API

SDR Optimization Routines

SHARC.optimization_routines.optimize_DR(X, labels=None, num_samples=None, methods=['LMDS'], metric=None, storage_path='./', param_grid='./settings_DR.json', verbose=True, seed=None)

Function that finds the optimal parameter set for each DR method given a parameter grid.

Parameters

Xarray-like, shape (n_samples, n_features)

An array containing the data that needs to be projected.

labelsarray-like, shape (n_samples,), default = None

An array containing the labels (as numeric values) corresponding to each sample in X. Be sure to provide it when it is used by the optimization metric.

num_samplesint, default = None (optional)

Size of the random subset of samples that will be used to find the optimal DR parameters. If None all samples will be used. Beware that for large datasets this may significantly slow down the optimization procedure! As a general recommendation one should not use significantly more than 10000 samples.

methodslist, default = [“LMDS”] (optional)

A list with names of the DR methods to optimize as strings.

metricmetrics.Metrics instance, default=None (optional)

A metrics.Metrics instance with a metric_total method which will be called to evaluate the DR performance for a given parameter set. If not provided metrics.DR_MetricsV1 will be initialized and used with its default parameters.

storage_pathstr, default = “./” (optional)

Path to the folder in which temporary files and results will be stored.

param_gridstr, default = “./settings_DR.json” (optional)

The path to a JSON file containing a compact parameter grid for each method provided in methods.

verbosebool, default = True (optional)

Controls the verbosity.

seedint, default = None (optional)

Random seed which is used by both the projection technique and for selecting a random subset of num_samples.

Returns

best_paramsdict

Dictionary containing the best parameter sets for each DR method specified in methods.

best_scoreslist

List containing the best total scores for each DR method specified in methods. The scores are computed by calling metric.metric_total.

SHARC.optimization_routines.optimize_LGC(X, labels=None, num_samples=None, methods=['LMDS'], metric=None, storage_path='./', param_grid='./settings_LGC.json', DR_params='./best_DR_params.json', verbose=True, seed=None)

Function that finds the optimal parameter set for each LGC method given a parameter grid.

Parameters

Xarray-like, shape (n_samples, n_features)

An array containing the data that needs to be projected.

labelsarray-like, shape (n_samples,), default = None

An array containing the labels (as numeric values) corresponding to each sample in X. Be sure to provide it when it is used by the optimization metric.

num_samplesint, default = None (optional)

Size of the random subset of samples that will be used to find the optimal LGC parameters. If None all samples will be used. Beware that for large datasets this may significantly slow down the optimization procedure! As a general recommendation one should not use significantly more than 10000 samples.

methodslist, default = [“LMDS”] (optional)

A list with names of the DR methods to use in combination with LGC as strings.

metricmetrics.Metrics instance, default=None (optional)

A metrics.Metrics instance with a metric_total method which will be called to evaluate the LGC performance for a given parameter set. If not provided metrics.LGC_Metrics will be initialized and used with its default parameters.

storage_pathstr, default = “./” (optional)

Path to the folder in which temporary files and results will be stored.

param_gridstr, default = “./settings_LGC.json” (optional)

The path to a JSON file containing a compact parameter grid for each method provided in methods.

DR_paramsstr, default = “./best_DR_params.json” (optional)

The path to a JSON file containing the parameters to use for each DR method provided in methods.

verbosebool, default = True (optional)

Controls the verbosity.

seedint, default = None (optional)

Random seed which is used by both the projection technique and for selecting a random subset of num_samples.

Returns

best_paramsdict

Dictionary containing the best LGC parameter set for each DR method used which were specified in methods.

best_scoreslist

List containing the best total scores for each DR method used which were specified in methods. The scores are computed by calling metric.metric_total.

SHARC.optimization_routines.save_results(results, outfile)

Function to save optimization results to a JSON file.

Parameters

resultsdict

A dictionary containing the results to be saved to outfile.

outfilestr

The name of the file the results should be saved to.

SDR Optimization Metrics

class SHARC.metrics.DR_MetricsV1(metric=['euclidean', 'euclidean'], k=7)

Metric class for DR optimization using a metric composed of the trustworthiness, continuity, neighborhood hit and Shepard goodness metrics. Metric functions are inherited from the metrics.Metrics class.

Parameters

metricstr or list, default=[“euclidean”, “euclidean”]

Metrics to use when computing distances in the feature space and the projection space. When a string is provided that same metric will be used for both the feature space and the projection space. Values are passed to scipy.spatial.distance.pdist.

kint, default=7

Number of nearest neighbors to consider when computing the various metrics. Used by metric_trustworthiness, metric_continuity, metric_jaccard_similarity_coefficient, metric_neighborhood_hit and metric_distribution_consistency.

metric_total()

Function to compute the optimization metric.

Returns

totalfloat

The value between \([0, 1]\) of the composite metric, i.e.:

\[\frac{1}{4}\left(\text{trustworthiness} + \text{continuity} + \text{neighborhood hit} + \text{Shepard goodness}\right)\]
class SHARC.metrics.DR_MetricsV2(metric=['euclidean', 'euclidean'], k=7)

Metric class for DR optimization using a metric composed of only the distribution consistency metric. The metric function for distribution consistency is inherited from the metrics.Metrics class.

Parameters

metricstr or list, default=[“euclidean”, “euclidean”]

Metrics to use when computing distances in the feature space and the projection space. When a string is provided that same metric will be used for both the feature space and the projection space. Values are passed to scipy.spatial.distance.pdist.

kint, default=7

Number of nearest neighbors to consider when computing the various metrics. Used by metric_trustworthiness, metric_continuity, metric_jaccard_similarity_coefficient, metric_neighborhood_hit and metric_distribution_consistency.

metric_total()

Function to compute the optimization metric.

Returns

totalfloat

The value between \([0, 1]\) of the optimization metric, i.e. distribution consistency.

class SHARC.metrics.LGC_Metrics(metric='euclidean', k=7)

Metric class for LGC optimization using a metric composed of only the distribution consistency metric. The metric function for distribution consistency is inherited from the metrics.Metrics class.

Parameters

metricstr or list, default=[“euclidean”, “euclidean”]

Metrics to use when computing distances in the feature space and the projection space. When a string is provided that same metric will be used for both the feature space and the projection space. Values are passed to scipy.spatial.distance.pdist.

kint, default=7

Number of nearest neighbors to consider when computing the various metrics. Used by metric_trustworthiness, metric_continuity, metric_jaccard_similarity_coefficient, metric_neighborhood_hit and metric_distribution_consistency.

metric_total(k=7)

Function to compute the optimization metric.

Returns

totalfloat

The value between \([0, 1]\) of the optimization metric, i.e. distribution consistency.

class SHARC.metrics.Metrics(metric=['euclidean', 'euclidean'], k=7)

A base class for the computation of some basic metrics that quantify the performance of DR algorithms.

Parameters

metricstr or list, default=[“euclidean”, “euclidean”]

Metrics to use when computing distances in the feature space and the projection space. When a string is provided that same metric will be used for both the feature space and the projection space. Values are passed to scipy.spatial.distance.pdist.

kint, default=7

Number of nearest neighbors to consider when computing the various metrics. Used by metric_trustworthiness, metric_continuity, metric_jaccard_similarity_coefficient, metric_neighborhood_hit and metric_distribution_consistency.

fit(X, Y, labels=None)

Fit the provided data to the metric instance. That is, for both X and Y compact distance matrices and nearest neighbor sets are computed.

Parameters

Xarray-like, shape (n_samples, n_features)

Feature space dataset.

Yarray-like, shape (n_samples, n_embedding_dimensions)

Projection space dataset.

labelsarray-like, shape (n_samples, ), default=None

An array of label values for each sample. Only required for purity/VSC metrics such as metric_neighborhood_hit, metric_distance_consistency and metric_distribution_consistency

Returns

selfobject

Returns self.

get_summary()

Function to get a summary of the computed metrics.

Returns

summarydict

A dictionary containing all computed metrics and their values.

metric_continuity()

Function to compute the continuity metric which quantifies the proportion of missing neighbors in the projection. The functional definition reads as follows:

(1)\[M_c(k) = 1 - \frac{2}{Nk(2N-3k-1)}\sum^{N}_{i=1}\sum_{j\in \mathcal{V}^k_i}(\hat{r}(i,j)-k)\]

In this definition, \(N\) is the number of samples in the dataset and \(k\) is the number of nearest neighbors to consider and should always be smaller than \(N / 2\) for the metric to be properly normalized. The set \(\mathcal{V}^{k}_i\) consists of the \(k\) nearest neighbors of sample \(i\) in original data space that are not among the \(k\) data vectors after the projection. The quantity \(\hat{r}(i,j)\) specifies the rank of the point \(j\) when feature vectors are based on their distance to point \(i\) after the projection.

Returns

continuityfloat

The value between \([0,1]\) yielded by the continuity metric.

metric_distance_consistency()

Function to compute the distance consistency metric which measures how well separated data clusters with different labels are in the projection. The functional definition reads as follows:

(2)\[M_{\text{DSC}} = 1 - \frac{\left|\left\{\vec{x}\in D : \text{CD}(\vec{x}, \text{centr}(\text{clabel}(\vec{x}))) \neq 1\right\}\right|}{N}\]

In this definition, \(N\) is the number of samples in the dataset \(D\) and \(\text{CD}(\vec{x}, \text{centr}(\text{clabel}(\vec{x})))\) is the so-called centroid distance which is defined as follows:

\[\begin{split}\text{CD}(\vec{x}, \text{centr}(\text{clabel}(\vec{x}))) = \begin{cases} 1\quad d(\vec{x},\text{centr}(\text{clabel}(\vec{x}))) < d(\vec{x},\text{centr}(c_i)) \forall i \in [0, m] \wedge c_i \neq \text{clabel}(\vec{x})\\ 0\quad\text{otherwise} \end{cases}\end{split}\]

where \(\text{centr}(c_i)\) is the position of the centroid corresponding to all datapoints with class label \(c_i\), \(\text{clabel}(\vec{x})\) gets the class label of datapoint \(\vec{x}\) and \(d(\vec{x},\vec{y})\) is the distance between points \(\vec{x}\) and \(\vec{y}\).

Returns

distance_consistencyfloat

The value between \([0, 1]\) yielded by the distance consistency metric.

metric_distribution_consistency()

Function to compute the distribution consistency metric which measures how well separated data with different class labels are in the projection. The functional definition reads as follows:

(3)\[M_{\text{DC}} = 1 + \frac{1}{N\log_2(m)}\sum_{\vec{x}\in D}\sum_{i=0}^{m}\frac{p_{c_i}}{\sum_{i=0}^m p_{c_i}}\log_2\left(\frac{p_{c_i}}{\sum_{i=0}^m p_{c_i}}\right)\]

In this definition, \(N\) is the number of samples in the dataset \(D\), \(m\) is the number of unique class labels and \(p_{c_i}\) is the number of datapoints of class \(c_i\) in the nearest neighbor set of a point \(\vec{x}\). The way this metric is defined, it measures the average purity with respect to the class labels in the neighborhood of all points in the dataset. To probe the purity it uses the Shannon entropy.

Returns

distribution_consistencyfloat

The value between \([0, 1]\) yielded by the distribution consistency metric.

metric_jaccard_similarity_coefficient()

Function to compute the Jaccard similarity coefficient metric which quantifies the proportion of overlap between the \(k\)-nearest neighbor sets in the feature space and the projection space. The functional definition reads as follows:

(4)\[M_J(k) = \frac{1}{N}\sum^{N}_{i=1}\frac{\left|\mathcal{N}^k_i \cap \mathcal{M}^k_i\right|}{\left|\mathcal{N}^k_i \cup \mathcal{M}^k_i\right|}\]

In this definition, \(N\) is the number of samples in the dataset and \(k\) is the number of nearest neighbors to consider. The set \(\mathcal{N}^{k}_i\) consists of the \(k\) nearest neighbors of sample \(i\) in original data space. The set \(\mathcal{M}^{k}_i\) consists of the \(k\) nearest neighbors of sample \(i\) in the projection.

Returns

jaccard_similarity_coefficientfloat

The value between \([0,1]\) yielded by the Jaccard similarity coefficient metric.

metric_neighborhood_hit()

Function to compute the neighborhood hit metric which measures how well separated datapoints with different labels are in the projection. The functional definition reads as follows:

(5)\[M_{NH}(k) = \frac{1}{kN}\sum^{N}_{i=1}\left|\left\{j\in\mathcal{N}^{k}_{i} | l_j = l_i\right\}\right|\]

In this definition, \(N\) is the number of samples in the dataset and \(k\) is the number of nearest neighbors to consider. The set \(\mathcal{N}^k_i\) is the set of nearest neighbors of point \(i\) in the projection space and \(l_i\) denotes the label of a point \(i\).

Returns

normalized_stressfloat

The value between \([0, \infty]\) yielded by the normalized stress metric.

metric_normalized_stress()

Function to compute the normalized stress metric which quantifies the respective mismatch between pointwise distances in the feature space and the projection space. The functional definition reads as follows:

(6)\[M_{\sigma}(k) = \frac{\sum^{N}_{i=1}\sum^{N}_{j=1}\left(\Delta^n(\mathbf{x}_i,\mathbf{x}_j)-\Delta^m(P\left(\mathbf{x}_i\right),P\left(\mathbf{x}_j)\right)\right)^2}{\sum^{N}_{i=1}\sum^{N}_{j=1}\Delta^n(\mathbf{x}_i,\mathbf{x}_j)^2}\]

In this definition, \(N\) is the number of samples in the dataset. The function \(\Delta^n(\mathbf{x}_i, \mathbf{x}_j)\) returns the distance between points \(i\) and \(j\) in \(n\)-dimensions.

Returns

normalized_stressfloat

The value between \([0, \infty]\) yielded by the normalized stress metric.

metric_shepard_goodness(return_shepard=False)

Function that computes the Shepard goodness metric, i.e. the spearman rank correlation of the Shepard diagram.

Parameters

return_shepardbool, default=False

Controls whether to return the Shepard diagram as well.

Returns

shepard_goodnessfloat

The value between \([0,1]\) of the Shepard goodness metric.

metric_trustworthiness()

Function to compute the trustworthiness metric which quantifies the proportion of false neighbors in the projection. The functional definition reads as follows:

(7)\[M_t(k) = 1 - \frac{2}{Nk(2N-3k-1)}\sum^{N}_{i=1}\sum_{j\in \mathcal{U}_i^k}(r(i,j) - k)\]

In this definition, \(N\) is the number of samples in the dataset and \(k\) is the number of nearest neighbors to consider and should always be smaller than \(N / 2\) for the metric to be properly normalized. The set \(\mathcal{U}_i^k\) consists of the \(k\) nearest neighbors of sample \(i\) in the projection that are not amongst the \(k\) nearest neighbors of \(i\) in the original space. The quantity \(r(i,j)\) specifies the rank of the point \(j\) when feature vectors are ordered based on their distance to point \(i\) in the original space.

Returns

trustworthinessfloat

The value between \([0,1]\) yielded by the trustworthiness metric.

print_summary(file=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, end='\n')

Function to print a summary of the computed metrics.

Parameters

file : file-like object (stream), default=sys.stdout

end : string appended after the last value, default=’\n’

shepard_diagram()

Function that returns the Shepard diagram.

Returns

shepard_diagramarray-like (n_pairs, 2)

An array of pairwise distances between points in the original data space and the projection.

NNP Models

class SHARC.nn_models.DenseBlock(*args, **kwargs)

Class constructor of a dense block.

Parameters

unitsint (required)

Number of units in the Dense layer.

momentumfloat between [0,1], default=0.6 (optional)

Momentum parameter of the batch normalization layer. Should be close to 1 for slow learning of batch normalization layer. Typically somewhere between 0.6 and 0.85 works fine for big batches.

alphafloat, default=0.3 (optional)

Negative slope coefficient of leaky ReLU layer.

ratefloat between [0,1], default=0 (optional)

Dropout rate.

call(x, training=True)

Calls the model on new inputs and returns the outputs as tensors.

In this case call() just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Note: This method should not be called directly. It is only meant to be overridden when subclassing tf.keras.Model. To call a model on an input, always use the __call__() method, i.e. model(inputs), which relies on the underlying call() method.

Args:

inputs: Input tensor, or dict/list/tuple of input tensors. training: Boolean or boolean scalar tensor, indicating whether to

run the Network in training mode or inference mode.

mask: A mask or list of masks. A mask can be either a boolean tensor

or None (no mask). For more details, check the guide [here](https://www.tensorflow.org/guide/keras/masking_and_padding).

Returns:

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class SHARC.nn_models.NNPModelBackboneV1(*args, **kwargs)

NNP model backbone class version 1.

Parameters

D1_unitsint (required)

Number of units in the first dense layer of the network. Should not be less than 4!

**kwargs(optional)

Additional keyword arguments to be passed to each block in this backbone.

call(inputs, training=True)

Calls the model on new inputs and returns the outputs as tensors.

In this case call() just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Note: This method should not be called directly. It is only meant to be overridden when subclassing tf.keras.Model. To call a model on an input, always use the __call__() method, i.e. model(inputs), which relies on the underlying call() method.

Args:

inputs: Input tensor, or dict/list/tuple of input tensors. training: Boolean or boolean scalar tensor, indicating whether to

run the Network in training mode or inference mode.

mask: A mask or list of masks. A mask can be either a boolean tensor

or None (no mask). For more details, check the guide [here](https://www.tensorflow.org/guide/keras/masking_and_padding).

Returns:

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class SHARC.nn_models.NNPModelBackboneV2(*args, **kwargs)

NNP model backbone class version 2.

Parameters

D1_unitsint (required)

Number of units in the first dense layer of the network. Should not be less than 4!

**kwargs(optional)

Additional keyword arguments to be passed to each block in this backbone.

call(inputs, training=True)

Calls the model on new inputs and returns the outputs as tensors.

In this case call() just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Note: This method should not be called directly. It is only meant to be overridden when subclassing tf.keras.Model. To call a model on an input, always use the __call__() method, i.e. model(inputs), which relies on the underlying call() method.

Args:

inputs: Input tensor, or dict/list/tuple of input tensors. training: Boolean or boolean scalar tensor, indicating whether to

run the Network in training mode or inference mode.

mask: A mask or list of masks. A mask can be either a boolean tensor

or None (no mask). For more details, check the guide [here](https://www.tensorflow.org/guide/keras/masking_and_padding).

Returns:

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

SHARC.nn_models.construct_NNPModel(num_input_features, output_dimensions=2, output_activation='sigmoid', version=2, **kwargs)

Function to construct a NNP (neural network projection) model.

Parameters

num_input_featuresint

The number of input features.

output_dimensionsint, default=2 (optional)

The number of output dimensions of the projection.

output_activationstr or function, default=”sigmoid” (optional)

Activation function to use.

versionint, default=2 (optional)

Version of the NNP model backbone to use.

**kwargs(optional)

Additional keyword arguments will be passed to the NNP model backbone.

Returns

modeltensorflow Model

A tf.keras.Model instance.

NNP Training Utilities

SHARC.nn_training_utils.train_nnp(X, Y_true, model, loss_function, optimizer, labels=None, epochs=10, validation_ratio=0.25, save_path='./NNP', verbose=False)

Function that handles the training of the NNP model.

Parameters

Xarray-like, shape (n_samples, n_features)

Feature space training dataset.

Y_truearray-like, shape (n_samples, n_embedding_dimensions)

Projection space training dataset.

modeltensorflow Model

The tf.keras.Model instance to train.

loss_function :

A Tensorflow compatible loss function, i.e. it supports auto differentiation, to use for optimization.

optimizertensorflow optimizer

The tf.keras.optimizer to use for optimization.

labelsarray-like, shape (n_samples,), default = None

An array containing the labels (as numeric values) corresponding to each sample in X and Y_true. When provided it is used to stratify the cross validation set.

epochsint, default = 10 (optional)

Maximum number of epochs.

validation_ratiofloat, default = 0.25 (optional)

Fraction of the dataset to use for cross validation at each training epoch.

save_pathstr, default = “./NNP” (optional)

Path the save the checkpoints, training history and trained model to.

verbosebool, default = False (optional)

Controls the verbosity.

Returns

train_lossnumpy.ndarray, shape (epochs,)

Training loss at each epoch.

valid_lossnumpy.ndarray, shape (epochs,)

Validatation loss at each epoch.

pred_train_lossnumpy.ndarray, shape (epochs,)

Inferential training loss at each epoch.

Loss Function Definitions

class SHARC.loss_functions.AlternativeMeanAbsoluteError

Class for computing the Alternative Median Absolute Error (AMedAE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

class SHARC.loss_functions.AlternativeMeanSquaredError

Class for computing the Alternative Mean Squared Error (AMSE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

class SHARC.loss_functions.AlternativeMedianAbsoluteError

Class for computing the Alternative Median Absolute Error (AMedAE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

class SHARC.loss_functions.AlternativeMedianSquaredError

Class for computing the Alternative Median Squared Error (AMedSE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

class SHARC.loss_functions.MedianAbsoluteError

Class for computing the Median Absolute Error (MedAE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

class SHARC.loss_functions.MedianSquaredError

Class for computing the Median Squared Error (MedSE) for predictions.

call(y_true, y_pred)

Parameters

y_true:

Ground truth values. shape = [batch_size, d0, .. dN], except sparse loss functions such as sparse categorical crossentropy where shape = [batch_size, d0, .. dN-1]

y_pred:

The predicted values. shape = [batch_size, d0, .. dN]

SDR-NNP Classifier

class SHARC.classifiers.SDRNNPClassifier(nnp_model_path=None, classifier=None)

A classifier which implements SDR-NNP based classification.

Parameters

nnp_model_pathstr, default=None

Path to the stored SDR-NNP model (required).

classifierobject, default=None

Classifier used for the final classification (required).

Attributes

X_ndarray, shape (n_samples, n_features)

The input passed during fit().

y_ndarray, shape (n_samples,)

The labels passed during fit().

classes_ndarray, shape (n_classes,)

The classes seen at fit().

fit(X, y)

Fit the SDR-NNP based classifier from the training dataset.

Parameters

Xarray-like, shape (n_samples, n_features)

The training input samples.

yarray-like, shape (n_samples,)

The target values. An array of int.

Returns

selfobject

Returns self.

plot_classifier_decision_boundaries(ax=None, grid_resolution=200, eps=0.2, plot_method='contourf', cmap=<matplotlib.colors.LinearSegmentedColormap object>, alpha=0.3, **kwargs)

Plot decision boundaries for the trained classifier.

Parameters

axmatplotlib Axes, default=None

Axes object to plot on. If None, a new figure and axes is created.

**kwargs

Additional arguments are passed to sklearn.inspection.DecisionBoundaryDisplay.from_estimator().

Returns

displayDecisionBoundaryDisplay

Object storing the result.

plot_projection(X, y=None, ax=None)

Plot the SDR-NNP projection of the input data X.

Parameters

Xarray-like, shape (n_samples, n_features)

The input samples.

yarray-like, shape (n_samples,), default=None

The target values. An array of int.

axmatplotlib Axes, default=None

Axes object to plot on. If None, a new figure and axes is created.

Returns

axmatplotlib Axes

Axes object that was plotted on.

predict(X)

Predict the class labels for the provided data.

Parameters

Xarray-like, shape (n_samples, n_features)

The input samples.

Returns

yndarray, shape (n_samples,)

Class labels for each data sample.

predict_proba(X)

Return probability estimates for the test data X.

Parameters

Xarray-like, shape (n_samples, n_features)

The input samples.

Returns

yndarray, shape (n_samples,)

Class labels for each data sample.

Consolidation Methods

SHARC.consolidation.alternative_consolidation(predictions)

When the predictions by the different classifiers are in disagreement the sample is assigned to the post-consolidation outlier class.

Parameters

predictionsarray-like, shape (n_classifiers, n_samples)

The predictions given by each classifier.

Returns

labelsarray-like, shape (n_samples,)

The consolidated labels.

SHARC.consolidation.average_probability_consolidation(probabilities, threshold=None, label_lut=None)

Consolidation is done by averaging the probabilities for each class returned by each classifier. Samples are labelled by the class with the highest average probability.

Parameters

probabilitiesarray-like, shape (n_classifiers, n_samples, n_classes)

The probabilities predicted for each class by each classifier.

thresholdfloat, default=None (optional)

Optional probability threshold. Whenever, the highest average probability falls below the given threshold value the sample is classified as an outlier.

label_lutarray-like, shape (n_classes,)

Lookup table for the labels.

Returns

labelsarray-like, shape (n_samples,)

The consolidated labels.

SHARC.consolidation.lowest_entropy_consolidation(probabilities, threshold=None, label_lut=None, return_entropies=False, return_selected_classifiers=False)

For each sample use the classification of the classifier with lowest entropy in the distribution of class labels.

Parameters

probabilitiesarray-like, shape (n_classifiers, n_samples, n_classes)

The probabilities predicted for each class by each classifier.

thresholdfloat, default=None

The entropy threshold. Samples with a post-consolidation entropy above this threshold with be classified as an outlier. If None no thresholding will be applied.

label_lutarray-like, shape (n_classes,)

Lookup table for the labels.

Returns

labelsarray-like, shape (n_samples,)

The consolidated labels.

entropies: array-like, shape (n_classifiers, n_samples)

The computed entropy in the probabilities predicted for each class by each classifier. Only returned if return_entropies=True.

selected_classification: array-like, shape (n_samples)

An array consisting of indices corresponding to the classifiers that were used in the final classification of each sample. Only returned if return_selected_classifiers=True

SHARC.consolidation.majority_vote_consolidation(predictions)

Consolidation is done through a majority vote. When the vote is indecisive the sample is classified as an outlier.

Parameters

predictionsarray-like, shape (n_classifiers, n_samples)

The predictions given by each classifier.

Returns

labelsarray-like, shape (n_samples,)

The consolidated labels.

SHARC.consolidation.multiplied_probability_consolidation(probabilities, threshold=None, label_lut=None)

Consolidation is done by multiplying the probabilities for each class returned by each classifier. Samples are labelled by the class with the highest multiplied probability.

Parameters

probabilitiesarray-like, shape (n_classifiers, n_samples, n_classes)

The probabilities predicted for each class by each classifier.

thresholdfloat, default=None (optional)

Optional probability threshold. Whenever, the highest multiplied and normalized probability falls below the given threshold value the sample is classified as an outlier.

label_lutarray-like, shape (n_classes,)

Lookup table for the labels.

Returns

labelsarray-like, shape (n_samples,)

The consolidated labels.

Additional Utilities

SHARC.utils.insertColors(table, colors)

Combines magnitudes in astropy Table into colours and adds them to the table. The astropy Table is modified in-place.

Parameters

tableastropy Table

Table containing the magnitudes to be combined into colours.

colorsiterable

An array or list containing colours as strings with magnitudes corresponding to column names in the astropy Table.

Returns

tableastropy Table

Modified astropy Table with colours added as columns to the end of the original Table.

SHARC.utils.writeDataset(table, filename, verbose=True, overwrite=False)

Writes an astropy Table to a fits file.

Parameters

tableastropy Table

Table to be written to file.

filenamestr

Filename of the file the astropy Table needs to be written to.

verbosebool, default = True

Variable controlling the verbosity of this function.

overwritebool, default = False

Variable controlling whether to overwrite any existing file. When the file already exists and overwrite=False the dataset won’t be written and the function will exit.

Plot Functions

class SHARC.plot_funcs.CustomConfusionMatrixDisplay(confusion_matrix, *, display_labels=None)