signaturization

Signaturization

The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as HDF5 files following a folder structure organized by datasets.

The central type of data is the signature (one numerical vector per molecule). There are 4 types of signatures:

sign0 Signatures type 0: A sufficiently-processed version of the raw data. They usually show explicit knowledge, which enables connectivity and interpretation.
sign1 Signatures type 1: A mildly compressed (usually latent) representation of the signatures, with a dimensionality that typically retains 90% of the original variance. They keep most of the complexity of the original data and they can be used for similarity calculations.
sign2 Signatures type 2: Network embedding of the similarity matrix derived from signatures type 1. These signatures have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data.
sign3 Signatures type 3: Fixed-length (e.g. 128-d) representation of the data, capturing and inferring the original (signature type 1) similarity of the data. Signatures type 3 are available for any molecule of interest and have a confidence measure assigned to them.

Besides, there are other auxiliary types of data that may be of interest. The asterisk * denotes correspondence with signatures type 0-3.

neig* Nearest neighbors: Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario.
clus* Clusters: Centroids and labels of a k-means clustering.
proj* 2D Projections: t-SNE 2D projections of the data.

Signature characteristics

Every CC signature must have:

A matrix of data (V).
Keys (typically InChIKeys) (keys).
A fit method to be run at production time.
A predict method to be run for out-of-production (new) data.

Below we detail the characteristics of each signature type and the algorithms behind.

Signatures type 0

These are the raw signatures that enter the CC pipeline. The input of these signatures can be:

Sparse: A pairs vector (e.g. molecule-target pairs), optionally with weights.
Dense: An X matrix with keys (e.g. molecules) and features(e.g. cell-lines).

Signatures type 0 are minimally modified. We only apply the following modifications:

Imputation: for dense inputs, NA values are median-imputed.
Aggregation: In case some keys are duplicated (for instance, at fit and predict time), user can choose to keep the first instance, the last instance, or an average of the data.

Signatures type 1

These signatures are processed versions of the experimental data available in the CC and can be used for similarity measures. They have variable dimensionality, depending on the CC space.

The CC provides an automated pipeline for producing signatures type 1. We allow some flexibility and the user can choose to apply the following procedures:

Latent representation: For originally sparse matrices, this is TF-IDF Latent-semantic indexing (LSI). For dense matrices, this corresponds to a PCA (optionally, preceded by a robust median-based scaling). By default, we keep 90% of the variance.
Outlier removal: Outlier keys (molecules) can be removed based on the isolation forest algorithm.
Metric learning: A shallow siamese network can be trained to learn a latent representation of the space that has a good distance distribution. Metric learning requires similar/dissimilar cases (triplets); we offer two options:
- Unsupervised: Triplets are drawn from the signature itself.
- Semi-supervised: Triplets are drawn from other CC signatures. This helps integrate/contextualize the signature in question within the framework of the CC.

Signatures type 2

These signatures are mostly used for internal machine-learning procedures, as they have a convenient fixed-length format.

Signatures type 2 are produced with two steps:

Construction of a similarity network using signatures type 1.
Network embedding (node2vec).

Signatures type 3

These signatures are fixed-length vectors available for any molecule of interest. Thus, they are mostly inferred properties.

To learn signatures type 3:

Triplets are sampled from type 1 similarities
Signatures type 2 across the CC are used as input for a deep siamese neural network. Thus, 25 fixed-length vectors are stacked.
A signature-dropout (subsampling) procedure is applied to ensure that the data seen in the training set are realistic, meaning that the signature coverage resembles the coverage available for those molecules that do not have data available for the CC space in question.

A confidence score is assigned to every signature, based on:

Applicability domain, computed as the distance of the signature to the signatures in the training data.
Robustness, determined by the variability of predictions when dropout is applied to the siamese network.
A prior confidence in light of the coverage of CC spaces available for the molecule.