Signaturization
The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as HDF5
files following a folder structure organized by datasets.
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
-
sign0
Signatures type 0: A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation. -
sign1
Signatures type 1: A PCA/LSI-projected version of the data, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for almost-exact similarity calculations. -
sign2
Signatures type 2: Network embedding of the similarity matrix derived from signatures type 1. They have fixed length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data. -
sign3
Signatures type 3: Network embedding of observed and inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually any molecule in any dataset.⚠ These signatures are not calculated yet, and won't be in the near future.
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk *
denotes correspondence with signatures type 0
-3
. Typically, we have the 1
version of the following in the repository.
-
sims*
Similarity vectors: Full similarities stored as lightint8
data. Each molecule receives one such similarity vector per dataset. They may be observed*_obs
or predicted*_prd
similarities. These signatures are only applicable to exemplary datasets. -
neig*
Nearest neighbors: Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keepneig1
. -
nprd*
Predicted nearest neighbors: Predicted nearest neighbors.⚠ These are not available and will not be in the near future. -
clus*
Clusters: Centroids and labels of a k-means clustering. -
proj*
2D Projections: t-SNE 2D projections of the data.
I consider the numbering 0
-3
to be conceptually closed. However, further auxiliary data types may be introduced in the future. Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
Commonalities
The following applies to all data types except the similarity vectors sims*
, which are of a very different nature due to their organization in the folder structure.
Every CC data will have the following methods:
-
fit()
: Takes an input and learns to produce an output. -
predict()
: Uses the fitted models to go from input to output. -
validate()
: Performs a validation across external data such as MoA and ATC codes.
Also, data are always sorted by a certain key (an InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset in memory.
-
keys
: The keys, sorted alphabetically. -
__iter__()
: Batch iteration, if necessary. -
__getattr__()
: Returns the vector corresponding to the key. Works fast withbisect
, but should returnNone
if the key is not inkeys
(ideally, keep aset
to do this).
I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
-
PATH
:🤔 Not sure whether it has to be an absolute path.
Signature commonalities
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a certain distance metric.
-
V
: Typically, a dense matrix (it can be sparse in the case of signatures type 0). -
metric
: The distance used, typicallycosine
. -
pvalues
: A(dist, pval)
array.
Validation
For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets (e.g. AUROC).
Peculiarities
Signatures type 0
These signatures are the result of a processing of a raw (but standard) input. They can be stored as sparse matrices. Because they represent explicit data, features can be specified.
-
features
: For example, protein targets. Sorted alphabetically.
Signatures type 1
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining the variance. In the fitting of these features, we cut at 90% of the variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
-
elbow
: Index of the dimension corresponding to the elbow point in the scree plot.
Conveniently, one can then do:
elbow_idx = my_sign0.elbow
V_red = my_sign0.V[:,:elbow_idx]
Signatures type 2
Signatures type 2 are the result of a two-step process:
- Load nearest-neighbor similarities type 1 as a graph.
- Perform network embedding with
node2vec
.
It is not possible to produce network embeddings for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed a posteriori (from signatures type 1 to signatures type 2) in order to endow predict()
capabilities.
Signatures type 3
Similarity vectors
You can skip this paragraph, really.
These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one sims1.h5
file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding sign1.h5
file. Similarities can be observed (*_obs
) or predicted (*_prd
). OK, this is confusing, but who cares. We keep these data for now because they are used by targetmate
and by the CC web app. Also, these signatures may be problematic during the 6-month update, this is why they are connected to the PostgreSQL database.
Nearest neighbors
A much better representation of the data is given by the k nearest neighbors, as calculated with faiss.
In this case, we have the following attributes:
-
indices
: A N·k matrix of integers denoting the index of the neighbors. -
D
: Correspondingly, an N.k matrix of real numbers indicating the distances.
To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k in practice (e.g. 1000).
Validation
Predicted nearest neighbors
Clusters
These are the results of a k-means clustering of an N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
-
labels
: A N vector indicating the index of the centroids. -
centroids
: A k·m matrix defining the centroids.
In this case, the predict()
method simply seeks the closest centroid to the query.
Validation
A contingency table (Fisher's test) checking whether molecules belonging to the same cluster tend to be positive pairs in the validation sets.
2D projections
These are, actually, very similar to signatures, only that in this case V
has 2 dimensions.
fit()
method becomes straightforward.