Signaturization
The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as HDF5
files following a folder structure organized by datasets.
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
-
sign0
Signatures type 0: A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation. -
sign1
Signatures type 1: A PCA/LSI-projected version of the data, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for almost-exact similarity calculations. -
sign2
Signatures type 2: Network embedding of the similarity matrix derived from signatures type 1. They have fixed length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data. -
sign3
Signatures type 3: Network embedding of observed and inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually any molecule in any dataset.⚠ These signatures are not calculated yet, and won't be in the near future.
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk *
denotes correspondence with signatures type 0
-3
. Typically, we have the 1
version of the following in the repository.
-
sims*
Similarity vectors: Full similarities stored as lightint8
data. Each molecule receives one such similarity vector per dataset. They may be observed*_obs
or predicted*_prd
similarities. These signatures are only applicable to exemplary datasets. -
neig*
Nearest neighbors: Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keepneig1
. -
nprd*
Predicted nearest neighbors: Predicted nearest neighbors.⚠ These are not available and will not be in the near future. -
clus*
Clusters: Centroids and labels of a k-means clustering. -
proj*
2D Projections: t-SNE 2D projections of the data.
I consider the numbering 0
-3
to be conceptually closed. However, further auxiliary data types may be introduced in the future. Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
Commonalities
The following applies to all data types except the similarity vectors sims*
, which are of a very different nature due to their organization in the folder structure.
Every CC data will have the following methods:
-
fit()
: Takes an input and transforms the data to an output. -
predict()
: Uses the fitted models to go from input to output. -
validate()
: Performs a validation across external data such as MoA and ATC codes.
Also, data are always sorted by a certain key (and InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset into memory.
-
keys
: The keys, sorted alphabetically. -
__iter__()
: Batch iteration, if necessary. -
__getattr__()
: Returns the vector corresponding to the key. Works fast withbisect
, but should returnNone
if the key is not inkeys
(ideally, keep a set to do this).
I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
-
PATH
:🤔 Not sure whether it has to be an absolute path.
Signature commonalities
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a metric.
-
V
: Typically, a dense matrix (it can be sparse in the case of signatures type 0). -
metric
: The distance used, typicallycosine
. -
pvalues
: A(dist, pval)
array.
Peculiarities
Signatures type 0
These signatures are the result of a processing of a raw (but standard) input. They can be stored as sparse matrices. Because they represent explicit data, features can be specified.
-
features
: For example, protein targets. Sorted alphabetically.
Signatures type 1
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining variance. In the fitting of these features, we cut at 90% of variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
-
elbow
: Index of the dimension corresponding to the elbow point in the scree plot.
Conveniently, one can do:
elbow_idx = my_sign0.elbow
V_red = my_sign0.V[:,:elbow_idx]
Signatures type 2
Signatures type 3
Similarity vectors
Nearest neighbors
Predicted nearest neighbors
Clusters
2D projections
Below, I list an schematic proposal of the classes:
class
As you know, we have most of the CC data stored as HDF5
files. I think HDF5
format is good and we have to stick to this file format. However, I think that these files should be accessible through some classes. These classes must not load all data into memory.
Signature classes
- Applies to:
- Signatures Type 0
- Signatures Type 1
- Signatures Type 2
- Signatures Type 3
- 2D projections
- Every class must have at least the following attributes:
-
V
: the values -
keys
: the keys, sorted alphabetically -
metric
: the distance used -
pvalues
: (distance, p-value) array -
PATH
: the path where everything is stored - Every class must have at least the following methods:
-
__iter__
: smart, batch iteration, if necessary -
__getattr__
: -
fit
: Not sure this is necessary... Maybe we can just do it as part of the pipeline. -
predict
: For the new samples, we should be able to produce the correspondingV
vectors. This will be, by far, the most tricky part. One should access themodels
folder and use them correspondingly. To increase speed in this part, probably one should just predict for the ones that are not already in the reference. Sometimes, it will be necessary to learn a mapping functions, for instance via AdaNet; for example, in the case of Signature Type 2, as node2vec does not allow for out-of-sample mapping. -
validate
: I'm thinking of a folder where we have validation files (for now, MoA and ATC), and then automatically outputting AUROC and KS metrics, among others. -
background
: Not sure this is necessary... Just like fit, this has to be done only at initiation.
Other classes
- We have other data types, such as the nearest neighbors produced by Oriol and the clusters produced by myself.
- These must also have at least the following methods:
__iter__
__getattr__
-
predict
: As always, we want to be able to predict for new molecules using the models stored. -
validate
: Here we will not use AUROC and KS, but other statistics, depending on the case.
Here I put a scheme of the first part of the Chemical Checker pipeline: