... | ... | @@ -2,130 +2,74 @@ |
|
|
|
|
|
The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.
|
|
|
|
|
|
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
|
|
|
The central type of data is the *signature* (one numerical vector per molecule). There are 4 types of signatures:
|
|
|
|
|
|
* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data. They usually show explicit knowledge, which enables connectivity and interpretation.
|
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A (TF-IDF) PCA/LSI-projected version of the signatures type 0, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
|
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset.
|
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A mildly compressed (usually latent) representation of the signatures, with a dimensionality that typically retains 90% of the original variance. They keep most of the complexity of the original data and they can be used for similarity calculations.
|
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. These signatures have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
|
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Fixed-length (e.g. 128-d) representation of the data, capturing *and* inferring the original (signature type 1) similarity of the data. Signatures type 3 are available for any molecule of interest and have a confidence measure assigned to them.
|
|
|
|
|
|
Besides, there are other auxiliary types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`.
|
|
|
|
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario.
|
|
|
* `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
|
|
|
* `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase), and are mainly used for the [CC web resource](http://chemicalchecker.org).
|
|
|
|
|
|
Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
|
|
|
## Signature characteristics
|
|
|
|
|
|
## Commonalities
|
|
|
Every CC signature must have:
|
|
|
|
|
|
*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
|
|
|
* A matrix of data (`V`).
|
|
|
* Keys (typically InChIKeys) (`keys`).
|
|
|
* A `fit` method to be run at production time.
|
|
|
* A `predict` method to be run for out-of-production (new) data.
|
|
|
|
|
|
Every CC signature instance has the following methods:
|
|
|
Below we detail the characteristics of each signature type and the algorithms behind.
|
|
|
|
|
|
* `fit()`: Takes an input and learns to produce an output.
|
|
|
* `predict()`: Uses the fitted models to go from input to output.
|
|
|
* `validate()`: Performs a validation across external data such as MoA and ATC codes.
|
|
|
|
|
|
Also, data are always sorted by a certain key (an InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset in memory.
|
|
|
|
|
|
* `keys`: The keys, **sorted alphabetically**.
|
|
|
* `__iter__()`: Batch iteration, if necessary.
|
|
|
* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a `set` to do this).
|
|
|
|
|
|
I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
|
|
|
|
|
|
* `PATH`: :thinking: Not sure whether it has to be an *absolute* path.
|
|
|
|
|
|
### Signature commonalities
|
|
|
|
|
|
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a certain distance metric.
|
|
|
|
|
|
* `V`: Typically, a dense matrix (it can be sparse in the case of signatures type 0).
|
|
|
* `metric`: The distance used, typically `cosine`.
|
|
|
* `pvalues`: A `(dist, pval)` array.
|
|
|
|
|
|
##### Validation
|
|
|
|
|
|
For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets (e.g. AUROC).
|
|
|
### Signatures type 0
|
|
|
|
|
|
## Peculiarities
|
|
|
These are the raw signatures that enter the CC pipeline. The input of these signatures can be:
|
|
|
|
|
|
### Signatures type 0
|
|
|
* _Sparse_: A `pairs` vector (e.g. molecule-target pairs), optionally with weights.
|
|
|
* _Dense_: An `X` matrix with `keys` (e.g. molecules) and `features`(e.g. cell-lines).
|
|
|
|
|
|
These signatures are the result of a processing of a raw (but standard) input (please read about [connectivity](connecitvity) for more information). They can be stored as sparse matrices. Because they represent explicit data, features can be specified.
|
|
|
Signatures type 0 are minimally modified. We only apply the following modifications:
|
|
|
|
|
|
* `features`: For example, protein targets. Sorted alphabetically.
|
|
|
* _Imputation_: for _dense_ inputs, `NA` values are median-imputed.
|
|
|
* _Aggregation_: In case some keys are duplicated (for instance, at fit and predict time), user can choose to keep the first instance, the last instance, or an average of the data.
|
|
|
|
|
|
### Signatures type 1
|
|
|
|
|
|
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining the variance. In the fitting of these features, we cut at 90% of the variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
|
|
|
|
|
|
* `elbow`: Index of the dimension corresponding to the elbow point in the scree plot.
|
|
|
These signatures are processed versions of the _experimental_ data available in the CC and can be used for similarity measures. They have variable dimensionality, depending on the CC space.
|
|
|
|
|
|
Conveniently, one can then do:
|
|
|
The CC provides an automated pipeline for producing signatures type 1. We allow some flexibility and the user can choose to apply the following procedures:
|
|
|
|
|
|
```python
|
|
|
elbow_idx = my_sign0.elbow
|
|
|
V_red = my_sign0.V[:,:elbow_idx]
|
|
|
```
|
|
|
* _Latent representation_: For originally sparse matrices, this is TF-IDF Latent-semantic indexing (LSI). For dense matrices, this corresponds to a PCA (optionally, preceded by a robust median-based scaling). By default, we keep 90% of the variance.
|
|
|
* _Outlier removal_: Outlier keys (molecules) can be removed based on the isolation forest algorithm.
|
|
|
* _Metric learning_: A shallow siamese network can be trained to learn a latent representation of the space that has a good distance distribution. Metric learning requires similar/dissimilar cases (triplets); we offer two options:
|
|
|
* _Unsupervised_: Triplets are drawn from the signature itself.
|
|
|
* _Semi-supervised_: Triplets are drawn from other CC signatures. This helps integrate/contextualize the signature in question within the framework of the CC.
|
|
|
|
|
|
### Signatures type 2
|
|
|
|
|
|
Signatures type 2 are the result of a two-step process:
|
|
|
These signatures are mostly used for internal machine-learning procedures, as they have a convenient fixed-length format.
|
|
|
|
|
|
1. Load nearest-neighbor similarities type 1 as a graph.
|
|
|
2. Perform network embedding with `node2vec`.
|
|
|
Signatures type 2 are produced with two steps:
|
|
|
|
|
|
It is not possible to produce network embeddings for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed a posteriori (from signatures type 1 to signatures type 2) in order to endow `predict()` capabilities.
|
|
|
1. Construction of a similarity network using signatures type 1.
|
|
|
1. Network embedding (node2vec).
|
|
|
|
|
|
### Signatures type 3
|
|
|
|
|
|
:construction: We haven't figured out how to derive signatures of type 3, yet.
|
|
|
|
|
|
### Similarity vectors
|
|
|
|
|
|
*You can skip this paragraph, really.*
|
|
|
|
|
|
These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similarities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org). Also, these signatures may be problematic during the 6-month update, this is why they are connected to the [PostgreSQL database](database).
|
|
|
|
|
|
### Nearest neighbors
|
|
|
|
|
|
A much better representation of the data is given by the k nearest neighbors, as calculated with [faiss](https://github.com/facebookresearch/faiss).
|
|
|
|
|
|
In this case, we have the following attributes:
|
|
|
|
|
|
* `indices`: A N·k matrix of integers denoting the index of the neighbors.
|
|
|
* `D`: Correspondingly, an N.k matrix of real numbers indicating the distances.
|
|
|
|
|
|
To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k in practice (e.g. 1000).
|
|
|
|
|
|
##### Validation
|
|
|
|
|
|
:construction: We haven't worked on it yet, but it should be easy. Perhaps truncated ROC curves will do.
|
|
|
|
|
|
### Predicted nearest neighbors
|
|
|
|
|
|
:construction: We haven't worked on it yet.
|
|
|
|
|
|
### Clusters
|
|
|
|
|
|
These are the results of a k-means clustering of an N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
|
|
|
|
|
|
* `labels`: A N vector indicating the index of the centroids.
|
|
|
* `centroids`: A k·m matrix defining the centroids.
|
|
|
|
|
|
In this case, the `predict()` method simply seeks the closest centroid to the query.
|
|
|
|
|
|
##### Validation
|
|
|
|
|
|
A contingency table (Fisher's test) checking whether molecules belonging to the same cluster tend to be positive pairs in the validation sets.
|
|
|
|
|
|
### 2D projections
|
|
|
These signatures are fixed-length vectors available for _any_ molecule of interest. Thus, they are mostly *inferred* properties.
|
|
|
|
|
|
These are, actually, very similar to [signatures](#signature-commonalities), only that in this case `V` has 2 dimensions.
|
|
|
To learn signatures type 3:
|
|
|
|
|
|
:construction: Together with @oguitart, we are figuring out how to obtain 2D projections that look good for a wide range of scenarios. Most probably, this will imply an intermediate clustering step.
|
|
|
* Triplets are sampled from type 1 similarities
|
|
|
* Signatures type 2 across the CC are used as input for a deep siamese neural network. Thus, 25 fixed-length vectors are stacked.
|
|
|
* A signature-dropout (subsampling) procedure is applied to ensure that the data seen in the training set are _realistic_, meaning that the signature coverage resembles the coverage available for those molecules that do _not_ have data available for the CC space in question.
|
|
|
|
|
|
:construction: Further @mbertoni is trying to derive a regressor so that the `fit()` method becomes straightforward. |
|
|
\ No newline at end of file |
|
|
A confidence score is assigned to every signature, based on:
|
|
|
* _Applicability domain_, computed as the distance of the signature to the signatures in the training data.
|
|
|
* _Robustness_, determined by the variability of predictions when dropout is applied to the siamese network.
|
|
|
* A _prior_ confidence in light of the coverage of CC spaces available for the molecule. |
|
|
\ No newline at end of file |