... | ... | @@ -25,15 +25,15 @@ I consider the numbering `0`-`3` to be conceptually closed. However, further aux |
|
|
|
|
|
Every CC data will have the following methods:
|
|
|
|
|
|
* `fit()`: Takes an input and transforms the data to an output.
|
|
|
* `fit()`: Takes an input and learns to produce an output.
|
|
|
* `predict()`: Uses the fitted models to go from input to output.
|
|
|
* `validate()`: Performs a validation across external data such as MoA and ATC codes.
|
|
|
|
|
|
Also, data are always sorted by a certain key (and InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset into memory.
|
|
|
Also, data are always sorted by a certain key (an InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset in memory.
|
|
|
|
|
|
* `keys`: The keys, **sorted alphabetically**.
|
|
|
* `__iter__()`: Batch iteration, if necessary.
|
|
|
* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a set to do this).
|
|
|
* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a `set` to do this).
|
|
|
|
|
|
I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
|
|
|
|
... | ... | @@ -41,7 +41,7 @@ I think that it may be interesting to keep cognizance of the folder where persis |
|
|
|
|
|
### Signature commonalities
|
|
|
|
|
|
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a metric.
|
|
|
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a certain distance metric.
|
|
|
|
|
|
* `V`: Typically, a dense matrix (it can be sparse in the case of signatures type 0).
|
|
|
* `metric`: The distance used, typically `cosine`.
|
... | ... | @@ -49,7 +49,7 @@ All signatures type 0-3 contain a numerical data matrix. Also, I think it is int |
|
|
|
|
|
##### Validation
|
|
|
|
|
|
For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets.
|
|
|
For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets (e.g. AUROC).
|
|
|
|
|
|
## Peculiarities
|
|
|
|
... | ... | @@ -61,11 +61,11 @@ These signatures are the result of a processing of a raw (but standard) input. T |
|
|
|
|
|
### Signatures type 1
|
|
|
|
|
|
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining variance. In the fitting of these features, we cut at 90% of variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
|
|
|
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining the variance. In the fitting of these features, we cut at 90% of the variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
|
|
|
|
|
|
* `elbow`: Index of the dimension corresponding to the elbow point in the scree plot.
|
|
|
|
|
|
Conveniently, one can do:
|
|
|
Conveniently, one can then do:
|
|
|
|
|
|
```python
|
|
|
elbow_idx = my_sign0.elbow
|
... | ... | @@ -76,10 +76,10 @@ V_red = my_sign0.V[:,:elbow_idx] |
|
|
|
|
|
Signatures type 2 are the result of a two-step process:
|
|
|
|
|
|
1. Loading nearest-neighbor similarities type 1 as a graph.
|
|
|
2. Perform `node2vec`.
|
|
|
1. Load nearest-neighbor similarities type 1 as a graph.
|
|
|
2. Perform network embedding with `node2vec`.
|
|
|
|
|
|
It is not possible to produce embedding for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed, a posteriori, from signatures type 1 to signatures type 2 in order to endow `predict()` capabilities.
|
|
|
It is not possible to produce network embeddings for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed a posteriori (from signatures type 1 to signatures type 2) in order to endow `predict()` capabilities.
|
|
|
|
|
|
### Signatures type 3
|
|
|
|
... | ... | @@ -87,18 +87,20 @@ It is not possible to produce embedding for out-of-sample (out-of-vocabulary) no |
|
|
|
|
|
### Similarity vectors
|
|
|
|
|
|
These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similaritities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares, anyway. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org).
|
|
|
*You can skip this paragraph, really.*
|
|
|
|
|
|
These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similarities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org). Also, these signatures may be problematic during the 6-month update, this is why they are connected to the [PostgreSQL database](database).
|
|
|
|
|
|
### Nearest neighbors
|
|
|
|
|
|
A much better representation of the data is given by the k nearest neighbors, as calculated with [faiss](https://github.com/facebookresearch/faiss).
|
|
|
|
|
|
In this case we have the following attributes:
|
|
|
In this case, we have the following attributes:
|
|
|
|
|
|
* `indices`: A N·k matrix of integers denoting the index of the neighbors.
|
|
|
* `D`: Correspondingly, a N.k matrix of real numbers indicating the distances.
|
|
|
* `D`: Correspondingly, an N.k matrix of real numbers indicating the distances.
|
|
|
|
|
|
To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k (e.g. 1000).
|
|
|
To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k in practice (e.g. 1000).
|
|
|
|
|
|
##### Validation
|
|
|
|
... | ... | @@ -110,10 +112,10 @@ To stay on the safe side and not miss any relevant similarity, we use an unreali |
|
|
|
|
|
### Clusters
|
|
|
|
|
|
These are the results of a k-means clustering of the N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
|
|
|
These are the results of a k-means clustering of an N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
|
|
|
|
|
|
`labels`: A N vector indicating the index of the centroids.
|
|
|
`centroids`: A k·m matrix defining the centroids.
|
|
|
* `labels`: A N vector indicating the index of the centroids.
|
|
|
* `centroids`: A k·m matrix defining the centroids.
|
|
|
|
|
|
In this case, the `predict()` method simply seeks the closest centroid to the query.
|
|
|
|
... | ... | |