... | @@ -5,25 +5,24 @@ The main feature of the CC is the automatic conversion of virtually any compound |
... | @@ -5,25 +5,24 @@ The main feature of the CC is the automatic conversion of virtually any compound |
|
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
|
|
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
|
|
|
|
|
|
* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data. They usually show explicit knowledge, which enables connectivity and interpretation.
|
|
* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data. They usually show explicit knowledge, which enables connectivity and interpretation.
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A (TF-IDF) PCA/LSI-projected version of the signatures type 0, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have fixed length, which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset. :warning: These signatures are not calculated yet, and won't be in the near future.
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset.
|
|
|
|
|
|
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. Typically, we have the `1` version of the following in the repository.
|
|
Besides, there are other auxiliary types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`.
|
|
|
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase).
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario.
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
|
|
|
|
* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): Predicted nearest neighbors. :warning: These are not available and will not be in the near future.
|
|
|
|
* `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
|
|
* `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
|
|
* `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.
|
|
* `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.
|
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase), and are mainly used for the [CC web resource](http://chemicalchecker.org).
|
|
|
|
|
|
I consider the numbering `0`-`3` to be conceptually closed. However, further auxiliary data types may be introduced in the future. Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
|
|
Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
|
|
|
|
|
|
## Commonalities
|
|
## Commonalities
|
|
|
|
|
|
*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
|
|
*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
|
|
|
|
|
|
Every CC data will have the following methods:
|
|
Every CC signature instance has the following methods:
|
|
|
|
|
|
* `fit()`: Takes an input and learns to produce an output.
|
|
* `fit()`: Takes an input and learns to produce an output.
|
|
* `predict()`: Uses the fitted models to go from input to output.
|
|
* `predict()`: Uses the fitted models to go from input to output.
|
... | | ... | |