... | ... | @@ -2,16 +2,16 @@ |
|
|
|
|
|
The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.
|
|
|
|
|
|
The central type of data are the signatures (one numerical vector per molecule), which are of three types:
|
|
|
The central type of data are the signatures (one numerical vector per molecule), which are of four types:
|
|
|
|
|
|
* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation.
|
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, so that 90% of the variance is kept. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures. They have fixed-length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data.
|
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have fixed length, which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
|
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset. :warning: These signatures are not calculated yet, and won't be in the near future.
|
|
|
|
|
|
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`.
|
|
|
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. Typically, we have the `1` version of the following in the repository.
|
|
|
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed (`_obs`) or predicted (`_prd`) similarities. These signatures are [only applicable to exemplary datasets](production-phase). Currently, we only keep `sims1`.
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase).
|
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): XXXX. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
|
|
|
* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): XXXX
|
|
|
* `clus*` [Clusters](#clusters): XXXX
|
... | ... | |