|
|
# Signaturization
|
|
|
|
|
|
The main feature of the CC is the automatic processing of virtually any compound-related data into a homogeneous, standard-format data ready-to-be-used in machine learning tasks.
|
|
|
The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.
|
|
|
|
|
|
To discuss with @mbertoni and @oguitart (I've put Martino as assignee, because I think he is the most experienced in python classes).
|
|
|
The central type of data are the signatures (one numerical vector per molecule), which are of three types:
|
|
|
|
|
|
* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation.
|
|
|
* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, so that 90% of the variance is kept. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
|
|
|
* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures. They have fixed-length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data.
|
|
|
* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset. :warning: These signatures are not calculated yet, and won't be in the near future.
|
|
|
|
|
|
Besides, there are other (auxiliary) types of data that may be of interest. `*` denotes correspondence with signatures type `0`-`3`.
|
|
|
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed (`_obs`) or predicted (`_prd`) similarities. These signatures are [only applicable to exemplary datasets](production-phase). Currently, we only keep `sims1`.
|
|
|
* `neig*` [Nearest-neighbors](#nearest-neighbors): . Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
|
|
|
*
|
|
|
|
|
|
Although I consider the numbering (0-3) conceptually closed, further auxiliary data types may be introduced in the future.
|
|
|
|
|
|
## Commonalities
|
|
|
|
|
|
*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
|
|
|
|
|
|
## Signatures type 0
|
|
|
|
|
|
## Signatures type 1
|
|
|
|
|
|
## Signatures type 2
|
|
|
|
|
|
## Signatures type 3
|
|
|
|
|
|
## Similarity vectors
|
|
|
|
|
|
## Nearest neighbors
|
|
|
|
|
|
Below, I list an schematic proposal of the classes:
|
|
|
|
|
|
|
|
|
```python
|
|
|
class
|
|
|
|
|
|
```
|
|
|
|
|
|
As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5` format is good and we have to stick to this file format. However, I think that these files should be accessible through some classes. These classes must not load all data into memory.
|
|
|
|
... | ... | |