Miquel Duran-Frigola · bdc3bf62
--- a/signaturization.md
+++ b/signaturization.md
 # Signaturization

-The main feature of the CC is the automatic processing of virtually any compound-related data into a homogeneous, standard-format data ready-to-be-used in machine learning tasks.
+The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.

-To discuss with @mbertoni and @oguitart (I've put Martino as assignee, because I think he is the most experienced in python classes).
+The central type of data are the signatures (one numerical vector per molecule), which are of three types:
+
+* `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation.
+* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, so that 90% of the variance is kept. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
+* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures. They have fixed-length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data.
+* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset. :warning: These signatures are not calculated yet, and won't be in the near future.  
+
+Besides, there are other (auxiliary) types of data that may be of interest. `*` denotes correspondence with signatures type `0`-`3`. 
+
+* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed (`_obs`) or predicted (`_prd`) similarities. These signatures are [only applicable to exemplary datasets](production-phase). Currently, we only keep `sims1`.
+* `neig*` [Nearest-neighbors](#nearest-neighbors): . Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
+* 
+
+Although I consider the numbering (0-3) conceptually closed, further auxiliary data types may be introduced in the future.
+
+## Commonalities
+
+*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
+
+## Signatures type 0
+
+## Signatures type 1
+
+## Signatures type 2
+
+## Signatures type 3
+
+## Similarity vectors
+
+## Nearest neighbors
+
+Below, I list an schematic proposal of the classes:
+
+
+```python
+class 
+
+```

 As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5` format is good and we have to stick to this file format. However, I think that these files should be accessible through some classes. These classes must not load all data into memory.