Miquel Duran-Frigola · e20bb6db
--- a/signaturization.md
+++ b/signaturization.md
@@ -2,16 +2,16 @@

 The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.

-The central type of data are the signatures (one numerical vector per molecule), which are of three types:
+The central type of data are the signatures (one numerical vector per molecule), which are of four types:

 * `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data, containing TF-IDF weightnings if applicable. They usually show explicit knowledge, which enables connectivity and interpretation.
-* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, so that 90% of the variance is kept. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
-* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures. They have fixed-length, which is convenient for machine learning, and capture both explicit and implicit similarity relationships in the data.
+* `sign1` [Signatures type 1](#signatures-type-1): A PCA/LSI-projected version of the data, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
+* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have fixed length, which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
 * `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset. :warning: These signatures are not calculated yet, and won't be in the near future.  

-Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. 
+Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. Typically, we have the `1` version of the following in the repository.

-* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed (`_obs`) or predicted (`_prd`) similarities. These signatures are [only applicable to exemplary datasets](production-phase). Currently, we only keep `sims1`.
+* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase).
 * `neig*` [Nearest neighbors](#nearest-neighbors): XXXX. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
 * `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): XXXX
 * `clus*` [Clusters](#clusters): XXXX