Miquel Duran-Frigola · 831daa0b
--- a/signaturization.md
+++ b/signaturization.md
@@ -12,10 +12,10 @@ The central type of data are the signatures (one numerical vector per molecule),
 Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. Typically, we have the `1` version of the following in the repository.

 * `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase).
-* `neig*` [Nearest neighbors](#nearest-neighbors): XXXX. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
-* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): XXXX
-* `clus*` [Clusters](#clusters): XXXX
-* `proj*` [2D Projections](#2d-projections): XXX
+* `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
+* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): Predicted nearest neighbors. :warning: These are not available and will not be in the near future.
+* `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
+* `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.


 I consider the numbering `0`-`3` to be conceptually closed. However, further auxiliary data types may be introduced in the future. Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
@@ -24,12 +24,22 @@ I consider the numbering `0`-`3` to be conceptually closed. However, further aux

 *The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*

-I suggest that
+Every CC data will have the following methods:
+
+* `fit()`: Takes an input and transforms the data to an output.
+* `predict()`: Uses the fitted models to go from input to output.
+* `validate()`: Performs a validation across external data such as MoA and ATC codes.

 ## Signatures type 0

+* From: standard input
+* To: `sign0.h5`
+
 ## Signatures type 1

+* From: `sign0.h5`
+* To: `sign1`.h5`
+
 ## Signatures type 2

 ## Signatures type 3
@@ -86,5 +96,3 @@ As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5`
 * `validate`: Here we will not use AUROC and KS, but other statistics, depending on the case.

 Here I put a scheme of the first part of the Chemical Checker pipeline:
-
-![CC_backbone](/uploads/d117decadd2a47d1fcca824ebe892b05/CC_backbone.png)
\ No newline at end of file