Miquel Duran-Frigola · 2ae1cb54
--- a/signaturization.md
+++ b/signaturization.md
@@ -2,130 +2,74 @@

 The main feature of the CC is the automatic conversion of virtually any compound-related data into a standard format ready-to-be-used by machine learning algorithms. All CC data is stored as `HDF5` files following a [folder structure](production-phase) organized by datasets.

-The central type of data are the signatures (one numerical vector per molecule), which are of four types:
+The central type of data is the *signature* (one numerical vector per molecule). There are 4 types of signatures:

 * `sign0` [Signatures type 0](#signatures-type-0): A sufficiently-processed version of the raw data. They usually show explicit knowledge, which enables connectivity and interpretation.
-* `sign1` [Signatures type 1](#signatures-type-1): A (TF-IDF) PCA/LSI-projected version of the signatures type 0, retaining 90% of the variance. They keep most of the complexity of the original data and they can be used for *almost-exact* similarity calculations.
-* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. They have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
-* `sign3` [Signatures type 3](#signatures-type-3): Network embedding of observed *and* inferred similarity networks. Their added value, compared to signatures type 2, is that they can be derived for virtually *any* molecule in *any* dataset.  
+* `sign1` [Signatures type 1](#signatures-type-1): A mildly compressed (usually latent) representation of the signatures, with a dimensionality that typically retains 90% of the original variance. They keep most of the complexity of the original data and they can be used for similarity calculations.
+* `sign2` [Signatures type 2](#signatures-type-2): Network embedding of the similarity matrix derived from signatures type 1. These signatures have a fixed length (e.g. 128-d), which is convenient for machine learning, and capture both explicit *and* implicit similarity relationships in the data.
+* `sign3` [Signatures type 3](#signatures-type-3): Fixed-length (e.g. 128-d) representation of the data, capturing *and* inferring the original (signature type 1) similarity of the data. Signatures type 3 are available for any molecule of interest and have a confidence measure assigned to them.

 Besides, there are other auxiliary types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`.

 * `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario.
 * `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
 * `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.
-* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase), and are mainly used for the [CC web resource](http://chemicalchecker.org).

-Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
+## Signature characteristics

-## Commonalities
+Every CC signature must have:

-*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
+* A matrix of data (`V`).
+* Keys (typically InChIKeys) (`keys`).
+* A `fit` method to be run at production time.
+* A `predict` method to be run for out-of-production (new) data.

-Every CC signature instance has the following methods:
+Below we detail the characteristics of each signature type and the algorithms behind.

-* `fit()`: Takes an input and learns to produce an output.
-* `predict()`: Uses the fitted models to go from input to output.
-* `validate()`: Performs a validation across external data such as MoA and ATC codes.
-
-Also, data are always sorted by a certain key (an InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset in memory.
-
-* `keys`: The keys, **sorted alphabetically**.
-* `__iter__()`: Batch iteration, if necessary.
-* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a `set` to do this).
-
-I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
-
-* `PATH`: :thinking: Not sure whether it has to be an *absolute* path.
-
-### Signature commonalities
-
-All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a certain distance metric.
-
-* `V`: Typically, a dense matrix (it can be sparse in the case of signatures type 0).
-* `metric`: The distance used, typically `cosine`.
-* `pvalues`: A `(dist, pval)` array.
-
-##### Validation
-
-For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets (e.g. AUROC).
+### Signatures type 0

-## Peculiarities
+These are the raw signatures that enter the CC pipeline. The input of these signatures can be:

-### Signatures type 0
+* _Sparse_: A `pairs` vector (e.g. molecule-target pairs), optionally with weights.
+* _Dense_: An `X` matrix with `keys` (e.g. molecules) and `features`(e.g. cell-lines).

-These signatures are the result of a processing of a raw (but standard) input (please read about [connectivity](connecitvity) for more information). They can be stored as sparse matrices. Because they represent explicit data, features can be specified.
+Signatures type 0 are minimally modified. We only apply the following modifications:

-* `features`: For example, protein targets. Sorted alphabetically.
+* _Imputation_: for _dense_ inputs, `NA` values are median-imputed.
+* _Aggregation_: In case some keys are duplicated (for instance, at fit and predict time), user can choose to keep the first instance, the last instance, or an average of the data.

 ### Signatures type 1

-These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining the variance. In the fitting of these features, we cut at 90% of the variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
-
-* `elbow`: Index of the dimension corresponding to the elbow point in the scree plot.
+These signatures are processed versions of the _experimental_ data available in the CC and can be used for similarity measures. They have variable dimensionality, depending on the CC space.

-Conveniently, one can then do:
+The CC provides an automated pipeline for producing signatures type 1. We allow some flexibility and the user can choose to apply the following procedures:

-```python
-elbow_idx = my_sign0.elbow
-V_red = my_sign0.V[:,:elbow_idx]
-```
+* _Latent representation_: For originally sparse matrices, this is TF-IDF Latent-semantic indexing (LSI). For dense matrices, this corresponds to a PCA (optionally, preceded by a robust median-based scaling). By default, we keep 90% of the variance.
+* _Outlier removal_: Outlier keys (molecules) can be removed based on the isolation forest algorithm.
+* _Metric learning_: A shallow siamese network can be trained to learn a latent representation of the space that has a good distance distribution. Metric learning requires similar/dissimilar cases (triplets); we offer two options:
+  * _Unsupervised_: Triplets are drawn from the signature itself.
+  * _Semi-supervised_: Triplets are drawn from other CC signatures. This helps integrate/contextualize the signature in question within the framework of the CC.

 ### Signatures type 2

-Signatures type 2 are the result of a two-step process:
+These signatures are mostly used for internal machine-learning procedures, as they have a convenient fixed-length format.

-1. Load nearest-neighbor similarities type 1 as a graph.
-2. Perform network embedding with `node2vec`.
+Signatures type 2 are produced with two steps:

-It is not possible to produce network embeddings for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed a posteriori (from signatures type 1 to signatures type 2) in order to endow `predict()` capabilities.
+1.  Construction of a similarity network using signatures type 1.
+1.  Network embedding (node2vec).

 ### Signatures type 3

-:construction: We haven't figured out how to derive signatures of type 3, yet.
-
-### Similarity vectors
-
-*You can skip this paragraph, really.*
-
-These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similarities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org). Also, these signatures may be problematic during the 6-month update, this is why they are connected to the [PostgreSQL database](database).
-
-### Nearest neighbors
-
-A much better representation of the data is given by the k nearest neighbors, as calculated with [faiss](https://github.com/facebookresearch/faiss).
-
-In this case, we have the following attributes:
-
-* `indices`: A N·k matrix of integers denoting the index of the neighbors.
-* `D`: Correspondingly, an N.k matrix of real numbers indicating the distances.
-
-To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k in practice (e.g. 1000).
-
-##### Validation
-
-:construction: We haven't worked on it yet, but it should be easy. Perhaps truncated ROC curves will do.
-
-### Predicted nearest neighbors
-
-:construction: We haven't worked on it yet.
-
-### Clusters
-
-These are the results of a k-means clustering of an N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
-
-* `labels`: A N vector indicating the index of the centroids.
-* `centroids`: A k·m matrix defining the centroids.
-
-In this case, the `predict()` method simply seeks the closest centroid to the query.
-
-##### Validation
-
-A contingency table (Fisher's test) checking whether molecules belonging to the same cluster tend to be positive pairs in the validation sets.
-
-### 2D projections
+These signatures are fixed-length vectors available for _any_ molecule of interest. Thus, they are mostly *inferred* properties.

-These are, actually, very similar to [signatures](#signature-commonalities), only that in this case `V` has 2 dimensions.
+To learn signatures type 3:

-:construction: Together with @oguitart, we are figuring out how to obtain 2D projections that look good for a wide range of scenarios. Most probably, this will imply an intermediate clustering step.
+* Triplets are sampled from type 1 similarities
+* Signatures type 2 across the CC are used as input for a deep siamese neural network. Thus, 25 fixed-length vectors are stacked.
+* A signature-dropout (subsampling) procedure is applied to ensure that the data seen in the training set are _realistic_, meaning that the signature coverage resembles the coverage available for those molecules that do _not_ have data available for the CC space in question.

-:construction: Further @mbertoni is trying to derive a regressor so that the `fit()` method becomes straightforward.
\ No newline at end of file
+A confidence score is assigned to every signature, based on:
+* _Applicability domain_, computed as the distance of the signature to the signatures in the training data.
+* _Robustness_, determined by the variability of predictions when dropout is applied to the siamese network.
+* A _prior_ confidence in light of the coverage of CC spaces available for the molecule.  
\ No newline at end of file