Miquel Duran-Frigola · 16863705
--- a/signaturization.md
+++ b/signaturization.md
@@ -47,6 +47,10 @@ All signatures type 0-3 contain a numerical data matrix. Also, I think it is int
 * `metric`: The distance used, typically `cosine`.
 * `pvalues`: A `(dist, pval)` array.
+##### Validation
+For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets.
 ## Peculiarities
 ### Signatures type 0
@@ -83,55 +87,44 @@ It is not possible to produce embedding for out-of-sample (out-of-vocabulary) no
 ### Similarity vectors
-These are, arguably, the mo
+These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similaritities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares, anyway. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org).
 ### Nearest neighbors
+A much better representation of the data is given by the k nearest neighbors, as calculated with [faiss](https://github.com/facebookresearch/faiss).
+In this case we have the following attributes:
+* `indices`: A N·k matrix of integers denoting the index of the neighbors.
+* `D`: Correspondingly, a N.k matrix of real numbers indicating the distances.
+To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k (e.g. 1000).
+##### Validation
+:construction: We haven't worked on it yet, but it should be easy. Perhaps truncated ROC curves will do.
 ### Predicted nearest neighbors
+:construction: We haven't worked on it yet.
 ### Clusters
-### 2D projections
+These are the results of a k-means clustering of the N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
-Below, I list an schematic proposal of the classes:
+`labels`: A N vector indicating the index of the centroids.
+`centroids`: A k·m matrix defining the centroids.
+In this case, the `predict()` method simply seeks the closest centroid to the query.
-```python
+##### Validation
-class 
-```
+A contingency table (Fisher's test) checking whether molecules belonging to the same cluster tend to be positive pairs in the validation sets.
+### 2D projections
+These are, actually, very similar to [signatures](#signature-commonalities), only that in this case `V` has 2 dimensions.
+:construction: Together with @oguitart, we are figuring out how to obtain 2D projections that look good for a wide range of scenarios. Most probably, this will imply an intermediate clustering step.
-As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5` format is good and we have to stick to this file format. However, I think that these files should be accessible through some classes. These classes must not load all data into memory.
+:construction: Further @mbertoni is trying to derive a regressor so that the `fit()` method becomes straightforward.
\ No newline at end of file
-**Signature classes**
-* Applies to:
- * Signatures Type 0
- * Signatures Type 1
- * Signatures Type 2
- * Signatures Type 3
- * 2D projections
-* Every class must have **at least** the following attributes:
- * `V`: the values
- * `keys`: the keys, sorted alphabetically
- * `metric`: the distance used
- * `pvalues`: (distance, p-value) array
- * `PATH`: the path where everything is stored
-* Every class must have **at least** the following methods:
- * `__iter__`: smart, batch iteration, if necessary
- * `__getattr__`: 
- * `fit`: Not sure this is necessary... Maybe we can just do it as part of the pipeline.
- * `predict`: For the new samples, we should be able to produce the corresponding `V` vectors. This will be, by far, the most tricky part. One should access the `models` folder and use them correspondingly. To increase speed in this part, probably one should just predict for the ones that are not already in the reference. Sometimes, it will be necessary to learn a mapping functions, for instance via AdaNet; for example, in the case of Signature Type 2, as node2vec does not allow for out-of-sample mapping.
- * `validate`: I'm thinking of a folder where we have validation files (for now, MoA and ATC), and then automatically outputting AUROC and KS metrics, among others.
- * `background`: Not sure this is necessary... Just like fit, this has to be done only at initiation.
-**Other classes**
-* We have other data types, such as the nearest neighbors produced by Oriol and the clusters produced by myself.
-* These must also have **at least** the following methods:
- * `__iter__`
- * `__getattr__`
- * `predict`: As always, we want to be able to predict for new molecules using the models stored.
- * `validate`: Here we will not use AUROC and KS, but other statistics, depending on the case.
-Here I put a scheme of the first part of the Chemical Checker pipeline: