Miquel Duran-Frigola · 5b0d607c
--- a/signaturization.md
+++ b/signaturization.md
@@ -25,15 +25,15 @@ I consider the numbering `0`-`3` to be conceptually closed. However, further aux

 Every CC data will have the following methods:

-* `fit()`: Takes an input and transforms the data to an output.
+* `fit()`: Takes an input and learns to produce an output.
 * `predict()`: Uses the fitted models to go from input to output.
 * `validate()`: Performs a validation across external data such as MoA and ATC codes.

-Also, data are always sorted by a certain key (and InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset into memory.
+Also, data are always sorted by a certain key (an InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset in memory.

 * `keys`: The keys, **sorted alphabetically**.
 * `__iter__()`: Batch iteration, if necessary.
-* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a set to do this).
+* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a `set` to do this).

 I think that it may be interesting to keep cognizance of the folder where persistency models are stored:

@@ -41,7 +41,7 @@ I think that it may be interesting to keep cognizance of the folder where persis

 ### Signature commonalities

-All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a metric.
+All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a certain distance metric.

 * `V`: Typically, a dense matrix (it can be sparse in the case of signatures type 0).
 * `metric`: The distance used, typically `cosine`.
@@ -49,7 +49,7 @@ All signatures type 0-3 contain a numerical data matrix. Also, I think it is int

 ##### Validation

-For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets.
+For CC signatures, validation is done by simply measuring (cosine) similarities. Similar molecules should be more likely to be positive pairs in the validation sets (e.g. AUROC).

 ## Peculiarities

@@ -61,11 +61,11 @@ These signatures are the result of a processing of a raw (but standard) input. T

 ### Signatures type 1

-These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining variance. In the fitting of these features, we cut at 90% of variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
+These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining the variance. In the fitting of these features, we cut at 90% of the variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:

 * `elbow`: Index of the dimension corresponding to the elbow point in the scree plot.

-Conveniently, one can do:
+Conveniently, one can then do:

 ```python
 elbow_idx = my_sign0.elbow
@@ -76,10 +76,10 @@ V_red = my_sign0.V[:,:elbow_idx]

 Signatures type 2 are the result of a two-step process:

-1. Loading nearest-neighbor similarities type 1 as a graph.
-2. Perform `node2vec`.
+1. Load nearest-neighbor similarities type 1 as a graph.
+2. Perform network embedding with `node2vec`.

-It is not possible to produce embedding for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed, a posteriori, from signatures type 1 to signatures type 2 in order to endow `predict()` capabilities.
+It is not possible to produce network embeddings for out-of-sample (out-of-vocabulary) nodes, so a multi-output regression needs to be performed a posteriori (from signatures type 1 to signatures type 2) in order to endow `predict()` capabilities.

 ### Signatures type 3

@@ -87,18 +87,20 @@ It is not possible to produce embedding for out-of-sample (out-of-vocabulary) no

 ### Similarity vectors

-These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similaritities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares, anyway. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org).
+*You can skip this paragraph, really.*
+
+These are, arguably, the weirdest of the CC data types and I tend to dislike them (I hope we can get rid of them in the future). Currently, there is one `sims1.h5` file per molecule, and inside each of the files there are 25 separate vectors denoting a binning version of the similarity (p-value) between the molecule in question and the rest of the molecules, as listed in the corresponding `sign1.h5` file. Similarities can be observed (`*_obs`) or predicted (`*_prd`). OK, this is confusing, but who cares. We keep these data for now because they are used by `targetmate` and by the [CC web app](http://chemicalchecker.org). Also, these signatures may be problematic during the 6-month update, this is why they are connected to the [PostgreSQL database](database).

 ### Nearest neighbors

 A much better representation of the data is given by the k nearest neighbors, as calculated with [faiss](https://github.com/facebookresearch/faiss).

-In this case we have the following attributes:
+In this case, we have the following attributes:

 * `indices`: A N·k matrix of integers denoting the index of the neighbors.
-* `D`: Correspondingly, a N.k matrix of real numbers indicating the distances.
+* `D`: Correspondingly, an N.k matrix of real numbers indicating the distances.

-To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k (e.g. 1000).
+To stay on the safe side and not miss any relevant similarity, we use an unrealistically high k in practice (e.g. 1000).

 ##### Validation

@@ -110,10 +112,10 @@ To stay on the safe side and not miss any relevant similarity, we use an unreali

 ### Clusters

-These are the results of a k-means clustering of the N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.
+These are the results of a k-means clustering of an N·m matrix (typically signature type 1). Therefore, we store the centroids and the belonging of the samples to the centroids.

-`labels`: A N vector indicating the index of the centroids.
-`centroids`: A k·m matrix defining the centroids.
+* `labels`: A N vector indicating the index of the centroids.
+* `centroids`: A k·m matrix defining the centroids.

 In this case, the `predict()` method simply seeks the closest centroid to the query.