Miquel Duran-Frigola · 00456c28
--- a/similarity-and-connectivity.md
+++ b/similarity-and-connectivity.md
@@ -2,7 +2,7 @@
 The CC is based upon the similarity principle, i.e. similar molecules have similar properties. **Similarity** can be defined between **pairs of molecules** in any of the CC [datasets](datasets).
-When it comes to **comparing molecules to other biological entities** (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of **connectivity**. A classical view of connectivity are molecules that *mimic* the transcriptional profile of a shRNA, or molecules that *revert* the transcriptional profile of a disease state.
+When it comes to **comparing molecules to other biological entities** (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of **connectivity**. A classical view of connectivity are molecules that *mimic* the transcriptional profile of a shRNA experiment, or molecules that *revert* the transcriptional profile of a disease state.
 These are some ways similarity and connectivity can be applied in the CC:
@@ -10,17 +10,27 @@ These are some ways similarity and connectivity can be applied in the CC:
 ## Easy calculation of similarity and connectivity
-Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarity and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the [pre-processing repository structure](datasets#dataset-pre-processing). 
+Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarities and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the [pre-processing repository structure](datasets#dataset-pre-processing). 
-Every dataset has a particular processing protocol, always consisting of two consecutive steps:
+Every dataset has one (or more) pre-processing script(s), always consisting of two steps:
-1. Fetching of data and conversion to a standard input file.
+1. Data gathering (and conversion to a standard input file).
- * It is very important that data are *minimally* transformed here.
+    * In production phase (i.e. when *building* the dataset) data are gathered from downloads of from calculated molecular properties.
- * Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
+    * In the mapping phase (i.e. when including *external* molecules or biological entities) data parsed by the user or fetched from calculated molecular properties if these are available for the compounds of interest of the user.
-2. From standard input to signature type 0
+2. Standard input to Signature Type 0.
- * When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
+    * The outcome of step 1 is some sort of [standard input](#standard-input-files) for step 2.
- * Accordingly, a `predict()` method must be available.
+    * The output of this step is a Signature type 0.
- * Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.
+    * The complexity of this step can vary dramatically:
+        - *Very simple:* Like in the case of 2D fingerprints where, simply, we take the corresponding molecular properties of the InChIKey provided. Likewise, the case of indications, where we read drug-disease pairs and map them.
+        - *Simple:* The case of binding data where, in some occasions, we map target classes to the binding data. 
+        - *Not so simple:* The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
+        - *Complex:* The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
+        - *Very complex:* The case of LINCS transcriptomics data (`D1.001`), where start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and filter the outcome accordingly. 
+In practice:
+ * At production phase, all procedures above (1 & 2) are wrapped in a `fit()` method of `sign0`.
+ * For the mapping phase, step 2 is wrapped in a `predict()` method of `sign0`.
+ * Every dataset must have a `README` file.
 It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.