Miquel Duran-Frigola · 531e9398
--- a/similarity-and-connectivity.md
+++ b/similarity-and-connectivity.md
@@ -25,44 +25,20 @@ Every dataset has one (or more) pre-processing script(s), always consisting of t
        - *Simple:* The case of binding data where, in some occasions, we map target classes to the binding data. 
        - *Not so simple:* The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
        - *Complex:* The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
-        - *Very complex:* The case of LINCS transcriptomics data (`D1.001`), where start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and filter the outcome accordingly. 
+        - *Very complex:* The case of LINCS transcriptomics data (`D1.001`), we start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and we filter the outcome accordingly. 
 In practice:
- * At production phase, all procedures above (1 & 2) are wrapped in a `fit()` method of `sign0`.
+* At production phase, all procedures above (1 & 2) are wrapped in a `fit()` method of `sign0`.
- * For the mapping phase, step 2 is wrapped in a `predict()` method of `sign0`.
+* For the mapping phase, step 2 is wrapped in a `predict()` method of `sign0`.
- * Every dataset must have a `README` file.
+    * The method can have more than one entry point, i.e. multiple input types. Por example, in the biological processes dataset we may enter the targets or the biological process terms directly.
+    * Inputs must be of an [standard input format](#standard-input-file).
-It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
+* Every dataset must have a `README` file.
-More precisely, connectivity starts with **standard input files** and finishes with a **signature type 0**.
-In some datasets, this procedure may be of considerable complexity and we need to conceive workflows that can be wrapped into a `predict()` method. For other datasets, the procedure will be almost trivial.
-Another important matter here is the distance. The CC works with *common* distance metrics, such as the `cosine` or `euclidean` distances. Sometimes, connectivity may require other types of metrics (e.g. GSEA-like, overlap, etc.). We might consider learning siamese networks that transform original distances to the more standard ones. This is an unexplored avenue, though.
 ## Standard input files
 |Type|Format|Description|
 |---|---|---|
-|Feature sets|GMT|XX|
+|InChIKeys|`TSV`|A one-column file containing InChIKeys. This will fetch the corresponding molecular properties from the CC database.|
-|Key-feature pairs|TSV|XX|
+|Key-feature pairs|`TSV`|A two-column file containing keys (first column) and features (second column). Features can be, for example, protein identifiers. Optionally, a third column can be included to specify the *weight* of the key-feature annotation.|
-|Key profiles|TSV|XX|
+|Key profiles|`TSV`|A multiple-column file containing keys (first column) and features (second column onwards). These can be, for example, NCI-60 profiles, or chemical-genetic interaction profiles. If a header is not included, the order of the columns should match the one used in the CC internally.|
-|InChIKeys|TSV|XX|
+|Feature sets|`GMT`|A [GMT](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GMT)-like file, typically used for gene sets. **First column:** Sample (signature) identifier. **Second column:** Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. **Third column:** Up features (genes). May be `NULL`. **Fourth column:** Down features (genes). If empty, assume that there is no direction in the gene set, and only take third column. May be `NULL`.|
\ No newline at end of file
-## Documentation
-Every pre-processing script needs to have a `README` file.
-## Chemistry
-## Targets
-## Networks
-## Cells
-## Diseases
\ No newline at end of file