Miquel Duran-Frigola · 4a8f3464
--- a/similarity-and-connectivity.md
+++ b/similarity-and-connectivity.md
@@ -10,10 +10,19 @@ These are some ways similarity and connectivity can be applied in the CC:

 ## Easy calculation of similarity and connectivity

+Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarity and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the [pre-processing repository structure](datasets#dataset-pre-processing). 

+Every dataset has a particular processing protocol, always consisting of two consecutive steps:

-1. XX
-2. XX
+1. Fetching of data and conversion to a standard input file.
+ * It is very important that data are *minimally* transformed here.
+ * Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
+2. From standard input to signature type 0
+ * When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
+ * Accordingly, a `predict()` method must be available.
+ * Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.
+
+It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.

 More precisely, connectivity starts with **standard input files** and finishes with a **signature type 0**.