Miquel Duran-Frigola · b7060607
--- a/similarity-and-connectivity.md
+++ b/similarity-and-connectivity.md
@@ -41,4 +41,26 @@ In practice:
 |InChIKeys|`TSV`|A one-column file containing InChIKeys. This will fetch the corresponding molecular properties from the CC database.|
 |Key-feature pairs|`TSV`|A two-column file containing keys (first column) and features (second column). Features can be, for example, protein identifiers. Optionally, a third column can be included to specify the *weight* of the key-feature annotation.|
 |Key profiles|`TSV`|A multiple-column file containing keys (first column) and features (second column onwards). These can be, for example, NCI-60 profiles, or chemical-genetic interaction profiles. If a header is not included, the order of the columns should match the one used in the CC internally.|
-|Feature sets|`GMT`|A [GMT](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GMT)-like file, typically used for gene sets. **First column:** Sample (signature) identifier. **Second column:** Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. **Third column:** Up features (genes). May be `NULL`. **Fourth column:** Down features (genes). If empty, assume that there is no direction in the gene set, and only take third column. May be `NULL`.|
\ No newline at end of file
+|Feature sets|`GMT`|A [GMT](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide#GMT)-like file, typically used for gene sets. **First column:** Sample (signature) identifier. **Second column:** Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. **Third column:** Up features (genes). Can be `NULL`. **Fourth column:** Down features (genes). If empty, assume that there is no direction in the gene set, and only take the third column. Can be `NULL`.|
+
+We highly recommend that, when designing the datasets, features are as explicit as possible. A good way to start would be the metanodes defined in the Bioteque:
+
+|Node|Abbreviation|
+|---|---|
+|Assay|`ASY`|
+|Cell|`CLL`|
+|Chemical entity|`CHE`|
+|Compartment|`CMP`|
+|Domain|`DOM`|
+|Compound|`CPD`|
+|Gene/Protein|`GEN`|
+|Disease|`DIS`|
+|Molecular function|`MFN`|
+|Pathway/processes|`PWY`|
+|Protein class|`PCL`|
+|Perturbagen|`PGN`|
+|Symptom|`SYM`|
+|Tissue|`TIS`|
+|Pharmacologic class|`PHC`|
+
+Obviously, it is mandatory that the *vocabularies* used in the production phase and the mapping phase match.