Miquel Duran-Frigola · 01c5072d
--- a/datasets.md
+++ b/datasets.md
@@ -22,7 +22,7 @@ In turn, each level is divided into 5 sublevels or **coordinates** representing

 |Coordinate|Name|Description|
 |---|---|---|
-|`A1`|2D fingerprints|Binary representation of the 2D structure of a molecule. The neighborhood of every atom is encoded using circular topology hashing.|
+|`A1`|2D fingerprints|Binary representation of the 2D structure of a molecule. The neighbourhood of every atom is encoded using circular topology hashing.|
 |`A2`|3D fingerprints|Similar to `A1`, the 3D structures of the three best conformers after energy minimization are hashed into a binary representation without the need for structural alignment.|
 |`A3`|Scaffolds|Largest molecular scaffold (usually a ring system) remaining after applying Murcko’s pruning rules. Additionally, we keep the corresponding framework, i.e. a version of the scaffold where all atoms are carbons and all bonds are single. The scaffold and the framework are encoded with path-based 1024-bit fingerprints, suitable for capturing substructures in similarity searches.|
 |`A4`|Structural keys|166 functional groups and substructures widely accepted by medicinal chemists (MACCS keys).|
@@ -62,14 +62,14 @@ This is how we define a dataset:
 |Name|2D fingerprints|Display, short-name of the dataset.|
 |Technical name|1024-bit Morgan fingerprints|A more technical name for the dataset, suitable for chemo-/bio-informaticians.|
 |Description|2D fingerprints are...|This field contains a long description of the dataset. It is important that the curator outlines here the importance of the dataset, why did he/she make the decision to include it, and what are the scenarios where this dataset may be useful.|
-|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingeprints or gene expression data do not contain unknowns.|
+|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingerprints or gene expression data do not contain unknowns.|
 |Permanent|`True`/`False`|Are measurements for each entry permanent? 2D fingerprints, for example, are permanent. However, most of the biological data may change/evolve with the different versions of the CC. This field, in essence, dictates whether the dataset should be completely updated in every update of the CC, or whether new entries can be simply appended.|
 |Finished|`True`/`False`|Is the dataset considered to be finished? For examples, datasets coming from supplementary data of scientific papers are immutable, and they consequently need no updates in posterior versions of the CC.|
 |Data type|`Discrete`/`Continuous`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
 |Predicted|`True`/`False`|Is the dataset a result of a prediction (by us or by others?). Prediction results are perfectly valid CC datasets, in principle.|
 |Connectivity|`True`/`False`|Is there a way to connect this dataset to other biological entities? We understand connectivity as a generalization of the cMap idea of matching gene expression signatures.|
 |Connectivity comments|Free text commenting on the connectivity strategy (e.g. type of distance)|This field needs to be self-explanatory.|
-|Keys|e.g. `CPD` (we use @afernandez `Bioteque` nomenclature). Can be `NULL`.|In the core CC database, most of the times this field will correspond to `CPD`, as the CC is centered on small molecules. It only makes sense to have keys of different types when we do connectivity attempts, that is, for example, when mapping disease gene expression signatures.|
+|Keys|e.g. `CPD` (we use @afernandez `Bioteque` nomenclature). Can be `NULL`.|In the core CC database, most of the times this field will correspond to `CPD`, as the CC is centred on small molecules. It only makes sense to have keys of different types when we do connectivity attempts, that is, for example, when mapping disease gene expression signatures.|
 |Number of keys|e.g. 800000|Number of samples in the dataset.|
 |Features|e.g. `GEN` (we use `Bioteque` nomenclature). Can be `NULL`.|When features correspond to explicit knowledge, such as proteins, gene ontology processes, or indications, we express with this field the type of biological entities. It is not allowed to mix different feature types. Features can, however, have no type, typically when they come from a heavily-processed dataset, such as gene-expression data. Even if we use `Bioteque` nomenclature to the define the type of biological data, it is not mandatory that the vocabularies are the ones used by the `Bioteque`; for example, I can use non-human UniProt ACs, if I deem it necessary.|
 |Number of features|e.g. 1000|Number of features in the dataset.|
@@ -82,15 +82,14 @@ See the [PostgreSQL database](database) for more information.

 ## Dataset processing

-Every dataset has a particular processing protocol. All processings have two and only two steps:
+Every dataset has a particular processing protocol, always consisting of two consecutive steps:

-1. From 
+1. Fetching of data and conversion to a standard input file.
+ * It is very important that data are *minimally* transformed here.
+ * Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
 2. From standard input to signature type 0
- * Must have a `predict` method
-
-Acceptable standard inputs include:
-
-* `.gmt`
-* `.h5`
-
+ * When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
+ * Accordingly, a `predict()` method must be available.
+ * Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.

+It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
\ No newline at end of file