Miquel Duran-Frigola · 21bda5d4
--- a/datasets.md
+++ b/datasets.md
@@ -73,16 +73,12 @@ See the [PostgreSQL database](database) for more information.
 ## Dataset pre-processing
-:construction:
+Dataset pre-processing refers to everything that from downloaded/calculated/user-defined data until Signature Type 0. Pre-processing can be of very different complexity:
-Every dataset has a particular processing protocol, always consisting of two consecutive steps:
+![FigureS1-01](/uploads/1dcea76fe57d4031fdfb68efc2afc317/FigureS1-01.png)
-1. Fetching of data and conversion to a standard input file.
+Here is where most of the SB&NB research happens. For now, dataset pre-processing is organized in a rather independent structure, i.e. each dataset receives its pre-processing scripts.
- * It is very important that data are *minimally* transformed here.
- * Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
-2. From standard input to signature type 0
- * When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
- * Accordingly, a `predict()` method must be available.
- * Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.
-It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
+There are two mandatory files:
\ No newline at end of file
+* `run.py`
+* `README.md`