... | @@ -73,16 +73,12 @@ See the [PostgreSQL database](database) for more information. |
... | @@ -73,16 +73,12 @@ See the [PostgreSQL database](database) for more information. |
|
|
|
|
|
## Dataset pre-processing
|
|
## Dataset pre-processing
|
|
|
|
|
|
:construction:
|
|
Dataset pre-processing refers to everything that from downloaded/calculated/user-defined data until Signature Type 0. Pre-processing can be of very different complexity:
|
|
|
|
|
|
Every dataset has a particular processing protocol, always consisting of two consecutive steps:
|
|
![FigureS1-01](/uploads/1dcea76fe57d4031fdfb68efc2afc317/FigureS1-01.png)
|
|
|
|
|
|
1. Fetching of data and conversion to a standard input file.
|
|
Here is where most of the SB&NB research happens. For now, dataset pre-processing is organized in a rather independent structure, i.e. each dataset receives its pre-processing scripts.
|
|
* It is very important that data are *minimally* transformed here.
|
|
|
|
* Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
|
|
|
|
2. From standard input to signature type 0
|
|
|
|
* When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
|
|
|
|
* Accordingly, a `predict()` method must be available.
|
|
|
|
* Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.
|
|
|
|
|
|
|
|
It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step. |
|
There are two mandatory files:
|
|
\ No newline at end of file |
|
* `run.py`
|
|
|
|
* `README.md` |