... | @@ -146,61 +146,26 @@ Accompanying the folder structure, there is a [PostGreSQL database](database) th |
... | @@ -146,61 +146,26 @@ Accompanying the folder structure, there is a [PostGreSQL database](database) th |
|
|
|
|
|
## Pipeline
|
|
## Pipeline
|
|
|
|
|
|
Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the `pac-one cluster` at IRB Barcelona.
|
|
Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the `pac-one cluster` at IRB Barcelona. Below, we provide detailed explanations of the different pipeline modalities:
|
|
|
|
|
|
Broadly speaking, the CC pipeline has two modalities:
|
|
|
|
|
|
|
|
1. [Dataset addition](#dataset-addition)
|
|
1. [Dataset addition](#dataset-addition)
|
|
* The resource is fully updated every 6 months.
|
|
* [Updates every 6 months](#six-month-pipeline).
|
|
* Datasets may be added sporadically.
|
|
* [Sporadic addition or updates of datasets](#sporadic-datasets).
|
|
2. [Data mapping](#new-data-mapping)
|
|
2. [Mapping of new data](#new-data-mapping)
|
|
* Mapping of data to a dataset
|
|
* Mapping of data to a dataset
|
|
* Individually querying
|
|
* Individually querying
|
|
* Connectivity
|
|
* Connectivity
|
|
|
|
|
|
### Dataset addition
|
|
### Dataset addition
|
|
|
|
|
|
The resource update will happen **every 6 months**
|
|
#### Six-month pipeline
|
|
|
|
|
|
![6-month-pipeline.svg](/uploads/04a4dcb2ad2956cf4accbb0562b7fd38/6-month-pipeline.svg)
|
|
![6-month-pipeline.svg](/uploads/04a4dcb2ad2956cf4accbb0562b7fd38/6-month-pipeline.svg)
|
|
|
|
|
|
|
|
Below I sequentially list the steps of the pipeline. This is a **linear and qualitative** view of the pipeline and does not necessarily correspond to the organization of scripts nor to the sequence of executions.
|
|
|
|
|
|
* `levels`: all datasets in all coordinates of t
|
|
|
|
* `coordinates`:
|
|
|
|
* `datasets`: `None`
|
|
|
|
* `xxx`:
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
#### Pipeline execution
|
|
|
|
|
|
|
|
The pipeline will be typically run in the `pac-one cluster`.
|
|
|
|
|
|
|
|
Once the reference is done, with all of the models, one can run the pipeline for any dataset (including the `full` dataset).
|
|
|
|
|
|
|
|
##### Fit and produce the models
|
|
|
|
|
|
|
|
#### Predict for any dataset
|
|
|
|
|
|
|
|
The arguments should be, at least:
|
|
|
|
|
|
|
|
* `--datasets`: Datasets `A1.001`-`E5.999` to calculate. One can also specify the level `A`-`E` or the coordinate `A1`-`E5`. All are considered by default.
|
|
|
|
* `--matrices`: What matrices to keep (e.g. `sig0`).
|
|
|
|
* `--only_exemplar`: Calculate only exemplar datasets.
|
|
|
|
|
|
|
|
### A linear view of the 6 monthly pipeline
|
|
|
|
|
|
|
|
Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts in the repository.
|
|
|
|
|
|
|
|
1. Download data.
|
|
1. Download data.
|
|
* We need a [SQL table](database#downloads) specifying, for each download file, at least whether they are **completed** or not. Files that are internal to the SB&NB are anyway downloaded from an `ftp` repository or the like.
|
|
* We need a [SQL table](database#downloads) specifying, for each download file, at least whether they are **completed** or not. Files that are internal to the SB&NB are simply copied.
|
|
* If the data are **completed** *or* the data has not been updated since the last CC update, don't download, just copy/move from the previous CC version.
|
|
* If the data are **completed** *or* the data has not been updated since the last CC update, don't download, just copy/move from the previous CC version.
|
|
* After this step, all data *and* libraries should be stored in the disk.
|
|
* After this step, all data *and* libraries should be stored in the disk.
|
|
2. Read small molecule structures.
|
|
2. Read small molecule structures.
|
... | @@ -317,4 +282,18 @@ Please beware that, for simplicity, here I have omitted processes that are relev |
... | @@ -317,4 +282,18 @@ Please beware that, for simplicity, here I have omitted processes that are relev |
|
* Updating the PubChem entries (name, synonyms, etc.).
|
|
* Updating the PubChem entries (name, synonyms, etc.).
|
|
* Filling up the table of targets to show.
|
|
* Filling up the table of targets to show.
|
|
|
|
|
|
|
|
#### Sporadic datasets
|
|
|
|
|
|
|
|
Here again, linear views of the pipeline would be the following. The explanation is more succinct since many details were already given [above](#six-month-pipeline).
|
|
|
|
|
|
|
|
0. Write scripts to process data (cannot be automatized, as this is specific for dataset)
|
|
|
|
* If the dataset is of a new type,
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Also, fill in the [PostGreSQL table](#database) with the fields .
|
|
|
|
1. Download or calculate accordingly
|
|
|
|
*
|
|
|
|
2. Process
|
|
|
|
|
|
|
|
|
|
### New data mapping |
|
### New data mapping |