Miquel Duran-Frigola · 17381046
--- a/production-phase.md
+++ b/production-phase.md
@@ -2,38 +2,30 @@

 *Diagrams below are editable at this [draw.io link](https://drive.google.com/file/d/1PR0G2u03_-LB1jZvpIV02zt5YDfwXiJc/view?usp=sharing). Feel free to modify them.*

-The central task of the CC is to produce signatures of different types, given compound-related data typically fetched from the public domain. Signature production is performed via a rather complex/long [pipeline](#pipeline). There are several scenarios where signature generation will be necessary:
+The central task of the CC is to produce signatures of different types, given compound-related data typically fetched from the public domain. Signature production is performed via a rather complex/long [pipeline](#pipeline). Broadly speaking, the pipeline has two modalities:

-* A new dataset is *added* or a dataset is *updated*.
- * It will happen for all of the CC every six months.
- * It can happen, sporadically, during the development of a research project. Please note that [exemplary datasets](datasets#exemplary-datasets) can only be added in the 6 month update. 
-* Samples are *mapped* or, more generally, *connected* to an existing dataset.
- * Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain `A1.001` signatures for it, without actually adding the molecule to the `A1.001`. Another example would be a new molecule for which we know a target that is not associated 
- * [Connectivity](connectivity) is the generalization of mapping to biological entities other than compounds. Some, but not all, of the datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (`D1.001`). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (`C5.001`).
+1. Addition or update of a dataset.
+ * This happens for *all* of the CC every 6 months.
+ * It can happen, sporadically, during the development of a research project. Please note that [exemplary datasets](datasets#exemplary-datasets) can only be modified at the 6-month update, as exemplary datasets are bound to the [CC web app](http://chemicalchecker.org). 
+2. Mapping (or, more generally, *connection*), of external data to an existing dataset.
+ * Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain `A1.001` signatures for it, without actually appending the molecule to the dataset. Another example would be a new molecule for which we have determined a target in the laboratory, and we want to obtain the mode of action (`B1.001`) signatures of this new molecules, so that we can compare it with the rest.
+ * [Connectivity](connectivity) is the generalization of mapping to biological entities other than compounds. Some (but not all) datasets will have [connectivity capabilities](connectivity). For example, a disease gene expression signature can be compared to a compound gene expression signature (`D1.001`). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (`C5.001`).

 This is best explained with a diagram:

 ![cc_pipelines_general.svg](/uploads/37e5a5ce93ec5182eab138a8e5d187c5/cc_pipelines_general.svg)

-* (a) **Six month update**: For every downloaded [data](data) and [chemical library](libraries), the pipeline standardizes the chemical structures and calculates molecular properties. Then, the data-driven datasets may be indexed by InChIKey. The union of all molecules participating in data-driven datasets defines the bioactive *universe*, which is used to select the molecules of property-driven datasets. After, for each dataset, signatures of type 0 are derived upon dataset-specific processing. A reference collection is chosen, and all the training to derive signatures of type 1 and 2 happens along the reference data. Then, signatures can be derived for the full dataset. For the 25 exemplary datasets, full compound similarity vectors are calculated, and data are compared pair-wise between datasets so that similarity inferences can be performed.
-* (b) **New data-driven dataset**: To incorporate, sporadically, a new dataset, we first standardize the structures and index the dataset by InChIKey. Then, we process the data to end up with signatures type 0. Just like in (a), training happens in a reference set of molecules, and the full collection is later derived. New datasets
-* (c) **New property dataset**: When a new molecule property is defined, we derive the corresponding dataset for the bioactive universe. Then, as in (b), we process the data, fit models for a reference and finally derive signatures for the full data.
-* (d) **Mapping of a new molecule collection**: In this case, we simply want to obtain the signature representation of *external* molecules onto an existing property dataset. For this, we simply calculate the properties, process them correspondingly and use the fitted models to produce the full signatures.
-* (e) **Connectivity of external data**: Likewise, when new data are to be mapped on existing datasets, we simply process the raw data accordingly (potentially by using [connectivity functionalities](#connectivity)) and derive signatures using the fitted models. If the samples of the new dataset are indeed molecules, these are standardized as usual.
+* (a) **Six-month update**: For every downloaded [data](data) and [chemical library](libraries), the pipeline standardizes the chemical structures and calculates molecular properties. Then, the data-driven datasets may be indexed by InChIKey. The union of all molecules participating in data-driven datasets defines the bioactive *universe*, which is used to select the molecules of property-driven datasets. After, for each dataset, signatures of type 0 are derived upon dataset-specific processing. A reference collection is chosen, and all of the training to derive signatures of type 1 and 2 happens along the reference data. Then, signatures can be obtained for the full dataset. For the 25 exemplary datasets, full similarity vectors are calculated per compound, and data are compared pair-wise between datasets so that similarity inferences can be performed.
+* (b) **New data-driven dataset**: To incorporate, sporadically, a new dataset, we first standardize the structures and index the dataset by InChIKey. Then, we process the data to end up with signatures type 0. Just like in (a), training happens in a reference set of molecules, and the full collection is later obtained. New datasets
+* (c) **New property dataset**: When a new molecule property (e.g. a new chemical fingerprint) is defined by the CC team, we derive the corresponding dataset for the bioactive universe. Then, as in (b), we process the data, fit models for a reference and finally obtain signatures for the full data.
+* (d) **Mapping of a new molecule collection**: In this case, we simply want to obtain the signature representation of *external* molecules onto an existing property-based dataset. For this, we simply calculate the molecule properties (e.g. 2D fingerprints), process them correspondingly and use the fitted models to produce the full signatures.
+* (e) **Connectivity of external data**: Likewise, when new data are to be mapped on existing datasets, we simply process the raw data accordingly (potentially by using [connectivity functionalities](#connectivity)) and derive signatures using the fitted models. If the samples of the new dataset *are* indeed molecules, these are standardized as usual.

-:warning: Unfortunately, we do not have a clear means to produce [signatures type 3](signaturization#signatures-type-3) yet. In consequence, their production is not appropriately reflected in the [pipeline](#pipeline).
+:warning: Unfortunately, we do not have a final means to produce [signatures type 3](signaturization#signatures-type-3) yet. In consequence, their production is not appropriately reflected in the pipelines presented herein.

 ## Folder structure

-The CC resource is organized in three parts:
-
-* [The data repository](#data-repository)
-* [Production scripts](#pipeline-scripts)
-* [Data access scripts](#data-access-scripts)
-
-### Data repository
-
-The default path for this repository is `/aloy/chemical_checker_repo/`. The following data structure is proposed:
+The default path for CC repository is `/aloy/chemical_checker_repo/`. This corresponds to the data generated in the dataset addition/update mode of the pipeline. Data in this directory is organized as follows:

 ```
 ├── downloads
@@ -126,9 +118,8 @@ The default path for this repository is `/aloy/chemical_checker_repo/`. The foll
 │   ├──...
 ├── README.md
 ```
-#### External data

-The repository above is the result of running the CC processing pipeline on the downloaded data. We should be able to run the CC pipeline, to *any* collection of molecules and produce exactly the same data structure. The pipeline will produce the following folders structure, in any directory of choice `/path/to/my/directory/`:
+When running the pipeline in mapping/connectivity mode, signatures will be stored in a user-specified directory `/path/to/my/directory/`:

 ```
 ├── output
@@ -149,9 +140,11 @@ The repository above is the result of running the CC processing pipeline on the
 │   ├──...
 ```

-Obviously, in these cases, one should be able to run the pipeline with specifications (only certain levels, coordinates, or datasets; only certain signature types, etc.). See the [pipeline execution](#pipeline-execution) below for details.
+Please note that I do not present here the `/aloy/scratch/` organization, which is internal to the pipeline.

-## Pipeline scripts
+Accompanying the folder structure, there is a [PostGreSQL database](database) that is crucial for meta-data storage and proper functioning of the [CC web app](http://chemicalchecker.org).
+
+## Pipeline

 Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the `pac-one cluster` at IRB Barcelona.

@@ -171,7 +164,7 @@ The resource update will happen **every 6 months**

 ![6-month-pipeline.svg](/uploads/04a4dcb2ad2956cf4accbb0562b7fd38/6-month-pipeline.svg)

-#### Pipeline
+

 * `levels`: all datasets in all coordinates of t
 * `coordinates`: