... | @@ -6,7 +6,7 @@ The central task of the CC is to produce signatures of different types, given co |
... | @@ -6,7 +6,7 @@ The central task of the CC is to produce signatures of different types, given co |
|
|
|
|
|
* A new dataset is *added* or a dataset is *updated*.
|
|
* A new dataset is *added* or a dataset is *updated*.
|
|
* It will happen for all of the CC every six months.
|
|
* It will happen for all of the CC every six months.
|
|
* It can happen, sporadically, during the development of a research project. Please note that [exemplary datasets](datasets#exemplary-datasets) can only be added previous to the 6-month update, as they participate in downstream analyses.
|
|
* It can happen, sporadically, during the development of a research project. Please note that [exemplary datasets](datasets#exemplary-datasets) can only be added in the 6 month update.
|
|
* Samples are *mapped* or, more generally, *connected* to an existing dataset.
|
|
* Samples are *mapped* or, more generally, *connected* to an existing dataset.
|
|
* Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain `A1.001` signatures for it, without actually adding the molecule to the `A1.001`. Another example would be a new molecule for which we know a target that is not associated
|
|
* Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain `A1.001` signatures for it, without actually adding the molecule to the `A1.001`. Another example would be a new molecule for which we know a target that is not associated
|
|
* [Connectivity](connectivity) is the generalization of mapping to biological entities other than compounds. Some, but not all, of the datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (`D1.001`). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (`C5.001`).
|
|
* [Connectivity](connectivity) is the generalization of mapping to biological entities other than compounds. Some, but not all, of the datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (`D1.001`). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (`C5.001`).
|
... | @@ -15,16 +15,11 @@ This is best explained with a diagram: |
... | @@ -15,16 +15,11 @@ This is best explained with a diagram: |
|
|
|
|
|
![cc_pipelines_general.svg](/uploads/37e5a5ce93ec5182eab138a8e5d187c5/cc_pipelines_general.svg)
|
|
![cc_pipelines_general.svg](/uploads/37e5a5ce93ec5182eab138a8e5d187c5/cc_pipelines_general.svg)
|
|
|
|
|
|
* (a) Six month update
|
|
* (a) **Six month update**: For every downloaded [data](data) and [chemical library](libraries), the pipeline standardizes the chemical structures and calculates molecular properties. Then, the data-driven datasets may be indexed by InChIKey. The union of all molecules participating in data-driven datasets defines the bioactive *universe*, which is used to select the molecules of property-driven datasets. After, for each dataset, signatures of type 0 are derived upon dataset-specific processing. A reference collection is chosen, and all the training to derive signatures of type 1 and 2 happens along the reference data. Then, signatures can be derived for the full dataset. For the 25 exemplary datasets, full compound similarity vectors are calculated, and data are compared pair-wise between datasets so that similarity inferences can be performed.
|
|
For every downloaded [data](data) and [chemical library](libraries), the pipeline standardizes the chemical structures. Then, it calculates molecular properties.
|
|
* (b) **New data-driven dataset**: To incorporate, sporadically, a new dataset, we first standardize the structures and index the dataset by InChIKey. Then, we process the data to end up with signatures type 0. Just like in (a), training happens in a reference set of molecules, and the full collection is later derived. New datasets
|
|
* (b) New data-driven dataset
|
|
* (c) **New property dataset**: When a new molecule property is defined, we derive the corresponding dataset for the bioactive universe. Then, as in (b), we process the data, fit models for a reference and finally derive signatures for the full data.
|
|
XXX
|
|
* (d) **Mapping of a new molecule collection**: In this case, we simply want to obtain the signature representation of *external* molecules onto an existing property dataset. For this, we simply calculate the properties, process them correspondingly and use the fitted models to produce the full signatures.
|
|
* (c) New property dataset
|
|
* (e) **Connectivity of external data**: Likewise, when new data are to be mapped on existing datasets, we simply process the raw data accordingly (potentially by using [connectivity functionalities](#connectivity)) and derive signatures using the fitted models. If the samples of the new dataset are indeed molecules, these are standardized as usual.
|
|
XXX
|
|
|
|
* (d) Mapping of new molecule collections
|
|
|
|
XXX
|
|
|
|
* (e) Connectivity of external data
|
|
|
|
XXX
|
|
|
|
|
|
|
|
:warning: Unfortunately, we do not have a clear means to produce [signatures type 3](signaturization#signatures-type-3) yet. In consequence, their production is not appropriately reflected in the [pipeline](#pipeline).
|
|
:warning: Unfortunately, we do not have a clear means to produce [signatures type 3](signaturization#signatures-type-3) yet. In consequence, their production is not appropriately reflected in the [pipeline](#pipeline).
|
|
|
|
|
... | @@ -208,7 +203,7 @@ The arguments should be, at least: |
... | @@ -208,7 +203,7 @@ The arguments should be, at least: |
|
* `--matrices`: What matrices to keep (e.g. `sig0`).
|
|
* `--matrices`: What matrices to keep (e.g. `sig0`).
|
|
* `--only_exemplar`: Calculate only exemplar datasets.
|
|
* `--only_exemplar`: Calculate only exemplar datasets.
|
|
|
|
|
|
### A linear view of the pipeline
|
|
### A linear view of the 6 monthly pipeline
|
|
|
|
|
|
Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts in the repository.
|
|
Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts in the repository.
|
|
|
|
|
... | | ... | |