... | ... | @@ -13,7 +13,7 @@ The central task of the CC is to produce signatures of different types, given co |
|
|
|
|
|
This is best explained with a diagram:
|
|
|
|
|
|
![cc_pipelines_general.svg](/uploads/37e5a5ce93ec5182eab138a8e5d187c5/cc_pipelines_general.svg)
|
|
|
![cc_pipelines_general_1_.svg](uploads/a76835d0973d45fc579ba70c139c24e6/cc_pipelines_general_1_.svg)
|
|
|
|
|
|
* (a) **Six-month update**: For every downloaded [data](data) and [chemical library](libraries), the pipeline standardizes the chemical structures and calculates molecular properties. Then, the data-driven datasets may be indexed by InChIKey. The union of all molecules participating in data-driven datasets defines the bioactive *universe*, which is used to select the molecules of property-driven datasets. After, for each dataset, signatures of type 0 are derived upon dataset-specific processing. A reference collection is chosen, and all of the training to derive signatures of type 1 and 2 happens along the reference data. Then, signatures can be obtained for the full dataset. For the 25 exemplary datasets, full similarity vectors are calculated per compound, and data are compared pair-wise between datasets so that similarity inferences can be performed.
|
|
|
* (b) **New data-driven dataset**: To incorporate, sporadically, a new dataset, we first standardize the structures and index the dataset by InChIKey. Then, we process the data to end up with signatures type 0. Just like in (a), training happens in a reference set of molecules, and the full collection is later obtained. New datasets
|
... | ... | |