production phase

Production phase

Diagrams below are editable at this draw.io link. Feel free to modify them.

The central task of the CC is to produce signatures of different types, given compound-related data typically fetched from the public domain. Signature production is performed via a rather complex/long pipeline. Broadly speaking, the pipeline has two modalities:

Addition or update of a dataset.

This happens for all of the CC every 6 months.
It can happen, sporadically, during the development of a research project. Please note that exemplary datasets can only be modified at the 6-month update, as exemplary datasets are bound to the CC web app.

Mapping (or, more generally, connection), of external data to an existing dataset.

Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain A1.001 signatures for it, without actually appending the molecule to the dataset. Another example would be a new molecule for which we have determined a target in the laboratory, and we want to obtain the mode of action (B1.001) signatures of this new molecules, so that we can compare it with the rest.
Connectivity is the generalization of mapping to biological entities other than compounds. Some (but not all) datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (D1.001). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (C5.001).

This is best explained with a diagram:

(a) Six-month update: For every downloaded data and chemical library, the pipeline standardizes the chemical structures and calculates molecular properties. Then, the data-driven datasets may be indexed by InChIKey. The union of all molecules participating in data-driven datasets defines the bioactive universe, which is used to select the molecules of property-driven datasets. After, for each dataset, signatures of type 0 are derived upon dataset-specific processing. A reference collection is chosen, and all of the training to derive signatures of type 1 and 2 happens along the reference data. Then, signatures can be obtained for the full dataset. For the 25 exemplary datasets, full similarity vectors are calculated per compound, and data are compared pair-wise between datasets so that similarity inferences can be performed.
(b) New data-driven dataset: To incorporate, sporadically, a new dataset, we first standardize the structures and index the dataset by InChIKey. Then, we process the data to end up with signatures type 0. Just like in (a), training happens in a reference set of molecules, and the full collection is later obtained. New datasets
(c) New property dataset: When a new molecule property (e.g. a new chemical fingerprint) is defined by the CC team, we derive the corresponding dataset for the bioactive universe. Then, as in (b), we process the data, fit models for a reference and finally obtain signatures for the full data.
(d) Mapping of a new molecule collection: In this case, we simply want to obtain the signature representation of external molecules onto an existing property-based dataset. For this, we simply calculate the molecule properties (e.g. 2D fingerprints), process them correspondingly and use the fitted models to produce the full signatures.
(e) Connectivity of external data: Likewise, when new data are to be mapped on existing datasets, we simply process the raw data accordingly (potentially by using connectivity functionalities) and derive signatures using the fitted models. If the samples of the new dataset are indeed molecules, these are standardized as usual.

⚠ Unfortunately, we do not have a final means to produce signatures type 3 yet. In consequence, their production is not appropriately reflected in the pipelines presented herein.

Folder structure

The default path for CC repository is /aloy/chemical_checker_repo/. This corresponds to the data generated in the dataset addition/update mode of the pipeline. Data in this directory is organized as follows:

├── downloads
│   ├──data
│   │   ├──*.*
│   ├──libraries
│   │   ├──*.smi
├── molrepos
│   ├──data
│   │   ├──*.tsv
│   ├──libraries
│   │   ├──*.tsv
├── molproperties
│   ├──*.*
├── reference
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──models
│   │   │   │   │   ├──*.*
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   │   ├──preprocess
│   │   │   │   │   ├──*.*
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── full
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── exemplary
│   ├──coordinates
│   │   ├──A
│   │   │   ├──A1
│   │   │   │   ├──sign*.h5 [link-to-full]
│   │   │   │   ├──clus*.h5 [link-to-full]
│   │   │   │   ├──neig*.h5 [link-to-full]
│   │   │   │   ├──proj*.h5 [link-to-full]
│   │   │   │   ├──nprd*.h5
│   │   │   │   ├──plots [link-to-full]
│   │   │   │   ├──stats [link-to-full]
│   │   │   ├──...
│   │   ├──...
│   ├──infer
│   │   ├──models
│   │   │   ├──*.*
│   ├──plots
│   │   ├──*.tsv
│   │   ├──*.png
├── molecules
│   ├──RZ
│   │   ├──VA
│   │   │   ├──RZVAJINKPMORJF-UHFFFAOYSA-N
│   │   │   │   ├──sims*.h5
│   │   │   │   ├──2d.svg
│   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── test
│   ├──validation_sets
│   │   ├──*.tsv
├── releases
│   ├──2018_04
│   │   ├──reference.tar.gz
│   │   ├──full.tar.gz
│   │   ├──exemplar.tar.gz
│   │   ├──test.tar.gz
│   │   ├──libraries.tar.gz
│   ├──...
├── README.md

When running the pipeline in mapping/connectivity mode, signatures will be stored in a user-specified directory /path/to/my/directory/:

├── output
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...

Please note that I do not present here the /aloy/scratch/ organization, which is internal to the pipeline.

Accompanying the folder structure, there is a PostGreSQL database that is crucial for meta-data storage and proper functioning of the CC web app.

Pipeline

Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the pac-one cluster at IRB Barcelona. Below, we provide detailed explanations of the different pipeline modalities:

Dataset addition

Updates every 6 months.
Sporadic addition or updates of datasets.

Mapping of new data

Dataset addition

Adding and updating datasets is the most important and computationally intensive part of the repository. In consequence, I anticipate that this part of the pipeline will be in constant evolution.

Six-month pipeline

Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts nor to the sequence of executions.

Download data.

We need a SQL table specifying, for each download file, at least whether they are completed or not. Files that are internal to the SB&NB are simply copied.
If the data are completed or the data has not been updated since the last CC update, don't download, just copy/move from the previous CC version.
After this step, all data and libraries should be stored in the disk.

Read small molecule structures.

Get all molecules in each downloaded file as (id, smiles) file.
Standardize SMILES strings and save molrepo files.
Insert inchikey and inchi to structures SQL table.
Calculate physicochemical parameters and insert them to physchem SQL table.
Draw molecule and save it to ./molecules/<IN>/<CH>/<INCHIKEY>/2d.svg.

Calculate compound properties

This means, calculate fingerprints, scaffolds, structural keys, etc.
Property files are ever-growing and should be stored efficiently (e.g. binary format). It has to be easy to insert/append new molecules to these files. These files can be found in ./molproperties.
Property calculation may fail, sometimes, due to e.g. extreme molecular complexity. In these cases, we need to keep awareness of what molecules we have already attempted to calculate, and don't do it again.

Do validation sets

Validation sets are (inchikey, inchikey, 0/1) files denoting whether two molecules are similar (1) or different (0) in a certain aspect.
We can store an arbitrary number of validation sets.

Preprocess datasets

This step is specific for each dataset. The goal is to go from one (or more) download file, or molproperties file, to a sign0.h5 file stored in ./full.
Some datasets really require a lot of processing and we need to come up with a way to monitor them and not re-do them every time we run the CC.
We need to start with the datasets that have subspace = None.
Then, we proceed to the rest of datasets (subspace != None). These are, typically, chemistry datasets for which we want to constrain the space, for instance, to bioactive molecules (i.e. molecules that appear in the exemplar B-E levels). See dataset documentation for more information.

Remove (near-)duplicates.

The sign0.h5 files may contain duplicates and near-duplicates, and it is important to remove them to obtain a good reference dataset.
Save the non-redundant sign0.h5 under ./reference.
Save the 1-to-many mapping of the groups of (near-)duplicates.

Fit signatures type 1.

For continuous datasets, do PCA.
For discrete datasets, do LSI.
Keep the sign1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Fit k-Means clusters.

Do an automated k-means clustering to discover centroids and groups in the data.
Keep the clus1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Fit nearest-neighbor search.

Do a nearest-neighbor search to discover pairs of similar molecules.
Keep the neig1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Obtain signatures type 2.

These are the node2vec embeddings derived from the similarity network.
First, a network is built (under a certain cutoff).
Then, a random walk algorithm is ran.
Finally, word2vec embeds nodes (words) by reading the trajectories of the random walker.
Keep the sign2.h5 file under ./reference.
Do validation plots.

Fit signatures type 2

Unfortunately, node2vec does not work with out-of-sample data, so we need to learn a mapping between signatures type 1 and signatures type 2.
This can be done with deep learning, preferably with automated deep learning (e.g. AdaNet).
Save the models for persistency.

Obtain 2D projections

Vanilla 2D projection of signatures type 1 does not work out-of-the-box with t-SNE. A bit of pre-processing is needed. We are still deciding what's best. In any case, it will require the calculation of an intermediate matrix.
From this intermediate matrix, run t-SNE to obtain the projection.
Keep the proj1.h5 file under ./reference.
Do validation plots

Fit 2D projections

Here again, t-SNE has no out-of-sample method, so we need to learn a mapping between signatures type 1 and the 2D projections.
I suggest using AdaNet in this case, too.
Save the models for persistency.
Once reference calculations are done, we can move to the full dataset.

Predict signatures type 1.

If the molecule is in reference (or is a near-duplicate of it), take signature.
Else, use the persistent model to predict.
Keep the sign1.h5 file under ./full
Do validation plots.

Predict k-Means clusters.

Do 1-to-many or predict, as necessary.
Keep the clus1.h5 file under ./full.
Do validation plots.

Predict nearest-neighbors.

Beware that there should be two modalities here:
Full-vs-reference (1-to-many or predict, as necessary); this is the one I would keep.
Full-vs-full (requires fitting again).
Keep the neig1.h5 file under ./full.
Do the validation plots.

Predict signatures type 2.

Do 1-to-many or predict, as necessary.
Keep the sign2.h5 file under ./full.
Do the validation plots.

Predict 2D projections.

Do 1-to-many or predict, as necessary.
Keep the proj1.h5 file under ./full.
Do the validation plots.
Points 1-19 are applicable to any dataset. Comparison of CC datasets is, for now, only among exemplary ones. From here on, we only perform the calculations on these 25 exemplary datasets.

Link exemplary to full datasets

In the ./exemplary, keep the corresponding signature files available from ./full.
It is not necessary that signature files are copied, they can just be linked with a pointer.

Calculate full similarities

Using signatures type 1.
Add datasets to sims1.h5 (*_obs) under ./molecules.

Calculate correlations between datasets.

Conditional similarities.
Paired conditional similarities.
Cluster paired conditional similarities.
Similarity rank comparisons.
Canonical correlation analysis, performed with signatures type 2.

Identify patterns of expected similarity distributions

This generates clusters of distributions that are useful to perform similarity inferences.
Save them under ./exemplary.

Prepare for inference.

Produce correlation matrices and placeholders necessary for inference.
Save them under ./exemplary.

Infer similarities.

Add datasets to sims.h5 (*_prd)under ./molecules.

Calculate CC scores.

Popularity
Singularity
Mappability

Please beware that, for simplicity, here I have omitted processes that are relevant to the CC web app. Some of these are related to the SQL database:

Saving x-y limits of the projections.
Updating the PubChem entries (name, synonyms, etc.).
Filling up the table of targets to show.

Sporadic datasets

Whenever, during a research project, we want to introduce a new dataset to the CC resource, we can follow this reasoning:

Note that, necessarily, adding new data to the CC will require some scripting. Please refer to dataset processing for guidelines.

New data mapping

When we want to map or connect external data, we make heavy use of the predictors learned in the dataset addition phase. There is, therefore, no need for heavy computations in this modality and we might consider offering this part of the pipeline outside the pac-one cluster. The only potentially heavy predictions are the ones related to the processing step.

The diagram below covers the different scenarios that we might encounter when we want to map/connect new molecules or biological entities.