This is an old version of this page. You can view the most recent version or browse the history.

production phase

Production phase

Diagrams below are editable at this draw.io link. Feel free to modify them.

The central task of the CC is to produce signatures of different types, given compound-related data typically fetched from the public domain. We can identify several scenarios where signatures are to be generated:

A new dataset is added or a dataset is updated.
It will happen for all of the CC every six months.
It can happen, sporadically, during the development of a research project. Please note that exemplary datasets can only be added previous to the 6-month update, as they participate in downstream analyses.
Samples are mapped or, more generally, connected to an existing dataset.
Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain A1.001 signatures for it, without actually adding the molecule to the A1.001. Another example would be a new molecule for which we know a target that is not associated
Connectivity is the generalization of mapping to biological entities other than compounds. Some, but not all, of the datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (D1.001). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (C5.001).

In the diagram below, we show how

Folder structure

The CC resource is organized in three parts:

The data repository
Production scripts
Data access scripts

Data repository

The default path for this repository is /aloy/chemical_checker_repo/. The following data structure is proposed:

├── downloads
│   ├──data
│   │   ├──*.*
│   ├──libraries
│   │   ├──*.smi
├── molrepos
│   ├──data
│   │   ├──*.tsv
│   ├──libraries
│   │   ├──*.tsv
├── molproperties
│   ├──*.*
├── reference
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──models
│   │   │   │   │   ├──*.*
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   │   ├──preprocess
│   │   │   │   │   ├──*.*
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── full
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── exemplary
│   ├──coordinates
│   │   ├──A
│   │   │   ├──A1
│   │   │   │   ├──sign*.h5 [link-to-full]
│   │   │   │   ├──clus*.h5 [link-to-full]
│   │   │   │   ├──neig*.h5 [link-to-full]
│   │   │   │   ├──proj*.h5 [link-to-full]
│   │   │   │   ├──nprd*.h5
│   │   │   │   ├──plots [link-to-full]
│   │   │   │   ├──stats [link-to-full]
│   │   │   ├──...
│   │   ├──...
│   ├──infer
│   │   ├──models
│   │   │   ├──*.*
│   ├──plots
│   │   ├──*.tsv
│   │   ├──*.png
├── molecules
│   ├──RZ
│   │   ├──VA
│   │   │   ├──RZVAJINKPMORJF-UHFFFAOYSA-N
│   │   │   │   ├──sims*.h5
│   │   │   │   ├──2d.svg
│   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── test
│   ├──validation_sets
│   │   ├──*.tsv
├── releases
│   ├──2018_04
│   │   ├──reference.tar.gz
│   │   ├──full.tar.gz
│   │   ├──exemplar.tar.gz
│   │   ├──test.tar.gz
│   │   ├──libraries.tar.gz
│   ├──...
├── README.md

External data

The repository above is the result of running the CC processing pipeline on the downloaded data. We should be able to run the CC pipeline, to any collection of molecules and produce exactly the same data structure. The pipeline will produce the following folders structure, in any directory of choice /path/to/my/directory/:

├── output
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...

Obviously, in these cases, one should be able to run the pipeline with specifications (only certain levels, coordinates, or datasets; only certain signature types, etc.). See the pipeline execution below for details.

Pipeline scripts

Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the pac-one cluster at IRB Barcelona.

Broadly speaking, the CC pipeline has two modalities:

Dataset addition

The resource is fully updated every 6 months.
Datasets may be added sporadically.

Data mapping

Mapping of data to a dataset
Individually querying
Connectivity

Dataset addition

The resource update will happen every 6 months

Pipeline

levels: all datasets in all coordinates of t
coordinates:
datasets: None
xxx:

Pipeline execution

The pipeline will be typically run in the pac-one cluster.

Once the reference is done, with all of the models, one can run the pipeline for any dataset (including the full dataset).

Fit and produce the models

Predict for any dataset

The arguments should be, at least:

--datasets: Datasets A1.001-E5.999 to calculate. One can also specify the level A-E or the coordinate A1-E5. All are considered by default.
--matrices: What matrices to keep (e.g. sig0).
--only_exemplar: Calculate only exemplar datasets.

A linear view of the pipeline

Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts in the repository.

Download data.

We need a SQL table specifying, for each download file, at least whether they are completed or not. Files that are internal to the SB&NB are anyway downloaded from an ftp repository or the like.
If the data are completed or the data has not been updated since the last CC update, don't download, just copy/move from the previous CC version.
After this step, all data and libraries should be stored in the disk.

Read small molecule structures.

Get all molecules in each downloaded file as (id, smiles) file.
Standardize SMILES strings and save molrepo files.
Insert inchikey and inchi to structures SQL table.
Calculate physicochemical parameters and insert them to physchem SQL table.
Draw molecule and save it to ./molecules/<IN>/<CH>/<INCHIKEY>/2d.svg.

Calculate compound properties

This means, calculate fingerprints, scaffolds, structural keys, etc.
Property files are ever-growing and should be stored efficiently (e.g. binary format). It has to be easy to insert/append new molecules to these files. These files can be found in ./molproperties.
Property calculation may fail, sometimes, due to e.g. extreme molecular complexity. In these cases, we need to keep awareness of what molecules we have already attempted to calculate, and don't do it again.

Do validation sets

Validation sets are (inchikey, inchikey, 0/1) files denoting whether two molecules are similar (1) or different (0) in a certain aspect.
We can store an arbitrary number of validation sets.

Preprocess datasets

This step is specific for each dataset. The goal is to go from one (or more) download file, or molproperties file, to a sign0.h5 file stored in ./full.
Some datasets really require a lot of processing and we need to come up with a way to monitor them and not re-do them every time we run the CC.
We need to start with the datasets that have subspace = None.
Then, we proceed to the rest of datasets (subspace != None). These are, typically, chemistry datasets for which we want to constrain the space, for instance, to bioactive molecules (i.e. molecules that appear in the exemplar B-E levels). See dataset documentation for more information.

Remove (near-)duplicates.

The sign0.h5 files may contain duplicates and near-duplicates, and it is important to remove them to obtain a good reference dataset.
Save the non-redundant sign0.h5 under ./reference.
Save the 1-to-many mapping of the groups of (near-)duplicates.

Fit signatures type 1.

For continuous datasets, do PCA.
For discrete datasets, do LSI.
Keep the sign1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Fit k-Means clusters.

Do an automated k-means clustering to discover centroids and groups in the data.
Keep the clus1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Fit nearest-neighbor search.

Do a nearest-neighbor search to discover pairs of similar molecules.
Keep the neig1.h5 file under ./reference.
Save the models for persistency.
Do validation plots.

Obtain signatures type 2.

These are the node2vec embeddings derived from the similarity network.
First, a network is built (under a certain cutoff).
Then, a random walk algorithm is ran.
Finally, word2vec embeds nodes (words) by reading the trajectories of the random walker.
Keep the sign2.h5 file under ./reference.
Do validation plots.

Fit signatures type 2

Unfortunately, node2vec does not work with out-of-sample data, so we need to learn a mapping between signatures type 1 and signatures type 2.
This can be done with deep learning, preferably with automated deep learning (e.g. AdaNet).
Save the models for persistency.

Obtain 2D projections

Vanilla 2D projection of signatures type 1 does not work out-of-the-box with t-SNE. A bit of pre-processing is needed. We are still deciding what's best. In any case, it will require the calculation of an intermediate matrix.
From this intermediate matrix, run t-SNE to obtain the projection.
Keep the proj1.h5 file under ./reference.
Do validation plots

Fit 2D projections

Here again, t-SNE has no out-of-sample method, so we need to learn a mapping between signatures type 1 and the 2D projections.
I suggest using AdaNet in this case, too.
Save the models for persistency.

Once reference calculations are done, we can move to the full dataset.

Predict signatures type 1.

If the molecule is in reference (or is a near-duplicate of it), take signature.
Else, use the persistent model to predict.
Keep the sign1.h5 file under ./full
Do validation plots.

Predict k-Means clusters.

Do 1-to-many or predict, as necessary.
Keep the clus1.h5 file under ./full.
Do validation plots.

Predict nearest-neighbors.

Beware that there should be two modalities here:
Full-vs-reference (1-to-many or predict, as necessary); this is the one I would keep.
Full-vs-full (requires fitting again).
Keep the neig1.h5 file under ./full.
Do the validation plots.

Predict signatures type 2.

Do 1-to-many or predict, as necessary.
Keep the sign2.h5 file under ./full.
Do the validation plots.

Predict 2D projections.

Do 1-to-many or predict, as necessary.
Keep the proj1.h5 file under ./full.
Do the validation plots.

Points 1-19 are applicable to any dataset. Comparison of CC datasets is, for now, only among exemplary ones. From here on, we only perform the calculations on these 25 exemplary datasets.

Link exemplary to full datasets

In the ./exemplary, keep the corresponding signature files available from ./full.
It is not necessary that signature files are copied, they can just be linked with a pointer.

Calculate full similarities

Using signatures type 1.
Add datasets to sims1.h5 (*_obs) under ./molecules.

Calculate correlations between datasets.

Conditional similarities.
Paired conditional similarities.
Cluster paired conditional similarities.
Similarity rank comparisons.
Canonical correlation analysis, performed with signatures type 2.

Identify patterns of expected similarity distributions

This generates clusters of distributions that are useful to perform similarity inferences.
Save them under ./exemplary.

Prepare for inference.

Produce correlation matrices and placeholders necessary for inference.
Save them under ./exemplary.

Infer similarities

Add datasets to sims.h5 (*_prd)under ./molecules.

Calculate CC scores.

Popularity
Singularity
Mappability

Please beware that, for simplicity, here I have omitted processes that are relevan to the CC web app. Some of these are related to the SQL database:

Saving x-y limits of the projections.
Updating the PubChem entries (name, synonyms, etc.).
Filling up the table of targets to show.

New data mapping

Data access scripts

The CC resource provides a very simple API to query the data within the resource. In brief, the API facilitates the following:

Given one or more inchikeys:
It lists the corresponding datasets.
It returns the InChI structure(s).
If a dataset is specified, it returns the signatures (sign*) or vectorial data (neig*, clus*, etc.) of choice.
Given one or more datasets:
It lists the corresponding inchikeys.
It returns the signatures (sign*) or vectorial data (neig*, clus*, etc.) of choice.
It deletes the dataset from the resource (might be convenient when we are testing datasets).
Given one or more levels or coordinates:
It lists the corresponding datasets.
It indicates the exemplar dataset(s).