Production phase
Diagrams below are editable at this draw.io link. Feel free to modify them.
The central task of the CC is to produce signatures of different types, given compound-related data typically fetched from the public domain. We can identify several scenarios where signatures are to be generated:
- A new dataset is added or a dataset is updated.
- It will happen for all of the CC every six months.
- It can happen, sporadically, during the development of a research project. Please note that exemplary datasets can only be added previous to the 6-month update, as they participate in downstream analyses.
- Samples are mapped or, more generally, connected to an existing dataset.
- Mapping means that a new compound is projected onto the signature space of a dataset of choice. For example, given a new molecule with a calculated 2D fingerprint, we might want to obtain
A1.001
signatures for it, without actually adding the molecule to theA1.001
. Another example would be a new molecule for which we know a target that is not associated -
Connectivity is the generalization of mapping to biological entities other than compounds. Some, but not all, of the datasets will have connectivity capabilities. For example, a disease gene expression signature can be compared to a compound gene expression signature (
D1.001
). Or disease-related genes can be allocated on the protein-protein interaction network and compared to the network context of drug targets (C5.001
).
In the diagram below, we show how
Folder structure
The CC resource is organized in three parts:
Data repository
The default path for this repository is /aloy/chemical_checker_repo/
. The following data structure is proposed:
├── downloads
│ ├──data
│ │ ├──*.*
│ ├──libraries
│ │ ├──*.smi
├── molrepos
│ ├──data
│ │ ├──*.tsv
│ ├──libraries
│ │ ├──*.tsv
├── molproperties
│ ├──*.*
├── reference
│ ├──A
│ │ ├──A1
│ │ │ ├──A1.001
│ │ │ │ ├──sign*.h5
│ │ │ │ ├──clus*.h5
│ │ │ │ ├──neig*.h5
│ │ │ │ ├──proj*.h5
│ │ │ │ ├──models
│ │ │ │ │ ├──*.*
│ │ │ │ ├──plots
│ │ │ │ │ ├──*.tsv
│ │ │ │ │ ├──*.png
│ │ │ │ ├──stats
│ │ │ │ │ ├──*.json
│ │ │ │ ├──preprocess
│ │ │ │ │ ├──*.*
│ │ │ ├──...
│ │ ├──...
│ ├──...
├── full
│ ├──A
│ │ ├──A1
│ │ │ ├──A1.001
│ │ │ │ ├──sign*.h5
│ │ │ │ ├──clus*.h5
│ │ │ │ ├──neig*.h5
│ │ │ │ ├──proj*.h5
│ │ │ │ ├──plots
│ │ │ │ │ ├──*.tsv
│ │ │ │ │ ├──*.png
│ │ │ │ ├──stats
│ │ │ │ │ ├──*.json
│ │ │ ├──...
│ │ ├──...
│ ├──...
├── exemplary
│ ├──coordinates
│ │ ├──A
│ │ │ ├──A1
│ │ │ │ ├──sign*.h5 [link-to-full]
│ │ │ │ ├──clus*.h5 [link-to-full]
│ │ │ │ ├──neig*.h5 [link-to-full]
│ │ │ │ ├──proj*.h5 [link-to-full]
│ │ │ │ ├──nprd*.h5
│ │ │ │ ├──plots [link-to-full]
│ │ │ │ ├──stats [link-to-full]
│ │ │ ├──...
│ │ ├──...
│ ├──infer
│ │ ├──models
│ │ │ ├──*.*
│ ├──plots
│ │ ├──*.tsv
│ │ ├──*.png
├── molecules
│ ├──RZ
│ │ ├──VA
│ │ │ ├──RZVAJINKPMORJF-UHFFFAOYSA-N
│ │ │ │ ├──sims*.h5
│ │ │ │ ├──2d.svg
│ │ │ │ ├──*.json
│ │ │ ├──...
│ │ ├──...
│ ├──...
├── test
│ ├──validation_sets
│ │ ├──*.tsv
├── releases
│ ├──2018_04
│ │ ├──reference.tar.gz
│ │ ├──full.tar.gz
│ │ ├──exemplar.tar.gz
│ │ ├──test.tar.gz
│ │ ├──libraries.tar.gz
│ ├──...
├── README.md
External data
The repository above is the result of running the CC processing pipeline on the downloaded data. We should be able to run the CC pipeline, to any collection of molecules and produce exactly the same data structure. The pipeline will produce the following folders structure, in any directory of choice /path/to/my/directory/
:
├── output
│ ├──A
│ │ ├──A1
│ │ │ ├──A1.001
│ │ │ │ ├──sign*.h5
│ │ │ │ ├──clus*.h5
│ │ │ │ ├──neig*.h5
│ │ │ │ ├──proj*.h5
│ │ │ │ ├──plots
│ │ │ │ │ ├──*.tsv
│ │ │ │ │ ├──*.png
│ │ │ │ ├──stats
│ │ │ │ │ ├──*.json
│ │ │ ├──...
│ │ ├──...
│ ├──...
Obviously, in these cases, one should be able to run the pipeline with specifications (only certain levels, coordinates, or datasets; only certain signature types, etc.). See the pipeline execution below for details.
Pipeline scripts
Pipeline scripts are used to produce CC signatures, models and analyses. These scripts are typically run in the pac-one cluster
at IRB Barcelona.
Broadly speaking, the CC pipeline has two modalities:
- The resource is fully updated every 6 months.
- Datasets may be added sporadically.
- Mapping of data to a dataset
- Individually querying
- Connectivity
Dataset addition
The resource update will happen every 6 months
Pipeline
-
levels
: all datasets in all coordinates of t -
coordinates
: -
datasets
:None
-
xxx
:
Pipeline execution
The pipeline will be typically run in the pac-one cluster
.
Once the reference is done, with all of the models, one can run the pipeline for any dataset (including the full
dataset).
Fit and produce the models
Predict for any dataset
The arguments should be, at least:
-
--datasets
: DatasetsA1.001
-E5.999
to calculate. One can also specify the levelA
-E
or the coordinateA1
-E5
. All are considered by default. -
--matrices
: What matrices to keep (e.g.sig0
). -
--only_exemplar
: Calculate only exemplar datasets.
A linear view of the pipeline
Below I sequentially list the steps of the pipeline. This is a linear and qualitative view of the pipeline and does not necessarily correspond to the organization of scripts in the repository.
- Download data.
- We need a SQL table specifying, for each download file, at least whether they are completed or not. Files that are internal to the SB&NB are anyway downloaded from an
ftp
repository or the like. - If the data are completed or the data has not been updated since the last CC update, don't download, just copy/move from the previous CC version.
- After this step, all data and libraries should be stored in the disk.
- Read small molecule structures.
- Get all molecules in each downloaded file as
(id, smiles)
file. - Standardize SMILES strings and save
molrepo
files. - Insert
inchikey
andinchi
tostructures
SQL table. - Calculate physicochemical parameters and insert them to
physchem
SQL table. - Draw molecule and save it to
./molecules/<IN>/<CH>/<INCHIKEY>/2d.svg
.
- Calculate compound properties
- This means, calculate fingerprints, scaffolds, structural keys, etc.
- Property files are ever-growing and should be stored efficiently (e.g. binary format). It has to be easy to insert/append new molecules to these files. These files can be found in
./molproperties
. - Property calculation may fail, sometimes, due to e.g. extreme molecular complexity. In these cases, we need to keep awareness of what molecules we have already attempted to calculate, and don't do it again.
- Do validation sets
- Validation sets are
(inchikey, inchikey, 0/1)
files denoting whether two molecules are similar (1
) or different (0
) in a certain aspect. - We can store an arbitrary number of validation sets.
- Preprocess datasets
- This step is specific for each dataset. The goal is to go from one (or more) download file, or molproperties file, to a
sign0.h5
file stored in./full
. - Some datasets really require a lot of processing and we need to come up with a way to monitor them and not re-do them every time we run the CC.
- We need to start with the datasets that have
subspace = None
. - Then, we proceed to the rest of datasets (
subspace != None
). These are, typically, chemistry datasets for which we want to constrain the space, for instance, to bioactive molecules (i.e. molecules that appear in the exemplarB
-E
levels). See dataset documentation for more information.
- Remove (near-)duplicates.
- The
sign0.h5
files may contain duplicates and near-duplicates, and it is important to remove them to obtain a good reference dataset. - Save the non-redundant
sign0.h5
under./reference
. - Save the 1-to-many mapping of the groups of (near-)duplicates.
- Fit signatures type 1.
- For continuous datasets, do PCA.
- For discrete datasets, do LSI.
- Keep the
sign1.h5
file under./reference
. - Save the models for persistency.
- Do validation plots.
- Fit k-Means clusters.
- Do an automated k-means clustering to discover centroids and groups in the data.
- Keep the
clus1.h5
file under./reference
. - Save the models for persistency.
- Do validation plots.
- Fit nearest-neighbor search.
- Do a nearest-neighbor search to discover pairs of similar molecules.
- Keep the
neig1.h5
file under./reference
. - Save the models for persistency.
- Do validation plots.
- Obtain signatures type 2.
- These are the node2vec embeddings derived from the similarity network.
- First, a network is built (under a certain cutoff).
- Then, a random walk algorithm is ran.
- Finally, word2vec embeds nodes (words) by reading the trajectories of the random walker.
- Keep the
sign2.h5
file under./reference
. - Do validation plots.
- Fit signatures type 2
- Unfortunately, node2vec does not work with out-of-sample data, so we need to learn a mapping between signatures type 1 and signatures type 2.
- This can be done with deep learning, preferably with automated deep learning (e.g. AdaNet).
- Save the models for persistency.
- Obtain 2D projections
- Vanilla 2D projection of signatures type 1 does not work out-of-the-box with t-SNE. A bit of pre-processing is needed. We are still deciding what's best. In any case, it will require the calculation of an intermediate matrix.
- From this intermediate matrix, run t-SNE to obtain the projection.
- Keep the
proj1.h5
file under./reference
. - Do validation plots
- Fit 2D projections
- Here again, t-SNE has no out-of-sample method, so we need to learn a mapping between signatures type 1 and the 2D projections.
- I suggest using AdaNet in this case, too.
- Save the models for persistency.
Once reference calculations are done, we can move to the full dataset.
- Predict signatures type 1.
- If the molecule is in reference (or is a near-duplicate of it), take signature.
- Else, use the persistent model to predict.
- Keep the
sign1.h5
file under./full
- Do validation plots.
- Predict k-Means clusters.
- Do 1-to-many or predict, as necessary.
- Keep the
clus1.h5
file under./full
. - Do validation plots.
- Predict nearest-neighbors.
- Beware that there should be two modalities here:
- Full-vs-reference (1-to-many or predict, as necessary); this is the one I would keep.
- Full-vs-full (requires fitting again).
- Keep the
neig1.h5
file under./full
. - Do the validation plots.
- Predict signatures type 2.
- Do 1-to-many or predict, as necessary.
- Keep the
sign2.h5
file under./full
. - Do the validation plots.
- Predict 2D projections.
- Do 1-to-many or predict, as necessary.
- Keep the
proj1.h5
file under./full
. - Do the validation plots.
Points 1-19 are applicable to any dataset. Comparison of CC datasets is, for now, only among exemplary ones. From here on, we only perform the calculations on these 25 exemplary datasets.
- Link exemplary to full datasets
- In the
./exemplary
, keep the corresponding signature files available from./full
. - It is not necessary that signature files are copied, they can just be linked with a pointer.
- Calculate full similarities
- Using signatures type 1.
- Add datasets to
sims1.h5 (*_obs)
under./molecules
.
- Calculate correlations between datasets.
- Conditional similarities.
- Paired conditional similarities.
- Cluster paired conditional similarities.
- Similarity rank comparisons.
- Canonical correlation analysis, performed with signatures type 2.
- Identify patterns of expected similarity distributions
- This generates clusters of distributions that are useful to perform similarity inferences.
- Save them under
./exemplary
.
- Prepare for inference.
- Produce correlation matrices and placeholders necessary for inference.
- Save them under
./exemplary
.
- Infer similarities
- Add datasets to
sims.h5 (*_prd)
under./molecules
.
- Calculate CC scores.
- Popularity
- Singularity
- Mappability
Please beware that, for simplicity, here I have omitted processes that are relevan to the CC web app. Some of these are related to the SQL database:
- Saving x-y limits of the projections.
- Updating the PubChem entries (name, synonyms, etc.).
- Filling up the table of targets to show.
New data mapping
Data access scripts
The CC resource provides a very simple API to query the data within the resource. In brief, the API facilitates the following:
- Given one or more
inchikeys
: - It lists the corresponding
datasets
. - It returns the InChI structure(s).
- If a
dataset
is specified, it returns the signatures (sign*
) or vectorial data (neig*
,clus*
, etc.) of choice. - Given one or more
datasets
: - It lists the corresponding
inchikeys
. - It returns the signatures (
sign*
) or vectorial data (neig*
,clus*
, etc.) of choice. - It deletes the dataset from the resource (might be convenient when we are testing datasets).
- Given one or more
levels
orcoordinates
: - It lists the corresponding
datasets
. - It indicates the exemplar
dataset(s)
.