The Chemical Checker repository
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in 5 levels of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by a less explicit, more machine-learning-friendly abstraction of the data (see the CC Manifesto).
The CC resource is ever-growing and maintained by the Structural Bioinformatics & Network Biology Laboratory at the Institute for Research in Biomedicine (IRB Barcelona). Should you have any questions, please send an email to firstname.lastname@example.org or email@example.com.
This project was first presented to the scientific community in the following paper: Duran-Frigola et al., Extending the small molecule similarity principle to all levels of biology (2019), and has since produced a number of related publications.
Source data and datasets
The CC capitalizes on publicly-available bioactivity data fetched from many sources. We are committed to updating the resource every 6 months (versions named accordingly, e.g.
chemical_checker_2019_01), incorporating the latest datasets, and adding new ones upon request and posterior evaluation by our team.
The basic data unit of the CC is the dataset. There are 5 levels (
D Cells and
E Clinics) and, in turn, each level is divided into 5 sublevels or coordinates (
E5), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g.
A1.001), one of which is selected as being exemplary.
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discoverers, including approved drugs, natural products, and commercial screening libraries.
For further information, please refer to:
Signaturization of the data
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning tools such as
Accordingly, the backbone pipeline of the CC is devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called CC signatures:
||Raw dataset data, expressed in a matrix format.||Explicit data.||Possibly sparse, heterogeneous, unprocessed.|
||PCA/LSI projections of the data, accounting for 90% of the data.||Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.||Variables dimensions, they may still be sparse.|
||Network-embedding of the similarity network.||Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network.||Information leak due to similarity measures. Hyper-parameter tunning.|
||Network-embedding of the inferred similarity network.||Fixed dimension and available for any molecule.||Possibly very noisy, hence useless, especially for low-data datasets.|
There are other important types of data:
||Full similarity vectors, molecule-specific. Indexed (binned) by p-value and designed to occupy little disk space.|
||Nearest neighbours using a distance metric of choice, typically the cosine distance.|
|Predicted nearest neighbours||
||Predicted nearest neighbours by exploiting correlations between datasets.|
||Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
||2D representations of the data, typically performed with t-SNE.|
All data in the CC resource are stored as
HDF5 files and can be accessed with a simple API. Measuring correlations between signatures belonging to different datasets yield a systematic assessment of the small molecule similarity principle (similar molecules have similar properties). Please follow the links below for more details:
Resource structure, pipeline and access to the data
This repository is divided into two parts:
In the production phase, signatures are generated. The scripts in this part of the repository thus refer to:
- The bulk update of data performed every six months.
- Sporadic additions of datasets.
- Any external data that may be mapped (compounds) or connected (other biological entities) to the existing resource, which in turn produces new signatures for the queries.
The data generated by the production phase can be easily accessed with a simple python library.
Similarity searches in the web
Signature similarity searches can be performed at a high level using the CC web interface, available at http://chemicalchecker.org. This resource is limited to the 25 exemplary datasets of the CC.
In the Main page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with grey 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular compound collections can be displayed. Deeper insights can be obtained by clicking on the Explore button for a molecule of choice.
In the Explore page, we look for similar molecules in the CC resource and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule is present, we measure similarities to other molecules in the dataset. If the molecule is not present, we infer similarities only to the molecules in the dataset.
Other interesting pages of the website include:
The CC contains both chemical and biological signatures. One of the most interesting features of biological signatures is that they can be connected to signatures of biology, opening the way to unsupervised learning. The connectivity idea was first popularized by the Connectivity Map in the context of gene expression data.
In the CC we generalize this notion to other types of data and provide functionalities to connect small molecules to other biologically-annotated entities such as diseases, cell lines or genetic perturbation experiments. Some examples would be:
- A molecule whose gene expression profile is opposite to a disease-characteristic gene expression signature.
- Simply, a molecule whose targets are closeby in protein-protein interaction networks to disease-related genes.
- A molecule whose gene-sensitivity profiles resemble a basal gene expression of a cell line.
Finding the right connectivity strategy requires a deep understanding of the datasets and, with the CC, we simplify this by pre-assigning connectivity functions to each dataset. Please note that some datasets cannot be connected to biology (e.g. 2D chemical fingerprints), whereas some others can be connected by different means (e.g. reversion/mimicking, global/local, etc.).
For a thorough explanation of the connectivity strategies, please visit:
As a research laboratory, we are committed to exploiting the CC and to implement customary procedures that could be of use to computational drug discoverers. These applications are encapsulated as separate packages with seamless integration with the central CC resource.
Similarity-based property (target) prediction:
- This package derives from a work by David Amat (@damat) named TargetMate. We are currently evolving it into a simple kernel predictor that can take into account label correlation as well.
Automated machine learning:
- This is the work of Modesto Orozco (@morozco), who is currently using the AutoML TPOT library.
- The DeepChem library is an outstanding chemoinformatics toolbox that incorporates a number of machine learning algorithms. Moreover, DeepChem contains MoleculeNet, a collection of benchmark sets related, among others, to biophysical properties and physiological outcomes of compounds.
- Automatically 2D-project and massively predict properties for a library of interest.
- We will start to do so with the ChemistriX library by Nostrum BioDiscovery.
- Conveniently, DeepChem contains several
featurizers, i.e. functions that are able to convert a compound structure to a vector format, typically representing their chemistry. We branch DeepChem to include
cc_featurizersable to convert, in principle, any compound structure to the corresponding signature. Martino Bertoni (@mbertoni) has started doing so with Type 2 signatures.