The Chemical Checker repository
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in 5 levels of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles.
The CC is maintained by the Structural Bioinformatics & Network Biology Laboratory at the Institute for Research in Biomedicine (IRB Barcelona). Should you have any questions, please send an email to firstname.lastname@example.org.
This project was first presented to the scientific community in the following paper: Duran-Frigola et al., Extending the small molecule similarity principle to all levels of biology (2019), and has some related publications.
Source data and datasets
The CC is built from public bioactivity data. We are committed to updating the resource every 6 months (versions named accordingly, e.g.
chemical_checker_2019_01). New datasets may be incorporated upon request.
The basic data unit of the CC is the dataset. There are 5 data levels (
D Cells and
E Clinics) and, in turn, each level is divided into 5 sublevels or coordinates (
E5). Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have a finite number of datasets (e.g.
A1.001), one of which is selected as being exemplary.
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discoverers, including approved drugs, natural products, and commercial screening libraries.
For further information, please refer to:
Signaturization of the data
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning toolkits such as scikit-learn.
Accordingly, the backbone pipeline of the CC is devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called CC signatures:
||Raw dataset data, expressed in a matrix format.||Explicit data.||Possibly sparse, heterogeneous, unprocessed.|
||PCA/LSI projections of the data, accounting for 90% of the data.||Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.||Variables dimensions, they may still be sparse.|
||Network-embedding of the similarity network.||Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network.||Information leak due to similarity measures. Hyper-parameter tunning.|
||Network-embedding of the inferred similarity network.||Fixed dimension and available for any molecule.||Possibly very noisy, hence useless, especially for low-data datasets.|
There are other important types of data:
||Full similarity vectors, molecule-specific. Indexed (binned) by p-value and designed to occupy little disk space.|
||Nearest neighbours using a distance metric of choice, typically the cosine distance.|
|Predicted nearest neighbours||
||Predicted nearest neighbours by exploiting correlations between datasets.|
||Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
||2D representations of the data, typically performed with t-SNE.|
All data in the CC resource are stored as
HDF5 files and can be accessed with a simple API. Measuring correlations between signatures belonging to different datasets yields a systematic assessment of the small molecule similarity principle (similar molecules have similar properties). Please follow the links below for more details:
Code structure, pipeline and access to data
This repository is divided into two parts:
In the production phase, signatures are generated. The scripts in this part of the repository thus refer to:
- The bulk update of data performed every six months.
- Sporadic additions of datasets.
- Any external data that may be mapped (compounds) or connected (other biological entities) to the existing resource, which in turn produces new signatures for the queries.
The data generated by the production phase can be easily accessed with a simple python library.
Similarity searches in the web
Signature similarity searches can be performed at a high level using the CC web interface, available at http://chemicalchecker.org. This resource is limited to the 25 exemplary datasets of the CC.
In the Main page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with grey 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular compound collections can be displayed. Deeper insights can be obtained by clicking on the Explore button for a molecule of choice.
In the Explore page, we look for similar molecules in the CC resource and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule is present, we measure similarities to other molecules in the dataset. If the molecule is not present, we infer similarities only to the molecules in the dataset.
Other interesting pages of the website include:
The CC contains both chemical and biological signatures. One of the most interesting features of biological signatures is that they can be connected to signatures of biology, opening the way to unsupervised learning. The connectivity idea was first popularized by the Connectivity Map in the context of gene expression data.
In the CC we generalize this notion to other types of data and provide functionalities to connect small molecules to other biologically-annotated entities such as diseases, cell lines or genetic perturbation experiments. Some examples would be:
- A molecule whose gene expression profile is opposite to a disease-characteristic gene expression signature.
- Simply, a molecule whose targets are closeby in protein-protein interaction networks to disease-related genes.
- A molecule whose gene-sensitivity profiles resemble a basal gene expression of a cell line.
Finding the right connectivity strategy requires a deep understanding of the datasets and, with the CC, we simplify this by pre-assigning connectivity functions to each dataset. Please note that some datasets cannot be connected to biology (e.g. 2D chemical fingerprints), whereas some others can be connected by different means (e.g. reversion/mimicking, global/local, etc.).
For a thorough explanation of the connectivity strategies, please visit: