Chemical Checker
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in five levels of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by less explicit, machine-learning oriented abstractions of the data (see the CC Manifesto).
The CC is an ever-growing resource maintained by the Structural Bioinformatics & Network Biology Laboratory at the Institute for Research in Biomedicine IRB Barcelona. Should you have any inquiries, please send an email to miquel.duran@irbbarcelona.org.
The CC was first released to the scientific community in the following paper: Duran-Frigola et al., Expanding the small molecule similarity principle to all levels of biology (2019), and has since produced a number of related publications.
Source data and datasets
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are commited to updating the CC resource every 6 months.
The basic data unit of the CC are the datasets. There are 5 levels of increasing biological complexity (A: Chemistry, B: Targets, C: Networks, D: Cells and E: Clinics) and, in turn, each level is divided into five sublevels (1--5), denoting different types of data. In the CC terminology, a dataset is a set of compounds belonging to a certain source and having a certain data type. Each dataset belongs to one and only one of the 25 categories.
For further information, please refer to:
- Bioactivity data sources
- Chemical libraries
- Meta-data
- Full list of datasets and file-system organization
Signaturization of the data
The main task of the CC is to convert raw data into formats that are suitable inputs for machine learning toolboxs such as sklearn
, keras
or tensorflow
.
As such, the backbone scripts of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main asset of the CC are the so-called CC signatures:
Signatures | Abbreviation | Description | Advantages | Disadvantages |
---|---|---|---|---|
Type 0 | dataset |
Raw dataset data, expressed in matrix format. | Explicit data. | Possibly sparse, heterogeneous, un-processed. |
Type 1 | sig |
PCA/LSI projections of the data, accounting for 90% of the data. | Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning. | Variables dimensions, they may still be sparse. |
Type 2 | netemb |
Network-embedding of the similarity network. | Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network. | Information leak due to similarity measures. Hyper-parameter tunning. |
Type 3 | fullnetemb |
Network-embedding of the inferred similarity network. | Fixed dimension and available for any molecule. | Possibly very noisy, hence useless, especially for low-data datasets. |
There are also other important types of data:
Name | Abbreviation | Description |
---|---|---|
Nearest neighbors | nneigh |
Nearest neighbors using a distance metric of choice, typically the cosine distance. |
Clusters | clust |
Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means. |
2D projections | proj |
2D representations of the data, typically performed with t-SNE. |
All data in the CC resource are stored as HDF5
files. For further information, please refer to:
Similarity searches in the web
Similarity searches can be performed at a high level using the CC resource, available at http://chemicalchecker.org. This resource is limited to the 25 exemplar datasets of the CC.
In the Main page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with gray 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular collections can be displayed. Deeper insights can be obtained by clicking on the Explore button for a molecule of choice.
In the Explore page, we look for similar molecules in the CC database and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule is present, we measure similarities to other molecules in the dataset. If the molecule is not present, we infer similarities only to the molecules in the dataset.
For further information, please refer to:
Connectivity
The CC contains both chemical and biological signatures. One of the most interesting features of biological signatures is that they can be connected to signatures of biology. This idea was first popularized by the Connectivity Map in the context of gene expression data.
In we generalize the notion of connectivity to other types of data and provide functionalities to connect small molecules to other biologically-annotated entities such as disease, cell lines or genetic perturbation experiments. Some examples would be:
- A molecule whose gene expression profile is opposite to a disease-characteristic gene expression signature.
- Simply, a molecule whose targets are closeby in protein-protein interaction networks.
- A molecule whose gene-sensitivity profiles resemble a basal gene expression of a cell line.
Finding the right connectivity strategy requires a deep understanding of the datasets and, with the CC, we simplify this by manually assigning connectivity functions to each dataset. Some datasets cannot be connected (such as chemical fingerprints), and some others may enjoy different connectivity functions.
For further information, please refer to:
Customary drug discovery tasks
We are currently most interested in making the CC resource available to everyone and to identify standard applications that are of use to computational drug discoverers. Below, we list some such applications:
- Library characterization:
- Massive property prediction:
- Similarity-based prediction: This work emerges from a work by David Amat (@damat) named TargetMate. We are currenty evolving it into a simple kernel predictor that can take into account label correlation as well.
- Automated machine learning: This is the work of Modesto Orozco (@morozco), who is currently using the AutoML TPOT library.
-
DeepChem integration: The DeepChem library is an outstanding chemoinformatics toolbox that incorporates a number of machine learning algorithms with seamless integration with
tensorflow
and similars. Moreover, DeepChem contains MoleculeNet, a collection of benchmark sets related, among others, to biophysical properties and physiological outcomes of compounds. Conveniently, DeepChem contains severalfeaturizers
, i.e. functions that are able to convert a compound structure to a vector format, typically representing their chemistry. We branch DeepChem to includecc_featurizers
able to convert, in principle, any compound structure to the corresponding signature. We've started doing so with Type 2 signatures. This is the work of Martino Bertoni (@mbertoni).