Datasets
Levels, coordinates and datasets
The CC is devided in five levels of increasing complexity:
Level | Name | Description |
---|---|---|
A |
Chemistry | Chemical properties of the compounds. |
B |
Targets | Chemical-protein interactions. |
C |
Networks | Higher-order effects of small molecules. |
D |
Cells | Readouts of compound cell-based assays. |
C |
Clinics | Clinical data of drugs and environmental chemicals. |
In turn, each level is divided in 5 sublevels or coordinates representing different aspects of the data. Each sublevel has an exemplar dataset, as described below:
Coordinate | Name | Description |
---|---|---|
A1 |
2D fingerprints | Binary representation of the 2D structure of a molecule. The neighborhood of every atom is encoded using circular topology hashing. |
A2 |
3D fingerprints | Similar to A1 , the 3D structures of the three best conformers after energy minimization are hashed into a binary representation without the need for structural alignment. |
A3 |
Scaffolds | Largest molecular scaffold (usually a ring system) remaining after applying Murcko’s pruning rules. Additionally, we keep the corresponding framework, i.e. a version of the scaffold where all atoms are carbons and all bonds are single. The scaffold and the framework are encoded with path-based 1024-bit fingerprints, suitable for capturing substructures in similarity searches. |
A4 |
Structural keys | 166 functional groups and substructures widely accepted by medicinal chemists (MACCS keys). |
A5 |
Physicochemistry | Physicochemical properties such as molecular weight, logP and refractivity. Number of hydrogen-bond donors and acceptors, rings, etc. Drug-likeness measurements e.g. number of structural alerts, Lipinski’s rule-of-5 violations or chemical beauty (QED). |
B1 |
Mechanism of action | Drug targets with known pharmacological action and modes (agonist, antagonist, etc.). |
B2 |
Metabolic genes Drug metabolizing enzymes, transporters and carriers. | |
B3 |
Crystals | Small molecules co-crystalized with protein chains. Data is organized according to the structural families of the protein chains. |
B4 |
Binding | Compound--protein binding data available in major public chemogenomics databases. Data mainly comes from academic publications and patents. Only binding affinities below a class-specific threshold are kept (kinases ≤ 30 nM, GPCRs ≤ 100 nM, nuclear receptors ≤ 100 nM, ion channels ≤ 10 uM and others ≤ 1 uM). |
B5 |
HTS bioassays | Hits from screening campaigns against protein targets (mainly confirmatory functional assays below 10 uM). |
C1 |
Biological roles | Ontology terms associated to small molecules with recognized biological roles, such as known drugs, metabolites and other natural products. |
C2 |
Metabolic network | Curated reconstruction of human metabolism, containing metabolites and reactions. Data is represented as a network where nodes are metabolites and edges connect substrates and products of reactions. |
C3 |
Canonical pathways | Canonical pathways related to the known receptors of compounds (as recorded in B4 ). Pathways are assigned via a guilt-by-association approach, i.e. a molecule is related to a pathway if at least one of the molecule targets is a member of it. |
C4 |
Biological processes | Similar to C3 , biological processes from the gene ontology are associated to compounds via a guilt-by-association approach from B4 data. All parent terms are kept, from the leaves of the ontology to its root. |
C5 |
Interactomes | Neighborhoods of B4 targets are collected by inspecting several large protein-protein interaction networks. A random-walk algorithm is used to obtain a robust measure of 'proximity' in the network. |
D1 |
Gene expression | Transcriptional response of cell lines upon exposure to small molecules. A well-documented reference dataset of gene expression profiles is used to map all compound profiles using a two-sided gene set enrichment analysis. |
D2 |
Cancer cell lines | Small molecule sensitivity data (GI50) of a panel of 60 cancer cell lines. |
D3 |
Chemical genetics | Growth inhibition profiles in a panel of ~300 yeast mutants. Data are combined with yeast genetic interaction data, so that compounds can be assimilated to genetic alterations when they have similar profiles. |
D4 |
Morphology | Changes in U-2 OS cell morphology measured after compound treatment using a multiplexed-cytological cell painting assay. 812 morphology features are recorded via automated microscopy and image analysis. |
D5 |
Cell bioassays | Small molecule cell bioassays reported in ChEMBL, mainly growth and proliferation measurements found in the literature. |
E1 |
Therapeutic areas | Anatomical Therapeutic Chemical (ATC) codes of drugs. All ATC levels are considered. |
E2 |
Indications | Indications of approved drugs and drugs in clinical trials. A controlled medical vocabulary is used. |
E3 |
Side effects | Side effects extracted from drug package inserts via text-mining techniques. |
E4 |
Disease phenotypes | Manually curated relationships between chemicals and diseases. Chemicals include drug molecules and environmental substances, among others. |
E5 |
Drug-drug interactions | Changes in the effect of a drug when it is taken together with a second drug. Drug-drug interactions may alter pharmacokinetics and/or cause side effects. |
Each of the coordinates can contain an arbitrary number of datasets. All datasets are fully described in the PostGreSQL database, and searchable at http://chemicalchecker.org/datasets/
. They receive a numbered coding (e.g. A1.001
).