This is an old version of this page. You can view the most recent version or browse the history.

datasets

Datasets

Level, coordinates and datasets

The CC is devided in five levels of increasing complexity:

Level	Name	Description
`A`	Chemistry	Chemical properties of the compounds.
`B`	Targets	Chemical-protein interactions.
`C`	Networks	Higher-order effects of small molecules.
`D`	Cells	Readouts of compound cell-based assays.
`C`	Clinics	Clinical data of drugs and environmental chemicals.

In turn, each level is divided in 5 sublevels representing different aspects of the data. Each sublevel has an exemplar dataset, as described below:

Coordinate	Name	Description
`A1`	2D fingerprints	Binary representation of the 2D structure of a molecule. The neighborhood of every atom is encoded using circular topology hashing.
`A2`	3D fingerprints	Similar to `A1`, the 3D structures of the three best conformers after energy minimization are hashed into a binary representation without the need for structural alignment.
`A3`	Scaffolds	Largest molecular scaffold (usually a ring system) remaining after applying Murcko’s pruning rules. Additionally, we keep the corresponding framework, i.e. a version of the scaffold where all atoms are carbons and all bonds are single. The scaffold and the framework are encoded with path-based 1024-bit fingerprints, suitable for capturing substructures in similarity searches.
`A4`	Structural keys	166 functional groups and substructures widely accepted by medicinal chemists (MACCS keys).
`A5`	Physicochemistry	Physicochemical properties such as molecular weight, logP and refractivity. Number of hydrogen-bond donors and acceptors, rings, etc. Drug-likeness measurements e.g. number of structural alerts, Lipinski’s rule-of-5 violations or chemical beauty (QED).
`B1`	Mechanism of action	Drug targets with known pharmacological action and modes (agonist, antagonist, etc.).
`B2`	Metabolic genes Drug metabolizing enzymes, transporters and carriers.
`B3`	Crystals	Small molecules co-crystalized with protein chains. Data is organized according to the structural families of the protein chains.
`B4`	Binding	Compound--protein binding data available in major public chemogenomics databases. Data mainly comes from academic publications and patents. Only binding affinities below a class-specific threshold are kept (kinases ≤ 30 nM, GPCRs ≤ 100 nM, nuclear receptors ≤ 100 nM, ion channels ≤ 10 uM and others ≤ 1 uM).
`B5`	HTS bioassays	Hits from screening campaigns against protein targets (mainly confirmatory functional assays below 10 uM).
`C1`	Biological roles	Ontology terms associated to small molecules with recognized biological roles, such as known drugs, metabolites and other natural products.
`C2`	Metabolic network	Curated reconstruction of human metabolism, containing metabolites and reactions. Data is represented as a network where nodes are metabolites and edges connect substrates and products of reactions.
`C3`	Canonical pathways	Canonical pathways related to the known receptors of compounds (as recorded in `B4`). Pathways are assigned via a guilt-by-association approach, i.e. a molecule is related to a pathway if at least one of the molecule targets is a member of it.
`C4`	Biological processes	Similar to `C3`, biological processes from the gene ontology are associated to compounds via a guilt-by-association approach from `B4` data. All parent terms are kept, from the leaves of the ontology to its root.
`C5`	Interactomes	Neighborhoods of `B4` targets are collected by inspecting several large protein-protein interaction networks. A random-walk algorithm is used to obtain a robust measure of 'proximity' in the network.
`D1`	Gene expression	Transcriptional response of cell lines upon exposure to small molecules. A well-documented reference dataset of gene expression profiles is used to map all compound profiles using a two-sided gene set enrichment analysis.
`D2`	Cancer cell lines	Small molecule sensitivity data (GI50) of a panel of 60 cancer cell lines.
`D3`	Chemical genetics	Growth inhibition profiles in a panel of ~300 yeast mutants. Data are combined with yeast genetic interaction data, so that compounds can be assimilated to genetic alterations when they have similar profiles.
`D4`	Morphology	Changes in U-2 OS cell morphology measured after compound treatment using a multiplexed-cytological cell painting assay. 812 morphology features are recorded via automated microscopy and image analysis.
`D5`	Cell bioassays	Small molecule cell bioassays reported in ChEMBL, mainly growth and proliferation measurements found in the literature.
`E1`	Therapeutic areas	Anatomical Therapeutic Chemical (ATC) codes of drugs. All ATC levels are considered.
`E2`	Indications	Indications of approved drugs and drugs in clinical trials. A controlled medical vocabulary is used.
`E3`	Side effects	Side effects extracted from drug package inserts via text-mining techniques.
`E4`	Disease phenotypes	Manually curated relationships between chemicals and diseases. Chemicals include drug molecules and environmental substances, among others.
`E5`	Drug-drug interactions	Changes in the effect of a drug when it is taken together with a second drug. Drug-drug interactions may alter pharmacokinetics and/or cause side effects.