|
# Datasets
|
|
# Datasets
|
|
|
|
|
|
The main xxx |
|
## Level, coordinates and datasets
|
|
\ No newline at end of file |
|
|
|
|
|
The CC is devided in five levels of increasing complexity:
|
|
|
|
|
|
|
|
|Level|Name|Description|
|
|
|
|
|---|---|---|
|
|
|
|
|`A`|Chemistry|Chemical properties of the compounds.|
|
|
|
|
|`B`|Targets|Chemical-protein interactions.|
|
|
|
|
|`C`|Networks|Higher-order effects of small molecules.|
|
|
|
|
|`D`|Cells|Readouts of compound cell-based assays.|
|
|
|
|
|`C`|Clinics|Clinical data of drugs and environmental chemicals.|
|
|
|
|
|
|
|
|
In turn, each level is divided in 5 sublevels representing different aspects of the data. Each sublevel has an *exemplar* dataset, as described below:
|
|
|
|
|
|
|
|
|Coordinate|Name|Description|
|
|
|
|
|---|---|---|
|
|
|
|
|`A1`|2D fingerprints|Binary representation of the 2D structure of a molecule. The neighborhood of every atom is encoded using circular topology hashing.|
|
|
|
|
|`A2`|3D fingerprints|Similar to `A1`, the 3D structures of the three best conformers after energy minimization are hashed into a binary representation without the need for structural alignment.|
|
|
|
|
|`A3`|Scaffolds|Largest molecular scaffold (usually a ring system) remaining after applying Murcko’s pruning rules. Additionally, we keep the corresponding framework, i.e. a version of the scaffold where all atoms are carbons and all bonds are single. The scaffold and the framework are encoded with path-based 1024-bit fingerprints, suitable for capturing substructures in similarity searches.|
|
|
|
|
|`A4`|Structural keys|166 functional groups and substructures widely accepted by medicinal chemists (MACCS keys).|
|
|
|
|
|`A5`|Physicochemistry|Physicochemical properties such as molecular weight, logP and refractivity. Number of hydrogen-bond donors and acceptors, rings, etc. Drug-likeness measurements e.g. number of structural alerts, Lipinski’s rule-of-5 violations or chemical beauty (QED).|
|
|
|
|
|`B1`|Mechanism of action|Drug targets with known pharmacological action and modes (agonist, antagonist, etc.).|
|
|
|
|
|`B2`|Metabolic genes Drug metabolizing enzymes, transporters and carriers.|
|
|
|
|
|`B3`|Crystals|Small molecules co-crystalized with protein chains. Data is organized according to the structural families of the protein chains.|
|
|
|
|
|`B4`|Binding|Compound--protein binding data available in major public chemogenomics databases. Data mainly comes from academic publications and patents. Only binding affinities below a class-specific threshold are kept (kinases ≤ 30 nM, GPCRs ≤ 100 nM, nuclear receptors ≤ 100 nM, ion channels ≤ 10 uM and others ≤ 1 uM).|
|
|
|
|
|`B5`|HTS bioassays|Hits from screening campaigns against protein targets (mainly confirmatory functional assays below 10 uM).|
|
|
|
|
|`C1`|Biological roles|Ontology terms associated to small molecules with recognized biological roles, such as known drugs, metabolites and other natural products.|
|
|
|
|
|`C2`|Metabolic network|Curated reconstruction of human metabolism, containing metabolites and reactions. Data is represented as a network where nodes are metabolites and edges connect substrates and products of reactions.|
|
|
|
|
|`C3`|Canonical pathways|Canonical pathways related to the known receptors of compounds (as recorded in `B4`). Pathways are assigned via a guilt-by-association approach, i.e. a molecule is related to a pathway if at least one of the molecule targets is a member of it.|
|
|
|
|
|`C4`|Biological processes|Similar to `C3`, biological processes from the gene ontology are associated to compounds via a guilt-by-association approach from `B4` data. All parent terms are kept, from the leaves of the ontology to its root.|
|
|
|
|
|`C5`|Interactomes|Neighborhoods of `B4` targets are collected by inspecting several large protein-protein interaction networks. A random-walk algorithm is used to obtain a robust measure of 'proximity' in the network.|
|
|
|
|
|`D1`|Gene expression|Transcriptional response of cell lines upon exposure to small molecules. A well-documented reference dataset of gene expression profiles is used to map all compound profiles using a two-sided gene set enrichment analysis.
|
|
|
|
|`D2`|Cancer cell lines|Small molecule sensitivity data (GI50) of a panel of 60 cancer cell lines.|
|
|
|
|
|`D3`|Chemical genetics|Growth inhibition profiles in a panel of ~300 yeast mutants. Data are combined with yeast genetic interaction data, so that compounds can be assimilated to genetic alterations when they have similar profiles.|
|
|
|
|
|`D4`|Morphology|Changes in U-2 OS cell morphology measured after compound treatment using a multiplexed-cytological cell painting assay. 812 morphology features are recorded via automated microscopy and image analysis.|
|
|
|
|
|`D5`|Cell bioassays|Small molecule cell bioassays reported in ChEMBL, mainly growth and proliferation measurements found in the literature.|
|
|
|
|
|`E1`|Therapeutic areas|Anatomical Therapeutic Chemical (ATC) codes of drugs. All ATC levels are considered.|
|
|
|
|
|`E2`|Indications|Indications of approved drugs and drugs in clinical trials. A controlled medical vocabulary is used.|
|
|
|
|
|`E3`|Side effects|Side effects extracted from drug package inserts via text-mining techniques.|
|
|
|
|
|`E4`|Disease phenotypes|Manually curated relationships between chemicals and diseases. Chemicals include drug molecules and environmental substances, among others.|
|
|
|
|
|`E5`|Drug-drug interactions|Changes in the effect of a drug when it is taken together with a second drug. Drug-drug interactions may alter pharmacokinetics and/or cause side effects.| |