|
|
# Chemical Checker
|
|
|
|
|
|
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in **5 levels** of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by less explicit, machine-learning oriented abstractions of the data (see the CC [Manifesto](manifesto)).
|
|
|
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in **5 levels** of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by less explicit, machine-learning-friendly abstractions of the data (see the CC [Manifesto](manifesto)).
|
|
|
|
|
|
The CC is an ever-growing resource maintained by the [Structural Bioinformatics & Network Biology Laboratory](http://sbnb.irbbarcelona.org) at the Institute for Research in Biomedicine ([IRB Barcelona](http://irbbarcelona.org)). Should you have any inquiries, please send an email to [miquel.duran@irbbarcelona.org](miquel.duran@irbbarcelona.org).
|
|
|
|
... | ... | @@ -8,21 +8,23 @@ The CC was first released to the scientific community in the following paper: [D |
|
|
|
|
|
## Source data and datasets
|
|
|
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are commited to updating the CC resource **every 6 months**.
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are committed to updating the CC resource **every 6 months**, and versions are named accordingly (e.g. `chemical_checker_2019_01`).
|
|
|
|
|
|
The basic data unit of the CC are the *datasets*. There are 5 levels of increasing biological complexity (A: Chemistry, B: Targets, C: Networks, D: Cells and E: Clinics) and, in turn, each level is divided into five sublevels (1--5), denoting different types of data. In the CC terminology, a dataset is a set of compounds belonging to a certain source and having a certain data type. Each dataset belongs to one and only one of the 25 categories.
|
|
|
The basic data unit of the CC are the *datasets*. There are 5 *levels* (`A` Chemistry, `B` Targets, `C` Networks, `D` Cells and `E` Clinics) and, in turn, each level is divided into 5 sublevels or *coordinates* (`A1`-`E5`), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g. `A1.001`).
|
|
|
|
|
|
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discovery, including approved drugs, natural compounds and commercial screening libraries.
|
|
|
|
|
|
For further information, please refer to:
|
|
|
* [Bioactivity data sources](data)
|
|
|
* [Chemical libraries](libraries)
|
|
|
* [Meta-data](database)
|
|
|
* [Full list of datasets and file-system organization](datasets)
|
|
|
* [Compound libraries](libraries)
|
|
|
* [Resource meta-data](database)
|
|
|
* [Datasets and file-system organization](datasets)
|
|
|
|
|
|
## Signaturization of the data
|
|
|
|
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine learning toolboxs such as `sklearn`, `keras` or `tensorflow`.
|
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning tools such as `sklearn`, `keras` or `tensorflow`.
|
|
|
|
|
|
As such, the backbone scripts of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main asset of the CC are the so-called *CC signatures*:
|
|
|
Accordingly, the backbone pipeline of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main asset of the CC are the so-called *CC signatures*:
|
|
|
|
|
|
| Signatures | Abbreviation | Description | Advantages | Disadvantages |
|
|
|
| --- | --- | --- | --- | --- |
|
... | ... | |