... | ... | @@ -2,19 +2,19 @@ |
|
|
|
|
|
![cc_logo-01.svg](/uploads/aff01a127cb9572b34b9f1d48861acf0/cc_logo-01.svg)
|
|
|
|
|
|
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in **5 levels** of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by less explicit, machine-learning-friendly abstractions of the data (see the CC [Manifesto](manifesto)).
|
|
|
The Chemical Checker (CC) is a data-driven resource of small molecule bioactivity data. The main goal of the CC is to express data in a format that can be used off-the-shelf in daily computational drug discovery tasks. The resource is organized in **5 levels** of increasing complexity, ranging from the chemical properties of the compounds to their clinical outcomes. In between, we consider targets, off-targets, perturbed biological networks and several cell-based assays, including gene expression, growth inhibition, and morphological profiles. The CC is different to other integrative compounds database in almost every aspect. The classical, relational representation of the data is surpassed here by a less explicit, more machine-learning-friendly abstraction of the data (see the CC [Manifesto](manifesto)).
|
|
|
|
|
|
The CC is an ever-growing resource maintained by the [Structural Bioinformatics & Network Biology Laboratory](http://sbnb.irbbarcelona.org) at the Institute for Research in Biomedicine ([IRB Barcelona](http://irbbarcelona.org)). Should you have any inquiries, please send an email to [miquel.duran@irbbarcelona.org](miquel.duran@irbbarcelona.org).
|
|
|
The CC resource is ever-growing and maintained by the [Structural Bioinformatics & Network Biology Laboratory](http://sbnb.irbbarcelona.org) at the Institute for Research in Biomedicine ([IRB Barcelona](http://irbbarcelona.org)). Should you have any inquiries, please send an email to [miquel.duran@irbbarcelona.org](miquel.duran@irbbarcelona.org) or [patrick.aloy@irbbarcelona.org](patrick.aloy@irbbarcelona.org).
|
|
|
|
|
|
The CC was first released to the scientific community in the following paper: [Duran-Frigola et al., *Extending the small molecule similarity principle to all levels of biology* (2019)](https://www.dropbox.com/s/x2rqszfdfpqdqdy/duranfrigola_etal_ms_current.pdf?dl=0), and has since produced a number of [related publications](publications).
|
|
|
|
|
|
## Source data and datasets
|
|
|
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are committed to updating the CC resource **every 6 months**, and versions are named accordingly (e.g. `chemical_checker_2019_01`).
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are committed to updating the CC resource **every 6 months** (versions named accordingly, e.g. `chemical_checker_2019_01`), including the latest datasets, and incorporating new ones upon request and posterior evaluation by our team.
|
|
|
|
|
|
The basic data unit of the CC are the *datasets*. There are 5 *levels* (`A` Chemistry, `B` Targets, `C` Networks, `D` Cells and `E` Clinics) and, in turn, each level is divided into 5 sublevels or *coordinates* (`A1`-`E5`), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g. `A1.001`), one of which is selected as being *exemplar*.
|
|
|
The basic data unit of the CC is the *dataset*. There are 5 *levels* (`A` Chemistry, `B` Targets, `C` Networks, `D` Cells and `E` Clinics) and, in turn, each level is divided into 5 sublevels or *coordinates* (`A1`-`E5`), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g. `A1.001`), one of which is selected as being *exemplary*.
|
|
|
|
|
|
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discovery, including approved drugs, natural compounds, and commercial screening libraries.
|
|
|
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discovery, including approved drugs, natural products, and commercial screening libraries.
|
|
|
|
|
|
For further information, please refer to:
|
|
|
* [Bioactivity data sources](data)
|
... | ... | @@ -26,7 +26,7 @@ For further information, please refer to: |
|
|
|
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning tools such as `sklearn`, `keras` or `tensorflow`.
|
|
|
|
|
|
Accordingly, the backbone pipeline of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called *CC signatures*:
|
|
|
Accordingly, the backbone pipeline of the CC is devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called *CC signatures*:
|
|
|
|
|
|
|Signature|Abbreviation|Description|Advantages|Disadvantages|
|
|
|
|---|---|---|---|---|
|
... | ... | @@ -46,18 +46,30 @@ There are other important types of data: |
|
|
|2D projections|`proj*`|2D representations of the data, typically performed with t-SNE.|
|
|
|
`*` denotes correspondence with the signature type `0`-`3`.
|
|
|
|
|
|
All data in the CC resource are stored as `HDF5` files. Measuring correlations between signatures belonging to different datasets yield a systematic assessment of the **small molecule similarity principle** (similar molecules have similar properties). Please follow the links below for more details:
|
|
|
All data in the CC resource are stored as `HDF5` files and can be accessed with a [simple API](access-the-data). Measuring correlations between signatures belonging to different datasets yield a systematic assessment of the **small molecule similarity principle** (similar molecules have similar properties). Please follow the links below for more details:
|
|
|
|
|
|
* [Signaturization](signaturization)
|
|
|
* [Dataset correlations](dataset-correlation)
|
|
|
|
|
|
## Resource structure and pipeline
|
|
|
## Resource structure, pipeline and access to the data
|
|
|
|
|
|
* [Folders and pipeline](folders-and-pipeline)
|
|
|
This repository is divided into two parts:
|
|
|
|
|
|
* [Production phase](production-phase)
|
|
|
* [Access to data](access-to-data)
|
|
|
|
|
|
In the **production phase**, signatures are generated. The scripts in this part of the repository thus refer to:
|
|
|
|
|
|
* The bulk update of data performed every six months.
|
|
|
* Sporadic additions of datasets.
|
|
|
* Any external data that may be mapped (compounds) or [connected](#connectivity) (other
|
|
|
biological entities) to the existing resource, which in turn produces new signatures for the queries.
|
|
|
|
|
|
The data generated by the production phase can be easily accessed with a simple python library.
|
|
|
|
|
|
## Similarity searches in the web
|
|
|
|
|
|
Signature similarity searches can be performed at a high level using the CC web interface, available at [http://chemicalchecker.org](http://chemicalchecker.org). This resource is limited to the 25 *exemplar* datasets of the CC.
|
|
|
Signature similarity searches can be performed at a high level using the CC web interface, available at [http://chemicalchecker.org](http://chemicalchecker.org). This resource is limited to the 25 *exemplary* datasets of the CC.
|
|
|
|
|
|
In the *Main* page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with grey 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular compound collections can be displayed. Deeper insights can be obtained by clicking on the *Explore* button for a molecule of choice.
|
|
|
|
... | ... | |