... | @@ -10,9 +10,9 @@ The CC was first released to the scientific community in the following paper: [D |
... | @@ -10,9 +10,9 @@ The CC was first released to the scientific community in the following paper: [D |
|
|
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are committed to updating the CC resource **every 6 months**, and versions are named accordingly (e.g. `chemical_checker_2019_01`).
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are committed to updating the CC resource **every 6 months**, and versions are named accordingly (e.g. `chemical_checker_2019_01`).
|
|
|
|
|
|
The basic data unit of the CC are the *datasets*. There are 5 *levels* (`A` Chemistry, `B` Targets, `C` Networks, `D` Cells and `E` Clinics) and, in turn, each level is divided into 5 sublevels or *coordinates* (`A1`-`E5`), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g. `A1.001`).
|
|
The basic data unit of the CC are the *datasets*. There are 5 *levels* (`A` Chemistry, `B` Targets, `C` Networks, `D` Cells and `E` Clinics) and, in turn, each level is divided into 5 sublevels or *coordinates* (`A1`-`E5`), denoting different types or aspects of the data. Each dataset belongs to one and only one of the 25 coordinates, and each coordinate can have an arbitrary number of datasets (e.g. `A1.001`), one of which is selected as being *exemplar*.
|
|
|
|
|
|
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discovery, including approved drugs, natural compounds and commercial screening libraries.
|
|
The CC is a chemistry-first biomedical resource and, as such, it contains several predefined compound collections that are of interest to drug discovery, including approved drugs, natural compounds, and commercial screening libraries.
|
|
|
|
|
|
For further information, please refer to:
|
|
For further information, please refer to:
|
|
* [Bioactivity data sources](data)
|
|
* [Bioactivity data sources](data)
|
... | @@ -24,16 +24,16 @@ For further information, please refer to: |
... | @@ -24,16 +24,16 @@ For further information, please refer to: |
|
|
|
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning tools such as `sklearn`, `keras` or `tensorflow`.
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine-learning tools such as `sklearn`, `keras` or `tensorflow`.
|
|
|
|
|
|
Accordingly, the backbone pipeline of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main asset of the CC are the so-called *CC signatures*:
|
|
Accordingly, the backbone pipeline of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called *CC signatures*:
|
|
|
|
|
|
| Signatures | Abbreviation | Description | Advantages | Disadvantages |
|
|
|Signatures|Abbreviation|Description|Advantages|Disadvantages|
|
|
| --- | --- | --- | --- | --- |
|
|
|---|---|---|---|---|
|
|
| Type 0 | `dataset` | Raw dataset data, expressed in matrix format. | Explicit data. | Possibly sparse, heterogeneous, un-processed. |
|
|
|Type 0| `dataset`|Raw dataset data, expressed in matrix format.|Explicit data.|Possibly sparse, heterogeneous, unprocessed.|
|
|
| Type 1 | `sig` | PCA/LSI projections of the data, accounting for 90% of the data. | Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning. | Variables dimensions, they may still be sparse. |
|
|
|Type 1|`sig`|PCA/LSI projections of the data, accounting for 90% of the data.|Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.|Variables dimensions, they may still be sparse.|
|
|
| Type 2 | `netemb` | Network-embedding of the similarity network. | Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network. | Information leak due to similarity measures. Hyper-parameter tunning. |
|
|
|Type 2|`netemb`|Network-embedding of the similarity network.|Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network.|Information leak due to similarity measures. Hyper-parameter tunning.|
|
|
| Type 3 | `fullnetemb` | Network-embedding of the inferred similarity network. | Fixed dimension and available for *any* molecule. | Possibly very noisy, hence useless, especially for low-data datasets. |
|
|
|Type 3|`fullnetemb`|Network-embedding of the inferred similarity network.|Fixed dimension and available for *any* molecule.|Possibly very noisy, hence useless, especially for low-data datasets.|
|
|
|
|
|
|
There are also other important types of data:
|
|
There are other important types of data:
|
|
|
|
|
|
|Name|Abbreviation|Description|
|
|
|Name|Abbreviation|Description|
|
|
|---|---|---|
|
|
|---|---|---|
|
... | @@ -41,24 +41,24 @@ There are also other important types of data: |
... | @@ -41,24 +41,24 @@ There are also other important types of data: |
|
|Clusters|`clust`|Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
|
|
|Clusters|`clust`|Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
|
|
|2D projections|`proj`|2D representations of the data, typically performed with t-SNE.|
|
|
|2D projections|`proj`|2D representations of the data, typically performed with t-SNE.|
|
|
|
|
|
|
All data in the CC resource are stored as `HDF5` files. For further information, please refer to:
|
|
All data in the CC resource are stored as `HDF5` files. Details of the pipeline are given in the link below:
|
|
* [Signaturization](signaturization)
|
|
* [Signaturization](signaturization)
|
|
|
|
|
|
## Similarity searches in the web
|
|
## Similarity searches in the web
|
|
|
|
|
|
Similarity searches can be performed at a high level using the CC resource, available at [http://chemicalchecker.org](http://chemicalchecker.org). This resource is limited to the 25 *exemplar* datasets of the CC.
|
|
Signature similarity searches can be performed at a high level using the CC web interface, available at [http://chemicalchecker.org](http://chemicalchecker.org). This resource is limited to the 25 *exemplar* datasets of the CC.
|
|
|
|
|
|
In the *Main* page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with gray 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular collections can be displayed. Deeper insights can be obtained by clicking on the *Explore* button for a molecule of choice.
|
|
In the *Main* page, the user can query small molecules and obtain an overview of their location inside the CC. The user will learn the CC datasets where these molecules have data available, with gray 2D density plots indicating whether they are peripheral (low-density regions) or central (high-density regions). To have a better sense of the location of query molecules, landmark compounds from popular compound collections can be displayed. Deeper insights can be obtained by clicking on the *Explore* button for a molecule of choice.
|
|
|
|
|
|
In the *Explore* page, we look for similar molecules in the CC database and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule *is* present, we measure similarities to other molecules in the dataset. If the molecule *is not* present, we infer similarities only to the molecules in the dataset.
|
|
In the *Explore* page, we look for similar molecules in the CC resource and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule *is* present, we measure similarities to other molecules in the dataset. If the molecule *is not* present, we infer similarities only to the molecules in the dataset.
|
|
|
|
|
|
For further information, please refer to:
|
|
Other interesting pages of the website include:
|
|
* [Short web tutorial](http://chemicalchecker.org/help/)
|
|
* [Short web tutorial](http://chemicalchecker.org/help/)
|
|
* [Statistics page](http://chemicalchecker.org/stats/)
|
|
* [Statistics page](http://chemicalchecker.org/stats/)
|
|
|
|
|
|
## Connectivity
|
|
## Connectivity
|
|
|
|
|
|
The CC contains both chemical and biological signatures. One of the most interesting features of biological signatures is that they can be *connected* to signatures of biology. This idea was first popularized by the Connectivity Map in the context of gene expression data.
|
|
The CC contains both chemical *and* biological signatures. One of the most interesting features of biological signatures is that they can be *connected* to signatures of biology. This idea was first popularized by the Connectivity Map in the context of gene expression data.
|
|
|
|
|
|
In we generalize the notion of connectivity to other types of data and provide functionalities to connect small molecules to other biologically-annotated entities such as disease, cell lines or genetic perturbation experiments. Some examples would be:
|
|
In we generalize the notion of connectivity to other types of data and provide functionalities to connect small molecules to other biologically-annotated entities such as disease, cell lines or genetic perturbation experiments. Some examples would be:
|
|
|
|
|
... | | ... | |