... | ... | @@ -10,19 +10,36 @@ The CC was first released to the scientific community in the following paper: [D |
|
|
|
|
|
The CC capitalizes on publicly-available bioactivity data fetched from different sources. We are commited to updating the CC resource **every 6 months**.
|
|
|
|
|
|
|
|
|
The basic data unit of the CC are the *datasets*. There are 5 levels of increasing biological complexity (A: Chemistry, B: Targets, C: Networks, D: Cells and E: Clinics) and, in turn, each level is divided into five sublevels (1--5), denoting different types of data. In the CC terminology, a dataset is a set of compounds belonging to a certain source and having a certain data type. Each dataset belongs to one and only one of the 25 categories.
|
|
|
|
|
|
For further information, please refer to:
|
|
|
* [Bioactivity data sources](data)
|
|
|
* [Chemical libraries](libraries)
|
|
|
* [Meta-data](database)
|
|
|
* [File-system organiztion](datasets)
|
|
|
* [Full list of datasets and file-system organization](datasets)
|
|
|
|
|
|
## Signaturization of the data
|
|
|
|
|
|
The main task of the CC is to convert raw data into formats that are suitable inputs for machine learning toolboxs such as `sklearn`, `keras` or `tensorflow`.
|
|
|
|
|
|
As such, the backbone scripts of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main asset of the CC are the so-called *CC signatures*:
|
|
|
|
|
|
For further information, please refer to:
|
|
|
| Signatures | Abbreviation | Description | Advantages | Disadvantages |
|
|
|
| --- | --- | --- | --- | --- |
|
|
|
| Type 0 | `dataset` | Raw dataset data, expressed in matrix format. | Explicit data. | Possibly sparse, heterogeneous, un-processed. |
|
|
|
| Type 1 | `sig` | PCA/LSI projections of the data, accounting for 90% of the data. | Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning. | Variables dimensions, they may still be sparse. |
|
|
|
| Type 2 | `netemb` | Network-embedding of the similarity network. | Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network. | Information leak due to similarity measures. Hyper-parameter tunning. |
|
|
|
| Type 3 | `fullnetemb` | Network-embedding of the inferred similarity network. | Fixed dimension and available for *any* molecule. | Possibly very noisy, hence useless, especially for low-data datasets. |
|
|
|
|
|
|
There are also other important types of data:
|
|
|
|
|
|
| Name | Abbreviation | Description |
|
|
|
| --- | --- | --- |
|
|
|
| Nearest neighbors | `nneigh` | Nearest neighbors using a distance metric of choice, typically the cosine distance. |
|
|
|
| Clusters | `clust` | Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means. |
|
|
|
| 2D projections | `proj` | 2D representations of the data, typically performed with t-SNE. |
|
|
|
|
|
|
All data in the CC resource are stored as `HDF5` files. For further information, please refer to:
|
|
|
* [Signaturization](signaturization)
|
|
|
|
|
|
## Similarity searches in the web
|
... | ... | @@ -34,21 +51,19 @@ In the *Main* page, the user can query small molecules and obtain an overview of |
|
|
In the *Explore* page, we look for similar molecules in the CC database and display them in a 25-column table, corresponding to all CC datasets. In CC datasets where the molecule *is* present, we measure similarities to other molecules in the dataset. If the molecule *is not* present, we infer similarities only to the molecules in the dataset.
|
|
|
|
|
|
For further information, please refer to:
|
|
|
* [Short web-tutorial](http://chemicalchecker.org/help/)
|
|
|
* [Statistics](http://chemicalchecker.org/stats/)
|
|
|
|
|
|
## Customary drug discovery tasks
|
|
|
|
|
|
The best potential of the CC is to apply it to drug discovery tasks
|
|
|
|
|
|
For further information, please refer to:
|
|
|
|
|
|
* [Library characterization](XXX)
|
|
|
* Target prediction using [TargetMate](XXX)
|
|
|
* [DeepChem integration](XXX)
|
|
|
* [Short web tutorial](http://chemicalchecker.org/help/)
|
|
|
* [Statistics page](http://chemicalchecker.org/stats/)
|
|
|
|
|
|
## Connectivity
|
|
|
|
|
|
XXXX
|
|
|
|
|
|
## Customary drug discovery tasks
|
|
|
|
|
|
We are currently most interested in making the CC resource available to everyone and to identify standard applications that are of use to computational drug discoverers. Below, we list some such applications:
|
|
|
|
|
|
[Connectivity](connectivity) |
|
|
* [Library characterization](XXX):
|
|
|
* Massive property prediction:
|
|
|
* [Similarity-based prediction](XXX): This work emerges from a work by David Amat (@damat) named TargetMate. We are currenty evolving it into a simple kernel predictor that can take into account label correlation as well.
|
|
|
* [Automated machine learning](XXX): This is the work of Modesto Orozco (@morozco), who is currently using the AutoML TPOT library.
|
|
|
* [DeepChem integration](XXX): The [DeepChem library](http://deepchem.io) is an outstanding chemoinformatics toolbox that incorporates a number of machine learning algorithms with seamless integration with `tensorflow` and similars. Moreover, DeepChem contains [MoleculeNet](http://moleculenet.ai), a collection of benchmark sets related, among others, to biophysical properties and physiological outcomes of compounds. Conveniently, DeepChem contains several `featurizers`, i.e. functions that are able to convert a compound structure to a vector format, typically representing their chemistry. We branch DeepChem to include `cc_featurizers` able to convert, in principle, *any* compound structure to the corresponding signature. We've started doing so with Type 2 signatures. This is the work of Martino Bertoni (@mbertoni). |