... | ... | @@ -26,7 +26,7 @@ The main task of the CC is to convert raw data into formats that are suitable in |
|
|
|
|
|
Accordingly, the backbone pipeline of the CC are devoted to processing every dataset and converting it to a series of formats that may be readily useful for machine learning. The main assets of the CC are the so-called *CC signatures*:
|
|
|
|
|
|
|Signatures|Abbreviation|Description|Advantages|Disadvantages|
|
|
|
|Signature|Abbreviation|Description|Advantages|Disadvantages|
|
|
|
|---|---|---|---|---|
|
|
|
|Type 0| `sign0`|Raw dataset data, expressed in a matrix format.|Explicit data.|Possibly sparse, heterogeneous, unprocessed.|
|
|
|
|Type 1|`sign1`|PCA/LSI projections of the data, accounting for 90% of the data.|Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.|Variables dimensions, they may still be sparse.|
|
... | ... | @@ -40,7 +40,7 @@ There are other important types of data: |
|
|
|Nearest neighbors |`neigh*`|Nearest neighbors using a distance metric of choice, typically the cosine distance.|
|
|
|
|Clusters|`clus*`|Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
|
|
|
|2D projections|`proj*`|2D representations of the data, typically performed with t-SNE.|
|
|
|
`*` denotes correspondence with the signature type `1`-`3`.
|
|
|
`*` denotes correspondence with the signature type `0`-`3`.
|
|
|
|
|
|
All data in the CC resource are stored as `HDF5` files. Measuring correlations between signatures belonging to different datasets yield a systematic assessment of the **small molecule similarity principle** (similar molecules have similar properties). Please follow the links below for more details:
|
|
|
|
... | ... | |