... | ... | @@ -28,18 +28,18 @@ Accordingly, the backbone pipeline of the CC are devoted to processing every dat |
|
|
|
|
|
|Signatures|Abbreviation|Description|Advantages|Disadvantages|
|
|
|
|---|---|---|---|---|
|
|
|
|Type 0| `dataset`|Raw dataset data, expressed in matrix format.|Explicit data.|Possibly sparse, heterogeneous, unprocessed.|
|
|
|
|Type 1|`sig`|PCA/LSI projections of the data, accounting for 90% of the data.|Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.|Variables dimensions, they may still be sparse.|
|
|
|
|Type 2|`netemb`|Network-embedding of the similarity network.|Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network.|Information leak due to similarity measures. Hyper-parameter tunning.|
|
|
|
|Type 3|`fullnetemb`|Network-embedding of the inferred similarity network.|Fixed dimension and available for *any* molecule.|Possibly very noisy, hence useless, especially for low-data datasets.|
|
|
|
|Type 0| `sign0`|Raw dataset data, expressed in a matrix format.|Explicit data.|Possibly sparse, heterogeneous, unprocessed.|
|
|
|
|Type 1|`sign1`|PCA/LSI projections of the data, accounting for 90% of the data.|Biological signatures of this type can be obtained by simple projection. Easy to compute and require no fine-tuning.|Variables dimensions, they may still be sparse.|
|
|
|
|Type 2|`sign2`|Network-embedding of the similarity network.|Fixed-length, usually acceptably short. Suitable for machine learning. Capture global properties of the similarity network.|Information leak due to similarity measures. Hyper-parameter tunning.|
|
|
|
|Type 3|`sign3`|Network-embedding of the inferred similarity network.|Fixed dimension and available for *any* molecule.|Possibly very noisy, hence useless, especially for low-data datasets.|
|
|
|
|
|
|
There are other important types of data:
|
|
|
|
|
|
|Name|Abbreviation|Description|
|
|
|
|---|---|---|
|
|
|
|Nearest neighbours|`nneigh`|Nearest neighbours using a distance metric of choice, typically the cosine distance.|
|
|
|
|Clusters|`clust`|Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
|
|
|
|2D projections|`proj`|2D representations of the data, typically performed with t-SNE.|
|
|
|
|Nearest neighbors |`neigh*`|Nearest neighbors using a distance metric of choice, typically the cosine distance.|
|
|
|
|Clusters|`clus*`|Clusters or data partitions of the data. Typically obtained with a simple clustering algorithm such as k-Means.|
|
|
|
|2D projections|`proj*`|2D representations of the data, typically performed with t-SNE.|
|
|
|
|
|
|
All data in the CC resource are stored as `HDF5` files. Measuring correlations between signatures belonging to different datasets yield a systematic assessment of the **small molecule similarity principle** (similar molecules have similar properties). Please follow the links below for more details:
|
|
|
|
... | ... | |