... | ... | @@ -12,10 +12,10 @@ The central type of data are the signatures (one numerical vector per molecule), |
|
|
Besides, there are other (auxiliary) types of data that may be of interest. The asterisk `*` denotes correspondence with signatures type `0`-`3`. Typically, we have the `1` version of the following in the repository.
|
|
|
|
|
|
* `sims*` [Similarity vectors](#similarity-vectors): Full similarities stored as light `int8` data. Each molecule receives one such similarity vector per dataset. They may be observed `*_obs` or predicted `*_prd` similarities. These signatures are [only applicable to exemplary datasets](production-phase).
|
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): XXXX. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
|
|
|
* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): XXXX
|
|
|
* `clus*` [Clusters](#clusters): XXXX
|
|
|
* `proj*` [2D Projections](#2d-projections): XXX
|
|
|
* `neig*` [Nearest neighbors](#nearest-neighbors): Nearest neighbors search result. Currently, we consider the 1000-nearest neighbors, which is more than sufficient in any realistic scenario. For now, we only keep `neig1`.
|
|
|
* `nprd*` [Predicted nearest neighbors](#predicted-nearest-neighbors): Predicted nearest neighbors. :warning: These are not available and will not be in the near future.
|
|
|
* `clus*` [Clusters](#clusters): Centroids and labels of a k-means clustering.
|
|
|
* `proj*` [2D Projections](#2d-projections): t-SNE 2D projections of the data.
|
|
|
|
|
|
|
|
|
I consider the numbering `0`-`3` to be conceptually closed. However, further auxiliary data types may be introduced in the future. Note that all names have a 4-character code followed by a digit. Future data should stick to this nomenclature.
|
... | ... | @@ -24,12 +24,22 @@ I consider the numbering `0`-`3` to be conceptually closed. However, further aux |
|
|
|
|
|
*The following applies to all data types except the [similarity vectors](#similarity-vectors)* `sims*`*, which are of a very different nature due to their organization in the [folder structure](production-phase).*
|
|
|
|
|
|
I suggest that
|
|
|
Every CC data will have the following methods:
|
|
|
|
|
|
* `fit()`: Takes an input and transforms the data to an output.
|
|
|
* `predict()`: Uses the fitted models to go from input to output.
|
|
|
* `validate()`: Performs a validation across external data such as MoA and ATC codes.
|
|
|
|
|
|
## Signatures type 0
|
|
|
|
|
|
* From: standard input
|
|
|
* To: `sign0.h5`
|
|
|
|
|
|
## Signatures type 1
|
|
|
|
|
|
* From: `sign0.h5`
|
|
|
* To: `sign1`.h5`
|
|
|
|
|
|
## Signatures type 2
|
|
|
|
|
|
## Signatures type 3
|
... | ... | @@ -86,5 +96,3 @@ As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5` |
|
|
* `validate`: Here we will not use AUROC and KS, but other statistics, depending on the case.
|
|
|
|
|
|
Here I put a scheme of the first part of the Chemical Checker pipeline: |
|
|
|
|
|
![CC_backbone](/uploads/d117decadd2a47d1fcca824ebe892b05/CC_backbone.png) |
|
|
\ No newline at end of file |