... | ... | @@ -10,10 +10,19 @@ These are some ways similarity and connectivity can be applied in the CC: |
|
|
|
|
|
## Easy calculation of similarity and connectivity
|
|
|
|
|
|
Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarity and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the [pre-processing repository structure](datasets#dataset-pre-processing).
|
|
|
|
|
|
Every dataset has a particular processing protocol, always consisting of two consecutive steps:
|
|
|
|
|
|
1. XX
|
|
|
2. XX
|
|
|
1. Fetching of data and conversion to a standard input file.
|
|
|
* It is very important that data are *minimally* transformed here.
|
|
|
* Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
|
|
|
2. From standard input to signature type 0
|
|
|
* When adding/updating a dataset, all procedures here must be encapsulated in a `fit()` method.
|
|
|
* Accordingly, a `predict()` method must be available.
|
|
|
* Acceptable standard inputs include: `.gmt`, `.h5` and `.tsv`. It is strongly recommended that input features are *recognizable* entities, e.g. those defined in the `Bioteque`.
|
|
|
|
|
|
It is of the utmost importance that step 2 is endowed with a `predict()` method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of [connectivity methods](connectivity). This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
|
|
|
|
|
|
More precisely, connectivity starts with **standard input files** and finishes with a **signature type 0**.
|
|
|
|
... | ... | |