Similarity and connectivity
The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.
When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA, or molecules that revert the transcriptional profile of a disease state.
These are some ways similarity and connectivity can be applied in the CC:
Easy calculation of similarity and connectivity
Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarity and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the pre-processing repository structure.
Every dataset has a particular processing protocol, always consisting of two consecutive steps:
- Fetching of data and conversion to a standard input file.
- It is very important that data are minimally transformed here.
- Data may be fetched from the downloaded files, from calculated properties, or from a file of interest of the user.
- From standard input to signature type 0
- When adding/updating a dataset, all procedures here must be encapsulated in a
fit()
method. - Accordingly, a
predict()
method must be available. - Acceptable standard inputs include:
.gmt
,.h5
and.tsv
. It is strongly recommended that input features are recognizable entities, e.g. those defined in theBioteque
.
It is of the utmost importance that step 2 is endowed with a predict()
method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of connectivity methods. This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
More precisely, connectivity starts with standard input files and finishes with a signature type 0.
In some datasets, this procedure may be of considerable complexity and we need to conceive workflows that can be wrapped into a predict()
method. For other datasets, the procedure will be almost trivial.
Another important matter here is the distance. The CC works with common distance metrics, such as the cosine
or euclidean
distances. Sometimes, connectivity may require other types of metrics (e.g. GSEA-like, overlap, etc.). We might consider learning siamese networks that transform original distances to the more standard ones. This is an unexplored avenue, though.
Standard input files
Type | Format | Description |
---|---|---|
Feature sets | GMT | XX |
Key-feature pairs | TSV | XX |
Key profiles | TSV | XX |
InChIKeys | TSV | XX |
Documentation
Every pre-processing script needs to have a README
file.