Similarity and connectivity
The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.
When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA, or molecules that revert the transcriptional profile of a disease state.
These are some ways similarity and connectivity can be applied in the CC:
Easy calculation of similarity and connectivity
- XX
- XX
More precisely, connectivity starts with standard input files and finishes with a signature type 0.
In some datasets, this procedure may be of considerable complexity and we need to conceive workflows that can be wrapped into a predict()
method. For other datasets, the procedure will be almost trivial.
Another important matter here is the distance. The CC works with common distance metrics, such as the cosine
or euclidean
distances. Sometimes, connectivity may require other types of metrics (e.g. GSEA-like, overlap, etc.). We might consider learning siamese networks that transform original distances to the more standard ones. This is an unexplored avenue, though.
Standard input files
Type | Format | Description |
---|---|---|
Feature sets | GMT | XX |
Key-feature pairs | TSV | XX |
Key profiles | TSV | XX |
InChIKeys | TSV | XX |
Documentation
Every pre-processing script needs to have a README
file.