Similarity and connectivity
In the context of the CC, "connectivity" is a generalization of "similarity". That is, it is a more flexible means to compare entities. The notion of connectivity is of special interest to unsupervised drug discovery since it enables mapping of external biological data to the chemical space.
In the CC pipeline, connectivity happens at the pre-processing step. The pre-processing step has two phases:
- XX
- XX
More precisely, connectivity starts with standard input files and finishes with a signature type 0.
In some datasets, this procedure may be of considerable complexity and we need to conceive workflows that can be wrapped into a predict()
method. For other datasets, the procedure will be almost trivial.
Another important matter here is the distance. The CC works with common distance metrics, such as the cosine
or euclidean
distances. Sometimes, connectivity may require other types of metrics (e.g. GSEA-like, overlap, etc.). We might consider learning siamese networks that transform original distances to the more standard ones. This is an unexplored avenue, though.