Correlation between datasets
With molecules expressed as signatures, it becomes straightforward to assess the similarity principle, that is, it is easy to check whether similar molecules in a particular dataset are still similar in another dataset.
As of today, we measure the following types of correlation (only between the 25 exemplary datasets):
- Canonical correlation analysis.
- Shared pairs of similar molecules.
- Coincidence of similarity ranks.
These correlations are used not only for analysis, but more importantly, to predict unseen similarities. If two molecules are similar in a certain space, they are likely to be similar in other highly correlated spaces, too.
This approach yields predictors that are, hopefully, of sufficient quality for the CC web app. However, I do think we have to be more ambitious and aim at more proficient classifiers, potentially using deep learning (e.g. siamese networks).
This will be a crucial achievement for the development of the CC as it will yield signatures of type 3, which are currently lacking.