|
|
# Correlation between datasets
|
|
|
|
|
|
With molecules expressed as [signatures](signaturization), it becomes straightforward to assess the similarity principle, that is, it is easy to check whether similar molecules in a particular dataset are still similar in another dataset.
|
|
|
|
|
|
As of today, we measure the following types of correlation (only between the 25 exemplary datasets):
|
|
|
|
|
|
* Canonical correlation analysis.
|
|
|
* Shared pairs of similar molecules.
|
|
|
* Coincidence of similarity ranks.
|
|
|
|
|
|
These correlations are used not only for analysis, but more importantly, to *predict* unseen similarities. If two molecules are similar in a certain space, they are likely to be similar in other highly correlated spaces, too.
|
|
|
|
|
|
This approach yields predictors that are, hopefully, of sufficient quality for the [CC web app](http://chemicalchecker.org). However, I do think we have to be more ambitious and aim at more proficient classifiers, potentially using deep learning (e.g. siamese networks).
|
|
|
|
|
|
This will be a crucial achievement for the development of the CC as it will yield signatures of type 3, which are currently lacking.
|
|
|
|