Miquel Duran-Frigola · 2377afda
--- a/dataset-correlation.md
+++ b/dataset-correlation.md
 # Correlation between datasets

+With molecules expressed as [signatures](signaturization), it becomes straightforward to assess the similarity principle, that is, it is easy to check whether similar molecules in a particular dataset are still similar in another dataset.
+
+As of today, we measure the following types of correlation (only between the 25 exemplary datasets):
+
+* Canonical correlation analysis.
+* Shared pairs of similar molecules.
+* Coincidence of similarity ranks.
+
+These correlations are used not only for analysis, but more importantly, to *predict* unseen similarities. If two molecules are similar in a certain space, they are likely to be similar in other highly correlated spaces, too.
+
+This approach yields predictors that are, hopefully, of sufficient quality for the [CC web app](http://chemicalchecker.org). However, I do think we have to be more ambitious and aim at more proficient classifiers, potentially using deep learning (e.g. siamese networks).
+
+This will be a crucial achievement for the development of the CC as it will yield signatures of type 3, which are currently lacking.
+