|
|
# Correlation between datasets
|
|
|
|
|
|
With molecules expressed as [signatures](signaturization), it becomes straightforward to assess the similarity principle, that is, it is easy to check whether similar molecules in a particular dataset are still similar in another dataset.
|
|
|
With molecules expressed as [signatures](signaturization), it is easy to apply the similarity principle.
|
|
|
|
|
|
As of today, we measure the following types of correlation (only between the 25 exemplary datasets):
|
|
|
|
... | ... | @@ -8,9 +8,14 @@ As of today, we measure the following types of correlation (only between the 25 |
|
|
* Shared pairs of similar molecules.
|
|
|
* Coincidence of similarity ranks.
|
|
|
|
|
|
These correlations are used not only for analysis, but more importantly, to *predict* unseen similarities. If two molecules are similar in a certain space, they are likely to be similar in other highly correlated spaces, too.
|
|
|
These correlations greatly help analysis (here we show a consensus on exemplary datasets):
|
|
|
|
|
|
This approach yields predictors that are, hopefully, of sufficient quality for the [CC web app](http://chemicalchecker.org). However, I do think we have to be more ambitious and aim at more proficient classifiers, potentially using deep learning (e.g. siamese networks).
|
|
|
![consensus](/uploads/c1ac016c0c7436755c1709ec2ca3afeb/consensus.png)
|
|
|
|
|
|
Such correlations, in a very simple manner, together with conditional probabilities, are used internally by the [CC web app](http://chemicalchecker.org).
|
|
|
|
|
|
## Signatures Type 3
|
|
|
|
|
|
Signatures Type 3 are the attemp to predict, for *any* given molecule (with *any* given information available for it), the signature corresponding to a certain data type.
|
|
|
|
|
|
This will be a crucial achievement for the development of the CC as it will yield signatures of type 3, which are currently lacking.
|
|
|
|