Better 2D projections
I am not being able to consistently obtain good 2D projections for all spaces. There are two issues:
- Big spaces (N > 100k) give granular projections, which is quite disappointing for a global visualization.
- There is a dearth of AUC performance w.r.t. the multi-dimensional spaces in the MoA / ATC validations.
There seems to be a trade-off between these issues. As a constraint, I don't want to tweak tSNE too much [see mlutils/proj.h5
]. These are the current solutions I've come about:
Default solution
- Input =
sig.h5
- Treatment = remove (near-)duplicates (possibly with
faiss
LSH).
Martino's solution
- Input =
netemb.h5
(128-d vectors resulting from runningnode2vec
on the similarity network, under a p-value of 0.01) - Treatment = remove near-duplicates with
faiss
LSH.
Siamese solution (the current one)
- Input =
clustemb.h5
, resulting from learning a Siamese network on the k-Means clustering results (learning to identify pairs of signatures that belong to the same cluster). The inner mid-dimensional hidden layer of the Siamese network is used as an embedding. - Treatment = none, currently, but we should remove near-duplicates with
faiss
LSH.
Another Martino's solution
- Input =
densenetemb.h5
(32-d vectors fromnode2vec
). - Treatment = remove near-duplicates with
faiss
LSH. - @mbertoni can you please produce 32-d embeddings so that we can test this option? Keep all the other parameters fixed.
Human eye solution
- Conveniently,
faiss
has a fast k-Means implementation that we could use to cluster to a relatively large number of clusters (e.g.k = np.sqrt(N)
ork = np.sqrt(N/2)
). It is not really necessary to have so many points in a 2D projection, so I thought we can just 2D-project the k centroids. This will make tSNE much faster and, possibly, more nicely organized. Then we have to find a way to interpolate the rest of the points. - Dangers: Ideally, clusters should be of comparable size. Otherwise, when we map new molecules it might happen that they end up in a few centroids.
- One may consider removing outliers, even if this is not easily done with k-Means.
I would suggest that @oguitart first implements the LSH near-duplicate detection, and then we decide the way to go.