Better 2D projections

I am not being able to consistently obtain good 2D projections for all spaces. There are two issues:

Big spaces (N > 100k) give granular projections, which is quite disappointing for a global visualization.
There is a dearth of AUC performance w.r.t. the multi-dimensional spaces in the MoA / ATC validations.

There seems to be a trade-off between these issues. As a constraint, I don't want to tweak tSNE too much [see mlutils/proj.h5]. These are the current solutions I've come about:

Default solution

Input = sig.h5
Treatment = remove (near-)duplicates (possibly with faiss LSH).

Martino's solution

Input = netemb.h5 (128-d vectors resulting from running node2vec on the similarity network, under a p-value of 0.01)
Treatment = remove near-duplicates with faiss LSH.

Siamese solution (the current one)

Input = clustemb.h5, resulting from learning a Siamese network on the k-Means clustering results (learning to identify pairs of signatures that belong to the same cluster). The inner mid-dimensional hidden layer of the Siamese network is used as an embedding.
Treatment = none, currently, but we should remove near-duplicates with faiss LSH.

Another Martino's solution

Input = densenetemb.h5 (32-d vectors from node2vec).
Treatment = remove near-duplicates with faiss LSH.
@mbertoni can you please produce 32-d embeddings so that we can test this option? Keep all the other parameters fixed.

Human eye solution

Conveniently, faiss has a fast k-Means implementation that we could use to cluster to a relatively large number of clusters (e.g. k = np.sqrt(N) or k = np.sqrt(N/2)). It is not really necessary to have so many points in a 2D projection, so I thought we can just 2D-project the k centroids. This will make tSNE much faster and, possibly, more nicely organized. Then we have to find a way to interpolate the rest of the points.
Dangers: Ideally, clusters should be of comparable size. Otherwise, when we map new molecules it might happen that they end up in a few centroids.
One may consider removing outliers, even if this is not easily done with k-Means.

I would suggest that @oguitart first implements the LSH near-duplicate detection, and then we decide the way to go.

Edited Nov 04, 2018 by Miquel Duran-Frigola