Consistency and persistency of the metric

I suggest that every vectorial file has the following format:
- V: the matrix
- keys: the values, always sorted alphabetically
- normed: flag to say whether it has been normed or not
- integerized: flag to say whether it has been integerized
- principal_components: whether the columns are sorted by variable importance
- shape: tuple to have the shape
- distance: background distance
- pvalue: pvalue
- metric: metric used for the background
- name: name of the dataset
- date: date of creation
- ... (some datasets, like clust.h5, may have additional matrices). No problem.

About the values that these metadata may take:

sig.h5

normed: depends on the args (be careful! the argument is not_normalized, thus, if argument is false, then normed is true, and viceversa)
integerized: depends on the argument. if args.integerize = True, then true, if false, then false.
principal_components: true
metric: cosine
name: for example, A1_sig

clust.h5

The only indispensible datasets in this file are keys and labels.

The rest really is a bonus, and is only kept when the clustering is done with pqkmeans (other versions of the clustering, such as hdbscan cannot produce V vectors and so on. This one is complicated, I will do it myself, OK?

[V: in the current version, it is called V_pqcode. Let's call it V]
normed: False
integerized: False
principal_components: False
metric: symmetric_distance
distance and pvalue: they were, formerly, bg_pq_euclideans.
name: for example, A1_clust
But remember, the only indispensible here is keys and labels.

clustemb.h5

normed: False
integerized: False
principal_components: False
metric: cosine
distance and pvalue: We can calculate them (see auto_plots.py)

proj.h5

normed: False
integerized: False
principal_components: False
metric: cosine
distance and pvalue: We can calculate them (see auto_plots.py)

<martino embeddings 1 and 2>

normed: False
integerized: False
principal_components: False
metric: cosine
distance and pvalue: We can calculate them (see auto_plots.py)

About the metric (the function to calculate distance)

As you can see, we have different distances:

cosine

if normed: then 1 - np.dot(x,y) (fast)
if not normed: then cosine(x,y) (slow)

euclidean

always: euclidean(x,y)

symmetric_distance

this is the pqkmeans version of euclidean, using a lookup table. relates to the clusters.

So what scripts need be modified? EVERY function that calculates a distance, for example, cosine(v1, v2), should be now metric(v1, v2). This affects calculation of similarities, calculation of background_distances, etc. IN ADDITION I use very often the function cdist. Every call of cdist must have the corresponding metric, now. For example, cdist(X,Y, metric = metric)

Other considerations

You might have realized that sometimes we put a limit in the number of components to take from a matrix (e.g. max_comp = 200). Well, this is ONLY VALID if principal_components=True. Therefore, we can add something like this:

if max_comp and principal_components: # max_comp is not None
    V = hf["V"][:,:max_comp]
else:
    V = hf["V"][:]

Edited Oct 05, 2018 by Miquel Duran-Frigola