Consistency and persistency of the metric
- I suggest that every vectorial file has the following format:
-
V
: the matrix -
keys
: the values, always sorted alphabetically -
normed
: flag to say whether it has been normed or not -
integerized
: flag to say whether it has been integerized -
principal_components
: whether the columns are sorted by variable importance -
shape
: tuple to have the shape -
distance
: background distance -
pvalue
: pvalue -
metric
: metric used for the background -
name
: name of the dataset -
date
: date of creation -
...
(some datasets, likeclust.h5
, may have additional matrices). No problem.
-
About the values that these metadata may take:
sig.h5
-
normed
: depends on the args (be careful! the argument is not_normalized, thus, if argument is false, then normed is true, and viceversa) -
integerized
: depends on the argument. if args.integerize = True, then true, if false, then false. -
principal_components
: true -
metric
: cosine -
name
: for example, A1_sig
clust.h5
The only indispensible datasets in this file are keys
and labels
.
The rest really is a bonus, and is only kept when the clustering is done with pqkmeans
(other versions of the clustering, such as hdbscan
cannot produce V
vectors and so on. This one is complicated, I will do it myself, OK?
- [
V
: in the current version, it is calledV_pqcode
. Let's call itV
] -
normed
: False -
integerized
: False -
principal_components
: False -
metric
: symmetric_distance -
distance
andpvalue
: they were, formerly,bg_pq_euclideans
. -
name
: for example, A1_clust - But remember, the only indispensible here is
keys
andlabels
.
clustemb.h5
-
normed
: False -
integerized
: False -
principal_components
: False -
metric
: cosine -
distance
andpvalue
: We can calculate them (seeauto_plots.py
)
proj.h5
-
normed
: False -
integerized
: False -
principal_components
: False -
metric
: cosine -
distance
andpvalue
: We can calculate them (seeauto_plots.py
)
<martino embeddings 1 and 2>
-
normed
: False -
integerized
: False -
principal_components
: False -
metric
: cosine -
distance
andpvalue
: We can calculate them (seeauto_plots.py
)
About the metric (the function to calculate distance)
As you can see, we have different distances:
cosine
- if normed: then
1 - np.dot(x,y)
(fast) - if not normed: then
cosine(x,y)
(slow)
euclidean
- always: euclidean(x,y)
symmetric_distance
- this is the pqkmeans version of euclidean, using a lookup table. relates to the clusters.
So what scripts need be modified?
EVERY function that calculates a distance, for example, cosine(v1, v2)
, should be now metric(v1, v2)
. This affects calculation of similarities, calculation of background_distances, etc.
IN ADDITION I use very often the function cdist
. Every call of cdist
must have the corresponding metric, now. For example, cdist(X,Y, metric = metric)
Other considerations
You might have realized that sometimes we put a limit in the number of components to take from a matrix (e.g. max_comp = 200
). Well, this is ONLY VALID if principal_components=True
. Therefore, we can add something like this:
if max_comp and principal_components: # max_comp is not None
V = hf["V"][:,:max_comp]
else:
V = hf["V"][:]