Thoughts about Signature Type 3
Hi @mbertoni ,
Some thoughts about Signatures Type 3.
- We will first do 600 cross-predictions to get a feeling about the predictive power.
- Here we will remove redundancy in the covariates matrix by using the "remove near duplicates" method from the very beginning (e.g. Signature Type 1 or, ideally, Signature Type 0).
- Train/test/validation is done here on a pair basis (easy).
- These cross-predictors are not going to be used, in principle, as we decided to forget about the meta-predictor. They are just for our understanding of the data.
- We will do a train/test/validation (stratified) split that is smart enough to "balance" samples from different datasets.
- Not an easy task, but we do not need an exact solution either.
- We will then do 25 predictors, based on a matrices of 128x24(+24) dimensions.
- These will necessary include many missing values. One possibility (I think) is to assign 0 to these missing values and do not allow for "biases" in the functions (only weights).
- An additional boolean vector (the +24 above) may be added to indicate presence/absence of data. Hopefully the network will learn how to handle this.
- I think it will be necessary to do some "data augmentation".
- Say we have A1, A2, D1, C3, C4, C5 and we want to predict B4. We will "always" have C3-C5 data available to predict B4, but these would be over-optimistic for a real scenario.
- I suggest that we augment the data by duplicating the samples. For example:
- A1, D1
- C3, C4, A2
- A1, A2
- D1
- etc.
- Here again, we have to come up with a sampling method. The only parameter being the degree of over-sampling that we want to do.
- One possibility is to do this sampling method based on conditional probabilities.
- Finally, we can think of a "general purpose" fingerprint once we have the 25 Signatures Type 3. This can be done by simply concatenating (128x25) and then doing an autoencoder to remove redundancies. That'd be cool.
Let's talk about all of this today so we are fully aligned