... | ... | @@ -31,33 +31,42 @@ Every CC data will have the following methods: |
|
|
|
|
|
Also, data are always sorted by a certain key (and InChIKey, typically). These keys should be accessible and iterable without having to load the whole dataset into memory.
|
|
|
|
|
|
* `keys`
|
|
|
* `__iter__`
|
|
|
* `__getattr__`
|
|
|
* `keys`: The keys, **sorted alphabetically**.
|
|
|
* `__iter__()`: Batch iteration, if necessary.
|
|
|
* `__getattr__()`: Returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a set to do this).
|
|
|
|
|
|
I think that it may be interesting to keep cognizance of the folder where persistency models are stored:
|
|
|
|
|
|
* `PATH`
|
|
|
* `PATH`: :thinking: Not sure whether it has to be an *absolute* path.
|
|
|
|
|
|
### Signature commonalities
|
|
|
|
|
|
All signatures type 0-3 contain a numerical data matrix:
|
|
|
All signatures type 0-3 contain a numerical data matrix. Also, I think it is interesting to have an idea of the background similarity distribution, given a metric.
|
|
|
|
|
|
* `V`: Typically, a dense matrix (it can be sparse in the case of signatures type 0).
|
|
|
|
|
|
All
|
|
|
* `metric`: The distance used, typically `cosine`.
|
|
|
* `pvalues`: A `(dist, pval)` array.
|
|
|
|
|
|
## Peculiarities
|
|
|
|
|
|
### Signatures type 0
|
|
|
|
|
|
* From: standard input
|
|
|
* To: `sign0.h5`
|
|
|
These signatures are the result of a processing of a raw (but standard) input. They can be stored as sparse matrices. Because they represent explicit data, features can be specified.
|
|
|
|
|
|
* `features`: For example, protein targets. Sorted alphabetically.
|
|
|
|
|
|
### Signatures type 1
|
|
|
|
|
|
* From: `sign0.h5`
|
|
|
* To: `sign1`.h5`
|
|
|
These signatures are the result of a PCA-like projection, meaning that variables consecutively contribute to explaining variance. In the fitting of these features, we cut at 90% of variance, and we also identify the elbow of the variance-explained plot. I think that the elbow should be stored in this class:
|
|
|
|
|
|
* `elbow`: Index of the dimension corresponding to the elbow point in the scree plot.
|
|
|
|
|
|
Conveniently, one can do:
|
|
|
|
|
|
```python
|
|
|
elbow_idx = my_sign0.elbow
|
|
|
V_red = my_sign0.V[:,:elbow_idx]
|
|
|
```
|
|
|
|
|
|
### Signatures type 2
|
|
|
|
... | ... | @@ -99,7 +108,7 @@ As you know, we have most of the CC data stored as `HDF5` files. I think `HDF5` |
|
|
* `PATH`: the path where everything is stored
|
|
|
* Every class must have **at least** the following methods:
|
|
|
* `__iter__`: smart, batch iteration, if necessary
|
|
|
* `__getattr__`: returns the vector corresponding to the key. Works fast with `bisect`, but should return `None` if the key is not in `keys` (ideally, keep a set to do this).
|
|
|
* `__getattr__`:
|
|
|
* `fit`: Not sure this is necessary... Maybe we can just do it as part of the pipeline.
|
|
|
* `predict`: For the new samples, we should be able to produce the corresponding `V` vectors. This will be, by far, the most tricky part. One should access the `models` folder and use them correspondingly. To increase speed in this part, probably one should just predict for the ones that are not already in the reference. Sometimes, it will be necessary to learn a mapping functions, for instance via AdaNet; for example, in the case of Signature Type 2, as node2vec does not allow for out-of-sample mapping.
|
|
|
* `validate`: I'm thinking of a folder where we have validation files (for now, MoA and ATC), and then automatically outputting AUROC and KS metrics, among others.
|
... | ... | |