Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • chemical_checker chemical_checker
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 3
    • Issues 3
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Packages
  • chemical_checkerchemical_checker
  • Issues
  • #25

Closed
Open
Created Nov 05, 2018 by Miquel Duran-Frigola@mduranDeveloper

HDF5, models and classes

To discuss with @mbertoni and @oguitart (I've put Martino as assignee, because I think he is the most experienced in python classes).

As you know, we have most of the CC data stored as HDF5 files. I think HDF5 format is good and we have to stick to this file format. However, I think that these files should be accessible through some classes. These classes must not load all data into memory.

Signature classes

  • Applies to:
  • Signatures Type 0
  • Signatures Type 1
  • Signatures Type 2
  • Signatures Type 3
  • 2D projections
  • Every class must have at least the following attributes:
  • V: the values
  • keys: the keys, sorted alphabetically
  • metric: the distance used
  • pvalues: (distance, p-value) array
  • PATH: the path where everything is stored
  • Every class must have at least the following methods:
  • __iter__: smart, batch iteration, if necessary
  • __getattr__: returns the vector corresponding to the key. Works fast with bisect, but should return None if the key is not in keys (ideally, keep a set to do this).
  • fit: Not sure this is necessary... Maybe we can just do it as part of the pipeline.
  • predict: For the new samples, we should be able to produce the corresponding V vectors. This will be, by far, the most tricky part. One should access the models folder and use them correspondingly. To increase speed in this part, probably one should just predict for the ones that are not already in the reference. Sometimes, it will be necessary to learn a mapping functions, for instance via AdaNet; for example, in the case of Signature Type 2, as node2vec does not allow for out-of-sample mapping.
  • validate: I'm thinking of a folder where we have validation files (for now, MoA and ATC), and then automatically outputting AUROC and KS metrics, among others.
  • background: Not sure this is necessary... Just like fit, this has to be done only at initiation.

Other classes

  • We have other data types, such as the nearest neighbors produced by Oriol and the clusters produced by myself.
  • These must also have at least the following methods:
  • __iter__
  • __getattr__
  • predict: As always, we want to be able to predict for new molecules using the models stored.
  • validate: Here we will not use AUROC and KS, but other statistics, depending on the case.

Here I put a scheme of the first part of the Chemical Checker pipeline:

CC_backbone

Assignee
Assign to
Time tracking