The chemical checker package aims to provide an interface to use all chemical checker tools. From the generation of new data (signatures, etc) to accessing already produced data.
This package will be used in the chemical checker pipeline but it is completely independent of its pipeline. The package will have five main modules that are shown in the flow diagram. But, it will also contain a bunch of unit tests and a kind of 'API Doc' which will help to use the package.
CORE
In this module of the package, there will be all the main functionalities. In order to use this module, users will get an instance of ChemicalChecker by passing a config file. Through this config file, the package will be able to validate the directory infrastructure or create a new one.
This module will contain an abstract class (base.py) which will define the main methods that the other classes will inherit from. Then, there is a class for each of all possible signatures and other data generated (clusters. nneigh, etc).
DATABASE
The goal of this module is to let the users interact with the chemical checker database without using any kind of query. This module should provide methods for each of the database tables. The package should be database agnostic through this module, that is, the database could be using any kind of database technologies and the user should not be aware of it.
IO
Through this module will help users to manage all the file downloading. It will also contains all the logic of writing and reading H5 files in different ways (batch mode).
TOOLS
This module will encapsulates several tools that the chemical checker uses( node2vec, hotnet, etc). The idea is to give an interface to users to access the main functionalities of these tools.
UTIL
In this module will put several utilities that the package uses and users might need too.
Logging
It will be used internally in the package but it could also be use useful to users.
Config
All the config parsing will be implemented here. The way to pass the config files around all modules will be through an environment variable. This variable will be set inside the container and the config file will also be stored inside the container always to the same location.
HPC
The package will try to be cluster agnostic. It should provide all methods to submit all kind of jobs by just telling the package the kind of system the cluster uses.
Directory Structure
The package needs to know what we call the ROOT_PATH. From this path, the package expects to find or will create this directory structure:
ROOT_PATH/release/<VERSION>/
├── reference
│ ├──A
│ │ ├──A1
│ │ │ ├──A1.001
│ │ │ │ ├──sign*.h5
│ │ │ │ ├──clus*.h5
│ │ │ │ ├──neig*.h5
│ │ │ │ ├──proj*.h5
│ │ │ │ ├──models
│ │ │ │ │ ├──*.*
│ │ │ │ ├──plots
│ │ │ │ │ ├──*.tsv
│ │ │ │ │ ├──*.png
│ │ │ │ ├──stats
│ │ │ │ │ ├──*.json
│ │ │ │ ├──preprocess
│ │ │ │ │ ├──*.*
│ │ │ ├──...
│ │ ├──...
│ ├──...
├── full
│ ├──A
│ │ ├──A1
│ │ │ ├──A1.001
│ │ │ │ ├──sign*.h5
│ │ │ │ ├──clus*.h5
│ │ │ │ ├──neig*.h5
│ │ │ │ ├──proj*.h5
│ │ │ │ ├──plots
│ │ │ │ │ ├──*.tsv
│ │ │ │ │ ├──*.png
│ │ │ │ ├──stats
│ │ │ │ │ ├──*.json
│ │ │ ├──...
│ │ ├──...
│ ├──...
├── exemplary
│ ├──coordinates
│ │ ├──A
│ │ │ ├──A1
│ │ │ │ ├──sign*.h5 [link-to-full]
│ │ │ │ ├──clus*.h5 [link-to-full]
│ │ │ │ ├──neig*.h5 [link-to-full]
│ │ │ │ ├──proj*.h5 [link-to-full]
│ │ │ │ ├──nprd*.h5
│ │ │ │ ├──plots [link-to-full]
│ │ │ │ ├──stats [link-to-full]
│ │ │ ├──...
│ │ ├──...
│ ├──infer
│ │ ├──models
│ │ │ ├──*.*
│ ├──plots
│ │ ├──*.tsv
│ │ ├──*.png
There will be a current link to the following location:
ROOT_PATH/release/current
We will keep the last two versions and the rest will be compressed and save it in this location:
ROOT_PATH/old_releases/
Config Data
This is the list of parameters that the config file needs to include:
- CC root path
- data path
- scratch path
- HPC cluster name (if empty will run locally )
- Host Db server
- Username & Password Db server
- Queue name
Implementation issues
- The package should use Python 3