package

The chemical checker package aims to provide an interface to use all chemical checker tools. From the generation of new data (signatures, etc) to accessing already produced data.

This package will be used in the chemical checker pipeline but it is completely independent of its pipeline. The package will have five main modules that are shown in the flow diagram. But, it will also contain a bunch of unit tests and a kind of 'API Doc' which will help to use the package.

CORE

In this module of the package, there will be all the main functionalities. In order to use this module, users will get an instance of ChemicalChecker by passing a config file. Through this config file, the package will be able to validate the directory infrastructure or create a new one.

This module will contain an abstract class (base.py) which will define the main methods that the other classes will inherit from. Then, there is a class for each of all possible signatures and other data generated (clusters. nneigh, etc).

DATABASE

The goal of this module is to let the users interact with the chemical checker database without using any kind of query. This module should provide methods for each of the database tables. The package should be database agnostic through this module, that is, the database could be using any kind of database technologies and the user should not be aware of it.

IO

Through this module will help users to manage all the file downloading. It will also contains all the logic of writing and reading H5 files in different ways (batch mode).

TOOLS

This module will encapsulates several tools that the chemical checker uses( node2vec, hotnet, etc). The idea is to give an interface to users to access the main functionalities of these tools.

UTIL

In this module will put several utilities that the package uses and users might need too.

Logging

It will be used internally in the package but it could also be use useful to users.

Config

All the config parsing will be implemented here. The way to pass the config files around all modules will be through an environment variable. This variable will be set inside the container and the config file will also be stored inside the container always to the same location.

HPC

The package will try to be cluster agnostic. It should provide all methods to submit all kind of jobs by just telling the package the kind of system the cluster uses.

Directory Structure

The package needs to know what we call the ROOT_PATH. From this path, the package expects to find or will create this directory structure:

ROOT_PATH/release/<VERSION>/

├── reference
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──models
│   │   │   │   │   ├──*.*
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   │   ├──preprocess
│   │   │   │   │   ├──*.*
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── full
│   ├──A
│   │   ├──A1
│   │   │   ├──A1.001
│   │   │   │   ├──sign*.h5
│   │   │   │   ├──clus*.h5
│   │   │   │   ├──neig*.h5
│   │   │   │   ├──proj*.h5
│   │   │   │   ├──plots
│   │   │   │   │   ├──*.tsv
│   │   │   │   │   ├──*.png
│   │   │   │   ├──stats
│   │   │   │   │   ├──*.json
│   │   │   ├──...
│   │   ├──...
│   ├──...
├── exemplary
│   ├──coordinates
│   │   ├──A
│   │   │   ├──A1
│   │   │   │   ├──sign*.h5 [link-to-full]
│   │   │   │   ├──clus*.h5 [link-to-full]
│   │   │   │   ├──neig*.h5 [link-to-full]
│   │   │   │   ├──proj*.h5 [link-to-full]
│   │   │   │   ├──nprd*.h5
│   │   │   │   ├──plots [link-to-full]
│   │   │   │   ├──stats [link-to-full]
│   │   │   ├──...
│   │   ├──...
│   ├──infer
│   │   ├──models
│   │   │   ├──*.*
│   ├──plots
│   │   ├──*.tsv
│   │   ├──*.png

There will be a current link to the following location:

ROOT_PATH/release/current

We will keep the last two versions and the rest will be compressed and save it in this location:

ROOT_PATH/old_releases/

Config Data

This is the list of parameters that the config file needs to include:

CC root path
data path
scratch path
HPC cluster name (if empty will run locally )
Host Db server
Username & Password Db server
Queue name

Implementation issues

The package should use Python 3