pipeline

The pipeline is the process that the Chemical Checker resource follows in order to update its internal resources.

The chemical Checker has two main resources: the Package and the Website. The pipeline will try to differentiate between these two entities. For the same reason, the database resource that the checker provides should be divided in two databases. One permanent database that is updated and other web related database that is created each time a new release is produced.

1 Download

can be parallelized on Datasources
code 100% there

from chemicalchecker.database import Datasource

# check if Datasource table is there
if not Datasource._table_exists():
    # create the Datasource table
    Datasource._create_table()
    # populate it with Datasources needed for exemplary Datasets
    Datasource.from_cvs('./exemplary_datasources.csv')
# start 45 download jobs (one per Datasource), job will wait until finished
job = Datasource.download_hpc('/aloy/scratch/sbnb-adm/tmp_job_download')
# check if the downloads are really done
if not Datasource.test_all_downloaded():
    print("Something went WRONG while DOWNLOAD, should retry")
    # print the faulty one
    for ds in Datasource.get():
        if not ds.available():
            print("Datasource %s not available" % ds)

2 Molrepo

can be parallelized on Datasource having Molrepos
code 30% there, missing:
- 2/14 parsers adapted
- verify chembldb requirements
decide what to do if table is already there, update?

from chemicalchecker.database import Molrepo

# create the Molrepo table
Molrepo._create_table()
# start 14 molrepo jobs (one per Datasource), job will wait until finished
job = Molrepo.molrepo_hpc('/aloy/scratch/sbnb-adm/tmp_job_molrepo')
# check if the molrepos are really done
if not Molrepo.test_all_available():
    print("Something went WRONG while MOLREPO, should retry")

2 Signaturization (pipeline style)

sign0_full = cc.get_signature('sign0', 'full', 'A1.001')
sign0_full.predict()
sign0_ref = cc.get_signature('sign0', 'reference', 'A1.001')
rnd = RNDuplicates()
rnd.remove(sign0_full.data_path)
f5 = h5py.File(sign0_full.data_path)
features = f5['features'][:]
f5.close()
rnd.save(sign0_ref.data_path)
with h5py.File(sign0_ref.data_path, 'a') as hf:
    hf.create_dataset('features', data=features)
sign1_ref = cc.get_signature('sign1', 'reference', 'A1.001')
sign1_ref.fit(sign0_ref)
sign1_full = cc.get_signature('sign1', 'full', 'A1.001')
sign1_ref.predict(sign0_full, destination=sign1_full.data_path)