The pipeline is the process that the Chemical Checker resource follows in order to update its internal resources.
The chemical Checker has two main resources: the Package and the Website. The pipeline will try to differentiate between these two entities. For the same reason, the database resource that the checker provides should be divided in two databases. One permanent database that is updated and other web related database that is created each time a new release is produced.
1 Download
- can be parallelized on Datasources
- code 100% there
from chemicalchecker.database import Datasource
# check if Datasource table is there
if not Datasource._table_exists():
# create the Datasource table
Datasource._create_table()
# populate it with Datasources needed for exemplary Datasets
Datasource.from_cvs('./exemplary_datasources.csv')
# start 45 download jobs (one per Datasource), job will wait until finished
job = Datasource.download_hpc('/aloy/scratch/sbnb-adm/tmp_job_download')
# check if the downloads are really done
if not Datasource.test_all_downloaded():
print("Something went WRONG while DOWNLOAD, should retry")
# print the faulty one
for ds in Datasource.get():
if not ds.available():
print("Datasource %s not available" % ds)
2 Molrepo
- can be parallelized on Datasource having Molrepos
- code 30% there, missing:
- 2/14 parsers adapted
- verify chembldb requirements
- decide what to do if table is already there, update?
from chemicalchecker.database import Molrepo
# create the Molrepo table
Molrepo._create_table()
# start 14 molrepo jobs (one per Datasource), job will wait until finished
job = Molrepo.molrepo_hpc('/aloy/scratch/sbnb-adm/tmp_job_molrepo')
# check if the molrepos are really done
if not Molrepo.test_all_available():
print("Something went WRONG while MOLREPO, should retry")
2 Signaturization (pipeline style)
sign0_full = cc.get_signature('sign0', 'full', 'A1.001')
sign0_full.predict()
sign0_ref = cc.get_signature('sign0', 'reference', 'A1.001')
rnd = RNDuplicates()
keys,data,maps = rnd.remove(sign0_full.data_path)
f5 = h5py.File(sign0_full.data_path)
features = f5["features"][:]
f5.close()
with h5py.File(sign0_ref.data_path, 'w') as hf:
hf.create_dataset("features", data=features)
hf.create_dataset("keys", data=keys)
hf.create_dataset("V", data=data, dtype='i8')
hf.create_dataset("shape", data=data.shape)
sign1_ref = cc.get_signature('sign1', 'reference', 'A1.001')
sign1_ref.fit(sign0_ref)
sign1_full = cc.get_signature('sign1', 'full', 'A1.001')
sign1_ref.predict(sign0_full, destination=sign1_full.data_path)