Miquel Duran-Frigola · b5dbc5af
--- a/datasets.md
+++ b/datasets.md
@@ -4,11 +4,11 @@ In the CC nomenclature, a dataset is determined by:
 1. One coordinate.
 2. One (typically) or multiple (eventually) sources having the same type of (mergeable) data.
-3. A processing procedure yielding signatures type 0.
+3. A [processing procedure](dataset-processing) yielding signatures type 0.
 ## Levels, coordinates and datasets
-The CC is divided in five **levels** of increasing complexity:
+The CC is divided into five **levels** of increasing complexity:
 |Level|Name|Description|
 |---|---|---|
@@ -18,7 +18,7 @@ The CC is divided in five **levels** of increasing complexity:
 |`D`|Cells|Readouts of compound cell-based assays.|
 |`C`|Clinics|Clinical data of drugs and environmental chemicals.|
-In turn, each level is divided in 5 sublevels or **coordinates** representing different aspects of the data. Each sublevel has an *exemplary* dataset, as described below:
+In turn, each level is divided into 5 sublevels or **coordinates** representing different aspects of the data. Each sublevel has an *exemplary* dataset, as described below:
 |Coordinate|Name|Description|
 |---|---|---|
@@ -26,20 +26,20 @@ In turn, each level is divided in 5 sublevels or **coordinates** representing di
 |`A2`|3D fingerprints|Similar to `A1`, the 3D structures of the three best conformers after energy minimization are hashed into a binary representation without the need for structural alignment.|
 |`A3`|Scaffolds|Largest molecular scaffold (usually a ring system) remaining after applying Murcko’s pruning rules. Additionally, we keep the corresponding framework, i.e. a version of the scaffold where all atoms are carbons and all bonds are single. The scaffold and the framework are encoded with path-based 1024-bit fingerprints, suitable for capturing substructures in similarity searches.|
 |`A4`|Structural keys|166 functional groups and substructures widely accepted by medicinal chemists (MACCS keys).|
-|`A5`|Physicochemistry|Physicochemical properties such as molecular weight, logP and refractivity. Number of hydrogen-bond donors and acceptors, rings, etc. Drug-likeness measurements e.g. number of structural alerts, Lipinski’s rule-of-5 violations or chemical beauty (QED).|
+|`A5`|Physicochemistry|Physicochemical properties such as molecular weight, logP, and refractivity. Number of hydrogen-bond donors and acceptors, rings, etc. Drug-likeness measurements e.g. number of structural alerts, Lipinski’s rule-of-5 violations or chemical beauty (QED).|
 |`B1`|Mechanism of action|Drug targets with known pharmacological action and modes (agonist, antagonist, etc.).|
-|`B2`|Metabolic genes|Drug metabolizing enzymes, transporters and carriers.|
+|`B2`|Metabolic genes|Drug metabolizing enzymes, transporters, and carriers.|
 |`B3`|Crystals|Small molecules co-crystalized with protein chains. Data is organized according to the structural families of the protein chains.|
 |`B4`|Binding|Compound--protein binding data available in major public chemogenomics databases. Data mainly comes from academic publications and patents. Only binding affinities below a class-specific threshold are kept (kinases ≤ 30 nM, GPCRs ≤ 100 nM, nuclear receptors ≤ 100 nM, ion channels ≤ 10 uM and others ≤ 1 uM).|
 |`B5`|HTS bioassays|Hits from screening campaigns against protein targets (mainly confirmatory functional assays below 10 uM).|
-|`C1`|Biological roles|Ontology terms associated to small molecules with recognized biological roles, such as known drugs, metabolites and other natural products.|
+|`C1`|Biological roles|Ontology terms associated with small molecules with recognized biological roles, such as known drugs, metabolites and other natural products.|
 |`C2`|Metabolic network|Curated reconstruction of human metabolism, containing metabolites and reactions. Data is represented as a network where nodes are metabolites and edges connect substrates and products of reactions.|
 |`C3`|Canonical pathways|Canonical pathways related to the known receptors of compounds (as recorded in `B4`). Pathways are assigned via a guilt-by-association approach, i.e. a molecule is related to a pathway if at least one of the molecule targets is a member of it.|
-|`C4`|Biological processes|Similar to `C3`, biological processes from the gene ontology are associated to compounds via a guilt-by-association approach from `B4` data. All parent terms are kept, from the leaves of the ontology to its root.|
+|`C4`|Biological processes|Similar to `C3`, biological processes from the gene ontology are associated with compounds via a guilt-by-association approach from `B4` data. All parent terms are kept, from the leaves of the ontology to its root.|
 |`C5`|Interactomes|Neighborhoods of `B4` targets are collected by inspecting several large protein-protein interaction networks. A random-walk algorithm is used to obtain a robust measure of 'proximity' in the network.|
 |`D1`|Gene expression|Transcriptional response of cell lines upon exposure to small molecules. A well-documented reference dataset of gene expression profiles is used to map all compound profiles using a two-sided gene set enrichment analysis.
 |`D2`|Cancer cell lines|Small molecule sensitivity data (GI50) of a panel of 60 cancer cell lines.|
-|`D3`|Chemical genetics|Growth inhibition profiles in a panel of ~300 yeast mutants. Data are combined with yeast genetic interaction data, so that compounds can be assimilated to genetic alterations when they have similar profiles.|
+|`D3`|Chemical genetics|Growth inhibition profiles in a panel of ~300 yeast mutants. Data are combined with yeast genetic interaction data so that compounds can be assimilated to genetic alterations when they have similar profiles.|
 |`D4`|Morphology|Changes in U-2 OS cell morphology measured after compound treatment using a multiplexed-cytological cell painting assay. 812 morphology features are recorded via automated microscopy and image analysis.|
 |`D5`|Cell bioassays|Small molecule cell bioassays reported in ChEMBL, mainly growth and proliferation measurements found in the literature.|
 |`E1`|Therapeutic areas|Anatomical Therapeutic Chemical (ATC) codes of drugs. All ATC levels are considered.|
@@ -62,36 +62,25 @@ This is how we define a dataset:
 |Name|2D fingerprints|Display, short-name of the dataset.|
 |Technical name|1024-bit Morgan fingerprints|A more technical name for the dataset, suitable for chemo-/bio-informaticians.|
 |Description|2D fingerprints are...|This field contains a long description of the dataset. It is important that the curator outlines here the importance of the dataset, why did he/she make the decision to include it, and what are the scenarios where this dataset may be useful.|
-|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingreprints or gene expression data do not contain unknowns.|
+|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingeprints or gene expression data do not contain unknowns.|
 |Permanent|`True`/`False`|Are measurements for each entry permanent? 2D fingerprints, for example, are permanent. However, most of the biological data may change/evolve with the different versions of the CC. This field, in essence, dictates whether the dataset should be completely updated in every update of the CC, or whether new entries can be simply appended.|
 |Finished|`True`/`False`|Is the dataset considered to be finished? For examples, datasets coming from supplementary data of scientific papers are immutable, and they consequently need no updates in posterior versions of the CC.|
 |Data type|`Discrete`/`Continuous`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
 |Predicted|`True`/`False`|Is the dataset a result of a prediction (by us or by others?). Prediction results are perfectly valid CC datasets, in principle.|
 |Connectivity|`True`/`False`|Is there a way to connect this dataset to other biological entities? We understand connectivity as a generalization of the cMap idea of matching gene expression signatures.|
 |Connectivity comments|Free text commenting on the connectivity strategy (e.g. type of distance)|This field needs to be self-explanatory.|
-|Keys|e.g. `CPD` (we use @afernandez `Bioteque` nomenclature). May be `NULL`.|In the core CC database, most of the times this field will correspond to `CPD`, as the CC is centered on small molecules. It only makes sense to have keys of different types when we do connectivity attempts, that is, for example, when mapping disease gene expression signatures.|
+|Keys|e.g. `CPD` (we use @afernandez `Bioteque` nomenclature). Can be `NULL`.|In the core CC database, most of the times this field will correspond to `CPD`, as the CC is centered on small molecules. It only makes sense to have keys of different types when we do connectivity attempts, that is, for example, when mapping disease gene expression signatures.|
 |Number of keys|e.g. 800000|Number of samples in the dataset.|
-|Features|e.g. `GEN` (we use `Bioteque` nomenclature). May be `NULL`.|When features correspond to explicit knowledge, such as proteins, gene ontology processes, or indications, we express with this field the type of biological entities. It is not allowed to mix different feature types. Features can, however, have no type, typically when they come from a heavily-processed dataset, such as gene-expression data. Even if we use `Bioteque` nomenclature to the define the type of biological data, it is not mandatory that the vocabularies are the ones used by the `Bioteque`; for example, I can use non-human Uniprot ACs, if I deem it necessary.|
+|Features|e.g. `GEN` (we use `Bioteque` nomenclature). Can be `NULL`.|When features correspond to explicit knowledge, such as proteins, gene ontology processes, or indications, we express with this field the type of biological entities. It is not allowed to mix different feature types. Features can, however, have no type, typically when they come from a heavily-processed dataset, such as gene-expression data. Even if we use `Bioteque` nomenclature to the define the type of biological data, it is not mandatory that the vocabularies are the ones used by the `Bioteque`; for example, I can use non-human UniProt ACs, if I deem it necessary.|
 |Number of features|e.g. 1000|Number of features in the dataset.|
 |Exemplary|`True`/`False`|Is the dataset exemplary of the coordinate. Only one exemplary dataset is valid for each coordinate. Exemplary datasets should have good coverage (both in keys space and feature space) and acceptable quality of the data.|
 |Source|Free text defining the source of data.|More than one source is allowed. We have mild constraints in the nomenclature, here.|
 |Version|CC version|The CC is updated every 6 months.|
 |Public|`True`/`False`|Some datasets are public, and some are not, especially those that come from collaborations with the pharma industry.|
-The information above can be stored in a `postgresql` table named `datasets`.
+See the [PostgreSQL database](database) for more information.
-It is important that we decide how to store and organize datasets in the file-system, though. For this I need advice and help from all of you, @mbertoni , @oguitart and @afernandez.
+## Dataset processing
-I suggest the following structure in e.g. `aloy/web_checker/` (or somewhere else):
-* Each dataset is stored correspondingly, e.g. `./datasets/A/A1/001/`.
+Every dataset has one 
-* A `./data.h5` file:
- * `V`: the matrix of values (`np.int8`, `np.float32`, ...)
- * `keys`: *sorted* alphabetically
- * `features`: *sorted* alphabetically
- * etc (de
-* A `./processing/` folder where a mini-pipeline is devoted to processing the data until obtaining the final dataset (`data.h5`). **Downloads are not here!**, since many downloads are shared between datasets. Downloads and download scripts are wherever @oguitart decides.
-* A `./connectivity/` folder where connectivity scripts are stored. I don't know how to organized this, yet. This folder will obviously be empty if no connectivity is possible for this dataset.
-* A `./models/` folder where persistent models are stored. This folder may be empty many times.
-## Dataset processing