... | @@ -14,7 +14,7 @@ The CC is divided in five **levels** of increasing complexity: |
... | @@ -14,7 +14,7 @@ The CC is divided in five **levels** of increasing complexity: |
|
|`D`|Cells|Readouts of compound cell-based assays.|
|
|
|`D`|Cells|Readouts of compound cell-based assays.|
|
|
|`C`|Clinics|Clinical data of drugs and environmental chemicals.|
|
|
|`C`|Clinics|Clinical data of drugs and environmental chemicals.|
|
|
|
|
|
|
In turn, each level is divided in 5 sublevels or **coordinates** representing different aspects of the data. Each sublevel has an *exemplar* dataset, as described below:
|
|
In turn, each level is divided in 5 sublevels or **coordinates** representing different aspects of the data. Each sublevel has an *exemplary* dataset, as described below:
|
|
|
|
|
|
|Coordinate|Name|Description|
|
|
|Coordinate|Name|Description|
|
|
|---|---|---|
|
|
|---|---|---|
|
... | @@ -57,9 +57,9 @@ This is how we define a dataset: |
... | @@ -57,9 +57,9 @@ This is how we define a dataset: |
|
|Coordinate|e.g.`A1`|Coordinates in the CC organization.|
|
|
|Coordinate|e.g.`A1`|Coordinates in the CC organization.|
|
|
|Name|2D fingerprints|Display, short-name of the dataset.|
|
|
|Name|2D fingerprints|Display, short-name of the dataset.|
|
|
|Technical name|1024-bit Morgan fingerprints|A more technical name for the dataset, suitable for chemo-/bio-informaticians.|
|
|
|Technical name|1024-bit Morgan fingerprints|A more technical name for the dataset, suitable for chemo-/bio-informaticians.|
|
|
|Description|2D fingerprints are...|This field contains a long description of the dataset. It is important that the curator outlines here the importance of the dataset, why did he/she made the decision to include it, and what are the scenarios where this dataset may be useful.|
|
|
|Description|2D fingerprints are...|This field contains a long description of the dataset. It is important that the curator outlines here the importance of the dataset, why did he/she make the decision to include it, and what are the scenarios where this dataset may be useful.|
|
|
|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingreprints or gene expression data do not contain unknowns.|
|
|
|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingreprints or gene expression data do not contain unknowns.|
|
|
|Permanent|`True`/`False`|Are measurements for each entry permanent? 2D fingerprints, for example, are permanent. However, most of biological data may change/evolve with the different versions of the CC. This field, in essence, dictates whether the dataset should be completely updated in every update of the CC, or whether new entries can be simply appended.|
|
|
|Permanent|`True`/`False`|Are measurements for each entry permanent? 2D fingerprints, for example, are permanent. However, most of the biological data may change/evolve with the different versions of the CC. This field, in essence, dictates whether the dataset should be completely updated in every update of the CC, or whether new entries can be simply appended.|
|
|
|Finished|`True`/`False`|Is the dataset considered to be finished? For examples, datasets coming from supplementary data of scientific papers are immutable, and they consequently need no updates in posterior versions of the CC.|
|
|
|Finished|`True`/`False`|Is the dataset considered to be finished? For examples, datasets coming from supplementary data of scientific papers are immutable, and they consequently need no updates in posterior versions of the CC.|
|
|
|Data type|`Discrete`/`Continuous`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
|
|
|Data type|`Discrete`/`Continuous`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
|
|
|Predicted|`True`/`False`|Is the dataset a result of a prediction (by us or by others?). Prediction results are perfectly valid CC datasets, in principle.|
|
|
|Predicted|`True`/`False`|Is the dataset a result of a prediction (by us or by others?). Prediction results are perfectly valid CC datasets, in principle.|
|
... | @@ -69,7 +69,7 @@ This is how we define a dataset: |
... | @@ -69,7 +69,7 @@ This is how we define a dataset: |
|
|Number of keys|e.g. 800000|Number of samples in the dataset.|
|
|
|Number of keys|e.g. 800000|Number of samples in the dataset.|
|
|
|Features|e.g. `GEN` (we use `Bioteque` nomenclature). May be `NULL`.|When features correspond to explicit knowledge, such as proteins, gene ontology processes, or indications, we express with this field the type of biological entities. It is not allowed to mix different feature types. Features can, however, have no type, typically when they come from a heavily-processed dataset, such as gene-expression data. Even if we use `Bioteque` nomenclature to the define the type of biological data, it is not mandatory that the vocabularies are the ones used by the `Bioteque`; for example, I can use non-human Uniprot ACs, if I deem it necessary.|
|
|
|Features|e.g. `GEN` (we use `Bioteque` nomenclature). May be `NULL`.|When features correspond to explicit knowledge, such as proteins, gene ontology processes, or indications, we express with this field the type of biological entities. It is not allowed to mix different feature types. Features can, however, have no type, typically when they come from a heavily-processed dataset, such as gene-expression data. Even if we use `Bioteque` nomenclature to the define the type of biological data, it is not mandatory that the vocabularies are the ones used by the `Bioteque`; for example, I can use non-human Uniprot ACs, if I deem it necessary.|
|
|
|Number of features|e.g. 1000|Number of features in the dataset.|
|
|
|Number of features|e.g. 1000|Number of features in the dataset.|
|
|
|Exemplar|`True`/`False`|Is the dataset exemplar of the coordinate. Only one exemplar dataset is valid for each coordinate. Exemplar datasets should have good coverage (both in keys space and feature space) and acceptable quality of the data.|
|
|
|Exemplary|`True`/`False`|Is the dataset exemplary of the coordinate. Only one exemplary dataset is valid for each coordinate. Exemplary datasets should have good coverage (both in keys space and feature space) and acceptable quality of the data.|
|
|
|Source|Free text defining the source of data.|More than one source is allowed. We have mild constraints in the nomenclature, here.|
|
|
|Source|Free text defining the source of data.|More than one source is allowed. We have mild constraints in the nomenclature, here.|
|
|
|Version|CC version|The CC is updated every 6 months.|
|
|
|Version|CC version|The CC is updated every 6 months.|
|
|
|Public|`True`/`False`|Some datasets are public, and some are not, especially those that come from collaborations with the pharma industry.|
|
|
|Public|`True`/`False`|Some datasets are public, and some are not, especially those that come from collaborations with the pharma industry.|
|
... | | ... | |