... | ... | @@ -65,7 +65,7 @@ This is how we define a dataset: |
|
|
|Unknowns|`True`/`False`|Does the dataset contain known/unknown data? Binding data from chemogenomics datasets, for example, are positive-unlabeled, so they do contain unknowns. Conversely, chemical fingerprints or gene expression data do not contain unknowns.|
|
|
|
|Permanent|`True`/`False`|Are measurements for each entry permanent? 2D fingerprints, for example, are permanent. However, most of the biological data may change/evolve with the different versions of the CC. This field, in essence, dictates whether the dataset should be completely updated in every update of the CC, or whether new entries can be simply appended.|
|
|
|
|Finished|`True`/`False`|Is the dataset considered to be finished? For examples, datasets coming from supplementary data of scientific papers are immutable, and they consequently need no updates in posterior versions of the CC.|
|
|
|
|Data type|`Discrete`/`Continuous`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
|
|
|
|Is Discrete|`True`/`False`|The type of data that ultimately expresses de dataset, after the pre-processing. Categorical variables are not allowed; they must be converted to one-hot encoding or binarized. Mixed variables are not allowed, either.|
|
|
|
|Predicted|`True`/`False`|Is the dataset a result of a prediction (by us or by others?). Prediction results are perfectly valid CC datasets, in principle.|
|
|
|
|Connectivity|`True`/`False`|Is there a way to connect this dataset to other biological entities? We understand connectivity as a generalization of the cMap idea of matching gene expression signatures.|
|
|
|
|Connectivity comments|Free text commenting on the connectivity strategy (e.g. type of distance)|This field needs to be self-explanatory.|
|
... | ... | |