Similarity and connectivity
The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.
When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA experiment, or molecules that revert the transcriptional profile of a disease state.
These are some ways similarity and connectivity can be applied in the CC:
Easy calculation of similarity and connectivity
Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarities and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the pre-processing repository structure.
Every dataset has one (or more) pre-processing script(s), always consisting of two steps:
- Data gathering (and conversion to a standard input file).
- In production phase (i.e. when building the dataset) data are gathered from downloads of from calculated molecular properties.
- In the mapping phase (i.e. when including external molecules or biological entities) data parsed by the user or fetched from calculated molecular properties if these are available for the compounds of interest of the user.
- Standard input to Signature Type 0.
- The outcome of step 1 is some sort of standard input for step 2.
- The output of this step is a Signature type 0.
- The complexity of this step can vary dramatically:
- Very simple: Like in the case of 2D fingerprints where, simply, we take the corresponding molecular properties of the InChIKey provided. Likewise, the case of indications, where we read drug-disease pairs and map them.
- Simple: The case of binding data where, in some occasions, we map target classes to the binding data.
- Not so simple: The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
- Complex: The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
-
Very complex: The case of LINCS transcriptomics data (
D1.001
), we start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and we filter the outcome accordingly.
In practice:
- At production phase, all procedures above (1 & 2) are wrapped in a
fit()
method ofsign0
. - For the mapping phase, step 2 is wrapped in a
predict()
method ofsign0
.- The method can have more than one entry point, i.e. multiple input types. Por example, in the biological processes dataset we may enter the targets or the biological process terms directly.
- Inputs must be of an standard input format.
- Every dataset must have a
README
file.
Standard input files
Type | Format | Description |
---|---|---|
InChIKeys | TSV |
A one-column file containing InChIKeys. This will fetch the corresponding molecular properties from the CC database. |
Key-feature pairs | TSV |
A two-column file containing keys (first column) and features (second column). Features can be, for example, protein identifiers. Optionally, a third column can be included to specify the weight of the key-feature annotation. |
Key profiles | TSV |
A multiple-column file containing keys (first column) and features (second column onwards). These can be, for example, NCI-60 profiles, or chemical-genetic interaction profiles. If a header is not included, the order of the columns should match the one used in the CC internally. |
Feature sets | GMT |
A GMT-like file, typically used for gene sets. First column: Sample (signature) identifier. Second column: Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. Third column: Up features (genes). Can be NULL . Fourth column: Down features (genes). If empty, assume that there is no direction in the gene set, and only take the third column. Can be NULL . |
We highly recommend that, when designing the datasets, features are as explicit as possible. A good way to start would be the metanodes defined in the Bioteque:
Metanode | Abbreviation |
---|---|
Assay | ASY |
Cell | CLL |
Chemical entity | CHE |
Compartment | CMP |
Domain | DOM |
Compound | CPD |
Gene/Protein | GEN |
Disease | DIS |
Molecular function | MFN |
Pathway/process | PWY |
Protein class | PCL |
Perturbagen | PGN |
Symptom | SYM |
Tissue | TIS |
Pharmacologic class | PHC |
Obviously, it is mandatory that the vocabularies used in the production phase and the mapping phase match.
Distance metrics
From Signature Type 0 onwards, the CC only deals with two distance metrics: the cosine distance and the Euclidean distance. These are well-accepted metrics that capture two different properties: the direction and the absolute distance, respectively.
It may happen that some datasets require more advanced metrics, though. In this case, we recommend applying any required transformation of the data in the pre-processing, so as Signatures Type 0 are natively comparable using cosine/Euclidean distances. This can be achieved by metric learning algorithms. For example, one incorporate a Siamese network in the pre-processing: