Similarity and connectivity
The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.
When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA experiment, or molecules that revert the transcriptional profile of a disease state.
These are some ways similarity and connectivity can be applied in the CC:
Easy calculation of similarity and connectivity
Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarities and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the pre-processing repository structure.
Every dataset has one (or more) pre-processing script(s), always consisting of two steps:
- Data gathering (and conversion to a standard input file).
- In production phase (i.e. when building the dataset) data are gathered from downloads of from calculated molecular properties.
- In the mapping phase (i.e. when including external molecules or biological entities) data parsed by the user or fetched from calculated molecular properties if these are available for the compounds of interest of the user.
- Standard input to Signature Type 0.
- The outcome of step 1 is some sort of standard input for step 2.
- The output of this step is a Signature type 0.
- The complexity of this step can vary dramatically:
- Very simple: Like in the case of 2D fingerprints where, simply, we take the corresponding molecular properties of the InChIKey provided. Likewise, the case of indications, where we read drug-disease pairs and map them.
- Simple: The case of binding data where, in some occasions, we map target classes to the binding data.
- Not so simple: The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
- Complex: The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
-
Very complex: The case of LINCS transcriptomics data (
D1.001
), where start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and filter the outcome accordingly.
In practice:
- At production phase, all procedures above (1 & 2) are wrapped in a
fit()
method ofsign0
. - For the mapping phase, step 2 is wrapped in a
predict()
method ofsign0
. - Every dataset must have a
README
file.
It is of the utmost importance that step 2 is endowed with a predict()
method. Having the ability to convert any standard input to a signature type 0 (in an automated manner) will enable implementation of connectivity methods. This is a critical feature of the CC and I anticipate that most of our efforts will be put in this particular step.
More precisely, connectivity starts with standard input files and finishes with a signature type 0.
In some datasets, this procedure may be of considerable complexity and we need to conceive workflows that can be wrapped into a predict()
method. For other datasets, the procedure will be almost trivial.
Another important matter here is the distance. The CC works with common distance metrics, such as the cosine
or euclidean
distances. Sometimes, connectivity may require other types of metrics (e.g. GSEA-like, overlap, etc.). We might consider learning siamese networks that transform original distances to the more standard ones. This is an unexplored avenue, though.
Standard input files
Type | Format | Description |
---|---|---|
Feature sets | GMT | XX |
Key-feature pairs | TSV | XX |
Key profiles | TSV | XX |
InChIKeys | TSV | XX |
Documentation
Every pre-processing script needs to have a README
file.