Similarity and connectivity
The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.
When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA experiment, or molecules that revert the transcriptional profile of a disease state.
These are some ways similarity and connectivity can be applied in the CC:
Easy calculation of similarity and connectivity
Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarities and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the pre-processing repository structure.
Every dataset has one (or more) pre-processing script(s), always consisting of two steps:
- Data gathering (and conversion to a standard input file).
- In production phase (i.e. when building the dataset) data are gathered from downloads of from calculated molecular properties.
- In the mapping phase (i.e. when including external molecules or biological entities) data parsed by the user or fetched from calculated molecular properties if these are available for the compounds of interest of the user.
- Standard input to Signature Type 0.
- The outcome of step 1 is some sort of standard input for step 2.
- The output of this step is a Signature type 0.
- The complexity of this step can vary dramatically:
- Very simple: Like in the case of 2D fingerprints where, simply, we take the corresponding molecular properties of the InChIKey provided. Likewise, the case of indications, where we read drug-disease pairs and map them.
- Simple: The case of binding data where, in some occasions, we map target classes to the binding data.
- Not so simple: The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
- Complex: The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
-
Very complex: The case of LINCS transcriptomics data (
D1.001
), we start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and we filter the outcome accordingly.
In practice:
- At production phase, all procedures above (1 & 2) are wrapped in a
fit()
method ofsign0
. - For the mapping phase, step 2 is wrapped in a
predict()
method ofsign0
.- The method can have more than one entry point, i.e. multiple input types. Por example, in the biological processes dataset we may enter the targets or the biological process terms directly.
- Inputs must be of an standard input format.
Standard input files
Type | Format | Description |
---|---|---|
InChIKeys | TSV |
A one-column file containing InChIKeys. This will fetch the corresponding molecular properties from the CC database. |
Key-feature pairs | TSV |
A two-column file containing keys (first column) and features (second column). Features can be, for example, protein identifiers. Optionally, a third column can be included to specify the weight of the key-feature annotation. |
Key profiles | TSV |
A multiple-column file containing keys (first column) and features (second column onwards). These can be, for example, NCI-60 profiles, or chemical-genetic interaction profiles. If a header is not included, the order of the columns should match the one used in the CC internally. |
Feature sets | GMT |
A GMT-like file, typically used for gene sets. First column: Sample (signature) identifier. Second column: Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. Third column: Up features (genes). Can be NULL . Fourth column: Down features (genes). If empty, assume that there is no direction in the gene set, and only take the third column. Can be NULL . |
We highly recommend that, when designing the datasets, features are as explicit as possible. A good way to start would be the metanodes defined in the Bioteque:
Metanode | Abbreviation |
---|---|
Assay | ASY |
Cell | CLL |
Chemical entity | CHE |
Compartment | CMP |
Domain | DOM |
Compound | CPD |
Gene/Protein | GEN |
Disease | DIS |
Molecular function | MFN |
Pathway/process | PWY |
Protein class | PCL |
Perturbagen | PGN |
Symptom | SYM |
Tissue | TIS |
Pharmacologic class | PHC |
Obviously, it is mandatory that the vocabularies used in the production phase and the mapping phase match.
Distance metrics
From Signature Type 0 onwards, the CC only deals with two distance metrics: the cosine distance and the Euclidean distance. These are well-accepted metrics that capture two different properties: the direction and the absolute distance, respectively.
It may happen that some datasets require more advanced metrics, though. In this case, we recommend applying any required transformation of the data in the pre-processing, so as Signatures Type 0 are natively comparable using cosine/Euclidean distances. This can be achieved by metric learning algorithms. For example, one can incorporate a Siamese network in the pre-processing:
Entry points
The mapping (prediction) for new molecules/entities can be entered at one or multiple steps of the predict pipeline. The corresponding argument is entry_point
.
Please note that, in this case, the key that is kept in the Signature Type 0 is exactly the one provided by the user.
Vocabularies are specified below. Terms in these vocabularies should look like:
Vocabulary | Example |
---|---|
SMILES | Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C |
InChI | InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,14-17,20H2,1-2H3,(H,32,37)(H,31,33,34) |
InChIKey | KTUFNOKKBVMGRW-UHFFFAOYSA-N |
UniProtAC | Q9H3N8 |
ChEMBL target class | Class:1033 |
PDB ID | 1f3v |
ECOD domain ID | e1m48A1 |
ECOD hierarchy ID | E:e1m48A1 X:150 H:3 T:1 F:13 |
ChEBI | CHEBI:52206 |
Reactome | R-HSA-6804756 |
Gene Ontology | GO:0023051 |
Interactome & Uniprot AC | string_Q9H3N8 inbiomap_Q9H3N8... |
Gene Names/Symbols | HRH4 |
NCI60 cell lines | MDA-MB-435 |
Yeast mutants | YBL067C_sn1167 |
812 microscopy features | Cells_AreaShape_Eccentricity |
Cellosaurus | CVCL_0382 |
ATC | L L02 L02A L02AA L02AA03 |
MEDIC/MeSH | DB00932 |
UMLS | C0157733 |
DrugBank | DB00930 |
A
Chemistry
A1.001
2D fingerprints
- Key-raw molecule:
smiles
/ vocabulary: SMILES - Key-standard molecule:
inchi
/ vocabulary: InChI - InChIKeys:
inchikey
[Default] / vocabulary: InChIKey
A2.001
3D fingerprints
- Key-raw molecule:
smiles
/ vocabulary: SMILES - Key-standard molecule:
inchi
/ vocabulary: InChI - InChIKeys:
inchikey
[Default] / vocabulary: InChIKey
A3.001
Scaffolds
- Key-raw molecule:
smiles
/ vocabulary: SMILES - Key-standard molecule:
inchi
/ vocabulary: InChI - InChIKeys:
inchikey
[Default] / vocabulary: InChIKey
A4.001
Structural keys
- Key-raw molecule:
smiles
/ vocabulary: SMILES - Key-standard molecule:
inchi
/ vocabulary: InChI - InChIKeys:
inchikey
[Default] / vocabulary: InChIKey
A5.001
Physicochemistry
- Key-raw molecule:
smiles
/ vocabulary: SMILES - Key-standard molecule:
inchi
/ vocabulary: InChI - InChIKeys:
inchikey
[Default] / vocabulary: InChIKey
B
Targets
B1.001
Mechanism of action
- Key-Protein pairs (-1/+1; default = -1):
proteins
[Default] / vocabulary: UniProtAC - Key-Class pairs (-1/+1; default = -1):
classes
/ vocabulary: ChEMBL target class
B2.001
Metabolic genes
- Key-Protein pairs:
proteins
[Default] / vocabulary: UniProtAC - Key-Class pairs:
classes
/ vocabulary: ChEMBL target class
B3.001
Crystals
- Key-structure pairs:
structures
/ vocabulary: PDB ID - Key-domain pairs:
domains
[Default] / vocabulary: ECOD domain ID - Key-domain pairs:
domain_hierarchies
/ vocabulary: ECOD hierarchy ID
B4.001
Binding
- Key-Protein pairs (2/1; default = 1):
proteins
[Default] / vocabulary: UniProtAC - Key-Class pairs (2/1; default = 1):
classes
/ vocabulary: ChEMBL target class
B5.001
HTS bioassays
- Key-Protein pairs:
proteins
[Default] / vocabulary: UniProtAC - Key-Class pairs:
classes
/ vocabulary: ChEMBL target class
C
Networks
C1.001
Small molecule roles
- Key-ChEBI pairs:
terms
[Default] / vocabulary: ChEBI
C2.001
Small molecule pathways
- Key-metabolite pairs (neighbors) (10-1; default = 5):
metabolites_neighbors
[Default] / vocabulary: InChIKey
C3.001
Signaling pathways
- Key-Protein pairs (2/1; default = 1):
proteins
[Default] / vocabulary: UniprotAC - Key-Pathway pairs (2/1; default = 1):
pathways
/ vocabulary: Reactome
C4.001
Biological processes
- Key-Protein pairs (2/1; default = 1):
proteins
[Default] / vocabulary: UniprotAC - Key-Process pairs (2/1; default = 1):
processes
/ vocabulary: Gene Ontology (Biological processes)
C5.001
Interactome
- Key-Protein pairs (exact nodes) (2/1; default = 1):
proteins
[Default] / vocabulary: UniprotAC - Key-Protein pairs (neighbors) (20-1; default = 10):
protein_neighbors
/ vocabulary: UniprotAC - Key-Protein pairs (neighbors in specific network) (20-1; default = 10):
protein_neighbors_network
/ vocabulary: Interactome & UniProtAC
D
Cells
D1.001
Gene expression
- Key-(perturbation)-Up&Down genes:
up_down
[Default] / vocabulary: Gene names/symbols
D2.001
Cancer cell lines
- Key-Profile (score):
profile
[Default] / vocabulary: NCI-60 cell lines
D3.001
Chemical genetics
- Key-Strain (2/1; default = 1):
strain
[Default] / vocabulary: Yeast mutants
D4.001
Morphology
- Key-Measure (score):
measure
[Default] / vocabulary: 812 microscopy features
D5.001
Cell bioassays
- Key-Cell:
cell
[Default] / vocabulary: Cellosaurus
E
Clinics
E1.001
Therapeutic areas
- Key-ATC:
atc
[Default] / vocabulary: ATC codes
E2.001
Indications
- Key-Disease (4/1; default = 2):
disease
[Default] / vocabulary: MEDIC/MeSH
E3.001
Side effects
- Key-Side effect:
side_effect
[Default] / vocabulary: UMLS
E4.001
Diseases and Toxicology
- Key-Disease (M/T; mandatory):
disease
[Default] / vocabulary: MEDIC/MeSH
E5.001
Drug-drug interactions
- Key-Drug:
drug
[Default] / vocabulary: DrugBank