similarity and connectivity

Similarity and connectivity

The CC is based upon the similarity principle, i.e. similar molecules have similar properties. Similarity can be defined between pairs of molecules in any of the CC datasets.

When it comes to comparing molecules to other biological entities (antibodies, shRNAs, diseases, etc.), the similarity principle can be generalized to the notion of connectivity. A classical view of connectivity are molecules that mimic the transcriptional profile of a shRNA experiment, or molecules that revert the transcriptional profile of a disease state.

These are some ways similarity and connectivity can be applied in the CC:

Easy calculation of similarity and connectivity

Calculating similarities and connectivities is the most important feature of the CC and, as such, it must be a flexible part of the repository. Similarities and connectivities are dataset-specific. The relevant scripts (i.e. the ones that lead to Signature Type 0) are within the pre-processing repository structure.

Every dataset has one (or more) pre-processing script(s), always consisting of two steps:

Data gathering (and conversion to a standard input file).
- In production phase (i.e. when building the dataset) data are gathered from downloads of from calculated molecular properties.
- In the mapping phase (i.e. when including external molecules or biological entities) data parsed by the user or fetched from calculated molecular properties if these are available for the compounds of interest of the user.
Standard input to Signature Type 0.
- The outcome of step 1 is some sort of standard input for step 2.
- The output of this step is a Signature type 0.
- The complexity of this step can vary dramatically:
  - Very simple: Like in the case of 2D fingerprints where, simply, we take the corresponding molecular properties of the InChIKey provided. Likewise, the case of indications, where we read drug-disease pairs and map them.
  - Simple: The case of binding data where, in some occasions, we map target classes to the binding data.
  - Not so simple: The case of pathways, where we map targets to human orthologs, and then these to pathway annotations. In this case, the input may be of two types (i.e. targets or the pathways themselves).
  - Complex: The case of interactomes, where we map targets to human orthologs and these to several networks using HotNet. Here again, in this case the input may be of two types (i.e. targets or the neighbors themselves).
  - Very complex: The case of LINCS transcriptomics data (D1.001), we start from signatures of interest, we compare them to the Touchstone signatures using a GSEA-like metric, we aggregate them if necessary and we filter the outcome accordingly.

In practice:

At production phase, all procedures above (1 & 2) are wrapped in a fit() method of sign0.
For the mapping phase, step 2 is wrapped in a predict() method of sign0.
- The method can have more than one entry point, i.e. multiple input types. Por example, in the biological processes dataset we may enter the targets or the biological process terms directly.
- Inputs must be of an standard input format.

Standard input files

Type	Format	Description
InChIKeys	`TSV`	A one-column file containing InChIKeys. This will fetch the corresponding molecular properties from the CC database.
Key-feature pairs	`TSV`	A two-column file containing keys (first column) and features (second column). Features can be, for example, protein identifiers. Optionally, a third column can be included to specify the weight of the key-feature annotation.
Key profiles	`TSV`	A multiple-column file containing keys (first column) and features (second column onwards). These can be, for example, NCI-60 profiles, or chemical-genetic interaction profiles. If a header is not included, the order of the columns should match the one used in the CC internally.
Feature sets	`GMT`	A GMT-like file, typically used for gene sets. First column: Sample (signature) identifier. Second column: Agent (perturbagen, molecule, etc.) identifier. If empty, assume the same than first column. This is used in case it is necessary to aggregate downstream. Third column: Up features (genes). Can be `NULL`. Fourth column: Down features (genes). If empty, assume that there is no direction in the gene set, and only take the third column. Can be `NULL`.

We highly recommend that, when designing the datasets, features are as explicit as possible. A good way to start would be the metanodes defined in the Bioteque:

Metanode	Abbreviation
Assay	`ASY`
Cell	`CLL`
Chemical entity	`CHE`
Compartment	`CMP`
Domain	`DOM`
Compound	`CPD`
Gene/Protein	`GEN`
Disease	`DIS`
Molecular function	`MFN`
Pathway/process	`PWY`
Protein class	`PCL`
Perturbagen	`PGN`
Symptom	`SYM`
Tissue	`TIS`
Pharmacologic class	`PHC`

Obviously, it is mandatory that the vocabularies used in the production phase and the mapping phase match.

Distance metrics

From Signature Type 0 onwards, the CC only deals with two distance metrics: the cosine distance and the Euclidean distance. These are well-accepted metrics that capture two different properties: the direction and the absolute distance, respectively.

It may happen that some datasets require more advanced metrics, though. In this case, we recommend applying any required transformation of the data in the pre-processing, so as Signatures Type 0 are natively comparable using cosine/Euclidean distances. This can be achieved by metric learning algorithms. For example, one can incorporate a Siamese network in the pre-processing:

Entry points

The mapping (prediction) for new molecules/entities can be entered at one or multiple steps of the predict pipeline. The corresponding argument is entry_point.

Please note that, in this case, the key that is kept in the Signature Type 0 is exactly the one provided by the user.

Vocabularies are specified below. Terms in these vocabularies should look like:

Vocabulary	Example
SMILES	Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C
InChI	InChI=1S/C29H31N7O/c1-21-5-10-25(18-27(21)34-29-31-13-11-26(33-29)24-4-3-12-30-19-24)32-28(37)23-8-6-22(7-9-23)20-36-16-14-35(2)15-17-36/h3-13,18-19H,14-17,20H2,1-2H3,(H,32,37)(H,31,33,34)
InChIKey	KTUFNOKKBVMGRW-UHFFFAOYSA-N
UniProtAC	Q9H3N8
ChEMBL target class	Class:1033
PDB ID	1f3v
ECOD domain ID	e1m48A1
ECOD hierarchy ID	E:e1m48A1 X:150 H:3 T:1 F:13
ChEBI	CHEBI:52206
Reactome	R-HSA-6804756
Gene Ontology	GO:0023051
Interactome & Uniprot AC	string_Q9H3N8 inbiomap_Q9H3N8...
Gene Names/Symbols	HRH4
NCI60 cell lines	MDA-MB-435
Yeast mutants	YBL067C_sn1167
812 microscopy features	Cells_AreaShape_Eccentricity
Cellosaurus	CVCL_0382
ATC	L L02 L02A L02AA L02AA03
MEDIC/MeSH	DB00932
UMLS	C0157733
DrugBank	DB00930

`A` Chemistry

`A1.001` 2D fingerprints

Key-raw molecule: smiles / vocabulary: SMILES
Key-standard molecule: inchi / vocabulary: InChI
InChIKeys: inchikey [Default] / vocabulary: InChIKey

`A2.001` 3D fingerprints

Key-raw molecule: smiles / vocabulary: SMILES
Key-standard molecule: inchi / vocabulary: InChI
InChIKeys: inchikey [Default] / vocabulary: InChIKey

`A3.001` Scaffolds

Key-raw molecule: smiles / vocabulary: SMILES
Key-standard molecule: inchi / vocabulary: InChI
InChIKeys: inchikey [Default] / vocabulary: InChIKey

`A4.001` Structural keys

Key-raw molecule: smiles / vocabulary: SMILES
Key-standard molecule: inchi / vocabulary: InChI
InChIKeys: inchikey [Default] / vocabulary: InChIKey

`A5.001` Physicochemistry

Key-raw molecule: smiles / vocabulary: SMILES
Key-standard molecule: inchi / vocabulary: InChI
InChIKeys: inchikey [Default] / vocabulary: InChIKey

`B` Targets

`B1.001` Mechanism of action

Key-Protein pairs (-1/+1; default = -1): proteins [Default] / vocabulary: UniProtAC
Key-Class pairs (-1/+1; default = -1): classes / vocabulary: ChEMBL target class

`B2.001` Metabolic genes

Key-Protein pairs: proteins [Default] / vocabulary: UniProtAC
Key-Class pairs: classes / vocabulary: ChEMBL target class

`B3.001` Crystals

Key-structure pairs: structures / vocabulary: PDB ID
Key-domain pairs: domains [Default] / vocabulary: ECOD domain ID
Key-domain pairs: domain_hierarchies / vocabulary: ECOD hierarchy ID

`B4.001` Binding

Key-Protein pairs (2/1; default = 1): proteins [Default] / vocabulary: UniProtAC
Key-Class pairs (2/1; default = 1): classes / vocabulary: ChEMBL target class

`B5.001` HTS bioassays

Key-Protein pairs: proteins [Default] / vocabulary: UniProtAC
Key-Class pairs: classes / vocabulary: ChEMBL target class

`C` Networks

`C1.001` Small molecule roles

Key-ChEBI pairs: terms [Default] / vocabulary: ChEBI

`C2.001` Small molecule pathways

Key-metabolite pairs (neighbors) (10-1; default = 5): metabolites_neighbors [Default] / vocabulary: InChIKey

`C3.001` Signaling pathways

Key-Protein pairs (2/1; default = 1): proteins [Default] / vocabulary: UniprotAC
Key-Pathway pairs (2/1; default = 1): pathways / vocabulary: Reactome

`C4.001` Biological processes

Key-Protein pairs (2/1; default = 1): proteins [Default] / vocabulary: UniprotAC
Key-Process pairs (2/1; default = 1): processes / vocabulary: Gene Ontology (Biological processes)

`C5.001` Interactome

Key-Protein pairs (exact nodes) (2/1; default = 1): proteins [Default] / vocabulary: UniprotAC
Key-Protein pairs (neighbors) (20-1; default = 10): protein_neighbors / vocabulary: UniprotAC
Key-Protein pairs (neighbors in specific network) (20-1; default = 10): protein_neighbors_network / vocabulary: Interactome & UniProtAC

`D` Cells

`D1.001` Gene expression

Key-(perturbation)-Up&Down genes: up_down [Default] / vocabulary: Gene names/symbols

`D2.001` Cancer cell lines

Key-Profile (score): profile [Default] / vocabulary: NCI-60 cell lines

`D3.001` Chemical genetics

Key-Strain (2/1; default = 1): strain [Default] / vocabulary: Yeast mutants

`D4.001` Morphology

Key-Measure (score): measure [Default] / vocabulary: 812 microscopy features

`D5.001` Cell bioassays

Key-Cell: cell [Default] / vocabulary: Cellosaurus

`E` Clinics

`E1.001` Therapeutic areas

Key-ATC: atc [Default] / vocabulary: ATC codes

`E2.001` Indications

Key-Disease (4/1; default = 2): disease [Default] / vocabulary: MEDIC/MeSH

`E3.001` Side effects

Key-Side effect: side_effect [Default] / vocabulary: UMLS

`E4.001` Diseases and Toxicology

Key-Disease (M/T; mandatory): disease [Default] / vocabulary: MEDIC/MeSH

`E5.001` Drug-drug interactions

Key-Drug: drug [Default] / vocabulary: DrugBank