We apply and develop novel graph-theoretic/statistical/machine learning techniques for solving problems in computational biology and medicine. These techniques can provide an answer to many challenges in these domains, because they offer a natural way to integrate different types of data and to handle large amounts of noisy information.

An important idea that has emerged recently is that a cell can be viewed as a complex network of inter-relating proteins, nucleic acids and other bio-molecules. A bio-molecular network can be viewed as a collection of nodes, representing the bio-molecules, connected by links, representing relations between the bio-molecules. Examples of biological networks are, for example, regulatory networks, metabolic networks and signalling networks. At the same time, data generated by large-scale experiments often have a natural representation as networks such as protein-protein interaction (PPI) networks, genetic interaction networks, and co-expression networks. Finally, it is understood that the interconnectivity between cellular components (genes, metabolites, microRNAs etc) has important implications for diseases. The view that has become widely accepted is that genetic disease is the result of abnormal interactions between multiple players in complex networks. From a computational point of view, a central objective for systems biology and medicine is therefore to develop methods for inferring networks or parts of networks or relations between networks possibly using data which are also in the form of networks.

Much of our research focuses on developing novel mathematical methods specifically suited for making inferences on biological networks, building on these most recent results from computer science and machine learning. In particular, the methods we develop take into account both the structure of the networks representing the data and the structure of the network representing the biological question being answered. The final goal is to be able to answer questions in systems biology and medicine that will help us understand and predict complex cellular behaviour in health and disease.

Recently, we mainly focused on five areas:
  1. Network Medicine
  2. Inference and analysis of large-scale Protein-Protein Interaction networks.
  3. Protein Function Prediction.
  4. Inferring relationships between Genotype, Phenotype and Environment.
  5. Analysis of Biological Processes from co-expression networks.>
  6. Computational Pharmacology
  7. 3D Genome Analysis


Network Medicine

In a cell, the function of most cellular components (genes, proteins, metabolites, micro-RNA, etc.) is brought to bear through the interaction with other cellular components. The interconnectivity among bio-molecules implies that the relation between the entire set of genes in a cell (genotype) and their physical manifestation (phenotype) is extremely complex, since it is mediated by these complex molecular networks. Network medicine is a recent paradigm that exploits the organizing principles of human cellular networks and links network structures to disease.

From a network medicine perspective, hereditary diseases can be seen as perturbations of “disease modules” in the interactome. An important effort in our lab has been aimed at quantifying similarity between hereditable diseases at molecular level by bringing together the existing information that is scattered across the vast corpus of biomedical literature. In other words, we obtain a number that accurately quantifies distance between disease modules in the interactome.

Quantifying disease similarity at molecular level enables the transfer of knowledge between similar diseases, providing hypotheses for causal genes discovery and even suggestions for drug repositioning. This is particularly important for hereditary diseases for which no disease gene is currently known – about 30% of them. For these orphan diseases, our measure can help pinpoint the location of their molecular perturbations. Our measure can also be used for differential diagnosis, aiding medical practitioners in identifying putative alternative diagnosis that are obscured by the complexity and multiplicity of the symptoms.

We have also developed a novel network-based approach to prioritize gene-disease associations that can also predict genes for diseases with no known molecular basis by exploiting our phenotypic measure. Our method, which uses semi-supervised learning for the prediction, can accurately predict disease genes for molecularly uncharacterised diseases and also gives excellent results for molecularly characterized diseases, when compared with state-of-the-art methods. Moreover, it can also be used for disease module prediction.

In collaboration with the lab of Giorgio Valentini at the University of Milan, we have developed a network-based approach for modelling patients’ biomolecular profiles for clinical phenotype/outcome prediction. Our method builds the profiles in a graph-structured patient space rather than the more typical biomarker space. We construct a network of patients based on their functional or genetic similarities, and then we apply a semi-supervised transductive approach to predict phenotype/clinical outcomes. Extensive tests show that our approach accurately predicts phenotype/outcome in patients with several diseases and provide interpretable results, thus leading to an explainable patient stratification based on their biomolecular characteristics.

In our lab research in Network Medicine has been funded by the BBSRC (grants BB/K004131/1, BB/F00964X/1 and BB/M025047/1)


Inference and analysis of large-scale protein-protein interaction networks

Proteins carry out their molecular functions by interacting with other molecules, mainly other proteins. For this reason protein interactions provide an important step toward understanding protein function and cell behaviour. Systematically mapping the set of all protein-protein interactions within an organism – the interactome – has therefore become a major challenge in post-genomic biology. Recent developments in experimental procedures (e.g. co-affinity purification followed by mass spectrometry, AP-MS) have resulted in the publication of many high-quality protein-protein interaction datasets for different organisms ranging from the yeast Saccharomyces cerevisiae to Homo sapiens.

An interactome has a natural representation as an undirected graph, often called protein-protein interaction (PPI) network, where nodes represent proteins and edges represent interactions between pairs of proteins. Often an estimation of the reliability of such interactions is available and is included as edge labels (weights). Interactomes have a modular structure, meaning that there are sets of proteins that interact with each other more frequently than with the rest of the network. These densely connected regions are typically interpreted as protein complexes, and their identification is crucial to deepen our understanding of cellular processes. The problem of identifying protein complexes from PPI data is then equivalent to detecting dense regions containing many connections in PPI networks (or regions with large weights if the networks are weighted).

In our lab research on large scale PPI networks has been funded by the BBSRC (grant BB/F00964X/1) and the Royal Society (grant NF080750).
Protein Function Prediction

In recent years, the numerous large scale sequencing projects have generated enormous amounts of sequence data. This has led to the identification of thousands of previously unknown genes whose function awaits to be characterized. A precise definition of protein function is difficult, as in general the meaning of the term “function” depends on the context which one is considering. The current dominant solution to this problem is through the use of ontologies, consisting of terms in a controlled vocabulary organized in a hierarchical structure through a set of well-defined relationships.

Standard ontologies usually have a structure that can be modelled by a rooted and oriented tree or, more generally, by a directed acyclic graph, like the Gene Ontology, which is becoming the standard. Having defined function through ontologies, even for the best characterized model organisms, about a third of the proteins have unknown function. A fundamental goal is therefore to identify the function of uncharacterised genes on a genomic scale. It is difficult to design functional assays for uncharacterised genes so a major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated experimentally.

In our lab research in protein function prediction has been funded by the BBSRC (grant BB/F00964X/1).
Inferring relationships between genotype, phenotype and environment

An important problem in biology is to uncover the links between the genetic makeup of an organism (genotype) and its observable physical or biochemical characteristics (phenotype). For example, this would increase our ability to rapidly characterize an unknown microorganism, which is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism’s phenotype based on the molecules encoded by its genome.

At the same time, by what means specific sequences link distinct environmental conditions with specific biological processes is also not well understood. Thus, another important challenge is how the usage of particular pathways and subnetworks reflects the adaptation of microbial communities across environments and habitats – i.e., how network dynamics relates to environmental features.
Analysis and detection of biological processes from co-expression networks

Gene expression experiments measure the activity of thousands of genes in response to different conditions. Generally, genes involved in a particular biological mechanism tend to exhibit similar expression patterns and form groups. An important question in this area is that of detecting from transcriptomics data which biological processes are activated in a given condition.

Another problem is that of selecting marker genes which can represent such specific mechanisms. In fact these markers can be used as readouts and help understanding the mechanisms, monitor the interactions between them and track the physiological effect they may exert. For example, as yeast cells grow, genes involved in various hormone pathways exhibit distinct similarity in expression patterns and form groups. Sensitive and specific markers which can track and report the dynamics of each group are important for investigating the mechanisms of response to each hormone, cross-talk between hormone pathways and the relationship between hormones and phenotypic effects.

In our lab research for the analysis of transcriptomics data has been funded by the BBSRC (grant BB/F00964X/1) and Royal Holloway, through the Agnes Grace Ellen Endowment.


Computational Pharmacology

The frequency of drug side effects in the population is determined through placebo-controlled studies during drug clinical trials. However, it is well recognised that many drug side effects are not observed during such trials, thus remaining a leading cause of morbidity and mortality in health care. Our idea is to use a matrix decomposition model to learn a low dimensional representation of drugs and side effects that encodes the interplay between them – we called these low dimensional representations signatures. Results show that our model can predict the frequency of drug side effects with high accuracy and, importantly, that drug signatures can explain molecular and clinical drug responses. Thus, our biologically interpretable model goes beyond the standard machine learning black-box modelling and can shed new lights into population-level drug response.

Moreover, we have recently begun to develop similar matrix completion approaches for tackling the problem of prediction of drug-target interactions, an important step in the drug discovery and repositioning process. Our final aim is to extend the druggable genome and preliminary results show that our models can be very effective at predicting drug-target associations involving previously unknown targets.

In the context of computational pharmacology, we have developed pipelines for repositioning drug combinations for neglected tropical diseases. Currently we are focusing on Chagas disease (among the top 10 target diseases in the Gates Foundation), which is caused by the protozoan parasite Trypanosoma cruzi (T. cruzi) and is endemic throughout Latin America. About 6 to 7 million people are infected with T. cruzi, around 40 million are at risk of infection. and no drug is effective against the parasite in the chronic phase of the disease. This project aims at identifying drug combinations with antitrypanosomal effects. We developed a method that exploits concepts from comparative genomics for the prediction of FDA approved drugs which could be effective against T. cruzi. Our method selects FDA approved drugs which target enzymes in model organisms that are evolutionarily related to enzymes in T. cruzi pathways and could therefore be effective at disrupting its metabolic pathways. We have identified 384 FDA approved drugs and the drug combinations are selected following a multi-objective optimisation approach. These are being tested in vitro by Dr Celeste Vega at CEDIC, our collaborating lab in Asuncion, Paraguay.

In our lab research in Network Medicine has been funded by the BBSRC (grants BB/K004131/1, BB/F00964X/1 and BB/M025047/1)


3D Genome Analysis

The different types of cells in the human body have an identical 1D genome, i.e., a linear sequence of nucleotides, yet their genomes have different underlying 3D architectures. Their different ways of packing DNA molecules into cell nuclei lead to different arrangements of genomic elements in 3D, and such layouts play a central role in gene regulation and cell fate determination. In fact, transcriptional regulation is mediated by the interaction between promoters and distal enhancers which are spatially close together via the formation of loops or domains.

Over the last decade, genome-wide ligation-based assays such as Hi-C have provided an unprecedented opportunity to investigate the 3D organization of the genome. Results of a typical Hi-C experiment are summarized by a chromosomal contact map, a matrix whose elements reflect the population-averaged co-location frequencies of genomic loci, which can be viewed as a measurement of the spatial proximity between genomic loci.

We realized that there are two different components contributing to the overall contact frequency observed between a pair of genes in the contact map. The first component is related to their genomic distance, i.e., the distance between genes due to the fact they are positioned sequentially on the 1D DNA strand. The second component depends on cell specific arrangements of the genes in 3D. Since all human cells have an identical 1D genome, it is the second component that has a role in gene regulation.

We developed a network-based framework that effectively extracts the 3D component of the gene proximity signal. We show that such component can be used for in-depth analysis of the interplay between the spatial positioning of genes and their regulation in different human cells, and that such interplay is consistently easier to detect and quantify than when using the contact frequency obtained directly from the Hi-C data. In other words, our procedure can be thought of as a de-noising procedure that is able to extract the 3D component of the signal from the mixture of 1D and 3D signal components that constitutes the experimental Hi-C data.