Inference and analysis of large-scale protein-protein interaction networks

Proteins carry out their molecular functions by interacting with other molecules, mainly other proteins. For this reason, understanding protein interactions is an important step toward understanding protein function and cell behaviour. Systematically mapping the set of all protein-protein interactions within an organism – the interactome – has therefore become a major challenge in post-genomic biology. Recent developments in experimental procedures (e.g. co-affinity purification followed by mass spectrometry, AP-MS) have resulted in the publication of many high-quality protein-protein interaction datasets for different organisms ranging from the yeast Saccharomyces cerevisiae to Homo sapiens.

An interactome has a natural representation as an undirected graph, often called protein-protein interaction (PPI) network, where nodes represent proteins and edges represent interactions between pairs of proteins. Often an estimation of the reliability of such interactions is available and is included as edge labels (weights). Interactomes have a modular structure, meaning that there are sets of proteins that interact with each other more frequently than with the rest of the network. These densely connected regions are typically interpreted as protein complexes, and their identification is crucial to deepen our understanding of cellular processes. The problem of identifying protein complexes from PPI data is then equivalent to detecting dense possibly overlapping regions containing many connections in PPI networks (or regions with large weights in weighted networks).

We developed ClusterONE (Clustering with Overlapping Neighborhood Expansion), an overlapping clustering algorithm for protein interaction datasets, capable of detecting potentially overlapping protein complexes from weighted protein-protein interaction data. Since its publication, ClusterONE has become one of the most used algorithms for the detection of protein complexes from protein interaction data. Moreover, it has been applied to different domains, e.g. for the detection of communities in social networks. The current version of the software that we provide can exploit multiple CPU cores and scales up to graphs containing millions of vertices and edges. It has now been downloaded over 20,000 times.

We developed a graph diffusion procedure that exploits the PPI network topology and protein co-localisation information in order to reduce the amount of noise in the experimentally inferred PPI network. The idea here is that if two proteins are not very well connected through their local network neighbourhood, but there exists a link between them in the PPI network, this constitutes a spurious edge and it is removed by our procedure. In collaborations with the labs of Andre Emili (Boston Un) and Edward Marcotte (Un Texas, Austin), we applied our algorithm together with  ClusterONE to a set of high-confidence human protein-protein interactions.

In our lab research on large scale PPI networks has been funded by the BBSRC (grant BB/F00964X/1) and the Royal Society (grant NF080750).