Protein Function Prediction

In recent years, the numerous large scale sequencing projects have generated enormous amounts of sequence data. This has led to the identification of thousands of previously unknown genes whose function awaits to be characterized. A precise definition of protein function is difficult, as in general the meaning of the term “function” depends on the context which one is considering. The current dominant solution to this problem is through the use of ontologies, consisting of terms in a controlled vocabulary organized in a hierarchical structure through a set of well-defined relationships.

Standard ontologies usually have a structure that can be modelled by a rooted and oriented tree or, more generally, by a directed acyclic graph, like the Gene Ontology (GO), which is becoming the standard. Having defined function through ontologies, even for the best characterized model organisms, about a third of the proteins have unknown function. A fundamental goal is therefore to identify the function of uncharacterised genes on a genomic scale. It is difficult to design functional assays for uncharacterised genes so a major challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated experimentally.

For over 10 years our lab has been at the forefront of international research for the development of methods for protein function prediction. We have framed this problem as a multi-class multi-label classification problem and to solve it we have developed new machine learning methods based on the diffusion of information over large weighted graphs inferred from evolutionary information. The results obtained by our system in the last two published editions of the tri-annual CAFA competition (Critical Assessment of Functional Annotation) placed the performance of our system among the very best systems in the world.

Recently, we introduced S2F (Sequence to Function), a network propagation approach for the functional annotation of newly sequenced organisms. Our main idea is to systematically transfer functionally relevant data from model organisms to newly sequenced ones, thus allowing us to use a label propagation approach. S2F introduces a novel label diffusion algorithm that can account for the presence of overlapping communities of proteins with related functions. Currently, S2F is considered to be the best method for predicting protein function in bacteria.

We have also proposed more accurate measures of GO semantic similarity.

In our lab research in protein function prediction has been funded by the BBSRC (grant BB/F00964X/1).