Additional information, benchmarking scripts and FAQs

Author: Tamas Nepusz, Haiyuan Yu, Alberto Paccanaro
Date: 15 September 2012
Revision: 1

In this document we provide additional details regarding the ClusterONE algorithm and its benchmarks described in [1]. We also provide source code for calculating the quality measures we have used in the manuscript. Together with the already released datasets and gold standards, one should be able to replicate our results simply by running the downloaded benchmark script in an appropriately prepared software environment.

We plan to maintain this document and extend it in the future with frequently asked questions and answers regarding ClusterONE.

In the following sections, we repeat and clarify the construction of the gold standard datasets and of the input networks, and describe our benchmarking procedure. Next, we describe the software requirements of our benchmark scripts and provide usage instructions for the calculation of various quality measures using our implementation.

Contents

Construction of the gold standards

In this section, we describe how the gold standards were constructed. The data files can be downloaded from the following URL:

http://membrane.cs.rhul.ac.uk/static/cl1/cl1_gold_standard.zip

The MIPS gold standard

The MIPS gold standard we have used was constructed from the most recent version of the MIPS catalog of protein complexes that was available when we wrote the manuscript. This version was dated as 18 May 2006 and it was downloaded from ftp://ftpmips.gsf.de/yeast/catalogues/complexcat/.

The MIPS catalog is organized hierarchically, which means that complexes in the catalog may consist of subcomplexes extending to at most five hierarchy levels deep. One example of a deeply embedded complex is the one with the identifier 510.190.10.20.10: it represents the SAGA complex and it is to be found at the fifth hierarchy level in the catalog. However, there is no clear separation in MIPS between complexes and groups of related complexes; for instance, the category with the identifier 510.180 corresponds to all "DNA-repair complexes". To avoid any selection bias, we decided not to classify the MIPS categories manually as "complexes" and "complex groups"; instead of that, we simply set an upper and lower size threshold on the complex size and considered every MIPS category with at least 3 and at most 100 members as a valid protein complex. The only exception was the MIPS category 550 and its descendants, as these categories correspond to unconfirmed protein complexes that were predicted by computational methods. Including these putative complexes would clearly have biased the benchmarks towards algorithms that work similarly to the ones that were used to create category 550.

The SGD gold standard

The SGD gold standard was created out of the need to have a gold standard that is more recent than the MIPS gold standard (which was dated as 18 May 2006). These complexes were derived from the Gene Ontology (GO) annotations in the Saccharomyces Genome Database (SGD), downloaded on 11 Aug 2010. The rules were as follows. First, we downloaded the set of annotation terms in the Gene Ontology that were valid on 11 Aug 2010 and their relationships. Second, we ran an inference engine on the Gene Ontology that provided us with all the terms from the ontology that are descendants of the term GO:0043234 (protein complex) using "is_a" relations only. These GO terms were then treated as protein complex annotations, and we have collected all the proteins from the SGD that were annotated by these annotations and were supported by at least one evidence code, excluding IEA (Inferred from Electronic Annotation), which is used in Gene Ontology to denote relationships that were determined by a computational algorithm and were not approved yet by a human curator. Proteins sharing the same Gene Ontology term were then treated as being part of the same complex (where the complex is represented by the Gene Ontology term). Annotations with modifiers such as not or colocalizes_with were ignored.

The inference engine we have used is available in a forked version of Biopython at http://github.com/ntamas/biopython. The inference engine itself is contained in the Bio.GO.inference module.

Construction of the input networks

In this section, we describe how the input networks were constructed. The data files can be downloaded from the following URL:

http://membrane.cs.rhul.ac.uk/static/cl1/cl1_datasets.zip

All the input networks we have used were stripped from self-interactions and isolated proteins.

Gavin et al, 2006

In the paper of Gavin et al [2], the weights of the interactions were defined using the so-called socio-affinity index, a measure that is based on the log-odds of the number of times two proteins were observed together in a purification, relative to the expected frequency of such a co-occurrence based on the number of times the proteins appeared in purifications. In other words, pairs of proteins with high socio-affinity indices were seen together in a purification more frequently than what one would have expected by random chance. More details are to be found in Section 4.1 of the Supplementary Discussion of our manuscript [1] or in the paper of Gavin et al [2].

Gavin et al argued that "generally, pairs with socio-affinity indices below 5 should be considered with caution". Therefore, our dataset was constructed from the socio-affinity indices published by Gavin, removing interactions that had a socio-affinity index less than or equal to 5. The socio-affinity indices were then divided by the maximum socio-affinity index encountered in the dataset to obtain the weights of the interactions.

Krogan et al, 2006

Krogan et al [3] have used MALDI-TOF mass spectrometry and LC-MS/MS to identify protein-protein interactions, based on the observation that either mass spectrometry method often fails to identify a protein, and the usage of two independent methods can increase the coverage and confidence of the obtained interactome. The results of the two techniques were combined by supervised machine learning methods with two rounds of learning, using hand-curated protein complexes in the MIPS database as a gold standard dataset. More details are to be found in Section 4.2 of the Supplementary Discussion of our manuscript [1] or in the paper of Krogan et al [3].

Krogan et al have provided two versions of their results using different thresholds on the output of the final classifier in their pipeline. The core dataset includes all interactions with posterior probability higher than 0.273, while the extended dataset includes all interactions with posterior probability higher than 0.101. The posterior probabilities were used as weights in our networks without any further modifications (apart from the thresholds already applied by Krogan et al [3]).

Collins et al, 2007

Collins et al [4] have combined the experimentally derived PPI networks of Krogan et al [3] and Gavin et al [2] by re-analyzing the raw primary affinity purification data using a novel scoring technique called purification enrichment (PE). The PE scores were motivated by the probabilistic socio-affinity scoring framework of Gavin et al [2] but also take into account negative evidence (i.e. pairs of proteins where one of them does not appear as a prey when the other one is used as a bait). More details are to be found in Section 4.3 of the Supplementary Discussion of our manuscript [1] or in the paper of Collins et al [4].

Although the PE scores are not constrained to the [0, 1] interval, Collins et al [4] have already PE scores within this range. They have set scores below 0.05 to zero for computational efficiency, and then established a set of reliable interactions by taking 9074 interactions with the highest PE scores. The number of interactions to keep was selected based on the true positive to true negative rate evaluated on the MIPS small scale experiments. We have used these 9074 interactions along with the remapped scores provided by Collins et al [4] without further modifications.

BioGRID

The BioGRID dataset was the only unweighted dataset we have considered. It was downloaded from BioGRID at version 3.1.77 and contained all physical interactions involving yeast proteins only. Weights were ignored even for those interactions that had an associated weight information for two reasons:

  1. Only 18.05% of all the interactions contained confidence scores and it was unclear how to specify weights for the remaining 81.95%.
  2. The confidence scores were determined not by a single, consistent algorithm but were provided by the research groups that also provided the interaction data. Since interactions from different research groups were determined by different techniques, their confidence scores are not directly comparable to each other.

Benchmarking a network on a given gold standard

The set of proteins in a gold standard is almost always different from the set of proteins in the network being benchmarked. Proteins can thus be grouped into three groups:

  1. Proteins appearing in both the input network and in the gold standard
  2. Proteins appearing in the input network but not in the gold standard
  3. Proteins appearing in the gold standard but not in the input network

For the proteins in groups 1 and 2, there is no need for any intervention. In the former case, the protein has at least one known interaction with other proteins and it participates in at least one complex in the gold standard, therefore it is theoretically possible to detect the complexes the protein participates in from the input dataset. In the latter case, the protein has known interactions but does not belong to any of the known complexes, which may pose a problem to traditional graph clustering methods that put every protein into exactly one cluster (since they are forced to put proteins without complexes into singleton clusters), but it is not a problem for an overlapping clustering algorithm that allows outliers (like ClusterONE). However, in the third case, we know that the protein participates in at least one complex, but we do not know any interactions that the protein is involved in, hence it is impossible for any clustering algorithm that uses topological data only to infer the correct membership of this protein or to identify the corresponding complex accurately. On the other hand, if the complex is large enough, we should not throw it away only because a few of its constituent proteins are missing from the input data since it is quite conceivable that the rest of the complex is easily identifiable from the network itself.

Based on the considerations outlined above, we performed an additional on-the-fly filtering on the gold standard datasets when they were used to benchmark a particular network. The procedure we used was as follows:

  1. We have read the input network and collected the identifiers of those proteins that had at least one known interaction with other proteins.
  2. We have considered each complex from the gold standard and calculated the intersection of the complex with the set of known proteins as determined in the previous step.
  3. If the size of the intersection was less than half the size of the complex in the gold standard, we ignored the complex and removed it from the current benchmark run because we deemed that too many of its proteins are missing from the input dataset, and even a partial match to the rest of the complex would be more likely to be a result of pure chance than a correct inference made by a clustering algorithm.
  4. If the size of the intersection was larger than or equal to half the size of the complex, we used the complex from the gold standard but removed those proteins from the complex that were not present in the input dataset. Since we were modifying the gold standard here and not the input network, this modification does not put any of the algorithms into advantage since all the algorithms are still assessed on the same gold standard (but without those proteins whose membership could not be inferred anyway).

Reproducing our results using the provided scripts

In this section, we provide instructions for downloading and running the quality score calculation scripts in order to obtain the same quality scores for ClusterONE as the ones we have reported in the manuscript. The instructions are meant for Linux or Mac OS X; if you are running Windows, you can still run our quality score calculation script from the command line but the master script that processes all the input files in a single batch will not work for you since it requires a shell typically found on Linux or Mac OS X only.

Software requirements

You will need to install the following software on your machine in order to run the scripts:

  • Linux or Mac OS X if you wish to use the master script that runs ClusterONE on all the input file and calculates the quality scores. Windows is also suitable but you will have to run the quality score calculation manually.
  • Java Version 5 or later. This can be obtained from http://www.java.com/getjava/. You will need the Java Runtime Environment, not only the browser plugin because ClusterONE runs outside a web browser.
  • Python 2.7 or later (but not Python 3.x). At the time of writing, the latest suitable version is Python 2.7.3.

Obtaining the scripts

We have archived the input files, the gold standard, and an implementation of ClusterONE (version 0.93, which was the version we have used for the manuscript) and a script for the calculation of quality scores in a single self-contained ZIP file that can be downloaded from here:

http://membrane.cs.rhul.ac.uk/static/cl1/cl1_reproducibility.zip

Extract the contents of the ZIP file into an arbitrary directory.

Running the scripts

The downloaded ZIP file contains a single master script called run_all.sh. Running run_all.sh from a POSIX-compatible shell such as bash runs ClusterONE on all the datasets in the datasets/ folder, comparing them to all the gold standards in the gold_standard/ folder. The quality scores are printed to the standard output using the following abbreviations:

  • frac represents the fraction of complexes in the gold standard that are matched by at least one detected complex with a match score larger than 0.25.
  • cws is the clustering-wise sensitivity
  • ppv is the positive predictive value
  • acc is the geometric accuracy
  • mmr is the maximum matching ratio that we have proposed

The master script cannot be run directly in Windows since Windows does not contain a POSIX-compatible shell. (It is possible to install one from Cygwin or MSYS, but this is outside the scope of the present document). However, the core of the benchmark procedure is the script that compares the set of predicted complexes with the gold standard, and this script can also be used from Windows.

The quality score calculation script resides in scripts/match_standalone.py and it is written in the Python language. It can be executed from the command line as follows:

python scripts/match_standalone.py

The above command assumes that the Python interpreter (python or python.exe on Windows) is in the system path; otherwise the full path must be provided. Running the command prints a short usage message. The script itself requires two mandatory arguments: the file containing the reference complexes (i.e. the gold standard) and the file containing the predicted complexes. For instance, the results of ClusterONE on the Collins dataset can be compared with the MIPS gold standard as follows:

java -jar impl/cluster_one-0.93.jar datasets/collins2007.txt >result.txt
python scripts/match_standalone.py gold_standard/mips_3_100.txt result.txt

However, note that the script does not "know" the input dataset by default, hence it cannot filter the gold standard according to the procedure we have outlined above. Therefore, in order to obtain the exact same results that we have published, one must also pass the input network to the quality score calculation script using the optional -n switch:

python scripts/match_standalone.py -n datasets/collins2007.txt gold_standard/mips_3_100.txt result.txt

This should produce the quality scores from Supplementary Table 7 of our manuscript.

Frequently asked questions

Is ClusterONE deterministic?

It would be quite hard to reproduce our results exactly if ClusterONE were a stochastic algorithm. We have tried to provide an implementation of ClusterONE that is as deterministic as possible, and our experience is that it can generally be considered as deterministic on weighted networks, because there is a standard set of rules that the algorithm will follow in order to break ties during the course of the computation. The only exception is for unweighted networks.

In unweighted networks, ties occur very frequently in the seed selection phase, i.e. when ClusterONE must decide which protein to grow a complex from. ClusterONE selects the protein with the largest (weighted) degree that has not been included in any of the previously found complexes yet. In the absence of weights, it can happen that there are multiple proteins with the same degree that are not included in complexes. Due to internal programming decisions we have made, this choice was random in ClusterONE 0.93 and it still is (although this will probably be fixed in a future release). The result is a slight variation in the final complex set; the variation typically occurs only between results obtained on different machines. Since such ties do not occur in weighted networks, the results for the weighted datasets are practically deterministic.

References

[1](1, 2, 3, 4) Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 9:471-472, 2012.
[2](1, 2, 3, 4) Gavin A et al: Proteome survey reveals modularity of the yeast cell machinery. Nature 440:631-636, 2006.
[3](1, 2, 3, 4) Krogan N et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440:637-643, 2006.
[4](1, 2, 3, 4) Collins SR et al: Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol Cell Proteomics 6:439-450, 2007.