Author: | Tamás Nepusz, Haiyuan Yu, Alberto Paccanaro |
---|---|
Contact: | tamas@cs.rhul.ac.uk |
This is the documentation of the command line interface of ClusterONE, created by Tamás Nepusz, Haiyuan Yu and Alberto Paccanaro.
If you use results calculated by ClusterONE in a publication, please cite one of the suggested references.
The command line interface of ClusterONE is distributed in a Java archive file (JAR). Its name is likely to be something like cluster_one-X.Y.jar, where X.Y is the version number. In the document, we will simply use the name without the version number.
In the rest of this document, a dollar ($) sign at the start of a line in the examples represents the shell prompt of the operating system. There is no need to type it.
The easiest use-case is to run ClusterONE on an input file containing id1-id2-weight triplets. Assuming that you have the Java interpreter on the path of your operating system, this is as simple as:
$ java -jar cluster_one.jar input_file.txt
ClusterONE also works on SIF files (Standard Interaction Format). It ignores the interaction types and assumes each interaction to have a weight of 1.0:
$ java -jar cluster_one.jar input_file.sif
In most cases, the only parameters of the algorithm you wish to tweak is the size and density thresholds of the detected complexes (3 and 0.3 by default):
$ java -jar cluster_one.jar input_file.txt -s 4 -d 0.4
The above command line would detect complexes with at least four proteins and a minimum density of 0.4.
You can also supply the input file on the standard input by specifying - as the input file name:
$ java -jar cluster_one.jar -
When you are not using the standard input for supplying the input dataset, you can use it to detect a protein complex around a pre-defined set of seed nodes by setting the seeding method to stdin:
$ java -jar cluster_one.jar input_data.txt -S stdin
Alternatively, you can specify a single seed on the command line using the single seeding method:
$ java -jar cluster_one.jar input_data.txt -S 'single(node1,node2,node3)'
Note that it is usually necessary to enclose arguments containing parentheses in single quotes if your shell would otherwise try to interpret the parentheses on its own.
For more details about the command line arguments, see Invocation.
ClusterONE strives to discover densely connected and possibly overlapping regions within the Cytoscape network you are working with. The interpretation of these regions depends on the context (i.e. what the network represents) and it is left up to you. For instance, in protein-protein interaction networks derived from high-throughput AP-MS experiments, these dense regions usually correspond to protein complexes or fractions of them. ClusterONE works by "growing" dense regions out of small seeds (typically one or two vertices), driven by a quality function called cohesiveness.
Before we move on to the formal definition of cohesiveness, let us introduce some terminology that classifies vertices and edges of a graph G according to their relationship to a selected group of vertices V0. Vertices of V0 are called internal vertices, while vertices not in V0 are called external vertices. An edge that is situated between two internal vertices is an internal edge, an edge going between an internal and an external vertex is a boundary edge, and an edge between two external vertices is an external edge. An internal vertex incident on at least one boundary edge is an internal boundary vertex, while an external vertex incident on at least one boundary edge is an external boundary vertex. The following figure illustrates these concepts:
Here, V0 itself is denoted by a shaded background, which delimits internal and external vertices. Thick black edges are internal, thin black edges are boundary edges, while thin gray dashed edges are completely external. Vertices marked by a letter are (internal or external) boundary vertices.
The quality of the group can be assessed by the number of internal edges divided by the sum of the number of internal and boundary edges. This quality measure is driven by the fact that a well-defined group should have many internal edges and only a few boundary edges; in other words, the boundary of the group should be sharp. If the edges have weights (i.e. a numeric value assigned to each edge that quantifies how reliable that edge is or how confident we are in its existence), the same guidelines apply, but the number of edges should be replaced by the total confidence associated to those edges. Whenever you have such confidence values, store them in a numeric edge attribute in Cytoscape and use that attribute to drive the cluster growth process. From now on, such confidence values will simply be called edge weights and the above mentioned quality measure will be referred to as cohesiveness.
ClusterONE essentially looks for groups of high cohesiveness. This is achieved by adopting a greedy strategy: starting from a single seed vertex (or a small set of vertices that are strongly bound together), one can extend the group step by step with new vertices so that the newly added vertex always increases the cohesiveness of a group as much as possible. Removals are also allowed if removing a vertex from the group increases its cohesiveness. The process stops when it is not possible to increase the cohesiveness of the group by adding another external boundary vertex or removing an internal boundary vertex. See the ClusterONE paper [1] for the description of the exact procedure. The growth process is repeated either for every vertex or every connected vertex pair to obtain an initial set of cohesive subgroups. Subgroups smaller than a given size or having a density less than a given threshold are thrown away. Finally, redundant cohesive subgroups (i.e. those that overlap significantly with each other) are merged to form larger subgroups to make the results easier to interpret.
ClusterONE is distributed as a Java archive (JAR) file. Assuming that you have already installed Java and the Java executable (java on Linux and Mac OS X, java.exe on Windows) is already on the system path, you can start ClusterONE as follows:
$ java -jar cluster_one.jar [options] input_file
where options is a list of command-line options (see below) and input_file is the name of the input file to be processed. The order of the command line options is irrelevant. The output will simply show the predicted protein complexes, by default, one per line. See Input file formats for the expected format of the input file, and Output file formats for alternative output formats if the default one is not suitable for you.
The following command line options are recognised:
-f, --input-format | |
specifies the format of the input file (sif or edge_list). Use this option only if ClusterONE failed to detect the format automatically. | |
-F, --output-format | |
specifies the format of the output file (plain, csv or genepro). | |
-h, --help | shows a general help message |
-d, --min-density | |
sets the minimum density of predicted complexes. auto means that the density threshold will be set automatically based on whether the graph is weighted or not, and if not, what its clustering coefficient is. Weighted graphs will have a default density threshold of 0.3, unweighted graphs will have a density threshold of 0.5, unless their global clustering coefficient is less than 0.1, in which case the density threshold is set to 0.6. | |
-s, --min-size | sets the minimum size of the predicted complexes |
-v, --version | shows the version information |
--debug | turns on debugging mode. The debugging mode prints more details about the progress of the cluster growth process. You should not turn this option on unless you think you have found a bug in ClusterONE and you would like to track it down. |
--fluff | fluffs the clusters as a post-processing step. This is not used in the published algorithm, but it may be useful for your specific problem. The idea is to check whether the external boundary nodes of each cluster connect to more than two third of the internal nodes; if so, such external boundary nodes are added to the cluster. Fluffing is applied before the size and density filters. |
--haircut | apply a haircut transformation as a post-processing step on the detected clusters. This is not used in the published algorithm either, but it may be useful for your specific problem. A haircut transformation removes dangling nodes from a cluster: if the total weight of connections from a node to the rest of the cluster is less than x times the average node weight in the cluster (where x is the argument of the switch), the node will be removed. The process is repeated iteratively until there are no more nodes to be removed. Haircut is applied before the size and density filters. |
--max-overlap | specifies the maximum allowed overlap between two clusters, as measured by the match coefficient, which takes the size of the overlap squared, divided by the product of the sizes of the two clusters being considered, as in the paper of Bader and Hogue [2] |
--merge-method | specifies the method to be used to merge highly overlapping complexes. The following values are accepted:
|
--similarity | specifies the similarity function to be used in the merging step. More precisely, this switch controls which scoring function is used to decide whether two complexes overlap significantly or not. The following values are accepted:
|
--no-fluff | don't fluff the clusters, this is the default. For more details about fluffing, see the --fluff switch above. |
--no-merge | don't merge highly overlapping clusters (in other words, skip the last merging phase). This is useful for debugging purposes only. |
--penalty | sets a penalty value for the inclusion of each node. When you set this option to x, ClusterONE will assume that each node has an extra boundary weight of x when it considers the addition of the node to a cluster (see [1] for more details). It can be used to model the possibility of uncharted connections for each node, so nodes with only a single weak connection to a cluster will not be added to the cluster as the penalty value will outweigh the benefits of adding the node. The default penalty value is 2. |
--seed-method | specifies the seed generation method to use. The following values are accepted:
|
The following input file formats are recognised:
When the extension of the input file is .sif, ClusterONE will automatically try to parse the file according to the SIF format of Cytoscape. Each line of the file must be according to the following format:
id1 type id2
where id1 and id2 are the IDs of the two interacting proteins and type is the interaction type (which will silently be ignored by ClusterONE). Each edge will have unit weight. The columns of the input file may be separated by spaces or tabs; however, make sure that you do not mix these separator characters.
This is the default file format assumed by ClusterONE unless the file extension suggests otherwise. Each line of the file has the following format:
id1 id2 weight
where id1 and id2 are the IDs of the interaction proteins and weight is the associated confidence value between 0 and 1. If the weight is omitted, it is considered to be equal to 1. Lines starting with hash marks (#) or percentage signs (%) are considered as comments and they are silently ignored.
If ClusterONE fails to recognise the input format of your file, feel free to specify it using the --input-format command line option.
The following output file formats are available:
If you use results calculated by ClusterONE in a publication, please cite the following reference:
[1] | (1, 2) Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes from protein-protein interaction networks. Nature Methods, Advance Online Publication, 2012. doi:10.1038/nmeth.1938 |
Some other papers that might be of interest (and were referenced earlier in this help file):
[2] | Bader GD, Hogue CWV: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2, 2003. doi:10.1186/1471-2105-4-2 |