Abstract
The previously fastest algorithms for computing the R* consensus tree of two given (rooted) phylogenetic trees with a leaf label set of cardinality n run in Θ(n 3) time (Bryant and Berry in Adv. Appl. Math. 27(4):705–732, 2001; Kannan et al. in SIAM J. Comput. 27(6):1695–1724, 1998). In this manuscript, we describe a new \(O(n^{2} \sqrt{\log n})\)-time algorithm to solve the problem. This is a significant improvement because the R* consensus tree is defined in terms of a set \(\mathcal {R}_{\mathit{maj}}\) which may contain Ω(n 3) elements, so any direct approach that explicitly constructs \(\mathcal {R}_{\mathit{maj}}\) requires Ω(n 3) time.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Phylogenetic trees are leaf-labeled trees commonly used to describe the evolutionary history of a set of objects such as biological species or languages [1, 9, 10, 16, 18]. Typically, in a phylogenetic tree, each leaf represents one of the objects being studied and the branching structure of the tree shows the assumed evolutionary relationships among the objects. Depending on the available data, it can be difficult to infer an accurate phylogenetic tree. For example, small changes in the input data may result in trees with very different structures. A consensus tree is a single phylogenetic tree which summarizes the branching information contained in an input collection of phylogenetic trees with identical leaf label sets. Consensus trees are useful when different data sets or different tree inference methods have produced a set of trees with the same leaf labels and slightly conflicting structures, yet a single tree is required to represent all of them [1, 3, 7, 9, 14, 18] (indeed, phylogenetic analyses often output alternative trees for the same set of species [1]). Also, by exclusion, a consensus tree indicates areas of conflict in the input trees [3]. Furthermore, consensus trees are sometimes used as a basis for new phylogenetic inferences [3].
Over the years, many types of consensus trees have been defined and studied in detail. For a survey, see, e.g., [3], Chap. 30 in [9], Chap. 3.2 in [18], or Chap. 8.4 in [19]. Different types of consensus trees use different criteria to resolve conflicts among the input trees, so their mathematical properties vary. Thus, the most suitable type of consensus tree to use in practice depends on the particular application. In this paper, we focus on the so-called R* consensus tree. One advantage of the R* consensus tree is that it provides a statistically consistent estimator of the species tree topology when combining a set of gene trees, as recently demonstrated by Degnan et al. [8]; moreover, R* consensus outperformed other methods such as majority-rule consensus in the study conducted by [8]. For the case of two input trees, it is known [3] that the R* consensus tree is equivalent to the RV-III tree introduced in [14].
The R* consensus tree is defined in Sect. 2 below.Footnote 1 The idea is: For any set of input phylogenetic trees on a fixed leaf label set L, there are certain small binary trees called rooted triplets (each containing exactly three leaf labels from L) that occur as embedded subtrees more frequently than others, and the R* consensus tree includes as many of these rooted triplets as possible and none of the others. In short, if the rooted triplet xy|z for any {x,y,z}⊆L is consistent with more input trees than each of the two rooted triplets xz|y and yz|x is, then xy|z belongs to a set named \(\mathcal {R}_{\mathit{maj}}\); the R* consensus tree is the phylogenetic tree τ having the largest possible number of internal nodes such that every rooted triplet consistent with τ belongs to \(\mathcal {R}_{\mathit{maj}}\).
The goal of this paper is to develop a fast algorithm for constructing the R* consensus tree for two input trees T 1 and T 2 with a leaf label set L of cardinality n. The previous algorithms for this problem require Θ(n 3) time [4, 14], which was believed to be optimal because when T 1 and T 2 have similar branching structures, \(|\mathcal {R}_{\mathit{maj}}| = \varOmega (n^{3})\). Our main result is that the time complexity of the problem is in fact subcubic. In order to obtain a subcubic-time algorithm, we have to avoid explicitly constructing the set \(\mathcal {R}_{\mathit{maj}}\). For this purpose, we compute the values of a function named \(s_{\mathcal {R}_{\mathit{maj}}}\) associated to \(\mathcal {R}_{\mathit{maj}}\) by using a novel formulation based on distances from leaves to lowest common ancestors of pairs of leaves, and then apply an algorithm by Bryant and Berry [4] to compute the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\). Next, we check each of the obtained Apresjan clusters to find all strong clusters of \(\mathcal {R}_{\mathit{maj}}\), which are subsequently used to build the R* consensus tree. In total, the running time of our new algorithm R*_consensus_tree is \(O(n^{2} \sqrt{\log n})\). (The preliminary conference version of this paper [13] described a slightly slower O(n 2logn)-time method.)
The paper is organized as follows. Section 2 introduces the basic definitions and terminology used throughout the paper and states some simple properties of clusters and the R* consensus tree. Next, our algorithm R*_consensus_tree is presented in Sect. 3. Two crucial steps of the algorithm (how to compute the values of the function \(s_{\mathcal {R}_{\mathit{maj}}}\) efficiently and how to determine if a cluster is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\)) are described in detail in Sects. 4 and 5. Finally, Sect. 6 discusses open problems for further research.
2 Preliminaries
2.1 Basic Definitions
A phylogenetic tree is a rooted, unordered, distinctly leaf-labeled tree in which every internal node has at least two children. From here on, “tree” means “phylogenetic tree”, and every leaf in a tree is identified with its label.
We shall use the following terminology and notation. For any tree T and any node u of T, the subtree of T rooted at u is denoted by T[u]. The set of all leaves in a tree T is written as Λ(T). To simplify the presentation, every node in a tree is considered to be an ancestor as well as a descendant of itself. For any nodes u,v in a tree, in case u is a descendant of v and u≠v then we call u a proper descendant of v. For any set A of nodes in a tree T, the lowest common ancestor of A in T is the node w such that: (1) every node in A is a descendant of w; and (2) w is a proper descendant of x for every other node x which is an ancestor of all nodes in A. The lowest common ancestor of any two nodes u and v in a tree T is denoted by lca T(u,v).
A triplet is a tree with three leaves. Any non-binary tree containing exactly three leaves {x,y,z} is called a fan triplet and is written as x|y|z. On the other hand, any binary tree with exactly three leaves {x,y,z} is called a rooted triplet, and is denoted by xy|z if the lowest common ancestor of x and y is a proper descendant of the lowest common ancestor of x and z. Note that there are precisely four different triplets for any set of three leaf labels {x,y,z}, namely x|y|z, xy|z, xz|y, and yz|x; see Fig. 1.
For any tree T and {x,y,z}⊆Λ(T), the fan triplet x|y|z is said to be consistent with T if lca T(x,y) = lca T(x,z) = lca T(y,z). Similarly, the rooted triplet xy|z is consistent with T if lca T(x,y) is a proper descendant of lca T(x,z) = lca T(y,z). Let T||{x,y,z} denote the unique triplet with leaf label set {x,y,z} which is consistent with T. Finally, for any tree T, let r(T) be the set of all rooted triplets which are consistent with T, i.e., define r(T)={T||{x,y,z}:{x,y,z}⊆Λ(T) and T||{x,y,z} is a rooted triplet}, and define t(T) as the set of all triplets (rooted triplets and fan triplets) consistent with T, i.e., t(T)={T||{x,y,z}:{x,y,z}⊆Λ(T)}. (See Fig. 2 for some examples.) It follows that |t(T)|=Θ(|Λ(T)|3) for any tree T, and |r(T)|=Θ(|Λ(T)|3) when T is a binary tree because |r(T)|=|t(T)| in this case.
2.2 Strong Clusters and Apresjan Clusters
Below, let \(\mathcal {R}\) be a given set of triplets over a leaf label set \(L = \bigcup_{r \in \mathcal {R}} \varLambda (r)\) such that for each {x,y,z}⊆L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to \(\mathcal {R}\). A cluster of L is any non-empty subset of L. We define two special types of clusters:
Definition 1
A cluster A of L is called a strong cluster of \(\mathcal {R}\) if \(aa'|x \in \mathcal {R}\) for all a,a′∈A and x∈L∖A. Furthermore, L as well as every singleton set of L is also defined to be a strong cluster of \(\mathcal {R}\).
Definition 2
For each a,b∈L with a≠b, define \(s_{\mathcal {R}}(a,b) = |\{y \in L \setminus \{a,b\} : ab|y \in \mathcal {R}\}|\), and for each a∈L, define \(s_{\mathcal {R}}(a,a) = | L | - 1\). A cluster A of L is called an Apresjan cluster of \(s_{\mathcal {R}}\) if \(s_{\mathcal {R}}(a,a') > s_{\mathcal {R}}(a,x)\) for all a,a′∈A and x∈L∖A.
Write n=|L|. By Theorem 2.3 and Corollary 2.1 of [4], the following holds:
Lemma 1
(Bryant and Berry [4])
-
1.
There are O(n) Apresjan clusters of \(s_{\mathcal {R}}\).
-
2.
Given the values of \(s_{\mathcal {R}}(a,b)\) for all a,b∈L, the Apresjan clusters of \(s_{\mathcal {R}}\) can be computed in O(n 2) time.
There is a connection between the strong clusters of \(\mathcal {R}\) and the Apresjan clusters of \(s_{\mathcal {R}}\):
Lemma 2
Every strong cluster of \(\mathcal {R}\) is an Apresjan cluster of \(s_{\mathcal {R}}\).
Proof
Let C be a strong cluster of \(\mathcal {R}\). Consider any fixed a,a′∈C and x∈L∖C. We need to show that \(s_{\mathcal {R}}(a,a') > s_{\mathcal {R}}(a,x)\).
First of all, for every y∈L∖C, we have \(aa'|y \in \mathcal {R}\) by the definition of a strong cluster, which gives \(s_{\mathcal {R}}(a,a') = |\{aa'|y : aa'|y \in \mathcal {R}\}| \geq |L \setminus C|\). In the same way, \(ab|x \in \mathcal {R}\) holds for every b∈C, which (along with the requirement that for each {x,y,z}⊆L, at most one of x|y|z, xy|z, xz|y, and yz|x belongs to \(\mathcal {R}\)) implies that \(ax|b \not\in \mathcal {R}\). Thus, \(s_{\mathcal {R}}(a,x) =|\{ax|y : ax|y \in \mathcal {R}\}| \leq |(L \setminus C) \setminus \{x\}| <|L \setminus C|\). Since \(s_{\mathcal {R}}(a,x) < s_{\mathcal {R}}(a,a')\), C is an Apresjan cluster of \(s_{\mathcal {R}}\) by definition. □
2.3 R* Consensus Trees
Let T 1 and T 2 be two given trees with Λ(T 1)=Λ(T 2)=L. For any {a,b,c}⊆L, define #ab|c as the number of trees T i for which ab|c∈r(T i ). The set of “majority rooted triplets” \(\mathcal {R}_{\mathit{maj}}\) is defined as {ab|c:a,b,c∈L and #ab|c>#ac|b,#bc|a}. An R* consensus tree of T 1 and T 2 is a tree τ with Λ(τ)=L which satisfies \(r(\tau) \subseteq \mathcal {R}_{\mathit{maj}}\) and which maximizes the number of internal nodes.Footnote 2
The next two lemmas describe some useful properties of the strong clusters of \(\mathcal {R}_{\mathit{maj}}\).
Lemma 3
Let T be a tree with Λ(T)=L and \(r(T) \subseteq \mathcal {R}_{\mathit{maj}}\). For any node u of T, Λ(T[u]) is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\).
Proof
If u is a leaf or the root of T then Λ(T[u]) is trivially a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). If u is an internal non-root node then for any two a,a′∈Λ(T[u]) and any \(x \not\in \varLambda (T[u])\), the triplet aa′|x belongs to \(\mathcal {R}_{\mathit{maj}}\), so Λ(T[u]) is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). □
Lemma 4
There exists a tree τ such that {Λ(τ[u]):u is a node in τ} equals the set of strong clusters of \(\mathcal {R}_{\mathit{maj}}\).
Proof
First observe that for any two strong clusters A and B of \(\mathcal {R}_{\mathit{maj}}\), either A⊆̷B, B⊆̷A, or A∩B=∅. To prove this claim, suppose for the sake of contradiction that there are x,y,z∈L such that x∈A∖B, y∈B∖A, and z∈A∩B. Since A is a strong cluster, \(xz|y \in \mathcal {R}_{\mathit{maj}}\). Since B is a strong cluster, \(yz|x \in \mathcal {R}_{\mathit{maj}}\). But by the definition of \(\mathcal {R}_{\mathit{maj}}\), we cannot have both of xz|y and yz|x in \(\mathcal {R}_{\mathit{maj}}\). The claim follows.
As a direct consequence, the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) form a nested hierarchy (laminar family) on L. By Theorem 13.21 of [17], the collection of strong clusters of \(\mathcal {R}_{\mathit{maj}}\) can be represented by a rooted tree τ together with a function \(\pi : L \rightarrow \{\text{nodes of~$\tau$}\}\) such that the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) are in one-to-one-correspondence with the nodes of τ in the following sense: For any strong cluster A and any x∈L, it holds that x∈A if and only if π(x) is a descendant of the node τ(A) corresponding to A (as before, every node is also defined to be a descendant of itself).Footnote 3
Next, we show that:
-
For every x∈L, the node π(x) must be a leaf: Suppose that for some x∈L, the node π(x) is an internal node. Consider the strong cluster {x}. By the above, the node π(x) is equal to or a proper descendant of the node τ({x}). Since π(x) has at least one child, there exists some node v in τ which is a proper descendant of τ({x}). The (non-empty) strong cluster corresponding to node v is therefore a proper subset of {x}, which is impossible. Hence, π cannot map any element of L to an internal node of τ.
-
Every internal node of τ has at least two children: If A and B are strong clusters of \(\mathcal {R}_{\mathit{maj}}\) with A⊆̷B then there exists some strong cluster C of \(\mathcal {R}_{\mathit{maj}}\) such that C⊆̷B and C∩A=∅. For example, take any c∈B∖A and use the fact that {c} is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\).
By these observations, there exists a tree τ leaf-labeled by L where every internal node has at least two children such that each strong cluster of \(\mathcal {R}_{\mathit{maj}}\) equals Λ(τ[u]) for some rooted subtree τ[u] of τ. □
Say that a tree T includes a cluster A of L if T contains a node u such that Λ(T[u])=A.
Theorem 1
An R* consensus tree of T 1 and T 2 always exists. In particular, it includes every strong cluster of \(\mathcal {R}_{\mathit{maj}}\) and no other clusters of L.
Proof
By Lemma 4, there exists a tree τ that includes all the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) and does not include any other clusters. For any aa′|x∈r(τ), there exists a node u in τ such that a,a′∈Λ(τ[u]) and \(x \not\in \varLambda (\tau[u])\), i.e., a,a′∈A and \(x \not\in A\) for some strong cluster A of \(\mathcal {R}_{\mathit{maj}}\), which implies that \(aa'|x \in \mathcal {R}_{\mathit{maj}}\). Thus, \(r(\tau) \subseteq \mathcal {R}_{\mathit{maj}}\).
To prove the optimality of τ, suppose that there exists a tree τ′ which satisfies \(r(\tau') \subseteq \mathcal {R}_{\mathit{maj}}\) and has more internal nodes than τ. By Lemma 3, the set of leaves in each rooted subtree of τ′ forms a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Since τ′ has more internal nodes than τ, it follows that τ′ includes some strong cluster of \(\mathcal {R}_{\mathit{maj}}\) which τ does not include. This contradicts Lemma 4. Hence, τ is an R* consensus tree of T 1 and T 2. □
2.4 Constructing a Tree from a Set of Clusters
Here, we address the problem of constructing a tree from a given set of clusters. We first recall a problem known as the directed perfect phylogeny problem with binary characters. Suppose M is a binary matrix. Let I be the set of rows and J the set of columns in M. A directed perfect phylogeny for M is a tree T such that: (1) the leaves of T are bijectively labeled by I; and (2) each c∈J is associated with exactly one node T(c) of T in such a way that for any x∈I, it holds that M[x,c]=1 if and only if leaf x belongs to the subtree of T rooted at node T(c). The directed perfect phylogeny problem with binary characters (DPPB) is to, given a binary matrix M, construct a directed perfect phylogeny for M (if one exists). For examples and more details, refer to, e.g., Sect. 17.3 in [10] or Sect. 7.2.2 in [19]. DPPB can be solved in O(|I|⋅|J|) time by Gusfield’s algorithm; see Sect. 17.3.4 in [10] or Sect. 7.2.2.1 in [19].
Now, suppose we are given a set L of leaf labels and a set C of clusters of L. To build a tree that includes all clusters in C and no other clusters of L (if such a tree exists), we first construct a binary matrix M of size |L|×|C| whose rows represent the leaf labels in L and whose columns represent the clusters in C by letting each entry M[i,j]=1 if and only if the ith leaf label belongs to the jth cluster. Then, we try to build a directed perfect phylogeny T for M by applying Gusfield’s algorithm. If such a T exists, it must be unique since all columns of M are distinct (see the comments in Sect. 2 in [11]). However, T might include some clusters not present in C; to ensure that this is not the case, count the number of nodes in T and check if it is equal to |C|. This gives:
Lemma 5
Given a set L of leaf labels and a set C of clusters of L, we can build a tree that includes all clusters in C and no other clusters of L (or determine that no such tree exists) in O(|L|⋅|C|) time.
3 Constructing the R* Consensus Tree
Based on the properties mentioned in Sects. 2.2–2.4, we can construct the R* consensus tree of two given trees T 1 and T 2 as outlined in our main algorithm R*_consensus_tree (displayed in Fig. 3).
The strategy of Algorithm R*_consensus_tree is as follows: First compute \(s_{\mathcal {R}_{\mathit{maj}}}\) and all the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\).Footnote 4 Next, check each Apresjan cluster to see if it is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) (by Lemma 2, the set of strong clusters of \(\mathcal {R}_{\mathit{maj}}\) is a subset of the set of Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\)). Finally, construct a tree T that includes all the strong clusters of \(\mathcal {R}_{\mathit{maj}}\) and no other clusters of L by the method in Lemma 5. According to Theorem 1, the resulting T is the R* consensus tree of T 1 and T 2.
We now analyze the time complexity of Algorithm R*_consensus_tree. We shall explain how to implement step 1 in \(O(n^{2} \sqrt{\log n})\) time in Sect. 4 (step 1 is in fact the only step that does not take O(n 2) time). In step 2, the Apresjan clusters of \(s_{\mathcal {R}_{\mathit{maj}}}\) are computed by running the algorithm of Bryant and Berry from [4], which takes O(n 2) time according to Lemma 1 above. Next, there are O(n) Apresjan clusters to consider in the loop of step 3 by Lemma 1, and each one is checked in O(n) time in step 4, as detailed in Sect. 5 below. Finally, in step 5, the set C satisfies |C|=O(n) because C is a subset of the Apresjan clusters, so applying the method of Lemma 5 takes O(n 2) time.
In summary, we have the following theorem.
Theorem 2
Algorithm R*_consensus_tree constructs the R* consensus tree of T 1 and T 2 in \(O(n^{2} \sqrt{\log n})\) time.
The following two sections are devoted to implementing steps 1 and 4 of R*_consensus_tree efficiently.
4 Computing \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) for All a,b∈L with a≠b
From the definition of \(\mathcal {R}_{\mathit{maj}}\), we have:
Lemma 6
For any a,b,c∈L, \(ab|c \in \mathcal {R}_{\mathit{maj}}\) if and only if either
-
1.
ab|c∈t(T 1)∩t(T 2); or
-
2.
ab|c∈t(T 1) and a|b|c∈t(T 2); or
-
3.
a|b|c∈t(T 1) and ab|c∈t(T 2).
We introduce the following three auxiliary functions:
For any a,b∈L, the function \(\mathit{count}_{r,r}(a,b)\) tells us how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in both T 1 and T 2. Similarly, \(\mathit{count}_{r,f}(a,b)\) counts how many times a triplet involving a and b and some other leaf w occurs as a rooted triplet of the form ab|w in T 1 and as a fan triplet of the form a|b|w in T 2 (and analogously for \(\mathit{count}_{f,r}(a,b)\)). Then, Definition 2 and Lemma 6 immediately imply:
Corollary 1
For every a,b∈L with a≠b,
(Observe that three auxiliary \(\mathit{count}\)-functions suffice to express the value of \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\). For our purposes, it is unnecessary to define a fourth auxiliary function \(\mathit{count}_{f,f}(a,b)\) that would count how many fan triplets of the form a|b|w that are consistent with both T 1 and T 2.)
To compute \(\mathit{count}_{r,r}\), \(\mathit{count}_{r,f}\), and \(\mathit{count}_{f,r}\), we could preprocess T 1 and T 2 in O(n) time so that lowest common ancestor queries in T 1 and T 2 could be answered in O(1) time [2, 12]. Then, for any given a,b∈L, a brute-force solution would easily obtain all of \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) in O(n) time by checking T 1||{a,b,w} and T 2||{a,b,w} for every w∈L∖{a,b}. This approach would therefore take O(n 3) time to compute \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) for all a,b∈L. In the rest of this section, we show how to obtain these values more efficiently.
First, in Sect. 4.1, we reformulate the \(\mathit{count}\)-functions in terms of distances from leaves to lowest common ancestors of leaves. All of these distances may be computed in O(n 2) time. Then, based on this alternative formulation, for any fixed leaf label a∈L, Sect. 4.2 describes an \(O(n \sqrt{\log n})\)-time method for computing \(\mathit{count}_{r,r}(a,b)\) for all b∈L∖{a}, and Sect. 4.3 describes an O(n)-time method for computing \(\mathit{count}_{r,f}(a,b)\) for all b∈L∖{a}. (By symmetry, we can obtain \(\mathit{count}_{f,r}(a,b)\) for all b∈L∖{a} in O(n) time with the same technique as in Sect. 4.3.) To summarize, after O(n 2) time preprocessing, \(O(n \sqrt{\log n})\) time is enough to compute \(\mathit{count}_{r,r}(a,b)\), \(\mathit{count}_{r,f}(a,b)\), and \(\mathit{count}_{f,r}(a,b)\) for all b∈L∖{a} for each fixed a∈L. Therefore, we only need \(O(n^{2} \sqrt{\log n})\) time in total to obtain \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) for all a,b∈L according to Corollary 1.
4.1 Distances from Leaves to Lowest Common Ancestors
For any tree T and any a,b∈Λ(T), let d a,T (b) be the distance (i.e., the number of edges) between a and lca T(a,b) in T, and let e a,T (b) be the child of lca T(a,b) which is an ancestor of b. Without loss of generality, define e a,T (a)=∅. See Fig. 4(a) for an illustration of these definitions.
Note that the values of d a,T (b) and e a,T (b) for all a,b∈Λ(T) can be obtained in O(n 2) time in total by performing n simple traversals of T, where n=|Λ(T)|. More precisely, for each leaf a∈Λ(T), spend O(n) time to compute d a,T (b) and e a,T (b) for all b∈Λ(T) as follows: Let d a,T (a):=0 and e a,T (a):=∅. Start at a and follow the upwards path P from a to the root of T while using a variable d to remember the distance from a to the current location on P. Whenever a new node u on P is visited, let C be the set of children of u not belonging to P, and for each c∈C, traverse the subtree of T rooted at c and assign the values d a,T (b):=d and e a,T (b):=c for all leaves b that are encountered.
The next lemma states the relationship between d and e and the triplets consistent with T. For an example, refer to Fig. 4(b). The first part of Lemma 7 is a special case of Theorem 1 in [15], which was derived by Lee et al. [15] to solve a different problem known as the maximum agreement subtree problem (MAST).
Lemma 7
For any tree T and any three distinct a,x,w∈Λ(T), it holds that:
-
d a,T (x)<d a,T (w) if and only if ax|w∈t(T);
-
d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w) if and only if a|x|w∈t(T).
Proof
For the first part, d a,T (x)<d a,T (w) if and only if lca T(a,x) is a proper descendant of lca T(a,w). This is equivalent to ax|w∈t(T) by the definition of “consistent with T” for a rooted triplet and the definition of t(T).
For the second part, a|x|w∈t(T) if and only if lca T(a,x)=lca T(a,w)=lca T(x,w) by the definition of “consistent with T” for a fan triplet and the definition of t(T). The condition lca T(a,x)=lca T(a,w)=lca T(x,w) directly implies that d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w). On the other hand, if d a,T (x)=d a,T (w) and e a,T (x)≠e a,T (w) then lca T(a,x)=lca T(a,w); denote this node by v. Since e a,T (x) and e a,T (w) are two different children of v, we have lca T(x,w)=v and thus lca T(a,x)=lca T(a,w)=lca T(x,w). □
Next, consider two trees T 1 and T 2 with Λ(T 1)=Λ(T 2)=L. We have:
Lemma 8
For any a,b∈L with a≠b:
-
\(\mathit{count}_{r,r}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).
-
\(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\).
-
\(\mathit{count}_{f,r}(a,b) =|\{ w :d_{a,T_{1}}(b) = d_{a,T_{1}}(w), e_{a,T_{1}}(b) \neq e_{a,T_{1}}(w)\), and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).
Proof
By Lemma 7, \(d_{a,T_{1}}(b)<d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b)<d_{a,T_{2}}(w)\) mean that ab|w∈t(T 1)∩t(T 2). Hence, \(\mathit{count}_{r,r}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\).
Again, by Lemma 7, \(d_{a,T_{1}}(b)<d_{a,T_{1}}(w)\), \(d_{a,T_{2}}(b)=d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w)\) mean that ab|w∈t(T 1) and a|b|w∈t(T 2). Therefore, \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\), \(d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\).
The formula for \(\mathit{count}_{f,r}(a,b)\) follows by symmetry. □
4.2 Computing \(\mathit{count}_{r,r}(a,b)\) for All b∈L∖{a}
Suppose a∈L is fixed. This section describes how to compute \(\mathit{count}_{r,r}(a,b)\) for all b∈L∖{a} in \(O(n \sqrt{\log n})\) time, assuming all values of \(d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b)\) have been precomputed.
Our algorithm is named Compute_count_rr and is listed in Fig. 5. It works as follows. Suppose that G is an integer grid of size (n×n) and each b∈L∖{a} is represented by the point \((d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b))\) on G. Then:
Lemma 9
For each b∈L∖{a}, the value of \(\mathit{count}_{r,r}(a,b)\) equals the number of points inside the rectangle \([(d_{a,T_{1}}(b) + 1) : n] \times [(d_{a,T_{2}}(b) + 1) : n]\) in G.
Proof
Lemma 8 states that \(\mathit{count}_{r,r}(a,b)\) equals \(|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w) \}|\). Every element w that contributes to this expression satisfies \(d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\) and \(d_{a,T_{2}}(b) < d_{a,T_{2}}(w)\), and thus, \(d_{a,T_{1}}(w) \in [(d_{a,T_{1}}(b) + 1) : n]\) and \(d_{a,T_{2}}(w) \in [(d_{a,T_{2}}(b) + 1) : n]\). □
According to Lemma 9, we can obtain \(\mathit{count}_{r,r}(a,b)\) by calculating how many points lie inside the corresponding rectangle on G. Actually, we have to do this for all of the n−1 rectangles defined by the elements of L∖{a}. So to implement Lemma 9 efficiently, we borrow a data structure by Chan and Pǎtraşcu [6] for offline orthogonal range counting on G. More precisely, we use the following result (Corollary 2.3 in [6]):
Lemma 10
(Chan and Pǎtraşcu [6])
Given n points and n axis-aligned rectangles in the plane, we can count the number of points inside each rectangle in \(O(n \sqrt{\log n})\) total time.
Note that one query to the data structure returns the number of points in each of the rectangles. By Lemma 10, the running time of Algorithm Compute_count_rr becomes \(O(n \sqrt{\log n})\).
4.3 Computing \(\mathit{count}_{r,f}(a,b)\) for All b∈L∖{a}
Again, suppose a∈L is fixed. Here, we show how to compute \(\mathit{count}_{r,f}(a,b)\) (and by symmetry, \(\mathit{count}_{f,r}(a,b)\)) for all b∈L∖{a} in O(n) time. We assume that the values of \(d_{a,T_{1}}(b)\), \(d_{a,T_{2}}(b)\), \(e_{a,T_{1}}(b)\), \(e_{a,T_{2}}(b)\) for all b∈L∖{a} have been precomputed.
For any given a,b, we need to count how many leaves w appear in a rooted triplet of the form ab|w in t(T 1) and in a fan triplet of the form a|b|w in t(T 2) at the same time. Our idea is to first use T 2 to partition L∖{a} into disjoint subsets C 1,…,C p in such a way that for every pair of elements x,y in the same C i , it holds that \(d_{a,T_{2}}(x) = d_{a,T_{2}}(y)\). Also, further partition each C i into C i,1,…,C i,q so that for every pair of elements x,y in the same C i,j , it holds that \(e_{a,T_{2}}(x) = e_{a,T_{2}}(y)\). The second part of Lemma 7 immediately yields:
Lemma 11
For any b,w∈L∖{a}, the fan triplet a|b|w belongs to t(T 2) if and only if b and w belong to the same C i -set but different C i,q -sets.
Then, for any b∈C i,j , the number of fan triplets of the form a|b|w that are consistent with T 2 equals |C i ∖C i,j |. However, what we need to know is when a fan triplet a|b|w from T 2 also satisfies ab|w∈t(T 1). To also include this crucial condition on T 1, we refine Lemma 11 in the following way:
Lemma 12
Let b∈L∖{a} and suppose b∈C i,j . Then, \(\mathit{count}_{r,f}(a,b) =|\{ w \in C_{i} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}| -|\{ w \in C_{i,j} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\).
Proof
By Lemma 8, \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\), and \(e_{a,T_{2}}(b) \neq e_{a,T_{2}}(w) \}|\). Since \(d_{a,T_{2}}(b) = d_{a,T_{2}}(w)\) if and only if w∈C i , and moreover, \(e_{a,T_{2}}(b) = e_{a,T_{2}}(w)\) if and only if w∈C i,j , we have \(\mathit{count}_{r,f}(a,b) =|\{ w :d_{a,T_{1}}(b) < d_{a,T_{1}}(w), w \in C_{i}\), and \(w \not\in C_{i,j} \}|\). Next, to count the total number of leaves w from C i ∖C i,j that satisfy \(d_{a,T_{1}}(b) < d_{a,T_{1}}(w)\), just count how many leaves in C i that satisfy the condition and then subtract the number of leaves in C i,j that also satisfy it. This gives the desired formula for \(\mathit{count}_{r,f}(a,b)\). □
Our algorithm Compute_count_rf is specified in Fig. 6. It uses Lemma 12 to compute \(\mathit{count}_{r,f}(a,b)\) for all b∈L∖{a} for any fixed a∈L. For each b, define \(s(b) = |\{ w \in C_{i} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\) and \(ss(b) = |\{ w \in C_{i,j} : d_{a,T_{1}}(b) < d_{a,T_{1}}(w) \}|\), where b∈C i,j . Then, by Lemma 12, \(\mathit{count}_{r,f}(a,b) = s(b) - ss(b)\). The algorithm computes the values of s(b) for all leaves b in each set C i in steps 4–13 by simple counting and by using a variable named c to keep track of how many consecutive elements from C i that have the same \(d_{a,T_{1}}\)-value. Next, ss(b) for all leaves b in each set C i,j are obtained analogously in steps 17–26. At last, step 30 assigns the value s(b)−ss(b) to \(\mathit{count}_{r,f}(a,b)\). The correctness follows from the correctness of Lemma 12.
Algorithm Compute_count_rf can be implemented to run in O(n) time by using radix sort to sort the elements inside each C i - and C i,j -set in steps 3 and 16. Each b∈L∖{a} belongs to exactly one C i -set and exactly one C i,j -set, and is therefore treated once in steps 4–13 and once in steps 17–26.
5 Determining if a Cluster Is a Strong Cluster
This section shows how to check whether any given cluster A of L with |A|≥2 is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) in O(n) time (recall from the definitions in Sect. 2.2 that A is trivially a strong cluster if |A|=1). Our solution relies on the next lemma, illustrated in Fig. 7. For any node u of a tree T, each subtree of T rooted at a child of u is said to be attached to u.
Lemma 13
Let A⊆L with |A|≥2 and define u 1 and u 2 to be the lowest common ancestor of A in T 1 and T 2, respectively. A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) if and only if:
-
(1)
For i=1,2, every subtree U attached to u i in T i satisfies either Λ(U)⊆A or Λ(U)⊆L∖A; and
-
(2)
X 1∩X 2=∅, where X i =Λ(T i [u i ])∖A.
Proof
(⇒) Let A be a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). We shall prove that both conditions (1) and (2) hold.
We first prove (1). For the sake of obtaining a contradiction, suppose that for some i∈{1,2}, there exist a∈A, x∈L∖A where both a and x are in the same subtree attached to u i . Then, \(lca^{T_{i}}(a, x)\) is a proper descendant of u i . The node u i is defined as the lowest common ancestor of A, so \(lca^{T_{i}}(a,a') = u_{i}\) for some a′∈A. However, then \(lca^{T_{i}}(a,x)\) is a proper descendant of \(lca^{T_{i}}(a,a')\), giving ax|a′∈t(T i ). This implies that aa′|x cannot belong to \(\mathcal {R}_{\mathit{maj}}\), which contradicts the fact that A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Thus, condition (1) must hold.
To prove condition (2), we first show that there exist two leaves a,a′∈A such that \(lca^{T_{1}}(a,a') = u_{1}\) and \(lca^{T_{2}}(a,a') = u_{2}\). Observe that for both i=1,2, the leaves from A are located in at least two different subtrees attached to u i . Thus, there exists at least one pair a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\). Now, assume on the contrary that for every pair a,a′∈A such that \(lca^{T_{1}}(a,a') = u_{1}\), \(lca^{T_{2}}(a,a')\) is a proper descendant of u 2. Consider any pair of different subtrees attached to u 1 that contain elements from A, and denote their two sets of leaves by A 1 and A 2. By the assumption, \(lca^{T_{2}}(x,y)\) is a proper descendant of u 2 for all x∈A 1 and y∈A 2, and thus all leaves in A 1∪A 2 must appear in a single subtree attached to u 2. By repeating this argument, all elements of A appear in one subtree attached to u 2, which is impossible. Therefore, there exists a pair a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\) such that \(lca^{T_{2}}(a,a')\) is not a proper descendant of u 2, i.e., \(lca^{T_{2}}(a,a') = u_{2}\).
Finally, we are ready to prove (2). Take any a,a′∈A with \(lca^{T_{1}}(a,a') = u_{1}\) and \(lca^{T_{2}}(a,a') = u_{2}\) as in the preceding paragraph. To obtain a contradiction, suppose that X 1∩X 2≠∅. Then there exists an x∈X 1∩X 2, and moreover, x|a|a′∈t(T 1) and x|a|a′∈t(T 2) because the subtree attached to u i for i=1,2 that contains x cannot contain a or a′ by condition (1) above. But then \(aa'|x \not\in \mathcal {R}_{\mathit{maj}}\), contradicting the fact that A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). Hence, (2) holds.
(⇐) Suppose A satisfies the two conditions stated in the lemma. We will prove that for every a,a′∈A and c∈L∖A, it holds that \(aa'|c \in \mathcal {R}_{\mathit{maj}}\). The leaf c belongs to exactly one of the three disjoint sets X 1, X 2, and L∖(A∪X 1∪X 2), which gives three cases:
-
If c∈L∖(A∪X 1∪X 2): Then \(lca^{T_{i}}(a,a')\) is a proper descendant of \(lca^{T_{i}}(a,c)\) for both i=1,2. Thus, \(aa'|c \in \mathcal {R}_{\mathit{maj}}\).
-
If c∈X 1: Then c is not a descendant of u 2 in T 2 because \(c \not\in X_{2}\), so aa′|c∈t(T 2) always holds. There are two subcases depending on whether or not \(lca^{T_{1}}(a,a')\) is a proper descendant of u 1. If yes, then aa′|c∈t(T 1), and thus \(aa'|c \in \mathcal {R}_{\mathit{maj}}\). If no, then \(lca^{T_{1}}(a,a') = u_{1}\), and thus a|a′|c∈t(T 1), which also yields \(aa'|c \in \mathcal {R}_{\mathit{maj}}\).
-
If c∈X 2: Then it follows that \(aa'|c \in \mathcal {R}_{\mathit{maj}}\) in the same way as for the case c∈X 1 above.
By definition, A is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\). □
Armed with Lemma 13, it is straightforward to check if any given cluster A of L is a strong cluster of \(\mathcal {R}_{\mathit{maj}}\) in O(n) time as shown in Algorithm Check_if_strong_cluster in Fig. 8.
6 Concluding Remarks
In this paper, we have shown how to construct the R* consensus tree of two input trees in \(O(n^{2} \sqrt{\log n}) = o(n^{3})\) time. It is an open problem to determine if the time complexity can be further improved to O(n 2) or better. The bottleneck of our algorithm R*_consensus_tree is the computation of \(\mathit{count}_{r,r}\) in Sect. 4.2, which uses \(O(n \sqrt{\log n})\) time for each fixed a∈L to obtain the values of \(\mathit{count}_{r,r}(a,b)\) for all b∈L∖{a}.
We remark that for the restricted case where both input trees are binary, the problem becomes much easier because then the R* consensus tree is equivalent to the so-called RV-I tree in [14]. The RV-I tree differs slightly from the R* consensus tree in general in that if ab|c belongs to one of t(T 1) and t(T 2), and a|b|c belongs to the other, then ab|c may never be consistent with the output tree; however, for the special case of binary trees, neither t(T 1) nor t(T 2) contain any fan triplets so \(\mathcal {R}_{\mathit{maj}}\) equals the set of all rooted triplets that are consistent with both T 1 and T 2, and this is precisely the “set of triples which are resolved identically in T 1 and T 2” in Definition 5.1 in [14]. Furthermore, according to Theorem 5.1 in [14], the RV-I tree of T 1 and T 2 is always equivalent to the strict consensus tree (see also [3, 7, 9, 18, 19]) of T 1 and T 2, which can be constructed in O(n) time by a classical algorithm of Day [7]. Thus, there is a large gap in the time complexity of constructing the R* consensus tree of two trees in the binary case (O(n) time) and in the general, non-binary case (\(O(n^{2} \sqrt{\log n})\) time).
The definition of the set of majority rooted triplets \(\mathcal {R}_{\mathit{maj}}\) as well as the definition of an R* consensus tree can be naturally extended to the case of k>2 input trees; see [3]. The currently fastest algorithm for the case k>2, outlined in [3], runs in O(kn 3) time. It can be implemented by explicitly constructing the sets r(T i ) for every input tree T i using O(kn 3) total time, then obtaining \(\mathcal {R}_{\mathit{maj}}\) in O(kn 3) time by finding the most frequently occurring rooted triplet (if it exists) for each {x,y,z} in the leaf label set in O(k) time, and finally applying the O(n 3)-time strong cluster algorithm from Corollary 2.2 in [4] to \(\mathcal {R}_{\mathit{maj}}\). An important open problem is to reduce the time complexity to o(kn 3). It seems difficult to extend the approach used in this paper because it would yield an exponential number (in k) of cases in Lemma 6 and thus express \(s_{\mathcal {R}_{\mathit{maj}}}(a,b)\) as the sum of an exponential number of auxiliary \(\mathit{count}\)-functions, causing the total running time to be exponential in k.
Notes
See Sect. 2 for formal definitions of rooted triplet, consistent with, strong cluster, etc.
Observe that every rooted triplet consistent with τ must also belong to \(\mathcal {R}_{\mathit{maj}}\), i.e., the R* consensus tree is not allowed to introduce any new rooted triplets. In a related problem called the maximum rooted triplets consistency problem (MaxRTC), the input is a set \(\mathcal {R}\) of (possibly conflicting) rooted triplets and the objective is to infer a tree T consistent with as many rooted triplets as possible from \(\mathcal {R}\); in particular, the output T may also be consistent with rooted triplets not present in \(\mathcal {R}\). MaxRTC is NP-hard; see [5] for more details and references.
In other words, each strong cluster A corresponds to a node τ(A) in τ and the set of elements from L that are mapped by π to the subtree rooted at τ(A) is precisely A. Also, A⊆̷B, where A and B are strong clusters of \(\mathcal {R}_{\mathit{maj}}\), if and only if the node τ(A) is a proper descendant of the node τ(B).
Recall from Sect. 2.2 that for a set \(\mathcal {R}\) of triplets over a leaf label set L, the function \(s_{\mathcal {R}}\) is defined as \(s_{\mathcal {R}}(a,b) = \bigl|\{y : ab|y \in \mathcal {R}\}\bigr|\) for each a,b∈L with a≠b, and \(s_{\mathcal {R}}(a,a) = |L| - 1\) for each a∈L.
References
Bansal, M.S., Dong, J., Fernández-Baca, D.: Comparing and aggregating partially resolved trees. In: Proceedings of the 8th Latin American Symposium on Theoretical Informatics (LATIN 2008). LNCS, vol. 4957, pp. 72–83. Springer, Berlin (2008)
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Proceedings of the 4th Latin American Symposium on Theoretical Informatics (LATIN 2000). LNCS, vol. 1776, pp. 88–94. Springer, Berlin (2000)
Bryant, D.: A classification of consensus methods for phylogenetics. In: Janowitz, M.F., Lapointe, F.-J., McMorris, F.R., Mirkin, B., Roberts, F.S. (eds.) Bioconsensus. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 61, pp. 163–184. Am. Math. Soc., Providence (2003)
Bryant, D., Berry, V.: A structured family of clustering and tree construction methods. Adv. Appl. Math. 27(4), 705–732 (2001)
Byrka, J., Guillemot, S., Jansson, J.: New results on optimizing rooted triplets consistency. Discrete Appl. Math. 158(11), 1136–1147 (2010)
Chan, T.M., Pǎtraşcu, M.: Counting inversions, offline orthogonal range counting, and related problems. In: Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2010), pp. 161–173. SIAM, Philadelphia (2010)
Day, W.H.E.: Optimal algorithms for comparing trees with labeled leaves. J. Classif. 2(1), 7–28 (1985)
Degnan, J.H., DeGiorgio, M., Bryant, D., Rosenberg, N.A.: Properties of consensus methods for inferring species trees from gene trees. Syst. Biol. 58(1), 35–54 (2009)
Felsenstein, J.: Inferring Phylogenies. Sinauer, Sunderland (2004)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)
Gusfield, D.: Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In: Proceedings of the 6th Annual International Conference on Computational Biology (RECOMB 2002), pp. 166–175. ACM, New York (2002)
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13(2), 338–355 (1984)
Jansson, J., Sung, W.-K.: Constructing the R* consensus tree of two trees in subcubic time. In: Proceedings of the 18th Annual European Symposium on Algorithms (ESA 2010). LNCS, vol. 6346, pp. 573–584. Springer, Berlin (2010)
Kannan, S., Warnow, T., Yooseph, S.: Computing the local consensus of trees. SIAM J. Comput. 27(6), 1695–1724 (1998)
Lee, C.-M., Hung, L.-J., Chang, M.-S., Shen, C.-B., Tang, C.-Y.: An improved algorithm for the maximum agreement subtree problem. Inf. Process. Lett. 94(5), 211–216 (2005)
Nakhleh, L., Warnow, T., Ringe, D., Evans, S.N.: A comparison of phylogenetic reconstruction methods on an Indo-European dataset. Trans. Philol. Soc. 103(2), 171–192 (2005)
Schrijver, A.: Combinatorial Optimization: Polyhedra and Efficiency. Algorithms and Combinatorics, vol. 24/A. Springer, Berlin (2003)
Scornavacca, C.: Supertree methods for phylogenomics. PhD thesis, University of Montpellier II, France (2009)
Sung, W.-K.: Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, Boca Raton (2010)
Acknowledgements
The authors would like to thank David Bryant for some clarifications and the anonymous referees for their helpful comments. JJ was supported by the Special Coordination Funds for Promoting Science and Technology and KAKENHI grant number 23700011.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Jansson, J., Sung, WK. Constructing the R* Consensus Tree of Two Trees in Subcubic Time. Algorithmica 66, 329–345 (2013). https://doi.org/10.1007/s00453-012-9639-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-012-9639-1