当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2003年第1期 > 正文
编号:10582159
Evolutionary Distances Between Sequences
http://www.100md.com 《分子生物学进展》2003年第1期
     School of Biological Sciences, University of Manchesterj&vl, 百拇医药

    Abstractj&vl, 百拇医药

    Phylogenetic methods that use matrices of pairwise distances between sequences (e.g., neighbor joining) will only give accurate results when the initial estimates of the pairwise distances are accurate. For many different models of sequence evolution, analytical formulae are known that give estimates of the distance between two sequences as a function of the observed numbers of substitutions of various classes. These are often of a form that we call "log transform formulae". Errors in these distance estimates become larger as the time t since divergence of the two sequences increases. For long times, the log transform formulae can sometimes give divergent distance estimates when applied to finite sequences. We show that these errors become significant when t 1/2 |max|-1 logN, where max is the eigenvalue of the substitution rate matrix with the largest absolute value and N is the sequence length. Various likelihood-based methods have been proposed to estimate the values of parameters in rate matrices. If rate matrix parameters are known with reasonable accuracy, it is possible to use the maximum likelihood method to estimate evolutionary distances while keeping the rate parameters fixed. We show that errors in distances estimated in this way only become significant when t 1/2 |1|-1 logN, where 1 is the eigenvalue of the substitution rate matrix with the smallest nonzero absolute value. The accuracy of likelihood-based distance estimates is therefore much higher than those based on log transform formulae, particularly in cases where there is a large range of timescales involved in the rate matrix (e.g., when the ratio of transition to transversion rates is large). We discuss several practical ways of estimating the rate matrix parameters before distance calculation and hence of increasing the accuracy of distance estimates.

    Key Words: molecular phylogeny • maximum likelihood • evolutionary distances • distance matrixse}t, 百拇医药

    Introductionse}t, 百拇医药

    There are now a large number of methods available for the construction of phylogenetic trees from molecular sequences. Where relatively small numbers of sequences are involved and large amounts of computer time is available, maximum likelihood (ML) inference of phylogenies can be used . The ML method is based on sound statistical principles and allows tests to be made for comparing alternative tree topologies and alternative models of sequence evolution. However, when large numbers of taxa are involved, the ML method becomes very slow, and practical methods of dealing with large sets of sequences are still required. The fastest phylogeny methods usually use matrices of evolutionary distances. The neighbor joining (NJ) method is one of the most popular of these, and tests have shown that it is relatively accurate in comparison with other heuristic clustering methods . If the distance matrix corresponds to an exactly additive tree, NJ reproduces the correct tree topology, and it is robust against small errors in the distance estimates between sequences . If errors in distance estimates are large, however, then NJ and other distance matrix methods will give incorrect tree topologies. In this article, we investigate the way in which the errors in the pairwise distances between sequences depend on the sequence length, the evolutionary model, and the degree of divergence between sequences.

    The problems with distance estimation are already apparent in the simplest Jukes-Cantor (JC) model of sequence evolution. Evolutionary distance is usually measured in terms of the average number of substitutions per site, which we denote as d. For two aligned sequences of length N, the estimated evolutionary distance is given within the JC model by4p8d, 百拇医药

    where D is the observed fraction of sites that differ. We will refer to equations such as equation (1) as "log transform" formulae because they transform observed quantities (in this case D) to estimates of distances that are not themselves directly observable quantities. A general method of deriving log transform formulae is given in Error Analysis of Distance Estimates for a General Rate Model, equations (4)–(9). For finite sequences, n has a binomial distribution, and hence D fluctuates about its mean value. Because is a nonlinear function of D, the expectation value of is not exactly equal to d; hence, there is a small systematic error of order 1/N. In practice, this is dominated by the statistical error, which is of the order 1/N1/2 .

    It also is apparent from equation (1) that the estimated distance is infinite or undefined, if D 3/4. When the true d is large, D approaches 3/4 for an infinite sequence; therefore, there is a significant chance that the observed D in a finite sequence will be larger than 3/4 and that the distance estimate will be undefined. For small evolutionary distances, this will rarely occur. However, for any finite length N and for any true distance d, there is a nonzero probability that will be undefined.j%@3, 百拇医药

    When using the JC model, the problems described above will only become important when d >> 1, in which case there is saturation of mutations. In this case there is very little phylogenetic information left in the sequences, and we probably should not be using these sequences for tree construction in the first place. However, we will show below that these problems tend to be more serious with more complex models. The effect of increasing the model complexity is illustrated by the next most simple model of evolution: the Kimura 2- parameter model (K2P) . In this model the base frequencies are equal, but a distinction is made between transitions (occurring at rate 1/4) and transversions (occurring at rate 1/4ß). The ratio /ß is usually denoted by . The estimated evolutionary distance between two sequences is the sum of the estimated numbers of transversions and transitions per site and also can be written as a log transform formula, this time with two log terms,

    where S and V are the observed fractions of sites that differ by transitions and transversions between the two sequences. There are now two timescales, one associated with transitions and one with transversions. Transitions are usually faster ( > 1). The distance estimate can diverge because of divergences in either of the two log terms. The quantity 1 - 2S - V tends to zero in a time governed by the rapid transition rate, whereas the term 1 - 2V tends to zero on the slow transversion timescale. Divergences, therefore, tend to arise because of saturation of the transitions, even when the overall evolutionary distance is small. The errors in this formula, therefore, become significant at smaller evolutionary distances than for the JC model.], http://www.100md.com

    In addition to the JC and K2P models, many other rate matrix models have been proposed. The model of Hasegawa, Kishino, and Yano (HKY) and the model of Tamura and Nei (TN) have rate matrices for which the eigenvectors and eigenvalues are analytically soluble. Hence, an explicit log transform formula can be written. A distance formula for general reversible models has been derived in various forms by and Other distance measures such as the paralinear distance and the LogDet distance have been introduced to address nonstationarity, i.e., heterogeneity in the base composition of the sequences studied.

    The observation that increasing model complexity can increase both the variance of distance estimates and the frequency with which undefined distance estimates are encountered has been made before (p. 84 for a discussion of this point). However, we wish to consider what generic features of a model affect the accuracy of distance estimates and the frequency of undefined distances.[]am, http://www.100md.com

    We will begin by showing a few problems with distance estimates in real data. We will then give a quantitative theory to expand on the above arguments concerning log-transform formulae. We also consider likelihood-based methods of estimating distances where we fix the estimates of the rate parameters before estimation of the evolutionary distance. We show that these methods have errors associated with the slowest timescale in the model, and hence that these methods are more accurate.[]am, http://www.100md.com

    Examples of Distance Estimates in Real Sequence Data[]am, http://www.100md.com

    As an example with real data, we consider a set of large subunit (LSU) rRNA sequences obtainedfrom the database of . A set of 90 mitochondrial LSU sequences was used, including representatives from all groups of mitochondria-containing eukaryotes for which sequence data were available. The alignment in the database was taken without alteration. To exclude variable regions and poorly aligned regions from the analysis, we eliminated from the alignment those sites which contained a gap or unknown nucleotide for 10% or more of the sequences. In the evolutionary distance estimates for the K2P model (eq. 2) are plotted against the sequence dissimilarity D for each pair of sequences. This is a well-behaved data set for which none of the distance estimates diverge and for which there is a clear relationship between D and so that all the points appear to lie on a smooth curve.

    fig.ommittedu, 百拇医药

    FIG. 1. Plot of estimated evolutionary distance against sequence dissimilarity D (uncorrected distance) for K2P model log transform estimates (left hand graph, 1a), TN model log transform estimates (middle graph, 1b), and estimates from TN model using ML with fixed rate parameters (right hand graph, 1c), all on LSU rRNA sequences (see main text). Evolutionary distances of d = -0.25 indicate a divergent estimateu, 百拇医药

    shows the same data with distance estimates calculated with the TN model, using the log transform formula given in . In this case, the distance estimate diverges for 1.43% of sequence pairs. Pairs that diverge are plotted as points at = -0.25 in the lower portion of the graph. Divergence occurs only for widely separated pairs with D between 0.6 and 0.7. Also apparent from is that the scatter of distance estimates is greater for the TN model than for the K2P model, and the relationship between D and does not seem to be so well defined. In particular, at large D there are a number of outlier points, with well above the main curve. In any case, the TN distances could not be used without some finite value being assigned to the pairs that diverge.

    We also can use likelihood ratio tests for model selection purposes, and this gives results that are contradictory to those above. Because both the K2P and TN model distance estimates can be derived by maximizing the likelihood in terms of the observed numbers of tranversions and transitions and the K2P can be considered as a special case of the TN model, we can apply a likelihood ratio test to each sequence pair from the this data set. It is straightforward to calculate the log-likelihood values for the two models for any pair of sequences and, from this, to determine the difference in the log-likelihoods. Because the two models are nested, we know that 2 is distributed according to a 2 distribution under the hypothesis that the simpler K2P model is correct . On performing such a calculation for each possible pair of sequences, we find that the TN model provides a significantly better fit to the data (P < 0.05) for 99.8% of the sequence pairs for which the distance estimate does not diverge. From this, we would conclude that the TN model provides an overwhelmingly better explanation of the observed sequence data than does the K2P model. (We note that the likelihood ratio test is usually applied to the likelihood of whole trees, whereas here we are applying it simply to the likelihood of each sequence pair.)

    Our second example consists of 455 small subunit (SSU) rRNA sequences taken from the database of , chosen so that there is one example of a sequence from each genus of Eubacteria. This data set has been analyzed previously by and because it is an example of sequence evolution under the constraint of a conserved RNA secondary structure. These articles (and references therein) discuss a range of models for evolution of the paired stem regions of RNA sequences. Stem regions evolve through compensatory substitutions that maintain the pairing pattern; hence, the substitutions in the sites on either side of a pair are strongly correlated, and it is necessary to consider the evolution of the pair as a single unit rather than as two separate sites. One appropriate model is that of , which considers six allowed paired states, AU, GU, GC, UA, UG, and CG, (occurring with frequencies AU, ..., CG) and has rate parameters , ß, and quantifying the rates of double transitions, double transversions, and single transitions, respectively. The structure of the model is simple enough for an explicit log transform formula to be calculated (), namely = V + + . V, , and are estimators of KV, K, and K, the expected numbers of transversions, single transitions, and double transitions, respectively. These are given as KV = ßt, K = 82 (1 + 3) t, and K = 813t, where1 = 1/ 2(AU + UA), 2 = 1/2 (GU + UG), and 3 = 1/2 (GC + CG). The estimators V, , and are

    where V, S1, and S2 are the observed fractions of transversions, single transitions, and double transitions, respectively. Clearly, the ratios of the rate parameters also can be estimated from V, , and , i.e., 1 is estimated by 1 = /813V, and 2 /ß is estimated by 2 = /82(1 + 3)V. shows the estimated distances using this formula. A divergent distance estimate occurs for 0.91% of the sequence pairs. These divergent pairs (shown in the lower portion of the graph at = -0.25) occur over a range of dissimilarity that begins with a value of D as small as 0.25. There also is considerable scatter in the distance estimates for the points that do not diverge. The problems of distance estimation are therefore quite serious in this example.*-!5n, http://www.100md.com

    fig.ommitted*-!5n, http://www.100md.com

    FIG. 2. Plot of estimated evolutionary distance d against sequence dissimilarity D for pairs of sequences from the SSU rRNA data set (see main text). Left hand graph (2a) shows results from six-state model with distances estimated using log transform formula. Right hand graph (2b) shows results from Tillier and Collins six-state model with distances estimated from ML with fixed rate parameters. Approximately, only every fifth data point is actually plotted on the graphs. Evolutionary distances of d = -0.25 indicate a divergent estimate

    Distance Estimates with Fixed Rate Parameters@1y$', http://www.100md.com

    The problem with the pairwise distance estimates is that the values of parameters like the base frequencies and rate matrix parameters , ß, etc. are estimated separately for each pair, using only information from that pair, i.e., the parameters have different values for each sequence pair. Usually when building a tree, we make the assumption that the rate matrix parameters are constant during evolution of the whole tree. It, therefore, makes sense to obtain a single estimate of the parameters using information from all the sequences and then to estimate ML distances for each sequence pair while keeping the parameters fixed at their previously estimated values. The accuracy of pairwise distance estimates obtained in this way is analyzed in ML Distance Estimates with Fixed Rate Parameters and Appendix B.@1y$', http://www.100md.com

    The most natural way of estimating the parameters would be simply to do an ML calculation of the optimum tree and parameter values for the whole set of sequences. However, this is not practical for large data sets of sequences without spending huge amounts of computer time. Indeed, if we were able to carry out ML for the complete set of sequences, we would not be particularly interested in using distance matrix methods anyway.

    One way around the problem is to take a subset of sequences and do the ML calculation and then to use the ML parameters estimated from this subset when estimating pairwise distances for the whole set. shows the distances obtained in this way for the LSU rRNA sequences with the TN model. The estimates of the rate parameters have been obtained using TREE- PUZZLE on 10 sequences selected at random from the data set. The pairwise distances are then obtained for each sequence pair by maximizing the likelihood while keeping the rate parameters fixed at the values estimated by TREE-PUZZLE. There are now no divergent distance estimates, and the scatter of points at high D is substantially reduced in comparison with . It is now practical to use the TN model on this data, which we know was the preferred model according to the likelihood ratio tests.wh|ed), 百拇医药

    The model of for paired sites in RNAs is not implemented in TREE-PUZZLE or other currently available programs. We are currently in the process of developing a package that will use ML and Monte Carlo Markov Chain methods to construct phylogenies of RNAs using a variety of paired-site models. This will be available shortly . However, for the present article, we used a much simpler method to estimate the rate parameters for the model of in the example with the SSU rRNA sequences

    For each sequence pair of the SSU rRNA data set, the parameter ratios 1 and 2 can be estimated as outlined in the previous section. We obtain final estimates of 1 and 2 by averaging finite estimates over all the 455 x 454/2 sequence pairs of the SSU rRNA data set. 1 and 2 are then held fixed at these values. Fixing 1 and 2 (along with the usual timescale normalization condition that the expected number of substitutions per site per unit time is 1) completely specifies all three rate parameters, , ß, and . Specifically, ß = (1 + 8131 + 82[1 + 3]2)-1, = 1ß, and = 2ß. shows ML distances calculated from the paired states in the SSU rRNA sequences and using the six-state model of with fixed rate parameters estimated in this fashion. There are now no divergent distance estimates and the scatter of points is reduced, indicating that distance estimates are more reliable than those with the log transform formula.

    The key point of this section is that if good estimates of the rate parameters are known by ML estimation on a tree or by averaging the estimates of parameter ratios from each pair as described above, then accurate estimates of the pairwise distances can be obtained by maximizing the likelihoods for each pair with the fixed values of the rate parameters. These estimates suffer much less from problems of divergence than do estimates using the log transform formulae. In the next two sections, we consider in detail why this is so. Those readers not interested in the mathematical detail can proceed to Discussion and Conclusions, where we discuss the implications of these findings.m+g62gz, 百拇医药

    Error Analysis of Distance Estimates for a General Rate Modelm+g62gz, 百拇医药

    Consider a rate matrix R, whose elements Rij (Rij > 0 i !=m+g62gz, 百拇医药

    j) define the rate of substitution from state i to state j. Thus, the probability Pij(t) of finding a particular nucleotide in state j at time t, given that it was initially in state i, obeys the simple evolution equation

    R is defined such that the diagonal elements Rii = -{Sigma}e, 百拇医药

    j!=e, 百拇医药

    i
Rij and also so that the substitution process is reversible, i.e., iRij = jRji i, j. At long times, Pij(t) tends to j, the equilibrium frequency of state j.e, 百拇医药

    We have labeled the eigenvalues of R as i, where i = 0, ..., Nstate - 1. There is one eigenvalue 0 which is zero, whereas the other eigenvalues are negative. We have assumed the eigenvalues are ordered so that |1| " |2| " ··· " ||. We also label as max because it is the eigenvalue of the largest absolute value. We also can define the matrices U and V whose columns are formed from the eigenvectors of R and RT, respectively. The transition probability matrix P can then be written in terms of this decomposition as

    The evolutionary distance d between two sequences is defined as the expected number of nucleotide changes per site from which one finds d = -t {Sigma}a|, http://www.100md.com

    i iRii. This can be written as a function of the eigenvalues asa|, http://www.100md.com

    To derive the log transform formula, the Pij(t) in equation (5) is replaced by the observed values in the pair of sequences in question, i.e., equating the expectation value Pij(t) with the observed number of substitutions of that type. Thus, the derivation essentially uses the "Methods of Moments" (see for example, , p. 312). This gives a set of simultaneous equations for the unknown quantities exp(kt). The number of unknowns is equal to the number of distinct nonzero eigenvalues of R, and we call this Ne. Solving these equations for kt and substituting into equation (6) gives an estimate of the evolutionary distance as a log transform formulae

    The quantities Mk depend on the frequencies of the states i but not on the rate parameters. Therefore, they can easily be estimated from the observed state frequencies. The Qi can be taken to be of the formg;22, http://www.100md.com

    Here, the quantities S, = 1, ..., M, represent observed fractions of various substitution classes, e.g., S and V in the K2P model. The coefficients Bk will be explicit from the particular log transform formula used. The log transform formulae given above (eqs. 1 and 2) are both of this form, as are those for the TN model and the RNA model used in Distance Estimates with Fixed Rate Parameters.g;22, http://www.100md.com

    In Appendix A, we carry out an analysis of the errors in a log transform formula of the general form given in equation (7). The probability that the ith individual logarithmic term in equation (7) diverges isg;22, http://www.100md.com

    whereg;22, http://www.100md.com

    Here, the notation "">"

    represents the expectation value. From equation (9), we can see that the individual logarithmic term in equation (7) that is estimating the fastest eigenvalue max is the most likely to produce a divergence, and so max predominantly determines the likelihood of obtaining a divergent evolutionary distance estimate from a log transform formula.^/p%, 百拇医药

    The total probability that the estimate (eq. 7) diverges is pdiv = 1 - P(Q1 > 0, Q2 > 0, ..., > 0). Obtaining a generic analytic form for pdiv is not possible. However, we approximate P(Q1 > 0, Q2 > 0, ..., > 0) ~=^/p%, 百拇医药

    P(Q1 > 0)P(Q2 > 0)...P( > 0), i.e., we approximate the Qi as being independent. This gives^/p%, 百拇医药

    where the complementary error function erfc(x) = 1 - erf(x). In the case of the K2P model, one has^/p%, 百拇医药

    where C11 = 4Pv(1 - Pv), C22 = Pv + 4Ps - (Pv + 2Ps)2 with Pv, Ps being the expected number of transversions and transitions, i.e., Pv = "V">"

    and Ps = "S">"4, 百拇医药

    .4, 百拇医药

    The approximation that Q1 and Q2 are independent is, as we shall see from simulations, a reasonably accurate one for the K2P model. If the transition to transversion ratio is large, then there is a clear separation of timescales between the two processes. Thus, pdiv consists of two step-like increases at evolutionary distances d1 and d2 . Thus, the range of applicability is governed by the shorter of the two timescales, d1. To test the accuracy of the above approximation, simulations were performed by generating pairs of random sequences with known evolutionary distance d and by calculating the fraction of pairs for which diverges. shows simulation results for the K2P model with sequence length N = 500 and = 10. The theoretical estimate is in excellent agreement with the simulation estimate of pdiv. The separation of the two distance scales is very clear because is large. We also performed simulations with = 2 (not shown). In this case, the two steps merge together to a single large step. The theory and simulation are also in good agreement in this case.

    fig.ommittedxf?{k, 百拇医药

    FIG. 3. Plot of simulated and theoretical divergence probability of log transform and ML (fixed rate matrix parameters) distance estimates for the K2P model with = 10 against evolutionary distance. The upper two curves (simulation = diamonds, theory = circles) are for the log transform estimate (eq. 2) and the lower two curves (simulation = triangles, theory = squares) are for ML with fixed rate parameters. Simulation averages evaluated over 2,000 replicates. Sequence length N = 500. One percent divergence probability occurs at approximate evolutionary distances of d = 1.40 and d = 6.65 for the two methods, respectivelyxf?{k, 百拇医药

    For all models we have studied in this article, the approximation that the Qi in equation (7) is independent is found to be a good one. Certainly, for any model where the distinct nonzero eigenvalues are well separated, it is valid for pdiv " 1/2 because within this range pdiv is predominantly controlled by a single eigenvalue max. For pdiv > 1/2, we consider the inaccuracy introduced by assuming that Qi in equation (7) is independent to be largely unimportant in comparison with the fact that at this stage greater than 50% of the distance estimates will be undefined. From equation (11), we see that for a general rate model pdiv consists of Ne step-like increases reaching a limiting value (for long sequences and large evolutionary distances) of 1 - .

    ML Distance Estimates with Fixed Rate Parametersa\, http://www.100md.com

    Our calculations on real data sets revealed that no divergent pairwise evolutionary distance estimates were obtained if the rate parameters were held fixed and not simultaneously estimated from the sequence data of the pair. We now consider this point in more detail and determine what controls the probability of a divergent estimate and the accuracy of that estimate when the rate parameters are held fixed.a\, http://www.100md.com

    Let us consider methods of distance estimation based on ML. We suppose that estimates of all the parameters in the rate matrix are known and that we maximize the likelihood with respect to only the distance t. If the specified values of the rate parameters are equal to the true values, we find (see Appendix B) that the probability of obtaining a divergent distance estimate, for large N and t, is now controlled by the slowest nonzero eigenvalue 1. The limiting value, as t "->"a\, http://www.100md.com

    {infty}

    , for the probability of divergence is 1/2 (to leading order in N-1), irrespective of the rate matrix used. For the K2P model, we find6hv3r, http://www.100md.com

    The result from this equation is shown in in comparison with estimates of pdiv from simulation. The sequence data has been generated with 0 = 10.0, and the same value has been specified in the ML calculation. It can be seen that the range of applicability of the ML distance estimate (with fixed rate parameters) is much larger because the divergence probability is a function of the slow timescale (transversions) only. The simulation results are in excellent agreement with the theory indicating that the probability of obtaining a divergent evolutionary distance estimate is predominantly controlled by the slowest nonzero eigenvalue 1.6hv3r, http://www.100md.com

    In Appendix B, we derive a perturbative result for the RMS error of the ML estimate (with fixed rate parameters) of d. Again the theoretical result (eq. 26) is in excellent agreement with simulation (not shown). Consequently, we also can conclude that the error in the ML estimate of d is indeed controlled by the slowest nonzero eigenvalue 1.

    In general, we will not know a priori the true parameters in the rate matrix. If the sequence data have been generated by the same class of model that is used to estimate the pairwise evolutionary distances, then for the methods we have used to estimate the rate parameters, e.g., for ML on a fixed tree (as in TREE-PUZZLE), we expect the estimates of rate parameters to become more accurate as sequence length N is increased. In this case, our analysis in Appendix B is unchanged to leading order in N. Consequently, for increasing sequence length, distance estimates are still controlled by the slowest eigenvalue1. Obviously, for any fixed sequence length N, the exact value of "({delta}l:4}j], 百拇医药

    t)2">"l:4}j], 百拇医药

    will depend on precisely how close the estimates of the rate matrix parameters are to the true values.l:4}j], 百拇医药

    Discussion and Conclusionsl:4}j], 百拇医药

    The focus of this article has been on the accuracy and reliability of distance matrix methods for phylogeny reconstruction. Our analysis has shown that distance estimates from log transform formula are accurate only on the shortest distance scale, |max|-1, defined by the underlying rate matrix R and can frequently give rise to undefined distances. Pairwise distance estimates can be improved using ML if the rate parameters are known with reasonable accuracy, for example, by estimation from more than just the two particular sequences under consideration. In such cases, pairwise distance estimates are accurate on the longest distance scale, |1|-1, defined by R. With improved pairwise distance estimates, better phylogenies can be constructed using NJ or variants such as Weighbor .

    Both the probability of divergence (eq. 11) and the estimate of the variance of (eq. 22 in Appendix A) provide a measure of how much sequence information is required to resolve a given evolutionary distance between two sequences using log transform formulae. To maintain a fixed probability of divergence pdiv or fixed variance of the evolutionary distance estimate, we have the scaling (for large evolutionary distances)qlz, 百拇医药

    Because tends to a constant, independent of N, as t "->"qlz, 百拇医药

    {infty}qlz, 百拇医药

    , we can, for large sequence lengths N, consequently ignore it in the scaling relation above.qlz, 百拇医药

    Models with higher Ne tend to be applicable over a narrower range of t and give more divergences because the more eigenvalues there are, the larger the largest one tends to be. This was already seen in the example with the LSU rRNA sequences in . We also can see from equation (14) that the evolutionary distance between two sequences that can be resolved accurately depends logarithmically on N. To increase the evolutionary distance probed requires an exponential increase in sequence length.

    What are the sequence lengths required to obtain a given level of accuracy if evolutionary distances are estimated using ML with fixed rate parameters? Again, one finds that to maintain a constant level of accuracy or constant probability of obtaining a divergent evolutionary distance estimate, the required scaling ismx0, 百拇医药

    Thus, the evolutionary distances that one can probe still depends logarithmically on N. However, one should note that the prefactor to log N is as opposed to when using log transform formulae. Thus, if one has a fixed reasonably accurate estimate of the underlying rate matrix, then the evolutionary distance estimate obtained by maximizing the likelihood with respect to t can be considerably more accurate than that obtained by using the appropriate log transform formulae. This is particularly the case when evolution of sequences is occurring through processes on vastly different timescales. For the K2P model, when = 10 (i.e., transitions occurring much more rapidly that transversions) the ratio max/1 = 5.

    Appendix A7f@-, 百拇医药

    Probability of Divergence in Log Transform Formulae7f@-, 百拇医药

    We take as our starting point a log transform formula of the form7f@-, 百拇医药

    which provides an estimate of the true evolutionary distance d between two sequences. The amplitudes Mi are assumed to depend only on the state frequencies {i, i = 1, ..., Nstate}. Ne is the number of distinct nonzero eigenvalues of the rate matrix model to which the log transform formula (16) applies. The Qi can be taken to be of the form7f@-, 百拇医药

    Here, the quantities S, 1, ..., M, represent observed fractions of various substitution classes. With the quantities S estimated from the sequence data, Qi provides estimates of exp(it) and in particular "Qi">"7f@-, 百拇医药

    = exp(it).7f@-, 百拇医药

    The distribution, P(S1, ..., SM) is multinomial and so, as N "->"

    {infty}5, http://www.100md.com

    , is well approximated by a multivariate Gaussian,5, http://www.100md.com

    For large N, we can consider the quantities S1, ..., SM and Q1, ..., to be continuously distributed between -{infty}5, http://www.100md.com

    and {infty}5, http://www.100md.com

    . Using the distribution (eq. 18), we calculate5, http://www.100md.com

    where5, http://www.100md.com

    The probability that the ith individual logarithmic term in equation (16) diverges is5, http://www.100md.com

    From this we can see that the individual logarithmic term in equation (16) that is estimating the fastest eigenvalue max is the most likely to produce a divergence, and so max predominantly determines the likelihood of obtaining a divergent evolutionary distance estimate from a log transform formula.5, http://www.100md.com

    Assuming that the dominant contribution to equation (16) comes from estimating max, then the variance in such a distance estimate is equally easily derived by generalizing the analysis of

    Appendix B08s[}h4, http://www.100md.com

    Analysis of ML for a Sequence Pair with Fixed Rate Parameters08s[}h4, http://www.100md.com

    The likelihood of observing two sequences 1 and 2 is08s[}h4, http://www.100md.com

    where s1,k and s2,k label the states at the kth site for sequences 1 and 2, respectively.08s[}h4, http://www.100md.com

    We wish to derive the probability of obtaining an infinite value for the estimate of t which maximizes the likelihood (eq. 23) when keeping the rate matrix R fixed, i.e., the probability that the solution, tML, to {partial}08s[}h4, http://www.100md.com

    log L/{partial}08s[}h4, http://www.100md.com

    t = 0 is infinite. For the two sequence problems, we assume that there exists at most one maxima in t of the likelihood L at a finite value of tML. Thus, a divergent solution tML is equivalent to the asymptotic gradient of log L approaching zero from above, i.e., {partial}08s[}h4, http://www.100md.com

    log L/{partial}08s[}h4, http://www.100md.com

    t "->"

    0+, t "->"0!yd, 百拇医药

    {infty}0!yd, 百拇医药

    . Thus, we will focus on the distribution of limt"->"0!yd, 百拇医药

    {infty}
{partial}0!yd, 百拇医药

    log L/{partial}0!yd, 百拇医药

    t.0!yd, 百拇医药

    We take two sequences separated by branch length t0 and generated with rate matrix R. We consider our fixed estimate of R in equation (23) to be exact. As N "->"0!yd, 百拇医药

    {infty}0!yd, 百拇医药

    , we can take the distribution of x(t) = {partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t to be Gaussian, i.e., a probability density (2{pi} B1)-1/2 exp(- [x - B0]2/2B1), where B0 = "{partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t">"0!yd, 百拇医药

    and B1 = Var({partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t). Consequently, the probability of obtaining a divergent estimate for tML is1i, 百拇医药

    where C = limt"->"1i, 百拇医药

    {infty}
B0/. We find1i, 百拇医药

    A perturbative analysis of the accuracy of the ML estimate tML also can be obtained in a straightforward manner. The derivation is tedious, and so we quote only the final result here. Writing tML = t0 + {delta}1i, 百拇医药

    tML (t0 again being the true separation between the two sequences), we find that the mean squared error is given by1i, 百拇医药

    Consequently, in contrast to equation (22), the accuracy of the ML (with fixed rate parameters) estimate tML when the two sequences are well separated is on a distance scale determined by the slowest nonzero eigenvalue 1.1i, 百拇医药

    Acknowledgements1i, 百拇医药

    This work was supported by the U.K. Biotechnology and Biological Sciences Research Council.

    Literature Cited.., 百拇医药

    Atteson, K. 1997. The performance of the NJ method of phylogeny reconstruction. Pp. 133–148 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, R.I..., 百拇医药

    Barry, D., and J. A. Hartigan. 1987. Asynchronous distance between homologous DNA sequences. Biometrics 43:261- 276..., 百拇医药

    Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance- based phylogeny reconstruction. Mol. Biol. Evol 17:189- 197..., 百拇医药

    Casella, G., and R. L. Berger. 2002. Statistical inference. 2nd edition. Duxbury, Pacific Grove, Calif..., 百拇医药

    De Rijk, P., J. Wuyts, Y. Van De Peer, T. Winklemans, and R. De Wachter. 2000. The European large subunit ribosomal RNA database. Nucleic Acids Res 28:177-178 [sequence data available via ..., 百拇医药

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol 17:368-376.

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol 36:182-198.n)m/9o, http://www.100md.com

    Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol 32:443-445.n)m/9o, http://www.100md.com

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol 22:160-174.n)m/9o, http://www.100md.com

    Higgs, P. G. 2000. RNA secondary structure—physical and computational aspects. Q. Rev. Biophys 33:199-253.n)m/9o, http://www.100md.com

    Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol 19:1591- 1601.n)m/9o, http://www.100md.com

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–123 in H. N. Munro, ed. Mammalian protein metabolism III. Academic Press, New York.n)m/9o, http://www.100md.com

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol 16:111-120.

    Kimura, M., and T. Ohta. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol 2:87-90.m, http://www.100md.com

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.m, http://www.100md.com

    Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol 20:86-93.m, http://www.100md.com

    Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.m, http://www.100md.com

    Li, W.-H., and M. Gouy. 1991. Statistical methods for testing molecular phylogenies. Pp. 249–277 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York.m, http://www.100md.com

    Li, W.-H., and X. Gu. 1996. Estimating evolutionary distances between DNA sequences. Pp. 449–459 in F. Russell, ed. Methods in enzymology. Academic Press, San Diego.m, http://www.100md.com

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol 11:605-612.

    Nei, M., J. C. Stephens, and N. Saitou. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol 2:66-85.#(%, 百拇医药

    Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol 142:485-501.#(%, 百拇医药

    Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol 6:514-525.#(%, 百拇医药

    Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol 4:406-426.#(%, 百拇医药

    Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157:399-411.#(%, 百拇医药

    Sourdis, J., and M. Nei. 1988. Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol. Biol. Evol 5:298-311.

    Steel, M. A. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett 7:19-24.$^, 百拇医药

    Strimmer, K. S. 1997. Maximum likelihood methods in molecular phylogenetics. Ph.D. thesis, University of Munich, Munich [available via ].$^, 百拇医药

    Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol 14:210-211.$^, 百拇医药

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.$^, 百拇医药

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.$^, 百拇医药

    Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol 12:7-15.$^, 百拇医药

    Van De Peer, Y., P. De Rijk, J. Wuyts, T. Winklemans, and R. De Wachter. 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res 28:175-176 [sequence data available via .$^, 百拇医药

    Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol 39:315-329.$^, 百拇医药

    Accepted for publication July 17, 2002.(D. C. Hoyle and P. G. Higgs)