EvolutionaryDistancesBetweenSequences

Evolutionary Distances Between Sequences

http://www.100md.com 《分子生物学进展》2003年第1期

     School of Biological Sciences, University of Manchesterj&vl, 百拇医药

    Abstractj&vl, 百拇医药

    Phylogenetic methods that use matrices of pairwise distancesbetween sequences (e.g., neighbor joining) will only give accurateresults when the initial estimates of the pairwise distancesare accurate. For many different models of sequence evolution,analytical formulae are known that give estimates of the distancebetween two sequences as a function of the observed numbersof substitutions of various classes. These are often of a formthat we call "log transform formulae". Errors in these distanceestimates become larger as the time t since divergence of thetwo sequences increases. For long times, the log transform formulaecan sometimes give divergent distance estimates when appliedto finite sequences. We show that these errors become significantwhen t 1/2 |_max|^-1 logN, where _max is the eigenvalue of thesubstitution rate matrix with the largest absolute value andN is the sequence length. Various likelihood-based methods havebeen proposed to estimate the values of parameters in rate matrices.If rate matrix parameters are known with reasonable accuracy,it is possible to use the maximum likelihood method to estimateevolutionary distances while keeping the rate parameters fixed.We show that errors in distances estimated in this way onlybecome significant when t 1/2 |₁|^-1 logN, where ₁ is the eigenvalueof the substitution rate matrix with the smallest nonzero absolutevalue. The accuracy of likelihood-based distance estimates istherefore much higher than those based on log transform formulae,particularly in cases where there is a large range of timescalesinvolved in the rate matrix (e.g., when the ratio of transitionto transversion rates is large). We discuss several practicalways of estimating the rate matrix parameters before distancecalculation and hence of increasing the accuracy of distanceestimates.

    Key Words: molecular phylogeny • maximum likelihood • evolutionary distances • distance matrixse}t, 百拇医药

    Introductionse}t, 百拇医药

    There are now a large number of methods available for the constructionof phylogenetic trees from molecular sequences. Where relativelysmall numbers of sequences are involved and large amounts ofcomputer time is available, maximum likelihood (ML) inferenceof phylogenies can be used .The ML method is based on sound statistical principles and allowstests to be made for comparing alternative tree topologies andalternative models of sequence evolution. However, when largenumbers of taxa are involved, the ML method becomes very slow,and practical methods of dealing with large sets of sequencesare still required. The fastest phylogeny methods usually usematrices of evolutionary distances. The neighbor joining (NJ)method is one of the most popular of these,and tests have shown that it is relatively accurate in comparisonwith other heuristic clustering methods .If the distance matrix corresponds to an exactly additive tree,NJ reproduces the correct tree topology, and it is robust againstsmall errors in the distance estimates between sequences .If errors in distance estimates are large, however, thenNJ and other distance matrix methods will give incorrect treetopologies. In this article, we investigate the way in whichthe errors in the pairwise distances between sequences dependon the sequence length, the evolutionary model, and the degreeof divergence between sequences.

    The problems with distance estimation are already apparent inthe simplest Jukes-Cantor (JC) modelof sequence evolution. Evolutionary distance is usually measuredin terms of the average number of substitutions per site, whichwe denote as d. For two aligned sequences of length N, the estimatedevolutionary distance is given within theJC model by4p8d, 百拇医药

    where D isthe observed fraction of sites that differ. We will refer toequations such as equation (1) as "log transform" formulae becausethey transform observed quantities (in this case D) to estimatesof distances that are not themselves directly observable quantities.A general method of deriving log transform formulae is givenin Error Analysis of Distance Estimates for a General Rate Model,equations (4)–(9). For finite sequences, n has a binomialdistribution, and hence D fluctuates about its mean value. Because is a nonlinear function of D, the expectationvalue of is not exactly equal to d; hence,there is a small systematic error of order 1/N. In practice,this is dominated by the statistical error, which is of theorder 1/N^1/2 .

    It also is apparent from equation (1) that the estimated distanceis infinite or undefined, if D 3/4. When the true d is large,D approaches 3/4 for an infinite sequence; therefore, thereis a significant chance that the observed D in a finite sequencewill be larger than 3/4 and that the distance estimate willbe undefined. For small evolutionary distances, this will rarelyoccur. However, for any finite length N and for any true distanced, there is a nonzero probability that willbe undefined.j%@3, 百拇医药

    When using the JC model, the problems described above will onlybecome important when d >> 1, in which case there is saturationof mutations. In this case there is very little phylogeneticinformation left in the sequences, and we probably should notbe using these sequences for tree construction in the firstplace. However, we will show below that these problems tendto be more serious with more complex models. The effect of increasingthe model complexity is illustrated by the next most simplemodel of evolution: the Kimura 2- parameter model (K2P) .In this model the base frequencies are equal, but a distinctionis made between transitions (occurring at rate 1/4) and transversions(occurring at rate 1/4ß). The ratio /ß isusually denoted by . The estimated evolutionary distance betweentwo sequences is the sum of the estimated numbers of transversionsand transitions per site and also can be written as a log transformformula, this time with two log terms,

    where S and V are the observed fractions of sites that differby transitions and transversions between the two sequences.There are now two timescales, one associated with transitionsand one with transversions. Transitions are usually faster (> 1). The distance estimate can diverge because of divergencesin either of the two log terms. The quantity 1 - 2S - V tendsto zero in a time governed by the rapid transition rate, whereasthe term 1 - 2V tends to zero on the slow transversion timescale.Divergences, therefore, tend to arise because of saturationof the transitions, even when the overall evolutionary distanceis small. The errors in this formula, therefore, become significantat smaller evolutionary distances than for the JC model.], http://www.100md.com

    In addition to the JC and K2P models, many other rate matrixmodels have been proposed. The model of Hasegawa, Kishino, andYano (HKY) and the modelof Tamura and Nei (TN) have rate matricesfor which the eigenvectors and eigenvalues are analyticallysoluble. Hence, an explicit log transform formula can be written.A distance formula for general reversible models has been derivedin various forms by and Otherdistance measures such as the paralinear distance and the LogDet distance havebeen introduced to address nonstationarity, i.e., heterogeneityin the base composition of the sequences studied.

    The observation that increasing model complexity can increaseboth the variance of distance estimates and the frequency withwhich undefined distance estimates are encountered has beenmade before (p. 84 for a discussionof this point). However, we wish to consider what generic featuresof a model affect the accuracy of distance estimates and thefrequency of undefined distances.[]am, http://www.100md.com

    We will begin by showing a few problems with distance estimatesin real data. We will then give a quantitative theory to expandon the above arguments concerning log-transform formulae. Wealso consider likelihood-based methods of estimating distanceswhere we fix the estimates of the rate parameters before estimationof the evolutionary distance. We show that these methods haveerrors associated with the slowest timescale in the model, andhence that these methods are more accurate.[]am, http://www.100md.com

    Examples of Distance Estimates in Real Sequence Data[]am, http://www.100md.com

    As an example with real data, we consider a set of large subunit(LSU) rRNA sequences obtainedfrom the database of .A set of 90 mitochondrial LSU sequences was used,including representatives from all groups of mitochondria-containingeukaryotes for which sequence data were available. The alignmentin the database was taken without alteration. To exclude variableregions and poorly aligned regions from the analysis, we eliminatedfrom the alignment those sites which contained a gap or unknownnucleotide for 10% or more of the sequences. In theevolutionary distance estimates for the K2P model (eq. 2) areplotted against the sequence dissimilarity D for each pair ofsequences. This is a well-behaved data set for which none ofthe distance estimates diverge and for which there is a clearrelationship between D and so that all thepoints appear to lie on a smooth curve.

    fig.ommittedu, 百拇医药

    FIG. 1. Plot of estimated evolutionary distance against sequence dissimilarity D (uncorrected distance) for K2P model log transform estimates (left hand graph, 1a), TN model log transform estimates (middle graph, 1b), and estimates from TN model using ML with fixed rate parameters (right hand graph, 1c), all on LSU rRNA sequences (see main text). Evolutionary distances of d = -0.25 indicate a divergent estimateu, 百拇医药

    shows the same data with distance estimates calculatedwith the TN model, using the log transform formula given in. In this case, the distance estimate divergesfor 1.43% of sequence pairs. Pairs that diverge are plottedas points at = -0.25 in the lower portionof the graph. Divergence occurs only for widely separated pairswith D between 0.6 and 0.7. Also apparent from isthat the scatter of distance estimates is greater for the TNmodel than for the K2P model, and the relationship between Dand does not seem to be so well defined.In particular, at large D there are a number of outlier points,with well above the main curve. In any case,the TN distances could not be used without some finite valuebeing assigned to the pairs that diverge.

    We also can use likelihood ratio tests for modelselection purposes, and this gives results that are contradictoryto those above. Because both the K2P and TN model distance estimatescan be derived by maximizing the likelihood in terms of theobserved numbers of tranversions and transitions and the K2Pcan be considered as a special case of the TN model, we canapply a likelihood ratio test to each sequence pair from thethis data set. It is straightforward to calculate the log-likelihoodvalues for the two models for any pair of sequences and, fromthis, to determine the difference in the log-likelihoods. Becausethe two models are nested, we know that 2 is distributed accordingto a ² distribution under the hypothesis that the simpler K2Pmodel is correct . On performing such a calculationfor each possible pair of sequences, we find that the TN modelprovides a significantly better fit to the data (P < 0.05)for 99.8% of the sequence pairs for which the distance estimatedoes not diverge. From this, we would conclude that the TN modelprovides an overwhelmingly better explanation of the observedsequence data than does the K2P model. (We note that the likelihoodratio test is usually applied to the likelihood of whole trees,whereas here we are applying it simply to the likelihood ofeach sequence pair.)

    Our second example consists of 455 small subunit (SSU) rRNAsequences taken from the database of ,chosen so that there is one example of a sequence from eachgenus of Eubacteria. This data set has been analyzed previouslyby and becauseit is an example of sequence evolution under the constraintof a conserved RNA secondary structure. These articles (andreferences therein) discuss a range of models for evolutionof the paired stem regions of RNA sequences. Stem regions evolvethrough compensatory substitutions that maintain the pairingpattern; hence, the substitutions in the sites on either sideof a pair are strongly correlated, and it is necessary to considerthe evolution of the pair as a single unit rather than as twoseparate sites. One appropriate model is that of ,which considers six allowed paired states, AU,GU, GC, UA, UG, and CG, (occurring with frequencies _AU, ...,_CG) and has rate parameters , ß, and quantifyingthe rates of double transitions, double transversions, and singletransitions, respectively. The structure of the model is simpleenough for an explicit log transform formula to be calculated(), namely = _V + + . _V, , and are estimators of K_V, K, and K, the expected numbers of transversions, single transitions, and double transitions, respectively. Theseare given as K_V = ßt, K = 8₂ (₁+ ₃) t, and K = 8₁₃t, where₁ = 1/ 2(_AU +_UA), ₂ = 1/2 (_GU + _UG), and ₃ = 1/2 (_GC + _CG). The estimators_V, , and are

    where V, S₁, and S₂ are the observed fractions of transversions,single transitions, and double transitions, respectively. Clearly,the ratios of the rate parameters also can be estimated from_V, , and , i.e., ₁ /ßis estimated by ₁ = /8₁₃_V, and ₂ /ß isestimated by ₂ = /8₂(₁ + ₃)_V. shows theestimated distances using this formula. A divergent distanceestimate occurs for 0.91% of the sequence pairs. These divergentpairs (shown in the lower portion of the graph at = -0.25) occur over a range of dissimilarity that begins witha value of D as small as 0.25. There also is considerable scatterin the distance estimates for the points that do not diverge.The problems of distance estimation are therefore quite seriousin this example.*-!5n, http://www.100md.com

    fig.ommitted*-!5n, http://www.100md.com

    FIG. 2. Plot of estimated evolutionary distance d against sequence dissimilarity D for pairs of sequences from the SSU rRNA data set (see main text). Left hand graph (2a) shows results from six-state model with distances estimated using log transform formula. Right hand graph (2b) shows results from Tillier and Collins six-state model with distances estimated from ML with fixed rate parameters. Approximately, only every fifth data point is actually plotted on the graphs. Evolutionary distances of d = -0.25 indicate a divergent estimate

    Distance Estimates with Fixed Rate Parameters@1y$', http://www.100md.com

    The problem with the pairwise distance estimates is that thevalues of parameters like the base frequencies and rate matrixparameters , ß, etc. are estimated separately foreach pair, using only information from that pair, i.e., theparameters have different values for each sequence pair. Usuallywhen building a tree, we make the assumption that the rate matrixparameters are constant during evolution of the whole tree.It, therefore, makes sense to obtain a single estimate of theparameters using information from all the sequences and thento estimate ML distances for each sequence pair while keepingthe parameters fixed at their previously estimated values. Theaccuracy of pairwise distance estimates obtained in this wayis analyzed in ML Distance Estimates with Fixed Rate Parametersand Appendix B.@1y$', http://www.100md.com

    The most natural way of estimating the parameters would be simplyto do an ML calculation of the optimum tree and parameter valuesfor the whole set of sequences. However, this is not practicalfor large data sets of sequences without spending huge amountsof computer time. Indeed, if we were able to carry out ML forthe complete set of sequences, we would not be particularlyinterested in using distance matrix methods anyway.

    One way around the problem is to take a subset of sequencesand do the ML calculation and then to use the ML parametersestimated from this subset when estimating pairwise distancesfor the whole set. shows the distances obtained inthis way for the LSU rRNA sequences with the TN model. The estimatesof the rate parameters have been obtained using TREE- PUZZLEon 10 sequences selected at random fromthe data set. The pairwise distances are then obtained for eachsequence pair by maximizing the likelihood while keeping therate parameters fixed at the values estimated by TREE-PUZZLE.There are now no divergent distance estimates, and the scatterof points at high D is substantially reduced in comparison with. It is now practical to use the TN model on this data,which we know was the preferred model according to the likelihoodratio tests.wh|ed), 百拇医药

    The model of for paired sites inRNAs is not implemented in TREE-PUZZLE or other currently availableprograms. We are currently in the process of developing a packagethat will use ML and Monte Carlo Markov Chain methods to constructphylogenies of RNAs using a variety of paired-site models. Thiswill be available shortly . However, for thepresent article, we used a much simpler method to estimate therate parameters for the model of in the example with the SSU rRNA sequences

    For each sequence pair of the SSU rRNA data set, the parameterratios ₁ and ₂ can be estimated as outlined in the previoussection. We obtain final estimates of ₁ and ₂ by averaging finiteestimates over all the 455 x 454/2 sequence pairs of the SSUrRNA data set. ₁ and ₂ are then held fixed at these values.Fixing ₁ and ₂ (along with the usual timescale normalizationcondition that the expected number of substitutions per siteper unit time is 1) completely specifies all three rate parameters,, ß, and . Specifically, ß = (1 + 8₁₃₁ +8₂[₁ + ₃]₂)^-1, = ₁ß, and = ₂ß. shows ML distances calculated from the paired states in theSSU rRNA sequences and using the six-state model of with fixed rate parameters estimated in thisfashion. There are now no divergent distance estimates and thescatter of points is reduced, indicating that distance estimatesare more reliable than those with the log transform formula.

    The key point of this section is that if good estimates of therate parameters are known by ML estimation on a tree or by averagingthe estimates of parameter ratios from each pair as describedabove, then accurate estimates of the pairwise distances canbe obtained by maximizing the likelihoods for each pair withthe fixed values of the rate parameters. These estimates suffermuch less from problems of divergence than do estimates usingthe log transform formulae. In the next two sections, we considerin detail why this is so. Those readers not interested in themathematical detail can proceed to Discussion and Conclusions,where we discuss the implications of these findings.m+g62gz, 百拇医药

    Error Analysis of Distance Estimates for a General Rate Modelm+g62gz, 百拇医药

    Consider a rate matrix R, whose elements R_ij (R_ij > 0 i !=m+g62gz, 百拇医药

    j) define the rate of substitution from state i to statej. Thus, the probability P_ij(t) of finding a particular nucleotidein state j at time t, given that it was initially in state i,obeys the simple evolution equation

    Ris defined such that the diagonal elements R_ii = -{Sigma}e, 百拇医药

    _{j!=e, 百拇医药

    i} R_ij andalso so that the substitution process is reversible, i.e., _iR_ij= _jR_ji i, j. At long times, P_ij(t) tends to _j, the equilibriumfrequency of state j.e, 百拇医药

    We have labeled the eigenvalues of R as _i, where i = 0, ...,N_state - 1. There is one eigenvalue ₀ which is zero, whereasthe other eigenvalues are negative. We have assumed the eigenvaluesare ordered so that |₁| " |₂| " ··· " ||. We also label as _max becauseit is the eigenvalue of the largest absolute value. We alsocan define the matrices U and V whose columns are formed fromthe eigenvectors of R and R^T, respectively. The transition probabilitymatrix P can then be written in terms of this decompositionas

    The evolutionary distanced between two sequences is defined as the expected number ofnucleotide changes per site from which one finds d = -t {Sigma}a|, http://www.100md.com

    _i _iR_ii.This can be written as a function of the eigenvalues asa|, http://www.100md.com

    To derive the log transformformula, the P_ij(t) in equation (5) is replaced by the observedvalues in the pair of sequences in question, i.e., equatingthe expectation value P_ij(t) with the observed number of substitutionsof that type. Thus, the derivation essentially uses the "Methodsof Moments" (see for example, , p. 312).This gives a set of simultaneous equations for the unknown quantitiesexp(_kt). The number of unknowns is equal to the number of distinctnonzero eigenvalues of R, and we call this N_e. Solving theseequations for _kt and substituting into equation (6) gives anestimate of the evolutionary distance as a log transform formulae

    The quantities M_k dependon the frequencies of the states _i but not on the rate parameters.Therefore, they can easily be estimated from the observed statefrequencies. The Q_i can be taken to be of the formg;22, http://www.100md.com

    Here, the quantities S, = 1, ..., M, represent observed fractionsof various substitution classes, e.g., S and V in the K2P model.The coefficients B_k will be explicit from the particular logtransform formula used. The log transform formulae given above(eqs. 1 and 2) are both of this form, as are those for the TNmodel and the RNA model used in Distance Estimates with FixedRate Parameters.g;22, http://www.100md.com

    In Appendix A, we carry out an analysis of the errors in a logtransform formula of the general form given in equation (7).The probability that the ith individual logarithmic term inequation (7) diverges isg;22, http://www.100md.com

    whereg;22, http://www.100md.com

    Here, the notation"">"

    represents the expectation value. From equation (9), we cansee that the individual logarithmic term in equation (7) thatis estimating the fastest eigenvalue _max is the most likelyto produce a divergence, and so _max predominantly determinesthe likelihood of obtaining a divergent evolutionary distanceestimate from a log transform formula.^/p%, 百拇医药

    The total probability that the estimate (eq. 7) diverges isp_div = 1 - P(Q₁ > 0, Q₂ > 0, ..., > 0). Obtaining a generic analytic form for p_div is not possible.However, we approximate P(Q₁ > 0, Q₂ > 0, ..., > 0) ~=^/p%, 百拇医药

    P(Q₁ > 0)P(Q₂ > 0)...P( > 0), i.e., we approximate the Q_i as being independent. Thisgives^/p%, 百拇医药

    where the complementaryerror function erfc(x) = 1 - erf(x). In the case of the K2Pmodel, one has^/p%, 百拇医药

    where C₁₁= 4P_v(1 - P_v), C₂₂ = P_v + 4P_s - (P_v + 2P_s)² with P_v, P_s beingthe expected number of transversions and transitions, i.e.,P_v = "V">"

    and P_s = "S">"4, 百拇医药

    .4, 百拇医药

    The approximation that Q₁ and Q₂ are independent is, as we shallsee from simulations, a reasonably accurate one for the K2Pmodel. If the transition to transversion ratio is large, thenthere is a clear separation of timescales between the two processes.Thus, p_div consists of two step-like increases at evolutionarydistances d₁ and d₂ . Thus, the range of applicabilityis governed by the shorter of the two timescales, d₁. To testthe accuracy of the above approximation, simulations were performedby generating pairs of random sequences with known evolutionarydistance d and by calculating the fraction of pairs for which diverges. shows simulation resultsfor the K2P model with sequence length N = 500 and = 10. Thetheoretical estimate is in excellent agreement with the simulationestimate of p_div. The separation of the two distance scalesis very clear because is large. We also performed simulationswith = 2 (not shown). In this case, the two steps merge togetherto a single large step. The theory and simulation are also ingood agreement in this case.

    fig.ommittedxf?{k, 百拇医药

    FIG. 3. Plot of simulated and theoretical divergence probability of log transform and ML (fixed rate matrix parameters) distance estimates for the K2P model with = 10 against evolutionary distance. The upper two curves (simulation = diamonds, theory = circles) are for the log transform estimate (eq. 2) and the lower two curves (simulation = triangles, theory = squares) are for ML with fixed rate parameters. Simulation averages evaluated over 2,000 replicates. Sequence length N = 500. One percent divergence probability occurs at approximate evolutionary distances of d = 1.40 and d = 6.65 for the two methods, respectivelyxf?{k, 百拇医药

    For all models we have studied in this article, the approximationthat the Q_i in equation (7) is independent is found to be agood one. Certainly, for any model where the distinct nonzeroeigenvalues are well separated, it is valid for p_div " 1/2 becausewithin this range p_div is predominantly controlled by a singleeigenvalue _max. For p_div > 1/2, we consider the inaccuracyintroduced by assuming that Q_i in equation (7) is independentto be largely unimportant in comparison with the fact that atthis stage greater than 50% of the distance estimates will beundefined. From equation (11), we see that for a general ratemodel p_div consists of N_e step-like increases reaching a limitingvalue (for long sequences and large evolutionary distances)of 1 - .

    ML Distance Estimates with Fixed Rate Parametersa\, http://www.100md.com

    Our calculations on real data sets revealed that no divergentpairwise evolutionary distance estimates were obtained if therate parameters were held fixed and not simultaneously estimatedfrom the sequence data of the pair. We now consider this pointin more detail and determine what controls the probability ofa divergent estimate and the accuracy of that estimate whenthe rate parameters are held fixed.a\, http://www.100md.com

    Let us consider methods of distance estimation based on ML.We suppose that estimates of all the parameters in the ratematrix are known and that we maximize the likelihood with respectto only the distance t. If the specified values of the rateparameters are equal to the true values, we find (see AppendixB) that the probability of obtaining a divergent distance estimate,for large N and t, is now controlled by the slowest nonzeroeigenvalue ₁. The limiting value, as t "->"a\, http://www.100md.com

    {infty}

    , for the probabilityof divergence is 1/2 (to leading order in N^-1), irrespectiveof the rate matrix used. For the K2P model, we find6hv3r, http://www.100md.com

    Theresult from this equation is shown in in comparisonwith estimates of p_div from simulation. The sequence data hasbeen generated with ₀ = 10.0, and the same value has been specifiedin the ML calculation. It can be seen that the range of applicabilityof the ML distance estimate (with fixed rate parameters) ismuch larger because the divergence probability is a functionof the slow timescale (transversions) only. The simulation resultsare in excellent agreement with the theory indicating that theprobability of obtaining a divergent evolutionary distance estimateis predominantly controlled by the slowest nonzero eigenvalue₁.6hv3r, http://www.100md.com

    In Appendix B, we derive a perturbative result for the RMS errorof the ML estimate (with fixed rate parameters) of d. Againthe theoretical result (eq. 26) is in excellent agreement withsimulation (not shown). Consequently, we also can conclude thatthe error in the ML estimate of d is indeed controlled by theslowest nonzero eigenvalue ₁.

    In general, we will not know a priori the true parameters inthe rate matrix. If the sequence data have been generated bythe same class of model that is used to estimate the pairwiseevolutionary distances, then for the methods we have used toestimate the rate parameters, e.g., for ML on a fixed tree (asin TREE-PUZZLE), we expect the estimates of rate parametersto become more accurate as sequence length N is increased. Inthis case, our analysis in Appendix B is unchanged to leadingorder in N. Consequently, for increasing sequence length, distanceestimates are still controlled by the slowest eigenvalue₁.Obviously, for any fixed sequence length N, the exact valueof "({delta}l:4}j], 百拇医药

    t)²">"l:4}j], 百拇医药

    will depend on precisely how close the estimates ofthe rate matrix parameters are to the true values.l:4}j], 百拇医药

    Discussion and Conclusionsl:4}j], 百拇医药

    The focus of this article has been on the accuracy and reliabilityof distance matrix methods for phylogeny reconstruction. Ouranalysis has shown that distance estimates from log transformformula are accurate only on the shortest distance scale, |_max|^-1,defined by the underlying rate matrix R and can frequently giverise to undefined distances. Pairwise distance estimates canbe improved using ML if the rate parameters are known with reasonableaccuracy, for example, by estimation from more than just thetwo particular sequences under consideration. In such cases,pairwise distance estimates are accurate on the longest distancescale, |₁|^-1, defined by R. With improved pairwise distanceestimates, better phylogenies can be constructed using NJ orvariants such as Weighbor .

    Both the probability of divergence (eq. 11) and the estimateof the variance of (eq. 22 in Appendix A)provide a measure of how much sequence information is requiredto resolve a given evolutionary distance between two sequencesusing log transform formulae. To maintain a fixed probabilityof divergence p_div or fixed variance of the evolutionary distanceestimate, we have the scaling (for large evolutionary distances)qlz, 百拇医药

    Because tends to a constant, independent of N, as t "->"qlz, 百拇医药

    {infty}qlz, 百拇医药

    , we can, for largesequence lengths N, consequently ignore it in the scaling relationabove.qlz, 百拇医药

    Models with higher N_e tend to be applicable over a narrowerrange of t and give more divergences because the more eigenvaluesthere are, the larger the largest one tends to be. This wasalready seen in the example with the LSU rRNA sequences in . We also can see from equation (14) that the evolutionarydistance between two sequences that can be resolved accuratelydepends logarithmically on N. To increase the evolutionary distanceprobed requires an exponential increase in sequence length.

    What are the sequence lengths required to obtain a given levelof accuracy if evolutionary distances are estimated using MLwith fixed rate parameters? Again, one finds that to maintaina constant level of accuracy or constant probability of obtaininga divergent evolutionary distance estimate, the required scalingismx0, 百拇医药

    Thus, the evolutionarydistances that one can probe still depends logarithmically onN. However, one should note that the prefactor to log N is as opposed to when using log transform formulae. Thus, if one has a fixed reasonably accurateestimate of the underlying rate matrix, then the evolutionarydistance estimate obtained by maximizing the likelihood withrespect to t can be considerably more accurate than that obtainedby using the appropriate log transform formulae. This is particularlythe case when evolution of sequences is occurring through processeson vastly different timescales. For the K2P model, when = 10(i.e., transitions occurring much more rapidly that transversions)the ratio _max/₁ = 5.

    Appendix A7f@-, 百拇医药

    Probability of Divergence in Log Transform Formulae7f@-, 百拇医药

    We take as our starting point a log transform formula of theform7f@-, 百拇医药

    which provides anestimate of the true evolutionary distanced between two sequences. The amplitudes M_i are assumed to dependonly on the state frequencies {_i, i = 1, ..., N_state}. N_e isthe number of distinct nonzero eigenvalues of the rate matrixmodel to which the log transform formula (16) applies. The Q_ican be taken to be of the form7f@-, 百拇医药

    Here, the quantities S, 1, ..., M, represent observed fractionsof various substitution classes. With the quantities S estimatedfrom the sequence data, Q_i provides estimates of exp(_it) andin particular "Q_i">"7f@-, 百拇医药

    = exp(_it).7f@-, 百拇医药

    The distribution, P(S₁, ..., S_M) is multinomial and so, as N"->"

    {infty}5, http://www.100md.com

    , is well approximated by a multivariate Gaussian,5, http://www.100md.com

    For large N, we can consider the quantities S₁, ..., S_M andQ₁, ..., to be continuously distributed between -{infty}5, http://www.100md.com

    and {infty}5, http://www.100md.com

    . Using the distribution (eq. 18), we calculate5, http://www.100md.com

    where5, http://www.100md.com

    The probability that the ith individual logarithmic term inequation (16) diverges is5, http://www.100md.com

    From this we can see that the individual logarithmic term inequation (16) that is estimating the fastest eigenvalue _maxis the most likely to produce a divergence, and so _max predominantlydetermines the likelihood of obtaining a divergent evolutionarydistance estimate from a log transform formula.5, http://www.100md.com

    Assuming that the dominant contribution to equation (16) comesfrom estimating _max, then the variance in such a distance estimateis equally easily derived by generalizing the analysis of

    Appendix B08s[}h4, http://www.100md.com

    Analysis of ML for a Sequence Pair with Fixed Rate Parameters08s[}h4, http://www.100md.com

    The likelihood of observing two sequences 1 and 2 is08s[}h4, http://www.100md.com

    where s_1,k and s_2,k label the states at the kth site for sequences1 and 2, respectively.08s[}h4, http://www.100md.com

    We wish to derive the probability of obtaining an infinite valuefor the estimate of t which maximizes the likelihood (eq. 23)when keeping the rate matrix R fixed, i.e., the probabilitythat the solution, t_ML, to {partial}08s[}h4, http://www.100md.com

    log L/{partial}08s[}h4, http://www.100md.com

    t = 0 is infinite. For thetwo sequence problems, we assume that there exists at most onemaxima in t of the likelihood L at a finite value of t_ML. Thus,a divergent solution t_ML is equivalent to the asymptotic gradientof log L approaching zero from above, i.e., {partial}08s[}h4, http://www.100md.com

    log L/{partial}08s[}h4, http://www.100md.com

    t "->"

    0⁺, t"->"0!yd, 百拇医药

    {infty}0!yd, 百拇医药

    . Thus, we will focus on the distribution of lim_{t"->"0!yd, 百拇医药

    {infty}} {partial}0!yd, 百拇医药

    log L/{partial}0!yd, 百拇医药

    t.0!yd, 百拇医药

    We take two sequences separated by branch length t₀ and generatedwith rate matrix R. We consider our fixed estimate of R in equation(23) to be exact. As N "->"0!yd, 百拇医药

    {infty}0!yd, 百拇医药

    , we can take the distribution of x(t)= {partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t to be Gaussian, i.e., a probability density (2{pi} B₁)^-1/2exp(- [x - B₀]²/2B₁), where B₀ = "{partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t">"0!yd, 百拇医药

    and B₁ = Var({partial}0!yd, 百拇医药

    logL/{partial}0!yd, 百拇医药

    t).Consequently, the probability of obtaining a divergent estimatefor t_ML is1i, 百拇医药

    where C = lim_{t"->"1i, 百拇医药

    {infty}}B₀/. We find1i, 百拇医药

    A perturbative analysis of the accuracy of the ML estimate t_MLalso can be obtained in a straightforward manner. The derivationis tedious, and so we quote only the final result here. Writingt_ML = t₀ + {delta}1i, 百拇医药

    t_ML (t₀ again being the true separation between thetwo sequences), we find that the mean squared error is givenby1i, 百拇医药

    Consequently, in contrastto equation (22), the accuracy of the ML (with fixed rate parameters)estimate t_ML when the two sequences are well separated is ona distance scale determined by the slowest nonzero eigenvalue₁.1i, 百拇医药

    Acknowledgements1i, 百拇医药

    This work was supported by the U.K. Biotechnology and BiologicalSciences Research Council.

    Literature Cited.., 百拇医药

    Atteson, K. 1997. The performance of the NJ method of phylogeny reconstruction. Pp. 133–148 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology, DIMACS series in discrete mathematics and theoretical computer science, Vol. 37. American Mathematical Society, Providence, R.I..., 百拇医药

    Barry, D., and J. A. Hartigan. 1987. Asynchronous distance between homologous DNA sequences. Biometrics 43:261- 276..., 百拇医药

    Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance- based phylogeny reconstruction. Mol. Biol. Evol 17:189- 197..., 百拇医药

    Casella, G., and R. L. Berger. 2002. Statistical inference. 2nd edition. Duxbury, Pacific Grove, Calif..., 百拇医药

    De Rijk, P., J. Wuyts, Y. Van De Peer, T. Winklemans, and R. De Wachter. 2000. The European large subunit ribosomal RNA database. Nucleic Acids Res 28:177-178 [sequence data available via ..., 百拇医药

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol 17:368-376.

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol 36:182-198.n)m/9o, http://www.100md.com

    Hasegawa, M., H. Kishino, and N. Saitou. 1991. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol 32:443-445.n)m/9o, http://www.100md.com

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol 22:160-174.n)m/9o, http://www.100md.com

    Higgs, P. G. 2000. RNA secondary structure—physical and computational aspects. Q. Rev. Biophys 33:199-253.n)m/9o, http://www.100md.com

    Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol 19:1591- 1601.n)m/9o, http://www.100md.com

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–123 in H. N. Munro, ed. Mammalian protein metabolism III. Academic Press, New York.n)m/9o, http://www.100md.com

    Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol 16:111-120.

    Kimura, M., and T. Ohta. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol 2:87-90.m, http://www.100md.com

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.m, http://www.100md.com

    Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol 20:86-93.m, http://www.100md.com

    Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, Mass.m, http://www.100md.com

    Li, W.-H., and M. Gouy. 1991. Statistical methods for testing molecular phylogenies. Pp. 249–277 in M. M. Miyamoto and J. Cracraft, eds. Phylogenetic analysis of DNA sequences. Oxford University Press, New York.m, http://www.100md.com

    Li, W.-H., and X. Gu. 1996. Estimating evolutionary distances between DNA sequences. Pp. 449–459 in F. Russell, ed. Methods in enzymology. Academic Press, San Diego.m, http://www.100md.com

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol 11:605-612.

    Nei, M., J. C. Stephens, and N. Saitou. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol 2:66-85.#(%, 百拇医药

    Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol 142:485-501.#(%, 百拇医药

    Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol 6:514-525.#(%, 百拇医药

    Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol 4:406-426.#(%, 百拇医药

    Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157:399-411.#(%, 百拇医药

    Sourdis, J., and M. Nei. 1988. Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol. Biol. Evol 5:298-311.

    Steel, M. A. 1994. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett 7:19-24.$^, 百拇医药

    Strimmer, K. S. 1997. Maximum likelihood methods in molecular phylogenetics. Ph.D. thesis, University of Munich, Munich [available via ].$^, 百拇医药

    Strimmer, K., N. Goldman, and A. von Haeseler. 1997. Bayesian probabilities and quartet puzzling. Mol. Biol. Evol 14:210-211.$^, 百拇医药

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol 13:964-969.$^, 百拇医药

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.$^, 百拇医药

    Tillier, E. R. M., and R. A. Collins. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol 12:7-15.$^, 百拇医药

    Van De Peer, Y., P. De Rijk, J. Wuyts, T. Winklemans, and R. De Wachter. 2000. The European small subunit ribosomal RNA database. Nucleic Acids Res 28:175-176 [sequence data available via .$^, 百拇医药

    Zharkikh, A. 1994. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol 39:315-329.$^, 百拇医药

    Accepted for publication July 17, 2002.(D. C. Hoyle and P. G. Higgs)

百拇医药网 http://www.100md.com/html/DirDu/2005/05/06/58/29/15.htm