IdentifyingSite-SpecificSubstitutionRa

Identifying Site-Specific Substitution Rates

http://www.100md.com 《分子生物学进展》2003年第2期

     ^* Max-Planck-Institut für evolutionäre Anthropologie, Leipzig, Germany., http://www.100md.com

    Heinrich-Heine-Universität Düsseldorf, Germany., http://www.100md.com

    Forschungszentrum Jülich, Germany., http://www.100md.com

    Abstract., http://www.100md.com

    A maximum likelihood framework for estimating site-specificsubstitution rates is presented that does not require any priorassumptions about the rate distribution. We show that, whenthe branching pattern of the underlying tree is known, the analysisof pairs of positions is sufficient to estimate site-specificrates. In the abscense of a known topology, we introduce aniterative procedure to estimate simultaneously the branchingpattern, the branch lengths, and site-specific substitutionrates. Simulations show that the evolutionary rate of fast-evolvingsites can be reliably inferred and that the accuracy of rateestimates depends mainly on the number of sequences in the dataset. Thus, large sets of aligned sequences are necessary forreliable site-specific rate estimates. The method is appliedto the complete mitochondrial DNA sequence of 53 humans, providinga complete picture of the site-specific substitution rates inhuman mitochondrial DNA.

    Key Words: sequence evolution • site-specific substitution rates • maximum likelihood • tree reconstruction • human mitochondrial DNAdp44/y4, 百拇医药

    Introductiondp44/y4, 百拇医药

    In a nucleotide or amino acid sequence not all sites evolvewith the same speed. Some sites experience more changes thanothers, mainly because different selective constraints act onthe different sites of the sequence. Examples for sequenceswith heterogeneous substitution rates among sites are codingregions of the genome where the three codon positions evolvewith different rates; noncoding regions where some but not allsites have regulatory and control functions, like the hypervariableregions of mitochondrial DNA ; or regions of thegenome that are involved in the formation of secondary structure,like the large and small subunit ribosomal RNAs . Because of the relationshipbetween substitution rate and selective constraint, we can gainvaluable insight into the structural and functional constraintsthat act on a sequence by quantifying the rate of substitutionat each sequence position. In addition, various sequence analysescan, in principle, benefit from the consideration of site-specificsubstitution rates. Substitution models or distance estimates,and thus tree reconstruction, can become more accurate, andpopulation genetic inference can be influenced. The bias introducedin sequence analysis by ignoring heterogeneous rates among siteshas been studied in population genetics (cf )and phylogenetic reconstruction (cf).

    To assess site-specific rate heterogeneity, several approacheswere developed. They can be separated into three categories:Parsimonious attempts ,maximum likelihood–based methods ( Olsen,Pracht, and Overbeek, personal communication; ),and empirical "pairwise" methods .%, http://www.100md.com

    In the parsimonious approach the number of substitutions eachsite requires in a most parsimonious tree is counted, and thosesites that exceed a cut-off value—i.e., if a site fallsin the upper 10% quantile of the distribution of substitutions—are coined "fast sites." However, as theassumptions of parsimony are not always valid, this method tendsto underestimate the amount of rate variation. Likewise, skewedbase composition and biased substitutions among nucleotides(or amino acids), which both are known to mimic effects of ratevariation among sites, cannot be taken fully into account. Thus,only approximate estimates of site-specific variability areinferred.%, http://www.100md.com

    Maximum likelihood methods in contrast have a statisticallywell-defined basis and can cope with recurrent mutations, skewedbase composition, and biased variation in rate by specifyinga model of sequence evolution . However, the drawbackof such methods is that they are computationally intensive,such that the estimation of site-specific rates is not possiblewithout assuming either a tree topology including branch lengths(Olsen, Pracht, and Overbeek, personal communication; ) or a specific distribution of ratesa priori .Typically one assumes that rates are distributed according tothe Gamma distribution. suggested the discrete Gammadistribution, which has been used to obtain site-specific rateestimates From a biological point of view, there is no reason toassume that sites evolve with "discrete" rates. Moreover, thediscrete approach has the tendency to underestimate the rateof fast-evolving sites. Thus, the use of a continuous distributionwould, in principle, be favorable .Unfortunately, this is at present computationally infeasible.

    Empirical "pairwise" methods circumvent the construction ofa tree by focusing on pairs of sequences . In each sequence pair,a position contains either a pair of different nucleotides orthe identical nucleotides, where the probability to observethe events depends on the site-specific substitution rate andthe evolutionary distance separating the sequences. used this relation to infer site-specific ratesof alignment positions. Although their approach yields sensibleresults, their statistical basis is not well understood. Thus,we introduce a maximum likelihood framework for estimating site-specificsubstitution rates from pairs of sequences based on the ideasof . Moreover, we suggest an iterativemaximum likelihood procedure to compute site-specific ratesand the phylogenetic tree simultaneously. The performance ofthe method is evaluated by simulations. As an illustrative example,we applied the estimation procedure to the dataset of 53 completemitochondrial DNA (mtDNA) sequences of humans and obtained the full rate spectrum.

    Modeling Substitutions$z|;, http://www.100md.com

    Let S_l(t) {A, C, G, T} represent the nucleotide state inthe lth position of a DNA sequence of length at time t. Weassume that replacement of one nucleotide by another followsa time-continuous, homogeneous, stationary, and reversible Markovprocess ) and thatthe (S_l(t))_l=1,..., are independent. Moreover, we require thateach position l in the sequence evolves according to the samemodel, which is typically summarized in the so-called rate matrix$z|;, http://www.100md.com

    thatprovides an infinitesimal description of the process .The entries Q_ij > 0 of Q define the instantaneousrate of change per time unit from nucleotide i to nucleotidej. The collection of Q_ij defines the substitution model. Themain diagonal elements$z|;, http://www.100md.com

    describethe total flow away from nucleotide i; the total rate r of evolutionper unit time is defined as$z|;, http://www.100md.com

    where = (_A, _C, _G, _T) denotes the stationary distribution of the nucleotides.In the discussion that follows we will assume that the totalrate r of evolution equals 1. Thus time is measured in numberof substitutions.

    We will use d = r · t for the expected number of substitutionsbetween two sequences.-[;q(, http://www.100md.com

    Rate Heterogeneity-[;q(, http://www.100md.com

    If we assume that the entries Q_ij of Q are the same for allsites in a sequence of length , then all sites are evolvingaccording to the same model of sequence evolution and with thesame rate of evolution. The latter assumption can be relaxedby introducing a site-specific scaling factor f_l, l = 1, ..., that defines the rate per site as-[;q(, http://www.100md.com

    Thusthe sequence will accumulate on average-[;q(, http://www.100md.com

    substitutionsper site. We will assume normalized f_l, such that (1/) f_l = 1.-[;q(, http://www.100md.com

    Inferring the Evolutionary Distance-[;q(, http://www.100md.com

    The rate matrix Q specifies the transition probabilities fromone nucleotide to another if d substitutions per site are expectedby the relation-[;q(, http://www.100md.com

    If two sequencesX = (X₁, ..., X) and Y = (Y₁, ..., Y) are compared and Q isspecified, we can estimate the number of substitutions betweenthe two sequences, under the homogeneous rate assumption bymaximizing the likelihood function

    Thevalue that maximizes lik(d) is the maximumlikelihood estimate of the number of substitutions that occurredbetween X and Y. For example, in the Jukes-Cantor model the maximum likelihood estimate for the numberof substitutions equals(w8jki, 百拇医药

    where is the frequency of observed differences. If we assume rateheterogeneity between sites, then cannotbe estimated from equation 7, unless one knows the distributionof rate heterogeneity, i.e., the parameter of the {Gamma}(w8jki, 百拇医药

    -distribution. If is not known, one can only estimate from a phylogenetic analysis of more than two sequences .(w8jki, 百拇医药

    Inferring Site-Specific Rates(w8jki, 百拇医药

    If the model of sequence evolution Q is known and is the samefor each position in the sequence, then we estimate the site-specificscaling factor f_l (l = 1, ..., ) as follows. For a given positionin a sequence, we analyze (independent) pairs of sequences atthat position. Instead of studying two sequences that accumulatedd substitutions, we now consider k independent pairs of sequencesat position l in a multiple sequence alignment. Let (, ), ..., (, ) denote the k pairs, that are d₁, d₂, ...,d_k substitutions apart. From this data set, we can estimatethe rate-specific factor f_l by maximizing

    forl = 1, ... . Thus, equation 9 describes a similar procedure,as suggested elsewhere , but in a likelihoodframework. If k is large and the distances d = (d₁, ..., d_k)are given, it is possible to estimate f_l for each site in asequence, and thus we obtain a site-specific rate vector f =( f₁, ..., f).6&{'fjm, http://www.100md.com

    Analyzing Real Data6&{'fjm, http://www.100md.com

    To estimate f = ( f₁, ..., f), equation 9 requires the knowledgeof the d₁, d₂, ..., d_k. Several approaches are possible: Onecan compute the pairwise distances for each pair separatelyand then estimate the site-specific rates. Another option isto use the known phylogeny of the 2k sequences to identify independentsequence pairs in the tree and to deduce their distances fromthe branch lengths in the tree. Here, independence means thatthe path i, (i = 1, ..., k) that connects the sequence pair(Xⁱ, Yⁱ) in the tree is disjoint from the remaining k - 1 pathsin the tree. This approach will be abbreviated IP-method (independentpairs). A second approach is simply taking all possible distancepairs from a known phylogeny, ignoring the fact that the pairsare not independent. We will call this the AP-method (all pairs).Both methods require the knowledge of the phylogeny and theevolutionary model Q. Note also that the AP-method is a generalizationof the method suggested by .

    If, however, no information on the phylogenetic relationshipof the sequences is available, we suggest an iterative procedure,which includes the estimation of the model of sequence evolution,the phylogeny, and heterogeneous rates among sites. To computethe maximum likelihood estimates of the tree, either PUZZLE or DNAML from PHYLIP is used.\wae@ag, 百拇医药

    Algorithm 1: Iterative Procedure\wae@ag, 百拇医药

    Compute the maximum likelihood tree ₀ andthe parameters of the model Q assuming no rate heterogeneity.Call the likelihood lik(₀).\wae@ag, 百拇医药

    Apply IP orAP to compute d(₀) derived fromthe tree₀ and use equation 9 to estimatethe siterate vector f.\wae@ag, 百拇医药

    Rescale such that (1/) _l = 1.\wae@ag, 百拇医药

    Use to compute lik(₁),the correspondingmaximum likelihood tree.\wae@ag, 百拇医药

    If lik( ₁) >lik( ₀)thenn, http://www.100md.com

    (a) replace ₀by ₁.n, http://www.100md.com

    (b) goto 2.n, http://www.100md.com

    Otherwise goto 6.n, http://www.100md.com

    Output₀ and f.n, http://www.100md.com

    Step 3 of the iterative procedure puts constraints on the ratesand ensures that we are analyzing relative rates with respectto the average evolutionary rate.n, http://www.100md.com

    Efficiencyn, http://www.100md.com

    If we had a collection of independent pairs of sequences togetherwith a distance estimate d_i for each pair (i = 1, ..., k), thenequation 9 would estimate f_l (l = 1, ..., ) if k is large (datanot shown).n, http://www.100md.com

    In real applications of the method, sequences are related bya tree, and estimates are therefore dependent. To test the performanceof equation 9 to estimate f for a collection of aligned sequences,we employed computer simulations, in which we compared the estimatedsite-specific rates with the rates modeled ("true" rates).

    To this end, random tree topologies based on the coalescence were generated. Sequences were evolved accordingto the Jukes-Cantor model with Gamma-distributedrates using Seq-Gen , where the shapeparameter is an indicator of the amount of rate heterogeneity.A small value of implies a pronounced rate heterogeneity./|l, 百拇医药

    To test the influence of the number of sequence pairs on thereliability of the estimates, data sets with n = 25, 50, 100,250 sequences were generated. Because of the complexity of theproblem, we confine ourselves to three types of simulations.The simulation conditions are summarized in ./|l, 百拇医药

    fig.ommitted/|l, 百拇医药

    Table 1 Simulation conditions./|l, 百拇医药

    The performance of simulations 1 to 3 was analyzed by plottingthe "true" rates against the respective rate estimate. Moreover,we calculated the correlation coefficient to measure the degree of linear dependency between true ratesand their estimates. The slope of the regression line betweentrue and estimated rates is used to check for a possible bias.A slope close to 1 is an indication of an unbiased procedure.

    Simulation 1—IP versus AP7, 百拇医药

    Here we wanted to elucidate the influence of the usage of allpossible n(n - 1)/2 pairs of sequences and their respectivedistances on the estimation of f as compared to the usage ofonly n/2 pairs of sequences, if independent pairs were used.7, 百拇医药

    shows the scatter plots for a sample of n = 25 andn = 100 sequences. For both approaches the accuracy of the estimatesincreases if the number of sequences is increased, as can beseen from the reduced variation of the dot plot. For n = 25it sometimes happened that the maximization of equation 9 didnot converge. In those instances an arbitrary large rate of75 was assigned. For n = 100 the maximization always worked.Moreover, for large data sets the effect of observing horizontallines of rate estimates (see ) disappears. Thus a muchfiner resolution of rate estimates seems possible.7, 百拇医药

    fig.ommitted7, 百拇医药

    FIG. 1. Scatter plot of the "true" rate versus the respective rate estimate for that position. a, Results for the data set of 25 sequences. b, Results for the data set of 100 sequence. For exact simulation conditions see table 1. For values of correlation coefficients and slope of regression line, see table 2. For each of the two data sets, two variants of the estimation method are shown (AP and IP)

    Calculation of correlation coefficients and the slope of theregression lines corroborates the observation thatlarger data sets lead to an increased precision of rate estimates.If 250 sequences are analyzed, the correlation coefficient equals0.9391 for the IP-method and 0.9714 for the AP-method; the slopeequals 1.0109 and 1.0110, respectively. Thus, a bias in theestimates is not observed. Surprisingly, the AP-method yieldsconsistently better estimates of correlation coefficient andslope. It appears that the larger sample size of AP outbalancespossible biases introduced by the nonindependence of pairs ofsequences. Therefore, the AP-method is the method of choiceand will be used in the following simulations.ezw, 百拇医药

    fig.ommittedezw, 百拇医药

    Table 2 Correlation Coefficients and Slope of the Regression Line as a Function of the Number of Sequences.ezw, 百拇医药

    Simulation 2—Influence of ezw, 百拇医药

    We investigated the performance of the AP-method when the shapeparameter of the {Gamma}

    -distribution is changed from extreme rateheterogeneity ( = 0.1) to weak heterogeneity ( = 5) and whenthe phylogenetic tree is also varied. To accomplish this investigation,the AP-method was applied to 100 random trees.&#:cs?$, 百拇医药

    summarizes the results. The overall performance ofthe estimates increases as the number of sequences is increased,but the AP-method performs best if strong rate heterogeneityis present. If = 5.0 then an average correlation coefficientof 0.5212 is observed for n = 250 sequences, which increasesto 0.9531 if = 0.1 (see ). In the latter case the correlationcoefficients range between 0.85 and 0.95, indicating a highdegree of correlation between estimated rates and true rates.Moreover, the range of the distribution of the correlation coefficientsis narrower if rate heterogeneity is strong. Also, the varianceof the empirical distribution of correlation coefficients isreduced as n is increased.&#:cs?$, 百拇医药

    fig.ommitted&#:cs?$, 百拇医药

    FIG. 2. Correlation coefficients between the "true" and the estimated rates as a function of the number of sequences and the amount of rate heterogeneity . The squares represent the average correlation coefficients computed from the rate estimates obtained from 100 random trees (see table 1, Sim2). The bars show the range of correlation coefficients. The crosses represent correlation coefficients obtained from the iterative procedure, where the tree is unknown and needs to be estimated jointly with the site-specific rates (see table 1, Sim3))+, 百拇医药

    As already observed in Simulation 1, IP versus AP, the estimationsappear unbiased, as the slope of the regression line rangesbetween 0.9734 and 1.2996, irrespective of the choice of orthe number n of sequences.)+, 百拇医药

    Because the results are averaged over 100 random trees, ourconclusions are independent of the underlying tree. Thus, themethod seems to work with good accuracy as long as the dataset is sufficiently large and rate heterogeneity is present.)+, 百拇医药

    Simulation 3—The Iterative Procedure

    Finally, we investigated the performance of the iterative procedure(algorithm 1). As maximum likelihood tree reconstruction isthe time-limiting factor (step 3 of the algorithm), we employedthe subsampling strategy for the simulated data with n = 100and 250 sequences . Here,10 random subsamples of 50 sequences were drawn. For each subsample,site-specific rates were estimated using the iterative procedure,and the estimates at each site were averaged.3:s#$@*, http://www.100md.com

    The iterative procedure stopped in all instances (see Sim3) after seven iterations. The analysis that follows isfor the resulting tree and the corresponding rate vector. Thecrosses in display the results of the iterative method.If we can compute the overall maximum likelihood tree (n " 50),then the correlation coefficients of the iterative method fallwithin the range of the coefficients obtained when the treeis given (Sim 2). However, if n = 100 or 250, then we observea reduced correlation compared to Sim 2. The correlation coefficientsoverlap with the coefficients one would obtain if 50 sequenceswere analyzed. Thus the reduction in correlation between trueand estimated rates may be attributed to the fact that we haveemployed the subsampling strategy. If we were able to computemaximum likelihood trees for large data sets in reasonable time,this reduction in correlation would disappear.

    Simulation 2 and Simulation 3 both show the phenomenon of ahigh correlation coefficient when strong rate heterogeneityis present and a reduced correlation coefficient when the sequencepositions evolve with weak heterogeneity. shows thescatter plots of one simulation of the rate estimates afterconvergence of algorithm 1 for 50 sequences, assuming = 5.0, = 0.8 , and = 0.1 . Whereas theslope equals 1.0147 ( = 5.0), 1.0868 ( = 0.8), and 1.0195 (= 0.1), the correlation coefficients vary from 0.4844, 0.7746,to 0.9164. Thus, while the slope is for all examples close to1, we observe a reduced correlation for the weak rate heterogeneitycase. An explanation may be the lack of fast-evolving positions,which are present for = 0.1, say (see ); these fast-evolvingpositions (true rates larger than 30), can be reliably estimatedand cause the high correlation. In summary, our simulationsshow that we are able to estimate site-specific rates in reasonablecomputation time. Moreover, they show that knowledge of thephylogeny is not necessary to infer the rates.

    fig.ommitted0i8\, 百拇医药

    FIG. 3. Scatter plots of the "true" rates versus estimated rates for 50 sequences calculated using the iterative procedure, assuming various degrees of rate heterogeneity. a, For low rate heterogeneity ( = 5.0); b, for moderate heterogeneity ( = 0.8); c, for strong rate heterogeneity ( = 0.1)0i8\, 百拇医药

    Site-Specific Rates of Human Mitochondrial DNA0i8\, 百拇医药

    To apply the iterative procedure, we analyzed the 53 completehuman mtDNA sequences . The sequences werealigned by eye, resulting in an alignment of length 16,591.According to the Anderson reference sequence ,we identified 23 gapped positions (44.1, 309.1–309.2,317.1–317.3, 523.1–523.4, 573.1–573.6, 2161.1,2232.1, 5909.1, 16193.1–16193.4), which were excludedfrom the analysis. Position 3107 was also excluded because itis missing in all 53 sequences. Thus, we analyzed 16,567 positionswith the iterative procedure. The initial tree of the algorithmwas estimated using the PUZZLE program with the Tamura Nei model , wherethe model parameters were estimated from the data. Based onthis tree, the iterative procedure was begun. After three iterationsthe algorithm stopped. The resulting site-specific rate estimatesare shown in . In total, 660 varied positions are observedscattered along the entire mitochondrial genome. The majority(474/660) of these varied position evolves with relative ratesless than 20, and only a small fraction (39 positions) evolveswith rates larger than 100. Using the PUZZLE-program ,we estimated a shape parameter = 0.002,which indicates extreme rate heterogeneity. If we were to takethis value as face value, then we would expect about 40 positionsamong the total of 16,567 positions with a relative rate above100. This rough estimate agrees fairly well with our calculations,yet the distribution of the fast-evolving sites along the genomeis far from a uniform distribution along the sequence. Of thevaried sites, 124 are located in the hypervariable regions (H1and H2), which comprise only (359 + 320)/16,567 = 4% of thecomplete genome. More precisely, in H1 23.39% of the sites arevariable, and in H2 12.5% of the positions show variation, whereasthe rest of the genome shows 3.37% variable sites. In contrast,regions which seem to be particularly conserved are regionsof known functions in the D-Loop (subsections 1, 2, and 3) aswell as the central region separating H1 and H2, all tRNA genes,the 12S rRNA, and the 16S rRNA. It is interesting to note, however,that single sites with very high relative rates (>150) areinterspersed throughout the genome and are found in ND1, ATP6,COIII, ND3, ND4, ND5, Cytb, and in one tRNA (Serine).

    fig.ommittedty\-r&c, http://www.100md.com

    FIG. 4. Site-specific rate estimates for the human mitochondrial DNA. The lengths of the bars reflect the relative rate at each site, where the average substitution rate is normalized to 1. The dotted circles are spaced at intervals of 50 relative rate units. The three small subsections (1, 2, and 3) within the D-Loop summarize several control elements. The labeled sections indicate the locations of the genes for the 12S and 16S rRNAs (12S,16S), NADH dehydrogenase subunits 1, 2, 3, 4L, 4, 5, and 6 (ND1 to ND6), cytochrome c oxidase subunits I, II, and III (COI to III), ATPase subunit 6 and 8 (ATP6 and 8) and cytochrome b (Cytb). The small, not labeled sections outside the D-Loop show the locations of the 22 different tRNA genes. OL denotes the origin of L-strand replication. The exact positions of the gene products and their original references are compiled in MITOMAPty\-r&c, http://www.100md.com

    Thus, our method is able to pinpoint well-known regions of highevolutionary rates and to detect other positions of high substitutionrates. We should keep in mind, however, that we can only generatehypotheses about the mode of evolution of single positions;the precise nature of the substitution process and its dependencyfrom the sequence environment still remains unclear.

    Discussion7%2#6r{, 百拇医药

    We have presented a maximum likelihood framework for the estimationof site-specific rates from pairs of sequences. This approachcomprises a generalization of the method by .Based on simulation studies, we show that the methodis able to estimate site-specific rates, even if we need toestimate the tree and the model. To obtain reliable estimates,however, large data sets are necessary. Our approach is basedon the analyses of pairs of sequences. Whether this method isadvantageous compared to the optimization of the likelihoodof a given site pattern (Olsen et al. 1994), is at present unclear. It will be interestingto compare these two approaches in further detail.7%2#6r{, 百拇医药

    With the proposed method, continuous rate estimates are obtained.This is a major advantage when pronounced rate heterogeneityis present, and thus the suggested method seems superior toother approaches, where evolutionary rates are more or lessarbitrarily pooled in discrete categories .

    If rate heterogeneity is low ( = 5.0), the correlation of trueand estimated rates is small. This finding is explained by thefact that the absolute differences between rates are small.In fact, if = 5.0, approximately 90% of the sites evolve withrelative rates between 0.25 and 1.75. In contrast, for = 0.1,only 14% of the sites are in that range, and the absolute differencesin rates between sites are larger. Thus, we find sites evolvingwith relative rates between 0 and 50 (see ). Our resultsshow that the method is able to distinguish between fast- andslow-evolving sites with high reliability. This is true evenwhen we have to estimate the tree with the iterative procedure.Here our method faces a bottleneck, which can be overcome onlyby faster maximum likelihood tree reconstruction programs (e.g). To estimate site-specific rates reliably,we need large data sets; however, at present it is impossibleto estimate maximum likelihood trees for more than 50 sequences.More work is necessary to develop further approximation procedures.

    Further extensions of our approach are straightforward: we needto incorporate more complex models of sequence evolution andto introduce a statistical framework to estimate the significanceof estimates obtained. Moreover, the method makes the assumptionthat the distribution of variable sites is the same in all sequences.We do not expect that this is a problem for the population sequencedata we have studied. However, if very distantly related sequencesare studied, it is possible that the distribution of sites freeto vary does change across the phylogeny (e.g). At presentit is not clear how to include rate changes at a single positionin our analyses.au+{}8!, 百拇医药

    Appendixau+{}8!, 百拇医药

    A computer program to determine site-specific rates is available.It is written in ANSI C and should run on most popular platformsafter proper compilation. The current version of the program,together with a detailed analysis of our example, is availableon request. Parts of the source code have been taken from thefree software Puzzle .

    Acknowledgements*, 百拇医药

    We express special thanks to Matthias Deliano, Roland Fleißner,Dirk Metzler, and Heiko Schmidt for stimulating discussions.Financial support from the DFG and MPG is greatly appreciated.*, 百拇医药

    Literature Cited*, 百拇医药

    Anderson, S., A. T. Bankier, B. G. Barrell, et al. (14 co-authors). 1981. Sequence and organization of the human mitochondrial genome. Nature 290:457-465.*, 百拇医药

    Aris-Brosou, S., and L. Excoffier. 1996. The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism. Mol. Biol. Evol. 13:494-504.*, 百拇医药

    Clayton, D. A. 2000. Transcription and replication of mitochondrial DNA. Hum. Reprod. 15:11-17.*, 百拇医药

    De Rijk, P., Y. Van de Peer, I. Van den Broeck, and R. De Wachter. 1995. Evolution according to large ribosomal subunit RNA. J. Mol. Evol. 41:366-375.*, 百拇医药

    Excoffier, L., and Z. Yang. 1999. Substitution rate variation among sites in mitochondrial hypervariable region i of humans and chimpanzees. Mol. Biol. Evol. 16:1357-1368.

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.a+^, 百拇医药

    Fitch, W. M. 1971. The nonidentity of invariable positions in the cytochomes c of different species. Biochem. Genet. 5:231-241.a+^, 百拇医药

    Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. 18:866-873.a+^, 百拇医药

    Gu, X., Y.-X. Fu, and W.-H. Li. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546-557.a+^, 百拇医药

    Hasegawa, M., A. D. Rienzo, T. D. Kocher, and A. C. Wilson. 1993. Toward a more accurate time scale for the human mitochondrial DNA tree. J. Mol. Evol. 37:347-354.a+^, 百拇医药

    Hudson, R. R. 1991. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1-44.a+^, 百拇医药

    Huelsenbeck, J. P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:698-707.a+^, 百拇医药

    Ingman, M., H. Kaessmann, S. Pääbo, and U. Gyllensten. 2000. Mitochondrial genome variation and the origin of modern humans. Nature 408:708-712.

    Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.t!(3zy, 百拇医药

    Kelly, C., and J. Rice. 1996. Modelling nucleotide evolution: a heterogeneous rate analysis. Math. Biosci. 133:85-109.t!(3zy, 百拇医药

    Kogelnik, A. M., M. T. Lott, M. D. Brown, S. B. Navathe, and D. C. Wallace. 1998. Mitomap: a human mitochondrial genome database—1998 update. Nucleic Acids Res. 26:112-115.t!(3zy, 百拇医药

    Lockhart, P. J., D. Huson, U. Maier, M. J. Fraunholz, Y. V. D. Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolve in eubacteria. Mol. Biol. Evol. 17:835-838.t!(3zy, 百拇医药

    pez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1-7.t!(3zy, 百拇医药

    Meyer, S., G. Weiss, and A. von Haeseler. 1999. Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 152:1103-1110.t!(3zy, 百拇医药

    Miyamoto, M. M., and W. M. Fitch. 1995. Testing the covarion hypothesis of molecular evolution. Mol. Biol. Evol. 12:503-513.

    Nielsen, R. 1997. Site-by-site estimation of the rate of substitution and the correlation of rates in mitochondrial DNA. Syst. Biol. 46:346-353.y3@c9|, 百拇医药

    Pesole, G., and C. Saccone. 2001. A novel method for estimating substitution rate variation among sites in a large dataset of homologous DNA sequences. Genetics 157:859-865.y3@c9|, 百拇医药

    Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.y3@c9|, 百拇医药

    Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502-504.y3@c9|, 百拇医药

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.y3@c9|, 百拇医药

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer Associates, Sunderland, Mass.

    Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526.@?5))u, 百拇医药

    Tavaré, S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Pp. 57–86 in M. S. Waterman, ed. Some mathematical questions in biology: DNA sequence analysis. The American Mathematical Society, Providence, R.I.@?5))u, 百拇医药

    Van de Peer, Y., S. L. Baldauf, W. F. Doolittle, and A. Meyer. 2000. An updated and comprehensive rRNA phylogeny of (crown) eukaryotes based on rate-calibrated evolutionary distances. J. Mol. Evol. 51:565-576.@?5))u, 百拇医药

    Van de Peer, Y., J. M. Neefs, P. D. Rijk, and R. D. Wachter. 1993. Reconstructing evolution from eukaryotic small-ribosomal-subunit RNA sequences: calibration of the molecular clock. J. Mol. Evol. 37:221-232.@?5))u, 百拇医药

    Wakeley, J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37:613-623.@?5))u, 百拇医药

    Yang, Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:1396-1401.@?5))u, 百拇医药

    Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314.@?5))u, 百拇医药

    Yang, Z. 1995. A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005.@?5))u, 百拇医药

    Yang, Z. 1996. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367-372.@?5))u, 百拇医药

    Yang, Z., and T. Wang. 1995. Mixed model analysis of DNA sequence evolution. Biometrics 51:552-561.@?5))u, 百拇医药

    Accepted for publication October 1, 2002.(Sonja Meyer and Arndt von Haeseler)

百拇医药网 http://www.100md.com/html/DirDu/2005/05/06/58/24/22.htm