EstimatingAncestralPopulationSizesandD

Estimating Ancestral Population Sizes and Divergence Times

http://www.100md.com 《基因杂志》2003年第1期

     ^a Department of Human Genetics, University of Chicago, Chicago, Illinois 60637}l, http://www.100md.com

    ABSTRACT}l, http://www.100md.com

    This article presents a new method for jointly estimating speciesdivergence times and ancestral population sizes. The methodimproves on previous ones by explicitly incorporating intragenicrecombination, by utilizing orthologous sequence data from closelyrelated species, and by using a maximum-likelihood framework.The latter allows for efficient use of the available informationand provides a way of assessing how much confidence we shouldplace in the estimates. I apply the method to recently collectedintergenic sequence data from humans and the great apes. Theresults suggest that the human-chimpanzee ancestral populationsize was four to seven times larger than the current human effectivepopulation size and that the current human effective populationsize is slightly >10,000. These estimates are similar to previousones, and they appear relatively insensitive to assumptionsabout the recombination rates or mutation rates across loci.

    THE effective population size (N_e) of a species has a directeffect on the amount and the pattern of DNA sequence variation.Researchers have therefore used sequence polymorphism data toestimate N_e (e.g., KREITMAN 1983 ; TAKAHATA 1993 ; NACHMANand CROWELL 2000 ). The amount of observed diversity can beused to estimate the population mutation parameter = 4N_eµ,and the per generation mutation rate µ can be estimatedeither directly (HARADA et al. 1993 ; GIANNELLI et al. 1999) or indirectly (e.g., KIMURA 1983 ; SATTA et al. 1993 ; KUMAR and HEDGES 1998 ; NACHMAN and CROWELL 2000 ) from divergencedata (given assumptions about the divergence date and the averagegeneration time).r6, http://www.100md.com

    Most estimates of N_e for humans are ~r6, http://www.100md.com

    10,000–15,000 (e.g.,TAKAHATA 1993 ; HARDING et al. 1997 ). While there are manypossible reasons why the effective population size may be quitedifferent from the census population size (CABALLERO 1994 ),it remains surprising that the human N_e is so low, especiallygiven humans' large range over the past 1–2 million years(MY; e.g., SWISHER et al. 1994 ; GABUNIA and VEKUA 1995 ).In particular, great ape species historically have had muchsmaller ranges, but have N_e two to three times larger than thehuman N_e. Did some event associated with the founding of thegenus Homo (TAKAHATA 1993 ) or some other particular event inhuman history lead to a sharp reduction in effective populationsize? It is difficult to answer this without knowing how N_ehas varied over evolutionary time. Recently, progress has beenmade in estimating the effective population size of the populationdirectly ancestral to two extant daughter species.

    A few main methods exist for estimating N_a (see TAKAHATA andSATTA 2002 for a more in-depth discussion). (We refer to theancestor's population size as the "ancestral N_e," or N_a.) Onemethod, referred to as the trichotomy method, requires orthologoussequence data from three closely related species (NEI 1987 ;WU 1991 ). This approach uses a single orthologous sequencefrom each of three species and assumes a simple model of populationhistory where at fixed times different species become isolatedwith no further admixture (cf. HEY 1994 ). Random mating isassumed within each population. If the time between the twospeciation events is small, the gene tree for a particular regionwill not always match the species tree (NEI 1987 ; see ).The probability that this happens depends in part on N_afor the ancestral population of species 1 and 2. In particular,a necessary (but not sufficient) condition for the gene treeand the species tree to be incompatible is that the species1 and 2 lineages do not coalesce between the two speciationevents. If N_a is larger, the probability of a common ancestorbefore time T₂ is reduced, leading to a greater chance thatthe gene tree and the species tree do not match. The trichotomymethod uses orthologous data from many unlinked loci, infersthe gene tree at each locus, calculates the proportion of lociwhere the inferred gene tree does not match the species tree,and then uses this proportion to estimate N_a. Application ofthe trichotomy method to human and great ape sequence data hasled to estimates of N_a (for the human-chimpanzee ancestral population)substantially larger than the current N_e for humans (RUVOLO1997 ; CHEN and LI 2001 ; TAKAHATA and SATTA 2002 ). CHENand LI 2001 estimate, for example, N_a = 52,000–96,000,or roughly five to nine times larger than the current humanN_e.

    fig.ommitteedm74}jg^, http://www.100md.com

    Figure 1. Two possible gene trees given a particular species tree. T₂ is the time when the first speciation event occurs. In A, the gene tree and species tree are compatible, while in B they are incompatible. Divergence between single orthologous sequences from two species (species 1 and 2) consists of two parts: time when the species are separated (y) and the time when the two sequences segregate in the ancestral population (x).m74}jg^, http://www.100md.com

    Another method for estimating N_a requires divergence data fromtwo or three species (TAKAHATA 1986 ; TAKAHATA et al. 1995; YANG 1997 , YANG 2002 ). Here, I describe the two-speciesmethod, because that is what is generally used. Given two orthologoussequences, one each from a pair of species, they will coalesceat some time that predates the species divergence time (see). For the autosomes, the time spent in the ancestralpopulation before coalescence (x in ) is exponentiallydistributed, with mean 2N_ag (where g is the average generationtime and N_a is the diploid ancestral N_e). In contrast, the postspeciationbranch lengths (y in ) are fixed. Given data from multipleunlinked loci and assumptions about g and µ, one can usemaximum likelihood to jointly estimate the speciation time andN_a (TAKAHATA et al. 1995 ; YANG 1997 ). The general idea isthat large values of N_a correspond to greater variability inthe coalescence time of two orthologous sequences and thus greatervariance in the observed divergences across loci. Using humanand chimpanzee divergence data, estimates of the human-chimpanzeeN_a are ~

    5–10 times the current N_e (TAKAHATA and SATTA 1997; TAKAHATA 2001 ).@:.c!, 百拇医药

    Finally, two other methods require intraspecific polymorphismdata from two species and use either a moment-based (WAKELEYand HEY 1997 ) or a maximum-likelihood (NIELSEN and WAKELEY2001 ) approach to estimate model parameters (including in particularN_a and the divergence time). Both of these methods are wellsuited for species that have diverged relatively recently, butless so for species such as humans and chimpanzees that sharevery little ancestral polymorphism. In any case, at the presentthey cannot be used to estimate the human-chimp N_a because ofa lack of chimpanzee polymorphism data.@:.c!, 百拇医药

    Large estimates of the human-chimpanzee N_a are concordant witha study of Mhc that used the high levels of diversity thereto estimate a long-term (i.e., over the past 10–20 millionyears) average effective population size of ~@:.c!, 百拇医药

    10⁵ (TAKAHATA 1991). However, it should be noted that the large estimates of N_aare difficult to reconcile with human-chimpanzee divergencetimes estimated from molecular data. Most recent estimates ofthe divergence time fall between 4 and 6 million years ago (MYA;e.g., HORAI et al. 1995 ; EASTEAL and HERBERT 1997 ; KUMARand HEDGES 1998 ; KUMAR and SUBRAMANIAN 2002 ). These estimatesare for a single human and a single chimpanzee sequence; theyreflect both divergence between species and segregation in theancestral population (x and y in ). If N_a is large, thenx must be large; if x is large and x + y is fixed, then y mustbe small. Suppose, for example, that (x + y) = 5.5 MY, as estimatedby KUMAR and HEDGES 1998 , and that N_a = 52,000–96,000(cf. CHEN and LI 2001 ). Then, if the average generation timeis 20 years, y would be 1.7–3.4 MY. If the average generationtime were 25 years (see DISCUSSION), then y = 0.7–2.9MY. These estimates postdate many well-documented australopithecinefossils and are therefore dubious estimates of the time sincespeciation.

    One possible explanation is that N_a has been consistently overestimated.Indeed, both the trichotomy method and the two-species maximum-likelihoodmethod have been criticized (HUDSON 1992 ; TAKAHATA et al.1995 ; SATTA et al. 2000 ; TAKAHATA and SATTA 2002 ), andit is not clear how accurate the estimates are. For one, thetrichotomy method assumes one already knows the time betweenthe two speciation events, but this is generally not known apriori. Furthermore, it assumes one can correctly infer thephylogeny for any particular locus. Errors in phylogenetic inferencearise when analyzing actual data, and the whole endeavor doesnot make sense in the presence of intragenic recombination (cf.NORDBORG 2001 ). With three closely related species, the truephylogenies for nearby sites are not always the same. So, whenloci are analyzed, they are often an amalgamation of sites withdifferent phylogenies. Trying to infer a single phylogeny fromsuch data is clearly not appropriate (SATTA et al. 2000 ). Evenif the problems of recombination and phylogenetic reconstructionwere ignored, the trichotomy method does not make an efficientuse of the available information. Data from each locus are summarizedinto a single binary variable, depending on whether the inferredlocus phylogeny agrees or disagrees with the species tree.

    The maximum-likelihood methods are more rigorous and efficient,but they too have two main drawbacks. As with the trichotomymethod and the method of NIELSEN and WAKELEY 2001 , intragenicrecombination is ignored. In addition, the methods of TAKAHATAet al. 1995 are highly sensitive to variation in µ amongloci. The problem is that variation in µ leads to greatervariance in observed divergences across loci, which inflatesthe estimate of N_a. Although variation in µ can be explicitlymodeled in the analyses (YANG 1997 ; TAKAHATA and SATTA 2002), it is difficult to know whether a particular model of ratevariation is appropriate, especially given data from only twospecies (but see YANG 2002 ).v, 百拇医药

    In this article, I present a new method for estimating N_a. Themethod requires orthologous sequence data from three or morespecies (two plus one or more outgroups) and jointly estimatesN_a and species divergence times using a summary maximum-likelihoodapproach. Unlike the previous maximum-likelihood methods, intragenicrecombination is incorporated, and likelihoods are estimatedfrom coalescent simulations. Also, the model can account forvariation in mutation rates across loci. Although the methodcan be used on data from any taxonomic group (as long as atleast one outgroup species is available), I concentrate hereon analyzing human and great ape sequence data. The maximum-likelihoodframework allows for the estimation of confidence intervals;this, along with a more realistic model, allows us to assesswith greater rigor whether the human-chimp N_a was much largerthan the current human N_e, as previous studies have claimed.I apply the method to the orthologous data from 53 intergenicregions reported in CHEN and LI 2001 and generate both pointestimates and approximate confidence intervals for N_a and speciesdivergence times. Intergenic sequence data are preferable todata from genes (even synonymous sites or introns) because theyare less likely to have been affected by natural selection atclosely linked sites.

    METHODScmnlq, http://www.100md.com

    I describe the model in which there are orthologous sequencedata from four species. The case in which there are three species(or five or more) follows analogously.cmnlq, http://www.100md.com

    Suppose we have four species with a known phylogeny. We assumea null model of speciation (cf. HEY 1994 ; SATTA et al. 2000) whereby a panmictic ancestral population splits at a fixedtime into two panmictic descendant populations, with no subsequentmigration between the descendant populations. The scaled mutationand recombination parameters are {theta} (= 4N_hµ) and {rho}cmnlq, http://www.100md.com

    (= 4N_hr),where µ is the mutation rate per site per generation andr is the recombination rate per site per generation. Label thespecies H, C, G, and O, with current diploid effective populationsizes N_h, N_c, N_g, and N_o, respectively. Suppose H and C splitat time T₁, H and G at time T₂, and H and O at time T₃, withT₁ < T₂ < T₃ (see ). T₁, T₂, and T₃ are scaled inunits of 4N_h generations. From time T₁ until T₃ both the H-Cancestral population and the H-C-G ancestral population haveeffective size N_a, while the H-C-G-O ancestral population haseffective size N_o. The results are similar if the latter ancestralpopulation has effective size N_a (results not shown). Finally,define n_s as the number of contiguous nucleotide sites in thesimulation. There are a total of 11 parameters in the model,listed in . We assume that the generation time and mutationrate do not vary across species. These assumptions are reasonablewhen the species considered have similar life-history traitsand are closely related. There are three possible (unrooted)gene trees, with H and C, H and G, or C and G as sibling species.

    fig.ommitteed(], 百拇医药

    Figure 2. Model of species history considered. There are four species with known branching order, labeled H, C, G, and O. The three speciation events (starting from the present) occur at times T₁, T₂, and T₃, where time is scaled in units of 4N_h generations. See METHODS for more details.(], 百拇医药

    fig.ommitteed(], 百拇医药

    Table 1. Model parameters(], 百拇医药

    Now, suppose we have a single orthologous sequence from eachspecies. For each site that is "segregating" (i.e., is not identicalacross all species), we can infer that one or more mutationshappened on certain branches in the unrooted tree. We do thisassuming the fewest number of mutations that can explain thedata. For example, if the H, C, G, and O sequences have A, G,A, and A, respectively, then we infer that a mutation happenedon the branch leading to species C. All biallelic segregatingsites fall into seven categories, resulting from mutations onseven different branches of an unrooted tree. These seven brancheshave H, C, G, O, HC, HG, or CG as descendants and are referredto as the seven types of branches. Note that for any particulargene tree there are only five possible branches, four externalones (with a single species as a descendant) and one internalone. Any site may have one of three possible gene trees, leadingto seven possible branches over all possible gene trees (thefour external branches that are common to each gene tree andone internal branch from each gene tree). For sites with threesegregating nucleotides, we assume that the two species withthe same base share the ancestral state and that the two otherbases each arose from a single mutation. The CHEN and LI 2001 data do not contain any sites where each species has a differentnucleotide, so we do not consider this possibility.

    The sequence data for the 53 intergenic regions reported inCHEN and LI 2001 were kindly provided by the authors and alignedby eye. All indels were excluded. From the remaining sequence,we count the inferred number of mutations that happened on eachof the seven branch types. (No distinction is made between transitionsand transversions.) For a given region, denote these numbersof inferred mutations as b = (b₁, b₂, ... b₇). For given valuesof M = ({theta} , {rho}of*, http://www.100md.com

    , N_h, N_c, N_g, N_o, N_a, T₁, T₂, T₃, n_s) we estimatethe likelihood of observing the vector b using Monte Carlo simulations.of*, http://www.100md.com

    The population model in is simulated using a modificationof the coalescent with recombination (HUDSON 1983 ). For eachsite in each replicate, we classify all the branches in thegenealogy as one of the seven types of branches (i.e., withdescendants H, C, G, O, HC, HG, or CG in the unrooted tree).Since mutations happen at rate µ per site per generation,we can tabulate from the total branch lengths the expected numberof mutations that lie on each of the types of branches. Forthe jth replicate, denote these expected values as B_j = (B₁,B₂, ... B₇). The probability of observing b given B_j is then

    To estimate the likelihood of M, we just average this probabilityover many replicates,eg%d, http://www.100md.com

    where x is large. The CHEN andLI 2001 data consist of two sequences from each species (asingle diploid sequence), while the method requires a singlesequence. Intraspecies polymorphism may add to the b_i values,depending on which chromosome is considered. In these cases,we take the average likelihood over the different possible b_ivalues.eg%d, http://www.100md.com

    The above equation describes how to estimate the likelihoodof M for a single locus. Define M' as a vector containing thefirst 10 values of M. Estimation of the likelihood of M' overmultiple loci is straightforward. Given a collection of k loci,define {M_i}^k_i=1 as a collection of corresponding M vectors,where the M_i are identical except for n_s (which is calculatedfor each locus). Define b_i as the vector b for the ith locus.Then, since unlinked loci are evolutionarily independent, wecan estimate the likelihood of M' over multiple loci simplyby taking the product of the individual lik(M|b) estimates:

    We have taken the approach of summarizing the data by b beforeperforming maximum likelihood. Summary-likelihood methods havebeen quite useful in other situations (e.g., WEISS and VONHAESELER 1998 ; WALL 2000 ; FEARNHEAD and DONNELLY 2002 )and are generally computationally much simpler than full-likelihoodmethods. For the case of estimating N_a, a full-likelihood approachincluding intragenic recombination does not look to be computationallyfeasible at this time.9qa'r;, 百拇医药

    Of the 11 parameters that make up the model M, only 9 can freelyvary. n_s is fixed from the actual data, while N_h is relevantonly indirectly; it turns out that the simulations use onlythe ratios of the effective population sizes (i.e., N_a/N_h, N_c/N_h,etc.), not their actual values. The actual N_h comes into playwhen interpreting the simulation results (e.g., translatingfrom scaled time to actual time). Ideally, one would like tolet {theta} , {rho}

    , N_c/N_h, N_g/N_h, N_o/N_h, N_a/N_h, T₁, T₂, and T₃ vary freelyand determine which combination of parameter values maximizesthe likelihood of observing the actual data. However, this iscomputationally prohibitive, so we fix those values for whichwe have prior information and let the others vary: N_a/N_h, T₁,T₂, and T₃ vary freely (at increments of 1.0, 0.25, 0.5, and1.0, respectively), and we consider the following four schemesfor the other parameters:m, http://www.100md.com

    Model 1: {theta} = {rho}m, http://www.100md.com

    = 0.001/bp; N_c = N_g =N_o = 3N_h.m, http://www.100md.com

    Model 2: {theta} n_s for each locus is proportional to thetotal inferrednumber of mutations (across all four species),and the average{theta} /bp (over all loci) is 0.001; {rho}m, http://www.100md.com

    = 0.001/bp; N_c= N_g = N_o = 3N_h.

    Model 3: Same as model 2, but all CpG siteswere excluded, andthe average {theta} /bp (over all loci) is 0.00075.aoh#y, http://www.100md.com

    Model 4: {theta} = 0.001/bp; {rho}aoh#y, http://www.100md.com

    = 0.002/bp; N_c = N_g = N_o = 3N_h.aoh#y, http://www.100md.com

    Model5: {theta} = {rho}aoh#y, http://www.100md.com

    = 0.001/bp; N_c = N_g = N_o = 6N_h.aoh#y, http://www.100md.com

    can be easily estimated from human sequence polymorphism data(e.g., WATTERSON 1975 ). Putatively neutral sites from recentresequencing studies of human variation suggest that {theta} = 0.001/bpis a good ballpark figure for the autosomes and that roughlyone-fourth of all segregating mutations occur at CpG sites (e.g.,NACHMAN and CROWELL 2000 ; PRZEWORSKI et al. 2000 ; TEMPLETONet al. 2000 ; EBERSBERGER et al. 2001 ; FRISSE et al. 2001). Model 2 tests how sensitive the results are to variationin {theta} across loci by taking the same average {theta} as model 1, butassuming that {theta} n_s for each locus is proportional to the observednumber of inferred mutations. (This is equivalent to estimating{theta} using WATTERSON 1975 , assuming an average of {theta} = 0.001/bp.)The genome-wide average rate of crossing over in humans is r= 1.3 x 10^-8/bp (YU et al. 2001 ). If N_h 10⁴, then {rho}

    {cong} 5.2 x10^-4/bp. We take slightly larger {rho})%.ev'^, 百拇医药

    values to account for theunknown contribution of gene conversion to overall rates ofrecombination (see, e.g., FRISSE et al. 2001 ; PRZEWORSKIand WALL 2001 ). Finally, levels of nonhuman great ape diversityseem to be substantially higher than human diversity levels(DEINARD and KIDD 1999 ; KAESSMANN et al. 1999 , KAESSMANNet al. 2001 ), but not enough data have been gathered to accuratelyestimate N_c/N_h, N_g/N_h, or N_o/N_h. We have chosen values thatmight plausibly reflect the total species diversity in chimps,gorillas, and orangs. Models 3–5 were chosen to explorethe sensitivity of the results to the presence of hypermutableCpG sites, assumptions about the recombination rate, and assumptionsabout great ape population sizes, respectively.)%.ev'^, 百拇医药

    In addition to using the parameter combination that maximizesthe likelihood as a point estimate, it would be useful to determinehow much confidence we should place in the estimated values.To get a sense of how the likelihood varies as a function ofT₁, for example, I calculate the (approximate) profile likelihood:

    Approximate 95% confidence intervals are found by using thestandard {chi} ² approximation for the likelihood-ratio statistic2 ln(L₀/L₁) (where L₀ is the maximum likelihood and L₁ is theprofile likelihood at an alternative point). The likelihoodfunctions calculated are not true profile likelihoods, sincesome of the nuisance parameters are not allowed to vary freely.So, it is not clear whether the standard {chi} ² approximation isappropriate. Approximate profile likelihoods are calculatedfor N_a/N_h and T₁, and linear interpolation is used to estimatethe log-likelihood for parameter values that are not directlyestimated by simulation.+/, 百拇医药

    To verify the accuracy of the method, I run coalescent simulationswith known T₁, T₂, T₃, and N_a/N_h values; then, I use the newmethod on the simulated data to estimate parameters and to comparethe estimated values with the actual ones. These simulationsmodeled 50 loci of 500 bp each, with {theta} = {rho}

    = 0.001/bp, N_c = N_g= N_o = 3N_h, T₁ = 5.0, T₂ = 8.0, T₃ = 14.0, and N_a/N_h = 5.0.The parameter values were chosen to roughly match both the CHEN and LI 2001 data and our a priori knowledge about speciesdivergence times and ancestral population sizes. Five replicateswere run; I analyzed each one under the assumptions of model1 (see above). Note that this assumes an idealized situation,where the nuisance parameters are known exactly.5?u^^$), http://www.100md.com

    All programs were written in C and are available from the authoron request. A total of 5 x 10⁴ replicates were run for eachmodel and parameter combination. To give a sense of the computationalefficiency, the total simulations took 5 months to run on apair of 1.7 GHz Pentium 4 processors.5?u^^$), http://www.100md.com

    RESULTS5?u^^$), http://www.100md.com

    The maximum-likelihood estimates for T₁, T₂, T₃, and N_a/N_h arepresented in . The estimates across the five modelsare broadly similar; all of them estimate an ancestral populationsize five to six times larger than the current human effectivepopulation size, in keeping with previous studies (TAKAHATAand SATTA 1997 ; CHEN and LI 2001 ; TAKAHATA 2001 ). The estimatesof T₁, the human-chimpzdivergence time, are also roughlyin line with expectations. If we assume that g = 25 years andN_h = 10⁴ (or that g = 20 years and N_h = 12,500), then theseestimates range from 3.5 to 4.0 MYA. In contrast, the paleontologicalrecord suggests that uniquely human ancestors were around atleast 4–4.5 MYA (WHITE et al. 1994 ; LEAKEY et al. 1995) and perhaps much earlier (HAILE-SELASSIE 2001 ; BRUNET etal. 2002 ). This disparity can easily be reconciled if boththe average generation time and the current human effectivepopulation size are on the larger side of previous estimates(e.g., g = 25 years and N_h = 15,000). Given our uncertaintyin parameter estimates, these values are quite plausible. Ifinstead we were to assume the generation time and species divergencetime were known, then we could use the results to estimate thecurrent human effective population size. If T₁ = 6 MYA and g= 25 years, then the point estimates of N_h range from 15,000to 17,100. The other species divergence times are also on therecent side; assuming once again that g = 25 years and N_h =10,000, the estimated human-gorilla divergence time ranges from5.0 to 5.5 MYA, while the estimated human-orangutan divergencetime ranges from 12 to 13 MYA.

    fig.ommitteed3ii!, 百拇医药

    Table 2. Parameter estimates and confidence intervals for the CHEN and LI 2001 data3ii!, 百拇医药

    To assess how much confidence we should place in the point estimates,I calculated approximate profile-likelihood curves and estimated~3ii!, 百拇医药

    95% confidence intervals. The intervals for T₁ and N_a/N_h arelisted in . For both T₁ and N_a/N_h the intervals arequite narrow, which suggests that the estimates are precise.All four models exclude N_a/N_h " 3.5 and N_a/N_h ">=" 7.1 from the approximateconfidence intervals. For T₁, the lower boundaries range from2.7 to 3.0 and the upper boundaries range from 4.2 to 4.8. Ifas before we take g = 25 years and N = 10⁴, the upper boundariesrange from 4.2 to 4.8 MYA; these times are still more recentthan the paleontological record would suggest. As mentionedabove, a small increase in N_h is sufficient to reconcile thetime estimates with the paleontological record. showsthe profile-likelihood functions of N_a/N_h and T₁ for model 2.The curves quickly become quite steep, suggesting that the rangeof plausible values is not that large. So, even if the approximateconfidence intervals were nonconservative, it is likely thatconservative ones would not differ much from the intervals listedin . The corresponding likelihood curves for the othermodels are qualitatively similar to those in

    fig.ommitteed2kc?, 百拇医药

    Figure 3. Approximate profile-likelihood curves under model 2. (A) The curve for N_a/N_h; (B) the curve for T₁. In both cases, the y-axis is the maximal log-likelihood given a particular value of the parameter (see METHODS for details). The shaded horizontal line shows the cutoff for the ~2kc?, 百拇医药

    95% confidence intervals.2kc?, 百拇医药

    To verify the accuracy of the method, I applied it on five simulateddata sets with known parameter values (see METHODS). Each onehad actual values of T₁ = 5.0, T₂ = 8.0, T₃ = 14.0, and N_a/N_h= 5. The estimated parameter values, along with the confidenceintervals for T₁ and N_a/N_h, are given in . The meansof the parameter estimates are 5.0, 8.1, 14.0, and 4.8 for T₁,T₂, T₃, and N_a/N_h, respectively, which suggests that the methodhas no or low bias. In addition, the confidence intervals forT₁ and N_a/N_h contain the true value all five times. Due to thelarge computational burden, it was not possible to run enoughreplicates to accurately estimate the coverage properties ofthe confidence intervals.

    fig.ommitteedw@?/t(e, http://www.100md.com

    Table 3. Parameter estimates and confidence intervals for simulated dataw@?/t(e, http://www.100md.com

    Comparing the different rows in can give us some ideaof how sensitive the results are to assumptions about the nuisanceparameters (i.e., {theta} , {rho}w@?/t(e, http://www.100md.com

    , N_c/N_h, N_g/N_h, and N_o/N_h). Since the resultsfrom all of the models are very similar, it appears that theparticular assumptions made do not appear to be very important.In particular, unlike the two-species maximum-likelihood methodof TAKAHATA et al. 1995 , the results do not seem to be verysensitive to variation in mutation rates across loci. This maybe due to the information about locus-specific mutation ratescontained in the outgroup species or because the actual datahave very little variation in mutation rates across loci.w@?/t(e, http://www.100md.com

    DISCUSSIONw@?/t(e, http://www.100md.com

    Estimating ancestral population sizes has been an active researcharea for several years. The work presented here improves onprevious efforts by explicitly incorporating intragenic recombination(see also SATTA et al. 2000 ) and by efficiently utilizingdata from outgroup species. The estimates of the human-chimpanzeeN_a are five to six times larger than the current human effectivepopulation size (see ). Although most previous studiescame to similar conclusions, it was not clear how much confidenceto place in these estimates because of unrealistic assumptions,such as no recombination or no variation in mutation rates (TAKAHATAand SATTA 2002 ). The narrow confidence intervals and simulationresults presented here ( and ) provide additionalevidence that ancestral population sizes were substantiallylarger than the current human effective population size.

    One recent study that came to a different conclusion (namely,that N_a is roughly as small as N_h) incorporated variation inmutation rates across loci but not intragenic recombination(YANG 2002 ). Recombination tends to decrease the variance inestimated branch lengths across loci; because of this, modelsthat assume no recombination tend to underestimate N_a (TAKAHATAand SATTA 2002 ). Further work must be done to quantify howmodel assumptions (both here and in other studies) affect estimatesof the ancestral population size.9&4^8t], 百拇医药

    Although this application focuses on the human-chimpanzee N_a,the same method can be used to estimate N_a from other taxa,as long as there are orthologous sequence data from three ormore species (including at least one outgroup) at multiple unlinkedloci. Below, I discuss issues that might affect the generalapplicability of the method.9&4^8t], 百拇医药

    Likelihood model:9&4^8t], 百拇医药

    One possible criticism of the model is that the relative locationsof the segregating mutations are ignored. However, this is notlikely to be very important, since the number and the patternof segregating mutations are far more informative. Incorporatingthe segregating site locations may lead to narrower confidenceintervals and more accurate estimation of the likelihood function,but excluding them is not expected to bias the results in eitherdirection. Given the results, it does not seem to be worth thesubstantial computational burden to consider the full-likelihoodmodel.{@3)}, http://www.100md.com

    Mutational model:{@3)}, http://www.100md.com

    The mutational model that was adopted makes no distinction betweentransitions and transversions and assumes the mutation rateat each site in a locus is the same. However, some sites havehigher mutation rates than others (NACHMAN and CROWELL 2000; TEMPLETON et al. 2000 ), which would increase the numberof sites experiencing multiple mutations. Those multiply hitsites with fewer than three segregating nucleotides would thenbe misclassified by the model. In primates, the transition rateaway from CpG sites is thought to be elevated by more than anorder of magnitude due to methylated-cytosine mutagenesis (e.g.,JONES et al. 1992; GIANNELLI et al. 1999 ). To test whetherhomoplasies from multiply hit CpG sites affected the parameterestimates, I reran model 2 excluding all CpG sites (i.e., model3). Both the maximum-likelihood estimate and the shape of theprofile-likelihood curves are almost identical (; resultsnot shown), suggesting that the results presented here are relativelyinsensitive to the effects of multiple mutations at CpG sites.Calculations suggest that other proposed sites with elevatedmutation rates, such as mononucleotide runs or DNA polymerase{alpha} -arrest sites (KRAWCZAK and COOPER 1991 ; TEMPLETON et al.2000 ), are too rare to appreciably increase the expected numberof homoplasies (results not shown). For studies of species withlarger levels of divergence, the effects of homoplasies maybe a more serious concern. Future work will concentrate on implementinga finite-site mutation model in the maximum-likelihood schemedescribed here.

    Molecular clock:x/, 百拇医药

    The method described here assumes that the rate of mutationper unit time is the same on all branches. This is likely areasonable assumption for the data considered here. Noncodingregions are less likely to be affected by natural selectionthan are the coding regions analyzed in other studies. Also,there is no reason to assume substantial differences in mutationrates (per generation) between humans and great apes. The dataon generation times are sparse; EYRE-WALKER and KEIGHTLEY 1999cite a time of g = 23 years in chimpanzees, while estimatesof current human generation times are ~x/, 百拇医药

    30 years (SIGUROARDOTTIRet al. 2000 ; TREMBLAY and VEZINA 2000 ). Average human generationtimes (over the last several million years) may be substantiallysmaller. Indeed, the CHEN and LI 2001 data show no evidencefor more mutations on the chimpanzee branch than on the humanbranch (CHEN and LI 2001 ; results not shown), suggesting thatthe long-term average generation times for humans and chimpanzeesare quite similar.

    For other taxa, the clock assumption may not be appropriate.It would be straightforward to generalize the model to havedifferent rates of evolution on different branches and to estimatethese as well as species divergence times and ancestral populationsizes. More sequence data and more computational time wouldbe required to accurately estimate the additional parameters,and the method (with variable rates) may not be feasible withmore than three species.#w, http://www.100md.com

    Nuisance parameters:#w, http://www.100md.com

    Although the goal of this article is to estimate ancestral populationsizes and species divergence times, the model presented herealso includes other parameters, such as {theta} , {rho}#w, http://www.100md.com

    , or N_c/N_h. The reason{rho}#w, http://www.100md.com

    is included is not to estimate the recombination rate fromdivergence data (which would be somewhat challenging). Rather,the values of parameters like {rho}#w, http://www.100md.com

    affect the likelihoods, so someassumptions must be made about them. In the interest of computationaltractability, I have chosen plausible values for {theta} , {rho}

    , N_c/N_h,N_g/N_h, and N_o/N_h. Comparing model 1 with models 4 and 5 suggeststhat the choice of particular values for these other parametersmay not affect the estimates of the parameters of interest.Further simulations show that this is true for a wider rangeof values ({rho}?gpd/(, 百拇医药

    = 0.0005–0.003/bp; N_h " N_c, N_g, N_o " 6N_h), althoughit should be pointed out that assuming no recombination (asprevious methods do) leads to a likelihood of 0, due to thepresence of several incompatibilities within loci. So, the estimatesof N_a/N_h, T₁, T₂, and T₃ are robust to the assumptions madeabout the other parameters.?gpd/(, 百拇医药

    Speciation model:?gpd/(, 百拇医药

    Estimates of N_a/N_h and T₁ provide information about the meanand the variance of the distribution of coalescent times ofa single human and a single chimpanzee sequence. Under the simplespeciation model considered here, greater variances in coalescenttimes must be the result of larger ancestral population sizes.Some researchers have suggested that there is often gene flowbetween "incipient species" (e.g., WU 2001 ). A model of limitedgene flow (prior to strict isolation) will lead to a greatervariance in coalescent times. In particular, if there were geneflow after the initial divergence of the human and chimpanzeelines, then the ancestral population size (before the initialdivergence) would be overestimated. Further work must be doneto develop methods that can distinguish between limited geneflow and large ancestral population sizes using orthologoussequence data.

    ACKNOWLEDGMENTS!*;a&^, 百拇医药

    I thank M. Hare, M. Przeworski, N. Takahata, J. Wakeley, andan anonymous reviewer for comments on an earlier version ofthis manuscript. J.D.W. was supported in part by a NationalScience Foundation Postdoctoral Fellowship in Bioinformatics.!*;a&^, 百拇医药

    Manuscript received February 12, 2002; Accepted for publication October 14, 2002.!*;a&^, 百拇医药

    LITERATURE CITED!*;a&^, 百拇医药

    BRUNET, M., F. GUY, D. PILBEAM, H. T. MACKAYE, and A. LIKIUS et al., 2002 A new hominid from the Upper Miocene of Chad, Central Africa. Nature 418:145-151.!*;a&^, 百拇医药

    CABALLERO, A., 1994 Developments in the prediction of effective population size. Heredity 73:657-679.!*;a&^, 百拇医药

    CHEN, F.-C. and W.-H. LI, 2001 Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.!*;a&^, 百拇医药

    DEINARD, A. and K. KIDD, 1999 Evolution of a HOXB6 intergenic region within the great apes and humans. J. Hum. Evol. 36:687-703.

    EASTEAL, S. and G. HERBERT, 1997 Molecular evidence from the nuclear genome for the time frame of human evolution. J. Mol. Evol. 44:S121-S132.^a(jh{, 百拇医药

    EBERSBERGER, I., D. METZLER, C. SCHWARZ, and S. PÄÄBO, 2001 Genomewide comparison of DNA sequences between humans and chimpanzees. Am. J. Hum. Genet. 70:1490-1497.^a(jh{, 百拇医药

    EYRE-WALKER, A. and P. D. KEIGHTLEY, 1999 High genomic deleterious mutation rates in hominids. Nature 397:344-347.^a(jh{, 百拇医药

    FEARNHEAD, P. and P. DONNELLY, 2002 Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. B 64:657-680.^a(jh{, 百拇医药

    FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, and J. DONFACK et al., 2001 Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69:831-843.^a(jh{, 百拇医药

    GABUNIA, L. and A. VEKUA, 1995 A Plio-Pleistocene hominid from Dmanisi, East Georgia, Caucasus. Nature 373:509-512.^a(jh{, 百拇医药

    GIANNELLI, F., T. ANAGNOSTOPOLOUS, and P. M. GREEN, 1999 Mutation rates in humans. II. Sporadic mutation-specific rates and rate of detrimental human mutations inferred from hemophilia B. Am. J. Hum. Genet. 65:1580-1587.

    HAILE-SELASSIE, Y., 2001 Late Miocene hominids from the Middle Awash, Ethiopia. Nature 412:178-181.(0d;;##, http://www.100md.com

    HARADA, K., S. KUSAKABE, T. YAMAZAKI, and T. MUKAI, 1993 Spontaneous mutation rates in null and band-morph mutations of enzyme loci in Drosophila melanogaster.. Jpn. J. Genet. 68:605-616.(0d;;##, http://www.100md.com

    HARDING, R. M., S. M. FULLERTON, R. C. GRIFFITHS, J. BOND, and M. J. COX et al., 1997 Archaic African and Asian lineages in the genetic ancestry of modern humans. Am. J. Hum. Genet. 60:772-789.(0d;;##, http://www.100md.com

    HEY, J., 1994 Bridging phylogenetics and population genetics with gene tree models, pp. 435–449 in Molecular Ecology and Evolution: Approaches and Applications, edited by B. SCHIERWATER, B. STREIT, G. P. WAGNER and R. DESALLE. Birkhäuser Verlag, Basel, Switzerland.(0d;;##, http://www.100md.com

    HORAI, S., K. HAYASAKA, R. KONDO, K. TSUGANE, and N. TAKAHATA, 1995 Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc. Natl. Acad. Sci. USA 92:532-536.(0d;;##, http://www.100md.com

    HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183-201.

    HUDSON, R. R., 1992 Gene trees, species trees and the segregation of ancestral alleles. Genetics 131:509-512.5t4w, 百拇医药

    JONES, P. A., W. M. RIDEOUT, J. C. SHEN, C. H. SPRUCK, and Y. C. TSAI, 1992 Methylation, mutation and cancer. Bioessays 14:33-36.5t4w, 百拇医药

    KAESSMANN, H., V. WIEBE, and S. PÄÄBO, 1999 Extensive nuclear DNA sequence diversity among chimpanzees. Science 286:1159-1161.5t4w, 百拇医药

    KAESSMANN, H., V. WIEBE, G. WEISS, and S. PÄÄBO, 2001 Great ape DNA sequences reveal a reduced diversity and an expansion in humans. Nat. Genet. 27:155-156.5t4w, 百拇医药

    KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.5t4w, 百拇医药

    KRAWCZAK, M. and D. N. COOPER, 1991 Gene deletions causing human genetic disease: mechanisms of mutagenesis and the role of the local DNA sequence environment. Hum. Genet. 86:425-441.5t4w, 百拇医药

    KREITMAN, M., 1983 Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster.. Nature 304:412-417.

    KUMAR, S. and B. HEDGES, 1998 A molecular timescale for vertebrate evolution. Nature 392:917-920.\*41r8l, 百拇医药

    KUMAR, S. and S. SUBRAMANIAN, 2002 Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci. USA 99:803-808.\*41r8l, 百拇医药

    LEAKEY, M. G., C. S. FEIBEL, I. MCDOUGALL, and A. WALKER, 1995 New four-million-year-old hominid species from Kanapoi and Allia Bay, Kenya. Nature 376:565-571.\*41r8l, 百拇医药

    NACHMAN, M. W. and S. L. CROWELL, 2000 Estimate of the mutation rate per nucleotide in humans. Genetics 156:297-304.\*41r8l, 百拇医药

    NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.\*41r8l, 百拇医药

    NIELSEN, R. and J. WAKELEY, 2001 Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158:885-896.\*41r8l, 百拇医药

    NORDBORG, M., 2001 Coalescent theory, pp. 179–212 in Handbook of Statistical Genetics, edited by D. BALDING, M. BISHOP and C. CANNINGS. Wiley, Chichester, UK.\*41r8l, 百拇医药

    PRZEWORSKI, M. and J. D. WALL, 2001 Why is there so little intragenic linkage disequilibrium in humans? Genet. Res. 77:143-151.

    PRZEWORSKI, M., R. R. HUDSON, and A. DI RIENZO, 2000 Adjusting the focus on human variation. Trends Genet. 16:296-302.{}moz, http://www.100md.com

    RUVOLO, M., 1997 Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol. 14:248-265.{}moz, http://www.100md.com

    SATTA, Y., C. O'HUIGIN, N. TAKAHATA, and J. KLEIN, 1993 The synonymous substitution rate of the major histocompatibility complex loci in primates. Proc. Natl. Acad. Sci. USA 90:7480-7484.{}moz, http://www.100md.com

    SATTA, Y., J. KLEIN, and N. TAKAHATA, 2000 DNA archives and our nearest relative: the trichotomy problem revisited. Mol. Phylogenet. Evol. 14:259-275.{}moz, http://www.100md.com

    SIGUROARDÓTTIR, S., A. HELGASON, J. R. GULCHER, K. STEFANSSON, and P. DONNELLY, 2000 The mutation rate in the human mtDNA control region. Am. J. Hum. Genet. 66:1599-1609.{}moz, http://www.100md.com

    SWISHER, C. C., G. H. CURTIS, T. JACOB, A. G. GETTY, and A. SUPRIJO et al., 1994 Age of the earliest known hominids in Java, Indonesia. Science 263:1118-1121.{}moz, http://www.100md.com

    TAKAHATA, N., 1986 An attempt to estimate the effective size of the ancestral species common to two extant species from which homologous genes are sequenced. Genet. Res. 48:187-190.

    TAKAHATA, N., 1991 Trans-species polymorphism of HLA molecules, founder principle, and human evolution, pp. 29–49 in Molecular Evolution of the Major Histocompatibility Complex, edited by J. KLEIN and D. KLEIN. Springer, Heidelberg, Germany.w9, http://www.100md.com

    TAKAHATA, N., 1993 Allelic genealogy and human evolution. Mol. Biol. Evol. 10:2-22.w9, http://www.100md.com

    TAKAHATA, N., 2001 Molecular phylogeny and demographic history of humans, pp. 299–305 in Humanity From African Naissance to Coming Millennia—Colloquia in Human Biology and Palaeoanthropology, edited by P. V. TOBIAS, M. A. RAATH, J. MOGGI-CECCHI and G. A. DOYLE. Firenze University Press, Firenze, Italy.w9, http://www.100md.com

    TAKAHATA, N. and Y. SATTA, 1997 Evolution of the primate lineage leading to modern humans: phylogenetic and demographic inferences from DNA sequences. Proc. Natl. Acad. Sci. USA 94:4811-4815.w9, http://www.100md.com

    TAKAHATA, N., and Y. SATTA, 2002 Pre-speciation coalescence and the effective size of ancestral populations, pp. 52–71 in Modern Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.

    TAKAHATA, N., Y. SATTA, and J. KLEIN, 1995 Divergence time and population size in the lineage leading to modern humans. Theor. Popul. Biol. 48:198-221.!d786c, 百拇医药

    TEMPLETON, A. R., A. G. CLARK, K. M. WEISS, D. A. NICKERSON, and E. BOERWINKLE et al., 2000 Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am. J. Hum. Genet. 66:69-83.!d786c, 百拇医药

    TREMBLAY, M. and H. VÉZINA, 2000 New estimates of intergenerational time intervals for the calculation of age and origins of mutations. Am. J. Hum. Genet. 66:651-658.!d786c, 百拇医药

    WAKELEY, J. and J. HEY, 1997 Estimating ancestral population parameters. Genetics 145:847-855.!d786c, 百拇医药

    WALL, J. D., 2000 A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17:156-163.!d786c, 百拇医药

    WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276.!d786c, 百拇医药

    WEISS, G. and A. VON HAESELER, 1998 Inference of population history using a likelihood approach. Genetics 149:1539-1546.

    WHITE, T. D., G. SUWA, and B. ASFAW, 1994 Australopithecus ramidus, a new species of early hominid from Aramis, Ethiopia. Nature 371:306-312.u0(2690, 百拇医药

    WU, C.-I, 1991 Inferences of species phylogeny in relation to segregation of ancient polymorphisms. Genetics 127:429-435.u0(2690, 百拇医药

    WU, C.-I, 2001 The genic view of the process of speciation. J. Evol. Biol. 14:851-866.u0(2690, 百拇医药

    YANG, Z., 1997 On the estimation of ancestral population sizes of modern humans. Genet. Res. 69:111-116.u0(2690, 百拇医药

    YANG, Z., 2002 Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162:1811-1823.u0(2690, 百拇医药

    YU, A., C. ZHAO, Y. FAN, W. JANG, and A. J. MUNGALL et al., 2001 Comparison of human genetic and sequence-based physical maps. Nature 409:951-953.(Jeffrey D. Wall)

百拇医药网 http://www.100md.com/html/DirDu/2005/05/05/58/58/72.htm