当前位置: 首页 > 期刊 > 《基因杂志》 > 2003年第1期 > 正文
编号:10585729
Estimating Effective Population Size and Migration Rates From Genetic Samples Over Space and Time
http://www.100md.com 《基因杂志》2003年第1期
     a Institute of Zoology, Zoological Society of London, London NW1 4RY, United Kingdom;, http://www.100md.com

    b Department of Zoology, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada;, http://www.100md.com

    ABSTRACT;, http://www.100md.com

    In the past, moment and likelihood methods have been developed to estimate the effective population size (Ne) on the basis of the observed changes of marker allele frequencies over time, and these have been applied to a large variety of species and populations. Such methods invariably make the critical assumption of a single isolated population receiving no immigrants over the study interval. For most populations in the real world, however, migration is not negligible and can substantially bias estimates of Ne if it is not accounted for. Here we extend previous moment and maximum-likelihood methods to allow the joint estimation of Ne and migration rate (m) using genetic samples over space and time. It is shown that, compared to genetic drift acting alone, migration results in changes in allele frequency that are greater in the short term and smaller in the long term, leading to under- and overestimation of Ne, respectively, if it is ignored. Extensive simulations are run to evaluate the newly developed moment and likelihood methods, which yield generally satisfactory estimates of both Ne and m for populations with widely different effective sizes and migration rates and patterns, given a reasonably large sample size and number of markers.

    ONE of the major applications of population genetics theory has been in the estimation of genetic and demographic properties of populations. Quantities such as the mutation rate, the migration rate, and the recombination rate are all in principle estimable from population genetic data, at least as products with the effective population size.am21+v2, 百拇医药

    The variance effective population size (Ne) is an important quantity in evolutionary biology, as a description of the rate at which genetic variance changes due to genetic drift. A variety of methods have been developed to estimate Ne (SCHWARTZ et al. 1999 ), including predictive equations based on life history data (reviewed in CABALLERO 1994 ; WANG and CABALLERO 1999 ), the lethal allelism approach (DOBZHANSKY and WRIGHT 1941 ), and a linkage disequilibrium approach (HILL 1981 ). Several authors have developed methods to estimate Ne for a population from temporal genetic data, that is, data on the frequencies of marker alleles taken at two or more time points from the same populations. In principle, if other population genetic processes such as mutation, migration, and selection are negligible in their effects on allele frequency change over the time frame of the study, then the observed changes in allele frequency can be assumed to be due to the effects of genetic drift. Using what is known about the effects of genetic drift, then, we could estimate Ne from such temporal data. This so-called "temporal approach" was pioneered by KRIMBAS and TSAKAS 1971 , with subsequent substantial statistical refinements by others (e.g., PAMILO and VARVIO-AHO 1980 ; NEI and TAJIMA 1981 ; POLLAK 1983 ; TAJIMA and NEI 1984 ; WAPLES 1989 ; WILLIAMSON and SLATKIN 1999 ; ANDERSON et al. 2000 ; WANG 2001A ). These methods have been applied to a large variety of species and populations (e.g., FUNK et al. 1999 ; LABATE et al. 1999 ; FIUMERA et al. 1999 , FIUMERA et al. 2000 ; KANTANEN et al. 1999 ; TURNER et al. 1999 ; see also many earlier references cited in WILLIAMSON and SLATKIN 1999 ).

    A key assumption made by all these approaches, however, is that the only factor involved in allele frequency change is genetic drift. All systematic forces (selection, mutation, and migration) are assumed to be absent. In our view, it is often legitimate to ignore the effects of mutation on the timescale involved in most empirical studies. It also may be reasonable to neglect the effects of selection because direct selection on most markers may be unlikely to be strong enough to cause substantial changes in their frequencies. (The effects of indirect selection acting on other linked loci on the allele frequency change of the marker locus may arguably be included as a "drift" effect and legitimately included in a Ne term.) However, for most populations the effects of migration are not negligible. A major weakness of these temporal methods is that they ignore the effects of migration, which can substantially bias estimates of Ne, either upward or downward (see below). The purpose of this article is to include the effects of migration in the estimation of demographic parameters from temporal genetic data.

    Migration can affect the estimation of Ne in two ways. In the short term, migration can change allele frequencies quite quickly. Given that the temporal methods estimate Ne by measuring the pace at which allele frequencies change, migration causes a population to behave as if strong drift was changing allele frequencies very rapidly, resulting in an estimate of Ne that is too low. In the longer term, the problem is reversed. Constant migration and drift would cause the populations to approach an equilibrium level of genetic differentiation and the rate of change of allele frequencies in a deme to slow down to approach that predicted by the effective size of the whole metapopulation. Thus the effective population size of the deme would be substantially overestimated in the long term. These effects are discussed later in this article in more mathematical detail.;i-?, 百拇医药

    This problem can be avoided by simultaneously estimating Ne and the migration rate, which is the approach taken in this article. It turns out that while both migration and drift affect allele frequency changes in a population, they do so differently in terms of the direction and magnitude of these changes. Under migration, the allele frequency of a focal population would tend to become increasingly similar to that of the source population. Under drift, however, the allele frequency of a focal population changes purely at random, irrespective of that of the source population. Furthermore, the magnitude of allele frequency change of a focal population depends on the allele frequency difference between the focal and source populations under migration. Migration has no effect on alleles that happen to be at the same frequencies both in the deme in question and in the populations providing those migrants, but migration is expected to change allele frequencies substantially in cases when the source and recipient demes have very different allele frequencies. Genetic drift, on the other hand, is not affected by differences in allele frequencies among demes, but depends only on the local allele frequency. These differences in the patterns of changes of allele frequencies due to migration and drift allow us to estimate them separately, so long as enough data are from multiple loci. Fortunately, these kinds of data are often available with modern biochemical methods.

    The estimation of migration rate and migrant number has a long history. The oldest and most widely used genetic method assumes that populations are at equilibrium in a system approximated by WRIGHT's (1931) island model and then uses estimates of Wright's FST to calculate 1/(4Ne m + 1), where m is the local immigration rate (see SLATKIN 1987 ). This method has been roundly criticized (WHITLOCK and MCCAULEY 1999 ), in particular because of the unrealistic assumptions it requires about the demographic structure of the species. Other approaches include genetic stock identification (MILNER et al. 1985 ), assignment methods (RANNALA and MOUNTAIN 1997 ), two-locus descent measures (VITALIS and COUVET 2001 ), coalescent methods (NIELSEN and WAKELEY 2001 ), isolation-by-distance methods (see review by ROUSSET 2001 ), and migration matrix-likelihood models (BEERLI and FELSENSTEIN 2001 ). The first two approaches require detailed knowledge about the possible sources of migrants and do not allow for geetic drift. The latter four approaches assume that the populations are at an equilibrium between the effects of genetic drift and migration. There is a need for a method that does not make restrictive assumptions about genetic and demographic equilibria, yet allows simultaneous estimation of the effects of drift and migration.

    Here we extend previous temporal methods for estimating the Ne of a single isolated population so that they can be used to estimate Ne and m jointly for a metapopulation. We first develop an expectation based on a moment estimation approach, which allows an approximate description of the behavior of the estimator and the system. We follow this by developing a likelihood method based on that of WILLIAMSON and SLATKIN 1999 , ANDERSON et al. 2000 , and WANG 2001A . Extensive simulations are run to check the performance of the moment and likelihood estimators in terms of precision and accuracy for the estimation of Ne and m. Two metapopulation models (a focal small population receiving immigrants from an infinitely large source population and two small populations connected by migration) are considered, and the applications of the methods to more complicated metapopulation models (e.g., a small number of demes in island or stepping-stone migration models) were checked by simulations. As an example for the application of our methods, a spatial and temporal genetic data set on Triturus newts (JEHLE et al. 2001) was reanalyzed by our methods for the simultaneous estimation of migration and drift.

    DEFINITIONS AND ASSUMPTIONSk&4[-nx, 百拇医药

    We use temporal data to estimate the effective size (Ne) and immigration rate (m) of a partially isolated population. For brevity, we refer to the population for which parameters are being estimated as the focal population and that from which the migrants come as the source population. Data and parameters for the source population are given the subscript A, and those for the focal population, B. In the first section, we assume that the source population is infinitely large and therefore not changing genetically through the time frame of the study. In the section after that, we assume that both populations are finite and jointly estimate the properties of both. In the latter case, source and focal populations are obviously reciprocal. Throughout we assume diploidy. For data from haploid species or loci, the Ne estimated by these models would be double the actual effective size.k&4[-nx, 百拇医药

    For the infinite-source population model, a minimum of three samples is required to estimate Ne and m of the focal population: two temporally spaced ones from the focal population and one taken at any time from the source. For the two small population models, a minimum of four samples is necessary, two temporally spaced ones taken concurrently from each population. Let us denote the size of these samples S, with subscripts A or B to describe whether the sample is from the source or focal population, respectively, and a subscript t to describe the generation in which the sample is taken. Thus SB,0 is the number of individuals sampled from the focal population in the initial generation. We use x to refer to the relative frequency of a particular allele in a given sample. Therefore x is an estimator of the true frequency (p) of that allele at that time in that population. For convenience, we define q = 1 - p. These quantities, x, p, and q, are subscripted in the same way as S.

    Several summary statistics of these quantities turn out to be useful. We define pAB pA,0 - pB,0 to be the initial difference between the allele frequencies of the two populations and pB,t pB,t - pB,0 to be the change in allele frequency by generation t in deme B. In the derivations of the moment-based estimators, we also use {delta}kcl'a, http://www.100md.com

    B,i to refer to the change in generation i due to genetic drift of the allele frequency in population B. Each of the quantities can refer to population A if the subscript B is replaced by A.kcl'a, http://www.100md.com

    We make several assumptions throughout. These are that the alleles under study are not under direct selection, that mutation is a negligible source of allele frequency change, that potential sources of migrants to the focal populations can be identified a priori, that all samples are randomly taken from their populations, and that the sampling event does not change the available pool of reproductive individuals. The latter can be realized by sampling after reproduction, sampling with replacement, or sampling from a population much larger than the sample. This sampling scheme therefore corresponds to WAPLES's (1989) plan 1.

    summarizes most of the terms used in this article.zq.d, http://www.100md.com

    fig.ommitteedzq.d, http://www.100md.com

    Table 1. Definitions of termszq.d, http://www.100md.com

    IMMIGRATION FROM AN INFINITE-SOURCE POPULATIONzq.d, http://www.100md.com

    Theoretical background:zq.d, http://www.100md.com

    Imagine that the focal population is receiving migrants at rate m per generation from an infinite, unchanging source population. For effectively neutral alleles, changes in allele frequency over a short period of time are a function of the local effective population size and migration rate. Mutations are negligible compared to drift and migration because of their rare occurrence and the short time span. Under these conditions, and assuming that migration precedes population regulation, the allele frequency at generation i is given byzq.d, http://www.100md.com

    Note then thatzq.d, http://www.100md.com

    Expanding across generations, we can findzq.d, http://www.100md.com

    sozq.d, http://www.100md.com

    Under the assumptions of Wright-Fisher sampling, the {delta}

    terms are independently binomially distributed. The expected value of all of the {delta}r:o, 百拇医药

    terms is zero, because drift does not change on average the allele frequency. The variance of each of these terms is given by pB,i(1 - pB,i)/Ne for haploids or pB,i(1 - pB,i)/2Ne for diploids. In reality, we do not know pB,i for any generation between samples, and we can only estimate pB,i for generations in which there was a sample. This uncertainty is the root of error in the estimation of statistics derived from these equations. In this article, we first estimate this error by using the approximate moments of the distribution of allele frequencies, and then we describe a likelihood approach to the estimation of Ne and m.r:o, 百拇医药

    Moment approximations:

    Ne and m affect the first and second moments of the change in allele frequency over time. Under certain circumstances, we can use these moments to estimate m and Ne. We do this in a two-step process, first estimating m from the mean change in allele frequencies at all loci and then using this value in the estimation of Ne from the second moment of allele frequency change.l@m6#32, 百拇医药

    Estimating the migration rate: The change in allele frequency is given by above. Taking the expectation of {Delta} pB,t we findl@m6#32, 百拇医药

    because the expected values of all the {delta}l@m6#32, 百拇医药

    terms are zero. Assuming that t is known without error, we can therefore make a one-locus estimate of (1 - (1 - m)t) from pB,t/pAB.l@m6#32, 百拇医药

    To predict the optimal sampling strategy and to approximate standard errors, we estimate the error variance of an estimate of m. The expected error in estimating a ratio is an approximate function of the error variance of estimating the numerator and the denominator and the error covariance between the two (see LYNCH and WALSH 1998 , Appendix 1). The error in estimating (1 - (1 - m)t) can be approximated from the error in estimating pB,t/pAB. The error variances (Verr[ ]) and error covariance (COVerr) are given by

    Note that the 2's in the denominators drop out when these equations are applied to haploids. The approximation in the first equation here comes from using the initial value of pq to predict the amount of variance in final allele frequency due to genetic drift.hm:m, 百拇医药

    Using these last equations and Equation A1.19b from LYNCH and WALSH 1998 , we can find the expected error variance for the estimate of (1 - (1 - m))t:hm:m, 百拇医药

    If the three sample sizes are roughly equivalent, then the error in estimating m is influenced more by the allele frequency in population B than by that in the source population, especially if the product mt is small. The terms involving only the sample sizes, m, and t are presumably constant across loci, so the error variance is largely proportional to 1/({Delta} pAB)2. The minimum variance possible when combining L loci can be obtained by weighting each locus by the inverse of its expected error variance. The best estimator of m is given approximately by

    where the weight for locus l is wl = (xAB)2, Fl = xB,t/xAB, and W is the sum of wl across loci.)c@5{:, http://www.100md.com

    Estimating Ne: Under the assumption that the total change in allele frequency per generation is not large (i.e., Ne is not very small and m is not very large), each of the terms in the equation describing allele frequency change is roughly independent of the others. The variance in allele frequency change at generation i due to genetic drift within the focal population is V[{delta})c@5{:, http://www.100md.com

    B,i] = pB,iqB,i/2Ne,B. If the allele frequency is not changing rapidly, then we can assume that the value of pq is approximately the same each generation. A reasonable estimator of the mean value of pq is (xB,0(1 - xB,0) + xB,t (1 - xB,t))/2. This is implicitly assuming that the allele frequency on average takes a linear path from the initial to the final value, but in any case is an ad hoc approximation. Then using , we can find

    where M = (1 - (1 - m)2t)/((2 - m)m). Again, the coefficient of Ne,B of 2 drops out when this formula is used for haploids. From and using the estimate of m, we obtain an estimator of 1/(2Ne,B),6;3vp, 百拇医药

    (10)6;3vp, 百拇医药

    where = (1 - (1 - )2t)/((2 -6;3vp, 百拇医药

    In general we do not have exact information about the change in population allele frequencies, but only an estimate based on sample data. Because of sampling variance, E[xB,t2] is biased upward as an estimator of pB,t2:6;3vp, 百拇医药

    Thus we can estimate6;3vp, 百拇医药

    The error variance per locus for estimating 1/(2Ne) is approximately proportional to 1/ (calculations not shown), so a minimum-variance estimator of 1/(2Ne) can be obtained by weighting alleles by approximately, estimated by (xB,0(1 - xB,0) + xB,t(1 - xB,t))/2.

    Choosing the right value of t: With a large separation in time between the first and second samples, estimates of m become problematic. The estimate of (1 - (1 - m)t) becomes more and more similar for different values of m as t gets larger (limt"->"17h4^, 百拇医药

    {infty}
[{delta}17h4^, 百拇医药

    (1 - (1 - m)t)/{delta}17h4^, 百拇医药

    m] = 0), yet the error variance in estimating this quantity increases linearly with t. Furthermore, as t increases, the allele frequency change approaches a distribution determined largely by the relative sizes of the effective size and m, rather than by the absolute values of either. When t is large enough that the populations reach an equilibrium between genetic drift and migration, then "->"17h4^, 百拇医药

    0, irrespective of the actual true migration rate. This is clear by inspecting the single-locus estimator of m, = 1 - t. As t "->"17h4^, 百拇医药

    {infty}

    , xB,t no longer increases and xB,t/xAB is expected to reach a constant smaller than one with time, therefore resulting in "->"{nv$ft, 百拇医药

    0.{nv$ft, 百拇医药

    If the time interval is too short and m too small, then there is less power to measure m because too few migration events may have occurred. Larger samples with more informative markers are therefore necessary to obtain a good estimate of m when (1 - m)t is close to one or zero.{nv$ft, 百拇医药

    Similarly, the estimation of Ne is also biased if the sampling interval is too long. As t "->"{nv$ft, 百拇医药

    {infty}{nv$ft, 百拇医药

    , we have "->"{nv$ft, 百拇医药

    0 shown above and reduces to Est(1/2Ne,B) = E[p2B,t]/(t). If the source population is infinite in size, then E[p2B,t] tends to a constant as t "->"{nv$ft, 百拇医药

    {infty}

    , and therefore e,B "->".)evn4, 百拇医药

    {infty}.)evn4, 百拇医药

    . If the source population has a finite size, then E[p2B,t] continues to increase as t "->".)evn4, 百拇医药

    {infty}.)evn4, 百拇医药

    (so long as the alleles are not fixed in the population), but at a slower rate corresponding to the total size of the focal and source populations. In both cases, the estimate of Ne,B approaches the effective size of the whole species rather than that of the focal population. Again, there is less power for estimating Ne,B if t and S/(Ne,B) are too small. With a small quantity of tS/(Ne,B), the change in allele frequency due to drift would be overwhelmed by that from sampling..)evn4, 百拇医药

    As a numerical example, shows the average estimates of 1/(2Ne,B) and m over 10,000 replicates plotted against sampling interval t, and shows the coefficients of variation (CV) of the estimates when t = 1–12 where the estimation is unbiased. (Methods for these simulations are given in the Appendix.) The true values being estimated are 1/(2Ne,B) = 0.005 (Ne,B = 100) and m = 0.2 for the focal population, which is assumed to be in a metapopulation containing 10 other populations of the same size (Ne = 100) with an island migration model. The effective size of the source population is therefore ~

    1000, sufficiently large although not infinite. As can be seen, estimates of both 1/(2Ne,B) and m are close to the true values when t is small, but approach the true values for the whole species, 0.00045 (Ne = 1112, calculated using the known parameters) and 0, with increasing t. The maximum number of generations allowed for valid joint estimates of 1/(2Ne,B) and m and the number of generations required to reach the equilibrium so that the species Ne is estimated depend on m, the effective sizes of the focal and source populations, and the initial genetic differentiation when the first sample was taken (see below).*|, http://www.100md.com

    fig.ommitteed*|, http://www.100md.com

    Figure 1. Means (A) and coefficients of variation (B) of estimates of 1/(2Ne) and m as a function of sampling interval t (in generations). The population contains 11 subpopulations of an equal Ne (100), migrating in the island model with rate 0.2. After the population reaches quasi-equilibrium, two samples separated by t generations are taken from each of the 11 subpopulations. The size of each sample from the focal subpopulation is 200 individuals, and the sample from any source subpopulation is 10 individuals. The 20 samples from the source subpopulations are combined into a single sample. For each sample, 20 loci, each having four alleles, are genotyped and used in the estimation. In the top graph, the averages over 10,000 replicate estimates of m and 1/(2Ne) from the moment estimator are denoted by thin and thick solid lines, respectively. The true value for m is denoted by the horizontal thin dashed line, and the true values of 1/(2Ne) for the focal subpopulation (0.005) and the whole population (1/2224) are denoted by the top and bottom horizontal thick dashed lines. Both axes of this graph are in log scale. In the bottom graph, the coefficients of variation over 10,000 estimates of m and 1/(2Ne) are denoted by thin and thick lines, respectively, from the moment estimator.

    For this specific numerical example, the estimation is unbiased when ~'pp{6, 百拇医药

    t < 12. The precision of the estimators changes considerably over this short sampling period (). As expected, the best estimates are obtained at an intermediate value of t (t = 2–4 for the example). The optimal sampling interval depends, again, on m, the effective size of the focal population and the initial genetic differentiation when the first sample was taken. With decreasing m and/or increasing initial differentiation, for example, the optimal t is expected to increase.'pp{6, 百拇医药

    In experimental design, it is difficult to determine beforehand the optimal interval between sampling for maximum power of the temporal method, because the power depends partially on the parameter being estimated (see e.g., ). Without any prior knowledge of the demographic history of the populations in question, it is reasonable to consider a short rather than a long sampling interval. While a too long sampling interval results in biased estimates of Ne and m, a too short sampling interval only decreases the precision of the estimates. With powerful statistical methods (e.g., maximum likelihood shown below) and large sample size and marker information, the estimation precision can be improved greatly even if t is as short as one generation.

    The consequences of assuming that m = 0: Other methods using temporal data for estimating the effective size have assumed that the immigration rate is zero. We can use the equations above to investigate the effects of making this assumption when it is in fact false.;k, 百拇医药

    When m "->";k, 百拇医药

    0, the expected change in allele frequency is zero, M = t, and reduces to;k, 百拇医药

    which is similar to NEI and TAJIMA's (1981) estimator using Fc.;k, 百拇医药

    When m > 0, the variance in allele frequency change over the interval between samples is a function of both genetic drift and migration. This can be either greater or smaller than that expected by drift alone, depending on, among other parameters, t. In the short term, the change in allele frequency due to the joint effects of migration and drift will be larger than that expected by drift alone. If t = 1, for example, we have E[(pB,t - pB,0)2] = {Delta} pAB2m2 + pB,0(1 - pB,0)/2Ne,B. In this case, therefore, the squared change in allele frequency is increased relative to that due to drift only, which is pB,0(1 - pB,0)/2Ne,B. Falsely assuming that m = 0 would on average cause an overestimate of the amount of drift and an underestimate of Ne,B. In contrast, as the time between samples gets large, the change in allele frequency is constrained by migration to a term only somewhat larger than the squared initial difference in allele frequency between the source and focal populations: limt"->"

    {infty}
E[(pB,t - pB,0)2] = {Delta} pAB2 + pB,0(1 - pB,0)/(Ne,Bm(2 - m)). With an increasing sampling interval (t), therefore, a false assumption of m = 0 would cause a downward bias in the estimation of the amount of genetic drift and an upward bias in estimating Ne. As the number of generations between samples continues to get larger, however, the estimate would approach the effective size of the whole species, rather than just the effective size of the focal population, no matter whether the assumption of m = 0 is made or not. These biases can be indefinitely large..+7|f, 百拇医药

    shows the biases of Ne estimation when a false assumption of m = 0 is made. The true parameter values are Ne = 100 and m = 0.1, and the populations are assumed to be either completely differentiated (no previous migration, case A) or at drift-migration equilibrium (case B, where Fst {cong} 0.02) when the initial sample is taken. Ignoring migration results in an overestimation of 1/(2Ne) when sampling interval (t) is small and underestimation when t is large. Compared with case A, the initial overestimation of 1/(2Ne) in case B from ignoring migration is much smaller, because the populations are more similar genetically ({Delta} pAB small) and therefore migrants have less effect on allele frequency changes. With a larger m and smaller Ne, the initial overestimation of 1/(2Ne) can be dramatic even when the populations are initially at equilibrium (data not shown). For both cases, the estimation is reasonably good for a large range of sampling interval when immigration is accounted for. However, the sampling interval cannot be too long. Otherwise, 1/(2Ne) is increasingly underestimated with t, approaching the true value for the whole species.

    fig.ommitteedao, 百拇医药

    Figure 2. Bias of 1/(2Ne) estimates as a function of sampling interval (in generations). The true value being estimated is 1/(2Ne) = 0.005 (Ne = 100) denoted by the horizontal dashed line, and the true migration rate is 0.1. The populations are assumed to have no previous migration (case A) or be at drift-migration equilibrium (case B) when the first sample from the focal population is taken. For cases A and B, the averages over 1000 replicate estimates of 1/(2Ne) are denoted by thin and thick solid lines, respectively, from the moment estimator developed herein that takes immigration into account, by thin and thick dashed lines, respectively, from the moment estimator ignoring immigration . In the estimation, we assumed 20 loci, each having four alleles, and two samples from the focal and one sample from the source population, each sample having 200 individuals. Note that both axes are in log scale.ao, 百拇医药

    Likelihood model:ao, 百拇医药

    Let us first consider a single locus with two alleles, and we focus on a particular allele. For brevity in this subsection, subscript B is dropped for the focal population. To estimate m and Ne of the focal population jointly, at least two temporally spaced samples from the focal and a single sample from the source population (A) are necessary. Because the three sampling events are independent, the probability of getting a set of samples with allele frequencies denoted by x isao, 百拇医药

    Each of these samples can be assumed to follow a binomial distribution (see WAPLES 1989 for a discussion of when this is appropriate); therefore(Jinliang Wang and Michael C. Whitlock)