当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2005年 > 第10期 > 正文
编号:11259188
Inferring the Effects of Demography and Selection on Drosophila melanogaster Populations from a Chromosome-Wide Scan of DNA Variation
     Section of Evolutionary Biology, Biocenter, University of Munich, Grosshaderner Strasse 2, D-82152 Planegg-Martinsried, Germany

    E-mail: ometto@zi.biologie.uni-muenchen.de.

    Abstract

    Identifying regions of the Drosophila melanogaster genome that have been recent targets of positive Darwinian selection will provide evidence for adaptations that have helped this species to colonize temperate habitats. We have begun a search for such genomic regions by analyzing multiple loci (about 250) dispersed across the X chromosome in a putatively ancestral population from East Africa and a derived European population. For both populations we found evidence for past changes in population size. We estimated that a major bottleneck associated with the colonization of Europe occurred about 3,500–16,000 years ago. We also found that while this bottleneck can account for most of the reduction in variation observed in the European sample, there is a deficit of polymorphism in some genomic regions that cannot be explained by demography alone.

    Key Words: Drosophila melanogaster ? SNPs ? demography ? selective sweeps

    Introduction

    Inferring a species' demographic history from patterns of genetic variation is essential in a search for adaptive signatures in the genome. Traditionally, DNA variation is compared with the expectations of the neutral theory of molecular evolution (Kimura 1968). While it is tempting to ascribe departures from the neutral equilibrium model to the action of positive selection, caution must be taken due to the possible confounding effects of demography. For example, strong reduction in levels of nucleotide polymorphism may result from hitchhiking associated with positive directional selection (Maynard-Smith and Haigh 1974) or from a strong population-size bottleneck. To disentangle demographic and selective forces, it is helpful to employ multilocus approaches (e.g., in humans, Akey et al. 2004; in Drosophila, Harr, Kauer, and Schl?tterer 2002; Glinka et al. 2003; in Arabidopsis, Schmid et al. 2005). The rationale behind these studies is the observation that while demography affects patterns of variation across the entire genome, positive selection acts locally.

    The cosmopolitan species Drosophila melanogaster is thought to have colonized Europe after the last glaciation about 10,000–15,000 years ago (David and Capy 1988; Lachaise et al. 1988). Several studies proposed that this colonization was accompanied by the occurrence of numerous adaptations to the new habitat (Harr, Kauer, and Schl?tterer 2002; Glinka et al. 2003; Orengo and Aguadé 2004). On the other hand, despite their long evolutionary history, African populations also show a departure from the neutral equilibrium model, suggesting that D. melanogaster has faced recent selective and demographic processes in its ancestral species range (Andolfatto and Przeworski 2001; Andolfatto and Wall 2003; Glinka et al. 2003).

    We analyzed chromosome-wide patterns of DNA variation in a derived population from Europe and in a putative ancestral population from East Africa. We sequenced >150 fragments of noncoding DNA, extending our previous data set (Glinka et al. 2003) to more than 250 loci distributed along the X chromosome. To evaluate the contribution of adaptive evolution in the European population, we developed a maximum likelihood approach that allowed us to test each locus against a demographic null model. Using this method, we estimated the parameters of a simple bottleneck model from the chromosome-wide polymorphism data and identified those loci whose level of variation cannot be explained by demography alone. An interesting additional finding (not observed thus far in Drosophila) is the positive correlation between divergence and recombination rate.

    Materials and Methods

    Data Collection

    Drosophila melanogaster sequence data were obtained from 24 highly inbred lines, 12 of which were from an African population (Lake Kariba, Zimbabwe, Begun and Aquadro 1993) and 12 from a European population (Leiden, The Netherlands). One inbred line of the sister species Drosophila simulans (Davis, Calif.) was used for interspecific comparisons.

    We made several single-mate crosses between virgin Canton-S females, homozygous for the standard chromosome arrangement, and males from each of the 24 D. melanogaster lines. From each of these crosses, we used salivary gland preparations from five F1 third-instar larvae (maintained at 18°C) to maximize the detection of heterozygous inversions. Polytene chromosomes were stained using the lacto-acetic orcein method and examined under an inverted compound microscope.

    Sequence data were collected from fragments located in noncoding DNA on the X chromosome of D. melanogaster (based on Flybase, Release 3.2, http://www.flybase.org). We sequenced 158 loci of the African and 163 of the European lines. For 143 loci, we also obtained the homolog in D. simulans. DNA sequences were generated as described in Glinka et al. (2003). In addition, for the analysis we also used the data from 100 fragments of Glinka et al. (2003), whose location was verified based on the genome Release 3.2 (originally, Release 2 was used). The updated annotation revealed that 5 of the previously analyzed 105 loci are located in putative coding regions. Therefore, we excluded them from the analysis. As a result, polymorphism (divergence) data were obtained from a total of 253 (232) and 263 (241) fragments for the African and the European populations, respectively. A total of 250 fragments were analyzed in both samples of D. melanogaster, and 230 of these loci were also analyzed in D. simulans. The new sequences have been deposited into the European Molecular Biology Laboratory (EMBL) database (under accession numbers from AM000058 to AM003900).

    Statistical Analysis

    Basic population genetic parameters were estimated by a program provided by H. Li. Nucleotide diversity was estimated using (Tajima 1983) and (Watterson 1975). The same program was also used to estimate Tajima's D (Tajima 1989) and Fay and Wu's H (Fay and Wu 2000) statistics and to evaluate their statistical significance by 10,000 coalescent simulations. Linkage disequilibrium (LD) measure ZnS (Kelly 1997) and interspecific divergence were estimated by the program VariScan (Vilella et al. 2005). A coalescent-based program (Ramos-Onsins et al. 2004) was used to determine for each locus the probability associated with the observed ZnS value, conditioned on and the population recombination rate (see below). To evaluate the decay of LD, we measured the correlation coefficient r2 (Hill and Robertson 1968) for each pair of biallelic sites within fragments and then plotted the pooled values across loci against the distances between the corresponding pair of sites (singletons, being noninformative, were discarded).

    Departure from the neutral equilibrium model was investigated by the multilocus Hudson-Kreitman-Aguadé (HKA; Hudson, Kreitman, and Aguadé 1987) and the Tajima D (Tajima 1989) tests. Both tests were performed using the program HKA, provided by J. Hey (http://lifesci.rutgers.edu/heylab) (Kliman et al. 2000). In addition, we examined the number of haplotypes KDV and the haplotype diversity HDV (Depaulis and Veuille 1998) as described in Glinka et al. (2003).

    Because recombination is absent in males of D. melanogaster, the population recombination parameter was estimated by 2Nc (Stephan et al. 1998; Przeworski, Wall, and Andolfatto 2001), where c is the female recombination rate per fragment per generation and, assuming a Wright-Fisher neutral model, N = Ne = 106 (Li, Satta, and Takahata 1999). For each locus, we estimated c by multiplying its per-site recombination rate by its length L; was obtained using the program Recomb-Rate (Comeron, Kreitman, and Aguadé 1999). For comparison, recombination rate was also estimated following the procedure proposed by Charlesworth (1996). We used the recombination rate estimated for the white locus as threshold distinguishing regions of low recombination (region I) from regions of normal to high recombination (region II).

    The availability, for most of our fragments, of the D. simulans homologous sequence allowed us to polarize the state of our biallelic sites. A variant was considered ancestral if observed in both species and derived if segregating only in D. melanogaster. In the analysis of the European population, we were interested in the fraction of mutations that originated after the bottleneck, which we assumed to equal the SNPs exclusive to the European sample. We focused on all these kEall single-nucleotide polymorphisms (SNPs), or only the kEs singletons. Then, to denote the polymorphisms that originated after the bottleneck in each locus i, we define where Li and mi are the length and the sample size, respectively. It follows that for while for

    Demographic Modeling of the African Population

    To investigate the hypothesis of size expansion of the African D. melanogaster population, we applied a maximum likelihood method proposed by Weiss and von Haeseler (1998) which allowed us to extract information about the population history, such as the time (Te, in years) at which the population started to increase, and the strength of the expansion () defined as the ratio of the current and the initial effective population size. This method is based on both and and requires that for each fragment the parameters A, C, G, T (corresponding to the frequency of each base in the sequence), the transition/transversion parameter (), and the pyrimidine/purine transition parameter () are estimated. This was done with a program kindly provided by H. Li. Because fragments in regions of low recombination are more affected by the impact of selection (see Results), we excluded them from the analysis. In addition, we excluded 15 fragments due to undefined values of the parameters or , leaving 214 fragments for analysis. For each fragment, the likelihood of a given parameter set was determined through 10,000 coalescent simulations without recombination by the program IPHULA, kindly provided by G. Weiss. To compute the likelihood for all fragments, free recombination between fragments was assumed. The estimate of Te was obtained assuming for the current effective population size N0 = 106 (Li, Satta, and Takahata 1999) and 10 generations per year. The confidence intervals (95% CI) for these estimates were obtained by the standard MAX–2 rule (e.g., Kaplan and Weir 1995).

    Demographic Modeling of the European Population

    The population-size bottleneck associated with the colonization of Europe likely resulted in a genome-wide reduction of polymorphism that confounds the signature of selective sweeps. However, the detection of a stronger reduction of heterozygosity than expected under a bottleneck model may be attributed to the action of positive selection. To investigate this hypothesis, we examined for each locus i the observed number of segregating sites ki using coalescent simulations of a simple bottleneck model. Candidate loci of selective sweeps can thus be identified as those harboring less genetic variation than expected based on the chromosome-wide reduction of polymorphism due to a bottleneck.

    To characterize the bottleneck, we assume that the population drops to a fraction of its current effective population size N0 at time Tb (measured in units of 3N0, because we use X-linked data), and then recovers to N0. We also assume that the duration of the bottleneck is short such that mutations arising during this time may be ignored. We combine duration and depth of the bottleneck in a compound parameter Sb, describing the strength of the bottleneck (Galtier, Depaulis, and Barton 2000; see also Fay and Wu 1999). The parameter Sb can be interpreted as the time that would be necessary to obtain the same number of coalescent events in a neutral constant-size population. Therefore, the total length of the coalescent tree Ttree will be a function of both Tb and Sb, i.e., The estimation of the bottleneck parameters Tb and Sb is done using a maximum likelihood approach. That is, we consider

    (1)

    where Liki refers to the likelihood and Ki to the average number of differences between two sequences at locus i. We consider the loci to be independent (i.e., we assume free interlocus, but no intralocus recombination) and consequently calculate the joint likelihood for the m loci by multiplying their individual likelihoods. Then, and are the maximum likelihood estimates that maximize the product Note that the maximum likelihood estimate applies to the pair rather than to and individually.

    To calculate the likelihood of each locus, we simulate G genealogies (see below) and define for each locus i

    (2)

    where Pi,g is the probability of genealogy g for locus i. We used various methods to calculate Pi,g, each resulting in a different set An overview is given in table 1. In method I, we considered the probability to observe Ki = ki SNPs in a coalescent tree of length Ttree,g, based on the Poisson distribution

    (3)

    where and Li is the length in base pairs of locus i.

    Table 1 Demographic Modeling of the European Population— Overview of Methods

    In method II, we analyzed a subset w of our m loci, for which we were able to polarize the state of the kw observed SNPs (see above), where We define and as the polymorphisms in this subset that occurred before and after the colonization of Europe, respectively (see above), such that We assume that the African sample represents the ancestral population. The SNPs were either assumed to be identical to (method IIs) or to (method IIall). The partitioning of the kw SNPs into and SNPs is correlated with the proportion of the time spent by the species in "Africa" and "Europe" and will therefore depend on the demographic history. For a single locus i write Then,

    (4)

    where and measure the lengths of the coalescent tree portions after and before Tb (going backward in time), respectively, and

    Finally, a third method was used, where we considered the probability to observe independently and SNPs in the pre- and postbottleneck phases, respectively,

    (5)

    where and As before, we considered either (method IIIs) or (method IIIall).

    Note that the value of the mutational parameter was estimated in two ways. In the first (for method III only), we considered the observed average value of for the African sample, In the second (for methods I–III), i is locus specific, i.e., it is the value observed in the African sample at locus i. This is reasonable because there is a strong positive correlation between the i values found in the African and in the European samples (R = 0.527, P < 0.0001).

    For each method, we tested more than 600 Tb and Sb combinations, characterizing severe recent bottlenecks (i.e., Tb = 0.0005 and Sb = 2.0) to shallow old ones (i.e., Tb = 0.2 and Sb = 0.02). For each locus and parameter combination, G = 2,500 genealogies were simulated using a modified version of a program kindly provided by S. E. Ramos-Onsins (Ramos-Onsins et al. 2004), which is based on the program ms (Hudson 2002). To evaluate the fit of our model to the data, we used the estimated bottleneck parameter sets to simulate G = 10,000 genealogies for each locus and calculated for each of the 10,000 resulting samples the average of Tajima's D and Kelly's ZnS statistics. For the latter analysis, our simulation methodology did not consent to use simultaneously both i and Thus, only i was used as mutation parameter in method III.

    To identify candidate loci for positive selection, we used two approaches that tested each locus independently against the expectations under the estimated bottleneck models. In the first one (for methods I and II), G = 10,000 coalescent simulations were carried out for each locus i to compute the probability to harbor at most ki segregating sites,

    (6)

    The second one enabled us to use the information on the mutational and demographic history of the sample (method III). G = 10,000 coalescent simulations (for each locus i) were used to calculate the proportion of simulated genealogies, thereby generating at the same time (1) at most segregating sites in the portion of the simulated tree with mutational parameter and (2) at most segregating sites in the portion of the simulated tree with mutational parameter i. We call this probability Then, a locus was considered to lie in a candidate sweep region if it contained fewer polymorphisms than expected under the estimated bottleneck model, i.e., when Qi < 0.05 or

    Because hitchhiking and bottlenecks affect more than , using ki in the identification of targets of selection is conservative.

    Results

    Polymorphism Patterns of the African Population

    Before embarking on our sequencing study, we confirmed that all African (and European) X chromosomes are inversion free. From the African X chromosomes, we gathered sequencing data for 153 fragments. Together with the previously published data for 100 loci (see Materials and Methods) this increased the density of our scan of the X chromosome considerably. The average distance between adjacent loci for the whole chromosome is now <70 kb, and for the proximal half <50 kb. Fragment length ranges from 199 to 781 bp, with a mean (standard error, SE) of 510.4 (6.9) bp. We sequenced a total of 129,133 nt sites (excluding insertions and deletions), of which 4,922 are polymorphic. Over half of the observed polymorphic sites are in low frequency (i.e., singletons). A summary of the polymorphism and divergence data of all analyzed fragments is provided in the online supplementary table 1.

    The mean levels of diversity (SE) across the 253 loci are 0.0114 (0.0004) for and 0.0131 (0.0004) for . When levels of nucleotide diversity are plotted against recombination rate (fig. 1a and b), we observe a significantly positive correlation (for , Spearman's R = 0.140, P = 0.026; for , Spearman's R = 0.147, P = 0.020; hereafter, all correlations are tested using Spearman's R). Furthermore, a weak correlation between recombination rate and divergence across all sequenced 232 loci is found (R = 0.127, P = 0.054; fig. 1c). The average divergence (SE) between the African sample and D. simulans is 0.0621 (0.0019). To investigate this correlation more closely, we divided the data set into fragments of low (region I) and normal to high recombination rates (region II; see Materials and Methods). This approach, corresponding to a threshold value of 2 x 10–8 recombination events per base pair per generation, yields 24 fragments for region I. The correlation between levels of nucleotide diversity and recombination rates still exists in region I (for , R = 0.459, P = 0.024; for , R = 0.510, P = 0.011), but is weaker in region II (for , R = 0.114, P = 0.087; for , R = 0.116, P = 0.080). The opposite is seen when we correlate levels of divergence with recombination rates. A significant correlation is observed in region II (R = 0.142, P = 0.040), whereas none is found in region I (R = 0.085, P = 0.702). These observations also hold for the second measure of recombination rate (see Materials and Methods), except that levels of nucleotide diversity and recombination rate are not correlated in region I (for , R = 0.491, P = 0.125; for , R = 0.527, P = 0.096). The observed correlation between nucleotide diversity and crossing-over rate agrees with the results of several previous studies (e.g., Begun and Aquadro 1992), whereas the correlation between divergence and crossing over came as a surprise and requires further explanation (see Discussion).

    FIG. 1.— Nucleotide diversity, divergence and Tajima's D values versus recombination rate for the African (a–d) and the European (e–h) populations. Recombination rate is expressed in recombination events per site per generation x 108 (Comeron, Kreitman, and Aguadé 1999).

    Our data on intraspecific polymorphism and interspecific divergence were also used to test if the population departs from neutral equilibrium. Under neutrality, genome regions with high sequence diversity within a species should also show high levels of divergence between species (Hudson, Kreitman, and Aguadé 1987). To test if the data depart from this expectation, we used the multilocus HKA test. No departure from the neutral equilibrium model was detected (2 = 184.15, P = 0.990).

    To investigate the haplotype structure in the African sample, we used the number of haplotypes, KDV, and haplotype diversity, HDV, and examined their values across the chromosome as described in Glinka et al. (2003). Under neutrality, we expect an equal proportion of the observed values of KDV and HDV being lower and higher than the simulated median. We observed a significant excess of fragments with higher values than expected for both statistics (sign test, two-tailed, P < 0.001 for both KDV and HDV). High values can result from a star-like genealogy following population expansion, or from an old complete sweep (Depaulis and Veuille 1998). In the latter case, due to recombination, recurrent selective sweeps across the chromosome are expected to leave footprints of partial hitchhiking. We searched for such evidence using the KDV- and HDV-haplotype tests (Depaulis and Veuille 1998) and Fay and Wu's H test (Fay and Wu 2000), with the conservative assumption of zero recombination. We observed only one significant value of the HDV statistic (fragment 728) and another five with a significantly negative Fay and Wu's H value (one-tailed, P < 0.05), namely fragments 276, 295, 310, 392, and 483. The average H value (SE) across all fragments was –0.583 (0.155) indicating a significant skew toward high-frequency–derived variants (P = 0.007).

    Star-like genealogies typically produce a skew in the frequency spectrum toward rare variants that can be detected by negative values of Tajima's D statistic. Indeed, most fragments (m = 225) have negative D values (sign test, two-tailed, P < 0.0001; fig. 1d), 21 of which also depart from neutral equilibrium (P < 0.05; supplementary table 1). The observed average value (SE) of –0.608 (0.033) was compared with the prediction of the standard neutral model using coalescent simulations (see Materials and Methods). None of 10,000 simulated samples had either a more negative average or a smaller variance than the observed values. Furthermore, we did not find a positive correlation between D and recombination rate (P = 0.426; fig. 1d), in contrast to the prediction of the recurrent hitchhiking model (Braverman et al. 1995; Andolfatto and Przeworski 2001). This observation holds for both measures of recombination rates (data not shown).

    We analyzed LD, using Kelly's ZnS statistic (Kelly 1997). Most of the values are very low, with an average (SE) of 0.139 (0.004). To assess whether they are consistent with the neutral equilibrium scenario, we performed coalescent simulations with recombination. This assumption is conservative, because recombination decreases ZnS (Kelly 1997). We observed 78 of these loci with a ZnS value significantly lower than expected under a neutral equilibrium model (one-tailed, P < 0.05; supplementary table 1), pointing to a deficiency of LD across the chromosome. The short-range behavior of LD in the chromosome was studied using the statistic r2 (Hill and Robertson 1968). Pooling the data from all fragments resulted in 13,930 values. The correlation between r2 values and distance clearly indicates a strong decay of LD with distance, as LD drops 60% over the average fragment length (i.e., 500 bp; fig. 2a).

    FIG. 2.— Decay of LD with distance in the African (a) and the European (b) populations. The squared correlation coefficient of allele frequencies between biallelic sites (r2) is plotted against distance in base pairs across loci. The average r2 values for 10 subsets containing an equal number of site pairs (pooled based on the distance) are plotted as filled circles.

    Both observations (i.e., a chromosome-wide excess of low-frequency variants and a lack of LD) are consistent with an expansion of the African population, as suggested by Glinka et al. (2003) (see Discussion).

    Polymorphism Pattern of the European Population

    We gathered polymorphism data from a total of 263 fragments, spanning 142,135 nt sites (gaps were excluded), 1,925 of which are polymorphic. Number of segregating sites, diversity indices, and basic statistics for each locus are shown in the online supplementary table 2.

    The means (SE) of and across the X chromosome are 0.0047 (0.0003) and 0.0046 (0.0002), respectively. We found 22 loci (18 intronic and 4 intergenic regions) with no polymorphism. When the estimates of nucleotide diversity are plotted against recombination rate, a significant positive correlation is found for , but not for (R = 0.135, P = 0.028, and R = 0.079, P = 0.201, respectively; fig. 1e and f). If the lower values in regions of reduced recombination were due to a lower mutation rate, they should also be less diverged. Divergence across 241 loci between D. simulans and the European population is on average (SE) 0.0666 (0.0021). In contrast to the African sample, it does not correlate with recombination rate (R = 0.100, P = 0.123), even when the data are partitioned into those of recombination regions I and II (results not shown). Furthermore, no significant correlation was found when the ratio between and divergence was plotted against recombination rate (R = 0.105, P = 0.102).

    Tajima's D was calculated for each locus. We observed an average (SE) of –0.103 (0.079). As in Glinka et al. (2003), coalescent simulations (see Materials and Methods) showed that the average does not deviate from the neutral expectation (P = 0.077), while its SE is too large (P = 0.0001). We found a strong positive correlation between haplotype diversity and D (R = 0.611, P < 0.0001), reflecting the observation that positive D values are caused by two almost equally frequent haplotypes and negative values by rare divergent haplotypes or by few rare variants (distributed across the entire sample).

    Surprisingly, there is a weak but highly significant negative correlation between D and recombination rate (R = –0.188, P = 0.003), whereas no significant correlation was found between haplotype diversity and recombination rate (R = 0.019, P = 0.764). This suggests that the negative correlation between D and recombination rate is due to an excess of rare variants that are distributed over more than one sequence of the sample, rather than being comprised in a single diverged haplotype.

    A typical signature of a recent bottleneck is an elevated level of LD. The average (SE) statistic ZnS across loci is 0.438 (0.018), much higher than the value observed in the African sample and significantly higher than the value at neutral equilibrium (P < 0.0001; 26 loci had significantly higher LD than expected, i.e., P < 0.05; no recombination was assumed). No correlation with the recombination rate was found (data not shown). When the pooled r2 values across loci are plotted against distance (6,162 data points; see Materials and Methods), LD is observed to decay 60% over the average fragment length (i.e., 500 bp; fig. 2b).

    The multilocus HKA test showed significant departure from neutrality (2 = 483.40, P < 10–5). The statistic was no longer significant if 48 loci (with the largest contributions) were removed (data not shown). The observation that 30 of these show an excess of polymorphism agrees with the frequently observed haplotype structure.

    Estimating the Parameters of a Simple Bottleneck Model for the European Population

    Several observations reported above indicate the occurrence of bottlenecks during the colonization of Europe by D. melanogaster. While a bottleneck results in the loss of heterozygosity at a genome-wide scale, selective sweeps may lead to a further reduction of polymorphism locally in the genome. In order to disentangle the effects of these two forces, we developed a coalescent-based approach that compares the behavior of a single locus against that of all loci on the X chromosome (see Materials and Methods). We assume a simple bottleneck model such that only a single reduction of population size occurred during colonization. The maximum likelihood estimates of the bottleneck parameters are obtained by five different procedures (table 1). For method I, we analyzed all loci for which both populations had been sequenced (m = 250); for methods II and III, the D. simulans homologs were required (m = 230).

    The likelihood surfaces and the maximum likelihood estimates for the age, Tb, and the strength, Sb, of the bottleneck obtained by our methods (table 1) are shown in figure 3 and table 2, respectively. Only when the partitioning of segregating sites between pre- and postbottleneck SNPs is used (methods II and III), a clear maximum becomes apparent (fig. 3b–e vs. fig. 3a). Nonetheless, parameters estimated by method I are in close agreement with those estimated by methods IIs and IIIs, where only the European private singletons were equated with the SNPs that originated after the bottleneck. On the other hand, methods IIall and IIIall that classified all European private polymorphisms as postbottleneck ones, point to a more ancient bottleneck (table 2). Hereafter, bottlenecks will be identified according to the parameter set used; e.g., bottleneck IIall refers to the bottleneck simulated with and

    FIG. 3.— Log likelihood surfaces for the estimation of the bottleneck associated with the colonization of Europe (the lighter, the larger is the likelihood; contour lines define log likelihood intervals of 200). Maximum likelihood estimates of age, Tb (in units of 3N0 generations), and strength, Sb, of the bottleneck were obtained exploring the parameter space ranging from severe recent bottlenecks (i.e., Tb = 0.0005 and Sb = 2.0) to shallow old ones (i.e., Tb = 0.2 and Sb = 0.02) through coalescent simulation. Asterisks designate the approximate maximum likelihood parameters' estimates. Plots for methods I (a), II (b and c), and III (d and e) are shown (table 1; see Materials and Methods).

    Table 2 Demographic Modeling of the European Population—Results

    To evaluate the fit of our model to the data, we compared the observed averages of Tajima's D and Kelly's ZnS across loci to those expected under the estimated bottlenecks. All simulated bottlenecks produced average D and ZnS values close to the observed ones (table 2). For ZnS, incorporating recombination in our simulations results in an average of 0.230 (for all methods; data not shown), suggesting that some recombination is needed to have a better fit. However, the empirical averages are statistically compatible only with the (older) bottlenecks IIall and IIIall, while they do not lie within the 95% CI of the other three bottlenecks.

    Among the loci used in these simulations, we observed 20 with no polymorphism. For each locus and genealogy, we generated Poisson-distributed mutations according to the different mutation processes (see Materials and Methods). In each case, an average number of loci with no segregating sites close to 20 was generated, with the exception of those of bottleneck IIIall, where on average only 8.42 loci harbored no variation (table 2).

    These results indicate a good agreement between our model and the empirical data. In particular, method IIall produces estimates of the bottleneck parameters that appear to be consistent with the polymorphism pattern observed in the European sample.

    Identifying Candidate Sweep Regions in the European Population

    We attempted to identify loci not compatible with the estimated bottlenecks. Each locus was tested against the model by calculating the probability Q (or QE) that it harbors at most the observed number of segregating sites (fig. 4; see Materials and Methods). Depending on the estimated age of the bottleneck, 3–24 loci cannot be explained by demography alone (Q < 0.05 or QE < 0.05; table 2). We note that, of the 20 loci with no polymorphism, 16 were among the 24 outliers identified by bottleneck IIIall. These results indicate that demography accounts for most of the chromosome-wide lack of variation, but does not explain the effect alone. This is also seen by pooling all loci. For each bottleneck scenario, the probability to observe at most the observed total number of SNPs was P < 0.001.

    FIG. 4.— Probability Q to obtain at most the observed number of segregating sites in the European sample for a given locus (given the bottleneck estimated by method IIall) against locus position on the X chromosome. The corresponding values of nucleotide diversity are plotted below the x axis. In the upper part of the figure, the position of outliers (Q < 0.05) are denoted by filled triangles; empty and filled circles denote the position of loci with significantly negative or positive Tajima's D values, respectively; and empty and filled squares denote significant deficiency or excess of LD, respectively, as measured by ZnS. Significance was calculated given the bottleneck estimated by method IIall.

    The probability Q (calculated using method IIIall) strongly correlates with the decrease in heterozygosity of the European sample relative to the African one (i.e., the ratio between the respective values; R = 0.571, P < 0.0001). This is consistent with our simulation approach, which aimed to identify those loci that lost more of the ancestral variation than expected under a simple bottleneck model (see Materials and Methods).

    In contrast, because we did not use the frequency spectrum in defining the outliers, Tajima's D shows no correlation with Q, nor does LD (e.g., for bottleneck IIall, R = 0.111, P = 0.110, and R = 0.016, P = 0.831, respectively). The estimated bottleneck parameters were then used to assess the significance of Tajima's D and ZnS by calculating, for each locus i, the proportion of simulations (G = 10,000) that produced values more extreme than the observed one (i.e., Akey et al. 2004). Because in our simulations we could use only one mutation parameter at the time, this analysis was possible only for methods I, II, and IIall (see Materials and Methods). Depending on the bottleneck parameters, 12 to 19 loci have too extreme values of Tajima's D and 12 to 13 too extreme values of ZnS (P < 0.05; table 2; fig. 4). Significantly low values of ZnS were usually observed in loci with few singletons. Two loci showed both significantly high Tajima's D and high ZnS. These loci are not associated with the outliers detected by both methods IIall and IIIall (permutation test, 1,000 permutations, P > 0.5 for both Tajima's D and ZnS).

    We also addressed the multiple testing problem by considering the false discovery rate as proposed by Storey (2002). None of the above findings was still significant after correction. Obviously, this is due to the low power of our single locus approach owing to the limited number of SNPs in each fragment.

    Discussion

    In general, we have found that both demography and natural selection shaped patterns of nucleotide variation on the X chromosome of D. melanogaster from Africa and Europe. By developing a method to distinguish between the confounding effects of selection and demography, we were able to localize the gene regions that were the target of selection in the recent past in the derived European population.

    Demographic History of the African Population

    An interesting feature of the African sample was the chromosome-wide excess of rare variants. Here we discuss possible explanations of this observation. First, we can exclude the contribution of chromosomal inversions (Andolfatto, Depaulis, and Navarro 2001), as we did not find any on the X chromosome. Second, mutation bias does not seem to be a valid explanation either. For instance, to investigate the possible contribution of a nucleotide-specific mutation bias (i.e., C/G vs. A/T; e.g., Birdsell 2002), we polarized 3,633 alleles, of which 1,693 were derived singletons. We observed that derived singletons are more overrepresented among the SNPs with C or G as ancestral state compared to A or T (for singletons n = 942 and 751 for C/G and A/T, respectively; for the rest of the SNPs, n = 999 and 942, respectively; 2 = 6.245, P = 0.012). This trend agrees with the study by Kern and Begun (2005), who predicted that G/C to A/T mutations should be at lower frequencies than A/T to G/C mutations due to a recent lineage-specific change in mutation bias (for singletons n = 804 and 455 for C/G and A/T, respectively; for the rest of the SNPs, n = 840 and 580, respectively; 2 = 6.232, P = 0.013). However, both C or G and A or T singletons are in excess when compared to the neutral expectations, calculated as k/ai where k is the total number of SNPs and for a sample size of 12 chromosomes (Fu 1995).

    Third, becasue we have previously shown that selection is not a satisfactory explanation for the observed excess of low-frequency variants (for instance, we found no correlation between recombination rate and Tajima's D values, as expected under recurrent selective sweeps; Glinka et al. 2003), it remains to consider demographic processes. Stajich and Hahn (2005) suggested that admixture can lead to an average negative Tajima's D. This, however, does not seem to apply to rural Zimbabwean populations as used here (Kauer, Dieringer, and Schlotterer 2003). For this reason, the most straightforward explanation for the observed excess of singletons (and the lack of LD) is that the ancestral population has undergone a relatively recent growth in size, expanding either (1) from a population with a long-term constant population size, or (2) from a severe bottleneck. To explore hypothesis (1), we used an approach proposed by Weiss and von Haeseler (1998) (see Materials and Methods). According to this procedure, the maximum likelihood estimates (95% CI) of the time of expansion, Te, and the ratio of the present to the past population size, , are 15,000 (0–30,000) years and 5 (1–1,000), respectively, suggesting a rather recent population-size expansion. While the simple growth model underlying this approach produced estimates that are roughly compatible with the idea of a recent expansion out of Africa (David and Capy 1988; Lachaise et al. 1988), some aspects of the data (e.g., the negative average value of the H statistic) are not. Hypothesis (2), proposed by Haddrill et al. (2005), postulates a (slow) population-size growth following a severe old bottleneck (200,000 years ago). Indeed, the species history may have been strongly influenced by the climatic changes of the past 200,000 years, when three glacial maxima (the last 18,000–21,000 years ago) alternated with warmer and moister periods (Webb and Bartlein 1992; De Vivo and Carmignotto 2004). This scenario of a long-term and slow population growth (interrupted by repeated population-size crashes) may also explain why our selective sweep method was unable to find footprints of selection in the African sample. Sweeps that are older than 0.1Ne generations cannot reliably be detected by this method (Kim and Stephan 2000). Thus, taken together, our results seem to favor the second scenario.

    Demographic and Selection History of the European Population

    Our analysis of the European population is based on a thorough approach to distinguish between the confounding effects of selection and demography. This approach consists of two steps: (1) an estimation of the parameters of a simple bottleneck model, and (2) the identification of loci whose reduction of variation is more extreme than predicted by this bottleneck model. The estimates of the bottleneck parameters obtained in step (1), in particular by method IIall, are consistent with the polymorphism pattern observed in the European sample. Assuming N0 = Ne for the African population (where N0 = 106) and 5–10 generations per year, we find from that the bottleneck occurred between 3,600 and 15,800 years ago, depending on the method used to estimate (tables 1 and 2). These values are close to the commonly accepted estimates of 10,000–15,000 years ago as the time of the European colonization (David and Capy 1988; Lachaise et al. 1988).

    Although a single bottleneck can explain most of the reduction of variation, we identified several candidate loci with less polymorphism than expected. Loci for which the reduction in heterozygosity due to selection is more severe than due to the bottleneck alone are expected to cause an underestimation of the age of the bottleneck. To test this hypothesis, we removed the eight loci with Q < 0.05 in bottleneck IIall (six of which had no SNPs), and reestimated the bottleneck parameters. As expected, age and strength point to an older and less severe bottleneck ( and vs. and respectively). This result does not depend on the low polymorphism of the removed loci. When removing a random set of zero-polymorphism loci (but with Q > 0.05), the values of the bottleneck parameter were indistinguishable from those estimated from the complete data set.

    In contrast, we estimated a much recent and stronger bottleneck for the eight outliers alone, consistent with the recent occurrence of positive selection (and ).

    Selective sweeps are expected to reduce the opportunity of new segregating mutations to accumulate during population expansion (by eliminating those linked to the selected site). Thus, assuming that selection acted uniformly over the entire chromosome, we expect more mutations to accumulate in regions of high recombination, where the target of selection and the flanking regions can evolve more independently. Supporting this hypothesis, we found a significant positive correlation between Es and recombination rate, which is responsible for the negative correlation observed between Tajima's D and recombination rate (see Results). Furthermore, intronic regions have lower Eall (P = 0.037) and underwent a more severe drop in genetic diversity than intergenic regions (although not significantly; data not shown), as expected if genes were the main targets of selection, but show less negative D values. These findings would contradict those of Orengo and Aguadé (2004), who interpreted a positive correlation between Tajima's D values and distance to coding regions (i.e., more negative D values for loci close to genes) as signature of recent positive selection in a Spanish population of D. melanogaster. The negative correlation between Tajima's D and recombination rate can be a consequence of the sampling process associated with the colonization of Europe. This is because the more variation there is in the ancestral African population (which correlates with the recombination rate), the more chances are that a line surviving the bottleneck in the derived population harbors some polymorphisms that are not shared. However, Eall does not correlate with recombination rate, as expected if only the sampling process was responsible for the observed correlation and the correlation between Es and recombination rate is significant in intergenic (R = 0.378, P = 0.0001), but not in intronic regions (R = 0.007, P = 0.941; this also excludes a possible role of mutation bias across the recombination gradient, see below); moreover, there is no correlation between the African and either Es or Tajima's D (R = –0.063, P = 0.353, and R = 0.095, P = 0.149, respectively).

    During the bottleneck phase, adjacent loci may have been partially linked and therefore not independent, as we assumed. Thus, we may have underestimated the age of the bottleneck. The effect of this underestimation is difficult to gauge. On the one hand, it may produce more conservative tests for the individual loci (see Glinka et al. 2003). On the other hand, ignoring the partial linkage between loci may not be conservative, because low levels of polymorphism at adjacent loci may just reflect their shared ancestry. Nonindependence of loci, resulting in pseudoreplicated data, could have also produced smaller CI for the statistics estimated under the bottlenecks (table 2). To evaluate the contribution of linkage in our bottleneck estimates, we selected two (partially overlapping) sets of 65 loci with an average distance between adjacent ones of 250 kb (and average 0.0046). Using method IIall, we estimated bottleneck parameters comparable with those obtained using the whole data set ( and vs. and for 65 and 230 loci, respectively). Furthermore, a population bottleneck explained the data significantly better than a constant population-size model in both sets of loci (likelihood ratio test, G = 193.2, df = 1, P < 10–50, and G = 105.5, df = 1, P < 10–24, for all 230 loci and one set of 65 loci, respectively, using method IIall; in the constant population-size model, we assumed that the binomial factor of equation (4) equals 1, making it equivalent to method I). Only among the 24 identified by method IIIall there are cases of adjacent ones (i.e., three; they are at a minimum distance of 21,362 bp), otherwise they are separated by at least one locus with Q > 0.05. The minimum distance between any couple of the eight candidate loci identified by method IIall is 121,177 bp. They are separated by at least two loci with Q > 0.05 (fig. 4). Moreover, when we plotted the absolute average difference in Q or between a locus and its two neighbors against the average distance or difference in recombination rate to them (as a prediction of association), no correlation is found (data not shown). Therefore, we can be reasonably confident that linkage does not affect our results.

    Is Recombination Mutagenic in D. melanogaster?

    Another striking observation of our genome scan was the positive correlation between average divergence (of the African sample to D. simulans) and recombination rate. The most straightforward explanation of this observation is that recombination is mutagenic in D. melanogaster. However, other explanations may also be possible. Although hitchhiking associated with strong positive or negative selection does not increase the rate of neutral molecular evolution, the substitution rate of weakly selected mutations depends on the degree of linkage with strongly selected ones (via the fixation probability; Birky and Walsh 1988; McVean and Charlesworth 2000). Thus, depending on the relative magnitude of the rates of slightly advantageous and deleterious mutations, positive or negative correlations between divergence and recombination rate may result (Birky and Walsh 1988). Furthermore, the increase of divergence with recombination may simply be due to the relatively recent split of the D. melanogaster and D. simulans lineages (Hellmann et al. 2003). The latter hypothesis can be tested as soon as the complete D. yakuba sequence is available.

    Supplementary Material

    Supplementary tables 1 and 2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

    Acknowledgements

    We thank S. Mousset, P. Pfaffelhuber and S. Ramos-Onsins for their advice during the development of the bottleneck test, S. Hutter, H. Li, S. Ramos-Onsins and G. Weiss for computer programs, and, in particular, A. Wilken for excellent technical assistance. We are also grateful to P. Pennings and J. Hermisson for pointing out an error in our thinking on the joint effects of hitchhiking and demography that led to a misinterpretation of figure 4 in Glinka et al. (2003), and to P. Andolfatto, K. Thornton, and two reviewers for very constructive comments. This work was funded by the Deutsche Forschungsgemeinschaft (grant STE 325/6).

    References

    Akey, J. M., M. A. Eberle, M. J. Rieder, C. S. Carlson, M. D. Shriver, D. A. Nickerson, and L. Kruglyak. 2004. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2:e286.

    Andolfatto, P., F. Depaulis, and A. Navarro. 2001. Inversion polymorphisms and nucleotide variability in Drosophila. Genet. Res. 77:1–8.

    Andolfatto, P., and M. Przeworski. 2001. Regions of lower crossing over harbor more rare variants in African populations of Drosophila melanogaster. Genetics 158:657–665.

    Andolfatto, P., and J. D. Wall. 2003. Linkage disequilibrium patterns across a recombination gradient in African Drosophila melanogaster. Genetics 165:1289–1305.

    Begun, D. J., and C. F. Aquadro. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356:519–520.

    ———. 1993. African and North American populations of Drosophila melanogaster are very different at the DNA level. Nature 365:548–550.

    Birdsell, J. A. 2002. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol. 19:1181–1197.

    Birky, C. W. Jr, and J. B. Walsh. 1988. Effects of linkage on rates of molecular evolution. Proc. Natl. Acad. Sci. USA 85:6414–6418.

    Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley, and W. Stephan. 1995. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140:783–796.

    Charlesworth, B. 1996. Background selection and patterns of genetic diversity in Drosophila melanogaster. Genet. Res. 68:131–149.

    Comeron, J. M., M. Kreitman, and M. Aguadé. 1999. Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151:239–249.

    David, J. R., and P. Capy. 1988. Genetic variation of Drosophila melanogaster natural populations. Trends Genet. 4:106–111.

    Depaulis, F., and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:1788–1790.

    De Vivo, M., and A. P. Carmignotto. 2004. Holocene vegetation change and the mammal faunas of South America and Africa. J. Biogeogr. 31:943–957.

    Fay, J. C., and C. I. Wu. 1999. A human population bottleneck can account for the discordance between patterns of mitochondrial versus nuclear DNA variation. Mol. Biol. Evol. 16:1003–1005.

    ———. 2000. Hitchhiking under positive Darwinian selection. Genetics 155:1405–1413.

    Fu, Y. X. 1995. Statistical properties of segregating sites. Theor. Popul. Biol. 48:172–197.

    Galtier, N., F. Depaulis, and N. H. Barton. 2000. Detecting bottlenecks and selective sweeps from DNA sequence polymorphism. Genetics 155:981–987.

    Glinka, S., L. Ometto, S. Mousset, W. Stephan, and D. De Lorenzo. 2003. Demography and natural selection have shaped genetic variation in Drosophila melanogaster: a multi-locus approach. Genetics 165:1269–1278.

    Haddrill, P. R., K. R. Thornton, B. Charlesworth, and P. Andolfatto. 2005. Multilocus patterns of nucleotide variability and the demographic and selection history of Drosophila melanogaster populations. Genome Res. 15:790–799.

    Harr, B., M. Kauer, and C. Schl?tterer. 2002. Hitchhiking mapping: a population-based fine-mapping strategy for adaptive mutations in Drosophila melanogaster. Proc. Natl. Acad. Sci. USA 99:12949–12954.

    Hellmann, I., I. Ebersberger, S. E. Ptak, S. Paabo, and M. Przeworski. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. 72:1527–1535.

    Hill, W. G., and A. Robertson. 1968. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38:226–231.

    Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18:337–338.

    Hudson, R. R., M. Kreitman, and M. Aguadé. 1987. A test of neutral molecular evolution based on nucleotide data. Genetics 116:153–159.

    Kaplan, N. L., and B. S. Weir. 1995. Are moment bounds on the recombination fraction between a marker and a disease locus too good to be true? Allelic association mapping revisited for simple genetic diseases in the Finnish population. Am. J. Hum. Genet. 57:1486–1498.

    Kauer, M., D. Dieringer, and C. Schlotterer. 2003. Nonneutral admixture of immigrant genotypes in African Drosophila melanogaster populations from Zimbabwe. Mol. Biol. Evol. 20:1329–1337.

    Kelly, J. K. 1997. A test of neutrality based on interlocus associations. Genetics 146:1197–1206.

    Kern, A. D., and D. J. Begun. 2005. Patterns of polymorphism and divergence from noncoding sequences of Drosophila melanogaster and D. simulans: evidence for nonequilibrium processes. Mol. Biol. Evol. 22:51–62.

    Kim, Y., and W. Stephan. 2000. Joint effects of genetic hitchhiking and background selection on neutral variation. Genetics 155:1415–1427.

    ———. 2002. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160:765–777.

    Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217:624–626.

    Kliman, R. M., P. Andolfatto, J. A. Coyne, F. Depaulis, M. Kreitman, A. J. Berry, J. McCarter, J. Wakeley, and J. Hey. 2000. The population genetics of the origin and divergence of the Drosophila simulans complex species. Genetics 156:1913–1931.

    Lachaise, D., M. L. Cariou, J. R. David, F. Lemeunier, L. Tsacas, and M. Ashburner. 1988. Historical biogeography of the Drosophila melanogaster species subgroup. Evol. Biol. 22:159–225.

    Li, Y. J., Y. Satta, and N. Takahata, 1999. Paleo-demography of the Drosophila melanogaster subgroup: application of the maximum likelihood method. Genes Genet. Syst. 74:117–127.

    Maynard-Smith, J., and J. Haigh. 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23:23–35.

    McVean, G. A., and B. Charlesworth. 2000. The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155:929–944.

    Orengo, D. J., and M. Aguadé. 2004. Detecting the footprint of positive selection in a European population of Drosophila melanogaster: multilocus pattern of variation and distance to coding regions. Genetics 167:1759–1766.

    Przeworski, M., J. D. Wall, and P. Andolfatto. 2001. Recombination and the frequency spectrum in Drosophila melanogaster and Drosophila simulans. Mol. Biol. Evol. 18:291–298.

    Ramos-Onsins, S. E., B. E. Stranger, T. Mitchell-Olds, and M. Aguadé. 2004. Multilocus analysis of variation and speciation in the closely related species Arabidopsis halleri and A. lyrata. Genetics 166:373–388.

    Schmid, K. J., S. Ramos-Onsins, H. Ringys-Beckstein, B. Weisshaar, and T. Mitchell-Olds. 2005. A multilocus sequence survey in Arabidopsis thaliana reveals a genome-wide departure from a neutral model of DNA sequence polymorphism. Genetics 169:1601–1615.

    Stajich, J. E., and M. W. Hahn. 2005. Disentangling the effects of demography and selection in human history. Mol. Biol. Evol. 22:63–73.

    Stephan, W., L. Xing, D. A. Kirby, and J. M. Braverman. 1998. A test of the background selection hypothesis based on nucleotide data from Drosophila ananassae. Proc. Natl. Acad. Sci. USA 95:5649–5654.

    Storey, J. D. 2002. A direct approach to false discovery rates. J. R. Stat. Soc. B 64:479–498.

    Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460.

    ———. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585–595.

    Vilella, A. J., A. Blanco-Garcia, S. Hutter, and J. Rozas. 2005. VariScan: analysis of evolutionary patterns from large-scale DNA sequence polymorphism data. Bioinformatics 21:2791–2793.

    Watterson, G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256–276.

    Webb, T. III, and P. J. Bartlein. 1992. Global changes during the last 3 million years: climatic controls and biotic responses. Annu. Rev. Ecol. Syst. 23:141–173.

    Weiss, G., and A. von Haeseler. 1998. Inference of population history using a likelihood approach. Genetics 149:1539–1546.(Lino Ometto, Sascha Glink)