当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2005年 > 第5期 > 正文
编号:11258304
Variation in the Pattern of Synonymous and Nonsynonymous Difference Between Two Fungal Genomes
     Department of Biological Sciences, University of South Carolina

    Correspondence: E-mail: austin@biol.sc.edu.

    Abstract

    The proportion of synonymous nucleotide differences per synonymous site (pS) and the proportion of nonsynonymous differences per nonsynonymous site (pN) were computed at 1,993,217 individual codons in 4,133 protein-coding genes between the two yeast species Saccharomyces cerevisiae and Saccharomyces paradoxus. When the modified Nei-Gojobori method was used, significantly more codons with pN > pS were observed than expected, based on random pairing of observed pS and pN values. However, this finding was most likely explained by the presence of a strong negative correlation between the number of synonymous differences and the number of nonsynonymous differences at codons with at least one difference. As a result of this correlation, codons with pN > pS were characterized not only by unusually high pN but also by unusually low pS. On the other hand, the number of codons with (where pS is the mean for all codons) was very similar to the random expectation, and the observed number of 30-codon windows with pN > pS was significantly lower than the random expectation. These results imply that the occurrence of a certain number of codons or codon windows with pN > pS is expected given the nature of nucleotide substitution and need not imply the action of positive Darwinian selection.

    Key Words: genome evolution ? nonsynonymous substitution ? positive selection ? synonymous substitution

    Introduction

    A number of statistical methods have been introduced in order to search, without incorporating any a priori knowledge of protein function, for regions of protein-coding genes that are subject to positive Darwinian selection. These methods rely on the theoretical prediction that most nonsynonymous (amino acid altering) nucleotide mutations are deleterious to protein structure and are thus quickly eliminated by purifying selection (Kimura 1977). Thus, in most protein-coding genes, the number of synonymous nucleotide substitutions per synonymous site (dS) is expected to exceed the number of nonsynonymous substitutions per nonsynonymous site (dN) (Kimura 1977). By contrast, under positive Darwinian selection favoring changes at the amino acid level, dN is expected to exceed dS (Hughes and Nei 1988). For example, there are widely used methods based on maximum parsimony (Suzuki and Gojobori 1999) or maximum likelihood (Yang et al. 2000) that estimate dN/dS at individual codons and search for those at which dN > dS. Similarly, Fares et al. (2002) proposed a method for estimating the dN/dS in sliding windows along coding regions and searching for the optimal window size to detect deviations from the expected pattern. In addition, likelihood ratio methods have been applied to test a model including a category of codons with dN > dS against a model lacking such a category of codons (Yang et al. 2000).

    A problem with any such method is that dS and dN are likely to vary stochastically along the sequence; thus, it is expected that in certain genomic regions, dN will exceed dS by chance alone, in the absence of positive selection. Available methods do not appear to include any effective controls for such stochastic variation. In order to assess the extent of such stochastic variation in a real data set, we used aligned protein-coding genes from complete genomes of two closely related species of fungi, Saccharomyces cerevisiae and Saccharomyces paradoxus. By comparing the observed distributions of synonymous and nonsynonymous differences in individual codons and in sliding windows with the random expectation, we compared the observed frequencies of codons and windows showing excess numbers of nonsynonymous substitutions with random expectations.

    Methods

    Sequence data for 4,133 orthologous genes were derived from the aligned genomes of S. cerevisiae and S. paradoxus (Kellis et al. 2003). We computed the proportion of synonymous nucleotide differences per synonymous site (pS) and the number of nonsynonymous differences per nonsynonymous site (pN) between the two genomes at individual codons by the Nei and Gojobori (1986) (NG) method. No correction for multiple hits was applied. All such correction formulas are large-sample approximations and thus not applicable in the present case, in which pS and pN were computed on a codon-by-codon basis. We excluded the codons at which no synonymous sites were defined, i.e., codons at which both sequences included either ATG (Met) or TGG (Trp). We likewise computed sliding windows of 30 codons along each gene in the data set.

    Because of the method by which it counts "fractional sites" in the case of twofold degenerate sites, the NG method will tend to underestimate the number of synonymous sites (and thus to overestimate pS) when there is a strong transitional bias (Li 1993). In order to control for this effect, we analyzed separately a data set consisting exclusively of codons at which both genomes used fourfold degenerate codons. In addition, we applied the modified Nei and Gojobori (ModNG) method (Zhang, Rosenberg, and Nei 1998) separately to twofold degenerate codons, to all codons, and to the sliding windows of 30 codons. Whereas the NG method counts the third position of a twofold degenerate codon as 1/3 synonymous site, the ModNG method counts such a site as R/(1 + R) synonymous site, where R is the transition:transversion ratio. We estimated R from the observed pattern of transitional and transversional differences at fourfold degenerate sites in our data set. Note that, in the case of fourfold degenerate codons, NG and ModNG methods are equivalent.

    We compared the actual distribution of pN, pS, and their difference with that for 2,000,000 simulated codons. Each simulated codon was assigned a pN value drawn at random (with replacement) from the observed pN values and a pS value independently drawn at random from the observed pS values. A similar procedure was performed to create a data set of 2,000,000 simulated 30-codon windows. In addition, we conducted two computer simulations of codon evolution at 2,000,000 randomly generated codons using the Evolver program (Yang 1997). In each set, protein evolution followed a Dayhoff model, and codons were allowed to evolve 0.4 substitutions per site with a ratio of synonymous to nonsynonymous of 10:1 (consistent to the level of divergence observed in the actual data; table 1). One simulation was run with R = 0.5 and the other with R = 4.5 (as observed in the data; see below).

    Table 1 Comparisons of pS and pN in Individual Codons and in 30-Codon Sliding Windows in Aligned Genes of Saccharomyces cerevisiae and Saccharomyces paradoxus

    Results

    Differences at Individual Codons

    The proportions of synonymous differences per synonymous site (pS) and the proportion of nonsynonymous differences per nonsynonymous site (pN) were computed between S. cerevisiae and S. paradoxus at 1,993,217 individual codons in 4,133 genes. When pS and pN were estimated by the NG method, pN exceeded pS in 125,714 (6.3%) codons; this proportion was not significantly different from that expected on the basis of randomly pairing pS and pN values (table 1). However, in the present data set, there was evidence that the NG method substantially overestimated pS. We estimated pS codons where both genomes used a twofold degenerate codon (N = 782,559) and at codons where both genomes used a fourfold degenerate codon (N = 758,893). Mean pS was estimated to be over twice as high at twofold degenerate codons as at fourfold degenerate codons (table 1).

    We estimated the transition:transversion ratio (R) from observed differences at fourfold degenerate sites. There were 151,254 transitions and 33,362 transversions, yielding an estimate of R = 4.5. Using this value in the ModNG method, we obtained a mean pS at twofold degenerate codons (0.2178) very close to that seen at fourfold degenerate sites (0.2536) (table 1). When the ModNG method was applied at all codons, pN exceeded pS in 128,912 codons (fig. 1A and table 1), significantly more frequently than expected by chance (table 1). There were 3,180 more codons with pN > pS than expected from random pairing of pS and pN values (table 1). The number of excess codons with pN > pS than (3,180) represents 0.16% of all codons compared and 2.4% of all codons with pN > pS by the ModNG method. When twofold degenerate codons were considered separately, the observed number of codons with pN > pS was significantly greater than expected, whether the NG or ModNG method was used (table 1).

    FIG. 1.— The distribution of pN – pS, estimated by the ModNG method, in (A) 1,993,217 individual codons and (B) 1,933,457 sliding windows of 30 codons.

    Relationship Between Synonymous and Nonsynonymous Substitution

    One factor that might explain the apparent excess of codons with pN > pS would be a negative correlation between pS and pN. In fact, there was a strong negative correlation (r = –0.786; P < 0.001) between pS and pN at codons where at least one nucleotide difference occurred (N = 578,070). The main reason for this negative correlation was that there was a substantial proportion of codons with nonzero nonsynonymous difference but zero synonymous difference (123,424 or 21.4% of the total) and a still larger proportion of codons with nonzero synonymous difference but zero nonsynonymous difference (415,893 or 71.9%) but very few codons with nonzero nonsynonymous difference and nonzero synonymous difference (38,753 or 6.7%).

    A similar correlation was seen when only fourfold degenerate codons were considered (r = –0.770; P < 0.001; 221,817 codons). Here again the proportion of codons with nonzero nonsynonymous difference and nonzero synonymous difference (5.0%) was small relative to the proportion of codons with nonzero nonsynonymous difference and zero synonymous difference (11.8%) and the proportion of codons with nonzero synonymous difference and zero nonsynonymous difference (83.2%). A similar negative correlation was observed when twofold degenerate codons were considered (r = –0.891; P < 0.001; 178,349 codons). The correlations for fourfold degenerate codons and for twofold degenerate codons were significantly different from each other (P < 0.001). In the case of twofold degenerate codons, the proportion of codons with nonzero nonsynonymous difference and nonzero synonymous difference (3.0%) was small relative to the proportion of codons with nonzero nonsynonymous difference and zero synonymous difference (21.7%) and the proportion of codons with nonzero synonymous difference and zero nonsynonymous difference (75.4%).

    The negative correlation between pS and pN thus appears to result mainly from three factors: (1) the occurrence of nucleotide differences varies stochastically across codons; (2) because of purifying selection eliminating many nonsynonymous mutations, the rate of occurrence of nonsynonymous differences is much lower than that of synonymous differences; and (3) the overall rate of substitution is much less than one substitution per site. Thus, numerous codons with a synonymous difference but no nonsynonymous differences are expected to occur, but there will also be a substantial number of codons with nonsynonymous differences but no synonymous differences. In many cases, pN may exceed pS largely because pS happens to be low, as a result of stochastic error, rather than because pN is unusually elevated. We tested this interpretation by computer simulations, assuming only purifying selection and a rate of substitution comparable to that observed in the actual data; we observed correlations very similar to those seen in the actual data. For simulations with R = 0.5 and R = 4.5, pS and pN at codons where at least one nucleotide difference occurred were negatively correlated (r = –0.601 and r = –0.574, respectively; P < 0.001 in both cases).

    Average Numbers of Nucleotide Differences

    In order to provide further evidence regarding the relationship between synonymous and nonsynonymous substitution, we computed mean values of pS and pN separately for codons with pN > pS and for codons with pN pS (table 2). Using both NG and ModNG methods, mean pS at codons with pN > pS was significantly lower than mean pS at codons with pN pS (table 2). This was true of all codons, fourfold degenerate sites, and twofold degenerate codons (table 2). As expected, using both NG and ModNG methods, mean pN at codons with pN > pS was significantly greater than mean pN at codons with pN pS (table 2).

    Table 2 Mean pS and pN (±SE) Compared Between Codons and 30-Codon Windows with pN > pS and those with pN pS, and Between Codons and 30-Codon Windows with pN > and those with pN

    These results support the hypothesis that codons with pN > pS frequently showed both lower than average pS and higher than average pN. Thus, a more conservative method of identifying codons with elevated pN (as might result from positive selection) would be to compare pN in a codon not with pS in the same codon but rather with the genome-wide mean pS (). Therefore, we computed mean values of pS and pN separately for codons with and for codons with (table 2). Mean pS values for codons with were not significantly greater than the mean pS for codons with except in the case of fourfold degenerate codons (table 2).

    In the case of all sites using the NG method and in the case of twofold degenerate sites using the ModNG method, mean pS values for codons with were significantly lower than the mean pS for codons with (table 2). In the case of all sites using the ModNG method, the mean pS value for codons with was not significantly different from the mean pS for codons with (table 2). On the other hand, mean pN for codons with was in all cases significantly greater than mean pN for codons with (table 2). These results suggest that comparing pN in an individual codon with provides a way of identifying codons with elevated pN but without unusually low pS.

    We computed the numbers of codons with and compared the observed values with the expectation derived from randomly pairing pS and pN values (table 3). The observed number of codons with was significantly different from the expected value only in the case of twofold degenerate codons with the NG method (table 3). When the ModNG method was used, the observed and expected values were nearly identical in every case, and the observed and expected values were also identical for the NG method applied to all sites (table 3).

    Table 3 Observed and Expected Numbers of Codons and 30-Codon Windows with pN >

    Sliding Window Analyses

    In sliding windows analyses, the observed number of windows with pN > pS was significantly greater than expected whether the NG or ModNG method was used (table 1). It was interesting to observe that the use of ModNG greatly increased both the observed and expected numbers of windows with pN > pS (table 1). This no doubt occurred because of the reduction in estimates of pS using ModNG (table 1). The distribution of pN – pS for the sliding windows was much smoother and closer to a bell-shaped curve than in the case of individual codons (fig. 1). Windows with pN > pS had significantly lower mean pS than windows with pN pS and significantly higher mean pN than windows with pN pS (table 2). However, windows with had significantly higher mean values of both pS and pN than windows with (table 2). Furthermore, the observed number of windows with was very close to the expected number (table 3).

    Discussion

    Comparison of the proportions of synonymous (pS) and nonsynonymous (pN) difference in orthologous genes of two yeast species revealed different patterns depending on the sites analyzed and the method used. The ModNG method showed advantages over the original NG method in analyzing this data set. There was a strong transitional bias, as indicated by a transition:transversion ratio of 4.5:1 at fourfold degenerate sites. The NG method gave an estimate of mean pS at twofold degenerate codons greater than twice that for fourfold degenerate codons (table 2). By contrast, the ModNG method yielded estimates of pS that were, on average, very similar for fourfold degenerate codons, twofold degenerate codons, and sliding windows of 30 codons (table 1).

    Using the ModNG method, we found a significant excess of codons with pN > pS in comparison to the expectation based on random shuffling of pS and pN values. It might be concluded that the excess in codons with pN > pS is caused by a certain proportion of codons which are subject to positive selection favoring nonsynonymous changes. Even if this interpretation is correct, the number of codons affected would be very few because the apparent excess of codons with pN > pS represented only 0.16% of all codons and only 2.4% of codons with pN > pS.

    However, there are problems with the hypothesis that the observed excess of codons with pN > pS is due to positive selection. First, there was a strong negative correlation between the number of synonymous differences and the number of nonsynonymous differences at codons with at least one nucleotide difference. Second, codons with pN > pS tended to have lower than average values of pS as well as higher than average values of pN (table 2). These problems were avoided if pN in a given codon was compared not with pS in the same codon but with the mean value of pS for all codons compared (). However, in the latter case, there was no evidence that codons with elevated pN occurred at a greater rate than the random expectation.

    When we examined 30-codon sliding windows, we found a certain number with pN > pS and a certain number with (tables 1 and 3). However, in neither case were the observed numbers greater than the expected numbers (tables 1 and 3).

    Our results imply that the presence of a certain number of codons and of codon windows with a greater proportion of nonsynonymous than of synonymous nucleotide differences is expected given the pattern of nucleotide substitution. Thus, the existence of a set of such codons or codon windows cannot in itself be taken as evidence of the presence of positive Darwinian selection. As a consequence, our results cast doubt on the validity of methods—whether based on parsimony or likelihood—that consider the presence of a set of codons with dN > dS as evidence of positive selection. Of course, widely used methods look at the pattern of substitution across a phylogenetic tree, whereas we compared two genomes. However, the problems we observed are expected to be present in phylogenetic trees as well; for example, numbers of synonymous and nonsynonymous differences are likely to show stochastic variation from codon to codon, and there is likely to be a negative correlation between the numbers of synonymous and nonsynonymous differences.

    The observed negative correlation between pS and pN at codons with at least one nucleotide difference appeared to result mainly from the underrepresentation of codons with both synonymous and nonsynonymous differences. Such codons are lacking because the occurrence of nucleotide differences varies stochastically across codons and because the overall rate of nucleotide substitution is less than one substitution per site. In addition, purifying selection reduces the overall occurrence of nonsynonymous difference, resulting in many codons with synonymous but not nonsynonymous differences. Computer simulation supported the hypothesis that these factors alone are sufficient to explain the observed negative correlation between pS and pN at codons with at least one nucleotide difference. However, factors such as codon usage bias (Sharp and Li 1986), which will lead to purifying selection even at certain synonymous sites, may also contribute to maintaining this correlation.

    These problems may account in part for the overly liberal nature of the commonly used likelihood ratio tests (LRT) for positive selection (Suzuki and Nei 2004; Zhang 2004). The LRT compares a model including a set of codons with dN/dS > 1.0 with a model lacking such a set. If the former model gives a better likelihood score, the hypothesis of positive selection is accepted. Our results suggest that this is not a valid test of the hypothesis of positive selection because the existence of a set of codons with dN/dS > 1.0 is expected even under neutrality. An alternative approach might be to search for codons at which dN exceeds mean dS for the entire gene. However, our results (table 3) suggest that even the latter approach would still not provide a valid test of positive selection because it is expected that there will be certain codons where, by chance alone, dN exceeds mean dS for the entire gene. Rather, these methods of testing for positive selection might be improved if one were to allow for a set of codons with dN/dS > 1.0 even in the model without positive selection. This model could be tested against a model in which the number of codons with dN/dS > 1.0 is greater than a threshold value expected under random substitution.

    Acknowledgements

    This research was supported by grant GM43940 from the National Institutes of Health to A.L.H.

    References

    Fares, M. A., S. F. Elena, J. Ortiz, A. Moya, and E. Barrio. 2002. A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses. J. Mol. Evol. 55:509–521.

    Hughes, A. L., and M. Nei. 1988. Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335:167–170.

    Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254.

    Kimura, M. 1977. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267:275–276.

    Li, W.-H. 1993. Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36:96–99.

    Nei, M., and T. Gojobori. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3:418–426.

    Sharp, P. M., and W.-H. Li. 1986. An evolutionary perspective on synonymous codon usage in unicellular organisms. J. Mol. Evol. 24:28–38.

    Suzuki, Y., and T. Gojobori. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315–1328.

    Suzuki, Y., and M. Nei. 2004. False-positive selection identified by ML-based methods: examples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol. Biol. Evol. 21:914–921.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.

    Yang, Z., R. Nielsen, N. Goldman, and A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.

    Zhang, J. 2004. Frequent false detection of positive selection by the likelihood method with branch-site models. Mol. Biol. Evol. 21:1332–1339.

    Zhang, J., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribounuclease genes. Proc. Natl. Acad. Sci. USA 95:3708–3713.(Austin L. Hughes and Robe)