当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第10期 > 正文
编号:11255292
Successful Lateral Transfer Requires Codon Usage Compatibility Between Foreign Genes and Recipient Genomes
     * Program of Computational Genomics and Program of Molecular and Microbial Ecology, Centro de Investigación sobre Fijación de Nitrógeno (UNAM), Cuernavaca, Morelos, México; and Department of Probability and Statistics, Centro de Investigación en Matemáticas, Guanajuato, Guanajuato, México

    E-mail: amedrano@cifn.unam.mx.

    Abstract

    We present evidence supporting the notion that codon usage (CU) compatibility between foreign genes and recipient genomes is an important prerequisite to assess the selective advantage of imported functions, and therefore to increase the fixation probability of horizontal gene transfer (HGT) events. This contrasts with the current tendency in research to predict recent HGTs in prokaryotes by assuming that acquired genes generally display poor CU. By looking at the CU level (poor, typical, or rich) exhibited by putative xenologs still resembling their original CU, we found that most alien genes predominantly present typical CU immediately upon introgression, thereby suggesting that the role of CU amelioration in HGT has been overemphasized. In our strategy, we first scanned a representative set of 103 complete prokaryotic genomes for all pairs of candidate xenologs (exported/imported genes) displaying similar CU. We applied additional filtering criteria, including phylogenetic validations, to enhance the reliability of our predictions. Our approach makes no assumptions about the CU of foreign genes being typical or atypical within the recipient genome, thus providing a novel unbiased framework to study the evolutionary dynamics of HGT.

    Key Words: horizontal gene transfer ? codon usage compatibility ? comparative genomics ? evolution ? Bayesian model

    Introduction

    The incessant deluge of completely sequenced genomes has boosted the development of lateral genomics, providing new insights into the nature, prevalence, and evolutionary implications of horizontal gene transfer (HGT) in such critical biological processes as the occupation of new niches and speciation (Berg and Kurland 2002; Gogarten, Doolittle, and Lawrence 2002; Brown 2003). Laterally acquired genes are usually expected to display atypical codon usage (CU) mainly due to two lines of evidence. First, there are biases toward poor CU in genes with plasmid, phage, and transposon related functions (Sharp and Li 1987; Médigue et al. 1991). Second, some chromosomes exhibit distinctive regions with atypical sequence compositions (Lawrence and Ochman 1998; Ochman and Bergthorsson 1998; Kaneko et al. 2000). Given that CU is thought to reflect adaptation of genes to the translational machinery of the host (Ikemura 1981, 1982), and foreign genes have not been exposed to the same evolutionary forces as resident genes, it is commonly assumed that foreign genes should exhibit a poorly adapted codon composition. Thus, atypical CU, G+C content, and/or different oligonucleotide frequencies have been routinely used as critical parameters to predict HGT events (Médigue et al. 1991; Karlin, Mrázek, and Campbell 1998; Lawrence and Ochman 1998; Moszer, Rocha, and Danchin 1999; Garcia-Vallvé, Romeu, and Palau 2000). However, cautious evaluations conclude that current methodologies based on unusual codon abundances and base composition alone are poor indicators of HGT (Koski, Morton, and Golding 2001; Wang 2001) and that comparisons among different approaches generate fewer predictions in common than would be expected by chance (Ragan 2001b); nevertheless, methods testing compatible null hypotheses are expected to increase their level of agreement (Lawrence and Ochman 2002; Ragan 2002). Based on these evaluations, the need to develop new quantitative models to interpret data and assess confidence has been stressed (Ragan 2001a).

    A gene is said to display rich CU if it uses preferentially the most abundant codons within the host genome. Accordingly, poor or atypical CU indicates the preferential use of rare codons, and typical CU reflects a balance between abundant and rare codons. In this work we sought to find out what the actual CU level (poor, typical, or rich) displayed by successfully imported genes was at the moment of acquisition. Thus, it was necessary to collect pairs of imported/exported genes that still conserve the compositional footprint of the donor DNA, without making any a priori assumption about their CU level. The four basic premises in our approach are: (1) candidate xenologous genes (CXGs), or pairs of horizontally exported/imported genes, must bear similar codon composition independently of whether the CU level of acquired genes is poor, typical, or rich in the recipient genomes; (2) CXGs should have similar lengths; (3) they should display the highest global identity values at the protein level, and thus they should be detected by current computational methods to infer orthology; and (4) the phylogenetic relationship between detected CXGs must contradict the hypothesis of vertical inheritance. The first premise was used to obtain an initial list of candidate xenologs, and the other three allowed the removal of potential false positives. To assess the CU of genes, we designed the Codon-Richness Index (CRI), which quantifies the degree to which individual genes use the most abundant codons within a reference genome. Genes showing low, average, and high CRI identify the three CU levels defined above.

    We followed the strategy depicted in figure 1 to detect pairs of CXGs. First, we calculated a CU profile of potential xenologs for 103 representative genomes (see table 1 for a complete list), meaning the account of the CU level that each gene from 102 (donor) genomes would display within the remaining 103rd (recipient) genome. Second, we searched for pairs of exported/imported genes that still conserve their original CU, that is, genes that show similar codon frequencies (first working premise). Because CU similarity might not be enough evidence of HGT, we applied additional filtering criteria, including phylogenetic validations, to strengthen the predictions (i.e., the other three working premises; see Methods and Results). Then, we counted the number of detected xenologous genes at each CU level, from the perspective of their host genomes. Third, we compared the potential HGT CU profile, as obtained from the first step, with the actual HGT CU profile exhibited by predicted xenologs in the second step. The CU comparisons between the actual and the potential CU HGT profiles reveal that most horizontally transferred genes exhibited a typical codon composition at the moment of acquisition, which suggests that poor CU actually represents a strong barrier against successful acquisition and utilization of foreign genes.

    FIG. 1.— Strategy to assess the codon usage (CU) level of genes involved in horizontal transfer events. Putative orthologs (POs), quantification of CU values, candidate xenologous genes (CXGs), and phylogenetic analyses were performed as explained in Methods.

    Table 1 The Horizontal Gene Transfer (HGT) Potential Among the Representative Set of 103 Complete Prokaryote Genomes

    Methods

    The Codon-Richness Index

    To assess and compare the CU level of genes within and across genomes, we designed the CRI based on the genomic overall abundance of all 64 codons. The index quantifies the extent at which individual genes use the most abundant codons within a reference genome. Let Ga,i be gene i within genome a, qa,i(c) the relative frequency of codon c in gene Ga,i, and pb(c) the probability or relative frequency of codon c in genome b. So, the CRI of gene Ga,i based on the codon abundances of genome b is defined as:

    (1)

    This index may be interpreted as the expected utility of a particular codon distribution and constitutes a local score function (Bernardo and Smith 1994), where higher values are obtained when gene Ga,i uses the most abundant codons within genome b. To obtain the CRI values for genome a relative to its own codon abundances, b must refer to genome a (b = a), that is, equation (1) should be evaluated for CRIa(Ga,i); figure 2 illustrates the CRI for all genes in Escherichia coli K12. By means of this strategy (when a b) we may approach the question of whether or not the codon composition of foreign genes is atypical in recipient genomes.

    FIG. 2.— The Codon-Richness Index (CRI) of genes sorted in an increasing order. The concepts of poor, typical, and rich codon usage (CU) become apparent. The thresholds for low and high CRI are located at the inflexion points of the curve, embracing about 80% of the genes (see Methods).

    Classification in Three Codon Usage Levels

    We classified all genes into three CU categories or levels containing genes with low, typical, and high CRI, as illustrated in figure 2. We initially observed, in E. coli K12, that by sorting the genes by their CRI, most genes display approximately a constant CRI difference (see the slope of the curve in fig. 2). Genes with the lowest and highest CRI show greater differences, thus changing the slope of the curve. When similar curves for all genomes were drawn, it became apparent for almost all cases that the inflexion points, where the slope starts to deviate from typical CRI, embrace about 80% of the genes, thereby suggesting a strategy for the definition of genome-specific thresholds for high and low CRI. For each genome, we estimated its CRI histogram and calculated the CRI values (low and high CRI thresholds) at which the interval of maximum density and minimum length contained 80% of the genes.

    Detection of Candidate Xenologous Genes

    We devised a Bayesian method to compute the posterior probability that two putative orthologs (POs) are CXGs given the CU differences among all related POs and the probability of the null hypothesis (i.e., the chances that they are not xenologs preserving their original sequence composition). The underlying assumption is that CXGs satisfy all criteria of current methods to detect orthology based on protein sequence comparisons. Thus, CXGs were sought across all genomes as POs, which were chosen using the bidirectional best hit (BDBH) working definition of orthology as previously reported (Moreno-Hagelsieb and Collado-Vides 2002a). Whenever a BDBH was not detected in a given target genome, we used the top scoring BlastP hit as the PO as long as there was no better hit within the reference genome. This is the Ortholog Higher than Paralog (OHTP) definition (Ermolaeva, White, and Salzberg 2001). All BlastP comparisons were run with a maximum cutoff E-value of 0.001, filtering low information sequences and using the option for a Smith-Waterman final alignment (Schaffer et al. 2001). We assume that CRI differences between CXGs follow a normal distribution with mean zero and very small standard deviation, h. Similarly, CRI differences between nonxenologous genes (or POs) were assumed to follow a normal distribution with mean 0 and a standard deviation, o, much greater than h. Hence, the required predictive probability would be

    (2)

    where H(Ga,i) = b represents the hypothesis that gene Ga,i was involved in an HGT event with genome b, that is, that genes Ga,i and Gb,i are CXGs. is the vector of CRI differences between Ga,i and each one of its POs in other genomes (e.g., Gb,i, Gc,i, Gd,i, etc.), based on the codon composition of genome a. Da,i(b) is the specific CRI difference between Ga,i and its PO within genome b (Gb,i), based on genome a. K is the normalization constant. P(H(Ga,i) = b) represents the independent prior beliefs we have in the sense that Ga,i and Gb,i are CXGs; that is, if a reference gene has n POs then we consider a priori that all n POs have equal chances of being the CXGs, and so, taking into account the null hypothesis H(Ga,i) = 0, that probability is 1/(n+1). The threshold to make the initial detection of CXGs was set to the value of the posterior probability of the null hypothesis, that is, evaluating equation (2) for P(H(Ga,i) = 0|) and h = o. We performed the predictions by stringently defining h = 0.002804 and o = 0.1, which translates to the fact that two POs are considered as CXGs if |Da,i(b)| < 0.0075 (note that this difference represents in average 1/18 of the CRI scale, see figs. 2 and 3).

    FIG. 3.— Codon-Richness Index (CRI) differences between pairs of putative orthologs (POs) in H. influenzae and N. meningitidis MC58. The CRI of all pairs of POs in both genomes was calculated based on the codon abundances of H. influenzae. Smaller CRI differences can be clearly observed between candidate xenologous genes (CXGs).

    Phylogenetic Analysis

    It has been demonstrated that the most similar genes in a local or global alignment are not necessarily the closest neighbors in a phylogenetic tree (Koski and Golding 2001). Accordingly, our filtering criteria, where we require that pairs of detected candidate xenologs show small CRI differences and the highest global identity at the protein level, might not be enough supporting evidence. Therefore, for each predicted pair of CXGs we estimated protein-based maximum-likelihood (ML) phylogenies and eliminated all predictions producing trees not consistent with the hypothesis of HGT.

    A reference topology to perform phylogenetic incongruity tests was obtained by combining three different strategies and then inferring a consensus tree. First, we used the standard taxonomy for prokaryotic genomes, as reported in the NCBI Taxonomy database (Wheeler et al. 2000). Second, a dendogram was built based on the average protein similarity of all genes shared between each pair of genomes (see elimination of redundant genomes in Moreno-Hagelsieb and Collado-Vides [2002b]). A distance matrix was generated with these scores and the program FITCH from the Phylip 3.6b suite of programs (Felsenstein 1989) was used to generate a tree, allowing for overall global rearrangements. The resulting topology using this criterion matched surprisingly well the current taxonomy of prokaryotes, with only five genomes misplaced (data not shown). Third, from the set of genes previously proposed as good candidates to predict evolutionary relationships among prokaryotes (Zeigler 2003), we selected those with at least 90 BDBHs in our set of nonredundant genomes, namely atpD, cysS, ffh, glyA, recA, and serS. Phylogenetic analyses of the products encoded by these genes, as well as for all predicted CXGs, were performed as described below. Only those clades/lineages supported by the three approaches were considered as reference.

    Multiple alignments of putative orthologous proteins were performed using the program ClustalW (Thompson et al. 1997) with default settings. Saturated sites and potentially misaligned regions were removed applying the program Gblocks (Castresana 2000) with default parameters. Protein trees were constructed under the ML optimality criterion and applying the JTT+ model of amino acid substitutions as implemented in PHYML (Guindon and Gascuel 2003), which accommodates among-site rate variation using a discrete gamma distribution with four rate categories. Nodal support for each JTT+ ML phylogeny was assessed by the analysis of 100 bootstrapped alignments and the resulting consensus tree, which were generated with the programs SEQBOOT and CONSENSE, respectively, in the Phylip 3.6b package. Concerning the set of HGT predictions, we initially parsed automatically each tree and regarded it as a good prediction only if the two putative xenologs shared a terminal node, its bootstrap support was 75%, and if a third PO was present such that it rendered the association between the pairs of predicted xenologs phylogenetically incongruent. Finally, all trees that successfully passed such filters were visually inspected for the quality of the topology and potential problems such as the presence of long branches, poorly resolved clades, multiple paralogs, etc. All doubtful predictions were subsequently discarded.

    Results

    Most Potential Foreign Genes Exhibit Poor Codon Usage from the Perspective of Potential Recipient Genomes

    We calculated the CU level of genes by applying a genome-based CU measure, the CRI, designed to quantify the degree to which individual genes use the most abundant codons within a reference genome (see Methods). Then, we set two genome-specific CRI thresholds to classify the genes into three classes—the high-CRI (rich CU class), the typical-CRI (typical CU class), and the low-CRI (poor CU class) (see Methods and fig. 2). Next, we performed a massive in silico HGT, which consists in considering each genome in turn as recipient and all other genomes as potential donors. We then computed the CRI that each gene (or potential xenolog) from the donor genomes would display within the putative recipient genome. Finally, we counted the number of potential xenologs that fell into each of the three CU categories based on the CRI they would display in the recipient genomes (see Methods for details). The CU profile formed by the number of potential xenologs entering a genome as poor, typical, and rich is what we call the potential CU HGT profile.

    As shown in table 1, 74% of the genes in other genomes are currently low-CRI genes with respect to the potential recipient genomes, 22% would qualify as typical-CRI genes, and 4% as high-CRI genes. Interestingly, a couple of genomes accept as typical genes most genes in other genomes (see X. fastidiosa and P. gingivalis W83). If we restrict the HGT potential estimation to genomes belonging to the same taxonomic category, for instance only within the Proteobacteria, a similar pattern is observed, in that most potential xenologs (65%) would display poor CU in the recipient genome (data not shown). As expected, the number of potential xenologs with typical CU increases as the evolutionary distances decrease. It is worth noting that the HGT potential is consistent with the expectations and results of several authors who have based their work on atypical gene content (Karlin, Mrázek, and Campbell 1998; Lawrence and Ochman 1998; Garcia-Vallvé, Romeu, and Palau 2000). For instance, from the genes predicted by Lawrence and Ochman (1998) as horizontal acquisitions in E. coli K12 strain MG1655, 503 (84%) had a CRI lower than the genomic mean and 329 (55%) are low-CRI genes (we could only find 601 genes identified by name or position out of their 755 predictions; missing genes might have been removed from the current genome version as a result of over-annotations clean up). Similarly, 310 (82%) of the horizontal acquisitions in E. coli predicted by Garcia-Vallvé, Romeu, and Palau (2000) are genes below the genomic CRI mean, while 207 (55%) are low-CRI genes (we could only find 376 genes by their name, b-number, or position out of their 381 predictions, probably for similar reasons as before). However, we must emphasize that these numbers (the potential CU HGT profile) are quite different from the CRI distribution of detected HGT events (the actual CU HGT profile) as shown below.

    Pairs of Xenologs that Resemble Their Original Codon Usage Can Be Detected as Homologs with Very Similar Codon Usage

    To work only with representative genomes, we reduced the 148 available prokaryotic genomes to a nonredundant set of 103, by following a previously reported methodology based on the average protein similarity of shared genes between pairs of genomes (Moreno-Hagelsieb and Collado-Vides 2002b). We are aware that the current sample of completely sequenced genomes is not representative enough to ensure we can detect exact pairs of genomes that have been involved in HGT events. Thus, whenever we say we detect a pair of genes exchanged between two genomes, the recipient/donor genomes could well be close relatives of the actual genomes involved. Candidate xenologous genes (CXGs) across all genomes were identified among POs as described in Methods. The rationale is that CXGs are a subset of all genes detectable by current methods to identify orthologs. We are interested in HGT events where the transferred genes still resemble their original CU. Thus, we extracted an initial set of potential CXGs from the set of POs by searching for pairs of POs whose CRI difference is close to zero, when both CRIs are computed using the codon composition of either of the two genomes involved. In other words, only those pairs of POs that use to the same extent the most abundant codons within the donor and/or recipient genome are taken into consideration in order to predict CXGs. In figure 3 we show an example, using H. influenzae and N. meningitidis MC58, to illustrate that CRI differences between CXGs tend to be much smaller than between POs. As would be expected, the number of CXGs showing small CRI differences increases between closely related organisms. We apply a Bayesian method to perform an initial identification of CXGs. The method calculates the posterior probability that two POs with a small CRI difference are CXGs given the CRI differences between all other related POs and the null hypothesis that none of the POs are CXGs (see Methods).

    However, it is not the CU criterion alone that discriminates candidate xenologs from POs, but the simultaneous application of other filtering criteria as detailed below. The role of the CU similarity criterion is to guarantee that predicted xenologs will have similar CU, which we regard as an obligatory attribute of the type of HGT events we are interested in. The reliability of the predictions is enhanced by the following criteria: First, predicted xenologs must have approximately the same length (±10%). Second, the global identity, at the protein level, between predicted xenologs must be 40% (results with higher identity thresholds are shown below). Third, predicted xenologous genes must be the best hits when all related POs are globally compared with the Needleman-Wunsch algorithm (Needleman and Wunsch 1970), using default parameters as implemented within the EMBOSS package (Olson 2002). Fourth, predicted xenologs must be the closest neighbors in a phylogenetic tree, and the topology of the tree must contradict the reference phylogeny of the genomes analyzed (see Methods for a detailed explanation). To minimize ambiguous interpretations, we only considered predictions involving at least five POs (the pair of predicted xenologs plus three other POs), as previously suggested (Syvanen 1994). The general strategy for HGT detection is summarized in figure 1.

    From the analysis of 103 nonredundant genomes, we detected a total of 375 HGT events involving 730 genes (see table 2); some genes are involved in two or more events. Table S1 in the online Supplementary Material gives details on pairs of xenologous genes and their annotated function. About 36% of the predictions involve hypothetical, putative, or unknown proteins, 28% are enzymes (i.e., reductases, transferases, kinases, dehydrogenases, methylases, mutases, synthases), 19% are involved in transport (if putative genes are included, then the number is 27%), 11% are involved in transcriptional regulation, 4% are genes related to mobile elements, and 2% are drug resistance genes.

    Table 2 Predicted Number of Genes Involved in Horizontal Gene Transfer (HGT)

    To assess the level of conservation of predicted xenologs among closely related genomes, we took the 21 E. coli K12 genes involved in HGT events (see table 2) and observed how many of them are present in the other three sequenced E. coli strains (0157H7, 0157H7 EDL933, and CFT073). Using the BDBH definition of orthology, we found two genes confined exclusively to strain K12 or two strains (K12 and other), six shared by K12 and two other strains, and 13 genes shared by all strains. This is not surprising, as the number of genes shared between closely related genomes is huge. Furthermore, 12 out of the 21 E. coli K12 genes involved in HGT have at least two homologs in the other three strains. There are possible explanations for this; foreign genes may coexist with their native homologs and/or, as suggested by other authors, duplication of foreign genes is effectively more common than duplication of indigenous genes (Hooper and Berg 2003). As expected, the number of conserved xenologs with BDBHs elsewhere decreases with increasing phylogenetic distance. For example, none of the 21 genes in E. coli K12 predicted to be involved in HGT events has a BDBH in H. influenzae, whereas only one gene has a BDBH in X. fastidiosa. Unfortunately, the lack of closely related sequenced genomes on both sides of the probable transferences makes it hard to attain reliable conclusions from this analysis.

    Most Horizontal Gene Transfers Involve Genes with Typical and Rich Codon Usage

    As shown in table 2, most HGT predictions involve typical-CRI genes (84%), with apparently little contribution of genes displaying low or high CRI. This contrasts with both the potential shown in table 1 and the common underlying assumption that most HGTs involve genes displaying predominantly poor CU in the recipient genome at the moment of acquisition. Table 2 does not specify which genes are imported or exported in the predictions, and thus about 365 genes (50%) are expected to be acquisitions. Even if we assume that the 81 HGTs involving low-CRI genes (11% of the total HGT predictions) are all gene acquisitions, we would still have 284 genes (the remaining 39% of imported genes) displaying typical to high CRI. That is, at least 78% (284/365) of all predicted gene acquisitions display typical to rich CU. Furthermore, the overall tendencies observed in table 2 are not significantly affected if we vary the stringency on the minimal identity threshold required between predicted xenologs to detect an HGT event (see table 3).

    Table 3 Horizontal Gene Transfer (HGT) Predictions if the Identity Threshold for Candidate Xenologous Genes Is Gradually Increased

    Our conclusions rely on the assumption that pairs of candidate xenologs satisfying the four filtering conditions can be regarded as evidence of xenologous genes that still resemble their original CU. These conditions are: (1) CXGs must display similar CRI; (2) they must display similar length; (3) they must show the highest global protein identity; and (4) the relationship between CXGs must contradict the hypothesis of vertical inheritance. However, it could be argued that two genes with rather different CU might yield the same CRI score because it is a weighted average, and, consequently, similar CRIs do not necessarily indicate recent HGT events. Although such a scenario is mathematically possible, the combination of the four criteria makes it unlikely to be the case for CXGs. To corroborate this, we took all predicted xenologs showing at least 80% global protein identity and performed codon-wise alignments with their DNA sequences. If CU similarity decays quickly, then most aligned codons should be different. Even for the case of 80% identity, we observe long stretches of DNA sequence identity, which warrants that a substantial fraction of the aligned codons are the same. More specifically, predicted xenologs showing a global protein identity greater than 90% had an average 80% of identical codons in the alignment and 85.6% of average nucleotide identities when running the Smith-Waterman algorithm from the EMBOSS package with default settings. Similarly, those CXGs with global protein identity between 80% and 90% had on average 53% of identical codons and 77% of identities with Smith-Waterman. CXGs with 70%–80% of global protein identity showed an average 72% of DNA sequence identity with Smith-Waterman. Consequently, it is apparent that genes with protein identities 70% have similar CU vectors, leaving little room, if any, to argue that they are different but had a similar CRI by chance.

    The CRI of a gene is most sensitive to codon frequency changes when they affect the frequencies of the most abundant codons in the reference genome. Therefore, if changes occur in codons that contribute little to the score (i.e., rare codons), then the CRI will diverge at a slower rate than the DNA sequence identity. For predictions below 70% sequence identity, more differences in the codon usage vectors are observed, but they are located mostly on codons that have no significant influence on CRI. There are also cases where the most abundant codons are essentially the same for two genomes. In such scenarios, the CRI in both donor and acquired genes will vary even more slowly (e.g., E. coli K12 and N. europaea). On the other hand, genes exchanged between genomes with large differences in CU would display very poor CRI values (e.g., an HGT from M. loti to H. influenzae).

    For a better assessment of the role of HGT within each CU class (c), the total number of predicted HGT events per CU class (Nc) should be normalized. One possibility is to divide Nc by the number of comparisons necessary to detect HGTs in the CU class c, that is, the product of the number of resident genes within class c, rc, and the number of potential foreign genes that would enter directly in the same class (fc), as indicated by the HGT potential (see table 1), more precisely, Nc/(rcfc). Though correct, this normalization would favor our interpretation that very few successful HGTs involve genes with poor CU, since the number of potential foreign genes that would arrive with low CRI is enormous (see table 1). Alternatively, we may relax the normalization criterion and divide Nc only by the number of resident genes in the corresponding class (Nc/rc). Such normalization favors low-CRI genes, since, by definition, the typical-CRI area contains 80% of the genes (see Methods), and thus the number of HGTs entering with typical CRI is diluted. Despite such dilution, HGTs with typical CRI were found to occur 1.2 times more frequently than HGTs involving low-CRI genes, whereas HGTs with low CRI were 1.6 times more frequent than HGT with high CRI. However, if we use the former normalization criterion Nc/(rcfc), then for each detected HGT event involving a low-CRI gene there would be 3.7 and 12.7 HGT events involving genes with typical and high CRI, respectively.

    One source of potential bias in our results is derived from our phylogenetic congruency tests. If within our nonredundant set of genomes poor-CU genes tend to have less POs than genes with typical CU, then our predictions might be biased toward genes with typical CU; in our approach less than five POs would exclude such transfers from consideration. Before performing the phylogenetic analyses, there were certainly more HGT predictions because the only filtering criterion was that candidate xenologs had to be the most similar at the protein level; however, even in such circumstances the proportion of genes with poor, typical, and rich CU was essentially the same as that reported in table 2. Although the number of POs will increase as more genomes are sequenced, the proportions of xenologs with poor CU is so low that it is unlikely that it will ever be higher than the proportion of genes with typical CU. Another source of bias might exist between genomes with similar CU; in such cases genes with similar CRI might be found by chance, and the criterion of codon similarity alone could not discriminate between xenologous and orthologous genes. Independently of whether or not the genomes have similar CU, recently exchanged pairs of genes will always display similar CU immediately upon introgression; this situation clearly illustrates and stresses the importance of the phylogenetic and global protein similarity analyses as additional CU-independent filters.

    Current methods based on atypical sequence characteristics also have important sources of bias. For example, if a given Open Reading Frame (ORF) is highly atypical it might not actually be a true gene. This seems to be the case for a number of reported HGT predictions (Lawrence and Ochman 1998) that have been eliminated from the current version of the E. coli K12 genome in the Entrez genome database. In addition, there are alternative phenomena, besides HGT, that might explain the low GC content and/or poor CU in some genes. For instance, a substantial number of genomes have a GC distribution that is skewed toward low GC, and it was suggested that a remote origin for genes showing low GC seems unlikely. Selection for low GC is a more parsimonious explanation, because functional scenarios involving replication or recombination are easily conceivable for these genes (Syvanen 1994). Genes with low GC might also result from structural constraints, as is the case for ribosomal proteins displaying an excess of lysine residues, coded preferentially by the AAA codon required for RNA protein interactions (Lawrence and Ochman 1997; Ramakrishnan and White 1998). This also explains why most ribosomal proteins do not show high CRI in E. coli K12 (AAA is not an abundant codon).

    Common Predictions with Previous Reports of Horizontal Gene Transfer

    Among the predictions obtained in this work there are some that agree with previous HGT reports. For example, there is strong evidence that sodC and bioC were transferred between H. influenzae and N. meningitidis (Kroll et al. 1998). We correctly detect the transference of bioC. However, sodC is not present in the genome of H. influenzae strain Rd as reported in GenBank. Similarly, a substantial fraction of predicted CXGs is involved in transport (19%), regulation (11%), drug resistance, and mobile elements (6%). These functions are often regarded as exchangeable (Gray and Garey 2001; Scott 2002; Beaber, Hochhut, and Waldor 2004).

    Given the nature of our methodology, we have very few predictions in common with methods based on atypical sequence characteristics. This is expected, as our method is not designed to be exhaustive and the other methods favor genes with poor CU by definition.

    The Case of Horizontal Gene Transfer Between Archaea and Thermotoga maritima

    We sought another source of information that might either contradict or support the present interpretation. It has been suggested that the eubacterium Thermotoga maritima aquired 24% of its genes from an archaeal source (Nelson et al. 1999; Nesbo et al. 2001). Even though this number might be an overestimation (Kyrpides and Olsen 1999; Koski and Golding 2001), the six predictions we obtained involving T. maritima genes are all exchanges with Archaea (see table S1 in the online Supplementary Material), five showing typical CU and one showing rich CU (see table 2). The six genes in Archaea currently show typical or rich CU when seen from within T. maritima. Furthermore, some archaeal genomes (potential donors) show reasonably good CU compatibility with T. maritima. For instance, 64%, 60%, and 58% of all genes in A. fulgidus, P. abyssi, and M. jannaschii, respectively, would show typical to rich CU in T. maritma if they were transferred in this moment, despite the divergence that these genomes have suffered since the original speciation events. These results strengthen the idea that genes showing compatible CU with a potential recipient genome are more likely to be successfully transferred. The reader should be aware that we predict very few HGTs due to five reasons: first, we did not attempt to detect every possible xenolog, just those relevant to answer the fundamental question in this analysis, namely what the actual CU level predominantly shown by foreign genes is at the moment of introgression; second, we only take into consideration completely sequenced genomes; third, taxonomic groups are not evenly represented in the set of complete genomes; fourth, detected xenologs are required to share a terminal node in protein trees; and fifth, there must be at least five POs to uphold our predictions.

    Poor Codon Usage Represents a Barrier for Horizontal Gene Transfer

    Our results imply that foreign genes arriving with rich or typical CU face selection mainly at the functional level, whereas most genes entering the cell with poor CU would most likely be lost in the same way as pseudogenes are eroded, since their functions might not be fully available to assess any functional advantage due to poor translation. Thus, even though the great majority of genes may actually arrive with poor CU, as illustrated by the HGT potential (see table 1), the cell filters them out. Although it is not clear that barriers, beyond the restriction-modification systems, have evolved to prevent lateral gene exchange among prokaryotes (Gogarten, Doolittle, and Lawrence 2002), poor CU might well represent such a barrier as a side effect of defective translatability. Our interpretation makes biological sense, as integration of foreign DNA into the chromosome is a process subject to efficient surveillance and suppression. Studies of homologous recombination in bacteria show that the frequency of integration of exogenous DNA in the chromosome decreases exponentially as sequence divergence increases (Martin 1999; Denamur et al. 2000; Majewski 2001). In addition, mechanisms preventing illegitimate recombination in prokaryotes and eukaryotes have been reported (Hanada et al. 1997; Wu, Karow, and Hickson 1999). Under the assumption that most HGT events are not expected to be adaptive, but neutral or nearly neutral, the probability of fixation in large populations has been shown to be negligible, most likely leading to the ablation of foreign genes (Berg and Kurland 2002). This is in agreement with our observation that most laterally acquired genes potentially arrive displaying poor CU, but apparently they are strongly selected against due to poor translatability. In contrast, those genes arriving with typical CU have more opportunities to be successfully used by the recipient genome, and thus to persist as long as they provide a reasonable high selection coefficient. We propose that typical CU may well represent a safety or tolerance zone for genes to achieve adequate translation rates and expression levels. This notion is supported by our observation that, at least for E. coli K12, genomic codon frequencies (the reference values to calculate CRI) display correlation levels with tRNA concentrations similar to those exhibited by codon abundances in ribosomal proteins (unpublished material). As most highly expressed genes show typical to high CRI, it seems that displaying typical CU is sufficient to attain adequate translation rates.

    Our results are also consistent with the work of Smith and Eyre-Walker (2001) who, based on the fact that highly expressed genes in E. coli possess rare codons, posit that selection toward codon optimization is a relatively weak force. If a foreign gene, successfully integrated into the genome, displays an acceptable codon composition and provides an adequate amount of protein to perform satisfactorily its function (neutral, nearly neutral, or highly adaptive), then there is no need for an additional strong selective pressure to turn its rare codons into abundant codons.

    Codon Usage Amelioration Is Unnecessary

    We need to re-evaluate the notion that a foreign gene or fragment of DNA (assumed as atypical in sequence characteristics) becomes compositionally more similar to the host genome with increasing residence time. This process has been called "amelioration," after reasoning that it makes a gene "better" (Lawrence and Ochman 1997), implying better translatability. This concept is a natural consequence of methods assuming that most foreign genes display mainly a poor-CU profile. However, as our results indicate, most foreign genes with poor CU are counter-selected for successful integration, suggesting that CU amelioration might occur in a small fraction of genes. Of course, once integrated, foreign genes will be subjected to the same mutational biases as the rest of the indigenous genes, and if they had any atypical composition in terms other than CU, like GC content or oligonucleotide compositions (which are related to, but are not, CU), the mutational drift in the genome would "dilute" such differences without a dramatic impact on the CU of foreign genes relative to the genomic codon abundances. In fact, it has been observed that some genes may show rich CU but an average GC content (Garcia-Vallvé, Romeu, and Palau 2000; Garcia-Vallvé et al. 2003), and similarly we have observed genes that show deviant GC content but typical CU. Thus, the process is nothing else than the genetic drift that affects all genes within a genome, most probably making the genes fluctuate randomly within the typical CU area or safety zone without seriously affecting levels of expression.

    A substantial number of prophage, transposase, and insertion sequence-related genes display low GC content and poor CU relative to chromosomal genes. However, this is not the case for most genes linked to mobile elements, as more than half of them show typical CRI in the host genome. For example, 60% of the genes in the plasmids of A. tumefaciens C58 and M. loti display typical to high CRI within their respective host genomes. Similarly, we identified 154 genes whose annotated function is directly related to mobile elements in E. coli K12 (e.g., phages, transposons, and plasmids) and found that 96 (62%) genes display typical CRI. This CU compatibility between genes that map to mobile elements and their host genomes is consistent with the previous observation that plasmids tend to show genome signatures similar to those of potential hosts (Campbell, Mrazek, and Karlin 1999) and with the hypothesis that these genome signatures might play an important role in HGT (Karlin 2001).

    Concluding Remarks

    We have shown that, despite the huge probability for a foreign gene to display poor CU, most detected HGTs involve genes with typical or rich CU. As there is no reason to assume that evolution today differs from what happened during most of the evolutionary history, successful HGTs should have involved ready-to-use genes. Although it is accepted that a substantial fraction of acquired genes might not be sufficiently atypical to be detected by most published methods (Lawrence and Ochman 2002), our results indicate that the great majority of recently acquired genes exhibit typical CU. It might be argued that foreign genes enter a genome with poor CU and then move to the typical CU level by means of "fast amelioration," thus escaping detection by our method. Although more detailed studies are required to fully address this issue, our results provide evidence against this alternative. For example, T. maritima still perceives most archaeal CXGs as typical genes. That is, the CRI of the archaeal genes calculated based on the codon abundances of T. maritima falls mainly within the typical-CRI (safety) zone, suggesting that if the genes could be presently transferred, they would arrive mainly with typical CU. This is also true for about 60% of all archaeal genes, not predicted as exchanged with T. maritima, in genomes such as M. jannaschii, P. abyssi, and A. fulgidus (see Results above), which suggests that some potential archaeal donors currently show a strong CU compatibility with T. maritima, and any gene exchange with them would most likely involve typical-CRI genes.

    The number of genes predicted here as involved in HGT events is quite small, but we are confident that they constitute a representative sample of true xenologs. Such reduced sample size is not unexpected given the stringent constraints imposed by our phylogenetic validations and given the fact that only those pairs of xenologs still presenting similar DNA sequence characteristics can clarify whether or not the CU of foreign genes is typical or not at the moment of introgression. Nonetheless, this methodology will benefit from the ever-growing number of sequenced genomes, as more pairs of xenologs will be detected. With a greater number of reliable xenologous genes we will be able to explore in greater detail the quantity and quality of lateral exchanges among genomes and thus to understand the behavior of HGT networks. The strategy presented herein provides a conceptual change in the way CU is used to gain a deeper knowledge of the processes involved in horizontal gene transfer.

    Supplementary Material

    Online Supplementary Material can be found at the journal's Web site (www.mbe.oupjournals.org).

    Acknowledgements

    We thank W. Lamboy, E. Morett, A. Garciarrubio, E. Merino, L. Martínez-Castilla, and two anonymous referees for valuable comments on the manuscript. A.M.-S. acknowledges a Ph.D. fellowship from CONACyT. This work has been supported by grant number 0028 from CONACyT to J.C.-V. We appreciate computer technical support from V. del Moral, E. Díaz, and C. Bonavides.

    References

    Beaber, J. W., B. Hochhut, and M. K. Waldor. 2004. SOS response promotes horizontal dissemination of antibiotic resistance genes. Nature 427:72–74.

    Berg, O. G., and C. G. Kurland. 2002. Evolution of microbial genomes: sequence acquisition and loss. Mol. Biol. Evol. 19:2265–2276.

    Bernardo, J. O., and A. M. F. Smith. 1994. Bayesian theory. John Wiley and Sons, New York.

    Brown, J. R. 2003. Ancient horizontal gene transfer. Nat. Rev. Genet. 4:121–132.

    Campbell, A., J. Mrazek, and S. Karlin. 1999. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA 96:9184–9189.

    Castresana, J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–552.

    Denamur, E., G. Lecointre, P. Darlu et al. (12 co-authors). 2000. Evolutionary implications of the frequent horizontal transfer of mismatch repair genes. Cell 103:711–721.

    Ermolaeva, M. D., O. White, and S. L. Salzberg. 2001. Prediction of operons in microbial genomes. Nucleic Acids Res. 29:1216–1221.

    Felsenstein, J. 1989. PHYLIP: phylogeny inference package. Version 3.2. Cladistics 5:164–166.

    Garcia-Vallvé, S., E. Guzman, M. A. Montero, and A. Romeu. 2003. HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res. 31:187–189.

    Garcia-Vallvé, S., A. Romeu, and J. Palau. 2000. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res. 10:1719–1725.

    Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19:2226–2238.

    Gray, K. M., and J. R. Garey. 2001. The evolution of bacterial LuxI and LuxR quorum sensing regulators. Microbiology 147:2379–2387.

    Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696–704.

    Hanada, K., T. Ukita, Y. Kohno, K. Saito, J. Kato, and H. Ikeda. 1997. RecQ DNA helicase is a suppressor of illegitimate recombination in Escherichia coli. Proc. Natl. Acad. Sci. USA 94:3860–3865.

    Hooper, S. D., and O. G. Berg. 2003. Duplication is more common among laterally transferred genes than among indigenous genes. Genome Biol. 4:R48.

    Ikemura, T. 1981. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151:389–409.

    ———. 1982. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J. Mol. Biol. 158:573–597.

    Kaneko, T., Y. Nakamura, S. Sato et al. (24 co-authors). 2000. Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti. DNA Res. 7:331–338.

    Karlin, S. 2001. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 9:335–343.

    Karlin, S., J. Mrázek, and A. M. Campbell. 1998. Codon usages in different gene classes of the Escherichia coli genome. Mol. Microbiol. 29:1341–1355.

    Koski, L. B., and G. B. Golding. 2001. The closest Blast hit is often not the nearest neighbor. J. Mol. Evol. 52:540–542.

    Koski, L. B., R. A. Morton, and G. B. Golding. 2001. Codon bias and base composition are poor indicators of horizontally transferred genes. Mol. Biol. Evol. 18:404–412.

    Kroll, J. S., K. E. Wilks, J. L. Farrant, and P. R. Langford. 1998. Natural genetic exchange between Haemophilus and Neisseria: intergeneric transfer of chromosomal genes between major human pathogens. Proc. Natl. Acad. Sci. USA 95:12381–12385.

    Kyrpides, N. C., and G. J. Olsen. 1999. Archaeal and bacterial hyperthermophiles: horizontal gene exchange or common ancestry? Trends Genet. 15:298–299.

    Lawrence, J. G., and H. Ochman. 1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44:383–397.

    ———. 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95:9413–9417.

    ———. 2002. Reconciling the many faces of lateral gene transfer. Trends Microbiol. 10:1–4.

    Majewski, J. 2001. Sexual isolation in bacteria. FEMS Microbiol. Lett. 199:161–169.

    Martin, W. 1999. Mosaic bacterial chromosomes: a challenge en route to a tree of genomes. Bioessays 21:99–104.

    Médigue, C., T. Rouxel, P. Vigier, A. Henaut, and A. Danchin. 1991. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222:851–856.

    Moreno-Hagelsieb, G., and J. Collado-Vides. 2002a. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18(Suppl 1):S329–S336.

    ———. 2002b. Operon conservation from the point of view of Escherichia coli, and inference of functional interdependence of gene products from genome context. In Silico Biol. 2:87–95.

    Moszer, I., E. P. Rocha, and A. Danchin. 1999. Codon usage and lateral gene transfer in Bacillus subtilis. Curr. Opin. Microbiol. 2:524–528.

    Needleman, S. B., and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443–453.

    Nelson, K. E., R. A. Clayton, S. R. Gill et al. (25 co-authors). 1999. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399:323–329.

    Nesbo, C. L., S. L'Haridon, K. O. Stetter, and W. F. Doolittle. 2001. Phylogenetic analyses of two "archaeal" genes in Thermotoga maritima reveal multiple transfers between archaea and bacteria. Mol. Biol. Evol. 18:362–375.

    Ochman, H., and U. Bergthorsson. 1998. Rates and patterns of chromosome evolution in enteric bacteria. Curr. Opin. Microbiol. 1:580–583.

    Olson, S. A. 2002. EMBOSS opens up sequence analysis. European molecular biology open software suite. Brief Bioinform. 3:87–91.

    Ragan, M. A. 2001a. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev. 11:620–626.

    ———. 2001b. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. 201:187–191.

    ———. 2002. Reconciling the many faces of lateral gene transfer. Trends Microbiol. 10:4.

    Ramakrishnan, V., and S. W. White. 1998. Ribosomal protein structures: insights into the architecture, machinery and evolution of the ribosome. Trends Biochem. Sci. 23:208–212.

    Schaffer, A. A., L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, and S. F. Altschul. 2001. Improving the accuracy of PSI-Blast protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29:2994–3005.

    Scott, K. P. 2002. The role of conjugative transposons in spreading antibiotic resistance between bacteria that inhabit the gastrointestinal tract. Cell Mol. Life Sci. 59:2071–2082.

    Sharp, P. M., and W. H. Li. 1987. The codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15:1281–1295.

    Smith, N. G., and A. Eyre-Walker. 2001. Why are translationally sub-optimal synonymous codons used in Escherichia coli? J. Mol. Evol. 53:225–236.

    Syvanen, M. 1994. Horizontal gene transfer: evidence and possible consequences. Annu. Rev. Genet. 28:237–261.

    Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:4876–4882.

    Wang, B. 2001. Limitations of compositional approach to identifying horizontally transferred genes. J. Mol. Evol. 53:244–250.

    Wheeler, D. L., C. Chappey, A. E. Lash, D. D. Leipe, T. L. Madden, G. D. Schuler, T. A. Tatusova, and B. A. Rapp. 2000. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 28:10–14.

    Wu, L., J. K. Karow, and I. D. Hickson. 1999. Genetic recombination: helicases and topoisomerases link up. Curr. Biol. 9:R518–R520.

    Zeigler, D. R. 2003. Gene sequences useful for predicting relatedness of whole genomes in bacteria. Int. J. Syst. Evol. Microbiol. 53:1893–1900.(Arturo Medrano-Soto*, Gab)