当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第4期 > 正文
编号:11259368
Deriving the Genomic Tree of Life in the Presence of Horizontal Gene Transfer: Conditioned Reconstruction
     * Molecular Biology Institute

    Molecular, Cell, and Developmental Biology

    Human Genetics

    Astrobiology Institute

    || Institute of Geophysics and Planetary Physics, University of California, Los Angeles

    E-mail: lake@mbi.ucla.edu.

    Abstract

    The horizontal gene transfer (HGT) being inferred within prokaryotic genomes appears to be sufficiently massive that many scientists think it may have effectively obscured much of the history of life recorded in DNA. Here, we demonstrate that the tree of life can be reconstructed even in the presence of extensive HGT, provided the processes of genome evolution are properly modeled. We show that the dynamic deletions and insertions of genes that occur during genome evolution, including those introduced by HGT, may be modeled using techniques similar to those used to model nucleotide substitutions that occur during sequence evolution. In particular, we show that appropriately designed general Markov models are reasonable tools for reconstructing genome evolution. These studies indicate that, provided genomes contain sufficiently many genes and that the Markov assumptions are met, it is possible to reconstruct the tree of life. We also consider the fusion of genomes, a process not encountered in gene sequence evolution, and derive a method for the identification and reconstruction of genome fusion events. Genomic reconstructions of a well-defined classical four-genome problem, the root of the multicellular animals, show that the method, when used in conjunction with paralinear/logdet distances, performs remarkably well and is relatively unaffected by the recently discovered big genome artifact.

    Key Words: Horizontal gene transfer ? tree of life ? conditioned reconstruction ? conditioning genomes ? paralinear/logdet distances ? genome fusions

    Introduction

    The massive horizontal gene transfer (HGT), which has recently become so apparent (Karlin, Mrazek, and Campbell 1997; Lawrence 1998; Rivera et al. 1998; Doolittle 1999a; Campbell 2000; Ochman and Jones 2000; Ochman 2001; Gogarten, Doolittle, and Lawrence 2002), is challenging our views of genome evolution. This genetic cross-talk is thought to have the potential to erase much of the history of life that has been recorded in DNA. Indeed, some, perhaps most, scientists think that HGT may have effectively erased the phylogenetic signal of the tree of life from genomes.

    Presences and absences of genes and gene products have been used for more than 2 decades to support parsimonious arguments about the tree of life (Dickerson 1980; Woese 1981; Lake et al. 1982; Charlebois et al. 2000). With the availability of complete genomes, pioneering studies have developed methods for whole-genome analyses. These methods include both distance-based gene content phylogenetic reconstruction methods (Snel, Bork, and Huynen 1999; Tekaia, Lazcano, and Dujon 1999; Montague and Hutchison 2000) and parsimony-based gene family reconstruction methods (Fitz-Gibbon and House 1999). Distance-based methods can, in principle, be significantly influenced by HGT (Eisen 2000; House and Fitz-Gibbon 2002), and gene family reconstructions can be sensitive to HGT through recent gene innovations (House and Fitz-Gibbon 2002), so it was thought that HGT could significantly affect trees reconstructed using any method of genome composition analysis.

    Doolittle (1999b), for example, reviewing the state of "Phylogenetic Classification and the Universal Tree" asks, "What if phylogenetic classification is just let go?" It is obviously a bold and appealing question, designed to make the field reflect on what is being learned about HGT. Doolittle points out some of the specific challenges to classification that HGT presents, "If, however, different genes give different trees, and there is no fair way to suppress this disagreement, then a species (or phylum) can ‘belong’ to many genera (or kingdoms) at the same time: There really can be no universal phylogenetic tree of organisms based on such a reduction to genes." In this paper, we demonstrate that phylogenetic trees can be reliably reconstructed in the presence of HGT. We show that the dynamic deletions and insertions of genes that occur during genome evolution, including those introduced by HGT, may be modeled using techniques similar to those used to model nucleotide substitutions that occur during sequence evolution. In particular, we show that general Markov models, appropriately modified and interpreted, are reasonable tools for modeling genome evolution. Provided that the genomes contain sufficiently many genes and that the Markov assumptions apply, it is possible to reconstruct the tree of life. We also consider the merger of genomes, a process not encountered in gene sequence evolution, and explicitly deduce how to test rigorously potential genome fusion events, such as the origin of the nucleus. This new method is called conditioned reconstruction.

    Theory

    Gene and Genome Alignments

    To understand conditioned reconstruction, it is important to appreciate the similarities between gene sequences and genome sequences. Aligned gene sequences consist of a series of nucleotides and gaps, A, C, G, T, and -, that describe the character state at each position in the sequence. In practice, only the four character states, A, C, G, and T, are analyzed, and gaps are analyzed separately (Rivera and Lake 1992; Gupta 1998). However, if gaps are sufficiently frequent, one can also analyze all five character states by using the conditioning methods described in this paper.

    To understand genome sequences, imagine a gene sequence in which there are only two types of nucleotides. This two-character-state gene sequence would be analogous to a genomic sequence. In genomic sequences, two character states, P and A, are present at each position, depending upon whether the corresponding gene ortholog is present or absent, respectively. There are a few differences between both types of sequences. One difference is that the order of nucleotides in a gene alignment is normally that observed when the genes are sequenced, whereas in genome alignments, the order is arbitrary because gene positions are not highly conserved in genome evolution. Thus, homologous nucleotides can be inferred in sequence alignments by shifting gaps to maximize alignment scores, whereas orthologous genes must be inferred in genome alignments by global comparisons of multiple genomes (see Materials and Methods).

    Genome Sequences and HGT

    The analysis of phylogenetic trees, whether they are from genes or genomes, proceeds through the analyses of alignments. Two hypothetical sequence alignments, both 10 long, are shown in figure 1. A gene alignment is on the left and a genome alignment is on the right. As is normally the case, the common ancestral sequence (not shown) would not be available, so only the observed sequences are shown. Character state changes inferred for both alignments are shown as bars connecting the two sequences. For the gene sequence alignment, the character state changes, or nucleotide substitutions, are C to T, T to A, and A to G, from left to right, respectively. During genome evolution, the character state of any gene ortholog can only alternate between being present or absent. For this example the alternations are A to P, P to A, and P to A, from left to right, respectively.

    FIG. 1. A comparison of a gene alignment (left) and a genome alignment (right). In the gene sequence alignment, four character states, A, C, G, and T, change, whereas only two character states, A (gene absent) and P (gene present) change in the genome alignment. In both alignments, the probabilities of character-state change can be analyzed by Markov models

    Markov analyses using closed-form solutions (Lake 1994; Lockhart et al. 1994; Lake 1997) have proved very useful for studies of gene evolution. These methods, which describe the probabilities of change between character states, do not depend upon an explicit knowledge of the physical/biological mechanisms responsible for the changes. Thus, it does not matter whether a nucleotide is damaged by ionizing radiation, alkylating agents, mutations affecting exonuclease proofreading, intercalating agents, UV irradiation, or other causes, because Markov methods infer the net nucleotide changes produced from all sources. In other words, any particular gene sequence position records only the nucleotide that is currently present. Once a nucleotide substitution has occurred, the nucleotide that occupies that position has the same chemical structure, whether it was caused by ionizing radiation or by any other mechanism. Hence, that sequence position contains no direct evidence about the mechanism that caused the change. The most one can infer from Markov comparisons are the probabilities of nucleotide change between character states.

    Genomic sequences are affected by different physical/biological mechanisms than those that affect gene sequences. Various mechanisms can insert or delete genes in prokaryotes. Genes can be created by gene duplication and subsequently modified. Genes can also be inactivated by mutation and subsequently deleted. New DNA can be inserted by conjugation, transformation, or transduction followed by homologous recombination and chromosomal integration. Depending on the genes present in the inserted DNA, new genes can be introduced, old genes can be deleted, or both. However, just as was the case for gene sequences, the Markov matrices summarizing the evolution of genomic sequences will contain no direct information about the mechanism that caused gene changes.

    To understand why HGT confounds the phylogenetic reconstruction of trees from gene sequences but does not affect the reconstruction of trees from genomic sequences, consider a specific example. Imagine that a gene for ribosomal protein S8 is horizontally transferred from a Bacillus to a Methanobacterium, thereby replacing the original methanobacterial gene. Now consider how this horizontal transfer will affect the gene tree. Because the gene found in the Methanobacterium came from the Bacillus, the Bacillus leaves and Methanobacterium leaves will branch from a common node on the reconstructed gene tree. Next, consider how this horizontal transfer will affect the genome tree. For the reasons just discussed, the S8 position in the genomic alignment only contains information about the probabilities of gene insertion or deletion; it cannot contain any information about the mechanism of transfer. Furthermore, all other genome positions may have different histories, although these cannot be ascertained, either. However, the genome alignment will contain information about the probabilities of gene loss and gain that can be used by Markov methods to reconstruct the evolutionary history of each genome. Because nothing is coded in the genomic alignment about the horizontal transfer histories of any gene set, any method that can properly model genomic evolution will be invariant to the confounding effects of HGT. Furthermore, the more slowly evolving characters of gene presence and absence are likely to be more useful for discerning deep phylogenetic signals than the more numerous, but more rapidly evolving, nucleotide characters.

    Reconstructing Trees from Genomic Alignments

    There are important similarities between gene alignments and genomic alignments. Both can be analyzed, without significant loss of generality, by Markov models. Gene alignments are typically analyzed using four character states, A, C, G, and T. Our genome alignments use two states, A and P. For both, Markov matrices can describe the probabilities of change between character states.

    There are also important differences between the two processes. A significant difference is that character states are more difficult to define for genome analyses than for gene analyses because gaps are used as a character state for genome sequences. For gene sequences, the definitions of the four nucleotide states, A, C, G, and T, are self evident, whereas for genomes, the definitions of the two states, A and P, while simple, are not immediately obvious. Yet Markov-based methods such as paralinear/logdet distances (Lake 1994; Lockhart et al. 1994), require that the A and P states be properly defined.

    To appreciate the importance of, and the difficulties in, assigning the A and P character states, consider how one would calculate the distance between two genomes, X and Y, using paralinear distances. The distance between the X and Y genomes is solely a function of the frequencies of the four possible character state patterns (i.e., PP, PA, AP, and AA). If an othologous gene is present in both X and Y, then the pattern for that ortholog is PP. Likewise if the X ortholog is present and the Y ortholog is absent, then the pattern is PA, and if the X ortholog is absent and the Y ortholog is present, the pattern is AP. However, no AA patterns are present in the alignment between X and Y, because the analysis is over the set of all genes present in X and Y; that is, the union of the set of genes present in X and the set present in Y. Hence the set of genes with the AA pattern is the empty set, and the frequency of the AA pattern is zero. Because the distance between X and Y is a complex number when the AA frequency is zero, it is clear that this definition of the AA frequency is inadequate, and a better solution is needed.

    It is not immediately obvious how one can experimentally determine the frequency of the AA pattern. How does one define the set of genes that is missing from genomes X and Y? In other words, which genes missing from genomes X and Y should be counted as AAs? Does one count all genes present in any sequenced genome but missing in genomes X and Y as absent? Does one count every gene currently present in specific gene data banks but missing in genomes X and Y as absent, or does one count every gene that ever existed but is missing in genomes X and Y as absent? Perhaps one could even estimate every gene that has ever existed on Earth and use these to estimate the number of AAs. If life is discovered on other planets, do we include these genes, too? If any of these solutions are chosen, then each discovery of a new gene will change the measured distance between genomes X and Y. Alternatives of this type are clearly unsatisfactory. It is important that when one calculates a tree, all of the data necessary to substantiate it should be presently available. It is not realistic to present a tree based on data that may not be available until the next decade or later. Furthermore, the method needs to be based on a formal mathematical justification.

    The Conditioning Genome

    Our solution to the probem of absent data is to define absences by using conditional probabilities for the joint probability distributions needed for Markov analyses. In other words, one uses the set of all gene orthologs present in a particular genome, or possibly several genomes, to define the universe of genes for the analysis. This genome is called the conditioning genome. Because genomic analyses are restricted to just those genes that are present in the conditioning genome, the resulting distances are uniquely defined and will not be affected by any new gene discoveries. Note that the use of a conditioning genome affects the definitions of all four of the joint probabilities states considered in our two-genome example (i.e., it affects the PP, PA, AP, and AA probabilities). For a gene to be coded present, it must be present in the conditioning genome and in the genome being coded. For a gene to be coded absent, it must be present in the conditioning genome and absent in the genome being coded. Even if a gene is present in genome X and absent in genome Y, but not present in the conditioning genome, it will not be included in the count of PA patterns. Similar reasoning applies to the PP, AP, and AA states. Thus, conditioning does not simply affect the definition of the AA state, as was initially assumed, but the definitions of all patterns are affected by using a conditioning genome. By uniquely defining the universe of genes to be analyzed, this definition provides self-consistency and makes the definition of distances independent of any genes to be discovered in the future. The mathematical approach of conditioning on a separate set of genes is similar to one used previously, with nucleotide character states, to derive the equations for Markov triple analysis (Lake 1997).

    Genome Bands and Gene Fibers

    Consider the evolution of three genomes, X, Y, and Z, conditioned using genome C. If one could follow the evolution of three genomes through time and observe the deletions and insertions of genes as they occur, then one might observe a band of multiple fibers that duplicated as it flowed through time, similar to that shown in figure 2. Each fiber in the band represents a gene ortholog that is present in the conditioning genome positioned at the base of the tree in figure 2. However, the conditioning genome may be located anywhere on the tree because the choice of a root affects neither paralinear nor triple Markov methods (Lake 1997). In general, bands will be much wider than can be shown in figure 2 because genomes contain thousands of genes.

    FIG. 2. The evolution of genomes is represented by genome bands containing multiple gene fibers. Each fiber corresponds to a gene present in the conditioning genome. Gene losses during the course of evolution are indicated by a change of color from gray (P) to white (A), and gene additions are represented by a change from white (A) to gray (P). Even though genomes X, Y, and Z may contain genes not found in the conditioning genome, only genes present in the conditioning genome are shown, because genes absent in the conditioning genomes are not analyzed. The order of genes corresponds to that found in the conditioning genome and not in the other genomes, because extensive synteny is generally not observed

    Bands represent the continuing flow of genes into and out of genomes as the tree of life evolves. When genes are present in a fiber, they are colored gray, and when they are absent, they are colored white. Note that at the location of the conditioning genome C, all the fibers are gray because each one represents a gene present in the conditioning genome. Also note that gene deletions and gene duplications do not affect the width of the bands, because the set of genes and, hence, the number of fibers, is solely determined by the conditioning genome. If a gene that was present in the conditioning genome is duplicated in genome X, this does not affect the number of fibers in the band, because only one copy of an ortholog contributes to genome X. If a gene is deleted, then the character state of that fiber goes from P (gray) to A (white). If a new gene is horizontally transferred into a genome and corresponds to a gene that is not present in the conditioning genome, then it will not affect the coloring of the band, because there will be no fiber corresponding to this gene. If the new gene is orthologous to a gene present in the conditioning genome and that gene is lacking in the current organism, then the coloring of the orthologous gene will change from white to gray, A to P. If the new gene corresponds to an orthologous gene that is already present in the current genome, then the coloring of the fiber will not change. Gene duplications do not change the character state of the fiber, because gene duplications must start from an existing gene, implying that its initial character state must have already been P, so the second gene does not alter its state. If a second copy of a gene is acquired by HGT, that copy, likewise, will not affect the coloring. Finally, as previously discussed, the source of an acquired gene is not coded and, hence, is irrelevant. These basic rules allow one to analyze gene fusions using conditioned reconstruction.

    Genome Fusions

    Genome analyses using the basic properties of genomic bands and fibers can allow one to detect and analyze genome fusions. Consider the fusion of two genomes that is illustrated in figure 3. Immediately before the fusion, genomes W, X, and Y and the conditioning genome C, are evolving as shown by the unrooted tree in figure 3a. Only those genes present in the conditioning genome will be analyzed, but because the focus of this discussion will be on genomes W, X, and Y, genome C will not be shown in subsequent panels of the figure.

    FIG. 3. The fusion of two genomes produces a cycle graph. In the left side of (a), three genomes, W, X, and Y, and the conditioning genome, C, are represented as an unrooted tree just before descendants of W and Y fuse. In the right side of (a), the immediate ancestors of W and Y, represented by dots, have divided and produced W, Y, and their progeny, which fuse to produce the new organism Z. The probabilities of the genome alignment patterns found in the three-genome (W, X, Y) tree and in the four-genome (W, X, Y, Z) graph are shown at the left side of (b). The two four-taxon trees that are simultaneously supported by the four genome patterns are shown on the right side of (b). (c) The two simultaneously supported four-taxon trees are consistent with a repeating arrangement of genomes W Z Y X W Z Y X, which corresponds to the cycle graph shown at the bottom of the figure

    Consider now the events just before the fusion between genomes W and Y illustrated on the left side of figure 3a. Immediately before the fusion, one has a three-genome problem involving two character states, so the eight possible patterns of presences and absences shown must be considered. Let the underlying probability of each pattern be represented by the eight letters a to h, so that their sum equals 1.0. All three taxon patterns and their corresponding probabilities are listed in table 1. Under the assumption that the sites (genes) are independent and identically distributed (i. i. d), these eight probabilities contain all the information needed to calculate the distances and the Markov matrices corresponding to the three-taxon tree (Lake 1997).

    Table 1 Four-Taxon Joint Probabilities for Fusion Organism Z, Calculated from Three-Taxon, W, X, and Y, Joint Probabilities.

    Consider what happens when the genomes of prokaryotic organisms W and Y divide, and one descendant from each contributes its genome to form a new fusion organism, Z, as shown on the right size of figure 3a. Let the new organism contain the total genomic content of W and Y. For these computations, we assume that both daughter cells of W have identical genomic sequences and, similarly, that both daughter cells of Y have identical genomic sequences. Also, we assume that the genomic content of the fusion organism Z is the sum of the contents of W and Y. In practice, small deviations from these assumptions would not significantly modify our computations.

    The rules for formation of this new genome are then quite simple, so we can calculate the joint probability distributions for this new four-axon (W, X, Y, and Z) problem. Note that every gene present in either genome W or genome Y is present in genome Z. If the same gene ortholog is present in both W and Y, then two copies of that ortholog will be introduced into genome Z, but only one copy of the gene ortholog will be counted, corresponding to the copy present in the conditioning genome. Hence, the second copy of each gene effectively disappears. Using these rules, one can calculate the frequencies of each of the patterns that appear in genome Z (see table 1).

    To illustrate the process, consider a gene that is present in genomes W, X, and Y, just before the fusion event, so that pattern PPP occurs with probability a. Immediately after the fusion event, genome Z will contain two copies of this ortholog, one from genome W and one from genome Y. However, because the conditioning genome has only one copy of the ortholog, only a single copy of the ortholog will be counted as present in Z. Thus, the probability of the new pattern obtained when genome Z is also included, PPPP, will be a. Similarly, all remaining ortholog patterns present in the four-genome set, WXYZ, may be calculated. Note, as a check on the calculations, the probabilities of the eight patterns present in the new four-genome set still sum to 1.0, and all other patterns have zero probability.

    Under the assumption that the genomic content of Z is the sum of the contents of W and Y, the probability of multiple losses or gains of any particular gene is zero. Hence, parsimony will not be biased by unequal rate effects when it is used immediately after the fusion event. Parsimony has the particular advantage that the results may be easily interpreted. For four taxa (genomes), it can distinguish among three unrooted trees. Pattern AAPP supports the tree in which genomes W and X are paired on one side of the central branch, and genomes Y and Z are paired on the other side of the central branch; this is tree (WX)(YZ). Pattern APAP supports tree (WY)(XZ), pattern APPA supports tree (WZ)(XY), and all other patterns are unrelated to tree topology. Note that pattern APAP does not appear (i.e., it has zero probability). Pattern PAAP appears with probability e, and pattern AAPP appears with probability g (fig. 3b). In other words, two trees are simultaneously supported.

    Simultaneous support for two quartets can be caused by short sequences that cannot resolve the branching order or by unequal rate effects. However, under our model, (1) the sequences are effectively infinitely long, so the branching order cannot be caused by insufficient branching-order resolution, and (2) unequal rate effects are nonexistent. Hence, the results are not explained by reconstruction artifacts. However, the results are precisely what one would predict from the graph illustrating the fusion of the two genomes shown at the right of figure 3a.

    To derive the graph, which is supported from the joint probabilities of the WXYZ patterns, consider the following four-taxon parsimony analysis. The PAAP pattern corresponds to the four-taxon topology shown on the upper right of figure 3b, and the AAPP pattern corresponds to the four-taxon topology shown on the lower right of figure 3b. The four-taxon tree shown at the upper right occurs with probability e, and that shown immediately below occurs with probability g. Together, the two patterns form a cycle graph, as shown in figure 3c. When the two trees are alternately placed below each other and successively shifted one taxon to the right, they form a repeating pattern. This pattern is the same as would be formed if the cycle graph at the bottom of figure 3c were continuously unrolled. In this theoretical example, only one generation separates genomes W and Y from the fusion genome Z, so the leaves leading to W, Z, and Y would be very short. In practice, however, even the closest available genome is likely to be quite distant from the fusion genome, as illustrated in figure 3c, because at present, only an infinitesimally small fraction of all genomes have been sequenced.

    Materials and Methods

    Gene Alignments

    The genome alignments used in this paper are available in the Supplementary Material online. Phylogenetic trees were calculated for parsimony, Jukes-Cantor distances, and paralinear/logdet distances using Bootstrappers Gambit (Lake 1995; Strimmer and von Haeseler 1996). Graphs can be calculated using modified quartet methods (see e.g., Lake [1995] and Strimmer and von Haeseler [1996]).

    The ortholog sets are based on Blast analyses, using bit scores, and satisfy the following criteria (Rivera and Lake, in preparation). Each set is a globally optimal, strongly connected digraph, and each gene is contained in one, and only one, set. Orthologs are explicitly distinguished from paralogs and recent duplications.

    Globally optimal ortholog searches are a generalization of reciprocal blast scores. Reciprocal Blast scores and directed Blast scores are defined as follows. Given gene x in genome A, and gene y in genome B, the directed Blast score between x and y, D(x, y), is the score relating gene x and gene y, when gene x is searched against genome B. The reciprocal Blast score between genes x and y, R(x, y), is the sum of the directed scores obtained when gene x is searched against genome B and when gene y is searched against genome A (i.e., R[x, y] = R[y, x] = D[x,y] + D[y, x]). The best global set of genes is defined as the set, containing a starting gene from a given genome and at most one gene from each of the remaining genomes, that maximizes the sum of reciprocal Blast scores between all members of the set. For this paper, the "best global sets" have been calculated three deep to determine these sets within a reasonable amount of time. Consider a set of N genomes, A, B, and so on. Using gene x in genome A as the starting gene, the three genes in each of the other genomes that have the largest directed Blast scores are calculated. From this set of starting genes, one calculates the sum of the reciprocal Blast scores between the N taxa. There are N (N-1) terms in this sum. These are calculated for all possible combinations, 3(N-1), of the starting genes, and the set of genes corresponding to the largest sum is selected. The process is repeated using every gene within every genome as the starting gene. This provides a collection of sets of putative orthologous genes ranked by global reciprocal Blast scores, from which ortholog sets are subsequently selected.

    Gene paralogs are detected through computations based on the properties of digraphs. Digraphs are constructed with nodes corresponding to each gene in a set of putative orthologs and with directed edges corresponding to whether a nonzero directed Blast score exists between gene x and gene y. Based on the observation that paralogs are frequently unilaterally connected, or even weakly connected, to other nodes in a Blast-based digraph, we systematically analyze each putative ortholog set to determine all nonempty, strongly connected subsets of the putative ortholog set, including sets containing only a single gene. Each putative ortholog set is then replaced by all of its strongly connected, nonempty, ortholog subsets to form a modified collection of putative ortholog sets. Each set in this modified collection is then ranked by the sum of its directed Blast scores and used as described in the next paragraph.

    The final step chooses the highest scoring ortholog sets, ensures that every gene occurs in only one ortholog set, and simultaneously detects recent gene duplications. Recent duplications have been separated from each other by less time than they have been separated from their orthologs, so Blast scores between recently duplicated genes, on average, are larger than those between orthologous genes. Potential recent duplications are calculated by Blast analyses of each gene against every other gene in its genome. A list is then made of the n top scoring candidates for recent duplications of each gene. These candidates are ranked from highest to lowest, and recent duplications are substituted for each gene within a putative ortholog set to detect recent orthologs. The search itself starts with the highest scoring putative ortholog set. The first group is automatically accepted, and all members of it are added to a continuously updated list of accepted orthologs. If any recent duplications are found, they are added to a list of recent duplications. Each successive group is accepted only if no members of it are contained in either the list of accepted orthologs or the list of recent duplications. If a group is accepted, it is also searched for recent duplications, and if any are found, they are added to the list of accepted duplications. The process is iterated until all putative ortholog sets have been examined. Finally, separate sets, containing single genes, are created for all genes found in the list of recent duplications.

    Computation of Compositional Biases

    In the absence of phylogenetic signal, gene patterns among taxa are not correlated, so the probabilities of the various parsimony-informative patterns are simply those that occur randomly. Even when the phylogenetic signal is significant, the relative ratio of the phylogenetic to the random signal determines which tree is favored. Hence, by estimating the compositional bias favoring each tree, assuming no phylogenetic signal, and comparing this result to the observed support for each tree, we can infer the relative contributions of signal and compositional bias.

    When conditioned on the Homo sapiens genome, the percentage of positions corresponding to gene absences are 90.65%, 84.41%, 81.03%, and 79.92% of the Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thalina genomes, respectively. The percentages of presences are 9.35%, 15.59%, 18.97%, and 20.08%, respectively (see Results). In the absence of phylogenetic information, the probability of observing a parsimony pattern supporting the big genome attraction, BGA, tree (fig. 4b) (i.e., either the AAPP pattern or the PPAA pattern) is Prob.(AAPP) + Prob.(PPAA) = (0.9065)(0.8441)(0.1897)(0.2008) + (0.0935)(0.1559)(0.8103)(0.7992) = 0.0385441. The probability of observing either of the two parsimony patterns supporting the traditional tree (fig. 4a) is Prob.(APPA) + Prob.(PAAP) = (0.9065)(0.1559)(0.1897)(0.7992) + (0.0935)(0.8441)(0.8103)(0.2008) = 0.0342673. The probability of observing either of the two patterns supporting the third tree (fig. 4c) is Prob.(APAP) + Prob.(PAPA) = (0.9065)(0.1559)(0.8103)(0.2008) + (0.0935)(0.8441)(0.1897)(0.7992) = 0.0349599. Thus, in the absence of phylogenetic signal, patterns supporting the BGA tree are about 12.5% more likely to occur by chance than those supporting the traditional tree and about 10.3% more likely than those supporting the third tree.

    FIG. 4. The effects of the big genome attraction (BGA) artifact on genomic reconstructions. Some of the experimental difficulties in the conditioned reconstruction of a classical eukaryotic tree are illustrated using several reconstruction algorithms and conditioning genomes. The analyses shown at the left side of the figure have been conditioned using the large human genome (38,051 genes), and those shown at the right column have been conditioned using the small yeast genome (5,000 genes). In the traditional tree, (a) and (d), the plant (Arabidopsis thalina, 25,545 genes) and the yeast (Saccharomyces cerevisiae, 6,218 genes) shown at the left of the tree, are outgroups to the bilateral animals, the worm (Caenorhabditis elegans, 20,535 genes), and the fly (Drosophila melanogaster, 13,381 genes) at the right of the tree. The letters A, S, C, and D refer to the plant, yeast, worm, and fly, respectively. The leaves are drawn with lengths proportional to the numbers of genes contained in each genome, to illustrate BGA

    When conditioned on the Schizosaccharomyces pombe genome, the percentage of positions corresponding to gene absences are 37.89%, 40.50%, 32.56%, and 32.12% of the S. cerevisiae, D. melanogaster, C. elegans, and A. thaliana genomes, respectively (see Results). The percentages of presences are 62.11%, 59.50%, 67.44%, and 67.88%, respectively. In the absence of phylogenetic information, the probability of observing either of the two parsimony patterns supporting the BGA tree (fig. 4e) (i.e., either the AAPP pattern or the PPAA pattern) is Prob.(AAPP) + Prob.(PPAA) = (0.3789)(0.4050)(0.6744)0.6788) + (0.6211)(0.5950)(0.3256)(0.3212) = 0.1088978. The probability of observing either of the two parsimony patterns supporting the traditional tree (fig. 4d) is Prob.(APPA) + Prob.(PAAP) = (0.3789)(0.5950)(0.6744)(0.3212) + (0.6211)(0.4050)(0.3256)(0.6788) = 0.10291994. The prob-ability of observing the patterns supporting the third tree (fig. 4f) is Prob.(APAP) + Prob.(PAPA) = (0.3789)(0.5950)(0.3256)(0.6788) + (0.6211)(0.4050)(0.6744)(0.3212) = 0.10431645. Thus, in the absence of phylogenetic signal, patterns supporting the BGA tree are about 5.8% more likely to occur by chance than those supporting the traditional tree.

    Results

    The theoretical principles for reconstructing trees representing the evolution of genomes have been outlined in the Theory section of this paper, but the practical problems associated with these analyses are considered in this section. The principal practical difficulty in genome analyses is the big genome attraction, or BGA, artifact. In the following section we provide two examples using eukaryotic genomes to illustrate the nature of BGA. These examples suggest that BGA is not related to long-branch attraction, but is similar to the nucleotide composition bias that is found in sequence analyses.

    Our preliminary phylogenetic analyses using parsimony showed that tree reconstruction is strongly biased by genome size, so large genomes can appear to be more closely related to other large genomes, and small genomes appear to be more closely related to other small genomes, no matter what the actual tree topology is. A similar effect has been noted for small genomes using the homology method of gene reconstruction. It is thought to result from "...substantial gene loss from a specific genome" (House and Fitz-Gibbon 2002). As we derived methods to more accurately model genome evolution, we noted that the BGA artifact was related to the choice of the conditioning genomes and affected some much more seriously than others.

    To study these phenomena further, we analyzed the effects of BGA and the size of the conditioning genome, using a traditional four-taxon test of an extremely well documented topology. In the traditional tree shown at the top of figure 4a and d, the plant (A. thaliana, 25,545 genes) and the yeast (S. cerevisiae, 6,218 genes) on the left side of the tree are considered to be outgroups to the bilateral animals, the worm (C. elegans, 20,535 genes), and the fly (D. melanogaster, 13,381 genes) on the right side of the tree. The letters A, S, C, and D refer to the plant, yeast, worm, and fly, respectively. In addition, the human (H. sapiens, 38,051 genes) and a yeast (S. pombe, 5,000 genes) genomes were included in the alignments, as noted at the top of the figure, to serve as large and small conditioning genomes, respectively.

    To schematically illustrate BGA, leaves are drawn with lengths proportional to the numbers of genes contained in each genome. In the traditional tree (fig. 4a and d), the numbers of genes per genome are not correlated with the topology. However, in the BGA tree (fig. 4b and e), the taxa with the largest numbers of genes are adjacent (A and C), and the taxa with the fewest numbers of genes are adjacent (S and D) so that genome size and topology are correlated. The third alternative tree is shown at the bottom (fig. 4c and e). The analyses shown in the left column of figure 4 have been conditioned using the large human genome (38,051 genes), and those shown in the right column have been conditioned using the small yeast genome (5,000 genes).

    For the analyses conditioned on the human genome, the traditional tree is supported by 70.3% of parsimony bootstraps, 77.3% of Jukes-Cantor distance bootstraps, and 100% of paralinear distance bootstraps. The BGA tree is supported by 29.7%, 22.7%, and 0.0% of the parsimony, Jukes-Cantor, and paralinear distance bootstraps, respectively. When these data are conditioned on the human genome, paralinear distance performs better than the other algorithms as measured by its support for the traditional tree.

    When the genomic alignments are conditioned using the much smaller S. pombe genome, all algorithms perform well. The traditional tree is supported by 97.3% of the parsimony bootstraps, 96.3% of the Jukes-Cantor distance bootstraps, and 99.4% of the paralinear distance bootstraps. The BGA tree is supported by only 2.7%, 3.7%, and 0.5% of the parsimony, Jukes-Cantor, and paralinear distance bootstraps, respectively. The third alternative tree is supported only by 0.1% of the paralinear distance bootstraps.

    To investigate the cause of BGA biases, we calculated branch lengths and gene compositions because long-branch attraction and composition biases are both known to cause unequal rate effects. Calculations indicated that long-branch attraction was not the cause of BGA for this example. Specifically, when conditioned on the H. sapiens genome, the paralinear distance lengths of the C. elegans and D. melanogaster leaves are similar to each other (0.1324 ± 0.0036 and 0.1354 ± 0.0034 gene alternations, respectively), and the A. thaliana and S. cerevisiae leaves are also similar (0.1984 ± 0.0036 and 0.1826 ± 0.0042 gene alternations, respectively), so if long-branch attraction were significant, it would preferentially support the traditional tree, and not the BGA tree. For the traditional tree conditioned on the H. sapiens genome, the central branch was 0.0220 ± 0.0036 gene alternations long. Similar values were found for the traditional tree conditioned on the S. pombe genome. Because both leaves attached to the left node and both leaves attached to the right node of the reference tree are similar lengths, we conclude that the observed big genome artifacts cannot be explained by long-branch attraction.

    Compositional biases were next investigated. These analyses relied upon the observation that one can calculate the probability of the informative parsimony patterns, in the absence of phylogenetic signal (see Materials and Methods for details). When the alignment was conditioned on either the H. sapiens or the S. pombe genome, it was found that the calculated compositional biases explained the parsimony bootstrap results quite well. Specifically, parsimony patterns favoring the BGA tree were 12.5% more likely to occur randomly than were those supporting the traditional tree and 10.3% more likely than the third alternative when H. sapiens was used to condition the alignment. When S. pombe was used to condition the alignment, parsimony patterns supporting the BGA tree were 5.8% and 4.4% more likely than those calculated for the traditional and third trees, respectively. Thus the better bootstrap performance of parsimony observed when S. pombe is used to condition the alignment appears to result from the fact that compositional biases are less for this conditioning. Alignment artifacts and site-to-site variation artifacts are not likely to be significant causes of bias in the paralinear distance analyses, because the traditional tree is so strongly supported in the absence of corrections for these artifacts.

    When conditioned by the large genome, the parsimony and Jukes-Cantor algorithms all showed moderate to considerable BGA. In contrast, under similar conditions, the paralinear distance algorithm did not. When conditioned by the small genome, all four algorithms perform quite well.

    These results provide information that can help us better understand the trade-offs involved in selecting conditioning genomes and, thereby, improve our genomic reconstructions. Clearly, it is important to condition alignments. Although parsimony can be used without conditioning, it is very sensitive to BGA, even when conditioned. Proper conditioning, however, can reduce the effects of BGA on parsimony and on Jukes-Cantor distances. In our example, the false-positive rate with parsimony was reduced from 29.7% to 2.7%, and with Jukes-Cantor distances, it was reduced from 22.7% to 3.7%, nearly an order of magnitude improvement, simply by using a different conditioning genome. Our calculations (not shown) indicate that the false-positive rate can be reduced even further, provided that the optimal conditioning genome contains approximately the same number of genes as the genome being analyzed. For this example, the conditioning genome would be intermediate in size between those of S. pombe and H. sapiens (about 15,000 genes).

    The results also show that if the genomes used for conditioning are too small, the statistical significance of the analyses will be reduced. The effect could not be experimentally measured for the parsimony and Jukes-Cantor methods, most likely because it was masked by BGA effects. In the case of paralinear distances, for which the BGA artifact is relatively small, decreasing the size of the conditioning genome also decreases the effective sequence length and increases the false-positive rate slightly, from 0.0% to about 0.5%. This suggests that the increase is cause by the short length of the conditioning sequence. The fact that paralinear distances performed well for data conditioned on both large and small genomes suggests that paralinear distances may be a particularly useful algorithm for genome analysis.

    Discussion

    This study has detailed the theory that underlies conditioned reconstruction and has shown the theoretical promise of the method, as well as some of its practical pitfalls. One of the most important practical problems is that of BGA.

    The examples and computations presented in this paper indicate that BGA is analogous to the nucleotide bias effects in gene sequence analysis. Just as sequences with high GC compositions are attracted in sequence analyses, small genomes that are high in absences are attracted to other small genomes, and large genomes that are high in presences are attracted to other large genomes. BGA does not seem to be a counterpart of long-branch attraction, although one could anticipate a genomic equivalent to this would occur, if gene deletions and additions are found to be much more prevalent in some organisms than in others. BGA will continue to be a major artifact affecting genome analyses, suggesting that studies should be restricted to genomes of similar sizes. Although paralinear distances are affected far less by BGA than parsimony and Jukes-Cantor distances, they can be affected when size differences are great and the phylogenetic signal is small.

    In principle, conditioned genomic reconstructions solve the complications introduced by HGT. In practice, conditioned reconstructions provide a potentially powerful tool to help reconstruct the tree of life, but their usefulness ultimately depends upon how well the fundamental assumptions on which they are based, the Markov and the i.i.d. assumptions, fit the process of genome evolution. First-order Markov process analyses assume that the process is time invariant, and that only the state of the most recent ancestral node is significant. Because the paralinear distance algorithm is based on "infinitesimal steps," corresponding to individual generations, time dependence is not likely to cause significant violations of the Markov assumptions. However, to apply Markov methods, one frequently assumes that all sites are i.i.d. In sequence analyses, the i.i.d. assumption is rarely met, and this probably applies to genome analyses as well. HGT might affect conditioned reconstructions indirectly by violating the i.i.d. assumption. For example, if an organism optimized for an environment enters a new one where all neighboring organisms contained a large common set of genes, this could potentially increase the probability of horizontal transfer of those genes, relative to other genes. An analogous i.i.d. violation in sequence analysis might occur if an organism optimized for a low-light enters a new environment with increased solar radiation, because this could potentially increase the probability of certain nucleotide substitutions relative to others. This particular HGT example probably results in a minor second-order effect, but i.i.d. violations from other sources might be significant. To reduce this possibility, pattern filtering (Lake 1998), based on the gene order present in the conditioning genome, has been applied to correct for potential i.i.d. violations. Other, perhaps more significant, sources of error include using the observed frequencies of the P and A patterns to estimate their expected frequencies; hence, our emphasis on the importance of genome sequence length. Perhaps the single most important potential source of error is in ortholog identification. For this reason, we think global ortholog searches are an essential part of conditioned reconstruction studies. Much additional work needs to be done on global ortholog searches.

    Conditioned reconstructions provide us with a method of following the evolution of a band of genes as it maps out the tree of life and, thereby, provides a formal theoretical framework for the classification of organisms. Even though individual genes may continuously join and leave genomes, we can now follow the dynamic evolution of genomes. The phylogenetic resolution at which one can study the evolution of prokaryotic and eukaryotic phyla may be limited by experimental factors such as the rates of gene exchange and by the finite number of genes within genomes. Nevertheless, we predict that conditioned reconstructions will allow us to determine the tree, or graph, of life with increased resolution and allow deeper relationships to be discovered.

    Equally important, we now have a tool that is capable of resolving past genomic mergers. This is essential if we are to understand important steps in the evolution of the eukaryotic cell. In theory, conditioned reconstructions can differentiate between events in which genomes are instantaneously fused, on a genomic time scale, and those in which genes are slowly transferred from one genome to another. This ability may be useful when attempting to differentiate between a slow, but continuous, trickle of genes from a mitochondrion or a chloroplast into the nucleus (Baldauf and Palmer 1990; Gray, Burger, and Lang 1999) and a rapid merger of genes, such as those that may have happened during the formation of the eukaryotic genome (Rivera and Lake 1992; Martin and Muller 1998). We anticipate conditioned reconstructions will significantly improve our understanding of life's genomic origins.

    Acknowledgements

    We thank A. Simonson, J. Moore, and R. Jain for frequent and helpful discussions and Margaret Kowalczyk for illustrations. Also we thank Ford Doolittle for showing a slide that stimulated us to better explain conditioning and resulted in figure 2. This work was supported by grants from the National Science Foundation, NASA Astrobiology Program, the Department of Energy, and the National Institutes of Health to J.A.L.

    Literature Cited

    Baldauf, S. L., and J. D. Palmer. 1990. Evolutionary transfer of the chloroplast tufa gene to the nucleus. Nature 344:262-265.

    Campbell, A. M. 2000. Lateral gene transfer in prokaryotes. Theoret. Popul. Biol. 57:71-77.

    Charlebois, R. L., R. K. Singh, and C. C.-Y. Chan-Weiher, et al. (26 co-authors). 2000. Gene content and organization of a 281-kbp contig from the genome of the extremely thermophilic archaeon, Sulfolobus solfataricus P2. Genome 43:116-136.

    Dickerson, R. E. 1980. Structural conservatism in proteins over three billion years: cytochrome with a touch of collagen. Pp. 227–249 in R. Srinivasan, ed. Diffraction and related studies. Pergamon Press, Oxford and New York.

    Doolittle, W. F. 1999a. Lateral genomics. Trends Genet. 15:M5-M8.

    Doolittle, W. F. 1999b. Phylogenetic classification and the universal tree. Science 284:2124-2128.

    Eisen, J. A. 2000. Assessing evolutionary relationships among microbes from whole-genome analysis. Curr. Opin. Microbiol. 3:475-480.

    Fitz-Gibbon, S. T., and C. H. House. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27:4218-4222.

    Gogarten, J. P., W. F. Doolittle, and J. G. Lawrence. 2002. Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol. 19:2226-2238.

    Gray, M. W., G. Burger, and B. F. Lang. 1999. Mitochondrial evolution. Science 283:1476-1481.

    Gupta, R. S. 1998. Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among archaebacteria, eubacteria, and eukaryotes. Microbiol. Mol. Biol. Rev. 62:1435-1491.

    House, C. H., and S. T. Fitz-Gibbon. 2002. Using homolog groups to create a whole-genomic tree of free-living organisms: an update. J. Mol. Evol. 54:539-547.

    Karlin, S., J. Mrazek, and A. M. Campbell. 1997. Compositional biases of bacterial genomes and evolutionary implications. J. Bacteriol. 179:3899-3913.

    Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences—paralinear distances. Proc. Natl. Acad. Sci. USA 91:1455-1459.

    Lake, J. A. 1995. Calculating the probability of multitaxon evolutionary trees—bootstrappers gambit. Proc. Natl. Acad. Sci. USA 92:9662-9666.

    Lake, J. A. 1997. Phylogenetic inference: How much evolutionary history is knowable? Mol. Biol. Evol. 14:213-219.

    Lake, J. A. 1998. Optimally recovering rate variation information from genomes and sequences: pattern filtering. Mol. Biol. Evol. 15:1224-1231.

    Lake, J. A., E. Henderson, M. W. Clark, and A. T. Matheson. 1982. Mapping evolution with ribosome structure—intra-lineage constancy and inter-lineage variation. Proc. Natl. Acad. Sci. USA Biol. Sci. 79:5948-5952.

    Lawrence, J. G. 1998. Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95:9413-9417.

    Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605-612.

    Martin, W., and M. Muller. 1998. The hydrogen hypothesis for the first eukaryote. Nature 392:37-41.

    Montague, M. G., and C. A. Hutchison. 2000. Gene content phylogeny of herpesviruses. Proc. Natl. Acad. Sci. USA 97:5334-5339.

    Ochman, H. 2001. Lateral and oblique gene transfer. Curr. Opin. Genet. Dev. 11:616-619.

    Ochman, H., and I. B. Jones. 2000. Evolutionary dynamics of full genome content in Escherichia coli. EMBO J. 19:6637-6643.

    Rivera, M. C., R. Jain, J. E. Moore, and J. A. Lake. 1998. Genomic evidence for two functionally distinct gene classes. Proc. Natl. Acad. Sci. USA 95:6239-6244.

    Rivera, M. C., and J. A. Lake. 1992. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science 257:74-76.

    Snel, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based on gene content. Nat. Genet. 21:108-110.

    Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.

    Tekaia, F., A. Lazcano, and B. Dujon. 1999. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9:550-557.

    Woese, C. R. 1981. Archaebacteria. Sci. Am. 244:98-105.(James A. Lake*,,, and Mar)