当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第1期 > 正文
编号:11259324
Origin and Phylogeny of Chloroplasts Revealed by a Simple Correlation Analysis of Complete Genomes
     * Department of Biology, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China

    Institute of Theoretical Physics, The Chinese Academy of Sciences, Beijing, China

    Programs in Statistics and Operations Research, Queensland University of Technology, Brisbane, Australia

    Department of Mathematics, Xiangtan University, Hunan, China

    E-mail: kahouchu@cuhk.edu.hk.

    Abstract

    The complete sequenced genomes of chloroplast have provided much information on the origin and evolution of this organelle. In this paper we attempt to use these sequences to test a novel approach for phylogenetic analysis of complete genomes based on correlation analysis of compositional vectors. All protein sequences from 21 complete chloroplast genomes are analyzed in comparison with selected archaea, eubacteria, and eukaryotes. The distance-based analysis shows that the chloroplast genomes are most closely related to cyanobacteria, consistent with the endosymbiotic origin of chloroplasts. The chloroplast genomes are separated to two major clades corresponding to chlorophytes (green plants) s.l. and rhodophytes (red algae) s.l. The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution. For instance, the analysis places the chloroplasts of two chromophytes (Guillardia and Odontella) within the rhodophyte lineage, supporting secondary endosymbiosis as the source of these chloroplasts. The relationships among the green algae and land plants in our tree also agree with results from traditional phylogenetic analyses. Thus, this study establishes the value of our simple correlation analysis in elucidating the evolutionary relationships among genomes. It is hoped that this approach will provide insights on comparative genome analysis.

    Key Words: chloroplast ? genome ? plant ? phylogeny

    Introduction

    Chloroplast DNA is a primary source of molecular variations for phylogenetic analysis of photosynthetic eukaryotes. During the past decade, the availability of complete chloroplast genome sequences has provided a wealth of information to study the origin, including primary and secondary endosymbioses (Delwiche 1999; McFadden 2001a) and phylogeny of photosynthetic eukaryotes at the deep levels of evolution. There have been many phylogenetic analyses based on comparison of sequences of multiple protein-coding genes in chloroplast genomes (e.g., Martin et al. 1998, 2002; Turmel, Otis, and Lemieux 1999, 2002; Adachi et al. 2000; Lemieux, Otis, and Turmel 2000; De Las Rivas, Lozano, and Ortiz 2002). Alternative methodologies for phylogenetic analysis of complete genomes have been proposed, for example, based on the rearrangement of gene order (Sankoff et al. 1992), the presence and absence of protein-coding gene families (Fitz-Gibbon and House 1999), gene content and overall similarity (Tekaia, Lazcano, and Dujon 1999), and occurrence of folds and orthologs (Lin and Gerstein 2000). Yet, the above approaches are all based on alignment of homologous sequences, and it is apparent that much information (such as gene rearrangement and insertions/deletions) in these data sets is lost after sequence alignment, let alone the intrinsic problems of alignment algorithms (Li et al. 2001; Stuart, Moffet, and Baker 2002). There have been a number of recent attempts to develop methodologies that do not require sequence alignment for deriving species phylogeny based on overall similarities of the complete genome data (e.g., Li et al. 2001; Yu and Jiang 2001; Edwards et al. 2002; Stuart, Moffet, and Baker 2002; Stuart, Moffet, and Leader 2002). One author (Qi) and his colleagues have developed a simple correlation analysis of complete genome sequences based on compositional vectors without the need of sequence alignment. The compositional vectors calculated based on frequency of amino acid strings are converted to distance values for all taxa, and the phylogenetic relationships are inferred from the distance matrix using conventional tree-building methods (see Materials and Methods for details). An analysis based on this method using 103 prokaryotes and six eukaryotes has yielded a tree separating the three domains of life, Archaea, Eubacteria, and Eukarya, with the relationships among the taxa correlating with those based on traditional analyses (Qi, Wang, and Hao 2004). A correlation analysis based on a different transformation of compositional vectors was recently reported by Stuart, Moffet, and Baker (2002) and Stuart, Moffet, and Leader (2002) who demonstrated the applicability of the method in revealing phylogeny using vertebrate mitochondrial genomes. In the present study, we apply the above approach to analyze 21 complete chloroplast genomes, together with the genomes of two archaea, eight eubacteria (including two cyanobacteria), and three eukaryotes (see Materials and Methods for a list of complete nuclear and chloroplast genomes analyzed). The aim is to test the applicability of this correlation analysis in elucidating the origin and phylogeny of chloroplasts.

    Materials and Methods

    Genome Data Sets

    Complete sequences of 21 chloroplast genomes (Cyanophora paradoxa, Cyanidium caldarium, Porphyra purpurea, Guillardia theta, Odontella sinensis, Euglena gracilis, Chlorella vulgaris, Nephroselmis olivacea, Mesostigma viride, Chaetosphaeridium globosum, Marchantia polymorpha, Psilotum nudum, Pinus thunbergii, Oenothera elata, Lotus japonicus, Spinacia oleracea, Nicotiana tabacum, Arabidopsis thaliana, Oryza sativa, Triticum aestivu, and Zea mays) and genomes of two archaea (Archaeoglobus fulgidu and Sulfolobus solfataricus), eight eubacteria (Helicobacter pylori, Neisseria meningitides, Rickettsia prowazekii, Borrelia burgdorferi, Chlamydophila pneumoniae, Mycobacterium leprae, Nostoc sp., and Synechocystis sp.), and three eukaryotes (Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabitidis elegans) were retrieved from the NCBI database.

    Composition Vectors and Distance Matrix

    We base our analysis on all protein sequences, including hypothetical reading frames from each genome, regarding sequences of the 20 amino acids as symbolic sequences. In such a sequence of length L, there are a total of N = 20K possible types of strings of length K. We use a window of length K and slide it through the sequences by shifting one position at a time to determine the frequencies of each of the N kinds of strings in each genome. A protein sequence is excluded if its length is shorter than K. The observed frequency p(12 ... K) of a K-string 12 ... K is defined as p(12 ... K) = n(12 ... K)/(L – K + 1), where n(12 ... K) is the number of times that 12 ... K appears in this sequence. For example, in the protein sequence "MKRTFQPSILKRNRSHGFRIRMATKNGRYILSRRRAKLRTRLTVSSK," p(R) = 11/47, p(MR) = 0, p(RR) = 2/(47–2 + 1) = 1/23, and p(RRR) = 1/(47–3 +1) = 1/45. Denoting by m the number of protein sequences from each complete genome, the observed frequency of a K-string 12 ... K is defined as ; here nj(12 ... K) means the number of times that 12 ... K appears in the jth protein sequence, and Lj is the length of the jth protein sequence in this complete genome.

    Mutations occur in a random fashion at the molecular level, while selections shape the direction of evolution. There is always some randomness in the composition of protein sequences, revealed by statistical properties of protein sequences at single amino acid or oligopeptide level (see Weiss, Jimenez, and Herzel [2000] for a recent discussion on this point). To highlight the selective diversification of sequence composition, we subtract the random background from the simple counting results. If we perform direct counting for all strings of length (K – 1) and (K – 2), we can predict the expected frequency of appearance of K-strings by using a Markov model (Brendel, Beckmann, and Trifonov 1986):

    where q denotes the predicted frequency. When p(23 ... K–1) = 0, then definitely p(12 ... K–1) = 0 because a string will not appear if its substring does not appear; in this case we set q(12 ... K) = 0. In the above example, q(RRR) = (1/23 x 1/23)/(11/47). The above predictor via a Markov model has been used in biological sequence analyses (see Brendel, Beckmann, and Trifonov [1986] for example; see also page 47 of Percus [2002] for a theoretical development). A key step of our approach is to subtract the above random background before performing a cross-correlation analysis (similar to removing a time-varying mean in time series before computing the cross-correlation of two time series). We then calculate a new measure X of the shaping role of selective evolution as

    As an example, we display a segment of p for Cyanophora paradoxa chloroplast in figure 1a and the corresponding sequence X for the same set of K-strings in figure 1b. The transformation X = (p/q) – 1 has the desired effect of subtraction of random background in p and rendering it a stationary time series suitable for subsequent cross-correlation analysis.

    FIG. 1. (a) A segment of p for Cyanophora paradoxa chloroplast. (b) The corresponding sequence X for the same set of K-strings

    For all possible strings 12 ... K, we use X(12 ... K) as components to form a composition vector for a genome. To further simplify the notation, we use Xi for the i-th component corresponding to the string type i, i = 1, ... , N (the N strings are arranged in a fixed order as the alphabetical order). Hence we construct a composition vector X = (X1X2, ... ,XN) for genome X and likewise Y = (Y1Y2, ... ,YN) for genome Y.

    If we view the N components in vectors X and Y as the samples of two zero-mean random variables, respectively, the correlation C(X,Y) between any two genomes X and Y is defined in the usual way in probability theory as C(X,Y) = Xi x Yi/ x 1/2. The distance D(X,Y) between the two genomes is then defined as the equation D(X,Y) = (1 – C[X,Y])/2. A distance matrix for all the genomes under study is then generated for construction of phylogenetic trees.

    The vector p that we described is identical to the peptide frequency vector used by Stuart, Moffet, and Baker (2002) and Stuart, Moffet, and Leader (2002). However, their method of structure removal is entirely different from our method. Starting from the vector p, these authors used singular value decomposition (SVD) and then dimension reduction on their constructed matrix. The correlation distance is then used to construct the tree. In our method, we subtract random background via a Markov model for q and X. The SVD step is much more complicated than our method in both theoretical and practical considerations.

    Tree Construction and Statistical Test of the Trees

    Different distance methods, including Fitch-Margoliash (Fitch and Margoliash 1967), neighbor-joining (Saitou and Nei 1987), and minimum evolution (Saitou and Imanishi 1989), are used to construct the phylogenetic trees. A previous study on prokaryotes shows that the topology of the trees stabilized for K 5 (Qi, Wang, and Hao 2004). In the present study, we used K = 4 or 5 in our analysis, and the topologies of the resulting trees are similar. Here we present the results based on K = 5. We conducted the analysis on all the 34 genomes, as well as on the 21 chloroplast genomes alone using Synechocystis as the outgroup. The former analysis aims to explore the origin of the chloroplast genome, whereas the latter analysis is for comparison with previous phylogenetic analyses (Martin et al. 1998, 2002; Turmel, Otis, and Lemieux 1999; De Las Rivas, Lozano, and Oritz 2002) that include most of chloroplast genomes as in our analysis using the same outgroup taxon. The distance matrix generated from this analysis is available at http://www.itp.ac.cn/qiji.

    Bootstrapping is performed to give statistical support to the phylogenetic trees. Sequences of proteins are drawn randomly from a complete genome until the total number of proteins selected in each bootstrap is equal to the number of protein-coding genes of that particular genome. That is, in each bootstrap, some proteins may be selected more than once, whereas others may not be included at all. We generate a total of 100 bootstrap matrices and the bootstrap values are expressed as percentage of support for each branch.

    An IBM cluster of 64 CPUs with 3-GB memory is used for the computation of this study. All the calculations take more than 100 h.

    Analysis of the Subtraction Procedure

    To elucidate the biological meaning of the subtraction procedure, we have performed a concrete analysis on the example of Escherichia coli at string length K = 5. There are 1,343,887 nonzero five-strings belonging to 841,832 different string types. Among all the counts, the maximal one is 58 for the string "GKSTL." The frequency of the substrings "GKST" and "KSTL" is 113 and 77, respectively. The frequency of the middle string "KST" is 247. Thus, the predicted value is (113 x 77)/247 = 35.2267 compared with the real count 58 (neglecting the normalization factor when L >> K). The corresponding component in the composition vector after subtraction is (58 – 35.2267)/35.2267 = 0.646478.

    On the contrary, the string "HAMSC" only appears once in E. coli. Its substrings "HAMS" and "AMSC" also merely appear once; the frequency of the middle three-string "AMS" is 198. Its predicted value is (1 x 1)/198 = 0.00505051. The residual vector becomes (1 – 0.00505051)/0.00505051 = 197, making "HAMSC" the largest component in the vector.

    To reveal the biological difference between the two strings "GKSTL" and "HAMSC," we search for the exact match of these two pentapeptides in the Protein Information Resource (PIR) database that contains more than 1.2 million protein sequences in the present. The string "HAMSC" has 15 matches, among which one comes from eukaryotic species, four (essentially the same protein) come from a virus, and 10 come from prokaryotes. Among those from prokaryotes, four are from E. coli and Shigella and two are from Salmonella, while the prokaryotes with the string are closely related to Enterobacteria. In sharp contrast to "HAMSC," the string "GKSTL" has 6,121 matches with proteins from organisms of a wide taxonomic assortment, ranging from virus to human. As a commonly occurring pentapeptide, the string "GKSTL" in E. coli genome does not carry much phylogenetic information, although it appears most frequently. On the contrary, the pentapeptide "HAMSC" is more characteristic for prokaryotes, especially for Enterobacteria.

    It can be argued that frequently occurring strings per se may not be significant for inferring phylogenetic relationships. In the parlance of classic cladistics, they contribute to plesiomorphic characters and should be eliminated under strict treatment. On the other hand, some strings with small counts, which are of apomorphic characters, may be more significant, if their counts are largely different from what is predicted by a reasonable statistical model. The subtraction procedure helps to highlight these significant strings, although it is not always possible to evaluate the effect in a clear-cut way as we did above in the extreme cases.

    After the subtraction procedure, the frequency of some peptides is reduced to zero, although the number of such string is not large. By counting the number of strings whose value after subtraction fall in the range –0.1 to 0.1, we find that they only make up a small proportion. It is 6% in Cyanophora and 7% for E. coli. We cannot say that these zero-strings are not important. Actually they provide necessary information on the degree of dissimilarity among the species that eventually contributes to systematics.

    From a mathematical point of view, the subtraction procedure can be considered as removing a multifractal structure before performing a cross-correlation analysis (similar to removing a time-varying mean in time series before computing the cross-correlation of two time series). The multifractal method has been discussed in Anh, Lau, and Yu (2001) and is not elaborated here.

    We consider the subtraction of random background an essential step in our analysis. The phylogenetic trees generated without using this procedure are quite different. In fact, without this procedure, the topology is inconsistent with the phylogenetic relationships elucidated by traditional approaches. In the study by Qi, Wang, and Hao (2004), a tree of 109 species was generated without the subtraction procedure. In this tree, species of archaea, bacteria, and eukaryotes intermingle with one another and do not clearly cluster into three groups as in the tree presented in Qi, Wang, and Hao (2004). In the tree without subtraction, the groupings in lower systematic levels are in most cases not in agreement with those based on traditional methods. We also generated the chloroplast tree without subtraction of random background. The tree shows that, although all the chloroplasts cluster together, species of archaea and bacteria do not cluster into separate groups. From this comparison, it is apparent that subtraction of random background is necessary and crucial in our correlation analysis.

    Results and Discussion

    The topologies of the trees generated by distance methods including Fitch-Margoliash (FM), neighbor-joining (NJ), and minimum evolution (ME) are very similar. Figure 2a shows the tree based on ME analysis with bootstrap values from both ME and NJ analyses. Discrepancies of the NJ and FM trees from the ME tree are also shown as alternative topologies in figure 2b. All the chloroplast genomes form a clade branched in Eubacteria domain and share a most recent common ancestor with cyanobacteria, which is in accordance with the widely accepted endosymbiotic theory that chloroplasts arose from a cyanobacteria-like ancestor (Gray 1992, 1999; McFadden 2001b). Apparently, despite massive gene transfer from the endosymbiont to the nucleus of the host cell (Martin and Herrmann 1998; Martin et al. 1998, 2002), our analysis is able to identify cyanobacteria as the most closely related prokaryotes of chloroplast. We have also attempted to include in our analyses complete genomes of nonphotosynthetic plastids of the parasitic flowering plant Epifagus virginiana (70 kb), the euglenophyte Astasia longa (73 kb), and the apicomplexan Toxoplasma gondii (35 kb). All the three taxa appear to be closely related to the two cyanobacteria, with their branches diverged earlier than the other plastids (chloroplasts). We believe such branching positions of the nonphotosynthetic plastids are likely to be artifacts (particularly for Epifagus, a flowering plant whose plastids have lost all the genes for photosynthesis and chlororespiration [see Wolfe, Morden, and Palmer 1992]) of massive genome reduction (about 50% or more in the case of apicomplexan) in these degenerate plastids. Thus, we have not included these plastids in the tree (fig. 2). The effect of genome size on the resolving power of our method is under investigation in our laboratory.

    FIG. 2. Phylogeny of chloroplast genomes based on correlation analysis. (a) Topology of chloroplast genomes together with selected genomes from eubacteria, archaea, and eukaryotes using minimum evolution (ME) analysis. The numbers on each branch show the bootstrap support (100 replicates) based on ME and neighbor-joining (NJ, in italic) analyses. Values less than 50 are not shown. Values shown among the eubacteria, archaea, and eukaryotes are based on the analysis of all 34 genomes. Values shown among the chloroplasts are based on analysis of these 21 genomes using Synechorcystis as outgroup. (b) Alternative topologies of the trees based on Fitch-Margoliash (FM) (for T1) or on both FM and NJ (for T2 and T3) analyses

    Our analysis shows that the chloroplasts are separated into two major clades. One of these corresponds to the green plants sensu lato, or chlorophytes s.l. (Palmer and Delwiche 1998), which include all taxa with a chlorophyte chloroplast, both primary and secondary endosymbioses in origin. The other clade comprises the glaucophyte Cyanophora and members of rhodophytes s.l., which refers to rhodophytes (or red algae) and their secondary symbiotic derivatives, loosely termed chromophytes (including crytophytes, heterokonts, haptophytes, and dinoflagellates) (Palmer and Delwiche 1998). The close relationship between Cyanophora and rhodophytes s.l. agrees with some of the previous analyses (Stirewalt et al. 1995; De Las Rivas, Lozano, and Ortiz 2002), although most recent studies suggest that the glaucophyte represents the earliest branch in chloroplast evolution with the green plants s.l. and rhodophytes s.l. as sister taxa (Martin et al. 1998, 2002; Stoebe and Kowallik 1999; Adachi et al. 2000; Moreira, Le Guyader, and Philippe 2000). Within the rhodophytes s.l. clade in our tree (including the two red algae Cyanidium and Porphyra, the cryptophyte Guillardia, and the heterokont Odontella), Porphyra and Guillardia are the most closely related taxa. This agrees with the results from gene cluster comparison between these two species, providing strong evidence that cryptophytes arose by secondary endosymbiosis of a primitive rhodophyte (Douglas and Penny 1999; Stoebe and Kowallik 1999). The paraphyly of Guillardia and Odontella, with respect to the two red algae, also suggests independent acquisition of secondary chloroplasts in the heterokont and cryptophyte, in contrast to the hypothesis of a single secondary endosymbiotic event among the chromophytes (Cavalier-Smith 2000). Although a single origin of the chloroplasts in this group is supported in some analyses (De Las Rivas, Lozano, and Ortiz 2002; Yoon et al. 2002), the topology of these four taxa in our tree is identical to that based on a recent, traditional analysis of protein-coding genes in the genomes (Martin et al. 2002). Analysis of small subunit ribosomal DNA in the chloroplasts from a wide variety of rhodophytes and chromophytes also indicates that chloroplasts of the latter group have independent origin (Oliveira and Bhattacharya 2000).

    The chlorophyte-like chloroplast of euglenophytes is generally believed to have arisen from secondary symbiosis by capture of a green alga in the kinetoplastid lineage (Palmer and Delwiche 1998; Cavalier-Smith 2000). The euglenophyte Euglena branches basal to chlorophytes s.l. in our tree and is consistent with recent analyses of complete chloroplast genomes (De Las Rivas, Lozano, and Ortiz 2002; Martin et al. 2002), although other analyses have placed Euglena within the green algae (Van de Peer et al. 1996; K?hler et al. 1997; Turmel, Otis, and Lemieux 1999). The chloroplasts of green algae, including Chlorella, Nephroselmis, and Mesostigma, are more closely related to land plants than to other algae (Wakasugi et al. 1997; Martin et al. 2002). Our analysis however suggests that this assemblage is paraphyletic, but the branching order among the three species receives little bootstrap support. ME and NJ trees grouping Mesostigma with Nephroselmis as prasinophytes are consistent with results from another correlation analysis of complete chloroplast genomes (De Las Rivas, Lozano, and Ortiz 2002). Yet an alternate topology (T1) from the MF tree indicates that Mesostigma is closely related to the streptophytes (including the charophyte Chaetosphaeridium and land plants). Previous molecular phylogenetic studies have also produced conflicting results on the placement of Mesostigma. The first complete chloroplast genome analysis of this species showed that it is an ancestral branch of green plant evolution, representing a lineage that emerged before the divergence of green algae and streptophytes (Lemieux, Otis, and Turmel 2000). Yet a recent analysis on chloroplast genome sequences showed that it is basal to land plants above the green algae (Martin et al. 2002), in accordance with a multigene analysis on a wide variety of charophytes assigning Mesostigma to a basal group of charophytes (Karol et al. 2001). The difficulty in resolving the phylogeny of Mesostigma in relation to other members of chlorophytes s.l. in our analysis is possibly because of the limited taxon sampling of the chloroplasts in green algae and charophytes.

    The charophyte Chaetosphaeridium globosum represents a basal branch of the streptophyte clade in all analyses. This is consistent with the chloroplast genome analysis of this species (Turmel, Otis, and Lemieux 2002), suggesting that charophytes were the immediate ancestor of land plants, or embryophytes (Graham, Cook, and Busse 2000). Whereas the support for the angiosperm (flowering plants) clade is strong, its relationships with other land plants is not well resolved in our analysis. An alternative topology (T2) of both the NJ and FM trees suggests that the angiosperms are more closely related to the liverwort Marchantia and the psilophyte Psilotum than to the conifer Pinus. Interestingly, a recent correlation analysis on the complete chloroplast genomes also indicates the same topology (De Las Rivas, Lozano, and Ortiz 2002). Whether this anomaly is caused by the almost complete loss of a large inverted repeat in Pinus (Wakasugi et al. 1994) as compared with other photosynthetic eukaryotes remains to be investigated. Our analysis clearly separates the angiosperms into two clades corresponding to the monocotyledons and eudicots, the two large clades in current understanding of angiosperm phylogeny (Crane, Friis, and Pedersen 1995), although it should be noted that all the monocots included in the tree are members of a single family (Poaceae). The branching order within each clade is not well supported by bootstrapping. A different topology (T3) among three of the eudicots (Spinacia, Nicotiana, and Arabidopsis) is suggested by both the NJ and the FM trees as compared with the ME tree.

    Our simple correlation analysis on the complete chloroplast genomes has yielded a tree that is in good agreement with our current knowledge on the origin of the chloroplasts and the phylogenetic relationships of different groups of photosynthetic eukaryotes as elucidated previously by traditional analyses of the chloroplast genomes and other molecular/ultrastructural approaches (e.g., Martin et al. 2002; De Las Rivas, Lozano, and Ortiz 2002; see also Palmer and Delwiche [1998] and McFadden [2001a, 2001b] for reviews). Our approach circumvents the ambiguity in the selection of genes from complete genomes for phylogenetic reconstruction, and is also faster than the traditional approaches of phylogenetic analysis, particularly when dealing with a large number of genomes. Moreover, because multiple sequence alignment is not necessary, the intrinsic problems associated with this complex procedure can be avoided. In contrast to a recent similar analysis on mitochondrial genomes based on compositional vector (Stuart, Moffet, and Baker 2002; Stuart, Moffet, and Leader 2002), our approach does not require prior information on gene families in the genome and is also simpler in the method used for subtraction of random background from the data set (see Materials and Methods). We have also shown that this approach is applicable for analyzing the much larger genomes of chloroplast, as well as the prokaryotes (Qi, Wang, and Hao 2004). We believe that the present approach is an important step towards the analysis of the wealth of information provided by genome projects. In view of the lower resolving power (i.e., relatively low bootstrap support in most of the branches) as compared with the conventional analysis of chloroplast genomes (e.g., Martin et al. 2002), further refinements of the method are being explored in our laboratories, along with the question on the nature of the phylogenetic signals revealed in our method. It is hoped that efforts in this line of research will provide us with fast and useful tools in comparative genome analysis as well as insights on genome structure and evolution.

    Acknowledgements

    We thank C. P. Li and K. C. Cheung for technical assistance, B.-L. Hao for discussion, and C. K. Wong for comments on the draft manuscript. Suggestions from Mark Ragan, the Associate Editor and two anonymous reviewers significantly improved the manuscript. Financial support was provided by an AoE Fund of The Chinese University of Hong Kong (K.H.C.), Youth Foundation of the Chinese National Natural Science Foundation (grant no. 10101022), and Postdoctoral Research Support Grant (no. 9900658) of Queensland University of Technology (Z.-G.Y.). The use of the 64 CPU IBM Cluster at Peking University and the facilities at the Centre of Theoretical Biology of Fudan University are gratefully acknowledged.

    Literature Cited

    Adachi, J., P. J. Waddell, W. Martin, and M. Hasegawa. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J. Mol. Evol. 50:348-358.

    Anh, V. V., K. S. Lau, and Z. G. Yu. 2001. Multifractal characterization of complete genomes. J. Phys. A: Math. Gen. 34:7127-7139.

    Brendel, V., J. S. Beckmann, and E. N. Trifonov. 1986. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomol. Struct. Dyn. 4:11-21.

    Cavalier-Smith, T. 2000. Membrane heredity and early chloroplast evolution. Trends Plant Sci. 5:174-182.

    Crane, P. R., E. M. Friis, and K. R. Pedersen. 1995. The origin and early diversification of angiosperms. Nature 374:27-33.

    De Las Rivas, J., J. J. Lozano, and A. R. Ortiz. 2002. Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns. Genome Res. 12:567-583.

    Delwiche, C. F. 1999. Tracing the thread of plastid diversity through the tapestry of life. Am. Nat. 154:S164-S177.

    Douglas, S. E., and S. L. Penny. 1999. The plastid genome of the cryptophyte alga, Guillardia theta: complete sequence and conserved synteny groups confirm its common ancestry with red algae. J. Mol. Evol. 48:236-244.

    Edwards, S. V., B. Fertil, A. Giron, and P. J. Deschavanne. 2002. A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst. Biol. 51:599-613.

    Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic trees. Science 155:279-284.

    Fitz-Gibbon, S. T., and C. H. House. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27:4218-4222.

    Graham, L. E., M. E. Cook, and J. E. Busse. 2000. The origin of plants: body plan changes contributing to a major evolutionary radiation. Proc. Natl. Acad. Sci. USA 97:4535-4540.

    Gray, M. W. 1992. The endosymbiont hypothesis revisited. Int. Rev. Cytol. 141:233-357.

    Gray, M. W. 1999. Evolution of organellar genomes. Curr. Opin. Genet. Dev. 9:678-687.

    Karol, K. G., R. M. McCourt, M. T. Cimino, and C. F. Delwiche. 2001. The closest living relatives of land plants. Science 294:2351-2353.

    K?hler, S., C. F. Delwiche, P. W. Denny, L. G. Tilney, P. Webster, R. J. M. Wilson, J. D. Palmer, and D. S. Roos. 1997. A plastid of probable green algal origin in apicomplexan parasites. Science 275:1485-1489.

    Lemieux, C., C. Otis, and M. Turmel. 2000. Ancestral chloroplast genome in Mesostigma viride reveals an early branch of green plant evolution. Nature 403:649-652.

    Li, M., J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. 2001. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17:149-154.

    Lin, J., and M. Gerstein. 2000. Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes at different levels. Genome Res. 10:808-818.

    McFadden, G. I. 2001a. Primary and secondary endosymbiosis and the origin of plastids. J. Phycol. 37:951-959.

    McFadden, G. I. 2001b. Chloroplast origin and integration. Plant Physiol. 125:50-53.

    Martin, W., and R. G. Herrmann. 1998. Gene transfer from organelles to the nucleus: how much, what happens, and why? Plant Physiol. 118:9-17.

    Martin, W., B. Stoebe, V. Goremykin, S. Hansmann, M. Hasegawa, and K. V. Kowallik. 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393:162-165.

    Martin, W., T. Rujan, E. Richly, A. Hansen, S. Cornelsen, T. Lins, D. Leister, B. Stoebe, M. Hasegawa, and D. Penny. 2002. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Natl. Acad. Sci. USA 99:12246-12251.

    Moreira, D., H. Le Guyader, and H. Philippe. 2000. The origin of red algae and the evolution of chloroplasts. Nature 405:69-72.

    Oliveira, M. C., and D. Bhattacharya. 2000. Phylogeny of the Bangiophycidae (Rhodophyta) and the secondary endosymbiotic origin of algal plastids. Am. J. Bot. 87:482-492.

    Palmer, J. D., and C. F. Delwiche. 1998. The origin and evolution of plastids and their genomes. Pp. 345–409 in D. E. Soltis, P. S. Soltis, and J. J. Doyle, eds. Molecular systematics of plants II: DNA sequencing. Kluwer, London.

    Percus, J. K. 2002. Mathematics of genome analysis. Cambridge University Press, New York.

    Qi, J., B. Wang, and B. Hao. 2004. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J. Mol. Evol. 58:1-11.

    Saitou, N., and T. Imanishi. 1989. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol. Biol. Evol. 6:514-525.

    Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425.

    Sankoff, D., G. Leaduc, N. Antoine, B. Paquin, B. F. Lang, and R. Cedergren. 1992. Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA 89:6575-6579.

    Stirewalt, V. L., C. B. Michalowski, W. Loffelhardt, H. J. Bohnert, and D. A. Bryant. 1995. Nucleotide sequence of the cyanelle genome from Cyanophora paradoxa. Plant Mol. Biol. Rep. 13:327-332.

    Stoebe, B., and K. V. Kowallik. 1999. Gene-cluster analysis in chloroplast genomics. Genome Analysis Outlook 15:344-347.

    Stuart, G. W., K. Moffet, and S. Baker. 2002. Integrated gene species phylogenies from unaligned whole genome protein sequences. Bioinformatics 18:100-108.

    Stuart, G. W., K. Moffet, and J. J. Leader. 2002. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol. Biol. Evol. 19:554-562.

    Tekaia, F., A. Lazcano, and B. Dujon. 1999. The genomic tree as revealed from whole proteome comparisons. Genome Res. 9:550-557.

    Turmel, M., C. Otis, and C. Lemieux. 1999. The complete chloroplast DNA sequence of the green alga Nephroselmis olivacea: insights into the architecture of ancestral chloroplast genomes. Proc. Natl. Acad. Sci. USA 96:10248-10253.

    Turmel, M., C. Otis, and C. Lemieux. 2002. The chloroplast and mitochondrial genome sequences of the charophyte Chaetosphaeridium globosum: insights into the timing of the events that restructured organelle DNAs within the green algal lineage that led to land plants. Proc. Natl. Acad. Sci. USA 99:11275-11280.

    Van de Peer, Y., S. A. Rensing, U. G. Maier, and R. De Wachter. 1996. Substitution rate calibration of small subunit ribosomal RNA identifies chlorarachniophyte endosymbionts as remnants of green algae. Proc. Natl. Acad. Sci. USA 93:7732-7736.

    Wakasugi, T., T. Nagai, and M. Kapoor, et al. (15 co-authors). 1997. Complete nucleotide sequence of the chloroplast genome from the green alga Chlorella vulgaris: the existence of genes possibly involved in chloroplast division. Proc. Natl. Acad. Sci. USA 94:5967-5972.

    Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794-9798.

    Weiss, O., M. A. Jimenez, and H. Herzel. 2000. Information content of protein sequences. J. Theor. Biol. 206:379-386.

    Wolfe, K. H., C. W. Morden, and J. D. Palmer. 1992. Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc. Natl. Acad. Sci. USA 89:10648-10652.

    Yoon, H. S., J. D. Hackett, G. Pinto, and D. Bhattacharya. 2002. The single, ancient origin of chromist plastids. Proc. Natl. Acad. Sci. USA 99:15507-15512.

    Yu, Z.-G., and P. Jiang. 2001. Distance, correlation and mutual information among portraits of organisms based on complete genomes. Phys. Lett. A 286:34-46.(Ka Hou Chu*, Ji Qi, Zu-Gu)