当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第17期 > 正文
编号:11369849
PhyloGenie: automated phylome generation and analysis
http://www.100md.com 《核酸研究医学期刊》
     Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Spemannstr. 35, D-72076 Tuebingen, Germany

    * To whom correspondence should be addressed. Tel: +49 7071 601 340; Fax: +49 7071 601 349; Email: andrei.lupas@tuebingen.mpg.de

    ABSTRACT

    Phylogenetic reconstruction is the method of choice to determine the homologous relationships between sequences. Difficulties in producing high-quality alignments, which are the basis of good trees, and in automating the analysis of trees have unfortunately limited the use of phylogenetic reconstruction methods to individual genes or gene families. Due to the large number of sequences involved, phylogenetic analyses of proteomes preclude manual steps and therefore require a high degree of automation in sequence selection, alignment, phylogenetic inference and analysis of the resulting set of trees. We present a set of programs that automates the steps from seed sequence to phylogeny and a utility to extract all phylogenies that match specific topological constraints from a database of trees. Two example applications that show the type of questions that can be answered by phylome analysis are provided. The generation and analysis of the Thermoplasma acidophilum phylome with regard to lateral gene transfer between Thermoplasmata and Sulfolobus, showed best BLAST hits to be far less reliable indicators of lateral transfer than the corresponding protein phylogenies.The generation and analysis of the Danio rerio phylome provided more than twice as many proteins as described previously, supporting the hypothesis of an additional round of genome duplication in the actinopterygian lineage.

    INTRODUCTION

    The amount of sequences being generated by genome projects far exceeds our ability to manually assign any meaningful annotation to them. To analyze the flood of ‘unknown’ or ‘hypothetical’ sequences in a reasonable time frame, automated methods are essential. These often rely on the assumption that sequences have the same function as their closest relative. The use of best BLAST hits to find these close relatives may often be a viable option (1). However, Koski and Golding showed that best BLAST hits do not necessarily represent the closest sequence relatives (2), thereby casting doubt on the reliability of this approach. The human genome consortium (3), for example, predicted 113 lateral gene transfers (LGTs) from bacteria to vertebrates based on BLAST results. Subsequent phylogenetic analysis of the genes in question, however, was unable to find support for any of these predictions (4–6).

    The use of the trees to find the closest relatives, by inferring a phylogeny for each sequence, is a more robust but computationally demanding approach. It is difficult to automate reliably, as it involves two steps—selection of homologous sequences and multiple alignment—whose automated forms are error-prone. A program that automates the steps of similarity search, alignment and phylogenetic inference, Pyphy (7), uses a reduced sequence database with higher-quality annotation , fixed criteria of similarity to define homology (80% coverage and 50% identity, or identical annotation) and alignment of full-length sequences . Pyphy was specifically designed to detect and visualize LGT in prokaryotic genomes, and its restrictive settings were chosen to optimize its performance on this problem.

    We have developed a suite of programs, PhyloGenie, which also automates the steps from seed sequence to phylogenetic inference, but can be used to examine a much broader range of phylogenetic hypotheses. PhyloGenie can be used with any standard FASTA format database, is based on local alignments, offers full flexibility in setting the criteria for homology and filters phylomes for all trees matching specific, user-defined topological constraints. To illustrate its operation and scope, we apply PhyloGenie to two phylogenetic problems that have been studied previously by non-automated methods and compare its performance with Pyphy. The two problems are the apparent large-scale LGT between T.acidophilum and S.solfataricus (10), two phylogenetically distant Archaea that inhabit the same environment, and the presumed additional genome duplication in the actinopterygian lineage since its divergence from tetrapods (11).

    METHODS

    Genomes and databases

    NCBI taxonomy files and the non-redundant (nr) sequence database were obtained from the NCBI website (www.ncbi.nlm.nih.gov). The complete genome of T.acidophilum and all sequences for Danio rerio in the nr database of October 2003 were obtained from the same source.

    Sequence similarity detection and alignment

    Sequences were compared with the nr sequence database using BLASTP v2.26 and multiple sequence alignments were derived using the Java program Blammer. Blammer consists of five post-processing steps for BLAST result files that convert sets of high-scoring segment pairs (HSPs) to multiple alignments; this routine relieves the gapping problems that arise during the conversion of pairwise alignments to multiple alignments (Figures 1 and 2). All parameters (X to P) specified below can be customized and were chosen so as to maximize the number of BLASTP hits while providing reasonable support for sequence homology.

    Figure 1. Alignment excerpts showing the most commonly encountered problems when converting BLAST or PSIBLAST HSPs to multiple alignments. (A) Three BLAST HSPs combined to a multiple sequence alignment and the resulting gapping problems. (B) Extreme examples of excessive and inconsistent gapping.

    Figure 2. Layout showing the BLAST/PSIBLAST post-processing steps used to reduce excessive and inconsistent gapping. (1) All full-length sequences are gathered for HSPs and form the database used for HMM-searching in 5. (2) All HSPs matching E-value, score and coverage cutoff criteria are converted to a multiple sequence alignment. (3) The alignment sequences are filtered by maximum sequence identity to remove duplicate entries and gapped regions are realigned to resolve gapping problems. (4) A profile-HMM is derived from the multiple sequence alignment. (5) Sequences from step 1 are searched with the HMM generated in step 4 so as to better define the start and end of alignable regions and thereby improve alignment. (6) HMM-HSPs are converted to a multiple sequence alignment.

    First, full-length sequences for HSPs up to expectation values (E-values) of X (X = 10) are extracted, which enables the sequence database to be searched with a profile hidden Markov model (HMM) (12) in a later step. The HSPs of the query sequence with a coverage greater than Y (Y = 60%) and E-values better than Z (Z = 10–5) are extracted and the most dissimilar K (K = 150) of these are converted to a multiple alignment. The coverage and cutoff E-value are used to determine sequence homology and the most dissimilar HSPs are used to ensure that the HMM generated from the resulting alignment in a later step is representative of all of the relevant BLAST hits instead of only a large group of mostly identical sequences. Alignment regions with more than L (L = 100) consecutive ungapped columns are taken as alignment anchor points and all residues between such anchors are realigned using ClustalW, thus resolving inconsistent gapping problems.

    A HMM, derived from the resulting alignment, is used to search the database of full-length sequences generated in the first step. This removes false positive BLAST hits and better defines the beginning and end of alignable sequence regions due to the higher sensitivity of HMMs. The alignment from which phylogenies are inferred consists of the HMM-HSPs with E-values better than M (M = 10–6).

    Sequences of the same organism with more than N (N = 99%) sequence identity are thought to be redundant database entries and only one copy is retained. In cases where the HMM search returns more than P sequences (P = 150), only the best P matches are converted to a multiple alignment so as to the keep ensuing phylogenetic calculations and analyses in a reasonable time frame.

    Pyphy

    The program Pyphy was obtained from T. Sicheritz-Ponten (Technical University of Denmark, Lyngby) and installed under Gentoo Linux. To make the output of Pyphy comparable with PhyloGenie (specifically, to avoid distance versus parsimony issues), tree inference was handled in the same way for both programs by using the PhyloGenie routines.

    Phylogenetic inference

    Phylogenies were inferred using neighbor-joining (NJ) (13) in combination with the Poisson distance correction scheme and bootstrapped with 100 replicates.

    External programs

    For full functionality, it is necessary for the NCBI taxonomy files ‘names.dmp’ and ‘nodes.dmp’ (necessary for tree analysis) as well as BLAST (www.ncbi.nlm.nih.gov), HMMER (http://hmmer.wustl.edu) and ClustalW (www.ebi.ac.uk/clustalw) to be installed. To further customize the utility, it is possible to replace the alignment and tree construction routines. Any program or script that accepts FASTA format sequences as input and generates clustal format alignments can replace ClustalW as an alignment tool. Similarly, any tree construction program that accepts aligned FASTA format sequences and generates Newick format trees can replace the provided NJ tool.

    Tree analysis

    The T.acidophilum phylome was searched for trees showing LGT between Thermoplasmata and Sulfolobus using the query ‘(Thermoplasmata & Sulfolobus & !(*cellular organisms))’. Trees corresponding to this search string included those with at least one node containing Thermoplasmata and Sulfolobus sequence representatives but no other cellular organisms.

    For the zebrafish set of trees, the query ‘((Danio rerio {=2} & Homo sapiens {=1} & Mus musculus {=1} & Gallus gallus {=1} & Euteleostomi) & !(*Eukaryota))’ returned phylogenies containing nodes in which two genes were present in Danio rerio and exactly one in Homo sapiens, Mus musculus and Gallus gallus. In addition, sequences belonging to non-euteleostomi eukaryotes were not permitted in that node.

    Prior to analysis, sequences belonging to the NCBI taxonomic groups ‘Viruses’, ‘Viroids’, ‘other sequences’ and ‘unclassified’ were excluded and all nodes supported by bootstrap values below 50 were collapsed.

    The analysis of unrooted trees is far more complex than that of rooted trees due to missing directionality (Figure 3a). However, automated rooting of trees is non-trivial. We have implemented the following rooting scheme that ensures correct directionality for at least the branch containing the seed sequence, i.e. the one the tree was calculated for, and frequently the complete tree. A tree is rooted by assigning a taxonomic ‘level’ to each node and rooting at the node with the lowest level (i.e. closest to ‘root of life’ or ‘root’) (Figure 3d). To assign a node's taxonomic level, the tree is first rooted with the seed sequence (Figure 3a: MAN) and the lowest common taxonomic denominator for all descendant species is calculated for each node (Figure 3b). Next, the tree is rooted at the leaf-node, the least related and having the highest number of nodes separating it from the seed sequence (Figure 3b: E.coli K12). All nodes are then reassigned a taxonomic level. If a node's new taxonomic level differs from the previous assignment, the level closest to ‘species’ is retained (Figure 3c). The second rooting and round of taxonomic assignments is done to remove directionality from the taxonomic assignments and ensure that they are independent of the way the tree was rooted. The node closest to ‘root of life’ (last common ancestor for all proteins in this tree) is used to root the tree (Figure 3d). If multiple nodes of the same ‘lowest’ taxonomic level exist, the tree is rooted at the node most distant from the seed sequence.

    Figure 3. Tree rooting scheme. (a) Unrooted tree. (b) Tree rooted at the seed sequence (Man) with taxonomic "level" assignments for each node. (c) Tree rooted at the tipnode least related and most distant from the seed sequence (counting nodes) after the second round of taxonomic assignment. (d) Final tree, rooted at the most basal node the most distant from the seed sequence.

    Computing resources

    The T.acidophilum analysis was performed on an AMD64 2400 1CPU workstation running Linux. Analysis of the Danio rerio proteome was done on a SUN V880 under Solaris9. All Pyphy analyses were performed on AMD64 2400 workstation running Linux. Generation of the T.acidophilum phylome required 78 h. The BLAST searches for each protein took 14 h, the conversion of BLAST to multiple alignments took an additional 60 h, and 4 h were needed to infer phylogenetic trees and bootstrap each with 100 replicates. The analysis of the resulting phylome took 36 s.

    Availability

    The software can be downloaded from http://protevo.eb.tuebingen.mpg.de/download.

    RESULTS AND DISCUSSION

    The PhyloGenie program

    Analysis of phylomes, defined as the complete set of phylogenetic trees derived from the proteomes of organisms (7), requires four key steps: selection of homologs, multiple alignment, tree inference and filtering for specific tree topologies.

    In Pyphy, the selection of homologs is guided to a large extent by sequence annotation. This requires high-quality sequence databases that provide standardized annotation, such as Swissprot and TREMBL, which prevent the use of most public databases. Since both Swissprot and TREMBL lag substantially behind the nr sequence database, both in number of sequences and completeness, we have implemented a sequence selection routine in PhyloGenie that is completely driven by local pairwise similarity. First, we extracted sequences with domain-sized regions of statistically significant sequence similarity, using the search tools BLAST or PSIBLAST, which are fast, reliable and sensitive. We then refined this set during the alignment process, using HMMs (see Methods; Figure 2).

    Good phylogenies require good alignments. The errors incurred in the alignment process cannot be corrected by the subsequent steps of analysis. Non-homologous sequences or domains in an alignment, misaligned residues or the unfortunate selection of sequence representatives can lead to errors and possibly invalidate the inferred tree. Generating high-quality sequence alignments can therefore be seen as the most critical step on the path from seed sequence to phylogeny. When producing alignments, it is necessary to decide between aligning full-length sequences and aligning only the conserved regions for which sequence similarity, presumably due to shared descent, is unambiguously determinable. Pyphy uses the global alignment program ClustalW to align full-length sequences, thus requiring that all sequences in the alignment match over most of their length. This precludes the application of Pyphy to many proteins, such as the histidine kinases and response regulators of two-component signal transduction systems, which show an enormous diversity in length and domain composition, but are nevertheless rewarding targets for phylogenetic analysis based on their conserved kinase and phospho-acceptor domains (14). For this reason, PhyloGenie contains a novel alignment routine, Blammer, which post-processes local pairwise sequence alignments obtained from BLAST or PSIBLAST (see Methods; Figure 2) to focus the resulting multiple alignment on conserved domains. Blammer extracts the BLAST HSPs above a given significant cut-off and coverage, converts them to a multiple alignment, identifies anchor regions of ungapped sequence and realigns the gapped regions in between using ClustalW. It then builds an HMM of the alignment and searches all full-length sequences that have BLAST HSPs in response to the original query for significant matches, which it realigns to obtain the final alignment. In addition, and unlike Pyphy, PhyloGenie allows users to customize all parameters in the search and alignment routines, thus making it possible to optimize PhyloGenie for specific questions.

    Many approaches to tree inference exist and different methods may be used depending on the available computing infrastructure, the average size and the quality of alignments. By default, PhyloGenie provides a neighbor-joining (NJ) method (13), a fast and robust way to infer trees. This can be replaced by any program or script that accepts aligned FASTA format sequences and generates New Hampshire Bracket Format (Newick) trees. For example, PhyloGenie contains a script (treepuzzle.pl), which allows the use of Tree-Puzzle (15), one of the faster maximum likelihood tree inference programs. We believe that this solution is preferable to that implemented in Pyphy, which uses the program PAUP (16) for tree inference. PAUP is a proprietary program and uses a program-specific tree format.

    A large repository of phylogenetic trees is of limited use unless a way of separating relevant from irrelevant trees for the question at hand is provided. For example, in evaluating the actinopterygian genome duplication hypothesis, Taylor et al. (17) examined large numbers of trees manually, as phylogenies proved difficult to analyze in an automated manner. To reduce the number of trees that have to be examined manually, PhyloGenie contains a tool that extracts phylogenies corresponding to specific, user-defined topological constraints from a database of trees. Pyphy circumvents this problem by focusing on a single phylogenetic hypothesis, namely LGT.

    Application to a LGT hypothesis

    Thermoplasma acidophilum is a thermoacidophilic euryarchaeon that lives at 59°C and pH 2, whose genome sequence suggests an extensive LGT with a phylogenetically distant organism, the crenarchaeote S.solfataricus that inhabits the same ecological niche (10). This transfer was deduced from the fact that 252 of 1478 genes of Thermoplasma had best BLAST matches to proteins of Sulfolobus. Since the Thermoplasma–Sulfolobus BLAST comparison was originally done before the completion of the Sulfolobus genome, we repeated it and obtained 303 genes for which best BLAST hits predicted a Sulfolobus sequence as the closest relative. A PhyloGenie search for LGTs between Thermoplasmata (T.acidophilum, T.volcanium, Ferroplasma acidarmanus, Picrophilus torridus) and Sulfolobus (S.solfataricus and S.tokodaii) returned 185 trees. Of the 252 LGTs originally predicted from BLAST similarities (10), less than half were recovered by the PhyloGenie approach. An analysis with Pyphy returned 148 trees.

    The potential LGTs are not distributed uniformly across the genome (Figure 4); the patterns of distribution are similar for the three methods, with local differences in the exact numbers. Globally, 93 LGTs were predicted by all three methods, 71 by PhyloGenie and BLAST, 40 by Pyphy and BLAST and 1 by PhyloGenie and Pyphy. A closer analysis as to why one method differed from the other two revealed that in the set of 40 proteins missed by PhyloGenie but predicted as LGTs by the other two methods, most were compatible with the lateral transfer hypothesis but were excluded due to low bootstrap support (Table 1). In the set of 71 proteins missed by Pyphy, 43 were due to the use of the reduced sequence database that Pyphy uses and the very stringent inclusion criteria for homologous sequences; this caused many alignments to miss relevant proteins and, in some cases, to consist solely of one protein from each of the two Thermoplasma species, T.acidophilum and T.volcanium. The one tree missed by BLAST is due to an archaeal sequence with a marginally better E-value than the closest Sulfolobus sequence relative.

    Figure 4. Chromosomal distribution of presumed laterally transferred ORFs between Thermoplasmata and Sulfolobus, according to PhyloGenie, Pyphy and best BLAST hits. The light gray, dark gray and black circles encompass the LGTs predicted by BLAST, Pyphy and PhyloGenie, respectively.

    Table 1. Overview of LGT events identified by BLAST, Pyphy, and PhyloGenie

    In summary (Table 1): (i) 93 LGT predictions were supported by all three methods, and a further 90 were supported by at least two of the three methods and not contradicted by any; (ii) 8 LGTs were predicted by BLAST and PhyloGenie but contradicted by Pyphy, 13 were supported by BLAST and Pyphy but contradicted by PhyloGenie and 1 was supported by both Pyphy and PhyloGenie but contradicted by BLAST. Taking protein LGT predictions supported by at least two methods and not contradicted by any phylogenetic approach as true positives (184 trees), showed BLAST to be the most sensitive method with >99% sensitivity (183 of 184 true positives recovered) but also the least selective with 60% selectivity (303 predicted LGTs, 183 true positives). PhyloGenie showed a sensitivity of 85% and selectivity of 85% whereas Pyphy had a sensitivity of 66% and a selectivity of 82%. Thus, of the three methods, PhyloGenie seems to be the one best able to combine high sensitivity with high selectivity. In detail, our conclusions are: (i) the Pyphy criteria for defining homologous sequences are too strict, thereby excluding many relevant sequences from analysis, as is apparent from the 43 true positives that were not predicted by Pyphy because it missed the Sulfolobus homologs (Table 1); (ii) less strict search criteria, as in PhyloGenie, circumvents this problem, but the resulting sequence diversity may lower bootstrap support in some cases to <50%, thereby causing trees supporting the LGT hypothesis to be missed, as in 24 of the 40 true positives missed by PhyloGenie; (iii) finally, as pointed out by Koski and Golding (2), best BLAST hits are of only moderate accuracy when identifying the phylogenetically nearest homolog.

    The hypothesis of large-scale LGT between Thermoplasmata and Sulfolobus, proposed on the basis of best BLAST hits (10), is thus confirmed by phylogenetic analyses, albeit in a smaller number of cases than originally anticipated. The clustering of putatively transferred genes is also confirmed (Figure 4), pointing to a process that occurred mainly by the exchange of larger DNA regions.

    Application to a genome duplication hypothesis

    It has been proposed that the creation of metazoans and vertebrates from unicellular organisms would have been impossible without duplication of genes, as mechanisms evolving new functions at the price of discarding established ones would not provide an effective way to "progress" in evolution (18,19). Genome duplication was advanced as the primary source for new, redundant genes as it increases gene number without changing gene dosage. Dingerkus and Howell (11) proposed that the actinopterygian lineage (ray-finned fish), containing over 22 000 species, arose by means of tetraploidization due to the large number of chromosomes found in species whose ancestors diverged early in the evolution of actinopterygii. Support for this hypothesis has also been found in the seven hox clusters present in zebrafish (20), almost twice as many as in tetrapods, and in the 49 clades of orthologous proteins found by Taylor et al. (17).

    An analysis of the set of trees derived by PhyloGenie from all zebrafish proteins present in the non-redundant GenBank database returned trees for 120 clades of orthologous genes, in which two Danio rerio proteins were present for one protein in H. sapiens, M.musculus and G.gallus. Of these, five had no discernible annotation information, 16 proteins seemed unlikely to be involved in development or gene regulation (synaptosome associated protein, chromobox4, photolyase, beta-carotene oxygenase, opsin I, Cytochrome P450, dystrophin, histocompatibility antigen class II, heat shock factor 1, heat shock factor 2, lamin1, lamin2, rhodopsin, troponin, and two subunits of a Na+/K+ transport ATPase) and the majority (99 proteins) consisted of morphogenic, growth factor and signal transduction proteins (33 HOX/PAX genes, 11 frizzled and other receptors, 9 FGF and other growth factors, various transcription factors, cyclases, kinases, etc.). In comparison, Pyphy returned 53 trees that matched the selection criteria. Of these, 8 had no discernible annotation, 11 seemed unlikely to be involved in development or regulation (laminin, axin, HSP, UQ-conjugator, etc.) and 34 were growth factors or signal transduction proteins (Frizzled, Hox, Pax, G-proteins, growth factors, etc.). The PhyloGenie analysis provides support for the genome duplication hypothesis advanced by Dingerkus and Howell (11) by more than doubling the number of supporting clades.

    The analysis also suggests that a subsequent massive loss of non-morphogenic genes may have occurred. However, the Danio rerio and G.gallus genomes have not yet been completely determined. The large number of morphogenic and regulatory proteins we observe may therefore reflect, at least in part, an historic bias of molecular genetics towards development and cell cycle regulation. Support for this view comes from the observation that, if only two of the three tetrapod species are required, PhyloGenie and Pyphy return 351 and 118 trees, respectively, for nodes containing mouse and chicken, 331 and 141 trees for nodes with man and chicken, and 630 and 292 trees for nodes with man and mouse. It will only be possible to form an exact picture of the number and types of genes showing this 2:1 ratio, once completed genomes are available for a wide range of tetrapod and actinopterygian species.

    CONCLUSIONS

    We have introduced a new suite of programs for the generation and analysis of phylomes (PhyloGenie) and have compared its performance with that of a related software tool (Pyphy) on two previously studied phylogenetic problems. On attempting to detect LGTs in prokaryotic organisms, both methods seem to perform comparably. This ceases to be the case when analyses are attempted for which Pyphy was not designed, such as examining paralogous sequence relationships or sequence clades encompassing more than the immediate sequence relatives. In these cases, restrictive settings limit the ability of Pyphy to detect all relevant sequences. With regard to tree analysis, Pyphy is built to detect LGTs in a genome and graphically display the results. It does not support querying of more complex sequence relationships. In contrast, PhyloGenie is fully configurable in all parameters relating to sequence acquisition, alignment, and tree construction, and has the ability to filter the resulting database of trees for complex, user-defined tree topologies.

    Automated methods are powerful, but also have drawbacks. The search and alignment parameters used for generating phylomes rely on assumptions and prior knowledge that may introduce errors or systematic biases. It is therefore essential to manually re-evaluate, for biological relevance, the steps and results between seed sequence and the phylogenies of interest. The problems encountered by PhyloGenie in the example analyses were mostly due to suboptimal search parameters that cause some alignments to contain large coiled-coil or low complexity regions, possibly convergently evolved features misleading phylogenetic inference, and splice isoforms or gene fragments complicating the automatic phylogenetic analysis. In addition, sampling bias, alignment errors, mutational saturation, long-branch attraction, methodological artifacts and differential gene loss can also account for the atypical placement of species in a tree. The results produced by PhyloGenie should therefore not be seen as the endpoint of an analysis but rather as the first step in reducing the number of genes or alignments for which more time consuming, in depth analyses would need to be performed, before being able to draw conclusions with confidence.

    SUPPLEMENTARY MATERIAL

    Supplementary Material is available at NAR Online.

    ACKNOWLEDGEMENTS

    We wish to thank the referees for many helpful and constructive comments.

    REFERENCES

    Altschul,S.F., Madden,T.L., Sch?ffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. ( (1997) ) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., , 25, , 3389–3402.

    Koski,L.B. and Golding,G.B. ( (2001) ) The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol., , 52, , 540–542.

    International Human Genome Sequencing Consortium. ( (2001) ) Initial sequencing and analysis of the human genome. Nature, , 409, , 860–921.

    Stanhope,M.J., Lupas,A.N., Italia,M.J., Koretke,K.K., Volker,C. and Brown,J.R. ( (2001) ) Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates. Nature, , 411, , 940–944.

    Salzberg,S.L., White,O., Peterson,J. and Eisen,J.A. ( (2001) ) Microbial genes in the human genome: lateral transfer or gene loss? Science, , 5523, , 1903–1906.

    Roelofs,J. and Van Haastert,P.J. ( (2001) ) Genes lost during evolution. Nature, , 411, , 1013–1014.

    Sicheritz-Ponten,T. and Andersson,SG. ( (2001) ) A phylogenomic approach to microbial evolution. Nucleic Acids Res., , 15, , 545–552.

    Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.-C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C.Phan,I., Pilbout,S. and Schneider,M. ( (2003) ) The Swiss-Prot protein knowledgebase and its supplement TREMBL in 2003. Nucleic Acids Res., , 31, , 365–370.

    Thompson,J.D., Higgins,D.G. and Gibson,T.J. ( (1994) ) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., , 22, , 4673–4680.

    Ruepp,A., Graml,W., Santos-Martinez,M.L., Koretke,K.K., Volker,C., Mewes,H.W., Frishman,D., Stocker,S., Lupas,A.N. and Baumeister,W. ( (2000) ) The genome sequence of the thermoacidophilic scavenger Thermoplasma acidophilum. Nature, , 407, , 508–513.

    Dingerkus,G. and Howell,WM. ( (1976) ) Karyotypic analysis and evidence of tetraploidy in the North American paddlefish, Polyodon spathula. Science, , 194, , 842–844.

    Eddy,S.R. ( (1996) ) Hidden Markov Models. Curr. Opin. Struct. Biol., , 6, , 361–365.

    Saitou,N. and Nei,M. ( (1987) ) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., , 4, , 406–425.

    Koretke,K.K., Lupas,A.N., Warren,P.V., Rosenberg,M. and Brown,J.R. ( (2000) ) Evolution of two-component signal transduction. Mol. Biol. Evol., , 17, , 1956–1970.

    Schmidt,H.A., Strimmer,K., Vingron,M. and von Haeseler,A. ( (2002) ) TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics, , 18, , 502–504.

    Swofford,D.L. ( (1998) ) PAUP*, Phylogenetic Analysis Using Parsimony (*and Other Methods), version 4.0. Sinauer Associates, Sunderland, MA.

    Taylor,J.S., Braasch,I., Frickey,T., Meyer,A. and Van de Peer,Y. ( (2003) ) Genome duplication, a trait shared by 22 000 species of ray-finned fish. Genome Res., , 13, , 382–390.

    Ohno,S. ( (1970) ) Evolution by Gene Duplication. Springer-Verlag, New York, NY.

    Stephens,S.G. ( (1951) ) Possible significance of duplications in evolution. Adv. Genet., , 4, , 247–265.

    Amores,A., Force,A., Yan,Y.-L., Joly,L., Amemiya,C., Frity,A., Ho,R.K., Langeland,J., Prince,V., Wang,Y.-L., Westerfield,M., Ekker,M. and Postlethwait,J.H. ( (1998) ) Zebrafish hox clusters and vertebrate genome evolution. Science, , 282, , 1711–1714.(Tancred Frickey and Andrei N. Lupas*)