当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 病菌学杂志 > 2005年 > 第22期 > 正文
编号:11201957
Gene and Genome Duplication in Acanthamoeba polyph
     Information Génomique et Structurale, UPR CNRS 2589, 31 Chemin Joseph-Aiguier, 13402 Marseille Cedex 20, France

    ABSTRACT

    Gene duplication is key to molecular evolution in all three domains of life and may be the first step in the emergence of new gene function. It is a well-recognized feature in large DNA viruses but has not been studied extensively in the largest known virus to date, the recently discovered Acanthamoeba polyphaga Mimivirus. Here, I present a systematic analysis of gene and genome duplication events in the mimivirus genome. I found that one-third of the mimivirus genes are related to at least one other gene in the mimivirus genome, either through a large segmental genome duplication event that occurred in the more remote past or through more recent gene duplication events, which often occur in tandem. This shows that gene and genome duplication played a major role in shaping the mimivirus genome. Using multiple alignments, together with remote-homology detection methods based on Hidden Markov Model comparison, I assign putative functions to some of the paralogous gene families. I suggest that a large part of the duplicated mimivirus gene families are likely to interfere with important host cell processes, such as transcription control, protein degradation, and cell regulatory processes. My findings support the view that large DNA viruses are complex evolving organisms, possibly deeply rooted within the tree of life, and oppose the paradigm that viral evolution is dominated by lateral gene acquisition, at least in regard to large DNA viruses.

    INTRODUCTION

    It has long been realized that new gene material frequently emerges through gene and genome duplication (25, 26). The precise mechanisms of these events are diverse, each leaving its own particular signature in the genome (for a recent review, see reference 36). Once a gene has been duplicated, it may be subject to three different types of fate: nonfunctionalization, where one of the two copies of a duplicate pair degenerates into a pseudogene and may subsequently be lost from the genome (18, 19); subfunctionalization, which consists of the division of the original functions of the ancestral gene between the two duplicates (9); and neofunctionalization, where one copy in a duplicate pair acquires a new function (37). Eventually, divergent evolution may lead to a point where homologies between two genes of common ancestry become difficult or impossible to detect (11). The unexpectedly small structural variation between different protein families that has been unveiled by the recent large structural genomics efforts (31) corroborates this observation, suggesting that the prevalence of gene duplication in all three domains of life (36) is even larger than previously thought.

    The recent discovery (16) and subsequent genome sequencing (29) of the largest known virus to date, Acanthamoeba polyphaga Mimivirus, has raised a number of fundamental questions about what had been thought to be established boundaries between viruses and cellular life forms (5, 7, 14). In particular, the size of the mimivirus virion is comparable to that of a mycobacterium. Its genome, containing close to 1.2 million nucleotides (nt) and coding for 911 predicted proteins, holds more than twice as much genetic information as small bacteria find sufficient for life. Moreover, the mimivirus genome hosts a wide spectrum of genes that have never been found in such combination in a virus, in particular, a large set of genes related to protein transcription and translation. On the other hand, what is rather common for a viral genome is the fact that a large fraction of the mimivirus genes display only weak or no homology to any other known genes in the databases. Raoult et al. (29) were able to assign putative functions to only one-third (298/911) of the mimivirus genes, while this ratio is much higher for the genomes of all fully sequenced "living" organisms.

    Here, we set out to investigate the question of how many of these genes of unknown origin may have been generated through duplication processes within the mimivirus genome itself and how these duplications may then have shaped the mimivirus genome. The aim of this work was to identify and characterize events of gene and genome duplication in the mimivirus genome in order to shed new light on the origin of the mimivirus' exceptionally large size and on the importance of gene duplication in large DNA viruses in general. I report evidence for an ancient event of duplication of a large part of the mimivirus chromosome, as well as for numerous tandem gene duplication events, and I will show that some of these duplication events may play a role in virus-host adaptation.

    MATERIALS AND METHODS

    Detection of paralogous genes was performed using programs from the BLAST package (1). For the detection of paralogous families, each of the 911 mimivirus genes was used to initiate a BLASTP search, followed by one or several PSI-BLAST iterations until convergence. For the identification of homologous genes, all mimivirus genes were compared to each other using BLASTP, where only the highest-scoring match above a defined e-value cutoff was retained (best unidirectional match criteria). To test for possible dependence on the choice of the e-value threshold, three different e-values, 10–5, 10–10, and 10–25, were used. If not otherwise stated, 10–10 is used as a reference e-value throughout this paper.

    Remote protein homology detection was done by pairwise Hidden Markov Model (HMM) comparison using the HHsearch package (32), together with HMMs based on multiple alignments from the conserved domain database CDD (20), i.e., COG (33), SMART (17), PFAM (3), and SCOP (22). Multiple alignments of the paralogous genes were computed using the latest version of the T-Coffee package with advanced alignment options (23, 27, 28). Secondary-structure predictions from PSIPRED (13) were included in the HMM-HMM comparison as described previously (32). Results of the HMM search and multiple alignments are available at http://igs-server.cnrs-mrs.fr/suhre/mimiparalogues/.

    Genome sequences of all fully sequenced viral genomes (as of November 2004) were downloaded from the National Center for Biotechnology Information viral genomes project (2) at http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html. All 223 genomes with more than 50 annotated genes were included in the analysis.

    RESULTS

    One-third of the mimivirus genes have at least one paralogue in the genome. We compared all 911 predicted mimivirus genes against each other using the sequence alignment software BLAST (1) to identify genes that have significant matches in the genome. The search for paralogous genes was iterated until convergence using position-specific weight matrices constructed from the set of homologous genes found in each previous step as implemented in the PSI-BLAST version of BLAST.

    A total of 347 paralogous genes in 77 families were detected by this method when a conservative detection cutoff e-value in the (PSI-)BLAST search of 10–10 was applied. When a more permissive (10–5) or a more stringent (10–25) e-value was used, 398 and 244 paralogous genes in 86 and 58 families, respectively, were detected. Thus, between 26.3% and 35.0% of the mimivirus genes have at least one homologue in the virus' genome, depending on the choice of the e-value cutoff. To test for a possible dependence on gene annotation, the mimivirus genome was split into nonoverlapping segments 1,000 nucleotides in length. These segments were compared to segments of the same size but overlapping each other by 50%, using BLAST at the nucleotide level (BLASTN) and at the amino acid level after translation in all six reading frames (TBLASTX). The results were comparable to those found using BLAST at the gene level (BLASTP) in regard to our conclusions with respect to the overall genome and gene duplications, except that these methods were less sensitive and yielded fewer hits at lower sequence identity levels, especially in the BLASTN case. As these computations did not reveal any unexpected new insights but confirmed the robustness of the approach with respect to the applied detection algorithm (BLASTP), BLASTN and TBLASTX results will not be further presented in this paper.

    The orientation and location of gene duplication events are not random. The mimivirus genome is coded on a linear chromosome that may adopt a circular topology through noncovalent interactions between two 900-nt-long repeated sequences near the chromosome ends, as observed in some other large DNA viruses (29). The fraction of duplicated genes that are inserted in parallel orientation to the coding direction of the matching gene (cis) is, at 20.2% (22.1% for e = 10–5; 16.4% for e = 10–25), nearly twice as high as the fraction of genes that are duplicated in antiparallel orientation (trans), which is 11.7% (12.9% for e = 10–5; 9.95%, for e = 10–25). Sixty-one percent of all pairs of genes that are duplicated in trans are located on different halves of the mimivirus chromosome, whereas 79% of the duplications in cis occur on the same chromosome half. A large number of tandem, or near-tandem, gene duplications were detected, the most striking case consisting of an 11-fold duplication of genes L175 to L185 (dubbed Lcluster here; see below for gene locations and orientations of the largest families of paralogues). The overall trend is that cis duplications are more localized (often tandem or near tandem), while trans duplications are more likely to occur across the chromosome center. This trend becomes visible when corresponding best-matching pairs that are duplicated in cis and trans, respectively, are connected (Fig. 1).

    Evidence for a segmental duplication of a large telomeric chromosome fraction. Figure 2 shows a zoom into the "telomeric" regions of the mimivirus chromosome. Remnants of chromosomal synteny can be identified between 5' position 0 to 5' position 110,000 and the corresponding 3' position 110,000 to 3' position 220,000, and also between 5' 120,000 to 5' 200,000 and 5' 0 to 5' 80,000. Overlapping with these is a synteny between 5' 20,000 to 5' 110,000 and 3' 0 to 3' 100,000. The exact history of this segmental genome duplication event(s) is difficult to reconstruct, as it is overlaid by numerous local cis-duplication events and no information is available on potential gene deletions in this context. One parsimonious explanation could be a segmental duplication of an 200,000-nt-long telomeric chromosome fraction, followed by a rearrangement (immediately or later) around its center. Interestingly, three tRNA-Leu genes are found duplicated in concert with this event(s). They are highly conserved (displaying only four point mutations), while the adjacent genome regions accumulated such a large number of mutations that homology at the nucleotide level has become difficult to identify. Figure 3 shows the frequency distribution of all gene duplication events. A pronounced maximum for trans duplications is observed at a sequence identity level of 25%, which characterizes the segmental gene duplication as a more ancient event. cis duplications also peak at this value and are likely to correspond to older tandem duplication events. A second pronounced maximum at the 50% sequence identity level for cis duplications suggests a more recent origin for the corresponding tandem duplications (i.e., the Lcluster).

    Duplicated genes can be used to detect remote homologies and to improve on functional gene annotation. In every genome-sequencing project, the question of how to annotate putative genes has to be addressed. It is standard procedure to compare all predicted genes to existing annotated databases (e.g., SWISS-PROT [4] or the nonredundant protein database at the National Center for Biotechnology Information) using sequence-to-sequence comparison tools, most often BLAST. More sensitive methods, which also allow the identification of more remote functional relationships, are based on sequence-to-profile comparison. These include tools like reverse position-specific BLAST (rpsBLAST) (1) and hmmer (8), which compare a query gene to an annotated aligned gene family rather than to a single gene (see Materials and Methods for a selection of generally used protein family databases). Depending on the quality of the resulting hits, manual quality checks and further refinement are done, usually based on multiple alignments and possibly phylogenetic-tree reconstructions, in order to verify the predicted orthologies to a gene or gene family of known function. The result of this procedure is what is commonly known as the "GenBank annotation" of a genome. In the case of mimivirus (and this is true for all virus genome-sequencing projects), no function could be attributed convincingly to a large number of genes using this procedure. These genes are thus annotated as "hypothetical," supplemented in some cases by a description of a generic feature of that gene, such as a specific type of repeat (ankyrin repeat, triple-helix collagen repeat, or leucine-rich repeat).

    However, in a case where multiple copies of a gene are found in the genome, the idea of using profile or HMM search methods can be taken a step further. Different methods of this type have recently been developed (30, 32, 35). They allow the comparison of an aligned set of genes (the paralogous genes) to a database of annotated profiles, or HMMs, with much higher sensitivity than sequence-to-sequence and sequence-to-profile comparisons. Here, we use the HHsearch software (32), which, in addition to HMM-HMM comparison, evaluates the correspondence between the predicted secondary protein structure of the query protein and those of the potential hits (using observed structure information from the Protein Data Base where available). The result of an HHsearch for a single family of paralogues is then a list of hits, ranked by the probability that a hit is a true positive. For all families of paralogues, these results, together with the corresponding multiple alignments that were used to build the HMMs, are available at http://igs-server.cnrs-mrs.fr/suhre/mimiparalogues/. This data set may serve as a starting point for further analysis of a given mimivirus paralogue family.

    Some of the larger paralogous families are related to virus-host interactions. Figure 4 (left) shows the positions of all paralogous genes by their positions on the chromosome. Hot spots of local tandem duplication activities can be detected and are particularly pronounced for the gene family N172 (Lcluster). A clustered view of all genes is given in Fig. 4 (right). By far the largest paralogous gene family (N14), with 66 members, contains the ankyrin double-helix repeat proteins (L14 L22 L23 L25 L36 L42 L45 L56 L59 L62 L63 L66 L72 L88 L91 L93 L99 L100 L109 L112 L120 L121 L122 L148 R229 R267 L279 L482 L483 R579 L589 R600 R601 R602 R603 R634 L675 L715 R760 R777 R784 R787 R789 R791 R797 R810 R825 R835 R837 R838 R840 R844 R845 R846 R847 R848 L863 L864 R873 R875 R880 R886 R896 R901 R903 R911). (In these lists, genes are numbered in increasing order by their positions on the linear mimivirus chromosome. The letter L indicates genes that are transcribed to the left [negative strand], and the letter R stands for genes transcribed to the right [positive strand]. Tandem [cis] duplications can be identified by successive numbering and identical letters [e.g., L121 and L122 are adjacent genes that are coded on the same strand].) Ankyrin repeat-containing proteins are ubiquitously found in large paralogous families in both viral and bacterial genomes. These genes are thought to play structural roles in the cell and are not discussed further here.

    The second-largest family (N35) contains 26 genes (L35 L49 L55 R61 L67 L76 L85 L89 L98 L107 R154 R224 R225 L272 L344 R731 R738 R739 R765 R773 L783 L786 L788 R830 L834 R842) that are all annotated as unknown (4 of them contain WD repeats). However, using remote-homology detection methods, together with advanced multiple-alignment techniques (see Materials and Methods), I found that all of these proteins contain a common, 170 amino-acid-long N-terminal domain that clearly matches the BTB/POZ domain. The BTB/POZ domain mediates homomeric dimerization, and in some instances heteromeric dimerization. POZ domains from several zinc finger proteins have been shown to mediate transcriptional repression and to interact with components of histone deacetylase corepressor complexes. The best matches to proteins with known structure are the promyelocytic leukemia zinc finger protein (PDBid 1buo) and the B-cell lymphoma 6 protein (PDBid 1r28). The genes from the N35 paralogue family are thus likely to play a role in transcriptional regulation.

    The third-largest cluster (N172; Lcluster) (L172 L174 L175 L176 L177 L178 L179 L180 L181 L182 L183 L184 L185 L697) is also the most exceptional in regard to its 12-fold tandem repeat of proteins. A multiple alignment of these genes indicates that they code for real proteins and that these proteins are likely under selective pressure. For instance, the amino acid type is often conserved within aligned columns, and stretches without any insertions and deletions are followed by indel-rich regions (signatures of structure elements and loop regions, respectively). However, no clear function could be attributed to this cluster, and it has no significant match outside the mimivirus genome. The highest-scoring hits from remote-homology detection, albeit well below certainty levels in regard to the probability that these are true positives, are sometimes linked to interaction with RNA.

    The cluster N165 (L60 L162 L165 L166 L167 L168 L170 R286 L414 L415), which is found close to the Lcluster, also contains only genes that are annotated as unknown, most of them containing several Pfam FNIP repeats. Again, using remote-homology detection, we can identify an N-terminal domain that matches the Pfam F-box domain, which is a receptor for ubiquitination targets. This relatively conserved structural motif is present in numerous proteins and serves as a link between a target protein and a ubiquitin-conjugating enzyme. The SCF complex (i.e., Skp1-Cullin-F-box) plays a role similar to that of an E3 ligase in the ubiquitin protein degradation pathway. Different F-box proteins as a part of the SCF complex recruit particular substrates for ubiquitination through specific protein-protein interaction domains. Interestingly, several copies of ubiquitin-conjugating enzymes are also present in the mimivirus genome (i.e., gene L460), as well as a ubiquitin-specific protease (R319). Thus, the genes in cluster N165 can be predicted to play a role in protein degradation using the ubiquitin pathway.

    About cluster N226 (L226 L228 R734 L764 L766 L767 L768 L769 L774), little can be said at present. Cluster N232 (L232 L268 R436 R517 L670 L673 R818 R826 R831), on the other hand, contains genes that are predicted to encode protein kinases and that may thus play roles in different cell regulatory processes.

    Other notable families, not discussed in more detail here, are family N137, which contains proteins with glycosyltransferase domains; family N105, with remote homologies to potassium channel tetramerization domains; and families N73 and N430, which are similar to yeast and poxvirus transcription factors, respectively. Other interesting families that invite further investigation are N425, which contains the major capsid protein, and the family pair N79 (transposase)/N80 (site-specific integrase-resolvase), which contains three adjacent pairs of transposase/resolvase genes (L79/R80, R104/L103, L770/R771), as well as N238 (L71 R196 R238 R240 R241 L668 L669), which contains collagen triple helix repeats.

    DISCUSSION

    The ancient segmental duplications and massive ongoing individual gene duplications in mimivirus described here are parsimonious with the postulated early evolutionary origins of this virus (21, 24, 29). They explain the origin of a large part of the mimivirus genome without the need for overproportional gene acquisition through horizontal gene transfer from a host organism, as is commonly thought to be the case for smaller viruses. In fact, the gene duplication rate of mimivirus, at 38%, lies well within the range of prevalence in the three domains of life, e.g., 17% for Haemophilus influenzae, 44% for Mycoplasma pneumoniae, 30% for Archaeoglobus fulgidus, 30% for Saccharomyces cerevisiae, 38% for Homo sapiens, and 65% for Arabidopsis thaliana (reference 36 and references therein). In this context, it is interesting that Ogata et al. (24) recently showed that horizontal gene transfer in mimivirus is no more elevated than what is detected in bacteria.

    Using multiple alignments, together with remote-homology detection methods based on Hidden Markov Model comparison, I attribute putative functions to some of the larger paralogous gene families. These attributions indicate that a number of these duplicated mimivirus genes are likely to interfere with important host processes, such as transcription control, protein degradation, and different cell regulatory processes. The toleration and fixation of such important genome expansions under selective conditions may be explained by mimivirus' particular life style, that is, the fact that mimivirus mimics a microbial prey to its amoeban "predator" in order enter its host by phagocytosis. Thus, in order to represent an interesting prey for the amoeba, mimivirus has to maintain bacterial size (15) and can thus more easily tolerate a large genome size than its smaller cousins. With this constraint comes the evolutionary advantage of being able to host a larger spectrum of genes capable of interfering with host defenses, very much in contrast to the situation of small viruses that are optimized for rapid and economic replication and that survive with a rather minimal gene set (for a detailed discussion, see reference 5). Interestingly, if the same detection algorithm is applied to other large DNA viruses, a log-linear trend becomes visible between the number of paralogous genes and the gene content of the genome (Fig. 5).

    It is interesting that the larger families of proteins that are frequently repeated in tandem contain functions that are likely to play a role in virus-host interactions. Notable examples are the protein kinase family (N232), which may interfere with the host signaling network or other regulatory processes; the F-box-containing cluster (N165), which may tag selected host proteins for destruction through the ubiquitin pathway; and the Zinc finger (BTB/POZ) family (N35), which may interfere with host transcription regulation. Unsurprisingly for such a large virus, two other large mimivirus families of paralogues seem to play more structural roles, that is, the largest family of all, the ankyrin repeat-containing proteins, and the collagen triple-helix-containing repeat proteins. The two families N172 (Lcluster) and N226 are particular intriguing, since no putative function could be associated with these genes. The families are exceptionally well clustered and have undergone more recent duplications. It may therefore be speculated that they are related to more recent and novel function acquisitions that may be specific to the lineage Acanthamoeba polyphaga Mimivirus.

    Searching the Sargasso Sea environmental genome shotgun-sequencing data set (34), Ghedin and Claverie (10) detected the presence of close relatives of mimivirus in this marine environment. While a large number of the mimivirus genes are found to have a BLAST hit to this data set, none of the genes from the N172 and N226 clusters (with the exception of a spurious match for gene L177) are found in the Sargasso Sea data set. This may be an indication of a more recent emergence of these two families.

    The large fraction of viral genes that exhibit no or only remote homology to genes in any other organism, including different viruses (12), is commonly attributed to an assumed faster evolution of viral genes than their bacterial and eukaryotic counterparts. If this assumption is correct, the genes of the two families N172 and N226 may have evolved from an ancient ancestor to a point where no similarity at the sequence level to their orthologues in other genomes can be detected. Determining the three-dimensional structures of members of these (and other) families may therefore answer the question of the origin of these at present mimivirus-specific genes. Comparing the structures of different paralogues may then contribute more generally to our understanding of the evolution of viral genes, as they have evolved in a unique environment in a single genome context, i.e., in a situation where differences in G+C content or constraints related to metabolic differences due to the availability of different amino acids need not be considered.

    I believe that gene and genome duplications in large DNA viruses can be analyzed much as is currently done for members of the other three domains of life. For example, reconstructing duplication history has received extensive attention recently. Zhang et al. (38) present a method for inferring the duplication history of tandem-repeat sequences that may be readily applied to mimivirus tandem gene duplications. Davis and Petrov (6) demonstrated that genes that have generated duplicates in the Caenorhabditis elegans and S. cerevisiae genomes were 25% to 50% more constrained prior to duplication than the genes that failed to leave duplicates. They further showed that conserved genes have been consistently prolific in generating duplicates for hundreds of millions of years in these two species, that is, that the set of duplicate genes is biased. This observation may allow us to narrow the range of putative roles of the duplicated mimivirus genes whose functions are still completely unknown.

    My analysis shows that a large fraction of the mimivirus genes originated from repeated tandem gene duplications and from segmental genome duplication events, the order of magnitude of the duplications being comparable to what is commonly observed in bacteria, archaea, and eukaryotes. This is compatible with the view that the large DNA viruses establish a deeply rooted branch on the tree of life rather than representing just a collection of genes gathered during their passage through diverse cellular host organisms (see also the discussion in references 21 and 24).

    ACKNOWLEDGMENTS

    This work has been supported by CNRS and Marseille-Nice Génopole.

    I thank Johannes S?ding for assistance with the use of the HHsearch program and acknowledge helpful discussions with my colleagues at the Laboratory IGS, in particular, C. Abergel, S. Audic, G. Blanc, C. Notredame, H. Ogata, and J.-M. Claverie.

    Mailing address: Information Génomique et Structurale, UPR CNRS 2589, 31 Chemin Joseph-Aiguier, 13402 Marseille Cedex 20, France. Phone: 33 4 91 16 46 04. Fax: 33 4 91 16 45 49. E-mail: karsten.suhre@igs.cnrs-mrs.fr.

    REFERENCES

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.

    Bao, Y., S. Federhen, D. Leipe, V. Pham, S. Resenchuk, M. Rozanov, R. Tatusov, and T. Tatusova. 2004. National Center for Biotechnology Information viral genomes project. J. Virol. 78:7291-7298.

    Bateman, A., L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme, C. Yeats, and S. R. Eddy. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138-D141.

    Boeckmann, B., A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31:365-370.

    Claverie, J. M., H. Ogata, S. Audic, C. Abergel, P. E. Fournier, and K. Suhre. 7 June 2005, posting date. Mimivirus and the emerging concept of "giant" virus. [Online.] http://arxiv.org/abs/q-bio/0506007.

    Davis, J. C., and D. A. Petrov. 2004. Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol. 2:E55.

    Desjardins, C., J. A. Eisen, and V. Nene. 2005. New evolutionary frontiers from unusual virus genomes. Genome Biol. 6:212.

    Eddy, S. R. 1998. Profile hidden Markov models. Bioinformatics 14:755-763.

    Force, A., M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531-1545.

    Ghedin, E., and J. M. Claverie. 2005. Mimivirus relatives in the Sargasso Sea. Virol. J. 2:62. [Online.] http://www.virologyj.com/content/2/1/62.

    Hurles, M. 2004. Gene duplication: the genomic trade in spare parts. PLoS Biol. 2:E206.

    Iyer, L. M., L. Aravind, and E. V. Koonin. 2001. Common origin of four diverse families of large eukaryotic DNA viruses. J. Virol. 75:11720-11734.

    Jones, D. T. 1999. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292:195-202.

    Koonin, E. V. 2005. Virology: Gulliver among the Lilliputians. Curr. Biol. 15:R167-R169.

    Korn, E. D., and R. A. Weisman. 1967. Phagocytosis of latex beads by Acanthamoeba. II. Electron microscopic study of the initial events. J. Cell Biol. 34:219-227.

    La Scola, B., S. Audic, C. Robert, L. Jungang, X. de Lamballerie, M. Drancourt, R. Birtles, J. M. Claverie, and D. Raoult. 2003. A giant virus in amoebae. Science 299:2033.

    Letunic, I., R. R. Copley, S. Schmidt, F. D. Ciccarelli, T. Doerks, J. Schultz, C. P. Ponting, and P. Bork. 2004. SMART 4.0: towards genomic data integration. Nucleic Acids Res. 32:D142-D144.

    Lynch, M. 2002. Genomics. Gene duplication and evolution. Science 297:945-947.

    Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151-1155.

    Marchler-Bauer, A., J. B. Anderson, P. F. Cherukuri, C. DeWeese-Scott, L. Y. Geer, M. Gwadz, S. He, D. I. Hurwitz, J. D. Jackson, Z. Ke, C. J. Lanczycki, C. A. Liebert, C. Liu, F. Lu, G. H. Marchler, M. Mullokandov, B. A. Shoemaker, V. Simonyan, J. S. Song, P. A. Thiessen, R. A. Yamashita, J. J. Yin, D. Zhang, and S. H. Bryant. 2005. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33:D192-D196.

    Moreira, D., and P. Lopez-Garcia. 2005. Comment on "The 1.2-megabase genome sequence of mimivirus." Science 308:1114.

    Murzin, A. G., S. E. Brenner, T. Hubbard, and C. Chothia. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536-540.

    Notredame, C., D. G. Higgins, and J. Heringa. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302:205-217.

    Ogata, H., C. Abergel, D. Raoult, and J. M. Claverie. 2005. Response to comment on "the 1.2-megabase genome sequence of mimivirus." Science 308:1114b.

    Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, New York, N.Y.

    Ohno, S. 1999. Gene duplication and the uniqueness of vertebrate genomes circa 1970-1999. Semin. Cell Dev. Biol. 10:517-522.

    O'Sullivan, O., K. Suhre, C. Abergel, D. G. Higgins, and C. Notredame. 2004. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340:385-395.

    Poirot, O., K. Suhre, C. Abergel, E. O'Toole, and C. Notredame. 2004. 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Res. 32:W37-W40.

    Raoult, D., S. Audic, C. Robert, C. Abergel, P. Renesto, H. Ogata, B. La Scola, M. Suzan, and J. M. Claverie. 2004. The 1.2-megabase genome sequence of Mimivirus. Science 306:1344-1350.

    Sadreyev, R., and N. Grishin. 2003. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol. 326:317-336.

    Service, R. 2005. Structural biology: a dearth of new folds. Science 307:1555.

    S?ding, J. 2005. Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951-960.

    Tatusov, R. L., N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41.

    Venter, J. C., K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen, D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W. Lomas, K. Nealson, O. White, J. Peterson, J. Hoffman, R. Parsons, H. Baden-Tillson, C. Pfannkoch, Y. H. Rogers, and H. O. Smith. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66-74.

    Yona, G., and M. Levitt. 2002. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315:1257-1275.

    Zhang, J. 2003. Evolution by gene duplication: an update. Trends Ecol. Evol. 18:292-298.

    Zhang, J., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713.

    Zhang, L., B. Ma, L. Wang, and Y. Xu. 2003. Greedy method for inferring tandem duplication history. Bioinformatics 19:1497-1504.(Karsten Suhre)