当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2005年第1期 > 正文
编号:11175506
A General Tendency for Conservation of Protein Length Across Eukaryotic Kingdoms
http://www.100md.com 《分子生物学进展》
     * Computational and Evolutionary Genomics, Center for Genomics Research, Academia Sinica, Taipei, Taiwan; and Department of Ecology and Evolution, University of Chicago

    Correspondence: E-mail: whli@uchicago.edu.

    Abstract

    Protein elongation can occur in many ways, such as domain duplication or insertion and as recruitment of a transposable element fragment into the coding region, and it is believed to be a general tendency in protein evolution. Indeed, a previous study showed that yeast proteins are, on average, longer than their orthologs in bacteria, and in this study, we found that proteins in yeast, nematode, Drosophila, human, and Arabidopsis are, on average, longer than their orthologs in Escherichia coli. Surprisingly, however, we found conservation of protein sequence length across eukaryotic kingdoms. We collected 1,252 orthologous proteins from yeast, nematode, Drosophila, human, and Arabidopsis and found that the total length of these proteins is very similar among the five species and that there is no general tendency for a protein to increase or decrease in length. Furthermore, although paralogous proteins tend to undergo more sequence-length changes, there is also no general tendency for length increase. However, proteins that are commonly shared by Drosophila and human but not by yeast are, on average, substantially longer than proteins that are shared by yeast, Drosophila, and human. This is a puzzle that begs for an answer.

    Key Words: protein evolution ? protein length ? orthologous proteins ? paralogous proteins ? eukaryotes

    Introduction

    It is commonly thought that proteins were originally short and simple but have become longer and more complex during evolution (see Li [1997]). Indeed, analyses of the human proteome have revealed a more complex domain organization of many human proteins in comparison to their homologous invertebrate and yeast proteins (Li et al. 2001; Venter et al. 2001; Kaessmann et al. 2002). An increase in protein length can occur by domain duplication or domain shuffling (see Li [1997], Kinch and Grishin [2002], and Ponting and Russell [2002]). In addition, fragments of transposable elements have been found in a large number of human proteins (Makalowski 2000; Lander et al. 2001; Li et al. 2001; Venter et al. 2001; Hughes and Coffin 2002). Finally, studies using elongation mutagenesis suggested that insertion of a random C-terminal tail could increase protein stability, and, in some cases, insertion of a peptide segment can improve the function of a protein (Matsuura et al. 1999; Chow et al. 2003; Claverie and Ogata 2003). Therefore, it seems that proteins would tend to increase in sequence length during evolution. Indeed, there is evidence that yeast proteins are, on average, longer than bacterial proteins (Zhang 2000).

    On the other hand, an increase in peptide length may increase the energy cost of biosynthesis. A recent study of yeast genes suggested that natural selection favors shorter protein length for efficient synthesis (Akashi 2003). Studies using gene expression data showed strong correlation between codon usage bias and transcript abundance, suggesting that natural selection tends to increase the speed of protein synthesis (Moriyama and Powell 1998; Pal, Papp, and Hurst 2001; Akashi 2003). Also, it has been reported that proteins in the parasitic eukaryote Encephalitozoon cuniculi tend to be shorter than their eukaryotic orthologs (Katinka et al. 2001). In addition, spontaneous deletion tends to occur more often than spontaneous insertion in DNA sequences (de Jong and Ryden 1981). Thus, it is unclear whether in general proteins tend to increase in length.

    Because there seems to be no study of protein sequence-length evolution in eukaryotes, we conducted a genome-wide comparison of protein lengths across eukaryotic kingdoms, taking advantage of the increasing abundance of genomic data in eukaryotes.

    Materials and Methods

    The protein sequence data were obtained from the following sources: E. coli: (NCBI, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K12/, 03/20/2003 version); a total of 4,225 protein sequences were obtained from the database. Yeast (Saccharomyces cerevisiae): (Saccharomyces Genome Database, ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomic_sequence/orf_protein/orf_dubious_trans, 2003/08/04 version); 5,878 protein sequences were used. Nematode (Caenorhabditis elegant): (Sanger, ftp://ftp.sanger.ac.uk/pub/wormbase/current_release/ release 115); 22,227 protein sequences were used. Drosophila melanogaster: (Berkeley Drosophila Genome Project, http://www.fruitfly.org/cgi-bin/seq_tools/fasta_download.cgi, release 3); 18,498 protein sequences were used. Human (Homo sapiens): (European Bioinformatics Institute, http://www.ebi.ac.uk/proteome/index.html?http://www.ebi.ac.uk/proteome/HUMAN/), a nonredundant proteome set of SWISS-PROT, TrEMBL, and Ensembl entries; 28,657 protein sequences were obtained from the database. Arabidopsis thaliana: (Tigr, ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/SEQUENCES/, release 4.0); 28,583 protein sequences were obtained from the database. The eukaryotic orthologous groups of proteins (KOGs) database was downloaded from NCBI (http://www.ncbi.nlm.nih.gov/COG/new/shokog.cgi); a total of 4,852 KOGs were obtained.

    The commonly preserved (shared) proteins were defined by using yeast protein sequences as templates to BlastP against the protein sequences from the protein databases of nematode, Drosophila, human, and Arabidopsis, separately. The orthologous proteins among the different species were defined by two criteria: the BlastP search with the E value e–10 and the alignable region between the two proteins greater than 50% of the longer protein. If more than one protein met the criteria, the protein with the highest hit score was chosen as the potential ortholog. Next, the method of bidirectional reciprocal best hit was applied to each of the other four species and the proteins that failed to satisfy this criterion were removed from the data set of commonly preserved proteins.

    The ancestral states of the proteins were reconstructed by using the proteins commonly shared by the yeast, nematode, Drosophila, human, and Arabidopsis. For each set of orthologous protein sequences, a multiple sequence alignment was obtained using ClustalW with default parameters (Thompson, Higgins, and Gibson 1994). The length of the ancestral sequence was obtained from the multiple sequence alignment by the parsimony principle under the commonly accepted phylogeny of the five species in which the plant lineage branched off before the divergence between the yeast and animal lineages, and the insects (arthropods) are closer to humans than to nematode worms (the coelomata hypothesis) (see Blair et al. [2002], Hedges [2002] and Hughes and Friedman [2004]). An amino acid position was assumed to be in the ancestral sequence if it was present in at least three species or if the position was present in Arabidopsis and in at least one of the four other species. In this procedure, the Arabidopsis lineage has been given more weight because in the phylogeny used, it is the most divergent among the five eukaryotes studied. If this assumption is wrong and the yeast lineage is actually closer to the plant lineage than the animal lineage, it should have only a small effect on our inference because the three lineages are likely to be close to a trichotomy. Note that our inference makes no distinction between the coelomata hypothesis and the ecdysozoa hypothesis, which assumes that nematodes and insects form a clade and are equally distantly to humans. Therefore, our inference holds under either hypothesis.

    The KOGs database includes orthologous and paralogous genes of eukaryotic species. Each group is associated with a conserved and specific function (Tatusov et al. 2003). To compare the sequence lengths of orthologous and paralogous proteins, we used the 1,252 proteins commonly preserved among yeast, nematode, Drosophila, human, and Arabidopsis as queries to search the corresponding KOG databases (a total of 4,852 KOGs) and identified the paralogs of each set of orthologous proteins. Take the yeast as an example. We found 1,173 yeast proteins shared by the 1,252 proteins that have been commonly preserved among the five genomes and the KOGs database by identifying their accession numbers. These 1,173 proteins formed the set of the "orthologous" proteins that could be found in the KOGs database for the five genomes. Then, for each of the 1,173 proteins, we searched the KOGs database to find all other homologous yeast proteins and put them in the set of "paralogous proteins." If there were n yeast paralogous proteins in the set of 1,173 proteins and there were other yeast paralogous proteins in the KOGs database, then all theses proteins were clustered into n groups according to the rule of best hits in the BlastP search. If a group had more than one protein, then their average length was used in later analyses.

    We also compared the lengths of the proteins commonly shared among yeast, Drosophila, and human with the proteins that were shared only by human and Drosophila but not yeast, which for simplicity are called derived proteins. The criteria for the BlastP search for commonly shared proteins were the same as described above.

    Protein-length comparison was based on the lengths of polypeptides. Normality of the protein-length distribution was examined by the Kolmogorov-Smirnov (K-S) test; normality was rejected when the P value was smaller than 0.05. In addition, in view of the large variance of protein length, log10 transformation was applied to compress the range and to stabilize the variance, and then the K-S test was applied to test the normality. If the normality of either the original data or the log10-transformed data was accepted, the pairwise t-test was applied; otherwise, the pairwise Wilcoxon rank sum test was used.

    Results

    Longer Proteins in Eukaryotes Than in E. coli

    Applying the BlastP search under the criteria described above, the commonly preserved proteins between E. coli and eukaryotes were collected. Here, we compare proteins from E. coli with their orthologous proteins in yeast (547 pairs), nematode (590 pairs), Drosophila (610 pairs), human (681 pairs), and Arabidopsis (870 pairs), separately, using the pairwise Wilcoxon rank sum test (table 1). The results show that the proteins in each of the five eukaryotic genomes studied are, on average, longer than their orthologous counterparts in E. coli. This tendency is the same as that in Zhang (2000), who compared yeast proteins with prokaryotic proteins.

    Table 1 Pairwise Wilcoxon Rank Sum Tests of the Lengths of Commonly Shared Proteins Between E. Coli and Each of Five Eukaryotes

    Conservation of Protein Length among Eukaryotes

    From the BlastP search, a total of 1,252 proteins commonly shared by yeast, nematode, Drosophila, human, and Arabidopsis were collected and analyzed. Figure 1 shows that the lengths of these 1,252 proteins are scattered as a right-skewed distribution, and the K-S test shows a departure from normality in the tails (P < 1.16x10–6). The total numbers of amino acids in the 1,252 proteins for the five species are 547,044, 538,131, 548,927, 545,027, and 541,636, respectively. Human is actually similar to the four other species in the total length of these commonly shared proteins. In view of the large length variation among proteins, the protein lengths were log10 transformed, but normality still did not hold. Therefore, pairwise Wilcoxon rank sum tests were applied to compare the lengths. Figure 2 shows that yeast, nematode, Drosophila, human, and Arabidopsis have similar average lengths in the commonly shared proteins; all P values of pairwise Wilcoxon rank sum tests are large (0.30), showing no significant difference between any two distributions.

    FIG. 1.— Length distributions of the 1,252 commonly preserved proteins in yeast, human, and Arabidopsis.

    FIG. 2.— Average length and standard error of the 1,252 proteins commonly shared by yeast, nematode, Drosophila, human, and Arabidopsis. The lengths of the ancestral proteins were inferred from the alignments by the parsimony principle.

    Figure 3 shows more detailed comparisons of the human proteins with the ancestral proteins of yeast, nematode, Drosophila, human, and Arabidopsis. The length differences between human and ancestral protein sequences are nearly normally distributed around 0, indicating that there is a nearly equal probability for a protein to increase or decrease in length. Note that 48.7% of the 1,252 human proteins have not changed or have changed less than ±10 amino acids since their separation from the ancestral proteins (fig. 3a) and that 68.4% of these human proteins differ from their ancestral counterparts by less than ±5% of the ancestral sequence lengths (fig. 3b). The same patterns are found for the 1,252 proteins in yeast, nematode, Drosophila, and Arabidopsis (data not shown). These observations suggest that there is no general tendency for eukaryotic proteins to increase in sequence length.

    FIG. 3.— Distributions of length differences between human and ancestral protein sequences for the 1,252 proteins commonly shared by yeast, nematode, Drosophila, human, and Arabidopsis. (a) Distribution of the length differences between human and ancestral proteins. (b) Distribution of percentages of length differences between human and ancestral proteins.

    Comparison of Paralogous Proteins

    After separating the 4,852 KOGs database into orthologous and paralogous groups, we found 296 pairs (each orthologous versus paralogous from one KOG was counted as a pair) in yeast, 298 pairs in nematode, 287 pairs in Drosophila, 598 pairs in human, and 730 pairs in Arabidopsis (table 2). All KOG proteins that did not match any of the 1,252 proteins commonly preserved among the five eukaryotes or contain no paralog were excluded. The length ratios of paralogs to orthologs are shown in table 2. The pairwise Wilcoxon rank sum tests show that there is no significant difference in protein lengths between the paralogs and orthologs in the five eukaryote species studied, except for human. Human paralogous proteins are significantly shorter than their orthologous proteins by 14% (table 2), so paralogous proteins tend to be shorter than orthologous or ancestral proteins.

    Table 2 Pairwise Wilcoxon Rank Sum Tests of the Lengths of Orthologous and Paralogous Proteins of Yeast, Nematode, Drosophila, Human, and Arabidopsis

    However, although, like orthologs, paralogs show no clear tendency to increase or decrease sequence length, they have more frequent and larger length changes than do orthologs. This can be seen from a comparison of figure 4 with figure 3. For example, in figure 4, the proportion of proteins that have changed more than 10 amino acids is approximately 80% instead of approximately 50%, and the proportion of proteins that have changed more than 5% of their length is approximately 76% instead of approximately 30%. Clearly, the paralogs are more variable in length than the orthologs. Figure 4 shows only the data from human, but the same conclusion holds for data from each of the four other eukaryotic genomes.

    FIG. 4.— Distributions of length differences between orthologous and paralogous protein of human (598 pairs). (a) Distribution of the length differences between orthologous and paralogous proteins. (b) Distribution of percentages of length differences between orthologous and paralogous proteins.

    Comparison of Young and Old Proteins

    From our BlastP search, we found 2,376 proteins commonly shared by yeast, Drosophila, and human, which are defined as "old" proteins, and 4,077 proteins commonly shared by only Drosophila and human but not yeast, which are defined as "young" or "derived" proteins. (Note that a "young" protein defined here may actually not be a young protein, but represents a gene loss in the yeast lineage. However, the number of such cases is likely small compared with the total number [4,077]; for example, Krylov et al. [2003] inferred fewer than 500 gene losses in two fungus lineages. Another possibility is that a young protein here might actually be present also in the yeast proteome but was missed in the BlastP search because of sequence divergence.) Figure 5 shows that the 2,376 old proteins of yeast, Drosophila, and human are similar in length for the three species, and the 4,077 young proteins shared by Drosophila and human are also similar in length for the two species. These results are consistent with the above observation that protein length tends to be conserved among eukaryotes. When we compare the sizes of old (commonly preserved proteins) and young (newly derived proteins) proteins of Drosophila and human, the young proteins are significantly longer than old proteins by approximately 22% (table 3). That is, the newly derived proteins tend to be longer.

    FIG. 5.— Average length and standard error of the 2,376 proteins commonly shared by yeast, Drosophila, and human and of the 4,077 (derived) proteins shared by Drosophila and human but not by yeast.

    Table 3 Wilcoxon Rank Sum Tests Length Differences Between Proteins Commonly Shared by Yeast, Drosophila, and Human and Derived Proteins Present Only in Drosophila and Human

    Discussion

    In the present analysis of protein sequence length, we have not considered the possibility that an alternatively spliced form of a protein rather than the "wild type" form was used in comparison. This possibility tends to increase the variation in sequence length. Also, in some cases, a paralog rather than an ortholog might have been used, despite the use of stringent criteria in the search of orthologs among genomes. This possibility also tends to increase the variation in sequence length. For these reasons, the sequence-length variation we inferred could have been overestimated, which would strengthen our conclusion on the sequence-length conservation of orthologous proteins among eukaryotes. Note, however, that our methods selected only proteins that have been well conserved in sequence among the five genomes studied. For those orthologous proteins that have been missed by our methods, their length might have not been well conserved.

    Previous studies have suggested that present-day proteins have gone through several stages of evolution and that an evolving polypeptide chain may grow in length by insertion of residues into the chain or to its tail (Lupas, Ponting, and Russell 2001; Aravind et al. 2002; Trifonov and Berezovsky 2003). Empirical studies also suggested that insertion of a peptide segment can occasionally increase protein stability or even improve function (Matsuura et al. 1999; Chow et al. 2003; Claverie and Ogata 2003). Consistently, our results show that the lengths of proteins in eukaryotes are, on average, longer than those of E. coli proteins, extending the result by Zhang (2000). Surprisingly, we found conservation of protein sequence length since the common ancestor of yeast, nematode, Drosophila, human, and Arabidopsis. A simple explanation for the above observations is that although in general protein length had increased during the evolution from prokaryotes to eukaryotes, the length seems to have been largely optimized in the common ancestor of fungi, animals, and plants.

    It has been proposed that first protein domains evolved by recombination from a limited number of polypeptides (Soding and Lupas 2003), followed by genetic events such as point mutations, insertions, and deletions. Insertion (including internal duplication) expands the protein sequence, providing opportunities for additional function or functional improvement (Matsuura et al. 1999; Trifonov and Berezovsky 2003). A survey of genes in eukaryotes showed that internal duplications have occurred frequently in evolution, sometimes increasing the number of active sites and, thus, enhancing the protein function (see Li [1997]). However, our study showed no general tendency for eukaryotic proteins to increase length; indeed, the chance for a decrease in length is as high as that for an increase in length. Because for the 1,252 commonly preserved proteins, only 402 are essential (deletion lethal) genes in yeast, functional constraint may not be the primary factor for the conservation of sequence length. Rather, the conservation in sequence length is probably mainly caused by structural constraint.

    Some studies have indeed pointed to the importance of structural constraint in protein evolution. For example, thermodynamic stability and folding kinetics were shown to exert pressure on the course of protein evolution (Dokholyan and Shakhnovich 2001). The limited diversity of protein domain was also suggested to be caused by structural constraint (Jones et al. 1998; Hou et al. 2003). Furthermore, the study of Yang, Gu, and Li (2003) on the relationship between protein dispensability and the rate of evolution suggested that structural constraints are more important in determining the rate of amino acid substitution in proteins than functional requirement. Finally, as is well known, protein three-dimensional structure studies have shown that protein structure is much better conserved than sequence (Ponting and Russell 2002; Soding and Lupas 2003). Our observation of sequence-length conservation during the evolution of eukaryotes may be explained by protein-structure conservation because changes in sequence length may often cause changes in structure.

    We also note that although paralogous proteins often undergo deletions and insertions probably because of relaxation in functional constraint, they show no general tendency to increase sequence length. Also, although there seems to be a tendency for paralogous proteins to decrease in sequence length, the tendency is very weak because it is significant only in one (human) of the five species studied (fig. 4). This is somewhat surprising in view of the facts that spontaneous deletion occurs more often than spontaneous insertion and that a shorter protein requires a lower cost of biosynthesis.

    Interestingly, "newly evolved" or "derived" proteins are, on average, substantially longer than "old" proteins (fig. 5). It is not clear how this has happened, but we speculate two possibilities. First, some of these proteins might have already been present in the common ancestor of yeast, Drosophila, and human but had gained insertions and underwent a period of rapid amino acid change. Second, some of these proteins could have been derived from duplicate genes that had undergone gene elongation and rapid sequence changes. In both cases, the proteins have undergone so much sequence change that they can no longer be detected by the BlastP search using yeast proteins as queries. In any case, many of the new proteins might have gained new function, partly through sequence elongation. Whether these speculations have any merit requires further investigation.

    Acknowledgements

    We thank Dr. Henry Horng-Shing Lu and Dr. Hung-Mo Sung for help in the analysis and George Zhang for comments. This study was supported by Academia Sinica, Taiwan, and by NIH grants (GM30998 and GM66104) to WHL.

    References

    Akashi, H. 2003. Translational selection and yeast proteome evolution. Genetics 164:1291–1303.

    Aravind, L., R. Mazumder, S. Vasudevan, and E. V. Koonin. 2002. Trends in protein evolution inferred from sequence and structure analysis. Curr. Opin. Struc. Biol. 12:392–399.

    Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2:7.

    Chow, C. C., C. Chow, V. Raghunathan, T. J. Huppert, E. B. Kimball, and S. Cavagnero. 2003. Chain length dependence of apomyoglobin folding: structure evolution from misfolded sheets to native helices. Biochemistry 42:7090–7099.

    Claverie, J., and H. Ogata. 2003. The insertion of palindromic repeats in the evolution of proteins. Trends Biochem. Sci. 28:75–80.

    de Jong, W. W., and L. Ryden. 1981. Causes of more frequent deletions than insertions in mutations and protein evolution. Nature 290:157–159.

    Dokholyan, N. V., and E. I. Shakhnovich. 2001. Understanding hierarchical protein evolution from first principles. J. Mol. Biol. 312:289–307.

    Hedges, S. B. 2002. The origin and evolution of model organisms. Nat. Rev. Genet. 3:838–849.

    Hou, J., G. E. Sims, C. Zhang, and S. H. Kim. 2003. A global representation of the protein fold space. Proc. Natl. Acad. Sci. USA 100:2386–2390.

    Hughes, A. L., and R. Friedman. 2004. Differential loss of ancestral gene families as a source of genomic divergence in animals. Proc. R. Soc. Lond. B Biol. Sci. 271(suppl. 3):S107–109.

    Hughes, J. F., and J. M. Coffin. 2002. A novel endogenous retrovirus-related element in the human genome resembles a DNA transposon: evidence for an evolutionary link?. Genomics 80:453–455.

    Jones, S., M. Stewart, A. Michie, M. B. Swindells, C. Orengo, and J. M. Thornton. 1998. Domain assignment for protein structures using a consensus approach: characterization and analysis. Protein Sci. 7:233–242.

    Kaessmann, H., S. Zollner, A. Nekrutenko, and W. H. Li. 2002. Signatures of domain shuffling in the human genome. Genome Res. 12:1642–1650.

    Katinka, M. D., S. Duprat, E. Cornillot et al. (17 co-authors). 2001. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414:450–453.

    Kinch, L. N., and N. V. Grishin. 2002. Evolution of protein structures and functions. Curr. Opin. Struct. Biol. 12:400–408.

    Krylov, D. M., Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. 2003. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 13:2229–2235.

    Lander, E. S., L. M. Linton, B. Birren et al. (255 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.

    Li, W. H. 1997. Molecular evolution. Sinauer Associated, Sunderland, Mass.

    Li, W. H., Z. Gu, H. Wang, and A. Nekrutenko. 2001. Evolutionary analyses of the human genome. Nature 409:847–849.

    Lupas, A. N., C. P. Ponting, and R. R. Russell. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence, insertion, or relices of an ancient peptide world?. J. Struct. Biol. 134:191–203.

    Makalowski, W. 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259:61–67.

    Matsuura, T., K. Miyai, S. Trakulnaleamsai, T. Yomo, Y. Shima, S. Miki, K. Yamamoto, and I. Urabe. 1999. Evolutionary molecular engineering by random elongation mutagenesis. Nat. Biotechnol. 17:58–61.

    Moriyama, E. N., and J. R. Powell. 1998. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res. 26:3188–3193.

    Pal, C., B. Papp, and L. D. Hurst. 2001. Does the recombination rate affect the efficiency of purifying selection? The yeast genome provides a partial answer. Mol. Biol. Evol. 18:2323–2326.

    Ponting, C. P., and R. R. Russell. 2002. The natural history of protein domains. Annu. Rev. Biophys. Struct. 31:45–71.

    Soding, J., and A. N. Lupas. 2003. More than the sum of their parts: on the evolution of proteins from peptides. BioEssays 25:837–846.

    Tatusov, R. L., N. D. Fedorova, J. D. Jackson et al. (17 co-authors). 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41–54.

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

    Trifonov, E. N., and L. N. Berezovsky. 2003. Evolutionary aspects of protein structure and folding. Curr. Opin. Struct. Biol. 13:110–114.

    Venter, J. C., M. D. Adams, E. W. Myers et al. (274 co-authors). 2001. The sequence of the human genome. Science 291:1304–1351.

    Yang, J., Z. Gu, and W. H. Li. 2003. Rate of protein evolution versus fitness effect of gene deletion. Mol. Biol. Evol. 20:772–774.

    Zhang, J. 2000. Protein-length distributions for the three domains of life. Trends Genet. 16:107–109.(Daryi Wang*, Mufen Hsieh*)