当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2005年第4期 > 正文
编号:11176557
Complex Spliceosomal Organization Ancestral to Extant Eukaryotes
http://www.100md.com 《分子生物学进展》
     Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand

    Correspondence: E-mail: L.J.Collins@massey.ac.nz.

    Abstract

    In higher eukaryotes, introns are spliced out of protein-coding mRNAs by the spliceosome, a massive complex comprising five non-coding RNAs (ncRNAs) and about 200 proteins. By comparing the differences between spliceosomal proteins from many basal eukaryotic lineages, it is possible to infer properties of the splicing system in the last common ancestor of extant eukaryotes, the eukaryotic ancestor. We begin with the hypothesis that, similar to intron length (that appears to have increased in multicellular eukaryotes), the spliceosome has increased in complexity throughout eukaryotic evolution.

    However, examination of the distribution of spliceosomal components indicates that not only was a spliceosome present in the eukaryotic ancestor but it also contained most of the key components found in today's eukaryotes. All the small nuclear ribonucleoproteins (snRNPs) protein components are likely to have been present, as well as many splicing-related proteins. Both major and trans-splicing are likely to have been present, and the spliceosome had already formed links with other cellular processes such as transcription and capping. However, there is no evidence as yet to suggest that minor (U12-dependent) splicing was present in the eukaryotic ancestor.

    Although the last common ancestor of extant eukaryotes appears to show much of the molecular complexity seen today, we do not, from this work, infer anything of the properties of the earlier "first eukaryote."

    Key Words: spliceosome ? snRNA ? splicing ? ancestral eukaryote

    Introduction

    Most genes in higher eukaryotes, such as plants, animals, and some fungi, are interrupted by introns that must be excised precisely from precursor mRNA (pre-mRNA) (Patel and Steitz 2003). Intron removal and the ligation of the coding sequences (exons) occur through two sequential trans-esterification reactions carried out by a massive ribonucleoprotein complex, the spliceosome (Nilsen 2003). The standard spliceosome is made up of five snRNPs (U1, U2, U4, U5, and U6snRNPs), each containing a small RNA bound by several proteins, together with >150 less-stably associated proteins (Jurica and Moore 2003). This makes the spliceosome considerably larger than the ribosome. In addition, the spliceosomal complex has been implicated in other RNA-processing functions such as mRNA capping and the addition of the polyA tail (Lynch and Richardson 2002) and is also closely linked to eukaryotic transcription (Kornblihtt et al. 2004).

    Some introns, snRNAs, and splicing-associated proteins have been characterized in a number of eukaryotes from deeply branching lineages, here collectively called "basal eukaryotes" (Wilihoeft et al. 2001; Archibald, O'Kelly, and Doolittle 2002; Nixon et al. 2002; Collins, Macke, and Penny 2004). Thus, introns and the basic spliceosomal machinery may have occurred early in the eukaryotic lineage and likely occur in the last common ancestor of living eukaryotes, the "eukaryotic ancestor."

    Investigating the distribution of splicing mechanisms and spliceosome components among eukaryotic lineages can reveal how splicing and the spliceosome evolved within eukaryotes. In this study, we investigate three hypotheses of spliceosome evolution.

    The first is that the spliceosome appeared in eukaryotes shortly after the eukaryotic ancestor, possibly by invasion by self-splicing introns (Lynch and Richardson 2002). It is possible under this hypothesis that some eukaryotic lineages do not contain introns or spliceosomal components.

    The second hypothesis is that the eukaryotic ancestor had a basic spliceosome that increased in complexity in multicellular eukaryotes. This complexity increase through time would be similar to intron length which appears to have increased in multicellular eukaryotes (Lynch and Conery 2003). Under this scenario, we could expect to find some, but not many, highly conserved splicing proteins present throughout extant eukaryotes.

    These first two hypotheses are not mutually exclusive in that an invading self-splicing intron could lead to a spliceosome that increased in complexity over time.

    The third hypothesis is that the eukaryotic ancestor contained a spliceosome that is similar in complexity to the spliceosome present in today's eukaryotes, with the expectation that we could find many spliceosomal proteins throughout eukaryotic lineages.

    It is outside the scope of the present work to consider the origin of the eukaryotic ancestor (i.e., the evolution of the first eukaryote) or how ncRNAs and proteins evolved to this point; the question of interest is just which ncRNAs and proteins were likely to have been present. This study takes a parsimonious approach in that the larger the number of deep eukaryotic lineages that contained a feature, the more likely it was that the feature was present in the ancestor of those lineages. An alternative is that a common feature arose independently in each lineage; however, this hypothesis becomes less likely as the number of lineages in which a feature is found increases. Thus, by identifying spliceosomal features in many eukaryotic lineages, we can start to infer the properties of their ancestor.

    A number of splicing mechanisms occur in eukaryotes. Splicing carried out by the "major" spliceosome (often called U2-type or U2-dependent splicing) is the predominant mechanism in vertebrates, yeasts, and plants. This spliceosome processes introns containing "canonical" splice site characteristics (i.e., 5' splice sites with the "GT" motif and 3' splice sites with the "AG" motif; often referred to as having GT-AG boundaries). This major spliceosome contains the U1, U2, U4, U5, and U6snRNPs and numerous associated proteins mentioned previously. Each snRNP consists of a specific snRNA, several snRNP-specific proteins, and the Sm core proteins (B/B', D1, D2, D3, E, F, and G) (Labourier and Rio 2001). A diagram summarizing the major splicing cycle (based on Gesteland, Cech, and Atkins 1999; Nagai et al. 2001; Valadkhan and Manley 2001) is shown in figure 1 and is briefly described here.

    FIG. 1.— The major spliceosomal cycle. The branch point adenosine is symbolized by "A" and the branch site by a small black box. snRNPs, although a complex of RNA and protein components, are represented as single entities. Other key splicing proteins are also indicated. This diagram has been adapted from Gesteland, Cech, and Atkins (1999) and Jurica and Moore (2003).

    The first step in major splicing is the formation of the prespliceosome complex where the U1snRNP (the U1snRNA plus its proteins) binds to the 5' splice site of the intron. The U2snRNP then binds to the branch site positioned at the 3' end of the intron, resulting in the bulging out of an adenosine residue (the branch site adenosine) from the mRNA (fig. 1: A complex). Independently, the U4snRNP binds to the U6snRNP, which then binds with the U5snRNP forming the U4/U6.U5tri-snRNP. This tri-snRNP then joins the prespliceosome complex to form the B1 complex. During this stage, base pairing between the U4 and U6snRNAs is disrupted and a new base pairing between the U2 and U6snRNAs is established. Also the base pairing of the U1snRNA with the 5' splice site is exchanged for base pairing between U6snRNA and the 5' splice site. After these rearrangements, the U1 and U4snRNPs are released from the spliceosome, forming the B2 complex. In the first catalytic splicing step (C1 complex), the bulged adenosine attacks the 5'splice site, resulting in the formation of a branched (lariat) intron. The second catalytic step (C2 complex) results in the ligation of the two exons (processed mRNA) and excision of the intron lariat (I complex). Spliceosomal components are then released from the intron lariat to be recycled back to the splicing process. The whole assembly-disassembly cycle is then repeated for the next intron.

    While the accepted view of ordered assembly has been supported by numerous studies, a number of reports have suggested interactions additional to the proposed chronology of events (Malca, Shomron, and Ast 2003). Although these interactions may be valid, for the purposes of this study, the "standard" spliceosomal cycle will be used as it separates the different stages of splicing and it does not affect the analysis of overall splicing requirements.

    Another class of introns containing noncanonical boundary sequences has been found in jellyfish, insects, animals, and plants and is spliced by a different machinery (Patel and Steitz 2003). The excision of these "minor" class introns is dependent on the U12snRNP and is known as minor, U12-type, or U12-dependent splicing. Minor spliceosomes contain a different set of snRNPs to that used in major splicing. The U11snRNP replaces the U1snRNP, the U12snRNP replaces the U2snRNP, and the U4atac and U6atac snRNPs replace the U4 and U6snRNPs, respectively. Only the U5snRNP is shared between the two spliceosomes. Although the first U12-type introns characterized had AT-AC boundaries (hence the naming of the U4atac and U6atac snRNAs), GT-AG boundaries appear to be more common (Burge, Tuschl, and Sharp 1999).

    In general, the minor and major snRNAs are engaged in analogous snRNA-snRNA and snRNA–pre-mRNA interactions such that a similar dynamic network is formed (Schneider et al. 2002). The U4atac-U6atac snRNPs undergo similar base pairing to that in the major spliceosomal U4 and U6snRNPs, forming very similar secondary structures. The main difference is that unlike the separate binding of the U1 and U2snRNPs to the pre-mRNA, in the minor spliceosome, the U11 and U12snRNPs form a stable complex and interact with the pre-mRNA as such. This mechanism is suggested to prevent the formation of mixed spliceosomes (Patel and Steitz 2003). Although there are few U12-type introns in the genome of any given species, their presence in insects, metazoa, and plants (although not so far in Caenorhabditis elegans, Saccharomyces cerevisiae, or another yeast Schizosaccharomyces pombe) is consistent with the minor spliceosome occurring in the common ancestor to plants and animals but has been lost from some lineages (Lynch and Richardson 2002; Zhu and Brendel 2003).

    A third form of splicing, SL–trans-splicing (shortened here to trans-splicing), is found in trypanosomes (e.g., Trypanosoma brucei, Euglena gracialis), flatworms (e.g., Echinococcus multilocalaris), nematodes (e.g., C. elegans), and the sea squirt, Ciona intestinalis, and is used to process a polycistronic (multigene) pre-mRNA to form multiple mature single-gene transcripts (Tschudi and Ullu 2002). Trans-splicing requires the U2, U4/U6, and U5snRNA as well as the SL-RNA, joining a small noncoding "miniexon" derived from the SL-RNA to each protein-coding exon in the pre-mRNA. There are mechanistic parallels between trans-splicing and cis-splicing including the use of the same set of nucleotide sequence features to mark splice sites, and structural similarity between SL-RNAs and spliceosomal snRNAs (Vandenberghe, Meedel, and Hastings 2001). These similarities imply an evolutionary relationship between cis-splicing and trans-splicing (Bonen 1993). However, the nature of this relationship is unclear because the phylogenetic distribution of trans-splicing has not yet been fully determined (Vandenberghe, Meedel, and Hastings 2001).

    In order to infer properties of spliceosomes in the ancestral eukaryote, it is necessary to have as much information as possible about the eukaryotic tree. A number of trees representing eukaryotic evolution (Embley and Hirt 1998; Dacks and Doolittle 2001; Simpson and Roger 2002) have been published, but there is still debate as to the placement of many lineages on these trees. This is expected because there are inherent problems associated with reconstructing the deeply diverging lineages. Theoretical studies show that although sequences are excellent for recovering major groups of eukaryotes (or bacteria or archaea), under current models of sequence evolution, primary sequence data should be losing all information about the deepest divergences including the placement of the root of the eukaryotic tree (Mossel 2003; Penny, Hendy, and Poole 2003; Mossel and Steel 2004).

    Our aim, in this study, is to be as independent as possible of the actual position of the root of the tree of extant eukaryotes. For this reason, we use the tree of Simpson and Roger (2002) in order to focus our research because this tree (considering it as unrooted) can be resolved in a large number of ways (106). A recent alternative with the rooting between animals, fungi, and choanozoa/choanoflagellates and all other eukaryotes has also been advocated (Stechmann and Cavalier-Smith 2002). This rooting is based on a gene fusion between dihydrofolate reductase and thymidylate synthase genes. These genes are fused in most eukaryotes but not in the animal-fungi-choanozoa group mentioned above. We will return to this interesting possibility in the Discussion but point out two things here. This alternative rooting would not affect our conclusions because of our experimental design; our conclusions are robust to many alternative rootings (see later). However, we are cautious about using just a single gene fusion to root the eukaryote tree because it has been known for some time (Snel, Bork, and Huynen 2000) that separation (fission) of fused genes does occur. Thus, gene fusions are not an irreversible character and relying on a single event could be premature.

    Given the uncertainty in the deep eukaryotic tree, our approach is to use a tree with all main lineages identified but with little resolution for deep branching order. As mentioned earlier, the tree we use (fig. 2) is based on Simpson and Roger (2002) and allows uncertainty to be taken into account when drawing conclusions. Our strategy is to identify a protein, or group of proteins, on as many lineages as possible.

    FIG. 2.— Eukaryotic phylogenetic tree used in this study (adapted from Simpson and Roger [2002] and http://hades.biochem.dal.ca/Rogerlab/Frontpage/tree_polytomy6.jpg). Species used directly in this study are underlined. The distribution of the different splicing mechanisms (major, minor, and trans) is also shown. "?" indicates branching order uncertainties.

    A large number of proteins have been identified from recent studies on human and yeast spliceosomes (Jurica and Moore 2003). To determine whether some or any of these proteins were likely to be present in the eukaryotic ancestor, standard protein and nucleotide databases as well as three basal eukaryotic genomes (Plasmodium falciparum, Entamoeba histolytica, and Giardia lamblia) were searched computationally with known human, S. cerevisiae, and S. pombe spliceosomal proteins. The genome of the microsporidian Encephalitozoon cuniculi was also used for searches as it represented a highly reduced genome between the animals and yeast (microsporidia are thought to have branched early within the fungi [Vivares et al. 2002]).

    Materials and Methods

    The eukaryotic species used in this study are given in table 1 and underlined in the phylogenetic tree in figure 2. The genomes of G. lamblia and Ecz. cuniculi were downloaded from the National Center for Biotechnology Information (NCBI) Web site (http://www.ncbi.nlm.nih.gov). The P. falciparum genome was downloaded from PlasmoDB (Bahl et al. 2002) and the Ent. histolytica genome produced by the Pathogen Sequencing Unit at the Sanger Institute (ftp.sanger.ac.uk/pub/pathogens/E_histolytica/). Protein database searches at NCBI started with proteins from human, S. cerevisiae, and S. pombe spliceosomes and used the associated BLink function which displays the graphical output of precomputed BlastP results against the protein nonredundant database. Protein homologues were also recovered from "KOG" (eukaryotic orthologous groups, a subset of the "clusters of orthologous groups" [COG] database) available at ftp.ncbi.nih.gov/pub/COG/KOG (Koonin et al. 2004).

    Table 1 Eukaryotic Species Used

    Protein homologues were selected with the following criteria: proteins either had to have been confirmed experimentally as determined either within the GenPept file itself or by the associated literature (designated "E" in the results tables); annotated as being similar in sequence to the query protein (designated "S" in the results tables); or annotated as a hypothetical open reading frame (ORF) with a BLink score greater than 300 and with a length within 25% of the query protein (designated "H" in the results tables). In general, proteins are referred to by their human name, except where the human name is longer and/or more complicated than the corresponding yeast name (e.g., instead of human SF3b14b, we use the S. cerevisiae name Rds3) or where the protein has no human homologue (e.g., Aar2).

    Genomic searches generally used the tBlastN program (version 2.2.5, BLOSUM62 matrix) (Altschul et al. 1997). Generally human, S. cerevisiae, S. pombe, and Arabidopsis thaliana proteins were used as queries searching nucleotide genomic databases from G. lamblia, Ecz. cuniculi, P. falciparum, and Ent. histolytica. Other protein homologues found through data mining literature, protein and genomic databases were also used as queries where available. However, a negative result from these searches did not indicate that a protein was not present, just that it was not found using standard data mining techniques. All results from the different queries for each protein were compared to ensure consistency. Result rankings were based on the results from the human or S. cerevisiae queries as these proteins have been confirmed experimentally. Results were ranked (1–4; 1 having the highest confidence of validity) based on the following system: 1, a candidate sequence of similar length (within 100 amino acids) to the protein sequence and containing greater than 65% amino acid similarity; 2, a candidate sequence of similar length to the protein sequence and containing 50%–65% amino acid similarity; 3, a candidate sequence (which may be of a different length to the protein) but containing a protein motif present in the sequence; and 4, candidates that displayed low sequence homology across the whole protein length. In the situation where a query protein from different species returned different sequences from the target genome (e.g., human proteinA returned sequence1, but the homologous proteinA from C. elegans returns sequence2), the result was designated "?" indicating that the result was unclear. If no "significant" results were returned for a query protein against a genome, the result was designated "—." All candidate sequences were "back-Blasted" against the protein databases at NCBI and the genomes from which they were recovered. Back-Blasting could confirm a sequence's candidacy and also reveal any other closely related protein that could lead to ambiguity (see later).

    The ancestral sequence reconstruction (ASR) technique (Collins, Poole, and Penny 2003) was used on a selected number of proteins that could be reliably aligned. Ancestral sequences were predicted using PAML (Yang 1997) and then combined with Blast to search genomic databases. Results of snRNA and protein searches are shown in tables throughout this study. Information from published comparative genomic studies, that included some splicing proteins (Anantharaman, Koonin, and Aravind 2002; Koonin et al. 2004), has also been included. The seven eukaryotic genomes searched in these publications are human, C. elegans, D. melanogaster, S. cerevisiae, S. pombe, A. thaliana, and Ecz. cuniculi (Koonin et al. 2004 only). For spliceosomal proteins used in these published studies, the eukaryotic genomes in which they were found are included in the results tables, as well as any indication of any archaeal presence.

    Protein presence was traced to the eukaryotic ancestor using MacClade version 4.0 (http://macclade.org/). The phylogenetic tree from figure 2 was used for all three runs. The following settings were used: Run A: {E, S, H, 1, 2, 3} = 1; {–, 4} = 0; {?} = ?. Run B (slightly stricter): {E, S, 1, 2, 3} = 1; {–, 4} = 0; {?, H} = ?. Run C (strict): {E, S, 1, 2} = 1; {–, 4, H, 3, ?} = 0. The likely presence of a protein in the eukaryotic ancestor was scored as follows: 1 = protein highly likely to be present in eukaryotic ancestor (ancestor positive), 2 = protein likely to be present in eukaryotic ancestor (ancestor equivocal), 3 = protein low likelihood of being in eukaryotic ancestor (ancestor negative but protein present in at least two basal eukaryotic lineages) (i.e., lineages outside animals, yeast, or plants). MacClade results are shown in each of the results tables.

    Results

    Spliceosomal Proteins in the Eukaryotic Ancestor

    The 152 most conserved spliceosomal proteins and 10 proteins specific to minor splicing were examined in this study and grouped according to common snRNA-binding properties (e.g., the U1snRNA-specific proteins) or containing distinguishing sequence motifs (e.g., Sm-Lsm proteins). Because of the large number of proteins involved in this survey, each group is summarized separately, beginning with the groups of snRNP-associated proteins and then proteins that have other functions in the spliceosome. results tables for the U2 and the U5-specific proteins are shown as examples in tables 2 and 3, respectively. Other tables and details of candidate sequences can be downloaded as supplementary information.

    Table 2 Results for the U5-Specific Proteins

    Table 3 Results for the U2-Specific Proteins

    U5snRNP-Specific Proteins (table 2)

    The U5snRNP is required for both steps of splicing, interacting with both the 5' and 3'splice sites of the mRNA (Dix et al. 1998), and is the only snRNP found in all three types of splicing. The yeast U5snRNP has fewer proteins than its mammalian equivalent and contains Prp8, Brr2, Snu114, Prp28, Snu40, and the Sm proteins (Stevens et al. 2001) while the human U5snRNP additionally contains Prp6, the U5-40 protein, and the U5-15 protein (Zhou et al. 2002). The S. cerevisiae Dib1 (U5-15 homologue) has been found not in the U5snRNP but in the U4/U6.U5tri-snRNP (Stevens et al. 2001) but for convenience is dealt with here.

    The U5snRNP-specific proteins Prp8 and Brr2 are found throughout basal eukaryotes including G. lamblia (Nixon et al. 2002), T. brucei (Lucke et al. 1997), and Trichomonas vaginalis (Fast and Doolittle 1999), and it is not hard to place them within the eukaryotic ancestor (table 2). This placement is also supported by the presence of the U5snRNA in trypanosomatids (Schnare and Gray 2000) and G. lamblia (Collins, Macke, and Penny 2004).

    Early steps in splicing catalysis are thought to be catalyzed by two DExD/H-box RNA helicases, Prp28 and Brr2 (Kuhn, Reichl, and Brow 2002). These proteins are highly conserved in eukaryotes and are again likely to have been present in the eukaryotic ancestor. DExD/H RNA helicases share highly conserved motifs making positive identification difficult, and without care, an incorrect identification could be made. In this case, however, Brr2 and Prp28 are highly conserved throughout their entire sequence, enabling the basal eukaryotic candidate sequences to be treated with a higher confidence than for the other potential DExD/H RNA helicases recovered during this study (see later). With candidate sequences found in basal eukaryotes, Snu114, Prp6, U5-15, and Snu40 are other proteins likely to have been present in the eukaryotic ancestor.

    From these results, nearly all the U5snRNA-associated proteins can be placed in the eukaryotic ancestor, indicating that this snRNP that is required throughout splicing was already well established within the eukaryotic ancestor (see also Collins, Macke, and Penny 2004). Perhaps it is not unexpected that U5snRNP-associated proteins are so widely distributed, given that the U5snRNP complex is involved in major, minor, and trans-splicing.

    U2snRNP-Specific Proteins (table 3)

    The U2snRNP binds to the branch site of the pre-mRNA early in splicing resulting in the bulging out of the branch site adenosine and completing the prespliceosome (fig. 1). The majority of the U2snRNP-specific proteins belong to two U2snRNP-specific protein complexes (SF3a and SF3b). The first protein complex SF3a consists of Sap61, Sap62, and Sap114 (Will et al. 2001). All these SF3a proteins have been characterized throughout eukaryotes, and candidate sequences were found in basal eukaryotic genomes. The other complex SF3b (containing the P14, Sap49, Sap130, Sap145, and Sap155, Rds3/SF3b14b, and SF3b10 proteins) is present in both the major and minor spliceosome (Golas et al. 2003). Proteins of the SF3b complex, P14, Sap155, Sap145, Sap49, and Rds3, are well conserved across eukaryotic species with MacClade results inferring their presence in the eukaryotic ancestor. Sap130 was only recovered confidently from P. falciparum and thus has a lower likelihood of being present in the eukaryotic ancestor. The U2-A' and U2-B'' proteins associate stably with U2snRNA (Will et al. 2001) and are found throughout the higher eukaryotes. Candidate sequences were recovered from basal eukaryotes making them likely to have been present in the eukaryotic ancestor.

    U2snRNA is thought to be part of the spliceosome catalytic core, and most of the U2snRNP-associated proteins may have been present in the eukaryotic ancestor. The SF3a and SF3b complexes may have been similar to what is seen in extant eukaryotes. Overall, the results are evidence that the entire U2snRNP evolved into a sophisticated complex before, or within, the eukaryotic ancestor.

    U1snRNP-Specific Proteins (supplementary table 4)

    From this point onwards, individual results are given in the supplementary information. The U1snRNP binds to the mRNA in the prespliceosome and leaves the spliceosome before the first step of catalysis (fig. 1). Although the Sm core group of proteins (B/B', D1, D2, D3, E, F, and G) are associated with the U1snRNP, these proteins also bind to the other snRNPs and will be covered in Sm and Lsm Proteins. The U1-70 and U1-C proteins interact with the Sm core proteins during U1snRNP assembly (Nelissen et al. 1994), whereas U1-70 and U1A interact with the U1snRNA (Labourier and Rio 2001). U1-A, U1-C, and U1-70 proteins have been found throughout eukaryotes, and candidate sequences were found in basal eukaryotes. As these proteins show a wide distribution across the eukaryotic lineages, MacClade results place all three of these proteins in the eukaryotic ancestor.

    Yeasts contain a number of additional U1snRNP-specific proteins. One such protein Prp40 has been found in S. cerevisiae, S. pombe, and N. crassa, and a candidate sequence was found in Ecz. cuniculi, indicating that Prp40 was likely present in the fungal ancestor. However, candidates were also found in P. falciparum and Ent. histolytica, and MacClade results indicate a possibility that Prp40 or a similar protein could have been present in the eukaryotic ancestor. The Prp39 protein has been found in yeasts and plants but not in animals. However, candidate Prp39 sequences were found in Ecz. cuniculi, P. falciparum, and Ent. histolytica. This suggests that Prp39 may have been present in the eukaryotic ancestor and have either been lost from animals or no longer contain enough sequence similarity to be detected as protein homologues. Candidate sequences for the S. cerevisiae–specific protein Nam8 were also found in basal eukaryotes with MacClade results suggesting that this protein may have been present in the eukaryotic ancestor.

    In contrast to the positive results above, the S. cerevisiae proteins (Snu56, Snu65, Snu71, and Usa1) have not yet been found in any other eukaryote (Gottschalk et al. 1998) and were not found in any basal eukaryotes during this study. Such findings are reassuring in that these proteins act as important negative controls (i.e., to ensure that not every protein was found in basal eukaryotes).

    U4/U6snRNP-Specific Proteins and U4/U6.U5tri- snRNP-Specific Proteins (supplementary tables 5 and 6)

    The U4 and U6snRNPs exist as separate entities but form a complex (U4/U6snRNP complex) prior to binding to the U5snRNP to form the U4/U6.U5tri-snRNP complex that then attaches to the spliceosome. U4 specific, U6 specific, and proteins specific to the U4/U6 and U4/U6.U5 complexes are discussed in this section. U4snRNP-specific proteins Prp3 and Prp4 and Snu13 were detected in at least two basal eukaryotic genomes with MacClade again inferring that these proteins were likely to have been present in the eukaryotic ancestor.

    Prp31 is required for the U4/U6.U5tri-snRNP assembly (Makarova et al. 2002) and has been characterized in animals and yeast with candidate sequences found in G. lamblia and Ent. histolytica. This protein has also been reported in some archaeal genomes (Anantharaman, Koonin, and Aravind 2002) and thus is very likely to have been present both in the eukaryotic ancestor and, in this case, in the first eukaryote.

    Cpr1 (also called USA-CypP) is a member of the highly conserved cyclophilin protein family. Although Cpr1 candidates were found in Ecz. cuniculi, P. falciparum, Ent. histolytica, and G. lamblia, it cannot be ruled out that these candidates may in fact be other closely related cyclophilins. The finding of these candidates suggests, however, that at least one cyclophilin (either Cpr1 or related to Cpr1) was present in the eukaryotic ancestor.

    The three U4/U6.U5tri-snRNA–specific SR (Ser-Arg rich)–related proteins (Tri-27, Sad1, and Snu66) recovered possible candidates from P. falciparum but only Sad1 recovered a candidate from Ent. histolytica. The ASR technique (Collins, Poole, and Penny 2003) was applied with Snu66 and Sad1. A Sad1 sequence containing only a motif-associated area was then recovered from G. lamblia. ASR with Snu66 did not recover any significant hits against G. lamblia but recovered candidates from Ent. histolytica and P. falciparum. MacClade results are consistent with Sad1 and Snu66 being present in the eukaryotic ancestor.

    U11/U12snRNA (Minor Splicing)–Specific Proteins (supplementary table 7)

    Recently, a number of proteins specific to the minor splicing complex have been identified from the analysis of the human U11/U12snRNP (Will et al. 2004) and the fruitfly U11snRNP (Schneider et al. 2004). U11snRNP-specific proteins U11-25, U11-35, U11-48, and U11-59 and the U11/U12-specific proteins U11/12-20, U11/12-31, and U11/12-65 have similar sequences in the mouse and zebrafish genomes, and some of these are also found in the fruitfly, mosquito, and some plant genomes (Schneider et al. 2004). Searches against the C. intestinalis (sea-squirt) genome recovered candidates for most of these proteins (the U11-59 being the exception). Candidates for three other U11/U12-associated proteins, YB1, Toe-1, and C114, were also recovered from the sea-squirt. These protein candidates, together with the presence of candidate sequences for the U11, U12, and U6atac snRNAs (data not shown), strongly suggest the presence of minor splicing (as well as major and trans-splicing) in the sea-squirt.

    However, searches of basal eukaryotic genomes failed to find any clear candidates for any of the 10 minor splicing proteins used in this study. Thus, at this time, there is no evidence that minor splicing was present in the eukaryotic ancestor. This failure to detect any of the 10 proteins, in stark contrast to our results with proteins from the major spliceosome, is a reassuring negative control for our study.

    Sm and Lsm Proteins (supplementary table 8)

    Sm core proteins (B'/B, D3, D2, D1, E, F, and G) are found in both the major and minor spliceosomes (Hastings and Krainer 2001) binding to a conserved Sm-binding site found in snRNAs (Donahue and Jarrell 2002). Despite structural similarities, Lsm proteins play roles distinct from Sm proteins, assisting in the rearrangement of U6snRNP during splicing and in promoting U4/U6 formation during recycling of the spliceosome (Chan et al. 2003; Liu et al. 2004). Some of the Sm-Lsm proteins (SmE, SmF, SmG, Lsm3, and Lsm5) have been found in both eukaryotic and archaeal genomes (Anantharaman, Koonin, and Aravind 2002) and thus are already good candidates for also being present in the eukaryotic ancestor as well as in the first eukaryote. In addition, other Sm-Lsm proteins (SmB/B', SmD1, SmD2, SmD3, and Lsm2) recovered candidate sequences in at least two basal eukaryotes. Of the remaining Lsm proteins, Lsm4 and Lsm8 recovered good candidates from P. falciparum, Lsm6 a possible candidate from G. lamblia, and Lsm7 a possible candidate from Ent. histolytica. Recently, Lsm2 to Lsm8 have been experimentally identified in T. brucei (Liu et al. 2004), making it highly likely that these proteins were present in the eukaryotic ancestor. MacClade results for all the Sm and Lsm proteins (except Lsm1 which was unclear for some species) suggested their presence in the eukaryotic ancestor.

    Catalytic Step II Proteins (supplementary table 9)

    Protein interaction in the second catalytic splicing step can be divided into two stages: Prp16 and Prp17 activate the first stage, then Prp18 and Slu7 activate the second (Chawla et al. 2003). Prp17 and Prp18 have candidate sequences in P. falciparum but only low homology or small motif areas could be found in G. lamblia and Ent. histolytica. Prp16, Prp22, and Prp43 are DExD box RNA helicases and are dealt with in the next section. Slu7 candidates were also found in basal eukaryotes. MacClade results are consistent with all the catalytic step II proteins mentioned here being present in the eukaryotic ancestor.

    Other DExD/H Proteins (supplementary table 10)

    In S. cerevisiae, eight DExD/H proteins (Prp2p, Prp16p, Prp22p, Prp43p, Brr2p, Prp5p, Prp28p, and Sub2p/UAP56) have been identified as being required for pre-mRNA splicing (Jurica and Moore 2003). Seven additional proteins have been found in mammalian spliceosomes (DICE1, Abstrakt, eIF4a3, DDX35, DDX9, KIAA0052, and p72) (Jurica and Moore 2003). These DExD/H motif-containing proteins are classed as RNA helicases and are required to change the mRNA structural conformation during the splicing cycle. Some of these proteins have been covered under different protein groups (e.g., Brr2 and Prp28 are also U5snRNP-specific proteins).

    Problems arise when searching for DExD/H proteins because they contain large conserved sequence motifs, and often searching with one DExD/H protein will find many proteins of the same family. By comparing the length of the candidate sequence and the position of a motif to that of the query proteins, it is sometimes possible to narrow down the choices of proteins that are the most similar for a particular ORF. However, this was often not possible with proteins containing the DExD/H motif because of their diversity. For example, the G. lamblia candidate sequence (AACB0100041.1:9432–13868) is 1,478 amino acids in length and was recovered with searches of Prp16, Prp22, Prp43, and Prp2. From the analysis of known proteins used in this study, this protein could be Prp16, Prp22, or Prp43. Again, these occasional complexities are good controls as the majority of searches found only one unambiguous candidate.

    Searches of NCBI databases with DExD/H protein candidates recover many DExD proteins with very similar scores. Thus, it is likely that there may be other DExD/H proteins that are potential candidate sequences. Because of this complexity, only the DExD/H proteins that showed a high level of homology, with no conflict with other DExD/H proteins (e.g., Brr2), were placed in the eukaryotic ancestor, although it is likely that other DExD/H proteins were also present.

    SR Proteins (supplementary table 11)

    SR proteins are commonly found in mammalian splicing (Hastings and Krainer 2001) but are absent in yeast (Zhou et al. 2002). These proteins, in the species in which they are found, are required for splice site recognition in all three types of splicing (major, minor, and trans) (Hastings and Krainer 2001; Furuyama and Bruzik 2002; Graveley 2004). They contain a characteristic C-terminal "RS" domain of variable length, rich in serine-arginine repeats that can be extensively phosphorylated (Portal et al. 2003). Some novel SR proteins have been found in Trypanosoma cruzi (Ismaili et al. 1999, 2000; Portal et al. 2003), evidence that SR proteins may have been present in early eukaryotes (but may have been lost in a few later lineages [Portal et al. 2003]).

    Candidate sequences for the proteins ASF/SF2 and 9G8 were recovered from P. falciparum, but Blast searches with other SR proteins of P. falciparum, G. lamblia, and Ent. histolytica returned at best only motif-associated areas. ASR searching with 9G8 recovered a possible candidate in P. falciparum but again only motif-associated areas in the other two genomes. The RS motif (the predominant feature of SR proteins), however, has been found almost exclusively in splicing-related proteins (Portal et al. 2003), indicating that the motif-associated areas found in P. falciparum, G. lamblia, and Ent. histolytica may be part of SR proteins. The presence of the RS domain in these three basal eukaryotes as well as the novel T. cruzi SR proteins indicates that SR proteins as a group may have been in the eukaryotic ancestor, but no specific protein as yet fulfilled the criteria set in this study to enable it to be placed in the eukaryotic ancestor.

    Prp19-Associated Complex (supplementary table 12)

    The Prp19-associated complex (NTC or 19 complex) is required for the stable association of U5 and U6snRNPs with the spliceosome after U4snRNP dissociation (Chan et al. 2003). The NTC has been isolated as a distinct unit, indicating that its constituents bind directly with one another (Ohi and Gould 2002). The yeast NTC consists of Cef1p (CDC5L homologue), Snt309, Ntc31, Isy1, Ntc20, and at least another six uncharacterized proteins. Another 30 uncharacterized proteins have been copurified with the human Cdc5/Cef1 protein (Ohi and Gould 2002).

    Prp19 itself is required to maintain the organization of the NTC (Ohi and Gould 2002). A Prp19 candidate was found in P. falciparum and Dictyostelium discoideum, but only motif areas could be determined from Ent. histolytica and G. lamblia using both Blast and ASR. Nevertheless, MacClade results suggest that this protein may have been present in the eukaryotic ancestor. Similar results were found with Prp5, PLRG1, Crn, and Cdc5l. These proteins are core components of the mammalian NTC (Ohi and Gould 2002), and candidate sequences for these proteins were found in basal eukaryotes. Overall, our results indicate that a number of NTC-associated proteins, as well as Prp19 itself, were present in the eukaryotic ancestor, indicating that the NTC as a whole was present.

    Coupling of Splicing with Other Major Cellular Events (supplementary table 13)

    In today's eukaryotes, almost all the major events in the production of mature mRNAs are highly coupled with splicing (Lynch and Richardson 2002), and there are many interactions between splicing factors and elongation factors to promote transcription elongation, mRNA export, transcriptional termination, and polyadenylation. Some of the complexity of the spliceosome may be accounted for by proteins that are not essential for splicing but instead play important postsplicing roles (Nilsen 2003). Results from some of these proteins are summarized below.

    Prp4Kinase (not to be mistaken with Prp4) is present in the yeast S. pombe and mammals but has not been found in S. cerevisiae (Kuhn and Kaufer 2003). It plays a key role in regulating splicing and in connecting this process with the cell cycle. Candidate Prp4Kinase sequences were found in G. lamblia and T. brucei. Similarly, the Skip protein and polyA-binding protein recovered candidate sequences in basal eukaryotes and were also likely to have been present in the eukaryotic ancestor. Both Tex1 and UAP56 are components of the TREX-complex (involved in transcription elongation). Every protein that is in the TREX-complex may also be present in the spliceosome (Zhou et al. 2002), consistent with transcription, splicing and export being coupled via this complex. UAP56 and Tex1 protein candidates were found in basal eukaryotes. Thus, the eukaryotic ancestor may have already contained strong links between pre-mRNA splicing and other cellular processes such as transcription and RNA nuclear export.

    Post-transcriptional Exon-Junction Complex Proteins (supplementary table 14)

    The exon-junction complex (EJC) consists of several proteins that, upon the completion of intron excision, are deposited on the mRNA product at a conserved position (Nott, Le Hir, and Moore 2004). Core components include the Y14 and Magoh proteins which remain stably associated with mRNA after nuclear export. Magoh is found in vertebrates, yeasts, and plants, and candidate sequences were recovered from P. falciparum and Ent. histolytica. MacClade results place Magoh in the eukaryotic ancestor. In contrast, Y14 recovered, at best, motif-associated areas from any basal eukaryotic genome and is not at this time suggested to have been present in the eukaryotic ancestor.

    Although it is likely that some of the proteins associated today with the EJC were present in the eukaryotic ancestor, there is as yet no evidence to suggest that the EJC as a whole was present. Given these results, experimental tests are now required on the EJC in basal eukaryotes.

    Other Essential Splicing Proteins (supplementary table 15)

    Some splicing factors cannot be conveniently grouped into any of the previous sections and are summarized here. Essential splicing factors SF1, Luc7a, and U2AF (U2AF65 and U2AF35 subunits) play important roles in splice site recognition during early spliceosome assembly (Fortes et al. 1999; Selenko et al. 2003). U2AF subunits (U2AF65 and U2AF35) and SF1 have been characterized throughout basal eukaryotes, and MacClade results suggested their presence in the eukaryotic ancestor. There were other splicing proteins (for example, fSap105, fSap79, Spf30, and Snp70) that did not recover any candidates in any of the basal eukaryotic genomes tested here. We cannot conclude yet that these proteins are absent in these genomes but merely conclude that they were not found with techniques used in this study.

    A summary of our results is shown in table 4, and it lists the 78 proteins for which we have found evidence that they were present in the eukaryotic ancestor. The only group of proteins that we are confident were not present are those specific to minor splicing. Otherwise, there are reasonable candidates for the full range of splicing-associated groupings shown in figure 1.

    Table 4 Seventy-Eight Spliceosomal Proteins Likely to be Present in the Eukaryotic Ancestor

    Discussion

    This study sets out to ascertain whether or not (Hypothesis 1) the spliceosome existed in the eukaryotic ancestor and, if so, whether it was a simplified version of today's spliceosomes (Hypothesis 2) or just as complex (Hypothesis 3). Table 4 shows that splicing-specific proteins from the full range of the spliceosomal cycle are conserved throughout eukaryotes. Thus, a major conclusion of this work is that the splicing process in the eukaryotic ancestor would be similar in overall complexity to that seen today in living eukaryotes, that is, not simplified but complex and thus supporting the third hypothesis stated in the Introduction. Ancestral snRNPs, far from being simplified versions, may have contained most of the U-snRNP–specific proteins (proteins that bind to U1, U2, U4, U5, and U6snRNAs) found today (table 4). Other groups of proteins such as the Sm core proteins (bound within each snRNP) and the Lsm proteins have also remained highly conserved throughout the eukaryotic lineage and probably have origins in the ancestral eukaryote, perhaps ancestral to the first eukaryote.

    Thus, conclusions from our work support the premise (Lynch and Richardson 2002) that introns and the spliceosomal machinery to process them were present in the eukaryotic ancestor. If the origin of eukaryotic introns and the spliceosomal complex does go back to bacterial self-splicing introns, then this must have been significantly earlier than the last ancestor of extant eukaryotes in order for the full range of RNPs to evolve.

    Some protein groups, however, have not been easy to characterize across eukaryotic lineages. Proteins that belong to highly conserved protein families (e.g., DExD/H, cyclophilin, and SR proteins) are often similar in sequence to other members of the family. This creates problems both with sequence annotation in general and in determining if a particular member of a protein family was likely to have been present in the eukaryotic ancestor. Sequence-linked properties such as length and predicted physiochemical properties (e.g., isoelectric point and amino acid composition) are of limited use in this situation because they are often shared by the other members of the family. Thus, biochemical analysis including protein-RNA–binding studies will be required for identification for these spliceosomal proteins from basal eukaryotic species. Candidate sequences found using queries from members of a protein family indicate the likely presence of that protein family rather than the distribution of individual members. For example, the presence of RS motif sequences throughout eukaryotes indicates that proteins with this motif were likely to have been present in the eukaryotic ancestor.

    Not all proteins identified as belonging to spliceosomes (Kaufer and Potashkin 2000; Lorkovic et al. 2000; Zhou et al. 2002; Jurica and Moore 2003) were used in searches during this study. We concentrated on those that had reasonable conservation within eukaryotes. In addition, this "parts-list" (Nilsen 2003) of the spliceosome may still be incomplete because additional splicing-associated proteins are still being discovered and new functions identified. There are relatively few splicing-associated proteins biochemically characterized from any of the basal eukaryotes (compared with the numbers characterized from yeasts and vertebrates), and as yet no complete spliceosomes have been isolated. Given our results that a complex spliceosome is likely to be present in many basal eukaryotes, experimental studies are now a priority.

    As stated in the Introduction, our aim is to have a robust conclusion independent of any particular rooting of the eukaryote tree. From theoretical studies, it is well known that there is a major problem in using primary sequences for recovering deeper divergences. Theoretically, it is well established that primary sequences must, on our current models of sequence evolution, eventually lose virtually all phylogenetic information (Mossel 2003; Mossel and Steel 2004, 2005). This was first reported for simulations using established rates of mutations and realistic time periods (Penny et al. 2001). It was then, for simple models, demonstrated mathematically (Sober and Steel 2002). We then pointed out that biochemically based models, such as the covarion model implemented by the hidden Markov model in Tuffley and Steel (1997), were not covered by the original Steel theorem (Sober and Steel 2002) even though the conclusion about sequences losing information was still expected to hold (Penny, Hendy, and Poole 2003). There are major problems in getting consistent results for deep phylogeny of both prokaryotes and eukaryotes, exactly as predicted from the theory of Markov processes on trees (Mossel and Steel 2004, 2005). Although we would prefer to have additional genomes from deeply diverging eukaryotes, we think that our present conclusion is robust to many alternative rootings of the tree in figure 2, including the rooting of the eukaryotic tree between animals, fungi, and choanozoa and all other eukaryotes (Stechmann and Cavalier-Smith 2002). As mentioned earlier, this rooting is based on a single (reversible) gene fusion event. This rooting strongly supports the eukaryote ancestor having a full spliceosome because plants and animals are on opposite sides of the eukaryote tree on this rooting and even minor splicing would be in the eukaryotic ancestor. However, it has been known for some time (Snel, Bork, and Huynen 2000) that fused genes can separate again (gene fission). Mechanisms by which this fission occurs have been suggested by comparing genomes of closely related species of Drosophila (Wang, Yu, and Long 2004). Thus, relying on a single fusion event to root the eukaryote tree is risky.

    Nevertheless, using parsimony on "rare genomic changes" (Rokas and Holland 2000) does have a sound theoretical basis in that they are maximum likelihood estimators if the number of possible character states is extremely large (Steel and Penny 2004, 2005). On this basis, we definitely need new forms of data for the deepest divergences, including gene order, gene fusions, major insertions, and deletions, where there are a very large number of character states. Under these conditions, parsimony is the maximum likelihood estimator (Steel and Penny 2004).

    To return then to the Stechmann and Cavalier-Smith (2002) rooting of the eukaryotic tree, there are four points.

    It is consistent with our conclusion about the complexity of the spliceosome in the eukaryote ancestor.

    However, we are doubtful about using just a single gene fusion event when it is known that gene fissions can also occur (even though less frequently than fusions).

    Primary sequence data will lose information about the deepest divergences.

    However, rare genome changes (including gene fusion events and major insertions and deletions) are maximum likelihood characters that are expected to be useful when primary sequence data have lost most, or all, of its phylogenetic information.

    Perhaps the final point is that it is precisely when all phylogenetic information is lost from primary sequences that claims are made about "rampant" lateral transfer to explain the predicted (because of information loss) differences between gene trees. These results of Mossel and Steel (2005) show that there is no easy solution to finding the root of the eukaryotic tree, and we see that the current approach of using genomics on a wide range of basal eukaryotes is currently the most robust strategy.

    The distribution of major and trans-splicing indicates that both splicing mechanisms are likely to have been present in the eukaryotic ancestor. The mechanisms of major and trans-splicing differ, but each is conserved between highly diverse eukaryotic lineages. Thus, it is more likely that they were separate entities in the eukaryotic ancestor than the converse view that each instance evolved separately in each lineage. This is especially true because the present work leads to the inference that the ancestral spliceosome was complex. Another option is that similarities in splicing mechanisms may be the result of horizontal transfer. This is unlikely because it appears that genes involved in transcription, translation, and relating processes, such as splicing, are rarely horizontally transferred (Jain, Rivera, and Lake 1999). The use of trans-splicing to process polycistronic (many genes) mRNAs may have been lost or "downgraded" in some lineages (such as mammals) with the proliferation of monocistronic (single gene) mRNAs. However, the ability to join two independently produced pre-mRNAs in a trans-splicing reaction has remained in lineages (such as in humans) that appear not to contain SL–trans-splicing (Garcia-Blanco 2003). Although our present study indicates that minor splicing may be present in the sea-squirt, it has not yet been demonstrated in any basal eukaryotes, and at this stage, minor splicing does not seem likely to have been present in the eukaryotic ancestor but evolved sometime before the separation of plants and animals.

    Our work complements and extends earlier work (Anantharaman, Koonin, and Aravind 2002; Koonin et al. 2004) that used general computational surveys of higher eukaryotic genomes (and archaea) to uncover proteins that may have been present in "ancestral" organisms. The results of our studies are consistent with these earlier ones but we extended them in two ways. We used basal eukaryotes from several lineages, and in addition, we checked all annotation back to the experimental literature (avoiding problems of errors in annotation). Although we get considerably more detail, we are limited in the number of genes we can cover and so the two approaches are genuinely complementary.

    Splicing can now be seen as a fundamental aspect of all modern eukaryotic life and appears to have evolved before the last ancestor of living eukaryotes. Contrary to the idea that splicing may have been a much simpler mechanism in this ancient organism, it now appears that this was not the case and that splicing and the spliceosome had already evolved in a sophisticated cellular process, already linked to other cellular processes such as transcription, capping, mRNA export, and polyadenylation. At this point we can say nothing about the origin of the spliceosome or its nature in the first eukaryote. There must have been a significant period of time between this first eukaryote and the organism we have called the eukaryotic ancestor. In agreement with Martin (1999), a simple endosymbiotic event does not explain the origin of the nucleus with its complex RNA processing. To examine the nature of the much earlier first eukaryote, additional study will be required to compare the spliceosomal process found in eukaryotes and the self-splicing mechanism of prokaryotes, an interesting prospect for the future.

    Supplementary Material

    Supplementary information including result tables for all the proteins involved in this study is available online.

    Acknowledgements

    Many thanks to Mitchell L. Sogin and Andrew G. McArthur and their teams at the Giardia lamblia Genome Project, Marine Biological Laboratory at Woods Hole, for early access to nonpublic data. Thanks also to Allan Rodrigo for advice about MacClade. This work was supported by the New Zealand Marsden Fund and the Allan Wilson Centre for Molecular Ecology and Evolution.

    References

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.

    Anantharaman, V., E. V. Koonin, and L. Aravind. 2002. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res. 30:1427–1464.

    Archibald, J. M., C. J. O'Kelly, and W. F. Doolittle. 2002. The chaperonin genes of jakobid and jakobid-like flagellates: implications for eukaryotic evolution. Mol. Biol. Evol. 19:422–431.

    Bahl, A., B. Brunk, R. L. Coppel et al. (16 co-authors). 2002. PlasmoDB: the Plasmodium genome resource. An integrated database providing tools for accessing, analyzing and mapping expression and sequence data (both finished and unfinished). Nucleic Acids Res. 30:87–90.

    Bonen, L. 1993. Trans-splicing of pre-mRNA in plants, animals, and protists. FASEB J. 7:40–46.

    Burge, C. B., T. Tuschl, and P. A. Sharp. 1999. Splicing of precursors to mRNAs by the spliceosomes. In R. F. Gesteland, T. Cech, and J. F. Atkins, eds. The RNA world: the nature of modern RNA suggests a prebiotic RNA. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NewYork;1820–1821.

    Chan, S. P., D. I. Kao, W. Y. Tsai, and S. C. Cheng. 2003. The Prp19p-associated complex in spliceosome activation. Science 302:279–282.

    Chawla, G., A. K. Sapra, U. Surana, and U. Vijayraghavan. 2003. Dependence of pre-mRNA introns on PRP17, a non-essential splicing factor: implications for efficient progression through cell cycle transitions. Nucleic Acids Res. 31:2333–2343.

    Collins, L. J., T. J. Macke, and D. Penny. 2004. Searching for ncRNAs in eukaryotic genomes: maximizing biological input with RNAmotif. J. Integ. Bioinf. (http://journal.imbio.de/index.php?paper_id=6).

    Collins, L. J., A. M. Poole, and D. Penny. 2003. Using ancestral sequences to uncover potential gene homologues. Appl. Bioinf. 2:S85–S95.

    Dacks, J. B., and W. F. Doolittle. 2001. Reconstructing/deconstructing the earliest eukaryotes: how comparative genomics can help. Cell 107:419–425.

    Dix, I., C. S. Russell, R. T. O'Keefe, A. J. Newman, and J. D. Beggs. 1998. Protein-RNA interactions in the U5 snRNP of Saccharomyces cerevisiae. RNA 4:1675–1686.

    Donahue, W. F., and K. A. Jarrell. 2002. A BLAST from the past: ancient origin of human Sm proteins. Mol. Cell 9:7–8.

    Embley, T. M., and R. P. Hirt. 1998. Early branching eukaryotes? Curr. Opin. Genet. Dev. 8:624–629.

    Fast, N. M., and W. F. Doolittle. 1999. Trichomonas vaginalis possesses a gene encoding the essential spliceosomal component, PRP8. Mol. Biochem. Parasitol. 99:275–278.

    Fortes, P., D. Bilbao-Cortes, M. Fornerod, G. Rigaut, W. Raymond, B. Seraphin, and I. W. Mattaj. 1999. Luc7p, a novel yeast U1 snRNP protein with a role in 5' splice site recognition. Genes Dev. 13:2425–2438.

    Furuyama, S., and J. P. Bruzik. 2002. Multiple roles for SR proteins in trans splicing. Mol. Cell. Biol. 22:5337–5346.

    Garcia-Blanco, M. A. 2003. Messenger RNA reprogramming by spliceosome-mediated RNA trans-splicing. J. Clin. Investig. 112:474–480.

    Gesteland, R. F., T. Cech, and J. F. Atkins. 1999. The RNA world: the nature of modern RNA suggests a prebiotic RNA. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NewYork.

    Golas, M. M., B. Sander, C. L. Will, R. Luhrmann, and H. Stark. 2003. Molecular architecture of the multiprotein splicing factor SF3b. Science 300:980–984.

    Gottschalk, A., J. Tang, O. Puig J. Salgado, G. Neubauer, H.V. Golot, M. Mann, B. Seraphin, M. Rosbash, R. Luhrmann and P. Fabrizio. 1998. A comprehensive biochemical and genetic analysis of the yeast U1 snRNP reveals five novel proteins. RNA 4:374–393.

    Graveley, B. R. 2004. A protein interaction domain contacts RNA in the prespliceosome. Mol. Cell 13:302–304.

    Hastings, M. L., and A. R. Krainer. 2001. Functions of SR proteins in the U12-dependent AT-AC pre-mRNA splicing pathway. RNA 7:471–482.

    Ismaili, N., D. Perez-Morga, P. Walsh, M. Cadogan, A. Pays, P. Tebabi, and E. Pays. 2000. Characterization of a Trypanosoma brucei SR domain-containing protein bearing homology to cis-spliceosomal U1 70 kDa proteins. Mol. Biochem. Parasitol. 106:109–120.

    Ismaili, N., D. Perez-Morga, P. Walsh, A. Mayeda, A. Pays, P. Tebabi, A. R. Krainer, and E. Pays. 1999. Characterization of a SR protein from Trypanosoma brucei with homology to RNA-binding cis-splicing proteins. Mol. Biochem. Parasitol. 102:103–115.

    Jain, R., M. C. Rivera, and J. A. Lake. 1999. Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96:3801–3806.

    Jurica, M. S., and M. J. Moore. 2003. Pre-mRNA splicing: awash in a sea of proteins. Mol. Cell 12:5–14.

    Kaufer, N. F., and J. Potashkin. 2000. Analysis of the splicing machinery in fission yeast: a comparison with budding yeast and mammals. Nucleic Acids Res. 28:3003–3010.

    Koonin, E. V., N. D. Fedorova, J. D. Jackson et al. (17 co-authors). 2004. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 5:R7.

    Kornblihtt, A. R., M. de la Mata, J. P. Fededa, M. J. Munoz, and G. Nogues. 2004. Multiple links between transcription and splicing. RNA 10:1489–1498.

    Kuhn, A. N., and N. F. Kaufer. 2003. Pre-mRNA splicing in Schizosaccharomyces pombe: regulatory role of a kinase conserved from fission yeast to mammals. Curr. Genet. 42:241–251.

    Kuhn, A. N., E. M. Reichl, and D. A. Brow. 2002. Distinct domains of splicing factor Prp8 mediate different aspects of spliceosome activation. Proc. Natl. Acad. Sci. USA 99:9145–9149.

    Labourier, E., and D. C. Rio. 2001. Purification of Drosophila snRNPs and characterization of two populations of functional U1 particles. RNA 7:457–470.

    Liu, Q., X. H. Liang, S. Uliel, M. Belahcen, R. Unger, and S. Michaeli. 2004. Identification and functional characterization of Lsm proteins in Tryapnosoma brucei. J. Biol. Chem.

    Lorkovic, Z. J., D. A. Wieczorek Kirk, M. H. Lambermon, and W. Filipowicz. 2000. Pre-mRNA splicing in higher plants. Trends Plant Sci. 5:160–167.

    Lucke, S., T. Klockner, Z. Palfi, M. Boshart, and A. Bindereif. 1997. Trans mRNA splicing in trypanosomes: cloning and analysis of a PRP8-homologous gene from Trypanosoma brucei provides evidence for a U5-analogous RNP. EMBO J. 16:4433–4440.

    Lynch, M., and J. S. Conery. 2003. The origins of genome complexity. Science 302:1401–1404.

    Lynch, M., and A. O. Richardson. 2002. The evolution of spliceosomal introns. Curr. Opin. Genet. Dev. 12:701–710.

    Makarova, O. V., E. M. Makarov, S. Liu, H. P. Vornlocher, and R. Luhrmann. 2002. Protein 61K, encoded by a gene (PRPF31) linked to autosomal dominant retinitis pigmentosa, is required for U4/U6*U5 tri-snRNP formation and pre-mRNA splicing. EMBO J. 21:1148–1157.

    Malca, H., N. Shomron, and G. Ast. 2003. The U1 snRNP base pairs with the 5' splice site within a penta-snRNP complex. Mol. Cell. Biol. 23:3442–3455.

    Martin, W. 1999. A briefly argued case that mitochondria and plastids are descendants of endosymbionts, but that the nuclear compartment is not. Proc. R. Soc. Lond. B Biol. Sci. 266:1387–1395.

    Mossel, E. 2003. On the impossibility of reconstructing ancestral data and phylogenies. J. Comp. Biol. 10:669–676.

    Mossel, E., and M. Steel. 2004. A phase transition for a random cluster model on phylogenetic trees. Math. Biosci. 187:189–203.

    ———. 2005. How much can evolved characters tell us about the tree that generated them? In O. Gascuel, ed. Mathematics of evolution and phylogeny. Oxford University Press.

    Nagai, K., Y. Muto, D. A. Pomeranz Krummel, C. Kambach, T. Ignjatovic, S. Walke, and A. Kuglstatter. 2001. Structure and assembly of the spliceosomal snRNPs. Novartis Medal Lecture. Biochem. Soc. Trans. 29:15–26.

    Nelissen, R. L., C. L. Will, W. J. van Venrooij, and R. Luhrmann. 1994. The association of the U1-specific 70K and C proteins with U1 snRNPs is mediated in part by common U snRNP proteins. EMBO J. 13:4113–4125.

    Nilsen, T. W. 2003. The spliceosome: the most complex macromolecular machine in the cell? Bioessays 25:1147–1149.

    Nixon, J. E., A. Wang, H. G. Morrison, A. G. McArthur, M. L. Sogin, B. J. Loftus, and J. Samuelson. 2002. A spliceosomal intron in Giardia lamblia. Proc. Natl. Acad. Sci. USA 99:3701–3705.

    Nott, A., H. Le Hir, and M. J. Moore. 2004. Splicing enhances translation in mammalian cells: an additional function of the exon junction complex. Genes Dev. 18:210–222.

    Ohi, M. D., and K. L. Gould. 2002. Characterization of interactions among the Cef1p-Prp19p-associated splicing complex. RNA 8:798–815.

    Patel, A. A., and J. A. Steitz. 2003. Splicing double: insights from the second spliceosome. Nat. Rev. Mol. Cell Biol. 4:960–970.

    Penny, D., M. D. Hendy, and A. M. Poole. 2003. Testing fundamental evolutionary hypotheses. J. Theor. Biol. 223:377–385.

    Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. Research Reports in Mathematics and Statistics 98/04, Massey University. J. Mol. Evol. 53:711–723.

    Portal, D., J. M. Espinosa, G. S. Lobo et al. (11 co-authors) 2003. An early ancestor in the evolution of splicing: a Trypanosoma cruzi serine-arginine-rich protein (TcSR) is functional in cis-splicing. Mol. Biochem. Parasitol. 127:37–46.

    Rokas, A., and P. W. H. Holland. 2000. Rave genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15:454–459.

    Schnare, M. N., and M. W. Gray. 2000 Structural conservation and variation among U5 small nuclear RNAs from trypanosomatid protozoa. Biochim. Biophys. Acta. 1490:362–366.

    Schneider, C., C. L. Will, J. Brosius, M. J. Frilander, and R. Luhrmann. 2004. Identification of an evolutionarily divergent U11 small nuclear ribonucleoprotein particle in Drosophila. Proc. Natl. Acad. Sci. USA 101:9584–9589.

    Schneider, C., C. L. Will, O. V. Makarova, E. M. Makarov, and R. Luhrmann. 2002. Human U4/U6.U5 and U4atac/U6atac.U5 tri-snRNPs exhibit similar protein compositions. Mol. Cell. Biol. 22:3219–3229.

    Selenko, P., G. Gregorovic, R. Sprangers, G. Stier, Z. Rhani, A. Kramer, and M. Sattler. 2003. Structural basis for the molecular recognition between human splicing factors U2AF65 and SF1/mBBP. Mol. Cell 11:965–976.

    Simpson, A. G., and A. J. Roger. 2002. Eukaryotic evolution: getting to the root of the problem. Curr. Biol. 12:R691–R693.

    Snel, B., P. Bork, and M. Huynen. 2000. Genome evolution: gene fusion versus gene fission. Trends Genet. 16:9–11.

    Sober, E., and M. Steel. 2002. Testing the hypothesis of common ancestry. J. Theor. Biol. 218:395–408.

    Stechmann, A., and T. Cavalier-Smith. 2002. Rooting the eukaryote tree by using a derived gene fusion. Science 297:89–91.

    Steel, M. A., and D. Penny. 2004. Two further links between MP and ML under the Poisson model. Appl. Math. Lett. 17:785–790.

    ———. 2005. Maximum parsimony and the phylogenetic information in multistate characters. Pp. 163–178 in V. Albert, ed. Parsimony, phylogeny and genomics. Oxford University Press.

    Stevens, S. W., I. Barta, H. Y. Ge, R. E. Moore, M. K. Young, T. D. Lee, and J. Abelson. 2001. Biochemical and genetic analyses of the U5, U6, and U4/U6 x U5 small nuclear ribonucleoproteins from Saccharomyces cerevisiae. RNA 7:1543–1553.

    Tschudi, C., and E. Ullu. 2002. Unconventional rules of small nuclear RNA transcription and cap modification in trypanosomatids. Gene Expr. 10:3–16.

    Tuffley, C., and M. Steel. 1997. Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull. Math. Biol. 59:581–607.

    Valadkhan, S., and J. L. Manley. 2001. Splicing-related catalysis by protein-free snRNAs. Nature 413:701–707.

    Vandenberghe, A. E., T. H. Meedel, and K. E. Hastings. 2001. mRNA 5'-leader trans-splicing in the chordates. Genes Dev. 15:294–303.

    Vivares, C. P., M. Gouy, F. Thomarat, and G. Metenier. 2002. Functional and evolutionary analysis of a eukaryotic parasitic genome. Curr. Opin. Microbiol. 5:499–505.

    Wang, W., H. J. Yu, and M. Y. Long. 2004. Duplication-degeneration as a mechanism of gene fission and the origin of new genes in Drosophila species. Nat. Genet. 36:523–527.

    Wilihoeft, U., E. Campos-Gongora, S. Touzni, I. Bruchhaus, and E. Tannich. 2001. Introns of Entamoeba histolytica and Entamoeba dispar. Protist 152:149–156.

    Will, C. L., C. Schneider, M. Hossbach, H. Urlaub, R. Rauhut, S. Elbashir, T. Tuschl, and R. Luhrmann. 2004. The human 18S U11/U12 snRNP contains a set of novel proteins not found in the U2-dependent spliceosome. RNA 10:929–941.

    Will, C. L., C. Schneider, A. M. MacMillan, N. F. Katopodis, G. Neubauer, M. Wilm, R. Luhrmann, and C. C. Query. 2001. A novel U2 and U11/U12 snRNP protein that associates with the pre-mRNA branch site. EMBO J. 20:4536–4546.

    Yang, Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555–556.

    Zhou, Z., L. J. Licklider, S. P. Gygi, and R. Reed. 2002. Comprehensive proteomic analysis of the human spliceosome. Nature 419:182–185.

    Zhu, W., and V. Brendel. 2003. Identification, characterization and molecular phylogeny of U12-dependent introns in the Arabidopsis thaliana genome. Nucleic Acids Res. 31:4561–4572.(Lesley Collins and David )