EvolutionandPhylogeneticUtilityofAlign

Evolution and Phylogenetic Utility of Alignment Gaps Within Intron Sequences of Three Nuclear Genes in Bumble Bees(Bombus)

http://www.100md.com 《分子生物学进展》2003年第1期

     ^* Graduate School of Human and Environmental Studies0p, http://www.100md.com

    Department of Zoology, Graduate School of Science, Kyoto University, Kyoto, Japan0p, http://www.100md.com

    Department of Entomology, Comstock Hall, Cornell University, Ithaca, New York0p, http://www.100md.com

    Sapporo Science and Technology College, Sapporo, Japan0p, http://www.100md.com

    ^|| Primate Research Institute, Kyoto University, Inuyama, Japan0p, http://www.100md.com

    Abstract0p, http://www.100md.com

    To test whether gaps resulting from sequence alignment containphylogenetic signal concordant with those of base substitutions,we analyzed the occurrence of indel mutations upon a well-resolved,substitution-based tree for three nuclear genes in bumble bees(Bombus, Apidae: Bombini). The regions analyzed were exon andintron sequences of long-wavelength rhodopsin (LW Rh), argininekinase (ArgK), and elongation factor–1 (EF-1) F2 copygenes. LW Rh intron had only a few uninformative gaps, ArgKintron had relatively long gaps that were easily aligned, andEF-1 intron had many short gaps, resulting in multiple optimalalignments. The unambiguously aligned gaps within ArgK intronsequences showed no homoplasy upon the substitution-based tree,and phylogenetic signals within ambiguously aligned regionsof EF-1 intron were highly congruent with those of base substitutions.We further analyzed the contribution of gap characters to phylogeneticreconstruction by incorporating them in parsimony analysis.Inclusion of gap characters consistently improved support fornodes recovered by substitutions, and inclusion of ambiguouslyaligned regions of EF-1 intron resolved several additional nodes,most of which were apical on the phylogeny. We conclude thatgaps are an exceptionally reliable source of phylogenetic informationthat can be used to corroborate and refine phylogenies hypothesizedby base substitutions, at least at lower taxonomic levels. Atpresent, full use of gaps in phylogenetic reconstruction isbest achieved in parsimony analysis, pending development ofwell-justified and generally applicable methods for incorporatingindels in explicitly model-based methods.

    Key Words: arginine kinase • elongation factor–1 • long-wavelength rhodopsin • intron • phylogeny • bumble beevaa, 百拇医药

    Introductionvaa, 百拇医药

    Phylogenetic analysis of nucleotide and amino acid sequencedata often requires alignment of homologous sequences that varyin length. As a result, gaps are introduced to the data matrix,representing putative insertion or deletion events. As the productsof particular evolutionary processes (mutations), indels areoften considered as a class of phylogenetic characters to beincorporated in phylogenetic analysis or to be used to corroborateresults derived from base substitutions . However, in most phylogenetic analysesgaps are ignored as missing data, or regions containing gapsare simply excluded from data sets. One reason for discardinggap characters in phylogenetic analyses is that use of gapsas characters is generally confined to the parsimony method,since well-justified and generally applicable methods for incorporatingindels have yet to be developed for methods based on explicitmodels of sequence evolution, such as standard implementationsof maximum likelihood (see e.g., ). Anotherreason to exclude gaps is that their positions are often difficultto determine, especially when analyzing a broad range of taxaand/or highly diverged sequences. Phylogenetic reconstructionis highly sensitive to different alignment options (e.g., gap-to-substitutioncosts or alignment algorithms), which can lead to very differentphylogenetic hypotheses .As a novel approach within the parsimony framework, developed a direct optimization method for searchingmost parsimonious trees without prior sequence alignments. Subsequently, independently developedsimilar methods for accommodating ambiguously aligned sequenceswithout violating positional homology.

    Despite recent progress in analyzing gap characters, these havenot been widely accepted as phylogenetic markers due, in part,to insufficient empirical study of the quality of gaps as characters.Some authors assume that gaps are less homoplastic and thereforemore phylogenetically reliable than base substitutions, sincegaps generally occur less frequently . However, othersemphasize the potential for gaps to be misleading .Despite the need for further critical evaluationof gaps as phylogenetic characters, few studies have focusedon investigating homoplasy levels of gaps as compared with basesubstitutions or assessing the contribution of gaps to phylogeneticresolution and nodal support . Additional empiricalstudy of relative homoplasy levels among different types ofgap characters and of the degree to which these contribute tophylogenetic reconstruction would facilitate the full and appropriateutilization of information potentially available in variable-lengthsequence data.

    In this study, we tested the potential of gaps as phylogeneticcharacters by analyzing their occurrence upon a well-resolved,substitution-based tree and assessing their contribution tofurther resolution among 66 species (23 subgenera) of bumblebees (Bombus). Bombus is a diverse, monophyletic genus comprisingnearly 250 species (38 subgenera) and is thereforean ideal group for studying patterns of indel evolution amongspecies and subgenera. We analyzed exon and intron sequencesof three nuclear genes: long-wavelength rhodopsin (LW Rh), argininekinase (ArgK), and elongation factor-1 F2 copy (EF-1).\, 百拇医药

    Materials and Methods\, 百拇医药

    A list of exemplar species studied, collection data for eachexemplar, and GenBank accession numbers are given in supplementaryonline materials. Genomic DNA was extracted from thoracic musclesusing the standard phenol-chloroform method. We PCR-amplifieda total of ~\, 百拇医药

    2.4 kb of LW Rh, ArgK, and EF-1 F2 copy genes usingthe following forward (F) and reverse (R) primers: LW Rh, (F)5'-AAT TGC TAT TAY GAR ACN TGG GT-3' and 5'-ATA TGG AGT CCANGC CAT RAA CCA-3' ; ArgK, (F) 5'-GTTGAC CAA GCY GTY TTG GA-3' and (R) 5'-CAT GGA AAT AAT ACG RAGRTG-3' or (F) 5'-GA CAG CAA RTC TCT GCT GAA GAA-3' and (R) 5'-GGTYTT GGC ATC GTT GTG GTA GAT AC-3'; EF-1, (F) 5'-GGA CAC AGAGAT TTC ATC AAR AA-3' and (R) 5'-TTG CAA AGC TTC RTG RTG CATTT-3'. PCR products were directly sequenced using the aboveprimers.

    For initial alignment, we used ClustalX version 1.81 with the default parameter settings. The alignmentsobtained were then corrected manually for obvious misalignments.Sequence alignment for the introns within LW Rh was trivialand required only three simple gaps, which were parsimony-uninformative.However, alignment for introns within ArgK and EF-1 requiredgaps of various lengths, which differed markedly in terms ofstructural characteristics; relatively long gaps that were easilyaligned occurred throughout ArgK intron sequences ,whereas numerous shorter gaps were required at five particularregions within EF-1 intron sequences, resulting in multipleoptimal alignments . This difference in gap characteristicsprovided an excellent opportunity to study different evolutionarymodes of length mutations occurring within the diverse genusBombus.81, 百拇医药

    fig.ommitted81, 百拇医药

    FIG. 1. A, Partial alignment of ArgK intron. Parsimony-informative gaps, treated as single indel mutations, are indicated by ===. B. nevadensis had an insertion of 323 bp replacing an 111–122 bp sequence present in other species. This sequence was apparently unrelated to the sequences at corresponding positions in other species, so we excluded the sequence and coded it as inapplicable in the alignment (poly-n). Similarly, B. mendax and B. defector shared identical 46 bp sequences that were ambiguously aligned with respect to other sequences of corresponding position (shaded sequences). Because inclusion of these sequences affected inference of indels in other species, we ignored these sequences in the alignment and instead interpreted their origin as a single insertion event (gap-MD). Only sequences that account for informative gaps are given, and only regions that contain these gaps are shown (separated by vertical bars). Numbers above sequences are positions within aligned intron sequence for the first and last codons of each segment. B, Partial alignment of EF-1 showing the entire intron sequences. Ambiguously aligned regions are indicated by ===. Examples of different delimitations that we tried are given for region 4 (a, b); the bottommost scheme was used in the analysis. Only sequences that account for alignment ambiguity are shown

    All phylogenetic analyses were done using PAUP* version 4.0b10. Many intron sequences of the outgroup, Trigonaventralis, were highly dissimilar to those of the ingroup. Wetherefore included only unambiguously determined regions (correspondingto 42% of the intron alignment) for the outgroup in the analysis.The results of partition-homogeneity test ,as implemented in PAUP*, suggested that phylogenetic signalswithin alignment-unambiguous regions of the three genes werehighly congruent (P > 0.5 in 999 random partitioning forall data comparisons). We then performed a simultaneous analysisof the data set of substitution characters from all unambiguouslyaligned regions (with gaps treated as missing) to obtain a robustspecies relationship—the test phylogeny—using themaximum parsimony (MP) method. We conducted heuristic searcheswith 100 random addition analyses and tree bisection-reconnection(TBR) branch-swapping (Steepest descent option in effect). Inorder to assess the robustness of the MP tree to the use ofexplicit models of sequence evolution, we also performed neighborjoining (NJ) and maximum likelihood (ML) analyses. We used theHKY85 model for distance correction in the NJ analysis. To searchfor a ML tree, we used the quartet puzzling option as implementedin PAUP* with the HKY85+{gamma}

    substitution model.;4[w\, http://www.100md.com

    We investigated the influence of gap characters on phylogeneticinference by conducting an additional parsimony analysis withgap information included as coded characters. Alignable gapsof ArgK intron were coded as binary (presence/absence) charactersusing the method of and addedto the data matrix. We weighted base substitutions and indelmutations equally, because there was no a priori reason to differentiallyweight gap costs relative to base substitutions. For the ambiguouslyaligned regions of EF-1 intron , we employed the programINAASE (Integration of Ambiguously Aligned Sequences) to accommodate these regions in parsimony analysis.Following the criteria detailed in , wedelimited five ambiguously aligned regions within the EF-1 intron.However, coding one of these regions (region 4) resulted inmore than 32 multistate characters, which exceeds the numberthat can be handled by PAUP*. We therefore divided the regioninto two subregions such that sequences within each region arelikely homologous (region 4a and region 4b in ). Severaldivision schemes that we tried all yielded the same number ofmost-parsimonious trees of the same topology, indicating thevalidity of this approach.

    Results and Discussion\#fqqou, http://www.100md.com

    Parsimony analysis of the unambiguously aligned region (2301characters with 395 informative) resulted in 891 most-parsimonioustrees of length 1338 (consistency index excluding uninformativecharacters [CI] = 0.47, retention index [RI] = 0.76, and rescaledconsistency index [RC] = 0.45). The topology obtained is, with few exceptions, consistent with traditional subgenericclassifications of bumble bees based on adult morphologicalcharacters . The topologies of NJ and ML treesare similar to that of MP tree (see supplementary online materials).\#fqqou, http://www.100md.com

    fig.ommitted\#fqqou, http://www.100md.com

    FIG. 2. Strict consensus of 891 most-parsimonious (MP) trees based upon base substitutions of the unambiguously aligned regions. The 16 unambiguously aligned indels within ArgK intron were mapped onto specific branches of the tree without homoplasy. Upward and downward arrowheads represent insertions and deletions respectively. Numbers on arrowheads correspond to the gap codes in "MD" represents shaded sequences in Branches that collapse in the NJ and/or ML trees are presented as dotted lines. Nodal support is assessed by bootstrap values (above branches; MP/NJ/ML respectively; shown only when >50%) and branch support indices (below branches)

    Overall, we inferred 16 parsimony-informative gaps within theArgK intron . All gaps were successfully assigned tospecific branches of the MP tree without homoplasy ,suggesting that the signal of gaps is concordant with that ofbase substitutions. This result holds for the NJ and ML treeswith one exception. On the MP tree gap-4 was a synapomorphyuniting the species B. griseocollis, B. fraternus, B. crotchii,and B. rufocinctus, but monophyly of these four species wasnot recovered on the ML tree due to the sister relationshipof B. rufocinctus to a species that lacks gap-4, B. wurflenii. Support in the ML analysis for paraphyly of the fourspecies with gap-4 was ambiguous, as demonstrated by the extremelyshort lengths of the branches concerned. Therefore congruenceof gap-4 with the phylogeny is not strongly rejected.|2, http://www.100md.com

    Analysis of the data matrix including the 16 characters codingArgK gaps ("GAP" characters throughout) resulted in the sameshortest trees (1354 steps: CI = 0.48; RI = 0.77; RC = 0.46)as did the data matrix including only base substitutions. Inclusionof GAP characters (16 of 411 informative characters) did notcontribute to resolution, but bootstrap values for the nodesconcerned increased from >53% to >72%, and the total branchsupport increased from 323 to 335.

    The six multistate characters coding ambiguous regions of EF-1("INAASE characters" throughout) mapped onto the substitution-basedtree with a total of 187 steps and a CI of 0.77 (RI = 0.87;RC = 0.67). This indicates that the INAASE characters containrich information consistent with the phylogenetic signal ofbase substitutions. Simultaneous analysis of an expanded datamatrix, including the 16 GAP characters and the six INAASE charactersresulted in four most-parsimonious trees of length 1535 (CI= 0.53; RI = 0.78; RC = 0.48) (supplementary online data) withlower homoplasy levels than the analysis based on substitutionsand GAP characters. The six INAASE characters collectively hadCI = 0.80, RI = 0.89, and RC = 0.71. Although the alignmentambiguous EF-1 intron regions accounted for only ~sj+, http://www.100md.com

    5% of all sitesin the original data matrix, INAASE characters added 181 stepsto the tree (12% of the total steps) and contributed substantiallyto phylogenetic resolution, resulting in a drastic decreasein the number of shortest trees (from 891 to 4) and recoveryof 62 of 64 possible nodes in the strict consensus tree (anaddition of 9 nodes). INAASE characters also significantly improvednodal support, increasing average bootstrap values by 3.9 %(based on nodes in common on the strict consensus trees), andsummed branch support by 55 (from 335 to 390).

    Our analyses showed that gaps and ambiguously aligned regionsof nuclear intron sequences contain useful phylogenetic signalconcordant with that of base substitutions. Unambiguously alignedgaps exhibited minimal homoplasy and were consistently congruentwith the substitution-based tree, whether derived from MP, NJ,or ML method. Thus, our results reinforce several earlier suggestionsthat gaps are phylogenetically reliable characters .For example, found that all seven informative gaps longer than 1 bp within{psi}hk*hm, 百拇医药

    {eta}hk*hm, 百拇医药

    -globin pseudogene of primates could be assigned to specificbranches of substitution-based tree without homoplasy. Similarly,showed that 13 of 15 informative gapslonger than 1 bp could be mapped on substitution-based trnL-trnFintergenic spacer gene tree in Crassulaceae plants without homoplasy.However, in these studies single-nucleotide indels were oftenhomoplastic or ambiguously aligned. Multinucleotide indels areprobably more reliable than single-nucleotide indels becausethe former are less frequent than single-nucleotide indels inmany data sets and because homoplasies by parallel and back mutationscan occur only when they match exactly in length and position(and sequence for insertions) with the corresponding indels.

    On the other hand, the results of several other studies contrastwith those described above. For example, in analysis of eukaryotic phylogeny, the observed patternof amino acid indels within enolase and impdh genes likely resultedfrom recombination and lateral gene transfer as well as fromconvergence and reversal. In other cases, apparently identicalindels seem to have originated independently in distantly relatedtaxa .However, these results do not necessarily indicatethat indels are generally of lesser utility than base substitutions,because homoplasy can be found in any class of molecular characters,especially when dealing with phylogeny on a large scale (e.g.,across phyla). Considering these results together with ours,we conclude that indels can be highly reliable characters, especiallyat lower taxonomic levels, but recognize that gaps, like allclasses of phylogenetic characters, are not devoid of homoplasy.It may therefore be inadvisable to identify higher monophyleticgroups based solely on a single indel (or a few indels) .

    Recent methodological progress in handling gap characters inphylogenetic analyses affords the opportunity to incorporatethis useful phylogenetic information derived from sequence data.However, these new methods are confined to the parsimony optimalitycriterion, and a well-justified, statistically-robust, generallyapplicable, and widely accepted method for incorporating indelswithin explicitly model-based phylogenetic methods such as standardimplementations of ML is still lacking. The development of explicitmodels of indel evolution should not be particularly difficult,but selection of parameters to be included and statistical interpretationof resulting analyses will be far from straightforward (see). Even within the contextof parsimony analysis, the difficulty of assuming gap-to-substitutioncosts or weighting gaps of different lengths has impeded extensiveuse of the methods. However, gaps are strong indicators of commondescent, and full utilization of indel information to corroborateand refine phylogenies inferred primarily from substitutiondata would surely improve the accuracy and efficiency of phylogenyestimation for many data sets (e.g).

    Acknowledgements^#4, 百拇医药

    Supported by a Grant-in-Aid from the Japan Society for Promotionof Science (no. 11304056).^#4, 百拇医药

    Literature Cited^#4, 百拇医药

    Bapteste, E., and H. Philippe. 2002. The potential value of indels as phylogenetic markers: position of trichomonads as a case study. Mol. Biol. Evol 19:972-977.^#4, 百拇医药

    Bremer, K. 1994. Branch support and tree stability. Cladistics 10:295-304.^#4, 百拇医药

    Danforth, B. N. 2002. Evolution of sociality in a primitively eusocial lineage of bees. Proc. Natl. Acad. Sci 99:286-290.^#4, 百拇医药

    Farris, J. S. 1999. Likelihood and inconsistency. Cladistics 15:199-204.^#4, 百拇医药

    Farris, J. S., M. Källersjö, A. G. Kluge, and C. Bult. 1994. Testing significance of incongruence. Cladistics 10:315-320.^#4, 百拇医药

    Giribet, G., and W. C. Wheeler. 1999. On gaps. Mol. Phylogenet. Evol 13:132-143.^#4, 百拇医药

    Graham, S. W., P. A. Reeves, A. C. E. Burns, and R. G. Olmstead. 2000. Microstructural changes in noncoding chloroplast DNA: interpretation, evolution and utility of indels and inversions in basal angiosperm phylogenetic inference. Int. J. Plant Sci 161:S83-S96.

    Jeanmougin, F., J. D. Thompson, M. Gouy, D. G. Higgins, and T. J. Gibson. 1998. Multiple sequence alignment with ClustalX. Trends Biochem. Sci 23:403-405.(}iy]), 百拇医药

    Lloyd, D. G., and V. L. Calder. 1991. Multi-residue gaps, a class of molecular characters with exceptional reliability for phylogenetic analyses. J. Evol. Biol 4:9-21.(}iy]), 百拇医药

    Lutzoni, F., P. Wagner, V. Reeb, and S. Zoller. 2000. Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst. Biol 49:628-651.(}iy]), 百拇医药

    Mardulyn, P., and S. A. Cameron. 1999. The major opsin in bees (Insecta: Hymenoptera): a promising nuclear gene for higher level phylogenetics. Mol. Phylogenet. Evol 12:168-176.(}iy]), 百拇医药

    Philippe, H., and J. Laurent. 1998. How good are deep phylogenetic trees?. Curr. Opin. Genet. Dev 8:616-623.(}iy]), 百拇医药

    Rokas, A., and P. W. H. Holland. 2000. Rare genomic changes as a tool for phylogenetics. Trend Ecol. Evol 15:454-459.(}iy]), 百拇医药

    Saitou, N., and S. Ueda. 1994. Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates. Mol. Biol. Evol 11:504-512.

    Sanchis, A., J. M. Michelena, A. Latorre, D. L. J. Quicke, U. Gardenfors, and R. Belshaw. 2001. The phylogenetic analysis of variable-length sequence data: elongation factor–1 introns in European populations of the parasitoid wasp genus Pauesia (Hymenoptera: Braconidae: Aphidiinae). Mol. Biol. Evol 18:1117-1131.:?v, 百拇医药

    Sanderson, M. J., and J. Kim. 2000. Parametric phylogenetics?. Syst. Biol 49:817-829.:?v, 百拇医药

    Simmons, M. P., and H. Ochoterena. 2000. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol 49:369-381.:?v, 百拇医药

    Simmons, M. P., H. Ochoterena, and T. G. Carr. 2001. Incorporation, relative homoplasy, and effect of gap characters in sequence-based phylogenetic analyses. Syst. Biol 50:454-462.:?v, 百拇医药

    Swofford, D. L. 2002. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Version 4.0. Sinauer, Sunderland, Mass.:?v, 百拇医药

    Swofford, D. L., G. J. Olsen, P. L. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mabble, eds. Molecular systematics. Sinauer, Sunderland, Mass.

    van Dijk, M. A., E. Paradis, F. Catzeflis, and W. W. de Jong. 1999. The virtues of gaps: Xenarthran (Edentate) monophyly supported by a unique deletion in {alpha} A-crystallin. Syst. Biol 48:94-106.}y, 百拇医药

    van Ham, C. H. J., H. 't Hart, T. H. M. Mes, and J. M. Sandbrink. 1994. Molecular evolution of noncoding regions of the chloroplast genome in the Crassulaceae and related species. Curr. Genet 25:558-566.}y, 百拇医药

    Wheeler, W. C. 1996. Optimization alignment: the end of multiple sequence alignment in phylogenetics?. Cladistics 12:1-9.}y, 百拇医药

    Wheeler, W. C. 1999. Fixed character states and the optimization of molecular sequence data. Cladistics 15:379-385.}y, 百拇医药

    Williams, P. H. 1998. An annotated checklist of bumble bees with an analysis of patterns of description (Hymenoptera: Apidae, Bombini). Bull. Br. Mus. Nat. Hist. (Ent.) 67:79-152.(Atsushi Kawakita Teiji Sota John S. Ascher Masao Ito Hiroyuki Tanaka|| and Makoto Kato)

百拇医药网 http://www.100md.com/html/DirDu/2005/05/06/58/26/15.htm