当前位置: 首页 > 医学版 > 期刊论文 > 基础医学 > 分子生物学进展 > 2004年 > 第7期 > 正文
编号:11255025
Long Perfect Dinucleotide Repeats Are Typical of Vertebrates, Show Motif Preferences and Size Convergence
     Instituto Gulbenkian de Ciência, Oeiras, Portugal

    E-mail: cpenha@igc.gulbenkian.pt.

    Abstract

    Microsatellites are simple sequence repeats (SSRs) showing complex patterns of length, motif sizes, motif sequences, and repeat perfection. We studied the structure of the dinucleotide SSR population at the genome level by analyzing assembled DNA sequence across species. Three dinucleotide populations were distinguished when SSR genome frequency was analyzed as a function of repeat length and repeat perfection. A population of low-perfection SSRs was identified, which is constituted by short repeats and represents the vast majority of genomic dinucleotide SSRs across eukaryotic genomes. In turn, the highly perfect repeats are 30 to 50 times less frequent and, in addition to short repeats, also contain a long repeat population that is uniquely represented in vertebrate species. Distinctive features of this population include the modal peak in the frequency distribution of repeat length and the strong preferential usage of the repeat motifs AC and AG. These results raise the hypothesis that the ability of carrying a distinct population of long, highly perfect dinucleotide repeats in the genome is a late acquisition in chordate evolution. Our analysis also suggests that different dinucleotide repeat populations have different dynamics and are likely to be underlined by different molecular mechanisms of generation and maintenance in the genome. Thus, these observations imply that caution should be taken in extrapolating results from studies on SSR mutability and on SSR phylogenetic comparisons that do not take into account the stratification of dinucelotide populations in the eukaryotic genome.

    Key Words: microsatellites ? dinucleotide population ? identity ? repeat motif

    Introduction

    Microsatellites, or simple sequence repeats (SSRs), are generally defined as genetic loci where one to six nucleotides are repeated in tandem. SSRs are present in eukaryotic genomes at higher frequencies than would be expected from random distribution, based on nucleotide composition (Tautz and Renz 1984; Metzgar, Bytof, and Wills 2000). The genomic abundance of the SSRs and their high degree of polymorphism have prompted many investigators to propose mechanisms of microsatellite mutability to explain observed patterns of SSR evolution across species and SSR variability within populations (Li et al. 2002). It is commonly assumed that SSR maintenance in the genome depends on a balance between the addition and deletion of repeat units, through replication slippage, and the correction of these errors by proofreading and mismatch repair (MMR) enzymes (Harfe and Jinks-Robertson 2000). In addition, point mutations could have an effect in limiting the expansion of microsatellites, by degrading perfect repeat sequences (Kruglyak et al. 2000). In turn, slippage is proposed to have the opposite effect, by purifying microsatellites of these interruptions (Harr and Schl?tterer 2000). Recombination mechanisms are also suggested to play a role in SSR mutation, by unequal crossing over or gene conversion (Majewski and Ott 2000). However, a number of mutation biases related to the nature of the SSRs have been observed, including the increase of mutation rate with allele size and the higher rates of contraction for long alleles (Xu et al. 2000; Huang et al. 2002). These observations suggest that the mechanisms of SSR mutability are complex and may depend on several factors, including the structure of the SSRs.

    The studies on microsatellite variability are frequently focused on very long, highly mutable loci, which are unrepresentative of the majority of simple repeats in the genome. On the other hand, several studies on microsatellite evolution compare closely related species representing very limited branching of the phylogenetic tree. In fact, the comparisons of microsatellite length and structure at orthologous loci in related species tend to analyze sets of loci selected on the basis of high polymorphism in the species from which they were isolated. As reasoned by H. Ellegren and colleagues (Webster, Smith, and Ellegren 2002), such interspecific comparisons are likely to be flawed due to ascertainment bias (Fitzsimmons, Moritz, and Moore 1995; Forbes et al. 1995; Ellegren et al. 1997). In addition, the criteria used for defining a microsatellite locus differ from one study to another and are not always clearly stated, making it difficult to compare results and to evaluate claimed findings. Most studies only considered completely perfect repeats, although it has been observed that several long repeats contain one or a few base substitutions, causing them to be counted as two or more shorter repeats. For instance, long alleles of CA repeats were found to have increased stability due to a TA interspersion (Bacon, Farrington, and Dunlop 2000) but there are few studies allowing for mismatches when scanning for SSRs (Katti, Ranjekar, and Gupta 2001). Here, we pursued an in silico analysis of the population structure of dinucleotide repeats across eukaryotic species, repeat motifs and repeat lengths to which we added an extra dimension by searching for differences in repeat perfection. To capture the bulk of the dinucleotide repeat population we aimed to work without a priori repeat selection criteria and used assembled DNA sequence available from current genome projects. This analysis revealed the existence of distinct groups of dinucleotide repeats that have specific structural properties and particular phylogenetic distributions.

    Methods

    Genomic DNA Sequence

    All genome sequences were downloaded in FASTA format from the National Center for Biotechnology Information (NCBI) database, ftp://ftp.ncbi.nih.gov/genbank/ (table 1).

    Table 1 Length of DNA Sequences Scanned for Each Species.

    Microsatellite Identification

    Sequences were scanned for dinucleotide tandem repeats with etandem software, part of the EMBOSS package (http://www.emboss.org/), with a threshold score of 4, allowing identification of all SSRs spanning at least 14 nucleotides, with a minimum identity of 70%. These minimum values were set based on practicality (i.e., the time and processing power it would take to include smaller repeats) and also because accepting short sequences with a low threshold of identity would risk the inclusion of sequence data that do not represent SSRs. Particularly, due to the imposed identity minimum, two dinucleotide repeated stretches separated by a large interruption are counted as two repeats (see table 2) and are assumed to behave independently of one another. For simplicity, (AC)n, (CA)n, (GT)n, and (TG)n, were grouped together as they represent the same sequence, either inverted or on a complementary strand. Similarly, two groups were considered for all variations of (AG)n and (AT)n i.e., AG, GA, CT, TC and AT, TA. The rare (CG)n repeats were ignored.

    Table 2 Length of Repeats and Identity Scoring for Illustrative SSRs.

    A parallel analysis was made using the software Tandem Repeats Finder (Benson 1999), which identifies SSRs based on a different scanning approach. Using the parameters +2 for match score and –3 for mismatch and indel penalty allowed us to identify SSRs that span at least 20 nucleotides and have identity higher than 80%. The results of this analysis are in agreement with the results presented in this report and validate SSR identification methodology.

    Results

    Low-Identity Repeats

    To investigate the heterogeneity of dinucleotide repeats across the eukaryotic genomes, we performed an analysis of significant amounts of assembled genomic sequence (table 1). We show the results of analyzing two mammals (human and mouse), one fish (Fugu rubripes), one invertebrate chordate (Ciona intestinalis), one insect (Drosophila melanogaster), and one nematode (Caenorhabditis elegans). The analysis incorporated three different parameters simultaneously: (1) the frequency of SSRs per megabase of DNA; (2) the number of repeated dinucleotide units; and (3) the identity of the repeated sequence. Identity is a measure of repeat perfection and was defined as the quotient between the length of repetitive DNA and the total size of the SSR. The length of repetitive DNA is found by counting, within the SSR, all the nucleotides belonging to a given repeat motif. This means that interspersed nucleotides will decrease identity (table 2).

    Our analysis clearly shows that in all species analyzed, the vast majority of dinucleotide SSRs are shorter than 15 units long, and under 97% identity (fig. 1) and their frequency in the eukaryotic genome is considerably constant varying within the range 500-1,000/Mb, depending on the species considered (table 3). These SSRs were here classified as low-identity repeats and represent a population in which the frequency distribution exponentially decays with both repeat length and repeat identity. To assure that the low-identity repeats were not a set of random sequences, we performed several analyses with randomly generated sequences. Such simulations led to the conclusion that low-identity microsatellites occurred in the genome in much higher frequencies than would be expected by chance (data not shown).

    FIG. 1. Distribution of dinucleotide repeats drawn from genomic DNA of the indicated species. Frequency represents the number of repeats per megabase of analyzed genomic DNA and is presented as a function of both the number of repeat units (repeat count) and the percent of repeated sequence (identity) found across the sequence of each SSR

    Table 3 Repeat Frequency Across Three Populations of Dinucleotide Repeatsa.

    High-Identity Repeats

    The dinucleotide SSRs showing identity above 97% have a relatively low frequency in the genome, as they are 15–80 times less frequent than the low-identity repeats (fig. 2). The high-identity SSRs were detected in all the eukaryotic genomes analyzed, from yeast to human. Among the non-vertebrate species, the frequency of high-identity SSRs decays exponentially with the repeat length in a similar curve as compared to the low-identity repeats (fig. 2). This suggests that the mechanisms that limit the elongation of low- and high-identity repeats could be similar in nonvertebrate species. We noted that the long high-identity SSRs population was present in the genome of vertebrates like human, mouse, and fish (Fugu rubripes). Strikingly, this population was virtually absent in the chordate C. intestinalis, in the fly D. melanogaster and in the nematode C. elegans (fig. 2). This long high-identity population was also evident in the rat and in the fish Danio rerio, but was not present in the mosquito Anopheles gambiae, in the plant Arabidopsis thaliana, or in the yeast Saccharomyces cerevisiae (data not shown). When identity is included as a variable in this analysis, the long high- identity repeats show a population characterized by a repeat count higher than 15 and an identity higher than 91% (fig. 3). As the long high-identity dinucleotide repeats were found in the genome of vertebrates, but not in the genome of lower species, we suggest that the ability to generate and carry long, perfect SSRs was a late acquisition in the eukaryotic genome evolution within the chordate phylum.

    FIG. 2. Distribution of high-identity and low-identity dinucleotide repeats drawn from genomic DNA of the indicated species. Frequency represents the number of repeats per megabase of analyzed genomic DNA and is presented as a function of the number of repeat units (repeat count). Repeats up to 91% identity(low identity) are represented by a dashed line and repeats higher than 97% identity (high identity) are represented by a solid line

    FIG. 3. Distribution of long dinucleotide repeats drawn from genomic DNA of the indicated species. Frequency represents the number of repeats per megabase of analyzed genomic DNA and is presented as a function of both the number of repeat units (repeat count) and the percent of repeated sequence (identity) found across the sequence of each SSR

    Regardless of repeat perfection, the short repeat populations (under 15 repeat units) follow a power distribution in the species analyzed and the frequency of both low-identity and high-identity SSRs exponentially decays as repeat length increases (fig. 2). In contrast, the long high-identity SSRs display a modal peak in the frequency distribution (fig. 3). In fact, the average size of the long high-identity repeats varies from 21 to 25 repeat units, in vertebrate species, suggesting that the convergence mechanisms constraining this SSR population are conserved. These observations suggest that the long high-identity SSRs are under a distinct population dynamics and that the mechanisms underlying the generation and maintenance of these repeats are virtually absent in nonvertebrates.

    Repeat Motif Usage

    The low-identity repeats show no clear preferential use of repeat motifs, although it was noted that non-vertebrate species use AT repeats more frequently (table 3). We did not find any common trend in motif usage among nonvertebrate species when comparing low and high-identity repeats. However, within each of the vertebrate species, the relative usage of repeat motifs in high-identity repeats differs markedly from that observed for low-identity repeats. Specifically, the use of non-AT repeats is increased both in short and long high-identity repeats, when comparing them to the low-identity SSRs (table 3). Our results clearly show that AC dinucleotide repeats are the major contributors for the long high-identity population, in human, mouse, and fish (table 3 and fig. 4). AG and AT repeats were also present in the mouse but are almost absent in human and fish. This indicates that the composition of high-identity repeats in vertebrate species is influenced by strong repeat preference of the long perfect repeats and supports the notion that the mechanisms of generation of long high-repeats may have specific requirements. Together, these findings indicate that low- and high-identity populations have different structures, possibly with distinct mechanisms of generation and maintenance, which can be linked to the phylogenetic tree.

    FIG. 4. Distribution of long dinucleotide repeats with selected repeat motifs drawn from genomic DNA of the indicated species. Frequency represents the number of repeats per megabase of analyzed genomic DNA and is presented as a function of both the number of repeat units (repeat count) and the percent of repeated sequence (identity) found across the sequence of each SSR. Repeats were grouped as described in the methods section

    Discussion

    In contrast with previous analyses of dinucleotide repeats, which focused on perfect or near perfect repeats, we used the repeat identity as a variable in our analysis, together with repeat frequency and repeat length. This allowed us to discriminate three populations of dinucleotide repeats and to reveal a complex stratification of these SSRs in the genome. These dinucleotide populations display markedly different frequency in the genome, length distribution, and repeat motif usage, suggesting that distinct mechanisms are in play to their generation, expansion, and maintenance. The identification of a population of long high-identity repeats, in mammals and fish, but not in nonvertebrate species, suggests that the ability to carry, generate, and/or maintain these SSRs is a late acquisition in the chordate genome evolution. When scanning the genome for SSRs (especially in human studies), this population could be easily contaminated by other dinucleotide SSRs if locus selection does not use a stringent identity criterion, or if short repeats are included (see figs. 2 and 3). The identification of this population is in accordance with the findings by Xu et al. (2000), who reported an increase in contraction rates with allele length and suggested this mechanism might explain the stationary allele distribution of microsatellites from different species. Such a mechanism would also explain the modal peak in the population of long, perfect dinucleotides observed here.

    The findings reported here raise the issue that many studies, which do not take into account the stratification of the dinucleotide SSR population, may be flawed due to ascertainment bias or confounding and should be analyzed carefully. In fact, many discrepancies have been reported on SSR mutation rates within and between species, and between types of repeat motifs (Weber and Wong 1993; Chakraborty et al. 1997; Kruglyak et al. 1998; Schug et al. 1998). For instance, studies focused on Drosophila dinucleotide repeats to calculate mutation rates (Bachtrog 2000) and to study repeat frequency (Katti, Ranjekar, and Gupta 2001) may lead to illegitimate interpretations when establishing comparisons with mammalian genomes that carry high-identity repeats.

    A similar argument can be made about theoretical models of SSR mutability that are applied to calculate time of divergence between species, among vertebrates and nonvertebrates. If the SSRs used in these calculations are drawn from SSR populations with different dynamics, this could partly explain the inconsistency of the results obtained (Calabrese, Durrett, and Aquadro 2001; Zhivotovsky, Rosenberg, and Feldman 2003). If there are mechanisms stabilizing the long high-identity SSRs, and causing their length to converge, this effect might counter the slippage mutations, on which is based the rationale to determine the distance between species, or strains. Calabrese, Durrett, and Aquadro (2001) found that two different methods for determining genetic distances, based on microsatellites lengths, systematically underestimate divergence time. We noted that the markers used by these authors, to compare human/chimp populations, mostly fall in the 15–30 repeat units range, where the distribution is normal and repeat lengths are likely to be under convergence constraints.

    Another example of SSR ascertainment biases comes from studies based on microsatellite markers that are identified by laboratory methods that strongly favor the collection of perfect repeats. For instance, we verified that the average identity of mouse AC dinucleotides, in the Whitehead Institute/MIT database, is 96% and the length frequency distribution is similar to that observed for the long high-identity repeats. Several studies consistently showed that microsatellite markers, when isolated in one organism, tend to be longer than their orthologs in other organisms. This bias is likely caused by the laboratory methods that favor the selection of polymorphic markers and may be reflected in studies on SSR evolution using such genetic markers (Ellegren et al. 1997).

    In addition, a significant number of studies on microsatellites just consider perfect repeats although it has been observed that several long repeats contain one or a few base substitutions, causing them to be counted as two or more shorter repeats (Katti, Ranjekar, and Gupta 2001). Such ascertainment methods artificially reduce the size of the repeats under analysis. In fact, we show that the perfection among the long high-identity SSR population varies from 91% to 100%. We treated SSR of different motifs as independent repeats although sometimes they may occur close together (see table 2). Such compound repeats do not fall into the identity criteria used in our analysis. However, the analysis of compound repeats has been seldom addressed, and it is a field for further investigation as it may indicate specific mechanisms underlying the generation of these repeats.

    Our findings raise the hypothesis that long high-identity SSRs represent a late acquisition of the eukaryotic genome. Several mechanisms have been proposed to regulate the evolution of microsatellites including a DNA repair mechanism involving strand discrimination that would be responsible for the equilibrium of SSR distributions (Xu et al. 2000). In fact, slippage rates alone cannot explain the overall mutation rate of these loci, which can be the opposite of the primary slippage mutation rate (Harr, Todorova, and Schl?tterer 2002). A mechanism of strand discrimination could account for the presence of high-identity SSRs in human and mouse and it could also explain the absence of this SSR population in organisms like D.melanogaster, whose DNA is not methylated, except in embryonic stages (Lyko, Ramsahoye, and Jaenisch 2000). It is possible that the lack of a strand discrimination mechanism might lead to simple deletion of any loops formed during replication, thus causing Drosophila microsatellites to be generally shorter and to lack the high-identity population. Recent studies suggested the need for supplementary mutation mechanisms, other than the slippage/point mutation model, to explain differences in the distribution of dinucleotide populations with different length and with low identity (Dieringer and Schl?tterer 2003; Sibly et al. 2003).

    Another possibility is that the size of the SSRs could be regulated by mismatch repair (MMR) enzymes that have not been isolated in organisms where long high-identity SSRs are absent. We noted that as opposed to human and mouse, the MLH3 and MSH3 orthologs of MutL and MutS (part of the Escherichia coli MMR system) have not been identified so far in the fly and C. elegans (Marti, Kunz, and Fleck 2002). These MMR enzymes are apparently involved in repairing frameshift intermediates and insertion/deletion loops and tend to cause insertion mutations (Harfe and Jinks-Robertson 2000; Kearney et al. 2001).

    In this context, the mismatch repair system is a good candidate to modulate the outcome of the mutation process and regulate the size of SSRs. Therefore, we speculate that components of the MMR machinery, like the MSH3 protein, may lead to the production of the observed long high-identity repeats.

    An alternative explanation for the existence of the normally distributed SSR population might be the selection of some function that could be assigned to long high-identity SSRs sized within a certain range. Candidate functions include influence on recombination, gene expression and transcription, perhaps by enabling DNA to assume specific conformational changes (Li et al. 2002).

    Acknowledgements

    We acknowledge Instituto Gulbenkian de Ciência for financing our work and the Funda??o para a Ciência e Tecnologia for supporting Paulo Almeida and Carlos Penha-Gon?alves.

    Literature Cited

    Bachtrog, D., M. Agis, M. Imhof, and C. Schl?tterer. 2000. Microsatellite variability differs between dinucleotide repeat motifs. Mol. Biol. Evol. 17:1277-1285.

    Bacon, A., S. Farrington, and M. Dunlop. 2000. Sequence interruptions confer differential stability at microsatellite alleles in mismatch repair-deficient cells. Hum. Mol. Genet. 9:2707-2713.

    Benson, G. 2003. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573-580.

    Calabrese, P., R. Durrett, and C. Aquadro. 2001. Dynamics of microsatellite divergence under stepwise mutation and proportional slippage/point mutation models. Genetics 159:839-852.

    Chakraborty, M., M. Kimmel, D. Stivers, L. Davidson, and R. Deka. 1997. Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci. Proc. Natl. Acad. Sci. USA 94:1041-1046.

    Dieringer, D., and C. Schl?tterer. 2003. Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res. 13:2242-2251.

    Ellegren, H., S. Moore, N. Robinson, K. Byrne, W. Ward, and B. Sheldon. 1997. Microsatellite evolution—a reciprocal study of repeat lengths at homologous loci in cattle and sheep. Mol. Biol. Evol. 14:854-860.

    Fitzsimmons, N., C. Moritz, and S. Moore. 1995. Conservation and dynamics of microsatellite loci over 300 million years of marine turtle evolution. Mol. Biol. Evol. 12:432-440.

    Forbes, S., J. Hogg, F. Buchanan, A. Crawford, and F. Allendorf. 1995. Microsatellite evolution in congenic mammals: domestic and bighorn sheep. Mol. Biol. Evol. 12:1106-1113.

    Harfe, B., and S. Jinks-Robertson. 2000. Mismatch repair proteins and mitotic genome stability. Mutat. Res. 451:151-167.

    Harr, B., and C. Schl?tterer. 2000. Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics. 155:1213-1220.

    Harr, B., J. Todorova, and C. Schl?tterer. 2002. Mismatch repair-driven mutational bias in D. melanogaster. Mol. Cell 10:199-205.

    Huang, Q., F. Xu, H. Shen, H. Deng, Y. Liu, Y. Liu, J. Li, R. Recker, and H. Deng. 2002. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70:625-634.

    Katti, M., P. Ranjekar, and V. Gupta. 2001. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol. 18:1161-1167.

    Kearney, H., D. Kirkpatrick, J. Gerton, and T. Petes. 2001. Meiotic recombination involving heterozygous large insertions in Saccharomyces cerevisiae: formation and repair of large, unpaired DNA loops. Genetics 158:1457-1476.

    Kruglyak, S., R. Durrett, M. Schug, and C. Aquadro. 1998. Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl. Acad. Sci. USA 95:10774-10778.

    Kruglyak, S., R. Durrett, M. Schug, and C. Aquadro. 2000. Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol. Biol. Evol. 17:1210-1219.

    Li, Y., A. Korol, T. Fahima, A. Bailes, and E. Nevo. 2002. Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review. Mol. Ecol. 11:2453-2465.

    Lyko, F., B. Ramsahoye, and R. Jaenisch. 2000. DNA methylation in Drosophila melanogaster. Nature 408:538-40.

    Majewski, J., and J. Ott. 2000. GT repeats are associated with recombination on human chromosome 22. Genome Res. 10:1108-1114.

    Marti, T., C. Kunz, and O. Fleck. 2002. DNA mismatch repair and mutation avoidance pathways. J. Cell Physiol. 191:28-41.

    Metzgar, D., J. Bytof, and C. Wills. 2000. Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res. 10:72-80.

    Schug, M., C. Hutter, K. Wetterstrand, M. Gaudette, T. Mackay, and C. Aquadro. 1998. The mutation rates of di-, tri- and tetranucleotide repeats in Drosophila melanogaster. Mol. Biol. Evol. 15:1751-1760.

    Sibly, R., A. Meade, N. Boxall, M. Wilkinson, D. Corne, and J. Whittaker. 2003. The structure of interrupted human AC microsatellites. Mol. Biol. Evol. 20:453-459.

    Tautz, D., and M. Renz. 1984. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucl. Acid Res. 12:4127-4138.

    Weber, J., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123-1128.

    Webster, M., N. Smith, and H. Ellegren. 2002. Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proc. Natl. Acad. Sci. USA 99:8748-8753.

    Xu, X., M. Peng, Z. Fang, and X. Xu. 2000. The direction of microsatellite mutations is dependent upon allele length. Nat. Genet. 24:396-399.

    Zhivotovsky, L., N. Rosenberg, and M. Feldman. 2003. Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. Am. J. Hum. Genet. 72:1171-1186.(Paulo Almeida and Carlos )