当前位置: 首页 > 期刊 > 《分子生物学进展》 > 2005年第3期 > 正文
编号:11176513
Survey of Simple Sequence Repeats in Completed Fungal Genomes
http://www.100md.com 《分子生物学进展》
     * School of Molecular and Microbial Biosciences, University of Sydney, Sydney, Australia; Molecular Mycology Research Laboratory, CIDM, Westmead Hospital, Westmead, Australia; and Department of Medicine, Western Clinical School, University of Sydney, Sydney, Australia

    Correspondence: E-mail: w.meyer@usyd.edu.au.

    Abstract

    The use of simple sequence repeats or microsatellites as genetic markers has become very popular because of their abundance and length variation between different individuals. SSRs are tandem repeat units of 1 to 6 base pairs that are found abundantly in many prokaryotic and eukaryotic genomes. This is the first study examining and comparing SSRs in completely sequenced fungal genomes. We analyzed and compared the occurrences, relative abundance, relative density, most common, and longest SSRs in nine taxonomically different fungal species: Aspergillus nidulans, Cryptococcus neoformans, Encephalitozoon cuniculi, Fusarium graminearum, Magnaporthe grisea, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Ustilago maydis. Our analysis revealed that, in all of the genomes studied, the occurrence, abundance, and relative density of SSRs varied and was not influenced by the genome sizes. No correlation between relative abundance and the genome sizes was observed, but it was shown that N. crassa, the largest genome analyzed had the highest relative abundance of SSRs. In most genomes, mononucleotide, dinucleotide, and trinucleotide repeats were more abundant than the longer repeated SSRs. Generally, in each organism, the occurrence, relative abundance, and relative density of SSRs decreased as the repeat unit increased. Furthermore, each organism had its own common and longest SSRs. Our analysis showed that the relative abundance of SSRs in fungi is low compared with the human genome and that longer SSRs in fungi are rare. In addition to providing new information concerning the abundance of SSRs for each of these fungi, the results provide a general source of molecular markers that could be useful for a variety of applications such as population genetics and strain identification of fungal organisms.

    Key Words: simple sequence repeat ? microsatellite ? fungi

    Introduction

    Simple sequence repeats (SSRs), also known as microsatellites, comprise tandemly repeated genetic loci of 1 to 6 base pairs (bp) (Tautz and Renz 1984). SSRs are highly abundant and exhibit extensive levels of polymorphisms in eukaryotic (Weber 1990; Toth, Gaspari, and Jurka 2000; Katti, Ranjekar, and Gupta 2001) and prokaryotic (Field and Wills 1996; Gur-Ari et al. 2000) genomes. They are found in protein-coding and noncoding regions (Katti, Ranjekar, and Gupta 2001; Toth, Gaspari, and Jurka 2000), with SSRs being more abundant in noncoding regions than in exons (Hancock 1995). However, recent studies have shown that certain trinucleotides and hexanucleotides are more abundant in coding regions than in noncoding regions of higher eukaryotic genomes (Borstnik and Pumpernik 2002; Metzgar, Bytof, and Wills 2000; Subramanian et al. 2003). Variation of SSRs is mainly caused by slipped-strand mispairing and subsequent resulting errors during DNA replication, repair, and recombination (Levinson and Gutman 1987; Schl?tterer and Tautz 1992; Tautz and Schl?tterer 1994). Sequence polymorphisms in these loci arise by insertion or deletion mutations of one or more repeat units (Tautz and Renz 1984). SSR loci have high mutation rates ranging from 10–3 to 10–6 per generation (Schug, Mackay, and Aquadro 1997; Weber and Wong 1993; Xu, Peng, and Fang 2000). Comparatively eukaryotic DNA sequences mutate at a rate of approximately 10–9 per nucleotide per generation (Crow 1993). The mutation rate of SSRs generally increases with the increase of the repeat unit (Wierdl, Dominska, and Petes 1997). Fixation of de novo generated SSRs is determined by the interplay between the repeat type, the genomic position of a specific SSR, and the genetic/biochemical background of the cell (Toth, Gaspari, and Jurka 2000). It has been shown that different taxa exhibit different preferences for SSR types. For instance, in plants AG/CT repeats are most abundant (Morgante, Hanafey, and Powell 2002), and in mammals, A and AC repeats are the most common motifs (Toth, Gaspari, and Jurka 2000). The differential abundance of repeats in different eukaryotic genomes led to the suggestion that strain-slippage theories alone are insufficient to explain characteristic SSR distributions (Toth, Gaspari, and Jurka 2000). Several studies have examined the relationship between SSR content and genome size. Morgante, Hanafey, and Powell (2002) found that the overall SSR abundance is inversely proportional to the genome size, whereas others have shown positive correlation between SSR content and genome size (Primmer et al. 1997; Hancock 1999). Moreover, the abundance of different types of SSRs varies between different taxa (Hancock 1999).

    Because of their high mutability, SSRs are thought to play an active role in genome evolution by creating and maintaining genetic variation (Tautz, Trick, and Dover 1986). The length of SSRs in promoter regions may influence transcriptional activity (Kashi, King, and Soller 1997). They may also affect protein-protein interactions via the length of polyglutamine or polyproline traces encoded by SSRs (Gerber et al. 1994). There is also evidence that some SSRs serve a functional role in regulation of gene expression (Kunzler, Matsuo, and Schaffner 1995) and in the evolution of gene regulation (Huang et al. 2003). Dynamic mutations in trinucleotide repeats within or near specific genes have been found to be associated with several neurodegenerative diseases (Jin and Warren 2000; Sermon et al. 2001) and some human cancers (Wooster et al. 1994). The high levels of polymorphisms observed in SSRs and the relative ease of detection of these polymorphisms via PCR amplification has led to the widespread application of SSRs as genetic markers today.

    However, despite this widespread use, little is known about SSRs in fungi. In fact, there are only a limited number of studies on these seemingly important and intruding sets of sequences in fungal species. SSRs have currently only been analyzed in detail from two fungal species: Saccharomyces cerevisiae and Schizosaccharomyces pombe. This study indicated that AT motif is the most abundant in fungal genomes (Toth, Gaspari, and Jurka 2000). SSRs have been used as genetic markers in numerous DNA-fingerprinting and PCR-fingerprinting experiments for strain typing of a variety of filamentous fungi and yeasts without prior knowledge of their abundance and distribution in the investigated fungal genomes (Lieckfeld et al.1992; Meyer et al. 1991, 1997, 1999; Meyer, Maszewska, and Sorrell. 2001; Meyer et al. 2003). A recent study proposes the use of SSRs in a number of cell wall proteins for strain typing of wine yeasts (Marinangeli et al. 2003). Besides being used as molecular markers, recognition of abundance and density of SSRs in fungi may help to understand whether these sequences have any functional and evolutionary significance. The study of absolute numbers of SSRs in fungal genomes may also address whether SSR abundance is a direct function of the genome size in these organisms, as had been suggested for higher eukaryotes (Toth, Gaspari, and Jurka 2000). Such understanding would foster the proper use of SSRs in future studies.

    Frequencies of various SSR sequences in different genomes have been estimated originally via hybridization experiments (Tautz and Renz 1984; Panaud, Chen, and McCouch 1995) or database searches (Richard and Dujon 1996; Toth, Gaspari, and Jurka 2000). These studies were mainly based on the overrepresented coding regions and limited by the partial genomic sequences available. Large-scale genome sequencing initiatives on a growing number of organisms are now providing the opportunity to evaluate the abundance and relative distribution of SSRs in different genera based on the whole genome. The specific aim of this study was to determine the abundance and diversity of SSRs in fungal genomes and to compare them with other organisms. We describe in the present study, the genome-wide analysis of SSR sequences from nine completely sequenced fungal species representing phylogenetic diverse fungal genera. Specifically, data are presented for the abundance, density, most common and longest SSRs to see if there is a correlation between fungal SSR content of a certain genome and, genome size as previously reported.

    Methods

    Genome Sequences

    Nine fungal species have been chosen on the basis of their completed and easily accessible genomic sequences in public databases (see table 1). The Saccharomyces cerevisiae genome was fully annotated, and the sequences of the Aspergillus nidulans, Cryptococcus neoformans, Encephalitozoon cuniculi, Fusarium graminearum, Magnaporthe grisea, Neurospora crassa, Schizosaccharomyces pombe, and Ustilago maydis genomes were completed, but not yet fully annotated. The analyzed sequences are less than the estimated genome sizes because regions such as telomeres and centromeres have not as yet been accounted for in these organisms. Files were obtained from the respective genome project Web sites for each organism in FASTA format. The fully annotated chromosomal sequences of S. cerevisiae (strain S288C) were retrieved from the S. cerevisiae genome database at Stanford University (Stanford, CA). Partially annotated C. neoformans (strain JEC21) was obtained from the C. neoformans genome project at the Institute for Genomic Research (Rockville, MD). The A. nidulans (strain FGSC-A4), F. graminearum (strain PH-1), M. grisea (strain 70-15), N. crassa (strain OR74A), and U. maydis (strain 521) sequences were available as annotated long contiguous scaffolds from the Whitehead Institute's Center for Genome Research (Cambridge, MA). Individual chromosomal sequences of E. cuniculi (strain GB-M1) were obtained from Genoscope (Evry Cedex, France). The S. pombe (strain Urs Leupold 972h–) chromosome sequences were retrieved from the S. pombe genome project site at the Sanger Institute (Hinxton, Cambridge, UK). All information concerning the Web sites used in this study is listed in table 1.

    Table 1 List of Analyzed Fungal Genomes Their Web Sites and Genome Sizes

    SSR Analysis

    SSRs in genomes were identified using a PYTHON-based program that has been specifically developed for this project. It is available for downloading from the Molecular Mycology Research Laboratory's homepage at (http://www.mmrl.med.usyd.edu.au/ssr.html). It searches each of the six SSR motifs, one at a time, with lengths of 10 bp or more. It records repeat number and genome location and reports the results in an output file. The data were processed and counted with Microsoft Excel version SR2. The redundancy of SSR sequences present was minimized by recording only a single match when there were more than one hit for the same SSR locus. No difference between the occurrence of repeats in exons, introns and intergenic regions was made. We present the total numbers for all perfect repeat types. These total numbers have been normalized either as percentage or number of SSRs per Mb of sequence to allow comparison among genome sequences of different sizes (= relative abundance). The estimated repeat density (bp/Mb) for each genome was calculated by dividing the total sequences analyzed (Mb) by the number of base pairs of sequence contributed by each SSR. As we scanned for dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats, we considered combinations involving runs of the same nucleotide to be unique and, therefore, did not excluded them in the final count. For tetranucleotide and hexanucleotide repeats, combinations representing perfect dinucleotide and trinucleotide repeats were filtered from the final counts; for example, a (GTGT)8 was deemed as a (GT)16 dinucleotide and not as a tetranucleotide repeat. In the current survey, each of the SSRs was considered as unique and was not classified according to the assumption described by Jurka and Pethiyagoda (1995). They assumed that (AC)n is the same as (CA)n, (TG)n is the same as (GT)n, and (AGC)n is the same as (GCA)n, (CAG)n, (CTG)n, (TGC)n, and (GCT)n, in different reading frames or on the complementary strand. Consequently all theoretically possible 501 SSR (absolute frequency) types (Jurka and Pethiyagoda 1995) were analyzed for their occurrence, relative abundance per Mb, and density per Mb. We believe that scanning all possible combinations of SSRs for this study will lead to a better knowledge of total SSR occurrence, and their genomic locations will be extremely useful in selecting SSRs representative of similar repeat classes from different genomic locations as potential markers.

    Results

    We analyzed perfect SSRs over 10 bp long, from nine completely sequenced haploid fungal genomes, ranging from 2.5 Mb (E. cuniculi) to 43 Mb (N. crassa) (table 1). These fungal species represent phylogenetically diverse genera (see figure 1). The total frequency in each genome and relative abundance of mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides across the selected fungal genomes are presented in table 2 and figure 2. Compared with the human (3,150 Mb) or other mammalian genomes, fungal genomes are generally small. The genome of the Baker's yeast, S. cerevisiae, for instance, makes up only 0.5% of the human genome. Although there is no direct correlation of the SSR content with the genome size, it is generally assumed that the larger genomes are expected to contain more SSRs than do the smaller genomes (Hancock 2002). On the contrary, results here indicate that the total SSR contents in fungal species are not directly proportional to the genome sizes (fig. 4). For example, in a comparison of similar-sized genomes, N. crassa has a fivefold SSR abundance over that of F. graminearum, although they have approximately the same genome size (fig. 4 and table 2). In most instances, the relative density of the SSRs across genomes with similar genome sizes is also similar (fig. 3 and table 3). The highest SSR density was found in the N. crassa genome (6,676 bp/Mb), followed by the M. grisea (4,785 bp/Mb) genome. The lowest SSR density was in the genome of E. cuniculi (767 bp/Mb), which has the smallest genome size among the fungi surveyed in this study. The relative abundance of the various repeat types differed across the genomes. For instance, there were more trinucleotides in N. crassa than dinucleotides. This observation was even more evident in four of the nine genomes, in which hexanucleotides were more abundant than pentanucleotides and in some cases more abundant than the tetranucleotides (table 2). The raw data sets for each of the studied fungal genomes are available at http://www.mmrl.med.usyd.edu.au/ssr.html.

    FIG. 1.— The taxonomic positions of the fungal species studied (Berbee and Taylor 2001; Kurtzman and Sugiyama 2001).

    Table 2 Occurrence and Relative Abundancea of SSRs in Fungal Genomes

    FIG. 2.— Relative abundance of SSRs across fungal genomes. Abundance is defined as the total number of SSRs per Mb of sequence analyzed.

    FIG. 4.— Relationship of relative abundance of the total SSRs in each genome, with the genome sizes of different fungal species indicated. Numbers on top of the bars indicate the actual genome size, and the analyzed sequence lengths are indicated in parentheses.

    FIG. 3.— Relative density of SSRs across fungal genomes. Density is defined as the total sequence length (bp) contributed by each SSR per Mb of DNA of sequence analyzed.

    Table 3 Relative Density of SSR Repeatsa

    Mononucleotide Repeats

    Mononucleotide repeats can be found in each of the genomes with relatively high frequency. Our study revealed a strong overrepresentation of A/T compared with C/G sequences (table 4). For example, 99.5% of the mononucleotides in S. cerevisiae are A/T repeats and only 0.5% are C/G repeats. Only in E. cuniculi, the smallest genome analyzed here, the C/G motifs (80.6%) were found to be more abundant than the A/T (19.4%) motifs. Compared with other genomes, M. grisea (3,063 bp/Mb) showed the highest density of mononucleotide repeats, followed by N. crassa (2,505 bp/Mb) and S. cerevisiae (2,071 bp/Mb). E. cuniculi with only 156 bp/Mb had the lowest mononucleotide density (table 3). The longest stretches of mononucleotide repeats were found in A. nidulans and N. crassa. Two T repeats, with lengths of 94 and 93 bp in A. nidulans, were the longest mononucleotides found among fungi so far (table 7). The longest A tracts, with lengths 74 and 64 bp, also resided in the genome of A. nidulans. However, overall it was observed that the T repeats were much longer than the A repeats.

    Table 4 Abundance and Base Pair Content of Mononucleotides in Fungal Genomes

    Table 7 Longest SSR Motifs in Fungal Genomes

    Dinucleotide Repeats

    As molecular markers, dinucleotides are more important than the other SSRs and are one of the most sought-after markers because of their higher mutation rates. We found the AG/GA repeats to be predominant in the larger genomes of M. grisea and N. crassa, whereas the AT/TA repeats are overrepresented in five of the nine species examined (table 5). The higher AT/TA frequencies in the majority of genomes can be assumed to be the result of the high A/T content of the genomes. A surprising finding was the lack of, or the lower abundance of, the CG/GC motifs in all of the genomes (table 5). In fact, neither of these repeats was found in the genome of S. pombe. They were found to occur only five and six times in the genomes of S. cerevisiae and C. neoformans, respectively. N. crassa had the highest (3,208) and E. cuniculi (91) had the lowest numbers of dinucleotides. However, the relative abundance of dinucleotide repeats was highest in N. crassa (84 repeats/Mb) and lowest in A. nidulans (25 repeats/Mb) genomes (table 2). The nucleotide composition of the dinucleotide repeats varied between species (tables 5 and 6). AT/TA repeats in A. nidulans, C. neoformans, N. crassa, S. cerevisiae, and S. pombe, CT/TC repeats in M. grisea and U. maydis, AG/GA repeats in F. graminearum, and AC/CA in E. cuniculi were the most predominant (table 5). The densities of dinucleotides between the genomes also varied significantly, with N. crassa (1,105 bp/Mb) having the highest and A. nidulans (282 bp/Mb) the lowest densities (table 3). Our data show that long dinucleotides do not occur in fungal genomes. The longest dinucleotides were found to be the (GA)92 (184 bp) and (TC)78 (156 bp) motifs in M. grisea and N. crassa, respectively (table 7). The results also revealed that the A/T-rich genomes, such as S. cerevisiae (Goffeau et al. 1996) and S. pombe (Wood et al. 2002) do not contain longer AT/TA repeats. The longest AT/TA motif was only 20 bp long and was found in the genome of S. cerevisiae. Generally, dinucleotide motifs that contained the nucleotide G were found to be relatively longer than the dinucleotides composed of other nucleotides.

    Table 5 Abundance of Dinucleotide SSRs in Fungal Genomes

    Table 6 Most Frequent SSR Motifs/Groupings in Fungal Genomes

    Trinucleotide Repeats

    In general, N. crassa contained the most trinucleotide repeats (4,084), followed by M. grisea (1,573) and U. maydis (865) (table 2). The lowest occurrence was found in the E. cuniculi genome, with a total of only 20 trinucleotides. The highest relative abundance was seen in N. crassa (107 repeats/Mb) and the lowest in the genome of E. cuniculi (8 repeats/Mb). Overall, trinucleotide repeat density was by far the highest in N. crassa (2,245 bp/Mb) and U. maydis (891 bp/Mb) (table 3). The lowest trinucleotide density was in the genome of E. cuniculi (121 bp/Mb). In terms of their abundance, A-containing and T-containing trinucleotides were not the most abundantly present. C-containing and G-containing trinucleotides were found to be equally abundant in fungi (table 6). The most common trinucleotide motifs in the analyzed fungal genomes showed that there is no A/T-biased trend. Instead, each genome has its own frequent-repeat motif comprising various combinations (table 6). Furthermore, A/T-rich genomes (S. cerevisea and C. neoformans) did not show A/T-biased abundant trinucleotides. In C. neoformans, the most common trinucleotide repeats are the CTT/TCT/TTC group, which occurs 43 times. In S. cerevisiae, another A/T-rich genome, the most common trinucleotide grouping is the GAA/AGA/AAG, which occurs 45 times. Overall, the group GTT/TGT/TTG in N. crassa is the most common in this species, which occurs no less than 361 times. The longest trinucleotide repeats were found in the genome of N. crassa, in which (TTA)93, (TTG)73, (TTC)73, and (ACA)73 motifs were 279, 219, 219, and 219 bp long, respectively (table 7). The (TTA)93 repeat in N. crassa was by far the longest found in the herein analyzed fungal genomes. There were also other long trinucleotide-repeat tracts that were not A/T biased, but rather rich in C and T nucleotides.

    Tetranucleotide Repeats

    N. crassa and M. grisea showed the greatest occurrence of tetranucleotides with 758 and 219 repeats, respectively. These organisms also had the highest relative abundance of tetranucleotides with 20 repeats/Mb and 5.8 repeats/Mb, respectively (table 2). The lowest relative abundance was one repeat/Mb in S. cerevisiae, even though E. cuniculi had the lowest frequency. The density of tetranucleotide repeats was found to be much lower than the densities of dinucleotide and trinucleotide repeat motifs. The N. crassa genome had the highest (524 bp/Mb) and the S. cerevisiae the lowest (24 bp/Mb) tetranucleotide densities (table 3). The most frequent tetranucleotide repeat motifs were much less in abundance than the lower-repeated units. The most common tetranucleotide repeat was found to be the (TACC)n motif in N. crassa and M. grisea, which occurs 39 and 32 times, respectively (table 6). The repeat (TAGG)n was also common to these genomes, occurring 23 and 24 times, respectively. Generally, most tetranucleotide repeats were rather short, with a repeat length of less than eight. However, there were some longer motifs represented in the genomes of N. crassa and M. grisea (table 7). The longest tetranucleotides were the (AGGA)51 (204 bp), (ACCT)39 (156 bp), and (ACAT)38 (152 bp) motifs in N. crassa. The motif (TACC)48 (192 bp) was the longest in M. grisea. In the A/T-rich genomes of S. pombe and S. cerevisiae, long tetranucleotides were mostly composed of the bases A and T, contrary to the composition of long dinucleotides in these genomes, which contain at least one G in the core motif (table 7).

    Pentanucleotide Repeats

    As expected, the occurrence of pentanucleotide repeats was less than that of the tetranucleotide repeats. The highest occurrence of pentanucleotides, with 192 repeats, was found in the genome of N. crassa (table 2). The lowest occurrence was in the genome of C. neoformans, with only three repeat motifs. No pentanucleotides were found in E. cuniculi. The highest relative abundance was found in N. crassa (5.1 repeats/Mb) and the lowest in C. neoformans (0.2 repeats/Mb) (table 2). Overall, the density of pentanucleotide repeats ranged between 156 bp/Mb in N. crassa and 6 bp/Mb in C. neoformans (table 3). Generally, the core composition of pentanucleotide repeats was found to deviate more than for the shorter SSR motifs. This variance is probably related to their length rather than to having a particular base bias. The most common motifs were also found to occur less frequently than for tetranucleotide repeats, as is the case with the most frequent TACAC motif, which occurs only five times in N. crassa (table 6). N. crassa also accommodates the longest pentanucleotide motifs (AAGGA)32, (ACTCT)28, (CTTTT)24, and (CTTGA)18, which extend to 160, 140, 120, and 80 bp, respectively (table 7). Other long pentanucleotides found were (GGCAA)29 (145 bp) in M. grisea, (TTGCT)25 (125 bp) in U. maydis, and (GTATG)18 (90 bp) in F. graminearum.

    Hexanucleotide Repeats

    Hexanucleotide repeats were also found abundantly in the genomes analyzed. U. maydis, with 196 hexanucleotides, had the highest occurrence, and S. pombe, with only three hexanucleotides in 13.1 Mb of sequence analyzed, had the lowest occurrence. As seen for pentanucleotides, there were no hexanucleotides in E. cuniculi (table 2). Moreover, in the genomes of C. neoformans, M. grisea, S. cerevisiae, and U. maydis, hexanucleotide repeats were more abundant than the pentanucleotides. U. maydis, with 9.9 repeats/Mb, had the highest relative abundance, and S. pombe, with 0.2 repeats/Mb, had the lowest relative abundance of hexanucleotides. The densities of hexanucleotide repeats were found to vary significantly among the nine genomes (table 3). The highest density of hexanucleotide was 442 bp/Mb for U. maydis and the lowest was 10 bp/Mb for S. pombe. The hexanucleotide density in U. maydis was even higher than the tetranucleotide and pentanucleotide densities in this species. The most abundant hexanucleotide repeats were found in the larger and medium-sized genomes, with the most common being the (TGCTGT)8 motif in U. maydis (table 6). The longest hexanucleotide repeats were the (GCCTGA)77, (TAGGGT)62, and (CCTTCT)52 motifs in M. grisea, U. maydis and C. neoformans, respectively (table 7). These hexanucleotide repeats were, in fact, by far the longest among all the SSRs found across these species and compare very well with long repeats occurring in humans and higher eukaryotes (Kruglyak et al. 1998). Overall, the longest hexanucleotides were represented in M. grisea and N. crassa genomes.

    Discussion

    Abundance/Density of SSRs

    The relative abundance of SSRs is not as equally represented across the nine species. This observation is even more evident in taxonomically related genomes (fig. 2) such as A. nidulans, F. graminearum, M. grisea, and N. crassa. The two euascomycetous species, M grisea and N. crassa, exhibit a more similar relative abundance of SSRs to each other and to S. cerevisiae and S. pombe than to the closely related species F. graminearum and A. nidulans, whereas the basidiomyceteous species, C. neoformans and U. maydis, show comparable relative abundance to each other as well as to the distantly related species A. nidulans and F. graminearum (table 5). The relative abundance of each SSR motif within a genome also varies significantly. For example, A. nidulans has twice as many dinucleotides (25 repeats/Mb) as trinucleotides (11 repeats/Mb), yet M. grisea, sister taxon of A. nidulans, has almost identical proportions of dinucleotides and trinucleotides (table 2). However, M. grisea differs from A. nidulans in that it has a relatively higher abundance of all repeat types with much longer motifs (trinucleotides to hexanucleotides). As expected, the analysis of motif patterns revealed that the smaller repeated motifs were predominant in each genome. As repeat number increases, the occurrence decreases. This trend has been observed for a range of organisms. Overall, SSRs on average are more abundant in N. crassa (one every 2.7 kb) than in M. grisea (one every 3.3 kb), S. cerevisiae (one every 3.9 kb), S. pombe (one every 4.0 kb), U. maydis (one every 6.6 kb), C. neoformans (one every 9.6 kb), F. graminearum (one every 12.5 kb), A. nidulans (one every 12.5 kb), and E. cuniculi (one every 15.6 kb) (fig. 4). Generally, the observed occurrence of SSRs in S. cerevisiae and S. pombe genomes were relatively close to the previously reported data obtained for S. cerevisiae and Ascochyta rabiei (Geistlinger et al. 1997; Kruglyak et al. 2000; Pere et al. 2001). The differences in densities could be explained by the different genomic organization of these species, assuming no systematic bias exists that may lead to overestimates of the SSR densities in fungi. Katti, Ranjekar, and Gupta (2001) analyzed the SSR frequencies in five diverse genomes and found that SSRs were most abundant in human, followed by Drosophila melanogaster, Arabidopsis thaliana, S. cerevisiae, and Caenorhabditis elegans. Toth, Gaspari, and Jurka (2000) found that the relative abundance of different repeats also differed extensively, depending on the species examined. This nonrandom distribution may be the result of differences in mutability and the bias in repair efficiency of the mismatch repair system, which could lead to the fact that SSRs are overrepresented in certain genomes (Harr, Todorova, and Schl?tterer 2002).

    Our analysis revealed that the abundance of SSRs in fungi is comparatively less frequent than in the human genome. A large portion of the noncoding genome is repetitive DNA, which can be of different types based on the length of the repeated elements. SSRs occupy between 0.08% and 0.67% of the fungal genomes, comparatively less than the 3% for the human genome (International Human Sequencing Consortium 2001). For instance, a SSR occurs every 3.9 kb in the S. cerevisiae genome, in contrast to estimates of the overall SSR density in the human genome, which was one repeat per 0.1 kb of sequence (Simple Sequence Repeats Database: [http://www.ingenovis.com/ssr]). Therefore, it can be assumed that the human genome contains about 39-fold more SSRs than does the S. cerevisiae genome. There is no known explanation for the lower occurrence of SSRs in fungi. If the small genome sizes in fungi are taken into account, it still remains to be explained why there are less SSRs in fungi. Because most SSRs occur in noncoding regions, perhaps a high density of fungal genes in fungi compared with higher eukaryotes would imply that the relative amount of noncoding DNA should be less in fungi, thus leading to lower rates in fungal SSR evolution. In humans, more than 98% of the genome does not code for proteins, and it is thought that most of these regions are made up of SSRs (Matula and Kypr 1999). Therefore, the less amount of noncoding DNA resulting from the smaller genome sizes of fungi may explain the lower occurrence of SSRs in fungi.

    Some repeats, such as the dinucleotides, are some of the most sought-after molecular markers in most organisms because of their relatively high mutation rates. The higher AT/TA frequencies in the majority of genomes (table 5) may be a result of the high A/T content of the genomes and the relative ease of strand separation compared with C/G tracts (Gur-Arie et al. 2000). Interestingly, as the core repeat unit increases (for example from dinucleotide to trinucleotide), the percentage of the C or G content in the repeat unit also increases (table 7). These repeats (e.g., (TACC)n and (GGCAA)n in M. grisea) still occur in abundance and with relatively long tracts. The high abundance of GT repeats in mammals has been linked to formation of Z-DNA (Majewski and Ott 2000) and regulation of gene expression (Moore et al. 1991). The abundantly occurring SSRs such as the AT repeats in fungi could serve a similar function as GT repeats may serve in mammals.

    Relative Abundance of SSRs and the Genome Size

    Our results suggest that the relative abundance of SSRs differs and is not consistent with the genome size (fig. 4). It appears that the SSR abundance is neither inversely nor directly proportional to the genome size of fungi as has been reported for other genomes (Morgante, Hanafey, and Powell 2002; Hancock 1996, 2002). Similarly, the relative abundance of each class of SSRs (mononucleotides to hexanucleotides) also differs among the species. Several unexpected relationships between the genome sizes emerged, including the variance between the members of the ascomycetes. Values observed with different genome sizes indicate there are factors that impose limits upon compositional relative abundance variation in these genomes. This idea is supported by the observation of inconsistency of relative abundance in similar-sized genomes (Toth, Gaspari, and Jurka 2000). A surprising result is that the relative abundance of the SSRs in the larger genome of F. graminearum is significantly closer to that in the A. nidulans than to that in the similar-sized genomes M. grisea and N. crassa. It is not clear why similar-sized genomes such as, M. grisea contain almost fourfold more SSRs than do F. graminearum. A more detailed analysis of differential distribution of SSRs and the genome organization of these fungal species is needed to shed more light on the SSR variance in fungal genomes.

    Most Common SSR Motifs

    The potential of SSRs, specifically of the most abundant repeats, in contributing to the evolution of genomes has been well documented in various organisms. Thus, the large number of common and long SSRs identified in this survey are also likely to be associated with possible or known functions. It has been reported that (GT)n is the most common repeat motif in animals and invertebrates (Stallings et al. 1991), whereas in plants and insects, the repeats (AT)n and (CT)n are the most common, respectively (Lagercrantz, Ellegren, and Andersson 1993; Paxton et al. 1996). In agreement with the early studies in fungi (Valle 1993; Geistlinger et al. 1997), a majority of sequences rich in A/T were observed, especially in dinucleotide repeats. As discussed, this is possibly a consequence of the high A/T content of fungal genomes. However, C/G-containing repeats are also frequently observed in fungal genomes. For example, in M. grisea and U. maydis, the most common dinucleotides are the CT/TC group, which are also the most abundant dinucleotides in insects (Estoup et al. 1993) and higher invertebrates (Katti, Ranjekar, and Gupta 2001). Most of the common trinucleotides such as (CCG)n, (CGC)n, and (GGC)n in M. grisea do not contain either A or T at all. As in the human genome, the repeat (AAC)n in N. crassa is very common and occurs more than 69 times, and its high frequency may indicate similar function in both genomes. Furthermore, there is no known explanation for the uneven abundance of certain repeats in different genomes. One such SSR is the (CAA)n repeat that appears 152 times in the N. crassa genome but only 17 times in F. graminearum, although they are approximately the same size and have the same amount of the genome analyzed (table 1).

    Longest SSR Motifs

    Our results revealed that the abundance of various SSR motifs varied considerably among the nine fungal species analyzed. However, the length-frequency distributions of SSRs were relatively constant across all species. The length distributions of all SSRs indicated that the frequency of repeats decreases rapidly with increasing repeat length. This may be because longer repeats have higher mutation rates and, hence, could be less stable (Wierdl, Dominska, and Petes 1997). Furthermore, in this study, SSRs are found on average to be much shorter than in higher organisms. Generally, dinucleotide and trinucleotide repeat stretches tended to be longer than other repeats. In addition, SSRs in larger genomes such as M. grisea and N. crassa seemed to be longer than in smaller fungal genomes. The lack of longer repeats could possibly be explained by their downward mutation bias and short existence time (Harr and Schl?tterer. 2000). The SSR with the longest nucleotide stretch belonged to M. grisea and consisted of a (GCCTGA)77 motif of 462 bp. In contrast, the longest repeat in N. crassa, the organism with the most abundant SSRs, is the (TTA)93 repeat, which is 279 bp long (table 7). The M. grisea genome is 16 times larger than the E. cuniculi genome; hence, one would expect it to accommodate much longer SSRs. Our results found that larger genomes harbor the longest repeats, as seen for higher organisms with much larger genome sizes. Overall, N. crassa and M. grisea contained the longest dinucleotide, trinucleotide, tetranucleotide, and pentanucleotide repeats between them (table 7). However, long tracts of hexanucleotide repeats were also found frequently in the medium-sized genomes of C. neoformans and U. maydis. Cross-species comparisons indicate that SSR loci can be conserved over long evolutionary time periods, and the number of repeats never reaches extremely long values (Schl?tterer, Amos, and Tautz 1991). On the other hand, differences in length distributions between organisms were explained by different SSR mutation rates (Kruglyak et al. 1998). Furthermore, a lack of very long SSRs has been taken as evidence that selection is also involved in maintaining SSRs within a certain size range (Nauta and Weissing 1996).

    In conclusion, our study of SSRs in completely sequenced fungal genomes is a small step toward a better understanding the nature of these important sequences. We presented that the occurrence of SSRs in fungi to be comparatively less frequent than in the human and other eukaryotic genomes. Since the discovery of their polymorphic nature, SSRs have become the main choice of molecular markers for a variety of genetic studies. The abundance and variance of SSR lengths may give a good indication of the expected variability. The data on the composition and length distribution of SSRs obtained in this study and the developed screening software can be used for choosing the optimal repeat motifs for SSR isolation in these and related fungal genera.

    The SSR sequences and the exact locations within the genomes of all of the SSR loci reported in this study are available at http: http://www.mmrl.med.usyd.edu.au/ssr.html. This information may become useful for a variety of purposes, including isolation and developing variable markers. The SSR data should also facilitate research on the role of SSRs in genome organization.

    Acknowledgements

    This study was supported by the NH&MRC grant 99738 to W.M.

    References

    Berbee, M. L., and J. W. Taylor. 2001. Fungal molecular evolution: gene trees and geologic time. Pp. 229–245 in K. Esser and P.A Lemke, ed. The Mycota. Springer-Verlag, Heidelberg.

    Borstnik, B., and D. Pumpernik. 2002. Tandem repeats in protein coding regions of primate genes. Genome Res. 12:909–915.

    Dallas, J. F. 1992. Estimation of microsatellite mutation rates in recombinant inbred strains of mouse. Mamm. Genome 3:452–456.

    Estoup, A., M. Solignac, M. Harry, and J. M. Cornuet. 1993. Characterization of (GT)n and (CT)n microsatellites in two insect species: Apis mellifera and Bombus terrestris. Nucleic Acids Res. 21:1427–1431.

    Field, D., and C. Wills. 1996. Long, polymorphic microsatellites in simple organisms. Proc. R. Soc. Lond. B Biol. Sci. 263:209–215.

    Geistlinger, J., K. Weising, W. J. Kaiser, and G. Kahl. 1997. Allelic variation at a hypervariable compound microsatellite locus in the ascomycete Ascochyta rabiei. Mol. Gen. Genet. 256:298–305.

    Gerber, H. P., K. Seipel, O. Georgiev, M. Hofferer, M. Hug, S. Rusconi, and W. Schaffner. 1994. Transcriptional activation modulated by homopolymeric glutamine and proline stretches. Science 263:808–811.

    Goffeau, A., B. G. Barrell, H. Bussey et al. (16 co-authors) 1996. Life with 6000 genes. Science 274:563–570.

    Gur-Arie, R., C. J. Cohen, Y. Eitan, L. Shelef, E. M. Hallerman, and Y. Kashi. 2000. Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism. Genome Res. 10:62–71.

    Hancock, J. M. 1995. The contribution of slippage-like processes to genome evolution. J. Mol. Evol. 41:1038–1047.

    ———. 1996. Simple sequences and the expanding genome. BioEssays 18:421–425.

    Hancock, J. M. 1999. Microsatellites and other simple sequences: genomic context and mutational mechanisms. Pp 1-9 in D. B. Goldstein and C. Schl?tterer, eds. Microsatellites: evolution and applications. Oxford University Press, Oxford.

    ———. 2002. Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects. Genetica 115:93–103.

    Harr, B., and C. Schl?tterer. 2000. Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics 155:1213–1220.

    Harr, B., J. Todorova, and J. Schl?tterer. 2002. Mismatch repair-driven mutational bias in D. melanogaster. Mol. Cell. 10:199–205.

    Huang, T. S., C. C. Lee, A. C. Chang, S. Lin, C. C. Chao, Y. S. Jou, Y. W. Chu, C. W. Wu, and J. Whang-Peng. 2003. Shortening of microsatellite deoxy(CA) repeats involved in GL331-induced down-regulation of matrix metalloproteinase-9 gene expression. Biochem. Biophys. Res. Commun. 300:901–907.

    International Human Genome Sequencing Consortium.2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.

    Jin, P., and S. T. Warren. 2000. Understanding the molecular basis of fragile X syndrome. Hum. Mol. Genet. 9:901–908.

    Jurka, J., and C. Pethiyagoda. 1995. Simple repetitive DNA sequences from primates: compilation and analysis. J. Mol. Evol. 40:120–126.

    Kashi, Y., D. King, and M. Soller. 1997. Simple sequence repeats as a source of quantitative genetic variation. Trends Genet. 13:74–78.

    Katti, M. V., P. K. Ranjekar, and V. S. Gupta. 2001. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol. 18:1161–1167.

    Kruglyak, S., R. Durrett, M. D. Schug, and C. F. Aquadro. 1998. Equilibrium distribution of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl. Acad. Sci. USA 95:10774–10778.

    Kruglyak, S., R. Durrett, M. D. Schug, and C. F. Aquadro. 2000. Distribution and abundance of microsatellites in the yeast genome can be explained by a balance between slippage events and point mutations. Mol. Biol. Evol. 17:1210–1219.

    Kunzler, P., K. Matsuo, and W. Schaffner. 1995. Pathological, physiological, and evolutionary aspects of short unstable DNA repeats in the human genome. Biol. Chem. Hoppe. Seyler. 6:201–211.

    Kurtzman, C. P., and J. Sugiyama. 2001. Ascomycetous yeasts and yeastlike taxa. Pp. 179–200 in K. Esser and P. A. Lemke, eds. The Mycota. Springer-Verlag, Heidelberg.

    Lagercrantz, U., H. Ellegren, and L. Andersson. 1993. The abundance of various polymorphic microsatellite motifs differs between plants and vertebrates. Nucleic Acids Res. 21:1111–1115.

    Levinson, G., and G. A. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–221.

    Lieckfeldt, E., W. Meyer, K. Kuhls, and T. B?rner. 1992. Characterization of filamentous fungi and yeast by DNA fingerprinting and random amplified polymorphic DNA. Belg. J. Bot. 125:226–233.

    Majewski, J., and J. Ott. 2000. GT repeats are associated with recombination on human chromosome 22. Genome Res. 10:1108–1114.

    Marinangeli, P., D. Angelozzi, M. Ciani, F. Clementi, and I. Mannazzu. 2003. Minisatellites in Saccharomyces cerevisiae genes encoding cell wall proteins: a new way towards wine strain characterisation. FEMS Yeast Res. 4:427–435.

    Matula, M., and J. Kypr. 1999. Nucleotide sequences flanking dinucleotide microsatellites in the human, mouse and Drosophila genomes. J. Biomol. Struct. Dyn. 17:275–280.

    Metzgar, D., J. Bytof, and C. Wills. 2000. Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res. 10:72–80.

    Meyer, W., A. Castaneda, S. Jackson, M. Huynh, E. Castaneda, and the IberoAmerican Cryptococcal Study Group. 2003. Molecular typing of IberoAmerican Cryptococcus neoformans Isolates. Emerging Infect. Dis. 9:189–195.

    Meyer, W., A. Koch, C. Niemann, B. Beyermann, J. T. Epplen, and T. B?rner. 1991. Differentiation of species and strains among filamentous fungi by DNA fingerprinting. Curr. Genet. 19:239–242.

    Meyer, W., G. N. Latouche, H. Daniel, M. Thanos, T. G. Mitchell, D. Yarrow, G. Sch?nian, and T. C. Sorrell. 1997. Identification of pathogenic yeasts of the imperfect genus Candida by PCR-fingerprinting. Electrophoresis 18:1548–1559.

    Meyer, W., K. Marszewska, M. Amirmostofian, et al. (12 co-authors).1999. Molecular typing of global isolates of Cryptococcus neoformans var. neoformans by PCR-fingerprinting and RAPD: a pilot study to standardize techniques on which to base a detailed epidemiological survey. Electrophoresis 20:1790–1799.

    Meyer, W., K. Maszewska, and T. C. Sorrell. 2001. PCR-fingerprinting a convenient molecular toll to distinguish between Candida dubliniensis and Candida albicans. Med. Mycol. 39:185–193.

    Moore, S. S., L. L. Sargeant, T. G. King, J. S. Mattick, M. George, and D. J. S. Hetzel. 1991. The conservation of dinucleotide microsatellites among mammalian genomes allows the use of heterologous PCR primer pairs in closely related species. Genomics 10:654–660.

    Morgante, M., M. Hanafey, and W. Powell. 2002. Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nat Genet. 30:194–200.

    Nauta, M. J., and F. J. Weissing. 1996. Constraints on allele size at microsatellite loci: implications for genetic differentiation. Genetics 143:1021–1032.

    Panaud, O., X. Chen, and S. R. McCouch. 1995. Frequency of microsatellite sequences in rice (Oryza sativa L.). Genome 38:1170–1176.

    Paxton, R. J., M. P. A. Thoren, J. Tengo, A. Estoup, and P. Pamilo. 1996. Mating structure and nestmate relatedness in a communal bee, Andrena jacobi (Hymenoptera, Andrenidae), using microsatellites. Mol. Ecol. 5:511–519.

    Pere, M. A, F. J. Gallego, I. Martinez, and P. Hidalgo. 2001. Detection, distribution and selection of microsatellites (SSRs) in the genome of the yeast Saccharomyces cerevisiae as molecular markers. Lett. Appl. Microbiol. 33:461–466.

    Primmer, C.R., T. Raudsepp, B. P. Chowdary, A. P. Moller, and H. Ellegren. 1997. Low frequency of microsatellites in the avian genome. Genome Res. 7:471–482.

    Richard, G. F., and B. Dujon. 1996. Distribution and variability of trinucleotide repeats in the genome of the yeast Saccharomyces cerevisiae. Gene 174:165–174.

    Schl?tterer, C., B. Amos, and D. Tautz. 1991. Conservation of polymorphic simple sequence loci in cetacean species. Nature 354:63–65.

    Schl?tterer, C., and D. Tautz. 1992. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20:211–215.

    Schug, M. D., T. F. Mackay, and C. F. Aquadro. 1997. Low mutation rates of microsatellite loci in Drosophila melanogaster. Nat. Genet. 15:99–102.

    Sermon, K., S. Seneca, M. De Rycke, V. Goossens, H. Van de Velde, A. De Vos, P. Platteau, W. Lissens, A. Van Steirteghem, and I. Liebaers. 2001. PGD in the lab for triplet repeat diseases: myotonic dystrophy, Huntington's disease and fragile-X syndrome. Mol. Cell. Endocrinol. 183:S77–85.

    Stallings, R. L, A. F. Ford, D. Nelson, D. C. Torney, C. E. Hildebrand, and R. K. Moyzis. 1991. Evolution and distribution of (GT)n repetitive sequences in mammalian genomes. Genomics 10:807–815.

    Subramanian, S., V. M. Madgula, R. George, R. K. Mishra, M. W. Pundit, C. S. Kumar, and L. Singh. 2003. Triplet repeats in human genome: distribution and their association with genes and other genomic regions. Bioinformatics 19:549–552.

    Tautz, D., and M. Renz. 1984. Simple sequences are ubiquitous repetitive components of eukaryotic genomes. Nucleic Acids Res. 12:4127–4138.

    Tautz, D., and C. Schl?tterer. 1994. Simple sequences. Curr. Opin. Genet Dev. 4:832–837.

    Tautz, D., M. Trick, and G. A. Dover. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature 322:652–656.

    Toth, G., Z. Gaspari, and J. Jurka. 2000. Microsatellites in different eukaryotic genome: survey and analysis. Genome Res. 10:1967–1981.

    Weber, J. L. 1990. Informativeness of human poly(GT)n polymorphisms. Genomics 7:524–530.

    Weber, J. L., and C. Wong. (1993). Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123–1128.

    Wierdl, M., M. Dominska, and T. D. Petes. 1997. Microsatellite instability in yeast: dependence on the length of the microsatellite. Genetics 146:769–779.

    Wood, V., R. Gwilliam, M. A. Rajandream, et al. 2002. The genome sequence of Schizosaccharomyces pombe. Nature 415:871–880.

    Wooster, R., A. M. Cleton-Jansen, N. Collins, et al. (14 co-authors).1994. Instability of short tandem repeats (microsatellites) in human cancers. Nat. Genet. 6:152–156.

    Xu, X., M. Peng, and Z. Fang. 2000. The direction of microsatellite mutations is dependent upon allele length. Nat. Genet. 24:396–399.(Haydar Karaoglu*,, Crysta)