当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第20期 > 正文
编号:11370062
The impact of SNPs on the interpretation of SAGE and MPSS experimental
http://www.100md.com 《核酸研究医学期刊》
     Laboratory of Molecular Biology and Genomics and 1 Laboratory of Computational Biology, Ludwig Institute for Cancer Research, 01509-010, S?o Paulo, SP, Brazil, 2 Interunit in Bioinformatics and 3 Department of Biochemistry, University of S?o Paulo, 05508-900, S?o Paulo, SP, Brazil and 4 John Hopkins University School of Medicine, 21224, Baltimore, MD, USA

    * To whom correspondence should be addressed at Rua Prof. Antonio Prudente 109, 4th floor, 01509-010 S?o Paulo, SP, Brazil. Tel: +55 11 3388 3248; Fax: +55 11 3207 7001; Email: anamaria@compbio.ludwig.org.br

    ABSTRACT

    Serial Analysis of Gene Expression (SAGE) and Massively Parallel Signature Sequencing (MPSS) are powerful techniques for gene expression analysis. A crucial step in analyzing SAGE and MPSS data is the assignment of experimentally obtained tags to a known transcript. However, tag to transcript assignment is not a straightforward process since alternative tags for a given transcript can also be experimentally obtained. Here, we have evaluated the impact of Single Nucleotide Polymorphisms (SNPs) on the generation of alternative SAGE and MPSS tags. This was achieved through the construction of a reference database of SNP-associated alternative tags, which has been integrated with SAGE Genie. A total of 2020 SNP-associated alternative tags were catalogued in our reference database and at least one SNP-associated alternative tag was observed for 8.6% of all known human genes. A significant fraction (61.9%) of these alternative tags matched a list of experimentally obtained tags, validating their existence. In addition, the origin of four out of five SNP-associated alternative MPSS tags was experimentally confirmed through the use of the GLGI-MPSS protocol (Generation of Long cDNA fragments for Gene Identification). The availability of our SNP-associated alternative tag database will certainly improve the interpretation of SAGE and MPSS experiments.

    INTRODUCTION

    The determination of gene expression profiles under normal and pathological conditions is one of the major challenges of the post-genomic era (1,2). A key point for this achievement is the development of techniques that are able to detect all transcripts expressed in a cell population in an unbiased manner, and to precisely determine significant differences in the expression level of all transcripts, including those expressed at very low levels (2).

    SAGE (3) and MPSS (4) are powerful techniques developed for a genome-wide analysis of gene expression. Both methods are capable of uniformly analyzing gene expression irrespective of mRNA abundance and without a priori knowledge of the transcript sequence. In the SAGE technique, a short sequence tag with 10 nt adjacent to the 3'-most NlaIII restriction site is extracted from each expressed sequence (3). The extracted tags are then concatenated for high-throughput sequencing analysis and tag counts are used to measure the relative abundance of their corresponding transcripts. Usually >50 000 tags are generated within a single SAGE experiment.

    Similar to SAGE, MPSS also relies on the production of short tags adjacent to the 3'-most DpnII restriction site in transcripts (4). However, due to the combination of in vitro cloning of cDNA molecules on the surface of microbeads (5) with non-gel-based high-throughput signature sequencing, a single MPSS experiment can generate over 107 tags, providing a 10-fold coverage of the transcripts expressed in a human cell (4).

    SAGE and MPSS data interpretation relies on efficient computational tools for the extraction and counting of tag sequences from raw sequence files, as well as for establishing comparisons of tag abundances between different libraries (3,4). Another important step in analyzing SAGE and MPSS data is the assignment of experimentally obtained tags to a known human transcript (6). This is achieved through the construction of a tag–transcript reference database (7,8). These databases are usually constructed by scanning publicly available mRNA sequences for the presence of the 3'-most restriction sites for the enzymes used for SAGE (NlaIII) and MPSS (DpnII) library construction. A virtual tag sequence downstream to the restriction site is then extracted from each mRNA sequence and stored, together with sequence annotation, in the tag–transcript reference database. Matching experimentally obtained tags to the tag–transcript reference database reveals the identity of the corresponding transcript (7,8).

    A reliable tag to transcript assignment is, thus, crucial for the correct interpretation of SAGE and MPSS data. However, tag to transcript assignment is not a straightforward process; since many SAGE and MPSS tags can ambiguously match multiple known transcripts (usually due to the presence of repetitive elements in the 3' UTR of human transcripts) and a significant portion of these tags can have no match to the tag–transcript reference database (7,8). In addition, alternative tags other than the 3'-most predicted virtual tag can be experimentally obtained in SAGE and MPSS experiments. There are both artifactual and biological reasons why alternative tags are generated in SAGE and MPSS experiments. Artifactual alternative tags can be generated during SAGE and MPSS library construction if, e.g. cDNA synthesis is primed from internal poly A stretches within the mRNA sequence, or if digestion with the corresponding restriction enzyme is incomplete, producing an alternative tag which is not adjacent to the 3'-most restriction site (7). On the other hand, genuine alternative tags can be generated in cases of alternative polyadenylation and alternative splicing near the 3' end of the transcript .

    In theory, alternative tags can also be associated with the presence of SNPs within the tag sequence or within the restriction enzyme sites used for SAGE and MPSS library construction. SNPs are the most common genetic variation present in the human genome, occurring once every 100–300 bases (9–12). In this work, we have evaluated the impact of SNPs on the generation of genuine alternative SAGE and MPSS tags. For the purpose of this analysis, we have considered single base substitutions and small insertion/deletion polymorphisms as SNPs and have named the genuine alternative tags as SNP-associated alternative tags.

    The identification of SNP-associated alternative tags was achieved through the construction of a reference database in which the analysis of mRNA sequences from UniGene was combined with information available from the NCBI SNP database. Our results highlight the importance of considering the occurrence of SNPs in tag to transcript assignments, since at least one SNP-associated alternative tag was observed for 8.6% of all known human genes. Our reference database contains 2020 SNP-associated alternative tags and can be accessed through SAGE Genie (http://cgap.nci.nih.gov/SAGE/).

    MATERIALS AND METHODS

    A reference database for SNP-associated alternative tags

    A total of 130 148 mRNA sequences catalogued at UniGene (Build #163) and 5 789 183 SNPs from the NCBI-SNP database (Build #118) were mapped onto the publicly available human genome sequence (Build #134). The mapping of mRNA sequences to the human genome was carried as previously described (13,14), and the mapping of SNPs was achieved through the alignment of sequences flanking the SNPs according to the NCBI criteria for SNP mapping (http://www.ncbi.nlm.nih.gov/SNP). We have considered single base substitutions and small insertion/deletion polymorphisms as SNPs, and restricted our analysis to SNPs mapped only once to the human genome sequence. A MySQL database was loaded with mapping information of all mRNAs and SNPs that shared an overlap in genomic coordinates. To accurately represent the 3' end of a transcript, only mRNA sequences containing a poly-A tail were selected from the initial set of 130 148 sequences. A total of 54 645 mRNA sequences (corresponding to 20 300 human genes according to UniGene) was scanned for the presence of NlaIII (for the SAGE analysis) and DpnII (for the MPSS analysis) restriction sites, and virtual tags downstream to the 3'-most site were extracted and considered as the original tags. The analysis was conducted with the dataset containing the coordinates of SNPs, restriction site and original tag position for each mRNAs sequence.

    The identification of SNP-associated alternative tags was divided into three major categories as illustrated in Figure 1. First, we identified mRNA sequences in which the presence of an SNP generated a new restriction site downstream to the original tag, producing in this way a 3' SNP-associated alternative tag. Second, we identified mRNA sequences in which the presence of an SNP disrupted the 3'-most restriction site associated with the original tag and, as a consequence, the restriction site immediately upstream to the 3'-most site was used for the generation of the SNP-associated alternative tag. Finally, we identified mRNA sequences in which the SNP did not affect the restriction sites, but occurred within the adjacent tag sequence, producing an SNP-associated alternative tag with a single base substitution as compared to the original tag. SNP-associated alternative tags catalogued in the reference database were compared to a list of experimentally obtained SAGE and MPSS tags.

    Figure 1. The impact of SNPs on tag to gene assignments. For the analysis, SNPs were divided into three major categories: (A) SNPs that generate a new restriction enzyme site downstream to the original tag; (B) SNPs that disrupted the 3'-most restriction site associated with the original tag; (C) SNPs that did not affect the restriction sites, but occurred within the adjacent tag sequence. Restriction sites are represented by gray boxes, original tags by hatched boxes and SNP-associated alternative tags by open boxes. The location of the SNPs within mRNA sequences is indicated by arrows.

    Experimental SAGE and MPSS databases

    SAGE and MPSS tags that have been reliably obtained from human mRNA samples were used as experimental evidence to validate the SNP-associated alternative tags. The criteria for the selection of these tags were previously described (7,15). Experimental SAGE data was obtained from SAGE Genie (http://cgap.nci.nih.gov/SAGE) and comprised 586 144 unique tags generated from 260 SAGE libraries, which were derived from 25 different tissues. MPSS experimental data was extracted from the Ludwig Institute for Cancer Research and the National Cancer Institute MPSS database, and comprise 84 555 unique tags generated from six MPSS libraries derived from two different tissues (colon and breast).

    Experimental validation of SNP-associated alternative MPSS tags

    The specificity of five SNP-associated alternative MPSS tags (Table 4), derived from the HB4a breast cell line was experimentally confirmed by GLGI-MPSS (16). This technique allows the conversion of MPSS tags into their corresponding 3' cDNA fragments. A sense primer including 17 bases of the MPSS tag sequence and an antisense primer (ACTATCTAGAGCGGCCGCTT) present in the 3' end of all cDNA molecules and incorporated from reverse transcription primers were used for GLGI-MPSS amplification. The reaction mixture was prepared in a final volume of 30 μl, including 1x Taq Platinum DNA polymerase buffer (Invitrogen), 2.0 mM MgCl2, 83 μM dNTPs, 2.3 ng/μl antisense primer, 2.3 ng/μl sense primer, 1.5 U of Taq Platinum DNA polymerase (Invitrogen) and 0.5–0.8 μl of the same cDNA source used for MPSS library construction. PCR conditions used for amplification were 94°C for 2 min, followed by 30 cycles at 94°C for 30 s, 64°C for 30 s, and 72°C for 35 s. Reactions were kept at 72°C for 5 min after the last cycle. The amplified products were ethanol precipitated and cloned into the pGEM?-T Easy vector (Promega). Eight colonies for each GLGI-MPSS fragment were screened by PCR using pGEM universal primers and positive colonies were sequenced using Big-Dye Terminator (Applied Biosystems) and an ABI3100 sequencer (Applied Biosystems). Sequences were searched against GenBank (nr and dbEST databases) using BLASTN (http://www.ncbi.nlm.nih.gov/BLAST/) to confirm the identity of the fragments.

    Table 4. Experimental validation of SNP-associated MPSS tags by GLGI-MPSS

    SNP typing

    All SNPs associated with the MPSS alternative tags selected for GLGI-MPSS analysis were typed. Four SNPs (rs1053941, rs2362587, rs6961 and rs7110) were typed by genomic DNA amplification followed by restriction digestion with DpnII, and the remaining SNP (rs2422) was typed by direct DNA sequencing since the restriction analysis was not possible due to the presence of several DpnII sites within the amplified sequence. For both genotyping strategies, genomic DNA (100 ng) from the HB4a cell line was amplified by PCR using primers flanking the SNPs (rs1053941 FW 5'-GAT GGT TCT TGT CCT ATA TC-3', rs1053941 REV 5'-CAG CCT AAG ACC CCA CT-3', rs2362587 FW 5'-AGC ACA GGC CTG GTT AC-3', rs2362587 REV 5'–TGT ATG GCT CCA TGG TCC-3', rs2422 FW 5'-GAG CTT GGA AGA TGG CG-3', rs2422 REV 5'-CAT TCC TCT TTC AAA CAG CC-3', rs6961 FW 5'-TGA ATG TCA TGC TGG TGC-3', rs6961 REV 5'-AGA GTG CAG AAG CGT ATG-3', rs7110 FW 5'-GCA ACC CTA GCA ATA CCA-3', rs7110 REV 5'-TAG CAG TGA CCT AAG TCC-3'). The amplification mixture was prepared in a final volume of 25 μl, containing 1x Taq Platinum DNA polymerase buffer (Invitrogen),1.4 mM MgCl2, 0.1 mM dNTPs, 20 μM of each primer and 1 U of Taq DNA polymerase (Invitrogen). PCR conditions used for amplification were 94°C for 4 min, followed by 40 cycles at 94°C for 40 s, 57°C for 40 s and 72°C for 1 min. Reactions were kept at 72°C for 6 min after the last cycle. The amplified products were then either digested with DpnII or used for direct sequencing as described above. DpnII digestion was carried in a final volume of 20 μl, including 1x buffer, 10 U of DpnII (New England Biolabs) and 4 μl of each PCR product. Reactions were kept at 37°C for 2 h and were analyzed on 8% polyacrylamide gel stained with silver.

    RESULTS

    In theory, genuine alternative tags can be associated with the presence of SNPs within the tag sequence or within the restriction enzyme sites used for SAGE and MPSS library construction. To analyze the impact of SNPs on the generation of genuine alternative SAGE and MPSS tags, we have constructed a reference database of SNP-associated alternative tags. For the construction of this database 54 645 mRNA sequences containing a poly-A tail and corresponding to 20 300 UniGene clusters (Build #163) were initially scanned for the presence of NlaIII and DpnII restriction sites. Of the 54 645 mRNA sequences analyzed, 54 124 (99.0%) contained an NlaIII restriction site and 52 779 (96.6%) contained a DpnII site. mRNA sequences were then searched for the presence of SNPs according to the NCBI SNP Database (Build #118). Of the 54 124 mRNA sequences presenting NlaIII sites, 44 033 (81.4%) contained at least one SNP and of the 52 779 mRNA sequences with DpnII sites, 43 125 (81.7%) contained at least one SNP. The analysis for the identification of SNP-associated alternative tags was divided into three major categories as illustrated in Figure 1, and described in Materials and Methods.

    Creation of a new 3'-most restriction site

    First, we have analyzed mRNA sequences in which the presence of an SNP generated a new restriction enzyme site downstream to the original tag, producing in this way a 3' SNP-associated alternative tag (Figure 1A). From the 44 033 mRNA sequences containing both an NlaIII site and an SNP, we have identified 573 (1.3%) sequences in which the presence of SNPs created a new NlaIII restriction site downstream to the original tag. These 573 mRNA sequences correspond to 294 unique human genes according to UniGene. A total of 305 unique SNP-associated alternative SAGE tags were extracted from these 573 mRNA sequences (Table 1). A similar analysis was carried for the 43 125 mRNA sequences containing both a DpnII site and an SNP. In this case, we have identified 393 (0.9%) mRNA sequences corresponding to 205 UniGene clusters. The presence of an SNP within these sequences generated 217 unique SNP-associated alternative MPSS tags (Table 1). We also included in this category 56 mRNA sequences that did not have an NlaIII or a DpnII restriction site, but acquired one because of the presence of an SNP.

    Table 1. SNP-associated tags generated by the creation of a new restriction enzyme site downstream of the position of the original tag

    In order to validate the existence of these SNP-associated alternative tags, we have compared them to a list of experimentally obtained SAGE and MPSS tags. This list included 586 144 unique SAGE tags derived from 260 SAGE libraries and 84 555 unique MPSS tags derived from six MPSS libraries. Of the 305 SNP-associated alternative SAGE tags catalogued in our reference database, 275 (90.2%) were found in the list of experimentally obtained SAGE tags, and of the 217 SNP-associated alternative MPSS tags, 40 (18.4%) matched the list of experimentally obtained MPSS tags (Table 1).

    However, the presence of an SNP-associated alternative tag within a dataset of experimentally obtained tags is not always an irrefutable evidence for existence of the alternative tag and can also occur in cases of tag sequence ambiguity, when two distinct transcripts contain by chance an identical tag sequence. In order to further validate our analysis, we have determined the percentage of the SNP-associated alternative tags that also correspond to the 3'-most original tag of a distinct human transcript. A small percentage (12.8%) of the 305 SNP-associated alternative SAGE tags, corresponded to the 3' original tag of another transcript. This percentage was even smaller (3.2%) for the SNP-associated alternative MPSS tags due to the longer size and higher specificity of the tag sequence (Table 1). The presence of a high percentage of unambiguous SNP-associated alternative tag within a list of experimentally obtained SAGE and MPSS tags can thus be used to show that the occurrence of SNPs within transcript sequences is, indeed, an important source for the generation of alternative tags.

    Destruction of the original 3' restriction site

    We have then analyzed mRNA sequences in which the presence of an SNP disrupted the 3'-most restriction site associated with the original tag and, as a consequence, the second 3'-most site was used for the generation of the SNP-associated alternative tag (Figure 1B). We have identified, from the 44 033 mRNA sequences containing both an NlaIII site and an SNP, 498 (1.1%) sequences in which the presence of an SNP disrupted the 3'-most restriction site associated with the original tag. These 498 mRNA sequences correspond to 236 unique human genes according to UniGene, and a total of 235 unique SNP-associated alternative SAGE tags were extracted from them (Table 2). Of these 235 tags, 218 (92.8%) matched our list of experimentally obtained SAGE tags, and only a small fraction (13.2%) corresponded to the 3'-most original tag of another transcript, thus validating them as genuine SNP-associated alternative tags (Table 2).

    Table 2. SNP-associated tags generated by the disruption of the 3'-most restriction site associated with the original tag SNP

    For the 43 125 mRNA sequences containing both a DpnII site and an SNP, we have identified 422 (1%) sequences in which the 3'-most restriction site was disrupted by an SNP. These 422 mRNA sequences correspond to 208 UniGene clusters and the presence of an SNP within these sequences generated 196 unique SNP-associated alternative MPSS tags of which 78 (39.8%) were found within experimentally obtained MPSS tags. As expected, the frequency of tag ambiguity for these SNP-associated alternative MPSS tags was very low (4.6%) (Table 2).

    Single base substitutions within original tag sequence

    Finally, we have analyzed mRNA sequences in which the SNP did not affect the restriction sites, but occurred within the adjacent tag sequence, producing an SNP-associated alternative tag with a single base substitution as compared to the original tag (Figure 1C). From the 44 033 mRNA sequences containing both an NlaIII site and an SNP, we have identified 1136 (2.6%) sequences in which the presence of an SNP occurred within the adjacent tag sequence. These 1136 mRNA sequences correspond to 543 unique human genes according to UniGene, and generated 560 unique SNP-associated alternative SAGE tags (Table 3). Of these 560 tags, 512 (91.4%) were found within the experimentally obtained SAGE tag dataset and only 92 (16.4%) also corresponded to the 3'-most original tag of another transcript (Table 3).

    Table 3. SNP-associated tags generated by SNPs that occurred within the adjacent tag sequence

    Similarly, for the 43 125 mRNA sequences containing both a DpnII site and an SNP, we have identified 1009 (2.3%) mRNA sequences in which the presence of an SNP occurred within the adjacent tag sequence. These 1009 mRNA sequences correspond to 481 unique human transcripts according to UniGene, and generated 507 unique SNP-associated alternative MPSS tags of which 127 (25.0%) were experimentally obtained and 33 (6.5%) corresponded to the 3'-most tag of another transcript (Table 3).

    Integration of the reference database of SNP-associated alternative tags to SAGE Genie

    To make our analysis accessible to the research community, we have integrated the database of SNP-associated alternative SAGE tag to SAGE Genie (http://cgap.nci.nih.gov/SAGE) (7). The data can be directly downloaded as flat-files or visualized in the ‘Ludwig Transcript Viewer’ as exemplified in Figure 2. The information related to the existence of an SNP-associated alternative tag for a given transcript is presented in the ‘Ludwig Transcript Viewer’ as a separated table just below the schematic representation of the transcript sequence and its corresponding virtual SAGE tags. This table describes the impact of the SNP on the transcript sequence (e.g. creates a new 3'-most NlaIII site), and includes additional information related to the SNP-associated alternative tag, such as its sequence and position within the transcript sequence and its frequency in the SAGE Genie database. Information about the SNP related to the alternative tag, such as the SNP accession number, the base substitution and the position of the SNP in the transcript sequence is also provided in the table, as well as a direct link to the NCBI-SNP database. Using the ‘Ludwig Transcript Viewer’, the SAGE Genie user can now easily check whether an mRNA sequence presents an SNP-associated alternative tag or if a specific SAGE tag corresponds to an SNP-associated alternative tag of a known human gene.

    Figure 2. Integration of the database of SNP-associated alternative SAGE tags into SAGE Genie. A representative example of the Ludwig Transcript Viewer showing the transcript encoded by the MAF1 gene (NM_032272 ) as a blue line and the colored boxes represent the last four virtual tags relative to the 3' end of the transcript. The expression levels for each of the four virtual tags as well as the tag position in the transcript sequence are provided in the Tag Info Summary. The existence of an SNP-associated alternative tag for the MAF1 transcript is in a specific table (as indicated by the arrow), which includes the tag sequence, tag position within the transcript, tag frequency in the SAGE Genie database, the SNP id associated with the alternative tag, the base substitution and the position of the SNP within the transcript sequence.

    Experimental validation of SNP-associated alternative MPSS tags

    To further confirm the impact of SNPs on the generation of genuine alternative tags, we have used the GLGI-MPSS technique (16) to convert five SNP-associated alternative tags observed in the HB4a MPSS library into their corresponding 3' cDNA fragments. These extended 3' cDNA fragments were then used in similarity searches against public databases in order to confirm their specificity (Table 4).

    A sense primer corresponding to the SNP-associated alternative MPSS tag was used for GLGI-MPSS amplification as described in Materials and Methods. As can be seen in Figure 3, a predominant band was obtained for all GLGI-MPSS reactions. Bands were excised from the gel, cloned, sequenced and searched for sequence similarity against GenBank (nr and dbEST). With the exception of the SNP-associated alternative tag corresponding to the AK092889 transcript, all the others were validated and produced a 3' cDNA fragment matching the expected transcript sequence and confirming the origin of the SNP-associated alternative tag. The 3' cDNA fragment generated with the SNP-associated alternative tag corresponding to the AK092889 transcript matched an unrelated cDNA sequence (BC064564 ) in which the sequence corresponding to the alternative tag could not be found. This fragment should then be considered as an artifact generated by unspecific GLGI-MPSS amplification.

    Figure 3. GLGI-MPSS amplifications of five SNP-associated alternative MPSS tags. GLGI-MPSS amplifications for SNP-associated alternative MPSS tags listed in Table 4 and corresponding to the mRNA sequences NM_002482 (1); AK023594 (2); D86973 (3); NM_004168 (4); AK092889 (5) were analyzed on 1% agarose gel stained with ethidium bromide; 100 bp ladder (M) was used as molecular weight marker.

    We then decided to genotype the HB4a cell line for the presence of the SNPs associated with these alternative tags. All of these five selected SNPs created a new restriction enzyme site downstream of the position of the original tag. Primers flanking the SNPs were designed and used to amplify HB4a genomic DNA. Amplified fragments were either digested with DpnII or used for direct sequencing. As can be seen in Figure 4, the occurrence of the four SNPs (rs1053941, rs2362587, rs6961 and rs7110) in the HB4a cell line was confirmed after restriction digestion. The observed restriction digestion pattern suggests that the HB4a cell line is heterozygous for all SNPs analyzed by restriction digestion. The presence of the remaining SNP (rs2422) was confirmed by direct sequencing (data not shown), and the HB4a turn out to be homozygous for this polymorphism. According to the SNP genotyping results in cases of heterozygosis, the occurrence of both the original and the SNP-associated tags within the HB4a MPSS library is expected. As shown on Table 4, both the original and SNP-associated tags were found in the HB4a MPSS library at approximately the same frequency for all cases of heterozygosis.

    Figure 4. SNP typing by genomic DNA amplification followed by restriction enzyme digestion. The genomic region flanking the SNPs rs1053941 (SNP1), rs2362587 (SNP2), rs6961 (SNP4) and rs7110 (SNP5) was amplified using specific primers and genomic DNA from the HB4a cell line. PCR fragments were digested with DpnII, and analyzed on 8% polyacrylamide gels stained with silver; 100 bp ladder (M) was used as molecular weight marker and bands corresponding to the restriction fragments are indicated by arrows.

    DISCUSSION

    The impact of alternative polyadenylation and alternative splicing on the generation of genuine alternative tags has already been studied . However, alternative tags can also be generated by the presence of SNPs within the tag sequence or within the restriction enzyme sites used for SAGE and MPSS library construction.

    In this work, we found that the presence of SNPs within human mRNA sequences was responsible for the generation of 2020 SNP-associated alternative tags (1100 SNP-associated alternative SAGE tags and 920 SNP-associated alternative MPSS tags) and that 8.6% of all known human genes present at least one SNP-associated alternative tag. It should be noted, however, that this number is certainly underestimated because the growth of the NCBI SNP database has not yet reached a plateau (statistics available at http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi) suggesting that just a fraction of the whole repertoire of SNPs present in the human genome is so far reported. On the other hand, one should also be aware that only a fraction of the NCBI-SNP database (40.7%) is validated, what could potentially lead to the identification of artifactual SNP-associated alternative tags. However, the SNPs included in our database show the same proportion of validation (44.5%), suggesting that they are a fair representation of the whole set of SNPs. In spite of that, we expect that the growth in the collection of validated SNPs, as well as the availability of information related to allele frequencies in specific populations will enrich and better refine the analyses presented here. We have restricted our analysis to 54 645 mRNA sequences containing a poly-A tail and, thus, considered to represent the 3' end of a transcript. However, poly-A tails are sometimes removed from transcript sequences during the database submission process, and it is likely that among the 75 503 mRNA sequences excluded from our analysis (corresponding to 8677 UniGene clusters) there are several representing genuine 3' transcript ends. If we analyze all the 130 148 mRNA sequences catalogued at UniGene, irrespective of the presence of a poly-A tail, a total of 3520 SNP-associated alternative tags can be identified (1950 SNP-associated alternative SAGE tags and 1570 SNP-associated alternative MPSS tags). This number corresponds to an increase of 74.3% in the number of SNP-associated alternative tags identified in the analysis using sequences with poly-A tail.

    Interestingly, we did not observe a significant decrease in the percentage of SNP-associated alternative tags that are experimentally documented. Approximately 62% of the SNP-associated alternative tags identified from mRNA sequences with a poly-A tail were experimentally obtained and this number decrease to 57.8% if we consider all mRNA sequences. Taken together, these results suggest that the majority of the mRNA sequences without a poly-A tail indeed represent a genuine 3' transcript end and that the number of SNP-associated alternative tags reported here is conservative and probably underestimated.

    A significant fraction (91.4%) of the SNP-associated alternative SAGE tags were found within the CGAP SAGE database, but only 26.6% of the SNP-associated alternative MPSS tags could be experimentally obtained. This difference can be explained by the larger number of SAGE libraries used to generate our list of experimentally validated tags. If we assume that each of these SAGE and MPSS libraries were derived from a single individual, the genetic variability represented within the SAGE dataset of experimentally obtained tags (extracted from 260 different libraries) is much higher than that represented within the MPSS dataset (extracted from six different libraries), thus increasing the chance of a given polymorphism (and consequently the corresponding SNP-associated alternative tag) being represented within the experimentally obtained dataset.

    To overcome the problem of the limited number of MPSS libraries available, we have further confirmed the origin of five SNP-associated alternative MPSS tags found in the HB4a MPSS library through the use of GLGI-MPSS (16). With the exception of the SNP-associated alternative tag corresponding to the AK092889 mRNA, all the others produced a GLGI-MPSS 3' cDNA fragment matching the expected known human mRNA. The HB4a cell line was also genotyped for the presence of the SNPs and shown to be heterozygous for 4 out of the 5 SNPs analyzed. As expected, both the original and SNP-associated tags were found in the HB4a MPSS library in the cases of heterozygosis. These preliminary results suggest that the existence of SNP-associated alternative tags can be used to study allele-specific gene expression. Allele-specific variations in gene expression have been classically associated with X-chromosome inactivation and genomic imprinting, and recent studies have also shown that it is relatively common among non-imprinted autosomal genes (17,18). We are currently using our reference database of SNP-associated alternative tags to study allele-specific gene expression in a genome-wide context. To enhance the utility of the analysis presented in this work, we have integrated our database of SNP-associated alternative tags into SAGE Genie, a web site for the analysis and presentation of SAGE data (7). SNP-associated alternative tags can now be easily identified and correctly assigned to human transcripts allowing an improvement of the interpretation of SAGE experiments. Planned updates of our reference database with sequence data generated from full-length cDNA sequencing projects as well as with new releases of the NCBI SNP database will increase the accuracy of our analysis. These updates will be periodically available through SAGE Genie and will certainly improve the interpretation of SAGE and MPSS experiments.

    ACKNOWLEDGEMENTS

    The authors would like to thank Daniela Gerhard, Susan Greenhut and Carl Schaefer from the National Cancer Institute for help in making our data available through SAGE Genie. Funding was provided by the CEPID Program from the Funda??o de Amparo a Pesquisa do Estado de S?o Paulo (FAPESP 98/14335-2). The Ludwig Institute for Cancer Research and the National Cancer Institute funded the construction of the MPSS libraries for breast and colon cell lines, respectively.

    REFERENCES

    Lander,E.S. ( (1996) ) The new genomics: global views of biology. Science, , 274, , 536–539.

    Collins,F.S, Patrinos,A., Jordan,E., Chakravarti,A., Gesteland,R. and Walters,L. ( (1998) ) New goals for the U.S. Human Genome Project: 1998–2003. Science, , 282, , 682–689.

    Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. ( (1995) ) Serial analysis of gene expression. Science, , 270, , 484–487.

    Brenner,S., Johnson,M., Bridgham,J., Golda,G., Lloyd,D.H., Johnson,D., Luo,S., McCurdy,S., Foy,M., Ewan,M. et al. ( (2000) ) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol., , 18, , 630–634.

    Brenner,S., Williams,S.R., Vermaas,E.H., Storck,T., Moon,K., McCollum,C., Mao,J.I., Luo,S., Kirchner,J.J., Eletr,S. et al. ( (2000) ) In vitro cloning of complex mixtures of DNA on microbeads: physical separation of differentially expressed cDNAs. Proc. Natl Acad. Sci. USA, , 97, , 1665–1670.

    Madden,S.L., Wang,C.J. and Landes,G. ( (2000) ) Serial analysis of gene expression: from gene discovery to target identification. Drug Discov. Today, , 9, , 415–425.

    Boon,K., Osorio,E.C., Greenhut,S.F., Schaefer,C.F., Shoemaker,J., Polyak,K., Morin,P.J., Buetow,K.H., Strausberg,R.L., De Souza,S.J., et al. ( (2002) ) An anatomy of normal and malignant gene expression. Proc. Natl Acad. Sci. USA, , 99, , 11287–11292.

    Clark,T., Lee,S., Ridgway,S.L. and Wang,S.M. ( (2002) ) Computational analysis of gene identification with SAGE. J. Comput. Biol., , 9, , 513–526.

    Wang,D.G., Fan,J.B., Siao,C.J., Berno,A., Young,P., Sapolsky,R., Ghandour,G., Perkins,N., Winchester,E., Spencer,J. et al. ( (1998) ) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science, , 280, , 1077–1082.

    Cargill,M., Altshuler,D., Ireland,J., Sklar,P., Ardlie,K., Patil,N., Shaw,N., Lane,C.R., Lim,E.P., Kalyanaraman,N. et al. ( (1999) ) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet., 22, , 231–238.

    Sachidanandam,R., Weissman,D., Schmidt,S.C., Kakol,J.M., Stein,L.D., Marth,G., Sherry,S., Mullikin,J.C., Mortimore,B.J., Willey,D.L. et al. ( (2001) ) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, , 409, , 928–933.

    Shastry,B.S. ( (2002) ) SNP alleles in human disease and evolution. J. Hum. Genet., 47, , 561–566.

    Galante,P.A., Sakabe,N.J., Kirschbaum-Slager,N. and de Souza,S.J. ( (2004) ) Detection and evaluation of intron retention events in the human transcriptome. RNA, , 5, , 757–765.

    Sakabe,N.J., de Souza,J.E., Galante,P.A., de Oliveira,P.S., Passetti,F., Brentani,H., Osorio,E.C., Zaiats,A.C., Leerkes,M.R., Kitajima,J.P. et al. ( (2003) ) ORESTES are enriched in rare exon usage variants affecting the encoded proteins. C R Biol., , 326, , 979–985.

    Jongeneel,C.V., Iseli,C., Stevenson,B.J., Riggins,G.J., Lal,A., Mackay,A., Harris,R.A., O'Hare,M.J., Neville,A.M., Simpson,A.J. et al. ( (2003) ) Comprehensive sampling of gene expression in human cell lines with massively parallel signature sequencing. Proc. Natl Acad. Sci. USA, , 100, , 4702–4705.

    Silva,A.P.M., Chen,J., Carraro,D.M., Wang,S.M. and Camargo,A.A. ( (2004) ) Generation of longer 3' cDNA fragments from Massive Parallel Signature Sequencing Tags. Nucleic Acids Res., , 32, , e94.

    Knight,J.C. ( (2004) ) Allele-specific gene expression uncovered. Trends Genet., , 20, , 113–116.

    Lo,H.S., Wang,Z., Hu,Y., Yang,H.H., Gere,S., Buetow,K.H. and Lee,M.P. ( (2003) ) Allelic variation in gene expression is common in the human genome. Genome Res., , 13, , 1855–1862.(Ana Paula M. Silva, Jorge E. S. De Souza)