当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第20期 > 正文
编号:11370060
Novel algorithm for automated genotyping of microsatellites
http://www.100md.com 《核酸研究医学期刊》
     1 Japan Biological Information Research Center, Japan Biological Informatics Consortium, Tokyo, Japan, 2 Hitachi Software Engineering Co., Ltd, Tokyo, Japan, 3 Department of Molecular Life Science, Course of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine, Kanagawa, Japan, 4 Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan and 5 Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Shizuoka, Japan

    * To whom correspondence should be addressed. Tel: +81 3 5780 2111; Fax: +81 3 5780 2049; Email: tmatsumoto@hitachisoft.jp

    ABSTRACT

    Microsatellites or short tandem repeats (STRs) are abundant in the human genome with easily assayed polymorphisms, providing powerful genetic tools for mapping both Mendelian and complex traits. Microsatellite genotyping requires detection of the products of polymerase chain reaction (PCR) amplification by electrophoresis, and analysis of the peak data for discrimination of the true allele. A high-throughput genotyping approach requires computer-based automation at both the detection and analysis phases. In order to achieve this, complicated peak patterns from individual alleles must be interpreted in order to assign alleles. Previous methods consider limited types of noise peaks and cannot provide enough accuracy. By pattern recognition of various types of noise peaks, such as stutter peaks and additional peaks, we have achieved an overall average accuracy of 94% for allele calling in our actual data. Our algorithm is crucial for a high-throughput genotyping system for microsatellite markers by reducing manual editing and human errors.

    INTRODUCTION

    Recently, considerable effort has been put into the genetic mapping of complex diseases. Whole-genome association studies using polymorphic markers, such as SNP and microsatellite markers, have been proposed as an effective approach for dissecting major genetic factors underlying complex diseases. Microsatellites are frequently used as genetic markers. They consist of short tandem repeats, which are generally 2–6 bp in length. Microsatellites are frequently polymorphic with the number of repeat units varying between individuals. Additionally, they are abundant in the human genome, and readily identified. Furthermore, they are easily shared through sequence information, and are straightforward to assay via PCR amplification and subsequent size determination by gel or capillary electrophoresis (1,2). Due to these advantages, a number of microsatellite-based genetic and physical maps have been constructed for several species, including Homo sapiens (3–8).

    Recent advances in the field of DNA sequencing have allowed researchers to perform genotyping cost effectively and on a massive scale. This has allowed studies, such as those commonly performed for disease gene mapping, to genotype several hundred microsatellite markers. In order to analyze these large genotype datasets efficiently some form of high-throughput genotyping system is required. A key problem faced during analysis is coping with various types of noise bands, i.e. peaks, caused by PCR artifacts. A problem inherent in the process is PCR stutter (or ‘slippage’), an experimental error which can be caused by slipped strand mispairing during the PCR (9,10). PCR ‘slippage’ introduces additional DNA fragments which are one or several repeats shorter or longer than the true allele peak. The introduction of several repeats of a length similar but different from that of the allele mask its ‘true’ position. Another common cause of noise are ‘+A peaks’. These are PCR artifacts one nucleotide longer than the true allele peak (11). This artificial fragment is created by the DNA polymerase adding a nontemplated nucleotide residue (often an adenine) at the 3' end of a DNA fragment. The nontemplated nucleotide addition is observed in every peak, including the true allele and the stutter peaks. In addition to the presence of PCR artifacts, compound and interrupted microsatellite repeats can make peak pattern recognition a complicated and difficult process. Previously, detailed manual inspection was the only way to perform accurate allele identification through microsatellite genotyping. Manual curation can be a very time-consuming process.

    Previous studies have stated the requirement for algorithms which can discriminate stutter peaks from true allele peaks and automate the allele identification process (12,13). Additionally, software which uses heuristic mechanisms to perform rudimentary filtering inference of allele has been proposed to reduce the amount of manual confirmation required for positive allele identification (14). However, these programs do not take ‘+A peaks’ into account when calculating allele size (12–14). On the other hand, the Genotyper software (Applied Biosystems) for automated genotyping and the GeneMapper software (Applied Biosystems) for the allele call quality control are major programs that are commercially available.

    In the present study, we have developed a novel algorithm that predicts allele size while taking into consideration the noise created by both PCR stutter and the +A peaks. The algorithm is based on the assumption that there is a linear relationship between stutter peak pattern and allele size. Moreover, the algorithm also considers the exceptional +A peak pattern that is often higher than the true allele peak. We evaluated our algorithm by applying it to an actual data set from 174 microsatellite markers. The results from this analysis showed that our algorithm has an overall average accuracy of 94%—the highest so far reported for a tool of this kind.

    ALGORITHM AND METHODS

    Algorithm description

    Our algorithm attempts to remove some of the noise caused by artifacts created during the PCR process. It has been observed that stutter patterns are peculiar to each marker sequence, and the PCR generates reproducible stutter patterns for the same marker, among different individuals with the same allele (12). Additionally, it has been reported that for any one marker the height ratio of stutter peak to true allele peak when plotted against fragment size follows linear regression pattern (15). By appropriate utilization of these properties, we can determine the linear regression line for one marker using the data of several individuals with simple peak patterns. Simple peak patterns are those cases in which the true allele peak can be easily identified. In complex cases, there may be multiple stutter peaks and multiple true allele candidates. In cases such as these, if fragment size is plotted against the linear regression pattern created for the same marker using simple examples, then the expected ratio of stutter peak to true allele peak can be calculated. By comparing the calculated expected ratio with the ratios from the potential candidates, our algorithm can predict the most likely ‘true’ allele peak for that sample. In the case of +A peaks, the ratio of any +A peak to either its true allele peak or any associated stutter peak is specific for both the marker (16,17) and the sample (18). This means the ratios of different samples even though they are from the same marker and the same allele are sometimes different. However, for any sample there is a uniform pattern between the true allele peak and the associated +A peak. This is also true for any stutter peaks with associated +A peaks. The ratio of the height of any +A peak to either its corresponding true allele peak or corresponding stutter peak is approximately the same for all comparisons throughout the same sample. Because the height ratio of +A peaks to its corresponding peaks should be constant when the predicted true allele peak is expected to be correct, our algorithm can predict the most likely true allele peak and its associated +A peak in complex cases. In order to exclude those ratios given by misinterpreted peaks, our algorithm initially examines several individuals with easily identifiable peak patterns. Next our algorithm determines the maximum and minimum value ranges of all ratios for those easily identifiable peaks. Then our algorithm calculates the peak height ratios for the remaining replicates. When the ratio exceeds the maximum value or is less than the minimum, our algorithm substitutes its value for the maximum or minimum.

    Essentially the algorithm consists of two parts as shown in Figure 1: (i) examination of the pattern of stutter peaks and +A peaks for the marker, then the calculation of a ‘template’ defined by a stutter and +A peak pattern for any true allele peak, and (ii) inference of allele(s) within each individual by adjusting the calculated template(s) based upon the observed peak pattern. Further details of each part of this process are described in the next section.

    Figure 1. Algorithm summary. This algorithm consists of two parts: template calculation and genotype inference.

    Details of the algorithm for template calculation

    This part of the algorithm can be separated into three subsections (Figure 1):

    The selection of several individuals.

    Obtaining stutter patterns and the value range of the ratios for the +A peaks.

    The fixation of a template for all samples where the ratio for +A peaks was determined.

    Selection of individuals

    The algorithm selects heterozygotes with two alleles that are sufficiently separated and/or confirmed homozygotes. In both of these cases, peak interpretation is relatively easy. Homozygotes are represented as single cluster of peaks (Figure 2C). Heterozygotes with non-overlapping peaks can be easily identified due to the distinct patterns clearly visible as dual clusters of peaks (Figure 2A). Closely spaced heterozygotes represent a minefield of many complicated patterns with the composite cluster of peaks (Figure 2Ba and Bb). It is first necessary to distinguish homozygotes from heterozygotes with the composite cluster of peaks. We achieve this by using information from heterozygotes with the dual cluster of peaks (Figure 2A) and heterozygotes with the bimodal composite cluster of peaks (Figure 2Bb). The clusters of peaks are examined and the numbers of peaks longer and shorter than the highest peak in the cluster are counted. Assuming homozygotes have the same numbers of peaks longer and shorter than the highest peak, individuals with unimodal peak patterns are determined if the genotype in question is homozygous.

    Figure 2. Arrows indicate true allele peaks. A star indicates that the peak was not recognized. A circle indicates the +A peak of the longer true allele peak. (A) A heterozygote suitable for examination of the pattern of stutter peaks and +A peaks, since two clusters of peaks are not overlapping. (B) A heterozygote not suitable for examination of the pattern of stutter and +A peaks. (Ba) The peak pattern is unimodal because two alleles are closely spaced. The longer true allele peak is missed because it is combined with the +A peak of the shorter true allele peak. (Bb) Although the peak pattern is bimodal, two clusters of peaks are overlapping. (C) A homozygote suitable for examination of the pattern of stutter and +A peaks.

    Next, our algorithm decides whether or not the peak patterns observed in the selected individuals are suitable for template calculation, i.e. various types of noise peaks other than stutter and +A peaks are contained in the electropherograms. Our algorithm accesses individual suitability based upon three criteria. First in a heterozygote, both clusters of peaks should have approximately the same numbers of stutter peaks. When the difference between them is >2, we classify the individual as a homozygote with pseudo peak(s). Second, a true allele peak should be >20% of its corresponding +A peak. When this is not the case, the peak is assumed to be a +A peak and not a true allele peak. Using an empirical threshold set at 20%, we derived high level of accuracy. Third, all selected individuals should have approximately the same relative heights of +A peaks because the ratio is inherent in each marker sequence (16,17).

    To perform the confirmation of these criteria in the previous paragraph, each peak needs interpretation. For the markers with a repeat unit length >2 bp, pairs of peaks with fragment sizes 1 bp apart, are easily identified in a cluster. Observing a pattern like this, we can conclude that the shortest and longest peaks in the highest pair represent the true allele peak and its +A peak. If there are any other pairs, these represent simply stutter peaks which have associated +A peaks. In the case of markers with repetitive dinucleotide units, the peak pattern is complicated by the ladder of peaks which often exists in a cluster at 1 bp intervals. A peak next to the +A peak should be either a true allele peak or a stutter peak. As the stutter peaks are usually lower than the true allele peak, the highest peak should be the true allele peak or its +A peak. To determine the above, the variance of the height ratios from every pair of the true allele or stutter peaks and their +A peaks in cluster(s) is calculated under each assumption that the highest peak is the true allele peak or its +A peak. The peaks higher than 15% of the highest peak are used for the calculation since low peaks have poor quantification. The variance when the assumption is expected to be correct should be low because the ratios between pair members are constant (17). The assumption in which variance <0.01 is determined correct. When variances are <0.01 in both assumptions, the highest peak is considered as the true allele peak. This is based on our observation that markers without +A peaks tend to show small variances on both assumptions. Any individuals with a variance >0.01 on both assumptions is removed as in this case, it is highly likely that the highest peak is a noise peak.

    Determination of stutter patterns and ratio range for +A peaks

    To determine the patterns of stutter peaks, our algorithm calculates a linear regression line using the selected individuals as follows. Figure 3A and B show two selected homozygotes with different alleles. The size and height of the true allele peak of Figure 3A are represented by s0 and h0. Those of the k repeats shorter or longer stutter peak are represented by sk and hk, respectively. Coordinates (s0, hk/h0) are plotted as shown in Figure 3C. The coordinates from all selected individuals are plotted in the same way. The linear regression line representing the relationship between the height ratio of k repeat shorter/longer stutter peak to true allele peak for a fragment size is calculated as follows:

    (1)

    Using the linear regression line calculated with the above algorithm, we can estimate the relative height of the k repeat shorter/longer stutter peak for any allele. Our algorithm has a number of rules built in to prevent an incorrectly inferred linear regression line being calculated. For example, when the height of stutter peak is estimated to be negative or higher than the true allele peak, the regression line is discarded. The height of the stutter peak is approximated to be uniform from allele to allele, that is the gradient of the line is set to 0. This approximation is reasonable for estimating the stutter pattern adequately, given that the gradient of the linear regression slope is found to be <0.05 among all the datasets we analyzed. The correction is effective when individuals selected to build the regression line have fragment sizes across only a small range of values.

    Figure 3. (A and B) Arrows, squares and circles indicate true allele peaks, stutter peaks and +A peaks, respectively. The sizes of true allele peaks for individual (A) and (B) are s0 and s0', and the heights of the true allele peaks are h0 and h0', respectively. The sizes of one repeat shorter stutter peaks for individual (A) and (B) are s1 and s1', and the heights of the stutter peaks are h1 and h1'. (C) The one repeat shorter stutter peak of individual (A) and (B) is represented by coordinates (s0, h1/h0) and (s0', h1'/h0'). Those of other individuals are represented in the same way, and the linear regression line is calculated.

    The range in the height ratio of a +A peak to a true allele peak is fixed using the values derived from selected individuals. By calculating the ratio for each selected individual, we determined the maximum value rmax and minimum value rmin of the ratios.

    Calculation of template for each sample and each allele

    A template can be calculated for any sample with any allele using the previously mentioned linear regression lines of stutter peaks and the value range of ratio for +A peaks. Let the size and height of the true allele peak be s and h, and the ratio of +A peak height to true allele peak height be r. Then, the height of the k repeat's shorter stutter peaks of the template is equal to h*(ak*s + bk) and the height of a +A peak derived from the stutter peak is equal to r*h*(ak*s + bk). Longer stutter peaks from the template are calculated in the same way. In cases where r > rmax or r < rmin, r is substituted with rmax or rmin.

    Details of the algorithm for inference of genotype

    In this section, we describe how the genotype for each individual was inferred through the adjustment of templates to the observed peak pattern. The observed peak pattern for a homozygote should be similar to the derived template and a heterozygote's peak pattern should look like two overlapped templates. Our algorithm scores the difference between the observed and expected peak patterns for each of the six hypotheses: (i) the individual is homozygous and the true allele peak is higher than its +A peak, (ii) the individual is heterozygous and both two true allele peaks are higher than their +A peaks, (iii) the individual is heterozygous and the first true allele peak is higher than its +A peak, while the second true allele peak is lower than its +A peak, (iv) the individual is homozygous and the true allele peak is lower than its +A peak, (v) the individual is heterozygous and the first true allele peak is lower than its +A peak, while the second true allele peak is higher than its +A peak, and (vi) the individual is heterozygous and both two true allele peaks are lower than their +A peaks. We select the hypothesis with the minimum score.

    Figure 4 illustrates the procedure used to score the differences. First, our algorithm assumes the hypothesis (i), the highest peak, Pmax, is considered a true allele peak. The white triangles in Figure 4 represent a template under hypothesis (i). Let the difference between the heights of ith expected and observed peak be represented by hi. The arrows in Figure 4 indicate these differences. The score of the difference diff1 is defined by

    (2)

    Figure 4. A solid line indicates the observed peak pattern. White triangles indicate the template, assuming the highest peak, Pmax, is the true allele peak. Arrows indicates the difference between the template and the observed peak pattern.

    Second, our algorithm assumes the hypothesis (ii), Pmax and the peak Pmax' with the largest hi is considered to be a true allele peak. The score of the difference diff2 is defined as the sum of the squared height differences between the observed peak pattern and the sum of two templates determined from Pmax and Pmax', respectively. diff3 for hypothesis (iii) in which Pmax is a true allele peak, and Pmax' is a +A peak of the second true allele peak is calculated similarly. Subsequently, diff4, diff5 and diff6 for hypotheses (iv), (v) and (vi), where Pmax is assumed to be a +A peak of the true allele peak, are computed as described above.

    In order to make expected peak pattern fit to the observed one, the template(s) are adjusted as follows.

    The expected peak pattern is adjusted along the x-axis to fit the observed pattern, since the fragment sizes are provided as decimals, not integers. The fragment sizes of stutter peaks and +A peaks are estimated from that of Pmax or Pmax' and the unit length. The observed peak, the size of which is within the estimated fragment size +/–0.5, is adopted as a stutter peak or +A peak.

    When the two alleles overlap as in Figure 4, the peak identified as a true allele peak may be the sum of the true allele peak and the stutter peak of the other true allele peak. Thus, the height of stutter peaks in the template calculated with the overlapped peak is overestimated. In order to avoid the overestimation, our algorithm multiplies each of the heights of peaks of the template by a constant number adjusted to minimize diff. In hypotheses assuming the individual is heterozygous, two constant numbers are found independently.

    In order to avoid the consideration of a cluster of noise peaks as an allele, our algorithm assumes the individual is heterozygous only when the lower true allele peak is higher than 20% of the higher true allele peak. We found that in our dataset, the inferred genotypes matched the genotypes derived by visual inspection most accurately when the threshold was set at 20%. This threshold is identical to that of a previous study (13).

    Genotyping data

    The dataset used was obtained from 174 well-documented microsatellite markers (listed in Table 1) extracted from 250 Japanese individuals. The dataset had been produced by PCR and fragment detection (Tamiya,G., Shinya,M., Ikuta,T., Makino,S., Okamoto,K., Furugaki,K., Matsumoto,T., Mano,S., Ando,S., Nozaki,Y. et al., submitted). Our software can process data in plain text format containing height and fragment size for every peak in the electropherogram from each sample. For this study, we choose to process the ABI 3700 electropherogram data using the ABI GeneScan Analysis Software and the Genotyper software.

    Table 1. Marker information

    RESULTS

    We have evaluated our algorithm with 174 microsatellite markers. Based on the electropherogram data from three individuals, 174 microsatellite markers were categorized into four major classes by manual inspection (Figure 5). The four classes are as follows: (A) Markers for which the true allele peak is always higher/lower than the corresponding +A peak, and the stutter peaks only shorter than the true allele peak are observed. (B) Markers for which the true allele peak is always higher/lower than the corresponding +A peak, and possess stutter peaks both shorter and longer than the true allele peak. (C) Markers in which the +A peak is higher than the true allele peak for some samples and lower than the true allele peak for others. (D) Markers which have peak patterns so noisy or cryptic that even trained observers are undecided as to which peak is a stutter peak or a +A peak.

    Figure 5. Arrows, squares and circles indicate true allele peaks, stutter peaks, and +A peaks, respectively. (A) An example of markers satisfying two conditions: the true allele peak is always higher or always lower than its +A peak, and only stutter peaks shorter than the true allele peak are observed. (Aa) For this marker, almost no stutter peak or +A peak is observed. (Ab) For this marker, the true allele peak is always higher or always lower than its +A peak. Almost no stutter peaks longer than the true allele peak are observed, while stutter peaks shorter than the true allele peak are observed. (B) An example of markers satisfying two conditions: the true allele peak is always higher or always lower than its +A peak, and both stutter peaks shorter and longer than the true allele peak are observed. For this marker, the true allele peak is always lower than its +A peak and both stutter peaks shorter and longer than the true allele peak are observed. (C) An example of markers with a true allele peak approximately as high as its +A peak. Therefore, the true allele peak is higher than its +A peak for some samples and lower for others. (D) An example of markers for which experienced researchers cannot decide which peak is a stutter peak or a +A peak.

    One representative example marker in each class was chosen (Table 2). For class (A), two peak patterns (Aa) and (Ab) were selected. (Aa) has almost no stutter peaks and +A peaks. Using the genotype data from 250 individuals for these five representative markers, we compared our algorithm's automatic allele identification accuracy against that of manual allele calling by a trained expert. The accuracy rate obtained for our algorithm was then compared against those obtained for Genotyper and GeneMapper (Table 2). We selected these programs for comparative analysis against our algorithm because they are commonly used in microsatellite genotyping. We could not evaluate other programs because they are currently unavailable (12–14). Our algorithm achieved a higher accuracy rate than either GeneMapper or Genotyper for class (A), class (B) and (C). However, GeneMapper and Genotyper both outperformed our algorithm for class (D).

    Table 2. Accuracy rates for inference of genotype

    The accuracy rate was also calculated using samples labeled as ‘pass’ by GeneMapper (Table 3). ‘Pass’ means that a high Genotype Quality value is assigned for inference of genotype. For class (B) or (D), the accuracy rate was not examined because there were only a few ‘pass’ samples to make a meaningful comparison.

    Table 3. Accuracy rates for inference of genotype using samples labeled as ‘pass’

    Next, we estimated overall average accuracy rates for our algorithm, GeneMapper and Genotyper as follows. Table 4 shows the 174 markers divided into different class types and repeat unit length. We used a set of 20 431 microsatellite markers, which covered the entire human genome (Tamiya,G., Shinya,M., Ikuta,T., Makino,S., Okamoto,K., Furugaki,K., Matsumoto,T., Mano,S., Ando,S., Nozaki,Y., submitted). The markers could be classified at the following frequencies and into the following types: 73% dinucleotide, 7% trinucleotide, 17% tetranucleotide and 3% pentanucleotide. Based upon their peak patterns, we then divided the microsatellite markers into the pattern classes discussed above: 84% of the markers were predicted as belonging to class (A), 7% to (B), 7% to (C) and 1% to (D). If our class predictions were correct, then based upon the relative performances against Genotyper or GeneMapper, the comparative prediction accuracy of our algorithm across microsatellites in the human genome should be as follows:

    Significantly outperforms Genotyper and GeneMapper for 7% + 7% = 14% of all human microsatellites (class B and C).

    Moderately outperforms Genotyper and GeneMapper for 84% of the markers (class A).

    Moderately outperformed by both Genotyper and GeneMapper for 1% of the markers (class D).

    Table 4. Numbers of markers for each class and repeat unit length

    Using the relative frequencies of the microsatellite peak classes throughout the human genome and the efficiency of each algorithm at predicting the correct true allele peak for any given class member, we calculated the ‘true’ peak prediction rate for any given microsatellite from the human genome for each algorithm. Based upon our weighted average calculations, we estimated that Genotyper could predict the true allele peak for any randomly selected human microsatellite 85% of the time, GeneMapper fared better being able to identify the correct peak in 88% of all human microsatellites. However, our own algorithm outperformed both Genotyper and GeneMapper scoring a prediction accuracy of 94% for randomly selected microsatellite.

    DISCUSSION

    Our algorithm has four features that increase its accuracy for genotyping. First, it considers stutter peaks that are not only shorter, but also longer than the true allele peak. As a result, our algorithm is able to correctly genotype markers in class (B) with a high level of accuracy. However, the Genotyper software generated high number of false positives. For example Genotyper recognized some homozygotes as heterozygotes, assuming that the one repeat longer stutter peak was in fact a second true allele peak. Those errors were caused because the algorithm underpinning Genotyper currently only screens out stutter peaks which are shorter than the true allele peak. Our results showed that the current version of GeneMapper also unable to recognize stutter peaks longer than the true allele peak. Second, our algorithm determines the pattern of +A peaks for each sample. This leads to more accurate allele identification for markers in which the +A peak is higher than its true allele peak in some samples and lower in others such as class (C). Genotyper incorrectly identified true allele peaks in samples with +A peaks higher than the corresponding true allele peak. This may be caused by the fact that the algorithm presumes that +A peaks are always lower than the true allele peak. GeneMapper software was likely to label a sample ‘pass’ when the true allele peaks were higher than their corresponding +A peaks, providing as high accuracy as our algorithm in these samples. Third, the present algorithm does not make the assumption that the spacing of alleles is a multiple of a repeat unit length. This assumption cannot be used in the case of compound and/or interrupted markers, since some alleles appear in the intervals of non-multiple unit lengths. At this point, our algorithm is surely better than that by Perlin et al. (12), in which spacing of unit-length multiples are assumed. Finally, we expect the heights of true allele peaks in a heterozygote may be different. In contrast, a previous algorithm (12) assumes that a heterozygote has two true allele peaks with the same height. However, in practice peak data show that >30% of the heterozygotes have one true allele peak at least twice as high as its counterpart true allele peak, which draws into serious questions the validity of assuming true allele peak heights in heterozygotes are always the same.

    Our algorithm considers non-apical peaks. All three algorithms by Perlin et al. (12), Genotyper and the algorithm presented in this study use only coordinates of peak apexes. Two of them, the algorithms by Perlin et al. (12) and the Genotyper software, sometimes miscall when two neighbouring peaks are combined and appear to be one peak (Figure 2Ba; see the peak labeled with a star). In the case of Figure 2Ba, the true allele peak of the longer fragment size is missed because it overlaps slightly with the +A peak of the true allele peak of the shorter fragment size. However, our algorithm can infer the genotype correctly for such an individual, because it scores genotype based on the assumption that peaks always occur in pairs and that if a peak occurs on its own, then the software used to process the electropherogram data has missed a peak (the peak labeled with a circle in Figure 2Ba). In this scenario there are two possibilities. First, the filtering software has detected the true peak but failed to detect the +A peak—the missing peak is a +A peak (H0). Second, the filtering software has detected the +A peak but failed to detect the +A peak—the missing peak is the true allele peak (H1). Our algorithm automatically tests these two hypotheses. As peak height cannot be determined, we assume the missing peak is lower than the lone observed peak. Next, we locate a nearby stutter peak +A peak pairing. If the +A peak is higher than its corresponding stutter peak, then we assume that H1—the lone observed peak is a +A peak—is correct else we assume H0—the lone observed peak is a true allele peak—is correct. This assumption works due to the relationship between all +A peaks and their partner peaks across any one replicate. This relationship is as follows, the ratios of +A peaks to their corresponding peaks are approximately the same for each pairing across any replicate whether the pairings are between stutter peaks and +A peaks or true peaks and +A peaks (17). This relationship makes it easy for our algorithm to infer the identities of lone peaks.

    There is another advantage of our algorithm in the correctness of estimating stutter patterns. Perlin's algorithm (12), Stoughton's algorithm (13) as well as our own, all select heterozygotes with well-separated alleles and/or confirmed homozygotes to study stutter peak patterns. Our algorithm seems to distinguish homozygotes from heterozygotes with two overlapping alleles more accurately than either Perlin's or Stoughton's. Perlin's algorithm, as the algorithm assumes only the existence of stutter peaks shorter than the true allele peak, tends to skip homozygotes for markers with longer stutter peaks than the true allele peak. Stoughton's algorithm defines width of the cluster of peaks for every allele to select homozygotes as follows. First, heterozygotes with well-separated alleles are selected. However, when there is no such individual, the algorithm assumes that an individual with the smallest width is a homozygote for that allele. It may be a mistake to treat the heterozygote with the smallest allele width as a homozygote. This is especially true when all the individuals with the allele are heterozygous and possess two overlapping alleles.

    As stated in the results section, our algorithm outperforms Genotyper and GeneMapper for almost all markers except for class (D) markers for which Genotyper and GeneMapper predict more accurately. In order to achieve the best accuracy rate for true peak prediction in all markers given the available resources, we propose that the true allele peak for each marker is selected using the most optimal algorithms for that marker. For example, in the case of the microsatellite amplicon our analysis indicates tandem repeats of adenine tend to represent an overall peak pattern of class (D). The importance of sequence information in automated optimal algorithm selection should not be overlooked. Another way in which our genotyping system could be more efficient is through the incorporation of automated allele qualification, as seen in algorithm such as GeneMapper and that described by Palsson et al. (14). In the present algorithm, manual inspection is necessary, as incorrect allele calls are inevitable in some cases due to experimental error. By categorizing the allele calls into classes by quality and separating even some of the correct allele calls at 100% accuracy, samples which really require manual analysis can be determined. Thereby allele qualification may reduce the number of samples requiring manual evaluation, and lead to more efficient genotyping studies.

    In conclusion, based upon the class frequencies among a large number of microsatellites throughout the human genome, our novel algorithm would be most likely to predict correctly the true genotype among all the algorithms currently available for this task. It is hoped our algorithm will be a valuable addition to the arsenal of tools available to researchers for any genetic studies which requires large volumes of microsatellite genotypes to be accurately predicted.

    AVAILABILITY

    Our algorithm is implemented as a software program and will be commercially available from Hitachi Software Engineering Co., Ltd.

    ACKNOWLEDGEMENTS

    This work was supported by grants from the New Energy and Industrial Technology Development Organization (NEDO) of Japan.

    REFERENCES

    Tautz,D. ( (1989) ) Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucleic Acids Res., , 17, , 6463–6471.

    LeDuc,C., Miller,P., Lichter,J. and Parry,P. ( (1995) ) Batched analysis of genotypes. PCR Methods Appl., , 4, , 331–336.

    Dietrich,W., Katz,H., Lincoln,S.E., Shin,H.S., Friedman,J., Dracopoli,N.C. and Lander,E.S. ( (1992) ) A genetic map of the mouse suitable for typing intraspecific crosses. Genetics, , 131, , 423–447.

    Weissenbach,J., Gyapay,G., Dib,C., Vignal,A., Morissette,J., Millasseau,P., Vaysseix,G. and Lathrop,M. ( (1992) ) A second-generation linkage map of the human genome. Nature, , 359, , 794–801.

    Gyapay,G., Morissette,J., Vignal,A., Dib,C., Fizames,C., Millasseau,P., Marc,S., Bernardi,G., Lathrop,M. and Weissenbach,J. ( (1994) ) The 1993–94 Genethon human genetic linkage map. Nature Genet., , 7, , 246–339.

    Dib,C., Faure,S., Fizames,C., Samson,D., Drouot,N., Vignal,A., Millasseau,P., Marc,S., Hazan,J., Seboun,E. et al. ( (1996) ) A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature, , 380, , 152–154.

    Knapik,E.W., Goodman,A., Ekker,M., Chevrette,M., Delgado,J., Neuhauss,S., Shimoda,N., Driever,W., Fishman,M.C. and Jacob,H.J. ( (1998) ) A microsatellite genetic linkage map for zebrafish (Danio rerio). Nature Genet., , 18, , 338–343.

    Shimoda,N., Knapik,E.W., Ziniti,J., Sim,C., Yamada,E., Kaplan,S., Jackson,D., de Sauvage,F., Jacob,H. and Fishman,M.C. ( (1999) ) Zebrafish genetic map with 2000 microsatellite markers. Genomics, , 58, , 219–232.

    Hauge,X.Y. and Litt,M. ( (1993) ) A study of the origin of ‘shadow bands’ seen when typing dinucleotide repeat polymorphisms by the PCR. Hum. Mol. Genet., , 2, , 411–415.

    Murray,V., Monchawin,C. and England,P.R. ( (1993) ) The determination of the sequences present in the shadow bands of a dinucleotide repeat PCR. Nucleic Acids Res., , 21, , 2395–2398.

    Clark,J.M. ( (1988) ) Novel non-templated nucleotide addition reactions catalyzed by procaryotic and eucaryotic DNA polymerases. Nucleic Acids Res., , 16, , 9677–9686.

    Perlin,M.W., Burks,M.B., Hoop,R.C. and Hoffman,E.P. ( (1994) ) Toward fully automated genotyping: allele assignment, pedigree construction, phase determination, and recombination detection in Duchenne muscular dystrophy. Am. J. Hum. Genet., , 55, , 777–787.

    Stoughton,R., Bumgarner,R., Frederick,W.J.,III and McIndoe,R.A. ( (1997) ) Data-adaptive algorithms for calling alleles in repeat polymorphisms. Electrophoresis, , 18, , 1–5.

    Palsson,B., Palsson,F., Perlin,M., Gudbjartsson,H., Stefansson,K. and Gulcher,J. ( (1999) ) Using quality measures to facilitate allele calling in high-throughput genotyping. Genome Res., , 9, , 1002–1012.

    Lipkin,E., Mosig,M.O., Darvasi,A., Ezra,E., Shalom,A., Friedmann,A. and Soller,M. ( (1998) ) Quantitative trait locus mapping in dairy cattle by means of selective milk DNA pooling using dinucleotide microsatellite markers: analysis of milk protein percentage. Genetics, , 149, , 1557–1567.

    Hu,G. ( (1993) ) DNA polymerase-catalyzed addition of nontemplated extra nucleotides to the 3' end of a DNA fragment. DNA Cell Biol., , 12, , 763–770.

    Magnuson,V.L., Ally,D.S., Nylund,S.J., Karanjawala,Z.E., Rayman,J.B., Knapp,J.I., Lowe,A.L., Ghosh,S. and Collins,F.S. ( (1996) ) Substrate nucleotide-determined non-templated addition of adenine by Taq DNA polymerase: implications for PCR-based genotyping and cloning. Biotechniques, , 21, , 700–709.

    Smith,J.R., Carpten,J.D., Brownstein,M.J., Ghosh,S., Magnuson,V.L., Gilbert,D.A., Trent,J.M. and Collins,F.S. ( (1995) ) Approach to genotyping errors caused by nontemplated nucleotide addition by Taq DNA polymerase. Genome Res., , 5, , 312–317.(Toshiko Matsumoto1,2,*, Wataru Yukawa1,2)