当前位置: 首页 > 期刊 > 《核酸研究》 > 2006年第22期 > 正文
编号:11367010
Identification of degenerate motifs using position restricted selectio
http://www.100md.com 《核酸研究医学期刊》
     Department of Computer Science, Institute of Molecular and Cellular Biology, National Tsing Hua University Taiwan 300, Republic of China 1 Department of Life Science, Institute of Molecular and Cellular Biology, National Tsing Hua University Taiwan 300, Republic of China 2 Department of Computer and Information Science, Fordham University New York, NY 10023, USA 3 DIMACS Center, Rutgers University 96 Frelinghuysen Road, Piscataway, NJ 08854-8018, USA

    *To whom correspondence should be addressed. Tel: +886 3 5731077; Fax: +886 3 5723694; Email: cytang@cs.nthu.edu.tw

    ABSTRACT

    The identification of regulatory elements recognized by transcription factors and chromatin remodeling factors is essential to studying the regulation of gene expression. When no auxiliary data, such as orthologous sequences or expression profiles, are used, the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of sequence. Since suitable auxiliary data may not always be available, more work must be conducted to enhance tool performance to identify transcription elements in the metazoan. A non-alignment-based algorithm, MotifSeeker, is proposed to enhance the accuracy of discovering degenerate motifs. MotifSeeker utilizes the property that variable sites of transcription elements are usually position-specific to reduce exposure to noise. Consequently, the efficiency and accuracy of motif identification are improved. Using data fusion, the ranking process integrates two measures of motif significance, resulting in a more robust significance measure. Testing results for the synthetic data reveal that the accuracy of MotifSeeker is less sensitive to the motif degeneracy and the length of input sequences. Furthermore, MotifSeeker has been tested on a well-known benchmark , yielding a correlation coefficient of 0.262, which compares favorably with those of other tools. The high applicability of MotifSeeker to biological data is further demonstrated experimentally on regulons of Saccharomyces cerevisiae and liver-specific genes with experimentally verified regulatory elements.

    INTRODUCTION

    One of the current challenges in biological research is to understand the regulatory mechanisms of gene expression. Transcription initiation, generally controlled by interactions between transcription elements and proteins, is at the top of the hierarchy of gene expression control. The same transcription elements are frequently present in the regulatory regions of co-regulated genes and are conserved among the orthologous genes of closely related species. Identification of the transcription elements in the regulatory regions is critical to deciphering the transcriptional regulation.

    A transcription factor may recognize a highly diversified set of transcription elements, which share only a conserved core sequence. Such conserved core sequences are what we called motif in this study. Degeneracy often tends to occur at specific positions of transcription elements. The significance of position specificity can be demonstrated by the severe effect of a point mutation in the Sp1-binding sequence present in PNH patients. The Sp1 factor recognizes 5'-YCCGYCCS-3' where only the 5'- and 3'-most positions as well as the Y in the central core are tolerant of specific variations. A mutation of hTERC, the human telomerase RNA gene, changes the sequence from 5'-YCCGYCCS-3' to 5'-YCCGYCGS-3' (the letter denoted by underline is the mutation site). The C–G mutation located at the non-variable position disrupts Sp1 binding to the hTERC promoter and results in reduced transcription of hTERC (1).

    A number of motif-finding algorithms have been developed over the past few years. Many recent successful works are based on sequence alignment or are aided by the use of orthologous genes or microarray expression data (2–7). Some effective methods do not require auxiliary data and are not alignment-based. Of these, some ensure or favor the position specificity of output motifs, while others do not. Examples of the latter class include MEME (8), Consensus (9) and Gibbs Sampler (10,11). Other well-known members of the latter class include (l, d)-motif finders which identify significant patterns over {A, T, C, G} of length l with at most d mismatches. The (l, d)-motif-finding algorithms include WINNOWER (12), SP-STAR (12), Multiprofiler (13), Projection (14) and PatternBranching (15). An (l, d)-motif is not position-specific because d or fewer mutations can occur at arbitrary positions in a motif occurrence.

    By employing (a subset of) IUPAC codes, YMF (16,17) ensures the position specificity of output motifs. YMF considers each possible motif over {A, T, C, G, R, Y, S, W, N} and evaluates motifs using the Z-score. Despite its exponential running time, YMF has been recognized as a highly effective tool. Wolfertstetter et al. (18) observed that functional motifs show a preferred pattern of mismatch locations. Position-specific motifs are awarded higher scores of consensus index (Ci) and are thus favored by CoreSearch (18). Unlike YMF, CoreSearch does not fix the positions or contents of mutations in an early stage. Consequently, CoreSearch is more sensitive to noise. Motif lengths and degeneracy, as well as sequence lengths, seriously limit the analysis of CoreSearch. Specifically, the motif must be 7 bases long with at most one mismatch within each of its occurrences to achieve reasonable performance of CoreSearch. Under such conditions, the recommended average input sequence length is 600 bp. Some of these restrictions arise from the high computational cost.

    All of the aforementioned algorithms work well in practice but have limitations. Increases in motif degeneracy, motif length and the lengths of input sequences strongly affect the performance of most tools because of increased random matches and computational costs. Most can tolerate 35% motif degeneracy and handle only a few thousand input nucleotides. However, the transcription elements of multicellular eukaryotes are often highly degenerated, with lengths of 5–25 bp, and are embedded in long regulatory regions. Hence, more sensitive methods are always sought to locate accurately meaningful transcription elements.

    This work presents a new method, MotifSeeker. Both the ability to separate noise from true motif occurrences and computational efficiency must be enhanced to ensure the successful detection of degenerate motifs embedded in long regulatory regions. The success of YMF reveals the possibility that noise may be reduced by restricting both the degenerate positions and the sequence contents in these positions. Since YMF exhaustively enumerates all possible IUPAC code motifs of some particular length, the computational demand is rather high when motifs of longer than 10 bp are analyzed. However, this issue may not be a problem for identifying motifs in the yeast genome, for which YMF was originally designed. MotifSeeker is designed to be applicable to motifs over a wider range of lengths. Each possible set of d variation positions, rather than each possible l-string over IUPAC code, is processed by MotifSeeker to reduce noise without sacrificing efficiency. Since the sequence contents of the degenerate positions are not initially restricted, MotifSeeker includes a further filtration step to take into account the degree of conservation among the collected candidate occurrences. Another approach for reducing noise uses the concept of data fusion (19,20), which is effective in providing a robust ranking method in a variety of application domains. This idea is realized by combining two different measures of motif significance: they are the degree of conservation relative to the background sequences and the copy number of the motif relative to the expected copy number of a random string of the same length. The combined ranking method reduces the variation of the performance of each measure and is shown to be effective. Synthetic datasets of various motif degeneration rates and sequence lengths are used to demonstrate the effectiveness and limits of MotifSeeker. Experiments on a commonly used yeast dataset and the well-known benchmark (21) yield results that show that MotifSeeker outperforms the other tools to which it is compared. Experiments also indicate that MotifSeeker can be successfully applied to identify highly degenerate transcription elements that direct liver-specific gene expression. Theseelements have lengths of at least 12 bp and are embedded in regulatory regions whose average length is 2.5 kb. These results demonstrate that MotifSeeker can be applied without the aid of auxiliary data; hence, it is more generally applicable than the other methods. However, as presented in Results, if suitable auxiliary data are available, including such auxiliary data can certainly enhance the performance.

    METHODS

    MotifSeeker consists of two main phases—motif generation and motif scoring, which integrates two scoring schemes. They are discussed in the following subsections.

    Definition and properties of degenerate (l, d)-motifs

    A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. A degenerate position is a position occupied by a character other than A, G, C or T. A match of a degenerate (l, d)-motif in the input sequences is called an occurrence of this motif. We also require that each possible character over {A, T, C, G} allowed by the symbol at the i-th position of a degenerate (l, d)-motif must appear at position i of some occurrence of the motif. The important property of a degenerate (l, d)-motif is that its occurrences can differ only in the d degenerate positions of the motif. Unlike (l, d)-motifs (12–15), the mismatched positions in each occurrence of a degenerate (l, d)-motif are not independent of those in other occurrences. Consequently, a degenerate (l, d)-motif cannot be derived directly from the occurrences of an (l, d)-motif. Furthermore, (l, d)-motifs have been defined over {A, G, C, T} instead of the IUPAC code. Therefore, algorithms specialized for (l, d)-motifs cannot be expected to perform well for degenerate (l, d)-motifs and vice versa.

    Generating significant degenerate motifs

    The discovery problem is addressed first.

    Degenerate motif discovery problem. Given a set of sequences S = {S1, S2, ... , Sm | Si belongs to {A, G, C, T}* for all i} and three non-negative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S.

    For convenience of presentation, assume that all sequences in S are of length n. An l-substring of a sequence Si is a substring of length l of Si. The number of l-substrings in any sequence Si is then r = n – l + 1. For convenience these l-substrings are denoted by Wi1, ... , Wir. The Hamming distance between two l-substrings is the number of positions that the two substrings disagree. For each Wij, where 1 i m – k + 1, the Hamming distance between Wij and each of the l-substrings of all sequences in S is computed. Meanwhile, the sets of mismatched positions between each of these Wij and all other l-substrings are kept. The Hamming distance between Wij and Wpq is denoted as dH(Wij, Wpq), and the set of mismatched positions between them is V(Wij, Wpq). Then, for each possible set X = {p1, ... , pd} of degenerate positions, all Wpq with V(Wij, Wpq) X are collected. The set of l-substrings collected in this way is denoted by G(Wij | X). Then, whether G(Wij | X) contains l-substrings from at least k different sequences is determined. If so, then G(Wij | X) is kept and used further to derive a degenerate motif. If not, it is discarded. Notably, any degenerate (l, d)-motif is required to have occurrences in at least k sequences. Hence, constructing G(Wij | X) for 1 i m – k + 1 suffices. Since all of the occurrences of any degenerate (l, d)-motif must form a subset of some G(Wij | X), with these G(Wij | X) all possible degenerate motifs can be identified. The sets V(Wij, Wpq) and X can be conveniently stored in computer words so that bit operations can be utilized to achieve good efficiency.

    The time complexity of the above procedure is . Once each G(Wij | X) is computed, it can be moved to secondary storage, and after G(Wij | X) has been computed for all X, dH(Wij, Wpq) and V(Wij, Wpq) can be discarded before the method proceeds to Wi,j+1 (or Wi+1,1). Therefore, the demand for primary memory is linear in the input size, O(mn).

    In each remaining G(Wij | X), noise may still exist. Suppose that G(Wij | X) has 10 motif occurrences, and that TATAWAW is the correct motif for the TATA-box. Clearly, some noise (any occurrence that contains underlined letter) exists among those 10 motif occurrences (Figure 1a).

    Figure 1 (a) Take the TATA-box as an example. The black strings are true transcription elements and the red strings are false motif occurrences. These 10 strings are collected from the initial position-restricted selection. In this group, the weakly conserved letters (denoted by underlines) can be observed in the fifth and the seventh positions. Obviously, motif occurrences that have weakly conserved letters are likely to be noise. (b) The matrix for relative frequency. Background letter probabilities are PA = 0.22, PT = 0.22, PC = 0.28 and PG = 0.28. A negative (p, q)-entry means that the letter p at position q is weakly conserved in G(Wij | X). Occurrences with weakly conserved letters are called pseudo-occurrence in this paper. (c) ‘TATAWAW’ is derived from the remaining occurrences.

    Transcription elements evolve more slowly than non-functional background sequences and may co-exist in regulon sequences. Based on this premise, noise was reduced further by the following process. First, the background letter probabilities in the original set S are computed. Let PA, PT, PC and PG be the background probabilities of nucleotides A, T, C and G, respectively. Then, G(Wij | X) is transformed into a 4 x l matrix L (Figure 1b). The entries Lpq, representing the log-odds weights, are defined by the following formula.

    where p {A, T, C, G}. A positive Lpq means that the probability of letter p at position q in G(Wij | X) exceeds the background probability of the letter.

    Since real motif instances are generally more conserved than arbitrary background sequences, a word x1x2xl in G(Wij | X) is said to be a pseudo-occurrence if for some p and q, xq = p and Lpq < 0. All pseudo-occurrences in G(Wij | X) are removed (Figure 1c), and the remaining elements are considered to be significant. The positive entries in L are taken into consideration in deriving the motif. The motif is generated column-wise from these positive entries. For example, if only LA3 and LG3 are positive, then the symbol in the third position of the resulting motif is ‘R’ (in IUPAC code).

    Motif scoring methods R1 and R2

    Two scoring methods R1 and R2 are proposed to evaluate all reported motifs generated by the previously described degenerate motif discovery method. On the basis that the regulatory motifs are more conserved than surrounding background sequences, the first scoring function s1 for the method R1 is defined as

    where the summation is over i {A, T, C, G} and j satisfying 1 j l such that Lij > 0, and pj is the number of positive entries in column j. The more the letter frequencies exceed the background probabilities, the higher the score is. This fact is used to measure the conservation and the significance of each reported motif.

    The second scoring function s2 for method R2 is formulated based on the observation that transcription elements often appear in statistically significant concentrations, suggesting that the transcription elements for a particular transcription factor may appear in multiple locations of a promoter region. Thus, the copy number of each reported motif is another important indicator of the significance of the motif. The second ranking method is as follows.

    The frequency of all possible consecutive two-symbol combinations in the input sequences is computed and used to compute background transition probabilities. For each predicted motif M = x1x2xl, background transition probabilities are applied to compute the probability that M occurs randomly in the background following a Markov process:

    where the summation is over all possible instantiations y1yl of M (recall that M can contain degenerate symbols), and P(yi | yi–1) is the probability that yi is present in the background given that the preceding nucleic acid is yi–1. The value E can be computed using dynamic programming in O(l) time for each M.

    The scoring function s2 for method R2 is defined as follows:

    where N is the observed copy number, including overlapping occurrences, of the reported motif x1x2xl. A higher value implies a greater concentration of motif occurrences. Thus, motifs of higher value are considered more important.

    A Markov process is frequently used to distinguish regulatory sequences from other neutral sequences (22). A general problem with the Markov model is that it cannot appropriately reflect local sequence composition, which is important for short motifs. This shortcoming of the Markov model is inevitable, but experiments herein demonstrated that it does not strongly affect the performance of MotifSeeker.

    These two scoring schemes R1 and R2 are adopted to assign scores to each of the reported motifs. These measures of motif significance, R1 and R2, are combined by the method of data fusion, which is presented in the next subsection.

    MotifSeeker uses methods of data fusion and hybrid ranking

    The features of transcription elements are often fuzzy, making the elements hard to predict by computation. The two scoring methods R1 and R2 capture different properties of motifs. Intuitively, the use of two properties helps to evaluate the significance of motifs more thoroughly. A combination of different measures is also expected to be more robust than a single ranking method. Indeed, in the study of data fusion, a general observation is that one can often benefit from combining different methods when they exhibit ‘diversity’ . The hybrid ranking method of MotifSeeker involves the following procedure:

    For each degenerate motif derived from the corresponding purified candidate set, two scoring functions s1 and s2 are obtained using the scoring methods R1 and R2, respectively. Sorting each of these scoring functions leads to the rank functions r1 and r2, respectively, where for a reported motif M, ri(M) is the rank of M with respect to si.

    Combine r1 and r2. The score function s12 for the resulting combination is the equally weighted combination of r1 and r2: s12(M) = r1(M) + r2(M).

    For each motif M, s12(M) is taken as the new score for the combined method R12. Sorting s12(M) into ascending order (the less the sum of the ranks, the better in the combined list) leads to a rank function r12 of the combined method.

    RESULTS

    This section compares the performance of MotifSeeker with that of other methods, such as YMF, MEME, Projection, Consensus and Gibbs Sampler, using synthetic data and biological data for 39 regulons of Saccharomyces cerevisiae. A performance coefficient (12) is used to evaluate the performance of different methods. The accuracy of MotifSeeker is assessed by a well-known benchmark (21) and a more difficult but useful set of regulatory sequences of liver-specific genes.

    Evaluation of performance on synthetic data

    MotifSeeker is compared with several effective methods, including Consensus, Projection, Gibbs Sampler and MEME,to evaluate the relative degrees of influence of motif degeneracy and input sequence length. YMF is excluded from this first experiment since it takes too much computational time when the motif is longer than 10 bp.

    Various test samples, each of which consists of 20 random sequences of 600 bp, are generated. The sequence identity in the samples is between 50 and 70% with an average identity of 65%. In each sample, degenerate (l, d)-motifs are generated and embedded (with random variations allowed by the degenerate form) into random positions in the sequences. To focus on the influence of motif degeneracy, each sequence in the sample contains at least one motif occurrence. Various motif lengths and degeneracy (d/l) are used in the test samples. The generated motifs range from 6 to 20 bp long, each with 10–50% degeneracy. Herein, for all tools, all of the parameters that are related to l, d and k are set to the exact values used for generating motifs. In practice, the parameters are unknown, as in the cases of biological and the benchmark datasets used in the following subsections. For unknown parameters, MotifSeeker simply iterates over possible ranges of parameter settings, and the ranking methods are applied to pick up the significant motifs. The measure used for comparison is the performance coefficient |K P|/|K P| (12), where K is the set of positions of the known motif occurrences in the input sequences, and P is the set of predicted positions. The best performance coefficients among the top 10 motifs found by these tools are compared.

    Most motif finders perform well when the motif degeneracy ranges from 10 to 35%. However, motifs become too subtle to be identified by most methods as the number of degenerate positions increases. Figure 2 shows that the performance coefficients of most tools decline with the growing degrees of motif degeneracy. The performance of MEME and MotifSeeker tends to remain stable over this range of degeneracy. However, on average, the performance coefficient of MEME seems to be lower than that of the other tools tested in this experiment.

    Figure 2 Comparison of performance coefficients. The lengths of embedded motifs range from 6 to 20 and the degeneracy is drawn from 10 to 50%. The average sequence identity is 0.65. The point in the figure for each degree of degeneracy represents the average performance coefficient for the various motif lengths tested for the degree of degeneracy. Among the five tools, only MotifSeeker and MEME have consistent performance when motif degeneracy is beyond 35%.

    Figure 3 displays the test results for specificity |K P|/|P| (Figure 3a) and sensitivity |K P|/|K| (Figure 3b), as defined by Pevner and Sze (12). The sensitivity and specificity curves are similar to those in Figure 2. Among all tools, the sensitivity and specificity of MEME tend to decrease more slowly and those of MotifSeeker appear to remain steady. The average specificity of MotifSeeker is 1.0, and the average sensitivity is also very close to 1.0.

    Figure 3 (a) Comparisons of specificities. (b) Comparisons of sensitivities. The average specificity of MotifSeeker is 1.0 and the average sensitivity is also close to 1.0. Sensitivities and specificities of the other tools except MEME show a clear trend of degradation over this range of degeneracy.

    The degeneracy tolerance of MotifSeeker is further tested on samples with motif degeneracy of >50%. All the other test conditions are the same as the aforementioned ones. As presented in Figure 4, the performance of MotifSeeker gradually decreases as the degree of degeneracy exceeds 50%. The performance coefficient falls to 0.96 in the group with degeneracy between 51 and 55%. At degeneracy of >70%, the performance declines rapidly from 0.825 to 0.56. The average performance coefficient is 0.42 when the degree of degeneracy exceeds 0.75. Specificity and sensitivity also decrease sharply when the degeneracy exceeds 70%. However, transcription elements seldom display such a high degeneracy in biological systems, such as degenerate (8, 7)-, (9, 7)- and (12, 10)-motifs. Therefore, the poor performance of MotifSeeker on highly degenerate motifs affects only the identification of a restrictedly few transcription elements.

    Figure 4 Performance coefficient, specificity and sensitivity of MotifSeeker are evaluated in highly degenerate cases to examine the limitations of MotifSeeker.

    A test paradigm that consists of 20 input sequences with lengths of between 500 and 10 000 bp and an average sequence identity of 60% is established to test the influence of the sequence length on the five programs. The degeneration rates of the embedded motifs are 25–30%, and the lengths of the motifs are between 10 and 15 bp. Since the number of spurious motifs increases, the performance coefficients of Consensus, Gibbs Sampler and Projection are <0.5 as the length of each input sequence is 10 000 bp. These results are consistent with those of Wang and Stormo (25). For MEME, the performance coefficient averaged over all tests where the length of each input sequence is no longer than 3000 bp is 0.487; MEME prohibits inputs with a total length of >60 000 bp. On the contrary, MotifSeeker has a good performance coefficient, 0.93, even when the length of each sequence is 10 000 bp. Accordingly, the accuracy of MotifSeeker is not susceptible to the sequence length. The degree of motif degeneracy more importantly affects the performance of tools. As presented in Figures 2 and 4, even when the length of each input sequence is only 600 bp, the performance still decreases as the degrees of motif degeneracy increases.

    As expected, position restriction combined with further filtration by the pseudo-occurrence elimination step render MotifSeeker less susceptible to noise. Occasionally, the candidate set has a certain number of embedded occurrences containing the character i in position j with Lij < 0, especially when the distribution of the contents of the randomly generated occurrences is strongly uneven (Figure 5). Such an insignificant occurrence is regarded as a pseudo-occurrence and is wrongly discarded. Consequently, the specificity of MotifSeeker tends to be better than its sensitivity (Figure 4).When the degeneracy exceeds 70%, even the best possible occurrence set G(W | X), where W is an embedded occurrence and X is the correct set of degenerate positions, would contain many random matches. Some embedded occurrences become insignificant relative to the background and may be wrongly eliminated, whereas random occurrences may be falsely kept. Hence, the number of false positives and false negatives considerably increase with the degree of degeneracy >70%. Similar phenomena are observed when the number of sequences with motif occurrences is less than one-third of the number of input sequences.

    Figure 5 An example for a strongly uneven content distribution. Three types of occurrences (CCTAT, CATAT and CGTAT) are allowed by the degenerate form CVTAT (IUPAC code symbol V represents nucleic acid A, G or C). The number of occurrences for each type is randomly determined in the experiment on synthetic data. Since only 2 out of 20 occurrences are CGTAT, the distribution of the occurrences is strongly uneven.

    The purpose of this experiment is to demonstrate potential limitations of MotifSeeker. However, the synthetic data are generated according to our motif model and inevitably favor MotifSeeker over other tools. To evaluate its true applicability we proceed to assess MotifSeeker on biological datasets.

    Evaluation of performance on yeast promoters

    MotifSeeker is applied to known regulons, which are sets of genes that are regulated by the same transcription factors. The material is obtained from the promoter database of S.cerevisiae (SCPD, http://cgsigma.cshl.org/jian/index.html). SCPD (26) provides information on regulons of S.cerevisiae and lists the experimentally verified transcription elements. The degenerate motifs of the transcription elements in most of the regulons are also supplied. MotifSeeker, YMF, Consensus, Projection, Gibbs Sampler and MEME are tested on each of the regulons (see Tables 1 and 2) that has at least three genes in SCPD. Table 1 shows the results for the 25 regulons with a consensus reported in SCPD, whereas Table 2 shows those without. The parameters cannot be easily set because no prior knowledge of the transcription elements is available. Nevertheless, a majority of yeast regulatory motifs are between 6 and 12 bp long. We choose as the range of lengths to be investigated for each dataset. As by definition, a degenerate (l, r l)-motif is also a degenerate (l, 0.6l)-motif for r < 0.6, the default value of parameter d is set to 0.6l, instead of running over a range. A factor not considered in the previous synthetic model is that some of the regulatory motifs occur only in k out of m sequences in the input sample. A reasonable assumption is that most of the regulon sequences share the same consensus of transcription elements. The default value of k, therefore, is set to m/2 (where m is the number of sequences in the sample). If YMF (16,17), Consensus (9), Gibbs Sampler (10,11), Projection (14) and MEME (8) have any parameters that are related to l, d or k, then these parameters are also set according to the aforementioned ranges used in MotifSeeker. Other parameters in these five tools are set to default values. For each program, among the 10 highest ranked predictions, the motif with the highest non-zero performance coefficient is selected as its predicted motif. Such predictions are referred to as the ‘best close matches’ in this study.

    Table 1 Comparison of MotifSeeker with other systems on S.cerevisiae datasets with consensus given in SCPDa

    Table 2 The performance coefficients of the predicted motifs for the 25 regulons with consensus given in SCPD

    Tables 1–3 present the best close matches to each published motif. Missing entries represent that no motif with non-zero performance coefficient can be found among the top 10 predictions. Of all the tested regulons, 24 out of 25 regulons with consensus reported in SCPD are identified within the top 10 by MotifSeeker (Table 1). As to UASHPR, the consensus listed at SCPD is CTTCCT, but an alignment of all documented sites suggests another consensus of SGWGGH. This latter consensus is identified by MotifSeeker at top 5, with performance coefficient 0.31 (Table 2). A similar consensus, GTGGNN, is also discovered by Consensus at top 2 (Table 1).

    Table 3 Comparison of MotifSeeker with other systems on S.cerevisiae datasets without consensus given in SCPDa

    The results obtained from the analysis of GCN4 are selected as an example to demonstrate the strength of data fusion. The ranks of GCN4 are 12 and 7 when scoring functions R1 and R2 are used, respectively. The rank of GCN4 is promoted to 4 after data fusion is performed. A similar improvement is also observed in the other regulons. Since each of the hypotheses used for our ranking functions cannot be expected to be effective in all cases, the method of data fusion provides the advantage of combining the merits of different scoring functions and balancing the individual demerits.

    With respect to the other tools tested on the regulons with consensus given in SCPD, Gibbs Sampler fails to report 14 out of 25 regulons within top 10. Although Projection can identify motifs with non-zero performance coefficients within the top 10 for most of the regulons, it takes longer execution time in most cases. In particular, the estimated running time of MIG1 and MCM1 exceeds 60 days and its execution cannot be completed (Table 1). Compared with the other five tools, YMF has better performance on most motifs with middle spacers, such as Gal4, ABF1 and Hap1, as a result of its motif model.

    For the regulons without consensus given in SCPD, only 1 out of 14 regulons does MotifSeeker not predict any motif with non-zero performance coefficient within top 10 (Table 3). For this and the previous sets of regulons, the average performance coefficients of MotifSeeker also compare favorably with the other tools (Tables 2 and 3). In summary, this experiment shows that MotifSeeker appears to be a reliable tool for discovering transcription elements in yeast promoters.

    Evaluation of performance on tissue-specific regulatory elements

    The identification of regulatory elements within the human genome is yet another challenge. Since published experimental data of liver-specific genes are abundant, a predictive procedure is presented here to identify transcription elements associated with liver-specific transcription. The data used were collected by Kriven and Wasserman (27). This dataset includes four liver-specific factors: HNF-1, HNF-3, HNF-4 and C/EBP. Each regulon consists of at least five genes. Longer promoter sequences are retrieved from GenBank based on the gene names listed in Kriven's data. The average length of the analyzed promoter sequences is 2.5 kb.

    MotifSeeker is run on each set of the four regulons, which contain only human liver-specific genes. Since human regulatory elements may be longer and more degenerate than those of yeast, is set as the range of the motif length l, and parameter d is set to 0.7l. Other parameter settings are the same as those used in the previous experiment on the yeast dataset. Despite a lack of comparative analyses across other species, matches to the published motifs of HNF-1, HNF-3 and HNF-4 can be identified within top 10 predictions (Table 4 and Figure 6). MotifSeeker has better performance on HNF-1, HNF-3 and HNF-4 than C/EBP because most of the corresponding transcription elements for these three transcription factors have conserved sub-patterns DGTTAWTNWWYDNH, MNTRTTKRYHY and NHCTTTGBHMND, respectively. As can be seen in Figure 6, the predicted motifs appear to contain these sub-patterns (Figure 6). For C/EBP, the best ranked prediction with non-zero performance coefficient is out of top 10, and only half of the binding sites are located by this prediction. However, under the range of parameter settings, the other five tools fail to identify any published motif; they also often give an empty output from all four co-regulated liver-specific gene sets. The degenerate rate of the motifs recognized by the four factors exceeds 50%, resulting in the failure of the five tools. Additionally, long and highly degenerate motifs detrimentally affect the running time of YMF and Projection. These results are consistent with those observed on synthetic models. Although MotifSeeker already performs well without auxiliary data, utilizing suitable auxiliary data further improves its performance. Two optional simple post-processes are proposed below.

    Table 4 Performance coefficients and ranks of the best close matches to the published motifs in three stagesa

    Figure 6 The best close matches to the published HNF-1, HNF-3, HNF-4 and C/EBP motifs. The left column lists logos of the published motifs from JASPAR (http://jaspar.cgb.ki.se). The published HNF-1, HNF-3 and HNF-4 motifs contain conserved sub-patterns DGTTAWD, TRTTKRY and HCTTTGBHM, respectively (IUPAC code D: A, T or G; W: A or T; R: A or G; Y: T or C; H: A, T or C; B: C, T or G; M: A or C). On the other hand, the published C/EBP motif is very weakly conserved. The right column lists logos of the corresponding best close matches from MotifSeeker without the proposed post-processes.

    Integrating knowledge from co-regulated genes in a single species and sequence conservation among orthologous genes of different organisms has been shown to help in identifying weak motifs (25). Therefore, the first post-process is to combine the original results with phylogenetic footprints. The sequences of orthologous genes are included in the input set. The differential conservation facilitates the identification of evolutionary conserved functional elements and the filtration of false occurrences. This evolution-based filter improves the rank and the performance coefficient of the best close match to each published motif (Table 4). Most of the false occurrences of these motifs are also removed. Nevertheless, C/EBP remains the most difficult case because transcription elements for C/EBP, unlike those for the other three transcription factors, have no obvious conserved sub-patterns.

    To further improve the performance on C/EBP, promoters of liver-specific genes are used as positive sequences and promoters of genes not expressed in liver are regarded as negative sequences. A motif found both in positive and negative sequences cannot generally be specific to positive set. Therefore, MotifSeeker is run on another well-known collection, which is a set of high-quality annotations of transcription elements for muscle-specific genes (28). Most of the sequences in this collection are 300 bases long, and may be too short to be an effective negative set. Hence, longer promoter sequences are retrieved from GenBank based on the gene names listed in the dataset. Motifs found in the muscle-specific genes are used as markers. The original outputs for liver-specific datasets are refined by eliminating each motif M1 whose edit distance to some marker M2 is at most 0.2 x max{|M1|, |M2|}. (The scoring function is detailed in the Supplementary Data.) The rank of the best close match to the published C/EBP motif is advanced from 10 to 6 following the refinement (Table 4).

    In summary, these results indicate that MotifSeeker performs better on S.cerevisiae and liver-specific datasets than do the other tools tested herein. With the aid of suitable data, the output of MotifSeeker can be further improved by the two optional post-processes proposed above.

    Evaluation of performance on a well-known benchmark

    Finally, MotifSeeker is tested on a public well-known benchmark (21). All the datasets in the benchmark are used. This benchmark has three types of background sequences. One-third of all of the datasets used are ‘real’; one-third are ‘generic’ and the rest are ‘Markov’. The ranges of parameter settings over which the iterations are performed are the same as those for the experiments on liver-specific genes.

    For each tested dataset, the output of MotifSeeker is refined by taking all other sequences of the same species in the benchmark as a negative set to remove the spurious motifs. After this refinement, motif scores are adjusted by favoring a higher number of similar motifs found in all top 20 ranking lists in the iterations. (Please see the Supplementary Data for a detailed description of the post-processing and the results.) The highest-scoring motif is then selected. The same process is adopted for all datasets. The average correlation coefficient (nCC) of MotifSeeker is 0.262, which compares favorably with the winner, Weeder (29), of the assessment on all 52 datasets. With respect to the human portion, the highest nCC that has been obtained to date is 0.149 (30). Although the nCC of MotifSeeker for humans, 0.212, is lower than those of all other species considered, MotifSeeker improves nCC by 42%.

    DISCUSSION

    Applications of synthetic and biological data have indicated that MotifSeeker appears to have promising applications in identifying degenerate transcription elements with specific variable sites directly from sequences of a single species. The accuracy of MotifSeeker is less sensitive to the length of input sequences and the degree of motif degeneracy. Without auxiliary data, the performance of MotifSeeker is already satisfactory, enabling more general applications. If suitable auxiliary data are available, two optional post-processes can be incorporated. One direction combines the original input with orthologous sequences. The other refines the predicted results by the output for gene sets with expression patterns that differ from those of the input genes. As shown in investigation of liver-specific gene sets, both refinement steps can further enhance performance.

    In the evaluation on the benchmark, of all the datasets used, the set of type ‘real’ is the most difficult for MotifSeeker. This difficulty was also encountered by 13 tools that were evaluated by Tompa et al. (21). These evaluations should not be taken as an indictment of the motif discovery tools, which point was also made by Tompa et al. (21). Despite this difficulty, MotifSeeker still compares favorably with other motif finders.

    The reasons for the good performance of MotifSeeker are as follows. MotifSeeker is designed based on the position specificity of transcription elements, since transcription elements often prefer certain variation patterns. Restricting variation positions makes the candidate occurrence less likely to be obscured by random matches. Pseudo-occurrences in candidate sets are further eliminated by considering the weakly conserved letters. In this manner, both the positions and the contents of the variations are explicitly considered, and noise is thus reduced. Furthermore, two scoring measures are combined by data fusion to improve the performance on ranking motifs. Since the performance of a single measure often varies from case to case, the hybrid ranking method provides a more general scheme with consistently high performance.

    To ensure position specificity, the distance between each pair of occurrences of a motif must be within d. A clique finding procedure is usually necessary to meet this requirement while keeping the search in the input sequences. However, clique finding takes exponential time with respect to the length of input sequences. If, instead, all possible patterns over IUPAC code or merely {A, T, C, G} are to be enumerated, then the computation time would at least be proportional to 4l. MotifSeeker represents a compromise between these two extremes. All possible sets of degenerate positions are considered, but . That is, is much smaller than 4l, which is the number of all nucleic acid patterns of length l, by an exponential factor. For each set of degenerate positions, MotifSeeker searches only in the input sequences and can find cliques in a manner that is as simple and efficient as finding stars. In experiments on synthetic samples, with an of <103, in most cases, the time required is only several seconds to minutes. When is 104 as in (16, 8) or (20, 15) cases or the total length of the input sequences is 105 bp, the running time is 1 h or above.

    MotifSeeker is designed for single motifs. However, extending MotifSeeker to identify composite motifs is not difficult. One reasonable step is to incorporate information on the distance between adjacent sites and the interactive relationships among transcription factors. A current trend for finding motifs involves genome-wide sequence analysis (31,32). Since large-scale situations have much more noise than others, generally, no single ranking method can satisfactorily reflect the significance of motifs. Data fusion is expected to provide a robust ranking method in genome-wide applications. Extensions in this direction include (i) considering the different properties of motifs (such as copy number and motif conservation considered herein, and the position of the transcription initiation site, etc.) and different measures for each property (such as various measures of motif conservation, including information content and s1); (ii) given these multiple scoring functions, finding the best subset to combine so as to optimize performance in reasonable running time; and (iii) finding the best method of combination (such as by rank or by score.) All such studies should help to find regulatory motifs more accurately and will allow us to have a better understanding of the mechanism of gene regulation.

    PROGRAM AVAILABILITY

    MotifSeeker is written in C and Perl. It is available at http://ariel.2y.idv.tw/motifseeker.html. Each output motif can be further explored on PatchTM, a web-based tool integrated in TRANSFAC.

    SUPPLEMENTARY DATA

    Supplementary Data are available at NAR Online.

    ACKNOWLEDGEMENTS

    The authors thank two anonymous referees for their critical comments that greatly enhanced this paper. The authors also thank Chun-Tien Chang for useful discussions and system assistance. D.F.H. wishes to thank Fordham University, DIMACS and National Tsing Hua University for a faculty fellowship and visiting professorship. Funding to pay the Open Access publication charges for this article was provided by the National Science Council of the Republic of China (NSC 94-2213-E-007-090 and NSC 94-2627-B-007-002).

    REFERENCES

    Keith, W.N., Vulliamy, T., Zhao, J., Ar, C., Erzik, C., Bilsland, A., Ulku, B., Marrone, A., Mason, P.J., Bessler, M., et al. (2004) A mutation in a functional Sp1 binding site of the telomerase RNA gene (hTERC) promoter in a patient with Paroxysmal Nocturnal Haemoglobinuria BMC Blood Disord, . 4, 3 .

    Elemento, O. and Tavazoie, S. (2005) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach Genome Biol, . 6, R18 .

    Berezikov, E., Guryev, V., Cuppen, E. (2005) CONREAL web server: identification and visualization of conserved transcription factor binding sites Nucleic Acids Res, . 33, W447–W450 .

    Prakash, A. and Tompa, M. (2005) Discovery of regulatory elements in vertebrates through comparative genomics Nat. Biotechnol, . 23, 1249–1256 .

    Ho Sui, S.J., Mortimer, J.R., Arenillas, D.J., Brumm, J., Walsh, C.J., Kennedy, B.P., Wasserman, W.W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes Nucleic Acids Res, . 33, 3154–3164 .

    Wang, T. and Stormo, G.D. (2005) Identifying the conserved network of cis-regulatory sites of a eukaryotic genome Proc. Natl Acad. Sci. USA, 102, 17400–17405 .

    Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., Kellis, M. (2005) Systematic discovery of regulatory motifs in human promoters and 3'-UTRs by comparison of several mammals Nature, 434, 338–345 .

    Timothy, L.B. and Charles, E. (1995) The value of prior knowledge in discovering motifs with MEME Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, 16–19 JulyIn Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (Eds.). Cambridge, United Kingdom, Melno Park, CA AAAI Press pp. 21–29 .

    Hertz, G.Z. and Stormo, G.D. (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences Bioinformatics, 15, 563–577 .

    Lawrence, C.E., Altschul, S.F., Bogouski, M.S., Liu, J.S., Neuwald, A.F., Wooten, J.C. (1993) Detecting sublte sequence signals: a Gibbs sampling strategy for multiple alignment Science, 262, 208–214 .

    Thompson, W., Rouchka, E.C., Lawrence, C.E. (2003) Gibbs recursive sampler: finding transcription factor binding sites Nucleic Acids Res, . 31, 3580–3585 .

    Pevzner, P.A. and Sze, S.H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 20–23 AugustIn Bourne, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strarde, S., Weissig, H. (Eds.). La Jolla, Melno Park, CA AAAI Press pp. 269–278 .

    Keich, U. and Pevzner, P.A. (2002) Finding motifs in the twilight zone Bioinformatics, 18, 1374–1381 .

    Buhler, J. and Tompa, M. (2002) Finding motifs using random projections J. Comput. Biol, . 9, 225–242 .

    Price, A., Ramabhadran, S., Pevzner, P.A. (2003) Finding subtle motifs by branching from sample strings Bioinformatics, 19, Suppl. 2, ii149–ii155 .

    Sinha, S. and Tompa, M. (2000) A statistical method for finding transcription factor binding sites Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, 20–23 AugustIn Bourne, P., Gribskov, M., Altman, R., Jensen, N., Hope, D., Lengauer, T., Mitchell, J., Scheeff, E., Smith, C., Strarde, S., Weissig, H. (Eds.). La Jolla, CA, Melno Park, CA AAAI Press pp. 344–354 .

    Sinha, S. and Tompa, M. (2002) Discovery of novel transcription factor binding sites by statistical overrepresentation Nucleic Acids Res, . 20, 5549–5560 .

    Wolfertstetter, F., Frech, K., Gunter, H., Werner, T. (1996) Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm Comput. Appl. BioSci, . 12, 71–80 .

    Ng, K.B. and Kantor, P. (2000) Predicing the effectiveness of Native Data Fusion on the basis of system characteristics JASIS, 51, 1177–1189 .

    Hsu, D.F. and Taksa, I. (2005) Comparing rank and score combination methods for data fusion in information retrieval Inform. Retrieval, 8, 449–480 .

    Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites Nat. Biotechnol, . 23, 137–144 .

    Elnitski, L., Hardison, R.C., Li, J., Yang, S., Kolbe, D., Eswara, P., O'Connor, M.J., Schwartz, S., Miller, W., Chiaromonte, F. (2003) Distinguishing regulatory DNA from neutral sites Genome Res, . 13, 64–72 .

    Hsu, D.F., Chung, Y.-S., Kristal, B.S. (2006) Combinatorial fusion analysis: methods and practices of combining multiple scoring systems In Hsu, H. (Ed.). Advanced Data Mining Technologies in Bioinformatics, Hershey, PA Idea Group Publishing pp. 32–62 .

    Yang, J.M., Chen, Y.-F., Shen, T.-W., Kristal, B.S., Hsu, D.F. (2005) Consensus scoring criteria for improving enrichment in virtual screening J. Chem. Inf. Model, . 45, 1134–1146 .

    Wang, T. and Stormo, G.D. (2003) Combining phylogenetic data with co-regulated genes to identify regulatory motifs Bioinformatics, 18, 2369–2380 .

    Zhu, J. and Zhang, M.Q. (1999) SCPD: a promoter database of the yeast Saccharomyces cerevisiae Bioinformatics, 15, 607–611 .

    Krivan, W. and Wasserman, W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription Genome Res, . 11, 1159–1566 .

    Wasserman, W.W. and Fickett, J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression J. Mol. Biol, . 278, 167–181 .

    Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G. (2004) Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes Nucleic Acids Res, . 32, W199–W203 .

    Down, T.A. and Hubbard, T.J.P. (2005) NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence Nucleic Acids Res, . 33, 1445–1453 .

    Wang, G. and Zhang, W. (2005) An iterative learning algorithm for deciphering stegoscripts: a grammatical approach for motif discovery. Technical Report No. 12, Department of Computer Science and Engineering St Louis, MO, USA Washington University .

    Wang, G., Yu, T., Zhang, W. (2005) WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar Nucleic Acids Res, . 33, W412–W416 .(Chien-Hua Peng, Jeh-Ting Hsu, Yun-Sheng )