当前位置: 首页 > 期刊 > 《核酸研究》 > 2005年第3期 > 正文
编号:11368955
Design of nucleic acid sequences for DNA computing based on a thermody
http://www.100md.com 《核酸研究医学期刊》
     1 Graduate School of Engineering, Hokkaido University North 13, West 8, Kita-ku, Sapporo 060-8628, Japan 2 CREST, Japan Science and Technology Corporation 4-1-8, Honmachi, Kawaguchi, Saitama, 332-0012, Japan

    *To whom correspondence should be addressed. Tel: +81 11 716 2111, ext 6498; Fax: +81 11 706 7834; Email: fumiaki@dna-comp.org

    ABSTRACT

    We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (G min). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate G min. This effectively excludes inappropriate sequences before G min is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (G exp) of 126 sequences correlated well with G min (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java.

    INTRODUCTION

    Nucleic acids are now being utilized in computations (1–3), various engineering applications such as microarrays (4–6), and nano-fabrications (7–9). In these fields of research, called ‘DNA computing’, nucleic acid design or sequence design is a crucial problem in engineering using nucleic acids. Nucleic acid design is deciding the base sequences (e.g. ‘GCTAGCTAGGTTTA’,...,‘ATCGTACGCTATGTGCA’ in DNA) in order to satisfy the constraints based on the physicochemical properties of nucleic acids. In particular, it is essential to prevent undesired hybridization. In DNA computing, multiple sequences need to be designed that do not hybridize non-specifically with each other (10,11), while in RNA secondary structure design, a single sequence needs to be designed that folds into the desired secondary structure (12–15). For example, 40 DNA sequences of length 15 were designed to prevent undesired hybridization and used for solving a 20-variable instance of a three-satisfiability problem (2). In the field of microarrays, 69 122 DNA sequences of length 45–47 were designed to hybridize specifically to 24 502 transcripts from Arabidopsis thaliana (4). Furthermore, four DNA sequences of length 26 and four DNA sequences of length 48 were used to construct a periodic two-dimensional crystalline lattice (7). These sequences were carefully designed for the intended hybridization. Thus, an algorithm/program that generates multiple sequences, which do not hybridize non-specifically with each other, is useful for various applications of nucleic acid.

    We applied a thermodynamic approach to the nucleic acid design for DNA computing. In particular, we focused on designing a pool P containing n sequences of length l for which (i) the duplex melting temperature (T M) is in the range to for any pairwise duplex of a sequence in P and its complement and (ii) the minimum free energy (G min) is greater than a threshold () in any pairwise duplex of sequences in P and any concatenation of two sequences in P plus their complements except for the pairwise duplex of a sequence in P and its complement. Traditional approaches to the sequence design in DNA computing have approximated the stability between two sequences using the hamming distance (i.e. the number of base pairs) rather than G min. However, since the hamming distance is only an approximation of the stability, G min is preferable for predicting the stability. In practice, an RNA secondary structure can be adequately predicted using G min rather than the number of base pairs (16). However, the algorithm for calculating the G min for double-stranded DNA requires time complexity O(l 3), where l is the length of the sequence, as is the case with secondary structure prediction for single-stranded RNA. Since all the combinations of three sequences must be evaluated in this design, dozens of sequences cannot be designed within a reasonable computation time. Another approach that reduces the computation time is thus needed. Andronescu et al. (13) overcame this drawback in the secondary structure design by hierarchically decomposing the secondary structure into smaller substructures and integrating the partial sequences. By using this method, they designed sequences that a traditional program cannot.

    We have developed a random generate-and-test algorithm that generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. It stores the sequence in a sequence pool if and only if it satisfies all the constraints. To reduce the difficulty in computation time, we use a greedy search to calculate G min. The advantage of a greedy search is that it approximates G min in less time, with time complexity O(l 2), than a rigorous algorithm, which calculates G min with time complexity O(l 3). The G min approximated using the greedy search (denoted by G gre) correlated well with G min. The correlation coefficients were 0.95, 0.85 and 0.76 at 20mer, 40mer and 60mer lengths, respectively. Furthermore, since G gre is the upper bound for G min (i.e. G minG gre), the sequence such that is sure to satisfy . Therefore, using a greedy search excludes in advance most inappropriate sequences before G min is calculated without excluding an appropriate sequence. With this approach, our algorithm reduces computation time.

    To evaluate our algorithm, we investigated the effectiveness of the greedy search filtering. We compared the computation time of our algorithm with that of the same algorithm but without the greedy search filtering. The experimental results showed that the greedy search filtering reduces the total computation time drastically. For example, using the filtering reduced the computation time to 83% for 30 sequences with length 20 and to 87% for 20 sequences with length 15 when , and , and , and , respectively. In addition, we compared the greedy search with the traditional approach based on a hamming distance in terms of the filtering performance. We demonstrated that the greedy search can filter out inappropriate sequences better than using the hamming distance. In a laboratory experiment, we investigated the correlation coefficient between G min and the experimental free energy (G exp) and that between the hamming distance and the G exp using 126 duplexes. The G exp correlated better with G min (|R| = 0.90) than with the hamming distance (|R| = 0.80).

    To implement our algorithm, we developed a computer program called DNA-SDT, a graphic user interface (GUI)-based application written in Java. This program enables users to design DNA sequences with our algorithm and can be downloaded freely from the web site (http://ses3.complex.eng.hokudai.ac.jp/~fumi95/DNA-SDT/index.html).

    ALGORITHM

    Definition

    Let n be the number of sequences to be designed and li (0 i n – 1) be the length of each sequence. In this paper, we formulated li such that li = lj = l (0 i, j n – 1), although we can extend our algorithm easily for li lj (0 i j n – 1). We define P as the pool of n sequences, with length l, to be designed. Furthermore, let P = {U 0, U 1,..., U n–1} and Q = {V 0, V 1,..., V n–1} such that Vi (0 i n – 1) is the complement of Ui . and are defined as the lower and upper thresholds of T M for the duplex between a sequence and its complement. Moreover, is defined as the threshold of G min given by the sequence designer. Thus, our algorithm designs a pool P containing n sequences of length l for which (i) the duplex T M is in the range from to for any duplex of Ui (0 i n – 1) and Vi , and (ii) G min is greater than for any pairwise duplex of sequences in P and any concatenation of two sequences in P Q except for the pairwise duplex of Ui and Vi .

    Here, we describe more specifically the combination of sequences to be calculated using the G min. The sequence Ui (0 i n – 1) is denoted by a string of bases such as (5' to 3' direction), and similarly, Vi (0 i n – 1) is denoted as (5' to 3' direction). For example, if Ui = 5'-AAATTTCCCGGG-3', then Vi = 5'-CCCGGGAAATTT-3'. Furthermore, let X, Y be the combination of sequences X and Y, and XY be the concatenation of sequences X and Y in that order. For example, if Ui , Uj and Uk (0 i, j, k n – 1) are 5'-AAATTT-3', 5'-CCCGGG-3' and 5'-TCTCTC-3', respectively, then UiUj ,Uk means the combination of sequences 5'-AAATTTCCCGGG-3' and 5'-TCTCTC-3'. In our algorithm, the following combinations are considered for the G min calculation.

    Ui Uj Uk (0 i, j, k n – 1)

    Ui Uj Vk (0 i, j, k n – 1), i k

    Ui Vj Uk (0 i, j, k n – 1), i j

    Ui Vj Vk (0 i, j, k n – 1), (i j)(i k).

    For generality, we use two sequences, S(=s 0 s 1...sN –1) (5' to 3' direction) and T(=t 0 t 1...tM –1) (3' to 5' direction), to describe the G min calculation in detail. The sequences are defined to be antiparallel to each other. Note that any two sequences can be represented by S and T. In this paper, S and T represent Ui and the reverse sequence of XjYk (Xj {Uj , Vj }, Xk {Uk , Vk }).

    The notation si · tj represents the base pair between the i-th base in sequence S and the j-th base in sequence T, hence the structure between S and T is a set of base pairs such that each base is paired at most once. In addition, (si · tj , si' · tj' ) is defined as a structure in which base pairs si · tj and si' · tj' are formed and the sequences and do not form any base pairs. The term is defined as a structure in which base pair si' · tj' is formed and sequence in the case ( in the case ) do not form any base pair. Similarly, is defined as a structure in which base pair si · tj is formed and the sequence in the case ( in the case ) do not form any base pair.

    Outline

    Our algorithm is a random generate-and-test algorithm that generates a sequence randomly and then stores it in the pool if and only if the sequence satisfies all the constraints. The main feature of our algorithm is that it uses a greedy search for calculating G min to filter out inappropriate sequences. The advantage of a greedy search is that it approximates G min in less time when compared with a well-known dynamic programming algorithm (17). Therefore, using the greedy search before the G min calculation to exclude inappropriate sequences reduces the computation time. Hereafter, the approximated G min using a greedy search is denoted by G gre.

    The algorithm uses three filters:

    T M filter: checks whether the T M of candidate sequence Uc and Vc is in the range from to . If not, Uc is rejected.

    G gre filter: checks whether G gre is greater than the threshold, , for all the combinations above in P, provided that the candidate sequence Uc or Vc is included in that combination. If the G gre of any combination is less than or equal to , Uc is rejected.

    G min filter: checks whether G min is greater than the threshold, , for all the combinations above in P, provided that the candidate sequence Uc or Vc is included in that combination. If the G min of any combination is less than or equal to , Uc is rejected.

    The T M and G min filters are necessary for satisfying the constraints of sequence design, while the G gre filter reduces the computation time. The use of the G gre filter is based on the hypothesis that it can exclude most sequences that cannot pass through the G min filter. If this hypothesis is true, the G gre filter can exclude the inappropriate sequences in less time, resulting in reduced total computation time.

    The algorithm is defined as follows:

    Input: n, l, , , ,

    Output: pool P consisting of n sequences with length l.

    Procedure:

    Initialize pool P as an empty set.

    Iterate the following procedure until P has n sequences.

    Generate candidate sequence Uc with length l randomly, then add Uc to P.

    Evaluate Uc with T M filter. If Uc is rejected, exclude Uc from P and return to ii(a).

    Evaluate Uc with G gre filter. If Uc is rejected, exclude Uc from P and return to ii(a).

    Evaluate Uc with G min filter. If Uc passes, leave Uc in P; else exclude Uc from P and return to ii(a).

    The order of the filters is important. Each candidate sequence should be evaluated in this order to reduce the computation time. If pool P has m sequences, the time complexities to evaluate the (m + 1)-th candidate sequence are O(l), O(m 2 l 2) and O(m 2 l 3) at the T M, G gre and G min filters, respectively. By evaluating the candidate sequences in ascending order of time complexity, the inappropriate ones can be excluded sooner.

    Energy model

    The G min between S and T is calculated using a dynamic programming algorithm (17), while the G gre between S and T is calculated using a greedy search. Both G min and G gre are calculated based on the nearest-neighbor model, which calculates the total free energy as the summation of the contributions of various elementary structures (18,19). The elementary structures considered in this paper are stacking base pairs, bulge loops, internal loops, dangling ends and free ends.

    The contributions of the stacking base pairs, defined as , to the free energy are calculated using 12 parameters reported previously (19). The free energy contributions of the loop regions are sequence dependent (15,16,20). The free energies of single bulge loops, defined as or , are calculated using 64 parameters covering all the possible combinations of bulged base and flanking base pairs (19). The free energies of the other loops, bulge loops longer than one and internal loops, are calculated using conventional parameters and equations (20,21). Bulge loops longer than one are defined as or , and the internal loops are defined as . The free energies of dangling ends, defined as , , or , are calculated using 32 parameters covering all the possible combinations (22). The free ends are defined as the sequences and closing with such that both and do not form a base pair or and closing with such that both and do not form a base pair. The free energies of the free ends are also calculated using conventional parameters (20).

    The crossing base pairs are defined as a pair of base pairs and in a structure with {(i < i') (j' < j)} or {(i' < i) (j < j')}. We prohibit crossing base pairs because of the computation time and the lack of thermodynamic data. This constraint is equivalent to the pseudoknot-free constraint in the RNA secondary structure prediction. At this point, we do not consider intra-molecular base pairs or the interactions between loop regions.

    T M filter

    This filter checks whether a candidate sequence paired to its complement has a T M in the range to .

    (1)

    where R is the gas constant, C T is the concentration, H° is the enthalpy and S° is the entropy. Parameter is set to 1 for self-complementary and to 4 for non-self-complementary. Parameters H° and S° are calculated based on the nearest-neighbor model (18,19).

    G gre filter

    This filter checks whether G gre is greater than for all the combinations as mentioned in Algorithm.

    Using a ‘greedy search’ reduces the computation time for calculating G min. The greedy search works well because a structure with G min tends to include stable helices (i.e. continuous complementary regions). Therefore, the greedy search approximates G min by iteratively searching for the most stable helix and fixing the helix.

    The greedy search algorithm is as follows:

    First, calculate the free energies of all helices over the structure between sequence and sequence . Calculate the free energies of the helices with free ends using the following equations. Here, a helix is denoted by (5' 3') and (3' 5').

    G freeEnd is the free energy of a helix with free ends; G core is that without free ends. represents the free energy contribution of the dangling or free end between sequence and sequence closing with base pair . For , is zero because there is no dangling or free end. is the free energy of the stacked base pair between sequence and sequence .

    Search for a minimal value of G freeEnd over all helices. Then, fix the base pairs in the helix where G freeEnd is minimum. This region is freshly denoted as (5' 3') and (3' 5'). Furthermore, the free energies of the regions with and without free ends are freshly denoted as G freeEnd and G core, respectively.

    Iterate the following procedure d times, where d is a parameter defined below.

    If holds, calculate the free energies of all helices with the loop closing with base pair over the structure between sequence and the sequence . Thus, the free energy of helix, denoted as (5' 3') and (3' 5'), is calculated as follows:

    where is the free energy of the helix with a free end, is that of one without a free end and is the free energy contribution of a bulge or internal loop between sequence and sequence closing with base pairs and .

    Search for a region where is minimum over all helices. This region is freshly denoted as (5' 3') and (3' 5'). Furthermore, the free energies of the regions with and without a free end are freshly denoted as and , respectively. If holds, fix the base pairs, , and update i, j and G core to i', j' and , respectively. This means that the base pairs in the region are energetically favorable.

    If holds, calculate the free energies of all helices with a loop closing with base pair over the structure between sequence and sequence . Thus, the free energy of helix, denoted as (5' 3') and (3' 5'), is calculated as follows:

    where is the free energy of the helix with a free end and is that of one without a free end.

    Search for a region where is minimum over all helices. This region is freshly denoted as (5' 3') and (3' 5'). Furthermore, the free energies of the regions with and without a free end are freshly denoted as and , respectively. If holds, fix the base pairs, , and update k, l and G core to k'', l'' and , respectively. This means that the base pairs in the region are energetically favorable.

    Calculate G gre using

    where init is the energy penalty for forming double-stranded DNA. If , however, set G gre = 0. This is because the free energies are calculated relative to two non-interacting sequences, for which the free energy is defined as zero. Thus, two sequences must remain separate with zero free energy rather than form a structure with positive free energy.

    In the above procedure, the number of iterations for a search, d, is defined as ‘degree’. A structure with degree = 0 has at most one helix. Note that base pairing does not occur during a greedy search when more than two continuous complementary bases do not exist between two sequences. For degree = 1, there are at most three helices. Eventually, for degree = d, there are at most helices. If the length of sequences S and T are l and , respectively, the number of iterations is at most . Thus, the time complexity of greedy search is at worst. However, because the degree increases slowly with the sequence length, the time complexity of a greedy search can be in practice regarded as . To confirm this, we calculated the degree for 10 000 random pairs of sequences from 10mer to 100mer in steps of 10mer. As shown in Table 1, degree was at most 9 at 80mer and 90mer, while it was 33 ) at 100mer in the worst case. Furthermore, degree was rarely >6 (<1%), and, for >98% of the 10 000 random pairs, degree was in the range 0–4 at any length. Therefore, degree was nearly constant regardless of the sequence length on average, indicating that the time complexity of a greedy search is in practice.

    Table 1 Distribution of degree for greedy search for 10 000 random pairs of sequences from 10mer to 100mer in steps of 10mer

    G min filter

    This filter checks whether the G min is greater than for all the combinations as mentioned in Algorithm.

    G min between S and T can be decomposed into two terms:

    (2)

    where represents the minimum value of the free energy between sequence (5' 3' direction) and sequence (3' 5' direction) closing with . Recall that represents the free energy of the dangling or free ends between sequence (5' 3' direction) and sequence (3' 5' direction) closing with base pair .

    Furthermore, is calculated using

    (3)

    where represents the minimum value of the free energy forming a bulge or internal loops closing with base pair . Recall that is the free energy of the stacked base pair between sequence (5' 3' direction) and sequence (3' 5' direction). The first term represents the case in which the unpaired end consists of sequences (5' 3' direction) and (3' 5' direction) closing with base pair . The second term corresponds to the case in which the stacked base pair is energetically favorable. In this case, is calculated recursively. The third term is calculated using

    (4)

    Recall that represents the free energy contribution of the loops between sequence and sequence closing with base pairs and . The G min is calculated recursively using Equations 2– 4 by dynamic programming. If we compute the VBI term in a straightforward manner, its time complexity is . However, this can be reduced to using the algorithm of Lyngs? et al. (23).

    RESULTS

    Effectiveness of G gre filter

    To evaluate the effectiveness of the G gre filter, we compared the computation time of our algorithm with that of the algorithm without the G gre filter. That algorithm checks a randomly generated sequence by using the T M and G min filters and then stores the sequence in the pool if and only if the sequence passes both filters. All the computational experiments described in this section were performed using Windows 2000 on a computer with an Athlon 1.4 GHz CPU and 256 MB of memory. The results are shown in Figure 1. The computation time grew exponentially because the number of three-sequence combinations increased exponentially with the number of sequences.

    Figure 1 Number of sequences designed versus computation time for two design strategies up to 10 h. In (a) and (b), l = 20, and . In (a), ; and in (b), . In (c) and (d), l = 15, and . In (c), ; and in (d), .

    Figure 1 shows that using our algorithm reduced the computation time drastically. For example, our algorithm needed 1.3 h to design 30 sequences with length 20 for , while the algorithm without the G gre filter needed 7.7 h (Figure 1a and b). This means that the G gre filter effectively excludes the sequences that cannot pass through the G min filter.

    A comparison of Figure 1a and b clearly shows the primacy of our algorithm for versus . For instance, our algorithm reduced computation time to 83% (7.7 1.3 h) for 30 sequences and while it reduced computation time to 57% (9.1 3.9 h) for 69 sequences and . This is because both algorithms can find sequences that satisfy the constraints more easily for than for . A similar trend is seen in the sequences with length 15 (Figure 1c and d). Therefore, using the G gre filter enables sequences to be designed in less time, particularly when the threshold is high.

    Filtering performance of G gre filter versus hamming distance

    The function of the G gre filter is to filter out the inappropriate sequences from many sequences generated randomly; the promising sequences are then checked using the G min filter. This means that the filtering performance of the G gre filter can be evaluated using four terms: the number of sequences with both G min and G gre higher than the threshold (true positive: TP), that with G min higher but not with G gre (false negative: FN), that with G gre higher but not with G min (false positive: FP) and that with neither G min nor with G gre higher (true negative: TN) (Figure 2). There is a trade-off between FN and FP, which are the number of sequences incorrectly excluded by the G gre filter and that incorrectly passing through the G gre filter, respectively. To evaluate the G gre filter, we compared the filtering performance of the G gre filter with that using the hamming distance (called hamming filter) with respect to FN such that FP = 0 and to FP such that FN = 0 for 10 000 random pairs of sequences with length 20. Setting FP to 0 means that the G gre or hamming filter certainly excludes inappropriate sequences with a G min less than or equal to . Therefore, the lower the number in FN such that FP = 0, the better the filter. Similarly, because FN = 0 means that the G gre or hamming filter never excludes appropriate sequences with a G min greater than , the filter should have fewer sequences in FP such that FN = 0.

    Figure 2 Plot of G gre versus G min for 10 000 random pairs of sequences with length 20; and represent threshold of G gre and that of G min, respectively. Four terms were used for evaluating filtering performance: TP, ; FN, ; FP, ; and TN, .

    As shown in Figure 3, for FN = 0, the number of sequences classified into FP by the G gre filter was much smaller than that by the hamming filter. Because most sequences have a G min of more than –10.0 kcal/mol (Figure 2), the FP converged to zero for both the hamming filter and G gre filter for less than –10.0 kcal/mol. Similarly, for FP = 0, the number of sequences classified into FN by the G gre filter was smaller than that by the hamming filter. The reason there is little difference for greater than –5.0 kcal/mol is that a small number of sequences with the G gre actually had a G min such that . For example, sequences GGTCACCCTGGGCTACCGGA (5' 3' direction) and TTAAGGTCGCGTGCTATCTT (3' 5' direction) had a G min of –7.92 kcal/mol while the G gre was –1.96 kcal/mol. Because such sequences are rare, however, the primacy of the G gre filter over the hamming filter is clear for less than –5.0 kcal/mol.

    Figure 3 Threshold of G min versus number of sequences classified as FP and FN. (a) FP such that FN is zero and (b) FN such that FP is zero.

    Another advantage of the G gre filter is that it guarantees having the threshold where the number of FN is zero because the G gre is the upper bound for the G min (i.e. ). For example, a pair of sequences having a G gre less than or equal to –5.0 kcal/mol is guaranteed to have a G min of at most –5.0 kcal/mol. Thus, the sequences with a G gre of more than –5.0 kcal/mol include all sequences with a G min of more than –5.0 kcal/mol. Therefore, setting the threshold for a G gre filter () to that for the G min filter () (i.e. ) guarantees that the number of FN is zero. This is not the case for hamming filter.

    Comparison between hamming distance and G min in vitro

    We investigated the validity of the G min based approach compared with the traditional approach based on the hamming distance by using an in vitro experiment. First, to address the validity of the G min calculation, we calculated the average deviations from the experimental free energies (G exp). The average deviations were derived from 126 sequences (31 complementary sequences, 83 sequences with a single bulge loop and 12 sequences with free ends). The average deviations, calculated using ( and represent the i-th G min and G exp, respectively), were 3.2% (3.0, 2.8 and 6.5% for the complementary sequences, single bulges and free ends, respectively). These average deviations are within the limits of what can be expected for a nearest-neighbor model (18,19). Thus, we confirmed that our algorithm can predict the G min adequately. The 126 sequences with their G min and G exp are provided in the Supplementary Material. The number of complementary bases for each pair of sequences is also provided.

    Then, to compare the hamming distance with G min, we investigated the correlation coefficient with G exp. To fairly compare different length sequences, we used the number of complementary bases but not the hamming distance. For example, although sequences 5'-GGG-3' and 3'-CCC-5' with three complementary bases and sequences 5'-GGGGG-3' and 3'-CCCCC-5' with five complementary bases both have zero hamming distance, the stability of the latter must be higher than that of the former. Thus, the number of complementary bases is appropriate for comparing the stability of sequences with different lengths. Note that the number of complementary bases is equivalent to the hamming distance for sequences with the same length. The results are shown in Figure 4a and b. In Figure 4a, BP represents the number of complementary bases. The correlation coefficient, |R|, between –BP and G exp was 0.80, while that between G min and G exp was 0.90, indicating that the G min is better than the number of complementary bases (i.e. hamming distance) as a predictor of stability.

    Figure 4 (a) –BP versus experimental free energy. BP represents the number of complementary bases. (b) Predicted minimum versus experimental free energy. In (a) and (b), values and lines are correlation coefficients and regression lines, respectively.

    Bozdech et al. (5) compared the number of complementary bases with the binding energy (the calculation method and nearest-neighbor parameters differed from ours) by using 70mer oligonucleotides on microarray. They found that |R| between the binding energy and relative intensity of hybridization (= intensity of fluorescence) was 0.91, while it was 0.72 between the number of complementary bases and the relative intensity of hybridization. This is consistent with our findings, especially with respect to the correlation between the predicted energy (binding energy in theirs, G min in ours) and the stability derived from the experimental results (their |R| was 0.91, while ours was 0.90). With respect to the correlation between the number of complementary bases and the experimental stability, their |R| was 0.72, while ours was 0.80. This is because our data were derived from only sequences with a simple structure, i.e. with at most one loop (single bulge loop). In general, the more loops, the higher the discrepancy between the number of complementary bases and the experimental stability. Therefore, the correlation between the number of complementary bases and the experimental stability will be close to theirs with respect to more complex sequences, i.e. with more than two loops. These results demonstrate the ability of our program to approximate the experimental stability of double-stranded DNA based on the G min calculation.

    DISCUSSION

    In the previous section, we showed the effectiveness of the G gre filter for filtering out the sequences that cannot pass through the G min filter. To further address the question how the G gre is close to the G min, we investigated the average prediction error of G gre, calculated using . Figure 5 shows that the average prediction error can be restricted to <1.0 kcal/mol for sequences shorter than 50mer, which are frequently used in DNA computing.

    Figure 5 Average prediction error of G gre, calculated using .

    Figure 6 shows the number of sequences such that G min is equal to G gre. The number decreased as the sequences became longer. For example, in 9250 pairs from 10 000 pairs of sequences at 10mer, while in 157 pairs from 10 000 pairs at 100mer. Therefore, G gre is a good predictor of G min for short sequences, while it is only an approximation for long sequences.

    Figure 6 Number of sequences such that .

    For comparison with the hamming distance, the correlation coefficient (|R|) was also calculated. The correlation between G min and G gre decreased almost linearly from 0.99 at 10mer to 0.63 at 100mer. Because long sequences tend to have more helices, the discrepancy increased as the sequences became longer. The correlation coefficient between G min and the hamming distance, which decreased almost linearly from 0.46 at 10mer to 0.12 at 100mer, was much less than that between G min and G gre.

    IMPLEMENTATION

    We implemented our algorithm in a program called ‘DNA Sequence Design Tool (DNA-SDT)’. DNA-SDT can be downloaded freely from the web site (http://ses3.complex.eng.hokudai.ac.jp/~fumi95/DNA-SDT/index.html). It has two functions: sequence design and structure prediction.

    In sequence design, the program solves the problem of designing a pool P containing n sequences of length l for which (i) the duplex T M is in the range to for any duplex of Ui and Vi and (ii) G min is greater than in any pairwise duplex of sequences in P and any concatenation of two sequences in except for the pairwise duplex of Ui and Vi . This problem is solved using our algorithm with the threshold for a G gre filter, . The parameters, n, l, , , and , are user-defined. DNA-SDT has a table of average T M values at length l calculated from 10 000 sequences generated randomly. Parameters and are set to and , respectively, by default. The T M is calculated at 1 μM.

    In structure prediction, the program calculates the G min between two sequences, then displays the structure with the G min using a traceback algorithm. Users can thus determine the structure of any combination of sequences from the pool designed by the program. Of course, for any given combination of sequences, users can also determine the structure with the G min.

    The program is a GUI-based application written in Java, hence it can be executed on any computer that has the Java Runtime Environment (JRE) installed. Although older versions of JRE supposedly run without problem, the latest version is preferable. We tested the program with JRE 1.4.2 on a Windows 2000 workstation, JRE 1.4.1 on a Windows XP workstation and JRE 1.4.2 on a Turbolinux Workstation 7.0.

    The program interface consists of two screens: one for structure prediction and one for sequence design. The left-hand side of the window is the screen for structure prediction, which outputs the structure with the G min from the two sequences input in the text box. The sequences in the text box can be converted into reverse or complementary sequences. The right-hand side is the screen for sequence design, which outputs the sequences, GC content and T M based on the constraints input in the text boxes.

    To avoid impossible operations during design, the procedure for sequence design is executed as a separate thread. Therefore, the structure-prediction screen can be operated freely when sequences are being designed. Furthermore, the design procedure can be monitored and interrupted anytime, and the interim results can be output. However, because the button for running the sequence design procedure is locked during design, users cannot design another pool of sequences in parallel.

    The number of sequences can be set up to 100. If more than 100 sequences need to be designed, the user can choose ‘Unlimited’. In this case, the design procedure iterates until the user interrupts it. The sequence length is restricted at 8mer to 50mer because of the computation time. Although the thresholds for G min and G gre can be set freely, using irrelevant values resulting in zero sequences. We recommend setting these thresholds such that to guarantee zero false negatives (see Results).

    SUPPLEMENTARY MATERIAL

    Supplementary Material is available at NAR Online.

    ACKNOWLEDGEMENTS

    Funding to pay the Open Access publication charges for this article was provided by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Priority Areas, 2002–2005, 14085201.

    REFERENCES

    Adleman, L.M. (1994) Molecular computation of solutions to combinatorial problems Science, 266, 1021–1024 .

    Braich, R.S., Chelyapov, N., Johnson, C., Rothemund, P.W., Adleman, L. (2002) Solution of a 20-variable 3-SAT problem on a DNA computer Science, 296, 499–502 .

    Faulhammer, D., Cukras, A.R., Lipton, R.J., Landweber, L.F. (2000) Molecular computation: RNA solutions to chess problems Proc. Natl Acad. Sci. USA, 97, 1385–1389 .

    Rouillard, J.M., Zuker, M., Gulari, E. (2003) OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach Nucleic Acids Res., 31, 3057–3062 .

    Bozdech, Z., Zhu, J., Joachimiak, M.P., Cohen, F.E., Pulliam, B., DeRisi, J.L. (2003) Expression profiling of the schizont and trophozoite stages of Plasmodium falciparum with a long-oligonucleotide microarray Genome Biol., 4, R9 .

    Kane, M.D., Jatkoe, T.A., Stumpf, C.R., Lu, J., Thomas, J.D., Madore, S.J. (2000) Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays Nucleic Acids Res., 28, 4552–4557 .

    Winfree, E., Liu, F., Wenzler, L.A., Seeman, N.C. (1998) Design and self-assembly of two-dimensional DNA crystals Nature, 394, 539–544 .

    Shih, W.M., Quispe, J.D., Joyce, G.F. (2004) A 1.7-kilobase single-stranded DNA that folds into a nanoscale octahedron Nature, 427, 618–621 .

    Yan, H., Zhang, X., Shen, Z., Seeman, N.C. (2002) A robust DNA mechanical device controlled by hybridization topology Nature, 415, 62–65 .

    Arita, M. and Kobayashi, S. (2002) DNA sequence design using templates New Gen. Comput., 20, 263–277 .

    Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H., Sakamoto, K. (2000) Improving sequence design for DNA computing In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000) pp. 875–882 .

    Dirks, R.M., Lin, M., Winfree, E., Pierce, N.A. (2004) Paradigms for computational nucleic acid design Nucleic Acids Res., 32, 1392–1403 .

    Andronescu, M., Fejes, A.P., Hutter, F., Hoos, H.H., Condon, A. (2004) A new algorithm for RNA secondary structure design J. Mol. Biol., 336, 607–624 .

    Andronescu, M., Aguirre-Hernandez, R., Condon, A., Hoos, H.H. (2003) RNAsoft: a suite of RNA secondary structure prediction and design software tools Nucleic Acids Res., 31, 3416–3422 .

    Hofacker, I.L. (2003) Vienna RNA secondary structure server Nucleic Acids Res., 31, 3429–3431 .

    Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H. (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure J. Mol. Biol., 288, 911–940 .

    Zuker, M. and Stiegler, P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information Nucleic Acids Res., 9, 133–148 .

    SantaLucia, J., Jr. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics Proc. Natl Acad. Sci. USA, 95, 1460–1465 .

    Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A. (2004) Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop Biochemistry, 43, 7143–7150 .

    Zuker, M. (2003) Mfold web server for nucleic acid folding and hybridization prediction Nucleic Acids Res., 31, 3406–3415 .

    Peritz, A.E., Kierzek, R., Sugimoto, N., Turner, D.H. (1991) Thermodynamic study of internal loops in oligoribonucleotides: symmetric loops are more stable than asymmetric loops Biochemistry, 30, 6428–6436 .

    Bommarito, S., Peyret, N., SantaLucia, J., Jr. (2000) Thermodynamic parameters for DNA sequences with dangling ends Nucleic Acids Res., 28, 1929–1934 .

    Lyngs?, R.B., Zuker, M., Pedersen, C.N.S. (1999) Fast evaluation of internal loops in RNA secondary structure prediction Bioinformatics, 15, 440–445 .(Fumiaki Tanaka1,*, Atsushi Kameda2, Masa)