Measuring the Fit of Sequence Data to Phylogenetic Model: Allowing for Missing Data(文章精)

Measuring the Fit of Sequence Data to Phylogenetic Model: Allowing for Missing Data

http://www.100md.com 《分子生物学进展》

     Department of Statistics, Department of Biological Sciences, University of South Carolina, Columbia

    Correspondence: E-mail: waddell@stat.sc.edu.

    Abstract

    It is fundamentally important to assess the fit of data to model in phylogenetic and evolutionary studies. Phylogenetic methods using molecular sequences typically start with a multiple alignment. It is possible to measure the fit of data to model expectations of data, for example, via the likelihood-ratio (G) test or the X2 test, if all sites in all sequences have an unambiguous residue. However, nearly all alignments of interest contain sites (columns of the alignment) with missing data, that is, ambiguous nucleotides, gaps, or unsequenced regions, which must presently be removed before using the above tests. Unfortunately, this is often either undesirable or impractical, as it will discard much of the data. Here, we show how iterative ML estimators may directly estimate the site-pattern probabilities for columns with missing data, given only standard i.i.d. assumptions. The optimization may use an EM or Newton algorithm, or any other hill-climbing approach. The resulting optimal likelihood under the unconstrained or multinomial model may be compared directly with the likelihood of the data coming from the model (a G statistic). Alternatively the modified observed and the expected frequencies of site patterns may be compared using a X2 test. The distribution of such statistics is best assessed using appropriate simulations. The new method is applicable to models using codons or paired sites. The methods are also useful with Hadamard conjugations (spectral analysis) and are illustrated with these and with ML evolutionary models that allow site-rate variability.

    Key Words: Phylogenetic likelihood-ratio test ? model fit ? ML with unequal site rates ? G statistic ? Hadamard conjugation ? spectral analysis

    Introduction

    Measuring fit of data to model is essential if phylogenetics is to advance as a statistical science. The fit of aligned sequence data to a phylogenetic, or evolutionary tree, model under independent and identical distribution (i.i.d.) assumptions, using statistics such as X2 and G, is an issue that has been studied and discussed by Reeves (1992), Goldman (1993), Waddell (1995), Adachi and Hasegawa (1996), and others. Asymptotically, as the sequence length goes to infinity, these statistics often converge to a 2 distribution with degrees of freedom (df) equal to that of the multinomial model (the number of distinct sequence patterns minus 1) minus the number of efficiently estimated parameters if the data indeed come from the model. (Here, efficient is used in the statistical sense; an estimator with minimum variance, such as maximum likelihood is under some conditions.)

    Unfortunately, the required conditions for the above convergence rarely hold. For example, sequence lengths tend to be tiny compared with the number of possible site patterns, and extreme sparseness often results. This problem may be countered by grouping site patterns together (e.g., Adachi and Hasegawa 1996; Waddell 1995) or by simulating data under the model and inferring the expected distribution of the fit statistic (Reeves 1992; Goldman 1993; Waddell 1995). Another requirement is that the phylogenetic model to be assessed is specified a priori. Thus, a posteriori methods such as model selection using AIC or BIC (Kishino, Miyata, and Hasegawa 1990) disrupt the expected distribution, and this factor needs to be assessed using simulations. Tree selection is also a form of model selection and may disrupt the convergence if a tree is not specified a priori. Finally, if using fully statistically efficient estimators or if parameter estimates are constrained by boundaries, there can be partial recovery of df lost in fitting model to data (Stuart and Ord 1991).

    In practice, one of the greatest obstacles to using these methods is that multiple alignments typically contain "missing" data. This prevents fit methods being used, for example with the G statistic, because this unconstrained likelihood is presently treated as unknown or unknowable by the field. Here, "missing" data include ambiguous nucleotides (caused by sequencing problems or polymorphisms), insertions or deletions (indels), and incomplete sequencing. These factors are all relatively common and tend to affect nearly all sites as an alignment grows in depth because of adding more taxa. Excluding taxa with missing data is doubly unfortunate, as data sets with a good sample of appropriate taxa tend to be both more informative and robust than are taxon-poor data sets (Felsenstein 1978; Hendy and Penny 1989; Swofford et al. 1996).

    Here, we address the problem of measuring fit of data to model using iterative maximum likelihood (ML) estimators of site-pattern probabilities in place of the standard closed-form ML multinomial estimators for complete site patterns. Once these are applied, a variety of fit statistics may be used with ML, or indeed any phylogenetic method, provided the distribution of these statistics is inferred using appropriate simulations. The new methods work with any i.i.d. model of evolution (see Swofford et al. 1996).

    A bonus of the new approach is that it also provides an excellent way of preparing data for Hadamard conjugations and spectral analysis. These are useful means of studying phylogenetic signals in a data set. The four main variants are two-state models (Hendy and Penny 1993), four-state models (Steel et al. 1993; Hendy, Penny, and Steel 1994), site-rate variability (Steel et al. 1993; Waddell, Penny, and Moore 1997), and projections from four down to two states (Waddell 1995; Waddell and Hendy 1997). For a statistical framework to these methods and a simple step-by-step example, see Waddell et al. (1994). All these approaches previously required deleting any site from the multiple sequence alignment that had missing data. Using the new methods introduced here, all sites may be used in producing spectra.

    Theory

    The problem and solution may be expressed succinctly as follows: given a sample of complete plus incomplete site patterns, find the vector of site-pattern probabilities that maximizes the likelihood of obtaining this sample. To do this we use the standard i.i.d. assumption; that is, all sites evolve independently and according to an identical distribution. This, in turn, will allow a direct estimate of likelihood of the data under the unconstrained or multinomial model.

    If all observed site patterns are complete, the ML vector of unconstrained site-pattern probability estimates is given by where is the frequency of the ith site pattern and c is the total sequence length. (Notation here follows Waddell et al. [1994]; for example, "hat" indicates an ML estimate for the unconstrained model.) The probability of a site pattern is also called a site-pattern likelihood (i.e., its probability under some model with fixed parameter values).

    Consider a simple example: a data set that contains only one incomplete site pattern, and the state is unknown in only one taxon (sequence). In this case, the unconstrained likelihood of this site pattern equals the sum of the likelihoods for all the observed complete patterns compatible with this incomplete site pattern. By compatible, it is meant that there is no disagreement in the known states when comparing this pattern with another. This definition applies equally to a site with a completely unknown state or one with an ambiguous nucleotide assignment (e.g., Y [meaning C or T but not A or G]). A special case should be noted and occurs when a site pattern is completely orphaned; that is, no other observed complete or incomplete site pattern is compatible with it. In this case, the ML estimate of its probability equals its observed frequency divided by the number of all the sites in the alignment; that is, in the where * denotes "based on all sites including sites with missing data." Note that it is unnecessary to try to resolve the ambiguity of orphaned sites further at this stage; indeed, there is no way to do this without invoking further assumptions. A further rule is that to maximize the unconstrained likelihood of the observed data, there is no need to consider unobserved site patterns to have probability greater than zero. This is a logical consequence of maximizing the likelihood by putting all the probability density where it is needed to explain the observed data and none elsewhere.

    For data sets with only one incomplete site pattern, the ML vector of unconstrained site-pattern frequencies, including missing data (vector ), can be written down in closed form. Its entries are

    (1)

    where fi is the observed frequency of a site pattern, k is the index of the nonorphaned ambiguous site pattern, and the summation is over any site pattern j such that j is a member of the set of site-pattern indices ‘ak’ that are compatible with the ambiguous site pattern and are not partial nonorphaned patterns. If the ambiguous site pattern is orphaned, then its frequency is the ML estimate. (By defining the members of set ‘a’ appropriately, sequencing ambiguities such as "A or T" may be dealt with.) It automatically follows that and this applies whenever is obtained. It follows that the unconstrained natural log-likelihood lnL of the data, given i.i.d. assumptions, then,

    (2)

    where n is the number of observed complete site patterns, m the number of distinct incomplete site patterns, and set ‘ak’ is as defined above. To clarify, the indexing used may be that used in a standard Hadamard conjugation for complete site patterns or any other indexing. If Hadamard indexing is used, then an extended sequence of numbers will be needed for the incomplete site patterns.

    When there is more than one ambiguous site pattern and the sets of complete patterns compatible with these incomplete site patterns overlap, there is no closed form ML estimator for site-pattern probabilities using all the data. However, it is still possible to seek the ML solution by directly modifying the affected entries in f* to maximize the value on the left side of equation 2.

    A number of approaches to the numerical optimization of equation 2 are possible. Quasi-Newton and conjugate-gradient methods are two approaches that we have employed (Press et al. 1995). However, when applying these methods, it is important to constrain the space of solutions, including that all site-pattern probabilities being optimized are greater than zero, that the last site pattern is equal to 1 minus the sum of all others, and that the sum of all site-pattern probabilities is 1. Occasionally these methods will fail to converge. In large alignments, with fairly randomly scattered ambiguities, site-pattern probability estimates remain close to uncorrelated. Therefore, cycles of line searches (e.g., by the Newton method) may be computationally more efficient than a full multivariate-optimization procedure.

    This problem is also suited to the expectation maximization (ML) algorithm (Dempster, Laird, and Rubin 1977). Although EM generally converges slower than a Newton optimization, it has advantages. It is simple to program and is guaranteed never to diverge. The estimated frequency of the ith complete pattern is updated each cycle as

    (3)

    where ' indicates the value on the previous cycle. The iteration may be stopped, for example, when all entries in and differ by less than (e.g., = 10–16).

    Once is obtained, a variety of fit statistics may be used between it and that predicted by an evolutionary tree model, f(T). This includes any member of the Cressie-Read family (Read and Cressie 1988), of which

    (4)

    and the X2 statistic,

    (5)

    are prominent examples. The distribution of these statistics should be estimated using simulations if there is concern that the required conditions for convergence do not hold or the P value is close to the critical value.

    Worked Example

    The following example illustrates the calculation of site-pattern probabilities. The same data are then subsequently used in a series of analyses to illustrate fitting evolutionary models to data with and without incomplete patterns.

    Unconstrained Likelihood with Missing Data

    For simplicity of illustration we use the symmetric two-state model (e.g., R/Y coding). This is like a Cavender or Farris model (Swofford et al. 1996). Table 1 shows the resulting s (site-pattern probability) and f (site-pattern frequency) vectors before and after taking into account three ambiguous site patterns. The first ambiguous site pattern considered is RR??. This pattern is consistent with site patterns of index 0, 3, 4, and 7, and the evidence for RR?? is distributed to these patterns. Site patterns 3 and 7 are included because under this model, patterns RRRR and YYYY are equivalent, as are RRYY and YYRR; hence, these pairs are grouped in estimating site-pattern probabilities. The second pattern considered is ?YRR. It augments patterns with index 2 and 3 (but more goes to 2 than 3 because the former is much more common). Finally, consider ambiguous pattern ?RYR, which is compatible with patterns indexed 4 and 5. In this case, this last pattern augments only site-pattern index 4 because the pattern with index 5 is not observed in the data.

    Table 1 ML Estimates of Site-Pattern Probabilities for a Two-State i.i.d. Model That Assumes a Symmetric Transition Matrix

    Notice that in table 1, the frequencies of three site patterns, those of index 1, 5, and 6, have not changed. The case of index 5 has already been explained. Those of index 1 and 6 have not changed, as these are not compatible with any of the three ambiguous site patterns. Although their ML frequencies have not changed, their estimated probabilities do change because the sequence length with (c) and without (c*), removing ambiguous sites, are 49 and 52, respectively.

    In our example, the vector of ML estimates of site-pattern frequencies, including incomplete patterns, denoted were arrived at using a quasi-Newton method and confirmed with a conjugate-gradient method (table 1). To illustrate the alternative EM method, the zeroth cycle starts with and vectors for the complete site patterns. The first cycle (cycle 1) of this algorithm to estimate gives the vector [20.5556, 5, 7.7778, 2.2778, 7.1667, 0, 1, 8.2222]. The updated frequency of the first site pattern of index 0 (all taxa having the same state) is arrived at by applying equation 3: = 20 + 1 x [20/(20 + 2 + 6 + 8)]. Similarly, = 7 + 1 x [7/(7 + 2)], = 2 + 1 x [2/(20 + 2 + 6 + 8)] + 1 x [2/(7 + 2)], and so on for the other complete site patterns compatible with the observed partial site patterns. On the second cycle = 20 + 1 x [20.5556/(20.5556 + 2.2778 + 7.1667 + 8.22220)], = 7 + 1 x [7.7778(7.7778 + 2.2778)], and so on. This cycling continues until the frequencies of cells on subsequent cycles change by a small enough amount to meet the stopping rule (e.g., all cells change by less than 10–16). In this particular example, the rate of convergence was very close to one decimal place per cycle; after 17 cycles all cells changed by less than 10–16 per cycle and had converged to the estimates found with the Newton and conjugate-gradient methods.

    Given the ML vectors in table 1, the unconstrained lnL of the complete site patterns equals 20.ln[0.4082] + 5.ln[0.1020] + 7.ln[0.1429] + 2.ln[0.0408] + 6.ln[0.1224] + 0.ln[0] + 1.ln[0.0204] + 8.ln[0.1633] = –80.3436 (according to L'Hopital's rule, we can ignore terms of 0.ln[0]). The unconstrained lnL of all sites including partial site patterns equals 20.ln[0.3949] + 5.ln[0.0962] + 7.ln[0.1495] + 2.ln[0.0440] + 6.ln[0.1382] + 0.ln[0] + 1.ln[0.0192] + 8.ln[0.1580] + 1.ln[0.3949 + 0.440 + 0.1382 + 0.1580] + 1.ln[0.1495 + 0.0440] + 1.ln[0.1382 + 0] = –84.3572. As expected, the unconstrained likelihood of the data incorporating incomplete site patterns is less than that of only the complete site patterns. This will always be the case because the calculation for only the complete site patterns involves less data and a shorter alignment.

    Evolutionary Model Likelihoods

    The vectors of site-pattern probabilities under an evolutionary tree model may be arrived at using either the standard algorithm (Neyman 1971; Felsenstein 1981) or that based on a Hadamard conjugation (Hendy and Penny 1989, 1993). The iterative optimization of parameter values to maximize the lnL of the data can be done using the EM algorithm (e.g., Felsenstein 1981) or a Newton method (e.g., Waddell and Penny 1996) or any other suitable numerical optimization routine. The site-pattern probability (or likelihood) of a site with ambiguity is the sum of the site-pattern likelihoods of all patterns it is compatible with. In the case of the standard approach, this summation is done implicitly; it is necessary to only calculate the likelihood of the partial pattern because the sum of the remaining unknown probabilities is 1. With a Hadamard conjugation, the likelihood of partial site patterns involves a sum of terms in the s vector for a tree (called s(T)) with which this partial pattern is compatible. This is essentially the same calculation that occurs in the estimation of the likelihood of ambiguous sites for the unconstrained model, but it must explicitly include all site patterns (even those unobserved in the data).

    Next, we calculate the constrained or tree lnL of the data without ambiguous sites. Table 2 shows the ML vectors of site likelihoods under an identical site-rate (i.r.) model, a proportion of invariant sites model (pinv, first implemented by Hasegawa, Kishino, and Yano [1985]), a model, and an invariant site plus model (first implemented by Waddell and Penny [1996] based on concurrent work by Steel et al. [1993]). Notice that the likelihoods of all four models in this case are identical; there was no improvement in fitting the extra parameters. The reason for this is that site-rate variability helps to explain excess homoplasy, but in these data, there appears to be no such excess. The total lnL of the data under all these tree models was –83.6635. The likelihood ratio with respect to the unconstrained model was lnL = –3.3199 and G = 2lnL = 6.6397. Asymptotically, this G statistic is approximately distributed as 2 with df = number of unique pieces of information minus number of fitted parameter = 7 – 5 = 2 (under the i.r. model). The critical value for is = 0.05 = 5.9915. Here, the observed value for the G statistic is close to the critical value, and interpretation of this is important to consider.

    Table 2 ML Estimates of Site-Patterns Probabilities (s) and Frequencies (f) Under the Identical Site-Rate (i.r.) and the Gamma Distributed () Site-Rate Phylogenetic Models Using Only Unambiguous Site Patterns and All Site Patterns

    Interpreting G Statistics

    Factors such as sparseness, a posteriori model selection, and partial recovery of degrees of freedom, as noted in the introduction, all distort the distribution of fit statistics away from a simple 2. In addition, if the or the pinv or the pinv + model had been chosen a priori, then the asymptotic distribution of G would be further complicated because these parameters are affected by a boundary, and this would only alter the G value in some data samples. Because of the boundary, such direct parameter estimates may reduce the df in G by less than 1. This will make G smaller than otherwise expected, a result consistent with the findings in Ota et al. (2000). Further, with the pinv + model, G nominally has no df, yet is not zero. Given these issues that complicate the interpretation of G, it may be desirable to simulate the expected distribution. A reasonable simulation procedure, following Reeves (1992) and Goldman (1993), is to take the ML parameter estimates on the AIC model, simulate the data under this model, and analyze each replicate exactly the same way the original data were analyzed, including both evolutionary model and tree selection.

    Fit with Missing Data

    The result of selecting trees and fitting models, including partial site patterns, is given in table 2. The lnL under the i.r., , pinv, and pinv + model was –88.0860; lnL = 3.7288 and G = 2lnL = 7.4577. (Note, the likelihood of the data under the phylogenetic model is maximized by minimizing this G; the "G" of 6.81 shown in table 2 and also the 2 value of 4.83 are based on a direct comparison of f(T) and, without taking into account the interdependence of values). All the same factors described above that modify convergence to a distribution with complete data apply here also. In addition, probably making the convergence better, is the fact that expected and observed values are larger (i.e., more sites are used). Countering this, and making convergence worse, the average correlation among site-pattern frequencies is expected to be larger as partial site patterns may contribute to more than one category (unlike the standard multinomial situation). The simulation procedure to assess the distribution of G with missing data is essentially the same as that with complete data, except that missing data must be simulated also. Maintaining the i.i.d. assumption requires that the missing data can occur equally in any site pattern. For the example data, this can be achieved by creating a data set of 52 simulated site patterns and randomly erasing a character for taxon 3 once, erasing a character for taxon 4 once, and erasing two characters for taxon 1. If we assumed the process of substitution was i.i.d. but the process generating the missing data was not i.i.d., then it would be appropriate to erase a character for taxon 3 and taxon 4 at the same site, and to erase a character for taxon 1 twice. An example of a non-i.i.d. proccess may be indels affecting more than one site. An effective way to simulate an alignment of site patterns where missing data is not i.i.d. is to generate a full set of simulated site patterns and then, where there are missing data in the original data set, erase these character states in the simulated data.

    Spectral Analysis

    This section illustrates use of as the starting point for a spectral analysis. One of the strengths of spectral analysis is that it allows a visualization of fit of data to model by projecting support, under an evolutionary model, to a set of edge lengths before selecting a tree. An issue that immediately arises using this vector is that an intermediate value in the conjugation, the observed generalized distance entry r7, exceeds the threshold for use with a standard logarithmic correction for multiple hits. One solution to having distances that are too large is to add a fraction of constant sites to all site patterns until the observed distances are within the expected range. The question of how many constant sites to add can be addressed in a number of ways. One approach is to use ML as discussed below. Another is to add just a fraction more than is needed for the transformed distance to be defined. In this case setting pinv = –0.1 (i.e., adding 10% more constant sites to the data) achieves the desired result and yields the spectra and shown in table 3. These spectra suggest that there is too little support in the data to resolve a tree as indicated by all internal edge-length estimates being negative.

    Table 3 Spectral Analysis Compared with Inspection of Predicted and Observed Values via ML

    It is also possible to use ML to estimate the optimal fraction of invariant sites to add. Doing this requires allowing the parameter pinv to become negative. In the Hadamard framework, this does not immediately affect the likelihood function (indeed nor does allowing edge lengths to become negative [Waddell 1995]). The ML estimate of pinv is –2.4378 and the lnL of the data –81.0732 for the complete site pattern data. For all sites, it is pinv = –3.344 and lnL = –81.1486; that is, the data fit a tree best when an additional 2.43 x 49 or 3.34 x 52 constant sites are added to the data. Allowing pinv to be negative, both AIC and BIC criteria include pinv in the optimal model. Pinv takes a negative value when there are fewer constant sites in the data than the standard i.i.d model expects. This could be the result of chance or an effect such as morphologists not reporting "uninformative characters" in their data matrices. Interestingly, although the coefficient of variation of the distribution 1/k cannot become negative with pinv allowed to be negative, the pinv + model finds a new optimum with 1/k = 1/3.046, pinv = –4.0995, and lnL –81.0595 with the complete site pattern data (when these two parameters are fitted together their values tend to be negatively correlated, Waddell [1995]). Despite 1/k being far from 0, it is not admitted into the model using either AIC or BIC. Fitting these models to the data including ambiguous site patterns yields pinv = –100, 1/k = 1/0.039, and lnL= –81.1468. Clearly 1/k and -pinv are acting in concert to try to fit the data slightly better, but the lnL has come close to a plateau (and –100 was a distant boundary set to allow the optimization to complete). The fit of the data to model is now very close. The spectral analysis interpretation is corroborated by the i.r. ML tree models predicting larger frequencies for the site patterns corresponding to internal edges than those observed in the data.

    The spectra with and without inclusion of sites with missing data are similar. This will often be the case. The expectation is that the greater amount of data used by incorporating partial patterns will reduce the effect of sampling error and, in doing so, also make the systematic error appear more clearly in the spectra. To calculate the variance-covariance matrix of spectra (Waddell et al. 1994) based on ML estimates of s(T) with partial patterns, methods like those in Waddell and Hendy (1997) may be used.

    Technical Issues

    A major concern is that sites with missing data may have different site-pattern probabilities from those without missing data (for example, indels tend to occur at sites with relaxed negative selection, so homoplasy in site patterns is expected to be greater). This factor will also affect any i.i.d. model–based method used for reconstructing the data. For example, the model assumes all sites are drawn from the same underlying distribution (Steel et al. 1993; Waddell, Penny, and Moore 1997), and the likelihood of partially complete patterns will be calculated according to this expectation. If gapped sites do come from a different distribution, mixed-model techniques are useful after sorting sites into sets that are closer to i.i.d. expectations.

    A problematic situation occurs when most or all of the site patterns are incomplete. Consider for example, the following data set, ??RR(2), YRR?(1), and RR??(2), with the frequencies indicated in the brackets. Two sets of site patterns explain the data with the fewest distinct patterns. The first, YRRR(3) plus RR??(2), has likelihood equal to (3/5)3(2/5)2 = 27 x 4/55. The second, RRRR(4) and YRR?(1,) has likelihood equal to 256 x 1/55. Clearly the latter has the highest likelihood and is the desired solution. In these cases, the ML solution involves comparing the likelihood of different sets of site patterns, which may be a non-deterministic polynomial-time complexity (NP) complete problem analogous to finding the largest possible clique of characters.

    The unconstrained methods with missing data described above works with codon (Muse and Gaut 1994) or paired-sites RNA models (Sch?niger and von Haeseler 1994). Note, in these cases, the character is no longer a single site, and it is character-pattern probabilities that are being calculated. If a mixed model is being used, then the calculation with missing site patterns should be repeated for each partition of the data that assumes a different process.

    Discussion

    The use of fit statistics allowing for missing data is especially appropriate in the analysis of long, noncoding genomic regions. Here, sites are expected to behave close to i.i.d., and missing data will often be caused by randomly occurring indels. These indels tend to affect a large fraction of all sites, but the affected sites still offer a good deal of important partial information when the indels do not affect more than one or two taxa.

    One situation, to perhaps be avoided, is estimating fit of data to model when there are long regions with just one sequence showing nucleotides. In this case, the modification to other site-pattern probabilities tends to be dominated by the base composition of such regions, something that may not be desirable. Thus, a threshold of missing data may be desirable, for example, use only sites in the final alignment with at least 50% of possible information present.

    Hadamard conjugations, with which the new technique was illustrated, may be considered a generalized distance method (as well as an ML method). In practice, pairwise distance methods are also very popular (Swofford et al. 1996). It is sometimes advisable to remove sites with gaps, because if these sites evolve faster than other sites, they may cause distances based on a greater than average proportion of them to be biased towards overestimation (and vice versa). The present method of inferring states for all sites may offer robustness against this effect and, thus, may be a useful preparatory step for distance analyses.

    The new methods developed here with sequence data are also applicable to morphological data sets. An added concern with such data is that the frequencies of the "uninformative" character patterns "all the same" or "one different" tend to be very badly estimated (or not at all), causing a problem for the current method. One solution is to estimate these character (site) pattern probabilities directly as parameters of the evolutionary tree model (see Worked Example), and put estimates of the frequencies of these character patterns back into the data matrix before estimating the unconstrained likelihood. It is more of a problem if singleton characters are poorly estimated, but these, too, may be optimized if needed (e.g., by the BIC criteria).

    An example of where the present method will have utility is in comparing the relative fit of different data sets as a guide as to those trees that are most reliable. For example, new phylogenies of mammals (e.g., Waddell, Okada, and Hasegawa 1999) fit the latest data much better than earlier phylogenies. The largest alignments containing good taxon sampling so far assembled to assess this question are those of Waddell, Kishino, and Ota (2001) comparing nuclear and mitochondrial data. The mtDNA-based data in figure 1a of that work contain 54 taxa by 3,516 codons, of which 3,500 (99.5%) contain no missing data in any aligned sequence, and of the total 54 x 3516 = 189,864 encoded amino acids, only 17 (0.009%) are ambiguous. For the nuclear data sets of figures 1b–d of Waddell, Kishino, and Ota (2001), the numbers are 45 x 925 codons (530, or 57.3% of codon columns, and total ambiguous just 1.2%), 22 x 1222 codons (631, or 51.6%, and total ambiguous just 7.2%), and 47 x 2020 codons (191, or 9.5%, and total ambiguous just 11.3%), respectively. Thus, many of the codon sites are affected, despite what are often relatively minor amounts of ambiguous data. The situation is worst for the concatenated data set of figure 2a of Waddell, Kishino, and Ota (2001), though perhaps that most rich in phylogenetic information, with 34 x 7999 codons in total, yet 0% of codons are complete in all taxa and total missing/ambiguous is 11.8%.

    The method described for constructing f and s vectors with missing data should be particularly useful in analysis of noncoding DNA. Even between closely related and easily aligned sequences, there tend to be many short deletions, and removing these sites can remove much of the data. These are also situations in which spectral methods based on Hadamard conjugations can "visualize" structure in the data very well (e.g., Hendy, Penny, and Steel 1994; Waddell 1995).

    In summary, this paper describes a method of assessing the absolute fit of data to model even when there are missing data. It is hoped that the method will encourage researchers to gauge which data sets best fit the model, even if fast approximations to the G statistic, such as those described in Waddell (1995), need to be used in place of the full simulations, such as those described in Reeves (1992) and Goldman (1993) and above.

    Acknowledgements

    Thanks to Joe Felsenstein, Thomas Buckley, and Hari Shri Palakurthi for helpful comments.

    References

    Adachi, J., and M. Hasegawa. 1996. MOLPHY version 2.3 manual. The Institute of Statistical Genetics, Tokyo, Japan.

    Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Royal Stat. Soc. B 39:1–38.

    Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401–410.

    ———. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.

    Goldman, N. 1993. Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182–198.

    Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21:160–174.

    Hendy, M. D., and D. Penny. 1989. A framework for the quantitative study of evolutionary trees. Syst. Zool. 38:297–309.

    ———. 1993. Spectral analysis of phylogenetic data. J. Classif. 10:5–24.

    Hendy, M. D., D. Penny, and M. A. Steel. 1994. Discrete Fourier analysis for evolutionary trees. Proc. Natl. Acad. Sci. USA 91:3339–3343.

    Kishino, H., Miyata, T., and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31:151–160.

    Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715–24.

    Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798–803.

    Neyman, J. 1971. Molecular studies of evolution: a source of novel statistical problems. Pp. 1–27 in S. S. Gupta and J. Yackel, eds. Statistical decision theory and related topics. Academic Press, New York.

    Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1995. Numerical recipes in C: the art of scientific computing. 2nd edition. Cambridge University Press, Cambridge.

    Read, T. R. C., and N. A. C. Cressie. 1988. Goodness-of-fit statistics for discrete multivariate data. Springer-Verlag, New York.

    Reeves, J. H. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:17–31.

    Sch?niger, M., and A. von Haeseler. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240–247.

    Steel, M. A., L. Székely, P. L. Erd?s, and P. J. Waddell. 1993. A complete family of phylogenetic invariants for any number of taxa under Kimura's 3ST model. NZ J. Bot. 31:289–296.

    Stuart, A., and J. K. Ord. 1991. Kendall's advanced theory of statistics, Vol. 2. Distribution theory. Classical inference and relationship. 5th ed. Edward Arnold, London.

    Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 450–572 in D. M. Hillis and C. Moritz, eds. Molecular systematics. 2nd edition. Sinauer Associates, Sunderland, Mass.

    Waddell, P. J. 1995. Statistical methods of phylogenetic analysis: including Hadamard conjugations, LogDet transforms, and maximum likelihood. Ph.D. thesis, Massey University, New Zealand.

    Waddell, P. J., and M. D. Hendy. 1997. Families of order 2t–1 bipartition invariants under the generalised Kimura 3P model. Information and Mathematical Sciences Report, Series B: 97/01, Massey University, New Zealand.

    Waddell, P. J., H. Kishino, and R. Ota. 2001. A phylogenetic foundation for comparative mammalian genomics. Genome Informat. Ser. 12:141–154.

    Waddell, P. J., N. Okada, and M. Hasegawa. 1999. Progress in resolving the interordinal relationships of placental mammals. Syst. Biol. 48:1–5.

    Waddell, P. J., and D. Penny. 1996. Evolutionary trees of apes and humans from DNA sequences. Pp. 53–73 in A. J. Lock and C. R. Peters, eds. Handbook of symbolic evolution. Clarendon Press, Oxford, UK.

    Waddell, P. J., D. Penny, M. D. Hendy, and G. Arnold. 1994. The sampling distributions and covariance matrix of phylogenetic spectra. Mol. Biol. Evol. 11:630–642

    Waddell, P. J., D. Penny, and T. Moore. 1997. Extending Hadamard conjugations to model sequence evolution with variable rates across sites. Mol. Phylogenet. Evol. 8:33–50.(Peter J. Waddell)

http://www.100md.com/html/DirDu/2006/09/20/17/60/49.htm