当前位置: 首页 > 期刊 > 《核酸研究》 > 2004年第18期 > 正文
编号:11369916
A dynamic, web-accessible resource to process raw microarray scan data
http://www.100md.com 《核酸研究医学期刊》
     Ouest genopole, Institut du Thorax, Institut National de la Santé et de la Recherche Médicale (UMR 533), Faculté de Médecine, 44035 Nantes, France, 1 Centre National à la Recherche Scientifique (UMR 6061), Rennes, France and 2 Laboratoire d'Informatique de Nantes Atlantique, 44322 Nantes, France

    * To whom correspondence should be addressed at INSERM U533, Faculté Médecine, BP 53508, 44035 Nantes cedex 1, France. Tel: +33 240412957; Fax: +33 240412950; Email: nolwenn.lemeur@nantes.inserm.fr

    ABSTRACT

    We propose a freely accessible web-based pipeline, which processes raw microarray scan data to obtain experimentally consolidated gene expression values. The tool MADSCAN, which stands for MicroArray Data Suites of Computed ANalysis, makes a practical choice among the numerous methods available for filtering, normalizing and scaling of raw microarray expression data in a dynamic and automatic way. Different statistical methods have been adapted to extract reliable information from replicate gene spots as well as from replicate microarrays for each biological situation under study. A carefully constructed experimental design thus allows to detect outlying expression values and to identify statistically significant expression values, together with a list of quality controls with proposed threshold values. The integrated processing procedure described here, based on multiple measurements per gene, is decisive for reliably monitoring subtle gene expression changes typical for most biological events.

    INTRODUCTION

    During the last decade, cDNA microarray technology has been extensively applied to determine gene expression levels in diverse tissues, animals and diseases, at high throughput levels. As a result of the increasing knowledge of several genomes (especially the human genome), thousands of gene-fragments have been spotted or in situ synthesized to globally monitor various gene expression situations. Attention has been paid to mathematical (statistical) methods pertinent for the analysis, organization and handling of the enormous quantities of gene expression data . When comparing different situations like patients and controls, or when analyzing ontogenic or kinetical events, the challenge is to identify the combinatorial and hierarchical complexity of the gene expression profiles. Parallel to those efforts on interpretation of the data, other studies have aimed to identify the different physical and biological factors which have to be controlled to improve the reliability of massive gene expression studies (2–4). The multiple experimental steps involved in microarray procedures are sources of often badly controlled variation, which is superimposed on the biological variation under study. Experimental variation along the successive steps of preparation, purification and labeling of RNA samples, as well as the hybridization conditions, are inherent to all microarray experiments. Mechanical and optical distortions could locally or globally influence the raw expression values obtained after microarray image scanning. In addition, other factors like intrinsic heterogeneity, conditioning parameters and even erratic contamination of the biological samples may modify the gene expression results. To compensate (and/or better evaluate) the importance of these composite experimental and biological noises in microarray experiments, diverse numerical treatment procedures of the raw microarray scan image values and quality measures have been proposed. These procedures include filtering of bad spots following segmentation methods, normalizing between two channels (or signal scaling within monocolor microarrays), comparative scaling between different chips, and diverse statistical methods for selecting (ranking) differentially expressed genes (1,5). The most reliable way of evaluating the ratios between the different experimental noises and the biological signals is to produce replicate gene measurements within each microarray and to hybridize replicate microarrays with replicate targets obtained from the same biological samples. The metrological importance of such replications in microarray gene expression studies (6–9) casts doubts on the numerous microarray analyses performed with only singular gene spots and/or without sample replicates. The high cost associated with microarrays does not justify the metrological insufficiencies of any experiment. The accessibility to high throughput spotting robots deposing up to 25 000 spots per chip combined with a careful selection of a few thousands of theme-relevant genes now allows the use of such noise-informative chips and the design of corresponding reliable experiments.

    Here, we present a freely accessible web-based pipeline which processes raw microarray scan data to obtain experimentally consolidated gene expression values. The proposed module, MADSCAN, MicroArray Data Suites of Computed ANalysis (http://www.madtools.org), makes a practical choice among the numerous methods available for filtering, normalizing and scaling microarray data, in a dynamic and automatic way. Using a careful experimental design with replication information, we present different numerical and statistical methods, detection of outlying expression values and data integration with a list of quality controls, through proposed threshold values. A tutorial for MADSCAN is included on the website.

    MATERIALS AND METHODS

    Biological samples

    Cardiac tissue was obtained from the left ventricle of explanted hearts from two male patients who underwent heart transplantation. One patient (experiment 1) was affected by idiopathic dilated cardiomyopathy; the other patient (experiment 2) was affected by valvular heart disease and coronary artery disease. Both samples were compared to a common reference that was obtained from pooled RNAs extracted from the left and right ventricles of explanted hearts from 47 end-stage heart failure patients (G. Lamirault, N. Le Meur, M.F. Le Cunff, C. Chevalier, I. Guisle, A. Bihouée, J.J. Léger and M. Steenman, manuscript in preparation).

    RNA isolation and labeling

    Total RNA was isolated using TRIZOL?Reagent (Life Technologies). Two parallel RNA extractions from two different samples (spatially separated) of the same tissue were performed. Poly A+ RNA was isolated using the Oligotex mRNA kit (Qiagen) and quality was assessed using an Agilent 2100 bioanalyzer. Cy3- and Cy5-labeled cDNA was prepared using the CyScribe cDNA Post Labelling Kit (Amersham Pharmacia Biotech). Samples from the two patients under study were each labeled individually with Cy3. The reference pool was labeled with Cy5. No dye-swap was used.

    Microarrays

    Microarrays were prepared in-house using 50mer oligonucleotide probes (MWG Biotech). The probes were arrayed onto epoxy-silane-coated glass slides using the Lucidea printer from Amersham. The 4217 genes represented on the microarray were selected for involvement in skeletal muscle and/or cardiovascular normal and pathological functioning (10–13). Gene selection was based on (i) subtractive hybridization experiments, (ii) genome-wide microarray hybridizations, (iii) literature data. Each Cy3-labeled sample was mixed with equal amount of Cy5-labeled sample, pre-incubated with human Cot-I DNA (Gibco-BRL), yeast tRNA and poly A+ RNA, and hybridized to the microarrays. Two independent hybridizations were performed for each RNA sample, leading to four hybridizations per patient. Hybridized arrays were scanned by fluorescence confocal microscopy on a ScanArray 4000XL (GSI Lumonics, Downers Grove, IL) at laser power ranging from 75 to 100% and photo-multiplier tube gain settings ranging from 65 to 100%. Measurements were obtained separately for each fluorochrome at 10 μm/pixel resolution.

    Microarray data acquisition

    Signal intensities were extracted with the GenePix Pro 5.0 image analysis software (Axon Instruments, USA). Segmentation of the spots was done using the adaptive approach. Segmentation criteria were optimized visually for each slide. Alternate standard deviation (SD2) was chosen to quantify background SD. This setting uses the median of the background pixels as an estimator of the center of the distribution. This method is less susceptible to skewing by very bright pixels. Our data processing procedure uses background corrected median intensities; the given ratio corresponds to the ratio of background corrected median intensities. For further quality controls (see the preprocessing step in Results and Discussion, and in the tutorial), analytic parameters provided by the GenePix Pro 5.0 image analysis software were used. Other image analysis software like Quantarray (PE. Packard Biochip technologies, USA), Imagene (BioDiscovery, USA) or ScanAlyze (http://rana.lbl.gov/, Stanford University, USA) also deliver the minimal set of parameters required to perform the MADSCAN procedure. For a comparison between different image analysis software, see http://cardioserve.nantes.inserm.fr/ptf-puce/publications.php. Analysis files issued from different types of image software can be reformatted following procedures noted in the MADSCAN tutorial.

    Power study

    A power study of a standard t-test was performed on the heart expression data set of experiment 1 (four replicate spots for each oligonucleotide and four replicate chips). Only genes with at least two valid M-values for each array were selected for power analysis. We thus selected 3804 genes with reliable M-values. The ‘power t-test’, which is implemented in the ‘ctest’ package of R (14) was applied in five replication conditions: 4, 6, 8, 12 or 16 replicate M-values. Parameters of the power test were defined as follows:

    For each gene the mean level of differential expression between the two fluorochromes () was defined as the arithmetic mean of the four arithmetic means of the 4 M-values in each of the quadruplicate chips.

    The parameter for data variability (SD) was arbitrarily set as identical for all genes. SD was calculated as the median of the 3804 SDs determined from the replicate M-values for each gene.

    Significance level () was a priori set to 0.05, but a Holm correction (15) was applied to for each gene in order to account for multiple testing hypothesis. The 3804 genes under analysis are ranked according to their individual P-values, by application of a standard one-sample t-test. On the basis of the calculated rank of the corresponding gene, the basal -value of 0.05 was then corrected individually.

    Using the values fixed for SD and for each gene, the individual power test was performed on the basis of one sample and a two-sided t-test. Power values of (1 – ?) were deduced for each gene in the five replication conditions analyzed. To evaluate the effect of between-gene differences in sampling variation on the power test values, the power test was calculated with three different values of SD in two particular replication conditions (4 and 12 replicates). The three different SD values were defined as first quartile, median (as earlier) and third quartile of the 3804 SDs previously calculated. Other parameters of the power test were left unchanged. Power values (1 – ?) were calculated for both replication conditions and the three variability levels.

    Estimation of false positive and false negative rates

    The false positive (FP) rate is the proportion of negative cases that were incorrectly classified as positive in the predicted condition compared to the experimental observation. The false negative (FN) rate is the proportion of positive cases that were incorrectly classified as negative in the predicted condition compared to the experimental observation. Genes differentially expressed between the heart expression data sets of experiments 1 and 2 were first identified by a SAM modified two-class t-test (16), using 16 (4 within- and 4 between-chip) replicates for each data set. The number of differentially expressed genes was then determined based on six different replication conditions: 4, 8 or 12 replicates with various proportions of within- and between-chip replicates (see Supplementary Material). Six different two-by-two confusion matrices (17) were built to determine the FP and FN rates in the six simulated replication conditions with regard to the experimental situation based on 16 replicates.

    MADSCAN implementation

    MADSCAN was written in R (14) and Perl. A user-friendly web-interface was implemented in PhP to allow easy access (http://www.madtools.org) and rapid handling of data on our local server (PowerEdge 4600, Dell, USA). Access is obtained through a password, given to any requester. The raw microarray data are uploaded as compressed tabulated text files. MADSCAN analysis can be done either step by step or from A to Z, i.e. one can either apply one test at a time or all steps in a single and complete procedure. The results can be downloaded from the web-interface, where a window of results displays a summary of the performed procedure. Alternatively, MADSCAN results can be recovered through an e-mail service.

    RESULTS AND DISCUSSION

    Outline of analysis procedure

    Our goal was to provide a practical, accessible, integrated suite of different analytic procedures for the handling of raw data issued from two-fluorochrome (color) image scanning of microarray glass slides and to obtain consolidated expression values. According to the MIAME (Minimum Information About a Microarray Experiment) glossary, data processing means ‘the set of steps taken to process the data, including the normalization strategy and the algorithm used to allow comparison of all data’ (18). Draghii (5) defined preprocessing as the initial step that extracts and enhances meaningful data characteristics from raw data files from scanned images. Preprocessing prepares the data for the application of successive procedures or analytical methods. Using tabulated text files of raw microarray image values issued from widely used scanners and related image analysis software, we have developed a four-step procedure to transform the raw data into consolidated, robust expression values: the first three steps concerned each individual chip, whereas step 4 integrated the expression values issued from replicate chips if available (Figure 1).

    Figure 1. MADSCAN procedure steps within each chip and between replicate chips.

    The proposed integrated tool has been constructed around a few now well-accepted main analytic steps to numerically handle microarray image values within one chip and between replicate chips. Most of the used algorithms or methods—such as the rank invariant method (19), lowess fitness normalization (20,21) and outlier detection (although never used for microarrays) (22,23)—have been defined and documented by others. The corresponding methods/algorithms have been reformatted in a plug-in architecture system to make the whole numerical procedure reliable and fluent. Algorithmic approaches chosen for each step and modifications or adaptive processes made along the procedure are described in the following subsections. The computational tool, MADSCAN, is freely accessible via a web server (http://www.madtools.org), where detailed information is available in a tutorial.

    The importance of the experimental design

    Before describing the different steps of the MADSCAN procedure, we addressed the important issue of the reproducibility of microarray experiments. We proposed a ‘reference design’ with an experimental procedure based on replicate spots within each microarray, and replicated microarrays for two spatially separated samples from each tissue, compared in a hybridization to a reference sample (Figure 2). The replicate spots are issued from different print-tips and are therefore printed in different array blocks. This procedure allows the evaluation of the importance of the biological noise—due to sample heterogeneity—and numerous experimental noises. The latter could arise from variations in the molecular biology procedures for the extraction and labeling of RNA samples (e.g. dye quality, or possibly dye-swapping), from physical distortions in glass slides and from the scanner (optical irregularities in the laser performances and in the excitation of the fluorochromes). To be able to take into account such composite noises from one chip, it is obvious that microarrays need to contain at least triplicate spots. This allows the statistical evaluation of the internal variability of the signals corresponding to each gene (oligonucleotides or PCR products) and the detection of outlying values within one chip. Furthermore, a minimum of four (two pairs) replicate chips is necessary to evaluate the variations between the two independent RNA samples issued from the same tissue (6,8). The within-chip replicates reveal only technical noises, whereas the between-chip replicates give information on both technical and biological noises. Microarray experiments as designed in Figure 2 actually allow simultaneous measurements of the different technical noises, together with that of the biological signal under investigation. In addition, randomized print-tip usage allows a non-uniform distribution of the replicate spots throughout the array. Together with a randomization of the numerous experimental procedures and the use of replicates, this is crucial to obtain statistically significant data.

    Figure 2. Experimental design. Two independent RNA samples (a and b) from the same tissue, replicated spots within one chip and replicated chips for one biological point are necessary to discriminate between the signal under study and those due to the inherent experimental noises.

    Preprocessing of raw data files from image scan analysis

    Whatever the type of scanner or related software used, MADSCAN starts with tabulated text files composed of at least eight columns per hybridized microarray. These columns contain information on block designation, gene name, gene ID or annotation, measured intensities in both channels, local (or equivalent) background intensity values for both channels and image analysis software flags for a first determination of spot quality (diameter deviation of the spot, location). Nomenclature and gene annotation have to be carefully formatted. Replicate spots (if present) must be precisely annotated to be identified as such during the data processing procedure.

    Physical validation or quality filtration

    The overall quality of the raw image data (before any filtering) is calculated for each print-tip group (block) of spots according to the median values and the variation coefficients of signal and background intensities. Spot diameters and their SD are also determined. In spite of the importance of assessing spot quality, relatively few image analysis software packages provide systematic quality filtration (5). MADSCAN offers physical validation and quality filtration step by step, following a decision tree with a scoring procedure based on successive exclusion thresholds. Each feature is thus tested against a series of quality criteria (image analysis flags, signal-to-noise level, diameter variation and saturation level). Five different arbitrary scores can be attributed according to the spot quality. Score 0 is used for flawed spots whereas most of the good spots have score 2. Scores 3 and 4 are attributed to spots that are partially saturated for one of the channels. For those spots, the expression ratio is calculated from the regression ratios between the intensities of each pixel composing the spot. Score 5 is attributed to features partially saturated in both channels and their expression ratio is calculated as the ratio of their percentages of saturation. Fully saturated spots in both channels are flagged because no reliable information on the pixel values and distribution is available (see tutorial for details). Chips made in-house contain 15 000 to 20 000 spots. Using our conditions for hybridization and image scanning, 5–8% of the spots are flagged (=score 0) whereas 92–95% of the spots pass the quality control criteria. The percentage of partially saturated spots (score 3–5) is generally relatively low (0.05–0.1% of the spots).

    Within-chip normalization step

    Normalization issues have been addressed early in the development of microarray data treatment (19–21,24). It is considered an essential step to minimize experimental systematic and random biases, arising from technical variations inherent to the high throughput and complex experiments. The main aspects of any normalization process are whether or not to select a set of reporter (invariant) genes as a reference for the normalization process and whether or not to consider spatial and intensity value-dependent biases. Since most microarrays contain several thousands of spots, and since hybridization values are mostly distributed in an equilibrated (pseudo-gaussian) way in experiments comparing test and control tissues, we chose to adapt the rank invariant method developed by Tseng et al. (19) in our procedure. A set of invariant spots or non-differentially expressed genes (if no replicates were spotted) is a posteriori selected from all validated expression ratios for each chip. The rank of Cy3 and Cy5 intensities of each gene on the chip is computed separately. If the ranks of the two intensities for a given gene differ less than a fixed threshold and the rank of their averaged intensity is not among the highest or lowest ranks, this gene is classified as a non-differentially expressed gene. Figure 3A shows an M–A plot , with a selection of such invariant spots, following the application of the rank invariant method. The invariant spots in Figure 3A are sandwiched between the differentially expressed genes. As has already been described (20), the distribution of expression ratios is intensity dependent and therefore a non-linear normalization method must be used. The lowess fitness method, using the set of identified invariant genes, has been incorporated in our MADSCAN procedure. To assess the efficiency (robustness) of coupling both rank invariant and lowess fitness methods, we removed all identified putative differentially expressed spots (genes) from the original raw expression file . We then applied both complementary methods for normalization on the reduced raw expression files. Figures 3B and 3B' show the new set of invariant genes and the raw and normalized expression data, respectively. Eighty-five percent of the invariant genes selected before and after the file reduction are identical. A very strong correlation coefficient of 0.99 is observed between both sets of independently normalized expression values (Figure 3C). This is obviously due to the high number of invariant spots (genes) present (50% of the total amount of spots).

    Figure 3. M–A plots before (A and B) and after (A' and B') global lowess normalization, using rank invariant spots. The spots that are potentially differential in graphs (A) and (A'), , were eliminated for the determination of invariant spots used for further data normalization in graphs (B) and (B'). ‘With’ and ‘without’ refers to the presence or absence, respectively, of potentially differentially expressed genes. The presented expression values were from experiment 1. (C) Represents the correlation between the 85% invariant genes, common to the gene populations in graphs A' and B'.

    As described by Yang et al. (21), the use of a spatial approach is also crucial. The signal as well as the background intensity is often heterogeneous within a slide. This is due to the unavoidable spot dispersion over a relatively large surface, the use of several spotting pins and possible geometrical variations within glass slides. An additional refinement of the normalization procedure thus has to be applied to chips containing more than a few thousand spots. A normalization procedure per zone, usually print-tip group, allows to correct spatially dependent dye biases and probe delivery variations between the different pins as well as other geometrical and optical defects. Practically, in MADSCAN, the normalization is first attempted pin-by-pin (print-tip group), then by proximal or global approach, depending on the number of invariant spots present within individual blocks, contiguous blocks or the whole chip, respectively. We found that at least 50% of invariant genes among all genes under analysis are needed to obtain a robust normalization curve. To illustrate the used procedure and the importance of spatial normalization, lowess normalization procedures were applied based on invariant spots selected from either individual blocks (individual print-tip) or proximal blocks or all blocks in a 48-block chip with 420 spots per block. Five individual lowess fitness curves arbitrarily chosen among the 48 different ones obtained in each spatial condition are graphically represented in Figure 4. It is easily seen that the best superposition of the five curves is obtained when the rank invariant method was applied pin-by-pin rather than using proximal blocks or all blocks.

    Figure 4. Comparisons between global, proximal and local normalization procedures. Five individual lowess fitness curves corresponding to five arbitrarily chosen blocks (asterisks) are represented according to each of the three spatial normalization modes. Light gray blocks represent an example of blocks chosen for selection of invariant genes, to normalize the raw M-values in the dark gray block, in each of the three modes. Invariant genes within the dark gray block are part of the invariant population used in each mode. The superposition of the five selected curves shows how uncontrolled local variations may influence the final expression values. The expression values presented here were from experiment 2.

    Scaling and outlier detection

    In a metrologically controlled experiment—as described in Figure 2—the presence of replicated features within each slide and of replicated slides for each biological sample allows a statistical validation of the expression results after the three first steps of the procedure (Figure 1). First, scaling procedures have to be applied to bring the variances of filtered and normalized expression values between replicated chips at the same variation level (5,20). Outlying values within the series of expression values obtained for each gene from several spots can then be identified. Because of the low number of replicates in microarray experiments, we propose to apply modified statistical tests. A z-score modified by MAD is used to detect outliers within and between slides. In the MADSCAN procedure, we have implemented both the MAD and the ESD (Extreme Studentized Deviate or Grubb's test) procedures (22,23). The procedure for detecting outliers requests a minimum of three replicated values. The replicates may be within chips and/or between chips. Outlier detection can be applied iteratively with both tests, until no more outlier is detected.

    Crucial steps in microarray data treatment

    The presence of replicate spots for each gene on each individual chip and on replicate chips allows the calculation of the within-chip as well as the between-chip coefficients of variation (CV) of the expression ratios (=2M), respectively, at each of the four steps described in Figure 1. Figure 5 shows the variations of the CV calculated from the medians of expression ratios for each gene, in a typical experiment involving 2 x 2 replicate chips with four replicate spots for each gene (Figure 2). Step 3, corresponding to the within-chip normalization procedure, is clearly the most decisive step for reducing the absolute value and the variation of the CV. The CV distributions around their median values are approximatively gaussian, even though they are obviously higher for low intensity values (25). First and third quartile values in each of the four CV distributions are central visual elements for evaluating and controlling the quality of the expression values obtained for each individual chip and for replicate chips in the MADSCAN procedures. In contrast, step 4 does not significantly alter the CV values and their relative variations. This has to be related to the very small number of outlying values usually detected for each gene. However, this does not mean that outlier detection and elimination do not play a role in the CV calculations.

    Figure 5. Decrease of the coefficients of variation of expression ratios, along the different MADSCAN analysis procedure steps. The expression values were from experiment 2, using 2 x 2 replicate chips with four replicate spots for each gene. The ‘box’ in a ‘box and whisker’ plot shows the median of the values as a line, the mean as an asterisk and the first (25th percentile) and third quartile (75th percentile) of the expression values distribution as the lower and upper parts of the box, respectively. The ‘whiskers’ shown above and below the boxes represent the largest and smallest observed values, respectively, that are less than 1.5 box lengths (interquartile range) from the end of the box. When the box is in the middle of the whiskers, the data are probably more evenly distributed (steps 3 and 4). Steps 1 to 4 are as in Figure 1.

    Spot replicates and the detection of subtle expression changes

    The robustness of the proposed ‘reference design’ with within- and between-chip replicates is illustrated by means of (i) a power analysis (8) performed using 4, 6, 8, 12 or 16 replicate M-values and (ii) an estimation of FP and FN rates at different replication conditions (4, 8 or 12 replicates) as compared to the 16 replications experimentally used.

    Power study

    Power values (1 – ?) in each replication condition are plotted against the mean level of differential expression (), which is defined as the arithmetic mean of the four arithmetic means of the 4 M-values in each of the quadruplicate chips, for each of the 3804 analyzed genes (Figure 6). represents the most probable (informative) value for the expression ratio for each gene, since it results from the maximum number (in this analysis: 16) of experimental determinations (see Materials and Methods for the definition of the parameters used in the power t-test). Figures 6A and 6B show that two-digit replicates (in this analysis: 12 and 16 replicates) allow to detect stable changes in the expression ratios as low as 15% (roughly a variation of 0.20 in M) with a probability value lower than 0.05. The methodological sensitivity to detect limited variations in gene expression dramatically decreases when the expression ratios are determined on <6 replicate values. The grayed area between the corresponding power values calculated for the first and the third quartiles (Figure 6C) represent the variations of the SD values of the expression ratios for each gene, deduced from 6 and 12 replicate spots. Only genes with relatively high differential expression levels (M > ±1.5 at least) show sufficient reproducibility when only six replicates have been used. The present observations on the capacity of replicates to detect limited gene expression changes using DNA chips are in concordance with other studies (6–9). Replicate gene spots, as well as replicate chips, are crucial for reliable monitoring of subtle gene expression changes typical for most biological events. Only large expression changes can be obtained reproducibly from microarray studies performed with chips containing no, or a very limited number of within- and/or between-chip, replicates.

    Figure 6. Validation experiment and power analysis, using replicate spots and replicate chips. The set of expression values for power calculations were from experiment 2. (A) Power values (1 – ?) calculated in five replication conditions (4, 6, 8, 12 or 16 replicate M-values) were plotted against , the mean level of expression values between the two fluorochromes, which is calculated as the arithmetic mean of the four arithmetic means of the 4 M-values in each of the quadruplicate chips, for each gene. (B) The same results as in (A) for four replication conditions, but zoomed to a smaller x-axis ( values ranging from –0.6 to 0.6), to underline the capacity of 12 and 16 replicates to detect small gene expression changes. (C) The gray zones around the power values were defined from the power values calculated from the first and third quartiles of all the SD values of the M-values in the 6 and 12 replication conditions. The same results as in (A), but zoomed out to a larger x-axis ( values ranging from –2.5 to 2.5).

    False positive and false negative rates

    The gain from replications can also be calculated from paired sets of FP and FN rates, determined from differentially expressed gene collections with variable (<16) numbers of mixed within- and between-chip replications. Significantly low FP rates were obtained only with repeated hybridizations (chips) (Figure 7). In parallel, the number of within-chip replicates decreased the FN values. The concomitant use of additional within- and between-chip replicates allowed obtaining balanced values of both FP and FN rates. The simulation of 4 slides with 3 replicates per chip generated a tolerable FP rate of 7% and an FN rate of 14%. This replication pattern, which allowed the evaluation of both technical and biological variations, seems feasible with regards to labor intensity and cost of chips. Both FP and FN rate analysis and power analysis led to coherent conclusions on the importance of replications. This clearly defines the limitations in the use of genome-wide microarrays, which contain many genes but almost no replicates. Any additional experimental variability inherent to other chip designs—particularly to all ‘even designs’ including the use of dye-swaps (26)—could be evaluated in the same way.

    Figure 7. Gain from replications. FP and FN rates determined in six simulated replication conditions with regards to the experimental situation based on 16 replicates (N replicates obtained by i chips with j repeated spots).

    Data integration and MADSCAN use

    MADSCAN offers multiple processing steps such as filtration, normalization and outlier detection. Raw scan data can be fully analyzed chip by chip or in chip batches. All procedures can be applied independently, i.e. step by step or they can be run in a single and complete procedure, according to the experimental design. Only filtration and normalization procedures are applied to single chips without replicate spots or experiments without replicate chips. All expression data for one experiment are resumed in a consolidated matrix, which allows further comparisons with other data sets in a complete experiment. MADSCAN creates an end file that contains for each gene; its name, the within-slide median of expression ratios in log2(M), the within-slide coefficient of variation (CV-M), the between-slide median expression ratio and its CV, and the same data for the geometric mean of the intensities. Intermediate files are accessible for any step. Figure 8A shows the start menu of web-accessible MADSCAN. Figure 8B displays a typical summary of data processing for an example of four replicate chips. A few empirically defined inclusion threshold values for the quality of the chip(s), some statistical parameters on expression ratios and the spatial mode used for normalization are also shown. Detailed information on the quality control parameters is available online (http://www.madtools.org). MADSCAN and its ‘A to Z’ approach were principally developed to handle replicate chip experiments with a ‘reference design’. Other experimental designs, such as ‘time series’, ‘dye-swap’ or ‘loop’designs, can be analyzed in batches until after the normalization step. The final steps of the current online MADSCAN version, i.e. outlier detection and data consolidation, can easily be performed step by step by reformatting the normalized data files (for details, see tutorial).

    Figure 8. Web-accessible starting menu (A) and data summary page (B) in the MADSCAN module. For details, see pp. 22 and 27 in the tutorial. Note the definition of threshold values concerning the quality of the chip(s) and related expression measurements.

    CONCLUSION

    The challenge of determining thousands of values of gene expression levels in a parallel but unique way using DNA microarrays forces the biologist of today to reliably manage and analyze a deluge of biological data. During the last few years, many alternative algorithms, based on relatively sophisticated and diverse mathematical methods, have been proposed and validated to successfully transform the image scan raw data into consolidated gene expression data. Based on a careful and pragmatic selection among the numerous methods and software available for filtering, normalizing and scaling the raw microarray data, the web-accessible MADSCAN resource presented here offers a dynamic and automatic procedure to obtain a set of reliable gene expression values. The incorporation of methods for within- and between-chip scaling and outlier detection, together with the online access to quality control parameters, complements the proposed plug-in architecture resource in an original way. A careful experimental design—including multiple measurements for each gene under each biological condition—is clearly central to the evaluation of most experimental noises inherently present in high throughput measurements. The significance and quality of any further biological interpretation—gene clustering, coexpression, etc.—are directly dependent on the reliability and significance of the set of consolidated gene expression values derived from image scan values. Obtaining such an initial set of metrologically relevant chip data is the exclusive scope of the MADSCAN procedure.

    More or less sophisticated computational tools with various methods for microarray data processing are offered today in many commercially available and/or academic web-accessible software (for a list, see http://ihome.cuhk.edu.hk/~b400559/arraysoft.html or http://genopole.toulouse.inra.fr/bioinfo/microarray/index.php?page=logiciels). Among the available software, the steps corresponding to the initial treatment of raw scan data are either limited to some basic and inadequate transformation algorithms (like a linear normalization based on a few house-keeping genes), or numerous sophisticated, interconnected or independent, algorithmic modules are proposed. In all cases, the biologist has to adjust a series of ‘default’ parameters, more or less adapted to their own experimental design and the variables measured (27). Some knowledge and even understanding of the details of the algorithms/languages used are necessary to fully appreciate how such changes in the parameters do affect the expression results. To avoid those types of difficulties, we propose MADSCAN. MADSCAN, which has been successfully tested by diverse users on >2000 chips, containing 500 to 24 000 spots, represents an intelligent and powerful tool for the many biologists using DNA chips (12–13,28). The MADSCAN procedure is now plugged into BASE (BioArray Software Environment) (29). Therefore, information on raw image data and their transformation into consolidated expression values is accessible to the entire scientific community, in agreement with the most recent recommendations of the MGED consortium (18).

    SUPPLEMENTARY MATERIAL

    Supplementary Material is available at NAR Online.

    ACKNOWLEDGEMENTS

    We thank Catherine Chevalier, Isabelle Guisle and Martine Le Cunff for assistance with facilities and experimentation. We are grateful to Geoffroy Golfier, Jean Mosser and Marie-Claude Potier for helpful discussions on methods. We are also grateful to the company Perkin Elmer for their support in the early phase of our study and exchanges concerning the interpretation of microarray data. This work was supported by the ‘Institut National de la Santé et de la Recherche Médicale’, ‘le Centre National à la Recherche Scientifique’, ‘l'Association Fran?aise contre les Myopathies’, ‘le Conseil Régional des Pays de la Loire’ and ‘Ouest Génopole’.

    REFERENCES

    Chipping forecast ( (2002) ) Nature Genet., , 32, (Suppl.), 461–552.

    Kerr,M.K., Martin,M. and Churchill,G.A. ( (2000) ) Analysis of variance for gene expression microarray data. J. Comput. Biol., , 7, , 819–837.

    Schuchhardt,J., Beule,D., Malik,A., Wolski,E., Eickhoff,H., Lehrach,H. and Herzel,H. ( (2000) ) Normalization strategies for cDNA microarrays. Nucleic Acids Res., , 28, , E47.

    Yang,Y.H., Buckley,M.J., Dudoit,S. and Speed,T. ( (2000) ) Comparison of methods for image analysis on cDNA microarray data. Technical Report 107, Department of Statistics, University of California, Berkley, CA.

    Draghii,S. ( (2003) ) Data Analysis Tools for DNA Microarrays. 1st edn. Chapman & Hall, Boca Raton, FL.

    Lee,M.L., Kuo,F.C., Whitmore,G.A. and Sklar,J. ( (2000) ) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA, , 97, , 9834–9839.

    Hwang,D., Schmitt,W.A. and Stephanopoulos,G. ( (2002) ) Determination of minimal sample size and discriminatory expression patterns in microarray data. Bioinformatics, , 18, , 1184–1193.

    Pan,W., Lin,J. and Le,C.T. ( (2002) ) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol., , 3, , research 0022.

    Pavlidis,P., Li,Q. and NobleW.S. ( (2003) ) The effect of replication on gene expression microarray experiments. Bioinformatics, , 19, , 1620–1627.

    Tkatchenko,A.V., Le Cam,G., Leger,J.J. and Dechesne,C.A. ( (2000) ) Large-scale analysis of differential gene expression in the hindlimb muscles and diaphragm of mdx mouse. Biochim. Biophys. Acta, , 1500, , 17–30.

    Cros,N., Tkatchenko,A.V., Pisani,D.F., Leclerc,L., Leger,J.J., Marini,J.F. and Dechesne,C.A. ( (2001) ) Analysis of altered gene expression in rat soleus muscle atrophied by disuse. J. Cell. Biochem., , 83, , 508–51.

    Steenman,M., Chen,Y.W., Le Cunff,M., Lamirault,G., Varro,A., Hoffman,E. and Leger,J.J. ( (2003) ) Transcriptomal analysis of failing and nonfailing human hearts. Physiol. Genomics, , 12, , 97–112.

    Steenman,M., Lamirault,G., Le Meur,N., Le Cunff,M., Escande,D. and Léger,J.J. ( (2004) ) Distinct molecular portraits of human failing hearts identified by dedicated cDNA microarrays. Eur. J. Heart Fail., , in press, doi:10.1016/j.ejheart.2004.05.008.

    Ihaka,R. and Gentleman,R. ( (1996) ) R: a language for data analysis and graphics. J. Comput. Graph. Stat., , 5, , 299–314.

    Holm,S. ( (1979) ) Simple sequentially rejective multiple test procedure. Scand. J. Stat., , 6, , 65–70.

    Tusher,V.G., Tibshirani,R. and Chu,G. ( (2001) ) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, , 98, , 5116–5120.

    Kohavi,R. and Provost,F. ( (1988) ) Glossary of terms. Machine Learning, , 30, , 271–274.

    Brazma,A., Hingamp,P., Quackenbush,J., Sherlock,G., Spellman,P., Stoeckert,C., Aach,J., Ansorge,W., Ball,C.A., Causton,H.C., Gaasterland,T., Glenisson,P., Holstege,F.C., Kim,I.F., Markowitz,V., Matese,J.C., Parkinson,H., Robinson,A., Sarkans,U., Schulze-Kremer,S., Stewart,J., Taylor,R., Vilo,J. and Vingron,M. ( (2001) ) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genet., , 29, , 365–371.

    Tseng,G.C., Oh,M.K., Rohlin,L., Liao,J.C. and Wong,W.H. ( (2001) ) Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res., , 29, , 2549–2557.

    Yang,Y.H., Dudoit,S., Luu,P. and Speed,T. ( (2001) ) Normalization for cDNA microarray data. Brief. Bioinformatics, , 2, , 341–349.

    Yang,Y.H., Dudoit,S., Luu,P., Lin,D.M., Peng,V., Ngai,J. and Speed,T.P. ( (2002) ) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res., , 30, , e15.

    Burke,S. ( (2001) ) Missing values, outliers, robust statistics & non-parametric methods. Statistics and data analysis. Statistics and Data Analysis. LC.GC. Europe Online Supplement, , 59, , 19–24.

    Müller,J.W. ( (2000) ) Possible advantages of a robust evaluation of comparisons. J. Res. Natl Inst. Stand. Technol., , 4, , 551–555.

    Chen,Y., Dougherty,E. and Bittner,M. ( (1997) ) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Opt., , 2, , 364–374.

    Golfier,G., Tran,D.M., Dauphinot,L., Graison,E., Rossier,J. and Potier,M.C. ( (2004) ) VARAN: a web server for VARiability ANalysis of DNA microarray experiments. Bioinformatics, , 20, , 1641–1643.

    Kerr,M.K. ( (2003) ) Experimental design to make the most of microarray studies. Methods Mol. Biol., , 224, , 137–147.

    Bottomley,S. ( (2004) ) Bioinformatics: smartest software is still just a tool. Nature, , 429, , 241.

    Bédrine-Ferran,H., Le Meur,N., Gicquel,I., Le Cunff,M., Soriano,N., Guisle,I., Mottier,S., Monnier A., Teusan,R., Fergelot,P., Le Gall,J.Y., Leger,J.J. and Mosser,J. ( (2004) ) Transcriptome variations in human Caco-2 cells: a model for enterocyte differentiation and its link to iron absorption. Genomics, , 83, , 747–950.

    Saal,L.H., Troein,C., Vallon-Christersson,J., Gruvberger,S., Borg,A. and Peterson,C. ( (2002) ) BioArray Software Environment (BASE): a platform for comprehensive management and analysis of microarray data. Genome Biol., , 3, , SOFTWARE0003.(Nolwenn Le Meur*, Guillaume Lamirault, A)