Unformatted text preview:

BIOINFORMATICSVol. 17 no. 6 2001Pages 520–525Missing value estimation methods for DNAmicroarraysOlga Troyanskaya1, Michael Cantor1, Gavin Sherlock2,Pat Brown3, Trevor Hastie4, Robert Tibshirani4, David Botstein2and Russ B. Altman1,∗1Stanford Medical Informatics,2Department of Genetics, Stanford University Schoolof Medicine, Stanford, CA, USA,3Department of Biochemistry, Stanford UniversitySchool of Medicine, and Howard Hughes Medical Institute, Stanford, CA, USA and4Departments of Statistics and Health Research and Policy, Stanford University,Stanford, CA, USAReceived on November 13, 2000; revised on February 22, 2001; accepted on February 26, 2001ABSTRACTMotivation: Gene expression microarray experiments cangenerate data sets with multiple missing expression val-ues. Unfortunately, many algorithms for gene expressionanalysis require a complete matrix of gene array values asinput. For example, methods such as hierarchical cluster-ing and K-means clustering are not robust to missing data,and may lose effectiveness even with a few missing values.Methods for imputing missing data are needed, therefore,to minimize the effect of incomplete data sets on analy-ses, and to increase the range of data sets to which thesealgorithms can be applied. In this report, we investigateautomated methods for estimating missing data.Results: We present a comparative study of severalmethods for the estimation of missing values in genemicroarray data. We implemented and evaluated threemethods: a Singular Value Decomposition (SVD) basedmethod (SVDimpute), weighted K-nearest neighbors (KN-Nimpute), and row average. We evaluated the methodsusing a variety of parameter settings and over different realdata sets, and assessed the robustness of the imputationmethods to the amount of missing data over the range of1–20% missing values. We show that KNNimpute appearsto provide a more robust and sensitive method for missingvalue estimation than SVDimpute, and both SVDimputeand KNNimpute surpass the commonly used row averagemethod (as well as filling missing values with zeros). Wereport results of the comparative experiments and providerecommendations and tools for accurate estimation ofmissing microarray data under a variety of conditions.Availability: The software is available at http://smi-web.stanford.edu/projects/helix/pubs/impute/Contact: [email protected]∗To whom correspondence should be addressed.INTRODUCTIONDNA microarray technology allows for the monitoringof expression levels of thousands of genes under avariety of conditions (DeRisi et al., 1997; Spellmanet al., 1998). Microarrays have been used to study avariety of biological processes, from differential geneexpression in human tumors (Perou et al., 2000) to yeastsporulation (Chu et al., 1998). Various analysis techniqueshave been developed, aimed primarily at identifyingregulatory patterns or similarities in expression undersimilar conditions. Commonly used analysis methodsinclude clustering techniques (Eisen et al., 1998; Tamayoet al., 1999), techniques based on partitioning of data(Heyer et al., 1999; Tamayo et al., 1999), as well asvarious supervised learning algorithms (Alter et al., 2000;Brown et al., 2000; Golub et al., 1999; Raychaudhuri etal., 2000; Hastie et al., 2000).The data from microarray experiments is usually inthe form of large matrices of expression levels of genes(rows) under different experimental conditions (columns)and frequently with some values missing. Missing valuesoccur for diverse reasons, including insufficient resolution,image corruption, or simply due to dust or scratches onthe slide. Missing data may also occur systematicallyas a result of the robotic methods used to create them.Our informal analysis of the distribution of missingdata in real samples shows a combination of all ofthese, but none dominating. Such suspicious data isusually manually flagged and excluded from subsequentanalysis (Alizadeh et al., 2000). Many analysis methods,such as principle components analysis or singular valuedecomposition, require complete matrices (Alter et al.,2000; Raychaudhuri et al., 2000). Of course, one solutionto the missing data problem is to repeat the experiment.This strategy can be expensive, but has been used in520c Oxford University Press 2001Missing values in DNA microarraysvalidation of microarray analysis algorithms (Butte et al.,2001). Missing log2transformed data are often replacedby zeros (Alizadeh et al., 2000) or, less often, by anaverage expression over the row, or ‘row average’. Thisapproach is not optimal, since these methods do nottake into consideration the correlation structure of thedata. Thus, many analysis techniques, as well as otheranalysis methods such as hierarchical clustering, k-meansclustering, and self-organizing maps, may benefit fromusing more accurately estimated missing values.There is not a large published literature concerningmissing value estimation for microarray data, but muchwork has been devoted to similar problems in other fields.The question has been studied in contexts of non-responseissues in sample surveys and missing data in experiments(Little and Rubin, 1987). Common methods include fillingin least squares estimates, iterative analysis of variancemethods (Yates, 1933), randomized inference methods,and likelihood-based approaches (Wilkinson, 1958).An algorithm similar to nearest neighbors was used tohandle missing values in CART-like algorithms (Loh andVanichsetakul, 1988). Most commonly applied statisticaltechniques for dealing with missing data are model-basedapproaches. We have tried to minimize the influence ofspecific modeling assumptions in our methods.In this work, we describe and evaluate three methodsof estimation for missing values in DNA microarrays. Wecompare our KNN- and SVD-based methods to the rowaverage method, which is likely the most sophisticatedestimation technique currently employed for microarraymissing data estimation.SYSTEM AND METHODSExperimental methodsWe implemented and evaluated three data imputationmethods: a method based on K Nearest Neighbors (KNN)algorithm, a Singular Value Decomposition based method,and simple row (gene) average.Three microarray data sets were used: a study in yeastSaccharomyces cerevisiae focusing on identificationof cell-cycle regulated genes (Spellman et al., 1998),an exploration of temporal gene expression during themetabolic shift from fermentation to respiration in Sac-charomyces cerevisiae (DeRisi et


View Full Document

Princeton COS 557 - DNA microarrays

Documents in this Course
Load more
Download DNA microarrays
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view DNA microarrays and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view DNA microarrays 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?