Identification and Removal of Outlier Samples - Illumina - D2455870

Home> Academic Documents> Identification and Removal of Outlier Samples - Illumina

DOC PREVIEW

Identification and Removal of Outlier Samples - Illumina

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Identification and Removal of Outlier Samples (Illumina)Supplement for:"Functional Organization of the Transcriptome in Human Brain"Michael C. Oldham, Steve Horvath, Genevieve Konopka, KazuyaIwamoto, Peter Langfelder, Tadafumi Kato, and Daniel H. GeschwindSummaryHere we present additional details on the microarray data pre-processingsteps performed prior to the construction of gene coexpression networks inour study, "Functional Organization of the Transcriptome in Human Brain".To ensure full reproducibility of our research findings, below we provide anannotated supplement that contains all of the relevant R code andcorresponding figure images that were used to guide our decisions to removeoutlier samples in a previously processed Illumina microarray datasetconsisting of 193 samples from human cerebral cortex1. This dataset("CTX_ILMN") was analyzed to provide additional validation acrossplatforms and individuals for the significance of gene coexpressionrelationships in human cerebral cortex identified in our study.Since network analysis and module detection can be severely biasedby the presence of outlying microarray samples, it is important to carry outpre-processing steps to identify and remove such samples in each datasetprior to network construction. Our main statistical diagnostic for flaggingpotential outlying samples in this dataset was the inter-array correlation(IAC), which was defined as the Pearson correlation coefficient of theexpression levels for a given pair of microarrays (using all probe sets forwhich complete data were available). The exclusion of samples purely on thebasis of IACs represents an unbiased method for the identification andremoval of microarray samples with divergent gene expression levels. Thedistribution of IACs within a dataset can be visualized as a histogram(frequency plot), while the relationships between arrays can be visualized asa dendrogram using average linkage hierarchical clustering with 1-IAC as adistance metric.Ideally, microarray pre-processing steps begin with the analysis ofraw data. Unlike the other microarray datasets analyzed in our study,however, raw data for CTX_ILMN were unavailable. Therefore, theidentification and removal of outlier samples in CTX_ILMN was performedusing data that had been previously normalized by the authors of the originalstudy via the rank-invariant normalization method offered by Illumina'sBeadStudio software1. These data consisted of expression levels for 14,078transcripts that were detected in >= 5% of all 193 samples1, with undetectedtranscripts coded as missing values (detailed sample information can befound in ref. 1). From this list, we selected 5,269 transcripts with no missingvalues for further pre-processing and network analysis.Samples in CTX_ILMN that exhibited divergent clustering and/orsamples with low mean IACs (n=34/193 [18%]) were excluded, and themean IAC after removing all outlier samples and performing quantilenormalization2 was 0.943. We note that this quantity is lower than the meanIACs for the other datasets analyzed in this study (0.970 - 0.975). Thisdiscrepancy may reflect uncorrected batch effects or other technical artifactsin the present analysis, increased sample heterogeneity, or other factors. Allanalyses described below were performed in R.Data DescriptionDataset Arrays # samplesbeforepre-processing# samplesafterpre-processingSampledescription*CTX_ILMN IlluminaHumanRefseq-8193 159 cerebralcortex* For additional sample information, see Supplementary Table 1 from ref. 1.CTX_ILMN## Reading in the previously normalized expression data(14,078 probe sets, 193 samples; columns 1-3 containprobe set information and column 197 contains the numberof missing values for each probe set):library(cluster)library(affy)dat1=read.csv("CTXILMN_193samples_normalized_expression_data.csv",header=T)dim(dat1)# [1] 14078 197dimnames(dat1)[[2]]## First we will examine the overall distribution of expression values:boxplot(log(dat1[,4:196],2),las=3,cex=0.7)## There appears to be a fair amount of variation in the distribution. Calculating IACs for all pairs of samples and examining the distribution of IACs in the dataset:IAC=cor(dat1[,4:196],use="p")hist(IAC,sub=paste("Myers Illumina 193 CTX mean=",format(mean(IAC[upper.tri(IAC)]),digits=3)))## Here we see that the mean IAC in the initial CTX_ILMN dataset, with no outlier samples removed, is 0.918. There is a long tail to the left of the distribution, indicating the presence of possible outliers.## Performing hierarchical clustering (average linkage) using 1-IAC as a distance metric:cluster1=hclust(as.dist(1-IAC),method="average")plot(cluster1,cex=0.7,labels=dimnames(dat1)[[2]][4:196])## This dendrogram suggests that there are outliers in the dataset (notably at left). However, we would like tofurther restrict the dataset to include only transcripts with no missing values (n=5,269):datrest=as.matrix(dat1[dat1$No_NaN<1,4:196])dim(datrest)# [1] 5269 193## Repeating the previous steps:IAC=cor(datrest,use="p")hist(IAC,sub=paste("Myers Illumina 193 CTX 5269 mean=",format(mean(IAC[upper.tri(IAC)]),digits=3)))cluster1=hclust(as.dist(1-IAC),method="average")plot(cluster1,cex=0.7,labels=dimnames(datrest)[[2]])abline(h=0.1,col="red")## Again, we see a long tail to the left of the distribution. Clustering:## We will remove the outlying samples at the left of thedendrogram (n=24) by "cutting" the tree at a specified height (red line). These samples were excluded as follows: removevec=c("H54","H62","H90","H91","H130","H140","H150","H163","H173","H174","H176","H190","H191","H230","H231","H232","H233","H234","H235","H240","H245","H251","H253","H260")dim(datrest)# [1] 5269 193overlap1=is.element(dimnames(datrest)[[2]],removevec)datrest2=datrest[,!overlap1]dim(datrest2)# [1] 5269 169IAC=cor(datrest2,use="p")hist(IAC,sub=paste("Myers Illumina 169 CTX mean=",format(mean(IAC[upper.tri(IAC)]),digits=3)))## So after removing those samples, the mean IAC has improved considerably (from 0.916 to 0.929). Clustering again:cluster1=hclust(as.dist(1-IAC),method="average")plot(cluster1,cex=0.7,labels=dimnames(datrest2)[[2]])## The dendrogram looks better, but there


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 7 pages.

Identification and Removal of Outlier Samples - Illumina

Please select your school