IntroductionInstalling YASMA under UnixRunning R, loading the packages and dataReading and writing microarray data filesData qualityNormalizationANOVA analysisAnalysis of variance componentsOptimal designsSignificance of differential expressionMissing dataYASMA - Yet Another Statistical Microarray AnalysisA quick tutorial (for version 0.19)Lorenz WernischMay 14, 2002Contents1 Introduction 32 Installing YASMA under Unix 43 Running R, loading the packages and data 54 Reading and writing microarray data files 65 Data quality 816 Normalization 137 ANOVA analysis 158 Analysis of variance components 179 Optimal designs 1810 Significance of differential expression 1911 Missing data 2021 IntroductionYASMA is an add-on library for theR statistical package and can be used to analyse simple repli-cated experiments. We are interested in bacterial genes over- or under-expressed in mutants ascompared to the wild type. For this purpose, multiple mRNA samples from different cell culturesare hybridized on several arrays. As long as the same number of arrays is used for each sample astraightforward ANOVA analysis can be applied to the series of experiments (a balanced factorialdesign). From the standard error of the ANOVA anova analysis (or a bootstrap estimate of it)p-values for differential expression can be derived.Even if you are not yet familiar with R this tutorial will give you a first idea of how R and YASMAworks. Making full use of features of YASMA and R requires some knowledge of R. I recommendgoing through the R tutorial on the R online help page, especially the introductory chapters andthe chapters on arrays, reading data, loops, writing functions, and on statistical models (YASMAsupports model formulas).32 Installing YASMA under UnixTo run YASMA you need R, a statistical package which can be downloaded for free fromhttp://cran.r-project.org/.Download the YASMA package fromhttp://www.cryst.bbk.ac.uk/˜wernisch/yasma/yasma 0.19.tgzand for some additional functions (although not necessary for YASMA core functions) the SMApackage fromhttp://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html.Once R is installed, YASMA is added byR INSTALL yasma_0.19.tgzand similarly for the SMA package. For more information on how to install packages see the sectionon R Add-On Packages in the Frequently Asked Qeustions of the R online help (see section 3 foronline help under R).43 Running R, loading the packages and dataIf R is installed properly, it is invoked byRR is a command line driven interpreter system with convenient editing facilities on the commandline. For example, to load the YASMA (and SMA) package typelibrary(yasma)library(sma)on the command line. Alternatively (the way I prefer to work) cut and paste the above commandsdirectly from the PDF file reader to the command line (possibly line by line entering each commandby pressing return). Online help for R commands and all installed R packages is available withhelp.start()which usually launches an HTML page in your web browser.The first thing to do is to load some data. The YASMA package contains microarray data ona trcS mutant of Mycobacterium tuberculosis provided bySharon Kendall from Neil Stoker’s lab.Load them withdata(trcs)54 Reading and writing microarray data filesYou load your own data from ASCII files (tab delimited) writing a more complicated command,which looks likemy.RG <- read.rg("/home/wernisch/microarrays/data/trcs/",gene.col="Gene ID",x.col="X Coord", y.col="Y Coord",R.filenames=c("70","70rs","71","71rs","72","72rs","73","73rs","95","95rs","96","96rs" ),R.prefix="MT-cy3",R.suffix=".dat",G.filenames=c("70","70rs","71","71rs","72","72rs","73","73rs","95","95rs","96","96rs" ),G.prefix="MT-cy5",G.suffix=".dat",R.col="Signal Median",Rb.col="Background Median",G.col="Signal Median",Gb.col="Background Median",do.prepare=T,start.phrase="Gene ID",end.phrase="End Raw Data",experiments=list(S=1:3,A=1:2))Note that this only works with a complete path name and a data set in this directory. The detailsof the command are explained in the description of the command ”read.rg” in the help tool: click”Packages”, ”yasma”, ”read.rg”), alternatively you can view the help within R by help(read.rg).Essentially, the procedure reads in all relevant data files, strips them of everything before a linecontaining the start phrase Gene ID and of everything following a line containing the end phrase6End Raw Data and extracts signal and background expression levels for the ”red” and ”green” chan-nel from columns named Signal Median and Background Median, and gene names from columnGene ID.An important parameter of function read.rg is experiments. It describes the layout of the ex-periments for later use in the ANOVA analysis. In the above example the data are obtained from 3mRNA cultures (S=1:3), each on 2 microarrays (A=1:2), and for each array we have two spot quan-tifications (Q=1:2). This resulted in 12 = 3 × 2 × 2 array data sets: sets 1–4 from culture(sample)1, sets 5–8 from culture 2, sets 9–12 from culture 3. Sets 1,2 are two spot quantifications of thefirst array of sample 1, sets 3,4 are two spot quantifications of the second array of sample 1, and soon. This is also the layout of the data set trcs which you loaded above.The next step is to get rid of spots containing other material than genes.trcs.RG <- rg.remove.containing(trcs.RG,c("Cy3","Cy5","Carry-over","Spot","50%","GAPDH","B-actin","23s","16s","5s","10sa","PCR"))Alternatively, use the command rg.keep.containing if, for example, all genes containing ”Rv”should be kept. We are left with 3924 genes as is seen by typinglength(trcs.RG$genes)Write the RG structure to a series of files (eg ”RGcopy”) byrg.write(trcs.RG,"RGcopy")This generates a series of files with names starting with ”RCopy-” in the current working directory.This command is useful if one wants to manipulate the RG structure and use the result in otheranalysis packages.75 Data qualityLet us have a first plot of the data. Plot log(R/G) from experiment 1 over log(R/G) from experiment3 (same sample, different arrays) byal <- rg.log.matrix(trcs.RG)plot(al[,1],al[,3])−5 0 5−5 0 5al[, 1]al[, 3]Ideally all points should lie on a line. But there is only a weak hint of correlation in this plot.8The question arises how many points should be removed to get a good correlation between theexperiments. We don’t want to remove too many and lose genes. An indication is given by a plot
View Full Document