U-M STATS 545 - Statistics 545 Problem Set 2 - D1711656

Home> Schools> University of Michigan> Statistics (STATS) > STATS 545> Statistics 545 Problem Set 2

U-M STATS 545 - Statistics 545 Problem Set 2

Pages 11

Download Save

Unformatted text preview:

Statistics 545 Problem Set 2Due November 22.1. The goal of this exercise is to assess the stability of the clusters defined by the hierar-chical clustering algorithm under small perturbations to the data. Recall that clusterscan be defined by making a cut through the hierarchical clustering dendrogram aftera specified number of merges.To identify the clusters at a particular cut of the dendrogram in R, you will need to dosome process ing with the x$merge array, where x is the hierarchical clustering objectgenerated by R (i.e. the result of a call of the form x <- hclust(...)). I will also beposting some Matlab hierarchical clustering code. In R, the information in x$mergehas the following meaning (from help(hclust)).merge: an n-1 by 2 matrix. Row i of ’merge’ describes the merging ofclusters at step i of the clustering. If an element j in therow is negative, then observation -j was merged at thisstage. If j is positive then the merge was with the clusterformed at the (earlier) stage j of the algorithm. Thusnegative entries in ’merge’ indicate agglomerations ofsingletons, and positive entries indicate agglomerations ofnon-singletons.Generate 20 bootstrapped data sets from the NCI60 lung data by resampling thegenes. For each bootstrapped data set, calculate the hierarchical clustering solution.The discrepancy within any pair of solutions can be measured using the proportionof all sample pairs that are assigned to the same class in exactly one of the solutions,relative to the number of sample pairs that are assigned to the same class in at leastone of the solutions.Determine how the mean and variance of the discrepancy measure over all202boot-strap sets varies with the level of the dendrogram at which the cut is made, and withthe linkage type of the hierarchical clustering algorithm that is used.Solution:## The path to all the data.PA <- ’/data/BigFiles/Stat545/NCI60/’## Read in the expression file.M <- scan(sprintf(’%sNov1’, PA), 0)M <- matrix(M, ncol=59, byrow=TRUE)1## Log-transform.M <- M * (M >= 0)M <- log(M+1) / log(2)## Cell line names.CL <- scan(sprintf(’%sCell_lines’, PA), ’’, sep=’\n’)## Tissue type labels.TT <- scan(sprintf(’%sTissues’, PA), ’’, sep=’\n’)## Unique tissue type labels.UT <- unique(TT)## Return a list of cluster indicators after K joins. This performs## the same function as ’cutree’.CL <- function(m, K){## A list of cluster labels for each sample.ci <- 59+seq(59)## The current label for the cluster formed at stage j.b <- seq(59)for (k in (1:K)){## Merge two singletons to create a new cluster.if (m[k,2] < 0){ci[abs(m[k,])] <- b[k]}## Merge a singleton with a non-singletonelse if (m[k,1] < 0){ci[abs(m[k,1])] <- b[m[k,2]]b[k] <- b[m[k,2]]}## Merge two non-singletons.else{ci[ci == b[m[k,2]]] <- b[m[k,1]]b[m[k,2]] <- b[m[k,1]]b[k] <- b[m[k,1]]}}2ci}## Count the proportion of distinct object pairs that are assigned to## the same class in either C1 or C2 but not both relative to all## pairs that are assigned to the same class in either C1 or C2. C1## and C2 are the class label vectors for the objects under two## clustering solutions.disc <- function(C1, C2){n <- length(C1)f <- 0m <- 0for (i in (1:(n-1))){for (j in ((i+1):n)){j1 <- (C1[i] == C1[j])j2 <- (C2[i] == C2[j])if (j1 | j2) { m <- m+1 }if ( (j1 | j2) & (!(j1 & j2)) ) { f <- f+1 }}}f/m}## Generate clustering solutions for bootstrapped data sets.L <- array(0, c(58*20,2))for (r in (1:20)){## Generate a bootstrapped data set.n <- dim(M)[1]ii <- ceiling(n*runif(n))B <- M[ii,]## Try one or the other of these as a dissimilarity measure.D <- dist(t(B))##D <- as.dist(1 - cor(M))x <- hclust(D, method = "single", members=NULL)L[(58*(r-1)+1):(58*r),] <- x$merge}3## Calculate the discrepancy measure for every pair of bootstrapped## data sets, and at every possible cutting level in the tree.F <- array(0, c(190,58))for (K in (1:58)){## Loop through all pairs among the 20 bootstrapped sets.ii <- 1for (r1 in (1:19)){## Get the class labels for the first solution.M1 <- L[(58*(r1-1)+1):(58*r1),]C1 <- CL(M1, K)for (r2 in ((r1+1):20)){## Get the class labels for the second solution.M2 <- L[(58*(r2-1)+1):(58*r2),]C2 <- CL(M2, K)## Calculate the discrepancy measure.F[ii,K] <- disc(C1, C2)ii <- ii+1}}}FM <- colMeans(F)pdf(’disc-single.pdf’)plot(FM, xlab=’Number of joins’, ylab=’Discrepancy’, main=’Single Linkage’)dev.off()The results are plotted below. If we focus on the final ten joins, which is the usualarea of interest, there are clear differences among the methods. For complete linkage,only the penultimate j oin is reasonably well defined (with a discrepancy areound 0.1),and the other joins among the top ten are quite ambiguous. For single linkage, allof the final ten joins are quite distinct. As expected, average linkage is somewhatintermediate, with the final three or four joins being reasonably clear.4●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 10 20 30 40 50 600.0 0.1 0.2 0.3 0.4Complete LinkageNumber of joinsDiscrepancy●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 10 20 30 40 50 600.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35Average LinkageNumber of joinsDiscrepancy●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 10 20 30 40 50 600.0 0.1 0.2 0.3 0.4Single LinkageNumber of joinsDiscrepancy2. The goal of this exercise is to investigate whether an optimal number of genes canbe identified for prediction using the ridged logistic regression classifier applied to theMichigan lung data. You will also compare the performance of supervised gene selectionand unsupervised gene selection. It will be important to consider the uncertainty levelsof the cross-validated error rate estimates in order to determine whether any meaningfulconclusions can be drawn. You can choose any one of the response variables to analyze(we’ll arrange to have each of the four response variables studied by at least two people).To make things manageable, begin by selecting the genes in the top quarter based oneither variance or mean. Next get the cross-validated error rate estimates for predictionusing the top 25%,

View Full Document


School:
Email:
New Password:
Confirm Password:

U-M STATS 545 - Statistics 545 Problem Set 2

Sign up for free to view:

Please select your school