DOC PREVIEW
UCSD CSE 182 - Clustering

This preview shows page 1-2-3-18-19-36-37-38 out of 38 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 38 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CSE182-L17Unsupervised ClusteringDistanceK-Means Clustering Problem: Formulation1-Means Clustering Problem: an Easy CaseK-means: Lloyd’s algorithmPowerPoint PresentationSlide 8Slide 9Slide 10Conservative K-Means AlgorithmMicroarray summaryMicroarray non-summaryPopulation GeneticsPopulation StructureSlide 16What causes variation in a population?Single Nucleotide PolymorphismsShort Tandem RepeatsSTR can be used as a DNA fingerprintRecombinationWhat if there were no recombinations?The Infinite Sites AssumptionInfinite sites assumption and Perfect PhylogenyPerfect PhylogenyThe 4-gamete condition4 Gamete Condition4-gamete condition: proofAn algorithm for constructing a perfect phylogenyInclusion PropertyExampleSort columnsAdd first columnAdding other columnsUnrooted caseHandling recombinationLinkage (Dis)-equilibrium (LD)Slide 38CSE182-L17ClusteringPopulation Genetics: BasicsUnsupervised Clustering•Given a set of points (in n-dimensions), and k, compute the k “best clusters”.•In k-means, clustering is done by choosing k centers (means).•Each point is assigned to the closest center.•The notion of “best” is defined by distances to the center.•Question: How can we compute the k best centers?ClustersDistance•Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Euclidean) distance from v to the closest point from X. •Given a set of n data points V={v1…vn} and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(vi, X)2 / n 1 < i < nvK-Means Clustering Problem: Formulation•Input: A set, V, consisting of n points and a parameter k•Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of XThis problem is NP-complete in general.1-Means Clustering Problem: an Easy Case•Input: A set, V, consisting of n points. •Output: A single point X that minimizes d(V,X) over all possible choices of X.This problem is easy. However, it becomes very difficult for more than one center. An efficient heuristic method for k-Means clustering is the Lloyd algorithmK-means: Lloyd’s algorithm•Choose k centers at random:–X’ = {x1,x2,x3,…xk}•Repeat –X=X’–Assign each v  V to the closest cluster j•d(v,xj) = d(v,X)  Cj= Cj  {v}–Recompute X’•x’j  (∑ v  Cj v) /|Cj|•until (X’ = X)0123450 1 2 3 4 5expression in condition 1expression in condition 2x1x2x30123450 1 2 3 4 5expression in condition 1expression in condition 2x1x2x30123450 1 2 3 4 5expression in condition 1expression in condition 2x1x2x30123450 1 2 3 4 5expression in condition 1expression in condition 2x1x2x3Conservative K-Means Algorithm•Lloyd algorithm is fast but in each iteration it moves many data points, not necessarily causing better convergence. •A more conservative method would be to move one point at a time only if it improves the overall clustering cost•The smaller the clustering cost of a partition of data points is the better that clustering is•Different methods can be used to measure this clustering cost (for example in the last algorithm the squared error distortion was used)Microarray summary•Microarrays (like MS) are a technology for probing the dynamic state of the cell.•We answered questions like the following:–Which genes are coordinately regulated (They have similar expression patterns in different conditions)?–How can we reduce the dimensionality of the system?–Using gene expression values from a sample, can you predict if the sample is normal (state A) or diseased (state B)•The techniques employed for classification/clustering etc. are general and can be employed in a number of contexts.Microarray non-summary•We did not cover:–How are the gene expression values measured (the technology)? (CSE183)–How do you control variability across different experiments (normalization)? (CSE183)–What controls the expression of a gene (gene regulation), or a set of genes? (CSE 181)Population Genetics•The sequence of an individual does not say anything about the diversity of a population.•Small individual genetic differences can have a profound impact on “phenotypes”–Response to drugs–Susceptibility to diseases•Soon, we will have sequences of many individuals from the same species. Studying the differences will be a major challenge.Population Structure•377 locations (loci) were sampled in 1000 people from 52 populations.•6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)AfricaEurasia East AsiaAmericaOceaniaPopulation Genetics•What is it about our genetic makeup that makes us measurably different?•These genetic differences are correlated with phenotypic differences•With cost reduction in sequencing and genotyping technologies, we will know the sequence for entire populations of individuals.•Here, we will study the basics of this polymorphism data, and tools that are being developed to analyze it.What causes variation in a population?•Mutations (may lead to SNPs)•Recombinations•Other genetic events (Ex: microsatellite repeats)•Deletions, inversionsSingle Nucleotide Polymorphisms000001010111000110100101000101010010000000110001111000000101100110Infinite Sites Assumption:Each site mutates at most onceShort Tandem RepeatsGCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC435335STR can be used as a DNA fingerprint•Consider a collection of regions with variable length repeats.•Variable length repeats will lead to variable length DNA•Vector of lengths is a finger-print4 23 35 13 23 15 3positionsindividualsRecombination000000001111111100011111What if there were no recombinations?•Life would be simpler•Each sequence would have a single parent•The relationship is expressed as a tree.The Infinite Sites Assumption0 0 0 0 0 0 0 00 0 1 0 0 0 0 00 0 1 0 0 0 0 10 0 1 0 1 0 0 0385•The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.•Some phenotypes could be linked to the polymorphisms•Some of the linkage is “destroyed” by recombinationInfinite sites assumption and Perfect Phylogeny•Each site is mutated at most once in the history. •All descendants must carry the mutated value, and all others must carry the ancestral valuei1 in position i0 in


View Full Document

UCSD CSE 182 - Clustering

Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?