DOC PREVIEW
UMD CMSC 423 - Data clustering

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC423: Bioinformatic Algorithms, Databases and ToolsData clusteringCMSC423 Fall 2009 2Why data clustering?What does this mean?CMSC423 Fall 2009 3Data clustering...>F4BT0V001CZSIM rank=0000138 x=1110.0 y=2700.0 length=57ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACGTCTG>F4BT0V001BBJQS rank=0000155 x=424.0 y=1826.0 length=47ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAA>F4BT0V001EDG35 rank=0000182 x=1676.0 y=2387.0 length=44ACTGACTGCATGCTGCCTCCCGTAGGAGTCGCCGTCCTCGACNC>F4BT0V001D2HQQ rank=0000196 x=1551.0 y=1984.0 length=42ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCGTCCCTCGAC>F4BT0V001CM392 rank=0000206 x=966.0 y=1240.0 length=82AANCAGCTCTCATGCTCGCCCTGACTTGGCATGTGTTAAGCCTGTAGGCTAGCGTTCATCCCTGAGCCAGGATCAAACTCTG>F4BT0V001EIMFX rank=0000250 x=1735.0 y=907.0 length=46ACTGACTGCATGCTGCCTCCCGTAGGAGTGTCGCGCCATCAGACTG>F4BT0V001ENDKR rank=0000262 x=1789.0 y=1513.0 length=56GACACTGTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG>F4BT0V001D91MI rank=0000288 x=1637.0 y=2088.0 length=56ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG>F4BT0V001D0Y5G rank=0000341 x=1534.0 y=866.0 length=75GTCTGTGACATGCTGCCTCCCGTAGGAGTCTACACAAGTTGTGGCCCAGAACCACTGAGCCAGGATCAAACTCTG>F4BT0V001EMLE1 rank=0000365 x=1780.0 y=1883.0 length=84ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAATGCTGCATGCTGCTCCCTGAGCCAGGATCAAACTCTGCMSC423 Fall 2009 4Data clustering...•Given a collection of data-points can we identify any patterns?• Data-points:–DNA sequences–Gene expression levels–Organism abundances in an environment–Vitals• Patterns:–do certain points group together?CMSC423 Fall 2009 5Example•16S rRNA sequences from gut microbiome• How many types of organisms are there?• Assumption: organisms that are similar have similar 16S sequences• Operational Taxonomic Unit – sequences within 2% distance from each other•Next step: given OTU abundances for a number of patients, do the patients cluster by age/disease/etc?CMSC423 Fall 2009 6Types of clustering algorithms• Agglomerative–Start with single observations–Group similar observations into the same cluster• Divisive–All datapoints start in the same cluster–Iteratively divide cluster until you find good clustering• Hierarchical–Build a tree – leaves are datapoints, internal nodes represent clustersCMSC423 Fall 2009 7Measures of goodness of clustering• Homogeneity–All points in a cluster must be similar• Separation–Points in different clusters are disimilarCMSC423 Fall 2009 8Microarray (gene abundance) clustering• Each gene can be viewed as an array of numbers–expression of gene at different time-points–expression of gene in different conditions (normal, variants of a disease, etc.)• Each time-point or tissue sample can also be viewed as an array of numbers–expression levels for all genes• Basic idea: cluster genes and/or samples to highlight genes involved in diseaseCMSC423 Fall 2009 9Hierarchical clustering•Need: definition of distance between data-points (e.g. individual genes).• Some measures:–Euclidean distance –Manhattan distance –Pearson correlation –Angle between vectors (centered Pearson correlation)• Clustering algorithm–group together data-points that are most similar–repeat...D x , y=∑i xi− yi2Dx , y=∑i∣xi− yi∣Dx , y=E [ x−x y−y]xyCMSC423 Fall 2009 10Hierarchical clustering• Key element: how do you compute distance between two clusters, or a point and a cluster ?• UPGMA/average neighbor (average linkage)– average distance between all genes in the two clusters •Furthest neighbor (complete linkage)– largest distance between all genes in clusters•Nearest neighbor (single linkage)–smallest distance between all genes in clusters•Ward's distance–inter-cluster distance is variance of inter-gene distancesCMSC423 Fall 2009 11Hierarchical clustering...cont•Irrespective of distance choice, algorithm is the same1. compute inter-gene/cluster distances2. join together pair of genes/clusters with smallest distance3. recompute distances to include the newly created cluster4. repeat until all points in one cluster•Output of program is a tree• Cluster sets – defined by “cut” nodes – any subset of internal tree nodes defines a set of clusters – the sets of leaves in the corresponding subtrees• Choice of cut can be tricky – usually problem-specifi• Note: all this can be done easily in R•CMSC423 Fall 2009 12Example: gut microbiome in childrenCMSC423 Fall 2009 13k-means clustering• Goal: split data into exactly k clusters•Basic algorithm:–Create k arbitrary clusters - pick k points as cluster centers and assign each other point to the closest center–Re-compute the center of each cluster–Re-assign points to clusters–Repeat• Another approach: pick a point at and see if moving it to a different cluster will improve the quality of the overall solution. Repeat!CMSC423 Fall 2009 14K-means clustering... K=2CMSC423 Fall 2009 15Other clustering approaches•Principal component analysis–Identify a direction (vector V) such that the projection of data on V has maximum variance (first principal component)–repeat (vector V' != V such that project of data on V' has maximum variance) –Usually plot the first 2 or 3 principal componentsCMSC423 Fall 2009 16Other clustering approaches•Self-organizing maps–Neural-network based approach–Output layer of network are points in a low-dimensional space•Graph theoretic–Points are connected by edges representing strength of "connection" (e.g. similarity or dissimilarity)–Pick clusters such that number of "similar" edges spanning boundaries is minimized, or number of "dissimilar" edges within each cluster is minimized• Markov chain clustering–basic idea – a random walk through a graph will stay within a local strongly connected regionCMSC423 Fall 2009 17Self organizing map of genomeshttp://www.jamstec.go.jp/esc/esc/publication/journal/jes_vol.6/pdf/JES6_22-Abe.pdfCMSC423 Fall 2009 18Questions•Compare nearest, farthest, average, and Ward's cluster distances in terms of how compact the resulting clusters might


View Full Document

UMD CMSC 423 - Data clustering

Documents in this Course
Midterm

Midterm

8 pages

Lecture 7

Lecture 7

15 pages

Load more
Download Data clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?