Matrix-Based Algorithms for Data Mining

Home> Academic Documents> Matrix-Based Algorithms for Data Mining

DOC PREVIEW

This preview shows page 1-2-3-4-5-32-33-34-35-65-66-67-68-69 out of 69 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 69 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Matrix-Based Algorithms for Data MiningMatrix-Based Algorithms for Data MiningWhy Matrix and Eigenvectors?How did I get into this area as a computer scientist?Computing SystemsKey Challenges for Automatic ManagementLog Data OrganizationSolutions: Common SituationsSolutions: Common SituationsClustering ProblemK-means clusteringLog Data ExampleClustering ModelTalk OutlineOptimization: Matrix PerspectiveFirst try: Iterative Feature and Data (IFD) ClusteringExtension I: Adaptive Subspace ClusteringExtension II: A General ModelQuick Summary on ClusteringRelated Work SummaryTalk OutlineNonnegative Matrix Factorization (NMF)NMF Gives Holistic Pictures INMF Gives Holistic Pictures IINMF is doing “Data Clustering”Reformulate K-means ClusteringK-means Clustering TheoremNMF Generalizations (Li & Ding, ICDM 2006)NMF  PLSIOrthogonal Nonnegative Tri-FactorizationSymmetric NMF:Semi-NMF:Can we extend NMF for other data mining problems?Talk OutlineAn Illustrating ExampleK-means ClusteringCombining Multiple ClusteringsConsensus Clustering Applications(Li, Ma & Ogihara, CIKM 2004)Consensus ClusteringConsensus Clustering is Equivalent to Clustering Consensus Association(Li, Ding & Jordan, ICDM 2007)NMF FormulationsWeighted Consensus ClusteringSemi-Supervised ClusteringConstraintsNMF Formulations(Li, Ding & Jordan, ICDM 2007)NMF-based AlgorithmTalk OutlineDimension Reduction and ClusteringDimension Reduction and ClusteringRelationship between LDA and K-means ClusteringLDA and K-means ClusteringLDA and K-meansLDA and K-meansLDA and K-meansCombine LDA and K-means into a single algorithm (LDA-Km)LDA-Km AlgorithmRelation to Earlier ApproachesRelation to Earlier Approaches (II)Current WorkSummaryAcknowledgementsQuestion?1Matrix-Based Algorithms for Data MiningTao Li School of Computing and Information Sciences Florida International UniversityJoint work with Dr. Chris Ding and Dr. Michael I. Jordan2Matrix-Based Algorithms for Data Mining• Motivation• Matrix-Based Algorithms for Clustering• Non-Negative Matrix Factorization (NMF)for Clustering• Non-Negative Matrix Factorization (NMF)for Consensus Clustering and Semi-Supervised Clustering• Adaptive Dimension Reduction3A recent, fast growing, trend is to use eigenvectors and matrix algorithmsto solving challenging problems in data mining4Why Matrix and Eigenvectors?Matrix and Linear algebra• Relatively simple – in comparison to probabilistic, information-theoretic, graph-theoretic approaches• Well-developed branch of mathematics– knowledge accumulated since 1700• Many mature software tools available–developed by scientific computing community5How did I get into this area as a computer scientist?Motivated by a research project in computer science6Computing SystemsIBMAIXHP UXSun SolarisNetwork Events and MonitorsWebSphereApplication ServerEvents and MonitorsWindows 2000Microsoft ExchangeEnterprise Management Services Oracle Database ServerHTTP ServerCurrent computing systems:• Growing Number of Components• Heterogeneous• Dynamic7System Management OverviewSystem Management OverviewEvent ConsoleMail ServerName ServerPrinter ServerRouterHubRouterDatabaseÎSystem management: key for high performance and availability- collects and processes data in real-time for automatic actions- correlation knowledge needs to be constructured and maintained-Example-rebooting signature: host_down is followed by host_up in five seconds => filtering-host_down without host_up => an operator should be paged (availability problem)File Server8Key Challenges for Automatic Management• Need automatic and efficient approaches–IBM Autonomic Computing (AC) initiative– System Self-Management• Automatically acquiring needed knowledge from the log data• Heterogeneous nature of the system– Diversity and disparity in data reporting• Different formats (syntax) and content (semantic)– Difficult to correlate logs across different components9Log Data Organization• Different reporting schemes by different components• Component A: A has started• Component B: B has begun execution• Difficult to correlate logs across different components• Think about to implement the following rule: if any component has started, notify the system operators• Need to know all the different schemes for “start”• How about add a new component? • Problem: How to organize the log messages with disparate formats and contents into a canonical form?10Solutions: Common Situations• Common Situations– Encode semantics about the messages• Situation assignment requires to understand the message.– A set of common categories• E.g., start, stop, dependency, create, connection11Computing System Management: a high level conceptual viewlogslogslogsHistoricalData CollectionRealtime AnalysisOffline AnalysisKnowledge BaseReal Time ManagementAnomaly DetectionFault DiagnosisProblem DeterminationCorrelation/DependencyKnowledgeSummarization/VisualizationRule ConstructionTemporal Pattern DiscoveryKnowledge ManagementPlanning/ActionsLog Data OrganizationComponentlogsActive Data CollectionLog AdapterSituation Identification and Categorization12Solutions: Common Situations• Common Situations– Encode semantics about the messages• Situation assignment requires to understand the message.– A set of common categories• E.g., start, stop, dependency, create, connection• Two Challenges:– Automatically infer the common semantic situations• Clustering problem– Categorize the log messages into a set of common situations• Classification problem13Clustering Problem• Find a partition of W such that the points within each cluster are “similar” to each otherK-means:• Select k centers somehow• Repeat until k centers don’t change (or change very little)– Partition the data according to the k centers– Use the means of the cluster to find k new centers• Treat each attribute equally• Distance Computation– Curse of dimensionality14K-means clustering• Computationally Efficient (order-mN)• Most widely used in practice – Benchmark to evaluate other algorithms∑∑∈=−=kCikiKkKcxJ21||||minTnxxxX ),,,(21L=Given n points in m-dim:K-means objective• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hartigan, etc)15Log Data Example• Sample Messages from Log Files– Start• S1: User profile application start successfully• S2: Database application service started • S3: The shell


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-32-33-34-35-65-66-67-68-69 out of 69 pages.

Please select your school