New version page

Matrix-Based Algorithms for Data Mining

This preview shows page 1-2-3-4-5-32-33-34-35-65-66-67-68-69 out of 69 pages.

View Full Document
View Full Document

End of preview. Want to read all 69 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Matrix-Based Algorithms for Data MiningMatrix-Based Algorithms for Data MiningWhy Matrix and Eigenvectors?How did I get into this area as a computer scientist?Computing SystemsKey Challenges for Automatic ManagementLog Data OrganizationSolutions: Common SituationsSolutions: Common SituationsClustering ProblemK-means clusteringLog Data ExampleClustering ModelTalk OutlineOptimization: Matrix PerspectiveFirst try: Iterative Feature and Data (IFD) ClusteringExtension I: Adaptive Subspace ClusteringExtension II: A General ModelQuick Summary on ClusteringRelated Work SummaryTalk OutlineNonnegative Matrix Factorization (NMF)NMF Gives Holistic Pictures INMF Gives Holistic Pictures IINMF is doing “Data Clustering”Reformulate K-means ClusteringK-means Clustering TheoremNMF Generalizations (Li & Ding, ICDM 2006)NMF  PLSIOrthogonal Nonnegative Tri-FactorizationSymmetric NMF:Semi-NMF:Can we extend NMF for other data mining problems?Talk OutlineAn Illustrating ExampleK-means ClusteringCombining Multiple ClusteringsConsensus Clustering Applications(Li, Ma & Ogihara, CIKM 2004)Consensus ClusteringConsensus Clustering is Equivalent to Clustering Consensus Association(Li, Ding & Jordan, ICDM 2007)NMF FormulationsWeighted Consensus ClusteringSemi-Supervised ClusteringConstraintsNMF Formulations(Li, Ding & Jordan, ICDM 2007)NMF-based AlgorithmTalk OutlineDimension Reduction and ClusteringDimension Reduction and ClusteringRelationship between LDA and K-means ClusteringLDA and K-means ClusteringLDA and K-meansLDA and K-meansLDA and K-meansCombine LDA and K-means into a single algorithm (LDA-Km)LDA-Km AlgorithmRelation to Earlier ApproachesRelation to Earlier Approaches (II)Current WorkSummaryAcknowledgementsQuestion?1Matrix-Based Algorithms for Data MiningTao Li School of Computing and Information Sciences Florida International UniversityJoint work with Dr. Chris Ding and Dr. Michael I. Jordan2Matrix-Based Algorithms for Data Mining• Motivation• Matrix-Based Algorithms for Clustering• Non-Negative Matrix Factorization (NMF)for Clustering• Non-Negative Matrix Factorization (NMF)for Consensus Clustering and Semi-Supervised Clustering• Adaptive Dimension Reduction3A recent, fast growing, trend is to use eigenvectors and matrix algorithmsto solving challenging problems in data mining4Why Matrix and Eigenvectors?Matrix and Linear algebra• Relatively simple – in comparison to probabilistic, information-theoretic, graph-theoretic approaches• Well-developed branch of mathematics– knowledge accumulated since 1700• Many mature software tools available–developed by scientific computing community5How did I get into this area as a computer scientist?Motivated by a research project in computer science6Computing SystemsIBMAIXHP UXSun SolarisNetwork Events and MonitorsWebSphereApplication ServerEvents and MonitorsWindows 2000Microsoft ExchangeEnterprise Management Services Oracle Database ServerHTTP ServerCurrent computing systems:• Growing Number of Components• Heterogeneous• Dynamic7System Management OverviewSystem Management OverviewEvent ConsoleMail ServerName ServerPrinter ServerRouterHubRouterDatabaseÎSystem management: key for high performance and availability- collects and processes data in real-time for automatic actions- correlation knowledge needs to be constructured and maintained-Example-rebooting signature: host_down is followed by host_up in five seconds => filtering-host_down without host_up => an operator should be paged (availability problem)File Server8Key Challenges for Automatic Management• Need automatic and efficient approaches–IBM Autonomic Computing (AC) initiative– System Self-Management• Automatically acquiring needed knowledge from the log data• Heterogeneous nature of the system– Diversity and disparity in data reporting• Different formats (syntax) and content (semantic)– Difficult to correlate logs across different components9Log Data Organization• Different reporting schemes by different components• Component A: A has started• Component B: B has begun execution• Difficult to correlate logs across different components• Think about to implement the following rule: if any component has started, notify the system operators• Need to know all the different schemes for “start”• How about add a new component? • Problem: How to organize the log messages with disparate formats and contents into a canonical form?10Solutions: Common Situations• Common Situations– Encode semantics about the messages• Situation assignment requires to understand the message.– A set of common categories• E.g., start, stop, dependency, create, connection11Computing System Management: a high level conceptual viewlogslogslogsHistoricalData CollectionRealtime AnalysisOffline AnalysisKnowledge BaseReal Time ManagementAnomaly DetectionFault DiagnosisProblem DeterminationCorrelation/DependencyKnowledgeSummarization/VisualizationRule ConstructionTemporal Pattern DiscoveryKnowledge ManagementPlanning/ActionsLog Data OrganizationComponentlogsActive Data CollectionLog AdapterSituation Identification and Categorization12Solutions: Common Situations• Common Situations– Encode semantics about the messages• Situation assignment requires to understand the message.– A set of common categories• E.g., start, stop, dependency, create, connection• Two Challenges:– Automatically infer the common semantic situations• Clustering problem– Categorize the log messages into a set of common situations• Classification problem13Clustering Problem• Find a partition of W such that the points within each cluster are “similar” to each otherK-means:• Select k centers somehow• Repeat until k centers don’t change (or change very little)– Partition the data according to the k centers– Use the means of the cluster to find k new centers• Treat each attribute equally• Distance Computation– Curse of dimensionality14K-means clustering• Computationally Efficient (order-mN)• Most widely used in practice – Benchmark to evaluate other algorithms∑∑∈=−=kCikiKkKcxJ21||||minTnxxxX ),,,(21L=Given n points in m-dim:K-means objective• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hartigan, etc)15Log Data Example• Sample Messages from Log Files– Start• S1: User profile application start successfully• S2: Database application service started • S3: The shell


Loading Unlocking...
Login

Join to view Matrix-Based Algorithms for Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Matrix-Based Algorithms for Data Mining and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?