Spatial Outlier Detection and implementation in Weka Implemented by: Shan Huang Jisu Oh CSCI8715 Class Project, April 27 2004 Presented by Jisu Oh (Group 2) Slides Available at http://www.users.cs.umn.edu/~joh/csci8715/HW-list.htmTopics:MotivationSlide 4Problem StatementSlide 6Key ConceptsKey Concepts (contd.)Slide 9Major ContributionsMajor Contributions (contd.)Slide 12Slide 13Slide 14Slide 15Slide 16Validation MethodologyAssumptionsSlide 19Future workThanks!1 Spatial Outlier Detection and implementation in Weka Implemented by: Shan HuangJisu OhCSCI8715 Class Project, April 27 2004Presented by Jisu Oh (Group 2)Slides Available at http://www.users.cs.umn.edu/~joh/csci8715/HW-list.htm2 Topics:MotivationProblem StatementKey ConceptsMajor ContributionsValidation MethodologyAssumptionsConclusionsFuture work3 MotivationMachine learning /Data mining-Enables a computer program to analyze large-scale data-Decide important information which can be used to make predictions or to make decisions faster and more accurately.4 MotivationWeka-A collection of machine learning algorithms for solving real-world data mining problems-Provides data mining functions (eg, regressions, association rules, and clustering algorithm)-Limitation: operates on traditional non-spatial database5 Problem StatementInput Data setMinneapolis/St. Paul traffic data set Output : detected outliers asPlain text (timeslot, time, station, Zs(x))Overall traffic volume Neighbor relationship graph between stations6 Problem Statement(cont.)ConstraintsAlgorithm from paper “A unified approach Detecting Spatial Outliers”Dataset should be numeric ObjectiveTo find sets of spatial outliers and show the results visually7 Key ConceptsSpatial outliersDefinition – spatially referenced objects whose non-spatial attribute values are significantly different from the values of its neighborhood.Example – a new house in an old neighborhood of a growing metropolitan areaIn this project, outlier is one station which has a high volume compared to the neighboring stations at certain time slot.8 Key Concepts (contd.)AlgorithmProposed in the paper, “A Unified Approach to Detecting Spatial Outliers”, by S. Shekhar, C. T. Lu, and P. ZhangS(x) = [f(x)-Ey∈ N(x)(f(y))] : difference between f(x) - attribute value of a sensor located at x Ey - average attribute value of x’s neighborsZs(x) = |s(x) –s/σs| > θ : spatial statistic, where θ is a z-score for user specified confidence interval9 Key Concepts (contd.)Algorithm (example)1 2 3 4 520 6 7 8 92 5 10 11 127 8 100 2 13 6 7 8 9s : 0.22σs : 23.8Zs(x) = |s(x) –s|/σs = 3.98Z-score for 95% C.I. = 2 3.98 > 2Thus, 100 is an outlier Outlier is replaced by Ey.100 -> 5S(x) = f(x) –Ey = 100 – (2+8)/2 = 95 1 2 3 4 520 6 7 8 92 5 10 11 127 8 5 2 13 6 7 8 910 Major ContributionsTop k outliers query processing User interface similar to an UI of WekaProviding visualization of outliers-plain text (time slot, time, station, Zs(x))-overall traffic volume-neighbor relationship graph between stations Keeping user-specified results11 Major Contributions (contd.)Top k outliers query processingFig.1. Top 3 outliers from dataset 19970115N.dat12 Major Contributions (contd.)User Interface Fig.2 User interface of the spatial outlier detection application v.s. weka13 Major Contributions (contd.)Visualization outliersFig.3 Plain text results of detected outliers14 Major Contributions (contd.)Visualization outliersFig.4 Overall traffic volume and Neighbor relationship graph between stationsDetected outliers15 Major Contributions (contd.)Visualization outliersFig.4 Overall traffic volume and Neighbor relationship graph between stations16 Major Contributions (contd.)Keeping Results-Enable to save and print user-specified resultsLet’s go to the DEMO!17 Validation MethodologyExperiments with three different data setData set Most outliers found at station19970115N.dat 2419970116N.dat 2419970125N.dat 12418 AssumptionsData format is set-The original data consists of traffic volume and occupancy. -Detection outlier is based on volume. -Data format : @relation 19970115N @station 150 @timeslot 288 1 3 4 7 45 100 …. Users are familiar with statistical concepts(e.g., confidence interval, C.I.)19 ConclusionAdding one more package in Weka to find sets of spatial outliersShowing results visuallyin the user interface similar to the user interface of Wekaby top k outliers query processing providing visualization of outliersallowing to keep user-specified results20 Future workUpgrade to allow various file format and data typeExperiments to find more efficient algorithm using different outlier detection algorithmsAdd more spatial data mining options - e.g., SAR(Spatial Auto Regression), co-location21
View Full Document