U of M CSCI 8715 - Spatial Outlier Detection - D100829

Home> Schools> University of Minnesota- Twin Cities> Computer Science (CSCI) > CSCI 8715> Spatial Outlier Detection

U of M CSCI 8715 - Spatial Outlier Detection

School name University of Minnesota- Twin Cities

Course Csci 8715- Spatial Databases and Applications

Pages 8

Download Save

Unformatted text preview:

Project Report (draft version)“Spatial Outlier Detection”1. Introduction2. Motivation2. Related works3. Problem Statement4. Implementation4.1 AlgorithmReferencesName: Jisu Oh, Shan HuangDate : April 12, 2004Course : Csci 8715Professor : Shashi ShekharProject Report (draft version)“Spatial Outlier Detection”Shan Huang, Jisu OhComputer Science Department, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, U.S.AE-mail: [email protected], [email protected]://www-users.cs.umn.edu/~joh/csci8715/HW-list.htm1. Introduction A spatial outlier is a spatially referenced object whose non-spatial attribute values aresignificantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules andclustering algorithms have also been implemented in WEKA, but their algorithms canonly operate on traditional non-spatial database. The purpose of this project is to build a new class, which can detect spatial outlier in a spatial data set. 2. MotivationMachine learning/data mining discovers new things or structure that is unknown to humans. It enables a computer program to automatically analyze large-scale data anddecide what information is most important. We can then use this information to makepredictions or to make decisions faster and more accurately. 1Many organizations rely on spatial analysis to make business and agency decisions and to conduct research. The main difference between data mining in relational DBS and in spatial DBS is the interest of neighboring object’s attributes may have an influence on the current object, so the neighboring object have to be considered as well. The explicit location and extension of spatial objects define implicit relations of spatial neighborhood which are used by spatial data mining algorithms. Therefore, new techniques are required for effective and efficient data mining.WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules and clustering algorithms have also been implemented in WEKA, but these algorithms can only operate on traditional non-spatial database. The aim of this project is to build new classes and algorithm which can handle spatial data, such as spatial regression, spatial association rule (co-location), and spatial outlier detection.2. Related works Detecting spatial outliers is useful in many applications of geographic information systems, including transportation, ecology, public safety, public health, climatology, and location based services [2]. Shekhar et al. introduced a method for detecting spatial outliers in graph data set based on the distribution property of the difference between an attribute value and theaverage attribute value of its neighbors [3]. Shekhar also proposed an algorithm to find all outliers in a dataset, which replace many statistical discordance tests, regardless of any knowledge about the underlying distribution of the attributes [7]. Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial 2outlier, which gives linear time performance when data is in random order and a simple pruning rule is used [4]. Existing methods for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. A distance-based detection method was introduced by Sridhar Ramaswamy et al., which ranks each point on the basis of its distance to its kth nearest neighbor and declares the top n points in this ranking to outliers. A highly efficient partition-based algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed another distance-base outlier detection method that can be done efficiently for large datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are most time represented as point data, but they are frequently represented in region, i.e.,a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to detect region outlier [5]. Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it was done by assigning to each object a degree of being an outlier, the degree, which was called the local outlier factor of an object, depends on how isolated the object is with respect to the surrounding neighborhood [10]. Currently, there are many spatial statictis software available. S-PLUS spatial statistics are the first comprehensive, object-oriented software package for the analysis of spatial data. It includes a fairly wide range of techniques for spatial data analysis. R is a language similar to S for statistical data analysis, based on modern programming concepts and released under the GNU General Public License. It follows a broad outline of existing collections of functions for spatial statistics writtenfor S. Functions for three types of spatial statistics are covered: spatially continuous data, point pattern data, and area data. SAS is another powerful analytical and reporting system. The SAS Bridge for ESRI provides a new way to exchange spatial attribute data between ArcGIS, the market 3leading geographic information system (GIS) software from ESRI, and SAS. This new product links spatial, numeric and textual data through a single interface to improve efficiency, produce more intelligent results and communicate those results more effectively.3. Problem StatementThe input data set using in this project were collected from the sensor stations embedded in Interstate highways surrounding the Twin Cities area in Minnesota, US. Each station measures the traffic volume and occupancy on a particular stretch of the highway at 5-min intervals. Each data set consists of 288 rows of the 5-min detector records, starting from 0:0 AM; each row contains 300 tuples of (volume, occupancy) for 150 stations; each tuple in the row represents the traffic volume and occupancy of the detector within the 5-min period. The neighbor is defined in terms of topological rather than Euclidean distance. Our objective is to determine stations that are “outliers” based

View Full Document


School:
Email:
New Password:
Confirm Password:

U of M CSCI 8715 - Spatial Outlier Detection

Sign up for free to view:

Please select your school