Assumption-Free Anomaly Detection in Time Series

Home> Academic Documents> Assumption-Free Anomaly Detection in Time Series

DOC PREVIEW

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Assumption-Free Anomaly Detection in Time Series Li Wei Nitin Kumar Venkata Lolla Eamonn Keogh Stefano Lonardi Chotirat Ann Ratanamahatana University of California - Riverside Department of Computer Science & Engineering Riverside, CA 92521, USA {wli, nkumar, vlolla, eamonn, stelo, ratana}@cs.ucr.edu Abstract Recent advancements in sensor technology have made it possible to collect enormous amounts of data in real time. However, because of the sheer volume of data most of it will never be inspected by an algorithm, much less a human being. One way to mitigate this problem is to perform some type of anomaly (novelty / interestingness/ surprisingness) detection and flag unusual patterns for further inspection by humans or more CPU intensive algorithms. Most current solutions are “custom made” for particular domains, such as ECG monitoring, valve pressure monitoring, etc. This customization requires extensive effort by domain expert. Furthermore, hand-crafted systems tend to be very brittle to concept drift. In this demonstration, we will show an online anomaly detection system that does not need to be customized for individual domains, yet performs with exceptionally high precision/recall. The system is based on the recently introduced idea of time series bitmaps. To demonstrate the universality of our system, we will allow testing on independently annotated datasets from domains as diverse as ECGs, Space Shuttle telemetry monitoring, video surveillance, and respiratory data. In addition, we invite attendees to test our system with any dataset available on the web. 1. Introduction Recent advancements in sensor technology have made it possible to collect enormous amounts of data in real time. However, because of the sheer volume of data most of it is never inspected by an algorithm, much less a human being. One way to mitigate this problem is to perform some type of anomaly (novelty / interestingness/ surprisingness) detection and to flag unusual patterns for future inspection by humans or more CPU intensive algorithms. Most current solutions are “custom made” for particular domains, such as ECG monitoring, valve pressure monitoring, etc. This customization requires extensive effort by domain experts. Furthermore hand-crafted systems tend to be very brittle to concept drift. In this demonstration, we will show an online anomaly detection system that does not need to be customized for individual domains, yet performs with exceptionally high precision/recall. The system is based on the recently introduced idea of time series bitmaps [11]. It allows users to efficiently navigate through a time series of arbitrary length and identify portions that require further investigation. Figure 1 illustrates the graphical interface of our system1. Figure 1. A snapshot of the anomaly detection tool. To demonstrate the universality of our system, we will allow testing on independently annotated datasets from domains as diverse as ECGs, Space Shuttle telemetry monitoring, video surveillance, and respiratory data. In addition, we invite attendees to test our system with any dataset available on the web. 2. Background and Related Work In this section, we give brief reviews of chaos games and symbolic representations of time series, which are at the heart of our anomaly detection technique. 2.1 Chaos Game Representations Our visualization technique is partly inspired by an algorithm to draw fractals called the Chaos game [1]. The method can produce a representation of DNA sequences, in which both local and global patterns are displayed. The basic idea is to map frequency counts of DNA substrings of length L into a 2L by 2L matrix as shown in Figure 2, then color-code these frequency counts. From our point of view, the crucial observation is that the CGR 1 We encourage the interested reader to visit [5] to view full color examples of all figures in this work.representation of a sequence allows the investigation of the patterns in sequences, giving the human eye a possibility to recognize hidden structures. Figure 2. The quad-tree representation of a sequence over the alphabet {A,C,G,T} at different levels of resolution. We can get a hint of the potential utility of the approach if, for example, we take the first 5,000 symbols of the mitochondrial DNA sequences of four familiar species and use them to create their own file icons. Figure 3 below illustrates this. Note that Pan troglodytes is the familiar Chimpanzee, and Loxodonta africana and Elephas maximus are the African and Indian Elephants, respectively. Even if we did not know these particular animals, we would have no problem recognizing that there are two pairs of highly related species being considered. Figure 3. The bitmap representation of the gene sequences of four animals. With respect to the non-genetic sequences, Joel Jeffrey noted, “The CGR algorithm produces a CGR for any sequence of letters” [4]. However, it is only defined for discrete sequences, and most time series are real valued. The results in Figure 3 encouraged us to try a similar technique on real valued time series data and investigate the utility of such a representation on the data mining task of anomaly detection. Since CGR involves treating a data input as an abstract string of symbols, a discretization method is necessary to transform continuous time series data into discrete domain. For this purpose, we used the Symbolic Aggregate approXimation (SAX) [8], which we review below. 2.2 Symbolic Time Series Representations While there are at least 200 techniques in the literature for converting real valued time series into discrete symbols, the SAX technique of Lin et. al. [8] is unique and ideally suited for data mining. SAX is the only symbolic representation that allows the lower bounding of the distances in the original space. The SAX representation is created by taking a real valued signal and dividing it into equal sized sections. The mean value of each section is then calculated. By substituting each section with its mean, a reduced dimensionality piecewise constant approximation of the data is obtained. This representation is then discretized in such a manner as to produce a word with approximately equi-probable symbols. Figure 4 shows a short time series being converted into the SAX word baabccbc. Figure 4. A real valued time series can be converted to the SAX word baabccbc. It has been pointed out that when processing


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 4 pages.

Please select your school