**UFRJ**

# Data Mining in Time Series Databases (World Scientific)

(Parte **1** de 6)

Editors:H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)

Vol. 43:Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang)

Vol. 4:Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang)

Vol. 45:Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli)

Vol. 46:Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang)

Vol. 47:Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel)

Vol. 48:Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang)

Vol. 49:Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal)

Vol. 50:Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips)

Vol. 51:Automatic Diatom Identification (Eds. H. du Buf and M. M. Bayer)

Vol. 52:Advances in Image Processing and Understanding

A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof)

Vol. 53:Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal)

Vol. 54:Fundamentals of Robotics — Linking Perception to Action (M. Xie)

Vol. 5:Web Document Analysis: Challenges and Opportunities (Eds. A. Antonacopoulos and J. Hu)

Vol. 56:Artificial Intelligence Methods in Software Testing (Eds. M. Last, A. Kandel and H. Bunke)

Vol. 57:Data Mining in Time Series Databases (Eds. M. Last, A. Kandel and H. Bunke)

Vol. 58:Computational Web Intelligence: Intelligent Technology for

Web Applications (Eds. Y. Zhang, A. Kandel, T. Y. Lin and Y. Yao)

Vol. 59:Fuzzy Neural Network Theory and Application (P. Liu and H. Li)

*For the complete list of titles in this series, please write to the Publisher.

Series in Machine Perception and Artificial Intelligence - Vol, 57

Editors

Mark Last Ben-Gurion LIniversity of the Negeu, Israel

Abraham Kandel Zl-Auiv University, Israel

University of South Florida, Tampa, LISA

Horst Bunke University of Bern, Switzerland

vp World Scientific * *

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 2 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-238-290-9

Typeset by Stallion Press Email: enquiries@stallionpress.com

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

Copyright © 2004 by World Scientific Publishing Co. Pte. Ltd.

Published by

World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Printed in Singapore by World Scientific Printers (S) Pte Ltd

DATA MINING IN TIME SERIES DATABASES Series in Machine Perception and Artificial Intelligence (Vol. 57)

Dedicated to

The Honorable Congressman C. W. Bill Young House of Representatives

For his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department, University of South Florida

This page intentionally left blank This page intentionally left blank

Preface

Traditional data mining methods are designed to deal with “static” databases, i.e. databases where the ordering of records (or other database objects) has nothing to do with the patterns of interest. Though the assumption of order irrelevance may be suﬃciently accurate in some applications, there are certainly many other cases, where sequential information, such as a time-stamp associated with every record, can signiﬁcantly enhance our knowledge about the mined data. One example is a series of stock values: a speciﬁc closing price recorded yesterday has a completely diﬀerent meaning than the same value a year ago. Since most today’s databases already include temporal data in the form of “date created”, “date modiﬁed”, and other time-related ﬁelds, the only problem is how to exploit this valuable information to our beneﬁt. In other words, the question we are currently facing is: How to mine time series data?

The purpose of this volume is to present some recent advances in preprocessing, mining, and interpretation of temporal data that is stored by modern information systems. Adding the time dimension to a database produces a Time Series Database (TSDB) and introduces new aspects and challenges to the tasks of data mining and knowledge discovery. These new challenges include: ﬁnding the most eﬃcient representation of time series data, measuring similarity of time series, detecting change points in time series, and time series classiﬁcation and clustering. Some of these problems have been treated in the past by experts in time series analysis. However, statistical methods of time series analysis are focused on sequences of values representing a single numeric variable (e.g., price of a speciﬁc stock). In a real-world database, a time-stamped record may include several numerical and nominal attributes, which may depend not only on the time dimension but also on each other. To make the data mining task even more complicated, the objects in a time series may represent some complex graph structures rather than vectors of feature-values.

vii viii Preface

Our book covers the state-of-the-art research in several areas of time series data mining. Speciﬁc problems challenged by the authors of this volume are as follows.

Representation of Time Series. Eﬃcient and eﬀective representation of time series is a key to successful discovery of time-related patterns. The most frequently used representation of single-variable time series is piecewise linear approximation, where the original points are reduced to a set of straight lines (“segments”). Chapter 1 by Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani provides an extensive and comparative overview of existing techniques for time series segmentation. In the view of shortcomings of existing approaches, the same chapter introduces an improved segmentation algorithm called SWAB (Sliding Window and Bottom-up).

Indexing and Retrieval of Time Series. Since each time series is characterized by a large, potentially unlimited number of points, ﬁnding two identical time series for any phenomenon is hopeless. Thus, researchers have been looking for sets of similar data sequences that diﬀer only slightly from each other. The problem of retrieving similar series arises in many areas such as marketing and stock data analysis, meteorological studies, and medical diagnosis. An overview of current methods for eﬃcient retrieval of time series is presented in Chapter 2 by Magnus Lie Hetland. Chapter 3 (by Eugene Fink and Kevin B. Pratt) presents a new method for fast compression and indexing of time series. A robust similarity measure for retrieval of noisy time series is described and evaluated by Michail Vlachos, Dimitrios Gunopulos, and Gautam Das in Chapter 4.

Change Detection in Time Series. The problem of change point detection in a sequence of values has been studied in the past, especially in the context of time series segmentation (see above). However, the nature of real-world time series may be much more complex, involving multivariate and even graph data. Chapter 5 (by Gil Zeira, Oded Maimon, Mark Last, and Lior Rokach) covers the problem of change detection in a classiﬁcation model induced by a data mining algorithm from time series data. A change detection procedure for detecting abnormal events in time series of graphs is presented by Horst Bunke and Miro Kraetzl in Chapter 6. The procedure is applied to abnormal event detection in a computer network.

Classiﬁcation of Time Series. Rather than partitioning a time series into segments, one can see each time series, or any other sequence of data points, as a single object. Classiﬁcation and clustering of such complex

Preface ix

“objects” may be particularly beneﬁcial for the areas of process control, intrusion detection, and character recognition. In Chapter 7, Carlos J. Alonso Gonzalez and Juan J. Rodrıguez Diez present a new method for early classiﬁcation of multivariate time series. Their method is capable of learning from series of variable length and able of providing a classiﬁcation when only part of the series is presented to the classiﬁer. A novel concept of representing time series by median strings (see Chapter 8, by Xiaoyi Jiang, Horst Bunke, and Janos Csirik) opens new opportunities for applying classiﬁcation and clustering methods of data mining to sequential data.

As indicated above, the area of mining time series databases still includes many unexplored and insuﬃciently explored issues. Speciﬁc suggestions for future research can be found in individual chapters. In general, we believe that interesting and useful results can be obtained by applying the methods described in this book to real-world sets of sequential data.

Acknowledgments

The preparation of this volume was partially supported by the National Institute for Systems Test and Productivity at the University of South Florida under U.S. Space and Naval Warfare Systems Command grant number N00039-01-1-2248.

We also would like to acknowledge the generous support and cooperation of: Ben-Gurion University of the Negev, Department of Information Systems Engineering, University of South Florida, Department of Computer Science and Engineering, Tel-Aviv University, College of Engineering, The Fulbright Foundation, The US-Israel Educational Foundation.

January 2004 Mark Last

Abraham Kandel Horst Bunke

This page intentionally left blank This page intentionally left blank

Preface | vii |

Contents

and Novel Approach | 1 |

Chapter 1 Segmenting Time Series: A Survey E. Keogh, S. Chu, D. Hart and M. Pazzani

Retrieval of Similar Time Sequences | 23 |

Chapter 2 A Survey of Recent Methods for Eﬃcient M. L. Hetland

Chapter 3 Indexing of Compressed Time Series | 43 |

E.F inka ndK .B .P ratt

Chapter 4 Indexing Time-Series under Conditions of Noise | 67 |

M. Vlachos, D. Gunopulos and G. Das

Induced from Time Series Data | 101 |

Chapter 5 Change Detection in Classiﬁcation Models G. Zeira, O. Maimon, M. Last and L. Rokach

Abnormal Events in Time Series of Graphs | 127 |

Chapter 6 Classiﬁcation and Detection of H. Bunke and M. Kraetzl

Variable Length and Early Classiﬁcation | 149 |

Chapter 7 Boosting Interval-Based Literals: C. J. Alonso Gonzalez and J. J. Rodrıguez Diez

Chapter 8 Median Strings: A Review | 173 |

X. Jiang, H. Bunke and J. Csirik

This page intentionally left blank This page intentionally left blank

CHAPTER 1

Eamonn Keogh

Computer Science & Engineering Department, University of California —

Riverside, Riverside, California 92521, USA E-mail: eamonn@cs.ucr.edu

Selina Chu, David Hart, and Michael Pazzani

Department of Information and Computer Science, University of California,

Irvine, California 92697, USA E-mail: {selina, dhart, pazzani}@ics.uci.edu

In recent years, there has been an explosion of interest in mining time series databases. As with most computer science problems, representation of the data is the key to eﬃcient and eﬀective solutions. One of the most commonly used representations is piecewise linear approximation. This representation has been used by various researchers to support clustering, classiﬁcation, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain this representation, with several algorithms having been independently rediscovered several times. In this chapter, we undertake the ﬁrst extensive review and empirical comparison of all proposed techniques. We show that all these algorithms have fatal ﬂaws from a data mining perspective. We introduce a novel algorithm that we empirically show to be superior to all others in the literature.

Keywords: Time series; data mining; piecewise linear approximation; segmentation; regression.

1. Introduction

In recent years, there has been an explosion of interest in mining time series databases. As with most computer science problems, representation of the data is the key to eﬃcient and eﬀective solutions. Several high level

2 E. Keogh, S. Chu, D. Hart and M. Pazzani (a) (b)

Fig. 1. Two time series and their piecewise linear representation. (a) Space Shuttle Telemetry. (b) Electrocardiogram (ECG).

representations of time series have been proposed, including Fourier Transforms [Agrawal et al. (1993), Keogh et al. (2000)], Wavelets [Chan and Fu (1999)], Symbolic Mappings [Agrawal et al. (1995), Das et al. (1998), Perng et al. (2000)] and Piecewise Linear Representation (PLR). In this work, we conﬁne our attention to PLR, perhaps the most frequently used representation [Ge and Smyth (2001), Last et al. (2001), Hunter and McIntosh (1999), Koski et al. (1995), Keogh and Pazzani (1998), Keogh and Pazzani (1999), Keogh and Smyth (1997), Lavrenko et al. (2000), Li et al. (1998), Osaki et al. (1999), Park et al. (2001), Park et al. (1999), Qu et al. (1998), Shatkay (1995), Shatkay and Zdonik (1996), Vullings et al. (1997), Wang and Wang (2000)].

Intuitively, Piecewise Linear Representation refers to the approximation of a time series T, of length n, with K straight lines (hereafter known as segments). Figure 1 contains two examples. Because K is typically much smaller that n, this representation makes the storage, transmission and computation of the data more eﬃcient. Speciﬁcally, in the context of data mining, the piecewise linear representation has been used to:

• Support fast exact similarly search [Keogh et al. (2000)]. • Support novel distance measures for time series, including “fuzzy queries”

[Shatkay (1995), Shatkay and Zdonik (1996)], weighted queries [Keogh and Pazzani (1998)], multiresolution queries [Wang and Wang (2000), Li et al. (1998)], dynamic time warping [Park et al. (1999)] and relevance feedback [Keogh and Pazzani (1999)].

• Support concurrent mining of text and time series [Lavrenko et al. (2000)].

• Support novel clustering and classiﬁcation algorithms [Keogh and

• Support change point detection [Sugiura and Ogden (1994), Ge and Smyth (2001)].

Segmenting Time Series: A Survey and Novel Approach 3

Surprisingly, in spite of the ubiquity of this representation, with the exception of [Shatkay (1995)], there has been little attempt to understand and compare the algorithms that produce it. Indeed, there does not even appear to be a consensus on what to call such an algorithm. For clarity, we will refer to these types of algorithm, which input a time series and return a piecewise linear representation, as segmentation algorithms. The segmentation problem can be framed in several ways.

• Given a time series T, produce the best representation using only K segments.

• Given a time series T, produce the best representation such that the maximum error for any segment does not exceed some user-speciﬁed threshold, max error.

(Parte **1** de 6)