Advanced classification for streaming time series and data streams
Nguyen, Hai Long
Date of Issue2012
School of Computer Engineering
Centre for Advanced Information Systems
EADS Innovation Works South Asia & EDB (Economic Development Board of Singapore)
Nowadays, overwhelming volumes of sequential data are very common in scientific and business applications, such as biomedicine, stock markets, retail industry, and communication networks. Time series and data streams are the two most popular types of sequential data. The main difference between them is that time series is on a single variable domain, while data streams are generally on a multivariate domain. However, they do share some unique characteristics: possibly infinite volume, time-ordered and dynamically changing. In this dissertation, we propose classification algorithms for time series and data streams that satisfy strict constraints, such as bounded memory, single pass, real-time response, and concept-drift detection. Here, a concept drift refers to the situation where the data's underlying distribution changes over time. For massive time series datasets, classification algorithms that are based on motifs (frequent subsequences) are preferable since it not only has low complexity but can also achieve high accuracy. However, state-of-the-art algorithms can only find motifs with a predefined length, which greatly affects their performance and practicality. To overcome this challenge, we introduce the notion of a closed motif; a motif is closed if there is no motif with a longer length having the same number of occurrences. We also propose a novel closed-motif-based classifier, which is lightweight, effective and efficient for time series classification. Furthermore, we continue to examine a more challenging problem of classifying data streams in a multivariate domain. Here, we are confronted with a feature drift problem, where the importance/relevance of a set of features will change over time. We propose a general framework to integrate feature selection and heterogeneous ensemble learning, which is able to adapt to different types of concept drifts and works well with various kinds of datasets. The ensemble consists of well-chosen online classifiers and is equipped with an optimal weighting method. It updates online classifier members for gradual drifts, and replace outdated members by new ones for feature drifts. Additionally, we extend our algorithms in a practical environment, where labeled data is very scarce and there is a need for the concurrent mining of data streams in order to make full use of the single-pass data. Conventional stream mining algorithms only focus on stand-alone mining tasks. Therefore, we propose an incremental algorithm that performs clustering and classification concurrently, which not only maximize throughput, but also achieve better mining results. Moreover, enhanced with a novel active learning technique, our algorithm only requires a small number of queries to work well with very sparsely labeled data streams. Finally, as the volume of sequential data grows steadily, a single computer with limited computing power may soon be insufficient for the mining processes. Cloud computing, a cutting-edge technology that provides elastic computing on demand, will certainly facilitate large sequential data mining. Therefore, we plan to adapt and migrate our algorithms to a cloud computing platform in the future.
DRNTU::Engineering::Computer science and engineering::Information systems::Information systems applications