No.122 Analysing Large Collections of Time Series

Icon

NII Shonan Meeting Seminar 122

Overview

Shonan Village Center
12 – 15 February, 2018

Due to technological advances in sensor technology, there is a tremendous increase in the availability of massive data streams—many of these are time series in nature. For examples, sensors could measure various parameters of a manufacturing environment, vital parameters of a medical patient, or fitness parameters of a healthy person, or movement sensors could be installed in a fixed environment, traffic sensors measure the number of people or vehicles in a network. Other examples where massive time series data are generated include web applications tracking user clicks, machine log data generated in an IT-infrastructure, and point-of-sales data in a large store. The aim of capturing this data could be monitoring production quality, monitoring the state of health of a patient, detecting intruders climbing over fences, predicting if a user is likely to click on an advertisement, forecasting the number of passengers on a particular train route, and so on.

These advances in sensor and cloud technologies have led to the Internet of Things (IoT) phenomenon, where huge sets of time series are collected in a distributed way and either the data or some aspect of them is transferred to a centralized cloud.

As a result of the deluge of information, new paradigms are needed for working with time series data. Instead of working at the level of individual observations, we can consider each time series as a single data point in a space of time series. Data analysis tasks such as forecasting, clustering, density estimation and outlier detection, have been largely developed for Euclidean (feature) spaces, and cannot easily be applied in these spaces of time series. We need new algorithmic methods in order to handle the infinite- dimensional geometry of the space of time series.

Indeed, the study of forecasting methods has a long history, and has been studied in various scientific communities, including statistics, econometrics, control engineering and computer science. Many of the widely-used techniques (such as exponential smoothing and the Kalman filter) were developed in the 1960s. From an algorithmic perspective, these methods are elegant and efficient, which make them
very appealing when computational power is scarce. Since then, much progress has been made with respect to both theoretical and computational aspects of forecasting. However, the focus has been limited to forecasting individual time series, or a small number of time series. New methods are required in order to develop algorithms and models designed for forecasting millions of related series.

Once we take the perspective of studying a space of time series, we can consider potential time series that have not yet been observed (e.g., the data that will be observed after we install a new sensor). We may wish to forecast these unobserved time series, but the existing paradigms provide no way of doing so.

Visualization of large collections of time series is also challenging, and impossible using classical time series graphics. Similarly, identifying outliers in a space of time series, or defining the “median” of a large collection of time series are difficult tasks and existing tools are very limited or non-existent.

The workshop will bring together researchers in machine learning, statistics, econometrics and computer science along with industry practitioners to discuss the computational and conceptual challenges that we are facing in analysing very large collections of time series. In particular, we have invited researchers from the following communities, because we see huge potential in the cross-fertilization of these research communities under the lens of time series data.

  • Computational topology is a field at the intersection of computer science and mathematics that is concerned with algorithms to compute topological features of point clouds. Its focus on topology rather than geometry allows it to detect and to reason about non-linear structures that underlie the data. Topological data analysis (TDA) is an emerging field that draws from concepts and methods developed in computational topology and makes a connection to traditional statistical data analysis.
  • Functional data analysis (FDA) is a field in statistics that is concerned with probability distributions over curves or functions. Researchers in this field have shown that traditional methods used for point clouds, such as principal component analysis (PCA), can be extended to spaces of curves. To the best of our knowledge there has not been much interaction between the fields of FDA and TDA, although there is clearly considerable overlap.
  • Manifold learning is an umbrella term for different techniques that involve non-linear dimensionality reduction, metric embeddings and clustering. The underlying assumption of these methods is that the unobserved coordinates of the data points lie on an unknown manifold embedded in a high-dimensional space of observations.
  • Forecasting large collections of time series is a common problem in modern data-driven companies. Data scientists working in this area are either using algorithms that work on many individual time series in parallel, or they are using deep learning approaches that model a collection of time series as a group. This latter approach overlaps with the idea of a studying a space of time series, but we know of no attempts to connect these models with the underlying implicit space of time series.