No.087 Processing Big Data Streams
NII Shonan Meeting: @ Shonan Village Center, June 58, 2017
Organizers

 Vladimir Braverman, John Hopkins University, USA
 David Woodruff, IBM Almaden, USA
 Ke Yi, Hong Kong University of Science and Technology, Hong Kong
Overview
The more data we have, the more data we need to process. Whether it’s Internet traffic or biological data, the hardware is never fast enough. The aim of this workshop is to focus on analyzing data under new models: when the data cannot be stored (e.g. identifying viruses in the Internet traffic), when we use multiple cores to analyze the data, and when we are generating short sketches of the data to be sent and analyzed by someone else. We believe a new set of algorithmic techniques (which rely mostly on statistics and on data structures) can be used in these models, and wish to find such techniques and employ them. An important focus of this workshop is to find algorithms which are elegant, and thus can also be used in practice. To develop these algorithms, it is critical to connect the key subareas of the algorithmic research on Big Data. These key subareas include streaming, sketching, and sampling. The goal of the workshop is to bring researchers together from these different subareas and to establish strong collaborations among the attendees.
Streaming model The main motivation for this model is cases where there is not enough memory (or storage space) to hold all the input. Thus, the input is examined sequentially, character by character, and each character is being processed on arrival. Algorithms in this model have sublinear storage, and typically use only polylogarithmic space and time (per character). Although the streaming model is quite restrictive, major breakthroughs were achieved for several sketching and statistical problems. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory, and communication complexity. Finding other problems that can be solved in this model, and new techniques for solving them, are fundamental open questions. An important and natural relaxation of the streaming model is the semistreaming model. In this model, the algorithm is now allowed nearlinear space, and sometimes we also allow it multiple passes over the input (usually a constant number of passes). This model has been mainly used for solving graph problems, in which it is presented as a stream of edges. Due to the space limitations only the vertices and a portion of the edges 1 can be stored. Reducing the space from quadratic to linear is a very important practical achievement of this model. We also consider a stream of edit operations – in this model we maintain a list of counters, and the meaning of the element (i,v) is to add v to the i’th counter, e.g., a stream of IP packets, which can be viewed as (src ip addr., data length).
Sampling The concept of a representative random sample is a central tool in data gathering and analysis. It has been a key concept in statistics and probability theory. More recently, with the progress made in randomized algorithms, it has proved to be a very useful and versatile tool in algorithm design (e.g., in computational geometry, graph optimization and data structures). The usefulness of random sampling in algorithms design comes from the fact that a sample often inherits important properties of the algorithm’s input, while at the same time it is very small. Thus one can discover these properties quickly by analyzing the sample, and then use this knowledge to improve the running time of the algorithm operating on the whole input. Although this approach has been successful, a vast majority of algorithms employing random sampling still need to examine the whole input data set to complete the computation. In the streaming model we aim to avoid the latter step and compute the answer without accessing the whole input. In addition, we aim at algorithms whose running time is also small. Altogether, we seek algorithms whose resource utilization (both space and time) is sublinear in the size of the dataset.
Sketching Sketching is a method of saving data using an extremely concise representation, which allows us to retrieve the information required for the processing application without going over the data again. These techniques are not standard compression techniques since they are often linear so that they can be easily updated in a streaming context. They create much shorter descriptions than compression methods achieve. Rather, sketches do not allow retrieving the original data (or even a close approximation of it), but only allow a very specific processing to be done on the data. There are two possible ways to reduce sketch size. One is by using randomized methods that allow answering the query correctly with high probability, and the second is to allow an approximate rather than exact result.