Seminars

NO.011 Large-scale Distributed Computation

Shonan Village Center

January 12 - 15, 2012 (Check-in: January 11, 2012 )

Organizers

  • Graham Cormode
    • AT&T Labs-Research, USA
  • S. Muthukrishnan
    • Rutgers University, USA
  • Ke Yi
    • Hong Kong University of Science and Technology, Hong Kong

Overview

As the amount of data produced by large scale systems such as environmental monitoring, scientific experiments and communication networks grows rapidly, new approaches are needed to effectively process and analyze such data. There are many promising directions in the area of large-scale distributed computation, that is, where multiple computing entities work together over partitions of the (huge) data to perform complex computations. Two important paradigms in this realm are distributed continuous monitoring, which continually maintains an accurate estimate of a complex query, and MapReduce, a primarily batch approach to large cluster computation. The aim of this meeting is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches to deal with massive data.

Description of the Meeting

If the Computer Science in the 20th Century was defined by the inception and astounding growth of computing power, in the 21st Century it will be defined by the huge growth in data. The ability to capture vast quantities of data (from network traffic, scientific experiments, online activity) is expanding rapidly, and at a rate far higher than the increase in our ability to process it. While Moore’s law has held for many decades, tracking an exponential growth in computational power, the corresponding law for data has recently shown an even faster growth in data production: see recent estimates such as http://wikibon.org/blog/cloud-storage/. Therefore, the new challenge for Computer Science in the 21st Century is how to deal with such a data deluge. Several promising new directions have emerged in recent years:

Distributed Continuous Monitoring. In many settings the new data is observed locally?at a router in a network, at a sensor in a larger sensor network. The volume of observations is too large to move to a central location and process together; instead, it is necessary to perform distributed computation over the data. Since the new observations are continually arriving, we must produce a continual answer to complex monitoring queries, all while ensuring that the communication cost necessary to maintain the result, and the computational cost of the tracking, are minimized to meet the data throughput demands. In recent years, there have been several advances in this field:

  • The geometric monitoring approach, which views the value of the function being monitored as points in a metric space, covered by monotonic balls in the region. This enables arbitrary complex functions to be decomposed into local conditions that can be monitored efficiently [16, 17, 15].
  • Use of sketches and randomized summaries to compactly summarize large complex distributions, allowing approximation of fundamental quantities such as order statistics, inner products, distinct items and frequency moments [7, 8].
  • New ways to incorporate time decay and sliding windows, to allow queries to be based on only the more recent observations, and reduce or remove the contribution of outdated observations, while still communicating much fewer than the full set of observations [4, 9].
  • Extension of monitoring techniques to more complex, non-linear functions such as the (empirical) entropy [16, 2].
  • Advances in sampling technology to enable drawing uniform samples over multiple streams of arrivals, arriving at different rates, both with and without replacement [3, 9].

However, there remain many challenging questions to address in this area:

  • A more general theory of distributed continuous computation: what functions and computations can be performed in this model? Are there hardness results and separation theorems that can be proved? Can a stronger set of lower bounds be provided?
  • A stronger sense of the connection to other areas in computer science, such as communication complexity, network coding, coding theory, compressed sensing, and so on. Can techniques from these areas be extended to the distributed continuous setting?
  • Systems issues have so far received only minor consideration. Can general purpose systems be designed, in a comparable way to centralized databases, which can accept complex queries in a high-level language, and then produce and deploy query monitoring plans?

Cluster Computing. As data sizes increase while the power of individual computing cores begin to saturate, there has been a move to adopt cluster computing: harnessing multiple computing nodes to work together to solve huge data processing tasks in parallel. The best known example of this is the MapReduce paradigm, and its open source Hadoop implementation. Computationally speaking, the approach is for each compute node to process data which is stored local to it in a distributed file system, and Map this data by deriving a new data set indexed by a key value. The system then collects all tuples with the same key together at a second node, which proceeds to Reduce these tuples to obtain the (partial) output. This paradigm has proved highly successful for many large scale computations, such as search engine log-analysis, building large machine learning models and data analysis such as clustering. Other key technical advances include:

  • A foundation for models of this style of computation, including the Massive Unordered Distributed (MUD) model, and the Karloff-Suri-Vassilivitskii model (KSV). These help to understand what computations can be performed effectively on clusters [11, 13].
  • Development of Online versions of the protocols, which use pipelining of data between operators, to move beyond the batch view of cluster computing, and to quickly provide partial results to complex queries which are refined as computation progresses [6].
  • Initial attempts to understand what class of algorithms can be effectively performed in these models, such as set cover, clustering, database, geometric and graph computations [5, 14, 10, 1, 12].
  • Overlays, such as Facebooks Hive, which provides advanced data processing and abstract query language on top of Hadoop to enable data warehousing applications [18].

The main challenges in this area include:

  • Understanding the extent to which approximation and randomization can further advance efficiency of computation.
  • Richer models which help us understand the tradeoff between the different cost parameters: amount of communication, number of compute nodes, memory available at each node, number of rounds of Map-Reduce, etc.
  • More examples of non-trivial computations that can be performed in Map-Reduce, leading to a more general theory of cluster computation and data processing under Map-Reduce.
  • More efficient implementations and variations of the model allowing real-time and more interactive use of clusters.

The aim of this workshop is to bring together researchers active in the areas of distributed data processing, to address these fundamental issues. We hope to encourage greater cooperation and interaction between currently separate communities, ultimately leading to new advances in these important developing areas.

Report

No-011.pdf