# Large-scale Distributed Computation

**NII Shonan Meeting: **

**@ Shonan Village Center, Jan. 11-15, 2012**

**NII Shonan Meeting Report (ISSN 2186-7437):No. 2012-1**

**Organizers:**

- Graham Cormode (AT&T Labs-Research)
- S. Muthukrishnan (Rutgers University)
- Ke Yi (Hong Kong University of Science and Technology)

**Overview**

As the amount of data produced by large scale systems such as environmental monitoring, scientific experiments and communication networks grows rapidly, new approaches are needed to effectively process and analyze such data. There are many promising directions in the area of large-scale distributed computation, that is, where multiple computing entities work together over partitions of the (huge) data to perform complex computations. Two important paradigms in this realm are distributed continuous monitoring, which continually maintains an accurate estimate of a complex query, and MapReduce, a primarily batch approach to large cluster computation. The aim of this meeting is to bring together computer scientists with interests in this field to present recent innovations, find topics of common interest and to stimulate further development of new approaches to deal with massive data.

**Description of the Meeting**

If the Computer Science in the 20th Century was defined by the inception and astounding growth

of computing power, in the 21st Century it will be defined by the huge growth in data. The ability

to capture vast quantities of data (from network traffic, scientific experiments, online activity) is

expanding rapidly, and at a rate far higher than the increase in our ability to process it. While

Moore’s law has held for many decades, tracking an exponential growth in computational power,

the corresponding law for data has recently shown an even faster growth in data production: see

recent estimates such as http://wikibon.org/blog/cloud-storage/.

Therefore, the new challenge for Computer Science in the 21st Century is how to deal with

such a data deluge. Several promising new directions have emerged in recent years:

**Distributed Continuous Monitoring. **In many settings the new data is observed locally—at a

router in a network, at a sensor in a larger sensor network. The volume of observations is too large

to move to a central location and process together; instead, it is necessary to perform distributed

computation over the data. Since the new observations are continually arriving, we must produce

a continual answer to complex monitoring queries, all while ensuring that the communication

cost necessary to maintain the result, and the computational cost of the tracking, are minimized

to meet the data throughput demands. In recent years, there have been several advances in this

field:

- The geometric monitoring approach, which views the value of the function being monitored

as points in a metric space, covered by monotonic balls in the region. This enables arbitrary

complex functions to be decomposed into local conditions that can be monitored efficiently

[16, 17, 15]. - Use of sketches and randomized summaries to compactly summarize large complex distributions,

allowing approximation of fundamental quantities such as order statistics, inner

products, distinct items and frequency moments [7, 8]. - New ways to incorporate time decay and sliding windows, to allow queries to be based

on only the more recent observations, and reduce or remove the contribution of outdated

observations, while still communicating much fewer than the full set of observations [4, 9]. - Extension of monitoring techniques to more complex, non-linear functions such as the (empirical)

entropy [16, 2]. - Advances in sampling technology to enable drawing uniform samples over multiple streams

of arrivals, arriving at different rates, both with and without replacement [3, 9].

However, there remain many challenging questions to address in this area:

- A more general theory of distributed continuous computation: what functions and computations

can be performed in this model? Are there hardness results and separation theorems

that can be proved? Can a stronger set of lower bounds be provided? - A stronger sense of the connection to other areas in computer science, such as communication

complexity, network coding, coding theory, compressed sensing, and so on. Can

techniques from these areas be extended to the distributed continuous setting? - Systems issues have so far received only minor consideration. Can general purpose systems

be designed, in a comparable way to centralized databases, which can accept complex

queries in a high-level language, and then produce and deploy query monitoring plans?

**Cluster Computing.** As data sizes increase while the power of individual computing cores begin

to saturate, there has been a move to adopt cluster computing: harnessing multiple computing

nodes to work together to solve huge data processing tasks in parallel. The best known example of

this is the MapReduce paradigm, and its open source Hadoop implementation. Computationally

speaking, the approach is for each compute node to process data which is stored local to it in a

distributed file system, and Map this data by deriving a new data set indexed by a key value. The

system then collects all tuples with the same key together at a second node, which proceeds to

Reduce these tuples to obtain the (partial) output. This paradigm has proved highly successful

for many large scale computations, such as search engine log-analysis, building large machine

learning models and data analysis such as clustering. Other key technical advances include:

- A foundation for models of this style of computation, including the Massive Unordered

Distributed (MUD) model, and the Karloff-Suri-Vassilivitskii model (KSV). These help to

understand what computations can be performed effectively on clusters [11, 13]. - Development of Online versions of the protocols, which use pipelining of data between

operators, to move beyond the batch view of cluster computing, and to quickly provide

partial results to complex queries which are refined as computation progresses [6]. - Initial attempts to understand what class of algorithms can be effectively performed in these

models, such as set cover, clustering, database, geometric and graph computations [5, 14, 10,

1, 12]. - Overlays, such as Facebooks Hive, which provides advanced data processing and abstract

query language on top of Hadoop to enable data warehousing applications [18].

The main challenges in this area include:

- Understanding the extent to which approximation and randomization can further advance

efficiency of computation. - Richer models which help us understand the tradeoff between the different cost parameters:

amount of communication, number of compute nodes, memory available at each node,

number of rounds of Map-Reduce, etc. - More examples of non-trivial computations that can be performed in Map-Reduce, leading

to a more general theory of cluster computation and data processing under Map-Reduce. - More efficient implementations and variations of the model allowing real-time and more

interactive use of clusters.

The aim of this workshop is to bring together researchers active in the areas of distributed data

processing, to address these fundamental issues. We hope to encourage greater cooperation and

interaction between currently separate communities, ultimately leading to new advances in these

important developing areas.