Seminars

NO.049 Algorithms for Large Scale Graphs

Shonan Village Center

October 14 - 17, 2014 (Check-in: October 13, 2014 )

Organizers

  • Andrew McGregor
    • University of Massachusetts, Amherst, USA
  • Gopal Pandurangan
    • University of Houston, USA
  • Sergei Vassilivitskii
    • Google Inc. , USA

Overview

The growing interest in “Big Data” has led to an increased interest in designing algorithms and systems for problems with very large inputs. Among these problems, graph problems are among the most challenging: basic algorithmic primitives, such as depth first searches, can not be performed in the data stream model [10] and existing algorithms can be notoriously hard to parallelize, to the extent that numerous graph-specific parallel systems have been developed [15, 16]. Nevertheless, there have recently been exciting developments on graph problems, both in streaming and parallel models, as well as work on sketching and sparsifying graphs.

Sketching, Sparsification, and Streaming: One way to deal with the data deluge is to reduce the graph size to something more manageable, while at the same time preserving the key properties of the graph. Examples of this approach include linear sketches and sparsifiers:

  • The seminal work of Benczur and Karger [8] introduced cut sparsification and showed that one can approximately preserve all of the cut sizes with a sparse weighted subgraph. This result has since been improved [11, 20, 21] and it is now known that it is possible to preserve not only cuts, but also spectral properties such as the behavior of random walks on the graph and natural clusterings. Another form of sparsification is spanners. These are small subgraphs that approximately preserve all of the pairwise distances, and recent work has focused on more efficient spanner (and “distance oracle”) construction [7, 22] and maintaining spanners for dynamic graphs.
  • While spanners and sparsifiers are subgraphs of the original graph, graph sketches [2,3] take a different approach to compressing the graph. First, the graph is encoded as a vector, e.g., as the linearization of the adjacency matrix, and then a random linear projection maps the vector into a lower dimensional space. By choosing the distribution of the projection in different ways, different information can be estimated from the projected vector. An advantage of linear sketches is that they can be applied on streaming data and on distributed data.

Many challenges arise when extending and applying these ideas on massive graphs:

  • Can the sparsification methods above be performed in a distributed or streaming manner? Recent work has shown that there exists low-dimensional sketches that preserve the size of all cuts, but a similar result for spectral sparsification requires new ideas that go beyond relating edge-connectivity to effective resistances [4]. Similarly, the only currently known sketches for estimating distances require multiple rounds of “adaptive” sketches [3, 12].
  • Can one find sparsifiers for a richer set of properties, for example, problems concerning matchings or densely connected subgraphs and be extended to directed graphs or hyper-graphs? For applications to massive graphs, these sparsifiers would need to be mergeable such a sparsification of disjoint graphs could be used to generate a sparsifier for the union of the graphs [1]. Such a property holds when approximating cuts and spectral properties but little is known more generally.

Parallel and Distributed Computation: Computation on modern massive graphs often proceeds in a distributed manner. Whether it is computing PageRank, clustering, or simply computing shortest paths, practitioners employ many different systems to get at their result. Each of these systems makes different tradeoffs depending on the possible priorities: Is it important to reduce communication? Can we reduce the computational power of each node? How can we ensure good load balancing or guarantee real-time responses?

MapReduce has become a very successful distributed computing platform for a wide variety of large-scale computing applications and there have been some success in parallelizing different graph algorithms including: finding connected components [17,19], computing PageRank on a distributed stream [5], and finding dense subgraphs in parallel [6, 23].

However, MapReduce is ill-suited for implementing some types of graph algorithms and can lead to “sub-optimal performance and usability issues” [16]. Indeed, there has been a recent proliferation of systems designed specifically for large-scale graph processing.

  • Pregel [16] is a system developed specifically for graph computation and is based on the message-passing distributed computing paradigm. It has been very influential in practice, but has received scant attention in theory.
  • GraphLab [15] has been very influential for performing machine learning on graphs, but also has not been analytically studied. Other similar systems are Giraph [9] (the open source version of Pregel) and Stanford’s GPS [18].

Recently, there has been work [14] on developing a distributed computing theory for graph processing systems such as Pregel, in a similar way to the theory developed for message passing distributed systems. Earlier work on abstracting the MapReduce model [13] has been very influential. Designing and analyzing algorithms for graph problems in this new abstraction is an important challenge. It will require algorithmic techniques from distributed and stream computation as well as ideas from information theory and communication complexity for proving lower bounds.

Comparing and contrasting MapReduce approach with the message passing approach, and understanding when and why one approach works better than the other, will be of fundamental interest to theorists and practitioners alike.

Conclusions: In a growing number of applications, there is the need to efficiently process large scale graphs. Developing a theory and principled approach to developing such algorithms represents a new challenge that needs the expertise of researchers in data streams, distributed computing, and graph algorithms. The goal of this workshop is to bring together researchers from these diverse communities to brainstorm and stimulate further development in this exciting and challenging area.

References

[1] P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z.Wei, and K. Yi. Mergeable summaries. In PODS, pages 23-34, 2012.

[2] K. J. Ahn, S. Guha, and A. McGregor. Analyzing graph structure via linear measurements.In SODA, pages 459-467, 2012.

[3] K. J. Ahn, S. Guha, and A. McGregor. Graph sketches: sparsification, spanners, and subgraphs.In PODS, pages 5-14, 2012.

[4] K. J. Ahn, S. Guha, and A. McGregor. Spectral sparsification in dynamic graph streams. In APPROX-RANDOM, pages 1-10, 2013.

[5] B. Bahmani, A. Chowdhury, and A. Goel. Fast incremental and personalized pagerank.PVLDB, 4(3):173-184, 2010.

[6] B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce.PVLDB, 5(5):454-465, 2012.

[7] S. Baswana and S. Sen. A simple and linear time randomized algorithm for computing sparse spanners in weighted graphs. Random Struct. Algorithms, 30(4):532-563, 2007.

[8] A. A. Bencz´ur and D. R. Karger. Approximating s-t minimum cuts in ~o(n2) time. In G. L. Miller, editor, STOC, pages 47-55. ACM, 1996.

[9] A. Ching and C. Kunz. Giraph : Large-scale graph processing on hadoop. In Hadoop Summit, 2010.

[10] J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theor. Comput. Sci., 348(2-3):207-216, 2005.

[11] W. S. Fung, R. Hariharan, N. J. A. Harvey, and D. Panigrahi. A general framework for graph sparsification. In STOC, pages 71-80, 2011.

[12] P. Indyk, E. Price, and D. P. Woodru?. On the power of adaptivity in sparse recovery. In FOCS, pages 285-294, 2011.

[13] H. J. Karlo?, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938-948, 2010.

[14] H. Klauck, D. Nanongkai, G. Pandurangan, and P. Robinson. On the distributed complexity of large-scale graph processing. 2013.

[15] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, pages 340-349, 2010.

[16] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages 135?146, New York,
NY, USA, 2010. ACM.

[17] V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, ICDE, pages 50-61. IEEE Computer Society, 2013.

[18] S. Salihoglu and J. Widom. Gps: a graph processing system. In SSDBM, page 22, 2013.

[19] T. Seidl, B. Boden, and S. Fries. Cc-mr – finding connected components in huge graphs with mapreduce. In P. A. Flach, T. D. Bie, and N. Cristianini, editors, ECML/PKDD (1), volume 7523 of Lecture Notes in Computer Science, pages 458-473. Springer, 2012.

[20] D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. SIAM J.Comput., 40(6):1913-1926, 2011.

[21] D. A. Spielman and S.-H. Teng. Spectral sparsification of graphs. SIAM J. Comput., 40(4):981-1025, 2011.

[22] M. Thorup and U. Zwick. Approximate distance oracles. J. ACM, 52(1):1-24, 2005.

[23] C. E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. A. Tsiarli. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In I. S. Dhillon, Y. Koren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, and R. Uthurusamy, editors, KDD, pages 104-112. ACM, 2013.

Report

No-049.pdf