Seminars

NO.143 Programming Language Support for Data-intensive Applications

Shonan Village Center

July 1 - 5, 2019 (Check-in: June 30, 2019 )

Organizers

  • Oleg Kiselyov
    • Tohuko University, Japan
  • Anil Madhavapeddy
    • University of Cambridge , United Kingdom
  • KC Sivaramakrishnan
    • University of Cambridge, United Kingdom
  • Suresh Jagannathan
    • Purdue University, USA

Overview

Abstract

The landscape of data-intensive applications today span the gamut from large-scale Web applications (expected to provide persistent, fault-tolerant, high-availability and low-latency geo-distributed services), to unreliably con- nected IoT networks comprised of millions of heterogeneous devices stream- ing and processing realtime data feeds. In both cases, application logic is usually expressed using high-level, often domain-specific, language abstrac- tions, while data management issues are typically relegated to an opaque monolithic data querying and storage service. While this architecture en- courages separation of concerns, it provides little opportunity for synergies between the application and database/storage boundary. In particular, apply- ing well-understood programming language principles, compiler optimiza- tions and verification techniques to ensure data management services enforce application-level invariants becomes difficult, jeopardizing safety and main- tainability.

To overcome these drawbacks, we require a radically different view of how applications and datastores interact with one another. We propose to organize a Shonan meeting to bring together programming language experts and practitioners of cutting-edge data-intensive applications to discuss how we can unfiy data representation issues across different layers of the applica- tion stack to exploit the benefits of program verification and optimization to realize correctness and performance in the data processing layer. Central to our approach is the application of declarative abstractions and methodologies to express storage requirements without exposing low-level system-specific details.

Description of the meeting

he growth of data intensive applications has spurred interest among a number of disparate communities, all interested in various ways to describe, harness, and analyze large data streams. However, proposed techniques are often described in isolation – machine learning experts may view the problem in terms of labeling and training, database researchers may consider effective analytics to be the overarch- ing challenge, while correctness and performance concerns are the primary consid- erations to the programming language and systems community. While all these per- spectives are noteworthy and valid, long-lasting solutions are likely to be succesful only when these issues are considered holistically. By bringing together leading researchers in these different communities to discuss approaches and present dif- fering viewpoints, we anticipate making progress on defining a more comprehen- sive and systematic way of addressing large-scale data management concerns in an age where data, not computation, is the currency of greatest value.

1.1 Background

In recent years, application software has become increasingly centered around ac- cumulating and processing large amounts of data across a vast array of distributed machines. This has given rise to diverse applications such as social networks, automated drug discovery, precision medicine, autonomous vehicles, and natural language processing. In the same time period, enormous advances have been made in programming language technology with notable successes in automated theo- rem proving, scalable program analysis, metaprogramming, program verification and synthesis.

However, the traditionally strict separation between application and data man- agement layers means that lower layers of the application stack have not benefited from the advances in programming language technology. While developers may use high-level language abstractions and data structures to program application logic, they nonetheless implicitly rely on lower-level data aggregation, persistence, data management, and dissemination techniques designed and implemented with- out consideration of higher-level concerns. Increasingly, these systems store and process personal and sensitive data (such as GPS locations or heart rate data) or control safety-critical systems based on insights from the data (such as home se- curity and self-driving cars). In these situations, issues such as correctness and security is paramount.

1.2 Squashing the stack

In response to these concerns, recent efforts have looked at shrinking the number of layers (and their complexity) in the software stack through the use of principles expounded by library OS designs [3, 1, 2]. These techniques lift many of the services traditionally performed by operating systems into the application layer, thus enabling application knowledge to be used for optimizing the control stack. In a similar vein, we can envision a database or data storage system to be a library of specialized components rather than as a general-purpose service; the components required for a specific application depends entirely on the application’s needs. In this scenario, whole system reasoning becomes feasible and effective, thus bridging the semantic mismatch between the application and the data and storage models.

Broadly construed, data complexity relates to various aspects of data manage- ment and manipulation that form the bulk of modern-data enterprise application ac- tivity. We are specifically interested in new techniques that allow us to understand and manage data as it transforms from application-level domain-specific models to storage-level representations, taking into account system-level concerns such as persistence, replication, and distribution. While programmers naturally prefer to reason about application semantics in the application layer, the architecture of the current software stack requires them to reason about a combination of properties across all the layers. There are two issues we believe conspire against building efficient yet verifiably correct data-centric applications:

  • A representation mismatch between the domain model of the application and the data model of underlying data store, and
  • A semantic mismatch between high-level integrity specifications of the do- main model and the low-level system-specific properties of the data storage model.

Progress towards defining frameworks that addresse these concerns would have immense foundational and practical significance, especially in the context of emerg- ing application areas like IoT, or scalable machine learning systems. The inherent diversity with respect to scale, heterogeneity, and functionality in such systems demands solutions that both simplify system complexity and facilitate end-to-end correctness arguments. Novel approaches towards eliminating and simplifying ab- straction boundaries in the data path will be critical to realizing this goal.

1.3 Aims of the meeting

The central theme addressed by the proposed meeting will be:

How can we develop synergies between programming language tech- nology such as metaprogramming, automated program verification, and program synthesis and system concerns related to data manage- ment, processing, and dissemination?

To address the questions, we propose to bring together leading researchers in programming languages, systems, databases, and end-user application developers (e.g., in disciplines as diverse as IoT and machine learning) to exchange ideas and propose synergistic activities that address key challenges central to the construc- tion of correct, verifiable, and scalable data-intensive applications. This meeting will provide a unique venue that allows diverse research communities to have an insightful exchange of perspectives and viewpoints.

To promote mutual understanding, we plan to structure the workshop around research talks, tutorials, and smaller breakout and working group sessions. We expect that a by-product of this workshop is a blueprint for further research collab- orative activities, inspired by real-world application demands.

References

(1)  D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP ’95, pages 251–266, New York, NY, USA, 1995. ACM.

(2)  I. M. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R. Fairbairns, and E. Hyden. The design and implementation of an operating system to support dis- tributed multimedia applications. IEEE  J.Sel. A. Commun., 14(7):1280–1297, Septem- ber 2006.

(3)  Anil Madhavapeddy and David J. Scott. Unikernels: Rise of the virtual library oper- ating system. Queue, 11(11):30:30–30:44, December 2013.