NO.098 Language integrated queries: towards standard logics for big data analytics
May 29 - June 1, 2017 (Check-in: May 28, 2017 )
- Laurent Daynes
- Oracle Labs, France
- George Fletcher
- TU Eindhoven, Netherlands
- Wook Shin Han
- Pohang University of Science and Technology, South Korea
Database management systems (DBMSs) are typically optimized for a particular data model (e.g., relational, semi-structured, graph-based) and interfaced with a unique query language (e.g., SQL, HiveQL, XQuery, JSONiq, SPARQL). In contrast, database applications are written in general-purpose programming languages that offer developers a large choice of libraries (e.g., to simplify presentation to the end-users, the writing of business logic, etc.).
For various architectural reasons, database applications execute in an environment distinct from that of the DBMS, i.e., on a client machine (e.g., a connected mobile device) or on a middle-tier, or even within the database itself. This situation causes two main problems that have been the focus of research for several decades: (1) how to better integrate database querying with application programming languages to eliminate impedance mismatch while retaining the full power of the database querying capabilities; and (2), how to minimize the traffic, both in terms of number of interactions and volume of data exchanged, between the database and its applications.
Current state of the art.
With respect to integrating querying with programming languages, the industrial landscape is currently dominated by ORM solutions (Hibernate, Ruby-on-Rails, SQLAlchemy, DJango, Propel, RedBeanPHP, to name the most prominent). These frameworks are based on popular architectural patterns (e.g., active records, data mappers) that greatly simplify application development by wrapping database operations in type-safe object-oriented interfaces, allowing developers to write code completely in the host language.
Unfortunately, these solutions encourage developers to write code that iterates over collections of
objects representing database records using idioms of the language to perform bulk operations like filters and joins that would be better done by the database. This often results in poor performance as it both increases traffic with the database and can overwhelm the application’s memory with very large intermediate results.
To avoid these problems, a number of language-integrated query techniques for embedding queries into general-purpose programming languages have emerged. These techniques seek to reconcile the goals of type-safety, uniform programming idioms, on one hand, and better capturing of querying intent to optimize interactions with databases, on the other. Two directions are being investigated: (1) use some form of static analysis or type system to identify part of programs that can be turned into queries; (2), extend conventional language with explicit quotation or surface syntax for expressing queries more directly.
This second approach has gained popularity with Microsoft’s LINQ which offers programmers a unified API for querying arbitrary data providers and supports facilities to extend the syntax of the host language to add query constructs.
Unfortunately, this approach is too restricted for a couple of reasons. First, although queries appear integrated syntactically, a driver still needs to translate the query into a form that can be shipped for execution at the back-end where the actual data resides. In practice, the query is just translated back to SQL query. Consequently, deep integration of language expressions and queries is missing, and complex queries featuring user-defined functions often require several round-trips between the database and the runtime to exchange intermediate results.
Second, a heavy burden is put on the data provider designer, who has to resort to either (1) developing a superficial provider, that only implements the most basic query primitives or, (2), investing a considerable amount of time, effort and expertise in developing a sophisticated data provider capable of analyzing the syntactic representation of queries and of translating it into one or more requests that can be executed by the back-end.
In all cases, sub-expressions from the host language that participate in the query must be translated into an equivalent expression in the interface of the data provider (e.g., SQL). This translation may not always be possible, or requires a substantial amount of work, such as, providing an equivalent stored procedure at the database side for all user-defined functions used in the application queries. When this isn’t possible, a complex query expression must be split into multiple queries and intermediate results must be materialized at the application side in order to apply the host language’s sub-expressions. This is a trait shared by all of the solutions mentioned above, and one that cannot be solved as long as data
providers fail to offer a querying interface that can accept foreign language expressions.
Towards an Algebra for data analytics.
Since its introduction in the 1970’s, Codd’s relational algebra (RA) has served as an indispensable workhorse in the engineering of relational database systems. As the mediating layer between specification of queries by clients in their host language, on one hand, and compilation of optimized physical query execution plans, on the other, the RA is arguably one of the key technologies which led to the rise of practical data management solutions in the 1980’s. Generalizations and extensions of RA played an analogous role in the 1990’s and 2000’s, to address new challenges arising, for example, in the management of object-based and semi-structured data collections.
In the last decade, we have witnessed a continued explosion of research and development of data intensive systems and languages for big data analytics. These range, for example, from distributed computing frameworks such Apache Spark and Apache Flink to document-centric data stores such as MongoDB or Microsoft Azure DocumentDB. To bridge the gap between the specification of analytic tasks by clients of these systems, on one hand, and compilation of optimized execution plans, on the other, an analogue of the relational algebra for big data analytics processing is called for. Although each of the systems in the contemporary data engineering landscape to some degree realizes its own flavor of a query algebra, there is currently no recognized logical language which serves this role. Recent efforts such as Apache Calcite are a step in the right direction, but are still focused on the relational paradigm.
A broad community discussion of the features and design of extended algebras for big data analytics, as integrated in general-purpose programming languages, is crucial to bring big data analytics solutions to the next level of maturity.
Goals and outcomes of the meeting.
The goal of this meeting is to take the first steps towards elaborating solutions for (1) a standard language-, data-model-, and platform-independent declarative interface to data providers which is able to leverage available multi-lingual capabilities of data providers; and, (2) corresponding compilation and execution strategies. For this broad discussion, we aim to bring together relevant leading researchers from both academia and industry, across the domains of programming languages, data management systems, and distributed and parallel systems. In addition to an in-depth report on the discussions and results of the seminar, other possible outcomes include a community white paper and concrete action plans for collaborations in research and longer-term international projects of broad ambition.