Spark SQL: Relational Data Processing in Spark

09 Oct 2015

Takeaways from Spark SQL: Relational Data Processing in Spark, Armburst et al, SIGMOD, 2015.

Big data systems have seen extension from the MapReduce procedural interface to also include relational interface for a productive user experience. However these systems have remained disjoint which is limiting to the user. Spark SQL seamlessly intermixes the two.

Spark SQL Goals:

Relational Processing on native and external data sources
High Performance
Extensible data sourcing
Seamless integration of advanced analytics algorithms

Spark SQL runs atop Spark. The two key components are:

A Programming Interface via the Dataframe API
Catalyst Optimizer

Programming Interface

Dataframe API. Dataframe is a schema-aware collection of rows but can also be viewed as an RDD. Thus it supports both relational and procedural operations.
SQL data types as well as Complex data types are first-class entities in the API. Extensible data model as data types can be nested, which helps sourcing data from multiple sources.
Execution is lazy but analysis is eager. Lazy execution helps optimization across the board (no distinction between relational and procedural) and eager analysis helps with early debugging and ease of programming.
Inexpensive native RDD relational querying using reflection.
Support for inline User-Defined Functions (UDFs).

Catalyst Optimizer

Extensible in terms of addition of new optimization techniques and new data source specific rules using Scala.
Does relational query processing and handles different phases of query execution.
Every data type in Catalyst is a subclass of the tree datatype and rules are functions that pattern match a subtree (type tree) and replace it with the output subtree (type tree).
A tree goes through four transformation phases.
Eager Analysis. An abstract syntax tree provided by the SparkSQL parser is processed for type inference.
Logical Optimization. Rule-based optimization such as predicate pushdown, constant folding and so on.
Physical Planning. Cost-based optimization specifically for conversion of joins to broadcast variable based map-joins.
Code Generation. Use of Quasiquotes to convert SQL expressions to quasiquote based scala expressions which are inexpensive to compile.

kshiteej CV Posts Contact

Spark SQL: Relational Data Processing in Spark

Related Posts

Tracking my runs 29 Dec 2015

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data 23 Nov 2015

Project Adam: Building an Efficient and Scalable Deep Learning Training System. 20 Nov 2015