Spark SQL: Relational Data Processing in Spark
09 Oct 2015Takeaways from Spark SQL: Relational Data Processing in Spark, Armburst et al, SIGMOD, 2015.
Big data systems have seen extension from the MapReduce procedural interface to also include relational interface for a productive user experience. However these systems have remained disjoint which is limiting to the user. Spark SQL seamlessly intermixes the two.
Spark SQL Goals:
- Relational Processing on native and external data sources
- High Performance
- Extensible data sourcing
- Seamless integration of advanced analytics algorithms
Spark SQL runs atop Spark. The two key components are:
- A Programming Interface via the Dataframe API
- Catalyst Optimizer
Programming Interface
- Dataframe API. Dataframe is a schema-aware collection of rows but can also be viewed as an RDD. Thus it supports both relational and procedural operations.
- SQL data types as well as Complex data types are first-class entities in the API. Extensible data model as data types can be nested, which helps sourcing data from multiple sources.
- Execution is lazy but analysis is eager. Lazy execution helps optimization across the board (no distinction between relational and procedural) and eager analysis helps with early debugging and ease of programming.
- Inexpensive native RDD relational querying using reflection.
- Support for inline User-Defined Functions (UDFs).
Catalyst Optimizer
- Extensible in terms of addition of new optimization techniques and new data source specific rules using Scala.
- Does relational query processing and handles different phases of query execution.
- Every data type in Catalyst is a subclass of the tree datatype and rules are functions that pattern match a subtree (type tree) and replace it with the output subtree (type tree).
- A tree goes through four transformation phases.
- Eager Analysis. An abstract syntax tree provided by the SparkSQL parser is processed for type inference.
- Logical Optimization. Rule-based optimization such as predicate pushdown, constant folding and so on.
- Physical Planning. Cost-based optimization specifically for conversion of joins to broadcast variable based map-joins.
- Code Generation. Use of Quasiquotes to convert SQL expressions to quasiquote based scala expressions which are inexpensive to compile.