Impala: A Modern, Open-Source SQL Engine for Hadoop
14 Oct 2015Takeaways from Impala: A Modern, Open-Source SQL Engine for Hadoop, Kornacker et al, CIDR 2015.
Impala is a highly performant SQL-on-Hadoop system especially for multi-user and interactive workloads. It is different from traditional RDBMS in that it is decoupled from the underlying storage engine.
Impala offers flexibility in the file formats and operates on tables. It supports partitioned and unpartitioned tables and offers functionality for creating, altering table partitions in different file formats.
The SQL statements supported by Impala supports the same standard SELECT statement syntax, standard SQL scalar data types, analytic functions, user-defined functions (UDFs) and user-defined aggregate functions (UDAs). Update or Delete SQL statements are not supported, however Impala makes provision for bulk Updates, Deletes and Statistic computations.
Architecture has three main components:
- Impala daemon (impalad)
- Symmetric daemons on every node. Symmetric implies that any impalad is capable of making independent decisions for all the daemons - anyone can be the coordinator.
- Accepts queries from clients and orchestrates execution of query segments across the cluster
- Statestore daemon (statestored)
- Metadata publish-subscribe service which disseminates cluster-wide metadata (topics) to all Impala processes. There is only a single statestored instance.
- Sends two kinds of messages to every subscriber:
- Topic Updates: Deltas of topic updates are sent to all the subscribers of that topic. Subscribers respond to this with any change in subscribed list of topics.
- Keepalive: To check liveness of subscriber.
- Weak update semantics: Not all the subscribers will be at the same updated level. This is not a bottleneck for impalad, as decisions are not co-ordinated and the only drawback would be taking decisions based on slightly stale metadata.
- Catalog daemon (catalogd)
- Serves catalog metadata using statestored.
- Pulls information from third-party metadata stores to convert to Impala compatible catalog structure
- Also maintains Impala-specific catalog structure such as UDFs
- Since catalogs are often large in size complete table metadata is loaded lazily
A typical workflow can be divided into two portions:
- Frontend.
- Query parsing: SQL statement to parsed tree
- Semantic analysis
- Query planning
- First a single node plan is generated
- Then a parallelized plan is generated
- Query optimization: Predicate pushdown. Cost-Based optimization for join ordering and coalescing analytic functions.
- Backend
- Runtime code generation for generation of inline functions with minimal instructions
- I/O management
- Support for different Storage Formats is managed here
Resource and Workload Management: Lookahead resource negotiator over YARN called Llama. Also supports admission control.