BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data23 Nov 2015
Takeaways from BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, Agarwal et al, Eurosys, 2013.
Aggregate/roll-up analytic workloads require near real-time response rate.
- Uniform Samplling. Issues: Make strong assumptions about the query workload.
- Online Aggregation. Issues: Highly variable performance.
BlinkDB proposes a distributed sampling-based query processing system. Users pose SQL-based aggregation queries along with response time and error bound constraints.
BlinkDB assumes the Predictable Query Column Set (QCS) workload model. QCSs are columns used for grouping or filtering. In production, QCSs adhere to a zipf distribution, thus naturally restricting the number of unique QCSs that need to be sampled. Interesting observation that the paper makes is that only 211 QCSs constitute almost all the queries at Facebook. (Doesn’t 211 seem to be a very small number?).
- Query Parser.
- Offline Sampling Module.
- Creates and maintains stratified samples for QCSs over time. Stratified samples ensure adequate representation to under-represented column-set keys as well.
- Decision on which QCSs to sample is determined by a MILP, giving weightage to higher probability of occurrence, coverage and sparsity.
- Run-time sample selection module.
- Also makes changes to Query Optimizer to decide on the optimal sample to choose for a particular query - preference is given to smallest possible QCS, suitable physical distribution of data and QCS that yield high selectivity.
- Creates Error-Latency Profiles to decide on the sample sizes.
- Bias Correction(appropriate weightage) to offset the over- / under-representation of column-set keys in the stratified sample.