kshiteej    CV    Posts    Contact

Global Analytics in the Face of Bandwidth and Regulatory Constraints

Takeaways from Global Analytics in the Face of Bandwidth and Regulatory Constraints, Vulimiri et al, NSDI, 2015.

Existing Approach

  • Existing Approach for Global Analytics: Transfer data to a central data center.
  • Issues:
    • Transfers significant data volumes.
    • Limit on the number of applications due to limited cross-border network bandwidth.
    • Network capacity growth is decelerating.
    • Regulatory constraints on data movement.

Problem Geode tries to solve:

Provide wide area analytics while minimizing bandwidth over geo-distributed data structured as SQL tables.

Assumption

Resources within a single data center are relatively cheap compared to cross-data center bandwidth.

Architecture.

  • Central Command Layer.
  • Thin proxy layer atop local analytics stack.
  • Central Workload Optimizer.

Details.

  • Subquery Deltas.
    • Cache query results at the source and destination with the query’s signature.
    • Instead of sending fresh results, send only a diff (delta) between the new and old results.
    • Cache results on a sub-query basis.
  • Workload Optimizer.
    • Identify optimal centralized plan.
    • Use default Calcite implementation.
    • Rough SQL table statistics in Hive to get an optimized join algorithm selection.
    • Use pseudo-distributed measurement to estimate data movement.
    • Simulate a virtual topology in which each base data partition is a separate datacenter, even if the data is centralized.
    • Achieved by rewriting SQL queries to simulate data partitions.
    • Evaluate alternative data partitions to get an indication of optimal. To reduce the sample space, do a worst case evaluation (worst data partition).
    • Combine all measurements to jointly solve site selection and data replication problem.
    • Formulate an ILP to jointly solve both problems to minimize total bandwidth cost.
    • Due to scalability constraints, alternatively use greedy approach.
    • Solve site selection problem in isolation based on sovereignity constraints.
    • Greedily pick the datacenter to copy input data needed per-task based on lowest cost.