Omega: Flexible, Scalable schedulers for large compute clusters
02 Oct 2015Takeaways from Omega: Flexible, Scalable schedulers for large compute clusters, Schwarzkopf et al, Eurosys, 2013.
This paper investigates a new cluster scheduling architecture, Omega, based on parallelism, shared state, and optimistic concurrency control.
The need for a new scheduling architecture is motivated by the following trends in Google’s compute clusters:
- Diverse workloads and policies. Workloads varying in resource requirements, priorities, latency (time-to-completion) and many more.
- Increasing cluster sizes.
- Growing job arrival rates.
Google’s exiting monolithic production job scheduler, in pursuit of handling these trends, has evolved into a complicated, sophisticated system hard to manage and also threatens to be a scalability bottleneck. This paper outlines the search efforts for a better scheduler architecture.
The authors detail and compare three cluster scheduling architectures.
- Monolithic schedulers. Example: Google’s existing job scheduler. Uses a single, centralized scheduling architecture for all jobs. Problems: Hard to diversify and maintain. Scalability bottleneck.
- Two-level schedulers. Example: Mesos. A single resource manager offers compute resources to multiple parallel, independent schedulers. These independent schedulers themselves manage scheduling over the allotted resources. Problems: Hoarding (one scheduler not freeing resources). Information hiding (one scheduler might not be fully utilizing resources which another scheduler might need).
- Shared State schedulers. Multiple parallel, independent schedulers working on shared state. Each scheduler has a local copy of the cluster state called “cell state”. To schedule a job, the scheduler updates it’s cell state and calculates the optimized tasks allocation (note: each job has multiple tasks) based on this updated cell-state. There is a possibility of a race condition here, since many parallel schedulers can allocate overlapping machines in the cluster at the same time. To resolve race, two possibilities: pessimistic lock-based concurrency control or optimistic lock-free concurrency control. Omega uses the latter. In this, in case of a conflict, the first request is honored. The schedulers which lose out due to an overlap, repeat the complete procedure again for scheduling the overlapping tasks.
The above three architectures are compared via simulations on Google-specific workloads.
The workloads consist of two broad categories of jobs: Batch jobs and Service jobs. To understand the distribution of jobs: Most of the jobs in a workload are batch jobs, but most of the resources are consumed by service jobs. Also batch jobs, as opposed to service jobs, batch jobs have short-lived and have a shorter inter-arrival span. Batch jobs are also expected to be interactively scheduled.
The comparison of the three architectures on these three workloads is done via simulations. The first set of simulations is using a lightweight simulator and synthetic near-real workloads. Scheduler per-job decision time is modeled as a fixed job time plus a variable component linearly proportional to the number of tasks. Scheduler busyness (time in scheduler / total time) is observed with varying per job time and per task time.
- Monolithic schedulers with single path perform bad with increasing scheduler job/task decision time (across batch and service jobs both). Monolithic schedulers with faster code path for batch jobs perform better than the single path counterparts. However with increasing service job scheduler decision time, adverse head-of-line blocking is observed.
- Two-level scheduler (of Mesos) performs bad as a number of jobs are abandoned. This is because Mesos achieves fairness by alternately allocating nearly all the available cluster resources to a scheduler. A neglected scheduler repeatedly attempts to schedule jobs in the meager resources it has, leading to high scheduler busyness.
- Shared state scheduling performs atleast as good as multi-path monolithic scheduler. Head-of-line blocking is not observed here as service jobs and batch jobs have different parallel schedulers. Morever, mean scheduler busyness decreases with increasing parallelism (more number of parallel schedulers).
The second set of simulations is done on a realistic simulator and driven by real traces. Only shared state scheduler is tested using this high-fidelity simulator. The proportion of scheduler conflicts for service jobs increases with increasing scheduler per-job decision times. This is improved by using fine-grained conflict detection and incremental conflict resolution.
Lastly, flexibility of Omega cluster scheduling architecture is showcased via experiments of three different resource-allocation policies on MapReduce jobs. Simulations predict that all the policies improve performance, with the largest speedup in the range of 3-4x using the max-parallelism policy.