Reservation-based Scheduling: If You're Late Don't Blame Us!

26 Oct 2015

Takeaways from Reservation-based Scheduling: If You’re Late Don’t Blame Us!, C. Curino, D.E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, S. Rao, SoCC 2014.

The authors propose Rayon, a reservation-based scheduler.

Goals.

Maximize cluster utilization and hence Return-on-Investment (ROI)
Be workload aware. Typically adhere to the requirements of two classes of jobs in company workloads.
- Respect completion deadlines for Production jobs. These are large and long running with strict service-level agreements (SLA). These jobs constitute 5% of all the jobs but consume 90% of the resources (yet another instance of the Pareto principle!).
- Provide least possible completion latency for Best-effort jobs. These are abundant but smaller jobs without explicit SLAs due to their inherent interactive nature.

Existing schedulers are lacking on several fronts.

Current schedulers are time-agnostic offering sharing policies based on priority, fairness and capacity.
This leads to inefficiencies and ad-hoc workarounds such as overprovisioning and manual monitoring for completion of production jobs within deadlines.
Cannot support a mix of workloads, such as those requiring gang semantics or inter-job dependencies.

Rayon addresses these issues via a time-predictable resource allocation mechanism that meets SLAs, minimizes best effort job latency and maximizes cluster utilization.

Rayon consists of three main stages.

Reservation. This involves specifying and understanding the diverse workload requirements. To do this the authors propose the Reservation Definition Language (RDL).
- RDL is envisioned as a composition of atomic expressions (representing each task) of the form atom(b,g,h,l,w), where,
  - b is a vector of resource requirements per unit of the job
  - g is the minimum number of parallel bundles required
  - h is the maximum number of parallel bundles supported
  - l is the minimum lease duration of each allocation
  - w represents the total work to be done by the job
- These expressions can support malleable jobs as MapReduce (g, l = 1) as well as strict gang jobs as MPI (g = h, l = w/h).
- RDL also proposes several operators for composition of expressions.
  - window operator enables SLA specification by capturing start (s) and end (e) time.
  - order / all /any operators enable specification of inter-job dependencies or choices.
Planning. Planning involves assigning temporal and resource allocations for all the reservations. The planning problem is modeled as a Mixed-Integer Linear Program (MILP). The planning stage involves a combination of MILP and heuristics (GREE-L) to yield a practical, near-optimal and timely solution.
- MILP. The problem is formalized as an optimization problem minimizing the cost of an allocation subject to capacity (bounded $\Sigma b$ ), parallelism (g,h) and SLA (l,w,s,e) constraints. I discuss more about this in the latter part of this summary on MILP in Rayon. I also sketch a modification to the MILP formulation.
- Procrastinating heuristic (GREE). The job allocation is done as near to the job completion deadline as possible. GREE finds the tallest, rightmost allocation for each of the job.
- Lower preemption heuristic (GREE-L). The job allocation is much smoother to avoid job preemptions at the cost of reduced job acceptance (in other words job SLA violations). Allocation per job is as flat as possible.
- MILP is used most-often in an offline manner, to reorganize already accepted jobs.
- Heuristics are most-often used in an online manner, to commit to accept/reject incoming stream of jobs.
Adaptive Scheduling. This involves dynamic rearrangement of Plan in response to changes in cluster capacity (because of node failures) and job expectations (because reservation request are approximate in nature).

Rayon Architecture.

Rayon is implemented as an extension of YARN components and delegates several of its functional steps to different YARN components.
- Job Managers (JM) for each job. This mainly implements the Reservation stage.
- Node Managers (NM) running on each worker node.
- A central Resource Manager (RM). This mainly implements the Planning and Adaptive Scheduling stage.
Rayon consists of the following logical series of steps.
1. JM converts the estimated demand for incoming job to RDL expressions and submits to RM.
2. Planner in RM validates the reservation and determines feasibility of accepting the reservation. A realtime feedback is sent to the JM accepting or rejecting the job.
3. Scheduler in RM dispatches resources based on sharing policies and job preferences. PlanFollower monitors cluster conditions and converts job deliverables to modified runtime parameter assignments.
4. JM receives access to resources from the RM when the job is ready to run. JM coordinates with the NM.
5. Adapter and PlanFollower in RM are responsible for adaptive scheduling. Job progress and cluster conditions are monitored to rerun Planner in offline mode for already accepted jobs to re-adjust accepted jobs.

MILP in Rayon

Rayon formalizes the planning problem as follows. This part of the paper is interesting from a theoretical perspective and is an interesting formulation of the planning problem as an optimization problem.

$minimize \sum_{j,t} c_{j,t} * x_{j,t} \\ subject\ to: \\ \forall t : \sum_{j} x_{j,t} \le Cap_t \\ \forall j \forall t : x_{j,t} \le h_j \\ \forall j : \sum_{s_j \le t \lt f_j} x_{j,t} = w_j \\ \forall j,t : x_{j,t} \in R^{+}$

Here $x_{j,t}$ represents the resources allocated to job j during timestep t. $c_{j,t}$ is the cost associated with assigning the job j in the timestep t. Controlling $c_{j,t}$ across jobs and timesteps is a way to assign priorities to jobs (as we saw before was useful for the adapter in RM during offline re-adjustments of the plan) and also proffers ability to make certain timesteps expensive.

The authors also detail extensions for several RDL promised use cases and optimizations.

Supporting gang semantics.
- By changing allowed values of $x_{j,t}$ to an integral multiple of $g_j$
Supporting minimum lease duration.
- By conversion of RDL to composition of fully rigid expressions and thereafter constraining the sum of absolute difference of $x_{j,t}$ in consecutive timesteps to $2 * g_j$ .
Supporting any or all operators.
- By adding a per-job slack variable contributing to the total work done. The slack variable has an all or nothing assignment in contribution to the total work done. All or nothing assignment is an indication of whether the job is not-accepted or accepted respectively. For any operator, only one zero (i.e. nothing) assignment is allowed, whereas for all the assignments are bounded to be zero assignments.
Supporting order operator.
- New integer support variables signalling the start and the end of each job are introduced. These are constrained across jobs to achieve order.
Minimizing preemption.
- This is done by introducing the sum of absolute difference of $x_{j,t}$ in the objective function.
Supporting near-feasible solutions.
- Slack variables are indicative of job acceptance. Thus, weighted slack variables (with weightage proportional to cost of rejecting that job) are introduced in the objective function to achieve this.

However, the paper goes on to say the impracticality of using a MILP solver. MILP-based solutions are used in a limited offline scenario in Rayon. The paper also talks of considering data locality constraints when scheduling the job in the Scheduler of RM. Data-locality as a constraint and cost-based optimization is not present in the MILP formulation presented in the paper.

Out of pure theoretical interest, I sketch a possible way for data locality to be formally introduced in the MILP formulation.

Conversion of $x_{j,t}$ to a multi-dimensional vector, with number of dimensions equal to the number of nodes and introduction of a multi-dimensional $r_{j,t}$ resource vector. $r_{j,0}$ is initialized during the reservation stage, if the job is aware of the nodes on which its data sources reside, the corresponding dimensions of $r_{j,0}$ vector are given non-zero values (this can either be a fixed value or directly proportional to the size of the data source on that node).
The optimization problem is remodeled as follows: (in the following notation $u_{j,t}$ denotes a unitary vector - i.e. a vector of all 1’s for each node dimension)

$maximize\ \sum_{j,t} \frac{(r_{j,t} \cdot x_{j,t})}{c_{j,t}} \\ subject\ to: \\ \forall t : \sum_{j} x_{j,t} \le Cap_t \\ \forall j \forall t : x_{j,t} \cdot u_{j,t} \le h_j \\ \forall j : (\sum_{s_j \le t \lt f_j} x_{j,t}) \cdot u_{j,t} = w_j \\ \forall j,t : x_{j,t} \in R^{+^{n}}$

Note that in the above formulation, $Cap_t$ is also a multi-dimensional vector and we are able to capture heterogeneity across nodes.
$r_{j,0}$ can be perpetuated to $r_{j,t}\ \forall t$ . However intermediate data sources are dependent on the actual assignment of $x_{j,t}$ across timesteps. That is the non-zero dimensions in $x_{j,t}$ will be an indication of the data sources for $x_{j,t+1}$ . This can be captured by changing the objective function to something as follows: $maximize\ \frac{(r_{j,0} \cdot x_{j,0})}{c_{j,0}} + \sum_{j,t} \frac{(\delta (x_{j,t}) \cdot x_{j,t+1})}{c_{j,t}}$ . Here $\delta (x_{j,t})$ represents the signum function. However this changed objective function ceases to be a linear optimization problem. We can rather stick to the simplistic $maximize\ \sum_{j,t} \frac{(r_{j,t} \cdot x_{j,t})}{c_{j,t}}$ objective function with $r_{j,0} = r_{j,t} \forall t$ . This is still effective in that it prevents deviations from the nodes on which initial data sources were and prevents allocations from spilling beyond this initial set for any timestep.

kshiteej CV Posts Contact

Reservation-based Scheduling: If You're Late Don't Blame Us!

MILP in Rayon

Related Posts

Tracking my runs 29 Dec 2015

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data 23 Nov 2015

Project Adam: Building an Efficient and Scalable Deep Learning Training System. 20 Nov 2015