kshiteej    CV    Scrapbook

Discretized Streams: Fault-Tolerant Streaming Computation at Scale.

Takeaways from Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013.

The main idea in Discritized Streams (D-Streams) is to structure a streaming computation as a series of stateless, deterministic batch computations on a fine granularity (small time intervals).

Existing Stream Processing Models.

Most previous systems used the continuous operator model, wherein streaming computation is divided into a set of long-lived stateful operators. However the caveats here include statefulness of the operators and nondeterminism due to record interleaving which complicate fault tolerance. Since operators are stateful, rebuilding state for recovery is essential and based on either of two schemes:

  • Replication. Here nodes are replicated to perform same computation on the same set of inputs. A synchronization protocol to avoid side-effects of record reordering is necessary here. Straggler mitigation is impossible here because of the synchronization with the straggler that prevents other nodes from progressing.
  • Upstream backup. Here messages since a checkpoint are retained and reprocessed in case of failure to rebuild state. Straggler mitigation can indirectly be handled by treating straggler as a failure, but slow state-recovery defeats the purpose.

D-Streams aims to address these issues, with key goals being:

  • Scalability.
  • Minimal cost.
  • Latency and recovery from fault-tolerance/stragglers at the granularity of seconds.

D-Streams Details.

  • Computation Model.
    • Each D-Stream in the program operates either on input data or RDDs every timestep.
    • Each D-Stream outputs RDDs or outputs data.
    • There is no difference between RDDs generated by Spark and D-Streams, hence batch and interactive processing can be unified.
  • Timing Considerations. The inherent timing semantic segregates data into timesteps on the order of arrival and not some inherent property of the data such as a timestep. D-Streams offers two ways to handle this:
    • By delaying the batch operation by a “slack time”. If the slack time is sufficient this enables regrouping of data into timesteps based on the particular property requested.
    • By application level corrections, wherein results are corrected for a timestep in the past based on new data received in the subsequent timesteps. The computation of the corrected output for the past timestep can be incremental on the existing output of that timestep.
  • D-Stream API. D-Stream supports following operations.
    • Read input streams (in a time-batched manner) from storage systems or outside (as in a network port) or RDDs.
    • Perform transformations - both stateless and stateful.
    • Stateless transformation such as map, reduce, groupBy and join.
    • Stateful transformations such as window, reduceByWindow or track operations.
    • Write output to external systems (such as HDFS).
  • Consistency semantics. D-Stream provides “exactly once” processing across the cluster guarantee.

Architecture.

Spark Streaming implements D-Streams via modifications to Spark and consist of three main components:

  • Master that tracks the D-Stream lineage graph and schedules tasks to compute new RDD partitions.
  • Worker nodes that receive data, store the partitions of input and execute tasks
  • Client library used to send data to the system

Fault and Straggler Recovery.

  • Fault Recovery
    • Lineage is tracked at the D-Streams as well as the RDD level. Lineage helps in recovery and parallelism in recovery at the time and operator level.
    • Operator level parallelism in recovery is aided by RDD lineage on a per-partition basis.
    • Time level parallelism in recovery is aided by D-Stream level lineage which is on a per timestep basis.
  • Straggler Mitigation
    • By running speculative backup of slow tasks. Slow task identification is done based on running-time thresholds.