kshiteej    CV    Posts    Contact

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Takeaways from Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, Li et al, SoCC, 2015.

Tachyon is an in-memory storage system that achieves high throughput writes and reads. It uses persistently-stored lineage for fault tolerance.

Why Tachyon?

  • Existing fault tolerant distributed filesystems suffered from synchronous slow writes. These filesystems either relied on replication or persistent storage or both for fault tolerance. Accompanying network or disk I/O overheads meant slow write.

Architecture

Internally Tachyon consists of two layers:

  • Lineage
    • This layer tracks the sequence of jobs and the corresponding file consumption and generation - which is also called lineage.
    • Lineage helps recompute lost data.
  • Persistence
    • This layer is responsible for asynchronous checkpointing and backward-compatibility.

Tachyon’s architecture is based on master-slave architecture similar to HDFS. The master is called workflow-manager, the slaves are tachyon worker daemons. The work of the above two layers is spread across the master and slaves.

  • Master / Workflow Manager
    • Tracks metadata (similar to HDFS).
    • Tracks lineage information and computes checkpoint preference.
    • Interacts with cluster resource manager to allocate resources for recomputation.
    • Passive standby masters ensure master fault-tolerance.
  • Workers
    • Manage local resources and directives of the master and sends status updates to the master.

Details

Lineage

  • Has a well-defined API for creation and retrieval of Lineage. This API is exposed to the execution framework, so that application programmers can transparently use Tachyon.
  • Lineage storage overheads are reduced by deleting lineages preceding most recent checkpoint and avoiding duplicated binary program storage.
  • Specifically, the lineage consists of an ordered list of files (both input, output), the binary program and the associated configuration to be used for recomputation, and the input-output dependency type.

Storage policy

  • Workloads working sets are preferentially stored in-memory.
  • To take care of spilling, large files above a threshold are synchronously written directly to persistent storage. Also an extensible LRU data-eviction policy is used.
  • Before a cluster environment change, all data is persisted and synchronous mode is enabled.

Checkpointing.

  • Key Goals: Asynchronous so that in-memory writes are not stalled. Frequent enough so that recomputation time is bounded. Intelligent enough so that hot-files are checkpointed and unnecessary temporary files are not.
  • Edge Algorithm:
    • The execution is typically modeled as a DAG. Each vertex in the DAG represents a data file.
    • The algorithm checkpoints the leaf vertices of the DAG. Hot-files are identified as vertices with degree greater than two. Edge algorithm preferentially checkpoints hot-files.

Resource Allocation

  • Key Goals: Recomputations should be priority-aware, resource-aware and dependency-aware.
  • Recomputation Resource Allocation strategy:
    • Priority Based: Recomputation inherits priority of the job requesting the file to be recomputed to avoid priority inversion.
    • Fair Sharing Based: Requesting job’s share is shared by the actual job and the recomputation job in a fixed proportion.