Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

20 Oct 2015

Takeaways from Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, Li et al, SoCC, 2015.

Tachyon is an in-memory storage system that achieves high throughput writes and reads. It uses persistently-stored lineage for fault tolerance.

Why Tachyon?

Existing fault tolerant distributed filesystems suffered from synchronous slow writes. These filesystems either relied on replication or persistent storage or both for fault tolerance. Accompanying network or disk I/O overheads meant slow write.

Architecture

Internally Tachyon consists of two layers:

Lineage
- This layer tracks the sequence of jobs and the corresponding file consumption and generation - which is also called lineage.
- Lineage helps recompute lost data.
Persistence
- This layer is responsible for asynchronous checkpointing and backward-compatibility.

Tachyon’s architecture is based on master-slave architecture similar to HDFS. The master is called workflow-manager, the slaves are tachyon worker daemons. The work of the above two layers is spread across the master and slaves.

Master / Workflow Manager
- Tracks metadata (similar to HDFS).
- Tracks lineage information and computes checkpoint preference.
- Interacts with cluster resource manager to allocate resources for recomputation.
- Passive standby masters ensure master fault-tolerance.
Workers
- Manage local resources and directives of the master and sends status updates to the master.

Details

Lineage

Has a well-defined API for creation and retrieval of Lineage. This API is exposed to the execution framework, so that application programmers can transparently use Tachyon.
Lineage storage overheads are reduced by deleting lineages preceding most recent checkpoint and avoiding duplicated binary program storage.
Specifically, the lineage consists of an ordered list of files (both input, output), the binary program and the associated configuration to be used for recomputation, and the input-output dependency type.

Storage policy

Workloads working sets are preferentially stored in-memory.
To take care of spilling, large files above a threshold are synchronously written directly to persistent storage. Also an extensible LRU data-eviction policy is used.
Before a cluster environment change, all data is persisted and synchronous mode is enabled.

Checkpointing.

Key Goals: Asynchronous so that in-memory writes are not stalled. Frequent enough so that recomputation time is bounded. Intelligent enough so that hot-files are checkpointed and unnecessary temporary files are not.
Edge Algorithm:
- The execution is typically modeled as a DAG. Each vertex in the DAG represents a data file.
- The algorithm checkpoints the leaf vertices of the DAG. Hot-files are identified as vertices with degree greater than two. Edge algorithm preferentially checkpoints hot-files.

Resource Allocation

Key Goals: Recomputations should be priority-aware, resource-aware and dependency-aware.
Recomputation Resource Allocation strategy:
- Priority Based: Recomputation inherits priority of the job requesting the file to be recomputed to avoid priority inversion.
- Fair Sharing Based: Requesting job’s share is shared by the actual job and the recomputation job in a fixed proportion.

kshiteej CV Posts Contact

Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks

Related Posts

Tracking my runs 29 Dec 2015

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data 23 Nov 2015

Project Adam: Building an Efficient and Scalable Deep Learning Training System. 20 Nov 2015