Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
20 Oct 2015Takeaways from Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, Li et al, SoCC, 2015.
Tachyon is an in-memory storage system that achieves high throughput writes and reads. It uses persistently-stored lineage for fault tolerance.
Why Tachyon?
- Existing fault tolerant distributed filesystems suffered from synchronous slow writes. These filesystems either relied on replication or persistent storage or both for fault tolerance. Accompanying network or disk I/O overheads meant slow write.
Architecture
Internally Tachyon consists of two layers:
- Lineage
- This layer tracks the sequence of jobs and the corresponding file consumption and generation - which is also called lineage.
- Lineage helps recompute lost data.
- Persistence
- This layer is responsible for asynchronous checkpointing and backward-compatibility.
Tachyon’s architecture is based on master-slave architecture similar to HDFS. The master is called workflow-manager, the slaves are tachyon worker daemons. The work of the above two layers is spread across the master and slaves.
- Master / Workflow Manager
- Tracks metadata (similar to HDFS).
- Tracks lineage information and computes checkpoint preference.
- Interacts with cluster resource manager to allocate resources for recomputation.
- Passive standby masters ensure master fault-tolerance.
- Workers
- Manage local resources and directives of the master and sends status updates to the master.
Details
Lineage
- Has a well-defined API for creation and retrieval of Lineage. This API is exposed to the execution framework, so that application programmers can transparently use Tachyon.
- Lineage storage overheads are reduced by deleting lineages preceding most recent checkpoint and avoiding duplicated binary program storage.
- Specifically, the lineage consists of an ordered list of files (both input, output), the binary program and the associated configuration to be used for recomputation, and the input-output dependency type.
Storage policy
- Workloads working sets are preferentially stored in-memory.
- To take care of spilling, large files above a threshold are synchronously written directly to persistent storage. Also an extensible LRU data-eviction policy is used.
- Before a cluster environment change, all data is persisted and synchronous mode is enabled.
Checkpointing.
- Key Goals: Asynchronous so that in-memory writes are not stalled. Frequent enough so that recomputation time is bounded. Intelligent enough so that hot-files are checkpointed and unnecessary temporary files are not.
- Edge Algorithm:
- The execution is typically modeled as a DAG. Each vertex in the DAG represents a data file.
- The algorithm checkpoints the leaf vertices of the DAG. Hot-files are identified as vertices with degree greater than two. Edge algorithm preferentially checkpoints hot-files.
Resource Allocation
- Key Goals: Recomputations should be priority-aware, resource-aware and dependency-aware.
- Recomputation Resource Allocation strategy:
- Priority Based: Recomputation inherits priority of the job requesting the file to be recomputed to avoid priority inversion.
- Fair Sharing Based: Requesting job’s share is shared by the actual job and the recomputation job in a fixed proportion.