Replication vs. Erasure Coding

In eEKAS, Cluster Drives use Ceph’s data redundancy mechanisms to protect against hardware failures and ensure data availability. The two primary methods are Replication and Erasure Coding (EC). While both serve the same purpose—preventing data loss—they do so in very different ways, each with its own strengths, trade-offs, and ideal use cases.

Ceph Replication

How does Ceph Replication works

Replication stores multiple identical copies of each piece of data across different drives and nodes. If one copy is lost due to a drive or node failure, the system immediately serves the data from another copy.

Example configurations:

2× Replication – Two copies of each object are stored. Can tolerate the loss of one drive/node.
3× Replication – Three copies of each object are stored. Can tolerate the loss of two drives/nodes.

Advantages:

Fast recovery – No need to reconstruct data; another copy is instantly available.

Low CPU overhead – Minimal computation required.

Best performance – Particularly for workloads with high IOPS.

Trade-offs:

Higher storage usage – 3× replication uses 3 TB of raw storage for 1 TB of usable capacity.

Typical use cases:

High-performance block storage (iSCSI, NVMe-oF)
Virtual machine storage requiring low latency
Frequently updated databases

Erasure Coding

How does Ceph Erasure Coding works

Erasure Coding splits data into a set number of data chunks and parity chunks, storing them across multiple drives and nodes. If one or more chunks are lost, the system uses the remaining chunks and parity to reconstruct the data.

Example configurations:

4+2 EC – Data is split into 4 chunks plus 2 parity chunks. Can tolerate the loss of 2 drives/nodes.
5+2 EC – Data is split into 5 chunks plus 2 parity chunks. Can tolerate the loss of 2 drives/nodes.
8+3 EC – Data is split into 8 chunks plus 3 parity chunks. Can tolerate the loss of 3 drives/nodes.

Advantages:

High storage efficiency – 5+2 EC uses 7 TB of raw storage for 5 TB of usable capacity.
Flexible redundancy levels – Can optimize for the desired balance between efficiency and fault tolerance.

Trade-offs:

Higher CPU and network overhead – Requires computation to encode/decode data.
Slightly higher latency – Particularly for small write operations.

Typical use cases:

Object storage (S3) for large, infrequently modified files
Backup archives
Media repositories

Choosing the Right Method

Criteria	Replication	Erasure Coding
Performance	Highest	Moderate (depends on EC profile)
Storage Efficiency	Low	High
Recovery Speed	Instant	Requires reconstruction
Best for	Databases, VMs, low-latency workloads	Object storage, archives, large datasets

Rule of thumb for Ceph

Use Replication for performance-critical workloads where speed and instant failover are more important than raw capacity efficiency.

Use Erasure Coding for large datasets where storage cost efficiency is important and access patterns are less latency-sensitive.