Operational Guide: STCS vs LCS vs TWCS in Production Cassandra
Compaction strategy selection is a foundational infrastructure decision that dictates I/O profiles, repair windows, disk provisioning, and node stability under failure conditions. In modern Cassandra 4.x and 5.x deployments, defaults heavily favor LeveledCompactionStrategy (LCS) for OLTP workloads, while TimeWindowCompactionStrategy (TWCS) dominates time-series and TTL-driven architectures. SizeTieredCompactionStrategy (STCS) remains viable only for append-heavy, low-read datasets where tombstone accumulation is strictly governed. Grounding your selection in Cassandra Architecture & Compaction Fundamentals is mandatory before altering production keyspaces, as compaction directly governs how SSTables are merged, how disk space is reclaimed, and how the cluster handles backpressure during peak ingestion.
Strategy Mechanics & I/O Trade-offs
All three strategies operate on the same underlying LSM Tree Mechanics in Cassandra, but they diverge sharply in merge scheduling, disk footprint, and CPU utilization:
- STCS: Groups SSTables into exponentially growing size tiers (default thresholds: 4, 8, 16, 32). It delivers minimal write amplification (~1x) but suffers from high read amplification due to fragmented data. Tombstone-heavy delete patterns trigger severe compaction stalls, requiring aggressive
tombstone_compaction_intervaltuning and strictgc_grace_secondsalignment. - LCS: Organizes data into leveled buckets (L0 through Ln) with uniform SSTable sizes. Read latency is highly predictable because queries typically touch only one or two levels. Write amplification is significantly higher (~10x). Note that the “10x” here is the level-size fan-out ratio between adjacent levels, not a free-disk requirement; transient compaction overhead is realistically closer to 1.5–2.5x the working set. Ideal for point reads and mixed OLTP workloads.
- TWCS: Merges data based on fixed time windows (e.g., 1 day, 1 week, 1 month). Optimized for TTL-driven ingestion, it prevents tombstone sprawl by dropping expired windows entirely. Requires strict alignment between
compaction_window_size,gc_grace_seconds, and business retention policies. In Cassandra 4.0+, TWCS includes improved backpressure handling and window boundary enforcement.
The following decision tree maps a workload profile onto the strategy that best fits it.
Repair Cadence & Anti-Entropy Alignment
Compaction strategy dictates your repair cadence. It is critical to distinguish Read Repair vs Anti-Entropy Repair: the probabilistic read_repair_chance/dclocal_read_repair_chance table options were removed in Cassandra 4.0, but blocking read repair still runs on digest mismatch for reads at CL > ONE (the per-table read_repair option defaults to 'BLOCKING'). Scheduled anti-entropy repair remains the primary mechanism for durable cross-node consistency. Anti-entropy must never overlap with peak compaction windows, as both compete aggressively for disk I/O, CPU, and memory.
- STCS: Requires frequent
nodetool repair -prdue to fragmented SSTables and high read amplification. Incremental repair (the default in 4.x+) keeps overhead bounded; full repairs (-full) will saturate I/O and trigger compaction backpressure. - LCS: Benefits from incremental repair because leveled buckets maintain consistent data boundaries across the ring. Repair ranges align cleanly with vnode ownership, reducing duplicate streaming.
- TWCS: Pairs naturally with time-bound repair windows. Expired windows automatically purge stale data and tombstones, shrinking the repair payload. Schedule repairs immediately after window boundaries to minimize overlap with active compaction.
When coordinating repairs across a live ring, remember that Data Partitioning & Token Ring Basics govern how repair ranges map to vnode ownership. Misaligned ranges cause duplicate compaction work, inflate repair duration, and can trigger unnecessary streaming storms. Always use the -pr (primary range) flag in production; incremental repair is already the default, so no separate flag is required.
Python Automation & Operational Workflows
Modern SRE teams automate compaction monitoring, repair scheduling, and backpressure mitigation using Python and JMX. Below is a validated workflow for Cassandra 4.x/5.x environments:
- Monitor Compaction Queue & Backpressure: Query JMX via
requestsorpy4j. Trackorg.apache.cassandra.metrics.Compaction:PendingTasksandCompletedTasks. IfPendingTasksexceeds 2x the number of CPU cores, throttle throughput vianodetool setcompactionthroughput <mb/s>or adjustcompaction_throughput_mb_per_secincassandra.yaml. - Automate Incremental Repair: Use the
cassandra-driveror directsubprocesscalls tonodetool repair -pr(incremental is the default). Wrap execution in a retry loop with exponential backoff. Validate repair success via the process exit code and by queryingsystem_distributed.repair_historyfor the session outcome (compaction_historydoes not track repair states). - Tombstone & GC Monitoring: Track
org.apache.cassandra.metrics.Table:Tombstones:ReadandScanned. If tombstone read ratios exceed 0.5, triggernodetool compacton affected tables and verifygc_grace_secondsalignment.nodetool tablestatsemits text only (no JSON), so for programmatic parsing either parse its text output or query thesystem_viewsvirtual tables via CQL. - JMX Authentication & Security: Cassandra 4.x+ enforces JMX authentication by default. Ensure your automation uses
jmxremote.passwordor certificate-based auth. The official Apache Cassandra Documentation on Compaction provides updated JMX metric paths and security configurations.
import subprocess
def run_incremental_repair(node_ip, keyspace, table):
# Incremental repair is the default in 4.x+; -pr limits to primary ranges.
# Do not combine -pr with -local (they are mutually exclusive).
cmd = [
"nodetool", "-h", node_ip, "repair",
"-pr", keyspace, table
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return {"status": "success", "output": result.stdout}
except subprocess.CalledProcessError as e:
return {"status": "failed", "error": e.stderr}Node Lifecycle, Gossip & Failure Handling
During node decommissioning, replacement, or bootstrap, streaming data merges into existing SSTables, causing compaction queues to spike. Monitor nodetool compactionstats and nodetool netstats concurrently. If Node Gossip & Failure Detection Protocols trigger false positives during heavy compaction/repair, the cluster may mark healthy nodes as DOWN, triggering unnecessary rebuilds.
To mitigate cascading failures:
- Increase
phi_convict_thresholdtemporarily (e.g., from 8 to 12) during major repair windows. - Throttle streaming with
stream_throughput_outbound_megabits_per_secandinter_dc_stream_throughput_outbound_megabits_per_sec. - Ensure Tombstone Management & Garbage Collection is optimized before decommissioning. High tombstone density during streaming inflates network payload and extends
nodetool decommissionduration.
Multi-DC Consistency & Cross-Cluster Dynamics
Consistency Level Selection for Multi-DC Deployments directly impacts repair scope and compaction pressure. Using QUORUM across data centers forces cross-DC anti-entropy validation, doubling repair I/O. Prefer LOCAL_QUORUM for application reads and schedule cross-DC repairs during off-peak windows.
Cross-Cluster Replication & Conflict Resolution interacts heavily with compaction. When using LWT (Lightweight Transactions) or timestamp-based conflict resolution, tombstones propagate across clusters. TWCS handles this efficiently by isolating time-bound tombstones, while LCS may merge them into lower levels, increasing read amplification. Align compaction_window_size with replication lag tolerances to prevent stale data resurrection during network partitions.
Strategy Migration & Operational Validation
Migrating compaction strategies requires careful planning. Cassandra 4.x/5.x supports online strategy changes via ALTER TABLE, but background compaction will run concurrently, temporarily spiking I/O. For a comprehensive operational migration path, consult the Step-by-Step Guide to Switching from STCS to LCS. Key validation steps include:
- Verify free disk space provides sufficient headroom (roughly 1.5–2.5x the working set) before switching to LCS; the “10x” associated with LCS is the level fan-out ratio, not a disk-headroom requirement.
- Monitor
nodetool compactionstatsforpending taskspost-migration. - Run
nodetool repair -primmediately after the strategy change to align leveled boundaries (incremental repair is the default). - Validate Python automation scripts against updated JMX metric paths.
For Python-based cluster management, the DataStax Python Driver provides robust schema management and schema agreement polling, ensuring strategy changes propagate safely before initiating background compaction.
Compaction strategy is not a set-and-forget configuration. It is a dynamic operational lever that must align with workload patterns, repair cadence, disk provisioning, and failure tolerance. By grounding automation in v4.x/v5.x standards, enforcing anti-entropy alignment, and monitoring I/O backpressure, SRE teams can maintain sub-millisecond read latencies while preventing disk exhaustion and cascading node failures.