Understanding STCS vs LCS vs TWCS in Cassandra 4.x and 5.x

Picking a compaction strategy is one of the few table-level decisions that reaches every layer of a Cassandra deployment at once: it sets the read and write amplification a table pays, the disk headroom a node must keep free, how much a tombstone-heavy workload can stall, and how repair streaming lands back on disk. This page is for DBAs, distributed-systems engineers, and DevOps teams choosing between SizeTieredCompactionStrategy (STCS), LeveledCompactionStrategy (LCS), and TimeWindowCompactionStrategy (TWCS) — and, on Cassandra 5.0, deciding whether the new UnifiedCompactionStrategy (UCS) default replaces the choice entirely. It sits beneath the broader Cassandra architecture and compaction fundamentals guide, which frames how the storage engine and cluster fabric interact; here we drill into how the three classic strategies differ in mechanism, cost, and failure surface. Reach for it when a table’s read latency is climbing, when disk is filling faster than deletes reclaim it, or when you are about to ALTER TABLE ... WITH compaction on a live keyspace and want the trade-offs and safety gates first.

How the Three Strategies Arrange SSTables

All three strategies sit on top of the same LSM tree mechanics in Cassandra: writes land in a memtable, flush to immutable SSTables, and are reconciled by background compaction that merges overlapping SSTables and resolves versions by timestamp. What differs is which SSTables a strategy chooses to merge together, and that single choice drives everything downstream.

STCS (SizeTieredCompactionStrategy) buckets SSTables by size. When a size tier accumulates between min_threshold (default 4) and max_threshold (default 32) similarly sized SSTables, the strategy merges the whole bucket into one larger SSTable, which then belongs to a higher tier. Because it only ever merges when a tier fills, write amplification is low — most data is written to disk roughly once per tier promotion — but a single partition’s data can be scattered across many tiers, so a point read may have to touch several SSTables. STCS is the historical default and remains a good fit for write-heavy, append-mostly tables that are rarely read by primary key. Its weakness is space and reads: compacting the largest tier needs free space on the order of that tier’s total size, and a row overwritten repeatedly leaves obsolete fragments (and tombstones) spread across tiers until a large compaction sweeps them.

LCS (LeveledCompactionStrategy) enforces a strict, non-overlapping level hierarchy. New SSTables enter L0; from L1 upward, every SSTable is a fixed size (sstable_size_in_mb, default 160 MB) and, critically, the SSTables within a level never overlap in token range. Each level holds roughly fanout_size (default 10) times the data of the level below it. Because levels are non-overlapping, a point read touches at most one SSTable per level — typically one or two SSTables total — giving predictable, low read amplification. The cost is write amplification: promoting data upward rewrites overlapping ranges repeatedly. The often-quoted “10x” is the level fan-out ratio, not a free-disk requirement; the transient disk overhead of running LCS is realistically closer to 1.5–2.5x the working set. LCS is the right default for read-heavy and mixed OLTP tables where predictable latency matters more than write throughput.

TWCS (TimeWindowCompactionStrategy) groups SSTables by the time window their data falls into, sized by compaction_window_unit and compaction_window_size (for example, one day). Inside the current, still-active window it compacts with STCS-like size tiering; once a window closes, its SSTables are compacted once into a single immutable SSTable and never touched again. When every row in a closed window passes its TTL and gc_grace_seconds, Cassandra drops the entire SSTable as a unit rather than rewriting it to purge tombstones. That makes TWCS the only strategy that reclaims expired time-series data without paying compaction to grind through tombstones — but it assumes writes arrive roughly in timestamp order and that you never delete or update old data out of band.

Each strategy chooses which SSTables to merge: STCS promotes by size, LCS keeps non-overlapping fixed-size levels, and TWCS freezes closed time windows and drops expired ones whole.

The practical consequence is a division of labor: read-latency-sensitive tables want LCS, throughput-first append logs tolerate STCS, and anything TTL-driven and time-ordered belongs on TWCS. The decision tree below maps a workload profile onto the strategy that fits it.

Choosing a compaction strategy from workload characteristics — read-latency-sensitive tables land on LCS, TTL-driven time-series on TWCS, and write-first append logs on STCS.

On Cassandra 5.0, UCS generalizes this: a single strategy with a scaling_parameters knob that behaves like STCS at one extreme and LCS at the other, tuned per level, so you can shift the read/write-amplification trade-off without a full strategy migration. UCS is the 5.0 default for new tables, but the STCS/LCS/TWCS mental model below is exactly what its scaling parameters interpolate between, and TWCS remains the idiomatic choice for time-series even on 5.0.

Configuration Reference

Compaction is a per-table schema property set in the compaction map. The options that matter differ by strategy; the table below lists the ones that change behavior in production on 4.x/5.x.

Key	Strategy	Default	Valid range	Impact on compaction / reads / disk
`class`	all	`SizeTieredCompactionStrategy` (4.x) / `UnifiedCompactionStrategy` (5.0 new tables)	STCS / LCS / TWCS / UCS	Selects the merge policy; changing it triggers a full background recompaction of the table.
`min_threshold`	STCS	`4`	`2`–`max_threshold`	Minimum same-size SSTables before a tier compacts; lower means more frequent, smaller merges.
`max_threshold`	STCS	`32`	`> min_threshold`	Maximum SSTables merged at once; caps the size of any single compaction.
`sstable_size_in_mb`	LCS	`160`	`1`–`2000`	Fixed size of L1+ SSTables; larger values reduce SSTable count but coarsen level boundaries.
`fanout_size`	LCS	`10`	`2`–`100`	Size ratio between adjacent levels; higher fan-out means fewer levels but larger per-level rewrites.
`compaction_window_unit`	TWCS	`DAYS`	`MINUTES` / `HOURS` / `DAYS`	Time unit for a window; pairs with the size below.
`compaction_window_size`	TWCS	`1`	positive integer	Window length in units; a window becomes immutable once it closes.
`unsafe_aggressive_sstable_expiration`	TWCS	`false`	`true` / `false`	Drops fully expired SSTables without checking overlap — faster reclaim, but risks resurrecting data if writes arrive late.
`tombstone_compaction_interval`	STCS / LCS	`86400` (s)	positive integer	Minimum age before a single SSTable is eligible for a tombstone-only compaction.
`tombstone_threshold`	STCS / LCS	`0.2`	`0.0`–`1.0`	Droppable-tombstone ratio that triggers single-SSTable compaction.
`gc_grace_seconds` (table option)	all	`864000` (10 days)	`0`–`2^31-1`	Tombstone purge deadline; must exceed the full-coverage repair interval or deletes resurrect.

Switch a read-heavy table to LCS for predictable point-read latency:

-- Read-mostly OLTP table: minimize read amplification with leveled compaction.
ALTER TABLE ks.accounts
  WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160};

Configure TWCS for a TTL-driven time-series table so closed windows drop whole:

-- One-day windows; align gc_grace_seconds so expired windows drop cleanly.
ALTER TABLE ks.sensor_readings
  WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'DAYS',
    'compaction_window_size': 1
  }
  AND gc_grace_seconds = 43200;

Keep STCS but bound its merge size on a write-heavy append log:

-- Append-mostly log: fewer, smaller merges to smooth I/O spikes.
ALTER TABLE ks.event_log
  WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'min_threshold': 4, 'max_threshold': 16};

Prefer TWCS over LCS on any table whose writes are timestamp-ordered and TTL-bound: the time-series strategy selection guidance shows why leveled compaction wastes write amplification re-leveling data that TWCS would simply drop, and why out-of-band writes (including read repair) can defeat window isolation.

Selecting and Switching a Strategy Safely

An ALTER TABLE ... WITH compaction on 4.x/5.x is an online change, but the moment it commits, Cassandra begins recompacting every existing SSTable into the new layout — a full-table I/O event, not a metadata toggle. Run these steps in order and stop if any gate fails.

Confirm the workload actually calls for a switch. Measure current read amplification before changing anything: a high SSTables per read on a latency-sensitive table is the signal to move from STCS to LCS.
```
nodetool tablehistograms ks accounts
```
Gate: only proceed to LCS if the 95th-percentile SSTables-per-read is above ~2 on a read-heavy table. A write-heavy table reading 1 SSTable does not need LCS.
Verify disk headroom for the recompaction. LCS staging and STCS’s largest-tier merge both need transient free space; require at least 1.5x the table’s live size free before switching to LCS.
```
nodetool tablestats ks.accounts | grep -E "Space used \(live\)"
df -h /var/lib/cassandra/data
```
Gate: abort if free space is below live_size * 1.5. Expand storage or archive cold partitions first.
Confirm no major compaction or repair is already running. Overlapping the strategy change with an in-flight compaction saturates disk I/O and stalls client reads.
```
nodetool compactionstats | grep -i "pending tasks"
nodetool netstats | head -n 3
```
Expected when idle: pending tasks: 0 (or a small single-digit number) and Not sending any streams. / Not receiving any streams.

Throttle compaction, then apply the change on one node/table at a time. Cap throughput so the background recompaction leaves headroom for client I/O.

nodetool setcompactionthroughput 32

ALTER TABLE ks.accounts
  WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160};

The Python driver below folds those gates into an idempotent routine. It parses the plain-text output of nodetool (neither compactionstats nor tablestats supports JSON on 4.x/5.x) and refuses to switch strategy when disk is tight or a compaction is already in flight. For a deeper, rollback-capable walkthrough of the STCS→LCS path specifically, see the step-by-step guide to switching from STCS to LCS.

#!/usr/bin/env python3
# Requirements: Python 3.10+, cassandra-driver>=3.28, a local `nodetool` on PATH.
"""Gate a compaction-strategy switch on disk headroom and an idle compaction queue (Cassandra 4.x/5.x)."""

import re
import shutil
import subprocess

from cassandra.cluster import Cluster


def _nodetool(args: list[str], timeout: int = 30) -> str:
    """Run nodetool, raising on a non-zero exit or timeout."""
    proc = subprocess.run(
        ["nodetool", *args], capture_output=True, text=True, timeout=timeout
    )
    if proc.returncode != 0:
        raise RuntimeError(f"nodetool {' '.join(args)} failed: {proc.stderr.strip()}")
    return proc.stdout


def compaction_idle(max_pending: int = 5) -> bool:
    """True when the compaction queue is at or below max_pending."""
    stats = _nodetool(["compactionstats"])
    match = re.search(r"pending tasks:\s*(\d+)", stats)
    pending = int(match.group(1)) if match else 0
    if pending > max_pending:
        print(f"Compaction backlog too high ({pending}); deferring.")
        return False
    return True


def disk_headroom_ok(live_bytes: int, data_dir: str, multiplier: float = 1.5) -> bool:
    """True when free space covers live_size * multiplier for the recompaction."""
    free = shutil.disk_usage(data_dir).free
    required = int(live_bytes * (multiplier - 1.0))
    if free < required:
        print(f"Insufficient headroom: {free/1e9:.1f}GB free, {required/1e9:.1f}GB required.")
        return False
    return True


def switch_to_lcs(
    keyspace: str,
    table: str,
    live_bytes: int,
    data_dir: str = "/var/lib/cassandra/data",
    throttle_mbps: int = 32,
) -> bool:
    """Gate on headroom + idle queue, then ALTER the table to LeveledCompactionStrategy."""
    if not (disk_headroom_ok(live_bytes, data_dir) and compaction_idle()):
        print("Pre-flight gate failed; not altering compaction.")
        return False
    # Cap compaction throughput first so the recompaction leaves headroom for clients.
    _nodetool(["setcompactionthroughput", str(throttle_mbps)])
    cluster = Cluster()
    session = cluster.connect(keyspace)
    # Idempotent: re-running when already on LCS is a harmless no-op ALTER.
    session.execute(
        f"ALTER TABLE {keyspace}.{table} "
        "WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': 160}"
    )
    cluster.shutdown()
    print(f"Switched {keyspace}.{table} to LCS; monitor nodetool compactionstats to completion.")
    return True


if __name__ == "__main__":
    # live_bytes should come from `nodetool tablestats` Space used (live).
    switch_to_lcs("ks", "accounts", live_bytes=120_000_000_000)

Apply the change one table at a time and let each recompaction drain before moving on, so quorum stays responsive and you never have two full-table recompactions competing for the same disks.

Verification & Observability

Confirm the strategy took hold and the recompaction settled — the ALTER returns instantly, but the real work runs for minutes to hours.

Confirm the schema change and watch it converge. Query the schema, then watch the queue drain rather than trusting the ALTER return.
```
SELECT compaction FROM system_schema.tables
WHERE keyspace_name = 'ks' AND table_name = 'accounts';
```
```
nodetool compactionstats -H
```
Expect a burst of Compaction/Anticompaction tasks that trend back toward pending tasks: 0. A queue that only grows means throughput is too high for the disks — lower setcompactionthroughput.
Confirm LCS actually leveled. After a switch to LCS, nodetool tablestats reports the SSTable count per level; a healthy table has almost everything above L0 with only a few L0 SSTables.
```
nodetool tablestats ks.accounts | grep -E "SSTables in each level"
```
Verify read amplification improved. Re-run nodetool tablehistograms ks accounts and compare SSTables-per-read against the pre-change baseline from step 1; LCS should drop the 95th percentile toward 1–2.
Audit compaction history. The system.compaction_history table records every merge with input/output sizes, so you can confirm the recompaction ran and see how much space it reclaimed:
```
SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history LIMIT 20;
```
Export the right JMX metrics. Through the Prometheus JMX Exporter, track org.apache.cassandra.metrics:type=Compaction,name=PendingTasks and per-table LiveSSTableCount; a PendingTasks line that never returns to baseline is the earliest sign a strategy is mismatched to the workload.

Failure Modes & Rollback

Disk exhaustion during an STCS largest-tier compaction. STCS must stage the merged output of its biggest tier alongside the originals, so a table on STCS can need free space on the order of its largest tier’s total size all at once; if the node is above ~50% full, that compaction can fail with OutOfSpaceException or wedge the queue. Detect it with nodetool compactionstats showing a stalled large compaction and df -h near capacity. Rollback: temporarily lower max_threshold so merges stay smaller (ALTER TABLE ks.tbl WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'max_threshold': 8}), add disk, or migrate the table to LCS, whose fixed-size SSTables cap any single compaction’s transient footprint.

Data resurrection under TWCS with out-of-band writes or unsafe_aggressive_sstable_expiration. TWCS assumes writes are timestamp-ordered and that closed windows are never modified. A late-arriving write, an update to old data, or read-repair writes can land in an already-closed window; with aggressive expiration enabled, TWCS may drop an SSTable whose data still overlaps live rows, resurrecting deleted values or losing recent ones. This connects directly to tombstone management and garbage collection: a window can only drop safely once every cell in it is past gc_grace_seconds. Detect it by watching for reappearing rows and by grepping system.log for Dropping expired SSTable. Rollback: set unsafe_aggressive_sstable_expiration = false, disable read repair on the table (ALTER TABLE ks.tbl WITH read_repair = 'NONE'), and run a full repair to reconcile the affected windows.

Write-amplification stall after switching a write-heavy table to LCS. LCS re-levels overlapping ranges continuously; on a table that ingests faster than compaction can level it, L0 grows unbounded, pending tasks climbs in lockstep with write volume, and reads slow as L0 SSTables all overlap. This often surfaces after a repair streams a burst of SSTables into L0. Detect it with nodetool tablestats showing a large L0 count and a persistently high PendingTasks. Rollback: raise compaction concurrency and throughput to let L0 drain, and if the table is genuinely write-first, revert to STCS (ALTER TABLE ks.tbl WITH compaction = {'class': 'SizeTieredCompactionStrategy'}) — the wrong strategy, not the tuning, is the problem.

FAQ

When should I switch from STCS to LCS?

Switch when a latency-sensitive, read-heavy table shows high SSTables-per-read — a 95th-percentile above roughly 2 in nodetool tablehistograms — and you have at least 1.5x the table’s live size free on disk. LCS trades higher write amplification for one or two SSTables touched per read, so it pays off on read-mostly and mixed OLTP tables. It is the wrong move for write-heavy or append-only tables, where the extra write amplification just burns I/O without improving the reads you rarely do.

Can I run different compaction strategies on different tables in one keyspace?

Yes. Compaction is a per-table property, so a single keyspace can mix STCS on an append log, LCS on an accounts table, and TWCS on a time-series table. That is normal and expected — choose per workload, not per keyspace. The only cross-table concern is aggregate disk and I/O headroom on each node, since every table’s compaction competes for the same compaction_throughput_mb_per_sec budget.

Why does my TWCS table keep tombstones instead of dropping expired data?

Almost always because data is not confined to closed windows: out-of-band updates, deletes, or read-repair writes land in old windows, so an SSTable spans more than one window and cannot be dropped whole. It can also happen when gc_grace_seconds has not yet elapsed for the newest cell in a window. Fix it by ensuring writes are timestamp-ordered, setting read_repair = 'NONE' on the table, and confirming your TTL plus gc_grace_seconds is shorter than your retention target so windows actually reach full expiry.

Should I just use UnifiedCompactionStrategy on Cassandra 5.0?

For new tables, UCS is a reasonable default: its scaling_parameters let one strategy behave like STCS or LCS per level, so you can tune the read/write-amplification trade-off without a migration. But TWCS remains the idiomatic choice for TTL-driven time-series even on 5.0, and if you already run well-tuned STCS or LCS tables there is no urgency to migrate them. Treat UCS as a way to avoid committing to a fixed point on the STCS↔LCS spectrum, not as a replacement for understanding what that spectrum means.

Does changing compaction strategy require downtime?

No. ALTER TABLE ... WITH compaction is an online change on 4.x/5.x — the table stays readable and writable throughout. What it does trigger is a full background recompaction that competes with client I/O, so throttle it with nodetool setcompactionthroughput, apply it one table at a time, and schedule the switch outside peak traffic. The risk is I/O saturation and disk pressure during the rewrite, not unavailability.

Cassandra architecture and compaction fundamentals — the parent guide framing how the storage engine and cluster fabric fit together.
LSM tree mechanics in Cassandra — the write path, memtable flush, and compaction engine every strategy builds on.
Step-by-step guide to switching from STCS to LCS — the gated, rollback-capable migration runbook for the STCS→LCS path.
Tombstone management and garbage collection — the gc_grace_seconds mechanics that TWCS window expiry and STCS tombstone sweeps depend on.
Read repair vs anti-entropy repair — how each strategy absorbs repair-generated and out-of-band read-repair SSTables.
Strategy selection for time-series workloads — deeper TWCS window-sizing guidance for TTL-driven ingestion.

Understanding STCS vs LCS vs TWCS in Cassandra 4.x and 5.x

Related guides