Compaction Backlog Analysis & Alerting in Cassandra 4.x/5.x

A growing compaction backlog is the primary precursor to disk exhaustion, elevated read latency, and cascading anti-entropy failures. In production Apache Cassandra 4.x and 5.x clusters, backlog analysis must transition from reactive nodetool polling to continuous, velocity-aware observability. This guide is for DBAs and SREs who need to turn raw compaction telemetry into an early-warning signal: it explains how to quantify backlog as a derived state, establish production-safe alerting thresholds, and automate remediation without disrupting node lifecycle operations or repair schedules. It sits under Advanced Compaction Strategy Tuning & Monitoring; read that first if you have not yet aligned your table-level strategy with your workload. Every threshold and command below is validated against Cassandra 4.0, 4.1, and 5.0, with version drift called out inline.

Concept: why backlog is a derived state, not a metric

Compaction backlog is not a single JMX metric; it is a derived operational state representing the delta between write ingestion velocity and disk I/O throughput. Cassandra submits compaction tasks to the bounded CompactionExecutor thread pool, sized by concurrent_compactors and throttled by the shared compaction_throughput budget. PendingTasks is the queue depth of that pool, and BytesCompacted is the cumulative output you differentiate to recover throughput. Backlog exists whenever the queue drains slower than mutations refill it.

When compaction falls behind, unmerged, overlapping SSTables accumulate, tombstones survive past their expiry, and read amplification spikes as each read fans across more SSTables per partition. That degradation loop often manifests as ReadTimeoutException or UnavailableException during peak traffic — the same failure surface explored under tombstone management and garbage collection. The compaction strategy in force shapes what a healthy backlog looks like: the trade-offs between STCS, LCS, and TWCS mean a size-tiered table bursts pending tasks on every tier merge, while a leveled table holds a tighter, flatter queue. Cassandra 4.x defaults new tables to SizeTieredCompactionStrategy (STCS); Cassandra 5.0 changes the default to UnifiedCompactionStrategy (UCS). Strategy mismatches remain a common root cause of chronic backlog, so validate table-level parameters before adjusting cluster-wide thresholds.

The distinction between depth and velocity is the whole game. A node with PendingTasks climbing while BytesCompacted still rises is merely busy; a node with pending climbing while BytesCompacted is flat is stalled. Depth alone cannot tell these apart, which is why static count thresholds generate noise. The rest of this guide builds alerting on the pair.

Quantifying backlog & establishing dynamic thresholds

Static alert thresholds fail under variable write loads. Effective backlog quantification correlates several metrics exposed via JMX or the Prometheus JMX Exporter:

org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks
org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted

The effective backlog ratio is calculated as:

(PendingTasks × avg_sstable_size) / available_data_volume_space

When this ratio exceeds 0.6, compaction cannot keep pace with ingestion and disk saturation becomes imminent. The configured throughput limit is read via nodetool getcompactionthroughput, and the actual work rate is the delta of the BytesCompacted counter between samples (bytes/sec). Implement dynamic alerting based on this derived compaction-rate decay rather than absolute counts. Trigger alerts when:

PendingTasks > 32 for > 15 minutes
the derived compaction rate (delta of BytesCompacted over time) drops below 40% of its 1-hour moving average
data-volume utilization exceeds 85% while pending tasks remain > 16
read p99 latency exceeds 2× baseline during compaction windows

Because these thresholds mirror the velocity computation used elsewhere, keep them aligned with the sampling patterns in async compaction tracking & metrics so a single scrape feeds both dashboards and pages.

Configuration reference

The node-scoped settings below govern how much compaction work can drain the backlog and, by extension, how you interpret its telemetry. Values in cassandra.yaml are per-node; strategy parameters are table-scoped via ALTER TABLE.

Key	Default	Valid range	Impact on backlog & alerting
`concurrent_compactors`	`min(cores, disks)` (often 2–8)	`1` – core count	Sets the `ActiveTasks` ceiling; `PendingTasks` above `2 × concurrent_compactors` is the primary backlog trigger
`compaction_throughput` (4.1+) / `compaction_throughput_mb_per_sec` (≤4.0)	`64` MiB/s (`0` = unlimited)	`0` – disk write ceiling	Caps aggregate `BytesCompacted` growth; velocity is benchmarked as a fraction of this ceiling
`sstable_preemptive_open_interval` (4.x)	`50` MiB	`-1` (disabled) – 100s of MiB	Governs how early merged output is visible to reads; smooths read-amp while a backlog drains
`concurrent_validations`	`0` (tied to compactors)	`0` – core count	Repair validation shares the compaction pool; bounds the contention you must schedule alerts around
`compaction_large_partition_warning_threshold`	`100` MiB	tens–hundreds of MiB	Emits WARN log lines that correlate with slow-draining backlog on skewed partitions

Note the 4.1 rename from compaction_throughput_mb_per_sec to compaction_throughput; automation that shells out to nodetool setcompactionthroughput <MiB/s> works on both, but YAML keys differ. A minimal throughput-sensitive block:

# cassandra.yaml — Cassandra 4.1+ / 5.0
# Pre-4.1 clusters: use compaction_throughput_mb_per_sec: 64
concurrent_compactors: 4
compaction_throughput: 128MiB/s
sstable_preemptive_open_interval: 50MiB

Alerting architecture & dynamic routing

Scrape compaction metrics via the JMX Exporter or the Prometheus endpoint exposed by Cassandra’s built-in metrics. Route alerts through a correlation engine that cross-references backlog velocity with active repair windows. When repair validation and compaction compete for disk I/O, the affected nodes experience validation timeouts and dropped anti-entropy repair sessions. Decouple these workloads by scheduling repairs during compaction lulls rather than forcing concurrent execution.

Implement tiered alert severity with automated routing:

Warning: pending tasks > 16, throughput decay detected. Route to Slack/PagerDuty for DBA review.
Critical: pending tasks > 64, disk > 90%, or read p99 latency > 2× baseline. Trigger automated throttling and page on-call SRE.

The tiered severity decisions are summarized below.

Tiered compaction-backlog alerting: each escalation gate pairs queue depth with disk pressure so a page only fires when both align.

Step-by-step: standing up backlog alerting

Manual intervention does not scale. The procedure below stands up a velocity-aware backlog monitor and safe remediation path. Run each step on a representative node first, then roll out node-by-node.

Confirm the telemetry surface is reachable. Capture a baseline before automating anything:
```
# Cassandra 4.x/5.x — human-readable queue snapshot + configured ceiling
nodetool compactionstats -H
nodetool getcompactionthroughput
```
On Cassandra 5.0 you can read the same queue state without JMX through the virtual tables, which is essential inside restricted networks:
```
SELECT keyspace_name, table_name, completion_ratio, unit
FROM system_views.sstable_tasks;
```
Expose the counters you will differentiate. Ensure the Prometheus JMX Exporter is running alongside Cassandra (this example assumes port 8080) and that cassandra_compaction_pendingtasks and cassandra_compaction_bytescompacted appear in its /metrics output.

Deploy the velocity-aware daemon. The pattern below queries the Prometheus scrape endpoint, computes backlog velocity, and gates remediation behind both a depth threshold and a rate-decay threshold. Never poll more aggressively than your scrape interval, or the differentiated rate becomes noise.

#!/usr/bin/env python3
# requirements: requests>=2.31
import time
import requests
import logging
from collections import deque
from datetime import datetime, timezone
from typing import Optional

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

JMX_EXPORTER_URL = "http://localhost:8080/metrics"
BACKLOG_THRESHOLD = 32
RATE_DECAY_RATIO = 0.40

# Sampled (timestamp, BytesCompacted) history for a 1-hour moving average of the
# derived compaction rate. BytesCompacted is a monotonic counter, so the rate is
# the delta between consecutive samples divided by the elapsed seconds.
_rate_history: "deque[float]" = deque(maxlen=120)  # 120 samples @ 30s = 1h
_last_sample: Optional[tuple[float, float]] = None

def fetch_compaction_metrics() -> Optional[dict[str, float]]:
    """Fetch compaction metrics from the JMX Exporter Prometheus endpoint.

    Metric names are Prometheus-normalized from the JMX object names, e.g.:
      cassandra_compaction_pendingtasks
      cassandra_compaction_bytescompacted
    Adjust to match your exporter's metric naming configuration.
    """
    try:
        resp = requests.get(JMX_EXPORTER_URL, timeout=5)
        resp.raise_for_status()
        metrics: dict[str, float] = {}
        for line in resp.text.splitlines():
            if line.startswith("#") or not line.strip():
                continue
            parts = line.split()
            if len(parts) >= 2 and "compaction" in parts[0].lower():
                key = parts[0].split("{")[0]
                try:
                    metrics[key] = float(parts[1])
                except ValueError:
                    pass
        return metrics or None
    except requests.RequestException as e:
        logging.error(f"Failed to fetch compaction metrics: {e}")
        return None

def calculate_backlog_velocity(metrics: dict[str, float]) -> dict[str, float | str]:
    global _last_sample
    pending = metrics.get("cassandra_compaction_pendingtasks", 0.0)
    bytes_compacted = metrics.get("cassandra_compaction_bytescompacted", 0.0)

    now = time.monotonic()
    current_rate = 0.0
    if _last_sample is not None:
        prev_ts, prev_bytes = _last_sample
        elapsed = now - prev_ts
        if elapsed > 0 and bytes_compacted >= prev_bytes:
            current_rate = (bytes_compacted - prev_bytes) / elapsed  # bytes/sec
            _rate_history.append(current_rate)
    _last_sample = (now, bytes_compacted)

    # Compare the instantaneous rate against its 1-hour moving average.
    moving_avg = sum(_rate_history) / len(_rate_history) if _rate_history else 0.0
    decay_ratio = current_rate / moving_avg if moving_avg > 0 else 1.0

    return {
        "pending_tasks": pending,
        "rate_mb_sec": current_rate / (1024 * 1024),
        "decay_ratio": decay_ratio,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }

def evaluate_and_act(velocity: dict[str, float | str]) -> None:
    if velocity["pending_tasks"] > BACKLOG_THRESHOLD and velocity["decay_ratio"] < RATE_DECAY_RATIO:
        logging.warning(
            f"Backlog critical: {velocity['pending_tasks']} pending tasks, rate decay detected."
        )
        # Safe remediation: raise compaction throughput via nodetool or JMX.
        # In production, wrap this in a circuit breaker to prevent thrashing.
        logging.info("Triggering automated compaction throughput adjustment...")
        # subprocess.run(["nodetool", "setcompactionthroughput", "256"], check=True)
    else:
        logging.info(f"Backlog nominal: {velocity['pending_tasks']} pending tasks.")

if __name__ == "__main__":
    # Poll on an interval so the moving average accumulates real history.
    while True:
        metrics = fetch_compaction_metrics()
        if metrics:
            evaluate_and_act(calculate_backlog_velocity(metrics))
        time.sleep(30)

Gate remediation behind a circuit breaker. Before any automated nodetool setcompactionthroughput bump takes effect, require that the decay condition has held for two consecutive samples and that disk headroom exists. This prevents oscillation during transient I/O spikes. Deeper structured-telemetry patterns live in Python monitoring for Cassandra compaction.

Verification & observability

Confirm the monitor and its remediation actually moved the backlog rather than just logging. Watch the queue drain directly:

# Queue depth should trend down after a throughput bump; -H prints human units.
nodetool compactionstats -H

On Cassandra 5.0, correlate completion against history to prove work is retiring:

SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history
WHERE keyspace_name = 'telemetry' ALLOW FILTERING;

Grep the logs for the throughput change and any starvation warnings the daemon should have caught first:

grep -E "compaction_throughput|Compacted .* to level|CompactionExecutor" /var/log/cassandra/system.log

A healthy post-remediation signature is PendingTasks falling, BytesCompacted rising steeply, and SSTablesPerReadHistogram (from nodetool tablehistograms) trending back toward baseline. Note the surface drift: nodetool tablestats/tablehistograms are current, while cfstats/cfhistograms are deprecated aliases and should not be baked into new automation.

Failure modes & rollback

Remediation thrash (throughput oscillation). A monitor that raises throughput on every transient decay will yank the I/O budget up and down, starving the read path it was meant to protect. Detection: repeated setcompactionthroughput log lines within minutes and a sawtooth in read p99. Rollback: disable the automation, reset to the baseline with nodetool setcompactionthroughput 64, and reintroduce the two-sample circuit breaker before re-enabling.

Repair/compaction I/O contention. In 4.x/5.x, nodetool repair runs with parallelism and leans on incremental repair streams. When backlog forces compaction to consume excess I/O, repair validation sessions time out, leaving inconsistent replicas. Detection: streaming MBeans under org.apache.cassandra.metrics:type=Streaming show active sessions exceeding concurrent_compactors, plus validation timed out in the logs. Rollback: pause non-critical repairs, let the daemon confirm PendingTasks < 8 and a stable BytesCompacted rate, then resume during the lull.

Corrupt-SSTable stall masquerading as backlog. A single unreadable SSTable can pin a compaction task and inflate the queue indefinitely. Detection: PendingTasks flat-high with ActiveTasks at 0 and CorruptSSTableException in debug.log; categorize it using compaction error categorization & logging. Rollback: auto-quarantine the file, run nodetool scrub during a maintenance window, and never let the throughput automation fire against a corruption stall — it cannot drain and will only waste I/O. For a full guided drain of a genuine backlog, follow resolving high compaction backlog without downtime.

Backlog analysis is also a foundational input for capacity forecasting. Track compaction-throughput baselines over rolling 30-day windows; when the derived rate consistently sits at 80%+ of disk-controller limits, provision additional nodes or migrate high-write tables to NVMe-backed storage. Use historical PendingTasks and BytesCompacted data to model write amplification under projected growth so disk provisioning, concurrent_compactors, and throughput limits scale predictably alongside ingestion velocity.

FAQ

Why should I alert on compaction-rate decay instead of a raw PendingTasks count?

PendingTasks is queue depth and spikes normally on STCS tier merges or TWCS window rollover, so a fixed count generates false pages. Rate decay is the derivative of BytesCompacted: it tells you whether the queue is actually draining. Alert on the pair — high depth and falling velocity — never on depth alone.

What PendingTasks threshold should trigger a page?

Scale it to the node, not a magic number. PendingTasks above 2 × concurrent_compactors sustained for 15 minutes is a solid warning trigger; combine it with disk > 85% or read p99 > 2× baseline before escalating to critical. On a 4-compactor node that is roughly 8 pending as a floor, 32+ as a warning, 64+ as critical.

How do I keep automated remediation from making backlog worse?

Wrap every setcompactionthroughput action in a circuit breaker: require the decay condition to hold for two consecutive samples, confirm disk headroom, and rate-limit adjustments. Raising throughput blindly during an I/O-bound or corrupt-SSTable stall wastes budget and starves reads, so gate on ActiveTasks > 0 first.

Does compaction backlog affect anti-entropy repair?

Yes. Repair validation compactions share the same CompactionExecutor and I/O budget, so a saturated queue starves repair streams and causes validation timeouts and inconsistent replicas. Schedule incremental repair during compaction lulls — when PendingTasks < 8 and the BytesCompacted rate is stable — rather than forcing concurrent execution.

Should the daemon read metrics from nodetool or JMX?

Read JMX (or a Prometheus JMX-exporter scrape) for counters you differentiate like BytesCompacted, because it avoids spawning a subprocess per sample. Use nodetool compactionstats -H for the human-verifiable snapshot and per-compaction progress. On Cassandra 5.0, prefer the system_views.sstable_tasks virtual table over parsing text.

Advanced Compaction Strategy Tuning & Monitoring — the parent guide covering strategy selection, tuning, and observability end to end.
Async compaction tracking & metrics — the JMX MBeans and velocity math that feed these thresholds.
Python monitoring for Cassandra compaction — structured telemetry ingestion and Prometheus/Datadog integration.
Compaction error categorization & logging — separating transient stalls from corrupt-SSTable failures behind a backlog.
Resolving high compaction backlog without downtime — the step-by-step drain procedure once an alert fires.

Compaction Backlog Analysis & Alerting in Cassandra 4.x/5.x

Related guides