Compaction Backlog Analysis & Alerting

A growing compaction backlog is the primary precursor to disk exhaustion, elevated read latency, and cascading anti-entropy failures. In modern Apache Cassandra deployments, backlog analysis must transition from reactive nodetool polling to continuous, automated observability. This guide details how to quantify backlog, establish production-safe alerting thresholds, and automate remediation workflows without disrupting node lifecycle operations or repair schedules. All automation patterns and consistency parameters are validated against Cassandra 4.x and 5.x architectural standards.

The Operational Reality of Compaction Backlog

Compaction backlog is not a single JMX metric; it is a derived operational state representing the delta between write ingestion velocity and disk I/O throughput. When compaction falls behind, tombstone accumulation increases, read amplification spikes, and the cluster enters a degradation loop that often manifests as ReadTimeoutException or UnavailableException during peak traffic.

Cassandra 4.x defaults new tables to SizeTieredCompactionStrategy (STCS) generally, while strongly recommending TimeWindowCompactionStrategy (TWCS) for high-velocity time-series workloads; Cassandra 5.0 changes the default to UnifiedCompactionStrategy (UCS). Strategy mismatches remain the most common root cause of chronic backlog. Before adjusting cluster-wide thresholds, validate your table-level parameters and compaction window configurations against Advanced Compaction Strategy Tuning & Monitoring to ensure alignment with your data lifecycle and retention policies.

Quantifying Backlog & Establishing Dynamic Thresholds

Static alert thresholds fail under variable write loads. Effective backlog quantification requires correlating multiple metrics exposed via the native metrics API (/api/v1/metrics) or JMX Exporter:

  • org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
  • org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks
  • org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted

The effective backlog ratio is calculated as: (PendingTasks × avg_sstable_size) / available_data_volume_space

When this ratio exceeds 0.6, compaction cannot keep pace with ingestion, and disk saturation becomes imminent. There is no CompactionThroughput metric — the configured limit is read via nodetool getcompactionthroughput, and the actual work rate is derived from the BytesCompacted counter (bytes/sec between samples). Implement dynamic alerting based on this derived compaction rate decay rather than absolute counts. Trigger alerts when:

  • PendingTasks > 32 for > 15 minutes
  • The derived compaction rate (delta of BytesCompacted over time) drops below 40% of its 1-hour moving average
  • Data volume utilization exceeds 85% while pending tasks remain > 16
  • Read p99 latency exceeds 2× baseline during compaction windows

Alerting Architecture & Dynamic Routing

Scrape compaction metrics via JMX Exporter or the native Cassandra metrics endpoint. Route alerts through a correlation engine that cross-references backlog velocity with active repair windows. When repair validation and compaction compete for disk I/O, the cluster experiences validation timeouts and dropped anti-entropy sessions. Decouple these workloads by tracking Async Compaction Tracking & Metrics to schedule repairs during compaction lulls rather than forcing concurrent execution.

Implement tiered alert severity with automated routing:

  • Warning: Pending tasks > 16, throughput decay detected. Route to Slack/PagerDuty for DBA review.
  • Critical: Pending tasks > 64, disk > 90%, or read p99 latency > 2× baseline. Trigger automated throttling and page on-call SRE.

The tiered severity decisions are summarized below.

flowchart TD S["Sample pending tasks and disk usage"] --> Q1{"Pending over 2x concurrent compactors"} Q1 -->|"no"| OK["Healthy"] Q1 -->|"yes"| W["Warning"] W --> Q2{"Disk over 85 percent with high tombstones"} Q2 -->|"no"| W Q2 -->|"yes"| CR["Critical"] CR --> Q3{"Disk over 90 percent or thread starvation"} Q3 -->|"yes"| EM["Emergency"]
Tiered compaction-backlog alerting thresholds

Automated Triage & Python Workflows

Manual intervention does not scale. Deploy a lightweight Python daemon that queries the metrics endpoint, calculates backlog velocity, and emits structured telemetry. The following pattern demonstrates safe, automated triage compatible with Cassandra 4.x/5.x:

import time
import requests
import logging
from collections import deque
from datetime import datetime
from typing import Dict, Optional

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

METRICS_ENDPOINT = "http://localhost:8080/api/v1/metrics"
BACKLOG_THRESHOLD = 32
RATE_DECAY_RATIO = 0.40

# Sampled (timestamp, BytesCompacted) history for a 1-hour moving average of the
# derived compaction rate. BytesCompacted is a monotonic counter, so the rate is
# the delta between consecutive samples divided by the elapsed seconds.
_rate_history: "deque[float]" = deque(maxlen=120)  # e.g. 120 samples @ 30s = 1h
_last_sample: Optional[tuple] = None

def fetch_compaction_metrics() -> Optional[Dict]:
    try:
        resp = requests.get(f"{METRICS_ENDPOINT}/compaction", timeout=5)
        resp.raise_for_status()
        return resp.json()
    except requests.RequestException as e:
        logging.error(f"Failed to fetch compaction metrics: {e}")
        return None

def calculate_backlog_velocity(metrics: Dict) -> Dict[str, float]:
    global _last_sample
    pending = metrics.get("PendingTasks", 0)
    bytes_compacted = metrics.get("BytesCompacted", 0)  # monotonic counter

    now = time.monotonic()
    current_rate = 0.0
    if _last_sample is not None:
        prev_ts, prev_bytes = _last_sample
        elapsed = now - prev_ts
        if elapsed > 0 and bytes_compacted >= prev_bytes:
            current_rate = (bytes_compacted - prev_bytes) / elapsed  # bytes/sec
            _rate_history.append(current_rate)
    _last_sample = (now, bytes_compacted)

    # Compare the instantaneous rate against its 1-hour moving average.
    moving_avg = sum(_rate_history) / len(_rate_history) if _rate_history else 0.0
    decay_ratio = current_rate / moving_avg if moving_avg > 0 else 1.0

    return {
        "pending_tasks": pending,
        "rate_mb_sec": current_rate / (1024 * 1024),
        "decay_ratio": decay_ratio,
        "timestamp": datetime.utcnow().isoformat()
    }

def evaluate_and_act(velocity: Dict) -> None:
    if velocity["pending_tasks"] > BACKLOG_THRESHOLD and velocity["decay_ratio"] < RATE_DECAY_RATIO:
        logging.warning(f"Backlog critical: {velocity['pending_tasks']} pending tasks, compaction rate decay detected.")
        # Safe remediation: Dynamically adjust compaction throughput via nodetool or JMX
        # In production, wrap this in a circuit breaker to prevent thrashing
        logging.info("Triggering automated compaction throughput adjustment...")
        # subprocess.run(["nodetool", "setcompactionthroughput", "256"], check=True)
    else:
        logging.info(f"Backlog nominal: {velocity['pending_tasks']} pending tasks.")

if __name__ == "__main__":
    # Poll on an interval so the moving average accumulates real history.
    while True:
        metrics = fetch_compaction_metrics()
        if metrics:
            velocity = calculate_backlog_velocity(metrics)
            evaluate_and_act(velocity)
        time.sleep(30)

This workflow aligns with established Python Monitoring for Cassandra Compaction patterns, ensuring telemetry is structured, idempotent, and safe for automated execution. Always wrap remediation actions in circuit breakers to prevent oscillation during transient I/O spikes.

Repair Scheduling & Node Management Integration

Compaction backlog directly impacts anti-entropy efficiency. In Cassandra 4.x/5.x, nodetool repair operates with parallelism by default and relies heavily on incremental repair streams. When backlog forces compaction to consume excessive I/O, repair sessions timeout, leading to inconsistent replicas and eventual read path degradation.

To maintain consistency without disrupting node operations:

  1. Schedule repairs during compaction lulls: Use the Python daemon to identify windows where PendingTasks < 8 and the derived compaction rate (delta of BytesCompacted) is stable.
  2. Validate repair streams: Monitor org.apache.cassandra.metrics:type=Repair,name=ActiveSessions. If active sessions exceed concurrent_compactors, pause non-critical repairs.
  3. Handle compaction errors gracefully: Implement structured logging for Compaction Error Categorization & Logging to distinguish between transient I/O stalls and corrupted SSTables. Auto-quarantine corrupted files and trigger nodetool scrub during maintenance windows.
  4. Compensate for degraded reads: When backlog forces fallback routing, tune the speculative_retry and read_repair table options (via ALTER TABLE) to prevent cascading timeouts. Note that read_repair_chance/dclocal_read_repair_chance were removed in Cassandra 4.0 and cannot be tuned on 4.x/5.x. Refer to Resolving High Compaction Backlog Without Downtime for step-by-step remediation workflows that preserve cluster availability.

Capacity Planning & Long-Term Baselines

Backlog analysis is not purely reactive; it is a foundational input for capacity forecasting. Track compaction throughput baselines over rolling 30-day windows to identify I/O saturation trends. When the derived compaction rate (from BytesCompacted) consistently operates at 80%+ of disk controller limits, provision additional nodes or migrate high-write tables to NVMe-backed storage classes.

Integrate backlog velocity into your Performance Benchmarking & Capacity Planning pipelines. Use historical PendingTasks and BytesCompacted data to model write amplification under projected traffic growth. This ensures disk provisioning, compaction thread allocation (concurrent_compactors), and throughput limits (compaction_throughput_mb_per_sec) scale predictably alongside ingestion velocity.

By transitioning from manual nodetool checks to automated, threshold-driven observability, SREs and DBAs can prevent compaction backlog from becoming a systemic failure vector. Continuous monitoring, dynamic alerting, and safe Python-driven remediation form the operational backbone of resilient Cassandra clusters at scale.

Related guides