Resolving High Compaction Backlog Without Downtime
High compaction backlog in production Cassandra clusters manifests as elevated PendingCompactions, degraded p99 read/write latency, and eventual CompactionExecutor thread starvation. In Cassandra v4.x/v5.x, Unified Compaction and modern I/O schedulers improve predictability, but aggressive write patterns, misaligned strategy parameters, or uncoordinated anti-entropy repairs can still saturate compaction queues. Resolving this without service interruption requires deterministic I/O arbitration, dynamic throughput scaling, and precise repair-compaction coordination. Operators should already have baseline telemetry established through Compaction Backlog Analysis & Alerting pipelines before executing the procedures below. The following workflow implements idempotent Python automation, explicit safety gates, and step-by-step validation to safely drain backlog while preserving cluster stability.
Phase 1: Quantitative Backlog Isolation & Metric Validation
Before modifying runtime parameters, isolate whether the backlog stems from write amplification, compaction strategy misalignment, or repair stream contention. Execute the following diagnostic sequence on a representative node.
# 1. Show pending tasks and the active-compaction table.
# Remaining bytes = sum of (total - completed) across the table rows;
# nodetool does NOT print an "active tasks:" / "pending bytes:" line.
nodetool compactionstats --human-readableSafety Check: Verify nodetool is responsive and the node is in UN state via nodetool status. Do not run if the node is DN or UJ.
Expected Output:
pending tasks: 312
id compaction type keyspace table completed total unit progress
c1d2e3f0-2a3b-11ef-8c4d-1a2b3c4d5e6f Compaction app events 120 GiB 300 GiB bytes 40.00%
d2e3f4a1-2a3b-11ef-8c4d-1a2b3c4d5e6f Compaction app sessions 60 GiB 198 GiB bytes 30.30%
Active compaction remaining time : 1h12m44s
Rollback Path: N/A (read-only diagnostics).
# 2. Verify CompactionExecutor queue saturation
# (keep the header row; gate on the Pending column, tpstats has no "Max" column)
nodetool tpstats | grep -E "Pool Name|CompactionExecutor"Safety Check: Ensure JMX port is accessible and no other nodetool operations are running concurrently.
Expected Output:
Pool Name Active Pending Completed Blocked All time blocked
CompactionExecutor 4 312 18452 0 0
Rollback Path: N/A.
# 3. Cross-reference with active repair streams
nodetool netstats | grep -i "repair"Safety Check: Confirm no ongoing nodetool rebuild or streaming operations that could be interrupted.
Expected Output: Either empty (no active repairs) or lines showing Repair streaming sessions with byte counts.
Rollback Path: N/A.
Intervention Thresholds:
PendingCompactions> 200 per nodeCompactionExecutorActive=concurrent_compactorsANDPending> 50- Remaining compaction bytes (sum of
total - completedfromcompactionstats) exceed what available disk I/O bandwidth can drain in the maintenance window (e.g., > 400 GB on 200 MB/s SSD arrays)
If thresholds are breached, proceed to dynamic scaling. Critical: Do not execute nodetool cleanup or nodetool scrub during active backlog. These operations compete for the same I/O scheduler and will trigger severe latency spikes or OOM conditions.
Phase 2: Idempotent Dynamic Throughput Scaling
Cassandra v4.x/v5.x supports live adjustment of compaction throughput and concurrency. The following Python module implements an idempotent resolver with explicit pre-flight validation, incremental scaling, and automatic rollback on metric regression.
#!/usr/bin/env python3
"""
Idempotent compaction backlog resolver for Cassandra v4.x/v5.x.
Scales compaction throughput and concurrency safely without downtime.
"""
import subprocess
import sys
import time
import logging
import re
from typing import Tuple, Optional
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[logging.StreamHandler(sys.stdout)]
)
def run_nodetool(args: list, timeout: int = 15) -> Tuple[int, str, str]:
"""Execute nodetool with explicit timeout and error propagation."""
cmd = ["nodetool"] + args
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, check=True)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except subprocess.CalledProcessError as e:
logging.error(f"nodetool {args} failed: {e.stderr.strip()}")
return e.returncode, "", e.stderr.strip()
except subprocess.TimeoutExpired:
logging.error(f"nodetool {args} timed out after {timeout}s")
return -1, "", "Timeout"
def get_pending_bytes() -> Optional[int]:
"""Estimate remaining compaction bytes from compactionstats.
nodetool does NOT emit a "pending bytes:" line. The real output has one row
per active compaction with `completed` and `total` byte columns, so the
remaining work is the sum of (total - completed) across those rows.
"""
rc, out, _ = run_nodetool(["compactionstats", "--human-readable"])
if rc != 0:
return None
multipliers = {
"bytes": 1, "B": 1,
"KiB": 1024, "MiB": 1024**2, "GiB": 1024**3, "TiB": 1024**4,
}
def to_bytes(value: str, unit: str) -> float:
return float(value) * multipliers.get(unit, 1)
# Match the completed/total pair on each table row, e.g.
# ... 120 GiB 300 GiB bytes 40.00%
# or the non-human-readable form: ... 128849018880 322122547200 bytes 40.00%
row_re = re.compile(
r"([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+([\d.]+)\s*(bytes|B|KiB|MiB|GiB|TiB)\s+\w+\s+[\d.]+%"
)
remaining = 0.0
for completed_v, completed_u, total_v, total_u in row_re.findall(out):
remaining += to_bytes(total_v, total_u) - to_bytes(completed_v, completed_u)
return int(max(remaining, 0))
def scale_compaction(target_throughput: int, target_concurrency: int,
original_throughput: int, original_concurrency: int,
max_attempts: int = 3) -> bool:
"""Incrementally scale compaction parameters with validation and rollback."""
logging.info(f"Target: {target_throughput} MB/s, {target_concurrency} threads")
for attempt in range(1, max_attempts + 1):
# Apply settings
rc1, _, _ = run_nodetool(["setcompactionthroughput", str(target_throughput)])
rc2, _, _ = run_nodetool(["setconcurrentcompactors", str(target_concurrency)])
if rc1 != 0 or rc2 != 0:
logging.error("Failed to apply scaling parameters. Rolling back.")
run_nodetool(["setcompactionthroughput", str(original_throughput)])
run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
return False
time.sleep(45) # Allow scheduler to stabilize
pending = get_pending_bytes()
if pending is None:
logging.warning("Unable to read pending bytes. Aborting.")
return False
logging.info(f"Attempt {attempt}: Pending bytes = {pending / (1024**3):.2f} GiB")
# Validation gate: expect >= 15% reduction per cycle
if attempt == 1:
baseline = pending
else:
reduction = (baseline - pending) / baseline
if reduction >= 0.15:
logging.info("Backlog draining successfully. Scaling complete.")
return True
else:
logging.warning(f"Insufficient drain rate ({reduction:.2%}). Rolling back.")
run_nodetool(["setcompactionthroughput", str(original_throughput)])
run_nodetool(["setconcurrentcompactors", str(original_concurrency)])
return False
return False
if __name__ == "__main__":
# Pre-flight safety check
rc, _, _ = run_nodetool(["status"])
if rc != 0:
logging.critical("Node not in UN state. Exiting.")
sys.exit(1)
# Capture current state for rollback
rc, out, _ = run_nodetool(["getcompactionthroughput"])
current_tp = int(out) if rc == 0 else 16
rc, out, _ = run_nodetool(["getconcurrentcompactors"])
current_cc = int(out) if rc == 0 else 2
# Scale to 2x throughput, +2 concurrency (cap at 8 threads to avoid I/O thrashing)
new_tp = min(current_tp * 2, 256)
new_cc = min(current_cc + 2, 8)
success = scale_compaction(new_tp, new_cc, current_tp, current_cc)
sys.exit(0 if success else 1)Safety Check: Script verifies nodetool status returns UN before execution. Captures baseline throughput/concurrency for guaranteed rollback. Caps concurrency at 8 to prevent CPU/IO thrashing on standard NVMe arrays.
Expected Output:
2024-05-12 14:02:11 [INFO] Target: 32 MB/s, 4 threads
2024-05-12 14:02:56 [INFO] Attempt 1: Pending bytes = 382.10 GiB
2024-05-12 14:02:56 [INFO] Backlog draining successfully. Scaling complete.
Rollback Path: Automatic on validation failure or nodetool error. Manual rollback: nodetool setcompactionthroughput <original_value> and nodetool setconcurrentcompactors <original_value>.
Phase 3: Anti-Entropy Repair Coordination & I/O Arbitration
Repair operations generate massive SSTable merges that directly compete with compaction. In Cassandra 4.x/5.x, streaming limits and repair scope must be explicitly bounded to prevent backlog regeneration.
# 1. Throttle repair streaming bandwidth (default is often unlimited)
nodetool setstreamthroughput 100Safety Check: Verify current streaming throughput via nodetool getstreamthroughput. Do not set below 50 MB/s if cluster is under heavy write load.
Expected Output: Stream throughput set to 100 MB/s
Rollback Path: nodetool setstreamthroughput <original_value>
# 2. Run scoped repair with compaction-friendly flags (-seq == --sequential)
nodetool repair --full --in-local-dc -seq --trace --ignore-unreplicated-keyspacesSafety Check: Ensure no other nodetool repair is running (nodetool compactionstats should show 0 Repair tasks). Run during off-peak windows if possible.
Expected Output: Repair progress logs streaming to stdout/JMX. Completion returns 0 exit code.
Rollback Path: Pressing Ctrl+C only detaches the nodetool client — it does not cleanly stop the server-side repair, and a full repair does not resume from a checkpoint. To actually stop in-flight validation/streaming, use nodetool stop VALIDATION (and nodetool stop COMPACTION if merges are saturating I/O) on the affected node, then revert the streaming throttle if needed. Plan to re-run the full repair from the start.
For advanced compaction strategy tuning, operators should align compaction_throughput_mb_per_sec with disk IOPS capacity and validate token distribution before scaling. Refer to Advanced Compaction Strategy Tuning & Monitoring for strategy-specific parameter matrices.
Phase 4: Post-Resolution Validation & Automated Rollback Triggers
Once backlog drains, validate system stability and revert temporary scaling to baseline to prevent resource exhaustion during normal operations.
# 1. Verify backlog normalization (pending count + remaining bytes from the table rows)
nodetool compactionstats --human-readableSafety Check: Confirm pending tasks is low and the remaining bytes (sum of total - completed across rows) are < 10% of total data size per node.
Expected Output:
pending tasks: 1
id compaction type keyspace table completed total unit progress
e3f4a5b6-2a3b-11ef-8c4d-1a2b3c4d5e6f Compaction app events 8 GiB 12.4 GiB bytes 64.52%
Active compaction remaining time : 0h01m03s
Rollback Path: If the remaining bytes rebound > 200 GiB, re-run Phase 2 script with conservative scaling (new_tp = current_tp * 1.5, new_cc = current_cc + 1).
# 2. Restore baseline compaction parameters
nodetool setcompactionthroughput 16
nodetool setconcurrentcompactors 2Safety Check: Monitor tpstats for 5 minutes post-revert. Ensure CompactionExecutor Pending does not spike > 50.
Expected Output: Compaction throughput set to 16 MB/s / Concurrent compactors set to 2
Rollback Path: If latency spikes or pending queue saturates, immediately re-apply scaled values: nodetool setcompactionthroughput 32 and nodetool setconcurrentcompactors 4.
# 3. Final latency & thread pool validation
nodetool tpstats | grep -E "Pool Name|MutationStage|ReadStage|CompactionExecutor"Safety Check: Cross-reference with application p99 latency dashboards. tpstats has no Max column — gate on the Pending and Blocked/All time blocked columns staying near zero for all pools.
Expected Output:
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 1048576 0 0
ReadStage 0 0 524288 0 0
CompactionExecutor 2 0 18460 0 0
Rollback Path: N/A. If thread starvation persists, investigate write amplification via nodetool tablestats and consider strategy migration (e.g., STCS → LCS or TWCS).
Operational Best Practices
- Never scale compaction throughput beyond 50% of sustained disk write bandwidth. Exceeding this threshold causes I/O queue saturation, triggering
DiskFailurePolicyevents. - Use
nodetool compactionstatsovernodetool tablestatsfor backlog tracking. The former aggregates across all strategies and provides real-time byte-level visibility. - Automate validation gates. Integrate the Python resolver into CI/CD or cron workflows with explicit alerting on
PendingCompactions > 150. - Document baseline parameters. Store original
compaction_throughputandconcurrent_compactorsvalues in infrastructure-as-code repositories for rapid recovery.
The end-to-end remediation phases are summarized below.
By enforcing deterministic scaling, isolating repair contention, and embedding explicit rollback triggers, operators can resolve compaction backlog without service degradation or manual intervention.