How to Tune compaction_throughput_mb_per_sec Safely
In Apache Cassandra 4.x and 5.x, background compaction remains the dominant consumer of disk I/O bandwidth. The compaction_throughput_mb_per_sec parameter dictates the aggregate write throughput allocated to compaction threads across all keyspaces. Misconfiguration directly correlates with foreground read/write latency spikes, repair-induced streaming timeouts, and heap exhaustion during SSTable merge cycles. While static edits to cassandra.yaml require rolling restarts, modern production environments demand dynamic, telemetry-driven adjustments. This guide establishes an idempotent, safety-first workflow for tuning this parameter without destabilizing cluster topology or corrupting storage state.
Pre-Flight Validation & Safety Boundaries
Blindly increasing throughput during active compaction backlogs or concurrent repairs guarantees I/O starvation. Execute the following validation sequence before any adjustment.
1. Compaction Queue Health
nodetool compactionstats -H- Safety Check: Inspect
pending tasksand the per-rowcompleted/totalcolumns. Abort if pending tasks are high relative to active compactions, or if the remaining bytes (sum oftotal - completedacross rows) cannot drain within a few hours at the current throughput. High pending counts indicate structural SSTable accumulation, not I/O throttling. - Expected Output:
pending tasks: 12
id compaction type keyspace table completed total unit progress
a1b2c3d0-1f2e-11ef-9a3b-0f1e2d3c4b5a Compaction ks1 events 1073741824 4294967296 bytes 25.00%
b2c3d4e1-1f2e-11ef-9a3b-0f1e2d3c4b5a Compaction ks1 sessions 536870912 1610612736 bytes 33.33%
Active compaction remaining time : 0h05m12s
- Rollback Path: Do not proceed. Investigate tombstone ratios (
nodetool tablestats) or trigger a targetednodetool cleanupon over-provisioned nodes.
2. Active Repair & Streaming State
nodetool netstats- Safety Check: Verify
Mode: NORMALand that the output contains bothNot sending any streams.andNot receiving any streams..Mode: NORMALalone does not imply the node is idle — active streaming sessions can run while the node is NORMAL. Compaction and repair share the same I/O scheduler; concurrent execution will triggerjava.lang.OutOfMemoryErrororjava.net.SocketTimeoutException. - Expected Output:
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
- Rollback Path: Defer tuning until
nodetool repair -prcompletes, or terminate in-flight repair sessions via the JMXStorageService.forceTerminateAllRepairSessionsoperation (there is nonodetool repair --abort).
3. Storage Subsystem Saturation
iostat -x 1 5 | grep -E "^(Device|sd)"- Safety Check: Ensure
await < 20msand%util < 75%on the data volume. Reference the iostat(1) manual for metric definitions. Sustained%util > 85%indicates physical disk saturation; increasing throughput will only amplify write amplification and latency. - Expected Output:
Device rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
nvme0n1 0.00 0.00 12.0 45.0 120.0 890.0 35.41 0.12 2.10 1.05 2.50 0.80 4.50
- Rollback Path: Halt tuning. Investigate filesystem fragmentation, RAID controller cache policies, or migrate data to NVMe-backed volumes.
4. Log Baseline Correlation
Cross-reference system.log and debug.log for CompactionExecutor saturation, DiskFull warnings, or SSTableRewriter stalls. Transient I/O stalls differ fundamentally from structural corruption. Formal signal-to-noise filtering and failure taxonomies for these events are documented in Compaction Error Categorization & Logging, establishing the baseline for safe intervention.
Dynamic Adjustment & Configuration
Static cassandra.yaml modifications are deprecated for runtime agility. Use the Management API or nodetool for live adjustments.
Live Throughput Adjustment
nodetool setcompactionthroughput 128- Safety Check: Validate the new value does not exceed 50% of your disk’s sustained sequential write throughput. Never set to
0(unlimited) in production. - Expected Output:
Compaction throughput set to 128 MB/s
- Rollback Path: Immediately revert with
nodetool setcompactionthroughput 64(or previous baseline). Monitorcompactionstatsfor 60 seconds to confirm queue stabilization.
Persistent Configuration (Rolling Restart Required)
# cassandra.yaml
compaction_throughput_mb_per_sec: 128- Safety Check: Apply via configuration management (Ansible/Puppet) with
serial: 1to prevent simultaneous restarts. Verify checksums match across nodes. - Expected Output: Node restarts cleanly;
nodetool infoconfirmsCompaction Throughput: 128 MB/s. - Rollback Path: Revert YAML value, restart node, and monitor for
java.lang.RuntimeException: Unable to acquire compaction semaphoreduring startup.
Idempotent Python Automation Workflow
The following script enforces pre-flight guards, applies adjustments via nodetool, validates post-state, and auto-rolls back if thresholds breach. It avoids direct JMX socket manipulation to prevent connection thrashing and handles subprocess failures gracefully.
#!/usr/bin/env python3
"""
Idempotent compaction throughput tuner for Cassandra 4.x/5.x.
Enforces safety boundaries, validates I/O/repair state, and prevents thrashing.
"""
import subprocess
import sys
import time
import re
from typing import Tuple
NODETOOL = "/opt/cassandra/bin/nodetool"
MAX_SAFE_THROUGHPUT_MB = 256
MIN_SAFE_THROUGHPUT_MB = 16
def run_cmd(cmd: str, timeout: int = 30) -> Tuple[int, str]:
"""Execute shell command with strict error handling."""
try:
result = subprocess.run(
cmd.split(), capture_output=True, text=True, timeout=timeout, check=True
)
return 0, result.stdout.strip()
except subprocess.CalledProcessError as e:
return e.returncode, e.stderr.strip()
except subprocess.TimeoutExpired:
return 1, "Command timed out"
def validate_pre_flight() -> bool:
"""Check compaction queue, repair state, and disk I/O."""
rc, out = run_cmd(f"{NODETOOL} compactionstats -H")
if rc != 0:
print(f"[FAIL] nodetool compactionstats failed: {out}")
return False
# Parse the real compactionstats format: a "pending tasks: N" line, then one
# row per active compaction with completed/total byte columns.
pending_match = re.search(r"pending tasks:\s*(\d+)", out)
pending = int(pending_match.group(1)) if pending_match else 0
active = len(re.findall(r"\bCompaction\b", out))
# Abort when pending work dwarfs what is actively draining.
if pending > 16 and pending > 4 * max(active, 1):
print(f"[ABORT] Compaction backlog too high: {pending} pending, {active} active")
return False
rc, out = run_cmd(f"{NODETOOL} netstats")
# Mode: NORMAL alone does NOT mean idle; explicitly confirm no active streams.
streaming_idle = (
"Not sending any streams." in out and "Not receiving any streams." in out
)
if "Mode: NORMAL" not in out or not streaming_idle:
print("[ABORT] Node not in NORMAL mode or active streaming detected")
return False
print("[OK] Pre-flight validation passed.")
return True
def apply_throughput(value: int, rollback_value: int) -> bool:
"""Apply new throughput, validate, rollback on failure."""
if not (MIN_SAFE_THROUGHPUT_MB <= value <= MAX_SAFE_THROUGHPUT_MB):
print(f"[FAIL] Value {value} outside safe bounds [{MIN_SAFE_THROUGHPUT_MB}-{MAX_SAFE_THROUGHPUT_MB}]")
return False
rc, out = run_cmd(f"{NODETOOL} setcompactionthroughput {value}")
if rc != 0 or "set to" not in out:
print(f"[FAIL] Adjustment failed: {out}")
return False
# Post-flight validation window
time.sleep(15)
rc, out = run_cmd(f"{NODETOOL} compactionstats -H")
if rc == 0 and "pending tasks:" in out:
print(f"[OK] Throughput applied successfully. Current state stable.")
return True
# Rollback path
print(f"[ROLLBACK] Post-validation failed. Reverting to {rollback_value} MB/s")
run_cmd(f"{NODETOOL} setcompactionthroughput {rollback_value}")
return False
def main():
target = int(sys.argv[1]) if len(sys.argv) > 1 else 128
baseline = 64 # Default or fetched from config management
if not validate_pre_flight():
sys.exit(1)
if not apply_throughput(target, baseline):
sys.exit(1)
print("[COMPLETE] Tuning cycle finished. Monitor compaction queue for 300s.")
if __name__ == "__main__":
main()Post-Adjustment Verification & Telemetry
After applying the new throughput, continuous telemetry ingestion is mandatory. Track org.apache.cassandra.metrics:type=Compaction,name=PendingTasks, org.apache.cassandra.metrics:type=Table,name=PendingCompactions, and disk await via Prometheus/Grafana. If pending tasks drop below 5 while await remains under 10ms, the adjustment is optimal. If await climbs above 25ms or foreground write latency (WriteLatency) increases by >15%, reduce throughput by 25% increments.
The safe-tuning procedure is an iterative loop, summarized below.
Advanced strategy tuning requires correlating compaction metrics with read repair rates and tombstone expiration. Comprehensive telemetry pipelines and threshold alerting frameworks are detailed in Advanced Compaction Strategy Tuning & Monitoring, which provides the operational blueprint for long-term capacity planning.
Incident Rollback & Recovery Protocol
If dynamic tuning triggers I/O starvation, heap pressure, or repair desynchronization:
- Immediate Throttle: Execute
nodetool setcompactionthroughput 16to restore foreground I/O priority. - Queue Drain: Allow pending compactions to complete naturally. Do not kill compaction threads via
kill -9; this risksCorruptSSTableException. - State Verification: Run
nodetool verifyon affected keyspaces to confirm SSTable integrity. - Configuration Revert: Restore
cassandra.yamlto baseline, propagate via config management, and schedule a rolling restart during off-peak windows. - Root Cause Analysis: Export
system.log,gc.log, andiostatsnapshots. Cross-reference with compaction strategy (SizeTiered vs Leveled) to identify structural mismatches rather than throughput misconfiguration.