How to Tune compaction_throughput_mb_per_sec Safely

In Apache Cassandra 4.x and 5.x, background compaction remains the dominant consumer of disk I/O bandwidth. The compaction_throughput_mb_per_sec parameter dictates the aggregate write throughput allocated to compaction threads across all keyspaces. Misconfiguration directly correlates with foreground read/write latency spikes, repair-induced streaming timeouts, and heap exhaustion during SSTable merge cycles. While static edits to cassandra.yaml require rolling restarts, modern production environments demand dynamic, telemetry-driven adjustments. This guide establishes an idempotent, safety-first workflow for tuning this parameter without destabilizing cluster topology or corrupting storage state.

Pre-Flight Validation & Safety Boundaries

Blindly increasing throughput during active compaction backlogs or concurrent repairs guarantees I/O starvation. Execute the following validation sequence before any adjustment.

1. Compaction Queue Health

nodetool compactionstats -H
  • Safety Check: Inspect pending tasks and the per-row completed/total columns. Abort if pending tasks are high relative to active compactions, or if the remaining bytes (sum of total - completed across rows) cannot drain within a few hours at the current throughput. High pending counts indicate structural SSTable accumulation, not I/O throttling.
  • Expected Output:
 pending tasks: 12
 id                                   compaction type  keyspace  table    completed     total          unit   progress
 a1b2c3d0-1f2e-11ef-9a3b-0f1e2d3c4b5a  Compaction       ks1       events   1073741824    4294967296     bytes  25.00%
 b2c3d4e1-1f2e-11ef-9a3b-0f1e2d3c4b5a  Compaction       ks1       sessions 536870912     1610612736     bytes  33.33%
 Active compaction remaining time :   0h05m12s
  • Rollback Path: Do not proceed. Investigate tombstone ratios (nodetool tablestats) or trigger a targeted nodetool cleanup on over-provisioned nodes.

2. Active Repair & Streaming State

nodetool netstats
  • Safety Check: Verify Mode: NORMAL and that the output contains both Not sending any streams. and Not receiving any streams.. Mode: NORMAL alone does not imply the node is idle — active streaming sessions can run while the node is NORMAL. Compaction and repair share the same I/O scheduler; concurrent execution will trigger java.lang.OutOfMemoryError or java.net.SocketTimeoutException.
  • Expected Output:
 Mode: NORMAL
 Not sending any streams.
 Not receiving any streams.
  • Rollback Path: Defer tuning until nodetool repair -pr completes, or terminate in-flight repair sessions via the JMX StorageService.forceTerminateAllRepairSessions operation (there is no nodetool repair --abort).

3. Storage Subsystem Saturation

iostat -x 1 5 | grep -E "^(Device|sd)"
  • Safety Check: Ensure await < 20ms and %util < 75% on the data volume. Reference the iostat(1) manual for metric definitions. Sustained %util > 85% indicates physical disk saturation; increasing throughput will only amplify write amplification and latency.
  • Expected Output:
 Device   rrqm/s wrqm/s   r/s   w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
 nvme0n1    0.00   0.00  12.0  45.0  120.0  890.0    35.41     0.12   2.10    1.05    2.50  0.80  4.50
  • Rollback Path: Halt tuning. Investigate filesystem fragmentation, RAID controller cache policies, or migrate data to NVMe-backed volumes.

4. Log Baseline Correlation

Cross-reference system.log and debug.log for CompactionExecutor saturation, DiskFull warnings, or SSTableRewriter stalls. Transient I/O stalls differ fundamentally from structural corruption. Formal signal-to-noise filtering and failure taxonomies for these events are documented in Compaction Error Categorization & Logging, establishing the baseline for safe intervention.

Dynamic Adjustment & Configuration

Static cassandra.yaml modifications are deprecated for runtime agility. Use the Management API or nodetool for live adjustments.

Live Throughput Adjustment

nodetool setcompactionthroughput 128
  • Safety Check: Validate the new value does not exceed 50% of your disk’s sustained sequential write throughput. Never set to 0 (unlimited) in production.
  • Expected Output:
 Compaction throughput set to 128 MB/s
  • Rollback Path: Immediately revert with nodetool setcompactionthroughput 64 (or previous baseline). Monitor compactionstats for 60 seconds to confirm queue stabilization.

Persistent Configuration (Rolling Restart Required)

# cassandra.yaml
compaction_throughput_mb_per_sec: 128
  • Safety Check: Apply via configuration management (Ansible/Puppet) with serial: 1 to prevent simultaneous restarts. Verify checksums match across nodes.
  • Expected Output: Node restarts cleanly; nodetool info confirms Compaction Throughput: 128 MB/s.
  • Rollback Path: Revert YAML value, restart node, and monitor for java.lang.RuntimeException: Unable to acquire compaction semaphore during startup.

Idempotent Python Automation Workflow

The following script enforces pre-flight guards, applies adjustments via nodetool, validates post-state, and auto-rolls back if thresholds breach. It avoids direct JMX socket manipulation to prevent connection thrashing and handles subprocess failures gracefully.

#!/usr/bin/env python3
"""
Idempotent compaction throughput tuner for Cassandra 4.x/5.x.
Enforces safety boundaries, validates I/O/repair state, and prevents thrashing.
"""
import subprocess
import sys
import time
import re
from typing import Tuple

NODETOOL = "/opt/cassandra/bin/nodetool"
MAX_SAFE_THROUGHPUT_MB = 256
MIN_SAFE_THROUGHPUT_MB = 16

def run_cmd(cmd: str, timeout: int = 30) -> Tuple[int, str]:
    """Execute shell command with strict error handling."""
    try:
        result = subprocess.run(
            cmd.split(), capture_output=True, text=True, timeout=timeout, check=True
        )
        return 0, result.stdout.strip()
    except subprocess.CalledProcessError as e:
        return e.returncode, e.stderr.strip()
    except subprocess.TimeoutExpired:
        return 1, "Command timed out"

def validate_pre_flight() -> bool:
    """Check compaction queue, repair state, and disk I/O."""
    rc, out = run_cmd(f"{NODETOOL} compactionstats -H")
    if rc != 0:
        print(f"[FAIL] nodetool compactionstats failed: {out}")
        return False
    
    # Parse the real compactionstats format: a "pending tasks: N" line, then one
    # row per active compaction with completed/total byte columns.
    pending_match = re.search(r"pending tasks:\s*(\d+)", out)
    pending = int(pending_match.group(1)) if pending_match else 0
    active = len(re.findall(r"\bCompaction\b", out))
    # Abort when pending work dwarfs what is actively draining.
    if pending > 16 and pending > 4 * max(active, 1):
        print(f"[ABORT] Compaction backlog too high: {pending} pending, {active} active")
        return False

    rc, out = run_cmd(f"{NODETOOL} netstats")
    # Mode: NORMAL alone does NOT mean idle; explicitly confirm no active streams.
    streaming_idle = (
        "Not sending any streams." in out and "Not receiving any streams." in out
    )
    if "Mode: NORMAL" not in out or not streaming_idle:
        print("[ABORT] Node not in NORMAL mode or active streaming detected")
        return False

    print("[OK] Pre-flight validation passed.")
    return True

def apply_throughput(value: int, rollback_value: int) -> bool:
    """Apply new throughput, validate, rollback on failure."""
    if not (MIN_SAFE_THROUGHPUT_MB <= value <= MAX_SAFE_THROUGHPUT_MB):
        print(f"[FAIL] Value {value} outside safe bounds [{MIN_SAFE_THROUGHPUT_MB}-{MAX_SAFE_THROUGHPUT_MB}]")
        return False

    rc, out = run_cmd(f"{NODETOOL} setcompactionthroughput {value}")
    if rc != 0 or "set to" not in out:
        print(f"[FAIL] Adjustment failed: {out}")
        return False

    # Post-flight validation window
    time.sleep(15)
    rc, out = run_cmd(f"{NODETOOL} compactionstats -H")
    if rc == 0 and "pending tasks:" in out:
        print(f"[OK] Throughput applied successfully. Current state stable.")
        return True
    
    # Rollback path
    print(f"[ROLLBACK] Post-validation failed. Reverting to {rollback_value} MB/s")
    run_cmd(f"{NODETOOL} setcompactionthroughput {rollback_value}")
    return False

def main():
    target = int(sys.argv[1]) if len(sys.argv) > 1 else 128
    baseline = 64  # Default or fetched from config management
    
    if not validate_pre_flight():
        sys.exit(1)
        
    if not apply_throughput(target, baseline):
        sys.exit(1)
        
    print("[COMPLETE] Tuning cycle finished. Monitor compaction queue for 300s.")

if __name__ == "__main__":
    main()

Post-Adjustment Verification & Telemetry

After applying the new throughput, continuous telemetry ingestion is mandatory. Track org.apache.cassandra.metrics:type=Compaction,name=PendingTasks, org.apache.cassandra.metrics:type=Table,name=PendingCompactions, and disk await via Prometheus/Grafana. If pending tasks drop below 5 while await remains under 10ms, the adjustment is optimal. If await climbs above 25ms or foreground write latency (WriteLatency) increases by >15%, reduce throughput by 25% increments.

The safe-tuning procedure is an iterative loop, summarized below.

flowchart LR A["Measure pending tasks and IO wait"] --> B["Adjust nodetool setcompactionthroughput"] B --> C["Observe backlog and latency"] C --> Q{"Stable within thresholds"} Q -->|"no"| A Q -->|"yes"| D["Hold baseline"]
Safe compaction-throughput tuning loop

Advanced strategy tuning requires correlating compaction metrics with read repair rates and tombstone expiration. Comprehensive telemetry pipelines and threshold alerting frameworks are detailed in Advanced Compaction Strategy Tuning & Monitoring, which provides the operational blueprint for long-term capacity planning.

Incident Rollback & Recovery Protocol

If dynamic tuning triggers I/O starvation, heap pressure, or repair desynchronization:

  1. Immediate Throttle: Execute nodetool setcompactionthroughput 16 to restore foreground I/O priority.
  2. Queue Drain: Allow pending compactions to complete naturally. Do not kill compaction threads via kill -9; this risks CorruptSSTableException.
  3. State Verification: Run nodetool verify on affected keyspaces to confirm SSTable integrity.
  4. Configuration Revert: Restore cassandra.yaml to baseline, propagate via config management, and schedule a rolling restart during off-peak windows.
  5. Root Cause Analysis: Export system.log, gc.log, and iostat snapshots. Cross-reference with compaction strategy (SizeTiered vs Leveled) to identify structural mismatches rather than throughput misconfiguration.