Python Monitoring for Cassandra Compaction (4.x/5.x)

Compaction is the dominant I/O consumer in Apache Cassandra and the primary driver of P99 latency degradation during maintenance windows, node decommissioning, and repair cycles. Manual nodetool compactionstats polling and static thresholds cannot keep pace with the dynamic reality of SSTable merging, anti-entropy synchronization, and background thread scheduling on a live 4.x or 5.x cluster. This guide is for DBAs, SREs, and Python automation builders who need to replace ad-hoc checks with a deterministic monitoring pipeline: structured telemetry ingestion, automated backlog computation, and safe orchestration of repair and node-lifecycle workflows. It sits under Advanced Compaction Strategy Tuning & Monitoring — read that first if you have not yet aligned your table-level strategy with your workload. Every command and threshold below is validated against Cassandra 4.0, 4.1, and 5.0, with version drift called out inline.

At a high level, the pipeline collects signals, evaluates compaction state, and dispatches a single safe action per interval:

Concept: what a Python monitor actually measures

A compaction monitor is not reading one number; it is reconstructing an operational state from several counters. Cassandra submits compaction tasks to the bounded CompactionExecutor thread pool — sized by concurrent_compactors and throttled by the shared compaction_throughput budget — and exposes the pool’s telemetry through JMX. The three counters a Python agent differentiates and correlates are:

org.apache.cassandra.metrics:type=Compaction,name=PendingTasks — the queue depth of the executor pool.
org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks — a monotonic count of finished tasks.
org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted — a monotonic byte counter you differentiate to recover throughput in bytes/sec.

The distinction between depth and velocity is the whole game. A node whose PendingTasks climbs while BytesCompacted still rises is merely busy; a node whose pending climbs while BytesCompacted is flat is stalled. Depth alone cannot separate the two, which is why raw count thresholds generate noise — the same reasoning that drives velocity-aware alerting in compaction backlog analysis & alerting. The strategy in force shapes what “healthy” looks like: the trade-offs between STCS, LCS, and TWCS mean a size-tiered table bursts pending tasks on every tier merge, while a leveled table holds a tighter, flatter queue. Poll cadence should track that: LeveledCompactionStrategy benefits from tighter 5-second intervals, while SizeTieredCompactionStrategy tolerates 10-second sampling. Because compaction reclaims tombstones on merge, a stalled queue also lets deleted data survive past expiry, coupling this monitor to tombstone management and garbage collection.

Modern Cassandra exposes these MBeans over JMX, and the Prometheus JMX Exporter normalizes them into scrapeable time-series such as cassandra_compaction_pendingtasks, cassandra_compaction_completedtasks, and cassandra_compaction_bytescompacted. The exact normalized names depend on your exporter configuration, so treat the names above as illustrative and confirm them against your own /metrics output before wiring alerts.

Configuration reference

The node-scoped settings below govern how much compaction work is available to drain and, by extension, how a Python monitor should interpret its telemetry. Values in cassandra.yaml are per-node; strategy parameters are table-scoped via ALTER TABLE.

Key	Default	Valid range	Impact on monitoring & throughput
`concurrent_compactors`	`min(cores, disks)` (often 2–8)	`1` – core count	Sets the `ActiveTasks` ceiling; scale your `PendingTasks` alert to `2 × concurrent_compactors`
`compaction_throughput` (4.1+) / `compaction_throughput_mb_per_sec` (≤4.0)	`64` MiB/s (`0` = unlimited)	`0` – disk write ceiling	Caps aggregate `BytesCompacted` growth; benchmark measured velocity as a fraction of this ceiling
`sstable_preemptive_open_interval` (4.x)	`50` MiB	`-1` (disabled) – 100s of MiB	Controls how early merged output is visible to reads; smooths read-amp while a backlog drains
`concurrent_validations`	`0` (tied to compactors)	`0` – core count	Repair validation shares the compaction pool; the contention your monitor must gate repair around
`compaction_large_partition_warning_threshold`	`100` MiB	tens–hundreds of MiB	Emits WARN log lines that correlate with slow-draining tasks on skewed partitions

Note the 4.1 rename from compaction_throughput_mb_per_sec to compaction_throughput; automation that shells out to nodetool setcompactionthroughput <MiB/s> works on both, but the YAML keys differ. A minimal throughput-sensitive block:

# cassandra.yaml — Cassandra 4.1+ / 5.0
# Pre-4.1 clusters: use compaction_throughput_mb_per_sec: 64
concurrent_compactors: 4
compaction_throughput: 128MiB/s
sstable_preemptive_open_interval: 50MiB

Step-by-step: standing up the monitoring pipeline

The procedure below builds the pipeline in three stages — collect, evaluate, gate — then wires them into a poll loop. Run each stage against a single representative node before rolling out cluster-wide.

1. Confirm the telemetry surface is reachable

Capture a baseline before automating anything. On Cassandra 4.x/5.x:

# Human-readable queue snapshot + the configured throughput ceiling
nodetool compactionstats -H
nodetool getcompactionthroughput

Expected output resembles:

pending tasks: 3
- keyspace.table: 2
id  compaction type  keyspace  table  completed  total   unit  progress
...                                    1.42 GiB   3.1 GiB bytes 45.80%
Current compaction throughput: 128 MB/s

On Cassandra 5.0 you can read the same queue state without JMX through the virtual tables — essential inside restricted networks:

SELECT keyspace_name, table_name, completion_ratio, unit
FROM system_views.sstable_tasks;

2. Collect signals with a resilient scraper

Point the collector at the Prometheus JMX Exporter endpoint running alongside Cassandra. Connection pooling and bounded retries keep transient GC pauses or memtable-flush spikes from turning into false failures:

#!/usr/bin/env python3
# requirements: requests>=2.31, urllib3>=2.0
import logging
from requests import Session, RequestException
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")


class CassandraMetricsCollector:
    def __init__(self, node_ip: str, prom_port: int = 9103) -> None:
        self.base_url = f"http://{node_ip}:{prom_port}/metrics"
        self.session = Session()
        retry = Retry(total=3, backoff_factor=0.5,
                      status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount("http://", HTTPAdapter(max_retries=retry))

    def scrape_compaction_metrics(self) -> dict[str, float]:
        """Parse the Prometheus text format for compaction-specific metrics."""
        try:
            resp = self.session.get(self.base_url, timeout=5)
            resp.raise_for_status()
        except RequestException as e:
            logging.error(f"Failed to scrape metrics from {self.base_url}: {e}")
            return {}

        metrics: dict[str, float] = {}
        for line in resp.text.splitlines():
            if not line.startswith("cassandra_compaction_") or line.startswith("#"):
                continue
            parts = line.split()  # format: metric_name{labels} value
            if len(parts) >= 2:
                key = parts[0].split("{")[0]
                try:
                    metrics[key] = float(parts[1])
                except ValueError:
                    continue
        return metrics

3. Evaluate backlog velocity

Compaction velocity must be judged against write-ingestion rate, not in isolation. Differentiate the monotonic cassandra_compaction_bytescompacted counter over a fixed interval and smooth it with an exponential moving average so a single memtable flush or GC stall does not swing the decision. The evaluator returns a smoothed throughput and a backlog deficit — the same velocity math that feeds the thresholds in async compaction tracking & metrics:

#!/usr/bin/env python3
# requirements: (standard library only)
from collections import deque


class CompactionBacklogEvaluator:
    def __init__(self, window_size: int = 12, ema_alpha: float = 0.3) -> None:
        self.ema_alpha = ema_alpha
        self.bytes_history: deque[float] = deque(maxlen=window_size)
        self.time_history: deque[float] = deque(maxlen=window_size)
        self.ema_throughput = 0.0

    def update(self, bytes_compacted: float, timestamp: float) -> float | None:
        """Update the sliding window; return EMA-smoothed throughput (bytes/sec)."""
        if not self.bytes_history:
            self.bytes_history.append(bytes_compacted)
            self.time_history.append(timestamp)
            return None

        delta_bytes = bytes_compacted - self.bytes_history[-1]
        delta_time = timestamp - self.time_history[-1]
        self.bytes_history.append(bytes_compacted)
        self.time_history.append(timestamp)
        if delta_time <= 0:
            return self.ema_throughput

        current = delta_bytes / delta_time
        self.ema_throughput = (self.ema_alpha * current) + ((1 - self.ema_alpha) * self.ema_throughput)
        return self.ema_throughput

    def compute_backlog_deficit(self, write_throughput: float) -> float:
        """Return the deficit in bytes/sec. Positive means the backlog is growing."""
        return max(0.0, write_throughput - self.ema_throughput)

4. Gate repair and lifecycle actions behind safety checks

Compaction state dictates safe execution windows for nodetool repair and node decommissioning. In Cassandra 4.x/5.x incremental repair is the default, but full anti-entropy sweeps remain necessary after topology changes — a trade-off covered in read-repair vs anti-entropy repair. Because a healthy node almost always has at least one pending task, gate on a sensible backlog ceiling (roughly MAX_PENDING = 20) rather than PendingTasks > 0. Defer repair if pending exceeds that ceiling, if measured throughput exceeds 70% of the configured limit, or if streaming is active — any of which would risk ReadTimeout exceptions or stalled streams:

#!/usr/bin/env python3
# requirements: (standard library only)
import logging
import subprocess


def is_node_safe_for_repair(node_ip: str, pending_tasks: int, throughput: float,
                            max_throughput: float, max_pending: int = 20) -> bool:
    """Validate node state before triggering repair. All gates must pass."""
    if pending_tasks > max_pending:
        logging.warning(f"Deferring repair on {node_ip}: {pending_tasks} pending tasks (> {max_pending}).")
        return False
    if throughput > (max_throughput * 0.7):
        logging.warning(f"Deferring repair on {node_ip}: throughput at {throughput:.0f} bytes/sec.")
        return False

    # Verify streaming is idle via nodetool (v4.x/5.x compatible).
    try:
        result = subprocess.run(["nodetool", "-h", node_ip, "netstats"],
                                capture_output=True, text=True, timeout=10, check=True)
    except subprocess.SubprocessError as e:
        logging.error(f"Failed to check netstats on {node_ip}: {e}")
        return False  # fail closed: never repair on an unknown state
    if "Streaming" in result.stdout and "active" in result.stdout.lower():
        logging.warning(f"Deferring repair on {node_ip}: active streaming detected.")
        return False
    return True


def schedule_repair(node_ip: str, full_repair: bool = False) -> None:
    """Execute repair with v4.x/5.x-compatible flags (incremental is the default)."""
    cmd = ["nodetool", "-h", node_ip, "repair"]
    if full_repair:
        cmd.append("--full")
    logging.info(f"Executing: {' '.join(cmd)}")
    subprocess.run(cmd, check=True)

Wire the three stages into a single poll loop that samples on the interval, evaluates state once, and takes at most one action — never poll faster than your scrape cadence, or the differentiated rate becomes noise. For per-task runtime deltas and throughput estimates on individual compactions, layer in the child procedure, Python script for real-time compaction latency tracking.

Verification & observability

Confirm the pipeline is reading true state and that its gating actually moved the backlog rather than just logging. Watch the queue drain directly:

# PendingTasks should trend down and BytesCompacted rise after a throughput bump.
nodetool compactionstats -H

On Cassandra 5.0, correlate completion against history to prove work is retiring:

SELECT keyspace_name, columnfamily_name, compacted_at, bytes_in, bytes_out
FROM system.compaction_history
WHERE keyspace_name = 'telemetry' ALLOW FILTERING;

Grep the logs for throughput changes and the starvation warnings the monitor should have caught first:

grep -E "compaction_throughput|Compacted .* to level|CompactionExecutor" /var/log/cassandra/system.log

A healthy signature is PendingTasks falling, BytesCompacted rising steeply, and SSTablesPerReadHistogram (from nodetool tablehistograms) trending back toward baseline. Note the surface drift: nodetool tablestats/tablehistograms are current, while cfstats/cfhistograms are deprecated aliases and must not be baked into new automation. Route structured failure logs through the categories defined in compaction error categorization & logging so a stall is never mistaken for a healthy busy period.

Failure modes & rollback

Scrape gap misread as a drained queue. If the JMX Exporter is briefly unreachable, scrape_compaction_metrics returns an empty dict and a naive loop treats “no metrics” as “no backlog,” silently skipping remediation. Detection: gaps in the Prometheus series and repeated Failed to scrape metrics log lines. Rollback: fail closed — treat an empty scrape as unknown state (never as safe-to-repair), hold the previous evaluation, and alert if the gap exceeds two intervals.

Repair fired during active streaming. Skipping the netstats gate — or running it against a node mid-bootstrap — lets schedule_repair compete with streaming for disk I/O, timing out validation sessions and leaving inconsistent replicas. Detection: validation timed out in the logs and streaming MBeans under type=Streaming showing active sessions. Rollback: pause non-critical repairs, confirm PendingTasks < 8 and idle streaming, then resume during the lull. If backlog is the root cause, route to fallback routing & read-path optimization to shed load first.

Corrupt-SSTable stall masquerading as backlog. A single unreadable SSTable pins a compaction task and inflates PendingTasks indefinitely, so a throughput bump wastes I/O without draining anything. Detection: PendingTasks flat-high with ActiveTasks at 0 and CorruptSSTableException in debug.log. Rollback: auto-quarantine the file, run nodetool scrub in a maintenance window, and gate throughput automation on ActiveTasks > 0 so it never fires against a corruption stall.

FAQ

How often should the monitor poll compaction metrics?

Match the poll interval to your Prometheus scrape cadence — typically 5–10 seconds — and never faster, because the differentiated BytesCompacted rate becomes pure noise below the scrape resolution. Tighten toward 5 seconds for LeveledCompactionStrategy tables with aggressive multi-level merges, and relax toward 10 seconds for SizeTieredCompactionStrategy, whose bursty tier merges are better judged over a wider window.

Should the agent read from nodetool or JMX?

Read JMX (or a Prometheus JMX-exporter scrape) for counters you differentiate like BytesCompacted, because it avoids spawning a subprocess per sample and keeps the rate math clean. Reserve nodetool compactionstats -H for the human-verifiable snapshot and per-compaction progress. On Cassandra 5.0, prefer the system_views.sstable_tasks virtual table over parsing text output.

Why gate repair on a pending-task ceiling instead of zero pending tasks?

A healthy node almost always has at least one pending compaction task, so PendingTasks > 0 would defer repair forever. Gate on a ceiling scaled to the node — roughly 2 × concurrent_compactors, or about 20 on a well-provisioned node — combined with a throughput check, so repair proceeds during genuine lulls rather than never.

What happens if compaction throughput exceeds 70% of the configured limit during repair?

Repair validation compactions share the same CompactionExecutor and I/O budget as normal compaction. When compaction is already consuming 70%+ of the ceiling, adding repair validation saturates disk I/O, causing validation timeouts, dropped streams, and read-path latency spikes. The safety gate defers repair until measured throughput falls back below that fraction.

Does UnifiedCompactionStrategy in Cassandra 5.0 change how I monitor?

Yes. UCS introduces adaptive thread scaling, so a static concurrent_compactors assumption no longer holds. Feed the monitor’s real-time backlog velocity back into throughput and compactor decisions rather than hard-coding them, and confirm your exporter still surfaces the same Compaction MBeans, since the strategy internals changed even though the metric surface is largely stable.

Advanced Compaction Strategy Tuning & Monitoring — the parent guide covering strategy selection, tuning, and observability end to end.
Python script for real-time compaction latency tracking — per-task runtime deltas and throughput estimates that build on this pipeline.
Compaction backlog analysis & alerting — velocity-aware thresholds and tiered alert routing on the counters collected here.
Async compaction tracking & metrics — the JMX MBeans and sampling patterns behind background-thread observability.
Compaction error categorization & logging — routing CorruptSSTableException and stalls into structured, alertable categories.

Python Monitoring for Cassandra Compaction (4.x/5.x)

Related guides