Does nodetool compactionstats have a JSON output mode?

No. Across Cassandra 4.0, 4.1, and 5.0 the command emits only human-readable text: a 'pending tasks: N' line followed by a table. Automation must parse that text deterministically or read the org.apache.cassandra.metrics:type=Compaction JMX MBeans. On Cassandra 5.0 the system_views.sstable_tasks virtual table gives a structured alternative.

How do I tell a healthy backlog from a stalled compaction?

Compare the completed and progress fields across consecutive polls. A high pending count with progress advancing is normal backlog. The same completed value across three or more polls, especially with pending non-zero, is a stall, usually caused by saturated disk I/O or a corrupt SSTable.

Why should polling be gated behind node-health checks?

Spawning nodetool against a DN or UJ node, or against a disk near capacity, adds JMX and I/O load to a node already in trouble and can turn a monitoring job into an outage. Verify the local node is UN and the data volume has headroom before every poll, and exit non-zero without retrying if a gate fails.

How do I get per-compaction throughput from compactionstats?

You cannot; compactionstats reports completed and total, not a rate. Derive aggregate throughput by differentiating the monotonically increasing BytesCompacted JMX counter across two polls at a known interval, and use the progress column only to detect stalls.

Interpreting `nodetool compactionstats` Output for Cassandra 4.x/5.x Automation

nodetool compactionstats is the primary human-readable signal for Cassandra storage health, but wiring it into automation is where operators get burned: raw polling in a tight loop triggers I/O starvation, JMX timeouts, and cascading disk pressure. This page is for the specific task of parsing that command’s output safely inside a pipeline — extracting the pending tasks: N line and the active-compaction table, mapping each column to an operational decision, and setting thresholds that distinguish healthy backlog from a genuine stall. It assumes Cassandra 4.0, 4.1, or 5.0, a reachable JMX/nodetool surface on the local node, and Python 3.10+ for the reference code. It sits under async compaction tracking and metrics; read that first for the JMX MBean model, because compactionstats is the text projection of the same org.apache.cassandra.metrics:type=Compaction counters.

There is no JSON mode for compactionstats, so automation must parse text deterministically or read the JMX MBeans directly. Because the shape of expected backlog depends on your table strategy — the tiering behavior of STCS vs LCS vs TWCS governs whether a pending spike is routine or alarming — the thresholds below are starting points you calibrate per keyspace, not universal constants.

Pre-conditions and safety gates

Never poll blindly. Confirm the node is healthy and has I/O headroom before you spawn a nodetool subprocess; polling a DN/UJ node or a disk near capacity is how a monitoring job becomes an outage. Run these gates first and abort on any failure.

# Gate 1: the LOCAL node must be Up/Normal (UN). Scope the check to this node's
# address — a cluster-wide grep for "UN" passes even when the local node is down.
nodetool status | awk -v ip="127.0.0.1" '$2==ip {print $1}'
# Expected output: UN

# Gate 2: data volume must have headroom. Refuse to poll above ~85% used, since
# compaction needs scratch space and a stall here is a disk-pressure symptom.
df --output=pcent /var/lib/cassandra | tail -1 | tr -d ' %'
# Expected output: an integer < 85

# Gate 3: capture the current snapshot once, cheaply, to confirm JMX responds.
# --human-readable is safe to parse; the machine form omits the unit suffix.
nodetool compactionstats --human-readable
# Expected output begins with a "pending tasks: N" line.

Pre-flight safety gates: every gate must pass before a poll; any failure aborts without side effects.

If any gate fails, the poller must exit non-zero without touching runtime parameters. The Python implementation below encodes all three as guard clauses.

Implementation: deterministic extraction and idempotent polling

The routine below verifies node state, checks disk pressure, parses the text columns, and records a snapshot digest so repeated identical readings are a no-op rather than duplicate work. It distinguishes a write failure from a genuine no-op via a dedicated exception, so a full disk can never masquerade as “nothing changed.”

#!/usr/bin/env python3
# requirements: Python 3.10+, Cassandra 4.0/4.1/5.0 with nodetool on PATH.
"""
Deterministic compactionstats extractor for Cassandra 4.x/5.x.
Safety: node-state validation, disk-pressure guardrails, atomic state files.
Output: parsed compaction-state dict as JSON, or a non-zero exit on failure.
Rollback: clears partial state and raises rather than reporting a false no-op.
"""
import hashlib
import json
import os
import re
import subprocess
import sys
from pathlib import Path
from typing import Optional

STATE_DIR = Path("/var/lib/cassandra/automation/compaction_state")
STATE_FILE = STATE_DIR / "compactionstats_last_run.sha256"
DISK_THRESHOLD = 0.85
NODETOOL_TIMEOUT = 30
LOCAL_ADDR = "127.0.0.1"  # this node's listen/broadcast address

PENDING_RE = re.compile(r"^pending tasks:\s*(\d+)", re.MULTILINE)
ACTIVE_TYPES = {"Compaction", "Validation", "Cleanup", "Scrub",
                "Upgradesstables", "Index build"}


class IdempotencyError(RuntimeError):
    """Raised when state could not be persisted (distinct from a no-op match)."""


def verify_node_state() -> bool:
    """Ensure the LOCAL node is Up/Normal (UN) before invocation."""
    try:
        proc = subprocess.run(["nodetool", "status"],
                              capture_output=True, text=True, timeout=15)
    except (subprocess.SubprocessError, OSError):
        return False
    if proc.returncode != 0:
        return False
    for line in proc.stdout.splitlines():
        cols = line.split()
        # Scope the UN check to the local node's row, not any node in the ring.
        if len(cols) >= 2 and cols[1] == LOCAL_ADDR:
            return cols[0] == "UN"
    return False


def verify_disk_pressure() -> bool:
    """Halt polling if the data directory exceeds the used-space threshold."""
    try:
        stat = os.statvfs("/var/lib/cassandra")
        used = 1.0 - (stat.f_bavail / stat.f_blocks)
        return used < DISK_THRESHOLD
    except OSError:
        return False


def parse_compactionstats(text: str) -> dict:
    """Parse the real column layout:
    id  compaction type  keyspace  table  completed  total  unit  progress."""
    match = PENDING_RE.search(text)
    pending = int(match.group(1)) if match else 0

    active = []
    for line in text.splitlines():
        cols = line.split()
        # Skip header/blank/non-row lines; the type token follows the id.
        if len(cols) >= 8 and cols[1] in ACTIVE_TYPES:
            try:
                completed = int(cols[-4])
                total = int(cols[-3])
            except ValueError:
                continue  # human-readable form carries unit suffixes; skip here
            active.append({
                "id": cols[0],
                "compaction_type": cols[1],
                "keyspace": cols[2],
                "table": cols[3],
                "completed": completed,
                "total": total,
                "unit": cols[-2],
                "progress": cols[-1],
            })
    return {"pending": pending, "active": active}


def fetch_compactionstats() -> Optional[dict]:
    """Idempotent fetch with the pre-flight gates applied as guard clauses."""
    if not verify_node_state() or not verify_disk_pressure():
        return None
    try:
        proc = subprocess.run(["nodetool", "compactionstats"],
                              capture_output=True, text=True,
                              timeout=NODETOOL_TIMEOUT)
    except (subprocess.TimeoutExpired, subprocess.SubprocessError, OSError):
        return None
    if proc.returncode != 0 or "pending tasks:" not in proc.stdout:
        return None
    return parse_compactionstats(proc.stdout)


def enforce_idempotency(payload: dict) -> bool:
    """Persist a snapshot digest. Return True if already seen (no-op), False if
    new and recorded. Raise IdempotencyError if the write fails, so a disk error
    is never mistaken for a successful no-op."""
    STATE_DIR.mkdir(parents=True, exist_ok=True)
    digest = hashlib.sha256(
        json.dumps(payload, sort_keys=True).encode()).hexdigest()
    if STATE_FILE.exists() and STATE_FILE.read_text().strip() == digest:
        return True
    tmp = STATE_FILE.with_suffix(".tmp")
    try:
        tmp.write_text(digest)
        tmp.replace(STATE_FILE)  # atomic on POSIX
    except OSError as exc:
        if tmp.exists():
            tmp.unlink()  # rollback the partial write
        raise IdempotencyError("failed to persist compaction state") from exc
    return False


if __name__ == "__main__":
    data = fetch_compactionstats()
    if data is None:
        sys.exit(1)  # a gate failed or JMX did not respond
    try:
        already_seen = enforce_idempotency(data)
    except IdempotencyError:
        sys.exit(1)
    if already_seen:
        sys.exit(0)  # nothing changed since the last poll
    print(json.dumps(data, indent=2))

Decoding the columns to operational meaning

Once parsed, each field maps directly to a storage-subsystem decision. The table below is the interpretation layer your alerting rules should encode.

Field	Operational meaning	Alert threshold
`pending tasks`	Queue depth awaiting the compaction executor	`> 2 × concurrent_compactors` sustained indicates backpressure
`compaction type`	`Compaction`, `Cleanup`, `Scrub`, `Upgradesstables`, `Validation`, `Index build`	`Validation` implies an anti-entropy repair is sharing the pool
`total`	Total work to process (in `unit`) for this compaction	Baseline for the progress calculation
`completed`	Work already merged (in `unit`)	`progress = completed / total`
`progress`	Percent string emitted by nodetool (e.g. `42.10%`)	Identical value across polls signals a stuck compaction

Always validate total > 0 before dividing, to guard against a division-by-zero on a freshly queued task. Per-task throughput is not reported here; derive the aggregate rate by differentiating the BytesCompacted JMX counter across polls, exactly as covered in the async compaction tracking guide. Note that stalled progress with surviving deletes is also the earliest symptom of a tombstone garbage-collection problem, so correlate the two.

The reading maps to a decision as follows.

Mapping a nodetool compactionstats reading to a single operational decision.

Turning the reading into a health verdict

Rather than poll on a fixed interval and hammer the thread pool, emit a structured verdict that downstream automation can act on. This consumes the dict from parse_compactionstats directly.

#!/usr/bin/env python3
# requirements: Python 3.10+. Pure function over the parsed dict; no I/O.
def validate_compaction_health(payload: dict, concurrent_compactors: int = 2) -> dict:
    """Return a health verdict with actionable thresholds. Uses progress to spot
    stalls because compactionstats does not report per-task throughput."""
    pending: int = payload.get("pending", 0)
    active: list[dict] = payload.get("active", [])

    min_progress = min(
        (row["completed"] / row["total"]
         for row in active if row.get("total", 0) > 0),
        default=1.0,
    )

    verdict: dict = {"status": "healthy", "actions": []}
    if pending > 2 * concurrent_compactors:
        verdict["status"] = "backpressure"
        verdict["actions"].append("throttle_throughput")
    if active and min_progress < 0.05 and pending > 5:
        verdict["status"] = "possible_stall"
        verdict["actions"].append("investigate_disk_io")
    return verdict

When the verdict calls for throttling — typically during a repair window, where repair-generated SSTables flood the queue — cap throughput rather than killing compaction outright, then confirm the change took effect.

# setcompactionthroughput is silent on success; always read it back.
nodetool setcompactionthroughput 50   # MiB/s; 0 = unlimited (the default)
nodetool getcompactionthroughput
# Rollback: nodetool setcompactionthroughput 0

Verification steps

Confirm the parser and gates behave against a live node before trusting the pipeline.

# 1. The extractor prints valid JSON with pending + active keys on a healthy node.
python3 compactionstats_extractor.py | python3 -m json.tool
# Expected: {"pending": <int>, "active": [ ... ]}

# 2. Running it twice with no cluster change is a no-op (exit 0, no output).
python3 compactionstats_extractor.py; echo "exit=$?"
# Expected on the second run: exit=0 with no stdout

# 3. The digest file exists and is a 64-char hex sha256.
wc -c /var/lib/cassandra/automation/compaction_state/compactionstats_last_run.sha256
# Expected: 64 (plus a trailing newline if your editor added one)

On Cassandra 5.0 you can cross-check the text parse against the system_views.sstable_tasks virtual table, which exposes the same in-flight compactions as structured rows — a useful oracle when validating the parser:

-- Cassandra 5.0+ only; absent on 4.x.
SELECT keyspace_name, table_name, kind, progress, total
FROM system_views.sstable_tasks;

Troubleshooting

nodetool: Failed to connect to '127.0.0.1:7199' — ConnectException. JMX is unreachable: the node is starting, mid-drain, or the JMX port is firewalled. The extractor already treats this as a gate failure and exits non-zero, so the fix is operational — verify the process is up and nodetool status returns UN before re-enabling the poller. Do not loop-retry; a tight retry against a dead JMX endpoint is itself a load source.

Progress frozen at an identical completed value across three or more polls. The compaction is stuck, usually on saturated disk I/O or a corrupt SSTable. Detect it by comparing the last three snapshot digests, then stop that operation cleanly — nodetool stop COMPACTION halts the type and is silent on success, so check the exit code rather than expecting output. Follow with nodetool enableautocompaction. If it recurs on the same table, run nodetool scrub off-peak.

OutOfSpaceException in the logs while pending climbs. Compaction needs free scratch space roughly equal to the SSTables being merged; a large STCS merge on a full disk fails and re-queues, inflating pending tasks in a loop. Gate 2 above is meant to catch this before you throttle. The remedy is capacity — add space or drain backlog per the backlog-resolution runbook — not more aggressive polling, which only adds JMX load to a node already under pressure.

Async compaction tracking and metrics — the parent guide mapping every compactionstats column to its JMX MBean equivalent.
Compaction backlog analysis and alerting — dynamic thresholds, tiered severity, and automated remediation for a growing queue.
Python monitoring for Cassandra compaction — feeding parsed telemetry into Prometheus, Grafana, and custom pollers.
Advanced Compaction Strategy Tuning and Monitoring — the parent section on strategy selection, tuning, and observability end to end.

Back to Async Compaction Tracking Metrics

Interpreting nodetool compactionstats Output for Cassandra 4.x/5.x Automation

Interpreting `nodetool compactionstats` Output for Cassandra 4.x/5.x Automation