Async Compaction Tracking & Metrics in Cassandra 4.x/5.x

Asynchronous compaction in Apache Cassandra decouples SSTable merging from the synchronous write path, letting ingestion proceed while background threads on the CompactionExecutor pool consolidate data. That separation protects write availability, but untracked async work silently degrades read latency, inflates disk utilization, and starves streaming during repair. This guide is for operators who need to turn raw compaction telemetry into an early-warning signal: it explains which JMX MBeans and nodetool compactionstats fields to sample, how to derive compaction velocity rather than trust instantaneous counts, and how to set thresholds that separate healthy backlog from genuine starvation. It sits under Advanced Compaction Strategy Tuning & Monitoring; read that first if you have not yet aligned your table-level strategy with your workload.

Everything here is validated against Cassandra 4.0, 4.1, and 5.0. Where behavior diverges between releases — the compaction_throughput_mb_per_sec → compaction_throughput rename in 4.1, the STCS → UnifiedCompactionStrategy default change in 5.0, and the arrival of system_views virtual tables — the version-specific detail is called out inline.

How async compaction telemetry is produced

Compaction runs as a set of tasks submitted to a bounded thread pool. Each strategy implementation (STCS, LCS, TWCS, or UCS) inspects the SSTable set, decides which files to merge, and submits an AbstractCompactionTask to the CompactionExecutor. The executor’s parallelism is capped by concurrent_compactors, and its aggregate write rate is throttled by the shared compaction_throughput budget. Understanding this producer/consumer relationship is the whole game: pending tasks are the queue depth, active tasks are the in-flight work, and BytesCompacted is the cumulative output you differentiate to get throughput.

Cassandra publishes this state through the Java Management Extensions (JMX) interface. Two MBean domains matter:

org.apache.cassandra.metrics:type=Compaction exposes PendingTasks, CompletedTasks, TotalCompactionsCompleted, and BytesCompacted.
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=CompactionExecutor exposes ActiveTasks, PendingTasks, and CompletedTasks for the executor itself.

The distinction is subtle but important: the Compaction domain counts logical compaction operations, while the ThreadPools CompactionExecutor scope counts thread-pool work items. A node with ActiveTasks pinned at 0 while Compaction.PendingTasks climbs is not busy — it is stalled, and that pattern is the single most reliable starvation signature. The first mention of parsing the human-readable nodetool compactionstats surface is covered in depth under interpreting nodetool compactionstats output, which maps every text column to its JMX equivalent.

Because the throughput budget is shared, tracking must correlate compaction against the write path. When compaction cannot keep pace, unmerged, overlapping SSTables accumulate and read amplification climbs as tombstones survive past their expiry — a failure mode explored in tombstone management and garbage collection. The velocity you compute here is therefore a leading indicator for read-latency regressions, not just a housekeeping metric.

The metrics path from a node through scraping into alert routing is shown below.

Async compaction metrics pipeline to alert routing.

Detection signatures: a healthy node keeps workers busy and output rising, while a starved node queues work but produces nothing.

Configuration reference

The settings below govern how much async compaction work a node can do and, by extension, how you interpret its telemetry. Values in cassandra.yaml are node-scoped; strategy parameters are table-scoped via ALTER TABLE.

Key	Default	Valid range	Impact on tracking
`concurrent_compactors`	`min(cores, disks)` (often 2–8)	`1` – core count	Sets `ActiveTasks` ceiling; `PendingTasks` above `2 × concurrent_compactors` is your primary backlog trigger
`compaction_throughput` (4.1+) / `compaction_throughput_mb_per_sec` (≤4.0)	`64` (MiB/s); `0` = unlimited	`0` – disk write ceiling	Caps aggregate `BytesCompacted` growth; velocity is benchmarked as a fraction of this
`sstable_preemptive_open_interval` (4.x, MiB) / `sstable_preemptive_open_interval_in_mb`	`50`	`-1` (disabled) – 100s of MiB	Affects how early merged output is visible to reads; influences read-amp during long compactions
`concurrent_validations`	`0` (unbounded, tied to compactors)	`0` – core count	Repair validation shares the compaction pool; bounds contention you must schedule around
Table `compaction` `class`	`SizeTieredCompactionStrategy` (4.x); `UnifiedCompactionStrategy` (5.0)	STCS / LCS / TWCS / UCS	Determines the shape of expected backlog — see below

A minimal cassandra.yaml block for a throughput-sensitive node:

# cassandra.yaml — Cassandra 4.1+ / 5.0
# Pre-4.1 clusters: use compaction_throughput_mb_per_sec: 64
concurrent_compactors: 4
compaction_throughput: 64MiB/s     # 0 disables the throttle (audit disk headroom first)
sstable_preemptive_open_interval: 50MiB

Strategy choice changes what “normal” telemetry looks like. STCS produces bursty, size-tiered merges; LCS keeps PendingTasks low but steady; TWCS concentrates compaction at window rollover, so pending spikes are expected and periodic. Never apply one flat threshold across strategies — calibrate against the strategy in play, which is the reasoning behind strategy selection for time-series workloads and the broader comparison in STCS vs LCS vs TWCS.

Step-by-step: instrument async compaction tracking

The following procedure stands up a velocity-aware tracker on a single node, then generalizes it to the fleet. Each step includes a safety gate you should not skip.

1. Confirm the telemetry surface is reachable

Read the configured throughput and current pending count before writing any automation. This is your baseline and your safety gate — if these commands fail, fix access before proceeding.

# Safety gate: both commands must return cleanly on the target node.
nodetool getcompactionthroughput
nodetool compactionstats

Expected output (an idle-ish node):

Current compaction throughput: 64 MB/s
pending tasks: 3
- Compaction ks.orders   1234567 / 4567890 bytes  27.02% 
id                                   compaction type  keyspace  table  ...

On Cassandra 5.0 you can also query the virtual table, which avoids the text-parsing step entirely:

-- Cassandra 5.0+: structured compaction state, no JMX bridge required
SELECT keyspace_name, table_name, completed_bytes, total_bytes, unit
FROM system_views.sstable_tasks;

2. Sample the JMX counters you will differentiate

BytesCompacted is a monotonically increasing counter. A single reading is meaningless; you need two samples across a known interval to derive throughput. Confirm you can read it before automating:

# Requires a JMX client; Cassandra ships nodetool over JMX by default on 7199.
nodetool sjk mxdump -q 'org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted' 2>/dev/null \
  || echo "Fall back to your JMX exporter / Prometheus scrape for BytesCompacted"

Safety gate: never poll JMX more aggressively than your scrape interval (15–30s). Sub-second polling on a loaded node adds GC and RPC pressure and can itself induce the latency you are trying to detect.

3. Deploy a velocity-aware poller

The daemon below parses nodetool compactionstats for queue depth and, separately, differentiates the BytesCompacted JMX counter across intervals to compute bytes_compacted_per_second. It uses subprocess isolation, a timeout, exponential backoff, and structured logging aligned with modern observability stacks.

#!/usr/bin/env python3
# requirements: Python 3.10+, Cassandra 4.0+/5.x nodetool on PATH.
# No third-party deps; JMX counter reads are injected by the caller.
"""Velocity-aware async-compaction tracker for Cassandra 4.x/5.x.

Parses `nodetool compactionstats` for queue depth and differentiates the
BytesCompacted JMX counter across polls to derive compaction throughput.
"""
import logging
import re
import subprocess
import sys
import time

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger("cassandra_compaction_tracker")

PENDING_RE = re.compile(r"^pending tasks:\s*(\d+)", re.MULTILINE)
ACTIVE_TYPES = ("Compaction", "Validation", "Cleanup", "Scrub", "Upgradesstables")


def poll_compactionstats(node_ip: str, timeout: int = 15) -> dict | None:
    """Run `nodetool compactionstats` and parse its text output.

    There is no JSON mode; the command emits a `pending tasks: N` line
    followed by a table of active compactions. Returns None on failure.
    """
    cmd = ["nodetool", "-h", node_ip, "compactionstats"]
    try:
        result = subprocess.run(
            cmd, capture_output=True, text=True, timeout=timeout, check=True
        )
    except subprocess.CalledProcessError as exc:
        logger.error("nodetool failed on %s: %s", node_ip, exc.stderr.strip())
        return None
    except subprocess.TimeoutExpired:
        logger.warning("Poll timeout exceeded for %s", node_ip)
        return None
    return parse_compactionstats(result.stdout)


def parse_compactionstats(text: str) -> dict:
    """Extract pending count and active-compaction rows from text output."""
    match = PENDING_RE.search(text)
    pending: int = int(match.group(1)) if match else 0

    active: list[dict] = []
    for line in text.splitlines():
        cols = line.split()
        if len(cols) >= 8 and cols[1] in ACTIVE_TYPES:
            try:
                completed, total = int(cols[-4]), int(cols[-3])
            except ValueError:
                continue
            active.append(
                {
                    "keyspace": cols[2],
                    "table": cols[3],
                    "completed": completed,
                    "total": total,
                    "unit": cols[-2],
                    "progress": cols[-1],
                }
            )
    return {"pending": pending, "active": active}


def compaction_velocity(
    bytes_prev: int, bytes_now: int, elapsed_s: float
) -> float:
    """Derive bytes/sec from two BytesCompacted JMX samples (guards resets)."""
    if elapsed_s <= 0 or bytes_now < bytes_prev:
        return 0.0
    return (bytes_now - bytes_prev) / elapsed_s


def evaluate_backlog(
    metrics: dict, concurrent_compactors: int, velocity_bps: float,
    throughput_limit_bps: float,
) -> bool:
    """Return True if the node's compaction is healthy.

    Starvation is queue growth with no output: pending high, ActiveTasks
    empty, velocity near zero. A single high pending count is NOT enough.
    """
    pending: int = metrics.get("pending", 0)
    active: int = len(metrics.get("active", []))
    stalled = active == 0 and pending > concurrent_compactors
    slow = throughput_limit_bps > 0 and velocity_bps < 0.30 * throughput_limit_bps

    if stalled:
        logger.error("STARVATION: pending=%d but ActiveTasks=0", pending)
        return False
    if pending > concurrent_compactors * 2 and slow:
        logger.warning(
            "Backlog growing at %.1f MiB/s (<30%% of budget), pending=%d",
            velocity_bps / 1_048_576, pending,
        )
        return False
    return True


if __name__ == "__main__":
    target = "127.0.0.1"
    backoff = 2
    for _ in range(5):
        data = poll_compactionstats(target)
        if data is not None:
            # In production, feed real BytesCompacted deltas + your configured
            # concurrent_compactors and compaction_throughput here.
            evaluate_backlog(data, concurrent_compactors=4,
                             velocity_bps=0.0, throughput_limit_bps=64 * 1_048_576)
            break
        time.sleep(backoff)
        backoff = min(backoff * 2, 30)

Wiring this into Prometheus exporters, Datadog agents, or a custom pipeline is covered in Python monitoring for Cassandra compaction.

4. Set thresholds that encode velocity, not just depth

A raw pending count is a poor alert. Fire only on the combination of queue growth and stalled output:

PendingTasks > concurrent_compactors × 2 sustained for more than 5 minutes.
BytesCompacted delta approaching zero while PendingTasks grows.
ActiveTasks at 0 despite non-zero PendingTasks (unambiguous starvation).

Full threshold calibration, tiered severity, and routing to on-call live in compaction backlog analysis & alerting.

Verification & observability

Confirm the tracker reflects reality by cross-checking three independent surfaces:

Queue depth vs. output: run nodetool compactionstats twice, 30 seconds apart. The progress percentages on active rows must advance. If pending tasks grows while progress is flat, the node is stalled — exactly what step 4’s rule 3 catches.
Throughput sanity: compute velocity from two BytesCompacted samples and compare against nodetool getcompactionthroughput. Sustained velocity below 30% of the configured budget, with disk not idle, indicates I/O saturation or file-descriptor exhaustion rather than a throttle you can relax.
Structured state (5.0): query SELECT * FROM system_views.sstable_tasks; and reconcile the row count against your parsed active list. On 4.x, nodetool tablestats <ks>.<table> (the rename of the deprecated cfstats) reports SSTable counts per table, which should trend down as compaction drains the queue.

Grep the logs to correlate spikes with real events rather than noise:

grep -E "CompactionTask|Compacted .* to level" /var/log/cassandra/system.log | tail -20

Failure modes & rollback

Compaction starvation masked by a high pending count

Symptom: PendingTasks climbs steadily while ActiveTasks sits at 0 and read p99 degrades. Detection: nodetool compactionstats shows a large pending tasks value but no active rows, and BytesCompacted is flat across samples. Root cause: the CompactionExecutor is blocked — usually disk I/O saturation, exhausted file descriptors, or a wedged compaction holding a resource. Rollback / remediation: raise concurrent_compactors or the throughput budget only after confirming disk headroom; if a single table is wedged, nodetool stop COMPACTION cancels in-flight tasks so the executor can drain the rest, then let it resubmit. If merge errors are in play, route them through compaction error categorization & logging rather than a generic channel.

False alarms from TWCS window rollover

Symptom: periodic pending spikes that alert on-call, then self-resolve. Detection: the spike period matches your TWCS compaction_window_size, and ActiveTasks is non-zero throughout with BytesCompacted advancing. Root cause: window expiry legitimately queues a burst of same-window SSTables. Rollback: do not raise throughput; instead make the alert strategy-aware — suppress depth-only triggers during rollover windows and rely on rule 3 (stalled output) so you catch real starvation without paging on healthy bursts.

Poller-induced JMX pressure

Symptom: latency worsens shortly after deploying the tracker. Detection: GC pauses and RPC timeouts correlate with your poll cadence; disabling the poller clears them. Root cause: sub-scrape-interval polling or an unbounded thread of nodetool subprocesses. Rollback: stop the daemon, restore the 15–30s cadence with the timeout and backoff shown above, and prefer a long-lived JMX/Prometheus scrape over per-poll nodetool process spawns.

FAQ

How is compaction velocity different from PendingTasks?

PendingTasks is queue depth — a snapshot of how many merges are waiting. Velocity is the derivative of the BytesCompacted counter over time, i.e. how fast the queue is actually draining. Depth can be high transiently and still be healthy; velocity tells you whether work is moving. Alert on the two together, never on depth alone.

What ActiveTasks-versus-PendingTasks pattern signals real starvation?

ActiveTasks at 0 while PendingTasks is non-zero and BytesCompacted is flat. That combination means the CompactionExecutor has work queued but is producing nothing — the unambiguous starvation signature. A high pending count with active rows advancing is just backlog, and often expected under STCS bursts or TWCS rollover.

Does async compaction tracking affect anti-entropy repair scheduling?

Yes. Repair validation compactions share the same executor and I/O budget, so a saturated compaction queue starves repair streams and causes validation timeouts. Schedule incremental repair during compaction lulls, or pause it when pending exceeds a safety threshold. The interaction is detailed under read repair vs anti-entropy repair.

Should I read metrics from nodetool or JMX in automation?

Read JMX (or a Prometheus JMX-exporter scrape) for counters you differentiate, such as BytesCompacted, because it avoids spawning a subprocess per sample. Use nodetool compactionstats for the human-verifiable queue snapshot and for the per-compaction progress column. On Cassandra 5.0, prefer the system_views.sstable_tasks virtual table over text parsing.

Why did my velocity drop below 30% of compaction_throughput with the throttle unchanged?

Because the configured budget is a ceiling, not a guarantee. When actual velocity falls below ~30% of the limit while disk is not idle, the node is I/O bound, under page-cache pressure, or out of file descriptors. Raising compaction_throughput will not help — investigate the disk subsystem and nodetool info first.

Advanced Compaction Strategy Tuning & Monitoring — the parent guide covering strategy selection, tuning, and observability end to end.
Interpreting nodetool compactionstats output — column-by-column parsing and threshold mapping for the text surface.
Python monitoring for Cassandra compaction — integrating pollers with Prometheus, Datadog, and custom telemetry.
Compaction backlog analysis & alerting — dynamic thresholds, tiered severity, and automated remediation.
Strategy selection for time-series workloads — why TWCS rollover produces expected pending spikes.

Async Compaction Tracking & Metrics in Cassandra 4.x/5.x

Related guides