Automating Tombstone Threshold Alerts with Python in Cassandra 4.x/5.x

Tombstone accumulation is a primary driver of read amplification, compaction backpressure, and the TombstoneOverwhelmingException failures that abruptly fail reads on a live cluster. This page delivers a complete, copy-paste Python 3.10+ engine that scrapes per-table tombstone metrics from nodetool tablestats, applies configurable warn/fail thresholds, enforces cooldown windows so a prolonged compaction stall cannot storm your on-call rotation, and routes structured alerts to an incident-management webhook. It sits beneath tombstone management and garbage collection; read that first for the model of when a marker becomes purgeable, then use this page when you specifically need automated early warning before a table’s scan cost breaches the tombstone_failure_threshold. Prerequisites: Cassandra 4.0, 4.1, or 5.0 with a reachable nodetool path, Python 3.10+ (for the union type-hint syntax below) plus the requests library, and a node in the UN state. The scrape itself is read-only, but the alerts it emits should gate real operational responses, so treat its output as an input to your runbook rather than a passive dashboard.

Pre-conditions & safety gates

Every check below is read-only. Run the full sequence on one representative node before rolling the alerter out cluster-wide, and stop if any gate fails.

1. Python runtime and dependency

python3 -c "import sys; assert sys.version_info >= (3,10), 'Python 3.10+ required'; import requests; print('PASS: runtime + requests validated')"

Safety Check: Fails fast if the interpreter is older than 3.10 or if requests is missing — the only third-party dependency the engine imports. Expected Output: PASS: runtime + requests validated Rollback Path: If validation fails, install a newer interpreter (apt install python3.11) and pip install requests into a dedicated virtualenv; do not symlink over the system python3.

2. nodetool reachability and the metric field

timeout 10 nodetool version >/dev/null 2>&1 \
  && nodetool tablestats -F json >/dev/null 2>&1 \
  && echo "PASS: nodetool responsive" || echo "WARN: fall back to plain tablestats parsing"

Safety Check: The 10-second timeout prevents the gate from hanging on a saturated JMX port. nodetool tablestats replaced the deprecated cfstats alias; the -F json flag exists on Cassandra 4.1+/5.0 but not 4.0, which is why the parser below defaults to the plain-text Maximum tombstones per slice (last five minutes) line that is present on every 4.x and 5.x release. Expected Output: PASS: nodetool responsive Rollback Path: On WARN, confirm the node is UN via nodetool status; the plain-text parser still works, so proceed. Do not run the alerter against a DN/UJ node whose stats are stale.

3. Least-privilege context and state directory

install -d -o cassandra -g cassandra -m 0750 /var/lib/cassandra-monitor \
  && [ "$(id -u)" -ne 0 ] && echo "PASS: non-root, state dir ready" \
  || echo "WARN: running as root — create a dedicated service account"

Safety Check: The engine only reads stats and writes a small JSON state file, so it never needs root. A dedicated, writable state directory is required for the cooldown deduplication to survive restarts. Expected Output: PASS: non-root, state dir ready Rollback Path: Create an unprivileged account (useradd -r -s /sbin/nologin cassandra-monitor), grant it a read-only sudoers rule scoped to nodetool tablestats, and re-run the gate.

Only proceed once all three gates pass. Threshold calibration is not arbitrary: because a marker is only reclaimed after gc_grace_seconds elapses and compaction merges every overlapping SSTable, a table’s tombstone-per-slice count reflects both your delete pattern and whether anti-entropy repair is keeping pace. Set warn/fail thresholds against the read-path budget, not a round number.

Implementation

The engine runs one pass per invocation: shell out to nodetool tablestats, parse the maximum tombstones-per-slice per keyspace.table, evaluate each table against warn and fail thresholds, and — for any breach whose cooldown has elapsed — dispatch a structured alert and record the timestamp. State is a single JSON file written atomically (temp-file-plus-replace) so a crash mid-save can never corrupt the cooldown ledger, and a --dry-run gate lets you validate parsing against live output before any webhook fires. The choice of nodetool over a JMX exporter keeps the dependency surface to requests alone and guarantees identical behaviour across 4.x and 5.x. The control flow each run takes is traced below.

Save the following as cassandra_tombstone_monitor.py.

#!/usr/bin/env python3
# Requires: Python 3.10+, requests, nodetool on PATH, target node in UN state.
"""
Cassandra tombstone threshold alerting engine (4.x/5.x compatible).
Idempotent and stateful: scrapes nodetool tablestats, applies warn/fail
thresholds, enforces per-table cooldown, and routes alerts to a webhook.
"""
import argparse
import json
import logging
import re
import subprocess
import sys
import time
from dataclasses import dataclass, field
from pathlib import Path

import requests

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger("cassandra_tombstone_monitor")


@dataclass
class AlertState:
    """Tracks last-alert time per table so cooldown suppresses alert storms."""
    last_alerted: dict[str, float] = field(default_factory=dict)
    cooldown_seconds: int = 3600

    def should_alert(self, table_key: str) -> bool:
        return (time.time() - self.last_alerted.get(table_key, 0.0)) >= self.cooldown_seconds

    def record_alert(self, table_key: str) -> None:
        self.last_alerted[table_key] = time.time()

    def save(self, path: Path) -> None:
        # Atomic write: temp file then replace, so a crash cannot leave partial state.
        tmp = path.with_suffix(".tmp")
        try:
            tmp.write_text(json.dumps(self.last_alerted, indent=2))
            tmp.replace(path)
        except OSError:
            tmp.unlink(missing_ok=True)
            raise

    @classmethod
    def load(cls, path: Path) -> "AlertState":
        if path.exists():
            state = cls()
            state.last_alerted = json.loads(path.read_text())
            return state
        return cls()


def run_nodetool(command: list[str], timeout_sec: int = 30) -> str:
    """Execute a read-only nodetool call with a hard timeout and exit-code check."""
    logger.info("Executing: %s", " ".join(command))
    try:
        result = subprocess.run(
            command, capture_output=True, text=True, timeout=timeout_sec, check=True
        )
        return result.stdout
    except subprocess.TimeoutExpired:
        logger.error("nodetool timed out after %ds", timeout_sec)
        sys.exit(1)
    except subprocess.CalledProcessError as e:
        logger.error("nodetool failed (exit %d): %s", e.returncode, e.stderr.strip())
        sys.exit(1)


def parse_tablestats(output: str) -> dict[str, int]:
    """Extract max tombstones-per-slice per keyspace.table from tablestats output."""
    tombstone_map: dict[str, int] = {}
    current_keyspace: str | None = None
    current_table: str | None = None

    keyspace_re = re.compile(r"^\s*Keyspace\s*:\s*(.+)$")
    table_re = re.compile(r"^\s*Table:\s*(.+)$")
    tombstone_re = re.compile(
        r"^\s*Maximum tombstones per slice \(last five minutes\):\s*(\d+)$"
    )

    for line in output.splitlines():
        if ks := keyspace_re.match(line):
            current_keyspace, current_table = ks.group(1).strip(), None
            continue
        if tbl := table_re.match(line):
            name = tbl.group(1).strip()
            current_table = f"{current_keyspace}.{name}" if current_keyspace else name
            continue
        if current_table and (tomb := tombstone_re.match(line)):
            tombstone_map[current_table] = int(tomb.group(1))
    return tombstone_map


def send_webhook_alert(url: str, payload: dict, retries: int = 3) -> bool:
    """POST the alert to an incident-management webhook with exponential backoff."""
    for attempt in range(retries):
        try:
            resp = requests.post(
                url, json=payload,
                headers={"Content-Type": "application/json"}, timeout=10,
            )
            resp.raise_for_status()
            logger.info("Alert dispatched (HTTP %d)", resp.status_code)
            return True
        except requests.RequestException as e:
            logger.warning("Webhook attempt %d/%d failed: %s", attempt + 1, retries, e)
            time.sleep(2 ** attempt)
    logger.error("Failed to dispatch alert after %d retries", retries)
    return False


def main() -> None:
    parser = argparse.ArgumentParser(description="Cassandra tombstone threshold monitor")
    parser.add_argument("--warn-threshold", type=int, default=5000)
    parser.add_argument("--fail-threshold", type=int, default=50000)
    parser.add_argument("--webhook-url", required=True, help="Incident-management webhook URL")
    parser.add_argument("--state-file", default="/var/lib/cassandra-monitor/tombstone_state.json")
    parser.add_argument("--cooldown", type=int, default=3600, help="Per-table cooldown seconds")
    parser.add_argument("--dry-run", action="store_true", help="Evaluate without sending alerts")
    args = parser.parse_args()

    state_path = Path(args.state_file)
    state_path.parent.mkdir(parents=True, exist_ok=True)
    state = AlertState.load(state_path)
    state.cooldown_seconds = args.cooldown

    tombstones = parse_tablestats(run_nodetool(["nodetool", "tablestats"]))

    triggered = False
    for table_key, count in tombstones.items():
        severity = ("CRITICAL" if count >= args.fail_threshold
                    else "WARNING" if count >= args.warn_threshold else None)
        # Guard clause: skip healthy tables and any table still inside its cooldown window.
        if not severity or not state.should_alert(table_key):
            continue

        payload = {
            "service": "cassandra-tombstone-monitor",
            "table": table_key,
            "severity": severity,
            "tombstone_count": count,
            "threshold": args.fail_threshold if severity == "CRITICAL" else args.warn_threshold,
            "timestamp": time.time(),
        }
        logger.info("Threshold breached: %s (%s) — %d tombstones", table_key, severity, count)

        if args.dry_run:
            logger.info("[DRY-RUN] Would alert for %s (%s)", table_key, severity)
            triggered = True
        elif send_webhook_alert(args.webhook_url, payload):
            state.record_alert(table_key)
            triggered = True

    # Only persist state when a live alert actually fired, so dry-runs stay side-effect free.
    if triggered and not args.dry_run:
        state.save(state_path)
    else:
        logger.info("No new alerts dispatched; state preserved.")


if __name__ == "__main__":
    main()

Safety Check: subprocess.run enforces a hard timeout and check=True; the webhook retries with exponential backoff; AlertState.save is atomic; the cooldown guard clause makes repeat invocations idempotent; and --dry-run never writes state or hits the network. Expected Output: Structured [INFO] log lines per breach, and — outside --dry-run — an HTTP 2xx from the webhook. Rollback Path: Delete /var/lib/cassandra-monitor/tombstone_state.json to force immediate re-evaluation on the next run; the engine recreates it. To stop entirely, disable the timer (below) — there is no other state to unwind.

Deploy under systemd

# /etc/systemd/system/cassandra-tombstone-monitor.service
[Unit]
Description=Cassandra tombstone threshold monitor
After=cassandra.service
ConditionPathExists=/usr/local/bin/cassandra_tombstone_monitor.py

[Service]
Type=oneshot
User=cassandra-monitor
ExecStart=/usr/bin/python3 /usr/local/bin/cassandra_tombstone_monitor.py \
  --warn-threshold 5000 --fail-threshold 50000 \
  --webhook-url "https://events.pagerduty.com/v2/enqueue" \
  --state-file /var/lib/cassandra-monitor/tombstone_state.json
TimeoutStartSec=45
StandardOutput=journal
StandardError=journal

# /etc/systemd/system/cassandra-tombstone-monitor.timer
[Unit]
Description=Run the Cassandra tombstone monitor every 15 minutes

[Timer]
OnBootSec=5min
OnUnitActiveSec=15min

[Install]
WantedBy=timers.target

Safety Check: Type=oneshot with TimeoutStartSec=45 guarantees a hung scrape cannot pin the journal; the unit runs as the unprivileged cassandra-monitor account from gate 3. Expected Output: systemctl enable --now cassandra-tombstone-monitor.timer then systemctl list-timers shows the next run; journalctl -u cassandra-tombstone-monitor.service streams the evaluations. Rollback Path: systemctl disable --now cassandra-tombstone-monitor.timer, remove both unit files, then systemctl daemon-reload.

Verification steps

Confirm the engine’s numbers agree with Cassandra before you trust them in automation.

# 1. Validate parsing end-to-end without sending alerts.
python3 cassandra_tombstone_monitor.py --dry-run \
  --warn-threshold 100 --fail-threshold 1000 --webhook-url http://localhost

Safety Check: --dry-run makes no network call and writes no state; it exercises only the scrape and threshold logic. Expected Output:

2026-07-04 10:15:02 [INFO] Executing: nodetool tablestats
2026-07-04 10:15:03 [INFO] Threshold breached: keyspace1.users (WARNING) — 245 tombstones
2026-07-04 10:15:03 [INFO] [DRY-RUN] Would alert for keyspace1.users (WARNING)

# 2. Cross-check a flagged table against nodetool's own view.
nodetool tablestats keyspace1.users | grep -i "tombstones per slice"

Safety Check: The Maximum tombstones per slice (last five minutes) value here must match the tombstone_count in the engine’s log for the same table within one poll window. Expected Output: Maximum tombstones per slice (last five minutes): 245

# 3. Confirm the cooldown ledger persisted after a live run.
sudo -u cassandra-monitor cat /var/lib/cassandra-monitor/tombstone_state.json

Safety Check: After a non-dry-run alert, the breaching table must appear with a recent Unix timestamp; a second immediate run must log No new alerts dispatched for that table, proving cooldown suppression. Expected Output:

{
  "keyspace1.users": 1751623200.42,
  "analytics.events": 1751623500.18
}

Pair every fired alert with a runbook rather than a manual scramble: nodetool compactionstats to confirm compaction is draining, nodetool repair -pr <keyspace> to synchronise replicas so markers past gc_grace_seconds become purgeable, then a re-scrape to confirm the count fell. When the count refuses to drop, the bottleneck is usually the merge itself — the STCS, LCS, and TWCS trade-offs decide how aggressively overlapping SSTables are reconciled, and a saturated queue is best worked through the drain procedure in resolving high compaction backlog without downtime.

Troubleshooting

TombstoneOverwhelmingException fires on reads even though no alert preceded it. The engine samples at a fixed cadence (15 min by default) while Maximum tombstones per slice (last five minutes) is a rolling five-minute maximum, so a short, intense scan burst can breach and decay between polls. Root cause: sampling interval wider than the metric window. Fix: shorten OnUnitActiveSec to 5min so the poll cadence matches the metric’s window, and lower --warn-threshold toward the tombstone_warn_threshold in cassandra.yaml rather than the failure ceiling.
Every table reports 0 and no alerts ever fire. The regex matched nothing because the node has served no reads in the last five minutes (the metric only populates on scans) or because a custom locale reformatted the label. Root cause: empty or reformatted tablestats output, not a healthy cluster. Fix: generate read traffic against a known tombstone-heavy table, re-run with --dry-run, and if the label text differs, adjust tombstone_re; on 4.1+/5.0 you can instead parse nodetool tablestats -F json for a locale-stable field.
Alert storm: the same table pages every run during a long compaction. State is not persisting, so should_alert always sees a zero timestamp. Root cause: the service account cannot write --state-file, so AlertState.save raises and the ledger is never updated. Fix: re-run gate 3 to confirm /var/lib/cassandra-monitor is owned by cassandra-monitor and mode 0750, and check journalctl for an OSError on save; correct ownership restores cooldown deduplication immediately.

Tombstone Management & Garbage Collection — the parent guide on how a delete becomes a purgeable marker and how gc_grace_seconds gates removal.
Understanding Cassandra read repair vs anti-entropy repair — why the repair that clears tombstones must complete inside the grace window.
Resolving high compaction backlog without downtime — what to do when alerts fire because compaction, not deletes, is the bottleneck.
Python Monitoring for Cassandra Compaction — the broader telemetry model these threshold alerts plug into.

Back to Tombstone Management Garbage Collection