Diagnosing Gossip Protocol Failures in Multi-DC Cassandra 4.x/5.x Clusters

Use this runbook when nodes in a multi-datacenter (DC) Apache Cassandra cluster are being marked DN on some peers but respond normally on others, when nodetool status disagrees between datacenters, or when cross-DC repair streams stall because a peer keeps flapping between UP and DOWN. Asymmetric routing, MTU fragmentation on the inter-DC link, JVM garbage-collection pauses, and I/O saturation from background maintenance each inflate the Phi Accrual suspicion score on a healthy remote node, and a single false conviction ripples into stalled streaming, compaction backlog, and silent replica divergence. This page assumes Cassandra 4.0, 4.1, or 5.0, a nodetool/JMX path to every target node, the DataStax Python Driver 3.25+ for the baseline collector, and every node in UN state before you start. It sits under Node Gossip & Failure Detection Protocols; read that first for the phi math and the gossip state machine, because every threshold below assumes you know how the gossip protocol converges. The diagnostics here are deterministic, idempotent, and read-only by default, and they correlate gossip degradation against the storage-engine load documented across Cassandra Architecture & Compaction Fundamentals.

The decision tree below outlines the triage path from an observed symptom through telemetry inspection to a likely root cause and its corrective action.

Pre-conditions & safety gates

Before touching any runtime parameter, confirm the deployment is healthy enough to diagnose safely and that the symptom is genuinely gossip degradation rather than a real node failure. These checks are read-only; run the full sequence and stop if any gate fails.

# 1. Cross-DC state agreement. Run on one node in EACH datacenter and diff the
#    output — a node that is UN locally but DN from the remote DC is the
#    signature of a cross-DC gossip/link problem, not a dead node.
nodetool status

Safety Check: If a node is genuinely DN from every vantage point, this is a real outage — stop and recover the node, do not tune thresholds. Expected Output:

Datacenter: dc-west
=====================
--  Address     Load    Tokens  Owns  Host ID    Rack
UN  10.0.1.11   412 GiB  16     ?     3f1a...     rack1
UN  10.0.1.12   408 GiB  16     ?     9c22...     rack1
Datacenter: dc-east
=====================
--  Address     Load    Tokens  Owns  Host ID    Rack
UN  10.0.2.21   410 GiB  16     ?     a71b...     rack1
DN  10.0.2.22   405 GiB  16     ?     b83c...     rack1

Rollback Path: N/A (read-only).

# 2. Clock synchronization. Skew corrupts gossip generation/version vectors and
#    invalidates failure-detector timestamps, so enforce drift <= 50 ms first.
chronyc tracking | grep -E "System time|Last offset"   # or: ntpq -p

Safety Check: Absolute offset must be under 50 ms on every node. A skewed clock makes a healthy peer look permanently overdue and no threshold tuning will fix it. Expected Output: System time : 0.000012 seconds slow of NTP time. Rollback Path: N/A.

# 3. Configuration parity across DCs. endpoint_snitch, seed_provider,
#    phi_convict_threshold, and cross_dc_tcp_keep_alive must match exactly,
#    or peers disagree about who owns which tokens and who is a seed.
for k in endpoint_snitch phi_convict_threshold cross_dc_tcp_keep_alive; do
  grep -E "^\s*${k}" /etc/cassandra/cassandra.yaml
done

Safety Check: Values must be identical on all nodes. A DC running SimpleSnitch while another runs GossipingPropertyFileSnitch will mis-route and mis-convict. Expected Output: phi_convict_threshold: 8, endpoint_snitch: GossipingPropertyFileSnitch, cross_dc_tcp_keep_alive: true. Rollback Path: N/A.

Prohibited during active gossip instability: never run nodetool assassinate, nodetool decommission, or nodetool stopdaemon while nodes are flapping. These bypass consensus and, on a partitioned topology view, risk data loss. Reconciliation here must respect the same failure-detection window that gates anti-entropy repair and streaming.

Implementation

Work through three read-only-by-default stages: baseline the phi and thread-pool telemetry, correlate any high-phi peers against compaction and repair I/O, then — only if the cause is confirmed network asymmetry — raise phi_convict_threshold reversibly.

Stage 1: Baseline gossip and failure-detector telemetry

Cassandra 4.x/5.x expose thread-pool saturation through the system_views.thread_pools virtual table over CQL. Per-endpoint phi values are not available through CQL — source them from the JMX FailureDetector MBean or by parsing nodetool failuredetector. The collector below does both without mutating state.

#!/usr/bin/env python3
# Requires: Python 3.10+, cassandra-driver 3.25+, nodetool on PATH.
"""
Idempotent gossip baseline validator for Cassandra v4.x/v5.x.
Reads thread-pool saturation from system_views.thread_pools and per-endpoint
phi values from `nodetool failuredetector`, without mutating state.
"""
import re
import subprocess
import sys
import json
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

SAFETY_CHECKS = {
    "min_version": 4,
    "max_phi_threshold": 12.0,
    "read_only": True,
}

def validate_cluster_version(session) -> int:
    row = session.execute("SELECT release_version FROM system.local LIMIT 1").one()
    major = int(row.release_version.split('.')[0])
    if major < SAFETY_CHECKS["min_version"]:
        raise RuntimeError(f"Unsupported version {row.release_version}. system_views requires Cassandra 4.0+")
    return major

def collect_gossip_baseline(contact_points: list[str],
                            username: str | None = None,
                            password: str | None = None) -> dict:
    auth = PlainTextAuthProvider(username, password) if username else None
    cluster = Cluster(contact_points, auth_provider=auth, connect_timeout=10)
    session = cluster.connect()
    validate_cluster_version(session)

    baseline: dict = {"peers": [], "thread_pools": []}

    # Per-endpoint phi values are not exposed via CQL; parse `nodetool failuredetector`.
    for ep, phi in parse_failuredetector().items():
        baseline["peers"].append({"peer": ep, "phi": phi})

    # Thread-pool saturation check via the real system_views.thread_pools table.
    monitored = ("GossipStage", "CompactionExecutor", "RepairSession")
    tp_rows = session.execute("""
        SELECT name, active_tasks, pending_tasks, completed_tasks, blocked_tasks
        FROM system_views.thread_pools
    """)
    for row in tp_rows:
        if row.name not in monitored:
            continue
        baseline["thread_pools"].append({
            "pool": row.name,
            "active": row.active_tasks,
            "pending": row.pending_tasks,
            "completed": row.completed_tasks,
            "blocked": row.blocked_tasks,
        })

    cluster.shutdown()
    return baseline

def parse_failuredetector() -> dict[str, float]:
    """Return {endpoint: phi} parsed from `nodetool failuredetector` output."""
    result = subprocess.run(
        ["nodetool", "failuredetector"],
        capture_output=True, text=True, timeout=15,
    )
    if result.returncode != 0:
        raise RuntimeError(f"nodetool failuredetector failed: {result.stderr.strip()}")

    phi_map: dict[str, float] = {}
    # Rows look like: "/10.0.0.1            8.0          true"
    row_re = re.compile(r"^/?(\S+?):?\d*\s+([0-9.]+|Infinity)\s+\S+")
    for line in result.stdout.splitlines():
        m = row_re.match(line.strip())
        if not m:
            continue
        raw_phi = m.group(2)
        phi_map[m.group(1)] = float("inf") if raw_phi == "Infinity" else float(raw_phi)
    return phi_map

if __name__ == "__main__":
    try:
        result = collect_gossip_baseline(["127.0.0.1"])
        print(json.dumps(result, indent=2))
    except Exception as e:
        print(f"[FATAL] Baseline collection failed: {e}", file=sys.stderr)
        sys.exit(1)

Safety Check: The script validates the Cassandra major version before querying system_views, caps the connect timeout at 10s to prevent thread exhaustion, and mutates nothing. Expected Output: JSON mapping each peer to its phi value plus active/pending/completed/blocked counts for GossipStage, CompactionExecutor, and RepairSession. Rollback Path: On any failure or malformed data, cluster state is untouched and no restart is required.

Stage 2: Correlate gossip degradation with compaction and repair load

False-positive gossip failures frequently originate from I/O starvation: when compaction or repair consumes available disk bandwidth, the GossipStage thread pool queues heartbeat packets and phi climbs past phi_convict_threshold on peers that are actually alive. The script below confirms whether high-phi peers coincide with backlog before you touch any threshold.

#!/usr/bin/env bash
# compaction_gossip_correlation.sh
# Requires: nodetool + iostat (sysstat) + bc on PATH.
# Correlates high phi values with compaction/repair backlog. Read-only.

set -euo pipefail

SAFETY_GATE() {
    local disk_util
    # Match sd*, nvme*, vd*, xvd*, and dm-* device names; %util is the last column.
    disk_util=$(iostat -x 1 1 | awk '/^(sd|nvme|vd|xvd|dm-)/{print $NF}' | sort -rn | head -1)
    if [ -z "$disk_util" ]; then
        echo "[WARN] Could not read disk utilization from iostat. Skipping I/O gate."
        return 0
    fi
    if (( $(echo "$disk_util > 90" | bc -l) )); then
        echo "[WARN] Disk I/O utilization >90%. Deferring diagnostics to prevent further contention."
        exit 0
    fi
}

SAFETY_GATE

echo "[INFO] Extracting high-phi nodes..."
# nodetool failuredetector rows look like: "/10.0.0.1   8.0   true".
# Strip the leading "/" from the endpoint and compare the phi column.
HIGH_PHI_NODES=$(nodetool failuredetector | awk '$2+0 > 7.0 && $2 != "" {sub(/^\//,"",$1); print $1}')

if [ -z "$HIGH_PHI_NODES" ]; then
    echo "[OK] No nodes exceed phi threshold. Gossip healthy."
    exit 0
fi

echo "[INFO] Checking compaction/repair backlog for affected nodes..."
for node in $HIGH_PHI_NODES; do
    echo "--- Node: $node ---"
    nodetool compactionstats | grep -E "pending|active" || echo "No compaction backlog"
    # nodetool 4.x/5.x: tpstats still valid; parse GossipStage/RepairSession rows.
    nodetool tpstats | grep -E "RepairSession|GossipStage" | awk '{print $1, $2, $3}'
done

echo "[INFO] Correlation complete."

Safety Check: SAFETY_GATE halts if disk utilization exceeds 90%, so the diagnostic itself never compounds I/O starvation. The script only reads nodetool output. Expected Output: A filtered list of IPs with phi > 7.0, followed by pending/active compaction counts and RepairSession/GossipStage thread-pool metrics for each. High phi that coincides with a large pending-compaction count points to I/O contention rather than a network fault — the same interaction described in the parent guide, and why repair cadence must track the STCS, LCS, and TWCS trade-offs in force on each table. Rollback Path: If you find compaction was manually paused (nodetool disableautocompaction), log the exact nodetool enableautocompaction command; resume only with operator confirmation.

Stage 3: Reversible phi_convict_threshold tuning

Only when Stage 2 rules out I/O starvation and the cause is confirmed network asymmetry should you raise phi_convict_threshold. There is no nodetool getter/setter for it: change it persistently in cassandra.yaml (requires a restart) or, for a runtime adjustment, invoke setPhiConvictThreshold on the JMX FailureDetector MBean (org.apache.cassandra.net:type=FailureDetector). The example drives JMX via jmxterm; the change is bounded and auto-reverts.

#!/usr/bin/env python3
# Requires: Python 3.10+, jmxterm on PATH, JMX reachable at 127.0.0.1:7199.
"""
Safe phi_convict_threshold adjustment with automatic rollback.
There is no nodetool command for this value; we use the JMX FailureDetector
MBean (org.apache.cassandra.net:type=FailureDetector) via jmxterm.
"""
import subprocess
import sys
import time

TARGET_THRESHOLD = 10.0
DEFAULT_THRESHOLD = 8.0
RESTORE_DELAY = 300  # seconds
JMX_HOST = "127.0.0.1:7199"
FD_MBEAN = "org.apache.cassandra.net:type=FailureDetector"

def jmxterm(script: str, timeout: int = 30) -> str:
    """Run a jmxterm command script against the local Cassandra JMX port."""
    result = subprocess.run(
        ["jmxterm", "-l", JMX_HOST, "-n", "-v", "silent"],
        input=script, capture_output=True, text=True, timeout=timeout,
    )
    if result.returncode != 0:
        raise RuntimeError(f"jmxterm failed: {result.stderr.strip()}")
    return result.stdout.strip()

def read_phi_threshold() -> str:
    """Read the current PhiConvictThreshold attribute from the FailureDetector MBean."""
    return jmxterm(f"get -b {FD_MBEAN} PhiConvictThreshold\n")

def set_phi_threshold(value: float) -> None:
    """Invoke setPhiConvictThreshold(double) on the FailureDetector MBean."""
    jmxterm(f"run -b {FD_MBEAN} setPhiConvictThreshold {value}\n")

def adjust_phi_threshold() -> None:
    print("[SAFETY] Reading current phi_convict_threshold via JMX...")
    current = read_phi_threshold()
    print(f"[INFO] Current PhiConvictThreshold: {current}")

    print(f"[INFO] Applying temporary threshold: {TARGET_THRESHOLD}")
    set_phi_threshold(TARGET_THRESHOLD)

    print(f"[INFO] Monitoring gossip stabilization for {RESTORE_DELAY}s...")
    time.sleep(RESTORE_DELAY)

    print("[ROLLBACK] Restoring original threshold...")
    set_phi_threshold(DEFAULT_THRESHOLD)
    print("[OK] Threshold restored. Verify with nodetool failuredetector.")

if __name__ == "__main__":
    try:
        adjust_phi_threshold()
    except Exception as e:
        print(f"[FATAL] Adjustment failed. Manual rollback required: {e}", file=sys.stderr)
        sys.exit(1)

Safety Check: The script reads the current threshold before modifying it; apply only if phi consistently exceeds 8.0 across multiple intervals. Timeout guards prevent hanging JMX calls. A JMX setPhiConvictThreshold change is in-memory only and is lost on restart — mirror the value into cassandra.yaml if you intend to keep it. Expected Output: Confirmation of the threshold change, a 300-second stabilization window, then automatic restoration to 8.0. Rollback Path: Restoration triggers automatically after RESTORE_DELAY. If the script crashes mid-run, re-invoke setPhiConvictThreshold 8.0 on the FailureDetector MBean or restart the node to fall back to the cassandra.yaml value. No data mutation occurs.

Verification steps

Confirm gossip has genuinely converged before you resume repair or streaming, and export the failure-detector metrics so the next flap is caught automatically.

# 1. Cross-DC agreement: every node must read UN from BOTH datacenters.
nodetool status | grep -E "^(UN|DN|UJ|DL)"

Safety Check: No DN/DL lines, and the output matches when run from a node in each DC. Expected Output: Every data line begins with UN. Rollback Path: N/A.

# 2. Phi has settled below the convict threshold for the previously-flapping peer.
nodetool failuredetector

Safety Check: Every peer’s phi is well under 8.0 (typically < 1.0 on a quiet ring). Expected Output: /10.0.2.22 0.31 true. Rollback Path: N/A.

# 3. GossipStage queue has drained — no packets backing up behind I/O.
nodetool tpstats | grep -E "Pool Name|GossipStage"

Safety Check: GossipStage Pending and Blocked are at or near 0 for a continuous 30–60s window before launching repair. Expected Output:

Pool Name        Active   Pending   Completed   Blocked   All time blocked
GossipStage           0         0      184213         0                 0

Rollback Path: N/A.

# 4. Confirm no fresh conviction lines are still being logged.
grep -E "InetAddress .* is now (DOWN|UP)|Cannot handshake" /var/log/cassandra/system.log | tail

Safety Check: No new is now DOWN lines after the tuning window closed. Expected Output: Only historical entries; nothing timestamped after remediation. Rollback Path: N/A.

For continuous coverage, export the FailureDetector MBean and GossipStage queue depth to your metrics backend via the Prometheus JMX exporter, and alert on phi breaching phi_convict_threshold or GossipStage Pending staying above zero. Wire a validation gate that blocks nodetool repair whenever phi exceeds 7.5 across more than 20% of peers, so repair streams never amplify flapping into cascading evictions — the same gossip-gated discipline that protects the token ring from mis-routed coordination during a partition.

Troubleshooting

Unable to gossip with any peers on node startup. The joining node cannot reach any seed within ring_delay. In a multi-DC topology this almost always means the seed list points at addresses unreachable across the WAN, or a firewall/security group blocks the storage port (7000/7001) between DCs. Root cause: seeds not reachable cross-DC. Fix: ensure every DC’s seed list includes at least one reachable seed in another DC, confirm port 7000 (or 7001 for TLS) is open both directions, and verify the advertised broadcast_address is the routable inter-DC IP — not a private address that only resolves inside one DC.
A healthy node flaps DOWN then UP repeatedly across the WAN. The Phi Accrual detector treats a long stop-the-world GC pause or a latency spike on the inter-DC link exactly like a dead node, because heartbeats simply stop arriving. Root cause: the conviction window is shorter than the peer’s worst-case pause/jitter. Fix: correlate the is now DOWN/is now UP log lines against gc.log; if they line up with GC, tune the heap and collector first, and raise phi_convict_threshold to 10–12 only as a stopgap for genuine cross-DC jitter. Never drop it below 5 — that trades a little detection latency for false convictions, redundant streaming, and repair thrash.
RuntimeException: A node with address /x.x.x.x already exists, cancelling join or generation collisions. Two processes claim the same endpoint, or a replaced node rejoined with a stale gossip generation after a clock jump. Root cause: gossip generation/version vectors corrupted by clock skew or a duplicated address. Fix: enforce chronyd/ntpd drift under 50 ms cluster-wide (Stage 1 gate), confirm no two nodes share a broadcast_address, and let gossip converge for a full ring_delay before retrying the join rather than reaching for nodetool assassinate, which bypasses consensus.

Node Gossip & Failure Detection Protocols — the parent guide covering the Phi Accrual math, the gossip state machine, and the config surface every gate here builds on.
Read Repair vs Anti-Entropy Repair — why a wrongly-convicted replica is excluded from reconciliation, and how repair cadence must respect the failure-detection window.
Data Partitioning & Token Ring Basics — how gossip liveness gates which replicas coordinate reads and writes across the ring.
Understanding STCS vs LCS vs TWCS — how a table’s compaction strategy sets its tolerance for gossip convergence delay and stale replicas.

Back to Node Gossip Failure Detection Protocols