Fallback Routing & Read Path Optimization in Cassandra 4.x/5.x

In production Apache Cassandra clusters, read latency degradation rarely originates from transient network partitions. The root cause usually traces back to degraded replica states, an unmanaged compaction backlog, or misaligned retry policy. Optimizing the read path therefore means three things at once: deterministic fallback routing when a replica stalls, precise speculative-execution thresholds tuned to your real latency distribution, and coordinator routing that is aware of which replicas are drowning in compaction. This guide is for operators who need read-path SLAs to hold even while background merging saturates disk. It sits under Advanced Compaction Strategy Tuning & Monitoring; read that first if you have not yet aligned table strategy with workload, because a misconfigured strategy is the fastest way to make every technique below ineffective. Everything here is validated against Cassandra 4.0, 4.1, and 5.0, with version-specific behavior called out inline.

Coordinator decision logic and speculative execution

Every read enters through a coordinator node that resolves the replica set from the partition key, orders those replicas by the snitch, and dispatches according to the requested consistency level. The coordinator does not blindly trust its first-choice replica: it consults the dynamic snitch, which scores each replica on recent latency and reported severity (compaction and I/O load), and the phi-accrual failure detector that also drives node gossip and failure detection. A replica that has been marked down, or that the dynamic snitch has scored poorly, is skipped before a single byte crosses the wire.

Speculative execution is the fallback mechanism layered on top of that ordering. When the first replica fails to acknowledge within a computed window, the coordinator dispatches a redundant read to the next-fastest replica rather than waiting for a timeout. The window is governed by the per-table speculative_retry option. Cassandra 4.0 removed read_repair_chance and dclocal_read_repair_chance entirely; read repair is now controlled by the separate read_repair table option ('BLOCKING' default, or 'NONE'), and speculation is controlled independently. The stock speculative_retry: 99PERCENTILE default is production-safe for general OLTP traffic: the coordinator maintains a latency histogram per table and fires the redundant read once elapsed time crosses the 99th percentile of recent reads. Latency-sensitive services often replace the percentile with a hard millisecond threshold — 50ms or 100ms — to bound tail latency deterministically rather than letting it float with the histogram.

The critical coupling is this: speculative retries multiply read load exactly when replicas are already struggling. If the reason a replica is slow is that its compaction backlog has driven SSTable fan-out and read amplification through the roof, then firing speculative reads at its peers — which are likely compacting too — amplifies I/O contention and can trigger cascading UnavailableException errors. Fallback routing must therefore be compaction-aware: retry thresholds have to be calibrated against real compaction throughput, and unhealthy replicas should be de-weighted at the routing layer, not just retried around at the query layer.

The coordinator’s fallback-routing decision flow is outlined below.

Coordinator read-path fallback routing: health-and-compaction filter first, speculative retry second.

A single speculative read: the window sets when the redundant read to B fires, and both too-tight and too-loose settings have a cost.

Compaction backlog as the primary latency vector

Compaction directly dictates read-path efficiency. As SSTables accumulate, bloom-filter checks, partition-index lookups, tombstone scans, and row-cache misses stall the read stage thread pool, and the coordinator sees rising per-replica latency long before any node is marked down. This is why fallback routing cannot be reasoned about in isolation from merge health. The relevant JMX surface is org.apache.cassandra.metrics:type=Compaction — specifically PendingTasks, BytesCompacted, and the CompactionExecutor pending count. When PendingTasks on a node consistently exceeds 2 × concurrent_compactors, that node should be de-weighted for reads or shielded by application-side backpressure until it drains.

Data-model alignment minimizes the read amplification that feeds this loop in the first place. The trade-offs between STCS, LCS, and TWCS determine how many SSTables a partition spans, and therefore how expensive the read path is under fallback. For append-only, time-ordered data, the strategy selection for time-series workloads guide explains why TimeWindowCompactionStrategy isolates time-bound data and prevents cross-window tombstone scans that would otherwise degrade every fallback read, while LeveledCompactionStrategy keeps read latency predictable for random-access patterns at the cost of higher write amplification. Read amplification is worst when tombstones survive past expiry, so pairing routing decisions with disciplined tombstone management and garbage collection is what keeps the fallback path cheap. When backlog breaches safe thresholds, automation should dynamically lower compaction throughput via nodetool setcompactionthroughput so foreground reads reclaim I/O, rather than routing ever more traffic onto shrinking headroom.

Configuration reference

The options below govern how aggressively the coordinator speculates and how long it waits before declaring a read failed. Speculation and read-repair options are table-scoped (set with ALTER TABLE); timeout and concurrency options are node-scoped in cassandra.yaml.

Key	Scope	Default	Valid range	Impact on the read path
`speculative_retry`	table	`99PERCENTILE`	`NONE`, `ALWAYS`, `<N>PERCENTILE`, `<N>ms`	Sets when the redundant fallback read fires; a fixed `ms` value bounds tail latency, a percentile floats with the histogram
`read_repair`	table	`BLOCKING`	`BLOCKING`, `NONE`	`BLOCKING` blocks the read on repairing divergent replicas; `NONE` favors latency over on-read convergence
`read_request_timeout_in_ms`	node	`5000`	RTT + disk seek → seconds	Hard ceiling before the coordinator returns `ReadTimeoutException`; must exceed your speculative window
`concurrent_reads`	node	`32`	`16` – `128` (disk-bound)	Sizes the read stage pool; too low and speculation stalls before it can help
`dynamic_snitch_badness_threshold`	node	`0.1` (`1.0` on 4.1+)	`0.0` – `1.0`	How much worse a replica must score before the coordinator routes around it
`dynamic_snitch_update_interval_in_ms`	node	`100`	`100` – `1000`	How quickly load/latency scores react to a replica falling behind on compaction

Override speculative_retry at the table level only when a strict SLA demands it, and always pair it with a read_request_timeout_in_ms calibrated to baseline network RTT plus disk-seek latency. A minimal table override for a latency-sensitive table:

-- Cassandra 4.x/5.x: bound tail latency with a fixed speculative window,
-- and keep on-read convergence via BLOCKING read repair.
ALTER TABLE ks.orders WITH
  speculative_retry = '75ms'
  AND read_repair = 'BLOCKING';

Note that there is no nodetool settableproperty; table options are changed only through CQL. Node-scoped timeouts live in cassandra.yaml and require a rolling restart to take effect:

# cassandra.yaml — Cassandra 4.x/5.x
read_request_timeout_in_ms: 5000    # must exceed the widest speculative window
concurrent_reads: 32                # raise only after confirming disk can sustain it
dynamic_snitch_badness_threshold: 1.0

Step-by-step: deploy compaction-aware fallback routing

The procedure below stands up dynamic routing that excludes compaction-degraded nodes from reads, then tightens speculation once the fleet is healthy. Each step includes a safety gate you should not skip.

1. Establish the read-latency baseline

Before touching any threshold, capture real percentiles so you tune against data, not a guess. nodetool tablestats reports only mean latency, so use histograms:

# Safety gate: both must return cleanly before you change anything.
nodetool tablehistograms ks.orders     # per-table read/write latency percentiles
nodetool proxyhistograms               # coordinator-level end-to-end latency

Expected output (abridged) shows the p95/p99/p99.9 read columns you will anchor speculative_retry to:

ks/orders histograms
Percentile  Read Latency  Write Latency
50%             454.83         35.43
95%            1131.75         88.15
99%            4055.27        105.78

2. Read compaction pressure per node

Confirm which replicas are backlogged so routing has something to act on. Fire the redundant read around these nodes, not at them:

# Safety gate: a node with PendingTasks > 2 x concurrent_compactors is a de-weight candidate.
nodetool compactionstats -H
nodetool getcompactionthroughput

On Cassandra 5.0 you can read the same state structurally, avoiding text parsing:

-- Cassandra 5.0+ virtual table: live compaction task state.
SELECT keyspace_name, table_name, completed_bytes, total_bytes, unit
FROM system_views.sstable_tasks;

3. Drive routing from live backlog telemetry

The workflow below polls compaction backlog over the Jolokia HTTP agent and swaps the driver’s load-balancing policy to exclude degraded nodes. It uses asynchronous collection so telemetry never blocks the read path.

#!/usr/bin/env python3
# requirements: Python 3.10+, aiohttp>=3.9, cassandra-driver>=3.28
# Jolokia must be deployed as a Cassandra JVM agent for the /jolokia endpoint to exist.
"""Compaction-aware fallback routing for Cassandra 4.x/5.x.

Polls per-node compaction backlog via Jolokia and rebuilds the driver's
load-balancing policy so reads avoid replicas that are behind on merging.
"""
import asyncio

import aiohttp
from cassandra.cluster import Cluster
from cassandra.policies import WhiteListRoundRobinPolicy

COMPACTION_MBEAN = "org.apache.cassandra.metrics:type=Compaction,name=PendingTasks"


def _extract_jmx_value(payload: dict) -> int:
    """A Jolokia read nests the attribute under "value"; it may be a scalar
    or a {attribute: value} map. Coerce to a numeric pending-task count."""
    value = payload.get("value", 0)
    if isinstance(value, dict):
        for key in ("Value", "value", "Count"):
            if key in value:
                value = value[key]
                break
        else:
            value = next(iter(value.values()), 0)
    return int(value or 0)


async def fetch_compaction_backlog(session: aiohttp.ClientSession, node_ip: str) -> int:
    """Return PendingTasks for one node, or a high sentinel on failure so a
    node we cannot reach is treated as degraded, never silently trusted."""
    url = f"http://{node_ip}:8778/jolokia/read/{COMPACTION_MBEAN}"
    try:
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as resp:
            resp.raise_for_status()
            return _extract_jmx_value(await resp.json())
    except (aiohttp.ClientError, asyncio.TimeoutError):
        return 1_000_000  # unreachable node -> treat as degraded


async def healthy_nodes(cluster_ips: list[str], concurrent_compactors: int = 2) -> list[str]:
    """Keep only nodes whose backlog is under 2 x concurrent_compactors."""
    threshold = 2 * concurrent_compactors
    async with aiohttp.ClientSession() as session:
        backlogs = await asyncio.gather(
            *(fetch_compaction_backlog(session, ip) for ip in cluster_ips)
        )
    return [ip for ip, pending in zip(cluster_ips, backlogs) if pending <= threshold]


def connect_with_fallback(healthy: list[str]) -> "Cluster":
    """Swap in a policy that excludes degraded nodes from read routing.

    Guard clause: never connect against an empty allow-list, which would
    strand the app; fall back to caller-managed contact points instead.
    """
    if not healthy:
        raise RuntimeError("no healthy replicas below the backlog threshold")
    policy = WhiteListRoundRobinPolicy(healthy)
    cluster = Cluster(contact_points=healthy, load_balancing_policy=policy)
    return cluster.connect()

The WhiteListRoundRobinPolicy restricts routing to the nodes you pass; rebuild it on a schedule rather than per query so you are not reconnecting under load. Wiring this poller into Prometheus or a custom exporter is covered in Python monitoring for Cassandra compaction, and the velocity-versus-depth math that decides when a node counts as degraded lives in async compaction tracking and metrics.

4. Tighten speculation once the fleet is healthy

Only after routing keeps traffic off backlogged nodes should you narrow the speculative window. Apply a fixed ms value derived from step 1’s p95, verify tail latency, then decide whether to keep it:

ALTER TABLE ks.orders WITH speculative_retry = '75ms';

Safety gate: never set a speculative window below your p95 read latency — you will speculate on nearly every read and double the deployment’s read load.

Verification and observability

Confirm the change did what you intended across three independent surfaces:

Tail latency moved the right way: re-run nodetool proxyhistograms and compare p99/p99.9 before and after. A tighter speculative_retry should lower p99.9 without materially raising overall read load.
Speculation is not thrashing: watch the per-table SpeculativeRetries and SpeculativeFailedRetries counters under org.apache.cassandra.metrics:type=Table. A speculative-retry rate climbing toward your total read rate means the window is too tight or replicas are genuinely degraded.
Routing is excluding the right nodes: cross-check the poller’s healthy_nodes() output against nodetool compactionstats -H on each node. On Cassandra 5.0, reconcile against SELECT * FROM system_views.sstable_tasks;.

Grep the coordinator logs to correlate speculation spikes with real events rather than noise:

grep -E "SpeculativeRetry|ReadTimeout|isLatencyForSnitch" /var/log/cassandra/system.log | tail -20

Failure modes and rollback

Speculative-retry amplification storm

Symptom: read load roughly doubles and p99 worsens shortly after tightening speculative_retry. Detection: SpeculativeRetries rate approaches the total read rate; nodetool proxyhistograms shows every percentile shifting up. Root cause: the window is below real p95, so almost every read fires a redundant request, and the redundant traffic itself induces the latency it was meant to avoid. Rollback: ALTER TABLE ks.orders WITH speculative_retry = '99PERCENTILE'; to restore the adaptive default, then re-derive a fixed value strictly above p95 from fresh histograms.

Cascading UnavailableException under compaction load

Symptom: reads begin returning UnavailableException on a subset of nodes during a maintenance window. Detection: the affected nodes show PendingTasks far above 2 × concurrent_compactors in nodetool compactionstats, and speculative reads are being routed onto their equally-backlogged peers. Root cause: compaction-blind routing kept firing redundant reads at nodes with no I/O headroom. Rollback: enable the step-3 poller so degraded nodes are excluded, and temporarily lower nodetool setcompactionthroughput so foreground reads reclaim disk. Categorize any accompanying merge errors through compaction error categorization and logging to separate transient I/O stalls from corrupted SSTables.

Tombstone resurrection during node replacement

Symptom: deleted rows reappear after a decommission or replace. Detection: nodetool status was not UN for all replicas, or nodetool repair had not completed within gc_grace_seconds before streaming began. Root cause: streaming from a replica that missed a delete past the grace window revives tombstoned data. Rollback: re-run anti-entropy repair across the replica set immediately, and gate future lifecycle operations on a pre-flight check that verifies UN status, zero pending compactions, and a completed repair inside the grace window.

FAQ

When should I use a fixed ms speculative_retry instead of a percentile?

Use a fixed ms value when you have a hard tail-latency SLA and a stable read-latency distribution — it bounds the window deterministically regardless of histogram drift. Stay on 99PERCENTILE for general OLTP traffic with variable load, because it adapts as latency shifts. Never set a fixed value below your measured p95, or you will speculate on nearly every read.

Why do speculative retries make latency worse instead of better?

Because speculation trades extra read load for lower tail latency, and that trade only pays off when peer replicas have spare I/O. When the slow replica is slow due to compaction backlog, its peers are usually backlogged too, so the redundant reads pile onto saturated disks and raise latency across the board. Fix routing and compaction headroom first, then speculate.

How does read_repair relate to speculative_retry in Cassandra 4.x/5.x?

They are independent options since Cassandra 4.0 removed read_repair_chance and dclocal_read_repair_chance. speculative_retry controls fallback routing — when a redundant read fires — while read_repair (BLOCKING or NONE) controls whether the coordinator repairs divergent replicas on the read path. Tune them separately: speculation for latency, read repair for convergence.

Can fallback routing compensate for a wrong compaction strategy?

No. Routing around a degraded node buys time, but if the table’s strategy produces chronic read amplification, every replica trends toward degraded and there is nowhere healthy to route. Align the strategy with the workload first; routing and speculation are latency insurance, not a substitute for the right merge behavior.

What must pass before I decommission or replace a node?

Three gates: nodetool status shows UN for the replica set, nodetool compactionstats reports zero pending compactions on the target, and nodetool repair completed within the last gc_grace_seconds. Skipping any of these risks tombstone resurrection during streaming.

Advanced Compaction Strategy Tuning & Monitoring — the parent guide covering strategy selection, tuning, and observability end to end.
Compaction backlog analysis & alerting — dynamic thresholds and tiered severity for the backlog signal that drives routing.
Async compaction tracking & metrics — velocity-versus-depth math that decides when a replica counts as degraded.
Python monitoring for Cassandra compaction — wiring pollers into Prometheus, Datadog, and custom telemetry.
Strategy selection for time-series workloads — choosing TWCS vs LCS to keep the fallback read cheap.

Back to Advanced Compaction Strategy Tuning Monitoring