Is there a hard 2 GB or 100 MB limit on a Cassandra partition?

No. The only hard limit is roughly 2 billion cells per partition, and Cassandra warns long before that. The practical ceiling is operational: degradation from compaction I/O, repair streaming, and tombstone scans typically starts between 100 MB and 500 MB depending on the compaction strategy, which is why the target is computed rather than fixed.

Why does the target partition size depend on the compaction strategy?

Each strategy tolerates a different partition size. LeveledCompactionStrategy needs partitions well under 100 MB to avoid L0/L1 overlap saturation, TimeWindowCompactionStrategy works best in a 10–50 MB range aligned to its windows, and SizeTieredCompactionStrategy tolerates up to about 500 MB at the cost of read amplification. The calculator takes the smallest of the strategy ceiling and the throughput-derived budget.

Where does the calculator read the compaction strategy from?

From system_schema.tables via cqlsh, because the strategy is a per-table property. nodetool describecluster and nodetool status do not report it. If the query fails, the script falls back to SizeTieredCompactionStrategy, the widest bound, so it never recommends an aggressively small size on bad data.

How do I shrink a partition that is already oversized?

Change the data model, not gc_grace_seconds. Add a bucketing or salting component to the partition key so writes spread across more tokens — for time series, a time bucket such as day or hour is typical. You cannot resize an existing partition in place; new writes land in the reshaped partitions and old data ages out or is migrated.

Is it safe to run the calculator on a live production cluster?

Yes. It is strictly read-only — it calls nodetool status, getcompactionthroughput, and a cqlsh SELECT, and never alters schema or node state, so it is idempotent and safe to re-run. It still refuses to compute when the ring is not all Up/Normal, because transient metrics during a topology change would produce a wrong recommendation.

How to Calculate Optimal Partition Sizes for Cassandra

Partition sizing in Apache Cassandra is an operational constraint, not a theoretical preference. There is no fixed 2 GB byte cap on a partition; the real hard limit is roughly 2 billion cells per partition, and Cassandra emits large-partition warnings well before that. Production degradation routinely begins between 100 MB and 500 MB because of compaction I/O saturation, repair streaming timeouts, and tombstone accumulation. This guide gives DBAs and automation engineers a deterministic method for computing a target partition size on Cassandra 4.x/5.x clusters — one derived from measured throughput rather than folklore, and aligned to the compaction strategy in play. It sits beneath Data Partitioning & Token Ring Basics, which explains how a partition key becomes a physical placement on the ring; here we turn that placement into a size budget. Prerequisites: nodetool and cqlsh on PATH, read access to system_schema.tables, and a healthy ring you can query before you calculate.

Pre-conditions & safety gates

Never compute or apply a sizing target against a deployment that is mid-topology-change or already saturated — the metrics you read back will be transient and the recommendation will be wrong. Clear these gates first.

Gate 1 — Ring health and node state

Every node must be Up/Normal. A node in DN (down) or UL/UJ/UM (leaving, joining, moving) skews ownership and makes any per-vnode partition estimate meaningless.

# Cassandra 4.x/5.x: neither nodetool status nor nodetool info emits JSON,
# so grep the state column directly.
nodetool status | grep -E "^(UN|DN|UL|UJ|UM)"

Expected — every line begins UN (Up/Normal):

UN  10.0.1.10  124.5 KiB  16  100.0%  abcdef01-...  rack1
UN  10.0.1.11  118.2 KiB  16  100.0%  bcdef012-...  rack1

If any node reports DN, UL, UJ, or UM, halt. Do not size or reshape partitions during an active bootstrap, decommission, or move.

Gate 2 — No active repair or streaming

Partition-size statistics read from a node that is streaming are distorted by in-flight SSTables. Confirm the node is idle before sampling.

nodetool netstats | grep -E "Mode|Streaming"

Expected — NORMAL mode with no active streams:

Mode: NORMAL
Not sending any streams.

If streams are active, wait for the anti-entropy repair session to finish; running repair inflates the very numbers the calculator depends on.

Gate 3 — Drainable compaction queue

A large pending-compaction backlog means SSTable counts (and therefore per-partition byte estimates) are still in flux.

nodetool compactionstats | grep -E "pending tasks"

Expected — a small, drainable count:

pending tasks: 3

A persistently high backlog is a structural problem, not a sizing input — resolve it via compaction-backlog analysis and alerting before trusting any sizing math.

How partition size maps to compaction, repair, and the ring

Partition size directly dictates SSTable lifecycle and repair topology. When a single partition dominates a vnode, compaction must merge increasingly large SSTables, consuming disk bandwidth and occasionally triggering OutOfMemoryError during nodetool repair streaming. The way primary keys map to the token ring is foundational: improper clustering-key cardinality or unbounded time-series writes create hot partitions that defeat the even distribution described in Data Partitioning & Token Ring Basics. Because every partition is written, merged, and streamed as a unit by the LSM tree mechanics in Cassandra, an optimal size must fall inside three simultaneous budgets: the compaction strategy’s merge tolerance, the repair streaming buffer, and the tombstone purge cycle.

Those three budgets are what the sizing formula intersects:

Compaction strategy overhead. SizeTieredCompactionStrategy (STCS) tolerates larger partitions but suffers rising read amplification as SSTables grow. LeveledCompactionStrategy (LCS) needs partitions well below 100 MB to prevent L0/L1 overlap saturation. TimeWindowCompactionStrategy (TWCS) performs best when partitions align to window boundaries; aim for a 10–50 MB working range and treat the 250 MB strategy ceiling below as an upper bound, not a target. The trade-offs behind these numbers are covered in Understanding STCS vs LCS vs TWCS.
Repair streaming limits. nodetool repair streams the divergent ranges identified by Merkle-tree comparison. Large partitions in those ranges inflate the bytes streamed and frequently trigger StreamingTimeoutException or saturate stream_throughput_outbound_megabits_per_sec, failing the repair.
Tombstone compaction cost. Oversized partitions delay tombstone purging, making it likelier that a single read scans past the node-level tombstone_failure_threshold (a cassandra.yaml setting, default 100000) during a range scan. Sustained tombstone management keeps that ceiling out of reach.

The target size is the smallest of a throughput-derived budget, the strategy ceiling, and a hard streaming safety ceiling:

Optimal_Partition_MB = min(
    (Compaction_Throughput_MBps * Repair_Window_Secs) / (Avg_Partitions_Per_Repair * 1.2),
    Strategy_Max_MB,
    200  # Hard safety ceiling for repair streaming
)

Where:

Term	Meaning
`Compaction_Throughput_MBps`	Sustained compaction write throughput (MB/s), read live from the node
`Repair_Window_Secs`	Maximum allowed repair duration (typically 7200–14400)
`Avg_Partitions_Per_Repair`	Estimated partitions per vnode during a `-pr` or `--full` repair
`Strategy_Max_MB`	100 for LCS, 250 for TWCS, 500 for STCS
`1.2`	Safety factor for concurrent compaction and streaming overhead

Implementation

Manual metric collection introduces drift. The calculator below is idempotent and read-only: it validates cluster health, reads the live compaction-throughput ceiling and the per-table strategy, applies the formula within hard bounds, and emits a deterministic JSON recommendation. It never mutates schema, so re-running it is always safe.

#!/usr/bin/env python3
# requirements: Python 3.10+, Cassandra 4.x/5.x with nodetool and cqlsh on PATH.
"""Cassandra partition-size calculator.

Read-only: computes an optimal target partition size from live compaction and
repair metrics. Safe to re-run — it never alters schema or node state.
"""

import json
import logging
import subprocess
import sys
from dataclasses import dataclass

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

# Per-strategy upper bound (MB). LCS is the strictest; STCS the most tolerant.
STRATEGY_LIMITS: dict[str, int] = {
    "SizeTieredCompactionStrategy": 500,
    "LeveledCompactionStrategy": 100,
    "TimeWindowCompactionStrategy": 250,
}
SAFETY_STREAMING_CEIL: int = 200  # Never exceed, regardless of strategy.


@dataclass(frozen=True)
class PartitionMetrics:
    compaction_throughput_mbps: float
    repair_window_hours: float
    avg_partitions_per_repair: int
    compaction_strategy: str


def run_nodetool(cmd: str) -> str:
    """Execute nodetool with a timeout; exit(2) on failure so callers can gate."""
    try:
        result = subprocess.run(
            ["nodetool", *cmd.split()],
            capture_output=True, text=True, timeout=30, check=True,
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as exc:
        logging.error("nodetool %s failed: %s", cmd, exc.stderr.strip())
        sys.exit(2)
    except subprocess.TimeoutExpired:
        logging.error("nodetool %s timed out; cluster may be unresponsive.", cmd)
        sys.exit(2)


def get_configured_compaction_throughput() -> float:
    """Read the configured throughput ceiling.

    On Cassandra 4.0 the line reads 'Current compaction throughput: 64 MB/s';
    on 4.1/5.0 the value may be reported as a MiB/s size string. Parse the
    first numeric token either way and floor it to avoid divide-by-zero.
    """
    output = run_nodetool("getcompactionthroughput")
    for line in output.splitlines():
        if "compaction throughput" in line.lower():
            for token in line.replace(":", " ").replace("/", " ").split():
                try:
                    return max(float(token.rstrip("MiBmbs")), 10.0)
                except ValueError:
                    continue
    logging.warning("Could not parse getcompactionthroughput; using 10 MB/s.")
    return 10.0


def get_compaction_strategy(keyspace: str, table: str) -> str:
    """Read the per-table strategy from system_schema.tables via cqlsh.

    The strategy is a table property and is NOT reported by nodetool
    describecluster. Fall back to STCS (the widest bound) if parsing fails.
    """
    query = (
        "SELECT compaction FROM system_schema.tables "
        f"WHERE keyspace_name='{keyspace}' AND table_name='{table}';"
    )
    result = subprocess.run(
        ["cqlsh", "-e", query],
        capture_output=True, text=True, timeout=30,
    )
    if result.returncode == 0:
        for name in STRATEGY_LIMITS:
            if name in result.stdout:
                return name
    return "SizeTieredCompactionStrategy"


def calculate_optimal_size(metrics: PartitionMetrics) -> float:
    """Intersect the throughput budget, the strategy ceiling, and the hard ceiling."""
    strategy_max = STRATEGY_LIMITS.get(metrics.compaction_strategy, 500)
    repair_secs = metrics.repair_window_hours * 3600
    numerator = metrics.compaction_throughput_mbps * repair_secs
    denominator = max(metrics.avg_partitions_per_repair * 1.2, 1)
    return min(numerator / denominator, strategy_max, SAFETY_STREAMING_CEIL)


def main() -> None:
    # Guard clause: refuse to compute against an unhealthy ring (Gate 1).
    status = run_nodetool("status")
    if any(line[:2] in {"DN", "UL", "UJ", "UM"} for line in status.splitlines()):
        logging.critical("Ring not all Up/Normal; aborting calculation.")
        sys.exit(1)

    metrics = PartitionMetrics(
        compaction_throughput_mbps=get_configured_compaction_throughput(),
        repair_window_hours=4.0,          # override via CLI/env in production
        avg_partitions_per_repair=1500,   # derive from historical repair logs
        compaction_strategy=get_compaction_strategy("my_keyspace", "my_table"),
    )

    optimal = calculate_optimal_size(metrics)
    print(json.dumps({
        "optimal_partition_mb": round(optimal, 2),
        "strategy_limit_mb": STRATEGY_LIMITS.get(metrics.compaction_strategy, 500),
        "safety_ceiling_mb": SAFETY_STREAMING_CEIL,
        "compaction_strategy": metrics.compaction_strategy,
        "recommendation": "APPLY" if optimal <= SAFETY_STREAMING_CEIL else "REVIEW",
    }, indent=2))


if __name__ == "__main__":
    main()

Run it against an idle node:

chmod +x cassandra_partition_calculator.py
./cassandra_partition_calculator.py

Expected output for an LCS table on a node with a 64 MB/s throughput ceiling:

{
  "optimal_partition_mb": 80.0,
  "strategy_limit_mb": 100,
  "safety_ceiling_mb": 200,
  "compaction_strategy": "LeveledCompactionStrategy",
  "recommendation": "APPLY"
}

Verification steps

A recommendation is only useful once you confirm it against the table’s real partition-size distribution and hold it there.

1. Inspect the live distribution

# Cassandra 4.x/5.x: cfstats is deprecated — use tablestats.
nodetool tablestats my_keyspace.my_table | grep "Compacted partition"

Expected — mean and max comfortably below optimal_partition_mb:

Compacted partition minimum bytes: 124
Compacted partition maximum bytes: 148920
Compacted partition mean bytes: 8420

If Compacted partition maximum bytes exceeds optimal_partition_mb * 1048576, the schema is producing oversized partitions; add application-level bucketing or salting to the partition key rather than editing gc_grace_seconds.

2. Watch the skew over time

nodetool tablehistograms my_keyspace my_table

The Partition Size percentiles should stay flat between samples; a rising p99 with a steady p50 is early evidence of a hot partition forming. The measure-compare-remediate loop below is the cadence to keep running.

3. Confirm streaming stays inside budget

nodetool netstats | grep "Streaming"

Keep stream_throughput_outbound_megabits_per_sec below ~70% of the NIC’s line rate so a repair of the sized ranges cannot starve client traffic.

Troubleshooting

TombstoneOverwhelmingException / TombstoneFailureThreshold exceeded on range reads. Root cause: a partition grew large enough that a single slice scans past tombstone_failure_threshold before returning rows — a sizing failure, not a query bug. Fix: reduce the effective partition size with bucketing, and sustain deletes through the tombstone management purge cycle so dead cells clear before a read hits them.
StreamingTimeoutException during nodetool repair. Root cause: an oversized partition inside a divergent range inflated the bytes streamed past the timeout window. Fix: throttle at runtime with nodetool setstreamthroughput <megabits/s> (or lower stream_throughput_outbound_megabits_per_sec by 25% in cassandra.yaml), restart the repair with -local to isolate cross-DC traffic, then reshape the offending partition.
The calculator exits with code 2. Root cause: nodetool or cqlsh is not resolvable, or JMX is unreachable — usually a missing JAVA_HOME/CASSANDRA_HOME or an unresponsive node. Fix: verify the environment and node state, and until it clears, fall back to conservative static defaults (100 MB for LCS, 250 MB for TWCS, 500 MB for STCS) and recalculate once the compaction backlog drains.

Partition sizing is a continuous feedback loop, not a one-time calculation. By anchoring the formula to live compaction throughput, the repair window, and strategy-specific ceilings, you eliminate guesswork and keep streaming, compaction, and the read path stable across the token ring.

Data Partitioning & Token Ring Basics — the parent guide on how a partition key hashes to a token and becomes a physical placement.
Understanding STCS vs LCS vs TWCS — why each strategy imposes a different partition-size ceiling.
LSM tree mechanics in Cassandra — how per-partition data becomes SSTables and why oversized partitions stall compaction.
Tombstone management & garbage collection — keeping deletes flowing so large partitions never hit the failure threshold on read.

Back to Data Partitioning Token Ring Basics