How to Calculate Optimal Partition Sizes for Cassandra

Partition sizing in Apache Cassandra is an operational constraint, not a theoretical preference. There is no fixed 2 GB byte cap on a partition; the real hard limit is roughly 2 billion cells per partition, and Cassandra emits large-partition warnings well before that. Production degradation routinely begins between 100 MB and 500 MB due to compaction I/O saturation, repair streaming timeouts, and tombstone accumulation. This guide delivers a deterministic, automation-ready methodology for calculating optimal partition sizes, explicitly aligned with compaction strategy behavior, repair window constraints, and idempotent Python tooling for Cassandra 4.x/5.x environments.

Architecture Context & Token Ring Mapping

Partition size directly dictates SSTable lifecycle and repair topology. When a single partition dominates a vnode, compaction must merge increasingly large SSTables, consuming disk bandwidth and triggering OutOfMemoryError during nodetool repair streaming. Understanding how primary keys map to the token ring is foundational; improper clustering key cardinality or unbounded time-series writes create hot partitions that bypass Data Partitioning & Token Ring Basics distribution guarantees.

The relationship between partition boundaries and Cassandra Architecture & Compaction Fundamentals dictates that optimal sizing must be derived from measurable throughput, not arbitrary defaults. Every partition must fit within the compaction strategy’s merge tolerance, the repair streaming buffer, and the tombstone purge cycle.

Operational Constraints & Strategy Alignment

Optimal partition sizing must account for three operational vectors:

  1. Compaction Strategy Overhead: SizeTieredCompactionStrategy (STCS) tolerates larger partitions but suffers exponential read amplification as SSTables grow. LeveledCompactionStrategy (LCS) requires partitions ≤ 100 MB to prevent L0/L1 overlap saturation. TimeWindowCompactionStrategy (TWCS) performs optimally when partitions align with window boundaries; aim for the 10–50 MB typical range and treat the 250 MB strategy ceiling used below as an upper bound, not a target.
  2. Repair Streaming Limits: nodetool repair streams the divergent ranges identified by Merkle-tree comparison. Large partitions within those ranges inflate the data streamed and frequently trigger StreamingTimeoutException or saturate stream_throughput_outbound_megabits_per_sec, causing repair failures.
  3. Tombstone Compaction Cost: Oversized partitions delay tombstone purging, making it more likely a single read scans past the node-level tombstone_failure_threshold (a cassandra.yaml setting, default 100000) during range scans and forcing full table scans that bypass partition pruning.

Deterministic Calculation Methodology

The optimal partition size is derived from compaction throughput capacity, repair window duration, and target SSTable size. Use the following production formula:

Optimal_Partition_MB = min(
    (Compaction_Throughput_MBps * Repair_Window_Secs) / (Avg_Partitions_Per_Repair * 1.2),
    Strategy_Max_MB,
    200  # Hard safety ceiling for repair streaming
)

Where:

  • Compaction_Throughput_MBps = Sustained compaction write throughput (MB/s)
  • Repair_Window_Secs = Maximum allowed repair duration (typically 7200–14400)
  • Avg_Partitions_Per_Repair = Estimated partitions per vnode during -pr or --full repair
  • Strategy_Max_MB = 100 for LCS, 250 for TWCS, 500 for STCS
  • 1.2 = Safety factor for concurrent compaction and streaming overhead

Production-Ready Automation Workflow

Manual metric collection introduces drift. The following Python script implements an idempotent, safety-checked calculator that queries nodetool, validates cluster health, and outputs deterministic sizing recommendations.

#!/usr/bin/env python3
"""
Cassandra Partition Size Calculator (v4.x/v5.x compatible)
Calculates optimal partition size based on live compaction/repair metrics.
"""

import subprocess
import json
import sys
import logging
from dataclasses import dataclass

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

@dataclass
class PartitionMetrics:
    compaction_throughput_mbps: float
    repair_window_hours: float
    avg_partitions_per_repair: int
    compaction_strategy: str

STRATEGY_LIMITS = {
    "SizeTieredCompactionStrategy": 500,
    "LeveledCompactionStrategy": 100,
    "TimeWindowCompactionStrategy": 250
}

SAFETY_STREAMING_CEIL = 200

def run_nodetool(cmd: str) -> str:
    """Execute nodetool with timeout and validation."""
    try:
        result = subprocess.run(
            ["nodetool"] + cmd.split(),
            capture_output=True, text=True, timeout=30, check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        logging.error(f"nodetool {cmd} failed: {e.stderr.strip()}")
        sys.exit(2)
    except subprocess.TimeoutExpired:
        logging.error("nodetool execution timed out. Cluster may be unresponsive.")
        sys.exit(2)

def get_configured_compaction_throughput() -> float:
    """Read the configured compaction throughput cap via nodetool getcompactionthroughput.

    Output looks like: "Current compaction throughput: 64 MB/s"
    """
    output = run_nodetool("getcompactionthroughput")
    for line in output.splitlines():
        if "compaction throughput" in line.lower():
            for token in line.replace(":", " ").split():
                try:
                    value = float(token)
                    return max(value, 10.0)  # Floor at 10 MB/s to prevent division by zero
                except ValueError:
                    continue
    logging.warning("Could not parse getcompactionthroughput. Using conservative default.")
    return 10.0

def get_compaction_strategy(keyspace: str, table: str) -> str:
    """Retrieve the active compaction strategy from system_schema.tables via cqlsh.

    The strategy is per-table and is NOT reported by nodetool describecluster.
    """
    query = (
        "SELECT compaction FROM system_schema.tables "
        f"WHERE keyspace_name='{keyspace}' AND table_name='{table}';"
    )
    result = subprocess.run(
        ["cqlsh", "-e", query],
        capture_output=True, text=True, timeout=30,
    )
    if result.returncode == 0:
        for name in STRATEGY_LIMITS:
            if name in result.stdout:
                return name
    return "SizeTieredCompactionStrategy"

def calculate_optimal_size(metrics: PartitionMetrics) -> float:
    """Apply deterministic sizing formula with safety bounds."""
    strategy_max = STRATEGY_LIMITS.get(metrics.compaction_strategy, 500)
    repair_secs = metrics.repair_window_hours * 3600
    numerator = metrics.compaction_throughput_mbps * repair_secs
    denominator = max(metrics.avg_partitions_per_repair * 1.2, 1)
    calculated = numerator / denominator
    return min(calculated, strategy_max, SAFETY_STREAMING_CEIL)

def main():
    # Safety Check: Verify cluster connectivity before proceeding
    try:
        run_nodetool("status")
    except SystemExit:
        logging.critical("Cluster health check failed. Aborting calculation.")
        sys.exit(1)

    metrics = PartitionMetrics(
        compaction_throughput_mbps=get_configured_compaction_throughput(),
        repair_window_hours=4.0,  # Configurable via CLI/ENV in production
        avg_partitions_per_repair=1500,  # Derived from historical repair logs
        compaction_strategy=get_compaction_strategy("my_keyspace", "my_table")
    )

    optimal = calculate_optimal_size(metrics)
    print(json.dumps({
        "optimal_partition_mb": round(optimal, 2),
        "strategy_limit_mb": STRATEGY_LIMITS.get(metrics.compaction_strategy, 500),
        "safety_ceiling_mb": SAFETY_STREAMING_CEIL,
        "compaction_strategy": metrics.compaction_strategy,
        "recommendation": "APPLY" if optimal <= 200 else "REVIEW"
    }, indent=2))

if __name__ == "__main__":
    main()

Execution Protocol

Safety Check

# Verify cluster state and nodetool accessibility
nodetool status | grep -E "UN|DN"

Expected Output

UN  10.0.1.10  124.5 KiB  256  100.0%  abcdef01-...  rack1
UN  10.0.1.11  118.2 KiB  256  100.0%  bcdef012-...  rack1

Rollback Path If any node reports DN or UL, halt execution. Run nodetool repair -pr manually on healthy nodes before proceeding. Do not apply partition size changes during active topology changes.

Script Execution

chmod +x cassandra_partition_calculator.py
python3 cassandra_partition_calculator.py

Expected Output

{
  "optimal_partition_mb": 80.0,
  "strategy_limit_mb": 100,
  "safety_ceiling_mb": 200,
  "compaction_strategy": "LeveledCompactionStrategy",
  "recommendation": "APPLY"
}

Rollback Path If the script exits with code 2, verify JAVA_HOME and CASSANDRA_HOME environment variables. Revert to static defaults (100 MB for LCS, 500 MB for STCS) and schedule recalculation after compaction backlog clears.

Validation & Continuous Monitoring

After calculating the target size, validate against live SSTable distribution and adjust table parameters accordingly.

1. Verify Partition Size Distribution

# Inspect partition size statistics from a representative node
nodetool tablestats <keyspace>.<table> | grep "Compacted partition"

Safety Check Confirm the node is not currently running nodetool repair or nodetool cleanup. Expected Output

Compacted partition minimum bytes: 124
Compacted partition maximum bytes: 148920
Compacted partition mean bytes: 8420

Rollback Path If max exceeds optimal_partition_mb * 1048576, implement application-level partition key salting or bucketing. Do not alter gc_grace_seconds until partition skew is resolved.

The decision flow below captures this measure-compare-remediate loop.

flowchart TD M["Measure max partition bytes via nodetool tablestats"] --> C{"Within warn and target thresholds"} C -->|"Yes"| K["Keep current schema"] C -->|"No"| R["Remodel schema or add bucketing to partition key"] R --> M K --> Mon["Monitor distribution with tablehistograms"]
Partition sizing decision loop

2. Enforce Tombstone & Compaction Boundaries

tombstone_warn_threshold and tombstone_failure_threshold are node-level cassandra.yaml settings (they bound how many tombstones a single read may scan), not per-table properties. They are not set via nodetool. Edit cassandra.yaml on each node:

# cassandra.yaml (node-level; applies to all tables on the node)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 50000

Safety Check Confirm the running value via JMX (org.apache.cassandra.db:type=StorageService) or by inspecting the active cassandra.yaml; it does not appear in system_schema.tables (which holds per-table properties). Expected Output

# A cassandra.yaml change takes effect after a rolling restart of each node.

Rollback Path If read latency spikes > 200ms, revert the value in cassandra.yaml to the default and roll the change back out:

tombstone_failure_threshold: 100000

3. Monitor Repair Streaming Stability

# Watch streaming throughput during incremental repair
nodetool netstats | grep "Streaming"

Safety Check Ensure stream_throughput_outbound_megabits_per_sec in cassandra.yaml does not exceed 70% of available NIC bandwidth. Expected Output

Streaming: 124.5 MB/s (active)

Rollback Path If streaming stalls or nodetool repair throws StreamingTimeoutException, reduce streaming bandwidth at runtime with nodetool setstreamthroughput <megabits/s> (or lower stream_throughput_outbound_megabits_per_sec in cassandra.yaml by 25%) and restart the repair with -local to isolate cross-DC traffic.

Operational Integration Guidelines

  1. Automate Metric Collection: Schedule the Python calculator via cron or systemd timers during low-traffic windows. Pipe output to a configuration management tool (Ansible/Terraform) to update table DDL templates.
  2. Align with TWCS Windows: For time-series workloads, ensure optimal_partition_mb divides evenly into compaction_window_size. Mismatched boundaries force TWCS to merge across windows, negating strategy benefits.
  3. Prevent Hot Partitions: Enforce application-level validation that rejects writes exceeding optimal_partition_mb. Review the schema for partition keys with unbounded cardinality and monitor real partition-size distribution with nodetool tablehistograms <keyspace> <table> to catch skew early.
  4. Reference Official Documentation: Cross-reference compaction tuning with the Apache Cassandra Compaction Documentation and repair best practices from the Cassandra Nodetool Repair Documentation. For Python subprocess security, consult the Python subprocess Module to prevent shell injection in production environments.

Partition sizing is a continuous feedback loop, not a one-time calculation. By anchoring the formula to live compaction throughput, repair windows, and strategy-specific ceilings, you eliminate guesswork and stabilize streaming, compaction, and read paths across the token ring.