How to Calculate Optimal Partition Sizes for Cassandra
Partition sizing in Apache Cassandra is an operational constraint, not a theoretical preference. There is no fixed 2 GB byte cap on a partition; the real hard limit is roughly 2 billion cells per partition, and Cassandra emits large-partition warnings well before that. Production degradation routinely begins between 100 MB and 500 MB due to compaction I/O saturation, repair streaming timeouts, and tombstone accumulation. This guide delivers a deterministic, automation-ready methodology for calculating optimal partition sizes, explicitly aligned with compaction strategy behavior, repair window constraints, and idempotent Python tooling for Cassandra 4.x/5.x environments.
Architecture Context & Token Ring Mapping
Partition size directly dictates SSTable lifecycle and repair topology. When a single partition dominates a vnode, compaction must merge increasingly large SSTables, consuming disk bandwidth and triggering OutOfMemoryError during nodetool repair streaming. Understanding how primary keys map to the token ring is foundational; improper clustering key cardinality or unbounded time-series writes create hot partitions that bypass Data Partitioning & Token Ring Basics distribution guarantees.
The relationship between partition boundaries and Cassandra Architecture & Compaction Fundamentals dictates that optimal sizing must be derived from measurable throughput, not arbitrary defaults. Every partition must fit within the compaction strategy’s merge tolerance, the repair streaming buffer, and the tombstone purge cycle.
Operational Constraints & Strategy Alignment
Optimal partition sizing must account for three operational vectors:
- Compaction Strategy Overhead:
SizeTieredCompactionStrategy(STCS) tolerates larger partitions but suffers exponential read amplification as SSTables grow.LeveledCompactionStrategy(LCS) requires partitions ≤ 100 MB to prevent L0/L1 overlap saturation.TimeWindowCompactionStrategy(TWCS) performs optimally when partitions align with window boundaries; aim for the 10–50 MB typical range and treat the 250 MB strategy ceiling used below as an upper bound, not a target. - Repair Streaming Limits:
nodetool repairstreams the divergent ranges identified by Merkle-tree comparison. Large partitions within those ranges inflate the data streamed and frequently triggerStreamingTimeoutExceptionor saturatestream_throughput_outbound_megabits_per_sec, causing repair failures. - Tombstone Compaction Cost: Oversized partitions delay tombstone purging, making it more likely a single read scans past the node-level
tombstone_failure_threshold(a cassandra.yaml setting, default100000) during range scans and forcing full table scans that bypass partition pruning.
Deterministic Calculation Methodology
The optimal partition size is derived from compaction throughput capacity, repair window duration, and target SSTable size. Use the following production formula:
Optimal_Partition_MB = min(
(Compaction_Throughput_MBps * Repair_Window_Secs) / (Avg_Partitions_Per_Repair * 1.2),
Strategy_Max_MB,
200 # Hard safety ceiling for repair streaming
)
Where:
Compaction_Throughput_MBps= Sustained compaction write throughput (MB/s)Repair_Window_Secs= Maximum allowed repair duration (typically 7200–14400)Avg_Partitions_Per_Repair= Estimated partitions per vnode during-pror--fullrepairStrategy_Max_MB= 100 for LCS, 250 for TWCS, 500 for STCS1.2= Safety factor for concurrent compaction and streaming overhead
Production-Ready Automation Workflow
Manual metric collection introduces drift. The following Python script implements an idempotent, safety-checked calculator that queries nodetool, validates cluster health, and outputs deterministic sizing recommendations.
#!/usr/bin/env python3
"""
Cassandra Partition Size Calculator (v4.x/v5.x compatible)
Calculates optimal partition size based on live compaction/repair metrics.
"""
import subprocess
import json
import sys
import logging
from dataclasses import dataclass
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
@dataclass
class PartitionMetrics:
compaction_throughput_mbps: float
repair_window_hours: float
avg_partitions_per_repair: int
compaction_strategy: str
STRATEGY_LIMITS = {
"SizeTieredCompactionStrategy": 500,
"LeveledCompactionStrategy": 100,
"TimeWindowCompactionStrategy": 250
}
SAFETY_STREAMING_CEIL = 200
def run_nodetool(cmd: str) -> str:
"""Execute nodetool with timeout and validation."""
try:
result = subprocess.run(
["nodetool"] + cmd.split(),
capture_output=True, text=True, timeout=30, check=True
)
return result.stdout.strip()
except subprocess.CalledProcessError as e:
logging.error(f"nodetool {cmd} failed: {e.stderr.strip()}")
sys.exit(2)
except subprocess.TimeoutExpired:
logging.error("nodetool execution timed out. Cluster may be unresponsive.")
sys.exit(2)
def get_configured_compaction_throughput() -> float:
"""Read the configured compaction throughput cap via nodetool getcompactionthroughput.
Output looks like: "Current compaction throughput: 64 MB/s"
"""
output = run_nodetool("getcompactionthroughput")
for line in output.splitlines():
if "compaction throughput" in line.lower():
for token in line.replace(":", " ").split():
try:
value = float(token)
return max(value, 10.0) # Floor at 10 MB/s to prevent division by zero
except ValueError:
continue
logging.warning("Could not parse getcompactionthroughput. Using conservative default.")
return 10.0
def get_compaction_strategy(keyspace: str, table: str) -> str:
"""Retrieve the active compaction strategy from system_schema.tables via cqlsh.
The strategy is per-table and is NOT reported by nodetool describecluster.
"""
query = (
"SELECT compaction FROM system_schema.tables "
f"WHERE keyspace_name='{keyspace}' AND table_name='{table}';"
)
result = subprocess.run(
["cqlsh", "-e", query],
capture_output=True, text=True, timeout=30,
)
if result.returncode == 0:
for name in STRATEGY_LIMITS:
if name in result.stdout:
return name
return "SizeTieredCompactionStrategy"
def calculate_optimal_size(metrics: PartitionMetrics) -> float:
"""Apply deterministic sizing formula with safety bounds."""
strategy_max = STRATEGY_LIMITS.get(metrics.compaction_strategy, 500)
repair_secs = metrics.repair_window_hours * 3600
numerator = metrics.compaction_throughput_mbps * repair_secs
denominator = max(metrics.avg_partitions_per_repair * 1.2, 1)
calculated = numerator / denominator
return min(calculated, strategy_max, SAFETY_STREAMING_CEIL)
def main():
# Safety Check: Verify cluster connectivity before proceeding
try:
run_nodetool("status")
except SystemExit:
logging.critical("Cluster health check failed. Aborting calculation.")
sys.exit(1)
metrics = PartitionMetrics(
compaction_throughput_mbps=get_configured_compaction_throughput(),
repair_window_hours=4.0, # Configurable via CLI/ENV in production
avg_partitions_per_repair=1500, # Derived from historical repair logs
compaction_strategy=get_compaction_strategy("my_keyspace", "my_table")
)
optimal = calculate_optimal_size(metrics)
print(json.dumps({
"optimal_partition_mb": round(optimal, 2),
"strategy_limit_mb": STRATEGY_LIMITS.get(metrics.compaction_strategy, 500),
"safety_ceiling_mb": SAFETY_STREAMING_CEIL,
"compaction_strategy": metrics.compaction_strategy,
"recommendation": "APPLY" if optimal <= 200 else "REVIEW"
}, indent=2))
if __name__ == "__main__":
main()Execution Protocol
Safety Check
# Verify cluster state and nodetool accessibility
nodetool status | grep -E "UN|DN"Expected Output
UN 10.0.1.10 124.5 KiB 256 100.0% abcdef01-... rack1
UN 10.0.1.11 118.2 KiB 256 100.0% bcdef012-... rack1
Rollback Path
If any node reports DN or UL, halt execution. Run nodetool repair -pr manually on healthy nodes before proceeding. Do not apply partition size changes during active topology changes.
Script Execution
chmod +x cassandra_partition_calculator.py
python3 cassandra_partition_calculator.pyExpected Output
{
"optimal_partition_mb": 80.0,
"strategy_limit_mb": 100,
"safety_ceiling_mb": 200,
"compaction_strategy": "LeveledCompactionStrategy",
"recommendation": "APPLY"
}Rollback Path
If the script exits with code 2, verify JAVA_HOME and CASSANDRA_HOME environment variables. Revert to static defaults (100 MB for LCS, 500 MB for STCS) and schedule recalculation after compaction backlog clears.
Validation & Continuous Monitoring
After calculating the target size, validate against live SSTable distribution and adjust table parameters accordingly.
1. Verify Partition Size Distribution
# Inspect partition size statistics from a representative node
nodetool tablestats <keyspace>.<table> | grep "Compacted partition"Safety Check
Confirm the node is not currently running nodetool repair or nodetool cleanup.
Expected Output
Compacted partition minimum bytes: 124
Compacted partition maximum bytes: 148920
Compacted partition mean bytes: 8420
Rollback Path
If max exceeds optimal_partition_mb * 1048576, implement application-level partition key salting or bucketing. Do not alter gc_grace_seconds until partition skew is resolved.
The decision flow below captures this measure-compare-remediate loop.
2. Enforce Tombstone & Compaction Boundaries
tombstone_warn_threshold and tombstone_failure_threshold are node-level cassandra.yaml settings (they bound how many tombstones a single read may scan), not per-table properties. They are not set via nodetool. Edit cassandra.yaml on each node:
# cassandra.yaml (node-level; applies to all tables on the node)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 50000Safety Check
Confirm the running value via JMX (org.apache.cassandra.db:type=StorageService) or by inspecting the active cassandra.yaml; it does not appear in system_schema.tables (which holds per-table properties).
Expected Output
# A cassandra.yaml change takes effect after a rolling restart of each node.
Rollback Path If read latency spikes > 200ms, revert the value in cassandra.yaml to the default and roll the change back out:
tombstone_failure_threshold: 1000003. Monitor Repair Streaming Stability
# Watch streaming throughput during incremental repair
nodetool netstats | grep "Streaming"Safety Check
Ensure stream_throughput_outbound_megabits_per_sec in cassandra.yaml does not exceed 70% of available NIC bandwidth.
Expected Output
Streaming: 124.5 MB/s (active)
Rollback Path
If streaming stalls or nodetool repair throws StreamingTimeoutException, reduce streaming bandwidth at runtime with nodetool setstreamthroughput <megabits/s> (or lower stream_throughput_outbound_megabits_per_sec in cassandra.yaml by 25%) and restart the repair with -local to isolate cross-DC traffic.
Operational Integration Guidelines
- Automate Metric Collection: Schedule the Python calculator via cron or systemd timers during low-traffic windows. Pipe output to a configuration management tool (Ansible/Terraform) to update table DDL templates.
- Align with TWCS Windows: For time-series workloads, ensure
optimal_partition_mbdivides evenly intocompaction_window_size. Mismatched boundaries force TWCS to merge across windows, negating strategy benefits. - Prevent Hot Partitions: Enforce application-level validation that rejects writes exceeding
optimal_partition_mb. Review the schema for partition keys with unbounded cardinality and monitor real partition-size distribution withnodetool tablehistograms <keyspace> <table>to catch skew early. - Reference Official Documentation: Cross-reference compaction tuning with the Apache Cassandra Compaction Documentation and repair best practices from the Cassandra Nodetool Repair Documentation. For Python subprocess security, consult the Python subprocess Module to prevent shell injection in production environments.
Partition sizing is a continuous feedback loop, not a one-time calculation. By anchoring the formula to live compaction throughput, repair windows, and strategy-specific ceilings, you eliminate guesswork and stabilize streaming, compaction, and read paths across the token ring.