See all blog posts

How To Analyze ScyllaDB Cluster Capacity

Monitoring tips that can help reduce cluster size 2-5X without compromising latency

Editor’s note: The following is a guest post by Andrei Manakov, Senior Staff Software Engineer at ShareChat. It was originally published on Andrei’s blog.

I had the privilege of giving a talk at ScyllaDB Summit 2024, where I briefly addressed the challenge of analyzing the remaining capacity in ScyllaDB clusters. A good understanding of ScyllaDB internals is required to plan your computation cost increase when your product grows or to reduce cost if the cluster turns out to be heavily over-provisioned. In my experience, clusters can be reduced by 2-5x without latency degradation after such an analysis. In this post, I provide more detail on how to properly analyze CPU and disk resources.

How Does ScyllaDB Use CPU?

ScyllaDB is a distributed database, and one cluster typically contains multiple nodes. Each node can contain multiple shards, and each shard is assigned to a single core. The database is built on the Seastar framework and uses a shared-nothing approach. All data is usually replicated in several copies, depending on the replication factor, and each copy is assigned to a specific shard. As a result, every shard can be analyzed as an independent unit and every shard efficiently utilizes all available CPU resources without any overhead from contention or context switching.

Each shard has different tasks, which we can divide into two categories: client request processing and maintenance tasks. All tasks are executed by a scheduler in one thread pinned to a core, giving each one its own CPU budget limit. Such clear task separation allows isolation and prioritization of latency-critical tasks for request processing. As a result of this design, the cluster handles load spikes more efficiently and provides gradual latency degradation under heavy load. [More details about this architecture].

Another interesting result of this design is that ScyllaDB supports workload prioritization. In my experience, this approach ensures that critical latency is not impacted during less critical load spikes. I can’t recall any similar feature in other databases. Such problems are usually tackled by having 2 clusters for different workloads. But keep in mind that this feature is available only in ScyllaDB Enterprise.

However, background tasks may occupy all remaining resources, and overall CPU utilization in the cluster appears spiky. So, it’s not obvious how to find the real cluster capacity.

It’s easy to see 100% CPU usage with no performance impact. If we increase the critical load, it will consume the resources (CPU, I/O) from background tasks. Background tasks’ duration can increase slightly, but it’s totally manageable.

The Best CPU Utilization Metric

How can we understand the remaining cluster capacity when CPU usage spikes up to 100% throughout the day, yet the system remains stable?

We need to exclude maintenance tasks and remove all these spikes from the consideration. Since ScyllaDB distributes all the data by shards and every shard has its own core, we take into account the max CPU utilization by a shard excluding maintenance tasks (you can find other task types here).

In my experience, you can keep the utilization up to 60-70% without visible degradation in tail latency.

Example of a Prometheus query:

max(sum(rate(scylla_scheduler_runtime_ms{group!="compaction|streaming"})) by (instance, shard))/10

You can find more details about the ScyllaDB monitoring stack here. In this article, PromQL queries are used to demonstrate how to analyse key metrics effectively.

However, I don’t recommend rapidly downscaling the cluster to the desired size just after looking at max CPU utilization excluding the maintenance tasks.

First, you need to look at average CPU utilization excluding maintenance tasks across all shards. In an ideal world, it should be close to max value. In case of significant skew, it definitely makes sense to find the root cause. It can be an inefficient schema with an incorrect partition key or an incorrect token-aware/rack-aware configuration in the driver.

Second, you need to take a look at the average CPU utilization of excluded tasks for some your workload specific things. It’s rarely more than 5-10% but you might need to have more buffer if it uses more CPU. Otherwise, compaction will be too tight in resources and reads start to become more expensive with respect to CPU and disk.

Third, it’s important to downscale your cluster gradually. ScyllaDB has an in-memory row cache which is crucial for ScyllaDB. It allocates all remaining memory for the cache and with the memory reduction, the hit rate might drop more than you expected. Hence, CPU utilization can be increased unilinearly and low cache hit rate can harm your tail latency.

1- (sum(rate(scylla_cache_reads_with_misses{})) / sum(rate(scylla_cache_reads{})))

I haven’t mentioned RAM in this article as there are not many actionable points. However, since memory cache is crucial for efficient reading in ScyllaDB, I recommend always using memory-optimized virtual machines. The more memory, the better.

Disk Resources

ScyllaDB is a LSMT-based database. That means it is optimized for writing by design and any mutation will lead to new appending new data to the disk. The database periodically rewrites the data to ensure acceptable read performance. Disk performance plays a crucial role in overall database performance. You can find more details about the write path and compaction in the scylla documentation. There are 3 important disk resources we will discuss here: Throughput, IOPs and free disk space.

All these resources depend on the disk type we attached to our ScyllaDB nodes and their quantity. But how can we understand the limit of the IOPs/throughput? There 2 possible options:

  1. Any cloud provider or manufacturer usually provides performance of their disks ; you can find it on their website. For example, NVMe disks from Google Cloud.
  2. The actual disk performance can be different compared to the numbers that manufacturers share. The best option might be just to measure it. And we can easily get the result. ScyllaDB performs a benchmark during installation to a node and stores the result in the file io_properties.yaml. The database uses these limits internally for achieving optimal performance.
disks: 
- mountpoint: /var/lib/scylla/data 
  read_iops: 2400000  //iops
  read_bandwidth: 5921532416//throughput 
  write_iops: 1200000 //iops
  write_bandwidth: 4663037952//throughput 

file: io_properties.yaml

Disk Throughput

sum(rate(node_disk_read_bytes_total{})) / (read_bandwidth * nodeNumber)
sum(rate(node_disk_written_bytes_total{})) / (write_bandwidth * nodeNumber)

In my experience, I haven’t seen any harm with utilization up to 80-90%.

Disk IOPs

sum(rate(node_disk_reads_completed_total{})) / (read_iops * nodeNumber)
sum(rate(node_disk_writes_completed_total{})) / (write_iops * nodeNumber)

Disk free space

It’s crucial to have significant buffer in every node. In case you’re running out of space, the node will be basically unavailable and it will be hard to restore it. However, additional space is required for many operations:

  1. Every update, write, or delete will be written to the disk and allocate new space.
  2. Compaction requires some buffer during cleaning the space.
  3. Back up procedure.

The best way to control disk usage is to use Time To Live in the tables if it matches your use case. In this case, irrelevant data will expire and be cleaned during compaction.

I usually try to keep at least 50-60% of free space.

min(sum(node_filesystem_avail_bytes{mountpoint="/var/lib/scylla"}) by (instance)/sum(node_filesystem_size_bytes{mountpoint="/var/lib/scylla"}) by (instance))

Tablets

Most apps have significant load variations throughout the day or week. ScyllaDB is not elastic and you need to have provisioned the cluster for the peak load. So, you could waste a lot of resources during night or weekends. But that could change soon.

A ScyllaDB cluster distributes data across its nodes and the smallest unit of the data is a partition uniquely identified by a partition key. A partitioner hash function computes tokens to understand in which nodes data are stored. Every node has its own token range, and all nodes make a ring. Previously, adding a new node wasn’t a fast procedure because it required copying (it is called streaming) data to a new node, adjusting token range for neighbors, etc. In addition, it’s a manual procedure.

However, ScyllaDB introduced tablets in 6.0 version, and it provides new opportunities. A Tablet is a range of tokens in a table and it includes partitions which can be replicated independently. It makes the overall process much smoother and it increases elasticity significantly. Adding new nodes takes minutes and a new node starts processing requests even before full data synchronization. It looks like a significant step toward full elasticity which can drastically reduce server cost for ScyllaDB even more. You can read more about tablets here. I am looking forward to testing tablets closely soon.

Conclusion

Tablets look like a solid foundation for future pure elasticity, but for now, we’re planning clusters for peak load. To effectively analyze ScyllaDB cluster capacity, focus on these key recommendations:

  • Target max CPU utilization (excluding maintenance tasks) per shard at 60–70%.
  • Ensure sufficient free disk space to handle compaction and backups.
  • Gradually downsize clusters to avoid sudden cache degradation.

 

About Andrei Manakov

Andrei Manakov is a software engineer with over 12 years of experience. He specializes in designing high-load distributed systems. Currently, he works at ShareChat, India's largest social network, with over 300 million monthly active users.

Blog Subscribe Mascots in Paper Airplane

Subscribe to the ScyllaDB Blog

For Engineers Only. Subscribe to the ScyllaDB Blog. Receive notifications about database-related technology articles and developer how-tos.