ScyllaDB Enterprise 2023.1 introduces Raft-based strongly consistent schema management, distributed aggregations, and much more.
ScyllaDB Enterprise 2023.1 is now available. It is based on ScyllaDB Open Source 5.2, introducing a Raft-based strongly consistent schema management, distributed aggregations, Alternator TTL, partition level rate limiting, distributed select count, and many more improvements and bug fixes.
It also resolves over 100 issues, bringing improved capabilities, stability and performance to our NoSQL database server and its APIs.
DOWNLOAD ScyllaDB Enterprise 2023.1NOW
In this blog, we’ll share the TL;DR of what the move to Raft cluster management means for our users (two more detailed Raft blogs provide additional details), then highlight additional new features that our users have been asking about most frequently. For the complete details, read the release notes.
Strongly Consistent Schema Management with Raft
Consistent Schema Management is the first Raft based feature in ScyllaDB, and ScyllaDB Enterprise 2023.1 is the first enterprise release to enable Raft by default. It applies to DDL operations that modify the schema – for example, CREATE, ALTER, or DROP for KEYSPACE, TABLE, INDEX, UDT, MV, etc.
Unstable schema management has been a problem in past Apache Cassandra and ScyllaDB versions. The root cause is the unsafe propagation of schema updates over gossip: concurrent schema updates can lead to schema collisions.
Once Raft is enabled, all schema management operations are serialized by the Raft consensus algorithm. Quickly assembling a fresh cluster, performing concurrent schema changes, updating node’s IP addresses – all of this is now possible.
More specifically:
- It is now safe to perform concurrent schema change statements. Change requests don’t conflict, get overridden by “competing” requests, or risk data loss. Schema propagation happens much faster since the leader of the cluster is actively pushing it to the nodes. You can expect the nodes to learn about new schema in a healthy cluster in under a few milliseconds (used to be a second or two).
- If a node is partitioned away from the cluster, it can’t perform schema changes. That’s the main difference, or limitation, from the pre-Raft clusters that you should keep in mind. You can still perform other operations with such nodes (such as reads and writes) so data availability is unaffected. We see results of the change not only in simple regression tests, but in our longevity tests which execute DDL. There are fewer errors in the log and the systems running on Raft are more stable when DDL is running.
For more details, see our Raft blogs:
- ScyllaDB’s Path to Strong Consistency: A New Milestone
- What’s Next on ScyllaDB’s Path to Strong Consistency
Moving to Raft-Based Clusters: Important User Impacts
Starting with the ScyllaDB Enterprise 2023.1 release, all new clusters will be created with Raft enabled by default from this release on. Upgrading from 2022.x will use Raft only if you explicitly enable it (see the upgrade docs). As soon as all nodes in the cluster opt-in to using Raft, the cluster will automatically migrate those subsystems to using Raft (you should validate that this is the case).
Once Raft is enabled, every cluster-level operation – like updating schema, adding and removing nodes, and adding and removing data centers – requires a quorum to be executed. For example, in the following use cases, the cluster does not have a quorum and will not allow updating the schema:
- A cluster with 1 out of 3 nodes available
- A cluster with 2 out of 4 nodes available
- A cluster with two data centers (DCs), each with 3 nodes, where one of the DCs is not available
This is different from the behavior of a ScyllaDB cluster with Raft disabled.
Nodes might be unavailable due to network issues, node issues, or other reasons. To reduce the chance of quorum loss, it is recommended to have 3 or more nodes per DC, and 3 or more DCs, for a multi-DCs cluster. To recover from a quorum loss, it’s best to revive the failed nodes or fix the network partitioning. If this is impossible, see the Raft manual recovery procedure.
The docs provide more details on handling failures in Raft.
Distributed Aggregations
ScyllaDB now automatically runs aggregations statements, like SELECT COUNT(*), on all nodes and all shards in parallel, which brings a considerable speedup, even 20X in larger clusters. All types of aggregations are supported.
This feature is limited to queries that do not use GROUP BY or filtering.
The implementation includes a new level of coordination. A Super-Coordinator node splits aggregation queries into sub-queries, distributes them across some group of coordinators, and merges results. Like a regular coordinator, the Super-Coordinator is a per-operation function.
Example results:
A 3 node cluster setup on powerful desktops (3×32 vCPU)
Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:
> time cqlsh <ip> <port> --request-timeout=3600 -e "select count(*) from scylla_bench.test using timeout 1h;"
Before Distributed Select: 68s
After Distributed Select: 2s
You can disable this feature by setting the enable_parallelized_aggregation config
parameter to false.
Read more in How ScyllaDB Distributed Aggregates Reduce Query Execution Time up to 20X
TTL for ScyllaDB’s DynamoDB API (Alternator)
ScyllaDB Alternator is an Amazon DynamoDB-compatible API that allows any application written for DynamoDB to be run, unmodified, against ScyllaDB (the fastest distributed database). ScyllaDB supports the same client SDKs, data modeling and queries as DynamoDB. However, you can deploy ScyllaDB wherever you want: on-premises, or on any public cloud. ScyllaDB provides lower latencies and solves DynamoDB’s high operational costs. You can deploy it however you want via Docker or Kubernetes, or use ScyllaDB Cloud for a fully-managed NoSQL database-as-a-service solution.
We previously introduced Time To Live (TTL) to Alternator as an experimental feature. Now, we promoted it to production ready. Like in DynamoDB, Alternator items that are set to expire at a specific time will not disappear precisely at that time but only after some delay. DynamoDB guarantees that the expiration delay will be less than 48 hours (though for small tables, the delay is often much shorter). In Alternator, the expiration delay is configurable – it defaults to 24 hours but can be set with the –alternator-ttl-period-in-seconds configuration option.
MORE ON SCYLLADB AS A DYNAMODB ALTERNATIVE
Large Collection Detection
ScyllaDB has traditionally recorded large partitions, large rows, and large cells in system tables so they can be identified and addressed. Now, it also records collections with a large number of elements since they can also cause degraded performance.
Additionally, we introduced a new configurable warning threshold:
compaction_collection_elements_count_warning_threshold
– how many elements are considered a “large” collection (default is 10,000 elements).
The information about large collections is stored in the large_cells table, with a new collection_elements column that contains the number of elements of the large collection.
Automating Away the gc_grace_seconds Parameter
Optional automatic management of tombstone garbage collection, replacing gc_grace_seconds, is now promoted from an experimental feature to production ready. This drops tombstones more frequently if repairs are made on time, and prevents data resurrection if repairs are delayed beyond gc_grace_seconds. Tombstones older than the most recent repair will be eligible for purging, and newer ones will be kept. The feature is disabled by default and needs to be enabled via ALTER TABLE.
For example:
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
MORE ON REPAIR BASED TOMBSTONE GARBAGE COLLECTION
Preventing Timeouts When Processing Long Tombstone Sequences
Previously, the paging code required that pages had at least one row before filtering. This could cause an unbounded amount of work if there was a long sequence of tombstones in a partition or token range, leading to timeouts. ScyllaDB will now send empty pages to the client, allowing progress to be made before a timeout. This prevents analytics workloads from failing when processing long sequences of tombstones.
Secondary Index on Collections
Secondary indexes can now index collection columns. This allows you to index individual keys as well as values within maps, sets, and lists.
For example:
CREATE TABLE test(id int, somemap map<int, int>, somelist list, someset set, PRIMARY KEY(id));
INSERT INTO test (id, somemap, somelist, someset) VALUES (7, {7: 7}, [1,3,7], {7,8,9});
CREATE INDEX ON test(keys(somemap));
CREATE INDEX ON test(values(somemap));
CREATE INDEX ON test(entries(somemap));
CREATE INDEX ON test(values(somelist));
CREATE INDEX ON test(values(someset));
CREATE INDEX IF NOT EXISTS ON test(somelist);
CREATE INDEX IF NOT EXISTS ON test(someset);
CREATE INDEX IF NOT EXISTS ON test(somemap);
SELECT * FROM test WHERE someset CONTAINS 7;
SELECT * FROM test WHERE somelist CONTAINS 7;
SELECT * FROM test WHERE somemap CONTAINS KEY 7;
SELECT * FROM test WHERE somemap CONTAINS 7;
SELECT * FROM test WHERE somemap[7] = 7;
Additional Improvements
The new release also introduces numerous improvements across:
- CQL API
- Amazon DynamoDB Compatible API (Alternator)
- Correctness
- Performance and stability
- Operations
- Deployment and installations
- Tools
- Configuration
- Monitoring and tracing
For complete details, see the release notes.