Close-to-the-metal architecture handles millions of OPS with predictable single-digit millisecond latencies.
Learn MoreClose-to-the-metal architecture handles millions of OPS with predictable single-digit millisecond latencies.
Learn MoreScyllaDB is purpose-built for data-intensive apps that require high throughput & predictable low latency.
Learn MoreLevel up your skills with our free NoSQL database courses.
Take a CourseOur blog keeps you up to date with recent news about the ScyllaDB NoSQL database and related technologies, success stories and developer how-tos.
Read MoreScyllaDB Change Data Capture (CDC) is an optional automated recording of modifications for one or more tables in your ScyllaDB cluster (database). CDC is primarily used for:
With these building block functions, CDC also allows ScyllaDB users to build streaming data pipelines that allow real-time data processing and analysis. More and more use cases require an almost immediate reaction to modifications occurring in the database. In a fast-moving world, the amount of new information is so big that it has to be processed on the spot. For example, it might be desirable to send an SMS to a user if a login attempt is performed from a country the user does not live in – a typical fraud detection operation. Change Data Capture is also useful to replicate data stored in the database to other systems that create indexes or aggregations for it.
The following three diagrams depict how the write-flow works in an already-enabled CDC log table. ScyllaDB CDC is (1) enabled per table using an INSERT INTO base_table(…)…, (2) if required, the coordinator node reads existing row data for pre-/post image generation, and (3) the coordinator then populates CDC log table writes and piggybacks on the base table.
Change data capture is enabled per table and records modifications made to a CQL row and can query the history of all changes made to enabled tables in your ScyllaDB database. When CDC is enabled on a table, a corresponding CDC log table is created and mutations (or CDC records if you so prefer) are ordered by timestamp in the original operation and the associated CQL batch sequence. The CDC log table ordering is important in order for consumers to know how to retrieve data and continuously poll the table. The batch sequence number helps to ensure that writes in a batch are processed together, improving performance and assuring correctness. Log tables can be optionally created for one or more of the following:
When writing applications, you can now use our language specific, shard-aware libraries (Rust, Go, and Java) to simplify writing applications which will be read to Scylla CDC. These libraries feature:
ScyllaDB CDC data tables are stored and distributed in the cluster, sharing all the same properties as you’re used to with your normal data. All considerations related to partition and clustering keys apply to CDC log tables. There’s no need for a separate specialized application to consume it.
The topology of the CDC log is meant to match the original data, so you get the same data distribution. It’s synchronized with your writes. It shares the same properties and distribution all to minimize the overhead of adding the log.
Everything in a CDC log is transient, representing a window back in time that you can consume and look at and then automatically erase it. The default survival time-to-live (TTL) is 24 hours, which means the CDC log is not going to overwhelm your storage system.
As described above, ScyllaDB CDC tables are normal tables and thus consumed on the lowest level through normal CQL operations. The data is already deduplicated so there is no need to run aggregation from different replicas because the CDC data records will always reside on the same shards as the base replicas.
You can also build integrators on top of CDC for more advanced use cases that can be grouped into two categories: integration with other systems and real-time processing of data. Real-time processing can be done for example with Kafka Streams or Spark and is useful for triggers and monitoring.
ScyllaDB CDC Source Connector is a Debezium connector, compatible with Kafka Connect (with Kafka 2.6.0+). The connector reads the CDC log for specified tables and produces Kafka messages for each row-level INSERT, UPDATE or DELETE operation. The connector is fault-tolerant, retrying reading data from Scylla in case of failure. It periodically saves the current position in the ScyllaDB CDC log using Kafka Connect offset tracking. Each generated Kafka message contains information about the source, such as the timestamp and the table name.
Confluent is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams. It expands the benefits of Apache Kafka with enterprise-grade features. Confluent makes it easy to build modern, event-driven applications, and gain a universal data pipeline, supporting scalability, performance, and reliability.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to define connectors that move large data sets in and out of Kafka. It can ingest entire databases or collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency.
Kafka Connect includes two types of connectors:
Cassandra | DynamoDB | MongoDB | ScyllaDB | |
Consumer location | Cassandraon-node | DynamoDBoff-node | MongoDBoff-node | ScyllaDBoff-node |
Replication | Cassandraduplicated | DynamoDBdeduplicated | MongoDBdeduplicated | ScyllaDBdeduplicated |
Deltas | Cassandrayes | DynamoDBno | MongoDBpartial | ScyllaDByes |
Pre-image | Cassandrano | DynamoDByes | MongoDBno | ScyllaDByes, optional |
Post-image | Cassandrano | DynamoDByes | MongoDByes | ScyllaDByes, optional |
Slow consumer reaction | CassandraTable stopped | DynamoDBConsumer loses data | MongoDBConsumer loses data | ScyllaDBConsumer loses data |
Ordering | Cassandrano | DynamoDByes | MongoDByes | ScyllaDByes |
Read the documentation for additional information to get started.
Apache® and Apache Cassandra® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Amazon DynamoDB® and Dynamo Accelerator® are trademarks of Amazon.com, Inc. No endorsements by The Apache Software Foundation or Amazon.com, Inc. are implied by the use of these marks.