Monster SCALE Summit is the event for extreme scale engineering. March 11-12. Online & Free. Register Now >

Distributed Database Architecture

Distributed Database Architecture Definition

A distributed database is a single logical database that is distributed across multiple physical databases, servers, data centers, or even separate networks.

Each node is a single computational whole — whether that node is a virtual instance carved out of a larger physical server (as you might often find in the public cloud) or a complete physical server running your database (as you might find typically on-premises). We define a cluster as comprising one or more nodes; thus, a distributed database needs to be running on a cluster of N nodes where N > 1.

There are a key decisions made when architecting a distributed database. The database designers first have to decide how to define a cluster, and distribute data across it. Next, they have to determine what the roles of each of the nodes of the cluster are. Is every node a peer, or are some nodes in a more superior leader position and others are more followers. And then, based on these roles, how do they deal with failover?
Lastly, they have to figure out based on this how you replicate and shard your data as evenly and easily as possible. These decisions

Image shows distributed databases nodes creating a network.

Distributed Database Architecture FAQs

What are Clustering & Distribution Strategies In Distributed Database Architecture?

When you have a database engine running across these multiple nodes, what do you do with your data? Do you split it as evenly as possible between them? That’s known as sharding. Or do you keep full copies on each of the nodes? That’s called replication.

There’s also the literal issues of physical distance between your servers because, as far as I know, databases need to obey the speed of light. And so if you need to keep your database in sync quickly, you need to make your cluster localized — in the same datacenter. It is a type of distributed database, but it’s just the beginning.

If you want to serve data close to users spread over a geographic area, you may be able to have multiple local clusters — one in the US, one in Europe, one in Asia. That keeps local user latencies low.

But now you may have your disparate local clusters intercommunicate through some sort of synchronization or update mechanism. For example, this is how DNS or Active Directory work. Each system works on its own, and there’s a propagation delay between updates across the different systems,

That might not be good enough for some production use cases though. So if you are more tolerant to “speed of light” propagation delays, and use what’s known as eventual consistency, you may be able to spread the cluster itself around the world. Some servers may be in the US. Others in Europe or Asia. Yet, it’s all considered the same logical cluster.

What are Node Roles (Active-Active, Active-Passive, etc.) In Distributed Database Architecture

Another aspect of distributed database architecture involves the role of the nodes in your database. Are they all peers, each capable of full writes, or are any of them designated as leaders or primaries with others designated as read-only replicas?

Previously it was common to have a replica set aside as a “hot standby” only used in case the primary server went down — and that’s still a successful model for many systems. But that hot standby is not taking on any of the load. It’s just sitting there idly humming just in case of emergency.

That’s why many people prefer peer-to-peer leaderless topologies, where everyone gets to share the load. And there’s no single point of failure, and no need to spend time hiccupping during failover.

In these so-called active-active scenarios, how to keep systems in sync is more complicated — it’s a tougher thing to do — but if you can solve for it, you’ve eliminated any single point of failure in your database.

Also, even if you have a distributed database, that doesn’t mean that your clients are aware of the topology. So people can either implement load balancers to front-end your distributed database, or they can implement client-side load balancing by making intelligent clients that know how your database is sharded and route queries to the right nodes.

What are Data Replication and Sharding In Distributed Database Architecture

Now let’s look at the replication and sharding strategies that are used across distributed database architectures. You can make each node a full replica of the entire database. You could have, for example, three full sets of data on three different servers, or you can distribute different pieces across multiple servers, sharded somewhat differently on each server, so that it’s more difficult to lose any one piece of data even if two or more servers in a cluster dies.

Image showing database sharding. 3 columns each with 2 letters, starting with A.

Example 1: Basic data sharding. Note that in this case, while data is sharded for load balancing across the different nodes, it does not provide high availability because none of the shards are replicated.

Data Replication (Primary/Replica) image showing 3 columns, each with letters A-F in them.

 

Example 2: A primary-replica method of data replication, where one node is used to write data, which then can be propagated out to other read-only nodes. This provides some levels of high availability, with a replica being able to take over the cluster in case the primary goes offline. However, it does not properly load balance your workload, because all writes have to be handled at the primary, so it may be impractical for write-heavy workloads.

Data Replication (Active/Active) image showing 3 columns, each with letters A-F in them.

Example 3: Here, all data is sharded and replicated in an active-active leaderless topology. Every node can accept read and write operations, so all are peers in managing workload. As well, because of replication any loss to part of the cluster will not result in lost data.

For horizontal scalability, how does your system decide how to shard data across nodes? At first that was always a manual process, difficult and problematic to manage. So distributed databases implemented algorithms to automagically shard your data across your nodes. While that is far more prevalent these days, there’s still some distributed database architectures that haven’t solved for how, specifically, to auto-shard or make auto-sharding an advanced feature you don’t get out of the box.

What is Topology Awareness In Distributed Database Architecture?

A final element of distributed database architecture is topology awareness. Distributed databases need to understand their own physical deployments.

For example, assume that you have a local cluster, but it’s all on the same rack in the datacenter. Then, somehow, power is knocked out to it. Whoops! Your whole system is down.

So rack-awareness means that your database can try to make sure that each server in the cluster is on its own rack — or that, at least, they are spread around across the available racks as evenly as possible.

It’s the same with datacenter awareness, across availability zones or regions. You want to make sure that no single datacenter disaster means you’ve lost part or all of your database. That actually happened earlier this year to one of our customers, but because they were deployed across three different datacenters, they lost zero data.

What is the Postgres Database Architecture?

“Postgres” offers local clustering out of the box. However, Postgres seems to still be working on its cross-cluster and multi-datacenter clustering. Users may have to put some effort into getting it working. Because SQL is grounded in a strongly consistent transactional mindset, it doesn’t lend itself well to spanning a cluster across a wide geography. Each query would be held up by long latency delays between all the relevant datacenters.

Also, Postgres relies upon a primary-replica model. One node in the cluster is the leader, and the others are replicas. And while there are load balancers for it, or active-active add-ons those are also beyond the base offering. Finally, sharding in Postgres still remains manual for the most part, though they are making advances in developing auto-sharding which are, again, beyond the base offering.

PostgreSQL  – distributed SQL

  • Clustering & Distribution Strategies
    • Local clustering – multiple nodes in the same data center share updates
    • Cross-cluster updates – multiple clusters can share data between them
    • Multi-datacenter clustering – geographically, even globally disbursed, but same logical cluster
  • Node Roles, High Availability & Failover Strategies
    • Primary replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
    • Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
    • Load balancing (client side or service in front of database)
  • Data Replication & Sharding Strategies
    • Replication Factors & Consistency Levels
    • Horizontal Scalability: Manual Sharding vs. Auto-sharding
    • Topology Awareness: Rack-awareness, Datacenter-awareness

Bold = Part of base offering

Italic = Can be added, but not part of base

What is the CockroachDB Database Architecture?

CockroachDB bills itself as “NewSQL” — a SQL database designed in mind for distribution. This is a SQL designed to be survivable (hence the name CockroachDB)  Note that CockroachDB uses the Postgres wire protocol, and borrows heavily from many concepts pioneered in Postgres. However, it doesn’t limit itself to the Postgres architecture.

Multi-datacenter clustering and peer-to-peer leaderless topology is built-in from the get-go. So is auto-sharding and data replication. And it has datacenter-awareness built in, and you can add rack-awareness too.

CockroachDB  – New SQL

  • Clustering & Distribution Strategies
    • Local clustering – multiple nodes in the same datacenter share updates
    • Cross-cluster updates – multiple clusters can share data between them
    • Multi-datacenter clustering – geographically, even globally disbursed, but same logical cluster
  • Node Roles, High Availability & Failover Strategies
    • Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
    • Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
    • Load balancing (client side or service in front of database)
  • Data Replication & Sharding Strategies
    • Replication Factors & Consistency Levels
    • Horizontal Scalability: Manual vs. Auto-sharding
    • Topology Awareness: Rack-awareness*, Datacenter-awareness

*Can be manually configured using localities

Bold = Part of base offering

Italic = Can be added, but not part of base

CockroachDB requires strong consistency on all its transactions. You don’t have the flexibility of eventual consistency nor tunable consistency. Thus will lower throughput and require high baseline latencies in any cross-datacenter deployment.

What is the Redis Database Architecture?

Redis, a key-value store designed to act as an in-memory cache or datastore. While it can persist data, it suffers from a huge performance penalty if the dataset doesn’t fit into RAM. Because of that, it was designed with local clustering in mind. Because if you can’t afford to wait five milliseconds to get data off an SSD, you probably can’t wait 145 milliseconds to make the network round trip time from San Francisco to London.

Redis – key-value in-memory DB/cache

  • Clustering & Distribution Strategies
    • Local clustering – multiple nodes in the same datacenter share updates
    • Cross-cluster updates – multiple clusters can share data between them
    • Multi-datacenter clustering – geographically, even globally disbursed, but same logical cluster
  • Node Roles, High Availability & Failover Strategies
    • Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
    • Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)*
    • Load balancing (client side or service in front of database)
  • Data Replication & Sharding Strategies
    • Replication Factors & Consistency Levels (e.g., strong locally; causal consistency in active-active*)
    • Horizontal Scalability: Manual vs. Auto-sharding
    • Topology Awareness: Rack-awareness*, Datacenter-awareness

*Redis Enterprise feature

Bold = Part of base offering

Italic = Can be added, but not part of base

However, there are enterprise features that do allow multi-datacenter Redis clusters for those who do need geographic distribution.

What is the MongDB Database Architecture?

MongoDB is the venerable leader of the NoSQL pack. So over time as it developed a lot of distributed database capabilities were added. It’s come a long way from its origins. Now MongoDB is capable of multi-datacenter clustering. It still follows a primary-replica model for the most part, but there are ways to make it peer-to-peer active-active.

MongoDB – the leading document store

  • Clustering & Distribution Strategies
    • Local clustering – multiple nodes in the same datacenter share updates
    • Cross-cluster updates – multiple clusters can share data between them
    • Multi-datacenter clustering – geographically, even globally disbursed, but same logical cluster
  • Node Roles, High Availability & Failover Strategies
    • Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
    • Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
    • Load balancing (client side or service in front of database)
  • Data Replication & Sharding Strategies
    • Replication Factors & Consistency Levels
    • Horizontal Scalability: Manual vs. Auto-sharding
    • Topology Awareness: Rack-awareness, Datacenter-awareness

Bold = Part of base offering

Italic = Can be added, but not part of base

What is the ScyllaDB Database Architecture?

ScyllaDB’s distributed database architecture was patterned after the distributed database model found in Apache Cassandra. It provides, by default, multi-datacenter clustering and leaderless active-active topology. It automatically shards and has tunable consistency per operation, and, if you want stronger consistency, even supports lightweight transactions to provide linearizability of writes.

ScyllaDB

  • Clustering & Distribution Strategies
    • Local clustering – multiple nodes in the same datacenter share updates
    • Cross-cluster updates – multiple clusters can share data between them
    • Multi-datacenter clustering – geographically, even globally disbursed, but same logical cluster
  • Node Roles, High Availability & Failover Strategies
    • Primary-replica (Active-passive; writes to primary only; read-only replicas; “hot standby” modes)
    • Peer-to-peer, leaderless (Active-Active, multi primaries; can write to any replica; no SPOF)
    • Load balancing (client side or service in front of database*)
  • Data Replication & Sharding Strategies
    • Replication Factors & Consistency Levels
    • Horizontal Scalability: Manual vs. Auto-sharding
    • Topology Awareness: Rack-awareness*, Datacenter-awareness

*For DynamoDB-compatible API

Bold = Part of base offering

As far as topology awareness, ScyllaDB of course supports rack-awareness and datacenter-awareness. It even supports token-awareness and shard-awareness to know not only which node data will be stored in, but even down to which CPU is associated with that data.

Learn more about ScyllaDB’s architecture at:

Trending NoSQL Resources

ScyllaDB University Mascot

ScyllaDB University

Get started on your path to becoming a ScyllaDB expert.