
IMPORTANT: Since the first publication of the Mutant Monitoring System we have made a number of updates to the concepts and code presented in the blog series. You can find the latest version now in ScyllaDB University. Click here to be directed to the new version. |
This is part 4 of a series of blog posts that provides a story arc for ScyllaDB Training.
If you got this far, you have your Mutant Monitoring setup and running with the Mutant Catalog and Tracking keyspaces populated with data. At Division 3, our mutant data-centers are experiencing more and more cyber attacks by evil mutants and sometimes we experience downtime and cannot track our IoT sensors. We must now prepare to plan for disaster scenarios so that we know for sure that we can survive an attack. In this exercise, we will go through a node failure scenario and learn about consistency levels.
Environment Setup
We will need to set up a fresh cluster by building the Docker containers using the files on the scylla-code-sampels GitHub repository.
git clone https://github.com/scylladb/scylla-code-samples.git
cd scylla-code-samples/mms
There is a new feature to automatically import the data with the keyspaces from the first couple of blog posts for you. Edit docker-compose.yml and add IMPORT=IMPORT
under the environment section for scylla-node1 so the existing mutant data will be imported when the container starts:
environment:
- DC1=DC1
- IMPORT=IMPORT
- SEEDS=scylla-node1,scylla-node2
- CQLSH_HOST=scylla-node1
Build and Start the containers:
docker-compose build
docker-compose up -d
After about 60 seconds, the catalog and tracking keyspaces will be imported.
Simulating the Attack
Let’s use the nodetool command to examine the status of the nodes in our cluster:
docker exec -it mms_scylla-node1_1 nodetool status
We can see that all three nodes are currently up and running because the status is set to UN (Up/Normal). Now let’s forcefully remove node 3 from the cluster:
docker pause mms_scylla-node3_1
Now use nodetool to check the status of the cluster again after about 30 seconds:
docker exec -it mms_scylla-node1_1 nodetool status
We can now see that node 3 is missing from the cluster because it is in a DN (Down/Normal) state. The data is safe for now because we created each keyspace with a replication factor of three. With a replication factor set to three, there will be a replica of the data on each node.
We should be able to still run queries on ScyllaDB:
docker exec -it mms_scylla-node1_1 cqlsh
select * from catalog.mutant_data;
All of the Mutant data is still there and accessible even though there are only two replicas remaining because we are using a replication factor of three (RF=3).
Consistency Levels
The data is still accessible because the Consistency Level is still being met. The Consistency Level (CL) determines how many replicas in a cluster that must acknowledge read or write operations. Quorum is the default Consistency Level. When a majority of the replicas respond, the read or write request is honored. Since we are using a replication factor of 3, only 2 replicas respond. QUORUM can be calculated using the formula (n/2 +1) where n is the Replication Factor.
Let’s test how consistency levels work with the cqlsh client. In this exercise, we will try to write data to the cluster using a Consistency Level of one, which means If one replica responds, the read or write request is honored.
docker exec -it mms_scylla-node1_1 cqlsh;
CONSISTENCY ONE;
insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Jobs','1 Apple Road', 'http://www.facebook.com/jobs') ;
select * from catalog.mutant_data;
The query was successful!
Now let’s test using a Consistency Level of ALL. This means that all of the nodes must respond to the read or write request otherwise the request will fail.
docker exec -it mms_scylla-node1_1 cqlsh
CONSISTENCY ALL;
insert into catalog.mutant_data ("first_name","last_name","address","picture_location") VALUES ('Steve','Wozniak','2 Apple Road', 'http://www.facebook.com/woz') ;
select * from catalog.mutant_data;
The query failed with error “NoHostAvailable” because we only have two of the three nodes online and the Replication Factor (RF) is 3.
Adding a Node Back and Repairing the Cluster
We need to get the Cluster back into a healthy state with 3 replicas by adding a new node.
cd scylla-code-samples/mms/scylla
docker build -t newnode .
docker run -d --name newnode --net=mms_web -e DC1=DC1 -e SEEDS=scylla-node1 -e CQLSH_HOST=newnode newnode
Next, we need to get the IP address of the node that is down:
docker inspect mms_scylla-node3_1 | grep IPAddress
Now we can log in the new node and begin the replacement node procedure by adding replace_address
in scylla.yaml. This argument tells ScyllaDB that the new node will be replacing the failed node. This will also tell ScyllaDB to remove the old node and it will no longer be visible in the nodetool status output.
docker exec -it newnode bash
Run the command below to add replace_address
to scylla.yaml. Replace the IP address with the correct IP of mms_scylla-node3_1 :
echo "replace_address: '172.18.0.3'" >> /etc/scylla/scylla.yaml
Finally, exit the container and restart the new node:
docker restart newnode
After a minute or so, we should see all three nodes online in the cluster:
docker exec -it mms_scylla-node1_1 nodetool status
Great, now we have three nodes. Is everything ok now? Of course not! Until we run a repair operation, existing data will only be on two of the nodes.
We can repair the cluster with the following command:
docker exec -it mms_scylla-node1_1 nodetool repair
When repair finishes, each node will be in sync and the data will be safe to handle additional failures. Since we do not need the paused container named mms_scylla-node3_1
anymore, we can delete it:
docker rm -f mms_scylla-node3_1
Conclusion
We are now more knowledgeable about recovering from node failure scenarios and should be able to recover from a real attack from the mutants. With a Consistency Level of quorum in a three node ScyllaDB cluster, we can afford to lose one of the three nodes while still being able to access our data. With our datacenter intact, we can begin looking at new ways to analyze data and run SQL queries with Apache Zeppelin. Stay tuned for the Zeppelin post coming soon and be safe out there!