True-life tales of what not to do to your database!
Halloween is that frightening time of the year when the veil between the well-ordered world of DevOps and the dark chthonic void of dev/null is thinnest. When chaos monkeys reign supreme and Site Reliability Engineers lock the doors and bolt the shutters.
Today we will share with you true life NoSQL horror stories our engineers have seen and witnessed in the world of big data. But be forewarned! What you are about to read may just make you shudder deep down to your very hyperthreaded cores.
It was a dark and stormy night…
Picture being a user running in production, expecting enterprise grade scalability and stability, frenetically complaining nodes are being “terminated”… and then you wake up in a cold sweat realizing you only provisioned spot instances!!!
…
Or that one day, you have a boolean column, but belatedly realize when you make a secondary index that you just created terabytes of data split in two huge partitions!!!
…
Imagine you have the idea to push your client application with maximum parallelism. Unlimited parallelism! “They called me mad! But I’ll show them! Mwah hah hah!” Only then you discover how unlimited parallelism creates a death spiral, generating queuing where final query latency is equal to its own latency plus the sum of latencies of all requests it’s behind in the queue!!!
…
Then there was the time that poor soul used the same disk for data as the OS/root drive.
Save Yourself While You Can!
Sadly, it is already too late for some developers. They plunged themselves into nightmarish situations from which their sanity can never fully recover, even if their data has been. Yet it’s not too late for you, dear reader, to think about how you can avoid the same fate.
…
Never create a huge Cartesian product by concatenating multiple IN conditions in a single query. This path only leads to madness!
…
Also, be wary of that innocuous soul from IT who “just wants to shutdown unused instances,” but ends up killing your production instances! (It was just an accident, wasn’t it?)
…
You’ve always wanted to have that viral success. And you finally get it. But then the users… They won’t stop! They. Just. Keep. Coming! All targeting that one hot partition (Read our related horror story on the Dress that Broke the Internet here.)
…
So, “ah hah!” you think. You’ll just redesign your database to have a cache in front of it. This time, you swear, things will be different! Then — the cache goes out of sync with the database! (We have heard horror stories of caches and databases so often we wrote this white paper on why it’s a bad idea.)
…
Perhaps disaster strikes. A catastrophic outage. But “Don’t panic!” you tell your team, “We religiously keep backups.” They relax a little. Until everyone realizes they actually never had enough time to test the backups or work out precise procedures on how to restore!
The feeling when you discover your multi-AZ deployment was actually done within the same AZ.
Don’t Open That Door to the Basement!
There are many more stories of woe deep in the trenches of NoSQL. Here’s a few more brief examples of what NOT to do:
- EBS isn’t a panacea. Guess what? You got an email that your block storage got erased
- Wish
DROP_TABLE
no longer had autocomplete and that you’d put your fingers on a diet? - Selecting a smaller replication factor, thinking that makes a really good savings plan, until it’s not… AT ALL!
- Keeping your database on a spot instance for cost savings, until you discover it has downsides too.
- Data necromancy. Because data resurrection with a log structured merge tree is as frightening as resurrection of the dead.
- Misusing the TTL (Time To Live) attribute, by either setting it to `0` and never deleting any data (uh-oh – storage bloat!) or setting it to a wrong small value and having some data deleted too early… (where did my data go??)