In the run-up to ScyllaDB Summit 2018, we’re featuring our speakers and providing sneak peeks at their presentations. This interview in our ongoing series is with Ľuboš Koščo and Michal Šenkýř, both senior software engineers in the Streaming Infrastructure Team at Sizmek, the largest independent Demand-Side Platform (DSP) for ad tech. Their session is entitled Adventures in AdTech: Processing 50 Billion User Profiles in Real Time with ScyllaDB.
Thank you for taking the time to speak with me. I’d like to ask you each to describe your journey as technical professionals. How did you each get involved with ad tech and Sizmek in particular?
Ľuboš: I worked for Sun Microsystems and later for Oracle on applications and hosts in datacenter management software and cloud monitoring solutions. I am also interested in making source code readable and easily accessible, so I am part of the {OpenGrok team.
I was looking for new challenges and the world of big data, artificial intelligence, mass processing and real-time processing were all interesting topics for me. AdTech is certainly an industry that converges all of them. So Sizmek was an obvious choice to fill in my curiosity and open new horizons.
Michal: Frankly, getting into AdTech was a bit of a coincidence in my case, since before Sizmek I joined Seznam, a media company not unlike Google in its offerings, but focusing only on the Czech market. I intended to join the search engine team but there was a mixup and I ended up in the ad platform team. I decided to stick around and soon, due to my expertise in Scala, got to work on my first Big Data project using Spark. With that, a whole new world of distributed systems opened up to me. Some time (and several projects) later, I got contacted by Rocket Fuel (now Sizmek) to work on their real-time bidding system. They got me with the much bigger scale of operations, with a platform spanning the whole globe. It was a challenge I gladly took.
Last year you talked about how quickly you got up and running with ScyllaDB. “We picked ScyllaDB and just got it done–seven data centers up and running in two months.” What have you been up to since?
Ľuboš: We took on a bigger challenge: replace our user profile store with ScyllaDB. We knew back then it won’t be that easy, since it’s one of the core parts of our real-time infrastructure. A lot of flows depend on it and the hardware takes a significant amount of space in our datacenters.
Preparing a proof of concept, doing capacity planning, making sure all pieces will work as designed were all tasks we had to do. However, we were able to tackle most of the challenges with the help of the ScyllaDB guys and we’re close to production now.
At the same time a similar task happened within our other department, where we replaced our page context proxy cache with ScyllaDB. This task is mostly getting to production now.
How about data management? How much data do you store, and what do your needs look like in terms of growth over time? How long do you keep your data?
Ľuboš: Currently we store roughly 30G per node, part of it in ramdisk on a total 21 nodes across the globe. This data can grow up to 50G per node per design. TTL here depends on the use case deployed, but it’s from a few hours to 3 or 7 days. The recent use case going to production will store much more data directly on SSD disks and right now it’s on 175GB per node on 20 nodes around the world. We can grow up to 1.7TB. This storage is persistent. Michal will comment on upcoming profile store, which we will talk about in more detail on the Summit.
Michal: In terms of user profiles, we currently store about 50 billion records, which amounts to about 150TB of replicated data. It fluctuates quite a bit with increases for new enhancements and decreases due to optimizations, legislative changes, etc. We keep them for just a few months unless we detect further activity.
What about AdTech makes NoSQL databases like ScyllaDB so compelling for your architecture?
Ľuboš: Read latency of ScyllaDB is very good. Also bearing in mind the fit with SSDs, CPU and memory allocation, and ease of node and cluster management. ScyllaDB seems to also nicely scale vertically.
Michal: We have a huge amount of data that needs to be referenced in a very short amount of time. There simply is no way to do it other than a distributed storage system like ScyllaDB. No centralized system can keep up with that. Using the profile data, we can make complex decisions when selecting and customising ads based on the audience.
How is Sizmek setting itself apart from other AdTech platforms?
Ľuboš: Sizmek is a high-performance platform. That means that we deliver on the campaign promise and we deliver with high quality. Sizmek’s AI models combined with real-time adjustments are one of the best in targeting for programmatic marketing. Sizmek was named the best innovator in this space by Gartner, and lots of our features that make us different are well described on our blog, so look it up. It’s very interesting reading.
Michal: We think deeply about which ads to show to which person and in what context. Internet advertising tends to have sort of a stigma because users can get annoying ads that follow them around, are displayed in inappropriate places, multiple times, etc. Our dedicated AI team is constantly working on improvements to our machine learning models to ensure that this is not the case and every advertisement is shown at precisely the right time to precisely the right user to maximize the effect our client wants to achieve.
Tell us about the SLAs you have for real time bidding (RTB).
Ľuboš: Right now we have 4 milliseconds maximum, but 1-2 milliseconds is the usual, for the page context caching service. For the proxy cache use case we are on 10 milliseconds as maximum, but generally this is around 3 milliseconds. [Editor’s note: These times are at the database level.]
Michal: For each given bid request, our [end-to-end] response needs to come in 70 milliseconds. Any longer than that and our bid is discarded. We allocate no more than 60% of that time to the actual lookup, which can involve multiple profile lookups if they are part of a cluster, as well as the subsequent transformation of the returned result. All in all, ScyllaDB is left with less than 10 milliseconds at best to complete the actual query.
Anything you’re especially looking forward to at this year’s ScyllaDB Summit?
Ľuboš: Scyla 3.0, in-memory ScyllaDB, Spark and ScyllaDB debugging are on my list so far.
Michal: I am very interested to hear about the progress the ScyllaDB team is making towards version 3.0. It is going to be a huge update with several features that we already plan to take advantage of. I am also looking forward to hearing from the other users of ScyllaDB about all the different use cases they are using the technology on.
Thank you both for this glimpse into your talk! I am sure attendees are going to learn a lot.
This is it! ScyllaDB Summit is coming up next week. The Pre-Summit Training is Monday, November 5, followed by two days of sessions, Tuesday and Wednesday, November 6-7, 2018. Thank you for following our series of ScyllaDB Summit Previews. We’re now busily preparing and look forward to seeing all of you at the show. So if you haven’t registered yet, now’s your chance.