Hi and welcome to my talk sink your teeth into streaming at any scale. Today we’re going to talk about how Apache pulsar and ScyllaDB can help you drive streaming analytics applications. My name is David Karengaard I’m an Apache pulsar committer and also the author of Pulsar in action.
I formerly worked at Splunk where I was a member of the internal messaging team running Pulsar at scale at up to 10 petabytes of data per day I was formerly director of solution architecture at streamlio and I want to introduce my my co-presenter Tim Spann hi I’m Tim spam I’m a developer Advocate at stream native I cover what I call the flipping stack which is Flink pulsar and nifi working together I’ve been working with different streaming and data systems for a number of years and a bunch of different Big Data companies but today we’re here to talk about pulsar and a number of other tools how they work together to do some pretty amazing things so we’ll look at the agenda real quick just so we know where we’re going and where we are now we’re going to start off with uh different tools we work together what is Pulsar you may not have heard of it how we join all these different streams of data together with Flink SQL and how to get data as fast as possible in disilla DB a little reference architecture and some descriptions of how we would do this and let’s start it off the stream team is pretty awesome uh you really need more than one thing to be able to build a real-time application there’s a number of different open source tools that work together really well and we’ll discuss a couple of them here today but commonly if I’m building an application have Pulsar is my central messaging Hub I’ll have Flink in there to do some processing maybe nifi to get things going of course silly DB is my sync maybe some extra analytics with Pinot some extra ETL with spark and maybe some other applications on the side and either python or Java or rust so C plus plus scholar lots of different language options out there but like I said in the center of all this really important is Apache Pulsar this lets us in a cloud native environment do messaging and event streaming at an incredible scale and we’ve been doing this for over 10 years started off internally at Yahoo and today it’s become a rapidly growing well-adopted open source project that has some unique features which is why a lot of people are using it all over the globe and that just keeps getting bigger and bigger what’s nice with our project open to as many people want to contribute and tons of people are noticing it getting millions of downloads lots of people love the project and you’ll see why one of the critical features that makes it that Central messaging Hub is you don’t have to get stuck with one way of working with your data we let you get your data into and out of Pulsar via a lot of different uh protocols mechanisms different ways of doing it so you can use the Pulsar protocol which makes sense if you’re building new apps but you may have Kafka apps you may have rabbitmq apps you might have mqtt around iot devices you could use all those protocols to get your data into different Pulsar topics and then how you grab it out for your particular application describes if it’s a messaging app or if it’s a streaming app and you can add as many of these as you want and it makes it pretty powerful to me being able to do key shared and just being able to stream very easily based on a key great for CDC or in case you want to have one consumer grabbing every message in order get set up failover and if they stop working automatically fails over where you were to a backup consumer so you never lose data and you never slow down that’s really cool to do if you want to have someone grabbing all the messages as fast as possible don’t care the order do it exclusive that keeps other people from looking at your data let them get their own subscription you know it’s like Netflix don’t let 12 people use your same account shared I want to do a work queue get me every message is possible fast as possible maybe I’ll spin up 10 000 consumers as many as I can do in my environment what’s cool with Pulsar as well and why we can make it a central Hub is you can have multiple tenants lots of different companies users apps all on the same cluster you set up a tenant for each one under the tenant have as many meme spaces as you like so you can really discreetly have your applications together that’s great for security great for moving around to different environments it’s just a great modular idea and then on the bottom let me drop a few hundred thousand topics get a million topics no problem now to make sure your data isn’t just a bunch of junk floating around we make sure you’ve got a schema we let you have a schema registry without having to run anything else having to figure it out I take my app and python Java go whatever you have and I just find this schema in a very simple manner a couple of fields a type I push that in the first time into a topic boom I have a version schema whoever consumes it can get that schema could access it with rest or devops really easy way to have versioning of your data and have a contract so you know what is this data and what is it doing now if we did just that I’d be pretty happy but there’s a lot of things in there that make things pretty amazing for scaling for handling massive load automatic load balancing is possible in this system so you don’t overload a broker this is common in other systems where you get into trouble where one broker gets too much messages too much traffic coming through this consensus moves that data around it’s an important feature that Pulsar supports this is the way the architecture is this can come in really handy so you don’t have one broker slowing down your whole system or one crashing because it’s getting too much load even though there’s other Brokers sitting out there maybe new ones you spun up that are just sitting idle this is very important in other systems you might have to do that manually or might not even be possible you might have to move a whole new workload from scratch there or bring down the system don’t have to do that here that’s just one of those those key differentiators that we’ll point out here things like acknowledgment based retention of messages this is more important than that huge phrase sounds so if I can’t acknowledge a particular message because maybe there’s something wrong with a downstream system I could Skip and go back other systems you’ve got to go in order or you have to fully stop maybe I’ll come back later to a message and be able to get it having that acknowledgment that yes I agree that that’s okay I’ve processed it really important tiered storage so I don’t have to worry about keeping every message that I need to have right away in local expensive storage send that out to S3 fully available still can consume it still use it in queries but a lot cheaper and you don’t even have to think about it this is great once you start getting into you know petabytes of data who can afford to have that in ssds not too easy being able to have uh different types of queuing so I could do round robin I could use Q uh keys I can handle dead letters which is important if the message has nowhere to go we’re not going to lose it I don’t have to implement something on my own built into the platform very easy to elastically scale this up because the Brokers that handle messaging and Communications are different from Storage so obviously you’ll have more consumers and producers of data than your storage of it and those nodes that are just doing uh Communications a lot lighter don’t need a giant disc on them like I mentioned before millions of topics which is really helpful that tint multi-tenancy Geo replication so when I need to scale that out whether it’s multiple data centers multiple clouds different clouds different availability zones very easy and all encryption everywhere if you need it pretty good idea so we’ve got a central Hub but how do I join it with other systems how do I write applications against it how do I do the fun stuff out there besides just distribute messages there is a full Flint connector here supported by the Apache Flink Community obviously with help from Pulsar lets you easily read and write to data from Pulsar without any extra difficulty this makes anything you put in a pulsar topic look like a table the Flink SQL so it makes it very easy to join these real-time data sources as events come in couldn’t be easier whether you’re getting data from Flink into Pulsar to do whatever you’re going to do next or the other way around and obviously multiple places where I can connect this to those other parts of our team things like Cilla things like Pinot whatever you have out there very good to do flinkets self really powerful project you’re going to be hearing more about that every time it’s just being adopted everywhere what’s cool about Flink and why it works really well with Pulsar is having that United uh Computing engine so I can do things like batch stream doesn’t matter you know how whatever makes sense for your use case use the same very scalable system for that we could have things store state if we need to have it you know things like Aggregates or Computing data over these massive sets of data coming in fling SQL is very full featured you do things like inserts you could do joins you could do updates up search deletes lots of cool stuff there use that for continuous SQL continuous analytics so as events come in I can keep running I never have to stop pretty powerful system out there streaming tables make very interesting applications because your data and again has to be pretty structured because you have to set up tables which is why we have those schemas it just never ends as events come in you just get larger and larger and you just could process them as they come in or as uh as a unit as they keep growing uh what’s cool is they don’t change so you get new events but the old ones are not updated in that in that way if you’re doing an upsert it’s a different thing on how that works depending on what different Flint connectors you have uh what’s cool is once you make that connection that whole namespace will be mapped for you to the catalog you don’t really have to write each table by hand or create your own ddl you certainly could but most of the time we’ll just use that catalog to do it for us now I’m going to transfer this over to David so we could show you some exciting stuff that’s been updated and how we get data from Pulsar topics into silly DB tables take it away Google all right yeah great job Tim thank you so much so yeah so as Tim mentioned we have a lot of different Integrations in the in the Apache Pulsar Apache Pulsar is a central messaging bus think of it as a data repository for your streaming time-based data uh for all the individual events themselves and so making use of this information requires a lot of integrating with a lot of systems as you mentioned before streaming with the team and we store a lot of this information all these different connectors as it’s shown here on the left in Hub and what we call Connector Hub or home.streamnative.io we have all these different processing engines uh and connectors as you mentioned so we highlighted flank SQL uh some of that in spark also but what we’re going to talk about today is connectors they’re sort of in the Middle where we allow data to go into and out of Apache Pulsar very easily with these drop-in connectors you do a few configuration changes press a few buttons and boom you’re either reading data from these sources bringing into Pulsar or you’re taking it from pulsar and writing it directly into these external systems and specifically I’ve been working on the cylindb uh sync connector making some additional changes to that so that you can do this processing that Tim talked about streaming some real-time analytics data again let’s say iot data coming in you can do your analytics on it you can do some pre-computing queries things for like machine machine learning and feature calculations ahead of time compute all these all this information and then write it directly into Cilla very easily transparently if you want to go to the next side Tim you can talk about that so this is drilling down into that we have some details there and that link please go in and look at some quick starts and how to take data from pulsar and directly into a solid DV database just a few minor configurations again as I mentioned before your host names your username password the table that you want to write the information into all that information you provide the bare minimum and the data will automatically stream from the whatever topics you specify the data comes from and go directly into your ScyllaDB database tables next slide please so there we have made some additions uh we’ve done some presentations in the past we found some gaps in this uh with our solar DB connector we’ve always been trying to integrate as well as solo emerges and evolves we’ve been trying to keep up with that up with them and adding some new features so we’ve added some additional capabilities to the ScyllaDB connector you can follow it on this pull request release here in the Apache Pulsar project we’ve added some interesting capabilities where we can dynamically uh interrogate the database schema to determine what this type definition is before you had to do a static schema definition say I want this table this table looks like this all the fields in the in the in the Pulsar schema that Tim mentioned earlier would have to map and you’d have to do this mapping one to one well we’ve made that a little more loose we made it more intelligent so that as your schemas evolve on either side let’s say you change your data profile in in Pulsar it doesn’t break your dumping the data into still a DB and vice versa if you add tables or columns to your still a DB we don’t want your connector to break because there’s not exact one-to-one mapping so we we made it more intelligent we’ve had support for a lot of those different capabilities in there I’ve also have the ability to extract values from what we call generic schema types so Pulsar is it has strongly typed schemas you can Define them in Avro and Json these other strongly type schemas and we also support what’s called a generic record which has like a map interface or key values interface and so if you want to dynamically put just a bunch of keys and values as long as the keys map the field names this connector is smart enough to say okay that goes in this particular table in this field and this one goes in that field we do automatic type conversion for you as well we’ve also added if you want to put in raw Json strings uh we will parse the Json for you automatically so again you don’t have to make that you can break that decoupling of that those schemas between the two and then we’ve obviously added some performance improvements and some bug fixes we’ve added some different uh schema capabilities and some security capabilities and on the roadmap we’re going to add some intelligent uh routing and sharding I know that Cilla has these capabilities to Route the data to the proper nodes uh that’s that’s on our roadmap to get that done in the next release so lots of interesting things coming down the pipeline with that uh next slide please
and so yeah we think that by by adding these capabilities that still a DB and stream native which is a provider of Apache Pulsar make a great system uh for building real-time analytics applications of you know uh has all the different capabilities you have a very fast distributed database system at a very fast stream storage system and integration to all the different stream processing engines that you need so that you can service different use cases again real-time analytics real-time machine learning uh any sort of High High Street high-speed data streaming pipelines you need for log analytics uh you know iot use cases uh again calculating feature sets for machine learning models as the data comes in all this capabilities are now made very easily with these two technologies merged together next slide please uh and just uh last one talk about reference architecture about how we envisioned putting all these pieces together again best practices Tim started off the talk with these are the different tools that you want to use for your tool set this is sort of how we map them out together so again you can have data coming uh into solid DP or coming in from flick processing so these microservices these uh mqtt protocols these sensor devices will get fed into Pulsar their their core of the sources at the bottom uh you can then use the Apache Flink SQL tool to do some analytics on that information uh you know calculate some running averages for example over time Windows uh and then publish that back into Apache Pulsar topic and then from that particular topic feed that data into the sylla DB uh directly again using that sync connector that we talked about in the previous slides and so you can have a continuous sort of ETL pipeline doing you know extractions Transformations loading analytics on that data and making it available so directly from your iot sensor devices a consumable to scilla tables almost you know you know as fast as pulsar and and still it can do it you can also do some additional analytics on other these visualization tools things like Apache Pinot or superset these other query tools that you can either go directly against Pulsar or again you can point um at other different data sources as well so this is sort of the data in the data out sort of uh reference architecture that we envisioned for pulsar and solid DB working together next slide please uh last but not not least again we’ve mentioned you know talked about this in the previous Slide the right the right tool for the right job you bring a lot of tools to a real-time pipeline uh developing exercise and so again the flank SQL that we mentioned here is for continuous analytics doing it for ingests doing real-time joins joining the different data sets together let’s say you know you have iot sensor data uh you may want to join that to some static data to say this is in you know my plan a or this is plant B this is the sensor type or that sensor type enriching this data continuously enriching that you can use things like Flink for that once you have that information in there you publish it to Apache Pulsar you can do some additional routing or transformation on that store the data there so backing up and then eventually again once you attach a sink you can route those pre-calculated values into Silla DB so you can have instant real-time lookup so these applications can again do like feature calculations for machine learning model I go look at what’s the key for this particular sensor type I can get the last reading for the last you know hour two hours three hours all these different averages and store massive amounts of data uh very quickly and access it in your near real time and then there’s also Apache Pinot for low latency instant query results if you want to use that as well to attach it to these other systems as well
and thank you so much for attending our talk please stay in touch this is both my contact information here as well as Tims scan the query code for more information free downloads uh reach out to us on Twitter email our GitHub accounts LinkedIn all different ways to get a hold of us so hopefully you found this educational and I’ll let Tim give any parting thoughts as well yeah definitely uh stay in touch there’s a lot of different things you could do with these Technologies so reach out if you have some Innovative things that you want to have questions on or help us solve some hard application problems [Applause]