Interested in getting data from DataPower to Kafka? Want to do more with your DataPower and MQ? In this blog post we will present an interesting architecture that combines proven IBM technologies with Open Source big data technologies.
Recently CROZ and APIS IT joined forces to create fast, scalable and fault-tolerant system that would transfer messages from DataPower to Kafka and process them in Spark. Once we have messages in Kafka they are available to Spark and many other technologies in Hadoop ecosystem.
So basically, what was our goal? We wanted to use Spark Structured Streaming to calculate aggregated data in real time. That’s great! We love Spark since it is fast and scalable in-memory engine that never lets us down. But it needs data! One of our favorite technologies we use as data source for Spark streaming is Apache Kafka. Kafka is scalable and fault-tolerant messaging system, very popular lately because it can handle trillions of events a day. Let’s do a quick overview of the product.
Kafka is generally considered a distributed streaming platform. It is often said that it is a publish-subscribe system – but it is actually a lot more than that. Kafka is similar to publish subscribe systems because it provides publish-subscribe functionality similar to a message queue or enterprise messaging system. It stores messages in topics that consist of partitions. When messages are written to Kafka topic, they are written into partition which can be explicitly specified; or we let Kafka decide where message will be stored. It is important to point out that partitions are the reason why Kafka scales so easily. We can easily add new partitions to Kafka topic and scale our system.
Kafka is special because it stores streams of records in a fault-tolerant, durable way. Every message can be replicated to multiple Kafka servers and retained for configurable period of time. With message retention we can consume messages from the past whenever we want.
In the end, Kafka is a streaming platform that allows us to act on messages in near real time! With many data stores as sources and sinks.
Let’s review our status:
We will use Spark Structured Streaming
- we know how to use it and we know it’s great
We will use Kafka as source for Spark Structured Streaming
- we know how to use and we know it’s great
We have data coming to our DataPower and we want to bring that data to Kafka
- we know how to use DataPower and we know it’s great
- we don’t know how to bring data from DataPower to Kafka!
That’s where the fun part began. We have a lot of experience with DataPower and a lot of related technologies but we’ve never used it before to sink data to Kafka. Most common solution when connecting Kafka with other data stores is to use Kafka Connect API. We did some research and figured out that there is a “Kafka Connect source connector for IBM MQ” available on GitHub. Connector is written by IBM under Apache License. It is also possible to use it as part of Confluent Platfom.
Since we have extensive experience with IBM MQ, it immediately intrigued us to further investigate MQ -> Kafka connectivity. First we set up the test MQ and Kafka. Then we configured a couple of parameters in Kafka IBM MQ Connector and we were good to go. It was surprising how easy it was to get things going. After writing few messages to MQ we’ve seen it is replicating messages to Kafka like we expected. It brought us to the final architecture shown in following figure.
Let’s describe what is actually going on in our architecture, what data we are using, where and how it is going.
Transactions coming into DataPower from source systems are being translated into JSON files that are written to MQ. DataPower – MQ connection works extremely fast and reliably and that’s why we love it. 🙂
Kafka IBM MQ Connector writes JSON messages from MQ to Kafka with extremely low latency. It writes messages to Kafka by leveraging Kafka partitions model that provides high throughput of messages.
Spark Structured Streaming reads messages from Kafka and calculates aggregated data in real time. What we wanted to do is to calculate sum of values in daily transactions grouped by certain key. Spark Structured Streaming provides simple and efficient way to keep only today’s messages and calculate aggregations really fast. What is also important to point out is that latency between Kafka and Spark is minimal just like all other connections in our architecture. Another reason why Spark – Kafka is great combination is because Kafka partitions are mapped to Spark partitions so we get most of parallelism in every part of architecture.
Finally, results of Spark computation are being saved into both Cassandra and Kafka. Having a really positive experience with Cassandra, we decided it is our weapon of choice. It is really fast peer-to-peer NoSQL database and it comes with one interesting feature that is a perfect fit for our use case. When you do insert into Cassandra, but primary key of that row already exists in database, Cassandra will perform an update on that row. In some databases it is known as upsert operation.
Cassandra is later used for ad-hoc queries and initial data loading for web application developed to track changes in real time. After initial load of data from Cassandra, web application continues to consume new messages from Kafka so it doesn’t use Cassandra resource and read data in reactive fashion.
To conclude, every part of architecture is scalable and latency between parts is very low, negligible we can say. We can say our PoC phase is completed and what is left to do now is serious performance testing on inbound DataPower traffic.
Hope you like our post and feel free to contact us if you are interested in any part of architecture from DataPower to Cassandra.
We will continue to inform you with more interesting Big Data stories. 🙂