Apache Flink and Confluent: The Use Cases and Benefits of Integration with Confluent’s Data Streaming Platform
Author: Sam Ward
Release Date: 19/03/2024
What is Apache Flink?
Apache Flink is a distributed stream and batch processing framework for stateful processing. It comes with a collection of complex and intelligently written APIs that power stream processing platforms at a wide range of companies. This technical blog aims to provide knowledge on the basics of event-streaming with Apache Kafka and build on those concepts to explain how Confluent’s new Flink service will adapt and improve its current system to provide the next level of stream processing.
Apache Kafka: An Event-Streaming Platform
Apache Kafka is a distributed event-streaming platform, initially released in 2011, that’s used worldwide by some of the biggest companies in the world. But what does event-streaming mean, and what is an event?
An event stream is a constant flow of data, where each event is a change of state in the system, and these events travel from Publishers to Subscribers (in Kafka, these are the Producers and the Consumers). Some examples of events are the constantly changing coordinates of a car in a city, a confirmation of a payment, and the value of a stock; there is no limit. Each collection of events is stored in a topic, which is a categorisation of different events, and we could create and store the previous examples in topics called: ‘car_location,’ ‘transactions,’ and ‘stock_values.’
These Topics are then broken down into smaller, more manageable chunks called Partitions, which are spread and replicated in the Kafka Cluster, which is made up of a collection of servers called Brokers. The main functions of the brokers are to receive, store and send messages to consumers. Partitions are replicated across brokers to ensure for high fault tolerance and high availability and are then further broken down into Segments, which are the actual physical files on which the data is stored. Data is structured in immutable logs, append-only lists of elements.
Confluent provides its users with a wide range of additional features, utilising Apache Kafka, Kafka Streams and a custom streaming SQL engine called ksqlDB to build an efficient, real-time data streaming platform. Paired with their vast number of custom-written connectors in the Confluent Hub, this makes Confluent the best choice for any business, new or old to view and manage their data in Confluent. They say it best, defining their service as providing “enterprise-grade capabilities [needed to run] mission-critical use cases [which allows its users to] operationalise and scale all your data streaming projects so you never lose focus on your core business.”
Methods of Enriching Data
Kafka Streams and Apache Flink are two methods for performing stream processing tasks. Confluent currently uses ksqlDB, running on the Kafka Streams engine to modify data being produced to the Kafka cluster, and by extension, data consumed by the consumers.
There are two main types of transforming data. Stateful Transformations (which rely on previous, persistent data in a stream, which influences the next event(s) processed) and Stateless Transformations (which process each stream or batch of data independently from the last). Some stateful transformation examples are: reduce, aggregate and branch. Some stateless transformation examples are map, filter, flatMap and groupBy.
Apache Flink: Stream Processing Evolved
There are a few key limitations to the current methods of stream processing being used, namely metadata declaration, checkpointing, and limited complexity of statements. This is where Apache Flink and Confluent step in, offering an entirely new batch processing framework where, with the use of Apache Kafka as a storage layer, users can leverage four different APIs of increasing levels of abstraction to filter, enrich and join data. Flink is also fully integrated with Confluent’s tooling for security, governance and observability.
Apache Flink comes with four different APIs, each of which performs a multitude of different actions and allows for many different use cases, as they are highly customisable. Flink also includes support for a range of different programming languages, including Scala, Python, SQL and Java. In decreasing levels of abstraction, these APIs are Flink SQL, Table API, DataStream API and ProcessFunction. With these APIs, Flink unifies both stream processing and batch processing, meaning you can select the mode most appropriate for the data you want to process, between Batch Processing Mode (BPM) for bounded data streams (like tables), and Stream Processing Mode (SPM) for unbounded data streams (such as a constantly changing stock price). This means that you can mix both real-time and historical data processing in the same application, and semantics/logic/code will be available to reuse when swapping between them.
While Flink alone is useful, it can be relatively difficult to set up and manage. Confluent has innovated a new version of Flink with an assortment of additional features that make Flink even more powerful and accessible than ever before:
- Serverless: As Flink is difficult to physically set up, Confluent has made the Flink service cloud-native to ensure efficiency and speed
- Metadata Importation: Confluent has removed the need to recreate tables – if the customer has Kafka that already contains topics and schemas in the Schema Registry, you will immediately be able to browse and query in Flink SQL
- Independent Scalability: With Kafka as the storage layer and Flink as the computational layer, Confluent achieves separation of the storage layer and the computational later, ensuring both are scalable independently of each other
- Evergreen Runtime: The Flink Runtime is not versioned, and to provide the user with a fully managed service, you will never have to update or upgrade it, as it is done automatically
- Autoscaling Workloads: On Confluent Cloud, workloads scale automatically, and require no user intervention at all – you can view a lot of these metrics in the Confluent Control Panel, or you can integrate them with existing observability platforms
- Usage-Based Billing: Allocation of compute resources is automatic in the Flink layer, and once you stop using the resources they are deallocated, and you only pay for them when you are using them. This also is linked to the concept of keeping Flink and Cluster communication region-specific, meaning that you cannot query across regions, but this is by design to reduce expensive data transfer charges
- Built-In Security Model: Confluent provides users with one and offers the same systems in place for Kafka. RBAC (role-based access control) is also available with Flink and is easily defined in the Confluent Control Panel
Confluent’s service also introduces a new concept called “Compute Pools” which provides the platform with the capability of having usage-based billing and autoscaling workloads. These Compute Pools will automatically increase and decrease when more or less resources are required, ensuring that costs stay low and resources are only used when necessary.
Use Cases and Potential Customers
Confluent Cloud for Apache Flink® has an incredibly wide range of potential customers and use cases, due to the sheer range of features and additional services that Confluent ships with Flink. Not only will Confluent provide its users with Flink, but it will also maintain support and usage of ksqlDB, which will still run on the Kafka Streams engine. By not replacing their previous system for stream processing, users will be able to migrate to Flink at their own pace, which elevates their platform to be the very best for managing data.
Some examples of different types of data pipelines that would benefit from using Flink are as follows: stock market application to track constantly changing prices, shopping market catalogue of all items and prices as well as transactions/orders, transport application that uses customer data and driver locations to inform current prices, banking application to keep a record of all transactions and current account values.
Any customer currently using Apache Kafka knows that keeping the cluster healthy, managing the resources and ensuring near-perfect up-time can be difficult. Confluent Cloud removes the elements that make managing a cluster difficult and allows the user to dedicate their time towards ensuring that they get the most out of their data, instead of managing resources. Additionally, Flink SQL is ANSI standard compliant, ensuring even more companies can join the platform.
Due to the nature of Flink and Kafka, you can integrate your current systems into Confluent with relative ease. Whether you need a better system to manage an endless stream of data or a finite set of values, Confluent provides a solution and its relationship with Flink is just beginning.