By Convequity in Confluent — Jun 17, 2024

Confluent (Pt.3) - Stream Processing TAM Expansion Potential

Related technologies and companies: Databricks, Snowflake, RisingWave, Alibaba

Summary

Stream processing is going to be a massive TAM expander for Confluent.
Tiered storage for Kafka with built-in stream processing via Flink is a powerful combination that can take a significant share of the real-time applications and analytics market.
If executed well, Confluent can be the dominant player in the streaming lakehouse space. Furthermore, it has great potential to address HTAP and data app potential too.
Similar to Kafka, Flink has issues with regards to technical debts. Though we're confident Confluent can emulate similar success it has achieved with Kafka, to eventually deliver Flink as-a-Service.
Centralizing storage on Iceberg is a smart move by Confluent. Tiered storage powered by Iceberg and Confluent's governance is a great mix that will resonate with many customers.
However, Databricks' acquisition of Tabular and Snowflake's release of the Polaris Catalog could be a headwind that Confluent needs to address in the future.

Quick Recap: Event Streaming vs. Stream Processing

For those unfamiliar with event streaming and stream processing, as outlined in Part 1, the former is the collection and distribution of events. In event streaming, there are producers and consumers, those systems/applications that produce events and those that consume those events as inputs. For instance, producer events could range from a customer clicking the payment button on a site, to an IoT sensor generating an updated measure for a manufacturing operation. A consumer system/application could range from a financial analytics dashboard that subscribes to stock market data feeds, to a ride-sharing app's notification service that subscribes to driver availability and ride request topics.

Such events are in their raw form, unprocessed and untransformed. If more complex analysis is required on these streams of events, then stream processing is needed in order to transform, filter, and aggregate etc. the data ready for the discovery of deeper and broader insights. In essence, stream processing aims to provide something akin to a data warehouse but for real-time data streams, whereas traditionally a data warehouse stores historical data that gets updated periodically (referred to as batch processing).

Kafka is the most popular open-source project for event streaming and Flink is the same for stream processing. CFLT began providing a managed service for Kafka and has more recently provided managed services for Flink, following its acquisition of Immerok in January 2023, the commercial entity behind Flink.

High level view on CFLT's expansion to stream processing

If you think about the adoption of Kafka:

It is already being used by 80% of Fortune 500
It is the standard for pub-sub, streaming, and event-driven architecture (EDA)
It is so widely adopted that even emerging competitors, like Redpanda, need to follow Kafka-API developments

Kafka is fast-approaching peak penetration of event streaming alone. And although event streaming is huge, with there being many more real-time use cases that the market will adopt, CFLT needs another S-curve to support its high growth and high valuation.

One way is to continuously dive deeper into the event streaming space, increasing the use of new concepts like EDA, and let every company centralize on Kafka as the central nervous system in real-time data sharing. Another way is to find vertical integration opportunities, where CFLT has strong leverage and potential synergies, that can dramatically improve the ROI for customers.

One of the most critical features Kafka has introduced in recent years is tiered storage. This feature is frequently highlighted by next-gen competitors in their marketing materials as a significant advantage over Kafka. Although Kafka's implementation of tiered storage became widely available in 2024, with AWS MSK supporting it, Confluent Cloud (referred to as CC) has been offering tiered storage since 2020. As previously discussed, before the advent of tiered storage, Kafka could only save data on local disk storage, making it cost-prohibitive to retain data for longer than seven days. Tiered storage allows users to save data in cost-effective object storage that can scale independently of the Kafka cluster, akin to the separation of compute and storage pioneered by Snowflake (SNOW), which has revolutionized the traditional data warehouse space.

Besides the scalability, ease of use, and cost advantage of tiered storage, it also opens up a huge space of imagination for CFLT. Tiered storage allows Kafka to function as an operational database or an analytical lakehouse. As 80% of the data on Kafka is already operational data, it makes sense for enterprises to simply have their database in Kafka, eliminating the need to transfer data to another database, which is a costly operation due to the high network ingress and egress fees charged by cloud providers. Moreover, if you can simply transform the data within Kafka using Flink, save it in object storage, and then query it using CFLT's built-in Flink-as-a-Service, then you don't really need to send the data to places like Databricks or SNOW.

CFLT is the only leader that is trying to expand from event streaming to stream processing. On the other hand, we have vendors like Databricks and SNOW moving from batch processing to stream processing too. If you think about the evolution of big data analytics, it has moved from data warehouses and data lakes residing on-prem to them migrating to cloud environments. The next evolution involves more real-time processing, taking unstructured, semi-structured, and/or structured data and transforming/aggregating such data for real-time analytics. This evolution is converging the concept of the lakehouse (the storing of unstructured data and transforming it into useful structured data for analysis within a single architecture), which is the domain of Databricks and SNOW, with real-time stream processing. If we imagine a data pipeline, with the far left being the data sources where event streaming occurs, and the far right being where periodically updated data is stored and processed (batch data), the middle is where stream processing resides. In essence, as depicted below, CFLT and SNOW/Databricks are expanding toward the middle, thus becoming closer competitors with one another.