Program

Integration of Batch and Streaming data processing with Apache Beam

By Yoichi Nagai

Room: Online

07/30/2025 3:00 PM 07/30/2025 3:30 PM UTC BS25: Integration of Batch and Streaming data processing with Apache Beam

Mercari utilizes Apache Beam for batch and streaming processing for various purposes, such as transferring data to CRM and providing incentives to users. To avoid having to develop similar data pipelines in different departments within the company, Mercari Pipeline has been developed and released as OSS as a tool that allows users to build pipelines by simply defining the processing via JSON or YAML. (https://github.com/mercari/pipeline)

In this session, we will introduce an example of utilizing the features of Apache Beam, which allows Batch and Streaming processes to share the same code, to divert time-series aggregate values generated and verified by Batch to Streaming processes.

Integration of multiple data sources
Aggregate computation with window function using State API
Aggregate computation with window function utilizing external data store (Bigtable) and CDC
Inference with ONNX model
Configuration management by Batch and Streaming difference

Online

Mercari utilizes Apache Beam for batch and streaming processing for various purposes, such as transferring data to CRM and providing incentives to users. To avoid having to develop similar data pipelines in different departments within the company, Mercari Pipeline has been developed and released as OSS as a tool that allows users to build pipelines by simply defining the processing via JSON or YAML. (https://github.com/mercari/pipeline)

In this session, we will introduce an example of utilizing the features of Apache Beam, which allows Batch and Streaming processes to share the same code, to divert time-series aggregate values generated and verified by Batch to Streaming processes.

By Ganesh Sivakumar

Room: Online

07/30/2025 3:30 PM 07/30/2025 4:00 PM UTC BS25: Integrating LLMs and Embedding models into Beam pipelines using langchain

Large language models (LLMs) have transformed how we process and generate text. In this session, I’ll talk about Langchain-Beam, an open-source library that integrates LLMs and embedding models into Apache Beam pipelines as transform using LangChain.

We will explore how Langchain-Beam transform performs remote LLM inference with OpenAi and Anthropic models. Provide data processing logic as prompt and use the models to transform the data based on the prompt. Use embedding models to generate vector embeddings for text in pipeline and Learn about real-world use cases Like,

Embedding generation pipelines to update LLM application’s knowledge base - Loading data from source, generating text embeddings using openai’s models and writing it to vector database.
Perform NLP based tasks like summarization and classification in beam pipeline using models from various models providers and with open source models using ollama.

Repository : https://github.com/Ganeshsivakumar/langchain-beam

Online

Large language models (LLMs) have transformed how we process and generate text. In this session, I’ll talk about Langchain-Beam, an open-source library that integrates LLMs and embedding models into Apache Beam pipelines as transform using LangChain.

We will explore how Langchain-Beam transform performs remote LLM inference with OpenAi and Anthropic models. Provide data processing logic as prompt and use the models to transform the data based on the prompt. Use embedding models to generate vector embeddings for text in pipeline and Learn about real-world use cases Like,

3.0 and Beyond: The Future of Beam

By Danny McCormick & Kenneth Knowles

Room: Online

07/30/2025 4:00 PM 07/30/2024 4:30 PM UTC BS25: 3.0 and Beyond: The Future of Beam

Beam has become a core part of the data processing ecosystem through a combination of innovation and hard work from the Beam community. As the data landscape continues to evolve, however, so too must Beam. During this talk, Kenn (Beam PMC chair) and Danny (Beam PMC) will explore some of the opportunities and challenges in front of Beam, culminating in a vision for the future of Beam. Attendees will gain a clear idea of where Beam is headed, how they can leverage Beam even more effectively moving forward, and how they can contribute to helping Beam become the best that it can be.

Online

Beam has become a core part of the data processing ecosystem through a combination of innovation and hard work from the Beam community. As the data landscape continues to evolve, however, so too must Beam. During this talk, Kenn (Beam PMC chair) and Danny (Beam PMC) will explore some of the opportunities and challenges in front of Beam, culminating in a vision for the future of Beam. Attendees will gain a clear idea of where Beam is headed, how they can leverage Beam even more effectively moving forward, and how they can contribute to helping Beam become the best that it can be.

By Rajesh Vayyala

Room: Online

07/30/2025 5:00 PM 07/30/2025 5:30 PM UTC BS25: Build Seamless Data Ecosystems: Real-World Integrations with Apache Beam, Kafka, and Iceberg

Modern data architectures are no longer built around a single tool — they thrive on interoperability and community-driven integration. This session explores how Apache Beam serves as the flexible processing engine that connects streaming platforms like Kafka with modern, ACID-compliant data lakehouse solutions like Apache Iceberg.

Through real-world architecture patterns and practical examples, we’ll dive into how organizations are using Beam to unify disparate data sources, enable real-time and batch analytics, and future-proof their data platforms. You’ll also gain insights into how the open-source community continues to drive innovation across this ecosystem — from new connectors to performance optimizations and beyond.

Whether you’re designing pipelines, modernizing ETL, or exploring community-powered tooling, this session gives you the blueprint to build scalable, production-ready data ecosystems with confidence.

Online

Modern data architectures are no longer built around a single tool — they thrive on interoperability and community-driven integration. This session explores how Apache Beam serves as the flexible processing engine that connects streaming platforms like Kafka with modern, ACID-compliant data lakehouse solutions like Apache Iceberg.

Through real-world architecture patterns and practical examples, we’ll dive into how organizations are using Beam to unify disparate data sources, enable real-time and batch analytics, and future-proof their data platforms. You’ll also gain insights into how the open-source community continues to drive innovation across this ecosystem — from new connectors to performance optimizations and beyond.

Program

Welcome to the session program for Beam Summit 2025 online talks.

Wednesday, July 30, 2025

The schedule might change or have updates.