Title

3.0 and Beyond: The Future of Beam

by Danny McCormick & Kenneth Knowles

Beam has become a core part of the data processing ecosystem through a combination of innovation and hard work from the Beam community. As the data landscape continues to evolve, however, so too must Beam. During this talk, Kenn (Beam PMC chair) and Danny (Beam PMC) will explore some of the opportunities and challenges in front of Beam, culminating in a vision for the future of Beam. Attendees will gain a clear idea of where Beam is headed, how they can leverage Beam even more effectively moving forward, and how they can contribute to helping Beam become the best that it can be.

A Deep Dive into Beam Python Type Hinting

by Jack McCluskey

This session focuses on the mechanics of the Beam Python SDK’s type hinting infrastructure and best practices for users, along with a discussion about current limitations and future improvements

A Tour of Apache Beam’s New Iceberg Connector

by Ahmed Abualsaud

Join us for a comprehensive overview of Apache Beam’s Iceberg connector. This session will cover its current features and support across multiple SDKs (Java, Python, YAML, SQL). We’ll delve into key design decisions and share what’s coming next. We’ll also demonstrate how to use the connector in batch, streaming, and multi-language scenarios to build robust and scalable pipelines for your data lakehouse.

Architecting Real-Time Blockchain Intelligence with Apache Beam and Apache Kafka

by Vijay Shekhawat

At TRM Labs, we manage petabyte-scale data from over 30 blockchains to deliver customer-facing analytics. Our platform processes high-throughput data to extract actionable intelligence for critical decision-making.

In this session, we will discuss how Apache Beam underpins our architecture by integrating with Apache Kafka for robust data ingestion and deploying on Google Cloud Dataflow to ensure scalability and fault tolerance. We will also delve into the complexities of handling massive volumes of blockchain data—peaking at up to one million events per second—in real time and computing complex metrics.

Become a Contributor: Making Changes, Running Patched Pipeline, and Contributing back to Beam

by Yi Hu

Apache Beam as an open-sourced product builts on community contributions. This talk shares how to make changes to Beam, then run your pipeline with patched Beam SDK, and finally, check in your changes to Beam. The talk also introduces recent efforts of making contribution and deploy of custom Beam easier.

Reference: https://github.com/apache/beam/blob/master/contributor-docs/code-change-guide.md

Bridging BigQuery and ClickHouse with Apache Beam: A Google Dataflow Template for Batch Ingestion

by Bentsi Leviav

In this talk, I’ll walk through the process of developing and deploying a reusable Dataflow template using ClickHouseIO, the official Beam connector. The template enables users to duplicate and copy data from BigQuery to ClickHouse - a high-performance OLAP database increasingly adopted for real-time analytics.

In addition, the session will cover key considerations when working with the ClickHouseIO connector such as schema mapping, handling large volumes of data, performance tuning, and more.

Build Seamless Data Ecosystems: Real-World Integrations with Apache Beam, Kafka, and Iceberg

by Rajesh Vayyala

Modern data architectures are no longer built around a single tool — they thrive on interoperability and community-driven integration. This session explores how Apache Beam serves as the flexible processing engine that connects streaming platforms like Kafka with modern, ACID-compliant data lakehouse solutions like Apache Iceberg.

Through real-world architecture patterns and practical examples, we’ll dive into how organizations are using Beam to unify disparate data sources, enable real-time and batch analytics, and future-proof their data platforms. You’ll also gain insights into how the open-source community continues to drive innovation across this ecosystem — from new connectors to performance optimizations and beyond.

Choosing The Right Boat For Your Stream

by Kamal Aboul-Hosn

Processing real-time data streams requires navigating a growing landscape of connective tissue that can move and transform data. The choices require careful consideration of tradeoffs across scalability, latency, capability, and operational overhead. These types of considerations also drive decision making in building the tools themselves. In this talk, Kamal will draw upon over a decade of building large-scale streaming services within Google Cloud to help you successfully architect your streaming data pipelines so that you aren’t left wishing you had brought a bigger boat.

Data Quality in ML Pipelines

by Pritam Dodeja

Demonstrate two approaches for integrating data quality into ML pipelines: Schema based approach and UDF based approach, where Apache Beam does the data quality based filtering. If there is time, demonstrate how to integrate data quality related features into the dataset using a PreTransform component that takes in a UDF.

Dataflow Cost Calculator

by Svetak Sundhar & Aditya Saraf

A Co-presentation with Google and Exabeam, demonstrating how Dataflow APIs and other GCP products were used in conjunction with Beam metrics to minimize infrastructure costs.

Dataflow for Beginners

by Wei Hsia

This presentation is designed for those new to Dataflow, offering a comprehensive introduction to building powerful data pipelines with ease. We will cover Beam YAML, a tool used for rapid development and it’s advantages. We will walk through examples and demonstrate how you can use Cloud Shell editor to quickly get started. We will then transition from the foundational concepts to more advanced topics, including effective testing strategies, leveraging different providers, and integrating machine learning models. We will also cover Job Builder, a user-friendly interface that streamlines the creation of data movement and processing tasks and its relationship with Beam YAML; including features for exporting, importing, and directly editing YAML configurations. We will explore a variety of common data processing patterns, such as filtering, mapping, and executing SQL transformations, all while incorporating robust error-handling techniques to ensure your pipelines are resilient and reliable. Leave this session equipped to build, customize, and manage sophisticated data pipelines with newfound speed and simplicity with Dataflow.

Dataflow Streaming Innovations

by Tom Stepp & Ryan Wigglesworth

The presentation at this year’s Beam Summit will highlight the latest advancements in Dataflow Streaming. Join us to learn about fast job updates, KafkaIO improvements, Streaming AI/ML enhancements, performance and observability enhancements.

Efficient LLM-Based Data Transformations Transformations via Multiple LoRA Adapters

by Jasper Van den Bossche

LLM-based data transformations are very powerful for processing unstructured data, organizations often struggle to deploy these tools at scale, particularly when tailoring them to specific custom use cases. This talk will explore how to efficiently serve multiple LoRA (Low-Rank Adaptation) adapters on a single base model, enabling task-specific transformations within Apache Beam pipelines and addressing the scalability challenges head-on. LoRA adapters enable efficient fine-tuning of large language models by updating only a subset of parameters, making it possible to tailor models for specific tasks without the computational overhead of full fine-tuning. This approach is particularly valuable when working with private data or deploying cost-effective models. We will explore how inference servers like vLLM and NVIDIA NIM can dynamically swap LoRA adapters in real-time, optimizing resource utilization while seamlessly integrating with Apache Beam for batch and streaming data processing. This integration ensures scalable, cost-effective, and adaptable workflows for various data sources. We’ll demonstrate this approach through a real-world implementation where we process different document types using custom LoRA adapters for each—from invoices to legal contracts—achieving specialized extraction capabilities while maintaining a single model infrastructure. The talk will cover: (1) an overview of LoRA adapters and their efficiency benefits, (2) configuring inference servers for dynamic adapter swapping, and (3) implementing a complete Apache Beam pipeline for production-ready unstructured data processing.

Enhancing Data Quality for AI Success

by Aarohi Tripathi

“Enhancing Data Quality for AI Success” focuses on the critical role that high-quality data plays in the effectiveness and accuracy of AI models. Since AI systems learn patterns from data, ensuring that the data is clean, diverse, accurately labeled, and regularly updated is essential for optimal performance. Poor-quality data can lead to inaccurate predictions, biased results, and underperforming models. By implementing strategies like data cleansing, augmentation, and proper annotation, organizations can improve the training process, resulting in more reliable, fair, and effective AI systems. The topic emphasizes that the success of AI initiatives depends as much on the data used as on the algorithms themselves.

Exabyte-scale Streaming Iceberg IO with Beam, Ray, and DeltaCAT

by Patrick Ames

Production case study highlighting how Amazon uses Ray and DeltaCAT at exabyte-scale to resolve longstanding performance & scale challenges integrating streaming pipelines with Apache Iceberg. Highlights how the Apache Beam, Ray, Apache Flink, and Apache Spark communities can start bringing the same benefits to their workloads using the DeltaCAT project’s IO source/sink implementations for Apache Beam.

From taming energy market data to hyperparameter hunting at scale: leveraging Apache Beam & BigQuery

by Nicholas Bonfanti & Matteo Pacciani

This session explores how Apache Beam handles the end-to-end workflow for Italian energy market analytics, including forecasting electricity demand, natural gas demand, and renewable generation.

We’ll demonstrate its power as a robust ETL tool for parsing and processing diverse data sources, using Google BigQuery as the central data warehouse. These sources range from millions of XML files detailing point-of-delivery (POD) electricity demand to ECMWF-generated GRIB meteorological files, among others.

Building on this foundation, the session covers Beam’s role in scalable machine learning, detailing its use for distributed Bayesian hyperparameter searches over thousands of models, efficient retraining with optimized parameters over newly-parsed data, and large-scale distributed inference.

Growing the Apache Beam Community: Resources, Contributions, and Collaboration

by Jana Polianskaja & Danny McCormick

Contribute to the Apache Beam community! This presentation guides developers—from beginners to experts—through a structured path to meaningful community engagement. We’ll cover essential resources, real-world contribution examples, and diverse collaboration opportunities, offering actionable strategies and inspiration for all experience levels.

How Beam serves models with vLLM

by Danny McCormick

Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU and to serve them efficiently. In particular, serving large language models efficiently has grown in importance and difficulty as models have continued to grow.

vLLM is an open-source library specifically designed for high-throughput and low-latency LLM inference. It optimizes the serving of LLMs by employing several specialized techniques, including continuous batching.

Integrating LLMs and Embedding models into Beam pipelines using langchain

by Ganesh Sivakumar

Large language models (LLMs) have transformed how we process and generate text. In this session, I’ll talk about Langchain-Beam, an open-source library that integrates LLMs and embedding models into Apache Beam pipelines as transform using LangChain.

We will explore how Langchain-Beam transform performs remote LLM inference with OpenAi and Anthropic models. Provide data processing logic as prompt and use the models to transform the data based on the prompt. Use embedding models to generate vector embeddings for text in pipeline and Learn about real-world use cases Like,

Integration of Batch and Streaming data processing with Apache Beam

by Yoichi Nagai

Mercari utilizes Apache Beam for batch and streaming processing for various purposes, such as transferring data to CRM and providing incentives to users. To avoid having to develop similar data pipelines in different departments within the company, Mercari Pipeline has been developed and released as OSS as a tool that allows users to build pipelines by simply defining the processing via JSON or YAML. (https://github.com/mercari/pipeline)

In this session, we will introduce an example of utilizing the features of Apache Beam, which allows Batch and Streaming processes to share the same code, to divert time-series aggregate values generated and verified by Batch to Streaming processes.

Introduction to the Apache Beam RAG package

by Claude van der Merwe

Use the extensible Apache beam RAG package to build pipelines that

  • Generate vector embeddings
  • Ingest embeddings to desired vector database
  • Perform semantic search

Leveraging Apache Beam for Enhanced Financial Insights

by Raj Katakam, Naresh Kumar Kotha & Venkatesh Poosarla

Credit Karma leverages Apache Beam to address a broad spectrum of data processing requirements, particularly real-time data transformation to bolster machine learning models. Key applications include:

  1. preprocessing data and constructing graphs for live model scoring,
  2. large-scale ETL (Extract, Transform, Load) operations for analytics, and
  3. real-time aggregation of features to furnish near-instantaneous insights to models we would be talking above one usecase from each pillar

Leveraging LLMs for Agentic Workflow Orchestration in Apache Beam YAML Pipelines

by Charles Adetiloye

This session explores how Large Language Models (LLMs) can be integrated into Apache Beam tooling to enable agentic orchestration of YAML-defined workflows. We present a system where LLMs parse, validate, and execute Beam YAML pipelines, acting as autonomous agents that enhance workflow automation and reduce manual intervention. The talk covers architecture, pipeline translation, task planning, and integration strategies for embedding LLMs in declarative workflow environments. Attendees will learn how to build intelligent tooling layers for Beam that support dynamic pipeline generation, error resolution, and adaptive execution—all while maintaining the flexibility and scalability of the Beam programming model.

Managed transforms - power of Beam without maintenance overheads

by Chamikara Jayalath

Apache Beam offers a number of powerful transforms including a set of highly scalable I/O connectors. Usually, you have to regularly keep upgrading Beam to get critical fixes related to such transforms. With the recent addition of Java and Python managed APIs, Beam allows runners to fully manage supported transforms. Google Cloud Dataflow uses this API to manage and automatically upgrade widely used Beam I/O connectors. This allows you to focus on the business logic of your batch and streaming pipelines without worrying about associated maintenance overheads.

Many Data Formats, One Data Lake

by Peter Wagener

Apache Beam has the flexibility to handle a wide variety of different types of text-based data: CSV, Avro, Parquet, Iceberg, … they can all be inputs and / or outputs for your data processing projects. The question quickly becomes, which do you choose?

Our answer is a bit surprising: All of them. If you can define the schema appropriately within the pipelines, you can use the file format that makes the most sense for each use case.

Optimize parallelism for reading from Apache Kafka to Dataflow

by Supriya Koppa

Reading from Apache Kafka into Google Cloud Dataflow can present performance challenges if not configured correctly. This session provides a practical guide to troubleshooting common parallelism issues and implementing best practices for optimal performance. We’ll cover key aspects such as understanding Dataflow’s Kafka source, effectively utilizing maxNumRecords and maxReadTime, and addressing potential bottlenecks. Learn how to diagnose and resolve issues related to uneven parallelism and latency, ensuring your real-time data pipelines operate smoothly and efficiently, referring to official Google Cloud Dataflow documentation.

Real-Time Medical Record Processing

by Austin Bennett

At Azra AI, we help with the treatment journeys of cancer patients. Beam is a core component of our platform allowing us to process medical records in ‘real-time’.

In this talk, we will share how we leverage Beam, along with a mix of other services, to help us achieve real impactful results.

Real-Time Predictive Modeling with MLServer, MLFlow, and Apache Beam

by Devon Peticolas & Jeswanth Yadagani

Oden Technologies delivers real-time machine learning to manufacturing environments with Apache Beam. In this session, we’ll demonstrate how Oden aggregates hundreds of sensor streams into real-time tensors for predictive scoring against SKLearn pipelines hosted on MLServer. We’ll also discuss how we use MLFlow for model management and monitoring, along with the infrastructure we’ve developed to coordinate these systems, enabling reliable model testing, deployment, and code updates.

​​​​Real-time Threat Detection at Box with Apache Beam

by Abhishek Mishra, Mark Chin & Elango Prasad

Box faces a constant barrage of sophisticated cybersecurity threats. This session dives into how Box leverages the Apache Beam Python SDK, combined with cutting-edge machine learning techniques, to build a real-time threat detection system. We’ll explore the unique challenges of processing high-volume, real-time data streams to identify and mitigate threats before they can impact our customers. The presentation will focus on:

  • The architecture of our Beam-based unified threat detection pipeline, highlighting the integration of machine learning models.
  • How we utilize a transformer-like structure for time-series analysis in ransomware detection. This includes discussing the advantages of transformers for capturing long-range dependencies and contextual information in sequential data (e.g., system logs, network traffic).
  • Specific real-time data challenges encountered at Box’s scale (e.g., data velocity, variety, veracity, late-arriving data, schema evolution) and how they impact model training and inference.
  • Practical techniques and Beam patterns used to address these challenges (e.g., windowing, triggering, state management, handling out-of-order data), ensuring data is prepared effectively for the machine learning model.
  • Lessons learned and best practices for building robust, real-time threat detection systems with Apache Beam and transformer-based models.
  • How the Python SDK of Apache Beam facilitated the integration of machine learning components into the streaming pipeline, specially the effective utilization of RunInference within the Beam pipeline to serve the transformer-based model and perform real-time predictions.
  • Future directions for enhancing our threat detection capabilities, including exploring other advanced machine-learning architectures.

Remote LLM Inference with Apache Beam: Practical Guide with Gemini and Gemma on Vertex AI

by Taka Shinagawa

Large Language Models offer powerful capabilities for data transformation, but reliably integrating them at scale into Apache Beam data pipelines presents challenges. Deploying powerful, large models (e.g., Gemma 27B, Llama 70B, DeepSeek R1) directly onto Beam workers via the RunInference API is often infeasible due to resource constraints, multi-GPU complexity, cost, and lack of serving optimizations. Furthermore, many frontier models like Gemini are only available via APIs. Therefore, this session focuses on effective Remote LLM inference integration with Apache Beam.

Revisiting Splittable DoFn in KafkaIO

by Steven van Rossum

SDF and IO are tricky subjects, doubly so for SDF IO. This session dives deeper into the SDF read transform in KafkaIO and will highlight a few crucial performance issues that have been addressed in recent releases of Apache Beam. The session is aimed at IO contributors with the goal to emphasize the subtle nuances in SDF and IO code that can make or break performance.

Scalable Drug Discovery with Apache Beam: From R-Groups to Crystal Structures

by Joey Tran

In this talk, we’ll share how we use Apache Beam at Schrodinger to power key stages of the drug discovery pipeline, from R-group enumeration in lead optimization to crystal structure determination in drug formulation. Rather than relying on existing runners, we built our own execution engine on top of Beam’s powerful abstraction layer to better serve our domain-specific needs.

Scalable Prompt Optimization in Apache Beam LLM Workflows

by Tomi Ajakaiye

As Large Language Models (LLMs) become integral to data pipelines, optimizing prompts at scale is critical for consistency, cost control, and performance. In this session, you’ll learn how to embed prompt-tuning and dynamic prompt-generation into an LLM workflow that is executed as Apache Beam pipeline.

Scaling Real-Time Feature Generation Platform @Lyft

by Rakesh Kumar

At Lyft, real-time feature generation is crucial for powering many business critical use-cases. This session describes how we leveraged Apache Beam to build a robust and scalable real-time feature generation platform for this purpose, capable of generating 100s of millions of features per minute. We will delve into the critical factors that engineering teams should consider when designing a real-time feature generation platform, such as: Data consistency and accuracy, with a focus on ownership and quality guarantees. Latency requirements Performance optimization to ensure efficient feature serving. Feature serving and downstream model execution pipelines. Data lineage tools for improved traceability. Strategies for designing for performance and minimizing infrastructure costs. The presentation will discuss engineering challenges encountered while scaling the Beam pipeline to support our requirements and the lessons we learned along the way.

See the Full Picture: Integrating Beam/Dataflow into Your Distributed Traces

by Radek Stankiewicz, Steven van Rossum & Kenneth Knowles

Achieving holistic observability often hits a wall at asynchronous batch or streaming systems. While OpenTelemetry provides standards for tracing, integrating systems like Apache Beam/Dataflow requires specific considerations. This presentation details the successful integration of Beam pipelines with OpenTelemetry’s tracing APIs. We’ll explore the mechanisms for context propagation across Beam’s distributed workers and stages, enabling pipelines to join traces initiated by upstream services. Discover how spans generated within the Dataflow runner can be exported and visualized alongside the rest of your application traces in Google Cloud Trace, finally delivering the “full picture” of your system’s behavior, including its data processing components.

Simplified Streaming Anomaly Detection with Apache Beam's Latest Transform

by Shunping Huang

This talk dives into the practical application of Apache Beam for constructing robust, scalable, and real-time anomaly detection pipelines. We’ll explore Beam’s latest anomaly detection transform, and how to integrate various anomaly detection algorithms (e.g., statistical methods, machine learning models) to identify critical outliers in continuous data streams.

Streaming Databases with Bigtable and Apache Beam

by Christopher Crosbie

Discover how companies, including Google, leverage Apache Beam and Bigtable to instantly enrich data as it’s created. We’ll explore how Bigtable, Google’s powerful key-value database, serves as a perfect real-time data storage solution for Beam’s processing. Learn about the seamless integration between these services and see how you can take advantage of features like large-scale embedding generation and the Beam Enrichment transform with minimal coding

Superpowering Agents with Apache Beam

by Konstantin Buschmeier & Jasper Van den Bossche

Large language models and agentic systems are rapidly changing how exploratory data analytics is performed. Natural language interactions and assisted querying unlock quick prototyping for business experts without data science or programming backgrounds enabling them to get answers to business questions faster.

The limited context inherent in LLMs prevents direct analysis of large data volumes. Augmenting them with access to powerful tools, like databases, and query engines, and scalable data processing frameworks, overcomes this constraint.

Talk to your pipeline: how to use AI to create dynamic transforms in streaming

by Israel Herraiz

With some design, you can leverage Beam ML to make a LLM transform your questions into dynamic transformations that you can apply to your data. In this talk, we show how to use Gemma to accept natural language questions about your data, and offer an answer in real time applied to the main stream of data

The ASF Data Ecosystem Bridging the Data Stream With Apache Beam

by JB Onofre

During this keynote, JB will provide an overview of the projects in the Apache Software Foundation forming a complete data ecosystem: from storage to analytic, including query engines and table format, Apache projects are the building blocks of the data ecosystem. We will see the role played by Apache Beam in this ecosystem and how vendors like Dremio are powered by these projects.

Using Apache Beam to power and scale a data engineering transformation at a Financial Exchange

by Conall Bennett

At CME Group we are going through a multi-year transformation to run the world’s leading derivatives exchange on Google’s cloud platform.

This talk will focus on how we have been using dataflow & apache beam as a core technology service to enable & transform how distributed data engineering teams across different business areas collaborate, build and deliver data products for internal stakeholders and customers, at scale and with a much faster time to market.