Title

A Deep Dive Into Beam Python Type Hinting

by Jack McCluskey
This session focuses on the mechanics of the Beam Python SDK’s type hinting infrastructure and best practices for users, along with a discussion about current limitations and future improvements

Architecting Real-Time Blockchain Intelligence with Apache Beam and Apache Kafka

by Vijay Shekhawat
At TRM Labs, we manage petabyte-scale data from over 30 blockchains to deliver customer-facing analytics. Our platform processes high-throughput data to extract actionable intelligence for critical decision-making. In this session, we will discuss how Apache Beam underpins our architecture by integrating with Apache Kafka for robust data ingestion and deploying on Google Cloud Dataflow to ensure scalability and fault tolerance. We will also delve into the complexities of handling massive volumes of blockchain data—peaking at up to one million events per second—in real time and computing complex metrics.

Data Quality in ML Pipelines

by Pritam Dodeja
Demonstrate two approaches for integrating data quality into ML pipelines: Schema based approach and UDF based approach, where Apache Beam does the data quality based filtering. If there is time, demonstrate how to integrate data quality related features into the dataset using a PreTransform component that takes in a UDF.

Dataflow Streaming Innovations

by Tom Stepp
The presentation at this year’s Beam Summit will highlight the latest advancements in Dataflow Streaming. Join us to learn about fast job updates, KafkaIO improvements, Streaming AI/ML enhancements, performance and observability enhancements.

Growing the Apache Beam Community: Resources, Contributions, and Collaboration

by Jana Polianskaja
Contribute to the Apache Beam community! This presentation guides developers—from beginners to experts—through a structured path to meaningful community engagement. We’ll cover essential resources, real-world contribution examples, and diverse collaboration opportunities, offering actionable strategies and inspiration for all experience levels.

How Beam serves models with vLLM

by Danny McCormick
Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU and to serve them efficiently. In particular, serving large language models efficiently has grown in importance and difficulty as models have continued to grow. vLLM is an open-source library specifically designed for high-throughput and low-latency LLM inference.

Integrating LLMs and Embedding models into Beam pipelines using langchain

by Ganesh Sivakumar
Large language models (LLMs) have transformed how we process and generate text. In this session, I’ll talk about Langchain-Beam, an open-source library that integrates LLMs and embedding models into Apache Beam pipelines as transform using LangChain. We will explore how Langchain-Beam transform performs remote LLM inference with OpenAi and Anthropic models. Provide data processing logic as prompt and use the models to transform the data based on the prompt. Use embedding models to generate vector embeddings for text in pipeline and Learn about real-world use cases Like,

Leveraging LLMs for Agentic Workflow Orchestration in Apache Beam YAML Pipelines

by Charles Adetiloye
This session explores how Large Language Models (LLMs) can be integrated into Apache Beam tooling to enable agentic orchestration of YAML-defined workflows. We present a system where LLMs parse, validate, and execute Beam YAML pipelines, acting as autonomous agents that enhance workflow automation and reduce manual intervention. The talk covers architecture, pipeline translation, task planning, and integration strategies for embedding LLMs in declarative workflow environments. Attendees will learn how to build intelligent tooling layers for Beam that support dynamic pipeline generation, error resolution, and adaptive execution—all while maintaining the flexibility and scalability of the Beam programming model.

Many Data Formats, One Data Lake

by Peter Wagener
Apache Beam has the flexibility to handle a wide variety of different types of text-based data: CSV, Avro, Parquet, Iceberg, … they can all be inputs and / or outputs for your data processing projects. The question quickly becomes, which do you choose? Our answer is a bit surprising: All of them. If you can define the schema appropriately within the pipelines, you can use the file format that makes the most sense for each use case.

Optimize parallelism for reading from Apache Kafka to Dataflow

by Supriya Koppa
Reading from Apache Kafka into Google Cloud Dataflow can present performance challenges if not configured correctly. This session provides a practical guide to troubleshooting common parallelism issues and implementing best practices for optimal performance. We’ll cover key aspects such as understanding Dataflow’s Kafka source, effectively utilizing maxNumRecords and maxReadTime, and addressing potential bottlenecks. Learn how to diagnose and resolve issues related to uneven parallelism and latency, ensuring your real-time data pipelines operate smoothly and efficiently, referring to official Google Cloud Dataflow documentation.

​​​​Real-time Threat Detection at Box with Apache Beam

by Abhishek Mishra
Box faces a constant barrage of sophisticated cybersecurity threats. This session dives into how Box leverages the Apache Beam Python SDK, combined with cutting-edge machine learning techniques, to build a real-time threat detection system. We’ll explore the unique challenges of processing high-volume, real-time data streams to identify and mitigate threats before they can impact our customers. The presentation will focus on: The architecture of our Beam-based unified threat detection pipeline, highlighting the integration of machine learning models.

​​Sculpting Data for Machine Learning: Beam-Powered GenAI Edition

by Jigyasa Grover
While Generative AI models capture headlines, the foundation of any successful GenAI implementation remains quality data preparation at scale. Apache Beam provides the ideal framework for constructing robust data pipelines that can seamlessly process batch and streaming data for GenAI applications. This session will guide attendees through leveraging Beam’s unified model to curate, transform, and deliver data ready for modern GenAI systems. By combining Apache Beam with complementary technologies like Kafka for real-time streaming and Apache Iceberg for efficient data storage, we’ll demonstrate how to build an end-to-end GenAI data ecosystem.

​Streaming Databases with Bigtable and Apache Beam

by Christopher Crosbie
Discover how companies, including Google, leverage Apache Beam and Bigtable to instantly enrich data as it’s created. We’ll explore how Bigtable, Google’s powerful key-value database, serves as a perfect real-time data storage solution for Beam’s processing. Learn about the seamless integration between these services and see how you can take advantage of features like large-scale embedding generation and the Beam Enrichment transform with minimal coding