Beam Summit 2026 Sessions

Title
'Beam'-ing Iceberg Tables Across Environments Using Project Nessie by Vineel Arekapudi At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline: See more ...
Agentic Beam YAML Pipelines for Real-Time Predictive Intelligence on Streaming Data by Charles Adetiloye This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications. See more ...
Agentic Workflows in Beam: The Opportunity Ahead of Us by Danny McCormick Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well. See more ...
Beam + Spanner in Production: Multi-Tenant Identity Resolution and Resilient Ad Platform Activation by Jonathan Sudhakar Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline. We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value. See more ...
Beyond the Black Box: How Intuit Credit Karma Runs ML Explainability for 140M Members with Beam by Raj Katakam & Pallav Anand Everyone talks about explainable AI. Almost no one runs it at scale in production. The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up. See more ...
Buffering Data by Timestamp: A Step Towards Time Series Processing in Beam by Shunping Huang & Claude van der Merwe Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately. In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source. See more ...
Building Agentic Data Pipelines: Orchestrating AI Workflows with Apache Beam by Siddharth Gargava AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making. We’ll cover practical patterns for building agent-driven pipelines: Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints. See more ...
Closing the Loop: Powering Real-Time Clinical AI Automation with Apache Beam by Austin Bennett The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care. In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks. See more ...
Evolution of the broad Beam Streaming IO Ecosystem towards production readiness by Yi Hu While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem. See more ...
From Binary to Breach Detection: Normalizing OT and Mixed Logs for SecOps with Apache Beam by Canburak Tumer Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation. In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into: See more ...
From Prompts to Pipelines: Scaling Data Engineering via Agent Skills by Canburak Tumer & Israel Herraiz The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers. In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle. See more ...
From Raw Logs to Model Inference: Building a DGA Detection ML Pipeline with Beam by Aditya Patil & Raniya Rehman Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models. See more ...
From Reactive to Proactive: How Intuit Credit Karma Solved Data Quality at Scale by Puneet Singh At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards. See more ...
Introducing a Modern SQL Experience in Apache Beam by Ahmed Abualsaud For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures. No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing: A New Catalog and Database Hierarchy, enabling centralized metadata management and data environment segregation Seamless Cross-Catalog Queries, allowing joins and data movement across different catalogs and databases with ease Standardized DDL Support (CREATE, SHOW, ALTER, DROP, USE), bringing Beam in line with the industry-standard SQL experience We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam. See more ...
Maximizing Performance, Reliability, and Scalability with Dataflow Streaming by Tom Stepp & Ryan Wigglesworth This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures. See more ...
Real-Time AI Pipelines at Scale: Embedding LLMs into Apache Beam for Live Inference by Samir Sengupta As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams. We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35% See more ...
Role of Apache Beam in Agentic AI Era by Ashwin Sampathkumar As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events. This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration See more ...
Scale Smarter: RateLimiting AI Inference at Dataflow Scale by Tarun Annapareddy Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI). To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing. See more ...
Scaling near-duplicate detection using Apache Beam by Pablo Rodriguez Defino Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug. This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows. See more ...
Streaming CDC at Scale by Jiufeng Liu Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases. In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data. See more ...
The Agent-Driven Pipeline: Real-Time Data Validation & Modeling using Apache Beam, MCP, and GenAI by Jay Jayakumar & Pablo Costamagna Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data. We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today. See more ...
The Apache Lakehouse Ecosystem and Apache Beam's Role in It by JB Onofre The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage. Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform. See more ...
The Autopilot Never Complained: Replay, Reason, Refly with Apache Beam by Charles Adetiloye & Ian MacDonald A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly. A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced. See more ...
The Present and Future of Beam Python Performance by Jack McCluskey A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes. See more ...
Using Python Memory Profilers with Apache Beam by Valentyn Tymofieiev In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines. See more ...
When Fast Is Too Fast: Throughput Control Patterns in Apache Beam on Dataflow by Josi Aranda Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t. What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control. This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow. See more ...
Zero-Copy Iceberg Migrations with Apache Beam by Ahmed Abualsaud Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers. This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte. In this session, we’ll explore: See more ...