| Title |
|---|
'Beam'-ing Iceberg Tables Across Environments Using Project Nessieby Vineel ArekapudiAt Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline: |
Agentic Beam YAML Pipelines for Real-Time Predictive Intelligence on Streaming Databy Charles AdetiloyeThis talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications. |
Agentic Workflows in Beam: The Opportunity Ahead of Usby Danny McCormickUse of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well. |
Beyond the Black Box: How Intuit Credit Karma Runs ML Explainability for 140M Members with Beamby Raj Katakam & Pallav AnandEveryone talks about explainable AI. Almost no one runs it at scale in production. The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up. |
Buffering Data by Timestamp: A Step Towards Time Series Processing in Beamby Shunping Huang & Claude van der MerweTime series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately. In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source. |
Building a Real-Time Identity Graph: Beam + Cloud Spanner for Multi-Tenant Customer Resolutionby Jonathan SudhakarProcessing billions of monthly events across dozens of marketing clients requires resolving fragmented user identities in near real-time. This talk covers how we built a multi-tenant identity graph system using Apache Beam on Dataflow and Google Cloud Spanner - including composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. We’ll share lessons on schema design trade-offs, handling late-arriving data in identity merges, and how this foundation powers downstream ML models for predicted lifetime value. |
Building Agentic Data Pipelines: Orchestrating AI Workflows with Apache Beamby Siddharth GargavaAI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making. We’ll cover practical patterns for building agent-driven pipelines: Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints. |
Closing the Loop: Powering Real-Time Clinical AI Automation with Apache Beamby Austin BennettThe “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care. In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks. |
Evolution of the broad Beam Streaming IO Ecosystem towards production readinessby Yi HuWhile Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem. |
From Binary to Breach Detection: Normalizing OT and Mixed Logs for SecOps with Apache Beamby Canburak TümerModern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation. In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into: |
From Prompts to Pipelines: Scaling Data Engineering via Agent Skillsby Canburak Tümer & Israel HerraizThe “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers. In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle. |
From Raw Logs to Model Inference: Building a DGA Detection ML Pipeline with Beamby Aditya Patil & Raniya RehmanDefending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models. |
Introducing a Modern SQL Experience in Apache Beamby Ahmed AbualsaudFor a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures. No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:
We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam. |
Maximizing Performance, Reliability, and Scalability with Dataflow Streamingby Tom Stepp & Ryan WigglesworthThis session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures. |
RAG-as-a-Service: Reusable AI Data Pipelines with Apache Beamby Vineel ArekapudiIn this session, we’ll explore how data engineering teams at Wells Fargo have built a RAG-as-a-Service platform that supports multiple AI applications across the enterprise. Using a real-world architecture, we’ll walk through the end-to-end pipeline: using Apache Beam to ingest PDFs and other documents submitted to a REST endpoint, preparing those documents for embedding, storing vectors in scalable vector databases, and orchestrating workflows using modern data platforms. Apache Beam handles the heavy lifting of the ingestion layer. Because different enterprise teams submit documents in very different ways, some in large overnight bulk loads and others one file at a time through the API, we needed something that could handle both without maintaining two separate codebases. Beam’s RunInference transform also lets us run embedding generation inside the pipeline itself rather than bolting on a separate service. We’ll also discuss how to operationalize these pipelines using MLOps and AIOps practices, including versioning embeddings, monitoring retrieval quality, and managing prompt pipelines in production. We’ll also discuss how to operationalize these pipelines using MLOps and AIOps practices, including versioning embeddings, monitoring retrieval quality, and managing prompt pipelines in production. We’ll also show how we integrated Google ADK to build agents that consume these embeddings, allowing enterprise teams to wire up LLMs to their document collections without having to understand what is happening underneath. Attendees will leave with a practical blueprint for building reusable document ingestion infrastructure that powers AI applications from copilots to intelligent search, while remaining scalable, governed, and production-ready. |
Real-Time AI Pipelines at Scale: Embedding LLMs into Apache Beam for Live Inferenceby Samir SenguptaAs AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams. We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35% |
Role of Apache Beam in Agentic AI Eraby Ashwin SampathkumarAs we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events. This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration |
Scale Smarter: RateLimiting AI Inference at Dataflow Scaleby Tarun AnnapareddyScaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI). To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing. |
Scaling near-duplicate detection using Apache Beamby Pablo Rodriguez DefinoAbstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug. This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows. |
The Agent-Driven Pipeline: Real-Time Data Validation & Modeling using Apache Beam, MCP, and GenAIby Jay Jayakumar & Pablo CostamagnaData pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data. We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today. |
The Apache Lakehouse Ecosystem and Apache Beam's Role in Itby JB OnofreThe modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage. Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform. |
The Autopilot Never Complained: Replay, Reason, Refly with Apache Beamby Charles Adetiloye & Ian MacDonaldA fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly. A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced. |
The Present and Future of Beam Python Performanceby Jack McCluskeyA brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes. |
Trust, But Verify: Petabyte scale validations using Beam & Dataflowby Manit GuptaMigrating massive, mission-critical databases across heterogeneous systems is inherently terrifying because customers don’t just hope for a successful migration — they demand 100% data integrity. In this talk, we will explore why Cloud Spanner chose Apache beam & Dataflow to implement scaled out data validation for customers migrating to Spanner. We will dig into how beam helps simplifies executing complex data processing at scale, and how Dataflow reduces time-to-market by providing a stable, scaled out beam runner. |
Using Python Memory Profilers with Apache Beamby Valentyn TymofieievIn this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines. |
When Fast Is Too Fast: Throughput Control Patterns in Apache Beam on Dataflowby Josi ArandaApache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t. What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control. This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow. |
Zero-Copy Iceberg Migrations with Apache Beamby Ahmed AbualsaudTraditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers. This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte. In this session, we’ll explore: |
Zero-loss Spanner to BigQuery Redeploymentby Jiufeng LiuIntroducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming use cases. In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data. |