Beam Summit 2026 schedule.

Monday, June 22, 2026

10:15 - 10:30
Welcome
10:30 - 11:00
Keynote
11:00 - 11:30
11:30 - 11:45
Coffee break
11:45 - 12:15
12:15 - 12:45
12:45 - 13:15
13:15 - 14:00
Lunch
14:00 - 14:30
14:30 - 15:00
15:00 - 15:30
15:30 - 15:45
Coffee Break
15:45 - 16:30
16:30 - 17:00
17:00 - 17:15
Reception
11:00 - 11:30.
By JB Onofre
Room: Pitch Pine
06/22/2026 11:00 AM 06/22/2026 11:30 AM UTC BS26: The Apache Lakehouse Ecosystem and Apache Beam's Role in It

The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.

Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.

Pitch Pine

The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.

Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.

11:45 - 12:15.
By Danny McCormick
Room: Pitch Pine
06/22/2026 11:45 AM 06/22/2026 12:15 PM UTC BS26: Agentic Workflows in Beam: The Opportunity Ahead of Us

Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.

This talk will discuss why streaming data processing systems are broadly well positioned to safely scale and deploy agentic workflows, and why Beam is a particularly good fit. It will then spend some time talking about the gaps in Beam (and other systems), and how moderate targeted investments can close those gaps.

Attendees can expect to come away with a high level understanding of agentic workflows, the current state of agentic workflows in Beam, and the path forward to take Beam’s agentic support from solid to excellent.

Pitch Pine

Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.

11:45 - 12:15.
By Canburak Tumer
Room: Hackberry
06/22/2026 11:45 AM 06/22/2026 12:15 PM UTC BS26: From Binary to Breach Detection: Normalizing OT and Mixed Logs for SecOps with Apache Beam

Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.

In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:

Handling the “Mixed Stream” Challenge: How to use Beam to demultiplex and route mixed log types from a single Kafka or Pub/Sub source.

Decoding the Undecodable: Patterns for processing proprietary binary protocols and industrial equipment logs at scale.

Architecting for Reliability: Implementing Dead Letter Queues (DLQs) and rejected message sinks to ensure zero data loss during ingestion.

The Ingestion Decision Tree: A comparison of ingestion methods (API vs. GCS vs. Pub/Sub) and how to choose the right sink for your security requirements.

Attendees will leave with a blueprint for building resilient, scalable security data pipelines that turn raw industrial signals into actionable security intelligence.

Hackberry

Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.

In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:

12:15 - 12:45.
By Siddharth Gargava
Room: Pitch Pine
06/22/2026 12:15 PM 06/22/2026 12:45 PM UTC BS26: Building Agentic Data Pipelines: Orchestrating AI Workflows with Apache Beam

AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.

We’ll cover practical patterns for building agent-driven pipelines:

Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.

Real-time ML Integration: Embedding models directly in Beam pipelines for low-latency inference. Patterns for fraud detection, anomaly detection, and personalized recommendations where agents act on streaming insights.

State and Error Handling: Managing agent state across pipeline stages, checkpoint strategies for long-running agentic workflows, and graceful recovery when tools or models fail.

Connecting Disparate Systems: Integrating Beam with Kafka, Iceberg, and cloud APIs to give agents access to the data they need without rebuilding your infrastructure.

Attendees will leave with:

  • Patterns for orchestrating agentic workflows in Beam pipelines
  • Strategies for embedding ML inference in streaming architectures
  • A framework for state management and error recovery in agent-driven systems

Built for data engineers ready to move from static pipelines to autonomous, ML-powered data workflows.

Pitch Pine

AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.

We’ll cover practical patterns for building agent-driven pipelines:

Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.

12:15 - 12:45.
By Jack McCluskey
Room: Hackberry
06/22/2026 12:15 PM 06/22/2026 12:45 PM UTC BS26: The Present and Future of Beam Python Performance

A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.

Hackberry

A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.

12:45 - 13:15.
By Pablo Rodriguez Defino
Room: Pitch Pine
06/22/2026 12:45 PM 06/22/2026 1:15 PM UTC BS26: Scaling near-duplicate detection using Apache Beam

Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.

This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.

Key Technical Takeaways:

Multi-Level Parallelism: Decomposing MinHash LSH into document-level, element-level, and global-level parallel transforms within Beam.

Unified State: Implementing an external “Global Memory” using high-throughput state stores to bridge the gap between processing modes.

Hardware Density: Offloading compute-intensive hashing to TPUs via Dataflow to drastically reduce the cost-per-token processed.

Reusable Components: Deploying low-latency membership inference to prevent “corpus poisoning” during live web crawls by reusing identical batch logic.

We will conclude with a live demonstration using the Common Crawl dataset, showcasing the pipeline’s ability to intercept near-duplicate documents in real-time with zero logic drift.

Pitch Pine

Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.

This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.

12:45 - 13:15.
By Valentyn Tymofieiev
Room: Hackberry
06/22/2026 12:45 PM 06/22/2026 1:15 PM UTC BS26: Using Python Memory Profilers with Apache Beam

In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.

Hackberry

In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.

14:00 - 14:30.
By Raj Katakam & Pallav Anand
Room: Pitch Pine
06/22/2026 2:00 PM 06/22/2026 2:30 PM UTC BS26: Beyond the Black Box: How Intuit Credit Karma Runs ML Explainability for 140M Members with Beam

Everyone talks about explainable AI. Almost no one runs it at scale in production.

The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.

At Intuit Credit Karma, we built and operate a production ML explainability platform on Apache Beam and Dataflow — and we’ll show you exactly how. Beam’s unified programming model gave us the horizontal scalability to run attribution computation across massive prediction volumes without blowing up cost or latency. Its flexible I/O and pipeline composition patterns let us stitch together a three-stage architecture: feature log preparation, distributed attribution generation grouped by model version, and downstream delivery with correctness guardrails — all orchestrated on a daily cadence.

We’ll walk through the real engineering decisions: how we used Apache Beam pipelines to group predictions by model version and load the corresponding model dynamically — ensuring explanations always reflect the exact model that made each prediction even as models refresh continuously in production; how we aggregated feature-level attributions into human-readable reason codes configurable for any audience; and how we designed the pipeline to be privacy-preserving by construction, with no sensitive data leaking through intermediate stages.

We’ll also be honest about where Beam shines for this use case and where the rough edges are — including version compatibility with ML attribution libraries, managing model artifacts in distributed workers, and cost tradeoffs vs. managed explainability services.

Whether you’re building explainability infrastructure, running large-scale batch ML pipelines, or just trying to understand what Apache Beam can do beyond ETL — this talk will show you what production explainability looks like when you build it the right way, at real scale.

Pitch Pine

Everyone talks about explainable AI. Almost no one runs it at scale in production.

The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.

14:00 - 14:30.
By Ashwin Sampathkumar
Room: Hackberry
06/22/2026 2:00 PM 06/22/2026 2:30 PM UTC BS26: Role of Apache Beam in Agentic AI Era

As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.

This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration

Hackberry

As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.

This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration

14:30 - 15:20.
By Samir Sengupta
Room: Hackberry
06/22/2026 2:30 PM 06/22/2026 3:20 PM UTC BS26: Real-Time AI Pipelines at Scale: Embedding LLMs into Apache Beam for Live Inference

As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.

We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%

This isn’t a toy demo. It’s a battle-tested architecture handling 10M+ daily events with 99.9% uptime. Attendees will leave with concrete patterns they can apply to fraud detection, anomaly detection, semantic search, and personalized recommendation systems.

Hackberry

As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.

We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%

15:00 - 15:30.
By Aditya Patil & Raniya Rehman
Room: Pitch Pine
06/22/2026 3:00 PM 06/22/2026 3:30 PM UTC BS26: From Raw Logs to Model Inference: Building a DGA Detection ML Pipeline with Beam

Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.

Pitch Pine

Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.

15:45 - 16:15.
By Vineel Arekapudi
Room: Hackberry
06/22/2026 3:45 PM 06/22/2026 4:15 PM UTC BS26: 'Beam'-ing Iceberg Tables Across Environments Using Project Nessie

At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:

  1. Storage-Layer Replication A Beam pipeline reads all Iceberg table data that’s in Parquet file format along with Metadata files like manifests, and metadata JSON directly from production S3 and writes to non-production S3. Beam’s parallel execution model handles fan-out across thousands of tables efficiently, replacing ad-hoc tooling like rclone or distcp with a testable, portable, and observable pipeline.
  2. Catalog-Layer Replication with Nessie, Powered by Beam, Not Raw MongoDB Sync Rather than directly synchronizing Nessie’s underlying MongoDB collections (objs2, refs2), an approach that is fragile, version-sensitive, and opaque, we use Apache Beam as an abstraction layer that operates at the Nessie API level. The Beam pipeline: -> Polls the production Nessie instance for new commits and snapshot changes via the Nessie REST API -> Extracts the relevant branch, tag, and table metadata -> Applies those changes to the non-production Nessie instance in a controlled, auditable sequence

This means replication is driven by Nessie’s own versioning semantics rather than internal MongoDB implementation details making the approach more resilient to Nessie upgrades. A Unified Pipeline with Batch and Streaming Modes Both layers run within the same Beam pipeline, giving us a single model for two distinct operational needs. In streaming mode, the pipeline continuously watches for new Nessie commits and triggers incremental storage and catalog replication, keeping non-production environments near-current with production. In batch mode, the same pipeline handles full environment bootstrapping or point-in-time recovery to a specific Nessie snapshot. Beam’s runner portability was essential in our regulated environment: pipelines are developed and validated locally using the Direct Runner before being deployed to our Spark cluster via the Spark Runner, without any rewrite. Once completed, the non-production Nessie catalog becomes a true, API-level mirror of production. We will share practical lessons learned, including Nessie API pagination at scale, handling Beam pipeline failures mid-replication, and ensuring catalog consistency when storage and metadata sync are not atomic.

Hackberry

At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:

15:45 - 16:15.
By Jonathan Sudhakar
Room: Pitch Pine
06/22/2026 3:45 PM 06/22/2026 4:15 PM UTC BS26: Beam + Spanner in Production: Multi-Tenant Identity Resolution and Resilient Ad Platform Activation

Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.

We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.

We’ll close with the resilience patterns that keep this graph flowing outward to Meta, Google, TikTok, and Snapchat - adaptive batching tuned to per-platform rate limits and payload constraints, circuit breakers that isolate failing destinations without stalling the pipeline, and a structured dead letter queue system with automated replay. Just enough DoFn-level detail to show how these patterns hold up under real third-party API volatility.

Pitch Pine

Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.

We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.

16:30 - 17:00.
By Ahmed Abualsaud
Room: Hackberry
06/22/2026 4:30 PM 06/22/2026 5:00 PM UTC BS26: Introducing a Modern SQL Experience in Apache Beam

For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.

No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:

  • A New Catalog and Database Hierarchy, enabling centralized metadata management and data environment segregation
  • Seamless Cross-Catalog Queries, allowing joins and data movement across different catalogs and databases with ease
  • Standardized DDL Support (CREATE, SHOW, ALTER, DROP, USE), bringing Beam in line with the industry-standard SQL experience

We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.

Hackberry

For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.

No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:

  • A New Catalog and Database Hierarchy, enabling centralized metadata management and data environment segregation
  • Seamless Cross-Catalog Queries, allowing joins and data movement across different catalogs and databases with ease
  • Standardized DDL Support (CREATE, SHOW, ALTER, DROP, USE), bringing Beam in line with the industry-standard SQL experience

We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.

10:15 - 10:30
Welcome
10:30 - 11:00
Keynote
11:30 - 11:45
Coffee break
13:15 - 14:00
Lunch
15:30 - 15:45
Coffee Break
17:00 - 17:15
Reception
11:00 - 11:30. Pitch Pine
By JB Onofre

The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.

Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.

11:45 - 12:15. Pitch Pine
By Danny McCormick

Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.

11:45 - 12:15. Hackberry
By Canburak Tumer

Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.

In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:

12:15 - 12:45. Pitch Pine
By Siddharth Gargava

AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.

We’ll cover practical patterns for building agent-driven pipelines:

Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.

12:15 - 12:45. Hackberry
By Jack McCluskey

A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.

12:45 - 13:15. Pitch Pine
By Pablo Rodriguez Defino

Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.

This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.

12:45 - 13:15. Hackberry
By Valentyn Tymofieiev

In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.

14:00 - 14:30. Pitch Pine
By Raj Katakam & Pallav Anand

Everyone talks about explainable AI. Almost no one runs it at scale in production.

The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.

14:00 - 14:30. Hackberry
By Ashwin Sampathkumar

As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.

This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration

14:30 - 15:20. Hackberry
By Samir Sengupta

As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.

We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%

15:00 - 15:30. Pitch Pine
By Aditya Patil & Raniya Rehman

Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.

15:45 - 16:15. Pitch Pine
By Jonathan Sudhakar

Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.

We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.

15:45 - 16:15. Hackberry
By Vineel Arekapudi

At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:

16:30 - 17:00. Hackberry
By Ahmed Abualsaud

For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.

No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:

  • A New Catalog and Database Hierarchy, enabling centralized metadata management and data environment segregation
  • Seamless Cross-Catalog Queries, allowing joins and data movement across different catalogs and databases with ease
  • Standardized DDL Support (CREATE, SHOW, ALTER, DROP, USE), bringing Beam in line with the industry-standard SQL experience

We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.

Tuesday, June 23, 2026

9:30 - 10:00
9:30 - 10:00
Keynote
10:00 - 10:30
Coffee Break
10:30 - 11:00
11:00 - 11:30
11:30 - 12:00
12:00 - 12:30
12:30 - 13:15
Lunch
13:15 - 13:45
13:45 - 14:10
14:10 - 14:40
09:00 - 09:30.
By Charles Adetiloye & Ian MacDonald
Room: Pitch Pine
06/22/2026 9:00 AM 06/22/2026 9:30 AM UTC BS26: The Autopilot Never Complained: Replay, Reason, Refly with Apache Beam

A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.

A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.

This talk shows how Beam’s unified model runs the same pipeline against 14 months of archived logs in batch and against the live telemetry stream in real time — no rewrite, no reconciliation, one codebase. We build it live, extend it layer by layer from raw telemetry through operator commands to LLM-generated cost analysis and route recommendations. The route gets adjusted. The inefficiency disappears.

The question Beam lets you ask for the first time: what has your fleet been quietly compensating for that your dashboards never showed you?

Pitch Pine

A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.

A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.

10:30 - 11:00.
By Yi Hu
Room: Pitch Pine
06/23/2026 10:30 AM 06/23/2026 11:00 AM UTC BS26: Evolution of the broad Beam Streaming IO Ecosystem towards production readiness

While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.

Pitch Pine

While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.

10:30 - 11:20.
By Canburak Tumer & Israel Herraiz
Room: Hackberry
06/23/2026 10:30 AM 06/23/2026 11:20 AM UTC BS26: From Prompts to Pipelines: Scaling Data Engineering via Agent Skills

The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.

In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.

Key areas of focus:

Encoding the Beam Model into Skills: How to build specialized skills that “understand” the nuances of PTransforms, Watermarks, and SideInputs to prevent common architectural anti-patterns.

Optimize Skill: Using agents to analyze Dataflow execution graphs and autonomously suggest performance tuning or cost-optimization fixes.

Agentic Testing Skills: Streamlining the creation of robust unit tests and TestStream scenarios to ensure pipeline reliability before deployment.

Skills in Action: A look at how a multi-agent workflow—using a suite of coordinated Beam Skills—can take a natural language requirement and turn it into a production-ready, multi-language pipeline.

By treating Beam expertise as a set of Agent Skills, we can lower the barrier to entry for new developers and allow seasoned experts to focus on high-level architecture rather than boilerplate logic.

Hackberry

The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.

In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.

11:00 - 11:30.
By Ahmed Abualsaud
Room: Pitch Pine
06/23/2026 11:00 AM 06/23/2026 11:30 AM UTC BS26: Zero-Copy Iceberg Migrations with Apache Beam

Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.

This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.

In this session, we’ll explore:

  • A practical framework for modernizing your lakehouse with minimal compute overhead.
  • Live demos showcasing (1) the Batch approach for migrating historical data and (2) the Streaming approach for registering incoming files in real-time
  • A decision matrix for choosing between tradition rewrites and zero-copy registration
Pitch Pine

Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.

This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.

In this session, we’ll explore:

11:30 - 12:00.
By Tarun Annapareddy
Room: Pitch Pine
06/23/2026 11:30 AM 06/23/2026 12:00 PM UTC BS26: Scale Smarter: RateLimiting AI Inference at Dataflow Scale

Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).

To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.

This talk will explore how Beam coordinates rate limits across dispersed workers and communicates dynamic back pressure to the Runner Autoscaler to prevent compute waste. Attendees can expect to come away with an understanding of how global rate limiting works in distributed environments, how the autoscaler responds to rate signals, and how they can use Beam to scale their usecases safely without overwhelming external services.

Pitch Pine

Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).

To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.

11:30 - 12:00.
By Jay Jayakumar & Pablo Costamagna
Room: Hackberry
06/23/2026 11:30 AM 06/23/2026 12:00 PM UTC BS26: The Agent-Driven Pipeline: Real-Time Data Validation & Modeling using Apache Beam, MCP, and GenAI

Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.

We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.

But the agent doesn’t process the data itself. Instead, it automatically triggers Apache Beam (Dataflow) to run these custom checks. The agent translates the business rules into logic specifically built for Apache Beam, allowing Beam to do what it does best: process huge amounts of data efficiently.

What You Will See:

Smart Triggering: How an AI agent figures out what needs checking and automatically spins up Apache Beam pipelines exactly when they are needed.

Building Beam-Ready Rules: How the agent translates everyday business rules and data catalog metadata into SQL and validation steps that plug right into your Apache Beam workflow.

Distributed Execution: How Apache Beam takes the handoff from the agent, using its distributed processing power to check massive datasets for errors and schema changes quickly and reliably.

Hackberry

Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.

We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.

12:00 - 12:30.
By Tom Stepp & Ryan Wigglesworth
Room: Pitch Pine
06/23/2026 12:00 PM 06/23/2026 12:30 PM UTC BS26: Maximizing Performance, Reliability, and Scalability with Dataflow Streaming

This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.

Pitch Pine

This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.

13:15 - 13:45.
By Shunping Huang & Claude van der Merwe
Room: Pitch Pine
06/23/2026 1:15 PM 06/23/2026 1:45 PM UTC BS26: Buffering Data by Timestamp: A Step Towards Time Series Processing in Beam

Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.

In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.

Pitch Pine

Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.

In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.

13:15 - 13:45.
By Austin Bennett
Room: Hackberry
06/23/2026 1:15 PM 06/23/2026 1:45 PM UTC BS26: Closing the Loop: Powering Real-Time Clinical AI Automation with Apache Beam

The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.

In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.

Hackberry

The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.

In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.

13:15 - 13:45.
By Josi Aranda
Room: Hackberry
06/23/2026 1:15 PM 06/23/2026 1:45 PM UTC BS26: When Fast Is Too Fast: Throughput Control Patterns in Apache Beam on Dataflow

Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.

What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.

This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.

Hackberry

Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.

What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.

This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.

13:45 - 14:10.
By Jiufeng Liu
Room: Pitch Pine
06/23/2026 1:45 PM 06/23/2026 2:10 PM UTC BS26: Streaming CDC at Scale

Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.

In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.

A proper fix belongs in Dataflow and SpannerIO. In the meantime, an interim solution has kept redeployments routine in production, with no data loss or duplication.

This session covers the problem, the approach, and the tradeoffs that remain.

Pitch Pine

Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.

In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.

14:10 - 14:40.
By Charles Adetiloye
Room: Pitch Pine
06/23/2026 2:10 PM 06/23/2026 2:40 PM UTC BS26: Agentic Beam YAML Pipelines for Real-Time Predictive Intelligence on Streaming Data

This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.

Pitch Pine

This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.

14:10 - 14:40.
By Puneet Singh
Room: Hackberry
06/23/2026 2:10 PM 06/23/2026 2:40 PM UTC BS26: From Reactive to Proactive: How Intuit Credit Karma Solved Data Quality at Scale

At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.

We designed our quality standards around five pillars: Timeliness, Completeness, Accuracy, Observability, and Governance. Like many organizations, we initially relied on custom rules and alerts, but quickly realized this approach was not scalable. We will discuss how we solved this crisis by automating observability using Monte Carlo’s Out-of-the-box features, Field Health/Metrics Monitors, and Custom SQL checks to handle complex DQ needs. We will also detail how we operationalized governance via the “Data Asset Registry”, a centralizing management solution for hundreds of data assets across Credit Karma teams.

Lastly, we will discuss the human side of observability: adoption and training. We will share how we navigated early implementation challenges to build a reliable alerting structure, enabling our current model of paging on-call teams in real-time with high confidence and low alert fatigue.

Hackberry

At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.

9:30 - 10:00
Keynote
10:00 - 10:30
Coffee Break
12:30 - 13:15
Lunch
09:00 - 09:30. Pitch Pine
By Charles Adetiloye & Ian MacDonald

A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.

A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.

10:30 - 11:00. Pitch Pine
By Yi Hu

While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.

10:30 - 11:20. Hackberry
By Canburak Tumer & Israel Herraiz

The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.

In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.

11:00 - 11:30. Pitch Pine
By Ahmed Abualsaud

Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.

This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.

In this session, we’ll explore:

11:30 - 12:00. Pitch Pine
By Tarun Annapareddy

Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).

To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.

11:30 - 12:00. Hackberry
By Jay Jayakumar & Pablo Costamagna

Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.

We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.

12:00 - 12:30. Pitch Pine
By Tom Stepp & Ryan Wigglesworth

This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.

13:15 - 13:45. Pitch Pine
By Shunping Huang & Claude van der Merwe

Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.

In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.

13:15 - 13:45. Hackberry
By Austin Bennett

The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.

In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.

13:15 - 13:45. Hackberry
By Josi Aranda

Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.

What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.

This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.

13:45 - 14:10. Pitch Pine
By Jiufeng Liu

Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.

In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.

14:10 - 14:40. Pitch Pine
By Charles Adetiloye

This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.

14:10 - 14:40. Hackberry
By Puneet Singh

At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.

The schedule might change or have updates.