The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.
Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.
Pitch PineThe modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.
Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.
Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.
This talk will discuss why streaming data processing systems are broadly well positioned to safely scale and deploy agentic workflows, and why Beam is a particularly good fit. It will then spend some time talking about the gaps in Beam (and other systems), and how moderate targeted investments can close those gaps.
Attendees can expect to come away with a high level understanding of agentic workflows, the current state of agentic workflows in Beam, and the path forward to take Beam’s agentic support from solid to excellent.
Pitch PineUse of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.
Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.
In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:
Handling the “Mixed Stream” Challenge: How to use Beam to demultiplex and route mixed log types from a single Kafka or Pub/Sub source.
Decoding the Undecodable: Patterns for processing proprietary binary protocols and industrial equipment logs at scale.
Architecting for Reliability: Implementing Dead Letter Queues (DLQs) and rejected message sinks to ensure zero data loss during ingestion.
The Ingestion Decision Tree: A comparison of ingestion methods (API vs. GCS vs. Pub/Sub) and how to choose the right sink for your security requirements.
Attendees will leave with a blueprint for building resilient, scalable security data pipelines that turn raw industrial signals into actionable security intelligence.
HackberryModern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.
In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:
AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.
We’ll cover practical patterns for building agent-driven pipelines:
Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.
Real-time ML Integration: Embedding models directly in Beam pipelines for low-latency inference. Patterns for fraud detection, anomaly detection, and personalized recommendations where agents act on streaming insights.
State and Error Handling: Managing agent state across pipeline stages, checkpoint strategies for long-running agentic workflows, and graceful recovery when tools or models fail.
Connecting Disparate Systems: Integrating Beam with Kafka, Iceberg, and cloud APIs to give agents access to the data they need without rebuilding your infrastructure.
Attendees will leave with:
Built for data engineers ready to move from static pipelines to autonomous, ML-powered data workflows.
Pitch PineAI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.
We’ll cover practical patterns for building agent-driven pipelines:
Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.
A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.
HackberryA brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.
Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.
This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.
Key Technical Takeaways:
Multi-Level Parallelism: Decomposing MinHash LSH into document-level, element-level, and global-level parallel transforms within Beam.
Unified State: Implementing an external “Global Memory” using high-throughput state stores to bridge the gap between processing modes.
Hardware Density: Offloading compute-intensive hashing to TPUs via Dataflow to drastically reduce the cost-per-token processed.
Reusable Components: Deploying low-latency membership inference to prevent “corpus poisoning” during live web crawls by reusing identical batch logic.
We will conclude with a live demonstration using the Common Crawl dataset, showcasing the pipeline’s ability to intercept near-duplicate documents in real-time with zero logic drift.
Pitch PineAbstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.
This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.
In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.
HackberryIn this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.
Everyone talks about explainable AI. Almost no one runs it at scale in production.
The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.
At Intuit Credit Karma, we built and operate a production ML explainability platform on Apache Beam and Dataflow — and we’ll show you exactly how. Beam’s unified programming model gave us the horizontal scalability to run attribution computation across massive prediction volumes without blowing up cost or latency. Its flexible I/O and pipeline composition patterns let us stitch together a three-stage architecture: feature log preparation, distributed attribution generation grouped by model version, and downstream delivery with correctness guardrails — all orchestrated on a daily cadence.
We’ll walk through the real engineering decisions: how we used Apache Beam pipelines to group predictions by model version and load the corresponding model dynamically — ensuring explanations always reflect the exact model that made each prediction even as models refresh continuously in production; how we aggregated feature-level attributions into human-readable reason codes configurable for any audience; and how we designed the pipeline to be privacy-preserving by construction, with no sensitive data leaking through intermediate stages.
We’ll also be honest about where Beam shines for this use case and where the rough edges are — including version compatibility with ML attribution libraries, managing model artifacts in distributed workers, and cost tradeoffs vs. managed explainability services.
Whether you’re building explainability infrastructure, running large-scale batch ML pipelines, or just trying to understand what Apache Beam can do beyond ETL — this talk will show you what production explainability looks like when you build it the right way, at real scale.
Pitch PineEveryone talks about explainable AI. Almost no one runs it at scale in production.
The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.
As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.
This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration
HackberryAs we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.
This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration
As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.
We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%
This isn’t a toy demo. It’s a battle-tested architecture handling 10M+ daily events with 99.9% uptime. Attendees will leave with concrete patterns they can apply to fraud detection, anomaly detection, semantic search, and personalized recommendation systems.
HackberryAs AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.
We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%
Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.
Pitch PineDefending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.
At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:
This means replication is driven by Nessie’s own versioning semantics rather than internal MongoDB implementation details making the approach more resilient to Nessie upgrades. A Unified Pipeline with Batch and Streaming Modes Both layers run within the same Beam pipeline, giving us a single model for two distinct operational needs. In streaming mode, the pipeline continuously watches for new Nessie commits and triggers incremental storage and catalog replication, keeping non-production environments near-current with production. In batch mode, the same pipeline handles full environment bootstrapping or point-in-time recovery to a specific Nessie snapshot. Beam’s runner portability was essential in our regulated environment: pipelines are developed and validated locally using the Direct Runner before being deployed to our Spark cluster via the Spark Runner, without any rewrite. Once completed, the non-production Nessie catalog becomes a true, API-level mirror of production. We will share practical lessons learned, including Nessie API pagination at scale, handling Beam pipeline failures mid-replication, and ensuring catalog consistency when storage and metadata sync are not atomic.
HackberryAt Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:
Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.
We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.
We’ll close with the resilience patterns that keep this graph flowing outward to Meta, Google, TikTok, and Snapchat - adaptive batching tuned to per-platform rate limits and payload constraints, circuit breakers that isolate failing destinations without stalling the pipeline, and a structured dead letter queue system with automated replay. Just enough DoFn-level detail to show how these patterns hold up under real third-party API volatility.
Pitch PineProcessing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.
We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.
For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.
No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:
We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.
HackberryFor a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.
No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:
We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.
The modern data landscape has converged on the lakehouse architecture — a paradigm that unifies the scalability of data lakes with the governance and performance of traditional data warehouses. Open table formats such as Apache Iceberg, Apache Hudi, and Delta Lake now serve as the foundation for this ecosystem, enabling ACID transactions, schema evolution, and time-travel queries directly on cloud object storage.
Yet as organizations adopt these formats at scale, a critical question emerges: how do data teams move data into and across these tables in a portable, reliable, and runtime-agnostic way? This talk explores where Apache Beam fits into the lakehouse picture. We will examine how Beam’s unified programming model and runner abstraction make it a natural choice for lakehouse ingestion and transformation pipelines — whether targeting Apache Iceberg via the managed I/O connector, streaming CDC events into Hudi, or orchestrating multi-table workflows across batch and streaming workloads in a single pipeline. Attendees will come away with a clear mental model of the lakehouse ecosystem’s key components, an understanding of the trade-offs between format choices, and practical insight into how Apache Beam’s portability layer complements — rather than competes with — engines like Apache Flink and Apache Spark in a modern data platform.
Use of LLMs and agents is steadily growing in prominence and importance across the data processing ecosystem and more broadly across all of software. Today, though, many agentic prototypes fail to reach production for familiar reasons, including: overly complex resource management, the inability to colocate the correct context, high deployment costs, and a lack of guardrails for non-deterministic agents. Agentic workflows can solve some of these problems by providing boundaries for an agent to operate within, but there is a lack of scalable tools to do this well.
Modern Security Operations Centers (SOCs) face a massive data problem: the “Data Wall.” As IT, OT, and IoT environments converge, security teams are forced to ingest a chaotic mix of structured, unstructured, and even proprietary binary logs (such as S7 or Modbus) into their SIEM platforms. Traditional ingestion tools often fail when faced with the need for real-time normalization, massive scale, and complex data transformation.
In this session, we will explore how Apache Beam serves as the critical “bridge” in this ecosystem, enabling the transformation of disparate manufacturing and enterprise logs into a standardized Unified Data Model (UDM). Drawing from real-world implementations using Google Cloud Dataflow and SecOps (Chronicle), we will dive into:
AI agents need more than model inference. They need reliable data orchestration. This session explores how Apache Beam enables agentic architectures that reason over streaming and batch data, execute multi-step workflows, and integrate ML models for real-time decision-making.
We’ll cover practical patterns for building agent-driven pipelines:
Agentic Orchestration with Beam: Structuring pipelines where agents decompose tasks, call external tools, and coordinate across data sources. How Beam’s unified model simplifies building workflows that span Kafka streams, batch repositories, and ML inference endpoints.
A brief dive into improving Beam Python performance, looking at both currently available best practices as well as looking forward to free threaded Python runtimes.
Abstract Deduplicating trillion-token corpora is a critical requirement for modern LLM pre-training, yet it remains a significant engineering bottleneck. Traditional approaches often rely on fragmented pipelines—massive batch jobs for historical cleaning paired with separate, lightweight scripts for real-time ingestion. This fragmentation leads to logic drift, where signatures generated in the stream may not align with batch history, creating inconsistencies that are difficult to debug.
This session presents a fully native Apache Beam implementation of the MinHash Locality-Sensitive Hashing (LSH) algorithm. We demonstrate a Unified Architecture that uses a single Java codebase to handle both large-scale “Batch Bootstrapping” and low-latency streaming workflows.
In this talk we will cover techniques for profiling memory usage in Apache Beam Python pipelines.
Everyone talks about explainable AI. Almost no one runs it at scale in production.
The dirty secret of ML explainability is that it’s easy in a notebook and brutally hard in the real world. Feature attribution libraries are compute-intensive and don’t parallelize out of the box. Models retrain and refresh continuously, making it impossible to guarantee that an explanation matches the model that actually made the prediction. Raw feature attributions are cryptic to everyone except the data scientist who built the model — but business stakeholders and end users all need to understand them in terms meaningful to them. And at the scale of billions of predictions a day across 140 million members, none of the off-the-shelf approaches hold up.
As we move from LLM chatbots to Autonomous AI Agents in 2026, the primary bottleneck isn’t model intelligence, it’s contextual latency. For an agent to act reliably on behalf of a user, it needs a living memory of both historical records and real-time events.
This session explores how Apache Beam is becoming the definitive context layer for the Agentic AI stack. While traditional RAG often relies on vector databases, Beam enables a new paradigm of streaming RAG and stateful orchestration
As AI moves from experimentation to production, the hardest challenge isn’t building a model. It’s getting it to run reliably on live data at scale. In this talk, I’ll walk through how I architected production-grade pipelines that embed LLMs and RAG systems directly into Apache Beam, enabling real-time inference on high-velocity data streams.
We’ll cover: How to integrate HuggingFace and vLLM models into Beam transforms for low-latency inference Designing a RAG pipeline inside Beam using vector databases (Pinecone, FAISS) for semantic search on streaming data Handling the cost and throughput challenges of running LLMs in a pipeline (quantization, batching, GPU optimization) Deploying the full stack on AWS Bedrock + SageMaker with Kubernetes orchestration Real benchmark results: how we cut inference costs by 50% while improving reasoning accuracy by 35%
Defending the Garden State, one pipeline at a time. In cybersecurity, speed is the ultimate currency, and manual detection simply can’t keep up with the velocity of modern malware. To bridge this gap, the NJCCIC developed an automated, ML-powered detection engine that analyzes ~700,000 domains daily, moving beyond a reliance on third-party threat feeds to identify ephemeral command-and-control channels in real-time. At the core of this defense, we leverage Apache Beam to orchestrate a high-throughput feature engineering engine, extracting 17 distinct lexical and statistical features—from Shannon entropy to bigram probabilities—across a massive 30-million-sample training set. By utilizing Beam’s ability to unify complex data transformation with production-scale inference, we’ve successfully deployed an ensemble of Random Forest and biologically-inspired NEAT (NeuroEvolution of Augmenting Topologies) models.
Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.
We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.
At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:
For a long time, Beam SQL lagged behind other frameworks like Spark or Flink because it lacked support for any hierarchical metadata management. This point of friction limited Beam SQL’s interoperability, scalability, and ease of use within modern data architectures.
No more! In this session, we’ll dive into the evolution of the Beam SQL story, introducing:
We’ll end with a demo in a multi-catalog environment, demonstrating how these new features allow for a more intuitive, powerful, and “SQL-native” developer workflow in Beam.
A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.
A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.
This talk shows how Beam’s unified model runs the same pipeline against 14 months of archived logs in batch and against the live telemetry stream in real time — no rewrite, no reconciliation, one codebase. We build it live, extend it layer by layer from raw telemetry through operator commands to LLM-generated cost analysis and route recommendations. The route gets adjusted. The inefficiency disappears.
The question Beam lets you ask for the first time: what has your fleet been quietly compensating for that your dashboards never showed you?
Pitch PineA fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.
A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.
While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.
Pitch PineWhile Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.
The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.
In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.
Key areas of focus:
Encoding the Beam Model into Skills: How to build specialized skills that “understand” the nuances of PTransforms, Watermarks, and SideInputs to prevent common architectural anti-patterns.
Optimize Skill: Using agents to analyze Dataflow execution graphs and autonomously suggest performance tuning or cost-optimization fixes.
Agentic Testing Skills: Streamlining the creation of robust unit tests and TestStream scenarios to ensure pipeline reliability before deployment.
Skills in Action: A look at how a multi-agent workflow—using a suite of coordinated Beam Skills—can take a natural language requirement and turn it into a production-ready, multi-language pipeline.
By treating Beam expertise as a set of Agent Skills, we can lower the barrier to entry for new developers and allow seasoned experts to focus on high-level architecture rather than boilerplate logic.
HackberryThe “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.
In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.
Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.
This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.
In this session, we’ll explore:
Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.
This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.
In this session, we’ll explore:
Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).
To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.
This talk will explore how Beam coordinates rate limits across dispersed workers and communicates dynamic back pressure to the Runner Autoscaler to prevent compute waste. Attendees can expect to come away with an understanding of how global rate limiting works in distributed environments, how the autoscaler responds to rate signals, and how they can use Beam to scale their usecases safely without overwhelming external services.
Pitch PineScaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).
To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.
Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.
We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.
But the agent doesn’t process the data itself. Instead, it automatically triggers Apache Beam (Dataflow) to run these custom checks. The agent translates the business rules into logic specifically built for Apache Beam, allowing Beam to do what it does best: process huge amounts of data efficiently.
What You Will See:
Smart Triggering: How an AI agent figures out what needs checking and automatically spins up Apache Beam pipelines exactly when they are needed.
Building Beam-Ready Rules: How the agent translates everyday business rules and data catalog metadata into SQL and validation steps that plug right into your Apache Beam workflow.
Distributed Execution: How Apache Beam takes the handoff from the agent, using its distributed processing power to check massive datasets for errors and schema changes quickly and reliably.
HackberryData pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.
We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.
This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.
Pitch PineThis session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.
Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.
In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.
Pitch PineTime series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.
In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.
The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.
In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.
HackberryThe “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.
In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.
Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.
What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.
This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.
HackberryApache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.
What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.
This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.
Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.
In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.
A proper fix belongs in Dataflow and SpannerIO. In the meantime, an interim solution has kept redeployments routine in production, with no data loss or duplication.
This session covers the problem, the approach, and the tradeoffs that remain.
Pitch PineIntroducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.
In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.
This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.
Pitch PineThis talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.
At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.
We designed our quality standards around five pillars: Timeliness, Completeness, Accuracy, Observability, and Governance. Like many organizations, we initially relied on custom rules and alerts, but quickly realized this approach was not scalable. We will discuss how we solved this crisis by automating observability using Monte Carlo’s Out-of-the-box features, Field Health/Metrics Monitors, and Custom SQL checks to handle complex DQ needs. We will also detail how we operationalized governance via the “Data Asset Registry”, a centralizing management solution for hundreds of data assets across Credit Karma teams.
Lastly, we will discuss the human side of observability: adoption and training. We will share how we navigated early implementation challenges to build a reliable alerting structure, enabling our current model of paging on-call teams in real-time with high confidence and low alert fatigue.
HackberryAt Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.
A fully autonomous VTOL logistics fleet has been flying mathematically optimized delivery corridors for 14 months. Every flight clears health checks. Every autopilot recovery is clean. No alerts fire. No reports are filed. On paper the operation is running perfectly.
A single Apache Beam pipeline replaying 1,200 archived flight logs tells a different story. Attitude recovery events — small, clean, individually insignificant — cluster at one specific waypoint, one altitude band, one azimuth range. The autopilot has been silently fighting a terrain-induced atmospheric rotor on every affected flight. It always won. It never complained. But the cumulative cost is real — excess battery draw, elevated motor wear, compounding flight time losses that no dashboard ever surfaced.
While Beam websites listed 30+ built-in IO connectors supporting streaming, the status (resilience, scalability, performance, etc) of each IO connectors is not equal. This session highlights recent improvements for selected Beam Streaming IO connectors that have indicated gaps between user demand and status quo in Beam, including Debezium, Jms, Mqtt, and Pulsar. We discusses how each connector went from “Day 1” existence to “Day 2” resilience, and with remarks on facilitating community engagements on Beam IO Ecosystem.
The “Beam Model” is incredibly powerful, but its complexity—balancing windowing, triggers, and stateful processing—often creates a steep learning curve. In the era of agentic development, we are moving beyond simple AI code completion toward a world of Agent Skills: modular, grounded capabilities that allow AI agents to act as specialized data engineers.
In this session, we explore how to build and deploy specific Agent Skills tailored for Apache Beam using modern tools like Claude Code, Cursor, and custom agentic frameworks. We will shift the focus from “writing code” to “orchestrating capabilities,” demonstrating how these skills can automate the most nuanced parts of the development lifecycle.
Traditionally, converting a Parquet-based data lake to Iceberg required a hidden tax of rewriting every single data file. For organizations managing petabyte-scale datasets, this compute overhead and the associated cloud bill are often dealbreakers.
This talk introduces a more efficient path using Apache Beam’s new AddFiles feature to perform zero-copy migrations, registering existing Parquet files directly into an Iceberg table without moving a single byte.
In this session, we’ll explore:
Scaling AI inference across thousands of workers to maximize throughput is a flagship feature of Apache Beam. However, this massive parallelism often collides head-on with strict external API quotas (e.g., Vertex AI, OpenAI).
To bridge this gap, we’ve introduced a Proactive Global RateLimiter for Apache Beam. Integrated directly into the RunInference transform and also made it available for custom DoFn’s. It moves quota management from reactive retry storms to proactive pacing.
Data pipelines need reliable quality checks, but hardcoded validation rules struggle to keep up with changing business needs. This session shows how to simplify data quality by using an AI agent to figure out the rules, and Apache Beam to do the heavy lifting of actually checking the data.
We will walk through a practical setup where an AI Data Validation Agent takes the lead. Using tools like Retrieval-Augmented Generation (RAG) and the Model Context Protocol (MCP), the agent reads your live data catalogs and governance rules to understand exactly what your data should look like today.
This session explores recent innovations that enhance pipeline scalability, reliability, and availability. We will cover key updates in autoscaling, high availability, and reliability, alongside progress in streaming ML and IO excellence. Attendees will discover how these enhancements facilitate the building of robust, next-generation streaming architectures.
Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.
In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.
The “last mile” of healthcare AI is often the hardest: turning a model prediction into an actual clinical action. Azra AI uses Apache Beam to close this loop. By unifying data ingestion and AI/ML within Beam pipelines, we automate the identification and navigation of patients needing urgent care.
In this talk, we’ll outline our architecture for deploying real-time streams with AI/ML predictions to provide clinicians with up-to-the-minute insights. We will discuss the challenges of processing unstructured clinical text and how Beam enables us to scale our automation to serve patients across diverse hospital networks.
Apache Beam and Google Cloud Dataflow are optimized for parallelism and autoscaling. In practice, that’s often exactly what we want, until it isn’t.
What happens when your streaming pipeline suddenly scales from 5 to 50 workers and overwhelms a downstream REST API? Or a legacy service enforces strict QPS limits? Beam provides powerful primitives for windowing, batching, and scaling — but it does not offer first-class global throughput control.
This talk explores advanced throughput control patterns specifically in the context of Apache Beam running on Google Cloud Dataflow.
Introducing a YAML-driven Dataflow Flex Template design enabling product teams to self-serve Spanner to BigQuery replication, supporting both current-state and append-only modes for analytics, downstream applications, and audit trails. This session focuses on streaming CDC(Change Data Capture) use cases.
In a federated deployment model where each product team owns and operates its own pipeline, redeployments emerged as a recurring risk event. None of Dataflow’s existing restart mechanisms work perfectly for our use cases, leaving every redeployment risks data loss or duplicate data.
This talk demonstrates how declarative Beam YAML pipelines, combined with LLM-powered agentic architectures, unlock real-time predictive intelligence on high-velocity sensor streams — using UAV flight telemetry as a live use case. We show how parameterized Beam YAML templates dynamically reconfigure processing logic at runtime without code changes, while AI agents perform anomaly detection, predictive trend analysis, preemptive failure reasoning, and adaptive alerting — all orchestrated within the pipeline itself. Attendees will learn practical patterns for embedding agentic workflows into Beam, wiring streaming data sources to ML inference transforms, and building pipelines that don’t just react to problems but anticipate them. The session highlights how Beam’s unified model powers the next generation of intelligent, self-reasoning data applications.
At Intuit Credit Karma, the “Credit Ecosystem” team powers the financial progress of millions of members, relying on massive datasets from all three major credit bureaus and multiple partners. This ecosystem spans over hundreds tables and tens of thousands of columns, with ingestion frequencies ranging from real-time and intraday (3x daily) to monthly batch files . The sheer scale of daily loads—impacting over 140 million members—made manual monitoring impossible. This session explores how the Credit Ecosystem team leveraged Monte Carlo to transition from reactive firefighting to proactive observability as part of our Data Quality standards.