Welcome to the session program for Beam Summit 2024.

If you prefer, you can also see the program in the Sessionize layout.

Wednesday, September 04, 2024

09:00 - 9:45
Welcome
09:45 - 10:30
10:30 - 11:00
Morning break
11:00 - 11:25
11:30 - 11:55.
12:00 - 12:25
12:30 - 13:25
Lunch
13:30 - 14:20
14:30 - 14:55
15:00 - 15:25
Afternoon break
16:00 - 16:25
16:30 - 16:55
17:00 - 19:30 | Event reception location: MP6, Game Room, 8th floor
09:10 - 09:40.
By Yasmeen Ahmad
Room: Mariposa Grove
09/04/2024 9:10 AM 09/04/2024 9:40 AM America/Los_Angeles AS24: Innovating the Data & AI Platform

Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.

Mariposa Grove
Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.
09:45 - 10:30.
By Marc Howard
Room: Mariposa Grove
09/04/2024 9:45 AM 09/04/2024 10:30 AM America/Los_Angeles AS24: Project Shield: How we use Beam to defend democracy and free expression, and how we got started!

Project Shield is a service that counters distributed-denial-of-service (DDoS) attacks. It is available free of charge to eligible websites that have news, elections, and human rights related content. Project Shield helps ensure unhindered access to election-related information during global democratic processes (such as the U.S. 2022 midterm election season and many others). It also enables critical infrastructure and news websites to defend against non-stop attacks and provides crucial services and information during crises (such as the invasion of Ukraine).

In this keynote, Marc Howard will explain why and how Project Shield uses Apache Beam and Google Cloud Dataflow to deliver some of their core value. Their streaming Apache Beam pipelines process more than 3 TB of log data daily at significantly over 10,000 queries per second, growing significantly during large attacks of up to 400 million queries per second. These metrics power user-facing graphs and long-term attack analytics at scale, fine-tuning Project Shield’s defenses and supporting them in the effort of making the web a safe and free space.

Mariposa Grove
Project Shield is a service that counters distributed-denial-of-service (DDoS) attacks. It is available free of charge to eligible websites that have news, elections, and human rights related content. Project Shield helps ensure unhindered access to election-related information during global democratic processes (such as the U.S. 2022 midterm election season and many others). It also enables critical infrastructure and news websites to defend against non-stop attacks and provides crucial services and information during crises (such as the invasion of Ukraine).
11:00 - 11:25.
By Rohit Sinha
Room: Hamina (MP4)
09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Data Lineage in Beam

In this presentation, we delve into the critical world of data lineage within Apache Beam, exploring its significance and demonstrating its practical implementation. We begin by establishing the motivation behind data lineage, highlighting its role in enhancing data governance, debugging, and impact analysis. Next, we introduce Google Cloud Dataplex, a unified data management platform, and its integration with Beam’s lineage capabilities.

We’ll then embark on a technical journey, showcasing how lineage support is built into Apache Beam’s core. Following this, we will dissect the process of constructing a lineage graph for an Apache Beam job and seamlessly reporting it to Dataplex for insightful visualization.

The presentation will empower the audience with actionable knowledge on how to integrate lineage tracking into their own I/O operations, ensuring greater transparency and control over their data pipelines. Finally, a live demonstration will bring these concepts to life, showcasing data lineage in action for an Apache Beam job executing on Dataflow, and visually exploring its lineage within Dataplex.

By the end of this talk, attendees will possess the knowledge and tools to effectively leverage Apache Beam’s lineage support, fostering transparency and trust within their data pipelines.

Hamina (MP4)
In this presentation, we delve into the critical world of data lineage within Apache Beam, exploring its significance and demonstrating its practical implementation. We begin by establishing the motivation behind data lineage, highlighting its role in enhancing data governance, debugging, and impact analysis. Next, we introduce Google Cloud Dataplex, a unified data management platform, and its integration with Beam’s lineage capabilities. We’ll then embark on a technical journey, showcasing how lineage support is built into Apache Beam’s core.
11:00 - 11:25.
By Sergei Lilichenko
Room: Walker Canyon
09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Ordered processing in Apache Beam

The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.

Walker Canyon
The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.
11:00 - 11:25.
By Sayat Satybaldiyev & Arwin Tio
Room: Mariposa Grove
09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Scaling Autonomous Driving with Apache Beam

Cruise leverages Apache Beam to manage and process petabytes of data monthly, essential for our autonomous vehicle model training. This talk will delve into the innovative features we’ve developed to enhance Beam’s capabilities, including a control plane for quota and user management, a C++ sandbox for running AV ROS nodes in the cloud, and shuffling optimization techniques to compress shuffled data

Mariposa Grove
Cruise leverages Apache Beam to manage and process petabytes of data monthly, essential for our autonomous vehicle model training. This talk will delve into the innovative features we’ve developed to enhance Beam’s capabilities, including a control plane for quota and user management, a C++ sandbox for running AV ROS nodes in the cloud, and shuffling optimization techniques to compress shuffled data
11:30 - 12:20.
By Robert Burke
Room: Hamina (MP4)
09/04/2024 11:30 AM 09/04/2024 12:20 PM America/Los_Angeles AS24: A New Local Runner Appears: Deep dive on Prism

Prism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs.

Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.

Prism is written Go to be a small, local runner built using Beam Portability first, to better emulate how jobs of any SDK execute on those larger systems. Further, Prism serves as a model runner for all SDKs for a robust local experience.

This talk will go into how runners execute pipelines, and the design and implementation of Prism for the goals of testing and configurability. Knowledge of Go is encouraged, but not required.

Hamina (MP4)
Prism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs. Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.
11:30 - 11:55.
By Shunping Huang
Room: Walker Canyon
09/04/2024 11:30 AM 09/04/2024 11:55 AM America/Los_Angeles AS24: Introducing Ordered List States

To introduce ordered list states from concept to implementation.

Walker Canyon
To introduce ordered list states from concept to implementation.
11:30 - 11:55.
By Baojun Liu
Room: Mariposa Grove
09/04/2024 11:30 AM 09/04/2024 11:55 AM America/Los_Angeles AS24: Transitioning Uber Michelangelo's Batch Prediction from Apache Spark to Ray

Michelangelo is Uber’s centralized machine learning platform, designed to manage ML pipelines and their associated data processing. As the demand for batch predictions grows, the need for a flexible and efficient processing framework becomes imperative. This presentation explores Uber Michelangelo’s batch prediction processes, focusing on data processing, model prediction, and the transition from Spark to Ray.

At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model. While Spark has been the backbone for our batch processing needs due to its robustness and ease of use, it has shown limitations in handling the complex machine learning tasks that Uber is increasingly deploying, such as natural language processing and Generative AI. These workloads often require GPUs to meet latency and throughput requirements, which Spark struggles to support efficiently.

Ray, an emerging distributed computing framework, offers better resource utilization, simpler parallelism, and more straightforward scalability. By leveraging Ray for batch processing, we can support large language model batch predictions reliably, efficiently, and scalably. We are also transitioning other machine learning batch prediction tasks to Ray for both data processing and model prediction. In this new setup, data processing is integrated as part of the model using native transformer techniques, allowing deployment on GPUs.

With Ray, we have developed a robust pipeline for batch prediction. Currently, streaming data is handled with separate pipelines. We are actively exploring the unification of these pipelines using open-source libraries. Apache Beam ML provides an opportunity to unify batch and streaming data processing pipelines.

Mariposa Grove
Michelangelo is Uber’s centralized machine learning platform, designed to manage ML pipelines and their associated data processing. As the demand for batch predictions grows, the need for a flexible and efficient processing framework becomes imperative. This presentation explores Uber Michelangelo’s batch prediction processes, focusing on data processing, model prediction, and the transition from Spark to Ray. At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model.
12:00 - 12:25.
By Alberto López Serna
Room: Walker Canyon
09/04/2024 12:00 PM 09/04/2024 12:25 PM America/Los_Angeles AS24: Avoid HTTP Request Duplicates with SCIO, a custom AsyncHttpParDoFn and State & Timers

A StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.

Walker Canyon
A StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.
12:00 - 12:25.
By Lakshmanan Arumugam
Room: Mariposa Grove
09/04/2024 12:00 PM 09/04/2024 12:25 PM America/Los_Angeles AS24: How we Migrated our JSON DB to a Relational DB using Apache Beam / Dataflow

We migrated ~5 TB (~2 billion rows) of production NoSQL DB to a Postgres DB without application downtime. This involved data export from Cloud Datastore (NSQL DB), normalization (from JSON to SQL schema) and then data import into Postgres. We used two different Apache Beam pipelines in the process.

  1. Exporting data from cloud datastore - for this we used Google developed dataflow template (written in java). *
  2. Normalize and import into Postgres - we wrote our own custom Apache Beam pipeline that transforms every JSON row into Postgres compatible schema and batches the normalized rows for ingestion into Postgres (written in python). **

Data migration took about ~16 hrs with this approach using the beam pipeline (as opposed to our initial estimate of 5 days using other batch scripts with parallel computing)

Advantages:

  • Time: made our migration faster
  • Repeatability and error handling: easy to rerun for records that were failed to import (and easy rerun for any new records created in datastore during the migration etc.,)
  • Managed scaling with Google Dataflow

*pipeline 1: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/src/main/java/com/google/cloud/teleport/templates/DatastoreToText.java ** pipeline 2: we wrote custom beam DoFn transforms in python(+sqlalchemy) and built a pipeline to ingest data into Postgres with error handling.

Mariposa Grove
We migrated ~5 TB (~2 billion rows) of production NoSQL DB to a Postgres DB without application downtime. This involved data export from Cloud Datastore (NSQL DB), normalization (from JSON to SQL schema) and then data import into Postgres. We used two different Apache Beam pipelines in the process. Exporting data from cloud datastore - for this we used Google developed dataflow template (written in java). * Normalize and import into Postgres - we wrote our own custom Apache Beam pipeline that transforms every JSON row into Postgres compatible schema and batches the normalized rows for ingestion into Postgres (written in python).
13:30 - 14:20.
By Bipin Upadhyaya
Room: Mariposa Grove
09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Accelerating CDC Data Ingestion with Apache Beam: A Qlik-to-BigQuery Journey

This talk unveils our design journey to streamline the ingestion of CDC changes into a data warehouse, enabling rapid data availability for users. We leverage Qlik to stream CDC events to Kafka, harness Dataflow’s processing power, and store the transformed data in BigQuery for efficient analysis.

We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.

This blueprint serves as a valuable resource for others seeking to simplify their CDC ingestion pipelines and accelerate time-to-insight for their data-driven initiatives.

Mariposa Grove
This talk unveils our design journey to streamline the ingestion of CDC changes into a data warehouse, enabling rapid data availability for users. We leverage Qlik to stream CDC events to Kafka, harness Dataflow’s processing power, and store the transformed data in BigQuery for efficient analysis. We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.
13:30 - 14:20.
By Byron Ellis
Room: Walker Canyon
09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Implementing a Beam SDK: A Deep Dive into the Swift SDK

In this session we’ll take a deep dive into implementing a Beam SDK for a new language using Swift as an example. We’ll cover both the internal and the external from implementing FnApi using Swift’s surprisingly robust gRPC support to how we use Swift’s modern type system to provide a uniquely Swift way of expressing the Beam programming model.

Walker Canyon
In this session we’ll take a deep dive into implementing a Beam SDK for a new language using Swift as an example. We’ll cover both the internal and the external from implementing FnApi using Swift’s surprisingly robust gRPC support to how we use Swift’s modern type system to provide a uniquely Swift way of expressing the Beam programming model.
13:30 - 14:20.
By Damon Douglas
Room: Hamina (MP4)
09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Processing Data from a Web API: A step by step guide

This talk is for you if you are considering writing custom code to read from and write to a Web API, for which a solution does not yet exist. This session demonstrates step by step how to read from and write to Web APIs using the Beam Java and Python SDKs.

Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam. Challenges addressed will include error handling, retry with backoff, and caching.

Hamina (MP4)
This talk is for you if you are considering writing custom code to read from and write to a Web API, for which a solution does not yet exist. This session demonstrates step by step how to read from and write to Web APIs using the Beam Java and Python SDKs. Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam.
14:30 - 14:55.
By Danny McCormick
Room: Mariposa Grove
09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: How Beam ML Optimizes Serving Large Models

Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU.

This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author. Attendees can expect to come away with an understanding of how Beam loads and serves models, how it optimizes its serving architecture for different model sizes/footprints, and how they can use Beam to serve their models (large or small).

Mariposa Grove
Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU. This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author.
14:30 - 14:55.
By Jeff Kinard
Room: Walker Canyon
09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: Introduction to Beam YAML

The purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to:

  • Basic syntax for declaring pipeline spec and pipeline options
  • Mapping transforms and UDF’s
  • Simple aggregations
  • Other turnkey transforms (LogForTesting, IO’s, etc.)

The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.

Walker Canyon
The purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to: Basic syntax for declaring pipeline spec and pipeline options Mapping transforms and UDF’s Simple aggregations Other turnkey transforms (LogForTesting, IO’s, etc.) The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.
14:30 - 14:55.
By Yi Hu
Room: Hamina (MP4)
09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: Throttling Detection and Reactive Worker Downscaling

Rate limiting and draining quota is a common issue for a data processing pipeline running at scale. Overprovision of the worker pool when the external resource is throttled not only increases the cost but also puts additional pressure onto the resources. This talk introduces the recent improvements on tackling this issue, from tracking throttled states in a Beam pipeline to taking actions on these signals from the runner (particularly Dataflow) side. Finally, it explores options on how users can onboard their custom IO connectors for throttling detection features.

Hamina (MP4)
Rate limiting and draining quota is a common issue for a data processing pipeline running at scale. Overprovision of the worker pool when the external resource is throttled not only increases the cost but also puts additional pressure onto the resources. This talk introduces the recent improvements on tackling this issue, from tracking throttled states in a Beam pipeline to taking actions on these signals from the runner (particularly Dataflow) side.
15:00 - 15:25.
By Ferran Fernandez & Austin Bennett
Room: Walker Canyon
09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Beam YAML and Protobuf

In this session, we’ll explore the transformative integration of Beam YAML with Protobuf, unlocking new possibilities within Apache Beam. Delve into practical applications and benefits, and gain insights from our journey of harnessing Apache Beam’s capabilities.

Walker Canyon
In this session, we’ll explore the transformative integration of Beam YAML with Protobuf, unlocking new possibilities within Apache Beam. Delve into practical applications and benefits, and gain insights from our journey of harnessing Apache Beam’s capabilities.
15:00 - 15:25.
By Konstantin Buschmeier, Jasper Van den Bossche & Iris Luden
Room: Mariposa Grove
09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Multi-Modal LLM Data Processing with Apache Beam

Large language models are well known for their performance on generation tasks like summarization but they also excel at many classical tasks like classification, named-entity recognition, or information extraction. Multi-modal LLMs similarly achieve state of the art performance on document understanding. This makes them vital for modern data processing pipelines.

Apache Beam is a powerful framework to define and execute batch and streaming data processing pipelines. Recent releases introduced many tools to facilitate machine learning workflows like ML Transforms, RunInference, and Enrichment transform.

In this talk we will introduce an application that combines Beam’s ML capabilities and LLMs to extract product requests from various document types of customer emails to facilitate the automatic fulfillment of orders.

Mariposa Grove
Large language models are well known for their performance on generation tasks like summarization but they also excel at many classical tasks like classification, named-entity recognition, or information extraction. Multi-modal LLMs similarly achieve state of the art performance on document understanding. This makes them vital for modern data processing pipelines. Apache Beam is a powerful framework to define and execute batch and streaming data processing pipelines. Recent releases introduced many tools to facilitate machine learning workflows like ML Transforms, RunInference, and Enrichment transform.
15:00 - 15:25.
By Valentyn Tymofieiev
Room: Hamina (MP4)
09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Troubleshooting Python pipelines with process monitoring tools.

In an ideal scenario, a data processing pipeline performs without issues. When a runtime processing error occurs, normally Beam surfaces the error to the runner. However in some cases, the process running the user code might run out of memory, get stuck or crash. This can prevent it from reporting the error, leaving the user unaware of the failure’s root cause. In this talk, I’ll discuss troubleshooting techniques for these situations. The techniques I cover can also be applied for debugging other Python applications.

Hamina (MP4)
In an ideal scenario, a data processing pipeline performs without issues. When a runtime processing error occurs, normally Beam surfaces the error to the runner. However in some cases, the process running the user code might run out of memory, get stuck or crash. This can prevent it from reporting the error, leaving the user unaware of the failure’s root cause. In this talk, I’ll discuss troubleshooting techniques for these situations.
16:00 - 16:25.
By Ahmed Abualsaud
Room: Walker Canyon
09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: Breaking the Language Barrier: Easy Cross-Language with Generated Python Wrappers

Harness the power of cross-language transforms by combining the best of Java and Python in your data processing workflows. Discover how to seamlessly integrate Java transforms into your Python pipelines using the SDK’s newest utilities.

With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.

Walker Canyon
Harness the power of cross-language transforms by combining the best of Java and Python in your data processing workflows. Discover how to seamlessly integrate Java transforms into your Python pipelines using the SDK’s newest utilities. With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.
16:00 - 16:25.
By Hai Sadon
Room: Mariposa Grove
09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: Real-Time Fraud Prevention with Apache Beam

In this session, we will explore how Apache Beam can be leveraged to create a robust real-time fraud prevention system. Drawing from real-world implementations at Transmit Security, the presentation will cover the architecture and components of our Detection and Response solution. Attendees will learn about the challenges of analyzing high volumes of data in real-time, and how we utilize a combination of data collection points, enriched data pipelines, and stateful aggregation engines.

We will discuss our transition to using Google BigTable for low-latency, high-throughput data management, highlighting its role in enhancing the system’s performance.

Additionally, the session will focus on two unique flows within our system:

  1. Running machine learning models that predict fraud in real time.
  2. Implementing a feedback loop that takes the results from the stream and re-ingests them back into the pipeline to continuously improve detection accuracy.
Mariposa Grove
In this session, we will explore how Apache Beam can be leveraged to create a robust real-time fraud prevention system. Drawing from real-world implementations at Transmit Security, the presentation will cover the architecture and components of our Detection and Response solution. Attendees will learn about the challenges of analyzing high volumes of data in real-time, and how we utilize a combination of data collection points, enriched data pipelines, and stateful aggregation engines.
16:00 - 16:25.
By Matt Mays
Room: Bonsai
09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: The SolaceIO connector: how was it made and why

Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries.

Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations.

The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.

In this talk we explain how Solace works, how we made it work with Beam with high throughput and low latency, and what lessons can be learnt for the design of complex streaming connectors for Beam.

Bonsai
Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries. Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations. The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.
16:00 - 16:25.
By Olu Akinlaja
Room: Hamina (MP4)
09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: using pub/subIO writeMessageDynamic() function in a Python pipeline to use dynamic topic destination

Enabling dynamic topic destinations using the pub/subIO writeMessageDynamic() function in a Java Dataflow pipeline is an interesting feature, which seems only available in Apache Beam Java SDK. This talk is to showcase a workaround implementation using the pub/subIO writeMessageDynamic() function as an external transform

Hamina (MP4)
Enabling dynamic topic destinations using the pub/subIO writeMessageDynamic() function in a Java Dataflow pipeline is an interesting feature, which seems only available in Apache Beam Java SDK. This talk is to showcase a workaround implementation using the pub/subIO writeMessageDynamic() function as an external transform
16:30 - 16:55.
By Lydian Lee
Room: Walker Canyon
09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Improving Stability for Running Python SDK with Flink Runner

In this session, we will explore our journey to improve the stability of our Flink application using the Python Beam SDK runner, with a particular focus on memory tuning. Our initial setup faced significant challenges, including frequent task manager disconnections and ambiguous error logs, often hinting at out-of-memory (OOM) issues. Despite no clear indicators of high memory usage, the instability worsened after transitioning from the Lyft K8s operator to the Apache Flink operator.

Key points include:

  • Initial Setup Challenges: Both the Python worker harness and the Flink task manager running in the same container, leading to frequent disconnections.
  • Diagnosing the Problem: Despite no high overall memory usage, the task manager frequently reported being unavailable, suggesting potential OOM issues.
  • Operator Differences: The Lyft K8s operator reserved 20% of memory for the system, while the Apache Flink operator allocated all memory to taskmanager.memory.process.size, causing OOM on the Python worker harness due to lack of reserved system memory.
  • Solution Implementation: Separating the Python worker harness into a separate container and using external to connect to Python, resulting in enhanced stability.
  • Additional Benefits: Improved resource utilization and flexibility by assigning specific memory requests and limits to the sidecar container running Python.
Walker Canyon
16:30 - 16:55.
By John Casey
Room: Hamina (MP4)
09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Using Dead Letter Queues with Beam

In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.

Hamina (MP4)
In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.
16:30 - 16:55.
By Reza Rokni
Room: Mariposa Grove
09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Using LLMs with Beam and RunInference

This session will provide an overview of how to utilize large language models (LLMs) using Apache Beam’s RunInference framework. The prime example for the talk will be running the Gemma open model on Dataflow, outlining considerations and common pitfalls when writing pipelines with LLMs.

Mariposa Grove
This session will provide an overview of how to utilize large language models (LLMs) using Apache Beam’s RunInference framework. The prime example for the talk will be running the Gemma open model on Dataflow, outlining considerations and common pitfalls when writing pipelines with LLMs.
13:30 - 14:50.
By Wei Hsia
Room: Bonsai
09/05/2024 1:30 PM 09/05/2024 2:50 PM America/Los_Angeles AS24: Multiple Input, Multiple output, Multi-Modal Inference: Streaming ML with Dataflow

Imagine when you’re attending an event, as you park your car, you get a notification telling you which entrance has the current shortest line that’s closest to your parking. Then when you check in, a reminder (only if you’ve parked), that you can get some parking validation done if you spend on concessions or merchandise as part of your membership benefit. And you have a hankering for nachos, hoping that they don’t run out. This would make your experience amazing and unique!

How can you do this though? You need to, in real time, connect your parking data to your membership data, to current lines? How does the event organizer know how busy it is in the lines easily? Or predict how fast they will be going to run out of things?

You’ll build a pipeline, from scratch, that incorporates all of these things. You’ll need to put all of your Beam knowledge (or learn them along the way!) to the test. Read multiple inputs, in real time, and make sure you’re enriching them with the right information. Write multiple outputs - in real time, to actually do something with the data. And of course, we’ll show you how you can use an LLM (or any model!) in the same pipeline.

Bonsai
Imagine when you’re attending an event, as you park your car, you get a notification telling you which entrance has the current shortest line that’s closest to your parking. Then when you check in, a reminder (only if you’ve parked), that you can get some parking validation done if you spend on concessions or merchandise as part of your membership benefit. And you have a hankering for nachos, hoping that they don’t run out.
9:00 - 9:10
Welcome
10:30 - 11:00
Morning break
12:30 - 13:30
Lunch
15:30 - 16:00
Afternoon break
17:00 - 19:30
17:00 - 19:30 | Event reception location: MP6, Game Room, 8th floor
09:10 - 09:40. Mariposa Grove
By Yasmeen Ahmad
Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.
09:45 - 10:30. Mariposa Grove
By Marc Howard
Project Shield is a service that counters distributed-denial-of-service (DDoS) attacks. It is available free of charge to eligible websites that have news, elections, and human rights related content. Project Shield helps ensure unhindered access to election-related information during global democratic processes (such as the U.S. 2022 midterm election season and many others). It also enables critical infrastructure and news websites to defend against non-stop attacks and provides crucial services and information during crises (such as the invasion of Ukraine).
11:00 - 11:25. Mariposa Grove
By Sayat Satybaldiyev & Arwin Tio
Cruise leverages Apache Beam to manage and process petabytes of data monthly, essential for our autonomous vehicle model training. This talk will delve into the innovative features we’ve developed to enhance Beam’s capabilities, including a control plane for quota and user management, a C++ sandbox for running AV ROS nodes in the cloud, and shuffling optimization techniques to compress shuffled data
11:00 - 11:25. Walker Canyon
By Sergei Lilichenko
The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.
11:00 - 11:25. Hamina (MP4)
By Rohit Sinha
In this presentation, we delve into the critical world of data lineage within Apache Beam, exploring its significance and demonstrating its practical implementation. We begin by establishing the motivation behind data lineage, highlighting its role in enhancing data governance, debugging, and impact analysis. Next, we introduce Google Cloud Dataplex, a unified data management platform, and its integration with Beam’s lineage capabilities. We’ll then embark on a technical journey, showcasing how lineage support is built into Apache Beam’s core.
11:30 - 11:55. Mariposa Grove
By Baojun Liu
Michelangelo is Uber’s centralized machine learning platform, designed to manage ML pipelines and their associated data processing. As the demand for batch predictions grows, the need for a flexible and efficient processing framework becomes imperative. This presentation explores Uber Michelangelo’s batch prediction processes, focusing on data processing, model prediction, and the transition from Spark to Ray. At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model.
11:30 - 11:55. Walker Canyon
By Shunping Huang
To introduce ordered list states from concept to implementation.
11:30 - 12:20. Hamina (MP4)
By Robert Burke
Prism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs. Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.
12:00 - 12:25. Mariposa Grove
By Lakshmanan Arumugam
We migrated ~5 TB (~2 billion rows) of production NoSQL DB to a Postgres DB without application downtime. This involved data export from Cloud Datastore (NSQL DB), normalization (from JSON to SQL schema) and then data import into Postgres. We used two different Apache Beam pipelines in the process. Exporting data from cloud datastore - for this we used Google developed dataflow template (written in java). * Normalize and import into Postgres - we wrote our own custom Apache Beam pipeline that transforms every JSON row into Postgres compatible schema and batches the normalized rows for ingestion into Postgres (written in python).
12:00 - 12:25. Walker Canyon
By Alberto López Serna
A StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.
13:30 - 14:20. Mariposa Grove
By Bipin Upadhyaya
This talk unveils our design journey to streamline the ingestion of CDC changes into a data warehouse, enabling rapid data availability for users. We leverage Qlik to stream CDC events to Kafka, harness Dataflow’s processing power, and store the transformed data in BigQuery for efficient analysis. We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.
13:30 - 14:20. Walker Canyon
By Byron Ellis
In this session we’ll take a deep dive into implementing a Beam SDK for a new language using Swift as an example. We’ll cover both the internal and the external from implementing FnApi using Swift’s surprisingly robust gRPC support to how we use Swift’s modern type system to provide a uniquely Swift way of expressing the Beam programming model.
13:30 - 14:20. Hamina (MP4)
By Damon Douglas
This talk is for you if you are considering writing custom code to read from and write to a Web API, for which a solution does not yet exist. This session demonstrates step by step how to read from and write to Web APIs using the Beam Java and Python SDKs. Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam.
13:30 - 14:50. Bonsai
By Wei Hsia
Imagine when you’re attending an event, as you park your car, you get a notification telling you which entrance has the current shortest line that’s closest to your parking. Then when you check in, a reminder (only if you’ve parked), that you can get some parking validation done if you spend on concessions or merchandise as part of your membership benefit. And you have a hankering for nachos, hoping that they don’t run out.
14:30 - 14:55. Mariposa Grove
By Danny McCormick
Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU. This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author.
14:30 - 14:55. Walker Canyon
By Jeff Kinard
The purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to: Basic syntax for declaring pipeline spec and pipeline options Mapping transforms and UDF’s Simple aggregations Other turnkey transforms (LogForTesting, IO’s, etc.) The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.
14:30 - 14:55. Hamina (MP4)
By Yi Hu
Rate limiting and draining quota is a common issue for a data processing pipeline running at scale. Overprovision of the worker pool when the external resource is throttled not only increases the cost but also puts additional pressure onto the resources. This talk introduces the recent improvements on tackling this issue, from tracking throttled states in a Beam pipeline to taking actions on these signals from the runner (particularly Dataflow) side.
15:00 - 15:25. Mariposa Grove
By Konstantin Buschmeier, Jasper Van den Bossche & Iris Luden
Large language models are well known for their performance on generation tasks like summarization but they also excel at many classical tasks like classification, named-entity recognition, or information extraction. Multi-modal LLMs similarly achieve state of the art performance on document understanding. This makes them vital for modern data processing pipelines. Apache Beam is a powerful framework to define and execute batch and streaming data processing pipelines. Recent releases introduced many tools to facilitate machine learning workflows like ML Transforms, RunInference, and Enrichment transform.
15:00 - 15:25. Walker Canyon
By Ferran Fernandez & Austin Bennett
In this session, we’ll explore the transformative integration of Beam YAML with Protobuf, unlocking new possibilities within Apache Beam. Delve into practical applications and benefits, and gain insights from our journey of harnessing Apache Beam’s capabilities.
15:00 - 15:25. Hamina (MP4)
By Valentyn Tymofieiev
In an ideal scenario, a data processing pipeline performs without issues. When a runtime processing error occurs, normally Beam surfaces the error to the runner. However in some cases, the process running the user code might run out of memory, get stuck or crash. This can prevent it from reporting the error, leaving the user unaware of the failure’s root cause. In this talk, I’ll discuss troubleshooting techniques for these situations.
16:00 - 16:25. Mariposa Grove
By Hai Sadon
In this session, we will explore how Apache Beam can be leveraged to create a robust real-time fraud prevention system. Drawing from real-world implementations at Transmit Security, the presentation will cover the architecture and components of our Detection and Response solution. Attendees will learn about the challenges of analyzing high volumes of data in real-time, and how we utilize a combination of data collection points, enriched data pipelines, and stateful aggregation engines.
16:00 - 16:25. Walker Canyon
By Ahmed Abualsaud
Harness the power of cross-language transforms by combining the best of Java and Python in your data processing workflows. Discover how to seamlessly integrate Java transforms into your Python pipelines using the SDK’s newest utilities. With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.
16:00 - 16:25. Hamina (MP4)
By Olu Akinlaja
Enabling dynamic topic destinations using the pub/subIO writeMessageDynamic() function in a Java Dataflow pipeline is an interesting feature, which seems only available in Apache Beam Java SDK. This talk is to showcase a workaround implementation using the pub/subIO writeMessageDynamic() function as an external transform
16:00 - 16:25. Bonsai
By Matt Mays
Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries. Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations. The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.
16:30 - 16:55. Mariposa Grove
By Reza Rokni
This session will provide an overview of how to utilize large language models (LLMs) using Apache Beam’s RunInference framework. The prime example for the talk will be running the Gemma open model on Dataflow, outlining considerations and common pitfalls when writing pipelines with LLMs.
16:30 - 16:55. Walker Canyon
By Lydian Lee
16:30 - 16:55. Hamina (MP4)
By John Casey
In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.

Thursday, September 05, 2024

09:00 - 9:40
09:45 - 10:15
10:30 - 11:00
Morning break
11:00 - 11:25
11:30 - 11:55.
12:00 - 12:25
Lunch
13:30 - 13:55
14:00 - 14:25
14:30 - 14:55
Closing remarks
09:00 - 09:40.
By Uday Kalra
Room: Mariposa Grove
09/05/2024 9:00 AM 09/05/2024 9:40 AM America/Los_Angeles AS24: Beam for Large-Scale, Accelerated ML Inference at Google

Bulk Inference in Machine Learning (ML) refers to the challenge of how to organize and compute model predictions for a large pool of available input data with no latency requirements. JAX is an open-source computation library commonly used by both engineers and researchers for flexible, high-performant ML development. This talk will illustrate how teams at Google are using Beam to ergonomically design, orchestrate, and scale JAX Bulk Inference workloads across various accelerator platforms.

Mariposa Grove
Bulk Inference in Machine Learning (ML) refers to the challenge of how to organize and compute model predictions for a large pool of available input data with no latency requirements. JAX is an open-source computation library commonly used by both engineers and researchers for flexible, high-performant ML development. This talk will illustrate how teams at Google are using Beam to ergonomically design, orchestrate, and scale JAX Bulk Inference workloads across various accelerator platforms.
09:45 - 10:15.
By Prakash Chockalingam
Room: Mariposa Grove
09/05/2024 9:45 AM 09/05/2024 10:15 AM America/Los_Angeles AS24: Lessons Learned from MLOps for GenAI at Google Scale

The rise of Generative AI (GenAI) has revolutionized Machine Learning (ML) Development Workflows. Starting from large pre-trained models introduces new challenges in resource management, large model evaluation (both human and automated), efficient bulk inference, and effective handling of massive model embeddings. This talk distills key lessons and best practices from Google’s experience in deploying GenAI models at scale, focusing on the adaptation and evolution of MLOps principles to tackle the unique demands of this emerging field.

Mariposa Grove
The rise of Generative AI (GenAI) has revolutionized Machine Learning (ML) Development Workflows. Starting from large pre-trained models introduces new challenges in resource management, large model evaluation (both human and automated), efficient bulk inference, and effective handling of massive model embeddings. This talk distills key lessons and best practices from Google’s experience in deploying GenAI models at scale, focusing on the adaptation and evolution of MLOps principles to tackle the unique demands of this emerging field.
11:00 - 11:25.
By Robert Burke
Room: Walker Canyon
09/05/2024 11:00 AM 09/05/2024 11:25 AM America/Los_Angeles AS24: Beam SDKs Don't Have to Look the Same

Do we even need PCollections? Or ProcessElements? Can we have the language fully typecheck the pipeline for us at compile time? Can we do that in Go?

Since Beam was designed, programming languages have continued to evolve and change, so why can’t our SDKs? We’ve now got ample experience with the Apache Beam Go SDK, but the language it was designed for is now very different.

This short talk will compare the current Go SDK with an experimental implementation that takes better advantage of the current strengths of Go, and its approach to generic type parameters, and more.

Walker Canyon
Do we even need PCollections? Or ProcessElements? Can we have the language fully typecheck the pipeline for us at compile time? Can we do that in Go? Since Beam was designed, programming languages have continued to evolve and change, so why can’t our SDKs? We’ve now got ample experience with the Apache Beam Go SDK, but the language it was designed for is now very different. This short talk will compare the current Go SDK with an experimental implementation that takes better advantage of the current strengths of Go, and its approach to generic type parameters, and more.
11:00 - 11:25.
By Alberto López Serna
Room: Hamina (MP4)
09/05/2024 11:00 AM 09/05/2024 11:25 AM America/Los_Angeles AS24: Drools ParDo and SCIO: a goodbye microservices tale

A successful re-engineering in banking with a Drools ParDo using a KieContainer (JBoss) in Dataflow has been used to process business rules.jar, merging several Spring microservices into Dataflow. This design, with another Dataflow app and the pattern adopted in “Avoid HTTP requests duplicates in Apache Beam with SCIO, a custom BaseAsyncDoFn and State and Timers” has allowed to get tid off all microservices and infrastructure thanks to Apache Beam and Dataflow.

Hamina (MP4)
A successful re-engineering in banking with a Drools ParDo using a KieContainer (JBoss) in Dataflow has been used to process business rules.jar, merging several Spring microservices into Dataflow. This design, with another Dataflow app and the pattern adopted in “Avoid HTTP requests duplicates in Apache Beam with SCIO, a custom BaseAsyncDoFn and State and Timers” has allowed to get tid off all microservices and infrastructure thanks to Apache Beam and Dataflow.
11:00 - 11:25.
By Rajkumar Gupta
Room: Mariposa Grove
09/05/2024 11:00 AM 09/05/2024 11:25 AM America/Los_Angeles AS24: Troubleshooting Beam/Dataflow ML Pipelines Related Common Issues

The presentation will cover troubleshooting RunInference, GPU, OOM, and other related errors, offering practical insights from real-world customer support experiences.

Mariposa Grove
11:30 - 11:55.
By Ihaffa Murtopo
Room: Hamina (MP4)
09/05/2024 11:30 AM 09/05/2024 11:55 AM America/Los_Angeles AS24: At Least Once Streaming vs Exactly Once: Cost saving vs data accuracy

Dataflow has always supported only Exactly Once features, but Dataflow just release At Least Once feature. Understand each Architecture, pro and cons, in which use cases one is better than the other.

Hamina (MP4)
Dataflow has always supported only Exactly Once features, but Dataflow just release At Least Once feature. Understand each Architecture, pro and cons, in which use cases one is better than the other.
11:30 - 11:55.
By Olufunbi Babalola
Room: Mariposa Grove
09/05/2024 11:30 AM 09/05/2024 11:55 AM America/Los_Angeles AS24: BeamStack: An open source Framework for running Machine Learning Pipelines with Apache Beam

We introduce you to Beamstack, an open-source framework currently under development, aimed at facilitating the deployment of Machine Learning and GenAI workflow pipelines with Apache Beam on Kubernetes, whether on-premises or in the cloud. It encompasses a holistic solution, featuring abstraction layers that optimize the deployment of various components of machine learning pipelines, data processing workflows, and deployment infrastructure.

At the core of Beamstack’s functionalities lie Kubernetes Custom Resource Definitions (CRDs). These CRDs constitute a potent mechanism for extending the Kubernetes API, facilitating the seamless integration of ML-centric resources within the Kubernetes ecosystem. By this approach, Beamstack empowers users to capitalize on the comprehensive capabilities and features offered by Kubernetes while unlocking the boundless potential of Apache Beam for Mache Learning development by various teams in any organization.

In this session, we will discuss BeamStack’s use cases and product roadmap, as well as features that have already been implemented, those currently under implementation, and those planned for the future. We will also address our current challenges and areas where we need support from contributors. Join us in shaping the future of ML development tooling around Apache Beam by becoming a part of the Beamstack community.

Mariposa Grove
We introduce you to Beamstack, an open-source framework currently under development, aimed at facilitating the deployment of Machine Learning and GenAI workflow pipelines with Apache Beam on Kubernetes, whether on-premises or in the cloud. It encompasses a holistic solution, featuring abstraction layers that optimize the deployment of various components of machine learning pipelines, data processing workflows, and deployment infrastructure. At the core of Beamstack’s functionalities lie Kubernetes Custom Resource Definitions (CRDs).
11:30 - 11:55.
By Sadeeq Akintola
Room: Walker Canyon
09/05/2024 11:30 AM 09/05/2024 11:55 AM America/Los_Angeles AS24: Reuniting the Two Distant Cousins: Calling a Beam Pipeline from an Airflow Job

Apache Beam and Apache Airflow are powerful tools in the data engineering ecosystem, often used separately but rarely in tandem. This talk explores the synergy between these “distant cousins” by demonstrating how to seamlessly integrate Beam pipelines within Airflow workflows.

We’ll dive into the challenges of orchestrating complex data processing tasks and show how combining Airflow’s scheduling capabilities with Beam’s robust data processing framework can create a more efficient and manageable data pipeline architecture.

Attendees will learn how to leverage Airflow’s DAG (Directed Acyclic Graph) to trigger Beam jobs seamlessly, enabling them to orchestrate sophisticated, distributed data processing tasks across data platforms, such as Google Cloud Dataflow. By the end of this session, participants will gain practical insights into integrating these technologies, enhancing their ability to build and maintain resilient, efficient data pipelines that meet the demands of modern data-driven applications.

Walker Canyon
Apache Beam and Apache Airflow are powerful tools in the data engineering ecosystem, often used separately but rarely in tandem. This talk explores the synergy between these “distant cousins” by demonstrating how to seamlessly integrate Beam pipelines within Airflow workflows. We’ll dive into the challenges of orchestrating complex data processing tasks and show how combining Airflow’s scheduling capabilities with Beam’s robust data processing framework can create a more efficient and manageable data pipeline architecture.
12:00 - 12:25.
By Charles Adetiloye & Nate Salawe
Room: Mariposa Grove
09/05/2024 12:00 PM 09/05/2024 12:25 PM America/Los_Angeles AS24: A Low Code Structured Approach to Deploying Apache Beam ML Workloads on Kubernetes using BeamStack

We have developed BeamStack, a toolkit tailored to enhance the deployment and management processes of ML and GenAI workloads on Kubernetes. It diminishes the intrinsic complexities associated with the managing of Kubernetes clusters and the deployment of Apache Beam workloads. Fundamentally, BeamStack leverages Beam YAML, a structured format that enables the declarative definition of pipelines. This facilitates rapid deployment and scalability across diverse Kubernetes environments, encompassing cloud-based solutions and local clusters such as Minikube.

BeamStack distinguishes itself through its proficiency in orchestrating AI pipelines within Kubernetes environments. It provides streamlined workflows that enhance the efficiency of setup, deployment, and management processes. Moreover, BeamStack seamlessly integrates with monitoring tools such as Prometheus and Grafana. Its overarching objective is to democratize the deployment of AI workloads with Apache Beam on Kubernetes, empowering users with the confidence to deploy seamlessly while optimizing performance. Our platform’s user-friendly design simplifies the process, making the deployment of Apache Beam jobs universally attainable.

In this talk, we’ll delve into the development journey of BeamStack, a toolkit crafted to simplify the deployment and management of Apache Beam ML workloads on Kubernetes. We’ll explore the motivations behind BeamStack’s creation, the challenges it addresses, and the key components that make it a powerful tool for AI workload deployment.

Mariposa Grove
We have developed BeamStack, a toolkit tailored to enhance the deployment and management processes of ML and GenAI workloads on Kubernetes. It diminishes the intrinsic complexities associated with the managing of Kubernetes clusters and the deployment of Apache Beam workloads. Fundamentally, BeamStack leverages Beam YAML, a structured format that enables the declarative definition of pipelines. This facilitates rapid deployment and scalability across diverse Kubernetes environments, encompassing cloud-based solutions and local clusters such as Minikube.
12:00 - 12:25.
By Sharan Teja Malyala
Room: Hamina (MP4)
09/05/2024 12:00 PM 09/05/2024 12:25 PM America/Los_Angeles AS24: Cost Effective Solutions for Beam pipelines in Dataflow

This talk will explain some of the capabilities of Dataflow that can help users save costs:

  1. Dynamic Thread Scaling: Dataflow tries to maximize the worker utilisation with this feature and avoids using up more workers.
  2. Right fitting: Customize the resources at stage level and efficiently use the worker resources.
  3. Streaming autoscaling: Tune autoscaling behavior to save costs.
Hamina (MP4)
This talk will explain some of the capabilities of Dataflow that can help users save costs: Dynamic Thread Scaling: Dataflow tries to maximize the worker utilisation with this feature and avoids using up more workers. Right fitting: Customize the resources at stage level and efficiently use the worker resources. Streaming autoscaling: Tune autoscaling behavior to save costs.
12:00 - 12:25.
By Tom Stepp
Room: Walker Canyon
09/05/2024 12:00 PM 09/05/2024 12:25 PM America/Los_Angeles AS24: Dataflow Streaming: The evolution of real-time data processing

Level up your streaming pipelines by learning about the latest advancements in Dataflow Streaming. This includes new features such as At-Least-Once Mode, Active Load Balancing, Autoscaling Hints, In-flight Autoscaling Updates, best practices for Kafka source pipelines and ML pipelines.

Walker Canyon
Level up your streaming pipelines by learning about the latest advancements in Dataflow Streaming. This includes new features such as At-Least-Once Mode, Active Load Balancing, Autoscaling Hints, In-flight Autoscaling Updates, best practices for Kafka source pipelines and ML pipelines.
13:30 - 13:55.
By Jeff Kinard
Room: Walker Canyon
09/05/2024 1:30 PM 09/05/2024 1:55 PM America/Los_Angeles AS24: Beam YAML: Advanced topics

While the basic features of Beam YAML allow one to write simple to possibly intermediate pipelines, there are limitations on what can be developed relying solely on the built-in transforms and basic features. Luckily, Beam YAML had these users in mind during its development and offers multiple ways to leverage more advanced features of Beam to implement these sophisticated use-cases.

The purpose of this session will be to dive into the more advanced features Beam YAML has to offer. Topics include, but are not limited to:

  • Defining explicit transform output types
  • Advanced mapping
  • Advanced aggregation
  • Transform Providers
  • Inline Python
  • Jinja preprocessing
  • ML transforms

If time allows, there will be some use-cases demonstrating the features presented above.

This session builds upon the information presented in the introductory session and it is recommended that attendees view that session before diving into the topics presented in this session.

Walker Canyon
While the basic features of Beam YAML allow one to write simple to possibly intermediate pipelines, there are limitations on what can be developed relying solely on the built-in transforms and basic features. Luckily, Beam YAML had these users in mind during its development and offers multiple ways to leverage more advanced features of Beam to implement these sophisticated use-cases. The purpose of this session will be to dive into the more advanced features Beam YAML has to offer.
13:30 - 13:55.
By Jasper Van den Bossche & Konstantin Buschmeier
Room: Mariposa Grove
09/05/2024 1:30 PM 09/05/2024 1:55 PM America/Los_Angeles AS24: RAG Data Ingestion Using Apache Beam

Retrieval Augmented Generation (RAG) has emerged as a groundbreaking technique in the field of generative AI. By providing a large language model relevant data to solve a given task, it can generate answers with much higher accuracy. Next to enhanced performance RAG allows us to work with sensitive without resource-intensive in-house model training. However, efficiently preprocessing and ingesting document data into vector databases for RAG applications can be challenging, especially when dealing with real-time updates.

In most RAG applications relevant data is fetched from a vector database through semantic search. From experience working on RAG applications we have noticed that building a robust data processing pipeline to keep the data in the vector database up to date can be a challenge. Beam is a particularly powerful tool to solve this task since it supports both batch and streaming data, allowing us reuse the same data processing pipeline for both processing scale datasets of relevant information as well as keeping the vector database up to date with live data through streaming data.

It is crucial to recognize that proper preprocessing and analysis of data are often underestimated yet fundamental components in building effective RAG applications. As the age-old adage goes, ‘garbage in, garbage out,’ emphasizing the significance of ensuring high-quality input data for optimal output results. Apache Beam can play an important role in this process, offering both batch and streaming data processing capabilities that are essential for developing production-ready RAG applications.

Mariposa Grove
Retrieval Augmented Generation (RAG) has emerged as a groundbreaking technique in the field of generative AI. By providing a large language model relevant data to solve a given task, it can generate answers with much higher accuracy. Next to enhanced performance RAG allows us to work with sensitive without resource-intensive in-house model training. However, efficiently preprocessing and ingesting document data into vector databases for RAG applications can be challenging, especially when dealing with real-time updates.
13:30 - 13:55.
By Narayanan Venkiteswaran & Jinjing Bi
Room: Hamina (MP4)
09/05/2024 1:30 PM 09/05/2024 1:55 PM America/Los_Angeles AS24: Usage Billing with BEAM @ LinkedIn

Usage Billing is a billing concept where charges are based on consumption. At LinkedIn, the Usage Billing system processes large volume of consumptions particularly for Ads and Jobs use cases. In this presentation, we will discuss how the team rearchitected the existing usage platform to adopt a more streaming-oriented approach using Apache Beam and Samza Runner. The new system can aggregate usage data across various dimensions and supports multiple rules to handle different aggregation windows. It also includes change data capture and correction events to ensure the high level of accuracy required when handling customers’ money.

Hamina (MP4)
Usage Billing is a billing concept where charges are based on consumption. At LinkedIn, the Usage Billing system processes large volume of consumptions particularly for Ads and Jobs use cases. In this presentation, we will discuss how the team rearchitected the existing usage platform to adopt a more streaming-oriented approach using Apache Beam and Samza Runner. The new system can aggregate usage data across various dimensions and supports multiple rules to handle different aggregation windows.
14:00 - 14:50.
By Sergei Lilichenko
Room: Hamina (MP4)
09/05/2024 2:00 PM 09/05/2024 2:50 PM America/Los_Angeles AS24: Cost Optimization of Dataflow Pipelines

This session covers the cost components of Dataflow pipelines, different run configurations, common performance optimization techniques, approaches to monitor cost and performance of the pipelines.

Hamina (MP4)
This session covers the cost components of Dataflow pipelines, different run configurations, common performance optimization techniques, approaches to monitor cost and performance of the pipelines.
14:00 - 14:25.
By Ravi Magham
Room: Walker Canyon
09/05/2024 2:00 PM 09/05/2024 2:25 PM America/Los_Angeles AS24: Realtime Forecasting using Beam

In this session, I’ll share how Lyft uses Beam’s portability framework with Flink execution-engine for real-time demand and supply forecasting. We will provide a dive into our architecture, scale and lessons learned.

Walker Canyon
In this session, I’ll share how Lyft uses Beam’s portability framework with Flink execution-engine for real-time demand and supply forecasting. We will provide a dive into our architecture, scale and lessons learned.
14:00 - 14:25.
By Pablo Rodriguez Defino & Namita Sharma
Room: Mariposa Grove
09/05/2024 2:00 PM 09/05/2024 2:25 PM America/Los_Angeles AS24: Streaming Processing for RAG Architectures

The session is aimed to practitioners looking to use streaming technologies to process data which will be in use for near real time RAG architectures.

For context, the session is based in this article: https://beam.apache.org/blog/dyi-content-discovery-platform-genai-beam

Agenda:

  • quick intro to RAG
  • why streaming processing
  • why beam
  • technology stack (gcp)
  • beam integrations (multilanguage, RunInference, IOs)
  • example / demo
  • close up
Mariposa Grove
The session is aimed to practitioners looking to use streaming technologies to process data which will be in use for near real time RAG architectures. For context, the session is based in this article: https://beam.apache.org/blog/dyi-content-discovery-platform-genai-beam Agenda: quick intro to RAG why streaming processing why beam technology stack (gcp) beam integrations (multilanguage, RunInference, IOs) example / demo close up
14:30 - 14:55.
By Surjit Singh
Room: Walker Canyon
09/05/2024 2:30 PM 09/05/2024 2:55 PM America/Los_Angeles AS24: Dataflow CI/CD

In this talk we will walk through the process of building and deploying Beam Dataflow templates.

Walker Canyon
In this talk we will walk through the process of building and deploying Beam Dataflow templates.
14:30 - 14:55.
By Ayush Pandey
Room: Mariposa Grove
09/05/2024 2:30 PM 09/05/2024 2:55 PM America/Los_Angeles AS24: RAG Data Ingestion and Enrichment Pipeline using Redis and OpenSearch Vector Database in Apache Beam

Building out data ingestion and enrichment pipeline for a Retrieval Augmented Generation (RAG) system using Apache Beam. The pipeline built a knowledge base by ingesting text data into vector databases, specifically utilizing Redis and OpenSearch for efficient semantic search and data retrieval. These databases stored vectorized representations of text chunks, allowing the system to enhance user queries by matching them with relevant text fragments. The design of the pipeline ensured scalable and effective RAG operations, leveraging the strengths of Redis and OpenSearch in vector-based search.

Mariposa Grove
Building out data ingestion and enrichment pipeline for a Retrieval Augmented Generation (RAG) system using Apache Beam. The pipeline built a knowledge base by ingesting text data into vector databases, specifically utilizing Redis and OpenSearch for efficient semantic search and data retrieval. These databases stored vectorized representations of text chunks, allowing the system to enhance user queries by matching them with relevant text fragments. The design of the pipeline ensured scalable and effective RAG operations, leveraging the strengths of Redis and OpenSearch in vector-based search.
10:30 - 11:00
Morning break
12:30 - 13:30
Lunch
15:00 - 15:15
Closing remarks
09:00 - 09:40. Mariposa Grove
By Uday Kalra
Bulk Inference in Machine Learning (ML) refers to the challenge of how to organize and compute model predictions for a large pool of available input data with no latency requirements. JAX is an open-source computation library commonly used by both engineers and researchers for flexible, high-performant ML development. This talk will illustrate how teams at Google are using Beam to ergonomically design, orchestrate, and scale JAX Bulk Inference workloads across various accelerator platforms.
09:45 - 10:15. Mariposa Grove
By Prakash Chockalingam
The rise of Generative AI (GenAI) has revolutionized Machine Learning (ML) Development Workflows. Starting from large pre-trained models introduces new challenges in resource management, large model evaluation (both human and automated), efficient bulk inference, and effective handling of massive model embeddings. This talk distills key lessons and best practices from Google’s experience in deploying GenAI models at scale, focusing on the adaptation and evolution of MLOps principles to tackle the unique demands of this emerging field.
11:00 - 11:25. Mariposa Grove
By Rajkumar Gupta
11:00 - 11:25. Walker Canyon
By Robert Burke
Do we even need PCollections? Or ProcessElements? Can we have the language fully typecheck the pipeline for us at compile time? Can we do that in Go? Since Beam was designed, programming languages have continued to evolve and change, so why can’t our SDKs? We’ve now got ample experience with the Apache Beam Go SDK, but the language it was designed for is now very different. This short talk will compare the current Go SDK with an experimental implementation that takes better advantage of the current strengths of Go, and its approach to generic type parameters, and more.
11:00 - 11:25. Hamina (MP4)
By Alberto López Serna
A successful re-engineering in banking with a Drools ParDo using a KieContainer (JBoss) in Dataflow has been used to process business rules.jar, merging several Spring microservices into Dataflow. This design, with another Dataflow app and the pattern adopted in “Avoid HTTP requests duplicates in Apache Beam with SCIO, a custom BaseAsyncDoFn and State and Timers” has allowed to get tid off all microservices and infrastructure thanks to Apache Beam and Dataflow.
11:30 - 11:55. Mariposa Grove
By Olufunbi Babalola
We introduce you to Beamstack, an open-source framework currently under development, aimed at facilitating the deployment of Machine Learning and GenAI workflow pipelines with Apache Beam on Kubernetes, whether on-premises or in the cloud. It encompasses a holistic solution, featuring abstraction layers that optimize the deployment of various components of machine learning pipelines, data processing workflows, and deployment infrastructure. At the core of Beamstack’s functionalities lie Kubernetes Custom Resource Definitions (CRDs).
11:30 - 11:55. Walker Canyon
By Sadeeq Akintola
Apache Beam and Apache Airflow are powerful tools in the data engineering ecosystem, often used separately but rarely in tandem. This talk explores the synergy between these “distant cousins” by demonstrating how to seamlessly integrate Beam pipelines within Airflow workflows. We’ll dive into the challenges of orchestrating complex data processing tasks and show how combining Airflow’s scheduling capabilities with Beam’s robust data processing framework can create a more efficient and manageable data pipeline architecture.
11:30 - 11:55. Hamina (MP4)
By Ihaffa Murtopo
Dataflow has always supported only Exactly Once features, but Dataflow just release At Least Once feature. Understand each Architecture, pro and cons, in which use cases one is better than the other.
12:00 - 12:25. Mariposa Grove
By Charles Adetiloye & Nate Salawe
We have developed BeamStack, a toolkit tailored to enhance the deployment and management processes of ML and GenAI workloads on Kubernetes. It diminishes the intrinsic complexities associated with the managing of Kubernetes clusters and the deployment of Apache Beam workloads. Fundamentally, BeamStack leverages Beam YAML, a structured format that enables the declarative definition of pipelines. This facilitates rapid deployment and scalability across diverse Kubernetes environments, encompassing cloud-based solutions and local clusters such as Minikube.
12:00 - 12:25. Walker Canyon
By Tom Stepp
Level up your streaming pipelines by learning about the latest advancements in Dataflow Streaming. This includes new features such as At-Least-Once Mode, Active Load Balancing, Autoscaling Hints, In-flight Autoscaling Updates, best practices for Kafka source pipelines and ML pipelines.
12:00 - 12:25. Hamina (MP4)
By Sharan Teja Malyala
This talk will explain some of the capabilities of Dataflow that can help users save costs: Dynamic Thread Scaling: Dataflow tries to maximize the worker utilisation with this feature and avoids using up more workers. Right fitting: Customize the resources at stage level and efficiently use the worker resources. Streaming autoscaling: Tune autoscaling behavior to save costs.
13:30 - 13:55. Mariposa Grove
By Jasper Van den Bossche & Konstantin Buschmeier
Retrieval Augmented Generation (RAG) has emerged as a groundbreaking technique in the field of generative AI. By providing a large language model relevant data to solve a given task, it can generate answers with much higher accuracy. Next to enhanced performance RAG allows us to work with sensitive without resource-intensive in-house model training. However, efficiently preprocessing and ingesting document data into vector databases for RAG applications can be challenging, especially when dealing with real-time updates.
13:30 - 13:55. Walker Canyon
By Jeff Kinard
While the basic features of Beam YAML allow one to write simple to possibly intermediate pipelines, there are limitations on what can be developed relying solely on the built-in transforms and basic features. Luckily, Beam YAML had these users in mind during its development and offers multiple ways to leverage more advanced features of Beam to implement these sophisticated use-cases. The purpose of this session will be to dive into the more advanced features Beam YAML has to offer.
13:30 - 13:55. Hamina (MP4)
By Narayanan Venkiteswaran & Jinjing Bi
Usage Billing is a billing concept where charges are based on consumption. At LinkedIn, the Usage Billing system processes large volume of consumptions particularly for Ads and Jobs use cases. In this presentation, we will discuss how the team rearchitected the existing usage platform to adopt a more streaming-oriented approach using Apache Beam and Samza Runner. The new system can aggregate usage data across various dimensions and supports multiple rules to handle different aggregation windows.
14:00 - 14:25. Mariposa Grove
By Pablo Rodriguez Defino & Namita Sharma
The session is aimed to practitioners looking to use streaming technologies to process data which will be in use for near real time RAG architectures. For context, the session is based in this article: https://beam.apache.org/blog/dyi-content-discovery-platform-genai-beam Agenda: quick intro to RAG why streaming processing why beam technology stack (gcp) beam integrations (multilanguage, RunInference, IOs) example / demo close up
14:00 - 14:25. Walker Canyon
By Ravi Magham
In this session, I’ll share how Lyft uses Beam’s portability framework with Flink execution-engine for real-time demand and supply forecasting. We will provide a dive into our architecture, scale and lessons learned.
14:00 - 14:50. Hamina (MP4)
By Sergei Lilichenko
This session covers the cost components of Dataflow pipelines, different run configurations, common performance optimization techniques, approaches to monitor cost and performance of the pipelines.
14:30 - 14:55. Mariposa Grove
By Ayush Pandey
Building out data ingestion and enrichment pipeline for a Retrieval Augmented Generation (RAG) system using Apache Beam. The pipeline built a knowledge base by ingesting text data into vector databases, specifically utilizing Redis and OpenSearch for efficient semantic search and data retrieval. These databases stored vectorized representations of text chunks, allowing the system to enhance user queries by matching them with relevant text fragments. The design of the pipeline ensured scalable and effective RAG operations, leveraging the strengths of Redis and OpenSearch in vector-based search.
14:30 - 14:55. Walker Canyon
By Surjit Singh
In this talk we will walk through the process of building and deploying Beam Dataflow templates.