Program

Welcome to the session program for Beam Summit 2024.

If you prefer, you can also see the program in the Sessionize layout.

Scroll to Wednesday.
Scroll to Thursday.

Wednesday, September 04, 2024

09:00 - 9:45

Welcome

09:45 - 10:30

10:30 - 11:00

Morning break

11:00 - 11:25

11:30 - 11:55.

12:00 - 12:25

12:30 - 13:25

Lunch

13:30 - 14:20

14:30 - 14:55

15:00 - 15:25

Afternoon break

16:00 - 16:25

16:30 - 16:55

17:00 - 19:30 | Event reception location: MP6, Game Room, 8th floor

Innovating the Data & AI Platform

By Yasmeen Ahmad

Room: Mariposa Grove

09/04/2024 9:10 AM 09/04/2024 9:40 AM America/Los_Angeles AS24: Innovating the Data & AI Platform

Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.

Mariposa Grove

Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.

By Marc Howard

Room: Mariposa Grove

09/04/2024 9:45 AM 09/04/2024 10:30 AM America/Los_Angeles AS24: Project Shield: How we use Beam to defend democracy and free expression, and how we got started!

Project Shield is a service that counters distributed-denial-of-service (DDoS) attacks. It is available free of charge to eligible websites that have news, elections, and human rights related content. Project Shield helps ensure unhindered access to election-related information during global democratic processes (such as the U.S. 2022 midterm election season and many others). It also enables critical infrastructure and news websites to defend against non-stop attacks and provides crucial services and information during crises (such as the invasion of Ukraine).

In this keynote, Marc Howard will explain why and how Project Shield uses Apache Beam and Google Cloud Dataflow to deliver some of their core value. Their streaming Apache Beam pipelines process more than 3 TB of log data daily at significantly over 10,000 queries per second, growing significantly during large attacks of up to 400 million queries per second. These metrics power user-facing graphs and long-term attack analytics at scale, fine-tuning Project Shield’s defenses and supporting them in the effort of making the web a safe and free space.

Mariposa Grove

Data Lineage in Beam

By Rohit Sinha

Room: Hamina (MP4)

09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Data Lineage in Beam

In this presentation, we delve into the critical world of data lineage within Apache Beam, exploring its significance and demonstrating its practical implementation. We begin by establishing the motivation behind data lineage, highlighting its role in enhancing data governance, debugging, and impact analysis. Next, we introduce Google Cloud Dataplex, a unified data management platform, and its integration with Beam’s lineage capabilities.

We’ll then embark on a technical journey, showcasing how lineage support is built into Apache Beam’s core. Following this, we will dissect the process of constructing a lineage graph for an Apache Beam job and seamlessly reporting it to Dataplex for insightful visualization.

The presentation will empower the audience with actionable knowledge on how to integrate lineage tracking into their own I/O operations, ensuring greater transparency and control over their data pipelines. Finally, a live demonstration will bring these concepts to life, showcasing data lineage in action for an Apache Beam job executing on Dataflow, and visually exploring its lineage within Dataplex.

By the end of this talk, attendees will possess the knowledge and tools to effectively leverage Apache Beam’s lineage support, fostering transparency and trust within their data pipelines.

Hamina (MP4)

Ordered processing in Apache Beam

By Sergei Lilichenko

Room: Walker Canyon

09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Ordered processing in Apache Beam

The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.

Walker Canyon

The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.

By Sayat Satybaldiyev & Arwin Tio

Room: Mariposa Grove

09/04/2024 11:00 AM 09/04/2024 11:25 AM America/Los_Angeles AS24: Scaling Autonomous Driving with Apache Beam

Cruise leverages Apache Beam to manage and process petabytes of data monthly, essential for our autonomous vehicle model training. This talk will delve into the innovative features we’ve developed to enhance Beam’s capabilities, including a control plane for quota and user management, a C++ sandbox for running AV ROS nodes in the cloud, and shuffling optimization techniques to compress shuffled data

Mariposa Grove

By Robert Burke

Room: Hamina (MP4)

09/04/2024 11:30 AM 09/04/2024 12:20 PM America/Los_Angeles AS24: A New Local Runner Appears: Deep dive on Prism

Prism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs.

Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.

Prism is written Go to be a small, local runner built using Beam Portability first, to better emulate how jobs of any SDK execute on those larger systems. Further, Prism serves as a model runner for all SDKs for a robust local experience.

This talk will go into how runners execute pipelines, and the design and implementation of Prism for the goals of testing and configurability. Knowledge of Go is encouraged, but not required.

Hamina (MP4)

Prism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs. Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.

Introducing Ordered List States

By Shunping Huang

Room: Walker Canyon

09/04/2024 11:30 AM 09/04/2024 11:55 AM America/Los_Angeles AS24: Introducing Ordered List States

To introduce ordered list states from concept to implementation.

Walker Canyon

To introduce ordered list states from concept to implementation.

By Baojun Liu

Room: Mariposa Grove

09/04/2024 11:30 AM 09/04/2024 11:55 AM America/Los_Angeles AS24: Transitioning Uber Michelangelo's Batch Prediction from Apache Spark to Ray

At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model. While Spark has been the backbone for our batch processing needs due to its robustness and ease of use, it has shown limitations in handling the complex machine learning tasks that Uber is increasingly deploying, such as natural language processing and Generative AI. These workloads often require GPUs to meet latency and throughput requirements, which Spark struggles to support efficiently.

Ray, an emerging distributed computing framework, offers better resource utilization, simpler parallelism, and more straightforward scalability. By leveraging Ray for batch processing, we can support large language model batch predictions reliably, efficiently, and scalably. We are also transitioning other machine learning batch prediction tasks to Ray for both data processing and model prediction. In this new setup, data processing is integrated as part of the model using native transformer techniques, allowing deployment on GPUs.

With Ray, we have developed a robust pipeline for batch prediction. Currently, streaming data is handled with separate pipelines. We are actively exploring the unification of these pipelines using open-source libraries. Apache Beam ML provides an opportunity to unify batch and streaming data processing pipelines.

Mariposa Grove

Michelangelo is Uber’s centralized machine learning platform, designed to manage ML pipelines and their associated data processing. As the demand for batch predictions grows, the need for a flexible and efficient processing framework becomes imperative. This presentation explores Uber Michelangelo’s batch prediction processes, focusing on data processing, model prediction, and the transition from Spark to Ray. At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model.

By Alberto López Serna

Room: Walker Canyon

09/04/2024 12:00 PM 09/04/2024 12:25 PM America/Los_Angeles AS24: Avoid HTTP Request Duplicates with SCIO, a custom AsyncHttpParDoFn and State & Timers

A StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.

Walker Canyon

A StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.

By Lakshmanan Arumugam

Room: Mariposa Grove

09/04/2024 12:00 PM 09/04/2024 12:25 PM America/Los_Angeles AS24: How we Migrated our JSON DB to a Relational DB using Apache Beam / Dataflow

Exporting data from cloud datastore - for this we used Google developed dataflow template (written in java). *
Normalize and import into Postgres - we wrote our own custom Apache Beam pipeline that transforms every JSON row into Postgres compatible schema and batches the normalized rows for ingestion into Postgres (written in python). **

Data migration took about ~16 hrs with this approach using the beam pipeline (as opposed to our initial estimate of 5 days using other batch scripts with parallel computing)

Advantages:

Time: made our migration faster
Repeatability and error handling: easy to rerun for records that were failed to import (and easy rerun for any new records created in datastore during the migration etc.,)
Managed scaling with Google Dataflow

*pipeline 1: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/src/main/java/com/google/cloud/teleport/templates/DatastoreToText.java ** pipeline 2: we wrote custom beam DoFn transforms in python(+sqlalchemy) and built a pipeline to ingest data into Postgres with error handling.

Mariposa Grove

We migrated ~5 TB (~2 billion rows) of production NoSQL DB to a Postgres DB without application downtime. This involved data export from Cloud Datastore (NSQL DB), normalization (from JSON to SQL schema) and then data import into Postgres. We used two different Apache Beam pipelines in the process. Exporting data from cloud datastore - for this we used Google developed dataflow template (written in java). * Normalize and import into Postgres - we wrote our own custom Apache Beam pipeline that transforms every JSON row into Postgres compatible schema and batches the normalized rows for ingestion into Postgres (written in python).

By Bipin Upadhyaya

Room: Mariposa Grove

09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Accelerating CDC Data Ingestion with Apache Beam: A Qlik-to-BigQuery Journey

We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.

This blueprint serves as a valuable resource for others seeking to simplify their CDC ingestion pipelines and accelerate time-to-insight for their data-driven initiatives.

Mariposa Grove

This talk unveils our design journey to streamline the ingestion of CDC changes into a data warehouse, enabling rapid data availability for users. We leverage Qlik to stream CDC events to Kafka, harness Dataflow’s processing power, and store the transformed data in BigQuery for efficient analysis. We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.

By Byron Ellis

Room: Walker Canyon

09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Implementing a Beam SDK: A Deep Dive into the Swift SDK

In this session we’ll take a deep dive into implementing a Beam SDK for a new language using Swift as an example. We’ll cover both the internal and the external from implementing FnApi using Swift’s surprisingly robust gRPC support to how we use Swift’s modern type system to provide a uniquely Swift way of expressing the Beam programming model.

Walker Canyon

By Damon Douglas

Room: Hamina (MP4)

09/04/2024 1:30 PM 09/04/2024 2:20 PM America/Los_Angeles AS24: Processing Data from a Web API: A step by step guide

Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam. Challenges addressed will include error handling, retry with backoff, and caching.

Hamina (MP4)

This talk is for you if you are considering writing custom code to read from and write to a Web API, for which a solution does not yet exist. This session demonstrates step by step how to read from and write to Web APIs using the Beam Java and Python SDKs. Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam.

By Danny McCormick

Room: Mariposa Grove

09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: How Beam ML Optimizes Serving Large Models

This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author. Attendees can expect to come away with an understanding of how Beam loads and serves models, how it optimizes its serving architecture for different model sizes/footprints, and how they can use Beam to serve their models (large or small).

Mariposa Grove

Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU. This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author.

Introduction to Beam YAML

By Jeff Kinard

Room: Walker Canyon

09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: Introduction to Beam YAML

The purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to:

Basic syntax for declaring pipeline spec and pipeline options
Mapping transforms and UDF’s
Simple aggregations
Other turnkey transforms (LogForTesting, IO’s, etc.)

The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.

Walker Canyon

The purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to: Basic syntax for declaring pipeline spec and pipeline options Mapping transforms and UDF’s Simple aggregations Other turnkey transforms (LogForTesting, IO’s, etc.) The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.

By Yi Hu

Room: Hamina (MP4)

09/04/2024 2:30 PM 09/04/2024 2:55 PM America/Los_Angeles AS24: Throttling Detection and Reactive Worker Downscaling

Rate limiting and draining quota is a common issue for a data processing pipeline running at scale. Overprovision of the worker pool when the external resource is throttled not only increases the cost but also puts additional pressure onto the resources. This talk introduces the recent improvements on tackling this issue, from tracking throttled states in a Beam pipeline to taking actions on these signals from the runner (particularly Dataflow) side. Finally, it explores options on how users can onboard their custom IO connectors for throttling detection features.

Hamina (MP4)

Beam YAML and Protobuf

By Ferran Fernandez & Austin Bennett

Room: Walker Canyon

09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Beam YAML and Protobuf

In this session, we’ll explore the transformative integration of Beam YAML with Protobuf, unlocking new possibilities within Apache Beam. Delve into practical applications and benefits, and gain insights from our journey of harnessing Apache Beam’s capabilities.

Walker Canyon

By Konstantin Buschmeier, Jasper Van den Bossche & Iris Luden

Room: Mariposa Grove

09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Multi-Modal LLM Data Processing with Apache Beam

Large language models are well known for their performance on generation tasks like summarization but they also excel at many classical tasks like classification, named-entity recognition, or information extraction. Multi-modal LLMs similarly achieve state of the art performance on document understanding. This makes them vital for modern data processing pipelines.

Apache Beam is a powerful framework to define and execute batch and streaming data processing pipelines. Recent releases introduced many tools to facilitate machine learning workflows like ML Transforms, RunInference, and Enrichment transform.

In this talk we will introduce an application that combines Beam’s ML capabilities and LLMs to extract product requests from various document types of customer emails to facilitate the automatic fulfillment of orders.

Mariposa Grove

Troubleshooting Python pipelines with process monitoring tools.

By Valentyn Tymofieiev

Room: Hamina (MP4)

09/04/2024 3:00 PM 09/04/2024 3:25 PM America/Los_Angeles AS24: Troubleshooting Python pipelines with process monitoring tools.

In an ideal scenario, a data processing pipeline performs without issues. When a runtime processing error occurs, normally Beam surfaces the error to the runner. However in some cases, the process running the user code might run out of memory, get stuck or crash. This can prevent it from reporting the error, leaving the user unaware of the failure’s root cause. In this talk, I’ll discuss troubleshooting techniques for these situations. The techniques I cover can also be applied for debugging other Python applications.

Hamina (MP4)

By Ahmed Abualsaud

Room: Walker Canyon

09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: Breaking the Language Barrier: Easy Cross-Language with Generated Python Wrappers

With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.

Walker Canyon

Harness the power of cross-language transforms by combining the best of Java and Python in your data processing workflows. Discover how to seamlessly integrate Java transforms into your Python pipelines using the SDK’s newest utilities. With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.

By Hai Sadon

Room: Mariposa Grove

09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: Real-Time Fraud Prevention with Apache Beam

In this session, we will explore how Apache Beam can be leveraged to create a robust real-time fraud prevention system. Drawing from real-world implementations at Transmit Security, the presentation will cover the architecture and components of our Detection and Response solution. Attendees will learn about the challenges of analyzing high volumes of data in real-time, and how we utilize a combination of data collection points, enriched data pipelines, and stateful aggregation engines.

We will discuss our transition to using Google BigTable for low-latency, high-throughput data management, highlighting its role in enhancing the system’s performance.

Additionally, the session will focus on two unique flows within our system:

Running machine learning models that predict fraud in real time.
Implementing a feedback loop that takes the results from the stream and re-ingests them back into the pipeline to continuously improve detection accuracy.

Mariposa Grove

By Matt Mays

Room: Bonsai

09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: The SolaceIO connector: how was it made and why

Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries.

Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations.

The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.

In this talk we explain how Solace works, how we made it work with Beam with high throughput and low latency, and what lessons can be learnt for the design of complex streaming connectors for Beam.

Bonsai

Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries. Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations. The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.

By Olu Akinlaja

Room: Hamina (MP4)

09/04/2024 4:00 PM 09/04/2024 4:25 PM America/Los_Angeles AS24: using pub/subIO writeMessageDynamic() function in a Python pipeline to use dynamic topic destination

Enabling dynamic topic destinations using the pub/subIO writeMessageDynamic() function in a Java Dataflow pipeline is an interesting feature, which seems only available in Apache Beam Java SDK. This talk is to showcase a workaround implementation using the pub/subIO writeMessageDynamic() function as an external transform

Hamina (MP4)

By Lydian Lee

Room: Walker Canyon

09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Improving Stability for Running Python SDK with Flink Runner

In this session, we will explore our journey to improve the stability of our Flink application using the Python Beam SDK runner, with a particular focus on memory tuning. Our initial setup faced significant challenges, including frequent task manager disconnections and ambiguous error logs, often hinting at out-of-memory (OOM) issues. Despite no clear indicators of high memory usage, the instability worsened after transitioning from the Lyft K8s operator to the Apache Flink operator.

Key points include:

Initial Setup Challenges: Both the Python worker harness and the Flink task manager running in the same container, leading to frequent disconnections.
Diagnosing the Problem: Despite no high overall memory usage, the task manager frequently reported being unavailable, suggesting potential OOM issues.
Operator Differences: The Lyft K8s operator reserved 20% of memory for the system, while the Apache Flink operator allocated all memory to taskmanager.memory.process.size, causing OOM on the Python worker harness due to lack of reserved system memory.
Solution Implementation: Separating the Python worker harness into a separate container and using external to connect to Python, resulting in enhanced stability.
Additional Benefits: Improved resource utilization and flexibility by assigning specific memory requests and limits to the sidecar container running Python.

Walker Canyon

Using Dead Letter Queues with Beam

By John Casey

Room: Hamina (MP4)

09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Using Dead Letter Queues with Beam

In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.

Hamina (MP4)

In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.

Using LLMs with Beam and RunInference

By Reza Rokni

Room: Mariposa Grove

09/04/2024 4:30 PM 09/04/2024 4:55 PM America/Los_Angeles AS24: Using LLMs with Beam and RunInference

This session will provide an overview of how to utilize large language models (LLMs) using Apache Beam’s RunInference framework. The prime example for the talk will be running the Gemma open model on Dataflow, outlining considerations and common pitfalls when writing pipelines with LLMs.

Mariposa Grove

By Wei Hsia

Room: Bonsai

09/05/2024 1:30 PM 09/05/2024 2:50 PM America/Los_Angeles AS24: Multiple Input, Multiple output, Multi-Modal Inference: Streaming ML with Dataflow

Imagine when you’re attending an event, as you park your car, you get a notification telling you which entrance has the current shortest line that’s closest to your parking. Then when you check in, a reminder (only if you’ve parked), that you can get some parking validation done if you spend on concessions or merchandise as part of your membership benefit. And you have a hankering for nachos, hoping that they don’t run out. This would make your experience amazing and unique!

How can you do this though? You need to, in real time, connect your parking data to your membership data, to current lines? How does the event organizer know how busy it is in the lines easily? Or predict how fast they will be going to run out of things?

You’ll build a pipeline, from scratch, that incorporates all of these things. You’ll need to put all of your Beam knowledge (or learn them along the way!) to the test. Read multiple inputs, in real time, and make sure you’re enriching them with the right information. Write multiple outputs - in real time, to actually do something with the data. And of course, we’ll show you how you can use an LLM (or any model!) in the same pipeline.

Bonsai

9:00 - 9:10

Welcome

10:30 - 11:00

Morning break

12:30 - 13:30

Lunch

15:30 - 16:00

Afternoon break

17:00 - 19:30

17:00 - 19:30 | Event reception location: MP6, Game Room, 8th floor

09:10 - 09:40. Mariposa Grove

Innovating the Data & AI Platform

By Yasmeen Ahmad

Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.

Program

Welcome to the session program for Beam Summit 2024.

Wednesday, September 04, 2024

Thursday, September 05, 2024