If you prefer, you can also see the program in the Sessionize layout.
Join us in this talk where Yasmeen Ahmad, Managing Director Data & Analytics at Google Cloud, shares her perspective on the innovations taking place in the data & analytics platform space.
Mariposa GroveProject Shield is a service that counters distributed-denial-of-service (DDoS) attacks. It is available free of charge to eligible websites that have news, elections, and human rights related content. Project Shield helps ensure unhindered access to election-related information during global democratic processes (such as the U.S. 2022 midterm election season and many others). It also enables critical infrastructure and news websites to defend against non-stop attacks and provides crucial services and information during crises (such as the invasion of Ukraine).
In this keynote, Marc Howard will explain why and how Project Shield uses Apache Beam and Google Cloud Dataflow to deliver some of their core value. Their streaming Apache Beam pipelines process more than 3 TB of log data daily at significantly over 10,000 queries per second, growing significantly during large attacks of up to 400 million queries per second. These metrics power user-facing graphs and long-term attack analytics at scale, fine-tuning Project Shield’s defenses and supporting them in the effort of making the web a safe and free space.
Mariposa GroveIn this presentation, we delve into the critical world of data lineage within Apache Beam, exploring its significance and demonstrating its practical implementation. We begin by establishing the motivation behind data lineage, highlighting its role in enhancing data governance, debugging, and impact analysis. Next, we introduce Google Cloud Dataplex, a unified data management platform, and its integration with Beam’s lineage capabilities.
We’ll then embark on a technical journey, showcasing how lineage support is built into Apache Beam’s core. Following this, we will dissect the process of constructing a lineage graph for an Apache Beam job and seamlessly reporting it to Dataplex for insightful visualization.
The presentation will empower the audience with actionable knowledge on how to integrate lineage tracking into their own I/O operations, ensuring greater transparency and control over their data pipelines. Finally, a live demonstration will bring these concepts to life, showcasing data lineage in action for an Apache Beam job executing on Dataflow, and visually exploring its lineage within Dataplex.
By the end of this talk, attendees will possess the knowledge and tools to effectively leverage Apache Beam’s lineage support, fostering transparency and trust within their data pipelines.
Hamina (MP4)The session covers a recently added Beam SDK extension and applicable use cases. It also describes the techniques used to implement the transform.
Walker CanyonCruise leverages Apache Beam to manage and process petabytes of data monthly, essential for our autonomous vehicle model training. This talk will delve into the innovative features we’ve developed to enhance Beam’s capabilities, including a control plane for quota and user management, a C++ sandbox for running AV ROS nodes in the cloud, and shuffling optimization techniques to compress shuffled data
Mariposa GrovePrism is a local portable Beam Runner intended to assist end user and SDK developers alike, by providing a common platform for all existing and new Beam SDKs.
Beam is unique among data processing systems in how it divides the user facing SDK from the execution engine. Jobs can be authored in one SDK and executed on Flink, Spark, or on cloud services, like Dataflow. But these external systems can make prototyping, testing, and debugging complicated.
Prism is written Go to be a small, local runner built using Beam Portability first, to better emulate how jobs of any SDK execute on those larger systems. Further, Prism serves as a model runner for all SDKs for a robust local experience.
This talk will go into how runners execute pipelines, and the design and implementation of Prism for the goals of testing and configurability. Knowledge of Go is encouraged, but not required.
Hamina (MP4)To introduce ordered list states from concept to implementation.
Walker CanyonMichelangelo is Uber’s centralized machine learning platform, designed to manage ML pipelines and their associated data processing. As the demand for batch predictions grows, the need for a flexible and efficient processing framework becomes imperative. This presentation explores Uber Michelangelo’s batch prediction processes, focusing on data processing, model prediction, and the transition from Spark to Ray.
At Michelangelo, data preparation and feature transformation are traditionally handled using Spark data transforms. The model prediction step involves a user-defined function within a Spark pipeline model. While Spark has been the backbone for our batch processing needs due to its robustness and ease of use, it has shown limitations in handling the complex machine learning tasks that Uber is increasingly deploying, such as natural language processing and Generative AI. These workloads often require GPUs to meet latency and throughput requirements, which Spark struggles to support efficiently.
Ray, an emerging distributed computing framework, offers better resource utilization, simpler parallelism, and more straightforward scalability. By leveraging Ray for batch processing, we can support large language model batch predictions reliably, efficiently, and scalably. We are also transitioning other machine learning batch prediction tasks to Ray for both data processing and model prediction. In this new setup, data processing is integrated as part of the model using native transformer techniques, allowing deployment on GPUs.
With Ray, we have developed a robust pipeline for batch prediction. Currently, streaming data is handled with separate pipelines. We are actively exploring the unification of these pipelines using open-source libraries. Apache Beam ML provides an opportunity to unify batch and streaming data processing pipelines.
Mariposa GroveA StateBaseAsyncDoFn.java class and full SCIO productive implementation for State and Timer with HTTP Clients to prevent duplicate requests and other aggregation use cases for asynchronous endpoints.
Walker CanyonWe migrated ~5 TB (~2 billion rows) of production NoSQL DB to a Postgres DB without application downtime. This involved data export from Cloud Datastore (NSQL DB), normalization (from JSON to SQL schema) and then data import into Postgres. We used two different Apache Beam pipelines in the process.
Data migration took about ~16 hrs with this approach using the beam pipeline (as opposed to our initial estimate of 5 days using other batch scripts with parallel computing)
Advantages:
*pipeline 1: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v1/src/main/java/com/google/cloud/teleport/templates/DatastoreToText.java ** pipeline 2: we wrote custom beam DoFn transforms in python(+sqlalchemy) and built a pipeline to ingest data into Postgres with error handling.
Mariposa GroveThis talk unveils our design journey to streamline the ingestion of CDC changes into a data warehouse, enabling rapid data availability for users. We leverage Qlik to stream CDC events to Kafka, harness Dataflow’s processing power, and store the transformed data in BigQuery for efficient analysis.
We’ll walk through our iterative design process, showcasing how Apache Beam’s flexibility allowed us to address business requirements. We’ll highlight key architectural decisions, performance optimizations, and lessons learned along the way.
This blueprint serves as a valuable resource for others seeking to simplify their CDC ingestion pipelines and accelerate time-to-insight for their data-driven initiatives.
Mariposa GroveIn this session we’ll take a deep dive into implementing a Beam SDK for a new language using Swift as an example. We’ll cover both the internal and the external from implementing FnApi using Swift’s surprisingly robust gRPC support to how we use Swift’s modern type system to provide a uniquely Swift way of expressing the Beam programming model.
Walker CanyonThis talk is for you if you are considering writing custom code to read from and write to a Web API, for which a solution does not yet exist. This session demonstrates step by step how to read from and write to Web APIs using the Beam Java and Python SDKs.
Web API providers design the service for application workloads, presenting unique challenges for large scale parallelized workloads such as Beam. Challenges addressed will include error handling, retry with backoff, and caching.
Hamina (MP4)Serving ML models at scale is increasingly important, and Beam’s RunInference transform is a great tool to do this. At the same time, models are getting larger and larger, and it can be hard to fit them into your CPU or GPU.
This talk will explore some of the mechanisms that Beam has put in place for large model management so that it can serve your models efficiently without requiring any additional work from the pipeline author. Attendees can expect to come away with an understanding of how Beam loads and serves models, how it optimizes its serving architecture for different model sizes/footprints, and how they can use Beam to serve their models (large or small).
Mariposa GroveThe purpose of this session will be to introduce Beam YAML and its core capabilities. These include, but are not limited to:
The session will wrap up with some use-case examples and how to run the YAML pipelines on Google Cloud Dataflow.
Walker CanyonRate limiting and draining quota is a common issue for a data processing pipeline running at scale. Overprovision of the worker pool when the external resource is throttled not only increases the cost but also puts additional pressure onto the resources. This talk introduces the recent improvements on tackling this issue, from tracking throttled states in a Beam pipeline to taking actions on these signals from the runner (particularly Dataflow) side. Finally, it explores options on how users can onboard their custom IO connectors for throttling detection features.
Hamina (MP4)In this session, we’ll explore the transformative integration of Beam YAML with Protobuf, unlocking new possibilities within Apache Beam. Delve into practical applications and benefits, and gain insights from our journey of harnessing Apache Beam’s capabilities.
Walker CanyonLarge language models are well known for their performance on generation tasks like summarization but they also excel at many classical tasks like classification, named-entity recognition, or information extraction. Multi-modal LLMs similarly achieve state of the art performance on document understanding. This makes them vital for modern data processing pipelines.
Apache Beam is a powerful framework to define and execute batch and streaming data processing pipelines. Recent releases introduced many tools to facilitate machine learning workflows like ML Transforms, RunInference, and Enrichment transform.
In this talk we will introduce an application that combines Beam’s ML capabilities and LLMs to extract product requests from various document types of customer emails to facilitate the automatic fulfillment of orders.
Mariposa GroveIn an ideal scenario, a data processing pipeline performs without issues. When a runtime processing error occurs, normally Beam surfaces the error to the runner. However in some cases, the process running the user code might run out of memory, get stuck or crash. This can prevent it from reporting the error, leaving the user unaware of the failure’s root cause. In this talk, I’ll discuss troubleshooting techniques for these situations. The techniques I cover can also be applied for debugging other Python applications.
Hamina (MP4)Harness the power of cross-language transforms by combining the best of Java and Python in your data processing workflows. Discover how to seamlessly integrate Java transforms into your Python pipelines using the SDK’s newest utilities.
With just a few lines, you can also automatically generate well-documented, SDK-ready Python wrappers for existing Java transforms.
Walker CanyonIn this session, we will explore how Apache Beam can be leveraged to create a robust real-time fraud prevention system. Drawing from real-world implementations at Transmit Security, the presentation will cover the architecture and components of our Detection and Response solution. Attendees will learn about the challenges of analyzing high volumes of data in real-time, and how we utilize a combination of data collection points, enriched data pipelines, and stateful aggregation engines.
We will discuss our transition to using Google BigTable for low-latency, high-throughput data management, highlighting its role in enhancing the system’s performance.
Additionally, the session will focus on two unique flows within our system:
Together with Solace, we have developed a new native streaming connector for Solace, a popular messaging platform used in manufacturing, finance and many other industries.
Solace has different APIs for different purposes (moving data around, managing queues, etc), that can be leveraged together to create a Beam connector with accurate and timely backlog and watermark estimations.
The connector has been developed by Solace and Google, in collaboration with a customer, this connector is an example of cross-industry collaboration for the benefit of all Apache Beam users.
In this talk we explain how Solace works, how we made it work with Beam with high throughput and low latency, and what lessons can be learnt for the design of complex streaming connectors for Beam.
BonsaiEnabling dynamic topic destinations using the pub/subIO writeMessageDynamic() function in a Java Dataflow pipeline is an interesting feature, which seems only available in Apache Beam Java SDK. This talk is to showcase a workaround implementation using the pub/subIO writeMessageDynamic() function as an external transform
Hamina (MP4)In this session, we will explore our journey to improve the stability of our Flink application using the Python Beam SDK runner, with a particular focus on memory tuning. Our initial setup faced significant challenges, including frequent task manager disconnections and ambiguous error logs, often hinting at out-of-memory (OOM) issues. Despite no clear indicators of high memory usage, the instability worsened after transitioning from the Lyft K8s operator to the Apache Flink operator.
Key points include:
In this session, you will learn what we consider a Dead Letter in Beam, the high level DLQ architecture we’ve implemented, and some example use cases on how to incorporate DLQs in your pipelines.
Hamina (MP4)This session will provide an overview of how to utilize large language models (LLMs) using Apache Beam’s RunInference framework. The prime example for the talk will be running the Gemma open model on Dataflow, outlining considerations and common pitfalls when writing pipelines with LLMs.
Mariposa GroveImagine when you’re attending an event, as you park your car, you get a notification telling you which entrance has the current shortest line that’s closest to your parking. Then when you check in, a reminder (only if you’ve parked), that you can get some parking validation done if you spend on concessions or merchandise as part of your membership benefit. And you have a hankering for nachos, hoping that they don’t run out. This would make your experience amazing and unique!
How can you do this though? You need to, in real time, connect your parking data to your membership data, to current lines? How does the event organizer know how busy it is in the lines easily? Or predict how fast they will be going to run out of things?
You’ll build a pipeline, from scratch, that incorporates all of these things. You’ll need to put all of your Beam knowledge (or learn them along the way!) to the test. Read multiple inputs, in real time, and make sure you’re enriching them with the right information. Write multiple outputs - in real time, to actually do something with the data. And of course, we’ll show you how you can use an LLM (or any model!) in the same pipeline.
BonsaiBulk Inference in Machine Learning (ML) refers to the challenge of how to organize and compute model predictions for a large pool of available input data with no latency requirements. JAX is an open-source computation library commonly used by both engineers and researchers for flexible, high-performant ML development. This talk will illustrate how teams at Google are using Beam to ergonomically design, orchestrate, and scale JAX Bulk Inference workloads across various accelerator platforms.
Mariposa GroveThe rise of Generative AI (GenAI) has revolutionized Machine Learning (ML) Development Workflows. Starting from large pre-trained models introduces new challenges in resource management, large model evaluation (both human and automated), efficient bulk inference, and effective handling of massive model embeddings. This talk distills key lessons and best practices from Google’s experience in deploying GenAI models at scale, focusing on the adaptation and evolution of MLOps principles to tackle the unique demands of this emerging field.
Mariposa GroveDo we even need PCollections? Or ProcessElements? Can we have the language fully typecheck the pipeline for us at compile time? Can we do that in Go?
Since Beam was designed, programming languages have continued to evolve and change, so why can’t our SDKs? We’ve now got ample experience with the Apache Beam Go SDK, but the language it was designed for is now very different.
This short talk will compare the current Go SDK with an experimental implementation that takes better advantage of the current strengths of Go, and its approach to generic type parameters, and more.
Walker CanyonA successful re-engineering in banking with a Drools ParDo using a KieContainer (JBoss) in Dataflow has been used to process business rules.jar, merging several Spring microservices into Dataflow. This design, with another Dataflow app and the pattern adopted in “Avoid HTTP requests duplicates in Apache Beam with SCIO, a custom BaseAsyncDoFn and State and Timers” has allowed to get tid off all microservices and infrastructure thanks to Apache Beam and Dataflow.
Hamina (MP4)The presentation will cover troubleshooting RunInference, GPU, OOM, and other related errors, offering practical insights from real-world customer support experiences.
Mariposa GroveDataflow has always supported only Exactly Once features, but Dataflow just release At Least Once feature. Understand each Architecture, pro and cons, in which use cases one is better than the other.
Hamina (MP4)We introduce you to Beamstack, an open-source framework currently under development, aimed at facilitating the deployment of Machine Learning and GenAI workflow pipelines with Apache Beam on Kubernetes, whether on-premises or in the cloud. It encompasses a holistic solution, featuring abstraction layers that optimize the deployment of various components of machine learning pipelines, data processing workflows, and deployment infrastructure.
At the core of Beamstack’s functionalities lie Kubernetes Custom Resource Definitions (CRDs). These CRDs constitute a potent mechanism for extending the Kubernetes API, facilitating the seamless integration of ML-centric resources within the Kubernetes ecosystem. By this approach, Beamstack empowers users to capitalize on the comprehensive capabilities and features offered by Kubernetes while unlocking the boundless potential of Apache Beam for Mache Learning development by various teams in any organization.
In this session, we will discuss BeamStack’s use cases and product roadmap, as well as features that have already been implemented, those currently under implementation, and those planned for the future. We will also address our current challenges and areas where we need support from contributors. Join us in shaping the future of ML development tooling around Apache Beam by becoming a part of the Beamstack community.
Mariposa GroveApache Beam and Apache Airflow are powerful tools in the data engineering ecosystem, often used separately but rarely in tandem. This talk explores the synergy between these “distant cousins” by demonstrating how to seamlessly integrate Beam pipelines within Airflow workflows.
We’ll dive into the challenges of orchestrating complex data processing tasks and show how combining Airflow’s scheduling capabilities with Beam’s robust data processing framework can create a more efficient and manageable data pipeline architecture.
Attendees will learn how to leverage Airflow’s DAG (Directed Acyclic Graph) to trigger Beam jobs seamlessly, enabling them to orchestrate sophisticated, distributed data processing tasks across data platforms, such as Google Cloud Dataflow. By the end of this session, participants will gain practical insights into integrating these technologies, enhancing their ability to build and maintain resilient, efficient data pipelines that meet the demands of modern data-driven applications.
Walker CanyonWe have developed BeamStack, a toolkit tailored to enhance the deployment and management processes of ML and GenAI workloads on Kubernetes. It diminishes the intrinsic complexities associated with the managing of Kubernetes clusters and the deployment of Apache Beam workloads. Fundamentally, BeamStack leverages Beam YAML, a structured format that enables the declarative definition of pipelines. This facilitates rapid deployment and scalability across diverse Kubernetes environments, encompassing cloud-based solutions and local clusters such as Minikube.
BeamStack distinguishes itself through its proficiency in orchestrating AI pipelines within Kubernetes environments. It provides streamlined workflows that enhance the efficiency of setup, deployment, and management processes. Moreover, BeamStack seamlessly integrates with monitoring tools such as Prometheus and Grafana. Its overarching objective is to democratize the deployment of AI workloads with Apache Beam on Kubernetes, empowering users with the confidence to deploy seamlessly while optimizing performance. Our platform’s user-friendly design simplifies the process, making the deployment of Apache Beam jobs universally attainable.
In this talk, we’ll delve into the development journey of BeamStack, a toolkit crafted to simplify the deployment and management of Apache Beam ML workloads on Kubernetes. We’ll explore the motivations behind BeamStack’s creation, the challenges it addresses, and the key components that make it a powerful tool for AI workload deployment.
Mariposa GroveThis talk will explain some of the capabilities of Dataflow that can help users save costs:
Level up your streaming pipelines by learning about the latest advancements in Dataflow Streaming. This includes new features such as At-Least-Once Mode, Active Load Balancing, Autoscaling Hints, In-flight Autoscaling Updates, best practices for Kafka source pipelines and ML pipelines.
Walker CanyonWhile the basic features of Beam YAML allow one to write simple to possibly intermediate pipelines, there are limitations on what can be developed relying solely on the built-in transforms and basic features. Luckily, Beam YAML had these users in mind during its development and offers multiple ways to leverage more advanced features of Beam to implement these sophisticated use-cases.
The purpose of this session will be to dive into the more advanced features Beam YAML has to offer. Topics include, but are not limited to:
If time allows, there will be some use-cases demonstrating the features presented above.
This session builds upon the information presented in the introductory session and it is recommended that attendees view that session before diving into the topics presented in this session.
Walker CanyonRetrieval Augmented Generation (RAG) has emerged as a groundbreaking technique in the field of generative AI. By providing a large language model relevant data to solve a given task, it can generate answers with much higher accuracy. Next to enhanced performance RAG allows us to work with sensitive without resource-intensive in-house model training. However, efficiently preprocessing and ingesting document data into vector databases for RAG applications can be challenging, especially when dealing with real-time updates.
In most RAG applications relevant data is fetched from a vector database through semantic search. From experience working on RAG applications we have noticed that building a robust data processing pipeline to keep the data in the vector database up to date can be a challenge. Beam is a particularly powerful tool to solve this task since it supports both batch and streaming data, allowing us reuse the same data processing pipeline for both processing scale datasets of relevant information as well as keeping the vector database up to date with live data through streaming data.
It is crucial to recognize that proper preprocessing and analysis of data are often underestimated yet fundamental components in building effective RAG applications. As the age-old adage goes, ‘garbage in, garbage out,’ emphasizing the significance of ensuring high-quality input data for optimal output results. Apache Beam can play an important role in this process, offering both batch and streaming data processing capabilities that are essential for developing production-ready RAG applications.
Mariposa GroveUsage Billing is a billing concept where charges are based on consumption. At LinkedIn, the Usage Billing system processes large volume of consumptions particularly for Ads and Jobs use cases. In this presentation, we will discuss how the team rearchitected the existing usage platform to adopt a more streaming-oriented approach using Apache Beam and Samza Runner. The new system can aggregate usage data across various dimensions and supports multiple rules to handle different aggregation windows. It also includes change data capture and correction events to ensure the high level of accuracy required when handling customers’ money.
Hamina (MP4)This session covers the cost components of Dataflow pipelines, different run configurations, common performance optimization techniques, approaches to monitor cost and performance of the pipelines.
Hamina (MP4)In this session, I’ll share how Lyft uses Beam’s portability framework with Flink execution-engine for real-time demand and supply forecasting. We will provide a dive into our architecture, scale and lessons learned.
Walker CanyonThe session is aimed to practitioners looking to use streaming technologies to process data which will be in use for near real time RAG architectures.
For context, the session is based in this article: https://beam.apache.org/blog/dyi-content-discovery-platform-genai-beam
Agenda:
In this talk we will walk through the process of building and deploying Beam Dataflow templates.
Walker CanyonBuilding out data ingestion and enrichment pipeline for a Retrieval Augmented Generation (RAG) system using Apache Beam. The pipeline built a knowledge base by ingesting text data into vector databases, specifically utilizing Redis and OpenSearch for efficient semantic search and data retrieval. These databases stored vectorized representations of text chunks, allowing the system to enhance user queries by matching them with relevant text fragments. The design of the pipeline ensured scalable and effective RAG operations, leveraging the strengths of Redis and OpenSearch in vector-based search.
Mariposa Grove