Speaker(s):

Deduplication: Where Beam Fits In

Aug-4 20:00-20:25
Add to Calendar 08/04/2021 8:00 PM 08/04/2021 8:25 PM America/Los_Angeles AS24: Deduplication: Where Beam Fits In

This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We’ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s open source codebase [0] for ingesting telemetry data from Firefox clients.

We’ll compare the robustness, performance, and operational experience of using the deduplication built in to PubsubIO vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.

Finally, we’ll compare streaming deduplication to a much stronger end-to-end guarantee that Mozilla achieves via nightly scheduled queries [1] to serve historical analysis use cases.


This session will start with a brief overview of the problem of duplicate records and the different options available for handling them. We’ll then explore two concrete approaches to deduplication within a Beam streaming pipeline implemented in Mozilla’s open source codebase [0] for ingesting telemetry data from Firefox clients.

We’ll compare the robustness, performance, and operational experience of using the deduplication built in to PubsubIO vs. storing IDs in an external Redis cluster and why Mozilla switched from one approach to the other.

Finally, we’ll compare streaming deduplication to a much stronger end-to-end guarantee that Mozilla achieves via nightly scheduled queries [1] to serve historical analysis use cases.