Buffering Data by Timestamp: A Step Towards Time Series Processing in Beam

Time series data is a foundational and ubiquitous format in modern big data applications, driving insights in fields ranging from user activity tracking to IoT sensor monitoring. While there is growing interest in processing time series data within Apache Beam, its inherently unordered and parallel execution model forces developers to implement complex custom logic to handle chronological events accurately.

In this talk, we explore the crucial first step of time series processing in Beam: buffering data in precise timestamp order to enable accurate downstream analysis. We will evaluate and compare various buffering approaches, weighing their trade-offs. Finally, we will demonstrate these concepts in action through a real-world anomaly detection use case utilizing the recently developed BigQuery CDC source.

Buffering Data by Timestamp: A Step Towards Time Series Processing in Beam

Shunping Huang

Claude van der Merwe

Buffering Data by Timestamp: A Step Towards Time Series Processing in Beam