Processing over 2 billion events per month across dozens of marketing clients requires resolving fragmented user identities in near real-time - and then reliably delivering the resulting audiences to ad platforms that fail in creative new ways every week. This talk covers both ends of that pipeline.
We’ll dig into how we built a multi-tenant identity graph on Apache Beam (Dataflow) and Google Cloud Spanner: composite match key design, weighted conflict resolution across disparate signal sources (ad platforms, first-party data, server-side events), and the Beam pipeline architecture for continuous ingest and deduplication. Expect concrete lessons on schema design trade-offs, handling late-arriving data in identity merges, tenant isolation patterns in Spanner, and how this foundation powers downstream ML models for predicted lifetime value.
We’ll close with the resilience patterns that keep this graph flowing outward to Meta, Google, TikTok, and Snapchat - adaptive batching tuned to per-platform rate limits and payload constraints, circuit breakers that isolate failing destinations without stalling the pipeline, and a structured dead letter queue system with automated replay. Just enough DoFn-level detail to show how these patterns hold up under real third-party API volatility.