At Wells Fargo, we delve into our approach for backing up and synchronizing Apache Iceberg tables across environments using Project Nessie as a catalog-level control plane and Apache Beam as the unified replication engine. By combining object storage replication with Nessie’s Git-like metadata versioning, orchestrated through a single Beam pipeline. We demonstrate how production Iceberg tables can be continuously mirrored into non-production catalogs without low-level database syncs. The architecture consists of two coordinated replication layers, implemented as a unified Apache Beam pipeline:
This means replication is driven by Nessie’s own versioning semantics rather than internal MongoDB implementation details making the approach more resilient to Nessie upgrades. A Unified Pipeline with Batch and Streaming Modes Both layers run within the same Beam pipeline, giving us a single model for two distinct operational needs. In streaming mode, the pipeline continuously watches for new Nessie commits and triggers incremental storage and catalog replication, keeping non-production environments near-current with production. In batch mode, the same pipeline handles full environment bootstrapping or point-in-time recovery to a specific Nessie snapshot. Beam’s runner portability was essential in our regulated environment: pipelines are developed and validated locally using the Direct Runner before being deployed to our Spark cluster via the Spark Runner, without any rewrite. Once completed, the non-production Nessie catalog becomes a true, API-level mirror of production. We will share practical lessons learned, including Nessie API pagination at scale, handling Beam pipeline failures mid-replication, and ensuring catalog consistency when storage and metadata sync are not atomic.