Scio in-depth workshop

Jul-20 09:00-12:00 in 202
Add to Calendar 07/20/2022 9:00 AM 07/20/2022 12:00 PM America/Los_Angeles AS24: Scio in-depth workshop

This workshop encompasses several talks and a workshop around Scio, which is the open source Scala API for Apache Beam.

Sessions:

  1. Scio & Scala to enhance the Beam experience

    Introduction to Scio and how it leverages some features of the scala programming language.

  2. A hands-on workshop for Scio

    We will work through a series of kata-like exercises for Scio, where we progressively reveal new concepts and SDK utilities, and build up our knowledge of how to use Scio in our applications.

  3. Algorithms for Join optimizations in Scio

    Joining large datasets is one of the main tasks when working with Beam and Scio. Joins are a big source of runtime and cost for these sorts of pipelines, as they cause most PCollection data to be serialized and transferred over to new workers. This talk studies how Scio can save you time and money with clever join strategies and approximate algorithms.

  4. How to optimize cost and runtime when doing rollup aggregations in Scio

    We will explain the use case and algorithm behind the rollupAndCount aggregation, that is part of the scio-extra package. When creating a dataset with rollup dimensions, there is a potentially huge fan-out transform before the aggregation step that can incur large costs in shuffle. It is possible to reduce this fan-out drastically by rethinking the problem. This talk will go into some backstory of the use case we had at Spotify and explain how we developed the algorithm behind rollupAndCount to solve this problem more efficiently.

202

This workshop encompasses several talks and a workshop around Scio, which is the open source Scala API for Apache Beam.

Sessions:

  1. Scio & Scala to enhance the Beam experience

    Introduction to Scio and how it leverages some features of the scala programming language.

  2. A hands-on workshop for Scio

    We will work through a series of kata-like exercises for Scio, where we progressively reveal new concepts and SDK utilities, and build up our knowledge of how to use Scio in our applications.

  3. Algorithms for Join optimizations in Scio

    Joining large datasets is one of the main tasks when working with Beam and Scio. Joins are a big source of runtime and cost for these sorts of pipelines, as they cause most PCollection data to be serialized and transferred over to new workers. This talk studies how Scio can save you time and money with clever join strategies and approximate algorithms.

  4. How to optimize cost and runtime when doing rollup aggregations in Scio

    We will explain the use case and algorithm behind the rollupAndCount aggregation, that is part of the scio-extra package. When creating a dataset with rollup dimensions, there is a potentially huge fan-out transform before the aggregation step that can incur large costs in shuffle. It is possible to reduce this fan-out drastically by rethinking the problem. This talk will go into some backstory of the use case we had at Spotify and explain how we developed the algorithm behind rollupAndCount to solve this problem more efficiently.