Accelerating Machine Learning Predictions with NVIDIA TensorRT and Apache Beam

Jun-14 16:45-17:10 UTC
Room: Horizon

Loading and preprocessing data for running machine learning models at scale can be a challenging task that requires seamlessly integrating the data processing framework with the inference engine. In this talk, we’ll explore how NVIDIA TensorRT can be integrated with Apache Beam SDK to simplify the process of integrating complex inference scenarios within a data processing pipeline. We’ll demonstrate how TensorRT and Apache Beam RunInference API can accelerate machine learning predictions, specifically for large models such as transformers.

Developing machine learning systems requires managing several steps, from data ingestion and processing to inference and post-processing. Keeping track of all these moving parts can be a significant challenge. But by integrating the power of NVIDIA TensorRT with the flexibility of Apache Beam SDK, you can stitch together the data processing framework and inference engine seamlessly. This integration can help reduce production inference costs while improving NVIDIA GPU utilization, latency, and throughput.

We’ll walk through an end-to-end example of how the RunInference API in Apache Beam can be used with TensorRT to accelerate machine learning predictions. Our example uses a BERT-based text classification model for sentiment analysis, and we’ll demonstrate how our approach can lead to significant speed improvements over traditional methods.

We’ll also show some benchmarks that demonstrate the performance improvements achieved by integrating NVIDIA TensorRT with Apache Beam SDK. By attending this talk, you’ll come away with a deeper understanding of how to use these tools to make your machine learning pipeline more efficient and scalable by achieving high-throughput and low-latency model inference.

Session 25m: Live session of 25 minutes