The open source PyTorch machine learning (ML) framework is widely used today for AI training, but its applications go beyond that. IBM is working on development initiatives to make PyTorch a more viable option for inference, in addition to training. In an interview with VentureBeat, Raghu Ganti, principal research staff member at IBM, discussed the research efforts that aim to establish PyTorch as an open source alternative for inference on various hardware platforms.
Accelerating Inference with PyTorch
PyTorch, originally developed by Meta (formerly Facebook) and now under the leadership of the PyTorch Foundation, provides a scalable and flexible framework for training AI models. However, deploying these models for production and delivering results to clients poses its own challenges. Ganti emphasized the need for fast inference with minimal latency to ensure rapid responses. IBM is addressing this challenge by combining three techniques within PyTorch: graph fusion, kernel optimizations, and parallel tensors.
- Graph Fusion: This technique reduces the volume of communications between the CPU and GPU, improving the efficiency of inference.
- Kernel Optimizations: By optimizing memory access for inference, PyTorch achieves better performance and streamlines attention computation.
- Parallel Tensors: IBM leverages parallel tensors to accelerate inference by optimizing memory utilization across multiple GPUs.
These techniques collectively remove process bottlenecks and improve access to memory, resulting in faster inference speeds. IBM researchers achieved inference speeds of 29 milliseconds per token on a 100 GPU system for a large language model with 70 billion parameters using the combined optimizations on PyTorch nightly builds.
Future Developments and Deployments
IBM acknowledges that their efforts to accelerate PyTorch for inferencing are still in progress and not ready for production deployment. However, they plan to contribute these inference optimization capabilities to the open source project and integrate them into the mainline project. IBM is also working on dynamic batching, a technique to improve GPU utilization for model inference. By dynamically grouping multiple inference requests and processing them as a batch on the GPU, PyTorch can scale its inference capabilities for enterprise deployments.
“From our perspective making PyTorch really enterprise ready is key.”
– Raghu Ganti, Principal Research Staff Member at IBM
The upcoming PyTorch 2.1 update is expected to include some of the optimizations used by IBM, making them more widely available. IBM’s goal is to make PyTorch a robust framework for both training and inference, enabling enterprises to leverage the power of AI in their operations.