Training AI models at an Unprecedented Speed: MLPerf Training 3.1 Benchmark Results

The pace of innovation in the generative AI space is truly remarkable. In 2023, training AI models has become significantly faster, as revealed by the MLPerf Training 3.1 benchmark released today. MLPerf Training, a benchmark developed by MLCommons, tracks and measures the ability to rapidly train models, which is crucial for driving AI advancements.

The Power of MLPerf Training 3.1 Benchmark

Included in the MLPerf Training 3.1 benchmark are submissions from 19 vendors, showcasing over 200 performance results. Notable among these results are the benchmarks for large language model (LLM) training with GPT-3. The performance improvements are truly impressive, with MLCommons executive director David Kanter stating, “We’ve got over 200 performance results and the improvements in performance are fairly substantial, somewhere between 50% to almost up to 3x better.”

“It’s about 2.8x faster comparing the fastest LLM training benchmark in the first round [in June], to the fastest in this round,” Kanter said. “I don’t know if that’s going to keep up in the next round and the round after that, but that’s a pretty impressive improvement in performance and represents tremendous capabilities.”

The rapid progress in AI training, according to Kanter, is surpassing what Moore’s Law would predict. Moore’s Law typically foresees a doubling of compute performance every few years, but the AI industry is scaling hardware architecture and software at a faster pace.

Scaling Hardware and Software Innovations

Intel, Nvidia, and Google are among the companies that have made significant strides in enabling faster LLM training results in the MLPerf Training 3.1 benchmarks.

  • Intel’s Habana Gaudi 2 accelerator achieved a remarkable 103% training speed performance boost through techniques such as 8-bit floating point (FP8) data types.
  • Google’s Cloud TPU v5e, which recently became generally available, also leveraged FP8 for optimal training performance. Google’s Cloud TPU multislice technology further enhanced scaling capabilities.
  • Nvidia, not to be outdone, used its supercomputer known as EOS to conduct MLPerf Training 3.1 benchmarks. The system features an impressive 10,752 GPUs connected via Nvidia Quantum-2 InfiniBand, offering over 40 exaflops of AI compute power.

“Some of the speeds and feed numbers here are kind of mind-blowing,” Salvator said. “In terms of AI compute, it’s over 40 exaflops of AI compute, right, which is just extraordinary.”

These technological advancements are propelling the AI industry forward, surpassing expectations and setting new standards for performance and capability.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts