Researchers at ETH Zurich have made a groundbreaking discovery that has the potential to revolutionize the speed and efficiency of neural networks. By altering the inference process, they have achieved an unprecedented reduction of over 99% in computations. This technique, demonstrated on BERT, a transformer model used for language tasks, could also be applied to large language models like GPT-3, opening up new possibilities for faster, more efficient language processing.
Unleashing the Power of Fast Feedforward Layers
Transformers, the neural networks that drive large language models, consist of various layers, including attention and feedforward layers. The feedforward layers, which contribute to a significant portion of the model’s parameters, are computationally demanding due to the need to calculate the product of all neurons and input dimensions. However, the researchers propose the introduction of “fast feedforward” layers (FFF) as a replacement for traditional feedforward layers. FFF utilizes conditional matrix multiplication (CMM) instead of dense matrix multiplications (DMM) used in conventional networks. Unlike DMM, which multiplies all input parameters by all neurons in the network, CMM intelligently selects only a handful of neurons for each computation, resulting in a substantial reduction in computational load.
“By identifying the right neurons for each computation, FFF can significantly reduce the computational load, leading to faster and more efficient language models.”
ETH Zurich Researchers
To validate their technique, the researchers developed FastBERT, a modified version of Google’s BERT transformer model. They replaced the intermediate feedforward layers with fast feedforward layers arranged in a balanced binary tree structure. The performance of FastBERT was evaluated on various language understanding tasks and proved to be comparable to base BERT models of similar size and training procedures. Impressively, the best FastBERT model achieved the performance of the original BERT model while utilizing only 0.3% of its own feedforward neurons.
Accelerating Large Language Models
The potential for acceleration in large language models is immense. For example, in GPT-3, each transformer layer contains 49,152 neurons in the feedforward networks. The researchers propose that by incorporating fast feedforward networks, which would use only 16 neurons for inference, the performance could be maintained with just 0.03% of GPT-3’s neurons.
“If trainable, this network could be replaced with a fast feedforward network of maximum depth 15, which would contain 65,536 neurons but use only 16 for inference. This amounts to about 0.03% of GPT-3’s neurons.”
ETH Zurich Researchers
While there have been significant optimizations for dense matrix multiplication, the researchers highlight the lack of efficient implementations for conditional matrix multiplication. They stress the need for device programming interfaces to support conditional neural execution, which could potentially lead to a 341x speedup in BERT-base models.
“With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces.”
ETH Zurich Researchers
This research is a significant step towards addressing the memory and compute bottlenecks of large language models. It paves the way for more efficient and powerful AI systems, ultimately enhancing their capabilities and impact.