The Evolution of Language Models: Redesigning the Transformer

Large language models like ChatGPT and Llama-2 are well-known for their high memory requirements and computational demands, making them expensive to run. However, reducing the size of these models, even by a small fraction, can lead to significant cost reductions.

The Redesigned Transformer: A Promising Solution

In response to this challenge, researchers at ETH Zurich have introduced a revised version of the transformer, the deep learning architecture underlying language models. This new design significantly reduces the size of the transformer while maintaining accuracy and increasing inference speed, making it an ideal architecture for creating more efficient language models.

Language models rely on the transformer blocks, which are specialized units capable of processing sequential data, such as text passages. Within each block, there are two important sub-blocks: the “attention mechanism” and the multi-layer perceptron (MLP).

“Given the exorbitant cost of training and deploying large transformer models nowadays, any efficiency gains in the training and inference pipelines for the transformer architecture represent significant potential savings,” write the ETH Zurich researchers.

The attention mechanism acts like a highlighter, selectively focusing on different parts of the input data (e.g., words in a sentence) to capture their context and importance relative to each other. This allows the model to understand the relationships between words, even if they are far apart. After the attention mechanism processes the data, the MLP, a mini neural network, refines and further processes the highlighted information, creating a more sophisticated representation that captures complex relationships.

In addition to these core components, transformer blocks also include features like “residual connections” and “normalization layers.” These components contribute to faster learning and mitigate common issues in deep neural networks. As transformer blocks stack up to form a language model, they become more adept at discerning complex relationships in the training data, enabling the performance of sophisticated tasks.

Trimming Down the Transformer Block

The conventional transformer block design has remained largely unchanged since its inception. However, the ETH Zurich researchers explored ways to simplify the transformer block without compromising its training speed or performance on tasks.

Standard transformer models consist of multiple attention heads, each with its own set of key (K), query (Q), and value (V) parameters. These parameters map the interactions among input tokens. The researchers found that they could eliminate the value parameters and the subsequent projection layer without sacrificing effectiveness. They also removed the skip connections, traditionally used to address the “vanishing gradients” issue in deep learning models.

Furthermore, they redesigned the transformer block to process attention heads and the MLP concurrently, rather than sequentially. This parallel processing approach deviates from the conventional architecture. To compensate for the reduction in parameters, the researchers made adjustments to non-learnable parameters, refined the training methodology, and implemented architectural tweaks. These changes allowed the model to maintain its learning capabilities despite the leaner structure.

The ETH Zurich team evaluated their compact transformer block across language models of varying depths. Their results were significant: they managed to reduce the size of the conventional transformer by approximately 16% without sacrificing accuracy, while also achieving faster inference times.

“Our simplified models are able to not only train faster but also to utilize the extra capacity that more depth provides,” the researchers write.

For example, applying this new architecture to a large model like GPT-3, with its 175 billion parameters, could result in a memory saving of about 50 GB.

While this streamlined approach has been effective on smaller scales, its application to larger models has yet to be tested. Further enhancements, such as tailoring AI processors to this redesigned architecture, could amplify its impact.

“We believe our work can lead to simpler architectures being used in practice, thereby helping to bridge the gap between theory and practice in deep learning, and reducing the cost of large transformer models,” the researchers conclude.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts