#103 How Google’s GLaM, a Trillion-Weight Language Model, Outperforms GPT-3 with Less Computation?
Fresh & Hot curated AI happenings in one snack. Never miss a byte 🍔
This snack byte will take approx 5 minutes to consume.
AI BYTE #1 📢: How Google’s GLaM, a Trillion-Weight Language Model, Outperforms GPT-3 with Less Computation?
⭐ Language models are powerful tools that can perform a variety of natural language processing tasks, such as reading comprehension, question answering, and text generation, with minimal or no supervision.
However, these models typically require a large amount of parameters and computation to achieve high performance, which poses challenges for their training and deployment.
Allow me to introduce to you GLaM, a Generalist Language Model that can achieve competitive results on multiple few-shot learning tasks with less computation and energy use than GPT-3, a state-of-the-art dense language model.
GLaM is a sparsely activated model that uses a Mixture-of-Experts (MoE) architecture, which means that it has different sub-models (or experts) that are specialized for different inputs. The experts are controlled by a gating network that selects the most appropriate ones based on the input data.
This way, GLaM can leverage a large number of parameters (1.2 trillion in total) while only activating a small fraction of them (97 billion) per token prediction.
Compare that to GPT-3. If you ask GPT-3 something, the model uses all 175 billion of its parameters to provide an answer.
GLaM is trained on a high-quality dataset of 1.6 trillion tokens, which consists of web pages, books, and Wikipedia articles. The web pages are filtered by a text quality classifier that is trained on Wikipedia and books, which are generally higher quality sources. The dataset covers a wide range of language usage and domains, which enables GLaM to learn general and diverse skills.
GLaM is evaluated on 29 public NLP benchmarks in seven categories, including language completion, open-domain question answering, and natural language inference.
It uses a zero-shot and one-shot setting, where the tasks are never seen during training and only a few examples are given at inference time. GLaM outperforms or is on-par with GPT-3 on almost 80% of zero-shot tasks and almost 90% of one-shot tasks, while using significantly less computation during inference.
It also shows better learning efficiency, as it needs to train with less data than GPT-3 to reach similar performance.
GLaM is not only effective but also efficient, as it uses less power and energy to train than other models. Thanks to the sparsity and the MoE architecture, GLaM can scale up to larger models without increasing the computation cost linearly.
GLaM is a breakthrough in language modeling, as it demonstrates that sparsely activated models can achieve competitive performance on few-shot learning tasks with less computation and energy use than dense models.
It also shows the importance of a high-quality dataset for large language models, as it enables them to learn general and diverse skills.
I hope that GLaM will inspire more research into compute-efficient language models that can benefit a wide range of applications.