#104 The Rise of Sparse Expert Models: A New Paradigm for Large-Scale Deep Learning
Fresh & Hot curated AI happenings in one snack. Never miss a byte 🍔
This snack byte will take approx 5 minutes to consume.
AI BYTE #1 📢: The Rise of Sparse Expert Models: A New Paradigm for Large-Scale Deep Learning
⭐ Sparse Expert Models are a new class of deep learning architectures that have emerged in recent years as a powerful and efficient way to train and deploy extremely large neural networks.
Unlike conventional dense models, which use the same set of parameters for every input example, sparse expert models partition the parameters into smaller groups called experts, and dynamically route each input to a subset of relevant experts.
This way, the effective model size can be much larger than the actual computation per example, enabling unprecedented scaling of deep learning models.
Sparse expert models have shown remarkable results across various domains, such as natural language processing, computer vision, speech recognition, and multi-modal learning.
For instance, Switch Transformers achieved state-of-the-art pre-training quality on natural language tasks with a 1.6 trillion parameter model, while using 4-7 times less compute than dense models.
Google’s GLaM outperformed the 175 billion parameter GPT-3 model in zero and one-shot performance, while using 49% fewer FLOPs per token at inference and 65% lower power.
ST-MoE achieved state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning, summarization, question answering, and adversarial tasks.
Sparse expert models are not a new concept, but a thirty-year old idea that has been reinvigorated by the advances in deep learning and distributed systems. The first work to propose a mixture-of-experts (MoE) architecture was Jacobs et al. (1991), which used an ensemble of neural networks and a gating function to select the best expert for each input.
Later works extended this idea to use stacked layers of MoE on image classification and language modeling. However, the breakthrough came when Lepikhin and Fedus replaced the feed-forward layers in Transformers with expert layers, creating the GShard and Switch Transformers, respectively.
The key challenge in designing effective sparse expert models is the routing algorithm, which determines how to assign inputs to experts. The routing algorithm should balance several objectives, such as maximizing the utilization of experts, minimizing the communication overhead, and ensuring the diversity and quality of experts.
Several routing algorithms have been proposed, such as top-k routing, hash routing, reinforcement learning routing, and optimal transport routing. Each algorithm has its own advantages and disadvantages, and the optimal choice may depend on the application and the hardware system specifications.
Sparse expert models are still an active area of research, and there are many open questions and challenges to be addressed.
For example, how to scale the number, size, and frequency of expert layers, how to improve the transferability and calibration of sparse models, how to incorporate domain knowledge and task-specific experts, how to handle multi-modal and multi-task inputs, and how to make sparse models more robust and interpretable.