Skip to main content
  1. Articels/

Mixed Expert (MoE) Model Notes

·1388 words·7 mins
MoE Large Model AI Paper Reading
Weaxs
Author
Weaxs
Table of Contents

Introduction
#

This article aims to sort out the concepts related to MoE models, read and organize some open source MoE model papers, and briefly describe the overall architecture.

Concepts
#

The detailed explanation of the MoE model is mainly based on this article Detailed Explanation of the Hybrid Expert Model (MoE).

Transformer and MoE
#

Let’s review the Transformer architecture first

  • The complete Transformer structure consists of several Blocks
  • Each Block contains an encoder Encoder and a decoder Decoder
  • Encoder and Decoder each consist of 4 parts: attention layer Attention, position-aware feed-forward layer FNN, residual connection Add and layer normalization Norm

Transformer.png

The Mixed Expert (MoE) model is based on the Transformer architecture and consists of two key components:

  • Sparse MoE layer: replaces the feedforward layer FNN in the Transformer architecture. The MoE layer contains multiple experts, each of which can be an independent neural network, usually a feed-forward network FNN, but it can also be a more complex network structure, or even the MoE layer itself.
  • Gated network or router: This part determines which expert the token is sent to. A token can be sent to multiple experts. The router consists of learned parameters and is pre-trained together with the other parts of the network.

Switch-Transfomer.png

Sparsity
#

In contrast to the traditional Transformer dense model, in which all parameters are processed for all input data, the MoE sparse model supports the calculation of only certain parts of the entire system. In other words, only some parameters can be run and called based on specific features or requirements of the input, which is called conditional calculation.

A 2022 paper points out that the FFNs in the Transformer model also have the problem of sparse activation. In other words, for a single output, only a small number of neurons in the FFNs are activated (quoting MoEfication: Transformer Feed-forward Layers are Mixtures of Experts"). To verify this conclusion, the author divides the FNNs layer in the Transformer into multiple experts, called MoEfication. The two main points are

Expert segmentation: splitting FFNs into multiple functional areas as experts. Here, two main methods are provided

  • Parameter clustering segmentation: using the Balanced K-Means algorithm to cluster the neuron vectors
  • Co-activation graph segmentation: constructing a co-activation graph
  • Expert Selection: This part mainly determines how to select experts. Here, the typical MoE gating network is not used, and only a brief discussion is given
  • Groundtruth Selection: A greedy algorithm is used to calculate the score of each expert, and then the expert with the highest score is selected
  • Similarity Selection: Cosine distance is used to calculate the similarity, and the most similar expert is selected
  • MLP Selection (recommended): Train a multi-layer perceptron (MLP) to predict the sum of activated neurons in each expert and use it as the score

MoEfication.png

Gated networks
#

However, one thing that needs to be noted here is that the experts need to be load-balanced to prevent uneven distribution and inefficient resource utilization. This can be done by a learnable gated network (G) to determine which experts (E) to send:

$$ y = \sum^n_{i=1} G(x)_i E_i(x) $$

A typical learnable gated network (G) is usually a network with a \(softmax\) function (which returns the result and probability of each output classification), but the default \(softmax\) function returns all results, i.e., all experts and probabilities. In this case, some adjustable noise can be introduced, and then k values can be retained, as follows: (quoted from Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer)

$$ H(x)_i = (xW_g)_i + StandardNormal() * SoftPlus((xW_{noise})_i) $$

$$ G(x) = Softmax(KeepTopK(H(x), k)) $$

Finally, it should be noted that during training of the MoE model, the gate network tends to self-reinforce the situation by mainly activating the same few experts because the popular experts are trained faster. To mitigate this problem, an auxiliary loss needs to be introduced to ensure that all experts receive roughly the same number of training samples. In the transformers library, the auxiliary loss can be controlled via the aux_loss parameter.

The sparse mixed expert model (MoE) is more suitable for scenarios with multiple machines and high throughput requirements; on the contrary, scenarios with less memory and low throughput are more suitable for dense Transfomer models

GShard: Top-2 gating
#

Google uses the MoE architecture to scale model parameters up to 600 billion. (Quote GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding)

GShard replaces each feed-forward network (FNN) layer in the encoder and decoder with a MoE layer of a Top-2 gated network. In short, after the token enters the MoE layer, it is processed by the Top-2 experts selected by the gated network. Each expert is a dense model of the Transformer architecture.

GShard.png

GShard also introduces the following key changes:

  • Random routing: In the top-2 selection, the first expert is the highest ranked, and the second expert is randomly selected based on its weight ratio
  • Expert capacity: A threshold is set to define how many tokens an expert can handle. If the capacity of both experts reaches the upper limit, the tokens will overflow and be passed to the next layer via the residual link, or in some cases be discarded completely. Therefore, the expert capacity of the MoE model determines the length of the tokens that can be processed.

Switch Transformers: single expert and expert capacity
#

The open source Switch Transformer on HuggingFace is a MoE based on a google/flan-t5-large model modification with 2 048 experts, 1.6 trillion parameters MoE, expert capacity 64. Each expert in Switch Transformer is a standard FFN, so the total number of parameters is 2048 times that of a standard Transformer model.

(quoted from Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, originating from google/switch-c-2048)

Switch-Transfomer.png

Unlike GShard, Switch Transformers simplifies the classic topK gate, and adopts a single expert strategy, that is, a token is only executed by one expert. This method can ① reduce the computational burden of the gate network ② reduce the batch size of each expert by at least half ③ reduce communication costs ④ maintain model quality.

Switch Transformers also studied expert capacity and recommended that the capacity be set so that the number of tokens in a batch is evenly distributed among the experts, as follows (Note: Switch Transformers performs well at low capacity factors, such as 1 to 1.25)

$$ Expert\space{} Capacity = (\frac{tokens \space{}per\space{}batch}{number \space{}of\space{}experts})\times capacity\space{}factor $$

Taking the open source model as an example, the expert capacity is 64, the number of experts is 2048, and assuming that the capacity factor is 1, the number of tokens in each batch is 131072.

DeepSeek-MoE: Shared Experts
#

This part mainly looks at the domestically-developed open-source MoE model. The model open-sourced on huggingface contains 16.4 billion parameters, with a total of 64 routing experts and 2 shared experts. Each token will match 6 experts. (Quote DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, model deepseek-ai/deepseek- moe-16b-base, fine-tuned code https://github.com/deepseek-ai/DeepSeek-MoE)

DeepSeek-MoE.png

DeepSeek-MoE introduces the concepts of “fine-grained/vertical experts” and “shared experts”.

Fine-grained/vertical experts” are FFNs that have been segmented into \(m\) parts using fine-grained expert segmentation. Although the number of parameters for each expert is small, the expertise of the experts can be improved.

Shared Experts” are experts with more generalized or common knowledge, thereby reducing the redundancy of knowledge in each fine-grained expert. The number of shared experts is fixed and always active.

LLaMA-MoE: Lightweight and configurable
#

LLaMA-MoE-v1 is a MoE model based on LLaMA2. Similar to MoEfication, LLaMA-MoE-v1 splits the feedforward network layer FFNs in the LLaMA2 model into MoEs with multiple experts, which results in smaller parameters for an expert in LLaMA-MoE-v1 than in other MoE models. (Quote from [LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training](https://github.com/pjlab-sys4nlp/llama-moe/blob/main/docs/L LaMA_MoE.pdf), model llama-moe, code base https://github.com/pjlab-sys4nlp/llama-moe )

For segmentation methods, Random, Clustering, Co-activation Graph, Gradient, etc. are provided here.

For the expert routing, i.e. the gate network part, LLaMA-MoE-v1 implements the classic TopK noise gate, as well as the single expert gate proposed in Switch Transformer.

LLaMA-MoE.png

References
#

Hybrid Mixture of Experts (MoE) Explained

Learning Factored Representations in a Deep Mixture of Experts

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models