The emergence of Mixture of Experts (MoE) in Large Language Models offers a promising trade-off: significantly increased model capacity and parameter count, with only a modest increase in computational cost. This efficiency stems from the fact that only a small subset of the model’s parameters (i.e., a few experts) are activated for each input token. However, a major challenge arises — certain experts tend to be selected much more frequently than others. This imbalance leads to inefficiencies, particularly when the experts are distributed across different devices and expected to operate in parallel.

To address this issue, we propose Mixture of Grouped Experts (MoGE). Unlike traditional MoE, MoGE introduces expert grouping during the selection process, which naturally results in a more balanced workload distribution across experts. This design improves overall system efficiency while maintaining the benefits of sparse activation and high model capacity.

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity

carter

The emergence of Mixture of Experts (MoE) in Large Language Models offers a promising trade-off: significantly increased model capacity and parameter count, with only a modest increase in computational cost. This efficiency stems from the fact that only a small subset of the model’s parameters (i.e., a few experts) are activated for each input token. However, a major challenge arises — certain experts tend to be selected much more frequently than others. This imbalance leads to inefficiencies, particularly when the experts are distributed across different devices and expected to operate in parallel.

To address this issue, we propose Mixture of Grouped Experts (MoGE). Unlike traditional MoE, MoGE introduces expert grouping during the selection process, which naturally results in a more balanced workload distribution across experts. This design improves overall system efficiency while maintaining the benefits of sparse activation and high model capacity.









Ask ChatGPT