The surgence of Mixture of Experts (MoE) in Large Language Models promises a small price of execution cost for a much larger model parameter count and learning capacity, because only a small fraction of parameters are activated for each input token. However, it is commonly observed that some experts are activated far more often than others, leading to system inefficiency when running the experts on different devices in parallel. Therefore, we introduce Mixture of Grouped Experts (MoGE), which groups the experts during selection and balances the expert workload better than MoE in nature
pull down to refresh
The emergence of Mixture of Experts (MoE) in Large Language Models offers a promising trade-off: significantly increased model capacity and parameter count, with only a modest increase in computational cost. This efficiency stems from the fact that only a small subset of the model’s parameters (i.e., a few experts) are activated for each input token. However, a major challenge arises — certain experts tend to be selected much more frequently than others. This imbalance leads to inefficiencies, particularly when the experts are distributed across different devices and expected to operate in parallel.
To address this issue, we propose Mixture of Grouped Experts (MoGE). Unlike traditional MoE, MoGE introduces expert grouping during the selection process, which naturally results in a more balanced workload distribution across experts. This design improves overall system efficiency while maintaining the benefits of sparse activation and high model capacity.
Ask ChatGPT