arxiv:2403.18926

Enhancing Efficiency in Sparse Models with Sparser Selection

Published on Feb 27, 2024

Authors:

Yuanhang Yang ,

Abstract

Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present \tool, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. \tool leverages small experts and a threshold-based router to enable tokens to selectively engage only essential <PRE_TAG>parameters</POST_TAG>. Our extensive experiments on language modeling and machine translation tasks demonstrate that \tool can enhance model performance while decreasing the computation load at MoE layers by over 50\% without sacrificing performance. Furthermore, we present the versatility of \tool by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://anonymous.4open.science/r/XMoE.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2403.18926 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2403.18926 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2403.18926 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.