SparseLLM
/

prosparse-llama-2-7b

@@ -19,7 +19,7 @@ The utilization of activation sparsity, namely the existence of considerable wea
 Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.
-In this work, we introduce a lossless sparsification method named "ProSparse" to push for higher activation sparsity without performance degradation. By applying ProSparse to Swish-activated LLaMA2-7B and LLaMA2-13B, we obtain ReLU-activated LLaMA2 models with high sparsity of 89.32% and 88.80% respectively while their performance is comparable to the original version. Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).
 ### Training Dataset
@@ -97,7 +97,7 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
 ### Inference Acceleration Effects
-First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs).
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:

 Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.
+In this work, we introduce an effective sparsification method named "ProSparse" to push for higher activation sparsity without performance degradation. By applying ProSparse to Swish-activated LLaMA2-7B and LLaMA2-13B, we obtain ReLU-activated LLaMA2 models with high sparsity of 89.32% and 88.80% respectively while their performance is comparable to the original version. Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).
 ### Training Dataset
 ### Inference Acceleration Effects
+First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors for ProSparse-7B are available at [ProSparse-LLaMA-2-7B-GGUF](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf)([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-gguf)) and [ProSparse-LLaMA-2-7B-Predictor](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-predictor)([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-predictor)) respectively.
 Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN: