File size: 6,298 Bytes
1d9c869 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
---
datasets:
- ulab-ai/FusionBench
---
# Fusing LLM Capabilities with Routing Data
<p align="center">
<a href="https://ulab-uiuc.github.io/FusionFactory/">
<img alt="Project Page" src="https://img.shields.io/badge/Project-Page-blue">
</a>
<a href="http://arxiv.org/abs/2507.10540">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2507.10540-red?logo=arxiv">
</a>
<!-- <a href="xxx">
<img alt="Twitter" src="https://img.shields.io/badge/Twitter-black?logo=X">
</a> -->
<a href="https://github.com/ulab-uiuc/FusionFactory/blob/master/LICENSE">
<img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
</a>
<br>
<a href="https://github.com/ulab-uiuc/FusionFactory">
<img alt="Stars" src="https://img.shields.io/github/stars/ulab-uiuc/FusionFactory">
</a>
<a href="https://github.com/ulab-uiuc/FusionFactory">
<img alt="Forks" src="https://img.shields.io/github/forks/ulab-uiuc/FusionFactory">
</a>
<a href="https://github.com/ulab-uiuc/FusionFactory">
<img alt="Issues" src="https://img.shields.io/github/issues/ulab-uiuc/FusionFactory">
</a>
</p>
<p align="center">
<a href="https://ulab-uiuc.github.io/FusionFactory/">🌐 Project Page</a> |
<a href="http://arxiv.org/abs/2507.10540">📜 arXiv</a> |
<a href="https://huggingface.co/datasets/ulab-ai/FusionBench">📂 Dataset</a> |
<a href="https://huggingface.co/ulab-ai/FusionFactory">🤖 Model</a> |
<a href="https://huggingface.co/spaces/ulab-ai/RoutePilot">🖥️ Demo</a>
</p>
<div align="center">
<img src="./figures/fusion.jpg" width="700" alt="FusionBench">
<p><b>Overview of LLM capability fusion via FusionFactory with three representative levels: Query-level, Thought-level, and Model-level.</b></p>
</div>
## News
**[2025.06]** 🌟 **FusionFactory** was released.
## 🛠️Environment Setup
```bash
conda create -n fusionfactory python=3.9
conda activate fusionfactory
pip install pandas
pip install datasets
pip install tqdm
pip install transformers
pip install sentence_transformers
pip install torch
pip install numpy
```
## 🎯Data Process
Run the following command to start data collection.
```bash
# split: train OR test
# case num: 500 for train & 50 for partial test
# a sample of LLM description: ./data_process/LLM_Descriptions.json
python data_process/data_combine.py \
--split train \
--case_num 500 \
--round 5 \
--llm_description_path [YOUR_LLM_PATH] \
--csv_save_path [YOUR_SAVE_PATH] \
--api_base [YOUR_API_BASE] \
--api_key [YOUR_API_KEY]
```
You may refer to the specific README in the [`data_process`](data_process/README.md) directory for detailed argument descriptions.
To add quality scores to the collected data using an LLM judge:
```bash
python data_process/add_llm_judge.py
```
This will evaluate each response and add quality scores to the dataset, which can be used for training and evaluation purposes. See the [`data_process/README.md`](data_process/README.md) for more details.
## 📊Experiments
### Query-level Fusion
First, run the data preprocessing script to prepare the dataset:
```bash
# Preprocess the dataset and generate training/testing files
python query_level/data_processing.py
```
For more detailed information about the data preprocessing and model training process, please refer to the specific README in the [`query_level`](query_level/README.md) directory.
### Thought-level Fusion
First, run the data preprocessing script to prepare the thought prompts:
```bash
# Preprocess the dataset and generate training/testing files
python query_level/data_processing.py
```
Or run the script to directly use Huggingface datasets to generate thought-enhanced queries
```bash
python thought_level/get_thought_prompt.py
```
For more detailed information about the data preprocessing and model training process, please refer to the specific README in the [`thought_level`](thought_level/README.md) directory.
### Model-level Fusion
You can refer to [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for detailed instructions to start fine-tuning on model-level fusion data. Make sure to first clone the LLaMA-Factory repository into the FusionBench directory, and then execute the following commands to generate SFT data for model-level fusion:
```bash
# setting: perf, judge, hybrid, baseline
python model_level/sft_data_gen.py --settin perf --k 5 --save_path [YOUR_PATH] --csv_path_with_judge [YOUR_PATH]
python model_level/sft_test_gen.py --save_path [YOUR_PATH] --csv_path [YOUR_PATH]
```
Then, you can use the following commands to start SFT and Inference after essential configuration described in [LLaMA-Factory Doc](https://llamafactory.readthedocs.io/en/latest/)
```bash
# SFT
FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=2,3,4,5 llamafactory-cli train examples/train_lora/[YOUR_YAML].yaml
# Inference
CUDA_VISIBLE_DEVICES=2,3,4,5 python scripts/vllm_infer.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --adapter_name_or_path saves/llama3.1-8b/lora/[YOUR_PATH] --dataset router_test --cutoff_len 2048
```
You may refer to the specific README in the [`model_level`](model_level/README.md) directory for detailed instructions.
## 📈 Evaluation
FusionBench provides a comprehensive evaluation framework to assess model performance across various tasks. The evaluation framework supports multiple types of tasks including:
- Mathematical Reasoning (GSM8K, MATH)
- Code Generation (MBPP, HumanEval)
- Commonsense Reasoning (CommonsenseQA, OpenBookQA, ARC Challenge, HellaSwag)
- World Knowledge (Natural Questions, TriviaQA)
- Reading Comprehension (SQuAD, BoolQ)
- Popular Benchmarks (MMLU, GPQA)
To evaluate your model's performance:
```bash
python eval/response_eval.py
```
For detailed information about the evaluation framework, supported metrics, and usage instructions, please refer to the [Evaluation Documentation](eval/README.md).
## Citation
```bibtex
@article{FusionFactory,
title={Fusing LLM Capabilities with Routing Data},
author={Tao Feng and Haozhen Zhang and Zijie Lei and Pengrui Han and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jiaxuan You},
journal={arXiv preprint arXiv:2507.10540},
year={2025}
}
``` |