metadata

datasets:
  - ulab-ai/FusionBench

Fusing LLM Capabilities with Routing Data

🌐 Project Page | 📜 arXiv | 📂 Dataset | 🤖 Model | 🖥️ Demo

Overview of LLM capability fusion via FusionFactory with three representative levels: Query-level, Thought-level, and Model-level.

News

[2025.06] 🌟 FusionFactory was released.

🛠️Environment Setup

conda create -n fusionfactory python=3.9
conda activate fusionfactory
pip install pandas
pip install datasets
pip install tqdm
pip install transformers
pip install sentence_transformers
pip install torch
pip install numpy

🎯Data Process

Run the following command to start data collection.

# split: train OR test
# case num: 500 for train & 50 for partial test
# a sample of LLM description: ./data_process/LLM_Descriptions.json
python data_process/data_combine.py \
--split train \
--case_num 500 \
--round 5 \
--llm_description_path [YOUR_LLM_PATH] \
--csv_save_path [YOUR_SAVE_PATH] \
--api_base [YOUR_API_BASE] \
--api_key [YOUR_API_KEY]

You may refer to the specific README in the data_process directory for detailed argument descriptions.

To add quality scores to the collected data using an LLM judge:

python data_process/add_llm_judge.py

This will evaluate each response and add quality scores to the dataset, which can be used for training and evaluation purposes. See the data_process/README.md for more details.

📊Experiments

Query-level Fusion

First, run the data preprocessing script to prepare the dataset:

# Preprocess the dataset and generate training/testing files
python query_level/data_processing.py

For more detailed information about the data preprocessing and model training process, please refer to the specific README in the query_level directory.

Thought-level Fusion

First, run the data preprocessing script to prepare the thought prompts:

# Preprocess the dataset and generate training/testing files
python query_level/data_processing.py

Or run the script to directly use Huggingface datasets to generate thought-enhanced queries

python thought_level/get_thought_prompt.py

For more detailed information about the data preprocessing and model training process, please refer to the specific README in the thought_level directory.

Model-level Fusion

You can refer to LLaMA-Factory for detailed instructions to start fine-tuning on model-level fusion data. Make sure to first clone the LLaMA-Factory repository into the FusionBench directory, and then execute the following commands to generate SFT data for model-level fusion:

# setting: perf, judge, hybrid, baseline
python model_level/sft_data_gen.py --settin perf --k 5 --save_path [YOUR_PATH] --csv_path_with_judge [YOUR_PATH]

python model_level/sft_test_gen.py --save_path [YOUR_PATH] --csv_path [YOUR_PATH]

Then, you can use the following commands to start SFT and Inference after essential configuration described in LLaMA-Factory Doc

# SFT
FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=2,3,4,5 llamafactory-cli train examples/train_lora/[YOUR_YAML].yaml

# Inference
CUDA_VISIBLE_DEVICES=2,3,4,5 python scripts/vllm_infer.py --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --adapter_name_or_path saves/llama3.1-8b/lora/[YOUR_PATH] --dataset router_test --cutoff_len 2048

You may refer to the specific README in the model_level directory for detailed instructions.

📈 Evaluation

FusionBench provides a comprehensive evaluation framework to assess model performance across various tasks. The evaluation framework supports multiple types of tasks including:

Mathematical Reasoning (GSM8K, MATH)
Code Generation (MBPP, HumanEval)
Commonsense Reasoning (CommonsenseQA, OpenBookQA, ARC Challenge, HellaSwag)
World Knowledge (Natural Questions, TriviaQA)
Reading Comprehension (SQuAD, BoolQ)
Popular Benchmarks (MMLU, GPQA)

To evaluate your model's performance:

python eval/response_eval.py

For detailed information about the evaluation framework, supported metrics, and usage instructions, please refer to the Evaluation Documentation.

Citation

@article{FusionFactory,
  title={Fusing LLM Capabilities with Routing Data},
  author={Tao Feng and Haozhen Zhang and Zijie Lei and Pengrui Han and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro and Jiaxuan You},
  journal={arXiv preprint arXiv:2507.10540},
  year={2025}
}