File size: 6,337 Bytes
3a44570
 
 
 
4a40a42
 
 
 
 
 
78cc7e1
 
 
 
3a44570
 
6417eac
 
8dc8df6
6417eac
2b2bc45
8dc8df6
 
4a40a42
8dc8df6
 
 
 
7d9bb99
8dc8df6
 
6417eac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e5114b
 
 
 
 
 
 
 
 
 
 
 
6417eac
 
 
 
 
 
 
3a44570
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: cc-by-4.0
library_name: transformers
pipeline_tag: text-generation
datasets:
- yongchao98/SymBench
metrics:
- accuracy
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- Symbolic
- Code
- Text
---

# CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
<img src="./Figures/Tag.png" width="650px" alt="s" />
These are the codes, models, and datasets for the following papers:

- [CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance](https://arxiv.org/abs/2502.04350)  
- [Steering Large Language Models between Code Execution and Textual Reasoning (ICLR'2025)](https://arxiv.org/pdf/2410.03524)

Project page: https://github.com/yongchao98/CodeSteer-v1.0/

[Code](https://github.com/yongchao98/CodeSteer-v1.0) &emsp;&emsp; 
[Huggingface🤗](https://huggingface.co/yongchao98/CodeSteer-v1) &emsp;&emsp;
[Model Weights](https://drive.google.com/drive/folders/1qb_rec6f8rMYtFKm0eQpad0L0uHCwgpL?usp=share_link)  
[SymBench🤗](https://huggingface.co/datasets/yongchao98/SymBench) &emsp;&emsp;
[Finetune Datasets](https://drive.google.com/drive/folders/1Byn-99gFd5ckRkPMJ8-zagzW7XDfO8ie?usp=share_link) &emsp;&emsp;
[SymBench Datasets](https://github.com/yongchao98/CodeSteer-v1.0/tree/main/dataset_gather) &emsp;&emsp;
[SymBench Synthesis Scripts](https://github.com/yongchao98/CodeSteer-v1.0/tree/main/benchmark)

## Contents

- [Framework](#Framework)
- [Inspirations](#Inspirations)
- [Performance](#Performance)
- [Environment_Setup](#Environment_Setup)
- [LLM_API_Key_Setup](#LLM_API_Key_Setup)
- [Train_and_Test_Models](#Train_and_Test_Models)
- [Assistance](#Assistance)
- [Citation](#Citation)

## Framework
<img src="./Figures/CodeSteer-intro.png" width="800px" alt="s" />

<p align="center" style="font-size: 16px;">
Figure: CodeSteer on guiding LLM code/text generation to integrate symbolic computing. At each interaction with TaskLLM, it reviews current and previous answers, then provides guidance for the next round.
</p>

## Inspirations
<img src="./Figures/LLM-makes-simple-mistakes-gather.png" width="800px" alt="s" />
<p align="center" style="font-size: 16px;">
Figure: The cases that GPT-4o makes simple mistakes by direct textual reasoning but can reliably solve the problem with prompted to use code.
</p>


## Performance
We compare GPT-4o + CodeSteer with OpenAI o1 and DeepSeek R1 on SymBench, with 28 seen tasks and 9 unseen tasks. GPT-4o + CodeSteer surpasses o1 (82.7), R1 (76.8), and o1-preview (74.8), highlighting the importance of integrating symbolic computing into LLMs.

<img src="./Figures/Table-results.png" width="800px" alt="s" />

The cost of tokens and runtimes for each method are as follows. GPT-4o + CodeSteer costs less tokens and runtimes than o1 and R1.
<img src="./Figures/Cost-token-runtime.png" width="800px" alt="s" />

## Environment_Setup
The fine-tuning and inference of CodeSteerLLM are based on [Llama-factory](https://github.com/hiyouga/LLaMA-Factory) with some modules modified by us.
```
git clone https://github.com/yongchao98/CodeSteer-v1.0.git
cd CodeSteer-v1.0

conda create -n CodeSteer python=3.10
conda activate CodeSteer
pip install -r requirements.txt
```

## LLM_API_Key_Setup
If you want to use several API-based LLMs as TaskLLM or CodeSteerLLM, then you need to set up API key.

1. First, create a .env file in your project root:
```
OPENAI_API_KEY='your_key_here'
CLAUDE_API_KEY='your_key_here'
MIXTRAL_API_KEY='your_key_here'
DEEPSEEK_API_KEY='your_key_here'
```
2. Add this .env file to your .gitignore to prevent accidentally committing it:
```
echo ".env" >> .gitignore
```

## Train_and_Test_Models

### Create_test_samples
The synthesized test samples for 37 tasks of SymBench are in [dataset_gather](https://github.com/yongchao98/CodeSteer-v1.0/tree/main/dataset_gather) dictionary. You can also synthezise the samples by yourself with tunable complexities with scripts in [create_dataset](https://github.com/yongchao98/CodeSteer-v1.0/tree/main/create_dataset).

### Run inference without GPU, test close LLM as CodeSteerLLM
We can directly use unfinetuned model like GPT-4o as CodeSteerLLM, in this case directly run
```
python benchmark_test_baseline.py
```

### Run inference with GPU, test finetuned CodeSteerLLM
We can infer Llama-3.1-8B with own GPUs (default setting is in infer_CodeSteer.sh using 4*H100 of Harvard Cluster, please modify freely with your own cluster settings). You can also download the [Model Weights](https://drive.google.com/drive/folders/1qb_rec6f8rMYtFKm0eQpad0L0uHCwgpL?usp=share_link) in your local and change the path in llama3_8B_CodeSteer.yaml.

```bash
bash infer_CodeSteer.sh
# default config file is ./llama3_8B_CodeSteer.yaml using the model uploaded on Huggingface.
```

### Finetuning CodeSteerLLM with synthesized data
Both our synthesized datasets of SFT and DPO finetuning are in [Finetune Datasets](https://drive.google.com/drive/folders/1Byn-99gFd5ckRkPMJ8-zagzW7XDfO8ie?usp=share_link).
We use Llama-factory and DeepSpeed for fintuning processes. First install Llama-factory with:
```
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..
```

Then we run the code with (default setting is in train_llama3-8B-CodeSteer.sh using 4*H100 of Harvard Cluster, please modify freely with your own cluster settings):
```
bash train_llama3-8B-CodeSteer.sh
```

## Assistance

We appreciate all feedback! Feel free to raise an issue for bugs, questions, or suggestions. Contacting [Yongchao Chen](https://yongchao98.github.io/YongchaoChen/) and [Chuchu Fan](https://chuchu.mit.edu) for any questions and discussion.

## Citation
```md
@misc{chen2025codesteersymbolicaugmentedlanguagemodels,
      title={CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance}, 
      author={Yongchao Chen and Yilun Hao and Yueying Liu and Yang Zhang and Chuchu Fan},
      year={2025},
      eprint={2502.04350},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.04350}, 
}
```

```md
@article{chen2024steering,
  title={Steering Large Language Models between Code Execution and Textual Reasoning},
  author={Chen, Yongchao and Jhamtani, Harsh and Sharma, Srinagesh and Fan, Chuchu and Wang, Chi},
  journal={arXiv preprint arXiv:2410.03524},
  year={2024}
}
```