File size: 7,218 Bytes
e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d e8dca86 3a7d20d 095b85c 3a7d20d 095b85c e8dca86 7df2c33 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
library_name: transformers
license: other
base_model: deepseek-ai/deepseek-coder-1.3b-instruct
tags:
- trl
- sft
- generated_from_trainer
model-index:
- name: asm2asm-deepseek-1.3b-500k-2ep-x86-O0-risc
results: []
datasets:
- ahmedheakl/asm2asm_O0_500000_gnueabi_gcc
metrics:
- exact_match
- accuracy
---
# CISC-to-RISC
A fine-tuned version of [deepseek-ai/deepseek-coder-1.3b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-instruct) specialized in converting x86 assembly code to RISCv5-64 assembly.
## Model Overview
**asm2asm-deepseek1.3b-xtokenizer-risc** is designed to assist developers in converting x86 assembly instructions to RISCv5-64 assembly. Leveraging the capabilities of the base model, this fine-tuned variant enhances accuracy and efficiency in assembly code transpilation tasks.
## Intended Use
This model is intended for:
- **Assembly Code Conversion**: Assisting developers in translating x86 assembly instructions to RISCv5-64 architecture.
- **Educational Purposes**: Helping learners understand the differences and translation mechanisms between x86 and RISCv5-64 assembly.
- **Code Optimization**: Facilitating optimization processes by converting and refining assembly code across architectures.
## Limitations
- **Dataset Specificity**: The model is fine-tuned on a specific dataset, which may limit its performance on assembly instructions outside the training distribution.
- **Complex Instructions**: May struggle with highly complex or unconventional assembly instructions not well-represented in the training data.
- **Error Propagation**: Inaccuracies in the generated RISCv5-64 code can lead to functional discrepancies or bugs if not reviewed.
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 8
- optimizer: Use OptimizerNames.PAGED_ADAMW with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 2
## Usage
All models and datasets are available on [Hugging Face](https://huggingface.co/collections/ahmedheakl/cisc-to-risc-672727bd996db985473d146e). Below is an example of how to use the best model for converting x86 assembly to RISCv5-64.
### Inference Code
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm
# Replace 'hf_token' with your Hugging Face token
hf_token = "your_hf_token_here"
model_name = "ahmedheakl/asm2asm-deepseek1.3b-risc"
instruction = """<|begin▁of▁sentence|>You are a helpful coding assistant assistant on converting from x86 to RISCv64 assembly.
### Instruction:
Convert this x86 assembly into RISCv64
```asm
{asm_x86}
"```"
### Response:
```asm
{asm_risc}
"""
model = AutoModelForCausalLM.from_pretrained(
model_name,
token=hf_token,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model.config.use_cache = True
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True,
token=hf_token,
)
def inference(asm_x86: str) -> str:
prompt = instruction.format(asm_x86=asm_x86, asm_risc="")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(
**inputs,
use_cache=True,
num_return_sequences=1,
max_new_tokens=8000,
do_sample=False,
num_beams=8,
# temperature=0.7,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
outputs = tokenizer.batch_decode(generated_ids)[0]
torch.cuda.empty_cache()
torch.cuda.synchronize()
return outputs.split("```asm\n")[-1].split(f"```{tokenizer.eos_token}")[0]
x86 = "DWORD PTR -248[rbp] movsx rdx"
converted_risc = inference(x86)
print(converted_risc)
```
## Experiments and Results
| **Model** | **Average Edit Distance** (↓) | **Exact Match** (↑) | **Test Accuracy** (↑) |
|-----------------------------------------------|-------------------------------|---------------------|-----------------------|
| GPT4o | 1296 | 0% | 8.18% |
| DeepSeekCoder2-16B | 1633 | 0% | 7.36% |
| Yi-Coder-9B | 1653 | 0% | 6.33% |
| **Yi-Coder-1.5B** | 275 | 16.98% | 49.69% |
| **DeepSeekCoder-1.3B** | 107 | 45.91% | 77.23% |
| **DeepSeekCoder-1.3B-xTokenizer-int4** | 119 | 46.54% | 72.96% |
| **DeepSeekCoder-1.3B-xTokenizer-int8** | **96** | 49.69% | 75.47% |
| **DeepSeekCoder-1.3B-xTokenizer** | 165 | **50.32%** | **79.25%** |
*Table: Comparison of models' performance on the x86 to ARM transpilation task, measured by Edit Distance (lower is better), Exact Match (higher is better), and Test Accuracy (higher is better). The top section lists pre-existing models, while the bottom section lists models trained by us. The best results in each metric are highlighted in bold.*
| **Model** | **Average Edit Distance** (↓) | **Exact Match** (↑) | **Test Accuracy** (↑) |
|----------------------------------------|-------------------------------|---------------------|-----------------------|
| GPT4o | 1293 | 0% | 7.55% |
| DeepSeekCoder2-16B | 1483 | 0% | 6.29% |
| **DeepSeekCoder-1.3B-xTokenizer-int4** | 112 | 14.47% | 68.55% |
| **DeepSeekCoder-1.3B-xTokenizer-int8** | 31 | 69.81% | 88.05% |
| **DeepSeekCoder-1.3B-xTokenizer** | **27** | **69.81%** | **88.68%** |
**Table:** Comparison of models' performance on the _x86 to RISCv64_ transpilation task. The top section lists pre-existing models, while the bottom section lists models trained by us.
### Framework versions
- Transformers 4.46.0
- Pytorch 2.4.0
- Datasets 3.0.2
- Tokenizers 0.20.1
**Please see paper & code for more information:**
- https://github.com/ahmedheakl/asm2asm
- https://arxiv.org/abs/2411.16341
## Citations
If you use this model in your research, please cite it as follows:
```
@article{heakl2024cisc,
title={From CISC to RISC: language-model guided assembly transpilation},
author={Heakl, Ahmed and Abi, Chaimaa and Hossam, Rania and Mahmoud, Abdulrahman},
journal={arXiv preprint arXiv:2411.16341},
year={2024}
}
``` |