arise-sustech
/

llm4decompile-6.7b-uo

@@ -12,6 +12,7 @@ LLM4Decompile aims to decompile x86 assembly instructions into C. It is finetune
 - **Github Repository:** [LLM4Compile](https://github.com/albertan017/LLM4Decompile)
 - **Paper link:** For more details check out the [paper](https://arxiv.org/abs/2403.05286).
 ### 2. Evaluation Results
 | Model              | Re-compilability |           |           |           |           | Re-executability |           |           |           |           |
@@ -26,6 +27,9 @@ LLM4Decompile aims to decompile x86 assembly instructions into C. It is finetune
 ### 3. How to Use
 Here give an example of how to use our model.
 First compile the C code into binary, disassemble the binary into assembly instructions:
 ```python
@@ -33,11 +37,9 @@ import subprocess
 import os
 import re
-digit_pattern = r'\b0x[a-fA-F0-9]+\b'#  binary codes in Hexadecimal
 zeros_pattern = r'^0+\s'#0s
 OPT = ["O0", "O1", "O2", "O3"]
-before = f"# This is the assembly code with {opt_state} optimization:\n"
-after = "\n# What is the source code?\n"
 fileName = 'path/to/file'
 with open(fileName+'.c','r') as f:#original file
     c_func = f.read()
@@ -57,7 +59,8 @@ for opt_state in OPT:
         tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
         input_asm+=tmp_asm+'\n'
     input_asm = re.sub(zeros_pattern, '', input_asm)
     input_asm_prompt = before+input_asm.strip()+after
     with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
         f.write(input_asm_prompt)
@@ -76,7 +79,7 @@ with open(fileName +'_' + opt_state +'.asm','r') as f:#original file
     asm_func = f.read()
 inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
     with torch.no_grad():
-        outputs = model.generate(**inputs, max_new_tokens=512)
 c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
 ```

 - **Github Repository:** [LLM4Compile](https://github.com/albertan017/LLM4Decompile)
 - **Paper link:** For more details check out the [paper](https://arxiv.org/abs/2403.05286).
+Note: The unified optimization (UO) model is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is arond 0.21.
 ### 2. Evaluation Results
 | Model              | Re-compilability |           |           |           |           | Re-executability |           |           |           |           |
 ### 3. How to Use
+Note: For the UO model, it is trained without prior knowledge of the optimization levels (O0~O3), therefore, the prompt is slightly different.
 Here give an example of how to use our model.
 First compile the C code into binary, disassemble the binary into assembly instructions:
 ```python
 import os
 import re
+digit_pattern = r'\b0x[a-fA-F0-9]+\b'# binary codes in Hexadecimal
 zeros_pattern = r'^0+\s'#0s
 OPT = ["O0", "O1", "O2", "O3"]
 fileName = 'path/to/file'
 with open(fileName+'.c','r') as f:#original file
     c_func = f.read()
         tmp_asm = tmp_asm.split('#')[0].strip()#remove the comments
         input_asm+=tmp_asm+'\n'
     input_asm = re.sub(zeros_pattern, '', input_asm)
+    before = f"# This is the assembly code:\n"#prompt different for the UO model
+    after = "\n# What is the source code?\n"#prompt
     input_asm_prompt = before+input_asm.strip()+after
     with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
         f.write(input_asm_prompt)
     asm_func = f.read()
 inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
     with torch.no_grad():
+        outputs = model.generate(**inputs, max_new_tokens=200)
 c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
 ```