Papers
arxiv:2306.02546

Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary

Published on Jun 5, 2023
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Decompilation aims to recover the source code form of a binary executable. It has many security applications such as malware analysis, vulnerability detection and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases and potential hallucinations. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B and CodeLlama-7B. We finetune GenNm on decompiled functions, and mitigate model biases by incorporating symbol preference to the training pipeline. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. It further leverages program analysis to validate the consistency of names produced by the generative model. Our results show that GenNm improves the state-of-the-art name recovery accuracy by 8.6 and 11.4 percentage points on two commonly used datasets, and improves the state-of-the-art from 8.5% to 22.8% in the most challenging setup where ground-truth variable names are not seen in the training dataset.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.02546 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.02546 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.02546 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.