elsagranger commited on
Commit
acc802b
·
verified ·
1 Parent(s): 119f581

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -50
README.md CHANGED
@@ -1,50 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
- # Virtual Compiler Is All You Need For Assembly Code Search
6
-
7
- ## Introduction
8
-
9
- This repo contains the models and the corresponding evaluation datasets of ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search".
10
-
11
- A virtual compiler is a LLM that is capable of compiling any programming language into underlying assembly code. The virtual compiler model is available at [elsagranger/VirtualCompiler](https://huggingface.co/elsagranger/VirtualCompiler), based on 34B CodeLlama.
12
-
13
- We evaluate the similiarity of the virtual assembly code generated by the virtual compiler and the real assembly code using force execution by script [force-exec.py](./force_exec.py), the corresponding evaluation dataset is avaiable at [virtual_assembly_and_ground_truth](./virtual_assembly_and_ground_truth).
14
-
15
- We evaluate the effective of the virtual compiler throught downstream task -- assembly code search, the evaluation dataset is avaiable at [elsagranger/AssemblyCodeSearch](https://huggingface.co/datasets/elsagranger/AssemblyCodeSearch).
16
-
17
- ## Usage
18
-
19
- We use FastChat and vllm worker to host the model. Run these following commands in seperate terminals, such as `tmux`.
20
-
21
- ```shell
22
- LOGDIR="" python3 -m fastchat.serve.openai_api_server \
23
- --host 0.0.0.0 --port 8080 \
24
- --controller-address http://localhost:21000
25
-
26
- LOGDIR="" python3 -m fastchat.serve.controller \
27
- --host 0.0.0.0 --port 21000
28
-
29
- LOGDIR="" RAY_LOG_TO_STDERR=1 \
30
- python3 -m fastchat.serve.vllm_worker \
31
- --model-path ./VirtualCompiler \
32
- --num-gpus 8 \
33
- --controller http://localhost:21000 \
34
- --max-num-batched-tokens 40960 \
35
- --disable-log-requests \
36
- --host 0.0.0.0 --port 22000 \
37
- --worker-address http://localhost:22000 \
38
- --model-names "VirtualCompiler"
39
- ```
40
-
41
- Then with the model hosted, use `do_request.py` to make request to the model.
42
-
43
- ```shell
44
- ~/C/VirtualCompiler (main)> python3 do_request.py
45
- test rdx, rdx
46
- setz al
47
- movzx eax, al
48
- neg eax
49
- retn
50
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Virtual Compiler Is All You Need For Assembly Code Search
6
+
7
+ ## Introduction
8
+
9
+ This repo contains the models and the corresponding evaluation datasets of ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search".
10
+
11
+ A virtual compiler is a LLM that is capable of compiling any programming language into underlying assembly code. The virtual compiler model is available at [elsagranger/VirtualCompiler](https://huggingface.co/elsagranger/VirtualCompiler), based on 34B CodeLlama.
12
+
13
+ We evaluate the similiarity of the virtual assembly code generated by the virtual compiler and the real assembly code using force execution by script [force-exec.py](./force_exec.py), the corresponding evaluation dataset is avaiable at [virtual_assembly_and_ground_truth](./virtual_assembly_and_ground_truth).
14
+
15
+ We evaluate the effective of the virtual compiler throught downstream task -- assembly code search, the evaluation dataset is avaiable at [elsagranger/AssemblyCodeSearchEval](https://huggingface.co/datasets/elsagranger/AssemblyCodeSearchEval).
16
+
17
+ ## Usage
18
+
19
+ We use FastChat and vllm worker to host the model. Run these following commands in seperate terminals, such as `tmux`.
20
+
21
+ ```shell
22
+ LOGDIR="" python3 -m fastchat.serve.openai_api_server \
23
+ --host 0.0.0.0 --port 8080 \
24
+ --controller-address http://localhost:21000
25
+
26
+ LOGDIR="" python3 -m fastchat.serve.controller \
27
+ --host 0.0.0.0 --port 21000
28
+
29
+ LOGDIR="" RAY_LOG_TO_STDERR=1 \
30
+ python3 -m fastchat.serve.vllm_worker \
31
+ --model-path ./VirtualCompiler \
32
+ --num-gpus 8 \
33
+ --controller http://localhost:21000 \
34
+ --max-num-batched-tokens 40960 \
35
+ --disable-log-requests \
36
+ --host 0.0.0.0 --port 22000 \
37
+ --worker-address http://localhost:22000 \
38
+ --model-names "VirtualCompiler"
39
+ ```
40
+
41
+ Then with the model hosted, use `do_request.py` to make request to the model.
42
+
43
+ ```shell
44
+ ~/C/VirtualCompiler (main)> python3 do_request.py
45
+ test rdx, rdx
46
+ setz al
47
+ movzx eax, al
48
+ neg eax
49
+ retn
50
+ ```
51
+
52
+
53
+ ## Assembly Code Search Encoder
54
+
55
+ As huggingface does not support load a remote model inside a folder, we host the model trained on the assembly code search dataset augmented by the Virtual Compiler in [vic-encoder](https://cloud.vul337.team:9443/s/t5Ltt8gy7kPfyw8). You can use the `model.py` to test the custom model loading.
56
+
57
+ Here is a example on text encoder and asm encoder. Please refer to this script on how to extract the assembly code from the binary: [process_asm.py](https://github.com/Hustcw/CLAP/blob/main/scripts/process_asm.py).
58
+
59
+ ```python
60
+ def calc_map_at_k(logits, pos_cnt, ks=[10,]):
61
+ _, indices = torch.sort(logits, dim=1, descending=True)
62
+
63
+ # [batch_size, pos_cnt]
64
+ ranks = torch.nonzero(
65
+ indices < pos_cnt,
66
+ as_tuple=False
67
+ )[:, 1].reshape(logits.shape[0], -1)
68
+
69
+ # [batch_size, pos_cnt]
70
+ mrr = torch.mean(1 / (ranks + 1), dim=1)
71
+
72
+ res = {}
73
+
74
+ for k in ks:
75
+ res[k] = (
76
+ torch.sum((ranks < k).float(), dim=1) / min(k, pos_cnt)
77
+ ).cpu().numpy()
78
+
79
+ return ranks.cpu().numpy(), res, mrr.cpu().numpy()
80
+
81
+ pos_asm_cnt = 1
82
+
83
+ query = ["List all files in a directory"]
84
+ anchor_asm = [...]
85
+ neg_anchor_asm = [...]
86
+
87
+ query_embs = text_encoder(**text_tokenizer(query))
88
+ asm_embs = asm_encoder(**asm_tokenizer(anchor_asm))
89
+ asm_neg_emb = asm_encoder(**asm_tokenizer(neg_anchor_asm))
90
+
91
+ # query_embs: [query_cnt, emb_dim]
92
+ # asm_embs: [pos_asm_cnt, emb_dim]
93
+
94
+ # logits_pos: [query_cnt, pos_asm_cnt]
95
+ logits_pos = torch.einsum(
96
+ "ic,jc->ij", [query_embs, asm_embs])
97
+ # logits_neg: [query_cnt, neg_asm_cnt]
98
+ logits_neg = torch.einsum(
99
+ "ic,jc->ij", [query_embs, asm_neg_emb[pos_asm_cnt:]]
100
+ )
101
+ logits = torch.cat([logits_pos, logits_neg], dim=1)
102
+
103
+ ranks, map_at_k, mrr = calc_map_at_k(
104
+ logits, pos_asm_cnt, [1, 5, 10, 20, 50, 100])
105
+ ```
106
+