hustcw commited on
Commit
d7e56a1
·
verified ·
1 Parent(s): ff5ee6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -1,4 +1,108 @@
1
  ---
2
  license: mit
3
  pipeline_tag: zero-shot-classification
4
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  pipeline_tag: zero-shot-classification
4
+ ---
5
+
6
+ <h1 align="center">CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision</h1>
7
+
8
+ <h4 align="center">
9
+ <p>
10
+ <a href=#About>About</a> |
11
+ <a href=#News >News </a> |
12
+ <a href=#QuickStart >QuickStart </a> |
13
+ <a href="#Citation"> Citation</a>
14
+ <p>
15
+ </h4>
16
+
17
+ ## About
18
+ CLAP (Contrastive Language-Assembly Pre-training) is a framework that learns binary code representations through natural language supervision. By aligning binary code with natural language explanations, it improves analysis performance in few-shot and zero-shot scenarios. Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions, CLAP offers a method with exceptional transferability in the field of binary code analysis. Our goal is to provide an effective tool for researchers and practitioners in binary code analysis, with our models accessible on the Hugging Face Model Hub.
19
+
20
+ <img alt="clap_model" src="https://cdn-uploads.huggingface.co/production/uploads/6342dd731bdd3dfa55d66931/qCNIjTlzOPtTpI3NLBY14.png">
21
+
22
+ ## News
23
+
24
+ - [2024/2/29] CLAP is available on Hugging Face Model Hub ([clap-asm](https://huggingface.co/hustcw/clap-asm) and [clap-text](https://huggingface.co/hustcw/clap-text)).
25
+ - [2024/2/28] CLAP is now on [ArXiv](https://arxiv.org/abs/2402.16928).
26
+
27
+ ## QuickStart
28
+
29
+ This document will help you set up and start using the CLAP model for various tasks, including fine-grained classification of sorting algorithms, malware, and cryptographic algorithms without any further training.
30
+
31
+ ### Requirements
32
+ - Python 3.6 or higher
33
+ - [PyTorch](https://pytorch.org/get-started/locally/)
34
+ - [Transformers library](https://huggingface.co/docs/transformers/installation)
35
+ - A CUDA-enabled GPU is highly recommended for faster processing.
36
+
37
+ Ensure you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:
38
+ ```bash
39
+ pip install transformers
40
+ ```
41
+
42
+ ### Preparing Tokenizers and Models
43
+ Import necessary libraries and initialize the model and tokenizers:
44
+ ```python
45
+ import torch
46
+ from transformers import AutoModel, AutoTokenizer
47
+
48
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
49
+
50
+ asm_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
51
+ text_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
52
+ asm_encoder = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
53
+ text_encoder = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)
54
+ ```
55
+
56
+ ### Example Use Cases
57
+ **Fine-Grained Sorting Algorithm Classification (Zero-Shot)**
58
+
59
+ 1. Load your assembly (asm) code dataset. For demonstration, we use a JSON file containing assembly code snippets related to bubble sort:
60
+
61
+ ```python
62
+ with open("./CaseStudy/bubblesort.json") as fp:
63
+ asm = json.load(fp)
64
+ ```
65
+
66
+ 2. Define your classification prompts:
67
+ ```python
68
+ prompts = [
69
+ "This is a function related to bubble sort",
70
+ "This is a function related to selection sort",
71
+ ...
72
+ ]
73
+ ```
74
+
75
+ 3. Encode the assembly code and prompts, then perform classification:
76
+
77
+ ```python
78
+ # Encode assembly code
79
+ asm_input = asm_tokenizer([asm], padding=True, return_tensors="pt").to(device)
80
+ asm_embedding = asm_encoder(**asm_input)
81
+
82
+ # Encode prompts
83
+ text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
84
+ text_embeddings = text_encoder(**text_input)
85
+
86
+ # Classification
87
+ logits = torch.einsum("nc,ck->nk", [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
88
+ preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()
89
+
90
+ # Output predictions
91
+ for i, prompt in enumerate(prompts):
92
+ print(f"Probability: {preds[i]*100:.3f}%, Text: {prompt}")
93
+ ```
94
+
95
+ Repeat the process for any other classification tasks you want, such as malware classification and cryptographic algorithm identification, by loading the respective datasets and defining the relevant natural language prompts.
96
+
97
+ ## Citation
98
+ If this work is helpful for your research, please consider giving a star 🌟 and citing our work.
99
+ ```
100
+ @misc{wang2024clap,
101
+ title={CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision},
102
+ author={Hao Wang and Zeyu Gao and Chao Zhang and Zihan Sha and Mingyang Sun and Yuchen Zhou and Wenyu Zhu and Wenju Sun and Han Qiu and Xi Xiao},
103
+ year={2024},
104
+ eprint={2402.16928},
105
+ archivePrefix={arXiv},
106
+ primaryClass={cs.SE}
107
+ }
108
+ ```