Update README.md
#1
by
Judithvdw
- opened
README.md
CHANGED
|
@@ -1,4 +1,89 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Model Card
|
| 2 |
+
----------
|
| 3 |
+
|
| 4 |
+
_Who to contact:_ fbda [at] nfi [dot] nl \
|
| 5 |
+
_Version / Date:_ v1, 15/05/2025\
|
| 6 |
+
TODO: add link to github repo
|
| 7 |
+
|
| 8 |
+
## General
|
| 9 |
+
### What is the purpose of the model
|
| 10 |
+
The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a
|
| 11 |
+
given ARM64 function.
|
| 12 |
+
|
| 13 |
+
### What does the model architecture look like?
|
| 14 |
+
The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). It is a BERT model
|
| 15 |
+
(Devlin et al. 2019)
|
| 16 |
+
although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al.
|
| 17 |
+
This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html).
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
### What is the output of the model?
|
| 21 |
+
The model returns a vector of 768 dimensions for each function that it's given. These vectors can be compared to
|
| 22 |
+
get an indication of which functions are similar to each other.
|
| 23 |
+
|
| 24 |
+
### How does the model perform?
|
| 25 |
+
The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and
|
| 26 |
+
[Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall).
|
| 27 |
+
When the model has to pick the positive example out of a pool of 32, ranks the positive example highest most of the time.
|
| 28 |
+
When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example first or second in most cases.
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
| Model | Pool size | MRR | Recall@1 |
|
| 32 |
+
|---------|-----------|------|----------|
|
| 33 |
+
| ASMBert | 32 | 0.99 | 0.99 |
|
| 34 |
+
| ASMBert | 10.000 | 0.87 | 0.83 |
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Purpose and use of the model
|
| 38 |
+
|
| 39 |
+
### For which problem has the model been designed?
|
| 40 |
+
The model has been designed to find similar ARM64 functions in a database of known ARM64 functions.
|
| 41 |
+
|
| 42 |
+
### What else could the model be used for?
|
| 43 |
+
We do not see other applications for this model.
|
| 44 |
+
|
| 45 |
+
### To what problems is the model not applicable?
|
| 46 |
+
This model has been finetuned on the semantic search task, for a generic ARM64-BERT model, please refer to the [other
|
| 47 |
+
model](https://huggingface.co/NetherlandsForensicInstitute/ARM64bert) we have published.
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
## Data
|
| 52 |
+
### What data was used for training and evaluation?
|
| 53 |
+
The dataset is created in the same way as Wang et al. create Binary Corp. A large set of binary code comes from the
|
| 54 |
+
[ArchLinux official repositories](https://aur.archlinux.org/) and the [ArchLinux user repositories](https://archlinux.org/packages/).
|
| 55 |
+
All this code is split into functions that are compiled with different optimalization
|
| 56 |
+
(O0, O1, O2, O3 and O3) and security settings (fortify or no-fortify). This results
|
| 57 |
+
in a maximum of 10 (5*2) different functions which are semantically similar i.e. they represent the same functionality but are written differently.
|
| 58 |
+
The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of
|
| 59 |
+
either the train or the test set, not both. We have not performed any deduplication on the dataset for training.
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
| set | # functions |
|
| 63 |
+
|-------|------------:|
|
| 64 |
+
| train | 18,083,285 |
|
| 65 |
+
| test | 3,375,741 |
|
| 66 |
+
|
| 67 |
+
### By whom was the dataset collected and annotated?
|
| 68 |
+
The dataset was collected by our team.
|
| 69 |
+
|
| 70 |
+
### Any remarks on data quality and bias?
|
| 71 |
+
After training our models, we found out that something had gone wrong when compiling our dataset. Consequently,
|
| 72 |
+
the last line (instruction) of the previous function was included in the next. This has been fixed for the finetuning, but due to the long training process, and the
|
| 73 |
+
good performance of the model despite the mistake, we have decided not to retrain the base model.
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Fairness Metrics
|
| 78 |
+
|
| 79 |
+
### Which metrics have been used to measure bias in the data/model and why?
|
| 80 |
+
n.a.
|
| 81 |
+
|
| 82 |
+
### What do those metrics show?
|
| 83 |
+
n.a.
|
| 84 |
+
|
| 85 |
+
### Any other notable issues?
|
| 86 |
+
n.a.
|
| 87 |
+
|
| 88 |
+
## Analyses (optional)
|
| 89 |
+
n.a.
|