|
--- |
|
tags: |
|
- software engineering |
|
- ner |
|
- named-entity recognition |
|
- token-classification |
|
widget: |
|
- text: >- |
|
In the field of computer graphics, a graphics processing unit (GPU) utilizes algorithms such as ray tracing, a rendering technique, to create realistic lighting effects in applications like Adobe Acrobat and Microsoft Excel. |
|
example_title: example 1 |
|
- text: >- |
|
By utilizing the TensorFlow and FastAPI libraries with Python, we are optimizing neural network training on devices like the Samsung Gear S2 and Intel T5300 processor. |
|
example_title: example 2 |
|
language: |
|
- en |
|
datasets: |
|
- wikiser |
|
license: apache-2.0 |
|
--- |
|
# Software Entity Recognition with Noise-robust Learning |
|
|
|
We train a BERT model for the task software entity recognition (SER). |
|
The training data leverages WikiSER, a corpus of 1.7M sentences extracted from Wikipedia. |
|
The model uses _self-regularization_ during the finetuning process, allowing it to be robust to texts in the software domain, including misannotations, different naming conventions, and others. |
|
|
|
The model recognizes 12 fine-grained named entities: `Algorithm`, `Application`, `Architecture`, `Data_Structure`, `Device`, `Error_Name`, `General_Concept`, `Language`, |
|
`Library`, `License`, `Operating_System`, and `Protocol`. |
|
|
|
| Type | Examples | |
|
|------------------|-------------------------------------------------------| |
|
| Algorithm | Auction algorithm, Collaborative filtering | |
|
| Application | Adobe Acrobat, Microsoft Excel | |
|
| Architecture | Graphics processing unit, Wishbone | |
|
| Data_Structure | Array, Hash table, mXOR linked list | |
|
| Device | Samsung Gear S2, iPad, Intel T5300 | |
|
| Error Name | Buffer overflow, Memory leak | |
|
| General_Concept | Memory management, Nouvelle AI | |
|
| Language | C++, Java, Python, Rust | |
|
| Library | Beautiful Soup, FastAPI | |
|
| License | Cryptix General License, MIT License | |
|
| Operating_System | Linux, Ubuntu, Red Hat OS, MorphOS | |
|
| Protocol | TLS, FTPS, HTTP 404 | |
|
|
|
## Model details |
|
|
|
Paper: https://arxiv.org/abs/2308.10564 |
|
|
|
Code: https://github.com/taidnguyen/software_entity_recognition |
|
|
|
Finetuned from model: `bert-large-cased` |
|
|
|
Checkpoint for base version: https://huggingface.co/taidng/wikiser-bert-base |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-large") |
|
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-large") |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "Windows XP was originally bundled with Internet Explorer 6." |
|
|
|
ner_results = nlp(example) |
|
print(ner_results) |
|
``` |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{nguyen2023software, |
|
title={Software Entity Recognition with Noise-Robust Learning}, |
|
author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi}, |
|
booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)}, |
|
year={2023}, |
|
organization={IEEE/ACM} |
|
} |
|
``` |
|
|