Model description

Cased fine-tuned BERT model for Hungarian, trained on a dataset provided by National Tax and Customs Administration - Hungary (NAV): Public Accessibilty Programme. Refined version of the huBERTPlain ('uvegesistvan/huBERTPlain') model. Trainig data cleaned further:

  • Minor corrections in sentence segmentation results.
  • Train data filtered: sentence pairs (original - rephrased) filtered out in each document, where two sentences' Levenstein distance was less then 3. These assumed to be spelling corrections, therefore potentially less helpful for Plain Language classification.

Intended uses & limitations

The model can be used as any other (cased) BERT model. It has been tested recognizing "accessible" and "original" sentences, where:

  • "accessible" - "Label_0": sentence, that can be considered as comprehensible (regarding to Plain Language directives)
  • "original" - "Label_1": sentence, that needs to rephrased in order to follow Plain Language Guidelines.

Training

Fine-tuned version of the original huBERT model (SZTAKI-HLT/hubert-base-cc), trained on information materials provided by NAV linguistic experts.

Eval results

Class Precision Recall F-Score
Accessible / Label_0 0.75 0.72 0.73
Original / Label_1 0.74 0.77 0.75
accuracy 0.74
macro avg 0.74 0.74 0.74
weighted avg 0.74 0.74 0.74

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("uvegesistvan/huBERTPlain_v2")
model = AutoModelForSequenceClassification.from_pretrained("uvegesistvan/huBERTPlain_v2")

BibTeX entry and citation info

If you use the model, please cite the following dissertation (to be submitted for workshop discussion):

Bibtex:

@PhDThesis{ Uveges:2024,
  author = {{"U}veges, Istv{\'a}n},
  title  = {K{\"o}z{\'e}rthet{\"o} és automatiz{\'a}ci{\'o} - k{\'i}s{\'e}rletek a jog, term{\'e}szetesnyelv-feldolgoz{\'a}s {\'e}s informatika hat{\'a}r{\'a}n.},
  year   = {2024},
  school = {Szegedi Tudom{\'a}nyegyetem}
}
Downloads last month
11
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.