Papers
arxiv:2308.12635

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

Published on Aug 24, 2023
Authors:
,
,
,
,

Abstract

This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the <PRE_TAG>HuSpaCy toolkit</POST_TAG> with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive <PRE_TAG>performance</POST_TAG> of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2308.12635 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2308.12635 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.