Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines
Abstract
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the <PRE_TAG>HuSpaCy toolkit</POST_TAG> with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive <PRE_TAG>performance</POST_TAG> of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper