Papers
arxiv:2209.14008

Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer

Published on Sep 28, 2022
Authors:
,
,
,
,

Abstract

The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (pl<PRE_TAG>T5</POST_TAG>) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. <PRE_TAG>pl<PRE_TAG>T5</POST_TAG>kw</POST_TAG>, extremeText, TermoPL, KeyBERT and conclude that the <PRE_TAG>pl<PRE_TAG>T5</POST_TAG>kw</POST_TAG> model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a <PRE_TAG>pl<PRE_TAG>T5</POST_TAG>kw</POST_TAG> keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic <PRE_TAG>keyword extraction</POST_TAG>.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2209.14008 in a dataset README.md to link it from this page.

Spaces citing this paper 9

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.