File size: 2,305 Bytes
2592d9a
 
b18241f
2592d9a
 
b18241f
2592d9a
2eafb43
79a54f6
 
2592d9a
05beb7e
2592d9a
c33c85c
20fa076
 
 
 
 
 
 
 
 
 
 
 
0b59321
20fa076
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
language: ca

tags:
- summarization

widget:
- text: "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
---


News Abstractive Summarization for Catalan (NASCA) is a Transformer encoder-decoder model, with the same hyper-parameters than BART, to perform summarization of Catalan news articles. It is pre-trained on a combination of several self-supervised tasks that help to increase the abstractivity of the generated summaries. Four pre-training tasks have been combined: sentence permutation, text infilling, Gap Sentence Generation, and Next Segment Generation. Catalan newspapers, the Catalan subset of the OSCAR corpus and Wikipedia articles in Catalan were used for pre-training the model (9.3GB of raw text -2.5 millions of documents-).

NASCA is finetuned for the summarization task on 636.596 (document, summary) pairs from the Dataset for Automatic summarization of Catalan and Spanish newspaper Articles (DACSA).

More details about the pretraining/finetuning datasets and the models soon:

@unpublished{DACSA,
  author = "Vicent Ahuir, Lluís-F. Hurtado , José Ángel González and Encarna Segarra",
  title = "DACSA: a Dataset for Automatic summarization of Catalan and Spanish
    newspaper Articles",
  note = "Unsubmitted",
}

@unpublished{NAS,
  author = "Vicent Ahuir, Lluís-F. Hurtado , José Ángel González and Encarna Segarra",
  title = "NASCA and NASES : Two monolingual pre-trained models for
abstractive summarization in Catalan and Spanish",
  note = "Submitted to the Special Issue on Current Approaches and Applications in Natural Language Processing (Applied Sciences)",
}