Marian Krotil commited on
Commit
03af426
1 Parent(s): 821ca2c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - cs
4
+ - cs
5
+ tags:
6
+ - abstractive summarization
7
+ - mbart-cc25
8
+ - Czech
9
+ license: apache-2.0
10
+ datasets:
11
+ - private CNC dataset news-based
12
+ metrics:
13
+ - rouge
14
+ - rougeraw
15
+ ---
16
+
17
+ # mBART fine-tuned model for Czech abstractive summarization (HT2A-C)
18
+ This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Czech news dataset to produce Czech abstractive summaries.
19
+ ## Task
20
+ The model deals with the task ``Headline + Text to Abstract`` (HT2A) which consists in generating a multi-sentence summary considered as an abstract from a Czech news text.
21
+
22
+ ## Dataset
23
+ The model has been trained on the private CNC dataset provided by Czech News Center. The dataset includes 3/4M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were set to 512 tokens.
24
+
25
+ ## Training
26
+ The model has been trained on 1x NVIDIA Tesla A100 40GB for 60 hours. During training, the model has seen 3712K documents corresponding to roughly 5.5 epochs.
27
+
28
+ # Use
29
+ Assuming you are using the provided Summarizer.ipynb file.
30
+ ```python
31
+ def summ_config():
32
+ cfg = OrderedDict([
33
+ # summarization model - checkpoint from website
34
+ ("model_name", "krotima1/mbart-ht2a-c"),
35
+ ("inference_cfg", OrderedDict([
36
+ ("num_beams", 4),
37
+ ("top_k", 40),
38
+ ("top_p", 0.92),
39
+ ("do_sample", True),
40
+ ("temperature", 0.89),
41
+ ("repetition_penalty", 1.2),
42
+ ("no_repeat_ngram_size", None),
43
+ ("early_stopping", True),
44
+ ("max_length", 96),
45
+ ("min_length", 10),
46
+ ])),
47
+ #texts to summarize
48
+ ("text",
49
+ [
50
+ "Input your Czech text",
51
+ ]
52
+ ),
53
+ ])
54
+ return cfg
55
+ cfg = summ_config()
56
+ #load model
57
+ model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"])
58
+ tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"])
59
+ # init summarizer
60
+ summarize = Summarizer(model, tokenizer, cfg["inference_cfg"])
61
+ summarize(cfg["text"])
62
+ ```