ctu-aic
/

mbart25-multilingual-summarization-multilarge-cs

@@ -36,6 +36,60 @@ This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://hug
 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```
@@ -83,7 +137,3 @@ tloss: 3.365 - 1.445
 | mlsum-ru  | 1.25    | 1.54    | 1.31      | 0.46   | 0.46   | 0.44      | 1.25   | 1.54   | 1.31    |
 | cnewsum   | 26.43   | 29.44   | 26.38     | 7.38   | 8.52   | 7.46      | 25.99  | 28.94  | 25.92   |
-# USAGE
-```
-soon
-```

 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: 'en_XX' : 'en', 'de_DE': 'de', 'es_XX': 'es', 'fr_XX':'fr', 'ru_RU':'ru', 'tr_TR':'tr'.
+# USAGE
+Assume that you are using the provided MultilingualSummarizer.ipynb file and included files from git repository.
+```
+## Configuration of summarization pipeline
+#
+def summ_config():
+    cfg = OrderedDict([
+        ## summarization model - checkpoint
+        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
+        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
+        ## language of summarization task
+        #   language : string : cs, en, de, fr, es, tr, ru, zh
+        ("language", "en"),
+        ## generation method parameters in dictionary
+        #
+        ("inference_cfg", OrderedDict([
+            ("num_beams", 4),
+            ("top_k", 40),
+            ("top_p", 0.92),
+            ("do_sample", True),
+            ("temperature", 0.95),
+            ("repetition_penalty", 1.23),
+            ("no_repeat_ngram_size", None),
+            ("early_stopping", True),
+            ("max_length", 128),
+            ("min_length", 10),
+        ])),
+        #texts to summarize values = (list of strings, string, dataset)
+        ("texts",
+            [
+               "english text1 to summarize",
+               "english text2 to summarize",
+            ]
+        ),
+        #OPTIONAL: Target summaries values = (list of strings, string, None)
+        ('golds',
+         [
+               "target english text1",
+               "target english text2",
+         ]),
+        #('golds', None),
+    ])
+    return cfg
+cfg = summ_config()
+msummarizer = MultiSummarizer(**cfg)
+ret = msummarizer(**cfg)
+```
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```
 | mlsum-ru  | 1.25    | 1.54    | 1.31      | 0.46   | 0.46   | 0.44      | 1.25   | 1.54   | 1.31    |
 | cnewsum   | 26.43   | 29.44   | 26.38     | 7.38   | 8.52   | 7.46      | 25.99  | 28.94  | 25.92   |