julien-c HF staff commited on
Commit
ded9297
·
1 Parent(s): 3cf4395

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/Musixmatch/umberto-wikipedia-uncased-v1/README.md

Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: it
3
+ ---
4
+
5
+ # UmBERTo Wikipedia Uncased
6
+
7
+ [UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
8
+
9
+ <p align="center">
10
+ <img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br>
11
+ Marco Lodola, Monument to Umberto Eco, Alessandria 2019
12
+ </p>
13
+
14
+ ## Dataset
15
+ UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from [Wikipedia-ITA](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
16
+
17
+ ## Pre-trained model
18
+
19
+ | Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download |
20
+ | ------ | ------ | ------ | ------ | ------ |------ | ------ |
21
+ | `umberto-wikipedia-uncased-v1` | YES | YES | SPM | 32K | 100k | [Link](http://bit.ly/35wbSj6) |
22
+
23
+ This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking.
24
+
25
+ ## Downstream Tasks
26
+ These results refers to umberto-wikipedia-uncased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page.
27
+
28
+ #### Named Entity Recognition (NER)
29
+
30
+ | Dataset | F1 | Precision | Recall | Accuracy |
31
+ | ------ | ------ | ------ | ------ | ----- |
32
+ | **ICAB-EvalITA07** | **86.240** | 85.939 | 86.544 | 98.534 |
33
+ | **WikiNER-ITA** | **90.483** | 90.328 | 90.638 | 98.661 |
34
+
35
+ #### Part of Speech (POS)
36
+
37
+ | Dataset | F1 | Precision | Recall | Accuracy |
38
+ | ------ | ------ | ------ | ------ | ------ |
39
+ | **UD_Italian-ISDT** | 98.563 | 98.508 | 98.618 | **98.717** |
40
+ | **UD_Italian-ParTUT** | 97.810 | 97.835 | 97.784 | **98.060** |
41
+
42
+
43
+
44
+ ## Usage
45
+
46
+ ##### Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:
47
+
48
+ ```python
49
+
50
+ import torch
51
+ from transformers import AutoTokenizer, AutoModel
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
54
+ umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
55
+
56
+ encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
57
+ input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1
58
+ outputs = umberto(input_ids)
59
+ last_hidden_states = outputs[0] # The last hidden-state is the first element of the output
60
+ ```
61
+
62
+ ##### Predict masked token:
63
+
64
+ ```python
65
+ from transformers import pipeline
66
+
67
+ fill_mask = pipeline(
68
+ "fill-mask",
69
+ model="Musixmatch/umberto-wikipedia-uncased-v1",
70
+ tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
71
+ )
72
+
73
+ result = fill_mask("Umberto Eco è <mask> un grande scrittore")
74
+ # {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
75
+ # {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
76
+ # {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
77
+ # {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
78
+ # {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}
79
+ ```
80
+
81
+
82
+ ## Citation
83
+ All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.
84
+
85
+ * UD Italian-ISDT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ISDT)
86
+ * UD Italian-ParTUT Dataset [Github](https://github.com/UniversalDependencies/UD_Italian-ParTUT)
87
+ * I-CAB (Italian Content Annotation Bank), EvalITA [Page](http://www.evalita.it/)
88
+ * WIKINER [Page](https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) , [Paper](https://www.sciencedirect.com/science/article/pii/S0004370212000276?via%3Dihub)
89
+
90
+ ```
91
+ @inproceedings {magnini2006annotazione,
92
+ title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
93
+ author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
94
+ booktitle = {Proc.of SILFI 2006},
95
+ year = {2006}
96
+ }
97
+ @inproceedings {magnini2006cab,
98
+ title = {I - CAB: the Italian Content Annotation Bank.},
99
+ author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
100
+ booktitle = {LREC},
101
+ pages = {963--968},
102
+ year = {2006},
103
+ organization = {Citeseer}
104
+ }
105
+ ```
106
+
107
+ ## Authors
108
+
109
+ **Loreto Parisi**: `loreto at musixmatch dot com`, [loretoparisi](https://github.com/loretoparisi)
110
+ **Simone Francia**: `simone.francia at musixmatch dot com`, [simonefrancia](https://github.com/simonefrancia)
111
+ **Paolo Magnani**: `paul.magnani95 at gmail dot com`, [paulthemagno](https://github.com/paulthemagno)
112
+
113
+ ## About Musixmatch AI
114
+ ![Musxmatch Ai mac app icon-128](https://user-images.githubusercontent.com/163333/72244273-396aa380-35ee-11ea-894b-4ea48230c02b.png)
115
+ We do Machine Learning and Artificial Intelligence @[musixmatch](https://twitter.com/Musixmatch)
116
+ Follow us on [Twitter](https://twitter.com/musixmatchai) [Github](https://github.com/musixmatchresearch)
117
+