Fill-Mask
Transformers
PyTorch
Safetensors
distilbert
Generated from Trainer
Inference Endpoints
Sakonii commited on
Commit
723fe4e
1 Parent(s): e5bd71a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -13
README.md CHANGED
@@ -1,38 +1,103 @@
1
  ---
2
  license: apache-2.0
 
3
  tags:
4
  - generated_from_trainer
5
  model-index:
6
  - name: distilbert-base-nepali
7
  results: []
 
 
 
 
 
 
 
8
  ---
9
 
10
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
- should probably proofread and complete it, then remove this comment. -->
12
-
13
  # distilbert-base-nepali
14
 
15
- This model is a fine-tuned version of [Sakonii/distilbert-base-nepali](https://huggingface.co/Sakonii/distilbert-base-nepali) on the None dataset.
 
16
  It achieves the following results on the evaluation set:
17
- - Loss: 2.5139
 
 
 
 
18
 
19
  ## Model description
20
 
21
- More information needed
22
 
23
  ## Intended uses & limitations
24
 
25
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- ## Training and evaluation data
 
28
 
29
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training procedure
32
 
 
 
33
  ### Training hyperparameters
34
 
35
- The following hyperparameters were used during training:
36
  - learning_rate: 5e-05
37
  - train_batch_size: 28
38
  - eval_batch_size: 8
@@ -44,9 +109,21 @@ The following hyperparameters were used during training:
44
 
45
  ### Training results
46
 
47
- | Training Loss | Epoch | Step | Validation Loss |
48
- |:-------------:|:-----:|:-----:|:---------------:|
49
- | 2.6412 | 1.0 | 35715 | 2.5161 |
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  ### Framework versions
 
1
  ---
2
  license: apache-2.0
3
+ mask_token: "<mask>"
4
  tags:
5
  - generated_from_trainer
6
  model-index:
7
  - name: distilbert-base-nepali
8
  results: []
9
+ widget:
10
+ - text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
11
+ example_title: "Example 1"
12
+ - text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
13
+ example_title: "Example 2"
14
+ - text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
15
+ example_title: "Example 3"
16
  ---
17
 
 
 
 
18
  # distilbert-base-nepali
19
 
20
+ This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling.
21
+
22
  It achieves the following results on the evaluation set:
23
+
24
+ mlm probability|evaluation loss|evaluation perplexity
25
+ --:|----:|-----:|
26
+ 15%|2.349|10.479|
27
+ 20%|2.605|13.351|
28
 
29
  ## Model description
30
 
31
+ Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
32
 
33
  ## Intended uses & limitations
34
 
35
+ This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
36
+ The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
37
+
38
+ ## Usage
39
+
40
+ This model can be used directly with a pipeline for masked language modeling:
41
+
42
+ ```python
43
+ >>> from transformers import pipeline
44
+ >>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
45
+ >>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
46
+
47
+ [{'score': 0.04128897562623024,
48
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
49
+ 'token': 2605,
50
+ 'token_str': 'मौसम'},
51
+ {'score': 0.04100276157259941,
52
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
53
+ 'token': 2792,
54
+ 'token_str': 'प्रकृति'},
55
+ {'score': 0.026525357738137245,
56
+ 'sequence': 'मानविय गतिविधिले प्रा���ृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
57
+ 'token': 387,
58
+ 'token_str': 'पानी'},
59
+ {'score': 0.02340106852352619,
60
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
61
+ 'token': 1313,
62
+ 'token_str': 'जल'},
63
+ {'score': 0.02055591531097889,
64
+ 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
65
+ 'token': 790,
66
+ 'token_str': 'वातावरण'}]
67
+ ```
68
+
69
+ Here is how we can use the model to get the features of a given text in PyTorch:
70
+
71
+ ```python
72
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
73
 
74
+ tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
75
+ model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
76
 
77
+ # prepare input
78
+ text = "चाहिएको text यता राख्नु होला।"
79
+ encoded_input = tokenizer(text, return_tensors='pt')
80
+
81
+ # forward pass
82
+ output = model(**encoded_input)
83
+ ```
84
+
85
+ ## Training data
86
+
87
+ This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
88
+ As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
89
+
90
+ ## Tokenization
91
+
92
+ A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
93
 
94
  ## Training procedure
95
 
96
+ The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
97
+
98
  ### Training hyperparameters
99
 
100
+ The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
101
  - learning_rate: 5e-05
102
  - train_batch_size: 28
103
  - eval_batch_size: 8
 
109
 
110
  ### Training results
111
 
112
+ The model is trained for 4 epochs with varying hyperparameters:
113
+
114
+ | Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
115
+ |:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
116
+ | 3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 |
117
+ | 2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 |
118
+ | 2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 |
119
+ | 2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 |
120
+ | 2.6412 | 5.0 | 20 | 28 | 35715 | 2.5161 | 12.3802 |
121
+
122
+ Final model evaluated with MLM Probability of 15%:
123
+
124
+ | Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
125
+ |:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
126
+ | - | - | 15 | - | - | 2.3494 | 10.4791 |
127
 
128
 
129
  ### Framework versions