Fill-Mask
Transformers
PyTorch
Safetensors
distilbert
Generated from Trainer
Inference Endpoints
Sakonii commited on
Commit
e5bd71a
1 Parent(s): 2d5b2ca

update model card README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -89
README.md CHANGED
@@ -1,103 +1,38 @@
1
  ---
2
  license: apache-2.0
3
- mask_token: "<mask>"
4
  tags:
5
  - generated_from_trainer
6
  model-index:
7
  - name: distilbert-base-nepali
8
  results: []
9
- widget:
10
- - text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
11
- example_title: "Example 1"
12
- - text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
13
- example_title: "Example 2"
14
- - text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
15
- example_title: "Example 3"
16
  ---
17
 
18
- # distilbert-base-nepali
 
19
 
20
- This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling.
21
 
 
22
  It achieves the following results on the evaluation set:
23
-
24
- mlm probability|evaluation loss|evaluation perplexity
25
- --:|----:|-----:|
26
- 15%|2.439|11.459|
27
- 20%|2.605|13.351|
28
 
29
  ## Model description
30
 
31
- Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
32
 
33
  ## Intended uses & limitations
34
 
35
- This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
36
- The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
37
-
38
- ## Usage
39
-
40
- This model can be used directly with a pipeline for masked language modeling:
41
-
42
- ```python
43
- >>> from transformers import pipeline
44
- >>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
45
- >>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
46
-
47
- [{'score': 0.04128897562623024,
48
- 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
49
- 'token': 2605,
50
- 'token_str': 'मौसम'},
51
- {'score': 0.04100276157259941,
52
- 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
53
- 'token': 2792,
54
- 'token_str': 'प्रकृति'},
55
- {'score': 0.026525357738137245,
56
- 'sequence': 'मानविय गतिविधिले प्रा��ृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
57
- 'token': 387,
58
- 'token_str': 'पानी'},
59
- {'score': 0.02340106852352619,
60
- 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
61
- 'token': 1313,
62
- 'token_str': 'जल'},
63
- {'score': 0.02055591531097889,
64
- 'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
65
- 'token': 790,
66
- 'token_str': 'वातावरण'}]
67
- ```
68
-
69
- Here is how we can use the model to get the features of a given text in PyTorch:
70
-
71
- ```python
72
- from transformers import AutoTokenizer, AutoModelForMaskedLM
73
 
74
- tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
75
- model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
76
 
77
- # prepare input
78
- text = "चाहिएको text यता राख्नु होला।"
79
- encoded_input = tokenizer(text, return_tensors='pt')
80
-
81
- # forward pass
82
- output = model(**encoded_input)
83
- ```
84
-
85
- ## Training data
86
-
87
- This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
88
- As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
89
-
90
- ## Tokenization
91
-
92
- A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
93
 
94
  ## Training procedure
95
 
96
- The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
97
-
98
  ### Training hyperparameters
99
 
100
- The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
101
  - learning_rate: 5e-05
102
  - train_batch_size: 28
103
  - eval_batch_size: 8
@@ -109,20 +44,9 @@ The following hyperparameters were used for training of the final epoch: [ Refer
109
 
110
  ### Training results
111
 
112
- The model is trained for 4 epochs with varying hyperparameters:
113
-
114
- | Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
115
- |:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
116
- | 3.4477 | 1.0 | 15 | 26 | 38864 | 3.3067 | 27.2949 |
117
- | 2.9451 | 2.0 | 15 | 28 | 35715 | 2.8238 | 16.8407 |
118
- | 2.866 | 3.0 | 20 | 28 | 35715 | 2.7431 | 15.5351 |
119
- | 2.7287 | 4.0 | 20 | 28 | 35715 | 2.6053 | 13.5353 |
120
-
121
- Final model evaluated with MLM Probability of 15%:
122
-
123
- | Training Loss | Epoch | MLM Probability | Train Batch Size | Step | Validation Loss | Perplexity |
124
- |:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
125
- | - | - | 15 | - | - | 2.4388 | 11.4589 |
126
 
127
 
128
  ### Framework versions
 
1
  ---
2
  license: apache-2.0
 
3
  tags:
4
  - generated_from_trainer
5
  model-index:
6
  - name: distilbert-base-nepali
7
  results: []
 
 
 
 
 
 
 
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
 
13
+ # distilbert-base-nepali
14
 
15
+ This model is a fine-tuned version of [Sakonii/distilbert-base-nepali](https://huggingface.co/Sakonii/distilbert-base-nepali) on the None dataset.
16
  It achieves the following results on the evaluation set:
17
+ - Loss: 2.5139
 
 
 
 
18
 
19
  ## Model description
20
 
21
+ More information needed
22
 
23
  ## Intended uses & limitations
24
 
25
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ ## Training and evaluation data
 
28
 
29
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training procedure
32
 
 
 
33
  ### Training hyperparameters
34
 
35
+ The following hyperparameters were used during training:
36
  - learning_rate: 5e-05
37
  - train_batch_size: 28
38
  - eval_batch_size: 8
 
44
 
45
  ### Training results
46
 
47
+ | Training Loss | Epoch | Step | Validation Loss |
48
+ |:-------------:|:-----:|:-----:|:---------------:|
49
+ | 2.6412 | 1.0 | 35715 | 2.5161 |
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  ### Framework versions