Sakonii
/

distilbert-base-nepali

@@ -1,103 +1,38 @@
 ---
 license: apache-2.0
-mask_token: "<mask>"
 tags:
 - generated_from_trainer
 model-index:
 - name: distilbert-base-nepali
   results: []
-widget:
-- text: "मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।"
-  example_title: "Example 1"
-- text: "अचेल विद्यालय र कलेजहरूले स्मारिका कत्तिको प्रकाशन गर्छन्, यकिन छैन । केही वर्षपहिलेसम्म गाउँसहरका सानाठूला <mask> संस्थाहरूमा पुग्दा शिक्षक वा कर्मचारीले संस्थाबाट प्रकाशित पत्रिका, स्मारिका र पुस्तक कोसेलीका रूपमा थमाउँथे ।"
-  example_title: "Example 2"
-- text: "जलविद्युत् विकासको ११० वर्षको इतिहास बनाएको नेपालमा हाल सरकारी र निजी क्षेत्रबाट गरी करिब २ हजार मेगावाट <mask> उत्पादन भइरहेको छ ।"
-  example_title: "Example 3"
 ---
-# distilbert-base-nepali
-This model is pre-trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset consisting of over 13 million Nepali text sequences using a masked language modeling (MLM) objective. Our approach trains a Sentence Piece Model (SPM) for text tokenization similar to [XLM-ROBERTa](https://arxiv.org/abs/1911.02116) and trains [distilbert model](https://arxiv.org/abs/1910.01108) for language modeling.
 It achieves the following results on the evaluation set:
-mlm probability|evaluation loss|evaluation perplexity
---:|----:|-----:|
-15%|2.439|11.459|
-20%|2.605|13.351|
 ## Model description
-Refer to original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased)
 ## Intended uses & limitations
-This backbone model intends to be fine-tuned on Nepali language focused downstream task such as sequence classification, token classification or question answering.
-The language model being trained on a data with texts grouped to a block size of 512, it handles text sequence up to 512 tokens and may not perform satisfactorily on shorter sequences.
-## Usage
-This model can be used directly with a pipeline for masked language modeling:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='Sakonii/distilbert-base-nepali')
->>> unmasker("मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, <mask>, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।")
-[{'score': 0.04128897562623024,
-  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, मौसम, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
-  'token': 2605,
-  'token_str': 'मौसम'},
- {'score': 0.04100276157259941,
-  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, प्रकृति, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
-  'token': 2792,
-  'token_str': 'प्रकृति'},
- {'score': 0.026525357738137245,
-  'sequence': 'मानविय गतिविधिले प्रा��ृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, पानी, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
-  'token': 387,
-  'token_str': 'पानी'},
- {'score': 0.02340106852352619,
-  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, जल, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
-  'token': 1313,
-  'token_str': 'जल'},
- {'score': 0.02055591531097889,
-  'sequence': 'मानविय गतिविधिले प्रातृतिक पर्यावरन प्रनालीलाई अपरिमेय क्षति पु्र्याएको छ। परिवर्तनशिल जलवायुले खाध, सुरक्षा, वातावरण, जमिन, मौसमलगायतलाई असंख्य तरिकाले प्रभावित छ।',
-  'token': 790,
-  'token_str': 'वातावरण'}]
-```
-Here is how we can use the model to get the features of a given text in PyTorch:
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained('Sakonii/distilbert-base-nepali')
-model = AutoModelForMaskedLM.from_pretrained('Sakonii/distilbert-base-nepali')
-# prepare input
-text = "चाहिएको text यता राख्नु होला।"
-encoded_input = tokenizer(text, return_tensors='pt')
-# forward pass
-output = model(**encoded_input)
-```
-## Training data
-This model is trained on [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) language modeling dataset which combines the datasets: [OSCAR](https://huggingface.co/datasets/oscar) , [cc100](https://huggingface.co/datasets/cc100) and a set of scraped Nepali articles on Wikipedia.
-As for training the language model, the texts in the training set are grouped to a block of 512 tokens.
-## Tokenization
-A Sentence Piece Model (SPM) is trained on a subset of [nepalitext](https://huggingface.co/datasets/Sakonii/nepalitext-language-model-dataset) dataset for text tokenization. The tokenizer trained with vocab-size=24576, min-frequency=4, limit-alphabet=1000 and model-max-length=512.
 ## Training procedure
-The model is trained with the same configuration as the original [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased); 512 tokens per instance, 28 instances per batch, and around 35.7K training steps.
 ### Training hyperparameters
-The following hyperparameters were used for training of the final epoch: [ Refer to the *Training results* table below for varying hyperparameters every epoch ]
 - learning_rate: 5e-05
 - train_batch_size: 28
 - eval_batch_size: 8
@@ -109,20 +44,9 @@ The following hyperparameters were used for training of the final epoch: [ Refer
 ### Training results
-The model is trained for 4 epochs with varying hyperparameters:
-| Training Loss | Epoch | MLM Probability | Train Batch Size | Step  | Validation Loss | Perplexity |
-|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
-| 3.4477        | 1.0   | 15              | 26               | 38864 | 3.3067          | 27.2949    |
-| 2.9451        | 2.0   | 15              | 28               | 35715 | 2.8238          | 16.8407    |
-| 2.866         | 3.0   | 20              | 28               | 35715 | 2.7431          | 15.5351    |
-| 2.7287        | 4.0   | 20              | 28               | 35715 | 2.6053          | 13.5353    |
-Final model evaluated with MLM Probability of 15%:
-| Training Loss | Epoch | MLM Probability | Train Batch Size | Step  | Validation Loss | Perplexity |
-|:-------------:|:-----:|:---------------:|:----------------:|:-----:|:---------------:|:----------:|
-| -             | -     | 15              | -                | -     | 2.4388          | 11.4589    |
 ### Framework versions

 ---
 license: apache-2.0
 tags:
 - generated_from_trainer
 model-index:
 - name: distilbert-base-nepali
   results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# distilbert-base-nepali
+This model is a fine-tuned version of [Sakonii/distilbert-base-nepali](https://huggingface.co/Sakonii/distilbert-base-nepali) on the None dataset.
 It achieves the following results on the evaluation set:
+- Loss: 2.5139
 ## Model description
+More information needed
 ## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
 ## Training procedure
 ### Training hyperparameters
+The following hyperparameters were used during training:
 - learning_rate: 5e-05
 - train_batch_size: 28
 - eval_batch_size: 8
 ### Training results
+| Training Loss | Epoch | Step  | Validation Loss |
+|:-------------:|:-----:|:-----:|:---------------:|
+| 2.6412        | 1.0   | 35715 | 2.5161          |
 ### Framework versions