File size: 1,218 Bytes
927a4e5 57dca78 927a4e5 57dca78 927a4e5 57dca78 927a4e5 2b28239 927a4e5 57dca78 927a4e5 57dca78 927a4e5 0d1f32f 927a4e5 57dca78 927a4e5 57dca78 927a4e5 2b28239 927a4e5 57dca78 927a4e5 2b28239 927a4e5 57dca78 927a4e5 2b28239 927a4e5 57dca78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
---
library_name: transformers
license: apache-2.0
language:
- hi
pipeline_tag: token-classification
---
## Model Details
### BertWordPieceTokenizer
- tokenizer for hindi language
#### Usage
```py
from transformers import AutoTokenizer
hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi')
hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है"
# encode text
encoded_str = hi_tokenizer.encode(hi_str)
# decode text
decoded_str = hi_tokenizer.decode(encoded_str)
```
#### Language
- hi
#### Training
- For training see [Train BertWordPieceTokenizer](https://gist.github.com/kjdeveloper8/57d9e16848cd77df778804c9e2214a78)
#### Dataset
- trained on BHAAV (hi sentiment analysis dataset)
- dataset source: [Bhaav](https://github.com/midas-research/bhaav)
- Hindi text corpus (20,304 sentences)
#### Citation
```shell
@article{kumar2019bhaav,
title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories},
author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn},
journal={arXiv preprint arXiv:1910.04073},
year={2019}
}
```
|