File size: 1,218 Bytes
927a4e5
 
57dca78
 
 
 
927a4e5
 
 
 
57dca78
927a4e5
57dca78
927a4e5
2b28239
927a4e5
57dca78
 
927a4e5
57dca78
927a4e5
0d1f32f
927a4e5
57dca78
 
927a4e5
57dca78
 
 
927a4e5
2b28239
927a4e5
57dca78
927a4e5
2b28239
 
 
 
 
927a4e5
57dca78
 
 
927a4e5
2b28239
927a4e5
57dca78
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
library_name: transformers
license: apache-2.0
language:
- hi
pipeline_tag: token-classification
---

## Model Details

### BertWordPieceTokenizer

- tokenizer for hindi language

#### Usage

```py
from transformers import AutoTokenizer

hi_tokenizer = AutoTokenizer.from_pretrained('krinal/BertWordPieceTokenizer-hi')

hi_str = "आज का सूर्य देखो, कितना प्यारा, कितना शीतल है"

# encode text
encoded_str = hi_tokenizer.encode(hi_str)

# decode text
decoded_str = hi_tokenizer.decode(encoded_str)
```

#### Language

- hi

#### Training

- For training see [Train BertWordPieceTokenizer](https://gist.github.com/kjdeveloper8/57d9e16848cd77df778804c9e2214a78)
  
#### Dataset

- trained on BHAAV (hi sentiment analysis dataset)
- dataset source: [Bhaav](https://github.com/midas-research/bhaav)
- Hindi text corpus (20,304 sentences)

#### Citation

```shell
@article{kumar2019bhaav,
  title={BHAAV-A Text Corpus for Emotion Analysis from Hindi Stories},
  author={Kumar, Yaman and Mahata, Debanjan and Aggarwal, Sagar and Chugh, Anmol and Maheshwari, Rajat and Shah, Rajiv Ratn},
  journal={arXiv preprint arXiv:1910.04073},
  year={2019}
}
```