system HF staff commited on
Commit
75d5fd6
·
1 Parent(s): 82c3c0c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -7
README.md CHANGED
@@ -4,13 +4,25 @@ language: hi
4
 
5
  # Releasing Hindi ELECTRA model
6
 
7
- This is a first attempt at a Hindi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra). **I don't modify ELECTRA until we get into finetuning**
 
 
8
 
9
  <a href="https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_">Tokenization and training CoLab</a>
10
 
11
- <a href="https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81">Blog post</a>
 
 
 
 
 
 
 
12
 
13
- I was greatly influenced by: https://huggingface.co/blog/how-to-train
 
 
 
14
 
15
  ## Corpus
16
 
@@ -28,7 +40,7 @@ Bonus notes:
28
  https://drive.google.com/file/d/1-6tXrii3tVxjkbrpSJE9MOG_HhbvP66V/view?usp=sharing
29
 
30
  Bonus notes:
31
- - Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param
32
 
33
  ## Training
34
 
@@ -50,8 +62,28 @@ CoLab notebook gives examples of GPU vs. TPU setup
50
 
51
  [configure_pretraining.py](https://github.com/google-research/electra/blob/master/configure_pretraining.py)
52
 
53
- ## Using this model with Transformers
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- Sample movie reviews classifier: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w
 
 
 
 
56
 
57
- Slightly outperforms Multilingual BERT on these Hindi Movie Reviews from https://github.com/sid573/Hindi_Sentiment_Analysis
 
 
 
 
 
4
 
5
  # Releasing Hindi ELECTRA model
6
 
7
+ This is a first attempt at a Hindi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra).
8
+
9
+ **Consider using this newer, larger model: https://huggingface.co/monsoon-nlp/hindi-tpu-electra**
10
 
11
  <a href="https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_">Tokenization and training CoLab</a>
12
 
13
+ I originally used <a href="https://github.com/monsoonNLP/transformers">a modified ELECTRA</a> for finetuning, but now use SimpleTransformers.
14
+
15
+ <a href="https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81">Blog post</a> - I was greatly influenced by: https://huggingface.co/blog/how-to-train
16
+
17
+ ## Example Notebooks
18
+
19
+ This small model has comparable results to Multilingual BERT on <a href="https://colab.research.google.com/drive/18FQxp9QGOORhMENafQilEmeAo88pqVtP">BBC Hindi news classification</a>
20
+ and on <a href="https://colab.research.google.com/drive/1UYn5Th8u7xISnPUBf72at1IZIm3LEDWN">Hindi movie reviews / sentiment analysis</a> (using SimpleTransformers)
21
 
22
+ Question-answering on MLQA dataset: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar#scrollTo=IcFoAHgKCUiQ
23
+
24
+ A larger model (<a href="https://huggingface.co/monsoon-nlp/hindi-tpu-electra">Hindi-TPU-Electra</a>) using ELECTRA base size outperforms both models on Hindi movie reviews / sentiment analysis, but
25
+ does not perform as well on the BBC news classification task.
26
 
27
  ## Corpus
28
 
 
40
  https://drive.google.com/file/d/1-6tXrii3tVxjkbrpSJE9MOG_HhbvP66V/view?usp=sharing
41
 
42
  Bonus notes:
43
+ - Created with HuggingFace Tokenizers; you can increase vocabulary size and re-train; remember to change ELECTRA vocab_size
44
 
45
  ## Training
46
 
 
62
 
63
  [configure_pretraining.py](https://github.com/google-research/electra/blob/master/configure_pretraining.py)
64
 
65
+ ## Conversion
66
+
67
+ Use this process to convert an in-progress or completed ELECTRA checkpoint to a Transformers-ready model:
68
+
69
+ ```
70
+ git clone https://github.com/huggingface/transformers
71
+ python ./transformers/src/transformers/convert_electra_original_tf_checkpoint_to_pytorch.py
72
+ --tf_checkpoint_path=./models/checkpointdir
73
+ --config_file=config.json
74
+ --pytorch_dump_path=pytorch_model.bin
75
+ --discriminator_or_generator=discriminator
76
+ python
77
+ ```
78
 
79
+ ```
80
+ from transformers import TFElectraForPreTraining
81
+ model = TFElectraForPreTraining.from_pretrained("./dir_with_pytorch", from_pt=True)
82
+ model.save_pretrained("tf")
83
+ ```
84
 
85
+ Once you have formed one directory with config.json, pytorch_model.bin, tf_model.h5, special_tokens_map.json, tokenizer_config.json, and vocab.txt on the same level, run:
86
+
87
+ ```
88
+ transformers-cli upload directory
89
+ ```