RichardErkhov commited on
Commit
c82c181
1 Parent(s): 8eed9e3

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ gpt-2-tamil - bnb 8bits
11
+ - Model creator: https://huggingface.co/abinayam/
12
+ - Original model: https://huggingface.co/abinayam/gpt-2-tamil/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+
20
+ language: ta
21
+ datasets:
22
+ - oscar
23
+ - IndicNLP
24
+ widget:
25
+ - text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
26
+
27
+ ---
28
+ # GPT2-Tamil
29
+
30
+ This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to pretrain a language model using GPT-2 specifically for Tamil language.
31
+
32
+ ## Setup:
33
+ To setup the project, run the following command,
34
+ ```python
35
+ pip install -r requirements.txt
36
+ ```
37
+
38
+ ## Model:
39
+ Pretrained model on Tamil language using a causal language modeling (CLM) objective.
40
+
41
+ ## Dataset Used:
42
+ The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/)
43
+
44
+ ## Intended uses & limitations:
45
+ You can use the raw model for next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=gpt2) to look for fine-tuned versions on a task that interests you.
46
+
47
+ ## How to pretrain the model:
48
+ To perform training, do the following steps,
49
+
50
+ - Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
51
+ ```python
52
+ >>> export MODEL_DIR=<model_dir>
53
+ ```
54
+ - Create the config.json by running the following command,
55
+ ```python
56
+ >>> python src/create_config.py
57
+ ```
58
+ - Create the tokenizer by running the following command,
59
+ ```python
60
+ >>> python src/train_tokenizer.py
61
+ ```
62
+ - Once the config and tokenizer is created, run the following script to start training the flax model
63
+ ```python
64
+ >>> python scripts/train_gpt2-oscar-tamil.sh
65
+ ```
66
+
67
+ ## How to use:
68
+ To perform language generation using the model, pipeline can be used directly.
69
+
70
+ - First convert the flax model to pytorch using the following command,
71
+ ```python
72
+ python src/convert_flax_to_pytorch.py
73
+ ```
74
+ - Use the following snippet to perform language generation,
75
+ ```python
76
+ >>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
77
+ >>> model_name = 'abinayam/gpt-2-tamil'
78
+ >>> model = AutoModelWithLMHead.from_pretrained(model_name)
79
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
80
+ >>> set_seed(42)
81
+ >>> input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
82
+ >>> max_len = 300
83
+ >>> no_seq = 5
84
+ >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
85
+ >>> sequence = generator(input_text, max_length=max_len, num_return_sequences=no_seq)
86
+ ```
87
+
88
+