sagorsarker commited on
Commit
5b651f5
1 Parent(s): e4b3682

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - bn
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - hishab
8
+ - titulm
9
+ - pytorch
10
+ - llama
11
+ - llama-3
12
+ - llama-factory
13
+ license: llama3.2
14
+ ---
15
+
16
+ ## Model Information
17
+
18
+ This model is a continually pretrained version of the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) architecture with extended about 42k Bangla tokens, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining with token extending was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in tasks related to Bangla language understanding evaluation benchmarks and text generation.
19
+
20
+ **Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture.
21
+
22
+ | | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff |
23
+ | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
24
+ | Llama 3.2 (text only) | Hishab curated Bangla text corpus | 1B (1.23B) | Monolingual Text(Bangla) | Monolingual Text(Bangla) | 4096 | Yes | Yes | 37B tokens | |
25
+
26
+ **Supported Languages:** Bengali(primary) and English(secondary)
27
+
28
+ **Llama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
29
+
30
+ **Model Release Date:** October 24, 2024
31
+
32
+ **Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities.
33
+
34
+ **License:** We are using the similar license of Llama 3.2. Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).
35
+
36
+
37
+ ## How to use
38
+ - Use with transformers
39
+
40
+ Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
41
+
42
+ Make sure to update your transformers installation via pip install --upgrade transformers.
43
+
44
+ ```python
45
+ import torch
46
+ from transformers import pipeline
47
+
48
+ model_id = "hishab/titulm-llama-3.2-1b-v2.0"
49
+
50
+ pipe = pipeline(
51
+ "text-generation",
52
+ model=model_id,
53
+ torch_dtype=torch.bfloat16,
54
+ device_map="auto"
55
+ )
56
+
57
+ pipe("আমাদের দেশের নাম")
58
+ ```
59
+
60
+ ## Hardware and Software
61
+
62
+ **Training Factors:** We used [llama-factory](https://github.com/hiyouga/LLaMA-Factory) training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
63
+
64
+
65
+ ## Training Data
66
+
67
+ **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. Total trained tokens are 37B tokens.
68
+
69
+ Data sources summary:
70
+ - Web documents: Extract, clean, filter common crawl data
71
+ - Books: Extract, clean, filter books data
72
+ - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
73
+ - Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
74
+ - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
75
+ - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
76
+ - Synthetic data: We generated synthetic data using a Bangla LLM model
77
+ - Others: We scrap some selected websites data, used open-sources data, and used some other data sources
78
+
79
+ ## Token Extending
80
+ We trained a separate Bangla tokenizer using [Tiktoken](https://github.com/openai/tiktoken) library on 48 GB Bangla datasets(sampled from main pretraining data) with vocab size 48k and separated 42k tokens for adding with the pretrained model. We extended the model's vocabulary with these tokens and continued the pretraining process on Bangla data. The token extending process was done to enhance the model's ability to generate high-quality Bangla text. Our updated vocab size is 170k where original llama-3.2 vocab size is 128k.
81
+
82
+
83
+ ## Benchmarks \- Bangla Text
84
+
85
+ In this section, we report the results for __titulm-llama-3.2-1b-v2.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
86
+
87
+ ### Evaluation Datasets
88
+ We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, it's English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
89
+
90
+ #### Bangla Benchmark datasets
91
+ We evaluated the models on the following datasets:
92
+ - [Bangla MMLU](): A privated multiple choice questions datasets developed by Hishab curated from various sources.
93
+ - [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
94
+ - [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
95
+ - [BoolQ Bangla](https://huggingface.co/datasets/hishab/boolq_bn): The dataset contains 15,942 examples, with each entry consisting of a triplet: (question, passage, answer). The questions are naturally occurring, generated from unprompted and unconstrained settings. Input passages were sourced from Bangla Wikipedia, Banglapedia, and News Articles, and GPT-4 was used to generate corresponding yes/no questions with answers.
96
+
97
+ #### English Benchmark datasets
98
+ - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
99
+ - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers .
100
+ - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
101
+ - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
102
+
103
+ ### Evaluation Results
104
+
105
+ #### Evaluation on Bangla Benchmark datasets
106
+ - However, **titulm-llama-3.2-1b-v2.0** performs better in **Commonsense QA** and **PIQA BN**, where it outperforms the original model in both 0-shot and 5-shot settings.
107
+ - **llama-3.2-1b** generally outperforms **titulm-llama-3.2-1b-v2.0** on Bangla datasets, especially in the **Bangla MMLU** and **BoolQ BN** tasks, achieving a higher score in the 0-shot setting.
108
+ - The models perform similarly on **OpenBook QA**, with marginal differences.
109
+
110
+
111
+ | Model | Shots | Bangla MMLU | BoolQ BN | Commonsense QA | OpenBook QA | PIQA BN |
112
+ |------------------------------|---------|-------------|----------|----------------|-------------|---------|
113
+ | llama-3.2-1b | 0-shot | **0.29** | **0.55** | 0.22 | **0.33** | 0.53 |
114
+ | | 5-shot | **0.28** | - | 0.23 | 0.31 | 0.54 |
115
+ | titulm-llama-3.2-1b-v2.0 | 0-shot | 0.25 | - | **0.26** | 0.32 | **0.58**|
116
+ | | 5-shot | 0.25 | - | **0.28** | **0.33** | **0.57**|
117
+
118
+ #### Evaluation on English Benchmark datasets
119
+ - **llama-3.2-1b** shows consistently better performance on English datasets, especially in tasks like **MMLU**, **BoolQ**, **Commonsense QA**, and **PIQA**.
120
+ - In comparison, **titulm-llama-3.2-1b-v2.0** underperforms in both 0-shot and 5-shot settings across all tasks in English.
121
+ - It was expected as we trained the model only on Bangla datasets.
122
+
123
+
124
+ | Model | Shots | MMLU | BoolQ | Commonsense QA | OpenBook QA | PIQA |
125
+ |------------------------------|---------|-------------|---------|----------------|-------------|-------|
126
+ | llama-3.2-1b | 0-shot | **0.38** | **0.64**| **0.47** | **0.37** | **0.75** |
127
+ | | 5-shot | **0.31** | **0.66**| **0.32** | **0.40** | **0.76** |
128
+ | titulm-llama-3.2-1b-v2.0 | 0-shot | 0.23 | 0.45 | 0.20 | 0.24 | 0.55 |
129
+ | | 5-shot | 0.25 | 0.49 | 0.18 | 0.24 | 0.55 |
130
+
131
+ ### Instruction Tuned Models
132
+
133
+
134
+ ### Intended Use
135
+ - Bangla text generation
136
+ - Bangla language understanding tasks
137
+ - Bangla instruction fine-tuning tasks
138
+