SKNahin commited on
Commit
3be4c54
1 Parent(s): b1f1283

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -12
README.md CHANGED
@@ -16,9 +16,9 @@ base_model:
16
 
17
  ## Model Information
18
 
19
- This model is a continually pretrained version of the [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in tasks related to Bangla language understanding evaluation benchmarks and text generation.
20
 
21
- **Model Architecture:** Gemma 2 is an auto-regressive language model that uses an optimized transformer architecture.
22
 
23
  | | Training Data | Params | Input modalities | Output modalities | Context Length | Token count |
24
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
@@ -31,7 +31,7 @@ Below we share some code snippets on how to get quickly started with running the
31
  pip install -U transformers
32
  ```
33
 
34
- Then, copy the snippet from the section that is relevant for your usecase.
35
 
36
  #### Running with the `pipeline` API
37
 
@@ -59,17 +59,17 @@ print(response)
59
 
60
  ## Training Data
61
 
62
- **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
63
 
64
  Data sources summary:
65
- - Web documents: Extract, clean, filter common crawl data
66
- - Books: Extract, clean, filter books data
67
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
68
- - Translation data: We trained a Bangla-English translation LLM model and used it to translate English data to Bangla
69
- - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
- - Others: We scraped data from some selected websites, used open-sources data, and used some other data sources
73
 
74
 
75
  ## Benchmarks \- Bangla Text
@@ -81,7 +81,7 @@ We evaluated our pretrained models on both Bangla and English benchmark datasets
81
 
82
  #### Bangla Benchmark datasets
83
  We evaluated the models on the following datasets:
84
- - [Bangla MMLU](): A privated multiple choice questions datasets developed by Hishab curated from various sources.
85
  - [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
86
  - [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
87
  - [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa-bn): A Bangla translation of the Piqa dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
@@ -89,10 +89,10 @@ We evaluated the models on the following datasets:
89
 
90
  #### English Benchmark datasets
91
  - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
92
- - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers .
93
  - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
94
  - [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
95
- - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question answering dataset for yes/no questions containing 15942 examples. These questions are naturally occurring ---they are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
96
 
97
  ### Evaluation Results
98
 
 
16
 
17
  ## Model Information
18
 
19
+ This model is a continually pretrained version of the [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) architecture, fine-tuned on extensive Bangla datasets. The primary goal of the continual pretraining was to enhance the model's ability to generate high-quality Bangla text. By extending the pretraining process specifically on Bangla data, the model has demonstrated superior performance in Bangla language understanding evaluation benchmarks and text generation tasks.
20
 
21
+ **Model Architecture:** Gemma 2 is an auto-regressive language model with optimized transformer architecture.
22
 
23
  | | Training Data | Params | Input modalities | Output modalities | Context Length | Token count |
24
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
 
31
  pip install -U transformers
32
  ```
33
 
34
+ Then, copy the snippet from the section that is relevant to your use case.
35
 
36
  #### Running with the `pipeline` API
37
 
 
59
 
60
  ## Training Data
61
 
62
+ **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __22GB__ data from that using a ratio of the actual data size. Total trained tokens are __3B__ tokens.
63
 
64
  Data sources summary:
65
+ - Web documents: Extracted, clean, and filtered common crawl data
66
+ - Books: Extracted, clean, filtered books data
67
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
68
+ - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
69
+ - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
+ - Others: We scrapped some selected website data, used open-source data, and used some other data sources
73
 
74
 
75
  ## Benchmarks \- Bangla Text
 
81
 
82
  #### Bangla Benchmark datasets
83
  We evaluated the models on the following datasets:
84
+ - [Bangla MMLU](): A private multiple choice question dataset developed by Hishab curated from various sources.
85
  - [CommonsenseQa Bangla](https://huggingface.co/datasets/hishab/commonsenseqa-bn): A Bangla translation of the CommonsenseQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
86
  - [OpenbookQA Bangla](https://huggingface.co/datasets/hishab/openbookqa-bn): A Bangla translation of the OpenbookQA dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
87
  - [Piqa Bangla](https://huggingface.co/datasets/hishab/piqa-bn): A Bangla translation of the Piqa dataset. The dataset was translated using a new method called Expressive Semantic Translation (EST), which combines Google Machine Translation with LLM-based rewriting modifications.
 
89
 
90
  #### English Benchmark datasets
91
  - [MMLU](https://huggingface.co/datasets/cais/mmlu): This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge.
92
+ - [CommonseQa](https://huggingface.co/datasets/tau/commonsense_qa): CommonsenseQA is a new multiple-choice question-answering dataset that requires different types of commonsense knowledge to predict the correct answers .
93
  - [OpenbookQA](https://huggingface.co/datasets/allenai/openbookqa): OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in.
94
  - [Piqa](https://huggingface.co/datasets/ybisk/piqa): The PIQA dataset focuses on physical commonsense reasoning, challenging AI to handle everyday situations requiring practical knowledge and unconventional solutions. Inspired by instructables.com, it aims to enhance AI's ability to understand and reason about physical interactions.
95
+ - [BoolQ](https://huggingface.co/datasets/google/boolq): BoolQ is a question-answer dataset for yes/no questions containing 15942 examples. These questions are naturally occurring. They are generated in unprompted and unconstrained settings. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The text-pair classification setup is similar to existing natural language inference tasks.
96
 
97
  ### Evaluation Results
98