SarwarShafee
commited on
Commit
•
b1f1283
1
Parent(s):
5b94ed4
Update README.md (#1)
Browse files- Update README.md (a44c0e14c26d5296155497cdeaf6378f17773f3d)
README.md
CHANGED
@@ -54,12 +54,12 @@ print(response)
|
|
54 |
|
55 |
## Hardware and Software
|
56 |
|
57 |
-
**Training Factors:** We used [llama-factory]() training library,
|
58 |
|
59 |
|
60 |
## Training Data
|
61 |
|
62 |
-
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
|
63 |
|
64 |
Data sources summary:
|
65 |
- Web documents: Extract, clean, filter common crawl data
|
@@ -69,7 +69,7 @@ Data sources summary:
|
|
69 |
- Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
|
70 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
71 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
72 |
-
- Others: We
|
73 |
|
74 |
|
75 |
## Benchmarks \- Bangla Text
|
@@ -77,7 +77,7 @@ Data sources summary:
|
|
77 |
In this section, we report the results for __titulm-gemma-2-2b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
|
78 |
|
79 |
### Evaluation Datasets
|
80 |
-
We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data,
|
81 |
|
82 |
#### Bangla Benchmark datasets
|
83 |
We evaluated the models on the following datasets:
|
|
|
54 |
|
55 |
## Hardware and Software
|
56 |
|
57 |
+
**Training Factors:** We used [llama-factory]() training library, cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
|
58 |
|
59 |
|
60 |
## Training Data
|
61 |
|
62 |
+
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
|
63 |
|
64 |
Data sources summary:
|
65 |
- Web documents: Extract, clean, filter common crawl data
|
|
|
69 |
- Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
|
70 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
71 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
72 |
+
- Others: We scraped data from some selected websites, used open-sources data, and used some other data sources
|
73 |
|
74 |
|
75 |
## Benchmarks \- Bangla Text
|
|
|
77 |
In this section, we report the results for __titulm-gemma-2-2b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
|
78 |
|
79 |
### Evaluation Datasets
|
80 |
+
We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, its English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
|
81 |
|
82 |
#### Bangla Benchmark datasets
|
83 |
We evaluated the models on the following datasets:
|