Angika-LLM-1b
Angika-LLM-1b is the first generative language model developed for the Angika language by Aarambh AI Research Group. It aims to preserve and promote Angika, an endangered language spoken in parts of Bihar and Jharkhand, India. The model is built on cutting-edge transformer architectures and can perform tasks like text generation, translation, and conversational AI in Angika. A key challenge in developing Angika-LLM-1b was the lack of annotated datasets. To overcome this, the team used data augmentation, translation of existing resources, and crowdsourcing. The model captures the unique syntax and expressions of Angika, making it contextually accurate for various applications. Angika-LLM-1b opens up opportunities for creating digital content, educational resources, and language learning tools in Angika. It also promotes linguistic diversity, empowering smaller language communities to have a presence in the digital world. This model sets a precedent for developing AI tools for other regional languages and contributes to the global movement of using AI for social good and language preservation.
Model description
!pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "Arambh/angika-llm-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def generate_text(prompt, max_length=100, num_return_sequences=1):
# Tokenize input prompt
inputs = tokenizer(prompt, return_tensors="pt")
# Generate text
outputs = model.generate(
**inputs,
max_length=max_length,
num_return_sequences=num_return_sequences,
no_repeat_ngram_size=2, # Prevents repetition
early_stopping=True
)
# Decode and return the generated text
return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
if __name__ == "__main__":
prompt = "ये सब पहाड़ी पर पुरानो अभिलेख मिलै छै "
generated_text = generate_text(prompt, max_length=100)
for i, text in enumerate(generated_text):
print(f"Generated Text {i+1}:\n{text}\n")
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
4.5134 | 0.9999 | 11388 | nan |
Framework versions
- Transformers 4.42.4
- Pytorch 2.3.1+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1
Contributors
Satyajeet Azad |
---|
Data Scientist,IIT Delhi |
Raj Kumar |
---|
Data Analyst,IIT Jodhpur |
Sumant Azad |
---|
Business Analyst,IIT Patna |
- Downloads last month
- 13