Angika-LLM-1b

Angika-LLM-1b is the first generative language model developed for the Angika language by Aarambh AI Research Group. It aims to preserve and promote Angika, an endangered language spoken in parts of Bihar and Jharkhand, India. The model is built on cutting-edge transformer architectures and can perform tasks like text generation, translation, and conversational AI in Angika. A key challenge in developing Angika-LLM-1b was the lack of annotated datasets. To overcome this, the team used data augmentation, translation of existing resources, and crowdsourcing. The model captures the unique syntax and expressions of Angika, making it contextually accurate for various applications. Angika-LLM-1b opens up opportunities for creating digital content, educational resources, and language learning tools in Angika. It also promotes linguistic diversity, empowering smaller language communities to have a presence in the digital world. This model sets a precedent for developing AI tools for other regional languages and contributes to the global movement of using AI for social good and language preservation.

Model description

!pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


model_name = "Arambh/angika-llm-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


def generate_text(prompt, max_length=100, num_return_sequences=1):
    # Tokenize input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

   # Generate text
    outputs = model.generate(
    **inputs,
    max_length=max_length,
    num_return_sequences=num_return_sequences,
    no_repeat_ngram_size=2,  # Prevents repetition
    early_stopping=True
    )

   # Decode and return the generated text
   return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]


  if __name__ == "__main__":
  prompt = "ये सब पहाड़ी पर पुरानो अभिलेख मिलै छै "
  generated_text = generate_text(prompt, max_length=100)

  for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}:\n{text}\n")

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss
4.5134	0.9999	11388	nan

Framework versions

Transformers 4.42.4
Pytorch 2.3.1+cu121
Datasets 2.20.0
Tokenizers 0.19.1

Contributors

Satyajeet Azad
Data Scientist,IIT Delhi

Raj Kumar
Data Analyst,IIT Jodhpur

Sumant Azad
Business Analyst,IIT Patna

Arambh
/

angika-llm-1b