Angika-LLM-1b

Angika-LLM-1b is the first generative language model developed for the Angika language by Aarambh AI Research Group. It aims to preserve and promote Angika, an endangered language spoken in parts of Bihar and Jharkhand, India. The model is built on cutting-edge transformer architectures and can perform tasks like text generation, translation, and conversational AI in Angika. A key challenge in developing Angika-LLM-1b was the lack of annotated datasets. To overcome this, the team used data augmentation, translation of existing resources, and crowdsourcing. The model captures the unique syntax and expressions of Angika, making it contextually accurate for various applications. Angika-LLM-1b opens up opportunities for creating digital content, educational resources, and language learning tools in Angika. It also promotes linguistic diversity, empowering smaller language communities to have a presence in the digital world. This model sets a precedent for developing AI tools for other regional languages and contributes to the global movement of using AI for social good and language preservation.

Model description

!pip install transformers torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch


model_name = "Arambh/angika-llm-1b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


def generate_text(prompt, max_length=100, num_return_sequences=1):
    # Tokenize input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

   # Generate text
    outputs = model.generate(
    **inputs,
    max_length=max_length,
    num_return_sequences=num_return_sequences,
    no_repeat_ngram_size=2,  # Prevents repetition
    early_stopping=True
    )

   # Decode and return the generated text
   return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]


  if __name__ == "__main__":
  prompt = "ये सब पहाड़ी पर पुरानो अभिलेख मिलै छै "
  generated_text = generate_text(prompt, max_length=100)

  for i, text in enumerate(generated_text):
    print(f"Generated Text {i+1}:\n{text}\n")

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss
4.5134 0.9999 11388 nan

Framework versions

  • Transformers 4.42.4
  • Pytorch 2.3.1+cu121
  • Datasets 2.20.0
  • Tokenizers 0.19.1

Contributors

Satyajeet Azad
Data Scientist,IIT Delhi
Raj Kumar
Data Analyst,IIT Jodhpur
Sumant Azad
Business Analyst,IIT Patna
Downloads last month
13
Safetensors
Model size
995M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Space using Arambh/angika-llm-1b 1