---
license: apache-2.0
tags:
- generated_from_trainer
model-index:
- name: angika-llm-1b
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Angika-LLM-1b

**Angika-LLM-1b** is the first generative language model developed for the **Angika language** by **Aarambh AI Research Group**.
It aims to preserve and promote Angika, an endangered language spoken in parts of Bihar and Jharkhand, India.
The model is built on cutting-edge transformer architectures and can perform tasks like text generation, translation, and conversational AI in Angika.
A key challenge in developing Angika-LLM-1b was the lack of annotated datasets. 
To overcome this, the team used data augmentation, translation of existing resources, and crowdsourcing.
The model captures the unique syntax and expressions of Angika, making it contextually accurate for various applications.
Angika-LLM-1b opens up opportunities for creating digital content, educational resources, and language learning tools in Angika. 
It also promotes linguistic diversity, empowering smaller language communities to have a presence in the digital world. 
This model sets a precedent for developing AI tools for other regional languages and contributes to the global movement of using AI for social good and language preservation.

## Model description

    !pip install transformers torch
    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch


    model_name = "Arambh/angika-llm-1b"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)


    def generate_text(prompt, max_length=100, num_return_sequences=1):
        # Tokenize input prompt
        inputs = tokenizer(prompt, return_tensors="pt")

       # Generate text
        outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=2,  # Prevents repetition
        early_stopping=True
        )

       # Decode and return the generated text
       return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]


      if __name__ == "__main__":
      prompt = "ये सब पहाड़ी पर पुरानो अभिलेख मिलै छै "
      generated_text = generate_text(prompt, max_length=100)
   
      for i, text in enumerate(generated_text):
        print(f"Generated Text {i+1}:\n{text}\n")

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1

### Training results

| Training Loss | Epoch  | Step  | Validation Loss |
|:-------------:|:------:|:-----:|:---------------:|
| 4.5134        | 0.9999 | 11388 | nan             |


### Framework versions

- Transformers 4.42.4
- Pytorch 2.3.1+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1

### Contributors
| Satyajeet Azad |  
|:-------------:|
| Data Scientist,IIT Delhi|

| Raj Kumar |  
|:-------------:|
| Data Analyst,IIT Jodhpur|

| Sumant Azad |  
|:-------------:|
| Business Analyst,IIT Patna|