--- license: apache-2.0 tags: - generated_from_trainer model-index: - name: angika-llm-1b results: [] --- # Angika-LLM-1b **Angika-LLM-1b** is the first generative language model developed for the **Angika language** by **Aarambh AI Research Group**. It aims to preserve and promote Angika, an endangered language spoken in parts of Bihar and Jharkhand, India. The model is built on cutting-edge transformer architectures and can perform tasks like text generation, translation, and conversational AI in Angika. A key challenge in developing Angika-LLM-1b was the lack of annotated datasets. To overcome this, the team used data augmentation, translation of existing resources, and crowdsourcing. The model captures the unique syntax and expressions of Angika, making it contextually accurate for various applications. Angika-LLM-1b opens up opportunities for creating digital content, educational resources, and language learning tools in Angika. It also promotes linguistic diversity, empowering smaller language communities to have a presence in the digital world. This model sets a precedent for developing AI tools for other regional languages and contributes to the global movement of using AI for social good and language preservation. ## Model description !pip install transformers torch from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "Arambh/angika-llm-1b" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) def generate_text(prompt, max_length=100, num_return_sequences=1): # Tokenize input prompt inputs = tokenizer(prompt, return_tensors="pt") # Generate text outputs = model.generate( **inputs, max_length=max_length, num_return_sequences=num_return_sequences, no_repeat_ngram_size=2, # Prevents repetition early_stopping=True ) # Decode and return the generated text return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs] if __name__ == "__main__": prompt = "ये सब पहाड़ी पर पुरानो अभिलेख मिलै छै " generated_text = generate_text(prompt, max_length=100) for i, text in enumerate(generated_text): print(f"Generated Text {i+1}:\n{text}\n") ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - gradient_accumulation_steps: 4 - total_train_batch_size: 4 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:-----:|:---------------:| | 4.5134 | 0.9999 | 11388 | nan | ### Framework versions - Transformers 4.42.4 - Pytorch 2.3.1+cu121 - Datasets 2.20.0 - Tokenizers 0.19.1 ### Contributors | Satyajeet Azad | |:-------------:| | Data Scientist,IIT Delhi| | Raj Kumar | |:-------------:| | Data Analyst,IIT Jodhpur| | Sumant Azad | |:-------------:| | Business Analyst,IIT Patna|