--- license: mit datasets: - saillab/taco-datasets language: - ar - en --- Arabic Translator: Machine Learning Model This repository contains a machine learning model designed to translate text into Arabic. The model is trained on a custom dataset and fine-tuned to optimize translation accuracy while balancing training and validation performance. 📄 Overview: The model is built using deep learning techniques to translate text effectively. It was trained and validated using loss metrics to monitor performance over multiple epochs. The training process is visualized through loss curves that demonstrate learning progress and highlight overfitting challenges. Key Features: Language Support: Translates text into Arabic. Model Architecture: Based on [model architecture used, e.g., Transformer, RNN, etc.]. Preprocessing: Includes tokenization and encoding steps for handling Arabic script. Evaluation: Monitored with training and validation loss for consistent improvement. 🚀 How to Use Installation Clone this repository: git clone https://huggingface.co/MounikaAithagoni/Traanslator cd arabic-translator Install dependencies: pip install -r requirements.txt Model Inference from transformers import , AutoTokenizer # Load the model and tokenizer model = .from_pretrained("") tokenizer = AutoTokenizer.from_pretrained("") # Translate a sample sentence text = "Hello, how are you?" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs) translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Translation: {translation}") 🧑‍💻 Training Details Training Loss: Decreased steadily across epochs, indicating effective learning. Validation Loss: Decreased initially but plateaued later, suggesting overfitting beyond epoch 5. Epochs: Trained for 10 epochs with an early stopping mechanism. 📝 Dataset https://huggingface.co/datasets/saillab/taco-datasets/tree/main/multilingual-instruction-tuning-dataset%20/multilingual-alpaca-52k-gpt-4Links to an external site. The model was trained on a custom dataset tailored for Arabic translation. Preprocessing steps included: Tokenizing and encoding text data. Splitting into training and validation sets. For details on the dataset format, refer to the data/ folder. 📊 Evaluation Metrics: Training and validation loss monitored. Performance: Shows good initial generalization with validation loss increasing slightly after the 5th epoch, signaling overfitting. 🔧 Future Improvements Implement techniques to address overfitting, such as regularization or data augmentation. Fine-tune on larger, more diverse datasets for better generalization.