# Llama-3-3B CodeSearchNet Fine-tuned This repository hosts a **Llama 3 (3B) model** fine-tuned on the **CodeSearchNet dataset**, which contains code in six programming languages. ## 📝 Model Details - **Base Model**: Llama 3 (3B) - **Fine-tuning Dataset**: CodeSearchNet - **Languages Covered**: Python, Java, JavaScript, PHP, Ruby, Go - **Training Method**: Supervised fine-tuning (SFT) with a contrastive loss objective for code search tasks - **Tokenization**: Llama 3 tokenizer with additional tokens for code-specific keywords - **Frameworks Used**: Hugging Face `transformers`, PyTorch, PEFT (for LoRA-based tuning) ## 📚 Dataset The model is trained on the **CodeSearchNet** dataset, which contains: - Function-level code snippets - Paired natural language descriptions - Multiple programming languages for multi-language search support ### **Dataset Sources** - [CodeSearchNet Dataset](https://github.com/github/CodeSearchNet) - Contains ~2M code snippets from open-source repositories ## 🚀 Training Setup - **Hardware**: NVIDIA A100 GPUs - **Batch Size**: 16 - **Learning Rate**: 2e-5 with cosine annealing - **Max Sequence Length**: 512 - **Fine-tuning Duration**: 3 epochs ## 🔍 Intended Use - **Code Search**: Retrieve relevant code snippets given a natural language query - **Code Completion**: Provide context-aware code suggestions - **Code-to-Text Generation**: Explain code functionality in natural language - **Multi-language Code Retrieval**: Search across different programming languages