Abstract
Code embeddings are essential for semantic code search; however, current approaches often struggle to capture the precise syntactic and contextual nuances inherent in code. Open-source models such as CodeBERT and UniXcoder exhibit limitations in scalability and efficiency, while high-performing proprietary systems impose substantial computational costs. We introduce a parameter-efficient fine-tuning method based on Low-Rank Adaptation (LoRA) to construct task-specific adapters for code retrieval. Our approach reduces the number of trainable parameters to less than two percent of the base model, enabling rapid fine-tuning on extensive code corpora (2 million samples in 25 minutes on two H100 GPUs). Experiments demonstrate an increase of up to 9.1% in Mean Reciprocal Rank (MRR) for Code2Code search, and up to 86.69% for Text2Code search tasks across multiple programming languages. Distinction in task-wise and language-wise adaptation helps explore the sensitivity of code retrieval for syntactical and linguistic variations.
Community
LoRACode introduces a Low-Rank Adaptation (LoRA)-based fine-tuning framework for efficient and scalable code embeddings, significantly reducing trainable parameters while improving code retrieval performance in Code2Code and Text2Code search tasks.
LoRA-based Parameter-Efficient Fine-Tuning for Code Embeddings: Unlike traditional fine-tuning methods that require modifying the entire model, LoRACode applies LoRA by introducing low-rank adaptation matrices in the query and value projection layers of transformer models. This reduces trainable parameters to ~1.83%–1.85% of the base model while maintaining or improving retrieval accuracy. Fine-tuning is significantly faster -- 2 million samples in 25 minutes on two H100 GPUs.
Task-Specific and Language-Specific Adapters Fine-tuned for Code2Code retrieval (matching code snippets) and Text2Code retrieval (mapping natural language queries to code). Separate LoRA adapters fine-tuned for six programming languages (Go, Java, JavaScript, PHP, Python, Ruby), outperforming generic task-based adapters. Language-specific fine-tuning captures syntactic and contextual variations better than multilingual training.
Integration of LoRA with Contrastive Fine-Tuning LoRACode employs a contrastive learning objective with a cosine similarity-based loss function to improve retrieval accuracy. The fine-tuning process is implemented using ContrastiveTrainer, a custom Hugging Face extension. Embeddings are extracted using a last-token pooling strategy to retain semantic richness.
Performance Gains Over Existing Models: Up to 9.1% increase in Mean Reciprocal Rank (MRR) for Code2Code retrieval. Up to 86.69% increase in MRR for Text2Code retrieval (Python-specific model). Surpasses CodeBERT, GraphCodeBERT, and UniXcoder in retrieval accuracy while being significantly more computationally efficient.
Scalability and Computational Efficiency: LoRACode fine-tunes models using fewer resources than OpenAI’s proprietary embeddings while achieving comparable or better retrieval accuracy. It enables cross-language retrieval, allowing embeddings trained on one language to generalize to others.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings (2025)
- Language Fusion for Parameter-Efficient Cross-lingual Transfer (2025)
- GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer (2025)
- Training Sparse Mixture Of Experts Text Embedding Models (2025)
- Multilingual State Space Models for Structured Question Answering in Indic Languages (2025)
- UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings (2025)
- DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper