---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- legal
- USCode
license: apache-2.0
datasets:
- ArchitRastogi/USCode-QAPairs-Finetuning
model_creator: Archit Rastogi
language:
- en
library_name: transformers
base_model:
- BAAI/bge-small-en-v1.5
fine_tuned_from: sentence-transformers/BGE-Small
task_categories:
- sentence-similarity
- embeddings
- feature-extraction
model-index:
- name: BGE-Small-LegalEmbeddings-USCode
  results:
  - task:
      type: sentence-similarity
    dataset:
      name: USCode-QAPairs-Finetuning
      type: USCode-QAPairs-Finetuning
    metrics:
    - name: Accuracy
      type: Accuracy
      value: 0.72
    - name: Recall
      type: Recall
      value: 0.75
    source:
      name: Evaluation on USLawQA Dataset
      url: https://huggingface.co/datasets/ArchitRastogi/USLawQA
---


# BGE-Small Fine-Tuned on USCode-QueryPairs

This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.

## Overview

- **Base Model**: BGE Small
- **Dataset**: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- **Training Details**:
  - **Hardware**: Google Colab (T4 GPU)
  - **Training Time**: 2 hours
- **Accuracy**: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

## Applications

This model is ideal for:
- **Legal Text Retrieval**: Efficient semantic search across legal documents.
- **Question Answering**: Answering legal queries based on context from the US Code.
- **Embeddings Generation**: Generating high-quality embeddings for downstream legal NLP tasks.

## Usage

The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet:

```python
# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
text = "Duties of the president"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
#Printing the Embeddings
print(outputs)

```

## Evaluation

The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics:
- **Accuracy**: 75%
- **Task**: Semantic similarity and legal question answering.

## Related Resources

- [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

## 📧 Contact

For any inquiries, suggestions, or feedback, feel free to reach out:

**Archit Rastogi**  
📧 [architrastogi20@gmail.com](mailto:architrastogi20@gmail.com)


---

## 📜 License

This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset.