File size: 3,258 Bytes
73be84a
 
 
 
 
 
21cc8d6
 
 
 
 
 
 
 
 
 
31f7ce4
 
21cc8d6
 
 
 
 
 
31f7ce4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73be84a
 
 
 
21cc8d6
73be84a
21cc8d6
73be84a
21cc8d6
73be84a
21cc8d6
 
 
 
 
 
 
 
 
 
 
 
 
73be84a
21cc8d6
 
 
73be84a
 
21cc8d6
 
aa18202
 
21cc8d6
 
 
 
aa18202
73be84a
 
 
21cc8d6
73be84a
21cc8d6
 
 
73be84a
21cc8d6
73be84a
21cc8d6
 
73be84a
21cc8d6
73be84a
21cc8d6
73be84a
21cc8d6
 
73be84a
 
21cc8d6
 
 
73be84a
21cc8d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- legal
- USCode
license: apache-2.0
datasets:
- ArchitRastogi/USCode-QAPairs-Finetuning
model_creator: Archit Rastogi
language:
- en
library_name: transformers
base_model:
- BAAI/bge-small-en-v1.5
fine_tuned_from: sentence-transformers/BGE-Small
task_categories:
- sentence-similarity
- embeddings
- feature-extraction
model-index:
- name: BGE-Small-LegalEmbeddings-USCode
  results:
  - task:
      type: sentence-similarity
    dataset:
      name: USCode-QAPairs-Finetuning
      type: USCode-QAPairs-Finetuning
    metrics:
    - name: Accuracy
      type: Accuracy
      value: 0.72
    - name: Recall
      type: Recall
      value: 0.75
    source:
      name: Evaluation on USLawQA Dataset
      url: https://huggingface.co/datasets/ArchitRastogi/USLawQA
---



# BGE-Small Fine-Tuned on USCode-QueryPairs

This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.

## Overview

- **Base Model**: BGE Small
- **Dataset**: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- **Training Details**:
  - **Hardware**: Google Colab (T4 GPU)
  - **Training Time**: 2 hours
- **Accuracy**: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

## Applications

This model is ideal for:
- **Legal Text Retrieval**: Efficient semantic search across legal documents.
- **Question Answering**: Answering legal queries based on context from the US Code.
- **Embeddings Generation**: Generating high-quality embeddings for downstream legal NLP tasks.

## Usage

The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet:

```python
# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
text = "Duties of the president"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
#Printing the Embeddings
print(outputs)

```

## Evaluation

The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics:
- **Accuracy**: 75%
- **Task**: Semantic similarity and legal question answering.

## Related Resources

- [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

## 📧 Contact

For any inquiries, suggestions, or feedback, feel free to reach out:

**Archit Rastogi**  
📧 [[email protected]](mailto:[email protected])


---

## 📜 License

This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset.