File size: 3,258 Bytes
73be84a 21cc8d6 31f7ce4 21cc8d6 31f7ce4 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 aa18202 21cc8d6 aa18202 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 73be84a 21cc8d6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- embeddings
- legal
- USCode
license: apache-2.0
datasets:
- ArchitRastogi/USCode-QAPairs-Finetuning
model_creator: Archit Rastogi
language:
- en
library_name: transformers
base_model:
- BAAI/bge-small-en-v1.5
fine_tuned_from: sentence-transformers/BGE-Small
task_categories:
- sentence-similarity
- embeddings
- feature-extraction
model-index:
- name: BGE-Small-LegalEmbeddings-USCode
results:
- task:
type: sentence-similarity
dataset:
name: USCode-QAPairs-Finetuning
type: USCode-QAPairs-Finetuning
metrics:
- name: Accuracy
type: Accuracy
value: 0.72
- name: Recall
type: Recall
value: 0.75
source:
name: Evaluation on USLawQA Dataset
url: https://huggingface.co/datasets/ArchitRastogi/USLawQA
---
# BGE-Small Fine-Tuned on USCode-QueryPairs
This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.
## Overview
- **Base Model**: BGE Small
- **Dataset**: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- **Training Details**:
- **Hardware**: Google Colab (T4 GPU)
- **Training Time**: 2 hours
- **Accuracy**: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA)
## Applications
This model is ideal for:
- **Legal Text Retrieval**: Efficient semantic search across legal documents.
- **Question Answering**: Answering legal queries based on context from the US Code.
- **Embeddings Generation**: Generating high-quality embeddings for downstream legal NLP tasks.
## Usage
The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet:
```python
# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
text = "Duties of the president"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
#Printing the Embeddings
print(outputs)
```
## Evaluation
The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics:
- **Accuracy**: 75%
- **Task**: Semantic similarity and legal question answering.
## Related Resources
- [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
- [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA)
## 📧 Contact
For any inquiries, suggestions, or feedback, feel free to reach out:
**Archit Rastogi**
📧 [[email protected]](mailto:[email protected])
---
## 📜 License
This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset. |