|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- embeddings |
|
- legal |
|
- USCode |
|
license: apache-2.0 |
|
datasets: |
|
- ArchitRastogi/USCode-QAPairs-Finetuning |
|
model_creator: Archit Rastogi |
|
language: |
|
- en |
|
library_name: transformers |
|
base_model: |
|
- BAAI/bge-small-en-v1.5 |
|
fine_tuned_from: sentence-transformers/BGE-Small |
|
task_categories: |
|
- sentence-similarity |
|
- embeddings |
|
- feature-extraction |
|
model-index: |
|
- name: BGE-Small-LegalEmbeddings-USCode |
|
results: |
|
- task: |
|
type: sentence-similarity |
|
dataset: |
|
name: USCode-QAPairs-Finetuning |
|
type: USCode-QAPairs-Finetuning |
|
metrics: |
|
- name: Accuracy |
|
type: Accuracy |
|
value: 0.72 |
|
- name: Recall |
|
type: Recall |
|
value: 0.75 |
|
source: |
|
name: Evaluation on USLawQA Dataset |
|
url: https://huggingface.co/datasets/ArchitRastogi/USLawQA |
|
--- |
|
|
|
|
|
|
|
# BGE-Small Fine-Tuned on USCode-QueryPairs |
|
|
|
This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set. |
|
|
|
## Overview |
|
|
|
- **Base Model**: BGE Small |
|
- **Dataset**: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) |
|
- **Training Details**: |
|
- **Hardware**: Google Colab (T4 GPU) |
|
- **Training Time**: 2 hours |
|
- **Accuracy**: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) |
|
|
|
## Applications |
|
|
|
This model is ideal for: |
|
- **Legal Text Retrieval**: Efficient semantic search across legal documents. |
|
- **Question Answering**: Answering legal queries based on context from the US Code. |
|
- **Embeddings Generation**: Generating high-quality embeddings for downstream legal NLP tasks. |
|
|
|
## Usage |
|
|
|
The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet: |
|
|
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode") |
|
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode") |
|
text = "Duties of the president" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
#Printing the Embeddings |
|
print(outputs) |
|
|
|
``` |
|
|
|
## Evaluation |
|
|
|
The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics: |
|
- **Accuracy**: 75% |
|
- **Task**: Semantic similarity and legal question answering. |
|
|
|
## Related Resources |
|
|
|
- [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) |
|
- [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA) |
|
|
|
## π§ Contact |
|
|
|
For any inquiries, suggestions, or feedback, feel free to reach out: |
|
|
|
**Archit Rastogi** |
|
π§ [[email protected]](mailto:[email protected]) |
|
|
|
|
|
--- |
|
|
|
## π License |
|
|
|
This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset. |