ArchitRastogi's picture
fixed how to use model
aa18202 verified
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - embeddings
  - legal
  - USCode
license: apache-2.0
datasets:
  - ArchitRastogi/USCode-QAPairs-Finetuning
model_creator: Archit Rastogi
language:
  - en
library_name: transformers
base_model:
  - BAAI/bge-small-en-v1.5
fine_tuned_from: sentence-transformers/BGE-Small
task_categories:
  - sentence-similarity
  - embeddings
  - feature-extraction
model-index:
  - name: BGE-Small-LegalEmbeddings-USCode
    results:
      - task:
          type: sentence-similarity
        dataset:
          name: USCode-QAPairs-Finetuning
          type: USCode-QAPairs-Finetuning
        metrics:
          - name: Accuracy
            type: Accuracy
            value: 0.72
          - name: Recall
            type: Recall
            value: 0.75
        source:
          name: Evaluation on USLawQA Dataset
          url: https://huggingface.co/datasets/ArchitRastogi/USLawQA

BGE-Small Fine-Tuned on USCode-QueryPairs

This is a fine-tuned version of the BGE Small embedding model, trained on the USCode-QueryPairs dataset, a subset of the USLawQA corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.

Overview

  • Base Model: BGE Small
  • Dataset: USCode-QueryPairs
  • Training Details:
    • Hardware: Google Colab (T4 GPU)
    • Training Time: 2 hours
  • Accuracy: 75% on the test set from USLawQA

Applications

This model is ideal for:

  • Legal Text Retrieval: Efficient semantic search across legal documents.
  • Question Answering: Answering legal queries based on context from the US Code.
  • Embeddings Generation: Generating high-quality embeddings for downstream legal NLP tasks.

Usage

The model can be used with model.encode for generating embeddings. Below is an example usage snippet:

# Load model directly
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
text = "Duties of the president"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
#Printing the Embeddings
print(outputs)

Evaluation

The model was evaluated on the test set of USLawQA and achieved the following metrics:

  • Accuracy: 75%
  • Task: Semantic similarity and legal question answering.

Related Resources

πŸ“§ Contact

For any inquiries, suggestions, or feedback, feel free to reach out:

Archit Rastogi
πŸ“§ [email protected]


πŸ“œ License

This dataset is distributed under the Apache 2.0 License. Please ensure compliance with applicable copyright laws when using this dataset.