--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - embeddings - legal - USCode license: apache-2.0 datasets: - ArchitRastogi/USCode-QAPairs-Finetuning model_creator: Archit Rastogi language: - en library_name: transformers base_model: - BAAI/bge-small-en-v1.5 fine_tuned_from: sentence-transformers/BGE-Small task_categories: - sentence-similarity - embeddings - feature-extraction model-index: - name: BGE-Small-LegalEmbeddings-USCode results: - task: type: sentence-similarity dataset: name: USCode-QAPairs-Finetuning type: USCode-QAPairs-Finetuning metrics: - name: Accuracy type: Accuracy value: 0.72 - name: Recall type: Recall value: 0.75 source: name: Evaluation on USLawQA Dataset url: https://huggingface.co/datasets/ArchitRastogi/USLawQA --- # BGE-Small Fine-Tuned on USCode-QueryPairs This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set. ## Overview - **Base Model**: BGE Small - **Dataset**: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) - **Training Details**: - **Hardware**: Google Colab (T4 GPU) - **Training Time**: 2 hours - **Accuracy**: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) ## Applications This model is ideal for: - **Legal Text Retrieval**: Efficient semantic search across legal documents. - **Question Answering**: Answering legal queries based on context from the US Code. - **Embeddings Generation**: Generating high-quality embeddings for downstream legal NLP tasks. ## Usage The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet: ```python # Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode") model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode") text = "Duties of the president" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) #Printing the Embeddings print(outputs) ``` ## Evaluation The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics: - **Accuracy**: 75% - **Task**: Semantic similarity and legal question answering. ## Related Resources - [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) - [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA) ## 📧 Contact For any inquiries, suggestions, or feedback, feel free to reach out: **Archit Rastogi** 📧 [architrastogi20@gmail.com](mailto:architrastogi20@gmail.com) --- ## 📜 License This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset.