ArchitRastogi
/

BGE-Small-LegalEmbeddings-USCode

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

BGE-Small-LegalEmbeddings-USCode / README.md

ArchitRastogi's picture

fixed how to use model

aa18202 verified 3 months ago

|

history blame contribute delete

3.26 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- embeddings
	- legal
	- USCode
	license: apache-2.0
	datasets:
	- ArchitRastogi/USCode-QAPairs-Finetuning
	model_creator: Archit Rastogi
	language:
	- en
	library_name: transformers
	base_model:
	- BAAI/bge-small-en-v1.5
	fine_tuned_from: sentence-transformers/BGE-Small
	task_categories:
	- sentence-similarity
	- embeddings
	- feature-extraction
	model-index:
	- name: BGE-Small-LegalEmbeddings-USCode
	results:
	- task:
	type: sentence-similarity
	dataset:
	name: USCode-QAPairs-Finetuning
	type: USCode-QAPairs-Finetuning
	metrics:
	- name: Accuracy
	type: Accuracy
	value: 0.72
	- name: Recall
	type: Recall
	value: 0.75
	source:
	name: Evaluation on USLawQA Dataset
	url: https://huggingface.co/datasets/ArchitRastogi/USLawQA
	---



	# BGE-Small Fine-Tuned on USCode-QueryPairs

	This is a fine-tuned version of the BGE Small embedding model, trained on the [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs) dataset, a subset of the [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) corpus. The model is optimized for generating embeddings for legal text, achieving 75% accuracy on the test set.

	## Overview

	- Base Model: BGE Small
	- Dataset: [USCode-QueryPairs](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
	- Training Details:
	- Hardware: Google Colab (T4 GPU)
	- Training Time: 2 hours
	- Accuracy: 75% on the test set from [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

	## Applications

	This model is ideal for:
	- Legal Text Retrieval: Efficient semantic search across legal documents.
	- Question Answering: Answering legal queries based on context from the US Code.
	- Embeddings Generation: Generating high-quality embeddings for downstream legal NLP tasks.

	## Usage

	The model can be used with `model.encode` for generating embeddings. Below is an example usage snippet:

	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
	model = AutoModel.from_pretrained("ArchitRastogi/BGE-Small-LegalEmbeddings-USCode")
	text = "Duties of the president"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	#Printing the Embeddings
	print(outputs)

	```

	## Evaluation

	The model was evaluated on the test set of [USLawQA](https://huggingface.co/datasets/ArchitRastogi/USLawQA) and achieved the following metrics:
	- Accuracy: 75%
	- Task: Semantic similarity and legal question answering.

	## Related Resources

	- [USCode-QueryPairs Dataset](https://huggingface.co/datasets/ArchitRastogi/USCode-QueryPairs)
	- [USLawQA Corpus](https://huggingface.co/datasets/ArchitRastogi/USLawQA)

	## 📧 Contact

	For any inquiries, suggestions, or feedback, feel free to reach out:

	Archit Rastogi
	📧 [[email protected]](mailto:[email protected])


	---

	## 📜 License

	This dataset is distributed under the [Apache 2.0 License](LICENSE). Please ensure compliance with applicable copyright laws when using this dataset.