SRBEdding Information Retrieval Model for Serbian

Model Description

This model is an e5-multilingual-base model fine-tuned for information retrieval tasks in Serbian. It leverages the power of the multilingual e5 base model and adapts it specifically for Serbian language processing. The model employs a Sentence Transformer architecture, which is particularly effective for semantic search, clustering, and other text similarity tasks.

Key Features

Base Model: e5-multilingual-base
Task: Information Retrieval
Language: Serbian
Vector Dimension: 768
Max Sequence Length: 512

Detailed Description

The model maps sentences and paragraphs in Serbian to a 768-dimensional dense vector space. This dense representation allows for efficient and accurate similarity comparisons between different pieces of text. The maximum sequence length of 512 tokens ensures that the model can handle relatively long text inputs, making it suitable for a wide range of information retrieval tasks.

The fine-tuning process focused on adapting the model to the nuances and specifics of the Serbian language, enhancing its performance on Serbian text beyond the capabilities of the original multilingual model. This specialized training allows the model to capture semantic relationships and contextual information specific to Serbian, resulting in more accurate and relevant information retrieval results.

Training Data

The model was fine-tuned on the serbian_qa dataset, which can be found at: https://huggingface.co/datasets/smartcat/serbian_qa

This dataset comprises context-query pairs specifically curated for Serbian language tasks. The contexts are derived from diverse sources to ensure a broad coverage of topics and language use:

Serbian Wikipedia: Providing a wide range of factual and encyclopedic content
Serbian news articles: Offering current affairs and contemporary language use
A Serbian novel: Including more literary and narrative language styles

The queries in the dataset were automatically generated using the GPT-4 model, ensuring a variety of question types and formulations. This approach helps in creating a robust model capable of handling diverse query styles and intentions.

Evaluation

The model's performance was rigorously evaluated using the InformationRetrieval evaluator from the Sentence Transformers library. This evaluation process ensures that the model performs well on standard information retrieval tasks. The evaluation was conducted on three distinct datasets, all translated to Serbian to match the model's target language:

MS MARCO (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/ms_marco_sr
- Scope: First 8000 samples
- Description: A large-scale dataset based on real-world queries and web documents, providing a realistic evaluation scenario.
Natural Questions (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/natural_quesions_sr
- Scope: First 8000 samples
- Description: Contains real questions from Google Search, paired with answers from Wikipedia, offering a diverse range of question types and topics.
SQuAD (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/squad_sr
- Scope: Full dataset
- Description: A reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.

The use of these translated datasets ensures that the model's performance is thoroughly tested on a wide range of query types, topics, and complexity levels, all in the Serbian language context.

Evaluation Results

With much fewer parameters, the model outperfors OpenAI's text-embedding-3-small on 2 out of 3 evaluation datasets, and is close to embedic-base.

Model name	Dataset	dotRecall@10	dotMRR@10	dotNDCG@10	dot-MAP@100
text-embedding-3-small	MSMARCO	0.936	0.431	0.551	0.434
	NQ	0.876	0.749	0.780	0.753
	SQuAD	0.840	0.622	0.674	0.628
multilingual-e5-large	MSMARCO	0.957	0.487	0.601	0.490
	NQ	0.894	0.761	0.794	0.765
	SQuAD	0.945	0.774	0.816	0.776
embedic-base	MSMARCO	*0.972*	*0.502*	*0.615*	*0.503*
	NQ	*0.917*	*0.796*	*0.826*	*0.799*
	SQuAD	*0.964*	*0.824*	*0.859*	*0.826*
SRBedding	MSMARCO	0.938	0.471	0.582	0.473
	NQ	0.835	0.690	0.725	0.695
	SQuAD	0.860	0.646	0.698	0.652

Results of model evaluation on MSMARCO, Natural Questions (NQ) and SQuAD datasets in the Serbian language. The best model overall is presented in underlined and bold, while the results where our model, SRBedding outperformed OpenAI's text-embedding-3-small are presented in bold.

Usage

Here's how to load and use the model with Sentence Transformers:

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('smartcat/SRBedding-base-v1')

# Example sentences
sentence1 = "Ово је пример реченице на српском језику."
sentence2 = "Ова реченица је слична првој."

# Encode sentences
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)

# Calculate cosine similarity
similarity = util.pytorch_cos_sim(embedding1, embedding2)

print(f"Similarity between the sentences: {similarity.item():.4f}")

# Example of finding the most similar sentence
sentences = [
    "Београд је главни град Србије.",
    "Нови Сад је град у Војводини.",
    "Србија је земља у југоисточној Европи."
]

query = "Који је највећи град у Србији?"

# Encode all sentences and the query
sentence_embeddings = model.encode(sentences)
query_embedding = model.encode(query)

# Calculate similarities
similarities = util.pytorch_cos_sim(query_embedding, sentence_embeddings)[0]

# Find the most similar sentence
most_similar_idx = similarities.argmax()

print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}")

This code demonstrates how to load the model, encode sentences, calculate similarities, and find the most similar sentence to a given query.

smartcat
/

SRBedding-base-v1