SRBEdding Information Retrieval Model for Serbian

Model Description

This model is an e5-multilingual-base model fine-tuned for information retrieval tasks in Serbian. It leverages the power of the multilingual e5 base model and adapts it specifically for Serbian language processing. The model employs a Sentence Transformer architecture, which is particularly effective for semantic search, clustering, and other text similarity tasks.

Key Features

  • Base Model: e5-multilingual-base
  • Task: Information Retrieval
  • Language: Serbian
  • Vector Dimension: 768
  • Max Sequence Length: 512

Detailed Description

The model maps sentences and paragraphs in Serbian to a 768-dimensional dense vector space. This dense representation allows for efficient and accurate similarity comparisons between different pieces of text. The maximum sequence length of 512 tokens ensures that the model can handle relatively long text inputs, making it suitable for a wide range of information retrieval tasks.

The fine-tuning process focused on adapting the model to the nuances and specifics of the Serbian language, enhancing its performance on Serbian text beyond the capabilities of the original multilingual model. This specialized training allows the model to capture semantic relationships and contextual information specific to Serbian, resulting in more accurate and relevant information retrieval results.

Training Data

The model was fine-tuned on the serbian_qa dataset, which can be found at: https://huggingface.co/datasets/smartcat/serbian_qa

This dataset comprises context-query pairs specifically curated for Serbian language tasks. The contexts are derived from diverse sources to ensure a broad coverage of topics and language use:

  • Serbian Wikipedia: Providing a wide range of factual and encyclopedic content
  • Serbian news articles: Offering current affairs and contemporary language use
  • A Serbian novel: Including more literary and narrative language styles

The queries in the dataset were automatically generated using the GPT-4 model, ensuring a variety of question types and formulations. This approach helps in creating a robust model capable of handling diverse query styles and intentions.

Evaluation

The model's performance was rigorously evaluated using the InformationRetrieval evaluator from the Sentence Transformers library. This evaluation process ensures that the model performs well on standard information retrieval tasks. The evaluation was conducted on three distinct datasets, all translated to Serbian to match the model's target language:

  1. MS MARCO (Serbian version):

  2. Natural Questions (Serbian version):

  3. SQuAD (Serbian version):

The use of these translated datasets ensures that the model's performance is thoroughly tested on a wide range of query types, topics, and complexity levels, all in the Serbian language context.

Evaluation Results

With much fewer parameters, the model outperfors OpenAI's text-embedding-3-small on 2 out of 3 evaluation datasets, and is close to embedic-base.

Model name Dataset dotRecall@10 dotMRR@10 dotNDCG@10 dot-MAP@100
text-embedding-3-small MSMARCO 0.936 0.431 0.551 0.434
NQ 0.876 0.749 0.780 0.753
SQuAD 0.840 0.622 0.674 0.628
multilingual-e5-large MSMARCO 0.957 0.487 0.601 0.490
NQ 0.894 0.761 0.794 0.765
SQuAD 0.945 0.774 0.816 0.776
embedic-base MSMARCO 0.972 0.502 0.615 0.503
NQ 0.917 0.796 0.826 0.799
SQuAD 0.964 0.824 0.859 0.826
SRBedding MSMARCO 0.938 0.471 0.582 0.473
NQ 0.835 0.690 0.725 0.695
SQuAD 0.860 0.646 0.698 0.652

Results of model evaluation on MSMARCO, Natural Questions (NQ) and SQuAD datasets in the Serbian language. The best model overall is presented in underlined and bold, while the results where our model, SRBedding outperformed OpenAI's text-embedding-3-small are presented in bold.

Usage

Here's how to load and use the model with Sentence Transformers:

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('smartcat/SRBedding-base-v1')

# Example sentences
sentence1 = "Ово је пример реченице на српском језику."
sentence2 = "Ова реченица је слична првој."

# Encode sentences
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)

# Calculate cosine similarity
similarity = util.pytorch_cos_sim(embedding1, embedding2)

print(f"Similarity between the sentences: {similarity.item():.4f}")

# Example of finding the most similar sentence
sentences = [
    "Београд је главни град Србије.",
    "Нови Сад је град у Војводини.",
    "Србија је земља у југоисточној Европи."
]

query = "Који је највећи град у Србији?"

# Encode all sentences and the query
sentence_embeddings = model.encode(sentences)
query_embedding = model.encode(query)

# Calculate similarities
similarities = util.pytorch_cos_sim(query_embedding, sentence_embeddings)[0]

# Find the most similar sentence
most_similar_idx = similarities.argmax()

print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}")

This code demonstrates how to load the model, encode sentences, calculate similarities, and find the most similar sentence to a given query.

Downloads last month
18
Safetensors
Model size
278M params
Tensor type
F32
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train smartcat/SRBedding-base-v1

Collection including smartcat/SRBedding-base-v1