SRBEdding Information Retrieval Model for Serbian
Model Description
This model is an e5-multilingual-base model fine-tuned for information retrieval tasks in Serbian. It leverages the power of the multilingual e5 base model and adapts it specifically for Serbian language processing. The model employs a Sentence Transformer architecture, which is particularly effective for semantic search, clustering, and other text similarity tasks.
Key Features
- Base Model: e5-multilingual-base
- Task: Information Retrieval
- Language: Serbian
- Vector Dimension: 768
- Max Sequence Length: 512
Detailed Description
The model maps sentences and paragraphs in Serbian to a 768-dimensional dense vector space. This dense representation allows for efficient and accurate similarity comparisons between different pieces of text. The maximum sequence length of 512 tokens ensures that the model can handle relatively long text inputs, making it suitable for a wide range of information retrieval tasks.
The fine-tuning process focused on adapting the model to the nuances and specifics of the Serbian language, enhancing its performance on Serbian text beyond the capabilities of the original multilingual model. This specialized training allows the model to capture semantic relationships and contextual information specific to Serbian, resulting in more accurate and relevant information retrieval results.
Training Data
The model was fine-tuned on the serbian_qa dataset, which can be found at: https://huggingface.co/datasets/smartcat/serbian_qa
This dataset comprises context-query pairs specifically curated for Serbian language tasks. The contexts are derived from diverse sources to ensure a broad coverage of topics and language use:
- Serbian Wikipedia: Providing a wide range of factual and encyclopedic content
- Serbian news articles: Offering current affairs and contemporary language use
- A Serbian novel: Including more literary and narrative language styles
The queries in the dataset were automatically generated using the GPT-4 model, ensuring a variety of question types and formulations. This approach helps in creating a robust model capable of handling diverse query styles and intentions.
Evaluation
The model's performance was rigorously evaluated using the InformationRetrieval evaluator from the Sentence Transformers library. This evaluation process ensures that the model performs well on standard information retrieval tasks. The evaluation was conducted on three distinct datasets, all translated to Serbian to match the model's target language:
MS MARCO (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/ms_marco_sr
- Scope: First 8000 samples
- Description: A large-scale dataset based on real-world queries and web documents, providing a realistic evaluation scenario.
Natural Questions (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/natural_quesions_sr
- Scope: First 8000 samples
- Description: Contains real questions from Google Search, paired with answers from Wikipedia, offering a diverse range of question types and topics.
SQuAD (Serbian version):
- Dataset: https://huggingface.co/datasets/smartcat/squad_sr
- Scope: Full dataset
- Description: A reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles.
The use of these translated datasets ensures that the model's performance is thoroughly tested on a wide range of query types, topics, and complexity levels, all in the Serbian language context.
Evaluation Results
With much fewer parameters, the model outperfors OpenAI's text-embedding-3-small on 2 out of 3 evaluation datasets, and is close to embedic-base.
Model name | Dataset | dotRecall@10 | dotMRR@10 | dotNDCG@10 | dot-MAP@100 |
---|---|---|---|---|---|
text-embedding-3-small | MSMARCO | 0.936 | 0.431 | 0.551 | 0.434 |
NQ | 0.876 | 0.749 | 0.780 | 0.753 | |
SQuAD | 0.840 | 0.622 | 0.674 | 0.628 | |
multilingual-e5-large | MSMARCO | 0.957 | 0.487 | 0.601 | 0.490 |
NQ | 0.894 | 0.761 | 0.794 | 0.765 | |
SQuAD | 0.945 | 0.774 | 0.816 | 0.776 | |
embedic-base | MSMARCO | 0.972 | 0.502 | 0.615 | 0.503 |
NQ | 0.917 | 0.796 | 0.826 | 0.799 | |
SQuAD | 0.964 | 0.824 | 0.859 | 0.826 | |
SRBedding | MSMARCO | 0.938 | 0.471 | 0.582 | 0.473 |
NQ | 0.835 | 0.690 | 0.725 | 0.695 | |
SQuAD | 0.860 | 0.646 | 0.698 | 0.652 |
Results of model evaluation on MSMARCO, Natural Questions (NQ) and SQuAD datasets in the Serbian language. The best model overall is presented in underlined and bold, while the results where our model, SRBedding outperformed OpenAI's text-embedding-3-small are presented in bold.
Usage
Here's how to load and use the model with Sentence Transformers:
from sentence_transformers import SentenceTransformer, util
# Load the model
model = SentenceTransformer('smartcat/SRBedding-base-v1')
# Example sentences
sentence1 = "Ово је пример реченице на српском језику."
sentence2 = "Ова реченица је слична првој."
# Encode sentences
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)
# Calculate cosine similarity
similarity = util.pytorch_cos_sim(embedding1, embedding2)
print(f"Similarity between the sentences: {similarity.item():.4f}")
# Example of finding the most similar sentence
sentences = [
"Београд је главни град Србије.",
"Нови Сад је град у Војводини.",
"Србија је земља у југоисточној Европи."
]
query = "Који је највећи град у Србији?"
# Encode all sentences and the query
sentence_embeddings = model.encode(sentences)
query_embedding = model.encode(query)
# Calculate similarities
similarities = util.pytorch_cos_sim(query_embedding, sentence_embeddings)[0]
# Find the most similar sentence
most_similar_idx = similarities.argmax()
print(f"Query: {query}")
print(f"Most similar sentence: {sentences[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.4f}")
This code demonstrates how to load the model, encode sentences, calculate similarities, and find the most similar sentence to a given query.
- Downloads last month
- 18