|
--- |
|
language: |
|
- eng |
|
- hin |
|
library_name: sentence-transformers |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
widget: |
|
- source_sentence: प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए |
|
sentences: |
|
- प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए |
|
- Pranav studied law and became a politician at the age of 30. |
|
- Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye. |
|
- Pranav ne law ki padhai kari aur 30 ki umar mein politics se jud gaye. |
|
- प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था |
|
- Pranav was born in a family of politicians |
|
- Pranav ka janm rajneetigyon ke parivar mein hua tha |
|
- Pranav ka janm politicians ki family mein hua tha |
|
- source_sentence: Is baar diwali par main 15 din ke liye ghar ja rahi hoon |
|
sentences: |
|
- october mein deepawali ki chhutiyan hai sabhi ke. |
|
- Mere parivaar mein tyoharon pe devi puja ki parampara hai |
|
- Pavan ne kanoon ki padhai ki aur 30 ki umar mein rajneeti se jud gaye |
|
pipeline_tag: sentence-similarity |
|
license: apache-2.0 |
|
--- |
|
|
|
## <font color="#488AC7"> Bhasha embed v0 model </font> |
|
|
|
This is an embedding model that can embed texts in Hindi (Devanagari script), English and Romanized Hindi. |
|
There are many multilingual embedding models which work well for Hindi and English texts individually, but lack the following capabilities. |
|
|
|
1. **Romanized Hindi support**: This is the first embedding model to support Romanized Hindi (transliterated Hindi / hin_Latn). |
|
2. **Cross-lingual alignment**: This model outputs language-agnostic embedding. This enables querying a multilingual candidate pool containing a mix of Hindi, English and Romanised Hindi texts. |
|
|
|
|
|
|
|
## <font color="#488AC7"> Model Details </font> |
|
- **Supported Languages:** Hindi, English, Romanised Hindi |
|
- **Base model:** [google/muril-base-cased](https://huggingface.co/google/muril-base-cased) |
|
- **Training GPUs:** 1xRTX4090 |
|
- **Training methodology:** Distillation from English embedding model and Fine-tuning on triplet data. |
|
- **Maximum Sequence Length:** 512 tokens |
|
- **Output Dimensionality:** 768 tokens |
|
- **Similarity Function:** Cosine Similarity |
|
|
|
|
|
### Model Sources |
|
|
|
- **Repository:** [github_link](https://github.com/akshita-sukhlecha/bhasha-embed) |
|
- **Developer:** [Akshita Sukhlecha](https://www.linkedin.com/in/akshita-sukhlecha/) |
|
|
|
--- |
|
|
|
## <font color="#488AC7"> Results </font> |
|
|
|
<img src="assets/results_model_legend.png" width=350 height=100> |
|
|
|
<b>Results for English-Hindi cross-lingual alignment</b> : Tasks with corpus containing texts in Hindi as well as English |
|
<img src="assets/results_cross_lingual_alignment.png" width=1200 height=200> |
|
|
|
<b>Results for Romanised Hindi tasks</b> : Tasks with texts in Romanised Hindi |
|
<img src="assets/results_hin_latn.png" width=600 height=200> |
|
|
|
<b>Results for retrieval tasks with multilingual corpus</b> : Retrieval task with corpus containing texts in Hindi, English as well as Romanised Hindi |
|
<img src="assets/results_retrieval_belebele.png" width=400 height=200> |
|
|
|
<b>Results for Hindi tasks</b> : Tasks with texts in Hindi (Devanagari script) |
|
<img src="assets/results_hindi.png" width=1000 height=300> |
|
|
|
|
|
### Additional information |
|
- Some task dataset links: [Belebele](https://huggingface.co/datasets/facebook/belebele), [MLQA](https://huggingface.co/datasets/facebook/mlqa), [XQuAD](https://huggingface.co/datasets/google/xquad), [SemRel24](https://huggingface.co/datasets/SemRel/SemRel2024) |
|
- hin_Latn tasks: Most hin_Latn tasks have been created by transliterating hindi texts using [indic-trans library](https://github.com/libindic/indic-trans) |
|
- Detailed results: [github_link](https://github.com/akshita-sukhlecha/bhasha-embed/blob/main/eval/results/all_results.csv) |
|
- Script to reproduce the results: [github_link](https://github.com/akshita-sukhlecha/bhasha-embed/blob/main/eval/evaluator.py) |
|
|
|
--- |
|
|
|
## <font color="#488AC7"> Sample outputs </font> |
|
|
|
### Example 1 |
|
|
|
<img src="assets/example_1.png" width=1200 height=200> |
|
|
|
|
|
### Example 2 |
|
|
|
<img src="assets/example_2.png" width=1200 height=200> |
|
|
|
### Example 3 |
|
|
|
<img src="assets/example_3.png" width=1200 height=200> |
|
|
|
### Example 4 |
|
|
|
<img src="assets/example_4.png" width=1200 height=200> |
|
|
|
|
|
--- |
|
|
|
## <font color="#488AC7"> Usage </font> |
|
|
|
Below are examples to encode queries and passages and compute similarity scores using Sentence Transformers and 🤗 Transformers. |
|
|
|
### Using Sentence Transformers |
|
|
|
First install the Sentence Transformers library (`pip install -U sentence-transformers`) and then run the following code: |
|
|
|
```python |
|
import numpy as np |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("AkshitaS/bhasha-embed-v0") |
|
|
|
queries = [ |
|
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", |
|
"Pranav studied law and became a politician at the age of 30.", |
|
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye" |
|
] |
|
documents = [ |
|
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", |
|
"Pranav studied law and became a politician at the age of 30.", |
|
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye", |
|
"प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था", |
|
"Pranav was born in a family of politicians", |
|
"Pranav ka janm rajneetigyon ke parivar mein hua tha" |
|
] |
|
|
|
query_embeddings = model.encode(queries, normalize_embeddings=True) |
|
document_embeddings = model.encode(documents, normalize_embeddings=True) |
|
|
|
similarity_matrix = (query_embeddings @ document_embeddings.T) |
|
print(similarity_matrix.shape) |
|
# (3, 6) |
|
print(np.round(similarity_matrix, 2)) |
|
#[[1.00 0.97 0.97 0.92 0.90 0.91] |
|
# [0.97 1.00 0.96 0.90 0.91 0.91] |
|
# [0.97 0.96 1.00 0.89 0.90 0.92]] |
|
``` |
|
|
|
### Using 🤗 Transformers |
|
|
|
```python |
|
import numpy as np |
|
from torch import Tensor |
|
import torch.nn.functional as F |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: |
|
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
|
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
|
|
|
|
|
model_id = "AkshitaS/bhasha-embed-v0" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModel.from_pretrained(model_id) |
|
|
|
queries = [ |
|
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", |
|
"Pranav studied law and became a politician at the age of 30.", |
|
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye" |
|
] |
|
documents = [ |
|
"प्रणव ने कानून की पढ़ाई की और ३० की उम्र में राजनीति से जुड़ गए", |
|
"Pranav studied law and became a politician at the age of 30.", |
|
"Pranav ne kanoon ki padhai kari aur 30 ki umar mein rajneeti se jud gaye", |
|
"प्रणव का जन्म राजनीतिज्ञों के परिवार में हुआ था", |
|
"Pranav was born in a family of politicians", |
|
"Pranav ka janm rajneetigyon ke parivar mein hua tha" |
|
] |
|
|
|
input_texts = queries + documents |
|
batch_dict = tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt') |
|
outputs = model(**batch_dict) |
|
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
similarity_matrix = (embeddings[:len(queries)] @ embeddings[len(queries):].T).detach().numpy() |
|
print(similarity_matrix.shape) |
|
# (3, 6) |
|
print(np.round(similarity_matrix, 2)) |
|
#[[1.00 0.97 0.97 0.92 0.90 0.91] |
|
# [0.97 1.00 0.96 0.90 0.91 0.91] |
|
# [0.97 0.96 1.00 0.89 0.90 0.92]] |
|
``` |
|
|
|
|
|
|
|
### Citation |
|
To cite this model: |
|
``` |
|
@misc{sukhlecha_2024_bhasha_embed_v0, |
|
author = {Sukhlecha, Akshita}, |
|
title = {Bhasha-embed-v0}, |
|
howpublished = {Hugging Face}, |
|
month = {June}, |
|
year = {2024}, |
|
url = {https://huggingface.co/AkshitaS/bhasha-embed-v0} |
|
} |
|
``` |