Osiria "Water" Series π§
Collection
This collection is composed of adaptable and flexible models, which blend robustness and usability
β’
3 items
β’
Updated
This is a Universal Sentence Encoder [1] model for the Italian language, obtained using mDistilUSE (distiluse-base-multilingual-cased-v1) as a starting point and focusing it on the Italian language by modifying the embedding layer (as in [2], computing document-level frequencies over the Wikipedia dataset)
The resulting model has 67M parameters, a vocabulary of 30.785 tokens, and a size of ~270 MB.
It can be used to encode Italian texts and compute similarities between them.
from transformers import AutoTokenizer, AutoModel
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("osiria/distiluse-base-italian")
model = AutoModel.from_pretrained("osiria/distiluse-base-italian")
text1 = "Alessandro Manzoni Γ¨ stato uno scrittore italiano"
text2 = "Giacomo Leopardi Γ¨ stato un poeta italiano"
vec1 = model(tokenizer.encode(text1, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
vec2 = model(tokenizer.encode(text2, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
cosine_similarity = np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))
print("COSINE SIMILARITY:", cosine_similarity)
# COSINE SIMILARITY: 0.734292
[1] https://arxiv.org/abs/1907.04307
[2] https://arxiv.org/abs/2010.05609
The model is released under Apache-2.0 license