Our original base similarity Matryoshka

This is a [sentence-transformers] model finetuned from Ghani-25/LF_enrich_sim on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Ghani-25/LF_enrich_sim
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: multilingual
License: apache-2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
# Run inference
sentences = [
    'Summer Job: Export Manager',
    'Responsable Export Afrique Amériquess
    'Clinical Project Leader',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

# Extraction de la diagonale pour obtenir les similarités correspondantes
similarities_diagonal = similarities.diag().cpu().numpy()
print(similarities_diagonal)
# [0.896542]

Evaluation

Metrics

Semantic Similarity

Datasets: dim_768, dim_512, dim_256, dim_128 and dim_64
Evaluated with EmbeddingSimilarityEvaluator

Metric	dim_768	dim_512	dim_256	dim_128	dim_64
pearson_cosine	0.9696	0.9693	0.9662	0.9606	0.9464
spearman_cosine	0.9472	0.9466	0.9408	0.9315	0.9101

Training Details

Training Dataset

json

Dataset: json
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string float
details
min: 3 tokens
mean: 10.22 tokens
max: 30 tokens

min: 3 tokens
mean: 9.98 tokens
max: 67 tokens

min: -0.05
mean: 0.37
max: 0.98

	sentence1	sentence2	label
type	string	string	float
details	min: 3 tokens mean: 10.22 tokens max: 30 tokens	min: 3 tokens mean: 9.98 tokens max: 67 tokens	min: -0.05 mean: 0.37 max: 0.98

Samples:

sentence1	sentence2	label
`Contributive filmer`	`Doctorant contractuel (2016-2019)`	`0.20986526`
`Responsable Développement et Communication`	`Bilingual Business Assistant`	`0.3238712`
`Law Trainee`	`Sales Director contract manager`	`0.24983984`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CosineSimilarityLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 32
per_device_eval_batch_size: 16
gradient_accumulation_steps: 16
learning_rate: 2e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: True
load_best_model_at_end: True
optim: adamw_torch_fused

All Hyperparameters

Contact the author.

Training Logs

Epoch	Step	Training Loss	dim_768_spearman_cosine	dim_512_spearman_cosine	dim_256_spearman_cosine	dim_128_spearman_cosine	dim_64_spearman_cosine
0.1624	10	0.0669	-	-	-	-	-
0.3249	20	0.0563	-	-	-	-	-
0.4873	30	0.0496	-	-	-	-	-
0.6497	40	0.0456	-	-	-	-	-
0.8122	50	0.0418	-	-	-	-	-
0.9746	60	0.0407	-	-	-	-	-
0.9909	61	-	0.9223	0.9199	0.9087	0.8920	0.8586
1.1371	70	0.0326	-	-	-	-	-
1.2995	80	0.0312	-	-	-	-	-
1.4619	90	0.0303	-	-	-	-	-
1.6244	100	0.03	-	-	-	-	-
1.7868	110	0.0291	-	-	-	-	-
1.9492	120	0.0301	-	-	-	-	-
1.9980	123	-	0.9393	0.9382	0.9304	0.9191	0.8946
2.1117	130	0.0257	-	-	-	-	-
2.2741	140	0.0243	-	-	-	-	-
2.4365	150	0.0246	-	-	-	-	-
2.5990	160	0.0235	-	-	-	-	-
2.7614	170	0.024	-	-	-	-	-
2.9239	180	0.023	-	-	-	-	-
2.9888	184	-	0.9464	0.9457	0.9396	0.9301	0.9083
3.0863	190	0.0222	-	-	-	-	-
3.2487	200	0.022	-	-	-	-	-
3.4112	210	0.022	-	-	-	-	-
3.5736	220	0.0226	-	-	-	-	-
3.7360	230	0.021	-	-	-	-	-
3.8985	240	0.0224	-	-	-	-	-
3.9635	244	-	0.9472	0.9466	0.9408	0.9315	0.9101

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.41.2
PyTorch: 2.5.1+cu121
Accelerate: 1.1.1
Datasets: 2.19.1
Tokenizers: 0.19.1

Ghani-25
/

LF-enrich-sim-matryoshka-64