metadata

language:
  - multilingual
license: apache-2.0
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - loss:MatryoshkaLoss
base_model: Ghani-25/LF_enrich_sim
widget:
  - source_sentence: CTO and co-Founder
    sentences:
      - Responsable surpervision des départements
      - Senior sales executive
      - >-
        Injection Operations Supervisor - Industrial Efficiency - Systems &
        Equipment
  - source_sentence: Commercial Account Executive
    sentences:
      - Automation Electrician
      - Love Coach Extra
      - Psychologue Clinicienne (Croix Rouge Française) Hébergements et ESAT
  - source_sentence: Chargée d'etudes actuarielles IFRS17
    sentences:
      - Visuel Merchandiser Shop In Shop
      - VIP Lounge Hostess
      - Directeur Adjoint des opérations
  - source_sentence: Cheffe de projet emailing
    sentences:
      - Experte Territoriale
      - Responsable Clientele / Commerciale  et  Communication /
      - STRATEGIC CONSULTANT - LIVE BUSINESS CASE
  - source_sentence: 'Summer Job: Export Manager'
    sentences:
      - Clinical Project Leader
      - Member and Maghreb Representative
      - Responsable Export Afrique Amériques
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: Our original base similarity Matryoshka
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: dim 768
          type: dim_768
        metrics:
          - type: pearson_cosine
            value: 0.9696182810336916
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9472439476744547
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: dim 512
          type: dim_512
        metrics:
          - type: pearson_cosine
            value: 0.9692898932305203
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9466297549051846
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: pearson_cosine
            value: 0.9662306280132803
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9407689506959847
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: dim 128
          type: dim_128
        metrics:
          - type: pearson_cosine
            value: 0.960638838395904
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9314825034513964
            name: Spearman Cosine
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: dim 64
          type: dim_64
        metrics:
          - type: pearson_cosine
            value: 0.9463950305830967
            name: Pearson Cosine
          - type: spearman_cosine
            value: 0.9100801085031441
            name: Spearman Cosine

Our original base similarity Matryoshka

This is a [sentence-transformers] model finetuned from Ghani-25/LF_enrich_sim on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Ghani-25/LF_enrich_sim
Maximum Sequence Length: 128 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json
Language: multilingual
License: apache-2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
# Run inference
sentences = [
    'Summer Job: Export Manager',
    'Responsable Export Afrique Amériquess
    'Clinical Project Leader',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

# Extraction de la diagonale pour obtenir les similarités correspondantes
similarities_diagonal = similarities.diag().cpu().numpy()
print(similarities_diagonal)
# [0.896542]

Evaluation

Metrics

Semantic Similarity

Datasets: dim_768, dim_512, dim_256, dim_128 and dim_64
Evaluated with EmbeddingSimilarityEvaluator

Metric	dim_768	dim_512	dim_256	dim_128	dim_64
pearson_cosine	0.9696	0.9693	0.9662	0.9606	0.9464
spearman_cosine	0.9472	0.9466	0.9408	0.9315	0.9101

Training Details

Training Dataset

json

Dataset: json
Columns: sentence1, sentence2, and label
Approximate statistics based on the first 1000 samples:
sentence1 sentence2 label
type string string float
details
min: 3 tokens
mean: 10.22 tokens
max: 30 tokens

min: 3 tokens
mean: 9.98 tokens
max: 67 tokens

min: -0.05
mean: 0.37
max: 0.98

	sentence1	sentence2	label
type	string	string	float
details	min: 3 tokens mean: 10.22 tokens max: 30 tokens	min: 3 tokens mean: 9.98 tokens max: 67 tokens	min: -0.05 mean: 0.37 max: 0.98

Samples:

sentence1	sentence2	label
`Contributive filmer`	`Doctorant contractuel (2016-2019)`	`0.20986526`
`Responsable Développement et Communication`	`Bilingual Business Assistant`	`0.3238712`
`Law Trainee`	`Sales Director contract manager`	`0.24983984`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "CosineSimilarityLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 32
per_device_eval_batch_size: 16
gradient_accumulation_steps: 16
learning_rate: 2e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: True
load_best_model_at_end: True
optim: adamw_torch_fused

All Hyperparameters

Contact the author.

Training Logs

Epoch	Step	Training Loss	dim_768_spearman_cosine	dim_512_spearman_cosine	dim_256_spearman_cosine	dim_128_spearman_cosine	dim_64_spearman_cosine
0.1624	10	0.0669	-	-	-	-	-
0.3249	20	0.0563	-	-	-	-	-
0.4873	30	0.0496	-	-	-	-	-
0.6497	40	0.0456	-	-	-	-	-
0.8122	50	0.0418	-	-	-	-	-
0.9746	60	0.0407	-	-	-	-	-
0.9909	61	-	0.9223	0.9199	0.9087	0.8920	0.8586
1.1371	70	0.0326	-	-	-	-	-
1.2995	80	0.0312	-	-	-	-	-
1.4619	90	0.0303	-	-	-	-	-
1.6244	100	0.03	-	-	-	-	-
1.7868	110	0.0291	-	-	-	-	-
1.9492	120	0.0301	-	-	-	-	-
1.9980	123	-	0.9393	0.9382	0.9304	0.9191	0.8946
2.1117	130	0.0257	-	-	-	-	-
2.2741	140	0.0243	-	-	-	-	-
2.4365	150	0.0246	-	-	-	-	-
2.5990	160	0.0235	-	-	-	-	-
2.7614	170	0.024	-	-	-	-	-
2.9239	180	0.023	-	-	-	-	-
2.9888	184	-	0.9464	0.9457	0.9396	0.9301	0.9083
3.0863	190	0.0222	-	-	-	-	-
3.2487	200	0.022	-	-	-	-	-
3.4112	210	0.022	-	-	-	-	-
3.5736	220	0.0226	-	-	-	-	-
3.7360	230	0.021	-	-	-	-	-
3.8985	240	0.0224	-	-	-	-	-
3.9635	244	-	0.9472	0.9466	0.9408	0.9315	0.9101

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.3.1
Transformers: 4.41.2
PyTorch: 2.5.1+cu121
Accelerate: 1.1.1
Datasets: 2.19.1
Tokenizers: 0.19.1