--- language: - multilingual license: apache-2.0 tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:MatryoshkaLoss base_model: Ghani-25/LF_enrich_sim widget: - source_sentence: CTO and co-Founder sentences: - Responsable surpervision des départements - Senior sales executive - >- Injection Operations Supervisor - Industrial Efficiency - Systems & Equipment - source_sentence: Commercial Account Executive sentences: - Automation Electrician - Love Coach Extra - Psychologue Clinicienne (Croix Rouge Française) Hébergements et ESAT - source_sentence: Chargée d'etudes actuarielles IFRS17 sentences: - Visuel Merchandiser Shop In Shop - VIP Lounge Hostess - Directeur Adjoint des opérations - source_sentence: Cheffe de projet emailing sentences: - Experte Territoriale - Responsable Clientele / Commerciale et Communication / - STRATEGIC CONSULTANT - LIVE BUSINESS CASE - source_sentence: 'Summer Job: Export Manager' sentences: - Clinical Project Leader - Member and Maghreb Representative - Responsable Export Afrique Amériques pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - pearson_cosine - spearman_cosine model-index: - name: Our original base similarity Matryoshka results: - task: type: semantic-similarity name: Semantic Similarity dataset: name: dim 768 type: dim_768 metrics: - type: pearson_cosine value: 0.9696182810336916 name: Pearson Cosine - type: spearman_cosine value: 0.9472439476744547 name: Spearman Cosine - task: type: semantic-similarity name: Semantic Similarity dataset: name: dim 512 type: dim_512 metrics: - type: pearson_cosine value: 0.9692898932305203 name: Pearson Cosine - type: spearman_cosine value: 0.9466297549051846 name: Spearman Cosine - task: type: semantic-similarity name: Semantic Similarity dataset: name: dim 256 type: dim_256 metrics: - type: pearson_cosine value: 0.9662306280132803 name: Pearson Cosine - type: spearman_cosine value: 0.9407689506959847 name: Spearman Cosine - task: type: semantic-similarity name: Semantic Similarity dataset: name: dim 128 type: dim_128 metrics: - type: pearson_cosine value: 0.960638838395904 name: Pearson Cosine - type: spearman_cosine value: 0.9314825034513964 name: Spearman Cosine - task: type: semantic-similarity name: Semantic Similarity dataset: name: dim 64 type: dim_64 metrics: - type: pearson_cosine value: 0.9463950305830967 name: Pearson Cosine - type: spearman_cosine value: 0.9100801085031441 name: Spearman Cosine --- # Our original base similarity Matryoshka This is a [sentence-transformers] model finetuned from [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) - **Maximum Sequence Length:** 128 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** - json - **Language:** multilingual - **License:** apache-2.0 ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64") # Run inference sentences = [ 'Summer Job: Export Manager', 'Responsable Export Afrique Amériquess 'Clinical Project Leader', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 768] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] # Extraction de la diagonale pour obtenir les similarités correspondantes similarities_diagonal = similarities.diag().cpu().numpy() print(similarities_diagonal) # [0.896542] ``` ## Evaluation ### Metrics #### Semantic Similarity * Datasets: `dim_768`, `dim_512`, `dim_256`, `dim_128` and `dim_64` * Evaluated with [EmbeddingSimilarityEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator) | Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 | |:--------------------|:-----------|:-----------|:-----------|:-----------|:-----------| | pearson_cosine | 0.9696 | 0.9693 | 0.9662 | 0.9606 | 0.9464 | | **spearman_cosine** | **0.9472** | **0.9466** | **0.9408** | **0.9315** | **0.9101** | ## Training Details ### Training Dataset #### json * Dataset: json * Columns: sentence1, sentence2, and label * Approximate statistics based on the first 1000 samples: | | sentence1 | sentence2 | label | |:--------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------| | type | string | string | float | | details | | | | * Samples: | sentence1 | sentence2 | label | |:--------------------------------------------------------|:-----------------------------------------------|:------------------------| | Contributive filmer | Doctorant contractuel (2016-2019) | 0.20986526 | | Responsable Développement et Communication | Bilingual Business Assistant | 0.3238712 | | Law Trainee | Sales Director contract manager | 0.24983984 | * Loss: [MatryoshkaLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters: ```json { "loss": "CosineSimilarityLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: epoch - `per_device_train_batch_size`: 32 - `per_device_eval_batch_size`: 16 - `gradient_accumulation_steps`: 16 - `learning_rate`: 2e-05 - `num_train_epochs`: 4 - `lr_scheduler_type`: cosine - `warmup_ratio`: 0.1 - `bf16`: True - `tf32`: True - `load_best_model_at_end`: True - `optim`: adamw_torch_fused #### All Hyperparameters Contact the author. ### Training Logs | Epoch | Step | Training Loss | dim_768_spearman_cosine | dim_512_spearman_cosine | dim_256_spearman_cosine | dim_128_spearman_cosine | dim_64_spearman_cosine | |:----------:|:-------:|:-------------:|:-----------------------:|:-----------------------:|:-----------------------:|:-----------------------:|:----------------------:| | 0.1624 | 10 | 0.0669 | - | - | - | - | - | | 0.3249 | 20 | 0.0563 | - | - | - | - | - | | 0.4873 | 30 | 0.0496 | - | - | - | - | - | | 0.6497 | 40 | 0.0456 | - | - | - | - | - | | 0.8122 | 50 | 0.0418 | - | - | - | - | - | | 0.9746 | 60 | 0.0407 | - | - | - | - | - | | 0.9909 | 61 | - | 0.9223 | 0.9199 | 0.9087 | 0.8920 | 0.8586 | | 1.1371 | 70 | 0.0326 | - | - | - | - | - | | 1.2995 | 80 | 0.0312 | - | - | - | - | - | | 1.4619 | 90 | 0.0303 | - | - | - | - | - | | 1.6244 | 100 | 0.03 | - | - | - | - | - | | 1.7868 | 110 | 0.0291 | - | - | - | - | - | | 1.9492 | 120 | 0.0301 | - | - | - | - | - | | 1.9980 | 123 | - | 0.9393 | 0.9382 | 0.9304 | 0.9191 | 0.8946 | | 2.1117 | 130 | 0.0257 | - | - | - | - | - | | 2.2741 | 140 | 0.0243 | - | - | - | - | - | | 2.4365 | 150 | 0.0246 | - | - | - | - | - | | 2.5990 | 160 | 0.0235 | - | - | - | - | - | | 2.7614 | 170 | 0.024 | - | - | - | - | - | | 2.9239 | 180 | 0.023 | - | - | - | - | - | | 2.9888 | 184 | - | 0.9464 | 0.9457 | 0.9396 | 0.9301 | 0.9083 | | 3.0863 | 190 | 0.0222 | - | - | - | - | - | | 3.2487 | 200 | 0.022 | - | - | - | - | - | | 3.4112 | 210 | 0.022 | - | - | - | - | - | | 3.5736 | 220 | 0.0226 | - | - | - | - | - | | 3.7360 | 230 | 0.021 | - | - | - | - | - | | 3.8985 | 240 | 0.0224 | - | - | - | - | - | | **3.9635** | **244** | **-** | **0.9472** | **0.9466** | **0.9408** | **0.9315** | **0.9101** | * The bold row denotes the saved checkpoint. ### Framework Versions - Python: 3.10.12 - Sentence Transformers: 3.3.1 - Transformers: 4.41.2 - PyTorch: 2.5.1+cu121 - Accelerate: 1.1.1 - Datasets: 2.19.1 - Tokenizers: 0.19.1