Update README.md

fb18955 verified 27 days ago

15.1 kB

	---
	language:
	- multilingual
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- generated_from_trainer
	- loss:MatryoshkaLoss
	base_model: Ghani-25/LF_enrich_sim
	widget:
	- source_sentence: CTO and co-Founder
	sentences:
	- Responsable surpervision des départements
	- Senior sales executive
	- >-
	Injection Operations Supervisor - Industrial Efficiency - Systems &
	Equipment
	- source_sentence: Commercial Account Executive
	sentences:
	- Automation Electrician
	- Love Coach Extra
	- Psychologue Clinicienne (Croix Rouge Française) Hébergements et ESAT
	- source_sentence: Chargée d'etudes actuarielles IFRS17
	sentences:
	- Visuel Merchandiser Shop In Shop
	- VIP Lounge Hostess
	- Directeur Adjoint des opérations
	- source_sentence: Cheffe de projet emailing
	sentences:
	- Experte Territoriale
	- Responsable Clientele / Commerciale et Communication /
	- STRATEGIC CONSULTANT - LIVE BUSINESS CASE
	- source_sentence: 'Summer Job: Export Manager'
	sentences:
	- Clinical Project Leader
	- Member and Maghreb Representative
	- Responsable Export Afrique Amériques
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- spearman_cosine
	model-index:
	- name: Our original base similarity Matryoshka
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: dim 768
	type: dim_768
	metrics:
	- type: pearson_cosine
	value: 0.9696182810336916
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9472439476744547
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: dim 512
	type: dim_512
	metrics:
	- type: pearson_cosine
	value: 0.9692898932305203
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9466297549051846
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: dim 256
	type: dim_256
	metrics:
	- type: pearson_cosine
	value: 0.9662306280132803
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9407689506959847
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: dim 128
	type: dim_128
	metrics:
	- type: pearson_cosine
	value: 0.960638838395904
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9314825034513964
	name: Spearman Cosine
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: dim 64
	type: dim_64
	metrics:
	- type: pearson_cosine
	value: 0.9463950305830967
	name: Pearson Cosine
	- type: spearman_cosine
	value: 0.9100801085031441
	name: Spearman Cosine
	---

	# Our original base similarity Matryoshka

	This is a [sentence-transformers] model finetuned from [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [Ghani-25/LF_enrich_sim](https://huggingface.co/Ghani-25/LF_enrich_sim) <!-- at revision fb09bbe3ab4baafa2101c33989bf2ed8ffddf5cc -->
	- Maximum Sequence Length: 128 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	- Training Dataset:
	- json
	- Language: multilingual
	- License: apache-2.0

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
	# Run inference
	sentences = [
	'Summer Job: Export Manager',
	'Responsable Export Afrique Amériquess
	'Clinical Project Leader',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]

	# Extraction de la diagonale pour obtenir les similarités correspondantes
	similarities_diagonal = similarities.diag().cpu().numpy()
	print(similarities_diagonal)
	# [0.896542]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Semantic Similarity

	* Datasets: `dim_768`, `dim_512`, `dim_256`, `dim_128` and `dim_64`
	* Evaluated with [<code>EmbeddingSimilarityEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.EmbeddingSimilarityEvaluator)

	\| Metric \| dim_768 \| dim_512 \| dim_256 \| dim_128 \| dim_64 \|
	\|:--------------------\|:-----------\|:-----------\|:-----------\|:-----------\|:-----------\|
	\| pearson_cosine \| 0.9696 \| 0.9693 \| 0.9662 \| 0.9606 \| 0.9464 \|
	\| spearman_cosine \| 0.9472 \| 0.9466 \| 0.9408 \| 0.9315 \| 0.9101 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### json

	* Dataset: json
	* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence1 \| sentence2 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------\|:------------------------------------------------------------------\|
	\| type \| string \| string \| float \|
	\| details \| <ul><li>min: 3 tokens</li><li>mean: 10.22 tokens</li><li>max: 30 tokens</li></ul> \| <ul><li>min: 3 tokens</li><li>mean: 9.98 tokens</li><li>max: 67 tokens</li></ul> \| <ul><li>min: -0.05</li><li>mean: 0.37</li><li>max: 0.98</li></ul> \|
	* Samples:
	\| sentence1 \| sentence2 \| label \|
	\|:--------------------------------------------------------\|:-----------------------------------------------\|:------------------------\|
	\| <code>Contributive filmer</code> \| <code>Doctorant contractuel (2016-2019)</code> \| <code>0.20986526</code> \|
	\| <code>Responsable Développement et Communication</code> \| <code>Bilingual Business Assistant</code> \| <code>0.3238712</code> \|
	\| <code>Law Trainee</code> \| <code>Sales Director contract manager</code> \| <code>0.24983984</code> \|
	* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
	```json
	{
	"loss": "CosineSimilarityLoss",
	"matryoshka_dims": [
	768,
	512,
	256,
	128,
	64
	],
	"matryoshka_weights": [
	1,
	1,
	1,
	1,
	1
	],
	"n_dims_per_step": -1
	}
	```

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `eval_strategy`: epoch
	- `per_device_train_batch_size`: 32
	- `per_device_eval_batch_size`: 16
	- `gradient_accumulation_steps`: 16
	- `learning_rate`: 2e-05
	- `num_train_epochs`: 4
	- `lr_scheduler_type`: cosine
	- `warmup_ratio`: 0.1
	- `bf16`: True
	- `tf32`: True
	- `load_best_model_at_end`: True
	- `optim`: adamw_torch_fused

	#### All Hyperparameters
	Contact the author.

	### Training Logs
	\| Epoch \| Step \| Training Loss \| dim_768_spearman_cosine \| dim_512_spearman_cosine \| dim_256_spearman_cosine \| dim_128_spearman_cosine \| dim_64_spearman_cosine \|
	\|:----------:\|:-------:\|:-------------:\|:-----------------------:\|:-----------------------:\|:-----------------------:\|:-----------------------:\|:----------------------:\|
	\| 0.1624 \| 10 \| 0.0669 \| - \| - \| - \| - \| - \|
	\| 0.3249 \| 20 \| 0.0563 \| - \| - \| - \| - \| - \|
	\| 0.4873 \| 30 \| 0.0496 \| - \| - \| - \| - \| - \|
	\| 0.6497 \| 40 \| 0.0456 \| - \| - \| - \| - \| - \|
	\| 0.8122 \| 50 \| 0.0418 \| - \| - \| - \| - \| - \|
	\| 0.9746 \| 60 \| 0.0407 \| - \| - \| - \| - \| - \|
	\| 0.9909 \| 61 \| - \| 0.9223 \| 0.9199 \| 0.9087 \| 0.8920 \| 0.8586 \|
	\| 1.1371 \| 70 \| 0.0326 \| - \| - \| - \| - \| - \|
	\| 1.2995 \| 80 \| 0.0312 \| - \| - \| - \| - \| - \|
	\| 1.4619 \| 90 \| 0.0303 \| - \| - \| - \| - \| - \|
	\| 1.6244 \| 100 \| 0.03 \| - \| - \| - \| - \| - \|
	\| 1.7868 \| 110 \| 0.0291 \| - \| - \| - \| - \| - \|
	\| 1.9492 \| 120 \| 0.0301 \| - \| - \| - \| - \| - \|
	\| 1.9980 \| 123 \| - \| 0.9393 \| 0.9382 \| 0.9304 \| 0.9191 \| 0.8946 \|
	\| 2.1117 \| 130 \| 0.0257 \| - \| - \| - \| - \| - \|
	\| 2.2741 \| 140 \| 0.0243 \| - \| - \| - \| - \| - \|
	\| 2.4365 \| 150 \| 0.0246 \| - \| - \| - \| - \| - \|
	\| 2.5990 \| 160 \| 0.0235 \| - \| - \| - \| - \| - \|
	\| 2.7614 \| 170 \| 0.024 \| - \| - \| - \| - \| - \|
	\| 2.9239 \| 180 \| 0.023 \| - \| - \| - \| - \| - \|
	\| 2.9888 \| 184 \| - \| 0.9464 \| 0.9457 \| 0.9396 \| 0.9301 \| 0.9083 \|
	\| 3.0863 \| 190 \| 0.0222 \| - \| - \| - \| - \| - \|
	\| 3.2487 \| 200 \| 0.022 \| - \| - \| - \| - \| - \|
	\| 3.4112 \| 210 \| 0.022 \| - \| - \| - \| - \| - \|
	\| 3.5736 \| 220 \| 0.0226 \| - \| - \| - \| - \| - \|
	\| 3.7360 \| 230 \| 0.021 \| - \| - \| - \| - \| - \|
	\| 3.8985 \| 240 \| 0.0224 \| - \| - \| - \| - \| - \|
	\| 3.9635 \| 244 \| - \| 0.9472 \| 0.9466 \| 0.9408 \| 0.9315 \| 0.9101 \|

	* The bold row denotes the saved checkpoint.

	### Framework Versions
	- Python: 3.10.12
	- Sentence Transformers: 3.3.1
	- Transformers: 4.41.2
	- PyTorch: 2.5.1+cu121
	- Accelerate: 1.1.1
	- Datasets: 2.19.1
	- Tokenizers: 0.19.1

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->