Add: add README

Files changed (3) hide show

.ipynb_checkpoints/README-checkpoint.md +0 -95
.ipynb_checkpoints/config-checkpoint.json +0 -37
.ipynb_checkpoints/config_sentence_transformers-checkpoint.json +0 -9

.ipynb_checkpoints/README-checkpoint.md DELETED Viewed

@@ -1,95 +0,0 @@
----
-library_name: sentence-transformers
-pipeline_tag: sentence-similarity
-tags:
-- sentence-transformers
-- feature-extraction
-- sentence-similarity
-- transformers
-- sentence-embedding
-license: apache-2.0
-language:
-- fr
-metrics:
-- pearsonr
-- spearmanr
----
-# [bilingual-embedding-base](https://huggingface.co/Lajavaness/bilingual-embedding-base)
-bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), a pre-trained language model based on the [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.
-## Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
-  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
-## Training and Fine-tuning process
-#### Stage 1: NLI Training
-- Dataset: [(SNLI+XNLI) for english+french]
-- Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
-### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
-- Dataset: [STSB-fr and en]
-- Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.
-### Stage 4: Advanced Augmentation Fine-tuning
-- Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
-- Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.
-## Usage:
-Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
-```
-pip install -U sentence-transformers
-pip install -q pyvi
-```
-Then you can use the model like this:
-```python
-from sentence_transformers import SentenceTransformer
-from pyvi.ViTokenizer import tokenize
-sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]
-model = SentenceTransformer('Lajavaness/bilingual-embedding-base', trust_remote_code=True)
-print(embeddings)
-```
-## Evaluation
-TODO
-## Citation
-    @article{conneau2019unsupervised,
-      title={Unsupervised cross-lingual representation learning at scale},
-      author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
-      journal={arXiv preprint arXiv:1911.02116},
-      year={2019}
-    }
-	@article{reimers2019sentence,
-	   title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
-	   author={Nils Reimers, Iryna Gurevych},
-	   journal={https://arxiv.org/abs/1908.10084},
-	   year={2019}
-	}
-    @article{thakur2020augmented,
-      title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
-      author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
-      journal={arXiv e-prints},
-      pages={arXiv--2010},
-      year={2020}

.ipynb_checkpoints/config-checkpoint.json DELETED Viewed

@@ -1,37 +0,0 @@
-{
-  "_name_or_path": "dangvantuan/bilingual_impl",
-  "architectures": [
-    "BilingualModel"
-  ],
-  "model_type": "bilingual",
-  "auto_map": {
-    "AutoConfig":"dangvantuan/bilingual_impl--config.BilingualConfig",
-    "AutoModel": "dangvantuan/bilingual_impl--modeling.BilingualModel",
-    "AutoModelForMaskedLM": "dangvantuan/bilingual_impl--modeling.BilingualForMaskedLM",
-    "AutoModelForMultipleChoice": "dangvantuan/bilingual_impl--modeling.BilingualForMultipleChoice",
-    "AutoModelForQuestionAnswering": "dangvantuan/bilingual_impl--modeling.BilingualForQuestionAnswering",
-    "AutoModelForSequenceClassification": "dangvantuan/bilingual_impl--modeling.BilingualForSequenceClassification",
-    "AutoModelForTokenClassification": "dangvantuan/bilingual_impl--modeling.BilingualForTokenClassification"
-  },
-  "attention_probs_dropout_prob": 0.1,
-  "classifier_dropout": null,
-  "bos_token_id": 0,
-  "eos_token_id": 2,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "output_past": true,
-  "pad_token_id": 1,
-  "position_embedding_type": "absolute",
-  "torch_dtype": "float16",
-  "transformers_version": "4.39.1",
-  "type_vocab_size": 1,
-  "use_cache": true,
-  "vocab_size": 250002
-}

.ipynb_checkpoints/config_sentence_transformers-checkpoint.json DELETED Viewed

@@ -1,9 +0,0 @@
-{
-  "__version__": {
-    "sentence_transformers": "2.7.0",
-    "transformers": "4.38.2",
-    "pytorch": "2.2.1+cu121"
-  },
-  "prompts": {},
-  "default_prompt_name": null
-}