tuan.ljn commited on
Commit
b39a198
·
1 Parent(s): ef54d41

Add: add README

Browse files
.ipynb_checkpoints/README-checkpoint.md DELETED
@@ -1,95 +0,0 @@
1
- ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
- - transformers
9
- - sentence-embedding
10
- license: apache-2.0
11
- language:
12
- - fr
13
- metrics:
14
- - pearsonr
15
- - spearmanr
16
- ---
17
-
18
- # [bilingual-embedding-base](https://huggingface.co/Lajavaness/bilingual-embedding-base)
19
-
20
- bilingual-embedding is the Embedding Model for bilingual language: french and english. This model is a specialized sentence-embedding trained specifically for the bilingual language, leveraging the robust capabilities of [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base), a pre-trained language model based on the [XLM-RoBERTa](https://huggingface.co/FacebookAI/xlm-roberta-base) architecture. The model utilizes xlm-roberta to encode english-french sentences into a 1024-dimensional vector space, facilitating a wide range of applications from semantic search to text clustering. The embeddings capture the nuanced meanings of english-french sentences, reflecting both the lexical and contextual layers of the language.
21
-
22
-
23
- ## Full Model Architecture
24
- ```
25
- SentenceTransformer(
26
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel
27
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
28
- (2): Normalize()
29
- )
30
- ```
31
-
32
- ## Training and Fine-tuning process
33
- #### Stage 1: NLI Training
34
- - Dataset: [(SNLI+XNLI) for english+french]
35
- - Method: Training using Multi-Negative Ranking Loss. This stage focused on improving the model's ability to discern and rank nuanced differences in sentence semantics.
36
- ### Stage 3: Continued Fine-tuning for Semantic Textual Similarity on STS Benchmark
37
- - Dataset: [STSB-fr and en]
38
- - Method: Fine-tuning specifically for the semantic textual similarity benchmark using Siamese BERT-Networks configured with the 'sentence-transformers' library.
39
- ### Stage 4: Advanced Augmentation Fine-tuning
40
- - Dataset: STSB-vn with generate [silver sample from gold sample](https://www.sbert.net/examples/training/data_augmentation/README.html)
41
- - Method: Employed an advanced strategy using [Augmented SBERT](https://arxiv.org/abs/2010.08240) with Pair Sampling Strategies, integrating both Cross-Encoder and Bi-Encoder models. This stage further refined the embeddings by enriching the training data dynamically, enhancing the model's robustness and accuracy.
42
-
43
-
44
- ## Usage:
45
-
46
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
47
-
48
- ```
49
- pip install -U sentence-transformers
50
- pip install -q pyvi
51
- ```
52
-
53
- Then you can use the model like this:
54
-
55
- ```python
56
- from sentence_transformers import SentenceTransformer
57
- from pyvi.ViTokenizer import tokenize
58
-
59
- sentences = ["Paris est une capitale de la France", "Paris is a capital of France"]
60
-
61
- model = SentenceTransformer('Lajavaness/bilingual-embedding-base', trust_remote_code=True)
62
- print(embeddings)
63
-
64
- ```
65
-
66
-
67
-
68
-
69
-
70
- ## Evaluation
71
-
72
- TODO
73
-
74
- ## Citation
75
-
76
- @article{conneau2019unsupervised,
77
- title={Unsupervised cross-lingual representation learning at scale},
78
- author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
79
- journal={arXiv preprint arXiv:1911.02116},
80
- year={2019}
81
- }
82
-
83
- @article{reimers2019sentence,
84
- title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
85
- author={Nils Reimers, Iryna Gurevych},
86
- journal={https://arxiv.org/abs/1908.10084},
87
- year={2019}
88
- }
89
-
90
- @article{thakur2020augmented,
91
- title={Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks},
92
- author={Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and Gurevych, Iryna},
93
- journal={arXiv e-prints},
94
- pages={arXiv--2010},
95
- year={2020}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.ipynb_checkpoints/config-checkpoint.json DELETED
@@ -1,37 +0,0 @@
1
- {
2
- "_name_or_path": "dangvantuan/bilingual_impl",
3
- "architectures": [
4
- "BilingualModel"
5
- ],
6
- "model_type": "bilingual",
7
- "auto_map": {
8
- "AutoConfig":"dangvantuan/bilingual_impl--config.BilingualConfig",
9
- "AutoModel": "dangvantuan/bilingual_impl--modeling.BilingualModel",
10
- "AutoModelForMaskedLM": "dangvantuan/bilingual_impl--modeling.BilingualForMaskedLM",
11
- "AutoModelForMultipleChoice": "dangvantuan/bilingual_impl--modeling.BilingualForMultipleChoice",
12
- "AutoModelForQuestionAnswering": "dangvantuan/bilingual_impl--modeling.BilingualForQuestionAnswering",
13
- "AutoModelForSequenceClassification": "dangvantuan/bilingual_impl--modeling.BilingualForSequenceClassification",
14
- "AutoModelForTokenClassification": "dangvantuan/bilingual_impl--modeling.BilingualForTokenClassification"
15
- },
16
- "attention_probs_dropout_prob": 0.1,
17
- "classifier_dropout": null,
18
- "bos_token_id": 0,
19
- "eos_token_id": 2,
20
- "hidden_act": "gelu",
21
- "hidden_dropout_prob": 0.1,
22
- "hidden_size": 768,
23
- "initializer_range": 0.02,
24
- "intermediate_size": 3072,
25
- "layer_norm_eps": 1e-05,
26
- "max_position_embeddings": 514,
27
- "num_attention_heads": 12,
28
- "num_hidden_layers": 12,
29
- "output_past": true,
30
- "pad_token_id": 1,
31
- "position_embedding_type": "absolute",
32
- "torch_dtype": "float16",
33
- "transformers_version": "4.39.1",
34
- "type_vocab_size": 1,
35
- "use_cache": true,
36
- "vocab_size": 250002
37
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.ipynb_checkpoints/config_sentence_transformers-checkpoint.json DELETED
@@ -1,9 +0,0 @@
1
- {
2
- "__version__": {
3
- "sentence_transformers": "2.7.0",
4
- "transformers": "4.38.2",
5
- "pytorch": "2.2.1+cu121"
6
- },
7
- "prompts": {},
8
- "default_prompt_name": null
9
- }