Sentence Similarity
sentence-transformers
Safetensors
Thai
feature-extraction
Inference Endpoints
chompk commited on
Commit
3602db5
·
verified ·
1 Parent(s): f6afb8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -93
README.md CHANGED
@@ -5,137 +5,171 @@ tags:
5
  - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
 
 
 
 
 
 
 
 
 
8
  ---
9
 
10
- # SentenceTransformer
11
 
12
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 
13
 
14
- ## Model Details
15
 
16
- ### Model Description
17
- - **Model Type:** Sentence Transformer
18
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
- - **Maximum Sequence Length:** 8192 tokens
20
- - **Output Dimensionality:** 1024 tokens
21
- - **Similarity Function:** Cosine Similarity
22
- <!-- - **Training Dataset:** Unknown -->
23
- <!-- - **Language:** Unknown -->
24
- <!-- - **License:** Unknown -->
25
 
26
- ### Model Sources
 
 
 
 
 
 
 
 
27
 
28
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
 
32
- ### Full Model Architecture
33
 
 
 
 
 
 
 
 
34
  ```
35
- SentenceTransformer(
36
- (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
37
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
38
- (2): Normalize()
39
- )
40
  ```
41
 
42
- ## Usage
43
 
44
- ### Direct Usage (Sentence Transformers)
45
 
46
- First install the Sentence Transformers library:
47
 
48
- ```bash
49
- pip install -U sentence-transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
 
 
51
 
52
- Then you can load this model and run inference.
 
53
  ```python
54
- from sentence_transformers import SentenceTransformer
55
-
56
- # Download from the 🤗 Hub
57
- model = SentenceTransformer("sentence_transformers_model_id")
58
- # Run inference
59
- sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
63
- ]
64
- embeddings = model.encode(sentences)
65
- print(embeddings.shape)
66
- # [3, 1024]
67
-
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
- ```
73
 
74
- <!--
75
- ### Direct Usage (Transformers)
76
 
77
- <details><summary>Click to see the direct usage in Transformers</summary>
 
 
78
 
79
- </details>
80
- -->
81
 
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
 
 
84
 
85
- You can finetune this model on your own dataset.
86
 
87
- <details><summary>Click to expand</summary>
 
 
 
88
 
89
- </details>
90
- -->
 
91
 
92
- <!--
93
- ### Out-of-Scope Use
 
94
 
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
 
98
- <!--
99
- ## Bias, Risks and Limitations
 
100
 
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
103
 
104
- <!--
105
- ### Recommendations
 
 
 
106
 
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
 
110
- ## Training Details
 
 
 
111
 
112
- ### Framework Versions
113
- - Python: 3.11.9
114
- - Sentence Transformers: 3.1.1
115
- - Transformers: 4.45.1
116
- - PyTorch: 2.4.1+cu121
117
- - Accelerate: 0.34.2
118
- - Datasets: 3.0.1
119
- - Tokenizers: 0.20.0
120
 
121
- ## Citation
 
 
122
 
123
- ### BibTeX
124
 
125
- <!--
126
- ## Glossary
 
127
 
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
 
 
 
 
 
 
130
 
131
- <!--
132
- ## Model Card Authors
133
 
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
 
 
 
 
 
 
 
 
 
 
5
  - sentence-transformers
6
  - sentence-similarity
7
  - feature-extraction
8
+ license: mit
9
+ datasets:
10
+ - airesearch/WangchanX-Legal-ThaiCCL-RAG
11
+ - VISAI-AI/nitibench
12
+ language:
13
+ - th
14
+ - en
15
+ base_model:
16
+ - BAAI/bge-m3
17
  ---
18
 
19
+ # Auto-Finetuned BGE-M3 CCL
20
 
21
+ This is a finetuned [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) model on [`airesearch/WangchanX-Legal-ThaiCCL-RAG`](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG) queries.
22
+ However, we didn't use the same positives as shown in the dataset card, the positives was collected without any human supervision.
23
 
24
+ ## Finetuning Details
25
 
26
+ Apart from the original [`airesearch/WangchanX-Legal-ThaiCCL-RAG`](https://huggingface.co/datasets/airesearch/WangchanX-Legal-ThaiCCL-RAG) which requires human to rerank and remove irrelevant documents, the model was finetuned on a completely automated environment.
27
+ Specifically, given the query in the WangchanX-Legal-ThaiCCL-RAG dataset and a set of law sections to be retrieved, we follow the following procedure:
28
+ 1. Use [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) to retrieve N positive law sections based on thresholding score of 0.8
29
+ 2. Among those N documents, we use [`BAAI/bge-reranker-v2-m3`](https://huggingface.co/BAAI/bge-reranker-v2-m3) to rerank documents and filtered any document that reranker scores less than 0.8 - achieving final positive law sections
30
+ 3. Using positives from (2), we finetuned BGE-M3 model
 
 
 
 
31
 
32
+ ## Model Performance
33
+ | **Dataset** | **Top-K** | **HR@k** | **Multi HR@k** | **Recall@k** | **MRR@k** | **Multi MRR@k** |
34
+ |:----------------:|:---------:|:-------:|:-------------:|:-----------:|:--------:|:---------------:|
35
+ | **NitiBench-CCL** | 1 | 0.731 | – | 0.731 | 0.731 | – |
36
+ | **NitiBench-CCL** | 5 | 0.900 | – | 0.900 | 0.800 | – |
37
+ | **NitiBench-CCL** | 10 | 0.934 | – | 0.934 | 0.804 | – |
38
+ | **NitiBench-Tax**| 1 | 0.520 | 0.160 | 0.281 | 0.520 | 0.281 |
39
+ | **NitiBench-Tax**| 5 | 0.700 | 0.200 | 0.382 | 0.587 | 0.329 |
40
+ | **NitiBench-Tax**| 10 | 0.780 | 0.260 | 0.483 | 0.600 | 0.345 |
41
 
 
 
 
42
 
43
+ ## Usage
44
 
45
+ Install:
46
+ ```
47
+ git clone https://github.com/FlagOpen/FlagEmbedding.git
48
+ cd FlagEmbedding
49
+ pip install -e .
50
+ ```
51
+ or:
52
  ```
53
+ pip install -U FlagEmbedding
 
 
 
 
54
  ```
55
 
 
56
 
 
57
 
58
+ ### Generate Embedding for text
59
 
60
+ - Dense Embedding
61
+ ```python
62
+ from FlagEmbedding import BGEM3FlagModel
63
+
64
+ model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-auto-finetuned-bge-m3',
65
+ use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
66
+
67
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
68
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
69
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
70
+
71
+ embeddings_1 = model.encode(sentences_1,
72
+ batch_size=12,
73
+ max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
74
+ )['dense_vecs']
75
+ embeddings_2 = model.encode(sentences_2)['dense_vecs']
76
+ similarity = embeddings_1 @ embeddings_2.T
77
+ print(similarity)
78
+ # [[0.6265, 0.3477], [0.3499, 0.678 ]]
79
  ```
80
+ You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
81
+ Refer to [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding#usage) for details.
82
 
83
+
84
+ - Sparse Embedding (Lexical Weight)
85
  ```python
86
+ from FlagEmbedding import BGEM3FlagModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
+ model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-auto-finetuned-bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
 
89
 
90
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
91
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
92
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
93
 
94
+ output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
95
+ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
96
 
97
+ # you can see the weight for each token:
98
+ print(model.convert_id_to_token(output_1['lexical_weights']))
99
+ # [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},
100
+ # {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]
101
 
 
102
 
103
+ # compute the scores via lexical mathcing
104
+ lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
105
+ print(lexical_scores)
106
+ # 0.19554901123046875
107
 
108
+ print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
109
+ # 0.0
110
+ ```
111
 
112
+ - Multi-Vector (ColBERT)
113
+ ```python
114
+ from FlagEmbedding import BGEM3FlagModel
115
 
116
+ model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-auto-finetuned-bge-m3', use_fp16=True)
 
117
 
118
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
119
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
120
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
121
 
122
+ output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
123
+ output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
124
 
125
+ print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
126
+ print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
127
+ # 0.7797
128
+ # 0.4620
129
+ ```
130
 
 
 
131
 
132
+ ### Compute score for text pairs
133
+ Input a list of text pairs, you can get the scores computed by different methods.
134
+ ```python
135
+ from FlagEmbedding import BGEM3FlagModel
136
 
137
+ model = BGEM3FlagModel('VISAI-AI/nitibench-ccl-auto-finetuned-bge-m3', use_fp16=True)
 
 
 
 
 
 
 
138
 
139
+ sentences_1 = ["What is BGE M3?", "Defination of BM25"]
140
+ sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
141
+ "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
142
 
143
+ sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
144
 
145
+ print(model.compute_score(sentence_pairs,
146
+ max_passage_length=128, # a smaller max length leads to a lower latency
147
+ weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
148
 
149
+ # {
150
+ # 'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],
151
+ # 'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],
152
+ # 'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],
153
+ # 'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],
154
+ # 'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]
155
+ # }
156
+ ```
157
 
 
 
158
 
159
+ ## Acknowledgement
160
+ Thanks to Pirat Pothavorn for evaluating the model performance on NitiBench, Supavish Punchun for finetuning the model and preparing positives for finetuning. Additionally, we thank you all authors of this open-sourced project.
161
 
162
+ ## Citation
 
163
 
164
+ ### BibTeX
165
+ ```
166
+ @misc{akarajaradwong2025nitibenchcomprehensivestudiesllm,
167
+ title={NitiBench: A Comprehensive Studies of LLM Frameworks Capabilities for Thai Legal Question Answering},
168
+ author={Pawitsapak Akarajaradwong and Pirat Pothavorn and Chompakorn Chaksangchaichot and Panuthep Tasawong and Thitiwat Nopparatbundit and Sarana Nutanong},
169
+ year={2025},
170
+ eprint={2502.10868},
171
+ archivePrefix={arXiv},
172
+ primaryClass={cs.CL},
173
+ url={https://arxiv.org/abs/2502.10868},
174
+ }
175
+ ```