File size: 4,028 Bytes
35fcce8
 
 
 
 
 
 
 
87fff89
 
35fcce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c42e9df
 
35fcce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c42e9df
35fcce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87fff89
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- vi
---

# NghiemAbe/Vi-Legal-Bi-Encoder-v2

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

<!--- Describe your model here -->

## Usage (Sentence-Transformers)

Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
from pyvi.ViTokenizer import tokenize
sentences = [tokenize("This is an example sentence"), tokenize("Each sentence is converted")]

model = SentenceTransformer('NghiemAbe/Vi-Legal-Bi-Encoder-v2')
embeddings = model.encode(sentences)
print(embeddings)
```



## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = [tokenize("This is an example sentence"), tokenize("Each sentence is converted")]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NghiemAbe/Vi-Legal-Bi-Encoder-v2')
model = AutoModel.from_pretrained('NghiemAbe/Vi-Legal-Bi-Encoder-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)
```



## Evaluation Results

I evaluated my [Dev-Legal-Dataset](https://huggingface.co/datasets/NghiemAbe/dev_legal) and here are the results:

| Model                                                                  | R@1  | R@5  | R@10 | R@20 | R@100 | MRR@5 | MRR@10 | MRR@20 | MRR@100 | Avg  |
|------------------------------------------------------------------------|------|------|------|------|-------|-------|--------|--------|---------|------|
| keepitreal/vietnamese-sbert                                            | 0.278| 0.552| 0.649| 0.734| 0.842 | 0.396 | 0.409  | 0.415  | 0.417   | 0.521|
| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2            | 0.314| 0.486| 0.585| 0.662| 0.854 | 0.395 | 0.409  | 0.414  | 0.419   | 0.504|
| sentence-transformers/paraphrase-multilingual-mpnet-base-v2            | 0.354| 0.553| 0.646| 0.750| 0.896 | 0.449 | 0.461  | 0.468  | 0.472   | 0.561|
| intfloat/multilingual-e5-small                                         | 0.488| 0.746| 0.835| 0.906| 0.962 | 0.610 | 0.620  | 0.624  | 0.625   | 0.713|
| intfloat/multilingual-e5-base                                          | 0.466| 0.740| 0.840| 0.907| 0.952 | 0.596 | 0.608  | 0.612  | 0.613   | 0.704|
| bkai-foundation-models/vietnamese-bi-encoder                           | 0.644| 0.881| 0.924| 0.954| 0.986 | 0.752 | 0.757  | 0.758  | 0.759   | 0.824|
| Vi-Legal-Bi-Encoder-v2                                                 | 0.720| 0.884| 0.935| 0.963| 0.986 | 0.796 | 0.802  | 0.803  | 0.804   | 0.855|