prashantkumarbarman
commited on
Commit
•
efa587c
1
Parent(s):
c093cc1
Pushing the model to Huggingace hub
Browse files- README.md +210 -0
- config.json +25 -0
- flax_model.msgpack +3 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tf_model.h5 +3 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- multilingual
|
4 |
+
- af
|
5 |
+
- sq
|
6 |
+
- ar
|
7 |
+
- an
|
8 |
+
- hy
|
9 |
+
- ast
|
10 |
+
- az
|
11 |
+
- ba
|
12 |
+
- eu
|
13 |
+
- bar
|
14 |
+
- be
|
15 |
+
- bn
|
16 |
+
- inc
|
17 |
+
- bs
|
18 |
+
- br
|
19 |
+
- bg
|
20 |
+
- my
|
21 |
+
- ca
|
22 |
+
- ceb
|
23 |
+
- ce
|
24 |
+
- zh
|
25 |
+
- cv
|
26 |
+
- hr
|
27 |
+
- cs
|
28 |
+
- da
|
29 |
+
- nl
|
30 |
+
- en
|
31 |
+
- et
|
32 |
+
- fi
|
33 |
+
- fr
|
34 |
+
- gl
|
35 |
+
- ka
|
36 |
+
- de
|
37 |
+
- el
|
38 |
+
- gu
|
39 |
+
- ht
|
40 |
+
- he
|
41 |
+
- hi
|
42 |
+
- hu
|
43 |
+
- is
|
44 |
+
- io
|
45 |
+
- id
|
46 |
+
- ga
|
47 |
+
- it
|
48 |
+
- ja
|
49 |
+
- jv
|
50 |
+
- kn
|
51 |
+
- kk
|
52 |
+
- ky
|
53 |
+
- ko
|
54 |
+
- la
|
55 |
+
- lv
|
56 |
+
- lt
|
57 |
+
- roa
|
58 |
+
- nds
|
59 |
+
- lm
|
60 |
+
- mk
|
61 |
+
- mg
|
62 |
+
- ms
|
63 |
+
- ml
|
64 |
+
- mr
|
65 |
+
- min
|
66 |
+
- ne
|
67 |
+
- new
|
68 |
+
- nb
|
69 |
+
- nn
|
70 |
+
- oc
|
71 |
+
- fa
|
72 |
+
- pms
|
73 |
+
- pl
|
74 |
+
- pt
|
75 |
+
- pa
|
76 |
+
- ro
|
77 |
+
- ru
|
78 |
+
- sco
|
79 |
+
- sr
|
80 |
+
- hr
|
81 |
+
- scn
|
82 |
+
- sk
|
83 |
+
- sl
|
84 |
+
- aze
|
85 |
+
- es
|
86 |
+
- su
|
87 |
+
- sw
|
88 |
+
- sv
|
89 |
+
- tl
|
90 |
+
- tg
|
91 |
+
- ta
|
92 |
+
- tt
|
93 |
+
- te
|
94 |
+
- tr
|
95 |
+
- uk
|
96 |
+
- ud
|
97 |
+
- uz
|
98 |
+
- vi
|
99 |
+
- vo
|
100 |
+
- war
|
101 |
+
- cy
|
102 |
+
- fry
|
103 |
+
- pnb
|
104 |
+
- yo
|
105 |
+
thumbnail: https://amberoad.de/images/logo_text.png
|
106 |
+
tags:
|
107 |
+
- msmarco
|
108 |
+
- multilingual
|
109 |
+
- passage reranking
|
110 |
+
license: apache-2.0
|
111 |
+
datasets:
|
112 |
+
- msmarco
|
113 |
+
metrics:
|
114 |
+
- MRR
|
115 |
+
widget:
|
116 |
+
- query: What is a corporation?
|
117 |
+
passage: A company is incorporated in a specific nation, often within the bounds
|
118 |
+
of a smaller subset of that nation, such as a state or province. The corporation
|
119 |
+
is then governed by the laws of incorporation in that state. A corporation may
|
120 |
+
issue stock, either private or public, or may be classified as a non-stock corporation.
|
121 |
+
If stock is issued, the corporation will usually be governed by its shareholders,
|
122 |
+
either directly or indirectly.
|
123 |
+
---
|
124 |
+
|
125 |
+
# Passage Reranking Multilingual BERT 🔃 🌍
|
126 |
+
|
127 |
+
|
128 |
+
|
129 |
+
## Model description
|
130 |
+
**Input:** Supports over 100 Languages. See [List of supported languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) for all available.
|
131 |
+
|
132 |
+
**Purpose:** This module takes a search query [1] and a passage [2] and calculates if the passage matches the query.
|
133 |
+
It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.
|
134 |
+
|
135 |
+
**Architecture:** On top of BERT there is a Densly Connected NN which takes the 768 Dimensional [CLS] Token as input and provides the output ([Arxiv](https://arxiv.org/abs/1901.04085)).
|
136 |
+
|
137 |
+
**Output:** Just a single value between between -10 and 10. Better matching query,passage pairs tend to have a higher a score.
|
138 |
+
|
139 |
+
|
140 |
+
|
141 |
+
## Intended uses & limitations
|
142 |
+
Both query[1] and passage[2] have to fit in 512 Tokens.
|
143 |
+
As you normally want to rerank the first dozens of search results keep in mind the inference time of approximately 300 ms/query.
|
144 |
+
|
145 |
+
#### How to use
|
146 |
+
|
147 |
+
```python
|
148 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
149 |
+
|
150 |
+
tokenizer = AutoTokenizer.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
|
151 |
+
|
152 |
+
model = AutoModelForSequenceClassification.from_pretrained("amberoad/bert-multilingual-passage-reranking-msmarco")
|
153 |
+
```
|
154 |
+
|
155 |
+
This Model can be used as a drop-in replacement in the [Nboost Library](https://github.com/koursaros-ai/nboost)
|
156 |
+
Through this you can directly improve your Elasticsearch Results without any coding.
|
157 |
+
|
158 |
+
|
159 |
+
## Training data
|
160 |
+
|
161 |
+
This model is trained using the [**Microsoft MS Marco Dataset**](https://microsoft.github.io/msmarco/ "Microsoft MS Marco"). This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages. All datasets used for training and evaluating are listed in this [table](https://github.com/microsoft/MSMARCO-Passage-Ranking#data-information-and-formating). The used dataset for training is called *Train Triples Large*, while the evaluation was made on *Top 1000 Dev*. There are 6,900 queries in total in the development dataset, where each query is mapped to top 1,000 passage retrieved using BM25 from MS MARCO corpus.
|
162 |
+
|
163 |
+
## Training procedure
|
164 |
+
|
165 |
+
The training is performed the same way as stated in this [README](https://github.com/nyu-dl/dl4marco-bert "NYU Github"). See their excellent Paper on [Arxiv](https://arxiv.org/abs/1901.04085).
|
166 |
+
|
167 |
+
We changed the BERT Model from an English only to the default BERT Multilingual uncased Model from [Google](https://huggingface.co/bert-base-multilingual-uncased).
|
168 |
+
|
169 |
+
Training was done 400 000 Steps. This equaled 12 hours an a TPU V3-8.
|
170 |
+
|
171 |
+
|
172 |
+
## Eval results
|
173 |
+
|
174 |
+
We see nearly similar performance than the English only Model in the English [Bing Queries Dataset](http://www.msmarco.org/). Although the training data is English only internal Tests on private data showed a far higher accurancy in German than all other available models.
|
175 |
+
|
176 |
+
|
177 |
+
|
178 |
+
Fine-tuned Models | Dependency | Eval Set | Search Boost<a href='#benchmarks'> | Speed on GPU
|
179 |
+
----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------ | ----------------------------------------------------- | ----------------------------------
|
180 |
+
**`amberoad/Multilingual-uncased-MSMARCO`** (This Model) | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-blue"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+61%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query <a href='#footnotes'>
|
181 |
+
`nboost/pt-tinybert-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+45%** <sub><sup>(0.26 vs 0.18)</sup></sub> | ~50ms/query <a href='#footnotes'>
|
182 |
+
`nboost/pt-bert-base-uncased-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+62%** <sub><sup>(0.29 vs 0.18)</sup></sub> | ~300 ms/query<a href='#footnotes'>
|
183 |
+
`nboost/pt-bert-large-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='http://www.msmarco.org/'>bing queries</a> | **+77%** <sub><sup>(0.32 vs 0.18)</sup></sub> | -
|
184 |
+
`nboost/pt-biobert-base-msmarco` | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-red"/> | <a href ='https://github.com/naver/biobert-pretrained'>biomed</a> | **+66%** <sub><sup>(0.17 vs 0.10)</sup></sub> | ~300 ms/query<a href='#footnotes'>
|
185 |
+
|
186 |
+
This table is taken from [nboost](https://github.com/koursaros-ai/nboost) and extended by the first line.
|
187 |
+
|
188 |
+
|
189 |
+
|
190 |
+
## Contact Infos
|
191 |
+
|
192 |
+
![](https://amberoad.de/images/logo_text.png)
|
193 |
+
|
194 |
+
Amberoad is a company focussing on Search and Business Intelligence.
|
195 |
+
We provide you:
|
196 |
+
* Advanced Internal Company Search Engines thorugh NLP
|
197 |
+
* External Search Egnines: Find Competitors, Customers, Suppliers
|
198 |
+
|
199 |
+
**Get in Contact now to benefit from our Expertise:**
|
200 |
+
|
201 |
+
The training and evaluation was performed by [**Philipp Reissel**](https://reissel.eu/) and [**Igli Manaj**](https://github.com/iglimanaj)
|
202 |
+
|
203 |
+
[![Amberoad](https://i.stack.imgur.com/gVE0j.png) Linkedin](https://de.linkedin.com/company/amberoad) | <svg xmlns="http://www.w3.org/2000/svg" x="0px" y="0px"
|
204 |
+
width="32" height="32"
|
205 |
+
viewBox="0 0 172 172"
|
206 |
+
style=" fill:#000000;"><g fill="none" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><path d="M0,172v-172h172v172z" fill="none"></path><g fill="#e67e22"><path d="M37.625,21.5v86h96.75v-86h-5.375zM48.375,32.25h10.75v10.75h-10.75zM69.875,32.25h10.75v10.75h-10.75zM91.375,32.25h32.25v10.75h-32.25zM48.375,53.75h75.25v43h-75.25zM80.625,112.875v17.61572c-1.61558,0.93921 -2.94506,2.2687 -3.88428,3.88428h-49.86572v10.75h49.86572c1.8612,3.20153 5.28744,5.375 9.25928,5.375c3.97183,0 7.39808,-2.17347 9.25928,-5.375h49.86572v-10.75h-49.86572c-0.93921,-1.61558 -2.2687,-2.94506 -3.88428,-3.88428v-17.61572z"></path></g></g></svg>[Homepage](https://de.linkedin.com/company/amberoad) | [Email]([email protected])
|
207 |
+
|
208 |
+
|
209 |
+
|
210 |
+
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{"architectures": [
|
2 |
+
"BertForSequenceClassification"
|
3 |
+
],
|
4 |
+
"attention_probs_dropout_prob": 0.1,
|
5 |
+
"directionality": "bidi",
|
6 |
+
"gradient_checkpointing": false,
|
7 |
+
"hidden_act": "gelu",
|
8 |
+
"hidden_dropout_prob": 0.1,
|
9 |
+
"hidden_size": 768,
|
10 |
+
"initializer_range": 0.02,
|
11 |
+
"intermediate_size": 3072,
|
12 |
+
"layer_norm_eps": 1e-12,
|
13 |
+
"max_position_embeddings": 512,
|
14 |
+
"model_type": "bert",
|
15 |
+
"num_attention_heads": 12,
|
16 |
+
"num_hidden_layers": 12,
|
17 |
+
"pad_token_id": 0,
|
18 |
+
"pooler_fc_size": 768,
|
19 |
+
"pooler_num_attention_heads": 12,
|
20 |
+
"pooler_num_fc_layers": 3,
|
21 |
+
"pooler_size_per_head": 128,
|
22 |
+
"pooler_type": "first_token_transform",
|
23 |
+
"type_vocab_size": 2,
|
24 |
+
"vocab_size": 105879
|
25 |
+
}
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1d4cd912eb99c7d8d5a9e3a58a8cdecd47fda4fcf59bd0fec8a0b06b1584b099
|
3 |
+
size 669439034
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:188287a61bb87387f5a2783ccfffb4f649ceed50d8fc7bbff4a7cb964105cbc1
|
3 |
+
size 669478888
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:782ba55476d1387470ab4c0b8ecdf05b544fc86f484952044aa57469332a63ec
|
3 |
+
size 669702896
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"special_tokens_map_file": null, "full_tokenizer_file": null}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|