clarine commited on
Commit
0093cfb
·
verified ·
1 Parent(s): d4a1a0e

Add metrics on Polish datasets (#1)

Browse files

- Add metrics on Polish datasets (44a46d4ae7971db9750d9be14cf75334cc66bb71)
- Unified readme format (4115910358487c8bb4bd66e3f35114010158e643)

Files changed (1) hide show
  1. README.md +41 -12
README.md CHANGED
@@ -7,9 +7,9 @@ language:
7
  - it
8
  - ja
9
  - nl
10
- - pl
11
  - pt
12
  - zh
 
13
  ---
14
 
15
  # Model Card for `passage-ranker.pistachio`
@@ -22,27 +22,28 @@ Model name: `passage-ranker.pistachio`
22
 
23
  The model was trained and tested in the following languages:
24
 
25
- - Chinese (simplified)
26
- - Dutch
27
  - English
28
  - French
29
  - German
 
30
  - Italian
 
31
  - Japanese
32
- - Polish
33
  - Portuguese
34
- - Spanish
 
35
 
36
  Besides the aforementioned languages, basic support can be expected for additional 93 languages that were used during the pretraining of the base model (see
37
  [list of languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages)).
38
 
39
  ## Scores
40
 
41
- | Metric | Value |
42
- |:--------------------|------:|
43
- | Relevance (NDCG@10) | 0.480 |
 
44
 
45
- Note that the relevance score is computed as an average over 14 retrieval datasets (see
46
  [details below](#evaluation-metrics)).
47
 
48
  ## Inference Times
@@ -93,6 +94,8 @@ can be around 0.5 to 1 GiB depending on the used GPU.
93
 
94
  ### Evaluation Metrics
95
 
 
 
96
  To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
97
  [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
98
 
@@ -115,12 +118,38 @@ To determine the relevance score, we averaged the results that we obtained when
115
  | TREC-COVID | 0.651 |
116
  | Webis-Touche-2020 | 0.312 |
117
 
118
- We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  | Language | NDCG@10 |
121
  |:----------------------|--------:|
122
- | Chinese (simplified) | 0.454 |
123
  | French | 0.439 |
124
  | German | 0.418 |
 
125
  | Japanese | 0.517 |
126
- | Spanish | 0.487 |
 
7
  - it
8
  - ja
9
  - nl
 
10
  - pt
11
  - zh
12
+ - pl
13
  ---
14
 
15
  # Model Card for `passage-ranker.pistachio`
 
22
 
23
  The model was trained and tested in the following languages:
24
 
 
 
25
  - English
26
  - French
27
  - German
28
+ - Spanish
29
  - Italian
30
+ - Dutch
31
  - Japanese
 
32
  - Portuguese
33
+ - Chinese (simplified)
34
+ - Polish
35
 
36
  Besides the aforementioned languages, basic support can be expected for additional 93 languages that were used during the pretraining of the base model (see
37
  [list of languages](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages)).
38
 
39
  ## Scores
40
 
41
+ | Metric | Value |
42
+ |:----------------------------|------:|
43
+ | English Relevance (NDCG@10) | 0.474 |
44
+ | Polish Relevance (NDCG@10) | 0.380 |
45
 
46
+ Note that the relevance score is computed as an average over several retrieval datasets (see
47
  [details below](#evaluation-metrics)).
48
 
49
  ## Inference Times
 
94
 
95
  ### Evaluation Metrics
96
 
97
+ ##### English
98
+
99
  To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the
100
  [BEIR benchmark](https://github.com/beir-cellar/beir). Note that all these datasets are in English.
101
 
 
118
  | TREC-COVID | 0.651 |
119
  | Webis-Touche-2020 | 0.312 |
120
 
121
+ #### Polish
122
+
123
+ This model has polish capacities, that are being evaluated over a subset of
124
+ the [PIRBenchmark](https://github.com/sdadas/pirb) with BM25 as the first stage retrieval.
125
+
126
+
127
+ | Dataset | NDCG@10 |
128
+ |:--------------|--------:|
129
+ | Average | 0.380 |
130
+ | | |
131
+ | arguana-pl | 0.285 |
132
+ | dbpedia-pl | 0.283 |
133
+ | fiqa-pl | 0.223 |
134
+ | hotpotqa-pl | 0.603 |
135
+ | msmarco-pl | 0.259 |
136
+ | nfcorpus-pl | 0.293 |
137
+ | nq-pl | 0.355 |
138
+ | quora-pl | 0.613 |
139
+ | scidocs-pl | 0.128 |
140
+ | scifact-pl | 0.581 |
141
+ | trec-covid-pl | 0.560 |
142
+
143
+ #### Other languages
144
+
145
+ We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project-miracl/miracl) to test its
146
+ multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics
147
+ for the existing languages.
148
 
149
  | Language | NDCG@10 |
150
  |:----------------------|--------:|
 
151
  | French | 0.439 |
152
  | German | 0.418 |
153
+ | Spanish | 0.487 |
154
  | Japanese | 0.517 |
155
+ | Chinese (simplified) | 0.454 |