mmlw-roberta-large / README.md
sdadas's picture
Update README.md
51811d7 verified
|
raw
history blame
30.4 kB
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- mteb
model-index:
- name: mmlw-roberta-large
results:
- task:
type: Clustering
dataset:
type: PL-MTEB/8tags-clustering
name: MTEB 8TagsClustering
config: default
split: test
revision: None
metrics:
- type: v_measure
value: 31.16472823814849
- task:
type: Classification
dataset:
type: PL-MTEB/allegro-reviews
name: MTEB AllegroReviews
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 47.48508946322067
- type: f1
value: 42.33327527584009
- task:
type: Retrieval
dataset:
type: arguana-pl
name: MTEB ArguAna-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 38.834
- type: map_at_10
value: 55.22899999999999
- type: map_at_100
value: 55.791999999999994
- type: map_at_1000
value: 55.794
- type: map_at_3
value: 51.233
- type: map_at_5
value: 53.772
- type: mrr_at_1
value: 39.687
- type: mrr_at_10
value: 55.596000000000004
- type: mrr_at_100
value: 56.157000000000004
- type: mrr_at_1000
value: 56.157999999999994
- type: mrr_at_3
value: 51.66
- type: mrr_at_5
value: 54.135
- type: ndcg_at_1
value: 38.834
- type: ndcg_at_10
value: 63.402
- type: ndcg_at_100
value: 65.78
- type: ndcg_at_1000
value: 65.816
- type: ndcg_at_3
value: 55.349000000000004
- type: ndcg_at_5
value: 59.892
- type: precision_at_1
value: 38.834
- type: precision_at_10
value: 8.905000000000001
- type: precision_at_100
value: 0.9939999999999999
- type: precision_at_1000
value: 0.1
- type: precision_at_3
value: 22.428
- type: precision_at_5
value: 15.647
- type: recall_at_1
value: 38.834
- type: recall_at_10
value: 89.047
- type: recall_at_100
value: 99.36
- type: recall_at_1000
value: 99.644
- type: recall_at_3
value: 67.283
- type: recall_at_5
value: 78.236
- task:
type: Classification
dataset:
type: PL-MTEB/cbd
name: MTEB CBD
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 69.33
- type: ap
value: 22.972409521444508
- type: f1
value: 58.91072163784952
- task:
type: PairClassification
dataset:
type: PL-MTEB/cdsce-pairclassification
name: MTEB CDSC-E
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 89.8
- type: cos_sim_ap
value: 79.87039801032493
- type: cos_sim_f1
value: 68.53932584269663
- type: cos_sim_precision
value: 73.49397590361446
- type: cos_sim_recall
value: 64.21052631578948
- type: dot_accuracy
value: 86.1
- type: dot_ap
value: 63.684975861694035
- type: dot_f1
value: 63.61746361746362
- type: dot_precision
value: 52.57731958762887
- type: dot_recall
value: 80.52631578947368
- type: euclidean_accuracy
value: 89.8
- type: euclidean_ap
value: 79.7527126811392
- type: euclidean_f1
value: 68.46361185983827
- type: euclidean_precision
value: 70.1657458563536
- type: euclidean_recall
value: 66.84210526315789
- type: manhattan_accuracy
value: 89.7
- type: manhattan_ap
value: 79.64632771093657
- type: manhattan_f1
value: 68.4931506849315
- type: manhattan_precision
value: 71.42857142857143
- type: manhattan_recall
value: 65.78947368421053
- type: max_accuracy
value: 89.8
- type: max_ap
value: 79.87039801032493
- type: max_f1
value: 68.53932584269663
- task:
type: STS
dataset:
type: PL-MTEB/cdscr-sts
name: MTEB CDSC-R
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 92.1088892402831
- type: cos_sim_spearman
value: 92.54126377343101
- type: euclidean_pearson
value: 91.99022371986013
- type: euclidean_spearman
value: 92.55235973775511
- type: manhattan_pearson
value: 91.92170171331357
- type: manhattan_spearman
value: 92.47797623672449
- task:
type: Retrieval
dataset:
type: dbpedia-pl
name: MTEB DBPedia-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 8.683
- type: map_at_10
value: 18.9
- type: map_at_100
value: 26.933
- type: map_at_1000
value: 28.558
- type: map_at_3
value: 13.638
- type: map_at_5
value: 15.9
- type: mrr_at_1
value: 63.74999999999999
- type: mrr_at_10
value: 73.566
- type: mrr_at_100
value: 73.817
- type: mrr_at_1000
value: 73.824
- type: mrr_at_3
value: 71.875
- type: mrr_at_5
value: 73.2
- type: ndcg_at_1
value: 53.125
- type: ndcg_at_10
value: 40.271
- type: ndcg_at_100
value: 45.51
- type: ndcg_at_1000
value: 52.968
- type: ndcg_at_3
value: 45.122
- type: ndcg_at_5
value: 42.306
- type: precision_at_1
value: 63.74999999999999
- type: precision_at_10
value: 31.55
- type: precision_at_100
value: 10.440000000000001
- type: precision_at_1000
value: 2.01
- type: precision_at_3
value: 48.333
- type: precision_at_5
value: 40.5
- type: recall_at_1
value: 8.683
- type: recall_at_10
value: 24.63
- type: recall_at_100
value: 51.762
- type: recall_at_1000
value: 75.64999999999999
- type: recall_at_3
value: 15.136
- type: recall_at_5
value: 18.678
- task:
type: Retrieval
dataset:
type: fiqa-pl
name: MTEB FiQA-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 19.872999999999998
- type: map_at_10
value: 32.923
- type: map_at_100
value: 34.819
- type: map_at_1000
value: 34.99
- type: map_at_3
value: 28.500999999999998
- type: map_at_5
value: 31.087999999999997
- type: mrr_at_1
value: 40.432
- type: mrr_at_10
value: 49.242999999999995
- type: mrr_at_100
value: 50.014
- type: mrr_at_1000
value: 50.05500000000001
- type: mrr_at_3
value: 47.144999999999996
- type: mrr_at_5
value: 48.171
- type: ndcg_at_1
value: 40.586
- type: ndcg_at_10
value: 40.887
- type: ndcg_at_100
value: 47.701
- type: ndcg_at_1000
value: 50.624
- type: ndcg_at_3
value: 37.143
- type: ndcg_at_5
value: 38.329
- type: precision_at_1
value: 40.586
- type: precision_at_10
value: 11.497
- type: precision_at_100
value: 1.838
- type: precision_at_1000
value: 0.23700000000000002
- type: precision_at_3
value: 25.0
- type: precision_at_5
value: 18.549
- type: recall_at_1
value: 19.872999999999998
- type: recall_at_10
value: 48.073
- type: recall_at_100
value: 73.473
- type: recall_at_1000
value: 90.94
- type: recall_at_3
value: 33.645
- type: recall_at_5
value: 39.711
- task:
type: Retrieval
dataset:
type: hotpotqa-pl
name: MTEB HotpotQA-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 39.399
- type: map_at_10
value: 62.604000000000006
- type: map_at_100
value: 63.475
- type: map_at_1000
value: 63.534
- type: map_at_3
value: 58.870999999999995
- type: map_at_5
value: 61.217
- type: mrr_at_1
value: 78.758
- type: mrr_at_10
value: 84.584
- type: mrr_at_100
value: 84.753
- type: mrr_at_1000
value: 84.759
- type: mrr_at_3
value: 83.65700000000001
- type: mrr_at_5
value: 84.283
- type: ndcg_at_1
value: 78.798
- type: ndcg_at_10
value: 71.04
- type: ndcg_at_100
value: 74.048
- type: ndcg_at_1000
value: 75.163
- type: ndcg_at_3
value: 65.862
- type: ndcg_at_5
value: 68.77600000000001
- type: precision_at_1
value: 78.798
- type: precision_at_10
value: 14.949000000000002
- type: precision_at_100
value: 1.7309999999999999
- type: precision_at_1000
value: 0.188
- type: precision_at_3
value: 42.237
- type: precision_at_5
value: 27.634999999999998
- type: recall_at_1
value: 39.399
- type: recall_at_10
value: 74.747
- type: recall_at_100
value: 86.529
- type: recall_at_1000
value: 93.849
- type: recall_at_3
value: 63.356
- type: recall_at_5
value: 69.08800000000001
- task:
type: Retrieval
dataset:
type: msmarco-pl
name: MTEB MSMARCO-PL
config: default
split: validation
revision: None
metrics:
- type: map_at_1
value: 19.598
- type: map_at_10
value: 30.453999999999997
- type: map_at_100
value: 31.601000000000003
- type: map_at_1000
value: 31.66
- type: map_at_3
value: 27.118
- type: map_at_5
value: 28.943
- type: mrr_at_1
value: 20.1
- type: mrr_at_10
value: 30.978
- type: mrr_at_100
value: 32.057
- type: mrr_at_1000
value: 32.112
- type: mrr_at_3
value: 27.679
- type: mrr_at_5
value: 29.493000000000002
- type: ndcg_at_1
value: 20.158
- type: ndcg_at_10
value: 36.63
- type: ndcg_at_100
value: 42.291000000000004
- type: ndcg_at_1000
value: 43.828
- type: ndcg_at_3
value: 29.744999999999997
- type: ndcg_at_5
value: 33.024
- type: precision_at_1
value: 20.158
- type: precision_at_10
value: 5.811999999999999
- type: precision_at_100
value: 0.868
- type: precision_at_1000
value: 0.1
- type: precision_at_3
value: 12.689
- type: precision_at_5
value: 9.295
- type: recall_at_1
value: 19.598
- type: recall_at_10
value: 55.596999999999994
- type: recall_at_100
value: 82.143
- type: recall_at_1000
value: 94.015
- type: recall_at_3
value: 36.720000000000006
- type: recall_at_5
value: 44.606
- task:
type: Classification
dataset:
type: mteb/amazon_massive_intent
name: MTEB MassiveIntentClassification (pl)
config: pl
split: test
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
metrics:
- type: accuracy
value: 74.8117014122394
- type: f1
value: 72.0259730121889
- task:
type: Classification
dataset:
type: mteb/amazon_massive_scenario
name: MTEB MassiveScenarioClassification (pl)
config: pl
split: test
revision: 7d571f92784cd94a019292a1f45445077d0ef634
metrics:
- type: accuracy
value: 77.84465366509752
- type: f1
value: 77.73439218970051
- task:
type: Retrieval
dataset:
type: nfcorpus-pl
name: MTEB NFCorpus-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 5.604
- type: map_at_10
value: 12.684000000000001
- type: map_at_100
value: 16.274
- type: map_at_1000
value: 17.669
- type: map_at_3
value: 9.347
- type: map_at_5
value: 10.752
- type: mrr_at_1
value: 43.963
- type: mrr_at_10
value: 52.94
- type: mrr_at_100
value: 53.571000000000005
- type: mrr_at_1000
value: 53.613
- type: mrr_at_3
value: 51.032
- type: mrr_at_5
value: 52.193
- type: ndcg_at_1
value: 41.486000000000004
- type: ndcg_at_10
value: 33.937
- type: ndcg_at_100
value: 31.726
- type: ndcg_at_1000
value: 40.331
- type: ndcg_at_3
value: 39.217
- type: ndcg_at_5
value: 36.521
- type: precision_at_1
value: 43.034
- type: precision_at_10
value: 25.324999999999996
- type: precision_at_100
value: 8.022
- type: precision_at_1000
value: 2.0629999999999997
- type: precision_at_3
value: 36.945
- type: precision_at_5
value: 31.517
- type: recall_at_1
value: 5.604
- type: recall_at_10
value: 16.554
- type: recall_at_100
value: 33.113
- type: recall_at_1000
value: 62.832
- type: recall_at_3
value: 10.397
- type: recall_at_5
value: 12.629999999999999
- task:
type: Retrieval
dataset:
type: nq-pl
name: MTEB NQ-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 26.642
- type: map_at_10
value: 40.367999999999995
- type: map_at_100
value: 41.487
- type: map_at_1000
value: 41.528
- type: map_at_3
value: 36.292
- type: map_at_5
value: 38.548
- type: mrr_at_1
value: 30.156
- type: mrr_at_10
value: 42.853
- type: mrr_at_100
value: 43.742
- type: mrr_at_1000
value: 43.772
- type: mrr_at_3
value: 39.47
- type: mrr_at_5
value: 41.366
- type: ndcg_at_1
value: 30.214000000000002
- type: ndcg_at_10
value: 47.620000000000005
- type: ndcg_at_100
value: 52.486
- type: ndcg_at_1000
value: 53.482
- type: ndcg_at_3
value: 39.864
- type: ndcg_at_5
value: 43.645
- type: precision_at_1
value: 30.214000000000002
- type: precision_at_10
value: 8.03
- type: precision_at_100
value: 1.0739999999999998
- type: precision_at_1000
value: 0.117
- type: precision_at_3
value: 18.183
- type: precision_at_5
value: 13.105
- type: recall_at_1
value: 26.642
- type: recall_at_10
value: 67.282
- type: recall_at_100
value: 88.632
- type: recall_at_1000
value: 96.109
- type: recall_at_3
value: 47.048
- type: recall_at_5
value: 55.791000000000004
- task:
type: Classification
dataset:
type: laugustyniak/abusive-clauses-pl
name: MTEB PAC
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 64.69446857804807
- type: ap
value: 75.58028779280512
- type: f1
value: 62.3610392963539
- task:
type: PairClassification
dataset:
type: PL-MTEB/ppc-pairclassification
name: MTEB PPC
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 88.4
- type: cos_sim_ap
value: 93.56462741831817
- type: cos_sim_f1
value: 90.73634204275535
- type: cos_sim_precision
value: 86.94992412746586
- type: cos_sim_recall
value: 94.86754966887418
- type: dot_accuracy
value: 75.3
- type: dot_ap
value: 83.06945936688015
- type: dot_f1
value: 81.50887573964496
- type: dot_precision
value: 73.66310160427807
- type: dot_recall
value: 91.22516556291392
- type: euclidean_accuracy
value: 88.8
- type: euclidean_ap
value: 93.53974198044985
- type: euclidean_f1
value: 90.87947882736157
- type: euclidean_precision
value: 89.42307692307693
- type: euclidean_recall
value: 92.3841059602649
- type: manhattan_accuracy
value: 88.8
- type: manhattan_ap
value: 93.54209967780366
- type: manhattan_f1
value: 90.85072231139645
- type: manhattan_precision
value: 88.1619937694704
- type: manhattan_recall
value: 93.70860927152319
- type: max_accuracy
value: 88.8
- type: max_ap
value: 93.56462741831817
- type: max_f1
value: 90.87947882736157
- task:
type: PairClassification
dataset:
type: PL-MTEB/psc-pairclassification
name: MTEB PSC
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 97.03153988868274
- type: cos_sim_ap
value: 98.63208302459417
- type: cos_sim_f1
value: 95.06172839506173
- type: cos_sim_precision
value: 96.25
- type: cos_sim_recall
value: 93.90243902439023
- type: dot_accuracy
value: 86.82745825602969
- type: dot_ap
value: 83.77450133931302
- type: dot_f1
value: 79.3053545586107
- type: dot_precision
value: 75.48209366391184
- type: dot_recall
value: 83.53658536585365
- type: euclidean_accuracy
value: 97.03153988868274
- type: euclidean_ap
value: 98.80678168225653
- type: euclidean_f1
value: 95.20958083832335
- type: euclidean_precision
value: 93.52941176470588
- type: euclidean_recall
value: 96.95121951219512
- type: manhattan_accuracy
value: 97.21706864564007
- type: manhattan_ap
value: 98.82279484224186
- type: manhattan_f1
value: 95.44072948328268
- type: manhattan_precision
value: 95.15151515151516
- type: manhattan_recall
value: 95.73170731707317
- type: max_accuracy
value: 97.21706864564007
- type: max_ap
value: 98.82279484224186
- type: max_f1
value: 95.44072948328268
- task:
type: Classification
dataset:
type: PL-MTEB/polemo2_in
name: MTEB PolEmo2.0-IN
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 76.84210526315789
- type: f1
value: 75.49713789106988
- task:
type: Classification
dataset:
type: PL-MTEB/polemo2_out
name: MTEB PolEmo2.0-OUT
config: default
split: test
revision: None
metrics:
- type: accuracy
value: 53.7246963562753
- type: f1
value: 43.060592194322986
- task:
type: Retrieval
dataset:
type: quora-pl
name: MTEB Quora-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 67.021
- type: map_at_10
value: 81.362
- type: map_at_100
value: 82.06700000000001
- type: map_at_1000
value: 82.084
- type: map_at_3
value: 78.223
- type: map_at_5
value: 80.219
- type: mrr_at_1
value: 77.17
- type: mrr_at_10
value: 84.222
- type: mrr_at_100
value: 84.37599999999999
- type: mrr_at_1000
value: 84.379
- type: mrr_at_3
value: 83.003
- type: mrr_at_5
value: 83.834
- type: ndcg_at_1
value: 77.29
- type: ndcg_at_10
value: 85.506
- type: ndcg_at_100
value: 87.0
- type: ndcg_at_1000
value: 87.143
- type: ndcg_at_3
value: 82.17
- type: ndcg_at_5
value: 84.057
- type: precision_at_1
value: 77.29
- type: precision_at_10
value: 13.15
- type: precision_at_100
value: 1.522
- type: precision_at_1000
value: 0.156
- type: precision_at_3
value: 36.173
- type: precision_at_5
value: 23.988
- type: recall_at_1
value: 67.021
- type: recall_at_10
value: 93.943
- type: recall_at_100
value: 99.167
- type: recall_at_1000
value: 99.929
- type: recall_at_3
value: 84.55799999999999
- type: recall_at_5
value: 89.697
- task:
type: Retrieval
dataset:
type: scidocs-pl
name: MTEB SCIDOCS-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 4.523
- type: map_at_10
value: 11.584
- type: map_at_100
value: 13.705
- type: map_at_1000
value: 14.038999999999998
- type: map_at_3
value: 8.187999999999999
- type: map_at_5
value: 9.922
- type: mrr_at_1
value: 22.1
- type: mrr_at_10
value: 32.946999999999996
- type: mrr_at_100
value: 34.11
- type: mrr_at_1000
value: 34.163
- type: mrr_at_3
value: 29.633
- type: mrr_at_5
value: 31.657999999999998
- type: ndcg_at_1
value: 22.2
- type: ndcg_at_10
value: 19.466
- type: ndcg_at_100
value: 27.725
- type: ndcg_at_1000
value: 33.539
- type: ndcg_at_3
value: 18.26
- type: ndcg_at_5
value: 16.265
- type: precision_at_1
value: 22.2
- type: precision_at_10
value: 10.11
- type: precision_at_100
value: 2.204
- type: precision_at_1000
value: 0.36
- type: precision_at_3
value: 17.1
- type: precision_at_5
value: 14.44
- type: recall_at_1
value: 4.523
- type: recall_at_10
value: 20.497
- type: recall_at_100
value: 44.757000000000005
- type: recall_at_1000
value: 73.14699999999999
- type: recall_at_3
value: 10.413
- type: recall_at_5
value: 14.638000000000002
- task:
type: PairClassification
dataset:
type: PL-MTEB/sicke-pl-pairclassification
name: MTEB SICK-E-PL
config: default
split: test
revision: None
metrics:
- type: cos_sim_accuracy
value: 87.4235629841011
- type: cos_sim_ap
value: 84.46531935663157
- type: cos_sim_f1
value: 77.18910963944077
- type: cos_sim_precision
value: 79.83257229832572
- type: cos_sim_recall
value: 74.71509971509973
- type: dot_accuracy
value: 81.10476966979209
- type: dot_ap
value: 71.12231750543143
- type: dot_f1
value: 68.13455657492355
- type: dot_precision
value: 59.69989281886387
- type: dot_recall
value: 79.34472934472934
- type: euclidean_accuracy
value: 87.21973094170403
- type: euclidean_ap
value: 84.33077991405355
- type: euclidean_f1
value: 76.81931132410365
- type: euclidean_precision
value: 76.57466383581033
- type: euclidean_recall
value: 77.06552706552706
- type: manhattan_accuracy
value: 87.21973094170403
- type: manhattan_ap
value: 84.35651252115137
- type: manhattan_f1
value: 76.87004481213376
- type: manhattan_precision
value: 74.48229792919172
- type: manhattan_recall
value: 79.41595441595442
- type: max_accuracy
value: 87.4235629841011
- type: max_ap
value: 84.46531935663157
- type: max_f1
value: 77.18910963944077
- task:
type: STS
dataset:
type: PL-MTEB/sickr-pl-sts
name: MTEB SICK-R-PL
config: default
split: test
revision: None
metrics:
- type: cos_sim_pearson
value: 83.05629619004273
- type: cos_sim_spearman
value: 79.90632583043678
- type: euclidean_pearson
value: 81.56426663515931
- type: euclidean_spearman
value: 80.05439220131294
- type: manhattan_pearson
value: 81.52958181013108
- type: manhattan_spearman
value: 80.0387467163383
- task:
type: STS
dataset:
type: mteb/sts22-crosslingual-sts
name: MTEB STS22 (pl)
config: pl
split: test
revision: 6d1ba47164174a496b7fa5d3569dae26a6813b80
metrics:
- type: cos_sim_pearson
value: 35.93847200513348
- type: cos_sim_spearman
value: 39.31543525546526
- type: euclidean_pearson
value: 30.19743936591465
- type: euclidean_spearman
value: 39.966612599252095
- type: manhattan_pearson
value: 30.195614462473387
- type: manhattan_spearman
value: 39.822552043685754
- task:
type: Retrieval
dataset:
type: scifact-pl
name: MTEB SciFact-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 56.05
- type: map_at_10
value: 65.93299999999999
- type: map_at_100
value: 66.571
- type: map_at_1000
value: 66.60000000000001
- type: map_at_3
value: 63.489
- type: map_at_5
value: 64.91799999999999
- type: mrr_at_1
value: 59.0
- type: mrr_at_10
value: 67.026
- type: mrr_at_100
value: 67.559
- type: mrr_at_1000
value: 67.586
- type: mrr_at_3
value: 65.444
- type: mrr_at_5
value: 66.278
- type: ndcg_at_1
value: 59.0
- type: ndcg_at_10
value: 70.233
- type: ndcg_at_100
value: 72.789
- type: ndcg_at_1000
value: 73.637
- type: ndcg_at_3
value: 66.40700000000001
- type: ndcg_at_5
value: 68.206
- type: precision_at_1
value: 59.0
- type: precision_at_10
value: 9.367
- type: precision_at_100
value: 1.06
- type: precision_at_1000
value: 0.11299999999999999
- type: precision_at_3
value: 26.222
- type: precision_at_5
value: 17.067
- type: recall_at_1
value: 56.05
- type: recall_at_10
value: 82.089
- type: recall_at_100
value: 93.167
- type: recall_at_1000
value: 100.0
- type: recall_at_3
value: 71.822
- type: recall_at_5
value: 76.483
- task:
type: Retrieval
dataset:
type: trec-covid-pl
name: MTEB TRECCOVID-PL
config: default
split: test
revision: None
metrics:
- type: map_at_1
value: 0.21
- type: map_at_10
value: 1.7680000000000002
- type: map_at_100
value: 9.447999999999999
- type: map_at_1000
value: 21.728
- type: map_at_3
value: 0.603
- type: map_at_5
value: 0.9610000000000001
- type: mrr_at_1
value: 80.0
- type: mrr_at_10
value: 88.667
- type: mrr_at_100
value: 88.667
- type: mrr_at_1000
value: 88.667
- type: mrr_at_3
value: 87.667
- type: mrr_at_5
value: 88.667
- type: ndcg_at_1
value: 77.0
- type: ndcg_at_10
value: 70.814
- type: ndcg_at_100
value: 52.532000000000004
- type: ndcg_at_1000
value: 45.635999999999996
- type: ndcg_at_3
value: 76.542
- type: ndcg_at_5
value: 73.24000000000001
- type: precision_at_1
value: 80.0
- type: precision_at_10
value: 75.0
- type: precision_at_100
value: 53.879999999999995
- type: precision_at_1000
value: 20.002
- type: precision_at_3
value: 80.0
- type: precision_at_5
value: 76.4
- type: recall_at_1
value: 0.21
- type: recall_at_10
value: 2.012
- type: recall_at_100
value: 12.781999999999998
- type: recall_at_1000
value: 42.05
- type: recall_at_3
value: 0.644
- type: recall_at_5
value: 1.04
language: pl
license: apache-2.0
widget:
- source_sentence: "zapytanie: Jak dożyć 100 lat?"
sentences:
- "Trzeba zdrowo się odżywiać i uprawiać sport."
- "Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
- "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
---
<h1 align="center">MMLW-roberta-large</h1>
MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning.
It transforms texts to 1024 dimensional vectors.
The model was initialized with Polish RoBERTa checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation.
## Usage (Sentence-Transformers)
⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix **"zapytanie: "** ⚠️
You can use the model like this with [sentence-transformers](https://www.SBERT.net):
```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
query_prefix = "zapytanie: "
answer_prefix = ""
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-roberta-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.
```
## Evaluation Results
- The model achieves an **Average Score** of **63.23** on the Polish Massive Text Embedding Benchmark (MTEB). See [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for detailed results.
- The model achieves **NDCG@10** of **55.95** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.
## Acknowledgements
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.
## Citation
```bibtex
@article{dadas2024pirb,
title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods},
author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
year={2024},
eprint={2402.13350},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```