File size: 3,180 Bytes
7abb877 fa67f89 7abb877 fdc1f59 7abb877 fdc1f59 7abb877 fdc1f59 7abb877 fdc1f59 2239f67 fdc1f59 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
pipeline_tag: translation
library_name: comet
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: apache-2.0
base_model:
- FacebookAI/xlm-roberta-large
---
# PreCOMET-cons [](https://arxiv.org/abs/2501.18251)
This is a source-only COMET model used for efficient evaluation subset selection.
Specifically this model predicts `consistency` of the system ordering based on a single segment being the same as the system ordering on the whole test-set.
The higher the scores, the better it is for evaluation because then fewer samples will be needed to arrive at the same system ordering.
It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
```bash
pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
```
You can then use it in Python:
```python
import precomet
model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-cons"))
model.predict([
{"src": "This is an easy source sentence."},
{"src": "this is a much more complicated source sen-tence that will pro·bably lead to loww scores 🤪"}
])["scores"]
> [0.1797918677330017, 0.32624873518943787]
```
The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:
```python
import subset2evaluate
data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
data_random = subset2evaluate.select_subset.basic(data_full, method="random")
subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
> 1
subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
> 0.71
```
Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
```python
data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_cons")
subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
> 1
subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
> 0.81
```
we get higher correlation.
You can expect a bigger effect on a larger scale, as described in the paper.
This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
Cite as:
```
@misc{zouhar2025selectdatapointsefficienthuman,
title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
author={Vilém Zouhar and Peng Cui and Mrinmaya Sachan},
year={2025},
eprint={2501.18251},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.18251},
}
``` |