Update README.md
Browse files
README.md
CHANGED
@@ -100,11 +100,60 @@ base_model:
|
|
100 |
- FacebookAI/xlm-roberta-large
|
101 |
---
|
102 |
|
103 |
-
# PreCOMET-cons
|
104 |
|
105 |
This is a source-only COMET model used for efficient evaluation subset selection.
|
106 |
-
|
|
|
|
|
|
|
|
|
|
|
107 |
|
108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
|
110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
- FacebookAI/xlm-roberta-large
|
101 |
---
|
102 |
|
103 |
+
# PreCOMET-cons [](https://arxiv.org/abs/2501.18251)
|
104 |
|
105 |
This is a source-only COMET model used for efficient evaluation subset selection.
|
106 |
+
Specifically this model predicts `consistency` of the system ordering based on a single segment being the same as the system ordering on the whole test-set.
|
107 |
+
The higher the scores, the better it is for evaluation because then fewer samples will be needed to arrive at the same system ordering.
|
108 |
+
It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
|
109 |
+
```bash
|
110 |
+
pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
|
111 |
+
```
|
112 |
|
113 |
+
You can then use it in Python:
|
114 |
+
```python
|
115 |
+
import precomet
|
116 |
+
model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-cons"))
|
117 |
+
model.predict([
|
118 |
+
{"src": "This is an easy source sentence."},
|
119 |
+
{"src": "this is a much more complicated source sen-tence that will pro路bably lead to loww scores 馃お"}
|
120 |
+
])["scores"]
|
121 |
+
> [0.1797918677330017, 0.32624873518943787]
|
122 |
+
```
|
123 |
|
124 |
+
The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:
|
125 |
+
|
126 |
+
```python
|
127 |
+
import subset2evaluate
|
128 |
+
|
129 |
+
data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
|
130 |
+
data_random = subset2evaluate.select_subset.basic(data_full, method="random")
|
131 |
+
subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
|
132 |
+
> 1
|
133 |
+
subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
|
134 |
+
> 0.71
|
135 |
+
```
|
136 |
+
Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
|
137 |
+
```python
|
138 |
+
data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_cons")
|
139 |
+
subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
|
140 |
+
> 1
|
141 |
+
subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
|
142 |
+
> 0.81
|
143 |
+
```
|
144 |
+
we get higher correlation.
|
145 |
+
|
146 |
+
|
147 |
+
This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
|
148 |
+
Cite as:
|
149 |
+
```
|
150 |
+
@misc{zouhar2025selectdatapointsefficienthuman,
|
151 |
+
title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
|
152 |
+
author={Vil茅m Zouhar and Peng Cui and Mrinmaya Sachan},
|
153 |
+
year={2025},
|
154 |
+
eprint={2501.18251},
|
155 |
+
archivePrefix={arXiv},
|
156 |
+
primaryClass={cs.CL},
|
157 |
+
url={https://arxiv.org/abs/2501.18251},
|
158 |
+
}
|
159 |
+
```
|