zouharvi commited on
Commit
fdc1f59
verified
1 Parent(s): 7abb877

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -4
README.md CHANGED
@@ -100,11 +100,60 @@ base_model:
100
  - FacebookAI/xlm-roberta-large
101
  ---
102
 
103
- # PreCOMET-cons
104
 
105
  This is a source-only COMET model used for efficient evaluation subset selection.
106
- It is not compatible with the upstream [github.com/Unbabel/COMET/](https://github.com/Unbabel/COMET/) and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET)
 
 
 
 
 
107
 
108
- The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package.
 
 
 
 
 
 
 
 
 
109
 
110
- Further description TODO.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  - FacebookAI/xlm-roberta-large
101
  ---
102
 
103
+ # PreCOMET-cons [![Paper](https://img.shields.io/badge/馃摐%20paper-481.svg)](https://arxiv.org/abs/2501.18251)
104
 
105
  This is a source-only COMET model used for efficient evaluation subset selection.
106
+ Specifically this model predicts `consistency` of the system ordering based on a single segment being the same as the system ordering on the whole test-set.
107
+ The higher the scores, the better it is for evaluation because then fewer samples will be needed to arrive at the same system ordering.
108
+ It is not compatible with the original Unbabel's COMET and to run it you have to install [github.com/zouharvi/PreCOMET](https://github.com/zouharvi/PreCOMET):
109
+ ```bash
110
+ pip install pip3 install git+https://github.com/zouharvi/PreCOMET.git
111
+ ```
112
 
113
+ You can then use it in Python:
114
+ ```python
115
+ import precomet
116
+ model = precomet.load_from_checkpoint(precomet.download_model("zouharvi/PreCOMET-cons"))
117
+ model.predict([
118
+ {"src": "This is an easy source sentence."},
119
+ {"src": "this is a much more complicated source sen-tence that will pro路bably lead to loww scores 馃お"}
120
+ ])["scores"]
121
+ > [0.1797918677330017, 0.32624873518943787]
122
+ ```
123
 
124
+ The primary use of this model is from the [subset2evaluate](https://github.com/zouharvi/subset2evaluate) package:
125
+
126
+ ```python
127
+ import subset2evaluate
128
+
129
+ data_full = subset2evaluate.utils.load_data("wmt23/en-cs")
130
+ data_random = subset2evaluate.select_subset.basic(data_full, method="random")
131
+ subset2evaluate.evaluate.eval_subset_clusters(data_random[:100])
132
+ > 1
133
+ subset2evaluate.evaluate.eval_subset_correlation(data_random[:100], data_full)
134
+ > 0.71
135
+ ```
136
+ Random selection gives us only one cluster and system-level Spearman correlation of 0.71 when we have a budget for only 100 segments. However, by using this model:
137
+ ```python
138
+ data_precomet = subset2evaluate.select_subset.basic(data_full, method="precomet_cons")
139
+ subset2evaluate.evaluate.eval_subset_clusters(data_precomet[:100])
140
+ > 1
141
+ subset2evaluate.evaluate.eval_subset_correlation(data_precomet[:100], data_full)
142
+ > 0.81
143
+ ```
144
+ we get higher correlation.
145
+
146
+
147
+ This work is described in [How to Select Datapoints for Efficient Human Evaluation of NLG Models?](https://arxiv.org/abs/2501.18251).
148
+ Cite as:
149
+ ```
150
+ @misc{zouhar2025selectdatapointsefficienthuman,
151
+ title={How to Select Datapoints for Efficient Human Evaluation of NLG Models?},
152
+ author={Vil茅m Zouhar and Peng Cui and Mrinmaya Sachan},
153
+ year={2025},
154
+ eprint={2501.18251},
155
+ archivePrefix={arXiv},
156
+ primaryClass={cs.CL},
157
+ url={https://arxiv.org/abs/2501.18251},
158
+ }
159
+ ```