ymoslem commited on
Commit
65b9a95
·
verified ·
1 Parent(s): b8d6f45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -14
README.md CHANGED
@@ -37,7 +37,23 @@ datasets:
37
  - ymoslem/wmt-da-human-evaluation
38
  model-index:
39
  - name: Quality Estimation for Machine Translation
40
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ---
42
 
43
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -45,22 +61,12 @@ should probably proofread and complete it, then remove this comment. -->
45
 
46
  # Quality Estimation for Machine Translation
47
 
48
- This model is a fine-tuned version of [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) on the ymoslem/wmt-da-human-evaluation dataset.
 
 
49
  It achieves the following results on the evaluation set:
50
  - Loss: 0.0564
51
 
52
- ## Model description
53
-
54
- More information needed
55
-
56
- ## Intended uses & limitations
57
-
58
- More information needed
59
-
60
- ## Training and evaluation data
61
-
62
- More information needed
63
-
64
  ## Training procedure
65
 
66
  ### Training hyperparameters
@@ -96,3 +102,128 @@ The following hyperparameters were used during training:
96
  - Pytorch 2.4.1+cu124
97
  - Datasets 3.2.0
98
  - Tokenizers 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  - ymoslem/wmt-da-human-evaluation
38
  model-index:
39
  - name: Quality Estimation for Machine Translation
40
+ results:
41
+ - task:
42
+ type: regression
43
+ dataset:
44
+ name: ymoslem/wmt-da-human-evaluation
45
+ type: QE
46
+ metrics:
47
+ - name: Pearson Correlation
48
+ type: Pearson
49
+ value: 0.4589
50
+ - name: Mean Absolute Error
51
+ type: MAE
52
+ value: 0.1861
53
+ - name: Root Mean Squared Error
54
+ type: RMSE
55
+ value: 0.2375
56
+
57
  ---
58
 
59
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
61
 
62
  # Quality Estimation for Machine Translation
63
 
64
+ This model is a fine-tuned version of [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large)
65
+ on the [ymoslem/wmt-da-human-evaluation](https://huggingface.co/ymoslem/wmt-da-human-evaluation) dataset.
66
+
67
  It achieves the following results on the evaluation set:
68
  - Loss: 0.0564
69
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ## Training procedure
71
 
72
  ### Training hyperparameters
 
102
  - Pytorch 2.4.1+cu124
103
  - Datasets 3.2.0
104
  - Tokenizers 0.21.0
105
+
106
+ ## Inference
107
+
108
+ 1. Install the required libraries.
109
+
110
+ ```bash
111
+ pip3 install -q --upgrade datasets accelerate transformers
112
+ pip3 install -q --upgrade scikit-learn polars
113
+ pip3 install -q --upgrade flash_attn triton
114
+ ```
115
+
116
+ 2. Load the test dataset.
117
+
118
+ ```python
119
+ from datasets import load_dataset
120
+
121
+ test_dataset = load_dataset("ymoslem/wmt-da-human-evaluation",
122
+ split="test",
123
+ trust_remote_code=True
124
+ )
125
+ print(test_dataset)
126
+ ```
127
+
128
+ 3. Load the model and tokenizer:
129
+
130
+ ```python
131
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
132
+ import torch
133
+
134
+ # Load the fine-tuned model and tokenizer
135
+ model_name = "ymoslem/ModernBERT-large-qe-v1"
136
+ model = AutoModelForSequenceClassification.from_pretrained(
137
+ model_name,
138
+ device_map="auto",
139
+ torch_dtype=torch.bfloat16,
140
+ attn_implementation="flash_attention_2",
141
+ )
142
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
143
+
144
+ # Move model to GPU if available
145
+ device = "cuda" if torch.cuda.is_available() else "cpu"
146
+ model.to(device)
147
+ model.eval()
148
+ ```
149
+
150
+ 4. Prepare the dataset. Each source segment `src` and target segment `tgt` are separated by the `sep_token`, which is `'</s>'` for ModernBERT.
151
+
152
+ ```python
153
+ sep_token = tokenizer.sep_token
154
+ input_test_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(test_dataset["src"], test_dataset["mt"])]
155
+ ```
156
+
157
+ 5. Generate predictions.
158
+
159
+ If you print `model.config.problem_type`, the output is `regression`.
160
+ Still, you can use the "text-classification" pipeline as follows (cf. [pipeline documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.TextClassificationPipeline)):
161
+
162
+ ```python
163
+ from transformers import pipeline
164
+
165
+ classifier = pipeline("text-classification",
166
+ model=model_name,
167
+ tokenizer=tokenizer,
168
+ device=0,
169
+ )
170
+
171
+ predictions = classifier(input_test_texts,
172
+ batch_size=128,
173
+ truncation=True,
174
+ padding="max_length",
175
+ max_length=tokenizer.model_max_length,
176
+ )
177
+ predictions = [prediction["score"] for prediction in predictions]
178
+
179
+ ```
180
+
181
+ Alternatively, you can use an elaborate version of the code, which is slightly faster and provides more control.
182
+
183
+ ```python
184
+ from torch.utils.data import DataLoader
185
+ import torch
186
+ from tqdm.auto import tqdm
187
+
188
+ # Tokenization function
189
+ def process_batch(batch, tokenizer, device):
190
+ sep_token = tokenizer.sep_token
191
+ input_texts = [f"{src} {sep_token} {tgt}" for src, tgt in zip(batch["src"], batch["mt"])]
192
+ tokens = tokenizer(input_texts,
193
+ truncation=True,
194
+ padding="max_length",
195
+ max_length=tokenizer.model_max_length,
196
+ return_tensors="pt",
197
+ ).to(device)
198
+ return tokens
199
+
200
+
201
+
202
+ # Create a DataLoader for batching
203
+ test_dataloader = DataLoader(test_dataset,
204
+ batch_size=128, # Adjust batch size as needed
205
+ shuffle=False)
206
+
207
+
208
+ # List to store all predictions
209
+ predictions = []
210
+
211
+ with torch.no_grad():
212
+ for batch in tqdm(test_dataloader, desc="Inference Progress", unit="batch"):
213
+
214
+ tokens = process_batch(batch, tokenizer, device)
215
+
216
+ # Forward pass: Generate model's logits
217
+ outputs = model(**tokens)
218
+
219
+ # Get logits (predictions)
220
+ logits = outputs.logits
221
+
222
+ # Extract the regression predicted values
223
+ batch_predictions = logits.squeeze()
224
+
225
+ # Extend the list with the predictions
226
+ predictions.extend(batch_predictions.tolist())
227
+ ```
228
+
229
+