WEBVTT 0:00:56.957 --> 0:01:10.166 In today you are going to talk about evaluation like how you can tell how well your translation. 0:01:11.251 --> 0:01:23.175 Today we're going to talk about first some introduction about the difficulties and also 0:01:23.175 --> 0:01:27.783 the dimensions of the evaluation. 0:01:28.248 --> 0:01:32.315 And the second one is on automatic evaluation. 0:01:32.315 --> 0:01:33.960 The second one is. 0:01:33.893 --> 0:01:40.952 Would be less human effort costly, but it probably is not really as perfect. 0:01:42.702 --> 0:02:01.262 So on machine translation evaluation, so the goal is to measure the quality of translation. 0:02:03.003 --> 0:02:06.949 We need machine translation evaluation. 0:02:06.949 --> 0:02:14.152 The first thing is for application scenarios and whether it is reliable. 0:02:14.674 --> 0:02:22.911 Second thing is to guide our research because given symmetrics we will be able to find out 0:02:22.911 --> 0:02:30.875 which improvement direction is valuable for our machine translation system and the last 0:02:30.875 --> 0:02:34.224 thing is for our system development. 0:02:36.116 --> 0:02:42.926 So now we will come to some difficulties on evaluation. 0:02:42.926 --> 0:02:50.952 The first thing is ambiguity because usually for one sentence it. 0:02:51.431 --> 0:03:04.031 Here you can see that, for example, we have the correct reference. 0:03:05.325 --> 0:03:19.124 The second difficulty is that small changes can be very important. 0:03:20.060 --> 0:03:22.531 The first difficulty is subjective. 0:03:23.123 --> 0:03:39.266 So it depends on each person's opinion whether translation is correct. 0:03:41.041 --> 0:03:49.393 The last is that evaluation sometimes is application dependent. 0:03:49.393 --> 0:03:54.745 We're not sure how good it's getting up. 0:03:57.437 --> 0:04:04.502 The first dimension is human versus automatic evaluation, which I definitely talked about 0:04:04.502 --> 0:04:06.151 in the introduction. 0:04:06.151 --> 0:04:13.373 The second thing is on granulity, so evaluation could be on sentence level, document level, 0:04:13.373 --> 0:04:14.472 or task base. 0:04:15.375 --> 0:04:28.622 The last thing is whether the translation is correct in order to capture the meaning. 0:04:30.630 --> 0:04:33.769 So on the first dimensions, human verses are automatic. 0:04:34.334 --> 0:04:45.069 So human evaluation education is the goal standard because in the end we give our machine 0:04:45.069 --> 0:04:48.647 translation system to people. 0:04:49.329 --> 0:04:55.040 And is also expensive and time consuming for people to manually evaluate some systems. 0:04:57.057 --> 0:05:05.575 For automatic evaluation, it is of course tupper and faster, and it would use human reference. 0:05:08.168 --> 0:05:16.971 The next dimension is on granulity. 0:05:16.971 --> 0:05:25.529 The first level is sentence based. 0:05:25.885 --> 0:05:33.003 But this is difficult because if you translate a single sentence, it will be difficult to 0:05:33.003 --> 0:05:35.454 tell whether this translation. 0:05:37.537 --> 0:05:40.633 The second level is document based. 0:05:40.633 --> 0:05:46.051 This should be the most commonly used in automatic evaluation. 0:05:46.286 --> 0:06:00.750 This should be like the final bowl of our machine translation. 0:06:01.061 --> 0:06:02.315 And slow in general. 0:06:02.315 --> 0:06:07.753 We are not sure whether the arrows come from the machine translation system itself or some 0:06:07.753 --> 0:06:08.828 other components. 0:06:11.431 --> 0:06:21.300 The next dimension is on adigocy because it's fluency, so adigocy is meaning translated correctly. 0:06:22.642 --> 0:06:25.384 Can see the example here. 0:06:25.384 --> 0:06:32.237 In hypothesis different is everything now, so basically it just. 0:06:32.852 --> 0:06:36.520 But then you can see it's not fluent. 0:06:36.520 --> 0:06:38.933 It sounds kind of weird. 0:06:38.933 --> 0:06:41.442 Nothing is different now. 0:06:41.442 --> 0:06:43.179 It sounds fluent. 0:06:46.006 --> 0:06:50.650 Next we come to error analysis. 0:06:50.650 --> 0:07:02.407 When we value the system and give a score we want to have interpretable results. 0:07:03.083 --> 0:07:07.930 So usually there would be some tetsus first in order to detect these errors. 0:07:08.448 --> 0:07:21.077 And usually they would be like quite specific to some specific type of arrow, for example 0:07:21.077 --> 0:07:23.743 wrong translation. 0:07:24.344 --> 0:07:32.127 All morphological agreements in whether the world form is correct. 0:07:32.127 --> 0:07:35.031 If you have the article. 0:07:37.577 --> 0:07:45.904 So now we come to human evaluation, which is the final goal of machine translation. 0:07:47.287 --> 0:07:50.287 So why do we perform human evaluation? 0:07:51.011 --> 0:08:00.115 The first thing is that automatic machine translation magic is not sufficient. 0:08:00.480 --> 0:08:06.725 Existing automated metrics and are sometimes biased. 0:08:06.725 --> 0:08:16.033 For example, the blue spar, but the blue scar will usually try to look at the. 0:08:16.496 --> 0:08:24.018 So it doesn't take into account some deeper meaning like cares about word-to-word matching 0:08:24.018 --> 0:08:26.829 instead of rephrasing or synonym. 0:08:27.587 --> 0:08:34.881 And bias, as in that metrics like that would usually depend a lot on the goal standard reference 0:08:34.881 --> 0:08:41.948 given from some human, and that person could have some specific type or language preferences, 0:08:41.948 --> 0:08:43.979 and then the metric would. 0:08:47.147 --> 0:08:55.422 The next thing is that automatic metrics don't provide sufficient insights for error analysis. 0:08:57.317 --> 0:09:04.096 Different types of errors would have different implications depending on the underlying task. 0:09:04.644 --> 0:09:09.895 So, for example, if you use machine translation for information with you both,. 0:09:10.470 --> 0:09:20.202 Then if it makes some error omitting some words in translation then it would be very 0:09:20.202 --> 0:09:20.775 bad. 0:09:21.321 --> 0:09:30.305 Another example is if you use machine translation in chat pop then fluency would be very important 0:09:30.305 --> 0:09:50.253 because: And we also need human measure in order to develop and assess automatic translation 0:09:50.253 --> 0:09:52.324 evaluation. 0:09:55.455 --> 0:10:01.872 Okay, so now we will come to the quality measures of human evaluation. 0:10:02.402 --> 0:10:05.165 The first thing is inter allotator agreement. 0:10:05.825 --> 0:10:25.985 This is agreement between different annotators. 0:10:26.126 --> 0:10:31.496 So as you can see here, this would measure the reliability of the other features. 0:10:32.252 --> 0:10:49.440 And here we have an example of where the pace car here is. 0:10:49.849 --> 0:10:57.700 And this is in contrast to intra-annuator agreement, so this is agreement within an annotator. 0:10:58.118 --> 0:11:03.950 So instead of measuring reliability, here it measures consistency of a single animator. 0:11:04.884 --> 0:11:07.027 And yep. 0:11:07.027 --> 0:11:22.260 We also have an example here of the which is so which is quite. 0:11:23.263 --> 0:11:42.120 So now we will come to the main types of human assessment: The first thing is direct assessment. 0:11:42.842 --> 0:11:53.826 The second thing is human ranking of the translation at sentence level. 0:11:56.176 --> 0:12:11.087 So direct assessment given the source and translation, and possibly the reference translation. 0:12:12.612 --> 0:12:18.023 The goal here is to give the scores to evaluate performance,adequacy and fluency. 0:12:18.598 --> 0:12:23.619 The problem here is that we need normalization across different judges, different human. 0:12:24.604 --> 0:12:27.043 And here we have an example. 0:12:27.043 --> 0:12:33.517 She was treated at the site by an emergency doctor and taken to hospital by. 0:12:34.334 --> 0:12:48.444 The hypothesis here is that she was treated on site and emergency medical rescue workers 0:12:48.444 --> 0:12:52.090 brought to a hospital. 0:12:52.472 --> 0:12:56.267 Lesson five is best in one sport. 0:13:00.060 --> 0:13:04.716 I don't think it's hard because I think there should be broad threat to a hospital right. 0:13:05.905 --> 0:13:09.553 Yes, that is like a crucial error. 0:13:09.553 --> 0:13:19.558 Yeah, I think I would agree because this sentence somehow gives us the idea of what the meaning 0:13:19.558 --> 0:13:21.642 of the sentence is. 0:13:21.642 --> 0:13:24.768 But then it lost towards her. 0:13:27.027 --> 0:13:29.298 The next time of human evaluation is ranking. 0:13:30.810 --> 0:13:38.893 Which is a great different system according to performance like which one is better. 0:13:40.981 --> 0:13:43.914 So here now we have a second hypothesis. 0:13:43.914 --> 0:13:49.280 She was hospitalized on the spot and taken to hospital by ambulance crews. 0:13:50.630 --> 0:14:01.608 As you can see here, the second hypothesis seems to be more fluent, more smooth. 0:14:01.608 --> 0:14:09.096 The meaning capture seems to be: So yeah, it's difficult to compare different errors 0:14:09.096 --> 0:14:11.143 in whether which error is more severe. 0:14:13.373 --> 0:14:16.068 The next type of human evaluation is post editing. 0:14:17.817 --> 0:14:29.483 So we want to measure how much time and effort human needs to spend in order to turn it into 0:14:29.483 --> 0:14:32.117 correct translation. 0:14:32.993 --> 0:14:47.905 So this area can be measured by time or key shop. 0:14:49.649 --> 0:14:52.889 And the last one is task based evaluation. 0:14:52.889 --> 0:14:56.806 Here we would want to evaluate the complete system. 0:14:56.806 --> 0:15:03.436 But if you are using the lecture translator and you see my lecture in German, the final 0:15:03.436 --> 0:15:05.772 evaluation here would be like. 0:15:05.772 --> 0:15:08.183 In the end, can you understand? 0:15:09.769 --> 0:15:15.301 Their friendship here that we get the overall performance, which is our final goal. 0:15:16.816 --> 0:15:25.850 But the disadvantage here that it could be complex and again if the spur is low it might 0:15:25.850 --> 0:15:31.432 be other problems than the machine translation itself. 0:15:33.613 --> 0:15:42.941 So guess that was about the human evaluation part any question so far. 0:15:42.941 --> 0:15:44.255 Yes, and. 0:16:00.000 --> 0:16:15.655 Then we will come to our magic matrix here to access the quality of the machine translation 0:16:15.655 --> 0:16:26.179 system by comparing: So the premise here is that the more similar translation is to reference, 0:16:26.179 --> 0:16:31.437 the better and we want some algorithms that can approximate. 0:16:34.114 --> 0:16:47.735 So the most famous measure could be the blow spark and the bilingual evaluation. 0:16:50.930 --> 0:16:56.358 So if we are given the goal that the more similar translation is to the reference, the 0:16:56.358 --> 0:17:01.785 better I think the most naive way would be count the number of people sentenced to the 0:17:01.785 --> 0:17:02.472 reference. 0:17:02.472 --> 0:17:08.211 But as you can see, this would be very difficult because sentence being exactly the same to 0:17:08.211 --> 0:17:10.332 the reference would be very rare. 0:17:11.831 --> 0:17:24.222 You can see the example here in the reference and machine translation output. 0:17:24.764 --> 0:17:31.930 So the idea here is that instead of comparing the two whole sentences up, we consider the. 0:17:35.255 --> 0:17:43.333 Now we can look at an example, so for the blow score we consider one to three four grams. 0:17:44.844 --> 0:17:52.611 The one ramp of a lap we would have back to the future, not at premieres thirty years ago, 0:17:52.611 --> 0:17:59.524 so it should be like one, two, three, four, five, six, seven, eight, so like it. 0:17:59.459 --> 0:18:01.476 One ram is overlap to the reverence. 0:18:01.921 --> 0:18:03.366 So you should be over. 0:18:06.666 --> 0:18:08.994 Is kind of the same. 0:18:08.994 --> 0:18:18.529 Instead of considering only the word back for three, one is to be back to the future. 0:18:19.439 --> 0:18:31.360 So that is basically the idea of the blue score, and in the end we calculate the geometric. 0:18:32.812 --> 0:18:39.745 So as you can see here, when we look at the A brand overlap you can only look at the machine 0:18:39.745 --> 0:18:40.715 translation. 0:18:41.041 --> 0:18:55.181 We only care about how many words in the machine translation output appear. 0:18:55.455 --> 0:19:02.370 So this metric is kind of like a precision based and not really recall based. 0:19:04.224 --> 0:19:08.112 So this would lead to a problem like the example here. 0:19:08.112 --> 0:19:14.828 The reference is back to the future of Premier 30 years ago and the machine translation output 0:19:14.828 --> 0:19:16.807 is only back to the future. 0:19:17.557 --> 0:19:28.722 The one grab overlap will be formed because you can see back to the future is overlap entirely 0:19:28.722 --> 0:19:30.367 in reference. 0:19:31.231 --> 0:19:38.314 Is not right because one is the perfect score, but this is obviously not a good translation. 0:19:40.120 --> 0:19:47.160 So in order to tackle this they use something called pre gravity velocity. 0:19:47.988 --> 0:19:59.910 So it should be a factor that is multiplied to the geometric nymph. 0:19:59.910 --> 0:20:04.820 This form is the length of. 0:20:05.525 --> 0:20:19.901 So the penalty over or overseas to the power of the length of this river over. 0:20:21.321 --> 0:20:32.298 Which is lower than, and if we apply this to the example, the blowscorn is going to be 0:20:32.298 --> 0:20:36.462 which is not a good translation. 0:20:38.999 --> 0:20:42.152 Yep so any question of this place. 0:20:44.064 --> 0:21:00.947 Yes exactly that should be a problem as well, and it will be mentioned later on. 0:21:00.947 --> 0:21:01.990 But. 0:21:03.203 --> 0:21:08.239 Is very sensitive to zero score like that, so that is why we usually don't use the blue 0:21:08.239 --> 0:21:13.103 score sentence level because sentence can be short and then there can be no overlap. 0:21:13.103 --> 0:21:16.709 That is why we usually use it on documents as you can imagine. 0:21:16.709 --> 0:21:20.657 Documents are very long and very little chance to have zero overlap. 0:21:23.363 --> 0:21:28.531 Yeah okay, so the next thing on the blow's floor is slipping. 0:21:29.809 --> 0:21:42.925 So you can see here we have two references, the new movie and the new film, and we have 0:21:42.925 --> 0:21:47.396 a machine translation output. 0:21:47.807 --> 0:21:54.735 Because the here is also in the reference, so yeah two or two books is one, which is: 0:21:56.236 --> 0:22:02.085 So but then this is not what we want because this is just repeating something that appears. 0:22:02.702 --> 0:22:06.058 So that's why we use clipping. 0:22:06.058 --> 0:22:15.368 Clipping here is that we consider the mask counts in any reference, so as you can see 0:22:15.368 --> 0:22:17.425 here in reference. 0:22:18.098 --> 0:22:28.833 So here when we do clipping we will just use the maximum opponents in the references. 0:22:29.809 --> 0:22:38.717 Yeah, just to avoid avoid overlapping repetitive words in the translation. 0:22:41.641 --> 0:23:00.599 It could happen that there is no overlap between the machine translation output and reference. 0:23:00.500 --> 0:23:01.917 Then Everything Is Going To Go To Zero. 0:23:02.402 --> 0:23:07.876 So that's why for blow score we usually use Japanese level score where we arrogate the 0:23:07.876 --> 0:23:08.631 statistics. 0:23:12.092 --> 0:23:18.589 Some summary about the brewer as you can see it mash exact words. 0:23:18.589 --> 0:23:31.751 It can take several references: It measured a depotency by the word precision and if measured 0:23:31.751 --> 0:23:36.656 the fluency by the gram precision. 0:23:37.437 --> 0:23:47.254 And as mentioned, it doesn't consider how much meaning that is captured in the machine 0:23:47.254 --> 0:23:48.721 translation. 0:23:49.589 --> 0:23:53.538 So here they use reality penalty to prevent short sentences. 0:23:54.654 --> 0:24:04.395 Will get the spot over the last test set to avoid the zero issues. 0:24:04.395 --> 0:24:07.012 As we mentioned,. 0:24:09.829 --> 0:24:22.387 Yes, that's mentioned with multiple reference translation simultaneously, and it's a precision 0:24:22.387 --> 0:24:24.238 based matrix. 0:24:24.238 --> 0:24:27.939 So we are not sure if this. 0:24:29.689 --> 0:24:37.423 The second thing is that blows calls common safe for recall by routine penalty, and we 0:24:37.423 --> 0:24:38.667 are not sure. 0:24:39.659 --> 0:24:50.902 Matches, so can still improve the similarity measure and improve the correlation score to 0:24:50.902 --> 0:24:51.776 human. 0:24:52.832 --> 0:25:01.673 The next is that all work will have the same importance. 0:25:01.673 --> 0:25:07.101 What if a scheme for wedding work? 0:25:11.571 --> 0:25:26.862 And the last witness is that blows for high grade order engrams that can confluency dramatically. 0:25:27.547 --> 0:25:32.101 So the pressure is that can be accounted for fluency, and grammatically there's some other. 0:25:35.956 --> 0:25:47.257 We have some further issues and not created equally so we can use stemming or knowledge 0:25:47.257 --> 0:25:48.156 space. 0:25:50.730 --> 0:26:00.576 The next way we incorporate information is within the metrics. 0:26:01.101 --> 0:26:07.101 And can be used like a stop list to like somehow ignore the non-important words. 0:26:08.688 --> 0:26:12.687 Text normalization spelling conjugation lower case and mix case. 0:26:12.687 --> 0:26:18.592 The next thing is that for some language like Chinese there can be different world segmentation 0:26:18.592 --> 0:26:23.944 so exact word matching might no longer be a good idea so maybe it's ready to cover the 0:26:23.944 --> 0:26:27.388 score as the character level instead of the word level. 0:26:29.209 --> 0:26:33.794 And the last thing is speech translation. 0:26:33.794 --> 0:26:38.707 Usually input from speech translation would. 0:26:38.979 --> 0:26:51.399 And there should be some way to segment into sentences so that we can calculate the score 0:26:51.399 --> 0:26:52.090 and. 0:26:52.953 --> 0:27:01.326 And the way to soften is to use some tools like enware segmentation to align the output 0:27:01.326 --> 0:27:01.896 with. 0:27:06.306 --> 0:27:10.274 Yes, so guess that was all about the blow score any question. 0:27:14.274 --> 0:27:28.292 Again on automatic metrics we'll talk about probably good metrics, strange automatic metrics, 0:27:28.292 --> 0:27:32.021 use cases on evaluation. 0:27:34.374 --> 0:27:44.763 How to measure the performance of the matrix, so a good matrix would be a. 0:27:49.949 --> 0:28:04.905 We would want the matrix to be interpretable if this is the ranking from a human that somehow 0:28:04.905 --> 0:28:08.247 can rank the system. 0:28:12.132 --> 0:28:15.819 We would also want the evaluation metric to be sensitive. 0:28:15.819 --> 0:28:21.732 Like small differences in the machine translation can be distinguished, we would not need to 0:28:21.732 --> 0:28:22.686 be consistent. 0:28:22.686 --> 0:28:28.472 Like if the same machine translation system is used on a similar text, it should reproduce 0:28:28.472 --> 0:28:29.553 a similar score. 0:28:31.972 --> 0:28:40.050 Next, we would want the machine translation system to be reliable. 0:28:40.050 --> 0:28:42.583 Machine translation. 0:28:43.223 --> 0:28:52.143 We want the matrix to be easy to run in general and can be applied to multiple different machine. 0:28:55.035 --> 0:29:11.148 The difficulty of evaluating the metric itself is kind of similar to when you evaluate the 0:29:11.148 --> 0:29:13.450 translation. 0:29:18.638 --> 0:29:23.813 And here is some components of the automatic machine translation matrix. 0:29:23.813 --> 0:29:28.420 So for the matching matrix the component would be the precision. 0:29:28.420 --> 0:29:30.689 Recall our Levinstein distance. 0:29:30.689 --> 0:29:35.225 So for the blow sparks you have seen it cares mostly about the. 0:29:36.396 --> 0:29:45.613 And on the features it would be about how to measure the matches or character based. 0:29:48.588 --> 0:30:01.304 Now we will talk about more matrix because the blue score is the most common. 0:30:02.082 --> 0:30:10.863 So it compared the reference and hypothesis using edit operations. 0:30:10.863 --> 0:30:14.925 They count how many insertion. 0:30:23.143 --> 0:30:31.968 We already talked about it beyond what matching would care about character based mathematization 0:30:31.968 --> 0:30:34.425 or linguistic information. 0:30:36.636 --> 0:30:41.502 The next metric is the meteor metric. 0:30:41.502 --> 0:30:50.978 This is strong called metric for evaluation of translation with explicit. 0:30:51.331 --> 0:31:03.236 So merely their new idea is that they reintroduce repose and combine with precision as small 0:31:03.236 --> 0:31:04.772 components. 0:31:05.986 --> 0:31:16.700 The language translation output with each reference individually and takes part of the 0:31:16.700 --> 0:31:18.301 best parent. 0:31:20.940 --> 0:31:27.330 The next thing is that matching takes into counterfection variation by stepping, so it's 0:31:27.330 --> 0:31:28.119 no longer. 0:31:30.230 --> 0:31:40.165 When they address fluency, they're a direct penalty instead of ink arms so they would care 0:31:40.165 --> 0:31:40.929 about. 0:31:45.925 --> 0:31:56.287 The next thing is on two noble metrics, so for this metric we want to extract some features. 0:31:56.936 --> 0:32:04.450 So for example here the nice house is on the right and the building is on the right side 0:32:04.450 --> 0:32:12.216 so we will have to extract some pictures like for example here the reference and hypothesis 0:32:12.216 --> 0:32:14.158 have hypers in common. 0:32:14.714 --> 0:32:19.163 They have one insertion, two deletions, and they have the same verb. 0:32:21.141 --> 0:32:31.530 So the idea is to use machine translation techniques to combine features and this machine 0:32:31.530 --> 0:32:37.532 translation model will be trained on human ranking. 0:32:39.819 --> 0:32:44.788 Any common framework for this is comet. 0:32:44.684 --> 0:32:48.094 Which is a narrow model that is used with X for. 0:32:48.094 --> 0:32:54.149 The feature would be created using some prejutant model like X, L, M, U, R, A, BO, DA. 0:32:54.149 --> 0:33:00.622 Here the input would be the source, the reference and the hypothesis and then they would try 0:33:00.622 --> 0:33:02.431 to produce an assessment. 0:33:03.583 --> 0:33:05.428 Yeah, it's strange to predict human sport. 0:33:06.346 --> 0:33:19.131 And they also have some additional versions, as we train this model in order to tell whether 0:33:19.131 --> 0:33:20.918 translation. 0:33:21.221 --> 0:33:29.724 So instead of checking the source and the hypothesis as input, they could take only the 0:33:29.724 --> 0:33:38.034 source and the hypotheses as input and try to predict the quality of the translation. 0:33:42.562 --> 0:33:49.836 So assumptions before machine translation systems are often used in larger systems. 0:33:50.430 --> 0:33:57.713 So the question is how to evaluate the performance of the machine translation system in this larger 0:33:57.713 --> 0:34:04.997 scenario, and an example would be speech translation system when you try to translate English audio 0:34:04.997 --> 0:34:05.798 to German. 0:34:06.506 --> 0:34:13.605 Then it would usually have two opponents, ASR and MT, where ASR is like speech recognition 0:34:13.605 --> 0:34:20.626 that can describe English audio to English text, and then we have the machine translation 0:34:20.626 --> 0:34:24.682 system that translates English text to German text. 0:34:26.967 --> 0:34:33.339 So in order to have these overall performances in this bigger scenario, they are so willing 0:34:33.339 --> 0:34:34.447 to evaluate it. 0:34:34.447 --> 0:34:41.236 So the first one is to evaluate the individual components like how good is the speech recognizer, 0:34:41.236 --> 0:34:46.916 how good is the analyzed and generalization engines, how good is the synthesizer. 0:34:47.727 --> 0:34:56.905 The second way is to evaluate translation quality from speech input to text output. 0:34:56.905 --> 0:35:00.729 How good is the final translation? 0:35:02.102 --> 0:35:10.042 The next thing is to measure the to evaluate the architecture effectiveness like: How is 0:35:10.042 --> 0:35:12.325 the level effects in general? 0:35:12.325 --> 0:35:19.252 The next one is task based evaluation or use a study like we just simply ask the user what 0:35:19.252 --> 0:35:24.960 is their experience like whether the system works well and how well it is. 0:35:27.267 --> 0:35:32.646 So here we have an example of the ITF shale test result. 0:35:33.153 --> 0:35:38.911 So the first block would be the human evaluation like I think they are asked to give a spawl 0:35:38.911 --> 0:35:44.917 from one to five again where a fight is best and one is worst and the lower one is the blowscore 0:35:44.917 --> 0:35:50.490 and they find out that the human evaluation is far actually correlated with the blowsfall 0:35:50.490 --> 0:35:51.233 quite well. 0:35:53.193 --> 0:36:02.743 Here you can also see that the systems from our university are actually on top many sub-tasts. 0:36:05.605 --> 0:36:07.429 So Yeah. 0:36:08.868 --> 0:36:14.401 For this lecture is that machine translation evaluation is difficult. 0:36:14.401 --> 0:36:21.671 We talk about human versus automatic evaluation that human would be costly, but then is the 0:36:21.671 --> 0:36:27.046 goal standard automatic evaluation would be a fast and cheaper way. 0:36:27.547 --> 0:36:36.441 We talk about granulity on sentence level, document level or task level evaluation machine 0:36:36.441 --> 0:36:38.395 translation system. 0:36:39.679 --> 0:36:51.977 And we talked about human evaluation versus automatic metrics in details. 0:36:54.034 --> 0:36:59.840 So we introduced a lot of metric metrics. 0:36:59.840 --> 0:37:10.348 How do they compare from the quadrating of human assessment so it's better? 0:37:12.052 --> 0:37:16.294 I don't have the exact score and reference in my head. 0:37:16.294 --> 0:37:22.928 I would assume that mediators should have a better correlation because here they also 0:37:22.928 --> 0:37:30.025 consider other aspects like the recall whether the information in the reference is captured 0:37:30.025 --> 0:37:31.568 in the translation. 0:37:32.872 --> 0:37:41.875 Like synonyms, so I would assume that mid air is better, but again don't have the reference 0:37:41.875 --> 0:37:43.441 in my hair, so. 0:37:43.903 --> 0:37:49.771 But guess the reason people are still using BlueScore is that in most literature, a machine 0:37:49.771 --> 0:38:00.823 translation system, they report: So now you create a new machine translation system. 0:38:00.823 --> 0:38:07.990 It might be better to also report the blow. 0:38:08.228 --> 0:38:11.472 Exactly just slice good, just spread white, and then we're going to go ahead. 0:38:12.332 --> 0:38:14.745 And don't know what you're doing. 0:38:17.457 --> 0:38:18.907 I Want to Talk Quickly About. 0:38:19.059 --> 0:38:32.902 So it is like a language model, so it's kind of the same uses as. 0:38:33.053 --> 0:38:39.343 So the idea is that we have this layer in order to embed the sauce and the reference 0:38:39.343 --> 0:38:39.713 and. 0:38:40.000 --> 0:38:54.199 Into some feature vectors that we can later on use to predict the human sport in the. 0:38:58.618 --> 0:39:00.051 It If There's Nothing Else.