Spaces:
Running
Running
WEBVTT | |
0:00:56.957 --> 0:01:10.166 | |
In today you are going to talk about evaluation | |
like how you can tell how well your translation. | |
0:01:11.251 --> 0:01:23.175 | |
Today we're going to talk about first some | |
introduction about the difficulties and also | |
0:01:23.175 --> 0:01:27.783 | |
the dimensions of the evaluation. | |
0:01:28.248 --> 0:01:32.315 | |
And the second one is on automatic evaluation. | |
0:01:32.315 --> 0:01:33.960 | |
The second one is. | |
0:01:33.893 --> 0:01:40.952 | |
Would be less human effort costly, but it | |
probably is not really as perfect. | |
0:01:42.702 --> 0:02:01.262 | |
So on machine translation evaluation, so the | |
goal is to measure the quality of translation. | |
0:02:03.003 --> 0:02:06.949 | |
We need machine translation evaluation. | |
0:02:06.949 --> 0:02:14.152 | |
The first thing is for application scenarios | |
and whether it is reliable. | |
0:02:14.674 --> 0:02:22.911 | |
Second thing is to guide our research because | |
given symmetrics we will be able to find out | |
0:02:22.911 --> 0:02:30.875 | |
which improvement direction is valuable for | |
our machine translation system and the last | |
0:02:30.875 --> 0:02:34.224 | |
thing is for our system development. | |
0:02:36.116 --> 0:02:42.926 | |
So now we will come to some difficulties on | |
evaluation. | |
0:02:42.926 --> 0:02:50.952 | |
The first thing is ambiguity because usually | |
for one sentence it. | |
0:02:51.431 --> 0:03:04.031 | |
Here you can see that, for example, we have | |
the correct reference. | |
0:03:05.325 --> 0:03:19.124 | |
The second difficulty is that small changes | |
can be very important. | |
0:03:20.060 --> 0:03:22.531 | |
The first difficulty is subjective. | |
0:03:23.123 --> 0:03:39.266 | |
So it depends on each person's opinion whether | |
translation is correct. | |
0:03:41.041 --> 0:03:49.393 | |
The last is that evaluation sometimes is application | |
dependent. | |
0:03:49.393 --> 0:03:54.745 | |
We're not sure how good it's getting up. | |
0:03:57.437 --> 0:04:04.502 | |
The first dimension is human versus automatic | |
evaluation, which I definitely talked about | |
0:04:04.502 --> 0:04:06.151 | |
in the introduction. | |
0:04:06.151 --> 0:04:13.373 | |
The second thing is on granulity, so evaluation | |
could be on sentence level, document level, | |
0:04:13.373 --> 0:04:14.472 | |
or task base. | |
0:04:15.375 --> 0:04:28.622 | |
The last thing is whether the translation | |
is correct in order to capture the meaning. | |
0:04:30.630 --> 0:04:33.769 | |
So on the first dimensions, human verses are | |
automatic. | |
0:04:34.334 --> 0:04:45.069 | |
So human evaluation education is the goal | |
standard because in the end we give our machine | |
0:04:45.069 --> 0:04:48.647 | |
translation system to people. | |
0:04:49.329 --> 0:04:55.040 | |
And is also expensive and time consuming for | |
people to manually evaluate some systems. | |
0:04:57.057 --> 0:05:05.575 | |
For automatic evaluation, it is of course | |
tupper and faster, and it would use human reference. | |
0:05:08.168 --> 0:05:16.971 | |
The next dimension is on granulity. | |
0:05:16.971 --> 0:05:25.529 | |
The first level is sentence based. | |
0:05:25.885 --> 0:05:33.003 | |
But this is difficult because if you translate | |
a single sentence, it will be difficult to | |
0:05:33.003 --> 0:05:35.454 | |
tell whether this translation. | |
0:05:37.537 --> 0:05:40.633 | |
The second level is document based. | |
0:05:40.633 --> 0:05:46.051 | |
This should be the most commonly used in automatic | |
evaluation. | |
0:05:46.286 --> 0:06:00.750 | |
This should be like the final bowl of our | |
machine translation. | |
0:06:01.061 --> 0:06:02.315 | |
And slow in general. | |
0:06:02.315 --> 0:06:07.753 | |
We are not sure whether the arrows come from | |
the machine translation system itself or some | |
0:06:07.753 --> 0:06:08.828 | |
other components. | |
0:06:11.431 --> 0:06:21.300 | |
The next dimension is on adigocy because it's | |
fluency, so adigocy is meaning translated correctly. | |
0:06:22.642 --> 0:06:25.384 | |
Can see the example here. | |
0:06:25.384 --> 0:06:32.237 | |
In hypothesis different is everything now, | |
so basically it just. | |
0:06:32.852 --> 0:06:36.520 | |
But then you can see it's not fluent. | |
0:06:36.520 --> 0:06:38.933 | |
It sounds kind of weird. | |
0:06:38.933 --> 0:06:41.442 | |
Nothing is different now. | |
0:06:41.442 --> 0:06:43.179 | |
It sounds fluent. | |
0:06:46.006 --> 0:06:50.650 | |
Next we come to error analysis. | |
0:06:50.650 --> 0:07:02.407 | |
When we value the system and give a score | |
we want to have interpretable results. | |
0:07:03.083 --> 0:07:07.930 | |
So usually there would be some tetsus first | |
in order to detect these errors. | |
0:07:08.448 --> 0:07:21.077 | |
And usually they would be like quite specific | |
to some specific type of arrow, for example | |
0:07:21.077 --> 0:07:23.743 | |
wrong translation. | |
0:07:24.344 --> 0:07:32.127 | |
All morphological agreements in whether the | |
world form is correct. | |
0:07:32.127 --> 0:07:35.031 | |
If you have the article. | |
0:07:37.577 --> 0:07:45.904 | |
So now we come to human evaluation, which | |
is the final goal of machine translation. | |
0:07:47.287 --> 0:07:50.287 | |
So why do we perform human evaluation? | |
0:07:51.011 --> 0:08:00.115 | |
The first thing is that automatic machine | |
translation magic is not sufficient. | |
0:08:00.480 --> 0:08:06.725 | |
Existing automated metrics and are sometimes | |
biased. | |
0:08:06.725 --> 0:08:16.033 | |
For example, the blue spar, but the blue scar | |
will usually try to look at the. | |
0:08:16.496 --> 0:08:24.018 | |
So it doesn't take into account some deeper | |
meaning like cares about word-to-word matching | |
0:08:24.018 --> 0:08:26.829 | |
instead of rephrasing or synonym. | |
0:08:27.587 --> 0:08:34.881 | |
And bias, as in that metrics like that would | |
usually depend a lot on the goal standard reference | |
0:08:34.881 --> 0:08:41.948 | |
given from some human, and that person could | |
have some specific type or language preferences, | |
0:08:41.948 --> 0:08:43.979 | |
and then the metric would. | |
0:08:47.147 --> 0:08:55.422 | |
The next thing is that automatic metrics don't | |
provide sufficient insights for error analysis. | |
0:08:57.317 --> 0:09:04.096 | |
Different types of errors would have different | |
implications depending on the underlying task. | |
0:09:04.644 --> 0:09:09.895 | |
So, for example, if you use machine translation | |
for information with you both,. | |
0:09:10.470 --> 0:09:20.202 | |
Then if it makes some error omitting some | |
words in translation then it would be very | |
0:09:20.202 --> 0:09:20.775 | |
bad. | |
0:09:21.321 --> 0:09:30.305 | |
Another example is if you use machine translation | |
in chat pop then fluency would be very important | |
0:09:30.305 --> 0:09:50.253 | |
because: And we also need human measure in | |
order to develop and assess automatic translation | |
0:09:50.253 --> 0:09:52.324 | |
evaluation. | |
0:09:55.455 --> 0:10:01.872 | |
Okay, so now we will come to the quality measures | |
of human evaluation. | |
0:10:02.402 --> 0:10:05.165 | |
The first thing is inter allotator agreement. | |
0:10:05.825 --> 0:10:25.985 | |
This is agreement between different annotators. | |
0:10:26.126 --> 0:10:31.496 | |
So as you can see here, this would measure | |
the reliability of the other features. | |
0:10:32.252 --> 0:10:49.440 | |
And here we have an example of where the pace | |
car here is. | |
0:10:49.849 --> 0:10:57.700 | |
And this is in contrast to intra-annuator | |
agreement, so this is agreement within an annotator. | |
0:10:58.118 --> 0:11:03.950 | |
So instead of measuring reliability, here | |
it measures consistency of a single animator. | |
0:11:04.884 --> 0:11:07.027 | |
And yep. | |
0:11:07.027 --> 0:11:22.260 | |
We also have an example here of the which | |
is so which is quite. | |
0:11:23.263 --> 0:11:42.120 | |
So now we will come to the main types of human | |
assessment: The first thing is direct assessment. | |
0:11:42.842 --> 0:11:53.826 | |
The second thing is human ranking of the translation | |
at sentence level. | |
0:11:56.176 --> 0:12:11.087 | |
So direct assessment given the source and | |
translation, and possibly the reference translation. | |
0:12:12.612 --> 0:12:18.023 | |
The goal here is to give the scores to evaluate | |
performance,adequacy and fluency. | |
0:12:18.598 --> 0:12:23.619 | |
The problem here is that we need normalization | |
across different judges, different human. | |
0:12:24.604 --> 0:12:27.043 | |
And here we have an example. | |
0:12:27.043 --> 0:12:33.517 | |
She was treated at the site by an emergency | |
doctor and taken to hospital by. | |
0:12:34.334 --> 0:12:48.444 | |
The hypothesis here is that she was treated | |
on site and emergency medical rescue workers | |
0:12:48.444 --> 0:12:52.090 | |
brought to a hospital. | |
0:12:52.472 --> 0:12:56.267 | |
Lesson five is best in one sport. | |
0:13:00.060 --> 0:13:04.716 | |
I don't think it's hard because I think there | |
should be broad threat to a hospital right. | |
0:13:05.905 --> 0:13:09.553 | |
Yes, that is like a crucial error. | |
0:13:09.553 --> 0:13:19.558 | |
Yeah, I think I would agree because this sentence | |
somehow gives us the idea of what the meaning | |
0:13:19.558 --> 0:13:21.642 | |
of the sentence is. | |
0:13:21.642 --> 0:13:24.768 | |
But then it lost towards her. | |
0:13:27.027 --> 0:13:29.298 | |
The next time of human evaluation is ranking. | |
0:13:30.810 --> 0:13:38.893 | |
Which is a great different system according | |
to performance like which one is better. | |
0:13:40.981 --> 0:13:43.914 | |
So here now we have a second hypothesis. | |
0:13:43.914 --> 0:13:49.280 | |
She was hospitalized on the spot and taken | |
to hospital by ambulance crews. | |
0:13:50.630 --> 0:14:01.608 | |
As you can see here, the second hypothesis | |
seems to be more fluent, more smooth. | |
0:14:01.608 --> 0:14:09.096 | |
The meaning capture seems to be: So yeah, | |
it's difficult to compare different errors | |
0:14:09.096 --> 0:14:11.143 | |
in whether which error is more severe. | |
0:14:13.373 --> 0:14:16.068 | |
The next type of human evaluation is post | |
editing. | |
0:14:17.817 --> 0:14:29.483 | |
So we want to measure how much time and effort | |
human needs to spend in order to turn it into | |
0:14:29.483 --> 0:14:32.117 | |
correct translation. | |
0:14:32.993 --> 0:14:47.905 | |
So this area can be measured by time or key | |
shop. | |
0:14:49.649 --> 0:14:52.889 | |
And the last one is task based evaluation. | |
0:14:52.889 --> 0:14:56.806 | |
Here we would want to evaluate the complete | |
system. | |
0:14:56.806 --> 0:15:03.436 | |
But if you are using the lecture translator | |
and you see my lecture in German, the final | |
0:15:03.436 --> 0:15:05.772 | |
evaluation here would be like. | |
0:15:05.772 --> 0:15:08.183 | |
In the end, can you understand? | |
0:15:09.769 --> 0:15:15.301 | |
Their friendship here that we get the overall | |
performance, which is our final goal. | |
0:15:16.816 --> 0:15:25.850 | |
But the disadvantage here that it could be | |
complex and again if the spur is low it might | |
0:15:25.850 --> 0:15:31.432 | |
be other problems than the machine translation | |
itself. | |
0:15:33.613 --> 0:15:42.941 | |
So guess that was about the human evaluation | |
part any question so far. | |
0:15:42.941 --> 0:15:44.255 | |
Yes, and. | |
0:16:00.000 --> 0:16:15.655 | |
Then we will come to our magic matrix here | |
to access the quality of the machine translation | |
0:16:15.655 --> 0:16:26.179 | |
system by comparing: So the premise here is | |
that the more similar translation is to reference, | |
0:16:26.179 --> 0:16:31.437 | |
the better and we want some algorithms that | |
can approximate. | |
0:16:34.114 --> 0:16:47.735 | |
So the most famous measure could be the blow | |
spark and the bilingual evaluation. | |
0:16:50.930 --> 0:16:56.358 | |
So if we are given the goal that the more | |
similar translation is to the reference, the | |
0:16:56.358 --> 0:17:01.785 | |
better I think the most naive way would be | |
count the number of people sentenced to the | |
0:17:01.785 --> 0:17:02.472 | |
reference. | |
0:17:02.472 --> 0:17:08.211 | |
But as you can see, this would be very difficult | |
because sentence being exactly the same to | |
0:17:08.211 --> 0:17:10.332 | |
the reference would be very rare. | |
0:17:11.831 --> 0:17:24.222 | |
You can see the example here in the reference | |
and machine translation output. | |
0:17:24.764 --> 0:17:31.930 | |
So the idea here is that instead of comparing | |
the two whole sentences up, we consider the. | |
0:17:35.255 --> 0:17:43.333 | |
Now we can look at an example, so for the | |
blow score we consider one to three four grams. | |
0:17:44.844 --> 0:17:52.611 | |
The one ramp of a lap we would have back to | |
the future, not at premieres thirty years ago, | |
0:17:52.611 --> 0:17:59.524 | |
so it should be like one, two, three, four, | |
five, six, seven, eight, so like it. | |
0:17:59.459 --> 0:18:01.476 | |
One ram is overlap to the reverence. | |
0:18:01.921 --> 0:18:03.366 | |
So you should be over. | |
0:18:06.666 --> 0:18:08.994 | |
Is kind of the same. | |
0:18:08.994 --> 0:18:18.529 | |
Instead of considering only the word back | |
for three, one is to be back to the future. | |
0:18:19.439 --> 0:18:31.360 | |
So that is basically the idea of the blue | |
score, and in the end we calculate the geometric. | |
0:18:32.812 --> 0:18:39.745 | |
So as you can see here, when we look at the | |
A brand overlap you can only look at the machine | |
0:18:39.745 --> 0:18:40.715 | |
translation. | |
0:18:41.041 --> 0:18:55.181 | |
We only care about how many words in the machine | |
translation output appear. | |
0:18:55.455 --> 0:19:02.370 | |
So this metric is kind of like a precision | |
based and not really recall based. | |
0:19:04.224 --> 0:19:08.112 | |
So this would lead to a problem like the example | |
here. | |
0:19:08.112 --> 0:19:14.828 | |
The reference is back to the future of Premier | |
30 years ago and the machine translation output | |
0:19:14.828 --> 0:19:16.807 | |
is only back to the future. | |
0:19:17.557 --> 0:19:28.722 | |
The one grab overlap will be formed because | |
you can see back to the future is overlap entirely | |
0:19:28.722 --> 0:19:30.367 | |
in reference. | |
0:19:31.231 --> 0:19:38.314 | |
Is not right because one is the perfect score, | |
but this is obviously not a good translation. | |
0:19:40.120 --> 0:19:47.160 | |
So in order to tackle this they use something | |
called pre gravity velocity. | |
0:19:47.988 --> 0:19:59.910 | |
So it should be a factor that is multiplied | |
to the geometric nymph. | |
0:19:59.910 --> 0:20:04.820 | |
This form is the length of. | |
0:20:05.525 --> 0:20:19.901 | |
So the penalty over or overseas to the power | |
of the length of this river over. | |
0:20:21.321 --> 0:20:32.298 | |
Which is lower than, and if we apply this | |
to the example, the blowscorn is going to be | |
0:20:32.298 --> 0:20:36.462 | |
which is not a good translation. | |
0:20:38.999 --> 0:20:42.152 | |
Yep so any question of this place. | |
0:20:44.064 --> 0:21:00.947 | |
Yes exactly that should be a problem as well, | |
and it will be mentioned later on. | |
0:21:00.947 --> 0:21:01.990 | |
But. | |
0:21:03.203 --> 0:21:08.239 | |
Is very sensitive to zero score like that, | |
so that is why we usually don't use the blue | |
0:21:08.239 --> 0:21:13.103 | |
score sentence level because sentence can be | |
short and then there can be no overlap. | |
0:21:13.103 --> 0:21:16.709 | |
That is why we usually use it on documents | |
as you can imagine. | |
0:21:16.709 --> 0:21:20.657 | |
Documents are very long and very little chance | |
to have zero overlap. | |
0:21:23.363 --> 0:21:28.531 | |
Yeah okay, so the next thing on the blow's | |
floor is slipping. | |
0:21:29.809 --> 0:21:42.925 | |
So you can see here we have two references, | |
the new movie and the new film, and we have | |
0:21:42.925 --> 0:21:47.396 | |
a machine translation output. | |
0:21:47.807 --> 0:21:54.735 | |
Because the here is also in the reference, | |
so yeah two or two books is one, which is: | |
0:21:56.236 --> 0:22:02.085 | |
So but then this is not what we want because | |
this is just repeating something that appears. | |
0:22:02.702 --> 0:22:06.058 | |
So that's why we use clipping. | |
0:22:06.058 --> 0:22:15.368 | |
Clipping here is that we consider the mask | |
counts in any reference, so as you can see | |
0:22:15.368 --> 0:22:17.425 | |
here in reference. | |
0:22:18.098 --> 0:22:28.833 | |
So here when we do clipping we will just use | |
the maximum opponents in the references. | |
0:22:29.809 --> 0:22:38.717 | |
Yeah, just to avoid avoid overlapping repetitive | |
words in the translation. | |
0:22:41.641 --> 0:23:00.599 | |
It could happen that there is no overlap between | |
the machine translation output and reference. | |
0:23:00.500 --> 0:23:01.917 | |
Then Everything Is Going To Go To Zero. | |
0:23:02.402 --> 0:23:07.876 | |
So that's why for blow score we usually use | |
Japanese level score where we arrogate the | |
0:23:07.876 --> 0:23:08.631 | |
statistics. | |
0:23:12.092 --> 0:23:18.589 | |
Some summary about the brewer as you can see | |
it mash exact words. | |
0:23:18.589 --> 0:23:31.751 | |
It can take several references: It measured | |
a depotency by the word precision and if measured | |
0:23:31.751 --> 0:23:36.656 | |
the fluency by the gram precision. | |
0:23:37.437 --> 0:23:47.254 | |
And as mentioned, it doesn't consider how | |
much meaning that is captured in the machine | |
0:23:47.254 --> 0:23:48.721 | |
translation. | |
0:23:49.589 --> 0:23:53.538 | |
So here they use reality penalty to prevent | |
short sentences. | |
0:23:54.654 --> 0:24:04.395 | |
Will get the spot over the last test set to | |
avoid the zero issues. | |
0:24:04.395 --> 0:24:07.012 | |
As we mentioned,. | |
0:24:09.829 --> 0:24:22.387 | |
Yes, that's mentioned with multiple reference | |
translation simultaneously, and it's a precision | |
0:24:22.387 --> 0:24:24.238 | |
based matrix. | |
0:24:24.238 --> 0:24:27.939 | |
So we are not sure if this. | |
0:24:29.689 --> 0:24:37.423 | |
The second thing is that blows calls common | |
safe for recall by routine penalty, and we | |
0:24:37.423 --> 0:24:38.667 | |
are not sure. | |
0:24:39.659 --> 0:24:50.902 | |
Matches, so can still improve the similarity | |
measure and improve the correlation score to | |
0:24:50.902 --> 0:24:51.776 | |
human. | |
0:24:52.832 --> 0:25:01.673 | |
The next is that all work will have the same | |
importance. | |
0:25:01.673 --> 0:25:07.101 | |
What if a scheme for wedding work? | |
0:25:11.571 --> 0:25:26.862 | |
And the last witness is that blows for high | |
grade order engrams that can confluency dramatically. | |
0:25:27.547 --> 0:25:32.101 | |
So the pressure is that can be accounted for | |
fluency, and grammatically there's some other. | |
0:25:35.956 --> 0:25:47.257 | |
We have some further issues and not created | |
equally so we can use stemming or knowledge | |
0:25:47.257 --> 0:25:48.156 | |
space. | |
0:25:50.730 --> 0:26:00.576 | |
The next way we incorporate information is | |
within the metrics. | |
0:26:01.101 --> 0:26:07.101 | |
And can be used like a stop list to like somehow | |
ignore the non-important words. | |
0:26:08.688 --> 0:26:12.687 | |
Text normalization spelling conjugation lower | |
case and mix case. | |
0:26:12.687 --> 0:26:18.592 | |
The next thing is that for some language like | |
Chinese there can be different world segmentation | |
0:26:18.592 --> 0:26:23.944 | |
so exact word matching might no longer be a | |
good idea so maybe it's ready to cover the | |
0:26:23.944 --> 0:26:27.388 | |
score as the character level instead of the | |
word level. | |
0:26:29.209 --> 0:26:33.794 | |
And the last thing is speech translation. | |
0:26:33.794 --> 0:26:38.707 | |
Usually input from speech translation would. | |
0:26:38.979 --> 0:26:51.399 | |
And there should be some way to segment into | |
sentences so that we can calculate the score | |
0:26:51.399 --> 0:26:52.090 | |
and. | |
0:26:52.953 --> 0:27:01.326 | |
And the way to soften is to use some tools | |
like enware segmentation to align the output | |
0:27:01.326 --> 0:27:01.896 | |
with. | |
0:27:06.306 --> 0:27:10.274 | |
Yes, so guess that was all about the blow | |
score any question. | |
0:27:14.274 --> 0:27:28.292 | |
Again on automatic metrics we'll talk about | |
probably good metrics, strange automatic metrics, | |
0:27:28.292 --> 0:27:32.021 | |
use cases on evaluation. | |
0:27:34.374 --> 0:27:44.763 | |
How to measure the performance of the matrix, | |
so a good matrix would be a. | |
0:27:49.949 --> 0:28:04.905 | |
We would want the matrix to be interpretable | |
if this is the ranking from a human that somehow | |
0:28:04.905 --> 0:28:08.247 | |
can rank the system. | |
0:28:12.132 --> 0:28:15.819 | |
We would also want the evaluation metric to | |
be sensitive. | |
0:28:15.819 --> 0:28:21.732 | |
Like small differences in the machine translation | |
can be distinguished, we would not need to | |
0:28:21.732 --> 0:28:22.686 | |
be consistent. | |
0:28:22.686 --> 0:28:28.472 | |
Like if the same machine translation system | |
is used on a similar text, it should reproduce | |
0:28:28.472 --> 0:28:29.553 | |
a similar score. | |
0:28:31.972 --> 0:28:40.050 | |
Next, we would want the machine translation | |
system to be reliable. | |
0:28:40.050 --> 0:28:42.583 | |
Machine translation. | |
0:28:43.223 --> 0:28:52.143 | |
We want the matrix to be easy to run in general | |
and can be applied to multiple different machine. | |
0:28:55.035 --> 0:29:11.148 | |
The difficulty of evaluating the metric itself | |
is kind of similar to when you evaluate the | |
0:29:11.148 --> 0:29:13.450 | |
translation. | |
0:29:18.638 --> 0:29:23.813 | |
And here is some components of the automatic | |
machine translation matrix. | |
0:29:23.813 --> 0:29:28.420 | |
So for the matching matrix the component would | |
be the precision. | |
0:29:28.420 --> 0:29:30.689 | |
Recall our Levinstein distance. | |
0:29:30.689 --> 0:29:35.225 | |
So for the blow sparks you have seen it cares | |
mostly about the. | |
0:29:36.396 --> 0:29:45.613 | |
And on the features it would be about how | |
to measure the matches or character based. | |
0:29:48.588 --> 0:30:01.304 | |
Now we will talk about more matrix because | |
the blue score is the most common. | |
0:30:02.082 --> 0:30:10.863 | |
So it compared the reference and hypothesis | |
using edit operations. | |
0:30:10.863 --> 0:30:14.925 | |
They count how many insertion. | |
0:30:23.143 --> 0:30:31.968 | |
We already talked about it beyond what matching | |
would care about character based mathematization | |
0:30:31.968 --> 0:30:34.425 | |
or linguistic information. | |
0:30:36.636 --> 0:30:41.502 | |
The next metric is the meteor metric. | |
0:30:41.502 --> 0:30:50.978 | |
This is strong called metric for evaluation | |
of translation with explicit. | |
0:30:51.331 --> 0:31:03.236 | |
So merely their new idea is that they reintroduce | |
repose and combine with precision as small | |
0:31:03.236 --> 0:31:04.772 | |
components. | |
0:31:05.986 --> 0:31:16.700 | |
The language translation output with each | |
reference individually and takes part of the | |
0:31:16.700 --> 0:31:18.301 | |
best parent. | |
0:31:20.940 --> 0:31:27.330 | |
The next thing is that matching takes into | |
counterfection variation by stepping, so it's | |
0:31:27.330 --> 0:31:28.119 | |
no longer. | |
0:31:30.230 --> 0:31:40.165 | |
When they address fluency, they're a direct | |
penalty instead of ink arms so they would care | |
0:31:40.165 --> 0:31:40.929 | |
about. | |
0:31:45.925 --> 0:31:56.287 | |
The next thing is on two noble metrics, so | |
for this metric we want to extract some features. | |
0:31:56.936 --> 0:32:04.450 | |
So for example here the nice house is on the | |
right and the building is on the right side | |
0:32:04.450 --> 0:32:12.216 | |
so we will have to extract some pictures like | |
for example here the reference and hypothesis | |
0:32:12.216 --> 0:32:14.158 | |
have hypers in common. | |
0:32:14.714 --> 0:32:19.163 | |
They have one insertion, two deletions, and | |
they have the same verb. | |
0:32:21.141 --> 0:32:31.530 | |
So the idea is to use machine translation | |
techniques to combine features and this machine | |
0:32:31.530 --> 0:32:37.532 | |
translation model will be trained on human | |
ranking. | |
0:32:39.819 --> 0:32:44.788 | |
Any common framework for this is comet. | |
0:32:44.684 --> 0:32:48.094 | |
Which is a narrow model that is used with | |
X for. | |
0:32:48.094 --> 0:32:54.149 | |
The feature would be created using some prejutant | |
model like X, L, M, U, R, A, BO, DA. | |
0:32:54.149 --> 0:33:00.622 | |
Here the input would be the source, the reference | |
and the hypothesis and then they would try | |
0:33:00.622 --> 0:33:02.431 | |
to produce an assessment. | |
0:33:03.583 --> 0:33:05.428 | |
Yeah, it's strange to predict human sport. | |
0:33:06.346 --> 0:33:19.131 | |
And they also have some additional versions, | |
as we train this model in order to tell whether | |
0:33:19.131 --> 0:33:20.918 | |
translation. | |
0:33:21.221 --> 0:33:29.724 | |
So instead of checking the source and the | |
hypothesis as input, they could take only the | |
0:33:29.724 --> 0:33:38.034 | |
source and the hypotheses as input and try | |
to predict the quality of the translation. | |
0:33:42.562 --> 0:33:49.836 | |
So assumptions before machine translation | |
systems are often used in larger systems. | |
0:33:50.430 --> 0:33:57.713 | |
So the question is how to evaluate the performance | |
of the machine translation system in this larger | |
0:33:57.713 --> 0:34:04.997 | |
scenario, and an example would be speech translation | |
system when you try to translate English audio | |
0:34:04.997 --> 0:34:05.798 | |
to German. | |
0:34:06.506 --> 0:34:13.605 | |
Then it would usually have two opponents, | |
ASR and MT, where ASR is like speech recognition | |
0:34:13.605 --> 0:34:20.626 | |
that can describe English audio to English | |
text, and then we have the machine translation | |
0:34:20.626 --> 0:34:24.682 | |
system that translates English text to German | |
text. | |
0:34:26.967 --> 0:34:33.339 | |
So in order to have these overall performances | |
in this bigger scenario, they are so willing | |
0:34:33.339 --> 0:34:34.447 | |
to evaluate it. | |
0:34:34.447 --> 0:34:41.236 | |
So the first one is to evaluate the individual | |
components like how good is the speech recognizer, | |
0:34:41.236 --> 0:34:46.916 | |
how good is the analyzed and generalization | |
engines, how good is the synthesizer. | |
0:34:47.727 --> 0:34:56.905 | |
The second way is to evaluate translation | |
quality from speech input to text output. | |
0:34:56.905 --> 0:35:00.729 | |
How good is the final translation? | |
0:35:02.102 --> 0:35:10.042 | |
The next thing is to measure the to evaluate | |
the architecture effectiveness like: How is | |
0:35:10.042 --> 0:35:12.325 | |
the level effects in general? | |
0:35:12.325 --> 0:35:19.252 | |
The next one is task based evaluation or use | |
a study like we just simply ask the user what | |
0:35:19.252 --> 0:35:24.960 | |
is their experience like whether the system | |
works well and how well it is. | |
0:35:27.267 --> 0:35:32.646 | |
So here we have an example of the ITF shale | |
test result. | |
0:35:33.153 --> 0:35:38.911 | |
So the first block would be the human evaluation | |
like I think they are asked to give a spawl | |
0:35:38.911 --> 0:35:44.917 | |
from one to five again where a fight is best | |
and one is worst and the lower one is the blowscore | |
0:35:44.917 --> 0:35:50.490 | |
and they find out that the human evaluation | |
is far actually correlated with the blowsfall | |
0:35:50.490 --> 0:35:51.233 | |
quite well. | |
0:35:53.193 --> 0:36:02.743 | |
Here you can also see that the systems from | |
our university are actually on top many sub-tasts. | |
0:36:05.605 --> 0:36:07.429 | |
So Yeah. | |
0:36:08.868 --> 0:36:14.401 | |
For this lecture is that machine translation | |
evaluation is difficult. | |
0:36:14.401 --> 0:36:21.671 | |
We talk about human versus automatic evaluation | |
that human would be costly, but then is the | |
0:36:21.671 --> 0:36:27.046 | |
goal standard automatic evaluation would be | |
a fast and cheaper way. | |
0:36:27.547 --> 0:36:36.441 | |
We talk about granulity on sentence level, | |
document level or task level evaluation machine | |
0:36:36.441 --> 0:36:38.395 | |
translation system. | |
0:36:39.679 --> 0:36:51.977 | |
And we talked about human evaluation versus | |
automatic metrics in details. | |
0:36:54.034 --> 0:36:59.840 | |
So we introduced a lot of metric metrics. | |
0:36:59.840 --> 0:37:10.348 | |
How do they compare from the quadrating of | |
human assessment so it's better? | |
0:37:12.052 --> 0:37:16.294 | |
I don't have the exact score and reference | |
in my head. | |
0:37:16.294 --> 0:37:22.928 | |
I would assume that mediators should have | |
a better correlation because here they also | |
0:37:22.928 --> 0:37:30.025 | |
consider other aspects like the recall whether | |
the information in the reference is captured | |
0:37:30.025 --> 0:37:31.568 | |
in the translation. | |
0:37:32.872 --> 0:37:41.875 | |
Like synonyms, so I would assume that mid | |
air is better, but again don't have the reference | |
0:37:41.875 --> 0:37:43.441 | |
in my hair, so. | |
0:37:43.903 --> 0:37:49.771 | |
But guess the reason people are still using | |
BlueScore is that in most literature, a machine | |
0:37:49.771 --> 0:38:00.823 | |
translation system, they report: So now you | |
create a new machine translation system. | |
0:38:00.823 --> 0:38:07.990 | |
It might be better to also report the blow. | |
0:38:08.228 --> 0:38:11.472 | |
Exactly just slice good, just spread white, | |
and then we're going to go ahead. | |
0:38:12.332 --> 0:38:14.745 | |
And don't know what you're doing. | |
0:38:17.457 --> 0:38:18.907 | |
I Want to Talk Quickly About. | |
0:38:19.059 --> 0:38:32.902 | |
So it is like a language model, so it's kind | |
of the same uses as. | |
0:38:33.053 --> 0:38:39.343 | |
So the idea is that we have this layer in | |
order to embed the sauce and the reference | |
0:38:39.343 --> 0:38:39.713 | |
and. | |
0:38:40.000 --> 0:38:54.199 | |
Into some feature vectors that we can later | |
on use to predict the human sport in the. | |
0:38:58.618 --> 0:39:00.051 | |
It If There's Nothing Else. | |