retkowski's picture
Add demo
cb71ef5
raw
history blame
27.9 kB
WEBVTT
0:00:56.957 --> 0:01:10.166
In today you are going to talk about evaluation
like how you can tell how well your translation.
0:01:11.251 --> 0:01:23.175
Today we're going to talk about first some
introduction about the difficulties and also
0:01:23.175 --> 0:01:27.783
the dimensions of the evaluation.
0:01:28.248 --> 0:01:32.315
And the second one is on automatic evaluation.
0:01:32.315 --> 0:01:33.960
The second one is.
0:01:33.893 --> 0:01:40.952
Would be less human effort costly, but it
probably is not really as perfect.
0:01:42.702 --> 0:02:01.262
So on machine translation evaluation, so the
goal is to measure the quality of translation.
0:02:03.003 --> 0:02:06.949
We need machine translation evaluation.
0:02:06.949 --> 0:02:14.152
The first thing is for application scenarios
and whether it is reliable.
0:02:14.674 --> 0:02:22.911
Second thing is to guide our research because
given symmetrics we will be able to find out
0:02:22.911 --> 0:02:30.875
which improvement direction is valuable for
our machine translation system and the last
0:02:30.875 --> 0:02:34.224
thing is for our system development.
0:02:36.116 --> 0:02:42.926
So now we will come to some difficulties on
evaluation.
0:02:42.926 --> 0:02:50.952
The first thing is ambiguity because usually
for one sentence it.
0:02:51.431 --> 0:03:04.031
Here you can see that, for example, we have
the correct reference.
0:03:05.325 --> 0:03:19.124
The second difficulty is that small changes
can be very important.
0:03:20.060 --> 0:03:22.531
The first difficulty is subjective.
0:03:23.123 --> 0:03:39.266
So it depends on each person's opinion whether
translation is correct.
0:03:41.041 --> 0:03:49.393
The last is that evaluation sometimes is application
dependent.
0:03:49.393 --> 0:03:54.745
We're not sure how good it's getting up.
0:03:57.437 --> 0:04:04.502
The first dimension is human versus automatic
evaluation, which I definitely talked about
0:04:04.502 --> 0:04:06.151
in the introduction.
0:04:06.151 --> 0:04:13.373
The second thing is on granulity, so evaluation
could be on sentence level, document level,
0:04:13.373 --> 0:04:14.472
or task base.
0:04:15.375 --> 0:04:28.622
The last thing is whether the translation
is correct in order to capture the meaning.
0:04:30.630 --> 0:04:33.769
So on the first dimensions, human verses are
automatic.
0:04:34.334 --> 0:04:45.069
So human evaluation education is the goal
standard because in the end we give our machine
0:04:45.069 --> 0:04:48.647
translation system to people.
0:04:49.329 --> 0:04:55.040
And is also expensive and time consuming for
people to manually evaluate some systems.
0:04:57.057 --> 0:05:05.575
For automatic evaluation, it is of course
tupper and faster, and it would use human reference.
0:05:08.168 --> 0:05:16.971
The next dimension is on granulity.
0:05:16.971 --> 0:05:25.529
The first level is sentence based.
0:05:25.885 --> 0:05:33.003
But this is difficult because if you translate
a single sentence, it will be difficult to
0:05:33.003 --> 0:05:35.454
tell whether this translation.
0:05:37.537 --> 0:05:40.633
The second level is document based.
0:05:40.633 --> 0:05:46.051
This should be the most commonly used in automatic
evaluation.
0:05:46.286 --> 0:06:00.750
This should be like the final bowl of our
machine translation.
0:06:01.061 --> 0:06:02.315
And slow in general.
0:06:02.315 --> 0:06:07.753
We are not sure whether the arrows come from
the machine translation system itself or some
0:06:07.753 --> 0:06:08.828
other components.
0:06:11.431 --> 0:06:21.300
The next dimension is on adigocy because it's
fluency, so adigocy is meaning translated correctly.
0:06:22.642 --> 0:06:25.384
Can see the example here.
0:06:25.384 --> 0:06:32.237
In hypothesis different is everything now,
so basically it just.
0:06:32.852 --> 0:06:36.520
But then you can see it's not fluent.
0:06:36.520 --> 0:06:38.933
It sounds kind of weird.
0:06:38.933 --> 0:06:41.442
Nothing is different now.
0:06:41.442 --> 0:06:43.179
It sounds fluent.
0:06:46.006 --> 0:06:50.650
Next we come to error analysis.
0:06:50.650 --> 0:07:02.407
When we value the system and give a score
we want to have interpretable results.
0:07:03.083 --> 0:07:07.930
So usually there would be some tetsus first
in order to detect these errors.
0:07:08.448 --> 0:07:21.077
And usually they would be like quite specific
to some specific type of arrow, for example
0:07:21.077 --> 0:07:23.743
wrong translation.
0:07:24.344 --> 0:07:32.127
All morphological agreements in whether the
world form is correct.
0:07:32.127 --> 0:07:35.031
If you have the article.
0:07:37.577 --> 0:07:45.904
So now we come to human evaluation, which
is the final goal of machine translation.
0:07:47.287 --> 0:07:50.287
So why do we perform human evaluation?
0:07:51.011 --> 0:08:00.115
The first thing is that automatic machine
translation magic is not sufficient.
0:08:00.480 --> 0:08:06.725
Existing automated metrics and are sometimes
biased.
0:08:06.725 --> 0:08:16.033
For example, the blue spar, but the blue scar
will usually try to look at the.
0:08:16.496 --> 0:08:24.018
So it doesn't take into account some deeper
meaning like cares about word-to-word matching
0:08:24.018 --> 0:08:26.829
instead of rephrasing or synonym.
0:08:27.587 --> 0:08:34.881
And bias, as in that metrics like that would
usually depend a lot on the goal standard reference
0:08:34.881 --> 0:08:41.948
given from some human, and that person could
have some specific type or language preferences,
0:08:41.948 --> 0:08:43.979
and then the metric would.
0:08:47.147 --> 0:08:55.422
The next thing is that automatic metrics don't
provide sufficient insights for error analysis.
0:08:57.317 --> 0:09:04.096
Different types of errors would have different
implications depending on the underlying task.
0:09:04.644 --> 0:09:09.895
So, for example, if you use machine translation
for information with you both,.
0:09:10.470 --> 0:09:20.202
Then if it makes some error omitting some
words in translation then it would be very
0:09:20.202 --> 0:09:20.775
bad.
0:09:21.321 --> 0:09:30.305
Another example is if you use machine translation
in chat pop then fluency would be very important
0:09:30.305 --> 0:09:50.253
because: And we also need human measure in
order to develop and assess automatic translation
0:09:50.253 --> 0:09:52.324
evaluation.
0:09:55.455 --> 0:10:01.872
Okay, so now we will come to the quality measures
of human evaluation.
0:10:02.402 --> 0:10:05.165
The first thing is inter allotator agreement.
0:10:05.825 --> 0:10:25.985
This is agreement between different annotators.
0:10:26.126 --> 0:10:31.496
So as you can see here, this would measure
the reliability of the other features.
0:10:32.252 --> 0:10:49.440
And here we have an example of where the pace
car here is.
0:10:49.849 --> 0:10:57.700
And this is in contrast to intra-annuator
agreement, so this is agreement within an annotator.
0:10:58.118 --> 0:11:03.950
So instead of measuring reliability, here
it measures consistency of a single animator.
0:11:04.884 --> 0:11:07.027
And yep.
0:11:07.027 --> 0:11:22.260
We also have an example here of the which
is so which is quite.
0:11:23.263 --> 0:11:42.120
So now we will come to the main types of human
assessment: The first thing is direct assessment.
0:11:42.842 --> 0:11:53.826
The second thing is human ranking of the translation
at sentence level.
0:11:56.176 --> 0:12:11.087
So direct assessment given the source and
translation, and possibly the reference translation.
0:12:12.612 --> 0:12:18.023
The goal here is to give the scores to evaluate
performance,adequacy and fluency.
0:12:18.598 --> 0:12:23.619
The problem here is that we need normalization
across different judges, different human.
0:12:24.604 --> 0:12:27.043
And here we have an example.
0:12:27.043 --> 0:12:33.517
She was treated at the site by an emergency
doctor and taken to hospital by.
0:12:34.334 --> 0:12:48.444
The hypothesis here is that she was treated
on site and emergency medical rescue workers
0:12:48.444 --> 0:12:52.090
brought to a hospital.
0:12:52.472 --> 0:12:56.267
Lesson five is best in one sport.
0:13:00.060 --> 0:13:04.716
I don't think it's hard because I think there
should be broad threat to a hospital right.
0:13:05.905 --> 0:13:09.553
Yes, that is like a crucial error.
0:13:09.553 --> 0:13:19.558
Yeah, I think I would agree because this sentence
somehow gives us the idea of what the meaning
0:13:19.558 --> 0:13:21.642
of the sentence is.
0:13:21.642 --> 0:13:24.768
But then it lost towards her.
0:13:27.027 --> 0:13:29.298
The next time of human evaluation is ranking.
0:13:30.810 --> 0:13:38.893
Which is a great different system according
to performance like which one is better.
0:13:40.981 --> 0:13:43.914
So here now we have a second hypothesis.
0:13:43.914 --> 0:13:49.280
She was hospitalized on the spot and taken
to hospital by ambulance crews.
0:13:50.630 --> 0:14:01.608
As you can see here, the second hypothesis
seems to be more fluent, more smooth.
0:14:01.608 --> 0:14:09.096
The meaning capture seems to be: So yeah,
it's difficult to compare different errors
0:14:09.096 --> 0:14:11.143
in whether which error is more severe.
0:14:13.373 --> 0:14:16.068
The next type of human evaluation is post
editing.
0:14:17.817 --> 0:14:29.483
So we want to measure how much time and effort
human needs to spend in order to turn it into
0:14:29.483 --> 0:14:32.117
correct translation.
0:14:32.993 --> 0:14:47.905
So this area can be measured by time or key
shop.
0:14:49.649 --> 0:14:52.889
And the last one is task based evaluation.
0:14:52.889 --> 0:14:56.806
Here we would want to evaluate the complete
system.
0:14:56.806 --> 0:15:03.436
But if you are using the lecture translator
and you see my lecture in German, the final
0:15:03.436 --> 0:15:05.772
evaluation here would be like.
0:15:05.772 --> 0:15:08.183
In the end, can you understand?
0:15:09.769 --> 0:15:15.301
Their friendship here that we get the overall
performance, which is our final goal.
0:15:16.816 --> 0:15:25.850
But the disadvantage here that it could be
complex and again if the spur is low it might
0:15:25.850 --> 0:15:31.432
be other problems than the machine translation
itself.
0:15:33.613 --> 0:15:42.941
So guess that was about the human evaluation
part any question so far.
0:15:42.941 --> 0:15:44.255
Yes, and.
0:16:00.000 --> 0:16:15.655
Then we will come to our magic matrix here
to access the quality of the machine translation
0:16:15.655 --> 0:16:26.179
system by comparing: So the premise here is
that the more similar translation is to reference,
0:16:26.179 --> 0:16:31.437
the better and we want some algorithms that
can approximate.
0:16:34.114 --> 0:16:47.735
So the most famous measure could be the blow
spark and the bilingual evaluation.
0:16:50.930 --> 0:16:56.358
So if we are given the goal that the more
similar translation is to the reference, the
0:16:56.358 --> 0:17:01.785
better I think the most naive way would be
count the number of people sentenced to the
0:17:01.785 --> 0:17:02.472
reference.
0:17:02.472 --> 0:17:08.211
But as you can see, this would be very difficult
because sentence being exactly the same to
0:17:08.211 --> 0:17:10.332
the reference would be very rare.
0:17:11.831 --> 0:17:24.222
You can see the example here in the reference
and machine translation output.
0:17:24.764 --> 0:17:31.930
So the idea here is that instead of comparing
the two whole sentences up, we consider the.
0:17:35.255 --> 0:17:43.333
Now we can look at an example, so for the
blow score we consider one to three four grams.
0:17:44.844 --> 0:17:52.611
The one ramp of a lap we would have back to
the future, not at premieres thirty years ago,
0:17:52.611 --> 0:17:59.524
so it should be like one, two, three, four,
five, six, seven, eight, so like it.
0:17:59.459 --> 0:18:01.476
One ram is overlap to the reverence.
0:18:01.921 --> 0:18:03.366
So you should be over.
0:18:06.666 --> 0:18:08.994
Is kind of the same.
0:18:08.994 --> 0:18:18.529
Instead of considering only the word back
for three, one is to be back to the future.
0:18:19.439 --> 0:18:31.360
So that is basically the idea of the blue
score, and in the end we calculate the geometric.
0:18:32.812 --> 0:18:39.745
So as you can see here, when we look at the
A brand overlap you can only look at the machine
0:18:39.745 --> 0:18:40.715
translation.
0:18:41.041 --> 0:18:55.181
We only care about how many words in the machine
translation output appear.
0:18:55.455 --> 0:19:02.370
So this metric is kind of like a precision
based and not really recall based.
0:19:04.224 --> 0:19:08.112
So this would lead to a problem like the example
here.
0:19:08.112 --> 0:19:14.828
The reference is back to the future of Premier
30 years ago and the machine translation output
0:19:14.828 --> 0:19:16.807
is only back to the future.
0:19:17.557 --> 0:19:28.722
The one grab overlap will be formed because
you can see back to the future is overlap entirely
0:19:28.722 --> 0:19:30.367
in reference.
0:19:31.231 --> 0:19:38.314
Is not right because one is the perfect score,
but this is obviously not a good translation.
0:19:40.120 --> 0:19:47.160
So in order to tackle this they use something
called pre gravity velocity.
0:19:47.988 --> 0:19:59.910
So it should be a factor that is multiplied
to the geometric nymph.
0:19:59.910 --> 0:20:04.820
This form is the length of.
0:20:05.525 --> 0:20:19.901
So the penalty over or overseas to the power
of the length of this river over.
0:20:21.321 --> 0:20:32.298
Which is lower than, and if we apply this
to the example, the blowscorn is going to be
0:20:32.298 --> 0:20:36.462
which is not a good translation.
0:20:38.999 --> 0:20:42.152
Yep so any question of this place.
0:20:44.064 --> 0:21:00.947
Yes exactly that should be a problem as well,
and it will be mentioned later on.
0:21:00.947 --> 0:21:01.990
But.
0:21:03.203 --> 0:21:08.239
Is very sensitive to zero score like that,
so that is why we usually don't use the blue
0:21:08.239 --> 0:21:13.103
score sentence level because sentence can be
short and then there can be no overlap.
0:21:13.103 --> 0:21:16.709
That is why we usually use it on documents
as you can imagine.
0:21:16.709 --> 0:21:20.657
Documents are very long and very little chance
to have zero overlap.
0:21:23.363 --> 0:21:28.531
Yeah okay, so the next thing on the blow's
floor is slipping.
0:21:29.809 --> 0:21:42.925
So you can see here we have two references,
the new movie and the new film, and we have
0:21:42.925 --> 0:21:47.396
a machine translation output.
0:21:47.807 --> 0:21:54.735
Because the here is also in the reference,
so yeah two or two books is one, which is:
0:21:56.236 --> 0:22:02.085
So but then this is not what we want because
this is just repeating something that appears.
0:22:02.702 --> 0:22:06.058
So that's why we use clipping.
0:22:06.058 --> 0:22:15.368
Clipping here is that we consider the mask
counts in any reference, so as you can see
0:22:15.368 --> 0:22:17.425
here in reference.
0:22:18.098 --> 0:22:28.833
So here when we do clipping we will just use
the maximum opponents in the references.
0:22:29.809 --> 0:22:38.717
Yeah, just to avoid avoid overlapping repetitive
words in the translation.
0:22:41.641 --> 0:23:00.599
It could happen that there is no overlap between
the machine translation output and reference.
0:23:00.500 --> 0:23:01.917
Then Everything Is Going To Go To Zero.
0:23:02.402 --> 0:23:07.876
So that's why for blow score we usually use
Japanese level score where we arrogate the
0:23:07.876 --> 0:23:08.631
statistics.
0:23:12.092 --> 0:23:18.589
Some summary about the brewer as you can see
it mash exact words.
0:23:18.589 --> 0:23:31.751
It can take several references: It measured
a depotency by the word precision and if measured
0:23:31.751 --> 0:23:36.656
the fluency by the gram precision.
0:23:37.437 --> 0:23:47.254
And as mentioned, it doesn't consider how
much meaning that is captured in the machine
0:23:47.254 --> 0:23:48.721
translation.
0:23:49.589 --> 0:23:53.538
So here they use reality penalty to prevent
short sentences.
0:23:54.654 --> 0:24:04.395
Will get the spot over the last test set to
avoid the zero issues.
0:24:04.395 --> 0:24:07.012
As we mentioned,.
0:24:09.829 --> 0:24:22.387
Yes, that's mentioned with multiple reference
translation simultaneously, and it's a precision
0:24:22.387 --> 0:24:24.238
based matrix.
0:24:24.238 --> 0:24:27.939
So we are not sure if this.
0:24:29.689 --> 0:24:37.423
The second thing is that blows calls common
safe for recall by routine penalty, and we
0:24:37.423 --> 0:24:38.667
are not sure.
0:24:39.659 --> 0:24:50.902
Matches, so can still improve the similarity
measure and improve the correlation score to
0:24:50.902 --> 0:24:51.776
human.
0:24:52.832 --> 0:25:01.673
The next is that all work will have the same
importance.
0:25:01.673 --> 0:25:07.101
What if a scheme for wedding work?
0:25:11.571 --> 0:25:26.862
And the last witness is that blows for high
grade order engrams that can confluency dramatically.
0:25:27.547 --> 0:25:32.101
So the pressure is that can be accounted for
fluency, and grammatically there's some other.
0:25:35.956 --> 0:25:47.257
We have some further issues and not created
equally so we can use stemming or knowledge
0:25:47.257 --> 0:25:48.156
space.
0:25:50.730 --> 0:26:00.576
The next way we incorporate information is
within the metrics.
0:26:01.101 --> 0:26:07.101
And can be used like a stop list to like somehow
ignore the non-important words.
0:26:08.688 --> 0:26:12.687
Text normalization spelling conjugation lower
case and mix case.
0:26:12.687 --> 0:26:18.592
The next thing is that for some language like
Chinese there can be different world segmentation
0:26:18.592 --> 0:26:23.944
so exact word matching might no longer be a
good idea so maybe it's ready to cover the
0:26:23.944 --> 0:26:27.388
score as the character level instead of the
word level.
0:26:29.209 --> 0:26:33.794
And the last thing is speech translation.
0:26:33.794 --> 0:26:38.707
Usually input from speech translation would.
0:26:38.979 --> 0:26:51.399
And there should be some way to segment into
sentences so that we can calculate the score
0:26:51.399 --> 0:26:52.090
and.
0:26:52.953 --> 0:27:01.326
And the way to soften is to use some tools
like enware segmentation to align the output
0:27:01.326 --> 0:27:01.896
with.
0:27:06.306 --> 0:27:10.274
Yes, so guess that was all about the blow
score any question.
0:27:14.274 --> 0:27:28.292
Again on automatic metrics we'll talk about
probably good metrics, strange automatic metrics,
0:27:28.292 --> 0:27:32.021
use cases on evaluation.
0:27:34.374 --> 0:27:44.763
How to measure the performance of the matrix,
so a good matrix would be a.
0:27:49.949 --> 0:28:04.905
We would want the matrix to be interpretable
if this is the ranking from a human that somehow
0:28:04.905 --> 0:28:08.247
can rank the system.
0:28:12.132 --> 0:28:15.819
We would also want the evaluation metric to
be sensitive.
0:28:15.819 --> 0:28:21.732
Like small differences in the machine translation
can be distinguished, we would not need to
0:28:21.732 --> 0:28:22.686
be consistent.
0:28:22.686 --> 0:28:28.472
Like if the same machine translation system
is used on a similar text, it should reproduce
0:28:28.472 --> 0:28:29.553
a similar score.
0:28:31.972 --> 0:28:40.050
Next, we would want the machine translation
system to be reliable.
0:28:40.050 --> 0:28:42.583
Machine translation.
0:28:43.223 --> 0:28:52.143
We want the matrix to be easy to run in general
and can be applied to multiple different machine.
0:28:55.035 --> 0:29:11.148
The difficulty of evaluating the metric itself
is kind of similar to when you evaluate the
0:29:11.148 --> 0:29:13.450
translation.
0:29:18.638 --> 0:29:23.813
And here is some components of the automatic
machine translation matrix.
0:29:23.813 --> 0:29:28.420
So for the matching matrix the component would
be the precision.
0:29:28.420 --> 0:29:30.689
Recall our Levinstein distance.
0:29:30.689 --> 0:29:35.225
So for the blow sparks you have seen it cares
mostly about the.
0:29:36.396 --> 0:29:45.613
And on the features it would be about how
to measure the matches or character based.
0:29:48.588 --> 0:30:01.304
Now we will talk about more matrix because
the blue score is the most common.
0:30:02.082 --> 0:30:10.863
So it compared the reference and hypothesis
using edit operations.
0:30:10.863 --> 0:30:14.925
They count how many insertion.
0:30:23.143 --> 0:30:31.968
We already talked about it beyond what matching
would care about character based mathematization
0:30:31.968 --> 0:30:34.425
or linguistic information.
0:30:36.636 --> 0:30:41.502
The next metric is the meteor metric.
0:30:41.502 --> 0:30:50.978
This is strong called metric for evaluation
of translation with explicit.
0:30:51.331 --> 0:31:03.236
So merely their new idea is that they reintroduce
repose and combine with precision as small
0:31:03.236 --> 0:31:04.772
components.
0:31:05.986 --> 0:31:16.700
The language translation output with each
reference individually and takes part of the
0:31:16.700 --> 0:31:18.301
best parent.
0:31:20.940 --> 0:31:27.330
The next thing is that matching takes into
counterfection variation by stepping, so it's
0:31:27.330 --> 0:31:28.119
no longer.
0:31:30.230 --> 0:31:40.165
When they address fluency, they're a direct
penalty instead of ink arms so they would care
0:31:40.165 --> 0:31:40.929
about.
0:31:45.925 --> 0:31:56.287
The next thing is on two noble metrics, so
for this metric we want to extract some features.
0:31:56.936 --> 0:32:04.450
So for example here the nice house is on the
right and the building is on the right side
0:32:04.450 --> 0:32:12.216
so we will have to extract some pictures like
for example here the reference and hypothesis
0:32:12.216 --> 0:32:14.158
have hypers in common.
0:32:14.714 --> 0:32:19.163
They have one insertion, two deletions, and
they have the same verb.
0:32:21.141 --> 0:32:31.530
So the idea is to use machine translation
techniques to combine features and this machine
0:32:31.530 --> 0:32:37.532
translation model will be trained on human
ranking.
0:32:39.819 --> 0:32:44.788
Any common framework for this is comet.
0:32:44.684 --> 0:32:48.094
Which is a narrow model that is used with
X for.
0:32:48.094 --> 0:32:54.149
The feature would be created using some prejutant
model like X, L, M, U, R, A, BO, DA.
0:32:54.149 --> 0:33:00.622
Here the input would be the source, the reference
and the hypothesis and then they would try
0:33:00.622 --> 0:33:02.431
to produce an assessment.
0:33:03.583 --> 0:33:05.428
Yeah, it's strange to predict human sport.
0:33:06.346 --> 0:33:19.131
And they also have some additional versions,
as we train this model in order to tell whether
0:33:19.131 --> 0:33:20.918
translation.
0:33:21.221 --> 0:33:29.724
So instead of checking the source and the
hypothesis as input, they could take only the
0:33:29.724 --> 0:33:38.034
source and the hypotheses as input and try
to predict the quality of the translation.
0:33:42.562 --> 0:33:49.836
So assumptions before machine translation
systems are often used in larger systems.
0:33:50.430 --> 0:33:57.713
So the question is how to evaluate the performance
of the machine translation system in this larger
0:33:57.713 --> 0:34:04.997
scenario, and an example would be speech translation
system when you try to translate English audio
0:34:04.997 --> 0:34:05.798
to German.
0:34:06.506 --> 0:34:13.605
Then it would usually have two opponents,
ASR and MT, where ASR is like speech recognition
0:34:13.605 --> 0:34:20.626
that can describe English audio to English
text, and then we have the machine translation
0:34:20.626 --> 0:34:24.682
system that translates English text to German
text.
0:34:26.967 --> 0:34:33.339
So in order to have these overall performances
in this bigger scenario, they are so willing
0:34:33.339 --> 0:34:34.447
to evaluate it.
0:34:34.447 --> 0:34:41.236
So the first one is to evaluate the individual
components like how good is the speech recognizer,
0:34:41.236 --> 0:34:46.916
how good is the analyzed and generalization
engines, how good is the synthesizer.
0:34:47.727 --> 0:34:56.905
The second way is to evaluate translation
quality from speech input to text output.
0:34:56.905 --> 0:35:00.729
How good is the final translation?
0:35:02.102 --> 0:35:10.042
The next thing is to measure the to evaluate
the architecture effectiveness like: How is
0:35:10.042 --> 0:35:12.325
the level effects in general?
0:35:12.325 --> 0:35:19.252
The next one is task based evaluation or use
a study like we just simply ask the user what
0:35:19.252 --> 0:35:24.960
is their experience like whether the system
works well and how well it is.
0:35:27.267 --> 0:35:32.646
So here we have an example of the ITF shale
test result.
0:35:33.153 --> 0:35:38.911
So the first block would be the human evaluation
like I think they are asked to give a spawl
0:35:38.911 --> 0:35:44.917
from one to five again where a fight is best
and one is worst and the lower one is the blowscore
0:35:44.917 --> 0:35:50.490
and they find out that the human evaluation
is far actually correlated with the blowsfall
0:35:50.490 --> 0:35:51.233
quite well.
0:35:53.193 --> 0:36:02.743
Here you can also see that the systems from
our university are actually on top many sub-tasts.
0:36:05.605 --> 0:36:07.429
So Yeah.
0:36:08.868 --> 0:36:14.401
For this lecture is that machine translation
evaluation is difficult.
0:36:14.401 --> 0:36:21.671
We talk about human versus automatic evaluation
that human would be costly, but then is the
0:36:21.671 --> 0:36:27.046
goal standard automatic evaluation would be
a fast and cheaper way.
0:36:27.547 --> 0:36:36.441
We talk about granulity on sentence level,
document level or task level evaluation machine
0:36:36.441 --> 0:36:38.395
translation system.
0:36:39.679 --> 0:36:51.977
And we talked about human evaluation versus
automatic metrics in details.
0:36:54.034 --> 0:36:59.840
So we introduced a lot of metric metrics.
0:36:59.840 --> 0:37:10.348
How do they compare from the quadrating of
human assessment so it's better?
0:37:12.052 --> 0:37:16.294
I don't have the exact score and reference
in my head.
0:37:16.294 --> 0:37:22.928
I would assume that mediators should have
a better correlation because here they also
0:37:22.928 --> 0:37:30.025
consider other aspects like the recall whether
the information in the reference is captured
0:37:30.025 --> 0:37:31.568
in the translation.
0:37:32.872 --> 0:37:41.875
Like synonyms, so I would assume that mid
air is better, but again don't have the reference
0:37:41.875 --> 0:37:43.441
in my hair, so.
0:37:43.903 --> 0:37:49.771
But guess the reason people are still using
BlueScore is that in most literature, a machine
0:37:49.771 --> 0:38:00.823
translation system, they report: So now you
create a new machine translation system.
0:38:00.823 --> 0:38:07.990
It might be better to also report the blow.
0:38:08.228 --> 0:38:11.472
Exactly just slice good, just spread white,
and then we're going to go ahead.
0:38:12.332 --> 0:38:14.745
And don't know what you're doing.
0:38:17.457 --> 0:38:18.907
I Want to Talk Quickly About.
0:38:19.059 --> 0:38:32.902
So it is like a language model, so it's kind
of the same uses as.
0:38:33.053 --> 0:38:39.343
So the idea is that we have this layer in
order to embed the sauce and the reference
0:38:39.343 --> 0:38:39.713
and.
0:38:40.000 --> 0:38:54.199
Into some feature vectors that we can later
on use to predict the human sport in the.
0:38:58.618 --> 0:39:00.051
It If There's Nothing Else.