Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-05-02.05.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame

27.9 kB

	WEBVTT

	0:00:56.957 --> 0:01:10.166
	In today you are going to talk about evaluation
	like how you can tell how well your translation.

	0:01:11.251 --> 0:01:23.175
	Today we're going to talk about first some
	introduction about the difficulties and also

	0:01:23.175 --> 0:01:27.783
	the dimensions of the evaluation.

	0:01:28.248 --> 0:01:32.315
	And the second one is on automatic evaluation.

	0:01:32.315 --> 0:01:33.960
	The second one is.

	0:01:33.893 --> 0:01:40.952
	Would be less human effort costly, but it
	probably is not really as perfect.

	0:01:42.702 --> 0:02:01.262
	So on machine translation evaluation, so the
	goal is to measure the quality of translation.

	0:02:03.003 --> 0:02:06.949
	We need machine translation evaluation.

	0:02:06.949 --> 0:02:14.152
	The first thing is for application scenarios
	and whether it is reliable.

	0:02:14.674 --> 0:02:22.911
	Second thing is to guide our research because
	given symmetrics we will be able to find out

	0:02:22.911 --> 0:02:30.875
	which improvement direction is valuable for
	our machine translation system and the last

	0:02:30.875 --> 0:02:34.224
	thing is for our system development.

	0:02:36.116 --> 0:02:42.926
	So now we will come to some difficulties on
	evaluation.

	0:02:42.926 --> 0:02:50.952
	The first thing is ambiguity because usually
	for one sentence it.

	0:02:51.431 --> 0:03:04.031
	Here you can see that, for example, we have
	the correct reference.

	0:03:05.325 --> 0:03:19.124
	The second difficulty is that small changes
	can be very important.

	0:03:20.060 --> 0:03:22.531
	The first difficulty is subjective.

	0:03:23.123 --> 0:03:39.266
	So it depends on each person's opinion whether
	translation is correct.

	0:03:41.041 --> 0:03:49.393
	The last is that evaluation sometimes is application
	dependent.

	0:03:49.393 --> 0:03:54.745
	We're not sure how good it's getting up.

	0:03:57.437 --> 0:04:04.502
	The first dimension is human versus automatic
	evaluation, which I definitely talked about

	0:04:04.502 --> 0:04:06.151
	in the introduction.

	0:04:06.151 --> 0:04:13.373
	The second thing is on granulity, so evaluation
	could be on sentence level, document level,

	0:04:13.373 --> 0:04:14.472
	or task base.

	0:04:15.375 --> 0:04:28.622
	The last thing is whether the translation
	is correct in order to capture the meaning.

	0:04:30.630 --> 0:04:33.769
	So on the first dimensions, human verses are
	automatic.

	0:04:34.334 --> 0:04:45.069
	So human evaluation education is the goal
	standard because in the end we give our machine

	0:04:45.069 --> 0:04:48.647
	translation system to people.

	0:04:49.329 --> 0:04:55.040
	And is also expensive and time consuming for
	people to manually evaluate some systems.

	0:04:57.057 --> 0:05:05.575
	For automatic evaluation, it is of course
	tupper and faster, and it would use human reference.

	0:05:08.168 --> 0:05:16.971
	The next dimension is on granulity.

	0:05:16.971 --> 0:05:25.529
	The first level is sentence based.

	0:05:25.885 --> 0:05:33.003
	But this is difficult because if you translate
	a single sentence, it will be difficult to

	0:05:33.003 --> 0:05:35.454
	tell whether this translation.

	0:05:37.537 --> 0:05:40.633
	The second level is document based.

	0:05:40.633 --> 0:05:46.051
	This should be the most commonly used in automatic
	evaluation.

	0:05:46.286 --> 0:06:00.750
	This should be like the final bowl of our
	machine translation.

	0:06:01.061 --> 0:06:02.315
	And slow in general.

	0:06:02.315 --> 0:06:07.753
	We are not sure whether the arrows come from
	the machine translation system itself or some

	0:06:07.753 --> 0:06:08.828
	other components.

	0:06:11.431 --> 0:06:21.300
	The next dimension is on adigocy because it's
	fluency, so adigocy is meaning translated correctly.

	0:06:22.642 --> 0:06:25.384
	Can see the example here.

	0:06:25.384 --> 0:06:32.237
	In hypothesis different is everything now,
	so basically it just.

	0:06:32.852 --> 0:06:36.520
	But then you can see it's not fluent.

	0:06:36.520 --> 0:06:38.933
	It sounds kind of weird.

	0:06:38.933 --> 0:06:41.442
	Nothing is different now.

	0:06:41.442 --> 0:06:43.179
	It sounds fluent.

	0:06:46.006 --> 0:06:50.650
	Next we come to error analysis.

	0:06:50.650 --> 0:07:02.407
	When we value the system and give a score
	we want to have interpretable results.

	0:07:03.083 --> 0:07:07.930
	So usually there would be some tetsus first
	in order to detect these errors.

	0:07:08.448 --> 0:07:21.077
	And usually they would be like quite specific
	to some specific type of arrow, for example

	0:07:21.077 --> 0:07:23.743
	wrong translation.

	0:07:24.344 --> 0:07:32.127
	All morphological agreements in whether the
	world form is correct.

	0:07:32.127 --> 0:07:35.031
	If you have the article.

	0:07:37.577 --> 0:07:45.904
	So now we come to human evaluation, which
	is the final goal of machine translation.

	0:07:47.287 --> 0:07:50.287
	So why do we perform human evaluation?

	0:07:51.011 --> 0:08:00.115
	The first thing is that automatic machine
	translation magic is not sufficient.

	0:08:00.480 --> 0:08:06.725
	Existing automated metrics and are sometimes
	biased.

	0:08:06.725 --> 0:08:16.033
	For example, the blue spar, but the blue scar
	will usually try to look at the.

	0:08:16.496 --> 0:08:24.018
	So it doesn't take into account some deeper
	meaning like cares about word-to-word matching

	0:08:24.018 --> 0:08:26.829
	instead of rephrasing or synonym.

	0:08:27.587 --> 0:08:34.881
	And bias, as in that metrics like that would
	usually depend a lot on the goal standard reference

	0:08:34.881 --> 0:08:41.948
	given from some human, and that person could
	have some specific type or language preferences,

	0:08:41.948 --> 0:08:43.979
	and then the metric would.

	0:08:47.147 --> 0:08:55.422
	The next thing is that automatic metrics don't
	provide sufficient insights for error analysis.

	0:08:57.317 --> 0:09:04.096
	Different types of errors would have different
	implications depending on the underlying task.

	0:09:04.644 --> 0:09:09.895
	So, for example, if you use machine translation
	for information with you both,.

	0:09:10.470 --> 0:09:20.202
	Then if it makes some error omitting some
	words in translation then it would be very

	0:09:20.202 --> 0:09:20.775
	bad.

	0:09:21.321 --> 0:09:30.305
	Another example is if you use machine translation
	in chat pop then fluency would be very important

	0:09:30.305 --> 0:09:50.253
	because: And we also need human measure in
	order to develop and assess automatic translation

	0:09:50.253 --> 0:09:52.324
	evaluation.

	0:09:55.455 --> 0:10:01.872
	Okay, so now we will come to the quality measures
	of human evaluation.

	0:10:02.402 --> 0:10:05.165
	The first thing is inter allotator agreement.

	0:10:05.825 --> 0:10:25.985
	This is agreement between different annotators.

	0:10:26.126 --> 0:10:31.496
	So as you can see here, this would measure
	the reliability of the other features.

	0:10:32.252 --> 0:10:49.440
	And here we have an example of where the pace
	car here is.

	0:10:49.849 --> 0:10:57.700
	And this is in contrast to intra-annuator
	agreement, so this is agreement within an annotator.

	0:10:58.118 --> 0:11:03.950
	So instead of measuring reliability, here
	it measures consistency of a single animator.

	0:11:04.884 --> 0:11:07.027
	And yep.

	0:11:07.027 --> 0:11:22.260
	We also have an example here of the which
	is so which is quite.

	0:11:23.263 --> 0:11:42.120
	So now we will come to the main types of human
	assessment: The first thing is direct assessment.

	0:11:42.842 --> 0:11:53.826
	The second thing is human ranking of the translation
	at sentence level.

	0:11:56.176 --> 0:12:11.087
	So direct assessment given the source and
	translation, and possibly the reference translation.

	0:12:12.612 --> 0:12:18.023
	The goal here is to give the scores to evaluate
	performance,adequacy and fluency.

	0:12:18.598 --> 0:12:23.619
	The problem here is that we need normalization
	across different judges, different human.

	0:12:24.604 --> 0:12:27.043
	And here we have an example.

	0:12:27.043 --> 0:12:33.517
	She was treated at the site by an emergency
	doctor and taken to hospital by.

	0:12:34.334 --> 0:12:48.444
	The hypothesis here is that she was treated
	on site and emergency medical rescue workers

	0:12:48.444 --> 0:12:52.090
	brought to a hospital.

	0:12:52.472 --> 0:12:56.267
	Lesson five is best in one sport.

	0:13:00.060 --> 0:13:04.716
	I don't think it's hard because I think there
	should be broad threat to a hospital right.

	0:13:05.905 --> 0:13:09.553
	Yes, that is like a crucial error.

	0:13:09.553 --> 0:13:19.558
	Yeah, I think I would agree because this sentence
	somehow gives us the idea of what the meaning

	0:13:19.558 --> 0:13:21.642
	of the sentence is.

	0:13:21.642 --> 0:13:24.768
	But then it lost towards her.

	0:13:27.027 --> 0:13:29.298
	The next time of human evaluation is ranking.

	0:13:30.810 --> 0:13:38.893
	Which is a great different system according
	to performance like which one is better.

	0:13:40.981 --> 0:13:43.914
	So here now we have a second hypothesis.

	0:13:43.914 --> 0:13:49.280
	She was hospitalized on the spot and taken
	to hospital by ambulance crews.

	0:13:50.630 --> 0:14:01.608
	As you can see here, the second hypothesis
	seems to be more fluent, more smooth.

	0:14:01.608 --> 0:14:09.096
	The meaning capture seems to be: So yeah,
	it's difficult to compare different errors

	0:14:09.096 --> 0:14:11.143
	in whether which error is more severe.

	0:14:13.373 --> 0:14:16.068
	The next type of human evaluation is post
	editing.

	0:14:17.817 --> 0:14:29.483
	So we want to measure how much time and effort
	human needs to spend in order to turn it into

	0:14:29.483 --> 0:14:32.117
	correct translation.

	0:14:32.993 --> 0:14:47.905
	So this area can be measured by time or key
	shop.

	0:14:49.649 --> 0:14:52.889
	And the last one is task based evaluation.

	0:14:52.889 --> 0:14:56.806
	Here we would want to evaluate the complete
	system.

	0:14:56.806 --> 0:15:03.436
	But if you are using the lecture translator
	and you see my lecture in German, the final

	0:15:03.436 --> 0:15:05.772
	evaluation here would be like.

	0:15:05.772 --> 0:15:08.183
	In the end, can you understand?

	0:15:09.769 --> 0:15:15.301
	Their friendship here that we get the overall
	performance, which is our final goal.

	0:15:16.816 --> 0:15:25.850
	But the disadvantage here that it could be
	complex and again if the spur is low it might

	0:15:25.850 --> 0:15:31.432
	be other problems than the machine translation
	itself.

	0:15:33.613 --> 0:15:42.941
	So guess that was about the human evaluation
	part any question so far.

	0:15:42.941 --> 0:15:44.255
	Yes, and.

	0:16:00.000 --> 0:16:15.655
	Then we will come to our magic matrix here
	to access the quality of the machine translation

	0:16:15.655 --> 0:16:26.179
	system by comparing: So the premise here is
	that the more similar translation is to reference,

	0:16:26.179 --> 0:16:31.437
	the better and we want some algorithms that
	can approximate.

	0:16:34.114 --> 0:16:47.735
	So the most famous measure could be the blow
	spark and the bilingual evaluation.

	0:16:50.930 --> 0:16:56.358
	So if we are given the goal that the more
	similar translation is to the reference, the

	0:16:56.358 --> 0:17:01.785
	better I think the most naive way would be
	count the number of people sentenced to the

	0:17:01.785 --> 0:17:02.472
	reference.

	0:17:02.472 --> 0:17:08.211
	But as you can see, this would be very difficult
	because sentence being exactly the same to

	0:17:08.211 --> 0:17:10.332
	the reference would be very rare.

	0:17:11.831 --> 0:17:24.222
	You can see the example here in the reference
	and machine translation output.

	0:17:24.764 --> 0:17:31.930
	So the idea here is that instead of comparing
	the two whole sentences up, we consider the.

	0:17:35.255 --> 0:17:43.333
	Now we can look at an example, so for the
	blow score we consider one to three four grams.

	0:17:44.844 --> 0:17:52.611
	The one ramp of a lap we would have back to
	the future, not at premieres thirty years ago,

	0:17:52.611 --> 0:17:59.524
	so it should be like one, two, three, four,
	five, six, seven, eight, so like it.

	0:17:59.459 --> 0:18:01.476
	One ram is overlap to the reverence.

	0:18:01.921 --> 0:18:03.366
	So you should be over.

	0:18:06.666 --> 0:18:08.994
	Is kind of the same.

	0:18:08.994 --> 0:18:18.529
	Instead of considering only the word back
	for three, one is to be back to the future.

	0:18:19.439 --> 0:18:31.360
	So that is basically the idea of the blue
	score, and in the end we calculate the geometric.

	0:18:32.812 --> 0:18:39.745
	So as you can see here, when we look at the
	A brand overlap you can only look at the machine

	0:18:39.745 --> 0:18:40.715
	translation.

	0:18:41.041 --> 0:18:55.181
	We only care about how many words in the machine
	translation output appear.

	0:18:55.455 --> 0:19:02.370
	So this metric is kind of like a precision
	based and not really recall based.

	0:19:04.224 --> 0:19:08.112
	So this would lead to a problem like the example
	here.

	0:19:08.112 --> 0:19:14.828
	The reference is back to the future of Premier
	30 years ago and the machine translation output

	0:19:14.828 --> 0:19:16.807
	is only back to the future.

	0:19:17.557 --> 0:19:28.722
	The one grab overlap will be formed because
	you can see back to the future is overlap entirely

	0:19:28.722 --> 0:19:30.367
	in reference.

	0:19:31.231 --> 0:19:38.314
	Is not right because one is the perfect score,
	but this is obviously not a good translation.

	0:19:40.120 --> 0:19:47.160
	So in order to tackle this they use something
	called pre gravity velocity.

	0:19:47.988 --> 0:19:59.910
	So it should be a factor that is multiplied
	to the geometric nymph.

	0:19:59.910 --> 0:20:04.820
	This form is the length of.

	0:20:05.525 --> 0:20:19.901
	So the penalty over or overseas to the power
	of the length of this river over.

	0:20:21.321 --> 0:20:32.298
	Which is lower than, and if we apply this
	to the example, the blowscorn is going to be

	0:20:32.298 --> 0:20:36.462
	which is not a good translation.

	0:20:38.999 --> 0:20:42.152
	Yep so any question of this place.

	0:20:44.064 --> 0:21:00.947
	Yes exactly that should be a problem as well,
	and it will be mentioned later on.

	0:21:00.947 --> 0:21:01.990
	But.

	0:21:03.203 --> 0:21:08.239
	Is very sensitive to zero score like that,
	so that is why we usually don't use the blue

	0:21:08.239 --> 0:21:13.103
	score sentence level because sentence can be
	short and then there can be no overlap.

	0:21:13.103 --> 0:21:16.709
	That is why we usually use it on documents
	as you can imagine.

	0:21:16.709 --> 0:21:20.657
	Documents are very long and very little chance
	to have zero overlap.

	0:21:23.363 --> 0:21:28.531
	Yeah okay, so the next thing on the blow's
	floor is slipping.

	0:21:29.809 --> 0:21:42.925
	So you can see here we have two references,
	the new movie and the new film, and we have

	0:21:42.925 --> 0:21:47.396
	a machine translation output.

	0:21:47.807 --> 0:21:54.735
	Because the here is also in the reference,
	so yeah two or two books is one, which is:

	0:21:56.236 --> 0:22:02.085
	So but then this is not what we want because
	this is just repeating something that appears.

	0:22:02.702 --> 0:22:06.058
	So that's why we use clipping.

	0:22:06.058 --> 0:22:15.368
	Clipping here is that we consider the mask
	counts in any reference, so as you can see

	0:22:15.368 --> 0:22:17.425
	here in reference.

	0:22:18.098 --> 0:22:28.833
	So here when we do clipping we will just use
	the maximum opponents in the references.

	0:22:29.809 --> 0:22:38.717
	Yeah, just to avoid avoid overlapping repetitive
	words in the translation.

	0:22:41.641 --> 0:23:00.599
	It could happen that there is no overlap between
	the machine translation output and reference.

	0:23:00.500 --> 0:23:01.917
	Then Everything Is Going To Go To Zero.

	0:23:02.402 --> 0:23:07.876
	So that's why for blow score we usually use
	Japanese level score where we arrogate the

	0:23:07.876 --> 0:23:08.631
	statistics.

	0:23:12.092 --> 0:23:18.589
	Some summary about the brewer as you can see
	it mash exact words.

	0:23:18.589 --> 0:23:31.751
	It can take several references: It measured
	a depotency by the word precision and if measured

	0:23:31.751 --> 0:23:36.656
	the fluency by the gram precision.

	0:23:37.437 --> 0:23:47.254
	And as mentioned, it doesn't consider how
	much meaning that is captured in the machine

	0:23:47.254 --> 0:23:48.721
	translation.

	0:23:49.589 --> 0:23:53.538
	So here they use reality penalty to prevent
	short sentences.

	0:23:54.654 --> 0:24:04.395
	Will get the spot over the last test set to
	avoid the zero issues.

	0:24:04.395 --> 0:24:07.012
	As we mentioned,.

	0:24:09.829 --> 0:24:22.387
	Yes, that's mentioned with multiple reference
	translation simultaneously, and it's a precision

	0:24:22.387 --> 0:24:24.238
	based matrix.

	0:24:24.238 --> 0:24:27.939
	So we are not sure if this.

	0:24:29.689 --> 0:24:37.423
	The second thing is that blows calls common
	safe for recall by routine penalty, and we

	0:24:37.423 --> 0:24:38.667
	are not sure.

	0:24:39.659 --> 0:24:50.902
	Matches, so can still improve the similarity
	measure and improve the correlation score to

	0:24:50.902 --> 0:24:51.776
	human.

	0:24:52.832 --> 0:25:01.673
	The next is that all work will have the same
	importance.

	0:25:01.673 --> 0:25:07.101
	What if a scheme for wedding work?

	0:25:11.571 --> 0:25:26.862
	And the last witness is that blows for high
	grade order engrams that can confluency dramatically.

	0:25:27.547 --> 0:25:32.101
	So the pressure is that can be accounted for
	fluency, and grammatically there's some other.

	0:25:35.956 --> 0:25:47.257
	We have some further issues and not created
	equally so we can use stemming or knowledge

	0:25:47.257 --> 0:25:48.156
	space.

	0:25:50.730 --> 0:26:00.576
	The next way we incorporate information is
	within the metrics.

	0:26:01.101 --> 0:26:07.101
	And can be used like a stop list to like somehow
	ignore the non-important words.

	0:26:08.688 --> 0:26:12.687
	Text normalization spelling conjugation lower
	case and mix case.

	0:26:12.687 --> 0:26:18.592
	The next thing is that for some language like
	Chinese there can be different world segmentation

	0:26:18.592 --> 0:26:23.944
	so exact word matching might no longer be a
	good idea so maybe it's ready to cover the

	0:26:23.944 --> 0:26:27.388
	score as the character level instead of the
	word level.

	0:26:29.209 --> 0:26:33.794
	And the last thing is speech translation.

	0:26:33.794 --> 0:26:38.707
	Usually input from speech translation would.

	0:26:38.979 --> 0:26:51.399
	And there should be some way to segment into
	sentences so that we can calculate the score

	0:26:51.399 --> 0:26:52.090
	and.

	0:26:52.953 --> 0:27:01.326
	And the way to soften is to use some tools
	like enware segmentation to align the output

	0:27:01.326 --> 0:27:01.896
	with.

	0:27:06.306 --> 0:27:10.274
	Yes, so guess that was all about the blow
	score any question.

	0:27:14.274 --> 0:27:28.292
	Again on automatic metrics we'll talk about
	probably good metrics, strange automatic metrics,

	0:27:28.292 --> 0:27:32.021
	use cases on evaluation.

	0:27:34.374 --> 0:27:44.763
	How to measure the performance of the matrix,
	so a good matrix would be a.

	0:27:49.949 --> 0:28:04.905
	We would want the matrix to be interpretable
	if this is the ranking from a human that somehow

	0:28:04.905 --> 0:28:08.247
	can rank the system.

	0:28:12.132 --> 0:28:15.819
	We would also want the evaluation metric to
	be sensitive.

	0:28:15.819 --> 0:28:21.732
	Like small differences in the machine translation
	can be distinguished, we would not need to

	0:28:21.732 --> 0:28:22.686
	be consistent.

	0:28:22.686 --> 0:28:28.472
	Like if the same machine translation system
	is used on a similar text, it should reproduce

	0:28:28.472 --> 0:28:29.553
	a similar score.

	0:28:31.972 --> 0:28:40.050
	Next, we would want the machine translation
	system to be reliable.

	0:28:40.050 --> 0:28:42.583
	Machine translation.

	0:28:43.223 --> 0:28:52.143
	We want the matrix to be easy to run in general
	and can be applied to multiple different machine.

	0:28:55.035 --> 0:29:11.148
	The difficulty of evaluating the metric itself
	is kind of similar to when you evaluate the

	0:29:11.148 --> 0:29:13.450
	translation.

	0:29:18.638 --> 0:29:23.813
	And here is some components of the automatic
	machine translation matrix.

	0:29:23.813 --> 0:29:28.420
	So for the matching matrix the component would
	be the precision.

	0:29:28.420 --> 0:29:30.689
	Recall our Levinstein distance.

	0:29:30.689 --> 0:29:35.225
	So for the blow sparks you have seen it cares
	mostly about the.

	0:29:36.396 --> 0:29:45.613
	And on the features it would be about how
	to measure the matches or character based.

	0:29:48.588 --> 0:30:01.304
	Now we will talk about more matrix because
	the blue score is the most common.

	0:30:02.082 --> 0:30:10.863
	So it compared the reference and hypothesis
	using edit operations.

	0:30:10.863 --> 0:30:14.925
	They count how many insertion.

	0:30:23.143 --> 0:30:31.968
	We already talked about it beyond what matching
	would care about character based mathematization

	0:30:31.968 --> 0:30:34.425
	or linguistic information.

	0:30:36.636 --> 0:30:41.502
	The next metric is the meteor metric.

	0:30:41.502 --> 0:30:50.978
	This is strong called metric for evaluation
	of translation with explicit.

	0:30:51.331 --> 0:31:03.236
	So merely their new idea is that they reintroduce
	repose and combine with precision as small

	0:31:03.236 --> 0:31:04.772
	components.

	0:31:05.986 --> 0:31:16.700
	The language translation output with each
	reference individually and takes part of the

	0:31:16.700 --> 0:31:18.301
	best parent.

	0:31:20.940 --> 0:31:27.330
	The next thing is that matching takes into
	counterfection variation by stepping, so it's

	0:31:27.330 --> 0:31:28.119
	no longer.

	0:31:30.230 --> 0:31:40.165
	When they address fluency, they're a direct
	penalty instead of ink arms so they would care

	0:31:40.165 --> 0:31:40.929
	about.

	0:31:45.925 --> 0:31:56.287
	The next thing is on two noble metrics, so
	for this metric we want to extract some features.

	0:31:56.936 --> 0:32:04.450
	So for example here the nice house is on the
	right and the building is on the right side

	0:32:04.450 --> 0:32:12.216
	so we will have to extract some pictures like
	for example here the reference and hypothesis

	0:32:12.216 --> 0:32:14.158
	have hypers in common.

	0:32:14.714 --> 0:32:19.163
	They have one insertion, two deletions, and
	they have the same verb.

	0:32:21.141 --> 0:32:31.530
	So the idea is to use machine translation
	techniques to combine features and this machine

	0:32:31.530 --> 0:32:37.532
	translation model will be trained on human
	ranking.

	0:32:39.819 --> 0:32:44.788
	Any common framework for this is comet.

	0:32:44.684 --> 0:32:48.094
	Which is a narrow model that is used with
	X for.

	0:32:48.094 --> 0:32:54.149
	The feature would be created using some prejutant
	model like X, L, M, U, R, A, BO, DA.

	0:32:54.149 --> 0:33:00.622
	Here the input would be the source, the reference
	and the hypothesis and then they would try

	0:33:00.622 --> 0:33:02.431
	to produce an assessment.

	0:33:03.583 --> 0:33:05.428
	Yeah, it's strange to predict human sport.

	0:33:06.346 --> 0:33:19.131
	And they also have some additional versions,
	as we train this model in order to tell whether

	0:33:19.131 --> 0:33:20.918
	translation.

	0:33:21.221 --> 0:33:29.724
	So instead of checking the source and the
	hypothesis as input, they could take only the

	0:33:29.724 --> 0:33:38.034
	source and the hypotheses as input and try
	to predict the quality of the translation.

	0:33:42.562 --> 0:33:49.836
	So assumptions before machine translation
	systems are often used in larger systems.

	0:33:50.430 --> 0:33:57.713
	So the question is how to evaluate the performance
	of the machine translation system in this larger

	0:33:57.713 --> 0:34:04.997
	scenario, and an example would be speech translation
	system when you try to translate English audio

	0:34:04.997 --> 0:34:05.798
	to German.

	0:34:06.506 --> 0:34:13.605
	Then it would usually have two opponents,
	ASR and MT, where ASR is like speech recognition

	0:34:13.605 --> 0:34:20.626
	that can describe English audio to English
	text, and then we have the machine translation

	0:34:20.626 --> 0:34:24.682
	system that translates English text to German
	text.

	0:34:26.967 --> 0:34:33.339
	So in order to have these overall performances
	in this bigger scenario, they are so willing

	0:34:33.339 --> 0:34:34.447
	to evaluate it.

	0:34:34.447 --> 0:34:41.236
	So the first one is to evaluate the individual
	components like how good is the speech recognizer,

	0:34:41.236 --> 0:34:46.916
	how good is the analyzed and generalization
	engines, how good is the synthesizer.

	0:34:47.727 --> 0:34:56.905
	The second way is to evaluate translation
	quality from speech input to text output.

	0:34:56.905 --> 0:35:00.729
	How good is the final translation?

	0:35:02.102 --> 0:35:10.042
	The next thing is to measure the to evaluate
	the architecture effectiveness like: How is

	0:35:10.042 --> 0:35:12.325
	the level effects in general?

	0:35:12.325 --> 0:35:19.252
	The next one is task based evaluation or use
	a study like we just simply ask the user what

	0:35:19.252 --> 0:35:24.960
	is their experience like whether the system
	works well and how well it is.

	0:35:27.267 --> 0:35:32.646
	So here we have an example of the ITF shale
	test result.

	0:35:33.153 --> 0:35:38.911
	So the first block would be the human evaluation
	like I think they are asked to give a spawl

	0:35:38.911 --> 0:35:44.917
	from one to five again where a fight is best
	and one is worst and the lower one is the blowscore

	0:35:44.917 --> 0:35:50.490
	and they find out that the human evaluation
	is far actually correlated with the blowsfall

	0:35:50.490 --> 0:35:51.233
	quite well.

	0:35:53.193 --> 0:36:02.743
	Here you can also see that the systems from
	our university are actually on top many sub-tasts.

	0:36:05.605 --> 0:36:07.429
	So Yeah.

	0:36:08.868 --> 0:36:14.401
	For this lecture is that machine translation
	evaluation is difficult.

	0:36:14.401 --> 0:36:21.671
	We talk about human versus automatic evaluation
	that human would be costly, but then is the

	0:36:21.671 --> 0:36:27.046
	goal standard automatic evaluation would be
	a fast and cheaper way.

	0:36:27.547 --> 0:36:36.441
	We talk about granulity on sentence level,
	document level or task level evaluation machine

	0:36:36.441 --> 0:36:38.395
	translation system.

	0:36:39.679 --> 0:36:51.977
	And we talked about human evaluation versus
	automatic metrics in details.

	0:36:54.034 --> 0:36:59.840
	So we introduced a lot of metric metrics.

	0:36:59.840 --> 0:37:10.348
	How do they compare from the quadrating of
	human assessment so it's better?

	0:37:12.052 --> 0:37:16.294
	I don't have the exact score and reference
	in my head.

	0:37:16.294 --> 0:37:22.928
	I would assume that mediators should have
	a better correlation because here they also

	0:37:22.928 --> 0:37:30.025
	consider other aspects like the recall whether
	the information in the reference is captured

	0:37:30.025 --> 0:37:31.568
	in the translation.

	0:37:32.872 --> 0:37:41.875
	Like synonyms, so I would assume that mid
	air is better, but again don't have the reference

	0:37:41.875 --> 0:37:43.441
	in my hair, so.

	0:37:43.903 --> 0:37:49.771
	But guess the reason people are still using
	BlueScore is that in most literature, a machine

	0:37:49.771 --> 0:38:00.823
	translation system, they report: So now you
	create a new machine translation system.

	0:38:00.823 --> 0:38:07.990
	It might be better to also report the blow.

	0:38:08.228 --> 0:38:11.472
	Exactly just slice good, just spread white,
	and then we're going to go ahead.

	0:38:12.332 --> 0:38:14.745
	And don't know what you're doing.

	0:38:17.457 --> 0:38:18.907
	I Want to Talk Quickly About.

	0:38:19.059 --> 0:38:32.902
	So it is like a language model, so it's kind
	of the same uses as.

	0:38:33.053 --> 0:38:39.343
	So the idea is that we have this layer in
	order to embed the sauce and the reference

	0:38:39.343 --> 0:38:39.713
	and.

	0:38:40.000 --> 0:38:54.199
	Into some feature vectors that we can later
	on use to predict the human sport in the.

	0:38:58.618 --> 0:39:00.051
	It If There's Nothing Else.