Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-19-21.07.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

69.4 kB

	WEBVTT

	0:00:01.121 --> 0:00:14.214
	Okay, so welcome to today's lecture, on Tuesday
	we started to talk about speech translation.

	0:00:14.634 --> 0:00:27.037
	And the idea is hopefully an idea of the basic
	ideas we have in speech translation, the two

	0:00:27.037 --> 0:00:29.464
	major approaches.

	0:00:29.829 --> 0:00:41.459
	And the other one is the end system where
	we have one large system which is everything

	0:00:41.459 --> 0:00:42.796
	together.

	0:00:43.643 --> 0:00:58.459
	Until now we mainly focus on text output that
	we'll see today, but you can extend these ideas

	0:00:58.459 --> 0:01:01.138
	to other speech.

	0:01:01.441 --> 0:01:08.592
	But since it's also like a machine translation
	lecture, you of course mainly focus a bit on

	0:01:08.592 --> 0:01:10.768
	the translation challenges.

	0:01:12.172 --> 0:01:25.045
	And what is the main focus of today's lecture
	is to look into why that is challenging speech

	0:01:25.045 --> 0:01:26.845
	translation.

	0:01:27.627 --> 0:01:33.901
	So a bit more focus on what is now really
	the difference to all you and how we can address.

	0:01:34.254 --> 0:01:39.703
	We'll start there by with the segmentation
	problem.

	0:01:39.703 --> 0:01:45.990
	We had that already of bits, but especially
	for end-to-end.

	0:01:46.386 --> 0:01:57.253
	So the problem is that until now it was easy
	to segment the input into sentences and then

	0:01:57.253 --> 0:02:01.842
	translate each sentence individually.

	0:02:02.442 --> 0:02:17.561
	When you're now translating audio, the challenge
	is that you have just a sequence of audio input

	0:02:17.561 --> 0:02:20.055
	and there's no.

	0:02:21.401 --> 0:02:27.834
	So you have this difference that your audio
	is a continuous stream, but the text is typically

	0:02:27.834 --> 0:02:28.930
	sentence based.

	0:02:28.930 --> 0:02:31.667
	So how can you match this gap in there?

	0:02:31.667 --> 0:02:37.690
	We'll see that is really essential, and if
	you're not using a decent good system there,

	0:02:37.690 --> 0:02:41.249
	then you can lose a lot of quality and performance.

	0:02:41.641 --> 0:02:44.267
	That is what also meant before.

	0:02:44.267 --> 0:02:51.734
	So if you have a more complex system out of
	several units, it's really essential that they

	0:02:51.734 --> 0:02:56.658
	all work together and it's very easy to lose
	significantly.

	0:02:57.497 --> 0:03:13.029
	The second challenge we'll talk about is disfluencies,
	so the style of speaking is very different

	0:03:13.029 --> 0:03:14.773
	from text.

	0:03:15.135 --> 0:03:24.727
	So if you translate or TedTalks, that's normally
	very good speakers.

	0:03:24.727 --> 0:03:30.149
	They will give you a very fluent text.

	0:03:30.670 --> 0:03:36.692
	When you want to translate a lecture, it might
	be more difficult or rednested.

	0:03:37.097 --> 0:03:39.242
	Mean people are not well that well.

	0:03:39.242 --> 0:03:42.281
	They should be prepared in giving the lecture
	and.

	0:03:42.362 --> 0:03:48.241
	But it's not that I mean, typically a lecture
	will have like rehearsal like five times before

	0:03:48.241 --> 0:03:52.682
	he is giving this lecture, and then like will
	it completely be fluent?

	0:03:52.682 --> 0:03:56.122
	He might at some point notice all this is
	not perfect.

	0:03:56.122 --> 0:04:00.062
	I want to rephrase, and he'll have to sing
	during the lecture.

	0:04:00.300 --> 0:04:04.049
	Might be also good that he's thinking, so
	he's not going too fast and things like.

	0:04:05.305 --> 0:04:07.933
	If you then go to the other extreme, it's
	more meetings.

	0:04:08.208 --> 0:04:15.430
	If you have a lively discussion, of course,
	people will interrupt, they will restart, they

	0:04:15.430 --> 0:04:22.971
	will think while they speak, and you know that
	sometimes you tell people first think and speak

	0:04:22.971 --> 0:04:26.225
	because they are changing their opinion.

	0:04:26.606 --> 0:04:31.346
	So the question of how can you deal with this?

	0:04:31.346 --> 0:04:37.498
	And there again it might be solutions for
	that, or at least.

	0:04:39.759 --> 0:04:46.557
	Then for the output we will look into simultaneous
	translation that is at least not very important

	0:04:46.557 --> 0:04:47.175
	in text.

	0:04:47.175 --> 0:04:53.699
	There might be some cases but normally you
	have all text available and then you're translating

	0:04:53.699 --> 0:04:54.042
	and.

	0:04:54.394 --> 0:05:09.220
	While for speech translation, since it's often
	a life interaction, then of course it's important.

	0:05:09.149 --> 0:05:12.378
	Otherwise it's hard to follow.

	0:05:12.378 --> 0:05:19.463
	You see what said five minutes ago and the
	slide is not as helpful.

	0:05:19.739 --> 0:05:35.627
	You have to wait very long before you can
	answer because you have to first wait for what

	0:05:35.627 --> 0:05:39.197
	is happening there.

	0:05:40.660 --> 0:05:46.177
	And finally, we can talk a bit about presentation.

	0:05:46.177 --> 0:05:54.722
	For example, mentioned that if you're generating
	subtitles, it's not possible.

	0:05:54.854 --> 0:06:01.110
	So in professional subtitles there are clear
	rules.

	0:06:01.110 --> 0:06:05.681
	Subtitle has to be shown for seconds.

	0:06:05.681 --> 0:06:08.929
	It's maximum of two lines.

	0:06:09.549 --> 0:06:13.156
	Because otherwise it's getting too long, it's
	not able to read it anymore, and so.

	0:06:13.613 --> 0:06:19.826
	So if you want to achieve that, of course,
	you might have to adjust and select what you

	0:06:19.826 --> 0:06:20.390
	really.

	0:06:23.203 --> 0:06:28.393
	The first date starts with the segmentation.

	0:06:28.393 --> 0:06:36.351
	On the one end it's an issue while training,
	on the other hand it's.

	0:06:38.678 --> 0:06:47.781
	What is the problem so when we train it's
	relatively easy to separate our data into sentence

	0:06:47.781 --> 0:06:48.466
	level.

	0:06:48.808 --> 0:07:02.241
	So if you have your example, you have the
	audio and the text, then you typically know

	0:07:02.241 --> 0:07:07.083
	that this sentence is aligned.

	0:07:07.627 --> 0:07:16.702
	You can use these time information to cut
	your audio and then you can train and then.

	0:07:18.018 --> 0:07:31.775
	Because what we need for an enchilada model
	is to be an output chart, in this case an audio

	0:07:31.775 --> 0:07:32.822
	chart.

	0:07:33.133 --> 0:07:38.551
	And even if this is a long speech, it's easy
	then since we have this time information to

	0:07:38.551 --> 0:07:39.159
	separate.

	0:07:39.579 --> 0:07:43.866
	But we are using therefore, of course, the
	target side information.

	0:07:45.865 --> 0:07:47.949
	The problem is now in runtime.

	0:07:47.949 --> 0:07:49.427
	This is not possible.

	0:07:49.427 --> 0:07:55.341
	Here we can do that based on the calculation
	marks and the sentence segmentation on the

	0:07:55.341 --> 0:07:57.962
	target side because that is splitting.

	0:07:57.962 --> 0:08:02.129
	But during transcript, during translation
	it is not possible.

	0:08:02.442 --> 0:08:10.288
	Because there is just a long audio signal,
	and of course if you have your test data to

	0:08:10.288 --> 0:08:15.193
	split it into: That has been done for some
	experience.

	0:08:15.193 --> 0:08:22.840
	It's fine, but it's not a realistic scenario
	because if you really apply it in real world,

	0:08:22.840 --> 0:08:25.949
	we won't have a manual segmentation.

	0:08:26.266 --> 0:08:31.838
	If a human has to do that then he can do the
	translation so you want to have a full automatic

	0:08:31.838 --> 0:08:32.431
	pipeline.

	0:08:32.993 --> 0:08:38.343
	So the question is how can we deal with this
	type of you know?

	0:09:09.309 --> 0:09:20.232
	So the question is how can we deal with this
	time of situation and how can we segment the

	0:09:20.232 --> 0:09:23.024
	audio into some units?

	0:09:23.863 --> 0:09:32.495
	And here is one further really big advantage
	of a cascaded sauce: Because how is this done

	0:09:32.495 --> 0:09:34.259
	in a cascade of systems?

	0:09:34.259 --> 0:09:38.494
	We are splitting the audio with some features
	we are doing.

	0:09:38.494 --> 0:09:42.094
	We can use similar ones which we'll discuss
	later.

	0:09:42.094 --> 0:09:43.929
	Then we run against chin.

	0:09:43.929 --> 0:09:48.799
	We have the transcript and then we can do
	what we talked last about.

	0:09:49.069 --> 0:10:02.260
	So if you have this is an audio signal and
	the training data it was good.

	0:10:02.822 --> 0:10:07.951
	So here we have a big advantage.

	0:10:07.951 --> 0:10:16.809
	We can use a different segmentation for the
	and for the.

	0:10:16.809 --> 0:10:21.316
	Why is that a big advantage?

	0:10:23.303 --> 0:10:34.067
	Will say for a team task is more important
	because we can then do the sentence transformation.

	0:10:34.955 --> 0:10:37.603
	See and Yeah, We Can Do the Same Thing.

	0:10:37.717 --> 0:10:40.226
	To save us, why is it not as important for
	us?

	0:10:40.226 --> 0:10:40.814
	Are maybe.

	0:10:43.363 --> 0:10:48.589
	We don't need that much context.

	0:10:48.589 --> 0:11:01.099
	We only try to restrict the word, but the
	context to consider is mainly small.

	0:11:03.283 --> 0:11:11.419
	Would agree with it in more context, but there
	is one more important: its.

	0:11:11.651 --> 0:11:16.764
	The is monotone, so there's no reordering.

	0:11:16.764 --> 0:11:22.472
	The second part of the signal is no reordering.

	0:11:22.472 --> 0:11:23.542
	We have.

	0:11:23.683 --> 0:11:29.147
	And of course if we are doing that we cannot
	really order across boundaries between segments.

	0:11:29.549 --> 0:11:37.491
	It might be challenging if we split the words
	so that it's not perfect for so that.

	0:11:37.637 --> 0:11:40.846
	But we need to do quite long range reordering.

	0:11:40.846 --> 0:11:47.058
	If you think about the German where the work
	has moved, and now the English work is in one

	0:11:47.058 --> 0:11:50.198
	part, but the end of the sentence is another.

	0:11:50.670 --> 0:11:59.427
	And of course this advantage we have now here
	that if we have a segment we have.

	0:12:01.441 --> 0:12:08.817
	And that this segmentation is important.

	0:12:08.817 --> 0:12:15.294
	Here are some motivations for that.

	0:12:15.675 --> 0:12:25.325
	What you are doing is you are taking the reference
	text and you are segmenting.

	0:12:26.326 --> 0:12:30.991
	And then, of course, your segments are exactly
	yeah cute.

	0:12:31.471 --> 0:12:42.980
	If you're now using different segmentation
	strategies, you're using significantly in blue

	0:12:42.980 --> 0:12:44.004
	points.

	0:12:44.004 --> 0:12:50.398
	If the segmentation is bad, you have a lot
	worse.

	0:12:52.312 --> 0:13:10.323
	And interesting, here you ought to see how
	it was a human, but people have in a competition.

	0:13:10.450 --> 0:13:22.996
	You can see that by working on the segmentation
	and using better segmentation you can improve

	0:13:22.996 --> 0:13:25.398
	your performance.

	0:13:26.006 --> 0:13:29.932
	So it's really essential.

	0:13:29.932 --> 0:13:41.712
	One other interesting thing is if you're looking
	into the difference between.

	0:13:42.082 --> 0:13:49.145
	So it really seems to be more important to
	have a good segmentation for our cascaded system.

	0:13:49.109 --> 0:13:56.248
	For an intra-end system because there you
	can't re-segment while it is less important

	0:13:56.248 --> 0:13:58.157
	for a cascaded system.

	0:13:58.157 --> 0:14:05.048
	Of course, it's still important, but the difference
	between the two segmentations.

	0:14:06.466 --> 0:14:18.391
	It was a shared task some years ago like it's
	just one system from different.

	0:14:22.122 --> 0:14:31.934
	So the question is how can we deal with this
	in speech translation and what people look

	0:14:31.934 --> 0:14:32.604
	into?

	0:14:32.752 --> 0:14:48.360
	Now we want to use different techniques to
	split the audio signal into segments.

	0:14:48.848 --> 0:14:54.413
	You have the disadvantage that you can't change
	it.

	0:14:54.413 --> 0:15:00.407
	Therefore, some of the quality might be more
	important.

	0:15:00.660 --> 0:15:15.678
	But in both cases, of course, the A's are
	better if you have a good segmentation.

	0:15:17.197 --> 0:15:23.149
	So any idea, how would you have this task
	now split this audio?

	0:15:23.149 --> 0:15:26.219
	What type of tool would you use?

	0:15:28.648 --> 0:15:41.513
	The fuse was a new network to segment half
	for instance supervise.

	0:15:41.962 --> 0:15:44.693
	Yes, that's exactly already the better system.

	0:15:44.693 --> 0:15:50.390
	So for long time people have done more simple
	things because we'll come to that a bit challenging

	0:15:50.390 --> 0:15:52.250
	as creating or having the data.

	0:15:53.193 --> 0:16:00.438
	The first thing is you use some tool out of
	the box like voice activity detection which

	0:16:00.438 --> 0:16:07.189
	has been there as a whole research field so
	people find when somebody's speaking.

	0:16:07.647 --> 0:16:14.952
	And then you use that in this different threshold
	you always have the ability that somebody's

	0:16:14.952 --> 0:16:16.273
	speaking or not.

	0:16:17.217 --> 0:16:19.889
	Then you split your signal.

	0:16:19.889 --> 0:16:26.762
	It will not be perfect, but you transcribe
	or translate each component.

	0:16:28.508 --> 0:16:39.337
	But as you see, a supervised classification
	task is even better, and that is now the most

	0:16:39.337 --> 0:16:40.781
	common use.

	0:16:41.441 --> 0:16:49.909
	The supervisor is doing that as a supervisor
	classification and then you'll try to use this

	0:16:49.909 --> 0:16:50.462
	type.

	0:16:50.810 --> 0:16:53.217
	We're going into a bit more detail on how
	to do that.

	0:16:53.633 --> 0:17:01.354
	So what you need to do first is, of course,
	you have to have some labels whether this is

	0:17:01.354 --> 0:17:03.089
	an end of sentence.

	0:17:03.363 --> 0:17:10.588
	You do that by using the alignment between
	the segments and the audio.

	0:17:10.588 --> 0:17:12.013
	You have the.

	0:17:12.212 --> 0:17:15.365
	The two people have not for each word, so
	these tank steps.

	0:17:15.365 --> 0:17:16.889
	This word is said this time.

	0:17:17.157 --> 0:17:27.935
	This word is said by what you typically have
	from this time to time to time.

	0:17:27.935 --> 0:17:34.654
	We have the second segment, the second segment.

	0:17:35.195 --> 0:17:39.051
	Which also used to trade for example your
	advanced system and everything.

	0:17:41.661 --> 0:17:53.715
	Based on that you can label each frame in
	there so if you have a green or blue that is

	0:17:53.715 --> 0:17:57.455
	our speech segment so you.

	0:17:58.618 --> 0:18:05.690
	And these labels will then later help you,
	but you extract exactly these types of.

	0:18:07.067 --> 0:18:08.917
	There's one big challenge.

	0:18:08.917 --> 0:18:15.152
	If you have two sentences which are directly
	connected to each other, then if you're doing

	0:18:15.152 --> 0:18:18.715
	this labeling, you would not have a break in
	later.

	0:18:18.715 --> 0:18:23.512
	If you tried to extract that, there should
	be something great or not.

	0:18:23.943 --> 0:18:31.955
	So what you typically do is in the last frame.

	0:18:31.955 --> 0:18:41.331
	You mark as outside, although it's not really
	outside.

	0:18:43.463 --> 0:18:46.882
	Yes, I guess you could also do that in more
	of a below check.

	0:18:46.882 --> 0:18:48.702
	I mean, this is the most simple.

	0:18:48.702 --> 0:18:51.514
	It's like inside outside, so it's related
	to that.

	0:18:51.514 --> 0:18:54.988
	Of course, you could have an extra startup
	segment, and so on.

	0:18:54.988 --> 0:18:57.469
	I guess this is just to make it more simple.

	0:18:57.469 --> 0:19:00.226
	You only have two labels, not a street classroom.

	0:19:00.226 --> 0:19:02.377
	But yeah, you could do similar things.

	0:19:12.432 --> 0:19:20.460
	Has caused down the roads to problems because
	it could be an important part of a segment

	0:19:20.460 --> 0:19:24.429
	which has some meaning and we do something.

	0:19:24.429 --> 0:19:28.398
	The good thing is frames are normally very.

	0:19:28.688 --> 0:19:37.586
	Like some milliseconds, so normally if you
	remove some milliseconds you can still understand

	0:19:37.586 --> 0:19:38.734
	everything.

	0:19:38.918 --> 0:19:46.999
	Mean the speech signal is very repetitive,
	and so you have information a lot of times.

	0:19:47.387 --> 0:19:50.730
	That's why we talked along there last time
	they could try to shrink the steak and.

	0:19:51.031 --> 0:20:00.995
	If you now have a short sequence where there
	is like which would be removed and that's not

	0:20:00.995 --> 0:20:01.871
	really.

	0:20:02.162 --> 0:20:06.585
	Yeah, but it's not a full letter is missing.

	0:20:06.585 --> 0:20:11.009
	It's like only the last ending of the vocal.

	0:20:11.751 --> 0:20:15.369
	Think it doesn't really happen.

	0:20:15.369 --> 0:20:23.056
	We have our audio signal and we have these
	gags that are not above.

	0:20:23.883 --> 0:20:29.288
	With this blue rectangulars the inside speech
	segment and with the guess it's all set yes.

	0:20:29.669 --> 0:20:35.736
	So then you have the full signal and you're
	meaning now labeling your task as a blue or

	0:20:35.736 --> 0:20:36.977
	white prediction.

	0:20:36.977 --> 0:20:39.252
	So that is your prediction task.

	0:20:39.252 --> 0:20:44.973
	You have the audio signal only and your prediction
	task is like label one or zero.

	0:20:45.305 --> 0:20:55.585
	Once you do that then based on this labeling
	you can extract each segment again like each

	0:20:55.585 --> 0:20:58.212
	consecutive blue area.

	0:20:58.798 --> 0:21:05.198
	See then removed maybe the non-speaking part
	already and duo speech translation only on

	0:21:05.198 --> 0:21:05.998
	the parts.

	0:21:06.786 --> 0:21:19.768
	Which is good because the training would have
	done similarly.

	0:21:20.120 --> 0:21:26.842
	So on the noise in between you never saw in
	the training, so it's good to throw it away.

	0:21:29.649 --> 0:21:34.930
	One challenge, of course, is now if you're
	doing that, what is your input?

	0:21:34.930 --> 0:21:40.704
	You cannot do the sequence labeling normally
	on the whole talk, so it's too long.

	0:21:40.704 --> 0:21:46.759
	So if you're doing this prediction of the
	label, you also have a window for which you

	0:21:46.759 --> 0:21:48.238
	do the segmentation.

	0:21:48.788 --> 0:21:54.515
	And that's the bedline we have in the punctuation
	prediction.

	0:21:54.515 --> 0:22:00.426
	If we don't have good borders, random splits
	are normally good.

	0:22:00.426 --> 0:22:03.936
	So what we do now is split the audio.

	0:22:04.344 --> 0:22:09.134
	So that would be our input, and then the part
	three would be our labels.

	0:22:09.269 --> 0:22:15.606
	This green would be the input and here we
	want, for example, blue labels and then white.

	0:22:16.036 --> 0:22:20.360
	Here only do labors and here at the beginning
	why maybe at the end why.

	0:22:21.401 --> 0:22:28.924
	So thereby you have now a fixed window always
	for which you're doing than this task of predicting.

	0:22:33.954 --> 0:22:43.914
	How you build your classifier that is based
	again.

	0:22:43.914 --> 0:22:52.507
	We had this wave to be mentioned last week.

	0:22:52.752 --> 0:23:00.599
	So in training you use labels to say whether
	it's in speech or outside speech.

	0:23:01.681 --> 0:23:17.740
	Inference: You give them always the chance
	and then predict whether this part like each

	0:23:17.740 --> 0:23:20.843
	label is afraid.

	0:23:23.143 --> 0:23:29.511
	Bit more complicated, so one challenge is
	if you randomly split off cognition, losing

	0:23:29.511 --> 0:23:32.028
	your context for the first brain.

	0:23:32.028 --> 0:23:38.692
	It might be very hard to predict whether this
	is now in or out of, and also for the last.

	0:23:39.980 --> 0:23:48.449
	You often need a bit of context whether this
	is audio or not, and at the beginning.

	0:23:49.249 --> 0:23:59.563
	So what you do is you put the audio in twice.

	0:23:59.563 --> 0:24:08.532
	You want to do it with splits and then.

	0:24:08.788 --> 0:24:15.996
	It is shown you have shifted the two offsets,
	so one is predicted with the other offset.

	0:24:16.416 --> 0:24:23.647
	And then averaging the probabilities so that
	at each time you have, at least for one of

	0:24:23.647 --> 0:24:25.127
	the predictions,.

	0:24:25.265 --> 0:24:36.326
	Because at the end of the second it might
	be very hard to predict whether this is now

	0:24:36.326 --> 0:24:39.027
	speech or nonspeech.

	0:24:39.939 --> 0:24:47.956
	Think it is a high parameter, but you are
	not optimizing it, so you just take two shifts.

	0:24:48.328 --> 0:24:54.636
	Of course try a lot of different shifts and
	so on.

	0:24:54.636 --> 0:24:59.707
	The thing is it's mainly a problem here.

	0:24:59.707 --> 0:25:04.407
	If you don't do two outsets you have.

	0:25:05.105 --> 0:25:14.761
	You could get better by doing that, but would
	be skeptical if it really matters, and also

	0:25:14.761 --> 0:25:18.946
	have not seen any experience in doing.

	0:25:19.159 --> 0:25:27.629
	Guess you're already good, you have maybe
	some arrows in there and you're getting.

	0:25:31.191 --> 0:25:37.824
	So with this you have your segmentation.

	0:25:37.824 --> 0:25:44.296
	However, there is a problem in between.

	0:25:44.296 --> 0:25:49.150
	Once the model is wrong then.

	0:25:49.789 --> 0:26:01.755
	The normal thing would be the first thing
	that you take some threshold and that you always

	0:26:01.755 --> 0:26:05.436
	label everything in speech.

	0:26:06.006 --> 0:26:19.368
	The problem is when you are just doing this
	one threshold that you might have.

	0:26:19.339 --> 0:26:23.954
	Those are the challenges.

	0:26:23.954 --> 0:26:31.232
	Short segments mean you have no context.

	0:26:31.232 --> 0:26:35.492
	The policy will be bad.

	0:26:37.077 --> 0:26:48.954
	Therefore, people use this probabilistic divided
	cocker algorithm, so the main idea is start

	0:26:48.954 --> 0:26:56.744
	with the whole segment, and now you split the
	whole segment.

	0:26:57.397 --> 0:27:09.842
	Then you split there and then you continue
	until each segment is smaller than the maximum

	0:27:09.842 --> 0:27:10.949
	length.

	0:27:11.431 --> 0:27:23.161
	But you can ignore some splits, and if you
	split one segment into two parts you first

	0:27:23.161 --> 0:27:23.980
	trim.

	0:27:24.064 --> 0:27:40.197
	So normally it's not only one signal position,
	it's a longer area of non-voice, so you try

	0:27:40.197 --> 0:27:43.921
	to find this longer.

	0:27:43.943 --> 0:27:51.403
	Now your large segment is split into two smaller
	segments.

	0:27:51.403 --> 0:27:56.082
	Now you are checking these segments.

	0:27:56.296 --> 0:28:04.683
	So if they are very, very short, it might
	be good not to spin at this point because you're

	0:28:04.683 --> 0:28:05.697
	ending up.

	0:28:06.006 --> 0:28:09.631
	And this way you continue all the time, and
	then hopefully you'll have a good stretch.

	0:28:10.090 --> 0:28:19.225
	So, of course, there's one challenge with
	this approach: if you think about it later,

	0:28:19.225 --> 0:28:20.606
	low latency.

	0:28:25.405 --> 0:28:31.555
	So in this case you have to have the full
	audio available.

	0:28:32.132 --> 0:28:38.112
	So you cannot continuously do that mean if
	you would do it just always.

	0:28:38.112 --> 0:28:45.588
	If the probability is higher you split but
	in this case you try to find a global optimal.

	0:28:46.706 --> 0:28:49.134
	A heuristic body.

	0:28:49.134 --> 0:28:58.170
	You find a global solution for your whole
	tar and not a local one.

	0:28:58.170 --> 0:29:02.216
	Where's the system most sure?

	0:29:02.802 --> 0:29:12.467
	So that's a bit of a challenge here, but the
	advantage of course is that in the end you

	0:29:12.467 --> 0:29:14.444
	have no segments.

	0:29:17.817 --> 0:29:23.716
	Any more questions like this.

	0:29:23.716 --> 0:29:36.693
	Then the next thing is we also need to evaluate
	in this scenario.

	0:29:37.097 --> 0:29:44.349
	So know machine translation is quite a long
	way.

	0:29:44.349 --> 0:29:55.303
	History now was the beginning of the semester,
	but hope you can remember.

	0:29:55.675 --> 0:30:09.214
	Might be with blue score, might be with comment
	or similar, but you need to have.

	0:30:10.310 --> 0:30:22.335
	But this assumes that you have this one-to-one
	match, so you always have an output and machine

	0:30:22.335 --> 0:30:26.132
	translation, which is nicely.

	0:30:26.506 --> 0:30:34.845
	So then it might be that our output has four
	segments, while our reference output has only

	0:30:34.845 --> 0:30:35.487
	three.

	0:30:36.756 --> 0:30:40.649
	And now is, of course, questionable like what
	should we compare in our metric.

	0:30:44.704 --> 0:30:53.087
	So it's no longer directly possible to directly
	do that because what should you compare?

	0:30:53.413 --> 0:31:00.214
	Just have four segments there and three segments
	there, and of course it seems to be that.

	0:31:00.920 --> 0:31:06.373
	The first one it likes to the first one when
	you see I can't speak Spanish, but you're an

	0:31:06.373 --> 0:31:09.099
	audience of the guests who is already there.

	0:31:09.099 --> 0:31:14.491
	So even like just a woman, the blue comparing
	wouldn't work, so you need to do something

	0:31:14.491 --> 0:31:17.157
	about that to take this type of evaluation.

	0:31:19.019 --> 0:31:21.727
	Still any suggestions what you could do.

	0:31:25.925 --> 0:31:44.702
	How can you calculate a blue score because
	you don't have one you want to see?

	0:31:45.925 --> 0:31:49.365
	Here you put another layer which spies to
	add in the second.

	0:31:51.491 --> 0:31:56.979
	It's even not aligning only, but that's one
	solution, so you need to align and resign.

	0:31:57.177 --> 0:32:06.886
	Because even if you have no alignment so this
	to this and this to that you see that it's

	0:32:06.886 --> 0:32:12.341
	not good because the audio would compare to
	that.

	0:32:13.453 --> 0:32:16.967
	That we'll discuss is even one simpler solution.

	0:32:16.967 --> 0:32:19.119
	Yes, it's a simpler solution.

	0:32:19.119 --> 0:32:23.135
	It's called document based blue or something
	like that.

	0:32:23.135 --> 0:32:25.717
	So you just take the full document.

	0:32:26.566 --> 0:32:32.630
	For some matrix it's good and it's not clear
	how good it is to the other, but there might

	0:32:32.630 --> 0:32:32.900
	be.

	0:32:33.393 --> 0:32:36.454
	Think of more simple metrics like blue.

	0:32:36.454 --> 0:32:40.356
	Do you have any idea what could be a disadvantage?

	0:32:49.249 --> 0:32:56.616
	Blue is matching ingrams so you start with
	the original.

	0:32:56.616 --> 0:33:01.270
	You check how many ingrams in here.

	0:33:01.901 --> 0:33:11.233
	If you're not doing that on the full document,
	you can also match grams from year to year.

	0:33:11.751 --> 0:33:15.680
	So you can match things very far away.

	0:33:15.680 --> 0:33:21.321
	Start doing translation and you just randomly
	randomly.

	0:33:22.142 --> 0:33:27.938
	And that, of course, could be a bit of a disadvantage
	or like is a problem, and therefore people

	0:33:27.938 --> 0:33:29.910
	also look into the segmentation.

	0:33:29.910 --> 0:33:34.690
	But I've recently seen some things, so document
	levels tours are also normally.

	0:33:34.690 --> 0:33:39.949
	If you have a relatively high quality system
	or state of the art, then they also have a

	0:33:39.949 --> 0:33:41.801
	good correlation of the human.

	0:33:46.546 --> 0:33:59.241
	So how are we doing that so we are putting
	end of sentence boundaries in there and then.

	0:33:59.179 --> 0:34:07.486
	Alignment based on a similar Livingston distance,
	so at a distance between our output and the

	0:34:07.486 --> 0:34:09.077
	reference output.

	0:34:09.449 --> 0:34:13.061
	And here is our boundary.

	0:34:13.061 --> 0:34:23.482
	We map the boundary based on the alignment,
	so in Lithuania you only have.

	0:34:23.803 --> 0:34:36.036
	And then, like all the words that are before,
	it might be since there is not a random.

	0:34:36.336 --> 0:34:44.890
	Mean it should be, but it can happen things
	like that, and it's not clear where.

	0:34:44.965 --> 0:34:49.727
	At the break, however, they are typically
	not that bad because they are words which are

	0:34:49.727 --> 0:34:52.270
	not matching between reference and hypothesis.

	0:34:52.270 --> 0:34:56.870
	So normally it doesn't really matter that
	much because they are anyway not matching.

	0:34:57.657 --> 0:35:05.888
	And then you take the mule as a T output and
	use that to calculate your metric.

	0:35:05.888 --> 0:35:12.575
	Then it's again a perfect alignment for which
	you can calculate.

	0:35:14.714 --> 0:35:19.229
	Any idea you could do it the other way around.

	0:35:19.229 --> 0:35:23.359
	You could resigment your reference to the.

	0:35:29.309 --> 0:35:30.368
	Which one would you select?

	0:35:34.214 --> 0:35:43.979
	I think segmenting the assertive also is much
	more natural because the reference sentence

	0:35:43.979 --> 0:35:46.474
	is the fixed solution.

	0:35:47.007 --> 0:35:52.947
	Yes, that's the right motivation if you do
	think about blue or so.

	0:35:52.947 --> 0:35:57.646
	Additionally important if you change your
	reference.

	0:35:57.857 --> 0:36:07.175
	You might have a different number of diagrams
	or diagrams because the sentences are different

	0:36:07.175 --> 0:36:08.067
	lengths.

	0:36:08.068 --> 0:36:15.347
	Here your five system, you're always comparing
	it to the same system, and you don't compare

	0:36:15.347 --> 0:36:16.455
	to different.

	0:36:16.736 --> 0:36:22.317
	The only different base of segmentation, but
	still it could make some do.

	0:36:25.645 --> 0:36:38.974
	Good, that's all about sentence segmentation,
	then a bit about disfluencies and what there

	0:36:38.974 --> 0:36:40.146
	really.

	0:36:42.182 --> 0:36:51.138
	So as said in daily life, you're not speaking
	like very nice full sentences every.

	0:36:51.471 --> 0:36:53.420
	He was speaking powerful sentences.

	0:36:53.420 --> 0:36:54.448
	We do repetitions.

	0:36:54.834 --> 0:37:00.915
	It's especially if it's more interactive,
	so in meetings, phone calls and so on.

	0:37:00.915 --> 0:37:04.519
	If you have multiple speakers, they also break.

	0:37:04.724 --> 0:37:16.651
	Each other, and then if you keep them, they
	are harder to translate because most of your

	0:37:16.651 --> 0:37:17.991
	training.

	0:37:18.278 --> 0:37:30.449
	It's also very difficult to read, so we'll
	have some examples there to transcribe everything

	0:37:30.449 --> 0:37:32.543
	as it was said.

	0:37:33.473 --> 0:37:36.555
	What type of things are there?

	0:37:37.717 --> 0:37:42.942
	So you have all these pillow works.

	0:37:42.942 --> 0:37:47.442
	These are very easy to remove.

	0:37:47.442 --> 0:37:52.957
	You can just use regular expressions.

	0:37:53.433 --> 0:38:00.139
	Is getting more difficult with some other
	type of filler works.

	0:38:00.139 --> 0:38:03.387
	In German you have this or in.

	0:38:04.024 --> 0:38:08.473
	And these ones you cannot just remove by regular
	expression.

	0:38:08.473 --> 0:38:15.039
	You shouldn't remove all yacht from a text
	because it might be very important information

	0:38:15.039 --> 0:38:15.768
	for well.

	0:38:15.715 --> 0:38:19.995
	It may be not as important as you are, but
	still it might be very important.

	0:38:20.300 --> 0:38:24.215
	So just removing them is there already more
	difficult.

	0:38:26.586 --> 0:38:29.162
	Then you have these repetitions.

	0:38:29.162 --> 0:38:32.596
	You have something like mean saw him there.

	0:38:32.596 --> 0:38:33.611
	There was a.

	0:38:34.334 --> 0:38:41.001
	And while for the first one that might be
	very easy to remove because you just look for

	0:38:41.001 --> 0:38:47.821
	double, the thing is that the repetition might
	not be exactly the same, so there is there

	0:38:47.821 --> 0:38:48.199
	was.

	0:38:48.199 --> 0:38:54.109
	So there is already getting a bit more complicated,
	of course still possible.

	0:38:54.614 --> 0:39:01.929
	You can remove Denver so the real sense would
	be like to have a ticket to Houston.

	0:39:02.882 --> 0:39:13.327
	But there the detection, of course, is getting
	more challenging as you want to get rid of.

	0:39:13.893 --> 0:39:21.699
	You don't have the data, of course, which
	makes all the tasks harder, but you probably

	0:39:21.699 --> 0:39:22.507
	want to.

	0:39:22.507 --> 0:39:24.840
	That's really meaningful.

	0:39:24.840 --> 0:39:26.185
	Current isn't.

	0:39:26.185 --> 0:39:31.120
	That is now a really good point and it's really
	there.

	0:39:31.051 --> 0:39:34.785
	The thing about what is your final task?

	0:39:35.155 --> 0:39:45.526
	If you want to have a transcript reading it,
	I'm not sure if we have another example.

	0:39:45.845 --> 0:39:54.171
	So there it's nicer if you have a clean transfer
	and if you see subtitles in, they're also not

	0:39:54.171 --> 0:39:56.625
	having all the repetitions.

	0:39:56.625 --> 0:40:03.811
	It's the nice way to shorten but also getting
	the structure you cannot even make.

	0:40:04.064 --> 0:40:11.407
	So in this situation, of course, they might
	give you information.

	0:40:11.407 --> 0:40:14.745
	There is a lot of stuttering.

	0:40:15.015 --> 0:40:22.835
	So in this case agree it might be helpful
	in some way, but meaning reading all the disfluencies

	0:40:22.835 --> 0:40:25.198
	is getting really difficult.

	0:40:25.198 --> 0:40:28.049
	If you have the next one, we have.

	0:40:28.308 --> 0:40:31.630
	That's a very long text.

	0:40:31.630 --> 0:40:35.883
	You need a bit of time to pass.

	0:40:35.883 --> 0:40:39.472
	This one is not important.

	0:40:40.480 --> 0:40:48.461
	It might be nice if you can start reading
	from here.

	0:40:48.461 --> 0:40:52.074
	Let's have a look here.

	0:40:52.074 --> 0:40:54.785
	Try to read this.

	0:40:57.297 --> 0:41:02.725
	You can understand it, but think you need
	a bit of time to really understand what was.

	0:41:11.711 --> 0:41:21.480
	And now we have the same text, but you have
	highlighted in bold, and not only read the

	0:41:21.480 --> 0:41:22.154
	bold.

	0:41:23.984 --> 0:41:25.995
	And ignore everything which is not bold.

	0:41:30.250 --> 0:41:49.121
	Would assume it's easier to read just the
	book part more faster and more faster.

	0:41:50.750 --> 0:41:57.626
	Yeah, it might be, but I'm not sure we have
	a master thesis of that.

	0:41:57.626 --> 0:41:59.619
	If seen my videos,.

	0:42:00.000 --> 0:42:09.875
	Of the recordings, I also have it more likely
	that it's like a fluent speak and I'm not like

	0:42:09.875 --> 0:42:12.318
	doing the hesitations.

	0:42:12.652 --> 0:42:23.764
	Don't know if somebody else has looked into
	the Cusera video, but notice that.

	0:42:25.005 --> 0:42:31.879
	For these videos spoke every minute, three
	times or something, and then people were there

	0:42:31.879 --> 0:42:35.011
	and cutting things and making hopefully.

	0:42:35.635 --> 0:42:42.445
	And therefore if you want to more achieve
	that, of course, no longer exactly what was

	0:42:42.445 --> 0:42:50.206
	happening, but if it more looks like a professional
	video, then you would have to do that and cut

	0:42:50.206 --> 0:42:50.998
	that out.

	0:42:50.998 --> 0:42:53.532
	But yeah, there are definitely.

	0:42:55.996 --> 0:42:59.008
	We're also going to do this thing again.

	0:42:59.008 --> 0:43:02.315
	First turn is like I'm going to have a very.

	0:43:02.422 --> 0:43:07.449
	Which in the end they start to slow down just
	without feeling as though they're.

	0:43:07.407 --> 0:43:10.212
	It's a good point for the next.

	0:43:10.212 --> 0:43:13.631
	There is not the one perfect solution.

	0:43:13.631 --> 0:43:20.732
	There's some work on destruction removal,
	but of course there's also disability.

	0:43:20.732 --> 0:43:27.394
	Removal is not that easy, so do you just remove
	that's in order everywhere.

	0:43:27.607 --> 0:43:29.708
	But how much like cleaning do you do?

	0:43:29.708 --> 0:43:31.366
	It's more a continuous thing.

	0:43:31.811 --> 0:43:38.211
	Is it more really you only remove stuff or
	are you also into rephrasing and here is only

	0:43:38.211 --> 0:43:38.930
	removing?

	0:43:39.279 --> 0:43:41.664
	But maybe you want to rephrase it.

	0:43:41.664 --> 0:43:43.231
	That's hearing better.

	0:43:43.503 --> 0:43:49.185
	So then it's going into what people are doing
	in style transfer.

	0:43:49.185 --> 0:43:52.419
	We are going from a speech style to.

	0:43:52.872 --> 0:44:07.632
	So there is more continuum, and of course
	Airconditioner is not the perfect solution,

	0:44:07.632 --> 0:44:10.722
	but exactly what.

	0:44:15.615 --> 0:44:19.005
	Yeah, we're challenging.

	0:44:19.005 --> 0:44:30.258
	You have examples where the direct copy is
	not as hard or is not exactly the same.

	0:44:30.258 --> 0:44:35.410
	That is, of course, more challenging.

	0:44:41.861 --> 0:44:49.889
	If it's getting really mean why it's so challenging,
	if it's really spontaneous even for the speaker,

	0:44:49.889 --> 0:44:55.634
	you need maybe even the video to really get
	that and at least the audio.

	0:45:01.841 --> 0:45:06.025
	Yeah what it also depends on.

	0:45:06.626 --> 0:45:15.253
	The purpose, of course, and very important
	thing is the easiest tasks just to removing.

	0:45:15.675 --> 0:45:25.841
	Of course you have to be very careful because
	if you remove some of the not, it's normally

	0:45:25.841 --> 0:45:26.958
	not much.

	0:45:27.227 --> 0:45:33.176
	But if you remove too much, of course, that's
	very, very bad because you're losing important.

	0:45:33.653 --> 0:45:46.176
	And this might be even more challenging if
	you think about rarer and unseen works.

	0:45:46.226 --> 0:45:56.532
	So when doing this removal, it's important
	to be careful and normally more conservative.

	0:46:03.083 --> 0:46:15.096
	Of course, also you have to again see if you're
	doing that now in a two step approach, not

	0:46:15.096 --> 0:46:17.076
	an end to end.

	0:46:17.076 --> 0:46:20.772
	So first you need a remote.

	0:46:21.501 --> 0:46:30.230
	But you have to somehow sing it in the whole
	type line.

	0:46:30.230 --> 0:46:36.932
	If you learn text or remove disfluencies,.

	0:46:36.796 --> 0:46:44.070
	But it might be that the ASR system is outputing
	something else or that it's more of an ASR

	0:46:44.070 --> 0:46:44.623
	error.

	0:46:44.864 --> 0:46:46.756
	So um.

	0:46:46.506 --> 0:46:52.248
	Just for example, if you do it based on language
	modeling scores, it might be that you're just

	0:46:52.248 --> 0:46:57.568
	the language modeling score because the has
	done some errors, so you really have to see

	0:46:57.568 --> 0:46:59.079
	the combination of that.

	0:46:59.419 --> 0:47:04.285
	And for example, we had like partial words.

	0:47:04.285 --> 0:47:06.496
	They are like some.

	0:47:06.496 --> 0:47:08.819
	We didn't have that.

	0:47:08.908 --> 0:47:18.248
	So these feelings cannot be that you start
	in the middle of the world and then you switch

	0:47:18.248 --> 0:47:19.182
	because.

	0:47:19.499 --> 0:47:23.214
	And of course, in text in perfect transcript,
	that's very easy to recognize.

	0:47:23.214 --> 0:47:24.372
	That's not a real word.

	0:47:24.904 --> 0:47:37.198
	However, when you really do it into an system,
	he will normally detect some type of word because

	0:47:37.198 --> 0:47:40.747
	he only can help the words.

	0:47:50.050 --> 0:48:03.450
	Example: We should think so if you have this
	in the transcript it's easy to detect as a

	0:48:03.450 --> 0:48:05.277
	disgusting.

	0:48:05.986 --> 0:48:11.619
	And then, of course, it's more challenging
	in a real world example where you have.

	0:48:12.492 --> 0:48:29.840
	Now to the approaches one thing is to really
	put it in between so you put your A's system.

	0:48:31.391 --> 0:48:45.139
	So what your task is like, so you have this
	text and the outputs in this text.

	0:48:45.565 --> 0:48:49.605
	There is different formulations of that.

	0:48:49.605 --> 0:48:54.533
	You might not be able to do everything like
	that.

	0:48:55.195 --> 0:49:10.852
	Or do you also allow, for example, rephrasing
	for reordering so in text you might have the

	0:49:10.852 --> 0:49:13.605
	word correctly.

	0:49:13.513 --> 0:49:24.201
	But the easiest thing is you only do it more
	like removing, so some things can be removed.

	0:49:29.049 --> 0:49:34.508
	Any ideas how to do that this is output.

	0:49:34.508 --> 0:49:41.034
	You have training data so we have training
	data.

	0:49:47.507 --> 0:49:55.869
	To put in with the spoon you can eat it even
	after it is out, but after the machine has.

	0:50:00.000 --> 0:50:05.511
	Was wearing rocks, so you have not just the
	shoes you remove but wearing them as input,

	0:50:05.511 --> 0:50:07.578
	as disfluent text and as output.

	0:50:07.578 --> 0:50:09.207
	It should be fueled text.

	0:50:09.207 --> 0:50:15.219
	It can be before or after recycling as you
	said, but you have this type of task, so technically

	0:50:15.219 --> 0:50:20.042
	how would you address this type of task when
	you have to solve this type of.

	0:50:24.364 --> 0:50:26.181
	That's exactly so.

	0:50:26.181 --> 0:50:28.859
	That's one way of doing it.

	0:50:28.859 --> 0:50:33.068
	It's a translation task and you train your.

	0:50:33.913 --> 0:50:34.683
	Can do.

	0:50:34.683 --> 0:50:42.865
	Then, of course, the bit of the challenge
	is that you automatically allow rephrasing

	0:50:42.865 --> 0:50:43.539
	stuff.

	0:50:43.943 --> 0:50:52.240
	Which of the one end is good so you have more
	opportunities but it might be also a bad thing

	0:50:52.240 --> 0:50:58.307
	because if you have more opportunities you
	have more opportunities.

	0:51:01.041 --> 0:51:08.300
	If you want to prevent that, it can also do
	more simple labeling, so for each word your

	0:51:08.300 --> 0:51:10.693
	label should not be removed.

	0:51:12.132 --> 0:51:17.658
	People have also been looked into parsley.

	0:51:17.658 --> 0:51:29.097
	You remember maybe the past trees at the beginning
	like the structure because the ideas.

	0:51:29.649 --> 0:51:45.779
	There's also more unsupervised approaches
	where you then phrase it as a style transfer

	0:51:45.779 --> 0:51:46.892
	task.

	0:51:50.310 --> 0:51:58.601
	At the last point since we have that yes,
	it has also been done in an end-to-end fashion

	0:51:58.601 --> 0:52:06.519
	so that it's really you have as input the audio
	signal and output you have than the.

	0:52:06.446 --> 0:52:10.750
	The text, without influence, is a clearly
	clear text.

	0:52:11.131 --> 0:52:19.069
	You model every single total, which of course
	has a big advantage.

	0:52:19.069 --> 0:52:25.704
	You can use these paralinguistic features,
	pauses, and.

	0:52:25.705 --> 0:52:34.091
	If you switch so you start something then
	oh it doesn't work continue differently so.

	0:52:34.374 --> 0:52:42.689
	So you can easily use in a fashion while in
	a cascade approach.

	0:52:42.689 --> 0:52:47.497
	As we saw there you have text input.

	0:52:49.990 --> 0:53:02.389
	But on the one end we have again, and in the
	more extreme case the problem before was endless.

	0:53:02.389 --> 0:53:06.957
	Of course there is even less data.

	0:53:11.611 --> 0:53:12.837
	Good.

	0:53:12.837 --> 0:53:30.814
	This is all about the input to a very more
	person, or maybe if you think about YouTube.

	0:53:32.752 --> 0:53:34.989
	Talk so this could use be very exciting.

	0:53:36.296 --> 0:53:42.016
	Is more viewed as style transferred.

	0:53:42.016 --> 0:53:53.147
	You can use ideas from machine translation
	where you have one language.

	0:53:53.713 --> 0:53:57.193
	So there is ways of trying to do this type
	of style transfer.

	0:53:57.637 --> 0:54:02.478
	Think is definitely also very promising to
	make it more and more fluent in a business.

	0:54:03.223 --> 0:54:17.974
	Because one major issue about all the previous
	ones is that you need training data and then

	0:54:17.974 --> 0:54:21.021
	you need training.

	0:54:21.381 --> 0:54:32.966
	So I mean, think that we are only really of
	data that we have for English.

	0:54:32.966 --> 0:54:39.453
	Maybe there is a very few data in German.

	0:54:42.382 --> 0:54:49.722
	Okay, then let's talk about low latency speech.

	0:54:50.270 --> 0:55:05.158
	So the idea is if we are doing life translation
	of a talker, so we want to start out.

	0:55:05.325 --> 0:55:23.010
	This is possible because there is typically
	some kind of monotony in many languages.

	0:55:24.504 --> 0:55:29.765
	And this is also what, for example, human
	interpreters are doing to have a really low

	0:55:29.765 --> 0:55:30.071
	leg.

	0:55:30.750 --> 0:55:34.393
	They are even going further.

	0:55:34.393 --> 0:55:40.926
	They guess what will be the ending of the
	sentence.

	0:55:41.421 --> 0:55:51.120
	Then they can already continue, although it's
	not sad it might be needed, but that is even

	0:55:51.120 --> 0:55:53.039
	more challenging.

	0:55:54.714 --> 0:55:58.014
	Why is it so difficult?

	0:55:58.014 --> 0:56:09.837
	There is this train of on the one end for
	a and you want to have more context because

	0:56:09.837 --> 0:56:14.511
	we learn if we have more context.

	0:56:15.015 --> 0:56:24.033
	And therefore to have more contacts you have
	to wait as long as possible.

	0:56:24.033 --> 0:56:27.689
	The best is to have the full.

	0:56:28.168 --> 0:56:35.244
	On the other hand, you want to have a low
	latency for the user to wait to generate as

	0:56:35.244 --> 0:56:35.737
	soon.

	0:56:36.356 --> 0:56:47.149
	So if you're doing no situation you have to
	find the best way to start in order to have

	0:56:47.149 --> 0:56:48.130
	a good.

	0:56:48.728 --> 0:56:52.296
	There's no longer the perfect solution.

	0:56:52.296 --> 0:56:56.845
	People will also evaluate what is the translation.

	0:56:57.657 --> 0:57:09.942
	While it's challenging in German to English,
	German has this very nice thing where the prefix

	0:57:09.942 --> 0:57:16.607
	of the word can be put at the end of the sentence.

	0:57:17.137 --> 0:57:24.201
	And you only know if the person registers
	or cancels his station at the end of the center.

	0:57:24.985 --> 0:57:33.690
	So if you want to start the translation in
	English you need to know at this point is the.

	0:57:35.275 --> 0:57:39.993
	So you would have to wait until the end of
	the year.

	0:57:39.993 --> 0:57:42.931
	That's not really what you want.

	0:57:43.843 --> 0:57:45.795
	What happened.

	0:57:47.207 --> 0:58:12.550
	Other solutions of doing that are: Have been
	motivating like how we can do that subject

	0:58:12.550 --> 0:58:15.957
	object or subject work.

	0:58:16.496 --> 0:58:24.582
	In German it's not always subject, but there
	are relative sentence where you have that,

	0:58:24.582 --> 0:58:25.777
	so it needs.

	0:58:28.808 --> 0:58:41.858
	How we can do that is, we'll look today into
	three ways of doing that.

	0:58:41.858 --> 0:58:46.269
	The one is to mitigate.

	0:58:46.766 --> 0:58:54.824
	And then the IVAR idea is to do retranslating,
	and there you can now use the text output.

	0:58:54.934 --> 0:59:02.302
	So the idea is you translate, and if you later
	notice it was wrong then you can retranslate

	0:59:02.302 --> 0:59:03.343
	and correct.

	0:59:03.803 --> 0:59:14.383
	Or you can do what is called extremely coding,
	so you can generically.

	0:59:17.237 --> 0:59:30.382
	Let's start with the optimization, so if you
	have a sentence, it may reach a conference,

	0:59:30.382 --> 0:59:33.040
	and in this time.

	0:59:32.993 --> 0:59:39.592
	So you have a good translation quality while
	still having low latency.

	0:59:39.699 --> 0:59:50.513
	You have an extra model which does your segmentation
	before, but your aim is not to have a segmentation.

	0:59:50.470 --> 0:59:53.624
	But you can somehow measure in training data.

	0:59:53.624 --> 0:59:59.863
	If do these types of segment lengths, that's
	my latency and that's my translation quality,

	0:59:59.863 --> 1:00:02.811
	and then you can try to search a good way.

	1:00:03.443 --> 1:00:20.188
	If you're doing that one, it's an extra component,
	so you can use your system as it was.

	1:00:22.002 --> 1:00:28.373
	The other idea is to directly output the first
	high processes always, so always when you have

	1:00:28.373 --> 1:00:34.201
	text or audio we translate, and if we then
	have more context available we can update.

	1:00:35.015 --> 1:00:50.195
	So imagine before, if get an eye register
	and there's a sentence continued, then.

	1:00:50.670 --> 1:00:54.298
	So you change the output.

	1:00:54.298 --> 1:01:07.414
	Of course, that might be also leading to bad
	user experience if you always flicker and change

	1:01:07.414 --> 1:01:09.228
	your output.

	1:01:09.669 --> 1:01:15.329
	The bit like human interpreters also are able
	to correct, so they're doing a more long text.

	1:01:15.329 --> 1:01:20.867
	If they are guessing how to continue to say
	and then he's saying something different, they

	1:01:20.867 --> 1:01:22.510
	also have to correct them.

	1:01:22.510 --> 1:01:26.831
	So here, since it's not all you, we can even
	change what we have said.

	1:01:26.831 --> 1:01:29.630
	Yes, that's exactly what we have implemented.

	1:01:31.431 --> 1:01:49.217
	So how that works is, we are aware, and then
	we translate it, and if we get more input like

	1:01:49.217 --> 1:01:51.344
	you, then.

	1:01:51.711 --> 1:02:00.223
	And so we can always continue to do that and
	improve the transcript that we have.

	1:02:00.480 --> 1:02:07.729
	So in the end we have the lowest possible
	latency because we always output what is possible.

	1:02:07.729 --> 1:02:14.784
	On the other hand, introducing a bit of a
	new problem is: There's another challenge when

	1:02:14.784 --> 1:02:20.061
	we first used that this one was first used
	for old and that it worked fine.

	1:02:20.061 --> 1:02:21.380
	You switch to NMT.

	1:02:21.380 --> 1:02:25.615
	You saw one problem that is even generating
	more flickering.

	1:02:25.615 --> 1:02:28.878
	The problem is the normal machine translation.

	1:02:29.669 --> 1:02:35.414
	So implicitly learn all the output that always
	ends with a dot, and it's always a full sentence.

	1:02:36.696 --> 1:02:42.466
	And this was even more important somewhere
	in the model than really what is in the input.

	1:02:42.983 --> 1:02:55.910
	So if you give him a partial sentence, it
	will still generate a full sentence.

	1:02:55.910 --> 1:02:58.201
	So encourage.

	1:02:58.298 --> 1:03:05.821
	It's like trying to just continue it somehow
	to a full sentence and if it's doing better

	1:03:05.821 --> 1:03:10.555
	guessing stuff then you have to even have more
	changes.

	1:03:10.890 --> 1:03:23.944
	So here we have a trained mismatch and that's
	maybe more a general important thing that the

	1:03:23.944 --> 1:03:28.910
	modem might learn a bit different.

	1:03:29.289 --> 1:03:32.636
	It's always ending with a dog, so you don't
	just guess something in general.

	1:03:33.053 --> 1:03:35.415
	So we have your trained test mismatch.

	1:03:38.918 --> 1:03:41.248
	And we have a trained test message.

	1:03:41.248 --> 1:03:43.708
	What is the best way to address that?

	1:03:46.526 --> 1:03:51.934
	That's exactly the right, so we have to like
	train also on that.

	1:03:52.692 --> 1:03:55.503
	The problem is for particle sentences.

	1:03:55.503 --> 1:03:59.611
	There's not training data, so it's hard to
	find all our.

	1:04:00.580 --> 1:04:06.531
	Hi, I'm ransom quite easy to generate artificial
	pottery scent or at least for the source.

	1:04:06.926 --> 1:04:15.367
	So you just take, you take all the prefixes
	of the source data.

	1:04:17.017 --> 1:04:22.794
	On the problem of course, with a bit what
	do you know lying?

	1:04:22.794 --> 1:04:30.845
	If you have a sentence, I encourage all of
	what should be the right target for that.

	1:04:31.491 --> 1:04:45.381
	And the constraints on the one hand, it should
	be as long as possible, so you always have

	1:04:45.381 --> 1:04:47.541
	a long delay.

	1:04:47.687 --> 1:04:55.556
	On the other hand, it should be also a suspect
	of the previous ones, and it should be not

	1:04:55.556 --> 1:04:57.304
	too much inventing.

	1:04:58.758 --> 1:05:02.170
	A very easy solution works fine.

	1:05:02.170 --> 1:05:05.478
	You can just do a length space.

	1:05:05.478 --> 1:05:09.612
	You also take two thirds of the target.

	1:05:10.070 --> 1:05:19.626
	His learning then implicitly to guess a bit
	if you think about the beginning of example.

	1:05:20.000 --> 1:05:30.287
	This one, if you do two sorts like half, in
	this case the target would be eye register.

	1:05:30.510 --> 1:05:39.289
	So you're doing a bit of implicit guessing,
	and if it's getting wrong you have rewriting,

	1:05:39.289 --> 1:05:43.581
	but you're doing a good amount of guessing.

	1:05:49.849 --> 1:05:53.950
	In addition, this would be like how it looks
	like if it was like.

	1:05:53.950 --> 1:05:58.300
	If it wasn't a housing game, then the target
	could be something like.

	1:05:58.979 --> 1:06:02.513
	One problem is that you just do that this
	way.

	1:06:02.513 --> 1:06:04.619
	It's most of your training.

	1:06:05.245 --> 1:06:11.983
	And in the end you're interested in the overall
	translation quality, so for full sentence.

	1:06:11.983 --> 1:06:19.017
	So if you train on that, it will mainly learn
	how to translate prefixes because ninety percent

	1:06:19.017 --> 1:06:21.535
	or more of your data is prefixed.

	1:06:22.202 --> 1:06:31.636
	That's why we'll see that it's better to do
	like a ratio.

	1:06:31.636 --> 1:06:39.281
	So half your training data are full sentences.

	1:06:39.759 --> 1:06:47.693
	Because if you're doing this well you see
	that for every word prefix and only one sentence.

	1:06:48.048 --> 1:06:52.252
	You also see that nicely here here are both.

	1:06:52.252 --> 1:06:56.549
	This is the blue scores and you see the bass.

	1:06:58.518 --> 1:06:59.618
	Is this one?

	1:06:59.618 --> 1:07:03.343
	It has a good quality because it's trained.

	1:07:03.343 --> 1:07:11.385
	If you know, train with all the partial sentences
	is more focusing on how to translate partial

	1:07:11.385 --> 1:07:12.316
	sentences.

	1:07:12.752 --> 1:07:17.840
	Because all the partial sentences will at
	some point be removed, because at the end you

	1:07:17.840 --> 1:07:18.996
	translate the full.

	1:07:20.520 --> 1:07:24.079
	There's many tasks to read, but you have the
	same performances.

	1:07:24.504 --> 1:07:26.938
	On the other hand, you see here the other
	problem.

	1:07:26.938 --> 1:07:28.656
	This is how many words got updated.

	1:07:29.009 --> 1:07:31.579
	You want to have as few updates as possible.

	1:07:31.579 --> 1:07:34.891
	Updates need to remove things which are once
	being shown.

	1:07:35.255 --> 1:07:40.538
	This is quite high for the baseline.

	1:07:40.538 --> 1:07:50.533
	If you know the partials that are going down,
	they should be removed.

	1:07:51.151 --> 1:07:58.648
	And then for moody tasks you have a bit like
	the best note of swim.

	1:08:02.722 --> 1:08:05.296
	Any more questions to this type of.

	1:08:09.309 --> 1:08:20.760
	The last thing is that you want to do an extremely.

	1:08:21.541 --> 1:08:23.345
	Again, it's a bit implication.

	1:08:23.345 --> 1:08:25.323
	Scenario is what you really want.

	1:08:25.323 --> 1:08:30.211
	As you said, we sometimes use this updating,
	and for text output it'd be very nice.

	1:08:30.211 --> 1:08:35.273
	But imagine if you want to audio output, of
	course you can't change it anymore because

	1:08:35.273 --> 1:08:37.891
	on one side you cannot change what was said.

	1:08:37.891 --> 1:08:40.858
	So in this time you more need like a fixed
	output.

	1:08:41.121 --> 1:08:47.440
	And then the style of street decoding is interesting.

	1:08:47.440 --> 1:08:55.631
	Where you, for example, get sourced, the seagullins
	are so stoked in.

	1:08:55.631 --> 1:09:00.897
	Then you decide oh, now it's better to wait.

	1:09:01.041 --> 1:09:14.643
	So you somehow need to have this type of additional
	information.

	1:09:15.295 --> 1:09:23.074
	Here you have to decide should know I'll put
	a token or should wait for my and feel.

	1:09:26.546 --> 1:09:32.649
	So you have to do this additional labels like
	weight, weight, output, output, wage and so

	1:09:32.649 --> 1:09:32.920
	on.

	1:09:33.453 --> 1:09:38.481
	There are different ways of doing that.

	1:09:38.481 --> 1:09:45.771
	You can have an additional model that does
	this decision.

	1:09:46.166 --> 1:09:53.669
	And then have a higher quality or better to
	continue and then have a lower latency in this

	1:09:53.669 --> 1:09:54.576
	different.

	1:09:55.215 --> 1:09:59.241
	Surprisingly, a very easy task also works,
	sometimes quite good.

	1:10:03.043 --> 1:10:10.981
	And that is the so called way care policy
	and the idea is there at least for text to

	1:10:10.981 --> 1:10:14.623
	text translation that is working well.

	1:10:14.623 --> 1:10:22.375
	It's like you wait for words and then you
	always output one and like one for each.

	1:10:22.682 --> 1:10:28.908
	So your weight slow works at the beginning
	of the sentence, and every time a new board

	1:10:28.908 --> 1:10:29.981
	is coming you.

	1:10:31.091 --> 1:10:39.459
	So you have the same times to beat as input,
	so you're not legging more or less, but to

	1:10:39.459 --> 1:10:41.456
	have enough context.

	1:10:43.103 --> 1:10:49.283
	Of course this for example for the unmarried
	will not solve it perfectly but if you have

	1:10:49.283 --> 1:10:55.395
	a bit of local reordering inside your token
	that you can manage very well and then it's

	1:10:55.395 --> 1:10:57.687
	a very simple solution but it's.

	1:10:57.877 --> 1:11:00.481
	The other one was dynamic.

	1:11:00.481 --> 1:11:06.943
	Depending on the context you can decide how
	long you want to wait.

	1:11:07.687 --> 1:11:21.506
	It also only works if you have a similar amount
	of tokens, so if your target is very short

	1:11:21.506 --> 1:11:22.113
	of.

	1:11:22.722 --> 1:11:28.791
	That's why it's also more challenging for
	audio input because the speaking rate is changing

	1:11:28.791 --> 1:11:29.517
	and so on.

	1:11:29.517 --> 1:11:35.586
	You would have to do something like I'll output
	a word for every second a year or something

	1:11:35.586 --> 1:11:35.981
	like.

	1:11:36.636 --> 1:11:45.459
	The problem is that the audio speaking speed
	is not like fixed but quite very, and therefore.

	1:11:50.170 --> 1:11:58.278
	Therefore, what you can also do is you can
	use a similar solution than we had before with

	1:11:58.278 --> 1:11:59.809
	the resetteling.

	1:12:00.080 --> 1:12:02.904
	You remember we were re-decoded all the time.

	1:12:03.423 --> 1:12:12.253
	And you can do something similar in this case
	except that you add something in that you're

	1:12:12.253 --> 1:12:16.813
	saying, oh, if I read it cold, I'm not always.

	1:12:16.736 --> 1:12:22.065
	Can decode as I want, but you can do this
	target prefix decoding, so what you say is

	1:12:22.065 --> 1:12:23.883
	in your achievement section.

	1:12:23.883 --> 1:12:26.829
	You can easily say generate a translation
	bus.

	1:12:27.007 --> 1:12:29.810
	The translation has to start with the prefix.

	1:12:31.251 --> 1:12:35.350
	How can you do that?

	1:12:39.839 --> 1:12:49.105
	In the decoder exactly you start, so if you
	do beam search you select always the most probable.

	1:12:49.349 --> 1:12:57.867
	And now you say oh, I'm not selecting the
	most perfect, but this is the fourth, so in

	1:12:57.867 --> 1:13:04.603
	the first step have to take this one, in the
	second start decoding.

	1:13:04.884 --> 1:13:09.387
	And then you're making sure that your second
	always starts with this prefix.

	1:13:10.350 --> 1:13:18.627
	And then you can use your immediate retranslation,
	but you're no longer changing the output.

	1:13:19.099 --> 1:13:31.595
	Out as it works, so it may get a speech signal
	and input, and it is not outputing any.

	1:13:32.212 --> 1:13:45.980
	So then if you got you get a translation maybe
	and then you decide yes output.

	1:13:46.766 --> 1:13:54.250
	And then you're translating as one as two
	as sweet as four, but now you say generate

	1:13:54.250 --> 1:13:55.483
	only outputs.

	1:13:55.935 --> 1:14:07.163
	And then you're translating and maybe you're
	deciding on and now a good translation.

	1:14:07.163 --> 1:14:08.880
	Then you're.

	1:14:09.749 --> 1:14:29.984
	Yes, but don't get to worry about what the
	effect is.

	1:14:30.050 --> 1:14:31.842
	We're generating your target text.

	1:14:32.892 --> 1:14:36.930
	But we're not always outputing the full target
	text now.

	1:14:36.930 --> 1:14:43.729
	What we are having is we have here some strategy
	to decide: Oh, is a system already sure enough

	1:14:43.729 --> 1:14:44.437
	about it?

	1:14:44.437 --> 1:14:49.395
	If it's sure enough and it has all the information,
	we can output it.

	1:14:49.395 --> 1:14:50.741
	And then the next.

	1:14:51.291 --> 1:14:55.931
	If we say here sometimes with better not to
	get output we won't output it already.

	1:14:57.777 --> 1:15:06.369
	And thereby the hope is in the uphill model
	should not yet outcut a register because it

	1:15:06.369 --> 1:15:10.568
	doesn't mean no yet if it's a case or not.

	1:15:13.193 --> 1:15:18.056
	So what we have to discuss is what is a good
	output strategy.

	1:15:18.658 --> 1:15:20.070
	So you could do.

	1:15:20.070 --> 1:15:23.806
	The output strategy could be something like.

	1:15:23.743 --> 1:15:39.871
	If you think of weight cape, this is an output
	strategy here that you always input.

	1:15:40.220 --> 1:15:44.990
	Good, and you can view your weight in a similar
	way as.

	1:15:45.265 --> 1:15:55.194
	But now, of course, we can also look at other
	output strategies where it's more generic and

	1:15:55.194 --> 1:15:59.727
	it's deciding whether in some situations.

	1:16:01.121 --> 1:16:12.739
	And one thing that works quite well is referred
	to as local agreement, and that means you're

	1:16:12.739 --> 1:16:13.738
	always.

	1:16:14.234 --> 1:16:26.978
	Then you're looking what is the same thing
	between my current translation and the one

	1:16:26.978 --> 1:16:28.756
	did before.

	1:16:29.349 --> 1:16:31.201
	So let's do that again in six hours.

	1:16:31.891 --> 1:16:45.900
	So your input is a first audio segment and
	your title text is all model trains.

	1:16:46.346 --> 1:16:53.231
	Then you're getting six opposites, one and
	two, and this time the output is all models.

	1:16:54.694 --> 1:17:08.407
	You see trains are different, but both of
	them agree that it's all so in those cases.

	1:17:09.209 --> 1:17:13.806
	So we can be hopefully a big show that really
	starts with all.

	1:17:15.155 --> 1:17:22.604
	So now we say we're output all, so at this
	time instead we'll output all, although before.

	1:17:23.543 --> 1:17:27.422
	We are getting one, two, three as input.

	1:17:27.422 --> 1:17:35.747
	This time we have a prefix, so now we are
	only allowing translations to start with all.

	1:17:35.747 --> 1:17:42.937
	We cannot change that anymore, so we now need
	to generate some translation.

	1:17:43.363 --> 1:17:46.323
	And then it can be that its now all models
	are run.

	1:17:47.927 --> 1:18:01.908
	Then we compare here and see this agrees on
	all models so we can output all models.

	1:18:02.882 --> 1:18:07.356
	So this by we can dynamically decide is a
	model is very anxious.

	1:18:07.356 --> 1:18:10.178
	We always talk with something different.

	1:18:11.231 --> 1:18:24.872
	Then it's, we'll wait longer, it's more for
	the same thing, and hope we don't need to wait.

	1:18:30.430 --> 1:18:40.238
	Is it clear again that the signal wouldn't
	be able to detect?

	1:18:43.203 --> 1:18:50.553
	The hope it is because if it's not sure of,
	of course, it in this kind would have to switch

	1:18:50.553 --> 1:18:51.671
	all the time.

	1:18:56.176 --> 1:19:01.375
	So if it would be the first step to register
	and the second time to cancel and they may

	1:19:01.375 --> 1:19:03.561
	register again, they wouldn't do it.

	1:19:03.561 --> 1:19:08.347
	Of course, it is very short because in register
	a long time, then it can't deal.

	1:19:08.568 --> 1:19:23.410
	That's why there's two parameters that you
	can use and which might be important, or how.

	1:19:23.763 --> 1:19:27.920
	So you do it like every one second, every
	five seconds or something like that.

	1:19:28.648 --> 1:19:37.695
	Put it more often as your latency will be
	because your weight is less long, but also

	1:19:37.695 --> 1:19:39.185
	you might do.

	1:19:40.400 --> 1:19:50.004
	So that is the one thing and the other thing
	is for words you might do everywhere, but if

	1:19:50.004 --> 1:19:52.779
	you think about audio it.

	1:19:53.493 --> 1:20:04.287
	And the other question you can do like the
	agreement, so the model is sure.

	1:20:04.287 --> 1:20:10.252
	If you say have to agree, then hopefully.

	1:20:10.650 --> 1:20:21.369
	What we saw is think there has been a really
	normally good performance and otherwise your

	1:20:21.369 --> 1:20:22.441
	latency.

	1:20:22.963 --> 1:20:42.085
	Okay, we'll just make more tests and we'll
	get the confidence.

	1:20:44.884 --> 1:20:47.596
	Have to completely agree with that.

	1:20:47.596 --> 1:20:53.018
	So when this was done, that was our first
	idea of using the confidence.

	1:20:53.018 --> 1:21:00.248
	The problem is that currently that's my assumption
	is that the modeling the model confidence is

	1:21:00.248 --> 1:21:03.939
	not that easy, and they are often overconfident.

	1:21:04.324 --> 1:21:17.121
	In the paper there is this type also where
	you try to use the confidence in some way to

	1:21:17.121 --> 1:21:20.465
	decide the confidence.

	1:21:21.701 --> 1:21:26.825
	But that gave worse results, and that's why
	we looked into that.

	1:21:27.087 --> 1:21:38.067
	So it's a very good idea think, but it seems
	not to at least how it was implemented.

	1:21:38.959 --> 1:21:55.670
	There is one way that maybe goes in more direction,
	which is very new.

	1:21:55.455 --> 1:22:02.743
	If this one, the last word is attending mainly
	to the end of the audio.

	1:22:02.942 --> 1:22:04.934
	You might you should not output it yet.

	1:22:05.485 --> 1:22:15.539
	Because they might think there is something
	more missing than you need to know, so they

	1:22:15.539 --> 1:22:24.678
	look at the attention and only output parts
	which look to not the audio signal.

	1:22:25.045 --> 1:22:40.175
	So there is, of course, a lot of ways how
	you can do it better or easier in some way.

	1:22:41.901 --> 1:22:53.388
	Instead tries to predict the next word with
	a large language model, and then for text translation

	1:22:53.388 --> 1:22:54.911
	you predict.

	1:22:55.215 --> 1:23:01.177
	Then you translate all of them and decide
	if there is a change so you can even earlier

	1:23:01.177 --> 1:23:02.410
	do your decision.

	1:23:02.362 --> 1:23:08.714
	The idea is that if we continue and then this
	will be to a change in the translation, then

	1:23:08.714 --> 1:23:10.320
	we should have opened.

	1:23:10.890 --> 1:23:18.302
	So it's more doing your estimate about possible
	continuations of the source instead of looking

	1:23:18.302 --> 1:23:19.317
	at previous.

	1:23:23.783 --> 1:23:31.388
	All that works is a bit here like one example.

	1:23:31.388 --> 1:23:39.641
	It has a legacy baselines and you are not
	putting.

	1:23:40.040 --> 1:23:47.041
	And you see in this case you have worse blood
	scores here.

	1:23:47.041 --> 1:23:51.670
	For equal one you have better latency.

	1:23:52.032 --> 1:24:01.123
	The how to and how does anybody have an idea
	of what could be challenging there or when?

	1:24:05.825 --> 1:24:20.132
	One problem of these models are hallucinations,
	and often very long has a negative impact on.

	1:24:24.884 --> 1:24:30.869
	If you don't remove the last four words but
	your model now starts to hallucinate and invent

	1:24:30.869 --> 1:24:37.438
	just a lot of new stuff then yeah you're removing
	the last four words of that but if it has invented

	1:24:37.438 --> 1:24:41.406
	ten words and you're still outputting six of
	these invented.

	1:24:41.982 --> 1:24:48.672
	Typically once it starts hallucination generating
	some output, it's quite long, so then it's

	1:24:48.672 --> 1:24:50.902
	no longer enough to just hold.

	1:24:51.511 --> 1:24:57.695
	And then, of course, a bit better if you compare
	to the previous ones.

	1:24:57.695 --> 1:25:01.528
	Their destinations are typically different.

	1:25:07.567 --> 1:25:25.939
	Yes, so we don't talk about the details, but
	for outputs, for presentations, there's different

	1:25:25.939 --> 1:25:27.100
	ways.

	1:25:27.347 --> 1:25:36.047
	So you want to have maximum two lines, maximum
	forty-two characters per line, and the reading

	1:25:36.047 --> 1:25:40.212
	speed is a maximum of twenty-one characters.

	1:25:40.981 --> 1:25:43.513
	How to Do That We Can Skip.

	1:25:43.463 --> 1:25:46.804
	Then you can generate something like that.

	1:25:46.886 --> 1:25:53.250
	Another challenge is, of course, that you
	not only need to generate the translation,

	1:25:53.250 --> 1:25:59.614
	but for subtlyning you also want to generate
	when to put breaks and what to display.

	1:25:59.619 --> 1:26:06.234
	Because it cannot be full sentences, as said
	here, if you have like maximum twenty four

	1:26:06.234 --> 1:26:10.443
	characters per line, that's not always a full
	sentence.

	1:26:10.443 --> 1:26:12.247
	So how can you make it?

	1:26:13.093 --> 1:26:16.253
	And then for speech there's not even a hint
	of wisdom.

	1:26:18.398 --> 1:26:27.711
	So what we have done today is yeah, we looked
	into maybe three challenges: We have this segmentation,

	1:26:27.711 --> 1:26:33.013
	which is a challenge both in evaluation and
	in the decoder.

	1:26:33.013 --> 1:26:40.613
	We talked about disfluencies and we talked
	about simultaneous translations and how to

	1:26:40.613 --> 1:26:42.911
	address these challenges.

	1:26:43.463 --> 1:26:45.507
	Any more questions.

	1:26:48.408 --> 1:26:52.578
	Good then new content.

	1:26:52.578 --> 1:26:58.198
	We are done for this semester.

	1:26:58.198 --> 1:27:04.905
	You can keep your knowledge in that.

	1:27:04.744 --> 1:27:09.405
	Repetition where we can try to repeat a bit
	what we've done all over the semester.

	1:27:10.010 --> 1:27:13.776
	Now prepare a bit of repetition to what think
	is important.

	1:27:14.634 --> 1:27:21.441
	But of course is also the chance for you to
	ask specific questions.

	1:27:21.441 --> 1:27:25.445
	It's not clear to me how things relate.

	1:27:25.745 --> 1:27:34.906
	So if you have any specific questions, please
	come to me or send me an email or so, then

	1:27:34.906 --> 1:27:36.038
	I'm happy.

	1:27:36.396 --> 1:27:46.665
	If should focus on it really in depth, it
	might be good not to come and send me an email

	1:27:46.665 --> 1:27:49.204
	on Wednesday evening.