Spaces:

retkowski
/

ytseg_demo

Running

File size: 69,444 Bytes

cb71ef5

WEBVTT

0:00:01.121 --> 0:00:14.214
Okay, so welcome to today's lecture, on Tuesday
we started to talk about speech translation.

0:00:14.634 --> 0:00:27.037
And the idea is hopefully an idea of the basic
ideas we have in speech translation, the two

0:00:27.037 --> 0:00:29.464
major approaches.

0:00:29.829 --> 0:00:41.459
And the other one is the end system where
we have one large system which is everything

0:00:41.459 --> 0:00:42.796
together.

0:00:43.643 --> 0:00:58.459
Until now we mainly focus on text output that
we'll see today, but you can extend these ideas

0:00:58.459 --> 0:01:01.138
to other speech.

0:01:01.441 --> 0:01:08.592
But since it's also like a machine translation
lecture, you of course mainly focus a bit on

0:01:08.592 --> 0:01:10.768
the translation challenges.

0:01:12.172 --> 0:01:25.045
And what is the main focus of today's lecture
is to look into why that is challenging speech

0:01:25.045 --> 0:01:26.845
translation.

0:01:27.627 --> 0:01:33.901
So a bit more focus on what is now really
the difference to all you and how we can address.

0:01:34.254 --> 0:01:39.703
We'll start there by with the segmentation
problem.

0:01:39.703 --> 0:01:45.990
We had that already of bits, but especially
for end-to-end.

0:01:46.386 --> 0:01:57.253
So the problem is that until now it was easy
to segment the input into sentences and then

0:01:57.253 --> 0:02:01.842
translate each sentence individually.

0:02:02.442 --> 0:02:17.561
When you're now translating audio, the challenge
is that you have just a sequence of audio input

0:02:17.561 --> 0:02:20.055
and there's no.

0:02:21.401 --> 0:02:27.834
So you have this difference that your audio
is a continuous stream, but the text is typically

0:02:27.834 --> 0:02:28.930
sentence based.

0:02:28.930 --> 0:02:31.667
So how can you match this gap in there?

0:02:31.667 --> 0:02:37.690
We'll see that is really essential, and if
you're not using a decent good system there,

0:02:37.690 --> 0:02:41.249
then you can lose a lot of quality and performance.

0:02:41.641 --> 0:02:44.267
That is what also meant before.

0:02:44.267 --> 0:02:51.734
So if you have a more complex system out of
several units, it's really essential that they

0:02:51.734 --> 0:02:56.658
all work together and it's very easy to lose
significantly.

0:02:57.497 --> 0:03:13.029
The second challenge we'll talk about is disfluencies,
so the style of speaking is very different

0:03:13.029 --> 0:03:14.773
from text.

0:03:15.135 --> 0:03:24.727
So if you translate or TedTalks, that's normally
very good speakers.

0:03:24.727 --> 0:03:30.149
They will give you a very fluent text.

0:03:30.670 --> 0:03:36.692
When you want to translate a lecture, it might
be more difficult or rednested.

0:03:37.097 --> 0:03:39.242
Mean people are not well that well.

0:03:39.242 --> 0:03:42.281
They should be prepared in giving the lecture
and.

0:03:42.362 --> 0:03:48.241
But it's not that I mean, typically a lecture
will have like rehearsal like five times before

0:03:48.241 --> 0:03:52.682
he is giving this lecture, and then like will
it completely be fluent?

0:03:52.682 --> 0:03:56.122
He might at some point notice all this is
not perfect.

0:03:56.122 --> 0:04:00.062
I want to rephrase, and he'll have to sing
during the lecture.

0:04:00.300 --> 0:04:04.049
Might be also good that he's thinking, so
he's not going too fast and things like.

0:04:05.305 --> 0:04:07.933
If you then go to the other extreme, it's
more meetings.

0:04:08.208 --> 0:04:15.430
If you have a lively discussion, of course,
people will interrupt, they will restart, they

0:04:15.430 --> 0:04:22.971
will think while they speak, and you know that
sometimes you tell people first think and speak

0:04:22.971 --> 0:04:26.225
because they are changing their opinion.

0:04:26.606 --> 0:04:31.346
So the question of how can you deal with this?

0:04:31.346 --> 0:04:37.498
And there again it might be solutions for
that, or at least.

0:04:39.759 --> 0:04:46.557
Then for the output we will look into simultaneous
translation that is at least not very important

0:04:46.557 --> 0:04:47.175
in text.

0:04:47.175 --> 0:04:53.699
There might be some cases but normally you
have all text available and then you're translating

0:04:53.699 --> 0:04:54.042
and.

0:04:54.394 --> 0:05:09.220
While for speech translation, since it's often
a life interaction, then of course it's important.

0:05:09.149 --> 0:05:12.378
Otherwise it's hard to follow.

0:05:12.378 --> 0:05:19.463
You see what said five minutes ago and the
slide is not as helpful.

0:05:19.739 --> 0:05:35.627
You have to wait very long before you can
answer because you have to first wait for what

0:05:35.627 --> 0:05:39.197
is happening there.

0:05:40.660 --> 0:05:46.177
And finally, we can talk a bit about presentation.

0:05:46.177 --> 0:05:54.722
For example, mentioned that if you're generating
subtitles, it's not possible.

0:05:54.854 --> 0:06:01.110
So in professional subtitles there are clear
rules.

0:06:01.110 --> 0:06:05.681
Subtitle has to be shown for seconds.

0:06:05.681 --> 0:06:08.929
It's maximum of two lines.

0:06:09.549 --> 0:06:13.156
Because otherwise it's getting too long, it's
not able to read it anymore, and so.

0:06:13.613 --> 0:06:19.826
So if you want to achieve that, of course,
you might have to adjust and select what you

0:06:19.826 --> 0:06:20.390
really.

0:06:23.203 --> 0:06:28.393
The first date starts with the segmentation.

0:06:28.393 --> 0:06:36.351
On the one end it's an issue while training,
on the other hand it's.

0:06:38.678 --> 0:06:47.781
What is the problem so when we train it's
relatively easy to separate our data into sentence

0:06:47.781 --> 0:06:48.466
level.

0:06:48.808 --> 0:07:02.241
So if you have your example, you have the
audio and the text, then you typically know

0:07:02.241 --> 0:07:07.083
that this sentence is aligned.

0:07:07.627 --> 0:07:16.702
You can use these time information to cut
your audio and then you can train and then.

0:07:18.018 --> 0:07:31.775
Because what we need for an enchilada model
is to be an output chart, in this case an audio

0:07:31.775 --> 0:07:32.822
chart.

0:07:33.133 --> 0:07:38.551
And even if this is a long speech, it's easy
then since we have this time information to

0:07:38.551 --> 0:07:39.159
separate.

0:07:39.579 --> 0:07:43.866
But we are using therefore, of course, the
target side information.

0:07:45.865 --> 0:07:47.949
The problem is now in runtime.

0:07:47.949 --> 0:07:49.427
This is not possible.

0:07:49.427 --> 0:07:55.341
Here we can do that based on the calculation
marks and the sentence segmentation on the

0:07:55.341 --> 0:07:57.962
target side because that is splitting.

0:07:57.962 --> 0:08:02.129
But during transcript, during translation
it is not possible.

0:08:02.442 --> 0:08:10.288
Because there is just a long audio signal,
and of course if you have your test data to

0:08:10.288 --> 0:08:15.193
split it into: That has been done for some
experience.

0:08:15.193 --> 0:08:22.840
It's fine, but it's not a realistic scenario
because if you really apply it in real world,

0:08:22.840 --> 0:08:25.949
we won't have a manual segmentation.

0:08:26.266 --> 0:08:31.838
If a human has to do that then he can do the
translation so you want to have a full automatic

0:08:31.838 --> 0:08:32.431
pipeline.

0:08:32.993 --> 0:08:38.343
So the question is how can we deal with this
type of you know?

0:09:09.309 --> 0:09:20.232
So the question is how can we deal with this
time of situation and how can we segment the

0:09:20.232 --> 0:09:23.024
audio into some units?

0:09:23.863 --> 0:09:32.495
And here is one further really big advantage
of a cascaded sauce: Because how is this done

0:09:32.495 --> 0:09:34.259
in a cascade of systems?

0:09:34.259 --> 0:09:38.494
We are splitting the audio with some features
we are doing.

0:09:38.494 --> 0:09:42.094
We can use similar ones which we'll discuss
later.

0:09:42.094 --> 0:09:43.929
Then we run against chin.

0:09:43.929 --> 0:09:48.799
We have the transcript and then we can do
what we talked last about.

0:09:49.069 --> 0:10:02.260
So if you have this is an audio signal and
the training data it was good.

0:10:02.822 --> 0:10:07.951
So here we have a big advantage.

0:10:07.951 --> 0:10:16.809
We can use a different segmentation for the
and for the.

0:10:16.809 --> 0:10:21.316
Why is that a big advantage?

0:10:23.303 --> 0:10:34.067
Will say for a team task is more important
because we can then do the sentence transformation.

0:10:34.955 --> 0:10:37.603
See and Yeah, We Can Do the Same Thing.

0:10:37.717 --> 0:10:40.226
To save us, why is it not as important for
us?

0:10:40.226 --> 0:10:40.814
Are maybe.

0:10:43.363 --> 0:10:48.589
We don't need that much context.

0:10:48.589 --> 0:11:01.099
We only try to restrict the word, but the
context to consider is mainly small.

0:11:03.283 --> 0:11:11.419
Would agree with it in more context, but there
is one more important: its.

0:11:11.651 --> 0:11:16.764
The is monotone, so there's no reordering.

0:11:16.764 --> 0:11:22.472
The second part of the signal is no reordering.

0:11:22.472 --> 0:11:23.542
We have.

0:11:23.683 --> 0:11:29.147
And of course if we are doing that we cannot
really order across boundaries between segments.

0:11:29.549 --> 0:11:37.491
It might be challenging if we split the words
so that it's not perfect for so that.

0:11:37.637 --> 0:11:40.846
But we need to do quite long range reordering.

0:11:40.846 --> 0:11:47.058
If you think about the German where the work
has moved, and now the English work is in one

0:11:47.058 --> 0:11:50.198
part, but the end of the sentence is another.

0:11:50.670 --> 0:11:59.427
And of course this advantage we have now here
that if we have a segment we have.

0:12:01.441 --> 0:12:08.817
And that this segmentation is important.

0:12:08.817 --> 0:12:15.294
Here are some motivations for that.

0:12:15.675 --> 0:12:25.325
What you are doing is you are taking the reference
text and you are segmenting.

0:12:26.326 --> 0:12:30.991
And then, of course, your segments are exactly
yeah cute.

0:12:31.471 --> 0:12:42.980
If you're now using different segmentation
strategies, you're using significantly in blue

0:12:42.980 --> 0:12:44.004
points.

0:12:44.004 --> 0:12:50.398
If the segmentation is bad, you have a lot
worse.

0:12:52.312 --> 0:13:10.323
And interesting, here you ought to see how
it was a human, but people have in a competition.

0:13:10.450 --> 0:13:22.996
You can see that by working on the segmentation
and using better segmentation you can improve

0:13:22.996 --> 0:13:25.398
your performance.

0:13:26.006 --> 0:13:29.932
So it's really essential.

0:13:29.932 --> 0:13:41.712
One other interesting thing is if you're looking
into the difference between.

0:13:42.082 --> 0:13:49.145
So it really seems to be more important to
have a good segmentation for our cascaded system.

0:13:49.109 --> 0:13:56.248
For an intra-end system because there you
can't re-segment while it is less important

0:13:56.248 --> 0:13:58.157
for a cascaded system.

0:13:58.157 --> 0:14:05.048
Of course, it's still important, but the difference
between the two segmentations.

0:14:06.466 --> 0:14:18.391
It was a shared task some years ago like it's
just one system from different.

0:14:22.122 --> 0:14:31.934
So the question is how can we deal with this
in speech translation and what people look

0:14:31.934 --> 0:14:32.604
into?

0:14:32.752 --> 0:14:48.360
Now we want to use different techniques to
split the audio signal into segments.

0:14:48.848 --> 0:14:54.413
You have the disadvantage that you can't change
it.

0:14:54.413 --> 0:15:00.407
Therefore, some of the quality might be more
important.

0:15:00.660 --> 0:15:15.678
But in both cases, of course, the A's are
better if you have a good segmentation.

0:15:17.197 --> 0:15:23.149
So any idea, how would you have this task
now split this audio?

0:15:23.149 --> 0:15:26.219
What type of tool would you use?

0:15:28.648 --> 0:15:41.513
The fuse was a new network to segment half
for instance supervise.

0:15:41.962 --> 0:15:44.693
Yes, that's exactly already the better system.

0:15:44.693 --> 0:15:50.390
So for long time people have done more simple
things because we'll come to that a bit challenging

0:15:50.390 --> 0:15:52.250
as creating or having the data.

0:15:53.193 --> 0:16:00.438
The first thing is you use some tool out of
the box like voice activity detection which

0:16:00.438 --> 0:16:07.189
has been there as a whole research field so
people find when somebody's speaking.

0:16:07.647 --> 0:16:14.952
And then you use that in this different threshold
you always have the ability that somebody's

0:16:14.952 --> 0:16:16.273
speaking or not.

0:16:17.217 --> 0:16:19.889
Then you split your signal.

0:16:19.889 --> 0:16:26.762
It will not be perfect, but you transcribe
or translate each component.

0:16:28.508 --> 0:16:39.337
But as you see, a supervised classification
task is even better, and that is now the most

0:16:39.337 --> 0:16:40.781
common use.

0:16:41.441 --> 0:16:49.909
The supervisor is doing that as a supervisor
classification and then you'll try to use this

0:16:49.909 --> 0:16:50.462
type.

0:16:50.810 --> 0:16:53.217
We're going into a bit more detail on how
to do that.

0:16:53.633 --> 0:17:01.354
So what you need to do first is, of course,
you have to have some labels whether this is

0:17:01.354 --> 0:17:03.089
an end of sentence.

0:17:03.363 --> 0:17:10.588
You do that by using the alignment between
the segments and the audio.

0:17:10.588 --> 0:17:12.013
You have the.

0:17:12.212 --> 0:17:15.365
The two people have not for each word, so
these tank steps.

0:17:15.365 --> 0:17:16.889
This word is said this time.

0:17:17.157 --> 0:17:27.935
This word is said by what you typically have
from this time to time to time.

0:17:27.935 --> 0:17:34.654
We have the second segment, the second segment.

0:17:35.195 --> 0:17:39.051
Which also used to trade for example your
advanced system and everything.

0:17:41.661 --> 0:17:53.715
Based on that you can label each frame in
there so if you have a green or blue that is

0:17:53.715 --> 0:17:57.455
our speech segment so you.

0:17:58.618 --> 0:18:05.690
And these labels will then later help you,
but you extract exactly these types of.

0:18:07.067 --> 0:18:08.917
There's one big challenge.

0:18:08.917 --> 0:18:15.152
If you have two sentences which are directly
connected to each other, then if you're doing

0:18:15.152 --> 0:18:18.715
this labeling, you would not have a break in
later.

0:18:18.715 --> 0:18:23.512
If you tried to extract that, there should
be something great or not.

0:18:23.943 --> 0:18:31.955
So what you typically do is in the last frame.

0:18:31.955 --> 0:18:41.331
You mark as outside, although it's not really
outside.

0:18:43.463 --> 0:18:46.882
Yes, I guess you could also do that in more
of a below check.

0:18:46.882 --> 0:18:48.702
I mean, this is the most simple.

0:18:48.702 --> 0:18:51.514
It's like inside outside, so it's related
to that.

0:18:51.514 --> 0:18:54.988
Of course, you could have an extra startup
segment, and so on.

0:18:54.988 --> 0:18:57.469
I guess this is just to make it more simple.

0:18:57.469 --> 0:19:00.226
You only have two labels, not a street classroom.

0:19:00.226 --> 0:19:02.377
But yeah, you could do similar things.

0:19:12.432 --> 0:19:20.460
Has caused down the roads to problems because
it could be an important part of a segment

0:19:20.460 --> 0:19:24.429
which has some meaning and we do something.

0:19:24.429 --> 0:19:28.398
The good thing is frames are normally very.

0:19:28.688 --> 0:19:37.586
Like some milliseconds, so normally if you
remove some milliseconds you can still understand

0:19:37.586 --> 0:19:38.734
everything.

0:19:38.918 --> 0:19:46.999
Mean the speech signal is very repetitive,
and so you have information a lot of times.

0:19:47.387 --> 0:19:50.730
That's why we talked along there last time
they could try to shrink the steak and.

0:19:51.031 --> 0:20:00.995
If you now have a short sequence where there
is like which would be removed and that's not

0:20:00.995 --> 0:20:01.871
really.

0:20:02.162 --> 0:20:06.585
Yeah, but it's not a full letter is missing.

0:20:06.585 --> 0:20:11.009
It's like only the last ending of the vocal.

0:20:11.751 --> 0:20:15.369
Think it doesn't really happen.

0:20:15.369 --> 0:20:23.056
We have our audio signal and we have these
gags that are not above.

0:20:23.883 --> 0:20:29.288
With this blue rectangulars the inside speech
segment and with the guess it's all set yes.

0:20:29.669 --> 0:20:35.736
So then you have the full signal and you're
meaning now labeling your task as a blue or

0:20:35.736 --> 0:20:36.977
white prediction.

0:20:36.977 --> 0:20:39.252
So that is your prediction task.

0:20:39.252 --> 0:20:44.973
You have the audio signal only and your prediction
task is like label one or zero.

0:20:45.305 --> 0:20:55.585
Once you do that then based on this labeling
you can extract each segment again like each

0:20:55.585 --> 0:20:58.212
consecutive blue area.

0:20:58.798 --> 0:21:05.198
See then removed maybe the non-speaking part
already and duo speech translation only on

0:21:05.198 --> 0:21:05.998
the parts.

0:21:06.786 --> 0:21:19.768
Which is good because the training would have
done similarly.

0:21:20.120 --> 0:21:26.842
So on the noise in between you never saw in
the training, so it's good to throw it away.

0:21:29.649 --> 0:21:34.930
One challenge, of course, is now if you're
doing that, what is your input?

0:21:34.930 --> 0:21:40.704
You cannot do the sequence labeling normally
on the whole talk, so it's too long.

0:21:40.704 --> 0:21:46.759
So if you're doing this prediction of the
label, you also have a window for which you

0:21:46.759 --> 0:21:48.238
do the segmentation.

0:21:48.788 --> 0:21:54.515
And that's the bedline we have in the punctuation
prediction.

0:21:54.515 --> 0:22:00.426
If we don't have good borders, random splits
are normally good.

0:22:00.426 --> 0:22:03.936
So what we do now is split the audio.

0:22:04.344 --> 0:22:09.134
So that would be our input, and then the part
three would be our labels.

0:22:09.269 --> 0:22:15.606
This green would be the input and here we
want, for example, blue labels and then white.

0:22:16.036 --> 0:22:20.360
Here only do labors and here at the beginning
why maybe at the end why.

0:22:21.401 --> 0:22:28.924
So thereby you have now a fixed window always
for which you're doing than this task of predicting.

0:22:33.954 --> 0:22:43.914
How you build your classifier that is based
again.

0:22:43.914 --> 0:22:52.507
We had this wave to be mentioned last week.

0:22:52.752 --> 0:23:00.599
So in training you use labels to say whether
it's in speech or outside speech.

0:23:01.681 --> 0:23:17.740
Inference: You give them always the chance
and then predict whether this part like each

0:23:17.740 --> 0:23:20.843
label is afraid.

0:23:23.143 --> 0:23:29.511
Bit more complicated, so one challenge is
if you randomly split off cognition, losing

0:23:29.511 --> 0:23:32.028
your context for the first brain.

0:23:32.028 --> 0:23:38.692
It might be very hard to predict whether this
is now in or out of, and also for the last.

0:23:39.980 --> 0:23:48.449
You often need a bit of context whether this
is audio or not, and at the beginning.

0:23:49.249 --> 0:23:59.563
So what you do is you put the audio in twice.

0:23:59.563 --> 0:24:08.532
You want to do it with splits and then.

0:24:08.788 --> 0:24:15.996
It is shown you have shifted the two offsets,
so one is predicted with the other offset.

0:24:16.416 --> 0:24:23.647
And then averaging the probabilities so that
at each time you have, at least for one of

0:24:23.647 --> 0:24:25.127
the predictions,.

0:24:25.265 --> 0:24:36.326
Because at the end of the second it might
be very hard to predict whether this is now

0:24:36.326 --> 0:24:39.027
speech or nonspeech.

0:24:39.939 --> 0:24:47.956
Think it is a high parameter, but you are
not optimizing it, so you just take two shifts.

0:24:48.328 --> 0:24:54.636
Of course try a lot of different shifts and
so on.

0:24:54.636 --> 0:24:59.707
The thing is it's mainly a problem here.

0:24:59.707 --> 0:25:04.407
If you don't do two outsets you have.

0:25:05.105 --> 0:25:14.761
You could get better by doing that, but would
be skeptical if it really matters, and also

0:25:14.761 --> 0:25:18.946
have not seen any experience in doing.

0:25:19.159 --> 0:25:27.629
Guess you're already good, you have maybe
some arrows in there and you're getting.

0:25:31.191 --> 0:25:37.824
So with this you have your segmentation.

0:25:37.824 --> 0:25:44.296
However, there is a problem in between.

0:25:44.296 --> 0:25:49.150
Once the model is wrong then.

0:25:49.789 --> 0:26:01.755
The normal thing would be the first thing
that you take some threshold and that you always

0:26:01.755 --> 0:26:05.436
label everything in speech.

0:26:06.006 --> 0:26:19.368
The problem is when you are just doing this
one threshold that you might have.

0:26:19.339 --> 0:26:23.954
Those are the challenges.

0:26:23.954 --> 0:26:31.232
Short segments mean you have no context.

0:26:31.232 --> 0:26:35.492
The policy will be bad.

0:26:37.077 --> 0:26:48.954
Therefore, people use this probabilistic divided
cocker algorithm, so the main idea is start

0:26:48.954 --> 0:26:56.744
with the whole segment, and now you split the
whole segment.

0:26:57.397 --> 0:27:09.842
Then you split there and then you continue
until each segment is smaller than the maximum

0:27:09.842 --> 0:27:10.949
length.

0:27:11.431 --> 0:27:23.161
But you can ignore some splits, and if you
split one segment into two parts you first

0:27:23.161 --> 0:27:23.980
trim.

0:27:24.064 --> 0:27:40.197
So normally it's not only one signal position,
it's a longer area of non-voice, so you try

0:27:40.197 --> 0:27:43.921
to find this longer.

0:27:43.943 --> 0:27:51.403
Now your large segment is split into two smaller
segments.

0:27:51.403 --> 0:27:56.082
Now you are checking these segments.

0:27:56.296 --> 0:28:04.683
So if they are very, very short, it might
be good not to spin at this point because you're

0:28:04.683 --> 0:28:05.697
ending up.

0:28:06.006 --> 0:28:09.631
And this way you continue all the time, and
then hopefully you'll have a good stretch.

0:28:10.090 --> 0:28:19.225
So, of course, there's one challenge with
this approach: if you think about it later,

0:28:19.225 --> 0:28:20.606
low latency.

0:28:25.405 --> 0:28:31.555
So in this case you have to have the full
audio available.

0:28:32.132 --> 0:28:38.112
So you cannot continuously do that mean if
you would do it just always.

0:28:38.112 --> 0:28:45.588
If the probability is higher you split but
in this case you try to find a global optimal.

0:28:46.706 --> 0:28:49.134
A heuristic body.

0:28:49.134 --> 0:28:58.170
You find a global solution for your whole
tar and not a local one.

0:28:58.170 --> 0:29:02.216
Where's the system most sure?

0:29:02.802 --> 0:29:12.467
So that's a bit of a challenge here, but the
advantage of course is that in the end you

0:29:12.467 --> 0:29:14.444
have no segments.

0:29:17.817 --> 0:29:23.716
Any more questions like this.

0:29:23.716 --> 0:29:36.693
Then the next thing is we also need to evaluate
in this scenario.

0:29:37.097 --> 0:29:44.349
So know machine translation is quite a long
way.

0:29:44.349 --> 0:29:55.303
History now was the beginning of the semester,
but hope you can remember.

0:29:55.675 --> 0:30:09.214
Might be with blue score, might be with comment
or similar, but you need to have.

0:30:10.310 --> 0:30:22.335
But this assumes that you have this one-to-one
match, so you always have an output and machine

0:30:22.335 --> 0:30:26.132
translation, which is nicely.

0:30:26.506 --> 0:30:34.845
So then it might be that our output has four
segments, while our reference output has only

0:30:34.845 --> 0:30:35.487
three.

0:30:36.756 --> 0:30:40.649
And now is, of course, questionable like what
should we compare in our metric.

0:30:44.704 --> 0:30:53.087
So it's no longer directly possible to directly
do that because what should you compare?

0:30:53.413 --> 0:31:00.214
Just have four segments there and three segments
there, and of course it seems to be that.

0:31:00.920 --> 0:31:06.373
The first one it likes to the first one when
you see I can't speak Spanish, but you're an

0:31:06.373 --> 0:31:09.099
audience of the guests who is already there.

0:31:09.099 --> 0:31:14.491
So even like just a woman, the blue comparing
wouldn't work, so you need to do something

0:31:14.491 --> 0:31:17.157
about that to take this type of evaluation.

0:31:19.019 --> 0:31:21.727
Still any suggestions what you could do.

0:31:25.925 --> 0:31:44.702
How can you calculate a blue score because
you don't have one you want to see?

0:31:45.925 --> 0:31:49.365
Here you put another layer which spies to
add in the second.

0:31:51.491 --> 0:31:56.979
It's even not aligning only, but that's one
solution, so you need to align and resign.

0:31:57.177 --> 0:32:06.886
Because even if you have no alignment so this
to this and this to that you see that it's

0:32:06.886 --> 0:32:12.341
not good because the audio would compare to
that.

0:32:13.453 --> 0:32:16.967
That we'll discuss is even one simpler solution.

0:32:16.967 --> 0:32:19.119
Yes, it's a simpler solution.

0:32:19.119 --> 0:32:23.135
It's called document based blue or something
like that.

0:32:23.135 --> 0:32:25.717
So you just take the full document.

0:32:26.566 --> 0:32:32.630
For some matrix it's good and it's not clear
how good it is to the other, but there might

0:32:32.630 --> 0:32:32.900
be.

0:32:33.393 --> 0:32:36.454
Think of more simple metrics like blue.

0:32:36.454 --> 0:32:40.356
Do you have any idea what could be a disadvantage?

0:32:49.249 --> 0:32:56.616
Blue is matching ingrams so you start with
the original.

0:32:56.616 --> 0:33:01.270
You check how many ingrams in here.

0:33:01.901 --> 0:33:11.233
If you're not doing that on the full document,
you can also match grams from year to year.

0:33:11.751 --> 0:33:15.680
So you can match things very far away.

0:33:15.680 --> 0:33:21.321
Start doing translation and you just randomly
randomly.

0:33:22.142 --> 0:33:27.938
And that, of course, could be a bit of a disadvantage
or like is a problem, and therefore people

0:33:27.938 --> 0:33:29.910
also look into the segmentation.

0:33:29.910 --> 0:33:34.690
But I've recently seen some things, so document
levels tours are also normally.

0:33:34.690 --> 0:33:39.949
If you have a relatively high quality system
or state of the art, then they also have a

0:33:39.949 --> 0:33:41.801
good correlation of the human.

0:33:46.546 --> 0:33:59.241
So how are we doing that so we are putting
end of sentence boundaries in there and then.

0:33:59.179 --> 0:34:07.486
Alignment based on a similar Livingston distance,
so at a distance between our output and the

0:34:07.486 --> 0:34:09.077
reference output.

0:34:09.449 --> 0:34:13.061
And here is our boundary.

0:34:13.061 --> 0:34:23.482
We map the boundary based on the alignment,
so in Lithuania you only have.

0:34:23.803 --> 0:34:36.036
And then, like all the words that are before,
it might be since there is not a random.

0:34:36.336 --> 0:34:44.890
Mean it should be, but it can happen things
like that, and it's not clear where.

0:34:44.965 --> 0:34:49.727
At the break, however, they are typically
not that bad because they are words which are

0:34:49.727 --> 0:34:52.270
not matching between reference and hypothesis.

0:34:52.270 --> 0:34:56.870
So normally it doesn't really matter that
much because they are anyway not matching.

0:34:57.657 --> 0:35:05.888
And then you take the mule as a T output and
use that to calculate your metric.

0:35:05.888 --> 0:35:12.575
Then it's again a perfect alignment for which
you can calculate.

0:35:14.714 --> 0:35:19.229
Any idea you could do it the other way around.

0:35:19.229 --> 0:35:23.359
You could resigment your reference to the.

0:35:29.309 --> 0:35:30.368
Which one would you select?

0:35:34.214 --> 0:35:43.979
I think segmenting the assertive also is much
more natural because the reference sentence

0:35:43.979 --> 0:35:46.474
is the fixed solution.

0:35:47.007 --> 0:35:52.947
Yes, that's the right motivation if you do
think about blue or so.

0:35:52.947 --> 0:35:57.646
Additionally important if you change your
reference.

0:35:57.857 --> 0:36:07.175
You might have a different number of diagrams
or diagrams because the sentences are different

0:36:07.175 --> 0:36:08.067
lengths.

0:36:08.068 --> 0:36:15.347
Here your five system, you're always comparing
it to the same system, and you don't compare

0:36:15.347 --> 0:36:16.455
to different.

0:36:16.736 --> 0:36:22.317
The only different base of segmentation, but
still it could make some do.

0:36:25.645 --> 0:36:38.974
Good, that's all about sentence segmentation,
then a bit about disfluencies and what there

0:36:38.974 --> 0:36:40.146
really.

0:36:42.182 --> 0:36:51.138
So as said in daily life, you're not speaking
like very nice full sentences every.

0:36:51.471 --> 0:36:53.420
He was speaking powerful sentences.

0:36:53.420 --> 0:36:54.448
We do repetitions.

0:36:54.834 --> 0:37:00.915
It's especially if it's more interactive,
so in meetings, phone calls and so on.

0:37:00.915 --> 0:37:04.519
If you have multiple speakers, they also break.

0:37:04.724 --> 0:37:16.651
Each other, and then if you keep them, they
are harder to translate because most of your

0:37:16.651 --> 0:37:17.991
training.

0:37:18.278 --> 0:37:30.449
It's also very difficult to read, so we'll
have some examples there to transcribe everything

0:37:30.449 --> 0:37:32.543
as it was said.

0:37:33.473 --> 0:37:36.555
What type of things are there?

0:37:37.717 --> 0:37:42.942
So you have all these pillow works.

0:37:42.942 --> 0:37:47.442
These are very easy to remove.

0:37:47.442 --> 0:37:52.957
You can just use regular expressions.

0:37:53.433 --> 0:38:00.139
Is getting more difficult with some other
type of filler works.

0:38:00.139 --> 0:38:03.387
In German you have this or in.

0:38:04.024 --> 0:38:08.473
And these ones you cannot just remove by regular
expression.

0:38:08.473 --> 0:38:15.039
You shouldn't remove all yacht from a text
because it might be very important information

0:38:15.039 --> 0:38:15.768
for well.

0:38:15.715 --> 0:38:19.995
It may be not as important as you are, but
still it might be very important.

0:38:20.300 --> 0:38:24.215
So just removing them is there already more
difficult.

0:38:26.586 --> 0:38:29.162
Then you have these repetitions.

0:38:29.162 --> 0:38:32.596
You have something like mean saw him there.

0:38:32.596 --> 0:38:33.611
There was a.

0:38:34.334 --> 0:38:41.001
And while for the first one that might be
very easy to remove because you just look for

0:38:41.001 --> 0:38:47.821
double, the thing is that the repetition might
not be exactly the same, so there is there

0:38:47.821 --> 0:38:48.199
was.

0:38:48.199 --> 0:38:54.109
So there is already getting a bit more complicated,
of course still possible.

0:38:54.614 --> 0:39:01.929
You can remove Denver so the real sense would
be like to have a ticket to Houston.

0:39:02.882 --> 0:39:13.327
But there the detection, of course, is getting
more challenging as you want to get rid of.

0:39:13.893 --> 0:39:21.699
You don't have the data, of course, which
makes all the tasks harder, but you probably

0:39:21.699 --> 0:39:22.507
want to.

0:39:22.507 --> 0:39:24.840
That's really meaningful.

0:39:24.840 --> 0:39:26.185
Current isn't.

0:39:26.185 --> 0:39:31.120
That is now a really good point and it's really
there.

0:39:31.051 --> 0:39:34.785
The thing about what is your final task?

0:39:35.155 --> 0:39:45.526
If you want to have a transcript reading it,
I'm not sure if we have another example.

0:39:45.845 --> 0:39:54.171
So there it's nicer if you have a clean transfer
and if you see subtitles in, they're also not

0:39:54.171 --> 0:39:56.625
having all the repetitions.

0:39:56.625 --> 0:40:03.811
It's the nice way to shorten but also getting
the structure you cannot even make.

0:40:04.064 --> 0:40:11.407
So in this situation, of course, they might
give you information.

0:40:11.407 --> 0:40:14.745
There is a lot of stuttering.

0:40:15.015 --> 0:40:22.835
So in this case agree it might be helpful
in some way, but meaning reading all the disfluencies

0:40:22.835 --> 0:40:25.198
is getting really difficult.

0:40:25.198 --> 0:40:28.049
If you have the next one, we have.

0:40:28.308 --> 0:40:31.630
That's a very long text.

0:40:31.630 --> 0:40:35.883
You need a bit of time to pass.

0:40:35.883 --> 0:40:39.472
This one is not important.

0:40:40.480 --> 0:40:48.461
It might be nice if you can start reading
from here.

0:40:48.461 --> 0:40:52.074
Let's have a look here.

0:40:52.074 --> 0:40:54.785
Try to read this.

0:40:57.297 --> 0:41:02.725
You can understand it, but think you need
a bit of time to really understand what was.

0:41:11.711 --> 0:41:21.480
And now we have the same text, but you have
highlighted in bold, and not only read the

0:41:21.480 --> 0:41:22.154
bold.

0:41:23.984 --> 0:41:25.995
And ignore everything which is not bold.

0:41:30.250 --> 0:41:49.121
Would assume it's easier to read just the
book part more faster and more faster.

0:41:50.750 --> 0:41:57.626
Yeah, it might be, but I'm not sure we have
a master thesis of that.

0:41:57.626 --> 0:41:59.619
If seen my videos,.

0:42:00.000 --> 0:42:09.875
Of the recordings, I also have it more likely
that it's like a fluent speak and I'm not like

0:42:09.875 --> 0:42:12.318
doing the hesitations.

0:42:12.652 --> 0:42:23.764
Don't know if somebody else has looked into
the Cusera video, but notice that.

0:42:25.005 --> 0:42:31.879
For these videos spoke every minute, three
times or something, and then people were there

0:42:31.879 --> 0:42:35.011
and cutting things and making hopefully.

0:42:35.635 --> 0:42:42.445
And therefore if you want to more achieve
that, of course, no longer exactly what was

0:42:42.445 --> 0:42:50.206
happening, but if it more looks like a professional
video, then you would have to do that and cut

0:42:50.206 --> 0:42:50.998
that out.

0:42:50.998 --> 0:42:53.532
But yeah, there are definitely.

0:42:55.996 --> 0:42:59.008
We're also going to do this thing again.

0:42:59.008 --> 0:43:02.315
First turn is like I'm going to have a very.

0:43:02.422 --> 0:43:07.449
Which in the end they start to slow down just
without feeling as though they're.

0:43:07.407 --> 0:43:10.212
It's a good point for the next.

0:43:10.212 --> 0:43:13.631
There is not the one perfect solution.

0:43:13.631 --> 0:43:20.732
There's some work on destruction removal,
but of course there's also disability.

0:43:20.732 --> 0:43:27.394
Removal is not that easy, so do you just remove
that's in order everywhere.

0:43:27.607 --> 0:43:29.708
But how much like cleaning do you do?

0:43:29.708 --> 0:43:31.366
It's more a continuous thing.

0:43:31.811 --> 0:43:38.211
Is it more really you only remove stuff or
are you also into rephrasing and here is only

0:43:38.211 --> 0:43:38.930
removing?

0:43:39.279 --> 0:43:41.664
But maybe you want to rephrase it.

0:43:41.664 --> 0:43:43.231
That's hearing better.

0:43:43.503 --> 0:43:49.185
So then it's going into what people are doing
in style transfer.

0:43:49.185 --> 0:43:52.419
We are going from a speech style to.

0:43:52.872 --> 0:44:07.632
So there is more continuum, and of course
Airconditioner is not the perfect solution,

0:44:07.632 --> 0:44:10.722
but exactly what.

0:44:15.615 --> 0:44:19.005
Yeah, we're challenging.

0:44:19.005 --> 0:44:30.258
You have examples where the direct copy is
not as hard or is not exactly the same.

0:44:30.258 --> 0:44:35.410
That is, of course, more challenging.

0:44:41.861 --> 0:44:49.889
If it's getting really mean why it's so challenging,
if it's really spontaneous even for the speaker,

0:44:49.889 --> 0:44:55.634
you need maybe even the video to really get
that and at least the audio.

0:45:01.841 --> 0:45:06.025
Yeah what it also depends on.

0:45:06.626 --> 0:45:15.253
The purpose, of course, and very important
thing is the easiest tasks just to removing.

0:45:15.675 --> 0:45:25.841
Of course you have to be very careful because
if you remove some of the not, it's normally

0:45:25.841 --> 0:45:26.958
not much.

0:45:27.227 --> 0:45:33.176
But if you remove too much, of course, that's
very, very bad because you're losing important.

0:45:33.653 --> 0:45:46.176
And this might be even more challenging if
you think about rarer and unseen works.

0:45:46.226 --> 0:45:56.532
So when doing this removal, it's important
to be careful and normally more conservative.

0:46:03.083 --> 0:46:15.096
Of course, also you have to again see if you're
doing that now in a two step approach, not

0:46:15.096 --> 0:46:17.076
an end to end.

0:46:17.076 --> 0:46:20.772
So first you need a remote.

0:46:21.501 --> 0:46:30.230
But you have to somehow sing it in the whole
type line.

0:46:30.230 --> 0:46:36.932
If you learn text or remove disfluencies,.

0:46:36.796 --> 0:46:44.070
But it might be that the ASR system is outputing
something else or that it's more of an ASR

0:46:44.070 --> 0:46:44.623
error.

0:46:44.864 --> 0:46:46.756
So um.

0:46:46.506 --> 0:46:52.248
Just for example, if you do it based on language
modeling scores, it might be that you're just

0:46:52.248 --> 0:46:57.568
the language modeling score because the has
done some errors, so you really have to see

0:46:57.568 --> 0:46:59.079
the combination of that.

0:46:59.419 --> 0:47:04.285
And for example, we had like partial words.

0:47:04.285 --> 0:47:06.496
They are like some.

0:47:06.496 --> 0:47:08.819
We didn't have that.

0:47:08.908 --> 0:47:18.248
So these feelings cannot be that you start
in the middle of the world and then you switch

0:47:18.248 --> 0:47:19.182
because.

0:47:19.499 --> 0:47:23.214
And of course, in text in perfect transcript,
that's very easy to recognize.

0:47:23.214 --> 0:47:24.372
That's not a real word.

0:47:24.904 --> 0:47:37.198
However, when you really do it into an system,
he will normally detect some type of word because

0:47:37.198 --> 0:47:40.747
he only can help the words.

0:47:50.050 --> 0:48:03.450
Example: We should think so if you have this
in the transcript it's easy to detect as a

0:48:03.450 --> 0:48:05.277
disgusting.

0:48:05.986 --> 0:48:11.619
And then, of course, it's more challenging
in a real world example where you have.

0:48:12.492 --> 0:48:29.840
Now to the approaches one thing is to really
put it in between so you put your A's system.

0:48:31.391 --> 0:48:45.139
So what your task is like, so you have this
text and the outputs in this text.

0:48:45.565 --> 0:48:49.605
There is different formulations of that.

0:48:49.605 --> 0:48:54.533
You might not be able to do everything like
that.

0:48:55.195 --> 0:49:10.852
Or do you also allow, for example, rephrasing
for reordering so in text you might have the

0:49:10.852 --> 0:49:13.605
word correctly.

0:49:13.513 --> 0:49:24.201
But the easiest thing is you only do it more
like removing, so some things can be removed.

0:49:29.049 --> 0:49:34.508
Any ideas how to do that this is output.

0:49:34.508 --> 0:49:41.034
You have training data so we have training
data.

0:49:47.507 --> 0:49:55.869
To put in with the spoon you can eat it even
after it is out, but after the machine has.

0:50:00.000 --> 0:50:05.511
Was wearing rocks, so you have not just the
shoes you remove but wearing them as input,

0:50:05.511 --> 0:50:07.578
as disfluent text and as output.

0:50:07.578 --> 0:50:09.207
It should be fueled text.

0:50:09.207 --> 0:50:15.219
It can be before or after recycling as you
said, but you have this type of task, so technically

0:50:15.219 --> 0:50:20.042
how would you address this type of task when
you have to solve this type of.

0:50:24.364 --> 0:50:26.181
That's exactly so.

0:50:26.181 --> 0:50:28.859
That's one way of doing it.

0:50:28.859 --> 0:50:33.068
It's a translation task and you train your.

0:50:33.913 --> 0:50:34.683
Can do.

0:50:34.683 --> 0:50:42.865
Then, of course, the bit of the challenge
is that you automatically allow rephrasing

0:50:42.865 --> 0:50:43.539
stuff.

0:50:43.943 --> 0:50:52.240
Which of the one end is good so you have more
opportunities but it might be also a bad thing

0:50:52.240 --> 0:50:58.307
because if you have more opportunities you
have more opportunities.

0:51:01.041 --> 0:51:08.300
If you want to prevent that, it can also do
more simple labeling, so for each word your

0:51:08.300 --> 0:51:10.693
label should not be removed.

0:51:12.132 --> 0:51:17.658
People have also been looked into parsley.

0:51:17.658 --> 0:51:29.097
You remember maybe the past trees at the beginning
like the structure because the ideas.

0:51:29.649 --> 0:51:45.779
There's also more unsupervised approaches
where you then phrase it as a style transfer

0:51:45.779 --> 0:51:46.892
task.

0:51:50.310 --> 0:51:58.601
At the last point since we have that yes,
it has also been done in an end-to-end fashion

0:51:58.601 --> 0:52:06.519
so that it's really you have as input the audio
signal and output you have than the.

0:52:06.446 --> 0:52:10.750
The text, without influence, is a clearly
clear text.

0:52:11.131 --> 0:52:19.069
You model every single total, which of course
has a big advantage.

0:52:19.069 --> 0:52:25.704
You can use these paralinguistic features,
pauses, and.

0:52:25.705 --> 0:52:34.091
If you switch so you start something then
oh it doesn't work continue differently so.

0:52:34.374 --> 0:52:42.689
So you can easily use in a fashion while in
a cascade approach.

0:52:42.689 --> 0:52:47.497
As we saw there you have text input.

0:52:49.990 --> 0:53:02.389
But on the one end we have again, and in the
more extreme case the problem before was endless.

0:53:02.389 --> 0:53:06.957
Of course there is even less data.

0:53:11.611 --> 0:53:12.837
Good.

0:53:12.837 --> 0:53:30.814
This is all about the input to a very more
person, or maybe if you think about YouTube.

0:53:32.752 --> 0:53:34.989
Talk so this could use be very exciting.

0:53:36.296 --> 0:53:42.016
Is more viewed as style transferred.

0:53:42.016 --> 0:53:53.147
You can use ideas from machine translation
where you have one language.

0:53:53.713 --> 0:53:57.193
So there is ways of trying to do this type
of style transfer.

0:53:57.637 --> 0:54:02.478
Think is definitely also very promising to
make it more and more fluent in a business.

0:54:03.223 --> 0:54:17.974
Because one major issue about all the previous
ones is that you need training data and then

0:54:17.974 --> 0:54:21.021
you need training.

0:54:21.381 --> 0:54:32.966
So I mean, think that we are only really of
data that we have for English.

0:54:32.966 --> 0:54:39.453
Maybe there is a very few data in German.

0:54:42.382 --> 0:54:49.722
Okay, then let's talk about low latency speech.

0:54:50.270 --> 0:55:05.158
So the idea is if we are doing life translation
of a talker, so we want to start out.

0:55:05.325 --> 0:55:23.010
This is possible because there is typically
some kind of monotony in many languages.

0:55:24.504 --> 0:55:29.765
And this is also what, for example, human
interpreters are doing to have a really low

0:55:29.765 --> 0:55:30.071
leg.

0:55:30.750 --> 0:55:34.393
They are even going further.

0:55:34.393 --> 0:55:40.926
They guess what will be the ending of the
sentence.

0:55:41.421 --> 0:55:51.120
Then they can already continue, although it's
not sad it might be needed, but that is even

0:55:51.120 --> 0:55:53.039
more challenging.

0:55:54.714 --> 0:55:58.014
Why is it so difficult?

0:55:58.014 --> 0:56:09.837
There is this train of on the one end for
a and you want to have more context because

0:56:09.837 --> 0:56:14.511
we learn if we have more context.

0:56:15.015 --> 0:56:24.033
And therefore to have more contacts you have
to wait as long as possible.

0:56:24.033 --> 0:56:27.689
The best is to have the full.

0:56:28.168 --> 0:56:35.244
On the other hand, you want to have a low
latency for the user to wait to generate as

0:56:35.244 --> 0:56:35.737
soon.

0:56:36.356 --> 0:56:47.149
So if you're doing no situation you have to
find the best way to start in order to have

0:56:47.149 --> 0:56:48.130
a good.

0:56:48.728 --> 0:56:52.296
There's no longer the perfect solution.

0:56:52.296 --> 0:56:56.845
People will also evaluate what is the translation.

0:56:57.657 --> 0:57:09.942
While it's challenging in German to English,
German has this very nice thing where the prefix

0:57:09.942 --> 0:57:16.607
of the word can be put at the end of the sentence.

0:57:17.137 --> 0:57:24.201
And you only know if the person registers
or cancels his station at the end of the center.

0:57:24.985 --> 0:57:33.690
So if you want to start the translation in
English you need to know at this point is the.

0:57:35.275 --> 0:57:39.993
So you would have to wait until the end of
the year.

0:57:39.993 --> 0:57:42.931
That's not really what you want.

0:57:43.843 --> 0:57:45.795
What happened.

0:57:47.207 --> 0:58:12.550
Other solutions of doing that are: Have been
motivating like how we can do that subject

0:58:12.550 --> 0:58:15.957
object or subject work.

0:58:16.496 --> 0:58:24.582
In German it's not always subject, but there
are relative sentence where you have that,

0:58:24.582 --> 0:58:25.777
so it needs.

0:58:28.808 --> 0:58:41.858
How we can do that is, we'll look today into
three ways of doing that.

0:58:41.858 --> 0:58:46.269
The one is to mitigate.

0:58:46.766 --> 0:58:54.824
And then the IVAR idea is to do retranslating,
and there you can now use the text output.

0:58:54.934 --> 0:59:02.302
So the idea is you translate, and if you later
notice it was wrong then you can retranslate

0:59:02.302 --> 0:59:03.343
and correct.

0:59:03.803 --> 0:59:14.383
Or you can do what is called extremely coding,
so you can generically.

0:59:17.237 --> 0:59:30.382
Let's start with the optimization, so if you
have a sentence, it may reach a conference,

0:59:30.382 --> 0:59:33.040
and in this time.

0:59:32.993 --> 0:59:39.592
So you have a good translation quality while
still having low latency.

0:59:39.699 --> 0:59:50.513
You have an extra model which does your segmentation
before, but your aim is not to have a segmentation.

0:59:50.470 --> 0:59:53.624
But you can somehow measure in training data.

0:59:53.624 --> 0:59:59.863
If do these types of segment lengths, that's
my latency and that's my translation quality,

0:59:59.863 --> 1:00:02.811
and then you can try to search a good way.

1:00:03.443 --> 1:00:20.188
If you're doing that one, it's an extra component,
so you can use your system as it was.

1:00:22.002 --> 1:00:28.373
The other idea is to directly output the first
high processes always, so always when you have

1:00:28.373 --> 1:00:34.201
text or audio we translate, and if we then
have more context available we can update.

1:00:35.015 --> 1:00:50.195
So imagine before, if get an eye register
and there's a sentence continued, then.

1:00:50.670 --> 1:00:54.298
So you change the output.

1:00:54.298 --> 1:01:07.414
Of course, that might be also leading to bad
user experience if you always flicker and change

1:01:07.414 --> 1:01:09.228
your output.

1:01:09.669 --> 1:01:15.329
The bit like human interpreters also are able
to correct, so they're doing a more long text.

1:01:15.329 --> 1:01:20.867
If they are guessing how to continue to say
and then he's saying something different, they

1:01:20.867 --> 1:01:22.510
also have to correct them.

1:01:22.510 --> 1:01:26.831
So here, since it's not all you, we can even
change what we have said.

1:01:26.831 --> 1:01:29.630
Yes, that's exactly what we have implemented.

1:01:31.431 --> 1:01:49.217
So how that works is, we are aware, and then
we translate it, and if we get more input like

1:01:49.217 --> 1:01:51.344
you, then.

1:01:51.711 --> 1:02:00.223
And so we can always continue to do that and
improve the transcript that we have.

1:02:00.480 --> 1:02:07.729
So in the end we have the lowest possible
latency because we always output what is possible.

1:02:07.729 --> 1:02:14.784
On the other hand, introducing a bit of a
new problem is: There's another challenge when

1:02:14.784 --> 1:02:20.061
we first used that this one was first used
for old and that it worked fine.

1:02:20.061 --> 1:02:21.380
You switch to NMT.

1:02:21.380 --> 1:02:25.615
You saw one problem that is even generating
more flickering.

1:02:25.615 --> 1:02:28.878
The problem is the normal machine translation.

1:02:29.669 --> 1:02:35.414
So implicitly learn all the output that always
ends with a dot, and it's always a full sentence.

1:02:36.696 --> 1:02:42.466
And this was even more important somewhere
in the model than really what is in the input.

1:02:42.983 --> 1:02:55.910
So if you give him a partial sentence, it
will still generate a full sentence.

1:02:55.910 --> 1:02:58.201
So encourage.

1:02:58.298 --> 1:03:05.821
It's like trying to just continue it somehow
to a full sentence and if it's doing better

1:03:05.821 --> 1:03:10.555
guessing stuff then you have to even have more
changes.

1:03:10.890 --> 1:03:23.944
So here we have a trained mismatch and that's
maybe more a general important thing that the

1:03:23.944 --> 1:03:28.910
modem might learn a bit different.

1:03:29.289 --> 1:03:32.636
It's always ending with a dog, so you don't
just guess something in general.

1:03:33.053 --> 1:03:35.415
So we have your trained test mismatch.

1:03:38.918 --> 1:03:41.248
And we have a trained test message.

1:03:41.248 --> 1:03:43.708
What is the best way to address that?

1:03:46.526 --> 1:03:51.934
That's exactly the right, so we have to like
train also on that.

1:03:52.692 --> 1:03:55.503
The problem is for particle sentences.

1:03:55.503 --> 1:03:59.611
There's not training data, so it's hard to
find all our.

1:04:00.580 --> 1:04:06.531
Hi, I'm ransom quite easy to generate artificial
pottery scent or at least for the source.

1:04:06.926 --> 1:04:15.367
So you just take, you take all the prefixes
of the source data.

1:04:17.017 --> 1:04:22.794
On the problem of course, with a bit what
do you know lying?

1:04:22.794 --> 1:04:30.845
If you have a sentence, I encourage all of
what should be the right target for that.

1:04:31.491 --> 1:04:45.381
And the constraints on the one hand, it should
be as long as possible, so you always have

1:04:45.381 --> 1:04:47.541
a long delay.

1:04:47.687 --> 1:04:55.556
On the other hand, it should be also a suspect
of the previous ones, and it should be not

1:04:55.556 --> 1:04:57.304
too much inventing.

1:04:58.758 --> 1:05:02.170
A very easy solution works fine.

1:05:02.170 --> 1:05:05.478
You can just do a length space.

1:05:05.478 --> 1:05:09.612
You also take two thirds of the target.

1:05:10.070 --> 1:05:19.626
His learning then implicitly to guess a bit
if you think about the beginning of example.

1:05:20.000 --> 1:05:30.287
This one, if you do two sorts like half, in
this case the target would be eye register.

1:05:30.510 --> 1:05:39.289
So you're doing a bit of implicit guessing,
and if it's getting wrong you have rewriting,

1:05:39.289 --> 1:05:43.581
but you're doing a good amount of guessing.

1:05:49.849 --> 1:05:53.950
In addition, this would be like how it looks
like if it was like.

1:05:53.950 --> 1:05:58.300
If it wasn't a housing game, then the target
could be something like.

1:05:58.979 --> 1:06:02.513
One problem is that you just do that this
way.

1:06:02.513 --> 1:06:04.619
It's most of your training.

1:06:05.245 --> 1:06:11.983
And in the end you're interested in the overall
translation quality, so for full sentence.

1:06:11.983 --> 1:06:19.017
So if you train on that, it will mainly learn
how to translate prefixes because ninety percent

1:06:19.017 --> 1:06:21.535
or more of your data is prefixed.

1:06:22.202 --> 1:06:31.636
That's why we'll see that it's better to do
like a ratio.

1:06:31.636 --> 1:06:39.281
So half your training data are full sentences.

1:06:39.759 --> 1:06:47.693
Because if you're doing this well you see
that for every word prefix and only one sentence.

1:06:48.048 --> 1:06:52.252
You also see that nicely here here are both.

1:06:52.252 --> 1:06:56.549
This is the blue scores and you see the bass.

1:06:58.518 --> 1:06:59.618
Is this one?

1:06:59.618 --> 1:07:03.343
It has a good quality because it's trained.

1:07:03.343 --> 1:07:11.385
If you know, train with all the partial sentences
is more focusing on how to translate partial

1:07:11.385 --> 1:07:12.316
sentences.

1:07:12.752 --> 1:07:17.840
Because all the partial sentences will at
some point be removed, because at the end you

1:07:17.840 --> 1:07:18.996
translate the full.

1:07:20.520 --> 1:07:24.079
There's many tasks to read, but you have the
same performances.

1:07:24.504 --> 1:07:26.938
On the other hand, you see here the other
problem.

1:07:26.938 --> 1:07:28.656
This is how many words got updated.

1:07:29.009 --> 1:07:31.579
You want to have as few updates as possible.

1:07:31.579 --> 1:07:34.891
Updates need to remove things which are once
being shown.

1:07:35.255 --> 1:07:40.538
This is quite high for the baseline.

1:07:40.538 --> 1:07:50.533
If you know the partials that are going down,
they should be removed.

1:07:51.151 --> 1:07:58.648
And then for moody tasks you have a bit like
the best note of swim.

1:08:02.722 --> 1:08:05.296
Any more questions to this type of.

1:08:09.309 --> 1:08:20.760
The last thing is that you want to do an extremely.

1:08:21.541 --> 1:08:23.345
Again, it's a bit implication.

1:08:23.345 --> 1:08:25.323
Scenario is what you really want.

1:08:25.323 --> 1:08:30.211
As you said, we sometimes use this updating,
and for text output it'd be very nice.

1:08:30.211 --> 1:08:35.273
But imagine if you want to audio output, of
course you can't change it anymore because

1:08:35.273 --> 1:08:37.891
on one side you cannot change what was said.

1:08:37.891 --> 1:08:40.858
So in this time you more need like a fixed
output.

1:08:41.121 --> 1:08:47.440
And then the style of street decoding is interesting.

1:08:47.440 --> 1:08:55.631
Where you, for example, get sourced, the seagullins
are so stoked in.

1:08:55.631 --> 1:09:00.897
Then you decide oh, now it's better to wait.

1:09:01.041 --> 1:09:14.643
So you somehow need to have this type of additional
information.

1:09:15.295 --> 1:09:23.074
Here you have to decide should know I'll put
a token or should wait for my and feel.

1:09:26.546 --> 1:09:32.649
So you have to do this additional labels like
weight, weight, output, output, wage and so

1:09:32.649 --> 1:09:32.920
on.

1:09:33.453 --> 1:09:38.481
There are different ways of doing that.

1:09:38.481 --> 1:09:45.771
You can have an additional model that does
this decision.

1:09:46.166 --> 1:09:53.669
And then have a higher quality or better to
continue and then have a lower latency in this

1:09:53.669 --> 1:09:54.576
different.

1:09:55.215 --> 1:09:59.241
Surprisingly, a very easy task also works,
sometimes quite good.

1:10:03.043 --> 1:10:10.981
And that is the so called way care policy
and the idea is there at least for text to

1:10:10.981 --> 1:10:14.623
text translation that is working well.

1:10:14.623 --> 1:10:22.375
It's like you wait for words and then you
always output one and like one for each.

1:10:22.682 --> 1:10:28.908
So your weight slow works at the beginning
of the sentence, and every time a new board

1:10:28.908 --> 1:10:29.981
is coming you.

1:10:31.091 --> 1:10:39.459
So you have the same times to beat as input,
so you're not legging more or less, but to

1:10:39.459 --> 1:10:41.456
have enough context.

1:10:43.103 --> 1:10:49.283
Of course this for example for the unmarried
will not solve it perfectly but if you have

1:10:49.283 --> 1:10:55.395
a bit of local reordering inside your token
that you can manage very well and then it's

1:10:55.395 --> 1:10:57.687
a very simple solution but it's.

1:10:57.877 --> 1:11:00.481
The other one was dynamic.

1:11:00.481 --> 1:11:06.943
Depending on the context you can decide how
long you want to wait.

1:11:07.687 --> 1:11:21.506
It also only works if you have a similar amount
of tokens, so if your target is very short

1:11:21.506 --> 1:11:22.113
of.

1:11:22.722 --> 1:11:28.791
That's why it's also more challenging for
audio input because the speaking rate is changing

1:11:28.791 --> 1:11:29.517
and so on.

1:11:29.517 --> 1:11:35.586
You would have to do something like I'll output
a word for every second a year or something

1:11:35.586 --> 1:11:35.981
like.

1:11:36.636 --> 1:11:45.459
The problem is that the audio speaking speed
is not like fixed but quite very, and therefore.

1:11:50.170 --> 1:11:58.278
Therefore, what you can also do is you can
use a similar solution than we had before with

1:11:58.278 --> 1:11:59.809
the resetteling.

1:12:00.080 --> 1:12:02.904
You remember we were re-decoded all the time.

1:12:03.423 --> 1:12:12.253
And you can do something similar in this case
except that you add something in that you're

1:12:12.253 --> 1:12:16.813
saying, oh, if I read it cold, I'm not always.

1:12:16.736 --> 1:12:22.065
Can decode as I want, but you can do this
target prefix decoding, so what you say is

1:12:22.065 --> 1:12:23.883
in your achievement section.

1:12:23.883 --> 1:12:26.829
You can easily say generate a translation
bus.

1:12:27.007 --> 1:12:29.810
The translation has to start with the prefix.

1:12:31.251 --> 1:12:35.350
How can you do that?

1:12:39.839 --> 1:12:49.105
In the decoder exactly you start, so if you
do beam search you select always the most probable.

1:12:49.349 --> 1:12:57.867
And now you say oh, I'm not selecting the
most perfect, but this is the fourth, so in

1:12:57.867 --> 1:13:04.603
the first step have to take this one, in the
second start decoding.

1:13:04.884 --> 1:13:09.387
And then you're making sure that your second
always starts with this prefix.

1:13:10.350 --> 1:13:18.627
And then you can use your immediate retranslation,
but you're no longer changing the output.

1:13:19.099 --> 1:13:31.595
Out as it works, so it may get a speech signal
and input, and it is not outputing any.

1:13:32.212 --> 1:13:45.980
So then if you got you get a translation maybe
and then you decide yes output.

1:13:46.766 --> 1:13:54.250
And then you're translating as one as two
as sweet as four, but now you say generate

1:13:54.250 --> 1:13:55.483
only outputs.

1:13:55.935 --> 1:14:07.163
And then you're translating and maybe you're
deciding on and now a good translation.

1:14:07.163 --> 1:14:08.880
Then you're.

1:14:09.749 --> 1:14:29.984
Yes, but don't get to worry about what the
effect is.

1:14:30.050 --> 1:14:31.842
We're generating your target text.

1:14:32.892 --> 1:14:36.930
But we're not always outputing the full target
text now.

1:14:36.930 --> 1:14:43.729
What we are having is we have here some strategy
to decide: Oh, is a system already sure enough

1:14:43.729 --> 1:14:44.437
about it?

1:14:44.437 --> 1:14:49.395
If it's sure enough and it has all the information,
we can output it.

1:14:49.395 --> 1:14:50.741
And then the next.

1:14:51.291 --> 1:14:55.931
If we say here sometimes with better not to
get output we won't output it already.

1:14:57.777 --> 1:15:06.369
And thereby the hope is in the uphill model
should not yet outcut a register because it

1:15:06.369 --> 1:15:10.568
doesn't mean no yet if it's a case or not.

1:15:13.193 --> 1:15:18.056
So what we have to discuss is what is a good
output strategy.

1:15:18.658 --> 1:15:20.070
So you could do.

1:15:20.070 --> 1:15:23.806
The output strategy could be something like.

1:15:23.743 --> 1:15:39.871
If you think of weight cape, this is an output
strategy here that you always input.

1:15:40.220 --> 1:15:44.990
Good, and you can view your weight in a similar
way as.

1:15:45.265 --> 1:15:55.194
But now, of course, we can also look at other
output strategies where it's more generic and

1:15:55.194 --> 1:15:59.727
it's deciding whether in some situations.

1:16:01.121 --> 1:16:12.739
And one thing that works quite well is referred
to as local agreement, and that means you're

1:16:12.739 --> 1:16:13.738
always.

1:16:14.234 --> 1:16:26.978
Then you're looking what is the same thing
between my current translation and the one

1:16:26.978 --> 1:16:28.756
did before.

1:16:29.349 --> 1:16:31.201
So let's do that again in six hours.

1:16:31.891 --> 1:16:45.900
So your input is a first audio segment and
your title text is all model trains.

1:16:46.346 --> 1:16:53.231
Then you're getting six opposites, one and
two, and this time the output is all models.

1:16:54.694 --> 1:17:08.407
You see trains are different, but both of
them agree that it's all so in those cases.

1:17:09.209 --> 1:17:13.806
So we can be hopefully a big show that really
starts with all.

1:17:15.155 --> 1:17:22.604
So now we say we're output all, so at this
time instead we'll output all, although before.

1:17:23.543 --> 1:17:27.422
We are getting one, two, three as input.

1:17:27.422 --> 1:17:35.747
This time we have a prefix, so now we are
only allowing translations to start with all.

1:17:35.747 --> 1:17:42.937
We cannot change that anymore, so we now need
to generate some translation.

1:17:43.363 --> 1:17:46.323
And then it can be that its now all models
are run.

1:17:47.927 --> 1:18:01.908
Then we compare here and see this agrees on
all models so we can output all models.

1:18:02.882 --> 1:18:07.356
So this by we can dynamically decide is a
model is very anxious.

1:18:07.356 --> 1:18:10.178
We always talk with something different.

1:18:11.231 --> 1:18:24.872
Then it's, we'll wait longer, it's more for
the same thing, and hope we don't need to wait.

1:18:30.430 --> 1:18:40.238
Is it clear again that the signal wouldn't
be able to detect?

1:18:43.203 --> 1:18:50.553
The hope it is because if it's not sure of,
of course, it in this kind would have to switch

1:18:50.553 --> 1:18:51.671
all the time.

1:18:56.176 --> 1:19:01.375
So if it would be the first step to register
and the second time to cancel and they may

1:19:01.375 --> 1:19:03.561
register again, they wouldn't do it.

1:19:03.561 --> 1:19:08.347
Of course, it is very short because in register
a long time, then it can't deal.

1:19:08.568 --> 1:19:23.410
That's why there's two parameters that you
can use and which might be important, or how.

1:19:23.763 --> 1:19:27.920
So you do it like every one second, every
five seconds or something like that.

1:19:28.648 --> 1:19:37.695
Put it more often as your latency will be
because your weight is less long, but also

1:19:37.695 --> 1:19:39.185
you might do.

1:19:40.400 --> 1:19:50.004
So that is the one thing and the other thing
is for words you might do everywhere, but if

1:19:50.004 --> 1:19:52.779
you think about audio it.

1:19:53.493 --> 1:20:04.287
And the other question you can do like the
agreement, so the model is sure.

1:20:04.287 --> 1:20:10.252
If you say have to agree, then hopefully.

1:20:10.650 --> 1:20:21.369
What we saw is think there has been a really
normally good performance and otherwise your

1:20:21.369 --> 1:20:22.441
latency.

1:20:22.963 --> 1:20:42.085
Okay, we'll just make more tests and we'll
get the confidence.

1:20:44.884 --> 1:20:47.596
Have to completely agree with that.

1:20:47.596 --> 1:20:53.018
So when this was done, that was our first
idea of using the confidence.

1:20:53.018 --> 1:21:00.248
The problem is that currently that's my assumption
is that the modeling the model confidence is

1:21:00.248 --> 1:21:03.939
not that easy, and they are often overconfident.

1:21:04.324 --> 1:21:17.121
In the paper there is this type also where
you try to use the confidence in some way to

1:21:17.121 --> 1:21:20.465
decide the confidence.

1:21:21.701 --> 1:21:26.825
But that gave worse results, and that's why
we looked into that.

1:21:27.087 --> 1:21:38.067
So it's a very good idea think, but it seems
not to at least how it was implemented.

1:21:38.959 --> 1:21:55.670
There is one way that maybe goes in more direction,
which is very new.

1:21:55.455 --> 1:22:02.743
If this one, the last word is attending mainly
to the end of the audio.

1:22:02.942 --> 1:22:04.934
You might you should not output it yet.

1:22:05.485 --> 1:22:15.539
Because they might think there is something
more missing than you need to know, so they

1:22:15.539 --> 1:22:24.678
look at the attention and only output parts
which look to not the audio signal.

1:22:25.045 --> 1:22:40.175
So there is, of course, a lot of ways how
you can do it better or easier in some way.

1:22:41.901 --> 1:22:53.388
Instead tries to predict the next word with
a large language model, and then for text translation

1:22:53.388 --> 1:22:54.911
you predict.

1:22:55.215 --> 1:23:01.177
Then you translate all of them and decide
if there is a change so you can even earlier

1:23:01.177 --> 1:23:02.410
do your decision.

1:23:02.362 --> 1:23:08.714
The idea is that if we continue and then this
will be to a change in the translation, then

1:23:08.714 --> 1:23:10.320
we should have opened.

1:23:10.890 --> 1:23:18.302
So it's more doing your estimate about possible
continuations of the source instead of looking

1:23:18.302 --> 1:23:19.317
at previous.

1:23:23.783 --> 1:23:31.388
All that works is a bit here like one example.

1:23:31.388 --> 1:23:39.641
It has a legacy baselines and you are not
putting.

1:23:40.040 --> 1:23:47.041
And you see in this case you have worse blood
scores here.

1:23:47.041 --> 1:23:51.670
For equal one you have better latency.

1:23:52.032 --> 1:24:01.123
The how to and how does anybody have an idea
of what could be challenging there or when?

1:24:05.825 --> 1:24:20.132
One problem of these models are hallucinations,
and often very long has a negative impact on.

1:24:24.884 --> 1:24:30.869
If you don't remove the last four words but
your model now starts to hallucinate and invent

1:24:30.869 --> 1:24:37.438
just a lot of new stuff then yeah you're removing
the last four words of that but if it has invented

1:24:37.438 --> 1:24:41.406
ten words and you're still outputting six of
these invented.

1:24:41.982 --> 1:24:48.672
Typically once it starts hallucination generating
some output, it's quite long, so then it's

1:24:48.672 --> 1:24:50.902
no longer enough to just hold.

1:24:51.511 --> 1:24:57.695
And then, of course, a bit better if you compare
to the previous ones.

1:24:57.695 --> 1:25:01.528
Their destinations are typically different.

1:25:07.567 --> 1:25:25.939
Yes, so we don't talk about the details, but
for outputs, for presentations, there's different

1:25:25.939 --> 1:25:27.100
ways.

1:25:27.347 --> 1:25:36.047
So you want to have maximum two lines, maximum
forty-two characters per line, and the reading

1:25:36.047 --> 1:25:40.212
speed is a maximum of twenty-one characters.

1:25:40.981 --> 1:25:43.513
How to Do That We Can Skip.

1:25:43.463 --> 1:25:46.804
Then you can generate something like that.

1:25:46.886 --> 1:25:53.250
Another challenge is, of course, that you
not only need to generate the translation,

1:25:53.250 --> 1:25:59.614
but for subtlyning you also want to generate
when to put breaks and what to display.

1:25:59.619 --> 1:26:06.234
Because it cannot be full sentences, as said
here, if you have like maximum twenty four

1:26:06.234 --> 1:26:10.443
characters per line, that's not always a full
sentence.

1:26:10.443 --> 1:26:12.247
So how can you make it?

1:26:13.093 --> 1:26:16.253
And then for speech there's not even a hint
of wisdom.

1:26:18.398 --> 1:26:27.711
So what we have done today is yeah, we looked
into maybe three challenges: We have this segmentation,

1:26:27.711 --> 1:26:33.013
which is a challenge both in evaluation and
in the decoder.

1:26:33.013 --> 1:26:40.613
We talked about disfluencies and we talked
about simultaneous translations and how to

1:26:40.613 --> 1:26:42.911
address these challenges.

1:26:43.463 --> 1:26:45.507
Any more questions.

1:26:48.408 --> 1:26:52.578
Good then new content.

1:26:52.578 --> 1:26:58.198
We are done for this semester.

1:26:58.198 --> 1:27:04.905
You can keep your knowledge in that.

1:27:04.744 --> 1:27:09.405
Repetition where we can try to repeat a bit
what we've done all over the semester.

1:27:10.010 --> 1:27:13.776
Now prepare a bit of repetition to what think
is important.

1:27:14.634 --> 1:27:21.441
But of course is also the chance for you to
ask specific questions.

1:27:21.441 --> 1:27:25.445
It's not clear to me how things relate.

1:27:25.745 --> 1:27:34.906
So if you have any specific questions, please
come to me or send me an email or so, then

1:27:34.906 --> 1:27:36.038
I'm happy.

1:27:36.396 --> 1:27:46.665
If should focus on it really in depth, it
might be good not to come and send me an email

1:27:46.665 --> 1:27:49.204
on Wednesday evening.