Spaces:
Running
Running
WEBVTT | |
0:00:01.121 --> 0:00:14.214 | |
Okay, so welcome to today's lecture, on Tuesday | |
we started to talk about speech translation. | |
0:00:14.634 --> 0:00:27.037 | |
And the idea is hopefully an idea of the basic | |
ideas we have in speech translation, the two | |
0:00:27.037 --> 0:00:29.464 | |
major approaches. | |
0:00:29.829 --> 0:00:41.459 | |
And the other one is the end system where | |
we have one large system which is everything | |
0:00:41.459 --> 0:00:42.796 | |
together. | |
0:00:43.643 --> 0:00:58.459 | |
Until now we mainly focus on text output that | |
we'll see today, but you can extend these ideas | |
0:00:58.459 --> 0:01:01.138 | |
to other speech. | |
0:01:01.441 --> 0:01:08.592 | |
But since it's also like a machine translation | |
lecture, you of course mainly focus a bit on | |
0:01:08.592 --> 0:01:10.768 | |
the translation challenges. | |
0:01:12.172 --> 0:01:25.045 | |
And what is the main focus of today's lecture | |
is to look into why that is challenging speech | |
0:01:25.045 --> 0:01:26.845 | |
translation. | |
0:01:27.627 --> 0:01:33.901 | |
So a bit more focus on what is now really | |
the difference to all you and how we can address. | |
0:01:34.254 --> 0:01:39.703 | |
We'll start there by with the segmentation | |
problem. | |
0:01:39.703 --> 0:01:45.990 | |
We had that already of bits, but especially | |
for end-to-end. | |
0:01:46.386 --> 0:01:57.253 | |
So the problem is that until now it was easy | |
to segment the input into sentences and then | |
0:01:57.253 --> 0:02:01.842 | |
translate each sentence individually. | |
0:02:02.442 --> 0:02:17.561 | |
When you're now translating audio, the challenge | |
is that you have just a sequence of audio input | |
0:02:17.561 --> 0:02:20.055 | |
and there's no. | |
0:02:21.401 --> 0:02:27.834 | |
So you have this difference that your audio | |
is a continuous stream, but the text is typically | |
0:02:27.834 --> 0:02:28.930 | |
sentence based. | |
0:02:28.930 --> 0:02:31.667 | |
So how can you match this gap in there? | |
0:02:31.667 --> 0:02:37.690 | |
We'll see that is really essential, and if | |
you're not using a decent good system there, | |
0:02:37.690 --> 0:02:41.249 | |
then you can lose a lot of quality and performance. | |
0:02:41.641 --> 0:02:44.267 | |
That is what also meant before. | |
0:02:44.267 --> 0:02:51.734 | |
So if you have a more complex system out of | |
several units, it's really essential that they | |
0:02:51.734 --> 0:02:56.658 | |
all work together and it's very easy to lose | |
significantly. | |
0:02:57.497 --> 0:03:13.029 | |
The second challenge we'll talk about is disfluencies, | |
so the style of speaking is very different | |
0:03:13.029 --> 0:03:14.773 | |
from text. | |
0:03:15.135 --> 0:03:24.727 | |
So if you translate or TedTalks, that's normally | |
very good speakers. | |
0:03:24.727 --> 0:03:30.149 | |
They will give you a very fluent text. | |
0:03:30.670 --> 0:03:36.692 | |
When you want to translate a lecture, it might | |
be more difficult or rednested. | |
0:03:37.097 --> 0:03:39.242 | |
Mean people are not well that well. | |
0:03:39.242 --> 0:03:42.281 | |
They should be prepared in giving the lecture | |
and. | |
0:03:42.362 --> 0:03:48.241 | |
But it's not that I mean, typically a lecture | |
will have like rehearsal like five times before | |
0:03:48.241 --> 0:03:52.682 | |
he is giving this lecture, and then like will | |
it completely be fluent? | |
0:03:52.682 --> 0:03:56.122 | |
He might at some point notice all this is | |
not perfect. | |
0:03:56.122 --> 0:04:00.062 | |
I want to rephrase, and he'll have to sing | |
during the lecture. | |
0:04:00.300 --> 0:04:04.049 | |
Might be also good that he's thinking, so | |
he's not going too fast and things like. | |
0:04:05.305 --> 0:04:07.933 | |
If you then go to the other extreme, it's | |
more meetings. | |
0:04:08.208 --> 0:04:15.430 | |
If you have a lively discussion, of course, | |
people will interrupt, they will restart, they | |
0:04:15.430 --> 0:04:22.971 | |
will think while they speak, and you know that | |
sometimes you tell people first think and speak | |
0:04:22.971 --> 0:04:26.225 | |
because they are changing their opinion. | |
0:04:26.606 --> 0:04:31.346 | |
So the question of how can you deal with this? | |
0:04:31.346 --> 0:04:37.498 | |
And there again it might be solutions for | |
that, or at least. | |
0:04:39.759 --> 0:04:46.557 | |
Then for the output we will look into simultaneous | |
translation that is at least not very important | |
0:04:46.557 --> 0:04:47.175 | |
in text. | |
0:04:47.175 --> 0:04:53.699 | |
There might be some cases but normally you | |
have all text available and then you're translating | |
0:04:53.699 --> 0:04:54.042 | |
and. | |
0:04:54.394 --> 0:05:09.220 | |
While for speech translation, since it's often | |
a life interaction, then of course it's important. | |
0:05:09.149 --> 0:05:12.378 | |
Otherwise it's hard to follow. | |
0:05:12.378 --> 0:05:19.463 | |
You see what said five minutes ago and the | |
slide is not as helpful. | |
0:05:19.739 --> 0:05:35.627 | |
You have to wait very long before you can | |
answer because you have to first wait for what | |
0:05:35.627 --> 0:05:39.197 | |
is happening there. | |
0:05:40.660 --> 0:05:46.177 | |
And finally, we can talk a bit about presentation. | |
0:05:46.177 --> 0:05:54.722 | |
For example, mentioned that if you're generating | |
subtitles, it's not possible. | |
0:05:54.854 --> 0:06:01.110 | |
So in professional subtitles there are clear | |
rules. | |
0:06:01.110 --> 0:06:05.681 | |
Subtitle has to be shown for seconds. | |
0:06:05.681 --> 0:06:08.929 | |
It's maximum of two lines. | |
0:06:09.549 --> 0:06:13.156 | |
Because otherwise it's getting too long, it's | |
not able to read it anymore, and so. | |
0:06:13.613 --> 0:06:19.826 | |
So if you want to achieve that, of course, | |
you might have to adjust and select what you | |
0:06:19.826 --> 0:06:20.390 | |
really. | |
0:06:23.203 --> 0:06:28.393 | |
The first date starts with the segmentation. | |
0:06:28.393 --> 0:06:36.351 | |
On the one end it's an issue while training, | |
on the other hand it's. | |
0:06:38.678 --> 0:06:47.781 | |
What is the problem so when we train it's | |
relatively easy to separate our data into sentence | |
0:06:47.781 --> 0:06:48.466 | |
level. | |
0:06:48.808 --> 0:07:02.241 | |
So if you have your example, you have the | |
audio and the text, then you typically know | |
0:07:02.241 --> 0:07:07.083 | |
that this sentence is aligned. | |
0:07:07.627 --> 0:07:16.702 | |
You can use these time information to cut | |
your audio and then you can train and then. | |
0:07:18.018 --> 0:07:31.775 | |
Because what we need for an enchilada model | |
is to be an output chart, in this case an audio | |
0:07:31.775 --> 0:07:32.822 | |
chart. | |
0:07:33.133 --> 0:07:38.551 | |
And even if this is a long speech, it's easy | |
then since we have this time information to | |
0:07:38.551 --> 0:07:39.159 | |
separate. | |
0:07:39.579 --> 0:07:43.866 | |
But we are using therefore, of course, the | |
target side information. | |
0:07:45.865 --> 0:07:47.949 | |
The problem is now in runtime. | |
0:07:47.949 --> 0:07:49.427 | |
This is not possible. | |
0:07:49.427 --> 0:07:55.341 | |
Here we can do that based on the calculation | |
marks and the sentence segmentation on the | |
0:07:55.341 --> 0:07:57.962 | |
target side because that is splitting. | |
0:07:57.962 --> 0:08:02.129 | |
But during transcript, during translation | |
it is not possible. | |
0:08:02.442 --> 0:08:10.288 | |
Because there is just a long audio signal, | |
and of course if you have your test data to | |
0:08:10.288 --> 0:08:15.193 | |
split it into: That has been done for some | |
experience. | |
0:08:15.193 --> 0:08:22.840 | |
It's fine, but it's not a realistic scenario | |
because if you really apply it in real world, | |
0:08:22.840 --> 0:08:25.949 | |
we won't have a manual segmentation. | |
0:08:26.266 --> 0:08:31.838 | |
If a human has to do that then he can do the | |
translation so you want to have a full automatic | |
0:08:31.838 --> 0:08:32.431 | |
pipeline. | |
0:08:32.993 --> 0:08:38.343 | |
So the question is how can we deal with this | |
type of you know? | |
0:09:09.309 --> 0:09:20.232 | |
So the question is how can we deal with this | |
time of situation and how can we segment the | |
0:09:20.232 --> 0:09:23.024 | |
audio into some units? | |
0:09:23.863 --> 0:09:32.495 | |
And here is one further really big advantage | |
of a cascaded sauce: Because how is this done | |
0:09:32.495 --> 0:09:34.259 | |
in a cascade of systems? | |
0:09:34.259 --> 0:09:38.494 | |
We are splitting the audio with some features | |
we are doing. | |
0:09:38.494 --> 0:09:42.094 | |
We can use similar ones which we'll discuss | |
later. | |
0:09:42.094 --> 0:09:43.929 | |
Then we run against chin. | |
0:09:43.929 --> 0:09:48.799 | |
We have the transcript and then we can do | |
what we talked last about. | |
0:09:49.069 --> 0:10:02.260 | |
So if you have this is an audio signal and | |
the training data it was good. | |
0:10:02.822 --> 0:10:07.951 | |
So here we have a big advantage. | |
0:10:07.951 --> 0:10:16.809 | |
We can use a different segmentation for the | |
and for the. | |
0:10:16.809 --> 0:10:21.316 | |
Why is that a big advantage? | |
0:10:23.303 --> 0:10:34.067 | |
Will say for a team task is more important | |
because we can then do the sentence transformation. | |
0:10:34.955 --> 0:10:37.603 | |
See and Yeah, We Can Do the Same Thing. | |
0:10:37.717 --> 0:10:40.226 | |
To save us, why is it not as important for | |
us? | |
0:10:40.226 --> 0:10:40.814 | |
Are maybe. | |
0:10:43.363 --> 0:10:48.589 | |
We don't need that much context. | |
0:10:48.589 --> 0:11:01.099 | |
We only try to restrict the word, but the | |
context to consider is mainly small. | |
0:11:03.283 --> 0:11:11.419 | |
Would agree with it in more context, but there | |
is one more important: its. | |
0:11:11.651 --> 0:11:16.764 | |
The is monotone, so there's no reordering. | |
0:11:16.764 --> 0:11:22.472 | |
The second part of the signal is no reordering. | |
0:11:22.472 --> 0:11:23.542 | |
We have. | |
0:11:23.683 --> 0:11:29.147 | |
And of course if we are doing that we cannot | |
really order across boundaries between segments. | |
0:11:29.549 --> 0:11:37.491 | |
It might be challenging if we split the words | |
so that it's not perfect for so that. | |
0:11:37.637 --> 0:11:40.846 | |
But we need to do quite long range reordering. | |
0:11:40.846 --> 0:11:47.058 | |
If you think about the German where the work | |
has moved, and now the English work is in one | |
0:11:47.058 --> 0:11:50.198 | |
part, but the end of the sentence is another. | |
0:11:50.670 --> 0:11:59.427 | |
And of course this advantage we have now here | |
that if we have a segment we have. | |
0:12:01.441 --> 0:12:08.817 | |
And that this segmentation is important. | |
0:12:08.817 --> 0:12:15.294 | |
Here are some motivations for that. | |
0:12:15.675 --> 0:12:25.325 | |
What you are doing is you are taking the reference | |
text and you are segmenting. | |
0:12:26.326 --> 0:12:30.991 | |
And then, of course, your segments are exactly | |
yeah cute. | |
0:12:31.471 --> 0:12:42.980 | |
If you're now using different segmentation | |
strategies, you're using significantly in blue | |
0:12:42.980 --> 0:12:44.004 | |
points. | |
0:12:44.004 --> 0:12:50.398 | |
If the segmentation is bad, you have a lot | |
worse. | |
0:12:52.312 --> 0:13:10.323 | |
And interesting, here you ought to see how | |
it was a human, but people have in a competition. | |
0:13:10.450 --> 0:13:22.996 | |
You can see that by working on the segmentation | |
and using better segmentation you can improve | |
0:13:22.996 --> 0:13:25.398 | |
your performance. | |
0:13:26.006 --> 0:13:29.932 | |
So it's really essential. | |
0:13:29.932 --> 0:13:41.712 | |
One other interesting thing is if you're looking | |
into the difference between. | |
0:13:42.082 --> 0:13:49.145 | |
So it really seems to be more important to | |
have a good segmentation for our cascaded system. | |
0:13:49.109 --> 0:13:56.248 | |
For an intra-end system because there you | |
can't re-segment while it is less important | |
0:13:56.248 --> 0:13:58.157 | |
for a cascaded system. | |
0:13:58.157 --> 0:14:05.048 | |
Of course, it's still important, but the difference | |
between the two segmentations. | |
0:14:06.466 --> 0:14:18.391 | |
It was a shared task some years ago like it's | |
just one system from different. | |
0:14:22.122 --> 0:14:31.934 | |
So the question is how can we deal with this | |
in speech translation and what people look | |
0:14:31.934 --> 0:14:32.604 | |
into? | |
0:14:32.752 --> 0:14:48.360 | |
Now we want to use different techniques to | |
split the audio signal into segments. | |
0:14:48.848 --> 0:14:54.413 | |
You have the disadvantage that you can't change | |
it. | |
0:14:54.413 --> 0:15:00.407 | |
Therefore, some of the quality might be more | |
important. | |
0:15:00.660 --> 0:15:15.678 | |
But in both cases, of course, the A's are | |
better if you have a good segmentation. | |
0:15:17.197 --> 0:15:23.149 | |
So any idea, how would you have this task | |
now split this audio? | |
0:15:23.149 --> 0:15:26.219 | |
What type of tool would you use? | |
0:15:28.648 --> 0:15:41.513 | |
The fuse was a new network to segment half | |
for instance supervise. | |
0:15:41.962 --> 0:15:44.693 | |
Yes, that's exactly already the better system. | |
0:15:44.693 --> 0:15:50.390 | |
So for long time people have done more simple | |
things because we'll come to that a bit challenging | |
0:15:50.390 --> 0:15:52.250 | |
as creating or having the data. | |
0:15:53.193 --> 0:16:00.438 | |
The first thing is you use some tool out of | |
the box like voice activity detection which | |
0:16:00.438 --> 0:16:07.189 | |
has been there as a whole research field so | |
people find when somebody's speaking. | |
0:16:07.647 --> 0:16:14.952 | |
And then you use that in this different threshold | |
you always have the ability that somebody's | |
0:16:14.952 --> 0:16:16.273 | |
speaking or not. | |
0:16:17.217 --> 0:16:19.889 | |
Then you split your signal. | |
0:16:19.889 --> 0:16:26.762 | |
It will not be perfect, but you transcribe | |
or translate each component. | |
0:16:28.508 --> 0:16:39.337 | |
But as you see, a supervised classification | |
task is even better, and that is now the most | |
0:16:39.337 --> 0:16:40.781 | |
common use. | |
0:16:41.441 --> 0:16:49.909 | |
The supervisor is doing that as a supervisor | |
classification and then you'll try to use this | |
0:16:49.909 --> 0:16:50.462 | |
type. | |
0:16:50.810 --> 0:16:53.217 | |
We're going into a bit more detail on how | |
to do that. | |
0:16:53.633 --> 0:17:01.354 | |
So what you need to do first is, of course, | |
you have to have some labels whether this is | |
0:17:01.354 --> 0:17:03.089 | |
an end of sentence. | |
0:17:03.363 --> 0:17:10.588 | |
You do that by using the alignment between | |
the segments and the audio. | |
0:17:10.588 --> 0:17:12.013 | |
You have the. | |
0:17:12.212 --> 0:17:15.365 | |
The two people have not for each word, so | |
these tank steps. | |
0:17:15.365 --> 0:17:16.889 | |
This word is said this time. | |
0:17:17.157 --> 0:17:27.935 | |
This word is said by what you typically have | |
from this time to time to time. | |
0:17:27.935 --> 0:17:34.654 | |
We have the second segment, the second segment. | |
0:17:35.195 --> 0:17:39.051 | |
Which also used to trade for example your | |
advanced system and everything. | |
0:17:41.661 --> 0:17:53.715 | |
Based on that you can label each frame in | |
there so if you have a green or blue that is | |
0:17:53.715 --> 0:17:57.455 | |
our speech segment so you. | |
0:17:58.618 --> 0:18:05.690 | |
And these labels will then later help you, | |
but you extract exactly these types of. | |
0:18:07.067 --> 0:18:08.917 | |
There's one big challenge. | |
0:18:08.917 --> 0:18:15.152 | |
If you have two sentences which are directly | |
connected to each other, then if you're doing | |
0:18:15.152 --> 0:18:18.715 | |
this labeling, you would not have a break in | |
later. | |
0:18:18.715 --> 0:18:23.512 | |
If you tried to extract that, there should | |
be something great or not. | |
0:18:23.943 --> 0:18:31.955 | |
So what you typically do is in the last frame. | |
0:18:31.955 --> 0:18:41.331 | |
You mark as outside, although it's not really | |
outside. | |
0:18:43.463 --> 0:18:46.882 | |
Yes, I guess you could also do that in more | |
of a below check. | |
0:18:46.882 --> 0:18:48.702 | |
I mean, this is the most simple. | |
0:18:48.702 --> 0:18:51.514 | |
It's like inside outside, so it's related | |
to that. | |
0:18:51.514 --> 0:18:54.988 | |
Of course, you could have an extra startup | |
segment, and so on. | |
0:18:54.988 --> 0:18:57.469 | |
I guess this is just to make it more simple. | |
0:18:57.469 --> 0:19:00.226 | |
You only have two labels, not a street classroom. | |
0:19:00.226 --> 0:19:02.377 | |
But yeah, you could do similar things. | |
0:19:12.432 --> 0:19:20.460 | |
Has caused down the roads to problems because | |
it could be an important part of a segment | |
0:19:20.460 --> 0:19:24.429 | |
which has some meaning and we do something. | |
0:19:24.429 --> 0:19:28.398 | |
The good thing is frames are normally very. | |
0:19:28.688 --> 0:19:37.586 | |
Like some milliseconds, so normally if you | |
remove some milliseconds you can still understand | |
0:19:37.586 --> 0:19:38.734 | |
everything. | |
0:19:38.918 --> 0:19:46.999 | |
Mean the speech signal is very repetitive, | |
and so you have information a lot of times. | |
0:19:47.387 --> 0:19:50.730 | |
That's why we talked along there last time | |
they could try to shrink the steak and. | |
0:19:51.031 --> 0:20:00.995 | |
If you now have a short sequence where there | |
is like which would be removed and that's not | |
0:20:00.995 --> 0:20:01.871 | |
really. | |
0:20:02.162 --> 0:20:06.585 | |
Yeah, but it's not a full letter is missing. | |
0:20:06.585 --> 0:20:11.009 | |
It's like only the last ending of the vocal. | |
0:20:11.751 --> 0:20:15.369 | |
Think it doesn't really happen. | |
0:20:15.369 --> 0:20:23.056 | |
We have our audio signal and we have these | |
gags that are not above. | |
0:20:23.883 --> 0:20:29.288 | |
With this blue rectangulars the inside speech | |
segment and with the guess it's all set yes. | |
0:20:29.669 --> 0:20:35.736 | |
So then you have the full signal and you're | |
meaning now labeling your task as a blue or | |
0:20:35.736 --> 0:20:36.977 | |
white prediction. | |
0:20:36.977 --> 0:20:39.252 | |
So that is your prediction task. | |
0:20:39.252 --> 0:20:44.973 | |
You have the audio signal only and your prediction | |
task is like label one or zero. | |
0:20:45.305 --> 0:20:55.585 | |
Once you do that then based on this labeling | |
you can extract each segment again like each | |
0:20:55.585 --> 0:20:58.212 | |
consecutive blue area. | |
0:20:58.798 --> 0:21:05.198 | |
See then removed maybe the non-speaking part | |
already and duo speech translation only on | |
0:21:05.198 --> 0:21:05.998 | |
the parts. | |
0:21:06.786 --> 0:21:19.768 | |
Which is good because the training would have | |
done similarly. | |
0:21:20.120 --> 0:21:26.842 | |
So on the noise in between you never saw in | |
the training, so it's good to throw it away. | |
0:21:29.649 --> 0:21:34.930 | |
One challenge, of course, is now if you're | |
doing that, what is your input? | |
0:21:34.930 --> 0:21:40.704 | |
You cannot do the sequence labeling normally | |
on the whole talk, so it's too long. | |
0:21:40.704 --> 0:21:46.759 | |
So if you're doing this prediction of the | |
label, you also have a window for which you | |
0:21:46.759 --> 0:21:48.238 | |
do the segmentation. | |
0:21:48.788 --> 0:21:54.515 | |
And that's the bedline we have in the punctuation | |
prediction. | |
0:21:54.515 --> 0:22:00.426 | |
If we don't have good borders, random splits | |
are normally good. | |
0:22:00.426 --> 0:22:03.936 | |
So what we do now is split the audio. | |
0:22:04.344 --> 0:22:09.134 | |
So that would be our input, and then the part | |
three would be our labels. | |
0:22:09.269 --> 0:22:15.606 | |
This green would be the input and here we | |
want, for example, blue labels and then white. | |
0:22:16.036 --> 0:22:20.360 | |
Here only do labors and here at the beginning | |
why maybe at the end why. | |
0:22:21.401 --> 0:22:28.924 | |
So thereby you have now a fixed window always | |
for which you're doing than this task of predicting. | |
0:22:33.954 --> 0:22:43.914 | |
How you build your classifier that is based | |
again. | |
0:22:43.914 --> 0:22:52.507 | |
We had this wave to be mentioned last week. | |
0:22:52.752 --> 0:23:00.599 | |
So in training you use labels to say whether | |
it's in speech or outside speech. | |
0:23:01.681 --> 0:23:17.740 | |
Inference: You give them always the chance | |
and then predict whether this part like each | |
0:23:17.740 --> 0:23:20.843 | |
label is afraid. | |
0:23:23.143 --> 0:23:29.511 | |
Bit more complicated, so one challenge is | |
if you randomly split off cognition, losing | |
0:23:29.511 --> 0:23:32.028 | |
your context for the first brain. | |
0:23:32.028 --> 0:23:38.692 | |
It might be very hard to predict whether this | |
is now in or out of, and also for the last. | |
0:23:39.980 --> 0:23:48.449 | |
You often need a bit of context whether this | |
is audio or not, and at the beginning. | |
0:23:49.249 --> 0:23:59.563 | |
So what you do is you put the audio in twice. | |
0:23:59.563 --> 0:24:08.532 | |
You want to do it with splits and then. | |
0:24:08.788 --> 0:24:15.996 | |
It is shown you have shifted the two offsets, | |
so one is predicted with the other offset. | |
0:24:16.416 --> 0:24:23.647 | |
And then averaging the probabilities so that | |
at each time you have, at least for one of | |
0:24:23.647 --> 0:24:25.127 | |
the predictions,. | |
0:24:25.265 --> 0:24:36.326 | |
Because at the end of the second it might | |
be very hard to predict whether this is now | |
0:24:36.326 --> 0:24:39.027 | |
speech or nonspeech. | |
0:24:39.939 --> 0:24:47.956 | |
Think it is a high parameter, but you are | |
not optimizing it, so you just take two shifts. | |
0:24:48.328 --> 0:24:54.636 | |
Of course try a lot of different shifts and | |
so on. | |
0:24:54.636 --> 0:24:59.707 | |
The thing is it's mainly a problem here. | |
0:24:59.707 --> 0:25:04.407 | |
If you don't do two outsets you have. | |
0:25:05.105 --> 0:25:14.761 | |
You could get better by doing that, but would | |
be skeptical if it really matters, and also | |
0:25:14.761 --> 0:25:18.946 | |
have not seen any experience in doing. | |
0:25:19.159 --> 0:25:27.629 | |
Guess you're already good, you have maybe | |
some arrows in there and you're getting. | |
0:25:31.191 --> 0:25:37.824 | |
So with this you have your segmentation. | |
0:25:37.824 --> 0:25:44.296 | |
However, there is a problem in between. | |
0:25:44.296 --> 0:25:49.150 | |
Once the model is wrong then. | |
0:25:49.789 --> 0:26:01.755 | |
The normal thing would be the first thing | |
that you take some threshold and that you always | |
0:26:01.755 --> 0:26:05.436 | |
label everything in speech. | |
0:26:06.006 --> 0:26:19.368 | |
The problem is when you are just doing this | |
one threshold that you might have. | |
0:26:19.339 --> 0:26:23.954 | |
Those are the challenges. | |
0:26:23.954 --> 0:26:31.232 | |
Short segments mean you have no context. | |
0:26:31.232 --> 0:26:35.492 | |
The policy will be bad. | |
0:26:37.077 --> 0:26:48.954 | |
Therefore, people use this probabilistic divided | |
cocker algorithm, so the main idea is start | |
0:26:48.954 --> 0:26:56.744 | |
with the whole segment, and now you split the | |
whole segment. | |
0:26:57.397 --> 0:27:09.842 | |
Then you split there and then you continue | |
until each segment is smaller than the maximum | |
0:27:09.842 --> 0:27:10.949 | |
length. | |
0:27:11.431 --> 0:27:23.161 | |
But you can ignore some splits, and if you | |
split one segment into two parts you first | |
0:27:23.161 --> 0:27:23.980 | |
trim. | |
0:27:24.064 --> 0:27:40.197 | |
So normally it's not only one signal position, | |
it's a longer area of non-voice, so you try | |
0:27:40.197 --> 0:27:43.921 | |
to find this longer. | |
0:27:43.943 --> 0:27:51.403 | |
Now your large segment is split into two smaller | |
segments. | |
0:27:51.403 --> 0:27:56.082 | |
Now you are checking these segments. | |
0:27:56.296 --> 0:28:04.683 | |
So if they are very, very short, it might | |
be good not to spin at this point because you're | |
0:28:04.683 --> 0:28:05.697 | |
ending up. | |
0:28:06.006 --> 0:28:09.631 | |
And this way you continue all the time, and | |
then hopefully you'll have a good stretch. | |
0:28:10.090 --> 0:28:19.225 | |
So, of course, there's one challenge with | |
this approach: if you think about it later, | |
0:28:19.225 --> 0:28:20.606 | |
low latency. | |
0:28:25.405 --> 0:28:31.555 | |
So in this case you have to have the full | |
audio available. | |
0:28:32.132 --> 0:28:38.112 | |
So you cannot continuously do that mean if | |
you would do it just always. | |
0:28:38.112 --> 0:28:45.588 | |
If the probability is higher you split but | |
in this case you try to find a global optimal. | |
0:28:46.706 --> 0:28:49.134 | |
A heuristic body. | |
0:28:49.134 --> 0:28:58.170 | |
You find a global solution for your whole | |
tar and not a local one. | |
0:28:58.170 --> 0:29:02.216 | |
Where's the system most sure? | |
0:29:02.802 --> 0:29:12.467 | |
So that's a bit of a challenge here, but the | |
advantage of course is that in the end you | |
0:29:12.467 --> 0:29:14.444 | |
have no segments. | |
0:29:17.817 --> 0:29:23.716 | |
Any more questions like this. | |
0:29:23.716 --> 0:29:36.693 | |
Then the next thing is we also need to evaluate | |
in this scenario. | |
0:29:37.097 --> 0:29:44.349 | |
So know machine translation is quite a long | |
way. | |
0:29:44.349 --> 0:29:55.303 | |
History now was the beginning of the semester, | |
but hope you can remember. | |
0:29:55.675 --> 0:30:09.214 | |
Might be with blue score, might be with comment | |
or similar, but you need to have. | |
0:30:10.310 --> 0:30:22.335 | |
But this assumes that you have this one-to-one | |
match, so you always have an output and machine | |
0:30:22.335 --> 0:30:26.132 | |
translation, which is nicely. | |
0:30:26.506 --> 0:30:34.845 | |
So then it might be that our output has four | |
segments, while our reference output has only | |
0:30:34.845 --> 0:30:35.487 | |
three. | |
0:30:36.756 --> 0:30:40.649 | |
And now is, of course, questionable like what | |
should we compare in our metric. | |
0:30:44.704 --> 0:30:53.087 | |
So it's no longer directly possible to directly | |
do that because what should you compare? | |
0:30:53.413 --> 0:31:00.214 | |
Just have four segments there and three segments | |
there, and of course it seems to be that. | |
0:31:00.920 --> 0:31:06.373 | |
The first one it likes to the first one when | |
you see I can't speak Spanish, but you're an | |
0:31:06.373 --> 0:31:09.099 | |
audience of the guests who is already there. | |
0:31:09.099 --> 0:31:14.491 | |
So even like just a woman, the blue comparing | |
wouldn't work, so you need to do something | |
0:31:14.491 --> 0:31:17.157 | |
about that to take this type of evaluation. | |
0:31:19.019 --> 0:31:21.727 | |
Still any suggestions what you could do. | |
0:31:25.925 --> 0:31:44.702 | |
How can you calculate a blue score because | |
you don't have one you want to see? | |
0:31:45.925 --> 0:31:49.365 | |
Here you put another layer which spies to | |
add in the second. | |
0:31:51.491 --> 0:31:56.979 | |
It's even not aligning only, but that's one | |
solution, so you need to align and resign. | |
0:31:57.177 --> 0:32:06.886 | |
Because even if you have no alignment so this | |
to this and this to that you see that it's | |
0:32:06.886 --> 0:32:12.341 | |
not good because the audio would compare to | |
that. | |
0:32:13.453 --> 0:32:16.967 | |
That we'll discuss is even one simpler solution. | |
0:32:16.967 --> 0:32:19.119 | |
Yes, it's a simpler solution. | |
0:32:19.119 --> 0:32:23.135 | |
It's called document based blue or something | |
like that. | |
0:32:23.135 --> 0:32:25.717 | |
So you just take the full document. | |
0:32:26.566 --> 0:32:32.630 | |
For some matrix it's good and it's not clear | |
how good it is to the other, but there might | |
0:32:32.630 --> 0:32:32.900 | |
be. | |
0:32:33.393 --> 0:32:36.454 | |
Think of more simple metrics like blue. | |
0:32:36.454 --> 0:32:40.356 | |
Do you have any idea what could be a disadvantage? | |
0:32:49.249 --> 0:32:56.616 | |
Blue is matching ingrams so you start with | |
the original. | |
0:32:56.616 --> 0:33:01.270 | |
You check how many ingrams in here. | |
0:33:01.901 --> 0:33:11.233 | |
If you're not doing that on the full document, | |
you can also match grams from year to year. | |
0:33:11.751 --> 0:33:15.680 | |
So you can match things very far away. | |
0:33:15.680 --> 0:33:21.321 | |
Start doing translation and you just randomly | |
randomly. | |
0:33:22.142 --> 0:33:27.938 | |
And that, of course, could be a bit of a disadvantage | |
or like is a problem, and therefore people | |
0:33:27.938 --> 0:33:29.910 | |
also look into the segmentation. | |
0:33:29.910 --> 0:33:34.690 | |
But I've recently seen some things, so document | |
levels tours are also normally. | |
0:33:34.690 --> 0:33:39.949 | |
If you have a relatively high quality system | |
or state of the art, then they also have a | |
0:33:39.949 --> 0:33:41.801 | |
good correlation of the human. | |
0:33:46.546 --> 0:33:59.241 | |
So how are we doing that so we are putting | |
end of sentence boundaries in there and then. | |
0:33:59.179 --> 0:34:07.486 | |
Alignment based on a similar Livingston distance, | |
so at a distance between our output and the | |
0:34:07.486 --> 0:34:09.077 | |
reference output. | |
0:34:09.449 --> 0:34:13.061 | |
And here is our boundary. | |
0:34:13.061 --> 0:34:23.482 | |
We map the boundary based on the alignment, | |
so in Lithuania you only have. | |
0:34:23.803 --> 0:34:36.036 | |
And then, like all the words that are before, | |
it might be since there is not a random. | |
0:34:36.336 --> 0:34:44.890 | |
Mean it should be, but it can happen things | |
like that, and it's not clear where. | |
0:34:44.965 --> 0:34:49.727 | |
At the break, however, they are typically | |
not that bad because they are words which are | |
0:34:49.727 --> 0:34:52.270 | |
not matching between reference and hypothesis. | |
0:34:52.270 --> 0:34:56.870 | |
So normally it doesn't really matter that | |
much because they are anyway not matching. | |
0:34:57.657 --> 0:35:05.888 | |
And then you take the mule as a T output and | |
use that to calculate your metric. | |
0:35:05.888 --> 0:35:12.575 | |
Then it's again a perfect alignment for which | |
you can calculate. | |
0:35:14.714 --> 0:35:19.229 | |
Any idea you could do it the other way around. | |
0:35:19.229 --> 0:35:23.359 | |
You could resigment your reference to the. | |
0:35:29.309 --> 0:35:30.368 | |
Which one would you select? | |
0:35:34.214 --> 0:35:43.979 | |
I think segmenting the assertive also is much | |
more natural because the reference sentence | |
0:35:43.979 --> 0:35:46.474 | |
is the fixed solution. | |
0:35:47.007 --> 0:35:52.947 | |
Yes, that's the right motivation if you do | |
think about blue or so. | |
0:35:52.947 --> 0:35:57.646 | |
Additionally important if you change your | |
reference. | |
0:35:57.857 --> 0:36:07.175 | |
You might have a different number of diagrams | |
or diagrams because the sentences are different | |
0:36:07.175 --> 0:36:08.067 | |
lengths. | |
0:36:08.068 --> 0:36:15.347 | |
Here your five system, you're always comparing | |
it to the same system, and you don't compare | |
0:36:15.347 --> 0:36:16.455 | |
to different. | |
0:36:16.736 --> 0:36:22.317 | |
The only different base of segmentation, but | |
still it could make some do. | |
0:36:25.645 --> 0:36:38.974 | |
Good, that's all about sentence segmentation, | |
then a bit about disfluencies and what there | |
0:36:38.974 --> 0:36:40.146 | |
really. | |
0:36:42.182 --> 0:36:51.138 | |
So as said in daily life, you're not speaking | |
like very nice full sentences every. | |
0:36:51.471 --> 0:36:53.420 | |
He was speaking powerful sentences. | |
0:36:53.420 --> 0:36:54.448 | |
We do repetitions. | |
0:36:54.834 --> 0:37:00.915 | |
It's especially if it's more interactive, | |
so in meetings, phone calls and so on. | |
0:37:00.915 --> 0:37:04.519 | |
If you have multiple speakers, they also break. | |
0:37:04.724 --> 0:37:16.651 | |
Each other, and then if you keep them, they | |
are harder to translate because most of your | |
0:37:16.651 --> 0:37:17.991 | |
training. | |
0:37:18.278 --> 0:37:30.449 | |
It's also very difficult to read, so we'll | |
have some examples there to transcribe everything | |
0:37:30.449 --> 0:37:32.543 | |
as it was said. | |
0:37:33.473 --> 0:37:36.555 | |
What type of things are there? | |
0:37:37.717 --> 0:37:42.942 | |
So you have all these pillow works. | |
0:37:42.942 --> 0:37:47.442 | |
These are very easy to remove. | |
0:37:47.442 --> 0:37:52.957 | |
You can just use regular expressions. | |
0:37:53.433 --> 0:38:00.139 | |
Is getting more difficult with some other | |
type of filler works. | |
0:38:00.139 --> 0:38:03.387 | |
In German you have this or in. | |
0:38:04.024 --> 0:38:08.473 | |
And these ones you cannot just remove by regular | |
expression. | |
0:38:08.473 --> 0:38:15.039 | |
You shouldn't remove all yacht from a text | |
because it might be very important information | |
0:38:15.039 --> 0:38:15.768 | |
for well. | |
0:38:15.715 --> 0:38:19.995 | |
It may be not as important as you are, but | |
still it might be very important. | |
0:38:20.300 --> 0:38:24.215 | |
So just removing them is there already more | |
difficult. | |
0:38:26.586 --> 0:38:29.162 | |
Then you have these repetitions. | |
0:38:29.162 --> 0:38:32.596 | |
You have something like mean saw him there. | |
0:38:32.596 --> 0:38:33.611 | |
There was a. | |
0:38:34.334 --> 0:38:41.001 | |
And while for the first one that might be | |
very easy to remove because you just look for | |
0:38:41.001 --> 0:38:47.821 | |
double, the thing is that the repetition might | |
not be exactly the same, so there is there | |
0:38:47.821 --> 0:38:48.199 | |
was. | |
0:38:48.199 --> 0:38:54.109 | |
So there is already getting a bit more complicated, | |
of course still possible. | |
0:38:54.614 --> 0:39:01.929 | |
You can remove Denver so the real sense would | |
be like to have a ticket to Houston. | |
0:39:02.882 --> 0:39:13.327 | |
But there the detection, of course, is getting | |
more challenging as you want to get rid of. | |
0:39:13.893 --> 0:39:21.699 | |
You don't have the data, of course, which | |
makes all the tasks harder, but you probably | |
0:39:21.699 --> 0:39:22.507 | |
want to. | |
0:39:22.507 --> 0:39:24.840 | |
That's really meaningful. | |
0:39:24.840 --> 0:39:26.185 | |
Current isn't. | |
0:39:26.185 --> 0:39:31.120 | |
That is now a really good point and it's really | |
there. | |
0:39:31.051 --> 0:39:34.785 | |
The thing about what is your final task? | |
0:39:35.155 --> 0:39:45.526 | |
If you want to have a transcript reading it, | |
I'm not sure if we have another example. | |
0:39:45.845 --> 0:39:54.171 | |
So there it's nicer if you have a clean transfer | |
and if you see subtitles in, they're also not | |
0:39:54.171 --> 0:39:56.625 | |
having all the repetitions. | |
0:39:56.625 --> 0:40:03.811 | |
It's the nice way to shorten but also getting | |
the structure you cannot even make. | |
0:40:04.064 --> 0:40:11.407 | |
So in this situation, of course, they might | |
give you information. | |
0:40:11.407 --> 0:40:14.745 | |
There is a lot of stuttering. | |
0:40:15.015 --> 0:40:22.835 | |
So in this case agree it might be helpful | |
in some way, but meaning reading all the disfluencies | |
0:40:22.835 --> 0:40:25.198 | |
is getting really difficult. | |
0:40:25.198 --> 0:40:28.049 | |
If you have the next one, we have. | |
0:40:28.308 --> 0:40:31.630 | |
That's a very long text. | |
0:40:31.630 --> 0:40:35.883 | |
You need a bit of time to pass. | |
0:40:35.883 --> 0:40:39.472 | |
This one is not important. | |
0:40:40.480 --> 0:40:48.461 | |
It might be nice if you can start reading | |
from here. | |
0:40:48.461 --> 0:40:52.074 | |
Let's have a look here. | |
0:40:52.074 --> 0:40:54.785 | |
Try to read this. | |
0:40:57.297 --> 0:41:02.725 | |
You can understand it, but think you need | |
a bit of time to really understand what was. | |
0:41:11.711 --> 0:41:21.480 | |
And now we have the same text, but you have | |
highlighted in bold, and not only read the | |
0:41:21.480 --> 0:41:22.154 | |
bold. | |
0:41:23.984 --> 0:41:25.995 | |
And ignore everything which is not bold. | |
0:41:30.250 --> 0:41:49.121 | |
Would assume it's easier to read just the | |
book part more faster and more faster. | |
0:41:50.750 --> 0:41:57.626 | |
Yeah, it might be, but I'm not sure we have | |
a master thesis of that. | |
0:41:57.626 --> 0:41:59.619 | |
If seen my videos,. | |
0:42:00.000 --> 0:42:09.875 | |
Of the recordings, I also have it more likely | |
that it's like a fluent speak and I'm not like | |
0:42:09.875 --> 0:42:12.318 | |
doing the hesitations. | |
0:42:12.652 --> 0:42:23.764 | |
Don't know if somebody else has looked into | |
the Cusera video, but notice that. | |
0:42:25.005 --> 0:42:31.879 | |
For these videos spoke every minute, three | |
times or something, and then people were there | |
0:42:31.879 --> 0:42:35.011 | |
and cutting things and making hopefully. | |
0:42:35.635 --> 0:42:42.445 | |
And therefore if you want to more achieve | |
that, of course, no longer exactly what was | |
0:42:42.445 --> 0:42:50.206 | |
happening, but if it more looks like a professional | |
video, then you would have to do that and cut | |
0:42:50.206 --> 0:42:50.998 | |
that out. | |
0:42:50.998 --> 0:42:53.532 | |
But yeah, there are definitely. | |
0:42:55.996 --> 0:42:59.008 | |
We're also going to do this thing again. | |
0:42:59.008 --> 0:43:02.315 | |
First turn is like I'm going to have a very. | |
0:43:02.422 --> 0:43:07.449 | |
Which in the end they start to slow down just | |
without feeling as though they're. | |
0:43:07.407 --> 0:43:10.212 | |
It's a good point for the next. | |
0:43:10.212 --> 0:43:13.631 | |
There is not the one perfect solution. | |
0:43:13.631 --> 0:43:20.732 | |
There's some work on destruction removal, | |
but of course there's also disability. | |
0:43:20.732 --> 0:43:27.394 | |
Removal is not that easy, so do you just remove | |
that's in order everywhere. | |
0:43:27.607 --> 0:43:29.708 | |
But how much like cleaning do you do? | |
0:43:29.708 --> 0:43:31.366 | |
It's more a continuous thing. | |
0:43:31.811 --> 0:43:38.211 | |
Is it more really you only remove stuff or | |
are you also into rephrasing and here is only | |
0:43:38.211 --> 0:43:38.930 | |
removing? | |
0:43:39.279 --> 0:43:41.664 | |
But maybe you want to rephrase it. | |
0:43:41.664 --> 0:43:43.231 | |
That's hearing better. | |
0:43:43.503 --> 0:43:49.185 | |
So then it's going into what people are doing | |
in style transfer. | |
0:43:49.185 --> 0:43:52.419 | |
We are going from a speech style to. | |
0:43:52.872 --> 0:44:07.632 | |
So there is more continuum, and of course | |
Airconditioner is not the perfect solution, | |
0:44:07.632 --> 0:44:10.722 | |
but exactly what. | |
0:44:15.615 --> 0:44:19.005 | |
Yeah, we're challenging. | |
0:44:19.005 --> 0:44:30.258 | |
You have examples where the direct copy is | |
not as hard or is not exactly the same. | |
0:44:30.258 --> 0:44:35.410 | |
That is, of course, more challenging. | |
0:44:41.861 --> 0:44:49.889 | |
If it's getting really mean why it's so challenging, | |
if it's really spontaneous even for the speaker, | |
0:44:49.889 --> 0:44:55.634 | |
you need maybe even the video to really get | |
that and at least the audio. | |
0:45:01.841 --> 0:45:06.025 | |
Yeah what it also depends on. | |
0:45:06.626 --> 0:45:15.253 | |
The purpose, of course, and very important | |
thing is the easiest tasks just to removing. | |
0:45:15.675 --> 0:45:25.841 | |
Of course you have to be very careful because | |
if you remove some of the not, it's normally | |
0:45:25.841 --> 0:45:26.958 | |
not much. | |
0:45:27.227 --> 0:45:33.176 | |
But if you remove too much, of course, that's | |
very, very bad because you're losing important. | |
0:45:33.653 --> 0:45:46.176 | |
And this might be even more challenging if | |
you think about rarer and unseen works. | |
0:45:46.226 --> 0:45:56.532 | |
So when doing this removal, it's important | |
to be careful and normally more conservative. | |
0:46:03.083 --> 0:46:15.096 | |
Of course, also you have to again see if you're | |
doing that now in a two step approach, not | |
0:46:15.096 --> 0:46:17.076 | |
an end to end. | |
0:46:17.076 --> 0:46:20.772 | |
So first you need a remote. | |
0:46:21.501 --> 0:46:30.230 | |
But you have to somehow sing it in the whole | |
type line. | |
0:46:30.230 --> 0:46:36.932 | |
If you learn text or remove disfluencies,. | |
0:46:36.796 --> 0:46:44.070 | |
But it might be that the ASR system is outputing | |
something else or that it's more of an ASR | |
0:46:44.070 --> 0:46:44.623 | |
error. | |
0:46:44.864 --> 0:46:46.756 | |
So um. | |
0:46:46.506 --> 0:46:52.248 | |
Just for example, if you do it based on language | |
modeling scores, it might be that you're just | |
0:46:52.248 --> 0:46:57.568 | |
the language modeling score because the has | |
done some errors, so you really have to see | |
0:46:57.568 --> 0:46:59.079 | |
the combination of that. | |
0:46:59.419 --> 0:47:04.285 | |
And for example, we had like partial words. | |
0:47:04.285 --> 0:47:06.496 | |
They are like some. | |
0:47:06.496 --> 0:47:08.819 | |
We didn't have that. | |
0:47:08.908 --> 0:47:18.248 | |
So these feelings cannot be that you start | |
in the middle of the world and then you switch | |
0:47:18.248 --> 0:47:19.182 | |
because. | |
0:47:19.499 --> 0:47:23.214 | |
And of course, in text in perfect transcript, | |
that's very easy to recognize. | |
0:47:23.214 --> 0:47:24.372 | |
That's not a real word. | |
0:47:24.904 --> 0:47:37.198 | |
However, when you really do it into an system, | |
he will normally detect some type of word because | |
0:47:37.198 --> 0:47:40.747 | |
he only can help the words. | |
0:47:50.050 --> 0:48:03.450 | |
Example: We should think so if you have this | |
in the transcript it's easy to detect as a | |
0:48:03.450 --> 0:48:05.277 | |
disgusting. | |
0:48:05.986 --> 0:48:11.619 | |
And then, of course, it's more challenging | |
in a real world example where you have. | |
0:48:12.492 --> 0:48:29.840 | |
Now to the approaches one thing is to really | |
put it in between so you put your A's system. | |
0:48:31.391 --> 0:48:45.139 | |
So what your task is like, so you have this | |
text and the outputs in this text. | |
0:48:45.565 --> 0:48:49.605 | |
There is different formulations of that. | |
0:48:49.605 --> 0:48:54.533 | |
You might not be able to do everything like | |
that. | |
0:48:55.195 --> 0:49:10.852 | |
Or do you also allow, for example, rephrasing | |
for reordering so in text you might have the | |
0:49:10.852 --> 0:49:13.605 | |
word correctly. | |
0:49:13.513 --> 0:49:24.201 | |
But the easiest thing is you only do it more | |
like removing, so some things can be removed. | |
0:49:29.049 --> 0:49:34.508 | |
Any ideas how to do that this is output. | |
0:49:34.508 --> 0:49:41.034 | |
You have training data so we have training | |
data. | |
0:49:47.507 --> 0:49:55.869 | |
To put in with the spoon you can eat it even | |
after it is out, but after the machine has. | |
0:50:00.000 --> 0:50:05.511 | |
Was wearing rocks, so you have not just the | |
shoes you remove but wearing them as input, | |
0:50:05.511 --> 0:50:07.578 | |
as disfluent text and as output. | |
0:50:07.578 --> 0:50:09.207 | |
It should be fueled text. | |
0:50:09.207 --> 0:50:15.219 | |
It can be before or after recycling as you | |
said, but you have this type of task, so technically | |
0:50:15.219 --> 0:50:20.042 | |
how would you address this type of task when | |
you have to solve this type of. | |
0:50:24.364 --> 0:50:26.181 | |
That's exactly so. | |
0:50:26.181 --> 0:50:28.859 | |
That's one way of doing it. | |
0:50:28.859 --> 0:50:33.068 | |
It's a translation task and you train your. | |
0:50:33.913 --> 0:50:34.683 | |
Can do. | |
0:50:34.683 --> 0:50:42.865 | |
Then, of course, the bit of the challenge | |
is that you automatically allow rephrasing | |
0:50:42.865 --> 0:50:43.539 | |
stuff. | |
0:50:43.943 --> 0:50:52.240 | |
Which of the one end is good so you have more | |
opportunities but it might be also a bad thing | |
0:50:52.240 --> 0:50:58.307 | |
because if you have more opportunities you | |
have more opportunities. | |
0:51:01.041 --> 0:51:08.300 | |
If you want to prevent that, it can also do | |
more simple labeling, so for each word your | |
0:51:08.300 --> 0:51:10.693 | |
label should not be removed. | |
0:51:12.132 --> 0:51:17.658 | |
People have also been looked into parsley. | |
0:51:17.658 --> 0:51:29.097 | |
You remember maybe the past trees at the beginning | |
like the structure because the ideas. | |
0:51:29.649 --> 0:51:45.779 | |
There's also more unsupervised approaches | |
where you then phrase it as a style transfer | |
0:51:45.779 --> 0:51:46.892 | |
task. | |
0:51:50.310 --> 0:51:58.601 | |
At the last point since we have that yes, | |
it has also been done in an end-to-end fashion | |
0:51:58.601 --> 0:52:06.519 | |
so that it's really you have as input the audio | |
signal and output you have than the. | |
0:52:06.446 --> 0:52:10.750 | |
The text, without influence, is a clearly | |
clear text. | |
0:52:11.131 --> 0:52:19.069 | |
You model every single total, which of course | |
has a big advantage. | |
0:52:19.069 --> 0:52:25.704 | |
You can use these paralinguistic features, | |
pauses, and. | |
0:52:25.705 --> 0:52:34.091 | |
If you switch so you start something then | |
oh it doesn't work continue differently so. | |
0:52:34.374 --> 0:52:42.689 | |
So you can easily use in a fashion while in | |
a cascade approach. | |
0:52:42.689 --> 0:52:47.497 | |
As we saw there you have text input. | |
0:52:49.990 --> 0:53:02.389 | |
But on the one end we have again, and in the | |
more extreme case the problem before was endless. | |
0:53:02.389 --> 0:53:06.957 | |
Of course there is even less data. | |
0:53:11.611 --> 0:53:12.837 | |
Good. | |
0:53:12.837 --> 0:53:30.814 | |
This is all about the input to a very more | |
person, or maybe if you think about YouTube. | |
0:53:32.752 --> 0:53:34.989 | |
Talk so this could use be very exciting. | |
0:53:36.296 --> 0:53:42.016 | |
Is more viewed as style transferred. | |
0:53:42.016 --> 0:53:53.147 | |
You can use ideas from machine translation | |
where you have one language. | |
0:53:53.713 --> 0:53:57.193 | |
So there is ways of trying to do this type | |
of style transfer. | |
0:53:57.637 --> 0:54:02.478 | |
Think is definitely also very promising to | |
make it more and more fluent in a business. | |
0:54:03.223 --> 0:54:17.974 | |
Because one major issue about all the previous | |
ones is that you need training data and then | |
0:54:17.974 --> 0:54:21.021 | |
you need training. | |
0:54:21.381 --> 0:54:32.966 | |
So I mean, think that we are only really of | |
data that we have for English. | |
0:54:32.966 --> 0:54:39.453 | |
Maybe there is a very few data in German. | |
0:54:42.382 --> 0:54:49.722 | |
Okay, then let's talk about low latency speech. | |
0:54:50.270 --> 0:55:05.158 | |
So the idea is if we are doing life translation | |
of a talker, so we want to start out. | |
0:55:05.325 --> 0:55:23.010 | |
This is possible because there is typically | |
some kind of monotony in many languages. | |
0:55:24.504 --> 0:55:29.765 | |
And this is also what, for example, human | |
interpreters are doing to have a really low | |
0:55:29.765 --> 0:55:30.071 | |
leg. | |
0:55:30.750 --> 0:55:34.393 | |
They are even going further. | |
0:55:34.393 --> 0:55:40.926 | |
They guess what will be the ending of the | |
sentence. | |
0:55:41.421 --> 0:55:51.120 | |
Then they can already continue, although it's | |
not sad it might be needed, but that is even | |
0:55:51.120 --> 0:55:53.039 | |
more challenging. | |
0:55:54.714 --> 0:55:58.014 | |
Why is it so difficult? | |
0:55:58.014 --> 0:56:09.837 | |
There is this train of on the one end for | |
a and you want to have more context because | |
0:56:09.837 --> 0:56:14.511 | |
we learn if we have more context. | |
0:56:15.015 --> 0:56:24.033 | |
And therefore to have more contacts you have | |
to wait as long as possible. | |
0:56:24.033 --> 0:56:27.689 | |
The best is to have the full. | |
0:56:28.168 --> 0:56:35.244 | |
On the other hand, you want to have a low | |
latency for the user to wait to generate as | |
0:56:35.244 --> 0:56:35.737 | |
soon. | |
0:56:36.356 --> 0:56:47.149 | |
So if you're doing no situation you have to | |
find the best way to start in order to have | |
0:56:47.149 --> 0:56:48.130 | |
a good. | |
0:56:48.728 --> 0:56:52.296 | |
There's no longer the perfect solution. | |
0:56:52.296 --> 0:56:56.845 | |
People will also evaluate what is the translation. | |
0:56:57.657 --> 0:57:09.942 | |
While it's challenging in German to English, | |
German has this very nice thing where the prefix | |
0:57:09.942 --> 0:57:16.607 | |
of the word can be put at the end of the sentence. | |
0:57:17.137 --> 0:57:24.201 | |
And you only know if the person registers | |
or cancels his station at the end of the center. | |
0:57:24.985 --> 0:57:33.690 | |
So if you want to start the translation in | |
English you need to know at this point is the. | |
0:57:35.275 --> 0:57:39.993 | |
So you would have to wait until the end of | |
the year. | |
0:57:39.993 --> 0:57:42.931 | |
That's not really what you want. | |
0:57:43.843 --> 0:57:45.795 | |
What happened. | |
0:57:47.207 --> 0:58:12.550 | |
Other solutions of doing that are: Have been | |
motivating like how we can do that subject | |
0:58:12.550 --> 0:58:15.957 | |
object or subject work. | |
0:58:16.496 --> 0:58:24.582 | |
In German it's not always subject, but there | |
are relative sentence where you have that, | |
0:58:24.582 --> 0:58:25.777 | |
so it needs. | |
0:58:28.808 --> 0:58:41.858 | |
How we can do that is, we'll look today into | |
three ways of doing that. | |
0:58:41.858 --> 0:58:46.269 | |
The one is to mitigate. | |
0:58:46.766 --> 0:58:54.824 | |
And then the IVAR idea is to do retranslating, | |
and there you can now use the text output. | |
0:58:54.934 --> 0:59:02.302 | |
So the idea is you translate, and if you later | |
notice it was wrong then you can retranslate | |
0:59:02.302 --> 0:59:03.343 | |
and correct. | |
0:59:03.803 --> 0:59:14.383 | |
Or you can do what is called extremely coding, | |
so you can generically. | |
0:59:17.237 --> 0:59:30.382 | |
Let's start with the optimization, so if you | |
have a sentence, it may reach a conference, | |
0:59:30.382 --> 0:59:33.040 | |
and in this time. | |
0:59:32.993 --> 0:59:39.592 | |
So you have a good translation quality while | |
still having low latency. | |
0:59:39.699 --> 0:59:50.513 | |
You have an extra model which does your segmentation | |
before, but your aim is not to have a segmentation. | |
0:59:50.470 --> 0:59:53.624 | |
But you can somehow measure in training data. | |
0:59:53.624 --> 0:59:59.863 | |
If do these types of segment lengths, that's | |
my latency and that's my translation quality, | |
0:59:59.863 --> 1:00:02.811 | |
and then you can try to search a good way. | |
1:00:03.443 --> 1:00:20.188 | |
If you're doing that one, it's an extra component, | |
so you can use your system as it was. | |
1:00:22.002 --> 1:00:28.373 | |
The other idea is to directly output the first | |
high processes always, so always when you have | |
1:00:28.373 --> 1:00:34.201 | |
text or audio we translate, and if we then | |
have more context available we can update. | |
1:00:35.015 --> 1:00:50.195 | |
So imagine before, if get an eye register | |
and there's a sentence continued, then. | |
1:00:50.670 --> 1:00:54.298 | |
So you change the output. | |
1:00:54.298 --> 1:01:07.414 | |
Of course, that might be also leading to bad | |
user experience if you always flicker and change | |
1:01:07.414 --> 1:01:09.228 | |
your output. | |
1:01:09.669 --> 1:01:15.329 | |
The bit like human interpreters also are able | |
to correct, so they're doing a more long text. | |
1:01:15.329 --> 1:01:20.867 | |
If they are guessing how to continue to say | |
and then he's saying something different, they | |
1:01:20.867 --> 1:01:22.510 | |
also have to correct them. | |
1:01:22.510 --> 1:01:26.831 | |
So here, since it's not all you, we can even | |
change what we have said. | |
1:01:26.831 --> 1:01:29.630 | |
Yes, that's exactly what we have implemented. | |
1:01:31.431 --> 1:01:49.217 | |
So how that works is, we are aware, and then | |
we translate it, and if we get more input like | |
1:01:49.217 --> 1:01:51.344 | |
you, then. | |
1:01:51.711 --> 1:02:00.223 | |
And so we can always continue to do that and | |
improve the transcript that we have. | |
1:02:00.480 --> 1:02:07.729 | |
So in the end we have the lowest possible | |
latency because we always output what is possible. | |
1:02:07.729 --> 1:02:14.784 | |
On the other hand, introducing a bit of a | |
new problem is: There's another challenge when | |
1:02:14.784 --> 1:02:20.061 | |
we first used that this one was first used | |
for old and that it worked fine. | |
1:02:20.061 --> 1:02:21.380 | |
You switch to NMT. | |
1:02:21.380 --> 1:02:25.615 | |
You saw one problem that is even generating | |
more flickering. | |
1:02:25.615 --> 1:02:28.878 | |
The problem is the normal machine translation. | |
1:02:29.669 --> 1:02:35.414 | |
So implicitly learn all the output that always | |
ends with a dot, and it's always a full sentence. | |
1:02:36.696 --> 1:02:42.466 | |
And this was even more important somewhere | |
in the model than really what is in the input. | |
1:02:42.983 --> 1:02:55.910 | |
So if you give him a partial sentence, it | |
will still generate a full sentence. | |
1:02:55.910 --> 1:02:58.201 | |
So encourage. | |
1:02:58.298 --> 1:03:05.821 | |
It's like trying to just continue it somehow | |
to a full sentence and if it's doing better | |
1:03:05.821 --> 1:03:10.555 | |
guessing stuff then you have to even have more | |
changes. | |
1:03:10.890 --> 1:03:23.944 | |
So here we have a trained mismatch and that's | |
maybe more a general important thing that the | |
1:03:23.944 --> 1:03:28.910 | |
modem might learn a bit different. | |
1:03:29.289 --> 1:03:32.636 | |
It's always ending with a dog, so you don't | |
just guess something in general. | |
1:03:33.053 --> 1:03:35.415 | |
So we have your trained test mismatch. | |
1:03:38.918 --> 1:03:41.248 | |
And we have a trained test message. | |
1:03:41.248 --> 1:03:43.708 | |
What is the best way to address that? | |
1:03:46.526 --> 1:03:51.934 | |
That's exactly the right, so we have to like | |
train also on that. | |
1:03:52.692 --> 1:03:55.503 | |
The problem is for particle sentences. | |
1:03:55.503 --> 1:03:59.611 | |
There's not training data, so it's hard to | |
find all our. | |
1:04:00.580 --> 1:04:06.531 | |
Hi, I'm ransom quite easy to generate artificial | |
pottery scent or at least for the source. | |
1:04:06.926 --> 1:04:15.367 | |
So you just take, you take all the prefixes | |
of the source data. | |
1:04:17.017 --> 1:04:22.794 | |
On the problem of course, with a bit what | |
do you know lying? | |
1:04:22.794 --> 1:04:30.845 | |
If you have a sentence, I encourage all of | |
what should be the right target for that. | |
1:04:31.491 --> 1:04:45.381 | |
And the constraints on the one hand, it should | |
be as long as possible, so you always have | |
1:04:45.381 --> 1:04:47.541 | |
a long delay. | |
1:04:47.687 --> 1:04:55.556 | |
On the other hand, it should be also a suspect | |
of the previous ones, and it should be not | |
1:04:55.556 --> 1:04:57.304 | |
too much inventing. | |
1:04:58.758 --> 1:05:02.170 | |
A very easy solution works fine. | |
1:05:02.170 --> 1:05:05.478 | |
You can just do a length space. | |
1:05:05.478 --> 1:05:09.612 | |
You also take two thirds of the target. | |
1:05:10.070 --> 1:05:19.626 | |
His learning then implicitly to guess a bit | |
if you think about the beginning of example. | |
1:05:20.000 --> 1:05:30.287 | |
This one, if you do two sorts like half, in | |
this case the target would be eye register. | |
1:05:30.510 --> 1:05:39.289 | |
So you're doing a bit of implicit guessing, | |
and if it's getting wrong you have rewriting, | |
1:05:39.289 --> 1:05:43.581 | |
but you're doing a good amount of guessing. | |
1:05:49.849 --> 1:05:53.950 | |
In addition, this would be like how it looks | |
like if it was like. | |
1:05:53.950 --> 1:05:58.300 | |
If it wasn't a housing game, then the target | |
could be something like. | |
1:05:58.979 --> 1:06:02.513 | |
One problem is that you just do that this | |
way. | |
1:06:02.513 --> 1:06:04.619 | |
It's most of your training. | |
1:06:05.245 --> 1:06:11.983 | |
And in the end you're interested in the overall | |
translation quality, so for full sentence. | |
1:06:11.983 --> 1:06:19.017 | |
So if you train on that, it will mainly learn | |
how to translate prefixes because ninety percent | |
1:06:19.017 --> 1:06:21.535 | |
or more of your data is prefixed. | |
1:06:22.202 --> 1:06:31.636 | |
That's why we'll see that it's better to do | |
like a ratio. | |
1:06:31.636 --> 1:06:39.281 | |
So half your training data are full sentences. | |
1:06:39.759 --> 1:06:47.693 | |
Because if you're doing this well you see | |
that for every word prefix and only one sentence. | |
1:06:48.048 --> 1:06:52.252 | |
You also see that nicely here here are both. | |
1:06:52.252 --> 1:06:56.549 | |
This is the blue scores and you see the bass. | |
1:06:58.518 --> 1:06:59.618 | |
Is this one? | |
1:06:59.618 --> 1:07:03.343 | |
It has a good quality because it's trained. | |
1:07:03.343 --> 1:07:11.385 | |
If you know, train with all the partial sentences | |
is more focusing on how to translate partial | |
1:07:11.385 --> 1:07:12.316 | |
sentences. | |
1:07:12.752 --> 1:07:17.840 | |
Because all the partial sentences will at | |
some point be removed, because at the end you | |
1:07:17.840 --> 1:07:18.996 | |
translate the full. | |
1:07:20.520 --> 1:07:24.079 | |
There's many tasks to read, but you have the | |
same performances. | |
1:07:24.504 --> 1:07:26.938 | |
On the other hand, you see here the other | |
problem. | |
1:07:26.938 --> 1:07:28.656 | |
This is how many words got updated. | |
1:07:29.009 --> 1:07:31.579 | |
You want to have as few updates as possible. | |
1:07:31.579 --> 1:07:34.891 | |
Updates need to remove things which are once | |
being shown. | |
1:07:35.255 --> 1:07:40.538 | |
This is quite high for the baseline. | |
1:07:40.538 --> 1:07:50.533 | |
If you know the partials that are going down, | |
they should be removed. | |
1:07:51.151 --> 1:07:58.648 | |
And then for moody tasks you have a bit like | |
the best note of swim. | |
1:08:02.722 --> 1:08:05.296 | |
Any more questions to this type of. | |
1:08:09.309 --> 1:08:20.760 | |
The last thing is that you want to do an extremely. | |
1:08:21.541 --> 1:08:23.345 | |
Again, it's a bit implication. | |
1:08:23.345 --> 1:08:25.323 | |
Scenario is what you really want. | |
1:08:25.323 --> 1:08:30.211 | |
As you said, we sometimes use this updating, | |
and for text output it'd be very nice. | |
1:08:30.211 --> 1:08:35.273 | |
But imagine if you want to audio output, of | |
course you can't change it anymore because | |
1:08:35.273 --> 1:08:37.891 | |
on one side you cannot change what was said. | |
1:08:37.891 --> 1:08:40.858 | |
So in this time you more need like a fixed | |
output. | |
1:08:41.121 --> 1:08:47.440 | |
And then the style of street decoding is interesting. | |
1:08:47.440 --> 1:08:55.631 | |
Where you, for example, get sourced, the seagullins | |
are so stoked in. | |
1:08:55.631 --> 1:09:00.897 | |
Then you decide oh, now it's better to wait. | |
1:09:01.041 --> 1:09:14.643 | |
So you somehow need to have this type of additional | |
information. | |
1:09:15.295 --> 1:09:23.074 | |
Here you have to decide should know I'll put | |
a token or should wait for my and feel. | |
1:09:26.546 --> 1:09:32.649 | |
So you have to do this additional labels like | |
weight, weight, output, output, wage and so | |
1:09:32.649 --> 1:09:32.920 | |
on. | |
1:09:33.453 --> 1:09:38.481 | |
There are different ways of doing that. | |
1:09:38.481 --> 1:09:45.771 | |
You can have an additional model that does | |
this decision. | |
1:09:46.166 --> 1:09:53.669 | |
And then have a higher quality or better to | |
continue and then have a lower latency in this | |
1:09:53.669 --> 1:09:54.576 | |
different. | |
1:09:55.215 --> 1:09:59.241 | |
Surprisingly, a very easy task also works, | |
sometimes quite good. | |
1:10:03.043 --> 1:10:10.981 | |
And that is the so called way care policy | |
and the idea is there at least for text to | |
1:10:10.981 --> 1:10:14.623 | |
text translation that is working well. | |
1:10:14.623 --> 1:10:22.375 | |
It's like you wait for words and then you | |
always output one and like one for each. | |
1:10:22.682 --> 1:10:28.908 | |
So your weight slow works at the beginning | |
of the sentence, and every time a new board | |
1:10:28.908 --> 1:10:29.981 | |
is coming you. | |
1:10:31.091 --> 1:10:39.459 | |
So you have the same times to beat as input, | |
so you're not legging more or less, but to | |
1:10:39.459 --> 1:10:41.456 | |
have enough context. | |
1:10:43.103 --> 1:10:49.283 | |
Of course this for example for the unmarried | |
will not solve it perfectly but if you have | |
1:10:49.283 --> 1:10:55.395 | |
a bit of local reordering inside your token | |
that you can manage very well and then it's | |
1:10:55.395 --> 1:10:57.687 | |
a very simple solution but it's. | |
1:10:57.877 --> 1:11:00.481 | |
The other one was dynamic. | |
1:11:00.481 --> 1:11:06.943 | |
Depending on the context you can decide how | |
long you want to wait. | |
1:11:07.687 --> 1:11:21.506 | |
It also only works if you have a similar amount | |
of tokens, so if your target is very short | |
1:11:21.506 --> 1:11:22.113 | |
of. | |
1:11:22.722 --> 1:11:28.791 | |
That's why it's also more challenging for | |
audio input because the speaking rate is changing | |
1:11:28.791 --> 1:11:29.517 | |
and so on. | |
1:11:29.517 --> 1:11:35.586 | |
You would have to do something like I'll output | |
a word for every second a year or something | |
1:11:35.586 --> 1:11:35.981 | |
like. | |
1:11:36.636 --> 1:11:45.459 | |
The problem is that the audio speaking speed | |
is not like fixed but quite very, and therefore. | |
1:11:50.170 --> 1:11:58.278 | |
Therefore, what you can also do is you can | |
use a similar solution than we had before with | |
1:11:58.278 --> 1:11:59.809 | |
the resetteling. | |
1:12:00.080 --> 1:12:02.904 | |
You remember we were re-decoded all the time. | |
1:12:03.423 --> 1:12:12.253 | |
And you can do something similar in this case | |
except that you add something in that you're | |
1:12:12.253 --> 1:12:16.813 | |
saying, oh, if I read it cold, I'm not always. | |
1:12:16.736 --> 1:12:22.065 | |
Can decode as I want, but you can do this | |
target prefix decoding, so what you say is | |
1:12:22.065 --> 1:12:23.883 | |
in your achievement section. | |
1:12:23.883 --> 1:12:26.829 | |
You can easily say generate a translation | |
bus. | |
1:12:27.007 --> 1:12:29.810 | |
The translation has to start with the prefix. | |
1:12:31.251 --> 1:12:35.350 | |
How can you do that? | |
1:12:39.839 --> 1:12:49.105 | |
In the decoder exactly you start, so if you | |
do beam search you select always the most probable. | |
1:12:49.349 --> 1:12:57.867 | |
And now you say oh, I'm not selecting the | |
most perfect, but this is the fourth, so in | |
1:12:57.867 --> 1:13:04.603 | |
the first step have to take this one, in the | |
second start decoding. | |
1:13:04.884 --> 1:13:09.387 | |
And then you're making sure that your second | |
always starts with this prefix. | |
1:13:10.350 --> 1:13:18.627 | |
And then you can use your immediate retranslation, | |
but you're no longer changing the output. | |
1:13:19.099 --> 1:13:31.595 | |
Out as it works, so it may get a speech signal | |
and input, and it is not outputing any. | |
1:13:32.212 --> 1:13:45.980 | |
So then if you got you get a translation maybe | |
and then you decide yes output. | |
1:13:46.766 --> 1:13:54.250 | |
And then you're translating as one as two | |
as sweet as four, but now you say generate | |
1:13:54.250 --> 1:13:55.483 | |
only outputs. | |
1:13:55.935 --> 1:14:07.163 | |
And then you're translating and maybe you're | |
deciding on and now a good translation. | |
1:14:07.163 --> 1:14:08.880 | |
Then you're. | |
1:14:09.749 --> 1:14:29.984 | |
Yes, but don't get to worry about what the | |
effect is. | |
1:14:30.050 --> 1:14:31.842 | |
We're generating your target text. | |
1:14:32.892 --> 1:14:36.930 | |
But we're not always outputing the full target | |
text now. | |
1:14:36.930 --> 1:14:43.729 | |
What we are having is we have here some strategy | |
to decide: Oh, is a system already sure enough | |
1:14:43.729 --> 1:14:44.437 | |
about it? | |
1:14:44.437 --> 1:14:49.395 | |
If it's sure enough and it has all the information, | |
we can output it. | |
1:14:49.395 --> 1:14:50.741 | |
And then the next. | |
1:14:51.291 --> 1:14:55.931 | |
If we say here sometimes with better not to | |
get output we won't output it already. | |
1:14:57.777 --> 1:15:06.369 | |
And thereby the hope is in the uphill model | |
should not yet outcut a register because it | |
1:15:06.369 --> 1:15:10.568 | |
doesn't mean no yet if it's a case or not. | |
1:15:13.193 --> 1:15:18.056 | |
So what we have to discuss is what is a good | |
output strategy. | |
1:15:18.658 --> 1:15:20.070 | |
So you could do. | |
1:15:20.070 --> 1:15:23.806 | |
The output strategy could be something like. | |
1:15:23.743 --> 1:15:39.871 | |
If you think of weight cape, this is an output | |
strategy here that you always input. | |
1:15:40.220 --> 1:15:44.990 | |
Good, and you can view your weight in a similar | |
way as. | |
1:15:45.265 --> 1:15:55.194 | |
But now, of course, we can also look at other | |
output strategies where it's more generic and | |
1:15:55.194 --> 1:15:59.727 | |
it's deciding whether in some situations. | |
1:16:01.121 --> 1:16:12.739 | |
And one thing that works quite well is referred | |
to as local agreement, and that means you're | |
1:16:12.739 --> 1:16:13.738 | |
always. | |
1:16:14.234 --> 1:16:26.978 | |
Then you're looking what is the same thing | |
between my current translation and the one | |
1:16:26.978 --> 1:16:28.756 | |
did before. | |
1:16:29.349 --> 1:16:31.201 | |
So let's do that again in six hours. | |
1:16:31.891 --> 1:16:45.900 | |
So your input is a first audio segment and | |
your title text is all model trains. | |
1:16:46.346 --> 1:16:53.231 | |
Then you're getting six opposites, one and | |
two, and this time the output is all models. | |
1:16:54.694 --> 1:17:08.407 | |
You see trains are different, but both of | |
them agree that it's all so in those cases. | |
1:17:09.209 --> 1:17:13.806 | |
So we can be hopefully a big show that really | |
starts with all. | |
1:17:15.155 --> 1:17:22.604 | |
So now we say we're output all, so at this | |
time instead we'll output all, although before. | |
1:17:23.543 --> 1:17:27.422 | |
We are getting one, two, three as input. | |
1:17:27.422 --> 1:17:35.747 | |
This time we have a prefix, so now we are | |
only allowing translations to start with all. | |
1:17:35.747 --> 1:17:42.937 | |
We cannot change that anymore, so we now need | |
to generate some translation. | |
1:17:43.363 --> 1:17:46.323 | |
And then it can be that its now all models | |
are run. | |
1:17:47.927 --> 1:18:01.908 | |
Then we compare here and see this agrees on | |
all models so we can output all models. | |
1:18:02.882 --> 1:18:07.356 | |
So this by we can dynamically decide is a | |
model is very anxious. | |
1:18:07.356 --> 1:18:10.178 | |
We always talk with something different. | |
1:18:11.231 --> 1:18:24.872 | |
Then it's, we'll wait longer, it's more for | |
the same thing, and hope we don't need to wait. | |
1:18:30.430 --> 1:18:40.238 | |
Is it clear again that the signal wouldn't | |
be able to detect? | |
1:18:43.203 --> 1:18:50.553 | |
The hope it is because if it's not sure of, | |
of course, it in this kind would have to switch | |
1:18:50.553 --> 1:18:51.671 | |
all the time. | |
1:18:56.176 --> 1:19:01.375 | |
So if it would be the first step to register | |
and the second time to cancel and they may | |
1:19:01.375 --> 1:19:03.561 | |
register again, they wouldn't do it. | |
1:19:03.561 --> 1:19:08.347 | |
Of course, it is very short because in register | |
a long time, then it can't deal. | |
1:19:08.568 --> 1:19:23.410 | |
That's why there's two parameters that you | |
can use and which might be important, or how. | |
1:19:23.763 --> 1:19:27.920 | |
So you do it like every one second, every | |
five seconds or something like that. | |
1:19:28.648 --> 1:19:37.695 | |
Put it more often as your latency will be | |
because your weight is less long, but also | |
1:19:37.695 --> 1:19:39.185 | |
you might do. | |
1:19:40.400 --> 1:19:50.004 | |
So that is the one thing and the other thing | |
is for words you might do everywhere, but if | |
1:19:50.004 --> 1:19:52.779 | |
you think about audio it. | |
1:19:53.493 --> 1:20:04.287 | |
And the other question you can do like the | |
agreement, so the model is sure. | |
1:20:04.287 --> 1:20:10.252 | |
If you say have to agree, then hopefully. | |
1:20:10.650 --> 1:20:21.369 | |
What we saw is think there has been a really | |
normally good performance and otherwise your | |
1:20:21.369 --> 1:20:22.441 | |
latency. | |
1:20:22.963 --> 1:20:42.085 | |
Okay, we'll just make more tests and we'll | |
get the confidence. | |
1:20:44.884 --> 1:20:47.596 | |
Have to completely agree with that. | |
1:20:47.596 --> 1:20:53.018 | |
So when this was done, that was our first | |
idea of using the confidence. | |
1:20:53.018 --> 1:21:00.248 | |
The problem is that currently that's my assumption | |
is that the modeling the model confidence is | |
1:21:00.248 --> 1:21:03.939 | |
not that easy, and they are often overconfident. | |
1:21:04.324 --> 1:21:17.121 | |
In the paper there is this type also where | |
you try to use the confidence in some way to | |
1:21:17.121 --> 1:21:20.465 | |
decide the confidence. | |
1:21:21.701 --> 1:21:26.825 | |
But that gave worse results, and that's why | |
we looked into that. | |
1:21:27.087 --> 1:21:38.067 | |
So it's a very good idea think, but it seems | |
not to at least how it was implemented. | |
1:21:38.959 --> 1:21:55.670 | |
There is one way that maybe goes in more direction, | |
which is very new. | |
1:21:55.455 --> 1:22:02.743 | |
If this one, the last word is attending mainly | |
to the end of the audio. | |
1:22:02.942 --> 1:22:04.934 | |
You might you should not output it yet. | |
1:22:05.485 --> 1:22:15.539 | |
Because they might think there is something | |
more missing than you need to know, so they | |
1:22:15.539 --> 1:22:24.678 | |
look at the attention and only output parts | |
which look to not the audio signal. | |
1:22:25.045 --> 1:22:40.175 | |
So there is, of course, a lot of ways how | |
you can do it better or easier in some way. | |
1:22:41.901 --> 1:22:53.388 | |
Instead tries to predict the next word with | |
a large language model, and then for text translation | |
1:22:53.388 --> 1:22:54.911 | |
you predict. | |
1:22:55.215 --> 1:23:01.177 | |
Then you translate all of them and decide | |
if there is a change so you can even earlier | |
1:23:01.177 --> 1:23:02.410 | |
do your decision. | |
1:23:02.362 --> 1:23:08.714 | |
The idea is that if we continue and then this | |
will be to a change in the translation, then | |
1:23:08.714 --> 1:23:10.320 | |
we should have opened. | |
1:23:10.890 --> 1:23:18.302 | |
So it's more doing your estimate about possible | |
continuations of the source instead of looking | |
1:23:18.302 --> 1:23:19.317 | |
at previous. | |
1:23:23.783 --> 1:23:31.388 | |
All that works is a bit here like one example. | |
1:23:31.388 --> 1:23:39.641 | |
It has a legacy baselines and you are not | |
putting. | |
1:23:40.040 --> 1:23:47.041 | |
And you see in this case you have worse blood | |
scores here. | |
1:23:47.041 --> 1:23:51.670 | |
For equal one you have better latency. | |
1:23:52.032 --> 1:24:01.123 | |
The how to and how does anybody have an idea | |
of what could be challenging there or when? | |
1:24:05.825 --> 1:24:20.132 | |
One problem of these models are hallucinations, | |
and often very long has a negative impact on. | |
1:24:24.884 --> 1:24:30.869 | |
If you don't remove the last four words but | |
your model now starts to hallucinate and invent | |
1:24:30.869 --> 1:24:37.438 | |
just a lot of new stuff then yeah you're removing | |
the last four words of that but if it has invented | |
1:24:37.438 --> 1:24:41.406 | |
ten words and you're still outputting six of | |
these invented. | |
1:24:41.982 --> 1:24:48.672 | |
Typically once it starts hallucination generating | |
some output, it's quite long, so then it's | |
1:24:48.672 --> 1:24:50.902 | |
no longer enough to just hold. | |
1:24:51.511 --> 1:24:57.695 | |
And then, of course, a bit better if you compare | |
to the previous ones. | |
1:24:57.695 --> 1:25:01.528 | |
Their destinations are typically different. | |
1:25:07.567 --> 1:25:25.939 | |
Yes, so we don't talk about the details, but | |
for outputs, for presentations, there's different | |
1:25:25.939 --> 1:25:27.100 | |
ways. | |
1:25:27.347 --> 1:25:36.047 | |
So you want to have maximum two lines, maximum | |
forty-two characters per line, and the reading | |
1:25:36.047 --> 1:25:40.212 | |
speed is a maximum of twenty-one characters. | |
1:25:40.981 --> 1:25:43.513 | |
How to Do That We Can Skip. | |
1:25:43.463 --> 1:25:46.804 | |
Then you can generate something like that. | |
1:25:46.886 --> 1:25:53.250 | |
Another challenge is, of course, that you | |
not only need to generate the translation, | |
1:25:53.250 --> 1:25:59.614 | |
but for subtlyning you also want to generate | |
when to put breaks and what to display. | |
1:25:59.619 --> 1:26:06.234 | |
Because it cannot be full sentences, as said | |
here, if you have like maximum twenty four | |
1:26:06.234 --> 1:26:10.443 | |
characters per line, that's not always a full | |
sentence. | |
1:26:10.443 --> 1:26:12.247 | |
So how can you make it? | |
1:26:13.093 --> 1:26:16.253 | |
And then for speech there's not even a hint | |
of wisdom. | |
1:26:18.398 --> 1:26:27.711 | |
So what we have done today is yeah, we looked | |
into maybe three challenges: We have this segmentation, | |
1:26:27.711 --> 1:26:33.013 | |
which is a challenge both in evaluation and | |
in the decoder. | |
1:26:33.013 --> 1:26:40.613 | |
We talked about disfluencies and we talked | |
about simultaneous translations and how to | |
1:26:40.613 --> 1:26:42.911 | |
address these challenges. | |
1:26:43.463 --> 1:26:45.507 | |
Any more questions. | |
1:26:48.408 --> 1:26:52.578 | |
Good then new content. | |
1:26:52.578 --> 1:26:58.198 | |
We are done for this semester. | |
1:26:58.198 --> 1:27:04.905 | |
You can keep your knowledge in that. | |
1:27:04.744 --> 1:27:09.405 | |
Repetition where we can try to repeat a bit | |
what we've done all over the semester. | |
1:27:10.010 --> 1:27:13.776 | |
Now prepare a bit of repetition to what think | |
is important. | |
1:27:14.634 --> 1:27:21.441 | |
But of course is also the chance for you to | |
ask specific questions. | |
1:27:21.441 --> 1:27:25.445 | |
It's not clear to me how things relate. | |
1:27:25.745 --> 1:27:34.906 | |
So if you have any specific questions, please | |
come to me or send me an email or so, then | |
1:27:34.906 --> 1:27:36.038 | |
I'm happy. | |
1:27:36.396 --> 1:27:46.665 | |
If should focus on it really in depth, it | |
might be good not to come and send me an email | |
1:27:46.665 --> 1:27:49.204 | |
on Wednesday evening. | |