WEBVTT 0:00:01.121 --> 0:00:14.214 Okay, so welcome to today's lecture, on Tuesday we started to talk about speech translation. 0:00:14.634 --> 0:00:27.037 And the idea is hopefully an idea of the basic ideas we have in speech translation, the two 0:00:27.037 --> 0:00:29.464 major approaches. 0:00:29.829 --> 0:00:41.459 And the other one is the end system where we have one large system which is everything 0:00:41.459 --> 0:00:42.796 together. 0:00:43.643 --> 0:00:58.459 Until now we mainly focus on text output that we'll see today, but you can extend these ideas 0:00:58.459 --> 0:01:01.138 to other speech. 0:01:01.441 --> 0:01:08.592 But since it's also like a machine translation lecture, you of course mainly focus a bit on 0:01:08.592 --> 0:01:10.768 the translation challenges. 0:01:12.172 --> 0:01:25.045 And what is the main focus of today's lecture is to look into why that is challenging speech 0:01:25.045 --> 0:01:26.845 translation. 0:01:27.627 --> 0:01:33.901 So a bit more focus on what is now really the difference to all you and how we can address. 0:01:34.254 --> 0:01:39.703 We'll start there by with the segmentation problem. 0:01:39.703 --> 0:01:45.990 We had that already of bits, but especially for end-to-end. 0:01:46.386 --> 0:01:57.253 So the problem is that until now it was easy to segment the input into sentences and then 0:01:57.253 --> 0:02:01.842 translate each sentence individually. 0:02:02.442 --> 0:02:17.561 When you're now translating audio, the challenge is that you have just a sequence of audio input 0:02:17.561 --> 0:02:20.055 and there's no. 0:02:21.401 --> 0:02:27.834 So you have this difference that your audio is a continuous stream, but the text is typically 0:02:27.834 --> 0:02:28.930 sentence based. 0:02:28.930 --> 0:02:31.667 So how can you match this gap in there? 0:02:31.667 --> 0:02:37.690 We'll see that is really essential, and if you're not using a decent good system there, 0:02:37.690 --> 0:02:41.249 then you can lose a lot of quality and performance. 0:02:41.641 --> 0:02:44.267 That is what also meant before. 0:02:44.267 --> 0:02:51.734 So if you have a more complex system out of several units, it's really essential that they 0:02:51.734 --> 0:02:56.658 all work together and it's very easy to lose significantly. 0:02:57.497 --> 0:03:13.029 The second challenge we'll talk about is disfluencies, so the style of speaking is very different 0:03:13.029 --> 0:03:14.773 from text. 0:03:15.135 --> 0:03:24.727 So if you translate or TedTalks, that's normally very good speakers. 0:03:24.727 --> 0:03:30.149 They will give you a very fluent text. 0:03:30.670 --> 0:03:36.692 When you want to translate a lecture, it might be more difficult or rednested. 0:03:37.097 --> 0:03:39.242 Mean people are not well that well. 0:03:39.242 --> 0:03:42.281 They should be prepared in giving the lecture and. 0:03:42.362 --> 0:03:48.241 But it's not that I mean, typically a lecture will have like rehearsal like five times before 0:03:48.241 --> 0:03:52.682 he is giving this lecture, and then like will it completely be fluent? 0:03:52.682 --> 0:03:56.122 He might at some point notice all this is not perfect. 0:03:56.122 --> 0:04:00.062 I want to rephrase, and he'll have to sing during the lecture. 0:04:00.300 --> 0:04:04.049 Might be also good that he's thinking, so he's not going too fast and things like. 0:04:05.305 --> 0:04:07.933 If you then go to the other extreme, it's more meetings. 0:04:08.208 --> 0:04:15.430 If you have a lively discussion, of course, people will interrupt, they will restart, they 0:04:15.430 --> 0:04:22.971 will think while they speak, and you know that sometimes you tell people first think and speak 0:04:22.971 --> 0:04:26.225 because they are changing their opinion. 0:04:26.606 --> 0:04:31.346 So the question of how can you deal with this? 0:04:31.346 --> 0:04:37.498 And there again it might be solutions for that, or at least. 0:04:39.759 --> 0:04:46.557 Then for the output we will look into simultaneous translation that is at least not very important 0:04:46.557 --> 0:04:47.175 in text. 0:04:47.175 --> 0:04:53.699 There might be some cases but normally you have all text available and then you're translating 0:04:53.699 --> 0:04:54.042 and. 0:04:54.394 --> 0:05:09.220 While for speech translation, since it's often a life interaction, then of course it's important. 0:05:09.149 --> 0:05:12.378 Otherwise it's hard to follow. 0:05:12.378 --> 0:05:19.463 You see what said five minutes ago and the slide is not as helpful. 0:05:19.739 --> 0:05:35.627 You have to wait very long before you can answer because you have to first wait for what 0:05:35.627 --> 0:05:39.197 is happening there. 0:05:40.660 --> 0:05:46.177 And finally, we can talk a bit about presentation. 0:05:46.177 --> 0:05:54.722 For example, mentioned that if you're generating subtitles, it's not possible. 0:05:54.854 --> 0:06:01.110 So in professional subtitles there are clear rules. 0:06:01.110 --> 0:06:05.681 Subtitle has to be shown for seconds. 0:06:05.681 --> 0:06:08.929 It's maximum of two lines. 0:06:09.549 --> 0:06:13.156 Because otherwise it's getting too long, it's not able to read it anymore, and so. 0:06:13.613 --> 0:06:19.826 So if you want to achieve that, of course, you might have to adjust and select what you 0:06:19.826 --> 0:06:20.390 really. 0:06:23.203 --> 0:06:28.393 The first date starts with the segmentation. 0:06:28.393 --> 0:06:36.351 On the one end it's an issue while training, on the other hand it's. 0:06:38.678 --> 0:06:47.781 What is the problem so when we train it's relatively easy to separate our data into sentence 0:06:47.781 --> 0:06:48.466 level. 0:06:48.808 --> 0:07:02.241 So if you have your example, you have the audio and the text, then you typically know 0:07:02.241 --> 0:07:07.083 that this sentence is aligned. 0:07:07.627 --> 0:07:16.702 You can use these time information to cut your audio and then you can train and then. 0:07:18.018 --> 0:07:31.775 Because what we need for an enchilada model is to be an output chart, in this case an audio 0:07:31.775 --> 0:07:32.822 chart. 0:07:33.133 --> 0:07:38.551 And even if this is a long speech, it's easy then since we have this time information to 0:07:38.551 --> 0:07:39.159 separate. 0:07:39.579 --> 0:07:43.866 But we are using therefore, of course, the target side information. 0:07:45.865 --> 0:07:47.949 The problem is now in runtime. 0:07:47.949 --> 0:07:49.427 This is not possible. 0:07:49.427 --> 0:07:55.341 Here we can do that based on the calculation marks and the sentence segmentation on the 0:07:55.341 --> 0:07:57.962 target side because that is splitting. 0:07:57.962 --> 0:08:02.129 But during transcript, during translation it is not possible. 0:08:02.442 --> 0:08:10.288 Because there is just a long audio signal, and of course if you have your test data to 0:08:10.288 --> 0:08:15.193 split it into: That has been done for some experience. 0:08:15.193 --> 0:08:22.840 It's fine, but it's not a realistic scenario because if you really apply it in real world, 0:08:22.840 --> 0:08:25.949 we won't have a manual segmentation. 0:08:26.266 --> 0:08:31.838 If a human has to do that then he can do the translation so you want to have a full automatic 0:08:31.838 --> 0:08:32.431 pipeline. 0:08:32.993 --> 0:08:38.343 So the question is how can we deal with this type of you know? 0:09:09.309 --> 0:09:20.232 So the question is how can we deal with this time of situation and how can we segment the 0:09:20.232 --> 0:09:23.024 audio into some units? 0:09:23.863 --> 0:09:32.495 And here is one further really big advantage of a cascaded sauce: Because how is this done 0:09:32.495 --> 0:09:34.259 in a cascade of systems? 0:09:34.259 --> 0:09:38.494 We are splitting the audio with some features we are doing. 0:09:38.494 --> 0:09:42.094 We can use similar ones which we'll discuss later. 0:09:42.094 --> 0:09:43.929 Then we run against chin. 0:09:43.929 --> 0:09:48.799 We have the transcript and then we can do what we talked last about. 0:09:49.069 --> 0:10:02.260 So if you have this is an audio signal and the training data it was good. 0:10:02.822 --> 0:10:07.951 So here we have a big advantage. 0:10:07.951 --> 0:10:16.809 We can use a different segmentation for the and for the. 0:10:16.809 --> 0:10:21.316 Why is that a big advantage? 0:10:23.303 --> 0:10:34.067 Will say for a team task is more important because we can then do the sentence transformation. 0:10:34.955 --> 0:10:37.603 See and Yeah, We Can Do the Same Thing. 0:10:37.717 --> 0:10:40.226 To save us, why is it not as important for us? 0:10:40.226 --> 0:10:40.814 Are maybe. 0:10:43.363 --> 0:10:48.589 We don't need that much context. 0:10:48.589 --> 0:11:01.099 We only try to restrict the word, but the context to consider is mainly small. 0:11:03.283 --> 0:11:11.419 Would agree with it in more context, but there is one more important: its. 0:11:11.651 --> 0:11:16.764 The is monotone, so there's no reordering. 0:11:16.764 --> 0:11:22.472 The second part of the signal is no reordering. 0:11:22.472 --> 0:11:23.542 We have. 0:11:23.683 --> 0:11:29.147 And of course if we are doing that we cannot really order across boundaries between segments. 0:11:29.549 --> 0:11:37.491 It might be challenging if we split the words so that it's not perfect for so that. 0:11:37.637 --> 0:11:40.846 But we need to do quite long range reordering. 0:11:40.846 --> 0:11:47.058 If you think about the German where the work has moved, and now the English work is in one 0:11:47.058 --> 0:11:50.198 part, but the end of the sentence is another. 0:11:50.670 --> 0:11:59.427 And of course this advantage we have now here that if we have a segment we have. 0:12:01.441 --> 0:12:08.817 And that this segmentation is important. 0:12:08.817 --> 0:12:15.294 Here are some motivations for that. 0:12:15.675 --> 0:12:25.325 What you are doing is you are taking the reference text and you are segmenting. 0:12:26.326 --> 0:12:30.991 And then, of course, your segments are exactly yeah cute. 0:12:31.471 --> 0:12:42.980 If you're now using different segmentation strategies, you're using significantly in blue 0:12:42.980 --> 0:12:44.004 points. 0:12:44.004 --> 0:12:50.398 If the segmentation is bad, you have a lot worse. 0:12:52.312 --> 0:13:10.323 And interesting, here you ought to see how it was a human, but people have in a competition. 0:13:10.450 --> 0:13:22.996 You can see that by working on the segmentation and using better segmentation you can improve 0:13:22.996 --> 0:13:25.398 your performance. 0:13:26.006 --> 0:13:29.932 So it's really essential. 0:13:29.932 --> 0:13:41.712 One other interesting thing is if you're looking into the difference between. 0:13:42.082 --> 0:13:49.145 So it really seems to be more important to have a good segmentation for our cascaded system. 0:13:49.109 --> 0:13:56.248 For an intra-end system because there you can't re-segment while it is less important 0:13:56.248 --> 0:13:58.157 for a cascaded system. 0:13:58.157 --> 0:14:05.048 Of course, it's still important, but the difference between the two segmentations. 0:14:06.466 --> 0:14:18.391 It was a shared task some years ago like it's just one system from different. 0:14:22.122 --> 0:14:31.934 So the question is how can we deal with this in speech translation and what people look 0:14:31.934 --> 0:14:32.604 into? 0:14:32.752 --> 0:14:48.360 Now we want to use different techniques to split the audio signal into segments. 0:14:48.848 --> 0:14:54.413 You have the disadvantage that you can't change it. 0:14:54.413 --> 0:15:00.407 Therefore, some of the quality might be more important. 0:15:00.660 --> 0:15:15.678 But in both cases, of course, the A's are better if you have a good segmentation. 0:15:17.197 --> 0:15:23.149 So any idea, how would you have this task now split this audio? 0:15:23.149 --> 0:15:26.219 What type of tool would you use? 0:15:28.648 --> 0:15:41.513 The fuse was a new network to segment half for instance supervise. 0:15:41.962 --> 0:15:44.693 Yes, that's exactly already the better system. 0:15:44.693 --> 0:15:50.390 So for long time people have done more simple things because we'll come to that a bit challenging 0:15:50.390 --> 0:15:52.250 as creating or having the data. 0:15:53.193 --> 0:16:00.438 The first thing is you use some tool out of the box like voice activity detection which 0:16:00.438 --> 0:16:07.189 has been there as a whole research field so people find when somebody's speaking. 0:16:07.647 --> 0:16:14.952 And then you use that in this different threshold you always have the ability that somebody's 0:16:14.952 --> 0:16:16.273 speaking or not. 0:16:17.217 --> 0:16:19.889 Then you split your signal. 0:16:19.889 --> 0:16:26.762 It will not be perfect, but you transcribe or translate each component. 0:16:28.508 --> 0:16:39.337 But as you see, a supervised classification task is even better, and that is now the most 0:16:39.337 --> 0:16:40.781 common use. 0:16:41.441 --> 0:16:49.909 The supervisor is doing that as a supervisor classification and then you'll try to use this 0:16:49.909 --> 0:16:50.462 type. 0:16:50.810 --> 0:16:53.217 We're going into a bit more detail on how to do that. 0:16:53.633 --> 0:17:01.354 So what you need to do first is, of course, you have to have some labels whether this is 0:17:01.354 --> 0:17:03.089 an end of sentence. 0:17:03.363 --> 0:17:10.588 You do that by using the alignment between the segments and the audio. 0:17:10.588 --> 0:17:12.013 You have the. 0:17:12.212 --> 0:17:15.365 The two people have not for each word, so these tank steps. 0:17:15.365 --> 0:17:16.889 This word is said this time. 0:17:17.157 --> 0:17:27.935 This word is said by what you typically have from this time to time to time. 0:17:27.935 --> 0:17:34.654 We have the second segment, the second segment. 0:17:35.195 --> 0:17:39.051 Which also used to trade for example your advanced system and everything. 0:17:41.661 --> 0:17:53.715 Based on that you can label each frame in there so if you have a green or blue that is 0:17:53.715 --> 0:17:57.455 our speech segment so you. 0:17:58.618 --> 0:18:05.690 And these labels will then later help you, but you extract exactly these types of. 0:18:07.067 --> 0:18:08.917 There's one big challenge. 0:18:08.917 --> 0:18:15.152 If you have two sentences which are directly connected to each other, then if you're doing 0:18:15.152 --> 0:18:18.715 this labeling, you would not have a break in later. 0:18:18.715 --> 0:18:23.512 If you tried to extract that, there should be something great or not. 0:18:23.943 --> 0:18:31.955 So what you typically do is in the last frame. 0:18:31.955 --> 0:18:41.331 You mark as outside, although it's not really outside. 0:18:43.463 --> 0:18:46.882 Yes, I guess you could also do that in more of a below check. 0:18:46.882 --> 0:18:48.702 I mean, this is the most simple. 0:18:48.702 --> 0:18:51.514 It's like inside outside, so it's related to that. 0:18:51.514 --> 0:18:54.988 Of course, you could have an extra startup segment, and so on. 0:18:54.988 --> 0:18:57.469 I guess this is just to make it more simple. 0:18:57.469 --> 0:19:00.226 You only have two labels, not a street classroom. 0:19:00.226 --> 0:19:02.377 But yeah, you could do similar things. 0:19:12.432 --> 0:19:20.460 Has caused down the roads to problems because it could be an important part of a segment 0:19:20.460 --> 0:19:24.429 which has some meaning and we do something. 0:19:24.429 --> 0:19:28.398 The good thing is frames are normally very. 0:19:28.688 --> 0:19:37.586 Like some milliseconds, so normally if you remove some milliseconds you can still understand 0:19:37.586 --> 0:19:38.734 everything. 0:19:38.918 --> 0:19:46.999 Mean the speech signal is very repetitive, and so you have information a lot of times. 0:19:47.387 --> 0:19:50.730 That's why we talked along there last time they could try to shrink the steak and. 0:19:51.031 --> 0:20:00.995 If you now have a short sequence where there is like which would be removed and that's not 0:20:00.995 --> 0:20:01.871 really. 0:20:02.162 --> 0:20:06.585 Yeah, but it's not a full letter is missing. 0:20:06.585 --> 0:20:11.009 It's like only the last ending of the vocal. 0:20:11.751 --> 0:20:15.369 Think it doesn't really happen. 0:20:15.369 --> 0:20:23.056 We have our audio signal and we have these gags that are not above. 0:20:23.883 --> 0:20:29.288 With this blue rectangulars the inside speech segment and with the guess it's all set yes. 0:20:29.669 --> 0:20:35.736 So then you have the full signal and you're meaning now labeling your task as a blue or 0:20:35.736 --> 0:20:36.977 white prediction. 0:20:36.977 --> 0:20:39.252 So that is your prediction task. 0:20:39.252 --> 0:20:44.973 You have the audio signal only and your prediction task is like label one or zero. 0:20:45.305 --> 0:20:55.585 Once you do that then based on this labeling you can extract each segment again like each 0:20:55.585 --> 0:20:58.212 consecutive blue area. 0:20:58.798 --> 0:21:05.198 See then removed maybe the non-speaking part already and duo speech translation only on 0:21:05.198 --> 0:21:05.998 the parts. 0:21:06.786 --> 0:21:19.768 Which is good because the training would have done similarly. 0:21:20.120 --> 0:21:26.842 So on the noise in between you never saw in the training, so it's good to throw it away. 0:21:29.649 --> 0:21:34.930 One challenge, of course, is now if you're doing that, what is your input? 0:21:34.930 --> 0:21:40.704 You cannot do the sequence labeling normally on the whole talk, so it's too long. 0:21:40.704 --> 0:21:46.759 So if you're doing this prediction of the label, you also have a window for which you 0:21:46.759 --> 0:21:48.238 do the segmentation. 0:21:48.788 --> 0:21:54.515 And that's the bedline we have in the punctuation prediction. 0:21:54.515 --> 0:22:00.426 If we don't have good borders, random splits are normally good. 0:22:00.426 --> 0:22:03.936 So what we do now is split the audio. 0:22:04.344 --> 0:22:09.134 So that would be our input, and then the part three would be our labels. 0:22:09.269 --> 0:22:15.606 This green would be the input and here we want, for example, blue labels and then white. 0:22:16.036 --> 0:22:20.360 Here only do labors and here at the beginning why maybe at the end why. 0:22:21.401 --> 0:22:28.924 So thereby you have now a fixed window always for which you're doing than this task of predicting. 0:22:33.954 --> 0:22:43.914 How you build your classifier that is based again. 0:22:43.914 --> 0:22:52.507 We had this wave to be mentioned last week. 0:22:52.752 --> 0:23:00.599 So in training you use labels to say whether it's in speech or outside speech. 0:23:01.681 --> 0:23:17.740 Inference: You give them always the chance and then predict whether this part like each 0:23:17.740 --> 0:23:20.843 label is afraid. 0:23:23.143 --> 0:23:29.511 Bit more complicated, so one challenge is if you randomly split off cognition, losing 0:23:29.511 --> 0:23:32.028 your context for the first brain. 0:23:32.028 --> 0:23:38.692 It might be very hard to predict whether this is now in or out of, and also for the last. 0:23:39.980 --> 0:23:48.449 You often need a bit of context whether this is audio or not, and at the beginning. 0:23:49.249 --> 0:23:59.563 So what you do is you put the audio in twice. 0:23:59.563 --> 0:24:08.532 You want to do it with splits and then. 0:24:08.788 --> 0:24:15.996 It is shown you have shifted the two offsets, so one is predicted with the other offset. 0:24:16.416 --> 0:24:23.647 And then averaging the probabilities so that at each time you have, at least for one of 0:24:23.647 --> 0:24:25.127 the predictions,. 0:24:25.265 --> 0:24:36.326 Because at the end of the second it might be very hard to predict whether this is now 0:24:36.326 --> 0:24:39.027 speech or nonspeech. 0:24:39.939 --> 0:24:47.956 Think it is a high parameter, but you are not optimizing it, so you just take two shifts. 0:24:48.328 --> 0:24:54.636 Of course try a lot of different shifts and so on. 0:24:54.636 --> 0:24:59.707 The thing is it's mainly a problem here. 0:24:59.707 --> 0:25:04.407 If you don't do two outsets you have. 0:25:05.105 --> 0:25:14.761 You could get better by doing that, but would be skeptical if it really matters, and also 0:25:14.761 --> 0:25:18.946 have not seen any experience in doing. 0:25:19.159 --> 0:25:27.629 Guess you're already good, you have maybe some arrows in there and you're getting. 0:25:31.191 --> 0:25:37.824 So with this you have your segmentation. 0:25:37.824 --> 0:25:44.296 However, there is a problem in between. 0:25:44.296 --> 0:25:49.150 Once the model is wrong then. 0:25:49.789 --> 0:26:01.755 The normal thing would be the first thing that you take some threshold and that you always 0:26:01.755 --> 0:26:05.436 label everything in speech. 0:26:06.006 --> 0:26:19.368 The problem is when you are just doing this one threshold that you might have. 0:26:19.339 --> 0:26:23.954 Those are the challenges. 0:26:23.954 --> 0:26:31.232 Short segments mean you have no context. 0:26:31.232 --> 0:26:35.492 The policy will be bad. 0:26:37.077 --> 0:26:48.954 Therefore, people use this probabilistic divided cocker algorithm, so the main idea is start 0:26:48.954 --> 0:26:56.744 with the whole segment, and now you split the whole segment. 0:26:57.397 --> 0:27:09.842 Then you split there and then you continue until each segment is smaller than the maximum 0:27:09.842 --> 0:27:10.949 length. 0:27:11.431 --> 0:27:23.161 But you can ignore some splits, and if you split one segment into two parts you first 0:27:23.161 --> 0:27:23.980 trim. 0:27:24.064 --> 0:27:40.197 So normally it's not only one signal position, it's a longer area of non-voice, so you try 0:27:40.197 --> 0:27:43.921 to find this longer. 0:27:43.943 --> 0:27:51.403 Now your large segment is split into two smaller segments. 0:27:51.403 --> 0:27:56.082 Now you are checking these segments. 0:27:56.296 --> 0:28:04.683 So if they are very, very short, it might be good not to spin at this point because you're 0:28:04.683 --> 0:28:05.697 ending up. 0:28:06.006 --> 0:28:09.631 And this way you continue all the time, and then hopefully you'll have a good stretch. 0:28:10.090 --> 0:28:19.225 So, of course, there's one challenge with this approach: if you think about it later, 0:28:19.225 --> 0:28:20.606 low latency. 0:28:25.405 --> 0:28:31.555 So in this case you have to have the full audio available. 0:28:32.132 --> 0:28:38.112 So you cannot continuously do that mean if you would do it just always. 0:28:38.112 --> 0:28:45.588 If the probability is higher you split but in this case you try to find a global optimal. 0:28:46.706 --> 0:28:49.134 A heuristic body. 0:28:49.134 --> 0:28:58.170 You find a global solution for your whole tar and not a local one. 0:28:58.170 --> 0:29:02.216 Where's the system most sure? 0:29:02.802 --> 0:29:12.467 So that's a bit of a challenge here, but the advantage of course is that in the end you 0:29:12.467 --> 0:29:14.444 have no segments. 0:29:17.817 --> 0:29:23.716 Any more questions like this. 0:29:23.716 --> 0:29:36.693 Then the next thing is we also need to evaluate in this scenario. 0:29:37.097 --> 0:29:44.349 So know machine translation is quite a long way. 0:29:44.349 --> 0:29:55.303 History now was the beginning of the semester, but hope you can remember. 0:29:55.675 --> 0:30:09.214 Might be with blue score, might be with comment or similar, but you need to have. 0:30:10.310 --> 0:30:22.335 But this assumes that you have this one-to-one match, so you always have an output and machine 0:30:22.335 --> 0:30:26.132 translation, which is nicely. 0:30:26.506 --> 0:30:34.845 So then it might be that our output has four segments, while our reference output has only 0:30:34.845 --> 0:30:35.487 three. 0:30:36.756 --> 0:30:40.649 And now is, of course, questionable like what should we compare in our metric. 0:30:44.704 --> 0:30:53.087 So it's no longer directly possible to directly do that because what should you compare? 0:30:53.413 --> 0:31:00.214 Just have four segments there and three segments there, and of course it seems to be that. 0:31:00.920 --> 0:31:06.373 The first one it likes to the first one when you see I can't speak Spanish, but you're an 0:31:06.373 --> 0:31:09.099 audience of the guests who is already there. 0:31:09.099 --> 0:31:14.491 So even like just a woman, the blue comparing wouldn't work, so you need to do something 0:31:14.491 --> 0:31:17.157 about that to take this type of evaluation. 0:31:19.019 --> 0:31:21.727 Still any suggestions what you could do. 0:31:25.925 --> 0:31:44.702 How can you calculate a blue score because you don't have one you want to see? 0:31:45.925 --> 0:31:49.365 Here you put another layer which spies to add in the second. 0:31:51.491 --> 0:31:56.979 It's even not aligning only, but that's one solution, so you need to align and resign. 0:31:57.177 --> 0:32:06.886 Because even if you have no alignment so this to this and this to that you see that it's 0:32:06.886 --> 0:32:12.341 not good because the audio would compare to that. 0:32:13.453 --> 0:32:16.967 That we'll discuss is even one simpler solution. 0:32:16.967 --> 0:32:19.119 Yes, it's a simpler solution. 0:32:19.119 --> 0:32:23.135 It's called document based blue or something like that. 0:32:23.135 --> 0:32:25.717 So you just take the full document. 0:32:26.566 --> 0:32:32.630 For some matrix it's good and it's not clear how good it is to the other, but there might 0:32:32.630 --> 0:32:32.900 be. 0:32:33.393 --> 0:32:36.454 Think of more simple metrics like blue. 0:32:36.454 --> 0:32:40.356 Do you have any idea what could be a disadvantage? 0:32:49.249 --> 0:32:56.616 Blue is matching ingrams so you start with the original. 0:32:56.616 --> 0:33:01.270 You check how many ingrams in here. 0:33:01.901 --> 0:33:11.233 If you're not doing that on the full document, you can also match grams from year to year. 0:33:11.751 --> 0:33:15.680 So you can match things very far away. 0:33:15.680 --> 0:33:21.321 Start doing translation and you just randomly randomly. 0:33:22.142 --> 0:33:27.938 And that, of course, could be a bit of a disadvantage or like is a problem, and therefore people 0:33:27.938 --> 0:33:29.910 also look into the segmentation. 0:33:29.910 --> 0:33:34.690 But I've recently seen some things, so document levels tours are also normally. 0:33:34.690 --> 0:33:39.949 If you have a relatively high quality system or state of the art, then they also have a 0:33:39.949 --> 0:33:41.801 good correlation of the human. 0:33:46.546 --> 0:33:59.241 So how are we doing that so we are putting end of sentence boundaries in there and then. 0:33:59.179 --> 0:34:07.486 Alignment based on a similar Livingston distance, so at a distance between our output and the 0:34:07.486 --> 0:34:09.077 reference output. 0:34:09.449 --> 0:34:13.061 And here is our boundary. 0:34:13.061 --> 0:34:23.482 We map the boundary based on the alignment, so in Lithuania you only have. 0:34:23.803 --> 0:34:36.036 And then, like all the words that are before, it might be since there is not a random. 0:34:36.336 --> 0:34:44.890 Mean it should be, but it can happen things like that, and it's not clear where. 0:34:44.965 --> 0:34:49.727 At the break, however, they are typically not that bad because they are words which are 0:34:49.727 --> 0:34:52.270 not matching between reference and hypothesis. 0:34:52.270 --> 0:34:56.870 So normally it doesn't really matter that much because they are anyway not matching. 0:34:57.657 --> 0:35:05.888 And then you take the mule as a T output and use that to calculate your metric. 0:35:05.888 --> 0:35:12.575 Then it's again a perfect alignment for which you can calculate. 0:35:14.714 --> 0:35:19.229 Any idea you could do it the other way around. 0:35:19.229 --> 0:35:23.359 You could resigment your reference to the. 0:35:29.309 --> 0:35:30.368 Which one would you select? 0:35:34.214 --> 0:35:43.979 I think segmenting the assertive also is much more natural because the reference sentence 0:35:43.979 --> 0:35:46.474 is the fixed solution. 0:35:47.007 --> 0:35:52.947 Yes, that's the right motivation if you do think about blue or so. 0:35:52.947 --> 0:35:57.646 Additionally important if you change your reference. 0:35:57.857 --> 0:36:07.175 You might have a different number of diagrams or diagrams because the sentences are different 0:36:07.175 --> 0:36:08.067 lengths. 0:36:08.068 --> 0:36:15.347 Here your five system, you're always comparing it to the same system, and you don't compare 0:36:15.347 --> 0:36:16.455 to different. 0:36:16.736 --> 0:36:22.317 The only different base of segmentation, but still it could make some do. 0:36:25.645 --> 0:36:38.974 Good, that's all about sentence segmentation, then a bit about disfluencies and what there 0:36:38.974 --> 0:36:40.146 really. 0:36:42.182 --> 0:36:51.138 So as said in daily life, you're not speaking like very nice full sentences every. 0:36:51.471 --> 0:36:53.420 He was speaking powerful sentences. 0:36:53.420 --> 0:36:54.448 We do repetitions. 0:36:54.834 --> 0:37:00.915 It's especially if it's more interactive, so in meetings, phone calls and so on. 0:37:00.915 --> 0:37:04.519 If you have multiple speakers, they also break. 0:37:04.724 --> 0:37:16.651 Each other, and then if you keep them, they are harder to translate because most of your 0:37:16.651 --> 0:37:17.991 training. 0:37:18.278 --> 0:37:30.449 It's also very difficult to read, so we'll have some examples there to transcribe everything 0:37:30.449 --> 0:37:32.543 as it was said. 0:37:33.473 --> 0:37:36.555 What type of things are there? 0:37:37.717 --> 0:37:42.942 So you have all these pillow works. 0:37:42.942 --> 0:37:47.442 These are very easy to remove. 0:37:47.442 --> 0:37:52.957 You can just use regular expressions. 0:37:53.433 --> 0:38:00.139 Is getting more difficult with some other type of filler works. 0:38:00.139 --> 0:38:03.387 In German you have this or in. 0:38:04.024 --> 0:38:08.473 And these ones you cannot just remove by regular expression. 0:38:08.473 --> 0:38:15.039 You shouldn't remove all yacht from a text because it might be very important information 0:38:15.039 --> 0:38:15.768 for well. 0:38:15.715 --> 0:38:19.995 It may be not as important as you are, but still it might be very important. 0:38:20.300 --> 0:38:24.215 So just removing them is there already more difficult. 0:38:26.586 --> 0:38:29.162 Then you have these repetitions. 0:38:29.162 --> 0:38:32.596 You have something like mean saw him there. 0:38:32.596 --> 0:38:33.611 There was a. 0:38:34.334 --> 0:38:41.001 And while for the first one that might be very easy to remove because you just look for 0:38:41.001 --> 0:38:47.821 double, the thing is that the repetition might not be exactly the same, so there is there 0:38:47.821 --> 0:38:48.199 was. 0:38:48.199 --> 0:38:54.109 So there is already getting a bit more complicated, of course still possible. 0:38:54.614 --> 0:39:01.929 You can remove Denver so the real sense would be like to have a ticket to Houston. 0:39:02.882 --> 0:39:13.327 But there the detection, of course, is getting more challenging as you want to get rid of. 0:39:13.893 --> 0:39:21.699 You don't have the data, of course, which makes all the tasks harder, but you probably 0:39:21.699 --> 0:39:22.507 want to. 0:39:22.507 --> 0:39:24.840 That's really meaningful. 0:39:24.840 --> 0:39:26.185 Current isn't. 0:39:26.185 --> 0:39:31.120 That is now a really good point and it's really there. 0:39:31.051 --> 0:39:34.785 The thing about what is your final task? 0:39:35.155 --> 0:39:45.526 If you want to have a transcript reading it, I'm not sure if we have another example. 0:39:45.845 --> 0:39:54.171 So there it's nicer if you have a clean transfer and if you see subtitles in, they're also not 0:39:54.171 --> 0:39:56.625 having all the repetitions. 0:39:56.625 --> 0:40:03.811 It's the nice way to shorten but also getting the structure you cannot even make. 0:40:04.064 --> 0:40:11.407 So in this situation, of course, they might give you information. 0:40:11.407 --> 0:40:14.745 There is a lot of stuttering. 0:40:15.015 --> 0:40:22.835 So in this case agree it might be helpful in some way, but meaning reading all the disfluencies 0:40:22.835 --> 0:40:25.198 is getting really difficult. 0:40:25.198 --> 0:40:28.049 If you have the next one, we have. 0:40:28.308 --> 0:40:31.630 That's a very long text. 0:40:31.630 --> 0:40:35.883 You need a bit of time to pass. 0:40:35.883 --> 0:40:39.472 This one is not important. 0:40:40.480 --> 0:40:48.461 It might be nice if you can start reading from here. 0:40:48.461 --> 0:40:52.074 Let's have a look here. 0:40:52.074 --> 0:40:54.785 Try to read this. 0:40:57.297 --> 0:41:02.725 You can understand it, but think you need a bit of time to really understand what was. 0:41:11.711 --> 0:41:21.480 And now we have the same text, but you have highlighted in bold, and not only read the 0:41:21.480 --> 0:41:22.154 bold. 0:41:23.984 --> 0:41:25.995 And ignore everything which is not bold. 0:41:30.250 --> 0:41:49.121 Would assume it's easier to read just the book part more faster and more faster. 0:41:50.750 --> 0:41:57.626 Yeah, it might be, but I'm not sure we have a master thesis of that. 0:41:57.626 --> 0:41:59.619 If seen my videos,. 0:42:00.000 --> 0:42:09.875 Of the recordings, I also have it more likely that it's like a fluent speak and I'm not like 0:42:09.875 --> 0:42:12.318 doing the hesitations. 0:42:12.652 --> 0:42:23.764 Don't know if somebody else has looked into the Cusera video, but notice that. 0:42:25.005 --> 0:42:31.879 For these videos spoke every minute, three times or something, and then people were there 0:42:31.879 --> 0:42:35.011 and cutting things and making hopefully. 0:42:35.635 --> 0:42:42.445 And therefore if you want to more achieve that, of course, no longer exactly what was 0:42:42.445 --> 0:42:50.206 happening, but if it more looks like a professional video, then you would have to do that and cut 0:42:50.206 --> 0:42:50.998 that out. 0:42:50.998 --> 0:42:53.532 But yeah, there are definitely. 0:42:55.996 --> 0:42:59.008 We're also going to do this thing again. 0:42:59.008 --> 0:43:02.315 First turn is like I'm going to have a very. 0:43:02.422 --> 0:43:07.449 Which in the end they start to slow down just without feeling as though they're. 0:43:07.407 --> 0:43:10.212 It's a good point for the next. 0:43:10.212 --> 0:43:13.631 There is not the one perfect solution. 0:43:13.631 --> 0:43:20.732 There's some work on destruction removal, but of course there's also disability. 0:43:20.732 --> 0:43:27.394 Removal is not that easy, so do you just remove that's in order everywhere. 0:43:27.607 --> 0:43:29.708 But how much like cleaning do you do? 0:43:29.708 --> 0:43:31.366 It's more a continuous thing. 0:43:31.811 --> 0:43:38.211 Is it more really you only remove stuff or are you also into rephrasing and here is only 0:43:38.211 --> 0:43:38.930 removing? 0:43:39.279 --> 0:43:41.664 But maybe you want to rephrase it. 0:43:41.664 --> 0:43:43.231 That's hearing better. 0:43:43.503 --> 0:43:49.185 So then it's going into what people are doing in style transfer. 0:43:49.185 --> 0:43:52.419 We are going from a speech style to. 0:43:52.872 --> 0:44:07.632 So there is more continuum, and of course Airconditioner is not the perfect solution, 0:44:07.632 --> 0:44:10.722 but exactly what. 0:44:15.615 --> 0:44:19.005 Yeah, we're challenging. 0:44:19.005 --> 0:44:30.258 You have examples where the direct copy is not as hard or is not exactly the same. 0:44:30.258 --> 0:44:35.410 That is, of course, more challenging. 0:44:41.861 --> 0:44:49.889 If it's getting really mean why it's so challenging, if it's really spontaneous even for the speaker, 0:44:49.889 --> 0:44:55.634 you need maybe even the video to really get that and at least the audio. 0:45:01.841 --> 0:45:06.025 Yeah what it also depends on. 0:45:06.626 --> 0:45:15.253 The purpose, of course, and very important thing is the easiest tasks just to removing. 0:45:15.675 --> 0:45:25.841 Of course you have to be very careful because if you remove some of the not, it's normally 0:45:25.841 --> 0:45:26.958 not much. 0:45:27.227 --> 0:45:33.176 But if you remove too much, of course, that's very, very bad because you're losing important. 0:45:33.653 --> 0:45:46.176 And this might be even more challenging if you think about rarer and unseen works. 0:45:46.226 --> 0:45:56.532 So when doing this removal, it's important to be careful and normally more conservative. 0:46:03.083 --> 0:46:15.096 Of course, also you have to again see if you're doing that now in a two step approach, not 0:46:15.096 --> 0:46:17.076 an end to end. 0:46:17.076 --> 0:46:20.772 So first you need a remote. 0:46:21.501 --> 0:46:30.230 But you have to somehow sing it in the whole type line. 0:46:30.230 --> 0:46:36.932 If you learn text or remove disfluencies,. 0:46:36.796 --> 0:46:44.070 But it might be that the ASR system is outputing something else or that it's more of an ASR 0:46:44.070 --> 0:46:44.623 error. 0:46:44.864 --> 0:46:46.756 So um. 0:46:46.506 --> 0:46:52.248 Just for example, if you do it based on language modeling scores, it might be that you're just 0:46:52.248 --> 0:46:57.568 the language modeling score because the has done some errors, so you really have to see 0:46:57.568 --> 0:46:59.079 the combination of that. 0:46:59.419 --> 0:47:04.285 And for example, we had like partial words. 0:47:04.285 --> 0:47:06.496 They are like some. 0:47:06.496 --> 0:47:08.819 We didn't have that. 0:47:08.908 --> 0:47:18.248 So these feelings cannot be that you start in the middle of the world and then you switch 0:47:18.248 --> 0:47:19.182 because. 0:47:19.499 --> 0:47:23.214 And of course, in text in perfect transcript, that's very easy to recognize. 0:47:23.214 --> 0:47:24.372 That's not a real word. 0:47:24.904 --> 0:47:37.198 However, when you really do it into an system, he will normally detect some type of word because 0:47:37.198 --> 0:47:40.747 he only can help the words. 0:47:50.050 --> 0:48:03.450 Example: We should think so if you have this in the transcript it's easy to detect as a 0:48:03.450 --> 0:48:05.277 disgusting. 0:48:05.986 --> 0:48:11.619 And then, of course, it's more challenging in a real world example where you have. 0:48:12.492 --> 0:48:29.840 Now to the approaches one thing is to really put it in between so you put your A's system. 0:48:31.391 --> 0:48:45.139 So what your task is like, so you have this text and the outputs in this text. 0:48:45.565 --> 0:48:49.605 There is different formulations of that. 0:48:49.605 --> 0:48:54.533 You might not be able to do everything like that. 0:48:55.195 --> 0:49:10.852 Or do you also allow, for example, rephrasing for reordering so in text you might have the 0:49:10.852 --> 0:49:13.605 word correctly. 0:49:13.513 --> 0:49:24.201 But the easiest thing is you only do it more like removing, so some things can be removed. 0:49:29.049 --> 0:49:34.508 Any ideas how to do that this is output. 0:49:34.508 --> 0:49:41.034 You have training data so we have training data. 0:49:47.507 --> 0:49:55.869 To put in with the spoon you can eat it even after it is out, but after the machine has. 0:50:00.000 --> 0:50:05.511 Was wearing rocks, so you have not just the shoes you remove but wearing them as input, 0:50:05.511 --> 0:50:07.578 as disfluent text and as output. 0:50:07.578 --> 0:50:09.207 It should be fueled text. 0:50:09.207 --> 0:50:15.219 It can be before or after recycling as you said, but you have this type of task, so technically 0:50:15.219 --> 0:50:20.042 how would you address this type of task when you have to solve this type of. 0:50:24.364 --> 0:50:26.181 That's exactly so. 0:50:26.181 --> 0:50:28.859 That's one way of doing it. 0:50:28.859 --> 0:50:33.068 It's a translation task and you train your. 0:50:33.913 --> 0:50:34.683 Can do. 0:50:34.683 --> 0:50:42.865 Then, of course, the bit of the challenge is that you automatically allow rephrasing 0:50:42.865 --> 0:50:43.539 stuff. 0:50:43.943 --> 0:50:52.240 Which of the one end is good so you have more opportunities but it might be also a bad thing 0:50:52.240 --> 0:50:58.307 because if you have more opportunities you have more opportunities. 0:51:01.041 --> 0:51:08.300 If you want to prevent that, it can also do more simple labeling, so for each word your 0:51:08.300 --> 0:51:10.693 label should not be removed. 0:51:12.132 --> 0:51:17.658 People have also been looked into parsley. 0:51:17.658 --> 0:51:29.097 You remember maybe the past trees at the beginning like the structure because the ideas. 0:51:29.649 --> 0:51:45.779 There's also more unsupervised approaches where you then phrase it as a style transfer 0:51:45.779 --> 0:51:46.892 task. 0:51:50.310 --> 0:51:58.601 At the last point since we have that yes, it has also been done in an end-to-end fashion 0:51:58.601 --> 0:52:06.519 so that it's really you have as input the audio signal and output you have than the. 0:52:06.446 --> 0:52:10.750 The text, without influence, is a clearly clear text. 0:52:11.131 --> 0:52:19.069 You model every single total, which of course has a big advantage. 0:52:19.069 --> 0:52:25.704 You can use these paralinguistic features, pauses, and. 0:52:25.705 --> 0:52:34.091 If you switch so you start something then oh it doesn't work continue differently so. 0:52:34.374 --> 0:52:42.689 So you can easily use in a fashion while in a cascade approach. 0:52:42.689 --> 0:52:47.497 As we saw there you have text input. 0:52:49.990 --> 0:53:02.389 But on the one end we have again, and in the more extreme case the problem before was endless. 0:53:02.389 --> 0:53:06.957 Of course there is even less data. 0:53:11.611 --> 0:53:12.837 Good. 0:53:12.837 --> 0:53:30.814 This is all about the input to a very more person, or maybe if you think about YouTube. 0:53:32.752 --> 0:53:34.989 Talk so this could use be very exciting. 0:53:36.296 --> 0:53:42.016 Is more viewed as style transferred. 0:53:42.016 --> 0:53:53.147 You can use ideas from machine translation where you have one language. 0:53:53.713 --> 0:53:57.193 So there is ways of trying to do this type of style transfer. 0:53:57.637 --> 0:54:02.478 Think is definitely also very promising to make it more and more fluent in a business. 0:54:03.223 --> 0:54:17.974 Because one major issue about all the previous ones is that you need training data and then 0:54:17.974 --> 0:54:21.021 you need training. 0:54:21.381 --> 0:54:32.966 So I mean, think that we are only really of data that we have for English. 0:54:32.966 --> 0:54:39.453 Maybe there is a very few data in German. 0:54:42.382 --> 0:54:49.722 Okay, then let's talk about low latency speech. 0:54:50.270 --> 0:55:05.158 So the idea is if we are doing life translation of a talker, so we want to start out. 0:55:05.325 --> 0:55:23.010 This is possible because there is typically some kind of monotony in many languages. 0:55:24.504 --> 0:55:29.765 And this is also what, for example, human interpreters are doing to have a really low 0:55:29.765 --> 0:55:30.071 leg. 0:55:30.750 --> 0:55:34.393 They are even going further. 0:55:34.393 --> 0:55:40.926 They guess what will be the ending of the sentence. 0:55:41.421 --> 0:55:51.120 Then they can already continue, although it's not sad it might be needed, but that is even 0:55:51.120 --> 0:55:53.039 more challenging. 0:55:54.714 --> 0:55:58.014 Why is it so difficult? 0:55:58.014 --> 0:56:09.837 There is this train of on the one end for a and you want to have more context because 0:56:09.837 --> 0:56:14.511 we learn if we have more context. 0:56:15.015 --> 0:56:24.033 And therefore to have more contacts you have to wait as long as possible. 0:56:24.033 --> 0:56:27.689 The best is to have the full. 0:56:28.168 --> 0:56:35.244 On the other hand, you want to have a low latency for the user to wait to generate as 0:56:35.244 --> 0:56:35.737 soon. 0:56:36.356 --> 0:56:47.149 So if you're doing no situation you have to find the best way to start in order to have 0:56:47.149 --> 0:56:48.130 a good. 0:56:48.728 --> 0:56:52.296 There's no longer the perfect solution. 0:56:52.296 --> 0:56:56.845 People will also evaluate what is the translation. 0:56:57.657 --> 0:57:09.942 While it's challenging in German to English, German has this very nice thing where the prefix 0:57:09.942 --> 0:57:16.607 of the word can be put at the end of the sentence. 0:57:17.137 --> 0:57:24.201 And you only know if the person registers or cancels his station at the end of the center. 0:57:24.985 --> 0:57:33.690 So if you want to start the translation in English you need to know at this point is the. 0:57:35.275 --> 0:57:39.993 So you would have to wait until the end of the year. 0:57:39.993 --> 0:57:42.931 That's not really what you want. 0:57:43.843 --> 0:57:45.795 What happened. 0:57:47.207 --> 0:58:12.550 Other solutions of doing that are: Have been motivating like how we can do that subject 0:58:12.550 --> 0:58:15.957 object or subject work. 0:58:16.496 --> 0:58:24.582 In German it's not always subject, but there are relative sentence where you have that, 0:58:24.582 --> 0:58:25.777 so it needs. 0:58:28.808 --> 0:58:41.858 How we can do that is, we'll look today into three ways of doing that. 0:58:41.858 --> 0:58:46.269 The one is to mitigate. 0:58:46.766 --> 0:58:54.824 And then the IVAR idea is to do retranslating, and there you can now use the text output. 0:58:54.934 --> 0:59:02.302 So the idea is you translate, and if you later notice it was wrong then you can retranslate 0:59:02.302 --> 0:59:03.343 and correct. 0:59:03.803 --> 0:59:14.383 Or you can do what is called extremely coding, so you can generically. 0:59:17.237 --> 0:59:30.382 Let's start with the optimization, so if you have a sentence, it may reach a conference, 0:59:30.382 --> 0:59:33.040 and in this time. 0:59:32.993 --> 0:59:39.592 So you have a good translation quality while still having low latency. 0:59:39.699 --> 0:59:50.513 You have an extra model which does your segmentation before, but your aim is not to have a segmentation. 0:59:50.470 --> 0:59:53.624 But you can somehow measure in training data. 0:59:53.624 --> 0:59:59.863 If do these types of segment lengths, that's my latency and that's my translation quality, 0:59:59.863 --> 1:00:02.811 and then you can try to search a good way. 1:00:03.443 --> 1:00:20.188 If you're doing that one, it's an extra component, so you can use your system as it was. 1:00:22.002 --> 1:00:28.373 The other idea is to directly output the first high processes always, so always when you have 1:00:28.373 --> 1:00:34.201 text or audio we translate, and if we then have more context available we can update. 1:00:35.015 --> 1:00:50.195 So imagine before, if get an eye register and there's a sentence continued, then. 1:00:50.670 --> 1:00:54.298 So you change the output. 1:00:54.298 --> 1:01:07.414 Of course, that might be also leading to bad user experience if you always flicker and change 1:01:07.414 --> 1:01:09.228 your output. 1:01:09.669 --> 1:01:15.329 The bit like human interpreters also are able to correct, so they're doing a more long text. 1:01:15.329 --> 1:01:20.867 If they are guessing how to continue to say and then he's saying something different, they 1:01:20.867 --> 1:01:22.510 also have to correct them. 1:01:22.510 --> 1:01:26.831 So here, since it's not all you, we can even change what we have said. 1:01:26.831 --> 1:01:29.630 Yes, that's exactly what we have implemented. 1:01:31.431 --> 1:01:49.217 So how that works is, we are aware, and then we translate it, and if we get more input like 1:01:49.217 --> 1:01:51.344 you, then. 1:01:51.711 --> 1:02:00.223 And so we can always continue to do that and improve the transcript that we have. 1:02:00.480 --> 1:02:07.729 So in the end we have the lowest possible latency because we always output what is possible. 1:02:07.729 --> 1:02:14.784 On the other hand, introducing a bit of a new problem is: There's another challenge when 1:02:14.784 --> 1:02:20.061 we first used that this one was first used for old and that it worked fine. 1:02:20.061 --> 1:02:21.380 You switch to NMT. 1:02:21.380 --> 1:02:25.615 You saw one problem that is even generating more flickering. 1:02:25.615 --> 1:02:28.878 The problem is the normal machine translation. 1:02:29.669 --> 1:02:35.414 So implicitly learn all the output that always ends with a dot, and it's always a full sentence. 1:02:36.696 --> 1:02:42.466 And this was even more important somewhere in the model than really what is in the input. 1:02:42.983 --> 1:02:55.910 So if you give him a partial sentence, it will still generate a full sentence. 1:02:55.910 --> 1:02:58.201 So encourage. 1:02:58.298 --> 1:03:05.821 It's like trying to just continue it somehow to a full sentence and if it's doing better 1:03:05.821 --> 1:03:10.555 guessing stuff then you have to even have more changes. 1:03:10.890 --> 1:03:23.944 So here we have a trained mismatch and that's maybe more a general important thing that the 1:03:23.944 --> 1:03:28.910 modem might learn a bit different. 1:03:29.289 --> 1:03:32.636 It's always ending with a dog, so you don't just guess something in general. 1:03:33.053 --> 1:03:35.415 So we have your trained test mismatch. 1:03:38.918 --> 1:03:41.248 And we have a trained test message. 1:03:41.248 --> 1:03:43.708 What is the best way to address that? 1:03:46.526 --> 1:03:51.934 That's exactly the right, so we have to like train also on that. 1:03:52.692 --> 1:03:55.503 The problem is for particle sentences. 1:03:55.503 --> 1:03:59.611 There's not training data, so it's hard to find all our. 1:04:00.580 --> 1:04:06.531 Hi, I'm ransom quite easy to generate artificial pottery scent or at least for the source. 1:04:06.926 --> 1:04:15.367 So you just take, you take all the prefixes of the source data. 1:04:17.017 --> 1:04:22.794 On the problem of course, with a bit what do you know lying? 1:04:22.794 --> 1:04:30.845 If you have a sentence, I encourage all of what should be the right target for that. 1:04:31.491 --> 1:04:45.381 And the constraints on the one hand, it should be as long as possible, so you always have 1:04:45.381 --> 1:04:47.541 a long delay. 1:04:47.687 --> 1:04:55.556 On the other hand, it should be also a suspect of the previous ones, and it should be not 1:04:55.556 --> 1:04:57.304 too much inventing. 1:04:58.758 --> 1:05:02.170 A very easy solution works fine. 1:05:02.170 --> 1:05:05.478 You can just do a length space. 1:05:05.478 --> 1:05:09.612 You also take two thirds of the target. 1:05:10.070 --> 1:05:19.626 His learning then implicitly to guess a bit if you think about the beginning of example. 1:05:20.000 --> 1:05:30.287 This one, if you do two sorts like half, in this case the target would be eye register. 1:05:30.510 --> 1:05:39.289 So you're doing a bit of implicit guessing, and if it's getting wrong you have rewriting, 1:05:39.289 --> 1:05:43.581 but you're doing a good amount of guessing. 1:05:49.849 --> 1:05:53.950 In addition, this would be like how it looks like if it was like. 1:05:53.950 --> 1:05:58.300 If it wasn't a housing game, then the target could be something like. 1:05:58.979 --> 1:06:02.513 One problem is that you just do that this way. 1:06:02.513 --> 1:06:04.619 It's most of your training. 1:06:05.245 --> 1:06:11.983 And in the end you're interested in the overall translation quality, so for full sentence. 1:06:11.983 --> 1:06:19.017 So if you train on that, it will mainly learn how to translate prefixes because ninety percent 1:06:19.017 --> 1:06:21.535 or more of your data is prefixed. 1:06:22.202 --> 1:06:31.636 That's why we'll see that it's better to do like a ratio. 1:06:31.636 --> 1:06:39.281 So half your training data are full sentences. 1:06:39.759 --> 1:06:47.693 Because if you're doing this well you see that for every word prefix and only one sentence. 1:06:48.048 --> 1:06:52.252 You also see that nicely here here are both. 1:06:52.252 --> 1:06:56.549 This is the blue scores and you see the bass. 1:06:58.518 --> 1:06:59.618 Is this one? 1:06:59.618 --> 1:07:03.343 It has a good quality because it's trained. 1:07:03.343 --> 1:07:11.385 If you know, train with all the partial sentences is more focusing on how to translate partial 1:07:11.385 --> 1:07:12.316 sentences. 1:07:12.752 --> 1:07:17.840 Because all the partial sentences will at some point be removed, because at the end you 1:07:17.840 --> 1:07:18.996 translate the full. 1:07:20.520 --> 1:07:24.079 There's many tasks to read, but you have the same performances. 1:07:24.504 --> 1:07:26.938 On the other hand, you see here the other problem. 1:07:26.938 --> 1:07:28.656 This is how many words got updated. 1:07:29.009 --> 1:07:31.579 You want to have as few updates as possible. 1:07:31.579 --> 1:07:34.891 Updates need to remove things which are once being shown. 1:07:35.255 --> 1:07:40.538 This is quite high for the baseline. 1:07:40.538 --> 1:07:50.533 If you know the partials that are going down, they should be removed. 1:07:51.151 --> 1:07:58.648 And then for moody tasks you have a bit like the best note of swim. 1:08:02.722 --> 1:08:05.296 Any more questions to this type of. 1:08:09.309 --> 1:08:20.760 The last thing is that you want to do an extremely. 1:08:21.541 --> 1:08:23.345 Again, it's a bit implication. 1:08:23.345 --> 1:08:25.323 Scenario is what you really want. 1:08:25.323 --> 1:08:30.211 As you said, we sometimes use this updating, and for text output it'd be very nice. 1:08:30.211 --> 1:08:35.273 But imagine if you want to audio output, of course you can't change it anymore because 1:08:35.273 --> 1:08:37.891 on one side you cannot change what was said. 1:08:37.891 --> 1:08:40.858 So in this time you more need like a fixed output. 1:08:41.121 --> 1:08:47.440 And then the style of street decoding is interesting. 1:08:47.440 --> 1:08:55.631 Where you, for example, get sourced, the seagullins are so stoked in. 1:08:55.631 --> 1:09:00.897 Then you decide oh, now it's better to wait. 1:09:01.041 --> 1:09:14.643 So you somehow need to have this type of additional information. 1:09:15.295 --> 1:09:23.074 Here you have to decide should know I'll put a token or should wait for my and feel. 1:09:26.546 --> 1:09:32.649 So you have to do this additional labels like weight, weight, output, output, wage and so 1:09:32.649 --> 1:09:32.920 on. 1:09:33.453 --> 1:09:38.481 There are different ways of doing that. 1:09:38.481 --> 1:09:45.771 You can have an additional model that does this decision. 1:09:46.166 --> 1:09:53.669 And then have a higher quality or better to continue and then have a lower latency in this 1:09:53.669 --> 1:09:54.576 different. 1:09:55.215 --> 1:09:59.241 Surprisingly, a very easy task also works, sometimes quite good. 1:10:03.043 --> 1:10:10.981 And that is the so called way care policy and the idea is there at least for text to 1:10:10.981 --> 1:10:14.623 text translation that is working well. 1:10:14.623 --> 1:10:22.375 It's like you wait for words and then you always output one and like one for each. 1:10:22.682 --> 1:10:28.908 So your weight slow works at the beginning of the sentence, and every time a new board 1:10:28.908 --> 1:10:29.981 is coming you. 1:10:31.091 --> 1:10:39.459 So you have the same times to beat as input, so you're not legging more or less, but to 1:10:39.459 --> 1:10:41.456 have enough context. 1:10:43.103 --> 1:10:49.283 Of course this for example for the unmarried will not solve it perfectly but if you have 1:10:49.283 --> 1:10:55.395 a bit of local reordering inside your token that you can manage very well and then it's 1:10:55.395 --> 1:10:57.687 a very simple solution but it's. 1:10:57.877 --> 1:11:00.481 The other one was dynamic. 1:11:00.481 --> 1:11:06.943 Depending on the context you can decide how long you want to wait. 1:11:07.687 --> 1:11:21.506 It also only works if you have a similar amount of tokens, so if your target is very short 1:11:21.506 --> 1:11:22.113 of. 1:11:22.722 --> 1:11:28.791 That's why it's also more challenging for audio input because the speaking rate is changing 1:11:28.791 --> 1:11:29.517 and so on. 1:11:29.517 --> 1:11:35.586 You would have to do something like I'll output a word for every second a year or something 1:11:35.586 --> 1:11:35.981 like. 1:11:36.636 --> 1:11:45.459 The problem is that the audio speaking speed is not like fixed but quite very, and therefore. 1:11:50.170 --> 1:11:58.278 Therefore, what you can also do is you can use a similar solution than we had before with 1:11:58.278 --> 1:11:59.809 the resetteling. 1:12:00.080 --> 1:12:02.904 You remember we were re-decoded all the time. 1:12:03.423 --> 1:12:12.253 And you can do something similar in this case except that you add something in that you're 1:12:12.253 --> 1:12:16.813 saying, oh, if I read it cold, I'm not always. 1:12:16.736 --> 1:12:22.065 Can decode as I want, but you can do this target prefix decoding, so what you say is 1:12:22.065 --> 1:12:23.883 in your achievement section. 1:12:23.883 --> 1:12:26.829 You can easily say generate a translation bus. 1:12:27.007 --> 1:12:29.810 The translation has to start with the prefix. 1:12:31.251 --> 1:12:35.350 How can you do that? 1:12:39.839 --> 1:12:49.105 In the decoder exactly you start, so if you do beam search you select always the most probable. 1:12:49.349 --> 1:12:57.867 And now you say oh, I'm not selecting the most perfect, but this is the fourth, so in 1:12:57.867 --> 1:13:04.603 the first step have to take this one, in the second start decoding. 1:13:04.884 --> 1:13:09.387 And then you're making sure that your second always starts with this prefix. 1:13:10.350 --> 1:13:18.627 And then you can use your immediate retranslation, but you're no longer changing the output. 1:13:19.099 --> 1:13:31.595 Out as it works, so it may get a speech signal and input, and it is not outputing any. 1:13:32.212 --> 1:13:45.980 So then if you got you get a translation maybe and then you decide yes output. 1:13:46.766 --> 1:13:54.250 And then you're translating as one as two as sweet as four, but now you say generate 1:13:54.250 --> 1:13:55.483 only outputs. 1:13:55.935 --> 1:14:07.163 And then you're translating and maybe you're deciding on and now a good translation. 1:14:07.163 --> 1:14:08.880 Then you're. 1:14:09.749 --> 1:14:29.984 Yes, but don't get to worry about what the effect is. 1:14:30.050 --> 1:14:31.842 We're generating your target text. 1:14:32.892 --> 1:14:36.930 But we're not always outputing the full target text now. 1:14:36.930 --> 1:14:43.729 What we are having is we have here some strategy to decide: Oh, is a system already sure enough 1:14:43.729 --> 1:14:44.437 about it? 1:14:44.437 --> 1:14:49.395 If it's sure enough and it has all the information, we can output it. 1:14:49.395 --> 1:14:50.741 And then the next. 1:14:51.291 --> 1:14:55.931 If we say here sometimes with better not to get output we won't output it already. 1:14:57.777 --> 1:15:06.369 And thereby the hope is in the uphill model should not yet outcut a register because it 1:15:06.369 --> 1:15:10.568 doesn't mean no yet if it's a case or not. 1:15:13.193 --> 1:15:18.056 So what we have to discuss is what is a good output strategy. 1:15:18.658 --> 1:15:20.070 So you could do. 1:15:20.070 --> 1:15:23.806 The output strategy could be something like. 1:15:23.743 --> 1:15:39.871 If you think of weight cape, this is an output strategy here that you always input. 1:15:40.220 --> 1:15:44.990 Good, and you can view your weight in a similar way as. 1:15:45.265 --> 1:15:55.194 But now, of course, we can also look at other output strategies where it's more generic and 1:15:55.194 --> 1:15:59.727 it's deciding whether in some situations. 1:16:01.121 --> 1:16:12.739 And one thing that works quite well is referred to as local agreement, and that means you're 1:16:12.739 --> 1:16:13.738 always. 1:16:14.234 --> 1:16:26.978 Then you're looking what is the same thing between my current translation and the one 1:16:26.978 --> 1:16:28.756 did before. 1:16:29.349 --> 1:16:31.201 So let's do that again in six hours. 1:16:31.891 --> 1:16:45.900 So your input is a first audio segment and your title text is all model trains. 1:16:46.346 --> 1:16:53.231 Then you're getting six opposites, one and two, and this time the output is all models. 1:16:54.694 --> 1:17:08.407 You see trains are different, but both of them agree that it's all so in those cases. 1:17:09.209 --> 1:17:13.806 So we can be hopefully a big show that really starts with all. 1:17:15.155 --> 1:17:22.604 So now we say we're output all, so at this time instead we'll output all, although before. 1:17:23.543 --> 1:17:27.422 We are getting one, two, three as input. 1:17:27.422 --> 1:17:35.747 This time we have a prefix, so now we are only allowing translations to start with all. 1:17:35.747 --> 1:17:42.937 We cannot change that anymore, so we now need to generate some translation. 1:17:43.363 --> 1:17:46.323 And then it can be that its now all models are run. 1:17:47.927 --> 1:18:01.908 Then we compare here and see this agrees on all models so we can output all models. 1:18:02.882 --> 1:18:07.356 So this by we can dynamically decide is a model is very anxious. 1:18:07.356 --> 1:18:10.178 We always talk with something different. 1:18:11.231 --> 1:18:24.872 Then it's, we'll wait longer, it's more for the same thing, and hope we don't need to wait. 1:18:30.430 --> 1:18:40.238 Is it clear again that the signal wouldn't be able to detect? 1:18:43.203 --> 1:18:50.553 The hope it is because if it's not sure of, of course, it in this kind would have to switch 1:18:50.553 --> 1:18:51.671 all the time. 1:18:56.176 --> 1:19:01.375 So if it would be the first step to register and the second time to cancel and they may 1:19:01.375 --> 1:19:03.561 register again, they wouldn't do it. 1:19:03.561 --> 1:19:08.347 Of course, it is very short because in register a long time, then it can't deal. 1:19:08.568 --> 1:19:23.410 That's why there's two parameters that you can use and which might be important, or how. 1:19:23.763 --> 1:19:27.920 So you do it like every one second, every five seconds or something like that. 1:19:28.648 --> 1:19:37.695 Put it more often as your latency will be because your weight is less long, but also 1:19:37.695 --> 1:19:39.185 you might do. 1:19:40.400 --> 1:19:50.004 So that is the one thing and the other thing is for words you might do everywhere, but if 1:19:50.004 --> 1:19:52.779 you think about audio it. 1:19:53.493 --> 1:20:04.287 And the other question you can do like the agreement, so the model is sure. 1:20:04.287 --> 1:20:10.252 If you say have to agree, then hopefully. 1:20:10.650 --> 1:20:21.369 What we saw is think there has been a really normally good performance and otherwise your 1:20:21.369 --> 1:20:22.441 latency. 1:20:22.963 --> 1:20:42.085 Okay, we'll just make more tests and we'll get the confidence. 1:20:44.884 --> 1:20:47.596 Have to completely agree with that. 1:20:47.596 --> 1:20:53.018 So when this was done, that was our first idea of using the confidence. 1:20:53.018 --> 1:21:00.248 The problem is that currently that's my assumption is that the modeling the model confidence is 1:21:00.248 --> 1:21:03.939 not that easy, and they are often overconfident. 1:21:04.324 --> 1:21:17.121 In the paper there is this type also where you try to use the confidence in some way to 1:21:17.121 --> 1:21:20.465 decide the confidence. 1:21:21.701 --> 1:21:26.825 But that gave worse results, and that's why we looked into that. 1:21:27.087 --> 1:21:38.067 So it's a very good idea think, but it seems not to at least how it was implemented. 1:21:38.959 --> 1:21:55.670 There is one way that maybe goes in more direction, which is very new. 1:21:55.455 --> 1:22:02.743 If this one, the last word is attending mainly to the end of the audio. 1:22:02.942 --> 1:22:04.934 You might you should not output it yet. 1:22:05.485 --> 1:22:15.539 Because they might think there is something more missing than you need to know, so they 1:22:15.539 --> 1:22:24.678 look at the attention and only output parts which look to not the audio signal. 1:22:25.045 --> 1:22:40.175 So there is, of course, a lot of ways how you can do it better or easier in some way. 1:22:41.901 --> 1:22:53.388 Instead tries to predict the next word with a large language model, and then for text translation 1:22:53.388 --> 1:22:54.911 you predict. 1:22:55.215 --> 1:23:01.177 Then you translate all of them and decide if there is a change so you can even earlier 1:23:01.177 --> 1:23:02.410 do your decision. 1:23:02.362 --> 1:23:08.714 The idea is that if we continue and then this will be to a change in the translation, then 1:23:08.714 --> 1:23:10.320 we should have opened. 1:23:10.890 --> 1:23:18.302 So it's more doing your estimate about possible continuations of the source instead of looking 1:23:18.302 --> 1:23:19.317 at previous. 1:23:23.783 --> 1:23:31.388 All that works is a bit here like one example. 1:23:31.388 --> 1:23:39.641 It has a legacy baselines and you are not putting. 1:23:40.040 --> 1:23:47.041 And you see in this case you have worse blood scores here. 1:23:47.041 --> 1:23:51.670 For equal one you have better latency. 1:23:52.032 --> 1:24:01.123 The how to and how does anybody have an idea of what could be challenging there or when? 1:24:05.825 --> 1:24:20.132 One problem of these models are hallucinations, and often very long has a negative impact on. 1:24:24.884 --> 1:24:30.869 If you don't remove the last four words but your model now starts to hallucinate and invent 1:24:30.869 --> 1:24:37.438 just a lot of new stuff then yeah you're removing the last four words of that but if it has invented 1:24:37.438 --> 1:24:41.406 ten words and you're still outputting six of these invented. 1:24:41.982 --> 1:24:48.672 Typically once it starts hallucination generating some output, it's quite long, so then it's 1:24:48.672 --> 1:24:50.902 no longer enough to just hold. 1:24:51.511 --> 1:24:57.695 And then, of course, a bit better if you compare to the previous ones. 1:24:57.695 --> 1:25:01.528 Their destinations are typically different. 1:25:07.567 --> 1:25:25.939 Yes, so we don't talk about the details, but for outputs, for presentations, there's different 1:25:25.939 --> 1:25:27.100 ways. 1:25:27.347 --> 1:25:36.047 So you want to have maximum two lines, maximum forty-two characters per line, and the reading 1:25:36.047 --> 1:25:40.212 speed is a maximum of twenty-one characters. 1:25:40.981 --> 1:25:43.513 How to Do That We Can Skip. 1:25:43.463 --> 1:25:46.804 Then you can generate something like that. 1:25:46.886 --> 1:25:53.250 Another challenge is, of course, that you not only need to generate the translation, 1:25:53.250 --> 1:25:59.614 but for subtlyning you also want to generate when to put breaks and what to display. 1:25:59.619 --> 1:26:06.234 Because it cannot be full sentences, as said here, if you have like maximum twenty four 1:26:06.234 --> 1:26:10.443 characters per line, that's not always a full sentence. 1:26:10.443 --> 1:26:12.247 So how can you make it? 1:26:13.093 --> 1:26:16.253 And then for speech there's not even a hint of wisdom. 1:26:18.398 --> 1:26:27.711 So what we have done today is yeah, we looked into maybe three challenges: We have this segmentation, 1:26:27.711 --> 1:26:33.013 which is a challenge both in evaluation and in the decoder. 1:26:33.013 --> 1:26:40.613 We talked about disfluencies and we talked about simultaneous translations and how to 1:26:40.613 --> 1:26:42.911 address these challenges. 1:26:43.463 --> 1:26:45.507 Any more questions. 1:26:48.408 --> 1:26:52.578 Good then new content. 1:26:52.578 --> 1:26:58.198 We are done for this semester. 1:26:58.198 --> 1:27:04.905 You can keep your knowledge in that. 1:27:04.744 --> 1:27:09.405 Repetition where we can try to repeat a bit what we've done all over the semester. 1:27:10.010 --> 1:27:13.776 Now prepare a bit of repetition to what think is important. 1:27:14.634 --> 1:27:21.441 But of course is also the chance for you to ask specific questions. 1:27:21.441 --> 1:27:25.445 It's not clear to me how things relate. 1:27:25.745 --> 1:27:34.906 So if you have any specific questions, please come to me or send me an email or so, then 1:27:34.906 --> 1:27:36.038 I'm happy. 1:27:36.396 --> 1:27:46.665 If should focus on it really in depth, it might be good not to come and send me an email 1:27:46.665 --> 1:27:49.204 on Wednesday evening.