WEBVTT 0:00:01.721 --> 0:00:05.064 Hey, and then welcome to today's lecture. 0:00:06.126 --> 0:00:13.861 What we want to do today is we will finish with what we have done last time, so we started 0:00:13.861 --> 0:00:22.192 looking at the new machine translation system, but we have had all the components of the sequence 0:00:22.192 --> 0:00:22.787 model. 0:00:22.722 --> 0:00:29.361 We're still missing is the transformer based architecture so that maybe the self attention. 0:00:29.849 --> 0:00:31.958 Then we want to look at the beginning today. 0:00:32.572 --> 0:00:39.315 And then the main part of the day's lecture will be decoding. 0:00:39.315 --> 0:00:43.992 That means we know how to train the model. 0:00:44.624 --> 0:00:47.507 So decoding sewage all they can be. 0:00:47.667 --> 0:00:53.359 Be useful that and the idea is how we find that and what challenges are there. 0:00:53.359 --> 0:00:59.051 Since it's unregressive, we will see that it's not as easy as for other tasks. 0:00:59.359 --> 0:01:08.206 While generating the translation step by step, we might make additional arrows that lead. 0:01:09.069 --> 0:01:16.464 But let's start with a self attention, so what we looked at into was an base model. 0:01:16.816 --> 0:01:27.931 And then in our based models you always take the last new state, you take your input, you 0:01:27.931 --> 0:01:31.513 generate a new hidden state. 0:01:31.513 --> 0:01:35.218 This is more like a standard. 0:01:35.675 --> 0:01:41.088 And one challenge in this is that we always store all our history in one signal hidden 0:01:41.088 --> 0:01:41.523 stick. 0:01:41.781 --> 0:01:50.235 We saw that this is a problem when going from encoder to decoder, and that is why we then 0:01:50.235 --> 0:01:58.031 introduced the attention mechanism so that we can look back and see all the parts. 0:01:59.579 --> 0:02:06.059 However, in the decoder we still have this issue so we are still storing all information 0:02:06.059 --> 0:02:12.394 in one hidden state and we might do things like here that we start to overwrite things 0:02:12.394 --> 0:02:13.486 and we forgot. 0:02:14.254 --> 0:02:23.575 So the idea is, can we do something similar which we do between encoder and decoder within 0:02:23.575 --> 0:02:24.907 the decoder? 0:02:26.526 --> 0:02:33.732 And the idea is each time we're generating here in New York State, it will not only depend 0:02:33.732 --> 0:02:40.780 on the previous one, but we will focus on the whole sequence and look at different parts 0:02:40.780 --> 0:02:46.165 as we did in attention in order to generate our new representation. 0:02:46.206 --> 0:02:53.903 So each time we generate a new representation we will look into what is important now to 0:02:53.903 --> 0:02:54.941 understand. 0:02:55.135 --> 0:03:00.558 You may want to understand what much is important. 0:03:00.558 --> 0:03:08.534 You might want to look to vary and to like so that it's much about liking. 0:03:08.808 --> 0:03:24.076 So the idea is that we are not staring everything in each time we are looking at the full sequence. 0:03:25.125 --> 0:03:35.160 And that is achieved by no longer going really secret, and the hidden states here aren't dependent 0:03:35.160 --> 0:03:37.086 on the same layer. 0:03:37.086 --> 0:03:42.864 But instead we are always looking at the previous layer. 0:03:42.942 --> 0:03:45.510 We will always have more information that we are coming. 0:03:47.147 --> 0:03:51.572 So how does this censor work in detail? 0:03:51.572 --> 0:03:56.107 So we started with our initial mistakes. 0:03:56.107 --> 0:04:08.338 So, for example: Now where we had the three terms already, the query, the key and the value, 0:04:08.338 --> 0:04:12.597 it was motivated by our database. 0:04:12.772 --> 0:04:20.746 We are comparing it to the keys to all the other values, and then we are merging the values. 0:04:21.321 --> 0:04:35.735 There was a difference between the decoder and the encoder. 0:04:35.775 --> 0:04:41.981 You can assume all the same because we are curving ourselves. 0:04:41.981 --> 0:04:49.489 However, we can make them different but just learning a linear projection. 0:04:49.529 --> 0:05:01.836 So you learn here some projection based on what need to do in order to ask which question. 0:05:02.062 --> 0:05:11.800 That is, the query and the key is to what do want to compare and provide others, and 0:05:11.800 --> 0:05:13.748 which values do. 0:05:14.014 --> 0:05:23.017 This is not like hand defined, but learn, so it's like three linear projections that 0:05:23.017 --> 0:05:26.618 you apply on all of these hidden. 0:05:26.618 --> 0:05:32.338 That is the first thing based on your initial hidden. 0:05:32.612 --> 0:05:37.249 And now you can do exactly as before, you can do the attention. 0:05:37.637 --> 0:05:40.023 How did the attention work? 0:05:40.023 --> 0:05:45.390 The first thing is we are comparing our query to all the keys. 0:05:45.445 --> 0:05:52.713 And that is now the difference before the quarry was from the decoder, the keys were 0:05:52.713 --> 0:05:54.253 from the encoder. 0:05:54.253 --> 0:06:02.547 Now it's like all from the same, so we started the first in state to the keys of all the others. 0:06:02.582 --> 0:06:06.217 We're learning some value here. 0:06:06.217 --> 0:06:12.806 How important are these information to better understand? 0:06:13.974 --> 0:06:19.103 And these are just like floating point numbers. 0:06:19.103 --> 0:06:21.668 They are normalized so. 0:06:22.762 --> 0:06:30.160 And that is the first step, so let's go first for the first curve. 0:06:30.470 --> 0:06:41.937 What we can then do is multiply each value as we have done before with the importance 0:06:41.937 --> 0:06:43.937 of each state. 0:06:45.145 --> 0:06:47.686 And then we have in here the new hit step. 0:06:48.308 --> 0:06:57.862 See now this new hidden status is depending on all the hidden state of all the sequences 0:06:57.862 --> 0:06:59.686 of the previous. 0:06:59.879 --> 0:07:01.739 One important thing. 0:07:01.739 --> 0:07:08.737 This one doesn't really depend, so the hidden states here don't depend on the. 0:07:09.029 --> 0:07:15.000 So it only depends on the hidden state of the previous layer, but it depends on all the 0:07:15.000 --> 0:07:18.664 hidden states, and that is of course a big advantage. 0:07:18.664 --> 0:07:25.111 So on the one hand information can directly flow from each hidden state before the information 0:07:25.111 --> 0:07:27.214 flow was always a bit limited. 0:07:28.828 --> 0:07:35.100 And the independence is important so we can calculate all these in the states in parallel. 0:07:35.100 --> 0:07:41.371 That's another big advantage of self attention that we can calculate all the hidden states 0:07:41.371 --> 0:07:46.815 in one layer in parallel and therefore it's the ad designed for GPUs and fast. 0:07:47.587 --> 0:07:50.235 Then we can do the same thing for the second in the state. 0:07:50.530 --> 0:08:06.866 And the only difference here is how we calculate what is occurring. 0:08:07.227 --> 0:08:15.733 Getting these values is different because we use the different query and then getting 0:08:15.733 --> 0:08:17.316 our new hidden. 0:08:18.258 --> 0:08:26.036 Yes, this is the word of words that underneath this case might, but this is simple. 0:08:26.036 --> 0:08:26.498 Not. 0:08:27.127 --> 0:08:33.359 That's a very good question that is like on the initial thing. 0:08:33.359 --> 0:08:38.503 That is exactly not one of you in the architecture. 0:08:38.503 --> 0:08:44.042 Maybe first you would think of a very big disadvantage. 0:08:44.384 --> 0:08:49.804 So this hidden state would be the same if the movie would be different. 0:08:50.650 --> 0:08:59.983 And of course this estate is a site someone should like, so if the estate would be here 0:08:59.983 --> 0:09:06.452 except for this correspondence the word order is completely. 0:09:06.706 --> 0:09:17.133 Therefore, just doing self attention wouldn't work at all because we know word order is important 0:09:17.133 --> 0:09:21.707 and there is a complete different meaning. 0:09:22.262 --> 0:09:26.277 We introduce the word position again. 0:09:26.277 --> 0:09:33.038 The main idea is if the position is already in your embeddings. 0:09:33.533 --> 0:09:39.296 Then of course the position is there and you don't lose it anymore. 0:09:39.296 --> 0:09:46.922 So mainly if your life representation here encodes at the second position and your output 0:09:46.922 --> 0:09:48.533 will be different. 0:09:49.049 --> 0:09:54.585 And that's how you encode it, but that's essential in order to get this work. 0:09:57.137 --> 0:10:08.752 But before we are coming to the next slide, one other thing that is typically done is multi-head 0:10:08.752 --> 0:10:10.069 attention. 0:10:10.430 --> 0:10:15.662 And it might be that in order to understand much, it might be good that in some way we 0:10:15.662 --> 0:10:19.872 focus on life, and in some way we can focus on vary, but not equally. 0:10:19.872 --> 0:10:25.345 But maybe it's like to understand again on different dimensions we should look into these. 0:10:25.905 --> 0:10:31.393 And therefore what we're doing is we're just doing the self attention at once, but we're 0:10:31.393 --> 0:10:35.031 doing it end times or based on your multi head attentions. 0:10:35.031 --> 0:10:41.299 So in typical examples, the number of heads people are talking about is like: So you're 0:10:41.299 --> 0:10:50.638 doing this process and have different queries and keys so you can focus. 0:10:50.790 --> 0:10:52.887 How can you generate eight different? 0:10:53.593 --> 0:11:07.595 Things it's quite easy here, so instead of having one linear projection you can have age 0:11:07.595 --> 0:11:09.326 different. 0:11:09.569 --> 0:11:13.844 And it might be that sometimes you're looking more into one thing, and sometimes you're Looking 0:11:13.844 --> 0:11:14.779 more into the other. 0:11:15.055 --> 0:11:24.751 So that's of course nice with this type of learned approach because we can automatically 0:11:24.751 --> 0:11:25.514 learn. 0:11:29.529 --> 0:11:36.629 And what you correctly said is its positional independence, so it doesn't really matter the 0:11:36.629 --> 0:11:39.176 order which should be important. 0:11:39.379 --> 0:11:47.686 So how can we do that and the idea is we are just encoding it directly into the embedding 0:11:47.686 --> 0:11:52.024 so into the starting so that a representation. 0:11:52.512 --> 0:11:55.873 How do we get that so we started with our embeddings? 0:11:55.873 --> 0:11:58.300 Just imagine this is embedding of eye. 0:11:59.259 --> 0:12:06.169 And then we are having additionally this positional encoding. 0:12:06.169 --> 0:12:10.181 In this position, encoding is just. 0:12:10.670 --> 0:12:19.564 With different wavelength, so with different lengths of your signal as you see here. 0:12:20.160 --> 0:12:37.531 And the number of functions you have is exactly the number of dimensions you have in your embedded. 0:12:38.118 --> 0:12:51.091 And what will then do is take the first one, and based on your position you multiply your 0:12:51.091 --> 0:12:51.955 word. 0:12:52.212 --> 0:13:02.518 And you see now if you put it in this position, of course it will get a different value. 0:13:03.003 --> 0:13:12.347 And thereby in each position a different function is multiplied. 0:13:12.347 --> 0:13:19.823 This is a representation for at the first position. 0:13:20.020 --> 0:13:34.922 If you have it in the input already encoded then of course the model is able to keep the 0:13:34.922 --> 0:13:38.605 position information. 0:13:38.758 --> 0:13:48.045 But your embeddings can also learn your embeddings in a way that they are optimal collaborating 0:13:48.045 --> 0:13:49.786 with these types. 0:13:51.451 --> 0:13:59.351 Is that somehow clear where he is there? 0:14:06.006 --> 0:14:13.630 Am the first position and second position? 0:14:16.576 --> 0:14:17.697 Have a long wait period. 0:14:17.697 --> 0:14:19.624 I'm not going to tell you how to turn the. 0:14:21.441 --> 0:14:26.927 Be completely issued because if you have a very short wavelength there might be quite 0:14:26.927 --> 0:14:28.011 big differences. 0:14:28.308 --> 0:14:33.577 And it might also be that then it depends, of course, like what type of world embedding 0:14:33.577 --> 0:14:34.834 you've learned like. 0:14:34.834 --> 0:14:37.588 Is the dimension where you have long changes? 0:14:37.588 --> 0:14:43.097 Is the report for your embedding or not so that's what I mean so that the model can somehow 0:14:43.097 --> 0:14:47.707 learn that by putting more information into one of the embedding dimensions? 0:14:48.128 --> 0:14:54.560 So incorporated and would assume it's learning it a bit haven't seen. 0:14:54.560 --> 0:14:57.409 Details studied how different. 0:14:58.078 --> 0:15:07.863 It's also a bit difficult because really measuring how similar or different a world isn't that 0:15:07.863 --> 0:15:08.480 easy. 0:15:08.480 --> 0:15:13.115 You can do, of course, the average distance. 0:15:14.114 --> 0:15:21.393 Them, so are the weight tags not at model two, or is there fixed weight tags that the 0:15:21.393 --> 0:15:21.986 model. 0:15:24.164 --> 0:15:30.165 To believe they are fixed and the mono learns there's a different way of doing it. 0:15:30.165 --> 0:15:32.985 The other thing you can do is you can. 0:15:33.213 --> 0:15:36.945 So you can learn the second embedding which says this is position one. 0:15:36.945 --> 0:15:38.628 This is position two and so on. 0:15:38.628 --> 0:15:42.571 Like for words you could learn fixed embeddings and then add them upwards. 0:15:42.571 --> 0:15:45.094 So then it would have the same thing it's done. 0:15:45.094 --> 0:15:46.935 There is one disadvantage of this. 0:15:46.935 --> 0:15:51.403 There is anybody an idea what could be the disadvantage of a more learned embedding. 0:15:54.955 --> 0:16:00.000 Here maybe extra play this finger and ethnic stuff that will be an art. 0:16:00.000 --> 0:16:01.751 This will be an art for. 0:16:02.502 --> 0:16:08.323 You would only be good at positions you have seen often and especially for long sequences. 0:16:08.323 --> 0:16:14.016 You might have seen the positions very rarely and then normally not performing that well 0:16:14.016 --> 0:16:17.981 while here it can better learn a more general representation. 0:16:18.298 --> 0:16:22.522 So that is another thing which we won't discuss here. 0:16:22.522 --> 0:16:25.964 Guess is what is called relative attention. 0:16:25.945 --> 0:16:32.570 And in this case you don't learn absolute positions, but in your calculation of the similarity 0:16:32.570 --> 0:16:39.194 you take again the relative distance into account and have a different similarity depending on 0:16:39.194 --> 0:16:40.449 how far they are. 0:16:40.660 --> 0:16:45.898 And then you don't need to encode it beforehand, but you would more happen within your comparison. 0:16:46.186 --> 0:16:53.471 So when you compare how similar things you print, of course also take the relative position. 0:16:55.715 --> 0:17:03.187 Because there are multiple ways to use the one, to multiply all the embedding, or to use 0:17:03.187 --> 0:17:03.607 all. 0:17:17.557 --> 0:17:21.931 The encoder can be bidirectional. 0:17:21.931 --> 0:17:30.679 We have everything from the beginning so we can have a model where. 0:17:31.111 --> 0:17:36.455 Decoder training of course has also everything available but during inference you always have 0:17:36.455 --> 0:17:41.628 only the past available so you can only look into the previous one and not into the future 0:17:41.628 --> 0:17:46.062 because if you generate word by word you don't know what it will be there in. 0:17:46.866 --> 0:17:53.180 And so we also have to consider this somehow in the attention, and until now we look more 0:17:53.180 --> 0:17:54.653 at the ecoder style. 0:17:54.653 --> 0:17:58.652 So if you look at this type of model, it's by direction. 0:17:58.652 --> 0:18:03.773 So for this hill state we are looking into the past and into the future. 0:18:04.404 --> 0:18:14.436 So the question is, can we have to do this like unidirectional so that you only look into 0:18:14.436 --> 0:18:15.551 the past? 0:18:15.551 --> 0:18:22.573 And the nice thing is, this is even easier than for our hands. 0:18:23.123 --> 0:18:29.738 So we would have different types of parameters and models because you have a forward direction. 0:18:31.211 --> 0:18:35.679 For attention, that is very simple. 0:18:35.679 --> 0:18:39.403 We are doing what is masking. 0:18:39.403 --> 0:18:45.609 If you want to have a backward model, these ones. 0:18:45.845 --> 0:18:54.355 So on the first hit stage it's been over, so it's maybe only looking at its health. 0:18:54.894 --> 0:19:05.310 By the second it looks on the second and the third, so you're always selling all values 0:19:05.310 --> 0:19:07.085 in the future. 0:19:07.507 --> 0:19:13.318 And thereby you can have with the same parameters the same model. 0:19:13.318 --> 0:19:15.783 You can have then a unique. 0:19:16.156 --> 0:19:29.895 In the decoder you do the masked self attention where you only look into the past and you don't 0:19:29.895 --> 0:19:30.753 look. 0:19:32.212 --> 0:19:36.400 Then we only have, of course, looked onto itself. 0:19:36.616 --> 0:19:50.903 So the question: How can we combine forward and decoder and then we can do a decoder and 0:19:50.903 --> 0:19:54.114 just have a second? 0:19:54.374 --> 0:20:00.286 And then we're doing the cross attention which attacks from the decoder to the anchoder. 0:20:00.540 --> 0:20:10.239 So in this time it's again that the queries is a current state of decoder, while the keys 0:20:10.239 --> 0:20:22.833 are: You can do both onto yourself to get the meaning on the target side and to get the meaning. 0:20:23.423 --> 0:20:25.928 So see then the full picture. 0:20:25.928 --> 0:20:33.026 This is now the typical picture of the transformer and where you use self attention. 0:20:33.026 --> 0:20:36.700 So what you have is have your power hidden. 0:20:37.217 --> 0:20:43.254 What you then apply is here the position they're coding: We have then doing the self attention 0:20:43.254 --> 0:20:46.734 to all the others, and this can be bi-directional. 0:20:47.707 --> 0:20:54.918 You normally do another feed forward layer just like to make things to learn additional 0:20:54.918 --> 0:20:55.574 things. 0:20:55.574 --> 0:21:02.785 You're just having also a feed forward layer which takes your heel stable and generates 0:21:02.785 --> 0:21:07.128 your heel state because we are making things deeper. 0:21:07.747 --> 0:21:15.648 Then this blue part you can stack over several times so you can have layers so that. 0:21:16.336 --> 0:21:30.256 In addition to these blue arrows, so we talked about this in R&S that if you are now back 0:21:30.256 --> 0:21:35.883 propagating your arrow from the top,. 0:21:36.436 --> 0:21:48.578 In order to prevent that we are not really learning how to transform that, but instead 0:21:48.578 --> 0:21:51.230 we have to change. 0:21:51.671 --> 0:22:00.597 You're calculating what should be changed with this one. 0:22:00.597 --> 0:22:09.365 The backwards clip each layer and the learning is just. 0:22:10.750 --> 0:22:21.632 The encoder before we go to the decoder. 0:22:21.632 --> 0:22:30.655 We have any additional questions. 0:22:31.471 --> 0:22:33.220 That's a Very Good Point. 0:22:33.553 --> 0:22:38.709 Yeah, you normally take always that at least the default architecture to only look at the 0:22:38.709 --> 0:22:38.996 top. 0:22:40.000 --> 0:22:40.388 Coder. 0:22:40.388 --> 0:22:42.383 Of course, you can do other things. 0:22:42.383 --> 0:22:45.100 We investigated, for example, the lowest layout. 0:22:45.100 --> 0:22:49.424 The decoder is looking at the lowest level of the incoder and not of the top. 0:22:49.749 --> 0:23:05.342 You can average or you can even learn theoretically that what you can also do is attending to all. 0:23:05.785 --> 0:23:11.180 Can attend to all possible layers and states. 0:23:11.180 --> 0:23:18.335 But what the default thing is is that you only have the top. 0:23:20.580 --> 0:23:31.999 The decoder when we're doing is firstly doing the same position and coding, then we're doing 0:23:31.999 --> 0:23:36.419 self attention in the decoder side. 0:23:37.837 --> 0:23:43.396 Of course here it's not important we're doing the mask self attention so that we're only 0:23:43.396 --> 0:23:45.708 attending to the past and we're not. 0:23:47.287 --> 0:24:02.698 Here you see the difference, so in this case the keys and values are from the encoder and 0:24:02.698 --> 0:24:03.554 the. 0:24:03.843 --> 0:24:12.103 You're comparing it to all the counter hidden states calculating the similarity and then 0:24:12.103 --> 0:24:13.866 you do the weight. 0:24:14.294 --> 0:24:17.236 And that is an edit to what is here. 0:24:18.418 --> 0:24:29.778 Then you have a linen layer and again this green one is sticked several times and then. 0:24:32.232 --> 0:24:36.987 Question, so each code is off. 0:24:36.987 --> 0:24:46.039 Every one of those has the last layer of thing, so in the. 0:24:46.246 --> 0:24:51.007 All with and only to the last or the top layer of the anchor. 0:24:57.197 --> 0:25:00.127 Good So That Would Be. 0:25:01.501 --> 0:25:12.513 To sequence models we have looked at attention and before we are decoding do you have any 0:25:12.513 --> 0:25:18.020 more questions to this type of architecture. 0:25:20.480 --> 0:25:30.049 Transformer was first used in machine translation, but now it's a standard thing for doing nearly 0:25:30.049 --> 0:25:32.490 any tie sequence models. 0:25:33.013 --> 0:25:35.984 Even large language models. 0:25:35.984 --> 0:25:38.531 They are a bit similar. 0:25:38.531 --> 0:25:45.111 They are just throwing away the anchor and cross the tension. 0:25:45.505 --> 0:25:59.329 And that is maybe interesting that it's important to have this attention because you cannot store 0:25:59.329 --> 0:26:01.021 everything. 0:26:01.361 --> 0:26:05.357 The interesting thing with the attention is now we can attend to everything. 0:26:05.745 --> 0:26:13.403 So you can again go back to your initial model and have just a simple sequence model and then 0:26:13.403 --> 0:26:14.055 target. 0:26:14.694 --> 0:26:24.277 There would be a more language model style or people call it Decoder Only model where 0:26:24.277 --> 0:26:26.617 you throw this away. 0:26:27.247 --> 0:26:30.327 The nice thing is because of your self attention. 0:26:30.327 --> 0:26:34.208 You have the original problem why you introduce the attention. 0:26:34.208 --> 0:26:39.691 You don't have that anymore because it's not everything is summarized, but each time you 0:26:39.691 --> 0:26:44.866 generate, you're looking back at all the previous words, the source and the target. 0:26:45.805 --> 0:26:51.734 And there is a lot of work on is a really important to have encoded a decoded model or 0:26:51.734 --> 0:26:54.800 is a decoded only model as good if you have. 0:26:54.800 --> 0:27:00.048 But the comparison is not that easy because how many parameters do you have? 0:27:00.360 --> 0:27:08.832 So think the general idea at the moment is, at least for machine translation, it's normally 0:27:08.832 --> 0:27:17.765 a bit better to have an encoded decoder model and not a decoder model where you just concatenate 0:27:17.765 --> 0:27:20.252 the source and the target. 0:27:21.581 --> 0:27:24.073 But there is not really a big difference anymore. 0:27:24.244 --> 0:27:29.891 Because this big issue, which we had initially with it that everything is stored in the working 0:27:29.891 --> 0:27:31.009 state, is nothing. 0:27:31.211 --> 0:27:45.046 Of course, the advantage maybe here is that you give it a bias at your same language information. 0:27:45.285 --> 0:27:53.702 While in an encoder only model this all is merged into one thing and sometimes it is good 0:27:53.702 --> 0:28:02.120 to give models a bit of bias okay you should maybe treat things separately and you should 0:28:02.120 --> 0:28:03.617 look different. 0:28:04.144 --> 0:28:11.612 And of course one other difference, one other disadvantage, maybe of an encoder owning one. 0:28:16.396 --> 0:28:19.634 You think about the suicide sentence and how it's treated. 0:28:21.061 --> 0:28:33.787 Architecture: Anchorer can both be in the sentence for every state and cause a little 0:28:33.787 --> 0:28:35.563 difference. 0:28:35.475 --> 0:28:43.178 If you only have a decoder that has to be unidirectional because for the decoder side 0:28:43.178 --> 0:28:51.239 for the generation you need it and so your input is read state by state so you don't have 0:28:51.239 --> 0:28:54.463 positional bidirection information. 0:28:56.596 --> 0:29:05.551 Again, it receives a sequence of embeddings with position encoding. 0:29:05.551 --> 0:29:11.082 The piece is like long vector has output. 0:29:11.031 --> 0:29:17.148 Don't understand how you can set footworks to this part of each other through inputs. 0:29:17.097 --> 0:29:20.060 Other than cola is the same as the food consume. 0:29:21.681 --> 0:29:27.438 Okay, it's very good bye, so this one hand coding is only done on the top layer. 0:29:27.727 --> 0:29:32.012 So this green one is only repeated. 0:29:32.012 --> 0:29:38.558 You have the word embedding or the position embedding. 0:29:38.558 --> 0:29:42.961 You have one layer of decoder which. 0:29:43.283 --> 0:29:48.245 Then you stick in the second one, the third one, the fourth one, and then on the top. 0:29:48.208 --> 0:29:55.188 Layer: You put this projection layer which takes a one thousand dimensional backtalk and 0:29:55.188 --> 0:30:02.089 generates based on your vocabulary maybe in ten thousand soft max layer which gives you 0:30:02.089 --> 0:30:04.442 the probability of all words. 0:30:06.066 --> 0:30:22.369 It's a very good part part of the mass tape ladies, but it wouldn't be for the X-rays. 0:30:22.262 --> 0:30:27.015 Aquarium filters to be like monsoon roding as they get by the river. 0:30:27.647 --> 0:30:33.140 Yes, there is work on that think we will discuss that in the pre-trained models. 0:30:33.493 --> 0:30:39.756 It's called where you exactly do that. 0:30:39.756 --> 0:30:48.588 If you have more metric side, it's like diagonal here. 0:30:48.708 --> 0:30:53.018 And it's a full metric, so here everybody's attending to each position. 0:30:53.018 --> 0:30:54.694 Here you're only attending. 0:30:54.975 --> 0:31:05.744 Then you can do the previous one where this one is decoded, not everything but everything. 0:31:06.166 --> 0:31:13.961 So you have a bit more that is possible, and we'll have that in the lecture on pre-train 0:31:13.961 --> 0:31:14.662 models. 0:31:18.478 --> 0:31:27.440 So we now know how to build a translation system, but of course we don't want to have 0:31:27.440 --> 0:31:30.774 a translation system by itself. 0:31:31.251 --> 0:31:40.037 Now given this model an input sentence, how can we generate an output mind? 0:31:40.037 --> 0:31:49.398 The general idea is still: So what we really want to do is we start with the model. 0:31:49.398 --> 0:31:53.893 We generate different possible translations. 0:31:54.014 --> 0:31:59.754 We score them the lock probability that we're getting, so for each input and output pair 0:31:59.754 --> 0:32:05.430 we can calculate the lock probability, which is a product of all probabilities for each 0:32:05.430 --> 0:32:09.493 word in there, and then we can find what is the most probable. 0:32:09.949 --> 0:32:15.410 However, that's a bit complicated we will see because we can't look at all possible translations. 0:32:15.795 --> 0:32:28.842 So there is infinite or a number of possible translations, so we have to do it somehow in 0:32:28.842 --> 0:32:31.596 more intelligence. 0:32:32.872 --> 0:32:37.821 So what we want to do today in the rest of the lecture? 0:32:37.821 --> 0:32:40.295 What is the search problem? 0:32:40.295 --> 0:32:44.713 Then we will look at different search algorithms. 0:32:45.825 --> 0:32:56.636 Will compare model and search errors, so there can be errors on the model where the model 0:32:56.636 --> 0:33:03.483 is not giving the highest score to the best translation. 0:33:03.903 --> 0:33:21.069 This is always like searching the best translation out of one model, which is often also interesting. 0:33:24.004 --> 0:33:29.570 And how do we do the search? 0:33:29.570 --> 0:33:41.853 We want to find the translation where the reference is minimal. 0:33:42.042 --> 0:33:44.041 So the nice thing is SMT. 0:33:44.041 --> 0:33:51.347 It wasn't the case, but in neuromachine translation we can't find any possible translation, so 0:33:51.347 --> 0:33:53.808 at least within our vocabulary. 0:33:53.808 --> 0:33:58.114 But if we have BPE we can really generate any possible. 0:33:58.078 --> 0:34:04.604 Translation and cereal: We could always minimize that, but yeah, we can't do it that easy because 0:34:04.604 --> 0:34:07.734 of course we don't have the reference at hand. 0:34:07.747 --> 0:34:10.384 If it has a reference, it's not a problem. 0:34:10.384 --> 0:34:13.694 We know what we are searching for, but we don't know. 0:34:14.054 --> 0:34:23.886 So how can we then model this by just finding the translation with the highest probability? 0:34:23.886 --> 0:34:29.015 Looking at it, we want to find the translation. 0:34:29.169 --> 0:34:32.525 Idea is our model is a good approximation. 0:34:32.525 --> 0:34:34.399 That's how we train it. 0:34:34.399 --> 0:34:36.584 What is a good translation? 0:34:36.584 --> 0:34:43.687 And if we find translation with the highest probability, this should also give us the best 0:34:43.687 --> 0:34:44.702 translation. 0:34:45.265 --> 0:34:56.965 And that is then, of course, the difference between the search error is that the model 0:34:56.965 --> 0:35:02.076 doesn't predict the best translation. 0:35:02.622 --> 0:35:08.777 How can we do the basic search first of all in basic search that seems to be very easy 0:35:08.777 --> 0:35:15.003 so what we can do is we can do the forward pass for the whole encoder and that's how it 0:35:15.003 --> 0:35:21.724 starts the input sentences known you can put the input sentence and calculate all your estates 0:35:21.724 --> 0:35:22.573 and hidden? 0:35:23.083 --> 0:35:35.508 Then you can put in your sentence start and you can generate. 0:35:35.508 --> 0:35:41.721 Here you have the probability. 0:35:41.801 --> 0:35:52.624 A good idea we would see later that as a typical algorithm is guess what you all would do, you 0:35:52.624 --> 0:35:54.788 would then select. 0:35:55.235 --> 0:36:06.265 So if you generate here a probability distribution over all the words in your vocabulary then 0:36:06.265 --> 0:36:08.025 you can solve. 0:36:08.688 --> 0:36:13.147 Yeah, this is how our auto condition is done in our system. 0:36:14.794 --> 0:36:19.463 Yeah, this is also why there you have to have a model of possible extending. 0:36:19.463 --> 0:36:24.314 It's more of a language model, but then this is one algorithm to do the search. 0:36:24.314 --> 0:36:26.801 They maybe have also more advanced ones. 0:36:26.801 --> 0:36:32.076 We will see that so this search and other completion should be exactly the same as the 0:36:32.076 --> 0:36:33.774 search machine translation. 0:36:34.914 --> 0:36:40.480 So we'll see that this is not optimal, so hopefully it's not that this way, but for this 0:36:40.480 --> 0:36:41.043 problem. 0:36:41.941 --> 0:36:47.437 And what you can do then you can select this word. 0:36:47.437 --> 0:36:50.778 This was the best translation. 0:36:51.111 --> 0:36:57.675 Because the decoder, of course, in the next step needs not to know what is the best word 0:36:57.675 --> 0:37:02.396 here, it inputs it and generates that flexibility distribution. 0:37:03.423 --> 0:37:14.608 And then your new distribution, and you can do the same thing, there's the best word there, 0:37:14.608 --> 0:37:15.216 and. 0:37:15.435 --> 0:37:22.647 So you can continue doing that and always get the hopefully the best translation in. 0:37:23.483 --> 0:37:30.839 The first question is, of course, how long are you doing it? 0:37:30.839 --> 0:37:33.854 Now we could go forever. 0:37:36.476 --> 0:37:52.596 We had this token at the input and we put the stop token at the output. 0:37:53.974 --> 0:38:07.217 And this is important because if we wouldn't do that then we wouldn't have a good idea. 0:38:10.930 --> 0:38:16.193 So that seems to be a good idea, but is it really? 0:38:16.193 --> 0:38:21.044 Do we find the most probable sentence in this? 0:38:23.763 --> 0:38:25.154 Or my dear healed proverb,. 0:38:27.547 --> 0:38:41.823 We are always selecting the highest probability one, so it seems to be that this is a very 0:38:41.823 --> 0:38:45.902 good solution to anybody. 0:38:46.406 --> 0:38:49.909 Yes, that is actually the problem. 0:38:49.909 --> 0:38:56.416 You might do early decisions and you don't have the global view. 0:38:56.796 --> 0:39:02.813 And this problem happens because it is an outer regressive model. 0:39:03.223 --> 0:39:13.275 So it happens because yeah, the output we generate is the input in the next step. 0:39:13.793 --> 0:39:19.493 And this, of course, is leading to problems. 0:39:19.493 --> 0:39:27.474 If we always take the best solution, it doesn't mean you have. 0:39:27.727 --> 0:39:33.941 It would be different if you have a problem where the output is not influencing your input. 0:39:34.294 --> 0:39:44.079 Then this solution will give you the best model, but since the output is influencing 0:39:44.079 --> 0:39:47.762 your next input and the model,. 0:39:48.268 --> 0:39:51.599 Because one question might not be why do we have this type of model? 0:39:51.771 --> 0:39:58.946 So why do we really need to put here in the last source word? 0:39:58.946 --> 0:40:06.078 You can also put in: And then always predict the word and the nice thing is then you wouldn't 0:40:06.078 --> 0:40:11.846 need to do beams or a difficult search because then the output here wouldn't influence what 0:40:11.846 --> 0:40:12.975 is inputted here. 0:40:15.435 --> 0:40:20.219 Idea whether that might not be the best idea. 0:40:20.219 --> 0:40:24.588 You'll just be translating each word and. 0:40:26.626 --> 0:40:37.815 The second one is right, yes, you're not generating a Korean sentence. 0:40:38.058 --> 0:40:48.197 We'll also see that later it's called non auto-progressive translation, so there is work 0:40:48.197 --> 0:40:49.223 on that. 0:40:49.529 --> 0:41:02.142 So you might know it roughly because you know it's based on this hidden state, but it can 0:41:02.142 --> 0:41:08.588 be that in the end you have your probability. 0:41:09.189 --> 0:41:14.633 And then you're not modeling the dependencies within a work within the target sentence. 0:41:14.633 --> 0:41:27.547 For example: You can express things in German, then you don't know which one you really select. 0:41:27.547 --> 0:41:32.156 That influences what you later. 0:41:33.393 --> 0:41:46.411 Then you try to find a better way not only based on the English sentence and the words 0:41:46.411 --> 0:41:48.057 that come. 0:41:49.709 --> 0:42:00.954 Yes, that is more like a two-step decoding, but that is, of course, a lot more like computational. 0:42:01.181 --> 0:42:15.978 The first thing you can do, which is typically done, is doing not really search. 0:42:16.176 --> 0:42:32.968 So first look at what the problem of research is to make it a bit more clear. 0:42:34.254 --> 0:42:53.163 And now you can extend them and you can extend these and the joint probabilities. 0:42:54.334 --> 0:42:59.063 The other thing is the second word. 0:42:59.063 --> 0:43:03.397 You can do the second word dusk. 0:43:03.397 --> 0:43:07.338 Now you see the problem here. 0:43:07.707 --> 0:43:17.507 It is true that these have the highest probability, but for these you have an extension. 0:43:18.078 --> 0:43:31.585 So the problem is just because in one position one hypothesis, so you can always call this 0:43:31.585 --> 0:43:34.702 partial translation. 0:43:34.874 --> 0:43:41.269 The blue one begin is higher, but the green one can be better extended and it will overtake. 0:43:45.525 --> 0:43:54.672 So the problem is if we are doing this greedy search is that we might not end up in really 0:43:54.672 --> 0:43:55.275 good. 0:43:55.956 --> 0:44:00.916 So the first thing we could not do is like yeah, we can just try. 0:44:00.880 --> 0:44:06.049 All combinations that are there, so there is the other direction. 0:44:06.049 --> 0:44:13.020 So if the solution to to check the first one is to just try all and it doesn't give us a 0:44:13.020 --> 0:44:17.876 good result, maybe what we have to do is just try everything. 0:44:18.318 --> 0:44:23.120 The nice thing is if we try everything, we'll definitely find the best translation. 0:44:23.463 --> 0:44:26.094 So we won't have a search error. 0:44:26.094 --> 0:44:28.167 We'll come to that later. 0:44:28.167 --> 0:44:32.472 The interesting thing is our translation performance. 0:44:33.353 --> 0:44:37.039 But we will definitely find the most probable translation. 0:44:38.598 --> 0:44:44.552 However, it's not really possible because the number of combinations is just too high. 0:44:44.764 --> 0:44:57.127 So the number of congregations is your vocabulary science times the lengths of your sentences. 0:44:57.157 --> 0:45:03.665 Ten thousand or so you can imagine that very soon you will have so many possibilities here 0:45:03.665 --> 0:45:05.597 that you cannot check all. 0:45:06.226 --> 0:45:13.460 So this is not really an implication or an algorithm that you can use for applying machine 0:45:13.460 --> 0:45:14.493 translation. 0:45:15.135 --> 0:45:24.657 So maybe we have to do something in between and yeah, not look at all but only look at 0:45:24.657 --> 0:45:25.314 some. 0:45:26.826 --> 0:45:29.342 And the easiest thing for that is okay. 0:45:29.342 --> 0:45:34.877 Just do sampling, so if we don't know what to look at, maybe it's good to randomly pick 0:45:34.877 --> 0:45:35.255 some. 0:45:35.255 --> 0:45:40.601 That's not only a very good algorithm, so the basic idea will always randomly select 0:45:40.601 --> 0:45:42.865 the word, of course, based on bits. 0:45:43.223 --> 0:45:52.434 We are doing that or times, and then we are looking which one at the end has the highest. 0:45:52.672 --> 0:45:59.060 So we are not doing anymore really searching for the best one, but we are more randomly 0:45:59.060 --> 0:46:05.158 doing selections with the idea that we always select the best one at the beginning. 0:46:05.158 --> 0:46:11.764 So maybe it's better to do random, but of course one important thing is how do we randomly 0:46:11.764 --> 0:46:12.344 select? 0:46:12.452 --> 0:46:15.756 If we just do uniform distribution, it would be very bad. 0:46:15.756 --> 0:46:18.034 You'll only have very bad translations. 0:46:18.398 --> 0:46:23.261 Because in each position if you think about it you have ten thousand possibilities. 0:46:23.903 --> 0:46:28.729 Most of them are really bad decisions and you shouldn't do that. 0:46:28.729 --> 0:46:35.189 There is always only a very small number, at least compared to the 10 000 translation. 0:46:35.395 --> 0:46:43.826 So if you have the sentence here, this is an English sentence. 0:46:43.826 --> 0:46:47.841 You can start with these and. 0:46:48.408 --> 0:46:58.345 You're thinking about setting legal documents in a legal document. 0:46:58.345 --> 0:47:02.350 You should not change the. 0:47:03.603 --> 0:47:11.032 The problem is we have a neural network, we have a black box, so it's anyway a bit random. 0:47:12.092 --> 0:47:24.341 It is considered, but you will see that if you make it intelligent for clear sentences, 0:47:24.341 --> 0:47:26.986 there is not that. 0:47:27.787 --> 0:47:35.600 Is an issue we should consider that this one might lead to more randomness, but it might 0:47:35.600 --> 0:47:39.286 also be positive for machine translation. 0:47:40.080 --> 0:47:46.395 Least can't directly think of a good implication where it's positive, but if you most think 0:47:46.395 --> 0:47:52.778 about dialogue systems, for example, whereas the similar architecture is nowadays also used, 0:47:52.778 --> 0:47:55.524 you predict what the system should say. 0:47:55.695 --> 0:48:00.885 Then you want to have randomness because it's not always saying the same thing. 0:48:01.341 --> 0:48:08.370 Machine translation is typically not you want to have consistency, so if you have the same 0:48:08.370 --> 0:48:09.606 input normally. 0:48:09.889 --> 0:48:14.528 Therefore, sampling is not a mathieu. 0:48:14.528 --> 0:48:22.584 There are some things you will later see as a preprocessing step. 0:48:23.003 --> 0:48:27.832 But of course it's important how you can make this process not too random. 0:48:29.269 --> 0:48:41.619 Therefore, the first thing is don't take a uniform distribution, but we have a very nice 0:48:41.619 --> 0:48:43.562 distribution. 0:48:43.843 --> 0:48:46.621 So I'm like randomly taking a word. 0:48:46.621 --> 0:48:51.328 We are looking at output distribution and now taking a word. 0:48:51.731 --> 0:49:03.901 So that means we are taking the word these, we are taking the word does, and all these. 0:49:04.444 --> 0:49:06.095 How can you do that? 0:49:06.095 --> 0:49:09.948 You randomly draw a number between zero and one. 0:49:10.390 --> 0:49:23.686 And then you have ordered your words in some way, and then you take the words before the 0:49:23.686 --> 0:49:26.375 sum of the words. 0:49:26.806 --> 0:49:34.981 So the easiest thing is you have zero point five, zero point two five, and zero point two 0:49:34.981 --> 0:49:35.526 five. 0:49:35.526 --> 0:49:43.428 If you have a number smaller than you take the first word, it takes a second word, and 0:49:43.428 --> 0:49:45.336 if it's higher than. 0:49:45.845 --> 0:49:57.707 Therefore, you can very easily get a distribution distributed according to this probability mass 0:49:57.707 --> 0:49:59.541 and no longer. 0:49:59.799 --> 0:50:12.479 You can't even do that a bit more and more focus on the important part if we are not randomly 0:50:12.479 --> 0:50:19.494 drawing from all words, but we are looking only at. 0:50:21.361 --> 0:50:24.278 You have an idea why this is an important stamp. 0:50:24.278 --> 0:50:29.459 Although we say I'm only throwing away the words which have a very low probability, so 0:50:29.459 --> 0:50:32.555 anyway the probability of taking them is quite low. 0:50:32.555 --> 0:50:35.234 So normally that shouldn't matter that much. 0:50:36.256 --> 0:50:38.830 There's ten thousand words. 0:50:40.300 --> 0:50:42.074 Of course, they admire thousand nine hundred. 0:50:42.074 --> 0:50:44.002 They're going to build a good people steal it up. 0:50:45.085 --> 0:50:47.425 Hi, I'm Sarah Hauer and I'm Sig Hauer and We're Professional. 0:50:47.867 --> 0:50:55.299 Yes, that's exactly why you do this most sampling or so that you don't take the lowest. 0:50:55.415 --> 0:50:59.694 Probability words, but you only look at the most probable ones and then like. 0:50:59.694 --> 0:51:04.632 Of course you have to rescale your probability mass then so that it's still a probability 0:51:04.632 --> 0:51:08.417 because now it's a probability distribution over ten thousand words. 0:51:08.417 --> 0:51:13.355 If you only take ten of them or so it's no longer a probability distribution, you rescale 0:51:13.355 --> 0:51:15.330 them and you can still do that and. 0:51:16.756 --> 0:51:20.095 That is what is done assembling. 0:51:20.095 --> 0:51:26.267 It's not the most common thing, but it's done several times. 0:51:28.088 --> 0:51:40.625 Then the search, which is somehow a standard, and if you're doing some type of machine translation. 0:51:41.181 --> 0:51:50.162 And the basic idea is that in research we select for the most probable and only continue 0:51:50.162 --> 0:51:51.171 with the. 0:51:51.691 --> 0:51:53.970 You can easily generalize this. 0:51:53.970 --> 0:52:00.451 We are not only continuing the most probable one, but we are continuing the most probable. 0:52:00.880 --> 0:52:21.376 The. 0:52:17.697 --> 0:52:26.920 You should say we are sampling how many examples it makes sense to take the one with the highest. 0:52:27.127 --> 0:52:33.947 But that is important that once you do a mistake you might want to not influence that much. 0:52:39.899 --> 0:52:45.815 So the idea is if we're keeping the end best hypotheses and not only the first fact. 0:52:46.586 --> 0:52:51.558 And the nice thing is in statistical machine translation. 0:52:51.558 --> 0:52:54.473 We have exactly the same problem. 0:52:54.473 --> 0:52:57.731 You would do the same thing, however. 0:52:57.731 --> 0:53:03.388 Since the model wasn't that strong you needed a quite large beam. 0:53:03.984 --> 0:53:18.944 Machine translation models are really strong and you get already a very good performance. 0:53:19.899 --> 0:53:22.835 So how does it work? 0:53:22.835 --> 0:53:35.134 We can't relate to our capabilities, but now we are not storing the most probable ones. 0:53:36.156 --> 0:53:45.163 Done that we extend all these hypothesis and of course there is now a bit difficult because 0:53:45.163 --> 0:53:54.073 now we always have to switch what is the input so the search gets more complicated and the 0:53:54.073 --> 0:53:55.933 first one is easy. 0:53:56.276 --> 0:54:09.816 In this case we have to once put in here these and then somehow delete this one and instead 0:54:09.816 --> 0:54:12.759 put that into that. 0:54:13.093 --> 0:54:24.318 Otherwise you could only store your current network states here and just continue by going 0:54:24.318 --> 0:54:25.428 forward. 0:54:26.766 --> 0:54:34.357 So now you have done the first two, and then you have known the best. 0:54:34.357 --> 0:54:37.285 Can you now just continue? 0:54:39.239 --> 0:54:53.511 Yes, that's very important, otherwise all your beam search doesn't really help because 0:54:53.511 --> 0:54:57.120 you would still have. 0:54:57.317 --> 0:55:06.472 So now you have to do one important step and then reduce again to end. 0:55:06.472 --> 0:55:13.822 So in our case to make things easier we have the inputs. 0:55:14.014 --> 0:55:19.072 Otherwise you will have two to the power of length possibilities, so it is still exponential. 0:55:19.559 --> 0:55:26.637 But by always throwing them away you keep your beans fixed. 0:55:26.637 --> 0:55:31.709 The items now differ in the last position. 0:55:32.492 --> 0:55:42.078 They are completely different, but you are always searching what is the best one. 0:55:44.564 --> 0:55:50.791 So another way of hearing it is like this, so just imagine you start with the empty sentence. 0:55:50.791 --> 0:55:55.296 Then you have three possible extensions: A, B, and end of sentence. 0:55:55.296 --> 0:55:59.205 It's throwing away the worst one, continuing with the two. 0:55:59.699 --> 0:56:13.136 Then you want to stay too, so in this state it's either or and then you continue. 0:56:13.293 --> 0:56:24.924 So you always have this exponential growing tree by destroying most of them away and only 0:56:24.924 --> 0:56:26.475 continuing. 0:56:26.806 --> 0:56:42.455 And thereby you can hopefully do less errors because in these examples you always see this 0:56:42.455 --> 0:56:43.315 one. 0:56:43.503 --> 0:56:47.406 So you're preventing some errors, but of course it's not perfect. 0:56:47.447 --> 0:56:56.829 You can still do errors because it could be not the second one but the fourth one. 0:56:57.017 --> 0:57:03.272 Now just the idea is that you make yeah less errors and prevent that. 0:57:07.667 --> 0:57:11.191 Then the question is how much does it help? 0:57:11.191 --> 0:57:14.074 And here is some examples for that. 0:57:14.074 --> 0:57:16.716 So for S & T it was really like. 0:57:16.716 --> 0:57:23.523 Typically the larger beam you have a larger third space and you have a better score. 0:57:23.763 --> 0:57:27.370 So the larger you get, the bigger your emails, the better you will. 0:57:27.370 --> 0:57:30.023 Typically maybe use something like three hundred. 0:57:30.250 --> 0:57:38.777 And it's mainly a trade-off between quality and speed because the larger your beams, the 0:57:38.777 --> 0:57:43.184 more time it takes and you want to finish it. 0:57:43.184 --> 0:57:49.124 So your quality improvements are getting smaller and smaller. 0:57:49.349 --> 0:57:57.164 So the difference between a beam of one and ten is bigger than the difference between a. 0:57:58.098 --> 0:58:14.203 And the interesting thing is we're seeing a bit of a different view, and we're seeing 0:58:14.203 --> 0:58:16.263 typically. 0:58:16.776 --> 0:58:24.376 And then especially if you look at the green ones, this is unnormalized. 0:58:24.376 --> 0:58:26.770 You're seeing a sharp. 0:58:27.207 --> 0:58:32.284 So your translation quality here measured in blue will go down again. 0:58:33.373 --> 0:58:35.663 That is now a question. 0:58:35.663 --> 0:58:37.762 Why is that the case? 0:58:37.762 --> 0:58:43.678 Why should we are seeing more and more possible translations? 0:58:46.226 --> 0:58:48.743 If we have a bigger stretch and we are going. 0:58:52.612 --> 0:58:56.312 I'm going to be using my examples before we also look at the bar. 0:58:56.656 --> 0:58:59.194 A good idea. 0:59:00.000 --> 0:59:18.521 But it's not everything because we in the end always in this list we're selecting. 0:59:18.538 --> 0:59:19.382 So this is here. 0:59:19.382 --> 0:59:21.170 We don't do any regions to do that. 0:59:21.601 --> 0:59:29.287 So the probabilities at the end we always give out the hypothesis with the highest probabilities. 0:59:30.250 --> 0:59:33.623 That is always the case. 0:59:33.623 --> 0:59:43.338 If you have a beam of this should be a subset of the items you look at. 0:59:44.224 --> 0:59:52.571 So if you increase your biomeat you're just looking at more and you're always taking the 0:59:52.571 --> 0:59:54.728 wine with the highest. 0:59:57.737 --> 1:00:07.014 Maybe they are all the probability that they will be comparable to don't really have. 1:00:08.388 --> 1:00:14.010 But the probabilities are the same, not that easy. 1:00:14.010 --> 1:00:23.931 One morning maybe you will have more examples where we look at some stuff that's not seen 1:00:23.931 --> 1:00:26.356 in the trading space. 1:00:28.428 --> 1:00:36.478 That's mainly the answer why we give a hyperability math we will see, but that is first of all 1:00:36.478 --> 1:00:43.087 the biggest issues, so here is a blue score, so that is somewhat translation. 1:00:43.883 --> 1:00:48.673 This will go down by the probability of the highest one that only goes out where stays 1:00:48.673 --> 1:00:49.224 at least. 1:00:49.609 --> 1:00:57.971 The problem is if we are searching more, we are finding high processes which have a high 1:00:57.971 --> 1:00:59.193 translation. 1:00:59.579 --> 1:01:10.375 So we are finding these things which we wouldn't find and we'll see why this is happening. 1:01:10.375 --> 1:01:15.714 So somehow we are reducing our search error. 1:01:16.336 --> 1:01:25.300 However, we also have a model error and we don't assign the highest probability to translation 1:01:25.300 --> 1:01:27.942 quality to the really best. 1:01:28.548 --> 1:01:31.460 They don't always add up. 1:01:31.460 --> 1:01:34.932 Of course somehow they add up. 1:01:34.932 --> 1:01:41.653 If your bottle is worse then your performance will even go. 1:01:42.202 --> 1:01:49.718 But sometimes it's happening that by increasing search errors we are missing out the really 1:01:49.718 --> 1:01:57.969 bad translations which have a high probability and we are only finding the decently good probability 1:01:57.969 --> 1:01:58.460 mass. 1:01:59.159 --> 1:02:03.859 So they are a bit independent of each other and you can make those types of arrows. 1:02:04.224 --> 1:02:09.858 That's why, for example, doing exact search will give you the translation with the highest 1:02:09.858 --> 1:02:15.245 probability, but there has been work on it that you then even have a lower translation 1:02:15.245 --> 1:02:21.436 quality because then you find some random translation which has a very high translation probability 1:02:21.436 --> 1:02:22.984 by which I'm really bad. 1:02:23.063 --> 1:02:29.036 Because our model is not perfect and giving a perfect translation probability over air,. 1:02:31.431 --> 1:02:34.537 So why is this happening? 1:02:34.537 --> 1:02:42.301 And one issue with this is the so called label or length spiral. 1:02:42.782 --> 1:02:47.115 And we are in each step of decoding. 1:02:47.115 --> 1:02:55.312 We are modeling the probability of the next word given the input and. 1:02:55.895 --> 1:03:06.037 So if you have this picture, so you always hear you have the probability of the next word. 1:03:06.446 --> 1:03:16.147 That's that's what your modeling, and of course the model is not perfect. 1:03:16.576 --> 1:03:22.765 So it can be that if we at one time do a bitter wrong prediction not for the first one but 1:03:22.765 --> 1:03:28.749 maybe for the 5th or 6th thing, then we're giving it an exceptional high probability we 1:03:28.749 --> 1:03:30.178 cannot recover from. 1:03:30.230 --> 1:03:34.891 Because this high probability will stay there forever and we just multiply other things to 1:03:34.891 --> 1:03:39.910 it, but we cannot like later say all this probability was a bit too high, we shouldn't have done. 1:03:41.541 --> 1:03:48.984 And this leads to that the more the longer your translation is, the more often you use 1:03:48.984 --> 1:03:51.637 this probability distribution. 1:03:52.112 --> 1:04:03.321 The typical example is this one, so you have the probability of the translation. 1:04:04.104 --> 1:04:12.608 And this probability is quite low as you see, and maybe there are a lot of other things. 1:04:13.053 --> 1:04:25.658 However, it might still be overestimated that it's still a bit too high. 1:04:26.066 --> 1:04:33.042 The problem is if you know the project translation is a very long one, but probability mask gets 1:04:33.042 --> 1:04:33.545 lower. 1:04:34.314 --> 1:04:45.399 Because each time you multiply your probability to it, so your sequence probability gets lower 1:04:45.399 --> 1:04:46.683 and lower. 1:04:48.588 --> 1:04:59.776 And this means that at some point you might get over this, and it might be a lower probability. 1:05:00.180 --> 1:05:09.651 And if you then have this probability at the beginning away, but it wasn't your beam, then 1:05:09.651 --> 1:05:14.958 at this point you would select the empty sentence. 1:05:15.535 --> 1:05:25.379 So this has happened because this short translation is seen and it's not thrown away. 1:05:28.268 --> 1:05:31.121 So,. 1:05:31.151 --> 1:05:41.256 If you have a very sore beam that can be prevented, but if you have a large beam, this one is in 1:05:41.256 --> 1:05:41.986 there. 1:05:42.302 --> 1:05:52.029 This in general seems reasonable that shorter pronunciations instead of longer sentences 1:05:52.029 --> 1:05:54.543 because non-religious. 1:05:56.376 --> 1:06:01.561 It's a bit depending on whether the translation should be a bit related to your input. 1:06:02.402 --> 1:06:18.053 And since we are always multiplying things, the longer the sequences we are getting smaller, 1:06:18.053 --> 1:06:18.726 it. 1:06:19.359 --> 1:06:29.340 It's somewhat right for human main too, but the models tend to overestimate because of 1:06:29.340 --> 1:06:34.388 this short translation of long translation. 1:06:35.375 --> 1:06:46.474 Then, of course, that means that it's not easy to stay on a computer because eventually 1:06:46.474 --> 1:06:48.114 it suggests. 1:06:51.571 --> 1:06:59.247 First of all there is another way and that's typically used but you don't have to do really 1:06:59.247 --> 1:07:07.089 because this is normally not a second position and if it's like on the 20th position you only 1:07:07.089 --> 1:07:09.592 have to have some bean lower. 1:07:10.030 --> 1:07:17.729 But you are right because these issues get larger, the larger your input is, and then 1:07:17.729 --> 1:07:20.235 you might make more errors. 1:07:20.235 --> 1:07:27.577 So therefore this is true, but it's not as simple that this one is always in the. 1:07:28.408 --> 1:07:45.430 That the translation for it goes down with higher insert sizes has there been more control. 1:07:47.507 --> 1:07:51.435 In this work you see a dozen knocks. 1:07:51.435 --> 1:07:53.027 Knots go down. 1:07:53.027 --> 1:08:00.246 That's light green here, but at least you don't see the sharp rock. 1:08:00.820 --> 1:08:07.897 So if you do some type of normalization, at least you can assess this probability and limit 1:08:07.897 --> 1:08:08.204 it. 1:08:15.675 --> 1:08:24.828 There is other reasons why, like initial, it's not only the length, but there can be 1:08:24.828 --> 1:08:26.874 other reasons why. 1:08:27.067 --> 1:08:37.316 And if you just take it too large, you're looking too often at ways in between, but it's 1:08:37.316 --> 1:08:40.195 better to ignore things. 1:08:41.101 --> 1:08:44.487 But that's more a hand gravy argument. 1:08:44.487 --> 1:08:47.874 Agree so don't know if the exact word. 1:08:48.648 --> 1:08:53.223 You need to do the normalization and there are different ways of doing it. 1:08:53.223 --> 1:08:54.199 It's mainly OK. 1:08:54.199 --> 1:08:59.445 We're just now not taking the translation with the highest probability, but we during 1:08:59.445 --> 1:09:04.935 the coding have another feature saying not only take the one with the highest probability 1:09:04.935 --> 1:09:08.169 but also prefer translations which are a bit longer. 1:09:08.488 --> 1:09:16.933 You can do that different in a way to divide by the center length. 1:09:16.933 --> 1:09:23.109 We take not the highest but the highest average. 1:09:23.563 --> 1:09:28.841 Of course, if both are the same lengths, it doesn't matter if M is the same lengths in 1:09:28.841 --> 1:09:34.483 all cases, but if you compare a translation with seven or eight words, there is a difference 1:09:34.483 --> 1:09:39.700 if you want to have the one with the highest probability or with the highest average. 1:09:41.021 --> 1:09:50.993 So that is the first one can have some reward model for each word, add a bit of the score, 1:09:50.993 --> 1:09:51.540 and. 1:09:51.711 --> 1:10:03.258 And then, of course, you have to find you that there is also more complex ones here. 1:10:03.903 --> 1:10:08.226 So there is different ways of doing that, and of course that's important. 1:10:08.428 --> 1:10:11.493 But in all of that, the main idea is OK. 1:10:11.493 --> 1:10:18.520 We are like knowing of the arrow that the model seems to prevent or prefer short translation. 1:10:18.520 --> 1:10:24.799 We circumvent that by OK we are adding we are no longer searching for the best one. 1:10:24.764 --> 1:10:30.071 But we're searching for the one best one and some additional constraints, so mainly you 1:10:30.071 --> 1:10:32.122 are doing here during the coding. 1:10:32.122 --> 1:10:37.428 You're not completely trusting your model, but you're adding some buyers or constraints 1:10:37.428 --> 1:10:39.599 into what should also be fulfilled. 1:10:40.000 --> 1:10:42.543 That can be, for example, that the length should be recently. 1:10:49.369 --> 1:10:51.071 Any More Questions to That. 1:10:56.736 --> 1:11:04.001 Last idea which gets recently quite a bit more interest also is what is called minimum 1:11:04.001 --> 1:11:11.682 base risk decoding and there is maybe not the one correct translation but there are several 1:11:11.682 --> 1:11:13.937 good correct translations. 1:11:14.294 --> 1:11:21.731 And the idea is now we don't want to find the one translation, which is maybe the highest 1:11:21.731 --> 1:11:22.805 probability. 1:11:23.203 --> 1:11:31.707 Instead we are looking at all the high translation, all translation with high probability and then 1:11:31.707 --> 1:11:39.524 we want to take one representative out of this so we're just most similar to all the other 1:11:39.524 --> 1:11:42.187 hydrobility translation again. 1:11:43.643 --> 1:11:46.642 So how does it work? 1:11:46.642 --> 1:11:55.638 First you could have imagined you have reference translations. 1:11:55.996 --> 1:12:13.017 You have a set of reference translations and then what you want to get is you want to have. 1:12:13.073 --> 1:12:28.641 As a probability distribution you measure the similarity of reference and the hypothesis. 1:12:28.748 --> 1:12:31.408 So you have two sets of translation. 1:12:31.408 --> 1:12:34.786 You have the human translations of a sentence. 1:12:35.675 --> 1:12:39.251 That's of course not realistic, but first from the idea. 1:12:39.251 --> 1:12:42.324 Then you have your set of possible translations. 1:12:42.622 --> 1:12:52.994 And now you're not saying okay, we have only one human, but we have several humans with 1:12:52.994 --> 1:12:56.294 different types of quality. 1:12:56.796 --> 1:13:07.798 You have to have two metrics here, the similarity between the automatic translation and the quality 1:13:07.798 --> 1:13:09.339 of the human. 1:13:10.951 --> 1:13:17.451 Of course, we have the same problem that we don't have the human reference, so we have. 1:13:18.058 --> 1:13:29.751 So when we are doing it, instead of estimating the quality based on the human, we use our 1:13:29.751 --> 1:13:30.660 model. 1:13:31.271 --> 1:13:37.612 So we can't be like humans, so we take the model probability. 1:13:37.612 --> 1:13:40.782 We take the set here first of. 1:13:41.681 --> 1:13:48.755 Then we are comparing each hypothesis to this one, so you have two sets. 1:13:48.755 --> 1:13:53.987 Just imagine here you take all possible translations. 1:13:53.987 --> 1:13:58.735 Here you take your hypothesis in comparing them. 1:13:58.678 --> 1:14:03.798 And then you're taking estimating the quality based on the outcome. 1:14:04.304 --> 1:14:06.874 So the overall idea is okay. 1:14:06.874 --> 1:14:14.672 We are not finding the best hypothesis but finding the hypothesis which is most similar 1:14:14.672 --> 1:14:17.065 to many good translations. 1:14:19.599 --> 1:14:21.826 Why would you do that? 1:14:21.826 --> 1:14:25.119 It's a bit like a smoothing idea. 1:14:25.119 --> 1:14:28.605 Imagine this is the probability of. 1:14:29.529 --> 1:14:36.634 So if you would do beam search or mini search or anything, if you just take the highest probability 1:14:36.634 --> 1:14:39.049 one, you would take this red one. 1:14:39.799 --> 1:14:45.686 Has this type of probability distribution. 1:14:45.686 --> 1:14:58.555 Then it might be better to take some of these models because it's a bit lower in probability. 1:14:58.618 --> 1:15:12.501 So what you're mainly doing is you're doing some smoothing of your probability distribution. 1:15:15.935 --> 1:15:17.010 How can you do that? 1:15:17.010 --> 1:15:20.131 Of course, we cannot do this again compared to all the hype. 1:15:21.141 --> 1:15:29.472 But what we can do is we have just two sets and we're just taking them the same. 1:15:29.472 --> 1:15:38.421 So we're having our penny data of the hypothesis and the sum of the soider references. 1:15:39.179 --> 1:15:55.707 And we can just take the same clue so we can just compare the utility of the. 1:15:56.656 --> 1:16:16.182 And then, of course, the question is how do we measure the quality of the hypothesis? 1:16:16.396 --> 1:16:28.148 Course: You could also take here the probability of this pee of given, but you can also say 1:16:28.148 --> 1:16:30.958 we only take the top. 1:16:31.211 --> 1:16:39.665 And where we don't want to really rely on how good they are, we filtered out all the 1:16:39.665 --> 1:16:40.659 bad ones. 1:16:40.940 --> 1:16:54.657 So that is the first question for the minimum base rhythm, and what are your pseudo references? 1:16:55.255 --> 1:17:06.968 So how do you set the quality of all these references here in the independent sampling? 1:17:06.968 --> 1:17:10.163 They all have the same. 1:17:10.750 --> 1:17:12.308 There's Also Work Where You Can Take That. 1:17:13.453 --> 1:17:17.952 And then the second question you have to do is, of course,. 1:17:17.917 --> 1:17:26.190 How do you prepare now two hypothesisms so you have now Y and H which are post generated 1:17:26.190 --> 1:17:34.927 by the system and you want to find the H which is most similar to all the other translations. 1:17:35.335 --> 1:17:41.812 So it's mainly like this model here, which says how similar is age to all the other whites. 1:17:42.942 --> 1:17:50.127 So you have to again use some type of similarity metric, which says how similar to possible. 1:17:52.172 --> 1:17:53.775 How can you do that? 1:17:53.775 --> 1:17:58.355 We luckily knew how to compare a reference to a hypothesis. 1:17:58.355 --> 1:18:00.493 We have evaluation metrics. 1:18:00.493 --> 1:18:03.700 You can do something like sentence level. 1:18:04.044 --> 1:18:13.501 But especially if you're looking into neuromodels you should have a stromometric so you can use 1:18:13.501 --> 1:18:17.836 a neural metric which directly compares to. 1:18:22.842 --> 1:18:29.292 Yes, so that is, is the main idea of minimum base risk to, so the important idea you should 1:18:29.292 --> 1:18:35.743 keep in mind is that it's doing somehow the smoothing by not taking the highest probability 1:18:35.743 --> 1:18:40.510 one, but by comparing like by taking a set of high probability one. 1:18:40.640 --> 1:18:45.042 And then looking for the translation, which is most similar to all of that. 1:18:45.445 --> 1:18:49.888 And thereby doing a bit more smoothing because you look at this one. 1:18:49.888 --> 1:18:55.169 If you have this one, for example, it would be more similar to all of these ones. 1:18:55.169 --> 1:19:00.965 But if you take this one, it's higher probability, but it's very dissimilar to all these. 1:19:05.445 --> 1:19:17.609 Hey, that is all for decoding before we finish with your combination of models. 1:19:18.678 --> 1:19:20.877 Sort of set of pseudo-reperences. 1:19:20.877 --> 1:19:24.368 Thomas Brown writes a little bit of type research or. 1:19:24.944 --> 1:19:27.087 For example, you can do beam search. 1:19:27.087 --> 1:19:28.825 You can do sampling for that. 1:19:28.825 --> 1:19:31.257 Oh yeah, we had mentioned sampling there. 1:19:31.257 --> 1:19:34.500 I don't know somebody asking for what sampling is good. 1:19:34.500 --> 1:19:37.280 So there's, of course, another important issue. 1:19:37.280 --> 1:19:40.117 How do you get a good representative set of age? 1:19:40.620 --> 1:19:47.147 If you do beam search, it might be that you end up with two similar ones, and maybe it's 1:19:47.147 --> 1:19:49.274 prevented by doing sampling. 1:19:49.274 --> 1:19:55.288 But maybe in sampling you find worse ones, but yet some type of model is helpful. 1:19:56.416 --> 1:20:04.863 Search method use more transformed based translation points. 1:20:04.863 --> 1:20:09.848 Nowadays beam search is definitely. 1:20:10.130 --> 1:20:13.749 There is work on this. 1:20:13.749 --> 1:20:27.283 The problem is that the MBR is often a lot more like heavy because you have to sample 1:20:27.283 --> 1:20:29.486 translations. 1:20:31.871 --> 1:20:40.946 If you are bustling then we take a pen or a pen for the most possible one. 1:20:40.946 --> 1:20:43.003 Now we put them. 1:20:43.623 --> 1:20:46.262 Bit and then we say okay, you don't have to be fine. 1:20:46.262 --> 1:20:47.657 I'm going to put it to you. 1:20:48.428 --> 1:20:52.690 Yes, so that is what you can also do. 1:20:52.690 --> 1:21:00.092 Instead of taking uniform per ability, you could take the modest. 1:21:01.041 --> 1:21:14.303 The uniform is a bit more robust because if you had this one it might be that there is 1:21:14.303 --> 1:21:17.810 some crazy exceptions. 1:21:17.897 --> 1:21:21.088 And then it would still relax. 1:21:21.088 --> 1:21:28.294 So if you look at this picture, the probability here would be higher. 1:21:28.294 --> 1:21:31.794 But yeah, that's a bit of tuning. 1:21:33.073 --> 1:21:42.980 In this case, and yes, it is like modeling also the ants that. 1:21:49.169 --> 1:21:56.265 The last thing is now we always have considered one model. 1:21:56.265 --> 1:22:04.084 It's also some prints helpful to not only look at one model but. 1:22:04.384 --> 1:22:10.453 So in general there's many ways of how you can make several models and with it's even 1:22:10.453 --> 1:22:17.370 easier you can just start three different random municipalizations you get three different models 1:22:17.370 --> 1:22:18.428 and typically. 1:22:19.019 --> 1:22:27.299 And then the question is, can we combine their strength into one model and use that then? 1:22:29.669 --> 1:22:39.281 And that can be done and it can be either online or ensemble, and the more offline thing 1:22:39.281 --> 1:22:41.549 is called reranking. 1:22:42.462 --> 1:22:52.800 So the idea is, for example, an ensemble that you combine different initializations. 1:22:52.800 --> 1:23:02.043 Of course, you can also do other things like having different architecture. 1:23:02.222 --> 1:23:08.922 But the easiest thing you can change always in generating two motors is to have different. 1:23:09.209 --> 1:23:24.054 And then the question is how can you combine that? 1:23:26.006 --> 1:23:34.245 And the easiest thing, as said, is the bottle of soda. 1:23:34.245 --> 1:23:39.488 What you mainly do is in parallel. 1:23:39.488 --> 1:23:43.833 You decode all of the money. 1:23:44.444 --> 1:23:59.084 So the probability of the output and you can join this one to a joint one by just summing 1:23:59.084 --> 1:24:04.126 up over your key models again. 1:24:04.084 --> 1:24:10.374 So you still have a pro bonding distribution, but you are not taking only one output here, 1:24:10.374 --> 1:24:10.719 but. 1:24:11.491 --> 1:24:20.049 So that's one you can easily combine different models, and the nice thing is it typically 1:24:20.049 --> 1:24:20.715 works. 1:24:21.141 --> 1:24:27.487 You additional improvement with only more calculation but not more human work. 1:24:27.487 --> 1:24:33.753 You just do the same thing for times and you're getting a better performance. 1:24:33.793 --> 1:24:41.623 Like having more layers and so on, the advantage of bigger models is of course you have to have 1:24:41.623 --> 1:24:46.272 the big models only joint and decoding during inference. 1:24:46.272 --> 1:24:52.634 There you have to load models in parallel because you have to do your search. 1:24:52.672 --> 1:24:57.557 Normally there is more memory resources for training than you need for insurance. 1:25:00.000 --> 1:25:12.637 You have to train four models and the decoding speed is also slower because you need to decode 1:25:12.637 --> 1:25:14.367 four models. 1:25:14.874 --> 1:25:25.670 There is one other very important thing and the models have to be very similar, at least 1:25:25.670 --> 1:25:27.368 in some ways. 1:25:27.887 --> 1:25:28.506 Course. 1:25:28.506 --> 1:25:34.611 You can only combine this one if you have the same words because you are just. 1:25:34.874 --> 1:25:43.110 So just imagine you have two different sizes because you want to compare them or a director 1:25:43.110 --> 1:25:44.273 based model. 1:25:44.724 --> 1:25:53.327 That's at least not easily possible here because once your output would be here a word and the 1:25:53.327 --> 1:25:56.406 other one would have to sum over. 1:25:56.636 --> 1:26:07.324 So this ensemble typically only works if you have the same output vocabulary. 1:26:07.707 --> 1:26:16.636 Your input can be different because that is only done once and then. 1:26:16.636 --> 1:26:23.752 Your hardware vocabulary has to be the same otherwise. 1:26:27.507 --> 1:26:41.522 There's even a surprising effect of improving your performance and it's again some kind of 1:26:41.522 --> 1:26:43.217 smoothing. 1:26:43.483 --> 1:26:52.122 So normally during training what we are doing is we can save the checkpoints after each epoch. 1:26:52.412 --> 1:27:01.774 And you have this type of curve where your Arab performance normally should go down, and 1:27:01.774 --> 1:27:09.874 if you do early stopping it means that at the end you select not the lowest. 1:27:11.571 --> 1:27:21.467 However, some type of smoothing is there again. 1:27:21.467 --> 1:27:31.157 Sometimes what you can do is take an ensemble. 1:27:31.491 --> 1:27:38.798 That is not as good, but you still have four different bottles, and they give you a little. 1:27:39.259 --> 1:27:42.212 So,. 1:27:43.723 --> 1:27:48.340 It's some are helping you, so now they're supposed to be something different, you know. 1:27:49.489 --> 1:27:53.812 Oh didn't do that, so that is a checkpoint. 1:27:53.812 --> 1:27:59.117 There is one thing interesting, which is even faster. 1:27:59.419 --> 1:28:12.255 Normally let's give you better performance because this one might be again like a smooth 1:28:12.255 --> 1:28:13.697 ensemble. 1:28:16.736 --> 1:28:22.364 Of course, there is also some problems with this, so I said. 1:28:22.364 --> 1:28:30.022 For example, maybe you want to do different web representations with Cherokee and. 1:28:30.590 --> 1:28:37.189 You want to do right to left decoding so you normally do like I go home but then your translation 1:28:37.189 --> 1:28:39.613 depends only on the previous words. 1:28:39.613 --> 1:28:45.942 If you want to model on the future you could do the inverse direction and generate the target 1:28:45.942 --> 1:28:47.895 sentence from right to left. 1:28:48.728 --> 1:28:50.839 But it's not easy to combine these things. 1:28:51.571 --> 1:28:56.976 In order to do this, or what is also sometimes interesting is doing in verse translation. 1:28:57.637 --> 1:29:07.841 You can combine these types of models in the next election. 1:29:07.841 --> 1:29:13.963 That is only a bit which we can do. 1:29:14.494 --> 1:29:29.593 Next time what you should remember is how search works and do you have any final questions. 1:29:33.773 --> 1:29:43.393 Then I wish you a happy holiday for next week and then Monday there is another practical 1:29:43.393 --> 1:29:50.958 and then Thursday in two weeks so we'll have the next lecture Monday.