WEBVTT 0:00:01.301 --> 0:00:05.707 Okay So Welcome to Today's Lecture. 0:00:06.066 --> 0:00:12.592 I'm sorry for the inconvenience. 0:00:12.592 --> 0:00:19.910 Sometimes they are project meetings. 0:00:19.910 --> 0:00:25.843 There will be one other time. 0:00:26.806 --> 0:00:40.863 So what we want to talk today about is want to start with neural approaches to machine 0:00:40.863 --> 0:00:42.964 translation. 0:00:43.123 --> 0:00:51.285 I guess you have heard about other types of neural models for other types of neural language 0:00:51.285 --> 0:00:52.339 processing. 0:00:52.339 --> 0:00:59.887 This was some of the first steps in introducing neal networks to machine translation. 0:01:00.600 --> 0:01:06.203 They are similar to what you know they see in as large language models. 0:01:06.666 --> 0:01:11.764 And today look into what are these neuro-language models? 0:01:11.764 --> 0:01:13.874 What is the difference? 0:01:13.874 --> 0:01:15.983 What is the motivation? 0:01:16.316 --> 0:01:21.445 And first will use them in statistics and machine translation. 0:01:21.445 --> 0:01:28.935 So if you remember how fully like two or three weeks ago we had this likely model where you 0:01:28.935 --> 0:01:31.052 can integrate easily any. 0:01:31.351 --> 0:01:40.967 We just have another model which evaluates how good a system is or how good a fluent language 0:01:40.967 --> 0:01:41.376 is. 0:01:41.376 --> 0:01:53.749 The main advantage compared to the statistical models we saw on Tuesday is: Next week we will 0:01:53.749 --> 0:02:06.496 then go for a neural machine translation where we replace the whole model. 0:02:11.211 --> 0:02:21.078 Just as a remember from Tuesday, we've seen the main challenge in language world was that 0:02:21.078 --> 0:02:25.134 most of the engrams we haven't seen. 0:02:26.946 --> 0:02:33.967 So this was therefore difficult to estimate any probability because you've seen that normally 0:02:33.967 --> 0:02:39.494 if you have not seen the endgram you will assign the probability of zero. 0:02:39.980 --> 0:02:49.420 However, this is not really very good because we don't want to give zero probabilities to 0:02:49.420 --> 0:02:54.979 sentences, which still might be a very good English. 0:02:55.415 --> 0:03:02.167 And then we learned a lot of techniques and that is the main challenging statistical machine 0:03:02.167 --> 0:03:04.490 translate statistical language. 0:03:04.490 --> 0:03:10.661 What's how we can give a good estimate of probability to events that we haven't seen 0:03:10.661 --> 0:03:12.258 smoothing techniques? 0:03:12.258 --> 0:03:15.307 We've seen this interpolation and begoff. 0:03:15.435 --> 0:03:21.637 And they invent or develop very specific techniques. 0:03:21.637 --> 0:03:26.903 To deal with that, however, it might not be. 0:03:28.568 --> 0:03:43.190 And therefore maybe we can do things different, so if we have not seen an gram before in statistical 0:03:43.190 --> 0:03:44.348 models. 0:03:45.225 --> 0:03:51.361 Before and we can only get information from exactly the same words. 0:03:51.411 --> 0:04:06.782 We don't have some on like approximate matching like that, maybe in a sentence that cures similarly. 0:04:06.782 --> 0:04:10.282 So if you have seen a. 0:04:11.191 --> 0:04:17.748 And so you would like to have more something like that where endgrams are represented, more 0:04:17.748 --> 0:04:21.953 in a general space, and we can generalize similar numbers. 0:04:22.262 --> 0:04:29.874 So if you learn something about walk then maybe we can use this knowledge and also apply. 0:04:30.290 --> 0:04:42.596 The same as we have done before, but we can really better model how similar they are and 0:04:42.596 --> 0:04:45.223 transfer to other. 0:04:47.047 --> 0:04:54.236 And we maybe want to do that in a more hierarchical approach that we know okay. 0:04:54.236 --> 0:05:02.773 Some words are similar but like go and walk is somehow similar and I and P and G and therefore 0:05:02.773 --> 0:05:06.996 like maybe if we then merge them in an engram. 0:05:07.387 --> 0:05:15.861 If we learn something about our walk, then it should tell us also something about Hugo. 0:05:15.861 --> 0:05:17.113 He walks or. 0:05:17.197 --> 0:05:27.327 You see that there is some relations which we need to integrate for you. 0:05:27.327 --> 0:05:35.514 We need to add the s, but maybe walks should also be here. 0:05:37.137 --> 0:05:45.149 And luckily there is one really convincing method in doing that: And that is by using 0:05:45.149 --> 0:05:47.231 a neural mechanism. 0:05:47.387 --> 0:05:58.497 That's what we will introduce today so we can use this type of neural networks to try 0:05:58.497 --> 0:06:04.053 to learn this similarity and to learn how. 0:06:04.324 --> 0:06:14.355 And that is one of the main advantages that we have by switching from the standard statistical 0:06:14.355 --> 0:06:15.200 models. 0:06:15.115 --> 0:06:22.830 To learn similarities between words and generalized, and learn what is called hidden representations 0:06:22.830 --> 0:06:29.705 or representations of words, where we can measure similarity in some dimensions of words. 0:06:30.290 --> 0:06:42.384 So we can measure in which way words are similar. 0:06:42.822 --> 0:06:48.902 We had it before and we've seen that words were just easier. 0:06:48.902 --> 0:06:51.991 The only thing we did is like. 0:06:52.192 --> 0:07:02.272 But this energies don't have any meaning, so it wasn't that word is more similar to words. 0:07:02.582 --> 0:07:12.112 So we couldn't learn anything about words in the statistical model and that's a big challenge. 0:07:12.192 --> 0:07:23.063 About words even like in morphology, so going goes is somehow more similar because the person 0:07:23.063 --> 0:07:24.219 singular. 0:07:24.264 --> 0:07:34.924 The basic models we have to now have no idea about that and goes as similar to go than it 0:07:34.924 --> 0:07:37.175 might be to sleep. 0:07:39.919 --> 0:07:44.073 So what we want to do today. 0:07:44.073 --> 0:07:53.096 In order to go to this we will have a short introduction into. 0:07:53.954 --> 0:08:05.984 It very short just to see how we use them here, but that's a good thing, so most of you 0:08:05.984 --> 0:08:08.445 think it will be. 0:08:08.928 --> 0:08:14.078 And then we will first look into a feet forward neural network language models. 0:08:14.454 --> 0:08:23.706 And there we will still have this approximation. 0:08:23.706 --> 0:08:33.902 We have before we are looking only at a fixed window. 0:08:34.154 --> 0:08:35.030 The case. 0:08:35.030 --> 0:08:38.270 However, we have the umbellent here. 0:08:38.270 --> 0:08:43.350 That's why they're already better in order to generalize. 0:08:44.024 --> 0:08:53.169 And then at the end we'll look at language models where we then have the additional advantage. 0:08:53.093 --> 0:09:04.317 Case that we need to have a fixed history, but in theory we can model arbitrary long dependencies. 0:09:04.304 --> 0:09:12.687 And we talked about on Tuesday where it is not clear what type of information it is to. 0:09:16.396 --> 0:09:24.981 So in general molecular networks I normally learn to prove that they perform some tasks. 0:09:25.325 --> 0:09:33.472 We have the structure and we are learning them from samples so that is similar to what 0:09:33.472 --> 0:09:34.971 we have before. 0:09:34.971 --> 0:09:42.275 So now we have the same task here, a language model giving input or forwards. 0:09:42.642 --> 0:09:48.959 And is somewhat originally motivated by human brain. 0:09:48.959 --> 0:10:00.639 However, when you now need to know about artificial neural networks, it's hard to get similarity. 0:10:00.540 --> 0:10:02.889 There seemed to be not that point. 0:10:03.123 --> 0:10:11.014 So what they are mainly doing is summoning multiplication and then one non-linear activation. 0:10:12.692 --> 0:10:16.085 So the basic units are these type of. 0:10:17.937 --> 0:10:29.891 Perceptron basic blocks which we have and this does processing so we have a fixed number 0:10:29.891 --> 0:10:36.070 of input features and that will be important. 0:10:36.096 --> 0:10:39.689 So we have here numbers to xn as input. 0:10:40.060 --> 0:10:53.221 And this makes partly of course language processing difficult. 0:10:54.114 --> 0:10:57.609 So we have to model this time on and then go stand home and model. 0:10:58.198 --> 0:11:02.099 Then we are having weights, which are the parameters and the number of weights exactly 0:11:02.099 --> 0:11:03.668 the same as the number of weights. 0:11:04.164 --> 0:11:06.322 Of input features. 0:11:06.322 --> 0:11:15.068 Sometimes he has his fires in there, and then it's not really an input from. 0:11:15.195 --> 0:11:19.205 And what you then do is multiply. 0:11:19.205 --> 0:11:26.164 Each input resists weight and then you sum it up and then. 0:11:26.606 --> 0:11:34.357 What is then additionally later important is that we have an activation function and 0:11:34.357 --> 0:11:42.473 it's important that this activation function is non linear, so we come to just a linear. 0:11:43.243 --> 0:11:54.088 And later it will be important that this is differentiable because otherwise all the training. 0:11:54.714 --> 0:12:01.907 This model by itself is not very powerful. 0:12:01.907 --> 0:12:10.437 It was originally shown that this is not powerful. 0:12:10.710 --> 0:12:19.463 However, there is a very easy extension, the multi layer perceptual, and then things get 0:12:19.463 --> 0:12:20.939 very powerful. 0:12:21.081 --> 0:12:27.719 The thing is you just connect a lot of these in this layer of structures and we have our 0:12:27.719 --> 0:12:35.029 input layer where we have the inputs and our hidden layer at least one where there is everywhere. 0:12:35.395 --> 0:12:39.817 And then we can combine them all to do that. 0:12:40.260 --> 0:12:48.320 The input layer is of course somewhat given by a problem of dimension. 0:12:48.320 --> 0:13:00.013 The outward layer is also given by your dimension, but the hidden layer is of course a hyperparameter. 0:13:01.621 --> 0:13:08.802 So let's start with the first question, now more language related, and that is how we represent. 0:13:09.149 --> 0:13:23.460 So we've seen here we have the but the question is now how can we put in a word into this? 0:13:26.866 --> 0:13:34.117 Noise: The first thing we're able to be better is by the fact that like you are said,. 0:13:34.314 --> 0:13:43.028 That is not that easy because the continuous vector will come to that. 0:13:43.028 --> 0:13:50.392 So from the neo-network we can directly put in the bedding. 0:13:50.630 --> 0:13:57.277 But if we need to input a word into the needle network, it has to be something which is easily 0:13:57.277 --> 0:13:57.907 defined. 0:13:59.079 --> 0:14:12.492 The one hood encoding, and then we have one out of encoding, so one value is one, and all 0:14:12.492 --> 0:14:15.324 the others is the. 0:14:16.316 --> 0:14:25.936 That means we are always dealing with fixed vocabulary because what said is we cannot. 0:14:26.246 --> 0:14:38.017 So you cannot easily extend your vocabulary because if you mean you would extend your vocabulary. 0:14:39.980 --> 0:14:41.502 That's also motivating. 0:14:41.502 --> 0:14:43.722 We're talked about biperriagoding. 0:14:43.722 --> 0:14:45.434 That's a nice thing there. 0:14:45.434 --> 0:14:47.210 We have a fixed vocabulary. 0:14:48.048 --> 0:14:55.804 The big advantage of this one encoding is that we don't implicitly sum our implement 0:14:55.804 --> 0:15:04.291 similarity between words, but really re-learning because if you first think about this, this 0:15:04.291 --> 0:15:06.938 is a very, very inefficient. 0:15:07.227 --> 0:15:15.889 So you need like to represent end words, you need a dimension of an end dimensional vector. 0:15:16.236 --> 0:15:24.846 Imagine you could do binary encoding so you could represent words as binary vectors. 0:15:24.846 --> 0:15:26.467 Then you would. 0:15:26.806 --> 0:15:31.177 Will be significantly more efficient. 0:15:31.177 --> 0:15:36.813 However, then you have some implicit similarity. 0:15:36.813 --> 0:15:39.113 Some numbers share. 0:15:39.559 --> 0:15:46.958 Would somehow be bad because you would force someone to do this by hand or clear how to 0:15:46.958 --> 0:15:47.631 define. 0:15:48.108 --> 0:15:55.135 So therefore currently this is the most successful approach to just do this one watch. 0:15:55.095 --> 0:15:59.563 Representations, so we take a fixed vocabulary. 0:15:59.563 --> 0:16:06.171 We map each word to the inise, and then we represent a word like this. 0:16:06.171 --> 0:16:13.246 So if home will be one, the representation will be one zero zero zero, and. 0:16:14.514 --> 0:16:30.639 But this dimension here is a vocabulary size and that is quite high, so we are always trying 0:16:30.639 --> 0:16:33.586 to be efficient. 0:16:33.853 --> 0:16:43.792 We are doing then some type of efficiency because typically we are having this next layer. 0:16:44.104 --> 0:16:51.967 It can be still maybe two hundred or five hundred or one thousand neurons, but this is 0:16:51.967 --> 0:16:53.323 significantly. 0:16:53.713 --> 0:17:03.792 You can learn that directly and there we then have similarity between words. 0:17:03.792 --> 0:17:07.458 Then it is that some words. 0:17:07.807 --> 0:17:14.772 But the nice thing is that this is then learned that we are not need to hand define that. 0:17:17.117 --> 0:17:32.742 We'll come later to the explicit architecture of the neural language one, and there we can 0:17:32.742 --> 0:17:35.146 see how it's. 0:17:38.418 --> 0:17:44.857 So we're seeing that the other one or our representation always has the same similarity. 0:17:45.105 --> 0:17:59.142 Then we're having this continuous factor which is a lot smaller dimension and that's important 0:17:59.142 --> 0:18:00.768 for later. 0:18:01.121 --> 0:18:06.989 What we are doing then is learning these representations so that they are best for language. 0:18:07.487 --> 0:18:14.968 So the representations are implicitly training the language for the cards. 0:18:14.968 --> 0:18:19.058 This is the best way for doing language. 0:18:19.479 --> 0:18:32.564 And the nice thing that was found out later is these representations are really good. 0:18:33.153 --> 0:18:39.253 And that is why they are now even called word embeddings by themselves and used for other 0:18:39.253 --> 0:18:39.727 tasks. 0:18:40.360 --> 0:18:49.821 And they are somewhat describing very different things so they can describe and semantic similarities. 0:18:49.789 --> 0:18:58.650 Are looking at the very example of today mass vector space by adding words and doing some 0:18:58.650 --> 0:19:00.618 interesting things. 0:19:00.940 --> 0:19:11.178 So they got really like the first big improvement when switching to neurostaff. 0:19:11.491 --> 0:19:20.456 Are like part of the model, but with more complex representation, but they are the basic 0:19:20.456 --> 0:19:21.261 models. 0:19:23.683 --> 0:19:36.979 In the output layer we are also having one output layer structure and a connection function. 0:19:36.997 --> 0:19:46.525 That is, for language learning we want to predict what is the most common word. 0:19:47.247 --> 0:19:56.453 And that can be done very well with this so called soft back layer, where again the dimension. 0:19:56.376 --> 0:20:02.825 Vocabulary size, so this is a vocabulary size, and again the case neural represents the case 0:20:02.825 --> 0:20:03.310 class. 0:20:03.310 --> 0:20:09.759 So in our case we have again one round representation, someone saying this is a core report. 0:20:10.090 --> 0:20:17.255 Our probability distribution is a probability distribution over all works, so the case entry 0:20:17.255 --> 0:20:21.338 tells us how probable is that the next word is this. 0:20:22.682 --> 0:20:33.885 So we need to have some probability distribution at our output in order to achieve that this 0:20:33.885 --> 0:20:37.017 activation function goes. 0:20:37.197 --> 0:20:46.944 And we can achieve that with a soft max activation we take the input to the form of the value, 0:20:46.944 --> 0:20:47.970 and then. 0:20:48.288 --> 0:20:58.021 So by having this type of activation function we are really getting this type of probability. 0:20:59.019 --> 0:21:15.200 At the beginning was also very challenging because again we have this inefficient representation. 0:21:15.235 --> 0:21:29.799 You can imagine that something over is maybe a bit inefficient with cheap users, but definitely. 0:21:36.316 --> 0:21:44.072 And then for training the models that will be fine, so we have to use architecture now. 0:21:44.264 --> 0:21:48.491 We need to minimize the arrow. 0:21:48.491 --> 0:21:53.264 Are we doing it taking the output? 0:21:53.264 --> 0:21:58.174 We are comparing it to our targets. 0:21:58.298 --> 0:22:03.830 So one important thing is by training them. 0:22:03.830 --> 0:22:07.603 How can we measure the error? 0:22:07.603 --> 0:22:12.758 So what is if we are training the ideas? 0:22:13.033 --> 0:22:15.163 And how well we are measuring. 0:22:15.163 --> 0:22:19.768 It is in natural language processing, typically the cross entropy. 0:22:19.960 --> 0:22:35.575 And that means we are comparing the target with the output. 0:22:35.335 --> 0:22:44.430 It gets optimized and you're seeing that this, of course, makes it again very nice and easy 0:22:44.430 --> 0:22:49.868 because our target is again a one-hour representation. 0:22:50.110 --> 0:23:00.116 So all of these are always zero, and what we are then doing is we are taking the one. 0:23:00.100 --> 0:23:04.615 And we only need to multiply the one with the logarithm here, and that is all the feedback 0:23:04.615 --> 0:23:05.955 signal we are taking here. 0:23:06.946 --> 0:23:13.885 Of course, this is not always influenced by all the others. 0:23:13.885 --> 0:23:17.933 Why is this influenced by all the. 0:23:24.304 --> 0:23:34.382 Have the activation function, which is the current activation divided by some of the others. 0:23:34.354 --> 0:23:45.924 Otherwise it could easily just increase this volume and ignore the others, but if you increase 0:23:45.924 --> 0:23:49.090 one value all the others. 0:23:51.351 --> 0:23:59.912 Then we can do with neometrics one very nice and easy type of training that is done in all 0:23:59.912 --> 0:24:07.721 the neometrics where we are now calculating our error and especially the gradient. 0:24:07.707 --> 0:24:11.640 So in which direction does the error show? 0:24:11.640 --> 0:24:18.682 And then if we want to go to a smaller arrow that's what we want to achieve. 0:24:18.682 --> 0:24:26.638 We are taking the inverse direction of the gradient and thereby trying to minimize our 0:24:26.638 --> 0:24:27.278 error. 0:24:27.287 --> 0:24:31.041 And we have to do that, of course, for all the weights. 0:24:31.041 --> 0:24:36.672 And to calculate the error of all the weights, we won't do the defectvagation here. 0:24:36.672 --> 0:24:41.432 But but what you can do is you can propagate the arrow which measured. 0:24:41.432 --> 0:24:46.393 At the end you can propagate it back its basic mass and basic derivation. 0:24:46.706 --> 0:24:58.854 For each way in your model measure how much you contribute to the error and then change 0:24:58.854 --> 0:25:01.339 it in a way that. 0:25:04.524 --> 0:25:11.625 So to summarize what for at least machine translation on your machine translation should 0:25:11.625 --> 0:25:19.044 remember, you know, to understand on this problem is that this is how a multilayer first the 0:25:19.044 --> 0:25:20.640 problem looks like. 0:25:20.580 --> 0:25:28.251 There are fully two layers and no connections. 0:25:28.108 --> 0:25:29.759 Across layers. 0:25:29.829 --> 0:25:35.153 And what they're doing is always just a waited sum here and then in activation production. 0:25:35.415 --> 0:25:38.792 And in order to train you have this forward and backward pass. 0:25:39.039 --> 0:25:41.384 So We Put in Here. 0:25:41.281 --> 0:25:41.895 Inputs. 0:25:41.895 --> 0:25:45.347 We have some random values at the beginning. 0:25:45.347 --> 0:25:47.418 Then calculate the output. 0:25:47.418 --> 0:25:54.246 We are measuring how our error is propagating the arrow back and then changing our model 0:25:54.246 --> 0:25:57.928 in a way that we hopefully get a smaller arrow. 0:25:57.928 --> 0:25:59.616 And then that is how. 0:26:01.962 --> 0:26:12.893 So before we're coming into our neural networks language models, how can we use this type of 0:26:12.893 --> 0:26:17.595 neural network to do language modeling? 0:26:23.103 --> 0:26:33.157 So how can we use them in natural language processing, especially machine translation? 0:26:33.157 --> 0:26:41.799 The first idea of using them was to estimate: So we have seen that the output can be monitored 0:26:41.799 --> 0:26:42.599 here as well. 0:26:43.603 --> 0:26:50.311 A probability distribution and if we have a full vocabulary we could mainly hear estimating 0:26:50.311 --> 0:26:56.727 how probable each next word is and then use that in our language model fashion as we've 0:26:56.727 --> 0:26:58.112 done it last time. 0:26:58.112 --> 0:27:03.215 We got the probability of a full sentence as a product of individual. 0:27:04.544 --> 0:27:12.820 And: That was done in the ninety seven years and it's very easy to integrate it into this 0:27:12.820 --> 0:27:14.545 lot of the year model. 0:27:14.545 --> 0:27:19.570 So we have said that this is how the locker here model looks like. 0:27:19.570 --> 0:27:25.119 So we are searching the best translation which minimizes each waste time. 0:27:25.125 --> 0:27:26.362 The Future About You. 0:27:26.646 --> 0:27:31.647 We have that with minimum error rate training if you can remember where we search for the 0:27:31.647 --> 0:27:32.147 optimal. 0:27:32.512 --> 0:27:40.422 The language model and many others, and we can just add here a neuromodel, have a knock 0:27:40.422 --> 0:27:41.591 of features. 0:27:41.861 --> 0:27:45.761 So that is quite easy as said. 0:27:45.761 --> 0:27:53.183 That was how statistical machine translation was improved. 0:27:53.183 --> 0:27:57.082 You just add one more feature. 0:27:58.798 --> 0:28:07.631 So how can we model the language modeling with a network? 0:28:07.631 --> 0:28:16.008 So what we have to do is model the probability of the. 0:28:16.656 --> 0:28:25.047 The problem in general in the head is that mostly we haven't seen long sequences. 0:28:25.085 --> 0:28:35.650 Mostly we have to beg off to very short sequences and we are working on this discrete space where 0:28:35.650 --> 0:28:36.944 similarity. 0:28:37.337 --> 0:28:50.163 So the idea is if we have now a real network, we can make words into continuous representation. 0:28:51.091 --> 0:29:00.480 And the structure then looks like this, so this is a basic still feed forward neural network. 0:29:01.361 --> 0:29:10.645 We are doing this at perximation again, so we are not putting in all previous words, but 0:29:10.645 --> 0:29:11.375 it is. 0:29:11.691 --> 0:29:25.856 This is done because we said that in the real network we can have only a fixed type of input. 0:29:25.945 --> 0:29:31.886 You can only do a fixed step and then we'll be doing that exactly in minus one. 0:29:33.593 --> 0:29:39.536 So here you are, for example, three words and three different words. 0:29:39.536 --> 0:29:50.704 One and all the others are: And then we're having the first layer of the neural network, 0:29:50.704 --> 0:29:56.230 which like you learns is word embedding. 0:29:57.437 --> 0:30:04.976 There is one thing which is maybe special compared to the standard neural member. 0:30:05.345 --> 0:30:11.918 So the representation of this word we want to learn first of all position independence. 0:30:11.918 --> 0:30:19.013 So we just want to learn what is the general meaning of the word independent of its neighbors. 0:30:19.299 --> 0:30:26.239 And therefore the representation you get here should be the same as if in the second position. 0:30:27.247 --> 0:30:36.865 The nice thing you can achieve is that this weights which you're using here you're reusing 0:30:36.865 --> 0:30:41.727 here and reusing here so we are forcing them. 0:30:42.322 --> 0:30:48.360 You then learn your word embedding, which is contextual, independent, so it's the same 0:30:48.360 --> 0:30:49.678 for each position. 0:30:49.909 --> 0:31:03.482 So that's the idea that you want to learn the representation first of and you don't want 0:31:03.482 --> 0:31:07.599 to really use the context. 0:31:08.348 --> 0:31:13.797 That of course might have a different meaning depending on where it stands, but we'll learn 0:31:13.797 --> 0:31:14.153 that. 0:31:14.514 --> 0:31:20.386 So first we are learning here representational words, which is just the representation. 0:31:20.760 --> 0:31:32.498 Normally we said in neurons all input neurons here are connected to all here, but we're reducing 0:31:32.498 --> 0:31:37.338 the complexity by saying these neurons. 0:31:37.857 --> 0:31:47.912 Then we have a lot denser representation that is our three word embedded in here, and now 0:31:47.912 --> 0:31:57.408 we are learning this interaction between words, a direction between words not based. 0:31:57.677 --> 0:32:08.051 So we have at least one connected layer here, which takes a three embedding input and then 0:32:08.051 --> 0:32:14.208 learns a new embedding which now represents the full. 0:32:15.535 --> 0:32:16.551 Layers. 0:32:16.551 --> 0:32:27.854 It is the output layer which now and then again the probability distribution of all the. 0:32:28.168 --> 0:32:48.612 So here is your target prediction. 0:32:48.688 --> 0:32:56.361 The nice thing is that you learn everything together, so you don't have to teach them what 0:32:56.361 --> 0:32:58.722 a good word representation. 0:32:59.079 --> 0:33:08.306 Training the whole number together, so it learns what a good representation for a word 0:33:08.306 --> 0:33:13.079 you get in order to perform your final task. 0:33:15.956 --> 0:33:19.190 Yeah, that is the main idea. 0:33:20.660 --> 0:33:32.731 This is now a days often referred to as one way of self supervise learning. 0:33:33.053 --> 0:33:37.120 The output is the next word and the input is the previous word. 0:33:37.377 --> 0:33:46.783 But it's not really that we created labels, but we artificially created a task out of unlabeled. 0:33:46.806 --> 0:33:59.434 We just had pure text, and then we created the telescopes by predicting the next word, 0:33:59.434 --> 0:34:18.797 which is: Say we have like two sentences like go home and the second one is go to prepare. 0:34:18.858 --> 0:34:30.135 And then we have to predict the next series and my questions in the labels for the album. 0:34:31.411 --> 0:34:42.752 We model this as one vector with like probability for possible weights starting again. 0:34:44.044 --> 0:34:57.792 Multiple examples, so then you would twice train one to predict KRT, one to predict home, 0:34:57.792 --> 0:35:02.374 and then of course the easel. 0:35:04.564 --> 0:35:13.568 Is a very good point, so you are not aggregating examples beforehand, but you are taking each. 0:35:19.259 --> 0:35:37.204 So when you do it simultaneously learn the projection layer and the endgram for abilities 0:35:37.204 --> 0:35:39.198 and then. 0:35:39.499 --> 0:35:47.684 And later analyze it that these representations are very powerful. 0:35:47.684 --> 0:35:56.358 The task is just a very important task to model what is the next word. 0:35:56.816 --> 0:35:59.842 Is motivated by nowadays. 0:35:59.842 --> 0:36:10.666 In order to get the meaning of the word you have to look at its companies where the context. 0:36:10.790 --> 0:36:16.048 If you read texts in days of word which you have never seen, you often can still estimate 0:36:16.048 --> 0:36:21.130 the meaning of this word because you do not know how it is used, and this is typically 0:36:21.130 --> 0:36:22.240 used as a city or. 0:36:22.602 --> 0:36:25.865 Just imagine you read a text about some city. 0:36:25.865 --> 0:36:32.037 Even if you've never seen the city before, you often know from the context of how it's 0:36:32.037 --> 0:36:32.463 used. 0:36:34.094 --> 0:36:42.483 So what is now the big advantage of using neural neckworks? 0:36:42.483 --> 0:36:51.851 So just imagine we have to estimate that I bought my first iPhone. 0:36:52.052 --> 0:36:56.608 So you have to monitor the probability of ad hitting them. 0:36:56.608 --> 0:37:00.237 Now imagine iPhone, which you have never seen. 0:37:00.600 --> 0:37:11.588 So all the techniques we had last time at the end, if you haven't seen iPhone you will 0:37:11.588 --> 0:37:14.240 always fall back to. 0:37:15.055 --> 0:37:26.230 You have no idea how to deal that you won't have seen the diagram, the trigram, and all 0:37:26.230 --> 0:37:27.754 the others. 0:37:28.588 --> 0:37:43.441 If you're having this type of model, what does it do if you have my first and then something? 0:37:43.483 --> 0:37:50.270 Maybe this representation is really messed up because it's mainly on a cavalry word. 0:37:50.730 --> 0:37:57.793 However, you have still these two information that two words before was first and therefore. 0:37:58.098 --> 0:38:06.954 So you have a lot of information in order to estimate how good it is. 0:38:06.954 --> 0:38:13.279 There could be more information if you know that. 0:38:13.593 --> 0:38:25.168 So all this type of modeling we can do that we couldn't do beforehand because we always 0:38:25.168 --> 0:38:25.957 have. 0:38:27.027 --> 0:38:40.466 Good point, so typically you would have one token for a vocabulary so that you could, for 0:38:40.466 --> 0:38:45.857 example: All you're doing by parent coding when you have a fixed thing. 0:38:46.226 --> 0:38:49.437 Oh yeah, you have to do something like that that that that's true. 0:38:50.050 --> 0:38:55.420 So yeah, auto vocabulary are by thanking where you don't have other words written. 0:38:55.735 --> 0:39:06.295 But then, of course, you might be getting very long previous things, and your sequence 0:39:06.295 --> 0:39:11.272 length gets very long for unknown words. 0:39:17.357 --> 0:39:20.067 Any more questions to the basic stable. 0:39:23.783 --> 0:39:36.719 For this model, what we then want to continue is looking a bit into how complex or how we 0:39:36.719 --> 0:39:39.162 can make things. 0:39:40.580 --> 0:39:49.477 Because at the beginning there was definitely a major challenge, it's still not that easy, 0:39:49.477 --> 0:39:58.275 and I mean our likeers followed the talk about their environmental fingerprint and so on. 0:39:58.478 --> 0:40:05.700 So this calculation is not really heavy, and if you build systems yourselves you have to 0:40:05.700 --> 0:40:06.187 wait. 0:40:06.466 --> 0:40:14.683 So it's good to know a bit about how complex things are in order to do a good or efficient 0:40:14.683 --> 0:40:15.405 affair. 0:40:15.915 --> 0:40:24.211 So one thing where most of the calculation really happens is if you're doing it in a bad 0:40:24.211 --> 0:40:24.677 way. 0:40:25.185 --> 0:40:33.523 So in generally all these layers we are talking about networks and zones fancy. 0:40:33.523 --> 0:40:46.363 In the end it is: So what you have to do in order to calculate here, for example, these 0:40:46.363 --> 0:40:52.333 activations: So make it simple a bit. 0:40:52.333 --> 0:41:06.636 Let's see where outputs and you just do metric multiplication between your weight matrix and 0:41:06.636 --> 0:41:08.482 your input. 0:41:08.969 --> 0:41:20.992 So that is why computers are so powerful for neural networks because they are very good 0:41:20.992 --> 0:41:22.358 in doing. 0:41:22.782 --> 0:41:28.013 However, for some type for the embedding layer this is really very inefficient. 0:41:28.208 --> 0:41:39.652 So because remember we're having this one art encoding in this input, it's always like 0:41:39.652 --> 0:41:42.940 one and everything else. 0:41:42.940 --> 0:41:47.018 It's zero if we're doing this. 0:41:47.387 --> 0:41:55.552 So therefore you can do at least the forward pass a lot more efficient if you don't really 0:41:55.552 --> 0:42:01.833 do this calculation, but you can select the one color where there is. 0:42:01.833 --> 0:42:07.216 Therefore, you also see this is called your word embedding. 0:42:08.348 --> 0:42:19.542 So the weight matrix of the embedding layer is just that in each color you have the embedding 0:42:19.542 --> 0:42:20.018 of. 0:42:20.580 --> 0:42:30.983 So this is like how your initial weights look like and how you can interpret or understand. 0:42:32.692 --> 0:42:39.509 And this is already relatively important because remember this is a huge dimensional thing. 0:42:39.509 --> 0:42:46.104 So typically here we have the number of words is ten thousand or so, so this is the word 0:42:46.104 --> 0:42:51.365 embeddings metrics, typically the most expensive to calculate metrics. 0:42:51.451 --> 0:42:59.741 Because it's the largest one there, we have ten thousand entries, while for the hours we 0:42:59.741 --> 0:43:00.393 maybe. 0:43:00.660 --> 0:43:03.408 So therefore the addition to a little bit more to make this. 0:43:06.206 --> 0:43:10.538 Then you can go where else the calculations are very difficult. 0:43:10.830 --> 0:43:20.389 So here we then have our network, so we have the word embeddings. 0:43:20.389 --> 0:43:29.514 We have one hidden there, and then you can look how difficult. 0:43:30.270 --> 0:43:38.746 Could save a lot of calculation by not really calculating the selection because that is always. 0:43:40.600 --> 0:43:46.096 The number of calculations you have to do here is so. 0:43:46.096 --> 0:43:51.693 The length of this layer is minus one type projection. 0:43:52.993 --> 0:43:56.321 That is a hint size. 0:43:56.321 --> 0:44:10.268 So the first step of calculation for this metrics modification is how much calculation. 0:44:10.730 --> 0:44:18.806 Then you have to do some activation function and then you have to do again the calculation. 0:44:19.339 --> 0:44:27.994 Here we need the vocabulary size because we need to calculate the probability for each 0:44:27.994 --> 0:44:29.088 next word. 0:44:29.889 --> 0:44:43.155 And if you look at these numbers, so if you have a projector size of and a vocabulary size 0:44:43.155 --> 0:44:53.876 of, you see: And that is why there has been especially at the beginning some ideas how 0:44:53.876 --> 0:44:55.589 we can reduce. 0:44:55.956 --> 0:45:01.942 And if we really need to calculate all of our capabilities, or if we can calculate only 0:45:01.942 --> 0:45:02.350 some. 0:45:02.582 --> 0:45:10.871 And there again the one important thing to think about is for what will use my language 0:45:10.871 --> 0:45:11.342 mom. 0:45:11.342 --> 0:45:19.630 I can use it for generations and that's what we will see next week in an achiever which 0:45:19.630 --> 0:45:22.456 really is guiding the search. 0:45:23.123 --> 0:45:30.899 If it just uses a feature, we do not want to use it for generations, but we want to only 0:45:30.899 --> 0:45:32.559 know how probable. 0:45:32.953 --> 0:45:39.325 There we might not be really interested in all the probabilities, but we already know 0:45:39.325 --> 0:45:46.217 we just want to know the probability of this one word, and then it might be very inefficient 0:45:46.217 --> 0:45:49.403 to really calculate all the probabilities. 0:45:51.231 --> 0:45:52.919 And how can you do that so? 0:45:52.919 --> 0:45:56.296 Initially, for example, the people look into shortness. 0:45:56.756 --> 0:46:02.276 So this calculation at the end is really very expensive. 0:46:02.276 --> 0:46:05.762 So can we make that more efficient. 0:46:05.945 --> 0:46:17.375 And most words occur very rarely, and maybe we don't need anger, and so there we may want 0:46:17.375 --> 0:46:18.645 to focus. 0:46:19.019 --> 0:46:29.437 And so they use the smaller vocabulary, which is maybe. 0:46:29.437 --> 0:46:34.646 This layer is used from to. 0:46:34.646 --> 0:46:37.623 Then you merge. 0:46:37.937 --> 0:46:45.162 So you're taking if the word is in the shortest, so in the two thousand most frequent words. 0:46:45.825 --> 0:46:58.299 Of this short word by some normalization here, and otherwise you take a back of probability 0:46:58.299 --> 0:46:59.655 from the. 0:47:00.020 --> 0:47:04.933 It will not be as good, but the idea is okay. 0:47:04.933 --> 0:47:14.013 Then we don't have to calculate all these probabilities here at the end, but we only 0:47:14.013 --> 0:47:16.042 have to calculate. 0:47:19.599 --> 0:47:32.097 With some type of cost because it means we don't model the probability of the infrequent 0:47:32.097 --> 0:47:39.399 words, and maybe it's even very important to model. 0:47:39.299 --> 0:47:46.671 And one idea is to do what is reported as so so structured out there. 0:47:46.606 --> 0:47:49.571 Network language models you see some years ago. 0:47:49.571 --> 0:47:53.154 People were very creative and giving names to new models. 0:47:53.813 --> 0:48:00.341 And there the idea is that we model the output vocabulary as a clustered treat. 0:48:00.680 --> 0:48:06.919 So you don't need to model all of our bodies directly, but you are putting words into a 0:48:06.919 --> 0:48:08.479 sequence of clusters. 0:48:08.969 --> 0:48:15.019 So maybe a very intriguant world is first in cluster three and then in cluster three. 0:48:15.019 --> 0:48:21.211 You have subclusters again and there is subclusters seven and subclusters and there is. 0:48:21.541 --> 0:48:40.134 And this is the path, so that is what was the man in the past. 0:48:40.340 --> 0:48:52.080 And then you can calculate the probability of the word again just by the product of the 0:48:52.080 --> 0:48:55.548 first class of the world. 0:48:57.617 --> 0:49:07.789 That it may be more clear where you have this architecture, so this is all the same. 0:49:07.789 --> 0:49:13.773 But then you first predict here which main class. 0:49:14.154 --> 0:49:24.226 Then you go to the appropriate subclass, then you calculate the probability of the subclass 0:49:24.226 --> 0:49:26.415 and maybe the cell. 0:49:27.687 --> 0:49:35.419 Anybody have an idea why this is more efficient or if you do it first, it looks a lot more. 0:49:42.242 --> 0:49:51.788 You have to do less calculations, so maybe if you do it here you have to calculate the 0:49:51.788 --> 0:49:59.468 element there, but you don't have to do all the one hundred thousand. 0:49:59.980 --> 0:50:06.115 The probabilities in the set classes that you're going through and not for all of them. 0:50:06.386 --> 0:50:18.067 Therefore, it's more efficient if you don't need all output proficient because you have 0:50:18.067 --> 0:50:21.253 to calculate the class. 0:50:21.501 --> 0:50:28.936 So it's only more efficient and scenarios where you really need to use a language model 0:50:28.936 --> 0:50:30.034 to evaluate. 0:50:35.275 --> 0:50:52.456 How this works was that you can train first in your language one on the short list. 0:50:52.872 --> 0:51:03.547 But on the input layer you have your full vocabulary because at the input we saw that 0:51:03.547 --> 0:51:06.650 this is not complicated. 0:51:06.906 --> 0:51:26.638 And then you can cluster down all your words here into classes and use that as your glasses. 0:51:29.249 --> 0:51:34.148 That is one idea of doing it. 0:51:34.148 --> 0:51:44.928 There is also a second idea of doing it, and again we don't need. 0:51:45.025 --> 0:51:53.401 So sometimes it doesn't really need to be a probability to evaluate. 0:51:53.401 --> 0:51:56.557 It's only important that. 0:51:58.298 --> 0:52:04.908 And: Here it's called self normalization what people have done so. 0:52:04.908 --> 0:52:11.562 We have seen that the probability is in this soft mechanism always to the input divided 0:52:11.562 --> 0:52:18.216 by our normalization, and the normalization is a summary of the vocabulary to the power 0:52:18.216 --> 0:52:19.274 of the spell. 0:52:19.759 --> 0:52:25.194 So this is how we calculate the software. 0:52:25.825 --> 0:52:41.179 In self normalization of the idea, if this would be zero then we don't need to calculate 0:52:41.179 --> 0:52:42.214 that. 0:52:42.102 --> 0:52:54.272 Will be zero, and then you don't even have to calculate the normalization because it's. 0:52:54.514 --> 0:53:08.653 So how can we achieve that and then the nice thing in your networks? 0:53:09.009 --> 0:53:23.928 And now we're just adding a second note with some either permitted here. 0:53:24.084 --> 0:53:29.551 And the second lost just tells us he'll be strained away. 0:53:29.551 --> 0:53:31.625 The locks at is zero. 0:53:32.352 --> 0:53:38.614 So then if it's nearly zero at the end we don't need to calculate this and it's also 0:53:38.614 --> 0:53:39.793 very efficient. 0:53:40.540 --> 0:53:49.498 One important thing is this, of course, is only in inference. 0:53:49.498 --> 0:54:04.700 During tests we don't need to calculate that because: You can do a bit of a hyperparameter 0:54:04.700 --> 0:54:14.851 here where you do the waiting, so how good should it be estimating the probabilities and 0:54:14.851 --> 0:54:16.790 how much effort? 0:54:18.318 --> 0:54:28.577 The only disadvantage is no speed up during training. 0:54:28.577 --> 0:54:43.843 There are other ways of doing that, for example: Englishman is in case you get it. 0:54:44.344 --> 0:54:48.540 Then we are coming very, very briefly like just one idea. 0:54:48.828 --> 0:54:53.058 That there is more things on different types of language models. 0:54:53.058 --> 0:54:58.002 We are having a very short view on restricted person-based language models. 0:54:58.298 --> 0:55:08.931 Talk about recurrent neural networks for language mines because they have the advantage that 0:55:08.931 --> 0:55:17.391 we can even further improve by not having a continuous representation on. 0:55:18.238 --> 0:55:23.845 So there's different types of neural networks. 0:55:23.845 --> 0:55:30.169 These are these boxing machines and the interesting. 0:55:30.330 --> 0:55:39.291 They have these: And they define like an energy function on the network, which can be in restricted 0:55:39.291 --> 0:55:44.372 balsam machines efficiently calculated in general and restricted needs. 0:55:44.372 --> 0:55:51.147 You only have connection between the input and the hidden layer, but you don't have connections 0:55:51.147 --> 0:55:53.123 in the input or within the. 0:55:53.393 --> 0:56:00.194 So you see here you don't have an input output, you just have an input, and you calculate. 0:56:00.460 --> 0:56:15.612 Which of course nicely fits with the idea we're having, so you can then use this for 0:56:15.612 --> 0:56:19.177 an N Gram language. 0:56:19.259 --> 0:56:25.189 Retaining the flexibility of the input by this type of neon networks. 0:56:26.406 --> 0:56:30.589 And the advantage of this type of model was there's. 0:56:30.550 --> 0:56:37.520 Very, very fast to integrate it, so that one was the first one which was used during the 0:56:37.520 --> 0:56:38.616 coding model. 0:56:38.938 --> 0:56:45.454 The engram language models were that they were very good and gave performance. 0:56:45.454 --> 0:56:50.072 However, calculation still with all these tricks takes. 0:56:50.230 --> 0:56:58.214 We have talked about embest lists so they generated an embest list of the most probable 0:56:58.214 --> 0:57:05.836 outputs and then they took this and best list scored each entry with a new network. 0:57:06.146 --> 0:57:09.306 A language model, and then only change the order again. 0:57:09.306 --> 0:57:10.887 Select based on that which. 0:57:11.231 --> 0:57:17.187 The neighboring list is maybe only like hundred entries. 0:57:17.187 --> 0:57:21.786 When decoding you look at several thousand. 0:57:26.186 --> 0:57:35.196 Let's look at the context so we have now seen your language models. 0:57:35.196 --> 0:57:43.676 There is the big advantage we can use this word similarity and. 0:57:44.084 --> 0:57:52.266 Remember for engram language ones is not always minus one words because sometimes you have 0:57:52.266 --> 0:57:59.909 to back off or interpolation to lower engrams and you don't know the previous words. 0:58:00.760 --> 0:58:04.742 And however in neural models we always have all of this importance. 0:58:04.742 --> 0:58:05.504 Can some of. 0:58:07.147 --> 0:58:20.288 The disadvantage is that you are still limited in your context, and if you remember the sentence 0:58:20.288 --> 0:58:22.998 from last lecture,. 0:58:22.882 --> 0:58:28.328 Sometimes you need more context and there is unlimited context that you might need and 0:58:28.328 --> 0:58:34.086 you can always create sentences where you may need this five context in order to put a good 0:58:34.086 --> 0:58:34.837 estimation. 0:58:35.315 --> 0:58:44.956 Can also do it different in order to understand that it makes sense to view language. 0:58:45.445 --> 0:58:59.510 So secret labeling tasks are a very common type of task in language processing where you 0:58:59.510 --> 0:59:03.461 have the input sequence. 0:59:03.323 --> 0:59:05.976 So you have one output for each input. 0:59:05.976 --> 0:59:12.371 Machine translation is not a secret labeling cast because the number of inputs and the number 0:59:12.371 --> 0:59:14.072 of outputs is different. 0:59:14.072 --> 0:59:20.598 So you put in a string German which has five words and the output can be: See, for example, 0:59:20.598 --> 0:59:24.078 you always have the same number and the same number of offices. 0:59:24.944 --> 0:59:39.779 And you can more language waddling as that, and you just say the label for each word is 0:59:39.779 --> 0:59:43.151 always a next word. 0:59:45.705 --> 0:59:50.312 This is the more generous you can think of it. 0:59:50.312 --> 0:59:56.194 For example, Paddle Speech Taking named Entity Recognition. 0:59:58.938 --> 1:00:08.476 And if you look at now, this output token and generally sequenced labeling can depend 1:00:08.476 --> 1:00:26.322 on: The input tokens are the same so we can easily model it and they only depend on the 1:00:26.322 --> 1:00:29.064 input tokens. 1:00:31.011 --> 1:00:42.306 But we can always look at one specific type of sequence labeling, unidirectional sequence 1:00:42.306 --> 1:00:44.189 labeling type. 1:00:44.584 --> 1:01:00.855 The probability of the next word only depends on the previous words that we are having here. 1:01:01.321 --> 1:01:05.998 That's also not completely true in language. 1:01:05.998 --> 1:01:14.418 Well, the back context might also be helpful by direction of the model's Google. 1:01:14.654 --> 1:01:23.039 We will always admire the probability of the word given on its history. 1:01:23.623 --> 1:01:30.562 And currently there is approximation and sequence labeling that we have this windowing approach. 1:01:30.951 --> 1:01:43.016 So in order to predict this type of word we always look at the previous three words. 1:01:43.016 --> 1:01:48.410 This is this type of windowing model. 1:01:49.389 --> 1:01:54.780 If you're into neural networks you recognize this type of structure. 1:01:54.780 --> 1:01:57.515 Also, the typical neural networks. 1:01:58.938 --> 1:02:11.050 Yes, yes, so like engram models you can, at least in some way, prepare for that type of 1:02:11.050 --> 1:02:12.289 context. 1:02:14.334 --> 1:02:23.321 Are also other types of neonamic structures which we can use for sequins lately and which 1:02:23.321 --> 1:02:30.710 might help us where we don't have this type of fixed size representation. 1:02:32.812 --> 1:02:34.678 That we can do so. 1:02:34.678 --> 1:02:39.391 The idea is in recurrent new networks traction. 1:02:39.391 --> 1:02:43.221 We are saving complete history in one. 1:02:43.623 --> 1:02:56.946 So again we have to do this fixed size representation because the neural networks always need a habit. 1:02:57.157 --> 1:03:09.028 And then the network should look like that, so we start with an initial value for our storage. 1:03:09.028 --> 1:03:15.900 We are giving our first input and calculating the new. 1:03:16.196 --> 1:03:35.895 So again in your network with two types of inputs: Then you can apply it to the next type 1:03:35.895 --> 1:03:41.581 of input and you're again having this. 1:03:41.581 --> 1:03:46.391 You're taking this hidden state. 1:03:47.367 --> 1:03:53.306 Nice thing is now that you can do now step by step by step, so all the way over. 1:03:55.495 --> 1:04:06.131 The nice thing we are having here now is that now we are having context information from 1:04:06.131 --> 1:04:07.206 all the. 1:04:07.607 --> 1:04:14.181 So if you're looking like based on which words do you, you calculate the probability of varying. 1:04:14.554 --> 1:04:20.090 It depends on this part. 1:04:20.090 --> 1:04:33.154 It depends on and this hidden state was influenced by two. 1:04:33.473 --> 1:04:38.259 So now we're having something new. 1:04:38.259 --> 1:04:46.463 We can model like the word probability not only on a fixed. 1:04:46.906 --> 1:04:53.565 Because the hidden states we are having here in our Oregon are influenced by all the trivia. 1:04:56.296 --> 1:05:02.578 So how is there to be Singapore? 1:05:02.578 --> 1:05:16.286 But then we have the initial idea about this P of given on the history. 1:05:16.736 --> 1:05:25.300 So do not need to do any clustering here, and you also see how things are put together 1:05:25.300 --> 1:05:26.284 in order. 1:05:29.489 --> 1:05:43.449 The green box this night since we are starting from the left to the right. 1:05:44.524 --> 1:05:51.483 Voices: Yes, that's right, so there are clusters, and here is also sometimes clustering happens. 1:05:51.871 --> 1:05:58.687 The small difference does matter again, so if you have now a lot of different histories, 1:05:58.687 --> 1:06:01.674 the similarity which you have in here. 1:06:01.674 --> 1:06:08.260 If two of the histories are very similar, these representations will be the same, and 1:06:08.260 --> 1:06:10.787 then you're treating them again. 1:06:11.071 --> 1:06:15.789 Because in order to do the final restriction you only do a good base on the green box. 1:06:16.156 --> 1:06:28.541 So you are now still learning some type of clustering in there, but you are learning it 1:06:28.541 --> 1:06:30.230 implicitly. 1:06:30.570 --> 1:06:38.200 The only restriction you're giving is you have to stall everything that is important 1:06:38.200 --> 1:06:39.008 in this. 1:06:39.359 --> 1:06:54.961 So it's a different type of limitation, so you calculate the probability based on the 1:06:54.961 --> 1:06:57.138 last words. 1:06:57.437 --> 1:07:04.430 And that is how you still need to somehow cluster things together in order to do efficiently. 1:07:04.430 --> 1:07:09.563 Of course, you need to do some type of clustering because otherwise. 1:07:09.970 --> 1:07:18.865 But this is where things get merged together in this type of hidden representation. 1:07:18.865 --> 1:07:27.973 So here the probability of the word first only depends on this hidden representation. 1:07:28.288 --> 1:07:33.104 On the previous words, but they are some other bottleneck in order to make a good estimation. 1:07:34.474 --> 1:07:41.231 So the idea is that we can store all our history into or into one lecture. 1:07:41.581 --> 1:07:44.812 Which is the one that makes it more strong. 1:07:44.812 --> 1:07:51.275 Next we come to problems that of course at some point it might be difficult if you have 1:07:51.275 --> 1:07:57.811 very long sequences and you always write all the information you have on this one block. 1:07:58.398 --> 1:08:02.233 Then maybe things get overwritten or you cannot store everything in there. 1:08:02.662 --> 1:08:04.514 So,. 1:08:04.184 --> 1:08:09.569 Therefore, yet for short things like single sentences that works well, but especially if 1:08:09.569 --> 1:08:15.197 you think of other tasks and like symbolizations with our document based on T where you need 1:08:15.197 --> 1:08:20.582 to consider the full document, these things got got a bit more more more complicated and 1:08:20.582 --> 1:08:23.063 will learn another type of architecture. 1:08:24.464 --> 1:08:30.462 In order to understand these neighbors, it is good to have all the bus use always. 1:08:30.710 --> 1:08:33.998 So this is the unrolled view. 1:08:33.998 --> 1:08:43.753 Somewhere you're over the type or in language over the words you're unrolling a network. 1:08:44.024 --> 1:08:52.096 Here is the article and here is the network which is connected by itself and that is recurrent. 1:08:56.176 --> 1:09:04.982 There is one challenge in this networks and training. 1:09:04.982 --> 1:09:11.994 We can train them first of all as forward. 1:09:12.272 --> 1:09:19.397 So we don't really know how to train them, but if you unroll them like this is a feet 1:09:19.397 --> 1:09:20.142 forward. 1:09:20.540 --> 1:09:38.063 Is exactly the same, so you can measure your arrows here and be back to your arrows. 1:09:38.378 --> 1:09:45.646 If you unroll something, it's a feature in your laptop and you can train it the same way. 1:09:46.106 --> 1:09:57.606 The only important thing is again, of course, for different inputs. 1:09:57.837 --> 1:10:05.145 But since parameters are shared, it's somehow a similar point you can train it. 1:10:05.145 --> 1:10:08.800 The training algorithm is very similar. 1:10:10.310 --> 1:10:29.568 One thing which makes things difficult is what is referred to as the vanish ingredient. 1:10:29.809 --> 1:10:32.799 That's a very strong thing in the motivation of using hardness. 1:10:33.593 --> 1:10:44.604 The influence here gets smaller and smaller, and the modems are not really able to monitor. 1:10:44.804 --> 1:10:51.939 Because the gradient gets smaller and smaller, and so the arrow here propagated to this one 1:10:51.939 --> 1:10:58.919 that contributes to the arrow is very small, and therefore you don't do any changes there 1:10:58.919 --> 1:10:59.617 anymore. 1:11:00.020 --> 1:11:06.703 And yeah, that's why standard art men are undifficult or have to pick them at custard. 1:11:07.247 --> 1:11:11.462 So everywhere talking to me about fire and ants nowadays,. 1:11:11.791 --> 1:11:23.333 What we are typically meaning are LSDN's or long short memories. 1:11:23.333 --> 1:11:30.968 You see they are by now quite old already. 1:11:31.171 --> 1:11:39.019 So there was a model in the language model task. 1:11:39.019 --> 1:11:44.784 It's some more storing information. 1:11:44.684 --> 1:11:51.556 Because if you only look at the last words, it's often no longer clear this is a question 1:11:51.556 --> 1:11:52.548 or a normal. 1:11:53.013 --> 1:12:05.318 So there you have these mechanisms with ripgate in order to store things for a longer time 1:12:05.318 --> 1:12:08.563 into your hidden state. 1:12:10.730 --> 1:12:20.162 Here they are used in in in selling quite a lot of works. 1:12:21.541 --> 1:12:29.349 For especially machine translation now, the standard is to do transform base models which 1:12:29.349 --> 1:12:30.477 we'll learn. 1:12:30.690 --> 1:12:38.962 But for example, in architecture we have later one lecture about efficiency. 1:12:38.962 --> 1:12:42.830 So how can we build very efficient? 1:12:42.882 --> 1:12:53.074 And there in the decoder in parts of the networks they are still using. 1:12:53.473 --> 1:12:57.518 So it's not that yeah our hands are of no importance in the body. 1:12:59.239 --> 1:13:08.956 In order to make them strong, there are some more things which are helpful and should be: 1:13:09.309 --> 1:13:19.683 So one thing is there is a nice trick to make this new network stronger and better. 1:13:19.739 --> 1:13:21.523 So of course it doesn't work always. 1:13:21.523 --> 1:13:23.451 They have to have enough training data. 1:13:23.763 --> 1:13:28.959 But in general there's the easiest way of making your models bigger and stronger just 1:13:28.959 --> 1:13:30.590 to increase your pyramids. 1:13:30.630 --> 1:13:43.236 And you've seen that with a large language models they are always bragging about. 1:13:43.903 --> 1:13:56.463 This is one way, so the question is how do you get more parameters? 1:13:56.463 --> 1:14:01.265 There's ways of doing it. 1:14:01.521 --> 1:14:10.029 And the other thing is to make your networks deeper so to have more legs in between. 1:14:11.471 --> 1:14:13.827 And then you can also get to get more calm. 1:14:14.614 --> 1:14:23.340 There's more traveling with this and it's very similar to what we just saw with our hand. 1:14:23.603 --> 1:14:34.253 We have this problem of radiant flow that if it flows so fast like a radiant gets very 1:14:34.253 --> 1:14:35.477 swollen,. 1:14:35.795 --> 1:14:42.704 Exactly the same thing happens in deep LSD ends. 1:14:42.704 --> 1:14:52.293 If you take here the gradient, tell you what is the right or wrong. 1:14:52.612 --> 1:14:56.439 With three layers it's no problem, but if you're going to ten, twenty or hundred layers. 1:14:57.797 --> 1:14:59.698 That's Getting Typically Young. 1:15:00.060 --> 1:15:07.000 Are doing is using what is called decisional connections. 1:15:07.000 --> 1:15:15.855 That's a very helpful idea, which is maybe very surprising that it works. 1:15:15.956 --> 1:15:20.309 And so the idea is that these networks. 1:15:20.320 --> 1:15:29.982 In between should no longer calculate what is a new good representation, but they're more 1:15:29.982 --> 1:15:31.378 calculating. 1:15:31.731 --> 1:15:37.588 Therefore, in the end you're always the output of a layer is added with the input. 1:15:38.318 --> 1:15:48.824 The knife is later if you are doing back propagation with this very fast back propagation. 1:15:49.209 --> 1:16:02.540 Nowadays in very deep architectures, not only on other but always has this residual or highway 1:16:02.540 --> 1:16:04.224 connection. 1:16:04.704 --> 1:16:06.616 Has two advantages. 1:16:06.616 --> 1:16:15.409 On the one hand, these layers don't need to learn a representation, they only need to learn 1:16:15.409 --> 1:16:18.754 what to change the representation. 1:16:22.082 --> 1:16:24.172 Good. 1:16:23.843 --> 1:16:31.768 That much for the new map before, so the last thing now means this. 1:16:31.671 --> 1:16:33.750 Language was are yeah. 1:16:33.750 --> 1:16:41.976 I were used in the molds itself and now were seeing them again, but one thing which at the 1:16:41.976 --> 1:16:53.558 beginning they were reading was very essential was: So people really train part of the language 1:16:53.558 --> 1:16:59.999 models only to get this type of embedding. 1:16:59.999 --> 1:17:04.193 Therefore, we want to look. 1:17:09.229 --> 1:17:15.678 So now some last words to the word embeddings. 1:17:15.678 --> 1:17:27.204 The interesting thing is that word embeddings can be used for very different tasks. 1:17:27.347 --> 1:17:31.329 The knife wing is you can train that on just large amounts of data. 1:17:31.931 --> 1:17:41.569 And then if you have these wooden beddings we have seen that they reduce the parameters. 1:17:41.982 --> 1:17:52.217 So then you can train your small mark to do any other task and therefore you are more efficient. 1:17:52.532 --> 1:17:55.218 These initial word embeddings is important. 1:17:55.218 --> 1:18:00.529 They really depend only on the word itself, so if you look at the two meanings of can, 1:18:00.529 --> 1:18:06.328 the can of beans or I can do that, they will have the same embedding, so some of the embedding 1:18:06.328 --> 1:18:08.709 has to save the ambiguity inside that. 1:18:09.189 --> 1:18:12.486 That cannot be resolved. 1:18:12.486 --> 1:18:24.753 Therefore, if you look at the higher levels in the context, but in the word embedding layers 1:18:24.753 --> 1:18:27.919 that really depends on. 1:18:29.489 --> 1:18:33.757 However, even this one has quite very interesting. 1:18:34.034 --> 1:18:39.558 So that people like to visualize them. 1:18:39.558 --> 1:18:47.208 They're always difficult because if you look at this. 1:18:47.767 --> 1:18:52.879 And drawing your five hundred damage, the vector is still a bit challenging. 1:18:53.113 --> 1:19:12.472 So you cannot directly do that, so people have to do it like they look at some type of. 1:19:13.073 --> 1:19:17.209 And of course then yes some information is getting lost by a bunch of control. 1:19:18.238 --> 1:19:24.802 And you see, for example, this is the most famous and common example, so what you can 1:19:24.802 --> 1:19:31.289 look is you can look at the difference between the main and the female word English. 1:19:31.289 --> 1:19:37.854 This is here in your embedding of king, and this is the embedding of queen, and this. 1:19:38.058 --> 1:19:40.394 You can do that for a very different work. 1:19:40.780 --> 1:19:45.407 And that is where the masks come into, that is what people then look into. 1:19:45.725 --> 1:19:50.995 So what you can now, for example, do is you can calculate the difference between man and 1:19:50.995 --> 1:19:51.410 woman? 1:19:52.232 --> 1:19:55.511 Then you can take the embedding of tea. 1:19:55.511 --> 1:20:02.806 You can add on it the difference between man and woman, and then you can notice what are 1:20:02.806 --> 1:20:04.364 the similar words. 1:20:04.364 --> 1:20:08.954 So you won't, of course, directly hit the correct word. 1:20:08.954 --> 1:20:10.512 It's a continuous. 1:20:10.790 --> 1:20:23.127 But you can look what are the nearest neighbors to this same, and often these words are near 1:20:23.127 --> 1:20:24.056 there. 1:20:24.224 --> 1:20:33.913 So it somehow learns that the difference between these words is always the same. 1:20:34.374 --> 1:20:37.746 You can do that for different things. 1:20:37.746 --> 1:20:41.296 He also imagines that it's not perfect. 1:20:41.296 --> 1:20:49.017 He says the world tends to be swimming and swimming, and with walking and walking you. 1:20:49.469 --> 1:20:51.639 So you can try to use them. 1:20:51.639 --> 1:20:59.001 It's no longer like saying yeah, but the interesting thing is this is completely unsupervised. 1:20:59.001 --> 1:21:03.961 So nobody taught him the principle of their gender in language. 1:21:04.284 --> 1:21:09.910 So it's purely trained on the task of doing the next work prediction. 1:21:10.230 --> 1:21:20.658 And even for really cementing information like the capital, this is the difference between 1:21:20.658 --> 1:21:23.638 the city and the capital. 1:21:23.823 --> 1:21:25.518 Visualization. 1:21:25.518 --> 1:21:33.766 Here we have done the same things of the difference between country and. 1:21:33.853 --> 1:21:41.991 You see it's not perfect, but it's building some kinds of a right direction, so you can't 1:21:41.991 --> 1:21:43.347 even use them. 1:21:43.347 --> 1:21:51.304 For example, for question answering, if you have the difference between them, you apply 1:21:51.304 --> 1:21:53.383 that to a new country. 1:21:54.834 --> 1:22:02.741 So it seems these ones are able to really learn a lot of information and collapse all 1:22:02.741 --> 1:22:04.396 this information. 1:22:05.325 --> 1:22:11.769 At just to do the next word prediction: And that also explains a bit maybe or not explains 1:22:11.769 --> 1:22:19.016 wrong life by motivating why what is the main advantage of this type of neural models that 1:22:19.016 --> 1:22:26.025 we can use this type of hidden representation, transfer them and use them in different. 1:22:28.568 --> 1:22:43.707 So summarize what we did today, so what you should hopefully have with you is for machine 1:22:43.707 --> 1:22:45.893 translation. 1:22:45.805 --> 1:22:49.149 Then how we can do language modern Chinese literature? 1:22:49.449 --> 1:22:55.617 We looked at three different architectures: We looked into the feet forward language mode 1:22:55.617 --> 1:22:59.063 and the one based on Bluetooth machines. 1:22:59.039 --> 1:23:05.366 And finally there are different architectures to do in your networks. 1:23:05.366 --> 1:23:14.404 We have seen feet for your networks and we'll see the next lectures, the last type of architecture. 1:23:15.915 --> 1:23:17.412 Have Any Questions. 1:23:20.680 --> 1:23:27.341 Then thanks a lot, and next on Tuesday we will be again in our order to know how to play. 0:00:01.301 --> 0:00:05.687 Okay, so we're welcome to today's lecture. 0:00:06.066 --> 0:00:18.128 A bit desperate in a small room and I'm sorry for the inconvenience. 0:00:18.128 --> 0:00:25.820 Sometimes there are project meetings where. 0:00:26.806 --> 0:00:40.863 So what we want to talk today about is want to start with neural approaches to machine 0:00:40.863 --> 0:00:42.964 translation. 0:00:43.123 --> 0:00:55.779 Guess I've heard about other types of neural models for natural language processing. 0:00:55.779 --> 0:00:59.948 This was some of the first. 0:01:00.600 --> 0:01:06.203 They are similar to what you know they see in as large language models. 0:01:06.666 --> 0:01:14.810 And we want today look into what are these neural language models, how we can build them, 0:01:14.810 --> 0:01:15.986 what is the. 0:01:16.316 --> 0:01:23.002 And first we'll show how to use them in statistical machine translation. 0:01:23.002 --> 0:01:31.062 If you remember weeks ago, we had this log-linear model where you can integrate easily. 0:01:31.351 --> 0:01:42.756 And that was how they first were used, so we just had another model that evaluates how 0:01:42.756 --> 0:01:49.180 good a system is or how good a lot of languages. 0:01:50.690 --> 0:02:04.468 And next week we will go for a neuromachine translation where we replace the whole model 0:02:04.468 --> 0:02:06.481 by one huge. 0:02:11.211 --> 0:02:20.668 So just as a member from Tuesday we've seen, the main challenge in language modeling was 0:02:20.668 --> 0:02:25.131 that most of the anthrax we haven't seen. 0:02:26.946 --> 0:02:34.167 So this was therefore difficult to estimate any probability because we've seen that yet 0:02:34.167 --> 0:02:39.501 normally if you've seen had not seen the N gram you will assign. 0:02:39.980 --> 0:02:53.385 However, this is not really very good because we don't want to give zero probabilities to 0:02:53.385 --> 0:02:55.023 sentences. 0:02:55.415 --> 0:03:10.397 And then we learned a lot of techniques and that is the main challenge in statistical language. 0:03:10.397 --> 0:03:15.391 How we can give somehow a good. 0:03:15.435 --> 0:03:23.835 And they developed very specific, very good techniques to deal with that. 0:03:23.835 --> 0:03:26.900 However, this is the best. 0:03:28.568 --> 0:03:33.907 And therefore we can do things different. 0:03:33.907 --> 0:03:44.331 If we have not seen an N gram before in statistical models, we have to have seen. 0:03:45.225 --> 0:03:51.361 Before, and we can only get information from exactly the same word. 0:03:51.411 --> 0:03:57.567 We don't have an approximate matching like that. 0:03:57.567 --> 0:04:10.255 Maybe it stood together in some way or similar, and in a sentence we might generalize the knowledge. 0:04:11.191 --> 0:04:21.227 Would like to have more something like that where engrams are represented more in a general 0:04:21.227 --> 0:04:21.990 space. 0:04:22.262 --> 0:04:29.877 So if you learn something about eyewalk then maybe we can use this knowledge and also. 0:04:30.290 --> 0:04:43.034 And thereby no longer treat all or at least a lot of the ingrams as we've done before. 0:04:43.034 --> 0:04:45.231 We can really. 0:04:47.047 --> 0:04:56.157 And we maybe want to even do that in a more hierarchical approach, but we know okay some 0:04:56.157 --> 0:05:05.268 words are similar like go and walk is somehow similar and and therefore like maybe if we 0:05:05.268 --> 0:05:07.009 then merge them. 0:05:07.387 --> 0:05:16.104 If we learn something about work, then it should tell us also something about Hugo or 0:05:16.104 --> 0:05:17.118 he walks. 0:05:17.197 --> 0:05:18.970 We see already. 0:05:18.970 --> 0:05:22.295 It's, of course, not so easy. 0:05:22.295 --> 0:05:31.828 We see that there is some relations which we need to integrate, for example, for you. 0:05:31.828 --> 0:05:35.486 We need to add the S, but maybe. 0:05:37.137 --> 0:05:42.984 And luckily there is one really yeah, convincing methods in doing that. 0:05:42.963 --> 0:05:47.239 And that is by using an evil neck or. 0:05:47.387 --> 0:05:57.618 That's what we will introduce today so we can use this type of neural networks to try 0:05:57.618 --> 0:06:04.042 to learn this similarity and to learn how some words. 0:06:04.324 --> 0:06:13.711 And that is one of the main advantages that we have by switching from the standard statistical 0:06:13.711 --> 0:06:15.193 models to the. 0:06:15.115 --> 0:06:22.840 To learn similarities between words and generalized and learn what we call hidden representations. 0:06:22.840 --> 0:06:29.707 So somehow representations of words where we can measure similarity in some dimensions. 0:06:30.290 --> 0:06:42.275 So in representations where as a tubically continuous vector or a vector of a fixed size. 0:06:42.822 --> 0:06:52.002 We had it before and we've seen that the only thing we did is we don't want to do. 0:06:52.192 --> 0:06:59.648 But these indices don't have any meaning, so it wasn't that word five is more similar 0:06:59.648 --> 0:07:02.248 to words twenty than to word. 0:07:02.582 --> 0:07:09.059 So we couldn't learn anything about words in the statistical model. 0:07:09.059 --> 0:07:12.107 That's a big challenge because. 0:07:12.192 --> 0:07:24.232 If you think about words even in morphology, so go and go is more similar because the person. 0:07:24.264 --> 0:07:36.265 While the basic models we have up to now, they have no idea about that and goes as similar 0:07:36.265 --> 0:07:37.188 to go. 0:07:39.919 --> 0:07:53.102 So what we want to do today, in order to go to this, we will have a short introduction. 0:07:53.954 --> 0:08:06.667 It very short just to see how we use them here, but that's the good thing that are important 0:08:06.667 --> 0:08:08.445 for dealing. 0:08:08.928 --> 0:08:14.083 And then we'll first look into feet forward, new network language models. 0:08:14.454 --> 0:08:21.221 And there we will still have this approximation we had before, then we are looking only at 0:08:21.221 --> 0:08:22.336 fixed windows. 0:08:22.336 --> 0:08:28.805 So if you remember we have this classroom of language models, and to determine what is 0:08:28.805 --> 0:08:33.788 the probability of a word, we only look at the past and minus one. 0:08:34.154 --> 0:08:36.878 This is the theory of the case. 0:08:36.878 --> 0:08:43.348 However, we have the ability and that's why they're really better in order. 0:08:44.024 --> 0:08:51.953 And then at the end we'll look at current network language models where we then have 0:08:51.953 --> 0:08:53.166 a different. 0:08:53.093 --> 0:09:01.922 And thereby it is no longer the case that we need to have a fixed history, but in theory 0:09:01.922 --> 0:09:04.303 we can model arbitrary. 0:09:04.304 --> 0:09:06.854 And we can log this phenomenon. 0:09:06.854 --> 0:09:12.672 We talked about a Tuesday where it's not clear what type of information. 0:09:16.396 --> 0:09:24.982 So yeah, generally new networks are normally learned to improve and perform some tasks. 0:09:25.325 --> 0:09:38.934 We have this structure and we are learning them from samples so that is similar to what 0:09:38.934 --> 0:09:42.336 we had before so now. 0:09:42.642 --> 0:09:49.361 And is somehow originally motivated by the human brain. 0:09:49.361 --> 0:10:00.640 However, when you now need to know artificial neural networks, it's hard to get a similarity. 0:10:00.540 --> 0:10:02.884 There seems to be not that important. 0:10:03.123 --> 0:10:11.013 So what they are mainly doing is doing summoning multiplication and then one linear activation. 0:10:12.692 --> 0:10:16.078 So so the basic units are these type of. 0:10:17.937 --> 0:10:29.837 Perceptron is a basic block which we have and this does exactly the processing. 0:10:29.837 --> 0:10:36.084 We have a fixed number of input features. 0:10:36.096 --> 0:10:39.668 So we have here numbers six zero to x and as input. 0:10:40.060 --> 0:10:48.096 And this makes language processing difficult because we know that it's not the case. 0:10:48.096 --> 0:10:53.107 If we're dealing with language, it doesn't have any. 0:10:54.114 --> 0:10:57.609 So we have to model this somehow and understand how we model this. 0:10:58.198 --> 0:11:03.681 Then we have the weights, which are the parameters and the number of weights exactly the same. 0:11:04.164 --> 0:11:15.069 Of input features sometimes you have the spires in there that always and then it's not really. 0:11:15.195 --> 0:11:19.656 And what you then do is very simple. 0:11:19.656 --> 0:11:26.166 It's just like the weight it sounds, so you multiply. 0:11:26.606 --> 0:11:38.405 What is then additionally important is we have an activation function and it's important 0:11:38.405 --> 0:11:42.514 that this activation function. 0:11:43.243 --> 0:11:54.088 And later it will be important that this is differentiable because otherwise all the training. 0:11:54.714 --> 0:12:01.471 This model by itself is not very powerful. 0:12:01.471 --> 0:12:10.427 We have the X Or problem and with this simple you can't. 0:12:10.710 --> 0:12:15.489 However, there is a very easy and nice extension. 0:12:15.489 --> 0:12:20.936 The multi layer perception and things get very powerful. 0:12:21.081 --> 0:12:32.953 The thing is you just connect a lot of these in these layers of structures where we have 0:12:32.953 --> 0:12:35.088 the inputs and. 0:12:35.395 --> 0:12:47.297 And then we can combine them, or to do them: The input layer is of course given by your 0:12:47.297 --> 0:12:51.880 problem with the dimension. 0:12:51.880 --> 0:13:00.063 The output layer is also given by your dimension. 0:13:01.621 --> 0:13:08.802 So let's start with the first question, now more language related, and that is how we represent. 0:13:09.149 --> 0:13:19.282 So we have seen here input to x, but the question is now okay. 0:13:19.282 --> 0:13:23.464 How can we put into this? 0:13:26.866 --> 0:13:34.123 The first thing that we're able to do is we're going to set it in the inspector. 0:13:34.314 --> 0:13:45.651 Yeah, and that is not that easy because the continuous vector will come to that. 0:13:45.651 --> 0:13:47.051 We can't. 0:13:47.051 --> 0:13:50.410 We don't want to do it. 0:13:50.630 --> 0:13:57.237 But if we need to input the word into the needle network, it has to be something easily 0:13:57.237 --> 0:13:57.912 defined. 0:13:59.079 --> 0:14:11.511 One is the typical thing, the one-hour encoded vector, so we have a vector where the dimension 0:14:11.511 --> 0:14:15.306 is the vocabulary, and then. 0:14:16.316 --> 0:14:25.938 So the first thing you are ready to see that means we are always dealing with fixed. 0:14:26.246 --> 0:14:34.961 So you cannot easily extend your vocabulary, but if you mean your vocabulary would increase 0:14:34.961 --> 0:14:37.992 the size of this input vector,. 0:14:39.980 --> 0:14:42.423 That's maybe also motivating. 0:14:42.423 --> 0:14:45.355 We'll talk about bike parade going. 0:14:45.355 --> 0:14:47.228 That's the nice thing. 0:14:48.048 --> 0:15:01.803 The big advantage of this one putt encoding is that we don't implement similarity between 0:15:01.803 --> 0:15:06.999 words, but we're really learning. 0:15:07.227 --> 0:15:11.219 So you need like to represent any words. 0:15:11.219 --> 0:15:15.893 You need a dimension of and dimensional vector. 0:15:16.236 --> 0:15:26.480 Imagine you could eat no binary encoding, so you could represent words as binary vectors. 0:15:26.806 --> 0:15:32.348 So you will be significantly more efficient. 0:15:32.348 --> 0:15:39.122 However, you have some more digits than other numbers. 0:15:39.559 --> 0:15:46.482 Would somehow be bad because you would force the one to do this and it's by hand not clear 0:15:46.482 --> 0:15:47.623 how to define. 0:15:48.108 --> 0:15:55.135 So therefore currently this is the most successful approach to just do this one patch. 0:15:55.095 --> 0:15:59.344 We take a fixed vocabulary. 0:15:59.344 --> 0:16:10.269 We map each word to the initial and then we represent a word like this. 0:16:10.269 --> 0:16:13.304 The representation. 0:16:14.514 --> 0:16:27.019 But this dimension here is a secondary size, and if you think ten thousand that's quite 0:16:27.019 --> 0:16:33.555 high, so we're always trying to be efficient. 0:16:33.853 --> 0:16:42.515 And we are doing the same type of efficiency because then we are having a very small one 0:16:42.515 --> 0:16:43.781 compared to. 0:16:44.104 --> 0:16:53.332 It can be still a maybe or neurons, but this is significantly smaller, of course, as before. 0:16:53.713 --> 0:17:04.751 So you are learning there this word as you said, but you can learn it directly, and there 0:17:04.751 --> 0:17:07.449 we have similarities. 0:17:07.807 --> 0:17:14.772 But the nice thing is that this is then learned, and we do not need to like hand define. 0:17:17.117 --> 0:17:32.377 So yes, so that is how we're typically adding at least a single word into the language world. 0:17:32.377 --> 0:17:43.337 Then we can see: So we're seeing that you have the one hard representation always of 0:17:43.337 --> 0:17:44.857 the same similarity. 0:17:45.105 --> 0:18:00.803 Then we're having this continuous vector which is a lot smaller dimension and that's. 0:18:01.121 --> 0:18:06.984 What we are doing then is learning these representations so that they are best for language modeling. 0:18:07.487 --> 0:18:19.107 So the representations are implicitly because we're training on the language. 0:18:19.479 --> 0:18:30.115 And the nice thing was found out later is these representations are really, really good 0:18:30.115 --> 0:18:32.533 for a lot of other. 0:18:33.153 --> 0:18:39.729 And that is why they are now called word embedded space themselves, and used for other tasks. 0:18:40.360 --> 0:18:49.827 And they are somehow describing different things so they can describe and semantic similarities. 0:18:49.789 --> 0:18:58.281 We are looking at the very example of today that you can do in this vector space by adding 0:18:58.281 --> 0:19:00.613 some interesting things. 0:19:00.940 --> 0:19:11.174 And so they got really was a first big improvement when switching to neural staff. 0:19:11.491 --> 0:19:20.736 They are like part of the model still with more complex representation alert, but they 0:19:20.736 --> 0:19:21.267 are. 0:19:23.683 --> 0:19:34.975 Then we are having the output layer, and in the output layer we also have output structure 0:19:34.975 --> 0:19:36.960 and activation. 0:19:36.997 --> 0:19:44.784 That is the language we want to predict, which word should be the next. 0:19:44.784 --> 0:19:46.514 We always have. 0:19:47.247 --> 0:19:56.454 And that can be done very well with the softball softbacked layer, where again the dimension. 0:19:56.376 --> 0:20:03.971 Is the vocabulary, so this is a vocabulary size, and again the case neuro represents the 0:20:03.971 --> 0:20:09.775 case class, so in our case we have again a one-hour representation. 0:20:10.090 --> 0:20:18.929 Ours is a probability distribution and the end is a probability distribution of all works. 0:20:18.929 --> 0:20:28.044 The case entry tells us: So we need to have some of our probability distribution at our 0:20:28.044 --> 0:20:36.215 output, and in order to achieve that this activation function goes, it needs to be that all the 0:20:36.215 --> 0:20:36.981 outputs. 0:20:37.197 --> 0:20:47.993 And we can achieve that with a softmax activation we take each of the value and then. 0:20:48.288 --> 0:20:58.020 So by having this type of activation function we are really getting that at the end we always. 0:20:59.019 --> 0:21:12.340 The beginning was very challenging because again we have this inefficient representation 0:21:12.340 --> 0:21:15.184 of our vocabulary. 0:21:15.235 --> 0:21:27.500 And then you can imagine escalating over to something over a thousand is maybe a bit inefficient 0:21:27.500 --> 0:21:29.776 with cheap users. 0:21:36.316 --> 0:21:43.664 And then yeah, for training the models, that is how we refine, so we have this architecture 0:21:43.664 --> 0:21:44.063 now. 0:21:44.264 --> 0:21:52.496 We need to minimize the arrow by taking the output. 0:21:52.496 --> 0:21:58.196 We are comparing it to our targets. 0:21:58.298 --> 0:22:07.670 So one important thing is, of course, how can we measure the error? 0:22:07.670 --> 0:22:12.770 So what if we're training the ideas? 0:22:13.033 --> 0:22:19.770 And how well when measuring it is in natural language processing, typically the cross entropy. 0:22:19.960 --> 0:22:32.847 That means we are comparing the target with the output, so we're taking the value multiplying 0:22:32.847 --> 0:22:35.452 with the horizons. 0:22:35.335 --> 0:22:43.454 Which gets optimized and you're seeing that this, of course, makes it again very nice and 0:22:43.454 --> 0:22:49.859 easy because our target, we said, is again a one-hound representation. 0:22:50.110 --> 0:23:00.111 So except for one, all of these are always zero, and what we are doing is taking the one. 0:23:00.100 --> 0:23:05.970 And we only need to multiply the one with the logarism here, and that is all the feedback. 0:23:06.946 --> 0:23:14.194 Of course, this is not always influenced by all the others. 0:23:14.194 --> 0:23:17.938 Why is this influenced by all? 0:23:24.304 --> 0:23:33.554 Think Mac the activation function, which is the current activation divided by some of the 0:23:33.554 --> 0:23:34.377 others. 0:23:34.354 --> 0:23:44.027 Because otherwise it could of course easily just increase this value and ignore the others, 0:23:44.027 --> 0:23:49.074 but if you increase one value or the other, so. 0:23:51.351 --> 0:24:04.433 And then we can do with neon networks one very nice and easy type of training that is 0:24:04.433 --> 0:24:07.779 done in all the neon. 0:24:07.707 --> 0:24:12.664 So in which direction does the arrow show? 0:24:12.664 --> 0:24:23.152 And then if we want to go to a smaller like smaller arrow, that's what we want to achieve. 0:24:23.152 --> 0:24:27.302 We're trying to minimize our arrow. 0:24:27.287 --> 0:24:32.875 And we have to do that, of course, for all the weights, and to calculate the error of 0:24:32.875 --> 0:24:36.709 all the weights we want in the back of the baggation here. 0:24:36.709 --> 0:24:41.322 But what you can do is you can propagate the arrow which you measured. 0:24:41.322 --> 0:24:43.792 At the end you can propagate it back. 0:24:43.792 --> 0:24:46.391 That's basic mass and basic derivation. 0:24:46.706 --> 0:24:59.557 Then you can do each weight in your model and measure how much it contributes to this 0:24:59.557 --> 0:25:01.350 individual. 0:25:04.524 --> 0:25:17.712 To summarize what your machine translation should be, to understand all this problem is 0:25:17.712 --> 0:25:20.710 that this is how a. 0:25:20.580 --> 0:25:23.056 The notes are perfect thrones. 0:25:23.056 --> 0:25:28.167 They are fully connected between two layers and no connections. 0:25:28.108 --> 0:25:29.759 Across layers. 0:25:29.829 --> 0:25:35.152 And what they're doing is always just to wait for some here and then an activation function. 0:25:35.415 --> 0:25:38.794 And in order to train you have this sword in backwards past. 0:25:39.039 --> 0:25:41.384 So we put in here. 0:25:41.281 --> 0:25:46.540 Our inputs have some random values at the beginning. 0:25:46.540 --> 0:25:49.219 They calculate the output. 0:25:49.219 --> 0:25:58.646 We are measuring how big our error is, propagating the arrow back, and then changing our model 0:25:58.646 --> 0:25:59.638 in a way. 0:26:01.962 --> 0:26:14.267 So before we're coming into the neural networks, how can we use this type of neural network 0:26:14.267 --> 0:26:17.611 to do language modeling? 0:26:23.103 --> 0:26:25.520 So the question is now okay. 0:26:25.520 --> 0:26:33.023 How can we use them in natural language processing and especially in machine translation? 0:26:33.023 --> 0:26:38.441 The first idea of using them was to estimate the language model. 0:26:38.999 --> 0:26:42.599 So we have seen that the output can be monitored here as well. 0:26:43.603 --> 0:26:49.308 Has a probability distribution, and if we have a full vocabulary, we could mainly hear 0:26:49.308 --> 0:26:55.209 estimate how probable each next word is, and then use that in our language model fashion, 0:26:55.209 --> 0:27:02.225 as we've done it last time, we've got the probability of a full sentence as a product of all probabilities 0:27:02.225 --> 0:27:03.208 of individual. 0:27:04.544 --> 0:27:06.695 And UM. 0:27:06.446 --> 0:27:09.776 That was done and in ninety seven years. 0:27:09.776 --> 0:27:17.410 It's very easy to integrate it into this Locklear model, so we have said that this is how the 0:27:17.410 --> 0:27:24.638 Locklear model looks like, so we're searching the best translation, which minimizes each 0:27:24.638 --> 0:27:25.126 wage. 0:27:25.125 --> 0:27:26.371 The feature value. 0:27:26.646 --> 0:27:31.642 We have that with the minimum error training, if you can remember when we search for the 0:27:31.642 --> 0:27:32.148 optimal. 0:27:32.512 --> 0:27:40.927 We have the phrasetable probabilities, the language model, and we can just add here and 0:27:40.927 --> 0:27:41.597 there. 0:27:41.861 --> 0:27:46.077 So that is quite easy as said. 0:27:46.077 --> 0:27:54.101 That was how statistical machine translation was improved. 0:27:54.101 --> 0:27:57.092 Add one more feature. 0:27:58.798 --> 0:28:11.220 So how can we model the language mark for Belty with your network? 0:28:11.220 --> 0:28:22.994 So what we have to do is: And the problem in generally in the head is that most we haven't 0:28:22.994 --> 0:28:25.042 seen long sequences. 0:28:25.085 --> 0:28:36.956 Mostly we have to beg off to very short sequences and we are working on this discrete space where. 0:28:37.337 --> 0:28:48.199 So the idea is if we have a meal network we can map words into continuous representation 0:28:48.199 --> 0:28:50.152 and that helps. 0:28:51.091 --> 0:28:59.598 And the structure then looks like this, so this is the basic still feed forward neural 0:28:59.598 --> 0:29:00.478 network. 0:29:01.361 --> 0:29:10.744 We are doing this at Proximation again, so we are not putting in all previous words, but 0:29:10.744 --> 0:29:11.376 it's. 0:29:11.691 --> 0:29:25.089 And this is done because in your network we can have only a fixed type of input, so we 0:29:25.089 --> 0:29:31.538 can: Can only do a fixed set, and they are going to be doing exactly the same in minus 0:29:31.538 --> 0:29:31.879 one. 0:29:33.593 --> 0:29:41.026 And then we have, for example, three words and three different words, which are in these 0:29:41.026 --> 0:29:54.583 positions: And then we're having the first layer of the neural network, which learns words 0:29:54.583 --> 0:29:56.247 and words. 0:29:57.437 --> 0:30:04.976 There is one thing which is maybe special compared to the standard neural memory. 0:30:05.345 --> 0:30:13.163 So the representation of this word we want to learn first of all position independence, 0:30:13.163 --> 0:30:19.027 so we just want to learn what is the general meaning of the word. 0:30:19.299 --> 0:30:26.244 Therefore, the representation you get here should be the same as if you put it in there. 0:30:27.247 --> 0:30:35.069 The nice thing is you can achieve that in networks the same way you achieve it. 0:30:35.069 --> 0:30:41.719 This way you're reusing ears so we are forcing them to always stay. 0:30:42.322 --> 0:30:49.689 And that's why you then learn your word embedding, which is contextual and independent, so. 0:30:49.909 --> 0:31:05.561 So the idea is you have the diagram go home and you don't want to use the context. 0:31:05.561 --> 0:31:07.635 First you. 0:31:08.348 --> 0:31:14.155 That of course it might have a different meaning depending on where it stands, but learn that. 0:31:14.514 --> 0:31:19.623 First, we're learning key representation of the words, which is just the representation 0:31:19.623 --> 0:31:20.378 of the word. 0:31:20.760 --> 0:31:37.428 So it's also not like normally all input neurons are connected to all neurons. 0:31:37.857 --> 0:31:47.209 This is the first layer of representation, and then we have a lot denser representation, 0:31:47.209 --> 0:31:56.666 that is, our three word embeddings here, and now we are learning this interaction between 0:31:56.666 --> 0:31:57.402 words. 0:31:57.677 --> 0:32:08.265 So now we have at least one connected, fully connected layer here, which takes the three 0:32:08.265 --> 0:32:14.213 imbedded input and then learns the new embedding. 0:32:15.535 --> 0:32:27.871 And then if you had one of several layers of lining which is your output layer, then. 0:32:28.168 --> 0:32:46.222 So here the size is a vocabulary size, and then you put as target what is the probability 0:32:46.222 --> 0:32:48.228 for each. 0:32:48.688 --> 0:32:56.778 The nice thing is that you learn everything together, so you're not learning what is a 0:32:56.778 --> 0:32:58.731 good representation. 0:32:59.079 --> 0:33:12.019 When you are training the whole network together, it learns what representation for a word you 0:33:12.019 --> 0:33:13.109 get in. 0:33:15.956 --> 0:33:19.176 It's Yeah That Is the Main Idea. 0:33:20.660 --> 0:33:32.695 Nowadays often referred to as one way of self-supervised learning, why self-supervisory learning? 0:33:33.053 --> 0:33:37.120 The output is the next word and the input is the previous word. 0:33:37.377 --> 0:33:46.778 But somehow it's self-supervised because it's not really that we created labels, but we artificially. 0:33:46.806 --> 0:34:01.003 We just have pure text, and then we created the task. 0:34:05.905 --> 0:34:12.413 Say we have two sentences like go home again. 0:34:12.413 --> 0:34:18.780 Second one is go to creative again, so both. 0:34:18.858 --> 0:34:31.765 The starboard bygo and then we have to predict the next four years and my question is: Be 0:34:31.765 --> 0:34:40.734 modeled this ability as one vector with like probability or possible works. 0:34:40.734 --> 0:34:42.740 We have musical. 0:34:44.044 --> 0:34:56.438 You have multiple examples, so you would twice train, once you predict, once you predict, 0:34:56.438 --> 0:35:02.359 and then, of course, the best performance. 0:35:04.564 --> 0:35:11.772 A very good point, so you're not aggregating examples beforehand, but you're taking each 0:35:11.772 --> 0:35:13.554 example individually. 0:35:19.259 --> 0:35:33.406 So what you do is you simultaneously learn the projection layer which represents this 0:35:33.406 --> 0:35:39.163 word and the N gram probabilities. 0:35:39.499 --> 0:35:48.390 And what people then later analyzed is that these representations are very powerful. 0:35:48.390 --> 0:35:56.340 The task is just a very important task to model like what is the next word. 0:35:56.816 --> 0:36:09.429 It's a bit motivated by people saying in order to get the meaning of the word you have to 0:36:09.429 --> 0:36:10.690 look at. 0:36:10.790 --> 0:36:18.467 If you read the text in there, which you have never seen, you can still estimate the meaning 0:36:18.467 --> 0:36:22.264 of this word because you know how it is used. 0:36:22.602 --> 0:36:26.667 Just imagine you read this text about some city. 0:36:26.667 --> 0:36:32.475 Even if you've never seen the city before heard, you often know from. 0:36:34.094 --> 0:36:44.809 So what is now the big advantage of using neural networks? 0:36:44.809 --> 0:36:57.570 Just imagine we have to estimate this: So you have to monitor the probability of ad hip 0:36:57.570 --> 0:37:00.272 and now imagine iPhone. 0:37:00.600 --> 0:37:06.837 So all the techniques we have at the last time. 0:37:06.837 --> 0:37:14.243 At the end, if you haven't seen iPhone, you will always. 0:37:15.055 --> 0:37:19.502 Because you haven't seen the previous words, so you have no idea how to do that. 0:37:19.502 --> 0:37:24.388 You won't have seen the diagram, the trigram and all the others, so the probability here 0:37:24.388 --> 0:37:27.682 will just be based on the probability of ad, so it uses no. 0:37:28.588 --> 0:37:38.328 If you're having this type of model, what does it do so? 0:37:38.328 --> 0:37:43.454 This is the last three words. 0:37:43.483 --> 0:37:49.837 Maybe this representation is messed up because it's mainly on a particular word or source 0:37:49.837 --> 0:37:50.260 that. 0:37:50.730 --> 0:37:57.792 Now anyway you have these two information that were two words before was first and therefore: 0:37:58.098 --> 0:38:07.214 So you have a lot of information here to estimate how good it is. 0:38:07.214 --> 0:38:13.291 Of course, there could be more information. 0:38:13.593 --> 0:38:25.958 So all this type of modeling we can do and that we couldn't do beforehand because we always. 0:38:27.027 --> 0:38:31.905 Don't guess how we do it now. 0:38:31.905 --> 0:38:41.824 Typically you would have one talking for awkward vocabulary. 0:38:42.602 --> 0:38:45.855 All you're doing by carrying coding when it has a fixed dancing. 0:38:46.226 --> 0:38:49.439 Yeah, you have to do something like that that the opposite way. 0:38:50.050 --> 0:38:55.413 So yeah, all the vocabulary are by thankcoding where you don't have have all the vocabulary. 0:38:55.735 --> 0:39:07.665 But then, of course, the back pairing coating is better with arbitrary context because a 0:39:07.665 --> 0:39:11.285 problem with back pairing. 0:39:17.357 --> 0:39:20.052 Anymore questions to the basic same little things. 0:39:23.783 --> 0:39:36.162 This model we then want to continue is to look into how complex that is or can make things 0:39:36.162 --> 0:39:39.155 maybe more efficient. 0:39:40.580 --> 0:39:47.404 At the beginning there was definitely a major challenge. 0:39:47.404 --> 0:39:50.516 It's still not that easy. 0:39:50.516 --> 0:39:58.297 All guess follow the talk about their environmental fingerprint. 0:39:58.478 --> 0:40:05.686 So this calculation is normally heavy, and if you build systems yourself, you have to 0:40:05.686 --> 0:40:06.189 wait. 0:40:06.466 --> 0:40:15.412 So it's good to know a bit about how complex things are in order to do a good or efficient. 0:40:15.915 --> 0:40:24.706 So one thing where most of the calculation really happens is if you're. 0:40:25.185 --> 0:40:34.649 So in generally all these layers, of course, we're talking about networks and the zones 0:40:34.649 --> 0:40:35.402 fancy. 0:40:35.835 --> 0:40:48.305 So what you have to do in order to calculate here these activations, you have this weight. 0:40:48.488 --> 0:41:05.021 So to make it simple, let's see we have three outputs, and then you just do a metric identification 0:41:05.021 --> 0:41:08.493 between your weight. 0:41:08.969 --> 0:41:19.641 That is why the use is so powerful for neural networks because they are very good in doing 0:41:19.641 --> 0:41:22.339 metric multiplication. 0:41:22.782 --> 0:41:28.017 However, for some type of embedding layer this is really very inefficient. 0:41:28.208 --> 0:41:37.547 So in this input we are doing this calculation. 0:41:37.547 --> 0:41:47.081 What we are mainly doing is selecting one color. 0:41:47.387 --> 0:42:03.570 So therefore you can do at least the forward pass a lot more efficient if you don't really 0:42:03.570 --> 0:42:07.304 do this calculation. 0:42:08.348 --> 0:42:20.032 So the weight metrics of the first embedding layer is just that in each color you have. 0:42:20.580 --> 0:42:30.990 So this is how your initial weights look like and how you can interpret or understand. 0:42:32.692 --> 0:42:42.042 And this is already relatively important because remember this is a huge dimensional thing, 0:42:42.042 --> 0:42:51.392 so typically here we have the number of words ten thousand, so this is the word embeddings. 0:42:51.451 --> 0:43:00.400 Because it's the largest one there, we have entries, while for the others we maybe have. 0:43:00.660 --> 0:43:03.402 So they are a little bit efficient and are important to make this in. 0:43:06.206 --> 0:43:10.529 And then you can look at where else the calculations are very difficult. 0:43:10.830 --> 0:43:20.294 So here we have our individual network, so here are the word embeddings. 0:43:20.294 --> 0:43:29.498 Then we have one hidden layer, and then you can look at how difficult. 0:43:30.270 --> 0:43:38.742 We could save a lot of calculations by calculating that by just doing like do the selection because: 0:43:40.600 --> 0:43:51.748 And then the number of calculations you have to do here is the length. 0:43:52.993 --> 0:44:06.206 Then we have here the hint size that is the hint size, so the first step of calculation 0:44:06.206 --> 0:44:10.260 for this metric is an age. 0:44:10.730 --> 0:44:22.030 Then you have to do some activation function which is this: This is the hidden size hymn 0:44:22.030 --> 0:44:29.081 because we need the vocabulary socks to calculate the probability for each. 0:44:29.889 --> 0:44:40.474 And if you look at this number, so if you have a projection sign of one hundred and a 0:44:40.474 --> 0:44:45.027 vocabulary sign of one hundred, you. 0:44:45.425 --> 0:44:53.958 And that's why there has been especially at the beginning some ideas on how we can reduce 0:44:53.958 --> 0:44:55.570 the calculation. 0:44:55.956 --> 0:45:02.352 And if we really need to calculate all our capabilities, or if we can calculate only some. 0:45:02.582 --> 0:45:13.061 And there again one important thing to think about is for what you will use my language. 0:45:13.061 --> 0:45:21.891 One can use it for generations and that's where we will see the next week. 0:45:21.891 --> 0:45:22.480 And. 0:45:23.123 --> 0:45:32.164 Initially, if it's just used as a feature, we do not want to use it for generation, but 0:45:32.164 --> 0:45:32.575 we. 0:45:32.953 --> 0:45:41.913 And there we might not be interested in all the probabilities, but we already know all 0:45:41.913 --> 0:45:49.432 the probability of this one word, and then it might be very inefficient. 0:45:51.231 --> 0:45:53.638 And how can you do that so initially? 0:45:53.638 --> 0:45:56.299 For example, people look into shortlists. 0:45:56.756 --> 0:46:03.321 So the idea was this calculation at the end is really very expensive. 0:46:03.321 --> 0:46:05.759 So can we make that more. 0:46:05.945 --> 0:46:17.135 And the idea was okay, and most birds occur very rarely, and some beef birds occur very, 0:46:17.135 --> 0:46:18.644 very often. 0:46:19.019 --> 0:46:37.644 And so they use the smaller imagery, which is maybe very small, and then you merge a new. 0:46:37.937 --> 0:46:45.174 So you're taking if the word is in the shortness, so in the most frequent words. 0:46:45.825 --> 0:46:58.287 You're taking the probability of this short word by some normalization here, and otherwise 0:46:58.287 --> 0:46:59.656 you take. 0:47:00.020 --> 0:47:00.836 Course. 0:47:00.836 --> 0:47:09.814 It will not be as good, but then we don't have to calculate all the capabilities at the 0:47:09.814 --> 0:47:16.037 end, but we only have to calculate it for the most frequent. 0:47:19.599 --> 0:47:39.477 Machines about that, but of course we don't model the probability of the infrequent words. 0:47:39.299 --> 0:47:46.658 And one idea is to do what is reported as soles for the structure of the layer. 0:47:46.606 --> 0:47:53.169 You see how some years ago people were very creative in giving names to newer models. 0:47:53.813 --> 0:48:00.338 And there the idea is that we model the out group vocabulary as a clustered strip. 0:48:00.680 --> 0:48:08.498 So you don't need to mold all of your bodies directly, but you are putting words into. 0:48:08.969 --> 0:48:20.623 A very intricate word is first in and then in and then in and that is in sub-sub-clusters 0:48:20.623 --> 0:48:21.270 and. 0:48:21.541 --> 0:48:29.936 And this is what was mentioned in the past of the work, so these are the subclasses that 0:48:29.936 --> 0:48:30.973 always go. 0:48:30.973 --> 0:48:39.934 So if it's in cluster one at the first position then you only look at all the words which are: 0:48:40.340 --> 0:48:50.069 And then you can calculate the probability of a word again just by the product over these, 0:48:50.069 --> 0:48:55.522 so the probability of the word is the first class. 0:48:57.617 --> 0:49:12.331 It's maybe more clear where you have the sole architecture, so what you will do is first 0:49:12.331 --> 0:49:13.818 predict. 0:49:14.154 --> 0:49:26.435 Then you go to the appropriate sub-class, then you calculate the probability of the sub-class. 0:49:27.687 --> 0:49:34.932 Anybody have an idea why this is more, more efficient, or if people do it first, it looks 0:49:34.932 --> 0:49:35.415 more. 0:49:42.242 --> 0:49:56.913 Yes, so you have to do less calculations, or maybe here you have to calculate the element 0:49:56.913 --> 0:49:59.522 there, but you. 0:49:59.980 --> 0:50:06.116 The capabilities in the set classes that you're going through and not for all of them. 0:50:06.386 --> 0:50:16.688 Therefore, it's only more efficient if you don't need all awkward preferences because 0:50:16.688 --> 0:50:21.240 you have to even calculate the class. 0:50:21.501 --> 0:50:30.040 So it's only more efficient in scenarios where you really need to use a language to evaluate. 0:50:35.275 --> 0:50:54.856 How this works is that on the output layer you only have a vocabulary of: But on the input 0:50:54.856 --> 0:51:05.126 layer you have always your full vocabulary because at the input we saw that this is not 0:51:05.126 --> 0:51:06.643 complicated. 0:51:06.906 --> 0:51:19.778 And then you can cluster down all your words, embedding series of classes, and use that as 0:51:19.778 --> 0:51:23.031 your classes for that. 0:51:23.031 --> 0:51:26.567 So yeah, you have words. 0:51:29.249 --> 0:51:32.593 Is one idea of doing it. 0:51:32.593 --> 0:51:44.898 There is also a second idea of doing it again, the idea that we don't need the probability. 0:51:45.025 --> 0:51:53.401 So sometimes it doesn't really need to be a probability to evaluate. 0:51:53.401 --> 0:52:05.492 It's only important that: And: Here is called self-normalization. 0:52:05.492 --> 0:52:19.349 What people have done so is in the softmax is always to the input divided by normalization. 0:52:19.759 --> 0:52:25.194 So this is how we calculate the soft mix. 0:52:25.825 --> 0:52:42.224 And in self-normalization now, the idea is that we don't need to calculate the logarithm. 0:52:42.102 --> 0:52:54.284 That would be zero, and then you don't even have to calculate the normalization. 0:52:54.514 --> 0:53:01.016 So how can we achieve that? 0:53:01.016 --> 0:53:08.680 And then there's the nice thing. 0:53:09.009 --> 0:53:14.743 And our novel Lots and more to maximize probability. 0:53:14.743 --> 0:53:23.831 We have this cross entry lot that probability is higher, and now we're just adding. 0:53:24.084 --> 0:53:31.617 And the second loss just tells us you're pleased training the way the lock set is zero. 0:53:32.352 --> 0:53:38.625 So then if it's nearly zero at the end you don't need to calculate this and it's also 0:53:38.625 --> 0:53:39.792 very efficient. 0:53:40.540 --> 0:53:57.335 One important thing is this is only an inference, so during tests we don't need to calculate. 0:54:00.480 --> 0:54:15.006 You can do a bit of a hyperparameter here where you do the waiting and how much effort 0:54:15.006 --> 0:54:16.843 should be. 0:54:18.318 --> 0:54:35.037 The only disadvantage is that it's no speed up during training and there are other ways 0:54:35.037 --> 0:54:37.887 of doing that. 0:54:41.801 --> 0:54:43.900 I'm with you all. 0:54:44.344 --> 0:54:48.540 Then we are coming very, very briefly like this one here. 0:54:48.828 --> 0:54:53.692 There are more things on different types of languages. 0:54:53.692 --> 0:54:58.026 We are having a very short view of a restricted. 0:54:58.298 --> 0:55:09.737 And then we'll talk about recurrent neural networks for our language minds because they 0:55:09.737 --> 0:55:17.407 have the advantage now that we can't even further improve. 0:55:18.238 --> 0:55:24.395 There's also different types of neural networks. 0:55:24.395 --> 0:55:30.175 These ballroom machines are not having input. 0:55:30.330 --> 0:55:39.271 They have these binary units: And they define an energy function on the network, which can 0:55:39.271 --> 0:55:46.832 be in respect of bottom machines efficiently calculated, and restricted needs. 0:55:46.832 --> 0:55:53.148 You only have connections between the input and the hidden layer. 0:55:53.393 --> 0:56:00.190 So you see here you don't have input and output, you just have an input and you calculate what. 0:56:00.460 --> 0:56:16.429 Which of course nicely fits with the idea we're having, so you can use this for N gram 0:56:16.429 --> 0:56:19.182 language ones. 0:56:19.259 --> 0:56:25.187 Decaying this credibility of the input by this type of neural networks. 0:56:26.406 --> 0:56:30.582 And the advantage of this type of model of board that is. 0:56:30.550 --> 0:56:38.629 Very fast to integrate it, so that one was the first one which was used during decoding. 0:56:38.938 --> 0:56:50.103 The problem of it is that the Enron language models were very good at performing the calculation. 0:56:50.230 --> 0:57:00.114 So what people typically did is we talked about a best list, so they generated a most 0:57:00.114 --> 0:57:05.860 probable output, and then they scored each entry. 0:57:06.146 --> 0:57:10.884 A language model, and then only like change the order against that based on that which. 0:57:11.231 --> 0:57:20.731 The knifing is maybe only hundred entries, while during decoding you will look at several 0:57:20.731 --> 0:57:21.787 thousand. 0:57:26.186 --> 0:57:40.437 This but let's look at the context, so we have now seen your language models. 0:57:40.437 --> 0:57:43.726 There is the big. 0:57:44.084 --> 0:57:57.552 Remember ingram language is not always words because sometimes you have to back off or interpolation 0:57:57.552 --> 0:57:59.953 to lower ingrams. 0:58:00.760 --> 0:58:05.504 However, in neural models we always have all of these inputs and some of these. 0:58:07.147 --> 0:58:21.262 The disadvantage is that you are still limited in your context, and if you remember the sentence 0:58:21.262 --> 0:58:23.008 from last,. 0:58:22.882 --> 0:58:28.445 Sometimes you need more context and there's unlimited contexts that you might need and 0:58:28.445 --> 0:58:34.838 you can always create sentences where you need this file context in order to put a good estimation. 0:58:35.315 --> 0:58:44.955 Can we also do it different in order to better understand that it makes sense to view? 0:58:45.445 --> 0:58:57.621 So sequence labeling tasks are a very common type of towns in natural language processing 0:58:57.621 --> 0:59:03.438 where you have an input sequence and then. 0:59:03.323 --> 0:59:08.663 I've token so you have one output for each input so machine translation is not a secret 0:59:08.663 --> 0:59:14.063 labeling cast because the number of inputs and the number of outputs is different so you 0:59:14.063 --> 0:59:19.099 put in a string German which has five words and the output can be six or seven or. 0:59:19.619 --> 0:59:20.155 Secrets. 0:59:20.155 --> 0:59:24.083 Lately you always have the same number of and the same number of. 0:59:24.944 --> 0:59:40.940 And you can model language modeling as that, and you just say a label for each word is always 0:59:40.940 --> 0:59:43.153 a next word. 0:59:45.705 --> 0:59:54.823 This is the more general you can think of it, for example how to speech taking entity 0:59:54.823 --> 0:59:56.202 recognition. 0:59:58.938 --> 1:00:08.081 And if you look at now fruit cut token in generally sequence, they can depend on import 1:00:08.081 --> 1:00:08.912 tokens. 1:00:09.869 --> 1:00:11.260 Nice thing. 1:00:11.260 --> 1:00:21.918 In our case, the output tokens are the same so we can easily model it that they only depend 1:00:21.918 --> 1:00:24.814 on all the input tokens. 1:00:24.814 --> 1:00:28.984 So we have this whether it's or so. 1:00:31.011 --> 1:00:42.945 But we can always do a look at what specific type of sequence labeling, unidirectional sequence 1:00:42.945 --> 1:00:44.188 labeling. 1:00:44.584 --> 1:00:58.215 And that's exactly how we want the language of the next word only depends on all the previous 1:00:58.215 --> 1:01:00.825 words that we're. 1:01:01.321 --> 1:01:12.899 Mean, of course, that's not completely true in a language that the bad context might also 1:01:12.899 --> 1:01:14.442 be helpful. 1:01:14.654 --> 1:01:22.468 We will model always the probability of a word given on its history, and therefore we 1:01:22.468 --> 1:01:23.013 need. 1:01:23.623 --> 1:01:29.896 And currently we did there this approximation in sequence labeling that we have this windowing 1:01:29.896 --> 1:01:30.556 approach. 1:01:30.951 --> 1:01:43.975 So in order to predict this type of word we always look at the previous three words and 1:01:43.975 --> 1:01:48.416 then to do this one we again. 1:01:49.389 --> 1:01:55.137 If you are into neural networks you recognize this type of structure. 1:01:55.137 --> 1:01:57.519 Also are the typical neural. 1:01:58.938 --> 1:02:09.688 Yes, so this is like Engram, Louis Couperus, and at least in some way compared to the original, 1:02:09.688 --> 1:02:12.264 you're always looking. 1:02:14.334 --> 1:02:30.781 However, there are also other types of neural network structures which we can use for sequence. 1:02:32.812 --> 1:02:34.678 That we can do so. 1:02:34.678 --> 1:02:39.686 The idea is in recurrent neural network structure. 1:02:39.686 --> 1:02:43.221 We are saving the complete history. 1:02:43.623 --> 1:02:55.118 So again we have to do like this fix size representation because neural networks always 1:02:55.118 --> 1:02:56.947 need to have. 1:02:57.157 --> 1:03:05.258 And then we start with an initial value for our storage. 1:03:05.258 --> 1:03:15.917 We are giving our first input and then calculating the new representation. 1:03:16.196 --> 1:03:26.328 If you look at this, it's just again your network was two types of inputs: in your work, 1:03:26.328 --> 1:03:29.743 in your initial hidden state. 1:03:30.210 --> 1:03:46.468 Then you can apply it to the next type of input and you're again having. 1:03:47.367 --> 1:03:53.306 Nice thing is now that you can do now step by step by step, so all the way over. 1:03:55.495 --> 1:04:05.245 The nice thing that we are having here now is that we are having context information from 1:04:05.245 --> 1:04:07.195 all the previous. 1:04:07.607 --> 1:04:13.582 So if you're looking like based on which words do you use here, calculate your ability of 1:04:13.582 --> 1:04:14.180 varying. 1:04:14.554 --> 1:04:20.128 It depends on is based on this path. 1:04:20.128 --> 1:04:33.083 It depends on and this hidden state was influenced by this one and this hidden state. 1:04:33.473 --> 1:04:37.798 So now we're having something new. 1:04:37.798 --> 1:04:46.449 We can really model the word probability not only on a fixed context. 1:04:46.906 --> 1:04:53.570 Because the in-states we're having here in our area are influenced by all the trivia. 1:04:56.296 --> 1:05:00.909 So how is that to mean? 1:05:00.909 --> 1:05:16.288 If you're not thinking about the history of clustering, we said the clustering. 1:05:16.736 --> 1:05:24.261 So do not need to do any clustering here, and we also see how things are put together 1:05:24.261 --> 1:05:26.273 in order to really do. 1:05:29.489 --> 1:05:43.433 In the green box this way since we are starting from the left point to the right. 1:05:44.524 --> 1:05:48.398 And that's right, so they're clustered in some parts. 1:05:48.398 --> 1:05:58.196 Here is some type of clustering happening: It's continuous representations, but a smaller 1:05:58.196 --> 1:06:02.636 difference doesn't matter again. 1:06:02.636 --> 1:06:10.845 So if you have a lot of different histories, the similarity. 1:06:11.071 --> 1:06:15.791 Because in order to do the final restriction you only do it based on the green box. 1:06:16.156 --> 1:06:24.284 So you are now again still learning some type of clasp. 1:06:24.284 --> 1:06:30.235 You don't have to do this hard decision. 1:06:30.570 --> 1:06:39.013 The only restriction you are giving is you have to install everything that is important. 1:06:39.359 --> 1:06:54.961 So it's a different type of limitation, so you calculate the probability based on the 1:06:54.961 --> 1:06:57.138 last words. 1:06:57.437 --> 1:07:09.645 That is how you still need some cluster things in order to do it efficiently. 1:07:09.970 --> 1:07:25.311 But this is where things get merged together in this type of hidden representation, which 1:07:25.311 --> 1:07:28.038 is then merged. 1:07:28.288 --> 1:07:33.104 On the previous words, but they are some other bottleneck in order to make a good estimation. 1:07:34.474 --> 1:07:41.242 So the idea is that we can store all our history into one lecture. 1:07:41.581 --> 1:07:47.351 Which is very good and makes it more strong. 1:07:47.351 --> 1:07:51.711 Next we come to problems of that. 1:07:51.711 --> 1:07:57.865 Of course, at some point it might be difficult. 1:07:58.398 --> 1:08:02.230 Then maybe things get all overwritten, or you cannot store everything in there. 1:08:02.662 --> 1:08:04.514 So,. 1:08:04.184 --> 1:08:10.252 Therefore, yet for short things like signal sentences that works well, but especially if 1:08:10.252 --> 1:08:16.184 you think of other tasks like harmonisation where a document based on T where you need 1:08:16.184 --> 1:08:22.457 to consider a full document, these things got a bit more complicated and we learned another 1:08:22.457 --> 1:08:23.071 type of. 1:08:24.464 --> 1:08:30.455 For the further in order to understand these networks, it's good to have both views always. 1:08:30.710 --> 1:08:39.426 So this is the unroll view, so you have this type of network. 1:08:39.426 --> 1:08:48.532 Therefore, it can be shown as: We have here the output and here's your network which is 1:08:48.532 --> 1:08:52.091 connected by itself and that is a recurrent. 1:08:56.176 --> 1:09:11.033 There is one challenge in these networks and that is the training so the nice thing is train 1:09:11.033 --> 1:09:11.991 them. 1:09:12.272 --> 1:09:20.147 So the idea is we don't really know how to train them, but if you unroll them like this,. 1:09:20.540 --> 1:09:38.054 It's exactly the same so you can measure your arrows and then you propagate your arrows. 1:09:38.378 --> 1:09:45.647 Now the nice thing is if you unroll something, it's a feet forward and you can train it. 1:09:46.106 --> 1:09:56.493 The only important thing is, of course, for different inputs you have to take that into 1:09:56.493 --> 1:09:57.555 account. 1:09:57.837 --> 1:10:07.621 But since parameters are shared, it's somehow similar and you can train that the training 1:10:07.621 --> 1:10:08.817 algorithm. 1:10:10.310 --> 1:10:16.113 One thing which makes things difficult is what is referred to as the vanishing gradient. 1:10:16.113 --> 1:10:21.720 So we are saying there is a big advantage of these models and that's why we are using 1:10:21.720 --> 1:10:22.111 that. 1:10:22.111 --> 1:10:27.980 The output here does not only depend on the current input of a last three but on anything 1:10:27.980 --> 1:10:29.414 that was said before. 1:10:29.809 --> 1:10:32.803 That's a very strong thing is the motivation of using art. 1:10:33.593 --> 1:10:44.599 However, if you're using standard, the influence here gets smaller and smaller, and the models. 1:10:44.804 --> 1:10:55.945 Because the gradients get smaller and smaller, and so the arrow here propagated to this one, 1:10:55.945 --> 1:10:59.659 this contributes to the arrow. 1:11:00.020 --> 1:11:06.710 And yeah, that's why standard R&S are difficult or have to become boosters. 1:11:07.247 --> 1:11:11.481 So if we are talking about our ends nowadays,. 1:11:11.791 --> 1:11:19.532 What we are typically meaning are long short memories. 1:11:19.532 --> 1:11:30.931 You see there by now quite old already, but they have special gating mechanisms. 1:11:31.171 --> 1:11:41.911 So in the language model tasks, for example in some other story information, all this sentence 1:11:41.911 --> 1:11:44.737 started with a question. 1:11:44.684 --> 1:11:51.886 Because if you only look at the five last five words, it's often no longer clear as a 1:11:51.886 --> 1:11:52.556 normal. 1:11:53.013 --> 1:12:06.287 So there you have these mechanisms with the right gate in order to store things for a longer 1:12:06.287 --> 1:12:08.571 time into your. 1:12:10.730 --> 1:12:20.147 Here they are used in, in, in, in selling quite a lot of works. 1:12:21.541 --> 1:12:30.487 For especially text machine translation now, the standard is to do transformer base models. 1:12:30.690 --> 1:12:42.857 But for example, this type of in architecture we have later one lecture about efficiency. 1:12:42.882 --> 1:12:53.044 And there in the decoder and partial networks they are still using our edges because then. 1:12:53.473 --> 1:12:57.542 So it's not that our ends are of no importance. 1:12:59.239 --> 1:13:08.956 In order to make them strong, there are some more things which are helpful and should be: 1:13:09.309 --> 1:13:19.668 So one thing is it's a very easy and nice trick to make this neon network stronger and better. 1:13:19.739 --> 1:13:21.619 So, of course, it doesn't work always. 1:13:21.619 --> 1:13:23.451 They have to have enough training to. 1:13:23.763 --> 1:13:29.583 But in general that is the easiest way of making your mouth bigger and stronger is to 1:13:29.583 --> 1:13:30.598 increase your. 1:13:30.630 --> 1:13:43.244 And you've seen that with a large size model they are always braggling about. 1:13:43.903 --> 1:13:53.657 This is one way so the question is how do you get more parameters? 1:13:53.657 --> 1:14:05.951 There's two ways you can make your representations: And the other thing is its octave deep learning, 1:14:05.951 --> 1:14:10.020 so the other thing is to make your networks. 1:14:11.471 --> 1:14:13.831 And then you can also get more work off. 1:14:14.614 --> 1:14:19.931 There's one problem with this and with more deeper networks. 1:14:19.931 --> 1:14:23.330 It's very similar to what we saw with. 1:14:23.603 --> 1:14:34.755 With the we have this problem of radiant flow that if it flows so fast like the radiant gets 1:14:34.755 --> 1:14:35.475 very. 1:14:35.795 --> 1:14:41.114 Exactly the same thing happens in deep. 1:14:41.114 --> 1:14:52.285 If you take the gradient and tell it's the right or wrong, then you're propagating. 1:14:52.612 --> 1:14:53.228 Three layers. 1:14:53.228 --> 1:14:56.440 It's no problem, but if you're going to ten, twenty or a hundred layers. 1:14:57.797 --> 1:14:59.690 That is getting typically a problem. 1:15:00.060 --> 1:15:10.659 People are doing and they are using what is called visual connections. 1:15:10.659 --> 1:15:15.885 That's a very helpful idea, which. 1:15:15.956 --> 1:15:20.309 And so the idea is that these networks. 1:15:20.320 --> 1:15:30.694 In between should calculate really what is a new representation, but they are calculating 1:15:30.694 --> 1:15:31.386 what. 1:15:31.731 --> 1:15:37.585 And therefore in the end you'll always the output of a layer is added with the input. 1:15:38.318 --> 1:15:48.824 The nice thing is that later, if you are doing back propagation with this very fast back,. 1:15:49.209 --> 1:16:01.896 So that is what you're seeing nowadays in very deep architectures, not only as others, 1:16:01.896 --> 1:16:04.229 but you always. 1:16:04.704 --> 1:16:07.388 Has two advantages. 1:16:07.388 --> 1:16:15.304 On the one hand, it's more easy to learn a representation. 1:16:15.304 --> 1:16:18.792 On the other hand, these. 1:16:22.082 --> 1:16:24.114 Goods. 1:16:23.843 --> 1:16:31.763 That much for the new record before, so the last thing now means this. 1:16:31.671 --> 1:16:36.400 Language was used in the molds itself. 1:16:36.400 --> 1:16:46.707 Now we're seeing them again, but one thing that at the beginning was very essential. 1:16:46.967 --> 1:16:57.655 So people really train part in the language models only to get this type of embeddings 1:16:57.655 --> 1:17:04.166 and therefore we want to look a bit more into these. 1:17:09.229 --> 1:17:13.456 Some laugh words to the word embeddings. 1:17:13.456 --> 1:17:22.117 The interesting thing is that word embeddings can be used for very different tasks. 1:17:22.117 --> 1:17:27.170 The advantage is we can train the word embedded. 1:17:27.347 --> 1:17:31.334 The knife is you can train that on just large amounts of data. 1:17:31.931 --> 1:17:40.937 And then if you have these wooden beddings you don't have a layer of ten thousand any 1:17:40.937 --> 1:17:41.566 more. 1:17:41.982 --> 1:17:52.231 So then you can train a small market to do any other tasks and therefore you're more. 1:17:52.532 --> 1:17:58.761 Initial word embeddings really depend only on the word itself. 1:17:58.761 --> 1:18:07.363 If you look at the two meanings of can, the can of beans, or can they do that, some of 1:18:07.363 --> 1:18:08.747 the embedded. 1:18:09.189 --> 1:18:12.395 That cannot be resolved. 1:18:12.395 --> 1:18:23.939 Therefore, you need to know the context, and if you look at the higher levels that people 1:18:23.939 --> 1:18:27.916 are doing in the context, but. 1:18:29.489 --> 1:18:33.757 However, even this one has quite very interesting. 1:18:34.034 --> 1:18:44.644 So people like to visualize that they're always a bit difficult because if you look at this 1:18:44.644 --> 1:18:47.182 word, vector or word. 1:18:47.767 --> 1:18:52.879 And drawing your five hundred dimensional vector is still a bit challenging. 1:18:53.113 --> 1:19:12.464 So you cannot directly do that, so what people have to do is learn some type of dimension. 1:19:13.073 --> 1:19:17.216 And of course then yes some information gets lost but you can try it. 1:19:18.238 --> 1:19:28.122 And you see, for example, this is the most famous and common example, so what you can 1:19:28.122 --> 1:19:37.892 look is you can look at the difference between the male and the female word English. 1:19:38.058 --> 1:19:40.389 And you can do that for a very different work. 1:19:40.780 --> 1:19:45.403 And that is where, where the masks come into that, what people then look into. 1:19:45.725 --> 1:19:50.995 So what you can now, for example, do is you can calculate the difference between man and 1:19:50.995 --> 1:19:51.410 woman. 1:19:52.232 --> 1:19:56.356 And what you can do then you can take the embedding of peeing. 1:19:56.356 --> 1:20:02.378 You can add on it the difference between men and women and where people get really excited. 1:20:02.378 --> 1:20:05.586 Then you can look at what are the similar words. 1:20:05.586 --> 1:20:09.252 So you won't, of course, directly hit the correct word. 1:20:09.252 --> 1:20:10.495 It's a continuous. 1:20:10.790 --> 1:20:24.062 But you can look at what are the nearest neighbors to the same, and often these words are near. 1:20:24.224 --> 1:20:33.911 So it's somehow weird that the difference between these works is always the same. 1:20:34.374 --> 1:20:37.308 Can do different things. 1:20:37.308 --> 1:20:47.520 You can also imagine that the work tends to be assuming and swim, and with walking and 1:20:47.520 --> 1:20:49.046 walking you. 1:20:49.469 --> 1:20:53.040 So you can try to use him. 1:20:53.040 --> 1:20:56.346 It's no longer like say. 1:20:56.346 --> 1:21:04.016 The interesting thing is nobody taught him the principle. 1:21:04.284 --> 1:21:09.910 So it's purely trained on the task of doing the next work prediction. 1:21:10.230 --> 1:21:23.669 And even for some information like the capital, this is the difference between the capital. 1:21:23.823 --> 1:21:33.760 Is another visualization here where you have done the same things on the difference between. 1:21:33.853 --> 1:21:41.342 And you see it's not perfect, but it's building in my directory, so you can even use that for 1:21:41.342 --> 1:21:42.936 pressure answering. 1:21:42.936 --> 1:21:50.345 If you have no three countries, the capital, you can do what is the difference between them. 1:21:50.345 --> 1:21:53.372 You apply that to a new country, and. 1:21:54.834 --> 1:22:02.280 So these models are able to really learn a lot of information and collapse this information 1:22:02.280 --> 1:22:04.385 into this representation. 1:22:05.325 --> 1:22:07.679 And just to do the next two are predictions. 1:22:07.707 --> 1:22:22.358 And that also explains a bit maybe or explains strongly, but motivates what is the main advantage 1:22:22.358 --> 1:22:26.095 of this type of neurons. 1:22:28.568 --> 1:22:46.104 So to summarize what we did today, so what you should hopefully have with you is: Then 1:22:46.104 --> 1:22:49.148 how we can do language modeling with new networks. 1:22:49.449 --> 1:22:55.445 We looked at three different architectures: We looked into the feet forward language one, 1:22:55.445 --> 1:22:59.059 the R&N, and the one based the balsamic. 1:22:59.039 --> 1:23:04.559 And finally, there are different architectures to do in neural networks. 1:23:04.559 --> 1:23:10.986 We have seen feet for neural networks and base neural networks, and we'll see in the 1:23:10.986 --> 1:23:14.389 next lectures the last type of architecture. 1:23:15.915 --> 1:23:17.438 Any questions. 1:23:20.680 --> 1:23:27.360 Then thanks a lot, and next I'm just there, we'll be again on order to.