diff --git "a/demo_data/lectures/Lecture-07-16.05.2023/English.vtt" "b/demo_data/lectures/Lecture-07-16.05.2023/English.vtt" new file mode 100644--- /dev/null +++ "b/demo_data/lectures/Lecture-07-16.05.2023/English.vtt" @@ -0,0 +1,5104 @@ +WEBVTT + +0:00:01.301 --> 0:00:05.707 +Okay So Welcome to Today's Lecture. + +0:00:06.066 --> 0:00:12.592 +I'm sorry for the inconvenience. + +0:00:12.592 --> 0:00:19.910 +Sometimes they are project meetings. + +0:00:19.910 --> 0:00:25.843 +There will be one other time. + +0:00:26.806 --> 0:00:40.863 +So what we want to talk today about is want +to start with neural approaches to machine + +0:00:40.863 --> 0:00:42.964 +translation. + +0:00:43.123 --> 0:00:51.285 +I guess you have heard about other types of +neural models for other types of neural language + +0:00:51.285 --> 0:00:52.339 +processing. + +0:00:52.339 --> 0:00:59.887 +This was some of the first steps in introducing +neal networks to machine translation. + +0:01:00.600 --> 0:01:06.203 +They are similar to what you know they see +in as large language models. + +0:01:06.666 --> 0:01:11.764 +And today look into what are these neuro-language +models? + +0:01:11.764 --> 0:01:13.874 +What is the difference? + +0:01:13.874 --> 0:01:15.983 +What is the motivation? + +0:01:16.316 --> 0:01:21.445 +And first will use them in statistics and +machine translation. + +0:01:21.445 --> 0:01:28.935 +So if you remember how fully like two or three +weeks ago we had this likely model where you + +0:01:28.935 --> 0:01:31.052 +can integrate easily any. + +0:01:31.351 --> 0:01:40.967 +We just have another model which evaluates +how good a system is or how good a fluent language + +0:01:40.967 --> 0:01:41.376 +is. + +0:01:41.376 --> 0:01:53.749 +The main advantage compared to the statistical +models we saw on Tuesday is: Next week we will + +0:01:53.749 --> 0:02:06.496 +then go for a neural machine translation where +we replace the whole model. + +0:02:11.211 --> 0:02:21.078 +Just as a remember from Tuesday, we've seen +the main challenge in language world was that + +0:02:21.078 --> 0:02:25.134 +most of the engrams we haven't seen. + +0:02:26.946 --> 0:02:33.967 +So this was therefore difficult to estimate +any probability because you've seen that normally + +0:02:33.967 --> 0:02:39.494 +if you have not seen the endgram you will assign +the probability of zero. + +0:02:39.980 --> 0:02:49.420 +However, this is not really very good because +we don't want to give zero probabilities to + +0:02:49.420 --> 0:02:54.979 +sentences, which still might be a very good +English. + +0:02:55.415 --> 0:03:02.167 +And then we learned a lot of techniques and +that is the main challenging statistical machine + +0:03:02.167 --> 0:03:04.490 +translate statistical language. + +0:03:04.490 --> 0:03:10.661 +What's how we can give a good estimate of +probability to events that we haven't seen + +0:03:10.661 --> 0:03:12.258 +smoothing techniques? + +0:03:12.258 --> 0:03:15.307 +We've seen this interpolation and begoff. + +0:03:15.435 --> 0:03:21.637 +And they invent or develop very specific techniques. + +0:03:21.637 --> 0:03:26.903 +To deal with that, however, it might not be. + +0:03:28.568 --> 0:03:43.190 +And therefore maybe we can do things different, +so if we have not seen an gram before in statistical + +0:03:43.190 --> 0:03:44.348 +models. + +0:03:45.225 --> 0:03:51.361 +Before and we can only get information from +exactly the same words. + +0:03:51.411 --> 0:04:06.782 +We don't have some on like approximate matching +like that, maybe in a sentence that cures similarly. + +0:04:06.782 --> 0:04:10.282 +So if you have seen a. + +0:04:11.191 --> 0:04:17.748 +And so you would like to have more something +like that where endgrams are represented, more + +0:04:17.748 --> 0:04:21.953 +in a general space, and we can generalize similar +numbers. + +0:04:22.262 --> 0:04:29.874 +So if you learn something about walk then +maybe we can use this knowledge and also apply. + +0:04:30.290 --> 0:04:42.596 +The same as we have done before, but we can +really better model how similar they are and + +0:04:42.596 --> 0:04:45.223 +transfer to other. + +0:04:47.047 --> 0:04:54.236 +And we maybe want to do that in a more hierarchical +approach that we know okay. + +0:04:54.236 --> 0:05:02.773 +Some words are similar but like go and walk +is somehow similar and I and P and G and therefore + +0:05:02.773 --> 0:05:06.996 +like maybe if we then merge them in an engram. + +0:05:07.387 --> 0:05:15.861 +If we learn something about our walk, then +it should tell us also something about Hugo. + +0:05:15.861 --> 0:05:17.113 +He walks or. + +0:05:17.197 --> 0:05:27.327 +You see that there is some relations which +we need to integrate for you. + +0:05:27.327 --> 0:05:35.514 +We need to add the s, but maybe walks should +also be here. + +0:05:37.137 --> 0:05:45.149 +And luckily there is one really convincing +method in doing that: And that is by using + +0:05:45.149 --> 0:05:47.231 +a neural mechanism. + +0:05:47.387 --> 0:05:58.497 +That's what we will introduce today so we +can use this type of neural networks to try + +0:05:58.497 --> 0:06:04.053 +to learn this similarity and to learn how. + +0:06:04.324 --> 0:06:14.355 +And that is one of the main advantages that +we have by switching from the standard statistical + +0:06:14.355 --> 0:06:15.200 +models. + +0:06:15.115 --> 0:06:22.830 +To learn similarities between words and generalized, +and learn what is called hidden representations + +0:06:22.830 --> 0:06:29.705 +or representations of words, where we can measure +similarity in some dimensions of words. + +0:06:30.290 --> 0:06:42.384 +So we can measure in which way words are similar. + +0:06:42.822 --> 0:06:48.902 +We had it before and we've seen that words +were just easier. + +0:06:48.902 --> 0:06:51.991 +The only thing we did is like. + +0:06:52.192 --> 0:07:02.272 +But this energies don't have any meaning, +so it wasn't that word is more similar to words. + +0:07:02.582 --> 0:07:12.112 +So we couldn't learn anything about words +in the statistical model and that's a big challenge. + +0:07:12.192 --> 0:07:23.063 +About words even like in morphology, so going +goes is somehow more similar because the person + +0:07:23.063 --> 0:07:24.219 +singular. + +0:07:24.264 --> 0:07:34.924 +The basic models we have to now have no idea +about that and goes as similar to go than it + +0:07:34.924 --> 0:07:37.175 +might be to sleep. + +0:07:39.919 --> 0:07:44.073 +So what we want to do today. + +0:07:44.073 --> 0:07:53.096 +In order to go to this we will have a short +introduction into. + +0:07:53.954 --> 0:08:05.984 +It very short just to see how we use them +here, but that's a good thing, so most of you + +0:08:05.984 --> 0:08:08.445 +think it will be. + +0:08:08.928 --> 0:08:14.078 +And then we will first look into a feet forward +neural network language models. + +0:08:14.454 --> 0:08:23.706 +And there we will still have this approximation. + +0:08:23.706 --> 0:08:33.902 +We have before we are looking only at a fixed +window. + +0:08:34.154 --> 0:08:35.030 +The case. + +0:08:35.030 --> 0:08:38.270 +However, we have the umbellent here. + +0:08:38.270 --> 0:08:43.350 +That's why they're already better in order +to generalize. + +0:08:44.024 --> 0:08:53.169 +And then at the end we'll look at language +models where we then have the additional advantage. + +0:08:53.093 --> 0:09:04.317 +Case that we need to have a fixed history, +but in theory we can model arbitrary long dependencies. + +0:09:04.304 --> 0:09:12.687 +And we talked about on Tuesday where it is +not clear what type of information it is to. + +0:09:16.396 --> 0:09:24.981 +So in general molecular networks I normally +learn to prove that they perform some tasks. + +0:09:25.325 --> 0:09:33.472 +We have the structure and we are learning +them from samples so that is similar to what + +0:09:33.472 --> 0:09:34.971 +we have before. + +0:09:34.971 --> 0:09:42.275 +So now we have the same task here, a language +model giving input or forwards. + +0:09:42.642 --> 0:09:48.959 +And is somewhat originally motivated by human +brain. + +0:09:48.959 --> 0:10:00.639 +However, when you now need to know about artificial +neural networks, it's hard to get similarity. + +0:10:00.540 --> 0:10:02.889 +There seemed to be not that point. + +0:10:03.123 --> 0:10:11.014 +So what they are mainly doing is summoning +multiplication and then one non-linear activation. + +0:10:12.692 --> 0:10:16.085 +So the basic units are these type of. + +0:10:17.937 --> 0:10:29.891 +Perceptron basic blocks which we have and +this does processing so we have a fixed number + +0:10:29.891 --> 0:10:36.070 +of input features and that will be important. + +0:10:36.096 --> 0:10:39.689 +So we have here numbers to xn as input. + +0:10:40.060 --> 0:10:53.221 +And this makes partly of course language processing +difficult. + +0:10:54.114 --> 0:10:57.609 +So we have to model this time on and then +go stand home and model. + +0:10:58.198 --> 0:11:02.099 +Then we are having weights, which are the +parameters and the number of weights exactly + +0:11:02.099 --> 0:11:03.668 +the same as the number of weights. + +0:11:04.164 --> 0:11:06.322 +Of input features. + +0:11:06.322 --> 0:11:15.068 +Sometimes he has his fires in there, and then +it's not really an input from. + +0:11:15.195 --> 0:11:19.205 +And what you then do is multiply. + +0:11:19.205 --> 0:11:26.164 +Each input resists weight and then you sum +it up and then. + +0:11:26.606 --> 0:11:34.357 +What is then additionally later important +is that we have an activation function and + +0:11:34.357 --> 0:11:42.473 +it's important that this activation function +is non linear, so we come to just a linear. + +0:11:43.243 --> 0:11:54.088 +And later it will be important that this is +differentiable because otherwise all the training. + +0:11:54.714 --> 0:12:01.907 +This model by itself is not very powerful. + +0:12:01.907 --> 0:12:10.437 +It was originally shown that this is not powerful. + +0:12:10.710 --> 0:12:19.463 +However, there is a very easy extension, the +multi layer perceptual, and then things get + +0:12:19.463 --> 0:12:20.939 +very powerful. + +0:12:21.081 --> 0:12:27.719 +The thing is you just connect a lot of these +in this layer of structures and we have our + +0:12:27.719 --> 0:12:35.029 +input layer where we have the inputs and our +hidden layer at least one where there is everywhere. + +0:12:35.395 --> 0:12:39.817 +And then we can combine them all to do that. + +0:12:40.260 --> 0:12:48.320 +The input layer is of course somewhat given +by a problem of dimension. + +0:12:48.320 --> 0:13:00.013 +The outward layer is also given by your dimension, +but the hidden layer is of course a hyperparameter. + +0:13:01.621 --> 0:13:08.802 +So let's start with the first question, now +more language related, and that is how we represent. + +0:13:09.149 --> 0:13:23.460 +So we've seen here we have the but the question +is now how can we put in a word into this? + +0:13:26.866 --> 0:13:34.117 +Noise: The first thing we're able to be better +is by the fact that like you are said,. + +0:13:34.314 --> 0:13:43.028 +That is not that easy because the continuous +vector will come to that. + +0:13:43.028 --> 0:13:50.392 +So from the neo-network we can directly put +in the bedding. + +0:13:50.630 --> 0:13:57.277 +But if we need to input a word into the needle +network, it has to be something which is easily + +0:13:57.277 --> 0:13:57.907 +defined. + +0:13:59.079 --> 0:14:12.492 +The one hood encoding, and then we have one +out of encoding, so one value is one, and all + +0:14:12.492 --> 0:14:15.324 +the others is the. + +0:14:16.316 --> 0:14:25.936 +That means we are always dealing with fixed +vocabulary because what said is we cannot. + +0:14:26.246 --> 0:14:38.017 +So you cannot easily extend your vocabulary +because if you mean you would extend your vocabulary. + +0:14:39.980 --> 0:14:41.502 +That's also motivating. + +0:14:41.502 --> 0:14:43.722 +We're talked about biperriagoding. + +0:14:43.722 --> 0:14:45.434 +That's a nice thing there. + +0:14:45.434 --> 0:14:47.210 +We have a fixed vocabulary. + +0:14:48.048 --> 0:14:55.804 +The big advantage of this one encoding is +that we don't implicitly sum our implement + +0:14:55.804 --> 0:15:04.291 +similarity between words, but really re-learning +because if you first think about this, this + +0:15:04.291 --> 0:15:06.938 +is a very, very inefficient. + +0:15:07.227 --> 0:15:15.889 +So you need like to represent end words, you +need a dimension of an end dimensional vector. + +0:15:16.236 --> 0:15:24.846 +Imagine you could do binary encoding so you +could represent words as binary vectors. + +0:15:24.846 --> 0:15:26.467 +Then you would. + +0:15:26.806 --> 0:15:31.177 +Will be significantly more efficient. + +0:15:31.177 --> 0:15:36.813 +However, then you have some implicit similarity. + +0:15:36.813 --> 0:15:39.113 +Some numbers share. + +0:15:39.559 --> 0:15:46.958 +Would somehow be bad because you would force +someone to do this by hand or clear how to + +0:15:46.958 --> 0:15:47.631 +define. + +0:15:48.108 --> 0:15:55.135 +So therefore currently this is the most successful +approach to just do this one watch. + +0:15:55.095 --> 0:15:59.563 +Representations, so we take a fixed vocabulary. + +0:15:59.563 --> 0:16:06.171 +We map each word to the inise, and then we +represent a word like this. + +0:16:06.171 --> 0:16:13.246 +So if home will be one, the representation +will be one zero zero zero, and. + +0:16:14.514 --> 0:16:30.639 +But this dimension here is a vocabulary size +and that is quite high, so we are always trying + +0:16:30.639 --> 0:16:33.586 +to be efficient. + +0:16:33.853 --> 0:16:43.792 +We are doing then some type of efficiency +because typically we are having this next layer. + +0:16:44.104 --> 0:16:51.967 +It can be still maybe two hundred or five +hundred or one thousand neurons, but this is + +0:16:51.967 --> 0:16:53.323 +significantly. + +0:16:53.713 --> 0:17:03.792 +You can learn that directly and there we then +have similarity between words. + +0:17:03.792 --> 0:17:07.458 +Then it is that some words. + +0:17:07.807 --> 0:17:14.772 +But the nice thing is that this is then learned +that we are not need to hand define that. + +0:17:17.117 --> 0:17:32.742 +We'll come later to the explicit architecture +of the neural language one, and there we can + +0:17:32.742 --> 0:17:35.146 +see how it's. + +0:17:38.418 --> 0:17:44.857 +So we're seeing that the other one or our +representation always has the same similarity. + +0:17:45.105 --> 0:17:59.142 +Then we're having this continuous factor which +is a lot smaller dimension and that's important + +0:17:59.142 --> 0:18:00.768 +for later. + +0:18:01.121 --> 0:18:06.989 +What we are doing then is learning these representations +so that they are best for language. + +0:18:07.487 --> 0:18:14.968 +So the representations are implicitly training +the language for the cards. + +0:18:14.968 --> 0:18:19.058 +This is the best way for doing language. + +0:18:19.479 --> 0:18:32.564 +And the nice thing that was found out later +is these representations are really good. + +0:18:33.153 --> 0:18:39.253 +And that is why they are now even called word +embeddings by themselves and used for other + +0:18:39.253 --> 0:18:39.727 +tasks. + +0:18:40.360 --> 0:18:49.821 +And they are somewhat describing very different +things so they can describe and semantic similarities. + +0:18:49.789 --> 0:18:58.650 +Are looking at the very example of today mass +vector space by adding words and doing some + +0:18:58.650 --> 0:19:00.618 +interesting things. + +0:19:00.940 --> 0:19:11.178 +So they got really like the first big improvement +when switching to neurostaff. + +0:19:11.491 --> 0:19:20.456 +Are like part of the model, but with more +complex representation, but they are the basic + +0:19:20.456 --> 0:19:21.261 +models. + +0:19:23.683 --> 0:19:36.979 +In the output layer we are also having one +output layer structure and a connection function. + +0:19:36.997 --> 0:19:46.525 +That is, for language learning we want to +predict what is the most common word. + +0:19:47.247 --> 0:19:56.453 +And that can be done very well with this so +called soft back layer, where again the dimension. + +0:19:56.376 --> 0:20:02.825 +Vocabulary size, so this is a vocabulary size, +and again the case neural represents the case + +0:20:02.825 --> 0:20:03.310 +class. + +0:20:03.310 --> 0:20:09.759 +So in our case we have again one round representation, +someone saying this is a core report. + +0:20:10.090 --> 0:20:17.255 +Our probability distribution is a probability +distribution over all works, so the case entry + +0:20:17.255 --> 0:20:21.338 +tells us how probable is that the next word +is this. + +0:20:22.682 --> 0:20:33.885 +So we need to have some probability distribution +at our output in order to achieve that this + +0:20:33.885 --> 0:20:37.017 +activation function goes. + +0:20:37.197 --> 0:20:46.944 +And we can achieve that with a soft max activation +we take the input to the form of the value, + +0:20:46.944 --> 0:20:47.970 +and then. + +0:20:48.288 --> 0:20:58.021 +So by having this type of activation function +we are really getting this type of probability. + +0:20:59.019 --> 0:21:15.200 +At the beginning was also very challenging +because again we have this inefficient representation. + +0:21:15.235 --> 0:21:29.799 +You can imagine that something over is maybe +a bit inefficient with cheap users, but definitely. + +0:21:36.316 --> 0:21:44.072 +And then for training the models that will +be fine, so we have to use architecture now. + +0:21:44.264 --> 0:21:48.491 +We need to minimize the arrow. + +0:21:48.491 --> 0:21:53.264 +Are we doing it taking the output? + +0:21:53.264 --> 0:21:58.174 +We are comparing it to our targets. + +0:21:58.298 --> 0:22:03.830 +So one important thing is by training them. + +0:22:03.830 --> 0:22:07.603 +How can we measure the error? + +0:22:07.603 --> 0:22:12.758 +So what is if we are training the ideas? + +0:22:13.033 --> 0:22:15.163 +And how well we are measuring. + +0:22:15.163 --> 0:22:19.768 +It is in natural language processing, typically +the cross entropy. + +0:22:19.960 --> 0:22:35.575 +And that means we are comparing the target +with the output. + +0:22:35.335 --> 0:22:44.430 +It gets optimized and you're seeing that this, +of course, makes it again very nice and easy + +0:22:44.430 --> 0:22:49.868 +because our target is again a one-hour representation. + +0:22:50.110 --> 0:23:00.116 +So all of these are always zero, and what +we are then doing is we are taking the one. + +0:23:00.100 --> 0:23:04.615 +And we only need to multiply the one with +the logarithm here, and that is all the feedback + +0:23:04.615 --> 0:23:05.955 +signal we are taking here. + +0:23:06.946 --> 0:23:13.885 +Of course, this is not always influenced by +all the others. + +0:23:13.885 --> 0:23:17.933 +Why is this influenced by all the. + +0:23:24.304 --> 0:23:34.382 +Have the activation function, which is the +current activation divided by some of the others. + +0:23:34.354 --> 0:23:45.924 +Otherwise it could easily just increase this +volume and ignore the others, but if you increase + +0:23:45.924 --> 0:23:49.090 +one value all the others. + +0:23:51.351 --> 0:23:59.912 +Then we can do with neometrics one very nice +and easy type of training that is done in all + +0:23:59.912 --> 0:24:07.721 +the neometrics where we are now calculating +our error and especially the gradient. + +0:24:07.707 --> 0:24:11.640 +So in which direction does the error show? + +0:24:11.640 --> 0:24:18.682 +And then if we want to go to a smaller arrow +that's what we want to achieve. + +0:24:18.682 --> 0:24:26.638 +We are taking the inverse direction of the +gradient and thereby trying to minimize our + +0:24:26.638 --> 0:24:27.278 +error. + +0:24:27.287 --> 0:24:31.041 +And we have to do that, of course, for all +the weights. + +0:24:31.041 --> 0:24:36.672 +And to calculate the error of all the weights, +we won't do the defectvagation here. + +0:24:36.672 --> 0:24:41.432 +But but what you can do is you can propagate +the arrow which measured. + +0:24:41.432 --> 0:24:46.393 +At the end you can propagate it back its basic +mass and basic derivation. + +0:24:46.706 --> 0:24:58.854 +For each way in your model measure how much +you contribute to the error and then change + +0:24:58.854 --> 0:25:01.339 +it in a way that. + +0:25:04.524 --> 0:25:11.625 +So to summarize what for at least machine +translation on your machine translation should + +0:25:11.625 --> 0:25:19.044 +remember, you know, to understand on this problem +is that this is how a multilayer first the + +0:25:19.044 --> 0:25:20.640 +problem looks like. + +0:25:20.580 --> 0:25:28.251 +There are fully two layers and no connections. + +0:25:28.108 --> 0:25:29.759 +Across layers. + +0:25:29.829 --> 0:25:35.153 +And what they're doing is always just a waited +sum here and then in activation production. + +0:25:35.415 --> 0:25:38.792 +And in order to train you have this forward +and backward pass. + +0:25:39.039 --> 0:25:41.384 +So We Put in Here. + +0:25:41.281 --> 0:25:41.895 +Inputs. + +0:25:41.895 --> 0:25:45.347 +We have some random values at the beginning. + +0:25:45.347 --> 0:25:47.418 +Then calculate the output. + +0:25:47.418 --> 0:25:54.246 +We are measuring how our error is propagating +the arrow back and then changing our model + +0:25:54.246 --> 0:25:57.928 +in a way that we hopefully get a smaller arrow. + +0:25:57.928 --> 0:25:59.616 +And then that is how. + +0:26:01.962 --> 0:26:12.893 +So before we're coming into our neural networks +language models, how can we use this type of + +0:26:12.893 --> 0:26:17.595 +neural network to do language modeling? + +0:26:23.103 --> 0:26:33.157 +So how can we use them in natural language +processing, especially machine translation? + +0:26:33.157 --> 0:26:41.799 +The first idea of using them was to estimate: +So we have seen that the output can be monitored + +0:26:41.799 --> 0:26:42.599 +here as well. + +0:26:43.603 --> 0:26:50.311 +A probability distribution and if we have +a full vocabulary we could mainly hear estimating + +0:26:50.311 --> 0:26:56.727 +how probable each next word is and then use +that in our language model fashion as we've + +0:26:56.727 --> 0:26:58.112 +done it last time. + +0:26:58.112 --> 0:27:03.215 +We got the probability of a full sentence +as a product of individual. + +0:27:04.544 --> 0:27:12.820 +And: That was done in the ninety seven years +and it's very easy to integrate it into this + +0:27:12.820 --> 0:27:14.545 +lot of the year model. + +0:27:14.545 --> 0:27:19.570 +So we have said that this is how the locker +here model looks like. + +0:27:19.570 --> 0:27:25.119 +So we are searching the best translation which +minimizes each waste time. + +0:27:25.125 --> 0:27:26.362 +The Future About You. + +0:27:26.646 --> 0:27:31.647 +We have that with minimum error rate training +if you can remember where we search for the + +0:27:31.647 --> 0:27:32.147 +optimal. + +0:27:32.512 --> 0:27:40.422 +The language model and many others, and we +can just add here a neuromodel, have a knock + +0:27:40.422 --> 0:27:41.591 +of features. + +0:27:41.861 --> 0:27:45.761 +So that is quite easy as said. + +0:27:45.761 --> 0:27:53.183 +That was how statistical machine translation +was improved. + +0:27:53.183 --> 0:27:57.082 +You just add one more feature. + +0:27:58.798 --> 0:28:07.631 +So how can we model the language modeling +with a network? + +0:28:07.631 --> 0:28:16.008 +So what we have to do is model the probability +of the. + +0:28:16.656 --> 0:28:25.047 +The problem in general in the head is that +mostly we haven't seen long sequences. + +0:28:25.085 --> 0:28:35.650 +Mostly we have to beg off to very short sequences +and we are working on this discrete space where + +0:28:35.650 --> 0:28:36.944 +similarity. + +0:28:37.337 --> 0:28:50.163 +So the idea is if we have now a real network, +we can make words into continuous representation. + +0:28:51.091 --> 0:29:00.480 +And the structure then looks like this, so +this is a basic still feed forward neural network. + +0:29:01.361 --> 0:29:10.645 +We are doing this at perximation again, so +we are not putting in all previous words, but + +0:29:10.645 --> 0:29:11.375 +it is. + +0:29:11.691 --> 0:29:25.856 +This is done because we said that in the real +network we can have only a fixed type of input. + +0:29:25.945 --> 0:29:31.886 +You can only do a fixed step and then we'll +be doing that exactly in minus one. + +0:29:33.593 --> 0:29:39.536 +So here you are, for example, three words +and three different words. + +0:29:39.536 --> 0:29:50.704 +One and all the others are: And then we're +having the first layer of the neural network, + +0:29:50.704 --> 0:29:56.230 +which like you learns is word embedding. + +0:29:57.437 --> 0:30:04.976 +There is one thing which is maybe special +compared to the standard neural member. + +0:30:05.345 --> 0:30:11.918 +So the representation of this word we want +to learn first of all position independence. + +0:30:11.918 --> 0:30:19.013 +So we just want to learn what is the general +meaning of the word independent of its neighbors. + +0:30:19.299 --> 0:30:26.239 +And therefore the representation you get here +should be the same as if in the second position. + +0:30:27.247 --> 0:30:36.865 +The nice thing you can achieve is that this +weights which you're using here you're reusing + +0:30:36.865 --> 0:30:41.727 +here and reusing here so we are forcing them. + +0:30:42.322 --> 0:30:48.360 +You then learn your word embedding, which +is contextual, independent, so it's the same + +0:30:48.360 --> 0:30:49.678 +for each position. + +0:30:49.909 --> 0:31:03.482 +So that's the idea that you want to learn +the representation first of and you don't want + +0:31:03.482 --> 0:31:07.599 +to really use the context. + +0:31:08.348 --> 0:31:13.797 +That of course might have a different meaning +depending on where it stands, but we'll learn + +0:31:13.797 --> 0:31:14.153 +that. + +0:31:14.514 --> 0:31:20.386 +So first we are learning here representational +words, which is just the representation. + +0:31:20.760 --> 0:31:32.498 +Normally we said in neurons all input neurons +here are connected to all here, but we're reducing + +0:31:32.498 --> 0:31:37.338 +the complexity by saying these neurons. + +0:31:37.857 --> 0:31:47.912 +Then we have a lot denser representation that +is our three word embedded in here, and now + +0:31:47.912 --> 0:31:57.408 +we are learning this interaction between words, +a direction between words not based. + +0:31:57.677 --> 0:32:08.051 +So we have at least one connected layer here, +which takes a three embedding input and then + +0:32:08.051 --> 0:32:14.208 +learns a new embedding which now represents +the full. + +0:32:15.535 --> 0:32:16.551 +Layers. + +0:32:16.551 --> 0:32:27.854 +It is the output layer which now and then +again the probability distribution of all the. + +0:32:28.168 --> 0:32:48.612 +So here is your target prediction. + +0:32:48.688 --> 0:32:56.361 +The nice thing is that you learn everything +together, so you don't have to teach them what + +0:32:56.361 --> 0:32:58.722 +a good word representation. + +0:32:59.079 --> 0:33:08.306 +Training the whole number together, so it +learns what a good representation for a word + +0:33:08.306 --> 0:33:13.079 +you get in order to perform your final task. + +0:33:15.956 --> 0:33:19.190 +Yeah, that is the main idea. + +0:33:20.660 --> 0:33:32.731 +This is now a days often referred to as one +way of self supervise learning. + +0:33:33.053 --> 0:33:37.120 +The output is the next word and the input +is the previous word. + +0:33:37.377 --> 0:33:46.783 +But it's not really that we created labels, +but we artificially created a task out of unlabeled. + +0:33:46.806 --> 0:33:59.434 +We just had pure text, and then we created +the telescopes by predicting the next word, + +0:33:59.434 --> 0:34:18.797 +which is: Say we have like two sentences like +go home and the second one is go to prepare. + +0:34:18.858 --> 0:34:30.135 +And then we have to predict the next series +and my questions in the labels for the album. + +0:34:31.411 --> 0:34:42.752 +We model this as one vector with like probability +for possible weights starting again. + +0:34:44.044 --> 0:34:57.792 +Multiple examples, so then you would twice +train one to predict KRT, one to predict home, + +0:34:57.792 --> 0:35:02.374 +and then of course the easel. + +0:35:04.564 --> 0:35:13.568 +Is a very good point, so you are not aggregating +examples beforehand, but you are taking each. + +0:35:19.259 --> 0:35:37.204 +So when you do it simultaneously learn the +projection layer and the endgram for abilities + +0:35:37.204 --> 0:35:39.198 +and then. + +0:35:39.499 --> 0:35:47.684 +And later analyze it that these representations +are very powerful. + +0:35:47.684 --> 0:35:56.358 +The task is just a very important task to +model what is the next word. + +0:35:56.816 --> 0:35:59.842 +Is motivated by nowadays. + +0:35:59.842 --> 0:36:10.666 +In order to get the meaning of the word you +have to look at its companies where the context. + +0:36:10.790 --> 0:36:16.048 +If you read texts in days of word which you +have never seen, you often can still estimate + +0:36:16.048 --> 0:36:21.130 +the meaning of this word because you do not +know how it is used, and this is typically + +0:36:21.130 --> 0:36:22.240 +used as a city or. + +0:36:22.602 --> 0:36:25.865 +Just imagine you read a text about some city. + +0:36:25.865 --> 0:36:32.037 +Even if you've never seen the city before, +you often know from the context of how it's + +0:36:32.037 --> 0:36:32.463 +used. + +0:36:34.094 --> 0:36:42.483 +So what is now the big advantage of using +neural neckworks? + +0:36:42.483 --> 0:36:51.851 +So just imagine we have to estimate that I +bought my first iPhone. + +0:36:52.052 --> 0:36:56.608 +So you have to monitor the probability of +ad hitting them. + +0:36:56.608 --> 0:37:00.237 +Now imagine iPhone, which you have never seen. + +0:37:00.600 --> 0:37:11.588 +So all the techniques we had last time at +the end, if you haven't seen iPhone you will + +0:37:11.588 --> 0:37:14.240 +always fall back to. + +0:37:15.055 --> 0:37:26.230 +You have no idea how to deal that you won't +have seen the diagram, the trigram, and all + +0:37:26.230 --> 0:37:27.754 +the others. + +0:37:28.588 --> 0:37:43.441 +If you're having this type of model, what +does it do if you have my first and then something? + +0:37:43.483 --> 0:37:50.270 +Maybe this representation is really messed +up because it's mainly on a cavalry word. + +0:37:50.730 --> 0:37:57.793 +However, you have still these two information +that two words before was first and therefore. + +0:37:58.098 --> 0:38:06.954 +So you have a lot of information in order +to estimate how good it is. + +0:38:06.954 --> 0:38:13.279 +There could be more information if you know +that. + +0:38:13.593 --> 0:38:25.168 +So all this type of modeling we can do that +we couldn't do beforehand because we always + +0:38:25.168 --> 0:38:25.957 +have. + +0:38:27.027 --> 0:38:40.466 +Good point, so typically you would have one +token for a vocabulary so that you could, for + +0:38:40.466 --> 0:38:45.857 +example: All you're doing by parent coding +when you have a fixed thing. + +0:38:46.226 --> 0:38:49.437 +Oh yeah, you have to do something like that +that that that's true. + +0:38:50.050 --> 0:38:55.420 +So yeah, auto vocabulary are by thanking where +you don't have other words written. + +0:38:55.735 --> 0:39:06.295 +But then, of course, you might be getting +very long previous things, and your sequence + +0:39:06.295 --> 0:39:11.272 +length gets very long for unknown words. + +0:39:17.357 --> 0:39:20.067 +Any more questions to the basic stable. + +0:39:23.783 --> 0:39:36.719 +For this model, what we then want to continue +is looking a bit into how complex or how we + +0:39:36.719 --> 0:39:39.162 +can make things. + +0:39:40.580 --> 0:39:49.477 +Because at the beginning there was definitely +a major challenge, it's still not that easy, + +0:39:49.477 --> 0:39:58.275 +and I mean our likeers followed the talk about +their environmental fingerprint and so on. + +0:39:58.478 --> 0:40:05.700 +So this calculation is not really heavy, and +if you build systems yourselves you have to + +0:40:05.700 --> 0:40:06.187 +wait. + +0:40:06.466 --> 0:40:14.683 +So it's good to know a bit about how complex +things are in order to do a good or efficient + +0:40:14.683 --> 0:40:15.405 +affair. + +0:40:15.915 --> 0:40:24.211 +So one thing where most of the calculation +really happens is if you're doing it in a bad + +0:40:24.211 --> 0:40:24.677 +way. + +0:40:25.185 --> 0:40:33.523 +So in generally all these layers we are talking +about networks and zones fancy. + +0:40:33.523 --> 0:40:46.363 +In the end it is: So what you have to do in +order to calculate here, for example, these + +0:40:46.363 --> 0:40:52.333 +activations: So make it simple a bit. + +0:40:52.333 --> 0:41:06.636 +Let's see where outputs and you just do metric +multiplication between your weight matrix and + +0:41:06.636 --> 0:41:08.482 +your input. + +0:41:08.969 --> 0:41:20.992 +So that is why computers are so powerful for +neural networks because they are very good + +0:41:20.992 --> 0:41:22.358 +in doing. + +0:41:22.782 --> 0:41:28.013 +However, for some type for the embedding layer +this is really very inefficient. + +0:41:28.208 --> 0:41:39.652 +So because remember we're having this one +art encoding in this input, it's always like + +0:41:39.652 --> 0:41:42.940 +one and everything else. + +0:41:42.940 --> 0:41:47.018 +It's zero if we're doing this. + +0:41:47.387 --> 0:41:55.552 +So therefore you can do at least the forward +pass a lot more efficient if you don't really + +0:41:55.552 --> 0:42:01.833 +do this calculation, but you can select the +one color where there is. + +0:42:01.833 --> 0:42:07.216 +Therefore, you also see this is called your +word embedding. + +0:42:08.348 --> 0:42:19.542 +So the weight matrix of the embedding layer +is just that in each color you have the embedding + +0:42:19.542 --> 0:42:20.018 +of. + +0:42:20.580 --> 0:42:30.983 +So this is like how your initial weights look +like and how you can interpret or understand. + +0:42:32.692 --> 0:42:39.509 +And this is already relatively important because +remember this is a huge dimensional thing. + +0:42:39.509 --> 0:42:46.104 +So typically here we have the number of words +is ten thousand or so, so this is the word + +0:42:46.104 --> 0:42:51.365 +embeddings metrics, typically the most expensive +to calculate metrics. + +0:42:51.451 --> 0:42:59.741 +Because it's the largest one there, we have +ten thousand entries, while for the hours we + +0:42:59.741 --> 0:43:00.393 +maybe. + +0:43:00.660 --> 0:43:03.408 +So therefore the addition to a little bit +more to make this. + +0:43:06.206 --> 0:43:10.538 +Then you can go where else the calculations +are very difficult. + +0:43:10.830 --> 0:43:20.389 +So here we then have our network, so we have +the word embeddings. + +0:43:20.389 --> 0:43:29.514 +We have one hidden there, and then you can +look how difficult. + +0:43:30.270 --> 0:43:38.746 +Could save a lot of calculation by not really +calculating the selection because that is always. + +0:43:40.600 --> 0:43:46.096 +The number of calculations you have to do +here is so. + +0:43:46.096 --> 0:43:51.693 +The length of this layer is minus one type +projection. + +0:43:52.993 --> 0:43:56.321 +That is a hint size. + +0:43:56.321 --> 0:44:10.268 +So the first step of calculation for this +metrics modification is how much calculation. + +0:44:10.730 --> 0:44:18.806 +Then you have to do some activation function +and then you have to do again the calculation. + +0:44:19.339 --> 0:44:27.994 +Here we need the vocabulary size because we +need to calculate the probability for each + +0:44:27.994 --> 0:44:29.088 +next word. + +0:44:29.889 --> 0:44:43.155 +And if you look at these numbers, so if you +have a projector size of and a vocabulary size + +0:44:43.155 --> 0:44:53.876 +of, you see: And that is why there has been +especially at the beginning some ideas how + +0:44:53.876 --> 0:44:55.589 +we can reduce. + +0:44:55.956 --> 0:45:01.942 +And if we really need to calculate all of +our capabilities, or if we can calculate only + +0:45:01.942 --> 0:45:02.350 +some. + +0:45:02.582 --> 0:45:10.871 +And there again the one important thing to +think about is for what will use my language + +0:45:10.871 --> 0:45:11.342 +mom. + +0:45:11.342 --> 0:45:19.630 +I can use it for generations and that's what +we will see next week in an achiever which + +0:45:19.630 --> 0:45:22.456 +really is guiding the search. + +0:45:23.123 --> 0:45:30.899 +If it just uses a feature, we do not want +to use it for generations, but we want to only + +0:45:30.899 --> 0:45:32.559 +know how probable. + +0:45:32.953 --> 0:45:39.325 +There we might not be really interested in +all the probabilities, but we already know + +0:45:39.325 --> 0:45:46.217 +we just want to know the probability of this +one word, and then it might be very inefficient + +0:45:46.217 --> 0:45:49.403 +to really calculate all the probabilities. + +0:45:51.231 --> 0:45:52.919 +And how can you do that so? + +0:45:52.919 --> 0:45:56.296 +Initially, for example, the people look into +shortness. + +0:45:56.756 --> 0:46:02.276 +So this calculation at the end is really very +expensive. + +0:46:02.276 --> 0:46:05.762 +So can we make that more efficient. + +0:46:05.945 --> 0:46:17.375 +And most words occur very rarely, and maybe +we don't need anger, and so there we may want + +0:46:17.375 --> 0:46:18.645 +to focus. + +0:46:19.019 --> 0:46:29.437 +And so they use the smaller vocabulary, which +is maybe. + +0:46:29.437 --> 0:46:34.646 +This layer is used from to. + +0:46:34.646 --> 0:46:37.623 +Then you merge. + +0:46:37.937 --> 0:46:45.162 +So you're taking if the word is in the shortest, +so in the two thousand most frequent words. + +0:46:45.825 --> 0:46:58.299 +Of this short word by some normalization here, +and otherwise you take a back of probability + +0:46:58.299 --> 0:46:59.655 +from the. + +0:47:00.020 --> 0:47:04.933 +It will not be as good, but the idea is okay. + +0:47:04.933 --> 0:47:14.013 +Then we don't have to calculate all these +probabilities here at the end, but we only + +0:47:14.013 --> 0:47:16.042 +have to calculate. + +0:47:19.599 --> 0:47:32.097 +With some type of cost because it means we +don't model the probability of the infrequent + +0:47:32.097 --> 0:47:39.399 +words, and maybe it's even very important to +model. + +0:47:39.299 --> 0:47:46.671 +And one idea is to do what is reported as +so so structured out there. + +0:47:46.606 --> 0:47:49.571 +Network language models you see some years +ago. + +0:47:49.571 --> 0:47:53.154 +People were very creative and giving names +to new models. + +0:47:53.813 --> 0:48:00.341 +And there the idea is that we model the output +vocabulary as a clustered treat. + +0:48:00.680 --> 0:48:06.919 +So you don't need to model all of our bodies +directly, but you are putting words into a + +0:48:06.919 --> 0:48:08.479 +sequence of clusters. + +0:48:08.969 --> 0:48:15.019 +So maybe a very intriguant world is first +in cluster three and then in cluster three. + +0:48:15.019 --> 0:48:21.211 +You have subclusters again and there is subclusters +seven and subclusters and there is. + +0:48:21.541 --> 0:48:40.134 +And this is the path, so that is what was +the man in the past. + +0:48:40.340 --> 0:48:52.080 +And then you can calculate the probability +of the word again just by the product of the + +0:48:52.080 --> 0:48:55.548 +first class of the world. + +0:48:57.617 --> 0:49:07.789 +That it may be more clear where you have this +architecture, so this is all the same. + +0:49:07.789 --> 0:49:13.773 +But then you first predict here which main +class. + +0:49:14.154 --> 0:49:24.226 +Then you go to the appropriate subclass, then +you calculate the probability of the subclass + +0:49:24.226 --> 0:49:26.415 +and maybe the cell. + +0:49:27.687 --> 0:49:35.419 +Anybody have an idea why this is more efficient +or if you do it first, it looks a lot more. + +0:49:42.242 --> 0:49:51.788 +You have to do less calculations, so maybe +if you do it here you have to calculate the + +0:49:51.788 --> 0:49:59.468 +element there, but you don't have to do all +the one hundred thousand. + +0:49:59.980 --> 0:50:06.115 +The probabilities in the set classes that +you're going through and not for all of them. + +0:50:06.386 --> 0:50:18.067 +Therefore, it's more efficient if you don't +need all output proficient because you have + +0:50:18.067 --> 0:50:21.253 +to calculate the class. + +0:50:21.501 --> 0:50:28.936 +So it's only more efficient and scenarios +where you really need to use a language model + +0:50:28.936 --> 0:50:30.034 +to evaluate. + +0:50:35.275 --> 0:50:52.456 +How this works was that you can train first +in your language one on the short list. + +0:50:52.872 --> 0:51:03.547 +But on the input layer you have your full +vocabulary because at the input we saw that + +0:51:03.547 --> 0:51:06.650 +this is not complicated. + +0:51:06.906 --> 0:51:26.638 +And then you can cluster down all your words +here into classes and use that as your glasses. + +0:51:29.249 --> 0:51:34.148 +That is one idea of doing it. + +0:51:34.148 --> 0:51:44.928 +There is also a second idea of doing it, and +again we don't need. + +0:51:45.025 --> 0:51:53.401 +So sometimes it doesn't really need to be +a probability to evaluate. + +0:51:53.401 --> 0:51:56.557 +It's only important that. + +0:51:58.298 --> 0:52:04.908 +And: Here it's called self normalization what +people have done so. + +0:52:04.908 --> 0:52:11.562 +We have seen that the probability is in this +soft mechanism always to the input divided + +0:52:11.562 --> 0:52:18.216 +by our normalization, and the normalization +is a summary of the vocabulary to the power + +0:52:18.216 --> 0:52:19.274 +of the spell. + +0:52:19.759 --> 0:52:25.194 +So this is how we calculate the software. + +0:52:25.825 --> 0:52:41.179 +In self normalization of the idea, if this +would be zero then we don't need to calculate + +0:52:41.179 --> 0:52:42.214 +that. + +0:52:42.102 --> 0:52:54.272 +Will be zero, and then you don't even have +to calculate the normalization because it's. + +0:52:54.514 --> 0:53:08.653 +So how can we achieve that and then the nice +thing in your networks? + +0:53:09.009 --> 0:53:23.928 +And now we're just adding a second note with +some either permitted here. + +0:53:24.084 --> 0:53:29.551 +And the second lost just tells us he'll be +strained away. + +0:53:29.551 --> 0:53:31.625 +The locks at is zero. + +0:53:32.352 --> 0:53:38.614 +So then if it's nearly zero at the end we +don't need to calculate this and it's also + +0:53:38.614 --> 0:53:39.793 +very efficient. + +0:53:40.540 --> 0:53:49.498 +One important thing is this, of course, is +only in inference. + +0:53:49.498 --> 0:54:04.700 +During tests we don't need to calculate that +because: You can do a bit of a hyperparameter + +0:54:04.700 --> 0:54:14.851 +here where you do the waiting, so how good +should it be estimating the probabilities and + +0:54:14.851 --> 0:54:16.790 +how much effort? + +0:54:18.318 --> 0:54:28.577 +The only disadvantage is no speed up during +training. + +0:54:28.577 --> 0:54:43.843 +There are other ways of doing that, for example: +Englishman is in case you get it. + +0:54:44.344 --> 0:54:48.540 +Then we are coming very, very briefly like +just one idea. + +0:54:48.828 --> 0:54:53.058 +That there is more things on different types +of language models. + +0:54:53.058 --> 0:54:58.002 +We are having a very short view on restricted +person-based language models. + +0:54:58.298 --> 0:55:08.931 +Talk about recurrent neural networks for language +mines because they have the advantage that + +0:55:08.931 --> 0:55:17.391 +we can even further improve by not having a +continuous representation on. + +0:55:18.238 --> 0:55:23.845 +So there's different types of neural networks. + +0:55:23.845 --> 0:55:30.169 +These are these boxing machines and the interesting. + +0:55:30.330 --> 0:55:39.291 +They have these: And they define like an energy +function on the network, which can be in restricted + +0:55:39.291 --> 0:55:44.372 +balsam machines efficiently calculated in general +and restricted needs. + +0:55:44.372 --> 0:55:51.147 +You only have connection between the input +and the hidden layer, but you don't have connections + +0:55:51.147 --> 0:55:53.123 +in the input or within the. + +0:55:53.393 --> 0:56:00.194 +So you see here you don't have an input output, +you just have an input, and you calculate. + +0:56:00.460 --> 0:56:15.612 +Which of course nicely fits with the idea +we're having, so you can then use this for + +0:56:15.612 --> 0:56:19.177 +an N Gram language. + +0:56:19.259 --> 0:56:25.189 +Retaining the flexibility of the input by +this type of neon networks. + +0:56:26.406 --> 0:56:30.589 +And the advantage of this type of model was +there's. + +0:56:30.550 --> 0:56:37.520 +Very, very fast to integrate it, so that one +was the first one which was used during the + +0:56:37.520 --> 0:56:38.616 +coding model. + +0:56:38.938 --> 0:56:45.454 +The engram language models were that they +were very good and gave performance. + +0:56:45.454 --> 0:56:50.072 +However, calculation still with all these +tricks takes. + +0:56:50.230 --> 0:56:58.214 +We have talked about embest lists so they +generated an embest list of the most probable + +0:56:58.214 --> 0:57:05.836 +outputs and then they took this and best list +scored each entry with a new network. + +0:57:06.146 --> 0:57:09.306 +A language model, and then only change the +order again. + +0:57:09.306 --> 0:57:10.887 +Select based on that which. + +0:57:11.231 --> 0:57:17.187 +The neighboring list is maybe only like hundred +entries. + +0:57:17.187 --> 0:57:21.786 +When decoding you look at several thousand. + +0:57:26.186 --> 0:57:35.196 +Let's look at the context so we have now seen +your language models. + +0:57:35.196 --> 0:57:43.676 +There is the big advantage we can use this +word similarity and. + +0:57:44.084 --> 0:57:52.266 +Remember for engram language ones is not always +minus one words because sometimes you have + +0:57:52.266 --> 0:57:59.909 +to back off or interpolation to lower engrams +and you don't know the previous words. + +0:58:00.760 --> 0:58:04.742 +And however in neural models we always have +all of this importance. + +0:58:04.742 --> 0:58:05.504 +Can some of. + +0:58:07.147 --> 0:58:20.288 +The disadvantage is that you are still limited +in your context, and if you remember the sentence + +0:58:20.288 --> 0:58:22.998 +from last lecture,. + +0:58:22.882 --> 0:58:28.328 +Sometimes you need more context and there +is unlimited context that you might need and + +0:58:28.328 --> 0:58:34.086 +you can always create sentences where you may +need this five context in order to put a good + +0:58:34.086 --> 0:58:34.837 +estimation. + +0:58:35.315 --> 0:58:44.956 +Can also do it different in order to understand +that it makes sense to view language. + +0:58:45.445 --> 0:58:59.510 +So secret labeling tasks are a very common +type of task in language processing where you + +0:58:59.510 --> 0:59:03.461 +have the input sequence. + +0:59:03.323 --> 0:59:05.976 +So you have one output for each input. + +0:59:05.976 --> 0:59:12.371 +Machine translation is not a secret labeling +cast because the number of inputs and the number + +0:59:12.371 --> 0:59:14.072 +of outputs is different. + +0:59:14.072 --> 0:59:20.598 +So you put in a string German which has five +words and the output can be: See, for example, + +0:59:20.598 --> 0:59:24.078 +you always have the same number and the same +number of offices. + +0:59:24.944 --> 0:59:39.779 +And you can more language waddling as that, +and you just say the label for each word is + +0:59:39.779 --> 0:59:43.151 +always a next word. + +0:59:45.705 --> 0:59:50.312 +This is the more generous you can think of +it. + +0:59:50.312 --> 0:59:56.194 +For example, Paddle Speech Taking named Entity +Recognition. + +0:59:58.938 --> 1:00:08.476 +And if you look at now, this output token +and generally sequenced labeling can depend + +1:00:08.476 --> 1:00:26.322 +on: The input tokens are the same so we can +easily model it and they only depend on the + +1:00:26.322 --> 1:00:29.064 +input tokens. + +1:00:31.011 --> 1:00:42.306 +But we can always look at one specific type +of sequence labeling, unidirectional sequence + +1:00:42.306 --> 1:00:44.189 +labeling type. + +1:00:44.584 --> 1:01:00.855 +The probability of the next word only depends +on the previous words that we are having here. + +1:01:01.321 --> 1:01:05.998 +That's also not completely true in language. + +1:01:05.998 --> 1:01:14.418 +Well, the back context might also be helpful +by direction of the model's Google. + +1:01:14.654 --> 1:01:23.039 +We will always admire the probability of the +word given on its history. + +1:01:23.623 --> 1:01:30.562 +And currently there is approximation and sequence +labeling that we have this windowing approach. + +1:01:30.951 --> 1:01:43.016 +So in order to predict this type of word we +always look at the previous three words. + +1:01:43.016 --> 1:01:48.410 +This is this type of windowing model. + +1:01:49.389 --> 1:01:54.780 +If you're into neural networks you recognize +this type of structure. + +1:01:54.780 --> 1:01:57.515 +Also, the typical neural networks. + +1:01:58.938 --> 1:02:11.050 +Yes, yes, so like engram models you can, at +least in some way, prepare for that type of + +1:02:11.050 --> 1:02:12.289 +context. + +1:02:14.334 --> 1:02:23.321 +Are also other types of neonamic structures +which we can use for sequins lately and which + +1:02:23.321 --> 1:02:30.710 +might help us where we don't have this type +of fixed size representation. + +1:02:32.812 --> 1:02:34.678 +That we can do so. + +1:02:34.678 --> 1:02:39.391 +The idea is in recurrent new networks traction. + +1:02:39.391 --> 1:02:43.221 +We are saving complete history in one. + +1:02:43.623 --> 1:02:56.946 +So again we have to do this fixed size representation +because the neural networks always need a habit. + +1:02:57.157 --> 1:03:09.028 +And then the network should look like that, +so we start with an initial value for our storage. + +1:03:09.028 --> 1:03:15.900 +We are giving our first input and calculating +the new. + +1:03:16.196 --> 1:03:35.895 +So again in your network with two types of +inputs: Then you can apply it to the next type + +1:03:35.895 --> 1:03:41.581 +of input and you're again having this. + +1:03:41.581 --> 1:03:46.391 +You're taking this hidden state. + +1:03:47.367 --> 1:03:53.306 +Nice thing is now that you can do now step +by step by step, so all the way over. + +1:03:55.495 --> 1:04:06.131 +The nice thing we are having here now is that +now we are having context information from + +1:04:06.131 --> 1:04:07.206 +all the. + +1:04:07.607 --> 1:04:14.181 +So if you're looking like based on which words +do you, you calculate the probability of varying. + +1:04:14.554 --> 1:04:20.090 +It depends on this part. + +1:04:20.090 --> 1:04:33.154 +It depends on and this hidden state was influenced +by two. + +1:04:33.473 --> 1:04:38.259 +So now we're having something new. + +1:04:38.259 --> 1:04:46.463 +We can model like the word probability not +only on a fixed. + +1:04:46.906 --> 1:04:53.565 +Because the hidden states we are having here +in our Oregon are influenced by all the trivia. + +1:04:56.296 --> 1:05:02.578 +So how is there to be Singapore? + +1:05:02.578 --> 1:05:16.286 +But then we have the initial idea about this +P of given on the history. + +1:05:16.736 --> 1:05:25.300 +So do not need to do any clustering here, +and you also see how things are put together + +1:05:25.300 --> 1:05:26.284 +in order. + +1:05:29.489 --> 1:05:43.449 +The green box this night since we are starting +from the left to the right. + +1:05:44.524 --> 1:05:51.483 +Voices: Yes, that's right, so there are clusters, +and here is also sometimes clustering happens. + +1:05:51.871 --> 1:05:58.687 +The small difference does matter again, so +if you have now a lot of different histories, + +1:05:58.687 --> 1:06:01.674 +the similarity which you have in here. + +1:06:01.674 --> 1:06:08.260 +If two of the histories are very similar, +these representations will be the same, and + +1:06:08.260 --> 1:06:10.787 +then you're treating them again. + +1:06:11.071 --> 1:06:15.789 +Because in order to do the final restriction +you only do a good base on the green box. + +1:06:16.156 --> 1:06:28.541 +So you are now still learning some type of +clustering in there, but you are learning it + +1:06:28.541 --> 1:06:30.230 +implicitly. + +1:06:30.570 --> 1:06:38.200 +The only restriction you're giving is you +have to stall everything that is important + +1:06:38.200 --> 1:06:39.008 +in this. + +1:06:39.359 --> 1:06:54.961 +So it's a different type of limitation, so +you calculate the probability based on the + +1:06:54.961 --> 1:06:57.138 +last words. + +1:06:57.437 --> 1:07:04.430 +And that is how you still need to somehow +cluster things together in order to do efficiently. + +1:07:04.430 --> 1:07:09.563 +Of course, you need to do some type of clustering +because otherwise. + +1:07:09.970 --> 1:07:18.865 +But this is where things get merged together +in this type of hidden representation. + +1:07:18.865 --> 1:07:27.973 +So here the probability of the word first +only depends on this hidden representation. + +1:07:28.288 --> 1:07:33.104 +On the previous words, but they are some other +bottleneck in order to make a good estimation. + +1:07:34.474 --> 1:07:41.231 +So the idea is that we can store all our history +into or into one lecture. + +1:07:41.581 --> 1:07:44.812 +Which is the one that makes it more strong. + +1:07:44.812 --> 1:07:51.275 +Next we come to problems that of course at +some point it might be difficult if you have + +1:07:51.275 --> 1:07:57.811 +very long sequences and you always write all +the information you have on this one block. + +1:07:58.398 --> 1:08:02.233 +Then maybe things get overwritten or you cannot +store everything in there. + +1:08:02.662 --> 1:08:04.514 +So,. + +1:08:04.184 --> 1:08:09.569 +Therefore, yet for short things like single +sentences that works well, but especially if + +1:08:09.569 --> 1:08:15.197 +you think of other tasks and like symbolizations +with our document based on T where you need + +1:08:15.197 --> 1:08:20.582 +to consider the full document, these things +got got a bit more more more complicated and + +1:08:20.582 --> 1:08:23.063 +will learn another type of architecture. + +1:08:24.464 --> 1:08:30.462 +In order to understand these neighbors, it +is good to have all the bus use always. + +1:08:30.710 --> 1:08:33.998 +So this is the unrolled view. + +1:08:33.998 --> 1:08:43.753 +Somewhere you're over the type or in language +over the words you're unrolling a network. + +1:08:44.024 --> 1:08:52.096 +Here is the article and here is the network +which is connected by itself and that is recurrent. + +1:08:56.176 --> 1:09:04.982 +There is one challenge in this networks and +training. + +1:09:04.982 --> 1:09:11.994 +We can train them first of all as forward. + +1:09:12.272 --> 1:09:19.397 +So we don't really know how to train them, +but if you unroll them like this is a feet + +1:09:19.397 --> 1:09:20.142 +forward. + +1:09:20.540 --> 1:09:38.063 +Is exactly the same, so you can measure your +arrows here and be back to your arrows. + +1:09:38.378 --> 1:09:45.646 +If you unroll something, it's a feature in +your laptop and you can train it the same way. + +1:09:46.106 --> 1:09:57.606 +The only important thing is again, of course, +for different inputs. + +1:09:57.837 --> 1:10:05.145 +But since parameters are shared, it's somehow +a similar point you can train it. + +1:10:05.145 --> 1:10:08.800 +The training algorithm is very similar. + +1:10:10.310 --> 1:10:29.568 +One thing which makes things difficult is +what is referred to as the vanish ingredient. + +1:10:29.809 --> 1:10:32.799 +That's a very strong thing in the motivation +of using hardness. + +1:10:33.593 --> 1:10:44.604 +The influence here gets smaller and smaller, +and the modems are not really able to monitor. + +1:10:44.804 --> 1:10:51.939 +Because the gradient gets smaller and smaller, +and so the arrow here propagated to this one + +1:10:51.939 --> 1:10:58.919 +that contributes to the arrow is very small, +and therefore you don't do any changes there + +1:10:58.919 --> 1:10:59.617 +anymore. + +1:11:00.020 --> 1:11:06.703 +And yeah, that's why standard art men are +undifficult or have to pick them at custard. + +1:11:07.247 --> 1:11:11.462 +So everywhere talking to me about fire and +ants nowadays,. + +1:11:11.791 --> 1:11:23.333 +What we are typically meaning are LSDN's or +long short memories. + +1:11:23.333 --> 1:11:30.968 +You see they are by now quite old already. + +1:11:31.171 --> 1:11:39.019 +So there was a model in the language model +task. + +1:11:39.019 --> 1:11:44.784 +It's some more storing information. + +1:11:44.684 --> 1:11:51.556 +Because if you only look at the last words, +it's often no longer clear this is a question + +1:11:51.556 --> 1:11:52.548 +or a normal. + +1:11:53.013 --> 1:12:05.318 +So there you have these mechanisms with ripgate +in order to store things for a longer time + +1:12:05.318 --> 1:12:08.563 +into your hidden state. + +1:12:10.730 --> 1:12:20.162 +Here they are used in in in selling quite +a lot of works. + +1:12:21.541 --> 1:12:29.349 +For especially machine translation now, the +standard is to do transform base models which + +1:12:29.349 --> 1:12:30.477 +we'll learn. + +1:12:30.690 --> 1:12:38.962 +But for example, in architecture we have later +one lecture about efficiency. + +1:12:38.962 --> 1:12:42.830 +So how can we build very efficient? + +1:12:42.882 --> 1:12:53.074 +And there in the decoder in parts of the networks +they are still using. + +1:12:53.473 --> 1:12:57.518 +So it's not that yeah our hands are of no +importance in the body. + +1:12:59.239 --> 1:13:08.956 +In order to make them strong, there are some +more things which are helpful and should be: + +1:13:09.309 --> 1:13:19.683 +So one thing is there is a nice trick to make +this new network stronger and better. + +1:13:19.739 --> 1:13:21.523 +So of course it doesn't work always. + +1:13:21.523 --> 1:13:23.451 +They have to have enough training data. + +1:13:23.763 --> 1:13:28.959 +But in general there's the easiest way of +making your models bigger and stronger just + +1:13:28.959 --> 1:13:30.590 +to increase your pyramids. + +1:13:30.630 --> 1:13:43.236 +And you've seen that with a large language +models they are always bragging about. + +1:13:43.903 --> 1:13:56.463 +This is one way, so the question is how do +you get more parameters? + +1:13:56.463 --> 1:14:01.265 +There's ways of doing it. + +1:14:01.521 --> 1:14:10.029 +And the other thing is to make your networks +deeper so to have more legs in between. + +1:14:11.471 --> 1:14:13.827 +And then you can also get to get more calm. + +1:14:14.614 --> 1:14:23.340 +There's more traveling with this and it's +very similar to what we just saw with our hand. + +1:14:23.603 --> 1:14:34.253 +We have this problem of radiant flow that +if it flows so fast like a radiant gets very + +1:14:34.253 --> 1:14:35.477 +swollen,. + +1:14:35.795 --> 1:14:42.704 +Exactly the same thing happens in deep LSD +ends. + +1:14:42.704 --> 1:14:52.293 +If you take here the gradient, tell you what +is the right or wrong. + +1:14:52.612 --> 1:14:56.439 +With three layers it's no problem, but if +you're going to ten, twenty or hundred layers. + +1:14:57.797 --> 1:14:59.698 +That's Getting Typically Young. + +1:15:00.060 --> 1:15:07.000 +Are doing is using what is called decisional +connections. + +1:15:07.000 --> 1:15:15.855 +That's a very helpful idea, which is maybe +very surprising that it works. + +1:15:15.956 --> 1:15:20.309 +And so the idea is that these networks. + +1:15:20.320 --> 1:15:29.982 +In between should no longer calculate what +is a new good representation, but they're more + +1:15:29.982 --> 1:15:31.378 +calculating. + +1:15:31.731 --> 1:15:37.588 +Therefore, in the end you're always the output +of a layer is added with the input. + +1:15:38.318 --> 1:15:48.824 +The knife is later if you are doing back propagation +with this very fast back propagation. + +1:15:49.209 --> 1:16:02.540 +Nowadays in very deep architectures, not only +on other but always has this residual or highway + +1:16:02.540 --> 1:16:04.224 +connection. + +1:16:04.704 --> 1:16:06.616 +Has two advantages. + +1:16:06.616 --> 1:16:15.409 +On the one hand, these layers don't need to +learn a representation, they only need to learn + +1:16:15.409 --> 1:16:18.754 +what to change the representation. + +1:16:22.082 --> 1:16:24.172 +Good. + +1:16:23.843 --> 1:16:31.768 +That much for the new map before, so the last +thing now means this. + +1:16:31.671 --> 1:16:33.750 +Language was are yeah. + +1:16:33.750 --> 1:16:41.976 +I were used in the molds itself and now were +seeing them again, but one thing which at the + +1:16:41.976 --> 1:16:53.558 +beginning they were reading was very essential +was: So people really train part of the language + +1:16:53.558 --> 1:16:59.999 +models only to get this type of embedding. + +1:16:59.999 --> 1:17:04.193 +Therefore, we want to look. + +1:17:09.229 --> 1:17:15.678 +So now some last words to the word embeddings. + +1:17:15.678 --> 1:17:27.204 +The interesting thing is that word embeddings +can be used for very different tasks. + +1:17:27.347 --> 1:17:31.329 +The knife wing is you can train that on just +large amounts of data. + +1:17:31.931 --> 1:17:41.569 +And then if you have these wooden beddings +we have seen that they reduce the parameters. + +1:17:41.982 --> 1:17:52.217 +So then you can train your small mark to do +any other task and therefore you are more efficient. + +1:17:52.532 --> 1:17:55.218 +These initial word embeddings is important. + +1:17:55.218 --> 1:18:00.529 +They really depend only on the word itself, +so if you look at the two meanings of can, + +1:18:00.529 --> 1:18:06.328 +the can of beans or I can do that, they will +have the same embedding, so some of the embedding + +1:18:06.328 --> 1:18:08.709 +has to save the ambiguity inside that. + +1:18:09.189 --> 1:18:12.486 +That cannot be resolved. + +1:18:12.486 --> 1:18:24.753 +Therefore, if you look at the higher levels +in the context, but in the word embedding layers + +1:18:24.753 --> 1:18:27.919 +that really depends on. + +1:18:29.489 --> 1:18:33.757 +However, even this one has quite very interesting. + +1:18:34.034 --> 1:18:39.558 +So that people like to visualize them. + +1:18:39.558 --> 1:18:47.208 +They're always difficult because if you look +at this. + +1:18:47.767 --> 1:18:52.879 +And drawing your five hundred damage, the +vector is still a bit challenging. + +1:18:53.113 --> 1:19:12.472 +So you cannot directly do that, so people +have to do it like they look at some type of. + +1:19:13.073 --> 1:19:17.209 +And of course then yes some information is +getting lost by a bunch of control. + +1:19:18.238 --> 1:19:24.802 +And you see, for example, this is the most +famous and common example, so what you can + +1:19:24.802 --> 1:19:31.289 +look is you can look at the difference between +the main and the female word English. + +1:19:31.289 --> 1:19:37.854 +This is here in your embedding of king, and +this is the embedding of queen, and this. + +1:19:38.058 --> 1:19:40.394 +You can do that for a very different work. + +1:19:40.780 --> 1:19:45.407 +And that is where the masks come into, that +is what people then look into. + +1:19:45.725 --> 1:19:50.995 +So what you can now, for example, do is you +can calculate the difference between man and + +1:19:50.995 --> 1:19:51.410 +woman? + +1:19:52.232 --> 1:19:55.511 +Then you can take the embedding of tea. + +1:19:55.511 --> 1:20:02.806 +You can add on it the difference between man +and woman, and then you can notice what are + +1:20:02.806 --> 1:20:04.364 +the similar words. + +1:20:04.364 --> 1:20:08.954 +So you won't, of course, directly hit the +correct word. + +1:20:08.954 --> 1:20:10.512 +It's a continuous. + +1:20:10.790 --> 1:20:23.127 +But you can look what are the nearest neighbors +to this same, and often these words are near + +1:20:23.127 --> 1:20:24.056 +there. + +1:20:24.224 --> 1:20:33.913 +So it somehow learns that the difference between +these words is always the same. + +1:20:34.374 --> 1:20:37.746 +You can do that for different things. + +1:20:37.746 --> 1:20:41.296 +He also imagines that it's not perfect. + +1:20:41.296 --> 1:20:49.017 +He says the world tends to be swimming and +swimming, and with walking and walking you. + +1:20:49.469 --> 1:20:51.639 +So you can try to use them. + +1:20:51.639 --> 1:20:59.001 +It's no longer like saying yeah, but the interesting +thing is this is completely unsupervised. + +1:20:59.001 --> 1:21:03.961 +So nobody taught him the principle of their +gender in language. + +1:21:04.284 --> 1:21:09.910 +So it's purely trained on the task of doing +the next work prediction. + +1:21:10.230 --> 1:21:20.658 +And even for really cementing information +like the capital, this is the difference between + +1:21:20.658 --> 1:21:23.638 +the city and the capital. + +1:21:23.823 --> 1:21:25.518 +Visualization. + +1:21:25.518 --> 1:21:33.766 +Here we have done the same things of the difference +between country and. + +1:21:33.853 --> 1:21:41.991 +You see it's not perfect, but it's building +some kinds of a right direction, so you can't + +1:21:41.991 --> 1:21:43.347 +even use them. + +1:21:43.347 --> 1:21:51.304 +For example, for question answering, if you +have the difference between them, you apply + +1:21:51.304 --> 1:21:53.383 +that to a new country. + +1:21:54.834 --> 1:22:02.741 +So it seems these ones are able to really +learn a lot of information and collapse all + +1:22:02.741 --> 1:22:04.396 +this information. + +1:22:05.325 --> 1:22:11.769 +At just to do the next word prediction: And +that also explains a bit maybe or not explains + +1:22:11.769 --> 1:22:19.016 +wrong life by motivating why what is the main +advantage of this type of neural models that + +1:22:19.016 --> 1:22:26.025 +we can use this type of hidden representation, +transfer them and use them in different. + +1:22:28.568 --> 1:22:43.707 +So summarize what we did today, so what you +should hopefully have with you is for machine + +1:22:43.707 --> 1:22:45.893 +translation. + +1:22:45.805 --> 1:22:49.149 +Then how we can do language modern Chinese +literature? + +1:22:49.449 --> 1:22:55.617 +We looked at three different architectures: +We looked into the feet forward language mode + +1:22:55.617 --> 1:22:59.063 +and the one based on Bluetooth machines. + +1:22:59.039 --> 1:23:05.366 +And finally there are different architectures +to do in your networks. + +1:23:05.366 --> 1:23:14.404 +We have seen feet for your networks and we'll +see the next lectures, the last type of architecture. + +1:23:15.915 --> 1:23:17.412 +Have Any Questions. + +1:23:20.680 --> 1:23:27.341 +Then thanks a lot, and next on Tuesday we +will be again in our order to know how to play. + +0:00:01.301 --> 0:00:05.687 +Okay, so we're welcome to today's lecture. + +0:00:06.066 --> 0:00:18.128 +A bit desperate in a small room and I'm sorry +for the inconvenience. + +0:00:18.128 --> 0:00:25.820 +Sometimes there are project meetings where. + +0:00:26.806 --> 0:00:40.863 +So what we want to talk today about is want +to start with neural approaches to machine + +0:00:40.863 --> 0:00:42.964 +translation. + +0:00:43.123 --> 0:00:55.779 +Guess I've heard about other types of neural +models for natural language processing. + +0:00:55.779 --> 0:00:59.948 +This was some of the first. + +0:01:00.600 --> 0:01:06.203 +They are similar to what you know they see +in as large language models. + +0:01:06.666 --> 0:01:14.810 +And we want today look into what are these +neural language models, how we can build them, + +0:01:14.810 --> 0:01:15.986 +what is the. + +0:01:16.316 --> 0:01:23.002 +And first we'll show how to use them in statistical +machine translation. + +0:01:23.002 --> 0:01:31.062 +If you remember weeks ago, we had this log-linear +model where you can integrate easily. + +0:01:31.351 --> 0:01:42.756 +And that was how they first were used, so +we just had another model that evaluates how + +0:01:42.756 --> 0:01:49.180 +good a system is or how good a lot of languages. + +0:01:50.690 --> 0:02:04.468 +And next week we will go for a neuromachine +translation where we replace the whole model + +0:02:04.468 --> 0:02:06.481 +by one huge. + +0:02:11.211 --> 0:02:20.668 +So just as a member from Tuesday we've seen, +the main challenge in language modeling was + +0:02:20.668 --> 0:02:25.131 +that most of the anthrax we haven't seen. + +0:02:26.946 --> 0:02:34.167 +So this was therefore difficult to estimate +any probability because we've seen that yet + +0:02:34.167 --> 0:02:39.501 +normally if you've seen had not seen the N +gram you will assign. + +0:02:39.980 --> 0:02:53.385 +However, this is not really very good because +we don't want to give zero probabilities to + +0:02:53.385 --> 0:02:55.023 +sentences. + +0:02:55.415 --> 0:03:10.397 +And then we learned a lot of techniques and +that is the main challenge in statistical language. + +0:03:10.397 --> 0:03:15.391 +How we can give somehow a good. + +0:03:15.435 --> 0:03:23.835 +And they developed very specific, very good +techniques to deal with that. + +0:03:23.835 --> 0:03:26.900 +However, this is the best. + +0:03:28.568 --> 0:03:33.907 +And therefore we can do things different. + +0:03:33.907 --> 0:03:44.331 +If we have not seen an N gram before in statistical +models, we have to have seen. + +0:03:45.225 --> 0:03:51.361 +Before, and we can only get information from +exactly the same word. + +0:03:51.411 --> 0:03:57.567 +We don't have an approximate matching like +that. + +0:03:57.567 --> 0:04:10.255 +Maybe it stood together in some way or similar, +and in a sentence we might generalize the knowledge. + +0:04:11.191 --> 0:04:21.227 +Would like to have more something like that +where engrams are represented more in a general + +0:04:21.227 --> 0:04:21.990 +space. + +0:04:22.262 --> 0:04:29.877 +So if you learn something about eyewalk then +maybe we can use this knowledge and also. + +0:04:30.290 --> 0:04:43.034 +And thereby no longer treat all or at least +a lot of the ingrams as we've done before. + +0:04:43.034 --> 0:04:45.231 +We can really. + +0:04:47.047 --> 0:04:56.157 +And we maybe want to even do that in a more +hierarchical approach, but we know okay some + +0:04:56.157 --> 0:05:05.268 +words are similar like go and walk is somehow +similar and and therefore like maybe if we + +0:05:05.268 --> 0:05:07.009 +then merge them. + +0:05:07.387 --> 0:05:16.104 +If we learn something about work, then it +should tell us also something about Hugo or + +0:05:16.104 --> 0:05:17.118 +he walks. + +0:05:17.197 --> 0:05:18.970 +We see already. + +0:05:18.970 --> 0:05:22.295 +It's, of course, not so easy. + +0:05:22.295 --> 0:05:31.828 +We see that there is some relations which +we need to integrate, for example, for you. + +0:05:31.828 --> 0:05:35.486 +We need to add the S, but maybe. + +0:05:37.137 --> 0:05:42.984 +And luckily there is one really yeah, convincing +methods in doing that. + +0:05:42.963 --> 0:05:47.239 +And that is by using an evil neck or. + +0:05:47.387 --> 0:05:57.618 +That's what we will introduce today so we +can use this type of neural networks to try + +0:05:57.618 --> 0:06:04.042 +to learn this similarity and to learn how some +words. + +0:06:04.324 --> 0:06:13.711 +And that is one of the main advantages that +we have by switching from the standard statistical + +0:06:13.711 --> 0:06:15.193 +models to the. + +0:06:15.115 --> 0:06:22.840 +To learn similarities between words and generalized +and learn what we call hidden representations. + +0:06:22.840 --> 0:06:29.707 +So somehow representations of words where +we can measure similarity in some dimensions. + +0:06:30.290 --> 0:06:42.275 +So in representations where as a tubically +continuous vector or a vector of a fixed size. + +0:06:42.822 --> 0:06:52.002 +We had it before and we've seen that the only +thing we did is we don't want to do. + +0:06:52.192 --> 0:06:59.648 +But these indices don't have any meaning, +so it wasn't that word five is more similar + +0:06:59.648 --> 0:07:02.248 +to words twenty than to word. + +0:07:02.582 --> 0:07:09.059 +So we couldn't learn anything about words +in the statistical model. + +0:07:09.059 --> 0:07:12.107 +That's a big challenge because. + +0:07:12.192 --> 0:07:24.232 +If you think about words even in morphology, +so go and go is more similar because the person. + +0:07:24.264 --> 0:07:36.265 +While the basic models we have up to now, +they have no idea about that and goes as similar + +0:07:36.265 --> 0:07:37.188 +to go. + +0:07:39.919 --> 0:07:53.102 +So what we want to do today, in order to go +to this, we will have a short introduction. + +0:07:53.954 --> 0:08:06.667 +It very short just to see how we use them +here, but that's the good thing that are important + +0:08:06.667 --> 0:08:08.445 +for dealing. + +0:08:08.928 --> 0:08:14.083 +And then we'll first look into feet forward, +new network language models. + +0:08:14.454 --> 0:08:21.221 +And there we will still have this approximation +we had before, then we are looking only at + +0:08:21.221 --> 0:08:22.336 +fixed windows. + +0:08:22.336 --> 0:08:28.805 +So if you remember we have this classroom +of language models, and to determine what is + +0:08:28.805 --> 0:08:33.788 +the probability of a word, we only look at +the past and minus one. + +0:08:34.154 --> 0:08:36.878 +This is the theory of the case. + +0:08:36.878 --> 0:08:43.348 +However, we have the ability and that's why +they're really better in order. + +0:08:44.024 --> 0:08:51.953 +And then at the end we'll look at current +network language models where we then have + +0:08:51.953 --> 0:08:53.166 +a different. + +0:08:53.093 --> 0:09:01.922 +And thereby it is no longer the case that +we need to have a fixed history, but in theory + +0:09:01.922 --> 0:09:04.303 +we can model arbitrary. + +0:09:04.304 --> 0:09:06.854 +And we can log this phenomenon. + +0:09:06.854 --> 0:09:12.672 +We talked about a Tuesday where it's not clear +what type of information. + +0:09:16.396 --> 0:09:24.982 +So yeah, generally new networks are normally +learned to improve and perform some tasks. + +0:09:25.325 --> 0:09:38.934 +We have this structure and we are learning +them from samples so that is similar to what + +0:09:38.934 --> 0:09:42.336 +we had before so now. + +0:09:42.642 --> 0:09:49.361 +And is somehow originally motivated by the +human brain. + +0:09:49.361 --> 0:10:00.640 +However, when you now need to know artificial +neural networks, it's hard to get a similarity. + +0:10:00.540 --> 0:10:02.884 +There seems to be not that important. + +0:10:03.123 --> 0:10:11.013 +So what they are mainly doing is doing summoning +multiplication and then one linear activation. + +0:10:12.692 --> 0:10:16.078 +So so the basic units are these type of. + +0:10:17.937 --> 0:10:29.837 +Perceptron is a basic block which we have +and this does exactly the processing. + +0:10:29.837 --> 0:10:36.084 +We have a fixed number of input features. + +0:10:36.096 --> 0:10:39.668 +So we have here numbers six zero to x and +as input. + +0:10:40.060 --> 0:10:48.096 +And this makes language processing difficult +because we know that it's not the case. + +0:10:48.096 --> 0:10:53.107 +If we're dealing with language, it doesn't +have any. + +0:10:54.114 --> 0:10:57.609 +So we have to model this somehow and understand +how we model this. + +0:10:58.198 --> 0:11:03.681 +Then we have the weights, which are the parameters +and the number of weights exactly the same. + +0:11:04.164 --> 0:11:15.069 +Of input features sometimes you have the spires +in there that always and then it's not really. + +0:11:15.195 --> 0:11:19.656 +And what you then do is very simple. + +0:11:19.656 --> 0:11:26.166 +It's just like the weight it sounds, so you +multiply. + +0:11:26.606 --> 0:11:38.405 +What is then additionally important is we +have an activation function and it's important + +0:11:38.405 --> 0:11:42.514 +that this activation function. + +0:11:43.243 --> 0:11:54.088 +And later it will be important that this is +differentiable because otherwise all the training. + +0:11:54.714 --> 0:12:01.471 +This model by itself is not very powerful. + +0:12:01.471 --> 0:12:10.427 +We have the X Or problem and with this simple +you can't. + +0:12:10.710 --> 0:12:15.489 +However, there is a very easy and nice extension. + +0:12:15.489 --> 0:12:20.936 +The multi layer perception and things get +very powerful. + +0:12:21.081 --> 0:12:32.953 +The thing is you just connect a lot of these +in these layers of structures where we have + +0:12:32.953 --> 0:12:35.088 +the inputs and. + +0:12:35.395 --> 0:12:47.297 +And then we can combine them, or to do them: +The input layer is of course given by your + +0:12:47.297 --> 0:12:51.880 +problem with the dimension. + +0:12:51.880 --> 0:13:00.063 +The output layer is also given by your dimension. + +0:13:01.621 --> 0:13:08.802 +So let's start with the first question, now +more language related, and that is how we represent. + +0:13:09.149 --> 0:13:19.282 +So we have seen here input to x, but the question +is now okay. + +0:13:19.282 --> 0:13:23.464 +How can we put into this? + +0:13:26.866 --> 0:13:34.123 +The first thing that we're able to do is we're +going to set it in the inspector. + +0:13:34.314 --> 0:13:45.651 +Yeah, and that is not that easy because the +continuous vector will come to that. + +0:13:45.651 --> 0:13:47.051 +We can't. + +0:13:47.051 --> 0:13:50.410 +We don't want to do it. + +0:13:50.630 --> 0:13:57.237 +But if we need to input the word into the +needle network, it has to be something easily + +0:13:57.237 --> 0:13:57.912 +defined. + +0:13:59.079 --> 0:14:11.511 +One is the typical thing, the one-hour encoded +vector, so we have a vector where the dimension + +0:14:11.511 --> 0:14:15.306 +is the vocabulary, and then. + +0:14:16.316 --> 0:14:25.938 +So the first thing you are ready to see that +means we are always dealing with fixed. + +0:14:26.246 --> 0:14:34.961 +So you cannot easily extend your vocabulary, +but if you mean your vocabulary would increase + +0:14:34.961 --> 0:14:37.992 +the size of this input vector,. + +0:14:39.980 --> 0:14:42.423 +That's maybe also motivating. + +0:14:42.423 --> 0:14:45.355 +We'll talk about bike parade going. + +0:14:45.355 --> 0:14:47.228 +That's the nice thing. + +0:14:48.048 --> 0:15:01.803 +The big advantage of this one putt encoding +is that we don't implement similarity between + +0:15:01.803 --> 0:15:06.999 +words, but we're really learning. + +0:15:07.227 --> 0:15:11.219 +So you need like to represent any words. + +0:15:11.219 --> 0:15:15.893 +You need a dimension of and dimensional vector. + +0:15:16.236 --> 0:15:26.480 +Imagine you could eat no binary encoding, +so you could represent words as binary vectors. + +0:15:26.806 --> 0:15:32.348 +So you will be significantly more efficient. + +0:15:32.348 --> 0:15:39.122 +However, you have some more digits than other +numbers. + +0:15:39.559 --> 0:15:46.482 +Would somehow be bad because you would force +the one to do this and it's by hand not clear + +0:15:46.482 --> 0:15:47.623 +how to define. + +0:15:48.108 --> 0:15:55.135 +So therefore currently this is the most successful +approach to just do this one patch. + +0:15:55.095 --> 0:15:59.344 +We take a fixed vocabulary. + +0:15:59.344 --> 0:16:10.269 +We map each word to the initial and then we +represent a word like this. + +0:16:10.269 --> 0:16:13.304 +The representation. + +0:16:14.514 --> 0:16:27.019 +But this dimension here is a secondary size, +and if you think ten thousand that's quite + +0:16:27.019 --> 0:16:33.555 +high, so we're always trying to be efficient. + +0:16:33.853 --> 0:16:42.515 +And we are doing the same type of efficiency +because then we are having a very small one + +0:16:42.515 --> 0:16:43.781 +compared to. + +0:16:44.104 --> 0:16:53.332 +It can be still a maybe or neurons, but this +is significantly smaller, of course, as before. + +0:16:53.713 --> 0:17:04.751 +So you are learning there this word as you +said, but you can learn it directly, and there + +0:17:04.751 --> 0:17:07.449 +we have similarities. + +0:17:07.807 --> 0:17:14.772 +But the nice thing is that this is then learned, +and we do not need to like hand define. + +0:17:17.117 --> 0:17:32.377 +So yes, so that is how we're typically adding +at least a single word into the language world. + +0:17:32.377 --> 0:17:43.337 +Then we can see: So we're seeing that you +have the one hard representation always of + +0:17:43.337 --> 0:17:44.857 +the same similarity. + +0:17:45.105 --> 0:18:00.803 +Then we're having this continuous vector which +is a lot smaller dimension and that's. + +0:18:01.121 --> 0:18:06.984 +What we are doing then is learning these representations +so that they are best for language modeling. + +0:18:07.487 --> 0:18:19.107 +So the representations are implicitly because +we're training on the language. + +0:18:19.479 --> 0:18:30.115 +And the nice thing was found out later is +these representations are really, really good + +0:18:30.115 --> 0:18:32.533 +for a lot of other. + +0:18:33.153 --> 0:18:39.729 +And that is why they are now called word embedded +space themselves, and used for other tasks. + +0:18:40.360 --> 0:18:49.827 +And they are somehow describing different +things so they can describe and semantic similarities. + +0:18:49.789 --> 0:18:58.281 +We are looking at the very example of today +that you can do in this vector space by adding + +0:18:58.281 --> 0:19:00.613 +some interesting things. + +0:19:00.940 --> 0:19:11.174 +And so they got really was a first big improvement +when switching to neural staff. + +0:19:11.491 --> 0:19:20.736 +They are like part of the model still with +more complex representation alert, but they + +0:19:20.736 --> 0:19:21.267 +are. + +0:19:23.683 --> 0:19:34.975 +Then we are having the output layer, and in +the output layer we also have output structure + +0:19:34.975 --> 0:19:36.960 +and activation. + +0:19:36.997 --> 0:19:44.784 +That is the language we want to predict, which +word should be the next. + +0:19:44.784 --> 0:19:46.514 +We always have. + +0:19:47.247 --> 0:19:56.454 +And that can be done very well with the softball +softbacked layer, where again the dimension. + +0:19:56.376 --> 0:20:03.971 +Is the vocabulary, so this is a vocabulary +size, and again the case neuro represents the + +0:20:03.971 --> 0:20:09.775 +case class, so in our case we have again a +one-hour representation. + +0:20:10.090 --> 0:20:18.929 +Ours is a probability distribution and the +end is a probability distribution of all works. + +0:20:18.929 --> 0:20:28.044 +The case entry tells us: So we need to have +some of our probability distribution at our + +0:20:28.044 --> 0:20:36.215 +output, and in order to achieve that this activation +function goes, it needs to be that all the + +0:20:36.215 --> 0:20:36.981 +outputs. + +0:20:37.197 --> 0:20:47.993 +And we can achieve that with a softmax activation +we take each of the value and then. + +0:20:48.288 --> 0:20:58.020 +So by having this type of activation function +we are really getting that at the end we always. + +0:20:59.019 --> 0:21:12.340 +The beginning was very challenging because +again we have this inefficient representation + +0:21:12.340 --> 0:21:15.184 +of our vocabulary. + +0:21:15.235 --> 0:21:27.500 +And then you can imagine escalating over to +something over a thousand is maybe a bit inefficient + +0:21:27.500 --> 0:21:29.776 +with cheap users. + +0:21:36.316 --> 0:21:43.664 +And then yeah, for training the models, that +is how we refine, so we have this architecture + +0:21:43.664 --> 0:21:44.063 +now. + +0:21:44.264 --> 0:21:52.496 +We need to minimize the arrow by taking the +output. + +0:21:52.496 --> 0:21:58.196 +We are comparing it to our targets. + +0:21:58.298 --> 0:22:07.670 +So one important thing is, of course, how +can we measure the error? + +0:22:07.670 --> 0:22:12.770 +So what if we're training the ideas? + +0:22:13.033 --> 0:22:19.770 +And how well when measuring it is in natural +language processing, typically the cross entropy. + +0:22:19.960 --> 0:22:32.847 +That means we are comparing the target with +the output, so we're taking the value multiplying + +0:22:32.847 --> 0:22:35.452 +with the horizons. + +0:22:35.335 --> 0:22:43.454 +Which gets optimized and you're seeing that +this, of course, makes it again very nice and + +0:22:43.454 --> 0:22:49.859 +easy because our target, we said, is again +a one-hound representation. + +0:22:50.110 --> 0:23:00.111 +So except for one, all of these are always +zero, and what we are doing is taking the one. + +0:23:00.100 --> 0:23:05.970 +And we only need to multiply the one with +the logarism here, and that is all the feedback. + +0:23:06.946 --> 0:23:14.194 +Of course, this is not always influenced by +all the others. + +0:23:14.194 --> 0:23:17.938 +Why is this influenced by all? + +0:23:24.304 --> 0:23:33.554 +Think Mac the activation function, which is +the current activation divided by some of the + +0:23:33.554 --> 0:23:34.377 +others. + +0:23:34.354 --> 0:23:44.027 +Because otherwise it could of course easily +just increase this value and ignore the others, + +0:23:44.027 --> 0:23:49.074 +but if you increase one value or the other, +so. + +0:23:51.351 --> 0:24:04.433 +And then we can do with neon networks one +very nice and easy type of training that is + +0:24:04.433 --> 0:24:07.779 +done in all the neon. + +0:24:07.707 --> 0:24:12.664 +So in which direction does the arrow show? + +0:24:12.664 --> 0:24:23.152 +And then if we want to go to a smaller like +smaller arrow, that's what we want to achieve. + +0:24:23.152 --> 0:24:27.302 +We're trying to minimize our arrow. + +0:24:27.287 --> 0:24:32.875 +And we have to do that, of course, for all +the weights, and to calculate the error of + +0:24:32.875 --> 0:24:36.709 +all the weights we want in the back of the +baggation here. + +0:24:36.709 --> 0:24:41.322 +But what you can do is you can propagate the +arrow which you measured. + +0:24:41.322 --> 0:24:43.792 +At the end you can propagate it back. + +0:24:43.792 --> 0:24:46.391 +That's basic mass and basic derivation. + +0:24:46.706 --> 0:24:59.557 +Then you can do each weight in your model +and measure how much it contributes to this + +0:24:59.557 --> 0:25:01.350 +individual. + +0:25:04.524 --> 0:25:17.712 +To summarize what your machine translation +should be, to understand all this problem is + +0:25:17.712 --> 0:25:20.710 +that this is how a. + +0:25:20.580 --> 0:25:23.056 +The notes are perfect thrones. + +0:25:23.056 --> 0:25:28.167 +They are fully connected between two layers +and no connections. + +0:25:28.108 --> 0:25:29.759 +Across layers. + +0:25:29.829 --> 0:25:35.152 +And what they're doing is always just to wait +for some here and then an activation function. + +0:25:35.415 --> 0:25:38.794 +And in order to train you have this sword +in backwards past. + +0:25:39.039 --> 0:25:41.384 +So we put in here. + +0:25:41.281 --> 0:25:46.540 +Our inputs have some random values at the +beginning. + +0:25:46.540 --> 0:25:49.219 +They calculate the output. + +0:25:49.219 --> 0:25:58.646 +We are measuring how big our error is, propagating +the arrow back, and then changing our model + +0:25:58.646 --> 0:25:59.638 +in a way. + +0:26:01.962 --> 0:26:14.267 +So before we're coming into the neural networks, +how can we use this type of neural network + +0:26:14.267 --> 0:26:17.611 +to do language modeling? + +0:26:23.103 --> 0:26:25.520 +So the question is now okay. + +0:26:25.520 --> 0:26:33.023 +How can we use them in natural language processing +and especially in machine translation? + +0:26:33.023 --> 0:26:38.441 +The first idea of using them was to estimate +the language model. + +0:26:38.999 --> 0:26:42.599 +So we have seen that the output can be monitored +here as well. + +0:26:43.603 --> 0:26:49.308 +Has a probability distribution, and if we +have a full vocabulary, we could mainly hear + +0:26:49.308 --> 0:26:55.209 +estimate how probable each next word is, and +then use that in our language model fashion, + +0:26:55.209 --> 0:27:02.225 +as we've done it last time, we've got the probability +of a full sentence as a product of all probabilities + +0:27:02.225 --> 0:27:03.208 +of individual. + +0:27:04.544 --> 0:27:06.695 +And UM. + +0:27:06.446 --> 0:27:09.776 +That was done and in ninety seven years. + +0:27:09.776 --> 0:27:17.410 +It's very easy to integrate it into this Locklear +model, so we have said that this is how the + +0:27:17.410 --> 0:27:24.638 +Locklear model looks like, so we're searching +the best translation, which minimizes each + +0:27:24.638 --> 0:27:25.126 +wage. + +0:27:25.125 --> 0:27:26.371 +The feature value. + +0:27:26.646 --> 0:27:31.642 +We have that with the minimum error training, +if you can remember when we search for the + +0:27:31.642 --> 0:27:32.148 +optimal. + +0:27:32.512 --> 0:27:40.927 +We have the phrasetable probabilities, the +language model, and we can just add here and + +0:27:40.927 --> 0:27:41.597 +there. + +0:27:41.861 --> 0:27:46.077 +So that is quite easy as said. + +0:27:46.077 --> 0:27:54.101 +That was how statistical machine translation +was improved. + +0:27:54.101 --> 0:27:57.092 +Add one more feature. + +0:27:58.798 --> 0:28:11.220 +So how can we model the language mark for +Belty with your network? + +0:28:11.220 --> 0:28:22.994 +So what we have to do is: And the problem +in generally in the head is that most we haven't + +0:28:22.994 --> 0:28:25.042 +seen long sequences. + +0:28:25.085 --> 0:28:36.956 +Mostly we have to beg off to very short sequences +and we are working on this discrete space where. + +0:28:37.337 --> 0:28:48.199 +So the idea is if we have a meal network we +can map words into continuous representation + +0:28:48.199 --> 0:28:50.152 +and that helps. + +0:28:51.091 --> 0:28:59.598 +And the structure then looks like this, so +this is the basic still feed forward neural + +0:28:59.598 --> 0:29:00.478 +network. + +0:29:01.361 --> 0:29:10.744 +We are doing this at Proximation again, so +we are not putting in all previous words, but + +0:29:10.744 --> 0:29:11.376 +it's. + +0:29:11.691 --> 0:29:25.089 +And this is done because in your network we +can have only a fixed type of input, so we + +0:29:25.089 --> 0:29:31.538 +can: Can only do a fixed set, and they are +going to be doing exactly the same in minus + +0:29:31.538 --> 0:29:31.879 +one. + +0:29:33.593 --> 0:29:41.026 +And then we have, for example, three words +and three different words, which are in these + +0:29:41.026 --> 0:29:54.583 +positions: And then we're having the first +layer of the neural network, which learns words + +0:29:54.583 --> 0:29:56.247 +and words. + +0:29:57.437 --> 0:30:04.976 +There is one thing which is maybe special +compared to the standard neural memory. + +0:30:05.345 --> 0:30:13.163 +So the representation of this word we want +to learn first of all position independence, + +0:30:13.163 --> 0:30:19.027 +so we just want to learn what is the general +meaning of the word. + +0:30:19.299 --> 0:30:26.244 +Therefore, the representation you get here +should be the same as if you put it in there. + +0:30:27.247 --> 0:30:35.069 +The nice thing is you can achieve that in +networks the same way you achieve it. + +0:30:35.069 --> 0:30:41.719 +This way you're reusing ears so we are forcing +them to always stay. + +0:30:42.322 --> 0:30:49.689 +And that's why you then learn your word embedding, +which is contextual and independent, so. + +0:30:49.909 --> 0:31:05.561 +So the idea is you have the diagram go home +and you don't want to use the context. + +0:31:05.561 --> 0:31:07.635 +First you. + +0:31:08.348 --> 0:31:14.155 +That of course it might have a different meaning +depending on where it stands, but learn that. + +0:31:14.514 --> 0:31:19.623 +First, we're learning key representation of +the words, which is just the representation + +0:31:19.623 --> 0:31:20.378 +of the word. + +0:31:20.760 --> 0:31:37.428 +So it's also not like normally all input neurons +are connected to all neurons. + +0:31:37.857 --> 0:31:47.209 +This is the first layer of representation, +and then we have a lot denser representation, + +0:31:47.209 --> 0:31:56.666 +that is, our three word embeddings here, and +now we are learning this interaction between + +0:31:56.666 --> 0:31:57.402 +words. + +0:31:57.677 --> 0:32:08.265 +So now we have at least one connected, fully +connected layer here, which takes the three + +0:32:08.265 --> 0:32:14.213 +imbedded input and then learns the new embedding. + +0:32:15.535 --> 0:32:27.871 +And then if you had one of several layers +of lining which is your output layer, then. + +0:32:28.168 --> 0:32:46.222 +So here the size is a vocabulary size, and +then you put as target what is the probability + +0:32:46.222 --> 0:32:48.228 +for each. + +0:32:48.688 --> 0:32:56.778 +The nice thing is that you learn everything +together, so you're not learning what is a + +0:32:56.778 --> 0:32:58.731 +good representation. + +0:32:59.079 --> 0:33:12.019 +When you are training the whole network together, +it learns what representation for a word you + +0:33:12.019 --> 0:33:13.109 +get in. + +0:33:15.956 --> 0:33:19.176 +It's Yeah That Is the Main Idea. + +0:33:20.660 --> 0:33:32.695 +Nowadays often referred to as one way of self-supervised +learning, why self-supervisory learning? + +0:33:33.053 --> 0:33:37.120 +The output is the next word and the input +is the previous word. + +0:33:37.377 --> 0:33:46.778 +But somehow it's self-supervised because it's +not really that we created labels, but we artificially. + +0:33:46.806 --> 0:34:01.003 +We just have pure text, and then we created +the task. + +0:34:05.905 --> 0:34:12.413 +Say we have two sentences like go home again. + +0:34:12.413 --> 0:34:18.780 +Second one is go to creative again, so both. + +0:34:18.858 --> 0:34:31.765 +The starboard bygo and then we have to predict +the next four years and my question is: Be + +0:34:31.765 --> 0:34:40.734 +modeled this ability as one vector with like +probability or possible works. + +0:34:40.734 --> 0:34:42.740 +We have musical. + +0:34:44.044 --> 0:34:56.438 +You have multiple examples, so you would twice +train, once you predict, once you predict, + +0:34:56.438 --> 0:35:02.359 +and then, of course, the best performance. + +0:35:04.564 --> 0:35:11.772 +A very good point, so you're not aggregating +examples beforehand, but you're taking each + +0:35:11.772 --> 0:35:13.554 +example individually. + +0:35:19.259 --> 0:35:33.406 +So what you do is you simultaneously learn +the projection layer which represents this + +0:35:33.406 --> 0:35:39.163 +word and the N gram probabilities. + +0:35:39.499 --> 0:35:48.390 +And what people then later analyzed is that +these representations are very powerful. + +0:35:48.390 --> 0:35:56.340 +The task is just a very important task to +model like what is the next word. + +0:35:56.816 --> 0:36:09.429 +It's a bit motivated by people saying in order +to get the meaning of the word you have to + +0:36:09.429 --> 0:36:10.690 +look at. + +0:36:10.790 --> 0:36:18.467 +If you read the text in there, which you have +never seen, you can still estimate the meaning + +0:36:18.467 --> 0:36:22.264 +of this word because you know how it is used. + +0:36:22.602 --> 0:36:26.667 +Just imagine you read this text about some +city. + +0:36:26.667 --> 0:36:32.475 +Even if you've never seen the city before +heard, you often know from. + +0:36:34.094 --> 0:36:44.809 +So what is now the big advantage of using +neural networks? + +0:36:44.809 --> 0:36:57.570 +Just imagine we have to estimate this: So +you have to monitor the probability of ad hip + +0:36:57.570 --> 0:37:00.272 +and now imagine iPhone. + +0:37:00.600 --> 0:37:06.837 +So all the techniques we have at the last +time. + +0:37:06.837 --> 0:37:14.243 +At the end, if you haven't seen iPhone, you +will always. + +0:37:15.055 --> 0:37:19.502 +Because you haven't seen the previous words, +so you have no idea how to do that. + +0:37:19.502 --> 0:37:24.388 +You won't have seen the diagram, the trigram +and all the others, so the probability here + +0:37:24.388 --> 0:37:27.682 +will just be based on the probability of ad, +so it uses no. + +0:37:28.588 --> 0:37:38.328 +If you're having this type of model, what +does it do so? + +0:37:38.328 --> 0:37:43.454 +This is the last three words. + +0:37:43.483 --> 0:37:49.837 +Maybe this representation is messed up because +it's mainly on a particular word or source + +0:37:49.837 --> 0:37:50.260 +that. + +0:37:50.730 --> 0:37:57.792 +Now anyway you have these two information +that were two words before was first and therefore: + +0:37:58.098 --> 0:38:07.214 +So you have a lot of information here to estimate +how good it is. + +0:38:07.214 --> 0:38:13.291 +Of course, there could be more information. + +0:38:13.593 --> 0:38:25.958 +So all this type of modeling we can do and +that we couldn't do beforehand because we always. + +0:38:27.027 --> 0:38:31.905 +Don't guess how we do it now. + +0:38:31.905 --> 0:38:41.824 +Typically you would have one talking for awkward +vocabulary. + +0:38:42.602 --> 0:38:45.855 +All you're doing by carrying coding when it +has a fixed dancing. + +0:38:46.226 --> 0:38:49.439 +Yeah, you have to do something like that that +the opposite way. + +0:38:50.050 --> 0:38:55.413 +So yeah, all the vocabulary are by thankcoding +where you don't have have all the vocabulary. + +0:38:55.735 --> 0:39:07.665 +But then, of course, the back pairing coating +is better with arbitrary context because a + +0:39:07.665 --> 0:39:11.285 +problem with back pairing. + +0:39:17.357 --> 0:39:20.052 +Anymore questions to the basic same little +things. + +0:39:23.783 --> 0:39:36.162 +This model we then want to continue is to +look into how complex that is or can make things + +0:39:36.162 --> 0:39:39.155 +maybe more efficient. + +0:39:40.580 --> 0:39:47.404 +At the beginning there was definitely a major +challenge. + +0:39:47.404 --> 0:39:50.516 +It's still not that easy. + +0:39:50.516 --> 0:39:58.297 +All guess follow the talk about their environmental +fingerprint. + +0:39:58.478 --> 0:40:05.686 +So this calculation is normally heavy, and +if you build systems yourself, you have to + +0:40:05.686 --> 0:40:06.189 +wait. + +0:40:06.466 --> 0:40:15.412 +So it's good to know a bit about how complex +things are in order to do a good or efficient. + +0:40:15.915 --> 0:40:24.706 +So one thing where most of the calculation +really happens is if you're. + +0:40:25.185 --> 0:40:34.649 +So in generally all these layers, of course, +we're talking about networks and the zones + +0:40:34.649 --> 0:40:35.402 +fancy. + +0:40:35.835 --> 0:40:48.305 +So what you have to do in order to calculate +here these activations, you have this weight. + +0:40:48.488 --> 0:41:05.021 +So to make it simple, let's see we have three +outputs, and then you just do a metric identification + +0:41:05.021 --> 0:41:08.493 +between your weight. + +0:41:08.969 --> 0:41:19.641 +That is why the use is so powerful for neural +networks because they are very good in doing + +0:41:19.641 --> 0:41:22.339 +metric multiplication. + +0:41:22.782 --> 0:41:28.017 +However, for some type of embedding layer +this is really very inefficient. + +0:41:28.208 --> 0:41:37.547 +So in this input we are doing this calculation. + +0:41:37.547 --> 0:41:47.081 +What we are mainly doing is selecting one +color. + +0:41:47.387 --> 0:42:03.570 +So therefore you can do at least the forward +pass a lot more efficient if you don't really + +0:42:03.570 --> 0:42:07.304 +do this calculation. + +0:42:08.348 --> 0:42:20.032 +So the weight metrics of the first embedding +layer is just that in each color you have. + +0:42:20.580 --> 0:42:30.990 +So this is how your initial weights look like +and how you can interpret or understand. + +0:42:32.692 --> 0:42:42.042 +And this is already relatively important because +remember this is a huge dimensional thing, + +0:42:42.042 --> 0:42:51.392 +so typically here we have the number of words +ten thousand, so this is the word embeddings. + +0:42:51.451 --> 0:43:00.400 +Because it's the largest one there, we have +entries, while for the others we maybe have. + +0:43:00.660 --> 0:43:03.402 +So they are a little bit efficient and are +important to make this in. + +0:43:06.206 --> 0:43:10.529 +And then you can look at where else the calculations +are very difficult. + +0:43:10.830 --> 0:43:20.294 +So here we have our individual network, so +here are the word embeddings. + +0:43:20.294 --> 0:43:29.498 +Then we have one hidden layer, and then you +can look at how difficult. + +0:43:30.270 --> 0:43:38.742 +We could save a lot of calculations by calculating +that by just doing like do the selection because: + +0:43:40.600 --> 0:43:51.748 +And then the number of calculations you have +to do here is the length. + +0:43:52.993 --> 0:44:06.206 +Then we have here the hint size that is the +hint size, so the first step of calculation + +0:44:06.206 --> 0:44:10.260 +for this metric is an age. + +0:44:10.730 --> 0:44:22.030 +Then you have to do some activation function +which is this: This is the hidden size hymn + +0:44:22.030 --> 0:44:29.081 +because we need the vocabulary socks to calculate +the probability for each. + +0:44:29.889 --> 0:44:40.474 +And if you look at this number, so if you +have a projection sign of one hundred and a + +0:44:40.474 --> 0:44:45.027 +vocabulary sign of one hundred, you. + +0:44:45.425 --> 0:44:53.958 +And that's why there has been especially at +the beginning some ideas on how we can reduce + +0:44:53.958 --> 0:44:55.570 +the calculation. + +0:44:55.956 --> 0:45:02.352 +And if we really need to calculate all our +capabilities, or if we can calculate only some. + +0:45:02.582 --> 0:45:13.061 +And there again one important thing to think +about is for what you will use my language. + +0:45:13.061 --> 0:45:21.891 +One can use it for generations and that's +where we will see the next week. + +0:45:21.891 --> 0:45:22.480 +And. + +0:45:23.123 --> 0:45:32.164 +Initially, if it's just used as a feature, +we do not want to use it for generation, but + +0:45:32.164 --> 0:45:32.575 +we. + +0:45:32.953 --> 0:45:41.913 +And there we might not be interested in all +the probabilities, but we already know all + +0:45:41.913 --> 0:45:49.432 +the probability of this one word, and then +it might be very inefficient. + +0:45:51.231 --> 0:45:53.638 +And how can you do that so initially? + +0:45:53.638 --> 0:45:56.299 +For example, people look into shortlists. + +0:45:56.756 --> 0:46:03.321 +So the idea was this calculation at the end +is really very expensive. + +0:46:03.321 --> 0:46:05.759 +So can we make that more. + +0:46:05.945 --> 0:46:17.135 +And the idea was okay, and most birds occur +very rarely, and some beef birds occur very, + +0:46:17.135 --> 0:46:18.644 +very often. + +0:46:19.019 --> 0:46:37.644 +And so they use the smaller imagery, which +is maybe very small, and then you merge a new. + +0:46:37.937 --> 0:46:45.174 +So you're taking if the word is in the shortness, +so in the most frequent words. + +0:46:45.825 --> 0:46:58.287 +You're taking the probability of this short +word by some normalization here, and otherwise + +0:46:58.287 --> 0:46:59.656 +you take. + +0:47:00.020 --> 0:47:00.836 +Course. + +0:47:00.836 --> 0:47:09.814 +It will not be as good, but then we don't +have to calculate all the capabilities at the + +0:47:09.814 --> 0:47:16.037 +end, but we only have to calculate it for the +most frequent. + +0:47:19.599 --> 0:47:39.477 +Machines about that, but of course we don't +model the probability of the infrequent words. + +0:47:39.299 --> 0:47:46.658 +And one idea is to do what is reported as +soles for the structure of the layer. + +0:47:46.606 --> 0:47:53.169 +You see how some years ago people were very +creative in giving names to newer models. + +0:47:53.813 --> 0:48:00.338 +And there the idea is that we model the out +group vocabulary as a clustered strip. + +0:48:00.680 --> 0:48:08.498 +So you don't need to mold all of your bodies +directly, but you are putting words into. + +0:48:08.969 --> 0:48:20.623 +A very intricate word is first in and then +in and then in and that is in sub-sub-clusters + +0:48:20.623 --> 0:48:21.270 +and. + +0:48:21.541 --> 0:48:29.936 +And this is what was mentioned in the past +of the work, so these are the subclasses that + +0:48:29.936 --> 0:48:30.973 +always go. + +0:48:30.973 --> 0:48:39.934 +So if it's in cluster one at the first position +then you only look at all the words which are: + +0:48:40.340 --> 0:48:50.069 +And then you can calculate the probability +of a word again just by the product over these, + +0:48:50.069 --> 0:48:55.522 +so the probability of the word is the first +class. + +0:48:57.617 --> 0:49:12.331 +It's maybe more clear where you have the sole +architecture, so what you will do is first + +0:49:12.331 --> 0:49:13.818 +predict. + +0:49:14.154 --> 0:49:26.435 +Then you go to the appropriate sub-class, +then you calculate the probability of the sub-class. + +0:49:27.687 --> 0:49:34.932 +Anybody have an idea why this is more, more +efficient, or if people do it first, it looks + +0:49:34.932 --> 0:49:35.415 +more. + +0:49:42.242 --> 0:49:56.913 +Yes, so you have to do less calculations, +or maybe here you have to calculate the element + +0:49:56.913 --> 0:49:59.522 +there, but you. + +0:49:59.980 --> 0:50:06.116 +The capabilities in the set classes that you're +going through and not for all of them. + +0:50:06.386 --> 0:50:16.688 +Therefore, it's only more efficient if you +don't need all awkward preferences because + +0:50:16.688 --> 0:50:21.240 +you have to even calculate the class. + +0:50:21.501 --> 0:50:30.040 +So it's only more efficient in scenarios where +you really need to use a language to evaluate. + +0:50:35.275 --> 0:50:54.856 +How this works is that on the output layer +you only have a vocabulary of: But on the input + +0:50:54.856 --> 0:51:05.126 +layer you have always your full vocabulary +because at the input we saw that this is not + +0:51:05.126 --> 0:51:06.643 +complicated. + +0:51:06.906 --> 0:51:19.778 +And then you can cluster down all your words, +embedding series of classes, and use that as + +0:51:19.778 --> 0:51:23.031 +your classes for that. + +0:51:23.031 --> 0:51:26.567 +So yeah, you have words. + +0:51:29.249 --> 0:51:32.593 +Is one idea of doing it. + +0:51:32.593 --> 0:51:44.898 +There is also a second idea of doing it again, +the idea that we don't need the probability. + +0:51:45.025 --> 0:51:53.401 +So sometimes it doesn't really need to be +a probability to evaluate. + +0:51:53.401 --> 0:52:05.492 +It's only important that: And: Here is called +self-normalization. + +0:52:05.492 --> 0:52:19.349 +What people have done so is in the softmax +is always to the input divided by normalization. + +0:52:19.759 --> 0:52:25.194 +So this is how we calculate the soft mix. + +0:52:25.825 --> 0:52:42.224 +And in self-normalization now, the idea is +that we don't need to calculate the logarithm. + +0:52:42.102 --> 0:52:54.284 +That would be zero, and then you don't even +have to calculate the normalization. + +0:52:54.514 --> 0:53:01.016 +So how can we achieve that? + +0:53:01.016 --> 0:53:08.680 +And then there's the nice thing. + +0:53:09.009 --> 0:53:14.743 +And our novel Lots and more to maximize probability. + +0:53:14.743 --> 0:53:23.831 +We have this cross entry lot that probability +is higher, and now we're just adding. + +0:53:24.084 --> 0:53:31.617 +And the second loss just tells us you're pleased +training the way the lock set is zero. + +0:53:32.352 --> 0:53:38.625 +So then if it's nearly zero at the end you +don't need to calculate this and it's also + +0:53:38.625 --> 0:53:39.792 +very efficient. + +0:53:40.540 --> 0:53:57.335 +One important thing is this is only an inference, +so during tests we don't need to calculate. + +0:54:00.480 --> 0:54:15.006 +You can do a bit of a hyperparameter here +where you do the waiting and how much effort + +0:54:15.006 --> 0:54:16.843 +should be. + +0:54:18.318 --> 0:54:35.037 +The only disadvantage is that it's no speed +up during training and there are other ways + +0:54:35.037 --> 0:54:37.887 +of doing that. + +0:54:41.801 --> 0:54:43.900 +I'm with you all. + +0:54:44.344 --> 0:54:48.540 +Then we are coming very, very briefly like +this one here. + +0:54:48.828 --> 0:54:53.692 +There are more things on different types of +languages. + +0:54:53.692 --> 0:54:58.026 +We are having a very short view of a restricted. + +0:54:58.298 --> 0:55:09.737 +And then we'll talk about recurrent neural +networks for our language minds because they + +0:55:09.737 --> 0:55:17.407 +have the advantage now that we can't even further +improve. + +0:55:18.238 --> 0:55:24.395 +There's also different types of neural networks. + +0:55:24.395 --> 0:55:30.175 +These ballroom machines are not having input. + +0:55:30.330 --> 0:55:39.271 +They have these binary units: And they define +an energy function on the network, which can + +0:55:39.271 --> 0:55:46.832 +be in respect of bottom machines efficiently +calculated, and restricted needs. + +0:55:46.832 --> 0:55:53.148 +You only have connections between the input +and the hidden layer. + +0:55:53.393 --> 0:56:00.190 +So you see here you don't have input and output, +you just have an input and you calculate what. + +0:56:00.460 --> 0:56:16.429 +Which of course nicely fits with the idea +we're having, so you can use this for N gram + +0:56:16.429 --> 0:56:19.182 +language ones. + +0:56:19.259 --> 0:56:25.187 +Decaying this credibility of the input by +this type of neural networks. + +0:56:26.406 --> 0:56:30.582 +And the advantage of this type of model of +board that is. + +0:56:30.550 --> 0:56:38.629 +Very fast to integrate it, so that one was +the first one which was used during decoding. + +0:56:38.938 --> 0:56:50.103 +The problem of it is that the Enron language +models were very good at performing the calculation. + +0:56:50.230 --> 0:57:00.114 +So what people typically did is we talked +about a best list, so they generated a most + +0:57:00.114 --> 0:57:05.860 +probable output, and then they scored each +entry. + +0:57:06.146 --> 0:57:10.884 +A language model, and then only like change +the order against that based on that which. + +0:57:11.231 --> 0:57:20.731 +The knifing is maybe only hundred entries, +while during decoding you will look at several + +0:57:20.731 --> 0:57:21.787 +thousand. + +0:57:26.186 --> 0:57:40.437 +This but let's look at the context, so we +have now seen your language models. + +0:57:40.437 --> 0:57:43.726 +There is the big. + +0:57:44.084 --> 0:57:57.552 +Remember ingram language is not always words +because sometimes you have to back off or interpolation + +0:57:57.552 --> 0:57:59.953 +to lower ingrams. + +0:58:00.760 --> 0:58:05.504 +However, in neural models we always have all +of these inputs and some of these. + +0:58:07.147 --> 0:58:21.262 +The disadvantage is that you are still limited +in your context, and if you remember the sentence + +0:58:21.262 --> 0:58:23.008 +from last,. + +0:58:22.882 --> 0:58:28.445 +Sometimes you need more context and there's +unlimited contexts that you might need and + +0:58:28.445 --> 0:58:34.838 +you can always create sentences where you need +this file context in order to put a good estimation. + +0:58:35.315 --> 0:58:44.955 +Can we also do it different in order to better +understand that it makes sense to view? + +0:58:45.445 --> 0:58:57.621 +So sequence labeling tasks are a very common +type of towns in natural language processing + +0:58:57.621 --> 0:59:03.438 +where you have an input sequence and then. + +0:59:03.323 --> 0:59:08.663 +I've token so you have one output for each +input so machine translation is not a secret + +0:59:08.663 --> 0:59:14.063 +labeling cast because the number of inputs +and the number of outputs is different so you + +0:59:14.063 --> 0:59:19.099 +put in a string German which has five words +and the output can be six or seven or. + +0:59:19.619 --> 0:59:20.155 +Secrets. + +0:59:20.155 --> 0:59:24.083 +Lately you always have the same number of +and the same number of. + +0:59:24.944 --> 0:59:40.940 +And you can model language modeling as that, +and you just say a label for each word is always + +0:59:40.940 --> 0:59:43.153 +a next word. + +0:59:45.705 --> 0:59:54.823 +This is the more general you can think of +it, for example how to speech taking entity + +0:59:54.823 --> 0:59:56.202 +recognition. + +0:59:58.938 --> 1:00:08.081 +And if you look at now fruit cut token in +generally sequence, they can depend on import + +1:00:08.081 --> 1:00:08.912 +tokens. + +1:00:09.869 --> 1:00:11.260 +Nice thing. + +1:00:11.260 --> 1:00:21.918 +In our case, the output tokens are the same +so we can easily model it that they only depend + +1:00:21.918 --> 1:00:24.814 +on all the input tokens. + +1:00:24.814 --> 1:00:28.984 +So we have this whether it's or so. + +1:00:31.011 --> 1:00:42.945 +But we can always do a look at what specific +type of sequence labeling, unidirectional sequence + +1:00:42.945 --> 1:00:44.188 +labeling. + +1:00:44.584 --> 1:00:58.215 +And that's exactly how we want the language +of the next word only depends on all the previous + +1:00:58.215 --> 1:01:00.825 +words that we're. + +1:01:01.321 --> 1:01:12.899 +Mean, of course, that's not completely true +in a language that the bad context might also + +1:01:12.899 --> 1:01:14.442 +be helpful. + +1:01:14.654 --> 1:01:22.468 +We will model always the probability of a +word given on its history, and therefore we + +1:01:22.468 --> 1:01:23.013 +need. + +1:01:23.623 --> 1:01:29.896 +And currently we did there this approximation +in sequence labeling that we have this windowing + +1:01:29.896 --> 1:01:30.556 +approach. + +1:01:30.951 --> 1:01:43.975 +So in order to predict this type of word we +always look at the previous three words and + +1:01:43.975 --> 1:01:48.416 +then to do this one we again. + +1:01:49.389 --> 1:01:55.137 +If you are into neural networks you recognize +this type of structure. + +1:01:55.137 --> 1:01:57.519 +Also are the typical neural. + +1:01:58.938 --> 1:02:09.688 +Yes, so this is like Engram, Louis Couperus, +and at least in some way compared to the original, + +1:02:09.688 --> 1:02:12.264 +you're always looking. + +1:02:14.334 --> 1:02:30.781 +However, there are also other types of neural +network structures which we can use for sequence. + +1:02:32.812 --> 1:02:34.678 +That we can do so. + +1:02:34.678 --> 1:02:39.686 +The idea is in recurrent neural network structure. + +1:02:39.686 --> 1:02:43.221 +We are saving the complete history. + +1:02:43.623 --> 1:02:55.118 +So again we have to do like this fix size +representation because neural networks always + +1:02:55.118 --> 1:02:56.947 +need to have. + +1:02:57.157 --> 1:03:05.258 +And then we start with an initial value for +our storage. + +1:03:05.258 --> 1:03:15.917 +We are giving our first input and then calculating +the new representation. + +1:03:16.196 --> 1:03:26.328 +If you look at this, it's just again your +network was two types of inputs: in your work, + +1:03:26.328 --> 1:03:29.743 +in your initial hidden state. + +1:03:30.210 --> 1:03:46.468 +Then you can apply it to the next type of +input and you're again having. + +1:03:47.367 --> 1:03:53.306 +Nice thing is now that you can do now step +by step by step, so all the way over. + +1:03:55.495 --> 1:04:05.245 +The nice thing that we are having here now +is that we are having context information from + +1:04:05.245 --> 1:04:07.195 +all the previous. + +1:04:07.607 --> 1:04:13.582 +So if you're looking like based on which words +do you use here, calculate your ability of + +1:04:13.582 --> 1:04:14.180 +varying. + +1:04:14.554 --> 1:04:20.128 +It depends on is based on this path. + +1:04:20.128 --> 1:04:33.083 +It depends on and this hidden state was influenced +by this one and this hidden state. + +1:04:33.473 --> 1:04:37.798 +So now we're having something new. + +1:04:37.798 --> 1:04:46.449 +We can really model the word probability not +only on a fixed context. + +1:04:46.906 --> 1:04:53.570 +Because the in-states we're having here in +our area are influenced by all the trivia. + +1:04:56.296 --> 1:05:00.909 +So how is that to mean? + +1:05:00.909 --> 1:05:16.288 +If you're not thinking about the history of +clustering, we said the clustering. + +1:05:16.736 --> 1:05:24.261 +So do not need to do any clustering here, +and we also see how things are put together + +1:05:24.261 --> 1:05:26.273 +in order to really do. + +1:05:29.489 --> 1:05:43.433 +In the green box this way since we are starting +from the left point to the right. + +1:05:44.524 --> 1:05:48.398 +And that's right, so they're clustered in +some parts. + +1:05:48.398 --> 1:05:58.196 +Here is some type of clustering happening: +It's continuous representations, but a smaller + +1:05:58.196 --> 1:06:02.636 +difference doesn't matter again. + +1:06:02.636 --> 1:06:10.845 +So if you have a lot of different histories, +the similarity. + +1:06:11.071 --> 1:06:15.791 +Because in order to do the final restriction +you only do it based on the green box. + +1:06:16.156 --> 1:06:24.284 +So you are now again still learning some type +of clasp. + +1:06:24.284 --> 1:06:30.235 +You don't have to do this hard decision. + +1:06:30.570 --> 1:06:39.013 +The only restriction you are giving is you +have to install everything that is important. + +1:06:39.359 --> 1:06:54.961 +So it's a different type of limitation, so +you calculate the probability based on the + +1:06:54.961 --> 1:06:57.138 +last words. + +1:06:57.437 --> 1:07:09.645 +That is how you still need some cluster things +in order to do it efficiently. + +1:07:09.970 --> 1:07:25.311 +But this is where things get merged together +in this type of hidden representation, which + +1:07:25.311 --> 1:07:28.038 +is then merged. + +1:07:28.288 --> 1:07:33.104 +On the previous words, but they are some other +bottleneck in order to make a good estimation. + +1:07:34.474 --> 1:07:41.242 +So the idea is that we can store all our history +into one lecture. + +1:07:41.581 --> 1:07:47.351 +Which is very good and makes it more strong. + +1:07:47.351 --> 1:07:51.711 +Next we come to problems of that. + +1:07:51.711 --> 1:07:57.865 +Of course, at some point it might be difficult. + +1:07:58.398 --> 1:08:02.230 +Then maybe things get all overwritten, or +you cannot store everything in there. + +1:08:02.662 --> 1:08:04.514 +So,. + +1:08:04.184 --> 1:08:10.252 +Therefore, yet for short things like signal +sentences that works well, but especially if + +1:08:10.252 --> 1:08:16.184 +you think of other tasks like harmonisation +where a document based on T where you need + +1:08:16.184 --> 1:08:22.457 +to consider a full document, these things got +a bit more complicated and we learned another + +1:08:22.457 --> 1:08:23.071 +type of. + +1:08:24.464 --> 1:08:30.455 +For the further in order to understand these +networks, it's good to have both views always. + +1:08:30.710 --> 1:08:39.426 +So this is the unroll view, so you have this +type of network. + +1:08:39.426 --> 1:08:48.532 +Therefore, it can be shown as: We have here +the output and here's your network which is + +1:08:48.532 --> 1:08:52.091 +connected by itself and that is a recurrent. + +1:08:56.176 --> 1:09:11.033 +There is one challenge in these networks and +that is the training so the nice thing is train + +1:09:11.033 --> 1:09:11.991 +them. + +1:09:12.272 --> 1:09:20.147 +So the idea is we don't really know how to +train them, but if you unroll them like this,. + +1:09:20.540 --> 1:09:38.054 +It's exactly the same so you can measure your +arrows and then you propagate your arrows. + +1:09:38.378 --> 1:09:45.647 +Now the nice thing is if you unroll something, +it's a feet forward and you can train it. + +1:09:46.106 --> 1:09:56.493 +The only important thing is, of course, for +different inputs you have to take that into + +1:09:56.493 --> 1:09:57.555 +account. + +1:09:57.837 --> 1:10:07.621 +But since parameters are shared, it's somehow +similar and you can train that the training + +1:10:07.621 --> 1:10:08.817 +algorithm. + +1:10:10.310 --> 1:10:16.113 +One thing which makes things difficult is +what is referred to as the vanishing gradient. + +1:10:16.113 --> 1:10:21.720 +So we are saying there is a big advantage +of these models and that's why we are using + +1:10:21.720 --> 1:10:22.111 +that. + +1:10:22.111 --> 1:10:27.980 +The output here does not only depend on the +current input of a last three but on anything + +1:10:27.980 --> 1:10:29.414 +that was said before. + +1:10:29.809 --> 1:10:32.803 +That's a very strong thing is the motivation +of using art. + +1:10:33.593 --> 1:10:44.599 +However, if you're using standard, the influence +here gets smaller and smaller, and the models. + +1:10:44.804 --> 1:10:55.945 +Because the gradients get smaller and smaller, +and so the arrow here propagated to this one, + +1:10:55.945 --> 1:10:59.659 +this contributes to the arrow. + +1:11:00.020 --> 1:11:06.710 +And yeah, that's why standard R&S are +difficult or have to become boosters. + +1:11:07.247 --> 1:11:11.481 +So if we are talking about our ends nowadays,. + +1:11:11.791 --> 1:11:19.532 +What we are typically meaning are long short +memories. + +1:11:19.532 --> 1:11:30.931 +You see there by now quite old already, but +they have special gating mechanisms. + +1:11:31.171 --> 1:11:41.911 +So in the language model tasks, for example +in some other story information, all this sentence + +1:11:41.911 --> 1:11:44.737 +started with a question. + +1:11:44.684 --> 1:11:51.886 +Because if you only look at the five last +five words, it's often no longer clear as a + +1:11:51.886 --> 1:11:52.556 +normal. + +1:11:53.013 --> 1:12:06.287 +So there you have these mechanisms with the +right gate in order to store things for a longer + +1:12:06.287 --> 1:12:08.571 +time into your. + +1:12:10.730 --> 1:12:20.147 +Here they are used in, in, in, in selling +quite a lot of works. + +1:12:21.541 --> 1:12:30.487 +For especially text machine translation now, +the standard is to do transformer base models. + +1:12:30.690 --> 1:12:42.857 +But for example, this type of in architecture +we have later one lecture about efficiency. + +1:12:42.882 --> 1:12:53.044 +And there in the decoder and partial networks +they are still using our edges because then. + +1:12:53.473 --> 1:12:57.542 +So it's not that our ends are of no importance. + +1:12:59.239 --> 1:13:08.956 +In order to make them strong, there are some +more things which are helpful and should be: + +1:13:09.309 --> 1:13:19.668 +So one thing is it's a very easy and nice trick +to make this neon network stronger and better. + +1:13:19.739 --> 1:13:21.619 +So, of course, it doesn't work always. + +1:13:21.619 --> 1:13:23.451 +They have to have enough training to. + +1:13:23.763 --> 1:13:29.583 +But in general that is the easiest way of +making your mouth bigger and stronger is to + +1:13:29.583 --> 1:13:30.598 +increase your. + +1:13:30.630 --> 1:13:43.244 +And you've seen that with a large size model +they are always braggling about. + +1:13:43.903 --> 1:13:53.657 +This is one way so the question is how do +you get more parameters? + +1:13:53.657 --> 1:14:05.951 +There's two ways you can make your representations: +And the other thing is its octave deep learning, + +1:14:05.951 --> 1:14:10.020 +so the other thing is to make your networks. + +1:14:11.471 --> 1:14:13.831 +And then you can also get more work off. + +1:14:14.614 --> 1:14:19.931 +There's one problem with this and with more +deeper networks. + +1:14:19.931 --> 1:14:23.330 +It's very similar to what we saw with. + +1:14:23.603 --> 1:14:34.755 +With the we have this problem of radiant flow +that if it flows so fast like the radiant gets + +1:14:34.755 --> 1:14:35.475 +very. + +1:14:35.795 --> 1:14:41.114 +Exactly the same thing happens in deep. + +1:14:41.114 --> 1:14:52.285 +If you take the gradient and tell it's the +right or wrong, then you're propagating. + +1:14:52.612 --> 1:14:53.228 +Three layers. + +1:14:53.228 --> 1:14:56.440 +It's no problem, but if you're going to ten, +twenty or a hundred layers. + +1:14:57.797 --> 1:14:59.690 +That is getting typically a problem. + +1:15:00.060 --> 1:15:10.659 +People are doing and they are using what is +called visual connections. + +1:15:10.659 --> 1:15:15.885 +That's a very helpful idea, which. + +1:15:15.956 --> 1:15:20.309 +And so the idea is that these networks. + +1:15:20.320 --> 1:15:30.694 +In between should calculate really what is +a new representation, but they are calculating + +1:15:30.694 --> 1:15:31.386 +what. + +1:15:31.731 --> 1:15:37.585 +And therefore in the end you'll always the +output of a layer is added with the input. + +1:15:38.318 --> 1:15:48.824 +The nice thing is that later, if you are doing +back propagation with this very fast back,. + +1:15:49.209 --> 1:16:01.896 +So that is what you're seeing nowadays in +very deep architectures, not only as others, + +1:16:01.896 --> 1:16:04.229 +but you always. + +1:16:04.704 --> 1:16:07.388 +Has two advantages. + +1:16:07.388 --> 1:16:15.304 +On the one hand, it's more easy to learn a +representation. + +1:16:15.304 --> 1:16:18.792 +On the other hand, these. + +1:16:22.082 --> 1:16:24.114 +Goods. + +1:16:23.843 --> 1:16:31.763 +That much for the new record before, so the +last thing now means this. + +1:16:31.671 --> 1:16:36.400 +Language was used in the molds itself. + +1:16:36.400 --> 1:16:46.707 +Now we're seeing them again, but one thing +that at the beginning was very essential. + +1:16:46.967 --> 1:16:57.655 +So people really train part in the language +models only to get this type of embeddings + +1:16:57.655 --> 1:17:04.166 +and therefore we want to look a bit more into +these. + +1:17:09.229 --> 1:17:13.456 +Some laugh words to the word embeddings. + +1:17:13.456 --> 1:17:22.117 +The interesting thing is that word embeddings +can be used for very different tasks. + +1:17:22.117 --> 1:17:27.170 +The advantage is we can train the word embedded. + +1:17:27.347 --> 1:17:31.334 +The knife is you can train that on just large +amounts of data. + +1:17:31.931 --> 1:17:40.937 +And then if you have these wooden beddings +you don't have a layer of ten thousand any + +1:17:40.937 --> 1:17:41.566 +more. + +1:17:41.982 --> 1:17:52.231 +So then you can train a small market to do +any other tasks and therefore you're more. + +1:17:52.532 --> 1:17:58.761 +Initial word embeddings really depend only +on the word itself. + +1:17:58.761 --> 1:18:07.363 +If you look at the two meanings of can, the +can of beans, or can they do that, some of + +1:18:07.363 --> 1:18:08.747 +the embedded. + +1:18:09.189 --> 1:18:12.395 +That cannot be resolved. + +1:18:12.395 --> 1:18:23.939 +Therefore, you need to know the context, and +if you look at the higher levels that people + +1:18:23.939 --> 1:18:27.916 +are doing in the context, but. + +1:18:29.489 --> 1:18:33.757 +However, even this one has quite very interesting. + +1:18:34.034 --> 1:18:44.644 +So people like to visualize that they're always +a bit difficult because if you look at this + +1:18:44.644 --> 1:18:47.182 +word, vector or word. + +1:18:47.767 --> 1:18:52.879 +And drawing your five hundred dimensional +vector is still a bit challenging. + +1:18:53.113 --> 1:19:12.464 +So you cannot directly do that, so what people +have to do is learn some type of dimension. + +1:19:13.073 --> 1:19:17.216 +And of course then yes some information gets +lost but you can try it. + +1:19:18.238 --> 1:19:28.122 +And you see, for example, this is the most +famous and common example, so what you can + +1:19:28.122 --> 1:19:37.892 +look is you can look at the difference between +the male and the female word English. + +1:19:38.058 --> 1:19:40.389 +And you can do that for a very different work. + +1:19:40.780 --> 1:19:45.403 +And that is where, where the masks come into +that, what people then look into. + +1:19:45.725 --> 1:19:50.995 +So what you can now, for example, do is you +can calculate the difference between man and + +1:19:50.995 --> 1:19:51.410 +woman. + +1:19:52.232 --> 1:19:56.356 +And what you can do then you can take the +embedding of peeing. + +1:19:56.356 --> 1:20:02.378 +You can add on it the difference between men +and women and where people get really excited. + +1:20:02.378 --> 1:20:05.586 +Then you can look at what are the similar +words. + +1:20:05.586 --> 1:20:09.252 +So you won't, of course, directly hit the +correct word. + +1:20:09.252 --> 1:20:10.495 +It's a continuous. + +1:20:10.790 --> 1:20:24.062 +But you can look at what are the nearest neighbors +to the same, and often these words are near. + +1:20:24.224 --> 1:20:33.911 +So it's somehow weird that the difference +between these works is always the same. + +1:20:34.374 --> 1:20:37.308 +Can do different things. + +1:20:37.308 --> 1:20:47.520 +You can also imagine that the work tends to +be assuming and swim, and with walking and + +1:20:47.520 --> 1:20:49.046 +walking you. + +1:20:49.469 --> 1:20:53.040 +So you can try to use him. + +1:20:53.040 --> 1:20:56.346 +It's no longer like say. + +1:20:56.346 --> 1:21:04.016 +The interesting thing is nobody taught him +the principle. + +1:21:04.284 --> 1:21:09.910 +So it's purely trained on the task of doing +the next work prediction. + +1:21:10.230 --> 1:21:23.669 +And even for some information like the capital, +this is the difference between the capital. + +1:21:23.823 --> 1:21:33.760 +Is another visualization here where you have +done the same things on the difference between. + +1:21:33.853 --> 1:21:41.342 +And you see it's not perfect, but it's building +in my directory, so you can even use that for + +1:21:41.342 --> 1:21:42.936 +pressure answering. + +1:21:42.936 --> 1:21:50.345 +If you have no three countries, the capital, +you can do what is the difference between them. + +1:21:50.345 --> 1:21:53.372 +You apply that to a new country, and. + +1:21:54.834 --> 1:22:02.280 +So these models are able to really learn a +lot of information and collapse this information + +1:22:02.280 --> 1:22:04.385 +into this representation. + +1:22:05.325 --> 1:22:07.679 +And just to do the next two are predictions. + +1:22:07.707 --> 1:22:22.358 +And that also explains a bit maybe or explains +strongly, but motivates what is the main advantage + +1:22:22.358 --> 1:22:26.095 +of this type of neurons. + +1:22:28.568 --> 1:22:46.104 +So to summarize what we did today, so what +you should hopefully have with you is: Then + +1:22:46.104 --> 1:22:49.148 +how we can do language modeling with new networks. + +1:22:49.449 --> 1:22:55.445 +We looked at three different architectures: +We looked into the feet forward language one, + +1:22:55.445 --> 1:22:59.059 +the R&N, and the one based the balsamic. + +1:22:59.039 --> 1:23:04.559 +And finally, there are different architectures +to do in neural networks. + +1:23:04.559 --> 1:23:10.986 +We have seen feet for neural networks and +base neural networks, and we'll see in the + +1:23:10.986 --> 1:23:14.389 +next lectures the last type of architecture. + +1:23:15.915 --> 1:23:17.438 +Any questions. + +1:23:20.680 --> 1:23:27.360 +Then thanks a lot, and next I'm just there, +we'll be again on order to. +