diff --git "a/demo_data/lectures/Lecture-07-16.05.2023/English.vtt" "b/demo_data/lectures/Lecture-07-16.05.2023/English.vtt"
new file mode 100644--- /dev/null
+++ "b/demo_data/lectures/Lecture-07-16.05.2023/English.vtt"
@@ -0,0 +1,5104 @@
+WEBVTT
+
+0:00:01.301 --> 0:00:05.707
+Okay So Welcome to Today's Lecture.
+
+0:00:06.066 --> 0:00:12.592
+I'm sorry for the inconvenience.
+
+0:00:12.592 --> 0:00:19.910
+Sometimes they are project meetings.
+
+0:00:19.910 --> 0:00:25.843
+There will be one other time.
+
+0:00:26.806 --> 0:00:40.863
+So what we want to talk today about is want
+to start with neural approaches to machine
+
+0:00:40.863 --> 0:00:42.964
+translation.
+
+0:00:43.123 --> 0:00:51.285
+I guess you have heard about other types of
+neural models for other types of neural language
+
+0:00:51.285 --> 0:00:52.339
+processing.
+
+0:00:52.339 --> 0:00:59.887
+This was some of the first steps in introducing
+neal networks to machine translation.
+
+0:01:00.600 --> 0:01:06.203
+They are similar to what you know they see
+in as large language models.
+
+0:01:06.666 --> 0:01:11.764
+And today look into what are these neuro-language
+models?
+
+0:01:11.764 --> 0:01:13.874
+What is the difference?
+
+0:01:13.874 --> 0:01:15.983
+What is the motivation?
+
+0:01:16.316 --> 0:01:21.445
+And first will use them in statistics and
+machine translation.
+
+0:01:21.445 --> 0:01:28.935
+So if you remember how fully like two or three
+weeks ago we had this likely model where you
+
+0:01:28.935 --> 0:01:31.052
+can integrate easily any.
+
+0:01:31.351 --> 0:01:40.967
+We just have another model which evaluates
+how good a system is or how good a fluent language
+
+0:01:40.967 --> 0:01:41.376
+is.
+
+0:01:41.376 --> 0:01:53.749
+The main advantage compared to the statistical
+models we saw on Tuesday is: Next week we will
+
+0:01:53.749 --> 0:02:06.496
+then go for a neural machine translation where
+we replace the whole model.
+
+0:02:11.211 --> 0:02:21.078
+Just as a remember from Tuesday, we've seen
+the main challenge in language world was that
+
+0:02:21.078 --> 0:02:25.134
+most of the engrams we haven't seen.
+
+0:02:26.946 --> 0:02:33.967
+So this was therefore difficult to estimate
+any probability because you've seen that normally
+
+0:02:33.967 --> 0:02:39.494
+if you have not seen the endgram you will assign
+the probability of zero.
+
+0:02:39.980 --> 0:02:49.420
+However, this is not really very good because
+we don't want to give zero probabilities to
+
+0:02:49.420 --> 0:02:54.979
+sentences, which still might be a very good
+English.
+
+0:02:55.415 --> 0:03:02.167
+And then we learned a lot of techniques and
+that is the main challenging statistical machine
+
+0:03:02.167 --> 0:03:04.490
+translate statistical language.
+
+0:03:04.490 --> 0:03:10.661
+What's how we can give a good estimate of
+probability to events that we haven't seen
+
+0:03:10.661 --> 0:03:12.258
+smoothing techniques?
+
+0:03:12.258 --> 0:03:15.307
+We've seen this interpolation and begoff.
+
+0:03:15.435 --> 0:03:21.637
+And they invent or develop very specific techniques.
+
+0:03:21.637 --> 0:03:26.903
+To deal with that, however, it might not be.
+
+0:03:28.568 --> 0:03:43.190
+And therefore maybe we can do things different,
+so if we have not seen an gram before in statistical
+
+0:03:43.190 --> 0:03:44.348
+models.
+
+0:03:45.225 --> 0:03:51.361
+Before and we can only get information from
+exactly the same words.
+
+0:03:51.411 --> 0:04:06.782
+We don't have some on like approximate matching
+like that, maybe in a sentence that cures similarly.
+
+0:04:06.782 --> 0:04:10.282
+So if you have seen a.
+
+0:04:11.191 --> 0:04:17.748
+And so you would like to have more something
+like that where endgrams are represented, more
+
+0:04:17.748 --> 0:04:21.953
+in a general space, and we can generalize similar
+numbers.
+
+0:04:22.262 --> 0:04:29.874
+So if you learn something about walk then
+maybe we can use this knowledge and also apply.
+
+0:04:30.290 --> 0:04:42.596
+The same as we have done before, but we can
+really better model how similar they are and
+
+0:04:42.596 --> 0:04:45.223
+transfer to other.
+
+0:04:47.047 --> 0:04:54.236
+And we maybe want to do that in a more hierarchical
+approach that we know okay.
+
+0:04:54.236 --> 0:05:02.773
+Some words are similar but like go and walk
+is somehow similar and I and P and G and therefore
+
+0:05:02.773 --> 0:05:06.996
+like maybe if we then merge them in an engram.
+
+0:05:07.387 --> 0:05:15.861
+If we learn something about our walk, then
+it should tell us also something about Hugo.
+
+0:05:15.861 --> 0:05:17.113
+He walks or.
+
+0:05:17.197 --> 0:05:27.327
+You see that there is some relations which
+we need to integrate for you.
+
+0:05:27.327 --> 0:05:35.514
+We need to add the s, but maybe walks should
+also be here.
+
+0:05:37.137 --> 0:05:45.149
+And luckily there is one really convincing
+method in doing that: And that is by using
+
+0:05:45.149 --> 0:05:47.231
+a neural mechanism.
+
+0:05:47.387 --> 0:05:58.497
+That's what we will introduce today so we
+can use this type of neural networks to try
+
+0:05:58.497 --> 0:06:04.053
+to learn this similarity and to learn how.
+
+0:06:04.324 --> 0:06:14.355
+And that is one of the main advantages that
+we have by switching from the standard statistical
+
+0:06:14.355 --> 0:06:15.200
+models.
+
+0:06:15.115 --> 0:06:22.830
+To learn similarities between words and generalized,
+and learn what is called hidden representations
+
+0:06:22.830 --> 0:06:29.705
+or representations of words, where we can measure
+similarity in some dimensions of words.
+
+0:06:30.290 --> 0:06:42.384
+So we can measure in which way words are similar.
+
+0:06:42.822 --> 0:06:48.902
+We had it before and we've seen that words
+were just easier.
+
+0:06:48.902 --> 0:06:51.991
+The only thing we did is like.
+
+0:06:52.192 --> 0:07:02.272
+But this energies don't have any meaning,
+so it wasn't that word is more similar to words.
+
+0:07:02.582 --> 0:07:12.112
+So we couldn't learn anything about words
+in the statistical model and that's a big challenge.
+
+0:07:12.192 --> 0:07:23.063
+About words even like in morphology, so going
+goes is somehow more similar because the person
+
+0:07:23.063 --> 0:07:24.219
+singular.
+
+0:07:24.264 --> 0:07:34.924
+The basic models we have to now have no idea
+about that and goes as similar to go than it
+
+0:07:34.924 --> 0:07:37.175
+might be to sleep.
+
+0:07:39.919 --> 0:07:44.073
+So what we want to do today.
+
+0:07:44.073 --> 0:07:53.096
+In order to go to this we will have a short
+introduction into.
+
+0:07:53.954 --> 0:08:05.984
+It very short just to see how we use them
+here, but that's a good thing, so most of you
+
+0:08:05.984 --> 0:08:08.445
+think it will be.
+
+0:08:08.928 --> 0:08:14.078
+And then we will first look into a feet forward
+neural network language models.
+
+0:08:14.454 --> 0:08:23.706
+And there we will still have this approximation.
+
+0:08:23.706 --> 0:08:33.902
+We have before we are looking only at a fixed
+window.
+
+0:08:34.154 --> 0:08:35.030
+The case.
+
+0:08:35.030 --> 0:08:38.270
+However, we have the umbellent here.
+
+0:08:38.270 --> 0:08:43.350
+That's why they're already better in order
+to generalize.
+
+0:08:44.024 --> 0:08:53.169
+And then at the end we'll look at language
+models where we then have the additional advantage.
+
+0:08:53.093 --> 0:09:04.317
+Case that we need to have a fixed history,
+but in theory we can model arbitrary long dependencies.
+
+0:09:04.304 --> 0:09:12.687
+And we talked about on Tuesday where it is
+not clear what type of information it is to.
+
+0:09:16.396 --> 0:09:24.981
+So in general molecular networks I normally
+learn to prove that they perform some tasks.
+
+0:09:25.325 --> 0:09:33.472
+We have the structure and we are learning
+them from samples so that is similar to what
+
+0:09:33.472 --> 0:09:34.971
+we have before.
+
+0:09:34.971 --> 0:09:42.275
+So now we have the same task here, a language
+model giving input or forwards.
+
+0:09:42.642 --> 0:09:48.959
+And is somewhat originally motivated by human
+brain.
+
+0:09:48.959 --> 0:10:00.639
+However, when you now need to know about artificial
+neural networks, it's hard to get similarity.
+
+0:10:00.540 --> 0:10:02.889
+There seemed to be not that point.
+
+0:10:03.123 --> 0:10:11.014
+So what they are mainly doing is summoning
+multiplication and then one non-linear activation.
+
+0:10:12.692 --> 0:10:16.085
+So the basic units are these type of.
+
+0:10:17.937 --> 0:10:29.891
+Perceptron basic blocks which we have and
+this does processing so we have a fixed number
+
+0:10:29.891 --> 0:10:36.070
+of input features and that will be important.
+
+0:10:36.096 --> 0:10:39.689
+So we have here numbers to xn as input.
+
+0:10:40.060 --> 0:10:53.221
+And this makes partly of course language processing
+difficult.
+
+0:10:54.114 --> 0:10:57.609
+So we have to model this time on and then
+go stand home and model.
+
+0:10:58.198 --> 0:11:02.099
+Then we are having weights, which are the
+parameters and the number of weights exactly
+
+0:11:02.099 --> 0:11:03.668
+the same as the number of weights.
+
+0:11:04.164 --> 0:11:06.322
+Of input features.
+
+0:11:06.322 --> 0:11:15.068
+Sometimes he has his fires in there, and then
+it's not really an input from.
+
+0:11:15.195 --> 0:11:19.205
+And what you then do is multiply.
+
+0:11:19.205 --> 0:11:26.164
+Each input resists weight and then you sum
+it up and then.
+
+0:11:26.606 --> 0:11:34.357
+What is then additionally later important
+is that we have an activation function and
+
+0:11:34.357 --> 0:11:42.473
+it's important that this activation function
+is non linear, so we come to just a linear.
+
+0:11:43.243 --> 0:11:54.088
+And later it will be important that this is
+differentiable because otherwise all the training.
+
+0:11:54.714 --> 0:12:01.907
+This model by itself is not very powerful.
+
+0:12:01.907 --> 0:12:10.437
+It was originally shown that this is not powerful.
+
+0:12:10.710 --> 0:12:19.463
+However, there is a very easy extension, the
+multi layer perceptual, and then things get
+
+0:12:19.463 --> 0:12:20.939
+very powerful.
+
+0:12:21.081 --> 0:12:27.719
+The thing is you just connect a lot of these
+in this layer of structures and we have our
+
+0:12:27.719 --> 0:12:35.029
+input layer where we have the inputs and our
+hidden layer at least one where there is everywhere.
+
+0:12:35.395 --> 0:12:39.817
+And then we can combine them all to do that.
+
+0:12:40.260 --> 0:12:48.320
+The input layer is of course somewhat given
+by a problem of dimension.
+
+0:12:48.320 --> 0:13:00.013
+The outward layer is also given by your dimension,
+but the hidden layer is of course a hyperparameter.
+
+0:13:01.621 --> 0:13:08.802
+So let's start with the first question, now
+more language related, and that is how we represent.
+
+0:13:09.149 --> 0:13:23.460
+So we've seen here we have the but the question
+is now how can we put in a word into this?
+
+0:13:26.866 --> 0:13:34.117
+Noise: The first thing we're able to be better
+is by the fact that like you are said,.
+
+0:13:34.314 --> 0:13:43.028
+That is not that easy because the continuous
+vector will come to that.
+
+0:13:43.028 --> 0:13:50.392
+So from the neo-network we can directly put
+in the bedding.
+
+0:13:50.630 --> 0:13:57.277
+But if we need to input a word into the needle
+network, it has to be something which is easily
+
+0:13:57.277 --> 0:13:57.907
+defined.
+
+0:13:59.079 --> 0:14:12.492
+The one hood encoding, and then we have one
+out of encoding, so one value is one, and all
+
+0:14:12.492 --> 0:14:15.324
+the others is the.
+
+0:14:16.316 --> 0:14:25.936
+That means we are always dealing with fixed
+vocabulary because what said is we cannot.
+
+0:14:26.246 --> 0:14:38.017
+So you cannot easily extend your vocabulary
+because if you mean you would extend your vocabulary.
+
+0:14:39.980 --> 0:14:41.502
+That's also motivating.
+
+0:14:41.502 --> 0:14:43.722
+We're talked about biperriagoding.
+
+0:14:43.722 --> 0:14:45.434
+That's a nice thing there.
+
+0:14:45.434 --> 0:14:47.210
+We have a fixed vocabulary.
+
+0:14:48.048 --> 0:14:55.804
+The big advantage of this one encoding is
+that we don't implicitly sum our implement
+
+0:14:55.804 --> 0:15:04.291
+similarity between words, but really re-learning
+because if you first think about this, this
+
+0:15:04.291 --> 0:15:06.938
+is a very, very inefficient.
+
+0:15:07.227 --> 0:15:15.889
+So you need like to represent end words, you
+need a dimension of an end dimensional vector.
+
+0:15:16.236 --> 0:15:24.846
+Imagine you could do binary encoding so you
+could represent words as binary vectors.
+
+0:15:24.846 --> 0:15:26.467
+Then you would.
+
+0:15:26.806 --> 0:15:31.177
+Will be significantly more efficient.
+
+0:15:31.177 --> 0:15:36.813
+However, then you have some implicit similarity.
+
+0:15:36.813 --> 0:15:39.113
+Some numbers share.
+
+0:15:39.559 --> 0:15:46.958
+Would somehow be bad because you would force
+someone to do this by hand or clear how to
+
+0:15:46.958 --> 0:15:47.631
+define.
+
+0:15:48.108 --> 0:15:55.135
+So therefore currently this is the most successful
+approach to just do this one watch.
+
+0:15:55.095 --> 0:15:59.563
+Representations, so we take a fixed vocabulary.
+
+0:15:59.563 --> 0:16:06.171
+We map each word to the inise, and then we
+represent a word like this.
+
+0:16:06.171 --> 0:16:13.246
+So if home will be one, the representation
+will be one zero zero zero, and.
+
+0:16:14.514 --> 0:16:30.639
+But this dimension here is a vocabulary size
+and that is quite high, so we are always trying
+
+0:16:30.639 --> 0:16:33.586
+to be efficient.
+
+0:16:33.853 --> 0:16:43.792
+We are doing then some type of efficiency
+because typically we are having this next layer.
+
+0:16:44.104 --> 0:16:51.967
+It can be still maybe two hundred or five
+hundred or one thousand neurons, but this is
+
+0:16:51.967 --> 0:16:53.323
+significantly.
+
+0:16:53.713 --> 0:17:03.792
+You can learn that directly and there we then
+have similarity between words.
+
+0:17:03.792 --> 0:17:07.458
+Then it is that some words.
+
+0:17:07.807 --> 0:17:14.772
+But the nice thing is that this is then learned
+that we are not need to hand define that.
+
+0:17:17.117 --> 0:17:32.742
+We'll come later to the explicit architecture
+of the neural language one, and there we can
+
+0:17:32.742 --> 0:17:35.146
+see how it's.
+
+0:17:38.418 --> 0:17:44.857
+So we're seeing that the other one or our
+representation always has the same similarity.
+
+0:17:45.105 --> 0:17:59.142
+Then we're having this continuous factor which
+is a lot smaller dimension and that's important
+
+0:17:59.142 --> 0:18:00.768
+for later.
+
+0:18:01.121 --> 0:18:06.989
+What we are doing then is learning these representations
+so that they are best for language.
+
+0:18:07.487 --> 0:18:14.968
+So the representations are implicitly training
+the language for the cards.
+
+0:18:14.968 --> 0:18:19.058
+This is the best way for doing language.
+
+0:18:19.479 --> 0:18:32.564
+And the nice thing that was found out later
+is these representations are really good.
+
+0:18:33.153 --> 0:18:39.253
+And that is why they are now even called word
+embeddings by themselves and used for other
+
+0:18:39.253 --> 0:18:39.727
+tasks.
+
+0:18:40.360 --> 0:18:49.821
+And they are somewhat describing very different
+things so they can describe and semantic similarities.
+
+0:18:49.789 --> 0:18:58.650
+Are looking at the very example of today mass
+vector space by adding words and doing some
+
+0:18:58.650 --> 0:19:00.618
+interesting things.
+
+0:19:00.940 --> 0:19:11.178
+So they got really like the first big improvement
+when switching to neurostaff.
+
+0:19:11.491 --> 0:19:20.456
+Are like part of the model, but with more
+complex representation, but they are the basic
+
+0:19:20.456 --> 0:19:21.261
+models.
+
+0:19:23.683 --> 0:19:36.979
+In the output layer we are also having one
+output layer structure and a connection function.
+
+0:19:36.997 --> 0:19:46.525
+That is, for language learning we want to
+predict what is the most common word.
+
+0:19:47.247 --> 0:19:56.453
+And that can be done very well with this so
+called soft back layer, where again the dimension.
+
+0:19:56.376 --> 0:20:02.825
+Vocabulary size, so this is a vocabulary size,
+and again the case neural represents the case
+
+0:20:02.825 --> 0:20:03.310
+class.
+
+0:20:03.310 --> 0:20:09.759
+So in our case we have again one round representation,
+someone saying this is a core report.
+
+0:20:10.090 --> 0:20:17.255
+Our probability distribution is a probability
+distribution over all works, so the case entry
+
+0:20:17.255 --> 0:20:21.338
+tells us how probable is that the next word
+is this.
+
+0:20:22.682 --> 0:20:33.885
+So we need to have some probability distribution
+at our output in order to achieve that this
+
+0:20:33.885 --> 0:20:37.017
+activation function goes.
+
+0:20:37.197 --> 0:20:46.944
+And we can achieve that with a soft max activation
+we take the input to the form of the value,
+
+0:20:46.944 --> 0:20:47.970
+and then.
+
+0:20:48.288 --> 0:20:58.021
+So by having this type of activation function
+we are really getting this type of probability.
+
+0:20:59.019 --> 0:21:15.200
+At the beginning was also very challenging
+because again we have this inefficient representation.
+
+0:21:15.235 --> 0:21:29.799
+You can imagine that something over is maybe
+a bit inefficient with cheap users, but definitely.
+
+0:21:36.316 --> 0:21:44.072
+And then for training the models that will
+be fine, so we have to use architecture now.
+
+0:21:44.264 --> 0:21:48.491
+We need to minimize the arrow.
+
+0:21:48.491 --> 0:21:53.264
+Are we doing it taking the output?
+
+0:21:53.264 --> 0:21:58.174
+We are comparing it to our targets.
+
+0:21:58.298 --> 0:22:03.830
+So one important thing is by training them.
+
+0:22:03.830 --> 0:22:07.603
+How can we measure the error?
+
+0:22:07.603 --> 0:22:12.758
+So what is if we are training the ideas?
+
+0:22:13.033 --> 0:22:15.163
+And how well we are measuring.
+
+0:22:15.163 --> 0:22:19.768
+It is in natural language processing, typically
+the cross entropy.
+
+0:22:19.960 --> 0:22:35.575
+And that means we are comparing the target
+with the output.
+
+0:22:35.335 --> 0:22:44.430
+It gets optimized and you're seeing that this,
+of course, makes it again very nice and easy
+
+0:22:44.430 --> 0:22:49.868
+because our target is again a one-hour representation.
+
+0:22:50.110 --> 0:23:00.116
+So all of these are always zero, and what
+we are then doing is we are taking the one.
+
+0:23:00.100 --> 0:23:04.615
+And we only need to multiply the one with
+the logarithm here, and that is all the feedback
+
+0:23:04.615 --> 0:23:05.955
+signal we are taking here.
+
+0:23:06.946 --> 0:23:13.885
+Of course, this is not always influenced by
+all the others.
+
+0:23:13.885 --> 0:23:17.933
+Why is this influenced by all the.
+
+0:23:24.304 --> 0:23:34.382
+Have the activation function, which is the
+current activation divided by some of the others.
+
+0:23:34.354 --> 0:23:45.924
+Otherwise it could easily just increase this
+volume and ignore the others, but if you increase
+
+0:23:45.924 --> 0:23:49.090
+one value all the others.
+
+0:23:51.351 --> 0:23:59.912
+Then we can do with neometrics one very nice
+and easy type of training that is done in all
+
+0:23:59.912 --> 0:24:07.721
+the neometrics where we are now calculating
+our error and especially the gradient.
+
+0:24:07.707 --> 0:24:11.640
+So in which direction does the error show?
+
+0:24:11.640 --> 0:24:18.682
+And then if we want to go to a smaller arrow
+that's what we want to achieve.
+
+0:24:18.682 --> 0:24:26.638
+We are taking the inverse direction of the
+gradient and thereby trying to minimize our
+
+0:24:26.638 --> 0:24:27.278
+error.
+
+0:24:27.287 --> 0:24:31.041
+And we have to do that, of course, for all
+the weights.
+
+0:24:31.041 --> 0:24:36.672
+And to calculate the error of all the weights,
+we won't do the defectvagation here.
+
+0:24:36.672 --> 0:24:41.432
+But but what you can do is you can propagate
+the arrow which measured.
+
+0:24:41.432 --> 0:24:46.393
+At the end you can propagate it back its basic
+mass and basic derivation.
+
+0:24:46.706 --> 0:24:58.854
+For each way in your model measure how much
+you contribute to the error and then change
+
+0:24:58.854 --> 0:25:01.339
+it in a way that.
+
+0:25:04.524 --> 0:25:11.625
+So to summarize what for at least machine
+translation on your machine translation should
+
+0:25:11.625 --> 0:25:19.044
+remember, you know, to understand on this problem
+is that this is how a multilayer first the
+
+0:25:19.044 --> 0:25:20.640
+problem looks like.
+
+0:25:20.580 --> 0:25:28.251
+There are fully two layers and no connections.
+
+0:25:28.108 --> 0:25:29.759
+Across layers.
+
+0:25:29.829 --> 0:25:35.153
+And what they're doing is always just a waited
+sum here and then in activation production.
+
+0:25:35.415 --> 0:25:38.792
+And in order to train you have this forward
+and backward pass.
+
+0:25:39.039 --> 0:25:41.384
+So We Put in Here.
+
+0:25:41.281 --> 0:25:41.895
+Inputs.
+
+0:25:41.895 --> 0:25:45.347
+We have some random values at the beginning.
+
+0:25:45.347 --> 0:25:47.418
+Then calculate the output.
+
+0:25:47.418 --> 0:25:54.246
+We are measuring how our error is propagating
+the arrow back and then changing our model
+
+0:25:54.246 --> 0:25:57.928
+in a way that we hopefully get a smaller arrow.
+
+0:25:57.928 --> 0:25:59.616
+And then that is how.
+
+0:26:01.962 --> 0:26:12.893
+So before we're coming into our neural networks
+language models, how can we use this type of
+
+0:26:12.893 --> 0:26:17.595
+neural network to do language modeling?
+
+0:26:23.103 --> 0:26:33.157
+So how can we use them in natural language
+processing, especially machine translation?
+
+0:26:33.157 --> 0:26:41.799
+The first idea of using them was to estimate:
+So we have seen that the output can be monitored
+
+0:26:41.799 --> 0:26:42.599
+here as well.
+
+0:26:43.603 --> 0:26:50.311
+A probability distribution and if we have
+a full vocabulary we could mainly hear estimating
+
+0:26:50.311 --> 0:26:56.727
+how probable each next word is and then use
+that in our language model fashion as we've
+
+0:26:56.727 --> 0:26:58.112
+done it last time.
+
+0:26:58.112 --> 0:27:03.215
+We got the probability of a full sentence
+as a product of individual.
+
+0:27:04.544 --> 0:27:12.820
+And: That was done in the ninety seven years
+and it's very easy to integrate it into this
+
+0:27:12.820 --> 0:27:14.545
+lot of the year model.
+
+0:27:14.545 --> 0:27:19.570
+So we have said that this is how the locker
+here model looks like.
+
+0:27:19.570 --> 0:27:25.119
+So we are searching the best translation which
+minimizes each waste time.
+
+0:27:25.125 --> 0:27:26.362
+The Future About You.
+
+0:27:26.646 --> 0:27:31.647
+We have that with minimum error rate training
+if you can remember where we search for the
+
+0:27:31.647 --> 0:27:32.147
+optimal.
+
+0:27:32.512 --> 0:27:40.422
+The language model and many others, and we
+can just add here a neuromodel, have a knock
+
+0:27:40.422 --> 0:27:41.591
+of features.
+
+0:27:41.861 --> 0:27:45.761
+So that is quite easy as said.
+
+0:27:45.761 --> 0:27:53.183
+That was how statistical machine translation
+was improved.
+
+0:27:53.183 --> 0:27:57.082
+You just add one more feature.
+
+0:27:58.798 --> 0:28:07.631
+So how can we model the language modeling
+with a network?
+
+0:28:07.631 --> 0:28:16.008
+So what we have to do is model the probability
+of the.
+
+0:28:16.656 --> 0:28:25.047
+The problem in general in the head is that
+mostly we haven't seen long sequences.
+
+0:28:25.085 --> 0:28:35.650
+Mostly we have to beg off to very short sequences
+and we are working on this discrete space where
+
+0:28:35.650 --> 0:28:36.944
+similarity.
+
+0:28:37.337 --> 0:28:50.163
+So the idea is if we have now a real network,
+we can make words into continuous representation.
+
+0:28:51.091 --> 0:29:00.480
+And the structure then looks like this, so
+this is a basic still feed forward neural network.
+
+0:29:01.361 --> 0:29:10.645
+We are doing this at perximation again, so
+we are not putting in all previous words, but
+
+0:29:10.645 --> 0:29:11.375
+it is.
+
+0:29:11.691 --> 0:29:25.856
+This is done because we said that in the real
+network we can have only a fixed type of input.
+
+0:29:25.945 --> 0:29:31.886
+You can only do a fixed step and then we'll
+be doing that exactly in minus one.
+
+0:29:33.593 --> 0:29:39.536
+So here you are, for example, three words
+and three different words.
+
+0:29:39.536 --> 0:29:50.704
+One and all the others are: And then we're
+having the first layer of the neural network,
+
+0:29:50.704 --> 0:29:56.230
+which like you learns is word embedding.
+
+0:29:57.437 --> 0:30:04.976
+There is one thing which is maybe special
+compared to the standard neural member.
+
+0:30:05.345 --> 0:30:11.918
+So the representation of this word we want
+to learn first of all position independence.
+
+0:30:11.918 --> 0:30:19.013
+So we just want to learn what is the general
+meaning of the word independent of its neighbors.
+
+0:30:19.299 --> 0:30:26.239
+And therefore the representation you get here
+should be the same as if in the second position.
+
+0:30:27.247 --> 0:30:36.865
+The nice thing you can achieve is that this
+weights which you're using here you're reusing
+
+0:30:36.865 --> 0:30:41.727
+here and reusing here so we are forcing them.
+
+0:30:42.322 --> 0:30:48.360
+You then learn your word embedding, which
+is contextual, independent, so it's the same
+
+0:30:48.360 --> 0:30:49.678
+for each position.
+
+0:30:49.909 --> 0:31:03.482
+So that's the idea that you want to learn
+the representation first of and you don't want
+
+0:31:03.482 --> 0:31:07.599
+to really use the context.
+
+0:31:08.348 --> 0:31:13.797
+That of course might have a different meaning
+depending on where it stands, but we'll learn
+
+0:31:13.797 --> 0:31:14.153
+that.
+
+0:31:14.514 --> 0:31:20.386
+So first we are learning here representational
+words, which is just the representation.
+
+0:31:20.760 --> 0:31:32.498
+Normally we said in neurons all input neurons
+here are connected to all here, but we're reducing
+
+0:31:32.498 --> 0:31:37.338
+the complexity by saying these neurons.
+
+0:31:37.857 --> 0:31:47.912
+Then we have a lot denser representation that
+is our three word embedded in here, and now
+
+0:31:47.912 --> 0:31:57.408
+we are learning this interaction between words,
+a direction between words not based.
+
+0:31:57.677 --> 0:32:08.051
+So we have at least one connected layer here,
+which takes a three embedding input and then
+
+0:32:08.051 --> 0:32:14.208
+learns a new embedding which now represents
+the full.
+
+0:32:15.535 --> 0:32:16.551
+Layers.
+
+0:32:16.551 --> 0:32:27.854
+It is the output layer which now and then
+again the probability distribution of all the.
+
+0:32:28.168 --> 0:32:48.612
+So here is your target prediction.
+
+0:32:48.688 --> 0:32:56.361
+The nice thing is that you learn everything
+together, so you don't have to teach them what
+
+0:32:56.361 --> 0:32:58.722
+a good word representation.
+
+0:32:59.079 --> 0:33:08.306
+Training the whole number together, so it
+learns what a good representation for a word
+
+0:33:08.306 --> 0:33:13.079
+you get in order to perform your final task.
+
+0:33:15.956 --> 0:33:19.190
+Yeah, that is the main idea.
+
+0:33:20.660 --> 0:33:32.731
+This is now a days often referred to as one
+way of self supervise learning.
+
+0:33:33.053 --> 0:33:37.120
+The output is the next word and the input
+is the previous word.
+
+0:33:37.377 --> 0:33:46.783
+But it's not really that we created labels,
+but we artificially created a task out of unlabeled.
+
+0:33:46.806 --> 0:33:59.434
+We just had pure text, and then we created
+the telescopes by predicting the next word,
+
+0:33:59.434 --> 0:34:18.797
+which is: Say we have like two sentences like
+go home and the second one is go to prepare.
+
+0:34:18.858 --> 0:34:30.135
+And then we have to predict the next series
+and my questions in the labels for the album.
+
+0:34:31.411 --> 0:34:42.752
+We model this as one vector with like probability
+for possible weights starting again.
+
+0:34:44.044 --> 0:34:57.792
+Multiple examples, so then you would twice
+train one to predict KRT, one to predict home,
+
+0:34:57.792 --> 0:35:02.374
+and then of course the easel.
+
+0:35:04.564 --> 0:35:13.568
+Is a very good point, so you are not aggregating
+examples beforehand, but you are taking each.
+
+0:35:19.259 --> 0:35:37.204
+So when you do it simultaneously learn the
+projection layer and the endgram for abilities
+
+0:35:37.204 --> 0:35:39.198
+and then.
+
+0:35:39.499 --> 0:35:47.684
+And later analyze it that these representations
+are very powerful.
+
+0:35:47.684 --> 0:35:56.358
+The task is just a very important task to
+model what is the next word.
+
+0:35:56.816 --> 0:35:59.842
+Is motivated by nowadays.
+
+0:35:59.842 --> 0:36:10.666
+In order to get the meaning of the word you
+have to look at its companies where the context.
+
+0:36:10.790 --> 0:36:16.048
+If you read texts in days of word which you
+have never seen, you often can still estimate
+
+0:36:16.048 --> 0:36:21.130
+the meaning of this word because you do not
+know how it is used, and this is typically
+
+0:36:21.130 --> 0:36:22.240
+used as a city or.
+
+0:36:22.602 --> 0:36:25.865
+Just imagine you read a text about some city.
+
+0:36:25.865 --> 0:36:32.037
+Even if you've never seen the city before,
+you often know from the context of how it's
+
+0:36:32.037 --> 0:36:32.463
+used.
+
+0:36:34.094 --> 0:36:42.483
+So what is now the big advantage of using
+neural neckworks?
+
+0:36:42.483 --> 0:36:51.851
+So just imagine we have to estimate that I
+bought my first iPhone.
+
+0:36:52.052 --> 0:36:56.608
+So you have to monitor the probability of
+ad hitting them.
+
+0:36:56.608 --> 0:37:00.237
+Now imagine iPhone, which you have never seen.
+
+0:37:00.600 --> 0:37:11.588
+So all the techniques we had last time at
+the end, if you haven't seen iPhone you will
+
+0:37:11.588 --> 0:37:14.240
+always fall back to.
+
+0:37:15.055 --> 0:37:26.230
+You have no idea how to deal that you won't
+have seen the diagram, the trigram, and all
+
+0:37:26.230 --> 0:37:27.754
+the others.
+
+0:37:28.588 --> 0:37:43.441
+If you're having this type of model, what
+does it do if you have my first and then something?
+
+0:37:43.483 --> 0:37:50.270
+Maybe this representation is really messed
+up because it's mainly on a cavalry word.
+
+0:37:50.730 --> 0:37:57.793
+However, you have still these two information
+that two words before was first and therefore.
+
+0:37:58.098 --> 0:38:06.954
+So you have a lot of information in order
+to estimate how good it is.
+
+0:38:06.954 --> 0:38:13.279
+There could be more information if you know
+that.
+
+0:38:13.593 --> 0:38:25.168
+So all this type of modeling we can do that
+we couldn't do beforehand because we always
+
+0:38:25.168 --> 0:38:25.957
+have.
+
+0:38:27.027 --> 0:38:40.466
+Good point, so typically you would have one
+token for a vocabulary so that you could, for
+
+0:38:40.466 --> 0:38:45.857
+example: All you're doing by parent coding
+when you have a fixed thing.
+
+0:38:46.226 --> 0:38:49.437
+Oh yeah, you have to do something like that
+that that that's true.
+
+0:38:50.050 --> 0:38:55.420
+So yeah, auto vocabulary are by thanking where
+you don't have other words written.
+
+0:38:55.735 --> 0:39:06.295
+But then, of course, you might be getting
+very long previous things, and your sequence
+
+0:39:06.295 --> 0:39:11.272
+length gets very long for unknown words.
+
+0:39:17.357 --> 0:39:20.067
+Any more questions to the basic stable.
+
+0:39:23.783 --> 0:39:36.719
+For this model, what we then want to continue
+is looking a bit into how complex or how we
+
+0:39:36.719 --> 0:39:39.162
+can make things.
+
+0:39:40.580 --> 0:39:49.477
+Because at the beginning there was definitely
+a major challenge, it's still not that easy,
+
+0:39:49.477 --> 0:39:58.275
+and I mean our likeers followed the talk about
+their environmental fingerprint and so on.
+
+0:39:58.478 --> 0:40:05.700
+So this calculation is not really heavy, and
+if you build systems yourselves you have to
+
+0:40:05.700 --> 0:40:06.187
+wait.
+
+0:40:06.466 --> 0:40:14.683
+So it's good to know a bit about how complex
+things are in order to do a good or efficient
+
+0:40:14.683 --> 0:40:15.405
+affair.
+
+0:40:15.915 --> 0:40:24.211
+So one thing where most of the calculation
+really happens is if you're doing it in a bad
+
+0:40:24.211 --> 0:40:24.677
+way.
+
+0:40:25.185 --> 0:40:33.523
+So in generally all these layers we are talking
+about networks and zones fancy.
+
+0:40:33.523 --> 0:40:46.363
+In the end it is: So what you have to do in
+order to calculate here, for example, these
+
+0:40:46.363 --> 0:40:52.333
+activations: So make it simple a bit.
+
+0:40:52.333 --> 0:41:06.636
+Let's see where outputs and you just do metric
+multiplication between your weight matrix and
+
+0:41:06.636 --> 0:41:08.482
+your input.
+
+0:41:08.969 --> 0:41:20.992
+So that is why computers are so powerful for
+neural networks because they are very good
+
+0:41:20.992 --> 0:41:22.358
+in doing.
+
+0:41:22.782 --> 0:41:28.013
+However, for some type for the embedding layer
+this is really very inefficient.
+
+0:41:28.208 --> 0:41:39.652
+So because remember we're having this one
+art encoding in this input, it's always like
+
+0:41:39.652 --> 0:41:42.940
+one and everything else.
+
+0:41:42.940 --> 0:41:47.018
+It's zero if we're doing this.
+
+0:41:47.387 --> 0:41:55.552
+So therefore you can do at least the forward
+pass a lot more efficient if you don't really
+
+0:41:55.552 --> 0:42:01.833
+do this calculation, but you can select the
+one color where there is.
+
+0:42:01.833 --> 0:42:07.216
+Therefore, you also see this is called your
+word embedding.
+
+0:42:08.348 --> 0:42:19.542
+So the weight matrix of the embedding layer
+is just that in each color you have the embedding
+
+0:42:19.542 --> 0:42:20.018
+of.
+
+0:42:20.580 --> 0:42:30.983
+So this is like how your initial weights look
+like and how you can interpret or understand.
+
+0:42:32.692 --> 0:42:39.509
+And this is already relatively important because
+remember this is a huge dimensional thing.
+
+0:42:39.509 --> 0:42:46.104
+So typically here we have the number of words
+is ten thousand or so, so this is the word
+
+0:42:46.104 --> 0:42:51.365
+embeddings metrics, typically the most expensive
+to calculate metrics.
+
+0:42:51.451 --> 0:42:59.741
+Because it's the largest one there, we have
+ten thousand entries, while for the hours we
+
+0:42:59.741 --> 0:43:00.393
+maybe.
+
+0:43:00.660 --> 0:43:03.408
+So therefore the addition to a little bit
+more to make this.
+
+0:43:06.206 --> 0:43:10.538
+Then you can go where else the calculations
+are very difficult.
+
+0:43:10.830 --> 0:43:20.389
+So here we then have our network, so we have
+the word embeddings.
+
+0:43:20.389 --> 0:43:29.514
+We have one hidden there, and then you can
+look how difficult.
+
+0:43:30.270 --> 0:43:38.746
+Could save a lot of calculation by not really
+calculating the selection because that is always.
+
+0:43:40.600 --> 0:43:46.096
+The number of calculations you have to do
+here is so.
+
+0:43:46.096 --> 0:43:51.693
+The length of this layer is minus one type
+projection.
+
+0:43:52.993 --> 0:43:56.321
+That is a hint size.
+
+0:43:56.321 --> 0:44:10.268
+So the first step of calculation for this
+metrics modification is how much calculation.
+
+0:44:10.730 --> 0:44:18.806
+Then you have to do some activation function
+and then you have to do again the calculation.
+
+0:44:19.339 --> 0:44:27.994
+Here we need the vocabulary size because we
+need to calculate the probability for each
+
+0:44:27.994 --> 0:44:29.088
+next word.
+
+0:44:29.889 --> 0:44:43.155
+And if you look at these numbers, so if you
+have a projector size of and a vocabulary size
+
+0:44:43.155 --> 0:44:53.876
+of, you see: And that is why there has been
+especially at the beginning some ideas how
+
+0:44:53.876 --> 0:44:55.589
+we can reduce.
+
+0:44:55.956 --> 0:45:01.942
+And if we really need to calculate all of
+our capabilities, or if we can calculate only
+
+0:45:01.942 --> 0:45:02.350
+some.
+
+0:45:02.582 --> 0:45:10.871
+And there again the one important thing to
+think about is for what will use my language
+
+0:45:10.871 --> 0:45:11.342
+mom.
+
+0:45:11.342 --> 0:45:19.630
+I can use it for generations and that's what
+we will see next week in an achiever which
+
+0:45:19.630 --> 0:45:22.456
+really is guiding the search.
+
+0:45:23.123 --> 0:45:30.899
+If it just uses a feature, we do not want
+to use it for generations, but we want to only
+
+0:45:30.899 --> 0:45:32.559
+know how probable.
+
+0:45:32.953 --> 0:45:39.325
+There we might not be really interested in
+all the probabilities, but we already know
+
+0:45:39.325 --> 0:45:46.217
+we just want to know the probability of this
+one word, and then it might be very inefficient
+
+0:45:46.217 --> 0:45:49.403
+to really calculate all the probabilities.
+
+0:45:51.231 --> 0:45:52.919
+And how can you do that so?
+
+0:45:52.919 --> 0:45:56.296
+Initially, for example, the people look into
+shortness.
+
+0:45:56.756 --> 0:46:02.276
+So this calculation at the end is really very
+expensive.
+
+0:46:02.276 --> 0:46:05.762
+So can we make that more efficient.
+
+0:46:05.945 --> 0:46:17.375
+And most words occur very rarely, and maybe
+we don't need anger, and so there we may want
+
+0:46:17.375 --> 0:46:18.645
+to focus.
+
+0:46:19.019 --> 0:46:29.437
+And so they use the smaller vocabulary, which
+is maybe.
+
+0:46:29.437 --> 0:46:34.646
+This layer is used from to.
+
+0:46:34.646 --> 0:46:37.623
+Then you merge.
+
+0:46:37.937 --> 0:46:45.162
+So you're taking if the word is in the shortest,
+so in the two thousand most frequent words.
+
+0:46:45.825 --> 0:46:58.299
+Of this short word by some normalization here,
+and otherwise you take a back of probability
+
+0:46:58.299 --> 0:46:59.655
+from the.
+
+0:47:00.020 --> 0:47:04.933
+It will not be as good, but the idea is okay.
+
+0:47:04.933 --> 0:47:14.013
+Then we don't have to calculate all these
+probabilities here at the end, but we only
+
+0:47:14.013 --> 0:47:16.042
+have to calculate.
+
+0:47:19.599 --> 0:47:32.097
+With some type of cost because it means we
+don't model the probability of the infrequent
+
+0:47:32.097 --> 0:47:39.399
+words, and maybe it's even very important to
+model.
+
+0:47:39.299 --> 0:47:46.671
+And one idea is to do what is reported as
+so so structured out there.
+
+0:47:46.606 --> 0:47:49.571
+Network language models you see some years
+ago.
+
+0:47:49.571 --> 0:47:53.154
+People were very creative and giving names
+to new models.
+
+0:47:53.813 --> 0:48:00.341
+And there the idea is that we model the output
+vocabulary as a clustered treat.
+
+0:48:00.680 --> 0:48:06.919
+So you don't need to model all of our bodies
+directly, but you are putting words into a
+
+0:48:06.919 --> 0:48:08.479
+sequence of clusters.
+
+0:48:08.969 --> 0:48:15.019
+So maybe a very intriguant world is first
+in cluster three and then in cluster three.
+
+0:48:15.019 --> 0:48:21.211
+You have subclusters again and there is subclusters
+seven and subclusters and there is.
+
+0:48:21.541 --> 0:48:40.134
+And this is the path, so that is what was
+the man in the past.
+
+0:48:40.340 --> 0:48:52.080
+And then you can calculate the probability
+of the word again just by the product of the
+
+0:48:52.080 --> 0:48:55.548
+first class of the world.
+
+0:48:57.617 --> 0:49:07.789
+That it may be more clear where you have this
+architecture, so this is all the same.
+
+0:49:07.789 --> 0:49:13.773
+But then you first predict here which main
+class.
+
+0:49:14.154 --> 0:49:24.226
+Then you go to the appropriate subclass, then
+you calculate the probability of the subclass
+
+0:49:24.226 --> 0:49:26.415
+and maybe the cell.
+
+0:49:27.687 --> 0:49:35.419
+Anybody have an idea why this is more efficient
+or if you do it first, it looks a lot more.
+
+0:49:42.242 --> 0:49:51.788
+You have to do less calculations, so maybe
+if you do it here you have to calculate the
+
+0:49:51.788 --> 0:49:59.468
+element there, but you don't have to do all
+the one hundred thousand.
+
+0:49:59.980 --> 0:50:06.115
+The probabilities in the set classes that
+you're going through and not for all of them.
+
+0:50:06.386 --> 0:50:18.067
+Therefore, it's more efficient if you don't
+need all output proficient because you have
+
+0:50:18.067 --> 0:50:21.253
+to calculate the class.
+
+0:50:21.501 --> 0:50:28.936
+So it's only more efficient and scenarios
+where you really need to use a language model
+
+0:50:28.936 --> 0:50:30.034
+to evaluate.
+
+0:50:35.275 --> 0:50:52.456
+How this works was that you can train first
+in your language one on the short list.
+
+0:50:52.872 --> 0:51:03.547
+But on the input layer you have your full
+vocabulary because at the input we saw that
+
+0:51:03.547 --> 0:51:06.650
+this is not complicated.
+
+0:51:06.906 --> 0:51:26.638
+And then you can cluster down all your words
+here into classes and use that as your glasses.
+
+0:51:29.249 --> 0:51:34.148
+That is one idea of doing it.
+
+0:51:34.148 --> 0:51:44.928
+There is also a second idea of doing it, and
+again we don't need.
+
+0:51:45.025 --> 0:51:53.401
+So sometimes it doesn't really need to be
+a probability to evaluate.
+
+0:51:53.401 --> 0:51:56.557
+It's only important that.
+
+0:51:58.298 --> 0:52:04.908
+And: Here it's called self normalization what
+people have done so.
+
+0:52:04.908 --> 0:52:11.562
+We have seen that the probability is in this
+soft mechanism always to the input divided
+
+0:52:11.562 --> 0:52:18.216
+by our normalization, and the normalization
+is a summary of the vocabulary to the power
+
+0:52:18.216 --> 0:52:19.274
+of the spell.
+
+0:52:19.759 --> 0:52:25.194
+So this is how we calculate the software.
+
+0:52:25.825 --> 0:52:41.179
+In self normalization of the idea, if this
+would be zero then we don't need to calculate
+
+0:52:41.179 --> 0:52:42.214
+that.
+
+0:52:42.102 --> 0:52:54.272
+Will be zero, and then you don't even have
+to calculate the normalization because it's.
+
+0:52:54.514 --> 0:53:08.653
+So how can we achieve that and then the nice
+thing in your networks?
+
+0:53:09.009 --> 0:53:23.928
+And now we're just adding a second note with
+some either permitted here.
+
+0:53:24.084 --> 0:53:29.551
+And the second lost just tells us he'll be
+strained away.
+
+0:53:29.551 --> 0:53:31.625
+The locks at is zero.
+
+0:53:32.352 --> 0:53:38.614
+So then if it's nearly zero at the end we
+don't need to calculate this and it's also
+
+0:53:38.614 --> 0:53:39.793
+very efficient.
+
+0:53:40.540 --> 0:53:49.498
+One important thing is this, of course, is
+only in inference.
+
+0:53:49.498 --> 0:54:04.700
+During tests we don't need to calculate that
+because: You can do a bit of a hyperparameter
+
+0:54:04.700 --> 0:54:14.851
+here where you do the waiting, so how good
+should it be estimating the probabilities and
+
+0:54:14.851 --> 0:54:16.790
+how much effort?
+
+0:54:18.318 --> 0:54:28.577
+The only disadvantage is no speed up during
+training.
+
+0:54:28.577 --> 0:54:43.843
+There are other ways of doing that, for example:
+Englishman is in case you get it.
+
+0:54:44.344 --> 0:54:48.540
+Then we are coming very, very briefly like
+just one idea.
+
+0:54:48.828 --> 0:54:53.058
+That there is more things on different types
+of language models.
+
+0:54:53.058 --> 0:54:58.002
+We are having a very short view on restricted
+person-based language models.
+
+0:54:58.298 --> 0:55:08.931
+Talk about recurrent neural networks for language
+mines because they have the advantage that
+
+0:55:08.931 --> 0:55:17.391
+we can even further improve by not having a
+continuous representation on.
+
+0:55:18.238 --> 0:55:23.845
+So there's different types of neural networks.
+
+0:55:23.845 --> 0:55:30.169
+These are these boxing machines and the interesting.
+
+0:55:30.330 --> 0:55:39.291
+They have these: And they define like an energy
+function on the network, which can be in restricted
+
+0:55:39.291 --> 0:55:44.372
+balsam machines efficiently calculated in general
+and restricted needs.
+
+0:55:44.372 --> 0:55:51.147
+You only have connection between the input
+and the hidden layer, but you don't have connections
+
+0:55:51.147 --> 0:55:53.123
+in the input or within the.
+
+0:55:53.393 --> 0:56:00.194
+So you see here you don't have an input output,
+you just have an input, and you calculate.
+
+0:56:00.460 --> 0:56:15.612
+Which of course nicely fits with the idea
+we're having, so you can then use this for
+
+0:56:15.612 --> 0:56:19.177
+an N Gram language.
+
+0:56:19.259 --> 0:56:25.189
+Retaining the flexibility of the input by
+this type of neon networks.
+
+0:56:26.406 --> 0:56:30.589
+And the advantage of this type of model was
+there's.
+
+0:56:30.550 --> 0:56:37.520
+Very, very fast to integrate it, so that one
+was the first one which was used during the
+
+0:56:37.520 --> 0:56:38.616
+coding model.
+
+0:56:38.938 --> 0:56:45.454
+The engram language models were that they
+were very good and gave performance.
+
+0:56:45.454 --> 0:56:50.072
+However, calculation still with all these
+tricks takes.
+
+0:56:50.230 --> 0:56:58.214
+We have talked about embest lists so they
+generated an embest list of the most probable
+
+0:56:58.214 --> 0:57:05.836
+outputs and then they took this and best list
+scored each entry with a new network.
+
+0:57:06.146 --> 0:57:09.306
+A language model, and then only change the
+order again.
+
+0:57:09.306 --> 0:57:10.887
+Select based on that which.
+
+0:57:11.231 --> 0:57:17.187
+The neighboring list is maybe only like hundred
+entries.
+
+0:57:17.187 --> 0:57:21.786
+When decoding you look at several thousand.
+
+0:57:26.186 --> 0:57:35.196
+Let's look at the context so we have now seen
+your language models.
+
+0:57:35.196 --> 0:57:43.676
+There is the big advantage we can use this
+word similarity and.
+
+0:57:44.084 --> 0:57:52.266
+Remember for engram language ones is not always
+minus one words because sometimes you have
+
+0:57:52.266 --> 0:57:59.909
+to back off or interpolation to lower engrams
+and you don't know the previous words.
+
+0:58:00.760 --> 0:58:04.742
+And however in neural models we always have
+all of this importance.
+
+0:58:04.742 --> 0:58:05.504
+Can some of.
+
+0:58:07.147 --> 0:58:20.288
+The disadvantage is that you are still limited
+in your context, and if you remember the sentence
+
+0:58:20.288 --> 0:58:22.998
+from last lecture,.
+
+0:58:22.882 --> 0:58:28.328
+Sometimes you need more context and there
+is unlimited context that you might need and
+
+0:58:28.328 --> 0:58:34.086
+you can always create sentences where you may
+need this five context in order to put a good
+
+0:58:34.086 --> 0:58:34.837
+estimation.
+
+0:58:35.315 --> 0:58:44.956
+Can also do it different in order to understand
+that it makes sense to view language.
+
+0:58:45.445 --> 0:58:59.510
+So secret labeling tasks are a very common
+type of task in language processing where you
+
+0:58:59.510 --> 0:59:03.461
+have the input sequence.
+
+0:59:03.323 --> 0:59:05.976
+So you have one output for each input.
+
+0:59:05.976 --> 0:59:12.371
+Machine translation is not a secret labeling
+cast because the number of inputs and the number
+
+0:59:12.371 --> 0:59:14.072
+of outputs is different.
+
+0:59:14.072 --> 0:59:20.598
+So you put in a string German which has five
+words and the output can be: See, for example,
+
+0:59:20.598 --> 0:59:24.078
+you always have the same number and the same
+number of offices.
+
+0:59:24.944 --> 0:59:39.779
+And you can more language waddling as that,
+and you just say the label for each word is
+
+0:59:39.779 --> 0:59:43.151
+always a next word.
+
+0:59:45.705 --> 0:59:50.312
+This is the more generous you can think of
+it.
+
+0:59:50.312 --> 0:59:56.194
+For example, Paddle Speech Taking named Entity
+Recognition.
+
+0:59:58.938 --> 1:00:08.476
+And if you look at now, this output token
+and generally sequenced labeling can depend
+
+1:00:08.476 --> 1:00:26.322
+on: The input tokens are the same so we can
+easily model it and they only depend on the
+
+1:00:26.322 --> 1:00:29.064
+input tokens.
+
+1:00:31.011 --> 1:00:42.306
+But we can always look at one specific type
+of sequence labeling, unidirectional sequence
+
+1:00:42.306 --> 1:00:44.189
+labeling type.
+
+1:00:44.584 --> 1:01:00.855
+The probability of the next word only depends
+on the previous words that we are having here.
+
+1:01:01.321 --> 1:01:05.998
+That's also not completely true in language.
+
+1:01:05.998 --> 1:01:14.418
+Well, the back context might also be helpful
+by direction of the model's Google.
+
+1:01:14.654 --> 1:01:23.039
+We will always admire the probability of the
+word given on its history.
+
+1:01:23.623 --> 1:01:30.562
+And currently there is approximation and sequence
+labeling that we have this windowing approach.
+
+1:01:30.951 --> 1:01:43.016
+So in order to predict this type of word we
+always look at the previous three words.
+
+1:01:43.016 --> 1:01:48.410
+This is this type of windowing model.
+
+1:01:49.389 --> 1:01:54.780
+If you're into neural networks you recognize
+this type of structure.
+
+1:01:54.780 --> 1:01:57.515
+Also, the typical neural networks.
+
+1:01:58.938 --> 1:02:11.050
+Yes, yes, so like engram models you can, at
+least in some way, prepare for that type of
+
+1:02:11.050 --> 1:02:12.289
+context.
+
+1:02:14.334 --> 1:02:23.321
+Are also other types of neonamic structures
+which we can use for sequins lately and which
+
+1:02:23.321 --> 1:02:30.710
+might help us where we don't have this type
+of fixed size representation.
+
+1:02:32.812 --> 1:02:34.678
+That we can do so.
+
+1:02:34.678 --> 1:02:39.391
+The idea is in recurrent new networks traction.
+
+1:02:39.391 --> 1:02:43.221
+We are saving complete history in one.
+
+1:02:43.623 --> 1:02:56.946
+So again we have to do this fixed size representation
+because the neural networks always need a habit.
+
+1:02:57.157 --> 1:03:09.028
+And then the network should look like that,
+so we start with an initial value for our storage.
+
+1:03:09.028 --> 1:03:15.900
+We are giving our first input and calculating
+the new.
+
+1:03:16.196 --> 1:03:35.895
+So again in your network with two types of
+inputs: Then you can apply it to the next type
+
+1:03:35.895 --> 1:03:41.581
+of input and you're again having this.
+
+1:03:41.581 --> 1:03:46.391
+You're taking this hidden state.
+
+1:03:47.367 --> 1:03:53.306
+Nice thing is now that you can do now step
+by step by step, so all the way over.
+
+1:03:55.495 --> 1:04:06.131
+The nice thing we are having here now is that
+now we are having context information from
+
+1:04:06.131 --> 1:04:07.206
+all the.
+
+1:04:07.607 --> 1:04:14.181
+So if you're looking like based on which words
+do you, you calculate the probability of varying.
+
+1:04:14.554 --> 1:04:20.090
+It depends on this part.
+
+1:04:20.090 --> 1:04:33.154
+It depends on and this hidden state was influenced
+by two.
+
+1:04:33.473 --> 1:04:38.259
+So now we're having something new.
+
+1:04:38.259 --> 1:04:46.463
+We can model like the word probability not
+only on a fixed.
+
+1:04:46.906 --> 1:04:53.565
+Because the hidden states we are having here
+in our Oregon are influenced by all the trivia.
+
+1:04:56.296 --> 1:05:02.578
+So how is there to be Singapore?
+
+1:05:02.578 --> 1:05:16.286
+But then we have the initial idea about this
+P of given on the history.
+
+1:05:16.736 --> 1:05:25.300
+So do not need to do any clustering here,
+and you also see how things are put together
+
+1:05:25.300 --> 1:05:26.284
+in order.
+
+1:05:29.489 --> 1:05:43.449
+The green box this night since we are starting
+from the left to the right.
+
+1:05:44.524 --> 1:05:51.483
+Voices: Yes, that's right, so there are clusters,
+and here is also sometimes clustering happens.
+
+1:05:51.871 --> 1:05:58.687
+The small difference does matter again, so
+if you have now a lot of different histories,
+
+1:05:58.687 --> 1:06:01.674
+the similarity which you have in here.
+
+1:06:01.674 --> 1:06:08.260
+If two of the histories are very similar,
+these representations will be the same, and
+
+1:06:08.260 --> 1:06:10.787
+then you're treating them again.
+
+1:06:11.071 --> 1:06:15.789
+Because in order to do the final restriction
+you only do a good base on the green box.
+
+1:06:16.156 --> 1:06:28.541
+So you are now still learning some type of
+clustering in there, but you are learning it
+
+1:06:28.541 --> 1:06:30.230
+implicitly.
+
+1:06:30.570 --> 1:06:38.200
+The only restriction you're giving is you
+have to stall everything that is important
+
+1:06:38.200 --> 1:06:39.008
+in this.
+
+1:06:39.359 --> 1:06:54.961
+So it's a different type of limitation, so
+you calculate the probability based on the
+
+1:06:54.961 --> 1:06:57.138
+last words.
+
+1:06:57.437 --> 1:07:04.430
+And that is how you still need to somehow
+cluster things together in order to do efficiently.
+
+1:07:04.430 --> 1:07:09.563
+Of course, you need to do some type of clustering
+because otherwise.
+
+1:07:09.970 --> 1:07:18.865
+But this is where things get merged together
+in this type of hidden representation.
+
+1:07:18.865 --> 1:07:27.973
+So here the probability of the word first
+only depends on this hidden representation.
+
+1:07:28.288 --> 1:07:33.104
+On the previous words, but they are some other
+bottleneck in order to make a good estimation.
+
+1:07:34.474 --> 1:07:41.231
+So the idea is that we can store all our history
+into or into one lecture.
+
+1:07:41.581 --> 1:07:44.812
+Which is the one that makes it more strong.
+
+1:07:44.812 --> 1:07:51.275
+Next we come to problems that of course at
+some point it might be difficult if you have
+
+1:07:51.275 --> 1:07:57.811
+very long sequences and you always write all
+the information you have on this one block.
+
+1:07:58.398 --> 1:08:02.233
+Then maybe things get overwritten or you cannot
+store everything in there.
+
+1:08:02.662 --> 1:08:04.514
+So,.
+
+1:08:04.184 --> 1:08:09.569
+Therefore, yet for short things like single
+sentences that works well, but especially if
+
+1:08:09.569 --> 1:08:15.197
+you think of other tasks and like symbolizations
+with our document based on T where you need
+
+1:08:15.197 --> 1:08:20.582
+to consider the full document, these things
+got got a bit more more more complicated and
+
+1:08:20.582 --> 1:08:23.063
+will learn another type of architecture.
+
+1:08:24.464 --> 1:08:30.462
+In order to understand these neighbors, it
+is good to have all the bus use always.
+
+1:08:30.710 --> 1:08:33.998
+So this is the unrolled view.
+
+1:08:33.998 --> 1:08:43.753
+Somewhere you're over the type or in language
+over the words you're unrolling a network.
+
+1:08:44.024 --> 1:08:52.096
+Here is the article and here is the network
+which is connected by itself and that is recurrent.
+
+1:08:56.176 --> 1:09:04.982
+There is one challenge in this networks and
+training.
+
+1:09:04.982 --> 1:09:11.994
+We can train them first of all as forward.
+
+1:09:12.272 --> 1:09:19.397
+So we don't really know how to train them,
+but if you unroll them like this is a feet
+
+1:09:19.397 --> 1:09:20.142
+forward.
+
+1:09:20.540 --> 1:09:38.063
+Is exactly the same, so you can measure your
+arrows here and be back to your arrows.
+
+1:09:38.378 --> 1:09:45.646
+If you unroll something, it's a feature in
+your laptop and you can train it the same way.
+
+1:09:46.106 --> 1:09:57.606
+The only important thing is again, of course,
+for different inputs.
+
+1:09:57.837 --> 1:10:05.145
+But since parameters are shared, it's somehow
+a similar point you can train it.
+
+1:10:05.145 --> 1:10:08.800
+The training algorithm is very similar.
+
+1:10:10.310 --> 1:10:29.568
+One thing which makes things difficult is
+what is referred to as the vanish ingredient.
+
+1:10:29.809 --> 1:10:32.799
+That's a very strong thing in the motivation
+of using hardness.
+
+1:10:33.593 --> 1:10:44.604
+The influence here gets smaller and smaller,
+and the modems are not really able to monitor.
+
+1:10:44.804 --> 1:10:51.939
+Because the gradient gets smaller and smaller,
+and so the arrow here propagated to this one
+
+1:10:51.939 --> 1:10:58.919
+that contributes to the arrow is very small,
+and therefore you don't do any changes there
+
+1:10:58.919 --> 1:10:59.617
+anymore.
+
+1:11:00.020 --> 1:11:06.703
+And yeah, that's why standard art men are
+undifficult or have to pick them at custard.
+
+1:11:07.247 --> 1:11:11.462
+So everywhere talking to me about fire and
+ants nowadays,.
+
+1:11:11.791 --> 1:11:23.333
+What we are typically meaning are LSDN's or
+long short memories.
+
+1:11:23.333 --> 1:11:30.968
+You see they are by now quite old already.
+
+1:11:31.171 --> 1:11:39.019
+So there was a model in the language model
+task.
+
+1:11:39.019 --> 1:11:44.784
+It's some more storing information.
+
+1:11:44.684 --> 1:11:51.556
+Because if you only look at the last words,
+it's often no longer clear this is a question
+
+1:11:51.556 --> 1:11:52.548
+or a normal.
+
+1:11:53.013 --> 1:12:05.318
+So there you have these mechanisms with ripgate
+in order to store things for a longer time
+
+1:12:05.318 --> 1:12:08.563
+into your hidden state.
+
+1:12:10.730 --> 1:12:20.162
+Here they are used in in in selling quite
+a lot of works.
+
+1:12:21.541 --> 1:12:29.349
+For especially machine translation now, the
+standard is to do transform base models which
+
+1:12:29.349 --> 1:12:30.477
+we'll learn.
+
+1:12:30.690 --> 1:12:38.962
+But for example, in architecture we have later
+one lecture about efficiency.
+
+1:12:38.962 --> 1:12:42.830
+So how can we build very efficient?
+
+1:12:42.882 --> 1:12:53.074
+And there in the decoder in parts of the networks
+they are still using.
+
+1:12:53.473 --> 1:12:57.518
+So it's not that yeah our hands are of no
+importance in the body.
+
+1:12:59.239 --> 1:13:08.956
+In order to make them strong, there are some
+more things which are helpful and should be:
+
+1:13:09.309 --> 1:13:19.683
+So one thing is there is a nice trick to make
+this new network stronger and better.
+
+1:13:19.739 --> 1:13:21.523
+So of course it doesn't work always.
+
+1:13:21.523 --> 1:13:23.451
+They have to have enough training data.
+
+1:13:23.763 --> 1:13:28.959
+But in general there's the easiest way of
+making your models bigger and stronger just
+
+1:13:28.959 --> 1:13:30.590
+to increase your pyramids.
+
+1:13:30.630 --> 1:13:43.236
+And you've seen that with a large language
+models they are always bragging about.
+
+1:13:43.903 --> 1:13:56.463
+This is one way, so the question is how do
+you get more parameters?
+
+1:13:56.463 --> 1:14:01.265
+There's ways of doing it.
+
+1:14:01.521 --> 1:14:10.029
+And the other thing is to make your networks
+deeper so to have more legs in between.
+
+1:14:11.471 --> 1:14:13.827
+And then you can also get to get more calm.
+
+1:14:14.614 --> 1:14:23.340
+There's more traveling with this and it's
+very similar to what we just saw with our hand.
+
+1:14:23.603 --> 1:14:34.253
+We have this problem of radiant flow that
+if it flows so fast like a radiant gets very
+
+1:14:34.253 --> 1:14:35.477
+swollen,.
+
+1:14:35.795 --> 1:14:42.704
+Exactly the same thing happens in deep LSD
+ends.
+
+1:14:42.704 --> 1:14:52.293
+If you take here the gradient, tell you what
+is the right or wrong.
+
+1:14:52.612 --> 1:14:56.439
+With three layers it's no problem, but if
+you're going to ten, twenty or hundred layers.
+
+1:14:57.797 --> 1:14:59.698
+That's Getting Typically Young.
+
+1:15:00.060 --> 1:15:07.000
+Are doing is using what is called decisional
+connections.
+
+1:15:07.000 --> 1:15:15.855
+That's a very helpful idea, which is maybe
+very surprising that it works.
+
+1:15:15.956 --> 1:15:20.309
+And so the idea is that these networks.
+
+1:15:20.320 --> 1:15:29.982
+In between should no longer calculate what
+is a new good representation, but they're more
+
+1:15:29.982 --> 1:15:31.378
+calculating.
+
+1:15:31.731 --> 1:15:37.588
+Therefore, in the end you're always the output
+of a layer is added with the input.
+
+1:15:38.318 --> 1:15:48.824
+The knife is later if you are doing back propagation
+with this very fast back propagation.
+
+1:15:49.209 --> 1:16:02.540
+Nowadays in very deep architectures, not only
+on other but always has this residual or highway
+
+1:16:02.540 --> 1:16:04.224
+connection.
+
+1:16:04.704 --> 1:16:06.616
+Has two advantages.
+
+1:16:06.616 --> 1:16:15.409
+On the one hand, these layers don't need to
+learn a representation, they only need to learn
+
+1:16:15.409 --> 1:16:18.754
+what to change the representation.
+
+1:16:22.082 --> 1:16:24.172
+Good.
+
+1:16:23.843 --> 1:16:31.768
+That much for the new map before, so the last
+thing now means this.
+
+1:16:31.671 --> 1:16:33.750
+Language was are yeah.
+
+1:16:33.750 --> 1:16:41.976
+I were used in the molds itself and now were
+seeing them again, but one thing which at the
+
+1:16:41.976 --> 1:16:53.558
+beginning they were reading was very essential
+was: So people really train part of the language
+
+1:16:53.558 --> 1:16:59.999
+models only to get this type of embedding.
+
+1:16:59.999 --> 1:17:04.193
+Therefore, we want to look.
+
+1:17:09.229 --> 1:17:15.678
+So now some last words to the word embeddings.
+
+1:17:15.678 --> 1:17:27.204
+The interesting thing is that word embeddings
+can be used for very different tasks.
+
+1:17:27.347 --> 1:17:31.329
+The knife wing is you can train that on just
+large amounts of data.
+
+1:17:31.931 --> 1:17:41.569
+And then if you have these wooden beddings
+we have seen that they reduce the parameters.
+
+1:17:41.982 --> 1:17:52.217
+So then you can train your small mark to do
+any other task and therefore you are more efficient.
+
+1:17:52.532 --> 1:17:55.218
+These initial word embeddings is important.
+
+1:17:55.218 --> 1:18:00.529
+They really depend only on the word itself,
+so if you look at the two meanings of can,
+
+1:18:00.529 --> 1:18:06.328
+the can of beans or I can do that, they will
+have the same embedding, so some of the embedding
+
+1:18:06.328 --> 1:18:08.709
+has to save the ambiguity inside that.
+
+1:18:09.189 --> 1:18:12.486
+That cannot be resolved.
+
+1:18:12.486 --> 1:18:24.753
+Therefore, if you look at the higher levels
+in the context, but in the word embedding layers
+
+1:18:24.753 --> 1:18:27.919
+that really depends on.
+
+1:18:29.489 --> 1:18:33.757
+However, even this one has quite very interesting.
+
+1:18:34.034 --> 1:18:39.558
+So that people like to visualize them.
+
+1:18:39.558 --> 1:18:47.208
+They're always difficult because if you look
+at this.
+
+1:18:47.767 --> 1:18:52.879
+And drawing your five hundred damage, the
+vector is still a bit challenging.
+
+1:18:53.113 --> 1:19:12.472
+So you cannot directly do that, so people
+have to do it like they look at some type of.
+
+1:19:13.073 --> 1:19:17.209
+And of course then yes some information is
+getting lost by a bunch of control.
+
+1:19:18.238 --> 1:19:24.802
+And you see, for example, this is the most
+famous and common example, so what you can
+
+1:19:24.802 --> 1:19:31.289
+look is you can look at the difference between
+the main and the female word English.
+
+1:19:31.289 --> 1:19:37.854
+This is here in your embedding of king, and
+this is the embedding of queen, and this.
+
+1:19:38.058 --> 1:19:40.394
+You can do that for a very different work.
+
+1:19:40.780 --> 1:19:45.407
+And that is where the masks come into, that
+is what people then look into.
+
+1:19:45.725 --> 1:19:50.995
+So what you can now, for example, do is you
+can calculate the difference between man and
+
+1:19:50.995 --> 1:19:51.410
+woman?
+
+1:19:52.232 --> 1:19:55.511
+Then you can take the embedding of tea.
+
+1:19:55.511 --> 1:20:02.806
+You can add on it the difference between man
+and woman, and then you can notice what are
+
+1:20:02.806 --> 1:20:04.364
+the similar words.
+
+1:20:04.364 --> 1:20:08.954
+So you won't, of course, directly hit the
+correct word.
+
+1:20:08.954 --> 1:20:10.512
+It's a continuous.
+
+1:20:10.790 --> 1:20:23.127
+But you can look what are the nearest neighbors
+to this same, and often these words are near
+
+1:20:23.127 --> 1:20:24.056
+there.
+
+1:20:24.224 --> 1:20:33.913
+So it somehow learns that the difference between
+these words is always the same.
+
+1:20:34.374 --> 1:20:37.746
+You can do that for different things.
+
+1:20:37.746 --> 1:20:41.296
+He also imagines that it's not perfect.
+
+1:20:41.296 --> 1:20:49.017
+He says the world tends to be swimming and
+swimming, and with walking and walking you.
+
+1:20:49.469 --> 1:20:51.639
+So you can try to use them.
+
+1:20:51.639 --> 1:20:59.001
+It's no longer like saying yeah, but the interesting
+thing is this is completely unsupervised.
+
+1:20:59.001 --> 1:21:03.961
+So nobody taught him the principle of their
+gender in language.
+
+1:21:04.284 --> 1:21:09.910
+So it's purely trained on the task of doing
+the next work prediction.
+
+1:21:10.230 --> 1:21:20.658
+And even for really cementing information
+like the capital, this is the difference between
+
+1:21:20.658 --> 1:21:23.638
+the city and the capital.
+
+1:21:23.823 --> 1:21:25.518
+Visualization.
+
+1:21:25.518 --> 1:21:33.766
+Here we have done the same things of the difference
+between country and.
+
+1:21:33.853 --> 1:21:41.991
+You see it's not perfect, but it's building
+some kinds of a right direction, so you can't
+
+1:21:41.991 --> 1:21:43.347
+even use them.
+
+1:21:43.347 --> 1:21:51.304
+For example, for question answering, if you
+have the difference between them, you apply
+
+1:21:51.304 --> 1:21:53.383
+that to a new country.
+
+1:21:54.834 --> 1:22:02.741
+So it seems these ones are able to really
+learn a lot of information and collapse all
+
+1:22:02.741 --> 1:22:04.396
+this information.
+
+1:22:05.325 --> 1:22:11.769
+At just to do the next word prediction: And
+that also explains a bit maybe or not explains
+
+1:22:11.769 --> 1:22:19.016
+wrong life by motivating why what is the main
+advantage of this type of neural models that
+
+1:22:19.016 --> 1:22:26.025
+we can use this type of hidden representation,
+transfer them and use them in different.
+
+1:22:28.568 --> 1:22:43.707
+So summarize what we did today, so what you
+should hopefully have with you is for machine
+
+1:22:43.707 --> 1:22:45.893
+translation.
+
+1:22:45.805 --> 1:22:49.149
+Then how we can do language modern Chinese
+literature?
+
+1:22:49.449 --> 1:22:55.617
+We looked at three different architectures:
+We looked into the feet forward language mode
+
+1:22:55.617 --> 1:22:59.063
+and the one based on Bluetooth machines.
+
+1:22:59.039 --> 1:23:05.366
+And finally there are different architectures
+to do in your networks.
+
+1:23:05.366 --> 1:23:14.404
+We have seen feet for your networks and we'll
+see the next lectures, the last type of architecture.
+
+1:23:15.915 --> 1:23:17.412
+Have Any Questions.
+
+1:23:20.680 --> 1:23:27.341
+Then thanks a lot, and next on Tuesday we
+will be again in our order to know how to play.
+
+0:00:01.301 --> 0:00:05.687
+Okay, so we're welcome to today's lecture.
+
+0:00:06.066 --> 0:00:18.128
+A bit desperate in a small room and I'm sorry
+for the inconvenience.
+
+0:00:18.128 --> 0:00:25.820
+Sometimes there are project meetings where.
+
+0:00:26.806 --> 0:00:40.863
+So what we want to talk today about is want
+to start with neural approaches to machine
+
+0:00:40.863 --> 0:00:42.964
+translation.
+
+0:00:43.123 --> 0:00:55.779
+Guess I've heard about other types of neural
+models for natural language processing.
+
+0:00:55.779 --> 0:00:59.948
+This was some of the first.
+
+0:01:00.600 --> 0:01:06.203
+They are similar to what you know they see
+in as large language models.
+
+0:01:06.666 --> 0:01:14.810
+And we want today look into what are these
+neural language models, how we can build them,
+
+0:01:14.810 --> 0:01:15.986
+what is the.
+
+0:01:16.316 --> 0:01:23.002
+And first we'll show how to use them in statistical
+machine translation.
+
+0:01:23.002 --> 0:01:31.062
+If you remember weeks ago, we had this log-linear
+model where you can integrate easily.
+
+0:01:31.351 --> 0:01:42.756
+And that was how they first were used, so
+we just had another model that evaluates how
+
+0:01:42.756 --> 0:01:49.180
+good a system is or how good a lot of languages.
+
+0:01:50.690 --> 0:02:04.468
+And next week we will go for a neuromachine
+translation where we replace the whole model
+
+0:02:04.468 --> 0:02:06.481
+by one huge.
+
+0:02:11.211 --> 0:02:20.668
+So just as a member from Tuesday we've seen,
+the main challenge in language modeling was
+
+0:02:20.668 --> 0:02:25.131
+that most of the anthrax we haven't seen.
+
+0:02:26.946 --> 0:02:34.167
+So this was therefore difficult to estimate
+any probability because we've seen that yet
+
+0:02:34.167 --> 0:02:39.501
+normally if you've seen had not seen the N
+gram you will assign.
+
+0:02:39.980 --> 0:02:53.385
+However, this is not really very good because
+we don't want to give zero probabilities to
+
+0:02:53.385 --> 0:02:55.023
+sentences.
+
+0:02:55.415 --> 0:03:10.397
+And then we learned a lot of techniques and
+that is the main challenge in statistical language.
+
+0:03:10.397 --> 0:03:15.391
+How we can give somehow a good.
+
+0:03:15.435 --> 0:03:23.835
+And they developed very specific, very good
+techniques to deal with that.
+
+0:03:23.835 --> 0:03:26.900
+However, this is the best.
+
+0:03:28.568 --> 0:03:33.907
+And therefore we can do things different.
+
+0:03:33.907 --> 0:03:44.331
+If we have not seen an N gram before in statistical
+models, we have to have seen.
+
+0:03:45.225 --> 0:03:51.361
+Before, and we can only get information from
+exactly the same word.
+
+0:03:51.411 --> 0:03:57.567
+We don't have an approximate matching like
+that.
+
+0:03:57.567 --> 0:04:10.255
+Maybe it stood together in some way or similar,
+and in a sentence we might generalize the knowledge.
+
+0:04:11.191 --> 0:04:21.227
+Would like to have more something like that
+where engrams are represented more in a general
+
+0:04:21.227 --> 0:04:21.990
+space.
+
+0:04:22.262 --> 0:04:29.877
+So if you learn something about eyewalk then
+maybe we can use this knowledge and also.
+
+0:04:30.290 --> 0:04:43.034
+And thereby no longer treat all or at least
+a lot of the ingrams as we've done before.
+
+0:04:43.034 --> 0:04:45.231
+We can really.
+
+0:04:47.047 --> 0:04:56.157
+And we maybe want to even do that in a more
+hierarchical approach, but we know okay some
+
+0:04:56.157 --> 0:05:05.268
+words are similar like go and walk is somehow
+similar and and therefore like maybe if we
+
+0:05:05.268 --> 0:05:07.009
+then merge them.
+
+0:05:07.387 --> 0:05:16.104
+If we learn something about work, then it
+should tell us also something about Hugo or
+
+0:05:16.104 --> 0:05:17.118
+he walks.
+
+0:05:17.197 --> 0:05:18.970
+We see already.
+
+0:05:18.970 --> 0:05:22.295
+It's, of course, not so easy.
+
+0:05:22.295 --> 0:05:31.828
+We see that there is some relations which
+we need to integrate, for example, for you.
+
+0:05:31.828 --> 0:05:35.486
+We need to add the S, but maybe.
+
+0:05:37.137 --> 0:05:42.984
+And luckily there is one really yeah, convincing
+methods in doing that.
+
+0:05:42.963 --> 0:05:47.239
+And that is by using an evil neck or.
+
+0:05:47.387 --> 0:05:57.618
+That's what we will introduce today so we
+can use this type of neural networks to try
+
+0:05:57.618 --> 0:06:04.042
+to learn this similarity and to learn how some
+words.
+
+0:06:04.324 --> 0:06:13.711
+And that is one of the main advantages that
+we have by switching from the standard statistical
+
+0:06:13.711 --> 0:06:15.193
+models to the.
+
+0:06:15.115 --> 0:06:22.840
+To learn similarities between words and generalized
+and learn what we call hidden representations.
+
+0:06:22.840 --> 0:06:29.707
+So somehow representations of words where
+we can measure similarity in some dimensions.
+
+0:06:30.290 --> 0:06:42.275
+So in representations where as a tubically
+continuous vector or a vector of a fixed size.
+
+0:06:42.822 --> 0:06:52.002
+We had it before and we've seen that the only
+thing we did is we don't want to do.
+
+0:06:52.192 --> 0:06:59.648
+But these indices don't have any meaning,
+so it wasn't that word five is more similar
+
+0:06:59.648 --> 0:07:02.248
+to words twenty than to word.
+
+0:07:02.582 --> 0:07:09.059
+So we couldn't learn anything about words
+in the statistical model.
+
+0:07:09.059 --> 0:07:12.107
+That's a big challenge because.
+
+0:07:12.192 --> 0:07:24.232
+If you think about words even in morphology,
+so go and go is more similar because the person.
+
+0:07:24.264 --> 0:07:36.265
+While the basic models we have up to now,
+they have no idea about that and goes as similar
+
+0:07:36.265 --> 0:07:37.188
+to go.
+
+0:07:39.919 --> 0:07:53.102
+So what we want to do today, in order to go
+to this, we will have a short introduction.
+
+0:07:53.954 --> 0:08:06.667
+It very short just to see how we use them
+here, but that's the good thing that are important
+
+0:08:06.667 --> 0:08:08.445
+for dealing.
+
+0:08:08.928 --> 0:08:14.083
+And then we'll first look into feet forward,
+new network language models.
+
+0:08:14.454 --> 0:08:21.221
+And there we will still have this approximation
+we had before, then we are looking only at
+
+0:08:21.221 --> 0:08:22.336
+fixed windows.
+
+0:08:22.336 --> 0:08:28.805
+So if you remember we have this classroom
+of language models, and to determine what is
+
+0:08:28.805 --> 0:08:33.788
+the probability of a word, we only look at
+the past and minus one.
+
+0:08:34.154 --> 0:08:36.878
+This is the theory of the case.
+
+0:08:36.878 --> 0:08:43.348
+However, we have the ability and that's why
+they're really better in order.
+
+0:08:44.024 --> 0:08:51.953
+And then at the end we'll look at current
+network language models where we then have
+
+0:08:51.953 --> 0:08:53.166
+a different.
+
+0:08:53.093 --> 0:09:01.922
+And thereby it is no longer the case that
+we need to have a fixed history, but in theory
+
+0:09:01.922 --> 0:09:04.303
+we can model arbitrary.
+
+0:09:04.304 --> 0:09:06.854
+And we can log this phenomenon.
+
+0:09:06.854 --> 0:09:12.672
+We talked about a Tuesday where it's not clear
+what type of information.
+
+0:09:16.396 --> 0:09:24.982
+So yeah, generally new networks are normally
+learned to improve and perform some tasks.
+
+0:09:25.325 --> 0:09:38.934
+We have this structure and we are learning
+them from samples so that is similar to what
+
+0:09:38.934 --> 0:09:42.336
+we had before so now.
+
+0:09:42.642 --> 0:09:49.361
+And is somehow originally motivated by the
+human brain.
+
+0:09:49.361 --> 0:10:00.640
+However, when you now need to know artificial
+neural networks, it's hard to get a similarity.
+
+0:10:00.540 --> 0:10:02.884
+There seems to be not that important.
+
+0:10:03.123 --> 0:10:11.013
+So what they are mainly doing is doing summoning
+multiplication and then one linear activation.
+
+0:10:12.692 --> 0:10:16.078
+So so the basic units are these type of.
+
+0:10:17.937 --> 0:10:29.837
+Perceptron is a basic block which we have
+and this does exactly the processing.
+
+0:10:29.837 --> 0:10:36.084
+We have a fixed number of input features.
+
+0:10:36.096 --> 0:10:39.668
+So we have here numbers six zero to x and
+as input.
+
+0:10:40.060 --> 0:10:48.096
+And this makes language processing difficult
+because we know that it's not the case.
+
+0:10:48.096 --> 0:10:53.107
+If we're dealing with language, it doesn't
+have any.
+
+0:10:54.114 --> 0:10:57.609
+So we have to model this somehow and understand
+how we model this.
+
+0:10:58.198 --> 0:11:03.681
+Then we have the weights, which are the parameters
+and the number of weights exactly the same.
+
+0:11:04.164 --> 0:11:15.069
+Of input features sometimes you have the spires
+in there that always and then it's not really.
+
+0:11:15.195 --> 0:11:19.656
+And what you then do is very simple.
+
+0:11:19.656 --> 0:11:26.166
+It's just like the weight it sounds, so you
+multiply.
+
+0:11:26.606 --> 0:11:38.405
+What is then additionally important is we
+have an activation function and it's important
+
+0:11:38.405 --> 0:11:42.514
+that this activation function.
+
+0:11:43.243 --> 0:11:54.088
+And later it will be important that this is
+differentiable because otherwise all the training.
+
+0:11:54.714 --> 0:12:01.471
+This model by itself is not very powerful.
+
+0:12:01.471 --> 0:12:10.427
+We have the X Or problem and with this simple
+you can't.
+
+0:12:10.710 --> 0:12:15.489
+However, there is a very easy and nice extension.
+
+0:12:15.489 --> 0:12:20.936
+The multi layer perception and things get
+very powerful.
+
+0:12:21.081 --> 0:12:32.953
+The thing is you just connect a lot of these
+in these layers of structures where we have
+
+0:12:32.953 --> 0:12:35.088
+the inputs and.
+
+0:12:35.395 --> 0:12:47.297
+And then we can combine them, or to do them:
+The input layer is of course given by your
+
+0:12:47.297 --> 0:12:51.880
+problem with the dimension.
+
+0:12:51.880 --> 0:13:00.063
+The output layer is also given by your dimension.
+
+0:13:01.621 --> 0:13:08.802
+So let's start with the first question, now
+more language related, and that is how we represent.
+
+0:13:09.149 --> 0:13:19.282
+So we have seen here input to x, but the question
+is now okay.
+
+0:13:19.282 --> 0:13:23.464
+How can we put into this?
+
+0:13:26.866 --> 0:13:34.123
+The first thing that we're able to do is we're
+going to set it in the inspector.
+
+0:13:34.314 --> 0:13:45.651
+Yeah, and that is not that easy because the
+continuous vector will come to that.
+
+0:13:45.651 --> 0:13:47.051
+We can't.
+
+0:13:47.051 --> 0:13:50.410
+We don't want to do it.
+
+0:13:50.630 --> 0:13:57.237
+But if we need to input the word into the
+needle network, it has to be something easily
+
+0:13:57.237 --> 0:13:57.912
+defined.
+
+0:13:59.079 --> 0:14:11.511
+One is the typical thing, the one-hour encoded
+vector, so we have a vector where the dimension
+
+0:14:11.511 --> 0:14:15.306
+is the vocabulary, and then.
+
+0:14:16.316 --> 0:14:25.938
+So the first thing you are ready to see that
+means we are always dealing with fixed.
+
+0:14:26.246 --> 0:14:34.961
+So you cannot easily extend your vocabulary,
+but if you mean your vocabulary would increase
+
+0:14:34.961 --> 0:14:37.992
+the size of this input vector,.
+
+0:14:39.980 --> 0:14:42.423
+That's maybe also motivating.
+
+0:14:42.423 --> 0:14:45.355
+We'll talk about bike parade going.
+
+0:14:45.355 --> 0:14:47.228
+That's the nice thing.
+
+0:14:48.048 --> 0:15:01.803
+The big advantage of this one putt encoding
+is that we don't implement similarity between
+
+0:15:01.803 --> 0:15:06.999
+words, but we're really learning.
+
+0:15:07.227 --> 0:15:11.219
+So you need like to represent any words.
+
+0:15:11.219 --> 0:15:15.893
+You need a dimension of and dimensional vector.
+
+0:15:16.236 --> 0:15:26.480
+Imagine you could eat no binary encoding,
+so you could represent words as binary vectors.
+
+0:15:26.806 --> 0:15:32.348
+So you will be significantly more efficient.
+
+0:15:32.348 --> 0:15:39.122
+However, you have some more digits than other
+numbers.
+
+0:15:39.559 --> 0:15:46.482
+Would somehow be bad because you would force
+the one to do this and it's by hand not clear
+
+0:15:46.482 --> 0:15:47.623
+how to define.
+
+0:15:48.108 --> 0:15:55.135
+So therefore currently this is the most successful
+approach to just do this one patch.
+
+0:15:55.095 --> 0:15:59.344
+We take a fixed vocabulary.
+
+0:15:59.344 --> 0:16:10.269
+We map each word to the initial and then we
+represent a word like this.
+
+0:16:10.269 --> 0:16:13.304
+The representation.
+
+0:16:14.514 --> 0:16:27.019
+But this dimension here is a secondary size,
+and if you think ten thousand that's quite
+
+0:16:27.019 --> 0:16:33.555
+high, so we're always trying to be efficient.
+
+0:16:33.853 --> 0:16:42.515
+And we are doing the same type of efficiency
+because then we are having a very small one
+
+0:16:42.515 --> 0:16:43.781
+compared to.
+
+0:16:44.104 --> 0:16:53.332
+It can be still a maybe or neurons, but this
+is significantly smaller, of course, as before.
+
+0:16:53.713 --> 0:17:04.751
+So you are learning there this word as you
+said, but you can learn it directly, and there
+
+0:17:04.751 --> 0:17:07.449
+we have similarities.
+
+0:17:07.807 --> 0:17:14.772
+But the nice thing is that this is then learned,
+and we do not need to like hand define.
+
+0:17:17.117 --> 0:17:32.377
+So yes, so that is how we're typically adding
+at least a single word into the language world.
+
+0:17:32.377 --> 0:17:43.337
+Then we can see: So we're seeing that you
+have the one hard representation always of
+
+0:17:43.337 --> 0:17:44.857
+the same similarity.
+
+0:17:45.105 --> 0:18:00.803
+Then we're having this continuous vector which
+is a lot smaller dimension and that's.
+
+0:18:01.121 --> 0:18:06.984
+What we are doing then is learning these representations
+so that they are best for language modeling.
+
+0:18:07.487 --> 0:18:19.107
+So the representations are implicitly because
+we're training on the language.
+
+0:18:19.479 --> 0:18:30.115
+And the nice thing was found out later is
+these representations are really, really good
+
+0:18:30.115 --> 0:18:32.533
+for a lot of other.
+
+0:18:33.153 --> 0:18:39.729
+And that is why they are now called word embedded
+space themselves, and used for other tasks.
+
+0:18:40.360 --> 0:18:49.827
+And they are somehow describing different
+things so they can describe and semantic similarities.
+
+0:18:49.789 --> 0:18:58.281
+We are looking at the very example of today
+that you can do in this vector space by adding
+
+0:18:58.281 --> 0:19:00.613
+some interesting things.
+
+0:19:00.940 --> 0:19:11.174
+And so they got really was a first big improvement
+when switching to neural staff.
+
+0:19:11.491 --> 0:19:20.736
+They are like part of the model still with
+more complex representation alert, but they
+
+0:19:20.736 --> 0:19:21.267
+are.
+
+0:19:23.683 --> 0:19:34.975
+Then we are having the output layer, and in
+the output layer we also have output structure
+
+0:19:34.975 --> 0:19:36.960
+and activation.
+
+0:19:36.997 --> 0:19:44.784
+That is the language we want to predict, which
+word should be the next.
+
+0:19:44.784 --> 0:19:46.514
+We always have.
+
+0:19:47.247 --> 0:19:56.454
+And that can be done very well with the softball
+softbacked layer, where again the dimension.
+
+0:19:56.376 --> 0:20:03.971
+Is the vocabulary, so this is a vocabulary
+size, and again the case neuro represents the
+
+0:20:03.971 --> 0:20:09.775
+case class, so in our case we have again a
+one-hour representation.
+
+0:20:10.090 --> 0:20:18.929
+Ours is a probability distribution and the
+end is a probability distribution of all works.
+
+0:20:18.929 --> 0:20:28.044
+The case entry tells us: So we need to have
+some of our probability distribution at our
+
+0:20:28.044 --> 0:20:36.215
+output, and in order to achieve that this activation
+function goes, it needs to be that all the
+
+0:20:36.215 --> 0:20:36.981
+outputs.
+
+0:20:37.197 --> 0:20:47.993
+And we can achieve that with a softmax activation
+we take each of the value and then.
+
+0:20:48.288 --> 0:20:58.020
+So by having this type of activation function
+we are really getting that at the end we always.
+
+0:20:59.019 --> 0:21:12.340
+The beginning was very challenging because
+again we have this inefficient representation
+
+0:21:12.340 --> 0:21:15.184
+of our vocabulary.
+
+0:21:15.235 --> 0:21:27.500
+And then you can imagine escalating over to
+something over a thousand is maybe a bit inefficient
+
+0:21:27.500 --> 0:21:29.776
+with cheap users.
+
+0:21:36.316 --> 0:21:43.664
+And then yeah, for training the models, that
+is how we refine, so we have this architecture
+
+0:21:43.664 --> 0:21:44.063
+now.
+
+0:21:44.264 --> 0:21:52.496
+We need to minimize the arrow by taking the
+output.
+
+0:21:52.496 --> 0:21:58.196
+We are comparing it to our targets.
+
+0:21:58.298 --> 0:22:07.670
+So one important thing is, of course, how
+can we measure the error?
+
+0:22:07.670 --> 0:22:12.770
+So what if we're training the ideas?
+
+0:22:13.033 --> 0:22:19.770
+And how well when measuring it is in natural
+language processing, typically the cross entropy.
+
+0:22:19.960 --> 0:22:32.847
+That means we are comparing the target with
+the output, so we're taking the value multiplying
+
+0:22:32.847 --> 0:22:35.452
+with the horizons.
+
+0:22:35.335 --> 0:22:43.454
+Which gets optimized and you're seeing that
+this, of course, makes it again very nice and
+
+0:22:43.454 --> 0:22:49.859
+easy because our target, we said, is again
+a one-hound representation.
+
+0:22:50.110 --> 0:23:00.111
+So except for one, all of these are always
+zero, and what we are doing is taking the one.
+
+0:23:00.100 --> 0:23:05.970
+And we only need to multiply the one with
+the logarism here, and that is all the feedback.
+
+0:23:06.946 --> 0:23:14.194
+Of course, this is not always influenced by
+all the others.
+
+0:23:14.194 --> 0:23:17.938
+Why is this influenced by all?
+
+0:23:24.304 --> 0:23:33.554
+Think Mac the activation function, which is
+the current activation divided by some of the
+
+0:23:33.554 --> 0:23:34.377
+others.
+
+0:23:34.354 --> 0:23:44.027
+Because otherwise it could of course easily
+just increase this value and ignore the others,
+
+0:23:44.027 --> 0:23:49.074
+but if you increase one value or the other,
+so.
+
+0:23:51.351 --> 0:24:04.433
+And then we can do with neon networks one
+very nice and easy type of training that is
+
+0:24:04.433 --> 0:24:07.779
+done in all the neon.
+
+0:24:07.707 --> 0:24:12.664
+So in which direction does the arrow show?
+
+0:24:12.664 --> 0:24:23.152
+And then if we want to go to a smaller like
+smaller arrow, that's what we want to achieve.
+
+0:24:23.152 --> 0:24:27.302
+We're trying to minimize our arrow.
+
+0:24:27.287 --> 0:24:32.875
+And we have to do that, of course, for all
+the weights, and to calculate the error of
+
+0:24:32.875 --> 0:24:36.709
+all the weights we want in the back of the
+baggation here.
+
+0:24:36.709 --> 0:24:41.322
+But what you can do is you can propagate the
+arrow which you measured.
+
+0:24:41.322 --> 0:24:43.792
+At the end you can propagate it back.
+
+0:24:43.792 --> 0:24:46.391
+That's basic mass and basic derivation.
+
+0:24:46.706 --> 0:24:59.557
+Then you can do each weight in your model
+and measure how much it contributes to this
+
+0:24:59.557 --> 0:25:01.350
+individual.
+
+0:25:04.524 --> 0:25:17.712
+To summarize what your machine translation
+should be, to understand all this problem is
+
+0:25:17.712 --> 0:25:20.710
+that this is how a.
+
+0:25:20.580 --> 0:25:23.056
+The notes are perfect thrones.
+
+0:25:23.056 --> 0:25:28.167
+They are fully connected between two layers
+and no connections.
+
+0:25:28.108 --> 0:25:29.759
+Across layers.
+
+0:25:29.829 --> 0:25:35.152
+And what they're doing is always just to wait
+for some here and then an activation function.
+
+0:25:35.415 --> 0:25:38.794
+And in order to train you have this sword
+in backwards past.
+
+0:25:39.039 --> 0:25:41.384
+So we put in here.
+
+0:25:41.281 --> 0:25:46.540
+Our inputs have some random values at the
+beginning.
+
+0:25:46.540 --> 0:25:49.219
+They calculate the output.
+
+0:25:49.219 --> 0:25:58.646
+We are measuring how big our error is, propagating
+the arrow back, and then changing our model
+
+0:25:58.646 --> 0:25:59.638
+in a way.
+
+0:26:01.962 --> 0:26:14.267
+So before we're coming into the neural networks,
+how can we use this type of neural network
+
+0:26:14.267 --> 0:26:17.611
+to do language modeling?
+
+0:26:23.103 --> 0:26:25.520
+So the question is now okay.
+
+0:26:25.520 --> 0:26:33.023
+How can we use them in natural language processing
+and especially in machine translation?
+
+0:26:33.023 --> 0:26:38.441
+The first idea of using them was to estimate
+the language model.
+
+0:26:38.999 --> 0:26:42.599
+So we have seen that the output can be monitored
+here as well.
+
+0:26:43.603 --> 0:26:49.308
+Has a probability distribution, and if we
+have a full vocabulary, we could mainly hear
+
+0:26:49.308 --> 0:26:55.209
+estimate how probable each next word is, and
+then use that in our language model fashion,
+
+0:26:55.209 --> 0:27:02.225
+as we've done it last time, we've got the probability
+of a full sentence as a product of all probabilities
+
+0:27:02.225 --> 0:27:03.208
+of individual.
+
+0:27:04.544 --> 0:27:06.695
+And UM.
+
+0:27:06.446 --> 0:27:09.776
+That was done and in ninety seven years.
+
+0:27:09.776 --> 0:27:17.410
+It's very easy to integrate it into this Locklear
+model, so we have said that this is how the
+
+0:27:17.410 --> 0:27:24.638
+Locklear model looks like, so we're searching
+the best translation, which minimizes each
+
+0:27:24.638 --> 0:27:25.126
+wage.
+
+0:27:25.125 --> 0:27:26.371
+The feature value.
+
+0:27:26.646 --> 0:27:31.642
+We have that with the minimum error training,
+if you can remember when we search for the
+
+0:27:31.642 --> 0:27:32.148
+optimal.
+
+0:27:32.512 --> 0:27:40.927
+We have the phrasetable probabilities, the
+language model, and we can just add here and
+
+0:27:40.927 --> 0:27:41.597
+there.
+
+0:27:41.861 --> 0:27:46.077
+So that is quite easy as said.
+
+0:27:46.077 --> 0:27:54.101
+That was how statistical machine translation
+was improved.
+
+0:27:54.101 --> 0:27:57.092
+Add one more feature.
+
+0:27:58.798 --> 0:28:11.220
+So how can we model the language mark for
+Belty with your network?
+
+0:28:11.220 --> 0:28:22.994
+So what we have to do is: And the problem
+in generally in the head is that most we haven't
+
+0:28:22.994 --> 0:28:25.042
+seen long sequences.
+
+0:28:25.085 --> 0:28:36.956
+Mostly we have to beg off to very short sequences
+and we are working on this discrete space where.
+
+0:28:37.337 --> 0:28:48.199
+So the idea is if we have a meal network we
+can map words into continuous representation
+
+0:28:48.199 --> 0:28:50.152
+and that helps.
+
+0:28:51.091 --> 0:28:59.598
+And the structure then looks like this, so
+this is the basic still feed forward neural
+
+0:28:59.598 --> 0:29:00.478
+network.
+
+0:29:01.361 --> 0:29:10.744
+We are doing this at Proximation again, so
+we are not putting in all previous words, but
+
+0:29:10.744 --> 0:29:11.376
+it's.
+
+0:29:11.691 --> 0:29:25.089
+And this is done because in your network we
+can have only a fixed type of input, so we
+
+0:29:25.089 --> 0:29:31.538
+can: Can only do a fixed set, and they are
+going to be doing exactly the same in minus
+
+0:29:31.538 --> 0:29:31.879
+one.
+
+0:29:33.593 --> 0:29:41.026
+And then we have, for example, three words
+and three different words, which are in these
+
+0:29:41.026 --> 0:29:54.583
+positions: And then we're having the first
+layer of the neural network, which learns words
+
+0:29:54.583 --> 0:29:56.247
+and words.
+
+0:29:57.437 --> 0:30:04.976
+There is one thing which is maybe special
+compared to the standard neural memory.
+
+0:30:05.345 --> 0:30:13.163
+So the representation of this word we want
+to learn first of all position independence,
+
+0:30:13.163 --> 0:30:19.027
+so we just want to learn what is the general
+meaning of the word.
+
+0:30:19.299 --> 0:30:26.244
+Therefore, the representation you get here
+should be the same as if you put it in there.
+
+0:30:27.247 --> 0:30:35.069
+The nice thing is you can achieve that in
+networks the same way you achieve it.
+
+0:30:35.069 --> 0:30:41.719
+This way you're reusing ears so we are forcing
+them to always stay.
+
+0:30:42.322 --> 0:30:49.689
+And that's why you then learn your word embedding,
+which is contextual and independent, so.
+
+0:30:49.909 --> 0:31:05.561
+So the idea is you have the diagram go home
+and you don't want to use the context.
+
+0:31:05.561 --> 0:31:07.635
+First you.
+
+0:31:08.348 --> 0:31:14.155
+That of course it might have a different meaning
+depending on where it stands, but learn that.
+
+0:31:14.514 --> 0:31:19.623
+First, we're learning key representation of
+the words, which is just the representation
+
+0:31:19.623 --> 0:31:20.378
+of the word.
+
+0:31:20.760 --> 0:31:37.428
+So it's also not like normally all input neurons
+are connected to all neurons.
+
+0:31:37.857 --> 0:31:47.209
+This is the first layer of representation,
+and then we have a lot denser representation,
+
+0:31:47.209 --> 0:31:56.666
+that is, our three word embeddings here, and
+now we are learning this interaction between
+
+0:31:56.666 --> 0:31:57.402
+words.
+
+0:31:57.677 --> 0:32:08.265
+So now we have at least one connected, fully
+connected layer here, which takes the three
+
+0:32:08.265 --> 0:32:14.213
+imbedded input and then learns the new embedding.
+
+0:32:15.535 --> 0:32:27.871
+And then if you had one of several layers
+of lining which is your output layer, then.
+
+0:32:28.168 --> 0:32:46.222
+So here the size is a vocabulary size, and
+then you put as target what is the probability
+
+0:32:46.222 --> 0:32:48.228
+for each.
+
+0:32:48.688 --> 0:32:56.778
+The nice thing is that you learn everything
+together, so you're not learning what is a
+
+0:32:56.778 --> 0:32:58.731
+good representation.
+
+0:32:59.079 --> 0:33:12.019
+When you are training the whole network together,
+it learns what representation for a word you
+
+0:33:12.019 --> 0:33:13.109
+get in.
+
+0:33:15.956 --> 0:33:19.176
+It's Yeah That Is the Main Idea.
+
+0:33:20.660 --> 0:33:32.695
+Nowadays often referred to as one way of self-supervised
+learning, why self-supervisory learning?
+
+0:33:33.053 --> 0:33:37.120
+The output is the next word and the input
+is the previous word.
+
+0:33:37.377 --> 0:33:46.778
+But somehow it's self-supervised because it's
+not really that we created labels, but we artificially.
+
+0:33:46.806 --> 0:34:01.003
+We just have pure text, and then we created
+the task.
+
+0:34:05.905 --> 0:34:12.413
+Say we have two sentences like go home again.
+
+0:34:12.413 --> 0:34:18.780
+Second one is go to creative again, so both.
+
+0:34:18.858 --> 0:34:31.765
+The starboard bygo and then we have to predict
+the next four years and my question is: Be
+
+0:34:31.765 --> 0:34:40.734
+modeled this ability as one vector with like
+probability or possible works.
+
+0:34:40.734 --> 0:34:42.740
+We have musical.
+
+0:34:44.044 --> 0:34:56.438
+You have multiple examples, so you would twice
+train, once you predict, once you predict,
+
+0:34:56.438 --> 0:35:02.359
+and then, of course, the best performance.
+
+0:35:04.564 --> 0:35:11.772
+A very good point, so you're not aggregating
+examples beforehand, but you're taking each
+
+0:35:11.772 --> 0:35:13.554
+example individually.
+
+0:35:19.259 --> 0:35:33.406
+So what you do is you simultaneously learn
+the projection layer which represents this
+
+0:35:33.406 --> 0:35:39.163
+word and the N gram probabilities.
+
+0:35:39.499 --> 0:35:48.390
+And what people then later analyzed is that
+these representations are very powerful.
+
+0:35:48.390 --> 0:35:56.340
+The task is just a very important task to
+model like what is the next word.
+
+0:35:56.816 --> 0:36:09.429
+It's a bit motivated by people saying in order
+to get the meaning of the word you have to
+
+0:36:09.429 --> 0:36:10.690
+look at.
+
+0:36:10.790 --> 0:36:18.467
+If you read the text in there, which you have
+never seen, you can still estimate the meaning
+
+0:36:18.467 --> 0:36:22.264
+of this word because you know how it is used.
+
+0:36:22.602 --> 0:36:26.667
+Just imagine you read this text about some
+city.
+
+0:36:26.667 --> 0:36:32.475
+Even if you've never seen the city before
+heard, you often know from.
+
+0:36:34.094 --> 0:36:44.809
+So what is now the big advantage of using
+neural networks?
+
+0:36:44.809 --> 0:36:57.570
+Just imagine we have to estimate this: So
+you have to monitor the probability of ad hip
+
+0:36:57.570 --> 0:37:00.272
+and now imagine iPhone.
+
+0:37:00.600 --> 0:37:06.837
+So all the techniques we have at the last
+time.
+
+0:37:06.837 --> 0:37:14.243
+At the end, if you haven't seen iPhone, you
+will always.
+
+0:37:15.055 --> 0:37:19.502
+Because you haven't seen the previous words,
+so you have no idea how to do that.
+
+0:37:19.502 --> 0:37:24.388
+You won't have seen the diagram, the trigram
+and all the others, so the probability here
+
+0:37:24.388 --> 0:37:27.682
+will just be based on the probability of ad,
+so it uses no.
+
+0:37:28.588 --> 0:37:38.328
+If you're having this type of model, what
+does it do so?
+
+0:37:38.328 --> 0:37:43.454
+This is the last three words.
+
+0:37:43.483 --> 0:37:49.837
+Maybe this representation is messed up because
+it's mainly on a particular word or source
+
+0:37:49.837 --> 0:37:50.260
+that.
+
+0:37:50.730 --> 0:37:57.792
+Now anyway you have these two information
+that were two words before was first and therefore:
+
+0:37:58.098 --> 0:38:07.214
+So you have a lot of information here to estimate
+how good it is.
+
+0:38:07.214 --> 0:38:13.291
+Of course, there could be more information.
+
+0:38:13.593 --> 0:38:25.958
+So all this type of modeling we can do and
+that we couldn't do beforehand because we always.
+
+0:38:27.027 --> 0:38:31.905
+Don't guess how we do it now.
+
+0:38:31.905 --> 0:38:41.824
+Typically you would have one talking for awkward
+vocabulary.
+
+0:38:42.602 --> 0:38:45.855
+All you're doing by carrying coding when it
+has a fixed dancing.
+
+0:38:46.226 --> 0:38:49.439
+Yeah, you have to do something like that that
+the opposite way.
+
+0:38:50.050 --> 0:38:55.413
+So yeah, all the vocabulary are by thankcoding
+where you don't have have all the vocabulary.
+
+0:38:55.735 --> 0:39:07.665
+But then, of course, the back pairing coating
+is better with arbitrary context because a
+
+0:39:07.665 --> 0:39:11.285
+problem with back pairing.
+
+0:39:17.357 --> 0:39:20.052
+Anymore questions to the basic same little
+things.
+
+0:39:23.783 --> 0:39:36.162
+This model we then want to continue is to
+look into how complex that is or can make things
+
+0:39:36.162 --> 0:39:39.155
+maybe more efficient.
+
+0:39:40.580 --> 0:39:47.404
+At the beginning there was definitely a major
+challenge.
+
+0:39:47.404 --> 0:39:50.516
+It's still not that easy.
+
+0:39:50.516 --> 0:39:58.297
+All guess follow the talk about their environmental
+fingerprint.
+
+0:39:58.478 --> 0:40:05.686
+So this calculation is normally heavy, and
+if you build systems yourself, you have to
+
+0:40:05.686 --> 0:40:06.189
+wait.
+
+0:40:06.466 --> 0:40:15.412
+So it's good to know a bit about how complex
+things are in order to do a good or efficient.
+
+0:40:15.915 --> 0:40:24.706
+So one thing where most of the calculation
+really happens is if you're.
+
+0:40:25.185 --> 0:40:34.649
+So in generally all these layers, of course,
+we're talking about networks and the zones
+
+0:40:34.649 --> 0:40:35.402
+fancy.
+
+0:40:35.835 --> 0:40:48.305
+So what you have to do in order to calculate
+here these activations, you have this weight.
+
+0:40:48.488 --> 0:41:05.021
+So to make it simple, let's see we have three
+outputs, and then you just do a metric identification
+
+0:41:05.021 --> 0:41:08.493
+between your weight.
+
+0:41:08.969 --> 0:41:19.641
+That is why the use is so powerful for neural
+networks because they are very good in doing
+
+0:41:19.641 --> 0:41:22.339
+metric multiplication.
+
+0:41:22.782 --> 0:41:28.017
+However, for some type of embedding layer
+this is really very inefficient.
+
+0:41:28.208 --> 0:41:37.547
+So in this input we are doing this calculation.
+
+0:41:37.547 --> 0:41:47.081
+What we are mainly doing is selecting one
+color.
+
+0:41:47.387 --> 0:42:03.570
+So therefore you can do at least the forward
+pass a lot more efficient if you don't really
+
+0:42:03.570 --> 0:42:07.304
+do this calculation.
+
+0:42:08.348 --> 0:42:20.032
+So the weight metrics of the first embedding
+layer is just that in each color you have.
+
+0:42:20.580 --> 0:42:30.990
+So this is how your initial weights look like
+and how you can interpret or understand.
+
+0:42:32.692 --> 0:42:42.042
+And this is already relatively important because
+remember this is a huge dimensional thing,
+
+0:42:42.042 --> 0:42:51.392
+so typically here we have the number of words
+ten thousand, so this is the word embeddings.
+
+0:42:51.451 --> 0:43:00.400
+Because it's the largest one there, we have
+entries, while for the others we maybe have.
+
+0:43:00.660 --> 0:43:03.402
+So they are a little bit efficient and are
+important to make this in.
+
+0:43:06.206 --> 0:43:10.529
+And then you can look at where else the calculations
+are very difficult.
+
+0:43:10.830 --> 0:43:20.294
+So here we have our individual network, so
+here are the word embeddings.
+
+0:43:20.294 --> 0:43:29.498
+Then we have one hidden layer, and then you
+can look at how difficult.
+
+0:43:30.270 --> 0:43:38.742
+We could save a lot of calculations by calculating
+that by just doing like do the selection because:
+
+0:43:40.600 --> 0:43:51.748
+And then the number of calculations you have
+to do here is the length.
+
+0:43:52.993 --> 0:44:06.206
+Then we have here the hint size that is the
+hint size, so the first step of calculation
+
+0:44:06.206 --> 0:44:10.260
+for this metric is an age.
+
+0:44:10.730 --> 0:44:22.030
+Then you have to do some activation function
+which is this: This is the hidden size hymn
+
+0:44:22.030 --> 0:44:29.081
+because we need the vocabulary socks to calculate
+the probability for each.
+
+0:44:29.889 --> 0:44:40.474
+And if you look at this number, so if you
+have a projection sign of one hundred and a
+
+0:44:40.474 --> 0:44:45.027
+vocabulary sign of one hundred, you.
+
+0:44:45.425 --> 0:44:53.958
+And that's why there has been especially at
+the beginning some ideas on how we can reduce
+
+0:44:53.958 --> 0:44:55.570
+the calculation.
+
+0:44:55.956 --> 0:45:02.352
+And if we really need to calculate all our
+capabilities, or if we can calculate only some.
+
+0:45:02.582 --> 0:45:13.061
+And there again one important thing to think
+about is for what you will use my language.
+
+0:45:13.061 --> 0:45:21.891
+One can use it for generations and that's
+where we will see the next week.
+
+0:45:21.891 --> 0:45:22.480
+And.
+
+0:45:23.123 --> 0:45:32.164
+Initially, if it's just used as a feature,
+we do not want to use it for generation, but
+
+0:45:32.164 --> 0:45:32.575
+we.
+
+0:45:32.953 --> 0:45:41.913
+And there we might not be interested in all
+the probabilities, but we already know all
+
+0:45:41.913 --> 0:45:49.432
+the probability of this one word, and then
+it might be very inefficient.
+
+0:45:51.231 --> 0:45:53.638
+And how can you do that so initially?
+
+0:45:53.638 --> 0:45:56.299
+For example, people look into shortlists.
+
+0:45:56.756 --> 0:46:03.321
+So the idea was this calculation at the end
+is really very expensive.
+
+0:46:03.321 --> 0:46:05.759
+So can we make that more.
+
+0:46:05.945 --> 0:46:17.135
+And the idea was okay, and most birds occur
+very rarely, and some beef birds occur very,
+
+0:46:17.135 --> 0:46:18.644
+very often.
+
+0:46:19.019 --> 0:46:37.644
+And so they use the smaller imagery, which
+is maybe very small, and then you merge a new.
+
+0:46:37.937 --> 0:46:45.174
+So you're taking if the word is in the shortness,
+so in the most frequent words.
+
+0:46:45.825 --> 0:46:58.287
+You're taking the probability of this short
+word by some normalization here, and otherwise
+
+0:46:58.287 --> 0:46:59.656
+you take.
+
+0:47:00.020 --> 0:47:00.836
+Course.
+
+0:47:00.836 --> 0:47:09.814
+It will not be as good, but then we don't
+have to calculate all the capabilities at the
+
+0:47:09.814 --> 0:47:16.037
+end, but we only have to calculate it for the
+most frequent.
+
+0:47:19.599 --> 0:47:39.477
+Machines about that, but of course we don't
+model the probability of the infrequent words.
+
+0:47:39.299 --> 0:47:46.658
+And one idea is to do what is reported as
+soles for the structure of the layer.
+
+0:47:46.606 --> 0:47:53.169
+You see how some years ago people were very
+creative in giving names to newer models.
+
+0:47:53.813 --> 0:48:00.338
+And there the idea is that we model the out
+group vocabulary as a clustered strip.
+
+0:48:00.680 --> 0:48:08.498
+So you don't need to mold all of your bodies
+directly, but you are putting words into.
+
+0:48:08.969 --> 0:48:20.623
+A very intricate word is first in and then
+in and then in and that is in sub-sub-clusters
+
+0:48:20.623 --> 0:48:21.270
+and.
+
+0:48:21.541 --> 0:48:29.936
+And this is what was mentioned in the past
+of the work, so these are the subclasses that
+
+0:48:29.936 --> 0:48:30.973
+always go.
+
+0:48:30.973 --> 0:48:39.934
+So if it's in cluster one at the first position
+then you only look at all the words which are:
+
+0:48:40.340 --> 0:48:50.069
+And then you can calculate the probability
+of a word again just by the product over these,
+
+0:48:50.069 --> 0:48:55.522
+so the probability of the word is the first
+class.
+
+0:48:57.617 --> 0:49:12.331
+It's maybe more clear where you have the sole
+architecture, so what you will do is first
+
+0:49:12.331 --> 0:49:13.818
+predict.
+
+0:49:14.154 --> 0:49:26.435
+Then you go to the appropriate sub-class,
+then you calculate the probability of the sub-class.
+
+0:49:27.687 --> 0:49:34.932
+Anybody have an idea why this is more, more
+efficient, or if people do it first, it looks
+
+0:49:34.932 --> 0:49:35.415
+more.
+
+0:49:42.242 --> 0:49:56.913
+Yes, so you have to do less calculations,
+or maybe here you have to calculate the element
+
+0:49:56.913 --> 0:49:59.522
+there, but you.
+
+0:49:59.980 --> 0:50:06.116
+The capabilities in the set classes that you're
+going through and not for all of them.
+
+0:50:06.386 --> 0:50:16.688
+Therefore, it's only more efficient if you
+don't need all awkward preferences because
+
+0:50:16.688 --> 0:50:21.240
+you have to even calculate the class.
+
+0:50:21.501 --> 0:50:30.040
+So it's only more efficient in scenarios where
+you really need to use a language to evaluate.
+
+0:50:35.275 --> 0:50:54.856
+How this works is that on the output layer
+you only have a vocabulary of: But on the input
+
+0:50:54.856 --> 0:51:05.126
+layer you have always your full vocabulary
+because at the input we saw that this is not
+
+0:51:05.126 --> 0:51:06.643
+complicated.
+
+0:51:06.906 --> 0:51:19.778
+And then you can cluster down all your words,
+embedding series of classes, and use that as
+
+0:51:19.778 --> 0:51:23.031
+your classes for that.
+
+0:51:23.031 --> 0:51:26.567
+So yeah, you have words.
+
+0:51:29.249 --> 0:51:32.593
+Is one idea of doing it.
+
+0:51:32.593 --> 0:51:44.898
+There is also a second idea of doing it again,
+the idea that we don't need the probability.
+
+0:51:45.025 --> 0:51:53.401
+So sometimes it doesn't really need to be
+a probability to evaluate.
+
+0:51:53.401 --> 0:52:05.492
+It's only important that: And: Here is called
+self-normalization.
+
+0:52:05.492 --> 0:52:19.349
+What people have done so is in the softmax
+is always to the input divided by normalization.
+
+0:52:19.759 --> 0:52:25.194
+So this is how we calculate the soft mix.
+
+0:52:25.825 --> 0:52:42.224
+And in self-normalization now, the idea is
+that we don't need to calculate the logarithm.
+
+0:52:42.102 --> 0:52:54.284
+That would be zero, and then you don't even
+have to calculate the normalization.
+
+0:52:54.514 --> 0:53:01.016
+So how can we achieve that?
+
+0:53:01.016 --> 0:53:08.680
+And then there's the nice thing.
+
+0:53:09.009 --> 0:53:14.743
+And our novel Lots and more to maximize probability.
+
+0:53:14.743 --> 0:53:23.831
+We have this cross entry lot that probability
+is higher, and now we're just adding.
+
+0:53:24.084 --> 0:53:31.617
+And the second loss just tells us you're pleased
+training the way the lock set is zero.
+
+0:53:32.352 --> 0:53:38.625
+So then if it's nearly zero at the end you
+don't need to calculate this and it's also
+
+0:53:38.625 --> 0:53:39.792
+very efficient.
+
+0:53:40.540 --> 0:53:57.335
+One important thing is this is only an inference,
+so during tests we don't need to calculate.
+
+0:54:00.480 --> 0:54:15.006
+You can do a bit of a hyperparameter here
+where you do the waiting and how much effort
+
+0:54:15.006 --> 0:54:16.843
+should be.
+
+0:54:18.318 --> 0:54:35.037
+The only disadvantage is that it's no speed
+up during training and there are other ways
+
+0:54:35.037 --> 0:54:37.887
+of doing that.
+
+0:54:41.801 --> 0:54:43.900
+I'm with you all.
+
+0:54:44.344 --> 0:54:48.540
+Then we are coming very, very briefly like
+this one here.
+
+0:54:48.828 --> 0:54:53.692
+There are more things on different types of
+languages.
+
+0:54:53.692 --> 0:54:58.026
+We are having a very short view of a restricted.
+
+0:54:58.298 --> 0:55:09.737
+And then we'll talk about recurrent neural
+networks for our language minds because they
+
+0:55:09.737 --> 0:55:17.407
+have the advantage now that we can't even further
+improve.
+
+0:55:18.238 --> 0:55:24.395
+There's also different types of neural networks.
+
+0:55:24.395 --> 0:55:30.175
+These ballroom machines are not having input.
+
+0:55:30.330 --> 0:55:39.271
+They have these binary units: And they define
+an energy function on the network, which can
+
+0:55:39.271 --> 0:55:46.832
+be in respect of bottom machines efficiently
+calculated, and restricted needs.
+
+0:55:46.832 --> 0:55:53.148
+You only have connections between the input
+and the hidden layer.
+
+0:55:53.393 --> 0:56:00.190
+So you see here you don't have input and output,
+you just have an input and you calculate what.
+
+0:56:00.460 --> 0:56:16.429
+Which of course nicely fits with the idea
+we're having, so you can use this for N gram
+
+0:56:16.429 --> 0:56:19.182
+language ones.
+
+0:56:19.259 --> 0:56:25.187
+Decaying this credibility of the input by
+this type of neural networks.
+
+0:56:26.406 --> 0:56:30.582
+And the advantage of this type of model of
+board that is.
+
+0:56:30.550 --> 0:56:38.629
+Very fast to integrate it, so that one was
+the first one which was used during decoding.
+
+0:56:38.938 --> 0:56:50.103
+The problem of it is that the Enron language
+models were very good at performing the calculation.
+
+0:56:50.230 --> 0:57:00.114
+So what people typically did is we talked
+about a best list, so they generated a most
+
+0:57:00.114 --> 0:57:05.860
+probable output, and then they scored each
+entry.
+
+0:57:06.146 --> 0:57:10.884
+A language model, and then only like change
+the order against that based on that which.
+
+0:57:11.231 --> 0:57:20.731
+The knifing is maybe only hundred entries,
+while during decoding you will look at several
+
+0:57:20.731 --> 0:57:21.787
+thousand.
+
+0:57:26.186 --> 0:57:40.437
+This but let's look at the context, so we
+have now seen your language models.
+
+0:57:40.437 --> 0:57:43.726
+There is the big.
+
+0:57:44.084 --> 0:57:57.552
+Remember ingram language is not always words
+because sometimes you have to back off or interpolation
+
+0:57:57.552 --> 0:57:59.953
+to lower ingrams.
+
+0:58:00.760 --> 0:58:05.504
+However, in neural models we always have all
+of these inputs and some of these.
+
+0:58:07.147 --> 0:58:21.262
+The disadvantage is that you are still limited
+in your context, and if you remember the sentence
+
+0:58:21.262 --> 0:58:23.008
+from last,.
+
+0:58:22.882 --> 0:58:28.445
+Sometimes you need more context and there's
+unlimited contexts that you might need and
+
+0:58:28.445 --> 0:58:34.838
+you can always create sentences where you need
+this file context in order to put a good estimation.
+
+0:58:35.315 --> 0:58:44.955
+Can we also do it different in order to better
+understand that it makes sense to view?
+
+0:58:45.445 --> 0:58:57.621
+So sequence labeling tasks are a very common
+type of towns in natural language processing
+
+0:58:57.621 --> 0:59:03.438
+where you have an input sequence and then.
+
+0:59:03.323 --> 0:59:08.663
+I've token so you have one output for each
+input so machine translation is not a secret
+
+0:59:08.663 --> 0:59:14.063
+labeling cast because the number of inputs
+and the number of outputs is different so you
+
+0:59:14.063 --> 0:59:19.099
+put in a string German which has five words
+and the output can be six or seven or.
+
+0:59:19.619 --> 0:59:20.155
+Secrets.
+
+0:59:20.155 --> 0:59:24.083
+Lately you always have the same number of
+and the same number of.
+
+0:59:24.944 --> 0:59:40.940
+And you can model language modeling as that,
+and you just say a label for each word is always
+
+0:59:40.940 --> 0:59:43.153
+a next word.
+
+0:59:45.705 --> 0:59:54.823
+This is the more general you can think of
+it, for example how to speech taking entity
+
+0:59:54.823 --> 0:59:56.202
+recognition.
+
+0:59:58.938 --> 1:00:08.081
+And if you look at now fruit cut token in
+generally sequence, they can depend on import
+
+1:00:08.081 --> 1:00:08.912
+tokens.
+
+1:00:09.869 --> 1:00:11.260
+Nice thing.
+
+1:00:11.260 --> 1:00:21.918
+In our case, the output tokens are the same
+so we can easily model it that they only depend
+
+1:00:21.918 --> 1:00:24.814
+on all the input tokens.
+
+1:00:24.814 --> 1:00:28.984
+So we have this whether it's or so.
+
+1:00:31.011 --> 1:00:42.945
+But we can always do a look at what specific
+type of sequence labeling, unidirectional sequence
+
+1:00:42.945 --> 1:00:44.188
+labeling.
+
+1:00:44.584 --> 1:00:58.215
+And that's exactly how we want the language
+of the next word only depends on all the previous
+
+1:00:58.215 --> 1:01:00.825
+words that we're.
+
+1:01:01.321 --> 1:01:12.899
+Mean, of course, that's not completely true
+in a language that the bad context might also
+
+1:01:12.899 --> 1:01:14.442
+be helpful.
+
+1:01:14.654 --> 1:01:22.468
+We will model always the probability of a
+word given on its history, and therefore we
+
+1:01:22.468 --> 1:01:23.013
+need.
+
+1:01:23.623 --> 1:01:29.896
+And currently we did there this approximation
+in sequence labeling that we have this windowing
+
+1:01:29.896 --> 1:01:30.556
+approach.
+
+1:01:30.951 --> 1:01:43.975
+So in order to predict this type of word we
+always look at the previous three words and
+
+1:01:43.975 --> 1:01:48.416
+then to do this one we again.
+
+1:01:49.389 --> 1:01:55.137
+If you are into neural networks you recognize
+this type of structure.
+
+1:01:55.137 --> 1:01:57.519
+Also are the typical neural.
+
+1:01:58.938 --> 1:02:09.688
+Yes, so this is like Engram, Louis Couperus,
+and at least in some way compared to the original,
+
+1:02:09.688 --> 1:02:12.264
+you're always looking.
+
+1:02:14.334 --> 1:02:30.781
+However, there are also other types of neural
+network structures which we can use for sequence.
+
+1:02:32.812 --> 1:02:34.678
+That we can do so.
+
+1:02:34.678 --> 1:02:39.686
+The idea is in recurrent neural network structure.
+
+1:02:39.686 --> 1:02:43.221
+We are saving the complete history.
+
+1:02:43.623 --> 1:02:55.118
+So again we have to do like this fix size
+representation because neural networks always
+
+1:02:55.118 --> 1:02:56.947
+need to have.
+
+1:02:57.157 --> 1:03:05.258
+And then we start with an initial value for
+our storage.
+
+1:03:05.258 --> 1:03:15.917
+We are giving our first input and then calculating
+the new representation.
+
+1:03:16.196 --> 1:03:26.328
+If you look at this, it's just again your
+network was two types of inputs: in your work,
+
+1:03:26.328 --> 1:03:29.743
+in your initial hidden state.
+
+1:03:30.210 --> 1:03:46.468
+Then you can apply it to the next type of
+input and you're again having.
+
+1:03:47.367 --> 1:03:53.306
+Nice thing is now that you can do now step
+by step by step, so all the way over.
+
+1:03:55.495 --> 1:04:05.245
+The nice thing that we are having here now
+is that we are having context information from
+
+1:04:05.245 --> 1:04:07.195
+all the previous.
+
+1:04:07.607 --> 1:04:13.582
+So if you're looking like based on which words
+do you use here, calculate your ability of
+
+1:04:13.582 --> 1:04:14.180
+varying.
+
+1:04:14.554 --> 1:04:20.128
+It depends on is based on this path.
+
+1:04:20.128 --> 1:04:33.083
+It depends on and this hidden state was influenced
+by this one and this hidden state.
+
+1:04:33.473 --> 1:04:37.798
+So now we're having something new.
+
+1:04:37.798 --> 1:04:46.449
+We can really model the word probability not
+only on a fixed context.
+
+1:04:46.906 --> 1:04:53.570
+Because the in-states we're having here in
+our area are influenced by all the trivia.
+
+1:04:56.296 --> 1:05:00.909
+So how is that to mean?
+
+1:05:00.909 --> 1:05:16.288
+If you're not thinking about the history of
+clustering, we said the clustering.
+
+1:05:16.736 --> 1:05:24.261
+So do not need to do any clustering here,
+and we also see how things are put together
+
+1:05:24.261 --> 1:05:26.273
+in order to really do.
+
+1:05:29.489 --> 1:05:43.433
+In the green box this way since we are starting
+from the left point to the right.
+
+1:05:44.524 --> 1:05:48.398
+And that's right, so they're clustered in
+some parts.
+
+1:05:48.398 --> 1:05:58.196
+Here is some type of clustering happening:
+It's continuous representations, but a smaller
+
+1:05:58.196 --> 1:06:02.636
+difference doesn't matter again.
+
+1:06:02.636 --> 1:06:10.845
+So if you have a lot of different histories,
+the similarity.
+
+1:06:11.071 --> 1:06:15.791
+Because in order to do the final restriction
+you only do it based on the green box.
+
+1:06:16.156 --> 1:06:24.284
+So you are now again still learning some type
+of clasp.
+
+1:06:24.284 --> 1:06:30.235
+You don't have to do this hard decision.
+
+1:06:30.570 --> 1:06:39.013
+The only restriction you are giving is you
+have to install everything that is important.
+
+1:06:39.359 --> 1:06:54.961
+So it's a different type of limitation, so
+you calculate the probability based on the
+
+1:06:54.961 --> 1:06:57.138
+last words.
+
+1:06:57.437 --> 1:07:09.645
+That is how you still need some cluster things
+in order to do it efficiently.
+
+1:07:09.970 --> 1:07:25.311
+But this is where things get merged together
+in this type of hidden representation, which
+
+1:07:25.311 --> 1:07:28.038
+is then merged.
+
+1:07:28.288 --> 1:07:33.104
+On the previous words, but they are some other
+bottleneck in order to make a good estimation.
+
+1:07:34.474 --> 1:07:41.242
+So the idea is that we can store all our history
+into one lecture.
+
+1:07:41.581 --> 1:07:47.351
+Which is very good and makes it more strong.
+
+1:07:47.351 --> 1:07:51.711
+Next we come to problems of that.
+
+1:07:51.711 --> 1:07:57.865
+Of course, at some point it might be difficult.
+
+1:07:58.398 --> 1:08:02.230
+Then maybe things get all overwritten, or
+you cannot store everything in there.
+
+1:08:02.662 --> 1:08:04.514
+So,.
+
+1:08:04.184 --> 1:08:10.252
+Therefore, yet for short things like signal
+sentences that works well, but especially if
+
+1:08:10.252 --> 1:08:16.184
+you think of other tasks like harmonisation
+where a document based on T where you need
+
+1:08:16.184 --> 1:08:22.457
+to consider a full document, these things got
+a bit more complicated and we learned another
+
+1:08:22.457 --> 1:08:23.071
+type of.
+
+1:08:24.464 --> 1:08:30.455
+For the further in order to understand these
+networks, it's good to have both views always.
+
+1:08:30.710 --> 1:08:39.426
+So this is the unroll view, so you have this
+type of network.
+
+1:08:39.426 --> 1:08:48.532
+Therefore, it can be shown as: We have here
+the output and here's your network which is
+
+1:08:48.532 --> 1:08:52.091
+connected by itself and that is a recurrent.
+
+1:08:56.176 --> 1:09:11.033
+There is one challenge in these networks and
+that is the training so the nice thing is train
+
+1:09:11.033 --> 1:09:11.991
+them.
+
+1:09:12.272 --> 1:09:20.147
+So the idea is we don't really know how to
+train them, but if you unroll them like this,.
+
+1:09:20.540 --> 1:09:38.054
+It's exactly the same so you can measure your
+arrows and then you propagate your arrows.
+
+1:09:38.378 --> 1:09:45.647
+Now the nice thing is if you unroll something,
+it's a feet forward and you can train it.
+
+1:09:46.106 --> 1:09:56.493
+The only important thing is, of course, for
+different inputs you have to take that into
+
+1:09:56.493 --> 1:09:57.555
+account.
+
+1:09:57.837 --> 1:10:07.621
+But since parameters are shared, it's somehow
+similar and you can train that the training
+
+1:10:07.621 --> 1:10:08.817
+algorithm.
+
+1:10:10.310 --> 1:10:16.113
+One thing which makes things difficult is
+what is referred to as the vanishing gradient.
+
+1:10:16.113 --> 1:10:21.720
+So we are saying there is a big advantage
+of these models and that's why we are using
+
+1:10:21.720 --> 1:10:22.111
+that.
+
+1:10:22.111 --> 1:10:27.980
+The output here does not only depend on the
+current input of a last three but on anything
+
+1:10:27.980 --> 1:10:29.414
+that was said before.
+
+1:10:29.809 --> 1:10:32.803
+That's a very strong thing is the motivation
+of using art.
+
+1:10:33.593 --> 1:10:44.599
+However, if you're using standard, the influence
+here gets smaller and smaller, and the models.
+
+1:10:44.804 --> 1:10:55.945
+Because the gradients get smaller and smaller,
+and so the arrow here propagated to this one,
+
+1:10:55.945 --> 1:10:59.659
+this contributes to the arrow.
+
+1:11:00.020 --> 1:11:06.710
+And yeah, that's why standard R&amp;S are
+difficult or have to become boosters.
+
+1:11:07.247 --> 1:11:11.481
+So if we are talking about our ends nowadays,.
+
+1:11:11.791 --> 1:11:19.532
+What we are typically meaning are long short
+memories.
+
+1:11:19.532 --> 1:11:30.931
+You see there by now quite old already, but
+they have special gating mechanisms.
+
+1:11:31.171 --> 1:11:41.911
+So in the language model tasks, for example
+in some other story information, all this sentence
+
+1:11:41.911 --> 1:11:44.737
+started with a question.
+
+1:11:44.684 --> 1:11:51.886
+Because if you only look at the five last
+five words, it's often no longer clear as a
+
+1:11:51.886 --> 1:11:52.556
+normal.
+
+1:11:53.013 --> 1:12:06.287
+So there you have these mechanisms with the
+right gate in order to store things for a longer
+
+1:12:06.287 --> 1:12:08.571
+time into your.
+
+1:12:10.730 --> 1:12:20.147
+Here they are used in, in, in, in selling
+quite a lot of works.
+
+1:12:21.541 --> 1:12:30.487
+For especially text machine translation now,
+the standard is to do transformer base models.
+
+1:12:30.690 --> 1:12:42.857
+But for example, this type of in architecture
+we have later one lecture about efficiency.
+
+1:12:42.882 --> 1:12:53.044
+And there in the decoder and partial networks
+they are still using our edges because then.
+
+1:12:53.473 --> 1:12:57.542
+So it's not that our ends are of no importance.
+
+1:12:59.239 --> 1:13:08.956
+In order to make them strong, there are some
+more things which are helpful and should be:
+
+1:13:09.309 --> 1:13:19.668
+So one thing is it's a very easy and nice trick
+to make this neon network stronger and better.
+
+1:13:19.739 --> 1:13:21.619
+So, of course, it doesn't work always.
+
+1:13:21.619 --> 1:13:23.451
+They have to have enough training to.
+
+1:13:23.763 --> 1:13:29.583
+But in general that is the easiest way of
+making your mouth bigger and stronger is to
+
+1:13:29.583 --> 1:13:30.598
+increase your.
+
+1:13:30.630 --> 1:13:43.244
+And you've seen that with a large size model
+they are always braggling about.
+
+1:13:43.903 --> 1:13:53.657
+This is one way so the question is how do
+you get more parameters?
+
+1:13:53.657 --> 1:14:05.951
+There's two ways you can make your representations:
+And the other thing is its octave deep learning,
+
+1:14:05.951 --> 1:14:10.020
+so the other thing is to make your networks.
+
+1:14:11.471 --> 1:14:13.831
+And then you can also get more work off.
+
+1:14:14.614 --> 1:14:19.931
+There's one problem with this and with more
+deeper networks.
+
+1:14:19.931 --> 1:14:23.330
+It's very similar to what we saw with.
+
+1:14:23.603 --> 1:14:34.755
+With the we have this problem of radiant flow
+that if it flows so fast like the radiant gets
+
+1:14:34.755 --> 1:14:35.475
+very.
+
+1:14:35.795 --> 1:14:41.114
+Exactly the same thing happens in deep.
+
+1:14:41.114 --> 1:14:52.285
+If you take the gradient and tell it's the
+right or wrong, then you're propagating.
+
+1:14:52.612 --> 1:14:53.228
+Three layers.
+
+1:14:53.228 --> 1:14:56.440
+It's no problem, but if you're going to ten,
+twenty or a hundred layers.
+
+1:14:57.797 --> 1:14:59.690
+That is getting typically a problem.
+
+1:15:00.060 --> 1:15:10.659
+People are doing and they are using what is
+called visual connections.
+
+1:15:10.659 --> 1:15:15.885
+That's a very helpful idea, which.
+
+1:15:15.956 --> 1:15:20.309
+And so the idea is that these networks.
+
+1:15:20.320 --> 1:15:30.694
+In between should calculate really what is
+a new representation, but they are calculating
+
+1:15:30.694 --> 1:15:31.386
+what.
+
+1:15:31.731 --> 1:15:37.585
+And therefore in the end you'll always the
+output of a layer is added with the input.
+
+1:15:38.318 --> 1:15:48.824
+The nice thing is that later, if you are doing
+back propagation with this very fast back,.
+
+1:15:49.209 --> 1:16:01.896
+So that is what you're seeing nowadays in
+very deep architectures, not only as others,
+
+1:16:01.896 --> 1:16:04.229
+but you always.
+
+1:16:04.704 --> 1:16:07.388
+Has two advantages.
+
+1:16:07.388 --> 1:16:15.304
+On the one hand, it's more easy to learn a
+representation.
+
+1:16:15.304 --> 1:16:18.792
+On the other hand, these.
+
+1:16:22.082 --> 1:16:24.114
+Goods.
+
+1:16:23.843 --> 1:16:31.763
+That much for the new record before, so the
+last thing now means this.
+
+1:16:31.671 --> 1:16:36.400
+Language was used in the molds itself.
+
+1:16:36.400 --> 1:16:46.707
+Now we're seeing them again, but one thing
+that at the beginning was very essential.
+
+1:16:46.967 --> 1:16:57.655
+So people really train part in the language
+models only to get this type of embeddings
+
+1:16:57.655 --> 1:17:04.166
+and therefore we want to look a bit more into
+these.
+
+1:17:09.229 --> 1:17:13.456
+Some laugh words to the word embeddings.
+
+1:17:13.456 --> 1:17:22.117
+The interesting thing is that word embeddings
+can be used for very different tasks.
+
+1:17:22.117 --> 1:17:27.170
+The advantage is we can train the word embedded.
+
+1:17:27.347 --> 1:17:31.334
+The knife is you can train that on just large
+amounts of data.
+
+1:17:31.931 --> 1:17:40.937
+And then if you have these wooden beddings
+you don't have a layer of ten thousand any
+
+1:17:40.937 --> 1:17:41.566
+more.
+
+1:17:41.982 --> 1:17:52.231
+So then you can train a small market to do
+any other tasks and therefore you're more.
+
+1:17:52.532 --> 1:17:58.761
+Initial word embeddings really depend only
+on the word itself.
+
+1:17:58.761 --> 1:18:07.363
+If you look at the two meanings of can, the
+can of beans, or can they do that, some of
+
+1:18:07.363 --> 1:18:08.747
+the embedded.
+
+1:18:09.189 --> 1:18:12.395
+That cannot be resolved.
+
+1:18:12.395 --> 1:18:23.939
+Therefore, you need to know the context, and
+if you look at the higher levels that people
+
+1:18:23.939 --> 1:18:27.916
+are doing in the context, but.
+
+1:18:29.489 --> 1:18:33.757
+However, even this one has quite very interesting.
+
+1:18:34.034 --> 1:18:44.644
+So people like to visualize that they're always
+a bit difficult because if you look at this
+
+1:18:44.644 --> 1:18:47.182
+word, vector or word.
+
+1:18:47.767 --> 1:18:52.879
+And drawing your five hundred dimensional
+vector is still a bit challenging.
+
+1:18:53.113 --> 1:19:12.464
+So you cannot directly do that, so what people
+have to do is learn some type of dimension.
+
+1:19:13.073 --> 1:19:17.216
+And of course then yes some information gets
+lost but you can try it.
+
+1:19:18.238 --> 1:19:28.122
+And you see, for example, this is the most
+famous and common example, so what you can
+
+1:19:28.122 --> 1:19:37.892
+look is you can look at the difference between
+the male and the female word English.
+
+1:19:38.058 --> 1:19:40.389
+And you can do that for a very different work.
+
+1:19:40.780 --> 1:19:45.403
+And that is where, where the masks come into
+that, what people then look into.
+
+1:19:45.725 --> 1:19:50.995
+So what you can now, for example, do is you
+can calculate the difference between man and
+
+1:19:50.995 --> 1:19:51.410
+woman.
+
+1:19:52.232 --> 1:19:56.356
+And what you can do then you can take the
+embedding of peeing.
+
+1:19:56.356 --> 1:20:02.378
+You can add on it the difference between men
+and women and where people get really excited.
+
+1:20:02.378 --> 1:20:05.586
+Then you can look at what are the similar
+words.
+
+1:20:05.586 --> 1:20:09.252
+So you won't, of course, directly hit the
+correct word.
+
+1:20:09.252 --> 1:20:10.495
+It's a continuous.
+
+1:20:10.790 --> 1:20:24.062
+But you can look at what are the nearest neighbors
+to the same, and often these words are near.
+
+1:20:24.224 --> 1:20:33.911
+So it's somehow weird that the difference
+between these works is always the same.
+
+1:20:34.374 --> 1:20:37.308
+Can do different things.
+
+1:20:37.308 --> 1:20:47.520
+You can also imagine that the work tends to
+be assuming and swim, and with walking and
+
+1:20:47.520 --> 1:20:49.046
+walking you.
+
+1:20:49.469 --> 1:20:53.040
+So you can try to use him.
+
+1:20:53.040 --> 1:20:56.346
+It's no longer like say.
+
+1:20:56.346 --> 1:21:04.016
+The interesting thing is nobody taught him
+the principle.
+
+1:21:04.284 --> 1:21:09.910
+So it's purely trained on the task of doing
+the next work prediction.
+
+1:21:10.230 --> 1:21:23.669
+And even for some information like the capital,
+this is the difference between the capital.
+
+1:21:23.823 --> 1:21:33.760
+Is another visualization here where you have
+done the same things on the difference between.
+
+1:21:33.853 --> 1:21:41.342
+And you see it's not perfect, but it's building
+in my directory, so you can even use that for
+
+1:21:41.342 --> 1:21:42.936
+pressure answering.
+
+1:21:42.936 --> 1:21:50.345
+If you have no three countries, the capital,
+you can do what is the difference between them.
+
+1:21:50.345 --> 1:21:53.372
+You apply that to a new country, and.
+
+1:21:54.834 --> 1:22:02.280
+So these models are able to really learn a
+lot of information and collapse this information
+
+1:22:02.280 --> 1:22:04.385
+into this representation.
+
+1:22:05.325 --> 1:22:07.679
+And just to do the next two are predictions.
+
+1:22:07.707 --> 1:22:22.358
+And that also explains a bit maybe or explains
+strongly, but motivates what is the main advantage
+
+1:22:22.358 --> 1:22:26.095
+of this type of neurons.
+
+1:22:28.568 --> 1:22:46.104
+So to summarize what we did today, so what
+you should hopefully have with you is: Then
+
+1:22:46.104 --> 1:22:49.148
+how we can do language modeling with new networks.
+
+1:22:49.449 --> 1:22:55.445
+We looked at three different architectures:
+We looked into the feet forward language one,
+
+1:22:55.445 --> 1:22:59.059
+the R&amp;N, and the one based the balsamic.
+
+1:22:59.039 --> 1:23:04.559
+And finally, there are different architectures
+to do in neural networks.
+
+1:23:04.559 --> 1:23:10.986
+We have seen feet for neural networks and
+base neural networks, and we'll see in the
+
+1:23:10.986 --> 1:23:14.389
+next lectures the last type of architecture.
+
+1:23:15.915 --> 1:23:17.438
+Any questions.
+
+1:23:20.680 --> 1:23:27.360
+Then thanks a lot, and next I'm just there,
+we'll be again on order to.
+