Spaces:
Running
Running
WEBVTT | |
0:00:01.301 --> 0:00:05.707 | |
Okay So Welcome to Today's Lecture. | |
0:00:06.066 --> 0:00:12.592 | |
I'm sorry for the inconvenience. | |
0:00:12.592 --> 0:00:19.910 | |
Sometimes they are project meetings. | |
0:00:19.910 --> 0:00:25.843 | |
There will be one other time. | |
0:00:26.806 --> 0:00:40.863 | |
So what we want to talk today about is want | |
to start with neural approaches to machine | |
0:00:40.863 --> 0:00:42.964 | |
translation. | |
0:00:43.123 --> 0:00:51.285 | |
I guess you have heard about other types of | |
neural models for other types of neural language | |
0:00:51.285 --> 0:00:52.339 | |
processing. | |
0:00:52.339 --> 0:00:59.887 | |
This was some of the first steps in introducing | |
neal networks to machine translation. | |
0:01:00.600 --> 0:01:06.203 | |
They are similar to what you know they see | |
in as large language models. | |
0:01:06.666 --> 0:01:11.764 | |
And today look into what are these neuro-language | |
models? | |
0:01:11.764 --> 0:01:13.874 | |
What is the difference? | |
0:01:13.874 --> 0:01:15.983 | |
What is the motivation? | |
0:01:16.316 --> 0:01:21.445 | |
And first will use them in statistics and | |
machine translation. | |
0:01:21.445 --> 0:01:28.935 | |
So if you remember how fully like two or three | |
weeks ago we had this likely model where you | |
0:01:28.935 --> 0:01:31.052 | |
can integrate easily any. | |
0:01:31.351 --> 0:01:40.967 | |
We just have another model which evaluates | |
how good a system is or how good a fluent language | |
0:01:40.967 --> 0:01:41.376 | |
is. | |
0:01:41.376 --> 0:01:53.749 | |
The main advantage compared to the statistical | |
models we saw on Tuesday is: Next week we will | |
0:01:53.749 --> 0:02:06.496 | |
then go for a neural machine translation where | |
we replace the whole model. | |
0:02:11.211 --> 0:02:21.078 | |
Just as a remember from Tuesday, we've seen | |
the main challenge in language world was that | |
0:02:21.078 --> 0:02:25.134 | |
most of the engrams we haven't seen. | |
0:02:26.946 --> 0:02:33.967 | |
So this was therefore difficult to estimate | |
any probability because you've seen that normally | |
0:02:33.967 --> 0:02:39.494 | |
if you have not seen the endgram you will assign | |
the probability of zero. | |
0:02:39.980 --> 0:02:49.420 | |
However, this is not really very good because | |
we don't want to give zero probabilities to | |
0:02:49.420 --> 0:02:54.979 | |
sentences, which still might be a very good | |
English. | |
0:02:55.415 --> 0:03:02.167 | |
And then we learned a lot of techniques and | |
that is the main challenging statistical machine | |
0:03:02.167 --> 0:03:04.490 | |
translate statistical language. | |
0:03:04.490 --> 0:03:10.661 | |
What's how we can give a good estimate of | |
probability to events that we haven't seen | |
0:03:10.661 --> 0:03:12.258 | |
smoothing techniques? | |
0:03:12.258 --> 0:03:15.307 | |
We've seen this interpolation and begoff. | |
0:03:15.435 --> 0:03:21.637 | |
And they invent or develop very specific techniques. | |
0:03:21.637 --> 0:03:26.903 | |
To deal with that, however, it might not be. | |
0:03:28.568 --> 0:03:43.190 | |
And therefore maybe we can do things different, | |
so if we have not seen an gram before in statistical | |
0:03:43.190 --> 0:03:44.348 | |
models. | |
0:03:45.225 --> 0:03:51.361 | |
Before and we can only get information from | |
exactly the same words. | |
0:03:51.411 --> 0:04:06.782 | |
We don't have some on like approximate matching | |
like that, maybe in a sentence that cures similarly. | |
0:04:06.782 --> 0:04:10.282 | |
So if you have seen a. | |
0:04:11.191 --> 0:04:17.748 | |
And so you would like to have more something | |
like that where endgrams are represented, more | |
0:04:17.748 --> 0:04:21.953 | |
in a general space, and we can generalize similar | |
numbers. | |
0:04:22.262 --> 0:04:29.874 | |
So if you learn something about walk then | |
maybe we can use this knowledge and also apply. | |
0:04:30.290 --> 0:04:42.596 | |
The same as we have done before, but we can | |
really better model how similar they are and | |
0:04:42.596 --> 0:04:45.223 | |
transfer to other. | |
0:04:47.047 --> 0:04:54.236 | |
And we maybe want to do that in a more hierarchical | |
approach that we know okay. | |
0:04:54.236 --> 0:05:02.773 | |
Some words are similar but like go and walk | |
is somehow similar and I and P and G and therefore | |
0:05:02.773 --> 0:05:06.996 | |
like maybe if we then merge them in an engram. | |
0:05:07.387 --> 0:05:15.861 | |
If we learn something about our walk, then | |
it should tell us also something about Hugo. | |
0:05:15.861 --> 0:05:17.113 | |
He walks or. | |
0:05:17.197 --> 0:05:27.327 | |
You see that there is some relations which | |
we need to integrate for you. | |
0:05:27.327 --> 0:05:35.514 | |
We need to add the s, but maybe walks should | |
also be here. | |
0:05:37.137 --> 0:05:45.149 | |
And luckily there is one really convincing | |
method in doing that: And that is by using | |
0:05:45.149 --> 0:05:47.231 | |
a neural mechanism. | |
0:05:47.387 --> 0:05:58.497 | |
That's what we will introduce today so we | |
can use this type of neural networks to try | |
0:05:58.497 --> 0:06:04.053 | |
to learn this similarity and to learn how. | |
0:06:04.324 --> 0:06:14.355 | |
And that is one of the main advantages that | |
we have by switching from the standard statistical | |
0:06:14.355 --> 0:06:15.200 | |
models. | |
0:06:15.115 --> 0:06:22.830 | |
To learn similarities between words and generalized, | |
and learn what is called hidden representations | |
0:06:22.830 --> 0:06:29.705 | |
or representations of words, where we can measure | |
similarity in some dimensions of words. | |
0:06:30.290 --> 0:06:42.384 | |
So we can measure in which way words are similar. | |
0:06:42.822 --> 0:06:48.902 | |
We had it before and we've seen that words | |
were just easier. | |
0:06:48.902 --> 0:06:51.991 | |
The only thing we did is like. | |
0:06:52.192 --> 0:07:02.272 | |
But this energies don't have any meaning, | |
so it wasn't that word is more similar to words. | |
0:07:02.582 --> 0:07:12.112 | |
So we couldn't learn anything about words | |
in the statistical model and that's a big challenge. | |
0:07:12.192 --> 0:07:23.063 | |
About words even like in morphology, so going | |
goes is somehow more similar because the person | |
0:07:23.063 --> 0:07:24.219 | |
singular. | |
0:07:24.264 --> 0:07:34.924 | |
The basic models we have to now have no idea | |
about that and goes as similar to go than it | |
0:07:34.924 --> 0:07:37.175 | |
might be to sleep. | |
0:07:39.919 --> 0:07:44.073 | |
So what we want to do today. | |
0:07:44.073 --> 0:07:53.096 | |
In order to go to this we will have a short | |
introduction into. | |
0:07:53.954 --> 0:08:05.984 | |
It very short just to see how we use them | |
here, but that's a good thing, so most of you | |
0:08:05.984 --> 0:08:08.445 | |
think it will be. | |
0:08:08.928 --> 0:08:14.078 | |
And then we will first look into a feet forward | |
neural network language models. | |
0:08:14.454 --> 0:08:23.706 | |
And there we will still have this approximation. | |
0:08:23.706 --> 0:08:33.902 | |
We have before we are looking only at a fixed | |
window. | |
0:08:34.154 --> 0:08:35.030 | |
The case. | |
0:08:35.030 --> 0:08:38.270 | |
However, we have the umbellent here. | |
0:08:38.270 --> 0:08:43.350 | |
That's why they're already better in order | |
to generalize. | |
0:08:44.024 --> 0:08:53.169 | |
And then at the end we'll look at language | |
models where we then have the additional advantage. | |
0:08:53.093 --> 0:09:04.317 | |
Case that we need to have a fixed history, | |
but in theory we can model arbitrary long dependencies. | |
0:09:04.304 --> 0:09:12.687 | |
And we talked about on Tuesday where it is | |
not clear what type of information it is to. | |
0:09:16.396 --> 0:09:24.981 | |
So in general molecular networks I normally | |
learn to prove that they perform some tasks. | |
0:09:25.325 --> 0:09:33.472 | |
We have the structure and we are learning | |
them from samples so that is similar to what | |
0:09:33.472 --> 0:09:34.971 | |
we have before. | |
0:09:34.971 --> 0:09:42.275 | |
So now we have the same task here, a language | |
model giving input or forwards. | |
0:09:42.642 --> 0:09:48.959 | |
And is somewhat originally motivated by human | |
brain. | |
0:09:48.959 --> 0:10:00.639 | |
However, when you now need to know about artificial | |
neural networks, it's hard to get similarity. | |
0:10:00.540 --> 0:10:02.889 | |
There seemed to be not that point. | |
0:10:03.123 --> 0:10:11.014 | |
So what they are mainly doing is summoning | |
multiplication and then one non-linear activation. | |
0:10:12.692 --> 0:10:16.085 | |
So the basic units are these type of. | |
0:10:17.937 --> 0:10:29.891 | |
Perceptron basic blocks which we have and | |
this does processing so we have a fixed number | |
0:10:29.891 --> 0:10:36.070 | |
of input features and that will be important. | |
0:10:36.096 --> 0:10:39.689 | |
So we have here numbers to xn as input. | |
0:10:40.060 --> 0:10:53.221 | |
And this makes partly of course language processing | |
difficult. | |
0:10:54.114 --> 0:10:57.609 | |
So we have to model this time on and then | |
go stand home and model. | |
0:10:58.198 --> 0:11:02.099 | |
Then we are having weights, which are the | |
parameters and the number of weights exactly | |
0:11:02.099 --> 0:11:03.668 | |
the same as the number of weights. | |
0:11:04.164 --> 0:11:06.322 | |
Of input features. | |
0:11:06.322 --> 0:11:15.068 | |
Sometimes he has his fires in there, and then | |
it's not really an input from. | |
0:11:15.195 --> 0:11:19.205 | |
And what you then do is multiply. | |
0:11:19.205 --> 0:11:26.164 | |
Each input resists weight and then you sum | |
it up and then. | |
0:11:26.606 --> 0:11:34.357 | |
What is then additionally later important | |
is that we have an activation function and | |
0:11:34.357 --> 0:11:42.473 | |
it's important that this activation function | |
is non linear, so we come to just a linear. | |
0:11:43.243 --> 0:11:54.088 | |
And later it will be important that this is | |
differentiable because otherwise all the training. | |
0:11:54.714 --> 0:12:01.907 | |
This model by itself is not very powerful. | |
0:12:01.907 --> 0:12:10.437 | |
It was originally shown that this is not powerful. | |
0:12:10.710 --> 0:12:19.463 | |
However, there is a very easy extension, the | |
multi layer perceptual, and then things get | |
0:12:19.463 --> 0:12:20.939 | |
very powerful. | |
0:12:21.081 --> 0:12:27.719 | |
The thing is you just connect a lot of these | |
in this layer of structures and we have our | |
0:12:27.719 --> 0:12:35.029 | |
input layer where we have the inputs and our | |
hidden layer at least one where there is everywhere. | |
0:12:35.395 --> 0:12:39.817 | |
And then we can combine them all to do that. | |
0:12:40.260 --> 0:12:48.320 | |
The input layer is of course somewhat given | |
by a problem of dimension. | |
0:12:48.320 --> 0:13:00.013 | |
The outward layer is also given by your dimension, | |
but the hidden layer is of course a hyperparameter. | |
0:13:01.621 --> 0:13:08.802 | |
So let's start with the first question, now | |
more language related, and that is how we represent. | |
0:13:09.149 --> 0:13:23.460 | |
So we've seen here we have the but the question | |
is now how can we put in a word into this? | |
0:13:26.866 --> 0:13:34.117 | |
Noise: The first thing we're able to be better | |
is by the fact that like you are said,. | |
0:13:34.314 --> 0:13:43.028 | |
That is not that easy because the continuous | |
vector will come to that. | |
0:13:43.028 --> 0:13:50.392 | |
So from the neo-network we can directly put | |
in the bedding. | |
0:13:50.630 --> 0:13:57.277 | |
But if we need to input a word into the needle | |
network, it has to be something which is easily | |
0:13:57.277 --> 0:13:57.907 | |
defined. | |
0:13:59.079 --> 0:14:12.492 | |
The one hood encoding, and then we have one | |
out of encoding, so one value is one, and all | |
0:14:12.492 --> 0:14:15.324 | |
the others is the. | |
0:14:16.316 --> 0:14:25.936 | |
That means we are always dealing with fixed | |
vocabulary because what said is we cannot. | |
0:14:26.246 --> 0:14:38.017 | |
So you cannot easily extend your vocabulary | |
because if you mean you would extend your vocabulary. | |
0:14:39.980 --> 0:14:41.502 | |
That's also motivating. | |
0:14:41.502 --> 0:14:43.722 | |
We're talked about biperriagoding. | |
0:14:43.722 --> 0:14:45.434 | |
That's a nice thing there. | |
0:14:45.434 --> 0:14:47.210 | |
We have a fixed vocabulary. | |
0:14:48.048 --> 0:14:55.804 | |
The big advantage of this one encoding is | |
that we don't implicitly sum our implement | |
0:14:55.804 --> 0:15:04.291 | |
similarity between words, but really re-learning | |
because if you first think about this, this | |
0:15:04.291 --> 0:15:06.938 | |
is a very, very inefficient. | |
0:15:07.227 --> 0:15:15.889 | |
So you need like to represent end words, you | |
need a dimension of an end dimensional vector. | |
0:15:16.236 --> 0:15:24.846 | |
Imagine you could do binary encoding so you | |
could represent words as binary vectors. | |
0:15:24.846 --> 0:15:26.467 | |
Then you would. | |
0:15:26.806 --> 0:15:31.177 | |
Will be significantly more efficient. | |
0:15:31.177 --> 0:15:36.813 | |
However, then you have some implicit similarity. | |
0:15:36.813 --> 0:15:39.113 | |
Some numbers share. | |
0:15:39.559 --> 0:15:46.958 | |
Would somehow be bad because you would force | |
someone to do this by hand or clear how to | |
0:15:46.958 --> 0:15:47.631 | |
define. | |
0:15:48.108 --> 0:15:55.135 | |
So therefore currently this is the most successful | |
approach to just do this one watch. | |
0:15:55.095 --> 0:15:59.563 | |
Representations, so we take a fixed vocabulary. | |
0:15:59.563 --> 0:16:06.171 | |
We map each word to the inise, and then we | |
represent a word like this. | |
0:16:06.171 --> 0:16:13.246 | |
So if home will be one, the representation | |
will be one zero zero zero, and. | |
0:16:14.514 --> 0:16:30.639 | |
But this dimension here is a vocabulary size | |
and that is quite high, so we are always trying | |
0:16:30.639 --> 0:16:33.586 | |
to be efficient. | |
0:16:33.853 --> 0:16:43.792 | |
We are doing then some type of efficiency | |
because typically we are having this next layer. | |
0:16:44.104 --> 0:16:51.967 | |
It can be still maybe two hundred or five | |
hundred or one thousand neurons, but this is | |
0:16:51.967 --> 0:16:53.323 | |
significantly. | |
0:16:53.713 --> 0:17:03.792 | |
You can learn that directly and there we then | |
have similarity between words. | |
0:17:03.792 --> 0:17:07.458 | |
Then it is that some words. | |
0:17:07.807 --> 0:17:14.772 | |
But the nice thing is that this is then learned | |
that we are not need to hand define that. | |
0:17:17.117 --> 0:17:32.742 | |
We'll come later to the explicit architecture | |
of the neural language one, and there we can | |
0:17:32.742 --> 0:17:35.146 | |
see how it's. | |
0:17:38.418 --> 0:17:44.857 | |
So we're seeing that the other one or our | |
representation always has the same similarity. | |
0:17:45.105 --> 0:17:59.142 | |
Then we're having this continuous factor which | |
is a lot smaller dimension and that's important | |
0:17:59.142 --> 0:18:00.768 | |
for later. | |
0:18:01.121 --> 0:18:06.989 | |
What we are doing then is learning these representations | |
so that they are best for language. | |
0:18:07.487 --> 0:18:14.968 | |
So the representations are implicitly training | |
the language for the cards. | |
0:18:14.968 --> 0:18:19.058 | |
This is the best way for doing language. | |
0:18:19.479 --> 0:18:32.564 | |
And the nice thing that was found out later | |
is these representations are really good. | |
0:18:33.153 --> 0:18:39.253 | |
And that is why they are now even called word | |
embeddings by themselves and used for other | |
0:18:39.253 --> 0:18:39.727 | |
tasks. | |
0:18:40.360 --> 0:18:49.821 | |
And they are somewhat describing very different | |
things so they can describe and semantic similarities. | |
0:18:49.789 --> 0:18:58.650 | |
Are looking at the very example of today mass | |
vector space by adding words and doing some | |
0:18:58.650 --> 0:19:00.618 | |
interesting things. | |
0:19:00.940 --> 0:19:11.178 | |
So they got really like the first big improvement | |
when switching to neurostaff. | |
0:19:11.491 --> 0:19:20.456 | |
Are like part of the model, but with more | |
complex representation, but they are the basic | |
0:19:20.456 --> 0:19:21.261 | |
models. | |
0:19:23.683 --> 0:19:36.979 | |
In the output layer we are also having one | |
output layer structure and a connection function. | |
0:19:36.997 --> 0:19:46.525 | |
That is, for language learning we want to | |
predict what is the most common word. | |
0:19:47.247 --> 0:19:56.453 | |
And that can be done very well with this so | |
called soft back layer, where again the dimension. | |
0:19:56.376 --> 0:20:02.825 | |
Vocabulary size, so this is a vocabulary size, | |
and again the case neural represents the case | |
0:20:02.825 --> 0:20:03.310 | |
class. | |
0:20:03.310 --> 0:20:09.759 | |
So in our case we have again one round representation, | |
someone saying this is a core report. | |
0:20:10.090 --> 0:20:17.255 | |
Our probability distribution is a probability | |
distribution over all works, so the case entry | |
0:20:17.255 --> 0:20:21.338 | |
tells us how probable is that the next word | |
is this. | |
0:20:22.682 --> 0:20:33.885 | |
So we need to have some probability distribution | |
at our output in order to achieve that this | |
0:20:33.885 --> 0:20:37.017 | |
activation function goes. | |
0:20:37.197 --> 0:20:46.944 | |
And we can achieve that with a soft max activation | |
we take the input to the form of the value, | |
0:20:46.944 --> 0:20:47.970 | |
and then. | |
0:20:48.288 --> 0:20:58.021 | |
So by having this type of activation function | |
we are really getting this type of probability. | |
0:20:59.019 --> 0:21:15.200 | |
At the beginning was also very challenging | |
because again we have this inefficient representation. | |
0:21:15.235 --> 0:21:29.799 | |
You can imagine that something over is maybe | |
a bit inefficient with cheap users, but definitely. | |
0:21:36.316 --> 0:21:44.072 | |
And then for training the models that will | |
be fine, so we have to use architecture now. | |
0:21:44.264 --> 0:21:48.491 | |
We need to minimize the arrow. | |
0:21:48.491 --> 0:21:53.264 | |
Are we doing it taking the output? | |
0:21:53.264 --> 0:21:58.174 | |
We are comparing it to our targets. | |
0:21:58.298 --> 0:22:03.830 | |
So one important thing is by training them. | |
0:22:03.830 --> 0:22:07.603 | |
How can we measure the error? | |
0:22:07.603 --> 0:22:12.758 | |
So what is if we are training the ideas? | |
0:22:13.033 --> 0:22:15.163 | |
And how well we are measuring. | |
0:22:15.163 --> 0:22:19.768 | |
It is in natural language processing, typically | |
the cross entropy. | |
0:22:19.960 --> 0:22:35.575 | |
And that means we are comparing the target | |
with the output. | |
0:22:35.335 --> 0:22:44.430 | |
It gets optimized and you're seeing that this, | |
of course, makes it again very nice and easy | |
0:22:44.430 --> 0:22:49.868 | |
because our target is again a one-hour representation. | |
0:22:50.110 --> 0:23:00.116 | |
So all of these are always zero, and what | |
we are then doing is we are taking the one. | |
0:23:00.100 --> 0:23:04.615 | |
And we only need to multiply the one with | |
the logarithm here, and that is all the feedback | |
0:23:04.615 --> 0:23:05.955 | |
signal we are taking here. | |
0:23:06.946 --> 0:23:13.885 | |
Of course, this is not always influenced by | |
all the others. | |
0:23:13.885 --> 0:23:17.933 | |
Why is this influenced by all the. | |
0:23:24.304 --> 0:23:34.382 | |
Have the activation function, which is the | |
current activation divided by some of the others. | |
0:23:34.354 --> 0:23:45.924 | |
Otherwise it could easily just increase this | |
volume and ignore the others, but if you increase | |
0:23:45.924 --> 0:23:49.090 | |
one value all the others. | |
0:23:51.351 --> 0:23:59.912 | |
Then we can do with neometrics one very nice | |
and easy type of training that is done in all | |
0:23:59.912 --> 0:24:07.721 | |
the neometrics where we are now calculating | |
our error and especially the gradient. | |
0:24:07.707 --> 0:24:11.640 | |
So in which direction does the error show? | |
0:24:11.640 --> 0:24:18.682 | |
And then if we want to go to a smaller arrow | |
that's what we want to achieve. | |
0:24:18.682 --> 0:24:26.638 | |
We are taking the inverse direction of the | |
gradient and thereby trying to minimize our | |
0:24:26.638 --> 0:24:27.278 | |
error. | |
0:24:27.287 --> 0:24:31.041 | |
And we have to do that, of course, for all | |
the weights. | |
0:24:31.041 --> 0:24:36.672 | |
And to calculate the error of all the weights, | |
we won't do the defectvagation here. | |
0:24:36.672 --> 0:24:41.432 | |
But but what you can do is you can propagate | |
the arrow which measured. | |
0:24:41.432 --> 0:24:46.393 | |
At the end you can propagate it back its basic | |
mass and basic derivation. | |
0:24:46.706 --> 0:24:58.854 | |
For each way in your model measure how much | |
you contribute to the error and then change | |
0:24:58.854 --> 0:25:01.339 | |
it in a way that. | |
0:25:04.524 --> 0:25:11.625 | |
So to summarize what for at least machine | |
translation on your machine translation should | |
0:25:11.625 --> 0:25:19.044 | |
remember, you know, to understand on this problem | |
is that this is how a multilayer first the | |
0:25:19.044 --> 0:25:20.640 | |
problem looks like. | |
0:25:20.580 --> 0:25:28.251 | |
There are fully two layers and no connections. | |
0:25:28.108 --> 0:25:29.759 | |
Across layers. | |
0:25:29.829 --> 0:25:35.153 | |
And what they're doing is always just a waited | |
sum here and then in activation production. | |
0:25:35.415 --> 0:25:38.792 | |
And in order to train you have this forward | |
and backward pass. | |
0:25:39.039 --> 0:25:41.384 | |
So We Put in Here. | |
0:25:41.281 --> 0:25:41.895 | |
Inputs. | |
0:25:41.895 --> 0:25:45.347 | |
We have some random values at the beginning. | |
0:25:45.347 --> 0:25:47.418 | |
Then calculate the output. | |
0:25:47.418 --> 0:25:54.246 | |
We are measuring how our error is propagating | |
the arrow back and then changing our model | |
0:25:54.246 --> 0:25:57.928 | |
in a way that we hopefully get a smaller arrow. | |
0:25:57.928 --> 0:25:59.616 | |
And then that is how. | |
0:26:01.962 --> 0:26:12.893 | |
So before we're coming into our neural networks | |
language models, how can we use this type of | |
0:26:12.893 --> 0:26:17.595 | |
neural network to do language modeling? | |
0:26:23.103 --> 0:26:33.157 | |
So how can we use them in natural language | |
processing, especially machine translation? | |
0:26:33.157 --> 0:26:41.799 | |
The first idea of using them was to estimate: | |
So we have seen that the output can be monitored | |
0:26:41.799 --> 0:26:42.599 | |
here as well. | |
0:26:43.603 --> 0:26:50.311 | |
A probability distribution and if we have | |
a full vocabulary we could mainly hear estimating | |
0:26:50.311 --> 0:26:56.727 | |
how probable each next word is and then use | |
that in our language model fashion as we've | |
0:26:56.727 --> 0:26:58.112 | |
done it last time. | |
0:26:58.112 --> 0:27:03.215 | |
We got the probability of a full sentence | |
as a product of individual. | |
0:27:04.544 --> 0:27:12.820 | |
And: That was done in the ninety seven years | |
and it's very easy to integrate it into this | |
0:27:12.820 --> 0:27:14.545 | |
lot of the year model. | |
0:27:14.545 --> 0:27:19.570 | |
So we have said that this is how the locker | |
here model looks like. | |
0:27:19.570 --> 0:27:25.119 | |
So we are searching the best translation which | |
minimizes each waste time. | |
0:27:25.125 --> 0:27:26.362 | |
The Future About You. | |
0:27:26.646 --> 0:27:31.647 | |
We have that with minimum error rate training | |
if you can remember where we search for the | |
0:27:31.647 --> 0:27:32.147 | |
optimal. | |
0:27:32.512 --> 0:27:40.422 | |
The language model and many others, and we | |
can just add here a neuromodel, have a knock | |
0:27:40.422 --> 0:27:41.591 | |
of features. | |
0:27:41.861 --> 0:27:45.761 | |
So that is quite easy as said. | |
0:27:45.761 --> 0:27:53.183 | |
That was how statistical machine translation | |
was improved. | |
0:27:53.183 --> 0:27:57.082 | |
You just add one more feature. | |
0:27:58.798 --> 0:28:07.631 | |
So how can we model the language modeling | |
with a network? | |
0:28:07.631 --> 0:28:16.008 | |
So what we have to do is model the probability | |
of the. | |
0:28:16.656 --> 0:28:25.047 | |
The problem in general in the head is that | |
mostly we haven't seen long sequences. | |
0:28:25.085 --> 0:28:35.650 | |
Mostly we have to beg off to very short sequences | |
and we are working on this discrete space where | |
0:28:35.650 --> 0:28:36.944 | |
similarity. | |
0:28:37.337 --> 0:28:50.163 | |
So the idea is if we have now a real network, | |
we can make words into continuous representation. | |
0:28:51.091 --> 0:29:00.480 | |
And the structure then looks like this, so | |
this is a basic still feed forward neural network. | |
0:29:01.361 --> 0:29:10.645 | |
We are doing this at perximation again, so | |
we are not putting in all previous words, but | |
0:29:10.645 --> 0:29:11.375 | |
it is. | |
0:29:11.691 --> 0:29:25.856 | |
This is done because we said that in the real | |
network we can have only a fixed type of input. | |
0:29:25.945 --> 0:29:31.886 | |
You can only do a fixed step and then we'll | |
be doing that exactly in minus one. | |
0:29:33.593 --> 0:29:39.536 | |
So here you are, for example, three words | |
and three different words. | |
0:29:39.536 --> 0:29:50.704 | |
One and all the others are: And then we're | |
having the first layer of the neural network, | |
0:29:50.704 --> 0:29:56.230 | |
which like you learns is word embedding. | |
0:29:57.437 --> 0:30:04.976 | |
There is one thing which is maybe special | |
compared to the standard neural member. | |
0:30:05.345 --> 0:30:11.918 | |
So the representation of this word we want | |
to learn first of all position independence. | |
0:30:11.918 --> 0:30:19.013 | |
So we just want to learn what is the general | |
meaning of the word independent of its neighbors. | |
0:30:19.299 --> 0:30:26.239 | |
And therefore the representation you get here | |
should be the same as if in the second position. | |
0:30:27.247 --> 0:30:36.865 | |
The nice thing you can achieve is that this | |
weights which you're using here you're reusing | |
0:30:36.865 --> 0:30:41.727 | |
here and reusing here so we are forcing them. | |
0:30:42.322 --> 0:30:48.360 | |
You then learn your word embedding, which | |
is contextual, independent, so it's the same | |
0:30:48.360 --> 0:30:49.678 | |
for each position. | |
0:30:49.909 --> 0:31:03.482 | |
So that's the idea that you want to learn | |
the representation first of and you don't want | |
0:31:03.482 --> 0:31:07.599 | |
to really use the context. | |
0:31:08.348 --> 0:31:13.797 | |
That of course might have a different meaning | |
depending on where it stands, but we'll learn | |
0:31:13.797 --> 0:31:14.153 | |
that. | |
0:31:14.514 --> 0:31:20.386 | |
So first we are learning here representational | |
words, which is just the representation. | |
0:31:20.760 --> 0:31:32.498 | |
Normally we said in neurons all input neurons | |
here are connected to all here, but we're reducing | |
0:31:32.498 --> 0:31:37.338 | |
the complexity by saying these neurons. | |
0:31:37.857 --> 0:31:47.912 | |
Then we have a lot denser representation that | |
is our three word embedded in here, and now | |
0:31:47.912 --> 0:31:57.408 | |
we are learning this interaction between words, | |
a direction between words not based. | |
0:31:57.677 --> 0:32:08.051 | |
So we have at least one connected layer here, | |
which takes a three embedding input and then | |
0:32:08.051 --> 0:32:14.208 | |
learns a new embedding which now represents | |
the full. | |
0:32:15.535 --> 0:32:16.551 | |
Layers. | |
0:32:16.551 --> 0:32:27.854 | |
It is the output layer which now and then | |
again the probability distribution of all the. | |
0:32:28.168 --> 0:32:48.612 | |
So here is your target prediction. | |
0:32:48.688 --> 0:32:56.361 | |
The nice thing is that you learn everything | |
together, so you don't have to teach them what | |
0:32:56.361 --> 0:32:58.722 | |
a good word representation. | |
0:32:59.079 --> 0:33:08.306 | |
Training the whole number together, so it | |
learns what a good representation for a word | |
0:33:08.306 --> 0:33:13.079 | |
you get in order to perform your final task. | |
0:33:15.956 --> 0:33:19.190 | |
Yeah, that is the main idea. | |
0:33:20.660 --> 0:33:32.731 | |
This is now a days often referred to as one | |
way of self supervise learning. | |
0:33:33.053 --> 0:33:37.120 | |
The output is the next word and the input | |
is the previous word. | |
0:33:37.377 --> 0:33:46.783 | |
But it's not really that we created labels, | |
but we artificially created a task out of unlabeled. | |
0:33:46.806 --> 0:33:59.434 | |
We just had pure text, and then we created | |
the telescopes by predicting the next word, | |
0:33:59.434 --> 0:34:18.797 | |
which is: Say we have like two sentences like | |
go home and the second one is go to prepare. | |
0:34:18.858 --> 0:34:30.135 | |
And then we have to predict the next series | |
and my questions in the labels for the album. | |
0:34:31.411 --> 0:34:42.752 | |
We model this as one vector with like probability | |
for possible weights starting again. | |
0:34:44.044 --> 0:34:57.792 | |
Multiple examples, so then you would twice | |
train one to predict KRT, one to predict home, | |
0:34:57.792 --> 0:35:02.374 | |
and then of course the easel. | |
0:35:04.564 --> 0:35:13.568 | |
Is a very good point, so you are not aggregating | |
examples beforehand, but you are taking each. | |
0:35:19.259 --> 0:35:37.204 | |
So when you do it simultaneously learn the | |
projection layer and the endgram for abilities | |
0:35:37.204 --> 0:35:39.198 | |
and then. | |
0:35:39.499 --> 0:35:47.684 | |
And later analyze it that these representations | |
are very powerful. | |
0:35:47.684 --> 0:35:56.358 | |
The task is just a very important task to | |
model what is the next word. | |
0:35:56.816 --> 0:35:59.842 | |
Is motivated by nowadays. | |
0:35:59.842 --> 0:36:10.666 | |
In order to get the meaning of the word you | |
have to look at its companies where the context. | |
0:36:10.790 --> 0:36:16.048 | |
If you read texts in days of word which you | |
have never seen, you often can still estimate | |
0:36:16.048 --> 0:36:21.130 | |
the meaning of this word because you do not | |
know how it is used, and this is typically | |
0:36:21.130 --> 0:36:22.240 | |
used as a city or. | |
0:36:22.602 --> 0:36:25.865 | |
Just imagine you read a text about some city. | |
0:36:25.865 --> 0:36:32.037 | |
Even if you've never seen the city before, | |
you often know from the context of how it's | |
0:36:32.037 --> 0:36:32.463 | |
used. | |
0:36:34.094 --> 0:36:42.483 | |
So what is now the big advantage of using | |
neural neckworks? | |
0:36:42.483 --> 0:36:51.851 | |
So just imagine we have to estimate that I | |
bought my first iPhone. | |
0:36:52.052 --> 0:36:56.608 | |
So you have to monitor the probability of | |
ad hitting them. | |
0:36:56.608 --> 0:37:00.237 | |
Now imagine iPhone, which you have never seen. | |
0:37:00.600 --> 0:37:11.588 | |
So all the techniques we had last time at | |
the end, if you haven't seen iPhone you will | |
0:37:11.588 --> 0:37:14.240 | |
always fall back to. | |
0:37:15.055 --> 0:37:26.230 | |
You have no idea how to deal that you won't | |
have seen the diagram, the trigram, and all | |
0:37:26.230 --> 0:37:27.754 | |
the others. | |
0:37:28.588 --> 0:37:43.441 | |
If you're having this type of model, what | |
does it do if you have my first and then something? | |
0:37:43.483 --> 0:37:50.270 | |
Maybe this representation is really messed | |
up because it's mainly on a cavalry word. | |
0:37:50.730 --> 0:37:57.793 | |
However, you have still these two information | |
that two words before was first and therefore. | |
0:37:58.098 --> 0:38:06.954 | |
So you have a lot of information in order | |
to estimate how good it is. | |
0:38:06.954 --> 0:38:13.279 | |
There could be more information if you know | |
that. | |
0:38:13.593 --> 0:38:25.168 | |
So all this type of modeling we can do that | |
we couldn't do beforehand because we always | |
0:38:25.168 --> 0:38:25.957 | |
have. | |
0:38:27.027 --> 0:38:40.466 | |
Good point, so typically you would have one | |
token for a vocabulary so that you could, for | |
0:38:40.466 --> 0:38:45.857 | |
example: All you're doing by parent coding | |
when you have a fixed thing. | |
0:38:46.226 --> 0:38:49.437 | |
Oh yeah, you have to do something like that | |
that that that's true. | |
0:38:50.050 --> 0:38:55.420 | |
So yeah, auto vocabulary are by thanking where | |
you don't have other words written. | |
0:38:55.735 --> 0:39:06.295 | |
But then, of course, you might be getting | |
very long previous things, and your sequence | |
0:39:06.295 --> 0:39:11.272 | |
length gets very long for unknown words. | |
0:39:17.357 --> 0:39:20.067 | |
Any more questions to the basic stable. | |
0:39:23.783 --> 0:39:36.719 | |
For this model, what we then want to continue | |
is looking a bit into how complex or how we | |
0:39:36.719 --> 0:39:39.162 | |
can make things. | |
0:39:40.580 --> 0:39:49.477 | |
Because at the beginning there was definitely | |
a major challenge, it's still not that easy, | |
0:39:49.477 --> 0:39:58.275 | |
and I mean our likeers followed the talk about | |
their environmental fingerprint and so on. | |
0:39:58.478 --> 0:40:05.700 | |
So this calculation is not really heavy, and | |
if you build systems yourselves you have to | |
0:40:05.700 --> 0:40:06.187 | |
wait. | |
0:40:06.466 --> 0:40:14.683 | |
So it's good to know a bit about how complex | |
things are in order to do a good or efficient | |
0:40:14.683 --> 0:40:15.405 | |
affair. | |
0:40:15.915 --> 0:40:24.211 | |
So one thing where most of the calculation | |
really happens is if you're doing it in a bad | |
0:40:24.211 --> 0:40:24.677 | |
way. | |
0:40:25.185 --> 0:40:33.523 | |
So in generally all these layers we are talking | |
about networks and zones fancy. | |
0:40:33.523 --> 0:40:46.363 | |
In the end it is: So what you have to do in | |
order to calculate here, for example, these | |
0:40:46.363 --> 0:40:52.333 | |
activations: So make it simple a bit. | |
0:40:52.333 --> 0:41:06.636 | |
Let's see where outputs and you just do metric | |
multiplication between your weight matrix and | |
0:41:06.636 --> 0:41:08.482 | |
your input. | |
0:41:08.969 --> 0:41:20.992 | |
So that is why computers are so powerful for | |
neural networks because they are very good | |
0:41:20.992 --> 0:41:22.358 | |
in doing. | |
0:41:22.782 --> 0:41:28.013 | |
However, for some type for the embedding layer | |
this is really very inefficient. | |
0:41:28.208 --> 0:41:39.652 | |
So because remember we're having this one | |
art encoding in this input, it's always like | |
0:41:39.652 --> 0:41:42.940 | |
one and everything else. | |
0:41:42.940 --> 0:41:47.018 | |
It's zero if we're doing this. | |
0:41:47.387 --> 0:41:55.552 | |
So therefore you can do at least the forward | |
pass a lot more efficient if you don't really | |
0:41:55.552 --> 0:42:01.833 | |
do this calculation, but you can select the | |
one color where there is. | |
0:42:01.833 --> 0:42:07.216 | |
Therefore, you also see this is called your | |
word embedding. | |
0:42:08.348 --> 0:42:19.542 | |
So the weight matrix of the embedding layer | |
is just that in each color you have the embedding | |
0:42:19.542 --> 0:42:20.018 | |
of. | |
0:42:20.580 --> 0:42:30.983 | |
So this is like how your initial weights look | |
like and how you can interpret or understand. | |
0:42:32.692 --> 0:42:39.509 | |
And this is already relatively important because | |
remember this is a huge dimensional thing. | |
0:42:39.509 --> 0:42:46.104 | |
So typically here we have the number of words | |
is ten thousand or so, so this is the word | |
0:42:46.104 --> 0:42:51.365 | |
embeddings metrics, typically the most expensive | |
to calculate metrics. | |
0:42:51.451 --> 0:42:59.741 | |
Because it's the largest one there, we have | |
ten thousand entries, while for the hours we | |
0:42:59.741 --> 0:43:00.393 | |
maybe. | |
0:43:00.660 --> 0:43:03.408 | |
So therefore the addition to a little bit | |
more to make this. | |
0:43:06.206 --> 0:43:10.538 | |
Then you can go where else the calculations | |
are very difficult. | |
0:43:10.830 --> 0:43:20.389 | |
So here we then have our network, so we have | |
the word embeddings. | |
0:43:20.389 --> 0:43:29.514 | |
We have one hidden there, and then you can | |
look how difficult. | |
0:43:30.270 --> 0:43:38.746 | |
Could save a lot of calculation by not really | |
calculating the selection because that is always. | |
0:43:40.600 --> 0:43:46.096 | |
The number of calculations you have to do | |
here is so. | |
0:43:46.096 --> 0:43:51.693 | |
The length of this layer is minus one type | |
projection. | |
0:43:52.993 --> 0:43:56.321 | |
That is a hint size. | |
0:43:56.321 --> 0:44:10.268 | |
So the first step of calculation for this | |
metrics modification is how much calculation. | |
0:44:10.730 --> 0:44:18.806 | |
Then you have to do some activation function | |
and then you have to do again the calculation. | |
0:44:19.339 --> 0:44:27.994 | |
Here we need the vocabulary size because we | |
need to calculate the probability for each | |
0:44:27.994 --> 0:44:29.088 | |
next word. | |
0:44:29.889 --> 0:44:43.155 | |
And if you look at these numbers, so if you | |
have a projector size of and a vocabulary size | |
0:44:43.155 --> 0:44:53.876 | |
of, you see: And that is why there has been | |
especially at the beginning some ideas how | |
0:44:53.876 --> 0:44:55.589 | |
we can reduce. | |
0:44:55.956 --> 0:45:01.942 | |
And if we really need to calculate all of | |
our capabilities, or if we can calculate only | |
0:45:01.942 --> 0:45:02.350 | |
some. | |
0:45:02.582 --> 0:45:10.871 | |
And there again the one important thing to | |
think about is for what will use my language | |
0:45:10.871 --> 0:45:11.342 | |
mom. | |
0:45:11.342 --> 0:45:19.630 | |
I can use it for generations and that's what | |
we will see next week in an achiever which | |
0:45:19.630 --> 0:45:22.456 | |
really is guiding the search. | |
0:45:23.123 --> 0:45:30.899 | |
If it just uses a feature, we do not want | |
to use it for generations, but we want to only | |
0:45:30.899 --> 0:45:32.559 | |
know how probable. | |
0:45:32.953 --> 0:45:39.325 | |
There we might not be really interested in | |
all the probabilities, but we already know | |
0:45:39.325 --> 0:45:46.217 | |
we just want to know the probability of this | |
one word, and then it might be very inefficient | |
0:45:46.217 --> 0:45:49.403 | |
to really calculate all the probabilities. | |
0:45:51.231 --> 0:45:52.919 | |
And how can you do that so? | |
0:45:52.919 --> 0:45:56.296 | |
Initially, for example, the people look into | |
shortness. | |
0:45:56.756 --> 0:46:02.276 | |
So this calculation at the end is really very | |
expensive. | |
0:46:02.276 --> 0:46:05.762 | |
So can we make that more efficient. | |
0:46:05.945 --> 0:46:17.375 | |
And most words occur very rarely, and maybe | |
we don't need anger, and so there we may want | |
0:46:17.375 --> 0:46:18.645 | |
to focus. | |
0:46:19.019 --> 0:46:29.437 | |
And so they use the smaller vocabulary, which | |
is maybe. | |
0:46:29.437 --> 0:46:34.646 | |
This layer is used from to. | |
0:46:34.646 --> 0:46:37.623 | |
Then you merge. | |
0:46:37.937 --> 0:46:45.162 | |
So you're taking if the word is in the shortest, | |
so in the two thousand most frequent words. | |
0:46:45.825 --> 0:46:58.299 | |
Of this short word by some normalization here, | |
and otherwise you take a back of probability | |
0:46:58.299 --> 0:46:59.655 | |
from the. | |
0:47:00.020 --> 0:47:04.933 | |
It will not be as good, but the idea is okay. | |
0:47:04.933 --> 0:47:14.013 | |
Then we don't have to calculate all these | |
probabilities here at the end, but we only | |
0:47:14.013 --> 0:47:16.042 | |
have to calculate. | |
0:47:19.599 --> 0:47:32.097 | |
With some type of cost because it means we | |
don't model the probability of the infrequent | |
0:47:32.097 --> 0:47:39.399 | |
words, and maybe it's even very important to | |
model. | |
0:47:39.299 --> 0:47:46.671 | |
And one idea is to do what is reported as | |
so so structured out there. | |
0:47:46.606 --> 0:47:49.571 | |
Network language models you see some years | |
ago. | |
0:47:49.571 --> 0:47:53.154 | |
People were very creative and giving names | |
to new models. | |
0:47:53.813 --> 0:48:00.341 | |
And there the idea is that we model the output | |
vocabulary as a clustered treat. | |
0:48:00.680 --> 0:48:06.919 | |
So you don't need to model all of our bodies | |
directly, but you are putting words into a | |
0:48:06.919 --> 0:48:08.479 | |
sequence of clusters. | |
0:48:08.969 --> 0:48:15.019 | |
So maybe a very intriguant world is first | |
in cluster three and then in cluster three. | |
0:48:15.019 --> 0:48:21.211 | |
You have subclusters again and there is subclusters | |
seven and subclusters and there is. | |
0:48:21.541 --> 0:48:40.134 | |
And this is the path, so that is what was | |
the man in the past. | |
0:48:40.340 --> 0:48:52.080 | |
And then you can calculate the probability | |
of the word again just by the product of the | |
0:48:52.080 --> 0:48:55.548 | |
first class of the world. | |
0:48:57.617 --> 0:49:07.789 | |
That it may be more clear where you have this | |
architecture, so this is all the same. | |
0:49:07.789 --> 0:49:13.773 | |
But then you first predict here which main | |
class. | |
0:49:14.154 --> 0:49:24.226 | |
Then you go to the appropriate subclass, then | |
you calculate the probability of the subclass | |
0:49:24.226 --> 0:49:26.415 | |
and maybe the cell. | |
0:49:27.687 --> 0:49:35.419 | |
Anybody have an idea why this is more efficient | |
or if you do it first, it looks a lot more. | |
0:49:42.242 --> 0:49:51.788 | |
You have to do less calculations, so maybe | |
if you do it here you have to calculate the | |
0:49:51.788 --> 0:49:59.468 | |
element there, but you don't have to do all | |
the one hundred thousand. | |
0:49:59.980 --> 0:50:06.115 | |
The probabilities in the set classes that | |
you're going through and not for all of them. | |
0:50:06.386 --> 0:50:18.067 | |
Therefore, it's more efficient if you don't | |
need all output proficient because you have | |
0:50:18.067 --> 0:50:21.253 | |
to calculate the class. | |
0:50:21.501 --> 0:50:28.936 | |
So it's only more efficient and scenarios | |
where you really need to use a language model | |
0:50:28.936 --> 0:50:30.034 | |
to evaluate. | |
0:50:35.275 --> 0:50:52.456 | |
How this works was that you can train first | |
in your language one on the short list. | |
0:50:52.872 --> 0:51:03.547 | |
But on the input layer you have your full | |
vocabulary because at the input we saw that | |
0:51:03.547 --> 0:51:06.650 | |
this is not complicated. | |
0:51:06.906 --> 0:51:26.638 | |
And then you can cluster down all your words | |
here into classes and use that as your glasses. | |
0:51:29.249 --> 0:51:34.148 | |
That is one idea of doing it. | |
0:51:34.148 --> 0:51:44.928 | |
There is also a second idea of doing it, and | |
again we don't need. | |
0:51:45.025 --> 0:51:53.401 | |
So sometimes it doesn't really need to be | |
a probability to evaluate. | |
0:51:53.401 --> 0:51:56.557 | |
It's only important that. | |
0:51:58.298 --> 0:52:04.908 | |
And: Here it's called self normalization what | |
people have done so. | |
0:52:04.908 --> 0:52:11.562 | |
We have seen that the probability is in this | |
soft mechanism always to the input divided | |
0:52:11.562 --> 0:52:18.216 | |
by our normalization, and the normalization | |
is a summary of the vocabulary to the power | |
0:52:18.216 --> 0:52:19.274 | |
of the spell. | |
0:52:19.759 --> 0:52:25.194 | |
So this is how we calculate the software. | |
0:52:25.825 --> 0:52:41.179 | |
In self normalization of the idea, if this | |
would be zero then we don't need to calculate | |
0:52:41.179 --> 0:52:42.214 | |
that. | |
0:52:42.102 --> 0:52:54.272 | |
Will be zero, and then you don't even have | |
to calculate the normalization because it's. | |
0:52:54.514 --> 0:53:08.653 | |
So how can we achieve that and then the nice | |
thing in your networks? | |
0:53:09.009 --> 0:53:23.928 | |
And now we're just adding a second note with | |
some either permitted here. | |
0:53:24.084 --> 0:53:29.551 | |
And the second lost just tells us he'll be | |
strained away. | |
0:53:29.551 --> 0:53:31.625 | |
The locks at is zero. | |
0:53:32.352 --> 0:53:38.614 | |
So then if it's nearly zero at the end we | |
don't need to calculate this and it's also | |
0:53:38.614 --> 0:53:39.793 | |
very efficient. | |
0:53:40.540 --> 0:53:49.498 | |
One important thing is this, of course, is | |
only in inference. | |
0:53:49.498 --> 0:54:04.700 | |
During tests we don't need to calculate that | |
because: You can do a bit of a hyperparameter | |
0:54:04.700 --> 0:54:14.851 | |
here where you do the waiting, so how good | |
should it be estimating the probabilities and | |
0:54:14.851 --> 0:54:16.790 | |
how much effort? | |
0:54:18.318 --> 0:54:28.577 | |
The only disadvantage is no speed up during | |
training. | |
0:54:28.577 --> 0:54:43.843 | |
There are other ways of doing that, for example: | |
Englishman is in case you get it. | |
0:54:44.344 --> 0:54:48.540 | |
Then we are coming very, very briefly like | |
just one idea. | |
0:54:48.828 --> 0:54:53.058 | |
That there is more things on different types | |
of language models. | |
0:54:53.058 --> 0:54:58.002 | |
We are having a very short view on restricted | |
person-based language models. | |
0:54:58.298 --> 0:55:08.931 | |
Talk about recurrent neural networks for language | |
mines because they have the advantage that | |
0:55:08.931 --> 0:55:17.391 | |
we can even further improve by not having a | |
continuous representation on. | |
0:55:18.238 --> 0:55:23.845 | |
So there's different types of neural networks. | |
0:55:23.845 --> 0:55:30.169 | |
These are these boxing machines and the interesting. | |
0:55:30.330 --> 0:55:39.291 | |
They have these: And they define like an energy | |
function on the network, which can be in restricted | |
0:55:39.291 --> 0:55:44.372 | |
balsam machines efficiently calculated in general | |
and restricted needs. | |
0:55:44.372 --> 0:55:51.147 | |
You only have connection between the input | |
and the hidden layer, but you don't have connections | |
0:55:51.147 --> 0:55:53.123 | |
in the input or within the. | |
0:55:53.393 --> 0:56:00.194 | |
So you see here you don't have an input output, | |
you just have an input, and you calculate. | |
0:56:00.460 --> 0:56:15.612 | |
Which of course nicely fits with the idea | |
we're having, so you can then use this for | |
0:56:15.612 --> 0:56:19.177 | |
an N Gram language. | |
0:56:19.259 --> 0:56:25.189 | |
Retaining the flexibility of the input by | |
this type of neon networks. | |
0:56:26.406 --> 0:56:30.589 | |
And the advantage of this type of model was | |
there's. | |
0:56:30.550 --> 0:56:37.520 | |
Very, very fast to integrate it, so that one | |
was the first one which was used during the | |
0:56:37.520 --> 0:56:38.616 | |
coding model. | |
0:56:38.938 --> 0:56:45.454 | |
The engram language models were that they | |
were very good and gave performance. | |
0:56:45.454 --> 0:56:50.072 | |
However, calculation still with all these | |
tricks takes. | |
0:56:50.230 --> 0:56:58.214 | |
We have talked about embest lists so they | |
generated an embest list of the most probable | |
0:56:58.214 --> 0:57:05.836 | |
outputs and then they took this and best list | |
scored each entry with a new network. | |
0:57:06.146 --> 0:57:09.306 | |
A language model, and then only change the | |
order again. | |
0:57:09.306 --> 0:57:10.887 | |
Select based on that which. | |
0:57:11.231 --> 0:57:17.187 | |
The neighboring list is maybe only like hundred | |
entries. | |
0:57:17.187 --> 0:57:21.786 | |
When decoding you look at several thousand. | |
0:57:26.186 --> 0:57:35.196 | |
Let's look at the context so we have now seen | |
your language models. | |
0:57:35.196 --> 0:57:43.676 | |
There is the big advantage we can use this | |
word similarity and. | |
0:57:44.084 --> 0:57:52.266 | |
Remember for engram language ones is not always | |
minus one words because sometimes you have | |
0:57:52.266 --> 0:57:59.909 | |
to back off or interpolation to lower engrams | |
and you don't know the previous words. | |
0:58:00.760 --> 0:58:04.742 | |
And however in neural models we always have | |
all of this importance. | |
0:58:04.742 --> 0:58:05.504 | |
Can some of. | |
0:58:07.147 --> 0:58:20.288 | |
The disadvantage is that you are still limited | |
in your context, and if you remember the sentence | |
0:58:20.288 --> 0:58:22.998 | |
from last lecture,. | |
0:58:22.882 --> 0:58:28.328 | |
Sometimes you need more context and there | |
is unlimited context that you might need and | |
0:58:28.328 --> 0:58:34.086 | |
you can always create sentences where you may | |
need this five context in order to put a good | |
0:58:34.086 --> 0:58:34.837 | |
estimation. | |
0:58:35.315 --> 0:58:44.956 | |
Can also do it different in order to understand | |
that it makes sense to view language. | |
0:58:45.445 --> 0:58:59.510 | |
So secret labeling tasks are a very common | |
type of task in language processing where you | |
0:58:59.510 --> 0:59:03.461 | |
have the input sequence. | |
0:59:03.323 --> 0:59:05.976 | |
So you have one output for each input. | |
0:59:05.976 --> 0:59:12.371 | |
Machine translation is not a secret labeling | |
cast because the number of inputs and the number | |
0:59:12.371 --> 0:59:14.072 | |
of outputs is different. | |
0:59:14.072 --> 0:59:20.598 | |
So you put in a string German which has five | |
words and the output can be: See, for example, | |
0:59:20.598 --> 0:59:24.078 | |
you always have the same number and the same | |
number of offices. | |
0:59:24.944 --> 0:59:39.779 | |
And you can more language waddling as that, | |
and you just say the label for each word is | |
0:59:39.779 --> 0:59:43.151 | |
always a next word. | |
0:59:45.705 --> 0:59:50.312 | |
This is the more generous you can think of | |
it. | |
0:59:50.312 --> 0:59:56.194 | |
For example, Paddle Speech Taking named Entity | |
Recognition. | |
0:59:58.938 --> 1:00:08.476 | |
And if you look at now, this output token | |
and generally sequenced labeling can depend | |
1:00:08.476 --> 1:00:26.322 | |
on: The input tokens are the same so we can | |
easily model it and they only depend on the | |
1:00:26.322 --> 1:00:29.064 | |
input tokens. | |
1:00:31.011 --> 1:00:42.306 | |
But we can always look at one specific type | |
of sequence labeling, unidirectional sequence | |
1:00:42.306 --> 1:00:44.189 | |
labeling type. | |
1:00:44.584 --> 1:01:00.855 | |
The probability of the next word only depends | |
on the previous words that we are having here. | |
1:01:01.321 --> 1:01:05.998 | |
That's also not completely true in language. | |
1:01:05.998 --> 1:01:14.418 | |
Well, the back context might also be helpful | |
by direction of the model's Google. | |
1:01:14.654 --> 1:01:23.039 | |
We will always admire the probability of the | |
word given on its history. | |
1:01:23.623 --> 1:01:30.562 | |
And currently there is approximation and sequence | |
labeling that we have this windowing approach. | |
1:01:30.951 --> 1:01:43.016 | |
So in order to predict this type of word we | |
always look at the previous three words. | |
1:01:43.016 --> 1:01:48.410 | |
This is this type of windowing model. | |
1:01:49.389 --> 1:01:54.780 | |
If you're into neural networks you recognize | |
this type of structure. | |
1:01:54.780 --> 1:01:57.515 | |
Also, the typical neural networks. | |
1:01:58.938 --> 1:02:11.050 | |
Yes, yes, so like engram models you can, at | |
least in some way, prepare for that type of | |
1:02:11.050 --> 1:02:12.289 | |
context. | |
1:02:14.334 --> 1:02:23.321 | |
Are also other types of neonamic structures | |
which we can use for sequins lately and which | |
1:02:23.321 --> 1:02:30.710 | |
might help us where we don't have this type | |
of fixed size representation. | |
1:02:32.812 --> 1:02:34.678 | |
That we can do so. | |
1:02:34.678 --> 1:02:39.391 | |
The idea is in recurrent new networks traction. | |
1:02:39.391 --> 1:02:43.221 | |
We are saving complete history in one. | |
1:02:43.623 --> 1:02:56.946 | |
So again we have to do this fixed size representation | |
because the neural networks always need a habit. | |
1:02:57.157 --> 1:03:09.028 | |
And then the network should look like that, | |
so we start with an initial value for our storage. | |
1:03:09.028 --> 1:03:15.900 | |
We are giving our first input and calculating | |
the new. | |
1:03:16.196 --> 1:03:35.895 | |
So again in your network with two types of | |
inputs: Then you can apply it to the next type | |
1:03:35.895 --> 1:03:41.581 | |
of input and you're again having this. | |
1:03:41.581 --> 1:03:46.391 | |
You're taking this hidden state. | |
1:03:47.367 --> 1:03:53.306 | |
Nice thing is now that you can do now step | |
by step by step, so all the way over. | |
1:03:55.495 --> 1:04:06.131 | |
The nice thing we are having here now is that | |
now we are having context information from | |
1:04:06.131 --> 1:04:07.206 | |
all the. | |
1:04:07.607 --> 1:04:14.181 | |
So if you're looking like based on which words | |
do you, you calculate the probability of varying. | |
1:04:14.554 --> 1:04:20.090 | |
It depends on this part. | |
1:04:20.090 --> 1:04:33.154 | |
It depends on and this hidden state was influenced | |
by two. | |
1:04:33.473 --> 1:04:38.259 | |
So now we're having something new. | |
1:04:38.259 --> 1:04:46.463 | |
We can model like the word probability not | |
only on a fixed. | |
1:04:46.906 --> 1:04:53.565 | |
Because the hidden states we are having here | |
in our Oregon are influenced by all the trivia. | |
1:04:56.296 --> 1:05:02.578 | |
So how is there to be Singapore? | |
1:05:02.578 --> 1:05:16.286 | |
But then we have the initial idea about this | |
P of given on the history. | |
1:05:16.736 --> 1:05:25.300 | |
So do not need to do any clustering here, | |
and you also see how things are put together | |
1:05:25.300 --> 1:05:26.284 | |
in order. | |
1:05:29.489 --> 1:05:43.449 | |
The green box this night since we are starting | |
from the left to the right. | |
1:05:44.524 --> 1:05:51.483 | |
Voices: Yes, that's right, so there are clusters, | |
and here is also sometimes clustering happens. | |
1:05:51.871 --> 1:05:58.687 | |
The small difference does matter again, so | |
if you have now a lot of different histories, | |
1:05:58.687 --> 1:06:01.674 | |
the similarity which you have in here. | |
1:06:01.674 --> 1:06:08.260 | |
If two of the histories are very similar, | |
these representations will be the same, and | |
1:06:08.260 --> 1:06:10.787 | |
then you're treating them again. | |
1:06:11.071 --> 1:06:15.789 | |
Because in order to do the final restriction | |
you only do a good base on the green box. | |
1:06:16.156 --> 1:06:28.541 | |
So you are now still learning some type of | |
clustering in there, but you are learning it | |
1:06:28.541 --> 1:06:30.230 | |
implicitly. | |
1:06:30.570 --> 1:06:38.200 | |
The only restriction you're giving is you | |
have to stall everything that is important | |
1:06:38.200 --> 1:06:39.008 | |
in this. | |
1:06:39.359 --> 1:06:54.961 | |
So it's a different type of limitation, so | |
you calculate the probability based on the | |
1:06:54.961 --> 1:06:57.138 | |
last words. | |
1:06:57.437 --> 1:07:04.430 | |
And that is how you still need to somehow | |
cluster things together in order to do efficiently. | |
1:07:04.430 --> 1:07:09.563 | |
Of course, you need to do some type of clustering | |
because otherwise. | |
1:07:09.970 --> 1:07:18.865 | |
But this is where things get merged together | |
in this type of hidden representation. | |
1:07:18.865 --> 1:07:27.973 | |
So here the probability of the word first | |
only depends on this hidden representation. | |
1:07:28.288 --> 1:07:33.104 | |
On the previous words, but they are some other | |
bottleneck in order to make a good estimation. | |
1:07:34.474 --> 1:07:41.231 | |
So the idea is that we can store all our history | |
into or into one lecture. | |
1:07:41.581 --> 1:07:44.812 | |
Which is the one that makes it more strong. | |
1:07:44.812 --> 1:07:51.275 | |
Next we come to problems that of course at | |
some point it might be difficult if you have | |
1:07:51.275 --> 1:07:57.811 | |
very long sequences and you always write all | |
the information you have on this one block. | |
1:07:58.398 --> 1:08:02.233 | |
Then maybe things get overwritten or you cannot | |
store everything in there. | |
1:08:02.662 --> 1:08:04.514 | |
So,. | |
1:08:04.184 --> 1:08:09.569 | |
Therefore, yet for short things like single | |
sentences that works well, but especially if | |
1:08:09.569 --> 1:08:15.197 | |
you think of other tasks and like symbolizations | |
with our document based on T where you need | |
1:08:15.197 --> 1:08:20.582 | |
to consider the full document, these things | |
got got a bit more more more complicated and | |
1:08:20.582 --> 1:08:23.063 | |
will learn another type of architecture. | |
1:08:24.464 --> 1:08:30.462 | |
In order to understand these neighbors, it | |
is good to have all the bus use always. | |
1:08:30.710 --> 1:08:33.998 | |
So this is the unrolled view. | |
1:08:33.998 --> 1:08:43.753 | |
Somewhere you're over the type or in language | |
over the words you're unrolling a network. | |
1:08:44.024 --> 1:08:52.096 | |
Here is the article and here is the network | |
which is connected by itself and that is recurrent. | |
1:08:56.176 --> 1:09:04.982 | |
There is one challenge in this networks and | |
training. | |
1:09:04.982 --> 1:09:11.994 | |
We can train them first of all as forward. | |
1:09:12.272 --> 1:09:19.397 | |
So we don't really know how to train them, | |
but if you unroll them like this is a feet | |
1:09:19.397 --> 1:09:20.142 | |
forward. | |
1:09:20.540 --> 1:09:38.063 | |
Is exactly the same, so you can measure your | |
arrows here and be back to your arrows. | |
1:09:38.378 --> 1:09:45.646 | |
If you unroll something, it's a feature in | |
your laptop and you can train it the same way. | |
1:09:46.106 --> 1:09:57.606 | |
The only important thing is again, of course, | |
for different inputs. | |
1:09:57.837 --> 1:10:05.145 | |
But since parameters are shared, it's somehow | |
a similar point you can train it. | |
1:10:05.145 --> 1:10:08.800 | |
The training algorithm is very similar. | |
1:10:10.310 --> 1:10:29.568 | |
One thing which makes things difficult is | |
what is referred to as the vanish ingredient. | |
1:10:29.809 --> 1:10:32.799 | |
That's a very strong thing in the motivation | |
of using hardness. | |
1:10:33.593 --> 1:10:44.604 | |
The influence here gets smaller and smaller, | |
and the modems are not really able to monitor. | |
1:10:44.804 --> 1:10:51.939 | |
Because the gradient gets smaller and smaller, | |
and so the arrow here propagated to this one | |
1:10:51.939 --> 1:10:58.919 | |
that contributes to the arrow is very small, | |
and therefore you don't do any changes there | |
1:10:58.919 --> 1:10:59.617 | |
anymore. | |
1:11:00.020 --> 1:11:06.703 | |
And yeah, that's why standard art men are | |
undifficult or have to pick them at custard. | |
1:11:07.247 --> 1:11:11.462 | |
So everywhere talking to me about fire and | |
ants nowadays,. | |
1:11:11.791 --> 1:11:23.333 | |
What we are typically meaning are LSDN's or | |
long short memories. | |
1:11:23.333 --> 1:11:30.968 | |
You see they are by now quite old already. | |
1:11:31.171 --> 1:11:39.019 | |
So there was a model in the language model | |
task. | |
1:11:39.019 --> 1:11:44.784 | |
It's some more storing information. | |
1:11:44.684 --> 1:11:51.556 | |
Because if you only look at the last words, | |
it's often no longer clear this is a question | |
1:11:51.556 --> 1:11:52.548 | |
or a normal. | |
1:11:53.013 --> 1:12:05.318 | |
So there you have these mechanisms with ripgate | |
in order to store things for a longer time | |
1:12:05.318 --> 1:12:08.563 | |
into your hidden state. | |
1:12:10.730 --> 1:12:20.162 | |
Here they are used in in in selling quite | |
a lot of works. | |
1:12:21.541 --> 1:12:29.349 | |
For especially machine translation now, the | |
standard is to do transform base models which | |
1:12:29.349 --> 1:12:30.477 | |
we'll learn. | |
1:12:30.690 --> 1:12:38.962 | |
But for example, in architecture we have later | |
one lecture about efficiency. | |
1:12:38.962 --> 1:12:42.830 | |
So how can we build very efficient? | |
1:12:42.882 --> 1:12:53.074 | |
And there in the decoder in parts of the networks | |
they are still using. | |
1:12:53.473 --> 1:12:57.518 | |
So it's not that yeah our hands are of no | |
importance in the body. | |
1:12:59.239 --> 1:13:08.956 | |
In order to make them strong, there are some | |
more things which are helpful and should be: | |
1:13:09.309 --> 1:13:19.683 | |
So one thing is there is a nice trick to make | |
this new network stronger and better. | |
1:13:19.739 --> 1:13:21.523 | |
So of course it doesn't work always. | |
1:13:21.523 --> 1:13:23.451 | |
They have to have enough training data. | |
1:13:23.763 --> 1:13:28.959 | |
But in general there's the easiest way of | |
making your models bigger and stronger just | |
1:13:28.959 --> 1:13:30.590 | |
to increase your pyramids. | |
1:13:30.630 --> 1:13:43.236 | |
And you've seen that with a large language | |
models they are always bragging about. | |
1:13:43.903 --> 1:13:56.463 | |
This is one way, so the question is how do | |
you get more parameters? | |
1:13:56.463 --> 1:14:01.265 | |
There's ways of doing it. | |
1:14:01.521 --> 1:14:10.029 | |
And the other thing is to make your networks | |
deeper so to have more legs in between. | |
1:14:11.471 --> 1:14:13.827 | |
And then you can also get to get more calm. | |
1:14:14.614 --> 1:14:23.340 | |
There's more traveling with this and it's | |
very similar to what we just saw with our hand. | |
1:14:23.603 --> 1:14:34.253 | |
We have this problem of radiant flow that | |
if it flows so fast like a radiant gets very | |
1:14:34.253 --> 1:14:35.477 | |
swollen,. | |
1:14:35.795 --> 1:14:42.704 | |
Exactly the same thing happens in deep LSD | |
ends. | |
1:14:42.704 --> 1:14:52.293 | |
If you take here the gradient, tell you what | |
is the right or wrong. | |
1:14:52.612 --> 1:14:56.439 | |
With three layers it's no problem, but if | |
you're going to ten, twenty or hundred layers. | |
1:14:57.797 --> 1:14:59.698 | |
That's Getting Typically Young. | |
1:15:00.060 --> 1:15:07.000 | |
Are doing is using what is called decisional | |
connections. | |
1:15:07.000 --> 1:15:15.855 | |
That's a very helpful idea, which is maybe | |
very surprising that it works. | |
1:15:15.956 --> 1:15:20.309 | |
And so the idea is that these networks. | |
1:15:20.320 --> 1:15:29.982 | |
In between should no longer calculate what | |
is a new good representation, but they're more | |
1:15:29.982 --> 1:15:31.378 | |
calculating. | |
1:15:31.731 --> 1:15:37.588 | |
Therefore, in the end you're always the output | |
of a layer is added with the input. | |
1:15:38.318 --> 1:15:48.824 | |
The knife is later if you are doing back propagation | |
with this very fast back propagation. | |
1:15:49.209 --> 1:16:02.540 | |
Nowadays in very deep architectures, not only | |
on other but always has this residual or highway | |
1:16:02.540 --> 1:16:04.224 | |
connection. | |
1:16:04.704 --> 1:16:06.616 | |
Has two advantages. | |
1:16:06.616 --> 1:16:15.409 | |
On the one hand, these layers don't need to | |
learn a representation, they only need to learn | |
1:16:15.409 --> 1:16:18.754 | |
what to change the representation. | |
1:16:22.082 --> 1:16:24.172 | |
Good. | |
1:16:23.843 --> 1:16:31.768 | |
That much for the new map before, so the last | |
thing now means this. | |
1:16:31.671 --> 1:16:33.750 | |
Language was are yeah. | |
1:16:33.750 --> 1:16:41.976 | |
I were used in the molds itself and now were | |
seeing them again, but one thing which at the | |
1:16:41.976 --> 1:16:53.558 | |
beginning they were reading was very essential | |
was: So people really train part of the language | |
1:16:53.558 --> 1:16:59.999 | |
models only to get this type of embedding. | |
1:16:59.999 --> 1:17:04.193 | |
Therefore, we want to look. | |
1:17:09.229 --> 1:17:15.678 | |
So now some last words to the word embeddings. | |
1:17:15.678 --> 1:17:27.204 | |
The interesting thing is that word embeddings | |
can be used for very different tasks. | |
1:17:27.347 --> 1:17:31.329 | |
The knife wing is you can train that on just | |
large amounts of data. | |
1:17:31.931 --> 1:17:41.569 | |
And then if you have these wooden beddings | |
we have seen that they reduce the parameters. | |
1:17:41.982 --> 1:17:52.217 | |
So then you can train your small mark to do | |
any other task and therefore you are more efficient. | |
1:17:52.532 --> 1:17:55.218 | |
These initial word embeddings is important. | |
1:17:55.218 --> 1:18:00.529 | |
They really depend only on the word itself, | |
so if you look at the two meanings of can, | |
1:18:00.529 --> 1:18:06.328 | |
the can of beans or I can do that, they will | |
have the same embedding, so some of the embedding | |
1:18:06.328 --> 1:18:08.709 | |
has to save the ambiguity inside that. | |
1:18:09.189 --> 1:18:12.486 | |
That cannot be resolved. | |
1:18:12.486 --> 1:18:24.753 | |
Therefore, if you look at the higher levels | |
in the context, but in the word embedding layers | |
1:18:24.753 --> 1:18:27.919 | |
that really depends on. | |
1:18:29.489 --> 1:18:33.757 | |
However, even this one has quite very interesting. | |
1:18:34.034 --> 1:18:39.558 | |
So that people like to visualize them. | |
1:18:39.558 --> 1:18:47.208 | |
They're always difficult because if you look | |
at this. | |
1:18:47.767 --> 1:18:52.879 | |
And drawing your five hundred damage, the | |
vector is still a bit challenging. | |
1:18:53.113 --> 1:19:12.472 | |
So you cannot directly do that, so people | |
have to do it like they look at some type of. | |
1:19:13.073 --> 1:19:17.209 | |
And of course then yes some information is | |
getting lost by a bunch of control. | |
1:19:18.238 --> 1:19:24.802 | |
And you see, for example, this is the most | |
famous and common example, so what you can | |
1:19:24.802 --> 1:19:31.289 | |
look is you can look at the difference between | |
the main and the female word English. | |
1:19:31.289 --> 1:19:37.854 | |
This is here in your embedding of king, and | |
this is the embedding of queen, and this. | |
1:19:38.058 --> 1:19:40.394 | |
You can do that for a very different work. | |
1:19:40.780 --> 1:19:45.407 | |
And that is where the masks come into, that | |
is what people then look into. | |
1:19:45.725 --> 1:19:50.995 | |
So what you can now, for example, do is you | |
can calculate the difference between man and | |
1:19:50.995 --> 1:19:51.410 | |
woman? | |
1:19:52.232 --> 1:19:55.511 | |
Then you can take the embedding of tea. | |
1:19:55.511 --> 1:20:02.806 | |
You can add on it the difference between man | |
and woman, and then you can notice what are | |
1:20:02.806 --> 1:20:04.364 | |
the similar words. | |
1:20:04.364 --> 1:20:08.954 | |
So you won't, of course, directly hit the | |
correct word. | |
1:20:08.954 --> 1:20:10.512 | |
It's a continuous. | |
1:20:10.790 --> 1:20:23.127 | |
But you can look what are the nearest neighbors | |
to this same, and often these words are near | |
1:20:23.127 --> 1:20:24.056 | |
there. | |
1:20:24.224 --> 1:20:33.913 | |
So it somehow learns that the difference between | |
these words is always the same. | |
1:20:34.374 --> 1:20:37.746 | |
You can do that for different things. | |
1:20:37.746 --> 1:20:41.296 | |
He also imagines that it's not perfect. | |
1:20:41.296 --> 1:20:49.017 | |
He says the world tends to be swimming and | |
swimming, and with walking and walking you. | |
1:20:49.469 --> 1:20:51.639 | |
So you can try to use them. | |
1:20:51.639 --> 1:20:59.001 | |
It's no longer like saying yeah, but the interesting | |
thing is this is completely unsupervised. | |
1:20:59.001 --> 1:21:03.961 | |
So nobody taught him the principle of their | |
gender in language. | |
1:21:04.284 --> 1:21:09.910 | |
So it's purely trained on the task of doing | |
the next work prediction. | |
1:21:10.230 --> 1:21:20.658 | |
And even for really cementing information | |
like the capital, this is the difference between | |
1:21:20.658 --> 1:21:23.638 | |
the city and the capital. | |
1:21:23.823 --> 1:21:25.518 | |
Visualization. | |
1:21:25.518 --> 1:21:33.766 | |
Here we have done the same things of the difference | |
between country and. | |
1:21:33.853 --> 1:21:41.991 | |
You see it's not perfect, but it's building | |
some kinds of a right direction, so you can't | |
1:21:41.991 --> 1:21:43.347 | |
even use them. | |
1:21:43.347 --> 1:21:51.304 | |
For example, for question answering, if you | |
have the difference between them, you apply | |
1:21:51.304 --> 1:21:53.383 | |
that to a new country. | |
1:21:54.834 --> 1:22:02.741 | |
So it seems these ones are able to really | |
learn a lot of information and collapse all | |
1:22:02.741 --> 1:22:04.396 | |
this information. | |
1:22:05.325 --> 1:22:11.769 | |
At just to do the next word prediction: And | |
that also explains a bit maybe or not explains | |
1:22:11.769 --> 1:22:19.016 | |
wrong life by motivating why what is the main | |
advantage of this type of neural models that | |
1:22:19.016 --> 1:22:26.025 | |
we can use this type of hidden representation, | |
transfer them and use them in different. | |
1:22:28.568 --> 1:22:43.707 | |
So summarize what we did today, so what you | |
should hopefully have with you is for machine | |
1:22:43.707 --> 1:22:45.893 | |
translation. | |
1:22:45.805 --> 1:22:49.149 | |
Then how we can do language modern Chinese | |
literature? | |
1:22:49.449 --> 1:22:55.617 | |
We looked at three different architectures: | |
We looked into the feet forward language mode | |
1:22:55.617 --> 1:22:59.063 | |
and the one based on Bluetooth machines. | |
1:22:59.039 --> 1:23:05.366 | |
And finally there are different architectures | |
to do in your networks. | |
1:23:05.366 --> 1:23:14.404 | |
We have seen feet for your networks and we'll | |
see the next lectures, the last type of architecture. | |
1:23:15.915 --> 1:23:17.412 | |
Have Any Questions. | |
1:23:20.680 --> 1:23:27.341 | |
Then thanks a lot, and next on Tuesday we | |
will be again in our order to know how to play. | |
0:00:01.301 --> 0:00:05.687 | |
Okay, so we're welcome to today's lecture. | |
0:00:06.066 --> 0:00:18.128 | |
A bit desperate in a small room and I'm sorry | |
for the inconvenience. | |
0:00:18.128 --> 0:00:25.820 | |
Sometimes there are project meetings where. | |
0:00:26.806 --> 0:00:40.863 | |
So what we want to talk today about is want | |
to start with neural approaches to machine | |
0:00:40.863 --> 0:00:42.964 | |
translation. | |
0:00:43.123 --> 0:00:55.779 | |
Guess I've heard about other types of neural | |
models for natural language processing. | |
0:00:55.779 --> 0:00:59.948 | |
This was some of the first. | |
0:01:00.600 --> 0:01:06.203 | |
They are similar to what you know they see | |
in as large language models. | |
0:01:06.666 --> 0:01:14.810 | |
And we want today look into what are these | |
neural language models, how we can build them, | |
0:01:14.810 --> 0:01:15.986 | |
what is the. | |
0:01:16.316 --> 0:01:23.002 | |
And first we'll show how to use them in statistical | |
machine translation. | |
0:01:23.002 --> 0:01:31.062 | |
If you remember weeks ago, we had this log-linear | |
model where you can integrate easily. | |
0:01:31.351 --> 0:01:42.756 | |
And that was how they first were used, so | |
we just had another model that evaluates how | |
0:01:42.756 --> 0:01:49.180 | |
good a system is or how good a lot of languages. | |
0:01:50.690 --> 0:02:04.468 | |
And next week we will go for a neuromachine | |
translation where we replace the whole model | |
0:02:04.468 --> 0:02:06.481 | |
by one huge. | |
0:02:11.211 --> 0:02:20.668 | |
So just as a member from Tuesday we've seen, | |
the main challenge in language modeling was | |
0:02:20.668 --> 0:02:25.131 | |
that most of the anthrax we haven't seen. | |
0:02:26.946 --> 0:02:34.167 | |
So this was therefore difficult to estimate | |
any probability because we've seen that yet | |
0:02:34.167 --> 0:02:39.501 | |
normally if you've seen had not seen the N | |
gram you will assign. | |
0:02:39.980 --> 0:02:53.385 | |
However, this is not really very good because | |
we don't want to give zero probabilities to | |
0:02:53.385 --> 0:02:55.023 | |
sentences. | |
0:02:55.415 --> 0:03:10.397 | |
And then we learned a lot of techniques and | |
that is the main challenge in statistical language. | |
0:03:10.397 --> 0:03:15.391 | |
How we can give somehow a good. | |
0:03:15.435 --> 0:03:23.835 | |
And they developed very specific, very good | |
techniques to deal with that. | |
0:03:23.835 --> 0:03:26.900 | |
However, this is the best. | |
0:03:28.568 --> 0:03:33.907 | |
And therefore we can do things different. | |
0:03:33.907 --> 0:03:44.331 | |
If we have not seen an N gram before in statistical | |
models, we have to have seen. | |
0:03:45.225 --> 0:03:51.361 | |
Before, and we can only get information from | |
exactly the same word. | |
0:03:51.411 --> 0:03:57.567 | |
We don't have an approximate matching like | |
that. | |
0:03:57.567 --> 0:04:10.255 | |
Maybe it stood together in some way or similar, | |
and in a sentence we might generalize the knowledge. | |
0:04:11.191 --> 0:04:21.227 | |
Would like to have more something like that | |
where engrams are represented more in a general | |
0:04:21.227 --> 0:04:21.990 | |
space. | |
0:04:22.262 --> 0:04:29.877 | |
So if you learn something about eyewalk then | |
maybe we can use this knowledge and also. | |
0:04:30.290 --> 0:04:43.034 | |
And thereby no longer treat all or at least | |
a lot of the ingrams as we've done before. | |
0:04:43.034 --> 0:04:45.231 | |
We can really. | |
0:04:47.047 --> 0:04:56.157 | |
And we maybe want to even do that in a more | |
hierarchical approach, but we know okay some | |
0:04:56.157 --> 0:05:05.268 | |
words are similar like go and walk is somehow | |
similar and and therefore like maybe if we | |
0:05:05.268 --> 0:05:07.009 | |
then merge them. | |
0:05:07.387 --> 0:05:16.104 | |
If we learn something about work, then it | |
should tell us also something about Hugo or | |
0:05:16.104 --> 0:05:17.118 | |
he walks. | |
0:05:17.197 --> 0:05:18.970 | |
We see already. | |
0:05:18.970 --> 0:05:22.295 | |
It's, of course, not so easy. | |
0:05:22.295 --> 0:05:31.828 | |
We see that there is some relations which | |
we need to integrate, for example, for you. | |
0:05:31.828 --> 0:05:35.486 | |
We need to add the S, but maybe. | |
0:05:37.137 --> 0:05:42.984 | |
And luckily there is one really yeah, convincing | |
methods in doing that. | |
0:05:42.963 --> 0:05:47.239 | |
And that is by using an evil neck or. | |
0:05:47.387 --> 0:05:57.618 | |
That's what we will introduce today so we | |
can use this type of neural networks to try | |
0:05:57.618 --> 0:06:04.042 | |
to learn this similarity and to learn how some | |
words. | |
0:06:04.324 --> 0:06:13.711 | |
And that is one of the main advantages that | |
we have by switching from the standard statistical | |
0:06:13.711 --> 0:06:15.193 | |
models to the. | |
0:06:15.115 --> 0:06:22.840 | |
To learn similarities between words and generalized | |
and learn what we call hidden representations. | |
0:06:22.840 --> 0:06:29.707 | |
So somehow representations of words where | |
we can measure similarity in some dimensions. | |
0:06:30.290 --> 0:06:42.275 | |
So in representations where as a tubically | |
continuous vector or a vector of a fixed size. | |
0:06:42.822 --> 0:06:52.002 | |
We had it before and we've seen that the only | |
thing we did is we don't want to do. | |
0:06:52.192 --> 0:06:59.648 | |
But these indices don't have any meaning, | |
so it wasn't that word five is more similar | |
0:06:59.648 --> 0:07:02.248 | |
to words twenty than to word. | |
0:07:02.582 --> 0:07:09.059 | |
So we couldn't learn anything about words | |
in the statistical model. | |
0:07:09.059 --> 0:07:12.107 | |
That's a big challenge because. | |
0:07:12.192 --> 0:07:24.232 | |
If you think about words even in morphology, | |
so go and go is more similar because the person. | |
0:07:24.264 --> 0:07:36.265 | |
While the basic models we have up to now, | |
they have no idea about that and goes as similar | |
0:07:36.265 --> 0:07:37.188 | |
to go. | |
0:07:39.919 --> 0:07:53.102 | |
So what we want to do today, in order to go | |
to this, we will have a short introduction. | |
0:07:53.954 --> 0:08:06.667 | |
It very short just to see how we use them | |
here, but that's the good thing that are important | |
0:08:06.667 --> 0:08:08.445 | |
for dealing. | |
0:08:08.928 --> 0:08:14.083 | |
And then we'll first look into feet forward, | |
new network language models. | |
0:08:14.454 --> 0:08:21.221 | |
And there we will still have this approximation | |
we had before, then we are looking only at | |
0:08:21.221 --> 0:08:22.336 | |
fixed windows. | |
0:08:22.336 --> 0:08:28.805 | |
So if you remember we have this classroom | |
of language models, and to determine what is | |
0:08:28.805 --> 0:08:33.788 | |
the probability of a word, we only look at | |
the past and minus one. | |
0:08:34.154 --> 0:08:36.878 | |
This is the theory of the case. | |
0:08:36.878 --> 0:08:43.348 | |
However, we have the ability and that's why | |
they're really better in order. | |
0:08:44.024 --> 0:08:51.953 | |
And then at the end we'll look at current | |
network language models where we then have | |
0:08:51.953 --> 0:08:53.166 | |
a different. | |
0:08:53.093 --> 0:09:01.922 | |
And thereby it is no longer the case that | |
we need to have a fixed history, but in theory | |
0:09:01.922 --> 0:09:04.303 | |
we can model arbitrary. | |
0:09:04.304 --> 0:09:06.854 | |
And we can log this phenomenon. | |
0:09:06.854 --> 0:09:12.672 | |
We talked about a Tuesday where it's not clear | |
what type of information. | |
0:09:16.396 --> 0:09:24.982 | |
So yeah, generally new networks are normally | |
learned to improve and perform some tasks. | |
0:09:25.325 --> 0:09:38.934 | |
We have this structure and we are learning | |
them from samples so that is similar to what | |
0:09:38.934 --> 0:09:42.336 | |
we had before so now. | |
0:09:42.642 --> 0:09:49.361 | |
And is somehow originally motivated by the | |
human brain. | |
0:09:49.361 --> 0:10:00.640 | |
However, when you now need to know artificial | |
neural networks, it's hard to get a similarity. | |
0:10:00.540 --> 0:10:02.884 | |
There seems to be not that important. | |
0:10:03.123 --> 0:10:11.013 | |
So what they are mainly doing is doing summoning | |
multiplication and then one linear activation. | |
0:10:12.692 --> 0:10:16.078 | |
So so the basic units are these type of. | |
0:10:17.937 --> 0:10:29.837 | |
Perceptron is a basic block which we have | |
and this does exactly the processing. | |
0:10:29.837 --> 0:10:36.084 | |
We have a fixed number of input features. | |
0:10:36.096 --> 0:10:39.668 | |
So we have here numbers six zero to x and | |
as input. | |
0:10:40.060 --> 0:10:48.096 | |
And this makes language processing difficult | |
because we know that it's not the case. | |
0:10:48.096 --> 0:10:53.107 | |
If we're dealing with language, it doesn't | |
have any. | |
0:10:54.114 --> 0:10:57.609 | |
So we have to model this somehow and understand | |
how we model this. | |
0:10:58.198 --> 0:11:03.681 | |
Then we have the weights, which are the parameters | |
and the number of weights exactly the same. | |
0:11:04.164 --> 0:11:15.069 | |
Of input features sometimes you have the spires | |
in there that always and then it's not really. | |
0:11:15.195 --> 0:11:19.656 | |
And what you then do is very simple. | |
0:11:19.656 --> 0:11:26.166 | |
It's just like the weight it sounds, so you | |
multiply. | |
0:11:26.606 --> 0:11:38.405 | |
What is then additionally important is we | |
have an activation function and it's important | |
0:11:38.405 --> 0:11:42.514 | |
that this activation function. | |
0:11:43.243 --> 0:11:54.088 | |
And later it will be important that this is | |
differentiable because otherwise all the training. | |
0:11:54.714 --> 0:12:01.471 | |
This model by itself is not very powerful. | |
0:12:01.471 --> 0:12:10.427 | |
We have the X Or problem and with this simple | |
you can't. | |
0:12:10.710 --> 0:12:15.489 | |
However, there is a very easy and nice extension. | |
0:12:15.489 --> 0:12:20.936 | |
The multi layer perception and things get | |
very powerful. | |
0:12:21.081 --> 0:12:32.953 | |
The thing is you just connect a lot of these | |
in these layers of structures where we have | |
0:12:32.953 --> 0:12:35.088 | |
the inputs and. | |
0:12:35.395 --> 0:12:47.297 | |
And then we can combine them, or to do them: | |
The input layer is of course given by your | |
0:12:47.297 --> 0:12:51.880 | |
problem with the dimension. | |
0:12:51.880 --> 0:13:00.063 | |
The output layer is also given by your dimension. | |
0:13:01.621 --> 0:13:08.802 | |
So let's start with the first question, now | |
more language related, and that is how we represent. | |
0:13:09.149 --> 0:13:19.282 | |
So we have seen here input to x, but the question | |
is now okay. | |
0:13:19.282 --> 0:13:23.464 | |
How can we put into this? | |
0:13:26.866 --> 0:13:34.123 | |
The first thing that we're able to do is we're | |
going to set it in the inspector. | |
0:13:34.314 --> 0:13:45.651 | |
Yeah, and that is not that easy because the | |
continuous vector will come to that. | |
0:13:45.651 --> 0:13:47.051 | |
We can't. | |
0:13:47.051 --> 0:13:50.410 | |
We don't want to do it. | |
0:13:50.630 --> 0:13:57.237 | |
But if we need to input the word into the | |
needle network, it has to be something easily | |
0:13:57.237 --> 0:13:57.912 | |
defined. | |
0:13:59.079 --> 0:14:11.511 | |
One is the typical thing, the one-hour encoded | |
vector, so we have a vector where the dimension | |
0:14:11.511 --> 0:14:15.306 | |
is the vocabulary, and then. | |
0:14:16.316 --> 0:14:25.938 | |
So the first thing you are ready to see that | |
means we are always dealing with fixed. | |
0:14:26.246 --> 0:14:34.961 | |
So you cannot easily extend your vocabulary, | |
but if you mean your vocabulary would increase | |
0:14:34.961 --> 0:14:37.992 | |
the size of this input vector,. | |
0:14:39.980 --> 0:14:42.423 | |
That's maybe also motivating. | |
0:14:42.423 --> 0:14:45.355 | |
We'll talk about bike parade going. | |
0:14:45.355 --> 0:14:47.228 | |
That's the nice thing. | |
0:14:48.048 --> 0:15:01.803 | |
The big advantage of this one putt encoding | |
is that we don't implement similarity between | |
0:15:01.803 --> 0:15:06.999 | |
words, but we're really learning. | |
0:15:07.227 --> 0:15:11.219 | |
So you need like to represent any words. | |
0:15:11.219 --> 0:15:15.893 | |
You need a dimension of and dimensional vector. | |
0:15:16.236 --> 0:15:26.480 | |
Imagine you could eat no binary encoding, | |
so you could represent words as binary vectors. | |
0:15:26.806 --> 0:15:32.348 | |
So you will be significantly more efficient. | |
0:15:32.348 --> 0:15:39.122 | |
However, you have some more digits than other | |
numbers. | |
0:15:39.559 --> 0:15:46.482 | |
Would somehow be bad because you would force | |
the one to do this and it's by hand not clear | |
0:15:46.482 --> 0:15:47.623 | |
how to define. | |
0:15:48.108 --> 0:15:55.135 | |
So therefore currently this is the most successful | |
approach to just do this one patch. | |
0:15:55.095 --> 0:15:59.344 | |
We take a fixed vocabulary. | |
0:15:59.344 --> 0:16:10.269 | |
We map each word to the initial and then we | |
represent a word like this. | |
0:16:10.269 --> 0:16:13.304 | |
The representation. | |
0:16:14.514 --> 0:16:27.019 | |
But this dimension here is a secondary size, | |
and if you think ten thousand that's quite | |
0:16:27.019 --> 0:16:33.555 | |
high, so we're always trying to be efficient. | |
0:16:33.853 --> 0:16:42.515 | |
And we are doing the same type of efficiency | |
because then we are having a very small one | |
0:16:42.515 --> 0:16:43.781 | |
compared to. | |
0:16:44.104 --> 0:16:53.332 | |
It can be still a maybe or neurons, but this | |
is significantly smaller, of course, as before. | |
0:16:53.713 --> 0:17:04.751 | |
So you are learning there this word as you | |
said, but you can learn it directly, and there | |
0:17:04.751 --> 0:17:07.449 | |
we have similarities. | |
0:17:07.807 --> 0:17:14.772 | |
But the nice thing is that this is then learned, | |
and we do not need to like hand define. | |
0:17:17.117 --> 0:17:32.377 | |
So yes, so that is how we're typically adding | |
at least a single word into the language world. | |
0:17:32.377 --> 0:17:43.337 | |
Then we can see: So we're seeing that you | |
have the one hard representation always of | |
0:17:43.337 --> 0:17:44.857 | |
the same similarity. | |
0:17:45.105 --> 0:18:00.803 | |
Then we're having this continuous vector which | |
is a lot smaller dimension and that's. | |
0:18:01.121 --> 0:18:06.984 | |
What we are doing then is learning these representations | |
so that they are best for language modeling. | |
0:18:07.487 --> 0:18:19.107 | |
So the representations are implicitly because | |
we're training on the language. | |
0:18:19.479 --> 0:18:30.115 | |
And the nice thing was found out later is | |
these representations are really, really good | |
0:18:30.115 --> 0:18:32.533 | |
for a lot of other. | |
0:18:33.153 --> 0:18:39.729 | |
And that is why they are now called word embedded | |
space themselves, and used for other tasks. | |
0:18:40.360 --> 0:18:49.827 | |
And they are somehow describing different | |
things so they can describe and semantic similarities. | |
0:18:49.789 --> 0:18:58.281 | |
We are looking at the very example of today | |
that you can do in this vector space by adding | |
0:18:58.281 --> 0:19:00.613 | |
some interesting things. | |
0:19:00.940 --> 0:19:11.174 | |
And so they got really was a first big improvement | |
when switching to neural staff. | |
0:19:11.491 --> 0:19:20.736 | |
They are like part of the model still with | |
more complex representation alert, but they | |
0:19:20.736 --> 0:19:21.267 | |
are. | |
0:19:23.683 --> 0:19:34.975 | |
Then we are having the output layer, and in | |
the output layer we also have output structure | |
0:19:34.975 --> 0:19:36.960 | |
and activation. | |
0:19:36.997 --> 0:19:44.784 | |
That is the language we want to predict, which | |
word should be the next. | |
0:19:44.784 --> 0:19:46.514 | |
We always have. | |
0:19:47.247 --> 0:19:56.454 | |
And that can be done very well with the softball | |
softbacked layer, where again the dimension. | |
0:19:56.376 --> 0:20:03.971 | |
Is the vocabulary, so this is a vocabulary | |
size, and again the case neuro represents the | |
0:20:03.971 --> 0:20:09.775 | |
case class, so in our case we have again a | |
one-hour representation. | |
0:20:10.090 --> 0:20:18.929 | |
Ours is a probability distribution and the | |
end is a probability distribution of all works. | |
0:20:18.929 --> 0:20:28.044 | |
The case entry tells us: So we need to have | |
some of our probability distribution at our | |
0:20:28.044 --> 0:20:36.215 | |
output, and in order to achieve that this activation | |
function goes, it needs to be that all the | |
0:20:36.215 --> 0:20:36.981 | |
outputs. | |
0:20:37.197 --> 0:20:47.993 | |
And we can achieve that with a softmax activation | |
we take each of the value and then. | |
0:20:48.288 --> 0:20:58.020 | |
So by having this type of activation function | |
we are really getting that at the end we always. | |
0:20:59.019 --> 0:21:12.340 | |
The beginning was very challenging because | |
again we have this inefficient representation | |
0:21:12.340 --> 0:21:15.184 | |
of our vocabulary. | |
0:21:15.235 --> 0:21:27.500 | |
And then you can imagine escalating over to | |
something over a thousand is maybe a bit inefficient | |
0:21:27.500 --> 0:21:29.776 | |
with cheap users. | |
0:21:36.316 --> 0:21:43.664 | |
And then yeah, for training the models, that | |
is how we refine, so we have this architecture | |
0:21:43.664 --> 0:21:44.063 | |
now. | |
0:21:44.264 --> 0:21:52.496 | |
We need to minimize the arrow by taking the | |
output. | |
0:21:52.496 --> 0:21:58.196 | |
We are comparing it to our targets. | |
0:21:58.298 --> 0:22:07.670 | |
So one important thing is, of course, how | |
can we measure the error? | |
0:22:07.670 --> 0:22:12.770 | |
So what if we're training the ideas? | |
0:22:13.033 --> 0:22:19.770 | |
And how well when measuring it is in natural | |
language processing, typically the cross entropy. | |
0:22:19.960 --> 0:22:32.847 | |
That means we are comparing the target with | |
the output, so we're taking the value multiplying | |
0:22:32.847 --> 0:22:35.452 | |
with the horizons. | |
0:22:35.335 --> 0:22:43.454 | |
Which gets optimized and you're seeing that | |
this, of course, makes it again very nice and | |
0:22:43.454 --> 0:22:49.859 | |
easy because our target, we said, is again | |
a one-hound representation. | |
0:22:50.110 --> 0:23:00.111 | |
So except for one, all of these are always | |
zero, and what we are doing is taking the one. | |
0:23:00.100 --> 0:23:05.970 | |
And we only need to multiply the one with | |
the logarism here, and that is all the feedback. | |
0:23:06.946 --> 0:23:14.194 | |
Of course, this is not always influenced by | |
all the others. | |
0:23:14.194 --> 0:23:17.938 | |
Why is this influenced by all? | |
0:23:24.304 --> 0:23:33.554 | |
Think Mac the activation function, which is | |
the current activation divided by some of the | |
0:23:33.554 --> 0:23:34.377 | |
others. | |
0:23:34.354 --> 0:23:44.027 | |
Because otherwise it could of course easily | |
just increase this value and ignore the others, | |
0:23:44.027 --> 0:23:49.074 | |
but if you increase one value or the other, | |
so. | |
0:23:51.351 --> 0:24:04.433 | |
And then we can do with neon networks one | |
very nice and easy type of training that is | |
0:24:04.433 --> 0:24:07.779 | |
done in all the neon. | |
0:24:07.707 --> 0:24:12.664 | |
So in which direction does the arrow show? | |
0:24:12.664 --> 0:24:23.152 | |
And then if we want to go to a smaller like | |
smaller arrow, that's what we want to achieve. | |
0:24:23.152 --> 0:24:27.302 | |
We're trying to minimize our arrow. | |
0:24:27.287 --> 0:24:32.875 | |
And we have to do that, of course, for all | |
the weights, and to calculate the error of | |
0:24:32.875 --> 0:24:36.709 | |
all the weights we want in the back of the | |
baggation here. | |
0:24:36.709 --> 0:24:41.322 | |
But what you can do is you can propagate the | |
arrow which you measured. | |
0:24:41.322 --> 0:24:43.792 | |
At the end you can propagate it back. | |
0:24:43.792 --> 0:24:46.391 | |
That's basic mass and basic derivation. | |
0:24:46.706 --> 0:24:59.557 | |
Then you can do each weight in your model | |
and measure how much it contributes to this | |
0:24:59.557 --> 0:25:01.350 | |
individual. | |
0:25:04.524 --> 0:25:17.712 | |
To summarize what your machine translation | |
should be, to understand all this problem is | |
0:25:17.712 --> 0:25:20.710 | |
that this is how a. | |
0:25:20.580 --> 0:25:23.056 | |
The notes are perfect thrones. | |
0:25:23.056 --> 0:25:28.167 | |
They are fully connected between two layers | |
and no connections. | |
0:25:28.108 --> 0:25:29.759 | |
Across layers. | |
0:25:29.829 --> 0:25:35.152 | |
And what they're doing is always just to wait | |
for some here and then an activation function. | |
0:25:35.415 --> 0:25:38.794 | |
And in order to train you have this sword | |
in backwards past. | |
0:25:39.039 --> 0:25:41.384 | |
So we put in here. | |
0:25:41.281 --> 0:25:46.540 | |
Our inputs have some random values at the | |
beginning. | |
0:25:46.540 --> 0:25:49.219 | |
They calculate the output. | |
0:25:49.219 --> 0:25:58.646 | |
We are measuring how big our error is, propagating | |
the arrow back, and then changing our model | |
0:25:58.646 --> 0:25:59.638 | |
in a way. | |
0:26:01.962 --> 0:26:14.267 | |
So before we're coming into the neural networks, | |
how can we use this type of neural network | |
0:26:14.267 --> 0:26:17.611 | |
to do language modeling? | |
0:26:23.103 --> 0:26:25.520 | |
So the question is now okay. | |
0:26:25.520 --> 0:26:33.023 | |
How can we use them in natural language processing | |
and especially in machine translation? | |
0:26:33.023 --> 0:26:38.441 | |
The first idea of using them was to estimate | |
the language model. | |
0:26:38.999 --> 0:26:42.599 | |
So we have seen that the output can be monitored | |
here as well. | |
0:26:43.603 --> 0:26:49.308 | |
Has a probability distribution, and if we | |
have a full vocabulary, we could mainly hear | |
0:26:49.308 --> 0:26:55.209 | |
estimate how probable each next word is, and | |
then use that in our language model fashion, | |
0:26:55.209 --> 0:27:02.225 | |
as we've done it last time, we've got the probability | |
of a full sentence as a product of all probabilities | |
0:27:02.225 --> 0:27:03.208 | |
of individual. | |
0:27:04.544 --> 0:27:06.695 | |
And UM. | |
0:27:06.446 --> 0:27:09.776 | |
That was done and in ninety seven years. | |
0:27:09.776 --> 0:27:17.410 | |
It's very easy to integrate it into this Locklear | |
model, so we have said that this is how the | |
0:27:17.410 --> 0:27:24.638 | |
Locklear model looks like, so we're searching | |
the best translation, which minimizes each | |
0:27:24.638 --> 0:27:25.126 | |
wage. | |
0:27:25.125 --> 0:27:26.371 | |
The feature value. | |
0:27:26.646 --> 0:27:31.642 | |
We have that with the minimum error training, | |
if you can remember when we search for the | |
0:27:31.642 --> 0:27:32.148 | |
optimal. | |
0:27:32.512 --> 0:27:40.927 | |
We have the phrasetable probabilities, the | |
language model, and we can just add here and | |
0:27:40.927 --> 0:27:41.597 | |
there. | |
0:27:41.861 --> 0:27:46.077 | |
So that is quite easy as said. | |
0:27:46.077 --> 0:27:54.101 | |
That was how statistical machine translation | |
was improved. | |
0:27:54.101 --> 0:27:57.092 | |
Add one more feature. | |
0:27:58.798 --> 0:28:11.220 | |
So how can we model the language mark for | |
Belty with your network? | |
0:28:11.220 --> 0:28:22.994 | |
So what we have to do is: And the problem | |
in generally in the head is that most we haven't | |
0:28:22.994 --> 0:28:25.042 | |
seen long sequences. | |
0:28:25.085 --> 0:28:36.956 | |
Mostly we have to beg off to very short sequences | |
and we are working on this discrete space where. | |
0:28:37.337 --> 0:28:48.199 | |
So the idea is if we have a meal network we | |
can map words into continuous representation | |
0:28:48.199 --> 0:28:50.152 | |
and that helps. | |
0:28:51.091 --> 0:28:59.598 | |
And the structure then looks like this, so | |
this is the basic still feed forward neural | |
0:28:59.598 --> 0:29:00.478 | |
network. | |
0:29:01.361 --> 0:29:10.744 | |
We are doing this at Proximation again, so | |
we are not putting in all previous words, but | |
0:29:10.744 --> 0:29:11.376 | |
it's. | |
0:29:11.691 --> 0:29:25.089 | |
And this is done because in your network we | |
can have only a fixed type of input, so we | |
0:29:25.089 --> 0:29:31.538 | |
can: Can only do a fixed set, and they are | |
going to be doing exactly the same in minus | |
0:29:31.538 --> 0:29:31.879 | |
one. | |
0:29:33.593 --> 0:29:41.026 | |
And then we have, for example, three words | |
and three different words, which are in these | |
0:29:41.026 --> 0:29:54.583 | |
positions: And then we're having the first | |
layer of the neural network, which learns words | |
0:29:54.583 --> 0:29:56.247 | |
and words. | |
0:29:57.437 --> 0:30:04.976 | |
There is one thing which is maybe special | |
compared to the standard neural memory. | |
0:30:05.345 --> 0:30:13.163 | |
So the representation of this word we want | |
to learn first of all position independence, | |
0:30:13.163 --> 0:30:19.027 | |
so we just want to learn what is the general | |
meaning of the word. | |
0:30:19.299 --> 0:30:26.244 | |
Therefore, the representation you get here | |
should be the same as if you put it in there. | |
0:30:27.247 --> 0:30:35.069 | |
The nice thing is you can achieve that in | |
networks the same way you achieve it. | |
0:30:35.069 --> 0:30:41.719 | |
This way you're reusing ears so we are forcing | |
them to always stay. | |
0:30:42.322 --> 0:30:49.689 | |
And that's why you then learn your word embedding, | |
which is contextual and independent, so. | |
0:30:49.909 --> 0:31:05.561 | |
So the idea is you have the diagram go home | |
and you don't want to use the context. | |
0:31:05.561 --> 0:31:07.635 | |
First you. | |
0:31:08.348 --> 0:31:14.155 | |
That of course it might have a different meaning | |
depending on where it stands, but learn that. | |
0:31:14.514 --> 0:31:19.623 | |
First, we're learning key representation of | |
the words, which is just the representation | |
0:31:19.623 --> 0:31:20.378 | |
of the word. | |
0:31:20.760 --> 0:31:37.428 | |
So it's also not like normally all input neurons | |
are connected to all neurons. | |
0:31:37.857 --> 0:31:47.209 | |
This is the first layer of representation, | |
and then we have a lot denser representation, | |
0:31:47.209 --> 0:31:56.666 | |
that is, our three word embeddings here, and | |
now we are learning this interaction between | |
0:31:56.666 --> 0:31:57.402 | |
words. | |
0:31:57.677 --> 0:32:08.265 | |
So now we have at least one connected, fully | |
connected layer here, which takes the three | |
0:32:08.265 --> 0:32:14.213 | |
imbedded input and then learns the new embedding. | |
0:32:15.535 --> 0:32:27.871 | |
And then if you had one of several layers | |
of lining which is your output layer, then. | |
0:32:28.168 --> 0:32:46.222 | |
So here the size is a vocabulary size, and | |
then you put as target what is the probability | |
0:32:46.222 --> 0:32:48.228 | |
for each. | |
0:32:48.688 --> 0:32:56.778 | |
The nice thing is that you learn everything | |
together, so you're not learning what is a | |
0:32:56.778 --> 0:32:58.731 | |
good representation. | |
0:32:59.079 --> 0:33:12.019 | |
When you are training the whole network together, | |
it learns what representation for a word you | |
0:33:12.019 --> 0:33:13.109 | |
get in. | |
0:33:15.956 --> 0:33:19.176 | |
It's Yeah That Is the Main Idea. | |
0:33:20.660 --> 0:33:32.695 | |
Nowadays often referred to as one way of self-supervised | |
learning, why self-supervisory learning? | |
0:33:33.053 --> 0:33:37.120 | |
The output is the next word and the input | |
is the previous word. | |
0:33:37.377 --> 0:33:46.778 | |
But somehow it's self-supervised because it's | |
not really that we created labels, but we artificially. | |
0:33:46.806 --> 0:34:01.003 | |
We just have pure text, and then we created | |
the task. | |
0:34:05.905 --> 0:34:12.413 | |
Say we have two sentences like go home again. | |
0:34:12.413 --> 0:34:18.780 | |
Second one is go to creative again, so both. | |
0:34:18.858 --> 0:34:31.765 | |
The starboard bygo and then we have to predict | |
the next four years and my question is: Be | |
0:34:31.765 --> 0:34:40.734 | |
modeled this ability as one vector with like | |
probability or possible works. | |
0:34:40.734 --> 0:34:42.740 | |
We have musical. | |
0:34:44.044 --> 0:34:56.438 | |
You have multiple examples, so you would twice | |
train, once you predict, once you predict, | |
0:34:56.438 --> 0:35:02.359 | |
and then, of course, the best performance. | |
0:35:04.564 --> 0:35:11.772 | |
A very good point, so you're not aggregating | |
examples beforehand, but you're taking each | |
0:35:11.772 --> 0:35:13.554 | |
example individually. | |
0:35:19.259 --> 0:35:33.406 | |
So what you do is you simultaneously learn | |
the projection layer which represents this | |
0:35:33.406 --> 0:35:39.163 | |
word and the N gram probabilities. | |
0:35:39.499 --> 0:35:48.390 | |
And what people then later analyzed is that | |
these representations are very powerful. | |
0:35:48.390 --> 0:35:56.340 | |
The task is just a very important task to | |
model like what is the next word. | |
0:35:56.816 --> 0:36:09.429 | |
It's a bit motivated by people saying in order | |
to get the meaning of the word you have to | |
0:36:09.429 --> 0:36:10.690 | |
look at. | |
0:36:10.790 --> 0:36:18.467 | |
If you read the text in there, which you have | |
never seen, you can still estimate the meaning | |
0:36:18.467 --> 0:36:22.264 | |
of this word because you know how it is used. | |
0:36:22.602 --> 0:36:26.667 | |
Just imagine you read this text about some | |
city. | |
0:36:26.667 --> 0:36:32.475 | |
Even if you've never seen the city before | |
heard, you often know from. | |
0:36:34.094 --> 0:36:44.809 | |
So what is now the big advantage of using | |
neural networks? | |
0:36:44.809 --> 0:36:57.570 | |
Just imagine we have to estimate this: So | |
you have to monitor the probability of ad hip | |
0:36:57.570 --> 0:37:00.272 | |
and now imagine iPhone. | |
0:37:00.600 --> 0:37:06.837 | |
So all the techniques we have at the last | |
time. | |
0:37:06.837 --> 0:37:14.243 | |
At the end, if you haven't seen iPhone, you | |
will always. | |
0:37:15.055 --> 0:37:19.502 | |
Because you haven't seen the previous words, | |
so you have no idea how to do that. | |
0:37:19.502 --> 0:37:24.388 | |
You won't have seen the diagram, the trigram | |
and all the others, so the probability here | |
0:37:24.388 --> 0:37:27.682 | |
will just be based on the probability of ad, | |
so it uses no. | |
0:37:28.588 --> 0:37:38.328 | |
If you're having this type of model, what | |
does it do so? | |
0:37:38.328 --> 0:37:43.454 | |
This is the last three words. | |
0:37:43.483 --> 0:37:49.837 | |
Maybe this representation is messed up because | |
it's mainly on a particular word or source | |
0:37:49.837 --> 0:37:50.260 | |
that. | |
0:37:50.730 --> 0:37:57.792 | |
Now anyway you have these two information | |
that were two words before was first and therefore: | |
0:37:58.098 --> 0:38:07.214 | |
So you have a lot of information here to estimate | |
how good it is. | |
0:38:07.214 --> 0:38:13.291 | |
Of course, there could be more information. | |
0:38:13.593 --> 0:38:25.958 | |
So all this type of modeling we can do and | |
that we couldn't do beforehand because we always. | |
0:38:27.027 --> 0:38:31.905 | |
Don't guess how we do it now. | |
0:38:31.905 --> 0:38:41.824 | |
Typically you would have one talking for awkward | |
vocabulary. | |
0:38:42.602 --> 0:38:45.855 | |
All you're doing by carrying coding when it | |
has a fixed dancing. | |
0:38:46.226 --> 0:38:49.439 | |
Yeah, you have to do something like that that | |
the opposite way. | |
0:38:50.050 --> 0:38:55.413 | |
So yeah, all the vocabulary are by thankcoding | |
where you don't have have all the vocabulary. | |
0:38:55.735 --> 0:39:07.665 | |
But then, of course, the back pairing coating | |
is better with arbitrary context because a | |
0:39:07.665 --> 0:39:11.285 | |
problem with back pairing. | |
0:39:17.357 --> 0:39:20.052 | |
Anymore questions to the basic same little | |
things. | |
0:39:23.783 --> 0:39:36.162 | |
This model we then want to continue is to | |
look into how complex that is or can make things | |
0:39:36.162 --> 0:39:39.155 | |
maybe more efficient. | |
0:39:40.580 --> 0:39:47.404 | |
At the beginning there was definitely a major | |
challenge. | |
0:39:47.404 --> 0:39:50.516 | |
It's still not that easy. | |
0:39:50.516 --> 0:39:58.297 | |
All guess follow the talk about their environmental | |
fingerprint. | |
0:39:58.478 --> 0:40:05.686 | |
So this calculation is normally heavy, and | |
if you build systems yourself, you have to | |
0:40:05.686 --> 0:40:06.189 | |
wait. | |
0:40:06.466 --> 0:40:15.412 | |
So it's good to know a bit about how complex | |
things are in order to do a good or efficient. | |
0:40:15.915 --> 0:40:24.706 | |
So one thing where most of the calculation | |
really happens is if you're. | |
0:40:25.185 --> 0:40:34.649 | |
So in generally all these layers, of course, | |
we're talking about networks and the zones | |
0:40:34.649 --> 0:40:35.402 | |
fancy. | |
0:40:35.835 --> 0:40:48.305 | |
So what you have to do in order to calculate | |
here these activations, you have this weight. | |
0:40:48.488 --> 0:41:05.021 | |
So to make it simple, let's see we have three | |
outputs, and then you just do a metric identification | |
0:41:05.021 --> 0:41:08.493 | |
between your weight. | |
0:41:08.969 --> 0:41:19.641 | |
That is why the use is so powerful for neural | |
networks because they are very good in doing | |
0:41:19.641 --> 0:41:22.339 | |
metric multiplication. | |
0:41:22.782 --> 0:41:28.017 | |
However, for some type of embedding layer | |
this is really very inefficient. | |
0:41:28.208 --> 0:41:37.547 | |
So in this input we are doing this calculation. | |
0:41:37.547 --> 0:41:47.081 | |
What we are mainly doing is selecting one | |
color. | |
0:41:47.387 --> 0:42:03.570 | |
So therefore you can do at least the forward | |
pass a lot more efficient if you don't really | |
0:42:03.570 --> 0:42:07.304 | |
do this calculation. | |
0:42:08.348 --> 0:42:20.032 | |
So the weight metrics of the first embedding | |
layer is just that in each color you have. | |
0:42:20.580 --> 0:42:30.990 | |
So this is how your initial weights look like | |
and how you can interpret or understand. | |
0:42:32.692 --> 0:42:42.042 | |
And this is already relatively important because | |
remember this is a huge dimensional thing, | |
0:42:42.042 --> 0:42:51.392 | |
so typically here we have the number of words | |
ten thousand, so this is the word embeddings. | |
0:42:51.451 --> 0:43:00.400 | |
Because it's the largest one there, we have | |
entries, while for the others we maybe have. | |
0:43:00.660 --> 0:43:03.402 | |
So they are a little bit efficient and are | |
important to make this in. | |
0:43:06.206 --> 0:43:10.529 | |
And then you can look at where else the calculations | |
are very difficult. | |
0:43:10.830 --> 0:43:20.294 | |
So here we have our individual network, so | |
here are the word embeddings. | |
0:43:20.294 --> 0:43:29.498 | |
Then we have one hidden layer, and then you | |
can look at how difficult. | |
0:43:30.270 --> 0:43:38.742 | |
We could save a lot of calculations by calculating | |
that by just doing like do the selection because: | |
0:43:40.600 --> 0:43:51.748 | |
And then the number of calculations you have | |
to do here is the length. | |
0:43:52.993 --> 0:44:06.206 | |
Then we have here the hint size that is the | |
hint size, so the first step of calculation | |
0:44:06.206 --> 0:44:10.260 | |
for this metric is an age. | |
0:44:10.730 --> 0:44:22.030 | |
Then you have to do some activation function | |
which is this: This is the hidden size hymn | |
0:44:22.030 --> 0:44:29.081 | |
because we need the vocabulary socks to calculate | |
the probability for each. | |
0:44:29.889 --> 0:44:40.474 | |
And if you look at this number, so if you | |
have a projection sign of one hundred and a | |
0:44:40.474 --> 0:44:45.027 | |
vocabulary sign of one hundred, you. | |
0:44:45.425 --> 0:44:53.958 | |
And that's why there has been especially at | |
the beginning some ideas on how we can reduce | |
0:44:53.958 --> 0:44:55.570 | |
the calculation. | |
0:44:55.956 --> 0:45:02.352 | |
And if we really need to calculate all our | |
capabilities, or if we can calculate only some. | |
0:45:02.582 --> 0:45:13.061 | |
And there again one important thing to think | |
about is for what you will use my language. | |
0:45:13.061 --> 0:45:21.891 | |
One can use it for generations and that's | |
where we will see the next week. | |
0:45:21.891 --> 0:45:22.480 | |
And. | |
0:45:23.123 --> 0:45:32.164 | |
Initially, if it's just used as a feature, | |
we do not want to use it for generation, but | |
0:45:32.164 --> 0:45:32.575 | |
we. | |
0:45:32.953 --> 0:45:41.913 | |
And there we might not be interested in all | |
the probabilities, but we already know all | |
0:45:41.913 --> 0:45:49.432 | |
the probability of this one word, and then | |
it might be very inefficient. | |
0:45:51.231 --> 0:45:53.638 | |
And how can you do that so initially? | |
0:45:53.638 --> 0:45:56.299 | |
For example, people look into shortlists. | |
0:45:56.756 --> 0:46:03.321 | |
So the idea was this calculation at the end | |
is really very expensive. | |
0:46:03.321 --> 0:46:05.759 | |
So can we make that more. | |
0:46:05.945 --> 0:46:17.135 | |
And the idea was okay, and most birds occur | |
very rarely, and some beef birds occur very, | |
0:46:17.135 --> 0:46:18.644 | |
very often. | |
0:46:19.019 --> 0:46:37.644 | |
And so they use the smaller imagery, which | |
is maybe very small, and then you merge a new. | |
0:46:37.937 --> 0:46:45.174 | |
So you're taking if the word is in the shortness, | |
so in the most frequent words. | |
0:46:45.825 --> 0:46:58.287 | |
You're taking the probability of this short | |
word by some normalization here, and otherwise | |
0:46:58.287 --> 0:46:59.656 | |
you take. | |
0:47:00.020 --> 0:47:00.836 | |
Course. | |
0:47:00.836 --> 0:47:09.814 | |
It will not be as good, but then we don't | |
have to calculate all the capabilities at the | |
0:47:09.814 --> 0:47:16.037 | |
end, but we only have to calculate it for the | |
most frequent. | |
0:47:19.599 --> 0:47:39.477 | |
Machines about that, but of course we don't | |
model the probability of the infrequent words. | |
0:47:39.299 --> 0:47:46.658 | |
And one idea is to do what is reported as | |
soles for the structure of the layer. | |
0:47:46.606 --> 0:47:53.169 | |
You see how some years ago people were very | |
creative in giving names to newer models. | |
0:47:53.813 --> 0:48:00.338 | |
And there the idea is that we model the out | |
group vocabulary as a clustered strip. | |
0:48:00.680 --> 0:48:08.498 | |
So you don't need to mold all of your bodies | |
directly, but you are putting words into. | |
0:48:08.969 --> 0:48:20.623 | |
A very intricate word is first in and then | |
in and then in and that is in sub-sub-clusters | |
0:48:20.623 --> 0:48:21.270 | |
and. | |
0:48:21.541 --> 0:48:29.936 | |
And this is what was mentioned in the past | |
of the work, so these are the subclasses that | |
0:48:29.936 --> 0:48:30.973 | |
always go. | |
0:48:30.973 --> 0:48:39.934 | |
So if it's in cluster one at the first position | |
then you only look at all the words which are: | |
0:48:40.340 --> 0:48:50.069 | |
And then you can calculate the probability | |
of a word again just by the product over these, | |
0:48:50.069 --> 0:48:55.522 | |
so the probability of the word is the first | |
class. | |
0:48:57.617 --> 0:49:12.331 | |
It's maybe more clear where you have the sole | |
architecture, so what you will do is first | |
0:49:12.331 --> 0:49:13.818 | |
predict. | |
0:49:14.154 --> 0:49:26.435 | |
Then you go to the appropriate sub-class, | |
then you calculate the probability of the sub-class. | |
0:49:27.687 --> 0:49:34.932 | |
Anybody have an idea why this is more, more | |
efficient, or if people do it first, it looks | |
0:49:34.932 --> 0:49:35.415 | |
more. | |
0:49:42.242 --> 0:49:56.913 | |
Yes, so you have to do less calculations, | |
or maybe here you have to calculate the element | |
0:49:56.913 --> 0:49:59.522 | |
there, but you. | |
0:49:59.980 --> 0:50:06.116 | |
The capabilities in the set classes that you're | |
going through and not for all of them. | |
0:50:06.386 --> 0:50:16.688 | |
Therefore, it's only more efficient if you | |
don't need all awkward preferences because | |
0:50:16.688 --> 0:50:21.240 | |
you have to even calculate the class. | |
0:50:21.501 --> 0:50:30.040 | |
So it's only more efficient in scenarios where | |
you really need to use a language to evaluate. | |
0:50:35.275 --> 0:50:54.856 | |
How this works is that on the output layer | |
you only have a vocabulary of: But on the input | |
0:50:54.856 --> 0:51:05.126 | |
layer you have always your full vocabulary | |
because at the input we saw that this is not | |
0:51:05.126 --> 0:51:06.643 | |
complicated. | |
0:51:06.906 --> 0:51:19.778 | |
And then you can cluster down all your words, | |
embedding series of classes, and use that as | |
0:51:19.778 --> 0:51:23.031 | |
your classes for that. | |
0:51:23.031 --> 0:51:26.567 | |
So yeah, you have words. | |
0:51:29.249 --> 0:51:32.593 | |
Is one idea of doing it. | |
0:51:32.593 --> 0:51:44.898 | |
There is also a second idea of doing it again, | |
the idea that we don't need the probability. | |
0:51:45.025 --> 0:51:53.401 | |
So sometimes it doesn't really need to be | |
a probability to evaluate. | |
0:51:53.401 --> 0:52:05.492 | |
It's only important that: And: Here is called | |
self-normalization. | |
0:52:05.492 --> 0:52:19.349 | |
What people have done so is in the softmax | |
is always to the input divided by normalization. | |
0:52:19.759 --> 0:52:25.194 | |
So this is how we calculate the soft mix. | |
0:52:25.825 --> 0:52:42.224 | |
And in self-normalization now, the idea is | |
that we don't need to calculate the logarithm. | |
0:52:42.102 --> 0:52:54.284 | |
That would be zero, and then you don't even | |
have to calculate the normalization. | |
0:52:54.514 --> 0:53:01.016 | |
So how can we achieve that? | |
0:53:01.016 --> 0:53:08.680 | |
And then there's the nice thing. | |
0:53:09.009 --> 0:53:14.743 | |
And our novel Lots and more to maximize probability. | |
0:53:14.743 --> 0:53:23.831 | |
We have this cross entry lot that probability | |
is higher, and now we're just adding. | |
0:53:24.084 --> 0:53:31.617 | |
And the second loss just tells us you're pleased | |
training the way the lock set is zero. | |
0:53:32.352 --> 0:53:38.625 | |
So then if it's nearly zero at the end you | |
don't need to calculate this and it's also | |
0:53:38.625 --> 0:53:39.792 | |
very efficient. | |
0:53:40.540 --> 0:53:57.335 | |
One important thing is this is only an inference, | |
so during tests we don't need to calculate. | |
0:54:00.480 --> 0:54:15.006 | |
You can do a bit of a hyperparameter here | |
where you do the waiting and how much effort | |
0:54:15.006 --> 0:54:16.843 | |
should be. | |
0:54:18.318 --> 0:54:35.037 | |
The only disadvantage is that it's no speed | |
up during training and there are other ways | |
0:54:35.037 --> 0:54:37.887 | |
of doing that. | |
0:54:41.801 --> 0:54:43.900 | |
I'm with you all. | |
0:54:44.344 --> 0:54:48.540 | |
Then we are coming very, very briefly like | |
this one here. | |
0:54:48.828 --> 0:54:53.692 | |
There are more things on different types of | |
languages. | |
0:54:53.692 --> 0:54:58.026 | |
We are having a very short view of a restricted. | |
0:54:58.298 --> 0:55:09.737 | |
And then we'll talk about recurrent neural | |
networks for our language minds because they | |
0:55:09.737 --> 0:55:17.407 | |
have the advantage now that we can't even further | |
improve. | |
0:55:18.238 --> 0:55:24.395 | |
There's also different types of neural networks. | |
0:55:24.395 --> 0:55:30.175 | |
These ballroom machines are not having input. | |
0:55:30.330 --> 0:55:39.271 | |
They have these binary units: And they define | |
an energy function on the network, which can | |
0:55:39.271 --> 0:55:46.832 | |
be in respect of bottom machines efficiently | |
calculated, and restricted needs. | |
0:55:46.832 --> 0:55:53.148 | |
You only have connections between the input | |
and the hidden layer. | |
0:55:53.393 --> 0:56:00.190 | |
So you see here you don't have input and output, | |
you just have an input and you calculate what. | |
0:56:00.460 --> 0:56:16.429 | |
Which of course nicely fits with the idea | |
we're having, so you can use this for N gram | |
0:56:16.429 --> 0:56:19.182 | |
language ones. | |
0:56:19.259 --> 0:56:25.187 | |
Decaying this credibility of the input by | |
this type of neural networks. | |
0:56:26.406 --> 0:56:30.582 | |
And the advantage of this type of model of | |
board that is. | |
0:56:30.550 --> 0:56:38.629 | |
Very fast to integrate it, so that one was | |
the first one which was used during decoding. | |
0:56:38.938 --> 0:56:50.103 | |
The problem of it is that the Enron language | |
models were very good at performing the calculation. | |
0:56:50.230 --> 0:57:00.114 | |
So what people typically did is we talked | |
about a best list, so they generated a most | |
0:57:00.114 --> 0:57:05.860 | |
probable output, and then they scored each | |
entry. | |
0:57:06.146 --> 0:57:10.884 | |
A language model, and then only like change | |
the order against that based on that which. | |
0:57:11.231 --> 0:57:20.731 | |
The knifing is maybe only hundred entries, | |
while during decoding you will look at several | |
0:57:20.731 --> 0:57:21.787 | |
thousand. | |
0:57:26.186 --> 0:57:40.437 | |
This but let's look at the context, so we | |
have now seen your language models. | |
0:57:40.437 --> 0:57:43.726 | |
There is the big. | |
0:57:44.084 --> 0:57:57.552 | |
Remember ingram language is not always words | |
because sometimes you have to back off or interpolation | |
0:57:57.552 --> 0:57:59.953 | |
to lower ingrams. | |
0:58:00.760 --> 0:58:05.504 | |
However, in neural models we always have all | |
of these inputs and some of these. | |
0:58:07.147 --> 0:58:21.262 | |
The disadvantage is that you are still limited | |
in your context, and if you remember the sentence | |
0:58:21.262 --> 0:58:23.008 | |
from last,. | |
0:58:22.882 --> 0:58:28.445 | |
Sometimes you need more context and there's | |
unlimited contexts that you might need and | |
0:58:28.445 --> 0:58:34.838 | |
you can always create sentences where you need | |
this file context in order to put a good estimation. | |
0:58:35.315 --> 0:58:44.955 | |
Can we also do it different in order to better | |
understand that it makes sense to view? | |
0:58:45.445 --> 0:58:57.621 | |
So sequence labeling tasks are a very common | |
type of towns in natural language processing | |
0:58:57.621 --> 0:59:03.438 | |
where you have an input sequence and then. | |
0:59:03.323 --> 0:59:08.663 | |
I've token so you have one output for each | |
input so machine translation is not a secret | |
0:59:08.663 --> 0:59:14.063 | |
labeling cast because the number of inputs | |
and the number of outputs is different so you | |
0:59:14.063 --> 0:59:19.099 | |
put in a string German which has five words | |
and the output can be six or seven or. | |
0:59:19.619 --> 0:59:20.155 | |
Secrets. | |
0:59:20.155 --> 0:59:24.083 | |
Lately you always have the same number of | |
and the same number of. | |
0:59:24.944 --> 0:59:40.940 | |
And you can model language modeling as that, | |
and you just say a label for each word is always | |
0:59:40.940 --> 0:59:43.153 | |
a next word. | |
0:59:45.705 --> 0:59:54.823 | |
This is the more general you can think of | |
it, for example how to speech taking entity | |
0:59:54.823 --> 0:59:56.202 | |
recognition. | |
0:59:58.938 --> 1:00:08.081 | |
And if you look at now fruit cut token in | |
generally sequence, they can depend on import | |
1:00:08.081 --> 1:00:08.912 | |
tokens. | |
1:00:09.869 --> 1:00:11.260 | |
Nice thing. | |
1:00:11.260 --> 1:00:21.918 | |
In our case, the output tokens are the same | |
so we can easily model it that they only depend | |
1:00:21.918 --> 1:00:24.814 | |
on all the input tokens. | |
1:00:24.814 --> 1:00:28.984 | |
So we have this whether it's or so. | |
1:00:31.011 --> 1:00:42.945 | |
But we can always do a look at what specific | |
type of sequence labeling, unidirectional sequence | |
1:00:42.945 --> 1:00:44.188 | |
labeling. | |
1:00:44.584 --> 1:00:58.215 | |
And that's exactly how we want the language | |
of the next word only depends on all the previous | |
1:00:58.215 --> 1:01:00.825 | |
words that we're. | |
1:01:01.321 --> 1:01:12.899 | |
Mean, of course, that's not completely true | |
in a language that the bad context might also | |
1:01:12.899 --> 1:01:14.442 | |
be helpful. | |
1:01:14.654 --> 1:01:22.468 | |
We will model always the probability of a | |
word given on its history, and therefore we | |
1:01:22.468 --> 1:01:23.013 | |
need. | |
1:01:23.623 --> 1:01:29.896 | |
And currently we did there this approximation | |
in sequence labeling that we have this windowing | |
1:01:29.896 --> 1:01:30.556 | |
approach. | |
1:01:30.951 --> 1:01:43.975 | |
So in order to predict this type of word we | |
always look at the previous three words and | |
1:01:43.975 --> 1:01:48.416 | |
then to do this one we again. | |
1:01:49.389 --> 1:01:55.137 | |
If you are into neural networks you recognize | |
this type of structure. | |
1:01:55.137 --> 1:01:57.519 | |
Also are the typical neural. | |
1:01:58.938 --> 1:02:09.688 | |
Yes, so this is like Engram, Louis Couperus, | |
and at least in some way compared to the original, | |
1:02:09.688 --> 1:02:12.264 | |
you're always looking. | |
1:02:14.334 --> 1:02:30.781 | |
However, there are also other types of neural | |
network structures which we can use for sequence. | |
1:02:32.812 --> 1:02:34.678 | |
That we can do so. | |
1:02:34.678 --> 1:02:39.686 | |
The idea is in recurrent neural network structure. | |
1:02:39.686 --> 1:02:43.221 | |
We are saving the complete history. | |
1:02:43.623 --> 1:02:55.118 | |
So again we have to do like this fix size | |
representation because neural networks always | |
1:02:55.118 --> 1:02:56.947 | |
need to have. | |
1:02:57.157 --> 1:03:05.258 | |
And then we start with an initial value for | |
our storage. | |
1:03:05.258 --> 1:03:15.917 | |
We are giving our first input and then calculating | |
the new representation. | |
1:03:16.196 --> 1:03:26.328 | |
If you look at this, it's just again your | |
network was two types of inputs: in your work, | |
1:03:26.328 --> 1:03:29.743 | |
in your initial hidden state. | |
1:03:30.210 --> 1:03:46.468 | |
Then you can apply it to the next type of | |
input and you're again having. | |
1:03:47.367 --> 1:03:53.306 | |
Nice thing is now that you can do now step | |
by step by step, so all the way over. | |
1:03:55.495 --> 1:04:05.245 | |
The nice thing that we are having here now | |
is that we are having context information from | |
1:04:05.245 --> 1:04:07.195 | |
all the previous. | |
1:04:07.607 --> 1:04:13.582 | |
So if you're looking like based on which words | |
do you use here, calculate your ability of | |
1:04:13.582 --> 1:04:14.180 | |
varying. | |
1:04:14.554 --> 1:04:20.128 | |
It depends on is based on this path. | |
1:04:20.128 --> 1:04:33.083 | |
It depends on and this hidden state was influenced | |
by this one and this hidden state. | |
1:04:33.473 --> 1:04:37.798 | |
So now we're having something new. | |
1:04:37.798 --> 1:04:46.449 | |
We can really model the word probability not | |
only on a fixed context. | |
1:04:46.906 --> 1:04:53.570 | |
Because the in-states we're having here in | |
our area are influenced by all the trivia. | |
1:04:56.296 --> 1:05:00.909 | |
So how is that to mean? | |
1:05:00.909 --> 1:05:16.288 | |
If you're not thinking about the history of | |
clustering, we said the clustering. | |
1:05:16.736 --> 1:05:24.261 | |
So do not need to do any clustering here, | |
and we also see how things are put together | |
1:05:24.261 --> 1:05:26.273 | |
in order to really do. | |
1:05:29.489 --> 1:05:43.433 | |
In the green box this way since we are starting | |
from the left point to the right. | |
1:05:44.524 --> 1:05:48.398 | |
And that's right, so they're clustered in | |
some parts. | |
1:05:48.398 --> 1:05:58.196 | |
Here is some type of clustering happening: | |
It's continuous representations, but a smaller | |
1:05:58.196 --> 1:06:02.636 | |
difference doesn't matter again. | |
1:06:02.636 --> 1:06:10.845 | |
So if you have a lot of different histories, | |
the similarity. | |
1:06:11.071 --> 1:06:15.791 | |
Because in order to do the final restriction | |
you only do it based on the green box. | |
1:06:16.156 --> 1:06:24.284 | |
So you are now again still learning some type | |
of clasp. | |
1:06:24.284 --> 1:06:30.235 | |
You don't have to do this hard decision. | |
1:06:30.570 --> 1:06:39.013 | |
The only restriction you are giving is you | |
have to install everything that is important. | |
1:06:39.359 --> 1:06:54.961 | |
So it's a different type of limitation, so | |
you calculate the probability based on the | |
1:06:54.961 --> 1:06:57.138 | |
last words. | |
1:06:57.437 --> 1:07:09.645 | |
That is how you still need some cluster things | |
in order to do it efficiently. | |
1:07:09.970 --> 1:07:25.311 | |
But this is where things get merged together | |
in this type of hidden representation, which | |
1:07:25.311 --> 1:07:28.038 | |
is then merged. | |
1:07:28.288 --> 1:07:33.104 | |
On the previous words, but they are some other | |
bottleneck in order to make a good estimation. | |
1:07:34.474 --> 1:07:41.242 | |
So the idea is that we can store all our history | |
into one lecture. | |
1:07:41.581 --> 1:07:47.351 | |
Which is very good and makes it more strong. | |
1:07:47.351 --> 1:07:51.711 | |
Next we come to problems of that. | |
1:07:51.711 --> 1:07:57.865 | |
Of course, at some point it might be difficult. | |
1:07:58.398 --> 1:08:02.230 | |
Then maybe things get all overwritten, or | |
you cannot store everything in there. | |
1:08:02.662 --> 1:08:04.514 | |
So,. | |
1:08:04.184 --> 1:08:10.252 | |
Therefore, yet for short things like signal | |
sentences that works well, but especially if | |
1:08:10.252 --> 1:08:16.184 | |
you think of other tasks like harmonisation | |
where a document based on T where you need | |
1:08:16.184 --> 1:08:22.457 | |
to consider a full document, these things got | |
a bit more complicated and we learned another | |
1:08:22.457 --> 1:08:23.071 | |
type of. | |
1:08:24.464 --> 1:08:30.455 | |
For the further in order to understand these | |
networks, it's good to have both views always. | |
1:08:30.710 --> 1:08:39.426 | |
So this is the unroll view, so you have this | |
type of network. | |
1:08:39.426 --> 1:08:48.532 | |
Therefore, it can be shown as: We have here | |
the output and here's your network which is | |
1:08:48.532 --> 1:08:52.091 | |
connected by itself and that is a recurrent. | |
1:08:56.176 --> 1:09:11.033 | |
There is one challenge in these networks and | |
that is the training so the nice thing is train | |
1:09:11.033 --> 1:09:11.991 | |
them. | |
1:09:12.272 --> 1:09:20.147 | |
So the idea is we don't really know how to | |
train them, but if you unroll them like this,. | |
1:09:20.540 --> 1:09:38.054 | |
It's exactly the same so you can measure your | |
arrows and then you propagate your arrows. | |
1:09:38.378 --> 1:09:45.647 | |
Now the nice thing is if you unroll something, | |
it's a feet forward and you can train it. | |
1:09:46.106 --> 1:09:56.493 | |
The only important thing is, of course, for | |
different inputs you have to take that into | |
1:09:56.493 --> 1:09:57.555 | |
account. | |
1:09:57.837 --> 1:10:07.621 | |
But since parameters are shared, it's somehow | |
similar and you can train that the training | |
1:10:07.621 --> 1:10:08.817 | |
algorithm. | |
1:10:10.310 --> 1:10:16.113 | |
One thing which makes things difficult is | |
what is referred to as the vanishing gradient. | |
1:10:16.113 --> 1:10:21.720 | |
So we are saying there is a big advantage | |
of these models and that's why we are using | |
1:10:21.720 --> 1:10:22.111 | |
that. | |
1:10:22.111 --> 1:10:27.980 | |
The output here does not only depend on the | |
current input of a last three but on anything | |
1:10:27.980 --> 1:10:29.414 | |
that was said before. | |
1:10:29.809 --> 1:10:32.803 | |
That's a very strong thing is the motivation | |
of using art. | |
1:10:33.593 --> 1:10:44.599 | |
However, if you're using standard, the influence | |
here gets smaller and smaller, and the models. | |
1:10:44.804 --> 1:10:55.945 | |
Because the gradients get smaller and smaller, | |
and so the arrow here propagated to this one, | |
1:10:55.945 --> 1:10:59.659 | |
this contributes to the arrow. | |
1:11:00.020 --> 1:11:06.710 | |
And yeah, that's why standard R&S are | |
difficult or have to become boosters. | |
1:11:07.247 --> 1:11:11.481 | |
So if we are talking about our ends nowadays,. | |
1:11:11.791 --> 1:11:19.532 | |
What we are typically meaning are long short | |
memories. | |
1:11:19.532 --> 1:11:30.931 | |
You see there by now quite old already, but | |
they have special gating mechanisms. | |
1:11:31.171 --> 1:11:41.911 | |
So in the language model tasks, for example | |
in some other story information, all this sentence | |
1:11:41.911 --> 1:11:44.737 | |
started with a question. | |
1:11:44.684 --> 1:11:51.886 | |
Because if you only look at the five last | |
five words, it's often no longer clear as a | |
1:11:51.886 --> 1:11:52.556 | |
normal. | |
1:11:53.013 --> 1:12:06.287 | |
So there you have these mechanisms with the | |
right gate in order to store things for a longer | |
1:12:06.287 --> 1:12:08.571 | |
time into your. | |
1:12:10.730 --> 1:12:20.147 | |
Here they are used in, in, in, in selling | |
quite a lot of works. | |
1:12:21.541 --> 1:12:30.487 | |
For especially text machine translation now, | |
the standard is to do transformer base models. | |
1:12:30.690 --> 1:12:42.857 | |
But for example, this type of in architecture | |
we have later one lecture about efficiency. | |
1:12:42.882 --> 1:12:53.044 | |
And there in the decoder and partial networks | |
they are still using our edges because then. | |
1:12:53.473 --> 1:12:57.542 | |
So it's not that our ends are of no importance. | |
1:12:59.239 --> 1:13:08.956 | |
In order to make them strong, there are some | |
more things which are helpful and should be: | |
1:13:09.309 --> 1:13:19.668 | |
So one thing is it's a very easy and nice trick | |
to make this neon network stronger and better. | |
1:13:19.739 --> 1:13:21.619 | |
So, of course, it doesn't work always. | |
1:13:21.619 --> 1:13:23.451 | |
They have to have enough training to. | |
1:13:23.763 --> 1:13:29.583 | |
But in general that is the easiest way of | |
making your mouth bigger and stronger is to | |
1:13:29.583 --> 1:13:30.598 | |
increase your. | |
1:13:30.630 --> 1:13:43.244 | |
And you've seen that with a large size model | |
they are always braggling about. | |
1:13:43.903 --> 1:13:53.657 | |
This is one way so the question is how do | |
you get more parameters? | |
1:13:53.657 --> 1:14:05.951 | |
There's two ways you can make your representations: | |
And the other thing is its octave deep learning, | |
1:14:05.951 --> 1:14:10.020 | |
so the other thing is to make your networks. | |
1:14:11.471 --> 1:14:13.831 | |
And then you can also get more work off. | |
1:14:14.614 --> 1:14:19.931 | |
There's one problem with this and with more | |
deeper networks. | |
1:14:19.931 --> 1:14:23.330 | |
It's very similar to what we saw with. | |
1:14:23.603 --> 1:14:34.755 | |
With the we have this problem of radiant flow | |
that if it flows so fast like the radiant gets | |
1:14:34.755 --> 1:14:35.475 | |
very. | |
1:14:35.795 --> 1:14:41.114 | |
Exactly the same thing happens in deep. | |
1:14:41.114 --> 1:14:52.285 | |
If you take the gradient and tell it's the | |
right or wrong, then you're propagating. | |
1:14:52.612 --> 1:14:53.228 | |
Three layers. | |
1:14:53.228 --> 1:14:56.440 | |
It's no problem, but if you're going to ten, | |
twenty or a hundred layers. | |
1:14:57.797 --> 1:14:59.690 | |
That is getting typically a problem. | |
1:15:00.060 --> 1:15:10.659 | |
People are doing and they are using what is | |
called visual connections. | |
1:15:10.659 --> 1:15:15.885 | |
That's a very helpful idea, which. | |
1:15:15.956 --> 1:15:20.309 | |
And so the idea is that these networks. | |
1:15:20.320 --> 1:15:30.694 | |
In between should calculate really what is | |
a new representation, but they are calculating | |
1:15:30.694 --> 1:15:31.386 | |
what. | |
1:15:31.731 --> 1:15:37.585 | |
And therefore in the end you'll always the | |
output of a layer is added with the input. | |
1:15:38.318 --> 1:15:48.824 | |
The nice thing is that later, if you are doing | |
back propagation with this very fast back,. | |
1:15:49.209 --> 1:16:01.896 | |
So that is what you're seeing nowadays in | |
very deep architectures, not only as others, | |
1:16:01.896 --> 1:16:04.229 | |
but you always. | |
1:16:04.704 --> 1:16:07.388 | |
Has two advantages. | |
1:16:07.388 --> 1:16:15.304 | |
On the one hand, it's more easy to learn a | |
representation. | |
1:16:15.304 --> 1:16:18.792 | |
On the other hand, these. | |
1:16:22.082 --> 1:16:24.114 | |
Goods. | |
1:16:23.843 --> 1:16:31.763 | |
That much for the new record before, so the | |
last thing now means this. | |
1:16:31.671 --> 1:16:36.400 | |
Language was used in the molds itself. | |
1:16:36.400 --> 1:16:46.707 | |
Now we're seeing them again, but one thing | |
that at the beginning was very essential. | |
1:16:46.967 --> 1:16:57.655 | |
So people really train part in the language | |
models only to get this type of embeddings | |
1:16:57.655 --> 1:17:04.166 | |
and therefore we want to look a bit more into | |
these. | |
1:17:09.229 --> 1:17:13.456 | |
Some laugh words to the word embeddings. | |
1:17:13.456 --> 1:17:22.117 | |
The interesting thing is that word embeddings | |
can be used for very different tasks. | |
1:17:22.117 --> 1:17:27.170 | |
The advantage is we can train the word embedded. | |
1:17:27.347 --> 1:17:31.334 | |
The knife is you can train that on just large | |
amounts of data. | |
1:17:31.931 --> 1:17:40.937 | |
And then if you have these wooden beddings | |
you don't have a layer of ten thousand any | |
1:17:40.937 --> 1:17:41.566 | |
more. | |
1:17:41.982 --> 1:17:52.231 | |
So then you can train a small market to do | |
any other tasks and therefore you're more. | |
1:17:52.532 --> 1:17:58.761 | |
Initial word embeddings really depend only | |
on the word itself. | |
1:17:58.761 --> 1:18:07.363 | |
If you look at the two meanings of can, the | |
can of beans, or can they do that, some of | |
1:18:07.363 --> 1:18:08.747 | |
the embedded. | |
1:18:09.189 --> 1:18:12.395 | |
That cannot be resolved. | |
1:18:12.395 --> 1:18:23.939 | |
Therefore, you need to know the context, and | |
if you look at the higher levels that people | |
1:18:23.939 --> 1:18:27.916 | |
are doing in the context, but. | |
1:18:29.489 --> 1:18:33.757 | |
However, even this one has quite very interesting. | |
1:18:34.034 --> 1:18:44.644 | |
So people like to visualize that they're always | |
a bit difficult because if you look at this | |
1:18:44.644 --> 1:18:47.182 | |
word, vector or word. | |
1:18:47.767 --> 1:18:52.879 | |
And drawing your five hundred dimensional | |
vector is still a bit challenging. | |
1:18:53.113 --> 1:19:12.464 | |
So you cannot directly do that, so what people | |
have to do is learn some type of dimension. | |
1:19:13.073 --> 1:19:17.216 | |
And of course then yes some information gets | |
lost but you can try it. | |
1:19:18.238 --> 1:19:28.122 | |
And you see, for example, this is the most | |
famous and common example, so what you can | |
1:19:28.122 --> 1:19:37.892 | |
look is you can look at the difference between | |
the male and the female word English. | |
1:19:38.058 --> 1:19:40.389 | |
And you can do that for a very different work. | |
1:19:40.780 --> 1:19:45.403 | |
And that is where, where the masks come into | |
that, what people then look into. | |
1:19:45.725 --> 1:19:50.995 | |
So what you can now, for example, do is you | |
can calculate the difference between man and | |
1:19:50.995 --> 1:19:51.410 | |
woman. | |
1:19:52.232 --> 1:19:56.356 | |
And what you can do then you can take the | |
embedding of peeing. | |
1:19:56.356 --> 1:20:02.378 | |
You can add on it the difference between men | |
and women and where people get really excited. | |
1:20:02.378 --> 1:20:05.586 | |
Then you can look at what are the similar | |
words. | |
1:20:05.586 --> 1:20:09.252 | |
So you won't, of course, directly hit the | |
correct word. | |
1:20:09.252 --> 1:20:10.495 | |
It's a continuous. | |
1:20:10.790 --> 1:20:24.062 | |
But you can look at what are the nearest neighbors | |
to the same, and often these words are near. | |
1:20:24.224 --> 1:20:33.911 | |
So it's somehow weird that the difference | |
between these works is always the same. | |
1:20:34.374 --> 1:20:37.308 | |
Can do different things. | |
1:20:37.308 --> 1:20:47.520 | |
You can also imagine that the work tends to | |
be assuming and swim, and with walking and | |
1:20:47.520 --> 1:20:49.046 | |
walking you. | |
1:20:49.469 --> 1:20:53.040 | |
So you can try to use him. | |
1:20:53.040 --> 1:20:56.346 | |
It's no longer like say. | |
1:20:56.346 --> 1:21:04.016 | |
The interesting thing is nobody taught him | |
the principle. | |
1:21:04.284 --> 1:21:09.910 | |
So it's purely trained on the task of doing | |
the next work prediction. | |
1:21:10.230 --> 1:21:23.669 | |
And even for some information like the capital, | |
this is the difference between the capital. | |
1:21:23.823 --> 1:21:33.760 | |
Is another visualization here where you have | |
done the same things on the difference between. | |
1:21:33.853 --> 1:21:41.342 | |
And you see it's not perfect, but it's building | |
in my directory, so you can even use that for | |
1:21:41.342 --> 1:21:42.936 | |
pressure answering. | |
1:21:42.936 --> 1:21:50.345 | |
If you have no three countries, the capital, | |
you can do what is the difference between them. | |
1:21:50.345 --> 1:21:53.372 | |
You apply that to a new country, and. | |
1:21:54.834 --> 1:22:02.280 | |
So these models are able to really learn a | |
lot of information and collapse this information | |
1:22:02.280 --> 1:22:04.385 | |
into this representation. | |
1:22:05.325 --> 1:22:07.679 | |
And just to do the next two are predictions. | |
1:22:07.707 --> 1:22:22.358 | |
And that also explains a bit maybe or explains | |
strongly, but motivates what is the main advantage | |
1:22:22.358 --> 1:22:26.095 | |
of this type of neurons. | |
1:22:28.568 --> 1:22:46.104 | |
So to summarize what we did today, so what | |
you should hopefully have with you is: Then | |
1:22:46.104 --> 1:22:49.148 | |
how we can do language modeling with new networks. | |
1:22:49.449 --> 1:22:55.445 | |
We looked at three different architectures: | |
We looked into the feet forward language one, | |
1:22:55.445 --> 1:22:59.059 | |
the R&N, and the one based the balsamic. | |
1:22:59.039 --> 1:23:04.559 | |
And finally, there are different architectures | |
to do in neural networks. | |
1:23:04.559 --> 1:23:10.986 | |
We have seen feet for neural networks and | |
base neural networks, and we'll see in the | |
1:23:10.986 --> 1:23:14.389 | |
next lectures the last type of architecture. | |
1:23:15.915 --> 1:23:17.438 | |
Any questions. | |
1:23:20.680 --> 1:23:27.360 | |
Then thanks a lot, and next I'm just there, | |
we'll be again on order to. | |