WEBVTT

0:00:01.301 --> 0:00:05.707
Okay So Welcome to Today's Lecture.

0:00:06.066 --> 0:00:12.592
I'm sorry for the inconvenience.

0:00:12.592 --> 0:00:19.910
Sometimes they are project meetings.

0:00:19.910 --> 0:00:25.843
There will be one other time.

0:00:26.806 --> 0:00:40.863
So what we want to talk today about is want
to start with neural approaches to machine

0:00:40.863 --> 0:00:42.964
translation.

0:00:43.123 --> 0:00:51.285
I guess you have heard about other types of
neural models for other types of neural language

0:00:51.285 --> 0:00:52.339
processing.

0:00:52.339 --> 0:00:59.887
This was some of the first steps in introducing
neal networks to machine translation.

0:01:00.600 --> 0:01:06.203
They are similar to what you know they see
in as large language models.

0:01:06.666 --> 0:01:11.764
And today look into what are these neuro-language
models?

0:01:11.764 --> 0:01:13.874
What is the difference?

0:01:13.874 --> 0:01:15.983
What is the motivation?

0:01:16.316 --> 0:01:21.445
And first will use them in statistics and
machine translation.

0:01:21.445 --> 0:01:28.935
So if you remember how fully like two or three
weeks ago we had this likely model where you

0:01:28.935 --> 0:01:31.052
can integrate easily any.

0:01:31.351 --> 0:01:40.967
We just have another model which evaluates
how good a system is or how good a fluent language

0:01:40.967 --> 0:01:41.376
is.

0:01:41.376 --> 0:01:53.749
The main advantage compared to the statistical
models we saw on Tuesday is: Next week we will

0:01:53.749 --> 0:02:06.496
then go for a neural machine translation where
we replace the whole model.

0:02:11.211 --> 0:02:21.078
Just as a remember from Tuesday, we've seen
the main challenge in language world was that

0:02:21.078 --> 0:02:25.134
most of the engrams we haven't seen.

0:02:26.946 --> 0:02:33.967
So this was therefore difficult to estimate
any probability because you've seen that normally

0:02:33.967 --> 0:02:39.494
if you have not seen the endgram you will assign
the probability of zero.

0:02:39.980 --> 0:02:49.420
However, this is not really very good because
we don't want to give zero probabilities to

0:02:49.420 --> 0:02:54.979
sentences, which still might be a very good
English.

0:02:55.415 --> 0:03:02.167
And then we learned a lot of techniques and
that is the main challenging statistical machine

0:03:02.167 --> 0:03:04.490
translate statistical language.

0:03:04.490 --> 0:03:10.661
What's how we can give a good estimate of
probability to events that we haven't seen

0:03:10.661 --> 0:03:12.258
smoothing techniques?

0:03:12.258 --> 0:03:15.307
We've seen this interpolation and begoff.

0:03:15.435 --> 0:03:21.637
And they invent or develop very specific techniques.

0:03:21.637 --> 0:03:26.903
To deal with that, however, it might not be.

0:03:28.568 --> 0:03:43.190
And therefore maybe we can do things different,
so if we have not seen an gram before in statistical

0:03:43.190 --> 0:03:44.348
models.

0:03:45.225 --> 0:03:51.361
Before and we can only get information from
exactly the same words.

0:03:51.411 --> 0:04:06.782
We don't have some on like approximate matching
like that, maybe in a sentence that cures similarly.

0:04:06.782 --> 0:04:10.282
So if you have seen a.

0:04:11.191 --> 0:04:17.748
And so you would like to have more something
like that where endgrams are represented, more

0:04:17.748 --> 0:04:21.953
in a general space, and we can generalize similar
numbers.

0:04:22.262 --> 0:04:29.874
So if you learn something about walk then
maybe we can use this knowledge and also apply.

0:04:30.290 --> 0:04:42.596
The same as we have done before, but we can
really better model how similar they are and

0:04:42.596 --> 0:04:45.223
transfer to other.

0:04:47.047 --> 0:04:54.236
And we maybe want to do that in a more hierarchical
approach that we know okay.

0:04:54.236 --> 0:05:02.773
Some words are similar but like go and walk
is somehow similar and I and P and G and therefore

0:05:02.773 --> 0:05:06.996
like maybe if we then merge them in an engram.

0:05:07.387 --> 0:05:15.861
If we learn something about our walk, then
it should tell us also something about Hugo.

0:05:15.861 --> 0:05:17.113
He walks or.

0:05:17.197 --> 0:05:27.327
You see that there is some relations which
we need to integrate for you.

0:05:27.327 --> 0:05:35.514
We need to add the s, but maybe walks should
also be here.

0:05:37.137 --> 0:05:45.149
And luckily there is one really convincing
method in doing that: And that is by using

0:05:45.149 --> 0:05:47.231
a neural mechanism.

0:05:47.387 --> 0:05:58.497
That's what we will introduce today so we
can use this type of neural networks to try

0:05:58.497 --> 0:06:04.053
to learn this similarity and to learn how.

0:06:04.324 --> 0:06:14.355
And that is one of the main advantages that
we have by switching from the standard statistical

0:06:14.355 --> 0:06:15.200
models.

0:06:15.115 --> 0:06:22.830
To learn similarities between words and generalized,
and learn what is called hidden representations

0:06:22.830 --> 0:06:29.705
or representations of words, where we can measure
similarity in some dimensions of words.

0:06:30.290 --> 0:06:42.384
So we can measure in which way words are similar.

0:06:42.822 --> 0:06:48.902
We had it before and we've seen that words
were just easier.

0:06:48.902 --> 0:06:51.991
The only thing we did is like.

0:06:52.192 --> 0:07:02.272
But this energies don't have any meaning,
so it wasn't that word is more similar to words.

0:07:02.582 --> 0:07:12.112
So we couldn't learn anything about words
in the statistical model and that's a big challenge.

0:07:12.192 --> 0:07:23.063
About words even like in morphology, so going
goes is somehow more similar because the person

0:07:23.063 --> 0:07:24.219
singular.

0:07:24.264 --> 0:07:34.924
The basic models we have to now have no idea
about that and goes as similar to go than it

0:07:34.924 --> 0:07:37.175
might be to sleep.

0:07:39.919 --> 0:07:44.073
So what we want to do today.

0:07:44.073 --> 0:07:53.096
In order to go to this we will have a short
introduction into.

0:07:53.954 --> 0:08:05.984
It very short just to see how we use them
here, but that's a good thing, so most of you

0:08:05.984 --> 0:08:08.445
think it will be.

0:08:08.928 --> 0:08:14.078
And then we will first look into a feet forward
neural network language models.

0:08:14.454 --> 0:08:23.706
And there we will still have this approximation.

0:08:23.706 --> 0:08:33.902
We have before we are looking only at a fixed
window.

0:08:34.154 --> 0:08:35.030
The case.

0:08:35.030 --> 0:08:38.270
However, we have the umbellent here.

0:08:38.270 --> 0:08:43.350
That's why they're already better in order
to generalize.

0:08:44.024 --> 0:08:53.169
And then at the end we'll look at language
models where we then have the additional advantage.

0:08:53.093 --> 0:09:04.317
Case that we need to have a fixed history,
but in theory we can model arbitrary long dependencies.

0:09:04.304 --> 0:09:12.687
And we talked about on Tuesday where it is
not clear what type of information it is to.

0:09:16.396 --> 0:09:24.981
So in general molecular networks I normally
learn to prove that they perform some tasks.

0:09:25.325 --> 0:09:33.472
We have the structure and we are learning
them from samples so that is similar to what

0:09:33.472 --> 0:09:34.971
we have before.

0:09:34.971 --> 0:09:42.275
So now we have the same task here, a language
model giving input or forwards.

0:09:42.642 --> 0:09:48.959
And is somewhat originally motivated by human
brain.

0:09:48.959 --> 0:10:00.639
However, when you now need to know about artificial
neural networks, it's hard to get similarity.

0:10:00.540 --> 0:10:02.889
There seemed to be not that point.

0:10:03.123 --> 0:10:11.014
So what they are mainly doing is summoning
multiplication and then one non-linear activation.

0:10:12.692 --> 0:10:16.085
So the basic units are these type of.

0:10:17.937 --> 0:10:29.891
Perceptron basic blocks which we have and
this does processing so we have a fixed number

0:10:29.891 --> 0:10:36.070
of input features and that will be important.

0:10:36.096 --> 0:10:39.689
So we have here numbers to xn as input.

0:10:40.060 --> 0:10:53.221
And this makes partly of course language processing
difficult.

0:10:54.114 --> 0:10:57.609
So we have to model this time on and then
go stand home and model.

0:10:58.198 --> 0:11:02.099
Then we are having weights, which are the
parameters and the number of weights exactly

0:11:02.099 --> 0:11:03.668
the same as the number of weights.

0:11:04.164 --> 0:11:06.322
Of input features.

0:11:06.322 --> 0:11:15.068
Sometimes he has his fires in there, and then
it's not really an input from.

0:11:15.195 --> 0:11:19.205
And what you then do is multiply.

0:11:19.205 --> 0:11:26.164
Each input resists weight and then you sum
it up and then.

0:11:26.606 --> 0:11:34.357
What is then additionally later important
is that we have an activation function and

0:11:34.357 --> 0:11:42.473
it's important that this activation function
is non linear, so we come to just a linear.

0:11:43.243 --> 0:11:54.088
And later it will be important that this is
differentiable because otherwise all the training.

0:11:54.714 --> 0:12:01.907
This model by itself is not very powerful.

0:12:01.907 --> 0:12:10.437
It was originally shown that this is not powerful.

0:12:10.710 --> 0:12:19.463
However, there is a very easy extension, the
multi layer perceptual, and then things get

0:12:19.463 --> 0:12:20.939
very powerful.

0:12:21.081 --> 0:12:27.719
The thing is you just connect a lot of these
in this layer of structures and we have our

0:12:27.719 --> 0:12:35.029
input layer where we have the inputs and our
hidden layer at least one where there is everywhere.

0:12:35.395 --> 0:12:39.817
And then we can combine them all to do that.

0:12:40.260 --> 0:12:48.320
The input layer is of course somewhat given
by a problem of dimension.

0:12:48.320 --> 0:13:00.013
The outward layer is also given by your dimension,
but the hidden layer is of course a hyperparameter.

0:13:01.621 --> 0:13:08.802
So let's start with the first question, now
more language related, and that is how we represent.

0:13:09.149 --> 0:13:23.460
So we've seen here we have the but the question
is now how can we put in a word into this?

0:13:26.866 --> 0:13:34.117
Noise: The first thing we're able to be better
is by the fact that like you are said,.

0:13:34.314 --> 0:13:43.028
That is not that easy because the continuous
vector will come to that.

0:13:43.028 --> 0:13:50.392
So from the neo-network we can directly put
in the bedding.

0:13:50.630 --> 0:13:57.277
But if we need to input a word into the needle
network, it has to be something which is easily

0:13:57.277 --> 0:13:57.907
defined.

0:13:59.079 --> 0:14:12.492
The one hood encoding, and then we have one
out of encoding, so one value is one, and all

0:14:12.492 --> 0:14:15.324
the others is the.

0:14:16.316 --> 0:14:25.936
That means we are always dealing with fixed
vocabulary because what said is we cannot.

0:14:26.246 --> 0:14:38.017
So you cannot easily extend your vocabulary
because if you mean you would extend your vocabulary.

0:14:39.980 --> 0:14:41.502
That's also motivating.

0:14:41.502 --> 0:14:43.722
We're talked about biperriagoding.

0:14:43.722 --> 0:14:45.434
That's a nice thing there.

0:14:45.434 --> 0:14:47.210
We have a fixed vocabulary.

0:14:48.048 --> 0:14:55.804
The big advantage of this one encoding is
that we don't implicitly sum our implement

0:14:55.804 --> 0:15:04.291
similarity between words, but really re-learning
because if you first think about this, this

0:15:04.291 --> 0:15:06.938
is a very, very inefficient.

0:15:07.227 --> 0:15:15.889
So you need like to represent end words, you
need a dimension of an end dimensional vector.

0:15:16.236 --> 0:15:24.846
Imagine you could do binary encoding so you
could represent words as binary vectors.

0:15:24.846 --> 0:15:26.467
Then you would.

0:15:26.806 --> 0:15:31.177
Will be significantly more efficient.

0:15:31.177 --> 0:15:36.813
However, then you have some implicit similarity.

0:15:36.813 --> 0:15:39.113
Some numbers share.

0:15:39.559 --> 0:15:46.958
Would somehow be bad because you would force
someone to do this by hand or clear how to

0:15:46.958 --> 0:15:47.631
define.

0:15:48.108 --> 0:15:55.135
So therefore currently this is the most successful
approach to just do this one watch.

0:15:55.095 --> 0:15:59.563
Representations, so we take a fixed vocabulary.

0:15:59.563 --> 0:16:06.171
We map each word to the inise, and then we
represent a word like this.

0:16:06.171 --> 0:16:13.246
So if home will be one, the representation
will be one zero zero zero, and.

0:16:14.514 --> 0:16:30.639
But this dimension here is a vocabulary size
and that is quite high, so we are always trying

0:16:30.639 --> 0:16:33.586
to be efficient.

0:16:33.853 --> 0:16:43.792
We are doing then some type of efficiency
because typically we are having this next layer.

0:16:44.104 --> 0:16:51.967
It can be still maybe two hundred or five
hundred or one thousand neurons, but this is

0:16:51.967 --> 0:16:53.323
significantly.

0:16:53.713 --> 0:17:03.792
You can learn that directly and there we then
have similarity between words.

0:17:03.792 --> 0:17:07.458
Then it is that some words.

0:17:07.807 --> 0:17:14.772
But the nice thing is that this is then learned
that we are not need to hand define that.

0:17:17.117 --> 0:17:32.742
We'll come later to the explicit architecture
of the neural language one, and there we can

0:17:32.742 --> 0:17:35.146
see how it's.

0:17:38.418 --> 0:17:44.857
So we're seeing that the other one or our
representation always has the same similarity.

0:17:45.105 --> 0:17:59.142
Then we're having this continuous factor which
is a lot smaller dimension and that's important

0:17:59.142 --> 0:18:00.768
for later.

0:18:01.121 --> 0:18:06.989
What we are doing then is learning these representations
so that they are best for language.

0:18:07.487 --> 0:18:14.968
So the representations are implicitly training
the language for the cards.

0:18:14.968 --> 0:18:19.058
This is the best way for doing language.

0:18:19.479 --> 0:18:32.564
And the nice thing that was found out later
is these representations are really good.

0:18:33.153 --> 0:18:39.253
And that is why they are now even called word
embeddings by themselves and used for other

0:18:39.253 --> 0:18:39.727
tasks.

0:18:40.360 --> 0:18:49.821
And they are somewhat describing very different
things so they can describe and semantic similarities.

0:18:49.789 --> 0:18:58.650
Are looking at the very example of today mass
vector space by adding words and doing some

0:18:58.650 --> 0:19:00.618
interesting things.

0:19:00.940 --> 0:19:11.178
So they got really like the first big improvement
when switching to neurostaff.

0:19:11.491 --> 0:19:20.456
Are like part of the model, but with more
complex representation, but they are the basic

0:19:20.456 --> 0:19:21.261
models.

0:19:23.683 --> 0:19:36.979
In the output layer we are also having one
output layer structure and a connection function.

0:19:36.997 --> 0:19:46.525
That is, for language learning we want to
predict what is the most common word.

0:19:47.247 --> 0:19:56.453
And that can be done very well with this so
called soft back layer, where again the dimension.

0:19:56.376 --> 0:20:02.825
Vocabulary size, so this is a vocabulary size,
and again the case neural represents the case

0:20:02.825 --> 0:20:03.310
class.

0:20:03.310 --> 0:20:09.759
So in our case we have again one round representation,
someone saying this is a core report.

0:20:10.090 --> 0:20:17.255
Our probability distribution is a probability
distribution over all works, so the case entry

0:20:17.255 --> 0:20:21.338
tells us how probable is that the next word
is this.

0:20:22.682 --> 0:20:33.885
So we need to have some probability distribution
at our output in order to achieve that this

0:20:33.885 --> 0:20:37.017
activation function goes.

0:20:37.197 --> 0:20:46.944
And we can achieve that with a soft max activation
we take the input to the form of the value,

0:20:46.944 --> 0:20:47.970
and then.

0:20:48.288 --> 0:20:58.021
So by having this type of activation function
we are really getting this type of probability.

0:20:59.019 --> 0:21:15.200
At the beginning was also very challenging
because again we have this inefficient representation.

0:21:15.235 --> 0:21:29.799
You can imagine that something over is maybe
a bit inefficient with cheap users, but definitely.

0:21:36.316 --> 0:21:44.072
And then for training the models that will
be fine, so we have to use architecture now.

0:21:44.264 --> 0:21:48.491
We need to minimize the arrow.

0:21:48.491 --> 0:21:53.264
Are we doing it taking the output?

0:21:53.264 --> 0:21:58.174
We are comparing it to our targets.

0:21:58.298 --> 0:22:03.830
So one important thing is by training them.

0:22:03.830 --> 0:22:07.603
How can we measure the error?

0:22:07.603 --> 0:22:12.758
So what is if we are training the ideas?

0:22:13.033 --> 0:22:15.163
And how well we are measuring.

0:22:15.163 --> 0:22:19.768
It is in natural language processing, typically
the cross entropy.

0:22:19.960 --> 0:22:35.575
And that means we are comparing the target
with the output.

0:22:35.335 --> 0:22:44.430
It gets optimized and you're seeing that this,
of course, makes it again very nice and easy

0:22:44.430 --> 0:22:49.868
because our target is again a one-hour representation.

0:22:50.110 --> 0:23:00.116
So all of these are always zero, and what
we are then doing is we are taking the one.

0:23:00.100 --> 0:23:04.615
And we only need to multiply the one with
the logarithm here, and that is all the feedback

0:23:04.615 --> 0:23:05.955
signal we are taking here.

0:23:06.946 --> 0:23:13.885
Of course, this is not always influenced by
all the others.

0:23:13.885 --> 0:23:17.933
Why is this influenced by all the.

0:23:24.304 --> 0:23:34.382
Have the activation function, which is the
current activation divided by some of the others.

0:23:34.354 --> 0:23:45.924
Otherwise it could easily just increase this
volume and ignore the others, but if you increase

0:23:45.924 --> 0:23:49.090
one value all the others.

0:23:51.351 --> 0:23:59.912
Then we can do with neometrics one very nice
and easy type of training that is done in all

0:23:59.912 --> 0:24:07.721
the neometrics where we are now calculating
our error and especially the gradient.

0:24:07.707 --> 0:24:11.640
So in which direction does the error show?

0:24:11.640 --> 0:24:18.682
And then if we want to go to a smaller arrow
that's what we want to achieve.

0:24:18.682 --> 0:24:26.638
We are taking the inverse direction of the
gradient and thereby trying to minimize our

0:24:26.638 --> 0:24:27.278
error.

0:24:27.287 --> 0:24:31.041
And we have to do that, of course, for all
the weights.

0:24:31.041 --> 0:24:36.672
And to calculate the error of all the weights,
we won't do the defectvagation here.

0:24:36.672 --> 0:24:41.432
But but what you can do is you can propagate
the arrow which measured.

0:24:41.432 --> 0:24:46.393
At the end you can propagate it back its basic
mass and basic derivation.

0:24:46.706 --> 0:24:58.854
For each way in your model measure how much
you contribute to the error and then change

0:24:58.854 --> 0:25:01.339
it in a way that.

0:25:04.524 --> 0:25:11.625
So to summarize what for at least machine
translation on your machine translation should

0:25:11.625 --> 0:25:19.044
remember, you know, to understand on this problem
is that this is how a multilayer first the

0:25:19.044 --> 0:25:20.640
problem looks like.

0:25:20.580 --> 0:25:28.251
There are fully two layers and no connections.

0:25:28.108 --> 0:25:29.759
Across layers.

0:25:29.829 --> 0:25:35.153
And what they're doing is always just a waited
sum here and then in activation production.

0:25:35.415 --> 0:25:38.792
And in order to train you have this forward
and backward pass.

0:25:39.039 --> 0:25:41.384
So We Put in Here.

0:25:41.281 --> 0:25:41.895
Inputs.

0:25:41.895 --> 0:25:45.347
We have some random values at the beginning.

0:25:45.347 --> 0:25:47.418
Then calculate the output.

0:25:47.418 --> 0:25:54.246
We are measuring how our error is propagating
the arrow back and then changing our model

0:25:54.246 --> 0:25:57.928
in a way that we hopefully get a smaller arrow.

0:25:57.928 --> 0:25:59.616
And then that is how.

0:26:01.962 --> 0:26:12.893
So before we're coming into our neural networks
language models, how can we use this type of

0:26:12.893 --> 0:26:17.595
neural network to do language modeling?

0:26:23.103 --> 0:26:33.157
So how can we use them in natural language
processing, especially machine translation?

0:26:33.157 --> 0:26:41.799
The first idea of using them was to estimate:
So we have seen that the output can be monitored

0:26:41.799 --> 0:26:42.599
here as well.

0:26:43.603 --> 0:26:50.311
A probability distribution and if we have
a full vocabulary we could mainly hear estimating

0:26:50.311 --> 0:26:56.727
how probable each next word is and then use
that in our language model fashion as we've

0:26:56.727 --> 0:26:58.112
done it last time.

0:26:58.112 --> 0:27:03.215
We got the probability of a full sentence
as a product of individual.

0:27:04.544 --> 0:27:12.820
And: That was done in the ninety seven years
and it's very easy to integrate it into this

0:27:12.820 --> 0:27:14.545
lot of the year model.

0:27:14.545 --> 0:27:19.570
So we have said that this is how the locker
here model looks like.

0:27:19.570 --> 0:27:25.119
So we are searching the best translation which
minimizes each waste time.

0:27:25.125 --> 0:27:26.362
The Future About You.

0:27:26.646 --> 0:27:31.647
We have that with minimum error rate training
if you can remember where we search for the

0:27:31.647 --> 0:27:32.147
optimal.

0:27:32.512 --> 0:27:40.422
The language model and many others, and we
can just add here a neuromodel, have a knock

0:27:40.422 --> 0:27:41.591
of features.

0:27:41.861 --> 0:27:45.761
So that is quite easy as said.

0:27:45.761 --> 0:27:53.183
That was how statistical machine translation
was improved.

0:27:53.183 --> 0:27:57.082
You just add one more feature.

0:27:58.798 --> 0:28:07.631
So how can we model the language modeling
with a network?

0:28:07.631 --> 0:28:16.008
So what we have to do is model the probability
of the.

0:28:16.656 --> 0:28:25.047
The problem in general in the head is that
mostly we haven't seen long sequences.

0:28:25.085 --> 0:28:35.650
Mostly we have to beg off to very short sequences
and we are working on this discrete space where

0:28:35.650 --> 0:28:36.944
similarity.

0:28:37.337 --> 0:28:50.163
So the idea is if we have now a real network,
we can make words into continuous representation.

0:28:51.091 --> 0:29:00.480
And the structure then looks like this, so
this is a basic still feed forward neural network.

0:29:01.361 --> 0:29:10.645
We are doing this at perximation again, so
we are not putting in all previous words, but

0:29:10.645 --> 0:29:11.375
it is.

0:29:11.691 --> 0:29:25.856
This is done because we said that in the real
network we can have only a fixed type of input.

0:29:25.945 --> 0:29:31.886
You can only do a fixed step and then we'll
be doing that exactly in minus one.

0:29:33.593 --> 0:29:39.536
So here you are, for example, three words
and three different words.

0:29:39.536 --> 0:29:50.704
One and all the others are: And then we're
having the first layer of the neural network,

0:29:50.704 --> 0:29:56.230
which like you learns is word embedding.

0:29:57.437 --> 0:30:04.976
There is one thing which is maybe special
compared to the standard neural member.

0:30:05.345 --> 0:30:11.918
So the representation of this word we want
to learn first of all position independence.

0:30:11.918 --> 0:30:19.013
So we just want to learn what is the general
meaning of the word independent of its neighbors.

0:30:19.299 --> 0:30:26.239
And therefore the representation you get here
should be the same as if in the second position.

0:30:27.247 --> 0:30:36.865
The nice thing you can achieve is that this
weights which you're using here you're reusing

0:30:36.865 --> 0:30:41.727
here and reusing here so we are forcing them.

0:30:42.322 --> 0:30:48.360
You then learn your word embedding, which
is contextual, independent, so it's the same

0:30:48.360 --> 0:30:49.678
for each position.

0:30:49.909 --> 0:31:03.482
So that's the idea that you want to learn
the representation first of and you don't want

0:31:03.482 --> 0:31:07.599
to really use the context.

0:31:08.348 --> 0:31:13.797
That of course might have a different meaning
depending on where it stands, but we'll learn

0:31:13.797 --> 0:31:14.153
that.

0:31:14.514 --> 0:31:20.386
So first we are learning here representational
words, which is just the representation.

0:31:20.760 --> 0:31:32.498
Normally we said in neurons all input neurons
here are connected to all here, but we're reducing

0:31:32.498 --> 0:31:37.338
the complexity by saying these neurons.

0:31:37.857 --> 0:31:47.912
Then we have a lot denser representation that
is our three word embedded in here, and now

0:31:47.912 --> 0:31:57.408
we are learning this interaction between words,
a direction between words not based.

0:31:57.677 --> 0:32:08.051
So we have at least one connected layer here,
which takes a three embedding input and then

0:32:08.051 --> 0:32:14.208
learns a new embedding which now represents
the full.

0:32:15.535 --> 0:32:16.551
Layers.

0:32:16.551 --> 0:32:27.854
It is the output layer which now and then
again the probability distribution of all the.

0:32:28.168 --> 0:32:48.612
So here is your target prediction.

0:32:48.688 --> 0:32:56.361
The nice thing is that you learn everything
together, so you don't have to teach them what

0:32:56.361 --> 0:32:58.722
a good word representation.

0:32:59.079 --> 0:33:08.306
Training the whole number together, so it
learns what a good representation for a word

0:33:08.306 --> 0:33:13.079
you get in order to perform your final task.

0:33:15.956 --> 0:33:19.190
Yeah, that is the main idea.

0:33:20.660 --> 0:33:32.731
This is now a days often referred to as one
way of self supervise learning.

0:33:33.053 --> 0:33:37.120
The output is the next word and the input
is the previous word.

0:33:37.377 --> 0:33:46.783
But it's not really that we created labels,
but we artificially created a task out of unlabeled.

0:33:46.806 --> 0:33:59.434
We just had pure text, and then we created
the telescopes by predicting the next word,

0:33:59.434 --> 0:34:18.797
which is: Say we have like two sentences like
go home and the second one is go to prepare.

0:34:18.858 --> 0:34:30.135
And then we have to predict the next series
and my questions in the labels for the album.

0:34:31.411 --> 0:34:42.752
We model this as one vector with like probability
for possible weights starting again.

0:34:44.044 --> 0:34:57.792
Multiple examples, so then you would twice
train one to predict KRT, one to predict home,

0:34:57.792 --> 0:35:02.374
and then of course the easel.

0:35:04.564 --> 0:35:13.568
Is a very good point, so you are not aggregating
examples beforehand, but you are taking each.

0:35:19.259 --> 0:35:37.204
So when you do it simultaneously learn the
projection layer and the endgram for abilities

0:35:37.204 --> 0:35:39.198
and then.

0:35:39.499 --> 0:35:47.684
And later analyze it that these representations
are very powerful.

0:35:47.684 --> 0:35:56.358
The task is just a very important task to
model what is the next word.

0:35:56.816 --> 0:35:59.842
Is motivated by nowadays.

0:35:59.842 --> 0:36:10.666
In order to get the meaning of the word you
have to look at its companies where the context.

0:36:10.790 --> 0:36:16.048
If you read texts in days of word which you
have never seen, you often can still estimate

0:36:16.048 --> 0:36:21.130
the meaning of this word because you do not
know how it is used, and this is typically

0:36:21.130 --> 0:36:22.240
used as a city or.

0:36:22.602 --> 0:36:25.865
Just imagine you read a text about some city.

0:36:25.865 --> 0:36:32.037
Even if you've never seen the city before,
you often know from the context of how it's

0:36:32.037 --> 0:36:32.463
used.

0:36:34.094 --> 0:36:42.483
So what is now the big advantage of using
neural neckworks?

0:36:42.483 --> 0:36:51.851
So just imagine we have to estimate that I
bought my first iPhone.

0:36:52.052 --> 0:36:56.608
So you have to monitor the probability of
ad hitting them.

0:36:56.608 --> 0:37:00.237
Now imagine iPhone, which you have never seen.

0:37:00.600 --> 0:37:11.588
So all the techniques we had last time at
the end, if you haven't seen iPhone you will

0:37:11.588 --> 0:37:14.240
always fall back to.

0:37:15.055 --> 0:37:26.230
You have no idea how to deal that you won't
have seen the diagram, the trigram, and all

0:37:26.230 --> 0:37:27.754
the others.

0:37:28.588 --> 0:37:43.441
If you're having this type of model, what
does it do if you have my first and then something?

0:37:43.483 --> 0:37:50.270
Maybe this representation is really messed
up because it's mainly on a cavalry word.

0:37:50.730 --> 0:37:57.793
However, you have still these two information
that two words before was first and therefore.

0:37:58.098 --> 0:38:06.954
So you have a lot of information in order
to estimate how good it is.

0:38:06.954 --> 0:38:13.279
There could be more information if you know
that.

0:38:13.593 --> 0:38:25.168
So all this type of modeling we can do that
we couldn't do beforehand because we always

0:38:25.168 --> 0:38:25.957
have.

0:38:27.027 --> 0:38:40.466
Good point, so typically you would have one
token for a vocabulary so that you could, for

0:38:40.466 --> 0:38:45.857
example: All you're doing by parent coding
when you have a fixed thing.

0:38:46.226 --> 0:38:49.437
Oh yeah, you have to do something like that
that that that's true.

0:38:50.050 --> 0:38:55.420
So yeah, auto vocabulary are by thanking where
you don't have other words written.

0:38:55.735 --> 0:39:06.295
But then, of course, you might be getting
very long previous things, and your sequence

0:39:06.295 --> 0:39:11.272
length gets very long for unknown words.

0:39:17.357 --> 0:39:20.067
Any more questions to the basic stable.

0:39:23.783 --> 0:39:36.719
For this model, what we then want to continue
is looking a bit into how complex or how we

0:39:36.719 --> 0:39:39.162
can make things.

0:39:40.580 --> 0:39:49.477
Because at the beginning there was definitely
a major challenge, it's still not that easy,

0:39:49.477 --> 0:39:58.275
and I mean our likeers followed the talk about
their environmental fingerprint and so on.

0:39:58.478 --> 0:40:05.700
So this calculation is not really heavy, and
if you build systems yourselves you have to

0:40:05.700 --> 0:40:06.187
wait.

0:40:06.466 --> 0:40:14.683
So it's good to know a bit about how complex
things are in order to do a good or efficient

0:40:14.683 --> 0:40:15.405
affair.

0:40:15.915 --> 0:40:24.211
So one thing where most of the calculation
really happens is if you're doing it in a bad

0:40:24.211 --> 0:40:24.677
way.

0:40:25.185 --> 0:40:33.523
So in generally all these layers we are talking
about networks and zones fancy.

0:40:33.523 --> 0:40:46.363
In the end it is: So what you have to do in
order to calculate here, for example, these

0:40:46.363 --> 0:40:52.333
activations: So make it simple a bit.

0:40:52.333 --> 0:41:06.636
Let's see where outputs and you just do metric
multiplication between your weight matrix and

0:41:06.636 --> 0:41:08.482
your input.

0:41:08.969 --> 0:41:20.992
So that is why computers are so powerful for
neural networks because they are very good

0:41:20.992 --> 0:41:22.358
in doing.

0:41:22.782 --> 0:41:28.013
However, for some type for the embedding layer
this is really very inefficient.

0:41:28.208 --> 0:41:39.652
So because remember we're having this one
art encoding in this input, it's always like

0:41:39.652 --> 0:41:42.940
one and everything else.

0:41:42.940 --> 0:41:47.018
It's zero if we're doing this.

0:41:47.387 --> 0:41:55.552
So therefore you can do at least the forward
pass a lot more efficient if you don't really

0:41:55.552 --> 0:42:01.833
do this calculation, but you can select the
one color where there is.

0:42:01.833 --> 0:42:07.216
Therefore, you also see this is called your
word embedding.

0:42:08.348 --> 0:42:19.542
So the weight matrix of the embedding layer
is just that in each color you have the embedding

0:42:19.542 --> 0:42:20.018
of.

0:42:20.580 --> 0:42:30.983
So this is like how your initial weights look
like and how you can interpret or understand.

0:42:32.692 --> 0:42:39.509
And this is already relatively important because
remember this is a huge dimensional thing.

0:42:39.509 --> 0:42:46.104
So typically here we have the number of words
is ten thousand or so, so this is the word

0:42:46.104 --> 0:42:51.365
embeddings metrics, typically the most expensive
to calculate metrics.

0:42:51.451 --> 0:42:59.741
Because it's the largest one there, we have
ten thousand entries, while for the hours we

0:42:59.741 --> 0:43:00.393
maybe.

0:43:00.660 --> 0:43:03.408
So therefore the addition to a little bit
more to make this.

0:43:06.206 --> 0:43:10.538
Then you can go where else the calculations
are very difficult.

0:43:10.830 --> 0:43:20.389
So here we then have our network, so we have
the word embeddings.

0:43:20.389 --> 0:43:29.514
We have one hidden there, and then you can
look how difficult.

0:43:30.270 --> 0:43:38.746
Could save a lot of calculation by not really
calculating the selection because that is always.

0:43:40.600 --> 0:43:46.096
The number of calculations you have to do
here is so.

0:43:46.096 --> 0:43:51.693
The length of this layer is minus one type
projection.

0:43:52.993 --> 0:43:56.321
That is a hint size.

0:43:56.321 --> 0:44:10.268
So the first step of calculation for this
metrics modification is how much calculation.

0:44:10.730 --> 0:44:18.806
Then you have to do some activation function
and then you have to do again the calculation.

0:44:19.339 --> 0:44:27.994
Here we need the vocabulary size because we
need to calculate the probability for each

0:44:27.994 --> 0:44:29.088
next word.

0:44:29.889 --> 0:44:43.155
And if you look at these numbers, so if you
have a projector size of and a vocabulary size

0:44:43.155 --> 0:44:53.876
of, you see: And that is why there has been
especially at the beginning some ideas how

0:44:53.876 --> 0:44:55.589
we can reduce.

0:44:55.956 --> 0:45:01.942
And if we really need to calculate all of
our capabilities, or if we can calculate only

0:45:01.942 --> 0:45:02.350
some.

0:45:02.582 --> 0:45:10.871
And there again the one important thing to
think about is for what will use my language

0:45:10.871 --> 0:45:11.342
mom.

0:45:11.342 --> 0:45:19.630
I can use it for generations and that's what
we will see next week in an achiever which

0:45:19.630 --> 0:45:22.456
really is guiding the search.

0:45:23.123 --> 0:45:30.899
If it just uses a feature, we do not want
to use it for generations, but we want to only

0:45:30.899 --> 0:45:32.559
know how probable.

0:45:32.953 --> 0:45:39.325
There we might not be really interested in
all the probabilities, but we already know

0:45:39.325 --> 0:45:46.217
we just want to know the probability of this
one word, and then it might be very inefficient

0:45:46.217 --> 0:45:49.403
to really calculate all the probabilities.

0:45:51.231 --> 0:45:52.919
And how can you do that so?

0:45:52.919 --> 0:45:56.296
Initially, for example, the people look into
shortness.

0:45:56.756 --> 0:46:02.276
So this calculation at the end is really very
expensive.

0:46:02.276 --> 0:46:05.762
So can we make that more efficient.

0:46:05.945 --> 0:46:17.375
And most words occur very rarely, and maybe
we don't need anger, and so there we may want

0:46:17.375 --> 0:46:18.645
to focus.

0:46:19.019 --> 0:46:29.437
And so they use the smaller vocabulary, which
is maybe.

0:46:29.437 --> 0:46:34.646
This layer is used from to.

0:46:34.646 --> 0:46:37.623
Then you merge.

0:46:37.937 --> 0:46:45.162
So you're taking if the word is in the shortest,
so in the two thousand most frequent words.

0:46:45.825 --> 0:46:58.299
Of this short word by some normalization here,
and otherwise you take a back of probability

0:46:58.299 --> 0:46:59.655
from the.

0:47:00.020 --> 0:47:04.933
It will not be as good, but the idea is okay.

0:47:04.933 --> 0:47:14.013
Then we don't have to calculate all these
probabilities here at the end, but we only

0:47:14.013 --> 0:47:16.042
have to calculate.

0:47:19.599 --> 0:47:32.097
With some type of cost because it means we
don't model the probability of the infrequent

0:47:32.097 --> 0:47:39.399
words, and maybe it's even very important to
model.

0:47:39.299 --> 0:47:46.671
And one idea is to do what is reported as
so so structured out there.

0:47:46.606 --> 0:47:49.571
Network language models you see some years
ago.

0:47:49.571 --> 0:47:53.154
People were very creative and giving names
to new models.

0:47:53.813 --> 0:48:00.341
And there the idea is that we model the output
vocabulary as a clustered treat.

0:48:00.680 --> 0:48:06.919
So you don't need to model all of our bodies
directly, but you are putting words into a

0:48:06.919 --> 0:48:08.479
sequence of clusters.

0:48:08.969 --> 0:48:15.019
So maybe a very intriguant world is first
in cluster three and then in cluster three.

0:48:15.019 --> 0:48:21.211
You have subclusters again and there is subclusters
seven and subclusters and there is.

0:48:21.541 --> 0:48:40.134
And this is the path, so that is what was
the man in the past.

0:48:40.340 --> 0:48:52.080
And then you can calculate the probability
of the word again just by the product of the

0:48:52.080 --> 0:48:55.548
first class of the world.

0:48:57.617 --> 0:49:07.789
That it may be more clear where you have this
architecture, so this is all the same.

0:49:07.789 --> 0:49:13.773
But then you first predict here which main
class.

0:49:14.154 --> 0:49:24.226
Then you go to the appropriate subclass, then
you calculate the probability of the subclass

0:49:24.226 --> 0:49:26.415
and maybe the cell.

0:49:27.687 --> 0:49:35.419
Anybody have an idea why this is more efficient
or if you do it first, it looks a lot more.

0:49:42.242 --> 0:49:51.788
You have to do less calculations, so maybe
if you do it here you have to calculate the

0:49:51.788 --> 0:49:59.468
element there, but you don't have to do all
the one hundred thousand.

0:49:59.980 --> 0:50:06.115
The probabilities in the set classes that
you're going through and not for all of them.

0:50:06.386 --> 0:50:18.067
Therefore, it's more efficient if you don't
need all output proficient because you have

0:50:18.067 --> 0:50:21.253
to calculate the class.

0:50:21.501 --> 0:50:28.936
So it's only more efficient and scenarios
where you really need to use a language model

0:50:28.936 --> 0:50:30.034
to evaluate.

0:50:35.275 --> 0:50:52.456
How this works was that you can train first
in your language one on the short list.

0:50:52.872 --> 0:51:03.547
But on the input layer you have your full
vocabulary because at the input we saw that

0:51:03.547 --> 0:51:06.650
this is not complicated.

0:51:06.906 --> 0:51:26.638
And then you can cluster down all your words
here into classes and use that as your glasses.

0:51:29.249 --> 0:51:34.148
That is one idea of doing it.

0:51:34.148 --> 0:51:44.928
There is also a second idea of doing it, and
again we don't need.

0:51:45.025 --> 0:51:53.401
So sometimes it doesn't really need to be
a probability to evaluate.

0:51:53.401 --> 0:51:56.557
It's only important that.

0:51:58.298 --> 0:52:04.908
And: Here it's called self normalization what
people have done so.

0:52:04.908 --> 0:52:11.562
We have seen that the probability is in this
soft mechanism always to the input divided

0:52:11.562 --> 0:52:18.216
by our normalization, and the normalization
is a summary of the vocabulary to the power

0:52:18.216 --> 0:52:19.274
of the spell.

0:52:19.759 --> 0:52:25.194
So this is how we calculate the software.

0:52:25.825 --> 0:52:41.179
In self normalization of the idea, if this
would be zero then we don't need to calculate

0:52:41.179 --> 0:52:42.214
that.

0:52:42.102 --> 0:52:54.272
Will be zero, and then you don't even have
to calculate the normalization because it's.

0:52:54.514 --> 0:53:08.653
So how can we achieve that and then the nice
thing in your networks?

0:53:09.009 --> 0:53:23.928
And now we're just adding a second note with
some either permitted here.

0:53:24.084 --> 0:53:29.551
And the second lost just tells us he'll be
strained away.

0:53:29.551 --> 0:53:31.625
The locks at is zero.

0:53:32.352 --> 0:53:38.614
So then if it's nearly zero at the end we
don't need to calculate this and it's also

0:53:38.614 --> 0:53:39.793
very efficient.

0:53:40.540 --> 0:53:49.498
One important thing is this, of course, is
only in inference.

0:53:49.498 --> 0:54:04.700
During tests we don't need to calculate that
because: You can do a bit of a hyperparameter

0:54:04.700 --> 0:54:14.851
here where you do the waiting, so how good
should it be estimating the probabilities and

0:54:14.851 --> 0:54:16.790
how much effort?

0:54:18.318 --> 0:54:28.577
The only disadvantage is no speed up during
training.

0:54:28.577 --> 0:54:43.843
There are other ways of doing that, for example:
Englishman is in case you get it.

0:54:44.344 --> 0:54:48.540
Then we are coming very, very briefly like
just one idea.

0:54:48.828 --> 0:54:53.058
That there is more things on different types
of language models.

0:54:53.058 --> 0:54:58.002
We are having a very short view on restricted
person-based language models.

0:54:58.298 --> 0:55:08.931
Talk about recurrent neural networks for language
mines because they have the advantage that

0:55:08.931 --> 0:55:17.391
we can even further improve by not having a
continuous representation on.

0:55:18.238 --> 0:55:23.845
So there's different types of neural networks.

0:55:23.845 --> 0:55:30.169
These are these boxing machines and the interesting.

0:55:30.330 --> 0:55:39.291
They have these: And they define like an energy
function on the network, which can be in restricted

0:55:39.291 --> 0:55:44.372
balsam machines efficiently calculated in general
and restricted needs.

0:55:44.372 --> 0:55:51.147
You only have connection between the input
and the hidden layer, but you don't have connections

0:55:51.147 --> 0:55:53.123
in the input or within the.

0:55:53.393 --> 0:56:00.194
So you see here you don't have an input output,
you just have an input, and you calculate.

0:56:00.460 --> 0:56:15.612
Which of course nicely fits with the idea
we're having, so you can then use this for

0:56:15.612 --> 0:56:19.177
an N Gram language.

0:56:19.259 --> 0:56:25.189
Retaining the flexibility of the input by
this type of neon networks.

0:56:26.406 --> 0:56:30.589
And the advantage of this type of model was
there's.

0:56:30.550 --> 0:56:37.520
Very, very fast to integrate it, so that one
was the first one which was used during the

0:56:37.520 --> 0:56:38.616
coding model.

0:56:38.938 --> 0:56:45.454
The engram language models were that they
were very good and gave performance.

0:56:45.454 --> 0:56:50.072
However, calculation still with all these
tricks takes.

0:56:50.230 --> 0:56:58.214
We have talked about embest lists so they
generated an embest list of the most probable

0:56:58.214 --> 0:57:05.836
outputs and then they took this and best list
scored each entry with a new network.

0:57:06.146 --> 0:57:09.306
A language model, and then only change the
order again.

0:57:09.306 --> 0:57:10.887
Select based on that which.

0:57:11.231 --> 0:57:17.187
The neighboring list is maybe only like hundred
entries.

0:57:17.187 --> 0:57:21.786
When decoding you look at several thousand.

0:57:26.186 --> 0:57:35.196
Let's look at the context so we have now seen
your language models.

0:57:35.196 --> 0:57:43.676
There is the big advantage we can use this
word similarity and.

0:57:44.084 --> 0:57:52.266
Remember for engram language ones is not always
minus one words because sometimes you have

0:57:52.266 --> 0:57:59.909
to back off or interpolation to lower engrams
and you don't know the previous words.

0:58:00.760 --> 0:58:04.742
And however in neural models we always have
all of this importance.

0:58:04.742 --> 0:58:05.504
Can some of.

0:58:07.147 --> 0:58:20.288
The disadvantage is that you are still limited
in your context, and if you remember the sentence

0:58:20.288 --> 0:58:22.998
from last lecture,.

0:58:22.882 --> 0:58:28.328
Sometimes you need more context and there
is unlimited context that you might need and

0:58:28.328 --> 0:58:34.086
you can always create sentences where you may
need this five context in order to put a good

0:58:34.086 --> 0:58:34.837
estimation.

0:58:35.315 --> 0:58:44.956
Can also do it different in order to understand
that it makes sense to view language.

0:58:45.445 --> 0:58:59.510
So secret labeling tasks are a very common
type of task in language processing where you

0:58:59.510 --> 0:59:03.461
have the input sequence.

0:59:03.323 --> 0:59:05.976
So you have one output for each input.

0:59:05.976 --> 0:59:12.371
Machine translation is not a secret labeling
cast because the number of inputs and the number

0:59:12.371 --> 0:59:14.072
of outputs is different.

0:59:14.072 --> 0:59:20.598
So you put in a string German which has five
words and the output can be: See, for example,

0:59:20.598 --> 0:59:24.078
you always have the same number and the same
number of offices.

0:59:24.944 --> 0:59:39.779
And you can more language waddling as that,
and you just say the label for each word is

0:59:39.779 --> 0:59:43.151
always a next word.

0:59:45.705 --> 0:59:50.312
This is the more generous you can think of
it.

0:59:50.312 --> 0:59:56.194
For example, Paddle Speech Taking named Entity
Recognition.

0:59:58.938 --> 1:00:08.476
And if you look at now, this output token
and generally sequenced labeling can depend

1:00:08.476 --> 1:00:26.322
on: The input tokens are the same so we can
easily model it and they only depend on the

1:00:26.322 --> 1:00:29.064
input tokens.

1:00:31.011 --> 1:00:42.306
But we can always look at one specific type
of sequence labeling, unidirectional sequence

1:00:42.306 --> 1:00:44.189
labeling type.

1:00:44.584 --> 1:01:00.855
The probability of the next word only depends
on the previous words that we are having here.

1:01:01.321 --> 1:01:05.998
That's also not completely true in language.

1:01:05.998 --> 1:01:14.418
Well, the back context might also be helpful
by direction of the model's Google.

1:01:14.654 --> 1:01:23.039
We will always admire the probability of the
word given on its history.

1:01:23.623 --> 1:01:30.562
And currently there is approximation and sequence
labeling that we have this windowing approach.

1:01:30.951 --> 1:01:43.016
So in order to predict this type of word we
always look at the previous three words.

1:01:43.016 --> 1:01:48.410
This is this type of windowing model.

1:01:49.389 --> 1:01:54.780
If you're into neural networks you recognize
this type of structure.

1:01:54.780 --> 1:01:57.515
Also, the typical neural networks.

1:01:58.938 --> 1:02:11.050
Yes, yes, so like engram models you can, at
least in some way, prepare for that type of

1:02:11.050 --> 1:02:12.289
context.

1:02:14.334 --> 1:02:23.321
Are also other types of neonamic structures
which we can use for sequins lately and which

1:02:23.321 --> 1:02:30.710
might help us where we don't have this type
of fixed size representation.

1:02:32.812 --> 1:02:34.678
That we can do so.

1:02:34.678 --> 1:02:39.391
The idea is in recurrent new networks traction.

1:02:39.391 --> 1:02:43.221
We are saving complete history in one.

1:02:43.623 --> 1:02:56.946
So again we have to do this fixed size representation
because the neural networks always need a habit.

1:02:57.157 --> 1:03:09.028
And then the network should look like that,
so we start with an initial value for our storage.

1:03:09.028 --> 1:03:15.900
We are giving our first input and calculating
the new.

1:03:16.196 --> 1:03:35.895
So again in your network with two types of
inputs: Then you can apply it to the next type

1:03:35.895 --> 1:03:41.581
of input and you're again having this.

1:03:41.581 --> 1:03:46.391
You're taking this hidden state.

1:03:47.367 --> 1:03:53.306
Nice thing is now that you can do now step
by step by step, so all the way over.

1:03:55.495 --> 1:04:06.131
The nice thing we are having here now is that
now we are having context information from

1:04:06.131 --> 1:04:07.206
all the.

1:04:07.607 --> 1:04:14.181
So if you're looking like based on which words
do you, you calculate the probability of varying.

1:04:14.554 --> 1:04:20.090
It depends on this part.

1:04:20.090 --> 1:04:33.154
It depends on and this hidden state was influenced
by two.

1:04:33.473 --> 1:04:38.259
So now we're having something new.

1:04:38.259 --> 1:04:46.463
We can model like the word probability not
only on a fixed.

1:04:46.906 --> 1:04:53.565
Because the hidden states we are having here
in our Oregon are influenced by all the trivia.

1:04:56.296 --> 1:05:02.578
So how is there to be Singapore?

1:05:02.578 --> 1:05:16.286
But then we have the initial idea about this
P of given on the history.

1:05:16.736 --> 1:05:25.300
So do not need to do any clustering here,
and you also see how things are put together

1:05:25.300 --> 1:05:26.284
in order.

1:05:29.489 --> 1:05:43.449
The green box this night since we are starting
from the left to the right.

1:05:44.524 --> 1:05:51.483
Voices: Yes, that's right, so there are clusters,
and here is also sometimes clustering happens.

1:05:51.871 --> 1:05:58.687
The small difference does matter again, so
if you have now a lot of different histories,

1:05:58.687 --> 1:06:01.674
the similarity which you have in here.

1:06:01.674 --> 1:06:08.260
If two of the histories are very similar,
these representations will be the same, and

1:06:08.260 --> 1:06:10.787
then you're treating them again.

1:06:11.071 --> 1:06:15.789
Because in order to do the final restriction
you only do a good base on the green box.

1:06:16.156 --> 1:06:28.541
So you are now still learning some type of
clustering in there, but you are learning it

1:06:28.541 --> 1:06:30.230
implicitly.

1:06:30.570 --> 1:06:38.200
The only restriction you're giving is you
have to stall everything that is important

1:06:38.200 --> 1:06:39.008
in this.

1:06:39.359 --> 1:06:54.961
So it's a different type of limitation, so
you calculate the probability based on the

1:06:54.961 --> 1:06:57.138
last words.

1:06:57.437 --> 1:07:04.430
And that is how you still need to somehow
cluster things together in order to do efficiently.

1:07:04.430 --> 1:07:09.563
Of course, you need to do some type of clustering
because otherwise.

1:07:09.970 --> 1:07:18.865
But this is where things get merged together
in this type of hidden representation.

1:07:18.865 --> 1:07:27.973
So here the probability of the word first
only depends on this hidden representation.

1:07:28.288 --> 1:07:33.104
On the previous words, but they are some other
bottleneck in order to make a good estimation.

1:07:34.474 --> 1:07:41.231
So the idea is that we can store all our history
into or into one lecture.

1:07:41.581 --> 1:07:44.812
Which is the one that makes it more strong.

1:07:44.812 --> 1:07:51.275
Next we come to problems that of course at
some point it might be difficult if you have

1:07:51.275 --> 1:07:57.811
very long sequences and you always write all
the information you have on this one block.

1:07:58.398 --> 1:08:02.233
Then maybe things get overwritten or you cannot
store everything in there.

1:08:02.662 --> 1:08:04.514
So,.

1:08:04.184 --> 1:08:09.569
Therefore, yet for short things like single
sentences that works well, but especially if

1:08:09.569 --> 1:08:15.197
you think of other tasks and like symbolizations
with our document based on T where you need

1:08:15.197 --> 1:08:20.582
to consider the full document, these things
got got a bit more more more complicated and

1:08:20.582 --> 1:08:23.063
will learn another type of architecture.

1:08:24.464 --> 1:08:30.462
In order to understand these neighbors, it
is good to have all the bus use always.

1:08:30.710 --> 1:08:33.998
So this is the unrolled view.

1:08:33.998 --> 1:08:43.753
Somewhere you're over the type or in language
over the words you're unrolling a network.

1:08:44.024 --> 1:08:52.096
Here is the article and here is the network
which is connected by itself and that is recurrent.

1:08:56.176 --> 1:09:04.982
There is one challenge in this networks and
training.

1:09:04.982 --> 1:09:11.994
We can train them first of all as forward.

1:09:12.272 --> 1:09:19.397
So we don't really know how to train them,
but if you unroll them like this is a feet

1:09:19.397 --> 1:09:20.142
forward.

1:09:20.540 --> 1:09:38.063
Is exactly the same, so you can measure your
arrows here and be back to your arrows.

1:09:38.378 --> 1:09:45.646
If you unroll something, it's a feature in
your laptop and you can train it the same way.

1:09:46.106 --> 1:09:57.606
The only important thing is again, of course,
for different inputs.

1:09:57.837 --> 1:10:05.145
But since parameters are shared, it's somehow
a similar point you can train it.

1:10:05.145 --> 1:10:08.800
The training algorithm is very similar.

1:10:10.310 --> 1:10:29.568
One thing which makes things difficult is
what is referred to as the vanish ingredient.

1:10:29.809 --> 1:10:32.799
That's a very strong thing in the motivation
of using hardness.

1:10:33.593 --> 1:10:44.604
The influence here gets smaller and smaller,
and the modems are not really able to monitor.

1:10:44.804 --> 1:10:51.939
Because the gradient gets smaller and smaller,
and so the arrow here propagated to this one

1:10:51.939 --> 1:10:58.919
that contributes to the arrow is very small,
and therefore you don't do any changes there

1:10:58.919 --> 1:10:59.617
anymore.

1:11:00.020 --> 1:11:06.703
And yeah, that's why standard art men are
undifficult or have to pick them at custard.

1:11:07.247 --> 1:11:11.462
So everywhere talking to me about fire and
ants nowadays,.

1:11:11.791 --> 1:11:23.333
What we are typically meaning are LSDN's or
long short memories.

1:11:23.333 --> 1:11:30.968
You see they are by now quite old already.

1:11:31.171 --> 1:11:39.019
So there was a model in the language model
task.

1:11:39.019 --> 1:11:44.784
It's some more storing information.

1:11:44.684 --> 1:11:51.556
Because if you only look at the last words,
it's often no longer clear this is a question

1:11:51.556 --> 1:11:52.548
or a normal.

1:11:53.013 --> 1:12:05.318
So there you have these mechanisms with ripgate
in order to store things for a longer time

1:12:05.318 --> 1:12:08.563
into your hidden state.

1:12:10.730 --> 1:12:20.162
Here they are used in in in selling quite
a lot of works.

1:12:21.541 --> 1:12:29.349
For especially machine translation now, the
standard is to do transform base models which

1:12:29.349 --> 1:12:30.477
we'll learn.

1:12:30.690 --> 1:12:38.962
But for example, in architecture we have later
one lecture about efficiency.

1:12:38.962 --> 1:12:42.830
So how can we build very efficient?

1:12:42.882 --> 1:12:53.074
And there in the decoder in parts of the networks
they are still using.

1:12:53.473 --> 1:12:57.518
So it's not that yeah our hands are of no
importance in the body.

1:12:59.239 --> 1:13:08.956
In order to make them strong, there are some
more things which are helpful and should be:

1:13:09.309 --> 1:13:19.683
So one thing is there is a nice trick to make
this new network stronger and better.

1:13:19.739 --> 1:13:21.523
So of course it doesn't work always.

1:13:21.523 --> 1:13:23.451
They have to have enough training data.

1:13:23.763 --> 1:13:28.959
But in general there's the easiest way of
making your models bigger and stronger just

1:13:28.959 --> 1:13:30.590
to increase your pyramids.

1:13:30.630 --> 1:13:43.236
And you've seen that with a large language
models they are always bragging about.

1:13:43.903 --> 1:13:56.463
This is one way, so the question is how do
you get more parameters?

1:13:56.463 --> 1:14:01.265
There's ways of doing it.

1:14:01.521 --> 1:14:10.029
And the other thing is to make your networks
deeper so to have more legs in between.

1:14:11.471 --> 1:14:13.827
And then you can also get to get more calm.

1:14:14.614 --> 1:14:23.340
There's more traveling with this and it's
very similar to what we just saw with our hand.

1:14:23.603 --> 1:14:34.253
We have this problem of radiant flow that
if it flows so fast like a radiant gets very

1:14:34.253 --> 1:14:35.477
swollen,.

1:14:35.795 --> 1:14:42.704
Exactly the same thing happens in deep LSD
ends.

1:14:42.704 --> 1:14:52.293
If you take here the gradient, tell you what
is the right or wrong.

1:14:52.612 --> 1:14:56.439
With three layers it's no problem, but if
you're going to ten, twenty or hundred layers.

1:14:57.797 --> 1:14:59.698
That's Getting Typically Young.

1:15:00.060 --> 1:15:07.000
Are doing is using what is called decisional
connections.

1:15:07.000 --> 1:15:15.855
That's a very helpful idea, which is maybe
very surprising that it works.

1:15:15.956 --> 1:15:20.309
And so the idea is that these networks.

1:15:20.320 --> 1:15:29.982
In between should no longer calculate what
is a new good representation, but they're more

1:15:29.982 --> 1:15:31.378
calculating.

1:15:31.731 --> 1:15:37.588
Therefore, in the end you're always the output
of a layer is added with the input.

1:15:38.318 --> 1:15:48.824
The knife is later if you are doing back propagation
with this very fast back propagation.

1:15:49.209 --> 1:16:02.540
Nowadays in very deep architectures, not only
on other but always has this residual or highway

1:16:02.540 --> 1:16:04.224
connection.

1:16:04.704 --> 1:16:06.616
Has two advantages.

1:16:06.616 --> 1:16:15.409
On the one hand, these layers don't need to
learn a representation, they only need to learn

1:16:15.409 --> 1:16:18.754
what to change the representation.

1:16:22.082 --> 1:16:24.172
Good.

1:16:23.843 --> 1:16:31.768
That much for the new map before, so the last
thing now means this.

1:16:31.671 --> 1:16:33.750
Language was are yeah.

1:16:33.750 --> 1:16:41.976
I were used in the molds itself and now were
seeing them again, but one thing which at the

1:16:41.976 --> 1:16:53.558
beginning they were reading was very essential
was: So people really train part of the language

1:16:53.558 --> 1:16:59.999
models only to get this type of embedding.

1:16:59.999 --> 1:17:04.193
Therefore, we want to look.

1:17:09.229 --> 1:17:15.678
So now some last words to the word embeddings.

1:17:15.678 --> 1:17:27.204
The interesting thing is that word embeddings
can be used for very different tasks.

1:17:27.347 --> 1:17:31.329
The knife wing is you can train that on just
large amounts of data.

1:17:31.931 --> 1:17:41.569
And then if you have these wooden beddings
we have seen that they reduce the parameters.

1:17:41.982 --> 1:17:52.217
So then you can train your small mark to do
any other task and therefore you are more efficient.

1:17:52.532 --> 1:17:55.218
These initial word embeddings is important.

1:17:55.218 --> 1:18:00.529
They really depend only on the word itself,
so if you look at the two meanings of can,

1:18:00.529 --> 1:18:06.328
the can of beans or I can do that, they will
have the same embedding, so some of the embedding

1:18:06.328 --> 1:18:08.709
has to save the ambiguity inside that.

1:18:09.189 --> 1:18:12.486
That cannot be resolved.

1:18:12.486 --> 1:18:24.753
Therefore, if you look at the higher levels
in the context, but in the word embedding layers

1:18:24.753 --> 1:18:27.919
that really depends on.

1:18:29.489 --> 1:18:33.757
However, even this one has quite very interesting.

1:18:34.034 --> 1:18:39.558
So that people like to visualize them.

1:18:39.558 --> 1:18:47.208
They're always difficult because if you look
at this.

1:18:47.767 --> 1:18:52.879
And drawing your five hundred damage, the
vector is still a bit challenging.

1:18:53.113 --> 1:19:12.472
So you cannot directly do that, so people
have to do it like they look at some type of.

1:19:13.073 --> 1:19:17.209
And of course then yes some information is
getting lost by a bunch of control.

1:19:18.238 --> 1:19:24.802
And you see, for example, this is the most
famous and common example, so what you can

1:19:24.802 --> 1:19:31.289
look is you can look at the difference between
the main and the female word English.

1:19:31.289 --> 1:19:37.854
This is here in your embedding of king, and
this is the embedding of queen, and this.

1:19:38.058 --> 1:19:40.394
You can do that for a very different work.

1:19:40.780 --> 1:19:45.407
And that is where the masks come into, that
is what people then look into.

1:19:45.725 --> 1:19:50.995
So what you can now, for example, do is you
can calculate the difference between man and

1:19:50.995 --> 1:19:51.410
woman?

1:19:52.232 --> 1:19:55.511
Then you can take the embedding of tea.

1:19:55.511 --> 1:20:02.806
You can add on it the difference between man
and woman, and then you can notice what are

1:20:02.806 --> 1:20:04.364
the similar words.

1:20:04.364 --> 1:20:08.954
So you won't, of course, directly hit the
correct word.

1:20:08.954 --> 1:20:10.512
It's a continuous.

1:20:10.790 --> 1:20:23.127
But you can look what are the nearest neighbors
to this same, and often these words are near

1:20:23.127 --> 1:20:24.056
there.

1:20:24.224 --> 1:20:33.913
So it somehow learns that the difference between
these words is always the same.

1:20:34.374 --> 1:20:37.746
You can do that for different things.

1:20:37.746 --> 1:20:41.296
He also imagines that it's not perfect.

1:20:41.296 --> 1:20:49.017
He says the world tends to be swimming and
swimming, and with walking and walking you.

1:20:49.469 --> 1:20:51.639
So you can try to use them.

1:20:51.639 --> 1:20:59.001
It's no longer like saying yeah, but the interesting
thing is this is completely unsupervised.

1:20:59.001 --> 1:21:03.961
So nobody taught him the principle of their
gender in language.

1:21:04.284 --> 1:21:09.910
So it's purely trained on the task of doing
the next work prediction.

1:21:10.230 --> 1:21:20.658
And even for really cementing information
like the capital, this is the difference between

1:21:20.658 --> 1:21:23.638
the city and the capital.

1:21:23.823 --> 1:21:25.518
Visualization.

1:21:25.518 --> 1:21:33.766
Here we have done the same things of the difference
between country and.

1:21:33.853 --> 1:21:41.991
You see it's not perfect, but it's building
some kinds of a right direction, so you can't

1:21:41.991 --> 1:21:43.347
even use them.

1:21:43.347 --> 1:21:51.304
For example, for question answering, if you
have the difference between them, you apply

1:21:51.304 --> 1:21:53.383
that to a new country.

1:21:54.834 --> 1:22:02.741
So it seems these ones are able to really
learn a lot of information and collapse all

1:22:02.741 --> 1:22:04.396
this information.

1:22:05.325 --> 1:22:11.769
At just to do the next word prediction: And
that also explains a bit maybe or not explains

1:22:11.769 --> 1:22:19.016
wrong life by motivating why what is the main
advantage of this type of neural models that

1:22:19.016 --> 1:22:26.025
we can use this type of hidden representation,
transfer them and use them in different.

1:22:28.568 --> 1:22:43.707
So summarize what we did today, so what you
should hopefully have with you is for machine

1:22:43.707 --> 1:22:45.893
translation.

1:22:45.805 --> 1:22:49.149
Then how we can do language modern Chinese
literature?

1:22:49.449 --> 1:22:55.617
We looked at three different architectures:
We looked into the feet forward language mode

1:22:55.617 --> 1:22:59.063
and the one based on Bluetooth machines.

1:22:59.039 --> 1:23:05.366
And finally there are different architectures
to do in your networks.

1:23:05.366 --> 1:23:14.404
We have seen feet for your networks and we'll
see the next lectures, the last type of architecture.

1:23:15.915 --> 1:23:17.412
Have Any Questions.

1:23:20.680 --> 1:23:27.341
Then thanks a lot, and next on Tuesday we
will be again in our order to know how to play.