Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-07-16.05.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

127 kB

	WEBVTT

	0:00:01.301 --> 0:00:05.707
	Okay So Welcome to Today's Lecture.

	0:00:06.066 --> 0:00:12.592
	I'm sorry for the inconvenience.

	0:00:12.592 --> 0:00:19.910
	Sometimes they are project meetings.

	0:00:19.910 --> 0:00:25.843
	There will be one other time.

	0:00:26.806 --> 0:00:40.863
	So what we want to talk today about is want
	to start with neural approaches to machine

	0:00:40.863 --> 0:00:42.964
	translation.

	0:00:43.123 --> 0:00:51.285
	I guess you have heard about other types of
	neural models for other types of neural language

	0:00:51.285 --> 0:00:52.339
	processing.

	0:00:52.339 --> 0:00:59.887
	This was some of the first steps in introducing
	neal networks to machine translation.

	0:01:00.600 --> 0:01:06.203
	They are similar to what you know they see
	in as large language models.

	0:01:06.666 --> 0:01:11.764
	And today look into what are these neuro-language
	models?

	0:01:11.764 --> 0:01:13.874
	What is the difference?

	0:01:13.874 --> 0:01:15.983
	What is the motivation?

	0:01:16.316 --> 0:01:21.445
	And first will use them in statistics and
	machine translation.

	0:01:21.445 --> 0:01:28.935
	So if you remember how fully like two or three
	weeks ago we had this likely model where you

	0:01:28.935 --> 0:01:31.052
	can integrate easily any.

	0:01:31.351 --> 0:01:40.967
	We just have another model which evaluates
	how good a system is or how good a fluent language

	0:01:40.967 --> 0:01:41.376
	is.

	0:01:41.376 --> 0:01:53.749
	The main advantage compared to the statistical
	models we saw on Tuesday is: Next week we will

	0:01:53.749 --> 0:02:06.496
	then go for a neural machine translation where
	we replace the whole model.

	0:02:11.211 --> 0:02:21.078
	Just as a remember from Tuesday, we've seen
	the main challenge in language world was that

	0:02:21.078 --> 0:02:25.134
	most of the engrams we haven't seen.

	0:02:26.946 --> 0:02:33.967
	So this was therefore difficult to estimate
	any probability because you've seen that normally

	0:02:33.967 --> 0:02:39.494
	if you have not seen the endgram you will assign
	the probability of zero.

	0:02:39.980 --> 0:02:49.420
	However, this is not really very good because
	we don't want to give zero probabilities to

	0:02:49.420 --> 0:02:54.979
	sentences, which still might be a very good
	English.

	0:02:55.415 --> 0:03:02.167
	And then we learned a lot of techniques and
	that is the main challenging statistical machine

	0:03:02.167 --> 0:03:04.490
	translate statistical language.

	0:03:04.490 --> 0:03:10.661
	What's how we can give a good estimate of
	probability to events that we haven't seen

	0:03:10.661 --> 0:03:12.258
	smoothing techniques?

	0:03:12.258 --> 0:03:15.307
	We've seen this interpolation and begoff.

	0:03:15.435 --> 0:03:21.637
	And they invent or develop very specific techniques.

	0:03:21.637 --> 0:03:26.903
	To deal with that, however, it might not be.

	0:03:28.568 --> 0:03:43.190
	And therefore maybe we can do things different,
	so if we have not seen an gram before in statistical

	0:03:43.190 --> 0:03:44.348
	models.

	0:03:45.225 --> 0:03:51.361
	Before and we can only get information from
	exactly the same words.

	0:03:51.411 --> 0:04:06.782
	We don't have some on like approximate matching
	like that, maybe in a sentence that cures similarly.

	0:04:06.782 --> 0:04:10.282
	So if you have seen a.

	0:04:11.191 --> 0:04:17.748
	And so you would like to have more something
	like that where endgrams are represented, more

	0:04:17.748 --> 0:04:21.953
	in a general space, and we can generalize similar
	numbers.

	0:04:22.262 --> 0:04:29.874
	So if you learn something about walk then
	maybe we can use this knowledge and also apply.

	0:04:30.290 --> 0:04:42.596
	The same as we have done before, but we can
	really better model how similar they are and

	0:04:42.596 --> 0:04:45.223
	transfer to other.

	0:04:47.047 --> 0:04:54.236
	And we maybe want to do that in a more hierarchical
	approach that we know okay.

	0:04:54.236 --> 0:05:02.773
	Some words are similar but like go and walk
	is somehow similar and I and P and G and therefore

	0:05:02.773 --> 0:05:06.996
	like maybe if we then merge them in an engram.

	0:05:07.387 --> 0:05:15.861
	If we learn something about our walk, then
	it should tell us also something about Hugo.

	0:05:15.861 --> 0:05:17.113
	He walks or.

	0:05:17.197 --> 0:05:27.327
	You see that there is some relations which
	we need to integrate for you.

	0:05:27.327 --> 0:05:35.514
	We need to add the s, but maybe walks should
	also be here.

	0:05:37.137 --> 0:05:45.149
	And luckily there is one really convincing
	method in doing that: And that is by using

	0:05:45.149 --> 0:05:47.231
	a neural mechanism.

	0:05:47.387 --> 0:05:58.497
	That's what we will introduce today so we
	can use this type of neural networks to try

	0:05:58.497 --> 0:06:04.053
	to learn this similarity and to learn how.

	0:06:04.324 --> 0:06:14.355
	And that is one of the main advantages that
	we have by switching from the standard statistical

	0:06:14.355 --> 0:06:15.200
	models.

	0:06:15.115 --> 0:06:22.830
	To learn similarities between words and generalized,
	and learn what is called hidden representations

	0:06:22.830 --> 0:06:29.705
	or representations of words, where we can measure
	similarity in some dimensions of words.

	0:06:30.290 --> 0:06:42.384
	So we can measure in which way words are similar.

	0:06:42.822 --> 0:06:48.902
	We had it before and we've seen that words
	were just easier.

	0:06:48.902 --> 0:06:51.991
	The only thing we did is like.

	0:06:52.192 --> 0:07:02.272
	But this energies don't have any meaning,
	so it wasn't that word is more similar to words.

	0:07:02.582 --> 0:07:12.112
	So we couldn't learn anything about words
	in the statistical model and that's a big challenge.

	0:07:12.192 --> 0:07:23.063
	About words even like in morphology, so going
	goes is somehow more similar because the person

	0:07:23.063 --> 0:07:24.219
	singular.

	0:07:24.264 --> 0:07:34.924
	The basic models we have to now have no idea
	about that and goes as similar to go than it

	0:07:34.924 --> 0:07:37.175
	might be to sleep.

	0:07:39.919 --> 0:07:44.073
	So what we want to do today.

	0:07:44.073 --> 0:07:53.096
	In order to go to this we will have a short
	introduction into.

	0:07:53.954 --> 0:08:05.984
	It very short just to see how we use them
	here, but that's a good thing, so most of you

	0:08:05.984 --> 0:08:08.445
	think it will be.

	0:08:08.928 --> 0:08:14.078
	And then we will first look into a feet forward
	neural network language models.

	0:08:14.454 --> 0:08:23.706
	And there we will still have this approximation.

	0:08:23.706 --> 0:08:33.902
	We have before we are looking only at a fixed
	window.

	0:08:34.154 --> 0:08:35.030
	The case.

	0:08:35.030 --> 0:08:38.270
	However, we have the umbellent here.

	0:08:38.270 --> 0:08:43.350
	That's why they're already better in order
	to generalize.

	0:08:44.024 --> 0:08:53.169
	And then at the end we'll look at language
	models where we then have the additional advantage.

	0:08:53.093 --> 0:09:04.317
	Case that we need to have a fixed history,
	but in theory we can model arbitrary long dependencies.

	0:09:04.304 --> 0:09:12.687
	And we talked about on Tuesday where it is
	not clear what type of information it is to.

	0:09:16.396 --> 0:09:24.981
	So in general molecular networks I normally
	learn to prove that they perform some tasks.

	0:09:25.325 --> 0:09:33.472
	We have the structure and we are learning
	them from samples so that is similar to what

	0:09:33.472 --> 0:09:34.971
	we have before.

	0:09:34.971 --> 0:09:42.275
	So now we have the same task here, a language
	model giving input or forwards.

	0:09:42.642 --> 0:09:48.959
	And is somewhat originally motivated by human
	brain.

	0:09:48.959 --> 0:10:00.639
	However, when you now need to know about artificial
	neural networks, it's hard to get similarity.

	0:10:00.540 --> 0:10:02.889
	There seemed to be not that point.

	0:10:03.123 --> 0:10:11.014
	So what they are mainly doing is summoning
	multiplication and then one non-linear activation.

	0:10:12.692 --> 0:10:16.085
	So the basic units are these type of.

	0:10:17.937 --> 0:10:29.891
	Perceptron basic blocks which we have and
	this does processing so we have a fixed number

	0:10:29.891 --> 0:10:36.070
	of input features and that will be important.

	0:10:36.096 --> 0:10:39.689
	So we have here numbers to xn as input.

	0:10:40.060 --> 0:10:53.221
	And this makes partly of course language processing
	difficult.

	0:10:54.114 --> 0:10:57.609
	So we have to model this time on and then
	go stand home and model.

	0:10:58.198 --> 0:11:02.099
	Then we are having weights, which are the
	parameters and the number of weights exactly

	0:11:02.099 --> 0:11:03.668
	the same as the number of weights.

	0:11:04.164 --> 0:11:06.322
	Of input features.

	0:11:06.322 --> 0:11:15.068
	Sometimes he has his fires in there, and then
	it's not really an input from.

	0:11:15.195 --> 0:11:19.205
	And what you then do is multiply.

	0:11:19.205 --> 0:11:26.164
	Each input resists weight and then you sum
	it up and then.

	0:11:26.606 --> 0:11:34.357
	What is then additionally later important
	is that we have an activation function and

	0:11:34.357 --> 0:11:42.473
	it's important that this activation function
	is non linear, so we come to just a linear.

	0:11:43.243 --> 0:11:54.088
	And later it will be important that this is
	differentiable because otherwise all the training.

	0:11:54.714 --> 0:12:01.907
	This model by itself is not very powerful.

	0:12:01.907 --> 0:12:10.437
	It was originally shown that this is not powerful.

	0:12:10.710 --> 0:12:19.463
	However, there is a very easy extension, the
	multi layer perceptual, and then things get

	0:12:19.463 --> 0:12:20.939
	very powerful.

	0:12:21.081 --> 0:12:27.719
	The thing is you just connect a lot of these
	in this layer of structures and we have our

	0:12:27.719 --> 0:12:35.029
	input layer where we have the inputs and our
	hidden layer at least one where there is everywhere.

	0:12:35.395 --> 0:12:39.817
	And then we can combine them all to do that.

	0:12:40.260 --> 0:12:48.320
	The input layer is of course somewhat given
	by a problem of dimension.

	0:12:48.320 --> 0:13:00.013
	The outward layer is also given by your dimension,
	but the hidden layer is of course a hyperparameter.

	0:13:01.621 --> 0:13:08.802
	So let's start with the first question, now
	more language related, and that is how we represent.

	0:13:09.149 --> 0:13:23.460
	So we've seen here we have the but the question
	is now how can we put in a word into this?

	0:13:26.866 --> 0:13:34.117
	Noise: The first thing we're able to be better
	is by the fact that like you are said,.

	0:13:34.314 --> 0:13:43.028
	That is not that easy because the continuous
	vector will come to that.

	0:13:43.028 --> 0:13:50.392
	So from the neo-network we can directly put
	in the bedding.

	0:13:50.630 --> 0:13:57.277
	But if we need to input a word into the needle
	network, it has to be something which is easily

	0:13:57.277 --> 0:13:57.907
	defined.

	0:13:59.079 --> 0:14:12.492
	The one hood encoding, and then we have one
	out of encoding, so one value is one, and all

	0:14:12.492 --> 0:14:15.324
	the others is the.

	0:14:16.316 --> 0:14:25.936
	That means we are always dealing with fixed
	vocabulary because what said is we cannot.

	0:14:26.246 --> 0:14:38.017
	So you cannot easily extend your vocabulary
	because if you mean you would extend your vocabulary.

	0:14:39.980 --> 0:14:41.502
	That's also motivating.

	0:14:41.502 --> 0:14:43.722
	We're talked about biperriagoding.

	0:14:43.722 --> 0:14:45.434
	That's a nice thing there.

	0:14:45.434 --> 0:14:47.210
	We have a fixed vocabulary.

	0:14:48.048 --> 0:14:55.804
	The big advantage of this one encoding is
	that we don't implicitly sum our implement

	0:14:55.804 --> 0:15:04.291
	similarity between words, but really re-learning
	because if you first think about this, this

	0:15:04.291 --> 0:15:06.938
	is a very, very inefficient.

	0:15:07.227 --> 0:15:15.889
	So you need like to represent end words, you
	need a dimension of an end dimensional vector.

	0:15:16.236 --> 0:15:24.846
	Imagine you could do binary encoding so you
	could represent words as binary vectors.

	0:15:24.846 --> 0:15:26.467
	Then you would.

	0:15:26.806 --> 0:15:31.177
	Will be significantly more efficient.

	0:15:31.177 --> 0:15:36.813
	However, then you have some implicit similarity.

	0:15:36.813 --> 0:15:39.113
	Some numbers share.

	0:15:39.559 --> 0:15:46.958
	Would somehow be bad because you would force
	someone to do this by hand or clear how to

	0:15:46.958 --> 0:15:47.631
	define.

	0:15:48.108 --> 0:15:55.135
	So therefore currently this is the most successful
	approach to just do this one watch.

	0:15:55.095 --> 0:15:59.563
	Representations, so we take a fixed vocabulary.

	0:15:59.563 --> 0:16:06.171
	We map each word to the inise, and then we
	represent a word like this.

	0:16:06.171 --> 0:16:13.246
	So if home will be one, the representation
	will be one zero zero zero, and.

	0:16:14.514 --> 0:16:30.639
	But this dimension here is a vocabulary size
	and that is quite high, so we are always trying

	0:16:30.639 --> 0:16:33.586
	to be efficient.

	0:16:33.853 --> 0:16:43.792
	We are doing then some type of efficiency
	because typically we are having this next layer.

	0:16:44.104 --> 0:16:51.967
	It can be still maybe two hundred or five
	hundred or one thousand neurons, but this is

	0:16:51.967 --> 0:16:53.323
	significantly.

	0:16:53.713 --> 0:17:03.792
	You can learn that directly and there we then
	have similarity between words.

	0:17:03.792 --> 0:17:07.458
	Then it is that some words.

	0:17:07.807 --> 0:17:14.772
	But the nice thing is that this is then learned
	that we are not need to hand define that.

	0:17:17.117 --> 0:17:32.742
	We'll come later to the explicit architecture
	of the neural language one, and there we can

	0:17:32.742 --> 0:17:35.146
	see how it's.

	0:17:38.418 --> 0:17:44.857
	So we're seeing that the other one or our
	representation always has the same similarity.

	0:17:45.105 --> 0:17:59.142
	Then we're having this continuous factor which
	is a lot smaller dimension and that's important

	0:17:59.142 --> 0:18:00.768
	for later.

	0:18:01.121 --> 0:18:06.989
	What we are doing then is learning these representations
	so that they are best for language.

	0:18:07.487 --> 0:18:14.968
	So the representations are implicitly training
	the language for the cards.

	0:18:14.968 --> 0:18:19.058
	This is the best way for doing language.

	0:18:19.479 --> 0:18:32.564
	And the nice thing that was found out later
	is these representations are really good.

	0:18:33.153 --> 0:18:39.253
	And that is why they are now even called word
	embeddings by themselves and used for other

	0:18:39.253 --> 0:18:39.727
	tasks.

	0:18:40.360 --> 0:18:49.821
	And they are somewhat describing very different
	things so they can describe and semantic similarities.

	0:18:49.789 --> 0:18:58.650
	Are looking at the very example of today mass
	vector space by adding words and doing some

	0:18:58.650 --> 0:19:00.618
	interesting things.

	0:19:00.940 --> 0:19:11.178
	So they got really like the first big improvement
	when switching to neurostaff.

	0:19:11.491 --> 0:19:20.456
	Are like part of the model, but with more
	complex representation, but they are the basic

	0:19:20.456 --> 0:19:21.261
	models.

	0:19:23.683 --> 0:19:36.979
	In the output layer we are also having one
	output layer structure and a connection function.

	0:19:36.997 --> 0:19:46.525
	That is, for language learning we want to
	predict what is the most common word.

	0:19:47.247 --> 0:19:56.453
	And that can be done very well with this so
	called soft back layer, where again the dimension.

	0:19:56.376 --> 0:20:02.825
	Vocabulary size, so this is a vocabulary size,
	and again the case neural represents the case

	0:20:02.825 --> 0:20:03.310
	class.

	0:20:03.310 --> 0:20:09.759
	So in our case we have again one round representation,
	someone saying this is a core report.

	0:20:10.090 --> 0:20:17.255
	Our probability distribution is a probability
	distribution over all works, so the case entry

	0:20:17.255 --> 0:20:21.338
	tells us how probable is that the next word
	is this.

	0:20:22.682 --> 0:20:33.885
	So we need to have some probability distribution
	at our output in order to achieve that this

	0:20:33.885 --> 0:20:37.017
	activation function goes.

	0:20:37.197 --> 0:20:46.944
	And we can achieve that with a soft max activation
	we take the input to the form of the value,

	0:20:46.944 --> 0:20:47.970
	and then.

	0:20:48.288 --> 0:20:58.021
	So by having this type of activation function
	we are really getting this type of probability.

	0:20:59.019 --> 0:21:15.200
	At the beginning was also very challenging
	because again we have this inefficient representation.

	0:21:15.235 --> 0:21:29.799
	You can imagine that something over is maybe
	a bit inefficient with cheap users, but definitely.

	0:21:36.316 --> 0:21:44.072
	And then for training the models that will
	be fine, so we have to use architecture now.

	0:21:44.264 --> 0:21:48.491
	We need to minimize the arrow.

	0:21:48.491 --> 0:21:53.264
	Are we doing it taking the output?

	0:21:53.264 --> 0:21:58.174
	We are comparing it to our targets.

	0:21:58.298 --> 0:22:03.830
	So one important thing is by training them.

	0:22:03.830 --> 0:22:07.603
	How can we measure the error?

	0:22:07.603 --> 0:22:12.758
	So what is if we are training the ideas?

	0:22:13.033 --> 0:22:15.163
	And how well we are measuring.

	0:22:15.163 --> 0:22:19.768
	It is in natural language processing, typically
	the cross entropy.

	0:22:19.960 --> 0:22:35.575
	And that means we are comparing the target
	with the output.

	0:22:35.335 --> 0:22:44.430
	It gets optimized and you're seeing that this,
	of course, makes it again very nice and easy

	0:22:44.430 --> 0:22:49.868
	because our target is again a one-hour representation.

	0:22:50.110 --> 0:23:00.116
	So all of these are always zero, and what
	we are then doing is we are taking the one.

	0:23:00.100 --> 0:23:04.615
	And we only need to multiply the one with
	the logarithm here, and that is all the feedback

	0:23:04.615 --> 0:23:05.955
	signal we are taking here.

	0:23:06.946 --> 0:23:13.885
	Of course, this is not always influenced by
	all the others.

	0:23:13.885 --> 0:23:17.933
	Why is this influenced by all the.

	0:23:24.304 --> 0:23:34.382
	Have the activation function, which is the
	current activation divided by some of the others.

	0:23:34.354 --> 0:23:45.924
	Otherwise it could easily just increase this
	volume and ignore the others, but if you increase

	0:23:45.924 --> 0:23:49.090
	one value all the others.

	0:23:51.351 --> 0:23:59.912
	Then we can do with neometrics one very nice
	and easy type of training that is done in all

	0:23:59.912 --> 0:24:07.721
	the neometrics where we are now calculating
	our error and especially the gradient.

	0:24:07.707 --> 0:24:11.640
	So in which direction does the error show?

	0:24:11.640 --> 0:24:18.682
	And then if we want to go to a smaller arrow
	that's what we want to achieve.

	0:24:18.682 --> 0:24:26.638
	We are taking the inverse direction of the
	gradient and thereby trying to minimize our

	0:24:26.638 --> 0:24:27.278
	error.

	0:24:27.287 --> 0:24:31.041
	And we have to do that, of course, for all
	the weights.

	0:24:31.041 --> 0:24:36.672
	And to calculate the error of all the weights,
	we won't do the defectvagation here.

	0:24:36.672 --> 0:24:41.432
	But but what you can do is you can propagate
	the arrow which measured.

	0:24:41.432 --> 0:24:46.393
	At the end you can propagate it back its basic
	mass and basic derivation.

	0:24:46.706 --> 0:24:58.854
	For each way in your model measure how much
	you contribute to the error and then change

	0:24:58.854 --> 0:25:01.339
	it in a way that.

	0:25:04.524 --> 0:25:11.625
	So to summarize what for at least machine
	translation on your machine translation should

	0:25:11.625 --> 0:25:19.044
	remember, you know, to understand on this problem
	is that this is how a multilayer first the

	0:25:19.044 --> 0:25:20.640
	problem looks like.

	0:25:20.580 --> 0:25:28.251
	There are fully two layers and no connections.

	0:25:28.108 --> 0:25:29.759
	Across layers.

	0:25:29.829 --> 0:25:35.153
	And what they're doing is always just a waited
	sum here and then in activation production.

	0:25:35.415 --> 0:25:38.792
	And in order to train you have this forward
	and backward pass.

	0:25:39.039 --> 0:25:41.384
	So We Put in Here.

	0:25:41.281 --> 0:25:41.895
	Inputs.

	0:25:41.895 --> 0:25:45.347
	We have some random values at the beginning.

	0:25:45.347 --> 0:25:47.418
	Then calculate the output.

	0:25:47.418 --> 0:25:54.246
	We are measuring how our error is propagating
	the arrow back and then changing our model

	0:25:54.246 --> 0:25:57.928
	in a way that we hopefully get a smaller arrow.

	0:25:57.928 --> 0:25:59.616
	And then that is how.

	0:26:01.962 --> 0:26:12.893
	So before we're coming into our neural networks
	language models, how can we use this type of

	0:26:12.893 --> 0:26:17.595
	neural network to do language modeling?

	0:26:23.103 --> 0:26:33.157
	So how can we use them in natural language
	processing, especially machine translation?

	0:26:33.157 --> 0:26:41.799
	The first idea of using them was to estimate:
	So we have seen that the output can be monitored

	0:26:41.799 --> 0:26:42.599
	here as well.

	0:26:43.603 --> 0:26:50.311
	A probability distribution and if we have
	a full vocabulary we could mainly hear estimating

	0:26:50.311 --> 0:26:56.727
	how probable each next word is and then use
	that in our language model fashion as we've

	0:26:56.727 --> 0:26:58.112
	done it last time.

	0:26:58.112 --> 0:27:03.215
	We got the probability of a full sentence
	as a product of individual.

	0:27:04.544 --> 0:27:12.820
	And: That was done in the ninety seven years
	and it's very easy to integrate it into this

	0:27:12.820 --> 0:27:14.545
	lot of the year model.

	0:27:14.545 --> 0:27:19.570
	So we have said that this is how the locker
	here model looks like.

	0:27:19.570 --> 0:27:25.119
	So we are searching the best translation which
	minimizes each waste time.

	0:27:25.125 --> 0:27:26.362
	The Future About You.

	0:27:26.646 --> 0:27:31.647
	We have that with minimum error rate training
	if you can remember where we search for the

	0:27:31.647 --> 0:27:32.147
	optimal.

	0:27:32.512 --> 0:27:40.422
	The language model and many others, and we
	can just add here a neuromodel, have a knock

	0:27:40.422 --> 0:27:41.591
	of features.

	0:27:41.861 --> 0:27:45.761
	So that is quite easy as said.

	0:27:45.761 --> 0:27:53.183
	That was how statistical machine translation
	was improved.

	0:27:53.183 --> 0:27:57.082
	You just add one more feature.

	0:27:58.798 --> 0:28:07.631
	So how can we model the language modeling
	with a network?

	0:28:07.631 --> 0:28:16.008
	So what we have to do is model the probability
	of the.

	0:28:16.656 --> 0:28:25.047
	The problem in general in the head is that
	mostly we haven't seen long sequences.

	0:28:25.085 --> 0:28:35.650
	Mostly we have to beg off to very short sequences
	and we are working on this discrete space where

	0:28:35.650 --> 0:28:36.944
	similarity.

	0:28:37.337 --> 0:28:50.163
	So the idea is if we have now a real network,
	we can make words into continuous representation.

	0:28:51.091 --> 0:29:00.480
	And the structure then looks like this, so
	this is a basic still feed forward neural network.

	0:29:01.361 --> 0:29:10.645
	We are doing this at perximation again, so
	we are not putting in all previous words, but

	0:29:10.645 --> 0:29:11.375
	it is.

	0:29:11.691 --> 0:29:25.856
	This is done because we said that in the real
	network we can have only a fixed type of input.

	0:29:25.945 --> 0:29:31.886
	You can only do a fixed step and then we'll
	be doing that exactly in minus one.

	0:29:33.593 --> 0:29:39.536
	So here you are, for example, three words
	and three different words.

	0:29:39.536 --> 0:29:50.704
	One and all the others are: And then we're
	having the first layer of the neural network,

	0:29:50.704 --> 0:29:56.230
	which like you learns is word embedding.

	0:29:57.437 --> 0:30:04.976
	There is one thing which is maybe special
	compared to the standard neural member.

	0:30:05.345 --> 0:30:11.918
	So the representation of this word we want
	to learn first of all position independence.

	0:30:11.918 --> 0:30:19.013
	So we just want to learn what is the general
	meaning of the word independent of its neighbors.

	0:30:19.299 --> 0:30:26.239
	And therefore the representation you get here
	should be the same as if in the second position.

	0:30:27.247 --> 0:30:36.865
	The nice thing you can achieve is that this
	weights which you're using here you're reusing

	0:30:36.865 --> 0:30:41.727
	here and reusing here so we are forcing them.

	0:30:42.322 --> 0:30:48.360
	You then learn your word embedding, which
	is contextual, independent, so it's the same

	0:30:48.360 --> 0:30:49.678
	for each position.

	0:30:49.909 --> 0:31:03.482
	So that's the idea that you want to learn
	the representation first of and you don't want

	0:31:03.482 --> 0:31:07.599
	to really use the context.

	0:31:08.348 --> 0:31:13.797
	That of course might have a different meaning
	depending on where it stands, but we'll learn

	0:31:13.797 --> 0:31:14.153
	that.

	0:31:14.514 --> 0:31:20.386
	So first we are learning here representational
	words, which is just the representation.

	0:31:20.760 --> 0:31:32.498
	Normally we said in neurons all input neurons
	here are connected to all here, but we're reducing

	0:31:32.498 --> 0:31:37.338
	the complexity by saying these neurons.

	0:31:37.857 --> 0:31:47.912
	Then we have a lot denser representation that
	is our three word embedded in here, and now

	0:31:47.912 --> 0:31:57.408
	we are learning this interaction between words,
	a direction between words not based.

	0:31:57.677 --> 0:32:08.051
	So we have at least one connected layer here,
	which takes a three embedding input and then

	0:32:08.051 --> 0:32:14.208
	learns a new embedding which now represents
	the full.

	0:32:15.535 --> 0:32:16.551
	Layers.

	0:32:16.551 --> 0:32:27.854
	It is the output layer which now and then
	again the probability distribution of all the.

	0:32:28.168 --> 0:32:48.612
	So here is your target prediction.

	0:32:48.688 --> 0:32:56.361
	The nice thing is that you learn everything
	together, so you don't have to teach them what

	0:32:56.361 --> 0:32:58.722
	a good word representation.

	0:32:59.079 --> 0:33:08.306
	Training the whole number together, so it
	learns what a good representation for a word

	0:33:08.306 --> 0:33:13.079
	you get in order to perform your final task.

	0:33:15.956 --> 0:33:19.190
	Yeah, that is the main idea.

	0:33:20.660 --> 0:33:32.731
	This is now a days often referred to as one
	way of self supervise learning.

	0:33:33.053 --> 0:33:37.120
	The output is the next word and the input
	is the previous word.

	0:33:37.377 --> 0:33:46.783
	But it's not really that we created labels,
	but we artificially created a task out of unlabeled.

	0:33:46.806 --> 0:33:59.434
	We just had pure text, and then we created
	the telescopes by predicting the next word,

	0:33:59.434 --> 0:34:18.797
	which is: Say we have like two sentences like
	go home and the second one is go to prepare.

	0:34:18.858 --> 0:34:30.135
	And then we have to predict the next series
	and my questions in the labels for the album.

	0:34:31.411 --> 0:34:42.752
	We model this as one vector with like probability
	for possible weights starting again.

	0:34:44.044 --> 0:34:57.792
	Multiple examples, so then you would twice
	train one to predict KRT, one to predict home,

	0:34:57.792 --> 0:35:02.374
	and then of course the easel.

	0:35:04.564 --> 0:35:13.568
	Is a very good point, so you are not aggregating
	examples beforehand, but you are taking each.

	0:35:19.259 --> 0:35:37.204
	So when you do it simultaneously learn the
	projection layer and the endgram for abilities

	0:35:37.204 --> 0:35:39.198
	and then.

	0:35:39.499 --> 0:35:47.684
	And later analyze it that these representations
	are very powerful.

	0:35:47.684 --> 0:35:56.358
	The task is just a very important task to
	model what is the next word.

	0:35:56.816 --> 0:35:59.842
	Is motivated by nowadays.

	0:35:59.842 --> 0:36:10.666
	In order to get the meaning of the word you
	have to look at its companies where the context.

	0:36:10.790 --> 0:36:16.048
	If you read texts in days of word which you
	have never seen, you often can still estimate

	0:36:16.048 --> 0:36:21.130
	the meaning of this word because you do not
	know how it is used, and this is typically

	0:36:21.130 --> 0:36:22.240
	used as a city or.

	0:36:22.602 --> 0:36:25.865
	Just imagine you read a text about some city.

	0:36:25.865 --> 0:36:32.037
	Even if you've never seen the city before,
	you often know from the context of how it's

	0:36:32.037 --> 0:36:32.463
	used.

	0:36:34.094 --> 0:36:42.483
	So what is now the big advantage of using
	neural neckworks?

	0:36:42.483 --> 0:36:51.851
	So just imagine we have to estimate that I
	bought my first iPhone.

	0:36:52.052 --> 0:36:56.608
	So you have to monitor the probability of
	ad hitting them.

	0:36:56.608 --> 0:37:00.237
	Now imagine iPhone, which you have never seen.

	0:37:00.600 --> 0:37:11.588
	So all the techniques we had last time at
	the end, if you haven't seen iPhone you will

	0:37:11.588 --> 0:37:14.240
	always fall back to.

	0:37:15.055 --> 0:37:26.230
	You have no idea how to deal that you won't
	have seen the diagram, the trigram, and all

	0:37:26.230 --> 0:37:27.754
	the others.

	0:37:28.588 --> 0:37:43.441
	If you're having this type of model, what
	does it do if you have my first and then something?

	0:37:43.483 --> 0:37:50.270
	Maybe this representation is really messed
	up because it's mainly on a cavalry word.

	0:37:50.730 --> 0:37:57.793
	However, you have still these two information
	that two words before was first and therefore.

	0:37:58.098 --> 0:38:06.954
	So you have a lot of information in order
	to estimate how good it is.

	0:38:06.954 --> 0:38:13.279
	There could be more information if you know
	that.

	0:38:13.593 --> 0:38:25.168
	So all this type of modeling we can do that
	we couldn't do beforehand because we always

	0:38:25.168 --> 0:38:25.957
	have.

	0:38:27.027 --> 0:38:40.466
	Good point, so typically you would have one
	token for a vocabulary so that you could, for

	0:38:40.466 --> 0:38:45.857
	example: All you're doing by parent coding
	when you have a fixed thing.

	0:38:46.226 --> 0:38:49.437
	Oh yeah, you have to do something like that
	that that that's true.

	0:38:50.050 --> 0:38:55.420
	So yeah, auto vocabulary are by thanking where
	you don't have other words written.

	0:38:55.735 --> 0:39:06.295
	But then, of course, you might be getting
	very long previous things, and your sequence

	0:39:06.295 --> 0:39:11.272
	length gets very long for unknown words.

	0:39:17.357 --> 0:39:20.067
	Any more questions to the basic stable.

	0:39:23.783 --> 0:39:36.719
	For this model, what we then want to continue
	is looking a bit into how complex or how we

	0:39:36.719 --> 0:39:39.162
	can make things.

	0:39:40.580 --> 0:39:49.477
	Because at the beginning there was definitely
	a major challenge, it's still not that easy,

	0:39:49.477 --> 0:39:58.275
	and I mean our likeers followed the talk about
	their environmental fingerprint and so on.

	0:39:58.478 --> 0:40:05.700
	So this calculation is not really heavy, and
	if you build systems yourselves you have to

	0:40:05.700 --> 0:40:06.187
	wait.

	0:40:06.466 --> 0:40:14.683
	So it's good to know a bit about how complex
	things are in order to do a good or efficient

	0:40:14.683 --> 0:40:15.405
	affair.

	0:40:15.915 --> 0:40:24.211
	So one thing where most of the calculation
	really happens is if you're doing it in a bad

	0:40:24.211 --> 0:40:24.677
	way.

	0:40:25.185 --> 0:40:33.523
	So in generally all these layers we are talking
	about networks and zones fancy.

	0:40:33.523 --> 0:40:46.363
	In the end it is: So what you have to do in
	order to calculate here, for example, these

	0:40:46.363 --> 0:40:52.333
	activations: So make it simple a bit.

	0:40:52.333 --> 0:41:06.636
	Let's see where outputs and you just do metric
	multiplication between your weight matrix and

	0:41:06.636 --> 0:41:08.482
	your input.

	0:41:08.969 --> 0:41:20.992
	So that is why computers are so powerful for
	neural networks because they are very good

	0:41:20.992 --> 0:41:22.358
	in doing.

	0:41:22.782 --> 0:41:28.013
	However, for some type for the embedding layer
	this is really very inefficient.

	0:41:28.208 --> 0:41:39.652
	So because remember we're having this one
	art encoding in this input, it's always like

	0:41:39.652 --> 0:41:42.940
	one and everything else.

	0:41:42.940 --> 0:41:47.018
	It's zero if we're doing this.

	0:41:47.387 --> 0:41:55.552
	So therefore you can do at least the forward
	pass a lot more efficient if you don't really

	0:41:55.552 --> 0:42:01.833
	do this calculation, but you can select the
	one color where there is.

	0:42:01.833 --> 0:42:07.216
	Therefore, you also see this is called your
	word embedding.

	0:42:08.348 --> 0:42:19.542
	So the weight matrix of the embedding layer
	is just that in each color you have the embedding

	0:42:19.542 --> 0:42:20.018
	of.

	0:42:20.580 --> 0:42:30.983
	So this is like how your initial weights look
	like and how you can interpret or understand.

	0:42:32.692 --> 0:42:39.509
	And this is already relatively important because
	remember this is a huge dimensional thing.

	0:42:39.509 --> 0:42:46.104
	So typically here we have the number of words
	is ten thousand or so, so this is the word

	0:42:46.104 --> 0:42:51.365
	embeddings metrics, typically the most expensive
	to calculate metrics.

	0:42:51.451 --> 0:42:59.741
	Because it's the largest one there, we have
	ten thousand entries, while for the hours we

	0:42:59.741 --> 0:43:00.393
	maybe.

	0:43:00.660 --> 0:43:03.408
	So therefore the addition to a little bit
	more to make this.

	0:43:06.206 --> 0:43:10.538
	Then you can go where else the calculations
	are very difficult.

	0:43:10.830 --> 0:43:20.389
	So here we then have our network, so we have
	the word embeddings.

	0:43:20.389 --> 0:43:29.514
	We have one hidden there, and then you can
	look how difficult.

	0:43:30.270 --> 0:43:38.746
	Could save a lot of calculation by not really
	calculating the selection because that is always.

	0:43:40.600 --> 0:43:46.096
	The number of calculations you have to do
	here is so.

	0:43:46.096 --> 0:43:51.693
	The length of this layer is minus one type
	projection.

	0:43:52.993 --> 0:43:56.321
	That is a hint size.

	0:43:56.321 --> 0:44:10.268
	So the first step of calculation for this
	metrics modification is how much calculation.

	0:44:10.730 --> 0:44:18.806
	Then you have to do some activation function
	and then you have to do again the calculation.

	0:44:19.339 --> 0:44:27.994
	Here we need the vocabulary size because we
	need to calculate the probability for each

	0:44:27.994 --> 0:44:29.088
	next word.

	0:44:29.889 --> 0:44:43.155
	And if you look at these numbers, so if you
	have a projector size of and a vocabulary size

	0:44:43.155 --> 0:44:53.876
	of, you see: And that is why there has been
	especially at the beginning some ideas how

	0:44:53.876 --> 0:44:55.589
	we can reduce.

	0:44:55.956 --> 0:45:01.942
	And if we really need to calculate all of
	our capabilities, or if we can calculate only

	0:45:01.942 --> 0:45:02.350
	some.

	0:45:02.582 --> 0:45:10.871
	And there again the one important thing to
	think about is for what will use my language

	0:45:10.871 --> 0:45:11.342
	mom.

	0:45:11.342 --> 0:45:19.630
	I can use it for generations and that's what
	we will see next week in an achiever which

	0:45:19.630 --> 0:45:22.456
	really is guiding the search.

	0:45:23.123 --> 0:45:30.899
	If it just uses a feature, we do not want
	to use it for generations, but we want to only

	0:45:30.899 --> 0:45:32.559
	know how probable.

	0:45:32.953 --> 0:45:39.325
	There we might not be really interested in
	all the probabilities, but we already know

	0:45:39.325 --> 0:45:46.217
	we just want to know the probability of this
	one word, and then it might be very inefficient

	0:45:46.217 --> 0:45:49.403
	to really calculate all the probabilities.

	0:45:51.231 --> 0:45:52.919
	And how can you do that so?

	0:45:52.919 --> 0:45:56.296
	Initially, for example, the people look into
	shortness.

	0:45:56.756 --> 0:46:02.276
	So this calculation at the end is really very
	expensive.

	0:46:02.276 --> 0:46:05.762
	So can we make that more efficient.

	0:46:05.945 --> 0:46:17.375
	And most words occur very rarely, and maybe
	we don't need anger, and so there we may want

	0:46:17.375 --> 0:46:18.645
	to focus.

	0:46:19.019 --> 0:46:29.437
	And so they use the smaller vocabulary, which
	is maybe.

	0:46:29.437 --> 0:46:34.646
	This layer is used from to.

	0:46:34.646 --> 0:46:37.623
	Then you merge.

	0:46:37.937 --> 0:46:45.162
	So you're taking if the word is in the shortest,
	so in the two thousand most frequent words.

	0:46:45.825 --> 0:46:58.299
	Of this short word by some normalization here,
	and otherwise you take a back of probability

	0:46:58.299 --> 0:46:59.655
	from the.

	0:47:00.020 --> 0:47:04.933
	It will not be as good, but the idea is okay.

	0:47:04.933 --> 0:47:14.013
	Then we don't have to calculate all these
	probabilities here at the end, but we only

	0:47:14.013 --> 0:47:16.042
	have to calculate.

	0:47:19.599 --> 0:47:32.097
	With some type of cost because it means we
	don't model the probability of the infrequent

	0:47:32.097 --> 0:47:39.399
	words, and maybe it's even very important to
	model.

	0:47:39.299 --> 0:47:46.671
	And one idea is to do what is reported as
	so so structured out there.

	0:47:46.606 --> 0:47:49.571
	Network language models you see some years
	ago.

	0:47:49.571 --> 0:47:53.154
	People were very creative and giving names
	to new models.

	0:47:53.813 --> 0:48:00.341
	And there the idea is that we model the output
	vocabulary as a clustered treat.

	0:48:00.680 --> 0:48:06.919
	So you don't need to model all of our bodies
	directly, but you are putting words into a

	0:48:06.919 --> 0:48:08.479
	sequence of clusters.

	0:48:08.969 --> 0:48:15.019
	So maybe a very intriguant world is first
	in cluster three and then in cluster three.

	0:48:15.019 --> 0:48:21.211
	You have subclusters again and there is subclusters
	seven and subclusters and there is.

	0:48:21.541 --> 0:48:40.134
	And this is the path, so that is what was
	the man in the past.

	0:48:40.340 --> 0:48:52.080
	And then you can calculate the probability
	of the word again just by the product of the

	0:48:52.080 --> 0:48:55.548
	first class of the world.

	0:48:57.617 --> 0:49:07.789
	That it may be more clear where you have this
	architecture, so this is all the same.

	0:49:07.789 --> 0:49:13.773
	But then you first predict here which main
	class.

	0:49:14.154 --> 0:49:24.226
	Then you go to the appropriate subclass, then
	you calculate the probability of the subclass

	0:49:24.226 --> 0:49:26.415
	and maybe the cell.

	0:49:27.687 --> 0:49:35.419
	Anybody have an idea why this is more efficient
	or if you do it first, it looks a lot more.

	0:49:42.242 --> 0:49:51.788
	You have to do less calculations, so maybe
	if you do it here you have to calculate the

	0:49:51.788 --> 0:49:59.468
	element there, but you don't have to do all
	the one hundred thousand.

	0:49:59.980 --> 0:50:06.115
	The probabilities in the set classes that
	you're going through and not for all of them.

	0:50:06.386 --> 0:50:18.067
	Therefore, it's more efficient if you don't
	need all output proficient because you have

	0:50:18.067 --> 0:50:21.253
	to calculate the class.

	0:50:21.501 --> 0:50:28.936
	So it's only more efficient and scenarios
	where you really need to use a language model

	0:50:28.936 --> 0:50:30.034
	to evaluate.

	0:50:35.275 --> 0:50:52.456
	How this works was that you can train first
	in your language one on the short list.

	0:50:52.872 --> 0:51:03.547
	But on the input layer you have your full
	vocabulary because at the input we saw that

	0:51:03.547 --> 0:51:06.650
	this is not complicated.

	0:51:06.906 --> 0:51:26.638
	And then you can cluster down all your words
	here into classes and use that as your glasses.

	0:51:29.249 --> 0:51:34.148
	That is one idea of doing it.

	0:51:34.148 --> 0:51:44.928
	There is also a second idea of doing it, and
	again we don't need.

	0:51:45.025 --> 0:51:53.401
	So sometimes it doesn't really need to be
	a probability to evaluate.

	0:51:53.401 --> 0:51:56.557
	It's only important that.

	0:51:58.298 --> 0:52:04.908
	And: Here it's called self normalization what
	people have done so.

	0:52:04.908 --> 0:52:11.562
	We have seen that the probability is in this
	soft mechanism always to the input divided

	0:52:11.562 --> 0:52:18.216
	by our normalization, and the normalization
	is a summary of the vocabulary to the power

	0:52:18.216 --> 0:52:19.274
	of the spell.

	0:52:19.759 --> 0:52:25.194
	So this is how we calculate the software.

	0:52:25.825 --> 0:52:41.179
	In self normalization of the idea, if this
	would be zero then we don't need to calculate

	0:52:41.179 --> 0:52:42.214
	that.

	0:52:42.102 --> 0:52:54.272
	Will be zero, and then you don't even have
	to calculate the normalization because it's.

	0:52:54.514 --> 0:53:08.653
	So how can we achieve that and then the nice
	thing in your networks?

	0:53:09.009 --> 0:53:23.928
	And now we're just adding a second note with
	some either permitted here.

	0:53:24.084 --> 0:53:29.551
	And the second lost just tells us he'll be
	strained away.

	0:53:29.551 --> 0:53:31.625
	The locks at is zero.

	0:53:32.352 --> 0:53:38.614
	So then if it's nearly zero at the end we
	don't need to calculate this and it's also

	0:53:38.614 --> 0:53:39.793
	very efficient.

	0:53:40.540 --> 0:53:49.498
	One important thing is this, of course, is
	only in inference.

	0:53:49.498 --> 0:54:04.700
	During tests we don't need to calculate that
	because: You can do a bit of a hyperparameter

	0:54:04.700 --> 0:54:14.851
	here where you do the waiting, so how good
	should it be estimating the probabilities and

	0:54:14.851 --> 0:54:16.790
	how much effort?

	0:54:18.318 --> 0:54:28.577
	The only disadvantage is no speed up during
	training.

	0:54:28.577 --> 0:54:43.843
	There are other ways of doing that, for example:
	Englishman is in case you get it.

	0:54:44.344 --> 0:54:48.540
	Then we are coming very, very briefly like
	just one idea.

	0:54:48.828 --> 0:54:53.058
	That there is more things on different types
	of language models.

	0:54:53.058 --> 0:54:58.002
	We are having a very short view on restricted
	person-based language models.

	0:54:58.298 --> 0:55:08.931
	Talk about recurrent neural networks for language
	mines because they have the advantage that

	0:55:08.931 --> 0:55:17.391
	we can even further improve by not having a
	continuous representation on.

	0:55:18.238 --> 0:55:23.845
	So there's different types of neural networks.

	0:55:23.845 --> 0:55:30.169
	These are these boxing machines and the interesting.

	0:55:30.330 --> 0:55:39.291
	They have these: And they define like an energy
	function on the network, which can be in restricted

	0:55:39.291 --> 0:55:44.372
	balsam machines efficiently calculated in general
	and restricted needs.

	0:55:44.372 --> 0:55:51.147
	You only have connection between the input
	and the hidden layer, but you don't have connections

	0:55:51.147 --> 0:55:53.123
	in the input or within the.

	0:55:53.393 --> 0:56:00.194
	So you see here you don't have an input output,
	you just have an input, and you calculate.

	0:56:00.460 --> 0:56:15.612
	Which of course nicely fits with the idea
	we're having, so you can then use this for

	0:56:15.612 --> 0:56:19.177
	an N Gram language.

	0:56:19.259 --> 0:56:25.189
	Retaining the flexibility of the input by
	this type of neon networks.

	0:56:26.406 --> 0:56:30.589
	And the advantage of this type of model was
	there's.

	0:56:30.550 --> 0:56:37.520
	Very, very fast to integrate it, so that one
	was the first one which was used during the

	0:56:37.520 --> 0:56:38.616
	coding model.

	0:56:38.938 --> 0:56:45.454
	The engram language models were that they
	were very good and gave performance.

	0:56:45.454 --> 0:56:50.072
	However, calculation still with all these
	tricks takes.

	0:56:50.230 --> 0:56:58.214
	We have talked about embest lists so they
	generated an embest list of the most probable

	0:56:58.214 --> 0:57:05.836
	outputs and then they took this and best list
	scored each entry with a new network.

	0:57:06.146 --> 0:57:09.306
	A language model, and then only change the
	order again.

	0:57:09.306 --> 0:57:10.887
	Select based on that which.

	0:57:11.231 --> 0:57:17.187
	The neighboring list is maybe only like hundred
	entries.

	0:57:17.187 --> 0:57:21.786
	When decoding you look at several thousand.

	0:57:26.186 --> 0:57:35.196
	Let's look at the context so we have now seen
	your language models.

	0:57:35.196 --> 0:57:43.676
	There is the big advantage we can use this
	word similarity and.

	0:57:44.084 --> 0:57:52.266
	Remember for engram language ones is not always
	minus one words because sometimes you have

	0:57:52.266 --> 0:57:59.909
	to back off or interpolation to lower engrams
	and you don't know the previous words.

	0:58:00.760 --> 0:58:04.742
	And however in neural models we always have
	all of this importance.

	0:58:04.742 --> 0:58:05.504
	Can some of.

	0:58:07.147 --> 0:58:20.288
	The disadvantage is that you are still limited
	in your context, and if you remember the sentence

	0:58:20.288 --> 0:58:22.998
	from last lecture,.

	0:58:22.882 --> 0:58:28.328
	Sometimes you need more context and there
	is unlimited context that you might need and

	0:58:28.328 --> 0:58:34.086
	you can always create sentences where you may
	need this five context in order to put a good

	0:58:34.086 --> 0:58:34.837
	estimation.

	0:58:35.315 --> 0:58:44.956
	Can also do it different in order to understand
	that it makes sense to view language.

	0:58:45.445 --> 0:58:59.510
	So secret labeling tasks are a very common
	type of task in language processing where you

	0:58:59.510 --> 0:59:03.461
	have the input sequence.

	0:59:03.323 --> 0:59:05.976
	So you have one output for each input.

	0:59:05.976 --> 0:59:12.371
	Machine translation is not a secret labeling
	cast because the number of inputs and the number

	0:59:12.371 --> 0:59:14.072
	of outputs is different.

	0:59:14.072 --> 0:59:20.598
	So you put in a string German which has five
	words and the output can be: See, for example,

	0:59:20.598 --> 0:59:24.078
	you always have the same number and the same
	number of offices.

	0:59:24.944 --> 0:59:39.779
	And you can more language waddling as that,
	and you just say the label for each word is

	0:59:39.779 --> 0:59:43.151
	always a next word.

	0:59:45.705 --> 0:59:50.312
	This is the more generous you can think of
	it.

	0:59:50.312 --> 0:59:56.194
	For example, Paddle Speech Taking named Entity
	Recognition.

	0:59:58.938 --> 1:00:08.476
	And if you look at now, this output token
	and generally sequenced labeling can depend

	1:00:08.476 --> 1:00:26.322
	on: The input tokens are the same so we can
	easily model it and they only depend on the

	1:00:26.322 --> 1:00:29.064
	input tokens.

	1:00:31.011 --> 1:00:42.306
	But we can always look at one specific type
	of sequence labeling, unidirectional sequence

	1:00:42.306 --> 1:00:44.189
	labeling type.

	1:00:44.584 --> 1:01:00.855
	The probability of the next word only depends
	on the previous words that we are having here.

	1:01:01.321 --> 1:01:05.998
	That's also not completely true in language.

	1:01:05.998 --> 1:01:14.418
	Well, the back context might also be helpful
	by direction of the model's Google.

	1:01:14.654 --> 1:01:23.039
	We will always admire the probability of the
	word given on its history.

	1:01:23.623 --> 1:01:30.562
	And currently there is approximation and sequence
	labeling that we have this windowing approach.

	1:01:30.951 --> 1:01:43.016
	So in order to predict this type of word we
	always look at the previous three words.

	1:01:43.016 --> 1:01:48.410
	This is this type of windowing model.

	1:01:49.389 --> 1:01:54.780
	If you're into neural networks you recognize
	this type of structure.

	1:01:54.780 --> 1:01:57.515
	Also, the typical neural networks.

	1:01:58.938 --> 1:02:11.050
	Yes, yes, so like engram models you can, at
	least in some way, prepare for that type of

	1:02:11.050 --> 1:02:12.289
	context.

	1:02:14.334 --> 1:02:23.321
	Are also other types of neonamic structures
	which we can use for sequins lately and which

	1:02:23.321 --> 1:02:30.710
	might help us where we don't have this type
	of fixed size representation.

	1:02:32.812 --> 1:02:34.678
	That we can do so.

	1:02:34.678 --> 1:02:39.391
	The idea is in recurrent new networks traction.

	1:02:39.391 --> 1:02:43.221
	We are saving complete history in one.

	1:02:43.623 --> 1:02:56.946
	So again we have to do this fixed size representation
	because the neural networks always need a habit.

	1:02:57.157 --> 1:03:09.028
	And then the network should look like that,
	so we start with an initial value for our storage.

	1:03:09.028 --> 1:03:15.900
	We are giving our first input and calculating
	the new.

	1:03:16.196 --> 1:03:35.895
	So again in your network with two types of
	inputs: Then you can apply it to the next type

	1:03:35.895 --> 1:03:41.581
	of input and you're again having this.

	1:03:41.581 --> 1:03:46.391
	You're taking this hidden state.

	1:03:47.367 --> 1:03:53.306
	Nice thing is now that you can do now step
	by step by step, so all the way over.

	1:03:55.495 --> 1:04:06.131
	The nice thing we are having here now is that
	now we are having context information from

	1:04:06.131 --> 1:04:07.206
	all the.

	1:04:07.607 --> 1:04:14.181
	So if you're looking like based on which words
	do you, you calculate the probability of varying.

	1:04:14.554 --> 1:04:20.090
	It depends on this part.

	1:04:20.090 --> 1:04:33.154
	It depends on and this hidden state was influenced
	by two.

	1:04:33.473 --> 1:04:38.259
	So now we're having something new.

	1:04:38.259 --> 1:04:46.463
	We can model like the word probability not
	only on a fixed.

	1:04:46.906 --> 1:04:53.565
	Because the hidden states we are having here
	in our Oregon are influenced by all the trivia.

	1:04:56.296 --> 1:05:02.578
	So how is there to be Singapore?

	1:05:02.578 --> 1:05:16.286
	But then we have the initial idea about this
	P of given on the history.

	1:05:16.736 --> 1:05:25.300
	So do not need to do any clustering here,
	and you also see how things are put together

	1:05:25.300 --> 1:05:26.284
	in order.

	1:05:29.489 --> 1:05:43.449
	The green box this night since we are starting
	from the left to the right.

	1:05:44.524 --> 1:05:51.483
	Voices: Yes, that's right, so there are clusters,
	and here is also sometimes clustering happens.

	1:05:51.871 --> 1:05:58.687
	The small difference does matter again, so
	if you have now a lot of different histories,

	1:05:58.687 --> 1:06:01.674
	the similarity which you have in here.

	1:06:01.674 --> 1:06:08.260
	If two of the histories are very similar,
	these representations will be the same, and

	1:06:08.260 --> 1:06:10.787
	then you're treating them again.

	1:06:11.071 --> 1:06:15.789
	Because in order to do the final restriction
	you only do a good base on the green box.

	1:06:16.156 --> 1:06:28.541
	So you are now still learning some type of
	clustering in there, but you are learning it

	1:06:28.541 --> 1:06:30.230
	implicitly.

	1:06:30.570 --> 1:06:38.200
	The only restriction you're giving is you
	have to stall everything that is important

	1:06:38.200 --> 1:06:39.008
	in this.

	1:06:39.359 --> 1:06:54.961
	So it's a different type of limitation, so
	you calculate the probability based on the

	1:06:54.961 --> 1:06:57.138
	last words.

	1:06:57.437 --> 1:07:04.430
	And that is how you still need to somehow
	cluster things together in order to do efficiently.

	1:07:04.430 --> 1:07:09.563
	Of course, you need to do some type of clustering
	because otherwise.

	1:07:09.970 --> 1:07:18.865
	But this is where things get merged together
	in this type of hidden representation.

	1:07:18.865 --> 1:07:27.973
	So here the probability of the word first
	only depends on this hidden representation.

	1:07:28.288 --> 1:07:33.104
	On the previous words, but they are some other
	bottleneck in order to make a good estimation.

	1:07:34.474 --> 1:07:41.231
	So the idea is that we can store all our history
	into or into one lecture.

	1:07:41.581 --> 1:07:44.812
	Which is the one that makes it more strong.

	1:07:44.812 --> 1:07:51.275
	Next we come to problems that of course at
	some point it might be difficult if you have

	1:07:51.275 --> 1:07:57.811
	very long sequences and you always write all
	the information you have on this one block.

	1:07:58.398 --> 1:08:02.233
	Then maybe things get overwritten or you cannot
	store everything in there.

	1:08:02.662 --> 1:08:04.514
	So,.

	1:08:04.184 --> 1:08:09.569
	Therefore, yet for short things like single
	sentences that works well, but especially if

	1:08:09.569 --> 1:08:15.197
	you think of other tasks and like symbolizations
	with our document based on T where you need

	1:08:15.197 --> 1:08:20.582
	to consider the full document, these things
	got got a bit more more more complicated and

	1:08:20.582 --> 1:08:23.063
	will learn another type of architecture.

	1:08:24.464 --> 1:08:30.462
	In order to understand these neighbors, it
	is good to have all the bus use always.

	1:08:30.710 --> 1:08:33.998
	So this is the unrolled view.

	1:08:33.998 --> 1:08:43.753
	Somewhere you're over the type or in language
	over the words you're unrolling a network.

	1:08:44.024 --> 1:08:52.096
	Here is the article and here is the network
	which is connected by itself and that is recurrent.

	1:08:56.176 --> 1:09:04.982
	There is one challenge in this networks and
	training.

	1:09:04.982 --> 1:09:11.994
	We can train them first of all as forward.

	1:09:12.272 --> 1:09:19.397
	So we don't really know how to train them,
	but if you unroll them like this is a feet

	1:09:19.397 --> 1:09:20.142
	forward.

	1:09:20.540 --> 1:09:38.063
	Is exactly the same, so you can measure your
	arrows here and be back to your arrows.

	1:09:38.378 --> 1:09:45.646
	If you unroll something, it's a feature in
	your laptop and you can train it the same way.

	1:09:46.106 --> 1:09:57.606
	The only important thing is again, of course,
	for different inputs.

	1:09:57.837 --> 1:10:05.145
	But since parameters are shared, it's somehow
	a similar point you can train it.

	1:10:05.145 --> 1:10:08.800
	The training algorithm is very similar.

	1:10:10.310 --> 1:10:29.568
	One thing which makes things difficult is
	what is referred to as the vanish ingredient.

	1:10:29.809 --> 1:10:32.799
	That's a very strong thing in the motivation
	of using hardness.

	1:10:33.593 --> 1:10:44.604
	The influence here gets smaller and smaller,
	and the modems are not really able to monitor.

	1:10:44.804 --> 1:10:51.939
	Because the gradient gets smaller and smaller,
	and so the arrow here propagated to this one

	1:10:51.939 --> 1:10:58.919
	that contributes to the arrow is very small,
	and therefore you don't do any changes there

	1:10:58.919 --> 1:10:59.617
	anymore.

	1:11:00.020 --> 1:11:06.703
	And yeah, that's why standard art men are
	undifficult or have to pick them at custard.

	1:11:07.247 --> 1:11:11.462
	So everywhere talking to me about fire and
	ants nowadays,.

	1:11:11.791 --> 1:11:23.333
	What we are typically meaning are LSDN's or
	long short memories.

	1:11:23.333 --> 1:11:30.968
	You see they are by now quite old already.

	1:11:31.171 --> 1:11:39.019
	So there was a model in the language model
	task.

	1:11:39.019 --> 1:11:44.784
	It's some more storing information.

	1:11:44.684 --> 1:11:51.556
	Because if you only look at the last words,
	it's often no longer clear this is a question

	1:11:51.556 --> 1:11:52.548
	or a normal.

	1:11:53.013 --> 1:12:05.318
	So there you have these mechanisms with ripgate
	in order to store things for a longer time

	1:12:05.318 --> 1:12:08.563
	into your hidden state.

	1:12:10.730 --> 1:12:20.162
	Here they are used in in in selling quite
	a lot of works.

	1:12:21.541 --> 1:12:29.349
	For especially machine translation now, the
	standard is to do transform base models which

	1:12:29.349 --> 1:12:30.477
	we'll learn.

	1:12:30.690 --> 1:12:38.962
	But for example, in architecture we have later
	one lecture about efficiency.

	1:12:38.962 --> 1:12:42.830
	So how can we build very efficient?

	1:12:42.882 --> 1:12:53.074
	And there in the decoder in parts of the networks
	they are still using.

	1:12:53.473 --> 1:12:57.518
	So it's not that yeah our hands are of no
	importance in the body.

	1:12:59.239 --> 1:13:08.956
	In order to make them strong, there are some
	more things which are helpful and should be:

	1:13:09.309 --> 1:13:19.683
	So one thing is there is a nice trick to make
	this new network stronger and better.

	1:13:19.739 --> 1:13:21.523
	So of course it doesn't work always.

	1:13:21.523 --> 1:13:23.451
	They have to have enough training data.

	1:13:23.763 --> 1:13:28.959
	But in general there's the easiest way of
	making your models bigger and stronger just

	1:13:28.959 --> 1:13:30.590
	to increase your pyramids.

	1:13:30.630 --> 1:13:43.236
	And you've seen that with a large language
	models they are always bragging about.

	1:13:43.903 --> 1:13:56.463
	This is one way, so the question is how do
	you get more parameters?

	1:13:56.463 --> 1:14:01.265
	There's ways of doing it.

	1:14:01.521 --> 1:14:10.029
	And the other thing is to make your networks
	deeper so to have more legs in between.

	1:14:11.471 --> 1:14:13.827
	And then you can also get to get more calm.

	1:14:14.614 --> 1:14:23.340
	There's more traveling with this and it's
	very similar to what we just saw with our hand.

	1:14:23.603 --> 1:14:34.253
	We have this problem of radiant flow that
	if it flows so fast like a radiant gets very

	1:14:34.253 --> 1:14:35.477
	swollen,.

	1:14:35.795 --> 1:14:42.704
	Exactly the same thing happens in deep LSD
	ends.

	1:14:42.704 --> 1:14:52.293
	If you take here the gradient, tell you what
	is the right or wrong.

	1:14:52.612 --> 1:14:56.439
	With three layers it's no problem, but if
	you're going to ten, twenty or hundred layers.

	1:14:57.797 --> 1:14:59.698
	That's Getting Typically Young.

	1:15:00.060 --> 1:15:07.000
	Are doing is using what is called decisional
	connections.

	1:15:07.000 --> 1:15:15.855
	That's a very helpful idea, which is maybe
	very surprising that it works.

	1:15:15.956 --> 1:15:20.309
	And so the idea is that these networks.

	1:15:20.320 --> 1:15:29.982
	In between should no longer calculate what
	is a new good representation, but they're more

	1:15:29.982 --> 1:15:31.378
	calculating.

	1:15:31.731 --> 1:15:37.588
	Therefore, in the end you're always the output
	of a layer is added with the input.

	1:15:38.318 --> 1:15:48.824
	The knife is later if you are doing back propagation
	with this very fast back propagation.

	1:15:49.209 --> 1:16:02.540
	Nowadays in very deep architectures, not only
	on other but always has this residual or highway

	1:16:02.540 --> 1:16:04.224
	connection.

	1:16:04.704 --> 1:16:06.616
	Has two advantages.

	1:16:06.616 --> 1:16:15.409
	On the one hand, these layers don't need to
	learn a representation, they only need to learn

	1:16:15.409 --> 1:16:18.754
	what to change the representation.

	1:16:22.082 --> 1:16:24.172
	Good.

	1:16:23.843 --> 1:16:31.768
	That much for the new map before, so the last
	thing now means this.

	1:16:31.671 --> 1:16:33.750
	Language was are yeah.

	1:16:33.750 --> 1:16:41.976
	I were used in the molds itself and now were
	seeing them again, but one thing which at the

	1:16:41.976 --> 1:16:53.558
	beginning they were reading was very essential
	was: So people really train part of the language

	1:16:53.558 --> 1:16:59.999
	models only to get this type of embedding.

	1:16:59.999 --> 1:17:04.193
	Therefore, we want to look.

	1:17:09.229 --> 1:17:15.678
	So now some last words to the word embeddings.

	1:17:15.678 --> 1:17:27.204
	The interesting thing is that word embeddings
	can be used for very different tasks.

	1:17:27.347 --> 1:17:31.329
	The knife wing is you can train that on just
	large amounts of data.

	1:17:31.931 --> 1:17:41.569
	And then if you have these wooden beddings
	we have seen that they reduce the parameters.

	1:17:41.982 --> 1:17:52.217
	So then you can train your small mark to do
	any other task and therefore you are more efficient.

	1:17:52.532 --> 1:17:55.218
	These initial word embeddings is important.

	1:17:55.218 --> 1:18:00.529
	They really depend only on the word itself,
	so if you look at the two meanings of can,

	1:18:00.529 --> 1:18:06.328
	the can of beans or I can do that, they will
	have the same embedding, so some of the embedding

	1:18:06.328 --> 1:18:08.709
	has to save the ambiguity inside that.

	1:18:09.189 --> 1:18:12.486
	That cannot be resolved.

	1:18:12.486 --> 1:18:24.753
	Therefore, if you look at the higher levels
	in the context, but in the word embedding layers

	1:18:24.753 --> 1:18:27.919
	that really depends on.

	1:18:29.489 --> 1:18:33.757
	However, even this one has quite very interesting.

	1:18:34.034 --> 1:18:39.558
	So that people like to visualize them.

	1:18:39.558 --> 1:18:47.208
	They're always difficult because if you look
	at this.

	1:18:47.767 --> 1:18:52.879
	And drawing your five hundred damage, the
	vector is still a bit challenging.

	1:18:53.113 --> 1:19:12.472
	So you cannot directly do that, so people
	have to do it like they look at some type of.

	1:19:13.073 --> 1:19:17.209
	And of course then yes some information is
	getting lost by a bunch of control.

	1:19:18.238 --> 1:19:24.802
	And you see, for example, this is the most
	famous and common example, so what you can

	1:19:24.802 --> 1:19:31.289
	look is you can look at the difference between
	the main and the female word English.

	1:19:31.289 --> 1:19:37.854
	This is here in your embedding of king, and
	this is the embedding of queen, and this.

	1:19:38.058 --> 1:19:40.394
	You can do that for a very different work.

	1:19:40.780 --> 1:19:45.407
	And that is where the masks come into, that
	is what people then look into.

	1:19:45.725 --> 1:19:50.995
	So what you can now, for example, do is you
	can calculate the difference between man and

	1:19:50.995 --> 1:19:51.410
	woman?

	1:19:52.232 --> 1:19:55.511
	Then you can take the embedding of tea.

	1:19:55.511 --> 1:20:02.806
	You can add on it the difference between man
	and woman, and then you can notice what are

	1:20:02.806 --> 1:20:04.364
	the similar words.

	1:20:04.364 --> 1:20:08.954
	So you won't, of course, directly hit the
	correct word.

	1:20:08.954 --> 1:20:10.512
	It's a continuous.

	1:20:10.790 --> 1:20:23.127
	But you can look what are the nearest neighbors
	to this same, and often these words are near

	1:20:23.127 --> 1:20:24.056
	there.

	1:20:24.224 --> 1:20:33.913
	So it somehow learns that the difference between
	these words is always the same.

	1:20:34.374 --> 1:20:37.746
	You can do that for different things.

	1:20:37.746 --> 1:20:41.296
	He also imagines that it's not perfect.

	1:20:41.296 --> 1:20:49.017
	He says the world tends to be swimming and
	swimming, and with walking and walking you.

	1:20:49.469 --> 1:20:51.639
	So you can try to use them.

	1:20:51.639 --> 1:20:59.001
	It's no longer like saying yeah, but the interesting
	thing is this is completely unsupervised.

	1:20:59.001 --> 1:21:03.961
	So nobody taught him the principle of their
	gender in language.

	1:21:04.284 --> 1:21:09.910
	So it's purely trained on the task of doing
	the next work prediction.

	1:21:10.230 --> 1:21:20.658
	And even for really cementing information
	like the capital, this is the difference between

	1:21:20.658 --> 1:21:23.638
	the city and the capital.

	1:21:23.823 --> 1:21:25.518
	Visualization.

	1:21:25.518 --> 1:21:33.766
	Here we have done the same things of the difference
	between country and.

	1:21:33.853 --> 1:21:41.991
	You see it's not perfect, but it's building
	some kinds of a right direction, so you can't

	1:21:41.991 --> 1:21:43.347
	even use them.

	1:21:43.347 --> 1:21:51.304
	For example, for question answering, if you
	have the difference between them, you apply

	1:21:51.304 --> 1:21:53.383
	that to a new country.

	1:21:54.834 --> 1:22:02.741
	So it seems these ones are able to really
	learn a lot of information and collapse all

	1:22:02.741 --> 1:22:04.396
	this information.

	1:22:05.325 --> 1:22:11.769
	At just to do the next word prediction: And
	that also explains a bit maybe or not explains

	1:22:11.769 --> 1:22:19.016
	wrong life by motivating why what is the main
	advantage of this type of neural models that

	1:22:19.016 --> 1:22:26.025
	we can use this type of hidden representation,
	transfer them and use them in different.

	1:22:28.568 --> 1:22:43.707
	So summarize what we did today, so what you
	should hopefully have with you is for machine

	1:22:43.707 --> 1:22:45.893
	translation.

	1:22:45.805 --> 1:22:49.149
	Then how we can do language modern Chinese
	literature?

	1:22:49.449 --> 1:22:55.617
	We looked at three different architectures:
	We looked into the feet forward language mode

	1:22:55.617 --> 1:22:59.063
	and the one based on Bluetooth machines.

	1:22:59.039 --> 1:23:05.366
	And finally there are different architectures
	to do in your networks.

	1:23:05.366 --> 1:23:14.404
	We have seen feet for your networks and we'll
	see the next lectures, the last type of architecture.

	1:23:15.915 --> 1:23:17.412
	Have Any Questions.

	1:23:20.680 --> 1:23:27.341
	Then thanks a lot, and next on Tuesday we
	will be again in our order to know how to play.

	0:00:01.301 --> 0:00:05.687
	Okay, so we're welcome to today's lecture.

	0:00:06.066 --> 0:00:18.128
	A bit desperate in a small room and I'm sorry
	for the inconvenience.

	0:00:18.128 --> 0:00:25.820
	Sometimes there are project meetings where.

	0:00:26.806 --> 0:00:40.863
	So what we want to talk today about is want
	to start with neural approaches to machine

	0:00:40.863 --> 0:00:42.964
	translation.

	0:00:43.123 --> 0:00:55.779
	Guess I've heard about other types of neural
	models for natural language processing.

	0:00:55.779 --> 0:00:59.948
	This was some of the first.

	0:01:00.600 --> 0:01:06.203
	They are similar to what you know they see
	in as large language models.

	0:01:06.666 --> 0:01:14.810
	And we want today look into what are these
	neural language models, how we can build them,

	0:01:14.810 --> 0:01:15.986
	what is the.

	0:01:16.316 --> 0:01:23.002
	And first we'll show how to use them in statistical
	machine translation.

	0:01:23.002 --> 0:01:31.062
	If you remember weeks ago, we had this log-linear
	model where you can integrate easily.

	0:01:31.351 --> 0:01:42.756
	And that was how they first were used, so
	we just had another model that evaluates how

	0:01:42.756 --> 0:01:49.180
	good a system is or how good a lot of languages.

	0:01:50.690 --> 0:02:04.468
	And next week we will go for a neuromachine
	translation where we replace the whole model

	0:02:04.468 --> 0:02:06.481
	by one huge.

	0:02:11.211 --> 0:02:20.668
	So just as a member from Tuesday we've seen,
	the main challenge in language modeling was

	0:02:20.668 --> 0:02:25.131
	that most of the anthrax we haven't seen.

	0:02:26.946 --> 0:02:34.167
	So this was therefore difficult to estimate
	any probability because we've seen that yet

	0:02:34.167 --> 0:02:39.501
	normally if you've seen had not seen the N
	gram you will assign.

	0:02:39.980 --> 0:02:53.385
	However, this is not really very good because
	we don't want to give zero probabilities to

	0:02:53.385 --> 0:02:55.023
	sentences.

	0:02:55.415 --> 0:03:10.397
	And then we learned a lot of techniques and
	that is the main challenge in statistical language.

	0:03:10.397 --> 0:03:15.391
	How we can give somehow a good.

	0:03:15.435 --> 0:03:23.835
	And they developed very specific, very good
	techniques to deal with that.

	0:03:23.835 --> 0:03:26.900
	However, this is the best.

	0:03:28.568 --> 0:03:33.907
	And therefore we can do things different.

	0:03:33.907 --> 0:03:44.331
	If we have not seen an N gram before in statistical
	models, we have to have seen.

	0:03:45.225 --> 0:03:51.361
	Before, and we can only get information from
	exactly the same word.

	0:03:51.411 --> 0:03:57.567
	We don't have an approximate matching like
	that.

	0:03:57.567 --> 0:04:10.255
	Maybe it stood together in some way or similar,
	and in a sentence we might generalize the knowledge.

	0:04:11.191 --> 0:04:21.227
	Would like to have more something like that
	where engrams are represented more in a general

	0:04:21.227 --> 0:04:21.990
	space.

	0:04:22.262 --> 0:04:29.877
	So if you learn something about eyewalk then
	maybe we can use this knowledge and also.

	0:04:30.290 --> 0:04:43.034
	And thereby no longer treat all or at least
	a lot of the ingrams as we've done before.

	0:04:43.034 --> 0:04:45.231
	We can really.

	0:04:47.047 --> 0:04:56.157
	And we maybe want to even do that in a more
	hierarchical approach, but we know okay some

	0:04:56.157 --> 0:05:05.268
	words are similar like go and walk is somehow
	similar and and therefore like maybe if we

	0:05:05.268 --> 0:05:07.009
	then merge them.

	0:05:07.387 --> 0:05:16.104
	If we learn something about work, then it
	should tell us also something about Hugo or

	0:05:16.104 --> 0:05:17.118
	he walks.

	0:05:17.197 --> 0:05:18.970
	We see already.

	0:05:18.970 --> 0:05:22.295
	It's, of course, not so easy.

	0:05:22.295 --> 0:05:31.828
	We see that there is some relations which
	we need to integrate, for example, for you.

	0:05:31.828 --> 0:05:35.486
	We need to add the S, but maybe.

	0:05:37.137 --> 0:05:42.984
	And luckily there is one really yeah, convincing
	methods in doing that.

	0:05:42.963 --> 0:05:47.239
	And that is by using an evil neck or.

	0:05:47.387 --> 0:05:57.618
	That's what we will introduce today so we
	can use this type of neural networks to try

	0:05:57.618 --> 0:06:04.042
	to learn this similarity and to learn how some
	words.

	0:06:04.324 --> 0:06:13.711
	And that is one of the main advantages that
	we have by switching from the standard statistical

	0:06:13.711 --> 0:06:15.193
	models to the.

	0:06:15.115 --> 0:06:22.840
	To learn similarities between words and generalized
	and learn what we call hidden representations.

	0:06:22.840 --> 0:06:29.707
	So somehow representations of words where
	we can measure similarity in some dimensions.

	0:06:30.290 --> 0:06:42.275
	So in representations where as a tubically
	continuous vector or a vector of a fixed size.

	0:06:42.822 --> 0:06:52.002
	We had it before and we've seen that the only
	thing we did is we don't want to do.

	0:06:52.192 --> 0:06:59.648
	But these indices don't have any meaning,
	so it wasn't that word five is more similar

	0:06:59.648 --> 0:07:02.248
	to words twenty than to word.

	0:07:02.582 --> 0:07:09.059
	So we couldn't learn anything about words
	in the statistical model.

	0:07:09.059 --> 0:07:12.107
	That's a big challenge because.

	0:07:12.192 --> 0:07:24.232
	If you think about words even in morphology,
	so go and go is more similar because the person.

	0:07:24.264 --> 0:07:36.265
	While the basic models we have up to now,
	they have no idea about that and goes as similar

	0:07:36.265 --> 0:07:37.188
	to go.

	0:07:39.919 --> 0:07:53.102
	So what we want to do today, in order to go
	to this, we will have a short introduction.

	0:07:53.954 --> 0:08:06.667
	It very short just to see how we use them
	here, but that's the good thing that are important

	0:08:06.667 --> 0:08:08.445
	for dealing.

	0:08:08.928 --> 0:08:14.083
	And then we'll first look into feet forward,
	new network language models.

	0:08:14.454 --> 0:08:21.221
	And there we will still have this approximation
	we had before, then we are looking only at

	0:08:21.221 --> 0:08:22.336
	fixed windows.

	0:08:22.336 --> 0:08:28.805
	So if you remember we have this classroom
	of language models, and to determine what is

	0:08:28.805 --> 0:08:33.788
	the probability of a word, we only look at
	the past and minus one.

	0:08:34.154 --> 0:08:36.878
	This is the theory of the case.

	0:08:36.878 --> 0:08:43.348
	However, we have the ability and that's why
	they're really better in order.

	0:08:44.024 --> 0:08:51.953
	And then at the end we'll look at current
	network language models where we then have

	0:08:51.953 --> 0:08:53.166
	a different.

	0:08:53.093 --> 0:09:01.922
	And thereby it is no longer the case that
	we need to have a fixed history, but in theory

	0:09:01.922 --> 0:09:04.303
	we can model arbitrary.

	0:09:04.304 --> 0:09:06.854
	And we can log this phenomenon.

	0:09:06.854 --> 0:09:12.672
	We talked about a Tuesday where it's not clear
	what type of information.

	0:09:16.396 --> 0:09:24.982
	So yeah, generally new networks are normally
	learned to improve and perform some tasks.

	0:09:25.325 --> 0:09:38.934
	We have this structure and we are learning
	them from samples so that is similar to what

	0:09:38.934 --> 0:09:42.336
	we had before so now.

	0:09:42.642 --> 0:09:49.361
	And is somehow originally motivated by the
	human brain.

	0:09:49.361 --> 0:10:00.640
	However, when you now need to know artificial
	neural networks, it's hard to get a similarity.

	0:10:00.540 --> 0:10:02.884
	There seems to be not that important.

	0:10:03.123 --> 0:10:11.013
	So what they are mainly doing is doing summoning
	multiplication and then one linear activation.

	0:10:12.692 --> 0:10:16.078
	So so the basic units are these type of.

	0:10:17.937 --> 0:10:29.837
	Perceptron is a basic block which we have
	and this does exactly the processing.

	0:10:29.837 --> 0:10:36.084
	We have a fixed number of input features.

	0:10:36.096 --> 0:10:39.668
	So we have here numbers six zero to x and
	as input.

	0:10:40.060 --> 0:10:48.096
	And this makes language processing difficult
	because we know that it's not the case.

	0:10:48.096 --> 0:10:53.107
	If we're dealing with language, it doesn't
	have any.

	0:10:54.114 --> 0:10:57.609
	So we have to model this somehow and understand
	how we model this.

	0:10:58.198 --> 0:11:03.681
	Then we have the weights, which are the parameters
	and the number of weights exactly the same.

	0:11:04.164 --> 0:11:15.069
	Of input features sometimes you have the spires
	in there that always and then it's not really.

	0:11:15.195 --> 0:11:19.656
	And what you then do is very simple.

	0:11:19.656 --> 0:11:26.166
	It's just like the weight it sounds, so you
	multiply.

	0:11:26.606 --> 0:11:38.405
	What is then additionally important is we
	have an activation function and it's important

	0:11:38.405 --> 0:11:42.514
	that this activation function.

	0:11:43.243 --> 0:11:54.088
	And later it will be important that this is
	differentiable because otherwise all the training.

	0:11:54.714 --> 0:12:01.471
	This model by itself is not very powerful.

	0:12:01.471 --> 0:12:10.427
	We have the X Or problem and with this simple
	you can't.

	0:12:10.710 --> 0:12:15.489
	However, there is a very easy and nice extension.

	0:12:15.489 --> 0:12:20.936
	The multi layer perception and things get
	very powerful.

	0:12:21.081 --> 0:12:32.953
	The thing is you just connect a lot of these
	in these layers of structures where we have

	0:12:32.953 --> 0:12:35.088
	the inputs and.

	0:12:35.395 --> 0:12:47.297
	And then we can combine them, or to do them:
	The input layer is of course given by your

	0:12:47.297 --> 0:12:51.880
	problem with the dimension.

	0:12:51.880 --> 0:13:00.063
	The output layer is also given by your dimension.

	0:13:01.621 --> 0:13:08.802
	So let's start with the first question, now
	more language related, and that is how we represent.

	0:13:09.149 --> 0:13:19.282
	So we have seen here input to x, but the question
	is now okay.

	0:13:19.282 --> 0:13:23.464
	How can we put into this?

	0:13:26.866 --> 0:13:34.123
	The first thing that we're able to do is we're
	going to set it in the inspector.

	0:13:34.314 --> 0:13:45.651
	Yeah, and that is not that easy because the
	continuous vector will come to that.

	0:13:45.651 --> 0:13:47.051
	We can't.

	0:13:47.051 --> 0:13:50.410
	We don't want to do it.

	0:13:50.630 --> 0:13:57.237
	But if we need to input the word into the
	needle network, it has to be something easily

	0:13:57.237 --> 0:13:57.912
	defined.

	0:13:59.079 --> 0:14:11.511
	One is the typical thing, the one-hour encoded
	vector, so we have a vector where the dimension

	0:14:11.511 --> 0:14:15.306
	is the vocabulary, and then.

	0:14:16.316 --> 0:14:25.938
	So the first thing you are ready to see that
	means we are always dealing with fixed.

	0:14:26.246 --> 0:14:34.961
	So you cannot easily extend your vocabulary,
	but if you mean your vocabulary would increase

	0:14:34.961 --> 0:14:37.992
	the size of this input vector,.

	0:14:39.980 --> 0:14:42.423
	That's maybe also motivating.

	0:14:42.423 --> 0:14:45.355
	We'll talk about bike parade going.

	0:14:45.355 --> 0:14:47.228
	That's the nice thing.

	0:14:48.048 --> 0:15:01.803
	The big advantage of this one putt encoding
	is that we don't implement similarity between

	0:15:01.803 --> 0:15:06.999
	words, but we're really learning.

	0:15:07.227 --> 0:15:11.219
	So you need like to represent any words.

	0:15:11.219 --> 0:15:15.893
	You need a dimension of and dimensional vector.

	0:15:16.236 --> 0:15:26.480
	Imagine you could eat no binary encoding,
	so you could represent words as binary vectors.

	0:15:26.806 --> 0:15:32.348
	So you will be significantly more efficient.

	0:15:32.348 --> 0:15:39.122
	However, you have some more digits than other
	numbers.

	0:15:39.559 --> 0:15:46.482
	Would somehow be bad because you would force
	the one to do this and it's by hand not clear

	0:15:46.482 --> 0:15:47.623
	how to define.

	0:15:48.108 --> 0:15:55.135
	So therefore currently this is the most successful
	approach to just do this one patch.

	0:15:55.095 --> 0:15:59.344
	We take a fixed vocabulary.

	0:15:59.344 --> 0:16:10.269
	We map each word to the initial and then we
	represent a word like this.

	0:16:10.269 --> 0:16:13.304
	The representation.

	0:16:14.514 --> 0:16:27.019
	But this dimension here is a secondary size,
	and if you think ten thousand that's quite

	0:16:27.019 --> 0:16:33.555
	high, so we're always trying to be efficient.

	0:16:33.853 --> 0:16:42.515
	And we are doing the same type of efficiency
	because then we are having a very small one

	0:16:42.515 --> 0:16:43.781
	compared to.

	0:16:44.104 --> 0:16:53.332
	It can be still a maybe or neurons, but this
	is significantly smaller, of course, as before.

	0:16:53.713 --> 0:17:04.751
	So you are learning there this word as you
	said, but you can learn it directly, and there

	0:17:04.751 --> 0:17:07.449
	we have similarities.

	0:17:07.807 --> 0:17:14.772
	But the nice thing is that this is then learned,
	and we do not need to like hand define.

	0:17:17.117 --> 0:17:32.377
	So yes, so that is how we're typically adding
	at least a single word into the language world.

	0:17:32.377 --> 0:17:43.337
	Then we can see: So we're seeing that you
	have the one hard representation always of

	0:17:43.337 --> 0:17:44.857
	the same similarity.

	0:17:45.105 --> 0:18:00.803
	Then we're having this continuous vector which
	is a lot smaller dimension and that's.

	0:18:01.121 --> 0:18:06.984
	What we are doing then is learning these representations
	so that they are best for language modeling.

	0:18:07.487 --> 0:18:19.107
	So the representations are implicitly because
	we're training on the language.

	0:18:19.479 --> 0:18:30.115
	And the nice thing was found out later is
	these representations are really, really good

	0:18:30.115 --> 0:18:32.533
	for a lot of other.

	0:18:33.153 --> 0:18:39.729
	And that is why they are now called word embedded
	space themselves, and used for other tasks.

	0:18:40.360 --> 0:18:49.827
	And they are somehow describing different
	things so they can describe and semantic similarities.

	0:18:49.789 --> 0:18:58.281
	We are looking at the very example of today
	that you can do in this vector space by adding

	0:18:58.281 --> 0:19:00.613
	some interesting things.

	0:19:00.940 --> 0:19:11.174
	And so they got really was a first big improvement
	when switching to neural staff.

	0:19:11.491 --> 0:19:20.736
	They are like part of the model still with
	more complex representation alert, but they

	0:19:20.736 --> 0:19:21.267
	are.

	0:19:23.683 --> 0:19:34.975
	Then we are having the output layer, and in
	the output layer we also have output structure

	0:19:34.975 --> 0:19:36.960
	and activation.

	0:19:36.997 --> 0:19:44.784
	That is the language we want to predict, which
	word should be the next.

	0:19:44.784 --> 0:19:46.514
	We always have.

	0:19:47.247 --> 0:19:56.454
	And that can be done very well with the softball
	softbacked layer, where again the dimension.

	0:19:56.376 --> 0:20:03.971
	Is the vocabulary, so this is a vocabulary
	size, and again the case neuro represents the

	0:20:03.971 --> 0:20:09.775
	case class, so in our case we have again a
	one-hour representation.

	0:20:10.090 --> 0:20:18.929
	Ours is a probability distribution and the
	end is a probability distribution of all works.

	0:20:18.929 --> 0:20:28.044
	The case entry tells us: So we need to have
	some of our probability distribution at our

	0:20:28.044 --> 0:20:36.215
	output, and in order to achieve that this activation
	function goes, it needs to be that all the

	0:20:36.215 --> 0:20:36.981
	outputs.

	0:20:37.197 --> 0:20:47.993
	And we can achieve that with a softmax activation
	we take each of the value and then.

	0:20:48.288 --> 0:20:58.020
	So by having this type of activation function
	we are really getting that at the end we always.

	0:20:59.019 --> 0:21:12.340
	The beginning was very challenging because
	again we have this inefficient representation

	0:21:12.340 --> 0:21:15.184
	of our vocabulary.

	0:21:15.235 --> 0:21:27.500
	And then you can imagine escalating over to
	something over a thousand is maybe a bit inefficient

	0:21:27.500 --> 0:21:29.776
	with cheap users.

	0:21:36.316 --> 0:21:43.664
	And then yeah, for training the models, that
	is how we refine, so we have this architecture

	0:21:43.664 --> 0:21:44.063
	now.

	0:21:44.264 --> 0:21:52.496
	We need to minimize the arrow by taking the
	output.

	0:21:52.496 --> 0:21:58.196
	We are comparing it to our targets.

	0:21:58.298 --> 0:22:07.670
	So one important thing is, of course, how
	can we measure the error?

	0:22:07.670 --> 0:22:12.770
	So what if we're training the ideas?

	0:22:13.033 --> 0:22:19.770
	And how well when measuring it is in natural
	language processing, typically the cross entropy.

	0:22:19.960 --> 0:22:32.847
	That means we are comparing the target with
	the output, so we're taking the value multiplying

	0:22:32.847 --> 0:22:35.452
	with the horizons.

	0:22:35.335 --> 0:22:43.454
	Which gets optimized and you're seeing that
	this, of course, makes it again very nice and

	0:22:43.454 --> 0:22:49.859
	easy because our target, we said, is again
	a one-hound representation.

	0:22:50.110 --> 0:23:00.111
	So except for one, all of these are always
	zero, and what we are doing is taking the one.

	0:23:00.100 --> 0:23:05.970
	And we only need to multiply the one with
	the logarism here, and that is all the feedback.

	0:23:06.946 --> 0:23:14.194
	Of course, this is not always influenced by
	all the others.

	0:23:14.194 --> 0:23:17.938
	Why is this influenced by all?

	0:23:24.304 --> 0:23:33.554
	Think Mac the activation function, which is
	the current activation divided by some of the

	0:23:33.554 --> 0:23:34.377
	others.

	0:23:34.354 --> 0:23:44.027
	Because otherwise it could of course easily
	just increase this value and ignore the others,

	0:23:44.027 --> 0:23:49.074
	but if you increase one value or the other,
	so.

	0:23:51.351 --> 0:24:04.433
	And then we can do with neon networks one
	very nice and easy type of training that is

	0:24:04.433 --> 0:24:07.779
	done in all the neon.

	0:24:07.707 --> 0:24:12.664
	So in which direction does the arrow show?

	0:24:12.664 --> 0:24:23.152
	And then if we want to go to a smaller like
	smaller arrow, that's what we want to achieve.

	0:24:23.152 --> 0:24:27.302
	We're trying to minimize our arrow.

	0:24:27.287 --> 0:24:32.875
	And we have to do that, of course, for all
	the weights, and to calculate the error of

	0:24:32.875 --> 0:24:36.709
	all the weights we want in the back of the
	baggation here.

	0:24:36.709 --> 0:24:41.322
	But what you can do is you can propagate the
	arrow which you measured.

	0:24:41.322 --> 0:24:43.792
	At the end you can propagate it back.

	0:24:43.792 --> 0:24:46.391
	That's basic mass and basic derivation.

	0:24:46.706 --> 0:24:59.557
	Then you can do each weight in your model
	and measure how much it contributes to this

	0:24:59.557 --> 0:25:01.350
	individual.

	0:25:04.524 --> 0:25:17.712
	To summarize what your machine translation
	should be, to understand all this problem is

	0:25:17.712 --> 0:25:20.710
	that this is how a.

	0:25:20.580 --> 0:25:23.056
	The notes are perfect thrones.

	0:25:23.056 --> 0:25:28.167
	They are fully connected between two layers
	and no connections.

	0:25:28.108 --> 0:25:29.759
	Across layers.

	0:25:29.829 --> 0:25:35.152
	And what they're doing is always just to wait
	for some here and then an activation function.

	0:25:35.415 --> 0:25:38.794
	And in order to train you have this sword
	in backwards past.

	0:25:39.039 --> 0:25:41.384
	So we put in here.

	0:25:41.281 --> 0:25:46.540
	Our inputs have some random values at the
	beginning.

	0:25:46.540 --> 0:25:49.219
	They calculate the output.

	0:25:49.219 --> 0:25:58.646
	We are measuring how big our error is, propagating
	the arrow back, and then changing our model

	0:25:58.646 --> 0:25:59.638
	in a way.

	0:26:01.962 --> 0:26:14.267
	So before we're coming into the neural networks,
	how can we use this type of neural network

	0:26:14.267 --> 0:26:17.611
	to do language modeling?

	0:26:23.103 --> 0:26:25.520
	So the question is now okay.

	0:26:25.520 --> 0:26:33.023
	How can we use them in natural language processing
	and especially in machine translation?

	0:26:33.023 --> 0:26:38.441
	The first idea of using them was to estimate
	the language model.

	0:26:38.999 --> 0:26:42.599
	So we have seen that the output can be monitored
	here as well.

	0:26:43.603 --> 0:26:49.308
	Has a probability distribution, and if we
	have a full vocabulary, we could mainly hear

	0:26:49.308 --> 0:26:55.209
	estimate how probable each next word is, and
	then use that in our language model fashion,

	0:26:55.209 --> 0:27:02.225
	as we've done it last time, we've got the probability
	of a full sentence as a product of all probabilities

	0:27:02.225 --> 0:27:03.208
	of individual.

	0:27:04.544 --> 0:27:06.695
	And UM.

	0:27:06.446 --> 0:27:09.776
	That was done and in ninety seven years.

	0:27:09.776 --> 0:27:17.410
	It's very easy to integrate it into this Locklear
	model, so we have said that this is how the

	0:27:17.410 --> 0:27:24.638
	Locklear model looks like, so we're searching
	the best translation, which minimizes each

	0:27:24.638 --> 0:27:25.126
	wage.

	0:27:25.125 --> 0:27:26.371
	The feature value.

	0:27:26.646 --> 0:27:31.642
	We have that with the minimum error training,
	if you can remember when we search for the

	0:27:31.642 --> 0:27:32.148
	optimal.

	0:27:32.512 --> 0:27:40.927
	We have the phrasetable probabilities, the
	language model, and we can just add here and

	0:27:40.927 --> 0:27:41.597
	there.

	0:27:41.861 --> 0:27:46.077
	So that is quite easy as said.

	0:27:46.077 --> 0:27:54.101
	That was how statistical machine translation
	was improved.

	0:27:54.101 --> 0:27:57.092
	Add one more feature.

	0:27:58.798 --> 0:28:11.220
	So how can we model the language mark for
	Belty with your network?

	0:28:11.220 --> 0:28:22.994
	So what we have to do is: And the problem
	in generally in the head is that most we haven't

	0:28:22.994 --> 0:28:25.042
	seen long sequences.

	0:28:25.085 --> 0:28:36.956
	Mostly we have to beg off to very short sequences
	and we are working on this discrete space where.

	0:28:37.337 --> 0:28:48.199
	So the idea is if we have a meal network we
	can map words into continuous representation

	0:28:48.199 --> 0:28:50.152
	and that helps.

	0:28:51.091 --> 0:28:59.598
	And the structure then looks like this, so
	this is the basic still feed forward neural

	0:28:59.598 --> 0:29:00.478
	network.

	0:29:01.361 --> 0:29:10.744
	We are doing this at Proximation again, so
	we are not putting in all previous words, but

	0:29:10.744 --> 0:29:11.376
	it's.

	0:29:11.691 --> 0:29:25.089
	And this is done because in your network we
	can have only a fixed type of input, so we

	0:29:25.089 --> 0:29:31.538
	can: Can only do a fixed set, and they are
	going to be doing exactly the same in minus

	0:29:31.538 --> 0:29:31.879
	one.

	0:29:33.593 --> 0:29:41.026
	And then we have, for example, three words
	and three different words, which are in these

	0:29:41.026 --> 0:29:54.583
	positions: And then we're having the first
	layer of the neural network, which learns words

	0:29:54.583 --> 0:29:56.247
	and words.

	0:29:57.437 --> 0:30:04.976
	There is one thing which is maybe special
	compared to the standard neural memory.

	0:30:05.345 --> 0:30:13.163
	So the representation of this word we want
	to learn first of all position independence,

	0:30:13.163 --> 0:30:19.027
	so we just want to learn what is the general
	meaning of the word.

	0:30:19.299 --> 0:30:26.244
	Therefore, the representation you get here
	should be the same as if you put it in there.

	0:30:27.247 --> 0:30:35.069
	The nice thing is you can achieve that in
	networks the same way you achieve it.

	0:30:35.069 --> 0:30:41.719
	This way you're reusing ears so we are forcing
	them to always stay.

	0:30:42.322 --> 0:30:49.689
	And that's why you then learn your word embedding,
	which is contextual and independent, so.

	0:30:49.909 --> 0:31:05.561
	So the idea is you have the diagram go home
	and you don't want to use the context.

	0:31:05.561 --> 0:31:07.635
	First you.

	0:31:08.348 --> 0:31:14.155
	That of course it might have a different meaning
	depending on where it stands, but learn that.

	0:31:14.514 --> 0:31:19.623
	First, we're learning key representation of
	the words, which is just the representation

	0:31:19.623 --> 0:31:20.378
	of the word.

	0:31:20.760 --> 0:31:37.428
	So it's also not like normally all input neurons
	are connected to all neurons.

	0:31:37.857 --> 0:31:47.209
	This is the first layer of representation,
	and then we have a lot denser representation,

	0:31:47.209 --> 0:31:56.666
	that is, our three word embeddings here, and
	now we are learning this interaction between

	0:31:56.666 --> 0:31:57.402
	words.

	0:31:57.677 --> 0:32:08.265
	So now we have at least one connected, fully
	connected layer here, which takes the three

	0:32:08.265 --> 0:32:14.213
	imbedded input and then learns the new embedding.

	0:32:15.535 --> 0:32:27.871
	And then if you had one of several layers
	of lining which is your output layer, then.

	0:32:28.168 --> 0:32:46.222
	So here the size is a vocabulary size, and
	then you put as target what is the probability

	0:32:46.222 --> 0:32:48.228
	for each.

	0:32:48.688 --> 0:32:56.778
	The nice thing is that you learn everything
	together, so you're not learning what is a

	0:32:56.778 --> 0:32:58.731
	good representation.

	0:32:59.079 --> 0:33:12.019
	When you are training the whole network together,
	it learns what representation for a word you

	0:33:12.019 --> 0:33:13.109
	get in.

	0:33:15.956 --> 0:33:19.176
	It's Yeah That Is the Main Idea.

	0:33:20.660 --> 0:33:32.695
	Nowadays often referred to as one way of self-supervised
	learning, why self-supervisory learning?

	0:33:33.053 --> 0:33:37.120
	The output is the next word and the input
	is the previous word.

	0:33:37.377 --> 0:33:46.778
	But somehow it's self-supervised because it's
	not really that we created labels, but we artificially.

	0:33:46.806 --> 0:34:01.003
	We just have pure text, and then we created
	the task.

	0:34:05.905 --> 0:34:12.413
	Say we have two sentences like go home again.

	0:34:12.413 --> 0:34:18.780
	Second one is go to creative again, so both.

	0:34:18.858 --> 0:34:31.765
	The starboard bygo and then we have to predict
	the next four years and my question is: Be

	0:34:31.765 --> 0:34:40.734
	modeled this ability as one vector with like
	probability or possible works.

	0:34:40.734 --> 0:34:42.740
	We have musical.

	0:34:44.044 --> 0:34:56.438
	You have multiple examples, so you would twice
	train, once you predict, once you predict,

	0:34:56.438 --> 0:35:02.359
	and then, of course, the best performance.

	0:35:04.564 --> 0:35:11.772
	A very good point, so you're not aggregating
	examples beforehand, but you're taking each

	0:35:11.772 --> 0:35:13.554
	example individually.

	0:35:19.259 --> 0:35:33.406
	So what you do is you simultaneously learn
	the projection layer which represents this

	0:35:33.406 --> 0:35:39.163
	word and the N gram probabilities.

	0:35:39.499 --> 0:35:48.390
	And what people then later analyzed is that
	these representations are very powerful.

	0:35:48.390 --> 0:35:56.340
	The task is just a very important task to
	model like what is the next word.

	0:35:56.816 --> 0:36:09.429
	It's a bit motivated by people saying in order
	to get the meaning of the word you have to

	0:36:09.429 --> 0:36:10.690
	look at.

	0:36:10.790 --> 0:36:18.467
	If you read the text in there, which you have
	never seen, you can still estimate the meaning

	0:36:18.467 --> 0:36:22.264
	of this word because you know how it is used.

	0:36:22.602 --> 0:36:26.667
	Just imagine you read this text about some
	city.

	0:36:26.667 --> 0:36:32.475
	Even if you've never seen the city before
	heard, you often know from.

	0:36:34.094 --> 0:36:44.809
	So what is now the big advantage of using
	neural networks?

	0:36:44.809 --> 0:36:57.570
	Just imagine we have to estimate this: So
	you have to monitor the probability of ad hip

	0:36:57.570 --> 0:37:00.272
	and now imagine iPhone.

	0:37:00.600 --> 0:37:06.837
	So all the techniques we have at the last
	time.

	0:37:06.837 --> 0:37:14.243
	At the end, if you haven't seen iPhone, you
	will always.

	0:37:15.055 --> 0:37:19.502
	Because you haven't seen the previous words,
	so you have no idea how to do that.

	0:37:19.502 --> 0:37:24.388
	You won't have seen the diagram, the trigram
	and all the others, so the probability here

	0:37:24.388 --> 0:37:27.682
	will just be based on the probability of ad,
	so it uses no.

	0:37:28.588 --> 0:37:38.328
	If you're having this type of model, what
	does it do so?

	0:37:38.328 --> 0:37:43.454
	This is the last three words.

	0:37:43.483 --> 0:37:49.837
	Maybe this representation is messed up because
	it's mainly on a particular word or source

	0:37:49.837 --> 0:37:50.260
	that.

	0:37:50.730 --> 0:37:57.792
	Now anyway you have these two information
	that were two words before was first and therefore:

	0:37:58.098 --> 0:38:07.214
	So you have a lot of information here to estimate
	how good it is.

	0:38:07.214 --> 0:38:13.291
	Of course, there could be more information.

	0:38:13.593 --> 0:38:25.958
	So all this type of modeling we can do and
	that we couldn't do beforehand because we always.

	0:38:27.027 --> 0:38:31.905
	Don't guess how we do it now.

	0:38:31.905 --> 0:38:41.824
	Typically you would have one talking for awkward
	vocabulary.

	0:38:42.602 --> 0:38:45.855
	All you're doing by carrying coding when it
	has a fixed dancing.

	0:38:46.226 --> 0:38:49.439
	Yeah, you have to do something like that that
	the opposite way.

	0:38:50.050 --> 0:38:55.413
	So yeah, all the vocabulary are by thankcoding
	where you don't have have all the vocabulary.

	0:38:55.735 --> 0:39:07.665
	But then, of course, the back pairing coating
	is better with arbitrary context because a

	0:39:07.665 --> 0:39:11.285
	problem with back pairing.

	0:39:17.357 --> 0:39:20.052
	Anymore questions to the basic same little
	things.

	0:39:23.783 --> 0:39:36.162
	This model we then want to continue is to
	look into how complex that is or can make things

	0:39:36.162 --> 0:39:39.155
	maybe more efficient.

	0:39:40.580 --> 0:39:47.404
	At the beginning there was definitely a major
	challenge.

	0:39:47.404 --> 0:39:50.516
	It's still not that easy.

	0:39:50.516 --> 0:39:58.297
	All guess follow the talk about their environmental
	fingerprint.

	0:39:58.478 --> 0:40:05.686
	So this calculation is normally heavy, and
	if you build systems yourself, you have to

	0:40:05.686 --> 0:40:06.189
	wait.

	0:40:06.466 --> 0:40:15.412
	So it's good to know a bit about how complex
	things are in order to do a good or efficient.

	0:40:15.915 --> 0:40:24.706
	So one thing where most of the calculation
	really happens is if you're.

	0:40:25.185 --> 0:40:34.649
	So in generally all these layers, of course,
	we're talking about networks and the zones

	0:40:34.649 --> 0:40:35.402
	fancy.

	0:40:35.835 --> 0:40:48.305
	So what you have to do in order to calculate
	here these activations, you have this weight.

	0:40:48.488 --> 0:41:05.021
	So to make it simple, let's see we have three
	outputs, and then you just do a metric identification

	0:41:05.021 --> 0:41:08.493
	between your weight.

	0:41:08.969 --> 0:41:19.641
	That is why the use is so powerful for neural
	networks because they are very good in doing

	0:41:19.641 --> 0:41:22.339
	metric multiplication.

	0:41:22.782 --> 0:41:28.017
	However, for some type of embedding layer
	this is really very inefficient.

	0:41:28.208 --> 0:41:37.547
	So in this input we are doing this calculation.

	0:41:37.547 --> 0:41:47.081
	What we are mainly doing is selecting one
	color.

	0:41:47.387 --> 0:42:03.570
	So therefore you can do at least the forward
	pass a lot more efficient if you don't really

	0:42:03.570 --> 0:42:07.304
	do this calculation.

	0:42:08.348 --> 0:42:20.032
	So the weight metrics of the first embedding
	layer is just that in each color you have.

	0:42:20.580 --> 0:42:30.990
	So this is how your initial weights look like
	and how you can interpret or understand.

	0:42:32.692 --> 0:42:42.042
	And this is already relatively important because
	remember this is a huge dimensional thing,

	0:42:42.042 --> 0:42:51.392
	so typically here we have the number of words
	ten thousand, so this is the word embeddings.

	0:42:51.451 --> 0:43:00.400
	Because it's the largest one there, we have
	entries, while for the others we maybe have.

	0:43:00.660 --> 0:43:03.402
	So they are a little bit efficient and are
	important to make this in.

	0:43:06.206 --> 0:43:10.529
	And then you can look at where else the calculations
	are very difficult.

	0:43:10.830 --> 0:43:20.294
	So here we have our individual network, so
	here are the word embeddings.

	0:43:20.294 --> 0:43:29.498
	Then we have one hidden layer, and then you
	can look at how difficult.

	0:43:30.270 --> 0:43:38.742
	We could save a lot of calculations by calculating
	that by just doing like do the selection because:

	0:43:40.600 --> 0:43:51.748
	And then the number of calculations you have
	to do here is the length.

	0:43:52.993 --> 0:44:06.206
	Then we have here the hint size that is the
	hint size, so the first step of calculation

	0:44:06.206 --> 0:44:10.260
	for this metric is an age.

	0:44:10.730 --> 0:44:22.030
	Then you have to do some activation function
	which is this: This is the hidden size hymn

	0:44:22.030 --> 0:44:29.081
	because we need the vocabulary socks to calculate
	the probability for each.

	0:44:29.889 --> 0:44:40.474
	And if you look at this number, so if you
	have a projection sign of one hundred and a

	0:44:40.474 --> 0:44:45.027
	vocabulary sign of one hundred, you.

	0:44:45.425 --> 0:44:53.958
	And that's why there has been especially at
	the beginning some ideas on how we can reduce

	0:44:53.958 --> 0:44:55.570
	the calculation.

	0:44:55.956 --> 0:45:02.352
	And if we really need to calculate all our
	capabilities, or if we can calculate only some.

	0:45:02.582 --> 0:45:13.061
	And there again one important thing to think
	about is for what you will use my language.

	0:45:13.061 --> 0:45:21.891
	One can use it for generations and that's
	where we will see the next week.

	0:45:21.891 --> 0:45:22.480
	And.

	0:45:23.123 --> 0:45:32.164
	Initially, if it's just used as a feature,
	we do not want to use it for generation, but

	0:45:32.164 --> 0:45:32.575
	we.

	0:45:32.953 --> 0:45:41.913
	And there we might not be interested in all
	the probabilities, but we already know all

	0:45:41.913 --> 0:45:49.432
	the probability of this one word, and then
	it might be very inefficient.

	0:45:51.231 --> 0:45:53.638
	And how can you do that so initially?

	0:45:53.638 --> 0:45:56.299
	For example, people look into shortlists.

	0:45:56.756 --> 0:46:03.321
	So the idea was this calculation at the end
	is really very expensive.

	0:46:03.321 --> 0:46:05.759
	So can we make that more.

	0:46:05.945 --> 0:46:17.135
	And the idea was okay, and most birds occur
	very rarely, and some beef birds occur very,

	0:46:17.135 --> 0:46:18.644
	very often.

	0:46:19.019 --> 0:46:37.644
	And so they use the smaller imagery, which
	is maybe very small, and then you merge a new.

	0:46:37.937 --> 0:46:45.174
	So you're taking if the word is in the shortness,
	so in the most frequent words.

	0:46:45.825 --> 0:46:58.287
	You're taking the probability of this short
	word by some normalization here, and otherwise

	0:46:58.287 --> 0:46:59.656
	you take.

	0:47:00.020 --> 0:47:00.836
	Course.

	0:47:00.836 --> 0:47:09.814
	It will not be as good, but then we don't
	have to calculate all the capabilities at the

	0:47:09.814 --> 0:47:16.037
	end, but we only have to calculate it for the
	most frequent.

	0:47:19.599 --> 0:47:39.477
	Machines about that, but of course we don't
	model the probability of the infrequent words.

	0:47:39.299 --> 0:47:46.658
	And one idea is to do what is reported as
	soles for the structure of the layer.

	0:47:46.606 --> 0:47:53.169
	You see how some years ago people were very
	creative in giving names to newer models.

	0:47:53.813 --> 0:48:00.338
	And there the idea is that we model the out
	group vocabulary as a clustered strip.

	0:48:00.680 --> 0:48:08.498
	So you don't need to mold all of your bodies
	directly, but you are putting words into.

	0:48:08.969 --> 0:48:20.623
	A very intricate word is first in and then
	in and then in and that is in sub-sub-clusters

	0:48:20.623 --> 0:48:21.270
	and.

	0:48:21.541 --> 0:48:29.936
	And this is what was mentioned in the past
	of the work, so these are the subclasses that

	0:48:29.936 --> 0:48:30.973
	always go.

	0:48:30.973 --> 0:48:39.934
	So if it's in cluster one at the first position
	then you only look at all the words which are:

	0:48:40.340 --> 0:48:50.069
	And then you can calculate the probability
	of a word again just by the product over these,

	0:48:50.069 --> 0:48:55.522
	so the probability of the word is the first
	class.

	0:48:57.617 --> 0:49:12.331
	It's maybe more clear where you have the sole
	architecture, so what you will do is first

	0:49:12.331 --> 0:49:13.818
	predict.

	0:49:14.154 --> 0:49:26.435
	Then you go to the appropriate sub-class,
	then you calculate the probability of the sub-class.

	0:49:27.687 --> 0:49:34.932
	Anybody have an idea why this is more, more
	efficient, or if people do it first, it looks

	0:49:34.932 --> 0:49:35.415
	more.

	0:49:42.242 --> 0:49:56.913
	Yes, so you have to do less calculations,
	or maybe here you have to calculate the element

	0:49:56.913 --> 0:49:59.522
	there, but you.

	0:49:59.980 --> 0:50:06.116
	The capabilities in the set classes that you're
	going through and not for all of them.

	0:50:06.386 --> 0:50:16.688
	Therefore, it's only more efficient if you
	don't need all awkward preferences because

	0:50:16.688 --> 0:50:21.240
	you have to even calculate the class.

	0:50:21.501 --> 0:50:30.040
	So it's only more efficient in scenarios where
	you really need to use a language to evaluate.

	0:50:35.275 --> 0:50:54.856
	How this works is that on the output layer
	you only have a vocabulary of: But on the input

	0:50:54.856 --> 0:51:05.126
	layer you have always your full vocabulary
	because at the input we saw that this is not

	0:51:05.126 --> 0:51:06.643
	complicated.

	0:51:06.906 --> 0:51:19.778
	And then you can cluster down all your words,
	embedding series of classes, and use that as

	0:51:19.778 --> 0:51:23.031
	your classes for that.

	0:51:23.031 --> 0:51:26.567
	So yeah, you have words.

	0:51:29.249 --> 0:51:32.593
	Is one idea of doing it.

	0:51:32.593 --> 0:51:44.898
	There is also a second idea of doing it again,
	the idea that we don't need the probability.

	0:51:45.025 --> 0:51:53.401
	So sometimes it doesn't really need to be
	a probability to evaluate.

	0:51:53.401 --> 0:52:05.492
	It's only important that: And: Here is called
	self-normalization.

	0:52:05.492 --> 0:52:19.349
	What people have done so is in the softmax
	is always to the input divided by normalization.

	0:52:19.759 --> 0:52:25.194
	So this is how we calculate the soft mix.

	0:52:25.825 --> 0:52:42.224
	And in self-normalization now, the idea is
	that we don't need to calculate the logarithm.

	0:52:42.102 --> 0:52:54.284
	That would be zero, and then you don't even
	have to calculate the normalization.

	0:52:54.514 --> 0:53:01.016
	So how can we achieve that?

	0:53:01.016 --> 0:53:08.680
	And then there's the nice thing.

	0:53:09.009 --> 0:53:14.743
	And our novel Lots and more to maximize probability.

	0:53:14.743 --> 0:53:23.831
	We have this cross entry lot that probability
	is higher, and now we're just adding.

	0:53:24.084 --> 0:53:31.617
	And the second loss just tells us you're pleased
	training the way the lock set is zero.

	0:53:32.352 --> 0:53:38.625
	So then if it's nearly zero at the end you
	don't need to calculate this and it's also

	0:53:38.625 --> 0:53:39.792
	very efficient.

	0:53:40.540 --> 0:53:57.335
	One important thing is this is only an inference,
	so during tests we don't need to calculate.

	0:54:00.480 --> 0:54:15.006
	You can do a bit of a hyperparameter here
	where you do the waiting and how much effort

	0:54:15.006 --> 0:54:16.843
	should be.

	0:54:18.318 --> 0:54:35.037
	The only disadvantage is that it's no speed
	up during training and there are other ways

	0:54:35.037 --> 0:54:37.887
	of doing that.

	0:54:41.801 --> 0:54:43.900
	I'm with you all.

	0:54:44.344 --> 0:54:48.540
	Then we are coming very, very briefly like
	this one here.

	0:54:48.828 --> 0:54:53.692
	There are more things on different types of
	languages.

	0:54:53.692 --> 0:54:58.026
	We are having a very short view of a restricted.

	0:54:58.298 --> 0:55:09.737
	And then we'll talk about recurrent neural
	networks for our language minds because they

	0:55:09.737 --> 0:55:17.407
	have the advantage now that we can't even further
	improve.

	0:55:18.238 --> 0:55:24.395
	There's also different types of neural networks.

	0:55:24.395 --> 0:55:30.175
	These ballroom machines are not having input.

	0:55:30.330 --> 0:55:39.271
	They have these binary units: And they define
	an energy function on the network, which can

	0:55:39.271 --> 0:55:46.832
	be in respect of bottom machines efficiently
	calculated, and restricted needs.

	0:55:46.832 --> 0:55:53.148
	You only have connections between the input
	and the hidden layer.

	0:55:53.393 --> 0:56:00.190
	So you see here you don't have input and output,
	you just have an input and you calculate what.

	0:56:00.460 --> 0:56:16.429
	Which of course nicely fits with the idea
	we're having, so you can use this for N gram

	0:56:16.429 --> 0:56:19.182
	language ones.

	0:56:19.259 --> 0:56:25.187
	Decaying this credibility of the input by
	this type of neural networks.

	0:56:26.406 --> 0:56:30.582
	And the advantage of this type of model of
	board that is.

	0:56:30.550 --> 0:56:38.629
	Very fast to integrate it, so that one was
	the first one which was used during decoding.

	0:56:38.938 --> 0:56:50.103
	The problem of it is that the Enron language
	models were very good at performing the calculation.

	0:56:50.230 --> 0:57:00.114
	So what people typically did is we talked
	about a best list, so they generated a most

	0:57:00.114 --> 0:57:05.860
	probable output, and then they scored each
	entry.

	0:57:06.146 --> 0:57:10.884
	A language model, and then only like change
	the order against that based on that which.

	0:57:11.231 --> 0:57:20.731
	The knifing is maybe only hundred entries,
	while during decoding you will look at several

	0:57:20.731 --> 0:57:21.787
	thousand.

	0:57:26.186 --> 0:57:40.437
	This but let's look at the context, so we
	have now seen your language models.

	0:57:40.437 --> 0:57:43.726
	There is the big.

	0:57:44.084 --> 0:57:57.552
	Remember ingram language is not always words
	because sometimes you have to back off or interpolation

	0:57:57.552 --> 0:57:59.953
	to lower ingrams.

	0:58:00.760 --> 0:58:05.504
	However, in neural models we always have all
	of these inputs and some of these.

	0:58:07.147 --> 0:58:21.262
	The disadvantage is that you are still limited
	in your context, and if you remember the sentence

	0:58:21.262 --> 0:58:23.008
	from last,.

	0:58:22.882 --> 0:58:28.445
	Sometimes you need more context and there's
	unlimited contexts that you might need and

	0:58:28.445 --> 0:58:34.838
	you can always create sentences where you need
	this file context in order to put a good estimation.

	0:58:35.315 --> 0:58:44.955
	Can we also do it different in order to better
	understand that it makes sense to view?

	0:58:45.445 --> 0:58:57.621
	So sequence labeling tasks are a very common
	type of towns in natural language processing

	0:58:57.621 --> 0:59:03.438
	where you have an input sequence and then.

	0:59:03.323 --> 0:59:08.663
	I've token so you have one output for each
	input so machine translation is not a secret

	0:59:08.663 --> 0:59:14.063
	labeling cast because the number of inputs
	and the number of outputs is different so you

	0:59:14.063 --> 0:59:19.099
	put in a string German which has five words
	and the output can be six or seven or.

	0:59:19.619 --> 0:59:20.155
	Secrets.

	0:59:20.155 --> 0:59:24.083
	Lately you always have the same number of
	and the same number of.

	0:59:24.944 --> 0:59:40.940
	And you can model language modeling as that,
	and you just say a label for each word is always

	0:59:40.940 --> 0:59:43.153
	a next word.

	0:59:45.705 --> 0:59:54.823
	This is the more general you can think of
	it, for example how to speech taking entity

	0:59:54.823 --> 0:59:56.202
	recognition.

	0:59:58.938 --> 1:00:08.081
	And if you look at now fruit cut token in
	generally sequence, they can depend on import

	1:00:08.081 --> 1:00:08.912
	tokens.

	1:00:09.869 --> 1:00:11.260
	Nice thing.

	1:00:11.260 --> 1:00:21.918
	In our case, the output tokens are the same
	so we can easily model it that they only depend

	1:00:21.918 --> 1:00:24.814
	on all the input tokens.

	1:00:24.814 --> 1:00:28.984
	So we have this whether it's or so.

	1:00:31.011 --> 1:00:42.945
	But we can always do a look at what specific
	type of sequence labeling, unidirectional sequence

	1:00:42.945 --> 1:00:44.188
	labeling.

	1:00:44.584 --> 1:00:58.215
	And that's exactly how we want the language
	of the next word only depends on all the previous

	1:00:58.215 --> 1:01:00.825
	words that we're.

	1:01:01.321 --> 1:01:12.899
	Mean, of course, that's not completely true
	in a language that the bad context might also

	1:01:12.899 --> 1:01:14.442
	be helpful.

	1:01:14.654 --> 1:01:22.468
	We will model always the probability of a
	word given on its history, and therefore we

	1:01:22.468 --> 1:01:23.013
	need.

	1:01:23.623 --> 1:01:29.896
	And currently we did there this approximation
	in sequence labeling that we have this windowing

	1:01:29.896 --> 1:01:30.556
	approach.

	1:01:30.951 --> 1:01:43.975
	So in order to predict this type of word we
	always look at the previous three words and

	1:01:43.975 --> 1:01:48.416
	then to do this one we again.

	1:01:49.389 --> 1:01:55.137
	If you are into neural networks you recognize
	this type of structure.

	1:01:55.137 --> 1:01:57.519
	Also are the typical neural.

	1:01:58.938 --> 1:02:09.688
	Yes, so this is like Engram, Louis Couperus,
	and at least in some way compared to the original,

	1:02:09.688 --> 1:02:12.264
	you're always looking.

	1:02:14.334 --> 1:02:30.781
	However, there are also other types of neural
	network structures which we can use for sequence.

	1:02:32.812 --> 1:02:34.678
	That we can do so.

	1:02:34.678 --> 1:02:39.686
	The idea is in recurrent neural network structure.

	1:02:39.686 --> 1:02:43.221
	We are saving the complete history.

	1:02:43.623 --> 1:02:55.118
	So again we have to do like this fix size
	representation because neural networks always

	1:02:55.118 --> 1:02:56.947
	need to have.

	1:02:57.157 --> 1:03:05.258
	And then we start with an initial value for
	our storage.

	1:03:05.258 --> 1:03:15.917
	We are giving our first input and then calculating
	the new representation.

	1:03:16.196 --> 1:03:26.328
	If you look at this, it's just again your
	network was two types of inputs: in your work,

	1:03:26.328 --> 1:03:29.743
	in your initial hidden state.

	1:03:30.210 --> 1:03:46.468
	Then you can apply it to the next type of
	input and you're again having.

	1:03:47.367 --> 1:03:53.306
	Nice thing is now that you can do now step
	by step by step, so all the way over.

	1:03:55.495 --> 1:04:05.245
	The nice thing that we are having here now
	is that we are having context information from

	1:04:05.245 --> 1:04:07.195
	all the previous.

	1:04:07.607 --> 1:04:13.582
	So if you're looking like based on which words
	do you use here, calculate your ability of

	1:04:13.582 --> 1:04:14.180
	varying.

	1:04:14.554 --> 1:04:20.128
	It depends on is based on this path.

	1:04:20.128 --> 1:04:33.083
	It depends on and this hidden state was influenced
	by this one and this hidden state.

	1:04:33.473 --> 1:04:37.798
	So now we're having something new.

	1:04:37.798 --> 1:04:46.449
	We can really model the word probability not
	only on a fixed context.

	1:04:46.906 --> 1:04:53.570
	Because the in-states we're having here in
	our area are influenced by all the trivia.

	1:04:56.296 --> 1:05:00.909
	So how is that to mean?

	1:05:00.909 --> 1:05:16.288
	If you're not thinking about the history of
	clustering, we said the clustering.

	1:05:16.736 --> 1:05:24.261
	So do not need to do any clustering here,
	and we also see how things are put together

	1:05:24.261 --> 1:05:26.273
	in order to really do.

	1:05:29.489 --> 1:05:43.433
	In the green box this way since we are starting
	from the left point to the right.

	1:05:44.524 --> 1:05:48.398
	And that's right, so they're clustered in
	some parts.

	1:05:48.398 --> 1:05:58.196
	Here is some type of clustering happening:
	It's continuous representations, but a smaller

	1:05:58.196 --> 1:06:02.636
	difference doesn't matter again.

	1:06:02.636 --> 1:06:10.845
	So if you have a lot of different histories,
	the similarity.

	1:06:11.071 --> 1:06:15.791
	Because in order to do the final restriction
	you only do it based on the green box.

	1:06:16.156 --> 1:06:24.284
	So you are now again still learning some type
	of clasp.

	1:06:24.284 --> 1:06:30.235
	You don't have to do this hard decision.

	1:06:30.570 --> 1:06:39.013
	The only restriction you are giving is you
	have to install everything that is important.

	1:06:39.359 --> 1:06:54.961
	So it's a different type of limitation, so
	you calculate the probability based on the

	1:06:54.961 --> 1:06:57.138
	last words.

	1:06:57.437 --> 1:07:09.645
	That is how you still need some cluster things
	in order to do it efficiently.

	1:07:09.970 --> 1:07:25.311
	But this is where things get merged together
	in this type of hidden representation, which

	1:07:25.311 --> 1:07:28.038
	is then merged.

	1:07:28.288 --> 1:07:33.104
	On the previous words, but they are some other
	bottleneck in order to make a good estimation.

	1:07:34.474 --> 1:07:41.242
	So the idea is that we can store all our history
	into one lecture.

	1:07:41.581 --> 1:07:47.351
	Which is very good and makes it more strong.

	1:07:47.351 --> 1:07:51.711
	Next we come to problems of that.

	1:07:51.711 --> 1:07:57.865
	Of course, at some point it might be difficult.

	1:07:58.398 --> 1:08:02.230
	Then maybe things get all overwritten, or
	you cannot store everything in there.

	1:08:02.662 --> 1:08:04.514
	So,.

	1:08:04.184 --> 1:08:10.252
	Therefore, yet for short things like signal
	sentences that works well, but especially if

	1:08:10.252 --> 1:08:16.184
	you think of other tasks like harmonisation
	where a document based on T where you need

	1:08:16.184 --> 1:08:22.457
	to consider a full document, these things got
	a bit more complicated and we learned another

	1:08:22.457 --> 1:08:23.071
	type of.

	1:08:24.464 --> 1:08:30.455
	For the further in order to understand these
	networks, it's good to have both views always.

	1:08:30.710 --> 1:08:39.426
	So this is the unroll view, so you have this
	type of network.

	1:08:39.426 --> 1:08:48.532
	Therefore, it can be shown as: We have here
	the output and here's your network which is

	1:08:48.532 --> 1:08:52.091
	connected by itself and that is a recurrent.

	1:08:56.176 --> 1:09:11.033
	There is one challenge in these networks and
	that is the training so the nice thing is train

	1:09:11.033 --> 1:09:11.991
	them.

	1:09:12.272 --> 1:09:20.147
	So the idea is we don't really know how to
	train them, but if you unroll them like this,.

	1:09:20.540 --> 1:09:38.054
	It's exactly the same so you can measure your
	arrows and then you propagate your arrows.

	1:09:38.378 --> 1:09:45.647
	Now the nice thing is if you unroll something,
	it's a feet forward and you can train it.

	1:09:46.106 --> 1:09:56.493
	The only important thing is, of course, for
	different inputs you have to take that into

	1:09:56.493 --> 1:09:57.555
	account.

	1:09:57.837 --> 1:10:07.621
	But since parameters are shared, it's somehow
	similar and you can train that the training

	1:10:07.621 --> 1:10:08.817
	algorithm.

	1:10:10.310 --> 1:10:16.113
	One thing which makes things difficult is
	what is referred to as the vanishing gradient.

	1:10:16.113 --> 1:10:21.720
	So we are saying there is a big advantage
	of these models and that's why we are using

	1:10:21.720 --> 1:10:22.111
	that.

	1:10:22.111 --> 1:10:27.980
	The output here does not only depend on the
	current input of a last three but on anything

	1:10:27.980 --> 1:10:29.414
	that was said before.

	1:10:29.809 --> 1:10:32.803
	That's a very strong thing is the motivation
	of using art.

	1:10:33.593 --> 1:10:44.599
	However, if you're using standard, the influence
	here gets smaller and smaller, and the models.

	1:10:44.804 --> 1:10:55.945
	Because the gradients get smaller and smaller,
	and so the arrow here propagated to this one,

	1:10:55.945 --> 1:10:59.659
	this contributes to the arrow.

	1:11:00.020 --> 1:11:06.710
	And yeah, that's why standard R&S are
	difficult or have to become boosters.

	1:11:07.247 --> 1:11:11.481
	So if we are talking about our ends nowadays,.

	1:11:11.791 --> 1:11:19.532
	What we are typically meaning are long short
	memories.

	1:11:19.532 --> 1:11:30.931
	You see there by now quite old already, but
	they have special gating mechanisms.

	1:11:31.171 --> 1:11:41.911
	So in the language model tasks, for example
	in some other story information, all this sentence

	1:11:41.911 --> 1:11:44.737
	started with a question.

	1:11:44.684 --> 1:11:51.886
	Because if you only look at the five last
	five words, it's often no longer clear as a

	1:11:51.886 --> 1:11:52.556
	normal.

	1:11:53.013 --> 1:12:06.287
	So there you have these mechanisms with the
	right gate in order to store things for a longer

	1:12:06.287 --> 1:12:08.571
	time into your.

	1:12:10.730 --> 1:12:20.147
	Here they are used in, in, in, in selling
	quite a lot of works.

	1:12:21.541 --> 1:12:30.487
	For especially text machine translation now,
	the standard is to do transformer base models.

	1:12:30.690 --> 1:12:42.857
	But for example, this type of in architecture
	we have later one lecture about efficiency.

	1:12:42.882 --> 1:12:53.044
	And there in the decoder and partial networks
	they are still using our edges because then.

	1:12:53.473 --> 1:12:57.542
	So it's not that our ends are of no importance.

	1:12:59.239 --> 1:13:08.956
	In order to make them strong, there are some
	more things which are helpful and should be:

	1:13:09.309 --> 1:13:19.668
	So one thing is it's a very easy and nice trick
	to make this neon network stronger and better.

	1:13:19.739 --> 1:13:21.619
	So, of course, it doesn't work always.

	1:13:21.619 --> 1:13:23.451
	They have to have enough training to.

	1:13:23.763 --> 1:13:29.583
	But in general that is the easiest way of
	making your mouth bigger and stronger is to

	1:13:29.583 --> 1:13:30.598
	increase your.

	1:13:30.630 --> 1:13:43.244
	And you've seen that with a large size model
	they are always braggling about.

	1:13:43.903 --> 1:13:53.657
	This is one way so the question is how do
	you get more parameters?

	1:13:53.657 --> 1:14:05.951
	There's two ways you can make your representations:
	And the other thing is its octave deep learning,

	1:14:05.951 --> 1:14:10.020
	so the other thing is to make your networks.

	1:14:11.471 --> 1:14:13.831
	And then you can also get more work off.

	1:14:14.614 --> 1:14:19.931
	There's one problem with this and with more
	deeper networks.

	1:14:19.931 --> 1:14:23.330
	It's very similar to what we saw with.

	1:14:23.603 --> 1:14:34.755
	With the we have this problem of radiant flow
	that if it flows so fast like the radiant gets

	1:14:34.755 --> 1:14:35.475
	very.

	1:14:35.795 --> 1:14:41.114
	Exactly the same thing happens in deep.

	1:14:41.114 --> 1:14:52.285
	If you take the gradient and tell it's the
	right or wrong, then you're propagating.

	1:14:52.612 --> 1:14:53.228
	Three layers.

	1:14:53.228 --> 1:14:56.440
	It's no problem, but if you're going to ten,
	twenty or a hundred layers.

	1:14:57.797 --> 1:14:59.690
	That is getting typically a problem.

	1:15:00.060 --> 1:15:10.659
	People are doing and they are using what is
	called visual connections.

	1:15:10.659 --> 1:15:15.885
	That's a very helpful idea, which.

	1:15:15.956 --> 1:15:20.309
	And so the idea is that these networks.

	1:15:20.320 --> 1:15:30.694
	In between should calculate really what is
	a new representation, but they are calculating

	1:15:30.694 --> 1:15:31.386
	what.

	1:15:31.731 --> 1:15:37.585
	And therefore in the end you'll always the
	output of a layer is added with the input.

	1:15:38.318 --> 1:15:48.824
	The nice thing is that later, if you are doing
	back propagation with this very fast back,.

	1:15:49.209 --> 1:16:01.896
	So that is what you're seeing nowadays in
	very deep architectures, not only as others,

	1:16:01.896 --> 1:16:04.229
	but you always.

	1:16:04.704 --> 1:16:07.388
	Has two advantages.

	1:16:07.388 --> 1:16:15.304
	On the one hand, it's more easy to learn a
	representation.

	1:16:15.304 --> 1:16:18.792
	On the other hand, these.

	1:16:22.082 --> 1:16:24.114
	Goods.

	1:16:23.843 --> 1:16:31.763
	That much for the new record before, so the
	last thing now means this.

	1:16:31.671 --> 1:16:36.400
	Language was used in the molds itself.

	1:16:36.400 --> 1:16:46.707
	Now we're seeing them again, but one thing
	that at the beginning was very essential.

	1:16:46.967 --> 1:16:57.655
	So people really train part in the language
	models only to get this type of embeddings

	1:16:57.655 --> 1:17:04.166
	and therefore we want to look a bit more into
	these.

	1:17:09.229 --> 1:17:13.456
	Some laugh words to the word embeddings.

	1:17:13.456 --> 1:17:22.117
	The interesting thing is that word embeddings
	can be used for very different tasks.

	1:17:22.117 --> 1:17:27.170
	The advantage is we can train the word embedded.

	1:17:27.347 --> 1:17:31.334
	The knife is you can train that on just large
	amounts of data.

	1:17:31.931 --> 1:17:40.937
	And then if you have these wooden beddings
	you don't have a layer of ten thousand any

	1:17:40.937 --> 1:17:41.566
	more.

	1:17:41.982 --> 1:17:52.231
	So then you can train a small market to do
	any other tasks and therefore you're more.

	1:17:52.532 --> 1:17:58.761
	Initial word embeddings really depend only
	on the word itself.

	1:17:58.761 --> 1:18:07.363
	If you look at the two meanings of can, the
	can of beans, or can they do that, some of

	1:18:07.363 --> 1:18:08.747
	the embedded.

	1:18:09.189 --> 1:18:12.395
	That cannot be resolved.

	1:18:12.395 --> 1:18:23.939
	Therefore, you need to know the context, and
	if you look at the higher levels that people

	1:18:23.939 --> 1:18:27.916
	are doing in the context, but.

	1:18:29.489 --> 1:18:33.757
	However, even this one has quite very interesting.

	1:18:34.034 --> 1:18:44.644
	So people like to visualize that they're always
	a bit difficult because if you look at this

	1:18:44.644 --> 1:18:47.182
	word, vector or word.

	1:18:47.767 --> 1:18:52.879
	And drawing your five hundred dimensional
	vector is still a bit challenging.

	1:18:53.113 --> 1:19:12.464
	So you cannot directly do that, so what people
	have to do is learn some type of dimension.

	1:19:13.073 --> 1:19:17.216
	And of course then yes some information gets
	lost but you can try it.

	1:19:18.238 --> 1:19:28.122
	And you see, for example, this is the most
	famous and common example, so what you can

	1:19:28.122 --> 1:19:37.892
	look is you can look at the difference between
	the male and the female word English.

	1:19:38.058 --> 1:19:40.389
	And you can do that for a very different work.

	1:19:40.780 --> 1:19:45.403
	And that is where, where the masks come into
	that, what people then look into.

	1:19:45.725 --> 1:19:50.995
	So what you can now, for example, do is you
	can calculate the difference between man and

	1:19:50.995 --> 1:19:51.410
	woman.

	1:19:52.232 --> 1:19:56.356
	And what you can do then you can take the
	embedding of peeing.

	1:19:56.356 --> 1:20:02.378
	You can add on it the difference between men
	and women and where people get really excited.

	1:20:02.378 --> 1:20:05.586
	Then you can look at what are the similar
	words.

	1:20:05.586 --> 1:20:09.252
	So you won't, of course, directly hit the
	correct word.

	1:20:09.252 --> 1:20:10.495
	It's a continuous.

	1:20:10.790 --> 1:20:24.062
	But you can look at what are the nearest neighbors
	to the same, and often these words are near.

	1:20:24.224 --> 1:20:33.911
	So it's somehow weird that the difference
	between these works is always the same.

	1:20:34.374 --> 1:20:37.308
	Can do different things.

	1:20:37.308 --> 1:20:47.520
	You can also imagine that the work tends to
	be assuming and swim, and with walking and

	1:20:47.520 --> 1:20:49.046
	walking you.

	1:20:49.469 --> 1:20:53.040
	So you can try to use him.

	1:20:53.040 --> 1:20:56.346
	It's no longer like say.

	1:20:56.346 --> 1:21:04.016
	The interesting thing is nobody taught him
	the principle.

	1:21:04.284 --> 1:21:09.910
	So it's purely trained on the task of doing
	the next work prediction.

	1:21:10.230 --> 1:21:23.669
	And even for some information like the capital,
	this is the difference between the capital.

	1:21:23.823 --> 1:21:33.760
	Is another visualization here where you have
	done the same things on the difference between.

	1:21:33.853 --> 1:21:41.342
	And you see it's not perfect, but it's building
	in my directory, so you can even use that for

	1:21:41.342 --> 1:21:42.936
	pressure answering.

	1:21:42.936 --> 1:21:50.345
	If you have no three countries, the capital,
	you can do what is the difference between them.

	1:21:50.345 --> 1:21:53.372
	You apply that to a new country, and.

	1:21:54.834 --> 1:22:02.280
	So these models are able to really learn a
	lot of information and collapse this information

	1:22:02.280 --> 1:22:04.385
	into this representation.

	1:22:05.325 --> 1:22:07.679
	And just to do the next two are predictions.

	1:22:07.707 --> 1:22:22.358
	And that also explains a bit maybe or explains
	strongly, but motivates what is the main advantage

	1:22:22.358 --> 1:22:26.095
	of this type of neurons.

	1:22:28.568 --> 1:22:46.104
	So to summarize what we did today, so what
	you should hopefully have with you is: Then

	1:22:46.104 --> 1:22:49.148
	how we can do language modeling with new networks.

	1:22:49.449 --> 1:22:55.445
	We looked at three different architectures:
	We looked into the feet forward language one,

	1:22:55.445 --> 1:22:59.059
	the R&N, and the one based the balsamic.

	1:22:59.039 --> 1:23:04.559
	And finally, there are different architectures
	to do in neural networks.

	1:23:04.559 --> 1:23:10.986
	We have seen feet for neural networks and
	base neural networks, and we'll see in the

	1:23:10.986 --> 1:23:14.389
	next lectures the last type of architecture.

	1:23:15.915 --> 1:23:17.438
	Any questions.

	1:23:20.680 --> 1:23:27.360
	Then thanks a lot, and next I'm just there,
	we'll be again on order to.