Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-06-09.05.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame

73.7 kB

	WEBVTT

	0:00:01.721 --> 0:00:08.584
	Hey, then welcome to today's lecture on language
	modeling.

	0:00:09.409 --> 0:00:21.608
	We had not a different view on machine translation,
	which was the evaluation path it's important

	0:00:21.608 --> 0:00:24.249
	to evaluate and see.

	0:00:24.664 --> 0:00:33.186
	We want to continue with building the MT system
	and this will be the last part before we are

	0:00:33.186 --> 0:00:36.668
	going into a neural step on Thursday.

	0:00:37.017 --> 0:00:45.478
	So we had the the broader view on statistical
	machine translation and the.

	0:00:45.385 --> 0:00:52.977
	Thursday: A week ago we talked about the statistical
	machine translation and mainly the translation

	0:00:52.977 --> 0:00:59.355
	model, so how we model how probable is it that
	one word is translated into another.

	0:01:00.800 --> 0:01:15.583
	However, there is another component when doing
	generation tasks in general and machine translation.

	0:01:16.016 --> 0:01:23.797
	There are several characteristics which you
	only need to model on the target side in the

	0:01:23.797 --> 0:01:31.754
	traditional approach where we talked about
	the generation from more semantic or synthectic

	0:01:31.754 --> 0:01:34.902
	representation into the real world.

	0:01:35.555 --> 0:01:51.013
	And the challenge is that there's some constructs
	which are only there in the target language.

	0:01:52.132 --> 0:01:57.908
	You cannot really get that translation, but
	it's more something that needs to model on

	0:01:57.908 --> 0:01:58.704
	the target.

	0:01:59.359 --> 0:02:05.742
	And this is done typically by a language model
	and this concept of language model.

	0:02:06.326 --> 0:02:11.057
	Guess you can assume nowadays very important.

	0:02:11.057 --> 0:02:20.416
	You've read a lot about large language models
	recently and they are all somehow trained or

	0:02:20.416 --> 0:02:22.164
	the idea behind.

	0:02:25.986 --> 0:02:41.802
	What we'll look today at if get the next night
	and look what a language model is and today's

	0:02:41.802 --> 0:02:42.992
	focus.

	0:02:43.363 --> 0:02:49.188
	This was the common approach to the language
	model for twenty or thirty years, so a lot

	0:02:49.188 --> 0:02:52.101
	of time it was really the state of the art.

	0:02:52.101 --> 0:02:58.124
	And people have used that in many applications
	in machine translation and automatic speech

	0:02:58.124 --> 0:02:58.985
	recognition.

	0:02:59.879 --> 0:03:11.607
	Again you are measuring the performance, but
	this is purely the performance of the language

	0:03:11.607 --> 0:03:12.499
	model.

	0:03:13.033 --> 0:03:23.137
	And then we will see that the traditional
	language will have a major drawback in how

	0:03:23.137 --> 0:03:24.683
	we can deal.

	0:03:24.944 --> 0:03:32.422
	So if you model language you will see that
	in most of the sentences and you have not really

	0:03:32.422 --> 0:03:39.981
	seen and you're still able to assess if this
	is good language or if this is native language.

	0:03:40.620 --> 0:03:45.092
	And this is challenging if you do just like
	parameter estimation.

	0:03:45.605 --> 0:03:59.277
	We are using two different techniques to do:
	interpolation, and these are essentially in

	0:03:59.277 --> 0:04:01.735
	order to build.

	0:04:01.881 --> 0:04:11.941
	It also motivates why things might be easier
	if we are going into neural morals as we will.

	0:04:12.312 --> 0:04:18.203
	And at the end we'll talk a bit about some
	additional type of language models which are

	0:04:18.203 --> 0:04:18.605
	also.

	0:04:20.440 --> 0:04:29.459
	So where our language was used, or how are
	they used in the machine translations?

	0:04:30.010 --> 0:04:38.513
	So the idea of a language model is that we
	are modeling what is the fluency of language.

	0:04:38.898 --> 0:04:49.381
	So if you have, for example, sentence will,
	then you can estimate that there are some words:

	0:04:49.669 --> 0:05:08.929
	For example, the next word is valid, but will
	card's words not?

	0:05:09.069 --> 0:05:13.673
	And we can do that.

	0:05:13.673 --> 0:05:22.192
	We have seen that the noise channel.

	0:05:22.322 --> 0:05:33.991
	That we have seen someone two weeks ago, and
	today we will look into how can we model P

	0:05:33.991 --> 0:05:36.909
	of Y or how possible.

	0:05:37.177 --> 0:05:44.192
	Now this is completely independent of the
	translation process.

	0:05:44.192 --> 0:05:49.761
	How fluent is a sentence and how you can express?

	0:05:51.591 --> 0:06:01.699
	And this language model task has one really
	big advantage and assume that is even the big

	0:06:01.699 --> 0:06:02.935
	advantage.

	0:06:03.663 --> 0:06:16.345
	The big advantage is the data we need to train
	that so normally we are doing supervised learning.

	0:06:16.876 --> 0:06:20.206
	So machine translation will talk about.

	0:06:20.206 --> 0:06:24.867
	That means we have the source center and target
	center.

	0:06:25.005 --> 0:06:27.620
	They need to be aligned.

	0:06:27.620 --> 0:06:31.386
	We look into how we can model them.

	0:06:31.386 --> 0:06:39.270
	Generally, the problem with this is that:
	Machine translation: You still have the advantage

	0:06:39.270 --> 0:06:45.697
	that there's quite huge amounts of this data
	for many languages, not all but many, but other

	0:06:45.697 --> 0:06:47.701
	classes even more difficult.

	0:06:47.701 --> 0:06:50.879
	There's very few data where you have summary.

	0:06:51.871 --> 0:07:02.185
	So the big advantage of language model is
	we're only modeling the centers, so we only

	0:07:02.185 --> 0:07:04.103
	need pure text.

	0:07:04.584 --> 0:07:11.286
	And pure text, especially since we have the
	Internet face melting large amounts of text.

	0:07:11.331 --> 0:07:17.886
	Of course, it's still, it's still maybe only
	for some domains, some type.

	0:07:18.198 --> 0:07:23.466
	Want to have data for speech about machine
	translation.

	0:07:23.466 --> 0:07:27.040
	Maybe there's only limited data that.

	0:07:27.027 --> 0:07:40.030
	There's always and also you go to some more
	exotic languages and then you will have less

	0:07:40.030 --> 0:07:40.906
	data.

	0:07:41.181 --> 0:07:46.803
	And in language once we can now look, how
	can we make use of these data?

	0:07:47.187 --> 0:07:54.326
	And: Nowadays this is often also framed as
	self supervised learning because on the one

	0:07:54.326 --> 0:08:00.900
	hand here we'll see it's a time of classification
	cast or supervised learning but we create some

	0:08:00.900 --> 0:08:02.730
	other data science itself.

	0:08:02.742 --> 0:08:13.922
	So it's not that we have this pair of data
	text and labels, but we have only the text.

	0:08:15.515 --> 0:08:21.367
	So the question is how can we use this modeling
	data and how can we train our language?

	0:08:22.302 --> 0:08:35.086
	The main goal is to produce fluent English,
	so we want to somehow model that something

	0:08:35.086 --> 0:08:38.024
	is a sentence of a.

	0:08:38.298 --> 0:08:44.897
	So there is no clear separation about semantics
	and syntax, but in this case it is not about

	0:08:44.897 --> 0:08:46.317
	a clear separation.

	0:08:46.746 --> 0:08:50.751
	So we will monitor them somehow in there.

	0:08:50.751 --> 0:08:56.091
	There will be some notion of semantics, some
	notion of.

	0:08:56.076 --> 0:09:08.748
	Because you say you want to water how fluid
	or probable is that the native speaker is producing

	0:09:08.748 --> 0:09:12.444
	that because of the one in.

	0:09:12.512 --> 0:09:17.711
	We are rarely talking like things that are
	semantically wrong, and therefore there is

	0:09:17.711 --> 0:09:18.679
	also some type.

	0:09:19.399 --> 0:09:24.048
	So, for example, the house is small.

	0:09:24.048 --> 0:09:30.455
	It should be a higher stability than the house
	is.

	0:09:31.251 --> 0:09:38.112
	Because home and house are both meaning German,
	they are used differently.

	0:09:38.112 --> 0:09:43.234
	For example, it should be more probable that
	the plane.

	0:09:44.444 --> 0:09:51.408
	So this is both synthetically correct, but
	cementically not.

	0:09:51.408 --> 0:09:58.372
	But still you will see much more often the
	probability that.

	0:10:03.883 --> 0:10:14.315
	So more formally, it's about like the language
	should be some type of function, and it gives

	0:10:14.315 --> 0:10:18.690
	us the probability that this sentence.

	0:10:19.519 --> 0:10:27.312
	Indicating that this is good English or more
	generally English, of course you can do that.

	0:10:28.448 --> 0:10:37.609
	And earlier times people have even done try
	to do that deterministic that was especially

	0:10:37.609 --> 0:10:40.903
	used for more dialogue systems.

	0:10:40.840 --> 0:10:50.660
	You have a very strict syntax so you can only
	use like turn off the, turn off the radio.

	0:10:50.690 --> 0:10:56.928
	Something else, but you have a very strict
	deterministic finance state grammar like which

	0:10:56.928 --> 0:10:58.107
	type of phrases.

	0:10:58.218 --> 0:11:04.791
	The problem of course if we're dealing with
	language is that language is variable, we're

	0:11:04.791 --> 0:11:10.183
	not always talking correct sentences, and so
	this type of deterministic.

	0:11:10.650 --> 0:11:22.121
	That's why for already many, many years people
	look into statistical language models and try

	0:11:22.121 --> 0:11:24.587
	to model something.

	0:11:24.924 --> 0:11:35.096
	So something like what is the probability
	of the sequences of to, and that is what.

	0:11:35.495 --> 0:11:43.076
	The advantage of doing it statistically is
	that we can train large text databases so we

	0:11:43.076 --> 0:11:44.454
	can train them.

	0:11:44.454 --> 0:11:52.380
	We don't have to define it and most of these
	cases we don't want to have the hard decision.

	0:11:52.380 --> 0:11:55.481
	This is a sentence of the language.

	0:11:55.815 --> 0:11:57.914
	Why we want to have some type of probability?

	0:11:57.914 --> 0:11:59.785
	How probable is this part of the center?

	0:12:00.560 --> 0:12:04.175
	Because yeah, even for a few minutes, it's
	not always clear.

	0:12:04.175 --> 0:12:06.782
	Is this a sentence that you can use or not?

	0:12:06.782 --> 0:12:12.174
	I mean, I just in this presentation gave several
	sentences, which are not correct English.

	0:12:12.174 --> 0:12:17.744
	So it might still happen that people speak
	sentences or write sentences that I'm not correct,

	0:12:17.744 --> 0:12:19.758
	and you want to deal with all of.

	0:12:20.020 --> 0:12:25.064
	So that is then, of course, a big advantage
	if you use your more statistical models.

	0:12:25.705 --> 0:12:35.810
	The disadvantage is that you need a subtitle
	of large text databases which might exist from

	0:12:35.810 --> 0:12:37.567
	many languages.

	0:12:37.857 --> 0:12:46.511
	Nowadays you see that there is of course issues
	that you need large computational resources

	0:12:46.511 --> 0:12:47.827
	to deal with.

	0:12:47.827 --> 0:12:56.198
	You need to collect all these crawlers on
	the internet which can create enormous amounts

	0:12:56.198 --> 0:12:57.891
	of training data.

	0:12:58.999 --> 0:13:08.224
	So if we want to build this then the question
	is of course how can we estimate the probability?

	0:13:08.448 --> 0:13:10.986
	So how probable is the sentence good morning?

	0:13:11.871 --> 0:13:15.450
	And you all know basic statistics.

	0:13:15.450 --> 0:13:21.483
	So if you see this you have a large database
	of sentences.

	0:13:21.901 --> 0:13:28.003
	Made this a real example, so this was from
	the TED talks.

	0:13:28.003 --> 0:13:37.050
	I guess most of you have heard about them,
	and if you account for all many sentences,

	0:13:37.050 --> 0:13:38.523
	good morning.

	0:13:38.718 --> 0:13:49.513
	It happens so the probability of good morning
	is sweet point times to the power minus.

	0:13:50.030 --> 0:13:53.755
	Okay, so this is a very easy thing.

	0:13:53.755 --> 0:13:58.101
	We can directly model the language model.

	0:13:58.959 --> 0:14:03.489
	Does anybody see a problem why this might
	not be the final solution?

	0:14:06.326 --> 0:14:14.962
	Think we would need a folder of more sentences
	to make anything useful of this.

	0:14:15.315 --> 0:14:29.340
	Because the probability of the talk starting
	with good morning, good morning is much higher

	0:14:29.340 --> 0:14:32.084
	than ten minutes.

	0:14:33.553 --> 0:14:41.700
	In all the probability presented in this face,
	not how we usually think about it.

	0:14:42.942 --> 0:14:55.038
	The probability is even OK, but you're going
	into the right direction about the large data.

	0:14:55.038 --> 0:14:59.771
	Yes, you can't form a new sentence.

	0:15:00.160 --> 0:15:04.763
	It's about a large data, so you said it's
	hard to get enough data.

	0:15:04.763 --> 0:15:05.931
	It's impossible.

	0:15:05.931 --> 0:15:11.839
	I would say we are always saying sentences
	which have never been said and we are able

	0:15:11.839 --> 0:15:12.801
	to deal with.

	0:15:13.133 --> 0:15:25.485
	The problem with the sparsity of the data
	will have a lot of perfect English sentences.

	0:15:26.226 --> 0:15:31.338
	And this is, of course, not what we want to
	deal with.

	0:15:31.338 --> 0:15:39.332
	If we want to model that, we need to have
	a model which can really estimate how good.

	0:15:39.599 --> 0:15:47.970
	And if we are just like counting this way,
	most of it will get a zero probability, which

	0:15:47.970 --> 0:15:48.722
	is not.

	0:15:49.029 --> 0:15:56.572
	So we need to make things a bit different.

	0:15:56.572 --> 0:16:06.221
	For the models we had already some idea of
	doing that.

	0:16:06.486 --> 0:16:08.058
	And that we can do here again.

	0:16:08.528 --> 0:16:12.866
	So we can especially use the gel gel.

	0:16:12.772 --> 0:16:19.651
	The chain rule and the definition of conditional
	probability solve the conditional probability.

	0:16:19.599 --> 0:16:26.369
	Of an event B given in an event A is the probability
	of A and B divided to the probability of A.

	0:16:26.369 --> 0:16:32.720
	Yes, I recently had a exam on a manic speech
	recognition and Mister Rival said this is not

	0:16:32.720 --> 0:16:39.629
	called a chain of wood because I use this terminology
	and he said it's just applying base another.

	0:16:40.500 --> 0:16:56.684
	But this is definitely the definition of the
	condition of probability.

	0:16:57.137 --> 0:17:08.630
	The probability is defined as P of A and P
	of supposed to be divided by the one.

	0:17:08.888 --> 0:17:16.392
	And that can be easily rewritten into and
	times given.

	0:17:16.816 --> 0:17:35.279
	And the nice thing is, we can easily extend
	it, of course, into more variables so we can

	0:17:35.279 --> 0:17:38.383
	have: And so on.

	0:17:38.383 --> 0:17:49.823
	So more generally you can do that for now
	any length of sequence.

	0:17:50.650 --> 0:18:04.802
	So if we are now going back to words, we can
	model that as the probability of the sequence

	0:18:04.802 --> 0:18:08.223
	is given its history.

	0:18:08.908 --> 0:18:23.717
	Maybe it's more clear if we're looking at
	real works, so if we have pee-off, it's water

	0:18:23.717 --> 0:18:26.914
	is so transparent.

	0:18:26.906 --> 0:18:39.136
	So this way we are able to model the ability
	of the whole sentence given the sequence by

	0:18:39.136 --> 0:18:42.159
	looking at each word.

	0:18:42.762 --> 0:18:49.206
	And of course the big advantage is that each
	word occurs less often than the full sect.

	0:18:49.206 --> 0:18:54.991
	So hopefully we see that still, of course,
	the problem the word doesn't occur.

	0:18:54.991 --> 0:19:01.435
	Then this doesn't work, but let's recover
	most of the lectures today about dealing with

	0:19:01.435 --> 0:19:01.874
	this.

	0:19:02.382 --> 0:19:08.727
	So by first of all, we generally is at least
	easier as the thing we have before.

	0:19:13.133 --> 0:19:23.531
	That we really make sense easier, no, because
	those jumps get utterly long and we have central.

	0:19:23.943 --> 0:19:29.628
	Yes exactly, so when we look at the last probability
	here, we still have to have seen the full.

	0:19:30.170 --> 0:19:38.146
	So if we want a molecule of transparent, if
	water is so we have to see the food sequence.

	0:19:38.578 --> 0:19:48.061
	So in first step we didn't really have to
	have seen the full sentence.

	0:19:48.969 --> 0:19:52.090
	However, a little bit of a step nearer.

	0:19:52.512 --> 0:19:59.673
	So this is still a problem and we will never
	have seen it for all the time.

	0:20:00.020 --> 0:20:08.223
	So you can look at this if you have a vocabulary
	of words.

	0:20:08.223 --> 0:20:17.956
	Now, for example, if the average sentence
	is, you would leave to the.

	0:20:18.298 --> 0:20:22.394
	And we are quite sure we have never seen that
	much date.

	0:20:22.902 --> 0:20:26.246
	So this is, we cannot really compute this
	probability.

	0:20:26.786 --> 0:20:37.794
	However, there's a trick how we can do that
	and that's the idea between most of the language.

	0:20:38.458 --> 0:20:44.446
	So instead of saying how often does this work
	happen to exactly this history, we are trying

	0:20:44.446 --> 0:20:50.433
	to do some kind of clustering and cluster a
	lot of different histories into the same class,

	0:20:50.433 --> 0:20:55.900
	and then we are modeling the probability of
	the word given this class of histories.

	0:20:56.776 --> 0:21:06.245
	And then, of course, the big design decision
	is how to be modeled like how to cluster history.

	0:21:06.666 --> 0:21:17.330
	So how do we put all these histories together
	so that we have seen each of one off enough

	0:21:17.330 --> 0:21:18.396
	so that.

	0:21:20.320 --> 0:21:25.623
	So there is quite different types of things
	people can do.

	0:21:25.623 --> 0:21:33.533
	You can add some speech texts, you can do
	semantic words, you can model the similarity,

	0:21:33.533 --> 0:21:46.113
	you can model grammatical content, and things
	like: However, like quite often in these statistical

	0:21:46.113 --> 0:21:53.091
	models, if you have a very simple solution.

	0:21:53.433 --> 0:21:58.455
	And this is what most statistical models do.

	0:21:58.455 --> 0:22:09.616
	They are based on the so called mark of assumption,
	and that means we are assuming all this history

	0:22:09.616 --> 0:22:12.183
	is not that important.

	0:22:12.792 --> 0:22:25.895
	So we are modeling the probability of zirkins
	is so transparent that or we have maybe two

	0:22:25.895 --> 0:22:29.534
	words by having a fixed.

	0:22:29.729 --> 0:22:38.761
	So the class of all our history from word
	to word minus one is just the last two words.

	0:22:39.679 --> 0:22:45.229
	And by doing this classification, which of
	course does need any additional knowledge.

	0:22:45.545 --> 0:22:51.176
	It's very easy to calculate we have no limited
	our our histories.

	0:22:51.291 --> 0:23:00.906
	So instead of an arbitrary long one here,
	we have here only like.

	0:23:00.906 --> 0:23:10.375
	For example, if we have two grams, a lot of
	them will not occur.

	0:23:10.930 --> 0:23:20.079
	So it's a very simple trick to make all these
	classes into a few classes and motivated by,

	0:23:20.079 --> 0:23:24.905
	of course, the language the nearest things
	are.

	0:23:24.944 --> 0:23:33.043
	Like a lot of sequences, they mainly depend
	on the previous one, and things which are far

	0:23:33.043 --> 0:23:33.583
	away.

	0:23:38.118 --> 0:23:47.361
	In our product here everything is just modeled
	not by the whole history but by the last and

	0:23:47.361 --> 0:23:48.969
	minus one word.

	0:23:50.470 --> 0:23:54.322
	So and this is typically expressed by people.

	0:23:54.322 --> 0:24:01.776
	They're therefore also talking by an N gram
	language model because we are always looking

	0:24:01.776 --> 0:24:06.550
	at these chimes of N words and modeling the
	probability.

	0:24:07.527 --> 0:24:10.485
	So again start with the most simple case.

	0:24:10.485 --> 0:24:15.485
	Even extreme is the unigram case, so we're
	ignoring the whole history.

	0:24:15.835 --> 0:24:24.825
	The probability of a sequence of words is
	just the probability of each of the words in

	0:24:24.825 --> 0:24:25.548
	there.

	0:24:26.046 --> 0:24:32.129
	And therefore we are removing the whole context.

	0:24:32.129 --> 0:24:40.944
	The most probable sequence would be something
	like one of them is the.

	0:24:42.162 --> 0:24:44.694
	Most probable wordsuit by itself.

	0:24:44.694 --> 0:24:49.684
	It might not make sense, but it, of course,
	can give you a bit of.

	0:24:49.629 --> 0:24:52.682
	Intuition like which types of words should
	be more frequent.

	0:24:53.393 --> 0:25:00.012
	And if you what you can do is train such a
	button and you can just automatically generate.

	0:25:00.140 --> 0:25:09.496
	And this sequence is generated by sampling,
	so we will later come in the lecture too.

	0:25:09.496 --> 0:25:16.024
	The sampling is that you randomly pick a word
	but based on.

	0:25:16.096 --> 0:25:22.711
	So if the probability of one word is zero
	point two then you'll put it on and if another

	0:25:22.711 --> 0:25:23.157
	word.

	0:25:23.483 --> 0:25:36.996
	And if you see that you'll see here now, for
	example, it seems that these are two occurring

	0:25:36.996 --> 0:25:38.024
	posts.

	0:25:38.138 --> 0:25:53.467
	But you see there's not really any continuing
	type of structure because each word is modeled

	0:25:53.467 --> 0:25:55.940
	independently.

	0:25:57.597 --> 0:26:03.037
	This you can do better even though going to
	a biograph, so then we're having a bit of context.

	0:26:03.037 --> 0:26:08.650
	Of course, it's still very small, so the probability
	of your word of the actual word only depends

	0:26:08.650 --> 0:26:12.429
	on the previous word and all the context before
	there is ignored.

	0:26:13.133 --> 0:26:18.951
	This of course will come to that wrong, but
	it models a regular language significantly

	0:26:18.951 --> 0:26:19.486
	better.

	0:26:19.779 --> 0:26:28.094
	Seeing some things here still doesn't really
	make a lot of sense, but you're seeing some

	0:26:28.094 --> 0:26:29.682
	typical phrases.

	0:26:29.949 --> 0:26:39.619
	In this hope doesn't make sense, but in this
	issue is also frequent.

	0:26:39.619 --> 0:26:51.335
	Issue is also: Very nice is this year new
	car parking lot after, so if you have the word

	0:26:51.335 --> 0:26:53.634
	new then the word.

	0:26:53.893 --> 0:27:01.428
	Is also quite common, but new car they wouldn't
	put parking.

	0:27:01.428 --> 0:27:06.369
	Often the continuation is packing lots.

	0:27:06.967 --> 0:27:12.417
	And now it's very interesting because here
	we see the two cementic meanings of lot: You

	0:27:12.417 --> 0:27:25.889
	have a parking lot, but in general if you just
	think about the history, the most common use

	0:27:25.889 --> 0:27:27.353
	is a lot.

	0:27:27.527 --> 0:27:33.392
	So you see that he's really not using the
	context before, but he's only using the current

	0:27:33.392 --> 0:27:33.979
	context.

	0:27:38.338 --> 0:27:41.371
	So in general we can of course do that longer.

	0:27:41.371 --> 0:27:43.888
	We can do unigrams, bigrams, trigrams.

	0:27:45.845 --> 0:27:52.061
	People typically went up to four or five grams,
	and then it's getting difficult because.

	0:27:52.792 --> 0:27:56.671
	There are so many five grams that it's getting
	complicated.

	0:27:56.671 --> 0:28:02.425
	Storing all of them and storing these models
	get so big that it's no longer working, and

	0:28:02.425 --> 0:28:08.050
	of course at some point the calculation of
	the probabilities again gets too difficult,

	0:28:08.050 --> 0:28:09.213
	and each of them.

	0:28:09.429 --> 0:28:14.777
	If you have a small corpus, of course you
	will use a smaller ingram length.

	0:28:14.777 --> 0:28:16.466
	You will take a larger.

	0:28:18.638 --> 0:28:24.976
	What is important to keep in mind is that,
	of course, this is wrong.

	0:28:25.285 --> 0:28:36.608
	So we have long range dependencies, and if
	we really want to model everything in language

	0:28:36.608 --> 0:28:37.363
	then.

	0:28:37.337 --> 0:28:46.965
	So here is like one of these extreme cases,
	the computer, which has just put into the machine

	0:28:46.965 --> 0:28:49.423
	room in the slow crash.

	0:28:49.423 --> 0:28:55.978
	Like somehow, there is a dependency between
	computer and crash.

	0:28:57.978 --> 0:29:10.646
	However, in most situations these are typically
	rare and normally most important things happen

	0:29:10.646 --> 0:29:13.446
	in the near context.

	0:29:15.495 --> 0:29:28.408
	But of course it's important to keep that
	in mind that you can't model the thing so you

	0:29:28.408 --> 0:29:29.876
	can't do.

	0:29:33.433 --> 0:29:50.200
	The next question is again how can we train
	so we have to estimate these probabilities.

	0:29:51.071 --> 0:30:00.131
	And the question is how we do that, and again
	the most simple thing.

	0:30:00.440 --> 0:30:03.168
	The thing is exactly what's maximum legal
	destination.

	0:30:03.168 --> 0:30:12.641
	What gives you the right answer is: So how
	probable is that the word is following minus

	0:30:12.641 --> 0:30:13.370
	one?

	0:30:13.370 --> 0:30:20.946
	You just count how often does this sequence
	happen?

	0:30:21.301 --> 0:30:28.165
	So guess this is what most of you would have
	intuitively done, and this also works best.

	0:30:28.568 --> 0:30:39.012
	So it's not a complicated train, so you once
	have to go over your corpus, you have to count

	0:30:39.012 --> 0:30:48.662
	our diagrams and unigrams, and then you can
	directly train the basic language model.

	0:30:49.189 --> 0:30:50.651
	Who is it difficult?

	0:30:50.651 --> 0:30:58.855
	There are two difficulties: The basic language
	well doesn't work that well because of zero

	0:30:58.855 --> 0:31:03.154
	counts and how we address that and the second.

	0:31:03.163 --> 0:31:13.716
	Because we saw that especially if you go for
	larger you have to store all these engrams

	0:31:13.716 --> 0:31:15.275
	efficiently.

	0:31:17.697 --> 0:31:21.220
	So how we can do that?

	0:31:21.220 --> 0:31:24.590
	Here's some examples.

	0:31:24.590 --> 0:31:33.626
	For example, if you have the sequence your
	training curve.

	0:31:33.713 --> 0:31:41.372
	You see that the word happens, ascends the
	star and the sequence happens two times.

	0:31:42.182 --> 0:31:45.651
	We have three times.

	0:31:45.651 --> 0:31:58.043
	The same starts as the probability is to thirds
	and the other probability.

	0:31:58.858 --> 0:32:09.204
	Here we have what is following so you have
	twice and once do so again two thirds and one.

	0:32:09.809 --> 0:32:20.627
	And this is all that you need to know here
	about it, so you can do this calculation.

	0:32:23.723 --> 0:32:35.506
	So the question then, of course, is what do
	we really learn in these types of models?

	0:32:35.506 --> 0:32:45.549
	Here are examples from the Europycopterus:
	The green, the red, and the blue, and here

	0:32:45.549 --> 0:32:48.594
	you have the probabilities which is the next.

	0:32:48.989 --> 0:33:01.897
	That there is a lot more than just like the
	syntax because the initial phrase is all the

	0:33:01.897 --> 0:33:02.767
	same.

	0:33:03.163 --> 0:33:10.132
	For example, you see the green paper in the
	green group.

	0:33:10.132 --> 0:33:16.979
	It's more European palaman, the red cross,
	which is by.

	0:33:17.197 --> 0:33:21.777
	What you also see that it's like sometimes
	Indian, sometimes it's more difficult.

	0:33:22.302 --> 0:33:28.345
	So, for example, following the rats, in one
	hundred cases it was a red cross.

	0:33:28.668 --> 0:33:48.472
	So it seems to be easier to guess the next
	word.

	0:33:48.528 --> 0:33:55.152
	So there is different types of information
	coded in that you also know that I guess sometimes

	0:33:55.152 --> 0:33:58.675
	you directly know all the speakers will continue.

	0:33:58.675 --> 0:34:04.946
	It's not a lot of new information in the next
	word, but in other cases like blue there's

	0:34:04.946 --> 0:34:06.496
	a lot of information.

	0:34:11.291 --> 0:34:14.849
	Another example is this Berkeley restaurant
	sentences.

	0:34:14.849 --> 0:34:21.059
	It's collected at Berkeley and you have sentences
	like can you tell me about any good spaghetti

	0:34:21.059 --> 0:34:21.835
	restaurant.

	0:34:21.835 --> 0:34:27.463
	Big price title is what I'm looking for so
	it's more like a dialogue system and people

	0:34:27.463 --> 0:34:31.215
	have collected this data and of course you
	can also look.

	0:34:31.551 --> 0:34:46.878
	Into this and get the counts, so you count
	the vibrants in the top, so the color is the.

	0:34:49.409 --> 0:34:52.912
	This is a bigram which is the first word of
	West.

	0:34:52.912 --> 0:34:54.524
	This one fuzzy is one.

	0:34:56.576 --> 0:35:12.160
	One because want to hyperability, but want
	a lot less, and there where you see it, for

	0:35:12.160 --> 0:35:17.004
	example: So here you see after I want.

	0:35:17.004 --> 0:35:23.064
	It's very often for I eat, but an island which
	is not just.

	0:35:27.347 --> 0:35:39.267
	The absolute counts of how often each road
	occurs, and then you can see here the probabilities

	0:35:39.267 --> 0:35:40.145
	again.

	0:35:42.422 --> 0:35:54.519
	Then do that if you want to do iwan Dutch
	food you get the sequence you have to multiply

	0:35:54.519 --> 0:35:55.471
	olive.

	0:35:55.635 --> 0:36:00.281
	And then you of course get a bit of interesting
	experience on that.

	0:36:00.281 --> 0:36:04.726
	For example: Information is there.

	0:36:04.726 --> 0:36:15.876
	So, for example, if you compare I want Dutch
	or I want Chinese, it seems that.

	0:36:16.176 --> 0:36:22.910
	That the sentence often starts with eye.

	0:36:22.910 --> 0:36:31.615
	You have it after two is possible, but after
	one it.

	0:36:31.731 --> 0:36:39.724
	And you cannot say want, but you have to say
	want to spend, so there's grammical information.

	0:36:40.000 --> 0:36:51.032
	To main information and source: Here before
	we're going into measuring quality, is there

	0:36:51.032 --> 0:36:58.297
	any questions about language model and the
	idea of modeling?

	0:37:02.702 --> 0:37:13.501
	Hope that doesn't mean everybody sleeping,
	and so when we're doing the training these

	0:37:13.501 --> 0:37:15.761
	language models,.

	0:37:16.356 --> 0:37:26.429
	You need to model what is the engrum length
	should we use a trigram or a forkrum.

	0:37:27.007 --> 0:37:34.040
	So in order to decide how can you now decide
	which of the two models are better?

	0:37:34.914 --> 0:37:40.702
	And if you would have to do that, how would
	you decide taking language model or taking

	0:37:40.702 --> 0:37:41.367
	language?

	0:37:43.263 --> 0:37:53.484
	I take some test text and see which model
	assigns a higher probability to me.

	0:37:54.354 --> 0:38:03.978
	It's very good, so that's even the second
	thing, so the first thing maybe would have

	0:38:03.978 --> 0:38:04.657
	been.

	0:38:05.925 --> 0:38:12.300
	The problem is the and then you take the language
	language language and machine translation.

	0:38:13.193 --> 0:38:18.773
	Problems: First of all you have to build a
	whole system which is very time consuming and

	0:38:18.773 --> 0:38:21.407
	it might not only depend on the language.

	0:38:21.407 --> 0:38:24.730
	On the other hand, that's of course what the
	end is.

	0:38:24.730 --> 0:38:30.373
	The end want and the pressure will model each
	component individually or do you want to do

	0:38:30.373 --> 0:38:31.313
	an end to end.

	0:38:31.771 --> 0:38:35.463
	What can also happen is you'll see your metric
	model.

	0:38:35.463 --> 0:38:41.412
	This is a very good language model, but it
	somewhat doesn't really work well with your

	0:38:41.412 --> 0:38:42.711
	translation model.

	0:38:43.803 --> 0:38:49.523
	But of course it's very good to also have
	this type of intrinsic evaluation where the

	0:38:49.523 --> 0:38:52.116
	assumption should be as a pointed out.

	0:38:52.116 --> 0:38:57.503
	If we have Good English it shouldn't be a
	high probability and it's bad English.

	0:38:58.318 --> 0:39:07.594
	And this is measured by the take a held out
	data set, so some data which you don't train

	0:39:07.594 --> 0:39:12.596
	on then calculate the probability of this data.

	0:39:12.912 --> 0:39:26.374
	Then you're just looking at the language model
	and you take the language model.

	0:39:27.727 --> 0:39:33.595
	You're not directly using the probability,
	but you're taking the perplexity.

	0:39:33.595 --> 0:39:40.454
	The perplexity is due to the power of the
	cross entropy, and you see in the cross entropy

	0:39:40.454 --> 0:39:46.322
	you're doing something like an average probability
	of always coming to this.

	0:39:46.846 --> 0:39:54.721
	Not so how exactly is that define perplexity
	is typically what people refer to all across.

	0:39:54.894 --> 0:40:02.328
	The cross edge is negative and average, and
	then you have the lock of the probability of

	0:40:02.328 --> 0:40:03.246
	the whole.

	0:40:04.584 --> 0:40:10.609
	We are modeling this probability as the product
	of each of the words.

	0:40:10.609 --> 0:40:18.613
	That's how the end gram was defined and now
	you hopefully can remember the rules of logarism

	0:40:18.613 --> 0:40:23.089
	so you can get the probability within the logarism.

	0:40:23.063 --> 0:40:31.036
	The sum here so the cross entry is minus one
	by two by n, and the sum of all your words

	0:40:31.036 --> 0:40:35.566
	and the lowerism of the probability of each
	word.

	0:40:36.176 --> 0:40:39.418
	And then the perplexity is just like two to
	the power.

	0:40:41.201 --> 0:40:44.706
	Why can this be interpreted as a branching
	factor?

	0:40:44.706 --> 0:40:50.479
	So it gives you a bit like the average thing,
	like how many possibilities you have.

	0:40:51.071 --> 0:41:02.249
	You have a digit task and you have no idea,
	but the probability of the next digit is like

	0:41:02.249 --> 0:41:03.367
	one ten.

	0:41:03.783 --> 0:41:09.354
	And if you then take a later perplexity, it
	will be exactly ten.

	0:41:09.849 --> 0:41:24.191
	And that is like this perplexity gives you
	a million interpretations, so how much randomness

	0:41:24.191 --> 0:41:27.121
	is still in there?

	0:41:27.307 --> 0:41:32.433
	Of course, now it's good to have a lower perplexity.

	0:41:32.433 --> 0:41:36.012
	We have less ambiguity in there and.

	0:41:35.976 --> 0:41:48.127
	If you have a hundred words and you only have
	to uniformly compare it to ten different, so

	0:41:48.127 --> 0:41:49.462
	you have.

	0:41:49.609 --> 0:41:53.255
	Yes, think so it should be.

	0:41:53.255 --> 0:42:03.673
	You had here logarism and then to the power
	and that should then be eliminated.

	0:42:03.743 --> 0:42:22.155
	So which logarism you use is not that important
	because it's a constant factor to reformulate.

	0:42:23.403 --> 0:42:28.462
	Yes and Yeah So the Best.

	0:42:31.931 --> 0:42:50.263
	The best model is always like you want to
	have a high probability.

	0:42:51.811 --> 0:43:04.549
	Time you see here, so here the probabilities
	would like to commend the rapporteur on his

	0:43:04.549 --> 0:43:05.408
	work.

	0:43:05.285 --> 0:43:14.116
	You have then locked two probabilities and
	then the average, so this is not the perplexity

	0:43:14.116 --> 0:43:18.095
	but the cross entropy as mentioned here.

	0:43:18.318 --> 0:43:26.651
	And then due to the power of that we'll give
	you the perplexity of the center.

	0:43:29.329 --> 0:43:40.967
	And these metrics of perplexity are essential
	in modeling that and we'll also see nowadays.

	0:43:41.121 --> 0:43:47.898
	You also measure like equality often in perplexity
	or cross entropy, which gives you how good

	0:43:47.898 --> 0:43:50.062
	is it in estimating the same.

	0:43:50.010 --> 0:43:53.647
	The better the model is, the more information
	you have about this.

	0:43:55.795 --> 0:44:03.106
	Talked about isomic ability or quit sentences,
	but don't most have to any much because.

	0:44:03.463 --> 0:44:12.512
	You are doing that in this way implicitly
	because of the correct word.

	0:44:12.512 --> 0:44:19.266
	If you are modeling this one, the sun over
	all next.

	0:44:20.020 --> 0:44:29.409
	Therefore, you have that implicitly in there
	because in each position you're modeling the

	0:44:29.409 --> 0:44:32.957
	probability of this witch behind.

	0:44:35.515 --> 0:44:43.811
	You have a very large number of negative examples
	because all the possible extensions which are

	0:44:43.811 --> 0:44:49.515
	not there are incorrect, which of course might
	also be a problem.

	0:44:52.312 --> 0:45:00.256
	And the biggest challenge of these types of
	models is how to model unseen events.

	0:45:00.840 --> 0:45:04.973
	So that can be unknown words or it can be
	unknown vibrants.

	0:45:05.245 --> 0:45:10.096
	So that's important also like you've seen
	all the words.

	0:45:10.096 --> 0:45:17.756
	But if you have a bigram language model, if
	you haven't seen the bigram, you'll still get

	0:45:17.756 --> 0:45:23.628
	a zero probability because we know that the
	bigram's divided by the.

	0:45:24.644 --> 0:45:35.299
	If you have unknown words, the problem gets
	even bigger because one word typically causes

	0:45:35.299 --> 0:45:37.075
	a lot of zero.

	0:45:37.217 --> 0:45:41.038
	So if you, for example, if your vocabulary
	is go to and care it,.

	0:45:41.341 --> 0:45:43.467
	And you have not a sentence.

	0:45:43.467 --> 0:45:47.941
	I want to pay a T, so you have one word, which
	is here 'an'.

	0:45:47.887 --> 0:45:54.354
	It is unknow then you have the proper.

	0:45:54.354 --> 0:46:02.147
	It is I get a sentence star and sentence star.

	0:46:02.582 --> 0:46:09.850
	To model this probability you always have
	to take the account from these sequences divided

	0:46:09.850 --> 0:46:19.145
	by: Since when does it occur, all of these
	angrams can also occur because of the word

	0:46:19.145 --> 0:46:19.961
	middle.

	0:46:20.260 --> 0:46:27.800
	So all of these probabilities are directly
	zero.

	0:46:27.800 --> 0:46:33.647
	You see that just by having a single.

	0:46:34.254 --> 0:46:47.968
	Tells you it might not always be better to
	have larger grams because if you have a gram

	0:46:47.968 --> 0:46:50.306
	language more.

	0:46:50.730 --> 0:46:57.870
	So sometimes it's better to have a smaller
	angram counter because the chances that you're

	0:46:57.870 --> 0:47:00.170
	seeing the angram is higher.

	0:47:00.170 --> 0:47:07.310
	On the other hand, you want to have a larger
	account because the larger the count is, the

	0:47:07.310 --> 0:47:09.849
	longer the context is modeling.

	0:47:10.670 --> 0:47:17.565
	So how can we address this type of problem?

	0:47:17.565 --> 0:47:28.064
	We address this type of problem by somehow
	adjusting our accounts.

	0:47:29.749 --> 0:47:40.482
	We have often, but most of your entries in
	the table are zero, and if one of these engrams

	0:47:40.482 --> 0:47:45.082
	occurs you'll have a zero probability.

	0:47:46.806 --> 0:48:06.999
	So therefore we need to find some of our ways
	in order to estimate this type of event because:

	0:48:07.427 --> 0:48:11.619
	So there are different ways of how to model
	it and how to adjust it.

	0:48:11.619 --> 0:48:15.326
	The one I hear is to do smoocing and that's
	the first thing.

	0:48:15.326 --> 0:48:20.734
	So in smoocing you're saying okay, we take
	a bit of the probability we have to our scene

	0:48:20.734 --> 0:48:23.893
	events and distribute this thing we're taking
	away.

	0:48:23.893 --> 0:48:26.567
	We're distributing to all the other events.

	0:48:26.946 --> 0:48:33.927
	The nice thing is in this case oh now each
	event has a non zero probability and that is

	0:48:33.927 --> 0:48:39.718
	of course very helpful because we don't have
	zero probabilities anymore.

	0:48:40.180 --> 0:48:48.422
	It smoothed out, but at least you have some
	kind of probability everywhere, so you take

	0:48:48.422 --> 0:48:50.764
	some of the probability.

	0:48:53.053 --> 0:49:05.465
	You can also do that more here when you have
	the endgram, for example, and this is your

	0:49:05.465 --> 0:49:08.709
	original distribution.

	0:49:08.648 --> 0:49:15.463
	Then you are taking some mass away from here
	and distributing this mass to all the other

	0:49:15.463 --> 0:49:17.453
	words that you have seen.

	0:49:18.638 --> 0:49:26.797
	And thereby you are now making sure that it's
	yeah, that it's now possible to model that.

	0:49:28.828 --> 0:49:36.163
	The other idea we're coming into more detail
	on how we can do this type of smoking, but

	0:49:36.163 --> 0:49:41.164
	one other idea you can do is to do some type
	of clustering.

	0:49:41.501 --> 0:49:48.486
	And that means if we are can't model go Kit's,
	for example because we haven't seen that.

	0:49:49.349 --> 0:49:56.128
	Then we're just looking at the full thing
	and we're just going to live directly how probable.

	0:49:56.156 --> 0:49:58.162
	Go two ways or so.

	0:49:58.162 --> 0:50:09.040
	Then we are modeling just only the word interpolation
	where you're interpolating all the probabilities

	0:50:09.040 --> 0:50:10.836
	and thereby can.

	0:50:11.111 --> 0:50:16.355
	These are the two things which are helpful
	in order to better calculate all these types.

	0:50:19.499 --> 0:50:28.404
	Let's start with what counts news so the idea
	is okay.

	0:50:28.404 --> 0:50:38.119
	We have not seen an event and then the probability
	is zero.

	0:50:38.618 --> 0:50:50.902
	It's not that high, but you should always
	be aware that there might be new things happening

	0:50:50.902 --> 0:50:55.308
	and somehow be able to estimate.

	0:50:56.276 --> 0:50:59.914
	So the idea is okay.

	0:50:59.914 --> 0:51:09.442
	We can also assign a positive probability
	to a higher.

	0:51:10.590 --> 0:51:23.233
	We are changing so currently we worked on
	imperial accounts so how often we have seen

	0:51:23.233 --> 0:51:25.292
	the accounts.

	0:51:25.745 --> 0:51:37.174
	And now we are going on to expect account
	how often this would occur in an unseen.

	0:51:37.517 --> 0:51:39.282
	So we are directly trying to model that.

	0:51:39.859 --> 0:51:45.836
	Of course, the empirical accounts are a good
	starting point, so if you've seen the world

	0:51:45.836 --> 0:51:51.880
	very often in your training data, it's a good
	estimation of how often you would see it in

	0:51:51.880 --> 0:51:52.685
	the future.

	0:51:52.685 --> 0:51:58.125
	However, it might make sense to think about
	it only because you haven't seen it.

	0:51:58.578 --> 0:52:10.742
	So does anybody have a very simple idea how
	you start with smoothing it?

	0:52:10.742 --> 0:52:15.241
	What count would you give?

	0:52:21.281 --> 0:52:32.279
	Now you have the probability to calculation
	how often have you seen the biogram with zero

	0:52:32.279 --> 0:52:33.135
	count.

	0:52:33.193 --> 0:52:39.209
	So what count would you give in order to still
	do this calculation?

	0:52:39.209 --> 0:52:41.509
	We have to smooth, so we.

	0:52:44.884 --> 0:52:52.151
	We could clump together all the rare words,
	for example everywhere we have only seen ones.

	0:52:52.652 --> 0:52:56.904
	And then just we can do the massive moment
	of those and don't.

	0:52:56.936 --> 0:53:00.085
	So remove the real ones.

	0:53:00.085 --> 0:53:06.130
	Yes, and then every unseen word is one of
	them.

	0:53:06.130 --> 0:53:13.939
	Yeah, but it's not only about unseen words,
	it's even unseen.

	0:53:14.874 --> 0:53:20.180
	You can even start easier and that's what
	people do at the first thing.

	0:53:20.180 --> 0:53:22.243
	That's at one smooth thing.

	0:53:22.243 --> 0:53:28.580
	You'll see it's not working good but the variation
	works fine and we're just as here.

	0:53:28.580 --> 0:53:30.644
	We've seen everything once.

	0:53:31.771 --> 0:53:39.896
	That's similar to this because you're clustering
	the one and the zero together and you just

	0:53:39.896 --> 0:53:45.814
	say you've seen everything once or have seen
	them twice and so on.

	0:53:46.386 --> 0:53:53.249
	And if you've done that wow, there's no probability
	because each event has happened once.

	0:53:55.795 --> 0:54:02.395
	If you otherwise have seen the bigram five
	times, you would not now do five times but

	0:54:02.395 --> 0:54:03.239
	six times.

	0:54:03.363 --> 0:54:09.117
	So the nice thing is to have seen everything.

	0:54:09.117 --> 0:54:19.124
	Once the probability of the engrap is now
	out, you have seen it divided by the.

	0:54:20.780 --> 0:54:23.763
	How long ago there's one big big problem with
	it?

	0:54:24.064 --> 0:54:38.509
	Just imagine that you have a vocabulary of
	words, and you have a corpus of thirty million

	0:54:38.509 --> 0:54:39.954
	bigrams.

	0:54:39.954 --> 0:54:42.843
	So if you have a.

	0:54:43.543 --> 0:54:46.580
	Simple Things So You've Seen Them Thirty Million
	Times.

	0:54:47.247 --> 0:54:49.818
	That is your count, your distributing.

	0:54:49.818 --> 0:54:55.225
	According to your gain, the problem is yet
	how many possible bigrams do you have?

	0:54:55.225 --> 0:55:00.895
	You have seven point five billion possible
	bigrams, and each of them you are counting

	0:55:00.895 --> 0:55:04.785
	now as give up your ability, like you give
	account of one.

	0:55:04.785 --> 0:55:07.092
	So each of them is saying a curse.

	0:55:07.627 --> 0:55:16.697
	Then this number of possible vigrams is many
	times larger than the number you really see.

	0:55:17.537 --> 0:55:21.151
	You're mainly doing equal distribution.

	0:55:21.151 --> 0:55:26.753
	Everything gets the same because this is much
	more important.

	0:55:26.753 --> 0:55:31.541
	Most of your probability mass is used for
	smoothing.

	0:55:32.412 --> 0:55:37.493
	Because most of the probability miles have
	to be distributed that you at least give every

	0:55:37.493 --> 0:55:42.687
	biogram at least a count of one, and the other
	counts are only the thirty million, so seven

	0:55:42.687 --> 0:55:48.219
	point five billion counts go to like a distribute
	around all the engrons, and only thirty million

	0:55:48.219 --> 0:55:50.026
	are according to your frequent.

	0:55:50.210 --> 0:56:02.406
	So you put a lot too much mass on your smoothing
	and you're doing some kind of extreme smoothing.

	0:56:02.742 --> 0:56:08.986
	So that of course is a bit bad then and will
	give you not the best performance.

	0:56:10.130 --> 0:56:16.160
	However, there's a nice thing and that means
	to do probability calculations.

	0:56:16.160 --> 0:56:21.800
	We are doing it based on counts, but to do
	this division we don't need.

	0:56:22.302 --> 0:56:32.112
	So we can also do that with floating point
	values and there is still a valid type of calculation.

	0:56:32.392 --> 0:56:39.380
	So we can have less probability mass to unseen
	events.

	0:56:39.380 --> 0:56:45.352
	We don't have to give one because if we count.

	0:56:45.785 --> 0:56:50.976
	But to do our calculation we can also give
	zero point zero to something like that, so

	0:56:50.976 --> 0:56:56.167
	very small value, and thereby we have less
	value on the smooth thing, and we are more

	0:56:56.167 --> 0:56:58.038
	focusing on the actual corpus.

	0:56:58.758 --> 0:57:03.045
	And that is what people refer to as Alpha
	Smoozing.

	0:57:03.223 --> 0:57:12.032
	You see that we are now adding not one to
	it but only alpha, and then we are giving less

	0:57:12.032 --> 0:57:19.258
	probability to the unseen event and more probability
	to the really seen.

	0:57:20.780 --> 0:57:24.713
	Questions: Of course, how do you find see
	also?

	0:57:24.713 --> 0:57:29.711
	I'm here to either use some help out data
	and optimize them.

	0:57:30.951 --> 0:57:35.153
	So what what does it now really mean?

	0:57:35.153 --> 0:57:40.130
	This gives you a bit of an idea behind that.

	0:57:40.700 --> 0:57:57.751
	So here you have the grams which occur one
	time, for example all grams which occur one.

	0:57:57.978 --> 0:58:10.890
	So, for example, that means that if you have
	engrams which occur one time, then.

	0:58:11.371 --> 0:58:22.896
	If you look at all the engrams which occur
	two times, then they occur.

	0:58:22.896 --> 0:58:31.013
	If you look at the engrams that occur zero,
	then.

	0:58:32.832 --> 0:58:46.511
	So if you are now doing the smoothing you
	can look what is the probability estimating

	0:58:46.511 --> 0:58:47.466
	them.

	0:58:47.847 --> 0:59:00.963
	You see that for all the endbreaks you heavily
	underestimate how often they occur in the test

	0:59:00.963 --> 0:59:01.801
	card.

	0:59:02.002 --> 0:59:10.067
	So what you want is very good to estimate
	this distribution, so for each Enron estimate

	0:59:10.067 --> 0:59:12.083
	quite well how often.

	0:59:12.632 --> 0:59:16.029
	You're quite bad at that for all of them.

	0:59:16.029 --> 0:59:22.500
	You're apparently underestimating only for
	the top ones which you haven't seen.

	0:59:22.500 --> 0:59:24.845
	You'll heavily overestimate.

	0:59:25.645 --> 0:59:30.887
	If you're doing alpha smoothing and optimize
	that to fit on the zero count because that's

	0:59:30.887 --> 0:59:36.361
	not completely fair because this alpha is now
	optimizes the test counter, you see that you're

	0:59:36.361 --> 0:59:37.526
	doing a lot better.

	0:59:37.526 --> 0:59:42.360
	It's not perfect, but you're a lot better
	in estimating how often they will occur.

	0:59:45.545 --> 0:59:49.316
	So this is one idea of doing it.

	0:59:49.316 --> 0:59:57.771
	Of course there's other ways and this is like
	a large research direction.

	0:59:58.318 --> 1:00:03.287
	So there is this needed estimation.

	1:00:03.287 --> 1:00:11.569
	What you are doing is filling your trading
	data into parts.

	1:00:11.972 --> 1:00:19.547
	Looking at how many engrams occur exactly
	are types, which engrams occur are times in

	1:00:19.547 --> 1:00:20.868
	your training.

	1:00:21.281 --> 1:00:27.716
	And then you look for these ones.

	1:00:27.716 --> 1:00:36.611
	How often do they occur in your training data?

	1:00:36.611 --> 1:00:37.746
	It's.

	1:00:38.118 --> 1:00:45.214
	And then you say oh this engram, the expector
	counts how often will see.

	1:00:45.214 --> 1:00:56.020
	It is divided by: Some type of clustering
	you're putting all the engrams which occur

	1:00:56.020 --> 1:01:04.341
	are at times in your data together and in order
	to estimate how often.

	1:01:05.185 --> 1:01:12.489
	And if you do half your data related to your
	final estimation by just using those statistics,.

	1:01:14.014 --> 1:01:25.210
	So this is called added estimation, and thereby
	you are not able to estimate better how often

	1:01:25.210 --> 1:01:25.924
	does.

	1:01:28.368 --> 1:01:34.559
	And again we can do the same look and compare
	it to the expected counts.

	1:01:34.559 --> 1:01:37.782
	Again we have exactly the same table.

	1:01:38.398 --> 1:01:47.611
	So then we're having to hear how many engrams
	that does exist.

	1:01:47.611 --> 1:01:55.361
	So, for example, there's like engrams which
	you can.

	1:01:55.835 --> 1:02:08.583
	Then you look into your other half and how
	often do these N grams occur in your 2nd part

	1:02:08.583 --> 1:02:11.734
	of the training data?

	1:02:12.012 --> 1:02:22.558
	For example, an unseen N gram I expect to
	occur, an engram which occurs one time.

	1:02:22.558 --> 1:02:25.774
	I expect that it occurs.

	1:02:27.527 --> 1:02:42.564
	Yeah, the number of zero counts are if take
	my one grams and then just calculate how many

	1:02:42.564 --> 1:02:45.572
	possible bigrams.

	1:02:45.525 --> 1:02:50.729
	Yes, so in this case we are now not assuming
	about having a more larger cattle because then,

	1:02:50.729 --> 1:02:52.127
	of course, it's getting.

	1:02:52.272 --> 1:02:54.730
	So you're doing that given the current gram.

	1:02:54.730 --> 1:03:06.057
	The cavalry is better to: So yeah, there's
	another problem in how to deal with them.

	1:03:06.057 --> 1:03:11.150
	This is more about how to smuse the engram
	counts to also deal.

	1:03:14.394 --> 1:03:18.329
	Certainly as I Think The.

	1:03:18.198 --> 1:03:25.197
	Yes, the last idea of doing is so called good
	cheering, and and the I hear here is in it

	1:03:25.197 --> 1:03:32.747
	similar, so there is a typical mathematic approve,
	but you can show that a very good estimation

	1:03:32.747 --> 1:03:34.713
	for the expected counts.

	1:03:34.654 --> 1:03:42.339
	Is that you take the number of engrams which
	occur one time more divided by the number of

	1:03:42.339 --> 1:03:46.011
	engram which occur R times and R plus one.

	1:03:46.666 --> 1:03:49.263
	So this is then the estimation of.

	1:03:49.549 --> 1:04:05.911
	So if you are looking now at an engram which
	occurs times then you are looking at how many

	1:04:05.911 --> 1:04:08.608
	engrams occur.

	1:04:09.009 --> 1:04:18.938
	It's very simple, so in this one you only
	have to count all the bigrams, how many different

	1:04:18.938 --> 1:04:23.471
	bigrams out there, and that is very good.

	1:04:23.903 --> 1:04:33.137
	So if you are saying now about end drums which
	occur or times,.

	1:04:33.473 --> 1:04:46.626
	It might be that there are some occurring
	times, but no times, and then.

	1:04:46.866 --> 1:04:54.721
	So what you normally do is you are doing for
	small R, and for large R you do some curve

	1:04:54.721 --> 1:04:55.524
	fitting.

	1:04:56.016 --> 1:05:07.377
	In general this type of smoothing is important
	for engrams which occur rarely.

	1:05:07.377 --> 1:05:15.719
	If an engram occurs so this is more important
	for events.

	1:05:17.717 --> 1:05:25.652
	So here again you see you have the counts
	and then based on that you get the adjusted

	1:05:25.652 --> 1:05:26.390
	counts.

	1:05:26.390 --> 1:05:34.786
	This is here and if you compare it's a test
	count you see that it really works quite well.

	1:05:35.035 --> 1:05:41.093
	But for the low numbers it's a very good modeling
	of how much how good this works.

	1:05:45.005 --> 1:05:50.018
	Then, of course, the question is how good
	does it work in language modeling?

	1:05:50.018 --> 1:05:51.516
	We also want tomorrow.

	1:05:52.372 --> 1:05:54.996
	We can measure that perplexity.

	1:05:54.996 --> 1:05:59.261
	We learned that before and then we have everyone's.

	1:05:59.579 --> 1:06:07.326
	You saw that a lot of too much probability
	mass is put to the events which have your probability.

	1:06:07.667 --> 1:06:11.098
	Then you have an alpha smoothing.

	1:06:11.098 --> 1:06:16.042
	Here's a start because it's not completely
	fair.

	1:06:16.042 --> 1:06:20.281
	The alpha was maximized on the test data.

	1:06:20.480 --> 1:06:25.904
	But you see that like the leaded estimation
	of the touring gives you a similar performance.

	1:06:26.226 --> 1:06:29.141
	So they seem to really work quite well.

	1:06:32.232 --> 1:06:41.552
	So this is about all assigning probability
	mass to aimed grams, which we have not seen

	1:06:41.552 --> 1:06:50.657
	in order to also estimate their probability
	before we're going to the interpolation.

	1:06:55.635 --> 1:07:00.207
	Good, so now we have.

	1:07:00.080 --> 1:07:11.818
	Done this estimation, and the problem is we
	have this general.

	1:07:11.651 --> 1:07:19.470
	We want to have a longer context because we
	can model longer than language better because

	1:07:19.470 --> 1:07:21.468
	long range dependency.

	1:07:21.701 --> 1:07:26.745
	On the other hand, we have limited data so
	we want to have stored angrums because they

	1:07:26.745 --> 1:07:28.426
	reach angrums at first more.

	1:07:29.029 --> 1:07:43.664
	And about the smooth thing in the discounting
	we did before, it always treats all angrams.

	1:07:44.024 --> 1:07:46.006
	So we didn't really look at the end drums.

	1:07:46.006 --> 1:07:48.174
	They were all classed into how often they
	are.

	1:07:49.169 --> 1:08:00.006
	However, sometimes this might not be very
	helpful, so for example look at the engram

	1:08:00.006 --> 1:08:06.253
	Scottish beer drinkers and Scottish beer eaters.

	1:08:06.686 --> 1:08:12.037
	Because we have not seen the trigram, so you
	will estimate the trigram probability by the

	1:08:12.037 --> 1:08:14.593
	probability you assign to the zero county.

	1:08:15.455 --> 1:08:26.700
	However, if you look at the background probability
	that you might have seen and might be helpful,.

	1:08:26.866 --> 1:08:34.538
	So be a drinker is more probable to see than
	Scottish be a drinker, and be a drinker should

	1:08:34.538 --> 1:08:36.039
	be more probable.

	1:08:36.896 --> 1:08:39.919
	So this type of information is somehow ignored.

	1:08:39.919 --> 1:08:45.271
	So if we have the Trigram language model,
	we are only looking at trigrams divided by

	1:08:45.271 --> 1:08:46.089
	the Vigrams.

	1:08:46.089 --> 1:08:49.678
	But if we have not seen the Vigrams, we are
	not looking.

	1:08:49.678 --> 1:08:53.456
	Oh, maybe we will have seen the Vigram and
	we can back off.

	1:08:54.114 --> 1:09:01.978
	And that is what people do in interpolation
	and back off.

	1:09:01.978 --> 1:09:09.164
	The idea is if we don't have seen the large
	engrams.

	1:09:09.429 --> 1:09:16.169
	So don't have to go to a shorter sequence
	and try to see if we came on in this probability.

	1:09:16.776 --> 1:09:20.730
	And this is the idea of interpolation.

	1:09:20.730 --> 1:09:25.291
	There's like two different ways of doing it.

	1:09:25.291 --> 1:09:26.507
	One is the.

	1:09:26.646 --> 1:09:29.465
	The easiest thing is like okay.

	1:09:29.465 --> 1:09:32.812
	If we have bigrams, we have trigrams.

	1:09:32.812 --> 1:09:35.103
	If we have programs, why?

	1:09:35.355 --> 1:09:46.544
	Mean, of course, we have the larger ones,
	the larger context, but the short amounts are

	1:09:46.544 --> 1:09:49.596
	maybe better estimated.

	1:09:50.090 --> 1:10:00.487
	Time just by taking the probability of just
	the word class of probability of and.

	1:10:01.261 --> 1:10:07.052
	And of course we need to know because otherwise
	we don't have a probability distribution, but

	1:10:07.052 --> 1:10:09.332
	we can somehow optimize the weights.

	1:10:09.332 --> 1:10:15.930
	For example, the health out data set: And
	thereby we have now a probability distribution

	1:10:15.930 --> 1:10:17.777
	which takes both into account.

	1:10:18.118 --> 1:10:23.705
	The thing about the Scottish be a drink business.

	1:10:23.705 --> 1:10:33.763
	The dry rum probability will be the same for
	the post office because they both occur zero

	1:10:33.763 --> 1:10:34.546
	times.

	1:10:36.116 --> 1:10:45.332
	But the two grand verability will hopefully
	be different because we might have seen beer

	1:10:45.332 --> 1:10:47.611
	eaters and therefore.

	1:10:48.668 --> 1:10:57.296
	The idea that sometimes it's better to have
	different models and combine them instead.

	1:10:58.678 --> 1:10:59.976
	Another idea in style.

	1:11:00.000 --> 1:11:08.506
	Of this overall interpolation is you can also
	do this type of recursive interpolation.

	1:11:08.969 --> 1:11:23.804
	The probability of the word given its history
	is in the current language model probability.

	1:11:24.664 --> 1:11:30.686
	Thus one minus the weights of this two some
	after one, and here it's an interpolated probability

	1:11:30.686 --> 1:11:36.832
	from the n minus one breath, and then of course
	it goes recursively on until you are at a junigram

	1:11:36.832 --> 1:11:37.639
	probability.

	1:11:38.558 --> 1:11:49.513
	What you can also do, you can not only do
	the same weights for all our words, but you

	1:11:49.513 --> 1:12:06.020
	can for example: For example, for engrams,
	which you have seen very often, you put more

	1:12:06.020 --> 1:12:10.580
	weight on the trigrams.

	1:12:13.673 --> 1:12:29.892
	The other thing you can do is the back off
	and the difference in back off is we are not

	1:12:29.892 --> 1:12:32.656
	interpolating.

	1:12:32.892 --> 1:12:41.954
	If we have seen the trigram probability so
	if the trigram hound is bigger then we take

	1:12:41.954 --> 1:12:48.412
	the trigram probability and if we have seen
	this one then we.

	1:12:48.868 --> 1:12:54.092
	So that is the difference.

	1:12:54.092 --> 1:13:06.279
	We are always taking all the angle probabilities
	and back off.

	1:13:07.147 --> 1:13:09.941
	Why do we need to do this just a minute?

	1:13:09.941 --> 1:13:13.621
	So why have we here just take the probability
	of the.

	1:13:15.595 --> 1:13:18.711
	Yes, because otherwise the probabilities from
	some people.

	1:13:19.059 --> 1:13:28.213
	In order to make them still sound one, we
	have to take away a bit of a probability mass

	1:13:28.213 --> 1:13:29.773
	for the scene.

	1:13:29.709 --> 1:13:38.919
	The difference is we are no longer distributing
	it equally as before to the unseen, but we

	1:13:38.919 --> 1:13:40.741
	are distributing.

	1:13:44.864 --> 1:13:56.220
	For example, this can be done with gutturing,
	so the expected counts in goodturing we saw.

	1:13:57.697 --> 1:13:59.804
	The adjusted counts.

	1:13:59.804 --> 1:14:04.719
	They are always lower than the ones we see
	here.

	1:14:04.719 --> 1:14:14.972
	These counts are always: See that so you can
	now take this different and distribute this

	1:14:14.972 --> 1:14:18.852
	weights to the lower based input.

	1:14:23.323 --> 1:14:29.896
	Is how we can distribute things.

	1:14:29.896 --> 1:14:43.442
	Then there is one last thing people are doing,
	especially how much.

	1:14:43.563 --> 1:14:55.464
	And there's one thing which is called well
	written by Mozilla.

	1:14:55.315 --> 1:15:01.335
	In the background, like in the background,
	it might make sense to look at the words and

	1:15:01.335 --> 1:15:04.893
	see how probable it is that you need to background.

	1:15:05.425 --> 1:15:11.232
	So look at these words five and one cent.

	1:15:11.232 --> 1:15:15.934
	Those occur exactly times in the.

	1:15:16.316 --> 1:15:27.804
	They would be treated exactly the same because
	both occur at the same time, and it would be

	1:15:27.804 --> 1:15:29.053
	the same.

	1:15:29.809 --> 1:15:48.401
	However, it shouldn't really model the same.

	1:15:48.568 --> 1:15:57.447
	If you compare that for constant there are
	four hundred different continuations of this

	1:15:57.447 --> 1:16:01.282
	work, so there is nearly always this.

	1:16:02.902 --> 1:16:11.203
	So if you're now seeing a new bigram or a
	biogram with Isaac Constant or Spite starting

	1:16:11.203 --> 1:16:13.467
	and then another word,.

	1:16:15.215 --> 1:16:25.606
	In constant, it's very frequent that you see
	new angrups because there are many different

	1:16:25.606 --> 1:16:27.222
	combinations.

	1:16:27.587 --> 1:16:35.421
	Therefore, it might look not only to look
	at the counts, the end grams, but also how

	1:16:35.421 --> 1:16:37.449
	many extensions does.

	1:16:38.218 --> 1:16:43.222
	And this is done by witt velk smoothing.

	1:16:43.222 --> 1:16:51.032
	The idea is we count how many possible extensions
	in this case.

	1:16:51.371 --> 1:17:01.966
	So we had for spive, we had possible extensions,
	and for constant we had a lot more.

	1:17:02.382 --> 1:17:09.394
	And then how much we put into our backup model,
	how much weight we put into the backup is,

	1:17:09.394 --> 1:17:13.170
	depending on this number of possible extensions.

	1:17:14.374 --> 1:17:15.557
	Style.

	1:17:15.557 --> 1:17:29.583
	We have it here, so this is the weight you
	put on your lower end gram probability.

	1:17:29.583 --> 1:17:46.596
	For example: And if you compare these two
	numbers, so for Spike you do how many extensions

	1:17:46.596 --> 1:17:55.333
	does Spike have divided by: While for constant
	you have zero point three, you know,.

	1:17:55.815 --> 1:18:05.780
	So you're putting a lot more weight to like
	it's not as bad to fall off to the back of

	1:18:05.780 --> 1:18:06.581
	model.

	1:18:06.581 --> 1:18:10.705
	So for the spy it's really unusual.

	1:18:10.730 --> 1:18:13.369
	For Constant there's a lot of probability
	medicine.

	1:18:13.369 --> 1:18:15.906
	The chances that you're doing that is quite
	high.

	1:18:20.000 --> 1:18:26.209
	Similarly, but just from the other way around,
	it's now looking at this probability distribution.

	1:18:26.546 --> 1:18:37.103
	So now when we back off the probability distribution
	for the lower angrums, we calculated exactly

	1:18:37.103 --> 1:18:40.227
	the same as the probability.

	1:18:40.320 --> 1:18:48.254
	However, they are used in a different way,
	so the lower order end drums are only used

	1:18:48.254 --> 1:18:49.361
	if we have.

	1:18:50.410 --> 1:18:54.264
	So it's like you're modeling something different.

	1:18:54.264 --> 1:19:01.278
	You're not modeling how probable this engram
	if we haven't seen the larger engram and that

	1:19:01.278 --> 1:19:04.361
	is tried by the diversity of histories.

	1:19:04.944 --> 1:19:14.714
	For example, if you look at York, that's a
	quite frequent work.

	1:19:14.714 --> 1:19:18.530
	It occurs as many times.

	1:19:19.559 --> 1:19:27.985
	However, four hundred seventy three times
	it was followed the way before it was mute.

	1:19:29.449 --> 1:19:40.237
	So if you now think the unigram model is only
	used, the probability of York as a unigram

	1:19:40.237 --> 1:19:49.947
	model should be very, very low because: So
	you should have a lower probability for your

	1:19:49.947 --> 1:19:56.292
	than, for example, for foods, although you
	have seen both of them at the same time, and

	1:19:56.292 --> 1:20:02.853
	this is done by Knesser and Nye Smoothing where
	you are not counting the words itself, but

	1:20:02.853 --> 1:20:05.377
	you count the number of mysteries.

	1:20:05.845 --> 1:20:15.233
	So how many other way around was it followed
	by how many different words were before?

	1:20:15.233 --> 1:20:28.232
	Then instead of the normal way you count the
	words: So you don't need to know all the formulas

	1:20:28.232 --> 1:20:28.864
	here.

	1:20:28.864 --> 1:20:33.498
	The more important thing is this intuition.

	1:20:34.874 --> 1:20:44.646
	More than it means already that I haven't
	seen the larger end grammar, and therefore

	1:20:44.646 --> 1:20:49.704
	it might be better to model it differently.

	1:20:49.929 --> 1:20:56.976
	So if there's a new engram with something
	in New York that's very unprofitable compared

	1:20:56.976 --> 1:20:57.297
	to.

	1:21:00.180 --> 1:21:06.130
	And yeah, this modified Kneffer Nice music
	is what people took into use.

	1:21:06.130 --> 1:21:08.249
	That's the fall approach.

	1:21:08.728 --> 1:21:20.481
	Has an absolute discounting for small and
	grams, and then bells smoothing, and for it

	1:21:20.481 --> 1:21:27.724
	uses the discounting of histories which we
	just had.

	1:21:28.028 --> 1:21:32.207
	And there's even two versions of it, like
	the backup and the interpolator.

	1:21:32.472 --> 1:21:34.264
	So that may be interesting.

	1:21:34.264 --> 1:21:40.216
	These are here even works well for interpolation,
	although your assumption is even no longer

	1:21:40.216 --> 1:21:45.592
	true because you're using the lower engrams
	even if you've seen the higher engrams.

	1:21:45.592 --> 1:21:49.113
	But since you're then focusing on the higher
	engrams,.

	1:21:49.929 --> 1:21:53.522
	So if you see that some beats on the perfectities,.

	1:21:54.754 --> 1:22:00.262
	So you see normally what interpolated movement
	class of nineties gives you some of the best

	1:22:00.262 --> 1:22:00.980
	performing.

	1:22:02.022 --> 1:22:08.032
	You see the larger your end drum than it is
	with interpolation.

	1:22:08.032 --> 1:22:15.168
	You also get significant better so you can
	not only look at the last words.

	1:22:18.638 --> 1:22:32.725
	Good so much for these types of things, and
	we will finish with some special things about

	1:22:32.725 --> 1:22:34.290
	language.

	1:22:38.678 --> 1:22:44.225
	One thing we talked about the unknown words,
	so there is different ways of doing it because

	1:22:44.225 --> 1:22:49.409
	in all the estimations we were still assuming
	mostly that we have a fixed vocabulary.

	1:22:50.270 --> 1:23:06.372
	So you can often, for example, create an unknown
	choken and use that while statistical language.

	1:23:06.766 --> 1:23:16.292
	It was mainly useful language processing since
	newer models are coming, but maybe it's surprising.

	1:23:18.578 --> 1:23:30.573
	What is also nice is that if you're going
	to really hard launch and ramps, it's more

	1:23:30.573 --> 1:23:33.114
	about efficiency.

	1:23:33.093 --> 1:23:37.378
	And then you have to remember lock it in your
	model.

	1:23:37.378 --> 1:23:41.422
	In a lot of situations it's not really important.

	1:23:41.661 --> 1:23:46.964
	It's more about ranking so which one is better
	and if they don't sum up to one that's not

	1:23:46.964 --> 1:23:47.907
	that important.

	1:23:47.907 --> 1:23:53.563
	Of course then you cannot calculate any perplexity
	anymore because if this is not a probability

	1:23:53.563 --> 1:23:58.807
	mass then the thing we had about the negative
	example doesn't fit anymore and that's not

	1:23:58.807 --> 1:23:59.338
	working.

	1:23:59.619 --> 1:24:02.202
	However, anification is also very helpful.

	1:24:02.582 --> 1:24:13.750
	And that is why there is this stupid bag-off
	presented remove all this complicated things

	1:24:13.750 --> 1:24:14.618
	which.

	1:24:15.055 --> 1:24:28.055
	And it just does once we directly take the
	absolute account, and otherwise we're doing.

	1:24:28.548 --> 1:24:41.867
	Is no longer any discounting anymore, so it's
	very, very simple and however they show you

	1:24:41.867 --> 1:24:47.935
	have to calculate a lot less statistics.

	1:24:50.750 --> 1:24:57.525
	In addition you can have other type of language
	models.

	1:24:57.525 --> 1:25:08.412
	We had word based language models and they
	normally go up to four or five for six brands.

	1:25:08.412 --> 1:25:10.831
	They are too large.

	1:25:11.531 --> 1:25:20.570
	So what people have then looked also into
	is what is referred to as part of speech language

	1:25:20.570 --> 1:25:21.258
	model.

	1:25:21.258 --> 1:25:29.806
	So instead of looking at the word sequence
	you're modeling directly the part of speech

	1:25:29.806 --> 1:25:30.788
	sequence.

	1:25:31.171 --> 1:25:34.987
	Then of course now you're only being modeling
	syntax.

	1:25:34.987 --> 1:25:41.134
	There's no cemented information anymore in
	the paddle speech test but now you might go

	1:25:41.134 --> 1:25:47.423
	to a larger context link so you can do seven
	H or nine grams and then you can write some

	1:25:47.423 --> 1:25:50.320
	of the long range dependencies in order.

	1:25:52.772 --> 1:25:59.833
	And there's other things people have done
	like cash language models, so the idea in cash

	1:25:59.833 --> 1:26:07.052
	language model is that yes words that you have
	recently seen are more frequently to do are

	1:26:07.052 --> 1:26:11.891
	more probable to reoccurr if you want to model
	the dynamics.

	1:26:12.152 --> 1:26:20.734
	If you're just talking here, we talked about
	language models in my presentation.

	1:26:20.734 --> 1:26:23.489
	There will be a lot more.

	1:26:23.883 --> 1:26:37.213
	Can do that by having a dynamic and a static
	component, and then you have a dynamic component

	1:26:37.213 --> 1:26:41.042
	which looks at the bigram.

	1:26:41.261 --> 1:26:49.802
	And thereby, for example, if you once generate
	language model of probability, it's increased

	1:26:49.802 --> 1:26:52.924
	and you're modeling that problem.

	1:26:56.816 --> 1:27:03.114
	Said the dynamic component is trained on the
	text translated so far.

	1:27:04.564 --> 1:27:12.488
	To train them what you just have done, there's
	no human feet there.

	1:27:12.712 --> 1:27:25.466
	The speech model all the time and then it
	will repeat its errors and that is, of course,.

	1:27:25.966 --> 1:27:31.506
	A similar idea is people have looked into
	trigger language model whereas one word occurs

	1:27:31.506 --> 1:27:34.931
	then you increase the probability of some other
	words.

	1:27:34.931 --> 1:27:40.596
	So if you're talking about money that will
	increase the probability of bank saving account

	1:27:40.596 --> 1:27:41.343
	dollar and.

	1:27:41.801 --> 1:27:47.352
	Because then you have to somehow model this
	dependency, but it's somehow also an idea of

	1:27:47.352 --> 1:27:52.840
	modeling long range dependency, because if
	one word occurs very often in your document,

	1:27:52.840 --> 1:27:58.203
	you like somehow like learning which other
	words to occur because they are more often

	1:27:58.203 --> 1:27:59.201
	than by chance.

	1:28:02.822 --> 1:28:10.822
	Yes, then the last thing is, of course, especially
	for languages which are, which are morphologically

	1:28:10.822 --> 1:28:11.292
	rich.

	1:28:11.292 --> 1:28:18.115
	You can do something similar to BPE so you
	can now do more themes or so, and then more

	1:28:18.115 --> 1:28:22.821
	the morphine sequence because the morphines
	are more often.

	1:28:23.023 --> 1:28:26.877
	However, the program is opposed that your
	sequence length also gets longer.

	1:28:27.127 --> 1:28:33.185
	And so if they have a four gram language model,
	it's not counting the last three words but

	1:28:33.185 --> 1:28:35.782
	only the last three more films, which.

	1:28:36.196 --> 1:28:39.833
	So of course then it's a bit challenging and
	know how to deal with.

	1:28:40.680 --> 1:28:51.350
	What about language is finished by the idea
	of a position at the end of the world?

	1:28:51.350 --> 1:28:58.807
	Yeah, but there you can typically do something
	like that.

	1:28:59.159 --> 1:29:02.157
	It is not the one perfect solution.

	1:29:02.157 --> 1:29:05.989
	You have to do a bit of testing what is best.

	1:29:06.246 --> 1:29:13.417
	One way of dealing with a large vocabulary
	that you haven't seen is to split these words

	1:29:13.417 --> 1:29:20.508
	into parts and themes that either like more
	linguistic motivated in more themes or more

	1:29:20.508 --> 1:29:25.826
	statistically motivated like we have in the
	bike pair and coding.

	1:29:28.188 --> 1:29:33.216
	The representation of your text is different.

	1:29:33.216 --> 1:29:41.197
	How you are later doing all the counting and
	the statistics is the same.

	1:29:41.197 --> 1:29:44.914
	What you assume is your sequence.

	1:29:45.805 --> 1:29:49.998
	That's the same thing for the other things
	we had here.

	1:29:49.998 --> 1:29:55.390
	Here you don't have words, but everything
	you're doing is done exactly.

	1:29:57.857 --> 1:29:59.457
	Some practical issues.

	1:29:59.457 --> 1:30:05.646
	Typically you're doing things on the lock
	and you're adding because mild decline in very

	1:30:05.646 --> 1:30:09.819
	small values gives you sometimes problems with
	calculation.

	1:30:10.230 --> 1:30:16.687
	Good thing is you don't have to care with
	this mostly so there is very good two kids

	1:30:16.687 --> 1:30:23.448
	like Azarayan or Kendalan which when you can
	just give your data and they will train the

	1:30:23.448 --> 1:30:30.286
	language more then do all the complicated maths
	behind that and you are able to run them.

	1:30:31.911 --> 1:30:39.894
	So what you should keep from today is what
	is a language model and how we can do maximum

	1:30:39.894 --> 1:30:44.199
	training on that and different language models.

	1:30:44.199 --> 1:30:49.939
	Similar ideas we use for a lot of different
	statistical models.

	1:30:50.350 --> 1:30:52.267
	Where You Always Have the Problem.

	1:30:53.233 --> 1:31:01.608
	Different way of looking at it and doing it
	will do it on Thursday when we will go to language.