WEBVTT

0:00:01.721 --> 0:00:08.584
Hey, then welcome to today's lecture on language
modeling.

0:00:09.409 --> 0:00:21.608
We had not a different view on machine translation,
which was the evaluation path it's important

0:00:21.608 --> 0:00:24.249
to evaluate and see.

0:00:24.664 --> 0:00:33.186
We want to continue with building the MT system
and this will be the last part before we are

0:00:33.186 --> 0:00:36.668
going into a neural step on Thursday.

0:00:37.017 --> 0:00:45.478
So we had the the broader view on statistical
machine translation and the.

0:00:45.385 --> 0:00:52.977
Thursday: A week ago we talked about the statistical
machine translation and mainly the translation

0:00:52.977 --> 0:00:59.355
model, so how we model how probable is it that
one word is translated into another.

0:01:00.800 --> 0:01:15.583
However, there is another component when doing
generation tasks in general and machine translation.

0:01:16.016 --> 0:01:23.797
There are several characteristics which you
only need to model on the target side in the

0:01:23.797 --> 0:01:31.754
traditional approach where we talked about
the generation from more semantic or synthectic

0:01:31.754 --> 0:01:34.902
representation into the real world.

0:01:35.555 --> 0:01:51.013
And the challenge is that there's some constructs
which are only there in the target language.

0:01:52.132 --> 0:01:57.908
You cannot really get that translation, but
it's more something that needs to model on

0:01:57.908 --> 0:01:58.704
the target.

0:01:59.359 --> 0:02:05.742
And this is done typically by a language model
and this concept of language model.

0:02:06.326 --> 0:02:11.057
Guess you can assume nowadays very important.

0:02:11.057 --> 0:02:20.416
You've read a lot about large language models
recently and they are all somehow trained or

0:02:20.416 --> 0:02:22.164
the idea behind.

0:02:25.986 --> 0:02:41.802
What we'll look today at if get the next night
and look what a language model is and today's

0:02:41.802 --> 0:02:42.992
focus.

0:02:43.363 --> 0:02:49.188
This was the common approach to the language
model for twenty or thirty years, so a lot

0:02:49.188 --> 0:02:52.101
of time it was really the state of the art.

0:02:52.101 --> 0:02:58.124
And people have used that in many applications
in machine translation and automatic speech

0:02:58.124 --> 0:02:58.985
recognition.

0:02:59.879 --> 0:03:11.607
Again you are measuring the performance, but
this is purely the performance of the language

0:03:11.607 --> 0:03:12.499
model.

0:03:13.033 --> 0:03:23.137
And then we will see that the traditional
language will have a major drawback in how

0:03:23.137 --> 0:03:24.683
we can deal.

0:03:24.944 --> 0:03:32.422
So if you model language you will see that
in most of the sentences and you have not really

0:03:32.422 --> 0:03:39.981
seen and you're still able to assess if this
is good language or if this is native language.

0:03:40.620 --> 0:03:45.092
And this is challenging if you do just like
parameter estimation.

0:03:45.605 --> 0:03:59.277
We are using two different techniques to do:
interpolation, and these are essentially in

0:03:59.277 --> 0:04:01.735
order to build.

0:04:01.881 --> 0:04:11.941
It also motivates why things might be easier
if we are going into neural morals as we will.

0:04:12.312 --> 0:04:18.203
And at the end we'll talk a bit about some
additional type of language models which are

0:04:18.203 --> 0:04:18.605
also.

0:04:20.440 --> 0:04:29.459
So where our language was used, or how are
they used in the machine translations?

0:04:30.010 --> 0:04:38.513
So the idea of a language model is that we
are modeling what is the fluency of language.

0:04:38.898 --> 0:04:49.381
So if you have, for example, sentence will,
then you can estimate that there are some words:

0:04:49.669 --> 0:05:08.929
For example, the next word is valid, but will
card's words not?

0:05:09.069 --> 0:05:13.673
And we can do that.

0:05:13.673 --> 0:05:22.192
We have seen that the noise channel.

0:05:22.322 --> 0:05:33.991
That we have seen someone two weeks ago, and
today we will look into how can we model P

0:05:33.991 --> 0:05:36.909
of Y or how possible.

0:05:37.177 --> 0:05:44.192
Now this is completely independent of the
translation process.

0:05:44.192 --> 0:05:49.761
How fluent is a sentence and how you can express?

0:05:51.591 --> 0:06:01.699
And this language model task has one really
big advantage and assume that is even the big

0:06:01.699 --> 0:06:02.935
advantage.

0:06:03.663 --> 0:06:16.345
The big advantage is the data we need to train
that so normally we are doing supervised learning.

0:06:16.876 --> 0:06:20.206
So machine translation will talk about.

0:06:20.206 --> 0:06:24.867
That means we have the source center and target
center.

0:06:25.005 --> 0:06:27.620
They need to be aligned.

0:06:27.620 --> 0:06:31.386
We look into how we can model them.

0:06:31.386 --> 0:06:39.270
Generally, the problem with this is that:
Machine translation: You still have the advantage

0:06:39.270 --> 0:06:45.697
that there's quite huge amounts of this data
for many languages, not all but many, but other

0:06:45.697 --> 0:06:47.701
classes even more difficult.

0:06:47.701 --> 0:06:50.879
There's very few data where you have summary.

0:06:51.871 --> 0:07:02.185
So the big advantage of language model is
we're only modeling the centers, so we only

0:07:02.185 --> 0:07:04.103
need pure text.

0:07:04.584 --> 0:07:11.286
And pure text, especially since we have the
Internet face melting large amounts of text.

0:07:11.331 --> 0:07:17.886
Of course, it's still, it's still maybe only
for some domains, some type.

0:07:18.198 --> 0:07:23.466
Want to have data for speech about machine
translation.

0:07:23.466 --> 0:07:27.040
Maybe there's only limited data that.

0:07:27.027 --> 0:07:40.030
There's always and also you go to some more
exotic languages and then you will have less

0:07:40.030 --> 0:07:40.906
data.

0:07:41.181 --> 0:07:46.803
And in language once we can now look, how
can we make use of these data?

0:07:47.187 --> 0:07:54.326
And: Nowadays this is often also framed as
self supervised learning because on the one

0:07:54.326 --> 0:08:00.900
hand here we'll see it's a time of classification
cast or supervised learning but we create some

0:08:00.900 --> 0:08:02.730
other data science itself.

0:08:02.742 --> 0:08:13.922
So it's not that we have this pair of data
text and labels, but we have only the text.

0:08:15.515 --> 0:08:21.367
So the question is how can we use this modeling
data and how can we train our language?

0:08:22.302 --> 0:08:35.086
The main goal is to produce fluent English,
so we want to somehow model that something

0:08:35.086 --> 0:08:38.024
is a sentence of a.

0:08:38.298 --> 0:08:44.897
So there is no clear separation about semantics
and syntax, but in this case it is not about

0:08:44.897 --> 0:08:46.317
a clear separation.

0:08:46.746 --> 0:08:50.751
So we will monitor them somehow in there.

0:08:50.751 --> 0:08:56.091
There will be some notion of semantics, some
notion of.

0:08:56.076 --> 0:09:08.748
Because you say you want to water how fluid
or probable is that the native speaker is producing

0:09:08.748 --> 0:09:12.444
that because of the one in.

0:09:12.512 --> 0:09:17.711
We are rarely talking like things that are
semantically wrong, and therefore there is

0:09:17.711 --> 0:09:18.679
also some type.

0:09:19.399 --> 0:09:24.048
So, for example, the house is small.

0:09:24.048 --> 0:09:30.455
It should be a higher stability than the house
is.

0:09:31.251 --> 0:09:38.112
Because home and house are both meaning German,
they are used differently.

0:09:38.112 --> 0:09:43.234
For example, it should be more probable that
the plane.

0:09:44.444 --> 0:09:51.408
So this is both synthetically correct, but
cementically not.

0:09:51.408 --> 0:09:58.372
But still you will see much more often the
probability that.

0:10:03.883 --> 0:10:14.315
So more formally, it's about like the language
should be some type of function, and it gives

0:10:14.315 --> 0:10:18.690
us the probability that this sentence.

0:10:19.519 --> 0:10:27.312
Indicating that this is good English or more
generally English, of course you can do that.

0:10:28.448 --> 0:10:37.609
And earlier times people have even done try
to do that deterministic that was especially

0:10:37.609 --> 0:10:40.903
used for more dialogue systems.

0:10:40.840 --> 0:10:50.660
You have a very strict syntax so you can only
use like turn off the, turn off the radio.

0:10:50.690 --> 0:10:56.928
Something else, but you have a very strict
deterministic finance state grammar like which

0:10:56.928 --> 0:10:58.107
type of phrases.

0:10:58.218 --> 0:11:04.791
The problem of course if we're dealing with
language is that language is variable, we're

0:11:04.791 --> 0:11:10.183
not always talking correct sentences, and so
this type of deterministic.

0:11:10.650 --> 0:11:22.121
That's why for already many, many years people
look into statistical language models and try

0:11:22.121 --> 0:11:24.587
to model something.

0:11:24.924 --> 0:11:35.096
So something like what is the probability
of the sequences of to, and that is what.

0:11:35.495 --> 0:11:43.076
The advantage of doing it statistically is
that we can train large text databases so we

0:11:43.076 --> 0:11:44.454
can train them.

0:11:44.454 --> 0:11:52.380
We don't have to define it and most of these
cases we don't want to have the hard decision.

0:11:52.380 --> 0:11:55.481
This is a sentence of the language.

0:11:55.815 --> 0:11:57.914
Why we want to have some type of probability?

0:11:57.914 --> 0:11:59.785
How probable is this part of the center?

0:12:00.560 --> 0:12:04.175
Because yeah, even for a few minutes, it's
not always clear.

0:12:04.175 --> 0:12:06.782
Is this a sentence that you can use or not?

0:12:06.782 --> 0:12:12.174
I mean, I just in this presentation gave several
sentences, which are not correct English.

0:12:12.174 --> 0:12:17.744
So it might still happen that people speak
sentences or write sentences that I'm not correct,

0:12:17.744 --> 0:12:19.758
and you want to deal with all of.

0:12:20.020 --> 0:12:25.064
So that is then, of course, a big advantage
if you use your more statistical models.

0:12:25.705 --> 0:12:35.810
The disadvantage is that you need a subtitle
of large text databases which might exist from

0:12:35.810 --> 0:12:37.567
many languages.

0:12:37.857 --> 0:12:46.511
Nowadays you see that there is of course issues
that you need large computational resources

0:12:46.511 --> 0:12:47.827
to deal with.

0:12:47.827 --> 0:12:56.198
You need to collect all these crawlers on
the internet which can create enormous amounts

0:12:56.198 --> 0:12:57.891
of training data.

0:12:58.999 --> 0:13:08.224
So if we want to build this then the question
is of course how can we estimate the probability?

0:13:08.448 --> 0:13:10.986
So how probable is the sentence good morning?

0:13:11.871 --> 0:13:15.450
And you all know basic statistics.

0:13:15.450 --> 0:13:21.483
So if you see this you have a large database
of sentences.

0:13:21.901 --> 0:13:28.003
Made this a real example, so this was from
the TED talks.

0:13:28.003 --> 0:13:37.050
I guess most of you have heard about them,
and if you account for all many sentences,

0:13:37.050 --> 0:13:38.523
good morning.

0:13:38.718 --> 0:13:49.513
It happens so the probability of good morning
is sweet point times to the power minus.

0:13:50.030 --> 0:13:53.755
Okay, so this is a very easy thing.

0:13:53.755 --> 0:13:58.101
We can directly model the language model.

0:13:58.959 --> 0:14:03.489
Does anybody see a problem why this might
not be the final solution?

0:14:06.326 --> 0:14:14.962
Think we would need a folder of more sentences
to make anything useful of this.

0:14:15.315 --> 0:14:29.340
Because the probability of the talk starting
with good morning, good morning is much higher

0:14:29.340 --> 0:14:32.084
than ten minutes.

0:14:33.553 --> 0:14:41.700
In all the probability presented in this face,
not how we usually think about it.

0:14:42.942 --> 0:14:55.038
The probability is even OK, but you're going
into the right direction about the large data.

0:14:55.038 --> 0:14:59.771
Yes, you can't form a new sentence.

0:15:00.160 --> 0:15:04.763
It's about a large data, so you said it's
hard to get enough data.

0:15:04.763 --> 0:15:05.931
It's impossible.

0:15:05.931 --> 0:15:11.839
I would say we are always saying sentences
which have never been said and we are able

0:15:11.839 --> 0:15:12.801
to deal with.

0:15:13.133 --> 0:15:25.485
The problem with the sparsity of the data
will have a lot of perfect English sentences.

0:15:26.226 --> 0:15:31.338
And this is, of course, not what we want to
deal with.

0:15:31.338 --> 0:15:39.332
If we want to model that, we need to have
a model which can really estimate how good.

0:15:39.599 --> 0:15:47.970
And if we are just like counting this way,
most of it will get a zero probability, which

0:15:47.970 --> 0:15:48.722
is not.

0:15:49.029 --> 0:15:56.572
So we need to make things a bit different.

0:15:56.572 --> 0:16:06.221
For the models we had already some idea of
doing that.

0:16:06.486 --> 0:16:08.058
And that we can do here again.

0:16:08.528 --> 0:16:12.866
So we can especially use the gel gel.

0:16:12.772 --> 0:16:19.651
The chain rule and the definition of conditional
probability solve the conditional probability.

0:16:19.599 --> 0:16:26.369
Of an event B given in an event A is the probability
of A and B divided to the probability of A.

0:16:26.369 --> 0:16:32.720
Yes, I recently had a exam on a manic speech
recognition and Mister Rival said this is not

0:16:32.720 --> 0:16:39.629
called a chain of wood because I use this terminology
and he said it's just applying base another.

0:16:40.500 --> 0:16:56.684
But this is definitely the definition of the
condition of probability.

0:16:57.137 --> 0:17:08.630
The probability is defined as P of A and P
of supposed to be divided by the one.

0:17:08.888 --> 0:17:16.392
And that can be easily rewritten into and
times given.

0:17:16.816 --> 0:17:35.279
And the nice thing is, we can easily extend
it, of course, into more variables so we can

0:17:35.279 --> 0:17:38.383
have: And so on.

0:17:38.383 --> 0:17:49.823
So more generally you can do that for now
any length of sequence.

0:17:50.650 --> 0:18:04.802
So if we are now going back to words, we can
model that as the probability of the sequence

0:18:04.802 --> 0:18:08.223
is given its history.

0:18:08.908 --> 0:18:23.717
Maybe it's more clear if we're looking at
real works, so if we have pee-off, it's water

0:18:23.717 --> 0:18:26.914
is so transparent.

0:18:26.906 --> 0:18:39.136
So this way we are able to model the ability
of the whole sentence given the sequence by

0:18:39.136 --> 0:18:42.159
looking at each word.

0:18:42.762 --> 0:18:49.206
And of course the big advantage is that each
word occurs less often than the full sect.

0:18:49.206 --> 0:18:54.991
So hopefully we see that still, of course,
the problem the word doesn't occur.

0:18:54.991 --> 0:19:01.435
Then this doesn't work, but let's recover
most of the lectures today about dealing with

0:19:01.435 --> 0:19:01.874
this.

0:19:02.382 --> 0:19:08.727
So by first of all, we generally is at least
easier as the thing we have before.

0:19:13.133 --> 0:19:23.531
That we really make sense easier, no, because
those jumps get utterly long and we have central.

0:19:23.943 --> 0:19:29.628
Yes exactly, so when we look at the last probability
here, we still have to have seen the full.

0:19:30.170 --> 0:19:38.146
So if we want a molecule of transparent, if
water is so we have to see the food sequence.

0:19:38.578 --> 0:19:48.061
So in first step we didn't really have to
have seen the full sentence.

0:19:48.969 --> 0:19:52.090
However, a little bit of a step nearer.

0:19:52.512 --> 0:19:59.673
So this is still a problem and we will never
have seen it for all the time.

0:20:00.020 --> 0:20:08.223
So you can look at this if you have a vocabulary
of words.

0:20:08.223 --> 0:20:17.956
Now, for example, if the average sentence
is, you would leave to the.

0:20:18.298 --> 0:20:22.394
And we are quite sure we have never seen that
much date.

0:20:22.902 --> 0:20:26.246
So this is, we cannot really compute this
probability.

0:20:26.786 --> 0:20:37.794
However, there's a trick how we can do that
and that's the idea between most of the language.

0:20:38.458 --> 0:20:44.446
So instead of saying how often does this work
happen to exactly this history, we are trying

0:20:44.446 --> 0:20:50.433
to do some kind of clustering and cluster a
lot of different histories into the same class,

0:20:50.433 --> 0:20:55.900
and then we are modeling the probability of
the word given this class of histories.

0:20:56.776 --> 0:21:06.245
And then, of course, the big design decision
is how to be modeled like how to cluster history.

0:21:06.666 --> 0:21:17.330
So how do we put all these histories together
so that we have seen each of one off enough

0:21:17.330 --> 0:21:18.396
so that.

0:21:20.320 --> 0:21:25.623
So there is quite different types of things
people can do.

0:21:25.623 --> 0:21:33.533
You can add some speech texts, you can do
semantic words, you can model the similarity,

0:21:33.533 --> 0:21:46.113
you can model grammatical content, and things
like: However, like quite often in these statistical

0:21:46.113 --> 0:21:53.091
models, if you have a very simple solution.

0:21:53.433 --> 0:21:58.455
And this is what most statistical models do.

0:21:58.455 --> 0:22:09.616
They are based on the so called mark of assumption,
and that means we are assuming all this history

0:22:09.616 --> 0:22:12.183
is not that important.

0:22:12.792 --> 0:22:25.895
So we are modeling the probability of zirkins
is so transparent that or we have maybe two

0:22:25.895 --> 0:22:29.534
words by having a fixed.

0:22:29.729 --> 0:22:38.761
So the class of all our history from word
to word minus one is just the last two words.

0:22:39.679 --> 0:22:45.229
And by doing this classification, which of
course does need any additional knowledge.

0:22:45.545 --> 0:22:51.176
It's very easy to calculate we have no limited
our our histories.

0:22:51.291 --> 0:23:00.906
So instead of an arbitrary long one here,
we have here only like.

0:23:00.906 --> 0:23:10.375
For example, if we have two grams, a lot of
them will not occur.

0:23:10.930 --> 0:23:20.079
So it's a very simple trick to make all these
classes into a few classes and motivated by,

0:23:20.079 --> 0:23:24.905
of course, the language the nearest things
are.

0:23:24.944 --> 0:23:33.043
Like a lot of sequences, they mainly depend
on the previous one, and things which are far

0:23:33.043 --> 0:23:33.583
away.

0:23:38.118 --> 0:23:47.361
In our product here everything is just modeled
not by the whole history but by the last and

0:23:47.361 --> 0:23:48.969
minus one word.

0:23:50.470 --> 0:23:54.322
So and this is typically expressed by people.

0:23:54.322 --> 0:24:01.776
They're therefore also talking by an N gram
language model because we are always looking

0:24:01.776 --> 0:24:06.550
at these chimes of N words and modeling the
probability.

0:24:07.527 --> 0:24:10.485
So again start with the most simple case.

0:24:10.485 --> 0:24:15.485
Even extreme is the unigram case, so we're
ignoring the whole history.

0:24:15.835 --> 0:24:24.825
The probability of a sequence of words is
just the probability of each of the words in

0:24:24.825 --> 0:24:25.548
there.

0:24:26.046 --> 0:24:32.129
And therefore we are removing the whole context.

0:24:32.129 --> 0:24:40.944
The most probable sequence would be something
like one of them is the.

0:24:42.162 --> 0:24:44.694
Most probable wordsuit by itself.

0:24:44.694 --> 0:24:49.684
It might not make sense, but it, of course,
can give you a bit of.

0:24:49.629 --> 0:24:52.682
Intuition like which types of words should
be more frequent.

0:24:53.393 --> 0:25:00.012
And if you what you can do is train such a
button and you can just automatically generate.

0:25:00.140 --> 0:25:09.496
And this sequence is generated by sampling,
so we will later come in the lecture too.

0:25:09.496 --> 0:25:16.024
The sampling is that you randomly pick a word
but based on.

0:25:16.096 --> 0:25:22.711
So if the probability of one word is zero
point two then you'll put it on and if another

0:25:22.711 --> 0:25:23.157
word.

0:25:23.483 --> 0:25:36.996
And if you see that you'll see here now, for
example, it seems that these are two occurring

0:25:36.996 --> 0:25:38.024
posts.

0:25:38.138 --> 0:25:53.467
But you see there's not really any continuing
type of structure because each word is modeled

0:25:53.467 --> 0:25:55.940
independently.

0:25:57.597 --> 0:26:03.037
This you can do better even though going to
a biograph, so then we're having a bit of context.

0:26:03.037 --> 0:26:08.650
Of course, it's still very small, so the probability
of your word of the actual word only depends

0:26:08.650 --> 0:26:12.429
on the previous word and all the context before
there is ignored.

0:26:13.133 --> 0:26:18.951
This of course will come to that wrong, but
it models a regular language significantly

0:26:18.951 --> 0:26:19.486
better.

0:26:19.779 --> 0:26:28.094
Seeing some things here still doesn't really
make a lot of sense, but you're seeing some

0:26:28.094 --> 0:26:29.682
typical phrases.

0:26:29.949 --> 0:26:39.619
In this hope doesn't make sense, but in this
issue is also frequent.

0:26:39.619 --> 0:26:51.335
Issue is also: Very nice is this year new
car parking lot after, so if you have the word

0:26:51.335 --> 0:26:53.634
new then the word.

0:26:53.893 --> 0:27:01.428
Is also quite common, but new car they wouldn't
put parking.

0:27:01.428 --> 0:27:06.369
Often the continuation is packing lots.

0:27:06.967 --> 0:27:12.417
And now it's very interesting because here
we see the two cementic meanings of lot: You

0:27:12.417 --> 0:27:25.889
have a parking lot, but in general if you just
think about the history, the most common use

0:27:25.889 --> 0:27:27.353
is a lot.

0:27:27.527 --> 0:27:33.392
So you see that he's really not using the
context before, but he's only using the current

0:27:33.392 --> 0:27:33.979
context.

0:27:38.338 --> 0:27:41.371
So in general we can of course do that longer.

0:27:41.371 --> 0:27:43.888
We can do unigrams, bigrams, trigrams.

0:27:45.845 --> 0:27:52.061
People typically went up to four or five grams,
and then it's getting difficult because.

0:27:52.792 --> 0:27:56.671
There are so many five grams that it's getting
complicated.

0:27:56.671 --> 0:28:02.425
Storing all of them and storing these models
get so big that it's no longer working, and

0:28:02.425 --> 0:28:08.050
of course at some point the calculation of
the probabilities again gets too difficult,

0:28:08.050 --> 0:28:09.213
and each of them.

0:28:09.429 --> 0:28:14.777
If you have a small corpus, of course you
will use a smaller ingram length.

0:28:14.777 --> 0:28:16.466
You will take a larger.

0:28:18.638 --> 0:28:24.976
What is important to keep in mind is that,
of course, this is wrong.

0:28:25.285 --> 0:28:36.608
So we have long range dependencies, and if
we really want to model everything in language

0:28:36.608 --> 0:28:37.363
then.

0:28:37.337 --> 0:28:46.965
So here is like one of these extreme cases,
the computer, which has just put into the machine

0:28:46.965 --> 0:28:49.423
room in the slow crash.

0:28:49.423 --> 0:28:55.978
Like somehow, there is a dependency between
computer and crash.

0:28:57.978 --> 0:29:10.646
However, in most situations these are typically
rare and normally most important things happen

0:29:10.646 --> 0:29:13.446
in the near context.

0:29:15.495 --> 0:29:28.408
But of course it's important to keep that
in mind that you can't model the thing so you

0:29:28.408 --> 0:29:29.876
can't do.

0:29:33.433 --> 0:29:50.200
The next question is again how can we train
so we have to estimate these probabilities.

0:29:51.071 --> 0:30:00.131
And the question is how we do that, and again
the most simple thing.

0:30:00.440 --> 0:30:03.168
The thing is exactly what's maximum legal
destination.

0:30:03.168 --> 0:30:12.641
What gives you the right answer is: So how
probable is that the word is following minus

0:30:12.641 --> 0:30:13.370
one?

0:30:13.370 --> 0:30:20.946
You just count how often does this sequence
happen?

0:30:21.301 --> 0:30:28.165
So guess this is what most of you would have
intuitively done, and this also works best.

0:30:28.568 --> 0:30:39.012
So it's not a complicated train, so you once
have to go over your corpus, you have to count

0:30:39.012 --> 0:30:48.662
our diagrams and unigrams, and then you can
directly train the basic language model.

0:30:49.189 --> 0:30:50.651
Who is it difficult?

0:30:50.651 --> 0:30:58.855
There are two difficulties: The basic language
well doesn't work that well because of zero

0:30:58.855 --> 0:31:03.154
counts and how we address that and the second.

0:31:03.163 --> 0:31:13.716
Because we saw that especially if you go for
larger you have to store all these engrams

0:31:13.716 --> 0:31:15.275
efficiently.

0:31:17.697 --> 0:31:21.220
So how we can do that?

0:31:21.220 --> 0:31:24.590
Here's some examples.

0:31:24.590 --> 0:31:33.626
For example, if you have the sequence your
training curve.

0:31:33.713 --> 0:31:41.372
You see that the word happens, ascends the
star and the sequence happens two times.

0:31:42.182 --> 0:31:45.651
We have three times.

0:31:45.651 --> 0:31:58.043
The same starts as the probability is to thirds
and the other probability.

0:31:58.858 --> 0:32:09.204
Here we have what is following so you have
twice and once do so again two thirds and one.

0:32:09.809 --> 0:32:20.627
And this is all that you need to know here
about it, so you can do this calculation.

0:32:23.723 --> 0:32:35.506
So the question then, of course, is what do
we really learn in these types of models?

0:32:35.506 --> 0:32:45.549
Here are examples from the Europycopterus:
The green, the red, and the blue, and here

0:32:45.549 --> 0:32:48.594
you have the probabilities which is the next.

0:32:48.989 --> 0:33:01.897
That there is a lot more than just like the
syntax because the initial phrase is all the

0:33:01.897 --> 0:33:02.767
same.

0:33:03.163 --> 0:33:10.132
For example, you see the green paper in the
green group.

0:33:10.132 --> 0:33:16.979
It's more European palaman, the red cross,
which is by.

0:33:17.197 --> 0:33:21.777
What you also see that it's like sometimes
Indian, sometimes it's more difficult.

0:33:22.302 --> 0:33:28.345
So, for example, following the rats, in one
hundred cases it was a red cross.

0:33:28.668 --> 0:33:48.472
So it seems to be easier to guess the next
word.

0:33:48.528 --> 0:33:55.152
So there is different types of information
coded in that you also know that I guess sometimes

0:33:55.152 --> 0:33:58.675
you directly know all the speakers will continue.

0:33:58.675 --> 0:34:04.946
It's not a lot of new information in the next
word, but in other cases like blue there's

0:34:04.946 --> 0:34:06.496
a lot of information.

0:34:11.291 --> 0:34:14.849
Another example is this Berkeley restaurant
sentences.

0:34:14.849 --> 0:34:21.059
It's collected at Berkeley and you have sentences
like can you tell me about any good spaghetti

0:34:21.059 --> 0:34:21.835
restaurant.

0:34:21.835 --> 0:34:27.463
Big price title is what I'm looking for so
it's more like a dialogue system and people

0:34:27.463 --> 0:34:31.215
have collected this data and of course you
can also look.

0:34:31.551 --> 0:34:46.878
Into this and get the counts, so you count
the vibrants in the top, so the color is the.

0:34:49.409 --> 0:34:52.912
This is a bigram which is the first word of
West.

0:34:52.912 --> 0:34:54.524
This one fuzzy is one.

0:34:56.576 --> 0:35:12.160
One because want to hyperability, but want
a lot less, and there where you see it, for

0:35:12.160 --> 0:35:17.004
example: So here you see after I want.

0:35:17.004 --> 0:35:23.064
It's very often for I eat, but an island which
is not just.

0:35:27.347 --> 0:35:39.267
The absolute counts of how often each road
occurs, and then you can see here the probabilities

0:35:39.267 --> 0:35:40.145
again.

0:35:42.422 --> 0:35:54.519
Then do that if you want to do iwan Dutch
food you get the sequence you have to multiply

0:35:54.519 --> 0:35:55.471
olive.

0:35:55.635 --> 0:36:00.281
And then you of course get a bit of interesting
experience on that.

0:36:00.281 --> 0:36:04.726
For example: Information is there.

0:36:04.726 --> 0:36:15.876
So, for example, if you compare I want Dutch
or I want Chinese, it seems that.

0:36:16.176 --> 0:36:22.910
That the sentence often starts with eye.

0:36:22.910 --> 0:36:31.615
You have it after two is possible, but after
one it.

0:36:31.731 --> 0:36:39.724
And you cannot say want, but you have to say
want to spend, so there's grammical information.

0:36:40.000 --> 0:36:51.032
To main information and source: Here before
we're going into measuring quality, is there

0:36:51.032 --> 0:36:58.297
any questions about language model and the
idea of modeling?

0:37:02.702 --> 0:37:13.501
Hope that doesn't mean everybody sleeping,
and so when we're doing the training these

0:37:13.501 --> 0:37:15.761
language models,.

0:37:16.356 --> 0:37:26.429
You need to model what is the engrum length
should we use a trigram or a forkrum.

0:37:27.007 --> 0:37:34.040
So in order to decide how can you now decide
which of the two models are better?

0:37:34.914 --> 0:37:40.702
And if you would have to do that, how would
you decide taking language model or taking

0:37:40.702 --> 0:37:41.367
language?

0:37:43.263 --> 0:37:53.484
I take some test text and see which model
assigns a higher probability to me.

0:37:54.354 --> 0:38:03.978
It's very good, so that's even the second
thing, so the first thing maybe would have

0:38:03.978 --> 0:38:04.657
been.

0:38:05.925 --> 0:38:12.300
The problem is the and then you take the language
language language and machine translation.

0:38:13.193 --> 0:38:18.773
Problems: First of all you have to build a
whole system which is very time consuming and

0:38:18.773 --> 0:38:21.407
it might not only depend on the language.

0:38:21.407 --> 0:38:24.730
On the other hand, that's of course what the
end is.

0:38:24.730 --> 0:38:30.373
The end want and the pressure will model each
component individually or do you want to do

0:38:30.373 --> 0:38:31.313
an end to end.

0:38:31.771 --> 0:38:35.463
What can also happen is you'll see your metric
model.

0:38:35.463 --> 0:38:41.412
This is a very good language model, but it
somewhat doesn't really work well with your

0:38:41.412 --> 0:38:42.711
translation model.

0:38:43.803 --> 0:38:49.523
But of course it's very good to also have
this type of intrinsic evaluation where the

0:38:49.523 --> 0:38:52.116
assumption should be as a pointed out.

0:38:52.116 --> 0:38:57.503
If we have Good English it shouldn't be a
high probability and it's bad English.

0:38:58.318 --> 0:39:07.594
And this is measured by the take a held out
data set, so some data which you don't train

0:39:07.594 --> 0:39:12.596
on then calculate the probability of this data.

0:39:12.912 --> 0:39:26.374
Then you're just looking at the language model
and you take the language model.

0:39:27.727 --> 0:39:33.595
You're not directly using the probability,
but you're taking the perplexity.

0:39:33.595 --> 0:39:40.454
The perplexity is due to the power of the
cross entropy, and you see in the cross entropy

0:39:40.454 --> 0:39:46.322
you're doing something like an average probability
of always coming to this.

0:39:46.846 --> 0:39:54.721
Not so how exactly is that define perplexity
is typically what people refer to all across.

0:39:54.894 --> 0:40:02.328
The cross edge is negative and average, and
then you have the lock of the probability of

0:40:02.328 --> 0:40:03.246
the whole.

0:40:04.584 --> 0:40:10.609
We are modeling this probability as the product
of each of the words.

0:40:10.609 --> 0:40:18.613
That's how the end gram was defined and now
you hopefully can remember the rules of logarism

0:40:18.613 --> 0:40:23.089
so you can get the probability within the logarism.

0:40:23.063 --> 0:40:31.036
The sum here so the cross entry is minus one
by two by n, and the sum of all your words

0:40:31.036 --> 0:40:35.566
and the lowerism of the probability of each
word.

0:40:36.176 --> 0:40:39.418
And then the perplexity is just like two to
the power.

0:40:41.201 --> 0:40:44.706
Why can this be interpreted as a branching
factor?

0:40:44.706 --> 0:40:50.479
So it gives you a bit like the average thing,
like how many possibilities you have.

0:40:51.071 --> 0:41:02.249
You have a digit task and you have no idea,
but the probability of the next digit is like

0:41:02.249 --> 0:41:03.367
one ten.

0:41:03.783 --> 0:41:09.354
And if you then take a later perplexity, it
will be exactly ten.

0:41:09.849 --> 0:41:24.191
And that is like this perplexity gives you
a million interpretations, so how much randomness

0:41:24.191 --> 0:41:27.121
is still in there?

0:41:27.307 --> 0:41:32.433
Of course, now it's good to have a lower perplexity.

0:41:32.433 --> 0:41:36.012
We have less ambiguity in there and.

0:41:35.976 --> 0:41:48.127
If you have a hundred words and you only have
to uniformly compare it to ten different, so

0:41:48.127 --> 0:41:49.462
you have.

0:41:49.609 --> 0:41:53.255
Yes, think so it should be.

0:41:53.255 --> 0:42:03.673
You had here logarism and then to the power
and that should then be eliminated.

0:42:03.743 --> 0:42:22.155
So which logarism you use is not that important
because it's a constant factor to reformulate.

0:42:23.403 --> 0:42:28.462
Yes and Yeah So the Best.

0:42:31.931 --> 0:42:50.263
The best model is always like you want to
have a high probability.

0:42:51.811 --> 0:43:04.549
Time you see here, so here the probabilities
would like to commend the rapporteur on his

0:43:04.549 --> 0:43:05.408
work.

0:43:05.285 --> 0:43:14.116
You have then locked two probabilities and
then the average, so this is not the perplexity

0:43:14.116 --> 0:43:18.095
but the cross entropy as mentioned here.

0:43:18.318 --> 0:43:26.651
And then due to the power of that we'll give
you the perplexity of the center.

0:43:29.329 --> 0:43:40.967
And these metrics of perplexity are essential
in modeling that and we'll also see nowadays.

0:43:41.121 --> 0:43:47.898
You also measure like equality often in perplexity
or cross entropy, which gives you how good

0:43:47.898 --> 0:43:50.062
is it in estimating the same.

0:43:50.010 --> 0:43:53.647
The better the model is, the more information
you have about this.

0:43:55.795 --> 0:44:03.106
Talked about isomic ability or quit sentences,
but don't most have to any much because.

0:44:03.463 --> 0:44:12.512
You are doing that in this way implicitly
because of the correct word.

0:44:12.512 --> 0:44:19.266
If you are modeling this one, the sun over
all next.

0:44:20.020 --> 0:44:29.409
Therefore, you have that implicitly in there
because in each position you're modeling the

0:44:29.409 --> 0:44:32.957
probability of this witch behind.

0:44:35.515 --> 0:44:43.811
You have a very large number of negative examples
because all the possible extensions which are

0:44:43.811 --> 0:44:49.515
not there are incorrect, which of course might
also be a problem.

0:44:52.312 --> 0:45:00.256
And the biggest challenge of these types of
models is how to model unseen events.

0:45:00.840 --> 0:45:04.973
So that can be unknown words or it can be
unknown vibrants.

0:45:05.245 --> 0:45:10.096
So that's important also like you've seen
all the words.

0:45:10.096 --> 0:45:17.756
But if you have a bigram language model, if
you haven't seen the bigram, you'll still get

0:45:17.756 --> 0:45:23.628
a zero probability because we know that the
bigram's divided by the.

0:45:24.644 --> 0:45:35.299
If you have unknown words, the problem gets
even bigger because one word typically causes

0:45:35.299 --> 0:45:37.075
a lot of zero.

0:45:37.217 --> 0:45:41.038
So if you, for example, if your vocabulary
is go to and care it,.

0:45:41.341 --> 0:45:43.467
And you have not a sentence.

0:45:43.467 --> 0:45:47.941
I want to pay a T, so you have one word, which
is here 'an'.

0:45:47.887 --> 0:45:54.354
It is unknow then you have the proper.

0:45:54.354 --> 0:46:02.147
It is I get a sentence star and sentence star.

0:46:02.582 --> 0:46:09.850
To model this probability you always have
to take the account from these sequences divided

0:46:09.850 --> 0:46:19.145
by: Since when does it occur, all of these
angrams can also occur because of the word

0:46:19.145 --> 0:46:19.961
middle.

0:46:20.260 --> 0:46:27.800
So all of these probabilities are directly
zero.

0:46:27.800 --> 0:46:33.647
You see that just by having a single.

0:46:34.254 --> 0:46:47.968
Tells you it might not always be better to
have larger grams because if you have a gram

0:46:47.968 --> 0:46:50.306
language more.

0:46:50.730 --> 0:46:57.870
So sometimes it's better to have a smaller
angram counter because the chances that you're

0:46:57.870 --> 0:47:00.170
seeing the angram is higher.

0:47:00.170 --> 0:47:07.310
On the other hand, you want to have a larger
account because the larger the count is, the

0:47:07.310 --> 0:47:09.849
longer the context is modeling.

0:47:10.670 --> 0:47:17.565
So how can we address this type of problem?

0:47:17.565 --> 0:47:28.064
We address this type of problem by somehow
adjusting our accounts.

0:47:29.749 --> 0:47:40.482
We have often, but most of your entries in
the table are zero, and if one of these engrams

0:47:40.482 --> 0:47:45.082
occurs you'll have a zero probability.

0:47:46.806 --> 0:48:06.999
So therefore we need to find some of our ways
in order to estimate this type of event because:

0:48:07.427 --> 0:48:11.619
So there are different ways of how to model
it and how to adjust it.

0:48:11.619 --> 0:48:15.326
The one I hear is to do smoocing and that's
the first thing.

0:48:15.326 --> 0:48:20.734
So in smoocing you're saying okay, we take
a bit of the probability we have to our scene

0:48:20.734 --> 0:48:23.893
events and distribute this thing we're taking
away.

0:48:23.893 --> 0:48:26.567
We're distributing to all the other events.

0:48:26.946 --> 0:48:33.927
The nice thing is in this case oh now each
event has a non zero probability and that is

0:48:33.927 --> 0:48:39.718
of course very helpful because we don't have
zero probabilities anymore.

0:48:40.180 --> 0:48:48.422
It smoothed out, but at least you have some
kind of probability everywhere, so you take

0:48:48.422 --> 0:48:50.764
some of the probability.

0:48:53.053 --> 0:49:05.465
You can also do that more here when you have
the endgram, for example, and this is your

0:49:05.465 --> 0:49:08.709
original distribution.

0:49:08.648 --> 0:49:15.463
Then you are taking some mass away from here
and distributing this mass to all the other

0:49:15.463 --> 0:49:17.453
words that you have seen.

0:49:18.638 --> 0:49:26.797
And thereby you are now making sure that it's
yeah, that it's now possible to model that.

0:49:28.828 --> 0:49:36.163
The other idea we're coming into more detail
on how we can do this type of smoking, but

0:49:36.163 --> 0:49:41.164
one other idea you can do is to do some type
of clustering.

0:49:41.501 --> 0:49:48.486
And that means if we are can't model go Kit's,
for example because we haven't seen that.

0:49:49.349 --> 0:49:56.128
Then we're just looking at the full thing
and we're just going to live directly how probable.

0:49:56.156 --> 0:49:58.162
Go two ways or so.

0:49:58.162 --> 0:50:09.040
Then we are modeling just only the word interpolation
where you're interpolating all the probabilities

0:50:09.040 --> 0:50:10.836
and thereby can.

0:50:11.111 --> 0:50:16.355
These are the two things which are helpful
in order to better calculate all these types.

0:50:19.499 --> 0:50:28.404
Let's start with what counts news so the idea
is okay.

0:50:28.404 --> 0:50:38.119
We have not seen an event and then the probability
is zero.

0:50:38.618 --> 0:50:50.902
It's not that high, but you should always
be aware that there might be new things happening

0:50:50.902 --> 0:50:55.308
and somehow be able to estimate.

0:50:56.276 --> 0:50:59.914
So the idea is okay.

0:50:59.914 --> 0:51:09.442
We can also assign a positive probability
to a higher.

0:51:10.590 --> 0:51:23.233
We are changing so currently we worked on
imperial accounts so how often we have seen

0:51:23.233 --> 0:51:25.292
the accounts.

0:51:25.745 --> 0:51:37.174
And now we are going on to expect account
how often this would occur in an unseen.

0:51:37.517 --> 0:51:39.282
So we are directly trying to model that.

0:51:39.859 --> 0:51:45.836
Of course, the empirical accounts are a good
starting point, so if you've seen the world

0:51:45.836 --> 0:51:51.880
very often in your training data, it's a good
estimation of how often you would see it in

0:51:51.880 --> 0:51:52.685
the future.

0:51:52.685 --> 0:51:58.125
However, it might make sense to think about
it only because you haven't seen it.

0:51:58.578 --> 0:52:10.742
So does anybody have a very simple idea how
you start with smoothing it?

0:52:10.742 --> 0:52:15.241
What count would you give?

0:52:21.281 --> 0:52:32.279
Now you have the probability to calculation
how often have you seen the biogram with zero

0:52:32.279 --> 0:52:33.135
count.

0:52:33.193 --> 0:52:39.209
So what count would you give in order to still
do this calculation?

0:52:39.209 --> 0:52:41.509
We have to smooth, so we.

0:52:44.884 --> 0:52:52.151
We could clump together all the rare words,
for example everywhere we have only seen ones.

0:52:52.652 --> 0:52:56.904
And then just we can do the massive moment
of those and don't.

0:52:56.936 --> 0:53:00.085
So remove the real ones.

0:53:00.085 --> 0:53:06.130
Yes, and then every unseen word is one of
them.

0:53:06.130 --> 0:53:13.939
Yeah, but it's not only about unseen words,
it's even unseen.

0:53:14.874 --> 0:53:20.180
You can even start easier and that's what
people do at the first thing.

0:53:20.180 --> 0:53:22.243
That's at one smooth thing.

0:53:22.243 --> 0:53:28.580
You'll see it's not working good but the variation
works fine and we're just as here.

0:53:28.580 --> 0:53:30.644
We've seen everything once.

0:53:31.771 --> 0:53:39.896
That's similar to this because you're clustering
the one and the zero together and you just

0:53:39.896 --> 0:53:45.814
say you've seen everything once or have seen
them twice and so on.

0:53:46.386 --> 0:53:53.249
And if you've done that wow, there's no probability
because each event has happened once.

0:53:55.795 --> 0:54:02.395
If you otherwise have seen the bigram five
times, you would not now do five times but

0:54:02.395 --> 0:54:03.239
six times.

0:54:03.363 --> 0:54:09.117
So the nice thing is to have seen everything.

0:54:09.117 --> 0:54:19.124
Once the probability of the engrap is now
out, you have seen it divided by the.

0:54:20.780 --> 0:54:23.763
How long ago there's one big big problem with
it?

0:54:24.064 --> 0:54:38.509
Just imagine that you have a vocabulary of
words, and you have a corpus of thirty million

0:54:38.509 --> 0:54:39.954
bigrams.

0:54:39.954 --> 0:54:42.843
So if you have a.

0:54:43.543 --> 0:54:46.580
Simple Things So You've Seen Them Thirty Million
Times.

0:54:47.247 --> 0:54:49.818
That is your count, your distributing.

0:54:49.818 --> 0:54:55.225
According to your gain, the problem is yet
how many possible bigrams do you have?

0:54:55.225 --> 0:55:00.895
You have seven point five billion possible
bigrams, and each of them you are counting

0:55:00.895 --> 0:55:04.785
now as give up your ability, like you give
account of one.

0:55:04.785 --> 0:55:07.092
So each of them is saying a curse.

0:55:07.627 --> 0:55:16.697
Then this number of possible vigrams is many
times larger than the number you really see.

0:55:17.537 --> 0:55:21.151
You're mainly doing equal distribution.

0:55:21.151 --> 0:55:26.753
Everything gets the same because this is much
more important.

0:55:26.753 --> 0:55:31.541
Most of your probability mass is used for
smoothing.

0:55:32.412 --> 0:55:37.493
Because most of the probability miles have
to be distributed that you at least give every

0:55:37.493 --> 0:55:42.687
biogram at least a count of one, and the other
counts are only the thirty million, so seven

0:55:42.687 --> 0:55:48.219
point five billion counts go to like a distribute
around all the engrons, and only thirty million

0:55:48.219 --> 0:55:50.026
are according to your frequent.

0:55:50.210 --> 0:56:02.406
So you put a lot too much mass on your smoothing
and you're doing some kind of extreme smoothing.

0:56:02.742 --> 0:56:08.986
So that of course is a bit bad then and will
give you not the best performance.

0:56:10.130 --> 0:56:16.160
However, there's a nice thing and that means
to do probability calculations.

0:56:16.160 --> 0:56:21.800
We are doing it based on counts, but to do
this division we don't need.

0:56:22.302 --> 0:56:32.112
So we can also do that with floating point
values and there is still a valid type of calculation.

0:56:32.392 --> 0:56:39.380
So we can have less probability mass to unseen
events.

0:56:39.380 --> 0:56:45.352
We don't have to give one because if we count.

0:56:45.785 --> 0:56:50.976
But to do our calculation we can also give
zero point zero to something like that, so

0:56:50.976 --> 0:56:56.167
very small value, and thereby we have less
value on the smooth thing, and we are more

0:56:56.167 --> 0:56:58.038
focusing on the actual corpus.

0:56:58.758 --> 0:57:03.045
And that is what people refer to as Alpha
Smoozing.

0:57:03.223 --> 0:57:12.032
You see that we are now adding not one to
it but only alpha, and then we are giving less

0:57:12.032 --> 0:57:19.258
probability to the unseen event and more probability
to the really seen.

0:57:20.780 --> 0:57:24.713
Questions: Of course, how do you find see
also?

0:57:24.713 --> 0:57:29.711
I'm here to either use some help out data
and optimize them.

0:57:30.951 --> 0:57:35.153
So what what does it now really mean?

0:57:35.153 --> 0:57:40.130
This gives you a bit of an idea behind that.

0:57:40.700 --> 0:57:57.751
So here you have the grams which occur one
time, for example all grams which occur one.

0:57:57.978 --> 0:58:10.890
So, for example, that means that if you have
engrams which occur one time, then.

0:58:11.371 --> 0:58:22.896
If you look at all the engrams which occur
two times, then they occur.

0:58:22.896 --> 0:58:31.013
If you look at the engrams that occur zero,
then.

0:58:32.832 --> 0:58:46.511
So if you are now doing the smoothing you
can look what is the probability estimating

0:58:46.511 --> 0:58:47.466
them.

0:58:47.847 --> 0:59:00.963
You see that for all the endbreaks you heavily
underestimate how often they occur in the test

0:59:00.963 --> 0:59:01.801
card.

0:59:02.002 --> 0:59:10.067
So what you want is very good to estimate
this distribution, so for each Enron estimate

0:59:10.067 --> 0:59:12.083
quite well how often.

0:59:12.632 --> 0:59:16.029
You're quite bad at that for all of them.

0:59:16.029 --> 0:59:22.500
You're apparently underestimating only for
the top ones which you haven't seen.

0:59:22.500 --> 0:59:24.845
You'll heavily overestimate.

0:59:25.645 --> 0:59:30.887
If you're doing alpha smoothing and optimize
that to fit on the zero count because that's

0:59:30.887 --> 0:59:36.361
not completely fair because this alpha is now
optimizes the test counter, you see that you're

0:59:36.361 --> 0:59:37.526
doing a lot better.

0:59:37.526 --> 0:59:42.360
It's not perfect, but you're a lot better
in estimating how often they will occur.

0:59:45.545 --> 0:59:49.316
So this is one idea of doing it.

0:59:49.316 --> 0:59:57.771
Of course there's other ways and this is like
a large research direction.

0:59:58.318 --> 1:00:03.287
So there is this needed estimation.

1:00:03.287 --> 1:00:11.569
What you are doing is filling your trading
data into parts.

1:00:11.972 --> 1:00:19.547
Looking at how many engrams occur exactly
are types, which engrams occur are times in

1:00:19.547 --> 1:00:20.868
your training.

1:00:21.281 --> 1:00:27.716
And then you look for these ones.

1:00:27.716 --> 1:00:36.611
How often do they occur in your training data?

1:00:36.611 --> 1:00:37.746
It's.

1:00:38.118 --> 1:00:45.214
And then you say oh this engram, the expector
counts how often will see.

1:00:45.214 --> 1:00:56.020
It is divided by: Some type of clustering
you're putting all the engrams which occur

1:00:56.020 --> 1:01:04.341
are at times in your data together and in order
to estimate how often.

1:01:05.185 --> 1:01:12.489
And if you do half your data related to your
final estimation by just using those statistics,.

1:01:14.014 --> 1:01:25.210
So this is called added estimation, and thereby
you are not able to estimate better how often

1:01:25.210 --> 1:01:25.924
does.

1:01:28.368 --> 1:01:34.559
And again we can do the same look and compare
it to the expected counts.

1:01:34.559 --> 1:01:37.782
Again we have exactly the same table.

1:01:38.398 --> 1:01:47.611
So then we're having to hear how many engrams
that does exist.

1:01:47.611 --> 1:01:55.361
So, for example, there's like engrams which
you can.

1:01:55.835 --> 1:02:08.583
Then you look into your other half and how
often do these N grams occur in your 2nd part

1:02:08.583 --> 1:02:11.734
of the training data?

1:02:12.012 --> 1:02:22.558
For example, an unseen N gram I expect to
occur, an engram which occurs one time.

1:02:22.558 --> 1:02:25.774
I expect that it occurs.

1:02:27.527 --> 1:02:42.564
Yeah, the number of zero counts are if take
my one grams and then just calculate how many

1:02:42.564 --> 1:02:45.572
possible bigrams.

1:02:45.525 --> 1:02:50.729
Yes, so in this case we are now not assuming
about having a more larger cattle because then,

1:02:50.729 --> 1:02:52.127
of course, it's getting.

1:02:52.272 --> 1:02:54.730
So you're doing that given the current gram.

1:02:54.730 --> 1:03:06.057
The cavalry is better to: So yeah, there's
another problem in how to deal with them.

1:03:06.057 --> 1:03:11.150
This is more about how to smuse the engram
counts to also deal.

1:03:14.394 --> 1:03:18.329
Certainly as I Think The.

1:03:18.198 --> 1:03:25.197
Yes, the last idea of doing is so called good
cheering, and and the I hear here is in it

1:03:25.197 --> 1:03:32.747
similar, so there is a typical mathematic approve,
but you can show that a very good estimation

1:03:32.747 --> 1:03:34.713
for the expected counts.

1:03:34.654 --> 1:03:42.339
Is that you take the number of engrams which
occur one time more divided by the number of

1:03:42.339 --> 1:03:46.011
engram which occur R times and R plus one.

1:03:46.666 --> 1:03:49.263
So this is then the estimation of.

1:03:49.549 --> 1:04:05.911
So if you are looking now at an engram which
occurs times then you are looking at how many

1:04:05.911 --> 1:04:08.608
engrams occur.

1:04:09.009 --> 1:04:18.938
It's very simple, so in this one you only
have to count all the bigrams, how many different

1:04:18.938 --> 1:04:23.471
bigrams out there, and that is very good.

1:04:23.903 --> 1:04:33.137
So if you are saying now about end drums which
occur or times,.

1:04:33.473 --> 1:04:46.626
It might be that there are some occurring
times, but no times, and then.

1:04:46.866 --> 1:04:54.721
So what you normally do is you are doing for
small R, and for large R you do some curve

1:04:54.721 --> 1:04:55.524
fitting.

1:04:56.016 --> 1:05:07.377
In general this type of smoothing is important
for engrams which occur rarely.

1:05:07.377 --> 1:05:15.719
If an engram occurs so this is more important
for events.

1:05:17.717 --> 1:05:25.652
So here again you see you have the counts
and then based on that you get the adjusted

1:05:25.652 --> 1:05:26.390
counts.

1:05:26.390 --> 1:05:34.786
This is here and if you compare it's a test
count you see that it really works quite well.

1:05:35.035 --> 1:05:41.093
But for the low numbers it's a very good modeling
of how much how good this works.

1:05:45.005 --> 1:05:50.018
Then, of course, the question is how good
does it work in language modeling?

1:05:50.018 --> 1:05:51.516
We also want tomorrow.

1:05:52.372 --> 1:05:54.996
We can measure that perplexity.

1:05:54.996 --> 1:05:59.261
We learned that before and then we have everyone's.

1:05:59.579 --> 1:06:07.326
You saw that a lot of too much probability
mass is put to the events which have your probability.

1:06:07.667 --> 1:06:11.098
Then you have an alpha smoothing.

1:06:11.098 --> 1:06:16.042
Here's a start because it's not completely
fair.

1:06:16.042 --> 1:06:20.281
The alpha was maximized on the test data.

1:06:20.480 --> 1:06:25.904
But you see that like the leaded estimation
of the touring gives you a similar performance.

1:06:26.226 --> 1:06:29.141
So they seem to really work quite well.

1:06:32.232 --> 1:06:41.552
So this is about all assigning probability
mass to aimed grams, which we have not seen

1:06:41.552 --> 1:06:50.657
in order to also estimate their probability
before we're going to the interpolation.

1:06:55.635 --> 1:07:00.207
Good, so now we have.

1:07:00.080 --> 1:07:11.818
Done this estimation, and the problem is we
have this general.

1:07:11.651 --> 1:07:19.470
We want to have a longer context because we
can model longer than language better because

1:07:19.470 --> 1:07:21.468
long range dependency.

1:07:21.701 --> 1:07:26.745
On the other hand, we have limited data so
we want to have stored angrums because they

1:07:26.745 --> 1:07:28.426
reach angrums at first more.

1:07:29.029 --> 1:07:43.664
And about the smooth thing in the discounting
we did before, it always treats all angrams.

1:07:44.024 --> 1:07:46.006
So we didn't really look at the end drums.

1:07:46.006 --> 1:07:48.174
They were all classed into how often they
are.

1:07:49.169 --> 1:08:00.006
However, sometimes this might not be very
helpful, so for example look at the engram

1:08:00.006 --> 1:08:06.253
Scottish beer drinkers and Scottish beer eaters.

1:08:06.686 --> 1:08:12.037
Because we have not seen the trigram, so you
will estimate the trigram probability by the

1:08:12.037 --> 1:08:14.593
probability you assign to the zero county.

1:08:15.455 --> 1:08:26.700
However, if you look at the background probability
that you might have seen and might be helpful,.

1:08:26.866 --> 1:08:34.538
So be a drinker is more probable to see than
Scottish be a drinker, and be a drinker should

1:08:34.538 --> 1:08:36.039
be more probable.

1:08:36.896 --> 1:08:39.919
So this type of information is somehow ignored.

1:08:39.919 --> 1:08:45.271
So if we have the Trigram language model,
we are only looking at trigrams divided by

1:08:45.271 --> 1:08:46.089
the Vigrams.

1:08:46.089 --> 1:08:49.678
But if we have not seen the Vigrams, we are
not looking.

1:08:49.678 --> 1:08:53.456
Oh, maybe we will have seen the Vigram and
we can back off.

1:08:54.114 --> 1:09:01.978
And that is what people do in interpolation
and back off.

1:09:01.978 --> 1:09:09.164
The idea is if we don't have seen the large
engrams.

1:09:09.429 --> 1:09:16.169
So don't have to go to a shorter sequence
and try to see if we came on in this probability.

1:09:16.776 --> 1:09:20.730
And this is the idea of interpolation.

1:09:20.730 --> 1:09:25.291
There's like two different ways of doing it.

1:09:25.291 --> 1:09:26.507
One is the.

1:09:26.646 --> 1:09:29.465
The easiest thing is like okay.

1:09:29.465 --> 1:09:32.812
If we have bigrams, we have trigrams.

1:09:32.812 --> 1:09:35.103
If we have programs, why?

1:09:35.355 --> 1:09:46.544
Mean, of course, we have the larger ones,
the larger context, but the short amounts are

1:09:46.544 --> 1:09:49.596
maybe better estimated.

1:09:50.090 --> 1:10:00.487
Time just by taking the probability of just
the word class of probability of and.

1:10:01.261 --> 1:10:07.052
And of course we need to know because otherwise
we don't have a probability distribution, but

1:10:07.052 --> 1:10:09.332
we can somehow optimize the weights.

1:10:09.332 --> 1:10:15.930
For example, the health out data set: And
thereby we have now a probability distribution

1:10:15.930 --> 1:10:17.777
which takes both into account.

1:10:18.118 --> 1:10:23.705
The thing about the Scottish be a drink business.

1:10:23.705 --> 1:10:33.763
The dry rum probability will be the same for
the post office because they both occur zero

1:10:33.763 --> 1:10:34.546
times.

1:10:36.116 --> 1:10:45.332
But the two grand verability will hopefully
be different because we might have seen beer

1:10:45.332 --> 1:10:47.611
eaters and therefore.

1:10:48.668 --> 1:10:57.296
The idea that sometimes it's better to have
different models and combine them instead.

1:10:58.678 --> 1:10:59.976
Another idea in style.

1:11:00.000 --> 1:11:08.506
Of this overall interpolation is you can also
do this type of recursive interpolation.

1:11:08.969 --> 1:11:23.804
The probability of the word given its history
is in the current language model probability.

1:11:24.664 --> 1:11:30.686
Thus one minus the weights of this two some
after one, and here it's an interpolated probability

1:11:30.686 --> 1:11:36.832
from the n minus one breath, and then of course
it goes recursively on until you are at a junigram

1:11:36.832 --> 1:11:37.639
probability.

1:11:38.558 --> 1:11:49.513
What you can also do, you can not only do
the same weights for all our words, but you

1:11:49.513 --> 1:12:06.020
can for example: For example, for engrams,
which you have seen very often, you put more

1:12:06.020 --> 1:12:10.580
weight on the trigrams.

1:12:13.673 --> 1:12:29.892
The other thing you can do is the back off
and the difference in back off is we are not

1:12:29.892 --> 1:12:32.656
interpolating.

1:12:32.892 --> 1:12:41.954
If we have seen the trigram probability so
if the trigram hound is bigger then we take

1:12:41.954 --> 1:12:48.412
the trigram probability and if we have seen
this one then we.

1:12:48.868 --> 1:12:54.092
So that is the difference.

1:12:54.092 --> 1:13:06.279
We are always taking all the angle probabilities
and back off.

1:13:07.147 --> 1:13:09.941
Why do we need to do this just a minute?

1:13:09.941 --> 1:13:13.621
So why have we here just take the probability
of the.

1:13:15.595 --> 1:13:18.711
Yes, because otherwise the probabilities from
some people.

1:13:19.059 --> 1:13:28.213
In order to make them still sound one, we
have to take away a bit of a probability mass

1:13:28.213 --> 1:13:29.773
for the scene.

1:13:29.709 --> 1:13:38.919
The difference is we are no longer distributing
it equally as before to the unseen, but we

1:13:38.919 --> 1:13:40.741
are distributing.

1:13:44.864 --> 1:13:56.220
For example, this can be done with gutturing,
so the expected counts in goodturing we saw.

1:13:57.697 --> 1:13:59.804
The adjusted counts.

1:13:59.804 --> 1:14:04.719
They are always lower than the ones we see
here.

1:14:04.719 --> 1:14:14.972
These counts are always: See that so you can
now take this different and distribute this

1:14:14.972 --> 1:14:18.852
weights to the lower based input.

1:14:23.323 --> 1:14:29.896
Is how we can distribute things.

1:14:29.896 --> 1:14:43.442
Then there is one last thing people are doing,
especially how much.

1:14:43.563 --> 1:14:55.464
And there's one thing which is called well
written by Mozilla.

1:14:55.315 --> 1:15:01.335
In the background, like in the background,
it might make sense to look at the words and

1:15:01.335 --> 1:15:04.893
see how probable it is that you need to background.

1:15:05.425 --> 1:15:11.232
So look at these words five and one cent.

1:15:11.232 --> 1:15:15.934
Those occur exactly times in the.

1:15:16.316 --> 1:15:27.804
They would be treated exactly the same because
both occur at the same time, and it would be

1:15:27.804 --> 1:15:29.053
the same.

1:15:29.809 --> 1:15:48.401
However, it shouldn't really model the same.

1:15:48.568 --> 1:15:57.447
If you compare that for constant there are
four hundred different continuations of this

1:15:57.447 --> 1:16:01.282
work, so there is nearly always this.

1:16:02.902 --> 1:16:11.203
So if you're now seeing a new bigram or a
biogram with Isaac Constant or Spite starting

1:16:11.203 --> 1:16:13.467
and then another word,.

1:16:15.215 --> 1:16:25.606
In constant, it's very frequent that you see
new angrups because there are many different

1:16:25.606 --> 1:16:27.222
combinations.

1:16:27.587 --> 1:16:35.421
Therefore, it might look not only to look
at the counts, the end grams, but also how

1:16:35.421 --> 1:16:37.449
many extensions does.

1:16:38.218 --> 1:16:43.222
And this is done by witt velk smoothing.

1:16:43.222 --> 1:16:51.032
The idea is we count how many possible extensions
in this case.

1:16:51.371 --> 1:17:01.966
So we had for spive, we had possible extensions,
and for constant we had a lot more.

1:17:02.382 --> 1:17:09.394
And then how much we put into our backup model,
how much weight we put into the backup is,

1:17:09.394 --> 1:17:13.170
depending on this number of possible extensions.

1:17:14.374 --> 1:17:15.557
Style.

1:17:15.557 --> 1:17:29.583
We have it here, so this is the weight you
put on your lower end gram probability.

1:17:29.583 --> 1:17:46.596
For example: And if you compare these two
numbers, so for Spike you do how many extensions

1:17:46.596 --> 1:17:55.333
does Spike have divided by: While for constant
you have zero point three, you know,.

1:17:55.815 --> 1:18:05.780
So you're putting a lot more weight to like
it's not as bad to fall off to the back of

1:18:05.780 --> 1:18:06.581
model.

1:18:06.581 --> 1:18:10.705
So for the spy it's really unusual.

1:18:10.730 --> 1:18:13.369
For Constant there's a lot of probability
medicine.

1:18:13.369 --> 1:18:15.906
The chances that you're doing that is quite
high.

1:18:20.000 --> 1:18:26.209
Similarly, but just from the other way around,
it's now looking at this probability distribution.

1:18:26.546 --> 1:18:37.103
So now when we back off the probability distribution
for the lower angrums, we calculated exactly

1:18:37.103 --> 1:18:40.227
the same as the probability.

1:18:40.320 --> 1:18:48.254
However, they are used in a different way,
so the lower order end drums are only used

1:18:48.254 --> 1:18:49.361
if we have.

1:18:50.410 --> 1:18:54.264
So it's like you're modeling something different.

1:18:54.264 --> 1:19:01.278
You're not modeling how probable this engram
if we haven't seen the larger engram and that

1:19:01.278 --> 1:19:04.361
is tried by the diversity of histories.

1:19:04.944 --> 1:19:14.714
For example, if you look at York, that's a
quite frequent work.

1:19:14.714 --> 1:19:18.530
It occurs as many times.

1:19:19.559 --> 1:19:27.985
However, four hundred seventy three times
it was followed the way before it was mute.

1:19:29.449 --> 1:19:40.237
So if you now think the unigram model is only
used, the probability of York as a unigram

1:19:40.237 --> 1:19:49.947
model should be very, very low because: So
you should have a lower probability for your

1:19:49.947 --> 1:19:56.292
than, for example, for foods, although you
have seen both of them at the same time, and

1:19:56.292 --> 1:20:02.853
this is done by Knesser and Nye Smoothing where
you are not counting the words itself, but

1:20:02.853 --> 1:20:05.377
you count the number of mysteries.

1:20:05.845 --> 1:20:15.233
So how many other way around was it followed
by how many different words were before?

1:20:15.233 --> 1:20:28.232
Then instead of the normal way you count the
words: So you don't need to know all the formulas

1:20:28.232 --> 1:20:28.864
here.

1:20:28.864 --> 1:20:33.498
The more important thing is this intuition.

1:20:34.874 --> 1:20:44.646
More than it means already that I haven't
seen the larger end grammar, and therefore

1:20:44.646 --> 1:20:49.704
it might be better to model it differently.

1:20:49.929 --> 1:20:56.976
So if there's a new engram with something
in New York that's very unprofitable compared

1:20:56.976 --> 1:20:57.297
to.

1:21:00.180 --> 1:21:06.130
And yeah, this modified Kneffer Nice music
is what people took into use.

1:21:06.130 --> 1:21:08.249
That's the fall approach.

1:21:08.728 --> 1:21:20.481
Has an absolute discounting for small and
grams, and then bells smoothing, and for it

1:21:20.481 --> 1:21:27.724
uses the discounting of histories which we
just had.

1:21:28.028 --> 1:21:32.207
And there's even two versions of it, like
the backup and the interpolator.

1:21:32.472 --> 1:21:34.264
So that may be interesting.

1:21:34.264 --> 1:21:40.216
These are here even works well for interpolation,
although your assumption is even no longer

1:21:40.216 --> 1:21:45.592
true because you're using the lower engrams
even if you've seen the higher engrams.

1:21:45.592 --> 1:21:49.113
But since you're then focusing on the higher
engrams,.

1:21:49.929 --> 1:21:53.522
So if you see that some beats on the perfectities,.

1:21:54.754 --> 1:22:00.262
So you see normally what interpolated movement
class of nineties gives you some of the best

1:22:00.262 --> 1:22:00.980
performing.

1:22:02.022 --> 1:22:08.032
You see the larger your end drum than it is
with interpolation.

1:22:08.032 --> 1:22:15.168
You also get significant better so you can
not only look at the last words.

1:22:18.638 --> 1:22:32.725
Good so much for these types of things, and
we will finish with some special things about

1:22:32.725 --> 1:22:34.290
language.

1:22:38.678 --> 1:22:44.225
One thing we talked about the unknown words,
so there is different ways of doing it because

1:22:44.225 --> 1:22:49.409
in all the estimations we were still assuming
mostly that we have a fixed vocabulary.

1:22:50.270 --> 1:23:06.372
So you can often, for example, create an unknown
choken and use that while statistical language.

1:23:06.766 --> 1:23:16.292
It was mainly useful language processing since
newer models are coming, but maybe it's surprising.

1:23:18.578 --> 1:23:30.573
What is also nice is that if you're going
to really hard launch and ramps, it's more

1:23:30.573 --> 1:23:33.114
about efficiency.

1:23:33.093 --> 1:23:37.378
And then you have to remember lock it in your
model.

1:23:37.378 --> 1:23:41.422
In a lot of situations it's not really important.

1:23:41.661 --> 1:23:46.964
It's more about ranking so which one is better
and if they don't sum up to one that's not

1:23:46.964 --> 1:23:47.907
that important.

1:23:47.907 --> 1:23:53.563
Of course then you cannot calculate any perplexity
anymore because if this is not a probability

1:23:53.563 --> 1:23:58.807
mass then the thing we had about the negative
example doesn't fit anymore and that's not

1:23:58.807 --> 1:23:59.338
working.

1:23:59.619 --> 1:24:02.202
However, anification is also very helpful.

1:24:02.582 --> 1:24:13.750
And that is why there is this stupid bag-off
presented remove all this complicated things

1:24:13.750 --> 1:24:14.618
which.

1:24:15.055 --> 1:24:28.055
And it just does once we directly take the
absolute account, and otherwise we're doing.

1:24:28.548 --> 1:24:41.867
Is no longer any discounting anymore, so it's
very, very simple and however they show you

1:24:41.867 --> 1:24:47.935
have to calculate a lot less statistics.

1:24:50.750 --> 1:24:57.525
In addition you can have other type of language
models.

1:24:57.525 --> 1:25:08.412
We had word based language models and they
normally go up to four or five for six brands.

1:25:08.412 --> 1:25:10.831
They are too large.

1:25:11.531 --> 1:25:20.570
So what people have then looked also into
is what is referred to as part of speech language

1:25:20.570 --> 1:25:21.258
model.

1:25:21.258 --> 1:25:29.806
So instead of looking at the word sequence
you're modeling directly the part of speech

1:25:29.806 --> 1:25:30.788
sequence.

1:25:31.171 --> 1:25:34.987
Then of course now you're only being modeling
syntax.

1:25:34.987 --> 1:25:41.134
There's no cemented information anymore in
the paddle speech test but now you might go

1:25:41.134 --> 1:25:47.423
to a larger context link so you can do seven
H or nine grams and then you can write some

1:25:47.423 --> 1:25:50.320
of the long range dependencies in order.

1:25:52.772 --> 1:25:59.833
And there's other things people have done
like cash language models, so the idea in cash

1:25:59.833 --> 1:26:07.052
language model is that yes words that you have
recently seen are more frequently to do are

1:26:07.052 --> 1:26:11.891
more probable to reoccurr if you want to model
the dynamics.

1:26:12.152 --> 1:26:20.734
If you're just talking here, we talked about
language models in my presentation.

1:26:20.734 --> 1:26:23.489
There will be a lot more.

1:26:23.883 --> 1:26:37.213
Can do that by having a dynamic and a static
component, and then you have a dynamic component

1:26:37.213 --> 1:26:41.042
which looks at the bigram.

1:26:41.261 --> 1:26:49.802
And thereby, for example, if you once generate
language model of probability, it's increased

1:26:49.802 --> 1:26:52.924
and you're modeling that problem.

1:26:56.816 --> 1:27:03.114
Said the dynamic component is trained on the
text translated so far.

1:27:04.564 --> 1:27:12.488
To train them what you just have done, there's
no human feet there.

1:27:12.712 --> 1:27:25.466
The speech model all the time and then it
will repeat its errors and that is, of course,.

1:27:25.966 --> 1:27:31.506
A similar idea is people have looked into
trigger language model whereas one word occurs

1:27:31.506 --> 1:27:34.931
then you increase the probability of some other
words.

1:27:34.931 --> 1:27:40.596
So if you're talking about money that will
increase the probability of bank saving account

1:27:40.596 --> 1:27:41.343
dollar and.

1:27:41.801 --> 1:27:47.352
Because then you have to somehow model this
dependency, but it's somehow also an idea of

1:27:47.352 --> 1:27:52.840
modeling long range dependency, because if
one word occurs very often in your document,

1:27:52.840 --> 1:27:58.203
you like somehow like learning which other
words to occur because they are more often

1:27:58.203 --> 1:27:59.201
than by chance.

1:28:02.822 --> 1:28:10.822
Yes, then the last thing is, of course, especially
for languages which are, which are morphologically

1:28:10.822 --> 1:28:11.292
rich.

1:28:11.292 --> 1:28:18.115
You can do something similar to BPE so you
can now do more themes or so, and then more

1:28:18.115 --> 1:28:22.821
the morphine sequence because the morphines
are more often.

1:28:23.023 --> 1:28:26.877
However, the program is opposed that your
sequence length also gets longer.

1:28:27.127 --> 1:28:33.185
And so if they have a four gram language model,
it's not counting the last three words but

1:28:33.185 --> 1:28:35.782
only the last three more films, which.

1:28:36.196 --> 1:28:39.833
So of course then it's a bit challenging and
know how to deal with.

1:28:40.680 --> 1:28:51.350
What about language is finished by the idea
of a position at the end of the world?

1:28:51.350 --> 1:28:58.807
Yeah, but there you can typically do something
like that.

1:28:59.159 --> 1:29:02.157
It is not the one perfect solution.

1:29:02.157 --> 1:29:05.989
You have to do a bit of testing what is best.

1:29:06.246 --> 1:29:13.417
One way of dealing with a large vocabulary
that you haven't seen is to split these words

1:29:13.417 --> 1:29:20.508
into parts and themes that either like more
linguistic motivated in more themes or more

1:29:20.508 --> 1:29:25.826
statistically motivated like we have in the
bike pair and coding.

1:29:28.188 --> 1:29:33.216
The representation of your text is different.

1:29:33.216 --> 1:29:41.197
How you are later doing all the counting and
the statistics is the same.

1:29:41.197 --> 1:29:44.914
What you assume is your sequence.

1:29:45.805 --> 1:29:49.998
That's the same thing for the other things
we had here.

1:29:49.998 --> 1:29:55.390
Here you don't have words, but everything
you're doing is done exactly.

1:29:57.857 --> 1:29:59.457
Some practical issues.

1:29:59.457 --> 1:30:05.646
Typically you're doing things on the lock
and you're adding because mild decline in very

1:30:05.646 --> 1:30:09.819
small values gives you sometimes problems with
calculation.

1:30:10.230 --> 1:30:16.687
Good thing is you don't have to care with
this mostly so there is very good two kids

1:30:16.687 --> 1:30:23.448
like Azarayan or Kendalan which when you can
just give your data and they will train the

1:30:23.448 --> 1:30:30.286
language more then do all the complicated maths
behind that and you are able to run them.

1:30:31.911 --> 1:30:39.894
So what you should keep from today is what
is a language model and how we can do maximum

1:30:39.894 --> 1:30:44.199
training on that and different language models.

1:30:44.199 --> 1:30:49.939
Similar ideas we use for a lot of different
statistical models.

1:30:50.350 --> 1:30:52.267
Where You Always Have the Problem.

1:30:53.233 --> 1:31:01.608
Different way of looking at it and doing it
will do it on Thursday when we will go to language.