Spaces:
Running
Running
WEBVTT | |
0:00:01.721 --> 0:00:08.584 | |
Hey, then welcome to today's lecture on language | |
modeling. | |
0:00:09.409 --> 0:00:21.608 | |
We had not a different view on machine translation, | |
which was the evaluation path it's important | |
0:00:21.608 --> 0:00:24.249 | |
to evaluate and see. | |
0:00:24.664 --> 0:00:33.186 | |
We want to continue with building the MT system | |
and this will be the last part before we are | |
0:00:33.186 --> 0:00:36.668 | |
going into a neural step on Thursday. | |
0:00:37.017 --> 0:00:45.478 | |
So we had the the broader view on statistical | |
machine translation and the. | |
0:00:45.385 --> 0:00:52.977 | |
Thursday: A week ago we talked about the statistical | |
machine translation and mainly the translation | |
0:00:52.977 --> 0:00:59.355 | |
model, so how we model how probable is it that | |
one word is translated into another. | |
0:01:00.800 --> 0:01:15.583 | |
However, there is another component when doing | |
generation tasks in general and machine translation. | |
0:01:16.016 --> 0:01:23.797 | |
There are several characteristics which you | |
only need to model on the target side in the | |
0:01:23.797 --> 0:01:31.754 | |
traditional approach where we talked about | |
the generation from more semantic or synthectic | |
0:01:31.754 --> 0:01:34.902 | |
representation into the real world. | |
0:01:35.555 --> 0:01:51.013 | |
And the challenge is that there's some constructs | |
which are only there in the target language. | |
0:01:52.132 --> 0:01:57.908 | |
You cannot really get that translation, but | |
it's more something that needs to model on | |
0:01:57.908 --> 0:01:58.704 | |
the target. | |
0:01:59.359 --> 0:02:05.742 | |
And this is done typically by a language model | |
and this concept of language model. | |
0:02:06.326 --> 0:02:11.057 | |
Guess you can assume nowadays very important. | |
0:02:11.057 --> 0:02:20.416 | |
You've read a lot about large language models | |
recently and they are all somehow trained or | |
0:02:20.416 --> 0:02:22.164 | |
the idea behind. | |
0:02:25.986 --> 0:02:41.802 | |
What we'll look today at if get the next night | |
and look what a language model is and today's | |
0:02:41.802 --> 0:02:42.992 | |
focus. | |
0:02:43.363 --> 0:02:49.188 | |
This was the common approach to the language | |
model for twenty or thirty years, so a lot | |
0:02:49.188 --> 0:02:52.101 | |
of time it was really the state of the art. | |
0:02:52.101 --> 0:02:58.124 | |
And people have used that in many applications | |
in machine translation and automatic speech | |
0:02:58.124 --> 0:02:58.985 | |
recognition. | |
0:02:59.879 --> 0:03:11.607 | |
Again you are measuring the performance, but | |
this is purely the performance of the language | |
0:03:11.607 --> 0:03:12.499 | |
model. | |
0:03:13.033 --> 0:03:23.137 | |
And then we will see that the traditional | |
language will have a major drawback in how | |
0:03:23.137 --> 0:03:24.683 | |
we can deal. | |
0:03:24.944 --> 0:03:32.422 | |
So if you model language you will see that | |
in most of the sentences and you have not really | |
0:03:32.422 --> 0:03:39.981 | |
seen and you're still able to assess if this | |
is good language or if this is native language. | |
0:03:40.620 --> 0:03:45.092 | |
And this is challenging if you do just like | |
parameter estimation. | |
0:03:45.605 --> 0:03:59.277 | |
We are using two different techniques to do: | |
interpolation, and these are essentially in | |
0:03:59.277 --> 0:04:01.735 | |
order to build. | |
0:04:01.881 --> 0:04:11.941 | |
It also motivates why things might be easier | |
if we are going into neural morals as we will. | |
0:04:12.312 --> 0:04:18.203 | |
And at the end we'll talk a bit about some | |
additional type of language models which are | |
0:04:18.203 --> 0:04:18.605 | |
also. | |
0:04:20.440 --> 0:04:29.459 | |
So where our language was used, or how are | |
they used in the machine translations? | |
0:04:30.010 --> 0:04:38.513 | |
So the idea of a language model is that we | |
are modeling what is the fluency of language. | |
0:04:38.898 --> 0:04:49.381 | |
So if you have, for example, sentence will, | |
then you can estimate that there are some words: | |
0:04:49.669 --> 0:05:08.929 | |
For example, the next word is valid, but will | |
card's words not? | |
0:05:09.069 --> 0:05:13.673 | |
And we can do that. | |
0:05:13.673 --> 0:05:22.192 | |
We have seen that the noise channel. | |
0:05:22.322 --> 0:05:33.991 | |
That we have seen someone two weeks ago, and | |
today we will look into how can we model P | |
0:05:33.991 --> 0:05:36.909 | |
of Y or how possible. | |
0:05:37.177 --> 0:05:44.192 | |
Now this is completely independent of the | |
translation process. | |
0:05:44.192 --> 0:05:49.761 | |
How fluent is a sentence and how you can express? | |
0:05:51.591 --> 0:06:01.699 | |
And this language model task has one really | |
big advantage and assume that is even the big | |
0:06:01.699 --> 0:06:02.935 | |
advantage. | |
0:06:03.663 --> 0:06:16.345 | |
The big advantage is the data we need to train | |
that so normally we are doing supervised learning. | |
0:06:16.876 --> 0:06:20.206 | |
So machine translation will talk about. | |
0:06:20.206 --> 0:06:24.867 | |
That means we have the source center and target | |
center. | |
0:06:25.005 --> 0:06:27.620 | |
They need to be aligned. | |
0:06:27.620 --> 0:06:31.386 | |
We look into how we can model them. | |
0:06:31.386 --> 0:06:39.270 | |
Generally, the problem with this is that: | |
Machine translation: You still have the advantage | |
0:06:39.270 --> 0:06:45.697 | |
that there's quite huge amounts of this data | |
for many languages, not all but many, but other | |
0:06:45.697 --> 0:06:47.701 | |
classes even more difficult. | |
0:06:47.701 --> 0:06:50.879 | |
There's very few data where you have summary. | |
0:06:51.871 --> 0:07:02.185 | |
So the big advantage of language model is | |
we're only modeling the centers, so we only | |
0:07:02.185 --> 0:07:04.103 | |
need pure text. | |
0:07:04.584 --> 0:07:11.286 | |
And pure text, especially since we have the | |
Internet face melting large amounts of text. | |
0:07:11.331 --> 0:07:17.886 | |
Of course, it's still, it's still maybe only | |
for some domains, some type. | |
0:07:18.198 --> 0:07:23.466 | |
Want to have data for speech about machine | |
translation. | |
0:07:23.466 --> 0:07:27.040 | |
Maybe there's only limited data that. | |
0:07:27.027 --> 0:07:40.030 | |
There's always and also you go to some more | |
exotic languages and then you will have less | |
0:07:40.030 --> 0:07:40.906 | |
data. | |
0:07:41.181 --> 0:07:46.803 | |
And in language once we can now look, how | |
can we make use of these data? | |
0:07:47.187 --> 0:07:54.326 | |
And: Nowadays this is often also framed as | |
self supervised learning because on the one | |
0:07:54.326 --> 0:08:00.900 | |
hand here we'll see it's a time of classification | |
cast or supervised learning but we create some | |
0:08:00.900 --> 0:08:02.730 | |
other data science itself. | |
0:08:02.742 --> 0:08:13.922 | |
So it's not that we have this pair of data | |
text and labels, but we have only the text. | |
0:08:15.515 --> 0:08:21.367 | |
So the question is how can we use this modeling | |
data and how can we train our language? | |
0:08:22.302 --> 0:08:35.086 | |
The main goal is to produce fluent English, | |
so we want to somehow model that something | |
0:08:35.086 --> 0:08:38.024 | |
is a sentence of a. | |
0:08:38.298 --> 0:08:44.897 | |
So there is no clear separation about semantics | |
and syntax, but in this case it is not about | |
0:08:44.897 --> 0:08:46.317 | |
a clear separation. | |
0:08:46.746 --> 0:08:50.751 | |
So we will monitor them somehow in there. | |
0:08:50.751 --> 0:08:56.091 | |
There will be some notion of semantics, some | |
notion of. | |
0:08:56.076 --> 0:09:08.748 | |
Because you say you want to water how fluid | |
or probable is that the native speaker is producing | |
0:09:08.748 --> 0:09:12.444 | |
that because of the one in. | |
0:09:12.512 --> 0:09:17.711 | |
We are rarely talking like things that are | |
semantically wrong, and therefore there is | |
0:09:17.711 --> 0:09:18.679 | |
also some type. | |
0:09:19.399 --> 0:09:24.048 | |
So, for example, the house is small. | |
0:09:24.048 --> 0:09:30.455 | |
It should be a higher stability than the house | |
is. | |
0:09:31.251 --> 0:09:38.112 | |
Because home and house are both meaning German, | |
they are used differently. | |
0:09:38.112 --> 0:09:43.234 | |
For example, it should be more probable that | |
the plane. | |
0:09:44.444 --> 0:09:51.408 | |
So this is both synthetically correct, but | |
cementically not. | |
0:09:51.408 --> 0:09:58.372 | |
But still you will see much more often the | |
probability that. | |
0:10:03.883 --> 0:10:14.315 | |
So more formally, it's about like the language | |
should be some type of function, and it gives | |
0:10:14.315 --> 0:10:18.690 | |
us the probability that this sentence. | |
0:10:19.519 --> 0:10:27.312 | |
Indicating that this is good English or more | |
generally English, of course you can do that. | |
0:10:28.448 --> 0:10:37.609 | |
And earlier times people have even done try | |
to do that deterministic that was especially | |
0:10:37.609 --> 0:10:40.903 | |
used for more dialogue systems. | |
0:10:40.840 --> 0:10:50.660 | |
You have a very strict syntax so you can only | |
use like turn off the, turn off the radio. | |
0:10:50.690 --> 0:10:56.928 | |
Something else, but you have a very strict | |
deterministic finance state grammar like which | |
0:10:56.928 --> 0:10:58.107 | |
type of phrases. | |
0:10:58.218 --> 0:11:04.791 | |
The problem of course if we're dealing with | |
language is that language is variable, we're | |
0:11:04.791 --> 0:11:10.183 | |
not always talking correct sentences, and so | |
this type of deterministic. | |
0:11:10.650 --> 0:11:22.121 | |
That's why for already many, many years people | |
look into statistical language models and try | |
0:11:22.121 --> 0:11:24.587 | |
to model something. | |
0:11:24.924 --> 0:11:35.096 | |
So something like what is the probability | |
of the sequences of to, and that is what. | |
0:11:35.495 --> 0:11:43.076 | |
The advantage of doing it statistically is | |
that we can train large text databases so we | |
0:11:43.076 --> 0:11:44.454 | |
can train them. | |
0:11:44.454 --> 0:11:52.380 | |
We don't have to define it and most of these | |
cases we don't want to have the hard decision. | |
0:11:52.380 --> 0:11:55.481 | |
This is a sentence of the language. | |
0:11:55.815 --> 0:11:57.914 | |
Why we want to have some type of probability? | |
0:11:57.914 --> 0:11:59.785 | |
How probable is this part of the center? | |
0:12:00.560 --> 0:12:04.175 | |
Because yeah, even for a few minutes, it's | |
not always clear. | |
0:12:04.175 --> 0:12:06.782 | |
Is this a sentence that you can use or not? | |
0:12:06.782 --> 0:12:12.174 | |
I mean, I just in this presentation gave several | |
sentences, which are not correct English. | |
0:12:12.174 --> 0:12:17.744 | |
So it might still happen that people speak | |
sentences or write sentences that I'm not correct, | |
0:12:17.744 --> 0:12:19.758 | |
and you want to deal with all of. | |
0:12:20.020 --> 0:12:25.064 | |
So that is then, of course, a big advantage | |
if you use your more statistical models. | |
0:12:25.705 --> 0:12:35.810 | |
The disadvantage is that you need a subtitle | |
of large text databases which might exist from | |
0:12:35.810 --> 0:12:37.567 | |
many languages. | |
0:12:37.857 --> 0:12:46.511 | |
Nowadays you see that there is of course issues | |
that you need large computational resources | |
0:12:46.511 --> 0:12:47.827 | |
to deal with. | |
0:12:47.827 --> 0:12:56.198 | |
You need to collect all these crawlers on | |
the internet which can create enormous amounts | |
0:12:56.198 --> 0:12:57.891 | |
of training data. | |
0:12:58.999 --> 0:13:08.224 | |
So if we want to build this then the question | |
is of course how can we estimate the probability? | |
0:13:08.448 --> 0:13:10.986 | |
So how probable is the sentence good morning? | |
0:13:11.871 --> 0:13:15.450 | |
And you all know basic statistics. | |
0:13:15.450 --> 0:13:21.483 | |
So if you see this you have a large database | |
of sentences. | |
0:13:21.901 --> 0:13:28.003 | |
Made this a real example, so this was from | |
the TED talks. | |
0:13:28.003 --> 0:13:37.050 | |
I guess most of you have heard about them, | |
and if you account for all many sentences, | |
0:13:37.050 --> 0:13:38.523 | |
good morning. | |
0:13:38.718 --> 0:13:49.513 | |
It happens so the probability of good morning | |
is sweet point times to the power minus. | |
0:13:50.030 --> 0:13:53.755 | |
Okay, so this is a very easy thing. | |
0:13:53.755 --> 0:13:58.101 | |
We can directly model the language model. | |
0:13:58.959 --> 0:14:03.489 | |
Does anybody see a problem why this might | |
not be the final solution? | |
0:14:06.326 --> 0:14:14.962 | |
Think we would need a folder of more sentences | |
to make anything useful of this. | |
0:14:15.315 --> 0:14:29.340 | |
Because the probability of the talk starting | |
with good morning, good morning is much higher | |
0:14:29.340 --> 0:14:32.084 | |
than ten minutes. | |
0:14:33.553 --> 0:14:41.700 | |
In all the probability presented in this face, | |
not how we usually think about it. | |
0:14:42.942 --> 0:14:55.038 | |
The probability is even OK, but you're going | |
into the right direction about the large data. | |
0:14:55.038 --> 0:14:59.771 | |
Yes, you can't form a new sentence. | |
0:15:00.160 --> 0:15:04.763 | |
It's about a large data, so you said it's | |
hard to get enough data. | |
0:15:04.763 --> 0:15:05.931 | |
It's impossible. | |
0:15:05.931 --> 0:15:11.839 | |
I would say we are always saying sentences | |
which have never been said and we are able | |
0:15:11.839 --> 0:15:12.801 | |
to deal with. | |
0:15:13.133 --> 0:15:25.485 | |
The problem with the sparsity of the data | |
will have a lot of perfect English sentences. | |
0:15:26.226 --> 0:15:31.338 | |
And this is, of course, not what we want to | |
deal with. | |
0:15:31.338 --> 0:15:39.332 | |
If we want to model that, we need to have | |
a model which can really estimate how good. | |
0:15:39.599 --> 0:15:47.970 | |
And if we are just like counting this way, | |
most of it will get a zero probability, which | |
0:15:47.970 --> 0:15:48.722 | |
is not. | |
0:15:49.029 --> 0:15:56.572 | |
So we need to make things a bit different. | |
0:15:56.572 --> 0:16:06.221 | |
For the models we had already some idea of | |
doing that. | |
0:16:06.486 --> 0:16:08.058 | |
And that we can do here again. | |
0:16:08.528 --> 0:16:12.866 | |
So we can especially use the gel gel. | |
0:16:12.772 --> 0:16:19.651 | |
The chain rule and the definition of conditional | |
probability solve the conditional probability. | |
0:16:19.599 --> 0:16:26.369 | |
Of an event B given in an event A is the probability | |
of A and B divided to the probability of A. | |
0:16:26.369 --> 0:16:32.720 | |
Yes, I recently had a exam on a manic speech | |
recognition and Mister Rival said this is not | |
0:16:32.720 --> 0:16:39.629 | |
called a chain of wood because I use this terminology | |
and he said it's just applying base another. | |
0:16:40.500 --> 0:16:56.684 | |
But this is definitely the definition of the | |
condition of probability. | |
0:16:57.137 --> 0:17:08.630 | |
The probability is defined as P of A and P | |
of supposed to be divided by the one. | |
0:17:08.888 --> 0:17:16.392 | |
And that can be easily rewritten into and | |
times given. | |
0:17:16.816 --> 0:17:35.279 | |
And the nice thing is, we can easily extend | |
it, of course, into more variables so we can | |
0:17:35.279 --> 0:17:38.383 | |
have: And so on. | |
0:17:38.383 --> 0:17:49.823 | |
So more generally you can do that for now | |
any length of sequence. | |
0:17:50.650 --> 0:18:04.802 | |
So if we are now going back to words, we can | |
model that as the probability of the sequence | |
0:18:04.802 --> 0:18:08.223 | |
is given its history. | |
0:18:08.908 --> 0:18:23.717 | |
Maybe it's more clear if we're looking at | |
real works, so if we have pee-off, it's water | |
0:18:23.717 --> 0:18:26.914 | |
is so transparent. | |
0:18:26.906 --> 0:18:39.136 | |
So this way we are able to model the ability | |
of the whole sentence given the sequence by | |
0:18:39.136 --> 0:18:42.159 | |
looking at each word. | |
0:18:42.762 --> 0:18:49.206 | |
And of course the big advantage is that each | |
word occurs less often than the full sect. | |
0:18:49.206 --> 0:18:54.991 | |
So hopefully we see that still, of course, | |
the problem the word doesn't occur. | |
0:18:54.991 --> 0:19:01.435 | |
Then this doesn't work, but let's recover | |
most of the lectures today about dealing with | |
0:19:01.435 --> 0:19:01.874 | |
this. | |
0:19:02.382 --> 0:19:08.727 | |
So by first of all, we generally is at least | |
easier as the thing we have before. | |
0:19:13.133 --> 0:19:23.531 | |
That we really make sense easier, no, because | |
those jumps get utterly long and we have central. | |
0:19:23.943 --> 0:19:29.628 | |
Yes exactly, so when we look at the last probability | |
here, we still have to have seen the full. | |
0:19:30.170 --> 0:19:38.146 | |
So if we want a molecule of transparent, if | |
water is so we have to see the food sequence. | |
0:19:38.578 --> 0:19:48.061 | |
So in first step we didn't really have to | |
have seen the full sentence. | |
0:19:48.969 --> 0:19:52.090 | |
However, a little bit of a step nearer. | |
0:19:52.512 --> 0:19:59.673 | |
So this is still a problem and we will never | |
have seen it for all the time. | |
0:20:00.020 --> 0:20:08.223 | |
So you can look at this if you have a vocabulary | |
of words. | |
0:20:08.223 --> 0:20:17.956 | |
Now, for example, if the average sentence | |
is, you would leave to the. | |
0:20:18.298 --> 0:20:22.394 | |
And we are quite sure we have never seen that | |
much date. | |
0:20:22.902 --> 0:20:26.246 | |
So this is, we cannot really compute this | |
probability. | |
0:20:26.786 --> 0:20:37.794 | |
However, there's a trick how we can do that | |
and that's the idea between most of the language. | |
0:20:38.458 --> 0:20:44.446 | |
So instead of saying how often does this work | |
happen to exactly this history, we are trying | |
0:20:44.446 --> 0:20:50.433 | |
to do some kind of clustering and cluster a | |
lot of different histories into the same class, | |
0:20:50.433 --> 0:20:55.900 | |
and then we are modeling the probability of | |
the word given this class of histories. | |
0:20:56.776 --> 0:21:06.245 | |
And then, of course, the big design decision | |
is how to be modeled like how to cluster history. | |
0:21:06.666 --> 0:21:17.330 | |
So how do we put all these histories together | |
so that we have seen each of one off enough | |
0:21:17.330 --> 0:21:18.396 | |
so that. | |
0:21:20.320 --> 0:21:25.623 | |
So there is quite different types of things | |
people can do. | |
0:21:25.623 --> 0:21:33.533 | |
You can add some speech texts, you can do | |
semantic words, you can model the similarity, | |
0:21:33.533 --> 0:21:46.113 | |
you can model grammatical content, and things | |
like: However, like quite often in these statistical | |
0:21:46.113 --> 0:21:53.091 | |
models, if you have a very simple solution. | |
0:21:53.433 --> 0:21:58.455 | |
And this is what most statistical models do. | |
0:21:58.455 --> 0:22:09.616 | |
They are based on the so called mark of assumption, | |
and that means we are assuming all this history | |
0:22:09.616 --> 0:22:12.183 | |
is not that important. | |
0:22:12.792 --> 0:22:25.895 | |
So we are modeling the probability of zirkins | |
is so transparent that or we have maybe two | |
0:22:25.895 --> 0:22:29.534 | |
words by having a fixed. | |
0:22:29.729 --> 0:22:38.761 | |
So the class of all our history from word | |
to word minus one is just the last two words. | |
0:22:39.679 --> 0:22:45.229 | |
And by doing this classification, which of | |
course does need any additional knowledge. | |
0:22:45.545 --> 0:22:51.176 | |
It's very easy to calculate we have no limited | |
our our histories. | |
0:22:51.291 --> 0:23:00.906 | |
So instead of an arbitrary long one here, | |
we have here only like. | |
0:23:00.906 --> 0:23:10.375 | |
For example, if we have two grams, a lot of | |
them will not occur. | |
0:23:10.930 --> 0:23:20.079 | |
So it's a very simple trick to make all these | |
classes into a few classes and motivated by, | |
0:23:20.079 --> 0:23:24.905 | |
of course, the language the nearest things | |
are. | |
0:23:24.944 --> 0:23:33.043 | |
Like a lot of sequences, they mainly depend | |
on the previous one, and things which are far | |
0:23:33.043 --> 0:23:33.583 | |
away. | |
0:23:38.118 --> 0:23:47.361 | |
In our product here everything is just modeled | |
not by the whole history but by the last and | |
0:23:47.361 --> 0:23:48.969 | |
minus one word. | |
0:23:50.470 --> 0:23:54.322 | |
So and this is typically expressed by people. | |
0:23:54.322 --> 0:24:01.776 | |
They're therefore also talking by an N gram | |
language model because we are always looking | |
0:24:01.776 --> 0:24:06.550 | |
at these chimes of N words and modeling the | |
probability. | |
0:24:07.527 --> 0:24:10.485 | |
So again start with the most simple case. | |
0:24:10.485 --> 0:24:15.485 | |
Even extreme is the unigram case, so we're | |
ignoring the whole history. | |
0:24:15.835 --> 0:24:24.825 | |
The probability of a sequence of words is | |
just the probability of each of the words in | |
0:24:24.825 --> 0:24:25.548 | |
there. | |
0:24:26.046 --> 0:24:32.129 | |
And therefore we are removing the whole context. | |
0:24:32.129 --> 0:24:40.944 | |
The most probable sequence would be something | |
like one of them is the. | |
0:24:42.162 --> 0:24:44.694 | |
Most probable wordsuit by itself. | |
0:24:44.694 --> 0:24:49.684 | |
It might not make sense, but it, of course, | |
can give you a bit of. | |
0:24:49.629 --> 0:24:52.682 | |
Intuition like which types of words should | |
be more frequent. | |
0:24:53.393 --> 0:25:00.012 | |
And if you what you can do is train such a | |
button and you can just automatically generate. | |
0:25:00.140 --> 0:25:09.496 | |
And this sequence is generated by sampling, | |
so we will later come in the lecture too. | |
0:25:09.496 --> 0:25:16.024 | |
The sampling is that you randomly pick a word | |
but based on. | |
0:25:16.096 --> 0:25:22.711 | |
So if the probability of one word is zero | |
point two then you'll put it on and if another | |
0:25:22.711 --> 0:25:23.157 | |
word. | |
0:25:23.483 --> 0:25:36.996 | |
And if you see that you'll see here now, for | |
example, it seems that these are two occurring | |
0:25:36.996 --> 0:25:38.024 | |
posts. | |
0:25:38.138 --> 0:25:53.467 | |
But you see there's not really any continuing | |
type of structure because each word is modeled | |
0:25:53.467 --> 0:25:55.940 | |
independently. | |
0:25:57.597 --> 0:26:03.037 | |
This you can do better even though going to | |
a biograph, so then we're having a bit of context. | |
0:26:03.037 --> 0:26:08.650 | |
Of course, it's still very small, so the probability | |
of your word of the actual word only depends | |
0:26:08.650 --> 0:26:12.429 | |
on the previous word and all the context before | |
there is ignored. | |
0:26:13.133 --> 0:26:18.951 | |
This of course will come to that wrong, but | |
it models a regular language significantly | |
0:26:18.951 --> 0:26:19.486 | |
better. | |
0:26:19.779 --> 0:26:28.094 | |
Seeing some things here still doesn't really | |
make a lot of sense, but you're seeing some | |
0:26:28.094 --> 0:26:29.682 | |
typical phrases. | |
0:26:29.949 --> 0:26:39.619 | |
In this hope doesn't make sense, but in this | |
issue is also frequent. | |
0:26:39.619 --> 0:26:51.335 | |
Issue is also: Very nice is this year new | |
car parking lot after, so if you have the word | |
0:26:51.335 --> 0:26:53.634 | |
new then the word. | |
0:26:53.893 --> 0:27:01.428 | |
Is also quite common, but new car they wouldn't | |
put parking. | |
0:27:01.428 --> 0:27:06.369 | |
Often the continuation is packing lots. | |
0:27:06.967 --> 0:27:12.417 | |
And now it's very interesting because here | |
we see the two cementic meanings of lot: You | |
0:27:12.417 --> 0:27:25.889 | |
have a parking lot, but in general if you just | |
think about the history, the most common use | |
0:27:25.889 --> 0:27:27.353 | |
is a lot. | |
0:27:27.527 --> 0:27:33.392 | |
So you see that he's really not using the | |
context before, but he's only using the current | |
0:27:33.392 --> 0:27:33.979 | |
context. | |
0:27:38.338 --> 0:27:41.371 | |
So in general we can of course do that longer. | |
0:27:41.371 --> 0:27:43.888 | |
We can do unigrams, bigrams, trigrams. | |
0:27:45.845 --> 0:27:52.061 | |
People typically went up to four or five grams, | |
and then it's getting difficult because. | |
0:27:52.792 --> 0:27:56.671 | |
There are so many five grams that it's getting | |
complicated. | |
0:27:56.671 --> 0:28:02.425 | |
Storing all of them and storing these models | |
get so big that it's no longer working, and | |
0:28:02.425 --> 0:28:08.050 | |
of course at some point the calculation of | |
the probabilities again gets too difficult, | |
0:28:08.050 --> 0:28:09.213 | |
and each of them. | |
0:28:09.429 --> 0:28:14.777 | |
If you have a small corpus, of course you | |
will use a smaller ingram length. | |
0:28:14.777 --> 0:28:16.466 | |
You will take a larger. | |
0:28:18.638 --> 0:28:24.976 | |
What is important to keep in mind is that, | |
of course, this is wrong. | |
0:28:25.285 --> 0:28:36.608 | |
So we have long range dependencies, and if | |
we really want to model everything in language | |
0:28:36.608 --> 0:28:37.363 | |
then. | |
0:28:37.337 --> 0:28:46.965 | |
So here is like one of these extreme cases, | |
the computer, which has just put into the machine | |
0:28:46.965 --> 0:28:49.423 | |
room in the slow crash. | |
0:28:49.423 --> 0:28:55.978 | |
Like somehow, there is a dependency between | |
computer and crash. | |
0:28:57.978 --> 0:29:10.646 | |
However, in most situations these are typically | |
rare and normally most important things happen | |
0:29:10.646 --> 0:29:13.446 | |
in the near context. | |
0:29:15.495 --> 0:29:28.408 | |
But of course it's important to keep that | |
in mind that you can't model the thing so you | |
0:29:28.408 --> 0:29:29.876 | |
can't do. | |
0:29:33.433 --> 0:29:50.200 | |
The next question is again how can we train | |
so we have to estimate these probabilities. | |
0:29:51.071 --> 0:30:00.131 | |
And the question is how we do that, and again | |
the most simple thing. | |
0:30:00.440 --> 0:30:03.168 | |
The thing is exactly what's maximum legal | |
destination. | |
0:30:03.168 --> 0:30:12.641 | |
What gives you the right answer is: So how | |
probable is that the word is following minus | |
0:30:12.641 --> 0:30:13.370 | |
one? | |
0:30:13.370 --> 0:30:20.946 | |
You just count how often does this sequence | |
happen? | |
0:30:21.301 --> 0:30:28.165 | |
So guess this is what most of you would have | |
intuitively done, and this also works best. | |
0:30:28.568 --> 0:30:39.012 | |
So it's not a complicated train, so you once | |
have to go over your corpus, you have to count | |
0:30:39.012 --> 0:30:48.662 | |
our diagrams and unigrams, and then you can | |
directly train the basic language model. | |
0:30:49.189 --> 0:30:50.651 | |
Who is it difficult? | |
0:30:50.651 --> 0:30:58.855 | |
There are two difficulties: The basic language | |
well doesn't work that well because of zero | |
0:30:58.855 --> 0:31:03.154 | |
counts and how we address that and the second. | |
0:31:03.163 --> 0:31:13.716 | |
Because we saw that especially if you go for | |
larger you have to store all these engrams | |
0:31:13.716 --> 0:31:15.275 | |
efficiently. | |
0:31:17.697 --> 0:31:21.220 | |
So how we can do that? | |
0:31:21.220 --> 0:31:24.590 | |
Here's some examples. | |
0:31:24.590 --> 0:31:33.626 | |
For example, if you have the sequence your | |
training curve. | |
0:31:33.713 --> 0:31:41.372 | |
You see that the word happens, ascends the | |
star and the sequence happens two times. | |
0:31:42.182 --> 0:31:45.651 | |
We have three times. | |
0:31:45.651 --> 0:31:58.043 | |
The same starts as the probability is to thirds | |
and the other probability. | |
0:31:58.858 --> 0:32:09.204 | |
Here we have what is following so you have | |
twice and once do so again two thirds and one. | |
0:32:09.809 --> 0:32:20.627 | |
And this is all that you need to know here | |
about it, so you can do this calculation. | |
0:32:23.723 --> 0:32:35.506 | |
So the question then, of course, is what do | |
we really learn in these types of models? | |
0:32:35.506 --> 0:32:45.549 | |
Here are examples from the Europycopterus: | |
The green, the red, and the blue, and here | |
0:32:45.549 --> 0:32:48.594 | |
you have the probabilities which is the next. | |
0:32:48.989 --> 0:33:01.897 | |
That there is a lot more than just like the | |
syntax because the initial phrase is all the | |
0:33:01.897 --> 0:33:02.767 | |
same. | |
0:33:03.163 --> 0:33:10.132 | |
For example, you see the green paper in the | |
green group. | |
0:33:10.132 --> 0:33:16.979 | |
It's more European palaman, the red cross, | |
which is by. | |
0:33:17.197 --> 0:33:21.777 | |
What you also see that it's like sometimes | |
Indian, sometimes it's more difficult. | |
0:33:22.302 --> 0:33:28.345 | |
So, for example, following the rats, in one | |
hundred cases it was a red cross. | |
0:33:28.668 --> 0:33:48.472 | |
So it seems to be easier to guess the next | |
word. | |
0:33:48.528 --> 0:33:55.152 | |
So there is different types of information | |
coded in that you also know that I guess sometimes | |
0:33:55.152 --> 0:33:58.675 | |
you directly know all the speakers will continue. | |
0:33:58.675 --> 0:34:04.946 | |
It's not a lot of new information in the next | |
word, but in other cases like blue there's | |
0:34:04.946 --> 0:34:06.496 | |
a lot of information. | |
0:34:11.291 --> 0:34:14.849 | |
Another example is this Berkeley restaurant | |
sentences. | |
0:34:14.849 --> 0:34:21.059 | |
It's collected at Berkeley and you have sentences | |
like can you tell me about any good spaghetti | |
0:34:21.059 --> 0:34:21.835 | |
restaurant. | |
0:34:21.835 --> 0:34:27.463 | |
Big price title is what I'm looking for so | |
it's more like a dialogue system and people | |
0:34:27.463 --> 0:34:31.215 | |
have collected this data and of course you | |
can also look. | |
0:34:31.551 --> 0:34:46.878 | |
Into this and get the counts, so you count | |
the vibrants in the top, so the color is the. | |
0:34:49.409 --> 0:34:52.912 | |
This is a bigram which is the first word of | |
West. | |
0:34:52.912 --> 0:34:54.524 | |
This one fuzzy is one. | |
0:34:56.576 --> 0:35:12.160 | |
One because want to hyperability, but want | |
a lot less, and there where you see it, for | |
0:35:12.160 --> 0:35:17.004 | |
example: So here you see after I want. | |
0:35:17.004 --> 0:35:23.064 | |
It's very often for I eat, but an island which | |
is not just. | |
0:35:27.347 --> 0:35:39.267 | |
The absolute counts of how often each road | |
occurs, and then you can see here the probabilities | |
0:35:39.267 --> 0:35:40.145 | |
again. | |
0:35:42.422 --> 0:35:54.519 | |
Then do that if you want to do iwan Dutch | |
food you get the sequence you have to multiply | |
0:35:54.519 --> 0:35:55.471 | |
olive. | |
0:35:55.635 --> 0:36:00.281 | |
And then you of course get a bit of interesting | |
experience on that. | |
0:36:00.281 --> 0:36:04.726 | |
For example: Information is there. | |
0:36:04.726 --> 0:36:15.876 | |
So, for example, if you compare I want Dutch | |
or I want Chinese, it seems that. | |
0:36:16.176 --> 0:36:22.910 | |
That the sentence often starts with eye. | |
0:36:22.910 --> 0:36:31.615 | |
You have it after two is possible, but after | |
one it. | |
0:36:31.731 --> 0:36:39.724 | |
And you cannot say want, but you have to say | |
want to spend, so there's grammical information. | |
0:36:40.000 --> 0:36:51.032 | |
To main information and source: Here before | |
we're going into measuring quality, is there | |
0:36:51.032 --> 0:36:58.297 | |
any questions about language model and the | |
idea of modeling? | |
0:37:02.702 --> 0:37:13.501 | |
Hope that doesn't mean everybody sleeping, | |
and so when we're doing the training these | |
0:37:13.501 --> 0:37:15.761 | |
language models,. | |
0:37:16.356 --> 0:37:26.429 | |
You need to model what is the engrum length | |
should we use a trigram or a forkrum. | |
0:37:27.007 --> 0:37:34.040 | |
So in order to decide how can you now decide | |
which of the two models are better? | |
0:37:34.914 --> 0:37:40.702 | |
And if you would have to do that, how would | |
you decide taking language model or taking | |
0:37:40.702 --> 0:37:41.367 | |
language? | |
0:37:43.263 --> 0:37:53.484 | |
I take some test text and see which model | |
assigns a higher probability to me. | |
0:37:54.354 --> 0:38:03.978 | |
It's very good, so that's even the second | |
thing, so the first thing maybe would have | |
0:38:03.978 --> 0:38:04.657 | |
been. | |
0:38:05.925 --> 0:38:12.300 | |
The problem is the and then you take the language | |
language language and machine translation. | |
0:38:13.193 --> 0:38:18.773 | |
Problems: First of all you have to build a | |
whole system which is very time consuming and | |
0:38:18.773 --> 0:38:21.407 | |
it might not only depend on the language. | |
0:38:21.407 --> 0:38:24.730 | |
On the other hand, that's of course what the | |
end is. | |
0:38:24.730 --> 0:38:30.373 | |
The end want and the pressure will model each | |
component individually or do you want to do | |
0:38:30.373 --> 0:38:31.313 | |
an end to end. | |
0:38:31.771 --> 0:38:35.463 | |
What can also happen is you'll see your metric | |
model. | |
0:38:35.463 --> 0:38:41.412 | |
This is a very good language model, but it | |
somewhat doesn't really work well with your | |
0:38:41.412 --> 0:38:42.711 | |
translation model. | |
0:38:43.803 --> 0:38:49.523 | |
But of course it's very good to also have | |
this type of intrinsic evaluation where the | |
0:38:49.523 --> 0:38:52.116 | |
assumption should be as a pointed out. | |
0:38:52.116 --> 0:38:57.503 | |
If we have Good English it shouldn't be a | |
high probability and it's bad English. | |
0:38:58.318 --> 0:39:07.594 | |
And this is measured by the take a held out | |
data set, so some data which you don't train | |
0:39:07.594 --> 0:39:12.596 | |
on then calculate the probability of this data. | |
0:39:12.912 --> 0:39:26.374 | |
Then you're just looking at the language model | |
and you take the language model. | |
0:39:27.727 --> 0:39:33.595 | |
You're not directly using the probability, | |
but you're taking the perplexity. | |
0:39:33.595 --> 0:39:40.454 | |
The perplexity is due to the power of the | |
cross entropy, and you see in the cross entropy | |
0:39:40.454 --> 0:39:46.322 | |
you're doing something like an average probability | |
of always coming to this. | |
0:39:46.846 --> 0:39:54.721 | |
Not so how exactly is that define perplexity | |
is typically what people refer to all across. | |
0:39:54.894 --> 0:40:02.328 | |
The cross edge is negative and average, and | |
then you have the lock of the probability of | |
0:40:02.328 --> 0:40:03.246 | |
the whole. | |
0:40:04.584 --> 0:40:10.609 | |
We are modeling this probability as the product | |
of each of the words. | |
0:40:10.609 --> 0:40:18.613 | |
That's how the end gram was defined and now | |
you hopefully can remember the rules of logarism | |
0:40:18.613 --> 0:40:23.089 | |
so you can get the probability within the logarism. | |
0:40:23.063 --> 0:40:31.036 | |
The sum here so the cross entry is minus one | |
by two by n, and the sum of all your words | |
0:40:31.036 --> 0:40:35.566 | |
and the lowerism of the probability of each | |
word. | |
0:40:36.176 --> 0:40:39.418 | |
And then the perplexity is just like two to | |
the power. | |
0:40:41.201 --> 0:40:44.706 | |
Why can this be interpreted as a branching | |
factor? | |
0:40:44.706 --> 0:40:50.479 | |
So it gives you a bit like the average thing, | |
like how many possibilities you have. | |
0:40:51.071 --> 0:41:02.249 | |
You have a digit task and you have no idea, | |
but the probability of the next digit is like | |
0:41:02.249 --> 0:41:03.367 | |
one ten. | |
0:41:03.783 --> 0:41:09.354 | |
And if you then take a later perplexity, it | |
will be exactly ten. | |
0:41:09.849 --> 0:41:24.191 | |
And that is like this perplexity gives you | |
a million interpretations, so how much randomness | |
0:41:24.191 --> 0:41:27.121 | |
is still in there? | |
0:41:27.307 --> 0:41:32.433 | |
Of course, now it's good to have a lower perplexity. | |
0:41:32.433 --> 0:41:36.012 | |
We have less ambiguity in there and. | |
0:41:35.976 --> 0:41:48.127 | |
If you have a hundred words and you only have | |
to uniformly compare it to ten different, so | |
0:41:48.127 --> 0:41:49.462 | |
you have. | |
0:41:49.609 --> 0:41:53.255 | |
Yes, think so it should be. | |
0:41:53.255 --> 0:42:03.673 | |
You had here logarism and then to the power | |
and that should then be eliminated. | |
0:42:03.743 --> 0:42:22.155 | |
So which logarism you use is not that important | |
because it's a constant factor to reformulate. | |
0:42:23.403 --> 0:42:28.462 | |
Yes and Yeah So the Best. | |
0:42:31.931 --> 0:42:50.263 | |
The best model is always like you want to | |
have a high probability. | |
0:42:51.811 --> 0:43:04.549 | |
Time you see here, so here the probabilities | |
would like to commend the rapporteur on his | |
0:43:04.549 --> 0:43:05.408 | |
work. | |
0:43:05.285 --> 0:43:14.116 | |
You have then locked two probabilities and | |
then the average, so this is not the perplexity | |
0:43:14.116 --> 0:43:18.095 | |
but the cross entropy as mentioned here. | |
0:43:18.318 --> 0:43:26.651 | |
And then due to the power of that we'll give | |
you the perplexity of the center. | |
0:43:29.329 --> 0:43:40.967 | |
And these metrics of perplexity are essential | |
in modeling that and we'll also see nowadays. | |
0:43:41.121 --> 0:43:47.898 | |
You also measure like equality often in perplexity | |
or cross entropy, which gives you how good | |
0:43:47.898 --> 0:43:50.062 | |
is it in estimating the same. | |
0:43:50.010 --> 0:43:53.647 | |
The better the model is, the more information | |
you have about this. | |
0:43:55.795 --> 0:44:03.106 | |
Talked about isomic ability or quit sentences, | |
but don't most have to any much because. | |
0:44:03.463 --> 0:44:12.512 | |
You are doing that in this way implicitly | |
because of the correct word. | |
0:44:12.512 --> 0:44:19.266 | |
If you are modeling this one, the sun over | |
all next. | |
0:44:20.020 --> 0:44:29.409 | |
Therefore, you have that implicitly in there | |
because in each position you're modeling the | |
0:44:29.409 --> 0:44:32.957 | |
probability of this witch behind. | |
0:44:35.515 --> 0:44:43.811 | |
You have a very large number of negative examples | |
because all the possible extensions which are | |
0:44:43.811 --> 0:44:49.515 | |
not there are incorrect, which of course might | |
also be a problem. | |
0:44:52.312 --> 0:45:00.256 | |
And the biggest challenge of these types of | |
models is how to model unseen events. | |
0:45:00.840 --> 0:45:04.973 | |
So that can be unknown words or it can be | |
unknown vibrants. | |
0:45:05.245 --> 0:45:10.096 | |
So that's important also like you've seen | |
all the words. | |
0:45:10.096 --> 0:45:17.756 | |
But if you have a bigram language model, if | |
you haven't seen the bigram, you'll still get | |
0:45:17.756 --> 0:45:23.628 | |
a zero probability because we know that the | |
bigram's divided by the. | |
0:45:24.644 --> 0:45:35.299 | |
If you have unknown words, the problem gets | |
even bigger because one word typically causes | |
0:45:35.299 --> 0:45:37.075 | |
a lot of zero. | |
0:45:37.217 --> 0:45:41.038 | |
So if you, for example, if your vocabulary | |
is go to and care it,. | |
0:45:41.341 --> 0:45:43.467 | |
And you have not a sentence. | |
0:45:43.467 --> 0:45:47.941 | |
I want to pay a T, so you have one word, which | |
is here 'an'. | |
0:45:47.887 --> 0:45:54.354 | |
It is unknow then you have the proper. | |
0:45:54.354 --> 0:46:02.147 | |
It is I get a sentence star and sentence star. | |
0:46:02.582 --> 0:46:09.850 | |
To model this probability you always have | |
to take the account from these sequences divided | |
0:46:09.850 --> 0:46:19.145 | |
by: Since when does it occur, all of these | |
angrams can also occur because of the word | |
0:46:19.145 --> 0:46:19.961 | |
middle. | |
0:46:20.260 --> 0:46:27.800 | |
So all of these probabilities are directly | |
zero. | |
0:46:27.800 --> 0:46:33.647 | |
You see that just by having a single. | |
0:46:34.254 --> 0:46:47.968 | |
Tells you it might not always be better to | |
have larger grams because if you have a gram | |
0:46:47.968 --> 0:46:50.306 | |
language more. | |
0:46:50.730 --> 0:46:57.870 | |
So sometimes it's better to have a smaller | |
angram counter because the chances that you're | |
0:46:57.870 --> 0:47:00.170 | |
seeing the angram is higher. | |
0:47:00.170 --> 0:47:07.310 | |
On the other hand, you want to have a larger | |
account because the larger the count is, the | |
0:47:07.310 --> 0:47:09.849 | |
longer the context is modeling. | |
0:47:10.670 --> 0:47:17.565 | |
So how can we address this type of problem? | |
0:47:17.565 --> 0:47:28.064 | |
We address this type of problem by somehow | |
adjusting our accounts. | |
0:47:29.749 --> 0:47:40.482 | |
We have often, but most of your entries in | |
the table are zero, and if one of these engrams | |
0:47:40.482 --> 0:47:45.082 | |
occurs you'll have a zero probability. | |
0:47:46.806 --> 0:48:06.999 | |
So therefore we need to find some of our ways | |
in order to estimate this type of event because: | |
0:48:07.427 --> 0:48:11.619 | |
So there are different ways of how to model | |
it and how to adjust it. | |
0:48:11.619 --> 0:48:15.326 | |
The one I hear is to do smoocing and that's | |
the first thing. | |
0:48:15.326 --> 0:48:20.734 | |
So in smoocing you're saying okay, we take | |
a bit of the probability we have to our scene | |
0:48:20.734 --> 0:48:23.893 | |
events and distribute this thing we're taking | |
away. | |
0:48:23.893 --> 0:48:26.567 | |
We're distributing to all the other events. | |
0:48:26.946 --> 0:48:33.927 | |
The nice thing is in this case oh now each | |
event has a non zero probability and that is | |
0:48:33.927 --> 0:48:39.718 | |
of course very helpful because we don't have | |
zero probabilities anymore. | |
0:48:40.180 --> 0:48:48.422 | |
It smoothed out, but at least you have some | |
kind of probability everywhere, so you take | |
0:48:48.422 --> 0:48:50.764 | |
some of the probability. | |
0:48:53.053 --> 0:49:05.465 | |
You can also do that more here when you have | |
the endgram, for example, and this is your | |
0:49:05.465 --> 0:49:08.709 | |
original distribution. | |
0:49:08.648 --> 0:49:15.463 | |
Then you are taking some mass away from here | |
and distributing this mass to all the other | |
0:49:15.463 --> 0:49:17.453 | |
words that you have seen. | |
0:49:18.638 --> 0:49:26.797 | |
And thereby you are now making sure that it's | |
yeah, that it's now possible to model that. | |
0:49:28.828 --> 0:49:36.163 | |
The other idea we're coming into more detail | |
on how we can do this type of smoking, but | |
0:49:36.163 --> 0:49:41.164 | |
one other idea you can do is to do some type | |
of clustering. | |
0:49:41.501 --> 0:49:48.486 | |
And that means if we are can't model go Kit's, | |
for example because we haven't seen that. | |
0:49:49.349 --> 0:49:56.128 | |
Then we're just looking at the full thing | |
and we're just going to live directly how probable. | |
0:49:56.156 --> 0:49:58.162 | |
Go two ways or so. | |
0:49:58.162 --> 0:50:09.040 | |
Then we are modeling just only the word interpolation | |
where you're interpolating all the probabilities | |
0:50:09.040 --> 0:50:10.836 | |
and thereby can. | |
0:50:11.111 --> 0:50:16.355 | |
These are the two things which are helpful | |
in order to better calculate all these types. | |
0:50:19.499 --> 0:50:28.404 | |
Let's start with what counts news so the idea | |
is okay. | |
0:50:28.404 --> 0:50:38.119 | |
We have not seen an event and then the probability | |
is zero. | |
0:50:38.618 --> 0:50:50.902 | |
It's not that high, but you should always | |
be aware that there might be new things happening | |
0:50:50.902 --> 0:50:55.308 | |
and somehow be able to estimate. | |
0:50:56.276 --> 0:50:59.914 | |
So the idea is okay. | |
0:50:59.914 --> 0:51:09.442 | |
We can also assign a positive probability | |
to a higher. | |
0:51:10.590 --> 0:51:23.233 | |
We are changing so currently we worked on | |
imperial accounts so how often we have seen | |
0:51:23.233 --> 0:51:25.292 | |
the accounts. | |
0:51:25.745 --> 0:51:37.174 | |
And now we are going on to expect account | |
how often this would occur in an unseen. | |
0:51:37.517 --> 0:51:39.282 | |
So we are directly trying to model that. | |
0:51:39.859 --> 0:51:45.836 | |
Of course, the empirical accounts are a good | |
starting point, so if you've seen the world | |
0:51:45.836 --> 0:51:51.880 | |
very often in your training data, it's a good | |
estimation of how often you would see it in | |
0:51:51.880 --> 0:51:52.685 | |
the future. | |
0:51:52.685 --> 0:51:58.125 | |
However, it might make sense to think about | |
it only because you haven't seen it. | |
0:51:58.578 --> 0:52:10.742 | |
So does anybody have a very simple idea how | |
you start with smoothing it? | |
0:52:10.742 --> 0:52:15.241 | |
What count would you give? | |
0:52:21.281 --> 0:52:32.279 | |
Now you have the probability to calculation | |
how often have you seen the biogram with zero | |
0:52:32.279 --> 0:52:33.135 | |
count. | |
0:52:33.193 --> 0:52:39.209 | |
So what count would you give in order to still | |
do this calculation? | |
0:52:39.209 --> 0:52:41.509 | |
We have to smooth, so we. | |
0:52:44.884 --> 0:52:52.151 | |
We could clump together all the rare words, | |
for example everywhere we have only seen ones. | |
0:52:52.652 --> 0:52:56.904 | |
And then just we can do the massive moment | |
of those and don't. | |
0:52:56.936 --> 0:53:00.085 | |
So remove the real ones. | |
0:53:00.085 --> 0:53:06.130 | |
Yes, and then every unseen word is one of | |
them. | |
0:53:06.130 --> 0:53:13.939 | |
Yeah, but it's not only about unseen words, | |
it's even unseen. | |
0:53:14.874 --> 0:53:20.180 | |
You can even start easier and that's what | |
people do at the first thing. | |
0:53:20.180 --> 0:53:22.243 | |
That's at one smooth thing. | |
0:53:22.243 --> 0:53:28.580 | |
You'll see it's not working good but the variation | |
works fine and we're just as here. | |
0:53:28.580 --> 0:53:30.644 | |
We've seen everything once. | |
0:53:31.771 --> 0:53:39.896 | |
That's similar to this because you're clustering | |
the one and the zero together and you just | |
0:53:39.896 --> 0:53:45.814 | |
say you've seen everything once or have seen | |
them twice and so on. | |
0:53:46.386 --> 0:53:53.249 | |
And if you've done that wow, there's no probability | |
because each event has happened once. | |
0:53:55.795 --> 0:54:02.395 | |
If you otherwise have seen the bigram five | |
times, you would not now do five times but | |
0:54:02.395 --> 0:54:03.239 | |
six times. | |
0:54:03.363 --> 0:54:09.117 | |
So the nice thing is to have seen everything. | |
0:54:09.117 --> 0:54:19.124 | |
Once the probability of the engrap is now | |
out, you have seen it divided by the. | |
0:54:20.780 --> 0:54:23.763 | |
How long ago there's one big big problem with | |
it? | |
0:54:24.064 --> 0:54:38.509 | |
Just imagine that you have a vocabulary of | |
words, and you have a corpus of thirty million | |
0:54:38.509 --> 0:54:39.954 | |
bigrams. | |
0:54:39.954 --> 0:54:42.843 | |
So if you have a. | |
0:54:43.543 --> 0:54:46.580 | |
Simple Things So You've Seen Them Thirty Million | |
Times. | |
0:54:47.247 --> 0:54:49.818 | |
That is your count, your distributing. | |
0:54:49.818 --> 0:54:55.225 | |
According to your gain, the problem is yet | |
how many possible bigrams do you have? | |
0:54:55.225 --> 0:55:00.895 | |
You have seven point five billion possible | |
bigrams, and each of them you are counting | |
0:55:00.895 --> 0:55:04.785 | |
now as give up your ability, like you give | |
account of one. | |
0:55:04.785 --> 0:55:07.092 | |
So each of them is saying a curse. | |
0:55:07.627 --> 0:55:16.697 | |
Then this number of possible vigrams is many | |
times larger than the number you really see. | |
0:55:17.537 --> 0:55:21.151 | |
You're mainly doing equal distribution. | |
0:55:21.151 --> 0:55:26.753 | |
Everything gets the same because this is much | |
more important. | |
0:55:26.753 --> 0:55:31.541 | |
Most of your probability mass is used for | |
smoothing. | |
0:55:32.412 --> 0:55:37.493 | |
Because most of the probability miles have | |
to be distributed that you at least give every | |
0:55:37.493 --> 0:55:42.687 | |
biogram at least a count of one, and the other | |
counts are only the thirty million, so seven | |
0:55:42.687 --> 0:55:48.219 | |
point five billion counts go to like a distribute | |
around all the engrons, and only thirty million | |
0:55:48.219 --> 0:55:50.026 | |
are according to your frequent. | |
0:55:50.210 --> 0:56:02.406 | |
So you put a lot too much mass on your smoothing | |
and you're doing some kind of extreme smoothing. | |
0:56:02.742 --> 0:56:08.986 | |
So that of course is a bit bad then and will | |
give you not the best performance. | |
0:56:10.130 --> 0:56:16.160 | |
However, there's a nice thing and that means | |
to do probability calculations. | |
0:56:16.160 --> 0:56:21.800 | |
We are doing it based on counts, but to do | |
this division we don't need. | |
0:56:22.302 --> 0:56:32.112 | |
So we can also do that with floating point | |
values and there is still a valid type of calculation. | |
0:56:32.392 --> 0:56:39.380 | |
So we can have less probability mass to unseen | |
events. | |
0:56:39.380 --> 0:56:45.352 | |
We don't have to give one because if we count. | |
0:56:45.785 --> 0:56:50.976 | |
But to do our calculation we can also give | |
zero point zero to something like that, so | |
0:56:50.976 --> 0:56:56.167 | |
very small value, and thereby we have less | |
value on the smooth thing, and we are more | |
0:56:56.167 --> 0:56:58.038 | |
focusing on the actual corpus. | |
0:56:58.758 --> 0:57:03.045 | |
And that is what people refer to as Alpha | |
Smoozing. | |
0:57:03.223 --> 0:57:12.032 | |
You see that we are now adding not one to | |
it but only alpha, and then we are giving less | |
0:57:12.032 --> 0:57:19.258 | |
probability to the unseen event and more probability | |
to the really seen. | |
0:57:20.780 --> 0:57:24.713 | |
Questions: Of course, how do you find see | |
also? | |
0:57:24.713 --> 0:57:29.711 | |
I'm here to either use some help out data | |
and optimize them. | |
0:57:30.951 --> 0:57:35.153 | |
So what what does it now really mean? | |
0:57:35.153 --> 0:57:40.130 | |
This gives you a bit of an idea behind that. | |
0:57:40.700 --> 0:57:57.751 | |
So here you have the grams which occur one | |
time, for example all grams which occur one. | |
0:57:57.978 --> 0:58:10.890 | |
So, for example, that means that if you have | |
engrams which occur one time, then. | |
0:58:11.371 --> 0:58:22.896 | |
If you look at all the engrams which occur | |
two times, then they occur. | |
0:58:22.896 --> 0:58:31.013 | |
If you look at the engrams that occur zero, | |
then. | |
0:58:32.832 --> 0:58:46.511 | |
So if you are now doing the smoothing you | |
can look what is the probability estimating | |
0:58:46.511 --> 0:58:47.466 | |
them. | |
0:58:47.847 --> 0:59:00.963 | |
You see that for all the endbreaks you heavily | |
underestimate how often they occur in the test | |
0:59:00.963 --> 0:59:01.801 | |
card. | |
0:59:02.002 --> 0:59:10.067 | |
So what you want is very good to estimate | |
this distribution, so for each Enron estimate | |
0:59:10.067 --> 0:59:12.083 | |
quite well how often. | |
0:59:12.632 --> 0:59:16.029 | |
You're quite bad at that for all of them. | |
0:59:16.029 --> 0:59:22.500 | |
You're apparently underestimating only for | |
the top ones which you haven't seen. | |
0:59:22.500 --> 0:59:24.845 | |
You'll heavily overestimate. | |
0:59:25.645 --> 0:59:30.887 | |
If you're doing alpha smoothing and optimize | |
that to fit on the zero count because that's | |
0:59:30.887 --> 0:59:36.361 | |
not completely fair because this alpha is now | |
optimizes the test counter, you see that you're | |
0:59:36.361 --> 0:59:37.526 | |
doing a lot better. | |
0:59:37.526 --> 0:59:42.360 | |
It's not perfect, but you're a lot better | |
in estimating how often they will occur. | |
0:59:45.545 --> 0:59:49.316 | |
So this is one idea of doing it. | |
0:59:49.316 --> 0:59:57.771 | |
Of course there's other ways and this is like | |
a large research direction. | |
0:59:58.318 --> 1:00:03.287 | |
So there is this needed estimation. | |
1:00:03.287 --> 1:00:11.569 | |
What you are doing is filling your trading | |
data into parts. | |
1:00:11.972 --> 1:00:19.547 | |
Looking at how many engrams occur exactly | |
are types, which engrams occur are times in | |
1:00:19.547 --> 1:00:20.868 | |
your training. | |
1:00:21.281 --> 1:00:27.716 | |
And then you look for these ones. | |
1:00:27.716 --> 1:00:36.611 | |
How often do they occur in your training data? | |
1:00:36.611 --> 1:00:37.746 | |
It's. | |
1:00:38.118 --> 1:00:45.214 | |
And then you say oh this engram, the expector | |
counts how often will see. | |
1:00:45.214 --> 1:00:56.020 | |
It is divided by: Some type of clustering | |
you're putting all the engrams which occur | |
1:00:56.020 --> 1:01:04.341 | |
are at times in your data together and in order | |
to estimate how often. | |
1:01:05.185 --> 1:01:12.489 | |
And if you do half your data related to your | |
final estimation by just using those statistics,. | |
1:01:14.014 --> 1:01:25.210 | |
So this is called added estimation, and thereby | |
you are not able to estimate better how often | |
1:01:25.210 --> 1:01:25.924 | |
does. | |
1:01:28.368 --> 1:01:34.559 | |
And again we can do the same look and compare | |
it to the expected counts. | |
1:01:34.559 --> 1:01:37.782 | |
Again we have exactly the same table. | |
1:01:38.398 --> 1:01:47.611 | |
So then we're having to hear how many engrams | |
that does exist. | |
1:01:47.611 --> 1:01:55.361 | |
So, for example, there's like engrams which | |
you can. | |
1:01:55.835 --> 1:02:08.583 | |
Then you look into your other half and how | |
often do these N grams occur in your 2nd part | |
1:02:08.583 --> 1:02:11.734 | |
of the training data? | |
1:02:12.012 --> 1:02:22.558 | |
For example, an unseen N gram I expect to | |
occur, an engram which occurs one time. | |
1:02:22.558 --> 1:02:25.774 | |
I expect that it occurs. | |
1:02:27.527 --> 1:02:42.564 | |
Yeah, the number of zero counts are if take | |
my one grams and then just calculate how many | |
1:02:42.564 --> 1:02:45.572 | |
possible bigrams. | |
1:02:45.525 --> 1:02:50.729 | |
Yes, so in this case we are now not assuming | |
about having a more larger cattle because then, | |
1:02:50.729 --> 1:02:52.127 | |
of course, it's getting. | |
1:02:52.272 --> 1:02:54.730 | |
So you're doing that given the current gram. | |
1:02:54.730 --> 1:03:06.057 | |
The cavalry is better to: So yeah, there's | |
another problem in how to deal with them. | |
1:03:06.057 --> 1:03:11.150 | |
This is more about how to smuse the engram | |
counts to also deal. | |
1:03:14.394 --> 1:03:18.329 | |
Certainly as I Think The. | |
1:03:18.198 --> 1:03:25.197 | |
Yes, the last idea of doing is so called good | |
cheering, and and the I hear here is in it | |
1:03:25.197 --> 1:03:32.747 | |
similar, so there is a typical mathematic approve, | |
but you can show that a very good estimation | |
1:03:32.747 --> 1:03:34.713 | |
for the expected counts. | |
1:03:34.654 --> 1:03:42.339 | |
Is that you take the number of engrams which | |
occur one time more divided by the number of | |
1:03:42.339 --> 1:03:46.011 | |
engram which occur R times and R plus one. | |
1:03:46.666 --> 1:03:49.263 | |
So this is then the estimation of. | |
1:03:49.549 --> 1:04:05.911 | |
So if you are looking now at an engram which | |
occurs times then you are looking at how many | |
1:04:05.911 --> 1:04:08.608 | |
engrams occur. | |
1:04:09.009 --> 1:04:18.938 | |
It's very simple, so in this one you only | |
have to count all the bigrams, how many different | |
1:04:18.938 --> 1:04:23.471 | |
bigrams out there, and that is very good. | |
1:04:23.903 --> 1:04:33.137 | |
So if you are saying now about end drums which | |
occur or times,. | |
1:04:33.473 --> 1:04:46.626 | |
It might be that there are some occurring | |
times, but no times, and then. | |
1:04:46.866 --> 1:04:54.721 | |
So what you normally do is you are doing for | |
small R, and for large R you do some curve | |
1:04:54.721 --> 1:04:55.524 | |
fitting. | |
1:04:56.016 --> 1:05:07.377 | |
In general this type of smoothing is important | |
for engrams which occur rarely. | |
1:05:07.377 --> 1:05:15.719 | |
If an engram occurs so this is more important | |
for events. | |
1:05:17.717 --> 1:05:25.652 | |
So here again you see you have the counts | |
and then based on that you get the adjusted | |
1:05:25.652 --> 1:05:26.390 | |
counts. | |
1:05:26.390 --> 1:05:34.786 | |
This is here and if you compare it's a test | |
count you see that it really works quite well. | |
1:05:35.035 --> 1:05:41.093 | |
But for the low numbers it's a very good modeling | |
of how much how good this works. | |
1:05:45.005 --> 1:05:50.018 | |
Then, of course, the question is how good | |
does it work in language modeling? | |
1:05:50.018 --> 1:05:51.516 | |
We also want tomorrow. | |
1:05:52.372 --> 1:05:54.996 | |
We can measure that perplexity. | |
1:05:54.996 --> 1:05:59.261 | |
We learned that before and then we have everyone's. | |
1:05:59.579 --> 1:06:07.326 | |
You saw that a lot of too much probability | |
mass is put to the events which have your probability. | |
1:06:07.667 --> 1:06:11.098 | |
Then you have an alpha smoothing. | |
1:06:11.098 --> 1:06:16.042 | |
Here's a start because it's not completely | |
fair. | |
1:06:16.042 --> 1:06:20.281 | |
The alpha was maximized on the test data. | |
1:06:20.480 --> 1:06:25.904 | |
But you see that like the leaded estimation | |
of the touring gives you a similar performance. | |
1:06:26.226 --> 1:06:29.141 | |
So they seem to really work quite well. | |
1:06:32.232 --> 1:06:41.552 | |
So this is about all assigning probability | |
mass to aimed grams, which we have not seen | |
1:06:41.552 --> 1:06:50.657 | |
in order to also estimate their probability | |
before we're going to the interpolation. | |
1:06:55.635 --> 1:07:00.207 | |
Good, so now we have. | |
1:07:00.080 --> 1:07:11.818 | |
Done this estimation, and the problem is we | |
have this general. | |
1:07:11.651 --> 1:07:19.470 | |
We want to have a longer context because we | |
can model longer than language better because | |
1:07:19.470 --> 1:07:21.468 | |
long range dependency. | |
1:07:21.701 --> 1:07:26.745 | |
On the other hand, we have limited data so | |
we want to have stored angrums because they | |
1:07:26.745 --> 1:07:28.426 | |
reach angrums at first more. | |
1:07:29.029 --> 1:07:43.664 | |
And about the smooth thing in the discounting | |
we did before, it always treats all angrams. | |
1:07:44.024 --> 1:07:46.006 | |
So we didn't really look at the end drums. | |
1:07:46.006 --> 1:07:48.174 | |
They were all classed into how often they | |
are. | |
1:07:49.169 --> 1:08:00.006 | |
However, sometimes this might not be very | |
helpful, so for example look at the engram | |
1:08:00.006 --> 1:08:06.253 | |
Scottish beer drinkers and Scottish beer eaters. | |
1:08:06.686 --> 1:08:12.037 | |
Because we have not seen the trigram, so you | |
will estimate the trigram probability by the | |
1:08:12.037 --> 1:08:14.593 | |
probability you assign to the zero county. | |
1:08:15.455 --> 1:08:26.700 | |
However, if you look at the background probability | |
that you might have seen and might be helpful,. | |
1:08:26.866 --> 1:08:34.538 | |
So be a drinker is more probable to see than | |
Scottish be a drinker, and be a drinker should | |
1:08:34.538 --> 1:08:36.039 | |
be more probable. | |
1:08:36.896 --> 1:08:39.919 | |
So this type of information is somehow ignored. | |
1:08:39.919 --> 1:08:45.271 | |
So if we have the Trigram language model, | |
we are only looking at trigrams divided by | |
1:08:45.271 --> 1:08:46.089 | |
the Vigrams. | |
1:08:46.089 --> 1:08:49.678 | |
But if we have not seen the Vigrams, we are | |
not looking. | |
1:08:49.678 --> 1:08:53.456 | |
Oh, maybe we will have seen the Vigram and | |
we can back off. | |
1:08:54.114 --> 1:09:01.978 | |
And that is what people do in interpolation | |
and back off. | |
1:09:01.978 --> 1:09:09.164 | |
The idea is if we don't have seen the large | |
engrams. | |
1:09:09.429 --> 1:09:16.169 | |
So don't have to go to a shorter sequence | |
and try to see if we came on in this probability. | |
1:09:16.776 --> 1:09:20.730 | |
And this is the idea of interpolation. | |
1:09:20.730 --> 1:09:25.291 | |
There's like two different ways of doing it. | |
1:09:25.291 --> 1:09:26.507 | |
One is the. | |
1:09:26.646 --> 1:09:29.465 | |
The easiest thing is like okay. | |
1:09:29.465 --> 1:09:32.812 | |
If we have bigrams, we have trigrams. | |
1:09:32.812 --> 1:09:35.103 | |
If we have programs, why? | |
1:09:35.355 --> 1:09:46.544 | |
Mean, of course, we have the larger ones, | |
the larger context, but the short amounts are | |
1:09:46.544 --> 1:09:49.596 | |
maybe better estimated. | |
1:09:50.090 --> 1:10:00.487 | |
Time just by taking the probability of just | |
the word class of probability of and. | |
1:10:01.261 --> 1:10:07.052 | |
And of course we need to know because otherwise | |
we don't have a probability distribution, but | |
1:10:07.052 --> 1:10:09.332 | |
we can somehow optimize the weights. | |
1:10:09.332 --> 1:10:15.930 | |
For example, the health out data set: And | |
thereby we have now a probability distribution | |
1:10:15.930 --> 1:10:17.777 | |
which takes both into account. | |
1:10:18.118 --> 1:10:23.705 | |
The thing about the Scottish be a drink business. | |
1:10:23.705 --> 1:10:33.763 | |
The dry rum probability will be the same for | |
the post office because they both occur zero | |
1:10:33.763 --> 1:10:34.546 | |
times. | |
1:10:36.116 --> 1:10:45.332 | |
But the two grand verability will hopefully | |
be different because we might have seen beer | |
1:10:45.332 --> 1:10:47.611 | |
eaters and therefore. | |
1:10:48.668 --> 1:10:57.296 | |
The idea that sometimes it's better to have | |
different models and combine them instead. | |
1:10:58.678 --> 1:10:59.976 | |
Another idea in style. | |
1:11:00.000 --> 1:11:08.506 | |
Of this overall interpolation is you can also | |
do this type of recursive interpolation. | |
1:11:08.969 --> 1:11:23.804 | |
The probability of the word given its history | |
is in the current language model probability. | |
1:11:24.664 --> 1:11:30.686 | |
Thus one minus the weights of this two some | |
after one, and here it's an interpolated probability | |
1:11:30.686 --> 1:11:36.832 | |
from the n minus one breath, and then of course | |
it goes recursively on until you are at a junigram | |
1:11:36.832 --> 1:11:37.639 | |
probability. | |
1:11:38.558 --> 1:11:49.513 | |
What you can also do, you can not only do | |
the same weights for all our words, but you | |
1:11:49.513 --> 1:12:06.020 | |
can for example: For example, for engrams, | |
which you have seen very often, you put more | |
1:12:06.020 --> 1:12:10.580 | |
weight on the trigrams. | |
1:12:13.673 --> 1:12:29.892 | |
The other thing you can do is the back off | |
and the difference in back off is we are not | |
1:12:29.892 --> 1:12:32.656 | |
interpolating. | |
1:12:32.892 --> 1:12:41.954 | |
If we have seen the trigram probability so | |
if the trigram hound is bigger then we take | |
1:12:41.954 --> 1:12:48.412 | |
the trigram probability and if we have seen | |
this one then we. | |
1:12:48.868 --> 1:12:54.092 | |
So that is the difference. | |
1:12:54.092 --> 1:13:06.279 | |
We are always taking all the angle probabilities | |
and back off. | |
1:13:07.147 --> 1:13:09.941 | |
Why do we need to do this just a minute? | |
1:13:09.941 --> 1:13:13.621 | |
So why have we here just take the probability | |
of the. | |
1:13:15.595 --> 1:13:18.711 | |
Yes, because otherwise the probabilities from | |
some people. | |
1:13:19.059 --> 1:13:28.213 | |
In order to make them still sound one, we | |
have to take away a bit of a probability mass | |
1:13:28.213 --> 1:13:29.773 | |
for the scene. | |
1:13:29.709 --> 1:13:38.919 | |
The difference is we are no longer distributing | |
it equally as before to the unseen, but we | |
1:13:38.919 --> 1:13:40.741 | |
are distributing. | |
1:13:44.864 --> 1:13:56.220 | |
For example, this can be done with gutturing, | |
so the expected counts in goodturing we saw. | |
1:13:57.697 --> 1:13:59.804 | |
The adjusted counts. | |
1:13:59.804 --> 1:14:04.719 | |
They are always lower than the ones we see | |
here. | |
1:14:04.719 --> 1:14:14.972 | |
These counts are always: See that so you can | |
now take this different and distribute this | |
1:14:14.972 --> 1:14:18.852 | |
weights to the lower based input. | |
1:14:23.323 --> 1:14:29.896 | |
Is how we can distribute things. | |
1:14:29.896 --> 1:14:43.442 | |
Then there is one last thing people are doing, | |
especially how much. | |
1:14:43.563 --> 1:14:55.464 | |
And there's one thing which is called well | |
written by Mozilla. | |
1:14:55.315 --> 1:15:01.335 | |
In the background, like in the background, | |
it might make sense to look at the words and | |
1:15:01.335 --> 1:15:04.893 | |
see how probable it is that you need to background. | |
1:15:05.425 --> 1:15:11.232 | |
So look at these words five and one cent. | |
1:15:11.232 --> 1:15:15.934 | |
Those occur exactly times in the. | |
1:15:16.316 --> 1:15:27.804 | |
They would be treated exactly the same because | |
both occur at the same time, and it would be | |
1:15:27.804 --> 1:15:29.053 | |
the same. | |
1:15:29.809 --> 1:15:48.401 | |
However, it shouldn't really model the same. | |
1:15:48.568 --> 1:15:57.447 | |
If you compare that for constant there are | |
four hundred different continuations of this | |
1:15:57.447 --> 1:16:01.282 | |
work, so there is nearly always this. | |
1:16:02.902 --> 1:16:11.203 | |
So if you're now seeing a new bigram or a | |
biogram with Isaac Constant or Spite starting | |
1:16:11.203 --> 1:16:13.467 | |
and then another word,. | |
1:16:15.215 --> 1:16:25.606 | |
In constant, it's very frequent that you see | |
new angrups because there are many different | |
1:16:25.606 --> 1:16:27.222 | |
combinations. | |
1:16:27.587 --> 1:16:35.421 | |
Therefore, it might look not only to look | |
at the counts, the end grams, but also how | |
1:16:35.421 --> 1:16:37.449 | |
many extensions does. | |
1:16:38.218 --> 1:16:43.222 | |
And this is done by witt velk smoothing. | |
1:16:43.222 --> 1:16:51.032 | |
The idea is we count how many possible extensions | |
in this case. | |
1:16:51.371 --> 1:17:01.966 | |
So we had for spive, we had possible extensions, | |
and for constant we had a lot more. | |
1:17:02.382 --> 1:17:09.394 | |
And then how much we put into our backup model, | |
how much weight we put into the backup is, | |
1:17:09.394 --> 1:17:13.170 | |
depending on this number of possible extensions. | |
1:17:14.374 --> 1:17:15.557 | |
Style. | |
1:17:15.557 --> 1:17:29.583 | |
We have it here, so this is the weight you | |
put on your lower end gram probability. | |
1:17:29.583 --> 1:17:46.596 | |
For example: And if you compare these two | |
numbers, so for Spike you do how many extensions | |
1:17:46.596 --> 1:17:55.333 | |
does Spike have divided by: While for constant | |
you have zero point three, you know,. | |
1:17:55.815 --> 1:18:05.780 | |
So you're putting a lot more weight to like | |
it's not as bad to fall off to the back of | |
1:18:05.780 --> 1:18:06.581 | |
model. | |
1:18:06.581 --> 1:18:10.705 | |
So for the spy it's really unusual. | |
1:18:10.730 --> 1:18:13.369 | |
For Constant there's a lot of probability | |
medicine. | |
1:18:13.369 --> 1:18:15.906 | |
The chances that you're doing that is quite | |
high. | |
1:18:20.000 --> 1:18:26.209 | |
Similarly, but just from the other way around, | |
it's now looking at this probability distribution. | |
1:18:26.546 --> 1:18:37.103 | |
So now when we back off the probability distribution | |
for the lower angrums, we calculated exactly | |
1:18:37.103 --> 1:18:40.227 | |
the same as the probability. | |
1:18:40.320 --> 1:18:48.254 | |
However, they are used in a different way, | |
so the lower order end drums are only used | |
1:18:48.254 --> 1:18:49.361 | |
if we have. | |
1:18:50.410 --> 1:18:54.264 | |
So it's like you're modeling something different. | |
1:18:54.264 --> 1:19:01.278 | |
You're not modeling how probable this engram | |
if we haven't seen the larger engram and that | |
1:19:01.278 --> 1:19:04.361 | |
is tried by the diversity of histories. | |
1:19:04.944 --> 1:19:14.714 | |
For example, if you look at York, that's a | |
quite frequent work. | |
1:19:14.714 --> 1:19:18.530 | |
It occurs as many times. | |
1:19:19.559 --> 1:19:27.985 | |
However, four hundred seventy three times | |
it was followed the way before it was mute. | |
1:19:29.449 --> 1:19:40.237 | |
So if you now think the unigram model is only | |
used, the probability of York as a unigram | |
1:19:40.237 --> 1:19:49.947 | |
model should be very, very low because: So | |
you should have a lower probability for your | |
1:19:49.947 --> 1:19:56.292 | |
than, for example, for foods, although you | |
have seen both of them at the same time, and | |
1:19:56.292 --> 1:20:02.853 | |
this is done by Knesser and Nye Smoothing where | |
you are not counting the words itself, but | |
1:20:02.853 --> 1:20:05.377 | |
you count the number of mysteries. | |
1:20:05.845 --> 1:20:15.233 | |
So how many other way around was it followed | |
by how many different words were before? | |
1:20:15.233 --> 1:20:28.232 | |
Then instead of the normal way you count the | |
words: So you don't need to know all the formulas | |
1:20:28.232 --> 1:20:28.864 | |
here. | |
1:20:28.864 --> 1:20:33.498 | |
The more important thing is this intuition. | |
1:20:34.874 --> 1:20:44.646 | |
More than it means already that I haven't | |
seen the larger end grammar, and therefore | |
1:20:44.646 --> 1:20:49.704 | |
it might be better to model it differently. | |
1:20:49.929 --> 1:20:56.976 | |
So if there's a new engram with something | |
in New York that's very unprofitable compared | |
1:20:56.976 --> 1:20:57.297 | |
to. | |
1:21:00.180 --> 1:21:06.130 | |
And yeah, this modified Kneffer Nice music | |
is what people took into use. | |
1:21:06.130 --> 1:21:08.249 | |
That's the fall approach. | |
1:21:08.728 --> 1:21:20.481 | |
Has an absolute discounting for small and | |
grams, and then bells smoothing, and for it | |
1:21:20.481 --> 1:21:27.724 | |
uses the discounting of histories which we | |
just had. | |
1:21:28.028 --> 1:21:32.207 | |
And there's even two versions of it, like | |
the backup and the interpolator. | |
1:21:32.472 --> 1:21:34.264 | |
So that may be interesting. | |
1:21:34.264 --> 1:21:40.216 | |
These are here even works well for interpolation, | |
although your assumption is even no longer | |
1:21:40.216 --> 1:21:45.592 | |
true because you're using the lower engrams | |
even if you've seen the higher engrams. | |
1:21:45.592 --> 1:21:49.113 | |
But since you're then focusing on the higher | |
engrams,. | |
1:21:49.929 --> 1:21:53.522 | |
So if you see that some beats on the perfectities,. | |
1:21:54.754 --> 1:22:00.262 | |
So you see normally what interpolated movement | |
class of nineties gives you some of the best | |
1:22:00.262 --> 1:22:00.980 | |
performing. | |
1:22:02.022 --> 1:22:08.032 | |
You see the larger your end drum than it is | |
with interpolation. | |
1:22:08.032 --> 1:22:15.168 | |
You also get significant better so you can | |
not only look at the last words. | |
1:22:18.638 --> 1:22:32.725 | |
Good so much for these types of things, and | |
we will finish with some special things about | |
1:22:32.725 --> 1:22:34.290 | |
language. | |
1:22:38.678 --> 1:22:44.225 | |
One thing we talked about the unknown words, | |
so there is different ways of doing it because | |
1:22:44.225 --> 1:22:49.409 | |
in all the estimations we were still assuming | |
mostly that we have a fixed vocabulary. | |
1:22:50.270 --> 1:23:06.372 | |
So you can often, for example, create an unknown | |
choken and use that while statistical language. | |
1:23:06.766 --> 1:23:16.292 | |
It was mainly useful language processing since | |
newer models are coming, but maybe it's surprising. | |
1:23:18.578 --> 1:23:30.573 | |
What is also nice is that if you're going | |
to really hard launch and ramps, it's more | |
1:23:30.573 --> 1:23:33.114 | |
about efficiency. | |
1:23:33.093 --> 1:23:37.378 | |
And then you have to remember lock it in your | |
model. | |
1:23:37.378 --> 1:23:41.422 | |
In a lot of situations it's not really important. | |
1:23:41.661 --> 1:23:46.964 | |
It's more about ranking so which one is better | |
and if they don't sum up to one that's not | |
1:23:46.964 --> 1:23:47.907 | |
that important. | |
1:23:47.907 --> 1:23:53.563 | |
Of course then you cannot calculate any perplexity | |
anymore because if this is not a probability | |
1:23:53.563 --> 1:23:58.807 | |
mass then the thing we had about the negative | |
example doesn't fit anymore and that's not | |
1:23:58.807 --> 1:23:59.338 | |
working. | |
1:23:59.619 --> 1:24:02.202 | |
However, anification is also very helpful. | |
1:24:02.582 --> 1:24:13.750 | |
And that is why there is this stupid bag-off | |
presented remove all this complicated things | |
1:24:13.750 --> 1:24:14.618 | |
which. | |
1:24:15.055 --> 1:24:28.055 | |
And it just does once we directly take the | |
absolute account, and otherwise we're doing. | |
1:24:28.548 --> 1:24:41.867 | |
Is no longer any discounting anymore, so it's | |
very, very simple and however they show you | |
1:24:41.867 --> 1:24:47.935 | |
have to calculate a lot less statistics. | |
1:24:50.750 --> 1:24:57.525 | |
In addition you can have other type of language | |
models. | |
1:24:57.525 --> 1:25:08.412 | |
We had word based language models and they | |
normally go up to four or five for six brands. | |
1:25:08.412 --> 1:25:10.831 | |
They are too large. | |
1:25:11.531 --> 1:25:20.570 | |
So what people have then looked also into | |
is what is referred to as part of speech language | |
1:25:20.570 --> 1:25:21.258 | |
model. | |
1:25:21.258 --> 1:25:29.806 | |
So instead of looking at the word sequence | |
you're modeling directly the part of speech | |
1:25:29.806 --> 1:25:30.788 | |
sequence. | |
1:25:31.171 --> 1:25:34.987 | |
Then of course now you're only being modeling | |
syntax. | |
1:25:34.987 --> 1:25:41.134 | |
There's no cemented information anymore in | |
the paddle speech test but now you might go | |
1:25:41.134 --> 1:25:47.423 | |
to a larger context link so you can do seven | |
H or nine grams and then you can write some | |
1:25:47.423 --> 1:25:50.320 | |
of the long range dependencies in order. | |
1:25:52.772 --> 1:25:59.833 | |
And there's other things people have done | |
like cash language models, so the idea in cash | |
1:25:59.833 --> 1:26:07.052 | |
language model is that yes words that you have | |
recently seen are more frequently to do are | |
1:26:07.052 --> 1:26:11.891 | |
more probable to reoccurr if you want to model | |
the dynamics. | |
1:26:12.152 --> 1:26:20.734 | |
If you're just talking here, we talked about | |
language models in my presentation. | |
1:26:20.734 --> 1:26:23.489 | |
There will be a lot more. | |
1:26:23.883 --> 1:26:37.213 | |
Can do that by having a dynamic and a static | |
component, and then you have a dynamic component | |
1:26:37.213 --> 1:26:41.042 | |
which looks at the bigram. | |
1:26:41.261 --> 1:26:49.802 | |
And thereby, for example, if you once generate | |
language model of probability, it's increased | |
1:26:49.802 --> 1:26:52.924 | |
and you're modeling that problem. | |
1:26:56.816 --> 1:27:03.114 | |
Said the dynamic component is trained on the | |
text translated so far. | |
1:27:04.564 --> 1:27:12.488 | |
To train them what you just have done, there's | |
no human feet there. | |
1:27:12.712 --> 1:27:25.466 | |
The speech model all the time and then it | |
will repeat its errors and that is, of course,. | |
1:27:25.966 --> 1:27:31.506 | |
A similar idea is people have looked into | |
trigger language model whereas one word occurs | |
1:27:31.506 --> 1:27:34.931 | |
then you increase the probability of some other | |
words. | |
1:27:34.931 --> 1:27:40.596 | |
So if you're talking about money that will | |
increase the probability of bank saving account | |
1:27:40.596 --> 1:27:41.343 | |
dollar and. | |
1:27:41.801 --> 1:27:47.352 | |
Because then you have to somehow model this | |
dependency, but it's somehow also an idea of | |
1:27:47.352 --> 1:27:52.840 | |
modeling long range dependency, because if | |
one word occurs very often in your document, | |
1:27:52.840 --> 1:27:58.203 | |
you like somehow like learning which other | |
words to occur because they are more often | |
1:27:58.203 --> 1:27:59.201 | |
than by chance. | |
1:28:02.822 --> 1:28:10.822 | |
Yes, then the last thing is, of course, especially | |
for languages which are, which are morphologically | |
1:28:10.822 --> 1:28:11.292 | |
rich. | |
1:28:11.292 --> 1:28:18.115 | |
You can do something similar to BPE so you | |
can now do more themes or so, and then more | |
1:28:18.115 --> 1:28:22.821 | |
the morphine sequence because the morphines | |
are more often. | |
1:28:23.023 --> 1:28:26.877 | |
However, the program is opposed that your | |
sequence length also gets longer. | |
1:28:27.127 --> 1:28:33.185 | |
And so if they have a four gram language model, | |
it's not counting the last three words but | |
1:28:33.185 --> 1:28:35.782 | |
only the last three more films, which. | |
1:28:36.196 --> 1:28:39.833 | |
So of course then it's a bit challenging and | |
know how to deal with. | |
1:28:40.680 --> 1:28:51.350 | |
What about language is finished by the idea | |
of a position at the end of the world? | |
1:28:51.350 --> 1:28:58.807 | |
Yeah, but there you can typically do something | |
like that. | |
1:28:59.159 --> 1:29:02.157 | |
It is not the one perfect solution. | |
1:29:02.157 --> 1:29:05.989 | |
You have to do a bit of testing what is best. | |
1:29:06.246 --> 1:29:13.417 | |
One way of dealing with a large vocabulary | |
that you haven't seen is to split these words | |
1:29:13.417 --> 1:29:20.508 | |
into parts and themes that either like more | |
linguistic motivated in more themes or more | |
1:29:20.508 --> 1:29:25.826 | |
statistically motivated like we have in the | |
bike pair and coding. | |
1:29:28.188 --> 1:29:33.216 | |
The representation of your text is different. | |
1:29:33.216 --> 1:29:41.197 | |
How you are later doing all the counting and | |
the statistics is the same. | |
1:29:41.197 --> 1:29:44.914 | |
What you assume is your sequence. | |
1:29:45.805 --> 1:29:49.998 | |
That's the same thing for the other things | |
we had here. | |
1:29:49.998 --> 1:29:55.390 | |
Here you don't have words, but everything | |
you're doing is done exactly. | |
1:29:57.857 --> 1:29:59.457 | |
Some practical issues. | |
1:29:59.457 --> 1:30:05.646 | |
Typically you're doing things on the lock | |
and you're adding because mild decline in very | |
1:30:05.646 --> 1:30:09.819 | |
small values gives you sometimes problems with | |
calculation. | |
1:30:10.230 --> 1:30:16.687 | |
Good thing is you don't have to care with | |
this mostly so there is very good two kids | |
1:30:16.687 --> 1:30:23.448 | |
like Azarayan or Kendalan which when you can | |
just give your data and they will train the | |
1:30:23.448 --> 1:30:30.286 | |
language more then do all the complicated maths | |
behind that and you are able to run them. | |
1:30:31.911 --> 1:30:39.894 | |
So what you should keep from today is what | |
is a language model and how we can do maximum | |
1:30:39.894 --> 1:30:44.199 | |
training on that and different language models. | |
1:30:44.199 --> 1:30:49.939 | |
Similar ideas we use for a lot of different | |
statistical models. | |
1:30:50.350 --> 1:30:52.267 | |
Where You Always Have the Problem. | |
1:30:53.233 --> 1:31:01.608 | |
Different way of looking at it and doing it | |
will do it on Thursday when we will go to language. | |