WEBVTT 0:00:01.721 --> 0:00:08.584 Hey, then welcome to today's lecture on language modeling. 0:00:09.409 --> 0:00:21.608 We had not a different view on machine translation, which was the evaluation path it's important 0:00:21.608 --> 0:00:24.249 to evaluate and see. 0:00:24.664 --> 0:00:33.186 We want to continue with building the MT system and this will be the last part before we are 0:00:33.186 --> 0:00:36.668 going into a neural step on Thursday. 0:00:37.017 --> 0:00:45.478 So we had the the broader view on statistical machine translation and the. 0:00:45.385 --> 0:00:52.977 Thursday: A week ago we talked about the statistical machine translation and mainly the translation 0:00:52.977 --> 0:00:59.355 model, so how we model how probable is it that one word is translated into another. 0:01:00.800 --> 0:01:15.583 However, there is another component when doing generation tasks in general and machine translation. 0:01:16.016 --> 0:01:23.797 There are several characteristics which you only need to model on the target side in the 0:01:23.797 --> 0:01:31.754 traditional approach where we talked about the generation from more semantic or synthectic 0:01:31.754 --> 0:01:34.902 representation into the real world. 0:01:35.555 --> 0:01:51.013 And the challenge is that there's some constructs which are only there in the target language. 0:01:52.132 --> 0:01:57.908 You cannot really get that translation, but it's more something that needs to model on 0:01:57.908 --> 0:01:58.704 the target. 0:01:59.359 --> 0:02:05.742 And this is done typically by a language model and this concept of language model. 0:02:06.326 --> 0:02:11.057 Guess you can assume nowadays very important. 0:02:11.057 --> 0:02:20.416 You've read a lot about large language models recently and they are all somehow trained or 0:02:20.416 --> 0:02:22.164 the idea behind. 0:02:25.986 --> 0:02:41.802 What we'll look today at if get the next night and look what a language model is and today's 0:02:41.802 --> 0:02:42.992 focus. 0:02:43.363 --> 0:02:49.188 This was the common approach to the language model for twenty or thirty years, so a lot 0:02:49.188 --> 0:02:52.101 of time it was really the state of the art. 0:02:52.101 --> 0:02:58.124 And people have used that in many applications in machine translation and automatic speech 0:02:58.124 --> 0:02:58.985 recognition. 0:02:59.879 --> 0:03:11.607 Again you are measuring the performance, but this is purely the performance of the language 0:03:11.607 --> 0:03:12.499 model. 0:03:13.033 --> 0:03:23.137 And then we will see that the traditional language will have a major drawback in how 0:03:23.137 --> 0:03:24.683 we can deal. 0:03:24.944 --> 0:03:32.422 So if you model language you will see that in most of the sentences and you have not really 0:03:32.422 --> 0:03:39.981 seen and you're still able to assess if this is good language or if this is native language. 0:03:40.620 --> 0:03:45.092 And this is challenging if you do just like parameter estimation. 0:03:45.605 --> 0:03:59.277 We are using two different techniques to do: interpolation, and these are essentially in 0:03:59.277 --> 0:04:01.735 order to build. 0:04:01.881 --> 0:04:11.941 It also motivates why things might be easier if we are going into neural morals as we will. 0:04:12.312 --> 0:04:18.203 And at the end we'll talk a bit about some additional type of language models which are 0:04:18.203 --> 0:04:18.605 also. 0:04:20.440 --> 0:04:29.459 So where our language was used, or how are they used in the machine translations? 0:04:30.010 --> 0:04:38.513 So the idea of a language model is that we are modeling what is the fluency of language. 0:04:38.898 --> 0:04:49.381 So if you have, for example, sentence will, then you can estimate that there are some words: 0:04:49.669 --> 0:05:08.929 For example, the next word is valid, but will card's words not? 0:05:09.069 --> 0:05:13.673 And we can do that. 0:05:13.673 --> 0:05:22.192 We have seen that the noise channel. 0:05:22.322 --> 0:05:33.991 That we have seen someone two weeks ago, and today we will look into how can we model P 0:05:33.991 --> 0:05:36.909 of Y or how possible. 0:05:37.177 --> 0:05:44.192 Now this is completely independent of the translation process. 0:05:44.192 --> 0:05:49.761 How fluent is a sentence and how you can express? 0:05:51.591 --> 0:06:01.699 And this language model task has one really big advantage and assume that is even the big 0:06:01.699 --> 0:06:02.935 advantage. 0:06:03.663 --> 0:06:16.345 The big advantage is the data we need to train that so normally we are doing supervised learning. 0:06:16.876 --> 0:06:20.206 So machine translation will talk about. 0:06:20.206 --> 0:06:24.867 That means we have the source center and target center. 0:06:25.005 --> 0:06:27.620 They need to be aligned. 0:06:27.620 --> 0:06:31.386 We look into how we can model them. 0:06:31.386 --> 0:06:39.270 Generally, the problem with this is that: Machine translation: You still have the advantage 0:06:39.270 --> 0:06:45.697 that there's quite huge amounts of this data for many languages, not all but many, but other 0:06:45.697 --> 0:06:47.701 classes even more difficult. 0:06:47.701 --> 0:06:50.879 There's very few data where you have summary. 0:06:51.871 --> 0:07:02.185 So the big advantage of language model is we're only modeling the centers, so we only 0:07:02.185 --> 0:07:04.103 need pure text. 0:07:04.584 --> 0:07:11.286 And pure text, especially since we have the Internet face melting large amounts of text. 0:07:11.331 --> 0:07:17.886 Of course, it's still, it's still maybe only for some domains, some type. 0:07:18.198 --> 0:07:23.466 Want to have data for speech about machine translation. 0:07:23.466 --> 0:07:27.040 Maybe there's only limited data that. 0:07:27.027 --> 0:07:40.030 There's always and also you go to some more exotic languages and then you will have less 0:07:40.030 --> 0:07:40.906 data. 0:07:41.181 --> 0:07:46.803 And in language once we can now look, how can we make use of these data? 0:07:47.187 --> 0:07:54.326 And: Nowadays this is often also framed as self supervised learning because on the one 0:07:54.326 --> 0:08:00.900 hand here we'll see it's a time of classification cast or supervised learning but we create some 0:08:00.900 --> 0:08:02.730 other data science itself. 0:08:02.742 --> 0:08:13.922 So it's not that we have this pair of data text and labels, but we have only the text. 0:08:15.515 --> 0:08:21.367 So the question is how can we use this modeling data and how can we train our language? 0:08:22.302 --> 0:08:35.086 The main goal is to produce fluent English, so we want to somehow model that something 0:08:35.086 --> 0:08:38.024 is a sentence of a. 0:08:38.298 --> 0:08:44.897 So there is no clear separation about semantics and syntax, but in this case it is not about 0:08:44.897 --> 0:08:46.317 a clear separation. 0:08:46.746 --> 0:08:50.751 So we will monitor them somehow in there. 0:08:50.751 --> 0:08:56.091 There will be some notion of semantics, some notion of. 0:08:56.076 --> 0:09:08.748 Because you say you want to water how fluid or probable is that the native speaker is producing 0:09:08.748 --> 0:09:12.444 that because of the one in. 0:09:12.512 --> 0:09:17.711 We are rarely talking like things that are semantically wrong, and therefore there is 0:09:17.711 --> 0:09:18.679 also some type. 0:09:19.399 --> 0:09:24.048 So, for example, the house is small. 0:09:24.048 --> 0:09:30.455 It should be a higher stability than the house is. 0:09:31.251 --> 0:09:38.112 Because home and house are both meaning German, they are used differently. 0:09:38.112 --> 0:09:43.234 For example, it should be more probable that the plane. 0:09:44.444 --> 0:09:51.408 So this is both synthetically correct, but cementically not. 0:09:51.408 --> 0:09:58.372 But still you will see much more often the probability that. 0:10:03.883 --> 0:10:14.315 So more formally, it's about like the language should be some type of function, and it gives 0:10:14.315 --> 0:10:18.690 us the probability that this sentence. 0:10:19.519 --> 0:10:27.312 Indicating that this is good English or more generally English, of course you can do that. 0:10:28.448 --> 0:10:37.609 And earlier times people have even done try to do that deterministic that was especially 0:10:37.609 --> 0:10:40.903 used for more dialogue systems. 0:10:40.840 --> 0:10:50.660 You have a very strict syntax so you can only use like turn off the, turn off the radio. 0:10:50.690 --> 0:10:56.928 Something else, but you have a very strict deterministic finance state grammar like which 0:10:56.928 --> 0:10:58.107 type of phrases. 0:10:58.218 --> 0:11:04.791 The problem of course if we're dealing with language is that language is variable, we're 0:11:04.791 --> 0:11:10.183 not always talking correct sentences, and so this type of deterministic. 0:11:10.650 --> 0:11:22.121 That's why for already many, many years people look into statistical language models and try 0:11:22.121 --> 0:11:24.587 to model something. 0:11:24.924 --> 0:11:35.096 So something like what is the probability of the sequences of to, and that is what. 0:11:35.495 --> 0:11:43.076 The advantage of doing it statistically is that we can train large text databases so we 0:11:43.076 --> 0:11:44.454 can train them. 0:11:44.454 --> 0:11:52.380 We don't have to define it and most of these cases we don't want to have the hard decision. 0:11:52.380 --> 0:11:55.481 This is a sentence of the language. 0:11:55.815 --> 0:11:57.914 Why we want to have some type of probability? 0:11:57.914 --> 0:11:59.785 How probable is this part of the center? 0:12:00.560 --> 0:12:04.175 Because yeah, even for a few minutes, it's not always clear. 0:12:04.175 --> 0:12:06.782 Is this a sentence that you can use or not? 0:12:06.782 --> 0:12:12.174 I mean, I just in this presentation gave several sentences, which are not correct English. 0:12:12.174 --> 0:12:17.744 So it might still happen that people speak sentences or write sentences that I'm not correct, 0:12:17.744 --> 0:12:19.758 and you want to deal with all of. 0:12:20.020 --> 0:12:25.064 So that is then, of course, a big advantage if you use your more statistical models. 0:12:25.705 --> 0:12:35.810 The disadvantage is that you need a subtitle of large text databases which might exist from 0:12:35.810 --> 0:12:37.567 many languages. 0:12:37.857 --> 0:12:46.511 Nowadays you see that there is of course issues that you need large computational resources 0:12:46.511 --> 0:12:47.827 to deal with. 0:12:47.827 --> 0:12:56.198 You need to collect all these crawlers on the internet which can create enormous amounts 0:12:56.198 --> 0:12:57.891 of training data. 0:12:58.999 --> 0:13:08.224 So if we want to build this then the question is of course how can we estimate the probability? 0:13:08.448 --> 0:13:10.986 So how probable is the sentence good morning? 0:13:11.871 --> 0:13:15.450 And you all know basic statistics. 0:13:15.450 --> 0:13:21.483 So if you see this you have a large database of sentences. 0:13:21.901 --> 0:13:28.003 Made this a real example, so this was from the TED talks. 0:13:28.003 --> 0:13:37.050 I guess most of you have heard about them, and if you account for all many sentences, 0:13:37.050 --> 0:13:38.523 good morning. 0:13:38.718 --> 0:13:49.513 It happens so the probability of good morning is sweet point times to the power minus. 0:13:50.030 --> 0:13:53.755 Okay, so this is a very easy thing. 0:13:53.755 --> 0:13:58.101 We can directly model the language model. 0:13:58.959 --> 0:14:03.489 Does anybody see a problem why this might not be the final solution? 0:14:06.326 --> 0:14:14.962 Think we would need a folder of more sentences to make anything useful of this. 0:14:15.315 --> 0:14:29.340 Because the probability of the talk starting with good morning, good morning is much higher 0:14:29.340 --> 0:14:32.084 than ten minutes. 0:14:33.553 --> 0:14:41.700 In all the probability presented in this face, not how we usually think about it. 0:14:42.942 --> 0:14:55.038 The probability is even OK, but you're going into the right direction about the large data. 0:14:55.038 --> 0:14:59.771 Yes, you can't form a new sentence. 0:15:00.160 --> 0:15:04.763 It's about a large data, so you said it's hard to get enough data. 0:15:04.763 --> 0:15:05.931 It's impossible. 0:15:05.931 --> 0:15:11.839 I would say we are always saying sentences which have never been said and we are able 0:15:11.839 --> 0:15:12.801 to deal with. 0:15:13.133 --> 0:15:25.485 The problem with the sparsity of the data will have a lot of perfect English sentences. 0:15:26.226 --> 0:15:31.338 And this is, of course, not what we want to deal with. 0:15:31.338 --> 0:15:39.332 If we want to model that, we need to have a model which can really estimate how good. 0:15:39.599 --> 0:15:47.970 And if we are just like counting this way, most of it will get a zero probability, which 0:15:47.970 --> 0:15:48.722 is not. 0:15:49.029 --> 0:15:56.572 So we need to make things a bit different. 0:15:56.572 --> 0:16:06.221 For the models we had already some idea of doing that. 0:16:06.486 --> 0:16:08.058 And that we can do here again. 0:16:08.528 --> 0:16:12.866 So we can especially use the gel gel. 0:16:12.772 --> 0:16:19.651 The chain rule and the definition of conditional probability solve the conditional probability. 0:16:19.599 --> 0:16:26.369 Of an event B given in an event A is the probability of A and B divided to the probability of A. 0:16:26.369 --> 0:16:32.720 Yes, I recently had a exam on a manic speech recognition and Mister Rival said this is not 0:16:32.720 --> 0:16:39.629 called a chain of wood because I use this terminology and he said it's just applying base another. 0:16:40.500 --> 0:16:56.684 But this is definitely the definition of the condition of probability. 0:16:57.137 --> 0:17:08.630 The probability is defined as P of A and P of supposed to be divided by the one. 0:17:08.888 --> 0:17:16.392 And that can be easily rewritten into and times given. 0:17:16.816 --> 0:17:35.279 And the nice thing is, we can easily extend it, of course, into more variables so we can 0:17:35.279 --> 0:17:38.383 have: And so on. 0:17:38.383 --> 0:17:49.823 So more generally you can do that for now any length of sequence. 0:17:50.650 --> 0:18:04.802 So if we are now going back to words, we can model that as the probability of the sequence 0:18:04.802 --> 0:18:08.223 is given its history. 0:18:08.908 --> 0:18:23.717 Maybe it's more clear if we're looking at real works, so if we have pee-off, it's water 0:18:23.717 --> 0:18:26.914 is so transparent. 0:18:26.906 --> 0:18:39.136 So this way we are able to model the ability of the whole sentence given the sequence by 0:18:39.136 --> 0:18:42.159 looking at each word. 0:18:42.762 --> 0:18:49.206 And of course the big advantage is that each word occurs less often than the full sect. 0:18:49.206 --> 0:18:54.991 So hopefully we see that still, of course, the problem the word doesn't occur. 0:18:54.991 --> 0:19:01.435 Then this doesn't work, but let's recover most of the lectures today about dealing with 0:19:01.435 --> 0:19:01.874 this. 0:19:02.382 --> 0:19:08.727 So by first of all, we generally is at least easier as the thing we have before. 0:19:13.133 --> 0:19:23.531 That we really make sense easier, no, because those jumps get utterly long and we have central. 0:19:23.943 --> 0:19:29.628 Yes exactly, so when we look at the last probability here, we still have to have seen the full. 0:19:30.170 --> 0:19:38.146 So if we want a molecule of transparent, if water is so we have to see the food sequence. 0:19:38.578 --> 0:19:48.061 So in first step we didn't really have to have seen the full sentence. 0:19:48.969 --> 0:19:52.090 However, a little bit of a step nearer. 0:19:52.512 --> 0:19:59.673 So this is still a problem and we will never have seen it for all the time. 0:20:00.020 --> 0:20:08.223 So you can look at this if you have a vocabulary of words. 0:20:08.223 --> 0:20:17.956 Now, for example, if the average sentence is, you would leave to the. 0:20:18.298 --> 0:20:22.394 And we are quite sure we have never seen that much date. 0:20:22.902 --> 0:20:26.246 So this is, we cannot really compute this probability. 0:20:26.786 --> 0:20:37.794 However, there's a trick how we can do that and that's the idea between most of the language. 0:20:38.458 --> 0:20:44.446 So instead of saying how often does this work happen to exactly this history, we are trying 0:20:44.446 --> 0:20:50.433 to do some kind of clustering and cluster a lot of different histories into the same class, 0:20:50.433 --> 0:20:55.900 and then we are modeling the probability of the word given this class of histories. 0:20:56.776 --> 0:21:06.245 And then, of course, the big design decision is how to be modeled like how to cluster history. 0:21:06.666 --> 0:21:17.330 So how do we put all these histories together so that we have seen each of one off enough 0:21:17.330 --> 0:21:18.396 so that. 0:21:20.320 --> 0:21:25.623 So there is quite different types of things people can do. 0:21:25.623 --> 0:21:33.533 You can add some speech texts, you can do semantic words, you can model the similarity, 0:21:33.533 --> 0:21:46.113 you can model grammatical content, and things like: However, like quite often in these statistical 0:21:46.113 --> 0:21:53.091 models, if you have a very simple solution. 0:21:53.433 --> 0:21:58.455 And this is what most statistical models do. 0:21:58.455 --> 0:22:09.616 They are based on the so called mark of assumption, and that means we are assuming all this history 0:22:09.616 --> 0:22:12.183 is not that important. 0:22:12.792 --> 0:22:25.895 So we are modeling the probability of zirkins is so transparent that or we have maybe two 0:22:25.895 --> 0:22:29.534 words by having a fixed. 0:22:29.729 --> 0:22:38.761 So the class of all our history from word to word minus one is just the last two words. 0:22:39.679 --> 0:22:45.229 And by doing this classification, which of course does need any additional knowledge. 0:22:45.545 --> 0:22:51.176 It's very easy to calculate we have no limited our our histories. 0:22:51.291 --> 0:23:00.906 So instead of an arbitrary long one here, we have here only like. 0:23:00.906 --> 0:23:10.375 For example, if we have two grams, a lot of them will not occur. 0:23:10.930 --> 0:23:20.079 So it's a very simple trick to make all these classes into a few classes and motivated by, 0:23:20.079 --> 0:23:24.905 of course, the language the nearest things are. 0:23:24.944 --> 0:23:33.043 Like a lot of sequences, they mainly depend on the previous one, and things which are far 0:23:33.043 --> 0:23:33.583 away. 0:23:38.118 --> 0:23:47.361 In our product here everything is just modeled not by the whole history but by the last and 0:23:47.361 --> 0:23:48.969 minus one word. 0:23:50.470 --> 0:23:54.322 So and this is typically expressed by people. 0:23:54.322 --> 0:24:01.776 They're therefore also talking by an N gram language model because we are always looking 0:24:01.776 --> 0:24:06.550 at these chimes of N words and modeling the probability. 0:24:07.527 --> 0:24:10.485 So again start with the most simple case. 0:24:10.485 --> 0:24:15.485 Even extreme is the unigram case, so we're ignoring the whole history. 0:24:15.835 --> 0:24:24.825 The probability of a sequence of words is just the probability of each of the words in 0:24:24.825 --> 0:24:25.548 there. 0:24:26.046 --> 0:24:32.129 And therefore we are removing the whole context. 0:24:32.129 --> 0:24:40.944 The most probable sequence would be something like one of them is the. 0:24:42.162 --> 0:24:44.694 Most probable wordsuit by itself. 0:24:44.694 --> 0:24:49.684 It might not make sense, but it, of course, can give you a bit of. 0:24:49.629 --> 0:24:52.682 Intuition like which types of words should be more frequent. 0:24:53.393 --> 0:25:00.012 And if you what you can do is train such a button and you can just automatically generate. 0:25:00.140 --> 0:25:09.496 And this sequence is generated by sampling, so we will later come in the lecture too. 0:25:09.496 --> 0:25:16.024 The sampling is that you randomly pick a word but based on. 0:25:16.096 --> 0:25:22.711 So if the probability of one word is zero point two then you'll put it on and if another 0:25:22.711 --> 0:25:23.157 word. 0:25:23.483 --> 0:25:36.996 And if you see that you'll see here now, for example, it seems that these are two occurring 0:25:36.996 --> 0:25:38.024 posts. 0:25:38.138 --> 0:25:53.467 But you see there's not really any continuing type of structure because each word is modeled 0:25:53.467 --> 0:25:55.940 independently. 0:25:57.597 --> 0:26:03.037 This you can do better even though going to a biograph, so then we're having a bit of context. 0:26:03.037 --> 0:26:08.650 Of course, it's still very small, so the probability of your word of the actual word only depends 0:26:08.650 --> 0:26:12.429 on the previous word and all the context before there is ignored. 0:26:13.133 --> 0:26:18.951 This of course will come to that wrong, but it models a regular language significantly 0:26:18.951 --> 0:26:19.486 better. 0:26:19.779 --> 0:26:28.094 Seeing some things here still doesn't really make a lot of sense, but you're seeing some 0:26:28.094 --> 0:26:29.682 typical phrases. 0:26:29.949 --> 0:26:39.619 In this hope doesn't make sense, but in this issue is also frequent. 0:26:39.619 --> 0:26:51.335 Issue is also: Very nice is this year new car parking lot after, so if you have the word 0:26:51.335 --> 0:26:53.634 new then the word. 0:26:53.893 --> 0:27:01.428 Is also quite common, but new car they wouldn't put parking. 0:27:01.428 --> 0:27:06.369 Often the continuation is packing lots. 0:27:06.967 --> 0:27:12.417 And now it's very interesting because here we see the two cementic meanings of lot: You 0:27:12.417 --> 0:27:25.889 have a parking lot, but in general if you just think about the history, the most common use 0:27:25.889 --> 0:27:27.353 is a lot. 0:27:27.527 --> 0:27:33.392 So you see that he's really not using the context before, but he's only using the current 0:27:33.392 --> 0:27:33.979 context. 0:27:38.338 --> 0:27:41.371 So in general we can of course do that longer. 0:27:41.371 --> 0:27:43.888 We can do unigrams, bigrams, trigrams. 0:27:45.845 --> 0:27:52.061 People typically went up to four or five grams, and then it's getting difficult because. 0:27:52.792 --> 0:27:56.671 There are so many five grams that it's getting complicated. 0:27:56.671 --> 0:28:02.425 Storing all of them and storing these models get so big that it's no longer working, and 0:28:02.425 --> 0:28:08.050 of course at some point the calculation of the probabilities again gets too difficult, 0:28:08.050 --> 0:28:09.213 and each of them. 0:28:09.429 --> 0:28:14.777 If you have a small corpus, of course you will use a smaller ingram length. 0:28:14.777 --> 0:28:16.466 You will take a larger. 0:28:18.638 --> 0:28:24.976 What is important to keep in mind is that, of course, this is wrong. 0:28:25.285 --> 0:28:36.608 So we have long range dependencies, and if we really want to model everything in language 0:28:36.608 --> 0:28:37.363 then. 0:28:37.337 --> 0:28:46.965 So here is like one of these extreme cases, the computer, which has just put into the machine 0:28:46.965 --> 0:28:49.423 room in the slow crash. 0:28:49.423 --> 0:28:55.978 Like somehow, there is a dependency between computer and crash. 0:28:57.978 --> 0:29:10.646 However, in most situations these are typically rare and normally most important things happen 0:29:10.646 --> 0:29:13.446 in the near context. 0:29:15.495 --> 0:29:28.408 But of course it's important to keep that in mind that you can't model the thing so you 0:29:28.408 --> 0:29:29.876 can't do. 0:29:33.433 --> 0:29:50.200 The next question is again how can we train so we have to estimate these probabilities. 0:29:51.071 --> 0:30:00.131 And the question is how we do that, and again the most simple thing. 0:30:00.440 --> 0:30:03.168 The thing is exactly what's maximum legal destination. 0:30:03.168 --> 0:30:12.641 What gives you the right answer is: So how probable is that the word is following minus 0:30:12.641 --> 0:30:13.370 one? 0:30:13.370 --> 0:30:20.946 You just count how often does this sequence happen? 0:30:21.301 --> 0:30:28.165 So guess this is what most of you would have intuitively done, and this also works best. 0:30:28.568 --> 0:30:39.012 So it's not a complicated train, so you once have to go over your corpus, you have to count 0:30:39.012 --> 0:30:48.662 our diagrams and unigrams, and then you can directly train the basic language model. 0:30:49.189 --> 0:30:50.651 Who is it difficult? 0:30:50.651 --> 0:30:58.855 There are two difficulties: The basic language well doesn't work that well because of zero 0:30:58.855 --> 0:31:03.154 counts and how we address that and the second. 0:31:03.163 --> 0:31:13.716 Because we saw that especially if you go for larger you have to store all these engrams 0:31:13.716 --> 0:31:15.275 efficiently. 0:31:17.697 --> 0:31:21.220 So how we can do that? 0:31:21.220 --> 0:31:24.590 Here's some examples. 0:31:24.590 --> 0:31:33.626 For example, if you have the sequence your training curve. 0:31:33.713 --> 0:31:41.372 You see that the word happens, ascends the star and the sequence happens two times. 0:31:42.182 --> 0:31:45.651 We have three times. 0:31:45.651 --> 0:31:58.043 The same starts as the probability is to thirds and the other probability. 0:31:58.858 --> 0:32:09.204 Here we have what is following so you have twice and once do so again two thirds and one. 0:32:09.809 --> 0:32:20.627 And this is all that you need to know here about it, so you can do this calculation. 0:32:23.723 --> 0:32:35.506 So the question then, of course, is what do we really learn in these types of models? 0:32:35.506 --> 0:32:45.549 Here are examples from the Europycopterus: The green, the red, and the blue, and here 0:32:45.549 --> 0:32:48.594 you have the probabilities which is the next. 0:32:48.989 --> 0:33:01.897 That there is a lot more than just like the syntax because the initial phrase is all the 0:33:01.897 --> 0:33:02.767 same. 0:33:03.163 --> 0:33:10.132 For example, you see the green paper in the green group. 0:33:10.132 --> 0:33:16.979 It's more European palaman, the red cross, which is by. 0:33:17.197 --> 0:33:21.777 What you also see that it's like sometimes Indian, sometimes it's more difficult. 0:33:22.302 --> 0:33:28.345 So, for example, following the rats, in one hundred cases it was a red cross. 0:33:28.668 --> 0:33:48.472 So it seems to be easier to guess the next word. 0:33:48.528 --> 0:33:55.152 So there is different types of information coded in that you also know that I guess sometimes 0:33:55.152 --> 0:33:58.675 you directly know all the speakers will continue. 0:33:58.675 --> 0:34:04.946 It's not a lot of new information in the next word, but in other cases like blue there's 0:34:04.946 --> 0:34:06.496 a lot of information. 0:34:11.291 --> 0:34:14.849 Another example is this Berkeley restaurant sentences. 0:34:14.849 --> 0:34:21.059 It's collected at Berkeley and you have sentences like can you tell me about any good spaghetti 0:34:21.059 --> 0:34:21.835 restaurant. 0:34:21.835 --> 0:34:27.463 Big price title is what I'm looking for so it's more like a dialogue system and people 0:34:27.463 --> 0:34:31.215 have collected this data and of course you can also look. 0:34:31.551 --> 0:34:46.878 Into this and get the counts, so you count the vibrants in the top, so the color is the. 0:34:49.409 --> 0:34:52.912 This is a bigram which is the first word of West. 0:34:52.912 --> 0:34:54.524 This one fuzzy is one. 0:34:56.576 --> 0:35:12.160 One because want to hyperability, but want a lot less, and there where you see it, for 0:35:12.160 --> 0:35:17.004 example: So here you see after I want. 0:35:17.004 --> 0:35:23.064 It's very often for I eat, but an island which is not just. 0:35:27.347 --> 0:35:39.267 The absolute counts of how often each road occurs, and then you can see here the probabilities 0:35:39.267 --> 0:35:40.145 again. 0:35:42.422 --> 0:35:54.519 Then do that if you want to do iwan Dutch food you get the sequence you have to multiply 0:35:54.519 --> 0:35:55.471 olive. 0:35:55.635 --> 0:36:00.281 And then you of course get a bit of interesting experience on that. 0:36:00.281 --> 0:36:04.726 For example: Information is there. 0:36:04.726 --> 0:36:15.876 So, for example, if you compare I want Dutch or I want Chinese, it seems that. 0:36:16.176 --> 0:36:22.910 That the sentence often starts with eye. 0:36:22.910 --> 0:36:31.615 You have it after two is possible, but after one it. 0:36:31.731 --> 0:36:39.724 And you cannot say want, but you have to say want to spend, so there's grammical information. 0:36:40.000 --> 0:36:51.032 To main information and source: Here before we're going into measuring quality, is there 0:36:51.032 --> 0:36:58.297 any questions about language model and the idea of modeling? 0:37:02.702 --> 0:37:13.501 Hope that doesn't mean everybody sleeping, and so when we're doing the training these 0:37:13.501 --> 0:37:15.761 language models,. 0:37:16.356 --> 0:37:26.429 You need to model what is the engrum length should we use a trigram or a forkrum. 0:37:27.007 --> 0:37:34.040 So in order to decide how can you now decide which of the two models are better? 0:37:34.914 --> 0:37:40.702 And if you would have to do that, how would you decide taking language model or taking 0:37:40.702 --> 0:37:41.367 language? 0:37:43.263 --> 0:37:53.484 I take some test text and see which model assigns a higher probability to me. 0:37:54.354 --> 0:38:03.978 It's very good, so that's even the second thing, so the first thing maybe would have 0:38:03.978 --> 0:38:04.657 been. 0:38:05.925 --> 0:38:12.300 The problem is the and then you take the language language language and machine translation. 0:38:13.193 --> 0:38:18.773 Problems: First of all you have to build a whole system which is very time consuming and 0:38:18.773 --> 0:38:21.407 it might not only depend on the language. 0:38:21.407 --> 0:38:24.730 On the other hand, that's of course what the end is. 0:38:24.730 --> 0:38:30.373 The end want and the pressure will model each component individually or do you want to do 0:38:30.373 --> 0:38:31.313 an end to end. 0:38:31.771 --> 0:38:35.463 What can also happen is you'll see your metric model. 0:38:35.463 --> 0:38:41.412 This is a very good language model, but it somewhat doesn't really work well with your 0:38:41.412 --> 0:38:42.711 translation model. 0:38:43.803 --> 0:38:49.523 But of course it's very good to also have this type of intrinsic evaluation where the 0:38:49.523 --> 0:38:52.116 assumption should be as a pointed out. 0:38:52.116 --> 0:38:57.503 If we have Good English it shouldn't be a high probability and it's bad English. 0:38:58.318 --> 0:39:07.594 And this is measured by the take a held out data set, so some data which you don't train 0:39:07.594 --> 0:39:12.596 on then calculate the probability of this data. 0:39:12.912 --> 0:39:26.374 Then you're just looking at the language model and you take the language model. 0:39:27.727 --> 0:39:33.595 You're not directly using the probability, but you're taking the perplexity. 0:39:33.595 --> 0:39:40.454 The perplexity is due to the power of the cross entropy, and you see in the cross entropy 0:39:40.454 --> 0:39:46.322 you're doing something like an average probability of always coming to this. 0:39:46.846 --> 0:39:54.721 Not so how exactly is that define perplexity is typically what people refer to all across. 0:39:54.894 --> 0:40:02.328 The cross edge is negative and average, and then you have the lock of the probability of 0:40:02.328 --> 0:40:03.246 the whole. 0:40:04.584 --> 0:40:10.609 We are modeling this probability as the product of each of the words. 0:40:10.609 --> 0:40:18.613 That's how the end gram was defined and now you hopefully can remember the rules of logarism 0:40:18.613 --> 0:40:23.089 so you can get the probability within the logarism. 0:40:23.063 --> 0:40:31.036 The sum here so the cross entry is minus one by two by n, and the sum of all your words 0:40:31.036 --> 0:40:35.566 and the lowerism of the probability of each word. 0:40:36.176 --> 0:40:39.418 And then the perplexity is just like two to the power. 0:40:41.201 --> 0:40:44.706 Why can this be interpreted as a branching factor? 0:40:44.706 --> 0:40:50.479 So it gives you a bit like the average thing, like how many possibilities you have. 0:40:51.071 --> 0:41:02.249 You have a digit task and you have no idea, but the probability of the next digit is like 0:41:02.249 --> 0:41:03.367 one ten. 0:41:03.783 --> 0:41:09.354 And if you then take a later perplexity, it will be exactly ten. 0:41:09.849 --> 0:41:24.191 And that is like this perplexity gives you a million interpretations, so how much randomness 0:41:24.191 --> 0:41:27.121 is still in there? 0:41:27.307 --> 0:41:32.433 Of course, now it's good to have a lower perplexity. 0:41:32.433 --> 0:41:36.012 We have less ambiguity in there and. 0:41:35.976 --> 0:41:48.127 If you have a hundred words and you only have to uniformly compare it to ten different, so 0:41:48.127 --> 0:41:49.462 you have. 0:41:49.609 --> 0:41:53.255 Yes, think so it should be. 0:41:53.255 --> 0:42:03.673 You had here logarism and then to the power and that should then be eliminated. 0:42:03.743 --> 0:42:22.155 So which logarism you use is not that important because it's a constant factor to reformulate. 0:42:23.403 --> 0:42:28.462 Yes and Yeah So the Best. 0:42:31.931 --> 0:42:50.263 The best model is always like you want to have a high probability. 0:42:51.811 --> 0:43:04.549 Time you see here, so here the probabilities would like to commend the rapporteur on his 0:43:04.549 --> 0:43:05.408 work. 0:43:05.285 --> 0:43:14.116 You have then locked two probabilities and then the average, so this is not the perplexity 0:43:14.116 --> 0:43:18.095 but the cross entropy as mentioned here. 0:43:18.318 --> 0:43:26.651 And then due to the power of that we'll give you the perplexity of the center. 0:43:29.329 --> 0:43:40.967 And these metrics of perplexity are essential in modeling that and we'll also see nowadays. 0:43:41.121 --> 0:43:47.898 You also measure like equality often in perplexity or cross entropy, which gives you how good 0:43:47.898 --> 0:43:50.062 is it in estimating the same. 0:43:50.010 --> 0:43:53.647 The better the model is, the more information you have about this. 0:43:55.795 --> 0:44:03.106 Talked about isomic ability or quit sentences, but don't most have to any much because. 0:44:03.463 --> 0:44:12.512 You are doing that in this way implicitly because of the correct word. 0:44:12.512 --> 0:44:19.266 If you are modeling this one, the sun over all next. 0:44:20.020 --> 0:44:29.409 Therefore, you have that implicitly in there because in each position you're modeling the 0:44:29.409 --> 0:44:32.957 probability of this witch behind. 0:44:35.515 --> 0:44:43.811 You have a very large number of negative examples because all the possible extensions which are 0:44:43.811 --> 0:44:49.515 not there are incorrect, which of course might also be a problem. 0:44:52.312 --> 0:45:00.256 And the biggest challenge of these types of models is how to model unseen events. 0:45:00.840 --> 0:45:04.973 So that can be unknown words or it can be unknown vibrants. 0:45:05.245 --> 0:45:10.096 So that's important also like you've seen all the words. 0:45:10.096 --> 0:45:17.756 But if you have a bigram language model, if you haven't seen the bigram, you'll still get 0:45:17.756 --> 0:45:23.628 a zero probability because we know that the bigram's divided by the. 0:45:24.644 --> 0:45:35.299 If you have unknown words, the problem gets even bigger because one word typically causes 0:45:35.299 --> 0:45:37.075 a lot of zero. 0:45:37.217 --> 0:45:41.038 So if you, for example, if your vocabulary is go to and care it,. 0:45:41.341 --> 0:45:43.467 And you have not a sentence. 0:45:43.467 --> 0:45:47.941 I want to pay a T, so you have one word, which is here 'an'. 0:45:47.887 --> 0:45:54.354 It is unknow then you have the proper. 0:45:54.354 --> 0:46:02.147 It is I get a sentence star and sentence star. 0:46:02.582 --> 0:46:09.850 To model this probability you always have to take the account from these sequences divided 0:46:09.850 --> 0:46:19.145 by: Since when does it occur, all of these angrams can also occur because of the word 0:46:19.145 --> 0:46:19.961 middle. 0:46:20.260 --> 0:46:27.800 So all of these probabilities are directly zero. 0:46:27.800 --> 0:46:33.647 You see that just by having a single. 0:46:34.254 --> 0:46:47.968 Tells you it might not always be better to have larger grams because if you have a gram 0:46:47.968 --> 0:46:50.306 language more. 0:46:50.730 --> 0:46:57.870 So sometimes it's better to have a smaller angram counter because the chances that you're 0:46:57.870 --> 0:47:00.170 seeing the angram is higher. 0:47:00.170 --> 0:47:07.310 On the other hand, you want to have a larger account because the larger the count is, the 0:47:07.310 --> 0:47:09.849 longer the context is modeling. 0:47:10.670 --> 0:47:17.565 So how can we address this type of problem? 0:47:17.565 --> 0:47:28.064 We address this type of problem by somehow adjusting our accounts. 0:47:29.749 --> 0:47:40.482 We have often, but most of your entries in the table are zero, and if one of these engrams 0:47:40.482 --> 0:47:45.082 occurs you'll have a zero probability. 0:47:46.806 --> 0:48:06.999 So therefore we need to find some of our ways in order to estimate this type of event because: 0:48:07.427 --> 0:48:11.619 So there are different ways of how to model it and how to adjust it. 0:48:11.619 --> 0:48:15.326 The one I hear is to do smoocing and that's the first thing. 0:48:15.326 --> 0:48:20.734 So in smoocing you're saying okay, we take a bit of the probability we have to our scene 0:48:20.734 --> 0:48:23.893 events and distribute this thing we're taking away. 0:48:23.893 --> 0:48:26.567 We're distributing to all the other events. 0:48:26.946 --> 0:48:33.927 The nice thing is in this case oh now each event has a non zero probability and that is 0:48:33.927 --> 0:48:39.718 of course very helpful because we don't have zero probabilities anymore. 0:48:40.180 --> 0:48:48.422 It smoothed out, but at least you have some kind of probability everywhere, so you take 0:48:48.422 --> 0:48:50.764 some of the probability. 0:48:53.053 --> 0:49:05.465 You can also do that more here when you have the endgram, for example, and this is your 0:49:05.465 --> 0:49:08.709 original distribution. 0:49:08.648 --> 0:49:15.463 Then you are taking some mass away from here and distributing this mass to all the other 0:49:15.463 --> 0:49:17.453 words that you have seen. 0:49:18.638 --> 0:49:26.797 And thereby you are now making sure that it's yeah, that it's now possible to model that. 0:49:28.828 --> 0:49:36.163 The other idea we're coming into more detail on how we can do this type of smoking, but 0:49:36.163 --> 0:49:41.164 one other idea you can do is to do some type of clustering. 0:49:41.501 --> 0:49:48.486 And that means if we are can't model go Kit's, for example because we haven't seen that. 0:49:49.349 --> 0:49:56.128 Then we're just looking at the full thing and we're just going to live directly how probable. 0:49:56.156 --> 0:49:58.162 Go two ways or so. 0:49:58.162 --> 0:50:09.040 Then we are modeling just only the word interpolation where you're interpolating all the probabilities 0:50:09.040 --> 0:50:10.836 and thereby can. 0:50:11.111 --> 0:50:16.355 These are the two things which are helpful in order to better calculate all these types. 0:50:19.499 --> 0:50:28.404 Let's start with what counts news so the idea is okay. 0:50:28.404 --> 0:50:38.119 We have not seen an event and then the probability is zero. 0:50:38.618 --> 0:50:50.902 It's not that high, but you should always be aware that there might be new things happening 0:50:50.902 --> 0:50:55.308 and somehow be able to estimate. 0:50:56.276 --> 0:50:59.914 So the idea is okay. 0:50:59.914 --> 0:51:09.442 We can also assign a positive probability to a higher. 0:51:10.590 --> 0:51:23.233 We are changing so currently we worked on imperial accounts so how often we have seen 0:51:23.233 --> 0:51:25.292 the accounts. 0:51:25.745 --> 0:51:37.174 And now we are going on to expect account how often this would occur in an unseen. 0:51:37.517 --> 0:51:39.282 So we are directly trying to model that. 0:51:39.859 --> 0:51:45.836 Of course, the empirical accounts are a good starting point, so if you've seen the world 0:51:45.836 --> 0:51:51.880 very often in your training data, it's a good estimation of how often you would see it in 0:51:51.880 --> 0:51:52.685 the future. 0:51:52.685 --> 0:51:58.125 However, it might make sense to think about it only because you haven't seen it. 0:51:58.578 --> 0:52:10.742 So does anybody have a very simple idea how you start with smoothing it? 0:52:10.742 --> 0:52:15.241 What count would you give? 0:52:21.281 --> 0:52:32.279 Now you have the probability to calculation how often have you seen the biogram with zero 0:52:32.279 --> 0:52:33.135 count. 0:52:33.193 --> 0:52:39.209 So what count would you give in order to still do this calculation? 0:52:39.209 --> 0:52:41.509 We have to smooth, so we. 0:52:44.884 --> 0:52:52.151 We could clump together all the rare words, for example everywhere we have only seen ones. 0:52:52.652 --> 0:52:56.904 And then just we can do the massive moment of those and don't. 0:52:56.936 --> 0:53:00.085 So remove the real ones. 0:53:00.085 --> 0:53:06.130 Yes, and then every unseen word is one of them. 0:53:06.130 --> 0:53:13.939 Yeah, but it's not only about unseen words, it's even unseen. 0:53:14.874 --> 0:53:20.180 You can even start easier and that's what people do at the first thing. 0:53:20.180 --> 0:53:22.243 That's at one smooth thing. 0:53:22.243 --> 0:53:28.580 You'll see it's not working good but the variation works fine and we're just as here. 0:53:28.580 --> 0:53:30.644 We've seen everything once. 0:53:31.771 --> 0:53:39.896 That's similar to this because you're clustering the one and the zero together and you just 0:53:39.896 --> 0:53:45.814 say you've seen everything once or have seen them twice and so on. 0:53:46.386 --> 0:53:53.249 And if you've done that wow, there's no probability because each event has happened once. 0:53:55.795 --> 0:54:02.395 If you otherwise have seen the bigram five times, you would not now do five times but 0:54:02.395 --> 0:54:03.239 six times. 0:54:03.363 --> 0:54:09.117 So the nice thing is to have seen everything. 0:54:09.117 --> 0:54:19.124 Once the probability of the engrap is now out, you have seen it divided by the. 0:54:20.780 --> 0:54:23.763 How long ago there's one big big problem with it? 0:54:24.064 --> 0:54:38.509 Just imagine that you have a vocabulary of words, and you have a corpus of thirty million 0:54:38.509 --> 0:54:39.954 bigrams. 0:54:39.954 --> 0:54:42.843 So if you have a. 0:54:43.543 --> 0:54:46.580 Simple Things So You've Seen Them Thirty Million Times. 0:54:47.247 --> 0:54:49.818 That is your count, your distributing. 0:54:49.818 --> 0:54:55.225 According to your gain, the problem is yet how many possible bigrams do you have? 0:54:55.225 --> 0:55:00.895 You have seven point five billion possible bigrams, and each of them you are counting 0:55:00.895 --> 0:55:04.785 now as give up your ability, like you give account of one. 0:55:04.785 --> 0:55:07.092 So each of them is saying a curse. 0:55:07.627 --> 0:55:16.697 Then this number of possible vigrams is many times larger than the number you really see. 0:55:17.537 --> 0:55:21.151 You're mainly doing equal distribution. 0:55:21.151 --> 0:55:26.753 Everything gets the same because this is much more important. 0:55:26.753 --> 0:55:31.541 Most of your probability mass is used for smoothing. 0:55:32.412 --> 0:55:37.493 Because most of the probability miles have to be distributed that you at least give every 0:55:37.493 --> 0:55:42.687 biogram at least a count of one, and the other counts are only the thirty million, so seven 0:55:42.687 --> 0:55:48.219 point five billion counts go to like a distribute around all the engrons, and only thirty million 0:55:48.219 --> 0:55:50.026 are according to your frequent. 0:55:50.210 --> 0:56:02.406 So you put a lot too much mass on your smoothing and you're doing some kind of extreme smoothing. 0:56:02.742 --> 0:56:08.986 So that of course is a bit bad then and will give you not the best performance. 0:56:10.130 --> 0:56:16.160 However, there's a nice thing and that means to do probability calculations. 0:56:16.160 --> 0:56:21.800 We are doing it based on counts, but to do this division we don't need. 0:56:22.302 --> 0:56:32.112 So we can also do that with floating point values and there is still a valid type of calculation. 0:56:32.392 --> 0:56:39.380 So we can have less probability mass to unseen events. 0:56:39.380 --> 0:56:45.352 We don't have to give one because if we count. 0:56:45.785 --> 0:56:50.976 But to do our calculation we can also give zero point zero to something like that, so 0:56:50.976 --> 0:56:56.167 very small value, and thereby we have less value on the smooth thing, and we are more 0:56:56.167 --> 0:56:58.038 focusing on the actual corpus. 0:56:58.758 --> 0:57:03.045 And that is what people refer to as Alpha Smoozing. 0:57:03.223 --> 0:57:12.032 You see that we are now adding not one to it but only alpha, and then we are giving less 0:57:12.032 --> 0:57:19.258 probability to the unseen event and more probability to the really seen. 0:57:20.780 --> 0:57:24.713 Questions: Of course, how do you find see also? 0:57:24.713 --> 0:57:29.711 I'm here to either use some help out data and optimize them. 0:57:30.951 --> 0:57:35.153 So what what does it now really mean? 0:57:35.153 --> 0:57:40.130 This gives you a bit of an idea behind that. 0:57:40.700 --> 0:57:57.751 So here you have the grams which occur one time, for example all grams which occur one. 0:57:57.978 --> 0:58:10.890 So, for example, that means that if you have engrams which occur one time, then. 0:58:11.371 --> 0:58:22.896 If you look at all the engrams which occur two times, then they occur. 0:58:22.896 --> 0:58:31.013 If you look at the engrams that occur zero, then. 0:58:32.832 --> 0:58:46.511 So if you are now doing the smoothing you can look what is the probability estimating 0:58:46.511 --> 0:58:47.466 them. 0:58:47.847 --> 0:59:00.963 You see that for all the endbreaks you heavily underestimate how often they occur in the test 0:59:00.963 --> 0:59:01.801 card. 0:59:02.002 --> 0:59:10.067 So what you want is very good to estimate this distribution, so for each Enron estimate 0:59:10.067 --> 0:59:12.083 quite well how often. 0:59:12.632 --> 0:59:16.029 You're quite bad at that for all of them. 0:59:16.029 --> 0:59:22.500 You're apparently underestimating only for the top ones which you haven't seen. 0:59:22.500 --> 0:59:24.845 You'll heavily overestimate. 0:59:25.645 --> 0:59:30.887 If you're doing alpha smoothing and optimize that to fit on the zero count because that's 0:59:30.887 --> 0:59:36.361 not completely fair because this alpha is now optimizes the test counter, you see that you're 0:59:36.361 --> 0:59:37.526 doing a lot better. 0:59:37.526 --> 0:59:42.360 It's not perfect, but you're a lot better in estimating how often they will occur. 0:59:45.545 --> 0:59:49.316 So this is one idea of doing it. 0:59:49.316 --> 0:59:57.771 Of course there's other ways and this is like a large research direction. 0:59:58.318 --> 1:00:03.287 So there is this needed estimation. 1:00:03.287 --> 1:00:11.569 What you are doing is filling your trading data into parts. 1:00:11.972 --> 1:00:19.547 Looking at how many engrams occur exactly are types, which engrams occur are times in 1:00:19.547 --> 1:00:20.868 your training. 1:00:21.281 --> 1:00:27.716 And then you look for these ones. 1:00:27.716 --> 1:00:36.611 How often do they occur in your training data? 1:00:36.611 --> 1:00:37.746 It's. 1:00:38.118 --> 1:00:45.214 And then you say oh this engram, the expector counts how often will see. 1:00:45.214 --> 1:00:56.020 It is divided by: Some type of clustering you're putting all the engrams which occur 1:00:56.020 --> 1:01:04.341 are at times in your data together and in order to estimate how often. 1:01:05.185 --> 1:01:12.489 And if you do half your data related to your final estimation by just using those statistics,. 1:01:14.014 --> 1:01:25.210 So this is called added estimation, and thereby you are not able to estimate better how often 1:01:25.210 --> 1:01:25.924 does. 1:01:28.368 --> 1:01:34.559 And again we can do the same look and compare it to the expected counts. 1:01:34.559 --> 1:01:37.782 Again we have exactly the same table. 1:01:38.398 --> 1:01:47.611 So then we're having to hear how many engrams that does exist. 1:01:47.611 --> 1:01:55.361 So, for example, there's like engrams which you can. 1:01:55.835 --> 1:02:08.583 Then you look into your other half and how often do these N grams occur in your 2nd part 1:02:08.583 --> 1:02:11.734 of the training data? 1:02:12.012 --> 1:02:22.558 For example, an unseen N gram I expect to occur, an engram which occurs one time. 1:02:22.558 --> 1:02:25.774 I expect that it occurs. 1:02:27.527 --> 1:02:42.564 Yeah, the number of zero counts are if take my one grams and then just calculate how many 1:02:42.564 --> 1:02:45.572 possible bigrams. 1:02:45.525 --> 1:02:50.729 Yes, so in this case we are now not assuming about having a more larger cattle because then, 1:02:50.729 --> 1:02:52.127 of course, it's getting. 1:02:52.272 --> 1:02:54.730 So you're doing that given the current gram. 1:02:54.730 --> 1:03:06.057 The cavalry is better to: So yeah, there's another problem in how to deal with them. 1:03:06.057 --> 1:03:11.150 This is more about how to smuse the engram counts to also deal. 1:03:14.394 --> 1:03:18.329 Certainly as I Think The. 1:03:18.198 --> 1:03:25.197 Yes, the last idea of doing is so called good cheering, and and the I hear here is in it 1:03:25.197 --> 1:03:32.747 similar, so there is a typical mathematic approve, but you can show that a very good estimation 1:03:32.747 --> 1:03:34.713 for the expected counts. 1:03:34.654 --> 1:03:42.339 Is that you take the number of engrams which occur one time more divided by the number of 1:03:42.339 --> 1:03:46.011 engram which occur R times and R plus one. 1:03:46.666 --> 1:03:49.263 So this is then the estimation of. 1:03:49.549 --> 1:04:05.911 So if you are looking now at an engram which occurs times then you are looking at how many 1:04:05.911 --> 1:04:08.608 engrams occur. 1:04:09.009 --> 1:04:18.938 It's very simple, so in this one you only have to count all the bigrams, how many different 1:04:18.938 --> 1:04:23.471 bigrams out there, and that is very good. 1:04:23.903 --> 1:04:33.137 So if you are saying now about end drums which occur or times,. 1:04:33.473 --> 1:04:46.626 It might be that there are some occurring times, but no times, and then. 1:04:46.866 --> 1:04:54.721 So what you normally do is you are doing for small R, and for large R you do some curve 1:04:54.721 --> 1:04:55.524 fitting. 1:04:56.016 --> 1:05:07.377 In general this type of smoothing is important for engrams which occur rarely. 1:05:07.377 --> 1:05:15.719 If an engram occurs so this is more important for events. 1:05:17.717 --> 1:05:25.652 So here again you see you have the counts and then based on that you get the adjusted 1:05:25.652 --> 1:05:26.390 counts. 1:05:26.390 --> 1:05:34.786 This is here and if you compare it's a test count you see that it really works quite well. 1:05:35.035 --> 1:05:41.093 But for the low numbers it's a very good modeling of how much how good this works. 1:05:45.005 --> 1:05:50.018 Then, of course, the question is how good does it work in language modeling? 1:05:50.018 --> 1:05:51.516 We also want tomorrow. 1:05:52.372 --> 1:05:54.996 We can measure that perplexity. 1:05:54.996 --> 1:05:59.261 We learned that before and then we have everyone's. 1:05:59.579 --> 1:06:07.326 You saw that a lot of too much probability mass is put to the events which have your probability. 1:06:07.667 --> 1:06:11.098 Then you have an alpha smoothing. 1:06:11.098 --> 1:06:16.042 Here's a start because it's not completely fair. 1:06:16.042 --> 1:06:20.281 The alpha was maximized on the test data. 1:06:20.480 --> 1:06:25.904 But you see that like the leaded estimation of the touring gives you a similar performance. 1:06:26.226 --> 1:06:29.141 So they seem to really work quite well. 1:06:32.232 --> 1:06:41.552 So this is about all assigning probability mass to aimed grams, which we have not seen 1:06:41.552 --> 1:06:50.657 in order to also estimate their probability before we're going to the interpolation. 1:06:55.635 --> 1:07:00.207 Good, so now we have. 1:07:00.080 --> 1:07:11.818 Done this estimation, and the problem is we have this general. 1:07:11.651 --> 1:07:19.470 We want to have a longer context because we can model longer than language better because 1:07:19.470 --> 1:07:21.468 long range dependency. 1:07:21.701 --> 1:07:26.745 On the other hand, we have limited data so we want to have stored angrums because they 1:07:26.745 --> 1:07:28.426 reach angrums at first more. 1:07:29.029 --> 1:07:43.664 And about the smooth thing in the discounting we did before, it always treats all angrams. 1:07:44.024 --> 1:07:46.006 So we didn't really look at the end drums. 1:07:46.006 --> 1:07:48.174 They were all classed into how often they are. 1:07:49.169 --> 1:08:00.006 However, sometimes this might not be very helpful, so for example look at the engram 1:08:00.006 --> 1:08:06.253 Scottish beer drinkers and Scottish beer eaters. 1:08:06.686 --> 1:08:12.037 Because we have not seen the trigram, so you will estimate the trigram probability by the 1:08:12.037 --> 1:08:14.593 probability you assign to the zero county. 1:08:15.455 --> 1:08:26.700 However, if you look at the background probability that you might have seen and might be helpful,. 1:08:26.866 --> 1:08:34.538 So be a drinker is more probable to see than Scottish be a drinker, and be a drinker should 1:08:34.538 --> 1:08:36.039 be more probable. 1:08:36.896 --> 1:08:39.919 So this type of information is somehow ignored. 1:08:39.919 --> 1:08:45.271 So if we have the Trigram language model, we are only looking at trigrams divided by 1:08:45.271 --> 1:08:46.089 the Vigrams. 1:08:46.089 --> 1:08:49.678 But if we have not seen the Vigrams, we are not looking. 1:08:49.678 --> 1:08:53.456 Oh, maybe we will have seen the Vigram and we can back off. 1:08:54.114 --> 1:09:01.978 And that is what people do in interpolation and back off. 1:09:01.978 --> 1:09:09.164 The idea is if we don't have seen the large engrams. 1:09:09.429 --> 1:09:16.169 So don't have to go to a shorter sequence and try to see if we came on in this probability. 1:09:16.776 --> 1:09:20.730 And this is the idea of interpolation. 1:09:20.730 --> 1:09:25.291 There's like two different ways of doing it. 1:09:25.291 --> 1:09:26.507 One is the. 1:09:26.646 --> 1:09:29.465 The easiest thing is like okay. 1:09:29.465 --> 1:09:32.812 If we have bigrams, we have trigrams. 1:09:32.812 --> 1:09:35.103 If we have programs, why? 1:09:35.355 --> 1:09:46.544 Mean, of course, we have the larger ones, the larger context, but the short amounts are 1:09:46.544 --> 1:09:49.596 maybe better estimated. 1:09:50.090 --> 1:10:00.487 Time just by taking the probability of just the word class of probability of and. 1:10:01.261 --> 1:10:07.052 And of course we need to know because otherwise we don't have a probability distribution, but 1:10:07.052 --> 1:10:09.332 we can somehow optimize the weights. 1:10:09.332 --> 1:10:15.930 For example, the health out data set: And thereby we have now a probability distribution 1:10:15.930 --> 1:10:17.777 which takes both into account. 1:10:18.118 --> 1:10:23.705 The thing about the Scottish be a drink business. 1:10:23.705 --> 1:10:33.763 The dry rum probability will be the same for the post office because they both occur zero 1:10:33.763 --> 1:10:34.546 times. 1:10:36.116 --> 1:10:45.332 But the two grand verability will hopefully be different because we might have seen beer 1:10:45.332 --> 1:10:47.611 eaters and therefore. 1:10:48.668 --> 1:10:57.296 The idea that sometimes it's better to have different models and combine them instead. 1:10:58.678 --> 1:10:59.976 Another idea in style. 1:11:00.000 --> 1:11:08.506 Of this overall interpolation is you can also do this type of recursive interpolation. 1:11:08.969 --> 1:11:23.804 The probability of the word given its history is in the current language model probability. 1:11:24.664 --> 1:11:30.686 Thus one minus the weights of this two some after one, and here it's an interpolated probability 1:11:30.686 --> 1:11:36.832 from the n minus one breath, and then of course it goes recursively on until you are at a junigram 1:11:36.832 --> 1:11:37.639 probability. 1:11:38.558 --> 1:11:49.513 What you can also do, you can not only do the same weights for all our words, but you 1:11:49.513 --> 1:12:06.020 can for example: For example, for engrams, which you have seen very often, you put more 1:12:06.020 --> 1:12:10.580 weight on the trigrams. 1:12:13.673 --> 1:12:29.892 The other thing you can do is the back off and the difference in back off is we are not 1:12:29.892 --> 1:12:32.656 interpolating. 1:12:32.892 --> 1:12:41.954 If we have seen the trigram probability so if the trigram hound is bigger then we take 1:12:41.954 --> 1:12:48.412 the trigram probability and if we have seen this one then we. 1:12:48.868 --> 1:12:54.092 So that is the difference. 1:12:54.092 --> 1:13:06.279 We are always taking all the angle probabilities and back off. 1:13:07.147 --> 1:13:09.941 Why do we need to do this just a minute? 1:13:09.941 --> 1:13:13.621 So why have we here just take the probability of the. 1:13:15.595 --> 1:13:18.711 Yes, because otherwise the probabilities from some people. 1:13:19.059 --> 1:13:28.213 In order to make them still sound one, we have to take away a bit of a probability mass 1:13:28.213 --> 1:13:29.773 for the scene. 1:13:29.709 --> 1:13:38.919 The difference is we are no longer distributing it equally as before to the unseen, but we 1:13:38.919 --> 1:13:40.741 are distributing. 1:13:44.864 --> 1:13:56.220 For example, this can be done with gutturing, so the expected counts in goodturing we saw. 1:13:57.697 --> 1:13:59.804 The adjusted counts. 1:13:59.804 --> 1:14:04.719 They are always lower than the ones we see here. 1:14:04.719 --> 1:14:14.972 These counts are always: See that so you can now take this different and distribute this 1:14:14.972 --> 1:14:18.852 weights to the lower based input. 1:14:23.323 --> 1:14:29.896 Is how we can distribute things. 1:14:29.896 --> 1:14:43.442 Then there is one last thing people are doing, especially how much. 1:14:43.563 --> 1:14:55.464 And there's one thing which is called well written by Mozilla. 1:14:55.315 --> 1:15:01.335 In the background, like in the background, it might make sense to look at the words and 1:15:01.335 --> 1:15:04.893 see how probable it is that you need to background. 1:15:05.425 --> 1:15:11.232 So look at these words five and one cent. 1:15:11.232 --> 1:15:15.934 Those occur exactly times in the. 1:15:16.316 --> 1:15:27.804 They would be treated exactly the same because both occur at the same time, and it would be 1:15:27.804 --> 1:15:29.053 the same. 1:15:29.809 --> 1:15:48.401 However, it shouldn't really model the same. 1:15:48.568 --> 1:15:57.447 If you compare that for constant there are four hundred different continuations of this 1:15:57.447 --> 1:16:01.282 work, so there is nearly always this. 1:16:02.902 --> 1:16:11.203 So if you're now seeing a new bigram or a biogram with Isaac Constant or Spite starting 1:16:11.203 --> 1:16:13.467 and then another word,. 1:16:15.215 --> 1:16:25.606 In constant, it's very frequent that you see new angrups because there are many different 1:16:25.606 --> 1:16:27.222 combinations. 1:16:27.587 --> 1:16:35.421 Therefore, it might look not only to look at the counts, the end grams, but also how 1:16:35.421 --> 1:16:37.449 many extensions does. 1:16:38.218 --> 1:16:43.222 And this is done by witt velk smoothing. 1:16:43.222 --> 1:16:51.032 The idea is we count how many possible extensions in this case. 1:16:51.371 --> 1:17:01.966 So we had for spive, we had possible extensions, and for constant we had a lot more. 1:17:02.382 --> 1:17:09.394 And then how much we put into our backup model, how much weight we put into the backup is, 1:17:09.394 --> 1:17:13.170 depending on this number of possible extensions. 1:17:14.374 --> 1:17:15.557 Style. 1:17:15.557 --> 1:17:29.583 We have it here, so this is the weight you put on your lower end gram probability. 1:17:29.583 --> 1:17:46.596 For example: And if you compare these two numbers, so for Spike you do how many extensions 1:17:46.596 --> 1:17:55.333 does Spike have divided by: While for constant you have zero point three, you know,. 1:17:55.815 --> 1:18:05.780 So you're putting a lot more weight to like it's not as bad to fall off to the back of 1:18:05.780 --> 1:18:06.581 model. 1:18:06.581 --> 1:18:10.705 So for the spy it's really unusual. 1:18:10.730 --> 1:18:13.369 For Constant there's a lot of probability medicine. 1:18:13.369 --> 1:18:15.906 The chances that you're doing that is quite high. 1:18:20.000 --> 1:18:26.209 Similarly, but just from the other way around, it's now looking at this probability distribution. 1:18:26.546 --> 1:18:37.103 So now when we back off the probability distribution for the lower angrums, we calculated exactly 1:18:37.103 --> 1:18:40.227 the same as the probability. 1:18:40.320 --> 1:18:48.254 However, they are used in a different way, so the lower order end drums are only used 1:18:48.254 --> 1:18:49.361 if we have. 1:18:50.410 --> 1:18:54.264 So it's like you're modeling something different. 1:18:54.264 --> 1:19:01.278 You're not modeling how probable this engram if we haven't seen the larger engram and that 1:19:01.278 --> 1:19:04.361 is tried by the diversity of histories. 1:19:04.944 --> 1:19:14.714 For example, if you look at York, that's a quite frequent work. 1:19:14.714 --> 1:19:18.530 It occurs as many times. 1:19:19.559 --> 1:19:27.985 However, four hundred seventy three times it was followed the way before it was mute. 1:19:29.449 --> 1:19:40.237 So if you now think the unigram model is only used, the probability of York as a unigram 1:19:40.237 --> 1:19:49.947 model should be very, very low because: So you should have a lower probability for your 1:19:49.947 --> 1:19:56.292 than, for example, for foods, although you have seen both of them at the same time, and 1:19:56.292 --> 1:20:02.853 this is done by Knesser and Nye Smoothing where you are not counting the words itself, but 1:20:02.853 --> 1:20:05.377 you count the number of mysteries. 1:20:05.845 --> 1:20:15.233 So how many other way around was it followed by how many different words were before? 1:20:15.233 --> 1:20:28.232 Then instead of the normal way you count the words: So you don't need to know all the formulas 1:20:28.232 --> 1:20:28.864 here. 1:20:28.864 --> 1:20:33.498 The more important thing is this intuition. 1:20:34.874 --> 1:20:44.646 More than it means already that I haven't seen the larger end grammar, and therefore 1:20:44.646 --> 1:20:49.704 it might be better to model it differently. 1:20:49.929 --> 1:20:56.976 So if there's a new engram with something in New York that's very unprofitable compared 1:20:56.976 --> 1:20:57.297 to. 1:21:00.180 --> 1:21:06.130 And yeah, this modified Kneffer Nice music is what people took into use. 1:21:06.130 --> 1:21:08.249 That's the fall approach. 1:21:08.728 --> 1:21:20.481 Has an absolute discounting for small and grams, and then bells smoothing, and for it 1:21:20.481 --> 1:21:27.724 uses the discounting of histories which we just had. 1:21:28.028 --> 1:21:32.207 And there's even two versions of it, like the backup and the interpolator. 1:21:32.472 --> 1:21:34.264 So that may be interesting. 1:21:34.264 --> 1:21:40.216 These are here even works well for interpolation, although your assumption is even no longer 1:21:40.216 --> 1:21:45.592 true because you're using the lower engrams even if you've seen the higher engrams. 1:21:45.592 --> 1:21:49.113 But since you're then focusing on the higher engrams,. 1:21:49.929 --> 1:21:53.522 So if you see that some beats on the perfectities,. 1:21:54.754 --> 1:22:00.262 So you see normally what interpolated movement class of nineties gives you some of the best 1:22:00.262 --> 1:22:00.980 performing. 1:22:02.022 --> 1:22:08.032 You see the larger your end drum than it is with interpolation. 1:22:08.032 --> 1:22:15.168 You also get significant better so you can not only look at the last words. 1:22:18.638 --> 1:22:32.725 Good so much for these types of things, and we will finish with some special things about 1:22:32.725 --> 1:22:34.290 language. 1:22:38.678 --> 1:22:44.225 One thing we talked about the unknown words, so there is different ways of doing it because 1:22:44.225 --> 1:22:49.409 in all the estimations we were still assuming mostly that we have a fixed vocabulary. 1:22:50.270 --> 1:23:06.372 So you can often, for example, create an unknown choken and use that while statistical language. 1:23:06.766 --> 1:23:16.292 It was mainly useful language processing since newer models are coming, but maybe it's surprising. 1:23:18.578 --> 1:23:30.573 What is also nice is that if you're going to really hard launch and ramps, it's more 1:23:30.573 --> 1:23:33.114 about efficiency. 1:23:33.093 --> 1:23:37.378 And then you have to remember lock it in your model. 1:23:37.378 --> 1:23:41.422 In a lot of situations it's not really important. 1:23:41.661 --> 1:23:46.964 It's more about ranking so which one is better and if they don't sum up to one that's not 1:23:46.964 --> 1:23:47.907 that important. 1:23:47.907 --> 1:23:53.563 Of course then you cannot calculate any perplexity anymore because if this is not a probability 1:23:53.563 --> 1:23:58.807 mass then the thing we had about the negative example doesn't fit anymore and that's not 1:23:58.807 --> 1:23:59.338 working. 1:23:59.619 --> 1:24:02.202 However, anification is also very helpful. 1:24:02.582 --> 1:24:13.750 And that is why there is this stupid bag-off presented remove all this complicated things 1:24:13.750 --> 1:24:14.618 which. 1:24:15.055 --> 1:24:28.055 And it just does once we directly take the absolute account, and otherwise we're doing. 1:24:28.548 --> 1:24:41.867 Is no longer any discounting anymore, so it's very, very simple and however they show you 1:24:41.867 --> 1:24:47.935 have to calculate a lot less statistics. 1:24:50.750 --> 1:24:57.525 In addition you can have other type of language models. 1:24:57.525 --> 1:25:08.412 We had word based language models and they normally go up to four or five for six brands. 1:25:08.412 --> 1:25:10.831 They are too large. 1:25:11.531 --> 1:25:20.570 So what people have then looked also into is what is referred to as part of speech language 1:25:20.570 --> 1:25:21.258 model. 1:25:21.258 --> 1:25:29.806 So instead of looking at the word sequence you're modeling directly the part of speech 1:25:29.806 --> 1:25:30.788 sequence. 1:25:31.171 --> 1:25:34.987 Then of course now you're only being modeling syntax. 1:25:34.987 --> 1:25:41.134 There's no cemented information anymore in the paddle speech test but now you might go 1:25:41.134 --> 1:25:47.423 to a larger context link so you can do seven H or nine grams and then you can write some 1:25:47.423 --> 1:25:50.320 of the long range dependencies in order. 1:25:52.772 --> 1:25:59.833 And there's other things people have done like cash language models, so the idea in cash 1:25:59.833 --> 1:26:07.052 language model is that yes words that you have recently seen are more frequently to do are 1:26:07.052 --> 1:26:11.891 more probable to reoccurr if you want to model the dynamics. 1:26:12.152 --> 1:26:20.734 If you're just talking here, we talked about language models in my presentation. 1:26:20.734 --> 1:26:23.489 There will be a lot more. 1:26:23.883 --> 1:26:37.213 Can do that by having a dynamic and a static component, and then you have a dynamic component 1:26:37.213 --> 1:26:41.042 which looks at the bigram. 1:26:41.261 --> 1:26:49.802 And thereby, for example, if you once generate language model of probability, it's increased 1:26:49.802 --> 1:26:52.924 and you're modeling that problem. 1:26:56.816 --> 1:27:03.114 Said the dynamic component is trained on the text translated so far. 1:27:04.564 --> 1:27:12.488 To train them what you just have done, there's no human feet there. 1:27:12.712 --> 1:27:25.466 The speech model all the time and then it will repeat its errors and that is, of course,. 1:27:25.966 --> 1:27:31.506 A similar idea is people have looked into trigger language model whereas one word occurs 1:27:31.506 --> 1:27:34.931 then you increase the probability of some other words. 1:27:34.931 --> 1:27:40.596 So if you're talking about money that will increase the probability of bank saving account 1:27:40.596 --> 1:27:41.343 dollar and. 1:27:41.801 --> 1:27:47.352 Because then you have to somehow model this dependency, but it's somehow also an idea of 1:27:47.352 --> 1:27:52.840 modeling long range dependency, because if one word occurs very often in your document, 1:27:52.840 --> 1:27:58.203 you like somehow like learning which other words to occur because they are more often 1:27:58.203 --> 1:27:59.201 than by chance. 1:28:02.822 --> 1:28:10.822 Yes, then the last thing is, of course, especially for languages which are, which are morphologically 1:28:10.822 --> 1:28:11.292 rich. 1:28:11.292 --> 1:28:18.115 You can do something similar to BPE so you can now do more themes or so, and then more 1:28:18.115 --> 1:28:22.821 the morphine sequence because the morphines are more often. 1:28:23.023 --> 1:28:26.877 However, the program is opposed that your sequence length also gets longer. 1:28:27.127 --> 1:28:33.185 And so if they have a four gram language model, it's not counting the last three words but 1:28:33.185 --> 1:28:35.782 only the last three more films, which. 1:28:36.196 --> 1:28:39.833 So of course then it's a bit challenging and know how to deal with. 1:28:40.680 --> 1:28:51.350 What about language is finished by the idea of a position at the end of the world? 1:28:51.350 --> 1:28:58.807 Yeah, but there you can typically do something like that. 1:28:59.159 --> 1:29:02.157 It is not the one perfect solution. 1:29:02.157 --> 1:29:05.989 You have to do a bit of testing what is best. 1:29:06.246 --> 1:29:13.417 One way of dealing with a large vocabulary that you haven't seen is to split these words 1:29:13.417 --> 1:29:20.508 into parts and themes that either like more linguistic motivated in more themes or more 1:29:20.508 --> 1:29:25.826 statistically motivated like we have in the bike pair and coding. 1:29:28.188 --> 1:29:33.216 The representation of your text is different. 1:29:33.216 --> 1:29:41.197 How you are later doing all the counting and the statistics is the same. 1:29:41.197 --> 1:29:44.914 What you assume is your sequence. 1:29:45.805 --> 1:29:49.998 That's the same thing for the other things we had here. 1:29:49.998 --> 1:29:55.390 Here you don't have words, but everything you're doing is done exactly. 1:29:57.857 --> 1:29:59.457 Some practical issues. 1:29:59.457 --> 1:30:05.646 Typically you're doing things on the lock and you're adding because mild decline in very 1:30:05.646 --> 1:30:09.819 small values gives you sometimes problems with calculation. 1:30:10.230 --> 1:30:16.687 Good thing is you don't have to care with this mostly so there is very good two kids 1:30:16.687 --> 1:30:23.448 like Azarayan or Kendalan which when you can just give your data and they will train the 1:30:23.448 --> 1:30:30.286 language more then do all the complicated maths behind that and you are able to run them. 1:30:31.911 --> 1:30:39.894 So what you should keep from today is what is a language model and how we can do maximum 1:30:39.894 --> 1:30:44.199 training on that and different language models. 1:30:44.199 --> 1:30:49.939 Similar ideas we use for a lot of different statistical models. 1:30:50.350 --> 1:30:52.267 Where You Always Have the Problem. 1:30:53.233 --> 1:31:01.608 Different way of looking at it and doing it will do it on Thursday when we will go to language.