WEBVTT 0:00:03.663 --> 0:00:07.970 Okay, then I should switch back to English, sorry,. 0:00:08.528 --> 0:00:18.970 So welcome to today's lecture in the cross machine translation and today we're planning 0:00:18.970 --> 0:00:20.038 to talk. 0:00:20.880 --> 0:00:31.845 Which will be without our summary of power translation was done from around till. 0:00:32.872 --> 0:00:38.471 Fourteen, so this was an approach which was quite long. 0:00:38.471 --> 0:00:47.070 It was the first approach where at the end the quality was really so good that it was 0:00:47.070 --> 0:00:49.969 used as a commercial system. 0:00:49.990 --> 0:00:56.482 Or something like that, so the first systems there was using the statistical machine translation. 0:00:57.937 --> 0:01:02.706 So when I came into the field this was the main part of the lecture, so there would be 0:01:02.706 --> 0:01:07.912 not be one lecture, but in more detail than half of the full course would be about statistical 0:01:07.912 --> 0:01:09.063 machine translation. 0:01:09.369 --> 0:01:23.381 So what we try to do today is like get the most important things, which think our part 0:01:23.381 --> 0:01:27.408 is still very important. 0:01:27.267 --> 0:01:31.196 Four State of the Art Box. 0:01:31.952 --> 0:01:45.240 Then we'll have the presentation about how to evaluate the other part of the machine translation. 0:01:45.505 --> 0:01:58.396 The other important thing is the language modeling part will explain later how they combine. 0:01:59.539 --> 0:02:04.563 Shortly mentioned this one already. 0:02:04.824 --> 0:02:06.025 On Tuesday. 0:02:06.246 --> 0:02:21.849 So in a lot of these explanations, how we model translation process, it might be surprising: 0:02:22.082 --> 0:02:27.905 Later some people say it's for four eight words traditionally came because the first models 0:02:27.905 --> 0:02:32.715 which you'll discuss here also when they are referred to as the IVM models. 0:02:32.832 --> 0:02:40.043 They were trained on French to English translation directions and that's why they started using 0:02:40.043 --> 0:02:44.399 F and E and then this was done for the next twenty years. 0:02:44.664 --> 0:02:52.316 So while we are trying to wait, the source words is: We have a big eye, typically the 0:02:52.316 --> 0:03:02.701 lengths of the sewer sentence in small eye, the position, and similarly in the target and 0:03:02.701 --> 0:03:05.240 the lengths of small. 0:03:05.485 --> 0:03:13.248 Things will get a bit complicated in this way because it is not always clear what is 0:03:13.248 --> 0:03:13.704 the. 0:03:14.014 --> 0:03:21.962 See that there is this noisy channel model which switches the direction in your model, 0:03:21.962 --> 0:03:25.616 but in the application it's the target. 0:03:26.006 --> 0:03:37.077 So that is why if you especially read these papers, it might sometimes be a bit disturbing. 0:03:37.437 --> 0:03:40.209 Try to keep it here always. 0:03:40.209 --> 0:03:48.427 The source is, and even if we use a model where it's inverse, we'll keep this way. 0:03:48.468 --> 0:03:55.138 Don't get disturbed by that, and I think it's possible to understand all that without this 0:03:55.138 --> 0:03:55.944 confusion. 0:03:55.944 --> 0:04:01.734 But in some of the papers you might get confused because they switched to the. 0:04:04.944 --> 0:04:17.138 In general, in statistics and machine translation, the goal is how we do translation. 0:04:17.377 --> 0:04:25.562 But first we are seeing all our possible target sentences as possible translations. 0:04:26.726 --> 0:04:37.495 And we are assigning some probability to the combination, so we are modeling. 0:04:39.359 --> 0:04:49.746 And then we are doing a search over all possible things or at least theoretically, and we are 0:04:49.746 --> 0:04:56.486 trying to find the translation with the highest probability. 0:04:56.936 --> 0:05:05.116 And this general idea is also true for neuromachine translation. 0:05:05.116 --> 0:05:07.633 They differ in how. 0:05:08.088 --> 0:05:10.801 So these were then of course the two big challenges. 0:05:11.171 --> 0:05:17.414 On the one hand, how can we estimate this probability? 0:05:17.414 --> 0:05:21.615 How is the translation of the other? 0:05:22.262 --> 0:05:32.412 The other challenge is the search, so we cannot, of course, say we want to find the most probable 0:05:32.412 --> 0:05:33.759 translation. 0:05:33.759 --> 0:05:42.045 We cannot go over all possible English sentences and calculate the probability. 0:05:43.103 --> 0:05:45.004 So,. 0:05:45.165 --> 0:05:53.423 What we have to do there is some are doing intelligent search and look for the ones and 0:05:53.423 --> 0:05:54.268 compare. 0:05:54.734 --> 0:05:57.384 That will be done. 0:05:57.384 --> 0:06:07.006 This process of finding them is called the decoding process because. 0:06:07.247 --> 0:06:09.015 They will be covered well later. 0:06:09.015 --> 0:06:11.104 Today we will concentrate on the mile. 0:06:11.451 --> 0:06:23.566 The model is trained using data, so in the first step we're having data, we're somehow 0:06:23.566 --> 0:06:30.529 having a definition of what the model looks like. 0:06:34.034 --> 0:06:42.913 And in statistical machine translation the common model is behind. 0:06:42.913 --> 0:06:46.358 That is what is referred. 0:06:46.786 --> 0:06:55.475 And this is motivated by the initial idea from Shannon. 0:06:55.475 --> 0:07:02.457 We have this that you can think of decoding. 0:07:02.722 --> 0:07:10.472 So think of it as we have this text in maybe German. 0:07:10.472 --> 0:07:21.147 Originally it was an English text, but somebody used some nice decoding. 0:07:21.021 --> 0:07:28.579 Task is to decipher it again, this crazy cyborg expressing things in German, and to decipher 0:07:28.579 --> 0:07:31.993 the meaning again and doing that between. 0:07:32.452 --> 0:07:35.735 And that is the idea about this noisy channel when it. 0:07:36.236 --> 0:07:47.209 It goes through some type of channel which adds noise to the source and then you receive 0:07:47.209 --> 0:07:48.811 the message. 0:07:49.429 --> 0:08:00.190 And then the idea is, can we now construct the original message out of these messages 0:08:00.190 --> 0:08:05.070 by modeling some of the channels here? 0:08:06.726 --> 0:08:15.797 There you know to see a bit the surface of the source message with English. 0:08:15.797 --> 0:08:22.361 It went through some channel and received the message. 0:08:22.682 --> 0:08:31.381 If you're not looking at machine translation, your source language is English. 0:08:31.671 --> 0:08:44.388 Here you see now a bit of this where the confusion starts while English as a target language is 0:08:44.388 --> 0:08:47.700 also the source message. 0:08:47.927 --> 0:08:48.674 You can see. 0:08:48.674 --> 0:08:51.488 There is also a mathematics of how we model the. 0:08:52.592 --> 0:08:56.888 It's a noisy channel model from a mathematic point of view. 0:08:56.997 --> 0:09:00.245 So this is again our general formula. 0:09:00.245 --> 0:09:08.623 We are looking for the most probable translation and that is the translation that has the highest 0:09:08.623 --> 0:09:09.735 probability. 0:09:09.809 --> 0:09:19.467 We are not interested in the probability itself, but we are interesting in this target sentence 0:09:19.467 --> 0:09:22.082 E where this probability. 0:09:23.483 --> 0:09:33.479 And: Therefore, we can use them twice definition of conditional probability and using the base 0:09:33.479 --> 0:09:42.712 rules, so this probability equals the probability of f giving any kind of probability of e divided 0:09:42.712 --> 0:09:44.858 by the probability of. 0:09:45.525 --> 0:09:48.218 Now see mathematically this confusion. 0:09:48.218 --> 0:09:54.983 Originally we are interested in the probability of the target sentence given the search sentence. 0:09:55.295 --> 0:10:00.742 And if we are modeling things now, we are looking here at the inverse direction, so the 0:10:00.742 --> 0:10:06.499 probability of F given E to the probability of the source sentence given the target sentence 0:10:06.499 --> 0:10:10.832 is the probability of the target sentence divided by the probability. 0:10:13.033 --> 0:10:15.353 Why are we doing this? 0:10:15.353 --> 0:10:24.333 Maybe I mean, of course, once it's motivated by our model, that we were saying this type 0:10:24.333 --> 0:10:27.058 of how we are modeling it. 0:10:27.058 --> 0:10:30.791 The other interesting thing is that. 0:10:31.231 --> 0:10:40.019 So we are looking at this probability up there, which we had before we formulate that we can 0:10:40.019 --> 0:10:40.775 remove. 0:10:41.181 --> 0:10:46.164 If we are searching for the highest translation, this is fixed. 0:10:46.164 --> 0:10:47.800 This doesn't change. 0:10:47.800 --> 0:10:52.550 We have an input, the source sentence, and we cannot change. 0:10:52.812 --> 0:11:02.780 Is always the same, so we can ignore it in the ACMAX because the lower one is exactly 0:11:02.780 --> 0:11:03.939 the same. 0:11:04.344 --> 0:11:06.683 And then we have p o f. 0:11:06.606 --> 0:11:13.177 E times P of E and that is so we are modeling the translation process on the one hand with 0:11:13.177 --> 0:11:19.748 the translation model which models how probable is the sentence F given E and on the other 0:11:19.748 --> 0:11:25.958 hand with the language model which models only how probable is this English sentence. 0:11:26.586 --> 0:11:39.366 That somebody wrote this language or translation point of view, this is about fluency. 0:11:40.200 --> 0:11:44.416 You should have in German, for example, agreement. 0:11:44.416 --> 0:11:50.863 If the agreement is not right, that's properly not said by anybody in German. 0:11:50.863 --> 0:11:58.220 Nobody would say that's Schönest's house because it's not according to the German rules. 0:11:58.598 --> 0:12:02.302 So this can be modeled by the language model. 0:12:02.542 --> 0:12:09.855 And you have the translation model which models housings get translated between the. 0:12:10.910 --> 0:12:18.775 And here you see again our confusion again, and now here put the translation model: Wage 0:12:18.775 --> 0:12:24.360 is a big income counterintuitive because the probability of a sewer sentence giving the 0:12:24.360 --> 0:12:24.868 target. 0:12:26.306 --> 0:12:35.094 Have to do that for the bass farmer, but in the following slides I'll talk again about. 0:12:35.535 --> 0:12:45.414 Because yeah, that's more intuitive that you model the translation of the target sentence 0:12:45.414 --> 0:12:48.377 given the source sentence. 0:12:50.930 --> 0:12:55.668 And this is what we want to talk about today. 0:12:55.668 --> 0:13:01.023 We later talk about language models how to do that. 0:13:00.940 --> 0:13:04.493 And maybe also how to combine them. 0:13:04.493 --> 0:13:13.080 But the focus on today would be how can we model this probability to how to generate a 0:13:13.080 --> 0:13:16.535 translation from source to target? 0:13:19.960 --> 0:13:24.263 How can we do that and the easiest thing? 0:13:24.263 --> 0:13:33.588 Maybe if you think about statistics, you count how many examples you have, how many target 0:13:33.588 --> 0:13:39.121 sentences go occur, and that gives you an estimation. 0:13:40.160 --> 0:13:51.632 However, like in another model that is not possible because most sentences you will never 0:13:51.632 --> 0:13:52.780 see, so. 0:13:53.333 --> 0:14:06.924 So what we have to do is break up the translation process into smaller models and model each 0:14:06.924 --> 0:14:09.555 of the decisions. 0:14:09.970 --> 0:14:26.300 So this simple solution with how you throw a dice is like you have a and that gives you 0:14:26.300 --> 0:14:29.454 the probability. 0:14:29.449 --> 0:14:40.439 But here's the principle because each event is so rare that most of them never have helped. 0:14:43.063 --> 0:14:48.164 Although it might be that in all your training data you have never seen this title of set. 0:14:49.589 --> 0:14:52.388 How can we do that? 0:14:52.388 --> 0:15:04.845 We look in statistical machine translation into two different models, a generative model 0:15:04.845 --> 0:15:05.825 where. 0:15:06.166 --> 0:15:11.736 So the idea was to really model model like each individual translation between words. 0:15:12.052 --> 0:15:22.598 So you break down the translation of a full sentence into the translation of each individual's 0:15:22.598 --> 0:15:23.264 word. 0:15:23.264 --> 0:15:31.922 So you say if you have the black cat, if you translate it, the full sentence. 0:15:32.932 --> 0:15:38.797 Of course, this has some challenges, any ideas where this type of model could be very challenging. 0:15:40.240 --> 0:15:47.396 Vocabularies and videos: Yes, we're going to be able to play in the very color. 0:15:47.867 --> 0:15:51.592 Yes, but you could at least use a bit of the context around it. 0:15:51.592 --> 0:15:55.491 It will not only depend on the word, but it's already challenging. 0:15:55.491 --> 0:15:59.157 You make things very hard, so that's definitely one challenge. 0:16:00.500 --> 0:16:07.085 One other, what did you talk about that we just don't want to say? 0:16:08.348 --> 0:16:11.483 Yes, they are challenging. 0:16:11.483 --> 0:16:21.817 You have to do something like words, but the problem is that you might introduce errors. 0:16:21.841 --> 0:16:23.298 Later and makes things very comfortable. 0:16:25.265 --> 0:16:28.153 Wrong splitting is the worst things that are very complicated. 0:16:32.032 --> 0:16:35.580 Saints, for example, and also maybe Japanese medicine. 0:16:35.735 --> 0:16:41.203 In German, yes, especially like these are all right. 0:16:41.203 --> 0:16:46.981 The first thing is maybe the one which is most obvious. 0:16:46.981 --> 0:16:49.972 It is raining cats and dogs. 0:16:51.631 --> 0:17:01.837 To German, the cat doesn't translate this whole chunk into something because there is 0:17:01.837 --> 0:17:03.261 not really. 0:17:03.403 --> 0:17:08.610 Mean, of course, in generally there is this type of alignment, so there is a correspondence 0:17:08.610 --> 0:17:11.439 between words in English and the words in German. 0:17:11.439 --> 0:17:16.363 However, that's not true for all sentences, so in some sentences you cannot really say 0:17:16.363 --> 0:17:18.174 this word translates into that. 0:17:18.498 --> 0:17:21.583 But you can only let more locate this whole phrase. 0:17:21.583 --> 0:17:23.482 This model into something else. 0:17:23.563 --> 0:17:30.970 If you think about the don't in English, the do is not really clearly where should that 0:17:30.970 --> 0:17:31.895 be allied. 0:17:32.712 --> 0:17:39.079 Then for a long time the most successful approach was this phrase based translation model where 0:17:39.079 --> 0:17:45.511 the idea is your block is not a single word but a longer phrase if you try to build translations 0:17:45.511 --> 0:17:46.572 based on these. 0:17:48.768 --> 0:17:54.105 But let's start with a word based and what you need. 0:17:54.105 --> 0:18:03.470 There is two main knowledge sources, so on the one hand we have a lexicon where we translate 0:18:03.470 --> 0:18:05.786 possible translations. 0:18:06.166 --> 0:18:16.084 The main difference between the lexicon and statistical machine translation and lexicon 0:18:16.084 --> 0:18:17.550 as you know. 0:18:17.837 --> 0:18:23.590 Traditional lexicon: You know how word is translated and mainly it's giving you two or 0:18:23.590 --> 0:18:26.367 three examples with any example sentence. 0:18:26.367 --> 0:18:30.136 So in this context it gets translated like that henceon. 0:18:30.570 --> 0:18:38.822 In order to model that and work with probabilities what we need in a machine translation is these: 0:18:39.099 --> 0:18:47.962 So if we have the German word bargain, it sends me out with a probability of zero point five. 0:18:47.962 --> 0:18:51.545 Maybe it's translated into a vehicle. 0:18:52.792 --> 0:18:58.876 And of course this is not easy to be created by a shoveman. 0:18:58.876 --> 0:19:07.960 If ask you and give probabilities for how probable this vehicle is, there might: So how 0:19:07.960 --> 0:19:12.848 we are doing is again that the lexicon is automatically will be created from a corpus. 0:19:13.333 --> 0:19:18.754 And we're just counting here, so we count how often does it work, how often does it co 0:19:18.754 --> 0:19:24.425 occur with vehicle, and then we're taking the ratio and saying in the house of time on the 0:19:24.425 --> 0:19:26.481 English side there was vehicles. 0:19:26.481 --> 0:19:31.840 There was a probability of vehicles given back, and there's something like zero point 0:19:31.840 --> 0:19:32.214 five. 0:19:33.793 --> 0:19:46.669 That we need another concept, and that is this concept of alignment, and now you can 0:19:46.669 --> 0:19:47.578 have. 0:19:47.667 --> 0:19:53.113 Since this is quite complicated, the alignment in general can be complex. 0:19:53.113 --> 0:19:55.689 It can be that it's not only like. 0:19:55.895 --> 0:20:04.283 It can be that two words of a surrender target sign and it's also imbiguous. 0:20:04.283 --> 0:20:13.761 It can be that you say all these two words only are aligned together and our words are 0:20:13.761 --> 0:20:15.504 aligned or not. 0:20:15.875 --> 0:20:21.581 Is should the do be aligned to the knot in German? 0:20:21.581 --> 0:20:29.301 It's only there because in German it's not, so it should be aligned. 0:20:30.510 --> 0:20:39.736 However, typically it's formalized and it's formalized by a function from the target language. 0:20:40.180 --> 0:20:44.051 And that is to make these models get easier and clearer. 0:20:44.304 --> 0:20:49.860 That means what means does it mean that you have a fence that means that each. 0:20:49.809 --> 0:20:58.700 A sewer's word gives target word and the alliance to only one source word because the function 0:20:58.700 --> 0:21:00.384 is also directly. 0:21:00.384 --> 0:21:05.999 However, a source word can be hit or like by signal target. 0:21:06.286 --> 0:21:11.332 So you are allowing for one to many alignments, but not for many to one alignment. 0:21:11.831 --> 0:21:17.848 That is a bit of a challenge because you assume a lightning should be symmetrical. 0:21:17.848 --> 0:21:24.372 So if you look at a parallel sentence, it should not matter if you look at it from German 0:21:24.372 --> 0:21:26.764 to English or English to German. 0:21:26.764 --> 0:21:34.352 So however, it makes these models: Yea possible and we'll like to see yea for the phrase bass 0:21:34.352 --> 0:21:36.545 until we need these alignments. 0:21:36.836 --> 0:21:41.423 So this alignment was the most important of the world based models. 0:21:41.423 --> 0:21:47.763 For the next twenty years you need the world based models to generate this type of alignment, 0:21:47.763 --> 0:21:50.798 which is then the first step for the phrase. 0:21:51.931 --> 0:21:59.642 Approach, and there you can then combine them again like both directions into one we'll see. 0:22:00.280 --> 0:22:06.850 This alignment is very important and allows us to do this type of separation. 0:22:08.308 --> 0:22:15.786 And yet the most commonly used word based models are these models referred to as IBM 0:22:15.786 --> 0:22:25.422 models, and there is a sequence of them with great names: And they were like yeah very commonly 0:22:25.422 --> 0:22:26.050 used. 0:22:26.246 --> 0:22:31.719 We'll mainly focus on the simple one here and look how this works and then not do all 0:22:31.719 --> 0:22:34.138 the details about the further models. 0:22:34.138 --> 0:22:38.084 The interesting thing is also that all of them are important. 0:22:38.084 --> 0:22:43.366 So if you want to train this alignment what you normally do is train an IVM model. 0:22:43.743 --> 0:22:50.940 Then you take that as your initialization to then train the IBM model too and so on. 0:22:50.940 --> 0:22:53.734 The motivation for that is yeah. 0:22:53.734 --> 0:23:00.462 The first model gives you: Is so simple that you can even find a global optimum, so it gives 0:23:00.462 --> 0:23:06.403 you a good starting point for the next one where the optimization in finding the right 0:23:06.403 --> 0:23:12.344 model is more difficult and therefore like the defore technique was to make your model 0:23:12.344 --> 0:23:13.641 step by step more. 0:23:15.195 --> 0:23:27.333 In these models we are breaking down the probability into smaller steps and then we can define: 0:23:27.367 --> 0:23:38.981 You see it's not a bit different, so it's not the curability and one specific alignment given. 0:23:39.299 --> 0:23:42.729 We'll let us learn how we can then go from one alignment to the full set. 0:23:43.203 --> 0:23:52.889 The probability of target sentences and one alignment between the source and target sentences 0:23:52.889 --> 0:23:56.599 alignment is this type of function. 0:23:57.057 --> 0:24:14.347 That every word is aligned in order to ensure that every word is aligned. 0:24:15.835 --> 0:24:28.148 So first of all you do some epsilon, the epsilon is just a normalization factor that everything 0:24:28.148 --> 0:24:31.739 is somehow to inferability. 0:24:31.631 --> 0:24:37.539 Of source sentences plus one to the power of the length of the targets. 0:24:37.937 --> 0:24:50.987 And this is somehow the probability of this alignment. 0:24:51.131 --> 0:24:53.224 So is this alignment probable or not? 0:24:53.224 --> 0:24:55.373 Of course you can have some intuition. 0:24:55.373 --> 0:24:58.403 So if there's a lot of crossing, it may be not a good. 0:24:58.403 --> 0:25:03.196 If all of the words align to the same one might be not a good alignment, but generally 0:25:03.196 --> 0:25:06.501 it's difficult to really describe what is a good alignment. 0:25:07.067 --> 0:25:11.482 Say for the first model that's the most simple thing. 0:25:11.482 --> 0:25:18.760 What can be the most simple thing if you think about giving a probability to some event? 0:25:21.401 --> 0:25:25.973 Yes exactly, so just take the uniform distribution. 0:25:25.973 --> 0:25:33.534 If we don't really know the best thing of modeling is all equally probable, of course 0:25:33.534 --> 0:25:38.105 that is not true, but it's giving you a good study. 0:25:38.618 --> 0:25:44.519 And so this one is just a number of all possible alignments for this sentence. 0:25:44.644 --> 0:25:53.096 So how many alignments are possible, so the first target word can be allied to all sources 0:25:53.096 --> 0:25:53.746 worth. 0:25:54.234 --> 0:26:09.743 The second one can also be aligned to all source work, and the third one also to source. 0:26:10.850 --> 0:26:13.678 This is the number of alignments. 0:26:13.678 --> 0:26:19.002 The second part is to model the probability of the translation. 0:26:19.439 --> 0:26:31.596 And there it's not nice to have this function, so now we are making the product over all target. 0:26:31.911 --> 0:26:40.068 And we are making a very strong independent assumption because in these models we normally 0:26:40.068 --> 0:26:45.715 assume the translation probability of one word is independent. 0:26:46.126 --> 0:26:49.800 So how you translate and visit it is independent of all the other parts. 0:26:50.290 --> 0:26:52.907 That is very strong and very bad. 0:26:52.907 --> 0:26:55.294 Yeah, you should do it better. 0:26:55.294 --> 0:27:00.452 We know that it's wrong because how you translate this depends on. 0:27:00.452 --> 0:27:05.302 However, it's a first easy solution and again a good starting. 0:27:05.966 --> 0:27:14.237 So what you do is that you take a product of all words and take a translation probability 0:27:14.237 --> 0:27:15.707 on this target. 0:27:16.076 --> 0:27:23.901 And because we know that there is always one source word allied to that, so it. 0:27:24.344 --> 0:27:37.409 If the probability of visits in the zoo doesn't really work, the good here I'm again. 0:27:38.098 --> 0:27:51.943 So most only we have it here, so the probability is an absolute divided pipe to the power. 0:27:53.913 --> 0:27:58.401 And then there is somewhere in the last one. 0:27:58.401 --> 0:28:04.484 There is an arrow and switch, so it is the other way around. 0:28:04.985 --> 0:28:07.511 Then you have your translation model. 0:28:07.511 --> 0:28:12.498 Hopefully let's assume you have your water train so that's only a signing. 0:28:12.953 --> 0:28:25.466 And then this sentence has the probability of generating I visit a friend given that you 0:28:25.466 --> 0:28:31.371 have the source sentence if Bezukhov I'm. 0:28:32.012 --> 0:28:34.498 Time stand to the power of minus five. 0:28:35.155 --> 0:28:36.098 So this is your model. 0:28:36.098 --> 0:28:37.738 This is how you're applying your model. 0:28:39.479 --> 0:28:44.220 As you said, it's the most simple bottle you assume that all word translations are. 0:28:44.204 --> 0:28:46.540 Independent of each other. 0:28:46.540 --> 0:28:54.069 You assume that all alignments are equally important, and then the only thing you need 0:28:54.069 --> 0:29:00.126 for this type of model is to have this lexicon in order to calculate. 0:29:00.940 --> 0:29:04.560 And that is, of course, now the training process. 0:29:04.560 --> 0:29:08.180 The question is how do we get this type of lexic? 0:29:09.609 --> 0:29:15.461 But before we look into the training, do you have any questions about the model itself? 0:29:21.101 --> 0:29:26.816 The problem in training is that we have incomplete data. 0:29:26.816 --> 0:29:32.432 So if you want to count, I mean said you want to count. 0:29:33.073 --> 0:29:39.348 However, if you don't have the alignment, on the other hand, if you would have a lexicon 0:29:39.348 --> 0:29:44.495 you could maybe generate the alignment, which is the most probable word. 0:29:45.225 --> 0:29:55.667 And this is the very common problem that you have this type of incomplete data where you 0:29:55.667 --> 0:29:59.656 have not one type of information. 0:30:00.120 --> 0:30:08.767 And you can model this by considering the alignment as your hidden variable and then 0:30:08.767 --> 0:30:17.619 you can use the expectation maximization algorithm in order to generate the alignment. 0:30:17.577 --> 0:30:26.801 So the nice thing is that you only need your parallel data, which is aligned on sentence 0:30:26.801 --> 0:30:29.392 level, but you normally. 0:30:29.389 --> 0:30:33.720 Is just a lot of work we saw last time. 0:30:33.720 --> 0:30:39.567 Typically what you have is this type of corpus where. 0:30:41.561 --> 0:30:50.364 And yeah, the ERM algorithm sounds very fancy. 0:30:50.364 --> 0:30:58.605 However, again look at a little high level. 0:30:58.838 --> 0:31:05.841 So you're initializing a model by uniform distribution. 0:31:05.841 --> 0:31:14.719 You're just saying if have lexicon, if all words are equally possible. 0:31:15.215 --> 0:31:23.872 And then you apply your model to the data, and that is your expectation step. 0:31:23.872 --> 0:31:30.421 So given this initial lexicon, we are now calculating the. 0:31:30.951 --> 0:31:36.043 So we can now take all our parallel sentences, and of course ought to check what is the most 0:31:36.043 --> 0:31:36.591 probable. 0:31:38.338 --> 0:31:49.851 And then, of course, at the beginning maybe houses most often in line. 0:31:50.350 --> 0:31:58.105 Once we have done this expectation step, we can next do the maximization step and based 0:31:58.105 --> 0:32:06.036 on this guest alignment, which we have, we can now learn better translation probabilities 0:32:06.036 --> 0:32:09.297 by just counting how often do words. 0:32:09.829 --> 0:32:22.289 And then it's rated these steps: We can make this whole process even more stable, only taking 0:32:22.289 --> 0:32:26.366 the most probable alignment. 0:32:26.346 --> 0:32:36.839 Second step, but in contrast we calculate for all possible alignments the alignment probability 0:32:36.839 --> 0:32:40.009 and weigh the correcurrence. 0:32:40.000 --> 0:32:41.593 Then Things Are Most. 0:32:42.942 --> 0:32:49.249 Why could that be very challenging if we do it in general and really calculate all probabilities 0:32:49.249 --> 0:32:49.834 for all? 0:32:53.673 --> 0:32:55.905 How many alignments are there for a Simpson? 0:32:58.498 --> 0:33:03.344 Yes there, we just saw that in the formula if you remember. 0:33:03.984 --> 0:33:12.336 This was the formula so it's exponential in the lengths of the target sentence. 0:33:12.336 --> 0:33:15.259 It would calculate all the. 0:33:15.415 --> 0:33:18.500 Be very inefficient and really possible. 0:33:18.500 --> 0:33:25.424 The nice thing is we can again use some type of dynamic programming, so then we can do this 0:33:25.424 --> 0:33:27.983 without really calculating audit. 0:33:28.948 --> 0:33:40.791 We have the next pipe slides or so with the most equations in the whole lecture, so don't 0:33:40.791 --> 0:33:41.713 worry. 0:33:42.902 --> 0:34:01.427 So we said we have first explanation where it is about calculating the alignment. 0:34:02.022 --> 0:34:20.253 And we can do this with our initial definition of because this formula. 0:34:20.160 --> 0:34:25.392 So we can define this as and and divided by and. 0:34:25.905 --> 0:34:30.562 This is just the normal definition of a conditional probability. 0:34:31.231 --> 0:34:37.937 And what we then need to assume a meter calculate is P of E given. 0:34:37.937 --> 0:34:41.441 P of E given is still again quiet. 0:34:41.982 --> 0:34:56.554 Simple: The probability of the sewer sentence given the target sentence is quite intuitive. 0:34:57.637 --> 0:35:15.047 So let's just calculate how to calculate the probability of a event. 0:35:15.215 --> 0:35:21.258 So in here we can then put in our original form in our soils. 0:35:21.201 --> 0:35:28.023 There are some of the possible alignments of the first word, and so until the sum of 0:35:28.023 --> 0:35:30.030 all possible alignments. 0:35:29.990 --> 0:35:41.590 And then we have the probability here of the alignment type, this product of translation. 0:35:42.562 --> 0:35:58.857 Now this one is independent of the alignment, so we can put it to the front here. 0:35:58.959 --> 0:36:03.537 And now this is where dynamic programming works in. 0:36:03.537 --> 0:36:08.556 We can change that and make thereby things a lot easier. 0:36:08.668 --> 0:36:21.783 Can reform it like this just as a product over all target positions, and then it's the 0:36:21.783 --> 0:36:26.456 sum over all source positions. 0:36:27.127 --> 0:36:36.454 Maybe at least the intuition why this is equal is a lot easier if you look into it as graphic. 0:36:36.816 --> 0:36:39.041 So what we have here is the table. 0:36:39.041 --> 0:36:42.345 We have the target position and the Swiss position. 0:36:42.862 --> 0:37:03.643 And we have to sum up all possible passes through that: The nice thing is that each of 0:37:03.643 --> 0:37:07.127 these passes these probabilities are independent of each. 0:37:07.607 --> 0:37:19.678 In order to get the sum of all passes through this table you can use dynamic programming 0:37:19.678 --> 0:37:27.002 and then say oh this probability is exactly the same. 0:37:26.886 --> 0:37:34.618 Times the sun of this column finds the sum of this column, and times the sun of this colun. 0:37:35.255 --> 0:37:41.823 That is the same as if you go through all possible passes here and multiply always the 0:37:41.823 --> 0:37:42.577 elements. 0:37:43.923 --> 0:37:54.227 And that is a simplification because now we only have quadratic numbers and we don't have 0:37:54.227 --> 0:37:55.029 to go. 0:37:55.355 --> 0:38:12.315 Similar to guess you may be seen the same type of algorithm for what is it? 0:38:14.314 --> 0:38:19.926 Yeah, well yeah, so that is the saying. 0:38:19.926 --> 0:38:31.431 But yeah, I think graphically this is seeable if you don't know exactly the mass. 0:38:32.472 --> 0:38:49.786 Now put these both together, so if you really want to take a piece of and put these two formulas 0:38:49.786 --> 0:38:51.750 together,. 0:38:51.611 --> 0:38:56.661 Eliminated and Then You Get Your Final Formula. 0:38:56.716 --> 0:39:01.148 And that somehow really makes now really intuitively again sense. 0:39:01.401 --> 0:39:08.301 So the probability of an alignment is the product of all target sentences, and then it's 0:39:08.301 --> 0:39:15.124 the probability of to translate a word into the word that is aligned to divided by some 0:39:15.124 --> 0:39:17.915 of the other words in the sentence. 0:39:18.678 --> 0:39:31.773 If you look at this again, it makes real descent. 0:39:31.891 --> 0:39:43.872 So you're looking at how probable it is to translate compared to all the other words. 0:39:43.872 --> 0:39:45.404 So you're. 0:39:45.865 --> 0:39:48.543 So and that gives you the alignment probability. 0:39:48.768 --> 0:39:54.949 Somehow it's not only that it's mathematically correct if you look at it this way, it's somehow 0:39:54.949 --> 0:39:55.785 intuitively. 0:39:55.785 --> 0:39:58.682 So if you would say how good is it to align? 0:39:58.638 --> 0:40:04.562 We had to zoo him to visit, or yet it should depend on how good this is the translation 0:40:04.562 --> 0:40:10.620 probability compared to how good are the other words in the sentence, and how probable is 0:40:10.620 --> 0:40:12.639 it that I align them to them. 0:40:15.655 --> 0:40:26.131 Then you have the expectations that the next thing is now the maximization step, so we have 0:40:26.131 --> 0:40:30.344 now the probability of an alignment. 0:40:31.451 --> 0:40:37.099 Intuitively, that means how often are words aligned to each other giving this alignment 0:40:37.099 --> 0:40:39.281 or more in a perverse definition? 0:40:39.281 --> 0:40:43.581 What is the expectation value that they are aligned to each other? 0:40:43.581 --> 0:40:49.613 So if there's a lot of alignments with hyperability that they're aligned to each other, then. 0:40:50.050 --> 0:41:07.501 So the count of E and given F given our caravan data is a sum of all possible alignments. 0:41:07.968 --> 0:41:14.262 That is, this count, and you don't do just count with absolute numbers, but you count 0:41:14.262 --> 0:41:14.847 always. 0:41:15.815 --> 0:41:26.519 And to make that translation probability is that you have to normalize it, of course, through: 0:41:27.487 --> 0:41:30.584 And that's then the whole model. 0:41:31.111 --> 0:41:39.512 It looks now maybe a bit mathematically complex. 0:41:39.512 --> 0:41:47.398 The whole training process is described here. 0:41:47.627 --> 0:41:53.809 So you really, really just have to collect these counts and later normalize that. 0:41:54.134 --> 0:42:03.812 So repeating that until convergence we have said the ear migration is always done again. 0:42:04.204 --> 0:42:15.152 Equally, then you go over all sentence pairs and all of words and calculate the translation. 0:42:15.355 --> 0:42:17.983 And then you go once again over. 0:42:17.983 --> 0:42:22.522 It counted this count, count given, and totally e-given. 0:42:22.702 --> 0:42:35.316 Initially how probable is the E translated to something else, and you normalize your translation 0:42:35.316 --> 0:42:37.267 probabilities. 0:42:38.538 --> 0:42:45.761 So this is an old training process for this type of. 0:42:46.166 --> 0:43:00.575 How that then works is shown here a bit, so we have a very simple corpus. 0:43:01.221 --> 0:43:12.522 And as we said, you initialize your translation with yes or possible translations, so dusk 0:43:12.522 --> 0:43:16.620 can be aligned to the bookhouse. 0:43:16.997 --> 0:43:25.867 And the other ones are missing because only a curse with and book, and then the others 0:43:25.867 --> 0:43:26.988 will soon. 0:43:27.127 --> 0:43:34.316 In the initial way your vocabulary is for works, so the initial probabilities are all: 0:43:34.794 --> 0:43:50.947 And then if you iterate you see that the things which occur often and then get alignments get 0:43:50.947 --> 0:43:53.525 more and more. 0:43:55.615 --> 0:44:01.506 In reality, of course, you won't get like zero alignments, but you would normally get 0:44:01.506 --> 0:44:02.671 there sometimes. 0:44:03.203 --> 0:44:05.534 But as the probability increases. 0:44:05.785 --> 0:44:17.181 The training process is also guaranteed that the probability of your training data is always 0:44:17.181 --> 0:44:20.122 increased in iteration. 0:44:21.421 --> 0:44:27.958 You see that the model tries to model your training data and give you at least good models. 0:44:30.130 --> 0:44:37.765 Okay, are there any more questions to the training of these type of word-based models? 0:44:38.838 --> 0:44:54.790 Initially there is like forwards in the source site, so it's just one force to do equal distribution. 0:44:55.215 --> 0:45:01.888 So each target word, the probability of the target word, is at four target words, so the 0:45:01.888 --> 0:45:03.538 uniform distribution. 0:45:07.807 --> 0:45:14.430 However, there is problems with this initial order and we have this already mentioned at 0:45:14.430 --> 0:45:15.547 the beginning. 0:45:15.547 --> 0:45:21.872 There is for example things that yeah you want to allow for reordering but there are 0:45:21.872 --> 0:45:27.081 definitely some alignments which should be more probable than others. 0:45:27.347 --> 0:45:42.333 So a friend visit should have a lower probability than visit a friend. 0:45:42.302 --> 0:45:50.233 It's not always monitoring, there is some reordering happening, but if you just mix it 0:45:50.233 --> 0:45:51.782 crazy, it's not. 0:45:52.252 --> 0:46:11.014 You have slings like one too many alignments and they are not really models. 0:46:11.491 --> 0:46:17.066 But it shouldn't be that you align one word to all the others, and that is, you don't want 0:46:17.066 --> 0:46:18.659 this type of probability. 0:46:19.199 --> 0:46:27.879 You don't want to align to null, so there's nothing about that and how to deal with other 0:46:27.879 --> 0:46:30.386 words on the source side. 0:46:32.272 --> 0:46:45.074 And therefore this was only like the initial model in there. 0:46:45.325 --> 0:46:47.639 Models, which we saw. 0:46:47.639 --> 0:46:57.001 They only model the translation probability, so how probable is it to translate one word 0:46:57.001 --> 0:46:58.263 to another? 0:46:58.678 --> 0:47:05.915 What you could then add is the absolute position. 0:47:05.915 --> 0:47:16.481 Yeah, the second word should more probable align to the second position. 0:47:17.557 --> 0:47:22.767 We add a fertility model that means one word is mostly translated into one word. 0:47:23.523 --> 0:47:29.257 For example, we saw it there that should be translated into two words, but most words should 0:47:29.257 --> 0:47:32.463 be one to one, and it's even modeled for each word. 0:47:32.463 --> 0:47:37.889 So for each source word, how probable is it that it is translated to one, two, three or 0:47:37.889 --> 0:47:38.259 more? 0:47:40.620 --> 0:47:50.291 Then either one of four acts relative positions, so it's asks: Maybe instead of modeling, how 0:47:50.291 --> 0:47:55.433 probable is it that you translate from position five to position twenty five? 0:47:55.433 --> 0:48:01.367 It's not a very good way, but in a relative position instead of what you try to model it. 0:48:01.321 --> 0:48:06.472 How probable is that you are jumping Swiss steps forward or Swiss steps back? 0:48:07.287 --> 0:48:15.285 However, this makes sense more complex because what is a jump forward and a jump backward 0:48:15.285 --> 0:48:16.885 is not that easy. 0:48:18.318 --> 0:48:30.423 You want to have a model that describes reality, so every sentence that is not possible should 0:48:30.423 --> 0:48:37.304 have the probability zero because that cannot happen. 0:48:37.837 --> 0:48:48.037 However, with this type of IBM model four this has a positive probability, so it makes 0:48:48.037 --> 0:48:54.251 a sentence more complex and you can easily check it. 0:48:57.457 --> 0:49:09.547 So these models were the first models which tried to directly model and where they are 0:49:09.547 --> 0:49:14.132 the first to do the translation. 0:49:14.414 --> 0:49:19.605 So in all of these models, the probability of a word translating into another word is 0:49:19.605 --> 0:49:25.339 always independent of all the other translations, and that is a challenge because we know that 0:49:25.339 --> 0:49:26.486 this is not right. 0:49:26.967 --> 0:49:32.342 And therefore we will come now to then the phrase-based translation models. 0:49:35.215 --> 0:49:42.057 However, this word alignment is the very important concept which was used in phrase based. 0:49:42.162 --> 0:49:50.559 Even when people use phrase based, they first would always train a word based model not to 0:49:50.559 --> 0:49:56.188 get the really model but only to get this type of alignment. 0:49:57.497 --> 0:50:01.343 What was the main idea of a phrase based machine translation? 0:50:03.223 --> 0:50:08.898 It's not only that things got mathematically a lot more simple here because you don't try 0:50:08.898 --> 0:50:13.628 to express the whole translation process, but it's a discriminative model. 0:50:13.628 --> 0:50:19.871 So what you only try to model is this translation probability or is this translation more probable 0:50:19.871 --> 0:50:20.943 than some other. 0:50:24.664 --> 0:50:28.542 The main idea is that the basic units are are the phrases. 0:50:28.542 --> 0:50:31.500 That's why it's called phrase phrase phrase. 0:50:31.500 --> 0:50:35.444 You have to be aware that these are not linguistic phrases. 0:50:35.444 --> 0:50:39.124 I guess you have some intuition about what is a phrase. 0:50:39.399 --> 0:50:45.547 You would express as a phrase. 0:50:45.547 --> 0:50:58.836 However, you wouldn't say that is a very good phrase because it's. 0:50:59.339 --> 0:51:06.529 However, in this machine learning-based motivated thing, phrases are just indicative. 0:51:07.127 --> 0:51:08.832 So it can be any split. 0:51:08.832 --> 0:51:12.455 We don't consider linguistically motivated or not. 0:51:12.455 --> 0:51:15.226 It can be any sequence of consecutive. 0:51:15.335 --> 0:51:16.842 That's the Only Important Thing. 0:51:16.977 --> 0:51:25.955 The phrase is always a thing of consecutive words, and the motivation behind that is getting 0:51:25.955 --> 0:51:27.403 computational. 0:51:27.387 --> 0:51:35.912 People have looked into how you can also discontinuous phrases, which might be very helpful if you 0:51:35.912 --> 0:51:38.237 think about German harbor. 0:51:38.237 --> 0:51:40.046 Has this one phrase? 0:51:40.000 --> 0:51:47.068 There's two phrases, although there's many things in between, but in order to make things 0:51:47.068 --> 0:51:52.330 still possible and runner will, it's always like consecutive work. 0:51:53.313 --> 0:52:05.450 The nice thing is that on the one hand you don't need this word to word correspondence 0:52:05.450 --> 0:52:06.706 anymore. 0:52:06.906 --> 0:52:17.088 You now need to invent some type of alignment that in this case doesn't really make sense. 0:52:17.417 --> 0:52:21.710 So you can just learn okay, you have this phrase and this phrase and their translation. 0:52:22.862 --> 0:52:25.989 Secondly, we can add a bit of context into that. 0:52:26.946 --> 0:52:43.782 You're saying, for example, of Ultimate Customs and of My Shift. 0:52:44.404 --> 0:52:51.443 And this was difficult to model and work based models because they always model the translation. 0:52:52.232 --> 0:52:57.877 Here you can have phrases where you have more context and just jointly translate the phrases, 0:52:57.877 --> 0:53:03.703 and if you then have seen all by the question as a phrase you can directly use that to generate. 0:53:08.468 --> 0:53:19.781 Okay, before we go into how to do that, then we start, so the start is when we start with 0:53:19.781 --> 0:53:21.667 the alignment. 0:53:22.022 --> 0:53:35.846 So that is what we get from the work based model and we are assuming to get the. 0:53:36.356 --> 0:53:40.786 So that is your starting point. 0:53:40.786 --> 0:53:47.846 You have a certain sentence and one most probable. 0:53:48.989 --> 0:54:11.419 The challenge you now have is that these alignments are: On the one hand, a source word like hit 0:54:11.419 --> 0:54:19.977 several times with one source word can be aligned to several: So in this case you see that for 0:54:19.977 --> 0:54:29.594 example Bisher is aligned to three words, so this can be the alignment from English to German, 0:54:29.594 --> 0:54:32.833 but it cannot be the alignment. 0:54:33.273 --> 0:54:41.024 In order to address for this inconsistency and being able to do that, what you typically 0:54:41.024 --> 0:54:49.221 then do is: If you have this inconsistency and you get different things in both directions,. 0:54:54.774 --> 0:55:01.418 In machine translation to do that you just do it in both directions and somehow combine 0:55:01.418 --> 0:55:08.363 them because both will do arrows and the hope is yeah if you know both things you minimize. 0:55:08.648 --> 0:55:20.060 So you would also do it in the other direction and get a different type of lineup, for example 0:55:20.060 --> 0:55:22.822 that you now have saw. 0:55:23.323 --> 0:55:37.135 So in this way you are having two alignments and the question is now how do get one alignment 0:55:37.135 --> 0:55:38.605 and what? 0:55:38.638 --> 0:55:45.828 There were a lot of different types of heuristics. 0:55:45.828 --> 0:55:55.556 They normally start with intersection because you should trust them. 0:55:55.996 --> 0:55:59.661 And your maximum will could take this, the union thought,. 0:55:59.980 --> 0:56:04.679 If one of the systems says they are not aligned then maybe you should not align them. 0:56:05.986 --> 0:56:12.240 The only question they are different is what should I do about things where they don't agree? 0:56:12.240 --> 0:56:18.096 So where only one of them enlines and then you have heuristics depending on other words 0:56:18.096 --> 0:56:22.288 around it, you can decide should I align them or should I not. 0:56:24.804 --> 0:56:34.728 So that is your first step and then the second step in your model. 0:56:34.728 --> 0:56:41.689 So now you have one alignment for the process. 0:56:42.042 --> 0:56:47.918 And the idea is that we will now extract all phrase pairs to combinations of source and 0:56:47.918 --> 0:56:51.858 target phrases where they are consistent within alignment. 0:56:52.152 --> 0:56:57.980 The idea is a consistence with an alignment that should be a good example and that we can 0:56:57.980 --> 0:56:58.563 extract. 0:56:59.459 --> 0:57:14.533 And there are three conditions where we say an alignment has to be consistent. 0:57:14.533 --> 0:57:17.968 The first one is. 0:57:18.318 --> 0:57:24.774 So if you add bisher, then it's in your phrase. 0:57:24.774 --> 0:57:32.306 All the three words up till and now should be in there. 0:57:32.492 --> 0:57:42.328 So Bisheret Till would not be a valid phrase pair in this case, but for example Bisheret 0:57:42.328 --> 0:57:43.433 Till now. 0:57:45.525 --> 0:58:04.090 Does anybody now have already an idea about the second rule that should be there? 0:58:05.325 --> 0:58:10.529 Yes, that is exactly the other thing. 0:58:10.529 --> 0:58:22.642 If a target verse is in the phrase pair, there are also: Then there is one very obvious one. 0:58:22.642 --> 0:58:28.401 If you strike a phrase pair, at least one word in the phrase. 0:58:29.069 --> 0:58:32.686 And this is a knife with working. 0:58:32.686 --> 0:58:40.026 However, in reality a captain will select some part of the sentence. 0:58:40.380 --> 0:58:47.416 You can take any possible combination of sewers and target words for this part, and that of 0:58:47.416 --> 0:58:54.222 course is not very helpful because you just have no idea, and therefore it says at least 0:58:54.222 --> 0:58:58.735 one sewer should be aligned to one target word to prevent. 0:58:59.399 --> 0:59:09.615 But still, it means that if you have normally analyzed words, the more analyzed words you 0:59:09.615 --> 0:59:10.183 can. 0:59:10.630 --> 0:59:13.088 That's not true for the very extreme case. 0:59:13.088 --> 0:59:17.603 If no word is a line you can extract nothing because you can never fulfill it. 0:59:17.603 --> 0:59:23.376 However, if only for example one word is aligned then you can align a lot of different possibilities 0:59:23.376 --> 0:59:28.977 because you can start with this word and then add source words or target words or any combination 0:59:28.977 --> 0:59:29.606 of source. 0:59:30.410 --> 0:59:37.585 So there was typically a problem that if you have too few works in light you can really 0:59:37.585 --> 0:59:38.319 extract. 0:59:38.558 --> 0:59:45.787 If you think about this already here you can extract very, very many phrase pairs from: 0:59:45.845 --> 0:59:55.476 So what you can extract is, for example, what we saw up and so on. 0:59:55.476 --> 1:00:00.363 So all of them will be extracted. 1:00:00.400 --> 1:00:08.379 In order to limit this you typically have a length limit so you can only extract phrases 1:00:08.379 --> 1:00:08.738 up. 1:00:09.049 --> 1:00:18.328 But still there these phrases where you have all these phrases extracted. 1:00:18.328 --> 1:00:22.968 You have to think about how to deal. 1:00:26.366 --> 1:00:34.966 Now we have the phrases, so the other question is what is a good phrase pair and not so good. 1:00:35.255 --> 1:00:39.933 You might be that you sometimes extract one which is explaining this sentence but is not 1:00:39.933 --> 1:00:44.769 really a good one because there is something ever in there or something special so it might 1:00:44.769 --> 1:00:47.239 not be a good phase pair in another situation. 1:00:49.629 --> 1:00:59.752 And therefore the easiest thing is again just count, and if a phrase pair occurs very often 1:00:59.752 --> 1:01:03.273 seems to be a good phrase pair. 1:01:03.743 --> 1:01:05.185 So if we have this one. 1:01:05.665 --> 1:01:09.179 And if you have the exam up till now,. 1:01:09.469 --> 1:01:20.759 Then you look how often does up till now to this hair occur? 1:01:20.759 --> 1:01:28.533 How often does up until now to this hair? 1:01:30.090 --> 1:01:36.426 So this is one way of yeah describing the quality of the phrase book. 1:01:37.257 --> 1:01:47.456 So one difference is now, and that is the advantage of these primitive models. 1:01:47.867 --> 1:01:55.442 But instead we are trying to have a lot of features describing how good a phrase parent 1:01:55.442 --> 1:01:55.786 is. 1:01:55.786 --> 1:02:04.211 One of these features is this one describing: But in this model we'll later see how to combine 1:02:04.211 --> 1:02:04.515 it. 1:02:04.515 --> 1:02:10.987 The nice thing is we can invent any other type of features and add that and normally 1:02:10.987 --> 1:02:14.870 if you have two or three metrics to describe then. 1:02:15.435 --> 1:02:18.393 And therefore the spray spray sprays. 1:02:18.393 --> 1:02:23.220 They were not only like evaluated by one type but by several. 1:02:23.763 --> 1:02:36.580 So this could, for example, have a problem because your target phrase here occurs only 1:02:36.580 --> 1:02:37.464 once. 1:02:38.398 --> 1:02:46.026 It will of course only occur with one other source trait, and that probability will be 1:02:46.026 --> 1:02:53.040 one which might not be a very good estimation because you've only seen it once. 1:02:53.533 --> 1:02:58.856 Therefore, we use additional ones to better deal with that, and the first thing is we're 1:02:58.856 --> 1:02:59.634 doing again. 1:02:59.634 --> 1:03:01.129 Yeah, we know it by now. 1:03:01.129 --> 1:03:06.692 If you look at it in the one direction, it's helpful to us to look into the other direction. 1:03:06.692 --> 1:03:11.297 So you take also the inverse probability, so you not only take in peer of E. 1:03:11.297 --> 1:03:11.477 G. 1:03:11.477 --> 1:03:11.656 M. 1:03:11.656 --> 1:03:12.972 F., but also peer of. 1:03:13.693 --> 1:03:19.933 And then in addition you say maybe for the especially prolonged phrases they occur rarely, 1:03:19.933 --> 1:03:25.898 and then you have very high probabilities, and that might not be always the right one. 1:03:25.898 --> 1:03:32.138 So maybe it's good to also look at the word based probabilities to represent how good they 1:03:32.138 --> 1:03:32.480 are. 1:03:32.692 --> 1:03:44.202 So in addition you take the work based probabilities of this phrase pair as an additional model. 1:03:44.704 --> 1:03:52.828 So then you would have in total four different values describing how good the phrase is. 1:03:52.828 --> 1:04:00.952 It would be the relatively frequencies in both directions and the lexical probabilities. 1:04:01.361 --> 1:04:08.515 So four values in describing how probable a phrase translation is. 1:04:11.871 --> 1:04:20.419 Then the next challenge is how can we combine these different types of probabilities into 1:04:20.419 --> 1:04:23.458 a global score saying how good? 1:04:24.424 --> 1:04:36.259 Model, but before we are doing that give any questions to this phrase extraction and phrase 1:04:36.259 --> 1:04:37.546 creation. 1:04:40.260 --> 1:04:44.961 And the motivation for that this was our initial moral. 1:04:44.961 --> 1:04:52.937 If you remember from the beginning of a lecture we had the probability of like PFO three times 1:04:52.937 --> 1:04:53.357 PFO. 1:04:55.155 --> 1:04:57.051 Now the problem is here. 1:04:57.051 --> 1:04:59.100 That is, of course, right. 1:04:59.100 --> 1:05:06.231 However, we have done a lot of simplification that the translation probability is independent 1:05:06.231 --> 1:05:08.204 of the other translation. 1:05:08.628 --> 1:05:14.609 So therefore our estimations of pH give me and pH might not be right, and therefore the 1:05:14.609 --> 1:05:16.784 combination might not be right. 1:05:17.317 --> 1:05:22.499 So it can be that, for example, at the edge you have a fluid but not accurate translation. 1:05:22.782 --> 1:05:25.909 And Then There's Could Be an Easy Way Around It. 1:05:26.126 --> 1:05:32.019 If our effluent but not accurate, it might be that we put too much effort on the language 1:05:32.019 --> 1:05:36.341 model and we are putting too few effort on the translation model. 1:05:36.936 --> 1:05:43.016 There we can wait a minute so we can do this a bit stronger. 1:05:43.016 --> 1:05:46.305 This one is more important than. 1:05:48.528 --> 1:05:53.511 And based on that we can extend this idea to the lacteria mole. 1:05:53.893 --> 1:06:02.164 The log linear model now says all the translation probabilities is just we have. 1:06:02.082 --> 1:06:09.230 Describing how good this translation process is, these are the speeches H which depend on 1:06:09.230 --> 1:06:09.468 E. 1:06:09.468 --> 1:06:09.706 F. 1:06:09.706 --> 1:06:13.280 Only one of them, but generally depend on E. 1:06:13.280 --> 1:06:13.518 E. 1:06:13.518 --> 1:06:13.757 E. 1:06:13.757 --> 1:06:13.995 N. 1:06:13.995 --> 1:06:14.233 F. 1:06:14.474 --> 1:06:22.393 Each of these pictures has a weight saying yeah how good does it model it so that if you're 1:06:22.393 --> 1:06:29.968 asking a lot of people about some opinion it might also be waiting some opinion more so 1:06:29.968 --> 1:06:34.100 I put more effort on that and he may not be so. 1:06:34.314 --> 1:06:39.239 If you're saying that it's maybe a good indication, yeah, would trust that much. 1:06:39.559 --> 1:06:41.380 And exactly you can do that for you too. 1:06:41.380 --> 1:06:42.446 You can't add no below. 1:06:43.423 --> 1:07:01.965 It's like depending on how many you want to have and each of the features gives you value. 1:07:02.102 --> 1:07:12.655 The nice thing is that we can normally ignore because we are not interested in the probability 1:07:12.655 --> 1:07:13.544 itself. 1:07:13.733 --> 1:07:18.640 And again, if that's not normalized, that's fine. 1:07:18.640 --> 1:07:23.841 So if this value is the highest, that's the highest. 1:07:26.987 --> 1:07:29.302 Can we do that? 1:07:29.302 --> 1:07:34.510 Let's start with two simple things. 1:07:34.510 --> 1:07:39.864 Then you have one translation model. 1:07:40.000 --> 1:07:43.102 Which gives you the peer of eagerness. 1:07:43.383 --> 1:07:49.203 It can be typically as a feature it would take the liberalism of this ability, so mine 1:07:49.203 --> 1:07:51.478 is nine hundred and fourty seven. 1:07:51.451 --> 1:07:57.846 And the language model which says you how clue in the English side is how you can calculate 1:07:57.846 --> 1:07:59.028 the probability. 1:07:58.979 --> 1:08:03.129 In some future lectures we'll give you all superbology. 1:08:03.129 --> 1:08:10.465 You can feature again the luck of the purbology, then you have minus seven and then give different 1:08:10.465 --> 1:08:11.725 weights to them. 1:08:12.292 --> 1:08:19.243 And that means that your probability is one divided by said to the power of this. 1:08:20.840 --> 1:08:38.853 You're not really interested in the probability, so you just calculate on the score to the exponendum. 1:08:40.000 --> 1:08:41.668 Maximal Maximal I Think. 1:08:42.122 --> 1:08:57.445 You can, for example, try different translations, calculate all their scores and take in the 1:08:57.445 --> 1:09:00.905 end the translation. 1:09:03.423 --> 1:09:04.661 Why to do that. 1:09:05.986 --> 1:09:10.698 We've done that now for two, but of course you cannot only do it with two. 1:09:10.698 --> 1:09:16.352 You can do it now with any fixed number, so of course you have to decide in the beginning 1:09:16.352 --> 1:09:21.944 I want to have ten features or something like that, but you can take all these features. 1:09:22.002 --> 1:09:29.378 And yeah, based on them, they calculate your model probability or the model score. 1:09:31.031 --> 1:09:40.849 A big advantage over the initial. 1:09:40.580 --> 1:09:45.506 A model because now we can add a lot of features and there was diamond machine translation, 1:09:45.506 --> 1:09:47.380 a statistical machine translation. 1:09:47.647 --> 1:09:57.063 So how can develop new features, new ways of evaluating them so that can hopefully better 1:09:57.063 --> 1:10:00.725 describe what is good translation? 1:10:01.001 --> 1:10:16.916 If you have a new great feature you can calculate these features and then how much better do 1:10:16.916 --> 1:10:18.969 they model? 1:10:21.741 --> 1:10:27.903 There is one challenge which haven't touched upon yet. 1:10:27.903 --> 1:10:33.505 So could you easily build your model if you have. 1:10:38.999 --> 1:10:43.016 Assumed here something which just gazed, but which might not be that easy. 1:10:49.990 --> 1:10:56.333 The weight for the translation model is and the weight for the language model is. 1:10:56.716 --> 1:11:08.030 That's a bit arbitrary, so why should you use this one and guess normally you won't be 1:11:08.030 --> 1:11:11.801 able to select that by hand? 1:11:11.992 --> 1:11:19.123 So typically we didn't have like or features in there, but features is very common. 1:11:19.779 --> 1:11:21.711 So how do you select them? 1:11:21.711 --> 1:11:24.645 There was a second part of the training. 1:11:24.645 --> 1:11:27.507 These models were trained in two steps. 1:11:27.507 --> 1:11:32.302 On the one hand, we had the training of the individual components. 1:11:32.302 --> 1:11:38.169 We saw that now how to build the phrase based system, how to extract the phrases. 1:11:38.738 --> 1:11:46.223 But then if you have these different components you need a second training to learn the optimal. 1:11:46.926 --> 1:11:51.158 And typically this is referred to as the tuning of the system. 1:11:51.431 --> 1:12:07.030 So now if you have different types of models describing what a good translation is you need 1:12:07.030 --> 1:12:10.760 to find good weights. 1:12:12.312 --> 1:12:14.315 So how can you do it? 1:12:14.315 --> 1:12:20.871 The easiest thing is, of course, you can just try different things out. 1:12:21.121 --> 1:12:27.496 You can then always select the best hyper scissors. 1:12:27.496 --> 1:12:38.089 You can evaluate it with some metrics saying: You can score all your outputs, always select 1:12:38.089 --> 1:12:42.543 the best one and then get this translation. 1:12:42.983 --> 1:12:45.930 And you can do that for a lot of different possible combinations. 1:12:47.067 --> 1:12:59.179 However, the challenge is the complexity, so if you have only parameters and each of 1:12:59.179 --> 1:13:04.166 them has values you try for, then. 1:13:04.804 --> 1:13:16.895 We won't be able to try all of these possible combinations, so what we have to do is some 1:13:16.895 --> 1:13:19.313 more intelligent. 1:13:20.540 --> 1:13:34.027 And what has been done there in machine translation is referred to as a minimum error rate training. 1:13:34.534 --> 1:13:41.743 Whole surge is a very intuitive one, so have all these different parameters, so how do. 1:13:42.522 --> 1:13:44.358 And the idea is okay. 1:13:44.358 --> 1:13:52.121 I start with an initial guess and then I optimize one single parameter that's always easier. 1:13:52.121 --> 1:13:54.041 That's some or linear. 1:13:54.041 --> 1:13:58.882 So you're searching the best value for the one parameter. 1:13:59.759 --> 1:14:04.130 Often visualized with a San Francisco map. 1:14:04.130 --> 1:14:13.786 Just imagine if you want to go to the highest spot in San Francisco, you're standing somewhere 1:14:13.786 --> 1:14:14.395 here. 1:14:14.574 --> 1:14:21.220 You are switching your dimensions so you are going in this direction again finding. 1:14:21.661 --> 1:14:33.804 Now you're on a different street and this one is not a different one so you go in here 1:14:33.804 --> 1:14:36.736 so you can interact. 1:14:36.977 --> 1:14:56.368 The one thing of course is find a local optimum, especially if you start in two different positions. 1:14:56.536 --> 1:15:10.030 So yeah, there is a heuristic in there, so typically it's done again if you land in different 1:15:10.030 --> 1:15:16.059 positions with different starting points. 1:15:16.516 --> 1:15:29.585 What is different or what is like the addition of arrow rate training compared to the standard? 1:15:29.729 --> 1:15:37.806 So the question is, like we said, you can now evaluate different values for one parameter. 1:15:38.918 --> 1:15:42.857 And the question is: Which values should you try out for one parameters? 1:15:42.857 --> 1:15:47.281 Should you just do zero point one, zero point two, zero point three, or anything? 1:15:49.029 --> 1:16:03.880 If you change only one parameter then you can define the score of translation as a linear 1:16:03.880 --> 1:16:05.530 function. 1:16:05.945 --> 1:16:17.258 That this is the one that possesses, and yet if you change the parameter, the score of this. 1:16:17.397 --> 1:16:26.506 It may depend so your score is there because the rest you don't change your feature value. 1:16:26.826 --> 1:16:30.100 And the feature value is there for the steepness of their purse. 1:16:30.750 --> 1:16:38.887 And now look at different possible translations. 1:16:38.887 --> 1:16:46.692 Therefore, how they go up here is differently. 1:16:47.247 --> 1:16:59.289 So in this case if you look at the minimum score so there should be as minimum. 1:17:00.300 --> 1:17:10.642 So it's enough to check once a year and check once here because if you check here and here. 1:17:11.111 --> 1:17:24.941 And that is the idea in minimum air rate training when you select different hypotheses. 1:17:29.309 --> 1:17:34.378 So in yeah, the minimum air raid training is a power search. 1:17:34.378 --> 1:17:37.453 Then we do an intelligent step size. 1:17:37.453 --> 1:17:39.364 We do random restarts. 1:17:39.364 --> 1:17:46.428 Then things are still too slow because it might say we would have to decode a lot of 1:17:46.428 --> 1:17:47.009 times. 1:17:46.987 --> 1:17:54.460 So what we can do to make things even faster is we are decoding once with the current parameters, 1:17:54.460 --> 1:18:01.248 but then we are not generating only the most probable translation, but we are generating 1:18:01.248 --> 1:18:05.061 the most probable ten hundred translations or so. 1:18:06.006 --> 1:18:18.338 And then we are optimizing our weights by only looking at this one hundred translation 1:18:18.338 --> 1:18:23.725 and finding the optimal values there. 1:18:24.564 --> 1:18:39.284 Of course, it might be a problem that at some point you have now good ways to find good translations 1:18:39.284 --> 1:18:42.928 inside your ambest list. 1:18:43.143 --> 1:18:52.357 You have to iterate that sometime, but the important thing is you don't have to decode 1:18:52.357 --> 1:18:56.382 every time you need weights, but you. 1:18:57.397 --> 1:19:11.325 There is mainly a speed up process in order to make things more, make things even faster. 1:19:15.515 --> 1:19:20.160 Good Then We'll Finish With. 1:19:20.440 --> 1:19:25.289 Looking at how do you really calculate the scores and everything? 1:19:25.289 --> 1:19:32.121 Because what we did look into was a translation of a full sentence doesn't really consist of 1:19:32.121 --> 1:19:37.190 only one single phrase, but of course you have to combine different. 1:19:37.637 --> 1:19:40.855 So how does that now really look and how do we have to do? 1:19:41.361 --> 1:19:48.252 Just think again of the translation we have done before. 1:19:48.252 --> 1:19:59.708 The sentence must be: What is the probability of translating this one into what we saw after 1:19:59.708 --> 1:20:00.301 now? 1:20:00.301 --> 1:20:03.501 We're doing this by using. 1:20:03.883 --> 1:20:07.157 So we're having the phrase pair. 1:20:07.157 --> 1:20:12.911 Vasvia is the phrase pair up to now and gazine harm into. 1:20:13.233 --> 1:20:18.970 In addition, that is important because translation is not monotone. 1:20:18.970 --> 1:20:26.311 We are not putting phrase pairs in the same order as we are doing it on the source and 1:20:26.311 --> 1:20:31.796 on the target, but in order to generate the correct translation. 1:20:31.771 --> 1:20:34.030 So we have to shuffle the phrase pears. 1:20:34.294 --> 1:20:39.747 And the blue wand is in front on the search side but not on the back of the tag. 1:20:40.200 --> 1:20:49.709 This reordering makes a statistic of the machine translation really complicated because if you 1:20:49.709 --> 1:20:53.313 would just monotonely do this then. 1:20:53.593 --> 1:21:05.288 The problem is if you would analyze all possible combinations of reshuffling them, then again. 1:21:05.565 --> 1:21:11.508 So you again have to use some type of heuristics which shuffle you allow and which you don't 1:21:11.508 --> 1:21:11.955 allow. 1:21:12.472 --> 1:21:27.889 That was relatively challenging since, for example, if you think of Germany you would 1:21:27.889 --> 1:21:32.371 have to allow very long. 1:21:33.033 --> 1:21:52.218 But if we have now this, how do we calculate the translation score so the translation score? 1:21:52.432 --> 1:21:55.792 That's why we sum up the scores at the end. 1:21:56.036 --> 1:22:08.524 So you said our first feature is the probability of the full sentence. 1:22:08.588 --> 1:22:13.932 So we say, the translation of each phrase pair is independent of each other, and then 1:22:13.932 --> 1:22:19.959 we can hear the probability of the full sentences, fear of what we give, but fear of times, fear 1:22:19.959 --> 1:22:24.246 of sobbing because they have time to feel up till now is impossible. 1:22:24.664 --> 1:22:29.379 Now we can use the loss of logarithmal calculation. 1:22:29.609 --> 1:22:36.563 That's logarithm of the first perability. 1:22:36.563 --> 1:22:48.153 We'll get our first score, which says the translation model is minus. 1:22:49.970 --> 1:22:56.586 And that we're not doing only once, but we're exactly doing it with all our translation model. 1:22:56.957 --> 1:23:03.705 So we said we also have the relative frequency and the inverse directions of the. 1:23:03.843 --> 1:23:06.226 So in the end you'll have four scores. 1:23:06.226 --> 1:23:09.097 Here how you combine them is exactly the same. 1:23:09.097 --> 1:23:12.824 The only thing is how you look them up for each phrase pair. 1:23:12.824 --> 1:23:18.139 We have said in the beginning we are storing four scores describing how good they are. 1:23:19.119 --> 1:23:25.415 And these are then of force points describing how probable the sense. 1:23:27.427 --> 1:23:31.579 Then we can have more sports. 1:23:31.579 --> 1:23:37.806 For example, we can have a distortion model. 1:23:37.806 --> 1:23:41.820 How much reordering is done? 1:23:41.841 --> 1:23:47.322 There were different types of ones who won't go into detail, but just imagine you have no 1:23:47.322 --> 1:23:47.748 score. 1:23:48.548 --> 1:23:56.651 Then you have a language model which is the sequence of what we saw until now. 1:23:56.651 --> 1:24:06.580 How we generate this language model for ability will cover: And there weren't even more probabilities. 1:24:06.580 --> 1:24:11.841 So one, for example, was a phrase count scarf, which just counts how many. 1:24:12.072 --> 1:24:19.555 In order to learn is it better to have more short phrases or should bias on having fewer 1:24:19.555 --> 1:24:20.564 and longer. 1:24:20.940 --> 1:24:28.885 Easily add this but just counting so the value will be here and like putting in a count like 1:24:28.885 --> 1:24:32.217 typically how good is it to translate. 1:24:32.932 --> 1:24:44.887 For language model, the probability normally gets shorter the longer the sequences in order 1:24:44.887 --> 1:24:46.836 to counteract. 1:24:47.827 --> 1:24:59.717 And then you get your final score by multi-climbing each of the scores we had before. 1:24:59.619 --> 1:25:07.339 Optimization and that gives you a final score maybe of twenty three point seven eight five 1:25:07.339 --> 1:25:13.278 and then you can do that with several possible translation tests and. 1:25:14.114 --> 1:25:23.949 One may be important point here is so the score not only depends on the target side but 1:25:23.949 --> 1:25:32.444 it also depends on which phrases you have used so you could have generated. 1:25:32.772 --> 1:25:38.076 So you would have the same translation, but you would have a different split into phrase. 1:25:38.979 --> 1:25:45.636 And this was normally ignored so you would just look at all of them and then select the 1:25:45.636 --> 1:25:52.672 one which has the highest probability and ignore that this translation could be generated by 1:25:52.672 --> 1:25:54.790 several splits into phrase. 1:25:57.497 --> 1:26:06.097 So to summarize what we look into today and what you should hopefully remember is: Statistical 1:26:06.097 --> 1:26:11.440 models in how to generate machine translation output that were the word based statistical 1:26:11.440 --> 1:26:11.915 models. 1:26:11.915 --> 1:26:16.962 There was IBM models at the beginning and then we have the phrase based entity where 1:26:16.962 --> 1:26:22.601 it's about building the translation by putting together these blocks of phrases and combining. 1:26:23.283 --> 1:26:34.771 If you have a water which has several features you can't do that with millions but with features. 1:26:34.834 --> 1:26:42.007 Then you can combine them with your local model, which allows you to have your variable 1:26:42.007 --> 1:26:45.186 number of features and easily combine. 1:26:45.365 --> 1:26:47.920 Yeah, how much can you trust each of these more? 1:26:51.091 --> 1:26:54.584 Do you have any further questions for this topic? 1:26:58.378 --> 1:27:08.715 And there will be on Tuesday a lecture by Tuan about evaluation, and then next Thursday 1:27:08.715 --> 1:27:12.710 there will be the practical part. 1:27:12.993 --> 1:27:21.461 So please bring the practical pot here, but you can do something yourself if you are not 1:27:21.461 --> 1:27:22.317 able to. 1:27:23.503 --> 1:27:26.848 So then please tell us and we'll have to see how we find the difference in this.