WEBVTT 0:00:00.060 --> 0:00:07.762 OK good so today's lecture is on on supervised machines and stations so what you have seen 0:00:07.762 --> 0:00:13.518 so far is different techniques are on supervised and MP so you are. 0:00:13.593 --> 0:00:18.552 Data right so let's say in English coppers you are one file and then in German you have 0:00:18.552 --> 0:00:23.454 another file which is sentence to sentence la and then you try to build systems around 0:00:23.454 --> 0:00:23.679 it. 0:00:24.324 --> 0:00:30.130 But what's different about this lecture is that you assume that you have no final data 0:00:30.130 --> 0:00:30.663 at all. 0:00:30.663 --> 0:00:37.137 You only have monolingual data and the question is how can we build systems to translate between 0:00:37.137 --> 0:00:39.405 these two languages right and so. 0:00:39.359 --> 0:00:44.658 This is a bit more realistic scenario because you have so many languages in the world. 0:00:44.658 --> 0:00:50.323 You cannot expect to have parallel data between all the two languages and so, but in typical 0:00:50.323 --> 0:00:55.623 cases you have newspapers and so on, which is like monolingual files, and the question 0:00:55.623 --> 0:00:57.998 is can we build something around them? 0:00:59.980 --> 0:01:01.651 They like said for today. 0:01:01.651 --> 0:01:05.893 First we'll start up with the interactions, so why do we need it? 0:01:05.893 --> 0:01:11.614 and also some infusion on how these models work before going into the technical details. 0:01:11.614 --> 0:01:17.335 I want to also go through an example,, which kind of gives you more understanding on how 0:01:17.335 --> 0:01:19.263 people came into more elders. 0:01:20.820 --> 0:01:23.905 Then the rest of the lecture is going to be two parts. 0:01:23.905 --> 0:01:26.092 One is we're going to translate words. 0:01:26.092 --> 0:01:30.018 We're not going to care about how can we translate the full sentence. 0:01:30.018 --> 0:01:35.177 But given to monolingual files, how can we get a dictionary basically, which is much easier 0:01:35.177 --> 0:01:37.813 than generating something in a sentence level? 0:01:38.698 --> 0:01:43.533 Then we're going to go into the Edwards case, which is the unsupervised sentence type solution. 0:01:44.204 --> 0:01:50.201 And here what you'll see is what are the training objectives which are quite different than the 0:01:50.201 --> 0:01:55.699 word translation and also where it doesn't but because this is also quite important and 0:01:55.699 --> 0:02:01.384 it's one of the reasons why unsupervised does not use anymore because the limitations kind 0:02:01.384 --> 0:02:03.946 of go away from the realistic use cases. 0:02:04.504 --> 0:02:06.922 And then that leads to the marketing world model. 0:02:06.922 --> 0:02:07.115 So. 0:02:07.807 --> 0:02:12.915 People are trying to do to build systems for languages that will not have any parallel data. 0:02:12.915 --> 0:02:17.693 Is use multilingual models and combine with these training objectives to get better at 0:02:17.693 --> 0:02:17.913 it. 0:02:17.913 --> 0:02:18.132 So. 0:02:18.658 --> 0:02:24.396 People are not trying to build bilingual systems currently for unsupervised arm translation, 0:02:24.396 --> 0:02:30.011 but I think it's good to know how they came to hear this point and what they're doing now. 0:02:30.090 --> 0:02:34.687 You also see some patterns overlapping which people are using. 0:02:36.916 --> 0:02:41.642 So as you said before, and you probably hear it multiple times now is that we have seven 0:02:41.642 --> 0:02:43.076 thousand languages around. 0:02:43.903 --> 0:02:49.460 Can be different dialects in someone, so it's quite hard to distinguish what's the language, 0:02:49.460 --> 0:02:54.957 but you can typically approximate that seven thousand and that leads to twenty five million 0:02:54.957 --> 0:02:59.318 pairs, which is the obvious reason why we do not have any parallel data. 0:03:00.560 --> 0:03:06.386 So you want to build an empty system for all possible language pests and the question is 0:03:06.386 --> 0:03:07.172 how can we? 0:03:08.648 --> 0:03:13.325 The typical use case, but there are actually quite few interesting use cases than what you 0:03:13.325 --> 0:03:14.045 would expect. 0:03:14.614 --> 0:03:20.508 One is the animal languages, which is the real thing that's happening right now with. 0:03:20.780 --> 0:03:26.250 The dog but with dolphins and so on, but I couldn't find a picture that could show this, 0:03:26.250 --> 0:03:31.659 but if you are interested in stuff like this you can check out the website where people 0:03:31.659 --> 0:03:34.916 are actually trying to understand how animals speak. 0:03:35.135 --> 0:03:37.356 It's Also a Bit More About. 0:03:37.297 --> 0:03:44.124 Knowing what the animals want to say but may not die dead but still people are trying to 0:03:44.124 --> 0:03:44.661 do it. 0:03:45.825 --> 0:03:50.689 More realistic thing that's happening is the translation of programming languages. 0:03:51.371 --> 0:03:56.963 And so this is quite a quite good scenario for entrepreneurs and empty is that you have 0:03:56.963 --> 0:04:02.556 a lot of code available online right in C + + and in Python and the question is how can 0:04:02.556 --> 0:04:08.402 we translate by just looking at the code alone and no parallel functions and so on and this 0:04:08.402 --> 0:04:10.754 is actually quite good right now so. 0:04:12.032 --> 0:04:16.111 See how these techniques were applied to do the programming translation. 0:04:18.258 --> 0:04:23.882 And then you can also think of language as something that is quite common so you can take 0:04:23.882 --> 0:04:24.194 off. 0:04:24.194 --> 0:04:29.631 Think of formal sentences in English as one language and informal sentences in English 0:04:29.631 --> 0:04:35.442 as another language and then learn the kind to stay between them and then it kind of becomes 0:04:35.442 --> 0:04:37.379 a style plan for a problem so. 0:04:38.358 --> 0:04:43.042 Although it's translation, you can consider different characteristics of a language and 0:04:43.042 --> 0:04:46.875 then separate them as two different languages and then try to map them. 0:04:46.875 --> 0:04:52.038 So it's not only about languages, but you can also do quite cool things by using unsophisticated 0:04:52.038 --> 0:04:54.327 techniques, which are quite possible also. 0:04:56.256 --> 0:04:56.990 I am so. 0:04:56.990 --> 0:05:04.335 This is kind of TV modeling for many of the use cases that we have for ours, ours and MD. 0:05:04.335 --> 0:05:11.842 But before we go into the modeling of these systems, what I want you to do is look at these 0:05:11.842 --> 0:05:12.413 dummy. 0:05:13.813 --> 0:05:19.720 We have text and language one, text and language two right, and nobody knows what these languages 0:05:19.720 --> 0:05:20.082 mean. 0:05:20.082 --> 0:05:23.758 They completely are made up right, and the question is also. 0:05:23.758 --> 0:05:29.364 They're not parallel lines, so the first line here and the first line is not a line, they're 0:05:29.364 --> 0:05:30.810 just monolingual files. 0:05:32.052 --> 0:05:38.281 And now think about how can you translate the word M1 from language one to language two, 0:05:38.281 --> 0:05:41.851 and this kind of you see how we try to model this. 0:05:42.983 --> 0:05:47.966 Would take your time and then think of how can you translate more into language two? 0:06:41.321 --> 0:06:45.589 About the model, if you ask somebody who doesn't know anything about machine translation right, 0:06:45.589 --> 0:06:47.411 and then you ask them to translate more. 0:07:01.201 --> 0:07:10.027 But it's also not quite easy if you think of the way that I made this example is relatively 0:07:10.027 --> 0:07:10.986 easy, so. 0:07:11.431 --> 0:07:17.963 Basically, the first two sentences are these two: A, B, C is E, and G cured up the U, V 0:07:17.963 --> 0:07:21.841 is L, A, A, C, S, and S, on and this is used towards the German. 0:07:22.662 --> 0:07:25.241 And then when you join these two words, it's. 0:07:25.205 --> 0:07:32.445 English German the third line and the last line, and then the fourth line is the first 0:07:32.445 --> 0:07:38.521 line, so German language, English, and then speak English, speak German. 0:07:38.578 --> 0:07:44.393 So this is how I made made up the example and what the intuition here is that you assume 0:07:44.393 --> 0:07:50.535 that the languages have a fundamental structure right and it's the same across all languages. 0:07:51.211 --> 0:07:57.727 Doesn't matter what language you are thinking of words kind of you have in the same way join 0:07:57.727 --> 0:07:59.829 together is the same way and. 0:07:59.779 --> 0:08:06.065 And plasma sign thinks the same way but this is not a realistic assumption for sure but 0:08:06.065 --> 0:08:12.636 it's actually a decent one to make and if you can think of this like if you can assume this 0:08:12.636 --> 0:08:16.207 then we can model systems in an unsupervised way. 0:08:16.396 --> 0:08:22.743 So this is the intuition that I want to give, and you can see that whenever assumptions fail, 0:08:22.743 --> 0:08:23.958 the systems fail. 0:08:23.958 --> 0:08:29.832 So in practice whenever we go far away from these assumptions, the systems try to more 0:08:29.832 --> 0:08:30.778 time to fail. 0:08:33.753 --> 0:08:39.711 So the example that I gave was actually perfect mapping right, so it never really sticks bad. 0:08:39.711 --> 0:08:45.353 They have the same number of words, same sentence structure, perfect mapping, and so on. 0:08:45.353 --> 0:08:50.994 This doesn't happen, but let's assume that this happens and try to see how we can moral. 0:08:53.493 --> 0:09:01.061 Okay, now let's go a bit more formal, so what you want to do is unsupervise word translation. 0:09:01.901 --> 0:09:08.773 Here the task is that we have input data as monolingual data, so a bunch of sentences in 0:09:08.773 --> 0:09:15.876 one file and a bunch of sentences another file in two different languages, and the question 0:09:15.876 --> 0:09:18.655 is how can we get a bilingual word? 0:09:19.559 --> 0:09:25.134 So if you look at the picture you see that it's just kind of projected down into two dimension 0:09:25.134 --> 0:09:30.358 planes, but it's basically when you map them into a plot you see that the words that are 0:09:30.358 --> 0:09:35.874 parallel are closer together, and the question is how can we do it just looking at two files? 0:09:36.816 --> 0:09:42.502 And you can say that what we want to basically do is create a dictionary in the end given 0:09:42.502 --> 0:09:43.260 two fights. 0:09:43.260 --> 0:09:45.408 So this is the task that we want. 0:09:46.606 --> 0:09:52.262 And the first step on how we do this is to learn word vectors, and this chicken is whatever 0:09:52.262 --> 0:09:56.257 techniques that you have seen before, but to work glow or so on. 0:09:56.856 --> 0:10:00.699 So you take a monolingual data and try to learn word embeddings. 0:10:02.002 --> 0:10:07.675 Then you plot them into a graph, and then typically what you would see is that they're 0:10:07.675 --> 0:10:08.979 not aligned at all. 0:10:08.979 --> 0:10:14.717 One word space is somewhere, and one word space is somewhere else, and this is what you 0:10:14.717 --> 0:10:18.043 would typically expect to see in the in the image. 0:10:19.659 --> 0:10:23.525 Now our assumption was that both lines we just have the same. 0:10:23.563 --> 0:10:28.520 Culture and so that we can use this information to learn the mapping between these two spaces. 0:10:30.130 --> 0:10:37.085 So before how we do it, I think this is quite famous already, and everybody knows it a bit 0:10:37.085 --> 0:10:41.824 more is that we're emitting capture semantic relations right. 0:10:41.824 --> 0:10:48.244 So the distance between man and woman is approximately the same as king and prince. 0:10:48.888 --> 0:10:54.620 It's also for world dances, country capital and so on, so there are some relationships 0:10:54.620 --> 0:11:00.286 happening in the word emmering space, which is quite clear for at least one language. 0:11:03.143 --> 0:11:08.082 Now if you think of this, let's say of the English word embryng. 0:11:08.082 --> 0:11:14.769 Let's say of German word embryng and the way the King Keene Man woman organized is same 0:11:14.769 --> 0:11:17.733 as the German translation of his word. 0:11:17.998 --> 0:11:23.336 This is the main idea is that although they are somewhere else, the relationship is the 0:11:23.336 --> 0:11:28.008 same between the both languages and we can use this to to learn the mapping. 0:11:31.811 --> 0:11:35.716 'S not only for these poor words where it happens for all the words in the language, 0:11:35.716 --> 0:11:37.783 and so we can use this to to learn the math. 0:11:39.179 --> 0:11:43.828 This is the main idea is that both emittings have a similar shape. 0:11:43.828 --> 0:11:48.477 It's only that they're just not aligned and so you go to the here. 0:11:48.477 --> 0:11:50.906 They kind of have a similar shape. 0:11:50.906 --> 0:11:57.221 They're just in some different spaces and what you need to do is to map them into a common 0:11:57.221 --> 0:11:57.707 space. 0:12:06.086 --> 0:12:12.393 The w, such that if it multiplied w with x, they both become. 0:12:35.335 --> 0:12:41.097 That's true, but there are also many works that have the relationship right, and we hope 0:12:41.097 --> 0:12:43.817 that this is enough to learn the mapping. 0:12:43.817 --> 0:12:49.838 So there's always going to be a bit of noise, as in how when we align them they're not going 0:12:49.838 --> 0:12:51.716 to be exactly the same, but. 0:12:51.671 --> 0:12:57.293 What you can expect is that there are these main works that allow us to learn the mapping, 0:12:57.293 --> 0:13:02.791 so it's not going to be perfect, but it's an approximation that we make to to see how it 0:13:02.791 --> 0:13:04.521 works and then practice it. 0:13:04.521 --> 0:13:10.081 Also, it's not that the fact that women do not have any relationship does not affect that 0:13:10.081 --> 0:13:10.452 much. 0:13:10.550 --> 0:13:15.429 A lot of words usually have, so it kind of works out in practice. 0:13:22.242 --> 0:13:34.248 I have not heard about it, but if you want to say something about it, I would be interested, 0:13:34.248 --> 0:13:37.346 but we can do it later. 0:13:41.281 --> 0:13:44.133 Usual case: This is supervised. 0:13:45.205 --> 0:13:49.484 First way to do a supervised work translation where we have a dictionary right and that we 0:13:49.484 --> 0:13:53.764 can use that to learn the mapping, but in our case we assume that we have nothing right so 0:13:53.764 --> 0:13:55.222 we only have monolingual data. 0:13:56.136 --> 0:14:03.126 Then we need unsupervised planning to figure out W, and we're going to use guns to to find 0:14:03.126 --> 0:14:06.122 W, and it's quite a nice way to do it. 0:14:08.248 --> 0:14:15.393 So just before I go on how we use it to use case, I'm going to go briefly on gas right, 0:14:15.393 --> 0:14:19.940 so we have two components: generator and discriminator. 0:14:21.441 --> 0:14:27.052 Gen data tries to generate something obviously, and the discriminator tries to see if it's 0:14:27.052 --> 0:14:30.752 real data or something that is generated by the generation. 0:14:31.371 --> 0:14:37.038 And there's like this two player game where the winner decides to fool and the winner decides 0:14:37.038 --> 0:14:41.862 to market food and they try to build these two components and try to learn WWE. 0:14:43.483 --> 0:14:53.163 Okay, so let's say we have two languages, X and Y right, so the X language has N words 0:14:53.163 --> 0:14:56.167 with numbering dimensions. 0:14:56.496 --> 0:14:59.498 So what I'm reading is matrix is peak or something. 0:14:59.498 --> 0:15:02.211 Then we have target language why with m words. 0:15:02.211 --> 0:15:06.944 I'm also the same amount of things I mentioned and then we have a matrix peak or. 0:15:07.927 --> 0:15:13.784 Basically what you're going to do is use word to work and learn our word embedded. 0:15:14.995 --> 0:15:23.134 Now we have these X Mrings, Y Mrings, and what you want to know is W, such that W X and 0:15:23.134 --> 0:15:24.336 Y are align. 0:15:29.209 --> 0:15:35.489 With guns you have two steps, one is a discriminative step and one is the the mapping step and the 0:15:35.489 --> 0:15:41.135 discriminative step is to see if the embeddings are from the source or mapped embedding. 0:15:41.135 --> 0:15:44.688 So it's going to be much scary when I go to the figure. 0:15:46.306 --> 0:15:50.041 So we have a monolingual documents with two different languages. 0:15:50.041 --> 0:15:54.522 From here we get our source language ambients target language ambients right. 0:15:54.522 --> 0:15:57.855 Then we randomly initialize the transformation metrics W. 0:16:00.040 --> 0:16:06.377 Then we have the discriminator which tries to see if it's WX or Y, so it needs to know 0:16:06.377 --> 0:16:13.735 that this is a mapped one and this is the original language, and so if you look at the lost function 0:16:13.735 --> 0:16:20.072 here, it's basically that source is one given WX, so this is from the source language. 0:16:23.543 --> 0:16:27.339 Which means it's the target language em yeah. 0:16:27.339 --> 0:16:34.436 It's just like my figure is not that great, but you can assume that they are totally. 0:16:40.260 --> 0:16:43.027 So this is the kind of the lost function. 0:16:43.027 --> 0:16:46.386 We have N source words, M target words, and so on. 0:16:46.386 --> 0:16:52.381 So that's why you have one by M, one by M, and the discriminator is to just see if they're 0:16:52.381 --> 0:16:55.741 mapped or they're from the original target number. 0:16:57.317 --> 0:17:04.024 And then we have the mapping step where we train W to fool the the discriminators. 0:17:04.564 --> 0:17:10.243 So here it's the same way, but what you're going to just do is inverse the loss function. 0:17:10.243 --> 0:17:15.859 So now we freeze the discriminators, so it's important to note that in the previous sect 0:17:15.859 --> 0:17:20.843 we freezed the transformation matrix, and here we freezed your discriminators. 0:17:22.482 --> 0:17:28.912 And now it's to fool the discriminated rights, so it should predict that the source is zero 0:17:28.912 --> 0:17:35.271 given the map numbering, and the source is one given the target numbering, which is wrong, 0:17:35.271 --> 0:17:37.787 which is why we're attaining the W. 0:17:39.439 --> 0:17:46.261 Any questions on this okay so then how do we know when to stop? 0:17:46.261 --> 0:17:55.854 We just train until we reach convergence right and then we have our W hopefully train and 0:17:55.854 --> 0:17:59.265 map them into an airline space. 0:18:02.222 --> 0:18:07.097 The question is how can we evaluate this mapping? 0:18:07.097 --> 0:18:13.923 Does anybody know what we can use to mapping or evaluate the mapping? 0:18:13.923 --> 0:18:15.873 How good is a word? 0:18:28.969 --> 0:18:33.538 We use as I said we use a dictionary, at least in the end. 0:18:33.538 --> 0:18:40.199 We need a dictionary to evaluate, so this is our only final, so we aren't using it at 0:18:40.199 --> 0:18:42.600 all in attaining data and the. 0:18:43.223 --> 0:18:49.681 Is one is to check what's the position for our dictionary, just that. 0:18:50.650 --> 0:18:52.813 The first nearest neighbor and see if it's there on. 0:18:53.573 --> 0:18:56.855 But this is quite strict because there's a lot of noise in the emitting space right. 0:18:57.657 --> 0:19:03.114 Not always your first neighbor is going to be the translation, so what people also report 0:19:03.114 --> 0:19:05.055 is precision at file and so on. 0:19:05.055 --> 0:19:10.209 So you take the finerest neighbors and see if the translation is in there and so on. 0:19:10.209 --> 0:19:15.545 So the more you increase, the more likely that there is a translation because where I'm 0:19:15.545 --> 0:19:16.697 being quite noisy. 0:19:19.239 --> 0:19:25.924 What's interesting is that people have used dictionary to to learn word translation, but 0:19:25.924 --> 0:19:32.985 the way of doing this is much better than using a dictionary, so somehow our assumption helps 0:19:32.985 --> 0:19:36.591 us to to build better than a supervised system. 0:19:39.099 --> 0:19:42.985 So as you see on the top you have a question at one five ten. 0:19:42.985 --> 0:19:47.309 These are the typical numbers that you report for world translation. 0:19:48.868 --> 0:19:55.996 But guns are usually quite tricky to to train, and it does not converge on on language based, 0:19:55.996 --> 0:20:02.820 and this kind of goes back to a assumption that they kind of behave in the same structure 0:20:02.820 --> 0:20:03.351 right. 0:20:03.351 --> 0:20:07.142 But if you take a language like English and some. 0:20:07.087 --> 0:20:12.203 Other languages are almost very lotus, so it's quite different from English and so on. 0:20:12.203 --> 0:20:13.673 Then I've one language,. 0:20:13.673 --> 0:20:18.789 So whenever whenever our assumption fails, these unsupervised techniques always do not 0:20:18.789 --> 0:20:21.199 converge or just give really bad scores. 0:20:22.162 --> 0:20:27.083 And so the fact is that the monolingual embryons for distant languages are too far. 0:20:27.083 --> 0:20:30.949 They do not share the same structure, and so they do not convert. 0:20:32.452 --> 0:20:39.380 And so I just want to mention that there is a better retrieval technique than the nearest 0:20:39.380 --> 0:20:41.458 neighbor, which is called. 0:20:42.882 --> 0:20:46.975 But it's more advanced than mathematical, so I didn't want to go in it now. 0:20:46.975 --> 0:20:51.822 But if your interest is in some quite good retrieval segments, you can just look at these 0:20:51.822 --> 0:20:53.006 if you're interested. 0:20:55.615 --> 0:20:59.241 Okay, so this is about the the word translation. 0:20:59.241 --> 0:21:02.276 Does anybody have any questions of cure? 0:21:06.246 --> 0:21:07.501 Was the worst answer? 0:21:07.501 --> 0:21:12.580 It was a bit easier than a sentence right, so you just assume that there's a mapping and 0:21:12.580 --> 0:21:14.577 then you try to learn the mapping. 0:21:14.577 --> 0:21:19.656 But now it's a bit more difficult because you need to jump at stuff also, which is quite 0:21:19.656 --> 0:21:20.797 much more trickier. 0:21:22.622 --> 0:21:28.512 Task here is that we have our input as manually well data for both languages as before, but 0:21:28.512 --> 0:21:34.017 now what we want to do is instead of translating word by word we want to do sentence. 0:21:37.377 --> 0:21:44.002 We have word of work now and so on to learn word amber inks, but sentence amber inks are 0:21:44.002 --> 0:21:50.627 actually not the site powered often, at least when people try to work on Answer Voice M, 0:21:50.627 --> 0:21:51.445 E, before. 0:21:52.632 --> 0:21:54.008 Now they're a bit okay. 0:21:54.008 --> 0:21:59.054 I mean, as you've seen in the practice on where we used places, they were quite decent. 0:21:59.054 --> 0:22:03.011 But then it's also the case on which data it's trained on and so on. 0:22:03.011 --> 0:22:03.240 So. 0:22:04.164 --> 0:22:09.666 Sentence embedings are definitely much more harder to get than were embedings, so this 0:22:09.666 --> 0:22:13.776 is a bit more complicated than the task that you've seen before. 0:22:16.476 --> 0:22:18.701 Before we go into how U. 0:22:18.701 --> 0:22:18.968 N. 0:22:18.968 --> 0:22:19.235 M. 0:22:19.235 --> 0:22:19.502 T. 0:22:19.502 --> 0:22:24.485 Works, so this is your typical supervised system right. 0:22:24.485 --> 0:22:29.558 So we have parallel data source sentence target centers. 0:22:29.558 --> 0:22:31.160 We have a source. 0:22:31.471 --> 0:22:36.709 We have a target decoder and then we try to minimize the cross center pillar on this viral 0:22:36.709 --> 0:22:37.054 data. 0:22:37.157 --> 0:22:39.818 And this is how we train our typical system. 0:22:43.583 --> 0:22:49.506 But now we do not have any parallel data, and so the intuition here is that if we can 0:22:49.506 --> 0:22:55.429 learn language independent representations at the end quota outputs, then we can pass 0:22:55.429 --> 0:22:58.046 it along to the decoder that we want. 0:22:58.718 --> 0:23:03.809 It's going to get more clear in the future, but I'm trying to give a bit more intuition 0:23:03.809 --> 0:23:07.164 before I'm going to show you all the planning objectives. 0:23:08.688 --> 0:23:15.252 So I assume that we have these different encoders right, so it's not only two, you have a bunch 0:23:15.252 --> 0:23:21.405 of different source language encoders, a bunch of different target language decoders, and 0:23:21.405 --> 0:23:26.054 also I assume that the encoder is in the same representation space. 0:23:26.706 --> 0:23:31.932 If you give a sentence in English and the same sentence in German, the embeddings are 0:23:31.932 --> 0:23:38.313 quite the same, so like the muddling when embeddings die right, and so then what we can do is, depending 0:23:38.313 --> 0:23:42.202 on the language we want, pass it to the the appropriate decode. 0:23:42.682 --> 0:23:50.141 And so the kind of goal here is to find out a way to create language independent representations 0:23:50.141 --> 0:23:52.909 and then pass it to the decodement. 0:23:54.975 --> 0:23:59.714 Just keep in mind that you're trying to do language independent for some reason, but it's 0:23:59.714 --> 0:24:02.294 going to be more clear once we see how it works. 0:24:05.585 --> 0:24:12.845 So in total we have three objectives that we're going to try to train in our systems, 0:24:12.845 --> 0:24:16.981 so this is and all of them use monolingual data. 0:24:17.697 --> 0:24:19.559 So there's no pilot data at all. 0:24:19.559 --> 0:24:24.469 The first one is denoising water encoding, so it's more like you add noise to noise to 0:24:24.469 --> 0:24:27.403 the sentence, and then they construct the original. 0:24:28.388 --> 0:24:34.276 Then we have the on the flyby translation, so this is where you take a sentence, generate 0:24:34.276 --> 0:24:39.902 a translation, and then learn the the word smarting, which I'm going to show pictures 0:24:39.902 --> 0:24:45.725 stated, and then we have an adverse serial planning to do learn the language independent 0:24:45.725 --> 0:24:46.772 representation. 0:24:47.427 --> 0:24:52.148 So somehow we'll fill in these three tasks or retain on these three tasks. 0:24:52.148 --> 0:24:54.728 We somehow get an answer to President M. 0:24:54.728 --> 0:24:54.917 T. 0:24:56.856 --> 0:25:02.964 OK, so first we're going to do is denoising what I'm cutting right, so as I said we add 0:25:02.964 --> 0:25:06.295 noise to the sentence, so we take our sentence. 0:25:06.826 --> 0:25:09.709 And then there are different ways to add noise. 0:25:09.709 --> 0:25:11.511 You can shuffle words around. 0:25:11.511 --> 0:25:12.712 You can drop words. 0:25:12.712 --> 0:25:18.298 Do whatever you want to do as long as there's enough information to reconstruct the original 0:25:18.298 --> 0:25:18.898 sentence. 0:25:19.719 --> 0:25:25.051 And then we assume that the nicest one and the original one are parallel data and train 0:25:25.051 --> 0:25:26.687 similar to the supervised. 0:25:28.168 --> 0:25:30.354 So we have a source sentence. 0:25:30.354 --> 0:25:32.540 We have a noisy source right. 0:25:32.540 --> 0:25:37.130 So here what basically happened is that the word got shuffled. 0:25:37.130 --> 0:25:39.097 One word is dropped right. 0:25:39.097 --> 0:25:41.356 So this was a noise of source. 0:25:41.356 --> 0:25:47.039 And then we treat the noise of source and source as a sentence bed basically. 0:25:49.009 --> 0:25:53.874 Way retainers optimizing the cross entropy loss similar to. 0:25:57.978 --> 0:26:03.211 Basically a picture to show what's happening and we have the nice resources. 0:26:03.163 --> 0:26:09.210 Now is the target and then we have the reconstructed original source and original tag and since 0:26:09.210 --> 0:26:14.817 the languages are different we have our source hand coded target and coded source coded. 0:26:17.317 --> 0:26:20.202 And for this task we only need monolingual data. 0:26:20.202 --> 0:26:25.267 We don't need any pedal data because it's just taking a sentence and shuffling it and 0:26:25.267 --> 0:26:27.446 reconstructing the the original one. 0:26:28.848 --> 0:26:31.058 And we are four different blocks. 0:26:31.058 --> 0:26:36.841 This is kind of very important to keep in mind on how we change these connections later. 0:26:41.121 --> 0:26:49.093 Then this is more like the mathematical formulation where you predict source given the noisy. 0:26:52.492 --> 0:26:55.090 So that was the nursing water encoding. 0:26:55.090 --> 0:26:58.403 The second step is on the flight back translation. 0:26:59.479 --> 0:27:06.386 So what we do is, we put our model inference mode right, we take a source of sentences, 0:27:06.386 --> 0:27:09.447 and we generate a translation pattern. 0:27:09.829 --> 0:27:18.534 It might be completely wrong or maybe partially correct or so on, but we assume that the moral 0:27:18.534 --> 0:27:20.091 knows of it and. 0:27:20.680 --> 0:27:25.779 Tend rate: T head right and then what we do is assume that T head or not assume but T head 0:27:25.779 --> 0:27:27.572 and S are sentence space right. 0:27:27.572 --> 0:27:29.925 That's how we can handle the translation. 0:27:30.530 --> 0:27:38.824 So we train a supervised system on this sentence bed, so we do inference and then build a reverse 0:27:38.824 --> 0:27:39.924 translation. 0:27:42.442 --> 0:27:49.495 Are both more concrete, so we have a false sentence right, then we chamber the translation, 0:27:49.495 --> 0:27:55.091 then we give the general translation as an input and try to predict the. 0:27:58.378 --> 0:28:03.500 This is how we would do in practice right, so not before the source encoder was connected 0:28:03.500 --> 0:28:08.907 to the source decoder, but now we interchanged connections, so the source encoder is connected 0:28:08.907 --> 0:28:10.216 to the target decoder. 0:28:10.216 --> 0:28:13.290 The target encoder is turned into the source decoder. 0:28:13.974 --> 0:28:20.747 And given s we get t-hat and given t we get s-hat, so this is the first time. 0:28:21.661 --> 0:28:24.022 On the second time step, what you're going to do is reverse. 0:28:24.664 --> 0:28:32.625 So as that is here, t hat is here, and given s hat we are trying to predict t, and given 0:28:32.625 --> 0:28:34.503 t hat we are trying. 0:28:36.636 --> 0:28:39.386 Is this clear you have any questions on? 0:28:45.405 --> 0:28:50.823 Bit more mathematically, we try to play the class, give and take and so it's always the 0:28:50.823 --> 0:28:53.963 supervised NMP technique that we are trying to do. 0:28:53.963 --> 0:28:59.689 But you're trying to create this synthetic pass that kind of helpers to build an unsurprised 0:28:59.689 --> 0:29:00.181 system. 0:29:02.362 --> 0:29:08.611 Now also with maybe you can see here is that if the source encoded and targeted encoded 0:29:08.611 --> 0:29:14.718 the language independent, we can always shuffle the connections and the translations. 0:29:14.718 --> 0:29:21.252 That's why it was important to find a way to generate language independent representations. 0:29:21.441 --> 0:29:26.476 And the way we try to force this language independence is the gan step. 0:29:27.627 --> 0:29:34.851 So the third step kind of combines all of them is where we try to use gun to make the 0:29:34.851 --> 0:29:37.959 encoded output language independent. 0:29:37.959 --> 0:29:42.831 So here it's the same picture but from a different paper. 0:29:42.831 --> 0:29:43.167 So. 0:29:43.343 --> 0:29:48.888 We have X-rays, X-ray objects which is monolingual in data. 0:29:48.888 --> 0:29:50.182 We add noise. 0:29:50.690 --> 0:29:54.736 Then we encode it using the source and the target encoders right. 0:29:54.736 --> 0:29:58.292 Then we get the latent space Z source and Z target right. 0:29:58.292 --> 0:30:03.503 Then we decode and try to reconstruct the original one and this is the auto encoding 0:30:03.503 --> 0:30:08.469 loss which takes the X source which is the original one and then the translated. 0:30:08.468 --> 0:30:09.834 Predicted output. 0:30:09.834 --> 0:30:16.740 So hello, it always is the auto encoding step where the gun concern is in the between gang 0:30:16.740 --> 0:30:24.102 cord outputs, and here we have an discriminator which tries to predict which language the latent 0:30:24.102 --> 0:30:25.241 space is from. 0:30:26.466 --> 0:30:33.782 So given Z source it has to predict that the representation is from a language source and 0:30:33.782 --> 0:30:39.961 given Z target it has to predict the representation from a language target. 0:30:40.520 --> 0:30:45.135 And our headquarters are kind of teaching data right now, and then we have a separate 0:30:45.135 --> 0:30:49.803 network discriminator which tries to predict which language the Latin spaces are from. 0:30:53.393 --> 0:30:57.611 And then this one is when we combined guns with the other ongoing step. 0:30:57.611 --> 0:31:02.767 Then we had an on the fly back translation step right, and so here what we're trying to 0:31:02.767 --> 0:31:03.001 do. 0:31:03.863 --> 0:31:07.260 Is the same, basically just exactly the same. 0:31:07.260 --> 0:31:12.946 But when we are doing the training, we are at the adversarial laws here, so. 0:31:13.893 --> 0:31:20.762 We take our X source, gender and intermediate translation, so why target and why source right? 0:31:20.762 --> 0:31:27.342 This is the previous time step, and then we have to encode the new sentences and basically 0:31:27.342 --> 0:31:32.764 make them language independent or train to make them language independent. 0:31:33.974 --> 0:31:43.502 And then the hope is that now if we do this using monolingual data alone we can just switch 0:31:43.502 --> 0:31:47.852 connections and then get our translation. 0:31:47.852 --> 0:31:49.613 So the scale of. 0:31:54.574 --> 0:32:03.749 And so as I said before, guns are quite good for vision right, so this is kind of like the 0:32:03.749 --> 0:32:11.312 cycle gun approach that you might have seen in any computer vision course. 0:32:11.911 --> 0:32:19.055 Somehow protect that place at least not as promising as for merchants, and so people. 0:32:19.055 --> 0:32:23.706 What they did is to enforce this language independence. 0:32:25.045 --> 0:32:31.226 They try to use a shared encoder instead of having these different encoders right, and 0:32:31.226 --> 0:32:37.835 so this is basically the same painting objectives as before, but what you're going to do now 0:32:37.835 --> 0:32:43.874 is learn cross language language and then use the single encoder for both languages. 0:32:44.104 --> 0:32:49.795 And this kind also forces them to be in the same space, and then you can choose whichever 0:32:49.795 --> 0:32:50.934 decoder you want. 0:32:52.552 --> 0:32:58.047 You can use guns or you can just use a shared encoder and type to build your unsupervised 0:32:58.047 --> 0:32:58.779 MTT system. 0:33:08.488 --> 0:33:09.808 These are now the. 0:33:09.808 --> 0:33:15.991 The enhancements that you can do on top of your unsavoizant system is one you can create 0:33:15.991 --> 0:33:16.686 a shared. 0:33:18.098 --> 0:33:22.358 On top of the shared encoder you can ask are your guns lost or whatever so there's a lot 0:33:22.358 --> 0:33:22.550 of. 0:33:24.164 --> 0:33:29.726 The other thing that is more relevant right now is that you can create parallel data by 0:33:29.726 --> 0:33:35.478 word to word translation right because you know how to do all supervised word translation. 0:33:36.376 --> 0:33:40.548 First step is to create parallel data, assuming that word translations are quite good. 0:33:41.361 --> 0:33:47.162 And then you claim a supervised and empty model on these more likely wrong model data, 0:33:47.162 --> 0:33:50.163 but somehow gives you a good starting point. 0:33:50.163 --> 0:33:56.098 So you build your supervised and empty system on the word translation data, and then you 0:33:56.098 --> 0:33:59.966 initialize it before you're doing unsupervised and empty. 0:34:00.260 --> 0:34:05.810 And the hope is that when you're doing the back pain installation, it's a good starting 0:34:05.810 --> 0:34:11.234 point, but it's one technique that you can do to to improve your anthropoids and the. 0:34:17.097 --> 0:34:25.879 In the previous case we had: The way we know when to stop was to see comedians on the gun 0:34:25.879 --> 0:34:26.485 training. 0:34:26.485 --> 0:34:28.849 Actually, all we want to do is when W. 0:34:28.849 --> 0:34:32.062 Comedians, which is quite easy to know when to stop. 0:34:32.062 --> 0:34:37.517 But in a realistic case, we don't have any parallel data right, so there's no validation. 0:34:37.517 --> 0:34:42.002 Or I mean, we might have test data in the end, but there's no validation. 0:34:43.703 --> 0:34:48.826 How will we tune our hyper parameters in this case because it's not really there's nothing 0:34:48.826 --> 0:34:49.445 for us to? 0:34:50.130 --> 0:34:53.326 Or the gold data in a sense like so. 0:34:53.326 --> 0:35:01.187 How do you think we can evaluate such systems or how can we tune hyper parameters in this? 0:35:11.711 --> 0:35:17.089 So what you're going to do is use the back translation technique. 0:35:17.089 --> 0:35:24.340 It's like a common technique where you have nothing okay that is to use back translation 0:35:24.340 --> 0:35:26.947 somehow and what you can do is. 0:35:26.947 --> 0:35:31.673 The main idea is validate on how good the reconstruction. 0:35:32.152 --> 0:35:37.534 So the idea is that if you have a good system then the intermediate translation is quite 0:35:37.534 --> 0:35:39.287 good and going back is easy. 0:35:39.287 --> 0:35:44.669 But if it's just noise that you generate in the forward step then it's really hard to go 0:35:44.669 --> 0:35:46.967 back, which is kind of the main idea. 0:35:48.148 --> 0:35:53.706 So the way it works is that we take a source sentence, we generate a translation in target 0:35:53.706 --> 0:35:59.082 language right, and then again can state the generated sentence and compare it with the 0:35:59.082 --> 0:36:01.342 original one, and if they're closer. 0:36:01.841 --> 0:36:09.745 It means that we have a good system, and if they are far this is kind of like an unsupervised 0:36:09.745 --> 0:36:10.334 grade. 0:36:17.397 --> 0:36:21.863 As far as the amount of data that you need. 0:36:23.083 --> 0:36:27.995 This was like the first initial resistance on on these systems is that you had. 0:36:27.995 --> 0:36:32.108 They wanted to do English and French and they had fifteen million. 0:36:32.108 --> 0:36:38.003 There was fifteen million more linguist sentences so it's quite a lot and they were able to get 0:36:38.003 --> 0:36:40.581 thirty two blue on these kinds of setups. 0:36:41.721 --> 0:36:47.580 But unsurprisingly if you have zero point one million pilot sentences you get the same 0:36:47.580 --> 0:36:48.455 performance. 0:36:48.748 --> 0:36:50.357 So it's a lot of training. 0:36:50.357 --> 0:36:55.960 It's a lot of monolingual data, but monolingual data is relatively easy to obtain is the fact 0:36:55.960 --> 0:37:01.264 that the training is also quite longer than the supervised system, but it's unsupervised 0:37:01.264 --> 0:37:04.303 so it's kind of the trade off that you are making. 0:37:07.367 --> 0:37:13.101 The other thing to note is that it's English and French, which is very close to our exemptions. 0:37:13.101 --> 0:37:18.237 Also, the monolingual data that they took are kind of from similar domains and so on. 0:37:18.638 --> 0:37:27.564 So that's why they're able to build such a good system, but you'll see later that it fails. 0:37:36.256 --> 0:37:46.888 Voice, and so mean what people usually do is first build a system right using whatever 0:37:46.888 --> 0:37:48.110 parallel. 0:37:48.608 --> 0:37:55.864 Then they use monolingual data and do back translation, so this is always being the standard 0:37:55.864 --> 0:38:04.478 way to to improve, and what people have seen is that: You don't even need zero point one 0:38:04.478 --> 0:38:05.360 million right. 0:38:05.360 --> 0:38:10.706 You just need like ten thousand or so on and then you do the monolingual back time station 0:38:10.706 --> 0:38:12.175 and you're still better. 0:38:12.175 --> 0:38:13.291 The answer is why. 0:38:13.833 --> 0:38:19.534 The question is it's really worth trying to to do this or maybe it's always better to find 0:38:19.534 --> 0:38:20.787 some parallel data. 0:38:20.787 --> 0:38:26.113 I'll expand a bit of money on getting few parallel data and then use it to start and 0:38:26.113 --> 0:38:27.804 find to build your system. 0:38:27.804 --> 0:38:33.756 So it was kind of the understanding that billing wool and spoiled systems are not that really. 0:38:50.710 --> 0:38:54.347 The thing is that with unlabeled data. 0:38:57.297 --> 0:39:05.488 Not in an obtaining signal, so when we are starting basically what we want to do is first 0:39:05.488 --> 0:39:13.224 get a good translation system and then use an unlabeled monolingual data to improve. 0:39:13.613 --> 0:39:15.015 But if you start from U. 0:39:15.015 --> 0:39:15.183 N. 0:39:15.183 --> 0:39:20.396 Empty our model might be really bad like it would be somewhere translating completely wrong. 0:39:20.760 --> 0:39:26.721 And then when you find your unlabeled data, it basically might be harming, or maybe the 0:39:26.721 --> 0:39:28.685 same as supervised applause. 0:39:28.685 --> 0:39:35.322 So the thing is, I hope, by fine tuning on labeled data as first is to get a good initialization. 0:39:35.835 --> 0:39:38.404 And then use the unsupervised techniques to get better. 0:39:38.818 --> 0:39:42.385 But if your starting point is really bad then it's not. 0:39:45.185 --> 0:39:47.324 Year so as we said before. 0:39:47.324 --> 0:39:52.475 This is kind of like the self supervised training usually works. 0:39:52.475 --> 0:39:54.773 First we have parallel data. 0:39:56.456 --> 0:39:58.062 Source language is X. 0:39:58.062 --> 0:39:59.668 Target language is Y. 0:39:59.668 --> 0:40:06.018 In the end we want a system that does X to Y, not Y to X, but first we want to train a 0:40:06.018 --> 0:40:10.543 backward model as it is Y to X, so target language to source. 0:40:11.691 --> 0:40:17.353 Then we take our moonlighting will target sentences, use our backward model to generate 0:40:17.353 --> 0:40:21.471 synthetic source, and then we join them with our original data. 0:40:21.471 --> 0:40:27.583 So now we have this noisy input, but always the gold output, which is kind of really important 0:40:27.583 --> 0:40:29.513 when you're doing backpaints. 0:40:30.410 --> 0:40:36.992 And then you can coordinate these big data and then you can train your X to Y cholesterol 0:40:36.992 --> 0:40:44.159 system and then you can always do this in multiple steps and usually three, four steps which kind 0:40:44.159 --> 0:40:48.401 of improves always and then finally get your best system. 0:40:49.029 --> 0:40:54.844 The point that I'm trying to make is that although answers and MPs the scores that I've 0:40:54.844 --> 0:41:00.659 shown before were quite good, you probably can get the same performance with with fifty 0:41:00.659 --> 0:41:06.474 thousand sentences, and also the languages that they've shown are quite similar and the 0:41:06.474 --> 0:41:08.654 texts were from the same domain. 0:41:14.354 --> 0:41:21.494 So any questions on u n m t ok yeah. 0:41:22.322 --> 0:41:28.982 So after this fact that temperature was already better than than empty, what people have tried 0:41:28.982 --> 0:41:34.660 is to use this idea of multilinguality as you have seen in the previous lecture. 0:41:34.660 --> 0:41:41.040 The question is how can we do this knowledge transfer from high resource language to lower 0:41:41.040 --> 0:41:42.232 source language? 0:41:44.484 --> 0:41:51.074 One way to promote this language independent representations is to share the encoder and 0:41:51.074 --> 0:41:57.960 decoder for all languages, all their available languages, and that kind of hopefully enables 0:41:57.960 --> 0:42:00.034 the the knowledge transfer. 0:42:03.323 --> 0:42:08.605 When we're doing multilinguality, the two questions we need to to think of is how does 0:42:08.605 --> 0:42:09.698 the encoder know? 0:42:09.698 --> 0:42:14.495 How does the encoder encoder know which language that we're dealing with that? 0:42:15.635 --> 0:42:20.715 You already might have known the answer also, and the second question is how can we promote 0:42:20.715 --> 0:42:24.139 the encoder to generate language independent representations? 0:42:25.045 --> 0:42:32.580 By solving these two problems we can take help of high resource languages to do unsupervised 0:42:32.580 --> 0:42:33.714 translations. 0:42:34.134 --> 0:42:40.997 Typical example would be you want to do unsurpressed between English and Dutch right, but you are 0:42:40.997 --> 0:42:47.369 parallel data between English and German, so the question is can we use this parallel data 0:42:47.369 --> 0:42:51.501 to help building an unsurpressed betweenEnglish and Dutch? 0:42:56.296 --> 0:43:01.240 For the first one we try to take help of language embeddings for tokens, and this kind of is 0:43:01.240 --> 0:43:05.758 a straightforward way to know to tell them well which language they're dealing with. 0:43:06.466 --> 0:43:11.993 And for the second one we're going to look at some pre training objectives which are also 0:43:11.993 --> 0:43:17.703 kind of unsupervised so we need monolingual data mostly and this kind of helps us to promote 0:43:17.703 --> 0:43:20.221 the language independent representation. 0:43:23.463 --> 0:43:29.954 So the first three things more that we'll look at is excel, which is quite famous if 0:43:29.954 --> 0:43:32.168 you haven't heard of it yet. 0:43:32.552 --> 0:43:40.577 And: The way it works is that it's basically a transformer encoder right, so it's like the 0:43:40.577 --> 0:43:42.391 just the encoder module. 0:43:42.391 --> 0:43:44.496 No, there's no decoder here. 0:43:44.884 --> 0:43:51.481 And what we're trying to do is mask two tokens in a sequence and try to predict these mask 0:43:51.481 --> 0:43:52.061 tokens. 0:43:52.061 --> 0:43:55.467 So I quickly called us mask language modeling. 0:43:55.996 --> 0:44:05.419 Typical language modeling that you see is the Danish language modeling where you predict 0:44:05.419 --> 0:44:08.278 the next token in English. 0:44:08.278 --> 0:44:11.136 Then we have the position. 0:44:11.871 --> 0:44:18.774 Then we have the token embellings, and then here we have the mass token, and then we have 0:44:18.774 --> 0:44:22.378 the transformer encoder blocks to predict the. 0:44:24.344 --> 0:44:30.552 To do this for all languages using the same tang somewhere encoded and this kind of helps 0:44:30.552 --> 0:44:36.760 us to push the the sentence and bearings or the output of the encoded into a common space 0:44:36.760 --> 0:44:37.726 per multiple. 0:44:42.782 --> 0:44:49.294 So first we train an MLM on both source, both source and target language sites, and then 0:44:49.294 --> 0:44:54.928 we use it as a starting point for the encoded and decoded for a UNMP system. 0:44:55.475 --> 0:45:03.175 So we take a monolingual data, build a mass language model on both source and target languages, 0:45:03.175 --> 0:45:07.346 and then read it to be or initialize that in the U. 0:45:07.346 --> 0:45:07.586 N. 0:45:07.586 --> 0:45:07.827 P. 0:45:07.827 --> 0:45:08.068 C. 0:45:09.009 --> 0:45:14.629 Here we look at two languages, but you can also do it with one hundred languages once. 0:45:14.629 --> 0:45:20.185 So they're retain checkpoints that you can use, which are quite which have seen quite 0:45:20.185 --> 0:45:21.671 a lot of data and use. 0:45:21.671 --> 0:45:24.449 It always has a starting point for your U. 0:45:24.449 --> 0:45:24.643 N. 0:45:24.643 --> 0:45:27.291 MP system, which in practice works well. 0:45:31.491 --> 0:45:36.759 This detail is that since this is an encoder block only, and your U. 0:45:36.759 --> 0:45:36.988 N. 0:45:36.988 --> 0:45:37.217 M. 0:45:37.217 --> 0:45:37.446 T. 0:45:37.446 --> 0:45:40.347 System is encodered, decodered right. 0:45:40.347 --> 0:45:47.524 So there's this cross attention that's missing, but you can always branch like that randomly. 0:45:47.524 --> 0:45:48.364 It's fine. 0:45:48.508 --> 0:45:53.077 Not everything is initialized, but it's still decent. 0:45:56.056 --> 0:46:02.141 Then we have the other one is M by plane, and here you see that this kind of builds on 0:46:02.141 --> 0:46:07.597 the the unsupervised training objector, which is the realizing auto encoding. 0:46:08.128 --> 0:46:14.337 So what they do is they say that we don't even need to do the gun outback translation, 0:46:14.337 --> 0:46:17.406 but you can do it later, but pre training. 0:46:17.406 --> 0:46:24.258 We just do do doing doing doing water inputting on all different languages, and that also gives 0:46:24.258 --> 0:46:32.660 you: Out of the box good performance, so what we basically have here is the transformer encoded. 0:46:34.334 --> 0:46:37.726 You are trying to generate a reconstructed sequence. 0:46:37.726 --> 0:46:38.942 You need a tickle. 0:46:39.899 --> 0:46:42.022 So we gave an input sentence. 0:46:42.022 --> 0:46:48.180 We tried to predict the masked tokens from the or we tried to reconstruct the original 0:46:48.180 --> 0:46:52.496 sentence from the input segments, which was corrupted right. 0:46:52.496 --> 0:46:57.167 So this is the same denoting objective that you have seen before. 0:46:58.418 --> 0:46:59.737 This is for English. 0:46:59.737 --> 0:47:04.195 I think this is for Japanese and then once we do it for all languages. 0:47:04.195 --> 0:47:09.596 I mean they have this difference on twenty five, fifty or so on and then you can find 0:47:09.596 --> 0:47:11.794 you on your sentence and document. 0:47:13.073 --> 0:47:20.454 And so what they is this for the supervised techniques, but you can also use this as initializations 0:47:20.454 --> 0:47:25.058 for unsupervised buildup on that which also in practice works. 0:47:30.790 --> 0:47:36.136 Then we have these, so still now we kind of didn't see the the states benefit from the 0:47:36.136 --> 0:47:38.840 high resource language right, so as I said. 0:47:38.878 --> 0:47:44.994 Why you can use English as something for English to Dutch, and if you want a new Catalan, you 0:47:44.994 --> 0:47:46.751 can use English to French. 0:47:48.408 --> 0:47:55.866 One typical way to do this is to use favorite translation lights or you take the. 0:47:55.795 --> 0:48:01.114 So here it's finished two weeks so you take your time say from finish to English English 0:48:01.114 --> 0:48:03.743 two weeks and then you get the translation. 0:48:04.344 --> 0:48:10.094 What's important is that you have these different techniques and you can always think of which 0:48:10.094 --> 0:48:12.333 one to use given the data situation. 0:48:12.333 --> 0:48:18.023 So if it was like finish to Greek maybe it's pivotal better because you might get good finish 0:48:18.023 --> 0:48:20.020 to English and English to Greek. 0:48:20.860 --> 0:48:23.255 Sometimes it also depends on the language pair. 0:48:23.255 --> 0:48:27.595 There might be some information loss and so on, so there are quite a few variables you 0:48:27.595 --> 0:48:30.039 need to think of and decide which system to use. 0:48:32.752 --> 0:48:39.654 Then there's a zero shot, which probably also I've seen in the multilingual course, and how 0:48:39.654 --> 0:48:45.505 if you can improve the language independence then your zero shot gets better. 0:48:45.505 --> 0:48:52.107 So maybe if you use the multilingual models and do zero shot directly, it's quite good. 0:48:53.093 --> 0:48:58.524 Thought we have zero shots per word, and then we have the answer to voice translation where 0:48:58.524 --> 0:49:00.059 we can calculate between. 0:49:00.600 --> 0:49:02.762 Just when there is no battle today. 0:49:06.686 --> 0:49:07.565 Is to solve. 0:49:07.565 --> 0:49:11.959 So sometimes what we have seen so far is that we basically have. 0:49:15.255 --> 0:49:16.754 To do from looking at it. 0:49:16.836 --> 0:49:19.307 These two files alone you can create a dictionary. 0:49:19.699 --> 0:49:26.773 Can build an unsupervised entry system, not always, but if the domains are similar in the 0:49:26.773 --> 0:49:28.895 languages, that's similar. 0:49:28.895 --> 0:49:36.283 But if there are distant languages, then the unsupervised texting doesn't usually work really 0:49:36.283 --> 0:49:36.755 well. 0:49:37.617 --> 0:49:40.297 What um. 0:49:40.720 --> 0:49:46.338 Would be is that if you can get some paddle data from somewhere or do bitex mining that 0:49:46.338 --> 0:49:51.892 we have seen in the in the laser practicum then you can use that as to initialize your 0:49:51.892 --> 0:49:57.829 system and then try and accept a semi supervised energy system and that would be better than 0:49:57.829 --> 0:50:00.063 just building an unsupervised and. 0:50:00.820 --> 0:50:06.546 With that as the end. 0:50:07.207 --> 0:50:08.797 Quickly could be. 0:50:16.236 --> 0:50:25.070 In common, they can catch the worst because the thing about finding a language is: And 0:50:25.070 --> 0:50:34.874 there's another joy in playing these games, almost in the middle of a game, and she's a 0:50:34.874 --> 0:50:40.111 characteristic too, and she is a global waver. 0:50:56.916 --> 0:51:03.798 Next talk inside and this somehow gives them many abilities, not only translation but other 0:51:03.798 --> 0:51:08.062 than that there are quite a few things that they can do. 0:51:10.590 --> 0:51:17.706 But the translation in itself usually doesn't really work really well if you build a system 0:51:17.706 --> 0:51:20.878 from your specific system for your case. 0:51:22.162 --> 0:51:27.924 I would guess that it's usually better than the LLM, but you can always adapt the LLM to 0:51:27.924 --> 0:51:31.355 the task that you want, and then it could be better. 0:51:32.152 --> 0:51:37.849 A little amount of the box might not be the best choice for your task force. 0:51:37.849 --> 0:51:44.138 For me, I'm working on new air translation, so it's more about translating software. 0:51:45.065 --> 0:51:50.451 And it's quite often each domain as well, and if use the LLM out of the box, they're 0:51:50.451 --> 0:51:53.937 actually quite bad compared to the systems that built. 0:51:54.414 --> 0:51:56.736 But you can do these different techniques like prompting. 0:51:57.437 --> 0:52:03.442 This is what people usually do is heart prompting where they give similar translation pairs in 0:52:03.442 --> 0:52:08.941 the prompt and then ask it to translate and then that kind of improves the performance 0:52:08.941 --> 0:52:09.383 a lot. 0:52:09.383 --> 0:52:15.135 So there are different techniques that you can do to adapt your eye lens and then it might 0:52:15.135 --> 0:52:16.399 be better than the. 0:52:16.376 --> 0:52:17.742 Task a fixed system. 0:52:18.418 --> 0:52:22.857 But if you're looking for niche things, I don't think error limbs are that good. 0:52:22.857 --> 0:52:26.309 But if you want to do to do, let's say, unplugged translation. 0:52:26.309 --> 0:52:30.036 In this case you can never be sure that they haven't seen the data. 0:52:30.036 --> 0:52:35.077 First of all is that if you see the data in that language or not, and if they're panthetic, 0:52:35.077 --> 0:52:36.831 they probably did see the data. 0:52:40.360 --> 0:53:00.276 I feel like they have pretty good understanding of each million people. 0:53:04.784 --> 0:53:09.059 Depends on the language, but I'm pretty surprised that it works on a lotus language. 0:53:09.059 --> 0:53:11.121 I would expect it to work on German and. 0:53:11.972 --> 0:53:13.633 But if you take a lot of first language,. 0:53:14.474 --> 0:53:20.973 Don't think it works, and also there are quite a few papers where they've already showed that 0:53:20.973 --> 0:53:27.610 if you build a system yourself or build a typical way to build a system, it's quite better than 0:53:27.610 --> 0:53:29.338 the bit better than the. 0:53:29.549 --> 0:53:34.883 But you can always do things with limbs to get better, but then I'm probably. 0:53:37.557 --> 0:53:39.539 Anymore. 0:53:41.421 --> 0:53:47.461 So if not then we're going to end the lecture here and then on Thursday we're going to have 0:53:47.461 --> 0:53:51.597 documented empty which is also run by me so thanks for coming.