WEBVTT 0:00:00.981 --> 0:00:20.036 Today about is how to use some type of additional resources to improve the translation. 0:00:20.300 --> 0:00:28.188 We have in the first part of the semester two thirds of the semester how to build some 0:00:28.188 --> 0:00:31.361 of your basic machine translation. 0:00:31.571 --> 0:00:42.317 Now the basic components are both for statistical and for neural, with the encoded decoding. 0:00:43.123 --> 0:00:46.000 Now, of course, that's not where it stops. 0:00:46.000 --> 0:00:51.286 It's still what nearly every machine translation system is currently in there. 0:00:51.286 --> 0:00:57.308 However, there's a lot of challenges which you need to address in addition and which need 0:00:57.308 --> 0:00:58.245 to be solved. 0:00:58.918 --> 0:01:09.858 And there we want to start to tell you what else can you do around this, and partly. 0:01:10.030 --> 0:01:14.396 And one important question there is on what do you train your models? 0:01:14.394 --> 0:01:32.003 Because like this type of parallel data, it's easier in machine translation than in other 0:01:32.003 --> 0:01:33.569 trusts. 0:01:33.853 --> 0:01:41.178 And therefore an important question is, can we also learn from like other sources and through? 0:01:41.701 --> 0:01:47.830 Because if you remember strongly right at the beginning of the election,. 0:01:51.171 --> 0:01:53.801 This Is How We Train All Our. 0:01:54.194 --> 0:01:59.887 Machine learning models from statistical to neural. 0:01:59.887 --> 0:02:09.412 This doesn't have changed so we need this type of parallel data where we have a source 0:02:09.412 --> 0:02:13.462 sentence aligned with a target data. 0:02:13.493 --> 0:02:19.135 We have now a strong model here, a very good model to do that. 0:02:19.135 --> 0:02:22.091 However, we always rely on this. 0:02:22.522 --> 0:02:28.395 For languages, high risk language pairs say from German to English or other European languages, 0:02:28.395 --> 0:02:31.332 there is decent amount, at least for similarly. 0:02:31.471 --> 0:02:37.630 But even there if we are going to very specific domains it might get difficult and then your 0:02:37.630 --> 0:02:43.525 system performance might drop because if you want to translate now some medical text for 0:02:43.525 --> 0:02:50.015 example of course you need to also have peril data in the medical domain to know how to translate 0:02:50.015 --> 0:02:50.876 these types. 0:02:51.231 --> 0:02:55.264 Phrases how to use the vocabulary and so on in the style. 0:02:55.915 --> 0:03:04.887 And if you are going to other languages, there is a lot bigger challenge and the question 0:03:04.887 --> 0:03:05.585 there. 0:03:05.825 --> 0:03:09.649 So is really this the only resource we can use. 0:03:09.889 --> 0:03:19.462 Can be adapted or training phase in order to also make use of other types of models that 0:03:19.462 --> 0:03:27.314 might enable us to build strong systems with other types of information. 0:03:27.707 --> 0:03:35.276 And that we will look into now in the next starting from from just saying the next election. 0:03:35.515 --> 0:03:40.697 So this idea we already have covered on Tuesday. 0:03:40.697 --> 0:03:45.350 One very successful idea for this is to do. 0:03:45.645 --> 0:03:51.990 So that we're no longer doing translation between languages, but we can do translation 0:03:51.990 --> 0:03:55.928 between languages and share common knowledge between. 0:03:56.296 --> 0:04:04.703 And you also learned about things like zero shots machine translation so you can translate 0:04:04.703 --> 0:04:06.458 between languages. 0:04:06.786 --> 0:04:09.790 Which is the case for many, many language pairs. 0:04:10.030 --> 0:04:19.209 Like even with German, you have not translation parallel data to all languages around the world, 0:04:19.209 --> 0:04:26.400 or most of them you have it to the Europeans once, maybe even for Japanese. 0:04:26.746 --> 0:04:35.332 There is quite a lot of data, for example English to Japanese, but German to Japanese 0:04:35.332 --> 0:04:37.827 or German to Vietnamese. 0:04:37.827 --> 0:04:41.621 There is some data from Multilingual. 0:04:42.042 --> 0:04:54.584 So there is a very promising direction if you want to build translation systems between 0:04:54.584 --> 0:05:00.142 language peers, typically not English. 0:05:01.221 --> 0:05:05.887 And the other ideas, of course, we don't have to either just search for it. 0:05:06.206 --> 0:05:12.505 Some work on a data crawling so if I don't have a corpus directly or I don't have an high 0:05:12.505 --> 0:05:19.014 quality corpus like from the European Parliament for a TED corpus so maybe it makes sense to 0:05:19.014 --> 0:05:23.913 crawl more data and get additional sources so you can build stronger. 0:05:24.344 --> 0:05:35.485 There has been quite a big effort in Europe to collect really large data sets for parallel 0:05:35.485 --> 0:05:36.220 data. 0:05:36.220 --> 0:05:40.382 How can we do this data crawling? 0:05:40.600 --> 0:05:46.103 There the interesting thing from the machine translation point is not just general data 0:05:46.103 --> 0:05:46.729 crawling. 0:05:47.067 --> 0:05:50.037 But how can we explicitly crawl data? 0:05:50.037 --> 0:05:52.070 Which is some of a peril? 0:05:52.132 --> 0:05:58.461 So there is in the Internet quite a lot of data which has been company websites which 0:05:58.461 --> 0:06:01.626 have been translated and things like that. 0:06:01.626 --> 0:06:05.158 So how can you extract them parallel fragments? 0:06:06.566 --> 0:06:13.404 That is typically more noisy than where you do more at hands where mean if you have Parliament. 0:06:13.693 --> 0:06:17.680 You can do some rules how to extract parallel things. 0:06:17.680 --> 0:06:24.176 Here there is more to it, so the quality is later maybe not as good, but normally scale 0:06:24.176 --> 0:06:26.908 is then a possibility to address it. 0:06:26.908 --> 0:06:30.304 So you just have so much more data that even. 0:06:33.313 --> 0:06:40.295 The other thing can be used monolingual data and monolingual data has a big advantage that 0:06:40.295 --> 0:06:46.664 we can have a huge amount of that so that you can be autocrawed from the Internet. 0:06:46.664 --> 0:06:51.728 The nice thing is you can also get it typically for many domains. 0:06:52.352 --> 0:06:59.558 There is just so much more magnitude of monolingual data so that it might be very helpful. 0:06:59.559 --> 0:07:03.054 We can do that in statistical machine translation. 0:07:03.054 --> 0:07:06.755 It was quite easy to integrate using language models. 0:07:08.508 --> 0:07:16.912 In neural machine translation we have the advantage that we have this overall architecture 0:07:16.912 --> 0:07:22.915 that does everything together, but it has also the disadvantage. 0:07:23.283 --> 0:07:25.675 We'll look today at two things. 0:07:25.675 --> 0:07:32.925 On the one end you can still try to do a bit of language modeling in there and add an additional 0:07:32.925 --> 0:07:35.168 language model into in there. 0:07:35.168 --> 0:07:38.232 There is some work, one very successful. 0:07:38.178 --> 0:07:43.764 A way in which I think is used in most systems at the moment is to do some scientific data. 0:07:43.763 --> 0:07:53.087 Is a very easy thing, but you can just translate there and use it as training gator, and normally. 0:07:53.213 --> 0:07:59.185 And thereby you are able to use like some type of monolingual a day. 0:08:00.380 --> 0:08:05.271 Another way to do it is unsupervised and the extreme case. 0:08:05.271 --> 0:08:11.158 If you have a scenario then you only have data, only monolingual data. 0:08:11.158 --> 0:08:13.976 Can you still build translations? 0:08:14.754 --> 0:08:27.675 If you have large amounts of data and languages are not too dissimilar, you can build translation 0:08:27.675 --> 0:08:31.102 systems without parallel. 0:08:32.512 --> 0:08:36.267 That we will see you then next Thursday. 0:08:37.857 --> 0:08:50.512 And then there is now a third type of pre-trained model that recently became very successful 0:08:50.512 --> 0:08:55.411 and now with large language models. 0:08:55.715 --> 0:09:03.525 So the idea is we are no longer sharing the real data, but it can also help to train a 0:09:03.525 --> 0:09:04.153 model. 0:09:04.364 --> 0:09:11.594 And that is now a big advantage of deep learning based approaches. 0:09:11.594 --> 0:09:22.169 There you have this ability that you can train a model in some task and then apply it to another. 0:09:22.722 --> 0:09:33.405 And then, of course, the question is, can I have an initial task where there's huge amounts 0:09:33.405 --> 0:09:34.450 of data? 0:09:34.714 --> 0:09:40.251 And the test that typically you pre train on is more like similar to a language moral 0:09:40.251 --> 0:09:45.852 task either direct to a language moral task or like a masking task which is related so 0:09:45.852 --> 0:09:51.582 the idea is oh I can train on this data and the knowledge about words how they relate to 0:09:51.582 --> 0:09:53.577 each other I can use in there. 0:09:53.753 --> 0:10:00.276 So it's a different way of using language models. 0:10:00.276 --> 0:10:06.276 There's more transfer learning at the end of. 0:10:09.029 --> 0:10:17.496 So first we will start with how can we use monolingual data to do a Yeah to do a machine 0:10:17.496 --> 0:10:18.733 translation? 0:10:20.040 --> 0:10:27.499 That: Big difference is you should remember from what I mentioned before is. 0:10:27.499 --> 0:10:32.783 In statistical machine translation we directly have the opportunity. 0:10:32.783 --> 0:10:39.676 There's peril data for the translation model and monolingual data for the language model. 0:10:39.679 --> 0:10:45.343 And you combine your translation model and language model, and then you can make use of 0:10:45.343 --> 0:10:45.730 both. 0:10:46.726 --> 0:10:53.183 That you can make use of these large large amounts of monolingual data, but of course 0:10:53.183 --> 0:10:55.510 it has also some disadvantage. 0:10:55.495 --> 0:11:01.156 Because we say the problem is we are optimizing both parts a bit independently to each other 0:11:01.156 --> 0:11:06.757 and we say oh yeah the big disadvantage of newer machine translations now we are optimizing 0:11:06.757 --> 0:11:10.531 the overall architecture everything together to perform best. 0:11:10.890 --> 0:11:16.994 And then, of course, we can't do there, so Leo we can can only do a mural like use power 0:11:16.994 --> 0:11:17.405 data. 0:11:17.897 --> 0:11:28.714 So the question is, but this advantage is not so important that we can train everything, 0:11:28.714 --> 0:11:35.276 but we have a moral legal data or even small amounts. 0:11:35.675 --> 0:11:43.102 So in data we know it's not only important the amount of data we have but also like how 0:11:43.102 --> 0:11:50.529 similar it is to your test data so it can be that this modeling data is quite small but 0:11:50.529 --> 0:11:55.339 it's very well fitting and then it's still very helpful. 0:11:55.675 --> 0:12:02.691 At the first year of surprisingness, if we are here successful with integrating a language 0:12:02.691 --> 0:12:09.631 model into a translation system, maybe we can also integrate some type of language models 0:12:09.631 --> 0:12:14.411 into our empty system in order to make it better and perform. 0:12:16.536 --> 0:12:23.298 The first thing we can do is we know there is language models, so let's try to integrate. 0:12:23.623 --> 0:12:31.096 There was our language model because these works were mainly done before transformer-based 0:12:31.096 --> 0:12:31.753 models. 0:12:32.152 --> 0:12:38.764 In general, of course, you can do the same thing with transformer baseball. 0:12:38.764 --> 0:12:50.929 There is nothing about whether: It's just that it has mainly been done before people 0:12:50.929 --> 0:13:01.875 started using R&S and they tried to do this more in cases. 0:13:07.087 --> 0:13:22.938 So what we're happening here is in some of this type of idea, and in key system you remember 0:13:22.938 --> 0:13:25.495 the attention. 0:13:25.605 --> 0:13:29.465 Gets it was your last in this day that you calculate easy attention. 0:13:29.729 --> 0:13:36.610 We get the context back, then combine both and then base the next in state and then predict. 0:13:37.057 --> 0:13:42.424 So this is our system, and the question is, can we send our integrated language model? 0:13:42.782 --> 0:13:49.890 And somehow it makes sense to take out a neural language model because we are anyway in the 0:13:49.890 --> 0:13:50.971 neural space. 0:13:50.971 --> 0:13:58.465 It's not surprising that it contrasts to statistical work used and grants it might make sense to 0:13:58.465 --> 0:14:01.478 take a bit of a normal language model. 0:14:01.621 --> 0:14:06.437 And there would be something like on Tubbles Air, a neural language model, and our man based 0:14:06.437 --> 0:14:11.149 is you have a target word, you put it in, you get a new benchmark, and then you always put 0:14:11.149 --> 0:14:15.757 in the words and get new hidden states, and you can do some predictions at the output to 0:14:15.757 --> 0:14:16.948 predict the next word. 0:14:17.597 --> 0:14:26.977 So if we're having this type of in language model, there's like two main questions we have 0:14:26.977 --> 0:14:34.769 to answer: So how do we combine now on the one hand our system and on the other hand our 0:14:34.769 --> 0:14:35.358 model? 0:14:35.358 --> 0:14:42.004 You see that was mentioned before when we started talking about ENCODA models. 0:14:42.004 --> 0:14:45.369 They can be viewed as a language model. 0:14:45.805 --> 0:14:47.710 The wine is lengthened, unconditioned. 0:14:47.710 --> 0:14:49.518 It's just modeling the target sides. 0:14:49.970 --> 0:14:56.963 And the other one is a conditional language one, which is a language one conditioned on 0:14:56.963 --> 0:14:57.837 the Sewer. 0:14:58.238 --> 0:15:03.694 So how can you combine to language models? 0:15:03.694 --> 0:15:14.860 Of course, it's like the translation model will be more important because it has access 0:15:14.860 --> 0:15:16.763 to the source. 0:15:18.778 --> 0:15:22.571 If we have that, the other question is okay. 0:15:22.571 --> 0:15:24.257 Now we have models. 0:15:24.257 --> 0:15:25.689 How do we train? 0:15:26.026 --> 0:15:30.005 Pickers integrated them. 0:15:30.005 --> 0:15:34.781 We have now two sets of data. 0:15:34.781 --> 0:15:42.741 We have parallel data where you can do the lower. 0:15:44.644 --> 0:15:53.293 So the first idea is we can do something more like a parallel combination. 0:15:53.293 --> 0:15:55.831 We just keep running. 0:15:56.036 --> 0:15:59.864 So here you see your system that is running. 0:16:00.200 --> 0:16:09.649 It's normally completely independent of your language model, which is up there, so down 0:16:09.649 --> 0:16:13.300 here we have just our NMT system. 0:16:13.313 --> 0:16:26.470 The only thing which is used is we have the words, and of course they are put into both 0:16:26.470 --> 0:16:30.059 systems, and out there. 0:16:30.050 --> 0:16:42.221 So we use them somehow for both, and then we are doing our decision just by merging these 0:16:42.221 --> 0:16:42.897 two. 0:16:43.343 --> 0:16:53.956 So there can be, for example, we are doing a probability distribution here, and then we 0:16:53.956 --> 0:17:03.363 are taking the average of post-perability distribution to do our predictions. 0:17:11.871 --> 0:17:18.923 You could also take the output with Steve's to be more in chore about the mixture. 0:17:20.000 --> 0:17:32.896 Yes, you could also do that, so it's more like engaging mechanisms that you're not doing. 0:17:32.993 --> 0:17:41.110 Another one would be cochtrinate the hidden states, and then you would have another layer 0:17:41.110 --> 0:17:41.831 on top. 0:17:43.303 --> 0:17:56.889 You think about if you do the conqueredination instead of taking the instead and then merging 0:17:56.889 --> 0:18:01.225 the probability distribution. 0:18:03.143 --> 0:18:16.610 Introduce many new parameters, and these parameters have somehow something special compared to 0:18:16.610 --> 0:18:17.318 the. 0:18:23.603 --> 0:18:37.651 So before all the error other parameters can be trained independent, the language model 0:18:37.651 --> 0:18:42.121 can be trained independent. 0:18:43.043 --> 0:18:51.749 If you have a joint layer, of course you need to train them because you have now inputs. 0:18:54.794 --> 0:19:02.594 Not surprisingly, if you have a parallel combination of whether you could, the other way is to do 0:19:02.594 --> 0:19:04.664 more serial combinations. 0:19:04.924 --> 0:19:10.101 How can you do a similar combination? 0:19:10.101 --> 0:19:18.274 Your final decision makes sense to do a face on the system. 0:19:18.438 --> 0:19:20.996 So you have on top of your normal and system. 0:19:21.121 --> 0:19:30.678 The only thing is now you're inputting into your system. 0:19:30.678 --> 0:19:38.726 You're no longer inputting the word embeddings. 0:19:38.918 --> 0:19:45.588 So you're training your mainly what you have your lower layers here which are trained more 0:19:45.588 --> 0:19:52.183 on the purely language model style and then on top your putting into the NMT system where 0:19:52.183 --> 0:19:55.408 it now has already here the language model. 0:19:55.815 --> 0:19:58.482 So here you can also view it. 0:19:58.482 --> 0:20:06.481 Here you have more contextual embeddings which no longer depend only on the word but they 0:20:06.481 --> 0:20:10.659 also depend on the context of the target site. 0:20:11.051 --> 0:20:19.941 But you have more understanding of the source word, so you have a language in the current 0:20:19.941 --> 0:20:21.620 target sentence. 0:20:21.881 --> 0:20:27.657 So if it's like the word can, for example, will be put in here always the same independent 0:20:27.657 --> 0:20:31.147 of its user can of beans, or if it's like I can do it. 0:20:31.147 --> 0:20:37.049 However, because you are having your language model style, you have maybe disintegrated this 0:20:37.049 --> 0:20:40.984 already a bit, and you give this information directly to the. 0:20:41.701 --> 0:20:43.095 An empty cyst. 0:20:44.364 --> 0:20:49.850 You, if you're remembering more the transformer based approach, you have some layers. 0:20:49.850 --> 0:20:55.783 The lower layers are purely languaged while the other ones are with attention to the source. 0:20:55.783 --> 0:21:01.525 So you can view it also that you just have lower layers which don't attend to the source. 0:21:02.202 --> 0:21:07.227 This is purely a language model, and then at some point you're starting to attend to 0:21:07.227 --> 0:21:08.587 the source and use it. 0:21:13.493 --> 0:21:20.781 Yes, so this is how you combine them in peril or first do the language model and then do. 0:21:23.623 --> 0:21:26.147 Questions for the integration. 0:21:31.831 --> 0:21:35.034 Not really sure about the input of the. 0:21:35.475 --> 0:21:38.102 Model, and in this case in the sequence. 0:21:38.278 --> 0:21:53.199 Case so the actual word that we transferred into a numerical lecture, and this is an input 0:21:53.199 --> 0:21:54.838 into the. 0:21:56.176 --> 0:22:03.568 That depends on if you view the word embedding as part of the language model. 0:22:03.568 --> 0:22:10.865 So if you first put the word target word then you do the one hot end coding. 0:22:11.691 --> 0:22:13.805 And then the word embedding there is the r& 0:22:13.805 --> 0:22:13.937 n. 0:22:14.314 --> 0:22:21.035 So you can use this together as your language model when you first do the word embedding. 0:22:21.401 --> 0:22:24.346 All you can say is like before. 0:22:24.346 --> 0:22:28.212 It's more a definition, but you're right. 0:22:28.212 --> 0:22:30.513 So what's the steps out? 0:22:30.513 --> 0:22:36.128 You take the word, the one hut encoding, the word embedding. 0:22:36.516 --> 0:22:46.214 What one of these parrots, you know, called a language model is definition wise and not 0:22:46.214 --> 0:22:47.978 that important. 0:22:53.933 --> 0:23:02.264 So the question is how can you then train them and make this this one work? 0:23:02.264 --> 0:23:02.812 The. 0:23:03.363 --> 0:23:15.201 So in the case where you combine the language one of the abilities you can train them independently 0:23:15.201 --> 0:23:18.516 and just put them together. 0:23:18.918 --> 0:23:27.368 Might not be the best because we have no longer the stability that we had before that optimally 0:23:27.368 --> 0:23:29.128 performed together. 0:23:29.128 --> 0:23:33.881 It's not clear if they really work the best together. 0:23:34.514 --> 0:23:41.585 At least you need to somehow find how much do you trust the one model and how much. 0:23:43.323 --> 0:23:45.058 Still in some cases useful. 0:23:45.058 --> 0:23:48.530 It might be helpful if you have only data and software. 0:23:48.928 --> 0:23:59.064 However, in MT we have one specific situation that at least for the MT part parallel is also 0:23:59.064 --> 0:24:07.456 always monolingual data, so what we definitely can do is train the language. 0:24:08.588 --> 0:24:18.886 So what we also can do is more like the pre-training approach. 0:24:18.886 --> 0:24:24.607 We first train the language model. 0:24:24.704 --> 0:24:27.334 The pre-training approach. 0:24:27.334 --> 0:24:33.470 You first train on the monolingual data and then you join the. 0:24:33.933 --> 0:24:41.143 Of course, the model size is this way, but the data size is too bigly the other way around. 0:24:41.143 --> 0:24:47.883 You often have a lot more monolingual data than you have here parallel data, in which 0:24:47.883 --> 0:24:52.350 scenario can you imagine where this type of pretraining? 0:24:56.536 --> 0:24:57.901 Any Ideas. 0:25:04.064 --> 0:25:12.772 One example where this might also be helpful if you want to adapt to domains. 0:25:12.772 --> 0:25:22.373 So let's say you do medical sentences and if you want to translate medical sentences. 0:25:23.083 --> 0:25:26.706 In this case it could be or its most probable happen. 0:25:26.706 --> 0:25:32.679 You're learning here up there what medical means, but in your fine tuning step the model 0:25:32.679 --> 0:25:38.785 is forgotten everything about Medicare, so you may be losing all the information you gain. 0:25:39.099 --> 0:25:42.366 So this type of priest training step is good. 0:25:42.366 --> 0:25:47.978 If your pretraining data is more general, very large and then you're adapting. 0:25:48.428 --> 0:25:56.012 But in the task with moral lingual data, which should be used to adapt the system to some 0:25:56.012 --> 0:25:57.781 general topic style. 0:25:57.817 --> 0:26:06.795 Then, of course, this is not a good strategy because you might forgot about everything up 0:26:06.795 --> 0:26:09.389 there and you don't have. 0:26:09.649 --> 0:26:14.678 So then you have to check what you can do for them. 0:26:14.678 --> 0:26:23.284 You can freeze this part and change it any more so you don't lose the ability or you can 0:26:23.284 --> 0:26:25.702 do a direct combination. 0:26:25.945 --> 0:26:31.028 Where you jointly train both of them, so you train the NMT system on the, and then you train 0:26:31.028 --> 0:26:34.909 the language model always in parallels so that you don't forget about. 0:26:35.395 --> 0:26:37.684 And what you learn of the length. 0:26:37.937 --> 0:26:46.711 Depends on what you want to combine because it's large data and you have a good general 0:26:46.711 --> 0:26:48.107 knowledge in. 0:26:48.548 --> 0:26:55.733 Then you normally don't really forget it because it's also in the or you use it to adapt to 0:26:55.733 --> 0:26:57.295 something specific. 0:26:57.295 --> 0:26:58.075 Then you. 0:27:01.001 --> 0:27:06.676 Then this is a way of how we can make use of monolingual data. 0:27:07.968 --> 0:27:12.116 It seems to be the easiest one somehow. 0:27:12.116 --> 0:27:20.103 It's more similar to what we are doing with statistical machine translation. 0:27:21.181 --> 0:27:31.158 Normally always beats this type of model, which in some view can be like from the conceptual 0:27:31.158 --> 0:27:31.909 thing. 0:27:31.909 --> 0:27:36.844 It's even easier from the computational side. 0:27:40.560 --> 0:27:42.078 And the idea is OK. 0:27:42.078 --> 0:27:49.136 We have monolingual data that we just translate and then generate some type of parallel data 0:27:49.136 --> 0:27:50.806 and use that then to. 0:27:51.111 --> 0:28:00.017 So if you want to build a German-to-English system first, take the large amount of data 0:28:00.017 --> 0:28:02.143 you have translated. 0:28:02.402 --> 0:28:10.446 Then you have more peril data and the interesting thing is if you then train on the joint thing 0:28:10.446 --> 0:28:18.742 or on the original peril data and on what is artificial where you have generated the translations. 0:28:18.918 --> 0:28:26.487 So you can because you are not doing the same era all the times and you have some knowledge. 0:28:28.028 --> 0:28:43.199 With this first approach, however, there is one issue why it might not work the best. 0:28:49.409 --> 0:28:51.177 Very a bit shown in the image to you. 0:28:53.113 --> 0:28:58.153 You trade on that quality data. 0:28:58.153 --> 0:29:02.563 Here is a bit of a problem. 0:29:02.563 --> 0:29:08.706 Your English style is not really good. 0:29:08.828 --> 0:29:12.213 And as you're saying, the system always mistranslates. 0:29:13.493 --> 0:29:19.798 Something then you will learn that this is correct because now it's a training game and 0:29:19.798 --> 0:29:23.022 you will encourage it to make it more often. 0:29:23.022 --> 0:29:29.614 So the problem with training on your own areas yeah you might prevent some areas you rarely 0:29:29.614 --> 0:29:29.901 do. 0:29:30.150 --> 0:29:31.749 But errors use systematically. 0:29:31.749 --> 0:29:34.225 Do you even enforce more and will even do more? 0:29:34.654 --> 0:29:40.145 So that might not be the best solution to have any idea how you could do it better. 0:29:44.404 --> 0:29:57.754 Is one way there is even a bit of more simple idea. 0:30:04.624 --> 0:30:10.975 The problem is yeah, the translations are not perfect, so the output and you're learning 0:30:10.975 --> 0:30:12.188 something wrong. 0:30:12.188 --> 0:30:17.969 Normally it's less bad if your inputs are not bad, but your outputs are perfect. 0:30:18.538 --> 0:30:24.284 So if your inputs are wrong you may learn that if you're doing this wrong input you're 0:30:24.284 --> 0:30:30.162 generating something correct, but you're not learning to generate something which is not 0:30:30.162 --> 0:30:30.756 correct. 0:30:31.511 --> 0:30:47.124 So often the case it is that it is more important than your target is correct. 0:30:47.347 --> 0:30:52.182 But you can assume in your application scenario you hope that you may only get correct inputs. 0:30:52.572 --> 0:31:02.535 So that is not harming you, and in machine translation we have one very nice advantage: 0:31:02.762 --> 0:31:04.648 And also the other way around. 0:31:04.648 --> 0:31:10.062 It's a very similar task, so there's a task to translate from German to English, but the 0:31:10.062 --> 0:31:13.894 task to translate from English to German is very similar, and. 0:31:14.094 --> 0:31:19.309 So what we can do is we can just switch it initially and generate the data the other way 0:31:19.309 --> 0:31:19.778 around. 0:31:20.120 --> 0:31:25.959 So what we are doing here is we are starting with an English to German system. 0:31:25.959 --> 0:31:32.906 Then we are translating the English data into German where the German is maybe not very nice. 0:31:33.293 --> 0:31:51.785 And then we are training on our original data and on the back translated data. 0:31:52.632 --> 0:32:02.332 So here we have the advantage that our target side is human quality and only the input. 0:32:03.583 --> 0:32:08.113 Then this helps us to get really good. 0:32:08.113 --> 0:32:15.431 There is one difference if you think about the data resources. 0:32:21.341 --> 0:32:27.336 Too obvious here we need a target site monolingual layer. 0:32:27.336 --> 0:32:31.574 In the first example we had source site. 0:32:31.931 --> 0:32:45.111 So back translation is normally working if you have target size peril later and not search 0:32:45.111 --> 0:32:48.152 side modeling later. 0:32:48.448 --> 0:32:56.125 Might be also, like if you think about it, understand a little better to understand the 0:32:56.125 --> 0:32:56.823 target. 0:32:57.117 --> 0:33:01.469 On the source side you have to understand the content. 0:33:01.469 --> 0:33:08.749 On the target side you have to generate really sentences and somehow it's more difficult to 0:33:08.749 --> 0:33:12.231 generate something than to only understand. 0:33:17.617 --> 0:33:30.734 This works well if you have to select how many back translated data do you use. 0:33:31.051 --> 0:33:32.983 Because only there's like a lot more. 0:33:33.253 --> 0:33:42.136 Question: Should take all of my data there is two problems with it? 0:33:42.136 --> 0:33:51.281 Of course it's expensive because you have to translate all this data. 0:33:51.651 --> 0:34:00.946 So if you don't know the normal good starting point is to take equal amount of data as many 0:34:00.946 --> 0:34:02.663 back translated. 0:34:02.963 --> 0:34:04.673 It depends on the used case. 0:34:04.673 --> 0:34:08.507 If we have very few data here, it makes more sense to have more. 0:34:08.688 --> 0:34:15.224 Depends on how good your quality is here, so the better the more data you might use because 0:34:15.224 --> 0:34:16.574 quality is better. 0:34:16.574 --> 0:34:22.755 So it depends on a lot of things, but your rule of sum is like which general way often 0:34:22.755 --> 0:34:24.815 is to have equal amounts of. 0:34:26.646 --> 0:34:29.854 And you can, of course, do that now. 0:34:29.854 --> 0:34:34.449 I said already that it's better to have the quality. 0:34:34.449 --> 0:34:38.523 At the end, of course, depends on this system. 0:34:38.523 --> 0:34:46.152 Also, because the better this system is, the better your synthetic data is, the better. 0:34:47.207 --> 0:34:50.949 That leads to what is referred to as iterated back translation. 0:34:51.291 --> 0:34:56.917 So you play them on English to German, and you translate the data on. 0:34:56.957 --> 0:35:03.198 Then you train a model on German to English with the additional data. 0:35:03.198 --> 0:35:09.796 Then you translate German data and then you train to gain your first one. 0:35:09.796 --> 0:35:14.343 So in the second iteration this quality is better. 0:35:14.334 --> 0:35:19.900 System is better because it's not only trained on the small data but additionally on back 0:35:19.900 --> 0:35:22.003 translated data with this system. 0:35:22.442 --> 0:35:24.458 And so you can get better. 0:35:24.764 --> 0:35:28.053 However, typically you can stop quite early. 0:35:28.053 --> 0:35:35.068 Maybe one iteration is good, but then you have diminishing gains after two or three iterations. 0:35:35.935 --> 0:35:46.140 There is very slight difference because you need a quite big difference in the quality 0:35:46.140 --> 0:35:46.843 here. 0:35:47.207 --> 0:36:02.262 Language is also good because it means you can already train it with relatively bad profiles. 0:36:03.723 --> 0:36:10.339 It's a design decision would advise so guess because it's easy to get it. 0:36:10.550 --> 0:36:20.802 Replace that because you have a higher quality real data, but then I think normally it's okay 0:36:20.802 --> 0:36:22.438 to replace it. 0:36:22.438 --> 0:36:28.437 I would assume it's not too much of a difference, but. 0:36:34.414 --> 0:36:42.014 That's about like using monolingual data before we go into the pre-train models to have any 0:36:42.014 --> 0:36:43.005 more crash. 0:36:49.029 --> 0:36:55.740 Yes, so the other thing which we can do and which is recently more and more successful 0:36:55.740 --> 0:37:02.451 and even more successful since we have this really large language models where you can 0:37:02.451 --> 0:37:08.545 even do the translation task with this is the way of using pre-trained models. 0:37:08.688 --> 0:37:16.135 So you learn a representation of one task, and then you use this representation from another. 0:37:16.576 --> 0:37:26.862 It was made maybe like one of the first words where it really used largely is doing something 0:37:26.862 --> 0:37:35.945 like a bird which you pre trained on purely text era and you take it in fine tune. 0:37:36.496 --> 0:37:42.953 And one big advantage, of course, is that people can only share data but also pre-trained. 0:37:43.423 --> 0:37:59.743 The recent models and the large language ones which are available. 0:37:59.919 --> 0:38:09.145 Where I think it costs several millions to train them all, just if you would buy the GPUs 0:38:09.145 --> 0:38:15.397 from some cloud company and train that the cost of training. 0:38:15.475 --> 0:38:21.735 And guess as a student project you won't have the budget to like build these models. 0:38:21.801 --> 0:38:24.598 So another idea is what you can do is okay. 0:38:24.598 --> 0:38:27.330 Maybe if these months are once available,. 0:38:27.467 --> 0:38:36.598 Can take them and use them as an also resource similar to pure text, and you can now build 0:38:36.598 --> 0:38:44.524 models which somehow learn not only from from data but also from other models. 0:38:44.844 --> 0:38:49.127 So it's a quite new way of thinking of how to train. 0:38:49.127 --> 0:38:53.894 We are not only learning from examples, but we might also. 0:38:54.534 --> 0:39:05.397 The nice thing is that this type of training where we are not learning directly from data 0:39:05.397 --> 0:39:07.087 but learning. 0:39:07.427 --> 0:39:17.647 So the main idea this go is you have a person initial task. 0:39:17.817 --> 0:39:26.369 And if you're working with anLP, that means you're training pure taxator because that's 0:39:26.369 --> 0:39:30.547 where you have the largest amount of data. 0:39:30.951 --> 0:39:35.857 And then you're defining some type of task in order to do your creek training. 0:39:36.176 --> 0:39:43.092 And: The typical task you can train on on that is like the language waddling task. 0:39:43.092 --> 0:39:50.049 So to predict the next word or we have a related task to predict something in between, we'll 0:39:50.049 --> 0:39:52.667 see depending on the architecture. 0:39:52.932 --> 0:39:58.278 But somehow to predict something which you have not in the input is a task which is easy 0:39:58.278 --> 0:40:00.740 to generate, so you just need your data. 0:40:00.740 --> 0:40:06.086 That's why it's called self supervised, so you're creating your supervised pending data. 0:40:06.366 --> 0:40:07.646 By yourself. 0:40:07.646 --> 0:40:15.133 On the other hand, you need a lot of knowledge and that is the other thing. 0:40:15.735 --> 0:40:24.703 Because there is this idea that the meaning of a word heavily depends on the context that. 0:40:25.145 --> 0:40:36.846 So can give you a sentence with some giverish word and there's some name and although you've 0:40:36.846 --> 0:40:41.627 never heard the name you will assume. 0:40:42.062 --> 0:40:44.149 And exactly the same thing. 0:40:44.149 --> 0:40:49.143 The models can also learn something about the world by just using. 0:40:49.649 --> 0:40:53.651 So that is typically the mule. 0:40:53.651 --> 0:40:59.848 Then we can use this model to train the system. 0:41:00.800 --> 0:41:03.368 Course we might need to adapt the system. 0:41:03.368 --> 0:41:07.648 To do that we have to change the architecture we might use only some. 0:41:07.627 --> 0:41:09.443 Part of the pre-trained model. 0:41:09.443 --> 0:41:14.773 In there we have seen that a bit already in the R&N case you can also see that we have 0:41:14.773 --> 0:41:17.175 also mentioned the pre-training already. 0:41:17.437 --> 0:41:22.783 So you can use the R&N as one of these approaches. 0:41:22.783 --> 0:41:28.712 You train the R&M language more on large pre-train data. 0:41:28.712 --> 0:41:32.309 Then you put it somewhere into your. 0:41:33.653 --> 0:41:37.415 So this gives you the ability to really do these types of tests. 0:41:37.877 --> 0:41:53.924 So you can build a system which is knowledge, which is just trained on large amounts of data. 0:41:56.376 --> 0:42:01.564 So the question is maybe what type of information so what type of models can you? 0:42:01.821 --> 0:42:05.277 And we want today to look at briefly at swings. 0:42:05.725 --> 0:42:08.704 First, that was what was initially done. 0:42:08.704 --> 0:42:15.314 It wasn't as famous as in machine translation as in other things, but it's also used there 0:42:15.314 --> 0:42:21.053 and that is to use static word embedding, so just the first step we know here. 0:42:21.221 --> 0:42:28.981 So we have this mapping from the one hot to a small continuous word representation. 0:42:29.229 --> 0:42:38.276 Using this one in your NG system, so you can, for example, replace the embedding layer by 0:42:38.276 --> 0:42:38.779 the. 0:42:39.139 --> 0:42:41.832 That is helpful to be a really small amount of data. 0:42:42.922 --> 0:42:48.517 And we're always in this pre-training phase and have the thing the advantage is. 0:42:48.468 --> 0:42:52.411 More data than the trade off, so you can get better. 0:42:52.411 --> 0:42:59.107 The disadvantage is, does anybody have an idea of what might be the disadvantage of using 0:42:59.107 --> 0:43:00.074 things like. 0:43:04.624 --> 0:43:12.175 What was one mentioned today giving like big advantage of the system compared to previous. 0:43:20.660 --> 0:43:25.134 Where one advantage was the enter end training, so you have the enter end training so that 0:43:25.134 --> 0:43:27.937 all parameters and all components play optimal together. 0:43:28.208 --> 0:43:33.076 If you know pre-train something on one fast, it may be no longer optimal fitting to everything 0:43:33.076 --> 0:43:33.384 else. 0:43:33.893 --> 0:43:37.862 So what do pretending or not? 0:43:37.862 --> 0:43:48.180 It depends on how important everything is optimal together and how important. 0:43:48.388 --> 0:43:50.454 Of large amount. 0:43:50.454 --> 0:44:00.541 The pre-change one is so much better that it's helpful, and the advantage of that. 0:44:00.600 --> 0:44:11.211 Getting everything optimal together, yes, we would use random instructions for raising. 0:44:11.691 --> 0:44:26.437 The problem is you might be already in some area where it's not easy to get. 0:44:26.766 --> 0:44:35.329 But often in some way right, so often it's not about your really worse pre trained monolepsy. 0:44:35.329 --> 0:44:43.254 If you're going already in some direction, and if this is not really optimal for you,. 0:44:43.603 --> 0:44:52.450 But if you're not really getting better because you have a decent amount of data, it's so different 0:44:52.450 --> 0:44:52.981 that. 0:44:53.153 --> 0:44:59.505 Initially it wasn't a machine translation done so much because there are more data in 0:44:59.505 --> 0:45:06.153 MPs than in other tasks, but now with really large amounts of monolingual data we do some 0:45:06.153 --> 0:45:09.403 type of pretraining in currently all state. 0:45:12.632 --> 0:45:14.302 The other one is okay now. 0:45:14.302 --> 0:45:18.260 It's always like how much of the model do you plea track a bit? 0:45:18.658 --> 0:45:22.386 To the other one you can do contextural word embedded. 0:45:22.386 --> 0:45:28.351 That is something like bird or Roberta where you train already a sequence model and the 0:45:28.351 --> 0:45:34.654 embeddings you're using are no longer specific for word but they are also taking the context 0:45:34.654 --> 0:45:35.603 into account. 0:45:35.875 --> 0:45:50.088 The embedding you're using is no longer depending on the word itself but on the whole sentence, 0:45:50.088 --> 0:45:54.382 so you can use this context. 0:45:55.415 --> 0:46:02.691 You can use similar things also in the decoder just by having layers which don't have access 0:46:02.691 --> 0:46:12.430 to the source, but there it still might have and these are typically models like: And finally 0:46:12.430 --> 0:46:14.634 they will look at the end. 0:46:14.634 --> 0:46:19.040 You can also have models which are already sequenced. 0:46:19.419 --> 0:46:28.561 So you may be training a sequence to sequence models. 0:46:28.561 --> 0:46:35.164 You have to make it a bit challenging. 0:46:36.156 --> 0:46:43.445 But the idea is really you're pre-training your whole model and then you'll find tuning. 0:46:47.227 --> 0:46:59.614 But let's first do a bit of step back and look into what are the different things. 0:46:59.614 --> 0:47:02.151 The first thing. 0:47:02.382 --> 0:47:11.063 The wooden bettings are just this first layer and you can train them with feedback annual 0:47:11.063 --> 0:47:12.028 networks. 0:47:12.212 --> 0:47:22.761 But you can also train them with an N language model, and by now you hopefully have also seen 0:47:22.761 --> 0:47:27.699 that you cannot transform a language model. 0:47:30.130 --> 0:47:37.875 So this is how you can train them and you're training them. 0:47:37.875 --> 0:47:45.234 For example, to speak the next word that is the easiest. 0:47:45.525 --> 0:47:55.234 And that is what is now referred to as South Supervised Learning and, for example, all the 0:47:55.234 --> 0:48:00.675 big large language models like Chad GPT and so on. 0:48:00.675 --> 0:48:03.129 They are trained with. 0:48:03.823 --> 0:48:15.812 So that is where you can hopefully learn how a word is used because you always try to previct 0:48:15.812 --> 0:48:17.725 the next word. 0:48:19.619 --> 0:48:27.281 Word embedding: Why do you keep the first look at the word embeddings and the use of 0:48:27.281 --> 0:48:29.985 word embeddings for our task? 0:48:29.985 --> 0:48:38.007 The main advantage was it might be only the first layer where you typically have most of 0:48:38.007 --> 0:48:39.449 the parameters. 0:48:39.879 --> 0:48:57.017 Most of your parameters already on the large data, then on your target data you have to 0:48:57.017 --> 0:48:59.353 train less. 0:48:59.259 --> 0:49:06.527 Big difference that your input size is so much bigger than the size of the novel in size. 0:49:06.626 --> 0:49:17.709 So it's a normally sign, maybe like, but your input and banning size is something like. 0:49:17.709 --> 0:49:20.606 Then here you have to. 0:49:23.123 --> 0:49:30.160 While here you see it's only like zero point five times as much in the layer. 0:49:30.750 --> 0:49:36.534 So here is where most of your parameters are, which means if you already replace the word 0:49:36.534 --> 0:49:41.739 embeddings, they might look a bit small in your overall and in key architecture. 0:49:41.739 --> 0:49:47.395 It's where most of the things are, and if you're doing that you already have really big 0:49:47.395 --> 0:49:48.873 games and can do that. 0:49:57.637 --> 0:50:01.249 The thing is we have seen these were the bettings. 0:50:01.249 --> 0:50:04.295 They can be very good use for other types. 0:50:04.784 --> 0:50:08.994 You learn some general relations between words. 0:50:08.994 --> 0:50:17.454 If you're doing this type of language modeling cast, you predict: The one thing is you have 0:50:17.454 --> 0:50:24.084 a lot of data, so the one question is we want to have data to trade a model. 0:50:24.084 --> 0:50:28.734 The other thing, the tasks need to be somehow useful. 0:50:29.169 --> 0:50:43.547 If you would predict the first letter of the word, then you wouldn't learn anything about 0:50:43.547 --> 0:50:45.144 the word. 0:50:45.545 --> 0:50:53.683 And the interesting thing is people have looked at these wood embeddings. 0:50:53.954 --> 0:50:58.550 And looking at the word embeddings. 0:50:58.550 --> 0:51:09.276 You can ask yourself how they look and visualize them by doing dimension reduction. 0:51:09.489 --> 0:51:13.236 Don't know if you and you are listening to artificial intelligence. 0:51:13.236 --> 0:51:15.110 Advanced artificial intelligence. 0:51:15.515 --> 0:51:23.217 We had on yesterday there how to do this type of representation, but you can do this time 0:51:23.217 --> 0:51:29.635 of representation, and now you're seeing interesting things that normally. 0:51:30.810 --> 0:51:41.027 Now you can represent a here in a three dimensional space with some dimension reduction. 0:51:41.027 --> 0:51:46.881 For example, the relation between male and female. 0:51:47.447 --> 0:51:56.625 So this vector between the male and female version of something is always not the same, 0:51:56.625 --> 0:51:58.502 but it's related. 0:51:58.718 --> 0:52:14.522 So you can do a bit of maths, so you do take king, you subtract this vector, add this vector. 0:52:14.894 --> 0:52:17.591 So that means okay, there is really something stored. 0:52:17.591 --> 0:52:19.689 Some information are stored in that book. 0:52:20.040 --> 0:52:22.621 Similar, you can do it with Bob Hansen. 0:52:22.621 --> 0:52:25.009 See here swimming slam walking walk. 0:52:25.265 --> 0:52:34.620 So again these vectors are not the same, but they are related. 0:52:34.620 --> 0:52:42.490 So you learn something from going from here to here. 0:52:43.623 --> 0:52:49.761 Or semantically, the relations between city and capital have exactly the same sense. 0:52:51.191 --> 0:52:56.854 And people had even done that question answering about that if they showed the diembeddings 0:52:56.854 --> 0:52:57.839 and the end of. 0:52:58.218 --> 0:53:06.711 All you can also do is don't trust the dimensions of the reaction because maybe there is something. 0:53:06.967 --> 0:53:16.863 You can also look into what happens really in the individual space. 0:53:16.863 --> 0:53:22.247 What is the nearest neighbor of the. 0:53:22.482 --> 0:53:29.608 So you can take the relationship between France and Paris and add it to Italy and you'll. 0:53:30.010 --> 0:53:33.078 You can do big and bigger and you have small and smaller and stuff. 0:53:33.593 --> 0:53:49.417 Because it doesn't work everywhere, there is also some typical dish here in German. 0:53:51.491 --> 0:54:01.677 You can do what the person is doing for famous ones, of course only like Einstein scientists 0:54:01.677 --> 0:54:06.716 that find midfielders not completely correct. 0:54:06.846 --> 0:54:10.134 You see the examples are a bit old. 0:54:10.134 --> 0:54:15.066 The politicians are no longer they am, but of course. 0:54:16.957 --> 0:54:26.759 What people have done there, especially at the beginning training our end language model, 0:54:26.759 --> 0:54:28.937 was very expensive. 0:54:29.309 --> 0:54:38.031 So one famous model was, but we are not really interested in the language model performance. 0:54:38.338 --> 0:54:40.581 Think something good to keep in mind. 0:54:40.581 --> 0:54:42.587 What are we really interested in? 0:54:42.587 --> 0:54:45.007 Do we really want to have an R&N no? 0:54:45.007 --> 0:54:48.607 In this case we are only interested in this type of mapping. 0:54:49.169 --> 0:54:55.500 And so successful and very successful was this word to vet. 0:54:55.535 --> 0:54:56.865 The idea is okay. 0:54:56.865 --> 0:55:03.592 We are not training real language one, making it even simpler and doing this, for example, 0:55:03.592 --> 0:55:05.513 continuous peck of words. 0:55:05.513 --> 0:55:12.313 We're just having four input tokens and we're predicting what is the word in the middle and 0:55:12.313 --> 0:55:15.048 this is just like two linear layers. 0:55:15.615 --> 0:55:21.627 So it's even simplifying things and making the calculation faster because that is what 0:55:21.627 --> 0:55:22.871 we're interested. 0:55:23.263 --> 0:55:32.897 All this continuous skip ground models with these other models which refer to as where 0:55:32.897 --> 0:55:34.004 to where. 0:55:34.234 --> 0:55:42.394 Where you have one equal word and the other way around, you're predicting the four words 0:55:42.394 --> 0:55:43.585 around them. 0:55:43.585 --> 0:55:45.327 It's very similar. 0:55:45.327 --> 0:55:48.720 The task is in the end very similar. 0:55:51.131 --> 0:56:01.407 Before we are going to the next point, anything about normal weight vectors or weight embedding. 0:56:04.564 --> 0:56:07.794 The next thing is contexture. 0:56:07.794 --> 0:56:12.208 Word embeddings and the idea is helpful. 0:56:12.208 --> 0:56:19.206 However, we might even be able to get more from one lingo layer. 0:56:19.419 --> 0:56:31.732 And now in the word that is overlap of these two meanings, so it represents both the meaning 0:56:31.732 --> 0:56:33.585 of can do it. 0:56:34.834 --> 0:56:40.410 But we might be able to in the pre-trained model already disambiguate this because they 0:56:40.410 --> 0:56:41.044 are used. 0:56:41.701 --> 0:56:53.331 So if we can have a model which can not only represent a word but can also represent the 0:56:53.331 --> 0:56:58.689 meaning of the word within the context,. 0:56:59.139 --> 0:57:03.769 So then we are going to context your word embeddings. 0:57:03.769 --> 0:57:07.713 We are really having a representation in the. 0:57:07.787 --> 0:57:11.519 And we have a very good architecture for that already. 0:57:11.691 --> 0:57:23.791 The hidden state represents what is currently said, but it's focusing on what is the last 0:57:23.791 --> 0:57:29.303 one, so it's some of the representation. 0:57:29.509 --> 0:57:43.758 The first one doing that is something like the Elmo paper where they instead of this is 0:57:43.758 --> 0:57:48.129 the normal language model. 0:57:48.008 --> 0:57:50.714 Within the third, predicting the fourth, and so on. 0:57:50.714 --> 0:57:53.004 So you are always predicting the next work. 0:57:53.193 --> 0:57:57.335 The architecture is the heaven words embedding layer and then layers. 0:57:57.335 --> 0:58:03.901 See you, for example: And now instead of using this one in the end, you're using here this 0:58:03.901 --> 0:58:04.254 one. 0:58:04.364 --> 0:58:11.245 This represents the meaning of this word mainly in the context of what we have seen before. 0:58:11.871 --> 0:58:18.610 We can train it in a language model style always predicting the next word, but we have 0:58:18.610 --> 0:58:21.088 more information trained there. 0:58:21.088 --> 0:58:26.123 Therefore, in the system it has to learn less additional things. 0:58:27.167 --> 0:58:31.261 And there is one Edendang which is done currently in GPS. 0:58:31.261 --> 0:58:38.319 The only difference is that we have more layers, bigger size, and we're using transformer neurocell 0:58:38.319 --> 0:58:40.437 potential instead of the RNA. 0:58:40.437 --> 0:58:45.095 But that is how you train like some large language models at the. 0:58:46.746 --> 0:58:55.044 However, if you look at this contextual representation, they might not be perfect. 0:58:55.044 --> 0:59:02.942 So if you think of this one as a contextual representation of the third word,. 0:59:07.587 --> 0:59:16.686 Is representing a three in the context of a sentence, however only in the context of 0:59:16.686 --> 0:59:18.185 the previous. 0:59:18.558 --> 0:59:27.413 However, we have an architecture which can also take both sides and we have used that 0:59:27.413 --> 0:59:30.193 already in the ink holder. 0:59:30.630 --> 0:59:34.264 So we could do the iron easily on your, also in the backward direction. 0:59:34.874 --> 0:59:42.826 By just having the states the other way around and then we couldn't combine the forward and 0:59:42.826 --> 0:59:49.135 the forward into a joint one where we are doing this type of prediction. 0:59:49.329 --> 0:59:50.858 So you have the word embedding. 0:59:51.011 --> 1:00:02.095 Then you have two in the states, one on the forward arm and one on the backward arm, and 1:00:02.095 --> 1:00:10.314 then you can, for example, take the cocagenation of both of them. 1:00:10.490 --> 1:00:23.257 Now this same here represents mainly this word because this is what both puts in it last 1:00:23.257 --> 1:00:30.573 and we know is focusing on what is happening last. 1:00:31.731 --> 1:00:40.469 However, there is a bit of difference when training that as a language model you already 1:00:40.469 --> 1:00:41.059 have. 1:00:43.203 --> 1:00:44.956 Maybe There's Again This Masking. 1:00:46.546 --> 1:00:47.748 That is one solution. 1:00:47.748 --> 1:00:52.995 First of all, why we can't do it is the information you leak it, so you cannot just predict the 1:00:52.995 --> 1:00:53.596 next word. 1:00:53.596 --> 1:00:58.132 If we just predict the next word in this type of model, that's a very simple task. 1:00:58.738 --> 1:01:09.581 You know the next word because it's influencing this hidden state predicting something is not 1:01:09.581 --> 1:01:11.081 a good task. 1:01:11.081 --> 1:01:18.455 You have to define: Because in this case what will end with the system will just ignore these 1:01:18.455 --> 1:01:22.966 estates and what will learn is copy this information directly in here. 1:01:23.343 --> 1:01:31.218 So it would be representing this word and you would have nearly a perfect model because 1:01:31.218 --> 1:01:38.287 you only need to find encoding where you can encode all words somehow in this. 1:01:38.458 --> 1:01:44.050 The only thing can learn is that turn and encode all my words in this upper hidden. 1:01:44.985 --> 1:01:53.779 Therefore, it's not really useful, so we need to find a bit of different ways out. 1:01:55.295 --> 1:01:57.090 There is a masking one. 1:01:57.090 --> 1:02:03.747 I'll come to that shortly just a bit that other things also have been done, so the other 1:02:03.747 --> 1:02:06.664 thing is not to directly combine them. 1:02:06.664 --> 1:02:13.546 That was in the animal paper, so you have them forward R&M and you keep them completely 1:02:13.546 --> 1:02:14.369 separated. 1:02:14.594 --> 1:02:20.458 So you never merged to state. 1:02:20.458 --> 1:02:33.749 At the end, the representation of the word is now from the forward. 1:02:33.873 --> 1:02:35.953 So it's always the hidden state before the good thing. 1:02:36.696 --> 1:02:41.286 These two you join now to your to the representation. 1:02:42.022 --> 1:02:48.685 And then you have now a representation also about like the whole sentence for the word, 1:02:48.685 --> 1:02:51.486 but there is no information leakage. 1:02:51.486 --> 1:02:58.149 One way of doing this is instead of doing a bidirection along you do a forward pass and 1:02:58.149 --> 1:02:59.815 then join the hidden. 1:03:00.380 --> 1:03:05.960 So you can do that in all layers. 1:03:05.960 --> 1:03:16.300 In the end you do the forwarded layers and you get the hidden. 1:03:16.596 --> 1:03:19.845 However, it's a bit of a complicated. 1:03:19.845 --> 1:03:25.230 You have to keep both separate and merge things so can you do. 1:03:27.968 --> 1:03:33.030 And that is the moment where like the big. 1:03:34.894 --> 1:03:39.970 The big success of the burnt model was used where it okay. 1:03:39.970 --> 1:03:47.281 Maybe in bite and rich case it's not good to do the next word prediction, but we can 1:03:47.281 --> 1:03:48.314 do masking. 1:03:48.308 --> 1:03:56.019 Masking mainly means we do a prediction of something in the middle or some words. 1:03:56.019 --> 1:04:04.388 So the idea is if we have the input, we are putting noise into the input, removing them, 1:04:04.388 --> 1:04:07.961 and then the model we are interested. 1:04:08.048 --> 1:04:15.327 Now there can be no information leakage because this wasn't predicting that one is a big challenge. 1:04:16.776 --> 1:04:19.957 Do any assumption about our model? 1:04:19.957 --> 1:04:26.410 It doesn't need to be a forward model or a backward model or anything. 1:04:26.410 --> 1:04:29.500 You can always predict the three. 1:04:30.530 --> 1:04:34.844 There's maybe one bit of a disadvantage. 1:04:34.844 --> 1:04:40.105 Do you see what could be a bit of a problem this? 1:05:00.000 --> 1:05:06.429 Yes, so yeah, you can of course mask more, but to see it more globally, just first assume 1:05:06.429 --> 1:05:08.143 you're only masked one. 1:05:08.143 --> 1:05:13.930 For the whole sentence, we get one feedback signal, like what is the word three. 1:05:13.930 --> 1:05:22.882 So we have one training example: If you do the language modeling taste, we predicted here, 1:05:22.882 --> 1:05:24.679 we predicted here. 1:05:25.005 --> 1:05:26.735 So we have number of tokens. 1:05:26.735 --> 1:05:30.970 For each token we have a feet pad and say what is the best correction. 1:05:31.211 --> 1:05:43.300 So in this case this is less efficient because we are getting less feedback signals on what 1:05:43.300 --> 1:05:45.797 we should predict. 1:05:48.348 --> 1:05:56.373 So and bird, the main ideas are that you're doing this bidirectional model with masking. 1:05:56.373 --> 1:05:59.709 It's using transformer architecture. 1:06:00.320 --> 1:06:06.326 There are two more minor changes. 1:06:06.326 --> 1:06:16.573 We'll see that this next word prediction is another task. 1:06:16.957 --> 1:06:30.394 You want to learn more about what language is to really understand following a story or 1:06:30.394 --> 1:06:35.127 their independent tokens into. 1:06:38.158 --> 1:06:42.723 The input is using word units as we use it. 1:06:42.723 --> 1:06:50.193 It has some special token that is framing for the next word prediction. 1:06:50.470 --> 1:07:04.075 It's more for classification task because you may be learning a general representation 1:07:04.075 --> 1:07:07.203 as a full sentence. 1:07:07.607 --> 1:07:19.290 You're doing segment embedding, so you have an embedding for it. 1:07:19.290 --> 1:07:24.323 This is the first sentence. 1:07:24.684 --> 1:07:29.099 Now what is more challenging is this masking. 1:07:29.099 --> 1:07:30.827 What do you mask? 1:07:30.827 --> 1:07:35.050 We already have the crush enough or should. 1:07:35.275 --> 1:07:42.836 So there has been afterwards eating some work like, for example, a bearer. 1:07:42.836 --> 1:07:52.313 It's not super sensitive, but if you do it completely wrong then you're not letting anything. 1:07:52.572 --> 1:07:54.590 That's Then Another Question There. 1:07:56.756 --> 1:08:04.594 Should I mask all types of should I always mask the footwork or if I have a subword to 1:08:04.594 --> 1:08:10.630 mask only like a subword and predict them based on the other ones? 1:08:10.630 --> 1:08:14.504 Of course, it's a bit of a different task. 1:08:14.894 --> 1:08:21.210 If you know three parts of the words, it might be easier to guess the last because they here 1:08:21.210 --> 1:08:27.594 took the easiest selection, so not considering words anymore at all because you're doing that 1:08:27.594 --> 1:08:32.280 in the preprocessing and just taking always words and like subwords. 1:08:32.672 --> 1:08:36.089 Think in group there is done differently. 1:08:36.089 --> 1:08:40.401 They mark always the full words, but guess it's not. 1:08:41.001 --> 1:08:46.044 And then what to do with the mask word in eighty percent of the cases. 1:08:46.044 --> 1:08:50.803 If the word is masked, they replace it with a special token thing. 1:08:50.803 --> 1:08:57.197 This is a mask token in ten percent they put in some random other token in there, and ten 1:08:57.197 --> 1:08:59.470 percent they keep it on change. 1:09:02.202 --> 1:09:10.846 And then what you can do is also this next word prediction. 1:09:10.846 --> 1:09:14.880 The man went to Mass Store. 1:09:14.880 --> 1:09:17.761 He bought a gallon. 1:09:18.418 --> 1:09:24.088 So may you see you're joining them, you're doing both masks and prediction that you're. 1:09:24.564 --> 1:09:29.449 Is a penguin mask or flyless birds. 1:09:29.449 --> 1:09:41.390 These two sentences have nothing to do with each other, so you can do also this type of 1:09:41.390 --> 1:09:43.018 prediction. 1:09:47.127 --> 1:09:57.043 And then the whole bird model, so here you have the input here to transform the layers, 1:09:57.043 --> 1:09:58.170 and then. 1:09:58.598 --> 1:10:17.731 And this model was quite successful in general applications. 1:10:17.937 --> 1:10:27.644 However, there is like a huge thing of different types of models coming from them. 1:10:27.827 --> 1:10:38.709 So based on others these supervised molds like a whole setup came out of there and now 1:10:38.709 --> 1:10:42.086 this is getting even more. 1:10:42.082 --> 1:10:46.640 With availability of a large language model than the success. 1:10:47.007 --> 1:10:48.436 We have now even larger ones. 1:10:48.828 --> 1:10:50.961 Interestingly, it goes a bit. 1:10:50.910 --> 1:10:57.847 Change the bit again from like more the spider action model to uni directional models. 1:10:57.847 --> 1:11:02.710 Are at the moment maybe a bit more we're coming to them now? 1:11:02.710 --> 1:11:09.168 Do you see one advantage while what is another event and we have the efficiency? 1:11:09.509 --> 1:11:15.901 Is one other reason why you are sometimes more interested in uni-direction models than 1:11:15.901 --> 1:11:17.150 in bi-direction. 1:11:22.882 --> 1:11:30.220 It depends on the pass, but for example for a language generation pass, the eccard is not 1:11:30.220 --> 1:11:30.872 really. 1:11:32.192 --> 1:11:40.924 It doesn't work so if you want to do a generation like the decoder you don't know the future 1:11:40.924 --> 1:11:42.896 so you cannot apply. 1:11:43.223 --> 1:11:53.870 So this time of model can be used for the encoder in an encoder model, but it cannot 1:11:53.870 --> 1:11:57.002 be used for the decoder. 1:12:00.000 --> 1:12:05.012 That's a good view to the next overall cast of models. 1:12:05.012 --> 1:12:08.839 Perhaps if you view it from the sequence. 1:12:09.009 --> 1:12:12.761 We have the encoder base model. 1:12:12.761 --> 1:12:16.161 That's what we just look at. 1:12:16.161 --> 1:12:20.617 They are bidirectional and typically. 1:12:20.981 --> 1:12:22.347 That Is the One We Looked At. 1:12:22.742 --> 1:12:34.634 At the beginning is the decoder based model, so see out in regressive models which are unidirective 1:12:34.634 --> 1:12:42.601 like an based model, and there we can do the next word prediction. 1:12:43.403 --> 1:12:52.439 And what you can also do first, and there you can also have a special things called prefix 1:12:52.439 --> 1:12:53.432 language. 1:12:54.354 --> 1:13:05.039 Because we are saying it might be helpful that some of your input can also use bi-direction. 1:13:05.285 --> 1:13:12.240 And that is somehow doing what it is called prefix length. 1:13:12.240 --> 1:13:19.076 On the first tokens you directly give your bidirectional. 1:13:19.219 --> 1:13:28.774 So you somehow merge that and that mainly works only in transformer based models because. 1:13:29.629 --> 1:13:33.039 There is no different number of parameters in our end. 1:13:33.039 --> 1:13:34.836 We need a back foot our end. 1:13:34.975 --> 1:13:38.533 Transformer: The only difference is how you mask your attention. 1:13:38.878 --> 1:13:44.918 We have seen that in the anchoder and decoder the number of parameters is different because 1:13:44.918 --> 1:13:50.235 you do cross attention, but if you do forward and backward or union directions,. 1:13:50.650 --> 1:13:58.736 It's only like you mask your attention to only look at the bad past or to look into the 1:13:58.736 --> 1:13:59.471 future. 1:14:00.680 --> 1:14:03.326 And now you can of course also do mixing. 1:14:03.563 --> 1:14:08.306 So this is a bi-directional attention matrix where you can attend to everything. 1:14:08.588 --> 1:14:23.516 There is a uni-direction or causal where you can look at the past and you can do the first 1:14:23.516 --> 1:14:25.649 three words. 1:14:29.149 --> 1:14:42.831 That somehow clear based on that, then of course you cannot do the other things. 1:14:43.163 --> 1:14:50.623 So the idea is we have our anchor to decoder architecture. 1:14:50.623 --> 1:14:57.704 Can we also train them completely in a side supervisor? 1:14:58.238 --> 1:15:09.980 And in this case we have the same input to both, so in this case we need to do some type 1:15:09.980 --> 1:15:12.224 of masking here. 1:15:12.912 --> 1:15:17.696 Here we don't need to do the masking, but here we need to masking that doesn't know ever 1:15:17.696 --> 1:15:17.911 so. 1:15:20.440 --> 1:15:30.269 And this type of model got quite successful also, especially for pre-training machine translation. 1:15:30.330 --> 1:15:39.059 The first model doing that is a Bart model, which exactly does that, and yes, it's one 1:15:39.059 --> 1:15:42.872 successful way to pre train your one. 1:15:42.872 --> 1:15:47.087 It's pretraining your full encoder model. 1:15:47.427 --> 1:15:54.365 Where you put in contrast to machine translation, where you put in source sentence, we can't 1:15:54.365 --> 1:15:55.409 do that here. 1:15:55.715 --> 1:16:01.382 But we can just put the second twice in there, and then it's not a trivial task. 1:16:01.382 --> 1:16:02.432 We can change. 1:16:03.003 --> 1:16:12.777 And there is like they do different corruption techniques so you can also do. 1:16:13.233 --> 1:16:19.692 That you couldn't do in an agricultural system because then it wouldn't be there and you cannot 1:16:19.692 --> 1:16:20.970 predict somewhere. 1:16:20.970 --> 1:16:26.353 So the anchor, the number of input and output tokens always has to be the same. 1:16:26.906 --> 1:16:29.818 You cannot do a prediction for something which isn't in it. 1:16:30.110 --> 1:16:38.268 Here in the decoder side it's unidirection so we can also delete the top and then try 1:16:38.268 --> 1:16:40.355 to generate the full. 1:16:41.061 --> 1:16:45.250 We can do sentence permutation. 1:16:45.250 --> 1:16:54.285 We can document rotation and text infilling so there is quite a bit. 1:16:55.615 --> 1:17:06.568 So you see there's quite a lot of types of models that you can use in order to pre-train. 1:17:07.507 --> 1:17:14.985 Then, of course, there is again for the language one. 1:17:14.985 --> 1:17:21.079 The other question is how do you integrate? 1:17:21.761 --> 1:17:26.636 And there's also, like yeah, quite some different ways of techniques. 1:17:27.007 --> 1:17:28.684 It's a Bit Similar to Before. 1:17:28.928 --> 1:17:39.068 So the easiest thing is you take your word embeddings or your free trained model. 1:17:39.068 --> 1:17:47.971 You freeze them and stack your decoder layers and keep these ones free. 1:17:48.748 --> 1:17:54.495 Can also be done if you have this type of bark model. 1:17:54.495 --> 1:18:03.329 What you can do is you freeze your word embeddings, for example some products and. 1:18:05.865 --> 1:18:17.296 The other thing is you initialize them so you initialize your models but you train everything 1:18:17.296 --> 1:18:19.120 so you're not. 1:18:22.562 --> 1:18:29.986 Then one thing, if you think about Bart, you want to have the Chinese language, the Italian 1:18:29.986 --> 1:18:32.165 language, and the deconer. 1:18:32.165 --> 1:18:35.716 However, in Bart we have the same language. 1:18:36.516 --> 1:18:46.010 The one you get is from English, so what you can do there is so you cannot try to do some. 1:18:46.366 --> 1:18:52.562 Below the barge, in order to learn some language specific stuff, or there's a masculine barge, 1:18:52.562 --> 1:18:58.823 which is trained on many languages, but it's trained only on like the Old Coast Modern Language 1:18:58.823 --> 1:19:03.388 House, which may be trained in German and English, but not on German. 1:19:03.923 --> 1:19:08.779 So then you would still need to find June and the model needs to learn how to better 1:19:08.779 --> 1:19:10.721 do the attention cross lingually. 1:19:10.721 --> 1:19:15.748 It's only on the same language but it mainly only has to learn this mapping and not all 1:19:15.748 --> 1:19:18.775 the rest and that's why it's still quite successful. 1:19:21.982 --> 1:19:27.492 Now certain thing which is very commonly used is what is required to it as adapters. 1:19:27.607 --> 1:19:29.754 So for example you take and buy. 1:19:29.709 --> 1:19:35.218 And you put some adapters on the inside of the networks so that it's small new layers 1:19:35.218 --> 1:19:40.790 which are in between put in there and then you only train these adapters or also train 1:19:40.790 --> 1:19:41.815 these adapters. 1:19:41.815 --> 1:19:47.900 For example, an embryo you could see that this learns to map the Sears language representation 1:19:47.900 --> 1:19:50.334 to the Tiger language representation. 1:19:50.470 --> 1:19:52.395 And then you don't have to change that luck. 1:19:52.792 --> 1:19:59.793 You give it extra ability to really perform well on that. 1:19:59.793 --> 1:20:05.225 These are quite small and so very efficient. 1:20:05.905 --> 1:20:12.632 That is also very commonly used, for example in modular systems where you have some adaptors 1:20:12.632 --> 1:20:16.248 in between here which might be language specific. 1:20:16.916 --> 1:20:22.247 So they are trained only for one language. 1:20:22.247 --> 1:20:33.777 The model has some or both and once has the ability to do multilingually to share knowledge. 1:20:34.914 --> 1:20:39.058 But there's one chance in general in the multilingual systems. 1:20:39.058 --> 1:20:40.439 It works quite well. 1:20:40.439 --> 1:20:46.161 There's one case or one specific use case for multilingual where this normally doesn't 1:20:46.161 --> 1:20:47.344 really work well. 1:20:47.344 --> 1:20:49.975 Do you have an idea what that could be? 1:20:55.996 --> 1:20:57.536 It's for Zero Shot Cases. 1:20:57.998 --> 1:21:03.660 Because having here some situation with this might be very language specific and zero shot, 1:21:03.660 --> 1:21:09.015 the idea is always to learn representations view which are more language dependent and 1:21:09.015 --> 1:21:10.184 with the adaptors. 1:21:10.184 --> 1:21:15.601 Of course you get in representations again which are more language specific and then it 1:21:15.601 --> 1:21:17.078 doesn't work that well. 1:21:20.260 --> 1:21:37.730 And there is also the idea of doing more knowledge pistolation. 1:21:39.179 --> 1:21:42.923 And now the idea is okay. 1:21:42.923 --> 1:21:54.157 We are training it the same, but what we want to achieve is that the encoder. 1:21:54.414 --> 1:22:03.095 So you should learn faster by trying to make these states as similar as possible. 1:22:03.095 --> 1:22:11.777 So you compare the first-hit state of the pre-trained model and try to make them. 1:22:12.192 --> 1:22:18.144 For example, by using the out two norms, so by just making these two representations the 1:22:18.144 --> 1:22:26.373 same: The same vocabulary: Why does it need the same vocabulary with any idea? 1:22:34.754 --> 1:22:46.137 If you have different vocabulary, it's typical you also have different sequenced lengths here. 1:22:46.137 --> 1:22:50.690 The number of sequences is different. 1:22:51.231 --> 1:22:58.888 If you now have pipe stains and four states here, it's no longer straightforward which 1:22:58.888 --> 1:23:01.089 states compare to which. 1:23:02.322 --> 1:23:05.246 And that's just easier if you have like the same number. 1:23:05.246 --> 1:23:08.940 You can always compare the first to the first and second to the second. 1:23:09.709 --> 1:23:16.836 So therefore at least the very easy way of knowledge destination only works if you have. 1:23:17.177 --> 1:23:30.030 Course: You could do things like yeah, the average should be the same, but of course there's 1:23:30.030 --> 1:23:33.071 a less strong signal. 1:23:34.314 --> 1:23:42.979 But the advantage here is that you have a diameter training signal here on the handquarter 1:23:42.979 --> 1:23:51.455 so you can directly make some of the encoder already giving a good signal while normally 1:23:51.455 --> 1:23:52.407 an empty. 1:23:56.936 --> 1:24:13.197 Yes, think this is most things for today, so what you should keep in mind is remind me. 1:24:13.393 --> 1:24:18.400 The one is a back translation idea. 1:24:18.400 --> 1:24:29.561 If you have monolingual and use that, the other one is to: And mentally it is often helpful 1:24:29.561 --> 1:24:33.614 to combine them so you can even use both of that. 1:24:33.853 --> 1:24:38.908 So you can use pre-trained walls, but then you can even still do back translation where 1:24:38.908 --> 1:24:40.057 it's still helpful. 1:24:40.160 --> 1:24:45.502 We have the advantage we are training like everything working together on the task so 1:24:45.502 --> 1:24:51.093 it might be helpful even to backtranslate some data and then use it in a real translation 1:24:51.093 --> 1:24:56.683 setup because in pretraining of course the beach challenge is always that you're training 1:24:56.683 --> 1:24:57.739 it on different. 1:24:58.058 --> 1:25:03.327 Different ways of how you integrate this knowledge. 1:25:03.327 --> 1:25:08.089 Even if you just use a full model, so in this. 1:25:08.748 --> 1:25:11.128 This is the most similar you can get. 1:25:11.128 --> 1:25:13.945 You're doing no changes to the architecture. 1:25:13.945 --> 1:25:19.643 You're really taking the model and just fine tuning them on the new task, but it still has 1:25:19.643 --> 1:25:24.026 to completely newly learn how to do the attention and how to do that. 1:25:24.464 --> 1:25:29.971 And that might be, for example, helpful to have more back-translated data to learn them. 1:25:32.192 --> 1:25:34.251 That's for today. 1:25:34.251 --> 1:25:44.661 There's one important thing that next Tuesday there is a conference or a workshop or so in 1:25:44.661 --> 1:25:45.920 this room. 1:25:47.127 --> 1:25:56.769 You should get an e-mail if you're in Elias that there's a room change for Tuesdays and 1:25:56.769 --> 1:25:57.426 it's. 1:25:57.637 --> 1:26:03.890 There are more questions, yeah, have a more general position, especially: In computer vision 1:26:03.890 --> 1:26:07.347 you can enlarge your data center data orientation. 1:26:07.347 --> 1:26:08.295 Is there any? 1:26:08.388 --> 1:26:15.301 It's similar to a large speech for text for the data of an edge. 1:26:15.755 --> 1:26:29.176 And you can use this back translation and also masking, but back translation is some 1:26:29.176 --> 1:26:31.228 way of data. 1:26:31.371 --> 1:26:35.629 So it has also been, for example, even its used not only for monolingual data. 1:26:36.216 --> 1:26:54.060 If you have good MP system, it can also be used for parallel data. 1:26:54.834 --> 1:26:59.139 So would say this is the most similar one. 1:26:59.139 --> 1:27:03.143 There's ways you can do power phrasing. 1:27:05.025 --> 1:27:12.057 But for example there is very hard to do this by rules like which words to replace because 1:27:12.057 --> 1:27:18.936 there is not a coup like you cannot always say this word can always be replaced by that. 1:27:19.139 --> 1:27:27.225 Mean, although they are many perfect synonyms, normally they are good in some cases, but not 1:27:27.225 --> 1:27:29.399 in all cases, and so on. 1:27:29.399 --> 1:27:36.963 And if you don't do a rule based, you have to train your model and then the freshness. 1:27:38.058 --> 1:27:57.236 The same architecture as the pre-trained mount. 1:27:57.457 --> 1:27:59.810 Should be of the same dimension, so it's easiest to have the same dimension. 1:28:00.000 --> 1:28:01.590 Architecture. 1:28:01.590 --> 1:28:05.452 We later will learn inefficiency. 1:28:05.452 --> 1:28:12.948 You can also do knowledge cessulation with, for example, smaller. 1:28:12.948 --> 1:28:16.469 You can learn the same within. 1:28:17.477 --> 1:28:22.949 Eight layers for it so that is possible, but yeah agree it should be of the same. 1:28:23.623 --> 1:28:32.486 Yeah yeah you need the question then of course you can do it like it's an initialization or 1:28:32.486 --> 1:28:41.157 you can do it doing training but normally it most makes sense during the normal training. 1:28:45.865 --> 1:28:53.963 Do it, then thanks a lot, and then we'll see each other again on Tuesday. 0:00:00.981 --> 0:00:20.036 Today about is how to use some type of additional resources to improve the translation. 0:00:20.300 --> 0:00:28.188 We have in the first part of the semester two thirds of the semester how to build some 0:00:28.188 --> 0:00:31.361 of your basic machine translation. 0:00:31.571 --> 0:00:42.317 Now the basic components are both for statistical and for neural, with the encoded decoding. 0:00:43.123 --> 0:00:46.000 Now, of course, that's not where it stops. 0:00:46.000 --> 0:00:51.286 It's still what nearly every machine translation system is currently in there. 0:00:51.286 --> 0:00:57.308 However, there's a lot of challenges which you need to address in addition and which need 0:00:57.308 --> 0:00:58.245 to be solved. 0:00:58.918 --> 0:01:09.858 And there we want to start to tell you what else can you do around this, and partly. 0:01:10.030 --> 0:01:14.396 And one important question there is on what do you train your models? 0:01:14.394 --> 0:01:32.003 Because like this type of parallel data, it's easier in machine translation than in other 0:01:32.003 --> 0:01:33.569 trusts. 0:01:33.853 --> 0:01:41.178 And therefore an important question is, can we also learn from like other sources and through? 0:01:41.701 --> 0:01:47.830 Because if you remember strongly right at the beginning of the election,. 0:01:51.171 --> 0:01:53.801 This Is How We Train All Our. 0:01:54.194 --> 0:01:59.887 Machine learning models from statistical to neural. 0:01:59.887 --> 0:02:09.412 This doesn't have changed so we need this type of parallel data where we have a source 0:02:09.412 --> 0:02:13.462 sentence aligned with a target data. 0:02:13.493 --> 0:02:19.135 We have now a strong model here, a very good model to do that. 0:02:19.135 --> 0:02:22.091 However, we always rely on this. 0:02:22.522 --> 0:02:28.437 For languages, high risk language pairs say from German to English or other European languages, 0:02:28.437 --> 0:02:31.332 there is decent amount at least for similarly. 0:02:31.471 --> 0:02:37.630 But even there if we are going to very specific domains it might get difficult and then your 0:02:37.630 --> 0:02:43.525 system performance might drop because if you want to translate now some medical text for 0:02:43.525 --> 0:02:50.015 example of course you need to also have peril data in the medical domain to know how to translate 0:02:50.015 --> 0:02:50.876 these types. 0:02:51.231 --> 0:02:55.264 Phrases how to use the vocabulary and so on in the style. 0:02:55.915 --> 0:03:04.887 And if you are going to other languages, there is a lot bigger challenge and the question 0:03:04.887 --> 0:03:05.585 there. 0:03:05.825 --> 0:03:09.649 So is really this the only resource we can use. 0:03:09.889 --> 0:03:19.462 Can be adapted or training phase in order to also make use of other types of models that 0:03:19.462 --> 0:03:27.314 might enable us to build strong systems with other types of information. 0:03:27.707 --> 0:03:35.276 And that we will look into now in the next starting from from just saying the next election. 0:03:35.515 --> 0:03:40.697 So this idea we already have covered on Tuesday. 0:03:40.697 --> 0:03:45.350 One very successful idea for this is to do. 0:03:45.645 --> 0:03:51.990 So that we're no longer doing translation between languages, but we can do translation 0:03:51.990 --> 0:03:55.928 between languages and share common knowledge between. 0:03:56.296 --> 0:04:03.888 You also learned about things like zero shots machine translation so you can translate between 0:04:03.888 --> 0:04:06.446 languages where you don't have. 0:04:06.786 --> 0:04:09.790 Which is the case for many, many language pairs. 0:04:10.030 --> 0:04:16.954 Like even with German, you have not translation parallel data to all languages around the world, 0:04:16.954 --> 0:04:23.450 or most of them you have it to the Europeans once, maybe even for Japanese, so it will get 0:04:23.450 --> 0:04:26.377 difficult to get a really decent amount. 0:04:26.746 --> 0:04:35.332 There is quite a lot of data, for example English to Japanese, but German to Japanese 0:04:35.332 --> 0:04:37.827 or German to Vietnamese. 0:04:37.827 --> 0:04:41.621 There is some data from Multilingual. 0:04:42.042 --> 0:04:54.584 So there is a very promising direction if you want to build translation systems between 0:04:54.584 --> 0:05:00.142 language peers, typically not English. 0:05:01.221 --> 0:05:05.887 And the other ideas, of course, we don't have to either just search for it. 0:05:06.206 --> 0:05:12.505 Some work on a data crawling so if I don't have a corpus directly or I don't have an high 0:05:12.505 --> 0:05:19.014 quality corpus like from the European Parliament for a TED corpus so maybe it makes sense to 0:05:19.014 --> 0:05:23.913 crawl more data and get additional sources so you can build stronger. 0:05:24.344 --> 0:05:35.485 There has been quite a big effort in Europe to collect really large data sets for parallel 0:05:35.485 --> 0:05:36.220 data. 0:05:36.220 --> 0:05:40.382 How can we do this data crawling? 0:05:40.600 --> 0:05:46.103 There the interesting thing from the machine translation point is not just general data 0:05:46.103 --> 0:05:46.729 crawling. 0:05:47.067 --> 0:05:50.037 But how can we explicitly crawl data? 0:05:50.037 --> 0:05:52.070 Which is some of a peril? 0:05:52.132 --> 0:05:58.461 So there is in the Internet quite a lot of data which has been company websites which 0:05:58.461 --> 0:06:01.626 have been translated and things like that. 0:06:01.626 --> 0:06:05.158 So how can you extract them parallel fragments? 0:06:06.566 --> 0:06:13.404 That is typically more noisy than where you do more at hands where mean if you have Parliament. 0:06:13.693 --> 0:06:17.680 You can do some rules how to extract parallel things. 0:06:17.680 --> 0:06:24.176 Here there is more to it, so the quality is later maybe not as good, but normally scale 0:06:24.176 --> 0:06:26.908 is then a possibility to address it. 0:06:26.908 --> 0:06:30.304 So you just have so much more data that even. 0:06:33.313 --> 0:06:40.295 The other thing can be used monolingual data and monolingual data has a big advantage that 0:06:40.295 --> 0:06:46.664 we can have a huge amount of that so that you can be autocrawed from the Internet. 0:06:46.664 --> 0:06:51.728 The nice thing is you can also get it typically for many domains. 0:06:52.352 --> 0:06:59.558 There is just so much more magnitude of monolingual data so that it might be very helpful. 0:06:59.559 --> 0:07:03.054 We can do that in statistical machine translation. 0:07:03.054 --> 0:07:06.755 It was quite easy to integrate using language models. 0:07:08.508 --> 0:07:16.912 In neural machine translation we have the advantage that we have this overall architecture 0:07:16.912 --> 0:07:22.915 that does everything together, but it has also the disadvantage. 0:07:23.283 --> 0:07:25.675 We'll look today at two things. 0:07:25.675 --> 0:07:32.925 On the one end you can still try to do a bit of language modeling in there and add an additional 0:07:32.925 --> 0:07:35.168 language model into in there. 0:07:35.168 --> 0:07:38.232 There is some work, one very successful. 0:07:38.178 --> 0:07:43.764 A way in which I think is used in most systems at the moment is to do some scientific data. 0:07:43.763 --> 0:07:53.087 Is a very easy thing, but you can just translate there and use it as training gator, and normally. 0:07:53.213 --> 0:07:59.185 And thereby you are able to use like some type of monolingual a day. 0:08:00.380 --> 0:08:05.271 Another way to do it is unsupervised and the extreme case. 0:08:05.271 --> 0:08:11.158 If you have a scenario then you only have data, only monolingual data. 0:08:11.158 --> 0:08:13.976 Can you still build translations? 0:08:14.754 --> 0:08:27.675 If you have large amounts of data and languages are not too dissimilar, you can build translation 0:08:27.675 --> 0:08:31.102 systems without parallel. 0:08:32.512 --> 0:08:36.267 That we will see you then next Thursday. 0:08:37.857 --> 0:08:50.512 And then there is now a third type of pre-trained model that recently became very successful 0:08:50.512 --> 0:08:55.411 and now with large language models. 0:08:55.715 --> 0:09:03.525 So the idea is we are no longer sharing the real data, but it can also help to train a 0:09:03.525 --> 0:09:04.153 model. 0:09:04.364 --> 0:09:11.594 And that is now a big advantage of deep learning based approaches. 0:09:11.594 --> 0:09:22.169 There you have this ability that you can train a model in some task and then apply it to another. 0:09:22.722 --> 0:09:33.405 And then, of course, the question is, can I have an initial task where there's huge amounts 0:09:33.405 --> 0:09:34.450 of data? 0:09:34.714 --> 0:09:40.251 And the test that typically you pre train on is more like similar to a language moral 0:09:40.251 --> 0:09:45.852 task either direct to a language moral task or like a masking task which is related so 0:09:45.852 --> 0:09:51.582 the idea is oh I can train on this data and the knowledge about words how they relate to 0:09:51.582 --> 0:09:53.577 each other I can use in there. 0:09:53.753 --> 0:10:00.276 So it's a different way of using language models. 0:10:00.276 --> 0:10:06.276 There's more transfer learning at the end of. 0:10:09.029 --> 0:10:17.496 So first we will start with how can we use monolingual data to do a Yeah to do a machine 0:10:17.496 --> 0:10:18.733 translation? 0:10:20.040 --> 0:10:27.499 That: Big difference is you should remember from what I mentioned before is. 0:10:27.499 --> 0:10:32.783 In statistical machine translation we directly have the opportunity. 0:10:32.783 --> 0:10:39.676 There's peril data for the translation model and monolingual data for the language model. 0:10:39.679 --> 0:10:45.343 And you combine your translation model and language model, and then you can make use of 0:10:45.343 --> 0:10:45.730 both. 0:10:46.726 --> 0:10:53.183 That you can make use of these large large amounts of monolingual data, but of course 0:10:53.183 --> 0:10:55.510 it has also some disadvantage. 0:10:55.495 --> 0:11:01.156 Because we say the problem is we are optimizing both parts a bit independently to each other 0:11:01.156 --> 0:11:06.757 and we say oh yeah the big disadvantage of newer machine translations now we are optimizing 0:11:06.757 --> 0:11:10.531 the overall architecture everything together to perform best. 0:11:10.890 --> 0:11:16.994 And then, of course, we can't do there, so Leo we can can only do a mural like use power 0:11:16.994 --> 0:11:17.405 data. 0:11:17.897 --> 0:11:28.714 So the question is, but this advantage is not so important that we can train everything, 0:11:28.714 --> 0:11:35.276 but we have a moral legal data or even small amounts. 0:11:35.675 --> 0:11:43.102 So in data we know it's not only important the amount of data we have but also like how 0:11:43.102 --> 0:11:50.529 similar it is to your test data so it can be that this modeling data is quite small but 0:11:50.529 --> 0:11:55.339 it's very well fitting and then it's still very helpful. 0:11:55.675 --> 0:12:02.691 At the first year of surprisingness, if we are here successful with integrating a language 0:12:02.691 --> 0:12:09.631 model into a translation system, maybe we can also integrate some type of language models 0:12:09.631 --> 0:12:14.411 into our empty system in order to make it better and perform. 0:12:16.536 --> 0:12:23.298 The first thing we can do is we know there is language models, so let's try to integrate. 0:12:23.623 --> 0:12:31.096 There was our language model because these works were mainly done before transformer-based 0:12:31.096 --> 0:12:31.753 models. 0:12:32.152 --> 0:12:38.764 In general, of course, you can do the same thing with transformer baseball. 0:12:38.764 --> 0:12:50.929 There is nothing about whether: It's just that it has mainly been done before people 0:12:50.929 --> 0:13:01.875 started using R&S and they tried to do this more in cases. 0:13:07.087 --> 0:13:22.938 So what we're happening here is in some of this type of idea, and in key system you remember 0:13:22.938 --> 0:13:25.495 the attention. 0:13:25.605 --> 0:13:29.465 Gets it was your last in this day that you calculate easy attention. 0:13:29.729 --> 0:13:36.610 We get the context back, then combine both and then base the next in state and then predict. 0:13:37.057 --> 0:13:42.424 So this is our system, and the question is, can we send our integrated language model? 0:13:42.782 --> 0:13:49.890 And somehow it makes sense to take out a neural language model because we are anyway in the 0:13:49.890 --> 0:13:50.971 neural space. 0:13:50.971 --> 0:13:58.465 It's not surprising that it contrasts to statistical work used and grants it might make sense to 0:13:58.465 --> 0:14:01.478 take a bit of a normal language model. 0:14:01.621 --> 0:14:06.437 And there would be something like on Tubbles Air, a neural language model, and our man based 0:14:06.437 --> 0:14:11.149 is you have a target word, you put it in, you get a new benchmark, and then you always put 0:14:11.149 --> 0:14:15.757 in the words and get new hidden states, and you can do some predictions at the output to 0:14:15.757 --> 0:14:16.948 predict the next word. 0:14:17.597 --> 0:14:26.977 So if we're having this type of in language model, there's like two main questions we have 0:14:26.977 --> 0:14:34.769 to answer: So how do we combine now on the one hand our system and on the other hand our 0:14:34.769 --> 0:14:35.358 model? 0:14:35.358 --> 0:14:42.004 You see that was mentioned before when we started talking about ENCODA models. 0:14:42.004 --> 0:14:45.369 They can be viewed as a language model. 0:14:45.805 --> 0:14:47.710 The wine is lengthened, unconditioned. 0:14:47.710 --> 0:14:49.518 It's just modeling the target sides. 0:14:49.970 --> 0:14:56.963 And the other one is a conditional language one, which is a language one conditioned on 0:14:56.963 --> 0:14:57.837 the Sewer. 0:14:58.238 --> 0:15:03.694 So how can you combine to language models? 0:15:03.694 --> 0:15:14.860 Of course, it's like the translation model will be more important because it has access 0:15:14.860 --> 0:15:16.763 to the source. 0:15:18.778 --> 0:15:22.571 If we have that, the other question is okay. 0:15:22.571 --> 0:15:24.257 Now we have models. 0:15:24.257 --> 0:15:25.689 How do we train? 0:15:26.026 --> 0:15:30.005 Pickers integrated them. 0:15:30.005 --> 0:15:34.781 We have now two sets of data. 0:15:34.781 --> 0:15:42.741 We have parallel data where you can do the lower. 0:15:44.644 --> 0:15:53.293 So the first idea is we can do something more like a parallel combination. 0:15:53.293 --> 0:15:55.831 We just keep running. 0:15:56.036 --> 0:15:59.864 So here you see your system that is running. 0:16:00.200 --> 0:16:09.649 It's normally completely independent of your language model, which is up there, so down 0:16:09.649 --> 0:16:13.300 here we have just our NMT system. 0:16:13.313 --> 0:16:26.470 The only thing which is used is we have the words, and of course they are put into both 0:16:26.470 --> 0:16:30.059 systems, and out there. 0:16:30.050 --> 0:16:42.221 So we use them somehow for both, and then we are doing our decision just by merging these 0:16:42.221 --> 0:16:42.897 two. 0:16:43.343 --> 0:16:53.956 So there can be, for example, we are doing a probability distribution here, and then we 0:16:53.956 --> 0:17:03.363 are taking the average of post-perability distribution to do our predictions. 0:17:11.871 --> 0:17:18.923 You could also take the output with Steve's to be more in chore about the mixture. 0:17:20.000 --> 0:17:32.896 Yes, you could also do that, so it's more like engaging mechanisms that you're not doing. 0:17:32.993 --> 0:17:41.110 Another one would be cochtrinate the hidden states, and then you would have another layer 0:17:41.110 --> 0:17:41.831 on top. 0:17:43.303 --> 0:17:56.889 You think about if you do the conqueredination instead of taking the instead and then merging 0:17:56.889 --> 0:18:01.225 the probability distribution. 0:18:03.143 --> 0:18:16.610 Introduce many new parameters, and these parameters have somehow something special compared to 0:18:16.610 --> 0:18:17.318 the. 0:18:23.603 --> 0:18:37.651 So before all the error other parameters can be trained independent, the language model 0:18:37.651 --> 0:18:42.121 can be trained independent. 0:18:43.043 --> 0:18:51.749 If you have a joint layer, of course you need to train them because you have now inputs. 0:18:54.794 --> 0:19:02.594 Not surprisingly, if you have a parallel combination of whether you could, the other way is to do 0:19:02.594 --> 0:19:04.664 more serial combinations. 0:19:04.924 --> 0:19:10.101 How can you do a similar combination? 0:19:10.101 --> 0:19:18.274 Your final decision makes sense to do a face on the system. 0:19:18.438 --> 0:19:20.996 So you have on top of your normal and system. 0:19:21.121 --> 0:19:30.678 The only thing is now you're inputting into your system. 0:19:30.678 --> 0:19:38.726 You're no longer inputting the word embeddings. 0:19:38.918 --> 0:19:45.588 So you're training your mainly what you have your lower layers here which are trained more 0:19:45.588 --> 0:19:52.183 on the purely language model style and then on top your putting into the NMT system where 0:19:52.183 --> 0:19:55.408 it now has already here the language model. 0:19:55.815 --> 0:19:58.482 So here you can also view it. 0:19:58.482 --> 0:20:06.481 Here you have more contextual embeddings which no longer depend only on the word but they 0:20:06.481 --> 0:20:10.659 also depend on the context of the target site. 0:20:11.051 --> 0:20:19.941 But you have more understanding of the source word, so you have a language in the current 0:20:19.941 --> 0:20:21.620 target sentence. 0:20:21.881 --> 0:20:27.657 So if it's like the word can, for example, will be put in here always the same independent 0:20:27.657 --> 0:20:31.147 of its user can of beans, or if it's like I can do it. 0:20:31.147 --> 0:20:37.049 However, because you are having your language model style, you have maybe disintegrated this 0:20:37.049 --> 0:20:40.984 already a bit, and you give this information directly to the. 0:20:41.701 --> 0:20:43.095 An empty cyst. 0:20:44.364 --> 0:20:49.850 You, if you're remembering more the transformer based approach, you have some layers. 0:20:49.850 --> 0:20:55.783 The lower layers are purely languaged while the other ones are with attention to the source. 0:20:55.783 --> 0:21:01.525 So you can view it also that you just have lower layers which don't attend to the source. 0:21:02.202 --> 0:21:07.227 This is purely a language model, and then at some point you're starting to attend to 0:21:07.227 --> 0:21:08.587 the source and use it. 0:21:13.493 --> 0:21:20.781 Yes, so this is how you combine them in peril or first do the language model and then do. 0:21:23.623 --> 0:21:26.147 Questions for the integration. 0:21:31.831 --> 0:21:35.034 Not really sure about the input of the. 0:21:35.475 --> 0:21:38.102 Model, and in this case in the sequence. 0:21:38.278 --> 0:21:54.854 Case so the actual word that we transferred into a numerical lecture, and this is an input. 0:21:56.176 --> 0:22:03.568 That depends on if you view the word embedding as part of the language model. 0:22:03.568 --> 0:22:10.865 So if you first put the word target word then you do the one hot end coding. 0:22:11.691 --> 0:22:13.805 And then the word embedding there is the r& 0:22:13.805 --> 0:22:13.937 n. 0:22:14.314 --> 0:22:21.035 So you can use this together as your language model when you first do the word embedding. 0:22:21.401 --> 0:22:24.346 All you can say is like before. 0:22:24.346 --> 0:22:28.212 It's more a definition, but you're right. 0:22:28.212 --> 0:22:30.513 So what's the steps out? 0:22:30.513 --> 0:22:36.128 You take the word, the one hut encoding, the word embedding. 0:22:36.516 --> 0:22:46.214 What one of these parrots, you know, called a language model is definition wise and not 0:22:46.214 --> 0:22:47.978 that important. 0:22:53.933 --> 0:23:02.264 So the question is how can you then train them and make this this one work? 0:23:02.264 --> 0:23:02.812 The. 0:23:03.363 --> 0:23:15.201 So in the case where you combine the language one of the abilities you can train them independently 0:23:15.201 --> 0:23:18.516 and just put them together. 0:23:18.918 --> 0:23:27.368 Might not be the best because we have no longer the stability that we had before that optimally 0:23:27.368 --> 0:23:29.128 performed together. 0:23:29.128 --> 0:23:33.881 It's not clear if they really work the best together. 0:23:34.514 --> 0:23:41.585 At least you need to somehow find how much do you trust the one model and how much. 0:23:43.323 --> 0:23:45.058 Still in some cases useful. 0:23:45.058 --> 0:23:48.530 It might be helpful if you have only data and software. 0:23:48.928 --> 0:23:59.064 However, in MT we have one specific situation that at least for the MT part parallel is also 0:23:59.064 --> 0:24:07.456 always monolingual data, so what we definitely can do is train the language. 0:24:08.588 --> 0:24:18.886 So what we also can do is more like the pre-training approach. 0:24:18.886 --> 0:24:24.607 We first train the language model. 0:24:24.704 --> 0:24:27.334 The pre-training approach. 0:24:27.334 --> 0:24:33.470 You first train on the monolingual data and then you join the. 0:24:33.933 --> 0:24:41.143 Of course, the model size is this way, but the data size is too bigly the other way around. 0:24:41.143 --> 0:24:47.883 You often have a lot more monolingual data than you have here parallel data, in which 0:24:47.883 --> 0:24:52.350 scenario can you imagine where this type of pretraining? 0:24:56.536 --> 0:24:57.901 Any Ideas. 0:25:04.064 --> 0:25:12.772 One example where this might also be helpful if you want to adapt to domains. 0:25:12.772 --> 0:25:22.373 So let's say you do medical sentences and if you want to translate medical sentences. 0:25:23.083 --> 0:25:26.706 In this case it could be or its most probable happen. 0:25:26.706 --> 0:25:32.679 You're learning here up there what medical means, but in your fine tuning step the model 0:25:32.679 --> 0:25:38.785 is forgotten everything about Medicare, so you may be losing all the information you gain. 0:25:39.099 --> 0:25:42.366 So this type of priest training step is good. 0:25:42.366 --> 0:25:47.978 If your pretraining data is more general, very large and then you're adapting. 0:25:48.428 --> 0:25:56.012 But in the task with moral lingual data, which should be used to adapt the system to some 0:25:56.012 --> 0:25:57.781 general topic style. 0:25:57.817 --> 0:26:06.795 Then, of course, this is not a good strategy because you might forgot about everything up 0:26:06.795 --> 0:26:09.389 there and you don't have. 0:26:09.649 --> 0:26:14.678 So then you have to check what you can do for them. 0:26:14.678 --> 0:26:23.284 You can freeze this part and change it any more so you don't lose the ability or you can 0:26:23.284 --> 0:26:25.702 do a direct combination. 0:26:25.945 --> 0:26:31.028 Where you jointly train both of them, so you train the NMT system on the, and then you train 0:26:31.028 --> 0:26:34.909 the language model always in parallels so that you don't forget about. 0:26:35.395 --> 0:26:37.684 And what you learn of the length. 0:26:37.937 --> 0:26:46.711 Depends on what you want to combine because it's large data and you have a good general 0:26:46.711 --> 0:26:48.107 knowledge in. 0:26:48.548 --> 0:26:55.733 Then you normally don't really forget it because it's also in the or you use it to adapt to 0:26:55.733 --> 0:26:57.295 something specific. 0:26:57.295 --> 0:26:58.075 Then you. 0:27:01.001 --> 0:27:06.676 Then this is a way of how we can make use of monolingual data. 0:27:07.968 --> 0:27:12.116 It seems to be the easiest one somehow. 0:27:12.116 --> 0:27:20.103 It's more similar to what we are doing with statistical machine translation. 0:27:21.181 --> 0:27:31.158 Normally always beats this type of model, which in some view can be like from the conceptual 0:27:31.158 --> 0:27:31.909 thing. 0:27:31.909 --> 0:27:36.844 It's even easier from the computational side. 0:27:40.560 --> 0:27:42.078 And the idea is OK. 0:27:42.078 --> 0:27:49.136 We have monolingual data that we just translate and then generate some type of parallel data 0:27:49.136 --> 0:27:50.806 and use that then to. 0:27:51.111 --> 0:28:00.017 So if you want to build a German-to-English system first, take the large amount of data 0:28:00.017 --> 0:28:02.143 you have translated. 0:28:02.402 --> 0:28:10.446 Then you have more peril data and the interesting thing is if you then train on the joint thing 0:28:10.446 --> 0:28:18.742 or on the original peril data and on what is artificial where you have generated the translations. 0:28:18.918 --> 0:28:26.487 So you can because you are not doing the same era all the times and you have some knowledge. 0:28:28.028 --> 0:28:43.199 With this first approach, however, there is one issue why it might not work the best. 0:28:49.409 --> 0:28:51.177 Very a bit shown in the image to you. 0:28:53.113 --> 0:28:58.153 You trade on that quality data. 0:28:58.153 --> 0:29:02.563 Here is a bit of a problem. 0:29:02.563 --> 0:29:08.706 Your English style is not really good. 0:29:08.828 --> 0:29:12.213 And as you're saying, the system always mistranslates. 0:29:13.493 --> 0:29:19.798 Something then you will learn that this is correct because now it's a training game and 0:29:19.798 --> 0:29:23.022 you will encourage it to make it more often. 0:29:23.022 --> 0:29:29.614 So the problem with training on your own areas yeah you might prevent some areas you rarely 0:29:29.614 --> 0:29:29.901 do. 0:29:30.150 --> 0:29:31.749 But errors use systematically. 0:29:31.749 --> 0:29:34.225 Do you even enforce more and will even do more? 0:29:34.654 --> 0:29:40.145 So that might not be the best solution to have any idea how you could do it better. 0:29:44.404 --> 0:29:57.754 Is one way there is even a bit of more simple idea. 0:30:04.624 --> 0:30:10.975 The problem is yeah, the translations are not perfect, so the output and you're learning 0:30:10.975 --> 0:30:12.188 something wrong. 0:30:12.188 --> 0:30:17.969 Normally it's less bad if your inputs are not bad, but your outputs are perfect. 0:30:18.538 --> 0:30:24.284 So if your inputs are wrong you may learn that if you're doing this wrong input you're 0:30:24.284 --> 0:30:30.162 generating something correct, but you're not learning to generate something which is not 0:30:30.162 --> 0:30:30.756 correct. 0:30:31.511 --> 0:30:47.124 So often the case it is that it is more important than your target is correct. 0:30:47.347 --> 0:30:52.182 But you can assume in your application scenario you hope that you may only get correct inputs. 0:30:52.572 --> 0:31:02.535 So that is not harming you, and in machine translation we have one very nice advantage: 0:31:02.762 --> 0:31:04.648 And also the other way around. 0:31:04.648 --> 0:31:10.062 It's a very similar task, so there's a task to translate from German to English, but the 0:31:10.062 --> 0:31:13.894 task to translate from English to German is very similar, and. 0:31:14.094 --> 0:31:19.309 So what we can do is we can just switch it initially and generate the data the other way 0:31:19.309 --> 0:31:19.778 around. 0:31:20.120 --> 0:31:25.959 So what we are doing here is we are starting with an English to German system. 0:31:25.959 --> 0:31:32.906 Then we are translating the English data into German where the German is maybe not very nice. 0:31:33.293 --> 0:31:51.785 And then we are training on our original data and on the back translated data. 0:31:52.632 --> 0:32:02.332 So here we have the advantage that our target side is human quality and only the input. 0:32:03.583 --> 0:32:08.113 Then this helps us to get really good. 0:32:08.113 --> 0:32:15.431 There is one difference if you think about the data resources. 0:32:21.341 --> 0:32:27.336 Too obvious here we need a target site monolingual layer. 0:32:27.336 --> 0:32:31.574 In the first example we had source site. 0:32:31.931 --> 0:32:45.111 So back translation is normally working if you have target size peril later and not search 0:32:45.111 --> 0:32:48.152 side modeling later. 0:32:48.448 --> 0:32:56.125 Might be also, like if you think about it, understand a little better to understand the 0:32:56.125 --> 0:32:56.823 target. 0:32:57.117 --> 0:33:01.469 On the source side you have to understand the content. 0:33:01.469 --> 0:33:08.749 On the target side you have to generate really sentences and somehow it's more difficult to 0:33:08.749 --> 0:33:12.231 generate something than to only understand. 0:33:17.617 --> 0:33:30.734 This works well if you have to select how many back translated data do you use. 0:33:31.051 --> 0:33:32.983 Because only there's like a lot more. 0:33:33.253 --> 0:33:42.136 Question: Should take all of my data there is two problems with it? 0:33:42.136 --> 0:33:51.281 Of course it's expensive because you have to translate all this data. 0:33:51.651 --> 0:34:00.946 So if you don't know the normal good starting point is to take equal amount of data as many 0:34:00.946 --> 0:34:02.663 back translated. 0:34:02.963 --> 0:34:04.673 It depends on the used case. 0:34:04.673 --> 0:34:08.507 If we have very few data here, it makes more sense to have more. 0:34:08.688 --> 0:34:15.224 Depends on how good your quality is here, so the better the more data you might use because 0:34:15.224 --> 0:34:16.574 quality is better. 0:34:16.574 --> 0:34:22.755 So it depends on a lot of things, but your rule of sum is like which general way often 0:34:22.755 --> 0:34:24.815 is to have equal amounts of. 0:34:26.646 --> 0:34:29.854 And you can, of course, do that now. 0:34:29.854 --> 0:34:34.449 I said already that it's better to have the quality. 0:34:34.449 --> 0:34:38.523 At the end, of course, depends on this system. 0:34:38.523 --> 0:34:46.152 Also, because the better this system is, the better your synthetic data is, the better. 0:34:47.207 --> 0:34:50.949 That leads to what is referred to as iterated back translation. 0:34:51.291 --> 0:34:56.917 So you play them on English to German, and you translate the data on. 0:34:56.957 --> 0:35:03.198 Then you train a model on German to English with the additional data. 0:35:03.198 --> 0:35:09.796 Then you translate German data and then you train to gain your first one. 0:35:09.796 --> 0:35:14.343 So in the second iteration this quality is better. 0:35:14.334 --> 0:35:19.900 System is better because it's not only trained on the small data but additionally on back 0:35:19.900 --> 0:35:22.003 translated data with this system. 0:35:22.442 --> 0:35:24.458 And so you can get better. 0:35:24.764 --> 0:35:28.053 However, typically you can stop quite early. 0:35:28.053 --> 0:35:35.068 Maybe one iteration is good, but then you have diminishing gains after two or three iterations. 0:35:35.935 --> 0:35:46.140 There is very slight difference because you need a quite big difference in the quality 0:35:46.140 --> 0:35:46.843 here. 0:35:47.207 --> 0:36:02.262 Language is also good because it means you can already train it with relatively bad profiles. 0:36:03.723 --> 0:36:10.339 It's a design decision would advise so guess because it's easy to get it. 0:36:10.550 --> 0:36:20.802 Replace that because you have a higher quality real data, but then I think normally it's okay 0:36:20.802 --> 0:36:22.438 to replace it. 0:36:22.438 --> 0:36:28.437 I would assume it's not too much of a difference, but. 0:36:34.414 --> 0:36:42.014 That's about like using monolingual data before we go into the pre-train models to have any 0:36:42.014 --> 0:36:43.005 more crash. 0:36:49.029 --> 0:36:55.740 Yes, so the other thing which we can do and which is recently more and more successful 0:36:55.740 --> 0:37:02.451 and even more successful since we have this really large language models where you can 0:37:02.451 --> 0:37:08.545 even do the translation task with this is the way of using pre-trained models. 0:37:08.688 --> 0:37:16.135 So you learn a representation of one task, and then you use this representation from another. 0:37:16.576 --> 0:37:26.862 It was made maybe like one of the first words where it really used largely is doing something 0:37:26.862 --> 0:37:35.945 like a bird which you pre trained on purely text era and you take it in fine tune. 0:37:36.496 --> 0:37:42.953 And one big advantage, of course, is that people can only share data but also pre-trained. 0:37:43.423 --> 0:37:59.743 The recent models and the large language ones which are available. 0:37:59.919 --> 0:38:09.145 Where I think it costs several millions to train them all, just if you would buy the GPUs 0:38:09.145 --> 0:38:15.397 from some cloud company and train that the cost of training. 0:38:15.475 --> 0:38:21.735 And guess as a student project you won't have the budget to like build these models. 0:38:21.801 --> 0:38:24.598 So another idea is what you can do is okay. 0:38:24.598 --> 0:38:27.330 Maybe if these months are once available,. 0:38:27.467 --> 0:38:36.598 Can take them and use them as an also resource similar to pure text, and you can now build 0:38:36.598 --> 0:38:44.524 models which somehow learn not only from from data but also from other models. 0:38:44.844 --> 0:38:49.127 So it's a quite new way of thinking of how to train. 0:38:49.127 --> 0:38:53.894 We are not only learning from examples, but we might also. 0:38:54.534 --> 0:39:05.397 The nice thing is that this type of training where we are not learning directly from data 0:39:05.397 --> 0:39:07.087 but learning. 0:39:07.427 --> 0:39:17.647 So the main idea this go is you have a person initial task. 0:39:17.817 --> 0:39:26.369 And if you're working with anLP, that means you're training pure taxator because that's 0:39:26.369 --> 0:39:30.547 where you have the largest amount of data. 0:39:30.951 --> 0:39:35.854 And then you're defining some type of task in order to you do your creek training. 0:39:36.176 --> 0:39:43.092 And: The typical task you can train on on that is like the language waddling task. 0:39:43.092 --> 0:39:50.049 So to predict the next word or we have a related task to predict something in between, we'll 0:39:50.049 --> 0:39:52.667 see depending on the architecture. 0:39:52.932 --> 0:39:58.278 But somehow to predict something which you have not in the input is a task which is easy 0:39:58.278 --> 0:40:00.740 to generate, so you just need your data. 0:40:00.740 --> 0:40:06.086 That's why it's called self supervised, so you're creating your supervised pending data. 0:40:06.366 --> 0:40:07.646 By yourself. 0:40:07.646 --> 0:40:15.133 On the other hand, you need a lot of knowledge and that is the other thing. 0:40:15.735 --> 0:40:24.703 Because there is this idea that the meaning of a word heavily depends on the context that. 0:40:25.145 --> 0:40:36.846 So can give you a sentence with some giverish word and there's some name and although you've 0:40:36.846 --> 0:40:41.627 never heard the name you will assume. 0:40:42.062 --> 0:40:44.149 And exactly the same thing. 0:40:44.149 --> 0:40:49.143 The models can also learn something about the world by just using. 0:40:49.649 --> 0:40:53.651 So that is typically the mule. 0:40:53.651 --> 0:40:59.848 Then we can use this model to train the system. 0:41:00.800 --> 0:41:03.368 Course we might need to adapt the system. 0:41:03.368 --> 0:41:07.648 To do that we have to change the architecture we might use only some. 0:41:07.627 --> 0:41:09.443 Part of the pre-trained model. 0:41:09.443 --> 0:41:14.773 In there we have seen that a bit already in the R&N case you can also see that we have 0:41:14.773 --> 0:41:17.175 also mentioned the pre-training already. 0:41:17.437 --> 0:41:22.783 So you can use the R&N as one of these approaches. 0:41:22.783 --> 0:41:28.712 You train the R&M language more on large pre-train data. 0:41:28.712 --> 0:41:32.309 Then you put it somewhere into your. 0:41:33.653 --> 0:41:37.415 So this gives you the ability to really do these types of tests. 0:41:37.877 --> 0:41:53.924 So you can build a system which is knowledge, which is just trained on large amounts of data. 0:41:56.376 --> 0:42:01.564 So the question is maybe what type of information so what type of models can you? 0:42:01.821 --> 0:42:05.277 And we want today to look at briefly at swings. 0:42:05.725 --> 0:42:08.850 That was what was initially done. 0:42:08.850 --> 0:42:17.213 It wasn't as famous as in machine translation as in other things, but it's also used there 0:42:17.213 --> 0:42:21.072 and that is to use static word embedding. 0:42:21.221 --> 0:42:28.981 So we have this mapping from the one hot to a small continuous word representation. 0:42:29.229 --> 0:42:38.276 Using this one in your NG system, so you can, for example, replace the embedding layer by 0:42:38.276 --> 0:42:38.779 the. 0:42:39.139 --> 0:42:41.832 That is helpful to be a really small amount of data. 0:42:42.922 --> 0:42:48.517 And we're always in this pre-training phase and have the thing the advantage is. 0:42:48.468 --> 0:42:52.411 More data than the trade off, so you can get better. 0:42:52.411 --> 0:42:59.107 The disadvantage is, does anybody have an idea of what might be the disadvantage of using 0:42:59.107 --> 0:43:00.074 things like. 0:43:04.624 --> 0:43:12.175 What was one mentioned today giving like big advantage of the system compared to previous. 0:43:20.660 --> 0:43:25.134 Where one advantage was the enter end training, so you have the enter end training so that 0:43:25.134 --> 0:43:27.937 all parameters and all components play optimal together. 0:43:28.208 --> 0:43:33.076 If you know pre-train something on one fast, it may be no longer optimal fitting to everything 0:43:33.076 --> 0:43:33.384 else. 0:43:33.893 --> 0:43:37.862 So what do pretending or not? 0:43:37.862 --> 0:43:48.180 It depends on how important everything is optimal together and how important. 0:43:48.388 --> 0:43:51.874 Is a iquality of large amount. 0:43:51.874 --> 0:44:00.532 The pre-change one is so much better that it's helpful and the advantage of. 0:44:00.600 --> 0:44:11.211 Getting everything optimal together, yes, we would use random instructions for raising. 0:44:11.691 --> 0:44:26.437 The problem is you might be already in some area where it's not easy to get. 0:44:26.766 --> 0:44:35.329 But often in some way right, so often it's not about your really worse pre trained monolepsy. 0:44:35.329 --> 0:44:43.254 If you're going already in some direction, and if this is not really optimal for you,. 0:44:43.603 --> 0:44:52.450 But if you're not really getting better because you have a decent amount of data, it's so different 0:44:52.450 --> 0:44:52.981 that. 0:44:53.153 --> 0:44:59.505 Initially it wasn't a machine translation done so much because there are more data in 0:44:59.505 --> 0:45:06.153 MPs than in other tasks, but now with really large amounts of monolingual data we do some 0:45:06.153 --> 0:45:09.403 type of pretraining in currently all state. 0:45:12.632 --> 0:45:14.302 The other one is okay now. 0:45:14.302 --> 0:45:18.260 It's always like how much of the model do you plea track a bit? 0:45:18.658 --> 0:45:22.386 To the other one you can do contextural word embedded. 0:45:22.386 --> 0:45:28.351 That is something like bird or Roberta where you train already a sequence model and the 0:45:28.351 --> 0:45:34.654 embeddings you're using are no longer specific for word but they are also taking the context 0:45:34.654 --> 0:45:35.603 into account. 0:45:35.875 --> 0:45:50.088 The embedding you're using is no longer depending on the word itself but on the whole sentence, 0:45:50.088 --> 0:45:54.382 so you can use this context. 0:45:55.415 --> 0:46:02.691 You can use similar things also in the decoder just by having layers which don't have access 0:46:02.691 --> 0:46:12.430 to the source, but there it still might have and these are typically models like: And finally 0:46:12.430 --> 0:46:14.634 they will look at the end. 0:46:14.634 --> 0:46:19.040 You can also have models which are already sequenced. 0:46:19.419 --> 0:46:28.561 So you may be training a sequence to sequence models. 0:46:28.561 --> 0:46:35.164 You have to make it a bit challenging. 0:46:36.156 --> 0:46:43.445 But the idea is really you're pre-training your whole model and then you'll find tuning. 0:46:47.227 --> 0:46:59.614 But let's first do a bit of step back and look into what are the different things. 0:46:59.614 --> 0:47:02.151 The first thing. 0:47:02.382 --> 0:47:11.063 The wooden bettings are just this first layer and you can train them with feedback annual 0:47:11.063 --> 0:47:12.028 networks. 0:47:12.212 --> 0:47:22.761 But you can also train them with an N language model, and by now you hopefully have also seen 0:47:22.761 --> 0:47:27.699 that you cannot transform a language model. 0:47:30.130 --> 0:47:37.875 So this is how you can train them and you're training them. 0:47:37.875 --> 0:47:45.234 For example, to speak the next word that is the easiest. 0:47:45.525 --> 0:47:55.234 And that is what is now referred to as South Supervised Learning and, for example, all the 0:47:55.234 --> 0:48:00.675 big large language models like Chad GPT and so on. 0:48:00.675 --> 0:48:03.129 They are trained with. 0:48:03.823 --> 0:48:15.812 So that is where you can hopefully learn how a word is used because you always try to previct 0:48:15.812 --> 0:48:17.725 the next word. 0:48:19.619 --> 0:48:27.281 Word embedding: Why do you keep the first look at the word embeddings and the use of 0:48:27.281 --> 0:48:29.985 word embeddings for our task? 0:48:29.985 --> 0:48:38.007 The main advantage was it might be only the first layer where you typically have most of 0:48:38.007 --> 0:48:39.449 the parameters. 0:48:39.879 --> 0:48:57.017 Most of your parameters already on the large data, then on your target data you have to 0:48:57.017 --> 0:48:59.353 train less. 0:48:59.259 --> 0:49:06.527 Big difference that your input size is so much bigger than the size of the novel in size. 0:49:06.626 --> 0:49:17.709 So it's a normally sign, maybe like, but your input and banning size is something like. 0:49:17.709 --> 0:49:20.606 Then here you have to. 0:49:23.123 --> 0:49:30.160 While here you see it's only like zero point five times as much in the layer. 0:49:30.750 --> 0:49:40.367 So here is where most of your parameters are, which means if you already replace the word 0:49:40.367 --> 0:49:48.915 embeddings they might look a bit small in your overall and in key architecture. 0:49:57.637 --> 0:50:01.249 The thing is we have seen these were the bettings. 0:50:01.249 --> 0:50:04.295 They can be very good use for other types. 0:50:04.784 --> 0:50:08.994 You learn some general relations between words. 0:50:08.994 --> 0:50:17.454 If you're doing this type of language modeling cast, you predict: The one thing is you have 0:50:17.454 --> 0:50:24.084 a lot of data, so the one question is we want to have data to trade a model. 0:50:24.084 --> 0:50:28.734 The other thing, the tasks need to be somehow useful. 0:50:29.169 --> 0:50:43.547 If you would predict the first letter of the word, then you wouldn't learn anything about 0:50:43.547 --> 0:50:45.144 the word. 0:50:45.545 --> 0:50:53.683 And the interesting thing is people have looked at these wood embeddings. 0:50:53.954 --> 0:50:58.550 And looking at the word embeddings. 0:50:58.550 --> 0:51:09.276 You can ask yourself how they look and visualize them by doing dimension reduction. 0:51:09.489 --> 0:51:13.236 Don't know if you and you are listening to artificial intelligence. 0:51:13.236 --> 0:51:15.110 Advanced artificial intelligence. 0:51:15.515 --> 0:51:23.217 We had on yesterday there how to do this type of representation, but you can do this time 0:51:23.217 --> 0:51:29.635 of representation, and now you're seeing interesting things that normally. 0:51:30.810 --> 0:51:41.027 Now you can represent a here in a three dimensional space with some dimension reduction. 0:51:41.027 --> 0:51:46.881 For example, the relation between male and female. 0:51:47.447 --> 0:51:56.625 So this vector between the male and female version of something is always not the same, 0:51:56.625 --> 0:51:58.502 but it's related. 0:51:58.718 --> 0:52:14.522 So you can do a bit of maths, so you do take king, you subtract this vector, add this vector. 0:52:14.894 --> 0:52:17.591 So that means okay, there is really something stored. 0:52:17.591 --> 0:52:19.689 Some information are stored in that book. 0:52:20.040 --> 0:52:22.492 Similar, you can do it with bug answers. 0:52:22.492 --> 0:52:25.004 You see here swimming slang walking walk. 0:52:25.265 --> 0:52:34.620 So again these vectors are not the same, but they are related. 0:52:34.620 --> 0:52:42.490 So you learn something from going from here to here. 0:52:43.623 --> 0:52:49.761 Or semantically, the relations between city and capital have exactly the same sense. 0:52:51.191 --> 0:52:56.854 And people had even done that question answering about that if they showed the diembeddings 0:52:56.854 --> 0:52:57.839 and the end of. 0:52:58.218 --> 0:53:06.711 All you can also do is don't trust the dimensions of the reaction because maybe there is something. 0:53:06.967 --> 0:53:16.863 You can also look into what happens really in the individual space. 0:53:16.863 --> 0:53:22.247 What is the nearest neighbor of the. 0:53:22.482 --> 0:53:29.608 So you can take the relationship between France and Paris and add it to Italy and you'll. 0:53:30.010 --> 0:53:33.078 You can do big and bigger and you have small and smaller and stuff. 0:53:33.593 --> 0:53:49.417 Because it doesn't work everywhere, there is also some typical dish here in German. 0:53:51.491 --> 0:54:01.677 You can do what the person is doing for famous ones, of course only like Einstein scientists 0:54:01.677 --> 0:54:06.716 that find midfielders not completely correct. 0:54:06.846 --> 0:54:10.134 You see the examples are a bit old. 0:54:10.134 --> 0:54:15.066 The politicians are no longer they am, but of course. 0:54:16.957 --> 0:54:26.759 What people have done there, especially at the beginning training our end language model, 0:54:26.759 --> 0:54:28.937 was very expensive. 0:54:29.309 --> 0:54:38.031 So one famous model was, but we are not really interested in the language model performance. 0:54:38.338 --> 0:54:40.581 Think something good to keep in mind. 0:54:40.581 --> 0:54:42.587 What are we really interested in? 0:54:42.587 --> 0:54:45.007 Do we really want to have an R&N no? 0:54:45.007 --> 0:54:48.607 In this case we are only interested in this type of mapping. 0:54:49.169 --> 0:54:55.500 And so successful and very successful was this word to vet. 0:54:55.535 --> 0:54:56.865 The idea is okay. 0:54:56.865 --> 0:55:03.592 We are not training real language one, making it even simpler and doing this, for example, 0:55:03.592 --> 0:55:05.513 continuous peck of words. 0:55:05.513 --> 0:55:12.313 We're just having four input tokens and we're predicting what is the word in the middle and 0:55:12.313 --> 0:55:15.048 this is just like two linear layers. 0:55:15.615 --> 0:55:21.627 So it's even simplifying things and making the calculation faster because that is what 0:55:21.627 --> 0:55:22.871 we're interested. 0:55:23.263 --> 0:55:32.897 All this continuous skip ground models with these other models which refer to as where 0:55:32.897 --> 0:55:34.004 to where. 0:55:34.234 --> 0:55:42.394 Where you have one equal word and the other way around, you're predicting the four words 0:55:42.394 --> 0:55:43.585 around them. 0:55:43.585 --> 0:55:45.327 It's very similar. 0:55:45.327 --> 0:55:48.720 The task is in the end very similar. 0:55:51.131 --> 0:56:01.407 Before we are going to the next point, anything about normal weight vectors or weight embedding. 0:56:04.564 --> 0:56:07.794 The next thing is contexture. 0:56:07.794 --> 0:56:12.208 Word embeddings and the idea is helpful. 0:56:12.208 --> 0:56:19.206 However, we might even be able to get more from one lingo layer. 0:56:19.419 --> 0:56:31.732 And now in the word that is overlap of these two meanings, so it represents both the meaning 0:56:31.732 --> 0:56:33.585 of can do it. 0:56:34.834 --> 0:56:40.410 But we might be able to in the pre-trained model already disambiguate this because they 0:56:40.410 --> 0:56:41.044 are used. 0:56:41.701 --> 0:56:53.331 So if we can have a model which can not only represent a word but can also represent the 0:56:53.331 --> 0:56:58.689 meaning of the word within the context,. 0:56:59.139 --> 0:57:03.769 So then we are going to context your word embeddings. 0:57:03.769 --> 0:57:07.713 We are really having a representation in the. 0:57:07.787 --> 0:57:11.519 And we have a very good architecture for that already. 0:57:11.691 --> 0:57:23.791 The hidden state represents what is currently said, but it's focusing on what is the last 0:57:23.791 --> 0:57:29.303 one, so it's some of the representation. 0:57:29.509 --> 0:57:43.758 The first one doing that is something like the Elmo paper where they instead of this is 0:57:43.758 --> 0:57:48.129 the normal language model. 0:57:48.008 --> 0:57:50.714 Within the third, predicting the fourth, and so on. 0:57:50.714 --> 0:57:53.004 So you are always predicting the next work. 0:57:53.193 --> 0:57:57.335 The architecture is the heaven words embedding layer and then layers. 0:57:57.335 --> 0:58:03.901 See you, for example: And now instead of using this one in the end, you're using here this 0:58:03.901 --> 0:58:04.254 one. 0:58:04.364 --> 0:58:11.245 This represents the meaning of this word mainly in the context of what we have seen before. 0:58:11.871 --> 0:58:18.610 We can train it in a language model style always predicting the next word, but we have 0:58:18.610 --> 0:58:21.088 more information trained there. 0:58:21.088 --> 0:58:26.123 Therefore, in the system it has to learn less additional things. 0:58:27.167 --> 0:58:31.261 And there is one Edendang which is done currently in GPS. 0:58:31.261 --> 0:58:38.319 The only difference is that we have more layers, bigger size, and we're using transformer neurocell 0:58:38.319 --> 0:58:40.437 potential instead of the RNA. 0:58:40.437 --> 0:58:45.095 But that is how you train like some large language models at the. 0:58:46.746 --> 0:58:55.044 However, if you look at this contextual representation, they might not be perfect. 0:58:55.044 --> 0:59:02.942 So if you think of this one as a contextual representation of the third word,. 0:59:07.587 --> 0:59:16.686 Is representing a three in the context of a sentence, however only in the context of 0:59:16.686 --> 0:59:18.185 the previous. 0:59:18.558 --> 0:59:27.413 However, we have an architecture which can also take both sides and we have used that 0:59:27.413 --> 0:59:30.193 already in the ink holder. 0:59:30.630 --> 0:59:34.264 So we could do the iron easily on your, also in the backward direction. 0:59:34.874 --> 0:59:42.826 By just having the states the other way around and then we couldn't combine the forward and 0:59:42.826 --> 0:59:49.135 the forward into a joint one where we are doing this type of prediction. 0:59:49.329 --> 0:59:50.858 So you have the word embedding. 0:59:51.011 --> 1:00:02.095 Then you have two in the states, one on the forward arm and one on the backward arm, and 1:00:02.095 --> 1:00:10.314 then you can, for example, take the cocagenation of both of them. 1:00:10.490 --> 1:00:23.257 Now this same here represents mainly this word because this is what both puts in it last 1:00:23.257 --> 1:00:30.573 and we know is focusing on what is happening last. 1:00:31.731 --> 1:00:40.469 However, there is a bit of difference when training that as a language model you already 1:00:40.469 --> 1:00:41.059 have. 1:00:43.203 --> 1:00:44.956 Maybe There's Again This Masking. 1:00:46.546 --> 1:00:47.748 That is one solution. 1:00:47.748 --> 1:00:52.995 First of all, why we can't do it is the information you leak it, so you cannot just predict the 1:00:52.995 --> 1:00:53.596 next word. 1:00:53.596 --> 1:00:58.132 If we just predict the next word in this type of model, that's a very simple task. 1:00:58.738 --> 1:01:09.581 You know the next word because it's influencing this hidden state predicting something is not 1:01:09.581 --> 1:01:11.081 a good task. 1:01:11.081 --> 1:01:18.455 You have to define: Because in this case what will end with the system will just ignore these 1:01:18.455 --> 1:01:22.966 estates and what will learn is copy this information directly in here. 1:01:23.343 --> 1:01:31.218 So it would be representing this word and you would have nearly a perfect model because 1:01:31.218 --> 1:01:38.287 you only need to find encoding where you can encode all words somehow in this. 1:01:38.458 --> 1:01:44.050 The only thing can learn is that turn and encode all my words in this upper hidden. 1:01:44.985 --> 1:01:53.779 Therefore, it's not really useful, so we need to find a bit of different ways out. 1:01:55.295 --> 1:01:57.090 There is a masking one. 1:01:57.090 --> 1:02:03.747 I'll come to that shortly just a bit that other things also have been done, so the other 1:02:03.747 --> 1:02:06.664 thing is not to directly combine them. 1:02:06.664 --> 1:02:13.546 That was in the animal paper, so you have them forward R&M and you keep them completely 1:02:13.546 --> 1:02:14.369 separated. 1:02:14.594 --> 1:02:20.458 So you never merged to state. 1:02:20.458 --> 1:02:33.749 At the end, the representation of the word is now from the forward. 1:02:33.873 --> 1:02:35.953 So it's always the hidden state before the good thing. 1:02:36.696 --> 1:02:41.286 These two you join now to your to the representation. 1:02:42.022 --> 1:02:48.685 And then you have now a representation also about like the whole sentence for the word, 1:02:48.685 --> 1:02:51.486 but there is no information leakage. 1:02:51.486 --> 1:02:58.149 One way of doing this is instead of doing a bidirection along you do a forward pass and 1:02:58.149 --> 1:02:59.815 then join the hidden. 1:03:00.380 --> 1:03:05.960 So you can do that in all layers. 1:03:05.960 --> 1:03:16.300 In the end you do the forwarded layers and you get the hidden. 1:03:16.596 --> 1:03:19.845 However, it's a bit of a complicated. 1:03:19.845 --> 1:03:25.230 You have to keep both separate and merge things so can you do. 1:03:27.968 --> 1:03:33.030 And that is the moment where like the big. 1:03:34.894 --> 1:03:39.970 The big success of the burnt model was used where it okay. 1:03:39.970 --> 1:03:47.281 Maybe in bite and rich case it's not good to do the next word prediction, but we can 1:03:47.281 --> 1:03:48.314 do masking. 1:03:48.308 --> 1:03:56.019 Masking mainly means we do a prediction of something in the middle or some words. 1:03:56.019 --> 1:04:04.388 So the idea is if we have the input, we are putting noise into the input, removing them, 1:04:04.388 --> 1:04:07.961 and then the model we are interested. 1:04:08.048 --> 1:04:15.327 Now there can be no information leakage because this wasn't predicting that one is a big challenge. 1:04:16.776 --> 1:04:19.957 Do any assumption about our model? 1:04:19.957 --> 1:04:26.410 It doesn't need to be a forward model or a backward model or anything. 1:04:26.410 --> 1:04:29.500 You can always predict the three. 1:04:30.530 --> 1:04:34.844 There's maybe one bit of a disadvantage. 1:04:34.844 --> 1:04:40.105 Do you see what could be a bit of a problem this? 1:05:00.000 --> 1:05:06.429 Yes, so yeah, you can of course mask more, but to see it more globally, just first assume 1:05:06.429 --> 1:05:08.143 you're only masked one. 1:05:08.143 --> 1:05:13.930 For the whole sentence, we get one feedback signal, like what is the word three. 1:05:13.930 --> 1:05:22.882 So we have one training example: If you do the language modeling taste, we predicted here, 1:05:22.882 --> 1:05:24.679 we predicted here. 1:05:25.005 --> 1:05:26.735 So we have number of tokens. 1:05:26.735 --> 1:05:30.970 For each token we have a feet pad and say what is the best correction. 1:05:31.211 --> 1:05:43.300 So in this case this is less efficient because we are getting less feedback signals on what 1:05:43.300 --> 1:05:45.797 we should predict. 1:05:48.348 --> 1:05:56.373 So and bird, the main ideas are that you're doing this bidirectional model with masking. 1:05:56.373 --> 1:05:59.709 It's using transformer architecture. 1:06:00.320 --> 1:06:06.326 There are two more minor changes. 1:06:06.326 --> 1:06:16.573 We'll see that this next word prediction is another task. 1:06:16.957 --> 1:06:30.394 You want to learn more about what language is to really understand following a story or 1:06:30.394 --> 1:06:35.127 their independent tokens into. 1:06:38.158 --> 1:06:42.723 The input is using word units as we use it. 1:06:42.723 --> 1:06:50.193 It has some special token that is framing for the next word prediction. 1:06:50.470 --> 1:07:04.075 It's more for classification task because you may be learning a general representation 1:07:04.075 --> 1:07:07.203 as a full sentence. 1:07:07.607 --> 1:07:19.290 You're doing segment embedding, so you have an embedding for it. 1:07:19.290 --> 1:07:24.323 This is the first sentence. 1:07:24.684 --> 1:07:29.099 Now what is more challenging is this masking. 1:07:29.099 --> 1:07:30.827 What do you mask? 1:07:30.827 --> 1:07:35.050 We already have the crush enough or should. 1:07:35.275 --> 1:07:42.836 So there has been afterwards eating some work like, for example, a bearer. 1:07:42.836 --> 1:07:52.313 It's not super sensitive, but if you do it completely wrong then you're not letting anything. 1:07:52.572 --> 1:07:54.590 That's Then Another Question There. 1:07:56.756 --> 1:08:04.594 Should I mask all types of should I always mask the footwork or if I have a subword to 1:08:04.594 --> 1:08:10.630 mask only like a subword and predict them based on the other ones? 1:08:10.630 --> 1:08:14.504 Of course, it's a bit of a different task. 1:08:14.894 --> 1:08:21.210 If you know three parts of the words, it might be easier to guess the last because they here 1:08:21.210 --> 1:08:27.594 took the easiest selection, so not considering words anymore at all because you're doing that 1:08:27.594 --> 1:08:32.280 in the preprocessing and just taking always words and like subwords. 1:08:32.672 --> 1:08:36.089 Think in group there is done differently. 1:08:36.089 --> 1:08:40.401 They mark always the full words, but guess it's not. 1:08:41.001 --> 1:08:46.044 And then what to do with the mask word in eighty percent of the cases. 1:08:46.044 --> 1:08:50.803 If the word is masked, they replace it with a special token thing. 1:08:50.803 --> 1:08:57.197 This is a mask token in ten percent they put in some random other token in there, and ten 1:08:57.197 --> 1:08:59.470 percent they keep it on change. 1:09:02.202 --> 1:09:10.846 And then what you can do is also this next word prediction. 1:09:10.846 --> 1:09:14.880 The man went to Mass Store. 1:09:14.880 --> 1:09:17.761 He bought a gallon. 1:09:18.418 --> 1:09:24.088 So may you see you're joining them, you're doing both masks and prediction that you're. 1:09:24.564 --> 1:09:29.449 Is a penguin mask or flyless birds. 1:09:29.449 --> 1:09:41.390 These two sentences have nothing to do with each other, so you can do also this type of 1:09:41.390 --> 1:09:43.018 prediction. 1:09:47.127 --> 1:09:56.572 And then the whole bird model, so here you have the in-foot to transform the layers, and 1:09:56.572 --> 1:09:58.164 you can train. 1:09:58.598 --> 1:10:17.731 And this model was quite successful in general applications. 1:10:17.937 --> 1:10:27.644 However, there is like a huge thing of different types of models coming from them. 1:10:27.827 --> 1:10:38.709 So based on others these supervised molds like a whole setup came out of there and now 1:10:38.709 --> 1:10:42.086 this is getting even more. 1:10:42.082 --> 1:10:46.640 With availability of a large language model than the success. 1:10:47.007 --> 1:10:48.436 We have now even larger ones. 1:10:48.828 --> 1:10:50.961 Interestingly, it goes a bit. 1:10:50.910 --> 1:10:57.847 Change the bit again from like more the spider action model to uni directional models. 1:10:57.847 --> 1:11:02.710 Are at the moment maybe a bit more we're coming to them now? 1:11:02.710 --> 1:11:09.168 Do you see one advantage while what is another event and we have the efficiency? 1:11:09.509 --> 1:11:15.901 Is one other reason why you are sometimes more interested in uni-direction models than 1:11:15.901 --> 1:11:17.150 in bi-direction. 1:11:22.882 --> 1:11:30.220 It depends on the pass, but for example for a language generation pass, the eccard is not 1:11:30.220 --> 1:11:30.872 really. 1:11:32.192 --> 1:11:40.924 It doesn't work so if you want to do a generation like the decoder you don't know the future 1:11:40.924 --> 1:11:42.896 so you cannot apply. 1:11:43.223 --> 1:11:53.870 So this time of model can be used for the encoder in an encoder model, but it cannot 1:11:53.870 --> 1:11:57.002 be used for the decoder. 1:12:00.000 --> 1:12:05.012 That's a good view to the next overall cast of models. 1:12:05.012 --> 1:12:08.839 Perhaps if you view it from the sequence. 1:12:09.009 --> 1:12:12.761 We have the encoder base model. 1:12:12.761 --> 1:12:16.161 That's what we just look at. 1:12:16.161 --> 1:12:20.617 They are bidirectional and typically. 1:12:20.981 --> 1:12:22.347 That Is the One We Looked At. 1:12:22.742 --> 1:12:34.634 At the beginning is the decoder based model, so see out in regressive models which are unidirective 1:12:34.634 --> 1:12:42.601 like an based model, and there we can do the next word prediction. 1:12:43.403 --> 1:12:52.439 And what you can also do first, and there you can also have a special things called prefix 1:12:52.439 --> 1:12:53.432 language. 1:12:54.354 --> 1:13:05.039 Because we are saying it might be helpful that some of your input can also use bi-direction. 1:13:05.285 --> 1:13:12.240 And that is somehow doing what it is called prefix length. 1:13:12.240 --> 1:13:19.076 On the first tokens you directly give your bidirectional. 1:13:19.219 --> 1:13:28.774 So you somehow merge that and that mainly works only in transformer based models because. 1:13:29.629 --> 1:13:33.039 There is no different number of parameters in our end. 1:13:33.039 --> 1:13:34.836 We need a back foot our end. 1:13:34.975 --> 1:13:38.533 Transformer: The only difference is how you mask your attention. 1:13:38.878 --> 1:13:44.918 We have seen that in the anchoder and decoder the number of parameters is different because 1:13:44.918 --> 1:13:50.235 you do cross attention, but if you do forward and backward or union directions,. 1:13:50.650 --> 1:13:58.419 It's only like that you mask your attention to only look at the bad past or to look into 1:13:58.419 --> 1:13:59.466 the future. 1:14:00.680 --> 1:14:03.326 And now you can of course also do mixing. 1:14:03.563 --> 1:14:08.306 So this is a bi-directional attention matrix where you can attend to everything. 1:14:08.588 --> 1:14:23.516 There is a uni-direction or causal where you can look at the past and you can do the first 1:14:23.516 --> 1:14:25.649 three words. 1:14:29.149 --> 1:14:42.831 That somehow clear based on that, then of course you cannot do the other things. 1:14:43.163 --> 1:14:50.623 So the idea is we have our anchor to decoder architecture. 1:14:50.623 --> 1:14:57.704 Can we also train them completely in a side supervisor? 1:14:58.238 --> 1:15:09.980 And in this case we have the same input to both, so in this case we need to do some type 1:15:09.980 --> 1:15:12.224 of masking here. 1:15:12.912 --> 1:15:17.591 Here we don't need to do the masking, but here we need to the masking that doesn't know 1:15:17.591 --> 1:15:17.910 ever. 1:15:20.440 --> 1:15:30.269 And this type of model got quite successful also, especially for pre-training machine translation. 1:15:30.330 --> 1:15:39.059 The first model doing that is a Bart model, which exactly does that, and yes, it's one 1:15:39.059 --> 1:15:42.872 successful way to pre train your one. 1:15:42.872 --> 1:15:47.087 It's pretraining your full encoder model. 1:15:47.427 --> 1:15:54.365 Where you put in contrast to machine translation, where you put in source sentence, we can't 1:15:54.365 --> 1:15:55.409 do that here. 1:15:55.715 --> 1:16:01.382 But we can just put the second twice in there, and then it's not a trivial task. 1:16:01.382 --> 1:16:02.432 We can change. 1:16:03.003 --> 1:16:12.777 And there is like they do different corruption techniques so you can also do. 1:16:13.233 --> 1:16:19.692 That you couldn't do in an agricultural system because then it wouldn't be there and you cannot 1:16:19.692 --> 1:16:20.970 predict somewhere. 1:16:20.970 --> 1:16:26.353 So the anchor, the number of input and output tokens always has to be the same. 1:16:26.906 --> 1:16:29.818 You cannot do a prediction for something which isn't in it. 1:16:30.110 --> 1:16:38.268 Here in the decoder side it's unidirection so we can also delete the top and then try 1:16:38.268 --> 1:16:40.355 to generate the full. 1:16:41.061 --> 1:16:45.250 We can do sentence permutation. 1:16:45.250 --> 1:16:54.285 We can document rotation and text infilling so there is quite a bit. 1:16:55.615 --> 1:17:06.568 So you see there's quite a lot of types of models that you can use in order to pre-train. 1:17:07.507 --> 1:17:14.985 Then, of course, there is again for the language one. 1:17:14.985 --> 1:17:21.079 The other question is how do you integrate? 1:17:21.761 --> 1:17:26.636 And there's also, like yeah, quite some different ways of techniques. 1:17:27.007 --> 1:17:28.684 It's a Bit Similar to Before. 1:17:28.928 --> 1:17:39.068 So the easiest thing is you take your word embeddings or your free trained model. 1:17:39.068 --> 1:17:47.971 You freeze them and stack your decoder layers and keep these ones free. 1:17:48.748 --> 1:17:54.495 Can also be done if you have this type of bark model. 1:17:54.495 --> 1:18:03.329 What you can do is you freeze your word embeddings, for example some products and. 1:18:05.865 --> 1:18:17.296 The other thing is you initialize them so you initialize your models but you train everything 1:18:17.296 --> 1:18:19.120 so you're not. 1:18:22.562 --> 1:18:29.986 Then one thing, if you think about Bart, you want to have the Chinese language, the Italian 1:18:29.986 --> 1:18:32.165 language, and the deconer. 1:18:32.165 --> 1:18:35.716 However, in Bart we have the same language. 1:18:36.516 --> 1:18:46.010 The one you get is from English, so what you can do there is so you cannot try to do some. 1:18:46.366 --> 1:18:52.562 Below the barge, in order to learn some language specific stuff, or there's a masculine barge, 1:18:52.562 --> 1:18:58.823 which is trained on many languages, but it's trained only on like the Old Coast Modern Language 1:18:58.823 --> 1:19:03.388 House, which may be trained in German and English, but not on German. 1:19:03.923 --> 1:19:08.779 So then you would still need to find June and the model needs to learn how to better 1:19:08.779 --> 1:19:10.721 do the attention cross lingually. 1:19:10.721 --> 1:19:15.748 It's only on the same language but it mainly only has to learn this mapping and not all 1:19:15.748 --> 1:19:18.775 the rest and that's why it's still quite successful. 1:19:21.982 --> 1:19:27.492 Now certain thing which is very commonly used is what is required to it as adapters. 1:19:27.607 --> 1:19:29.754 So for example you take and buy. 1:19:29.709 --> 1:19:35.218 And you put some adapters on the inside of the networks so that it's small new layers 1:19:35.218 --> 1:19:40.790 which are in between put in there and then you only train these adapters or also train 1:19:40.790 --> 1:19:41.815 these adapters. 1:19:41.815 --> 1:19:47.900 For example, an embryo you could see that this learns to map the Sears language representation 1:19:47.900 --> 1:19:50.334 to the Tiger language representation. 1:19:50.470 --> 1:19:52.395 And then you don't have to change that luck. 1:19:52.792 --> 1:19:59.793 You give it extra ability to really perform well on that. 1:19:59.793 --> 1:20:05.225 These are quite small and so very efficient. 1:20:05.905 --> 1:20:12.632 That is also very commonly used, for example in modular systems where you have some adaptors 1:20:12.632 --> 1:20:16.248 in between here which might be language specific. 1:20:16.916 --> 1:20:22.247 So they are trained only for one language. 1:20:22.247 --> 1:20:33.777 The model has some or both and once has the ability to do multilingually to share knowledge. 1:20:34.914 --> 1:20:39.058 But there's one chance in general in the multilingual systems. 1:20:39.058 --> 1:20:40.439 It works quite well. 1:20:40.439 --> 1:20:46.161 There's one case or one specific use case for multilingual where this normally doesn't 1:20:46.161 --> 1:20:47.344 really work well. 1:20:47.344 --> 1:20:49.975 Do you have an idea what that could be? 1:20:55.996 --> 1:20:57.536 It's for Zero Shot Cases. 1:20:57.998 --> 1:21:03.660 Because having here some situation with this might be very language specific and zero shot, 1:21:03.660 --> 1:21:09.015 the idea is always to learn representations view which are more language dependent and 1:21:09.015 --> 1:21:10.184 with the adaptors. 1:21:10.184 --> 1:21:15.601 Of course you get in representations again which are more language specific and then it 1:21:15.601 --> 1:21:17.078 doesn't work that well. 1:21:20.260 --> 1:21:37.730 And there is also the idea of doing more knowledge pistolation. 1:21:39.179 --> 1:21:42.923 And now the idea is okay. 1:21:42.923 --> 1:21:54.157 We are training it the same, but what we want to achieve is that the encoder. 1:21:54.414 --> 1:22:03.095 So you should learn faster by trying to make these states as similar as possible. 1:22:03.095 --> 1:22:11.777 So you compare the first-hit state of the pre-trained model and try to make them. 1:22:12.192 --> 1:22:18.144 For example, by using the out two norms, so by just making these two representations the 1:22:18.144 --> 1:22:26.373 same: The same vocabulary: Why does it need the same vocabulary with any idea? 1:22:34.754 --> 1:22:46.137 If you have different vocabulary, it's typical you also have different sequenced lengths here. 1:22:46.137 --> 1:22:50.690 The number of sequences is different. 1:22:51.231 --> 1:22:58.888 If you now have pipe stains and four states here, it's no longer straightforward which 1:22:58.888 --> 1:23:01.089 states compare to which. 1:23:02.322 --> 1:23:05.246 And that's just easier if you have like the same number. 1:23:05.246 --> 1:23:08.940 You can always compare the first to the first and second to the second. 1:23:09.709 --> 1:23:16.836 So therefore at least the very easy way of knowledge destination only works if you have. 1:23:17.177 --> 1:23:30.030 Course: You could do things like yeah, the average should be the same, but of course there's 1:23:30.030 --> 1:23:33.071 a less strong signal. 1:23:34.314 --> 1:23:42.979 But the advantage here is that you have a diameter training signal here on the handquarter 1:23:42.979 --> 1:23:51.455 so you can directly make some of the encoder already giving a good signal while normally 1:23:51.455 --> 1:23:52.407 an empty. 1:23:56.936 --> 1:24:13.197 Yes, think this is most things for today, so what you should keep in mind is remind me. 1:24:13.393 --> 1:24:18.400 The one is a back translation idea. 1:24:18.400 --> 1:24:29.561 If you have monolingual and use that, the other one is to: And mentally it is often helpful 1:24:29.561 --> 1:24:33.614 to combine them so you can even use both of that. 1:24:33.853 --> 1:24:38.908 So you can use pre-trained walls, but then you can even still do back translation where 1:24:38.908 --> 1:24:40.057 it's still helpful. 1:24:40.160 --> 1:24:45.502 We have the advantage we are training like everything working together on the task so 1:24:45.502 --> 1:24:51.093 it might be helpful even to backtranslate some data and then use it in a real translation 1:24:51.093 --> 1:24:56.683 setup because in pretraining of course the beach challenge is always that you're training 1:24:56.683 --> 1:24:57.739 it on different. 1:24:58.058 --> 1:25:03.327 Different ways of how you integrate this knowledge. 1:25:03.327 --> 1:25:08.089 Even if you just use a full model, so in this. 1:25:08.748 --> 1:25:11.128 This is the most similar you can get. 1:25:11.128 --> 1:25:13.945 You're doing no changes to the architecture. 1:25:13.945 --> 1:25:19.643 You're really taking the model and just fine tuning them on the new task, but it still has 1:25:19.643 --> 1:25:24.026 to completely newly learn how to do the attention and how to do that. 1:25:24.464 --> 1:25:29.971 And that might be, for example, helpful to have more back-translated data to learn them. 1:25:32.192 --> 1:25:34.251 That's for today. 1:25:34.251 --> 1:25:44.661 There's one important thing that next Tuesday there is a conference or a workshop or so in 1:25:44.661 --> 1:25:45.920 this room. 1:25:47.127 --> 1:25:56.769 You should get an e-mail if you're in Elias that there's a room change for Tuesdays and 1:25:56.769 --> 1:25:57.426 it's. 1:25:57.637 --> 1:26:03.890 There are more questions, yeah, have a more general position, especially: In computer vision 1:26:03.890 --> 1:26:07.347 you can enlarge your data center data orientation. 1:26:07.347 --> 1:26:08.295 Is there any? 1:26:08.388 --> 1:26:15.301 It's similar to a large speech for text for the data of an edge. 1:26:15.755 --> 1:26:29.176 And you can use this back translation and also masking, but back translation is some 1:26:29.176 --> 1:26:31.228 way of data. 1:26:31.371 --> 1:26:35.629 So it has also been, for example, even its used not only for monolingual data. 1:26:36.216 --> 1:26:54.060 If you have good MP system, it can also be used for parallel data. 1:26:54.834 --> 1:26:59.139 So would say this is the most similar one. 1:26:59.139 --> 1:27:03.143 There's ways you can do power phrasing. 1:27:05.025 --> 1:27:12.057 But for example there is very hard to do this by rules like which words to replace because 1:27:12.057 --> 1:27:18.936 there is not a coup like you cannot always say this word can always be replaced by that. 1:27:19.139 --> 1:27:27.225 Mean, although they are many perfect synonyms, normally they are good in some cases, but not 1:27:27.225 --> 1:27:29.399 in all cases, and so on. 1:27:29.399 --> 1:27:36.963 And if you don't do a rule based, you have to train your model and then the freshness. 1:27:38.058 --> 1:27:57.236 The same architecture as the pre-trained mount. 1:27:57.457 --> 1:27:59.810 Should be of the same dimension, so it's easiest to have the same dimension. 1:28:00.000 --> 1:28:01.590 Architecture. 1:28:01.590 --> 1:28:05.452 We later will learn inefficiency. 1:28:05.452 --> 1:28:12.948 You can also do knowledge cessulation with, for example, smaller. 1:28:12.948 --> 1:28:16.469 You can learn the same within. 1:28:17.477 --> 1:28:22.949 Eight layers for it so that is possible, but yeah agree it should be of the same. 1:28:23.623 --> 1:28:32.486 Yeah yeah you need the question then of course you can do it like it's an initialization or 1:28:32.486 --> 1:28:41.157 you can do it doing training but normally it most makes sense during the normal training. 1:28:45.865 --> 1:28:53.963 Do it, then thanks a lot, and then we'll see each other again on Tuesday. 0:00:00.981 --> 0:00:20.036 Today about is how to use some type of additional resources to improve the translation. 0:00:20.300 --> 0:00:28.188 We have in the first part of the semester two thirds of the semester how to build some 0:00:28.188 --> 0:00:31.361 of your basic machine translation. 0:00:31.571 --> 0:00:42.317 Now the basic components are both for statistical and for neural, with the encoded decoding. 0:00:43.123 --> 0:00:46.000 Now, of course, that's not where it stops. 0:00:46.000 --> 0:00:51.286 It's still what nearly every machine translation system is currently in there. 0:00:51.286 --> 0:00:57.308 However, there's a lot of challenges which you need to address in addition and which need 0:00:57.308 --> 0:00:58.245 to be solved. 0:00:58.918 --> 0:01:09.858 And there we want to start to tell you what else can you do around this, and partly. 0:01:10.030 --> 0:01:14.396 And one important question there is on what do you train your models? 0:01:14.394 --> 0:01:32.003 Because like this type of parallel data, it's easier in machine translation than in other 0:01:32.003 --> 0:01:33.569 trusts. 0:01:33.853 --> 0:01:41.178 And therefore an important question is, can we also learn from like other sources and through? 0:01:41.701 --> 0:01:47.830 Because if you remember strongly right at the beginning of the election,. 0:01:51.171 --> 0:01:53.801 This Is How We Train All Our. 0:01:54.194 --> 0:01:59.887 Machine learning models from statistical to neural. 0:01:59.887 --> 0:02:09.412 This doesn't have changed so we need this type of parallel data where we have a source 0:02:09.412 --> 0:02:13.462 sentence aligned with a target data. 0:02:13.493 --> 0:02:19.135 We have now a strong model here, a very good model to do that. 0:02:19.135 --> 0:02:22.091 However, we always rely on this. 0:02:22.522 --> 0:02:28.395 For languages, high risk language pairs say from German to English or other European languages, 0:02:28.395 --> 0:02:31.332 there is decent amount, at least for similarly. 0:02:31.471 --> 0:02:37.630 But even there if we are going to very specific domains it might get difficult and then your 0:02:37.630 --> 0:02:43.525 system performance might drop because if you want to translate now some medical text for 0:02:43.525 --> 0:02:50.015 example of course you need to also have peril data in the medical domain to know how to translate 0:02:50.015 --> 0:02:50.876 these types. 0:02:51.231 --> 0:02:55.264 Phrases how to use the vocabulary and so on in the style. 0:02:55.915 --> 0:03:04.887 And if you are going to other languages, there is a lot bigger challenge and the question 0:03:04.887 --> 0:03:05.585 there. 0:03:05.825 --> 0:03:09.649 So is really this the only resource we can use. 0:03:09.889 --> 0:03:19.462 Can be adapted or training phase in order to also make use of other types of models that 0:03:19.462 --> 0:03:27.314 might enable us to build strong systems with other types of information. 0:03:27.707 --> 0:03:35.276 And that we will look into now in the next starting from from just saying the next election. 0:03:35.515 --> 0:03:40.697 So this idea we already have covered on Tuesday. 0:03:40.697 --> 0:03:45.350 One very successful idea for this is to do. 0:03:45.645 --> 0:03:51.990 So that we're no longer doing translation between languages, but we can do translation 0:03:51.990 --> 0:03:55.928 between languages and share common knowledge between. 0:03:56.296 --> 0:04:03.888 You also learned about things like zero shots machine translation so you can translate between 0:04:03.888 --> 0:04:06.446 languages where you don't have. 0:04:06.786 --> 0:04:09.790 Which is the case for many, many language pairs. 0:04:10.030 --> 0:04:19.209 Like even with German, you have not translation parallel data to all languages around the world, 0:04:19.209 --> 0:04:26.400 or most of them you have it to the Europeans once, maybe even for Japanese. 0:04:26.746 --> 0:04:35.332 There is quite a lot of data, for example English to Japanese, but German to Japanese 0:04:35.332 --> 0:04:37.827 or German to Vietnamese. 0:04:37.827 --> 0:04:41.621 There is some data from Multilingual. 0:04:42.042 --> 0:04:54.584 So there is a very promising direction if you want to build translation systems between 0:04:54.584 --> 0:05:00.142 language peers, typically not English. 0:05:01.221 --> 0:05:05.887 And the other ideas, of course, we don't have to either just search for it. 0:05:06.206 --> 0:05:12.505 Some work on a data crawling so if I don't have a corpus directly or I don't have an high 0:05:12.505 --> 0:05:19.014 quality corpus like from the European Parliament for a TED corpus so maybe it makes sense to 0:05:19.014 --> 0:05:23.913 crawl more data and get additional sources so you can build stronger. 0:05:24.344 --> 0:05:35.485 There has been quite a big effort in Europe to collect really large data sets for parallel 0:05:35.485 --> 0:05:36.220 data. 0:05:36.220 --> 0:05:40.382 How can we do this data crawling? 0:05:40.600 --> 0:05:46.103 There the interesting thing from the machine translation point is not just general data 0:05:46.103 --> 0:05:46.729 crawling. 0:05:47.067 --> 0:05:50.037 But how can we explicitly crawl data? 0:05:50.037 --> 0:05:52.070 Which is some of a peril? 0:05:52.132 --> 0:05:58.461 So there is in the Internet quite a lot of data which has been company websites which 0:05:58.461 --> 0:06:01.626 have been translated and things like that. 0:06:01.626 --> 0:06:05.158 So how can you extract them parallel fragments? 0:06:06.566 --> 0:06:13.404 That is typically more noisy than where you do more at hands where mean if you have Parliament. 0:06:13.693 --> 0:06:17.680 You can do some rules how to extract parallel things. 0:06:17.680 --> 0:06:24.176 Here there is more to it, so the quality is later maybe not as good, but normally scale 0:06:24.176 --> 0:06:26.908 is then a possibility to address it. 0:06:26.908 --> 0:06:30.304 So you just have so much more data that even. 0:06:33.313 --> 0:06:40.295 The other thing can be used monolingual data and monolingual data has a big advantage that 0:06:40.295 --> 0:06:46.664 we can have a huge amount of that so that you can be autocrawed from the Internet. 0:06:46.664 --> 0:06:51.728 The nice thing is you can also get it typically for many domains. 0:06:52.352 --> 0:06:59.558 There is just so much more magnitude of monolingual data so that it might be very helpful. 0:06:59.559 --> 0:07:03.054 We can do that in statistical machine translation. 0:07:03.054 --> 0:07:06.755 It was quite easy to integrate using language models. 0:07:08.508 --> 0:07:16.912 In neural machine translation we have the advantage that we have this overall architecture 0:07:16.912 --> 0:07:22.915 that does everything together, but it has also the disadvantage. 0:07:23.283 --> 0:07:25.675 We'll look today at two things. 0:07:25.675 --> 0:07:32.925 On the one end you can still try to do a bit of language modeling in there and add an additional 0:07:32.925 --> 0:07:35.168 language model into in there. 0:07:35.168 --> 0:07:38.232 There is some work, one very successful. 0:07:38.178 --> 0:07:43.764 A way in which I think is used in most systems at the moment is to do some scientific data. 0:07:43.763 --> 0:07:53.087 Is a very easy thing, but you can just translate there and use it as training gator, and normally. 0:07:53.213 --> 0:07:59.185 And thereby you are able to use like some type of monolingual a day. 0:08:00.380 --> 0:08:05.271 Another way to do it is unsupervised and the extreme case. 0:08:05.271 --> 0:08:11.158 If you have a scenario then you only have data, only monolingual data. 0:08:11.158 --> 0:08:13.976 Can you still build translations? 0:08:14.754 --> 0:08:27.675 If you have large amounts of data and languages are not too dissimilar, you can build translation 0:08:27.675 --> 0:08:31.102 systems without parallel. 0:08:32.512 --> 0:08:36.267 That we will see you then next Thursday. 0:08:37.857 --> 0:08:50.512 And then there is now a third type of pre-trained model that recently became very successful 0:08:50.512 --> 0:08:55.411 and now with large language models. 0:08:55.715 --> 0:09:03.525 So the idea is we are no longer sharing the real data, but it can also help to train a 0:09:03.525 --> 0:09:04.153 model. 0:09:04.364 --> 0:09:11.594 And that is now a big advantage of deep learning based approaches. 0:09:11.594 --> 0:09:22.169 There you have this ability that you can train a model in some task and then apply it to another. 0:09:22.722 --> 0:09:33.405 And then, of course, the question is, can I have an initial task where there's huge amounts 0:09:33.405 --> 0:09:34.450 of data? 0:09:34.714 --> 0:09:40.251 And the test that typically you pre train on is more like similar to a language moral 0:09:40.251 --> 0:09:45.852 task either direct to a language moral task or like a masking task which is related so 0:09:45.852 --> 0:09:51.582 the idea is oh I can train on this data and the knowledge about words how they relate to 0:09:51.582 --> 0:09:53.577 each other I can use in there. 0:09:53.753 --> 0:10:00.276 So it's a different way of using language models. 0:10:00.276 --> 0:10:06.276 There's more transfer learning at the end of. 0:10:09.029 --> 0:10:17.496 So first we will start with how can we use monolingual data to do a Yeah to do a machine 0:10:17.496 --> 0:10:18.733 translation? 0:10:20.040 --> 0:10:27.499 That: Big difference is you should remember from what I mentioned before is. 0:10:27.499 --> 0:10:32.783 In statistical machine translation we directly have the opportunity. 0:10:32.783 --> 0:10:39.676 There's peril data for the translation model and monolingual data for the language model. 0:10:39.679 --> 0:10:45.343 And you combine your translation model and language model, and then you can make use of 0:10:45.343 --> 0:10:45.730 both. 0:10:46.726 --> 0:10:53.183 That you can make use of these large large amounts of monolingual data, but of course 0:10:53.183 --> 0:10:55.510 it has also some disadvantage. 0:10:55.495 --> 0:11:01.156 Because we say the problem is we are optimizing both parts a bit independently to each other 0:11:01.156 --> 0:11:06.757 and we say oh yeah the big disadvantage of newer machine translations now we are optimizing 0:11:06.757 --> 0:11:10.531 the overall architecture everything together to perform best. 0:11:10.890 --> 0:11:16.994 And then, of course, we can't do there, so Leo we can can only do a mural like use power 0:11:16.994 --> 0:11:17.405 data. 0:11:17.897 --> 0:11:28.714 So the question is, but this advantage is not so important that we can train everything, 0:11:28.714 --> 0:11:35.276 but we have a moral legal data or even small amounts. 0:11:35.675 --> 0:11:43.102 So in data we know it's not only important the amount of data we have but also like how 0:11:43.102 --> 0:11:50.529 similar it is to your test data so it can be that this modeling data is quite small but 0:11:50.529 --> 0:11:55.339 it's very well fitting and then it's still very helpful. 0:11:55.675 --> 0:12:02.691 At the first year of surprisingness, if we are here successful with integrating a language 0:12:02.691 --> 0:12:09.631 model into a translation system, maybe we can also integrate some type of language models 0:12:09.631 --> 0:12:14.411 into our empty system in order to make it better and perform. 0:12:16.536 --> 0:12:23.298 The first thing we can do is we know there is language models, so let's try to integrate. 0:12:23.623 --> 0:12:31.096 There was our language model because these works were mainly done before transformer-based 0:12:31.096 --> 0:12:31.753 models. 0:12:32.152 --> 0:12:38.764 In general, of course, you can do the same thing with transformer baseball. 0:12:38.764 --> 0:12:50.929 There is nothing about whether: It's just that it has mainly been done before people 0:12:50.929 --> 0:13:01.875 started using R&S and they tried to do this more in cases. 0:13:07.087 --> 0:13:22.938 So what we're happening here is in some of this type of idea, and in key system you remember 0:13:22.938 --> 0:13:25.495 the attention. 0:13:25.605 --> 0:13:29.465 Gets it was your last in this day that you calculate easy attention. 0:13:29.729 --> 0:13:36.610 We get the context back, then combine both and then base the next in state and then predict. 0:13:37.057 --> 0:13:42.424 So this is our system, and the question is, can we send our integrated language model? 0:13:42.782 --> 0:13:49.890 And somehow it makes sense to take out a neural language model because we are anyway in the 0:13:49.890 --> 0:13:50.971 neural space. 0:13:50.971 --> 0:13:58.465 It's not surprising that it contrasts to statistical work used and grants it might make sense to 0:13:58.465 --> 0:14:01.478 take a bit of a normal language model. 0:14:01.621 --> 0:14:06.437 And there would be something like on Tubbles Air, a neural language model, and our man based 0:14:06.437 --> 0:14:11.149 is you have a target word, you put it in, you get a new benchmark, and then you always put 0:14:11.149 --> 0:14:15.757 in the words and get new hidden states, and you can do some predictions at the output to 0:14:15.757 --> 0:14:16.948 predict the next word. 0:14:17.597 --> 0:14:26.977 So if we're having this type of in language model, there's like two main questions we have 0:14:26.977 --> 0:14:34.769 to answer: So how do we combine now on the one hand our system and on the other hand our 0:14:34.769 --> 0:14:35.358 model? 0:14:35.358 --> 0:14:42.004 You see that was mentioned before when we started talking about ENCODA models. 0:14:42.004 --> 0:14:45.369 They can be viewed as a language model. 0:14:45.805 --> 0:14:47.710 The wine is lengthened, unconditioned. 0:14:47.710 --> 0:14:49.518 It's just modeling the target sides. 0:14:49.970 --> 0:14:56.963 And the other one is a conditional language one, which is a language one conditioned on 0:14:56.963 --> 0:14:57.837 the Sewer. 0:14:58.238 --> 0:15:03.694 So how can you combine to language models? 0:15:03.694 --> 0:15:14.860 Of course, it's like the translation model will be more important because it has access 0:15:14.860 --> 0:15:16.763 to the source. 0:15:18.778 --> 0:15:22.571 If we have that, the other question is okay. 0:15:22.571 --> 0:15:24.257 Now we have models. 0:15:24.257 --> 0:15:25.689 How do we train? 0:15:26.026 --> 0:15:30.005 Pickers integrated them. 0:15:30.005 --> 0:15:34.781 We have now two sets of data. 0:15:34.781 --> 0:15:42.741 We have parallel data where you can do the lower. 0:15:44.644 --> 0:15:53.293 So the first idea is we can do something more like a parallel combination. 0:15:53.293 --> 0:15:55.831 We just keep running. 0:15:56.036 --> 0:15:59.864 So here you see your system that is running. 0:16:00.200 --> 0:16:09.649 It's normally completely independent of your language model, which is up there, so down 0:16:09.649 --> 0:16:13.300 here we have just our NMT system. 0:16:13.313 --> 0:16:26.470 The only thing which is used is we have the words, and of course they are put into both 0:16:26.470 --> 0:16:30.059 systems, and out there. 0:16:30.050 --> 0:16:42.221 So we use them somehow for both, and then we are doing our decision just by merging these 0:16:42.221 --> 0:16:42.897 two. 0:16:43.343 --> 0:16:53.956 So there can be, for example, we are doing a probability distribution here, and then we 0:16:53.956 --> 0:17:03.363 are taking the average of post-perability distribution to do our predictions. 0:17:11.871 --> 0:17:18.923 You could also take the output with Steve's to be more in chore about the mixture. 0:17:20.000 --> 0:17:32.896 Yes, you could also do that, so it's more like engaging mechanisms that you're not doing. 0:17:32.993 --> 0:17:41.110 Another one would be cochtrinate the hidden states, and then you would have another layer 0:17:41.110 --> 0:17:41.831 on top. 0:17:43.303 --> 0:17:56.889 You think about if you do the conqueredination instead of taking the instead and then merging 0:17:56.889 --> 0:18:01.225 the probability distribution. 0:18:03.143 --> 0:18:16.610 Introduce many new parameters, and these parameters have somehow something special compared to 0:18:16.610 --> 0:18:17.318 the. 0:18:23.603 --> 0:18:37.651 So before all the error other parameters can be trained independent, the language model 0:18:37.651 --> 0:18:42.121 can be trained independent. 0:18:43.043 --> 0:18:51.749 If you have a joint layer, of course you need to train them because you have now inputs. 0:18:54.794 --> 0:19:02.594 Not surprisingly, if you have a parallel combination of whether you could, the other way is to do 0:19:02.594 --> 0:19:04.664 more serial combinations. 0:19:04.924 --> 0:19:10.101 How can you do a similar combination? 0:19:10.101 --> 0:19:18.274 Your final decision makes sense to do a face on the system. 0:19:18.438 --> 0:19:20.996 So you have on top of your normal and system. 0:19:21.121 --> 0:19:30.678 The only thing is now you're inputting into your system. 0:19:30.678 --> 0:19:38.726 You're no longer inputting the word embeddings. 0:19:38.918 --> 0:19:45.588 So you're training your mainly what you have your lower layers here which are trained more 0:19:45.588 --> 0:19:52.183 on the purely language model style and then on top your putting into the NMT system where 0:19:52.183 --> 0:19:55.408 it now has already here the language model. 0:19:55.815 --> 0:19:58.482 So here you can also view it. 0:19:58.482 --> 0:20:06.481 Here you have more contextual embeddings which no longer depend only on the word but they 0:20:06.481 --> 0:20:10.659 also depend on the context of the target site. 0:20:11.051 --> 0:20:19.941 But you have more understanding of the source word, so you have a language in the current 0:20:19.941 --> 0:20:21.620 target sentence. 0:20:21.881 --> 0:20:27.657 So if it's like the word can, for example, will be put in here always the same independent 0:20:27.657 --> 0:20:31.147 of its user can of beans, or if it's like I can do it. 0:20:31.147 --> 0:20:37.049 However, because you are having your language model style, you have maybe disintegrated this 0:20:37.049 --> 0:20:40.984 already a bit, and you give this information directly to the. 0:20:41.701 --> 0:20:43.095 An empty cyst. 0:20:44.364 --> 0:20:49.850 You, if you're remembering more the transformer based approach, you have some layers. 0:20:49.850 --> 0:20:55.783 The lower layers are purely languaged while the other ones are with attention to the source. 0:20:55.783 --> 0:21:01.525 So you can view it also that you just have lower layers which don't attend to the source. 0:21:02.202 --> 0:21:07.227 This is purely a language model, and then at some point you're starting to attend to 0:21:07.227 --> 0:21:08.587 the source and use it. 0:21:13.493 --> 0:21:20.781 Yes, so this is how you combine them in peril or first do the language model and then do. 0:21:23.623 --> 0:21:26.147 Questions for the integration. 0:21:31.831 --> 0:21:35.034 Not really sure about the input of the. 0:21:35.475 --> 0:21:38.102 Model, and in this case in the sequence. 0:21:38.278 --> 0:21:53.199 Case so the actual word that we transferred into a numerical lecture, and this is an input 0:21:53.199 --> 0:21:54.838 into the. 0:21:56.176 --> 0:22:03.568 That depends on if you view the word embedding as part of the language model. 0:22:03.568 --> 0:22:10.865 So if you first put the word target word then you do the one hot end coding. 0:22:11.691 --> 0:22:13.805 And then the word embedding there is the r& 0:22:13.805 --> 0:22:13.937 n. 0:22:14.314 --> 0:22:21.035 So you can use this together as your language model when you first do the word embedding. 0:22:21.401 --> 0:22:24.346 All you can say is like before. 0:22:24.346 --> 0:22:28.212 It's more a definition, but you're right. 0:22:28.212 --> 0:22:30.513 So what's the steps out? 0:22:30.513 --> 0:22:36.128 You take the word, the one hut encoding, the word embedding. 0:22:36.516 --> 0:22:46.214 What one of these parrots, you know, called a language model is definition wise and not 0:22:46.214 --> 0:22:47.978 that important. 0:22:53.933 --> 0:23:02.264 So the question is how can you then train them and make this this one work? 0:23:02.264 --> 0:23:02.812 The. 0:23:03.363 --> 0:23:15.201 So in the case where you combine the language one of the abilities you can train them independently 0:23:15.201 --> 0:23:18.516 and just put them together. 0:23:18.918 --> 0:23:27.368 Might not be the best because we have no longer the stability that we had before that optimally 0:23:27.368 --> 0:23:29.128 performed together. 0:23:29.128 --> 0:23:33.881 It's not clear if they really work the best together. 0:23:34.514 --> 0:23:41.585 At least you need to somehow find how much do you trust the one model and how much. 0:23:43.323 --> 0:23:45.058 Still in some cases useful. 0:23:45.058 --> 0:23:48.530 It might be helpful if you have only data and software. 0:23:48.928 --> 0:23:59.064 However, in MT we have one specific situation that at least for the MT part parallel is also 0:23:59.064 --> 0:24:07.456 always monolingual data, so what we definitely can do is train the language. 0:24:08.588 --> 0:24:18.886 So what we also can do is more like the pre-training approach. 0:24:18.886 --> 0:24:24.607 We first train the language model. 0:24:24.704 --> 0:24:27.334 The pre-training approach. 0:24:27.334 --> 0:24:33.470 You first train on the monolingual data and then you join the. 0:24:33.933 --> 0:24:41.143 Of course, the model size is this way, but the data size is too bigly the other way around. 0:24:41.143 --> 0:24:47.883 You often have a lot more monolingual data than you have here parallel data, in which 0:24:47.883 --> 0:24:52.350 scenario can you imagine where this type of pretraining? 0:24:56.536 --> 0:24:57.901 Any Ideas. 0:25:04.064 --> 0:25:12.772 One example where this might also be helpful if you want to adapt to domains. 0:25:12.772 --> 0:25:22.373 So let's say you do medical sentences and if you want to translate medical sentences. 0:25:23.083 --> 0:25:26.706 In this case it could be or its most probable happen. 0:25:26.706 --> 0:25:32.679 You're learning here up there what medical means, but in your fine tuning step the model 0:25:32.679 --> 0:25:38.785 is forgotten everything about Medicare, so you may be losing all the information you gain. 0:25:39.099 --> 0:25:42.366 So this type of priest training step is good. 0:25:42.366 --> 0:25:47.978 If your pretraining data is more general, very large and then you're adapting. 0:25:48.428 --> 0:25:56.012 But in the task with moral lingual data, which should be used to adapt the system to some 0:25:56.012 --> 0:25:57.781 general topic style. 0:25:57.817 --> 0:26:06.795 Then, of course, this is not a good strategy because you might forgot about everything up 0:26:06.795 --> 0:26:09.389 there and you don't have. 0:26:09.649 --> 0:26:14.678 So then you have to check what you can do for them. 0:26:14.678 --> 0:26:23.284 You can freeze this part and change it any more so you don't lose the ability or you can 0:26:23.284 --> 0:26:25.702 do a direct combination. 0:26:25.945 --> 0:26:31.028 Where you jointly train both of them, so you train the NMT system on the, and then you train 0:26:31.028 --> 0:26:34.909 the language model always in parallels so that you don't forget about. 0:26:35.395 --> 0:26:37.684 And what you learn of the length. 0:26:37.937 --> 0:26:46.711 Depends on what you want to combine because it's large data and you have a good general 0:26:46.711 --> 0:26:48.107 knowledge in. 0:26:48.548 --> 0:26:55.733 Then you normally don't really forget it because it's also in the or you use it to adapt to 0:26:55.733 --> 0:26:57.295 something specific. 0:26:57.295 --> 0:26:58.075 Then you. 0:27:01.001 --> 0:27:06.676 Then this is a way of how we can make use of monolingual data. 0:27:07.968 --> 0:27:12.116 It seems to be the easiest one somehow. 0:27:12.116 --> 0:27:20.103 It's more similar to what we are doing with statistical machine translation. 0:27:21.181 --> 0:27:31.158 Normally always beats this type of model, which in some view can be like from the conceptual 0:27:31.158 --> 0:27:31.909 thing. 0:27:31.909 --> 0:27:36.844 It's even easier from the computational side. 0:27:40.560 --> 0:27:42.078 And the idea is OK. 0:27:42.078 --> 0:27:49.136 We have monolingual data that we just translate and then generate some type of parallel data 0:27:49.136 --> 0:27:50.806 and use that then to. 0:27:51.111 --> 0:28:00.017 So if you want to build a German-to-English system first, take the large amount of data 0:28:00.017 --> 0:28:02.143 you have translated. 0:28:02.402 --> 0:28:10.446 Then you have more peril data and the interesting thing is if you then train on the joint thing 0:28:10.446 --> 0:28:18.742 or on the original peril data and on what is artificial where you have generated the translations. 0:28:18.918 --> 0:28:26.487 So you can because you are not doing the same era all the times and you have some knowledge. 0:28:28.028 --> 0:28:43.199 With this first approach, however, there is one issue why it might not work the best. 0:28:49.409 --> 0:28:51.177 Very a bit shown in the image to you. 0:28:53.113 --> 0:28:58.153 You trade on that quality data. 0:28:58.153 --> 0:29:02.563 Here is a bit of a problem. 0:29:02.563 --> 0:29:08.706 Your English style is not really good. 0:29:08.828 --> 0:29:12.213 And as you're saying, the system always mistranslates. 0:29:13.493 --> 0:29:19.798 Something then you will learn that this is correct because now it's a training game and 0:29:19.798 --> 0:29:23.022 you will encourage it to make it more often. 0:29:23.022 --> 0:29:29.614 So the problem with training on your own areas yeah you might prevent some areas you rarely 0:29:29.614 --> 0:29:29.901 do. 0:29:30.150 --> 0:29:31.749 But errors use systematically. 0:29:31.749 --> 0:29:34.225 Do you even enforce more and will even do more? 0:29:34.654 --> 0:29:40.145 So that might not be the best solution to have any idea how you could do it better. 0:29:44.404 --> 0:29:57.754 Is one way there is even a bit of more simple idea. 0:30:04.624 --> 0:30:10.975 The problem is yeah, the translations are not perfect, so the output and you're learning 0:30:10.975 --> 0:30:12.188 something wrong. 0:30:12.188 --> 0:30:17.969 Normally it's less bad if your inputs are not bad, but your outputs are perfect. 0:30:18.538 --> 0:30:24.284 So if your inputs are wrong you may learn that if you're doing this wrong input you're 0:30:24.284 --> 0:30:30.162 generating something correct, but you're not learning to generate something which is not 0:30:30.162 --> 0:30:30.756 correct. 0:30:31.511 --> 0:30:47.124 So often the case it is that it is more important than your target is correct. 0:30:47.347 --> 0:30:52.182 But you can assume in your application scenario you hope that you may only get correct inputs. 0:30:52.572 --> 0:31:02.535 So that is not harming you, and in machine translation we have one very nice advantage: 0:31:02.762 --> 0:31:04.648 And also the other way around. 0:31:04.648 --> 0:31:10.062 It's a very similar task, so there's a task to translate from German to English, but the 0:31:10.062 --> 0:31:13.894 task to translate from English to German is very similar, and. 0:31:14.094 --> 0:31:19.309 So what we can do is we can just switch it initially and generate the data the other way 0:31:19.309 --> 0:31:19.778 around. 0:31:20.120 --> 0:31:25.959 So what we are doing here is we are starting with an English to German system. 0:31:25.959 --> 0:31:32.906 Then we are translating the English data into German where the German is maybe not very nice. 0:31:33.293 --> 0:31:51.785 And then we are training on our original data and on the back translated data. 0:31:52.632 --> 0:32:02.332 So here we have the advantage that our target side is human quality and only the input. 0:32:03.583 --> 0:32:08.113 Then this helps us to get really good. 0:32:08.113 --> 0:32:15.431 There is one difference if you think about the data resources. 0:32:21.341 --> 0:32:27.336 Too obvious here we need a target site monolingual layer. 0:32:27.336 --> 0:32:31.574 In the first example we had source site. 0:32:31.931 --> 0:32:45.111 So back translation is normally working if you have target size peril later and not search 0:32:45.111 --> 0:32:48.152 side modeling later. 0:32:48.448 --> 0:32:56.125 Might be also, like if you think about it, understand a little better to understand the 0:32:56.125 --> 0:32:56.823 target. 0:32:57.117 --> 0:33:01.469 On the source side you have to understand the content. 0:33:01.469 --> 0:33:08.749 On the target side you have to generate really sentences and somehow it's more difficult to 0:33:08.749 --> 0:33:12.231 generate something than to only understand. 0:33:17.617 --> 0:33:30.734 This works well if you have to select how many back translated data do you use. 0:33:31.051 --> 0:33:32.983 Because only there's like a lot more. 0:33:33.253 --> 0:33:42.136 Question: Should take all of my data there is two problems with it? 0:33:42.136 --> 0:33:51.281 Of course it's expensive because you have to translate all this data. 0:33:51.651 --> 0:34:00.946 So if you don't know the normal good starting point is to take equal amount of data as many 0:34:00.946 --> 0:34:02.663 back translated. 0:34:02.963 --> 0:34:04.673 It depends on the used case. 0:34:04.673 --> 0:34:08.507 If we have very few data here, it makes more sense to have more. 0:34:08.688 --> 0:34:15.224 Depends on how good your quality is here, so the better the more data you might use because 0:34:15.224 --> 0:34:16.574 quality is better. 0:34:16.574 --> 0:34:22.755 So it depends on a lot of things, but your rule of sum is like which general way often 0:34:22.755 --> 0:34:24.815 is to have equal amounts of. 0:34:26.646 --> 0:34:29.854 And you can, of course, do that now. 0:34:29.854 --> 0:34:34.449 I said already that it's better to have the quality. 0:34:34.449 --> 0:34:38.523 At the end, of course, depends on this system. 0:34:38.523 --> 0:34:46.152 Also, because the better this system is, the better your synthetic data is, the better. 0:34:47.207 --> 0:34:50.949 That leads to what is referred to as iterated back translation. 0:34:51.291 --> 0:34:56.917 So you play them on English to German, and you translate the data on. 0:34:56.957 --> 0:35:03.198 Then you train a model on German to English with the additional data. 0:35:03.198 --> 0:35:09.796 Then you translate German data and then you train to gain your first one. 0:35:09.796 --> 0:35:14.343 So in the second iteration this quality is better. 0:35:14.334 --> 0:35:19.900 System is better because it's not only trained on the small data but additionally on back 0:35:19.900 --> 0:35:22.003 translated data with this system. 0:35:22.442 --> 0:35:24.458 And so you can get better. 0:35:24.764 --> 0:35:28.053 However, typically you can stop quite early. 0:35:28.053 --> 0:35:35.068 Maybe one iteration is good, but then you have diminishing gains after two or three iterations. 0:35:35.935 --> 0:35:46.140 There is very slight difference because you need a quite big difference in the quality 0:35:46.140 --> 0:35:46.843 here. 0:35:47.207 --> 0:36:02.262 Language is also good because it means you can already train it with relatively bad profiles. 0:36:03.723 --> 0:36:10.339 It's a design decision would advise so guess because it's easy to get it. 0:36:10.550 --> 0:36:20.802 Replace that because you have a higher quality real data, but then I think normally it's okay 0:36:20.802 --> 0:36:22.438 to replace it. 0:36:22.438 --> 0:36:28.437 I would assume it's not too much of a difference, but. 0:36:34.414 --> 0:36:42.014 That's about like using monolingual data before we go into the pre-train models to have any 0:36:42.014 --> 0:36:43.005 more crash. 0:36:49.029 --> 0:36:55.740 Yes, so the other thing which we can do and which is recently more and more successful 0:36:55.740 --> 0:37:02.451 and even more successful since we have this really large language models where you can 0:37:02.451 --> 0:37:08.545 even do the translation task with this is the way of using pre-trained models. 0:37:08.688 --> 0:37:16.135 So you learn a representation of one task, and then you use this representation from another. 0:37:16.576 --> 0:37:26.862 It was made maybe like one of the first words where it really used largely is doing something 0:37:26.862 --> 0:37:35.945 like a bird which you pre trained on purely text era and you take it in fine tune. 0:37:36.496 --> 0:37:42.953 And one big advantage, of course, is that people can only share data but also pre-trained. 0:37:43.423 --> 0:37:59.743 The recent models and the large language ones which are available. 0:37:59.919 --> 0:38:09.145 Where I think it costs several millions to train them all, just if you would buy the GPUs 0:38:09.145 --> 0:38:15.397 from some cloud company and train that the cost of training. 0:38:15.475 --> 0:38:21.735 And guess as a student project you won't have the budget to like build these models. 0:38:21.801 --> 0:38:24.598 So another idea is what you can do is okay. 0:38:24.598 --> 0:38:27.330 Maybe if these months are once available,. 0:38:27.467 --> 0:38:36.598 Can take them and use them as an also resource similar to pure text, and you can now build 0:38:36.598 --> 0:38:44.524 models which somehow learn not only from from data but also from other models. 0:38:44.844 --> 0:38:49.127 So it's a quite new way of thinking of how to train. 0:38:49.127 --> 0:38:53.894 We are not only learning from examples, but we might also. 0:38:54.534 --> 0:39:05.397 The nice thing is that this type of training where we are not learning directly from data 0:39:05.397 --> 0:39:07.087 but learning. 0:39:07.427 --> 0:39:17.647 So the main idea this go is you have a person initial task. 0:39:17.817 --> 0:39:26.369 And if you're working with anLP, that means you're training pure taxator because that's 0:39:26.369 --> 0:39:30.547 where you have the largest amount of data. 0:39:30.951 --> 0:39:35.857 And then you're defining some type of task in order to do your creek training. 0:39:36.176 --> 0:39:43.092 And: The typical task you can train on on that is like the language waddling task. 0:39:43.092 --> 0:39:50.049 So to predict the next word or we have a related task to predict something in between, we'll 0:39:50.049 --> 0:39:52.667 see depending on the architecture. 0:39:52.932 --> 0:39:58.278 But somehow to predict something which you have not in the input is a task which is easy 0:39:58.278 --> 0:40:00.740 to generate, so you just need your data. 0:40:00.740 --> 0:40:06.086 That's why it's called self supervised, so you're creating your supervised pending data. 0:40:06.366 --> 0:40:07.646 By yourself. 0:40:07.646 --> 0:40:15.133 On the other hand, you need a lot of knowledge and that is the other thing. 0:40:15.735 --> 0:40:24.703 Because there is this idea that the meaning of a word heavily depends on the context that. 0:40:25.145 --> 0:40:36.846 So can give you a sentence with some giverish word and there's some name and although you've 0:40:36.846 --> 0:40:41.627 never heard the name you will assume. 0:40:42.062 --> 0:40:44.149 And exactly the same thing. 0:40:44.149 --> 0:40:49.143 The models can also learn something about the world by just using. 0:40:49.649 --> 0:40:53.651 So that is typically the mule. 0:40:53.651 --> 0:40:59.848 Then we can use this model to train the system. 0:41:00.800 --> 0:41:03.368 Course we might need to adapt the system. 0:41:03.368 --> 0:41:07.648 To do that we have to change the architecture we might use only some. 0:41:07.627 --> 0:41:09.443 Part of the pre-trained model. 0:41:09.443 --> 0:41:14.773 In there we have seen that a bit already in the R&N case you can also see that we have 0:41:14.773 --> 0:41:17.175 also mentioned the pre-training already. 0:41:17.437 --> 0:41:22.783 So you can use the R&N as one of these approaches. 0:41:22.783 --> 0:41:28.712 You train the R&M language more on large pre-train data. 0:41:28.712 --> 0:41:32.309 Then you put it somewhere into your. 0:41:33.653 --> 0:41:37.415 So this gives you the ability to really do these types of tests. 0:41:37.877 --> 0:41:53.924 So you can build a system which is knowledge, which is just trained on large amounts of data. 0:41:56.376 --> 0:42:01.564 So the question is maybe what type of information so what type of models can you? 0:42:01.821 --> 0:42:05.277 And we want today to look at briefly at swings. 0:42:05.725 --> 0:42:08.704 First, that was what was initially done. 0:42:08.704 --> 0:42:15.314 It wasn't as famous as in machine translation as in other things, but it's also used there 0:42:15.314 --> 0:42:21.053 and that is to use static word embedding, so just the first step we know here. 0:42:21.221 --> 0:42:28.981 So we have this mapping from the one hot to a small continuous word representation. 0:42:29.229 --> 0:42:38.276 Using this one in your NG system, so you can, for example, replace the embedding layer by 0:42:38.276 --> 0:42:38.779 the. 0:42:39.139 --> 0:42:41.832 That is helpful to be a really small amount of data. 0:42:42.922 --> 0:42:48.517 And we're always in this pre-training phase and have the thing the advantage is. 0:42:48.468 --> 0:42:52.411 More data than the trade off, so you can get better. 0:42:52.411 --> 0:42:59.107 The disadvantage is, does anybody have an idea of what might be the disadvantage of using 0:42:59.107 --> 0:43:00.074 things like. 0:43:04.624 --> 0:43:12.175 What was one mentioned today giving like big advantage of the system compared to previous. 0:43:20.660 --> 0:43:25.134 Where one advantage was the enter end training, so you have the enter end training so that 0:43:25.134 --> 0:43:27.937 all parameters and all components play optimal together. 0:43:28.208 --> 0:43:33.076 If you know pre-train something on one fast, it may be no longer optimal fitting to everything 0:43:33.076 --> 0:43:33.384 else. 0:43:33.893 --> 0:43:37.862 So what do pretending or not? 0:43:37.862 --> 0:43:48.180 It depends on how important everything is optimal together and how important. 0:43:48.388 --> 0:43:50.454 Of large amount. 0:43:50.454 --> 0:44:00.541 The pre-change one is so much better that it's helpful, and the advantage of that. 0:44:00.600 --> 0:44:11.211 Getting everything optimal together, yes, we would use random instructions for raising. 0:44:11.691 --> 0:44:26.437 The problem is you might be already in some area where it's not easy to get. 0:44:26.766 --> 0:44:35.329 But often in some way right, so often it's not about your really worse pre trained monolepsy. 0:44:35.329 --> 0:44:43.254 If you're going already in some direction, and if this is not really optimal for you,. 0:44:43.603 --> 0:44:52.450 But if you're not really getting better because you have a decent amount of data, it's so different 0:44:52.450 --> 0:44:52.981 that. 0:44:53.153 --> 0:44:59.505 Initially it wasn't a machine translation done so much because there are more data in 0:44:59.505 --> 0:45:06.153 MPs than in other tasks, but now with really large amounts of monolingual data we do some 0:45:06.153 --> 0:45:09.403 type of pretraining in currently all state. 0:45:12.632 --> 0:45:14.302 The other one is okay now. 0:45:14.302 --> 0:45:18.260 It's always like how much of the model do you plea track a bit? 0:45:18.658 --> 0:45:22.386 To the other one you can do contextural word embedded. 0:45:22.386 --> 0:45:28.351 That is something like bird or Roberta where you train already a sequence model and the 0:45:28.351 --> 0:45:34.654 embeddings you're using are no longer specific for word but they are also taking the context 0:45:34.654 --> 0:45:35.603 into account. 0:45:35.875 --> 0:45:50.088 The embedding you're using is no longer depending on the word itself but on the whole sentence, 0:45:50.088 --> 0:45:54.382 so you can use this context. 0:45:55.415 --> 0:46:02.691 You can use similar things also in the decoder just by having layers which don't have access 0:46:02.691 --> 0:46:12.430 to the source, but there it still might have and these are typically models like: And finally 0:46:12.430 --> 0:46:14.634 they will look at the end. 0:46:14.634 --> 0:46:19.040 You can also have models which are already sequenced. 0:46:19.419 --> 0:46:28.561 So you may be training a sequence to sequence models. 0:46:28.561 --> 0:46:35.164 You have to make it a bit challenging. 0:46:36.156 --> 0:46:43.445 But the idea is really you're pre-training your whole model and then you'll find tuning. 0:46:47.227 --> 0:46:59.614 But let's first do a bit of step back and look into what are the different things. 0:46:59.614 --> 0:47:02.151 The first thing. 0:47:02.382 --> 0:47:11.063 The wooden bettings are just this first layer and you can train them with feedback annual 0:47:11.063 --> 0:47:12.028 networks. 0:47:12.212 --> 0:47:22.761 But you can also train them with an N language model, and by now you hopefully have also seen 0:47:22.761 --> 0:47:27.699 that you cannot transform a language model. 0:47:30.130 --> 0:47:37.875 So this is how you can train them and you're training them. 0:47:37.875 --> 0:47:45.234 For example, to speak the next word that is the easiest. 0:47:45.525 --> 0:47:55.234 And that is what is now referred to as South Supervised Learning and, for example, all the 0:47:55.234 --> 0:48:00.675 big large language models like Chad GPT and so on. 0:48:00.675 --> 0:48:03.129 They are trained with. 0:48:03.823 --> 0:48:15.812 So that is where you can hopefully learn how a word is used because you always try to previct 0:48:15.812 --> 0:48:17.725 the next word. 0:48:19.619 --> 0:48:27.281 Word embedding: Why do you keep the first look at the word embeddings and the use of 0:48:27.281 --> 0:48:29.985 word embeddings for our task? 0:48:29.985 --> 0:48:38.007 The main advantage was it might be only the first layer where you typically have most of 0:48:38.007 --> 0:48:39.449 the parameters. 0:48:39.879 --> 0:48:57.017 Most of your parameters already on the large data, then on your target data you have to 0:48:57.017 --> 0:48:59.353 train less. 0:48:59.259 --> 0:49:06.527 Big difference that your input size is so much bigger than the size of the novel in size. 0:49:06.626 --> 0:49:17.709 So it's a normally sign, maybe like, but your input and banning size is something like. 0:49:17.709 --> 0:49:20.606 Then here you have to. 0:49:23.123 --> 0:49:30.160 While here you see it's only like zero point five times as much in the layer. 0:49:30.750 --> 0:49:36.534 So here is where most of your parameters are, which means if you already replace the word 0:49:36.534 --> 0:49:41.739 embeddings, they might look a bit small in your overall and in key architecture. 0:49:41.739 --> 0:49:47.395 It's where most of the things are, and if you're doing that you already have really big 0:49:47.395 --> 0:49:48.873 games and can do that. 0:49:57.637 --> 0:50:01.249 The thing is we have seen these were the bettings. 0:50:01.249 --> 0:50:04.295 They can be very good use for other types. 0:50:04.784 --> 0:50:08.994 You learn some general relations between words. 0:50:08.994 --> 0:50:17.454 If you're doing this type of language modeling cast, you predict: The one thing is you have 0:50:17.454 --> 0:50:24.084 a lot of data, so the one question is we want to have data to trade a model. 0:50:24.084 --> 0:50:28.734 The other thing, the tasks need to be somehow useful. 0:50:29.169 --> 0:50:43.547 If you would predict the first letter of the word, then you wouldn't learn anything about 0:50:43.547 --> 0:50:45.144 the word. 0:50:45.545 --> 0:50:53.683 And the interesting thing is people have looked at these wood embeddings. 0:50:53.954 --> 0:50:58.550 And looking at the word embeddings. 0:50:58.550 --> 0:51:09.276 You can ask yourself how they look and visualize them by doing dimension reduction. 0:51:09.489 --> 0:51:13.236 Don't know if you and you are listening to artificial intelligence. 0:51:13.236 --> 0:51:15.110 Advanced artificial intelligence. 0:51:15.515 --> 0:51:23.217 We had on yesterday there how to do this type of representation, but you can do this time 0:51:23.217 --> 0:51:29.635 of representation, and now you're seeing interesting things that normally. 0:51:30.810 --> 0:51:41.027 Now you can represent a here in a three dimensional space with some dimension reduction. 0:51:41.027 --> 0:51:46.881 For example, the relation between male and female. 0:51:47.447 --> 0:51:56.625 So this vector between the male and female version of something is always not the same, 0:51:56.625 --> 0:51:58.502 but it's related. 0:51:58.718 --> 0:52:14.522 So you can do a bit of maths, so you do take king, you subtract this vector, add this vector. 0:52:14.894 --> 0:52:17.591 So that means okay, there is really something stored. 0:52:17.591 --> 0:52:19.689 Some information are stored in that book. 0:52:20.040 --> 0:52:22.621 Similar, you can do it with Bob Hansen. 0:52:22.621 --> 0:52:25.009 See here swimming slam walking walk. 0:52:25.265 --> 0:52:34.620 So again these vectors are not the same, but they are related. 0:52:34.620 --> 0:52:42.490 So you learn something from going from here to here. 0:52:43.623 --> 0:52:49.761 Or semantically, the relations between city and capital have exactly the same sense. 0:52:51.191 --> 0:52:56.854 And people had even done that question answering about that if they showed the diembeddings 0:52:56.854 --> 0:52:57.839 and the end of. 0:52:58.218 --> 0:53:06.711 All you can also do is don't trust the dimensions of the reaction because maybe there is something. 0:53:06.967 --> 0:53:16.863 You can also look into what happens really in the individual space. 0:53:16.863 --> 0:53:22.247 What is the nearest neighbor of the. 0:53:22.482 --> 0:53:29.608 So you can take the relationship between France and Paris and add it to Italy and you'll. 0:53:30.010 --> 0:53:33.078 You can do big and bigger and you have small and smaller and stuff. 0:53:33.593 --> 0:53:49.417 Because it doesn't work everywhere, there is also some typical dish here in German. 0:53:51.491 --> 0:54:01.677 You can do what the person is doing for famous ones, of course only like Einstein scientists 0:54:01.677 --> 0:54:06.716 that find midfielders not completely correct. 0:54:06.846 --> 0:54:10.134 You see the examples are a bit old. 0:54:10.134 --> 0:54:15.066 The politicians are no longer they am, but of course. 0:54:16.957 --> 0:54:26.759 What people have done there, especially at the beginning training our end language model, 0:54:26.759 --> 0:54:28.937 was very expensive. 0:54:29.309 --> 0:54:38.031 So one famous model was, but we are not really interested in the language model performance. 0:54:38.338 --> 0:54:40.581 Think something good to keep in mind. 0:54:40.581 --> 0:54:42.587 What are we really interested in? 0:54:42.587 --> 0:54:45.007 Do we really want to have an R&N no? 0:54:45.007 --> 0:54:48.607 In this case we are only interested in this type of mapping. 0:54:49.169 --> 0:54:55.500 And so successful and very successful was this word to vet. 0:54:55.535 --> 0:54:56.865 The idea is okay. 0:54:56.865 --> 0:55:03.592 We are not training real language one, making it even simpler and doing this, for example, 0:55:03.592 --> 0:55:05.513 continuous peck of words. 0:55:05.513 --> 0:55:12.313 We're just having four input tokens and we're predicting what is the word in the middle and 0:55:12.313 --> 0:55:15.048 this is just like two linear layers. 0:55:15.615 --> 0:55:21.627 So it's even simplifying things and making the calculation faster because that is what 0:55:21.627 --> 0:55:22.871 we're interested. 0:55:23.263 --> 0:55:32.897 All this continuous skip ground models with these other models which refer to as where 0:55:32.897 --> 0:55:34.004 to where. 0:55:34.234 --> 0:55:42.394 Where you have one equal word and the other way around, you're predicting the four words 0:55:42.394 --> 0:55:43.585 around them. 0:55:43.585 --> 0:55:45.327 It's very similar. 0:55:45.327 --> 0:55:48.720 The task is in the end very similar. 0:55:51.131 --> 0:56:01.407 Before we are going to the next point, anything about normal weight vectors or weight embedding. 0:56:04.564 --> 0:56:07.794 The next thing is contexture. 0:56:07.794 --> 0:56:12.208 Word embeddings and the idea is helpful. 0:56:12.208 --> 0:56:19.206 However, we might even be able to get more from one lingo layer. 0:56:19.419 --> 0:56:31.732 And now in the word that is overlap of these two meanings, so it represents both the meaning 0:56:31.732 --> 0:56:33.585 of can do it. 0:56:34.834 --> 0:56:40.410 But we might be able to in the pre-trained model already disambiguate this because they 0:56:40.410 --> 0:56:41.044 are used. 0:56:41.701 --> 0:56:53.331 So if we can have a model which can not only represent a word but can also represent the 0:56:53.331 --> 0:56:58.689 meaning of the word within the context,. 0:56:59.139 --> 0:57:03.769 So then we are going to context your word embeddings. 0:57:03.769 --> 0:57:07.713 We are really having a representation in the. 0:57:07.787 --> 0:57:11.519 And we have a very good architecture for that already. 0:57:11.691 --> 0:57:23.791 The hidden state represents what is currently said, but it's focusing on what is the last 0:57:23.791 --> 0:57:29.303 one, so it's some of the representation. 0:57:29.509 --> 0:57:43.758 The first one doing that is something like the Elmo paper where they instead of this is 0:57:43.758 --> 0:57:48.129 the normal language model. 0:57:48.008 --> 0:57:50.714 Within the third, predicting the fourth, and so on. 0:57:50.714 --> 0:57:53.004 So you are always predicting the next work. 0:57:53.193 --> 0:57:57.335 The architecture is the heaven words embedding layer and then layers. 0:57:57.335 --> 0:58:03.901 See you, for example: And now instead of using this one in the end, you're using here this 0:58:03.901 --> 0:58:04.254 one. 0:58:04.364 --> 0:58:11.245 This represents the meaning of this word mainly in the context of what we have seen before. 0:58:11.871 --> 0:58:18.610 We can train it in a language model style always predicting the next word, but we have 0:58:18.610 --> 0:58:21.088 more information trained there. 0:58:21.088 --> 0:58:26.123 Therefore, in the system it has to learn less additional things. 0:58:27.167 --> 0:58:31.261 And there is one Edendang which is done currently in GPS. 0:58:31.261 --> 0:58:38.319 The only difference is that we have more layers, bigger size, and we're using transformer neurocell 0:58:38.319 --> 0:58:40.437 potential instead of the RNA. 0:58:40.437 --> 0:58:45.095 But that is how you train like some large language models at the. 0:58:46.746 --> 0:58:55.044 However, if you look at this contextual representation, they might not be perfect. 0:58:55.044 --> 0:59:02.942 So if you think of this one as a contextual representation of the third word,. 0:59:07.587 --> 0:59:16.686 Is representing a three in the context of a sentence, however only in the context of 0:59:16.686 --> 0:59:18.185 the previous. 0:59:18.558 --> 0:59:27.413 However, we have an architecture which can also take both sides and we have used that 0:59:27.413 --> 0:59:30.193 already in the ink holder. 0:59:30.630 --> 0:59:34.264 So we could do the iron easily on your, also in the backward direction. 0:59:34.874 --> 0:59:42.826 By just having the states the other way around and then we couldn't combine the forward and 0:59:42.826 --> 0:59:49.135 the forward into a joint one where we are doing this type of prediction. 0:59:49.329 --> 0:59:50.858 So you have the word embedding. 0:59:51.011 --> 1:00:02.095 Then you have two in the states, one on the forward arm and one on the backward arm, and 1:00:02.095 --> 1:00:10.314 then you can, for example, take the cocagenation of both of them. 1:00:10.490 --> 1:00:23.257 Now this same here represents mainly this word because this is what both puts in it last 1:00:23.257 --> 1:00:30.573 and we know is focusing on what is happening last. 1:00:31.731 --> 1:00:40.469 However, there is a bit of difference when training that as a language model you already 1:00:40.469 --> 1:00:41.059 have. 1:00:43.203 --> 1:00:44.956 Maybe There's Again This Masking. 1:00:46.546 --> 1:00:47.748 That is one solution. 1:00:47.748 --> 1:00:52.995 First of all, why we can't do it is the information you leak it, so you cannot just predict the 1:00:52.995 --> 1:00:53.596 next word. 1:00:53.596 --> 1:00:58.132 If we just predict the next word in this type of model, that's a very simple task. 1:00:58.738 --> 1:01:09.581 You know the next word because it's influencing this hidden state predicting something is not 1:01:09.581 --> 1:01:11.081 a good task. 1:01:11.081 --> 1:01:18.455 You have to define: Because in this case what will end with the system will just ignore these 1:01:18.455 --> 1:01:22.966 estates and what will learn is copy this information directly in here. 1:01:23.343 --> 1:01:31.218 So it would be representing this word and you would have nearly a perfect model because 1:01:31.218 --> 1:01:38.287 you only need to find encoding where you can encode all words somehow in this. 1:01:38.458 --> 1:01:44.050 The only thing can learn is that turn and encode all my words in this upper hidden. 1:01:44.985 --> 1:01:53.779 Therefore, it's not really useful, so we need to find a bit of different ways out. 1:01:55.295 --> 1:01:57.090 There is a masking one. 1:01:57.090 --> 1:02:03.747 I'll come to that shortly just a bit that other things also have been done, so the other 1:02:03.747 --> 1:02:06.664 thing is not to directly combine them. 1:02:06.664 --> 1:02:13.546 That was in the animal paper, so you have them forward R&M and you keep them completely 1:02:13.546 --> 1:02:14.369 separated. 1:02:14.594 --> 1:02:20.458 So you never merged to state. 1:02:20.458 --> 1:02:33.749 At the end, the representation of the word is now from the forward. 1:02:33.873 --> 1:02:35.953 So it's always the hidden state before the good thing. 1:02:36.696 --> 1:02:41.286 These two you join now to your to the representation. 1:02:42.022 --> 1:02:48.685 And then you have now a representation also about like the whole sentence for the word, 1:02:48.685 --> 1:02:51.486 but there is no information leakage. 1:02:51.486 --> 1:02:58.149 One way of doing this is instead of doing a bidirection along you do a forward pass and 1:02:58.149 --> 1:02:59.815 then join the hidden. 1:03:00.380 --> 1:03:05.960 So you can do that in all layers. 1:03:05.960 --> 1:03:16.300 In the end you do the forwarded layers and you get the hidden. 1:03:16.596 --> 1:03:19.845 However, it's a bit of a complicated. 1:03:19.845 --> 1:03:25.230 You have to keep both separate and merge things so can you do. 1:03:27.968 --> 1:03:33.030 And that is the moment where like the big. 1:03:34.894 --> 1:03:39.970 The big success of the burnt model was used where it okay. 1:03:39.970 --> 1:03:47.281 Maybe in bite and rich case it's not good to do the next word prediction, but we can 1:03:47.281 --> 1:03:48.314 do masking. 1:03:48.308 --> 1:03:56.019 Masking mainly means we do a prediction of something in the middle or some words. 1:03:56.019 --> 1:04:04.388 So the idea is if we have the input, we are putting noise into the input, removing them, 1:04:04.388 --> 1:04:07.961 and then the model we are interested. 1:04:08.048 --> 1:04:15.327 Now there can be no information leakage because this wasn't predicting that one is a big challenge. 1:04:16.776 --> 1:04:19.957 Do any assumption about our model? 1:04:19.957 --> 1:04:26.410 It doesn't need to be a forward model or a backward model or anything. 1:04:26.410 --> 1:04:29.500 You can always predict the three. 1:04:30.530 --> 1:04:34.844 There's maybe one bit of a disadvantage. 1:04:34.844 --> 1:04:40.105 Do you see what could be a bit of a problem this? 1:05:00.000 --> 1:05:06.429 Yes, so yeah, you can of course mask more, but to see it more globally, just first assume 1:05:06.429 --> 1:05:08.143 you're only masked one. 1:05:08.143 --> 1:05:13.930 For the whole sentence, we get one feedback signal, like what is the word three. 1:05:13.930 --> 1:05:22.882 So we have one training example: If you do the language modeling taste, we predicted here, 1:05:22.882 --> 1:05:24.679 we predicted here. 1:05:25.005 --> 1:05:26.735 So we have number of tokens. 1:05:26.735 --> 1:05:30.970 For each token we have a feet pad and say what is the best correction. 1:05:31.211 --> 1:05:43.300 So in this case this is less efficient because we are getting less feedback signals on what 1:05:43.300 --> 1:05:45.797 we should predict. 1:05:48.348 --> 1:05:56.373 So and bird, the main ideas are that you're doing this bidirectional model with masking. 1:05:56.373 --> 1:05:59.709 It's using transformer architecture. 1:06:00.320 --> 1:06:06.326 There are two more minor changes. 1:06:06.326 --> 1:06:16.573 We'll see that this next word prediction is another task. 1:06:16.957 --> 1:06:30.394 You want to learn more about what language is to really understand following a story or 1:06:30.394 --> 1:06:35.127 their independent tokens into. 1:06:38.158 --> 1:06:42.723 The input is using word units as we use it. 1:06:42.723 --> 1:06:50.193 It has some special token that is framing for the next word prediction. 1:06:50.470 --> 1:07:04.075 It's more for classification task because you may be learning a general representation 1:07:04.075 --> 1:07:07.203 as a full sentence. 1:07:07.607 --> 1:07:19.290 You're doing segment embedding, so you have an embedding for it. 1:07:19.290 --> 1:07:24.323 This is the first sentence. 1:07:24.684 --> 1:07:29.099 Now what is more challenging is this masking. 1:07:29.099 --> 1:07:30.827 What do you mask? 1:07:30.827 --> 1:07:35.050 We already have the crush enough or should. 1:07:35.275 --> 1:07:42.836 So there has been afterwards eating some work like, for example, a bearer. 1:07:42.836 --> 1:07:52.313 It's not super sensitive, but if you do it completely wrong then you're not letting anything. 1:07:52.572 --> 1:07:54.590 That's Then Another Question There. 1:07:56.756 --> 1:08:04.594 Should I mask all types of should I always mask the footwork or if I have a subword to 1:08:04.594 --> 1:08:10.630 mask only like a subword and predict them based on the other ones? 1:08:10.630 --> 1:08:14.504 Of course, it's a bit of a different task. 1:08:14.894 --> 1:08:21.210 If you know three parts of the words, it might be easier to guess the last because they here 1:08:21.210 --> 1:08:27.594 took the easiest selection, so not considering words anymore at all because you're doing that 1:08:27.594 --> 1:08:32.280 in the preprocessing and just taking always words and like subwords. 1:08:32.672 --> 1:08:36.089 Think in group there is done differently. 1:08:36.089 --> 1:08:40.401 They mark always the full words, but guess it's not. 1:08:41.001 --> 1:08:46.044 And then what to do with the mask word in eighty percent of the cases. 1:08:46.044 --> 1:08:50.803 If the word is masked, they replace it with a special token thing. 1:08:50.803 --> 1:08:57.197 This is a mask token in ten percent they put in some random other token in there, and ten 1:08:57.197 --> 1:08:59.470 percent they keep it on change. 1:09:02.202 --> 1:09:10.846 And then what you can do is also this next word prediction. 1:09:10.846 --> 1:09:14.880 The man went to Mass Store. 1:09:14.880 --> 1:09:17.761 He bought a gallon. 1:09:18.418 --> 1:09:24.088 So may you see you're joining them, you're doing both masks and prediction that you're. 1:09:24.564 --> 1:09:29.449 Is a penguin mask or flyless birds. 1:09:29.449 --> 1:09:41.390 These two sentences have nothing to do with each other, so you can do also this type of 1:09:41.390 --> 1:09:43.018 prediction. 1:09:47.127 --> 1:09:57.043 And then the whole bird model, so here you have the input here to transform the layers, 1:09:57.043 --> 1:09:58.170 and then. 1:09:58.598 --> 1:10:17.731 And this model was quite successful in general applications. 1:10:17.937 --> 1:10:27.644 However, there is like a huge thing of different types of models coming from them. 1:10:27.827 --> 1:10:38.709 So based on others these supervised molds like a whole setup came out of there and now 1:10:38.709 --> 1:10:42.086 this is getting even more. 1:10:42.082 --> 1:10:46.640 With availability of a large language model than the success. 1:10:47.007 --> 1:10:48.436 We have now even larger ones. 1:10:48.828 --> 1:10:50.961 Interestingly, it goes a bit. 1:10:50.910 --> 1:10:57.847 Change the bit again from like more the spider action model to uni directional models. 1:10:57.847 --> 1:11:02.710 Are at the moment maybe a bit more we're coming to them now? 1:11:02.710 --> 1:11:09.168 Do you see one advantage while what is another event and we have the efficiency? 1:11:09.509 --> 1:11:15.901 Is one other reason why you are sometimes more interested in uni-direction models than 1:11:15.901 --> 1:11:17.150 in bi-direction. 1:11:22.882 --> 1:11:30.220 It depends on the pass, but for example for a language generation pass, the eccard is not 1:11:30.220 --> 1:11:30.872 really. 1:11:32.192 --> 1:11:40.924 It doesn't work so if you want to do a generation like the decoder you don't know the future 1:11:40.924 --> 1:11:42.896 so you cannot apply. 1:11:43.223 --> 1:11:53.870 So this time of model can be used for the encoder in an encoder model, but it cannot 1:11:53.870 --> 1:11:57.002 be used for the decoder. 1:12:00.000 --> 1:12:05.012 That's a good view to the next overall cast of models. 1:12:05.012 --> 1:12:08.839 Perhaps if you view it from the sequence. 1:12:09.009 --> 1:12:12.761 We have the encoder base model. 1:12:12.761 --> 1:12:16.161 That's what we just look at. 1:12:16.161 --> 1:12:20.617 They are bidirectional and typically. 1:12:20.981 --> 1:12:22.347 That Is the One We Looked At. 1:12:22.742 --> 1:12:34.634 At the beginning is the decoder based model, so see out in regressive models which are unidirective 1:12:34.634 --> 1:12:42.601 like an based model, and there we can do the next word prediction. 1:12:43.403 --> 1:12:52.439 And what you can also do first, and there you can also have a special things called prefix 1:12:52.439 --> 1:12:53.432 language. 1:12:54.354 --> 1:13:05.039 Because we are saying it might be helpful that some of your input can also use bi-direction. 1:13:05.285 --> 1:13:12.240 And that is somehow doing what it is called prefix length. 1:13:12.240 --> 1:13:19.076 On the first tokens you directly give your bidirectional. 1:13:19.219 --> 1:13:28.774 So you somehow merge that and that mainly works only in transformer based models because. 1:13:29.629 --> 1:13:33.039 There is no different number of parameters in our end. 1:13:33.039 --> 1:13:34.836 We need a back foot our end. 1:13:34.975 --> 1:13:38.533 Transformer: The only difference is how you mask your attention. 1:13:38.878 --> 1:13:44.918 We have seen that in the anchoder and decoder the number of parameters is different because 1:13:44.918 --> 1:13:50.235 you do cross attention, but if you do forward and backward or union directions,. 1:13:50.650 --> 1:13:58.736 It's only like you mask your attention to only look at the bad past or to look into the 1:13:58.736 --> 1:13:59.471 future. 1:14:00.680 --> 1:14:03.326 And now you can of course also do mixing. 1:14:03.563 --> 1:14:08.306 So this is a bi-directional attention matrix where you can attend to everything. 1:14:08.588 --> 1:14:23.516 There is a uni-direction or causal where you can look at the past and you can do the first 1:14:23.516 --> 1:14:25.649 three words. 1:14:29.149 --> 1:14:42.831 That somehow clear based on that, then of course you cannot do the other things. 1:14:43.163 --> 1:14:50.623 So the idea is we have our anchor to decoder architecture. 1:14:50.623 --> 1:14:57.704 Can we also train them completely in a side supervisor? 1:14:58.238 --> 1:15:09.980 And in this case we have the same input to both, so in this case we need to do some type 1:15:09.980 --> 1:15:12.224 of masking here. 1:15:12.912 --> 1:15:17.696 Here we don't need to do the masking, but here we need to masking that doesn't know ever 1:15:17.696 --> 1:15:17.911 so. 1:15:20.440 --> 1:15:30.269 And this type of model got quite successful also, especially for pre-training machine translation. 1:15:30.330 --> 1:15:39.059 The first model doing that is a Bart model, which exactly does that, and yes, it's one 1:15:39.059 --> 1:15:42.872 successful way to pre train your one. 1:15:42.872 --> 1:15:47.087 It's pretraining your full encoder model. 1:15:47.427 --> 1:15:54.365 Where you put in contrast to machine translation, where you put in source sentence, we can't 1:15:54.365 --> 1:15:55.409 do that here. 1:15:55.715 --> 1:16:01.382 But we can just put the second twice in there, and then it's not a trivial task. 1:16:01.382 --> 1:16:02.432 We can change. 1:16:03.003 --> 1:16:12.777 And there is like they do different corruption techniques so you can also do. 1:16:13.233 --> 1:16:19.692 That you couldn't do in an agricultural system because then it wouldn't be there and you cannot 1:16:19.692 --> 1:16:20.970 predict somewhere. 1:16:20.970 --> 1:16:26.353 So the anchor, the number of input and output tokens always has to be the same. 1:16:26.906 --> 1:16:29.818 You cannot do a prediction for something which isn't in it. 1:16:30.110 --> 1:16:38.268 Here in the decoder side it's unidirection so we can also delete the top and then try 1:16:38.268 --> 1:16:40.355 to generate the full. 1:16:41.061 --> 1:16:45.250 We can do sentence permutation. 1:16:45.250 --> 1:16:54.285 We can document rotation and text infilling so there is quite a bit. 1:16:55.615 --> 1:17:06.568 So you see there's quite a lot of types of models that you can use in order to pre-train. 1:17:07.507 --> 1:17:14.985 Then, of course, there is again for the language one. 1:17:14.985 --> 1:17:21.079 The other question is how do you integrate? 1:17:21.761 --> 1:17:26.636 And there's also, like yeah, quite some different ways of techniques. 1:17:27.007 --> 1:17:28.684 It's a Bit Similar to Before. 1:17:28.928 --> 1:17:39.068 So the easiest thing is you take your word embeddings or your free trained model. 1:17:39.068 --> 1:17:47.971 You freeze them and stack your decoder layers and keep these ones free. 1:17:48.748 --> 1:17:54.495 Can also be done if you have this type of bark model. 1:17:54.495 --> 1:18:03.329 What you can do is you freeze your word embeddings, for example some products and. 1:18:05.865 --> 1:18:17.296 The other thing is you initialize them so you initialize your models but you train everything 1:18:17.296 --> 1:18:19.120 so you're not. 1:18:22.562 --> 1:18:29.986 Then one thing, if you think about Bart, you want to have the Chinese language, the Italian 1:18:29.986 --> 1:18:32.165 language, and the deconer. 1:18:32.165 --> 1:18:35.716 However, in Bart we have the same language. 1:18:36.516 --> 1:18:46.010 The one you get is from English, so what you can do there is so you cannot try to do some. 1:18:46.366 --> 1:18:52.562 Below the barge, in order to learn some language specific stuff, or there's a masculine barge, 1:18:52.562 --> 1:18:58.823 which is trained on many languages, but it's trained only on like the Old Coast Modern Language 1:18:58.823 --> 1:19:03.388 House, which may be trained in German and English, but not on German. 1:19:03.923 --> 1:19:08.779 So then you would still need to find June and the model needs to learn how to better 1:19:08.779 --> 1:19:10.721 do the attention cross lingually. 1:19:10.721 --> 1:19:15.748 It's only on the same language but it mainly only has to learn this mapping and not all 1:19:15.748 --> 1:19:18.775 the rest and that's why it's still quite successful. 1:19:21.982 --> 1:19:27.492 Now certain thing which is very commonly used is what is required to it as adapters. 1:19:27.607 --> 1:19:29.754 So for example you take and buy. 1:19:29.709 --> 1:19:35.218 And you put some adapters on the inside of the networks so that it's small new layers 1:19:35.218 --> 1:19:40.790 which are in between put in there and then you only train these adapters or also train 1:19:40.790 --> 1:19:41.815 these adapters. 1:19:41.815 --> 1:19:47.900 For example, an embryo you could see that this learns to map the Sears language representation 1:19:47.900 --> 1:19:50.334 to the Tiger language representation. 1:19:50.470 --> 1:19:52.395 And then you don't have to change that luck. 1:19:52.792 --> 1:19:59.793 You give it extra ability to really perform well on that. 1:19:59.793 --> 1:20:05.225 These are quite small and so very efficient. 1:20:05.905 --> 1:20:12.632 That is also very commonly used, for example in modular systems where you have some adaptors 1:20:12.632 --> 1:20:16.248 in between here which might be language specific. 1:20:16.916 --> 1:20:22.247 So they are trained only for one language. 1:20:22.247 --> 1:20:33.777 The model has some or both and once has the ability to do multilingually to share knowledge. 1:20:34.914 --> 1:20:39.058 But there's one chance in general in the multilingual systems. 1:20:39.058 --> 1:20:40.439 It works quite well. 1:20:40.439 --> 1:20:46.161 There's one case or one specific use case for multilingual where this normally doesn't 1:20:46.161 --> 1:20:47.344 really work well. 1:20:47.344 --> 1:20:49.975 Do you have an idea what that could be? 1:20:55.996 --> 1:20:57.536 It's for Zero Shot Cases. 1:20:57.998 --> 1:21:03.660 Because having here some situation with this might be very language specific and zero shot, 1:21:03.660 --> 1:21:09.015 the idea is always to learn representations view which are more language dependent and 1:21:09.015 --> 1:21:10.184 with the adaptors. 1:21:10.184 --> 1:21:15.601 Of course you get in representations again which are more language specific and then it 1:21:15.601 --> 1:21:17.078 doesn't work that well. 1:21:20.260 --> 1:21:37.730 And there is also the idea of doing more knowledge pistolation. 1:21:39.179 --> 1:21:42.923 And now the idea is okay. 1:21:42.923 --> 1:21:54.157 We are training it the same, but what we want to achieve is that the encoder. 1:21:54.414 --> 1:22:03.095 So you should learn faster by trying to make these states as similar as possible. 1:22:03.095 --> 1:22:11.777 So you compare the first-hit state of the pre-trained model and try to make them. 1:22:12.192 --> 1:22:18.144 For example, by using the out two norms, so by just making these two representations the 1:22:18.144 --> 1:22:26.373 same: The same vocabulary: Why does it need the same vocabulary with any idea? 1:22:34.754 --> 1:22:46.137 If you have different vocabulary, it's typical you also have different sequenced lengths here. 1:22:46.137 --> 1:22:50.690 The number of sequences is different. 1:22:51.231 --> 1:22:58.888 If you now have pipe stains and four states here, it's no longer straightforward which 1:22:58.888 --> 1:23:01.089 states compare to which. 1:23:02.322 --> 1:23:05.246 And that's just easier if you have like the same number. 1:23:05.246 --> 1:23:08.940 You can always compare the first to the first and second to the second. 1:23:09.709 --> 1:23:16.836 So therefore at least the very easy way of knowledge destination only works if you have. 1:23:17.177 --> 1:23:30.030 Course: You could do things like yeah, the average should be the same, but of course there's 1:23:30.030 --> 1:23:33.071 a less strong signal. 1:23:34.314 --> 1:23:42.979 But the advantage here is that you have a diameter training signal here on the handquarter 1:23:42.979 --> 1:23:51.455 so you can directly make some of the encoder already giving a good signal while normally 1:23:51.455 --> 1:23:52.407 an empty. 1:23:56.936 --> 1:24:13.197 Yes, think this is most things for today, so what you should keep in mind is remind me. 1:24:13.393 --> 1:24:18.400 The one is a back translation idea. 1:24:18.400 --> 1:24:29.561 If you have monolingual and use that, the other one is to: And mentally it is often helpful 1:24:29.561 --> 1:24:33.614 to combine them so you can even use both of that. 1:24:33.853 --> 1:24:38.908 So you can use pre-trained walls, but then you can even still do back translation where 1:24:38.908 --> 1:24:40.057 it's still helpful. 1:24:40.160 --> 1:24:45.502 We have the advantage we are training like everything working together on the task so 1:24:45.502 --> 1:24:51.093 it might be helpful even to backtranslate some data and then use it in a real translation 1:24:51.093 --> 1:24:56.683 setup because in pretraining of course the beach challenge is always that you're training 1:24:56.683 --> 1:24:57.739 it on different. 1:24:58.058 --> 1:25:03.327 Different ways of how you integrate this knowledge. 1:25:03.327 --> 1:25:08.089 Even if you just use a full model, so in this. 1:25:08.748 --> 1:25:11.128 This is the most similar you can get. 1:25:11.128 --> 1:25:13.945 You're doing no changes to the architecture. 1:25:13.945 --> 1:25:19.643 You're really taking the model and just fine tuning them on the new task, but it still has 1:25:19.643 --> 1:25:24.026 to completely newly learn how to do the attention and how to do that. 1:25:24.464 --> 1:25:29.971 And that might be, for example, helpful to have more back-translated data to learn them. 1:25:32.192 --> 1:25:34.251 That's for today. 1:25:34.251 --> 1:25:44.661 There's one important thing that next Tuesday there is a conference or a workshop or so in 1:25:44.661 --> 1:25:45.920 this room. 1:25:47.127 --> 1:25:56.769 You should get an e-mail if you're in Elias that there's a room change for Tuesdays and 1:25:56.769 --> 1:25:57.426 it's. 1:25:57.637 --> 1:26:03.890 There are more questions, yeah, have a more general position, especially: In computer vision 1:26:03.890 --> 1:26:07.347 you can enlarge your data center data orientation. 1:26:07.347 --> 1:26:08.295 Is there any? 1:26:08.388 --> 1:26:15.301 It's similar to a large speech for text for the data of an edge. 1:26:15.755 --> 1:26:29.176 And you can use this back translation and also masking, but back translation is some 1:26:29.176 --> 1:26:31.228 way of data. 1:26:31.371 --> 1:26:35.629 So it has also been, for example, even its used not only for monolingual data. 1:26:36.216 --> 1:26:54.060 If you have good MP system, it can also be used for parallel data. 1:26:54.834 --> 1:26:59.139 So would say this is the most similar one. 1:26:59.139 --> 1:27:03.143 There's ways you can do power phrasing. 1:27:05.025 --> 1:27:12.057 But for example there is very hard to do this by rules like which words to replace because 1:27:12.057 --> 1:27:18.936 there is not a coup like you cannot always say this word can always be replaced by that. 1:27:19.139 --> 1:27:27.225 Mean, although they are many perfect synonyms, normally they are good in some cases, but not 1:27:27.225 --> 1:27:29.399 in all cases, and so on. 1:27:29.399 --> 1:27:36.963 And if you don't do a rule based, you have to train your model and then the freshness. 1:27:38.058 --> 1:27:57.236 The same architecture as the pre-trained mount. 1:27:57.457 --> 1:27:59.810 Should be of the same dimension, so it's easiest to have the same dimension. 1:28:00.000 --> 1:28:01.590 Architecture. 1:28:01.590 --> 1:28:05.452 We later will learn inefficiency. 1:28:05.452 --> 1:28:12.948 You can also do knowledge cessulation with, for example, smaller. 1:28:12.948 --> 1:28:16.469 You can learn the same within. 1:28:17.477 --> 1:28:22.949 Eight layers for it so that is possible, but yeah agree it should be of the same. 1:28:23.623 --> 1:28:32.486 Yeah yeah you need the question then of course you can do it like it's an initialization or 1:28:32.486 --> 1:28:41.157 you can do it doing training but normally it most makes sense during the normal training. 1:28:45.865 --> 1:28:53.963 Do it, then thanks a lot, and then we'll see each other again on Tuesday. 0:00:00.981 --> 0:00:17.559 What we want today about is how to use some type of additional resources to improve the 0:00:17.559 --> 0:00:20.008 translation. 0:00:20.300 --> 0:00:31.387 We have in the first part of the semester how to build some of your basic machine translation. 0:00:31.571 --> 0:00:40.743 You know now the basic components both for statistical and for neural, with the encoder 0:00:40.743 --> 0:00:42.306 decoder model. 0:00:43.123 --> 0:00:45.950 Now, of course, that's not where it stops. 0:00:45.950 --> 0:00:51.340 It's still what in nearly every machine translation system is currently in India. 0:00:51.340 --> 0:00:57.323 However, there is a lot of challenges which you need to address in addition and which need 0:00:57.323 --> 0:00:58.243 to be solved. 0:00:58.918 --> 0:01:03.031 We want to start with these parts. 0:01:03.031 --> 0:01:07.614 What else can you do around this part? 0:01:07.614 --> 0:01:09.847 You can be honest. 0:01:10.030 --> 0:01:14.396 And one important question there is on what do you train your models? 0:01:14.394 --> 0:01:27.237 Because this type of parallel data is easier in machine translation than many other tasks 0:01:27.237 --> 0:01:33.516 where you have a decent amount of training. 0:01:33.853 --> 0:01:40.789 And therefore an important question is: Can we also learn from other sources and improve 0:01:40.789 --> 0:01:41.178 our. 0:01:41.701 --> 0:01:47.840 Because if you remember from quite the beginning of the lecture,. 0:01:51.171 --> 0:01:53.801 This is how we train all our. 0:01:54.194 --> 0:02:01.318 Machine learning models, all the corpus bases from statistical to neural. 0:02:01.318 --> 0:02:09.694 This doesn't have change, so we need this type of parallel data where we have a source 0:02:09.694 --> 0:02:13.449 sentence aligned with the target data. 0:02:13.493 --> 0:02:19.654 We have now a strong model here, a very good model to do that. 0:02:19.654 --> 0:02:22.099 However, we always rely. 0:02:22.522 --> 0:02:27.376 More languages, higher resource languages, prayers that say from German to English or 0:02:27.376 --> 0:02:31.327 other European languages, there is a decent amount at least for some. 0:02:31.471 --> 0:02:46.131 But even there, if we're going to very specific domains, it might get difficult and then your 0:02:46.131 --> 0:02:50.966 system performance might drop. 0:02:51.231 --> 0:02:55.261 Phrases how to use the vocabulary, and so on, and the style. 0:02:55.915 --> 0:03:04.104 And if you're going to other languages, there is of course a lot bigger challenge. 0:03:04.104 --> 0:03:05.584 Why can't you? 0:03:05.825 --> 0:03:09.647 So is really this the only resource you can use. 0:03:09.889 --> 0:03:20.667 Or can we adapt our models in order to also make use of other types of models that might 0:03:20.667 --> 0:03:27.328 enable us to build strong systems with other types of. 0:03:27.707 --> 0:03:35.283 And that's what we will look into now in the next, starting from Tuesday in the next. 0:03:35.515 --> 0:03:43.437 So this idea we already have covered on Tuesday, so one very successful idea for this is to 0:03:43.437 --> 0:03:45.331 do more multilingual. 0:03:45.645 --> 0:03:52.010 So that we're no longer only doing translation between two languages, but we can do translation 0:03:52.010 --> 0:03:55.922 between many languages and share common knowledge between. 0:03:56.296 --> 0:04:06.477 And you also learned about that you can even do things like zero shot machine translations. 0:04:06.786 --> 0:04:09.792 Which is the case for many many language pairs. 0:04:10.030 --> 0:04:17.406 Even with German, you have not translation parallel data to all languages around the world, 0:04:17.406 --> 0:04:22.698 or most of them you have it to the Europeans, maybe for Japanese. 0:04:22.698 --> 0:04:26.386 But even for Japanese, it will get difficult. 0:04:26.746 --> 0:04:32.862 There is quite a lot of data, for example English to Japanese, but German to Vietnamese. 0:04:32.862 --> 0:04:39.253 There is some data from Multilingual Corpora where you can extract the name, but your amount 0:04:39.253 --> 0:04:41.590 really is dropping significantly. 0:04:42.042 --> 0:04:54.907 So that is a very promising direction if you want to build translation systems between language 0:04:54.907 --> 0:05:00.134 pairs, typically not English, because. 0:05:01.221 --> 0:05:05.888 And the other ideas, of course, we don't have data, just search for data. 0:05:06.206 --> 0:05:15.755 There is some work on data crawling so if don't have a corpus directly or don't have 0:05:15.755 --> 0:05:23.956 a high quality corpus from the European Parliament for TED corpus maybe. 0:05:24.344 --> 0:05:35.528 There has been a big effort in Europe to collect data sets for parallel data. 0:05:35.528 --> 0:05:40.403 How can we do this data crawling? 0:05:40.600 --> 0:05:46.103 There the interesting thing from the machine translation point is not just general data 0:05:46.103 --> 0:05:46.729 crawling. 0:05:47.067 --> 0:05:52.067 But how can we explicitly crawl data, which is somewhat parallel? 0:05:52.132 --> 0:05:58.538 So there is in the Internet quite a lot of data which has been like company websites which 0:05:58.538 --> 0:06:01.565 have been translated and things like that. 0:06:01.565 --> 0:06:05.155 So how can you extract them and then extract them? 0:06:06.566 --> 0:06:13.406 There is typically more noisy than where you do more, hence mean if you have your Parliament. 0:06:13.693 --> 0:06:21.305 You can do some rules how to extract the parallel things. 0:06:21.305 --> 0:06:30.361 Here there is more to it, so the quality is later maybe not as good. 0:06:33.313 --> 0:06:39.927 The other thing is can we use monolingual data and monolingual data has a big advantage 0:06:39.927 --> 0:06:46.766 that we can have a huge amount of that so that you can be able to crawl from the internet. 0:06:46.766 --> 0:06:51.726 The nice thing is you can also get it typically for many domains. 0:06:52.352 --> 0:06:58.879 There is just so much more magnitude more of monolingual data so that it might be very 0:06:58.879 --> 0:06:59.554 helpful. 0:06:59.559 --> 0:07:06.187 We can do that in statistical machine translation was quite easy to integrate using language 0:07:06.187 --> 0:07:06.757 models. 0:07:08.508 --> 0:07:14.499 In neural machine translation we have the advantage that we have this overall and architecture 0:07:14.499 --> 0:07:18.850 that does everything together, but it has also the disadvantage now. 0:07:18.850 --> 0:07:22.885 It's more difficult to put in this type of information or make. 0:07:23.283 --> 0:07:26.427 We'll look to two things. 0:07:26.427 --> 0:07:37.432 You can still try to do a bit of language modeling in there and add an additional language 0:07:37.432 --> 0:07:38.279 model. 0:07:38.178 --> 0:07:43.771 A way which I think is used in most systems at the moment is to do synthetic data. 0:07:43.763 --> 0:07:53.095 It's a very easy thing, but you can just translate there and then use it as training data. 0:07:53.213 --> 0:07:59.192 And thereby you are able to use like some type of moonlighting. 0:08:00.380 --> 0:08:09.521 Another way to do it is to ensure that some are in the extreme case. 0:08:09.521 --> 0:08:14.026 If you have a scenario that only. 0:08:14.754 --> 0:08:24.081 The impressive thing is if you have large amounts of data and the languages are not too 0:08:24.081 --> 0:08:31.076 dissimilar, you can even in this case build a translation system. 0:08:32.512 --> 0:08:36.277 That we will see then next Thursday. 0:08:37.857 --> 0:08:55.462 And then there is now a fourth type of restorer that recently became very successful and now. 0:08:55.715 --> 0:09:02.409 So the idea is we are no longer sharing the real data such as text data, but it can also 0:09:02.409 --> 0:09:04.139 help to train a model. 0:09:04.364 --> 0:09:08.599 And that is now a big advantage of deep learning based approaches. 0:09:08.599 --> 0:09:14.414 There you have this ability that you can train a model on some task and then you can modify 0:09:14.414 --> 0:09:19.913 it maybe and then apply it to another task and you can somewhat transfer the knowledge 0:09:19.913 --> 0:09:22.125 from the first task to the second. 0:09:22.722 --> 0:09:31.906 And then, of course, the question is, can it have an initial task where it's very easy 0:09:31.906 --> 0:09:34.439 to train on the second? 0:09:34.714 --> 0:09:53.821 The task that you pre-train on is more similar to a language. 0:09:53.753 --> 0:10:06.293 A bit of a different way of using language malls in this more transfer learning set. 0:10:09.029 --> 0:10:18.747 So first we will start with how can we use monolingual data to do a machine translation? 0:10:20.040 --> 0:10:22.542 The. 0:10:22.062 --> 0:10:28.924 This big difference is you should remember from what I mentioned before is in statistical 0:10:28.924 --> 0:10:30.525 machine translation. 0:10:30.525 --> 0:10:33.118 We directly have the opportunity. 0:10:33.118 --> 0:10:39.675 There's peril data for a translation model and monolingual data for a language model. 0:10:39.679 --> 0:10:45.735 And you combine your translation model and your language model, and then you can make. 0:10:46.726 --> 0:10:54.263 That has big advantages that you can make use of these large amounts of monolingual data, 0:10:54.263 --> 0:10:55.519 but of course. 0:10:55.495 --> 0:11:02.198 Because we said the problem is, we are optimizing both parts independently to each other, and 0:11:02.198 --> 0:11:09.329 we say the big advantage of newer machine translation is we are optimizing the overall architecture 0:11:09.329 --> 0:11:10.541 to perform best. 0:11:10.890 --> 0:11:17.423 And then, of course, we can't do that, so here we can only use power there. 0:11:17.897 --> 0:11:25.567 So the question is, but if this advantage is not so important, we can train everything, 0:11:25.567 --> 0:11:33.499 but we have large amounts of monolingual data or small amounts, but they fit perfectly, so 0:11:33.499 --> 0:11:35.242 they are very good. 0:11:35.675 --> 0:11:41.438 So in data we know it's not only important the amount of data we have but also like how 0:11:41.438 --> 0:11:43.599 similar it is to your test data. 0:11:43.599 --> 0:11:49.230 So it can be that this volume is even only quite small but it's very well fitting and 0:11:49.230 --> 0:11:51.195 then it's still very helpful. 0:11:51.195 --> 0:11:55.320 So the question is if this is the case how can we make use of? 0:11:55.675 --> 0:12:03.171 And the first year of surprisingness, if we are here successful with integrating a language 0:12:03.171 --> 0:12:10.586 model into a translation system, maybe we can also integrate some types of language models 0:12:10.586 --> 0:12:14.415 into our MT system in order to make it better. 0:12:16.536 --> 0:12:19.000 The first thing we can do is okay. 0:12:19.000 --> 0:12:23.293 We know there is language models, so let's try to integrate. 0:12:23.623 --> 0:12:30.693 There was mainly used language models because these works were mainly done before transformer 0:12:30.693 --> 0:12:31.746 based models. 0:12:32.152 --> 0:12:41.567 And generally, of course, you can do the same thing with all the Transformers baseballs. 0:12:41.721 --> 0:12:58.900 It has mainly been done before people started using R&S, and they tried to do this more 0:12:58.900 --> 0:13:01.888 in cases where. 0:13:07.087 --> 0:13:17.508 So what we're having here is some of this type of idea. 0:13:17.508 --> 0:13:25.511 This is a key system here as you remember. 0:13:25.605 --> 0:13:29.470 Gets in with your last instinct and calculates your attention. 0:13:29.729 --> 0:13:36.614 We get the context and combine both and then based on that and then predict the target. 0:13:37.057 --> 0:13:42.423 So this is our anti-system, and the question is, can we somehow integrate the language? 0:13:42.782 --> 0:13:55.788 And of course, if someone makes sense to take out a neural language model because we're anyway 0:13:55.788 --> 0:14:01.538 in the neural space, it's not surprising. 0:14:01.621 --> 0:14:15.522 And there would be something like on top of there and you're a language model and you have 0:14:15.522 --> 0:14:17.049 a target. 0:14:17.597 --> 0:14:27.007 So if we're having this type of language model, there's two main questions we have to answer. 0:14:27.007 --> 0:14:28.108 How do we? 0:14:28.208 --> 0:14:37.935 So how do we combine now on the one hand our NMT system and on the other hand our RNA you 0:14:37.935 --> 0:14:45.393 see that was mentioned before when we started talking about encoder. 0:14:45.805 --> 0:14:49.523 The wild is like unconditioned, it's just modeling the targets side. 0:14:49.970 --> 0:14:57.183 And the other one is a conditional language, which is a language condition on the sewer 0:14:57.183 --> 0:14:57.839 center. 0:14:58.238 --> 0:15:03.144 So the question is how can you not combine two language models? 0:15:03.144 --> 0:15:09.813 Of course, it's like the translation model will some will be more important because it 0:15:09.813 --> 0:15:11.806 has access to the source. 0:15:11.806 --> 0:15:16.713 We want to generate something which corresponds to your source. 0:15:18.778 --> 0:15:20.918 If we had that, the other question is OK. 0:15:20.918 --> 0:15:22.141 Now we have two models. 0:15:22.141 --> 0:15:25.656 If we even have integrated them, the answer is how do we train them? 0:15:26.026 --> 0:15:39.212 Because we have integrated them, we have no two sets of data with parallel data where you 0:15:39.212 --> 0:15:42.729 can do the lower thing. 0:15:44.644 --> 0:15:47.575 So the first idea is okay. 0:15:47.575 --> 0:15:53.436 We can do something more like a parallel combination. 0:15:53.436 --> 0:15:55.824 We just keep running. 0:15:56.036 --> 0:15:59.854 So a year you see your NMT system that is running. 0:16:00.200 --> 0:16:08.182 First of all, it's normally completely independent of your language model, which is up there. 0:16:08.182 --> 0:16:13.278 So down here we have just our NMT system, which is running. 0:16:13.313 --> 0:16:26.439 The only thing which is used is we have the words inputted, and of course they are put 0:16:26.439 --> 0:16:28.099 into both. 0:16:28.099 --> 0:16:41.334 We also put: So we use them in parallel, and then we are doing our decision just by merging 0:16:41.334 --> 0:16:42.905 these two. 0:16:43.343 --> 0:16:52.288 So there can be, for example, we are doing a probability distribution here, we are doing 0:16:52.288 --> 0:17:01.032 a purability distribution here, and then we are taking the average of both per ability 0:17:01.032 --> 0:17:03.343 to do our predictions. 0:17:11.871 --> 0:17:18.929 You could also take the output which seems to be more short about the answer. 0:17:20.000 --> 0:17:23.272 Yes, you could also do that. 0:17:23.272 --> 0:17:27.222 It's more like a gating mechanism. 0:17:27.222 --> 0:17:32.865 You're not doing everything, but you're focusing. 0:17:32.993 --> 0:17:38.927 Another one would be you could also just concatenate the hidden states and then you have another 0:17:38.927 --> 0:17:41.802 layer on top which based on the concatenation. 0:17:43.303 --> 0:17:58.634 If you think about it, you do the coordination instead of taking the instead and then merging 0:17:58.634 --> 0:18:01.244 the perability. 0:18:03.143 --> 0:18:15.027 Yes, in the end you introduce many new parameters and these parameters have somehow something 0:18:15.027 --> 0:18:17.303 special compared. 0:18:23.603 --> 0:18:33.657 So before all the other parameters can be trained independently of each other, the language 0:18:33.657 --> 0:18:42.071 one can be trained independent and an antisystem can be trained independent. 0:18:43.043 --> 0:18:51.198 If you have a joint layer of course you need to train them because you have inputs so you 0:18:51.198 --> 0:19:01.560 need: Not surprisingly, if you have a parallel combination or whether you could, the other 0:19:01.560 --> 0:19:04.664 way is to do more serial combinations. 0:19:04.924 --> 0:19:10.382 How can you do a similar combination? 0:19:10.382 --> 0:19:18.281 Your final decision makes sense to do it based on the. 0:19:18.438 --> 0:19:20.997 So you have on top of your normal an system. 0:19:21.121 --> 0:19:30.826 The only thing is now your inputting into your NIT system. 0:19:30.826 --> 0:19:38.723 You're no longer inputting the word embeddings. 0:19:38.918 --> 0:19:47.819 You're training the lower layers here which are trained more on the purely language model 0:19:47.819 --> 0:19:55.434 and on top you're putting into the NMT system where it now has the language. 0:19:55.815 --> 0:19:59.003 So here you can also view it here. 0:19:59.003 --> 0:20:06.836 You have more contextual embeddings which no longer depend on the word, but they also 0:20:06.836 --> 0:20:10.661 depend on the context of the target site. 0:20:11.051 --> 0:20:21.797 More understanding of the source word. 0:20:21.881 --> 0:20:34.761 So if it's like the word can, for example, will be put in here always the same, independent 0:20:34.761 --> 0:20:41.060 of its use of can of beans, or if can do it. 0:20:41.701 --> 0:20:43.165 Empties. 0:20:44.364 --> 0:20:54.959 So another view, if you're remembering more the transformer based approach, is you have 0:20:54.959 --> 0:21:01.581 some layers, and the lower layers are purely language. 0:21:02.202 --> 0:21:08.052 This is purely language model and then at some point you're starting to attend to the 0:21:08.052 --> 0:21:08.596 source. 0:21:13.493 --> 0:21:20.774 Yes, so these are two ways of how you combine it, so run them in peril, or first do the language. 0:21:23.623 --> 0:21:26.147 Questions for the integration. 0:21:31.831 --> 0:21:35.034 Not really sure about the input of the. 0:21:35.475 --> 0:21:38.123 And this case with a sequence. 0:21:38.278 --> 0:21:50.721 Is the input and bedding, the target word embedding, or the actual word, and then we 0:21:50.721 --> 0:21:54.821 transfer it to a numerical. 0:21:56.176 --> 0:22:08.824 That depends on if you view the word embedding as part of the language model, so of course 0:22:08.824 --> 0:22:10.909 you first put. 0:22:11.691 --> 0:22:13.938 And then the word embedding there is the r&n. 0:22:14.314 --> 0:22:20.296 So of course you can view this together as your language model when you first do the word 0:22:20.296 --> 0:22:21.027 embedding. 0:22:21.401 --> 0:22:28.098 All you can say are the RNAs and this is like before. 0:22:28.098 --> 0:22:36.160 It's more a definition, but you're right, so what are the steps? 0:22:36.516 --> 0:22:46.655 One of these parts, you know, called a language model is definitionally not that important, 0:22:46.655 --> 0:22:47.978 but that's. 0:22:53.933 --> 0:23:02.812 So the question is how can you then train them and make make this this one work? 0:23:03.363 --> 0:23:15.492 So in the case where you combine the language of our abilities you can train them independently 0:23:15.492 --> 0:23:18.524 and then just put them. 0:23:18.918 --> 0:23:29.623 It might not be the best because we have no longer this ability before that. 0:23:29.623 --> 0:23:33.932 They optimal perform together. 0:23:34.514 --> 0:23:41.050 At least you need to summarize how much do you trust the one model and how much do you 0:23:41.050 --> 0:23:41.576 trust. 0:23:43.323 --> 0:23:48.529 But still in some cases usually it might be helpful if you have only data and so on. 0:23:48.928 --> 0:24:06.397 However, we have one specific situation that leads to the pearl leader is always mono legal 0:24:06.397 --> 0:24:07.537 data. 0:24:08.588 --> 0:24:17.693 So what we can also do is more the pre-training approach. 0:24:17.693 --> 0:24:24.601 We first train the language model and then. 0:24:24.704 --> 0:24:33.468 So the pre-training approach you first train on the monolingual data and then you join the. 0:24:33.933 --> 0:24:45.077 Of course, the model size is this way, but the data size is of course too big. 0:24:45.077 --> 0:24:52.413 You often have more monolingual data than parallel. 0:24:56.536 --> 0:24:57.901 Any ideas. 0:25:04.064 --> 0:25:10.108 Had one example where this might also be helpful if you want to adapt to a domain so let's say 0:25:10.108 --> 0:25:16.281 you do medical sentences and if you want to translate medical sentences and you have monolingual 0:25:16.281 --> 0:25:22.007 data on the target side for medical sentences but you only have parallel data for general 0:25:22.007 --> 0:25:22.325 use. 0:25:23.083 --> 0:25:30.601 In this case it could be, or it's the most probable happen if you're learning out there 0:25:30.601 --> 0:25:38.804 what medical means, but then in your fine tuning step the model is forgetting everything about. 0:25:39.099 --> 0:25:42.340 So this type of priest training step is good. 0:25:42.340 --> 0:25:47.978 If your pretraining data is more general, very large, and then you're adapting. 0:25:48.428 --> 0:25:55.545 But in the task we have monolingual data, which should be used to adapt the system to 0:25:55.545 --> 0:25:57.780 some genre of topic style. 0:25:57.817 --> 0:26:08.572 Then, of course, this is not a good strategy because you might forget about everything up 0:26:08.572 --> 0:26:09.408 there. 0:26:09.649 --> 0:26:17.494 So then you have to check what you can do for them to see. 0:26:17.494 --> 0:26:25.738 You can freeze this part and you can do a direct combination. 0:26:25.945 --> 0:26:33.796 Where you train both of them, and then you train the language more and parallel on their 0:26:33.796 --> 0:26:34.942 one so that. 0:26:35.395 --> 0:26:37.687 Eh What You Learn in the Length. 0:26:37.937 --> 0:26:48.116 So the bit depends on what you want to combine is that you use a language model because it's. 0:26:48.548 --> 0:26:56.380 Then you normally don't really forget it because it's also in the or you use it to adapt to 0:26:56.380 --> 0:26:58.083 something specific. 0:27:01.001 --> 0:27:06.662 Then there is so this is a way of how we can make use of monolingual data. 0:27:07.968 --> 0:27:11.787 It seems to be the easiest one somehow. 0:27:11.787 --> 0:27:19.140 It's more similar to what we are doing with statistical machine translation. 0:27:19.140 --> 0:27:20.095 However,. 0:27:21.181 --> 0:27:27.211 Normally always beats this type of model, which in some view can be from the conceptual 0:27:27.211 --> 0:27:27.691 thing. 0:27:27.691 --> 0:27:31.460 At least it's even easier from the computational side. 0:27:31.460 --> 0:27:36.805 Sometimes it has a disadvantage that it's more problematic or more difficult. 0:27:40.560 --> 0:27:42.576 And the idea is okay. 0:27:42.576 --> 0:27:45.141 We have a monolingual data. 0:27:45.141 --> 0:27:50.822 We just translate it and then generate some type of parallel. 0:27:51.111 --> 0:28:00.465 So if you want to build a German to English system, your first trained German to English 0:28:00.465 --> 0:28:02.147 system on your. 0:28:02.402 --> 0:28:05.217 Then you have more pearl data. 0:28:05.217 --> 0:28:13.482 The interesting thing is if you then train on the joint thing, on the original pearl data, 0:28:13.482 --> 0:28:18.749 and on that one is artificial, it even normally improves. 0:28:18.918 --> 0:28:26.490 You can because you're not doing the same error all the time and you have some knowledge. 0:28:28.028 --> 0:28:40.080 With this first approach, however, there's one issue: why it might not work the best, 0:28:40.080 --> 0:28:43.163 so could you imagine? 0:28:49.409 --> 0:28:51.186 Ready a bit shown in image two. 0:28:53.113 --> 0:29:00.637 Have a few trains on bad quality data. 0:29:00.637 --> 0:29:08.741 The system will learn also in the states. 0:29:08.828 --> 0:29:12.210 And as you're saying, it's a system always mistranslates. 0:29:13.493 --> 0:29:14.497 Something. 0:29:14.497 --> 0:29:23.623 Then you will learn that this is correct because now it's training data and you will even encourage 0:29:23.623 --> 0:29:25.996 it to make it more often. 0:29:25.996 --> 0:29:29.921 So the problem on training on your own is. 0:29:30.150 --> 0:29:34.222 But however, as you systematically do, you even enforce more and will even do more. 0:29:34.654 --> 0:29:37.401 So that might not be the best solution. 0:29:37.401 --> 0:29:40.148 Do any idea how you could do it better? 0:29:44.404 --> 0:29:57.653 If you had something else to prevent some systematic problems, yes, that is one way. 0:30:04.624 --> 0:30:10.809 The problem is yeah, the translations are not perfect, so the output and you're learning 0:30:10.809 --> 0:30:11.990 something wrong. 0:30:11.990 --> 0:30:17.967 Normally it's less bad if your inputs are somewhat bad, but your outputs are perfect. 0:30:18.538 --> 0:30:26.670 So if your inputs are wrong you maybe learn that if you're doing this wrong input you're 0:30:26.670 --> 0:30:30.782 generating something correct but you're not. 0:30:31.511 --> 0:30:40.911 So often the case is that it's more important that your target is correct. 0:30:40.911 --> 0:30:47.052 If on the source there is something crazy, then. 0:30:47.347 --> 0:30:52.184 But you can assume in your application scenario you hope that you mainly get correct input. 0:30:52.572 --> 0:31:02.126 So that is not harming you as much, and in machine translation we have some of these symmetries, 0:31:02.126 --> 0:31:02.520 so. 0:31:02.762 --> 0:31:04.578 And also the other way around. 0:31:04.578 --> 0:31:09.792 It's a very similar task, so there's a task to translate from German to English, but the 0:31:09.792 --> 0:31:13.892 task to translate from English to German is very similar and helpful. 0:31:14.094 --> 0:31:19.313 So what we can do is, we can just switch it initially and generate the data the other way 0:31:19.313 --> 0:31:19.777 around. 0:31:20.120 --> 0:31:25.699 So what we are doing here is we are starting with an English to German system. 0:31:25.699 --> 0:31:32.126 Then we are translating the English data into German, where the German is maybe not really 0:31:32.126 --> 0:31:32.903 very nice. 0:31:33.293 --> 0:31:46.045 And then we're training on our original data and on the back translated data where only 0:31:46.045 --> 0:31:51.696 the input is good and it's like human. 0:31:52.632 --> 0:32:01.622 So here we have now the advantage that always our target site is of human quality and the 0:32:01.622 --> 0:32:02.322 input. 0:32:03.583 --> 0:32:08.998 And then this helps us to get really good form. 0:32:08.998 --> 0:32:15.428 There's one important difference if you think about the. 0:32:21.341 --> 0:32:31.604 It's too obvious here we need a target side monolingual layer and the first. 0:32:31.931 --> 0:32:47.143 So back translation is normally working if you have target size parallel and not search 0:32:47.143 --> 0:32:48.180 side. 0:32:48.448 --> 0:32:55.493 Might be also a bit if you think about it understandable that it's more important to 0:32:55.493 --> 0:32:56.819 be like better. 0:32:57.117 --> 0:33:04.472 On the suicide you have to understand the content, on the target side you have to generate 0:33:04.472 --> 0:33:12.232 really sentences and somehow it's more difficult to generate something than to only understand. 0:33:17.617 --> 0:33:29.916 One other thing, so typically it's shown here differently, but typically it's like this works 0:33:29.916 --> 0:33:30.701 well. 0:33:31.051 --> 0:33:32.978 Because normally there's like a lot more. 0:33:33.253 --> 0:33:36.683 So the question is, should really take all of my data? 0:33:36.683 --> 0:33:38.554 There's two problems with it. 0:33:38.554 --> 0:33:42.981 Of course, it's expensive because you have to translate all this data. 0:33:42.981 --> 0:33:48.407 And secondly, if you had, although now your packet site is wrong, it might be that you 0:33:48.407 --> 0:33:51.213 still have your wrong correlations in there. 0:33:51.651 --> 0:34:01.061 So if you don't know the normally good starting point is to take equal amount of data as many 0:34:01.061 --> 0:34:02.662 backtranslated. 0:34:02.963 --> 0:34:05.366 Of course, it depends on the use case. 0:34:05.366 --> 0:34:07.215 There are very few data here. 0:34:07.215 --> 0:34:08.510 It makes more sense. 0:34:08.688 --> 0:34:14.273 It depends on how good your quality is here, so the better the model is observable, the 0:34:14.273 --> 0:34:17.510 more data you might use because quality is better. 0:34:17.510 --> 0:34:23.158 So it depends on a lot of things, but yeah, a rule of sample like good general way often 0:34:23.158 --> 0:34:24.808 is to have equal amounts. 0:34:26.646 --> 0:34:31.233 And you can of course do that now iteratively. 0:34:31.233 --> 0:34:39.039 It said already that the quality at the end, of course, depends on this system. 0:34:39.039 --> 0:34:46.163 Also, because the better this system is, the better your synthetic data. 0:34:47.207 --> 0:34:50.949 That leads to what is referred to as iterated back translation. 0:34:51.291 --> 0:34:56.911 So you're playing a model on English to German and you translate the data. 0:34:56.957 --> 0:35:03.397 Then you train a model on German to English with the additional data. 0:35:03.397 --> 0:35:11.954 Then you translate German when you translate German data and then you train again your first 0:35:11.954 --> 0:35:12.414 one. 0:35:12.414 --> 0:35:14.346 So you iterate that. 0:35:14.334 --> 0:35:19.653 Because now your system is better because it's not only trained on the small data but 0:35:19.653 --> 0:35:22.003 additionally on back translated data. 0:35:22.442 --> 0:35:24.458 And so you can get better. 0:35:24.764 --> 0:35:31.739 However, typically you can stop quite early, so maybe one iteration is good, but then you 0:35:31.739 --> 0:35:35.072 have diminishing gains after two or three. 0:35:35.935 --> 0:35:44.094 There's very slight difference and then yeah because you need of course quite big difference 0:35:44.094 --> 0:35:45.937 in the quality here. 0:35:45.937 --> 0:35:46.814 In order. 0:35:47.207 --> 0:35:59.810 Which is not too good because it means you can already have to train it with relatively 0:35:59.810 --> 0:36:02.245 bad performance. 0:36:03.723 --> 0:36:10.323 And they don't yeah, a design decision would advise so guess because it's easy to get it. 0:36:10.550 --> 0:36:16.617 Better to replace that because you have a higher quality, but you of course keep your 0:36:16.617 --> 0:36:18.310 high quality real data. 0:36:18.310 --> 0:36:21.626 Then I think normally it's okay to replace it. 0:36:21.626 --> 0:36:24.518 Of course you can also try to append it. 0:36:24.518 --> 0:36:28.398 I would assume it's not too much of a difference, but. 0:36:34.414 --> 0:36:40.567 That's about like using monolingual data before we go into the pre-train models. 0:36:40.567 --> 0:36:42.998 Do you have any more questions? 0:36:49.029 --> 0:36:57.521 Yes, so the other thing we can do and which is recently more and more successful and even 0:36:57.521 --> 0:37:05.731 more successful since we have these really large language models where you can even do 0:37:05.731 --> 0:37:08.562 a translation task with this. 0:37:08.688 --> 0:37:16.132 So here the idea is you learn a representation of one task and then you use this representation. 0:37:16.576 --> 0:37:27.276 It was made maybe like one of the first where it's really used largely is doing something 0:37:27.276 --> 0:37:35.954 like a bird which you pre-train on purely text editor and then you take. 0:37:36.496 --> 0:37:42.952 And the one big advantage, of course, is that people can only share data but also pre-train. 0:37:43.423 --> 0:37:53.247 So if you think of the recent models and the large language models which are available, 0:37:53.247 --> 0:37:59.611 it is not possible for universities often to train them. 0:37:59.919 --> 0:38:09.413 Think it costs several millions to train the model just if you rent the GPS from some cloud 0:38:09.413 --> 0:38:15.398 company and train that the cost of training these models. 0:38:15.475 --> 0:38:21.735 And guess as a student project you won't have the budget to like build these models. 0:38:21.801 --> 0:38:24.630 So another idea is what you can do is okay. 0:38:24.630 --> 0:38:27.331 Maybe if these months are once available. 0:38:27.467 --> 0:38:34.723 You can take them and use them as a resource similar to pure text, and you can now build 0:38:34.723 --> 0:38:41.734 models which some will learn not only from from data but also from other models which 0:38:41.734 --> 0:38:44.506 are maybe trained on other tasks. 0:38:44.844 --> 0:38:48.647 So it's a quite new way of thinking of how to train. 0:38:48.647 --> 0:38:53.885 So we are not only learning from examples, but we might also learn from. 0:38:54.534 --> 0:39:03.937 The nice thing is that this type of training where we are not learning directly from data 0:39:03.937 --> 0:39:07.071 by learning from other tasks. 0:39:07.427 --> 0:39:15.581 So the main idea to start with is to have a personal initial task, and typically this 0:39:15.581 --> 0:39:24.425 initial task is for: And if you're working with, that means you're training pure taxator 0:39:24.425 --> 0:39:30.547 because you have the largest amount of data from the Internet. 0:39:30.951 --> 0:39:35.857 And then you're defining some type of task in order to do your quick training. 0:39:36.176 --> 0:39:42.056 And: There's a typical task you can train on. 0:39:42.056 --> 0:39:52.709 That is like the language modeling text, so to predict the next word, all we have related. 0:39:52.932 --> 0:40:04.654 But to predict something which you have not in the input is a task which is easy to generate. 0:40:04.654 --> 0:40:06.150 That's why. 0:40:06.366 --> 0:40:14.005 By yourself, on the other hand, you need a lot of knowledge, and that is the other thing 0:40:14.005 --> 0:40:15.120 you need to. 0:40:15.735 --> 0:40:23.690 Because there is this idea that the meaning of the word heavily depends on the context 0:40:23.690 --> 0:40:24.695 it's used. 0:40:25.145 --> 0:40:36.087 So can give you a sentence with some gibberish word and there's some name, and although you've 0:40:36.087 --> 0:40:41.616 never read the name, you will just assume that. 0:40:42.062 --> 0:40:48.290 Exactly the same thing, the models can also learn something about the words in there by 0:40:48.290 --> 0:40:49.139 just using. 0:40:49.649 --> 0:40:53.246 So that is typically the new. 0:40:53.246 --> 0:40:59.839 Then we can use this model, use our data to train the. 0:41:00.800 --> 0:41:04.703 Of course, it might need to adapt the system. 0:41:04.703 --> 0:41:07.672 To do that we might use only some. 0:41:07.627 --> 0:41:16.326 Part of the pre-train model in there is that we have seen that a bit already in the RNA 0:41:16.326 --> 0:41:17.215 case is. 0:41:17.437 --> 0:41:22.670 So you can view the RN as one of these approaches. 0:41:22.670 --> 0:41:28.518 You train the RN language while on large pre-train data. 0:41:28.518 --> 0:41:32.314 Then you put it somewhere into your. 0:41:33.653 --> 0:41:37.415 So this gives you the ability to really do these types of tests. 0:41:37.877 --> 0:41:49.027 So that you can build a system which uses knowledge, which is just trained on large amounts 0:41:49.027 --> 0:41:52.299 of data and extracting it. 0:41:52.299 --> 0:41:53.874 So it knows. 0:41:56.376 --> 0:42:01.561 So the question is that yeah, what type of information so what type of models can you? 0:42:01.821 --> 0:42:05.278 And we want to today look at briefly at three. 0:42:05.725 --> 0:42:08.474 Was initially done. 0:42:08.474 --> 0:42:21.118 It wasn't as famous as in machine translation as in other things, but it's also used there. 0:42:21.221 --> 0:42:28.974 So where you have this mapping from the one hot to a small continuous word representation? 0:42:29.229 --> 0:42:37.891 Using this one in your anthrax you can, for example, replace the embedding layer by the 0:42:37.891 --> 0:42:38.776 trained. 0:42:39.139 --> 0:42:41.832 That is helpful to be a really small amount of data. 0:42:42.922 --> 0:42:48.520 You're always in this pre training phase and have the thing the advantage is. 0:42:48.468 --> 0:42:55.515 More data, that's the trade off so you can get better. 0:42:55.515 --> 0:43:00.128 Disadvantage is, does anybody have? 0:43:04.624 --> 0:43:12.173 Was one of the mentioned today, even like big advantages of the system compared to previous. 0:43:20.660 --> 0:43:26.781 Where one advantage was the end to end training so that all parameters and all components are 0:43:26.781 --> 0:43:27.952 optimal together. 0:43:28.208 --> 0:43:33.386 If you know pre-train something on one pass, it's maybe no longer optimal fitting to everything. 0:43:33.893 --> 0:43:40.338 So that is similar to what should do pretaining or not. 0:43:40.338 --> 0:43:48.163 It depends on how important everything is optimal together and how. 0:43:48.388 --> 0:44:00.552 If the state is a high quality of large amount, the pre trained one is just so much better. 0:44:00.600 --> 0:44:11.215 Standing everything optimal together, we would use random actions for amazing vices. 0:44:11.691 --> 0:44:18.791 Mean, we assume some structures that are trained basically. 0:44:18.791 --> 0:44:26.364 Yes, if you're fine tuning everything, it might be the problem. 0:44:26.766 --> 0:44:31.139 But often yeah, in some way right, so often it's not about. 0:44:31.139 --> 0:44:37.624 You're really worse with some pre-trained molecules because you're going already in some 0:44:37.624 --> 0:44:43.236 direction, and if this is not really optimal for you, it might be difficult. 0:44:43.603 --> 0:44:51.774 But the bigger is, if you're not getting better because you have a decent amount of data, it's 0:44:51.774 --> 0:44:52.978 so different. 0:44:53.153 --> 0:45:04.884 But mean initially it wasn't a machine translation done so much because there was more data in 0:45:04.884 --> 0:45:09.452 the task, but now it's really large. 0:45:12.632 --> 0:45:14.188 The other one is then OK. 0:45:14.188 --> 0:45:18.258 Now it's always like how much of the model do your pre-track a bit? 0:45:18.658 --> 0:45:25.057 The other one you can do is tack contextual words and then something like bird or a robota 0:45:25.057 --> 0:45:31.667 where you train more already as sequence models and the embeddings you're using are no longer 0:45:31.667 --> 0:45:35.605 specific for words but they're also taking the context. 0:45:35.875 --> 0:45:54.425 Embedding you're using is no longer only depending on the word itself but on the whole sentence. 0:45:55.415 --> 0:46:03.714 And of course you can use similar things also in the decoder just by having layers which 0:46:03.714 --> 0:46:09.122 don't have access to the source but there it's still not. 0:46:11.451 --> 0:46:19.044 And finally, and then we'll look at the end, you can also have models which are already. 0:46:19.419 --> 0:46:28.605 So you may be training a sequence model, but not a monolingual data. 0:46:28.605 --> 0:46:35.128 Of course you have to make it a bit challenging. 0:46:36.156 --> 0:46:43.445 But the idea is really you're pre-training your whole model and then you're fine tuning. 0:46:47.227 --> 0:46:59.487 But let's first do a bit of step back and look into what are the differences. 0:46:59.487 --> 0:47:02.159 The first thing. 0:47:02.382 --> 0:47:06.870 The word embeddings are just this first layer. 0:47:06.870 --> 0:47:12.027 You can train them with feed-forward neural networks. 0:47:12.212 --> 0:47:25.683 But you can also train them in language model, and by now you hopefully have also seen that 0:47:25.683 --> 0:47:27.733 you can also. 0:47:30.130 --> 0:47:41.558 So this is how you can train them, and you are training them to predict the next word, 0:47:41.558 --> 0:47:45.236 the typical language model. 0:47:45.525 --> 0:47:52.494 And that is what is now referred to as a South Supervised Learning, and for example all the 0:47:52.494 --> 0:47:56.357 big large language models like Chat, gp and so on. 0:47:56.357 --> 0:48:03.098 They are trained at an end or feet, but exactly with this objective to predict the next. 0:48:03.823 --> 0:48:12.847 So that is where you can hopefully learn what a word is used because you always try to predict 0:48:12.847 --> 0:48:17.692 the next word and then you have a ready intuition. 0:48:19.619 --> 0:48:25.374 In the word embedding, why do people first look at the word embeddings and the use of 0:48:25.374 --> 0:48:27.582 word embeddings for other tasks? 0:48:27.582 --> 0:48:32.600 The main advantage is it might be only the first layer you would think of. 0:48:32.600 --> 0:48:34.474 What does it really matter? 0:48:34.474 --> 0:48:39.426 However, it is the layer where you typically have most of the parameters. 0:48:39.879 --> 0:48:52.201 Of course, if you have trained on most of your parameters already on the large data, 0:48:52.201 --> 0:48:59.304 then on your target data you have to train less. 0:48:59.259 --> 0:49:05.841 This big difference that your input size is so much bigger than the size of the normal 0:49:05.841 --> 0:49:06.522 in size. 0:49:06.626 --> 0:49:16.551 So it's a normal size, maybe two hundred and fifty, but your input embedding besides vocabulary 0:49:16.551 --> 0:49:20.583 size is something like fifty thousand. 0:49:23.123 --> 0:49:30.163 And bending while here you see, it's only like times as much in the layer. 0:49:30.750 --> 0:49:36.747 So here's where most of your parameters are, which means if you already replace the word 0:49:36.747 --> 0:49:41.329 embeddings, it might look a bit small in your overall architecture. 0:49:41.329 --> 0:49:47.056 It's where most of the things are, and if you're doing that, you already have really 0:49:47.056 --> 0:49:48.876 big games and can do that. 0:49:57.637 --> 0:50:04.301 The thing is we have seen these wooden beddings can be very good used for other taps. 0:50:04.784 --> 0:50:08.921 Now you learn some relation between words. 0:50:08.921 --> 0:50:14.790 If you're doing this type of language modeling, you predict. 0:50:15.215 --> 0:50:21.532 The one thing is, of course, you have a lot of data, so the one question is we want to 0:50:21.532 --> 0:50:25.961 have a lot of data to good training models, the other thing. 0:50:25.961 --> 0:50:28.721 The tasks need to be somewhat useful. 0:50:29.169 --> 0:50:41.905 If you would predict the first letter of the word, it has to be a task where you need some 0:50:41.905 --> 0:50:45.124 syntactic information. 0:50:45.545 --> 0:50:53.066 The interesting thing is people have looked at these world embeddings here in a language 0:50:53.066 --> 0:50:53.658 model. 0:50:53.954 --> 0:51:04.224 And you're looking at the word embeddings, which are these vectors here. 0:51:04.224 --> 0:51:09.289 You can ask yourself, do they look? 0:51:09.489 --> 0:51:15.122 Don't know if your view is listening to artificial advance artificial intelligence. 0:51:15.515 --> 0:51:23.994 We had on yesterday how to do this type of representation, but you can do this kind of 0:51:23.994 --> 0:51:29.646 representation, and now you're seeing interesting things. 0:51:30.810 --> 0:51:41.248 Now you can represent it here in a three dimensional space with a dimension reduction. 0:51:41.248 --> 0:51:46.886 Then you can look into it and the interesting. 0:51:47.447 --> 0:51:57.539 So this vector between the male and the female version of something is not the same, but it's 0:51:57.539 --> 0:51:58.505 related. 0:51:58.718 --> 0:52:11.256 So you can do a bit of nuts, you subtract this vector, add this vector, and then you 0:52:11.256 --> 0:52:14.501 look around this one. 0:52:14.894 --> 0:52:19.691 So that means okay, there is really something stored, some information stored in that book. 0:52:20.040 --> 0:52:25.003 Similar you can do it with Buck and since you see here swimming slam walk and walk. 0:52:25.265 --> 0:52:42.534 So again these vectors are not the same, but they're related for going from here to here. 0:52:43.623 --> 0:52:47.508 Are semantically the relations between city and capital? 0:52:47.508 --> 0:52:49.757 You have exactly the same thing. 0:52:51.191 --> 0:52:57.857 People having done question answering about that if they show these embeddings and. 0:52:58.218 --> 0:53:05.198 Or you can also, if you don't trust the the dimensional reduction because you say maybe 0:53:05.198 --> 0:53:06.705 there's something. 0:53:06.967 --> 0:53:16.473 Done you can also look into what happens really in the indimensional space. 0:53:16.473 --> 0:53:22.227 You can look at what is the nearest neighbor. 0:53:22.482 --> 0:53:29.605 So you can take the relationship between France and Paris and add it to Italy and nicely see. 0:53:30.010 --> 0:53:33.082 You can do big and bigger and you have small and small lines. 0:53:33.593 --> 0:53:38.202 It doesn't work everywhere. 0:53:38.202 --> 0:53:49.393 There are also some which sometimes work, so if you have a typical. 0:53:51.491 --> 0:53:56.832 You can do what the person is doing for famous ones. 0:53:56.832 --> 0:54:05.800 Of course, only like Einstein, scientist, that Messier finds Midfield are not completely 0:54:05.800 --> 0:54:06.707 correct. 0:54:06.846 --> 0:54:09.781 You'll see the examples are a bit old. 0:54:09.781 --> 0:54:15.050 The politicians are no longer there, but the first one doesn't learn. 0:54:16.957 --> 0:54:29.003 What people have done there of courses, especially at the beginning. 0:54:29.309 --> 0:54:36.272 So one famous model was, but we're not really interested in the language model performance. 0:54:36.272 --> 0:54:38.013 We're only interested. 0:54:38.338 --> 0:54:40.634 Think something good to keep in mind. 0:54:40.634 --> 0:54:42.688 What are we really interested in? 0:54:42.688 --> 0:54:44.681 Do we really want to have an RN? 0:54:44.681 --> 0:54:44.923 No. 0:54:44.923 --> 0:54:48.608 In this case we are only interested in this type of mapping. 0:54:49.169 --> 0:54:55.536 And so very successful was this word to beg. 0:54:55.535 --> 0:55:02.597 We are not training real language when making it even simpler and doing this for example 0:55:02.597 --> 0:55:04.660 continuous back of words. 0:55:04.660 --> 0:55:11.801 We are just having four input tokens and we are predicting what is the word in the middle 0:55:11.801 --> 0:55:15.054 and this is just like two linear layers. 0:55:15.615 --> 0:55:22.019 It's even simplifying things and making the calculation faster because that is what we're 0:55:22.019 --> 0:55:22.873 interested. 0:55:23.263 --> 0:55:34.059 All this continues skip ground models of these other two models. 0:55:34.234 --> 0:55:38.273 You have one equal word and it's the other way around. 0:55:38.273 --> 0:55:41.651 You're predicting the four words around them. 0:55:41.651 --> 0:55:43.047 It's very similar. 0:55:43.047 --> 0:55:48.702 The task is in the end very similar, but in all of them it's about learning. 0:55:51.131 --> 0:56:01.416 Before we go into the next part, let's talk about the normal white vector or white line. 0:56:04.564 --> 0:56:07.562 The next thing is contextual word embeddings. 0:56:07.562 --> 0:56:08.670 The idea is yes. 0:56:08.670 --> 0:56:09.778 This is helpful. 0:56:09.778 --> 0:56:14.080 However, we might be able to get more from just only lingo later. 0:56:14.080 --> 0:56:19.164 For example, if you think about the word can, it can have different meanings. 0:56:19.419 --> 0:56:32.619 And now in the word embeddings how you have an overlap of these two meanings, so it represents 0:56:32.619 --> 0:56:33.592 those. 0:56:34.834 --> 0:56:40.318 But we might be able to in the pre-train model already disambiguate these because they use 0:56:40.318 --> 0:56:41.041 completely. 0:56:41.701 --> 0:56:50.998 So if we can have a model which can not only represent the word, but it can also represent 0:56:50.998 --> 0:56:58.660 the meaning of the word within the context, it might be even more helpful. 0:56:59.139 --> 0:57:03.342 So then we're going to contextual word embeddings. 0:57:03.342 --> 0:57:07.709 We're really having a representation of the context. 0:57:07.787 --> 0:57:11.519 And we have a very good architecture for that already. 0:57:11.691 --> 0:57:20.551 It's like our base language model where you have to do the hidden state. 0:57:20.551 --> 0:57:29.290 The hidden state represents what is apparently said, but it's focusing. 0:57:29.509 --> 0:57:43.814 The first one doing that is in something like the Elmo paper where they instead of like this 0:57:43.814 --> 0:57:48.121 is a normal language model. 0:57:48.008 --> 0:57:52.735 Put in the third predicting the fourth and so on, so you're always predicting the next 0:57:52.735 --> 0:57:53.007 one. 0:57:53.193 --> 0:57:57.919 The architecture of the heaven works embedding layer, and then two are an layer here. 0:57:57.919 --> 0:58:04.255 For example: And now instead of using this one in the end you're using here this one. 0:58:04.364 --> 0:58:11.245 This represents the meaning of this word mainly in the context of what we have seen before. 0:58:11.871 --> 0:58:22.909 We can train it in a language model or predicting the next word, but we have more information, 0:58:22.909 --> 0:58:26.162 train there, and therefore. 0:58:27.167 --> 0:58:31.168 And there is one even done currently in. 0:58:31.168 --> 0:58:40.536 The only difference is that we have more layers, bigger size, and we're using transform on here 0:58:40.536 --> 0:58:44.634 or self-attention instead of the R&F. 0:58:44.634 --> 0:58:45.122 But. 0:58:46.746 --> 0:58:52.737 However, if you look at this contextual representation, they might not be perfect. 0:58:52.737 --> 0:58:58.584 So what do you think of this one as contextual representation of the third word? 0:58:58.584 --> 0:59:02.914 Do you see anything which is not really considered in this? 0:59:07.587 --> 0:59:11.492 Only one way yes, so that is not a big issue here. 0:59:11.492 --> 0:59:18.154 It's representing a string in the context of a sentence, however, only in the context. 0:59:18.558 --> 0:59:28.394 However, we have an architecture which can also take both sides and we have used it in 0:59:28.394 --> 0:59:30.203 the ink holder. 0:59:30.630 --> 0:59:34.269 So we could do the and easily only us in the backboard direction. 0:59:34.874 --> 0:59:46.889 By just having the other way around, and then we couldn't combine the forward and into a 0:59:46.889 --> 0:59:49.184 joint one where. 0:59:49.329 --> 0:59:50.861 So You Have a Word embedding. 0:59:51.011 --> 1:00:03.910 Then you have two states, one with a forward, and then one with a backward. 1:00:03.910 --> 1:00:10.359 For example, take the representation. 1:00:10.490 --> 1:00:21.903 Now this same here represents mainly this word because this is where what both focuses 1:00:21.903 --> 1:00:30.561 on is what is happening last but is also looking at the previous. 1:00:31.731 --> 1:00:41.063 However, there is a bit different when training that as a language model you already have. 1:00:43.203 --> 1:00:44.956 Maybe there's again this masking. 1:00:46.546 --> 1:00:47.814 That is one solution. 1:00:47.814 --> 1:00:53.407 First of all, why we can't do it is the information you leave it, so you cannot just predict the 1:00:53.407 --> 1:00:54.041 next word. 1:00:54.041 --> 1:00:58.135 If we just predict the next word in this type of model, that's a very. 1:00:58.738 --> 1:01:04.590 You know the next word because it's influencing this hidden stage and then it's very easy so 1:01:04.590 --> 1:01:07.736 predicting something you know is not a good task. 1:01:07.736 --> 1:01:09.812 This is what I mentioned before. 1:01:09.812 --> 1:01:13.336 You have to define somehow a task which is challenging. 1:01:13.753 --> 1:01:19.007 Because in this case one would, I mean, the system would just ignore the states and what 1:01:19.007 --> 1:01:22.961 it would learn is that you copy this information directly in here. 1:01:23.343 --> 1:01:31.462 So it would mainly be representing this word and you would have a perfect model because 1:01:31.462 --> 1:01:38.290 you only need to find an encoding where you can encode all words somehow. 1:01:38.458 --> 1:01:44.046 The only thing that will learn is that tenor and coat all my words in this upper hidden. 1:01:44.985 --> 1:01:49.584 And then, of course, it's not really useful. 1:01:49.584 --> 1:01:53.775 We need to find a bit of different ways. 1:01:55.295 --> 1:01:59.440 There is a masking one. 1:01:59.440 --> 1:02:06.003 I'll come to that shortly just a bit. 1:02:06.003 --> 1:02:14.466 The other thing is not to directly combine them. 1:02:14.594 --> 1:02:22.276 So you never merge the states only at the end. 1:02:22.276 --> 1:02:33.717 The representation of the words is now from the forward and the next. 1:02:33.873 --> 1:02:35.964 So it's always a hidden state before that. 1:02:36.696 --> 1:02:41.273 And these two you're joined now to your to the representation. 1:02:42.022 --> 1:02:50.730 And then you have now a representation also about the whole sentence for the word, but 1:02:50.730 --> 1:02:53.933 there's no information leakage. 1:02:53.933 --> 1:02:59.839 One way of doing this is instead of doing a bidirectional. 1:03:00.380 --> 1:03:08.079 You can do that, of course, in all layers. 1:03:08.079 --> 1:03:16.315 In the end you have different bedding states. 1:03:16.596 --> 1:03:20.246 However, it's a bit of a complicated. 1:03:20.246 --> 1:03:25.241 You have to keep up separate and then merge things. 1:03:27.968 --> 1:03:33.007 And that is is the moment where, like the, the peak. 1:03:34.894 --> 1:03:42.018 Idea of the big success of the bird model was used, maybe in bidirector case. 1:03:42.018 --> 1:03:48.319 It's not good to do the next word prediction, but we can do masking. 1:03:48.308 --> 1:03:59.618 And masking maybe means we do a prediction of something in the middle or some words. 1:03:59.618 --> 1:04:08.000 If we have the input, we're just putting noise into the input. 1:04:08.048 --> 1:04:14.040 Now there can be no information leakage because this wasn't in the input. 1:04:14.040 --> 1:04:15.336 Now predicting. 1:04:16.776 --> 1:04:20.524 So thereby we don't do any assumption again about our models. 1:04:20.524 --> 1:04:24.815 It doesn't need to be a forward model or a backward model or anything. 1:04:24.815 --> 1:04:29.469 You can have any type of architecture and you can always predict the street. 1:04:30.530 --> 1:04:39.112 There is maybe one disadvantage: do you see what could be a bit of a problem this type 1:04:39.112 --> 1:04:40.098 compared? 1:05:00.000 --> 1:05:05.920 Yes, so yeah mean you cannot cross mass more, but to see it more globally just twist assume 1:05:05.920 --> 1:05:07.142 you only mask one. 1:05:07.142 --> 1:05:12.676 For the whole sentence we get one feedback signal like what is the word street, so we 1:05:12.676 --> 1:05:16.280 have one training sample, a model for the whole center. 1:05:17.397 --> 1:05:19.461 The language modeling paste. 1:05:19.461 --> 1:05:21.240 We predicted here three. 1:05:21.240 --> 1:05:22.947 We predicted here four. 1:05:22.947 --> 1:05:24.655 We predicted here five. 1:05:25.005 --> 1:05:26.973 So we have a number of tokens. 1:05:26.973 --> 1:05:30.974 For each token we have a feet bed and saying what is the best. 1:05:31.211 --> 1:05:39.369 So in this case of course this is a lot less efficient because we are getting less feedback 1:05:39.369 --> 1:05:45.754 signals on what we should predict compared to models where we're doing. 1:05:48.348 --> 1:05:54.847 So in birth the main idea this bidirectional model was masking. 1:05:54.847 --> 1:05:59.721 It was the first large model using transformer. 1:06:00.320 --> 1:06:06.326 There are two more minor changes. 1:06:06.326 --> 1:06:16.573 We'll see that this next word prediction is another task. 1:06:16.957 --> 1:06:25.395 Again you want to learn more about what language is to really understand. 1:06:25.395 --> 1:06:35.089 Are these two sentences like following a story or they're independent of each other? 1:06:38.158 --> 1:06:43.026 The input is using subword units as we're using it and we're using it. 1:06:43.026 --> 1:06:48.992 It has some special token, the beginning, the CLS token that is straining for the next 1:06:48.992 --> 1:06:50.158 word prediction. 1:06:50.470 --> 1:06:57.296 It's more for machine translation. 1:06:57.296 --> 1:07:07.242 It's more for classification tasks because you're. 1:07:07.607 --> 1:07:24.323 You have two sentences, and then you have a position of encoding as we know them in general. 1:07:24.684 --> 1:07:28.812 Now what is more challenging is masking. 1:07:28.812 --> 1:07:30.927 So what do you mask? 1:07:30.927 --> 1:07:35.055 We already have to question like should. 1:07:35.275 --> 1:07:44.453 So there has been afterwards eating some work like, for example, Urbana, which tries to improve. 1:07:44.453 --> 1:07:52.306 It's not super sensitive, but of course if you do it completely wrong then you're. 1:07:52.572 --> 1:07:54.590 That's then another question there. 1:07:56.756 --> 1:08:03.285 All types should always mask the poor word. 1:08:03.285 --> 1:08:14.562 If have a subword, it's good to mask only like a subword and predict based. 1:08:14.894 --> 1:08:20.755 You know, like three parts of the words, it might be easier to get the last because they 1:08:20.755 --> 1:08:27.142 here took the easiest selections, not considering words anymore at all because you're doing that 1:08:27.142 --> 1:08:32.278 in the pre-processing and just taking always words like subwords and masking. 1:08:32.672 --> 1:08:36.286 Their thinking will bear them differently. 1:08:36.286 --> 1:08:40.404 They mark always the full words, but guess it's. 1:08:41.001 --> 1:08:46.969 And then what to do with the mask work in eighty percent of the cases is the word is 1:08:46.969 --> 1:08:47.391 mask. 1:08:47.391 --> 1:08:50.481 They replace it with a special token thing. 1:08:50.481 --> 1:08:52.166 This is the mask token. 1:08:52.166 --> 1:08:58.486 In ten percent they put in some random other token in there, and in ten percent they keep 1:08:58.486 --> 1:08:59.469 it unchanged. 1:09:02.202 --> 1:09:11.519 And then what you can do is also this next prediction. 1:09:11.519 --> 1:09:17.786 So if you have the man went to mass. 1:09:18.418 --> 1:09:24.090 So may you see you're joining that you're doing both masks and next prediction that. 1:09:24.564 --> 1:09:34.402 And if the sentence is pinguine masks are flyless birds, then these two sentences have 1:09:34.402 --> 1:09:42.995 nothing to do with each other, and so in this case it's not the next token. 1:09:47.127 --> 1:09:56.184 And that is the whole bird model, so here is the input, here the transformable layers, 1:09:56.184 --> 1:09:58.162 and you can train. 1:09:58.598 --> 1:10:08.580 And this model was quite successful in general applications. 1:10:08.580 --> 1:10:17.581 It was not as successful as people are nowadays using. 1:10:17.937 --> 1:10:27.644 However, there is like a huge thing of different types of models coming from that. 1:10:27.827 --> 1:10:39.109 So based on bird and other semi-supervised models like a whole setup came out of there 1:10:39.109 --> 1:10:42.091 and there's different. 1:10:42.082 --> 1:10:46.637 With the availability of large languages more than the success. 1:10:47.007 --> 1:10:48.436 We have now even larger ones. 1:10:48.828 --> 1:10:50.961 Interestingly, it goes a bit. 1:10:50.910 --> 1:10:59.321 Change the bit again from like more this spider action model to unidirectional models, or at 1:10:59.321 --> 1:11:03.843 the moment maybe a bit more we're coming to them. 1:11:03.843 --> 1:11:09.179 Now do you see one advantage,, and we have the efficiency. 1:11:09.509 --> 1:11:16.670 There's one other reason why you sometimes are more interested in unidirectional models 1:11:16.670 --> 1:11:17.158 than. 1:11:22.882 --> 1:11:30.882 Mean it depends on the task, but for example for a language generation task, the task. 1:11:32.192 --> 1:11:34.574 It's not only interesting, it doesn't work. 1:11:34.574 --> 1:11:39.283 So if you want to do a generation like the decoder so you want to generate a sentence, 1:11:39.283 --> 1:11:42.856 you don't know the future so you cannot apply this type of model. 1:11:43.223 --> 1:11:49.498 This time off model can be used for the encoder in an encoder model but cannot be used for 1:11:49.498 --> 1:11:55.497 the decoder because it is trained that only works and it has information on both sides 1:11:55.497 --> 1:11:56.945 and if you're doing. 1:12:00.000 --> 1:12:05.559 Yeah, that's a good view to the next overall task of models. 1:12:05.559 --> 1:12:08.839 We have so if you view it from the. 1:12:09.009 --> 1:12:13.137 Of you we have the encoder baseball. 1:12:13.137 --> 1:12:16.372 That's what we just look at. 1:12:16.372 --> 1:12:20.612 They are bidirectional and typically. 1:12:20.981 --> 1:12:22.347 That is the one we looked at. 1:12:22.742 --> 1:12:35.217 At the beginning is the decoder-based model, so the outer-regressive mounts which are unit 1:12:35.217 --> 1:12:42.619 based model, and there we can do the next prediction. 1:12:43.403 --> 1:12:52.421 And what you can also do first, and there you can also have special things called prefix 1:12:52.421 --> 1:12:53.434 language. 1:12:54.354 --> 1:13:04.079 Because we are saying it might be helpful that some of your inputs you can use by direction 1:13:04.079 --> 1:13:17.334 because: That is what is called a prefix where you say on the first tokens you have bidirectional 1:13:17.334 --> 1:13:19.094 connections. 1:13:19.219 --> 1:13:28.768 You somehow merge that mainly works only in transformer based models because the uni direction. 1:13:29.629 --> 1:13:34.894 There is no different number of parameters. 1:13:34.975 --> 1:13:38.533 Transformer: The only difference is how you mask your attention. 1:13:38.878 --> 1:13:47.691 We have seen that in the encoder, in the decoder, the number of parameters is different because 1:13:47.691 --> 1:13:50.261 you do the cross-attention. 1:13:50.650 --> 1:13:58.389 It's only like you mask your attention to only look at the bad past or also look into 1:13:58.389 --> 1:13:59.469 the future. 1:14:00.680 --> 1:14:03.323 And now you can, of course, also do mixing. 1:14:03.563 --> 1:14:08.307 So this is a bidirectional attention metric where you can attend to everything. 1:14:08.588 --> 1:14:23.477 That is a unidirection or causal where you can only look at the past and you can do this 1:14:23.477 --> 1:14:25.652 with prefix. 1:14:29.149 --> 1:14:42.829 Some are all clear based on that, then of course you can also do the other thing. 1:14:43.163 --> 1:14:54.497 So the idea is we have our encoder, decoder architecture, can we also train them completely 1:14:54.497 --> 1:14:57.700 in a side supervised way? 1:14:58.238 --> 1:15:06.206 In this case we have the same input to both, so in this case we would have the sentence 1:15:06.206 --> 1:15:08.470 as input in the decoder. 1:15:08.470 --> 1:15:12.182 Then we need to do some type of masking. 1:15:12.912 --> 1:15:16.245 Here we don't need to do the masking, but here we need to do. 1:15:16.245 --> 1:15:17.911 The masking doesn't know ever. 1:15:20.440 --> 1:15:30.269 And this type of model got quite successful also, especially for pre-training machine translation. 1:15:30.330 --> 1:15:45.934 This is the first model of the BART model, which is one successful way to pre-train your 1:15:45.934 --> 1:15:47.162 model. 1:15:47.427 --> 1:15:52.858 Where you put in source sentence, we can't do that here. 1:15:52.858 --> 1:15:55.430 We only have one language. 1:15:55.715 --> 1:16:00.932 But we can just put this twice in there, and that is not a trivial task. 1:16:00.932 --> 1:16:08.517 We can change it in: They do quite a bit of different corruption techniques. 1:16:08.517 --> 1:16:12.751 You can do token masking and you can also. 1:16:13.233 --> 1:16:20.785 That you couldn't do and go the only system because then it wouldn't be there if you cannot 1:16:20.785 --> 1:16:22.345 predict somewhere. 1:16:22.345 --> 1:16:26.368 So the number of input and output tokens always. 1:16:26.906 --> 1:16:29.820 You cannot do a prediction for something which isn't it? 1:16:30.110 --> 1:16:39.714 Here in the decoder side it's uni-direction so we can also delete and then generate the 1:16:39.714 --> 1:16:40.369 full. 1:16:41.061 --> 1:16:48.628 We can do sentence per rotation where you change the sentence. 1:16:48.628 --> 1:16:54.274 We can document rotation and text and filling. 1:16:55.615 --> 1:17:05.870 So you see there's quite a lot of types of models that you can use in order to pre-train 1:17:05.870 --> 1:17:06.561 your. 1:17:07.507 --> 1:17:12.512 And these are the models you can use. 1:17:12.512 --> 1:17:21.072 Of course, the other question is how do you integrate them into? 1:17:21.761 --> 1:17:26.638 And there's also like yeah quite some different ways of techniques. 1:17:27.007 --> 1:17:28.684 It's a Bit Similar to Before. 1:17:28.928 --> 1:17:39.307 So the easiest thing is you take your word embeddings or your pre-train model. 1:17:39.307 --> 1:17:47.979 If you're contextual embedding several layers you freeze them in. 1:17:48.748 --> 1:17:53.978 Can also be done if you have a bark model. 1:17:53.978 --> 1:18:03.344 You freeze your wooden beddings, for example, and only train the top layers. 1:18:05.865 --> 1:18:14.965 The other thing is you initialize them so you initialize your models but then you train 1:18:14.965 --> 1:18:19.102 everything so you're not only training. 1:18:22.562 --> 1:18:32.600 When you have then one thing, if you think about Bart, there's one thing, so you want 1:18:32.600 --> 1:18:35.752 to have the same language. 1:18:36.516 --> 1:18:46.013 Typically mean the one you get is from English, so you can not try to do some language. 1:18:46.366 --> 1:18:55.165 Below the barge, in order to learn some language specific stuff or there's a multilingual barge 1:18:55.165 --> 1:19:03.415 which is trained on many languages, it's trained only on like it's more or less language. 1:19:03.923 --> 1:19:09.745 So then you would still need to find June and the model needs to learn how to better 1:19:09.745 --> 1:19:12.074 do the attention cross lingually. 1:19:12.074 --> 1:19:18.102 It's only on the same language but it mainly only has to learn this mapping and not all 1:19:18.102 --> 1:19:18.787 the rest. 1:19:21.982 --> 1:19:27.492 A third thing which is is very commonly used is what is frequent to it as adapters. 1:19:27.607 --> 1:19:29.749 So, for example, you take and bark. 1:19:29.709 --> 1:19:35.502 And you put some adapters on the inside of the network so that it's small new layers which 1:19:35.502 --> 1:19:41.676 are in between put in there and then you only train these adapters or also train these adapters. 1:19:41.676 --> 1:19:47.724 So for example in Embry you could see that this learns to map the Seus language representation 1:19:47.724 --> 1:19:50.333 to the targeted language representation. 1:19:50.470 --> 1:19:52.395 And then you don't have to change that luck. 1:19:52.792 --> 1:20:04.197 Ideas that you give it some extra ability to really perform well on that, and then it's 1:20:04.197 --> 1:20:05.234 easier. 1:20:05.905 --> 1:20:15.117 Is also very commonly used, for example, in multilingual systems where the idea is you 1:20:15.117 --> 1:20:16.282 have some. 1:20:16.916 --> 1:20:23.505 So they are trained only for one language pair, so the model has some of those it once 1:20:23.505 --> 1:20:27.973 has the abilities to do multilingually to share knowledge. 1:20:27.973 --> 1:20:33.729 But then there is some knowledge which is very language specific, and then. 1:20:34.914 --> 1:20:39.291 But there's one chance in general, the multilingual systems. 1:20:39.291 --> 1:20:40.798 It works quite well. 1:20:40.798 --> 1:20:47.542 There's one specific use case for multilingual, where this normally doesn't really work well. 1:20:47.542 --> 1:20:49.981 Do you have an idea of what that? 1:20:55.996 --> 1:20:57.534 It's for Zero Short Cases. 1:20:57.998 --> 1:21:06.051 Because then you're having to hear some situation which might be very language specific again 1:21:06.051 --> 1:21:15.046 in zero shot, the idea is always to learn representations via which are more language dependent and with 1:21:15.046 --> 1:21:17.102 the adaptors of course. 1:21:20.260 --> 1:21:37.655 And there's also the idea of doing more like a knowledge ventilation setup, so in this. 1:21:39.179 --> 1:21:41.177 And now the idea is okay. 1:21:41.177 --> 1:21:48.095 We are training it the same, but what we want to achieve is that the hidden stages of the 1:21:48.095 --> 1:21:54.090 encoder are as similar to the one as the pre-train model, just as additional. 1:21:54.414 --> 1:22:07.569 So you should learn faster by telling the model to make these states as similar as possible. 1:22:07.569 --> 1:22:11.813 You compare the first hidden. 1:22:12.192 --> 1:22:18.549 For example, by using the L2 norm, so by just making these two representations the same. 1:22:20.020 --> 1:22:22.880 Now here it requires the same vocabulary. 1:22:22.880 --> 1:22:25.468 Why does it need the same vocabulary? 1:22:25.468 --> 1:22:26.354 Give me the. 1:22:34.754 --> 1:22:39.132 You have different vocabulary. 1:22:39.132 --> 1:22:50.711 You also have different like sequence lengths because if you use different these. 1:22:51.231 --> 1:22:55.680 Then what happens is now we have states here. 1:22:55.680 --> 1:23:01.097 It's no longer straightforward which states to compare. 1:23:02.322 --> 1:23:05.892 And then it's just easier to have like the same number. 1:23:05.892 --> 1:23:08.952 You can always compare the first to the second. 1:23:09.709 --> 1:23:16.836 So therefore at least the very easy way of knowledge destination only works if you have. 1:23:17.177 --> 1:23:30.871 Of course you could do things like the average should be the same, but of course that's less 1:23:30.871 --> 1:23:33.080 strong signal. 1:23:34.314 --> 1:23:47.087 But the advantage here is that you have a direct training signal here on the ink corner 1:23:47.087 --> 1:23:52.457 so you can directly make the signal. 1:23:56.936 --> 1:24:11.208 Yes, think this is most things for today, so what you should keep in mind today is two 1:24:11.208 --> 1:24:18.147 techniques: The one is a back translation idea. 1:24:18.147 --> 1:24:26.598 If you have monolingual letters, you back translate it and use. 1:24:26.886 --> 1:24:33.608 And yeah, it is even often helpful to even combine them so you can even use both of them. 1:24:33.853 --> 1:24:39.669 You can do use pre-trained walls, but then you can even still do back translation where 1:24:39.669 --> 1:24:40.066 it's. 1:24:40.160 --> 1:24:47.058 We have the advantage that we are training like everything working together on the tasks 1:24:47.058 --> 1:24:54.422 so it might be helpful even to backtranslate some data and then use it in the real translation 1:24:54.422 --> 1:24:57.755 because in pre-training the big challenge. 1:24:58.058 --> 1:25:07.392 You can see there is different ways of integrating this knowledge, but even if you use a full 1:25:07.392 --> 1:25:08.087 model. 1:25:08.748 --> 1:25:11.713 This is the most similar you can get. 1:25:11.713 --> 1:25:15.224 You're doing no changes to the architecture. 1:25:15.224 --> 1:25:20.608 You're really taking the model and just fine tuning on the new task. 1:25:20.608 --> 1:25:24.041 But it still has to completely newly learn. 1:25:24.464 --> 1:25:29.978 Might be, for example, helpful to have more back translated data to learn them. 1:25:32.192 --> 1:25:45.096 Good, that's important thing that next Tuesday there is a conference or a workshop in this 1:25:45.096 --> 1:25:45.947 room. 1:25:47.127 --> 1:25:54.405 You should get an email if you're an alias that there is a room change for Tuesdays, only 1:25:54.405 --> 1:25:57.398 for Tuesdays, and it's again normal. 1:25:57.637 --> 1:26:03.714 Some more questions again have a more general perspective, especially: Computer vision. 1:26:03.714 --> 1:26:07.246 You can enlarge your data set with data augmentation. 1:26:07.246 --> 1:26:08.293 It's there and. 1:26:08.388 --> 1:26:15.306 Similarly to a large speech or text, so the data orientation. 1:26:15.755 --> 1:26:27.013 You can use this back translation and also the masking, but a bit like that would say 1:26:27.013 --> 1:26:31.201 that is the most similar thing. 1:26:31.371 --> 1:26:35.632 So it has also been, for example, it's used not only for monolingual data. 1:26:36.216 --> 1:26:40.958 If you have good MP system, it can also be used for parallel data by having like augmenting 1:26:40.958 --> 1:26:46.061 your data with more data because then you have the human translation and the automatic translation 1:26:46.061 --> 1:26:46.783 is both good. 1:26:46.783 --> 1:26:51.680 You're just having more data and better feedback signal and different ways because there's not 1:26:51.680 --> 1:26:53.845 only one correct translation but several. 1:26:54.834 --> 1:26:58.327 Would say this is the most similar one. 1:26:58.327 --> 1:27:00.947 Just rotate things and so on. 1:27:00.947 --> 1:27:03.130 There's ways you can do. 1:27:05.025 --> 1:27:07.646 But for example there's rarely use. 1:27:07.646 --> 1:27:13.907 It's very hard to do this by by rules like which words to replace because there's not 1:27:13.907 --> 1:27:14.490 a cool. 1:27:14.490 --> 1:27:18.931 You cannot like always say this word can always be replaced. 1:27:19.139 --> 1:27:28.824 Mean, although they are my perfect synonyms, they are good in some cases, but not in all 1:27:28.824 --> 1:27:29.585 cases. 1:27:29.585 --> 1:27:36.985 And if you don't do a rule base, you have to train the model again. 1:27:38.058 --> 1:27:57.050 Here we can compare the hidden stages to the same architecture as the free train normal. 1:27:57.457 --> 1:27:59.817 Should be of the same dimension, so it's easiest to have the. 1:28:00.000 --> 1:28:03.780 Architecture: We later will learn in efficiency. 1:28:03.780 --> 1:28:08.949 You can also do knowledge destillation with, for example, smaller. 1:28:08.949 --> 1:28:15.816 So you can have twelve layers, only five, and then you try to learn the same within five 1:28:15.816 --> 1:28:16.433 layers. 1:28:17.477 --> 1:28:22.945 Eight layers, so that is possible, but yeah agree it should be of the same hidden size. 1:28:23.623 --> 1:28:35.963 The question then, of course, is you can do it as an initialization or you can do it during 1:28:35.963 --> 1:28:37.305 training? 1:28:37.305 --> 1:28:41.195 You have some main training. 1:28:45.865 --> 1:28:53.964 Good, then thanks a lot, and then we'll see each other again on Tuesday.