diff --git "a/demo_data/lectures/Lecture-11-15.06.2023/English.vtt" "b/demo_data/lectures/Lecture-11-15.06.2023/English.vtt" new file mode 100644--- /dev/null +++ "b/demo_data/lectures/Lecture-11-15.06.2023/English.vtt" @@ -0,0 +1,12362 @@ +WEBVTT + +0:00:00.981 --> 0:00:20.036 +Today about is how to use some type of additional +resources to improve the translation. + +0:00:20.300 --> 0:00:28.188 +We have in the first part of the semester +two thirds of the semester how to build some + +0:00:28.188 --> 0:00:31.361 +of your basic machine translation. + +0:00:31.571 --> 0:00:42.317 +Now the basic components are both for statistical +and for neural, with the encoded decoding. + +0:00:43.123 --> 0:00:46.000 +Now, of course, that's not where it stops. + +0:00:46.000 --> 0:00:51.286 +It's still what nearly every machine translation +system is currently in there. + +0:00:51.286 --> 0:00:57.308 +However, there's a lot of challenges which +you need to address in addition and which need + +0:00:57.308 --> 0:00:58.245 +to be solved. + +0:00:58.918 --> 0:01:09.858 +And there we want to start to tell you what +else can you do around this, and partly. + +0:01:10.030 --> 0:01:14.396 +And one important question there is on what +do you train your models? + +0:01:14.394 --> 0:01:32.003 +Because like this type of parallel data, it's +easier in machine translation than in other + +0:01:32.003 --> 0:01:33.569 +trusts. + +0:01:33.853 --> 0:01:41.178 +And therefore an important question is, can +we also learn from like other sources and through? + +0:01:41.701 --> 0:01:47.830 +Because if you remember strongly right at +the beginning of the election,. + +0:01:51.171 --> 0:01:53.801 +This Is How We Train All Our. + +0:01:54.194 --> 0:01:59.887 +Machine learning models from statistical to +neural. + +0:01:59.887 --> 0:02:09.412 +This doesn't have changed so we need this +type of parallel data where we have a source + +0:02:09.412 --> 0:02:13.462 +sentence aligned with a target data. + +0:02:13.493 --> 0:02:19.135 +We have now a strong model here, a very good +model to do that. + +0:02:19.135 --> 0:02:22.091 +However, we always rely on this. + +0:02:22.522 --> 0:02:28.395 +For languages, high risk language pairs say +from German to English or other European languages, + +0:02:28.395 --> 0:02:31.332 +there is decent amount, at least for similarly. + +0:02:31.471 --> 0:02:37.630 +But even there if we are going to very specific +domains it might get difficult and then your + +0:02:37.630 --> 0:02:43.525 +system performance might drop because if you +want to translate now some medical text for + +0:02:43.525 --> 0:02:50.015 +example of course you need to also have peril +data in the medical domain to know how to translate + +0:02:50.015 --> 0:02:50.876 +these types. + +0:02:51.231 --> 0:02:55.264 +Phrases how to use the vocabulary and so on +in the style. + +0:02:55.915 --> 0:03:04.887 +And if you are going to other languages, there +is a lot bigger challenge and the question + +0:03:04.887 --> 0:03:05.585 +there. + +0:03:05.825 --> 0:03:09.649 +So is really this the only resource we can +use. + +0:03:09.889 --> 0:03:19.462 +Can be adapted or training phase in order +to also make use of other types of models that + +0:03:19.462 --> 0:03:27.314 +might enable us to build strong systems with +other types of information. + +0:03:27.707 --> 0:03:35.276 +And that we will look into now in the next +starting from from just saying the next election. + +0:03:35.515 --> 0:03:40.697 +So this idea we already have covered on Tuesday. + +0:03:40.697 --> 0:03:45.350 +One very successful idea for this is to do. + +0:03:45.645 --> 0:03:51.990 +So that we're no longer doing translation +between languages, but we can do translation + +0:03:51.990 --> 0:03:55.928 +between languages and share common knowledge +between. + +0:03:56.296 --> 0:04:04.703 +And you also learned about things like zero +shots machine translation so you can translate + +0:04:04.703 --> 0:04:06.458 +between languages. + +0:04:06.786 --> 0:04:09.790 +Which is the case for many, many language +pairs. + +0:04:10.030 --> 0:04:19.209 +Like even with German, you have not translation +parallel data to all languages around the world, + +0:04:19.209 --> 0:04:26.400 +or most of them you have it to the Europeans +once, maybe even for Japanese. + +0:04:26.746 --> 0:04:35.332 +There is quite a lot of data, for example +English to Japanese, but German to Japanese + +0:04:35.332 --> 0:04:37.827 +or German to Vietnamese. + +0:04:37.827 --> 0:04:41.621 +There is some data from Multilingual. + +0:04:42.042 --> 0:04:54.584 +So there is a very promising direction if +you want to build translation systems between + +0:04:54.584 --> 0:05:00.142 +language peers, typically not English. + +0:05:01.221 --> 0:05:05.887 +And the other ideas, of course, we don't have +to either just search for it. + +0:05:06.206 --> 0:05:12.505 +Some work on a data crawling so if I don't +have a corpus directly or I don't have an high + +0:05:12.505 --> 0:05:19.014 +quality corpus like from the European Parliament +for a TED corpus so maybe it makes sense to + +0:05:19.014 --> 0:05:23.913 +crawl more data and get additional sources +so you can build stronger. + +0:05:24.344 --> 0:05:35.485 +There has been quite a big effort in Europe +to collect really large data sets for parallel + +0:05:35.485 --> 0:05:36.220 +data. + +0:05:36.220 --> 0:05:40.382 +How can we do this data crawling? + +0:05:40.600 --> 0:05:46.103 +There the interesting thing from the machine +translation point is not just general data + +0:05:46.103 --> 0:05:46.729 +crawling. + +0:05:47.067 --> 0:05:50.037 +But how can we explicitly crawl data? + +0:05:50.037 --> 0:05:52.070 +Which is some of a peril? + +0:05:52.132 --> 0:05:58.461 +So there is in the Internet quite a lot of +data which has been company websites which + +0:05:58.461 --> 0:06:01.626 +have been translated and things like that. + +0:06:01.626 --> 0:06:05.158 +So how can you extract them parallel fragments? + +0:06:06.566 --> 0:06:13.404 +That is typically more noisy than where you +do more at hands where mean if you have Parliament. + +0:06:13.693 --> 0:06:17.680 +You can do some rules how to extract parallel +things. + +0:06:17.680 --> 0:06:24.176 +Here there is more to it, so the quality is +later maybe not as good, but normally scale + +0:06:24.176 --> 0:06:26.908 +is then a possibility to address it. + +0:06:26.908 --> 0:06:30.304 +So you just have so much more data that even. + +0:06:33.313 --> 0:06:40.295 +The other thing can be used monolingual data +and monolingual data has a big advantage that + +0:06:40.295 --> 0:06:46.664 +we can have a huge amount of that so that you +can be autocrawed from the Internet. + +0:06:46.664 --> 0:06:51.728 +The nice thing is you can also get it typically +for many domains. + +0:06:52.352 --> 0:06:59.558 +There is just so much more magnitude of monolingual +data so that it might be very helpful. + +0:06:59.559 --> 0:07:03.054 +We can do that in statistical machine translation. + +0:07:03.054 --> 0:07:06.755 +It was quite easy to integrate using language +models. + +0:07:08.508 --> 0:07:16.912 +In neural machine translation we have the +advantage that we have this overall architecture + +0:07:16.912 --> 0:07:22.915 +that does everything together, but it has also +the disadvantage. + +0:07:23.283 --> 0:07:25.675 +We'll look today at two things. + +0:07:25.675 --> 0:07:32.925 +On the one end you can still try to do a bit +of language modeling in there and add an additional + +0:07:32.925 --> 0:07:35.168 +language model into in there. + +0:07:35.168 --> 0:07:38.232 +There is some work, one very successful. + +0:07:38.178 --> 0:07:43.764 +A way in which I think is used in most systems +at the moment is to do some scientific data. + +0:07:43.763 --> 0:07:53.087 +Is a very easy thing, but you can just translate +there and use it as training gator, and normally. + +0:07:53.213 --> 0:07:59.185 +And thereby you are able to use like some +type of monolingual a day. + +0:08:00.380 --> 0:08:05.271 +Another way to do it is unsupervised and the +extreme case. + +0:08:05.271 --> 0:08:11.158 +If you have a scenario then you only have +data, only monolingual data. + +0:08:11.158 --> 0:08:13.976 +Can you still build translations? + +0:08:14.754 --> 0:08:27.675 +If you have large amounts of data and languages +are not too dissimilar, you can build translation + +0:08:27.675 --> 0:08:31.102 +systems without parallel. + +0:08:32.512 --> 0:08:36.267 +That we will see you then next Thursday. + +0:08:37.857 --> 0:08:50.512 +And then there is now a third type of pre-trained +model that recently became very successful + +0:08:50.512 --> 0:08:55.411 +and now with large language models. + +0:08:55.715 --> 0:09:03.525 +So the idea is we are no longer sharing the +real data, but it can also help to train a + +0:09:03.525 --> 0:09:04.153 +model. + +0:09:04.364 --> 0:09:11.594 +And that is now a big advantage of deep learning +based approaches. + +0:09:11.594 --> 0:09:22.169 +There you have this ability that you can train +a model in some task and then apply it to another. + +0:09:22.722 --> 0:09:33.405 +And then, of course, the question is, can +I have an initial task where there's huge amounts + +0:09:33.405 --> 0:09:34.450 +of data? + +0:09:34.714 --> 0:09:40.251 +And the test that typically you pre train +on is more like similar to a language moral + +0:09:40.251 --> 0:09:45.852 +task either direct to a language moral task +or like a masking task which is related so + +0:09:45.852 --> 0:09:51.582 +the idea is oh I can train on this data and +the knowledge about words how they relate to + +0:09:51.582 --> 0:09:53.577 +each other I can use in there. + +0:09:53.753 --> 0:10:00.276 +So it's a different way of using language +models. + +0:10:00.276 --> 0:10:06.276 +There's more transfer learning at the end +of. + +0:10:09.029 --> 0:10:17.496 +So first we will start with how can we use +monolingual data to do a Yeah to do a machine + +0:10:17.496 --> 0:10:18.733 +translation? + +0:10:20.040 --> 0:10:27.499 +That: Big difference is you should remember +from what I mentioned before is. + +0:10:27.499 --> 0:10:32.783 +In statistical machine translation we directly +have the opportunity. + +0:10:32.783 --> 0:10:39.676 +There's peril data for the translation model +and monolingual data for the language model. + +0:10:39.679 --> 0:10:45.343 +And you combine your translation model and +language model, and then you can make use of + +0:10:45.343 --> 0:10:45.730 +both. + +0:10:46.726 --> 0:10:53.183 +That you can make use of these large large +amounts of monolingual data, but of course + +0:10:53.183 --> 0:10:55.510 +it has also some disadvantage. + +0:10:55.495 --> 0:11:01.156 +Because we say the problem is we are optimizing +both parts a bit independently to each other + +0:11:01.156 --> 0:11:06.757 +and we say oh yeah the big disadvantage of +newer machine translations now we are optimizing + +0:11:06.757 --> 0:11:10.531 +the overall architecture everything together +to perform best. + +0:11:10.890 --> 0:11:16.994 +And then, of course, we can't do there, so +Leo we can can only do a mural like use power + +0:11:16.994 --> 0:11:17.405 +data. + +0:11:17.897 --> 0:11:28.714 +So the question is, but this advantage is +not so important that we can train everything, + +0:11:28.714 --> 0:11:35.276 +but we have a moral legal data or even small +amounts. + +0:11:35.675 --> 0:11:43.102 +So in data we know it's not only important +the amount of data we have but also like how + +0:11:43.102 --> 0:11:50.529 +similar it is to your test data so it can be +that this modeling data is quite small but + +0:11:50.529 --> 0:11:55.339 +it's very well fitting and then it's still +very helpful. + +0:11:55.675 --> 0:12:02.691 +At the first year of surprisingness, if we +are here successful with integrating a language + +0:12:02.691 --> 0:12:09.631 +model into a translation system, maybe we can +also integrate some type of language models + +0:12:09.631 --> 0:12:14.411 +into our empty system in order to make it better +and perform. + +0:12:16.536 --> 0:12:23.298 +The first thing we can do is we know there +is language models, so let's try to integrate. + +0:12:23.623 --> 0:12:31.096 +There was our language model because these +works were mainly done before transformer-based + +0:12:31.096 --> 0:12:31.753 +models. + +0:12:32.152 --> 0:12:38.764 +In general, of course, you can do the same +thing with transformer baseball. + +0:12:38.764 --> 0:12:50.929 +There is nothing about whether: It's just +that it has mainly been done before people + +0:12:50.929 --> 0:13:01.875 +started using R&S and they tried to do +this more in cases. + +0:13:07.087 --> 0:13:22.938 +So what we're happening here is in some of +this type of idea, and in key system you remember + +0:13:22.938 --> 0:13:25.495 +the attention. + +0:13:25.605 --> 0:13:29.465 +Gets it was your last in this day that you +calculate easy attention. + +0:13:29.729 --> 0:13:36.610 +We get the context back, then combine both +and then base the next in state and then predict. + +0:13:37.057 --> 0:13:42.424 +So this is our system, and the question is, +can we send our integrated language model? + +0:13:42.782 --> 0:13:49.890 +And somehow it makes sense to take out a neural +language model because we are anyway in the + +0:13:49.890 --> 0:13:50.971 +neural space. + +0:13:50.971 --> 0:13:58.465 +It's not surprising that it contrasts to statistical +work used and grants it might make sense to + +0:13:58.465 --> 0:14:01.478 +take a bit of a normal language model. + +0:14:01.621 --> 0:14:06.437 +And there would be something like on Tubbles +Air, a neural language model, and our man based + +0:14:06.437 --> 0:14:11.149 +is you have a target word, you put it in, you +get a new benchmark, and then you always put + +0:14:11.149 --> 0:14:15.757 +in the words and get new hidden states, and +you can do some predictions at the output to + +0:14:15.757 --> 0:14:16.948 +predict the next word. + +0:14:17.597 --> 0:14:26.977 +So if we're having this type of in language +model, there's like two main questions we have + +0:14:26.977 --> 0:14:34.769 +to answer: So how do we combine now on the +one hand our system and on the other hand our + +0:14:34.769 --> 0:14:35.358 +model? + +0:14:35.358 --> 0:14:42.004 +You see that was mentioned before when we +started talking about ENCODA models. + +0:14:42.004 --> 0:14:45.369 +They can be viewed as a language model. + +0:14:45.805 --> 0:14:47.710 +The wine is lengthened, unconditioned. + +0:14:47.710 --> 0:14:49.518 +It's just modeling the target sides. + +0:14:49.970 --> 0:14:56.963 +And the other one is a conditional language +one, which is a language one conditioned on + +0:14:56.963 --> 0:14:57.837 +the Sewer. + +0:14:58.238 --> 0:15:03.694 +So how can you combine to language models? + +0:15:03.694 --> 0:15:14.860 +Of course, it's like the translation model +will be more important because it has access + +0:15:14.860 --> 0:15:16.763 +to the source. + +0:15:18.778 --> 0:15:22.571 +If we have that, the other question is okay. + +0:15:22.571 --> 0:15:24.257 +Now we have models. + +0:15:24.257 --> 0:15:25.689 +How do we train? + +0:15:26.026 --> 0:15:30.005 +Pickers integrated them. + +0:15:30.005 --> 0:15:34.781 +We have now two sets of data. + +0:15:34.781 --> 0:15:42.741 +We have parallel data where you can do the +lower. + +0:15:44.644 --> 0:15:53.293 +So the first idea is we can do something more +like a parallel combination. + +0:15:53.293 --> 0:15:55.831 +We just keep running. + +0:15:56.036 --> 0:15:59.864 +So here you see your system that is running. + +0:16:00.200 --> 0:16:09.649 +It's normally completely independent of your +language model, which is up there, so down + +0:16:09.649 --> 0:16:13.300 +here we have just our NMT system. + +0:16:13.313 --> 0:16:26.470 +The only thing which is used is we have the +words, and of course they are put into both + +0:16:26.470 --> 0:16:30.059 +systems, and out there. + +0:16:30.050 --> 0:16:42.221 +So we use them somehow for both, and then +we are doing our decision just by merging these + +0:16:42.221 --> 0:16:42.897 +two. + +0:16:43.343 --> 0:16:53.956 +So there can be, for example, we are doing +a probability distribution here, and then we + +0:16:53.956 --> 0:17:03.363 +are taking the average of post-perability distribution +to do our predictions. + +0:17:11.871 --> 0:17:18.923 +You could also take the output with Steve's +to be more in chore about the mixture. + +0:17:20.000 --> 0:17:32.896 +Yes, you could also do that, so it's more +like engaging mechanisms that you're not doing. + +0:17:32.993 --> 0:17:41.110 +Another one would be cochtrinate the hidden +states, and then you would have another layer + +0:17:41.110 --> 0:17:41.831 +on top. + +0:17:43.303 --> 0:17:56.889 +You think about if you do the conqueredination +instead of taking the instead and then merging + +0:17:56.889 --> 0:18:01.225 +the probability distribution. + +0:18:03.143 --> 0:18:16.610 +Introduce many new parameters, and these parameters +have somehow something special compared to + +0:18:16.610 --> 0:18:17.318 +the. + +0:18:23.603 --> 0:18:37.651 +So before all the error other parameters can +be trained independent, the language model + +0:18:37.651 --> 0:18:42.121 +can be trained independent. + +0:18:43.043 --> 0:18:51.749 +If you have a joint layer, of course you need +to train them because you have now inputs. + +0:18:54.794 --> 0:19:02.594 +Not surprisingly, if you have a parallel combination +of whether you could, the other way is to do + +0:19:02.594 --> 0:19:04.664 +more serial combinations. + +0:19:04.924 --> 0:19:10.101 +How can you do a similar combination? + +0:19:10.101 --> 0:19:18.274 +Your final decision makes sense to do a face +on the system. + +0:19:18.438 --> 0:19:20.996 +So you have on top of your normal and system. + +0:19:21.121 --> 0:19:30.678 +The only thing is now you're inputting into +your system. + +0:19:30.678 --> 0:19:38.726 +You're no longer inputting the word embeddings. + +0:19:38.918 --> 0:19:45.588 +So you're training your mainly what you have +your lower layers here which are trained more + +0:19:45.588 --> 0:19:52.183 +on the purely language model style and then +on top your putting into the NMT system where + +0:19:52.183 --> 0:19:55.408 +it now has already here the language model. + +0:19:55.815 --> 0:19:58.482 +So here you can also view it. + +0:19:58.482 --> 0:20:06.481 +Here you have more contextual embeddings which +no longer depend only on the word but they + +0:20:06.481 --> 0:20:10.659 +also depend on the context of the target site. + +0:20:11.051 --> 0:20:19.941 +But you have more understanding of the source +word, so you have a language in the current + +0:20:19.941 --> 0:20:21.620 +target sentence. + +0:20:21.881 --> 0:20:27.657 +So if it's like the word can, for example, +will be put in here always the same independent + +0:20:27.657 --> 0:20:31.147 +of its user can of beans, or if it's like I +can do it. + +0:20:31.147 --> 0:20:37.049 +However, because you are having your language +model style, you have maybe disintegrated this + +0:20:37.049 --> 0:20:40.984 +already a bit, and you give this information +directly to the. + +0:20:41.701 --> 0:20:43.095 +An empty cyst. + +0:20:44.364 --> 0:20:49.850 +You, if you're remembering more the transformer +based approach, you have some layers. + +0:20:49.850 --> 0:20:55.783 +The lower layers are purely languaged while +the other ones are with attention to the source. + +0:20:55.783 --> 0:21:01.525 +So you can view it also that you just have +lower layers which don't attend to the source. + +0:21:02.202 --> 0:21:07.227 +This is purely a language model, and then +at some point you're starting to attend to + +0:21:07.227 --> 0:21:08.587 +the source and use it. + +0:21:13.493 --> 0:21:20.781 +Yes, so this is how you combine them in peril +or first do the language model and then do. + +0:21:23.623 --> 0:21:26.147 +Questions for the integration. + +0:21:31.831 --> 0:21:35.034 +Not really sure about the input of the. + +0:21:35.475 --> 0:21:38.102 +Model, and in this case in the sequence. + +0:21:38.278 --> 0:21:53.199 +Case so the actual word that we transferred +into a numerical lecture, and this is an input + +0:21:53.199 --> 0:21:54.838 +into the. + +0:21:56.176 --> 0:22:03.568 +That depends on if you view the word embedding +as part of the language model. + +0:22:03.568 --> 0:22:10.865 +So if you first put the word target word then +you do the one hot end coding. + +0:22:11.691 --> 0:22:13.805 +And then the word embedding there is the r& + +0:22:13.805 --> 0:22:13.937 +n. + +0:22:14.314 --> 0:22:21.035 +So you can use this together as your language +model when you first do the word embedding. + +0:22:21.401 --> 0:22:24.346 +All you can say is like before. + +0:22:24.346 --> 0:22:28.212 +It's more a definition, but you're right. + +0:22:28.212 --> 0:22:30.513 +So what's the steps out? + +0:22:30.513 --> 0:22:36.128 +You take the word, the one hut encoding, the +word embedding. + +0:22:36.516 --> 0:22:46.214 +What one of these parrots, you know, called +a language model is definition wise and not + +0:22:46.214 --> 0:22:47.978 +that important. + +0:22:53.933 --> 0:23:02.264 +So the question is how can you then train +them and make this this one work? + +0:23:02.264 --> 0:23:02.812 +The. + +0:23:03.363 --> 0:23:15.201 +So in the case where you combine the language +one of the abilities you can train them independently + +0:23:15.201 --> 0:23:18.516 +and just put them together. + +0:23:18.918 --> 0:23:27.368 +Might not be the best because we have no longer +the stability that we had before that optimally + +0:23:27.368 --> 0:23:29.128 +performed together. + +0:23:29.128 --> 0:23:33.881 +It's not clear if they really work the best +together. + +0:23:34.514 --> 0:23:41.585 +At least you need to somehow find how much +do you trust the one model and how much. + +0:23:43.323 --> 0:23:45.058 +Still in some cases useful. + +0:23:45.058 --> 0:23:48.530 +It might be helpful if you have only data +and software. + +0:23:48.928 --> 0:23:59.064 +However, in MT we have one specific situation +that at least for the MT part parallel is also + +0:23:59.064 --> 0:24:07.456 +always monolingual data, so what we definitely +can do is train the language. + +0:24:08.588 --> 0:24:18.886 +So what we also can do is more like the pre-training +approach. + +0:24:18.886 --> 0:24:24.607 +We first train the language model. + +0:24:24.704 --> 0:24:27.334 +The pre-training approach. + +0:24:27.334 --> 0:24:33.470 +You first train on the monolingual data and +then you join the. + +0:24:33.933 --> 0:24:41.143 +Of course, the model size is this way, but +the data size is too bigly the other way around. + +0:24:41.143 --> 0:24:47.883 +You often have a lot more monolingual data +than you have here parallel data, in which + +0:24:47.883 --> 0:24:52.350 +scenario can you imagine where this type of +pretraining? + +0:24:56.536 --> 0:24:57.901 +Any Ideas. + +0:25:04.064 --> 0:25:12.772 +One example where this might also be helpful +if you want to adapt to domains. + +0:25:12.772 --> 0:25:22.373 +So let's say you do medical sentences and +if you want to translate medical sentences. + +0:25:23.083 --> 0:25:26.706 +In this case it could be or its most probable +happen. + +0:25:26.706 --> 0:25:32.679 +You're learning here up there what medical +means, but in your fine tuning step the model + +0:25:32.679 --> 0:25:38.785 +is forgotten everything about Medicare, so +you may be losing all the information you gain. + +0:25:39.099 --> 0:25:42.366 +So this type of priest training step is good. + +0:25:42.366 --> 0:25:47.978 +If your pretraining data is more general, +very large and then you're adapting. + +0:25:48.428 --> 0:25:56.012 +But in the task with moral lingual data, which +should be used to adapt the system to some + +0:25:56.012 --> 0:25:57.781 +general topic style. + +0:25:57.817 --> 0:26:06.795 +Then, of course, this is not a good strategy +because you might forgot about everything up + +0:26:06.795 --> 0:26:09.389 +there and you don't have. + +0:26:09.649 --> 0:26:14.678 +So then you have to check what you can do +for them. + +0:26:14.678 --> 0:26:23.284 +You can freeze this part and change it any +more so you don't lose the ability or you can + +0:26:23.284 --> 0:26:25.702 +do a direct combination. + +0:26:25.945 --> 0:26:31.028 +Where you jointly train both of them, so you +train the NMT system on the, and then you train + +0:26:31.028 --> 0:26:34.909 +the language model always in parallels so that +you don't forget about. + +0:26:35.395 --> 0:26:37.684 +And what you learn of the length. + +0:26:37.937 --> 0:26:46.711 +Depends on what you want to combine because +it's large data and you have a good general + +0:26:46.711 --> 0:26:48.107 +knowledge in. + +0:26:48.548 --> 0:26:55.733 +Then you normally don't really forget it because +it's also in the or you use it to adapt to + +0:26:55.733 --> 0:26:57.295 +something specific. + +0:26:57.295 --> 0:26:58.075 +Then you. + +0:27:01.001 --> 0:27:06.676 +Then this is a way of how we can make use +of monolingual data. + +0:27:07.968 --> 0:27:12.116 +It seems to be the easiest one somehow. + +0:27:12.116 --> 0:27:20.103 +It's more similar to what we are doing with +statistical machine translation. + +0:27:21.181 --> 0:27:31.158 +Normally always beats this type of model, +which in some view can be like from the conceptual + +0:27:31.158 --> 0:27:31.909 +thing. + +0:27:31.909 --> 0:27:36.844 +It's even easier from the computational side. + +0:27:40.560 --> 0:27:42.078 +And the idea is OK. + +0:27:42.078 --> 0:27:49.136 +We have monolingual data that we just translate +and then generate some type of parallel data + +0:27:49.136 --> 0:27:50.806 +and use that then to. + +0:27:51.111 --> 0:28:00.017 +So if you want to build a German-to-English +system first, take the large amount of data + +0:28:00.017 --> 0:28:02.143 +you have translated. + +0:28:02.402 --> 0:28:10.446 +Then you have more peril data and the interesting +thing is if you then train on the joint thing + +0:28:10.446 --> 0:28:18.742 +or on the original peril data and on what is +artificial where you have generated the translations. + +0:28:18.918 --> 0:28:26.487 +So you can because you are not doing the same +era all the times and you have some knowledge. + +0:28:28.028 --> 0:28:43.199 +With this first approach, however, there is +one issue why it might not work the best. + +0:28:49.409 --> 0:28:51.177 +Very a bit shown in the image to you. + +0:28:53.113 --> 0:28:58.153 +You trade on that quality data. + +0:28:58.153 --> 0:29:02.563 +Here is a bit of a problem. + +0:29:02.563 --> 0:29:08.706 +Your English style is not really good. + +0:29:08.828 --> 0:29:12.213 +And as you're saying, the system always mistranslates. + +0:29:13.493 --> 0:29:19.798 +Something then you will learn that this is +correct because now it's a training game and + +0:29:19.798 --> 0:29:23.022 +you will encourage it to make it more often. + +0:29:23.022 --> 0:29:29.614 +So the problem with training on your own areas +yeah you might prevent some areas you rarely + +0:29:29.614 --> 0:29:29.901 +do. + +0:29:30.150 --> 0:29:31.749 +But errors use systematically. + +0:29:31.749 --> 0:29:34.225 +Do you even enforce more and will even do +more? + +0:29:34.654 --> 0:29:40.145 +So that might not be the best solution to +have any idea how you could do it better. + +0:29:44.404 --> 0:29:57.754 +Is one way there is even a bit of more simple +idea. + +0:30:04.624 --> 0:30:10.975 +The problem is yeah, the translations are +not perfect, so the output and you're learning + +0:30:10.975 --> 0:30:12.188 +something wrong. + +0:30:12.188 --> 0:30:17.969 +Normally it's less bad if your inputs are +not bad, but your outputs are perfect. + +0:30:18.538 --> 0:30:24.284 +So if your inputs are wrong you may learn +that if you're doing this wrong input you're + +0:30:24.284 --> 0:30:30.162 +generating something correct, but you're not +learning to generate something which is not + +0:30:30.162 --> 0:30:30.756 +correct. + +0:30:31.511 --> 0:30:47.124 +So often the case it is that it is more important +than your target is correct. + +0:30:47.347 --> 0:30:52.182 +But you can assume in your application scenario +you hope that you may only get correct inputs. + +0:30:52.572 --> 0:31:02.535 +So that is not harming you, and in machine +translation we have one very nice advantage: + +0:31:02.762 --> 0:31:04.648 +And also the other way around. + +0:31:04.648 --> 0:31:10.062 +It's a very similar task, so there's a task +to translate from German to English, but the + +0:31:10.062 --> 0:31:13.894 +task to translate from English to German is +very similar, and. + +0:31:14.094 --> 0:31:19.309 +So what we can do is we can just switch it +initially and generate the data the other way + +0:31:19.309 --> 0:31:19.778 +around. + +0:31:20.120 --> 0:31:25.959 +So what we are doing here is we are starting +with an English to German system. + +0:31:25.959 --> 0:31:32.906 +Then we are translating the English data into +German where the German is maybe not very nice. + +0:31:33.293 --> 0:31:51.785 +And then we are training on our original data +and on the back translated data. + +0:31:52.632 --> 0:32:02.332 +So here we have the advantage that our target +side is human quality and only the input. + +0:32:03.583 --> 0:32:08.113 +Then this helps us to get really good. + +0:32:08.113 --> 0:32:15.431 +There is one difference if you think about +the data resources. + +0:32:21.341 --> 0:32:27.336 +Too obvious here we need a target site monolingual +layer. + +0:32:27.336 --> 0:32:31.574 +In the first example we had source site. + +0:32:31.931 --> 0:32:45.111 +So back translation is normally working if +you have target size peril later and not search + +0:32:45.111 --> 0:32:48.152 +side modeling later. + +0:32:48.448 --> 0:32:56.125 +Might be also, like if you think about it, +understand a little better to understand the + +0:32:56.125 --> 0:32:56.823 +target. + +0:32:57.117 --> 0:33:01.469 +On the source side you have to understand +the content. + +0:33:01.469 --> 0:33:08.749 +On the target side you have to generate really +sentences and somehow it's more difficult to + +0:33:08.749 --> 0:33:12.231 +generate something than to only understand. + +0:33:17.617 --> 0:33:30.734 +This works well if you have to select how +many back translated data do you use. + +0:33:31.051 --> 0:33:32.983 +Because only there's like a lot more. + +0:33:33.253 --> 0:33:42.136 +Question: Should take all of my data there +is two problems with it? + +0:33:42.136 --> 0:33:51.281 +Of course it's expensive because you have +to translate all this data. + +0:33:51.651 --> 0:34:00.946 +So if you don't know the normal good starting +point is to take equal amount of data as many + +0:34:00.946 --> 0:34:02.663 +back translated. + +0:34:02.963 --> 0:34:04.673 +It depends on the used case. + +0:34:04.673 --> 0:34:08.507 +If we have very few data here, it makes more +sense to have more. + +0:34:08.688 --> 0:34:15.224 +Depends on how good your quality is here, +so the better the more data you might use because + +0:34:15.224 --> 0:34:16.574 +quality is better. + +0:34:16.574 --> 0:34:22.755 +So it depends on a lot of things, but your +rule of sum is like which general way often + +0:34:22.755 --> 0:34:24.815 +is to have equal amounts of. + +0:34:26.646 --> 0:34:29.854 +And you can, of course, do that now. + +0:34:29.854 --> 0:34:34.449 +I said already that it's better to have the +quality. + +0:34:34.449 --> 0:34:38.523 +At the end, of course, depends on this system. + +0:34:38.523 --> 0:34:46.152 +Also, because the better this system is, the +better your synthetic data is, the better. + +0:34:47.207 --> 0:34:50.949 +That leads to what is referred to as iterated +back translation. + +0:34:51.291 --> 0:34:56.917 +So you play them on English to German, and +you translate the data on. + +0:34:56.957 --> 0:35:03.198 +Then you train a model on German to English +with the additional data. + +0:35:03.198 --> 0:35:09.796 +Then you translate German data and then you +train to gain your first one. + +0:35:09.796 --> 0:35:14.343 +So in the second iteration this quality is +better. + +0:35:14.334 --> 0:35:19.900 +System is better because it's not only trained +on the small data but additionally on back + +0:35:19.900 --> 0:35:22.003 +translated data with this system. + +0:35:22.442 --> 0:35:24.458 +And so you can get better. + +0:35:24.764 --> 0:35:28.053 +However, typically you can stop quite early. + +0:35:28.053 --> 0:35:35.068 +Maybe one iteration is good, but then you +have diminishing gains after two or three iterations. + +0:35:35.935 --> 0:35:46.140 +There is very slight difference because you +need a quite big difference in the quality + +0:35:46.140 --> 0:35:46.843 +here. + +0:35:47.207 --> 0:36:02.262 +Language is also good because it means you +can already train it with relatively bad profiles. + +0:36:03.723 --> 0:36:10.339 +It's a design decision would advise so guess +because it's easy to get it. + +0:36:10.550 --> 0:36:20.802 +Replace that because you have a higher quality +real data, but then I think normally it's okay + +0:36:20.802 --> 0:36:22.438 +to replace it. + +0:36:22.438 --> 0:36:28.437 +I would assume it's not too much of a difference, +but. + +0:36:34.414 --> 0:36:42.014 +That's about like using monolingual data before +we go into the pre-train models to have any + +0:36:42.014 --> 0:36:43.005 +more crash. + +0:36:49.029 --> 0:36:55.740 +Yes, so the other thing which we can do and +which is recently more and more successful + +0:36:55.740 --> 0:37:02.451 +and even more successful since we have this +really large language models where you can + +0:37:02.451 --> 0:37:08.545 +even do the translation task with this is the +way of using pre-trained models. + +0:37:08.688 --> 0:37:16.135 +So you learn a representation of one task, +and then you use this representation from another. + +0:37:16.576 --> 0:37:26.862 +It was made maybe like one of the first words +where it really used largely is doing something + +0:37:26.862 --> 0:37:35.945 +like a bird which you pre trained on purely +text era and you take it in fine tune. + +0:37:36.496 --> 0:37:42.953 +And one big advantage, of course, is that +people can only share data but also pre-trained. + +0:37:43.423 --> 0:37:59.743 +The recent models and the large language ones +which are available. + +0:37:59.919 --> 0:38:09.145 +Where I think it costs several millions to +train them all, just if you would buy the GPUs + +0:38:09.145 --> 0:38:15.397 +from some cloud company and train that the +cost of training. + +0:38:15.475 --> 0:38:21.735 +And guess as a student project you won't have +the budget to like build these models. + +0:38:21.801 --> 0:38:24.598 +So another idea is what you can do is okay. + +0:38:24.598 --> 0:38:27.330 +Maybe if these months are once available,. + +0:38:27.467 --> 0:38:36.598 +Can take them and use them as an also resource +similar to pure text, and you can now build + +0:38:36.598 --> 0:38:44.524 +models which somehow learn not only from from +data but also from other models. + +0:38:44.844 --> 0:38:49.127 +So it's a quite new way of thinking of how +to train. + +0:38:49.127 --> 0:38:53.894 +We are not only learning from examples, but +we might also. + +0:38:54.534 --> 0:39:05.397 +The nice thing is that this type of training +where we are not learning directly from data + +0:39:05.397 --> 0:39:07.087 +but learning. + +0:39:07.427 --> 0:39:17.647 +So the main idea this go is you have a person +initial task. + +0:39:17.817 --> 0:39:26.369 +And if you're working with anLP, that means +you're training pure taxator because that's + +0:39:26.369 --> 0:39:30.547 +where you have the largest amount of data. + +0:39:30.951 --> 0:39:35.857 +And then you're defining some type of task +in order to do your creek training. + +0:39:36.176 --> 0:39:43.092 +And: The typical task you can train on on +that is like the language waddling task. + +0:39:43.092 --> 0:39:50.049 +So to predict the next word or we have a related +task to predict something in between, we'll + +0:39:50.049 --> 0:39:52.667 +see depending on the architecture. + +0:39:52.932 --> 0:39:58.278 +But somehow to predict something which you +have not in the input is a task which is easy + +0:39:58.278 --> 0:40:00.740 +to generate, so you just need your data. + +0:40:00.740 --> 0:40:06.086 +That's why it's called self supervised, so +you're creating your supervised pending data. + +0:40:06.366 --> 0:40:07.646 +By yourself. + +0:40:07.646 --> 0:40:15.133 +On the other hand, you need a lot of knowledge +and that is the other thing. + +0:40:15.735 --> 0:40:24.703 +Because there is this idea that the meaning +of a word heavily depends on the context that. + +0:40:25.145 --> 0:40:36.846 +So can give you a sentence with some giverish +word and there's some name and although you've + +0:40:36.846 --> 0:40:41.627 +never heard the name you will assume. + +0:40:42.062 --> 0:40:44.149 +And exactly the same thing. + +0:40:44.149 --> 0:40:49.143 +The models can also learn something about +the world by just using. + +0:40:49.649 --> 0:40:53.651 +So that is typically the mule. + +0:40:53.651 --> 0:40:59.848 +Then we can use this model to train the system. + +0:41:00.800 --> 0:41:03.368 +Course we might need to adapt the system. + +0:41:03.368 --> 0:41:07.648 +To do that we have to change the architecture +we might use only some. + +0:41:07.627 --> 0:41:09.443 +Part of the pre-trained model. + +0:41:09.443 --> 0:41:14.773 +In there we have seen that a bit already in +the R&N case you can also see that we have + +0:41:14.773 --> 0:41:17.175 +also mentioned the pre-training already. + +0:41:17.437 --> 0:41:22.783 +So you can use the R&N as one of these +approaches. + +0:41:22.783 --> 0:41:28.712 +You train the R&M language more on large +pre-train data. + +0:41:28.712 --> 0:41:32.309 +Then you put it somewhere into your. + +0:41:33.653 --> 0:41:37.415 +So this gives you the ability to really do +these types of tests. + +0:41:37.877 --> 0:41:53.924 +So you can build a system which is knowledge, +which is just trained on large amounts of data. + +0:41:56.376 --> 0:42:01.564 +So the question is maybe what type of information +so what type of models can you? + +0:42:01.821 --> 0:42:05.277 +And we want today to look at briefly at swings. + +0:42:05.725 --> 0:42:08.704 +First, that was what was initially done. + +0:42:08.704 --> 0:42:15.314 +It wasn't as famous as in machine translation +as in other things, but it's also used there + +0:42:15.314 --> 0:42:21.053 +and that is to use static word embedding, so +just the first step we know here. + +0:42:21.221 --> 0:42:28.981 +So we have this mapping from the one hot to +a small continuous word representation. + +0:42:29.229 --> 0:42:38.276 +Using this one in your NG system, so you can, +for example, replace the embedding layer by + +0:42:38.276 --> 0:42:38.779 +the. + +0:42:39.139 --> 0:42:41.832 +That is helpful to be a really small amount +of data. + +0:42:42.922 --> 0:42:48.517 +And we're always in this pre-training phase +and have the thing the advantage is. + +0:42:48.468 --> 0:42:52.411 +More data than the trade off, so you can get +better. + +0:42:52.411 --> 0:42:59.107 +The disadvantage is, does anybody have an +idea of what might be the disadvantage of using + +0:42:59.107 --> 0:43:00.074 +things like. + +0:43:04.624 --> 0:43:12.175 +What was one mentioned today giving like big +advantage of the system compared to previous. + +0:43:20.660 --> 0:43:25.134 +Where one advantage was the enter end training, +so you have the enter end training so that + +0:43:25.134 --> 0:43:27.937 +all parameters and all components play optimal +together. + +0:43:28.208 --> 0:43:33.076 +If you know pre-train something on one fast, +it may be no longer optimal fitting to everything + +0:43:33.076 --> 0:43:33.384 +else. + +0:43:33.893 --> 0:43:37.862 +So what do pretending or not? + +0:43:37.862 --> 0:43:48.180 +It depends on how important everything is +optimal together and how important. + +0:43:48.388 --> 0:43:50.454 +Of large amount. + +0:43:50.454 --> 0:44:00.541 +The pre-change one is so much better that +it's helpful, and the advantage of that. + +0:44:00.600 --> 0:44:11.211 +Getting everything optimal together, yes, +we would use random instructions for raising. + +0:44:11.691 --> 0:44:26.437 +The problem is you might be already in some +area where it's not easy to get. + +0:44:26.766 --> 0:44:35.329 +But often in some way right, so often it's +not about your really worse pre trained monolepsy. + +0:44:35.329 --> 0:44:43.254 +If you're going already in some direction, +and if this is not really optimal for you,. + +0:44:43.603 --> 0:44:52.450 +But if you're not really getting better because +you have a decent amount of data, it's so different + +0:44:52.450 --> 0:44:52.981 +that. + +0:44:53.153 --> 0:44:59.505 +Initially it wasn't a machine translation +done so much because there are more data in + +0:44:59.505 --> 0:45:06.153 +MPs than in other tasks, but now with really +large amounts of monolingual data we do some + +0:45:06.153 --> 0:45:09.403 +type of pretraining in currently all state. + +0:45:12.632 --> 0:45:14.302 +The other one is okay now. + +0:45:14.302 --> 0:45:18.260 +It's always like how much of the model do +you plea track a bit? + +0:45:18.658 --> 0:45:22.386 +To the other one you can do contextural word +embedded. + +0:45:22.386 --> 0:45:28.351 +That is something like bird or Roberta where +you train already a sequence model and the + +0:45:28.351 --> 0:45:34.654 +embeddings you're using are no longer specific +for word but they are also taking the context + +0:45:34.654 --> 0:45:35.603 +into account. + +0:45:35.875 --> 0:45:50.088 +The embedding you're using is no longer depending +on the word itself but on the whole sentence, + +0:45:50.088 --> 0:45:54.382 +so you can use this context. + +0:45:55.415 --> 0:46:02.691 +You can use similar things also in the decoder +just by having layers which don't have access + +0:46:02.691 --> 0:46:12.430 +to the source, but there it still might have +and these are typically models like: And finally + +0:46:12.430 --> 0:46:14.634 +they will look at the end. + +0:46:14.634 --> 0:46:19.040 +You can also have models which are already +sequenced. + +0:46:19.419 --> 0:46:28.561 +So you may be training a sequence to sequence +models. + +0:46:28.561 --> 0:46:35.164 +You have to make it a bit challenging. + +0:46:36.156 --> 0:46:43.445 +But the idea is really you're pre-training +your whole model and then you'll find tuning. + +0:46:47.227 --> 0:46:59.614 +But let's first do a bit of step back and +look into what are the different things. + +0:46:59.614 --> 0:47:02.151 +The first thing. + +0:47:02.382 --> 0:47:11.063 +The wooden bettings are just this first layer +and you can train them with feedback annual + +0:47:11.063 --> 0:47:12.028 +networks. + +0:47:12.212 --> 0:47:22.761 +But you can also train them with an N language +model, and by now you hopefully have also seen + +0:47:22.761 --> 0:47:27.699 +that you cannot transform a language model. + +0:47:30.130 --> 0:47:37.875 +So this is how you can train them and you're +training them. + +0:47:37.875 --> 0:47:45.234 +For example, to speak the next word that is +the easiest. + +0:47:45.525 --> 0:47:55.234 +And that is what is now referred to as South +Supervised Learning and, for example, all the + +0:47:55.234 --> 0:48:00.675 +big large language models like Chad GPT and +so on. + +0:48:00.675 --> 0:48:03.129 +They are trained with. + +0:48:03.823 --> 0:48:15.812 +So that is where you can hopefully learn how +a word is used because you always try to previct + +0:48:15.812 --> 0:48:17.725 +the next word. + +0:48:19.619 --> 0:48:27.281 +Word embedding: Why do you keep the first +look at the word embeddings and the use of + +0:48:27.281 --> 0:48:29.985 +word embeddings for our task? + +0:48:29.985 --> 0:48:38.007 +The main advantage was it might be only the +first layer where you typically have most of + +0:48:38.007 --> 0:48:39.449 +the parameters. + +0:48:39.879 --> 0:48:57.017 +Most of your parameters already on the large +data, then on your target data you have to + +0:48:57.017 --> 0:48:59.353 +train less. + +0:48:59.259 --> 0:49:06.527 +Big difference that your input size is so +much bigger than the size of the novel in size. + +0:49:06.626 --> 0:49:17.709 +So it's a normally sign, maybe like, but your +input and banning size is something like. + +0:49:17.709 --> 0:49:20.606 +Then here you have to. + +0:49:23.123 --> 0:49:30.160 +While here you see it's only like zero point +five times as much in the layer. + +0:49:30.750 --> 0:49:36.534 +So here is where most of your parameters are, +which means if you already replace the word + +0:49:36.534 --> 0:49:41.739 +embeddings, they might look a bit small in +your overall and in key architecture. + +0:49:41.739 --> 0:49:47.395 +It's where most of the things are, and if +you're doing that you already have really big + +0:49:47.395 --> 0:49:48.873 +games and can do that. + +0:49:57.637 --> 0:50:01.249 +The thing is we have seen these were the bettings. + +0:50:01.249 --> 0:50:04.295 +They can be very good use for other types. + +0:50:04.784 --> 0:50:08.994 +You learn some general relations between words. + +0:50:08.994 --> 0:50:17.454 +If you're doing this type of language modeling +cast, you predict: The one thing is you have + +0:50:17.454 --> 0:50:24.084 +a lot of data, so the one question is we want +to have data to trade a model. + +0:50:24.084 --> 0:50:28.734 +The other thing, the tasks need to be somehow +useful. + +0:50:29.169 --> 0:50:43.547 +If you would predict the first letter of the +word, then you wouldn't learn anything about + +0:50:43.547 --> 0:50:45.144 +the word. + +0:50:45.545 --> 0:50:53.683 +And the interesting thing is people have looked +at these wood embeddings. + +0:50:53.954 --> 0:50:58.550 +And looking at the word embeddings. + +0:50:58.550 --> 0:51:09.276 +You can ask yourself how they look and visualize +them by doing dimension reduction. + +0:51:09.489 --> 0:51:13.236 +Don't know if you and you are listening to +artificial intelligence. + +0:51:13.236 --> 0:51:15.110 +Advanced artificial intelligence. + +0:51:15.515 --> 0:51:23.217 +We had on yesterday there how to do this type +of representation, but you can do this time + +0:51:23.217 --> 0:51:29.635 +of representation, and now you're seeing interesting +things that normally. + +0:51:30.810 --> 0:51:41.027 +Now you can represent a here in a three dimensional +space with some dimension reduction. + +0:51:41.027 --> 0:51:46.881 +For example, the relation between male and +female. + +0:51:47.447 --> 0:51:56.625 +So this vector between the male and female +version of something is always not the same, + +0:51:56.625 --> 0:51:58.502 +but it's related. + +0:51:58.718 --> 0:52:14.522 +So you can do a bit of maths, so you do take +king, you subtract this vector, add this vector. + +0:52:14.894 --> 0:52:17.591 +So that means okay, there is really something +stored. + +0:52:17.591 --> 0:52:19.689 +Some information are stored in that book. + +0:52:20.040 --> 0:52:22.621 +Similar, you can do it with Bob Hansen. + +0:52:22.621 --> 0:52:25.009 +See here swimming slam walking walk. + +0:52:25.265 --> 0:52:34.620 +So again these vectors are not the same, but +they are related. + +0:52:34.620 --> 0:52:42.490 +So you learn something from going from here +to here. + +0:52:43.623 --> 0:52:49.761 +Or semantically, the relations between city +and capital have exactly the same sense. + +0:52:51.191 --> 0:52:56.854 +And people had even done that question answering +about that if they showed the diembeddings + +0:52:56.854 --> 0:52:57.839 +and the end of. + +0:52:58.218 --> 0:53:06.711 +All you can also do is don't trust the dimensions +of the reaction because maybe there is something. + +0:53:06.967 --> 0:53:16.863 +You can also look into what happens really +in the individual space. + +0:53:16.863 --> 0:53:22.247 +What is the nearest neighbor of the. + +0:53:22.482 --> 0:53:29.608 +So you can take the relationship between France +and Paris and add it to Italy and you'll. + +0:53:30.010 --> 0:53:33.078 +You can do big and bigger and you have small +and smaller and stuff. + +0:53:33.593 --> 0:53:49.417 +Because it doesn't work everywhere, there +is also some typical dish here in German. + +0:53:51.491 --> 0:54:01.677 +You can do what the person is doing for famous +ones, of course only like Einstein scientists + +0:54:01.677 --> 0:54:06.716 +that find midfielders not completely correct. + +0:54:06.846 --> 0:54:10.134 +You see the examples are a bit old. + +0:54:10.134 --> 0:54:15.066 +The politicians are no longer they am, but +of course. + +0:54:16.957 --> 0:54:26.759 +What people have done there, especially at +the beginning training our end language model, + +0:54:26.759 --> 0:54:28.937 +was very expensive. + +0:54:29.309 --> 0:54:38.031 +So one famous model was, but we are not really +interested in the language model performance. + +0:54:38.338 --> 0:54:40.581 +Think something good to keep in mind. + +0:54:40.581 --> 0:54:42.587 +What are we really interested in? + +0:54:42.587 --> 0:54:45.007 +Do we really want to have an R&N no? + +0:54:45.007 --> 0:54:48.607 +In this case we are only interested in this +type of mapping. + +0:54:49.169 --> 0:54:55.500 +And so successful and very successful was +this word to vet. + +0:54:55.535 --> 0:54:56.865 +The idea is okay. + +0:54:56.865 --> 0:55:03.592 +We are not training real language one, making +it even simpler and doing this, for example, + +0:55:03.592 --> 0:55:05.513 +continuous peck of words. + +0:55:05.513 --> 0:55:12.313 +We're just having four input tokens and we're +predicting what is the word in the middle and + +0:55:12.313 --> 0:55:15.048 +this is just like two linear layers. + +0:55:15.615 --> 0:55:21.627 +So it's even simplifying things and making +the calculation faster because that is what + +0:55:21.627 --> 0:55:22.871 +we're interested. + +0:55:23.263 --> 0:55:32.897 +All this continuous skip ground models with +these other models which refer to as where + +0:55:32.897 --> 0:55:34.004 +to where. + +0:55:34.234 --> 0:55:42.394 +Where you have one equal word and the other +way around, you're predicting the four words + +0:55:42.394 --> 0:55:43.585 +around them. + +0:55:43.585 --> 0:55:45.327 +It's very similar. + +0:55:45.327 --> 0:55:48.720 +The task is in the end very similar. + +0:55:51.131 --> 0:56:01.407 +Before we are going to the next point, anything +about normal weight vectors or weight embedding. + +0:56:04.564 --> 0:56:07.794 +The next thing is contexture. + +0:56:07.794 --> 0:56:12.208 +Word embeddings and the idea is helpful. + +0:56:12.208 --> 0:56:19.206 +However, we might even be able to get more +from one lingo layer. + +0:56:19.419 --> 0:56:31.732 +And now in the word that is overlap of these +two meanings, so it represents both the meaning + +0:56:31.732 --> 0:56:33.585 +of can do it. + +0:56:34.834 --> 0:56:40.410 +But we might be able to in the pre-trained +model already disambiguate this because they + +0:56:40.410 --> 0:56:41.044 +are used. + +0:56:41.701 --> 0:56:53.331 +So if we can have a model which can not only +represent a word but can also represent the + +0:56:53.331 --> 0:56:58.689 +meaning of the word within the context,. + +0:56:59.139 --> 0:57:03.769 +So then we are going to context your word +embeddings. + +0:57:03.769 --> 0:57:07.713 +We are really having a representation in the. + +0:57:07.787 --> 0:57:11.519 +And we have a very good architecture for that +already. + +0:57:11.691 --> 0:57:23.791 +The hidden state represents what is currently +said, but it's focusing on what is the last + +0:57:23.791 --> 0:57:29.303 +one, so it's some of the representation. + +0:57:29.509 --> 0:57:43.758 +The first one doing that is something like +the Elmo paper where they instead of this is + +0:57:43.758 --> 0:57:48.129 +the normal language model. + +0:57:48.008 --> 0:57:50.714 +Within the third, predicting the fourth, and +so on. + +0:57:50.714 --> 0:57:53.004 +So you are always predicting the next work. + +0:57:53.193 --> 0:57:57.335 +The architecture is the heaven words embedding +layer and then layers. + +0:57:57.335 --> 0:58:03.901 +See you, for example: And now instead of using +this one in the end, you're using here this + +0:58:03.901 --> 0:58:04.254 +one. + +0:58:04.364 --> 0:58:11.245 +This represents the meaning of this word mainly +in the context of what we have seen before. + +0:58:11.871 --> 0:58:18.610 +We can train it in a language model style +always predicting the next word, but we have + +0:58:18.610 --> 0:58:21.088 +more information trained there. + +0:58:21.088 --> 0:58:26.123 +Therefore, in the system it has to learn less +additional things. + +0:58:27.167 --> 0:58:31.261 +And there is one Edendang which is done currently +in GPS. + +0:58:31.261 --> 0:58:38.319 +The only difference is that we have more layers, +bigger size, and we're using transformer neurocell + +0:58:38.319 --> 0:58:40.437 +potential instead of the RNA. + +0:58:40.437 --> 0:58:45.095 +But that is how you train like some large +language models at the. + +0:58:46.746 --> 0:58:55.044 +However, if you look at this contextual representation, +they might not be perfect. + +0:58:55.044 --> 0:59:02.942 +So if you think of this one as a contextual +representation of the third word,. + +0:59:07.587 --> 0:59:16.686 +Is representing a three in the context of +a sentence, however only in the context of + +0:59:16.686 --> 0:59:18.185 +the previous. + +0:59:18.558 --> 0:59:27.413 +However, we have an architecture which can +also take both sides and we have used that + +0:59:27.413 --> 0:59:30.193 +already in the ink holder. + +0:59:30.630 --> 0:59:34.264 +So we could do the iron easily on your, also +in the backward direction. + +0:59:34.874 --> 0:59:42.826 +By just having the states the other way around +and then we couldn't combine the forward and + +0:59:42.826 --> 0:59:49.135 +the forward into a joint one where we are doing +this type of prediction. + +0:59:49.329 --> 0:59:50.858 +So you have the word embedding. + +0:59:51.011 --> 1:00:02.095 +Then you have two in the states, one on the +forward arm and one on the backward arm, and + +1:00:02.095 --> 1:00:10.314 +then you can, for example, take the cocagenation +of both of them. + +1:00:10.490 --> 1:00:23.257 +Now this same here represents mainly this +word because this is what both puts in it last + +1:00:23.257 --> 1:00:30.573 +and we know is focusing on what is happening +last. + +1:00:31.731 --> 1:00:40.469 +However, there is a bit of difference when +training that as a language model you already + +1:00:40.469 --> 1:00:41.059 +have. + +1:00:43.203 --> 1:00:44.956 +Maybe There's Again This Masking. + +1:00:46.546 --> 1:00:47.748 +That is one solution. + +1:00:47.748 --> 1:00:52.995 +First of all, why we can't do it is the information +you leak it, so you cannot just predict the + +1:00:52.995 --> 1:00:53.596 +next word. + +1:00:53.596 --> 1:00:58.132 +If we just predict the next word in this type +of model, that's a very simple task. + +1:00:58.738 --> 1:01:09.581 +You know the next word because it's influencing +this hidden state predicting something is not + +1:01:09.581 --> 1:01:11.081 +a good task. + +1:01:11.081 --> 1:01:18.455 +You have to define: Because in this case what +will end with the system will just ignore these + +1:01:18.455 --> 1:01:22.966 +estates and what will learn is copy this information +directly in here. + +1:01:23.343 --> 1:01:31.218 +So it would be representing this word and +you would have nearly a perfect model because + +1:01:31.218 --> 1:01:38.287 +you only need to find encoding where you can +encode all words somehow in this. + +1:01:38.458 --> 1:01:44.050 +The only thing can learn is that turn and +encode all my words in this upper hidden. + +1:01:44.985 --> 1:01:53.779 +Therefore, it's not really useful, so we need +to find a bit of different ways out. + +1:01:55.295 --> 1:01:57.090 +There is a masking one. + +1:01:57.090 --> 1:02:03.747 +I'll come to that shortly just a bit that +other things also have been done, so the other + +1:02:03.747 --> 1:02:06.664 +thing is not to directly combine them. + +1:02:06.664 --> 1:02:13.546 +That was in the animal paper, so you have +them forward R&M and you keep them completely + +1:02:13.546 --> 1:02:14.369 +separated. + +1:02:14.594 --> 1:02:20.458 +So you never merged to state. + +1:02:20.458 --> 1:02:33.749 +At the end, the representation of the word +is now from the forward. + +1:02:33.873 --> 1:02:35.953 +So it's always the hidden state before the +good thing. + +1:02:36.696 --> 1:02:41.286 +These two you join now to your to the representation. + +1:02:42.022 --> 1:02:48.685 +And then you have now a representation also +about like the whole sentence for the word, + +1:02:48.685 --> 1:02:51.486 +but there is no information leakage. + +1:02:51.486 --> 1:02:58.149 +One way of doing this is instead of doing +a bidirection along you do a forward pass and + +1:02:58.149 --> 1:02:59.815 +then join the hidden. + +1:03:00.380 --> 1:03:05.960 +So you can do that in all layers. + +1:03:05.960 --> 1:03:16.300 +In the end you do the forwarded layers and +you get the hidden. + +1:03:16.596 --> 1:03:19.845 +However, it's a bit of a complicated. + +1:03:19.845 --> 1:03:25.230 +You have to keep both separate and merge things +so can you do. + +1:03:27.968 --> 1:03:33.030 +And that is the moment where like the big. + +1:03:34.894 --> 1:03:39.970 +The big success of the burnt model was used +where it okay. + +1:03:39.970 --> 1:03:47.281 +Maybe in bite and rich case it's not good +to do the next word prediction, but we can + +1:03:47.281 --> 1:03:48.314 +do masking. + +1:03:48.308 --> 1:03:56.019 +Masking mainly means we do a prediction of +something in the middle or some words. + +1:03:56.019 --> 1:04:04.388 +So the idea is if we have the input, we are +putting noise into the input, removing them, + +1:04:04.388 --> 1:04:07.961 +and then the model we are interested. + +1:04:08.048 --> 1:04:15.327 +Now there can be no information leakage because +this wasn't predicting that one is a big challenge. + +1:04:16.776 --> 1:04:19.957 +Do any assumption about our model? + +1:04:19.957 --> 1:04:26.410 +It doesn't need to be a forward model or a +backward model or anything. + +1:04:26.410 --> 1:04:29.500 +You can always predict the three. + +1:04:30.530 --> 1:04:34.844 +There's maybe one bit of a disadvantage. + +1:04:34.844 --> 1:04:40.105 +Do you see what could be a bit of a problem +this? + +1:05:00.000 --> 1:05:06.429 +Yes, so yeah, you can of course mask more, +but to see it more globally, just first assume + +1:05:06.429 --> 1:05:08.143 +you're only masked one. + +1:05:08.143 --> 1:05:13.930 +For the whole sentence, we get one feedback +signal, like what is the word three. + +1:05:13.930 --> 1:05:22.882 +So we have one training example: If you do +the language modeling taste, we predicted here, + +1:05:22.882 --> 1:05:24.679 +we predicted here. + +1:05:25.005 --> 1:05:26.735 +So we have number of tokens. + +1:05:26.735 --> 1:05:30.970 +For each token we have a feet pad and say +what is the best correction. + +1:05:31.211 --> 1:05:43.300 +So in this case this is less efficient because +we are getting less feedback signals on what + +1:05:43.300 --> 1:05:45.797 +we should predict. + +1:05:48.348 --> 1:05:56.373 +So and bird, the main ideas are that you're +doing this bidirectional model with masking. + +1:05:56.373 --> 1:05:59.709 +It's using transformer architecture. + +1:06:00.320 --> 1:06:06.326 +There are two more minor changes. + +1:06:06.326 --> 1:06:16.573 +We'll see that this next word prediction is +another task. + +1:06:16.957 --> 1:06:30.394 +You want to learn more about what language +is to really understand following a story or + +1:06:30.394 --> 1:06:35.127 +their independent tokens into. + +1:06:38.158 --> 1:06:42.723 +The input is using word units as we use it. + +1:06:42.723 --> 1:06:50.193 +It has some special token that is framing +for the next word prediction. + +1:06:50.470 --> 1:07:04.075 +It's more for classification task because +you may be learning a general representation + +1:07:04.075 --> 1:07:07.203 +as a full sentence. + +1:07:07.607 --> 1:07:19.290 +You're doing segment embedding, so you have +an embedding for it. + +1:07:19.290 --> 1:07:24.323 +This is the first sentence. + +1:07:24.684 --> 1:07:29.099 +Now what is more challenging is this masking. + +1:07:29.099 --> 1:07:30.827 +What do you mask? + +1:07:30.827 --> 1:07:35.050 +We already have the crush enough or should. + +1:07:35.275 --> 1:07:42.836 +So there has been afterwards eating some work +like, for example, a bearer. + +1:07:42.836 --> 1:07:52.313 +It's not super sensitive, but if you do it +completely wrong then you're not letting anything. + +1:07:52.572 --> 1:07:54.590 +That's Then Another Question There. + +1:07:56.756 --> 1:08:04.594 +Should I mask all types of should I always +mask the footwork or if I have a subword to + +1:08:04.594 --> 1:08:10.630 +mask only like a subword and predict them based +on the other ones? + +1:08:10.630 --> 1:08:14.504 +Of course, it's a bit of a different task. + +1:08:14.894 --> 1:08:21.210 +If you know three parts of the words, it might +be easier to guess the last because they here + +1:08:21.210 --> 1:08:27.594 +took the easiest selection, so not considering +words anymore at all because you're doing that + +1:08:27.594 --> 1:08:32.280 +in the preprocessing and just taking always +words and like subwords. + +1:08:32.672 --> 1:08:36.089 +Think in group there is done differently. + +1:08:36.089 --> 1:08:40.401 +They mark always the full words, but guess +it's not. + +1:08:41.001 --> 1:08:46.044 +And then what to do with the mask word in +eighty percent of the cases. + +1:08:46.044 --> 1:08:50.803 +If the word is masked, they replace it with +a special token thing. + +1:08:50.803 --> 1:08:57.197 +This is a mask token in ten percent they put +in some random other token in there, and ten + +1:08:57.197 --> 1:08:59.470 +percent they keep it on change. + +1:09:02.202 --> 1:09:10.846 +And then what you can do is also this next +word prediction. + +1:09:10.846 --> 1:09:14.880 +The man went to Mass Store. + +1:09:14.880 --> 1:09:17.761 +He bought a gallon. + +1:09:18.418 --> 1:09:24.088 +So may you see you're joining them, you're +doing both masks and prediction that you're. + +1:09:24.564 --> 1:09:29.449 +Is a penguin mask or flyless birds. + +1:09:29.449 --> 1:09:41.390 +These two sentences have nothing to do with +each other, so you can do also this type of + +1:09:41.390 --> 1:09:43.018 +prediction. + +1:09:47.127 --> 1:09:57.043 +And then the whole bird model, so here you +have the input here to transform the layers, + +1:09:57.043 --> 1:09:58.170 +and then. + +1:09:58.598 --> 1:10:17.731 +And this model was quite successful in general +applications. + +1:10:17.937 --> 1:10:27.644 +However, there is like a huge thing of different +types of models coming from them. + +1:10:27.827 --> 1:10:38.709 +So based on others these supervised molds +like a whole setup came out of there and now + +1:10:38.709 --> 1:10:42.086 +this is getting even more. + +1:10:42.082 --> 1:10:46.640 +With availability of a large language model +than the success. + +1:10:47.007 --> 1:10:48.436 +We have now even larger ones. + +1:10:48.828 --> 1:10:50.961 +Interestingly, it goes a bit. + +1:10:50.910 --> 1:10:57.847 +Change the bit again from like more the spider +action model to uni directional models. + +1:10:57.847 --> 1:11:02.710 +Are at the moment maybe a bit more we're coming +to them now? + +1:11:02.710 --> 1:11:09.168 +Do you see one advantage while what is another +event and we have the efficiency? + +1:11:09.509 --> 1:11:15.901 +Is one other reason why you are sometimes +more interested in uni-direction models than + +1:11:15.901 --> 1:11:17.150 +in bi-direction. + +1:11:22.882 --> 1:11:30.220 +It depends on the pass, but for example for +a language generation pass, the eccard is not + +1:11:30.220 --> 1:11:30.872 +really. + +1:11:32.192 --> 1:11:40.924 +It doesn't work so if you want to do a generation +like the decoder you don't know the future + +1:11:40.924 --> 1:11:42.896 +so you cannot apply. + +1:11:43.223 --> 1:11:53.870 +So this time of model can be used for the +encoder in an encoder model, but it cannot + +1:11:53.870 --> 1:11:57.002 +be used for the decoder. + +1:12:00.000 --> 1:12:05.012 +That's a good view to the next overall cast +of models. + +1:12:05.012 --> 1:12:08.839 +Perhaps if you view it from the sequence. + +1:12:09.009 --> 1:12:12.761 +We have the encoder base model. + +1:12:12.761 --> 1:12:16.161 +That's what we just look at. + +1:12:16.161 --> 1:12:20.617 +They are bidirectional and typically. + +1:12:20.981 --> 1:12:22.347 +That Is the One We Looked At. + +1:12:22.742 --> 1:12:34.634 +At the beginning is the decoder based model, +so see out in regressive models which are unidirective + +1:12:34.634 --> 1:12:42.601 +like an based model, and there we can do the +next word prediction. + +1:12:43.403 --> 1:12:52.439 +And what you can also do first, and there +you can also have a special things called prefix + +1:12:52.439 --> 1:12:53.432 +language. + +1:12:54.354 --> 1:13:05.039 +Because we are saying it might be helpful +that some of your input can also use bi-direction. + +1:13:05.285 --> 1:13:12.240 +And that is somehow doing what it is called +prefix length. + +1:13:12.240 --> 1:13:19.076 +On the first tokens you directly give your +bidirectional. + +1:13:19.219 --> 1:13:28.774 +So you somehow merge that and that mainly +works only in transformer based models because. + +1:13:29.629 --> 1:13:33.039 +There is no different number of parameters +in our end. + +1:13:33.039 --> 1:13:34.836 +We need a back foot our end. + +1:13:34.975 --> 1:13:38.533 +Transformer: The only difference is how you +mask your attention. + +1:13:38.878 --> 1:13:44.918 +We have seen that in the anchoder and decoder +the number of parameters is different because + +1:13:44.918 --> 1:13:50.235 +you do cross attention, but if you do forward +and backward or union directions,. + +1:13:50.650 --> 1:13:58.736 +It's only like you mask your attention to +only look at the bad past or to look into the + +1:13:58.736 --> 1:13:59.471 +future. + +1:14:00.680 --> 1:14:03.326 +And now you can of course also do mixing. + +1:14:03.563 --> 1:14:08.306 +So this is a bi-directional attention matrix +where you can attend to everything. + +1:14:08.588 --> 1:14:23.516 +There is a uni-direction or causal where you +can look at the past and you can do the first + +1:14:23.516 --> 1:14:25.649 +three words. + +1:14:29.149 --> 1:14:42.831 +That somehow clear based on that, then of +course you cannot do the other things. + +1:14:43.163 --> 1:14:50.623 +So the idea is we have our anchor to decoder +architecture. + +1:14:50.623 --> 1:14:57.704 +Can we also train them completely in a side +supervisor? + +1:14:58.238 --> 1:15:09.980 +And in this case we have the same input to +both, so in this case we need to do some type + +1:15:09.980 --> 1:15:12.224 +of masking here. + +1:15:12.912 --> 1:15:17.696 +Here we don't need to do the masking, but +here we need to masking that doesn't know ever + +1:15:17.696 --> 1:15:17.911 +so. + +1:15:20.440 --> 1:15:30.269 +And this type of model got quite successful +also, especially for pre-training machine translation. + +1:15:30.330 --> 1:15:39.059 +The first model doing that is a Bart model, +which exactly does that, and yes, it's one + +1:15:39.059 --> 1:15:42.872 +successful way to pre train your one. + +1:15:42.872 --> 1:15:47.087 +It's pretraining your full encoder model. + +1:15:47.427 --> 1:15:54.365 +Where you put in contrast to machine translation, +where you put in source sentence, we can't + +1:15:54.365 --> 1:15:55.409 +do that here. + +1:15:55.715 --> 1:16:01.382 +But we can just put the second twice in there, +and then it's not a trivial task. + +1:16:01.382 --> 1:16:02.432 +We can change. + +1:16:03.003 --> 1:16:12.777 +And there is like they do different corruption +techniques so you can also do. + +1:16:13.233 --> 1:16:19.692 +That you couldn't do in an agricultural system +because then it wouldn't be there and you cannot + +1:16:19.692 --> 1:16:20.970 +predict somewhere. + +1:16:20.970 --> 1:16:26.353 +So the anchor, the number of input and output +tokens always has to be the same. + +1:16:26.906 --> 1:16:29.818 +You cannot do a prediction for something which +isn't in it. + +1:16:30.110 --> 1:16:38.268 +Here in the decoder side it's unidirection +so we can also delete the top and then try + +1:16:38.268 --> 1:16:40.355 +to generate the full. + +1:16:41.061 --> 1:16:45.250 +We can do sentence permutation. + +1:16:45.250 --> 1:16:54.285 +We can document rotation and text infilling +so there is quite a bit. + +1:16:55.615 --> 1:17:06.568 +So you see there's quite a lot of types of +models that you can use in order to pre-train. + +1:17:07.507 --> 1:17:14.985 +Then, of course, there is again for the language +one. + +1:17:14.985 --> 1:17:21.079 +The other question is how do you integrate? + +1:17:21.761 --> 1:17:26.636 +And there's also, like yeah, quite some different +ways of techniques. + +1:17:27.007 --> 1:17:28.684 +It's a Bit Similar to Before. + +1:17:28.928 --> 1:17:39.068 +So the easiest thing is you take your word +embeddings or your free trained model. + +1:17:39.068 --> 1:17:47.971 +You freeze them and stack your decoder layers +and keep these ones free. + +1:17:48.748 --> 1:17:54.495 +Can also be done if you have this type of +bark model. + +1:17:54.495 --> 1:18:03.329 +What you can do is you freeze your word embeddings, +for example some products and. + +1:18:05.865 --> 1:18:17.296 +The other thing is you initialize them so +you initialize your models but you train everything + +1:18:17.296 --> 1:18:19.120 +so you're not. + +1:18:22.562 --> 1:18:29.986 +Then one thing, if you think about Bart, you +want to have the Chinese language, the Italian + +1:18:29.986 --> 1:18:32.165 +language, and the deconer. + +1:18:32.165 --> 1:18:35.716 +However, in Bart we have the same language. + +1:18:36.516 --> 1:18:46.010 +The one you get is from English, so what you +can do there is so you cannot try to do some. + +1:18:46.366 --> 1:18:52.562 +Below the barge, in order to learn some language +specific stuff, or there's a masculine barge, + +1:18:52.562 --> 1:18:58.823 +which is trained on many languages, but it's +trained only on like the Old Coast Modern Language + +1:18:58.823 --> 1:19:03.388 +House, which may be trained in German and English, +but not on German. + +1:19:03.923 --> 1:19:08.779 +So then you would still need to find June +and the model needs to learn how to better + +1:19:08.779 --> 1:19:10.721 +do the attention cross lingually. + +1:19:10.721 --> 1:19:15.748 +It's only on the same language but it mainly +only has to learn this mapping and not all + +1:19:15.748 --> 1:19:18.775 +the rest and that's why it's still quite successful. + +1:19:21.982 --> 1:19:27.492 +Now certain thing which is very commonly used +is what is required to it as adapters. + +1:19:27.607 --> 1:19:29.754 +So for example you take and buy. + +1:19:29.709 --> 1:19:35.218 +And you put some adapters on the inside of +the networks so that it's small new layers + +1:19:35.218 --> 1:19:40.790 +which are in between put in there and then +you only train these adapters or also train + +1:19:40.790 --> 1:19:41.815 +these adapters. + +1:19:41.815 --> 1:19:47.900 +For example, an embryo you could see that +this learns to map the Sears language representation + +1:19:47.900 --> 1:19:50.334 +to the Tiger language representation. + +1:19:50.470 --> 1:19:52.395 +And then you don't have to change that luck. + +1:19:52.792 --> 1:19:59.793 +You give it extra ability to really perform +well on that. + +1:19:59.793 --> 1:20:05.225 +These are quite small and so very efficient. + +1:20:05.905 --> 1:20:12.632 +That is also very commonly used, for example +in modular systems where you have some adaptors + +1:20:12.632 --> 1:20:16.248 +in between here which might be language specific. + +1:20:16.916 --> 1:20:22.247 +So they are trained only for one language. + +1:20:22.247 --> 1:20:33.777 +The model has some or both and once has the +ability to do multilingually to share knowledge. + +1:20:34.914 --> 1:20:39.058 +But there's one chance in general in the multilingual +systems. + +1:20:39.058 --> 1:20:40.439 +It works quite well. + +1:20:40.439 --> 1:20:46.161 +There's one case or one specific use case +for multilingual where this normally doesn't + +1:20:46.161 --> 1:20:47.344 +really work well. + +1:20:47.344 --> 1:20:49.975 +Do you have an idea what that could be? + +1:20:55.996 --> 1:20:57.536 +It's for Zero Shot Cases. + +1:20:57.998 --> 1:21:03.660 +Because having here some situation with this +might be very language specific and zero shot, + +1:21:03.660 --> 1:21:09.015 +the idea is always to learn representations +view which are more language dependent and + +1:21:09.015 --> 1:21:10.184 +with the adaptors. + +1:21:10.184 --> 1:21:15.601 +Of course you get in representations again +which are more language specific and then it + +1:21:15.601 --> 1:21:17.078 +doesn't work that well. + +1:21:20.260 --> 1:21:37.730 +And there is also the idea of doing more knowledge +pistolation. + +1:21:39.179 --> 1:21:42.923 +And now the idea is okay. + +1:21:42.923 --> 1:21:54.157 +We are training it the same, but what we want +to achieve is that the encoder. + +1:21:54.414 --> 1:22:03.095 +So you should learn faster by trying to make +these states as similar as possible. + +1:22:03.095 --> 1:22:11.777 +So you compare the first-hit state of the +pre-trained model and try to make them. + +1:22:12.192 --> 1:22:18.144 +For example, by using the out two norms, so +by just making these two representations the + +1:22:18.144 --> 1:22:26.373 +same: The same vocabulary: Why does it need +the same vocabulary with any idea? + +1:22:34.754 --> 1:22:46.137 +If you have different vocabulary, it's typical +you also have different sequenced lengths here. + +1:22:46.137 --> 1:22:50.690 +The number of sequences is different. + +1:22:51.231 --> 1:22:58.888 +If you now have pipe stains and four states +here, it's no longer straightforward which + +1:22:58.888 --> 1:23:01.089 +states compare to which. + +1:23:02.322 --> 1:23:05.246 +And that's just easier if you have like the +same number. + +1:23:05.246 --> 1:23:08.940 +You can always compare the first to the first +and second to the second. + +1:23:09.709 --> 1:23:16.836 +So therefore at least the very easy way of +knowledge destination only works if you have. + +1:23:17.177 --> 1:23:30.030 +Course: You could do things like yeah, the +average should be the same, but of course there's + +1:23:30.030 --> 1:23:33.071 +a less strong signal. + +1:23:34.314 --> 1:23:42.979 +But the advantage here is that you have a +diameter training signal here on the handquarter + +1:23:42.979 --> 1:23:51.455 +so you can directly make some of the encoder +already giving a good signal while normally + +1:23:51.455 --> 1:23:52.407 +an empty. + +1:23:56.936 --> 1:24:13.197 +Yes, think this is most things for today, +so what you should keep in mind is remind me. + +1:24:13.393 --> 1:24:18.400 +The one is a back translation idea. + +1:24:18.400 --> 1:24:29.561 +If you have monolingual and use that, the +other one is to: And mentally it is often helpful + +1:24:29.561 --> 1:24:33.614 +to combine them so you can even use both of +that. + +1:24:33.853 --> 1:24:38.908 +So you can use pre-trained walls, but then +you can even still do back translation where + +1:24:38.908 --> 1:24:40.057 +it's still helpful. + +1:24:40.160 --> 1:24:45.502 +We have the advantage we are training like +everything working together on the task so + +1:24:45.502 --> 1:24:51.093 +it might be helpful even to backtranslate some +data and then use it in a real translation + +1:24:51.093 --> 1:24:56.683 +setup because in pretraining of course the +beach challenge is always that you're training + +1:24:56.683 --> 1:24:57.739 +it on different. + +1:24:58.058 --> 1:25:03.327 +Different ways of how you integrate this knowledge. + +1:25:03.327 --> 1:25:08.089 +Even if you just use a full model, so in this. + +1:25:08.748 --> 1:25:11.128 +This is the most similar you can get. + +1:25:11.128 --> 1:25:13.945 +You're doing no changes to the architecture. + +1:25:13.945 --> 1:25:19.643 +You're really taking the model and just fine +tuning them on the new task, but it still has + +1:25:19.643 --> 1:25:24.026 +to completely newly learn how to do the attention +and how to do that. + +1:25:24.464 --> 1:25:29.971 +And that might be, for example, helpful to +have more back-translated data to learn them. + +1:25:32.192 --> 1:25:34.251 +That's for today. + +1:25:34.251 --> 1:25:44.661 +There's one important thing that next Tuesday +there is a conference or a workshop or so in + +1:25:44.661 --> 1:25:45.920 +this room. + +1:25:47.127 --> 1:25:56.769 +You should get an e-mail if you're in Elias +that there's a room change for Tuesdays and + +1:25:56.769 --> 1:25:57.426 +it's. + +1:25:57.637 --> 1:26:03.890 +There are more questions, yeah, have a more +general position, especially: In computer vision + +1:26:03.890 --> 1:26:07.347 +you can enlarge your data center data orientation. + +1:26:07.347 --> 1:26:08.295 +Is there any? + +1:26:08.388 --> 1:26:15.301 +It's similar to a large speech for text for +the data of an edge. + +1:26:15.755 --> 1:26:29.176 +And you can use this back translation and +also masking, but back translation is some + +1:26:29.176 --> 1:26:31.228 +way of data. + +1:26:31.371 --> 1:26:35.629 +So it has also been, for example, even its +used not only for monolingual data. + +1:26:36.216 --> 1:26:54.060 +If you have good MP system, it can also be +used for parallel data. + +1:26:54.834 --> 1:26:59.139 +So would say this is the most similar one. + +1:26:59.139 --> 1:27:03.143 +There's ways you can do power phrasing. + +1:27:05.025 --> 1:27:12.057 +But for example there is very hard to do this +by rules like which words to replace because + +1:27:12.057 --> 1:27:18.936 +there is not a coup like you cannot always +say this word can always be replaced by that. + +1:27:19.139 --> 1:27:27.225 +Mean, although they are many perfect synonyms, +normally they are good in some cases, but not + +1:27:27.225 --> 1:27:29.399 +in all cases, and so on. + +1:27:29.399 --> 1:27:36.963 +And if you don't do a rule based, you have +to train your model and then the freshness. + +1:27:38.058 --> 1:27:57.236 +The same architecture as the pre-trained mount. + +1:27:57.457 --> 1:27:59.810 +Should be of the same dimension, so it's easiest +to have the same dimension. + +1:28:00.000 --> 1:28:01.590 +Architecture. + +1:28:01.590 --> 1:28:05.452 +We later will learn inefficiency. + +1:28:05.452 --> 1:28:12.948 +You can also do knowledge cessulation with, +for example, smaller. + +1:28:12.948 --> 1:28:16.469 +You can learn the same within. + +1:28:17.477 --> 1:28:22.949 +Eight layers for it so that is possible, but +yeah agree it should be of the same. + +1:28:23.623 --> 1:28:32.486 +Yeah yeah you need the question then of course +you can do it like it's an initialization or + +1:28:32.486 --> 1:28:41.157 +you can do it doing training but normally it +most makes sense during the normal training. + +1:28:45.865 --> 1:28:53.963 +Do it, then thanks a lot, and then we'll see +each other again on Tuesday. + +0:00:00.981 --> 0:00:20.036 +Today about is how to use some type of additional +resources to improve the translation. + +0:00:20.300 --> 0:00:28.188 +We have in the first part of the semester +two thirds of the semester how to build some + +0:00:28.188 --> 0:00:31.361 +of your basic machine translation. + +0:00:31.571 --> 0:00:42.317 +Now the basic components are both for statistical +and for neural, with the encoded decoding. + +0:00:43.123 --> 0:00:46.000 +Now, of course, that's not where it stops. + +0:00:46.000 --> 0:00:51.286 +It's still what nearly every machine translation +system is currently in there. + +0:00:51.286 --> 0:00:57.308 +However, there's a lot of challenges which +you need to address in addition and which need + +0:00:57.308 --> 0:00:58.245 +to be solved. + +0:00:58.918 --> 0:01:09.858 +And there we want to start to tell you what +else can you do around this, and partly. + +0:01:10.030 --> 0:01:14.396 +And one important question there is on what +do you train your models? + +0:01:14.394 --> 0:01:32.003 +Because like this type of parallel data, it's +easier in machine translation than in other + +0:01:32.003 --> 0:01:33.569 +trusts. + +0:01:33.853 --> 0:01:41.178 +And therefore an important question is, can +we also learn from like other sources and through? + +0:01:41.701 --> 0:01:47.830 +Because if you remember strongly right at +the beginning of the election,. + +0:01:51.171 --> 0:01:53.801 +This Is How We Train All Our. + +0:01:54.194 --> 0:01:59.887 +Machine learning models from statistical to +neural. + +0:01:59.887 --> 0:02:09.412 +This doesn't have changed so we need this +type of parallel data where we have a source + +0:02:09.412 --> 0:02:13.462 +sentence aligned with a target data. + +0:02:13.493 --> 0:02:19.135 +We have now a strong model here, a very good +model to do that. + +0:02:19.135 --> 0:02:22.091 +However, we always rely on this. + +0:02:22.522 --> 0:02:28.437 +For languages, high risk language pairs say +from German to English or other European languages, + +0:02:28.437 --> 0:02:31.332 +there is decent amount at least for similarly. + +0:02:31.471 --> 0:02:37.630 +But even there if we are going to very specific +domains it might get difficult and then your + +0:02:37.630 --> 0:02:43.525 +system performance might drop because if you +want to translate now some medical text for + +0:02:43.525 --> 0:02:50.015 +example of course you need to also have peril +data in the medical domain to know how to translate + +0:02:50.015 --> 0:02:50.876 +these types. + +0:02:51.231 --> 0:02:55.264 +Phrases how to use the vocabulary and so on +in the style. + +0:02:55.915 --> 0:03:04.887 +And if you are going to other languages, there +is a lot bigger challenge and the question + +0:03:04.887 --> 0:03:05.585 +there. + +0:03:05.825 --> 0:03:09.649 +So is really this the only resource we can +use. + +0:03:09.889 --> 0:03:19.462 +Can be adapted or training phase in order +to also make use of other types of models that + +0:03:19.462 --> 0:03:27.314 +might enable us to build strong systems with +other types of information. + +0:03:27.707 --> 0:03:35.276 +And that we will look into now in the next +starting from from just saying the next election. + +0:03:35.515 --> 0:03:40.697 +So this idea we already have covered on Tuesday. + +0:03:40.697 --> 0:03:45.350 +One very successful idea for this is to do. + +0:03:45.645 --> 0:03:51.990 +So that we're no longer doing translation +between languages, but we can do translation + +0:03:51.990 --> 0:03:55.928 +between languages and share common knowledge +between. + +0:03:56.296 --> 0:04:03.888 +You also learned about things like zero shots +machine translation so you can translate between + +0:04:03.888 --> 0:04:06.446 +languages where you don't have. + +0:04:06.786 --> 0:04:09.790 +Which is the case for many, many language +pairs. + +0:04:10.030 --> 0:04:16.954 +Like even with German, you have not translation +parallel data to all languages around the world, + +0:04:16.954 --> 0:04:23.450 +or most of them you have it to the Europeans +once, maybe even for Japanese, so it will get + +0:04:23.450 --> 0:04:26.377 +difficult to get a really decent amount. + +0:04:26.746 --> 0:04:35.332 +There is quite a lot of data, for example +English to Japanese, but German to Japanese + +0:04:35.332 --> 0:04:37.827 +or German to Vietnamese. + +0:04:37.827 --> 0:04:41.621 +There is some data from Multilingual. + +0:04:42.042 --> 0:04:54.584 +So there is a very promising direction if +you want to build translation systems between + +0:04:54.584 --> 0:05:00.142 +language peers, typically not English. + +0:05:01.221 --> 0:05:05.887 +And the other ideas, of course, we don't have +to either just search for it. + +0:05:06.206 --> 0:05:12.505 +Some work on a data crawling so if I don't +have a corpus directly or I don't have an high + +0:05:12.505 --> 0:05:19.014 +quality corpus like from the European Parliament +for a TED corpus so maybe it makes sense to + +0:05:19.014 --> 0:05:23.913 +crawl more data and get additional sources +so you can build stronger. + +0:05:24.344 --> 0:05:35.485 +There has been quite a big effort in Europe +to collect really large data sets for parallel + +0:05:35.485 --> 0:05:36.220 +data. + +0:05:36.220 --> 0:05:40.382 +How can we do this data crawling? + +0:05:40.600 --> 0:05:46.103 +There the interesting thing from the machine +translation point is not just general data + +0:05:46.103 --> 0:05:46.729 +crawling. + +0:05:47.067 --> 0:05:50.037 +But how can we explicitly crawl data? + +0:05:50.037 --> 0:05:52.070 +Which is some of a peril? + +0:05:52.132 --> 0:05:58.461 +So there is in the Internet quite a lot of +data which has been company websites which + +0:05:58.461 --> 0:06:01.626 +have been translated and things like that. + +0:06:01.626 --> 0:06:05.158 +So how can you extract them parallel fragments? + +0:06:06.566 --> 0:06:13.404 +That is typically more noisy than where you +do more at hands where mean if you have Parliament. + +0:06:13.693 --> 0:06:17.680 +You can do some rules how to extract parallel +things. + +0:06:17.680 --> 0:06:24.176 +Here there is more to it, so the quality is +later maybe not as good, but normally scale + +0:06:24.176 --> 0:06:26.908 +is then a possibility to address it. + +0:06:26.908 --> 0:06:30.304 +So you just have so much more data that even. + +0:06:33.313 --> 0:06:40.295 +The other thing can be used monolingual data +and monolingual data has a big advantage that + +0:06:40.295 --> 0:06:46.664 +we can have a huge amount of that so that you +can be autocrawed from the Internet. + +0:06:46.664 --> 0:06:51.728 +The nice thing is you can also get it typically +for many domains. + +0:06:52.352 --> 0:06:59.558 +There is just so much more magnitude of monolingual +data so that it might be very helpful. + +0:06:59.559 --> 0:07:03.054 +We can do that in statistical machine translation. + +0:07:03.054 --> 0:07:06.755 +It was quite easy to integrate using language +models. + +0:07:08.508 --> 0:07:16.912 +In neural machine translation we have the +advantage that we have this overall architecture + +0:07:16.912 --> 0:07:22.915 +that does everything together, but it has also +the disadvantage. + +0:07:23.283 --> 0:07:25.675 +We'll look today at two things. + +0:07:25.675 --> 0:07:32.925 +On the one end you can still try to do a bit +of language modeling in there and add an additional + +0:07:32.925 --> 0:07:35.168 +language model into in there. + +0:07:35.168 --> 0:07:38.232 +There is some work, one very successful. + +0:07:38.178 --> 0:07:43.764 +A way in which I think is used in most systems +at the moment is to do some scientific data. + +0:07:43.763 --> 0:07:53.087 +Is a very easy thing, but you can just translate +there and use it as training gator, and normally. + +0:07:53.213 --> 0:07:59.185 +And thereby you are able to use like some +type of monolingual a day. + +0:08:00.380 --> 0:08:05.271 +Another way to do it is unsupervised and the +extreme case. + +0:08:05.271 --> 0:08:11.158 +If you have a scenario then you only have +data, only monolingual data. + +0:08:11.158 --> 0:08:13.976 +Can you still build translations? + +0:08:14.754 --> 0:08:27.675 +If you have large amounts of data and languages +are not too dissimilar, you can build translation + +0:08:27.675 --> 0:08:31.102 +systems without parallel. + +0:08:32.512 --> 0:08:36.267 +That we will see you then next Thursday. + +0:08:37.857 --> 0:08:50.512 +And then there is now a third type of pre-trained +model that recently became very successful + +0:08:50.512 --> 0:08:55.411 +and now with large language models. + +0:08:55.715 --> 0:09:03.525 +So the idea is we are no longer sharing the +real data, but it can also help to train a + +0:09:03.525 --> 0:09:04.153 +model. + +0:09:04.364 --> 0:09:11.594 +And that is now a big advantage of deep learning +based approaches. + +0:09:11.594 --> 0:09:22.169 +There you have this ability that you can train +a model in some task and then apply it to another. + +0:09:22.722 --> 0:09:33.405 +And then, of course, the question is, can +I have an initial task where there's huge amounts + +0:09:33.405 --> 0:09:34.450 +of data? + +0:09:34.714 --> 0:09:40.251 +And the test that typically you pre train +on is more like similar to a language moral + +0:09:40.251 --> 0:09:45.852 +task either direct to a language moral task +or like a masking task which is related so + +0:09:45.852 --> 0:09:51.582 +the idea is oh I can train on this data and +the knowledge about words how they relate to + +0:09:51.582 --> 0:09:53.577 +each other I can use in there. + +0:09:53.753 --> 0:10:00.276 +So it's a different way of using language +models. + +0:10:00.276 --> 0:10:06.276 +There's more transfer learning at the end +of. + +0:10:09.029 --> 0:10:17.496 +So first we will start with how can we use +monolingual data to do a Yeah to do a machine + +0:10:17.496 --> 0:10:18.733 +translation? + +0:10:20.040 --> 0:10:27.499 +That: Big difference is you should remember +from what I mentioned before is. + +0:10:27.499 --> 0:10:32.783 +In statistical machine translation we directly +have the opportunity. + +0:10:32.783 --> 0:10:39.676 +There's peril data for the translation model +and monolingual data for the language model. + +0:10:39.679 --> 0:10:45.343 +And you combine your translation model and +language model, and then you can make use of + +0:10:45.343 --> 0:10:45.730 +both. + +0:10:46.726 --> 0:10:53.183 +That you can make use of these large large +amounts of monolingual data, but of course + +0:10:53.183 --> 0:10:55.510 +it has also some disadvantage. + +0:10:55.495 --> 0:11:01.156 +Because we say the problem is we are optimizing +both parts a bit independently to each other + +0:11:01.156 --> 0:11:06.757 +and we say oh yeah the big disadvantage of +newer machine translations now we are optimizing + +0:11:06.757 --> 0:11:10.531 +the overall architecture everything together +to perform best. + +0:11:10.890 --> 0:11:16.994 +And then, of course, we can't do there, so +Leo we can can only do a mural like use power + +0:11:16.994 --> 0:11:17.405 +data. + +0:11:17.897 --> 0:11:28.714 +So the question is, but this advantage is +not so important that we can train everything, + +0:11:28.714 --> 0:11:35.276 +but we have a moral legal data or even small +amounts. + +0:11:35.675 --> 0:11:43.102 +So in data we know it's not only important +the amount of data we have but also like how + +0:11:43.102 --> 0:11:50.529 +similar it is to your test data so it can be +that this modeling data is quite small but + +0:11:50.529 --> 0:11:55.339 +it's very well fitting and then it's still +very helpful. + +0:11:55.675 --> 0:12:02.691 +At the first year of surprisingness, if we +are here successful with integrating a language + +0:12:02.691 --> 0:12:09.631 +model into a translation system, maybe we can +also integrate some type of language models + +0:12:09.631 --> 0:12:14.411 +into our empty system in order to make it better +and perform. + +0:12:16.536 --> 0:12:23.298 +The first thing we can do is we know there +is language models, so let's try to integrate. + +0:12:23.623 --> 0:12:31.096 +There was our language model because these +works were mainly done before transformer-based + +0:12:31.096 --> 0:12:31.753 +models. + +0:12:32.152 --> 0:12:38.764 +In general, of course, you can do the same +thing with transformer baseball. + +0:12:38.764 --> 0:12:50.929 +There is nothing about whether: It's just +that it has mainly been done before people + +0:12:50.929 --> 0:13:01.875 +started using R&S and they tried to do +this more in cases. + +0:13:07.087 --> 0:13:22.938 +So what we're happening here is in some of +this type of idea, and in key system you remember + +0:13:22.938 --> 0:13:25.495 +the attention. + +0:13:25.605 --> 0:13:29.465 +Gets it was your last in this day that you +calculate easy attention. + +0:13:29.729 --> 0:13:36.610 +We get the context back, then combine both +and then base the next in state and then predict. + +0:13:37.057 --> 0:13:42.424 +So this is our system, and the question is, +can we send our integrated language model? + +0:13:42.782 --> 0:13:49.890 +And somehow it makes sense to take out a neural +language model because we are anyway in the + +0:13:49.890 --> 0:13:50.971 +neural space. + +0:13:50.971 --> 0:13:58.465 +It's not surprising that it contrasts to statistical +work used and grants it might make sense to + +0:13:58.465 --> 0:14:01.478 +take a bit of a normal language model. + +0:14:01.621 --> 0:14:06.437 +And there would be something like on Tubbles +Air, a neural language model, and our man based + +0:14:06.437 --> 0:14:11.149 +is you have a target word, you put it in, you +get a new benchmark, and then you always put + +0:14:11.149 --> 0:14:15.757 +in the words and get new hidden states, and +you can do some predictions at the output to + +0:14:15.757 --> 0:14:16.948 +predict the next word. + +0:14:17.597 --> 0:14:26.977 +So if we're having this type of in language +model, there's like two main questions we have + +0:14:26.977 --> 0:14:34.769 +to answer: So how do we combine now on the +one hand our system and on the other hand our + +0:14:34.769 --> 0:14:35.358 +model? + +0:14:35.358 --> 0:14:42.004 +You see that was mentioned before when we +started talking about ENCODA models. + +0:14:42.004 --> 0:14:45.369 +They can be viewed as a language model. + +0:14:45.805 --> 0:14:47.710 +The wine is lengthened, unconditioned. + +0:14:47.710 --> 0:14:49.518 +It's just modeling the target sides. + +0:14:49.970 --> 0:14:56.963 +And the other one is a conditional language +one, which is a language one conditioned on + +0:14:56.963 --> 0:14:57.837 +the Sewer. + +0:14:58.238 --> 0:15:03.694 +So how can you combine to language models? + +0:15:03.694 --> 0:15:14.860 +Of course, it's like the translation model +will be more important because it has access + +0:15:14.860 --> 0:15:16.763 +to the source. + +0:15:18.778 --> 0:15:22.571 +If we have that, the other question is okay. + +0:15:22.571 --> 0:15:24.257 +Now we have models. + +0:15:24.257 --> 0:15:25.689 +How do we train? + +0:15:26.026 --> 0:15:30.005 +Pickers integrated them. + +0:15:30.005 --> 0:15:34.781 +We have now two sets of data. + +0:15:34.781 --> 0:15:42.741 +We have parallel data where you can do the +lower. + +0:15:44.644 --> 0:15:53.293 +So the first idea is we can do something more +like a parallel combination. + +0:15:53.293 --> 0:15:55.831 +We just keep running. + +0:15:56.036 --> 0:15:59.864 +So here you see your system that is running. + +0:16:00.200 --> 0:16:09.649 +It's normally completely independent of your +language model, which is up there, so down + +0:16:09.649 --> 0:16:13.300 +here we have just our NMT system. + +0:16:13.313 --> 0:16:26.470 +The only thing which is used is we have the +words, and of course they are put into both + +0:16:26.470 --> 0:16:30.059 +systems, and out there. + +0:16:30.050 --> 0:16:42.221 +So we use them somehow for both, and then +we are doing our decision just by merging these + +0:16:42.221 --> 0:16:42.897 +two. + +0:16:43.343 --> 0:16:53.956 +So there can be, for example, we are doing +a probability distribution here, and then we + +0:16:53.956 --> 0:17:03.363 +are taking the average of post-perability distribution +to do our predictions. + +0:17:11.871 --> 0:17:18.923 +You could also take the output with Steve's +to be more in chore about the mixture. + +0:17:20.000 --> 0:17:32.896 +Yes, you could also do that, so it's more +like engaging mechanisms that you're not doing. + +0:17:32.993 --> 0:17:41.110 +Another one would be cochtrinate the hidden +states, and then you would have another layer + +0:17:41.110 --> 0:17:41.831 +on top. + +0:17:43.303 --> 0:17:56.889 +You think about if you do the conqueredination +instead of taking the instead and then merging + +0:17:56.889 --> 0:18:01.225 +the probability distribution. + +0:18:03.143 --> 0:18:16.610 +Introduce many new parameters, and these parameters +have somehow something special compared to + +0:18:16.610 --> 0:18:17.318 +the. + +0:18:23.603 --> 0:18:37.651 +So before all the error other parameters can +be trained independent, the language model + +0:18:37.651 --> 0:18:42.121 +can be trained independent. + +0:18:43.043 --> 0:18:51.749 +If you have a joint layer, of course you need +to train them because you have now inputs. + +0:18:54.794 --> 0:19:02.594 +Not surprisingly, if you have a parallel combination +of whether you could, the other way is to do + +0:19:02.594 --> 0:19:04.664 +more serial combinations. + +0:19:04.924 --> 0:19:10.101 +How can you do a similar combination? + +0:19:10.101 --> 0:19:18.274 +Your final decision makes sense to do a face +on the system. + +0:19:18.438 --> 0:19:20.996 +So you have on top of your normal and system. + +0:19:21.121 --> 0:19:30.678 +The only thing is now you're inputting into +your system. + +0:19:30.678 --> 0:19:38.726 +You're no longer inputting the word embeddings. + +0:19:38.918 --> 0:19:45.588 +So you're training your mainly what you have +your lower layers here which are trained more + +0:19:45.588 --> 0:19:52.183 +on the purely language model style and then +on top your putting into the NMT system where + +0:19:52.183 --> 0:19:55.408 +it now has already here the language model. + +0:19:55.815 --> 0:19:58.482 +So here you can also view it. + +0:19:58.482 --> 0:20:06.481 +Here you have more contextual embeddings which +no longer depend only on the word but they + +0:20:06.481 --> 0:20:10.659 +also depend on the context of the target site. + +0:20:11.051 --> 0:20:19.941 +But you have more understanding of the source +word, so you have a language in the current + +0:20:19.941 --> 0:20:21.620 +target sentence. + +0:20:21.881 --> 0:20:27.657 +So if it's like the word can, for example, +will be put in here always the same independent + +0:20:27.657 --> 0:20:31.147 +of its user can of beans, or if it's like I +can do it. + +0:20:31.147 --> 0:20:37.049 +However, because you are having your language +model style, you have maybe disintegrated this + +0:20:37.049 --> 0:20:40.984 +already a bit, and you give this information +directly to the. + +0:20:41.701 --> 0:20:43.095 +An empty cyst. + +0:20:44.364 --> 0:20:49.850 +You, if you're remembering more the transformer +based approach, you have some layers. + +0:20:49.850 --> 0:20:55.783 +The lower layers are purely languaged while +the other ones are with attention to the source. + +0:20:55.783 --> 0:21:01.525 +So you can view it also that you just have +lower layers which don't attend to the source. + +0:21:02.202 --> 0:21:07.227 +This is purely a language model, and then +at some point you're starting to attend to + +0:21:07.227 --> 0:21:08.587 +the source and use it. + +0:21:13.493 --> 0:21:20.781 +Yes, so this is how you combine them in peril +or first do the language model and then do. + +0:21:23.623 --> 0:21:26.147 +Questions for the integration. + +0:21:31.831 --> 0:21:35.034 +Not really sure about the input of the. + +0:21:35.475 --> 0:21:38.102 +Model, and in this case in the sequence. + +0:21:38.278 --> 0:21:54.854 +Case so the actual word that we transferred +into a numerical lecture, and this is an input. + +0:21:56.176 --> 0:22:03.568 +That depends on if you view the word embedding +as part of the language model. + +0:22:03.568 --> 0:22:10.865 +So if you first put the word target word then +you do the one hot end coding. + +0:22:11.691 --> 0:22:13.805 +And then the word embedding there is the r& + +0:22:13.805 --> 0:22:13.937 +n. + +0:22:14.314 --> 0:22:21.035 +So you can use this together as your language +model when you first do the word embedding. + +0:22:21.401 --> 0:22:24.346 +All you can say is like before. + +0:22:24.346 --> 0:22:28.212 +It's more a definition, but you're right. + +0:22:28.212 --> 0:22:30.513 +So what's the steps out? + +0:22:30.513 --> 0:22:36.128 +You take the word, the one hut encoding, the +word embedding. + +0:22:36.516 --> 0:22:46.214 +What one of these parrots, you know, called +a language model is definition wise and not + +0:22:46.214 --> 0:22:47.978 +that important. + +0:22:53.933 --> 0:23:02.264 +So the question is how can you then train +them and make this this one work? + +0:23:02.264 --> 0:23:02.812 +The. + +0:23:03.363 --> 0:23:15.201 +So in the case where you combine the language +one of the abilities you can train them independently + +0:23:15.201 --> 0:23:18.516 +and just put them together. + +0:23:18.918 --> 0:23:27.368 +Might not be the best because we have no longer +the stability that we had before that optimally + +0:23:27.368 --> 0:23:29.128 +performed together. + +0:23:29.128 --> 0:23:33.881 +It's not clear if they really work the best +together. + +0:23:34.514 --> 0:23:41.585 +At least you need to somehow find how much +do you trust the one model and how much. + +0:23:43.323 --> 0:23:45.058 +Still in some cases useful. + +0:23:45.058 --> 0:23:48.530 +It might be helpful if you have only data +and software. + +0:23:48.928 --> 0:23:59.064 +However, in MT we have one specific situation +that at least for the MT part parallel is also + +0:23:59.064 --> 0:24:07.456 +always monolingual data, so what we definitely +can do is train the language. + +0:24:08.588 --> 0:24:18.886 +So what we also can do is more like the pre-training +approach. + +0:24:18.886 --> 0:24:24.607 +We first train the language model. + +0:24:24.704 --> 0:24:27.334 +The pre-training approach. + +0:24:27.334 --> 0:24:33.470 +You first train on the monolingual data and +then you join the. + +0:24:33.933 --> 0:24:41.143 +Of course, the model size is this way, but +the data size is too bigly the other way around. + +0:24:41.143 --> 0:24:47.883 +You often have a lot more monolingual data +than you have here parallel data, in which + +0:24:47.883 --> 0:24:52.350 +scenario can you imagine where this type of +pretraining? + +0:24:56.536 --> 0:24:57.901 +Any Ideas. + +0:25:04.064 --> 0:25:12.772 +One example where this might also be helpful +if you want to adapt to domains. + +0:25:12.772 --> 0:25:22.373 +So let's say you do medical sentences and +if you want to translate medical sentences. + +0:25:23.083 --> 0:25:26.706 +In this case it could be or its most probable +happen. + +0:25:26.706 --> 0:25:32.679 +You're learning here up there what medical +means, but in your fine tuning step the model + +0:25:32.679 --> 0:25:38.785 +is forgotten everything about Medicare, so +you may be losing all the information you gain. + +0:25:39.099 --> 0:25:42.366 +So this type of priest training step is good. + +0:25:42.366 --> 0:25:47.978 +If your pretraining data is more general, +very large and then you're adapting. + +0:25:48.428 --> 0:25:56.012 +But in the task with moral lingual data, which +should be used to adapt the system to some + +0:25:56.012 --> 0:25:57.781 +general topic style. + +0:25:57.817 --> 0:26:06.795 +Then, of course, this is not a good strategy +because you might forgot about everything up + +0:26:06.795 --> 0:26:09.389 +there and you don't have. + +0:26:09.649 --> 0:26:14.678 +So then you have to check what you can do +for them. + +0:26:14.678 --> 0:26:23.284 +You can freeze this part and change it any +more so you don't lose the ability or you can + +0:26:23.284 --> 0:26:25.702 +do a direct combination. + +0:26:25.945 --> 0:26:31.028 +Where you jointly train both of them, so you +train the NMT system on the, and then you train + +0:26:31.028 --> 0:26:34.909 +the language model always in parallels so that +you don't forget about. + +0:26:35.395 --> 0:26:37.684 +And what you learn of the length. + +0:26:37.937 --> 0:26:46.711 +Depends on what you want to combine because +it's large data and you have a good general + +0:26:46.711 --> 0:26:48.107 +knowledge in. + +0:26:48.548 --> 0:26:55.733 +Then you normally don't really forget it because +it's also in the or you use it to adapt to + +0:26:55.733 --> 0:26:57.295 +something specific. + +0:26:57.295 --> 0:26:58.075 +Then you. + +0:27:01.001 --> 0:27:06.676 +Then this is a way of how we can make use +of monolingual data. + +0:27:07.968 --> 0:27:12.116 +It seems to be the easiest one somehow. + +0:27:12.116 --> 0:27:20.103 +It's more similar to what we are doing with +statistical machine translation. + +0:27:21.181 --> 0:27:31.158 +Normally always beats this type of model, +which in some view can be like from the conceptual + +0:27:31.158 --> 0:27:31.909 +thing. + +0:27:31.909 --> 0:27:36.844 +It's even easier from the computational side. + +0:27:40.560 --> 0:27:42.078 +And the idea is OK. + +0:27:42.078 --> 0:27:49.136 +We have monolingual data that we just translate +and then generate some type of parallel data + +0:27:49.136 --> 0:27:50.806 +and use that then to. + +0:27:51.111 --> 0:28:00.017 +So if you want to build a German-to-English +system first, take the large amount of data + +0:28:00.017 --> 0:28:02.143 +you have translated. + +0:28:02.402 --> 0:28:10.446 +Then you have more peril data and the interesting +thing is if you then train on the joint thing + +0:28:10.446 --> 0:28:18.742 +or on the original peril data and on what is +artificial where you have generated the translations. + +0:28:18.918 --> 0:28:26.487 +So you can because you are not doing the same +era all the times and you have some knowledge. + +0:28:28.028 --> 0:28:43.199 +With this first approach, however, there is +one issue why it might not work the best. + +0:28:49.409 --> 0:28:51.177 +Very a bit shown in the image to you. + +0:28:53.113 --> 0:28:58.153 +You trade on that quality data. + +0:28:58.153 --> 0:29:02.563 +Here is a bit of a problem. + +0:29:02.563 --> 0:29:08.706 +Your English style is not really good. + +0:29:08.828 --> 0:29:12.213 +And as you're saying, the system always mistranslates. + +0:29:13.493 --> 0:29:19.798 +Something then you will learn that this is +correct because now it's a training game and + +0:29:19.798 --> 0:29:23.022 +you will encourage it to make it more often. + +0:29:23.022 --> 0:29:29.614 +So the problem with training on your own areas +yeah you might prevent some areas you rarely + +0:29:29.614 --> 0:29:29.901 +do. + +0:29:30.150 --> 0:29:31.749 +But errors use systematically. + +0:29:31.749 --> 0:29:34.225 +Do you even enforce more and will even do +more? + +0:29:34.654 --> 0:29:40.145 +So that might not be the best solution to +have any idea how you could do it better. + +0:29:44.404 --> 0:29:57.754 +Is one way there is even a bit of more simple +idea. + +0:30:04.624 --> 0:30:10.975 +The problem is yeah, the translations are +not perfect, so the output and you're learning + +0:30:10.975 --> 0:30:12.188 +something wrong. + +0:30:12.188 --> 0:30:17.969 +Normally it's less bad if your inputs are +not bad, but your outputs are perfect. + +0:30:18.538 --> 0:30:24.284 +So if your inputs are wrong you may learn +that if you're doing this wrong input you're + +0:30:24.284 --> 0:30:30.162 +generating something correct, but you're not +learning to generate something which is not + +0:30:30.162 --> 0:30:30.756 +correct. + +0:30:31.511 --> 0:30:47.124 +So often the case it is that it is more important +than your target is correct. + +0:30:47.347 --> 0:30:52.182 +But you can assume in your application scenario +you hope that you may only get correct inputs. + +0:30:52.572 --> 0:31:02.535 +So that is not harming you, and in machine +translation we have one very nice advantage: + +0:31:02.762 --> 0:31:04.648 +And also the other way around. + +0:31:04.648 --> 0:31:10.062 +It's a very similar task, so there's a task +to translate from German to English, but the + +0:31:10.062 --> 0:31:13.894 +task to translate from English to German is +very similar, and. + +0:31:14.094 --> 0:31:19.309 +So what we can do is we can just switch it +initially and generate the data the other way + +0:31:19.309 --> 0:31:19.778 +around. + +0:31:20.120 --> 0:31:25.959 +So what we are doing here is we are starting +with an English to German system. + +0:31:25.959 --> 0:31:32.906 +Then we are translating the English data into +German where the German is maybe not very nice. + +0:31:33.293 --> 0:31:51.785 +And then we are training on our original data +and on the back translated data. + +0:31:52.632 --> 0:32:02.332 +So here we have the advantage that our target +side is human quality and only the input. + +0:32:03.583 --> 0:32:08.113 +Then this helps us to get really good. + +0:32:08.113 --> 0:32:15.431 +There is one difference if you think about +the data resources. + +0:32:21.341 --> 0:32:27.336 +Too obvious here we need a target site monolingual +layer. + +0:32:27.336 --> 0:32:31.574 +In the first example we had source site. + +0:32:31.931 --> 0:32:45.111 +So back translation is normally working if +you have target size peril later and not search + +0:32:45.111 --> 0:32:48.152 +side modeling later. + +0:32:48.448 --> 0:32:56.125 +Might be also, like if you think about it, +understand a little better to understand the + +0:32:56.125 --> 0:32:56.823 +target. + +0:32:57.117 --> 0:33:01.469 +On the source side you have to understand +the content. + +0:33:01.469 --> 0:33:08.749 +On the target side you have to generate really +sentences and somehow it's more difficult to + +0:33:08.749 --> 0:33:12.231 +generate something than to only understand. + +0:33:17.617 --> 0:33:30.734 +This works well if you have to select how +many back translated data do you use. + +0:33:31.051 --> 0:33:32.983 +Because only there's like a lot more. + +0:33:33.253 --> 0:33:42.136 +Question: Should take all of my data there +is two problems with it? + +0:33:42.136 --> 0:33:51.281 +Of course it's expensive because you have +to translate all this data. + +0:33:51.651 --> 0:34:00.946 +So if you don't know the normal good starting +point is to take equal amount of data as many + +0:34:00.946 --> 0:34:02.663 +back translated. + +0:34:02.963 --> 0:34:04.673 +It depends on the used case. + +0:34:04.673 --> 0:34:08.507 +If we have very few data here, it makes more +sense to have more. + +0:34:08.688 --> 0:34:15.224 +Depends on how good your quality is here, +so the better the more data you might use because + +0:34:15.224 --> 0:34:16.574 +quality is better. + +0:34:16.574 --> 0:34:22.755 +So it depends on a lot of things, but your +rule of sum is like which general way often + +0:34:22.755 --> 0:34:24.815 +is to have equal amounts of. + +0:34:26.646 --> 0:34:29.854 +And you can, of course, do that now. + +0:34:29.854 --> 0:34:34.449 +I said already that it's better to have the +quality. + +0:34:34.449 --> 0:34:38.523 +At the end, of course, depends on this system. + +0:34:38.523 --> 0:34:46.152 +Also, because the better this system is, the +better your synthetic data is, the better. + +0:34:47.207 --> 0:34:50.949 +That leads to what is referred to as iterated +back translation. + +0:34:51.291 --> 0:34:56.917 +So you play them on English to German, and +you translate the data on. + +0:34:56.957 --> 0:35:03.198 +Then you train a model on German to English +with the additional data. + +0:35:03.198 --> 0:35:09.796 +Then you translate German data and then you +train to gain your first one. + +0:35:09.796 --> 0:35:14.343 +So in the second iteration this quality is +better. + +0:35:14.334 --> 0:35:19.900 +System is better because it's not only trained +on the small data but additionally on back + +0:35:19.900 --> 0:35:22.003 +translated data with this system. + +0:35:22.442 --> 0:35:24.458 +And so you can get better. + +0:35:24.764 --> 0:35:28.053 +However, typically you can stop quite early. + +0:35:28.053 --> 0:35:35.068 +Maybe one iteration is good, but then you +have diminishing gains after two or three iterations. + +0:35:35.935 --> 0:35:46.140 +There is very slight difference because you +need a quite big difference in the quality + +0:35:46.140 --> 0:35:46.843 +here. + +0:35:47.207 --> 0:36:02.262 +Language is also good because it means you +can already train it with relatively bad profiles. + +0:36:03.723 --> 0:36:10.339 +It's a design decision would advise so guess +because it's easy to get it. + +0:36:10.550 --> 0:36:20.802 +Replace that because you have a higher quality +real data, but then I think normally it's okay + +0:36:20.802 --> 0:36:22.438 +to replace it. + +0:36:22.438 --> 0:36:28.437 +I would assume it's not too much of a difference, +but. + +0:36:34.414 --> 0:36:42.014 +That's about like using monolingual data before +we go into the pre-train models to have any + +0:36:42.014 --> 0:36:43.005 +more crash. + +0:36:49.029 --> 0:36:55.740 +Yes, so the other thing which we can do and +which is recently more and more successful + +0:36:55.740 --> 0:37:02.451 +and even more successful since we have this +really large language models where you can + +0:37:02.451 --> 0:37:08.545 +even do the translation task with this is the +way of using pre-trained models. + +0:37:08.688 --> 0:37:16.135 +So you learn a representation of one task, +and then you use this representation from another. + +0:37:16.576 --> 0:37:26.862 +It was made maybe like one of the first words +where it really used largely is doing something + +0:37:26.862 --> 0:37:35.945 +like a bird which you pre trained on purely +text era and you take it in fine tune. + +0:37:36.496 --> 0:37:42.953 +And one big advantage, of course, is that +people can only share data but also pre-trained. + +0:37:43.423 --> 0:37:59.743 +The recent models and the large language ones +which are available. + +0:37:59.919 --> 0:38:09.145 +Where I think it costs several millions to +train them all, just if you would buy the GPUs + +0:38:09.145 --> 0:38:15.397 +from some cloud company and train that the +cost of training. + +0:38:15.475 --> 0:38:21.735 +And guess as a student project you won't have +the budget to like build these models. + +0:38:21.801 --> 0:38:24.598 +So another idea is what you can do is okay. + +0:38:24.598 --> 0:38:27.330 +Maybe if these months are once available,. + +0:38:27.467 --> 0:38:36.598 +Can take them and use them as an also resource +similar to pure text, and you can now build + +0:38:36.598 --> 0:38:44.524 +models which somehow learn not only from from +data but also from other models. + +0:38:44.844 --> 0:38:49.127 +So it's a quite new way of thinking of how +to train. + +0:38:49.127 --> 0:38:53.894 +We are not only learning from examples, but +we might also. + +0:38:54.534 --> 0:39:05.397 +The nice thing is that this type of training +where we are not learning directly from data + +0:39:05.397 --> 0:39:07.087 +but learning. + +0:39:07.427 --> 0:39:17.647 +So the main idea this go is you have a person +initial task. + +0:39:17.817 --> 0:39:26.369 +And if you're working with anLP, that means +you're training pure taxator because that's + +0:39:26.369 --> 0:39:30.547 +where you have the largest amount of data. + +0:39:30.951 --> 0:39:35.854 +And then you're defining some type of task +in order to you do your creek training. + +0:39:36.176 --> 0:39:43.092 +And: The typical task you can train on on +that is like the language waddling task. + +0:39:43.092 --> 0:39:50.049 +So to predict the next word or we have a related +task to predict something in between, we'll + +0:39:50.049 --> 0:39:52.667 +see depending on the architecture. + +0:39:52.932 --> 0:39:58.278 +But somehow to predict something which you +have not in the input is a task which is easy + +0:39:58.278 --> 0:40:00.740 +to generate, so you just need your data. + +0:40:00.740 --> 0:40:06.086 +That's why it's called self supervised, so +you're creating your supervised pending data. + +0:40:06.366 --> 0:40:07.646 +By yourself. + +0:40:07.646 --> 0:40:15.133 +On the other hand, you need a lot of knowledge +and that is the other thing. + +0:40:15.735 --> 0:40:24.703 +Because there is this idea that the meaning +of a word heavily depends on the context that. + +0:40:25.145 --> 0:40:36.846 +So can give you a sentence with some giverish +word and there's some name and although you've + +0:40:36.846 --> 0:40:41.627 +never heard the name you will assume. + +0:40:42.062 --> 0:40:44.149 +And exactly the same thing. + +0:40:44.149 --> 0:40:49.143 +The models can also learn something about +the world by just using. + +0:40:49.649 --> 0:40:53.651 +So that is typically the mule. + +0:40:53.651 --> 0:40:59.848 +Then we can use this model to train the system. + +0:41:00.800 --> 0:41:03.368 +Course we might need to adapt the system. + +0:41:03.368 --> 0:41:07.648 +To do that we have to change the architecture +we might use only some. + +0:41:07.627 --> 0:41:09.443 +Part of the pre-trained model. + +0:41:09.443 --> 0:41:14.773 +In there we have seen that a bit already in +the R&N case you can also see that we have + +0:41:14.773 --> 0:41:17.175 +also mentioned the pre-training already. + +0:41:17.437 --> 0:41:22.783 +So you can use the R&N as one of these +approaches. + +0:41:22.783 --> 0:41:28.712 +You train the R&M language more on large +pre-train data. + +0:41:28.712 --> 0:41:32.309 +Then you put it somewhere into your. + +0:41:33.653 --> 0:41:37.415 +So this gives you the ability to really do +these types of tests. + +0:41:37.877 --> 0:41:53.924 +So you can build a system which is knowledge, +which is just trained on large amounts of data. + +0:41:56.376 --> 0:42:01.564 +So the question is maybe what type of information +so what type of models can you? + +0:42:01.821 --> 0:42:05.277 +And we want today to look at briefly at swings. + +0:42:05.725 --> 0:42:08.850 +That was what was initially done. + +0:42:08.850 --> 0:42:17.213 +It wasn't as famous as in machine translation +as in other things, but it's also used there + +0:42:17.213 --> 0:42:21.072 +and that is to use static word embedding. + +0:42:21.221 --> 0:42:28.981 +So we have this mapping from the one hot to +a small continuous word representation. + +0:42:29.229 --> 0:42:38.276 +Using this one in your NG system, so you can, +for example, replace the embedding layer by + +0:42:38.276 --> 0:42:38.779 +the. + +0:42:39.139 --> 0:42:41.832 +That is helpful to be a really small amount +of data. + +0:42:42.922 --> 0:42:48.517 +And we're always in this pre-training phase +and have the thing the advantage is. + +0:42:48.468 --> 0:42:52.411 +More data than the trade off, so you can get +better. + +0:42:52.411 --> 0:42:59.107 +The disadvantage is, does anybody have an +idea of what might be the disadvantage of using + +0:42:59.107 --> 0:43:00.074 +things like. + +0:43:04.624 --> 0:43:12.175 +What was one mentioned today giving like big +advantage of the system compared to previous. + +0:43:20.660 --> 0:43:25.134 +Where one advantage was the enter end training, +so you have the enter end training so that + +0:43:25.134 --> 0:43:27.937 +all parameters and all components play optimal +together. + +0:43:28.208 --> 0:43:33.076 +If you know pre-train something on one fast, +it may be no longer optimal fitting to everything + +0:43:33.076 --> 0:43:33.384 +else. + +0:43:33.893 --> 0:43:37.862 +So what do pretending or not? + +0:43:37.862 --> 0:43:48.180 +It depends on how important everything is +optimal together and how important. + +0:43:48.388 --> 0:43:51.874 +Is a iquality of large amount. + +0:43:51.874 --> 0:44:00.532 +The pre-change one is so much better that +it's helpful and the advantage of. + +0:44:00.600 --> 0:44:11.211 +Getting everything optimal together, yes, +we would use random instructions for raising. + +0:44:11.691 --> 0:44:26.437 +The problem is you might be already in some +area where it's not easy to get. + +0:44:26.766 --> 0:44:35.329 +But often in some way right, so often it's +not about your really worse pre trained monolepsy. + +0:44:35.329 --> 0:44:43.254 +If you're going already in some direction, +and if this is not really optimal for you,. + +0:44:43.603 --> 0:44:52.450 +But if you're not really getting better because +you have a decent amount of data, it's so different + +0:44:52.450 --> 0:44:52.981 +that. + +0:44:53.153 --> 0:44:59.505 +Initially it wasn't a machine translation +done so much because there are more data in + +0:44:59.505 --> 0:45:06.153 +MPs than in other tasks, but now with really +large amounts of monolingual data we do some + +0:45:06.153 --> 0:45:09.403 +type of pretraining in currently all state. + +0:45:12.632 --> 0:45:14.302 +The other one is okay now. + +0:45:14.302 --> 0:45:18.260 +It's always like how much of the model do +you plea track a bit? + +0:45:18.658 --> 0:45:22.386 +To the other one you can do contextural word +embedded. + +0:45:22.386 --> 0:45:28.351 +That is something like bird or Roberta where +you train already a sequence model and the + +0:45:28.351 --> 0:45:34.654 +embeddings you're using are no longer specific +for word but they are also taking the context + +0:45:34.654 --> 0:45:35.603 +into account. + +0:45:35.875 --> 0:45:50.088 +The embedding you're using is no longer depending +on the word itself but on the whole sentence, + +0:45:50.088 --> 0:45:54.382 +so you can use this context. + +0:45:55.415 --> 0:46:02.691 +You can use similar things also in the decoder +just by having layers which don't have access + +0:46:02.691 --> 0:46:12.430 +to the source, but there it still might have +and these are typically models like: And finally + +0:46:12.430 --> 0:46:14.634 +they will look at the end. + +0:46:14.634 --> 0:46:19.040 +You can also have models which are already +sequenced. + +0:46:19.419 --> 0:46:28.561 +So you may be training a sequence to sequence +models. + +0:46:28.561 --> 0:46:35.164 +You have to make it a bit challenging. + +0:46:36.156 --> 0:46:43.445 +But the idea is really you're pre-training +your whole model and then you'll find tuning. + +0:46:47.227 --> 0:46:59.614 +But let's first do a bit of step back and +look into what are the different things. + +0:46:59.614 --> 0:47:02.151 +The first thing. + +0:47:02.382 --> 0:47:11.063 +The wooden bettings are just this first layer +and you can train them with feedback annual + +0:47:11.063 --> 0:47:12.028 +networks. + +0:47:12.212 --> 0:47:22.761 +But you can also train them with an N language +model, and by now you hopefully have also seen + +0:47:22.761 --> 0:47:27.699 +that you cannot transform a language model. + +0:47:30.130 --> 0:47:37.875 +So this is how you can train them and you're +training them. + +0:47:37.875 --> 0:47:45.234 +For example, to speak the next word that is +the easiest. + +0:47:45.525 --> 0:47:55.234 +And that is what is now referred to as South +Supervised Learning and, for example, all the + +0:47:55.234 --> 0:48:00.675 +big large language models like Chad GPT and +so on. + +0:48:00.675 --> 0:48:03.129 +They are trained with. + +0:48:03.823 --> 0:48:15.812 +So that is where you can hopefully learn how +a word is used because you always try to previct + +0:48:15.812 --> 0:48:17.725 +the next word. + +0:48:19.619 --> 0:48:27.281 +Word embedding: Why do you keep the first +look at the word embeddings and the use of + +0:48:27.281 --> 0:48:29.985 +word embeddings for our task? + +0:48:29.985 --> 0:48:38.007 +The main advantage was it might be only the +first layer where you typically have most of + +0:48:38.007 --> 0:48:39.449 +the parameters. + +0:48:39.879 --> 0:48:57.017 +Most of your parameters already on the large +data, then on your target data you have to + +0:48:57.017 --> 0:48:59.353 +train less. + +0:48:59.259 --> 0:49:06.527 +Big difference that your input size is so +much bigger than the size of the novel in size. + +0:49:06.626 --> 0:49:17.709 +So it's a normally sign, maybe like, but your +input and banning size is something like. + +0:49:17.709 --> 0:49:20.606 +Then here you have to. + +0:49:23.123 --> 0:49:30.160 +While here you see it's only like zero point +five times as much in the layer. + +0:49:30.750 --> 0:49:40.367 +So here is where most of your parameters are, +which means if you already replace the word + +0:49:40.367 --> 0:49:48.915 +embeddings they might look a bit small in your +overall and in key architecture. + +0:49:57.637 --> 0:50:01.249 +The thing is we have seen these were the bettings. + +0:50:01.249 --> 0:50:04.295 +They can be very good use for other types. + +0:50:04.784 --> 0:50:08.994 +You learn some general relations between words. + +0:50:08.994 --> 0:50:17.454 +If you're doing this type of language modeling +cast, you predict: The one thing is you have + +0:50:17.454 --> 0:50:24.084 +a lot of data, so the one question is we want +to have data to trade a model. + +0:50:24.084 --> 0:50:28.734 +The other thing, the tasks need to be somehow +useful. + +0:50:29.169 --> 0:50:43.547 +If you would predict the first letter of the +word, then you wouldn't learn anything about + +0:50:43.547 --> 0:50:45.144 +the word. + +0:50:45.545 --> 0:50:53.683 +And the interesting thing is people have looked +at these wood embeddings. + +0:50:53.954 --> 0:50:58.550 +And looking at the word embeddings. + +0:50:58.550 --> 0:51:09.276 +You can ask yourself how they look and visualize +them by doing dimension reduction. + +0:51:09.489 --> 0:51:13.236 +Don't know if you and you are listening to +artificial intelligence. + +0:51:13.236 --> 0:51:15.110 +Advanced artificial intelligence. + +0:51:15.515 --> 0:51:23.217 +We had on yesterday there how to do this type +of representation, but you can do this time + +0:51:23.217 --> 0:51:29.635 +of representation, and now you're seeing interesting +things that normally. + +0:51:30.810 --> 0:51:41.027 +Now you can represent a here in a three dimensional +space with some dimension reduction. + +0:51:41.027 --> 0:51:46.881 +For example, the relation between male and +female. + +0:51:47.447 --> 0:51:56.625 +So this vector between the male and female +version of something is always not the same, + +0:51:56.625 --> 0:51:58.502 +but it's related. + +0:51:58.718 --> 0:52:14.522 +So you can do a bit of maths, so you do take +king, you subtract this vector, add this vector. + +0:52:14.894 --> 0:52:17.591 +So that means okay, there is really something +stored. + +0:52:17.591 --> 0:52:19.689 +Some information are stored in that book. + +0:52:20.040 --> 0:52:22.492 +Similar, you can do it with bug answers. + +0:52:22.492 --> 0:52:25.004 +You see here swimming slang walking walk. + +0:52:25.265 --> 0:52:34.620 +So again these vectors are not the same, but +they are related. + +0:52:34.620 --> 0:52:42.490 +So you learn something from going from here +to here. + +0:52:43.623 --> 0:52:49.761 +Or semantically, the relations between city +and capital have exactly the same sense. + +0:52:51.191 --> 0:52:56.854 +And people had even done that question answering +about that if they showed the diembeddings + +0:52:56.854 --> 0:52:57.839 +and the end of. + +0:52:58.218 --> 0:53:06.711 +All you can also do is don't trust the dimensions +of the reaction because maybe there is something. + +0:53:06.967 --> 0:53:16.863 +You can also look into what happens really +in the individual space. + +0:53:16.863 --> 0:53:22.247 +What is the nearest neighbor of the. + +0:53:22.482 --> 0:53:29.608 +So you can take the relationship between France +and Paris and add it to Italy and you'll. + +0:53:30.010 --> 0:53:33.078 +You can do big and bigger and you have small +and smaller and stuff. + +0:53:33.593 --> 0:53:49.417 +Because it doesn't work everywhere, there +is also some typical dish here in German. + +0:53:51.491 --> 0:54:01.677 +You can do what the person is doing for famous +ones, of course only like Einstein scientists + +0:54:01.677 --> 0:54:06.716 +that find midfielders not completely correct. + +0:54:06.846 --> 0:54:10.134 +You see the examples are a bit old. + +0:54:10.134 --> 0:54:15.066 +The politicians are no longer they am, but +of course. + +0:54:16.957 --> 0:54:26.759 +What people have done there, especially at +the beginning training our end language model, + +0:54:26.759 --> 0:54:28.937 +was very expensive. + +0:54:29.309 --> 0:54:38.031 +So one famous model was, but we are not really +interested in the language model performance. + +0:54:38.338 --> 0:54:40.581 +Think something good to keep in mind. + +0:54:40.581 --> 0:54:42.587 +What are we really interested in? + +0:54:42.587 --> 0:54:45.007 +Do we really want to have an R&N no? + +0:54:45.007 --> 0:54:48.607 +In this case we are only interested in this +type of mapping. + +0:54:49.169 --> 0:54:55.500 +And so successful and very successful was +this word to vet. + +0:54:55.535 --> 0:54:56.865 +The idea is okay. + +0:54:56.865 --> 0:55:03.592 +We are not training real language one, making +it even simpler and doing this, for example, + +0:55:03.592 --> 0:55:05.513 +continuous peck of words. + +0:55:05.513 --> 0:55:12.313 +We're just having four input tokens and we're +predicting what is the word in the middle and + +0:55:12.313 --> 0:55:15.048 +this is just like two linear layers. + +0:55:15.615 --> 0:55:21.627 +So it's even simplifying things and making +the calculation faster because that is what + +0:55:21.627 --> 0:55:22.871 +we're interested. + +0:55:23.263 --> 0:55:32.897 +All this continuous skip ground models with +these other models which refer to as where + +0:55:32.897 --> 0:55:34.004 +to where. + +0:55:34.234 --> 0:55:42.394 +Where you have one equal word and the other +way around, you're predicting the four words + +0:55:42.394 --> 0:55:43.585 +around them. + +0:55:43.585 --> 0:55:45.327 +It's very similar. + +0:55:45.327 --> 0:55:48.720 +The task is in the end very similar. + +0:55:51.131 --> 0:56:01.407 +Before we are going to the next point, anything +about normal weight vectors or weight embedding. + +0:56:04.564 --> 0:56:07.794 +The next thing is contexture. + +0:56:07.794 --> 0:56:12.208 +Word embeddings and the idea is helpful. + +0:56:12.208 --> 0:56:19.206 +However, we might even be able to get more +from one lingo layer. + +0:56:19.419 --> 0:56:31.732 +And now in the word that is overlap of these +two meanings, so it represents both the meaning + +0:56:31.732 --> 0:56:33.585 +of can do it. + +0:56:34.834 --> 0:56:40.410 +But we might be able to in the pre-trained +model already disambiguate this because they + +0:56:40.410 --> 0:56:41.044 +are used. + +0:56:41.701 --> 0:56:53.331 +So if we can have a model which can not only +represent a word but can also represent the + +0:56:53.331 --> 0:56:58.689 +meaning of the word within the context,. + +0:56:59.139 --> 0:57:03.769 +So then we are going to context your word +embeddings. + +0:57:03.769 --> 0:57:07.713 +We are really having a representation in the. + +0:57:07.787 --> 0:57:11.519 +And we have a very good architecture for that +already. + +0:57:11.691 --> 0:57:23.791 +The hidden state represents what is currently +said, but it's focusing on what is the last + +0:57:23.791 --> 0:57:29.303 +one, so it's some of the representation. + +0:57:29.509 --> 0:57:43.758 +The first one doing that is something like +the Elmo paper where they instead of this is + +0:57:43.758 --> 0:57:48.129 +the normal language model. + +0:57:48.008 --> 0:57:50.714 +Within the third, predicting the fourth, and +so on. + +0:57:50.714 --> 0:57:53.004 +So you are always predicting the next work. + +0:57:53.193 --> 0:57:57.335 +The architecture is the heaven words embedding +layer and then layers. + +0:57:57.335 --> 0:58:03.901 +See you, for example: And now instead of using +this one in the end, you're using here this + +0:58:03.901 --> 0:58:04.254 +one. + +0:58:04.364 --> 0:58:11.245 +This represents the meaning of this word mainly +in the context of what we have seen before. + +0:58:11.871 --> 0:58:18.610 +We can train it in a language model style +always predicting the next word, but we have + +0:58:18.610 --> 0:58:21.088 +more information trained there. + +0:58:21.088 --> 0:58:26.123 +Therefore, in the system it has to learn less +additional things. + +0:58:27.167 --> 0:58:31.261 +And there is one Edendang which is done currently +in GPS. + +0:58:31.261 --> 0:58:38.319 +The only difference is that we have more layers, +bigger size, and we're using transformer neurocell + +0:58:38.319 --> 0:58:40.437 +potential instead of the RNA. + +0:58:40.437 --> 0:58:45.095 +But that is how you train like some large +language models at the. + +0:58:46.746 --> 0:58:55.044 +However, if you look at this contextual representation, +they might not be perfect. + +0:58:55.044 --> 0:59:02.942 +So if you think of this one as a contextual +representation of the third word,. + +0:59:07.587 --> 0:59:16.686 +Is representing a three in the context of +a sentence, however only in the context of + +0:59:16.686 --> 0:59:18.185 +the previous. + +0:59:18.558 --> 0:59:27.413 +However, we have an architecture which can +also take both sides and we have used that + +0:59:27.413 --> 0:59:30.193 +already in the ink holder. + +0:59:30.630 --> 0:59:34.264 +So we could do the iron easily on your, also +in the backward direction. + +0:59:34.874 --> 0:59:42.826 +By just having the states the other way around +and then we couldn't combine the forward and + +0:59:42.826 --> 0:59:49.135 +the forward into a joint one where we are doing +this type of prediction. + +0:59:49.329 --> 0:59:50.858 +So you have the word embedding. + +0:59:51.011 --> 1:00:02.095 +Then you have two in the states, one on the +forward arm and one on the backward arm, and + +1:00:02.095 --> 1:00:10.314 +then you can, for example, take the cocagenation +of both of them. + +1:00:10.490 --> 1:00:23.257 +Now this same here represents mainly this +word because this is what both puts in it last + +1:00:23.257 --> 1:00:30.573 +and we know is focusing on what is happening +last. + +1:00:31.731 --> 1:00:40.469 +However, there is a bit of difference when +training that as a language model you already + +1:00:40.469 --> 1:00:41.059 +have. + +1:00:43.203 --> 1:00:44.956 +Maybe There's Again This Masking. + +1:00:46.546 --> 1:00:47.748 +That is one solution. + +1:00:47.748 --> 1:00:52.995 +First of all, why we can't do it is the information +you leak it, so you cannot just predict the + +1:00:52.995 --> 1:00:53.596 +next word. + +1:00:53.596 --> 1:00:58.132 +If we just predict the next word in this type +of model, that's a very simple task. + +1:00:58.738 --> 1:01:09.581 +You know the next word because it's influencing +this hidden state predicting something is not + +1:01:09.581 --> 1:01:11.081 +a good task. + +1:01:11.081 --> 1:01:18.455 +You have to define: Because in this case what +will end with the system will just ignore these + +1:01:18.455 --> 1:01:22.966 +estates and what will learn is copy this information +directly in here. + +1:01:23.343 --> 1:01:31.218 +So it would be representing this word and +you would have nearly a perfect model because + +1:01:31.218 --> 1:01:38.287 +you only need to find encoding where you can +encode all words somehow in this. + +1:01:38.458 --> 1:01:44.050 +The only thing can learn is that turn and +encode all my words in this upper hidden. + +1:01:44.985 --> 1:01:53.779 +Therefore, it's not really useful, so we need +to find a bit of different ways out. + +1:01:55.295 --> 1:01:57.090 +There is a masking one. + +1:01:57.090 --> 1:02:03.747 +I'll come to that shortly just a bit that +other things also have been done, so the other + +1:02:03.747 --> 1:02:06.664 +thing is not to directly combine them. + +1:02:06.664 --> 1:02:13.546 +That was in the animal paper, so you have +them forward R&M and you keep them completely + +1:02:13.546 --> 1:02:14.369 +separated. + +1:02:14.594 --> 1:02:20.458 +So you never merged to state. + +1:02:20.458 --> 1:02:33.749 +At the end, the representation of the word +is now from the forward. + +1:02:33.873 --> 1:02:35.953 +So it's always the hidden state before the +good thing. + +1:02:36.696 --> 1:02:41.286 +These two you join now to your to the representation. + +1:02:42.022 --> 1:02:48.685 +And then you have now a representation also +about like the whole sentence for the word, + +1:02:48.685 --> 1:02:51.486 +but there is no information leakage. + +1:02:51.486 --> 1:02:58.149 +One way of doing this is instead of doing +a bidirection along you do a forward pass and + +1:02:58.149 --> 1:02:59.815 +then join the hidden. + +1:03:00.380 --> 1:03:05.960 +So you can do that in all layers. + +1:03:05.960 --> 1:03:16.300 +In the end you do the forwarded layers and +you get the hidden. + +1:03:16.596 --> 1:03:19.845 +However, it's a bit of a complicated. + +1:03:19.845 --> 1:03:25.230 +You have to keep both separate and merge things +so can you do. + +1:03:27.968 --> 1:03:33.030 +And that is the moment where like the big. + +1:03:34.894 --> 1:03:39.970 +The big success of the burnt model was used +where it okay. + +1:03:39.970 --> 1:03:47.281 +Maybe in bite and rich case it's not good +to do the next word prediction, but we can + +1:03:47.281 --> 1:03:48.314 +do masking. + +1:03:48.308 --> 1:03:56.019 +Masking mainly means we do a prediction of +something in the middle or some words. + +1:03:56.019 --> 1:04:04.388 +So the idea is if we have the input, we are +putting noise into the input, removing them, + +1:04:04.388 --> 1:04:07.961 +and then the model we are interested. + +1:04:08.048 --> 1:04:15.327 +Now there can be no information leakage because +this wasn't predicting that one is a big challenge. + +1:04:16.776 --> 1:04:19.957 +Do any assumption about our model? + +1:04:19.957 --> 1:04:26.410 +It doesn't need to be a forward model or a +backward model or anything. + +1:04:26.410 --> 1:04:29.500 +You can always predict the three. + +1:04:30.530 --> 1:04:34.844 +There's maybe one bit of a disadvantage. + +1:04:34.844 --> 1:04:40.105 +Do you see what could be a bit of a problem +this? + +1:05:00.000 --> 1:05:06.429 +Yes, so yeah, you can of course mask more, +but to see it more globally, just first assume + +1:05:06.429 --> 1:05:08.143 +you're only masked one. + +1:05:08.143 --> 1:05:13.930 +For the whole sentence, we get one feedback +signal, like what is the word three. + +1:05:13.930 --> 1:05:22.882 +So we have one training example: If you do +the language modeling taste, we predicted here, + +1:05:22.882 --> 1:05:24.679 +we predicted here. + +1:05:25.005 --> 1:05:26.735 +So we have number of tokens. + +1:05:26.735 --> 1:05:30.970 +For each token we have a feet pad and say +what is the best correction. + +1:05:31.211 --> 1:05:43.300 +So in this case this is less efficient because +we are getting less feedback signals on what + +1:05:43.300 --> 1:05:45.797 +we should predict. + +1:05:48.348 --> 1:05:56.373 +So and bird, the main ideas are that you're +doing this bidirectional model with masking. + +1:05:56.373 --> 1:05:59.709 +It's using transformer architecture. + +1:06:00.320 --> 1:06:06.326 +There are two more minor changes. + +1:06:06.326 --> 1:06:16.573 +We'll see that this next word prediction is +another task. + +1:06:16.957 --> 1:06:30.394 +You want to learn more about what language +is to really understand following a story or + +1:06:30.394 --> 1:06:35.127 +their independent tokens into. + +1:06:38.158 --> 1:06:42.723 +The input is using word units as we use it. + +1:06:42.723 --> 1:06:50.193 +It has some special token that is framing +for the next word prediction. + +1:06:50.470 --> 1:07:04.075 +It's more for classification task because +you may be learning a general representation + +1:07:04.075 --> 1:07:07.203 +as a full sentence. + +1:07:07.607 --> 1:07:19.290 +You're doing segment embedding, so you have +an embedding for it. + +1:07:19.290 --> 1:07:24.323 +This is the first sentence. + +1:07:24.684 --> 1:07:29.099 +Now what is more challenging is this masking. + +1:07:29.099 --> 1:07:30.827 +What do you mask? + +1:07:30.827 --> 1:07:35.050 +We already have the crush enough or should. + +1:07:35.275 --> 1:07:42.836 +So there has been afterwards eating some work +like, for example, a bearer. + +1:07:42.836 --> 1:07:52.313 +It's not super sensitive, but if you do it +completely wrong then you're not letting anything. + +1:07:52.572 --> 1:07:54.590 +That's Then Another Question There. + +1:07:56.756 --> 1:08:04.594 +Should I mask all types of should I always +mask the footwork or if I have a subword to + +1:08:04.594 --> 1:08:10.630 +mask only like a subword and predict them based +on the other ones? + +1:08:10.630 --> 1:08:14.504 +Of course, it's a bit of a different task. + +1:08:14.894 --> 1:08:21.210 +If you know three parts of the words, it might +be easier to guess the last because they here + +1:08:21.210 --> 1:08:27.594 +took the easiest selection, so not considering +words anymore at all because you're doing that + +1:08:27.594 --> 1:08:32.280 +in the preprocessing and just taking always +words and like subwords. + +1:08:32.672 --> 1:08:36.089 +Think in group there is done differently. + +1:08:36.089 --> 1:08:40.401 +They mark always the full words, but guess +it's not. + +1:08:41.001 --> 1:08:46.044 +And then what to do with the mask word in +eighty percent of the cases. + +1:08:46.044 --> 1:08:50.803 +If the word is masked, they replace it with +a special token thing. + +1:08:50.803 --> 1:08:57.197 +This is a mask token in ten percent they put +in some random other token in there, and ten + +1:08:57.197 --> 1:08:59.470 +percent they keep it on change. + +1:09:02.202 --> 1:09:10.846 +And then what you can do is also this next +word prediction. + +1:09:10.846 --> 1:09:14.880 +The man went to Mass Store. + +1:09:14.880 --> 1:09:17.761 +He bought a gallon. + +1:09:18.418 --> 1:09:24.088 +So may you see you're joining them, you're +doing both masks and prediction that you're. + +1:09:24.564 --> 1:09:29.449 +Is a penguin mask or flyless birds. + +1:09:29.449 --> 1:09:41.390 +These two sentences have nothing to do with +each other, so you can do also this type of + +1:09:41.390 --> 1:09:43.018 +prediction. + +1:09:47.127 --> 1:09:56.572 +And then the whole bird model, so here you +have the in-foot to transform the layers, and + +1:09:56.572 --> 1:09:58.164 +you can train. + +1:09:58.598 --> 1:10:17.731 +And this model was quite successful in general +applications. + +1:10:17.937 --> 1:10:27.644 +However, there is like a huge thing of different +types of models coming from them. + +1:10:27.827 --> 1:10:38.709 +So based on others these supervised molds +like a whole setup came out of there and now + +1:10:38.709 --> 1:10:42.086 +this is getting even more. + +1:10:42.082 --> 1:10:46.640 +With availability of a large language model +than the success. + +1:10:47.007 --> 1:10:48.436 +We have now even larger ones. + +1:10:48.828 --> 1:10:50.961 +Interestingly, it goes a bit. + +1:10:50.910 --> 1:10:57.847 +Change the bit again from like more the spider +action model to uni directional models. + +1:10:57.847 --> 1:11:02.710 +Are at the moment maybe a bit more we're coming +to them now? + +1:11:02.710 --> 1:11:09.168 +Do you see one advantage while what is another +event and we have the efficiency? + +1:11:09.509 --> 1:11:15.901 +Is one other reason why you are sometimes +more interested in uni-direction models than + +1:11:15.901 --> 1:11:17.150 +in bi-direction. + +1:11:22.882 --> 1:11:30.220 +It depends on the pass, but for example for +a language generation pass, the eccard is not + +1:11:30.220 --> 1:11:30.872 +really. + +1:11:32.192 --> 1:11:40.924 +It doesn't work so if you want to do a generation +like the decoder you don't know the future + +1:11:40.924 --> 1:11:42.896 +so you cannot apply. + +1:11:43.223 --> 1:11:53.870 +So this time of model can be used for the +encoder in an encoder model, but it cannot + +1:11:53.870 --> 1:11:57.002 +be used for the decoder. + +1:12:00.000 --> 1:12:05.012 +That's a good view to the next overall cast +of models. + +1:12:05.012 --> 1:12:08.839 +Perhaps if you view it from the sequence. + +1:12:09.009 --> 1:12:12.761 +We have the encoder base model. + +1:12:12.761 --> 1:12:16.161 +That's what we just look at. + +1:12:16.161 --> 1:12:20.617 +They are bidirectional and typically. + +1:12:20.981 --> 1:12:22.347 +That Is the One We Looked At. + +1:12:22.742 --> 1:12:34.634 +At the beginning is the decoder based model, +so see out in regressive models which are unidirective + +1:12:34.634 --> 1:12:42.601 +like an based model, and there we can do the +next word prediction. + +1:12:43.403 --> 1:12:52.439 +And what you can also do first, and there +you can also have a special things called prefix + +1:12:52.439 --> 1:12:53.432 +language. + +1:12:54.354 --> 1:13:05.039 +Because we are saying it might be helpful +that some of your input can also use bi-direction. + +1:13:05.285 --> 1:13:12.240 +And that is somehow doing what it is called +prefix length. + +1:13:12.240 --> 1:13:19.076 +On the first tokens you directly give your +bidirectional. + +1:13:19.219 --> 1:13:28.774 +So you somehow merge that and that mainly +works only in transformer based models because. + +1:13:29.629 --> 1:13:33.039 +There is no different number of parameters +in our end. + +1:13:33.039 --> 1:13:34.836 +We need a back foot our end. + +1:13:34.975 --> 1:13:38.533 +Transformer: The only difference is how you +mask your attention. + +1:13:38.878 --> 1:13:44.918 +We have seen that in the anchoder and decoder +the number of parameters is different because + +1:13:44.918 --> 1:13:50.235 +you do cross attention, but if you do forward +and backward or union directions,. + +1:13:50.650 --> 1:13:58.419 +It's only like that you mask your attention +to only look at the bad past or to look into + +1:13:58.419 --> 1:13:59.466 +the future. + +1:14:00.680 --> 1:14:03.326 +And now you can of course also do mixing. + +1:14:03.563 --> 1:14:08.306 +So this is a bi-directional attention matrix +where you can attend to everything. + +1:14:08.588 --> 1:14:23.516 +There is a uni-direction or causal where you +can look at the past and you can do the first + +1:14:23.516 --> 1:14:25.649 +three words. + +1:14:29.149 --> 1:14:42.831 +That somehow clear based on that, then of +course you cannot do the other things. + +1:14:43.163 --> 1:14:50.623 +So the idea is we have our anchor to decoder +architecture. + +1:14:50.623 --> 1:14:57.704 +Can we also train them completely in a side +supervisor? + +1:14:58.238 --> 1:15:09.980 +And in this case we have the same input to +both, so in this case we need to do some type + +1:15:09.980 --> 1:15:12.224 +of masking here. + +1:15:12.912 --> 1:15:17.591 +Here we don't need to do the masking, but +here we need to the masking that doesn't know + +1:15:17.591 --> 1:15:17.910 +ever. + +1:15:20.440 --> 1:15:30.269 +And this type of model got quite successful +also, especially for pre-training machine translation. + +1:15:30.330 --> 1:15:39.059 +The first model doing that is a Bart model, +which exactly does that, and yes, it's one + +1:15:39.059 --> 1:15:42.872 +successful way to pre train your one. + +1:15:42.872 --> 1:15:47.087 +It's pretraining your full encoder model. + +1:15:47.427 --> 1:15:54.365 +Where you put in contrast to machine translation, +where you put in source sentence, we can't + +1:15:54.365 --> 1:15:55.409 +do that here. + +1:15:55.715 --> 1:16:01.382 +But we can just put the second twice in there, +and then it's not a trivial task. + +1:16:01.382 --> 1:16:02.432 +We can change. + +1:16:03.003 --> 1:16:12.777 +And there is like they do different corruption +techniques so you can also do. + +1:16:13.233 --> 1:16:19.692 +That you couldn't do in an agricultural system +because then it wouldn't be there and you cannot + +1:16:19.692 --> 1:16:20.970 +predict somewhere. + +1:16:20.970 --> 1:16:26.353 +So the anchor, the number of input and output +tokens always has to be the same. + +1:16:26.906 --> 1:16:29.818 +You cannot do a prediction for something which +isn't in it. + +1:16:30.110 --> 1:16:38.268 +Here in the decoder side it's unidirection +so we can also delete the top and then try + +1:16:38.268 --> 1:16:40.355 +to generate the full. + +1:16:41.061 --> 1:16:45.250 +We can do sentence permutation. + +1:16:45.250 --> 1:16:54.285 +We can document rotation and text infilling +so there is quite a bit. + +1:16:55.615 --> 1:17:06.568 +So you see there's quite a lot of types of +models that you can use in order to pre-train. + +1:17:07.507 --> 1:17:14.985 +Then, of course, there is again for the language +one. + +1:17:14.985 --> 1:17:21.079 +The other question is how do you integrate? + +1:17:21.761 --> 1:17:26.636 +And there's also, like yeah, quite some different +ways of techniques. + +1:17:27.007 --> 1:17:28.684 +It's a Bit Similar to Before. + +1:17:28.928 --> 1:17:39.068 +So the easiest thing is you take your word +embeddings or your free trained model. + +1:17:39.068 --> 1:17:47.971 +You freeze them and stack your decoder layers +and keep these ones free. + +1:17:48.748 --> 1:17:54.495 +Can also be done if you have this type of +bark model. + +1:17:54.495 --> 1:18:03.329 +What you can do is you freeze your word embeddings, +for example some products and. + +1:18:05.865 --> 1:18:17.296 +The other thing is you initialize them so +you initialize your models but you train everything + +1:18:17.296 --> 1:18:19.120 +so you're not. + +1:18:22.562 --> 1:18:29.986 +Then one thing, if you think about Bart, you +want to have the Chinese language, the Italian + +1:18:29.986 --> 1:18:32.165 +language, and the deconer. + +1:18:32.165 --> 1:18:35.716 +However, in Bart we have the same language. + +1:18:36.516 --> 1:18:46.010 +The one you get is from English, so what you +can do there is so you cannot try to do some. + +1:18:46.366 --> 1:18:52.562 +Below the barge, in order to learn some language +specific stuff, or there's a masculine barge, + +1:18:52.562 --> 1:18:58.823 +which is trained on many languages, but it's +trained only on like the Old Coast Modern Language + +1:18:58.823 --> 1:19:03.388 +House, which may be trained in German and English, +but not on German. + +1:19:03.923 --> 1:19:08.779 +So then you would still need to find June +and the model needs to learn how to better + +1:19:08.779 --> 1:19:10.721 +do the attention cross lingually. + +1:19:10.721 --> 1:19:15.748 +It's only on the same language but it mainly +only has to learn this mapping and not all + +1:19:15.748 --> 1:19:18.775 +the rest and that's why it's still quite successful. + +1:19:21.982 --> 1:19:27.492 +Now certain thing which is very commonly used +is what is required to it as adapters. + +1:19:27.607 --> 1:19:29.754 +So for example you take and buy. + +1:19:29.709 --> 1:19:35.218 +And you put some adapters on the inside of +the networks so that it's small new layers + +1:19:35.218 --> 1:19:40.790 +which are in between put in there and then +you only train these adapters or also train + +1:19:40.790 --> 1:19:41.815 +these adapters. + +1:19:41.815 --> 1:19:47.900 +For example, an embryo you could see that +this learns to map the Sears language representation + +1:19:47.900 --> 1:19:50.334 +to the Tiger language representation. + +1:19:50.470 --> 1:19:52.395 +And then you don't have to change that luck. + +1:19:52.792 --> 1:19:59.793 +You give it extra ability to really perform +well on that. + +1:19:59.793 --> 1:20:05.225 +These are quite small and so very efficient. + +1:20:05.905 --> 1:20:12.632 +That is also very commonly used, for example +in modular systems where you have some adaptors + +1:20:12.632 --> 1:20:16.248 +in between here which might be language specific. + +1:20:16.916 --> 1:20:22.247 +So they are trained only for one language. + +1:20:22.247 --> 1:20:33.777 +The model has some or both and once has the +ability to do multilingually to share knowledge. + +1:20:34.914 --> 1:20:39.058 +But there's one chance in general in the multilingual +systems. + +1:20:39.058 --> 1:20:40.439 +It works quite well. + +1:20:40.439 --> 1:20:46.161 +There's one case or one specific use case +for multilingual where this normally doesn't + +1:20:46.161 --> 1:20:47.344 +really work well. + +1:20:47.344 --> 1:20:49.975 +Do you have an idea what that could be? + +1:20:55.996 --> 1:20:57.536 +It's for Zero Shot Cases. + +1:20:57.998 --> 1:21:03.660 +Because having here some situation with this +might be very language specific and zero shot, + +1:21:03.660 --> 1:21:09.015 +the idea is always to learn representations +view which are more language dependent and + +1:21:09.015 --> 1:21:10.184 +with the adaptors. + +1:21:10.184 --> 1:21:15.601 +Of course you get in representations again +which are more language specific and then it + +1:21:15.601 --> 1:21:17.078 +doesn't work that well. + +1:21:20.260 --> 1:21:37.730 +And there is also the idea of doing more knowledge +pistolation. + +1:21:39.179 --> 1:21:42.923 +And now the idea is okay. + +1:21:42.923 --> 1:21:54.157 +We are training it the same, but what we want +to achieve is that the encoder. + +1:21:54.414 --> 1:22:03.095 +So you should learn faster by trying to make +these states as similar as possible. + +1:22:03.095 --> 1:22:11.777 +So you compare the first-hit state of the +pre-trained model and try to make them. + +1:22:12.192 --> 1:22:18.144 +For example, by using the out two norms, so +by just making these two representations the + +1:22:18.144 --> 1:22:26.373 +same: The same vocabulary: Why does it need +the same vocabulary with any idea? + +1:22:34.754 --> 1:22:46.137 +If you have different vocabulary, it's typical +you also have different sequenced lengths here. + +1:22:46.137 --> 1:22:50.690 +The number of sequences is different. + +1:22:51.231 --> 1:22:58.888 +If you now have pipe stains and four states +here, it's no longer straightforward which + +1:22:58.888 --> 1:23:01.089 +states compare to which. + +1:23:02.322 --> 1:23:05.246 +And that's just easier if you have like the +same number. + +1:23:05.246 --> 1:23:08.940 +You can always compare the first to the first +and second to the second. + +1:23:09.709 --> 1:23:16.836 +So therefore at least the very easy way of +knowledge destination only works if you have. + +1:23:17.177 --> 1:23:30.030 +Course: You could do things like yeah, the +average should be the same, but of course there's + +1:23:30.030 --> 1:23:33.071 +a less strong signal. + +1:23:34.314 --> 1:23:42.979 +But the advantage here is that you have a +diameter training signal here on the handquarter + +1:23:42.979 --> 1:23:51.455 +so you can directly make some of the encoder +already giving a good signal while normally + +1:23:51.455 --> 1:23:52.407 +an empty. + +1:23:56.936 --> 1:24:13.197 +Yes, think this is most things for today, +so what you should keep in mind is remind me. + +1:24:13.393 --> 1:24:18.400 +The one is a back translation idea. + +1:24:18.400 --> 1:24:29.561 +If you have monolingual and use that, the +other one is to: And mentally it is often helpful + +1:24:29.561 --> 1:24:33.614 +to combine them so you can even use both of +that. + +1:24:33.853 --> 1:24:38.908 +So you can use pre-trained walls, but then +you can even still do back translation where + +1:24:38.908 --> 1:24:40.057 +it's still helpful. + +1:24:40.160 --> 1:24:45.502 +We have the advantage we are training like +everything working together on the task so + +1:24:45.502 --> 1:24:51.093 +it might be helpful even to backtranslate some +data and then use it in a real translation + +1:24:51.093 --> 1:24:56.683 +setup because in pretraining of course the +beach challenge is always that you're training + +1:24:56.683 --> 1:24:57.739 +it on different. + +1:24:58.058 --> 1:25:03.327 +Different ways of how you integrate this knowledge. + +1:25:03.327 --> 1:25:08.089 +Even if you just use a full model, so in this. + +1:25:08.748 --> 1:25:11.128 +This is the most similar you can get. + +1:25:11.128 --> 1:25:13.945 +You're doing no changes to the architecture. + +1:25:13.945 --> 1:25:19.643 +You're really taking the model and just fine +tuning them on the new task, but it still has + +1:25:19.643 --> 1:25:24.026 +to completely newly learn how to do the attention +and how to do that. + +1:25:24.464 --> 1:25:29.971 +And that might be, for example, helpful to +have more back-translated data to learn them. + +1:25:32.192 --> 1:25:34.251 +That's for today. + +1:25:34.251 --> 1:25:44.661 +There's one important thing that next Tuesday +there is a conference or a workshop or so in + +1:25:44.661 --> 1:25:45.920 +this room. + +1:25:47.127 --> 1:25:56.769 +You should get an e-mail if you're in Elias +that there's a room change for Tuesdays and + +1:25:56.769 --> 1:25:57.426 +it's. + +1:25:57.637 --> 1:26:03.890 +There are more questions, yeah, have a more +general position, especially: In computer vision + +1:26:03.890 --> 1:26:07.347 +you can enlarge your data center data orientation. + +1:26:07.347 --> 1:26:08.295 +Is there any? + +1:26:08.388 --> 1:26:15.301 +It's similar to a large speech for text for +the data of an edge. + +1:26:15.755 --> 1:26:29.176 +And you can use this back translation and +also masking, but back translation is some + +1:26:29.176 --> 1:26:31.228 +way of data. + +1:26:31.371 --> 1:26:35.629 +So it has also been, for example, even its +used not only for monolingual data. + +1:26:36.216 --> 1:26:54.060 +If you have good MP system, it can also be +used for parallel data. + +1:26:54.834 --> 1:26:59.139 +So would say this is the most similar one. + +1:26:59.139 --> 1:27:03.143 +There's ways you can do power phrasing. + +1:27:05.025 --> 1:27:12.057 +But for example there is very hard to do this +by rules like which words to replace because + +1:27:12.057 --> 1:27:18.936 +there is not a coup like you cannot always +say this word can always be replaced by that. + +1:27:19.139 --> 1:27:27.225 +Mean, although they are many perfect synonyms, +normally they are good in some cases, but not + +1:27:27.225 --> 1:27:29.399 +in all cases, and so on. + +1:27:29.399 --> 1:27:36.963 +And if you don't do a rule based, you have +to train your model and then the freshness. + +1:27:38.058 --> 1:27:57.236 +The same architecture as the pre-trained mount. + +1:27:57.457 --> 1:27:59.810 +Should be of the same dimension, so it's easiest +to have the same dimension. + +1:28:00.000 --> 1:28:01.590 +Architecture. + +1:28:01.590 --> 1:28:05.452 +We later will learn inefficiency. + +1:28:05.452 --> 1:28:12.948 +You can also do knowledge cessulation with, +for example, smaller. + +1:28:12.948 --> 1:28:16.469 +You can learn the same within. + +1:28:17.477 --> 1:28:22.949 +Eight layers for it so that is possible, but +yeah agree it should be of the same. + +1:28:23.623 --> 1:28:32.486 +Yeah yeah you need the question then of course +you can do it like it's an initialization or + +1:28:32.486 --> 1:28:41.157 +you can do it doing training but normally it +most makes sense during the normal training. + +1:28:45.865 --> 1:28:53.963 +Do it, then thanks a lot, and then we'll see +each other again on Tuesday. + +0:00:00.981 --> 0:00:20.036 +Today about is how to use some type of additional +resources to improve the translation. + +0:00:20.300 --> 0:00:28.188 +We have in the first part of the semester +two thirds of the semester how to build some + +0:00:28.188 --> 0:00:31.361 +of your basic machine translation. + +0:00:31.571 --> 0:00:42.317 +Now the basic components are both for statistical +and for neural, with the encoded decoding. + +0:00:43.123 --> 0:00:46.000 +Now, of course, that's not where it stops. + +0:00:46.000 --> 0:00:51.286 +It's still what nearly every machine translation +system is currently in there. + +0:00:51.286 --> 0:00:57.308 +However, there's a lot of challenges which +you need to address in addition and which need + +0:00:57.308 --> 0:00:58.245 +to be solved. + +0:00:58.918 --> 0:01:09.858 +And there we want to start to tell you what +else can you do around this, and partly. + +0:01:10.030 --> 0:01:14.396 +And one important question there is on what +do you train your models? + +0:01:14.394 --> 0:01:32.003 +Because like this type of parallel data, it's +easier in machine translation than in other + +0:01:32.003 --> 0:01:33.569 +trusts. + +0:01:33.853 --> 0:01:41.178 +And therefore an important question is, can +we also learn from like other sources and through? + +0:01:41.701 --> 0:01:47.830 +Because if you remember strongly right at +the beginning of the election,. + +0:01:51.171 --> 0:01:53.801 +This Is How We Train All Our. + +0:01:54.194 --> 0:01:59.887 +Machine learning models from statistical to +neural. + +0:01:59.887 --> 0:02:09.412 +This doesn't have changed so we need this +type of parallel data where we have a source + +0:02:09.412 --> 0:02:13.462 +sentence aligned with a target data. + +0:02:13.493 --> 0:02:19.135 +We have now a strong model here, a very good +model to do that. + +0:02:19.135 --> 0:02:22.091 +However, we always rely on this. + +0:02:22.522 --> 0:02:28.395 +For languages, high risk language pairs say +from German to English or other European languages, + +0:02:28.395 --> 0:02:31.332 +there is decent amount, at least for similarly. + +0:02:31.471 --> 0:02:37.630 +But even there if we are going to very specific +domains it might get difficult and then your + +0:02:37.630 --> 0:02:43.525 +system performance might drop because if you +want to translate now some medical text for + +0:02:43.525 --> 0:02:50.015 +example of course you need to also have peril +data in the medical domain to know how to translate + +0:02:50.015 --> 0:02:50.876 +these types. + +0:02:51.231 --> 0:02:55.264 +Phrases how to use the vocabulary and so on +in the style. + +0:02:55.915 --> 0:03:04.887 +And if you are going to other languages, there +is a lot bigger challenge and the question + +0:03:04.887 --> 0:03:05.585 +there. + +0:03:05.825 --> 0:03:09.649 +So is really this the only resource we can +use. + +0:03:09.889 --> 0:03:19.462 +Can be adapted or training phase in order +to also make use of other types of models that + +0:03:19.462 --> 0:03:27.314 +might enable us to build strong systems with +other types of information. + +0:03:27.707 --> 0:03:35.276 +And that we will look into now in the next +starting from from just saying the next election. + +0:03:35.515 --> 0:03:40.697 +So this idea we already have covered on Tuesday. + +0:03:40.697 --> 0:03:45.350 +One very successful idea for this is to do. + +0:03:45.645 --> 0:03:51.990 +So that we're no longer doing translation +between languages, but we can do translation + +0:03:51.990 --> 0:03:55.928 +between languages and share common knowledge +between. + +0:03:56.296 --> 0:04:03.888 +You also learned about things like zero shots +machine translation so you can translate between + +0:04:03.888 --> 0:04:06.446 +languages where you don't have. + +0:04:06.786 --> 0:04:09.790 +Which is the case for many, many language +pairs. + +0:04:10.030 --> 0:04:19.209 +Like even with German, you have not translation +parallel data to all languages around the world, + +0:04:19.209 --> 0:04:26.400 +or most of them you have it to the Europeans +once, maybe even for Japanese. + +0:04:26.746 --> 0:04:35.332 +There is quite a lot of data, for example +English to Japanese, but German to Japanese + +0:04:35.332 --> 0:04:37.827 +or German to Vietnamese. + +0:04:37.827 --> 0:04:41.621 +There is some data from Multilingual. + +0:04:42.042 --> 0:04:54.584 +So there is a very promising direction if +you want to build translation systems between + +0:04:54.584 --> 0:05:00.142 +language peers, typically not English. + +0:05:01.221 --> 0:05:05.887 +And the other ideas, of course, we don't have +to either just search for it. + +0:05:06.206 --> 0:05:12.505 +Some work on a data crawling so if I don't +have a corpus directly or I don't have an high + +0:05:12.505 --> 0:05:19.014 +quality corpus like from the European Parliament +for a TED corpus so maybe it makes sense to + +0:05:19.014 --> 0:05:23.913 +crawl more data and get additional sources +so you can build stronger. + +0:05:24.344 --> 0:05:35.485 +There has been quite a big effort in Europe +to collect really large data sets for parallel + +0:05:35.485 --> 0:05:36.220 +data. + +0:05:36.220 --> 0:05:40.382 +How can we do this data crawling? + +0:05:40.600 --> 0:05:46.103 +There the interesting thing from the machine +translation point is not just general data + +0:05:46.103 --> 0:05:46.729 +crawling. + +0:05:47.067 --> 0:05:50.037 +But how can we explicitly crawl data? + +0:05:50.037 --> 0:05:52.070 +Which is some of a peril? + +0:05:52.132 --> 0:05:58.461 +So there is in the Internet quite a lot of +data which has been company websites which + +0:05:58.461 --> 0:06:01.626 +have been translated and things like that. + +0:06:01.626 --> 0:06:05.158 +So how can you extract them parallel fragments? + +0:06:06.566 --> 0:06:13.404 +That is typically more noisy than where you +do more at hands where mean if you have Parliament. + +0:06:13.693 --> 0:06:17.680 +You can do some rules how to extract parallel +things. + +0:06:17.680 --> 0:06:24.176 +Here there is more to it, so the quality is +later maybe not as good, but normally scale + +0:06:24.176 --> 0:06:26.908 +is then a possibility to address it. + +0:06:26.908 --> 0:06:30.304 +So you just have so much more data that even. + +0:06:33.313 --> 0:06:40.295 +The other thing can be used monolingual data +and monolingual data has a big advantage that + +0:06:40.295 --> 0:06:46.664 +we can have a huge amount of that so that you +can be autocrawed from the Internet. + +0:06:46.664 --> 0:06:51.728 +The nice thing is you can also get it typically +for many domains. + +0:06:52.352 --> 0:06:59.558 +There is just so much more magnitude of monolingual +data so that it might be very helpful. + +0:06:59.559 --> 0:07:03.054 +We can do that in statistical machine translation. + +0:07:03.054 --> 0:07:06.755 +It was quite easy to integrate using language +models. + +0:07:08.508 --> 0:07:16.912 +In neural machine translation we have the +advantage that we have this overall architecture + +0:07:16.912 --> 0:07:22.915 +that does everything together, but it has also +the disadvantage. + +0:07:23.283 --> 0:07:25.675 +We'll look today at two things. + +0:07:25.675 --> 0:07:32.925 +On the one end you can still try to do a bit +of language modeling in there and add an additional + +0:07:32.925 --> 0:07:35.168 +language model into in there. + +0:07:35.168 --> 0:07:38.232 +There is some work, one very successful. + +0:07:38.178 --> 0:07:43.764 +A way in which I think is used in most systems +at the moment is to do some scientific data. + +0:07:43.763 --> 0:07:53.087 +Is a very easy thing, but you can just translate +there and use it as training gator, and normally. + +0:07:53.213 --> 0:07:59.185 +And thereby you are able to use like some +type of monolingual a day. + +0:08:00.380 --> 0:08:05.271 +Another way to do it is unsupervised and the +extreme case. + +0:08:05.271 --> 0:08:11.158 +If you have a scenario then you only have +data, only monolingual data. + +0:08:11.158 --> 0:08:13.976 +Can you still build translations? + +0:08:14.754 --> 0:08:27.675 +If you have large amounts of data and languages +are not too dissimilar, you can build translation + +0:08:27.675 --> 0:08:31.102 +systems without parallel. + +0:08:32.512 --> 0:08:36.267 +That we will see you then next Thursday. + +0:08:37.857 --> 0:08:50.512 +And then there is now a third type of pre-trained +model that recently became very successful + +0:08:50.512 --> 0:08:55.411 +and now with large language models. + +0:08:55.715 --> 0:09:03.525 +So the idea is we are no longer sharing the +real data, but it can also help to train a + +0:09:03.525 --> 0:09:04.153 +model. + +0:09:04.364 --> 0:09:11.594 +And that is now a big advantage of deep learning +based approaches. + +0:09:11.594 --> 0:09:22.169 +There you have this ability that you can train +a model in some task and then apply it to another. + +0:09:22.722 --> 0:09:33.405 +And then, of course, the question is, can +I have an initial task where there's huge amounts + +0:09:33.405 --> 0:09:34.450 +of data? + +0:09:34.714 --> 0:09:40.251 +And the test that typically you pre train +on is more like similar to a language moral + +0:09:40.251 --> 0:09:45.852 +task either direct to a language moral task +or like a masking task which is related so + +0:09:45.852 --> 0:09:51.582 +the idea is oh I can train on this data and +the knowledge about words how they relate to + +0:09:51.582 --> 0:09:53.577 +each other I can use in there. + +0:09:53.753 --> 0:10:00.276 +So it's a different way of using language +models. + +0:10:00.276 --> 0:10:06.276 +There's more transfer learning at the end +of. + +0:10:09.029 --> 0:10:17.496 +So first we will start with how can we use +monolingual data to do a Yeah to do a machine + +0:10:17.496 --> 0:10:18.733 +translation? + +0:10:20.040 --> 0:10:27.499 +That: Big difference is you should remember +from what I mentioned before is. + +0:10:27.499 --> 0:10:32.783 +In statistical machine translation we directly +have the opportunity. + +0:10:32.783 --> 0:10:39.676 +There's peril data for the translation model +and monolingual data for the language model. + +0:10:39.679 --> 0:10:45.343 +And you combine your translation model and +language model, and then you can make use of + +0:10:45.343 --> 0:10:45.730 +both. + +0:10:46.726 --> 0:10:53.183 +That you can make use of these large large +amounts of monolingual data, but of course + +0:10:53.183 --> 0:10:55.510 +it has also some disadvantage. + +0:10:55.495 --> 0:11:01.156 +Because we say the problem is we are optimizing +both parts a bit independently to each other + +0:11:01.156 --> 0:11:06.757 +and we say oh yeah the big disadvantage of +newer machine translations now we are optimizing + +0:11:06.757 --> 0:11:10.531 +the overall architecture everything together +to perform best. + +0:11:10.890 --> 0:11:16.994 +And then, of course, we can't do there, so +Leo we can can only do a mural like use power + +0:11:16.994 --> 0:11:17.405 +data. + +0:11:17.897 --> 0:11:28.714 +So the question is, but this advantage is +not so important that we can train everything, + +0:11:28.714 --> 0:11:35.276 +but we have a moral legal data or even small +amounts. + +0:11:35.675 --> 0:11:43.102 +So in data we know it's not only important +the amount of data we have but also like how + +0:11:43.102 --> 0:11:50.529 +similar it is to your test data so it can be +that this modeling data is quite small but + +0:11:50.529 --> 0:11:55.339 +it's very well fitting and then it's still +very helpful. + +0:11:55.675 --> 0:12:02.691 +At the first year of surprisingness, if we +are here successful with integrating a language + +0:12:02.691 --> 0:12:09.631 +model into a translation system, maybe we can +also integrate some type of language models + +0:12:09.631 --> 0:12:14.411 +into our empty system in order to make it better +and perform. + +0:12:16.536 --> 0:12:23.298 +The first thing we can do is we know there +is language models, so let's try to integrate. + +0:12:23.623 --> 0:12:31.096 +There was our language model because these +works were mainly done before transformer-based + +0:12:31.096 --> 0:12:31.753 +models. + +0:12:32.152 --> 0:12:38.764 +In general, of course, you can do the same +thing with transformer baseball. + +0:12:38.764 --> 0:12:50.929 +There is nothing about whether: It's just +that it has mainly been done before people + +0:12:50.929 --> 0:13:01.875 +started using R&S and they tried to do +this more in cases. + +0:13:07.087 --> 0:13:22.938 +So what we're happening here is in some of +this type of idea, and in key system you remember + +0:13:22.938 --> 0:13:25.495 +the attention. + +0:13:25.605 --> 0:13:29.465 +Gets it was your last in this day that you +calculate easy attention. + +0:13:29.729 --> 0:13:36.610 +We get the context back, then combine both +and then base the next in state and then predict. + +0:13:37.057 --> 0:13:42.424 +So this is our system, and the question is, +can we send our integrated language model? + +0:13:42.782 --> 0:13:49.890 +And somehow it makes sense to take out a neural +language model because we are anyway in the + +0:13:49.890 --> 0:13:50.971 +neural space. + +0:13:50.971 --> 0:13:58.465 +It's not surprising that it contrasts to statistical +work used and grants it might make sense to + +0:13:58.465 --> 0:14:01.478 +take a bit of a normal language model. + +0:14:01.621 --> 0:14:06.437 +And there would be something like on Tubbles +Air, a neural language model, and our man based + +0:14:06.437 --> 0:14:11.149 +is you have a target word, you put it in, you +get a new benchmark, and then you always put + +0:14:11.149 --> 0:14:15.757 +in the words and get new hidden states, and +you can do some predictions at the output to + +0:14:15.757 --> 0:14:16.948 +predict the next word. + +0:14:17.597 --> 0:14:26.977 +So if we're having this type of in language +model, there's like two main questions we have + +0:14:26.977 --> 0:14:34.769 +to answer: So how do we combine now on the +one hand our system and on the other hand our + +0:14:34.769 --> 0:14:35.358 +model? + +0:14:35.358 --> 0:14:42.004 +You see that was mentioned before when we +started talking about ENCODA models. + +0:14:42.004 --> 0:14:45.369 +They can be viewed as a language model. + +0:14:45.805 --> 0:14:47.710 +The wine is lengthened, unconditioned. + +0:14:47.710 --> 0:14:49.518 +It's just modeling the target sides. + +0:14:49.970 --> 0:14:56.963 +And the other one is a conditional language +one, which is a language one conditioned on + +0:14:56.963 --> 0:14:57.837 +the Sewer. + +0:14:58.238 --> 0:15:03.694 +So how can you combine to language models? + +0:15:03.694 --> 0:15:14.860 +Of course, it's like the translation model +will be more important because it has access + +0:15:14.860 --> 0:15:16.763 +to the source. + +0:15:18.778 --> 0:15:22.571 +If we have that, the other question is okay. + +0:15:22.571 --> 0:15:24.257 +Now we have models. + +0:15:24.257 --> 0:15:25.689 +How do we train? + +0:15:26.026 --> 0:15:30.005 +Pickers integrated them. + +0:15:30.005 --> 0:15:34.781 +We have now two sets of data. + +0:15:34.781 --> 0:15:42.741 +We have parallel data where you can do the +lower. + +0:15:44.644 --> 0:15:53.293 +So the first idea is we can do something more +like a parallel combination. + +0:15:53.293 --> 0:15:55.831 +We just keep running. + +0:15:56.036 --> 0:15:59.864 +So here you see your system that is running. + +0:16:00.200 --> 0:16:09.649 +It's normally completely independent of your +language model, which is up there, so down + +0:16:09.649 --> 0:16:13.300 +here we have just our NMT system. + +0:16:13.313 --> 0:16:26.470 +The only thing which is used is we have the +words, and of course they are put into both + +0:16:26.470 --> 0:16:30.059 +systems, and out there. + +0:16:30.050 --> 0:16:42.221 +So we use them somehow for both, and then +we are doing our decision just by merging these + +0:16:42.221 --> 0:16:42.897 +two. + +0:16:43.343 --> 0:16:53.956 +So there can be, for example, we are doing +a probability distribution here, and then we + +0:16:53.956 --> 0:17:03.363 +are taking the average of post-perability distribution +to do our predictions. + +0:17:11.871 --> 0:17:18.923 +You could also take the output with Steve's +to be more in chore about the mixture. + +0:17:20.000 --> 0:17:32.896 +Yes, you could also do that, so it's more +like engaging mechanisms that you're not doing. + +0:17:32.993 --> 0:17:41.110 +Another one would be cochtrinate the hidden +states, and then you would have another layer + +0:17:41.110 --> 0:17:41.831 +on top. + +0:17:43.303 --> 0:17:56.889 +You think about if you do the conqueredination +instead of taking the instead and then merging + +0:17:56.889 --> 0:18:01.225 +the probability distribution. + +0:18:03.143 --> 0:18:16.610 +Introduce many new parameters, and these parameters +have somehow something special compared to + +0:18:16.610 --> 0:18:17.318 +the. + +0:18:23.603 --> 0:18:37.651 +So before all the error other parameters can +be trained independent, the language model + +0:18:37.651 --> 0:18:42.121 +can be trained independent. + +0:18:43.043 --> 0:18:51.749 +If you have a joint layer, of course you need +to train them because you have now inputs. + +0:18:54.794 --> 0:19:02.594 +Not surprisingly, if you have a parallel combination +of whether you could, the other way is to do + +0:19:02.594 --> 0:19:04.664 +more serial combinations. + +0:19:04.924 --> 0:19:10.101 +How can you do a similar combination? + +0:19:10.101 --> 0:19:18.274 +Your final decision makes sense to do a face +on the system. + +0:19:18.438 --> 0:19:20.996 +So you have on top of your normal and system. + +0:19:21.121 --> 0:19:30.678 +The only thing is now you're inputting into +your system. + +0:19:30.678 --> 0:19:38.726 +You're no longer inputting the word embeddings. + +0:19:38.918 --> 0:19:45.588 +So you're training your mainly what you have +your lower layers here which are trained more + +0:19:45.588 --> 0:19:52.183 +on the purely language model style and then +on top your putting into the NMT system where + +0:19:52.183 --> 0:19:55.408 +it now has already here the language model. + +0:19:55.815 --> 0:19:58.482 +So here you can also view it. + +0:19:58.482 --> 0:20:06.481 +Here you have more contextual embeddings which +no longer depend only on the word but they + +0:20:06.481 --> 0:20:10.659 +also depend on the context of the target site. + +0:20:11.051 --> 0:20:19.941 +But you have more understanding of the source +word, so you have a language in the current + +0:20:19.941 --> 0:20:21.620 +target sentence. + +0:20:21.881 --> 0:20:27.657 +So if it's like the word can, for example, +will be put in here always the same independent + +0:20:27.657 --> 0:20:31.147 +of its user can of beans, or if it's like I +can do it. + +0:20:31.147 --> 0:20:37.049 +However, because you are having your language +model style, you have maybe disintegrated this + +0:20:37.049 --> 0:20:40.984 +already a bit, and you give this information +directly to the. + +0:20:41.701 --> 0:20:43.095 +An empty cyst. + +0:20:44.364 --> 0:20:49.850 +You, if you're remembering more the transformer +based approach, you have some layers. + +0:20:49.850 --> 0:20:55.783 +The lower layers are purely languaged while +the other ones are with attention to the source. + +0:20:55.783 --> 0:21:01.525 +So you can view it also that you just have +lower layers which don't attend to the source. + +0:21:02.202 --> 0:21:07.227 +This is purely a language model, and then +at some point you're starting to attend to + +0:21:07.227 --> 0:21:08.587 +the source and use it. + +0:21:13.493 --> 0:21:20.781 +Yes, so this is how you combine them in peril +or first do the language model and then do. + +0:21:23.623 --> 0:21:26.147 +Questions for the integration. + +0:21:31.831 --> 0:21:35.034 +Not really sure about the input of the. + +0:21:35.475 --> 0:21:38.102 +Model, and in this case in the sequence. + +0:21:38.278 --> 0:21:53.199 +Case so the actual word that we transferred +into a numerical lecture, and this is an input + +0:21:53.199 --> 0:21:54.838 +into the. + +0:21:56.176 --> 0:22:03.568 +That depends on if you view the word embedding +as part of the language model. + +0:22:03.568 --> 0:22:10.865 +So if you first put the word target word then +you do the one hot end coding. + +0:22:11.691 --> 0:22:13.805 +And then the word embedding there is the r& + +0:22:13.805 --> 0:22:13.937 +n. + +0:22:14.314 --> 0:22:21.035 +So you can use this together as your language +model when you first do the word embedding. + +0:22:21.401 --> 0:22:24.346 +All you can say is like before. + +0:22:24.346 --> 0:22:28.212 +It's more a definition, but you're right. + +0:22:28.212 --> 0:22:30.513 +So what's the steps out? + +0:22:30.513 --> 0:22:36.128 +You take the word, the one hut encoding, the +word embedding. + +0:22:36.516 --> 0:22:46.214 +What one of these parrots, you know, called +a language model is definition wise and not + +0:22:46.214 --> 0:22:47.978 +that important. + +0:22:53.933 --> 0:23:02.264 +So the question is how can you then train +them and make this this one work? + +0:23:02.264 --> 0:23:02.812 +The. + +0:23:03.363 --> 0:23:15.201 +So in the case where you combine the language +one of the abilities you can train them independently + +0:23:15.201 --> 0:23:18.516 +and just put them together. + +0:23:18.918 --> 0:23:27.368 +Might not be the best because we have no longer +the stability that we had before that optimally + +0:23:27.368 --> 0:23:29.128 +performed together. + +0:23:29.128 --> 0:23:33.881 +It's not clear if they really work the best +together. + +0:23:34.514 --> 0:23:41.585 +At least you need to somehow find how much +do you trust the one model and how much. + +0:23:43.323 --> 0:23:45.058 +Still in some cases useful. + +0:23:45.058 --> 0:23:48.530 +It might be helpful if you have only data +and software. + +0:23:48.928 --> 0:23:59.064 +However, in MT we have one specific situation +that at least for the MT part parallel is also + +0:23:59.064 --> 0:24:07.456 +always monolingual data, so what we definitely +can do is train the language. + +0:24:08.588 --> 0:24:18.886 +So what we also can do is more like the pre-training +approach. + +0:24:18.886 --> 0:24:24.607 +We first train the language model. + +0:24:24.704 --> 0:24:27.334 +The pre-training approach. + +0:24:27.334 --> 0:24:33.470 +You first train on the monolingual data and +then you join the. + +0:24:33.933 --> 0:24:41.143 +Of course, the model size is this way, but +the data size is too bigly the other way around. + +0:24:41.143 --> 0:24:47.883 +You often have a lot more monolingual data +than you have here parallel data, in which + +0:24:47.883 --> 0:24:52.350 +scenario can you imagine where this type of +pretraining? + +0:24:56.536 --> 0:24:57.901 +Any Ideas. + +0:25:04.064 --> 0:25:12.772 +One example where this might also be helpful +if you want to adapt to domains. + +0:25:12.772 --> 0:25:22.373 +So let's say you do medical sentences and +if you want to translate medical sentences. + +0:25:23.083 --> 0:25:26.706 +In this case it could be or its most probable +happen. + +0:25:26.706 --> 0:25:32.679 +You're learning here up there what medical +means, but in your fine tuning step the model + +0:25:32.679 --> 0:25:38.785 +is forgotten everything about Medicare, so +you may be losing all the information you gain. + +0:25:39.099 --> 0:25:42.366 +So this type of priest training step is good. + +0:25:42.366 --> 0:25:47.978 +If your pretraining data is more general, +very large and then you're adapting. + +0:25:48.428 --> 0:25:56.012 +But in the task with moral lingual data, which +should be used to adapt the system to some + +0:25:56.012 --> 0:25:57.781 +general topic style. + +0:25:57.817 --> 0:26:06.795 +Then, of course, this is not a good strategy +because you might forgot about everything up + +0:26:06.795 --> 0:26:09.389 +there and you don't have. + +0:26:09.649 --> 0:26:14.678 +So then you have to check what you can do +for them. + +0:26:14.678 --> 0:26:23.284 +You can freeze this part and change it any +more so you don't lose the ability or you can + +0:26:23.284 --> 0:26:25.702 +do a direct combination. + +0:26:25.945 --> 0:26:31.028 +Where you jointly train both of them, so you +train the NMT system on the, and then you train + +0:26:31.028 --> 0:26:34.909 +the language model always in parallels so that +you don't forget about. + +0:26:35.395 --> 0:26:37.684 +And what you learn of the length. + +0:26:37.937 --> 0:26:46.711 +Depends on what you want to combine because +it's large data and you have a good general + +0:26:46.711 --> 0:26:48.107 +knowledge in. + +0:26:48.548 --> 0:26:55.733 +Then you normally don't really forget it because +it's also in the or you use it to adapt to + +0:26:55.733 --> 0:26:57.295 +something specific. + +0:26:57.295 --> 0:26:58.075 +Then you. + +0:27:01.001 --> 0:27:06.676 +Then this is a way of how we can make use +of monolingual data. + +0:27:07.968 --> 0:27:12.116 +It seems to be the easiest one somehow. + +0:27:12.116 --> 0:27:20.103 +It's more similar to what we are doing with +statistical machine translation. + +0:27:21.181 --> 0:27:31.158 +Normally always beats this type of model, +which in some view can be like from the conceptual + +0:27:31.158 --> 0:27:31.909 +thing. + +0:27:31.909 --> 0:27:36.844 +It's even easier from the computational side. + +0:27:40.560 --> 0:27:42.078 +And the idea is OK. + +0:27:42.078 --> 0:27:49.136 +We have monolingual data that we just translate +and then generate some type of parallel data + +0:27:49.136 --> 0:27:50.806 +and use that then to. + +0:27:51.111 --> 0:28:00.017 +So if you want to build a German-to-English +system first, take the large amount of data + +0:28:00.017 --> 0:28:02.143 +you have translated. + +0:28:02.402 --> 0:28:10.446 +Then you have more peril data and the interesting +thing is if you then train on the joint thing + +0:28:10.446 --> 0:28:18.742 +or on the original peril data and on what is +artificial where you have generated the translations. + +0:28:18.918 --> 0:28:26.487 +So you can because you are not doing the same +era all the times and you have some knowledge. + +0:28:28.028 --> 0:28:43.199 +With this first approach, however, there is +one issue why it might not work the best. + +0:28:49.409 --> 0:28:51.177 +Very a bit shown in the image to you. + +0:28:53.113 --> 0:28:58.153 +You trade on that quality data. + +0:28:58.153 --> 0:29:02.563 +Here is a bit of a problem. + +0:29:02.563 --> 0:29:08.706 +Your English style is not really good. + +0:29:08.828 --> 0:29:12.213 +And as you're saying, the system always mistranslates. + +0:29:13.493 --> 0:29:19.798 +Something then you will learn that this is +correct because now it's a training game and + +0:29:19.798 --> 0:29:23.022 +you will encourage it to make it more often. + +0:29:23.022 --> 0:29:29.614 +So the problem with training on your own areas +yeah you might prevent some areas you rarely + +0:29:29.614 --> 0:29:29.901 +do. + +0:29:30.150 --> 0:29:31.749 +But errors use systematically. + +0:29:31.749 --> 0:29:34.225 +Do you even enforce more and will even do +more? + +0:29:34.654 --> 0:29:40.145 +So that might not be the best solution to +have any idea how you could do it better. + +0:29:44.404 --> 0:29:57.754 +Is one way there is even a bit of more simple +idea. + +0:30:04.624 --> 0:30:10.975 +The problem is yeah, the translations are +not perfect, so the output and you're learning + +0:30:10.975 --> 0:30:12.188 +something wrong. + +0:30:12.188 --> 0:30:17.969 +Normally it's less bad if your inputs are +not bad, but your outputs are perfect. + +0:30:18.538 --> 0:30:24.284 +So if your inputs are wrong you may learn +that if you're doing this wrong input you're + +0:30:24.284 --> 0:30:30.162 +generating something correct, but you're not +learning to generate something which is not + +0:30:30.162 --> 0:30:30.756 +correct. + +0:30:31.511 --> 0:30:47.124 +So often the case it is that it is more important +than your target is correct. + +0:30:47.347 --> 0:30:52.182 +But you can assume in your application scenario +you hope that you may only get correct inputs. + +0:30:52.572 --> 0:31:02.535 +So that is not harming you, and in machine +translation we have one very nice advantage: + +0:31:02.762 --> 0:31:04.648 +And also the other way around. + +0:31:04.648 --> 0:31:10.062 +It's a very similar task, so there's a task +to translate from German to English, but the + +0:31:10.062 --> 0:31:13.894 +task to translate from English to German is +very similar, and. + +0:31:14.094 --> 0:31:19.309 +So what we can do is we can just switch it +initially and generate the data the other way + +0:31:19.309 --> 0:31:19.778 +around. + +0:31:20.120 --> 0:31:25.959 +So what we are doing here is we are starting +with an English to German system. + +0:31:25.959 --> 0:31:32.906 +Then we are translating the English data into +German where the German is maybe not very nice. + +0:31:33.293 --> 0:31:51.785 +And then we are training on our original data +and on the back translated data. + +0:31:52.632 --> 0:32:02.332 +So here we have the advantage that our target +side is human quality and only the input. + +0:32:03.583 --> 0:32:08.113 +Then this helps us to get really good. + +0:32:08.113 --> 0:32:15.431 +There is one difference if you think about +the data resources. + +0:32:21.341 --> 0:32:27.336 +Too obvious here we need a target site monolingual +layer. + +0:32:27.336 --> 0:32:31.574 +In the first example we had source site. + +0:32:31.931 --> 0:32:45.111 +So back translation is normally working if +you have target size peril later and not search + +0:32:45.111 --> 0:32:48.152 +side modeling later. + +0:32:48.448 --> 0:32:56.125 +Might be also, like if you think about it, +understand a little better to understand the + +0:32:56.125 --> 0:32:56.823 +target. + +0:32:57.117 --> 0:33:01.469 +On the source side you have to understand +the content. + +0:33:01.469 --> 0:33:08.749 +On the target side you have to generate really +sentences and somehow it's more difficult to + +0:33:08.749 --> 0:33:12.231 +generate something than to only understand. + +0:33:17.617 --> 0:33:30.734 +This works well if you have to select how +many back translated data do you use. + +0:33:31.051 --> 0:33:32.983 +Because only there's like a lot more. + +0:33:33.253 --> 0:33:42.136 +Question: Should take all of my data there +is two problems with it? + +0:33:42.136 --> 0:33:51.281 +Of course it's expensive because you have +to translate all this data. + +0:33:51.651 --> 0:34:00.946 +So if you don't know the normal good starting +point is to take equal amount of data as many + +0:34:00.946 --> 0:34:02.663 +back translated. + +0:34:02.963 --> 0:34:04.673 +It depends on the used case. + +0:34:04.673 --> 0:34:08.507 +If we have very few data here, it makes more +sense to have more. + +0:34:08.688 --> 0:34:15.224 +Depends on how good your quality is here, +so the better the more data you might use because + +0:34:15.224 --> 0:34:16.574 +quality is better. + +0:34:16.574 --> 0:34:22.755 +So it depends on a lot of things, but your +rule of sum is like which general way often + +0:34:22.755 --> 0:34:24.815 +is to have equal amounts of. + +0:34:26.646 --> 0:34:29.854 +And you can, of course, do that now. + +0:34:29.854 --> 0:34:34.449 +I said already that it's better to have the +quality. + +0:34:34.449 --> 0:34:38.523 +At the end, of course, depends on this system. + +0:34:38.523 --> 0:34:46.152 +Also, because the better this system is, the +better your synthetic data is, the better. + +0:34:47.207 --> 0:34:50.949 +That leads to what is referred to as iterated +back translation. + +0:34:51.291 --> 0:34:56.917 +So you play them on English to German, and +you translate the data on. + +0:34:56.957 --> 0:35:03.198 +Then you train a model on German to English +with the additional data. + +0:35:03.198 --> 0:35:09.796 +Then you translate German data and then you +train to gain your first one. + +0:35:09.796 --> 0:35:14.343 +So in the second iteration this quality is +better. + +0:35:14.334 --> 0:35:19.900 +System is better because it's not only trained +on the small data but additionally on back + +0:35:19.900 --> 0:35:22.003 +translated data with this system. + +0:35:22.442 --> 0:35:24.458 +And so you can get better. + +0:35:24.764 --> 0:35:28.053 +However, typically you can stop quite early. + +0:35:28.053 --> 0:35:35.068 +Maybe one iteration is good, but then you +have diminishing gains after two or three iterations. + +0:35:35.935 --> 0:35:46.140 +There is very slight difference because you +need a quite big difference in the quality + +0:35:46.140 --> 0:35:46.843 +here. + +0:35:47.207 --> 0:36:02.262 +Language is also good because it means you +can already train it with relatively bad profiles. + +0:36:03.723 --> 0:36:10.339 +It's a design decision would advise so guess +because it's easy to get it. + +0:36:10.550 --> 0:36:20.802 +Replace that because you have a higher quality +real data, but then I think normally it's okay + +0:36:20.802 --> 0:36:22.438 +to replace it. + +0:36:22.438 --> 0:36:28.437 +I would assume it's not too much of a difference, +but. + +0:36:34.414 --> 0:36:42.014 +That's about like using monolingual data before +we go into the pre-train models to have any + +0:36:42.014 --> 0:36:43.005 +more crash. + +0:36:49.029 --> 0:36:55.740 +Yes, so the other thing which we can do and +which is recently more and more successful + +0:36:55.740 --> 0:37:02.451 +and even more successful since we have this +really large language models where you can + +0:37:02.451 --> 0:37:08.545 +even do the translation task with this is the +way of using pre-trained models. + +0:37:08.688 --> 0:37:16.135 +So you learn a representation of one task, +and then you use this representation from another. + +0:37:16.576 --> 0:37:26.862 +It was made maybe like one of the first words +where it really used largely is doing something + +0:37:26.862 --> 0:37:35.945 +like a bird which you pre trained on purely +text era and you take it in fine tune. + +0:37:36.496 --> 0:37:42.953 +And one big advantage, of course, is that +people can only share data but also pre-trained. + +0:37:43.423 --> 0:37:59.743 +The recent models and the large language ones +which are available. + +0:37:59.919 --> 0:38:09.145 +Where I think it costs several millions to +train them all, just if you would buy the GPUs + +0:38:09.145 --> 0:38:15.397 +from some cloud company and train that the +cost of training. + +0:38:15.475 --> 0:38:21.735 +And guess as a student project you won't have +the budget to like build these models. + +0:38:21.801 --> 0:38:24.598 +So another idea is what you can do is okay. + +0:38:24.598 --> 0:38:27.330 +Maybe if these months are once available,. + +0:38:27.467 --> 0:38:36.598 +Can take them and use them as an also resource +similar to pure text, and you can now build + +0:38:36.598 --> 0:38:44.524 +models which somehow learn not only from from +data but also from other models. + +0:38:44.844 --> 0:38:49.127 +So it's a quite new way of thinking of how +to train. + +0:38:49.127 --> 0:38:53.894 +We are not only learning from examples, but +we might also. + +0:38:54.534 --> 0:39:05.397 +The nice thing is that this type of training +where we are not learning directly from data + +0:39:05.397 --> 0:39:07.087 +but learning. + +0:39:07.427 --> 0:39:17.647 +So the main idea this go is you have a person +initial task. + +0:39:17.817 --> 0:39:26.369 +And if you're working with anLP, that means +you're training pure taxator because that's + +0:39:26.369 --> 0:39:30.547 +where you have the largest amount of data. + +0:39:30.951 --> 0:39:35.857 +And then you're defining some type of task +in order to do your creek training. + +0:39:36.176 --> 0:39:43.092 +And: The typical task you can train on on +that is like the language waddling task. + +0:39:43.092 --> 0:39:50.049 +So to predict the next word or we have a related +task to predict something in between, we'll + +0:39:50.049 --> 0:39:52.667 +see depending on the architecture. + +0:39:52.932 --> 0:39:58.278 +But somehow to predict something which you +have not in the input is a task which is easy + +0:39:58.278 --> 0:40:00.740 +to generate, so you just need your data. + +0:40:00.740 --> 0:40:06.086 +That's why it's called self supervised, so +you're creating your supervised pending data. + +0:40:06.366 --> 0:40:07.646 +By yourself. + +0:40:07.646 --> 0:40:15.133 +On the other hand, you need a lot of knowledge +and that is the other thing. + +0:40:15.735 --> 0:40:24.703 +Because there is this idea that the meaning +of a word heavily depends on the context that. + +0:40:25.145 --> 0:40:36.846 +So can give you a sentence with some giverish +word and there's some name and although you've + +0:40:36.846 --> 0:40:41.627 +never heard the name you will assume. + +0:40:42.062 --> 0:40:44.149 +And exactly the same thing. + +0:40:44.149 --> 0:40:49.143 +The models can also learn something about +the world by just using. + +0:40:49.649 --> 0:40:53.651 +So that is typically the mule. + +0:40:53.651 --> 0:40:59.848 +Then we can use this model to train the system. + +0:41:00.800 --> 0:41:03.368 +Course we might need to adapt the system. + +0:41:03.368 --> 0:41:07.648 +To do that we have to change the architecture +we might use only some. + +0:41:07.627 --> 0:41:09.443 +Part of the pre-trained model. + +0:41:09.443 --> 0:41:14.773 +In there we have seen that a bit already in +the R&N case you can also see that we have + +0:41:14.773 --> 0:41:17.175 +also mentioned the pre-training already. + +0:41:17.437 --> 0:41:22.783 +So you can use the R&N as one of these +approaches. + +0:41:22.783 --> 0:41:28.712 +You train the R&M language more on large +pre-train data. + +0:41:28.712 --> 0:41:32.309 +Then you put it somewhere into your. + +0:41:33.653 --> 0:41:37.415 +So this gives you the ability to really do +these types of tests. + +0:41:37.877 --> 0:41:53.924 +So you can build a system which is knowledge, +which is just trained on large amounts of data. + +0:41:56.376 --> 0:42:01.564 +So the question is maybe what type of information +so what type of models can you? + +0:42:01.821 --> 0:42:05.277 +And we want today to look at briefly at swings. + +0:42:05.725 --> 0:42:08.704 +First, that was what was initially done. + +0:42:08.704 --> 0:42:15.314 +It wasn't as famous as in machine translation +as in other things, but it's also used there + +0:42:15.314 --> 0:42:21.053 +and that is to use static word embedding, so +just the first step we know here. + +0:42:21.221 --> 0:42:28.981 +So we have this mapping from the one hot to +a small continuous word representation. + +0:42:29.229 --> 0:42:38.276 +Using this one in your NG system, so you can, +for example, replace the embedding layer by + +0:42:38.276 --> 0:42:38.779 +the. + +0:42:39.139 --> 0:42:41.832 +That is helpful to be a really small amount +of data. + +0:42:42.922 --> 0:42:48.517 +And we're always in this pre-training phase +and have the thing the advantage is. + +0:42:48.468 --> 0:42:52.411 +More data than the trade off, so you can get +better. + +0:42:52.411 --> 0:42:59.107 +The disadvantage is, does anybody have an +idea of what might be the disadvantage of using + +0:42:59.107 --> 0:43:00.074 +things like. + +0:43:04.624 --> 0:43:12.175 +What was one mentioned today giving like big +advantage of the system compared to previous. + +0:43:20.660 --> 0:43:25.134 +Where one advantage was the enter end training, +so you have the enter end training so that + +0:43:25.134 --> 0:43:27.937 +all parameters and all components play optimal +together. + +0:43:28.208 --> 0:43:33.076 +If you know pre-train something on one fast, +it may be no longer optimal fitting to everything + +0:43:33.076 --> 0:43:33.384 +else. + +0:43:33.893 --> 0:43:37.862 +So what do pretending or not? + +0:43:37.862 --> 0:43:48.180 +It depends on how important everything is +optimal together and how important. + +0:43:48.388 --> 0:43:50.454 +Of large amount. + +0:43:50.454 --> 0:44:00.541 +The pre-change one is so much better that +it's helpful, and the advantage of that. + +0:44:00.600 --> 0:44:11.211 +Getting everything optimal together, yes, +we would use random instructions for raising. + +0:44:11.691 --> 0:44:26.437 +The problem is you might be already in some +area where it's not easy to get. + +0:44:26.766 --> 0:44:35.329 +But often in some way right, so often it's +not about your really worse pre trained monolepsy. + +0:44:35.329 --> 0:44:43.254 +If you're going already in some direction, +and if this is not really optimal for you,. + +0:44:43.603 --> 0:44:52.450 +But if you're not really getting better because +you have a decent amount of data, it's so different + +0:44:52.450 --> 0:44:52.981 +that. + +0:44:53.153 --> 0:44:59.505 +Initially it wasn't a machine translation +done so much because there are more data in + +0:44:59.505 --> 0:45:06.153 +MPs than in other tasks, but now with really +large amounts of monolingual data we do some + +0:45:06.153 --> 0:45:09.403 +type of pretraining in currently all state. + +0:45:12.632 --> 0:45:14.302 +The other one is okay now. + +0:45:14.302 --> 0:45:18.260 +It's always like how much of the model do +you plea track a bit? + +0:45:18.658 --> 0:45:22.386 +To the other one you can do contextural word +embedded. + +0:45:22.386 --> 0:45:28.351 +That is something like bird or Roberta where +you train already a sequence model and the + +0:45:28.351 --> 0:45:34.654 +embeddings you're using are no longer specific +for word but they are also taking the context + +0:45:34.654 --> 0:45:35.603 +into account. + +0:45:35.875 --> 0:45:50.088 +The embedding you're using is no longer depending +on the word itself but on the whole sentence, + +0:45:50.088 --> 0:45:54.382 +so you can use this context. + +0:45:55.415 --> 0:46:02.691 +You can use similar things also in the decoder +just by having layers which don't have access + +0:46:02.691 --> 0:46:12.430 +to the source, but there it still might have +and these are typically models like: And finally + +0:46:12.430 --> 0:46:14.634 +they will look at the end. + +0:46:14.634 --> 0:46:19.040 +You can also have models which are already +sequenced. + +0:46:19.419 --> 0:46:28.561 +So you may be training a sequence to sequence +models. + +0:46:28.561 --> 0:46:35.164 +You have to make it a bit challenging. + +0:46:36.156 --> 0:46:43.445 +But the idea is really you're pre-training +your whole model and then you'll find tuning. + +0:46:47.227 --> 0:46:59.614 +But let's first do a bit of step back and +look into what are the different things. + +0:46:59.614 --> 0:47:02.151 +The first thing. + +0:47:02.382 --> 0:47:11.063 +The wooden bettings are just this first layer +and you can train them with feedback annual + +0:47:11.063 --> 0:47:12.028 +networks. + +0:47:12.212 --> 0:47:22.761 +But you can also train them with an N language +model, and by now you hopefully have also seen + +0:47:22.761 --> 0:47:27.699 +that you cannot transform a language model. + +0:47:30.130 --> 0:47:37.875 +So this is how you can train them and you're +training them. + +0:47:37.875 --> 0:47:45.234 +For example, to speak the next word that is +the easiest. + +0:47:45.525 --> 0:47:55.234 +And that is what is now referred to as South +Supervised Learning and, for example, all the + +0:47:55.234 --> 0:48:00.675 +big large language models like Chad GPT and +so on. + +0:48:00.675 --> 0:48:03.129 +They are trained with. + +0:48:03.823 --> 0:48:15.812 +So that is where you can hopefully learn how +a word is used because you always try to previct + +0:48:15.812 --> 0:48:17.725 +the next word. + +0:48:19.619 --> 0:48:27.281 +Word embedding: Why do you keep the first +look at the word embeddings and the use of + +0:48:27.281 --> 0:48:29.985 +word embeddings for our task? + +0:48:29.985 --> 0:48:38.007 +The main advantage was it might be only the +first layer where you typically have most of + +0:48:38.007 --> 0:48:39.449 +the parameters. + +0:48:39.879 --> 0:48:57.017 +Most of your parameters already on the large +data, then on your target data you have to + +0:48:57.017 --> 0:48:59.353 +train less. + +0:48:59.259 --> 0:49:06.527 +Big difference that your input size is so +much bigger than the size of the novel in size. + +0:49:06.626 --> 0:49:17.709 +So it's a normally sign, maybe like, but your +input and banning size is something like. + +0:49:17.709 --> 0:49:20.606 +Then here you have to. + +0:49:23.123 --> 0:49:30.160 +While here you see it's only like zero point +five times as much in the layer. + +0:49:30.750 --> 0:49:36.534 +So here is where most of your parameters are, +which means if you already replace the word + +0:49:36.534 --> 0:49:41.739 +embeddings, they might look a bit small in +your overall and in key architecture. + +0:49:41.739 --> 0:49:47.395 +It's where most of the things are, and if +you're doing that you already have really big + +0:49:47.395 --> 0:49:48.873 +games and can do that. + +0:49:57.637 --> 0:50:01.249 +The thing is we have seen these were the bettings. + +0:50:01.249 --> 0:50:04.295 +They can be very good use for other types. + +0:50:04.784 --> 0:50:08.994 +You learn some general relations between words. + +0:50:08.994 --> 0:50:17.454 +If you're doing this type of language modeling +cast, you predict: The one thing is you have + +0:50:17.454 --> 0:50:24.084 +a lot of data, so the one question is we want +to have data to trade a model. + +0:50:24.084 --> 0:50:28.734 +The other thing, the tasks need to be somehow +useful. + +0:50:29.169 --> 0:50:43.547 +If you would predict the first letter of the +word, then you wouldn't learn anything about + +0:50:43.547 --> 0:50:45.144 +the word. + +0:50:45.545 --> 0:50:53.683 +And the interesting thing is people have looked +at these wood embeddings. + +0:50:53.954 --> 0:50:58.550 +And looking at the word embeddings. + +0:50:58.550 --> 0:51:09.276 +You can ask yourself how they look and visualize +them by doing dimension reduction. + +0:51:09.489 --> 0:51:13.236 +Don't know if you and you are listening to +artificial intelligence. + +0:51:13.236 --> 0:51:15.110 +Advanced artificial intelligence. + +0:51:15.515 --> 0:51:23.217 +We had on yesterday there how to do this type +of representation, but you can do this time + +0:51:23.217 --> 0:51:29.635 +of representation, and now you're seeing interesting +things that normally. + +0:51:30.810 --> 0:51:41.027 +Now you can represent a here in a three dimensional +space with some dimension reduction. + +0:51:41.027 --> 0:51:46.881 +For example, the relation between male and +female. + +0:51:47.447 --> 0:51:56.625 +So this vector between the male and female +version of something is always not the same, + +0:51:56.625 --> 0:51:58.502 +but it's related. + +0:51:58.718 --> 0:52:14.522 +So you can do a bit of maths, so you do take +king, you subtract this vector, add this vector. + +0:52:14.894 --> 0:52:17.591 +So that means okay, there is really something +stored. + +0:52:17.591 --> 0:52:19.689 +Some information are stored in that book. + +0:52:20.040 --> 0:52:22.621 +Similar, you can do it with Bob Hansen. + +0:52:22.621 --> 0:52:25.009 +See here swimming slam walking walk. + +0:52:25.265 --> 0:52:34.620 +So again these vectors are not the same, but +they are related. + +0:52:34.620 --> 0:52:42.490 +So you learn something from going from here +to here. + +0:52:43.623 --> 0:52:49.761 +Or semantically, the relations between city +and capital have exactly the same sense. + +0:52:51.191 --> 0:52:56.854 +And people had even done that question answering +about that if they showed the diembeddings + +0:52:56.854 --> 0:52:57.839 +and the end of. + +0:52:58.218 --> 0:53:06.711 +All you can also do is don't trust the dimensions +of the reaction because maybe there is something. + +0:53:06.967 --> 0:53:16.863 +You can also look into what happens really +in the individual space. + +0:53:16.863 --> 0:53:22.247 +What is the nearest neighbor of the. + +0:53:22.482 --> 0:53:29.608 +So you can take the relationship between France +and Paris and add it to Italy and you'll. + +0:53:30.010 --> 0:53:33.078 +You can do big and bigger and you have small +and smaller and stuff. + +0:53:33.593 --> 0:53:49.417 +Because it doesn't work everywhere, there +is also some typical dish here in German. + +0:53:51.491 --> 0:54:01.677 +You can do what the person is doing for famous +ones, of course only like Einstein scientists + +0:54:01.677 --> 0:54:06.716 +that find midfielders not completely correct. + +0:54:06.846 --> 0:54:10.134 +You see the examples are a bit old. + +0:54:10.134 --> 0:54:15.066 +The politicians are no longer they am, but +of course. + +0:54:16.957 --> 0:54:26.759 +What people have done there, especially at +the beginning training our end language model, + +0:54:26.759 --> 0:54:28.937 +was very expensive. + +0:54:29.309 --> 0:54:38.031 +So one famous model was, but we are not really +interested in the language model performance. + +0:54:38.338 --> 0:54:40.581 +Think something good to keep in mind. + +0:54:40.581 --> 0:54:42.587 +What are we really interested in? + +0:54:42.587 --> 0:54:45.007 +Do we really want to have an R&N no? + +0:54:45.007 --> 0:54:48.607 +In this case we are only interested in this +type of mapping. + +0:54:49.169 --> 0:54:55.500 +And so successful and very successful was +this word to vet. + +0:54:55.535 --> 0:54:56.865 +The idea is okay. + +0:54:56.865 --> 0:55:03.592 +We are not training real language one, making +it even simpler and doing this, for example, + +0:55:03.592 --> 0:55:05.513 +continuous peck of words. + +0:55:05.513 --> 0:55:12.313 +We're just having four input tokens and we're +predicting what is the word in the middle and + +0:55:12.313 --> 0:55:15.048 +this is just like two linear layers. + +0:55:15.615 --> 0:55:21.627 +So it's even simplifying things and making +the calculation faster because that is what + +0:55:21.627 --> 0:55:22.871 +we're interested. + +0:55:23.263 --> 0:55:32.897 +All this continuous skip ground models with +these other models which refer to as where + +0:55:32.897 --> 0:55:34.004 +to where. + +0:55:34.234 --> 0:55:42.394 +Where you have one equal word and the other +way around, you're predicting the four words + +0:55:42.394 --> 0:55:43.585 +around them. + +0:55:43.585 --> 0:55:45.327 +It's very similar. + +0:55:45.327 --> 0:55:48.720 +The task is in the end very similar. + +0:55:51.131 --> 0:56:01.407 +Before we are going to the next point, anything +about normal weight vectors or weight embedding. + +0:56:04.564 --> 0:56:07.794 +The next thing is contexture. + +0:56:07.794 --> 0:56:12.208 +Word embeddings and the idea is helpful. + +0:56:12.208 --> 0:56:19.206 +However, we might even be able to get more +from one lingo layer. + +0:56:19.419 --> 0:56:31.732 +And now in the word that is overlap of these +two meanings, so it represents both the meaning + +0:56:31.732 --> 0:56:33.585 +of can do it. + +0:56:34.834 --> 0:56:40.410 +But we might be able to in the pre-trained +model already disambiguate this because they + +0:56:40.410 --> 0:56:41.044 +are used. + +0:56:41.701 --> 0:56:53.331 +So if we can have a model which can not only +represent a word but can also represent the + +0:56:53.331 --> 0:56:58.689 +meaning of the word within the context,. + +0:56:59.139 --> 0:57:03.769 +So then we are going to context your word +embeddings. + +0:57:03.769 --> 0:57:07.713 +We are really having a representation in the. + +0:57:07.787 --> 0:57:11.519 +And we have a very good architecture for that +already. + +0:57:11.691 --> 0:57:23.791 +The hidden state represents what is currently +said, but it's focusing on what is the last + +0:57:23.791 --> 0:57:29.303 +one, so it's some of the representation. + +0:57:29.509 --> 0:57:43.758 +The first one doing that is something like +the Elmo paper where they instead of this is + +0:57:43.758 --> 0:57:48.129 +the normal language model. + +0:57:48.008 --> 0:57:50.714 +Within the third, predicting the fourth, and +so on. + +0:57:50.714 --> 0:57:53.004 +So you are always predicting the next work. + +0:57:53.193 --> 0:57:57.335 +The architecture is the heaven words embedding +layer and then layers. + +0:57:57.335 --> 0:58:03.901 +See you, for example: And now instead of using +this one in the end, you're using here this + +0:58:03.901 --> 0:58:04.254 +one. + +0:58:04.364 --> 0:58:11.245 +This represents the meaning of this word mainly +in the context of what we have seen before. + +0:58:11.871 --> 0:58:18.610 +We can train it in a language model style +always predicting the next word, but we have + +0:58:18.610 --> 0:58:21.088 +more information trained there. + +0:58:21.088 --> 0:58:26.123 +Therefore, in the system it has to learn less +additional things. + +0:58:27.167 --> 0:58:31.261 +And there is one Edendang which is done currently +in GPS. + +0:58:31.261 --> 0:58:38.319 +The only difference is that we have more layers, +bigger size, and we're using transformer neurocell + +0:58:38.319 --> 0:58:40.437 +potential instead of the RNA. + +0:58:40.437 --> 0:58:45.095 +But that is how you train like some large +language models at the. + +0:58:46.746 --> 0:58:55.044 +However, if you look at this contextual representation, +they might not be perfect. + +0:58:55.044 --> 0:59:02.942 +So if you think of this one as a contextual +representation of the third word,. + +0:59:07.587 --> 0:59:16.686 +Is representing a three in the context of +a sentence, however only in the context of + +0:59:16.686 --> 0:59:18.185 +the previous. + +0:59:18.558 --> 0:59:27.413 +However, we have an architecture which can +also take both sides and we have used that + +0:59:27.413 --> 0:59:30.193 +already in the ink holder. + +0:59:30.630 --> 0:59:34.264 +So we could do the iron easily on your, also +in the backward direction. + +0:59:34.874 --> 0:59:42.826 +By just having the states the other way around +and then we couldn't combine the forward and + +0:59:42.826 --> 0:59:49.135 +the forward into a joint one where we are doing +this type of prediction. + +0:59:49.329 --> 0:59:50.858 +So you have the word embedding. + +0:59:51.011 --> 1:00:02.095 +Then you have two in the states, one on the +forward arm and one on the backward arm, and + +1:00:02.095 --> 1:00:10.314 +then you can, for example, take the cocagenation +of both of them. + +1:00:10.490 --> 1:00:23.257 +Now this same here represents mainly this +word because this is what both puts in it last + +1:00:23.257 --> 1:00:30.573 +and we know is focusing on what is happening +last. + +1:00:31.731 --> 1:00:40.469 +However, there is a bit of difference when +training that as a language model you already + +1:00:40.469 --> 1:00:41.059 +have. + +1:00:43.203 --> 1:00:44.956 +Maybe There's Again This Masking. + +1:00:46.546 --> 1:00:47.748 +That is one solution. + +1:00:47.748 --> 1:00:52.995 +First of all, why we can't do it is the information +you leak it, so you cannot just predict the + +1:00:52.995 --> 1:00:53.596 +next word. + +1:00:53.596 --> 1:00:58.132 +If we just predict the next word in this type +of model, that's a very simple task. + +1:00:58.738 --> 1:01:09.581 +You know the next word because it's influencing +this hidden state predicting something is not + +1:01:09.581 --> 1:01:11.081 +a good task. + +1:01:11.081 --> 1:01:18.455 +You have to define: Because in this case what +will end with the system will just ignore these + +1:01:18.455 --> 1:01:22.966 +estates and what will learn is copy this information +directly in here. + +1:01:23.343 --> 1:01:31.218 +So it would be representing this word and +you would have nearly a perfect model because + +1:01:31.218 --> 1:01:38.287 +you only need to find encoding where you can +encode all words somehow in this. + +1:01:38.458 --> 1:01:44.050 +The only thing can learn is that turn and +encode all my words in this upper hidden. + +1:01:44.985 --> 1:01:53.779 +Therefore, it's not really useful, so we need +to find a bit of different ways out. + +1:01:55.295 --> 1:01:57.090 +There is a masking one. + +1:01:57.090 --> 1:02:03.747 +I'll come to that shortly just a bit that +other things also have been done, so the other + +1:02:03.747 --> 1:02:06.664 +thing is not to directly combine them. + +1:02:06.664 --> 1:02:13.546 +That was in the animal paper, so you have +them forward R&M and you keep them completely + +1:02:13.546 --> 1:02:14.369 +separated. + +1:02:14.594 --> 1:02:20.458 +So you never merged to state. + +1:02:20.458 --> 1:02:33.749 +At the end, the representation of the word +is now from the forward. + +1:02:33.873 --> 1:02:35.953 +So it's always the hidden state before the +good thing. + +1:02:36.696 --> 1:02:41.286 +These two you join now to your to the representation. + +1:02:42.022 --> 1:02:48.685 +And then you have now a representation also +about like the whole sentence for the word, + +1:02:48.685 --> 1:02:51.486 +but there is no information leakage. + +1:02:51.486 --> 1:02:58.149 +One way of doing this is instead of doing +a bidirection along you do a forward pass and + +1:02:58.149 --> 1:02:59.815 +then join the hidden. + +1:03:00.380 --> 1:03:05.960 +So you can do that in all layers. + +1:03:05.960 --> 1:03:16.300 +In the end you do the forwarded layers and +you get the hidden. + +1:03:16.596 --> 1:03:19.845 +However, it's a bit of a complicated. + +1:03:19.845 --> 1:03:25.230 +You have to keep both separate and merge things +so can you do. + +1:03:27.968 --> 1:03:33.030 +And that is the moment where like the big. + +1:03:34.894 --> 1:03:39.970 +The big success of the burnt model was used +where it okay. + +1:03:39.970 --> 1:03:47.281 +Maybe in bite and rich case it's not good +to do the next word prediction, but we can + +1:03:47.281 --> 1:03:48.314 +do masking. + +1:03:48.308 --> 1:03:56.019 +Masking mainly means we do a prediction of +something in the middle or some words. + +1:03:56.019 --> 1:04:04.388 +So the idea is if we have the input, we are +putting noise into the input, removing them, + +1:04:04.388 --> 1:04:07.961 +and then the model we are interested. + +1:04:08.048 --> 1:04:15.327 +Now there can be no information leakage because +this wasn't predicting that one is a big challenge. + +1:04:16.776 --> 1:04:19.957 +Do any assumption about our model? + +1:04:19.957 --> 1:04:26.410 +It doesn't need to be a forward model or a +backward model or anything. + +1:04:26.410 --> 1:04:29.500 +You can always predict the three. + +1:04:30.530 --> 1:04:34.844 +There's maybe one bit of a disadvantage. + +1:04:34.844 --> 1:04:40.105 +Do you see what could be a bit of a problem +this? + +1:05:00.000 --> 1:05:06.429 +Yes, so yeah, you can of course mask more, +but to see it more globally, just first assume + +1:05:06.429 --> 1:05:08.143 +you're only masked one. + +1:05:08.143 --> 1:05:13.930 +For the whole sentence, we get one feedback +signal, like what is the word three. + +1:05:13.930 --> 1:05:22.882 +So we have one training example: If you do +the language modeling taste, we predicted here, + +1:05:22.882 --> 1:05:24.679 +we predicted here. + +1:05:25.005 --> 1:05:26.735 +So we have number of tokens. + +1:05:26.735 --> 1:05:30.970 +For each token we have a feet pad and say +what is the best correction. + +1:05:31.211 --> 1:05:43.300 +So in this case this is less efficient because +we are getting less feedback signals on what + +1:05:43.300 --> 1:05:45.797 +we should predict. + +1:05:48.348 --> 1:05:56.373 +So and bird, the main ideas are that you're +doing this bidirectional model with masking. + +1:05:56.373 --> 1:05:59.709 +It's using transformer architecture. + +1:06:00.320 --> 1:06:06.326 +There are two more minor changes. + +1:06:06.326 --> 1:06:16.573 +We'll see that this next word prediction is +another task. + +1:06:16.957 --> 1:06:30.394 +You want to learn more about what language +is to really understand following a story or + +1:06:30.394 --> 1:06:35.127 +their independent tokens into. + +1:06:38.158 --> 1:06:42.723 +The input is using word units as we use it. + +1:06:42.723 --> 1:06:50.193 +It has some special token that is framing +for the next word prediction. + +1:06:50.470 --> 1:07:04.075 +It's more for classification task because +you may be learning a general representation + +1:07:04.075 --> 1:07:07.203 +as a full sentence. + +1:07:07.607 --> 1:07:19.290 +You're doing segment embedding, so you have +an embedding for it. + +1:07:19.290 --> 1:07:24.323 +This is the first sentence. + +1:07:24.684 --> 1:07:29.099 +Now what is more challenging is this masking. + +1:07:29.099 --> 1:07:30.827 +What do you mask? + +1:07:30.827 --> 1:07:35.050 +We already have the crush enough or should. + +1:07:35.275 --> 1:07:42.836 +So there has been afterwards eating some work +like, for example, a bearer. + +1:07:42.836 --> 1:07:52.313 +It's not super sensitive, but if you do it +completely wrong then you're not letting anything. + +1:07:52.572 --> 1:07:54.590 +That's Then Another Question There. + +1:07:56.756 --> 1:08:04.594 +Should I mask all types of should I always +mask the footwork or if I have a subword to + +1:08:04.594 --> 1:08:10.630 +mask only like a subword and predict them based +on the other ones? + +1:08:10.630 --> 1:08:14.504 +Of course, it's a bit of a different task. + +1:08:14.894 --> 1:08:21.210 +If you know three parts of the words, it might +be easier to guess the last because they here + +1:08:21.210 --> 1:08:27.594 +took the easiest selection, so not considering +words anymore at all because you're doing that + +1:08:27.594 --> 1:08:32.280 +in the preprocessing and just taking always +words and like subwords. + +1:08:32.672 --> 1:08:36.089 +Think in group there is done differently. + +1:08:36.089 --> 1:08:40.401 +They mark always the full words, but guess +it's not. + +1:08:41.001 --> 1:08:46.044 +And then what to do with the mask word in +eighty percent of the cases. + +1:08:46.044 --> 1:08:50.803 +If the word is masked, they replace it with +a special token thing. + +1:08:50.803 --> 1:08:57.197 +This is a mask token in ten percent they put +in some random other token in there, and ten + +1:08:57.197 --> 1:08:59.470 +percent they keep it on change. + +1:09:02.202 --> 1:09:10.846 +And then what you can do is also this next +word prediction. + +1:09:10.846 --> 1:09:14.880 +The man went to Mass Store. + +1:09:14.880 --> 1:09:17.761 +He bought a gallon. + +1:09:18.418 --> 1:09:24.088 +So may you see you're joining them, you're +doing both masks and prediction that you're. + +1:09:24.564 --> 1:09:29.449 +Is a penguin mask or flyless birds. + +1:09:29.449 --> 1:09:41.390 +These two sentences have nothing to do with +each other, so you can do also this type of + +1:09:41.390 --> 1:09:43.018 +prediction. + +1:09:47.127 --> 1:09:57.043 +And then the whole bird model, so here you +have the input here to transform the layers, + +1:09:57.043 --> 1:09:58.170 +and then. + +1:09:58.598 --> 1:10:17.731 +And this model was quite successful in general +applications. + +1:10:17.937 --> 1:10:27.644 +However, there is like a huge thing of different +types of models coming from them. + +1:10:27.827 --> 1:10:38.709 +So based on others these supervised molds +like a whole setup came out of there and now + +1:10:38.709 --> 1:10:42.086 +this is getting even more. + +1:10:42.082 --> 1:10:46.640 +With availability of a large language model +than the success. + +1:10:47.007 --> 1:10:48.436 +We have now even larger ones. + +1:10:48.828 --> 1:10:50.961 +Interestingly, it goes a bit. + +1:10:50.910 --> 1:10:57.847 +Change the bit again from like more the spider +action model to uni directional models. + +1:10:57.847 --> 1:11:02.710 +Are at the moment maybe a bit more we're coming +to them now? + +1:11:02.710 --> 1:11:09.168 +Do you see one advantage while what is another +event and we have the efficiency? + +1:11:09.509 --> 1:11:15.901 +Is one other reason why you are sometimes +more interested in uni-direction models than + +1:11:15.901 --> 1:11:17.150 +in bi-direction. + +1:11:22.882 --> 1:11:30.220 +It depends on the pass, but for example for +a language generation pass, the eccard is not + +1:11:30.220 --> 1:11:30.872 +really. + +1:11:32.192 --> 1:11:40.924 +It doesn't work so if you want to do a generation +like the decoder you don't know the future + +1:11:40.924 --> 1:11:42.896 +so you cannot apply. + +1:11:43.223 --> 1:11:53.870 +So this time of model can be used for the +encoder in an encoder model, but it cannot + +1:11:53.870 --> 1:11:57.002 +be used for the decoder. + +1:12:00.000 --> 1:12:05.012 +That's a good view to the next overall cast +of models. + +1:12:05.012 --> 1:12:08.839 +Perhaps if you view it from the sequence. + +1:12:09.009 --> 1:12:12.761 +We have the encoder base model. + +1:12:12.761 --> 1:12:16.161 +That's what we just look at. + +1:12:16.161 --> 1:12:20.617 +They are bidirectional and typically. + +1:12:20.981 --> 1:12:22.347 +That Is the One We Looked At. + +1:12:22.742 --> 1:12:34.634 +At the beginning is the decoder based model, +so see out in regressive models which are unidirective + +1:12:34.634 --> 1:12:42.601 +like an based model, and there we can do the +next word prediction. + +1:12:43.403 --> 1:12:52.439 +And what you can also do first, and there +you can also have a special things called prefix + +1:12:52.439 --> 1:12:53.432 +language. + +1:12:54.354 --> 1:13:05.039 +Because we are saying it might be helpful +that some of your input can also use bi-direction. + +1:13:05.285 --> 1:13:12.240 +And that is somehow doing what it is called +prefix length. + +1:13:12.240 --> 1:13:19.076 +On the first tokens you directly give your +bidirectional. + +1:13:19.219 --> 1:13:28.774 +So you somehow merge that and that mainly +works only in transformer based models because. + +1:13:29.629 --> 1:13:33.039 +There is no different number of parameters +in our end. + +1:13:33.039 --> 1:13:34.836 +We need a back foot our end. + +1:13:34.975 --> 1:13:38.533 +Transformer: The only difference is how you +mask your attention. + +1:13:38.878 --> 1:13:44.918 +We have seen that in the anchoder and decoder +the number of parameters is different because + +1:13:44.918 --> 1:13:50.235 +you do cross attention, but if you do forward +and backward or union directions,. + +1:13:50.650 --> 1:13:58.736 +It's only like you mask your attention to +only look at the bad past or to look into the + +1:13:58.736 --> 1:13:59.471 +future. + +1:14:00.680 --> 1:14:03.326 +And now you can of course also do mixing. + +1:14:03.563 --> 1:14:08.306 +So this is a bi-directional attention matrix +where you can attend to everything. + +1:14:08.588 --> 1:14:23.516 +There is a uni-direction or causal where you +can look at the past and you can do the first + +1:14:23.516 --> 1:14:25.649 +three words. + +1:14:29.149 --> 1:14:42.831 +That somehow clear based on that, then of +course you cannot do the other things. + +1:14:43.163 --> 1:14:50.623 +So the idea is we have our anchor to decoder +architecture. + +1:14:50.623 --> 1:14:57.704 +Can we also train them completely in a side +supervisor? + +1:14:58.238 --> 1:15:09.980 +And in this case we have the same input to +both, so in this case we need to do some type + +1:15:09.980 --> 1:15:12.224 +of masking here. + +1:15:12.912 --> 1:15:17.696 +Here we don't need to do the masking, but +here we need to masking that doesn't know ever + +1:15:17.696 --> 1:15:17.911 +so. + +1:15:20.440 --> 1:15:30.269 +And this type of model got quite successful +also, especially for pre-training machine translation. + +1:15:30.330 --> 1:15:39.059 +The first model doing that is a Bart model, +which exactly does that, and yes, it's one + +1:15:39.059 --> 1:15:42.872 +successful way to pre train your one. + +1:15:42.872 --> 1:15:47.087 +It's pretraining your full encoder model. + +1:15:47.427 --> 1:15:54.365 +Where you put in contrast to machine translation, +where you put in source sentence, we can't + +1:15:54.365 --> 1:15:55.409 +do that here. + +1:15:55.715 --> 1:16:01.382 +But we can just put the second twice in there, +and then it's not a trivial task. + +1:16:01.382 --> 1:16:02.432 +We can change. + +1:16:03.003 --> 1:16:12.777 +And there is like they do different corruption +techniques so you can also do. + +1:16:13.233 --> 1:16:19.692 +That you couldn't do in an agricultural system +because then it wouldn't be there and you cannot + +1:16:19.692 --> 1:16:20.970 +predict somewhere. + +1:16:20.970 --> 1:16:26.353 +So the anchor, the number of input and output +tokens always has to be the same. + +1:16:26.906 --> 1:16:29.818 +You cannot do a prediction for something which +isn't in it. + +1:16:30.110 --> 1:16:38.268 +Here in the decoder side it's unidirection +so we can also delete the top and then try + +1:16:38.268 --> 1:16:40.355 +to generate the full. + +1:16:41.061 --> 1:16:45.250 +We can do sentence permutation. + +1:16:45.250 --> 1:16:54.285 +We can document rotation and text infilling +so there is quite a bit. + +1:16:55.615 --> 1:17:06.568 +So you see there's quite a lot of types of +models that you can use in order to pre-train. + +1:17:07.507 --> 1:17:14.985 +Then, of course, there is again for the language +one. + +1:17:14.985 --> 1:17:21.079 +The other question is how do you integrate? + +1:17:21.761 --> 1:17:26.636 +And there's also, like yeah, quite some different +ways of techniques. + +1:17:27.007 --> 1:17:28.684 +It's a Bit Similar to Before. + +1:17:28.928 --> 1:17:39.068 +So the easiest thing is you take your word +embeddings or your free trained model. + +1:17:39.068 --> 1:17:47.971 +You freeze them and stack your decoder layers +and keep these ones free. + +1:17:48.748 --> 1:17:54.495 +Can also be done if you have this type of +bark model. + +1:17:54.495 --> 1:18:03.329 +What you can do is you freeze your word embeddings, +for example some products and. + +1:18:05.865 --> 1:18:17.296 +The other thing is you initialize them so +you initialize your models but you train everything + +1:18:17.296 --> 1:18:19.120 +so you're not. + +1:18:22.562 --> 1:18:29.986 +Then one thing, if you think about Bart, you +want to have the Chinese language, the Italian + +1:18:29.986 --> 1:18:32.165 +language, and the deconer. + +1:18:32.165 --> 1:18:35.716 +However, in Bart we have the same language. + +1:18:36.516 --> 1:18:46.010 +The one you get is from English, so what you +can do there is so you cannot try to do some. + +1:18:46.366 --> 1:18:52.562 +Below the barge, in order to learn some language +specific stuff, or there's a masculine barge, + +1:18:52.562 --> 1:18:58.823 +which is trained on many languages, but it's +trained only on like the Old Coast Modern Language + +1:18:58.823 --> 1:19:03.388 +House, which may be trained in German and English, +but not on German. + +1:19:03.923 --> 1:19:08.779 +So then you would still need to find June +and the model needs to learn how to better + +1:19:08.779 --> 1:19:10.721 +do the attention cross lingually. + +1:19:10.721 --> 1:19:15.748 +It's only on the same language but it mainly +only has to learn this mapping and not all + +1:19:15.748 --> 1:19:18.775 +the rest and that's why it's still quite successful. + +1:19:21.982 --> 1:19:27.492 +Now certain thing which is very commonly used +is what is required to it as adapters. + +1:19:27.607 --> 1:19:29.754 +So for example you take and buy. + +1:19:29.709 --> 1:19:35.218 +And you put some adapters on the inside of +the networks so that it's small new layers + +1:19:35.218 --> 1:19:40.790 +which are in between put in there and then +you only train these adapters or also train + +1:19:40.790 --> 1:19:41.815 +these adapters. + +1:19:41.815 --> 1:19:47.900 +For example, an embryo you could see that +this learns to map the Sears language representation + +1:19:47.900 --> 1:19:50.334 +to the Tiger language representation. + +1:19:50.470 --> 1:19:52.395 +And then you don't have to change that luck. + +1:19:52.792 --> 1:19:59.793 +You give it extra ability to really perform +well on that. + +1:19:59.793 --> 1:20:05.225 +These are quite small and so very efficient. + +1:20:05.905 --> 1:20:12.632 +That is also very commonly used, for example +in modular systems where you have some adaptors + +1:20:12.632 --> 1:20:16.248 +in between here which might be language specific. + +1:20:16.916 --> 1:20:22.247 +So they are trained only for one language. + +1:20:22.247 --> 1:20:33.777 +The model has some or both and once has the +ability to do multilingually to share knowledge. + +1:20:34.914 --> 1:20:39.058 +But there's one chance in general in the multilingual +systems. + +1:20:39.058 --> 1:20:40.439 +It works quite well. + +1:20:40.439 --> 1:20:46.161 +There's one case or one specific use case +for multilingual where this normally doesn't + +1:20:46.161 --> 1:20:47.344 +really work well. + +1:20:47.344 --> 1:20:49.975 +Do you have an idea what that could be? + +1:20:55.996 --> 1:20:57.536 +It's for Zero Shot Cases. + +1:20:57.998 --> 1:21:03.660 +Because having here some situation with this +might be very language specific and zero shot, + +1:21:03.660 --> 1:21:09.015 +the idea is always to learn representations +view which are more language dependent and + +1:21:09.015 --> 1:21:10.184 +with the adaptors. + +1:21:10.184 --> 1:21:15.601 +Of course you get in representations again +which are more language specific and then it + +1:21:15.601 --> 1:21:17.078 +doesn't work that well. + +1:21:20.260 --> 1:21:37.730 +And there is also the idea of doing more knowledge +pistolation. + +1:21:39.179 --> 1:21:42.923 +And now the idea is okay. + +1:21:42.923 --> 1:21:54.157 +We are training it the same, but what we want +to achieve is that the encoder. + +1:21:54.414 --> 1:22:03.095 +So you should learn faster by trying to make +these states as similar as possible. + +1:22:03.095 --> 1:22:11.777 +So you compare the first-hit state of the +pre-trained model and try to make them. + +1:22:12.192 --> 1:22:18.144 +For example, by using the out two norms, so +by just making these two representations the + +1:22:18.144 --> 1:22:26.373 +same: The same vocabulary: Why does it need +the same vocabulary with any idea? + +1:22:34.754 --> 1:22:46.137 +If you have different vocabulary, it's typical +you also have different sequenced lengths here. + +1:22:46.137 --> 1:22:50.690 +The number of sequences is different. + +1:22:51.231 --> 1:22:58.888 +If you now have pipe stains and four states +here, it's no longer straightforward which + +1:22:58.888 --> 1:23:01.089 +states compare to which. + +1:23:02.322 --> 1:23:05.246 +And that's just easier if you have like the +same number. + +1:23:05.246 --> 1:23:08.940 +You can always compare the first to the first +and second to the second. + +1:23:09.709 --> 1:23:16.836 +So therefore at least the very easy way of +knowledge destination only works if you have. + +1:23:17.177 --> 1:23:30.030 +Course: You could do things like yeah, the +average should be the same, but of course there's + +1:23:30.030 --> 1:23:33.071 +a less strong signal. + +1:23:34.314 --> 1:23:42.979 +But the advantage here is that you have a +diameter training signal here on the handquarter + +1:23:42.979 --> 1:23:51.455 +so you can directly make some of the encoder +already giving a good signal while normally + +1:23:51.455 --> 1:23:52.407 +an empty. + +1:23:56.936 --> 1:24:13.197 +Yes, think this is most things for today, +so what you should keep in mind is remind me. + +1:24:13.393 --> 1:24:18.400 +The one is a back translation idea. + +1:24:18.400 --> 1:24:29.561 +If you have monolingual and use that, the +other one is to: And mentally it is often helpful + +1:24:29.561 --> 1:24:33.614 +to combine them so you can even use both of +that. + +1:24:33.853 --> 1:24:38.908 +So you can use pre-trained walls, but then +you can even still do back translation where + +1:24:38.908 --> 1:24:40.057 +it's still helpful. + +1:24:40.160 --> 1:24:45.502 +We have the advantage we are training like +everything working together on the task so + +1:24:45.502 --> 1:24:51.093 +it might be helpful even to backtranslate some +data and then use it in a real translation + +1:24:51.093 --> 1:24:56.683 +setup because in pretraining of course the +beach challenge is always that you're training + +1:24:56.683 --> 1:24:57.739 +it on different. + +1:24:58.058 --> 1:25:03.327 +Different ways of how you integrate this knowledge. + +1:25:03.327 --> 1:25:08.089 +Even if you just use a full model, so in this. + +1:25:08.748 --> 1:25:11.128 +This is the most similar you can get. + +1:25:11.128 --> 1:25:13.945 +You're doing no changes to the architecture. + +1:25:13.945 --> 1:25:19.643 +You're really taking the model and just fine +tuning them on the new task, but it still has + +1:25:19.643 --> 1:25:24.026 +to completely newly learn how to do the attention +and how to do that. + +1:25:24.464 --> 1:25:29.971 +And that might be, for example, helpful to +have more back-translated data to learn them. + +1:25:32.192 --> 1:25:34.251 +That's for today. + +1:25:34.251 --> 1:25:44.661 +There's one important thing that next Tuesday +there is a conference or a workshop or so in + +1:25:44.661 --> 1:25:45.920 +this room. + +1:25:47.127 --> 1:25:56.769 +You should get an e-mail if you're in Elias +that there's a room change for Tuesdays and + +1:25:56.769 --> 1:25:57.426 +it's. + +1:25:57.637 --> 1:26:03.890 +There are more questions, yeah, have a more +general position, especially: In computer vision + +1:26:03.890 --> 1:26:07.347 +you can enlarge your data center data orientation. + +1:26:07.347 --> 1:26:08.295 +Is there any? + +1:26:08.388 --> 1:26:15.301 +It's similar to a large speech for text for +the data of an edge. + +1:26:15.755 --> 1:26:29.176 +And you can use this back translation and +also masking, but back translation is some + +1:26:29.176 --> 1:26:31.228 +way of data. + +1:26:31.371 --> 1:26:35.629 +So it has also been, for example, even its +used not only for monolingual data. + +1:26:36.216 --> 1:26:54.060 +If you have good MP system, it can also be +used for parallel data. + +1:26:54.834 --> 1:26:59.139 +So would say this is the most similar one. + +1:26:59.139 --> 1:27:03.143 +There's ways you can do power phrasing. + +1:27:05.025 --> 1:27:12.057 +But for example there is very hard to do this +by rules like which words to replace because + +1:27:12.057 --> 1:27:18.936 +there is not a coup like you cannot always +say this word can always be replaced by that. + +1:27:19.139 --> 1:27:27.225 +Mean, although they are many perfect synonyms, +normally they are good in some cases, but not + +1:27:27.225 --> 1:27:29.399 +in all cases, and so on. + +1:27:29.399 --> 1:27:36.963 +And if you don't do a rule based, you have +to train your model and then the freshness. + +1:27:38.058 --> 1:27:57.236 +The same architecture as the pre-trained mount. + +1:27:57.457 --> 1:27:59.810 +Should be of the same dimension, so it's easiest +to have the same dimension. + +1:28:00.000 --> 1:28:01.590 +Architecture. + +1:28:01.590 --> 1:28:05.452 +We later will learn inefficiency. + +1:28:05.452 --> 1:28:12.948 +You can also do knowledge cessulation with, +for example, smaller. + +1:28:12.948 --> 1:28:16.469 +You can learn the same within. + +1:28:17.477 --> 1:28:22.949 +Eight layers for it so that is possible, but +yeah agree it should be of the same. + +1:28:23.623 --> 1:28:32.486 +Yeah yeah you need the question then of course +you can do it like it's an initialization or + +1:28:32.486 --> 1:28:41.157 +you can do it doing training but normally it +most makes sense during the normal training. + +1:28:45.865 --> 1:28:53.963 +Do it, then thanks a lot, and then we'll see +each other again on Tuesday. + +0:00:00.981 --> 0:00:17.559 +What we want today about is how to use some +type of additional resources to improve the + +0:00:17.559 --> 0:00:20.008 +translation. + +0:00:20.300 --> 0:00:31.387 +We have in the first part of the semester +how to build some of your basic machine translation. + +0:00:31.571 --> 0:00:40.743 +You know now the basic components both for +statistical and for neural, with the encoder + +0:00:40.743 --> 0:00:42.306 +decoder model. + +0:00:43.123 --> 0:00:45.950 +Now, of course, that's not where it stops. + +0:00:45.950 --> 0:00:51.340 +It's still what in nearly every machine translation +system is currently in India. + +0:00:51.340 --> 0:00:57.323 +However, there is a lot of challenges which +you need to address in addition and which need + +0:00:57.323 --> 0:00:58.243 +to be solved. + +0:00:58.918 --> 0:01:03.031 +We want to start with these parts. + +0:01:03.031 --> 0:01:07.614 +What else can you do around this part? + +0:01:07.614 --> 0:01:09.847 +You can be honest. + +0:01:10.030 --> 0:01:14.396 +And one important question there is on what +do you train your models? + +0:01:14.394 --> 0:01:27.237 +Because this type of parallel data is easier +in machine translation than many other tasks + +0:01:27.237 --> 0:01:33.516 +where you have a decent amount of training. + +0:01:33.853 --> 0:01:40.789 +And therefore an important question is: Can +we also learn from other sources and improve + +0:01:40.789 --> 0:01:41.178 +our. + +0:01:41.701 --> 0:01:47.840 +Because if you remember from quite the beginning +of the lecture,. + +0:01:51.171 --> 0:01:53.801 +This is how we train all our. + +0:01:54.194 --> 0:02:01.318 +Machine learning models, all the corpus bases +from statistical to neural. + +0:02:01.318 --> 0:02:09.694 +This doesn't have change, so we need this +type of parallel data where we have a source + +0:02:09.694 --> 0:02:13.449 +sentence aligned with the target data. + +0:02:13.493 --> 0:02:19.654 +We have now a strong model here, a very good +model to do that. + +0:02:19.654 --> 0:02:22.099 +However, we always rely. + +0:02:22.522 --> 0:02:27.376 +More languages, higher resource languages, +prayers that say from German to English or + +0:02:27.376 --> 0:02:31.327 +other European languages, there is a decent +amount at least for some. + +0:02:31.471 --> 0:02:46.131 +But even there, if we're going to very specific +domains, it might get difficult and then your + +0:02:46.131 --> 0:02:50.966 +system performance might drop. + +0:02:51.231 --> 0:02:55.261 +Phrases how to use the vocabulary, and so +on, and the style. + +0:02:55.915 --> 0:03:04.104 +And if you're going to other languages, there +is of course a lot bigger challenge. + +0:03:04.104 --> 0:03:05.584 +Why can't you? + +0:03:05.825 --> 0:03:09.647 +So is really this the only resource you can +use. + +0:03:09.889 --> 0:03:20.667 +Or can we adapt our models in order to also +make use of other types of models that might + +0:03:20.667 --> 0:03:27.328 +enable us to build strong systems with other +types of. + +0:03:27.707 --> 0:03:35.283 +And that's what we will look into now in the +next, starting from Tuesday in the next. + +0:03:35.515 --> 0:03:43.437 +So this idea we already have covered on Tuesday, +so one very successful idea for this is to + +0:03:43.437 --> 0:03:45.331 +do more multilingual. + +0:03:45.645 --> 0:03:52.010 +So that we're no longer only doing translation +between two languages, but we can do translation + +0:03:52.010 --> 0:03:55.922 +between many languages and share common knowledge +between. + +0:03:56.296 --> 0:04:06.477 +And you also learned about that you can even +do things like zero shot machine translations. + +0:04:06.786 --> 0:04:09.792 +Which is the case for many many language pairs. + +0:04:10.030 --> 0:04:17.406 +Even with German, you have not translation +parallel data to all languages around the world, + +0:04:17.406 --> 0:04:22.698 +or most of them you have it to the Europeans, +maybe for Japanese. + +0:04:22.698 --> 0:04:26.386 +But even for Japanese, it will get difficult. + +0:04:26.746 --> 0:04:32.862 +There is quite a lot of data, for example +English to Japanese, but German to Vietnamese. + +0:04:32.862 --> 0:04:39.253 +There is some data from Multilingual Corpora +where you can extract the name, but your amount + +0:04:39.253 --> 0:04:41.590 +really is dropping significantly. + +0:04:42.042 --> 0:04:54.907 +So that is a very promising direction if you +want to build translation systems between language + +0:04:54.907 --> 0:05:00.134 +pairs, typically not English, because. + +0:05:01.221 --> 0:05:05.888 +And the other ideas, of course, we don't have +data, just search for data. + +0:05:06.206 --> 0:05:15.755 +There is some work on data crawling so if +don't have a corpus directly or don't have + +0:05:15.755 --> 0:05:23.956 +a high quality corpus from the European Parliament +for TED corpus maybe. + +0:05:24.344 --> 0:05:35.528 +There has been a big effort in Europe to collect +data sets for parallel data. + +0:05:35.528 --> 0:05:40.403 +How can we do this data crawling? + +0:05:40.600 --> 0:05:46.103 +There the interesting thing from the machine +translation point is not just general data + +0:05:46.103 --> 0:05:46.729 +crawling. + +0:05:47.067 --> 0:05:52.067 +But how can we explicitly crawl data, which +is somewhat parallel? + +0:05:52.132 --> 0:05:58.538 +So there is in the Internet quite a lot of +data which has been like company websites which + +0:05:58.538 --> 0:06:01.565 +have been translated and things like that. + +0:06:01.565 --> 0:06:05.155 +So how can you extract them and then extract +them? + +0:06:06.566 --> 0:06:13.406 +There is typically more noisy than where you +do more, hence mean if you have your Parliament. + +0:06:13.693 --> 0:06:21.305 +You can do some rules how to extract the parallel +things. + +0:06:21.305 --> 0:06:30.361 +Here there is more to it, so the quality is +later maybe not as good. + +0:06:33.313 --> 0:06:39.927 +The other thing is can we use monolingual +data and monolingual data has a big advantage + +0:06:39.927 --> 0:06:46.766 +that we can have a huge amount of that so that +you can be able to crawl from the internet. + +0:06:46.766 --> 0:06:51.726 +The nice thing is you can also get it typically +for many domains. + +0:06:52.352 --> 0:06:58.879 +There is just so much more magnitude more +of monolingual data so that it might be very + +0:06:58.879 --> 0:06:59.554 +helpful. + +0:06:59.559 --> 0:07:06.187 +We can do that in statistical machine translation +was quite easy to integrate using language + +0:07:06.187 --> 0:07:06.757 +models. + +0:07:08.508 --> 0:07:14.499 +In neural machine translation we have the +advantage that we have this overall and architecture + +0:07:14.499 --> 0:07:18.850 +that does everything together, but it has also +the disadvantage now. + +0:07:18.850 --> 0:07:22.885 +It's more difficult to put in this type of +information or make. + +0:07:23.283 --> 0:07:26.427 +We'll look to two things. + +0:07:26.427 --> 0:07:37.432 +You can still try to do a bit of language +modeling in there and add an additional language + +0:07:37.432 --> 0:07:38.279 +model. + +0:07:38.178 --> 0:07:43.771 +A way which I think is used in most systems +at the moment is to do synthetic data. + +0:07:43.763 --> 0:07:53.095 +It's a very easy thing, but you can just translate +there and then use it as training data. + +0:07:53.213 --> 0:07:59.192 +And thereby you are able to use like some +type of moonlighting. + +0:08:00.380 --> 0:08:09.521 +Another way to do it is to ensure that some +are in the extreme case. + +0:08:09.521 --> 0:08:14.026 +If you have a scenario that only. + +0:08:14.754 --> 0:08:24.081 +The impressive thing is if you have large +amounts of data and the languages are not too + +0:08:24.081 --> 0:08:31.076 +dissimilar, you can even in this case build +a translation system. + +0:08:32.512 --> 0:08:36.277 +That we will see then next Thursday. + +0:08:37.857 --> 0:08:55.462 +And then there is now a fourth type of restorer +that recently became very successful and now. + +0:08:55.715 --> 0:09:02.409 +So the idea is we are no longer sharing the +real data such as text data, but it can also + +0:09:02.409 --> 0:09:04.139 +help to train a model. + +0:09:04.364 --> 0:09:08.599 +And that is now a big advantage of deep learning +based approaches. + +0:09:08.599 --> 0:09:14.414 +There you have this ability that you can train +a model on some task and then you can modify + +0:09:14.414 --> 0:09:19.913 +it maybe and then apply it to another task +and you can somewhat transfer the knowledge + +0:09:19.913 --> 0:09:22.125 +from the first task to the second. + +0:09:22.722 --> 0:09:31.906 +And then, of course, the question is, can +it have an initial task where it's very easy + +0:09:31.906 --> 0:09:34.439 +to train on the second? + +0:09:34.714 --> 0:09:53.821 +The task that you pre-train on is more similar +to a language. + +0:09:53.753 --> 0:10:06.293 +A bit of a different way of using language +malls in this more transfer learning set. + +0:10:09.029 --> 0:10:18.747 +So first we will start with how can we use +monolingual data to do a machine translation? + +0:10:20.040 --> 0:10:22.542 +The. + +0:10:22.062 --> 0:10:28.924 +This big difference is you should remember +from what I mentioned before is in statistical + +0:10:28.924 --> 0:10:30.525 +machine translation. + +0:10:30.525 --> 0:10:33.118 +We directly have the opportunity. + +0:10:33.118 --> 0:10:39.675 +There's peril data for a translation model +and monolingual data for a language model. + +0:10:39.679 --> 0:10:45.735 +And you combine your translation model and +your language model, and then you can make. + +0:10:46.726 --> 0:10:54.263 +That has big advantages that you can make +use of these large amounts of monolingual data, + +0:10:54.263 --> 0:10:55.519 +but of course. + +0:10:55.495 --> 0:11:02.198 +Because we said the problem is, we are optimizing +both parts independently to each other, and + +0:11:02.198 --> 0:11:09.329 +we say the big advantage of newer machine translation +is we are optimizing the overall architecture + +0:11:09.329 --> 0:11:10.541 +to perform best. + +0:11:10.890 --> 0:11:17.423 +And then, of course, we can't do that, so +here we can only use power there. + +0:11:17.897 --> 0:11:25.567 +So the question is, but if this advantage +is not so important, we can train everything, + +0:11:25.567 --> 0:11:33.499 +but we have large amounts of monolingual data +or small amounts, but they fit perfectly, so + +0:11:33.499 --> 0:11:35.242 +they are very good. + +0:11:35.675 --> 0:11:41.438 +So in data we know it's not only important +the amount of data we have but also like how + +0:11:41.438 --> 0:11:43.599 +similar it is to your test data. + +0:11:43.599 --> 0:11:49.230 +So it can be that this volume is even only +quite small but it's very well fitting and + +0:11:49.230 --> 0:11:51.195 +then it's still very helpful. + +0:11:51.195 --> 0:11:55.320 +So the question is if this is the case how +can we make use of? + +0:11:55.675 --> 0:12:03.171 +And the first year of surprisingness, if we +are here successful with integrating a language + +0:12:03.171 --> 0:12:10.586 +model into a translation system, maybe we can +also integrate some types of language models + +0:12:10.586 --> 0:12:14.415 +into our MT system in order to make it better. + +0:12:16.536 --> 0:12:19.000 +The first thing we can do is okay. + +0:12:19.000 --> 0:12:23.293 +We know there is language models, so let's +try to integrate. + +0:12:23.623 --> 0:12:30.693 +There was mainly used language models because +these works were mainly done before transformer + +0:12:30.693 --> 0:12:31.746 +based models. + +0:12:32.152 --> 0:12:41.567 +And generally, of course, you can do the same +thing with all the Transformers baseballs. + +0:12:41.721 --> 0:12:58.900 +It has mainly been done before people started +using R&S, and they tried to do this more + +0:12:58.900 --> 0:13:01.888 +in cases where. + +0:13:07.087 --> 0:13:17.508 +So what we're having here is some of this +type of idea. + +0:13:17.508 --> 0:13:25.511 +This is a key system here as you remember. + +0:13:25.605 --> 0:13:29.470 +Gets in with your last instinct and calculates +your attention. + +0:13:29.729 --> 0:13:36.614 +We get the context and combine both and then +based on that and then predict the target. + +0:13:37.057 --> 0:13:42.423 +So this is our anti-system, and the question +is, can we somehow integrate the language? + +0:13:42.782 --> 0:13:55.788 +And of course, if someone makes sense to take +out a neural language model because we're anyway + +0:13:55.788 --> 0:14:01.538 +in the neural space, it's not surprising. + +0:14:01.621 --> 0:14:15.522 +And there would be something like on top of +there and you're a language model and you have + +0:14:15.522 --> 0:14:17.049 +a target. + +0:14:17.597 --> 0:14:27.007 +So if we're having this type of language model, +there's two main questions we have to answer. + +0:14:27.007 --> 0:14:28.108 +How do we? + +0:14:28.208 --> 0:14:37.935 +So how do we combine now on the one hand our +NMT system and on the other hand our RNA you + +0:14:37.935 --> 0:14:45.393 +see that was mentioned before when we started +talking about encoder. + +0:14:45.805 --> 0:14:49.523 +The wild is like unconditioned, it's just +modeling the targets side. + +0:14:49.970 --> 0:14:57.183 +And the other one is a conditional language, +which is a language condition on the sewer + +0:14:57.183 --> 0:14:57.839 +center. + +0:14:58.238 --> 0:15:03.144 +So the question is how can you not combine +two language models? + +0:15:03.144 --> 0:15:09.813 +Of course, it's like the translation model +will some will be more important because it + +0:15:09.813 --> 0:15:11.806 +has access to the source. + +0:15:11.806 --> 0:15:16.713 +We want to generate something which corresponds +to your source. + +0:15:18.778 --> 0:15:20.918 +If we had that, the other question is OK. + +0:15:20.918 --> 0:15:22.141 +Now we have two models. + +0:15:22.141 --> 0:15:25.656 +If we even have integrated them, the answer +is how do we train them? + +0:15:26.026 --> 0:15:39.212 +Because we have integrated them, we have no +two sets of data with parallel data where you + +0:15:39.212 --> 0:15:42.729 +can do the lower thing. + +0:15:44.644 --> 0:15:47.575 +So the first idea is okay. + +0:15:47.575 --> 0:15:53.436 +We can do something more like a parallel combination. + +0:15:53.436 --> 0:15:55.824 +We just keep running. + +0:15:56.036 --> 0:15:59.854 +So a year you see your NMT system that is +running. + +0:16:00.200 --> 0:16:08.182 +First of all, it's normally completely independent +of your language model, which is up there. + +0:16:08.182 --> 0:16:13.278 +So down here we have just our NMT system, +which is running. + +0:16:13.313 --> 0:16:26.439 +The only thing which is used is we have the +words inputted, and of course they are put + +0:16:26.439 --> 0:16:28.099 +into both. + +0:16:28.099 --> 0:16:41.334 +We also put: So we use them in parallel, and +then we are doing our decision just by merging + +0:16:41.334 --> 0:16:42.905 +these two. + +0:16:43.343 --> 0:16:52.288 +So there can be, for example, we are doing +a probability distribution here, we are doing + +0:16:52.288 --> 0:17:01.032 +a purability distribution here, and then we +are taking the average of both per ability + +0:17:01.032 --> 0:17:03.343 +to do our predictions. + +0:17:11.871 --> 0:17:18.929 +You could also take the output which seems +to be more short about the answer. + +0:17:20.000 --> 0:17:23.272 +Yes, you could also do that. + +0:17:23.272 --> 0:17:27.222 +It's more like a gating mechanism. + +0:17:27.222 --> 0:17:32.865 +You're not doing everything, but you're focusing. + +0:17:32.993 --> 0:17:38.927 +Another one would be you could also just concatenate +the hidden states and then you have another + +0:17:38.927 --> 0:17:41.802 +layer on top which based on the concatenation. + +0:17:43.303 --> 0:17:58.634 +If you think about it, you do the coordination +instead of taking the instead and then merging + +0:17:58.634 --> 0:18:01.244 +the perability. + +0:18:03.143 --> 0:18:15.027 +Yes, in the end you introduce many new parameters +and these parameters have somehow something + +0:18:15.027 --> 0:18:17.303 +special compared. + +0:18:23.603 --> 0:18:33.657 +So before all the other parameters can be +trained independently of each other, the language + +0:18:33.657 --> 0:18:42.071 +one can be trained independent and an antisystem +can be trained independent. + +0:18:43.043 --> 0:18:51.198 +If you have a joint layer of course you need +to train them because you have inputs so you + +0:18:51.198 --> 0:19:01.560 +need: Not surprisingly, if you have a parallel +combination or whether you could, the other + +0:19:01.560 --> 0:19:04.664 +way is to do more serial combinations. + +0:19:04.924 --> 0:19:10.382 +How can you do a similar combination? + +0:19:10.382 --> 0:19:18.281 +Your final decision makes sense to do it based +on the. + +0:19:18.438 --> 0:19:20.997 +So you have on top of your normal an system. + +0:19:21.121 --> 0:19:30.826 +The only thing is now your inputting into +your NIT system. + +0:19:30.826 --> 0:19:38.723 +You're no longer inputting the word embeddings. + +0:19:38.918 --> 0:19:47.819 +You're training the lower layers here which +are trained more on the purely language model + +0:19:47.819 --> 0:19:55.434 +and on top you're putting into the NMT system +where it now has the language. + +0:19:55.815 --> 0:19:59.003 +So here you can also view it here. + +0:19:59.003 --> 0:20:06.836 +You have more contextual embeddings which +no longer depend on the word, but they also + +0:20:06.836 --> 0:20:10.661 +depend on the context of the target site. + +0:20:11.051 --> 0:20:21.797 +More understanding of the source word. + +0:20:21.881 --> 0:20:34.761 +So if it's like the word can, for example, +will be put in here always the same, independent + +0:20:34.761 --> 0:20:41.060 +of its use of can of beans, or if can do it. + +0:20:41.701 --> 0:20:43.165 +Empties. + +0:20:44.364 --> 0:20:54.959 +So another view, if you're remembering more +the transformer based approach, is you have + +0:20:54.959 --> 0:21:01.581 +some layers, and the lower layers are purely +language. + +0:21:02.202 --> 0:21:08.052 +This is purely language model and then at +some point you're starting to attend to the + +0:21:08.052 --> 0:21:08.596 +source. + +0:21:13.493 --> 0:21:20.774 +Yes, so these are two ways of how you combine +it, so run them in peril, or first do the language. + +0:21:23.623 --> 0:21:26.147 +Questions for the integration. + +0:21:31.831 --> 0:21:35.034 +Not really sure about the input of the. + +0:21:35.475 --> 0:21:38.123 +And this case with a sequence. + +0:21:38.278 --> 0:21:50.721 +Is the input and bedding, the target word +embedding, or the actual word, and then we + +0:21:50.721 --> 0:21:54.821 +transfer it to a numerical. + +0:21:56.176 --> 0:22:08.824 +That depends on if you view the word embedding +as part of the language model, so of course + +0:22:08.824 --> 0:22:10.909 +you first put. + +0:22:11.691 --> 0:22:13.938 +And then the word embedding there is the r&n. + +0:22:14.314 --> 0:22:20.296 +So of course you can view this together as +your language model when you first do the word + +0:22:20.296 --> 0:22:21.027 +embedding. + +0:22:21.401 --> 0:22:28.098 +All you can say are the RNAs and this is like +before. + +0:22:28.098 --> 0:22:36.160 +It's more a definition, but you're right, +so what are the steps? + +0:22:36.516 --> 0:22:46.655 +One of these parts, you know, called a language +model is definitionally not that important, + +0:22:46.655 --> 0:22:47.978 +but that's. + +0:22:53.933 --> 0:23:02.812 +So the question is how can you then train +them and make make this this one work? + +0:23:03.363 --> 0:23:15.492 +So in the case where you combine the language +of our abilities you can train them independently + +0:23:15.492 --> 0:23:18.524 +and then just put them. + +0:23:18.918 --> 0:23:29.623 +It might not be the best because we have no +longer this ability before that. + +0:23:29.623 --> 0:23:33.932 +They optimal perform together. + +0:23:34.514 --> 0:23:41.050 +At least you need to summarize how much do +you trust the one model and how much do you + +0:23:41.050 --> 0:23:41.576 +trust. + +0:23:43.323 --> 0:23:48.529 +But still in some cases usually it might be +helpful if you have only data and so on. + +0:23:48.928 --> 0:24:06.397 +However, we have one specific situation that +leads to the pearl leader is always mono legal + +0:24:06.397 --> 0:24:07.537 +data. + +0:24:08.588 --> 0:24:17.693 +So what we can also do is more the pre-training +approach. + +0:24:17.693 --> 0:24:24.601 +We first train the language model and then. + +0:24:24.704 --> 0:24:33.468 +So the pre-training approach you first train +on the monolingual data and then you join the. + +0:24:33.933 --> 0:24:45.077 +Of course, the model size is this way, but +the data size is of course too big. + +0:24:45.077 --> 0:24:52.413 +You often have more monolingual data than +parallel. + +0:24:56.536 --> 0:24:57.901 +Any ideas. + +0:25:04.064 --> 0:25:10.108 +Had one example where this might also be helpful +if you want to adapt to a domain so let's say + +0:25:10.108 --> 0:25:16.281 +you do medical sentences and if you want to +translate medical sentences and you have monolingual + +0:25:16.281 --> 0:25:22.007 +data on the target side for medical sentences +but you only have parallel data for general + +0:25:22.007 --> 0:25:22.325 +use. + +0:25:23.083 --> 0:25:30.601 +In this case it could be, or it's the most +probable happen if you're learning out there + +0:25:30.601 --> 0:25:38.804 +what medical means, but then in your fine tuning +step the model is forgetting everything about. + +0:25:39.099 --> 0:25:42.340 +So this type of priest training step is good. + +0:25:42.340 --> 0:25:47.978 +If your pretraining data is more general, +very large, and then you're adapting. + +0:25:48.428 --> 0:25:55.545 +But in the task we have monolingual data, +which should be used to adapt the system to + +0:25:55.545 --> 0:25:57.780 +some genre of topic style. + +0:25:57.817 --> 0:26:08.572 +Then, of course, this is not a good strategy +because you might forget about everything up + +0:26:08.572 --> 0:26:09.408 +there. + +0:26:09.649 --> 0:26:17.494 +So then you have to check what you can do +for them to see. + +0:26:17.494 --> 0:26:25.738 +You can freeze this part and you can do a +direct combination. + +0:26:25.945 --> 0:26:33.796 +Where you train both of them, and then you +train the language more and parallel on their + +0:26:33.796 --> 0:26:34.942 +one so that. + +0:26:35.395 --> 0:26:37.687 +Eh What You Learn in the Length. + +0:26:37.937 --> 0:26:48.116 +So the bit depends on what you want to combine +is that you use a language model because it's. + +0:26:48.548 --> 0:26:56.380 +Then you normally don't really forget it because +it's also in the or you use it to adapt to + +0:26:56.380 --> 0:26:58.083 +something specific. + +0:27:01.001 --> 0:27:06.662 +Then there is so this is a way of how we can +make use of monolingual data. + +0:27:07.968 --> 0:27:11.787 +It seems to be the easiest one somehow. + +0:27:11.787 --> 0:27:19.140 +It's more similar to what we are doing with +statistical machine translation. + +0:27:19.140 --> 0:27:20.095 +However,. + +0:27:21.181 --> 0:27:27.211 +Normally always beats this type of model, +which in some view can be from the conceptual + +0:27:27.211 --> 0:27:27.691 +thing. + +0:27:27.691 --> 0:27:31.460 +At least it's even easier from the computational +side. + +0:27:31.460 --> 0:27:36.805 +Sometimes it has a disadvantage that it's +more problematic or more difficult. + +0:27:40.560 --> 0:27:42.576 +And the idea is okay. + +0:27:42.576 --> 0:27:45.141 +We have a monolingual data. + +0:27:45.141 --> 0:27:50.822 +We just translate it and then generate some +type of parallel. + +0:27:51.111 --> 0:28:00.465 +So if you want to build a German to English +system, your first trained German to English + +0:28:00.465 --> 0:28:02.147 +system on your. + +0:28:02.402 --> 0:28:05.217 +Then you have more pearl data. + +0:28:05.217 --> 0:28:13.482 +The interesting thing is if you then train +on the joint thing, on the original pearl data, + +0:28:13.482 --> 0:28:18.749 +and on that one is artificial, it even normally +improves. + +0:28:18.918 --> 0:28:26.490 +You can because you're not doing the same +error all the time and you have some knowledge. + +0:28:28.028 --> 0:28:40.080 +With this first approach, however, there's +one issue: why it might not work the best, + +0:28:40.080 --> 0:28:43.163 +so could you imagine? + +0:28:49.409 --> 0:28:51.186 +Ready a bit shown in image two. + +0:28:53.113 --> 0:29:00.637 +Have a few trains on bad quality data. + +0:29:00.637 --> 0:29:08.741 +The system will learn also in the states. + +0:29:08.828 --> 0:29:12.210 +And as you're saying, it's a system always +mistranslates. + +0:29:13.493 --> 0:29:14.497 +Something. + +0:29:14.497 --> 0:29:23.623 +Then you will learn that this is correct because +now it's training data and you will even encourage + +0:29:23.623 --> 0:29:25.996 +it to make it more often. + +0:29:25.996 --> 0:29:29.921 +So the problem on training on your own is. + +0:29:30.150 --> 0:29:34.222 +But however, as you systematically do, you +even enforce more and will even do more. + +0:29:34.654 --> 0:29:37.401 +So that might not be the best solution. + +0:29:37.401 --> 0:29:40.148 +Do any idea how you could do it better? + +0:29:44.404 --> 0:29:57.653 +If you had something else to prevent some +systematic problems, yes, that is one way. + +0:30:04.624 --> 0:30:10.809 +The problem is yeah, the translations are +not perfect, so the output and you're learning + +0:30:10.809 --> 0:30:11.990 +something wrong. + +0:30:11.990 --> 0:30:17.967 +Normally it's less bad if your inputs are +somewhat bad, but your outputs are perfect. + +0:30:18.538 --> 0:30:26.670 +So if your inputs are wrong you maybe learn +that if you're doing this wrong input you're + +0:30:26.670 --> 0:30:30.782 +generating something correct but you're not. + +0:30:31.511 --> 0:30:40.911 +So often the case is that it's more important +that your target is correct. + +0:30:40.911 --> 0:30:47.052 +If on the source there is something crazy, +then. + +0:30:47.347 --> 0:30:52.184 +But you can assume in your application scenario +you hope that you mainly get correct input. + +0:30:52.572 --> 0:31:02.126 +So that is not harming you as much, and in +machine translation we have some of these symmetries, + +0:31:02.126 --> 0:31:02.520 +so. + +0:31:02.762 --> 0:31:04.578 +And also the other way around. + +0:31:04.578 --> 0:31:09.792 +It's a very similar task, so there's a task +to translate from German to English, but the + +0:31:09.792 --> 0:31:13.892 +task to translate from English to German is +very similar and helpful. + +0:31:14.094 --> 0:31:19.313 +So what we can do is, we can just switch it +initially and generate the data the other way + +0:31:19.313 --> 0:31:19.777 +around. + +0:31:20.120 --> 0:31:25.699 +So what we are doing here is we are starting +with an English to German system. + +0:31:25.699 --> 0:31:32.126 +Then we are translating the English data into +German, where the German is maybe not really + +0:31:32.126 --> 0:31:32.903 +very nice. + +0:31:33.293 --> 0:31:46.045 +And then we're training on our original data +and on the back translated data where only + +0:31:46.045 --> 0:31:51.696 +the input is good and it's like human. + +0:31:52.632 --> 0:32:01.622 +So here we have now the advantage that always +our target site is of human quality and the + +0:32:01.622 --> 0:32:02.322 +input. + +0:32:03.583 --> 0:32:08.998 +And then this helps us to get really good +form. + +0:32:08.998 --> 0:32:15.428 +There's one important difference if you think +about the. + +0:32:21.341 --> 0:32:31.604 +It's too obvious here we need a target side +monolingual layer and the first. + +0:32:31.931 --> 0:32:47.143 +So back translation is normally working if +you have target size parallel and not search + +0:32:47.143 --> 0:32:48.180 +side. + +0:32:48.448 --> 0:32:55.493 +Might be also a bit if you think about it +understandable that it's more important to + +0:32:55.493 --> 0:32:56.819 +be like better. + +0:32:57.117 --> 0:33:04.472 +On the suicide you have to understand the +content, on the target side you have to generate + +0:33:04.472 --> 0:33:12.232 +really sentences and somehow it's more difficult +to generate something than to only understand. + +0:33:17.617 --> 0:33:29.916 +One other thing, so typically it's shown here +differently, but typically it's like this works + +0:33:29.916 --> 0:33:30.701 +well. + +0:33:31.051 --> 0:33:32.978 +Because normally there's like a lot more. + +0:33:33.253 --> 0:33:36.683 +So the question is, should really take all +of my data? + +0:33:36.683 --> 0:33:38.554 +There's two problems with it. + +0:33:38.554 --> 0:33:42.981 +Of course, it's expensive because you have +to translate all this data. + +0:33:42.981 --> 0:33:48.407 +And secondly, if you had, although now your +packet site is wrong, it might be that you + +0:33:48.407 --> 0:33:51.213 +still have your wrong correlations in there. + +0:33:51.651 --> 0:34:01.061 +So if you don't know the normally good starting +point is to take equal amount of data as many + +0:34:01.061 --> 0:34:02.662 +backtranslated. + +0:34:02.963 --> 0:34:05.366 +Of course, it depends on the use case. + +0:34:05.366 --> 0:34:07.215 +There are very few data here. + +0:34:07.215 --> 0:34:08.510 +It makes more sense. + +0:34:08.688 --> 0:34:14.273 +It depends on how good your quality is here, +so the better the model is observable, the + +0:34:14.273 --> 0:34:17.510 +more data you might use because quality is +better. + +0:34:17.510 --> 0:34:23.158 +So it depends on a lot of things, but yeah, +a rule of sample like good general way often + +0:34:23.158 --> 0:34:24.808 +is to have equal amounts. + +0:34:26.646 --> 0:34:31.233 +And you can of course do that now iteratively. + +0:34:31.233 --> 0:34:39.039 +It said already that the quality at the end, +of course, depends on this system. + +0:34:39.039 --> 0:34:46.163 +Also, because the better this system is, the +better your synthetic data. + +0:34:47.207 --> 0:34:50.949 +That leads to what is referred to as iterated +back translation. + +0:34:51.291 --> 0:34:56.911 +So you're playing a model on English to German +and you translate the data. + +0:34:56.957 --> 0:35:03.397 +Then you train a model on German to English +with the additional data. + +0:35:03.397 --> 0:35:11.954 +Then you translate German when you translate +German data and then you train again your first + +0:35:11.954 --> 0:35:12.414 +one. + +0:35:12.414 --> 0:35:14.346 +So you iterate that. + +0:35:14.334 --> 0:35:19.653 +Because now your system is better because +it's not only trained on the small data but + +0:35:19.653 --> 0:35:22.003 +additionally on back translated data. + +0:35:22.442 --> 0:35:24.458 +And so you can get better. + +0:35:24.764 --> 0:35:31.739 +However, typically you can stop quite early, +so maybe one iteration is good, but then you + +0:35:31.739 --> 0:35:35.072 +have diminishing gains after two or three. + +0:35:35.935 --> 0:35:44.094 +There's very slight difference and then yeah +because you need of course quite big difference + +0:35:44.094 --> 0:35:45.937 +in the quality here. + +0:35:45.937 --> 0:35:46.814 +In order. + +0:35:47.207 --> 0:35:59.810 +Which is not too good because it means you +can already have to train it with relatively + +0:35:59.810 --> 0:36:02.245 +bad performance. + +0:36:03.723 --> 0:36:10.323 +And they don't yeah, a design decision would +advise so guess because it's easy to get it. + +0:36:10.550 --> 0:36:16.617 +Better to replace that because you have a +higher quality, but you of course keep your + +0:36:16.617 --> 0:36:18.310 +high quality real data. + +0:36:18.310 --> 0:36:21.626 +Then I think normally it's okay to replace +it. + +0:36:21.626 --> 0:36:24.518 +Of course you can also try to append it. + +0:36:24.518 --> 0:36:28.398 +I would assume it's not too much of a difference, +but. + +0:36:34.414 --> 0:36:40.567 +That's about like using monolingual data before +we go into the pre-train models. + +0:36:40.567 --> 0:36:42.998 +Do you have any more questions? + +0:36:49.029 --> 0:36:57.521 +Yes, so the other thing we can do and which +is recently more and more successful and even + +0:36:57.521 --> 0:37:05.731 +more successful since we have these really +large language models where you can even do + +0:37:05.731 --> 0:37:08.562 +a translation task with this. + +0:37:08.688 --> 0:37:16.132 +So here the idea is you learn a representation +of one task and then you use this representation. + +0:37:16.576 --> 0:37:27.276 +It was made maybe like one of the first where +it's really used largely is doing something + +0:37:27.276 --> 0:37:35.954 +like a bird which you pre-train on purely text +editor and then you take. + +0:37:36.496 --> 0:37:42.952 +And the one big advantage, of course, is that +people can only share data but also pre-train. + +0:37:43.423 --> 0:37:53.247 +So if you think of the recent models and the +large language models which are available, + +0:37:53.247 --> 0:37:59.611 +it is not possible for universities often to +train them. + +0:37:59.919 --> 0:38:09.413 +Think it costs several millions to train the +model just if you rent the GPS from some cloud + +0:38:09.413 --> 0:38:15.398 +company and train that the cost of training +these models. + +0:38:15.475 --> 0:38:21.735 +And guess as a student project you won't have +the budget to like build these models. + +0:38:21.801 --> 0:38:24.630 +So another idea is what you can do is okay. + +0:38:24.630 --> 0:38:27.331 +Maybe if these months are once available. + +0:38:27.467 --> 0:38:34.723 +You can take them and use them as a resource +similar to pure text, and you can now build + +0:38:34.723 --> 0:38:41.734 +models which some will learn not only from +from data but also from other models which + +0:38:41.734 --> 0:38:44.506 +are maybe trained on other tasks. + +0:38:44.844 --> 0:38:48.647 +So it's a quite new way of thinking of how +to train. + +0:38:48.647 --> 0:38:53.885 +So we are not only learning from examples, +but we might also learn from. + +0:38:54.534 --> 0:39:03.937 +The nice thing is that this type of training +where we are not learning directly from data + +0:39:03.937 --> 0:39:07.071 +by learning from other tasks. + +0:39:07.427 --> 0:39:15.581 +So the main idea to start with is to have +a personal initial task, and typically this + +0:39:15.581 --> 0:39:24.425 +initial task is for: And if you're working +with, that means you're training pure taxator + +0:39:24.425 --> 0:39:30.547 +because you have the largest amount of data +from the Internet. + +0:39:30.951 --> 0:39:35.857 +And then you're defining some type of task +in order to do your quick training. + +0:39:36.176 --> 0:39:42.056 +And: There's a typical task you can train +on. + +0:39:42.056 --> 0:39:52.709 +That is like the language modeling text, so +to predict the next word, all we have related. + +0:39:52.932 --> 0:40:04.654 +But to predict something which you have not +in the input is a task which is easy to generate. + +0:40:04.654 --> 0:40:06.150 +That's why. + +0:40:06.366 --> 0:40:14.005 +By yourself, on the other hand, you need a +lot of knowledge, and that is the other thing + +0:40:14.005 --> 0:40:15.120 +you need to. + +0:40:15.735 --> 0:40:23.690 +Because there is this idea that the meaning +of the word heavily depends on the context + +0:40:23.690 --> 0:40:24.695 +it's used. + +0:40:25.145 --> 0:40:36.087 +So can give you a sentence with some gibberish +word and there's some name, and although you've + +0:40:36.087 --> 0:40:41.616 +never read the name, you will just assume that. + +0:40:42.062 --> 0:40:48.290 +Exactly the same thing, the models can also +learn something about the words in there by + +0:40:48.290 --> 0:40:49.139 +just using. + +0:40:49.649 --> 0:40:53.246 +So that is typically the new. + +0:40:53.246 --> 0:40:59.839 +Then we can use this model, use our data to +train the. + +0:41:00.800 --> 0:41:04.703 +Of course, it might need to adapt the system. + +0:41:04.703 --> 0:41:07.672 +To do that we might use only some. + +0:41:07.627 --> 0:41:16.326 +Part of the pre-train model in there is that +we have seen that a bit already in the RNA + +0:41:16.326 --> 0:41:17.215 +case is. + +0:41:17.437 --> 0:41:22.670 +So you can view the RN as one of these approaches. + +0:41:22.670 --> 0:41:28.518 +You train the RN language while on large pre-train +data. + +0:41:28.518 --> 0:41:32.314 +Then you put it somewhere into your. + +0:41:33.653 --> 0:41:37.415 +So this gives you the ability to really do +these types of tests. + +0:41:37.877 --> 0:41:49.027 +So that you can build a system which uses +knowledge, which is just trained on large amounts + +0:41:49.027 --> 0:41:52.299 +of data and extracting it. + +0:41:52.299 --> 0:41:53.874 +So it knows. + +0:41:56.376 --> 0:42:01.561 +So the question is that yeah, what type of +information so what type of models can you? + +0:42:01.821 --> 0:42:05.278 +And we want to today look at briefly at three. + +0:42:05.725 --> 0:42:08.474 +Was initially done. + +0:42:08.474 --> 0:42:21.118 +It wasn't as famous as in machine translation +as in other things, but it's also used there. + +0:42:21.221 --> 0:42:28.974 +So where you have this mapping from the one +hot to a small continuous word representation? + +0:42:29.229 --> 0:42:37.891 +Using this one in your anthrax you can, for +example, replace the embedding layer by the + +0:42:37.891 --> 0:42:38.776 +trained. + +0:42:39.139 --> 0:42:41.832 +That is helpful to be a really small amount +of data. + +0:42:42.922 --> 0:42:48.520 +You're always in this pre training phase and +have the thing the advantage is. + +0:42:48.468 --> 0:42:55.515 +More data, that's the trade off so you can +get better. + +0:42:55.515 --> 0:43:00.128 +Disadvantage is, does anybody have? + +0:43:04.624 --> 0:43:12.173 +Was one of the mentioned today, even like +big advantages of the system compared to previous. + +0:43:20.660 --> 0:43:26.781 +Where one advantage was the end to end training +so that all parameters and all components are + +0:43:26.781 --> 0:43:27.952 +optimal together. + +0:43:28.208 --> 0:43:33.386 +If you know pre-train something on one pass, +it's maybe no longer optimal fitting to everything. + +0:43:33.893 --> 0:43:40.338 +So that is similar to what should do pretaining +or not. + +0:43:40.338 --> 0:43:48.163 +It depends on how important everything is +optimal together and how. + +0:43:48.388 --> 0:44:00.552 +If the state is a high quality of large amount, +the pre trained one is just so much better. + +0:44:00.600 --> 0:44:11.215 +Standing everything optimal together, we would +use random actions for amazing vices. + +0:44:11.691 --> 0:44:18.791 +Mean, we assume some structures that are trained +basically. + +0:44:18.791 --> 0:44:26.364 +Yes, if you're fine tuning everything, it +might be the problem. + +0:44:26.766 --> 0:44:31.139 +But often yeah, in some way right, so often +it's not about. + +0:44:31.139 --> 0:44:37.624 +You're really worse with some pre-trained +molecules because you're going already in some + +0:44:37.624 --> 0:44:43.236 +direction, and if this is not really optimal +for you, it might be difficult. + +0:44:43.603 --> 0:44:51.774 +But the bigger is, if you're not getting better +because you have a decent amount of data, it's + +0:44:51.774 --> 0:44:52.978 +so different. + +0:44:53.153 --> 0:45:04.884 +But mean initially it wasn't a machine translation +done so much because there was more data in + +0:45:04.884 --> 0:45:09.452 +the task, but now it's really large. + +0:45:12.632 --> 0:45:14.188 +The other one is then OK. + +0:45:14.188 --> 0:45:18.258 +Now it's always like how much of the model +do your pre-track a bit? + +0:45:18.658 --> 0:45:25.057 +The other one you can do is tack contextual +words and then something like bird or a robota + +0:45:25.057 --> 0:45:31.667 +where you train more already as sequence models +and the embeddings you're using are no longer + +0:45:31.667 --> 0:45:35.605 +specific for words but they're also taking +the context. + +0:45:35.875 --> 0:45:54.425 +Embedding you're using is no longer only depending +on the word itself but on the whole sentence. + +0:45:55.415 --> 0:46:03.714 +And of course you can use similar things also +in the decoder just by having layers which + +0:46:03.714 --> 0:46:09.122 +don't have access to the source but there it's +still not. + +0:46:11.451 --> 0:46:19.044 +And finally, and then we'll look at the end, +you can also have models which are already. + +0:46:19.419 --> 0:46:28.605 +So you may be training a sequence model, but +not a monolingual data. + +0:46:28.605 --> 0:46:35.128 +Of course you have to make it a bit challenging. + +0:46:36.156 --> 0:46:43.445 +But the idea is really you're pre-training +your whole model and then you're fine tuning. + +0:46:47.227 --> 0:46:59.487 +But let's first do a bit of step back and +look into what are the differences. + +0:46:59.487 --> 0:47:02.159 +The first thing. + +0:47:02.382 --> 0:47:06.870 +The word embeddings are just this first layer. + +0:47:06.870 --> 0:47:12.027 +You can train them with feed-forward neural +networks. + +0:47:12.212 --> 0:47:25.683 +But you can also train them in language model, +and by now you hopefully have also seen that + +0:47:25.683 --> 0:47:27.733 +you can also. + +0:47:30.130 --> 0:47:41.558 +So this is how you can train them, and you +are training them to predict the next word, + +0:47:41.558 --> 0:47:45.236 +the typical language model. + +0:47:45.525 --> 0:47:52.494 +And that is what is now referred to as a South +Supervised Learning, and for example all the + +0:47:52.494 --> 0:47:56.357 +big large language models like Chat, gp and +so on. + +0:47:56.357 --> 0:48:03.098 +They are trained at an end or feet, but exactly +with this objective to predict the next. + +0:48:03.823 --> 0:48:12.847 +So that is where you can hopefully learn what +a word is used because you always try to predict + +0:48:12.847 --> 0:48:17.692 +the next word and then you have a ready intuition. + +0:48:19.619 --> 0:48:25.374 +In the word embedding, why do people first +look at the word embeddings and the use of + +0:48:25.374 --> 0:48:27.582 +word embeddings for other tasks? + +0:48:27.582 --> 0:48:32.600 +The main advantage is it might be only the +first layer you would think of. + +0:48:32.600 --> 0:48:34.474 +What does it really matter? + +0:48:34.474 --> 0:48:39.426 +However, it is the layer where you typically +have most of the parameters. + +0:48:39.879 --> 0:48:52.201 +Of course, if you have trained on most of +your parameters already on the large data, + +0:48:52.201 --> 0:48:59.304 +then on your target data you have to train +less. + +0:48:59.259 --> 0:49:05.841 +This big difference that your input size is +so much bigger than the size of the normal + +0:49:05.841 --> 0:49:06.522 +in size. + +0:49:06.626 --> 0:49:16.551 +So it's a normal size, maybe two hundred and +fifty, but your input embedding besides vocabulary + +0:49:16.551 --> 0:49:20.583 +size is something like fifty thousand. + +0:49:23.123 --> 0:49:30.163 +And bending while here you see, it's only +like times as much in the layer. + +0:49:30.750 --> 0:49:36.747 +So here's where most of your parameters are, +which means if you already replace the word + +0:49:36.747 --> 0:49:41.329 +embeddings, it might look a bit small in your +overall architecture. + +0:49:41.329 --> 0:49:47.056 +It's where most of the things are, and if +you're doing that, you already have really + +0:49:47.056 --> 0:49:48.876 +big games and can do that. + +0:49:57.637 --> 0:50:04.301 +The thing is we have seen these wooden beddings +can be very good used for other taps. + +0:50:04.784 --> 0:50:08.921 +Now you learn some relation between words. + +0:50:08.921 --> 0:50:14.790 +If you're doing this type of language modeling, +you predict. + +0:50:15.215 --> 0:50:21.532 +The one thing is, of course, you have a lot +of data, so the one question is we want to + +0:50:21.532 --> 0:50:25.961 +have a lot of data to good training models, +the other thing. + +0:50:25.961 --> 0:50:28.721 +The tasks need to be somewhat useful. + +0:50:29.169 --> 0:50:41.905 +If you would predict the first letter of the +word, it has to be a task where you need some + +0:50:41.905 --> 0:50:45.124 +syntactic information. + +0:50:45.545 --> 0:50:53.066 +The interesting thing is people have looked +at these world embeddings here in a language + +0:50:53.066 --> 0:50:53.658 +model. + +0:50:53.954 --> 0:51:04.224 +And you're looking at the word embeddings, +which are these vectors here. + +0:51:04.224 --> 0:51:09.289 +You can ask yourself, do they look? + +0:51:09.489 --> 0:51:15.122 +Don't know if your view is listening to artificial +advance artificial intelligence. + +0:51:15.515 --> 0:51:23.994 +We had on yesterday how to do this type of +representation, but you can do this kind of + +0:51:23.994 --> 0:51:29.646 +representation, and now you're seeing interesting +things. + +0:51:30.810 --> 0:51:41.248 +Now you can represent it here in a three dimensional +space with a dimension reduction. + +0:51:41.248 --> 0:51:46.886 +Then you can look into it and the interesting. + +0:51:47.447 --> 0:51:57.539 +So this vector between the male and the female +version of something is not the same, but it's + +0:51:57.539 --> 0:51:58.505 +related. + +0:51:58.718 --> 0:52:11.256 +So you can do a bit of nuts, you subtract +this vector, add this vector, and then you + +0:52:11.256 --> 0:52:14.501 +look around this one. + +0:52:14.894 --> 0:52:19.691 +So that means okay, there is really something +stored, some information stored in that book. + +0:52:20.040 --> 0:52:25.003 +Similar you can do it with Buck and since +you see here swimming slam walk and walk. + +0:52:25.265 --> 0:52:42.534 +So again these vectors are not the same, but +they're related for going from here to here. + +0:52:43.623 --> 0:52:47.508 +Are semantically the relations between city +and capital? + +0:52:47.508 --> 0:52:49.757 +You have exactly the same thing. + +0:52:51.191 --> 0:52:57.857 +People having done question answering about +that if they show these embeddings and. + +0:52:58.218 --> 0:53:05.198 +Or you can also, if you don't trust the the +dimensional reduction because you say maybe + +0:53:05.198 --> 0:53:06.705 +there's something. + +0:53:06.967 --> 0:53:16.473 +Done you can also look into what happens really +in the indimensional space. + +0:53:16.473 --> 0:53:22.227 +You can look at what is the nearest neighbor. + +0:53:22.482 --> 0:53:29.605 +So you can take the relationship between France +and Paris and add it to Italy and nicely see. + +0:53:30.010 --> 0:53:33.082 +You can do big and bigger and you have small +and small lines. + +0:53:33.593 --> 0:53:38.202 +It doesn't work everywhere. + +0:53:38.202 --> 0:53:49.393 +There are also some which sometimes work, +so if you have a typical. + +0:53:51.491 --> 0:53:56.832 +You can do what the person is doing for famous +ones. + +0:53:56.832 --> 0:54:05.800 +Of course, only like Einstein, scientist, +that Messier finds Midfield are not completely + +0:54:05.800 --> 0:54:06.707 +correct. + +0:54:06.846 --> 0:54:09.781 +You'll see the examples are a bit old. + +0:54:09.781 --> 0:54:15.050 +The politicians are no longer there, but the +first one doesn't learn. + +0:54:16.957 --> 0:54:29.003 +What people have done there of courses, especially +at the beginning. + +0:54:29.309 --> 0:54:36.272 +So one famous model was, but we're not really +interested in the language model performance. + +0:54:36.272 --> 0:54:38.013 +We're only interested. + +0:54:38.338 --> 0:54:40.634 +Think something good to keep in mind. + +0:54:40.634 --> 0:54:42.688 +What are we really interested in? + +0:54:42.688 --> 0:54:44.681 +Do we really want to have an RN? + +0:54:44.681 --> 0:54:44.923 +No. + +0:54:44.923 --> 0:54:48.608 +In this case we are only interested in this +type of mapping. + +0:54:49.169 --> 0:54:55.536 +And so very successful was this word to beg. + +0:54:55.535 --> 0:55:02.597 +We are not training real language when making +it even simpler and doing this for example + +0:55:02.597 --> 0:55:04.660 +continuous back of words. + +0:55:04.660 --> 0:55:11.801 +We are just having four input tokens and we +are predicting what is the word in the middle + +0:55:11.801 --> 0:55:15.054 +and this is just like two linear layers. + +0:55:15.615 --> 0:55:22.019 +It's even simplifying things and making the +calculation faster because that is what we're + +0:55:22.019 --> 0:55:22.873 +interested. + +0:55:23.263 --> 0:55:34.059 +All this continues skip ground models of these +other two models. + +0:55:34.234 --> 0:55:38.273 +You have one equal word and it's the other +way around. + +0:55:38.273 --> 0:55:41.651 +You're predicting the four words around them. + +0:55:41.651 --> 0:55:43.047 +It's very similar. + +0:55:43.047 --> 0:55:48.702 +The task is in the end very similar, but in +all of them it's about learning. + +0:55:51.131 --> 0:56:01.416 +Before we go into the next part, let's talk +about the normal white vector or white line. + +0:56:04.564 --> 0:56:07.562 +The next thing is contextual word embeddings. + +0:56:07.562 --> 0:56:08.670 +The idea is yes. + +0:56:08.670 --> 0:56:09.778 +This is helpful. + +0:56:09.778 --> 0:56:14.080 +However, we might be able to get more from +just only lingo later. + +0:56:14.080 --> 0:56:19.164 +For example, if you think about the word can, +it can have different meanings. + +0:56:19.419 --> 0:56:32.619 +And now in the word embeddings how you have +an overlap of these two meanings, so it represents + +0:56:32.619 --> 0:56:33.592 +those. + +0:56:34.834 --> 0:56:40.318 +But we might be able to in the pre-train model +already disambiguate these because they use + +0:56:40.318 --> 0:56:41.041 +completely. + +0:56:41.701 --> 0:56:50.998 +So if we can have a model which can not only +represent the word, but it can also represent + +0:56:50.998 --> 0:56:58.660 +the meaning of the word within the context, +it might be even more helpful. + +0:56:59.139 --> 0:57:03.342 +So then we're going to contextual word embeddings. + +0:57:03.342 --> 0:57:07.709 +We're really having a representation of the +context. + +0:57:07.787 --> 0:57:11.519 +And we have a very good architecture for that +already. + +0:57:11.691 --> 0:57:20.551 +It's like our base language model where you +have to do the hidden state. + +0:57:20.551 --> 0:57:29.290 +The hidden state represents what is apparently +said, but it's focusing. + +0:57:29.509 --> 0:57:43.814 +The first one doing that is in something like +the Elmo paper where they instead of like this + +0:57:43.814 --> 0:57:48.121 +is a normal language model. + +0:57:48.008 --> 0:57:52.735 +Put in the third predicting the fourth and +so on, so you're always predicting the next + +0:57:52.735 --> 0:57:53.007 +one. + +0:57:53.193 --> 0:57:57.919 +The architecture of the heaven works embedding +layer, and then two are an layer here. + +0:57:57.919 --> 0:58:04.255 +For example: And now instead of using this +one in the end you're using here this one. + +0:58:04.364 --> 0:58:11.245 +This represents the meaning of this word mainly +in the context of what we have seen before. + +0:58:11.871 --> 0:58:22.909 +We can train it in a language model or predicting +the next word, but we have more information, + +0:58:22.909 --> 0:58:26.162 +train there, and therefore. + +0:58:27.167 --> 0:58:31.168 +And there is one even done currently in. + +0:58:31.168 --> 0:58:40.536 +The only difference is that we have more layers, +bigger size, and we're using transform on here + +0:58:40.536 --> 0:58:44.634 +or self-attention instead of the R&F. + +0:58:44.634 --> 0:58:45.122 +But. + +0:58:46.746 --> 0:58:52.737 +However, if you look at this contextual representation, +they might not be perfect. + +0:58:52.737 --> 0:58:58.584 +So what do you think of this one as contextual +representation of the third word? + +0:58:58.584 --> 0:59:02.914 +Do you see anything which is not really considered +in this? + +0:59:07.587 --> 0:59:11.492 +Only one way yes, so that is not a big issue +here. + +0:59:11.492 --> 0:59:18.154 +It's representing a string in the context +of a sentence, however, only in the context. + +0:59:18.558 --> 0:59:28.394 +However, we have an architecture which can +also take both sides and we have used it in + +0:59:28.394 --> 0:59:30.203 +the ink holder. + +0:59:30.630 --> 0:59:34.269 +So we could do the and easily only us in the +backboard direction. + +0:59:34.874 --> 0:59:46.889 +By just having the other way around, and then +we couldn't combine the forward and into a + +0:59:46.889 --> 0:59:49.184 +joint one where. + +0:59:49.329 --> 0:59:50.861 +So You Have a Word embedding. + +0:59:51.011 --> 1:00:03.910 +Then you have two states, one with a forward, +and then one with a backward. + +1:00:03.910 --> 1:00:10.359 +For example, take the representation. + +1:00:10.490 --> 1:00:21.903 +Now this same here represents mainly this +word because this is where what both focuses + +1:00:21.903 --> 1:00:30.561 +on is what is happening last but is also looking +at the previous. + +1:00:31.731 --> 1:00:41.063 +However, there is a bit different when training +that as a language model you already have. + +1:00:43.203 --> 1:00:44.956 +Maybe there's again this masking. + +1:00:46.546 --> 1:00:47.814 +That is one solution. + +1:00:47.814 --> 1:00:53.407 +First of all, why we can't do it is the information +you leave it, so you cannot just predict the + +1:00:53.407 --> 1:00:54.041 +next word. + +1:00:54.041 --> 1:00:58.135 +If we just predict the next word in this type +of model, that's a very. + +1:00:58.738 --> 1:01:04.590 +You know the next word because it's influencing +this hidden stage and then it's very easy so + +1:01:04.590 --> 1:01:07.736 +predicting something you know is not a good +task. + +1:01:07.736 --> 1:01:09.812 +This is what I mentioned before. + +1:01:09.812 --> 1:01:13.336 +You have to define somehow a task which is +challenging. + +1:01:13.753 --> 1:01:19.007 +Because in this case one would, I mean, the +system would just ignore the states and what + +1:01:19.007 --> 1:01:22.961 +it would learn is that you copy this information +directly in here. + +1:01:23.343 --> 1:01:31.462 +So it would mainly be representing this word +and you would have a perfect model because + +1:01:31.462 --> 1:01:38.290 +you only need to find an encoding where you +can encode all words somehow. + +1:01:38.458 --> 1:01:44.046 +The only thing that will learn is that tenor +and coat all my words in this upper hidden. + +1:01:44.985 --> 1:01:49.584 +And then, of course, it's not really useful. + +1:01:49.584 --> 1:01:53.775 +We need to find a bit of different ways. + +1:01:55.295 --> 1:01:59.440 +There is a masking one. + +1:01:59.440 --> 1:02:06.003 +I'll come to that shortly just a bit. + +1:02:06.003 --> 1:02:14.466 +The other thing is not to directly combine +them. + +1:02:14.594 --> 1:02:22.276 +So you never merge the states only at the +end. + +1:02:22.276 --> 1:02:33.717 +The representation of the words is now from +the forward and the next. + +1:02:33.873 --> 1:02:35.964 +So it's always a hidden state before that. + +1:02:36.696 --> 1:02:41.273 +And these two you're joined now to your to +the representation. + +1:02:42.022 --> 1:02:50.730 +And then you have now a representation also +about the whole sentence for the word, but + +1:02:50.730 --> 1:02:53.933 +there's no information leakage. + +1:02:53.933 --> 1:02:59.839 +One way of doing this is instead of doing +a bidirectional. + +1:03:00.380 --> 1:03:08.079 +You can do that, of course, in all layers. + +1:03:08.079 --> 1:03:16.315 +In the end you have different bedding states. + +1:03:16.596 --> 1:03:20.246 +However, it's a bit of a complicated. + +1:03:20.246 --> 1:03:25.241 +You have to keep up separate and then merge +things. + +1:03:27.968 --> 1:03:33.007 +And that is is the moment where, like the, +the peak. + +1:03:34.894 --> 1:03:42.018 +Idea of the big success of the bird model +was used, maybe in bidirector case. + +1:03:42.018 --> 1:03:48.319 +It's not good to do the next word prediction, +but we can do masking. + +1:03:48.308 --> 1:03:59.618 +And masking maybe means we do a prediction +of something in the middle or some words. + +1:03:59.618 --> 1:04:08.000 +If we have the input, we're just putting noise +into the input. + +1:04:08.048 --> 1:04:14.040 +Now there can be no information leakage because +this wasn't in the input. + +1:04:14.040 --> 1:04:15.336 +Now predicting. + +1:04:16.776 --> 1:04:20.524 +So thereby we don't do any assumption again +about our models. + +1:04:20.524 --> 1:04:24.815 +It doesn't need to be a forward model or a +backward model or anything. + +1:04:24.815 --> 1:04:29.469 +You can have any type of architecture and +you can always predict the street. + +1:04:30.530 --> 1:04:39.112 +There is maybe one disadvantage: do you see +what could be a bit of a problem this type + +1:04:39.112 --> 1:04:40.098 +compared? + +1:05:00.000 --> 1:05:05.920 +Yes, so yeah mean you cannot cross mass more, +but to see it more globally just twist assume + +1:05:05.920 --> 1:05:07.142 +you only mask one. + +1:05:07.142 --> 1:05:12.676 +For the whole sentence we get one feedback +signal like what is the word street, so we + +1:05:12.676 --> 1:05:16.280 +have one training sample, a model for the whole +center. + +1:05:17.397 --> 1:05:19.461 +The language modeling paste. + +1:05:19.461 --> 1:05:21.240 +We predicted here three. + +1:05:21.240 --> 1:05:22.947 +We predicted here four. + +1:05:22.947 --> 1:05:24.655 +We predicted here five. + +1:05:25.005 --> 1:05:26.973 +So we have a number of tokens. + +1:05:26.973 --> 1:05:30.974 +For each token we have a feet bed and saying +what is the best. + +1:05:31.211 --> 1:05:39.369 +So in this case of course this is a lot less +efficient because we are getting less feedback + +1:05:39.369 --> 1:05:45.754 +signals on what we should predict compared +to models where we're doing. + +1:05:48.348 --> 1:05:54.847 +So in birth the main idea this bidirectional +model was masking. + +1:05:54.847 --> 1:05:59.721 +It was the first large model using transformer. + +1:06:00.320 --> 1:06:06.326 +There are two more minor changes. + +1:06:06.326 --> 1:06:16.573 +We'll see that this next word prediction is +another task. + +1:06:16.957 --> 1:06:25.395 +Again you want to learn more about what language +is to really understand. + +1:06:25.395 --> 1:06:35.089 +Are these two sentences like following a story +or they're independent of each other? + +1:06:38.158 --> 1:06:43.026 +The input is using subword units as we're +using it and we're using it. + +1:06:43.026 --> 1:06:48.992 +It has some special token, the beginning, +the CLS token that is straining for the next + +1:06:48.992 --> 1:06:50.158 +word prediction. + +1:06:50.470 --> 1:06:57.296 +It's more for machine translation. + +1:06:57.296 --> 1:07:07.242 +It's more for classification tasks because +you're. + +1:07:07.607 --> 1:07:24.323 +You have two sentences, and then you have +a position of encoding as we know them in general. + +1:07:24.684 --> 1:07:28.812 +Now what is more challenging is masking. + +1:07:28.812 --> 1:07:30.927 +So what do you mask? + +1:07:30.927 --> 1:07:35.055 +We already have to question like should. + +1:07:35.275 --> 1:07:44.453 +So there has been afterwards eating some work +like, for example, Urbana, which tries to improve. + +1:07:44.453 --> 1:07:52.306 +It's not super sensitive, but of course if +you do it completely wrong then you're. + +1:07:52.572 --> 1:07:54.590 +That's then another question there. + +1:07:56.756 --> 1:08:03.285 +All types should always mask the poor word. + +1:08:03.285 --> 1:08:14.562 +If have a subword, it's good to mask only +like a subword and predict based. + +1:08:14.894 --> 1:08:20.755 +You know, like three parts of the words, it +might be easier to get the last because they + +1:08:20.755 --> 1:08:27.142 +here took the easiest selections, not considering +words anymore at all because you're doing that + +1:08:27.142 --> 1:08:32.278 +in the pre-processing and just taking always +words like subwords and masking. + +1:08:32.672 --> 1:08:36.286 +Their thinking will bear them differently. + +1:08:36.286 --> 1:08:40.404 +They mark always the full words, but guess +it's. + +1:08:41.001 --> 1:08:46.969 +And then what to do with the mask work in +eighty percent of the cases is the word is + +1:08:46.969 --> 1:08:47.391 +mask. + +1:08:47.391 --> 1:08:50.481 +They replace it with a special token thing. + +1:08:50.481 --> 1:08:52.166 +This is the mask token. + +1:08:52.166 --> 1:08:58.486 +In ten percent they put in some random other +token in there, and in ten percent they keep + +1:08:58.486 --> 1:08:59.469 +it unchanged. + +1:09:02.202 --> 1:09:11.519 +And then what you can do is also this next +prediction. + +1:09:11.519 --> 1:09:17.786 +So if you have the man went to mass. + +1:09:18.418 --> 1:09:24.090 +So may you see you're joining that you're +doing both masks and next prediction that. + +1:09:24.564 --> 1:09:34.402 +And if the sentence is pinguine masks are +flyless birds, then these two sentences have + +1:09:34.402 --> 1:09:42.995 +nothing to do with each other, and so in this +case it's not the next token. + +1:09:47.127 --> 1:09:56.184 +And that is the whole bird model, so here +is the input, here the transformable layers, + +1:09:56.184 --> 1:09:58.162 +and you can train. + +1:09:58.598 --> 1:10:08.580 +And this model was quite successful in general +applications. + +1:10:08.580 --> 1:10:17.581 +It was not as successful as people are nowadays +using. + +1:10:17.937 --> 1:10:27.644 +However, there is like a huge thing of different +types of models coming from that. + +1:10:27.827 --> 1:10:39.109 +So based on bird and other semi-supervised +models like a whole setup came out of there + +1:10:39.109 --> 1:10:42.091 +and there's different. + +1:10:42.082 --> 1:10:46.637 +With the availability of large languages more +than the success. + +1:10:47.007 --> 1:10:48.436 +We have now even larger ones. + +1:10:48.828 --> 1:10:50.961 +Interestingly, it goes a bit. + +1:10:50.910 --> 1:10:59.321 +Change the bit again from like more this spider +action model to unidirectional models, or at + +1:10:59.321 --> 1:11:03.843 +the moment maybe a bit more we're coming to +them. + +1:11:03.843 --> 1:11:09.179 +Now do you see one advantage,, and we have +the efficiency. + +1:11:09.509 --> 1:11:16.670 +There's one other reason why you sometimes +are more interested in unidirectional models + +1:11:16.670 --> 1:11:17.158 +than. + +1:11:22.882 --> 1:11:30.882 +Mean it depends on the task, but for example +for a language generation task, the task. + +1:11:32.192 --> 1:11:34.574 +It's not only interesting, it doesn't work. + +1:11:34.574 --> 1:11:39.283 +So if you want to do a generation like the +decoder so you want to generate a sentence, + +1:11:39.283 --> 1:11:42.856 +you don't know the future so you cannot apply +this type of model. + +1:11:43.223 --> 1:11:49.498 +This time off model can be used for the encoder +in an encoder model but cannot be used for + +1:11:49.498 --> 1:11:55.497 +the decoder because it is trained that only +works and it has information on both sides + +1:11:55.497 --> 1:11:56.945 +and if you're doing. + +1:12:00.000 --> 1:12:05.559 +Yeah, that's a good view to the next overall +task of models. + +1:12:05.559 --> 1:12:08.839 +We have so if you view it from the. + +1:12:09.009 --> 1:12:13.137 +Of you we have the encoder baseball. + +1:12:13.137 --> 1:12:16.372 +That's what we just look at. + +1:12:16.372 --> 1:12:20.612 +They are bidirectional and typically. + +1:12:20.981 --> 1:12:22.347 +That is the one we looked at. + +1:12:22.742 --> 1:12:35.217 +At the beginning is the decoder-based model, +so the outer-regressive mounts which are unit + +1:12:35.217 --> 1:12:42.619 +based model, and there we can do the next prediction. + +1:12:43.403 --> 1:12:52.421 +And what you can also do first, and there +you can also have special things called prefix + +1:12:52.421 --> 1:12:53.434 +language. + +1:12:54.354 --> 1:13:04.079 +Because we are saying it might be helpful +that some of your inputs you can use by direction + +1:13:04.079 --> 1:13:17.334 +because: That is what is called a prefix where +you say on the first tokens you have bidirectional + +1:13:17.334 --> 1:13:19.094 +connections. + +1:13:19.219 --> 1:13:28.768 +You somehow merge that mainly works only in +transformer based models because the uni direction. + +1:13:29.629 --> 1:13:34.894 +There is no different number of parameters. + +1:13:34.975 --> 1:13:38.533 +Transformer: The only difference is how you +mask your attention. + +1:13:38.878 --> 1:13:47.691 +We have seen that in the encoder, in the decoder, +the number of parameters is different because + +1:13:47.691 --> 1:13:50.261 +you do the cross-attention. + +1:13:50.650 --> 1:13:58.389 +It's only like you mask your attention to +only look at the bad past or also look into + +1:13:58.389 --> 1:13:59.469 +the future. + +1:14:00.680 --> 1:14:03.323 +And now you can, of course, also do mixing. + +1:14:03.563 --> 1:14:08.307 +So this is a bidirectional attention metric +where you can attend to everything. + +1:14:08.588 --> 1:14:23.477 +That is a unidirection or causal where you +can only look at the past and you can do this + +1:14:23.477 --> 1:14:25.652 +with prefix. + +1:14:29.149 --> 1:14:42.829 +Some are all clear based on that, then of +course you can also do the other thing. + +1:14:43.163 --> 1:14:54.497 +So the idea is we have our encoder, decoder +architecture, can we also train them completely + +1:14:54.497 --> 1:14:57.700 +in a side supervised way? + +1:14:58.238 --> 1:15:06.206 +In this case we have the same input to both, +so in this case we would have the sentence + +1:15:06.206 --> 1:15:08.470 +as input in the decoder. + +1:15:08.470 --> 1:15:12.182 +Then we need to do some type of masking. + +1:15:12.912 --> 1:15:16.245 +Here we don't need to do the masking, but +here we need to do. + +1:15:16.245 --> 1:15:17.911 +The masking doesn't know ever. + +1:15:20.440 --> 1:15:30.269 +And this type of model got quite successful +also, especially for pre-training machine translation. + +1:15:30.330 --> 1:15:45.934 +This is the first model of the BART model, +which is one successful way to pre-train your + +1:15:45.934 --> 1:15:47.162 +model. + +1:15:47.427 --> 1:15:52.858 +Where you put in source sentence, we can't +do that here. + +1:15:52.858 --> 1:15:55.430 +We only have one language. + +1:15:55.715 --> 1:16:00.932 +But we can just put this twice in there, and +that is not a trivial task. + +1:16:00.932 --> 1:16:08.517 +We can change it in: They do quite a bit of +different corruption techniques. + +1:16:08.517 --> 1:16:12.751 +You can do token masking and you can also. + +1:16:13.233 --> 1:16:20.785 +That you couldn't do and go the only system +because then it wouldn't be there if you cannot + +1:16:20.785 --> 1:16:22.345 +predict somewhere. + +1:16:22.345 --> 1:16:26.368 +So the number of input and output tokens always. + +1:16:26.906 --> 1:16:29.820 +You cannot do a prediction for something which +isn't it? + +1:16:30.110 --> 1:16:39.714 +Here in the decoder side it's uni-direction +so we can also delete and then generate the + +1:16:39.714 --> 1:16:40.369 +full. + +1:16:41.061 --> 1:16:48.628 +We can do sentence per rotation where you +change the sentence. + +1:16:48.628 --> 1:16:54.274 +We can document rotation and text and filling. + +1:16:55.615 --> 1:17:05.870 +So you see there's quite a lot of types of +models that you can use in order to pre-train + +1:17:05.870 --> 1:17:06.561 +your. + +1:17:07.507 --> 1:17:12.512 +And these are the models you can use. + +1:17:12.512 --> 1:17:21.072 +Of course, the other question is how do you +integrate them into? + +1:17:21.761 --> 1:17:26.638 +And there's also like yeah quite some different +ways of techniques. + +1:17:27.007 --> 1:17:28.684 +It's a Bit Similar to Before. + +1:17:28.928 --> 1:17:39.307 +So the easiest thing is you take your word +embeddings or your pre-train model. + +1:17:39.307 --> 1:17:47.979 +If you're contextual embedding several layers +you freeze them in. + +1:17:48.748 --> 1:17:53.978 +Can also be done if you have a bark model. + +1:17:53.978 --> 1:18:03.344 +You freeze your wooden beddings, for example, +and only train the top layers. + +1:18:05.865 --> 1:18:14.965 +The other thing is you initialize them so +you initialize your models but then you train + +1:18:14.965 --> 1:18:19.102 +everything so you're not only training. + +1:18:22.562 --> 1:18:32.600 +When you have then one thing, if you think +about Bart, there's one thing, so you want + +1:18:32.600 --> 1:18:35.752 +to have the same language. + +1:18:36.516 --> 1:18:46.013 +Typically mean the one you get is from English, +so you can not try to do some language. + +1:18:46.366 --> 1:18:55.165 +Below the barge, in order to learn some language +specific stuff or there's a multilingual barge + +1:18:55.165 --> 1:19:03.415 +which is trained on many languages, it's trained +only on like it's more or less language. + +1:19:03.923 --> 1:19:09.745 +So then you would still need to find June +and the model needs to learn how to better + +1:19:09.745 --> 1:19:12.074 +do the attention cross lingually. + +1:19:12.074 --> 1:19:18.102 +It's only on the same language but it mainly +only has to learn this mapping and not all + +1:19:18.102 --> 1:19:18.787 +the rest. + +1:19:21.982 --> 1:19:27.492 +A third thing which is is very commonly used +is what is frequent to it as adapters. + +1:19:27.607 --> 1:19:29.749 +So, for example, you take and bark. + +1:19:29.709 --> 1:19:35.502 +And you put some adapters on the inside of +the network so that it's small new layers which + +1:19:35.502 --> 1:19:41.676 +are in between put in there and then you only +train these adapters or also train these adapters. + +1:19:41.676 --> 1:19:47.724 +So for example in Embry you could see that +this learns to map the Seus language representation + +1:19:47.724 --> 1:19:50.333 +to the targeted language representation. + +1:19:50.470 --> 1:19:52.395 +And then you don't have to change that luck. + +1:19:52.792 --> 1:20:04.197 +Ideas that you give it some extra ability +to really perform well on that, and then it's + +1:20:04.197 --> 1:20:05.234 +easier. + +1:20:05.905 --> 1:20:15.117 +Is also very commonly used, for example, in +multilingual systems where the idea is you + +1:20:15.117 --> 1:20:16.282 +have some. + +1:20:16.916 --> 1:20:23.505 +So they are trained only for one language +pair, so the model has some of those it once + +1:20:23.505 --> 1:20:27.973 +has the abilities to do multilingually to share +knowledge. + +1:20:27.973 --> 1:20:33.729 +But then there is some knowledge which is +very language specific, and then. + +1:20:34.914 --> 1:20:39.291 +But there's one chance in general, the multilingual +systems. + +1:20:39.291 --> 1:20:40.798 +It works quite well. + +1:20:40.798 --> 1:20:47.542 +There's one specific use case for multilingual, +where this normally doesn't really work well. + +1:20:47.542 --> 1:20:49.981 +Do you have an idea of what that? + +1:20:55.996 --> 1:20:57.534 +It's for Zero Short Cases. + +1:20:57.998 --> 1:21:06.051 +Because then you're having to hear some situation +which might be very language specific again + +1:21:06.051 --> 1:21:15.046 +in zero shot, the idea is always to learn representations +via which are more language dependent and with + +1:21:15.046 --> 1:21:17.102 +the adaptors of course. + +1:21:20.260 --> 1:21:37.655 +And there's also the idea of doing more like +a knowledge ventilation setup, so in this. + +1:21:39.179 --> 1:21:41.177 +And now the idea is okay. + +1:21:41.177 --> 1:21:48.095 +We are training it the same, but what we want +to achieve is that the hidden stages of the + +1:21:48.095 --> 1:21:54.090 +encoder are as similar to the one as the pre-train +model, just as additional. + +1:21:54.414 --> 1:22:07.569 +So you should learn faster by telling the +model to make these states as similar as possible. + +1:22:07.569 --> 1:22:11.813 +You compare the first hidden. + +1:22:12.192 --> 1:22:18.549 +For example, by using the L2 norm, so by just +making these two representations the same. + +1:22:20.020 --> 1:22:22.880 +Now here it requires the same vocabulary. + +1:22:22.880 --> 1:22:25.468 +Why does it need the same vocabulary? + +1:22:25.468 --> 1:22:26.354 +Give me the. + +1:22:34.754 --> 1:22:39.132 +You have different vocabulary. + +1:22:39.132 --> 1:22:50.711 +You also have different like sequence lengths +because if you use different these. + +1:22:51.231 --> 1:22:55.680 +Then what happens is now we have states here. + +1:22:55.680 --> 1:23:01.097 +It's no longer straightforward which states +to compare. + +1:23:02.322 --> 1:23:05.892 +And then it's just easier to have like the +same number. + +1:23:05.892 --> 1:23:08.952 +You can always compare the first to the second. + +1:23:09.709 --> 1:23:16.836 +So therefore at least the very easy way of +knowledge destination only works if you have. + +1:23:17.177 --> 1:23:30.871 +Of course you could do things like the average +should be the same, but of course that's less + +1:23:30.871 --> 1:23:33.080 +strong signal. + +1:23:34.314 --> 1:23:47.087 +But the advantage here is that you have a +direct training signal here on the ink corner + +1:23:47.087 --> 1:23:52.457 +so you can directly make the signal. + +1:23:56.936 --> 1:24:11.208 +Yes, think this is most things for today, +so what you should keep in mind today is two + +1:24:11.208 --> 1:24:18.147 +techniques: The one is a back translation idea. + +1:24:18.147 --> 1:24:26.598 +If you have monolingual letters, you back +translate it and use. + +1:24:26.886 --> 1:24:33.608 +And yeah, it is even often helpful to even +combine them so you can even use both of them. + +1:24:33.853 --> 1:24:39.669 +You can do use pre-trained walls, but then +you can even still do back translation where + +1:24:39.669 --> 1:24:40.066 +it's. + +1:24:40.160 --> 1:24:47.058 +We have the advantage that we are training +like everything working together on the tasks + +1:24:47.058 --> 1:24:54.422 +so it might be helpful even to backtranslate +some data and then use it in the real translation + +1:24:54.422 --> 1:24:57.755 +because in pre-training the big challenge. + +1:24:58.058 --> 1:25:07.392 +You can see there is different ways of integrating +this knowledge, but even if you use a full + +1:25:07.392 --> 1:25:08.087 +model. + +1:25:08.748 --> 1:25:11.713 +This is the most similar you can get. + +1:25:11.713 --> 1:25:15.224 +You're doing no changes to the architecture. + +1:25:15.224 --> 1:25:20.608 +You're really taking the model and just fine +tuning on the new task. + +1:25:20.608 --> 1:25:24.041 +But it still has to completely newly learn. + +1:25:24.464 --> 1:25:29.978 +Might be, for example, helpful to have more +back translated data to learn them. + +1:25:32.192 --> 1:25:45.096 +Good, that's important thing that next Tuesday +there is a conference or a workshop in this + +1:25:45.096 --> 1:25:45.947 +room. + +1:25:47.127 --> 1:25:54.405 +You should get an email if you're an alias +that there is a room change for Tuesdays, only + +1:25:54.405 --> 1:25:57.398 +for Tuesdays, and it's again normal. + +1:25:57.637 --> 1:26:03.714 +Some more questions again have a more general +perspective, especially: Computer vision. + +1:26:03.714 --> 1:26:07.246 +You can enlarge your data set with data augmentation. + +1:26:07.246 --> 1:26:08.293 +It's there and. + +1:26:08.388 --> 1:26:15.306 +Similarly to a large speech or text, so the +data orientation. + +1:26:15.755 --> 1:26:27.013 +You can use this back translation and also +the masking, but a bit like that would say + +1:26:27.013 --> 1:26:31.201 +that is the most similar thing. + +1:26:31.371 --> 1:26:35.632 +So it has also been, for example, it's used +not only for monolingual data. + +1:26:36.216 --> 1:26:40.958 +If you have good MP system, it can also be +used for parallel data by having like augmenting + +1:26:40.958 --> 1:26:46.061 +your data with more data because then you have +the human translation and the automatic translation + +1:26:46.061 --> 1:26:46.783 +is both good. + +1:26:46.783 --> 1:26:51.680 +You're just having more data and better feedback +signal and different ways because there's not + +1:26:51.680 --> 1:26:53.845 +only one correct translation but several. + +1:26:54.834 --> 1:26:58.327 +Would say this is the most similar one. + +1:26:58.327 --> 1:27:00.947 +Just rotate things and so on. + +1:27:00.947 --> 1:27:03.130 +There's ways you can do. + +1:27:05.025 --> 1:27:07.646 +But for example there's rarely use. + +1:27:07.646 --> 1:27:13.907 +It's very hard to do this by by rules like +which words to replace because there's not + +1:27:13.907 --> 1:27:14.490 +a cool. + +1:27:14.490 --> 1:27:18.931 +You cannot like always say this word can always +be replaced. + +1:27:19.139 --> 1:27:28.824 +Mean, although they are my perfect synonyms, +they are good in some cases, but not in all + +1:27:28.824 --> 1:27:29.585 +cases. + +1:27:29.585 --> 1:27:36.985 +And if you don't do a rule base, you have +to train the model again. + +1:27:38.058 --> 1:27:57.050 +Here we can compare the hidden stages to the +same architecture as the free train normal. + +1:27:57.457 --> 1:27:59.817 +Should be of the same dimension, so it's easiest +to have the. + +1:28:00.000 --> 1:28:03.780 +Architecture: We later will learn in efficiency. + +1:28:03.780 --> 1:28:08.949 +You can also do knowledge destillation with, +for example, smaller. + +1:28:08.949 --> 1:28:15.816 +So you can have twelve layers, only five, +and then you try to learn the same within five + +1:28:15.816 --> 1:28:16.433 +layers. + +1:28:17.477 --> 1:28:22.945 +Eight layers, so that is possible, but yeah +agree it should be of the same hidden size. + +1:28:23.623 --> 1:28:35.963 +The question then, of course, is you can do +it as an initialization or you can do it during + +1:28:35.963 --> 1:28:37.305 +training? + +1:28:37.305 --> 1:28:41.195 +You have some main training. + +1:28:45.865 --> 1:28:53.964 +Good, then thanks a lot, and then we'll see +each other again on Tuesday. +