diff --git "a/demo_data/lectures/Lecture-11-15.06.2023/English.vtt" "b/demo_data/lectures/Lecture-11-15.06.2023/English.vtt"
new file mode 100644--- /dev/null
+++ "b/demo_data/lectures/Lecture-11-15.06.2023/English.vtt"
@@ -0,0 +1,12362 @@
+WEBVTT
+
+0:00:00.981 --> 0:00:20.036
+Today about is how to use some type of additional
+resources to improve the translation.
+
+0:00:20.300 --> 0:00:28.188
+We have in the first part of the semester
+two thirds of the semester how to build some
+
+0:00:28.188 --> 0:00:31.361
+of your basic machine translation.
+
+0:00:31.571 --> 0:00:42.317
+Now the basic components are both for statistical
+and for neural, with the encoded decoding.
+
+0:00:43.123 --> 0:00:46.000
+Now, of course, that's not where it stops.
+
+0:00:46.000 --> 0:00:51.286
+It's still what nearly every machine translation
+system is currently in there.
+
+0:00:51.286 --> 0:00:57.308
+However, there's a lot of challenges which
+you need to address in addition and which need
+
+0:00:57.308 --> 0:00:58.245
+to be solved.
+
+0:00:58.918 --> 0:01:09.858
+And there we want to start to tell you what
+else can you do around this, and partly.
+
+0:01:10.030 --> 0:01:14.396
+And one important question there is on what
+do you train your models?
+
+0:01:14.394 --> 0:01:32.003
+Because like this type of parallel data, it's
+easier in machine translation than in other
+
+0:01:32.003 --> 0:01:33.569
+trusts.
+
+0:01:33.853 --> 0:01:41.178
+And therefore an important question is, can
+we also learn from like other sources and through?
+
+0:01:41.701 --> 0:01:47.830
+Because if you remember strongly right at
+the beginning of the election,.
+
+0:01:51.171 --> 0:01:53.801
+This Is How We Train All Our.
+
+0:01:54.194 --> 0:01:59.887
+Machine learning models from statistical to
+neural.
+
+0:01:59.887 --> 0:02:09.412
+This doesn't have changed so we need this
+type of parallel data where we have a source
+
+0:02:09.412 --> 0:02:13.462
+sentence aligned with a target data.
+
+0:02:13.493 --> 0:02:19.135
+We have now a strong model here, a very good
+model to do that.
+
+0:02:19.135 --> 0:02:22.091
+However, we always rely on this.
+
+0:02:22.522 --> 0:02:28.395
+For languages, high risk language pairs say
+from German to English or other European languages,
+
+0:02:28.395 --> 0:02:31.332
+there is decent amount, at least for similarly.
+
+0:02:31.471 --> 0:02:37.630
+But even there if we are going to very specific
+domains it might get difficult and then your
+
+0:02:37.630 --> 0:02:43.525
+system performance might drop because if you
+want to translate now some medical text for
+
+0:02:43.525 --> 0:02:50.015
+example of course you need to also have peril
+data in the medical domain to know how to translate
+
+0:02:50.015 --> 0:02:50.876
+these types.
+
+0:02:51.231 --> 0:02:55.264
+Phrases how to use the vocabulary and so on
+in the style.
+
+0:02:55.915 --> 0:03:04.887
+And if you are going to other languages, there
+is a lot bigger challenge and the question
+
+0:03:04.887 --> 0:03:05.585
+there.
+
+0:03:05.825 --> 0:03:09.649
+So is really this the only resource we can
+use.
+
+0:03:09.889 --> 0:03:19.462
+Can be adapted or training phase in order
+to also make use of other types of models that
+
+0:03:19.462 --> 0:03:27.314
+might enable us to build strong systems with
+other types of information.
+
+0:03:27.707 --> 0:03:35.276
+And that we will look into now in the next
+starting from from just saying the next election.
+
+0:03:35.515 --> 0:03:40.697
+So this idea we already have covered on Tuesday.
+
+0:03:40.697 --> 0:03:45.350
+One very successful idea for this is to do.
+
+0:03:45.645 --> 0:03:51.990
+So that we're no longer doing translation
+between languages, but we can do translation
+
+0:03:51.990 --> 0:03:55.928
+between languages and share common knowledge
+between.
+
+0:03:56.296 --> 0:04:04.703
+And you also learned about things like zero
+shots machine translation so you can translate
+
+0:04:04.703 --> 0:04:06.458
+between languages.
+
+0:04:06.786 --> 0:04:09.790
+Which is the case for many, many language
+pairs.
+
+0:04:10.030 --> 0:04:19.209
+Like even with German, you have not translation
+parallel data to all languages around the world,
+
+0:04:19.209 --> 0:04:26.400
+or most of them you have it to the Europeans
+once, maybe even for Japanese.
+
+0:04:26.746 --> 0:04:35.332
+There is quite a lot of data, for example
+English to Japanese, but German to Japanese
+
+0:04:35.332 --> 0:04:37.827
+or German to Vietnamese.
+
+0:04:37.827 --> 0:04:41.621
+There is some data from Multilingual.
+
+0:04:42.042 --> 0:04:54.584
+So there is a very promising direction if
+you want to build translation systems between
+
+0:04:54.584 --> 0:05:00.142
+language peers, typically not English.
+
+0:05:01.221 --> 0:05:05.887
+And the other ideas, of course, we don't have
+to either just search for it.
+
+0:05:06.206 --> 0:05:12.505
+Some work on a data crawling so if I don't
+have a corpus directly or I don't have an high
+
+0:05:12.505 --> 0:05:19.014
+quality corpus like from the European Parliament
+for a TED corpus so maybe it makes sense to
+
+0:05:19.014 --> 0:05:23.913
+crawl more data and get additional sources
+so you can build stronger.
+
+0:05:24.344 --> 0:05:35.485
+There has been quite a big effort in Europe
+to collect really large data sets for parallel
+
+0:05:35.485 --> 0:05:36.220
+data.
+
+0:05:36.220 --> 0:05:40.382
+How can we do this data crawling?
+
+0:05:40.600 --> 0:05:46.103
+There the interesting thing from the machine
+translation point is not just general data
+
+0:05:46.103 --> 0:05:46.729
+crawling.
+
+0:05:47.067 --> 0:05:50.037
+But how can we explicitly crawl data?
+
+0:05:50.037 --> 0:05:52.070
+Which is some of a peril?
+
+0:05:52.132 --> 0:05:58.461
+So there is in the Internet quite a lot of
+data which has been company websites which
+
+0:05:58.461 --> 0:06:01.626
+have been translated and things like that.
+
+0:06:01.626 --> 0:06:05.158
+So how can you extract them parallel fragments?
+
+0:06:06.566 --> 0:06:13.404
+That is typically more noisy than where you
+do more at hands where mean if you have Parliament.
+
+0:06:13.693 --> 0:06:17.680
+You can do some rules how to extract parallel
+things.
+
+0:06:17.680 --> 0:06:24.176
+Here there is more to it, so the quality is
+later maybe not as good, but normally scale
+
+0:06:24.176 --> 0:06:26.908
+is then a possibility to address it.
+
+0:06:26.908 --> 0:06:30.304
+So you just have so much more data that even.
+
+0:06:33.313 --> 0:06:40.295
+The other thing can be used monolingual data
+and monolingual data has a big advantage that
+
+0:06:40.295 --> 0:06:46.664
+we can have a huge amount of that so that you
+can be autocrawed from the Internet.
+
+0:06:46.664 --> 0:06:51.728
+The nice thing is you can also get it typically
+for many domains.
+
+0:06:52.352 --> 0:06:59.558
+There is just so much more magnitude of monolingual
+data so that it might be very helpful.
+
+0:06:59.559 --> 0:07:03.054
+We can do that in statistical machine translation.
+
+0:07:03.054 --> 0:07:06.755
+It was quite easy to integrate using language
+models.
+
+0:07:08.508 --> 0:07:16.912
+In neural machine translation we have the
+advantage that we have this overall architecture
+
+0:07:16.912 --> 0:07:22.915
+that does everything together, but it has also
+the disadvantage.
+
+0:07:23.283 --> 0:07:25.675
+We'll look today at two things.
+
+0:07:25.675 --> 0:07:32.925
+On the one end you can still try to do a bit
+of language modeling in there and add an additional
+
+0:07:32.925 --> 0:07:35.168
+language model into in there.
+
+0:07:35.168 --> 0:07:38.232
+There is some work, one very successful.
+
+0:07:38.178 --> 0:07:43.764
+A way in which I think is used in most systems
+at the moment is to do some scientific data.
+
+0:07:43.763 --> 0:07:53.087
+Is a very easy thing, but you can just translate
+there and use it as training gator, and normally.
+
+0:07:53.213 --> 0:07:59.185
+And thereby you are able to use like some
+type of monolingual a day.
+
+0:08:00.380 --> 0:08:05.271
+Another way to do it is unsupervised and the
+extreme case.
+
+0:08:05.271 --> 0:08:11.158
+If you have a scenario then you only have
+data, only monolingual data.
+
+0:08:11.158 --> 0:08:13.976
+Can you still build translations?
+
+0:08:14.754 --> 0:08:27.675
+If you have large amounts of data and languages
+are not too dissimilar, you can build translation
+
+0:08:27.675 --> 0:08:31.102
+systems without parallel.
+
+0:08:32.512 --> 0:08:36.267
+That we will see you then next Thursday.
+
+0:08:37.857 --> 0:08:50.512
+And then there is now a third type of pre-trained
+model that recently became very successful
+
+0:08:50.512 --> 0:08:55.411
+and now with large language models.
+
+0:08:55.715 --> 0:09:03.525
+So the idea is we are no longer sharing the
+real data, but it can also help to train a
+
+0:09:03.525 --> 0:09:04.153
+model.
+
+0:09:04.364 --> 0:09:11.594
+And that is now a big advantage of deep learning
+based approaches.
+
+0:09:11.594 --> 0:09:22.169
+There you have this ability that you can train
+a model in some task and then apply it to another.
+
+0:09:22.722 --> 0:09:33.405
+And then, of course, the question is, can
+I have an initial task where there's huge amounts
+
+0:09:33.405 --> 0:09:34.450
+of data?
+
+0:09:34.714 --> 0:09:40.251
+And the test that typically you pre train
+on is more like similar to a language moral
+
+0:09:40.251 --> 0:09:45.852
+task either direct to a language moral task
+or like a masking task which is related so
+
+0:09:45.852 --> 0:09:51.582
+the idea is oh I can train on this data and
+the knowledge about words how they relate to
+
+0:09:51.582 --> 0:09:53.577
+each other I can use in there.
+
+0:09:53.753 --> 0:10:00.276
+So it's a different way of using language
+models.
+
+0:10:00.276 --> 0:10:06.276
+There's more transfer learning at the end
+of.
+
+0:10:09.029 --> 0:10:17.496
+So first we will start with how can we use
+monolingual data to do a Yeah to do a machine
+
+0:10:17.496 --> 0:10:18.733
+translation?
+
+0:10:20.040 --> 0:10:27.499
+That: Big difference is you should remember
+from what I mentioned before is.
+
+0:10:27.499 --> 0:10:32.783
+In statistical machine translation we directly
+have the opportunity.
+
+0:10:32.783 --> 0:10:39.676
+There's peril data for the translation model
+and monolingual data for the language model.
+
+0:10:39.679 --> 0:10:45.343
+And you combine your translation model and
+language model, and then you can make use of
+
+0:10:45.343 --> 0:10:45.730
+both.
+
+0:10:46.726 --> 0:10:53.183
+That you can make use of these large large
+amounts of monolingual data, but of course
+
+0:10:53.183 --> 0:10:55.510
+it has also some disadvantage.
+
+0:10:55.495 --> 0:11:01.156
+Because we say the problem is we are optimizing
+both parts a bit independently to each other
+
+0:11:01.156 --> 0:11:06.757
+and we say oh yeah the big disadvantage of
+newer machine translations now we are optimizing
+
+0:11:06.757 --> 0:11:10.531
+the overall architecture everything together
+to perform best.
+
+0:11:10.890 --> 0:11:16.994
+And then, of course, we can't do there, so
+Leo we can can only do a mural like use power
+
+0:11:16.994 --> 0:11:17.405
+data.
+
+0:11:17.897 --> 0:11:28.714
+So the question is, but this advantage is
+not so important that we can train everything,
+
+0:11:28.714 --> 0:11:35.276
+but we have a moral legal data or even small
+amounts.
+
+0:11:35.675 --> 0:11:43.102
+So in data we know it's not only important
+the amount of data we have but also like how
+
+0:11:43.102 --> 0:11:50.529
+similar it is to your test data so it can be
+that this modeling data is quite small but
+
+0:11:50.529 --> 0:11:55.339
+it's very well fitting and then it's still
+very helpful.
+
+0:11:55.675 --> 0:12:02.691
+At the first year of surprisingness, if we
+are here successful with integrating a language
+
+0:12:02.691 --> 0:12:09.631
+model into a translation system, maybe we can
+also integrate some type of language models
+
+0:12:09.631 --> 0:12:14.411
+into our empty system in order to make it better
+and perform.
+
+0:12:16.536 --> 0:12:23.298
+The first thing we can do is we know there
+is language models, so let's try to integrate.
+
+0:12:23.623 --> 0:12:31.096
+There was our language model because these
+works were mainly done before transformer-based
+
+0:12:31.096 --> 0:12:31.753
+models.
+
+0:12:32.152 --> 0:12:38.764
+In general, of course, you can do the same
+thing with transformer baseball.
+
+0:12:38.764 --> 0:12:50.929
+There is nothing about whether: It's just
+that it has mainly been done before people
+
+0:12:50.929 --> 0:13:01.875
+started using R&amp;S and they tried to do
+this more in cases.
+
+0:13:07.087 --> 0:13:22.938
+So what we're happening here is in some of
+this type of idea, and in key system you remember
+
+0:13:22.938 --> 0:13:25.495
+the attention.
+
+0:13:25.605 --> 0:13:29.465
+Gets it was your last in this day that you
+calculate easy attention.
+
+0:13:29.729 --> 0:13:36.610
+We get the context back, then combine both
+and then base the next in state and then predict.
+
+0:13:37.057 --> 0:13:42.424
+So this is our system, and the question is,
+can we send our integrated language model?
+
+0:13:42.782 --> 0:13:49.890
+And somehow it makes sense to take out a neural
+language model because we are anyway in the
+
+0:13:49.890 --> 0:13:50.971
+neural space.
+
+0:13:50.971 --> 0:13:58.465
+It's not surprising that it contrasts to statistical
+work used and grants it might make sense to
+
+0:13:58.465 --> 0:14:01.478
+take a bit of a normal language model.
+
+0:14:01.621 --> 0:14:06.437
+And there would be something like on Tubbles
+Air, a neural language model, and our man based
+
+0:14:06.437 --> 0:14:11.149
+is you have a target word, you put it in, you
+get a new benchmark, and then you always put
+
+0:14:11.149 --> 0:14:15.757
+in the words and get new hidden states, and
+you can do some predictions at the output to
+
+0:14:15.757 --> 0:14:16.948
+predict the next word.
+
+0:14:17.597 --> 0:14:26.977
+So if we're having this type of in language
+model, there's like two main questions we have
+
+0:14:26.977 --> 0:14:34.769
+to answer: So how do we combine now on the
+one hand our system and on the other hand our
+
+0:14:34.769 --> 0:14:35.358
+model?
+
+0:14:35.358 --> 0:14:42.004
+You see that was mentioned before when we
+started talking about ENCODA models.
+
+0:14:42.004 --> 0:14:45.369
+They can be viewed as a language model.
+
+0:14:45.805 --> 0:14:47.710
+The wine is lengthened, unconditioned.
+
+0:14:47.710 --> 0:14:49.518
+It's just modeling the target sides.
+
+0:14:49.970 --> 0:14:56.963
+And the other one is a conditional language
+one, which is a language one conditioned on
+
+0:14:56.963 --> 0:14:57.837
+the Sewer.
+
+0:14:58.238 --> 0:15:03.694
+So how can you combine to language models?
+
+0:15:03.694 --> 0:15:14.860
+Of course, it's like the translation model
+will be more important because it has access
+
+0:15:14.860 --> 0:15:16.763
+to the source.
+
+0:15:18.778 --> 0:15:22.571
+If we have that, the other question is okay.
+
+0:15:22.571 --> 0:15:24.257
+Now we have models.
+
+0:15:24.257 --> 0:15:25.689
+How do we train?
+
+0:15:26.026 --> 0:15:30.005
+Pickers integrated them.
+
+0:15:30.005 --> 0:15:34.781
+We have now two sets of data.
+
+0:15:34.781 --> 0:15:42.741
+We have parallel data where you can do the
+lower.
+
+0:15:44.644 --> 0:15:53.293
+So the first idea is we can do something more
+like a parallel combination.
+
+0:15:53.293 --> 0:15:55.831
+We just keep running.
+
+0:15:56.036 --> 0:15:59.864
+So here you see your system that is running.
+
+0:16:00.200 --> 0:16:09.649
+It's normally completely independent of your
+language model, which is up there, so down
+
+0:16:09.649 --> 0:16:13.300
+here we have just our NMT system.
+
+0:16:13.313 --> 0:16:26.470
+The only thing which is used is we have the
+words, and of course they are put into both
+
+0:16:26.470 --> 0:16:30.059
+systems, and out there.
+
+0:16:30.050 --> 0:16:42.221
+So we use them somehow for both, and then
+we are doing our decision just by merging these
+
+0:16:42.221 --> 0:16:42.897
+two.
+
+0:16:43.343 --> 0:16:53.956
+So there can be, for example, we are doing
+a probability distribution here, and then we
+
+0:16:53.956 --> 0:17:03.363
+are taking the average of post-perability distribution
+to do our predictions.
+
+0:17:11.871 --> 0:17:18.923
+You could also take the output with Steve's
+to be more in chore about the mixture.
+
+0:17:20.000 --> 0:17:32.896
+Yes, you could also do that, so it's more
+like engaging mechanisms that you're not doing.
+
+0:17:32.993 --> 0:17:41.110
+Another one would be cochtrinate the hidden
+states, and then you would have another layer
+
+0:17:41.110 --> 0:17:41.831
+on top.
+
+0:17:43.303 --> 0:17:56.889
+You think about if you do the conqueredination
+instead of taking the instead and then merging
+
+0:17:56.889 --> 0:18:01.225
+the probability distribution.
+
+0:18:03.143 --> 0:18:16.610
+Introduce many new parameters, and these parameters
+have somehow something special compared to
+
+0:18:16.610 --> 0:18:17.318
+the.
+
+0:18:23.603 --> 0:18:37.651
+So before all the error other parameters can
+be trained independent, the language model
+
+0:18:37.651 --> 0:18:42.121
+can be trained independent.
+
+0:18:43.043 --> 0:18:51.749
+If you have a joint layer, of course you need
+to train them because you have now inputs.
+
+0:18:54.794 --> 0:19:02.594
+Not surprisingly, if you have a parallel combination
+of whether you could, the other way is to do
+
+0:19:02.594 --> 0:19:04.664
+more serial combinations.
+
+0:19:04.924 --> 0:19:10.101
+How can you do a similar combination?
+
+0:19:10.101 --> 0:19:18.274
+Your final decision makes sense to do a face
+on the system.
+
+0:19:18.438 --> 0:19:20.996
+So you have on top of your normal and system.
+
+0:19:21.121 --> 0:19:30.678
+The only thing is now you're inputting into
+your system.
+
+0:19:30.678 --> 0:19:38.726
+You're no longer inputting the word embeddings.
+
+0:19:38.918 --> 0:19:45.588
+So you're training your mainly what you have
+your lower layers here which are trained more
+
+0:19:45.588 --> 0:19:52.183
+on the purely language model style and then
+on top your putting into the NMT system where
+
+0:19:52.183 --> 0:19:55.408
+it now has already here the language model.
+
+0:19:55.815 --> 0:19:58.482
+So here you can also view it.
+
+0:19:58.482 --> 0:20:06.481
+Here you have more contextual embeddings which
+no longer depend only on the word but they
+
+0:20:06.481 --> 0:20:10.659
+also depend on the context of the target site.
+
+0:20:11.051 --> 0:20:19.941
+But you have more understanding of the source
+word, so you have a language in the current
+
+0:20:19.941 --> 0:20:21.620
+target sentence.
+
+0:20:21.881 --> 0:20:27.657
+So if it's like the word can, for example,
+will be put in here always the same independent
+
+0:20:27.657 --> 0:20:31.147
+of its user can of beans, or if it's like I
+can do it.
+
+0:20:31.147 --> 0:20:37.049
+However, because you are having your language
+model style, you have maybe disintegrated this
+
+0:20:37.049 --> 0:20:40.984
+already a bit, and you give this information
+directly to the.
+
+0:20:41.701 --> 0:20:43.095
+An empty cyst.
+
+0:20:44.364 --> 0:20:49.850
+You, if you're remembering more the transformer
+based approach, you have some layers.
+
+0:20:49.850 --> 0:20:55.783
+The lower layers are purely languaged while
+the other ones are with attention to the source.
+
+0:20:55.783 --> 0:21:01.525
+So you can view it also that you just have
+lower layers which don't attend to the source.
+
+0:21:02.202 --> 0:21:07.227
+This is purely a language model, and then
+at some point you're starting to attend to
+
+0:21:07.227 --> 0:21:08.587
+the source and use it.
+
+0:21:13.493 --> 0:21:20.781
+Yes, so this is how you combine them in peril
+or first do the language model and then do.
+
+0:21:23.623 --> 0:21:26.147
+Questions for the integration.
+
+0:21:31.831 --> 0:21:35.034
+Not really sure about the input of the.
+
+0:21:35.475 --> 0:21:38.102
+Model, and in this case in the sequence.
+
+0:21:38.278 --> 0:21:53.199
+Case so the actual word that we transferred
+into a numerical lecture, and this is an input
+
+0:21:53.199 --> 0:21:54.838
+into the.
+
+0:21:56.176 --> 0:22:03.568
+That depends on if you view the word embedding
+as part of the language model.
+
+0:22:03.568 --> 0:22:10.865
+So if you first put the word target word then
+you do the one hot end coding.
+
+0:22:11.691 --> 0:22:13.805
+And then the word embedding there is the r&amp;
+
+0:22:13.805 --> 0:22:13.937
+n.
+
+0:22:14.314 --> 0:22:21.035
+So you can use this together as your language
+model when you first do the word embedding.
+
+0:22:21.401 --> 0:22:24.346
+All you can say is like before.
+
+0:22:24.346 --> 0:22:28.212
+It's more a definition, but you're right.
+
+0:22:28.212 --> 0:22:30.513
+So what's the steps out?
+
+0:22:30.513 --> 0:22:36.128
+You take the word, the one hut encoding, the
+word embedding.
+
+0:22:36.516 --> 0:22:46.214
+What one of these parrots, you know, called
+a language model is definition wise and not
+
+0:22:46.214 --> 0:22:47.978
+that important.
+
+0:22:53.933 --> 0:23:02.264
+So the question is how can you then train
+them and make this this one work?
+
+0:23:02.264 --> 0:23:02.812
+The.
+
+0:23:03.363 --> 0:23:15.201
+So in the case where you combine the language
+one of the abilities you can train them independently
+
+0:23:15.201 --> 0:23:18.516
+and just put them together.
+
+0:23:18.918 --> 0:23:27.368
+Might not be the best because we have no longer
+the stability that we had before that optimally
+
+0:23:27.368 --> 0:23:29.128
+performed together.
+
+0:23:29.128 --> 0:23:33.881
+It's not clear if they really work the best
+together.
+
+0:23:34.514 --> 0:23:41.585
+At least you need to somehow find how much
+do you trust the one model and how much.
+
+0:23:43.323 --> 0:23:45.058
+Still in some cases useful.
+
+0:23:45.058 --> 0:23:48.530
+It might be helpful if you have only data
+and software.
+
+0:23:48.928 --> 0:23:59.064
+However, in MT we have one specific situation
+that at least for the MT part parallel is also
+
+0:23:59.064 --> 0:24:07.456
+always monolingual data, so what we definitely
+can do is train the language.
+
+0:24:08.588 --> 0:24:18.886
+So what we also can do is more like the pre-training
+approach.
+
+0:24:18.886 --> 0:24:24.607
+We first train the language model.
+
+0:24:24.704 --> 0:24:27.334
+The pre-training approach.
+
+0:24:27.334 --> 0:24:33.470
+You first train on the monolingual data and
+then you join the.
+
+0:24:33.933 --> 0:24:41.143
+Of course, the model size is this way, but
+the data size is too bigly the other way around.
+
+0:24:41.143 --> 0:24:47.883
+You often have a lot more monolingual data
+than you have here parallel data, in which
+
+0:24:47.883 --> 0:24:52.350
+scenario can you imagine where this type of
+pretraining?
+
+0:24:56.536 --> 0:24:57.901
+Any Ideas.
+
+0:25:04.064 --> 0:25:12.772
+One example where this might also be helpful
+if you want to adapt to domains.
+
+0:25:12.772 --> 0:25:22.373
+So let's say you do medical sentences and
+if you want to translate medical sentences.
+
+0:25:23.083 --> 0:25:26.706
+In this case it could be or its most probable
+happen.
+
+0:25:26.706 --> 0:25:32.679
+You're learning here up there what medical
+means, but in your fine tuning step the model
+
+0:25:32.679 --> 0:25:38.785
+is forgotten everything about Medicare, so
+you may be losing all the information you gain.
+
+0:25:39.099 --> 0:25:42.366
+So this type of priest training step is good.
+
+0:25:42.366 --> 0:25:47.978
+If your pretraining data is more general,
+very large and then you're adapting.
+
+0:25:48.428 --> 0:25:56.012
+But in the task with moral lingual data, which
+should be used to adapt the system to some
+
+0:25:56.012 --> 0:25:57.781
+general topic style.
+
+0:25:57.817 --> 0:26:06.795
+Then, of course, this is not a good strategy
+because you might forgot about everything up
+
+0:26:06.795 --> 0:26:09.389
+there and you don't have.
+
+0:26:09.649 --> 0:26:14.678
+So then you have to check what you can do
+for them.
+
+0:26:14.678 --> 0:26:23.284
+You can freeze this part and change it any
+more so you don't lose the ability or you can
+
+0:26:23.284 --> 0:26:25.702
+do a direct combination.
+
+0:26:25.945 --> 0:26:31.028
+Where you jointly train both of them, so you
+train the NMT system on the, and then you train
+
+0:26:31.028 --> 0:26:34.909
+the language model always in parallels so that
+you don't forget about.
+
+0:26:35.395 --> 0:26:37.684
+And what you learn of the length.
+
+0:26:37.937 --> 0:26:46.711
+Depends on what you want to combine because
+it's large data and you have a good general
+
+0:26:46.711 --> 0:26:48.107
+knowledge in.
+
+0:26:48.548 --> 0:26:55.733
+Then you normally don't really forget it because
+it's also in the or you use it to adapt to
+
+0:26:55.733 --> 0:26:57.295
+something specific.
+
+0:26:57.295 --> 0:26:58.075
+Then you.
+
+0:27:01.001 --> 0:27:06.676
+Then this is a way of how we can make use
+of monolingual data.
+
+0:27:07.968 --> 0:27:12.116
+It seems to be the easiest one somehow.
+
+0:27:12.116 --> 0:27:20.103
+It's more similar to what we are doing with
+statistical machine translation.
+
+0:27:21.181 --> 0:27:31.158
+Normally always beats this type of model,
+which in some view can be like from the conceptual
+
+0:27:31.158 --> 0:27:31.909
+thing.
+
+0:27:31.909 --> 0:27:36.844
+It's even easier from the computational side.
+
+0:27:40.560 --> 0:27:42.078
+And the idea is OK.
+
+0:27:42.078 --> 0:27:49.136
+We have monolingual data that we just translate
+and then generate some type of parallel data
+
+0:27:49.136 --> 0:27:50.806
+and use that then to.
+
+0:27:51.111 --> 0:28:00.017
+So if you want to build a German-to-English
+system first, take the large amount of data
+
+0:28:00.017 --> 0:28:02.143
+you have translated.
+
+0:28:02.402 --> 0:28:10.446
+Then you have more peril data and the interesting
+thing is if you then train on the joint thing
+
+0:28:10.446 --> 0:28:18.742
+or on the original peril data and on what is
+artificial where you have generated the translations.
+
+0:28:18.918 --> 0:28:26.487
+So you can because you are not doing the same
+era all the times and you have some knowledge.
+
+0:28:28.028 --> 0:28:43.199
+With this first approach, however, there is
+one issue why it might not work the best.
+
+0:28:49.409 --> 0:28:51.177
+Very a bit shown in the image to you.
+
+0:28:53.113 --> 0:28:58.153
+You trade on that quality data.
+
+0:28:58.153 --> 0:29:02.563
+Here is a bit of a problem.
+
+0:29:02.563 --> 0:29:08.706
+Your English style is not really good.
+
+0:29:08.828 --> 0:29:12.213
+And as you're saying, the system always mistranslates.
+
+0:29:13.493 --> 0:29:19.798
+Something then you will learn that this is
+correct because now it's a training game and
+
+0:29:19.798 --> 0:29:23.022
+you will encourage it to make it more often.
+
+0:29:23.022 --> 0:29:29.614
+So the problem with training on your own areas
+yeah you might prevent some areas you rarely
+
+0:29:29.614 --> 0:29:29.901
+do.
+
+0:29:30.150 --> 0:29:31.749
+But errors use systematically.
+
+0:29:31.749 --> 0:29:34.225
+Do you even enforce more and will even do
+more?
+
+0:29:34.654 --> 0:29:40.145
+So that might not be the best solution to
+have any idea how you could do it better.
+
+0:29:44.404 --> 0:29:57.754
+Is one way there is even a bit of more simple
+idea.
+
+0:30:04.624 --> 0:30:10.975
+The problem is yeah, the translations are
+not perfect, so the output and you're learning
+
+0:30:10.975 --> 0:30:12.188
+something wrong.
+
+0:30:12.188 --> 0:30:17.969
+Normally it's less bad if your inputs are
+not bad, but your outputs are perfect.
+
+0:30:18.538 --> 0:30:24.284
+So if your inputs are wrong you may learn
+that if you're doing this wrong input you're
+
+0:30:24.284 --> 0:30:30.162
+generating something correct, but you're not
+learning to generate something which is not
+
+0:30:30.162 --> 0:30:30.756
+correct.
+
+0:30:31.511 --> 0:30:47.124
+So often the case it is that it is more important
+than your target is correct.
+
+0:30:47.347 --> 0:30:52.182
+But you can assume in your application scenario
+you hope that you may only get correct inputs.
+
+0:30:52.572 --> 0:31:02.535
+So that is not harming you, and in machine
+translation we have one very nice advantage:
+
+0:31:02.762 --> 0:31:04.648
+And also the other way around.
+
+0:31:04.648 --> 0:31:10.062
+It's a very similar task, so there's a task
+to translate from German to English, but the
+
+0:31:10.062 --> 0:31:13.894
+task to translate from English to German is
+very similar, and.
+
+0:31:14.094 --> 0:31:19.309
+So what we can do is we can just switch it
+initially and generate the data the other way
+
+0:31:19.309 --> 0:31:19.778
+around.
+
+0:31:20.120 --> 0:31:25.959
+So what we are doing here is we are starting
+with an English to German system.
+
+0:31:25.959 --> 0:31:32.906
+Then we are translating the English data into
+German where the German is maybe not very nice.
+
+0:31:33.293 --> 0:31:51.785
+And then we are training on our original data
+and on the back translated data.
+
+0:31:52.632 --> 0:32:02.332
+So here we have the advantage that our target
+side is human quality and only the input.
+
+0:32:03.583 --> 0:32:08.113
+Then this helps us to get really good.
+
+0:32:08.113 --> 0:32:15.431
+There is one difference if you think about
+the data resources.
+
+0:32:21.341 --> 0:32:27.336
+Too obvious here we need a target site monolingual
+layer.
+
+0:32:27.336 --> 0:32:31.574
+In the first example we had source site.
+
+0:32:31.931 --> 0:32:45.111
+So back translation is normally working if
+you have target size peril later and not search
+
+0:32:45.111 --> 0:32:48.152
+side modeling later.
+
+0:32:48.448 --> 0:32:56.125
+Might be also, like if you think about it,
+understand a little better to understand the
+
+0:32:56.125 --> 0:32:56.823
+target.
+
+0:32:57.117 --> 0:33:01.469
+On the source side you have to understand
+the content.
+
+0:33:01.469 --> 0:33:08.749
+On the target side you have to generate really
+sentences and somehow it's more difficult to
+
+0:33:08.749 --> 0:33:12.231
+generate something than to only understand.
+
+0:33:17.617 --> 0:33:30.734
+This works well if you have to select how
+many back translated data do you use.
+
+0:33:31.051 --> 0:33:32.983
+Because only there's like a lot more.
+
+0:33:33.253 --> 0:33:42.136
+Question: Should take all of my data there
+is two problems with it?
+
+0:33:42.136 --> 0:33:51.281
+Of course it's expensive because you have
+to translate all this data.
+
+0:33:51.651 --> 0:34:00.946
+So if you don't know the normal good starting
+point is to take equal amount of data as many
+
+0:34:00.946 --> 0:34:02.663
+back translated.
+
+0:34:02.963 --> 0:34:04.673
+It depends on the used case.
+
+0:34:04.673 --> 0:34:08.507
+If we have very few data here, it makes more
+sense to have more.
+
+0:34:08.688 --> 0:34:15.224
+Depends on how good your quality is here,
+so the better the more data you might use because
+
+0:34:15.224 --> 0:34:16.574
+quality is better.
+
+0:34:16.574 --> 0:34:22.755
+So it depends on a lot of things, but your
+rule of sum is like which general way often
+
+0:34:22.755 --> 0:34:24.815
+is to have equal amounts of.
+
+0:34:26.646 --> 0:34:29.854
+And you can, of course, do that now.
+
+0:34:29.854 --> 0:34:34.449
+I said already that it's better to have the
+quality.
+
+0:34:34.449 --> 0:34:38.523
+At the end, of course, depends on this system.
+
+0:34:38.523 --> 0:34:46.152
+Also, because the better this system is, the
+better your synthetic data is, the better.
+
+0:34:47.207 --> 0:34:50.949
+That leads to what is referred to as iterated
+back translation.
+
+0:34:51.291 --> 0:34:56.917
+So you play them on English to German, and
+you translate the data on.
+
+0:34:56.957 --> 0:35:03.198
+Then you train a model on German to English
+with the additional data.
+
+0:35:03.198 --> 0:35:09.796
+Then you translate German data and then you
+train to gain your first one.
+
+0:35:09.796 --> 0:35:14.343
+So in the second iteration this quality is
+better.
+
+0:35:14.334 --> 0:35:19.900
+System is better because it's not only trained
+on the small data but additionally on back
+
+0:35:19.900 --> 0:35:22.003
+translated data with this system.
+
+0:35:22.442 --> 0:35:24.458
+And so you can get better.
+
+0:35:24.764 --> 0:35:28.053
+However, typically you can stop quite early.
+
+0:35:28.053 --> 0:35:35.068
+Maybe one iteration is good, but then you
+have diminishing gains after two or three iterations.
+
+0:35:35.935 --> 0:35:46.140
+There is very slight difference because you
+need a quite big difference in the quality
+
+0:35:46.140 --> 0:35:46.843
+here.
+
+0:35:47.207 --> 0:36:02.262
+Language is also good because it means you
+can already train it with relatively bad profiles.
+
+0:36:03.723 --> 0:36:10.339
+It's a design decision would advise so guess
+because it's easy to get it.
+
+0:36:10.550 --> 0:36:20.802
+Replace that because you have a higher quality
+real data, but then I think normally it's okay
+
+0:36:20.802 --> 0:36:22.438
+to replace it.
+
+0:36:22.438 --> 0:36:28.437
+I would assume it's not too much of a difference,
+but.
+
+0:36:34.414 --> 0:36:42.014
+That's about like using monolingual data before
+we go into the pre-train models to have any
+
+0:36:42.014 --> 0:36:43.005
+more crash.
+
+0:36:49.029 --> 0:36:55.740
+Yes, so the other thing which we can do and
+which is recently more and more successful
+
+0:36:55.740 --> 0:37:02.451
+and even more successful since we have this
+really large language models where you can
+
+0:37:02.451 --> 0:37:08.545
+even do the translation task with this is the
+way of using pre-trained models.
+
+0:37:08.688 --> 0:37:16.135
+So you learn a representation of one task,
+and then you use this representation from another.
+
+0:37:16.576 --> 0:37:26.862
+It was made maybe like one of the first words
+where it really used largely is doing something
+
+0:37:26.862 --> 0:37:35.945
+like a bird which you pre trained on purely
+text era and you take it in fine tune.
+
+0:37:36.496 --> 0:37:42.953
+And one big advantage, of course, is that
+people can only share data but also pre-trained.
+
+0:37:43.423 --> 0:37:59.743
+The recent models and the large language ones
+which are available.
+
+0:37:59.919 --> 0:38:09.145
+Where I think it costs several millions to
+train them all, just if you would buy the GPUs
+
+0:38:09.145 --> 0:38:15.397
+from some cloud company and train that the
+cost of training.
+
+0:38:15.475 --> 0:38:21.735
+And guess as a student project you won't have
+the budget to like build these models.
+
+0:38:21.801 --> 0:38:24.598
+So another idea is what you can do is okay.
+
+0:38:24.598 --> 0:38:27.330
+Maybe if these months are once available,.
+
+0:38:27.467 --> 0:38:36.598
+Can take them and use them as an also resource
+similar to pure text, and you can now build
+
+0:38:36.598 --> 0:38:44.524
+models which somehow learn not only from from
+data but also from other models.
+
+0:38:44.844 --> 0:38:49.127
+So it's a quite new way of thinking of how
+to train.
+
+0:38:49.127 --> 0:38:53.894
+We are not only learning from examples, but
+we might also.
+
+0:38:54.534 --> 0:39:05.397
+The nice thing is that this type of training
+where we are not learning directly from data
+
+0:39:05.397 --> 0:39:07.087
+but learning.
+
+0:39:07.427 --> 0:39:17.647
+So the main idea this go is you have a person
+initial task.
+
+0:39:17.817 --> 0:39:26.369
+And if you're working with anLP, that means
+you're training pure taxator because that's
+
+0:39:26.369 --> 0:39:30.547
+where you have the largest amount of data.
+
+0:39:30.951 --> 0:39:35.857
+And then you're defining some type of task
+in order to do your creek training.
+
+0:39:36.176 --> 0:39:43.092
+And: The typical task you can train on on
+that is like the language waddling task.
+
+0:39:43.092 --> 0:39:50.049
+So to predict the next word or we have a related
+task to predict something in between, we'll
+
+0:39:50.049 --> 0:39:52.667
+see depending on the architecture.
+
+0:39:52.932 --> 0:39:58.278
+But somehow to predict something which you
+have not in the input is a task which is easy
+
+0:39:58.278 --> 0:40:00.740
+to generate, so you just need your data.
+
+0:40:00.740 --> 0:40:06.086
+That's why it's called self supervised, so
+you're creating your supervised pending data.
+
+0:40:06.366 --> 0:40:07.646
+By yourself.
+
+0:40:07.646 --> 0:40:15.133
+On the other hand, you need a lot of knowledge
+and that is the other thing.
+
+0:40:15.735 --> 0:40:24.703
+Because there is this idea that the meaning
+of a word heavily depends on the context that.
+
+0:40:25.145 --> 0:40:36.846
+So can give you a sentence with some giverish
+word and there's some name and although you've
+
+0:40:36.846 --> 0:40:41.627
+never heard the name you will assume.
+
+0:40:42.062 --> 0:40:44.149
+And exactly the same thing.
+
+0:40:44.149 --> 0:40:49.143
+The models can also learn something about
+the world by just using.
+
+0:40:49.649 --> 0:40:53.651
+So that is typically the mule.
+
+0:40:53.651 --> 0:40:59.848
+Then we can use this model to train the system.
+
+0:41:00.800 --> 0:41:03.368
+Course we might need to adapt the system.
+
+0:41:03.368 --> 0:41:07.648
+To do that we have to change the architecture
+we might use only some.
+
+0:41:07.627 --> 0:41:09.443
+Part of the pre-trained model.
+
+0:41:09.443 --> 0:41:14.773
+In there we have seen that a bit already in
+the R&amp;N case you can also see that we have
+
+0:41:14.773 --> 0:41:17.175
+also mentioned the pre-training already.
+
+0:41:17.437 --> 0:41:22.783
+So you can use the R&amp;N as one of these
+approaches.
+
+0:41:22.783 --> 0:41:28.712
+You train the R&amp;M language more on large
+pre-train data.
+
+0:41:28.712 --> 0:41:32.309
+Then you put it somewhere into your.
+
+0:41:33.653 --> 0:41:37.415
+So this gives you the ability to really do
+these types of tests.
+
+0:41:37.877 --> 0:41:53.924
+So you can build a system which is knowledge,
+which is just trained on large amounts of data.
+
+0:41:56.376 --> 0:42:01.564
+So the question is maybe what type of information
+so what type of models can you?
+
+0:42:01.821 --> 0:42:05.277
+And we want today to look at briefly at swings.
+
+0:42:05.725 --> 0:42:08.704
+First, that was what was initially done.
+
+0:42:08.704 --> 0:42:15.314
+It wasn't as famous as in machine translation
+as in other things, but it's also used there
+
+0:42:15.314 --> 0:42:21.053
+and that is to use static word embedding, so
+just the first step we know here.
+
+0:42:21.221 --> 0:42:28.981
+So we have this mapping from the one hot to
+a small continuous word representation.
+
+0:42:29.229 --> 0:42:38.276
+Using this one in your NG system, so you can,
+for example, replace the embedding layer by
+
+0:42:38.276 --> 0:42:38.779
+the.
+
+0:42:39.139 --> 0:42:41.832
+That is helpful to be a really small amount
+of data.
+
+0:42:42.922 --> 0:42:48.517
+And we're always in this pre-training phase
+and have the thing the advantage is.
+
+0:42:48.468 --> 0:42:52.411
+More data than the trade off, so you can get
+better.
+
+0:42:52.411 --> 0:42:59.107
+The disadvantage is, does anybody have an
+idea of what might be the disadvantage of using
+
+0:42:59.107 --> 0:43:00.074
+things like.
+
+0:43:04.624 --> 0:43:12.175
+What was one mentioned today giving like big
+advantage of the system compared to previous.
+
+0:43:20.660 --> 0:43:25.134
+Where one advantage was the enter end training,
+so you have the enter end training so that
+
+0:43:25.134 --> 0:43:27.937
+all parameters and all components play optimal
+together.
+
+0:43:28.208 --> 0:43:33.076
+If you know pre-train something on one fast,
+it may be no longer optimal fitting to everything
+
+0:43:33.076 --> 0:43:33.384
+else.
+
+0:43:33.893 --> 0:43:37.862
+So what do pretending or not?
+
+0:43:37.862 --> 0:43:48.180
+It depends on how important everything is
+optimal together and how important.
+
+0:43:48.388 --> 0:43:50.454
+Of large amount.
+
+0:43:50.454 --> 0:44:00.541
+The pre-change one is so much better that
+it's helpful, and the advantage of that.
+
+0:44:00.600 --> 0:44:11.211
+Getting everything optimal together, yes,
+we would use random instructions for raising.
+
+0:44:11.691 --> 0:44:26.437
+The problem is you might be already in some
+area where it's not easy to get.
+
+0:44:26.766 --> 0:44:35.329
+But often in some way right, so often it's
+not about your really worse pre trained monolepsy.
+
+0:44:35.329 --> 0:44:43.254
+If you're going already in some direction,
+and if this is not really optimal for you,.
+
+0:44:43.603 --> 0:44:52.450
+But if you're not really getting better because
+you have a decent amount of data, it's so different
+
+0:44:52.450 --> 0:44:52.981
+that.
+
+0:44:53.153 --> 0:44:59.505
+Initially it wasn't a machine translation
+done so much because there are more data in
+
+0:44:59.505 --> 0:45:06.153
+MPs than in other tasks, but now with really
+large amounts of monolingual data we do some
+
+0:45:06.153 --> 0:45:09.403
+type of pretraining in currently all state.
+
+0:45:12.632 --> 0:45:14.302
+The other one is okay now.
+
+0:45:14.302 --> 0:45:18.260
+It's always like how much of the model do
+you plea track a bit?
+
+0:45:18.658 --> 0:45:22.386
+To the other one you can do contextural word
+embedded.
+
+0:45:22.386 --> 0:45:28.351
+That is something like bird or Roberta where
+you train already a sequence model and the
+
+0:45:28.351 --> 0:45:34.654
+embeddings you're using are no longer specific
+for word but they are also taking the context
+
+0:45:34.654 --> 0:45:35.603
+into account.
+
+0:45:35.875 --> 0:45:50.088
+The embedding you're using is no longer depending
+on the word itself but on the whole sentence,
+
+0:45:50.088 --> 0:45:54.382
+so you can use this context.
+
+0:45:55.415 --> 0:46:02.691
+You can use similar things also in the decoder
+just by having layers which don't have access
+
+0:46:02.691 --> 0:46:12.430
+to the source, but there it still might have
+and these are typically models like: And finally
+
+0:46:12.430 --> 0:46:14.634
+they will look at the end.
+
+0:46:14.634 --> 0:46:19.040
+You can also have models which are already
+sequenced.
+
+0:46:19.419 --> 0:46:28.561
+So you may be training a sequence to sequence
+models.
+
+0:46:28.561 --> 0:46:35.164
+You have to make it a bit challenging.
+
+0:46:36.156 --> 0:46:43.445
+But the idea is really you're pre-training
+your whole model and then you'll find tuning.
+
+0:46:47.227 --> 0:46:59.614
+But let's first do a bit of step back and
+look into what are the different things.
+
+0:46:59.614 --> 0:47:02.151
+The first thing.
+
+0:47:02.382 --> 0:47:11.063
+The wooden bettings are just this first layer
+and you can train them with feedback annual
+
+0:47:11.063 --> 0:47:12.028
+networks.
+
+0:47:12.212 --> 0:47:22.761
+But you can also train them with an N language
+model, and by now you hopefully have also seen
+
+0:47:22.761 --> 0:47:27.699
+that you cannot transform a language model.
+
+0:47:30.130 --> 0:47:37.875
+So this is how you can train them and you're
+training them.
+
+0:47:37.875 --> 0:47:45.234
+For example, to speak the next word that is
+the easiest.
+
+0:47:45.525 --> 0:47:55.234
+And that is what is now referred to as South
+Supervised Learning and, for example, all the
+
+0:47:55.234 --> 0:48:00.675
+big large language models like Chad GPT and
+so on.
+
+0:48:00.675 --> 0:48:03.129
+They are trained with.
+
+0:48:03.823 --> 0:48:15.812
+So that is where you can hopefully learn how
+a word is used because you always try to previct
+
+0:48:15.812 --> 0:48:17.725
+the next word.
+
+0:48:19.619 --> 0:48:27.281
+Word embedding: Why do you keep the first
+look at the word embeddings and the use of
+
+0:48:27.281 --> 0:48:29.985
+word embeddings for our task?
+
+0:48:29.985 --> 0:48:38.007
+The main advantage was it might be only the
+first layer where you typically have most of
+
+0:48:38.007 --> 0:48:39.449
+the parameters.
+
+0:48:39.879 --> 0:48:57.017
+Most of your parameters already on the large
+data, then on your target data you have to
+
+0:48:57.017 --> 0:48:59.353
+train less.
+
+0:48:59.259 --> 0:49:06.527
+Big difference that your input size is so
+much bigger than the size of the novel in size.
+
+0:49:06.626 --> 0:49:17.709
+So it's a normally sign, maybe like, but your
+input and banning size is something like.
+
+0:49:17.709 --> 0:49:20.606
+Then here you have to.
+
+0:49:23.123 --> 0:49:30.160
+While here you see it's only like zero point
+five times as much in the layer.
+
+0:49:30.750 --> 0:49:36.534
+So here is where most of your parameters are,
+which means if you already replace the word
+
+0:49:36.534 --> 0:49:41.739
+embeddings, they might look a bit small in
+your overall and in key architecture.
+
+0:49:41.739 --> 0:49:47.395
+It's where most of the things are, and if
+you're doing that you already have really big
+
+0:49:47.395 --> 0:49:48.873
+games and can do that.
+
+0:49:57.637 --> 0:50:01.249
+The thing is we have seen these were the bettings.
+
+0:50:01.249 --> 0:50:04.295
+They can be very good use for other types.
+
+0:50:04.784 --> 0:50:08.994
+You learn some general relations between words.
+
+0:50:08.994 --> 0:50:17.454
+If you're doing this type of language modeling
+cast, you predict: The one thing is you have
+
+0:50:17.454 --> 0:50:24.084
+a lot of data, so the one question is we want
+to have data to trade a model.
+
+0:50:24.084 --> 0:50:28.734
+The other thing, the tasks need to be somehow
+useful.
+
+0:50:29.169 --> 0:50:43.547
+If you would predict the first letter of the
+word, then you wouldn't learn anything about
+
+0:50:43.547 --> 0:50:45.144
+the word.
+
+0:50:45.545 --> 0:50:53.683
+And the interesting thing is people have looked
+at these wood embeddings.
+
+0:50:53.954 --> 0:50:58.550
+And looking at the word embeddings.
+
+0:50:58.550 --> 0:51:09.276
+You can ask yourself how they look and visualize
+them by doing dimension reduction.
+
+0:51:09.489 --> 0:51:13.236
+Don't know if you and you are listening to
+artificial intelligence.
+
+0:51:13.236 --> 0:51:15.110
+Advanced artificial intelligence.
+
+0:51:15.515 --> 0:51:23.217
+We had on yesterday there how to do this type
+of representation, but you can do this time
+
+0:51:23.217 --> 0:51:29.635
+of representation, and now you're seeing interesting
+things that normally.
+
+0:51:30.810 --> 0:51:41.027
+Now you can represent a here in a three dimensional
+space with some dimension reduction.
+
+0:51:41.027 --> 0:51:46.881
+For example, the relation between male and
+female.
+
+0:51:47.447 --> 0:51:56.625
+So this vector between the male and female
+version of something is always not the same,
+
+0:51:56.625 --> 0:51:58.502
+but it's related.
+
+0:51:58.718 --> 0:52:14.522
+So you can do a bit of maths, so you do take
+king, you subtract this vector, add this vector.
+
+0:52:14.894 --> 0:52:17.591
+So that means okay, there is really something
+stored.
+
+0:52:17.591 --> 0:52:19.689
+Some information are stored in that book.
+
+0:52:20.040 --> 0:52:22.621
+Similar, you can do it with Bob Hansen.
+
+0:52:22.621 --> 0:52:25.009
+See here swimming slam walking walk.
+
+0:52:25.265 --> 0:52:34.620
+So again these vectors are not the same, but
+they are related.
+
+0:52:34.620 --> 0:52:42.490
+So you learn something from going from here
+to here.
+
+0:52:43.623 --> 0:52:49.761
+Or semantically, the relations between city
+and capital have exactly the same sense.
+
+0:52:51.191 --> 0:52:56.854
+And people had even done that question answering
+about that if they showed the diembeddings
+
+0:52:56.854 --> 0:52:57.839
+and the end of.
+
+0:52:58.218 --> 0:53:06.711
+All you can also do is don't trust the dimensions
+of the reaction because maybe there is something.
+
+0:53:06.967 --> 0:53:16.863
+You can also look into what happens really
+in the individual space.
+
+0:53:16.863 --> 0:53:22.247
+What is the nearest neighbor of the.
+
+0:53:22.482 --> 0:53:29.608
+So you can take the relationship between France
+and Paris and add it to Italy and you'll.
+
+0:53:30.010 --> 0:53:33.078
+You can do big and bigger and you have small
+and smaller and stuff.
+
+0:53:33.593 --> 0:53:49.417
+Because it doesn't work everywhere, there
+is also some typical dish here in German.
+
+0:53:51.491 --> 0:54:01.677
+You can do what the person is doing for famous
+ones, of course only like Einstein scientists
+
+0:54:01.677 --> 0:54:06.716
+that find midfielders not completely correct.
+
+0:54:06.846 --> 0:54:10.134
+You see the examples are a bit old.
+
+0:54:10.134 --> 0:54:15.066
+The politicians are no longer they am, but
+of course.
+
+0:54:16.957 --> 0:54:26.759
+What people have done there, especially at
+the beginning training our end language model,
+
+0:54:26.759 --> 0:54:28.937
+was very expensive.
+
+0:54:29.309 --> 0:54:38.031
+So one famous model was, but we are not really
+interested in the language model performance.
+
+0:54:38.338 --> 0:54:40.581
+Think something good to keep in mind.
+
+0:54:40.581 --> 0:54:42.587
+What are we really interested in?
+
+0:54:42.587 --> 0:54:45.007
+Do we really want to have an R&amp;N no?
+
+0:54:45.007 --> 0:54:48.607
+In this case we are only interested in this
+type of mapping.
+
+0:54:49.169 --> 0:54:55.500
+And so successful and very successful was
+this word to vet.
+
+0:54:55.535 --> 0:54:56.865
+The idea is okay.
+
+0:54:56.865 --> 0:55:03.592
+We are not training real language one, making
+it even simpler and doing this, for example,
+
+0:55:03.592 --> 0:55:05.513
+continuous peck of words.
+
+0:55:05.513 --> 0:55:12.313
+We're just having four input tokens and we're
+predicting what is the word in the middle and
+
+0:55:12.313 --> 0:55:15.048
+this is just like two linear layers.
+
+0:55:15.615 --> 0:55:21.627
+So it's even simplifying things and making
+the calculation faster because that is what
+
+0:55:21.627 --> 0:55:22.871
+we're interested.
+
+0:55:23.263 --> 0:55:32.897
+All this continuous skip ground models with
+these other models which refer to as where
+
+0:55:32.897 --> 0:55:34.004
+to where.
+
+0:55:34.234 --> 0:55:42.394
+Where you have one equal word and the other
+way around, you're predicting the four words
+
+0:55:42.394 --> 0:55:43.585
+around them.
+
+0:55:43.585 --> 0:55:45.327
+It's very similar.
+
+0:55:45.327 --> 0:55:48.720
+The task is in the end very similar.
+
+0:55:51.131 --> 0:56:01.407
+Before we are going to the next point, anything
+about normal weight vectors or weight embedding.
+
+0:56:04.564 --> 0:56:07.794
+The next thing is contexture.
+
+0:56:07.794 --> 0:56:12.208
+Word embeddings and the idea is helpful.
+
+0:56:12.208 --> 0:56:19.206
+However, we might even be able to get more
+from one lingo layer.
+
+0:56:19.419 --> 0:56:31.732
+And now in the word that is overlap of these
+two meanings, so it represents both the meaning
+
+0:56:31.732 --> 0:56:33.585
+of can do it.
+
+0:56:34.834 --> 0:56:40.410
+But we might be able to in the pre-trained
+model already disambiguate this because they
+
+0:56:40.410 --> 0:56:41.044
+are used.
+
+0:56:41.701 --> 0:56:53.331
+So if we can have a model which can not only
+represent a word but can also represent the
+
+0:56:53.331 --> 0:56:58.689
+meaning of the word within the context,.
+
+0:56:59.139 --> 0:57:03.769
+So then we are going to context your word
+embeddings.
+
+0:57:03.769 --> 0:57:07.713
+We are really having a representation in the.
+
+0:57:07.787 --> 0:57:11.519
+And we have a very good architecture for that
+already.
+
+0:57:11.691 --> 0:57:23.791
+The hidden state represents what is currently
+said, but it's focusing on what is the last
+
+0:57:23.791 --> 0:57:29.303
+one, so it's some of the representation.
+
+0:57:29.509 --> 0:57:43.758
+The first one doing that is something like
+the Elmo paper where they instead of this is
+
+0:57:43.758 --> 0:57:48.129
+the normal language model.
+
+0:57:48.008 --> 0:57:50.714
+Within the third, predicting the fourth, and
+so on.
+
+0:57:50.714 --> 0:57:53.004
+So you are always predicting the next work.
+
+0:57:53.193 --> 0:57:57.335
+The architecture is the heaven words embedding
+layer and then layers.
+
+0:57:57.335 --> 0:58:03.901
+See you, for example: And now instead of using
+this one in the end, you're using here this
+
+0:58:03.901 --> 0:58:04.254
+one.
+
+0:58:04.364 --> 0:58:11.245
+This represents the meaning of this word mainly
+in the context of what we have seen before.
+
+0:58:11.871 --> 0:58:18.610
+We can train it in a language model style
+always predicting the next word, but we have
+
+0:58:18.610 --> 0:58:21.088
+more information trained there.
+
+0:58:21.088 --> 0:58:26.123
+Therefore, in the system it has to learn less
+additional things.
+
+0:58:27.167 --> 0:58:31.261
+And there is one Edendang which is done currently
+in GPS.
+
+0:58:31.261 --> 0:58:38.319
+The only difference is that we have more layers,
+bigger size, and we're using transformer neurocell
+
+0:58:38.319 --> 0:58:40.437
+potential instead of the RNA.
+
+0:58:40.437 --> 0:58:45.095
+But that is how you train like some large
+language models at the.
+
+0:58:46.746 --> 0:58:55.044
+However, if you look at this contextual representation,
+they might not be perfect.
+
+0:58:55.044 --> 0:59:02.942
+So if you think of this one as a contextual
+representation of the third word,.
+
+0:59:07.587 --> 0:59:16.686
+Is representing a three in the context of
+a sentence, however only in the context of
+
+0:59:16.686 --> 0:59:18.185
+the previous.
+
+0:59:18.558 --> 0:59:27.413
+However, we have an architecture which can
+also take both sides and we have used that
+
+0:59:27.413 --> 0:59:30.193
+already in the ink holder.
+
+0:59:30.630 --> 0:59:34.264
+So we could do the iron easily on your, also
+in the backward direction.
+
+0:59:34.874 --> 0:59:42.826
+By just having the states the other way around
+and then we couldn't combine the forward and
+
+0:59:42.826 --> 0:59:49.135
+the forward into a joint one where we are doing
+this type of prediction.
+
+0:59:49.329 --> 0:59:50.858
+So you have the word embedding.
+
+0:59:51.011 --> 1:00:02.095
+Then you have two in the states, one on the
+forward arm and one on the backward arm, and
+
+1:00:02.095 --> 1:00:10.314
+then you can, for example, take the cocagenation
+of both of them.
+
+1:00:10.490 --> 1:00:23.257
+Now this same here represents mainly this
+word because this is what both puts in it last
+
+1:00:23.257 --> 1:00:30.573
+and we know is focusing on what is happening
+last.
+
+1:00:31.731 --> 1:00:40.469
+However, there is a bit of difference when
+training that as a language model you already
+
+1:00:40.469 --> 1:00:41.059
+have.
+
+1:00:43.203 --> 1:00:44.956
+Maybe There's Again This Masking.
+
+1:00:46.546 --> 1:00:47.748
+That is one solution.
+
+1:00:47.748 --> 1:00:52.995
+First of all, why we can't do it is the information
+you leak it, so you cannot just predict the
+
+1:00:52.995 --> 1:00:53.596
+next word.
+
+1:00:53.596 --> 1:00:58.132
+If we just predict the next word in this type
+of model, that's a very simple task.
+
+1:00:58.738 --> 1:01:09.581
+You know the next word because it's influencing
+this hidden state predicting something is not
+
+1:01:09.581 --> 1:01:11.081
+a good task.
+
+1:01:11.081 --> 1:01:18.455
+You have to define: Because in this case what
+will end with the system will just ignore these
+
+1:01:18.455 --> 1:01:22.966
+estates and what will learn is copy this information
+directly in here.
+
+1:01:23.343 --> 1:01:31.218
+So it would be representing this word and
+you would have nearly a perfect model because
+
+1:01:31.218 --> 1:01:38.287
+you only need to find encoding where you can
+encode all words somehow in this.
+
+1:01:38.458 --> 1:01:44.050
+The only thing can learn is that turn and
+encode all my words in this upper hidden.
+
+1:01:44.985 --> 1:01:53.779
+Therefore, it's not really useful, so we need
+to find a bit of different ways out.
+
+1:01:55.295 --> 1:01:57.090
+There is a masking one.
+
+1:01:57.090 --> 1:02:03.747
+I'll come to that shortly just a bit that
+other things also have been done, so the other
+
+1:02:03.747 --> 1:02:06.664
+thing is not to directly combine them.
+
+1:02:06.664 --> 1:02:13.546
+That was in the animal paper, so you have
+them forward R&amp;M and you keep them completely
+
+1:02:13.546 --> 1:02:14.369
+separated.
+
+1:02:14.594 --> 1:02:20.458
+So you never merged to state.
+
+1:02:20.458 --> 1:02:33.749
+At the end, the representation of the word
+is now from the forward.
+
+1:02:33.873 --> 1:02:35.953
+So it's always the hidden state before the
+good thing.
+
+1:02:36.696 --> 1:02:41.286
+These two you join now to your to the representation.
+
+1:02:42.022 --> 1:02:48.685
+And then you have now a representation also
+about like the whole sentence for the word,
+
+1:02:48.685 --> 1:02:51.486
+but there is no information leakage.
+
+1:02:51.486 --> 1:02:58.149
+One way of doing this is instead of doing
+a bidirection along you do a forward pass and
+
+1:02:58.149 --> 1:02:59.815
+then join the hidden.
+
+1:03:00.380 --> 1:03:05.960
+So you can do that in all layers.
+
+1:03:05.960 --> 1:03:16.300
+In the end you do the forwarded layers and
+you get the hidden.
+
+1:03:16.596 --> 1:03:19.845
+However, it's a bit of a complicated.
+
+1:03:19.845 --> 1:03:25.230
+You have to keep both separate and merge things
+so can you do.
+
+1:03:27.968 --> 1:03:33.030
+And that is the moment where like the big.
+
+1:03:34.894 --> 1:03:39.970
+The big success of the burnt model was used
+where it okay.
+
+1:03:39.970 --> 1:03:47.281
+Maybe in bite and rich case it's not good
+to do the next word prediction, but we can
+
+1:03:47.281 --> 1:03:48.314
+do masking.
+
+1:03:48.308 --> 1:03:56.019
+Masking mainly means we do a prediction of
+something in the middle or some words.
+
+1:03:56.019 --> 1:04:04.388
+So the idea is if we have the input, we are
+putting noise into the input, removing them,
+
+1:04:04.388 --> 1:04:07.961
+and then the model we are interested.
+
+1:04:08.048 --> 1:04:15.327
+Now there can be no information leakage because
+this wasn't predicting that one is a big challenge.
+
+1:04:16.776 --> 1:04:19.957
+Do any assumption about our model?
+
+1:04:19.957 --> 1:04:26.410
+It doesn't need to be a forward model or a
+backward model or anything.
+
+1:04:26.410 --> 1:04:29.500
+You can always predict the three.
+
+1:04:30.530 --> 1:04:34.844
+There's maybe one bit of a disadvantage.
+
+1:04:34.844 --> 1:04:40.105
+Do you see what could be a bit of a problem
+this?
+
+1:05:00.000 --> 1:05:06.429
+Yes, so yeah, you can of course mask more,
+but to see it more globally, just first assume
+
+1:05:06.429 --> 1:05:08.143
+you're only masked one.
+
+1:05:08.143 --> 1:05:13.930
+For the whole sentence, we get one feedback
+signal, like what is the word three.
+
+1:05:13.930 --> 1:05:22.882
+So we have one training example: If you do
+the language modeling taste, we predicted here,
+
+1:05:22.882 --> 1:05:24.679
+we predicted here.
+
+1:05:25.005 --> 1:05:26.735
+So we have number of tokens.
+
+1:05:26.735 --> 1:05:30.970
+For each token we have a feet pad and say
+what is the best correction.
+
+1:05:31.211 --> 1:05:43.300
+So in this case this is less efficient because
+we are getting less feedback signals on what
+
+1:05:43.300 --> 1:05:45.797
+we should predict.
+
+1:05:48.348 --> 1:05:56.373
+So and bird, the main ideas are that you're
+doing this bidirectional model with masking.
+
+1:05:56.373 --> 1:05:59.709
+It's using transformer architecture.
+
+1:06:00.320 --> 1:06:06.326
+There are two more minor changes.
+
+1:06:06.326 --> 1:06:16.573
+We'll see that this next word prediction is
+another task.
+
+1:06:16.957 --> 1:06:30.394
+You want to learn more about what language
+is to really understand following a story or
+
+1:06:30.394 --> 1:06:35.127
+their independent tokens into.
+
+1:06:38.158 --> 1:06:42.723
+The input is using word units as we use it.
+
+1:06:42.723 --> 1:06:50.193
+It has some special token that is framing
+for the next word prediction.
+
+1:06:50.470 --> 1:07:04.075
+It's more for classification task because
+you may be learning a general representation
+
+1:07:04.075 --> 1:07:07.203
+as a full sentence.
+
+1:07:07.607 --> 1:07:19.290
+You're doing segment embedding, so you have
+an embedding for it.
+
+1:07:19.290 --> 1:07:24.323
+This is the first sentence.
+
+1:07:24.684 --> 1:07:29.099
+Now what is more challenging is this masking.
+
+1:07:29.099 --> 1:07:30.827
+What do you mask?
+
+1:07:30.827 --> 1:07:35.050
+We already have the crush enough or should.
+
+1:07:35.275 --> 1:07:42.836
+So there has been afterwards eating some work
+like, for example, a bearer.
+
+1:07:42.836 --> 1:07:52.313
+It's not super sensitive, but if you do it
+completely wrong then you're not letting anything.
+
+1:07:52.572 --> 1:07:54.590
+That's Then Another Question There.
+
+1:07:56.756 --> 1:08:04.594
+Should I mask all types of should I always
+mask the footwork or if I have a subword to
+
+1:08:04.594 --> 1:08:10.630
+mask only like a subword and predict them based
+on the other ones?
+
+1:08:10.630 --> 1:08:14.504
+Of course, it's a bit of a different task.
+
+1:08:14.894 --> 1:08:21.210
+If you know three parts of the words, it might
+be easier to guess the last because they here
+
+1:08:21.210 --> 1:08:27.594
+took the easiest selection, so not considering
+words anymore at all because you're doing that
+
+1:08:27.594 --> 1:08:32.280
+in the preprocessing and just taking always
+words and like subwords.
+
+1:08:32.672 --> 1:08:36.089
+Think in group there is done differently.
+
+1:08:36.089 --> 1:08:40.401
+They mark always the full words, but guess
+it's not.
+
+1:08:41.001 --> 1:08:46.044
+And then what to do with the mask word in
+eighty percent of the cases.
+
+1:08:46.044 --> 1:08:50.803
+If the word is masked, they replace it with
+a special token thing.
+
+1:08:50.803 --> 1:08:57.197
+This is a mask token in ten percent they put
+in some random other token in there, and ten
+
+1:08:57.197 --> 1:08:59.470
+percent they keep it on change.
+
+1:09:02.202 --> 1:09:10.846
+And then what you can do is also this next
+word prediction.
+
+1:09:10.846 --> 1:09:14.880
+The man went to Mass Store.
+
+1:09:14.880 --> 1:09:17.761
+He bought a gallon.
+
+1:09:18.418 --> 1:09:24.088
+So may you see you're joining them, you're
+doing both masks and prediction that you're.
+
+1:09:24.564 --> 1:09:29.449
+Is a penguin mask or flyless birds.
+
+1:09:29.449 --> 1:09:41.390
+These two sentences have nothing to do with
+each other, so you can do also this type of
+
+1:09:41.390 --> 1:09:43.018
+prediction.
+
+1:09:47.127 --> 1:09:57.043
+And then the whole bird model, so here you
+have the input here to transform the layers,
+
+1:09:57.043 --> 1:09:58.170
+and then.
+
+1:09:58.598 --> 1:10:17.731
+And this model was quite successful in general
+applications.
+
+1:10:17.937 --> 1:10:27.644
+However, there is like a huge thing of different
+types of models coming from them.
+
+1:10:27.827 --> 1:10:38.709
+So based on others these supervised molds
+like a whole setup came out of there and now
+
+1:10:38.709 --> 1:10:42.086
+this is getting even more.
+
+1:10:42.082 --> 1:10:46.640
+With availability of a large language model
+than the success.
+
+1:10:47.007 --> 1:10:48.436
+We have now even larger ones.
+
+1:10:48.828 --> 1:10:50.961
+Interestingly, it goes a bit.
+
+1:10:50.910 --> 1:10:57.847
+Change the bit again from like more the spider
+action model to uni directional models.
+
+1:10:57.847 --> 1:11:02.710
+Are at the moment maybe a bit more we're coming
+to them now?
+
+1:11:02.710 --> 1:11:09.168
+Do you see one advantage while what is another
+event and we have the efficiency?
+
+1:11:09.509 --> 1:11:15.901
+Is one other reason why you are sometimes
+more interested in uni-direction models than
+
+1:11:15.901 --> 1:11:17.150
+in bi-direction.
+
+1:11:22.882 --> 1:11:30.220
+It depends on the pass, but for example for
+a language generation pass, the eccard is not
+
+1:11:30.220 --> 1:11:30.872
+really.
+
+1:11:32.192 --> 1:11:40.924
+It doesn't work so if you want to do a generation
+like the decoder you don't know the future
+
+1:11:40.924 --> 1:11:42.896
+so you cannot apply.
+
+1:11:43.223 --> 1:11:53.870
+So this time of model can be used for the
+encoder in an encoder model, but it cannot
+
+1:11:53.870 --> 1:11:57.002
+be used for the decoder.
+
+1:12:00.000 --> 1:12:05.012
+That's a good view to the next overall cast
+of models.
+
+1:12:05.012 --> 1:12:08.839
+Perhaps if you view it from the sequence.
+
+1:12:09.009 --> 1:12:12.761
+We have the encoder base model.
+
+1:12:12.761 --> 1:12:16.161
+That's what we just look at.
+
+1:12:16.161 --> 1:12:20.617
+They are bidirectional and typically.
+
+1:12:20.981 --> 1:12:22.347
+That Is the One We Looked At.
+
+1:12:22.742 --> 1:12:34.634
+At the beginning is the decoder based model,
+so see out in regressive models which are unidirective
+
+1:12:34.634 --> 1:12:42.601
+like an based model, and there we can do the
+next word prediction.
+
+1:12:43.403 --> 1:12:52.439
+And what you can also do first, and there
+you can also have a special things called prefix
+
+1:12:52.439 --> 1:12:53.432
+language.
+
+1:12:54.354 --> 1:13:05.039
+Because we are saying it might be helpful
+that some of your input can also use bi-direction.
+
+1:13:05.285 --> 1:13:12.240
+And that is somehow doing what it is called
+prefix length.
+
+1:13:12.240 --> 1:13:19.076
+On the first tokens you directly give your
+bidirectional.
+
+1:13:19.219 --> 1:13:28.774
+So you somehow merge that and that mainly
+works only in transformer based models because.
+
+1:13:29.629 --> 1:13:33.039
+There is no different number of parameters
+in our end.
+
+1:13:33.039 --> 1:13:34.836
+We need a back foot our end.
+
+1:13:34.975 --> 1:13:38.533
+Transformer: The only difference is how you
+mask your attention.
+
+1:13:38.878 --> 1:13:44.918
+We have seen that in the anchoder and decoder
+the number of parameters is different because
+
+1:13:44.918 --> 1:13:50.235
+you do cross attention, but if you do forward
+and backward or union directions,.
+
+1:13:50.650 --> 1:13:58.736
+It's only like you mask your attention to
+only look at the bad past or to look into the
+
+1:13:58.736 --> 1:13:59.471
+future.
+
+1:14:00.680 --> 1:14:03.326
+And now you can of course also do mixing.
+
+1:14:03.563 --> 1:14:08.306
+So this is a bi-directional attention matrix
+where you can attend to everything.
+
+1:14:08.588 --> 1:14:23.516
+There is a uni-direction or causal where you
+can look at the past and you can do the first
+
+1:14:23.516 --> 1:14:25.649
+three words.
+
+1:14:29.149 --> 1:14:42.831
+That somehow clear based on that, then of
+course you cannot do the other things.
+
+1:14:43.163 --> 1:14:50.623
+So the idea is we have our anchor to decoder
+architecture.
+
+1:14:50.623 --> 1:14:57.704
+Can we also train them completely in a side
+supervisor?
+
+1:14:58.238 --> 1:15:09.980
+And in this case we have the same input to
+both, so in this case we need to do some type
+
+1:15:09.980 --> 1:15:12.224
+of masking here.
+
+1:15:12.912 --> 1:15:17.696
+Here we don't need to do the masking, but
+here we need to masking that doesn't know ever
+
+1:15:17.696 --> 1:15:17.911
+so.
+
+1:15:20.440 --> 1:15:30.269
+And this type of model got quite successful
+also, especially for pre-training machine translation.
+
+1:15:30.330 --> 1:15:39.059
+The first model doing that is a Bart model,
+which exactly does that, and yes, it's one
+
+1:15:39.059 --> 1:15:42.872
+successful way to pre train your one.
+
+1:15:42.872 --> 1:15:47.087
+It's pretraining your full encoder model.
+
+1:15:47.427 --> 1:15:54.365
+Where you put in contrast to machine translation,
+where you put in source sentence, we can't
+
+1:15:54.365 --> 1:15:55.409
+do that here.
+
+1:15:55.715 --> 1:16:01.382
+But we can just put the second twice in there,
+and then it's not a trivial task.
+
+1:16:01.382 --> 1:16:02.432
+We can change.
+
+1:16:03.003 --> 1:16:12.777
+And there is like they do different corruption
+techniques so you can also do.
+
+1:16:13.233 --> 1:16:19.692
+That you couldn't do in an agricultural system
+because then it wouldn't be there and you cannot
+
+1:16:19.692 --> 1:16:20.970
+predict somewhere.
+
+1:16:20.970 --> 1:16:26.353
+So the anchor, the number of input and output
+tokens always has to be the same.
+
+1:16:26.906 --> 1:16:29.818
+You cannot do a prediction for something which
+isn't in it.
+
+1:16:30.110 --> 1:16:38.268
+Here in the decoder side it's unidirection
+so we can also delete the top and then try
+
+1:16:38.268 --> 1:16:40.355
+to generate the full.
+
+1:16:41.061 --> 1:16:45.250
+We can do sentence permutation.
+
+1:16:45.250 --> 1:16:54.285
+We can document rotation and text infilling
+so there is quite a bit.
+
+1:16:55.615 --> 1:17:06.568
+So you see there's quite a lot of types of
+models that you can use in order to pre-train.
+
+1:17:07.507 --> 1:17:14.985
+Then, of course, there is again for the language
+one.
+
+1:17:14.985 --> 1:17:21.079
+The other question is how do you integrate?
+
+1:17:21.761 --> 1:17:26.636
+And there's also, like yeah, quite some different
+ways of techniques.
+
+1:17:27.007 --> 1:17:28.684
+It's a Bit Similar to Before.
+
+1:17:28.928 --> 1:17:39.068
+So the easiest thing is you take your word
+embeddings or your free trained model.
+
+1:17:39.068 --> 1:17:47.971
+You freeze them and stack your decoder layers
+and keep these ones free.
+
+1:17:48.748 --> 1:17:54.495
+Can also be done if you have this type of
+bark model.
+
+1:17:54.495 --> 1:18:03.329
+What you can do is you freeze your word embeddings,
+for example some products and.
+
+1:18:05.865 --> 1:18:17.296
+The other thing is you initialize them so
+you initialize your models but you train everything
+
+1:18:17.296 --> 1:18:19.120
+so you're not.
+
+1:18:22.562 --> 1:18:29.986
+Then one thing, if you think about Bart, you
+want to have the Chinese language, the Italian
+
+1:18:29.986 --> 1:18:32.165
+language, and the deconer.
+
+1:18:32.165 --> 1:18:35.716
+However, in Bart we have the same language.
+
+1:18:36.516 --> 1:18:46.010
+The one you get is from English, so what you
+can do there is so you cannot try to do some.
+
+1:18:46.366 --> 1:18:52.562
+Below the barge, in order to learn some language
+specific stuff, or there's a masculine barge,
+
+1:18:52.562 --> 1:18:58.823
+which is trained on many languages, but it's
+trained only on like the Old Coast Modern Language
+
+1:18:58.823 --> 1:19:03.388
+House, which may be trained in German and English,
+but not on German.
+
+1:19:03.923 --> 1:19:08.779
+So then you would still need to find June
+and the model needs to learn how to better
+
+1:19:08.779 --> 1:19:10.721
+do the attention cross lingually.
+
+1:19:10.721 --> 1:19:15.748
+It's only on the same language but it mainly
+only has to learn this mapping and not all
+
+1:19:15.748 --> 1:19:18.775
+the rest and that's why it's still quite successful.
+
+1:19:21.982 --> 1:19:27.492
+Now certain thing which is very commonly used
+is what is required to it as adapters.
+
+1:19:27.607 --> 1:19:29.754
+So for example you take and buy.
+
+1:19:29.709 --> 1:19:35.218
+And you put some adapters on the inside of
+the networks so that it's small new layers
+
+1:19:35.218 --> 1:19:40.790
+which are in between put in there and then
+you only train these adapters or also train
+
+1:19:40.790 --> 1:19:41.815
+these adapters.
+
+1:19:41.815 --> 1:19:47.900
+For example, an embryo you could see that
+this learns to map the Sears language representation
+
+1:19:47.900 --> 1:19:50.334
+to the Tiger language representation.
+
+1:19:50.470 --> 1:19:52.395
+And then you don't have to change that luck.
+
+1:19:52.792 --> 1:19:59.793
+You give it extra ability to really perform
+well on that.
+
+1:19:59.793 --> 1:20:05.225
+These are quite small and so very efficient.
+
+1:20:05.905 --> 1:20:12.632
+That is also very commonly used, for example
+in modular systems where you have some adaptors
+
+1:20:12.632 --> 1:20:16.248
+in between here which might be language specific.
+
+1:20:16.916 --> 1:20:22.247
+So they are trained only for one language.
+
+1:20:22.247 --> 1:20:33.777
+The model has some or both and once has the
+ability to do multilingually to share knowledge.
+
+1:20:34.914 --> 1:20:39.058
+But there's one chance in general in the multilingual
+systems.
+
+1:20:39.058 --> 1:20:40.439
+It works quite well.
+
+1:20:40.439 --> 1:20:46.161
+There's one case or one specific use case
+for multilingual where this normally doesn't
+
+1:20:46.161 --> 1:20:47.344
+really work well.
+
+1:20:47.344 --> 1:20:49.975
+Do you have an idea what that could be?
+
+1:20:55.996 --> 1:20:57.536
+It's for Zero Shot Cases.
+
+1:20:57.998 --> 1:21:03.660
+Because having here some situation with this
+might be very language specific and zero shot,
+
+1:21:03.660 --> 1:21:09.015
+the idea is always to learn representations
+view which are more language dependent and
+
+1:21:09.015 --> 1:21:10.184
+with the adaptors.
+
+1:21:10.184 --> 1:21:15.601
+Of course you get in representations again
+which are more language specific and then it
+
+1:21:15.601 --> 1:21:17.078
+doesn't work that well.
+
+1:21:20.260 --> 1:21:37.730
+And there is also the idea of doing more knowledge
+pistolation.
+
+1:21:39.179 --> 1:21:42.923
+And now the idea is okay.
+
+1:21:42.923 --> 1:21:54.157
+We are training it the same, but what we want
+to achieve is that the encoder.
+
+1:21:54.414 --> 1:22:03.095
+So you should learn faster by trying to make
+these states as similar as possible.
+
+1:22:03.095 --> 1:22:11.777
+So you compare the first-hit state of the
+pre-trained model and try to make them.
+
+1:22:12.192 --> 1:22:18.144
+For example, by using the out two norms, so
+by just making these two representations the
+
+1:22:18.144 --> 1:22:26.373
+same: The same vocabulary: Why does it need
+the same vocabulary with any idea?
+
+1:22:34.754 --> 1:22:46.137
+If you have different vocabulary, it's typical
+you also have different sequenced lengths here.
+
+1:22:46.137 --> 1:22:50.690
+The number of sequences is different.
+
+1:22:51.231 --> 1:22:58.888
+If you now have pipe stains and four states
+here, it's no longer straightforward which
+
+1:22:58.888 --> 1:23:01.089
+states compare to which.
+
+1:23:02.322 --> 1:23:05.246
+And that's just easier if you have like the
+same number.
+
+1:23:05.246 --> 1:23:08.940
+You can always compare the first to the first
+and second to the second.
+
+1:23:09.709 --> 1:23:16.836
+So therefore at least the very easy way of
+knowledge destination only works if you have.
+
+1:23:17.177 --> 1:23:30.030
+Course: You could do things like yeah, the
+average should be the same, but of course there's
+
+1:23:30.030 --> 1:23:33.071
+a less strong signal.
+
+1:23:34.314 --> 1:23:42.979
+But the advantage here is that you have a
+diameter training signal here on the handquarter
+
+1:23:42.979 --> 1:23:51.455
+so you can directly make some of the encoder
+already giving a good signal while normally
+
+1:23:51.455 --> 1:23:52.407
+an empty.
+
+1:23:56.936 --> 1:24:13.197
+Yes, think this is most things for today,
+so what you should keep in mind is remind me.
+
+1:24:13.393 --> 1:24:18.400
+The one is a back translation idea.
+
+1:24:18.400 --> 1:24:29.561
+If you have monolingual and use that, the
+other one is to: And mentally it is often helpful
+
+1:24:29.561 --> 1:24:33.614
+to combine them so you can even use both of
+that.
+
+1:24:33.853 --> 1:24:38.908
+So you can use pre-trained walls, but then
+you can even still do back translation where
+
+1:24:38.908 --> 1:24:40.057
+it's still helpful.
+
+1:24:40.160 --> 1:24:45.502
+We have the advantage we are training like
+everything working together on the task so
+
+1:24:45.502 --> 1:24:51.093
+it might be helpful even to backtranslate some
+data and then use it in a real translation
+
+1:24:51.093 --> 1:24:56.683
+setup because in pretraining of course the
+beach challenge is always that you're training
+
+1:24:56.683 --> 1:24:57.739
+it on different.
+
+1:24:58.058 --> 1:25:03.327
+Different ways of how you integrate this knowledge.
+
+1:25:03.327 --> 1:25:08.089
+Even if you just use a full model, so in this.
+
+1:25:08.748 --> 1:25:11.128
+This is the most similar you can get.
+
+1:25:11.128 --> 1:25:13.945
+You're doing no changes to the architecture.
+
+1:25:13.945 --> 1:25:19.643
+You're really taking the model and just fine
+tuning them on the new task, but it still has
+
+1:25:19.643 --> 1:25:24.026
+to completely newly learn how to do the attention
+and how to do that.
+
+1:25:24.464 --> 1:25:29.971
+And that might be, for example, helpful to
+have more back-translated data to learn them.
+
+1:25:32.192 --> 1:25:34.251
+That's for today.
+
+1:25:34.251 --> 1:25:44.661
+There's one important thing that next Tuesday
+there is a conference or a workshop or so in
+
+1:25:44.661 --> 1:25:45.920
+this room.
+
+1:25:47.127 --> 1:25:56.769
+You should get an e-mail if you're in Elias
+that there's a room change for Tuesdays and
+
+1:25:56.769 --> 1:25:57.426
+it's.
+
+1:25:57.637 --> 1:26:03.890
+There are more questions, yeah, have a more
+general position, especially: In computer vision
+
+1:26:03.890 --> 1:26:07.347
+you can enlarge your data center data orientation.
+
+1:26:07.347 --> 1:26:08.295
+Is there any?
+
+1:26:08.388 --> 1:26:15.301
+It's similar to a large speech for text for
+the data of an edge.
+
+1:26:15.755 --> 1:26:29.176
+And you can use this back translation and
+also masking, but back translation is some
+
+1:26:29.176 --> 1:26:31.228
+way of data.
+
+1:26:31.371 --> 1:26:35.629
+So it has also been, for example, even its
+used not only for monolingual data.
+
+1:26:36.216 --> 1:26:54.060
+If you have good MP system, it can also be
+used for parallel data.
+
+1:26:54.834 --> 1:26:59.139
+So would say this is the most similar one.
+
+1:26:59.139 --> 1:27:03.143
+There's ways you can do power phrasing.
+
+1:27:05.025 --> 1:27:12.057
+But for example there is very hard to do this
+by rules like which words to replace because
+
+1:27:12.057 --> 1:27:18.936
+there is not a coup like you cannot always
+say this word can always be replaced by that.
+
+1:27:19.139 --> 1:27:27.225
+Mean, although they are many perfect synonyms,
+normally they are good in some cases, but not
+
+1:27:27.225 --> 1:27:29.399
+in all cases, and so on.
+
+1:27:29.399 --> 1:27:36.963
+And if you don't do a rule based, you have
+to train your model and then the freshness.
+
+1:27:38.058 --> 1:27:57.236
+The same architecture as the pre-trained mount.
+
+1:27:57.457 --> 1:27:59.810
+Should be of the same dimension, so it's easiest
+to have the same dimension.
+
+1:28:00.000 --> 1:28:01.590
+Architecture.
+
+1:28:01.590 --> 1:28:05.452
+We later will learn inefficiency.
+
+1:28:05.452 --> 1:28:12.948
+You can also do knowledge cessulation with,
+for example, smaller.
+
+1:28:12.948 --> 1:28:16.469
+You can learn the same within.
+
+1:28:17.477 --> 1:28:22.949
+Eight layers for it so that is possible, but
+yeah agree it should be of the same.
+
+1:28:23.623 --> 1:28:32.486
+Yeah yeah you need the question then of course
+you can do it like it's an initialization or
+
+1:28:32.486 --> 1:28:41.157
+you can do it doing training but normally it
+most makes sense during the normal training.
+
+1:28:45.865 --> 1:28:53.963
+Do it, then thanks a lot, and then we'll see
+each other again on Tuesday.
+
+0:00:00.981 --> 0:00:20.036
+Today about is how to use some type of additional
+resources to improve the translation.
+
+0:00:20.300 --> 0:00:28.188
+We have in the first part of the semester
+two thirds of the semester how to build some
+
+0:00:28.188 --> 0:00:31.361
+of your basic machine translation.
+
+0:00:31.571 --> 0:00:42.317
+Now the basic components are both for statistical
+and for neural, with the encoded decoding.
+
+0:00:43.123 --> 0:00:46.000
+Now, of course, that's not where it stops.
+
+0:00:46.000 --> 0:00:51.286
+It's still what nearly every machine translation
+system is currently in there.
+
+0:00:51.286 --> 0:00:57.308
+However, there's a lot of challenges which
+you need to address in addition and which need
+
+0:00:57.308 --> 0:00:58.245
+to be solved.
+
+0:00:58.918 --> 0:01:09.858
+And there we want to start to tell you what
+else can you do around this, and partly.
+
+0:01:10.030 --> 0:01:14.396
+And one important question there is on what
+do you train your models?
+
+0:01:14.394 --> 0:01:32.003
+Because like this type of parallel data, it's
+easier in machine translation than in other
+
+0:01:32.003 --> 0:01:33.569
+trusts.
+
+0:01:33.853 --> 0:01:41.178
+And therefore an important question is, can
+we also learn from like other sources and through?
+
+0:01:41.701 --> 0:01:47.830
+Because if you remember strongly right at
+the beginning of the election,.
+
+0:01:51.171 --> 0:01:53.801
+This Is How We Train All Our.
+
+0:01:54.194 --> 0:01:59.887
+Machine learning models from statistical to
+neural.
+
+0:01:59.887 --> 0:02:09.412
+This doesn't have changed so we need this
+type of parallel data where we have a source
+
+0:02:09.412 --> 0:02:13.462
+sentence aligned with a target data.
+
+0:02:13.493 --> 0:02:19.135
+We have now a strong model here, a very good
+model to do that.
+
+0:02:19.135 --> 0:02:22.091
+However, we always rely on this.
+
+0:02:22.522 --> 0:02:28.437
+For languages, high risk language pairs say
+from German to English or other European languages,
+
+0:02:28.437 --> 0:02:31.332
+there is decent amount at least for similarly.
+
+0:02:31.471 --> 0:02:37.630
+But even there if we are going to very specific
+domains it might get difficult and then your
+
+0:02:37.630 --> 0:02:43.525
+system performance might drop because if you
+want to translate now some medical text for
+
+0:02:43.525 --> 0:02:50.015
+example of course you need to also have peril
+data in the medical domain to know how to translate
+
+0:02:50.015 --> 0:02:50.876
+these types.
+
+0:02:51.231 --> 0:02:55.264
+Phrases how to use the vocabulary and so on
+in the style.
+
+0:02:55.915 --> 0:03:04.887
+And if you are going to other languages, there
+is a lot bigger challenge and the question
+
+0:03:04.887 --> 0:03:05.585
+there.
+
+0:03:05.825 --> 0:03:09.649
+So is really this the only resource we can
+use.
+
+0:03:09.889 --> 0:03:19.462
+Can be adapted or training phase in order
+to also make use of other types of models that
+
+0:03:19.462 --> 0:03:27.314
+might enable us to build strong systems with
+other types of information.
+
+0:03:27.707 --> 0:03:35.276
+And that we will look into now in the next
+starting from from just saying the next election.
+
+0:03:35.515 --> 0:03:40.697
+So this idea we already have covered on Tuesday.
+
+0:03:40.697 --> 0:03:45.350
+One very successful idea for this is to do.
+
+0:03:45.645 --> 0:03:51.990
+So that we're no longer doing translation
+between languages, but we can do translation
+
+0:03:51.990 --> 0:03:55.928
+between languages and share common knowledge
+between.
+
+0:03:56.296 --> 0:04:03.888
+You also learned about things like zero shots
+machine translation so you can translate between
+
+0:04:03.888 --> 0:04:06.446
+languages where you don't have.
+
+0:04:06.786 --> 0:04:09.790
+Which is the case for many, many language
+pairs.
+
+0:04:10.030 --> 0:04:16.954
+Like even with German, you have not translation
+parallel data to all languages around the world,
+
+0:04:16.954 --> 0:04:23.450
+or most of them you have it to the Europeans
+once, maybe even for Japanese, so it will get
+
+0:04:23.450 --> 0:04:26.377
+difficult to get a really decent amount.
+
+0:04:26.746 --> 0:04:35.332
+There is quite a lot of data, for example
+English to Japanese, but German to Japanese
+
+0:04:35.332 --> 0:04:37.827
+or German to Vietnamese.
+
+0:04:37.827 --> 0:04:41.621
+There is some data from Multilingual.
+
+0:04:42.042 --> 0:04:54.584
+So there is a very promising direction if
+you want to build translation systems between
+
+0:04:54.584 --> 0:05:00.142
+language peers, typically not English.
+
+0:05:01.221 --> 0:05:05.887
+And the other ideas, of course, we don't have
+to either just search for it.
+
+0:05:06.206 --> 0:05:12.505
+Some work on a data crawling so if I don't
+have a corpus directly or I don't have an high
+
+0:05:12.505 --> 0:05:19.014
+quality corpus like from the European Parliament
+for a TED corpus so maybe it makes sense to
+
+0:05:19.014 --> 0:05:23.913
+crawl more data and get additional sources
+so you can build stronger.
+
+0:05:24.344 --> 0:05:35.485
+There has been quite a big effort in Europe
+to collect really large data sets for parallel
+
+0:05:35.485 --> 0:05:36.220
+data.
+
+0:05:36.220 --> 0:05:40.382
+How can we do this data crawling?
+
+0:05:40.600 --> 0:05:46.103
+There the interesting thing from the machine
+translation point is not just general data
+
+0:05:46.103 --> 0:05:46.729
+crawling.
+
+0:05:47.067 --> 0:05:50.037
+But how can we explicitly crawl data?
+
+0:05:50.037 --> 0:05:52.070
+Which is some of a peril?
+
+0:05:52.132 --> 0:05:58.461
+So there is in the Internet quite a lot of
+data which has been company websites which
+
+0:05:58.461 --> 0:06:01.626
+have been translated and things like that.
+
+0:06:01.626 --> 0:06:05.158
+So how can you extract them parallel fragments?
+
+0:06:06.566 --> 0:06:13.404
+That is typically more noisy than where you
+do more at hands where mean if you have Parliament.
+
+0:06:13.693 --> 0:06:17.680
+You can do some rules how to extract parallel
+things.
+
+0:06:17.680 --> 0:06:24.176
+Here there is more to it, so the quality is
+later maybe not as good, but normally scale
+
+0:06:24.176 --> 0:06:26.908
+is then a possibility to address it.
+
+0:06:26.908 --> 0:06:30.304
+So you just have so much more data that even.
+
+0:06:33.313 --> 0:06:40.295
+The other thing can be used monolingual data
+and monolingual data has a big advantage that
+
+0:06:40.295 --> 0:06:46.664
+we can have a huge amount of that so that you
+can be autocrawed from the Internet.
+
+0:06:46.664 --> 0:06:51.728
+The nice thing is you can also get it typically
+for many domains.
+
+0:06:52.352 --> 0:06:59.558
+There is just so much more magnitude of monolingual
+data so that it might be very helpful.
+
+0:06:59.559 --> 0:07:03.054
+We can do that in statistical machine translation.
+
+0:07:03.054 --> 0:07:06.755
+It was quite easy to integrate using language
+models.
+
+0:07:08.508 --> 0:07:16.912
+In neural machine translation we have the
+advantage that we have this overall architecture
+
+0:07:16.912 --> 0:07:22.915
+that does everything together, but it has also
+the disadvantage.
+
+0:07:23.283 --> 0:07:25.675
+We'll look today at two things.
+
+0:07:25.675 --> 0:07:32.925
+On the one end you can still try to do a bit
+of language modeling in there and add an additional
+
+0:07:32.925 --> 0:07:35.168
+language model into in there.
+
+0:07:35.168 --> 0:07:38.232
+There is some work, one very successful.
+
+0:07:38.178 --> 0:07:43.764
+A way in which I think is used in most systems
+at the moment is to do some scientific data.
+
+0:07:43.763 --> 0:07:53.087
+Is a very easy thing, but you can just translate
+there and use it as training gator, and normally.
+
+0:07:53.213 --> 0:07:59.185
+And thereby you are able to use like some
+type of monolingual a day.
+
+0:08:00.380 --> 0:08:05.271
+Another way to do it is unsupervised and the
+extreme case.
+
+0:08:05.271 --> 0:08:11.158
+If you have a scenario then you only have
+data, only monolingual data.
+
+0:08:11.158 --> 0:08:13.976
+Can you still build translations?
+
+0:08:14.754 --> 0:08:27.675
+If you have large amounts of data and languages
+are not too dissimilar, you can build translation
+
+0:08:27.675 --> 0:08:31.102
+systems without parallel.
+
+0:08:32.512 --> 0:08:36.267
+That we will see you then next Thursday.
+
+0:08:37.857 --> 0:08:50.512
+And then there is now a third type of pre-trained
+model that recently became very successful
+
+0:08:50.512 --> 0:08:55.411
+and now with large language models.
+
+0:08:55.715 --> 0:09:03.525
+So the idea is we are no longer sharing the
+real data, but it can also help to train a
+
+0:09:03.525 --> 0:09:04.153
+model.
+
+0:09:04.364 --> 0:09:11.594
+And that is now a big advantage of deep learning
+based approaches.
+
+0:09:11.594 --> 0:09:22.169
+There you have this ability that you can train
+a model in some task and then apply it to another.
+
+0:09:22.722 --> 0:09:33.405
+And then, of course, the question is, can
+I have an initial task where there's huge amounts
+
+0:09:33.405 --> 0:09:34.450
+of data?
+
+0:09:34.714 --> 0:09:40.251
+And the test that typically you pre train
+on is more like similar to a language moral
+
+0:09:40.251 --> 0:09:45.852
+task either direct to a language moral task
+or like a masking task which is related so
+
+0:09:45.852 --> 0:09:51.582
+the idea is oh I can train on this data and
+the knowledge about words how they relate to
+
+0:09:51.582 --> 0:09:53.577
+each other I can use in there.
+
+0:09:53.753 --> 0:10:00.276
+So it's a different way of using language
+models.
+
+0:10:00.276 --> 0:10:06.276
+There's more transfer learning at the end
+of.
+
+0:10:09.029 --> 0:10:17.496
+So first we will start with how can we use
+monolingual data to do a Yeah to do a machine
+
+0:10:17.496 --> 0:10:18.733
+translation?
+
+0:10:20.040 --> 0:10:27.499
+That: Big difference is you should remember
+from what I mentioned before is.
+
+0:10:27.499 --> 0:10:32.783
+In statistical machine translation we directly
+have the opportunity.
+
+0:10:32.783 --> 0:10:39.676
+There's peril data for the translation model
+and monolingual data for the language model.
+
+0:10:39.679 --> 0:10:45.343
+And you combine your translation model and
+language model, and then you can make use of
+
+0:10:45.343 --> 0:10:45.730
+both.
+
+0:10:46.726 --> 0:10:53.183
+That you can make use of these large large
+amounts of monolingual data, but of course
+
+0:10:53.183 --> 0:10:55.510
+it has also some disadvantage.
+
+0:10:55.495 --> 0:11:01.156
+Because we say the problem is we are optimizing
+both parts a bit independently to each other
+
+0:11:01.156 --> 0:11:06.757
+and we say oh yeah the big disadvantage of
+newer machine translations now we are optimizing
+
+0:11:06.757 --> 0:11:10.531
+the overall architecture everything together
+to perform best.
+
+0:11:10.890 --> 0:11:16.994
+And then, of course, we can't do there, so
+Leo we can can only do a mural like use power
+
+0:11:16.994 --> 0:11:17.405
+data.
+
+0:11:17.897 --> 0:11:28.714
+So the question is, but this advantage is
+not so important that we can train everything,
+
+0:11:28.714 --> 0:11:35.276
+but we have a moral legal data or even small
+amounts.
+
+0:11:35.675 --> 0:11:43.102
+So in data we know it's not only important
+the amount of data we have but also like how
+
+0:11:43.102 --> 0:11:50.529
+similar it is to your test data so it can be
+that this modeling data is quite small but
+
+0:11:50.529 --> 0:11:55.339
+it's very well fitting and then it's still
+very helpful.
+
+0:11:55.675 --> 0:12:02.691
+At the first year of surprisingness, if we
+are here successful with integrating a language
+
+0:12:02.691 --> 0:12:09.631
+model into a translation system, maybe we can
+also integrate some type of language models
+
+0:12:09.631 --> 0:12:14.411
+into our empty system in order to make it better
+and perform.
+
+0:12:16.536 --> 0:12:23.298
+The first thing we can do is we know there
+is language models, so let's try to integrate.
+
+0:12:23.623 --> 0:12:31.096
+There was our language model because these
+works were mainly done before transformer-based
+
+0:12:31.096 --> 0:12:31.753
+models.
+
+0:12:32.152 --> 0:12:38.764
+In general, of course, you can do the same
+thing with transformer baseball.
+
+0:12:38.764 --> 0:12:50.929
+There is nothing about whether: It's just
+that it has mainly been done before people
+
+0:12:50.929 --> 0:13:01.875
+started using R&amp;S and they tried to do
+this more in cases.
+
+0:13:07.087 --> 0:13:22.938
+So what we're happening here is in some of
+this type of idea, and in key system you remember
+
+0:13:22.938 --> 0:13:25.495
+the attention.
+
+0:13:25.605 --> 0:13:29.465
+Gets it was your last in this day that you
+calculate easy attention.
+
+0:13:29.729 --> 0:13:36.610
+We get the context back, then combine both
+and then base the next in state and then predict.
+
+0:13:37.057 --> 0:13:42.424
+So this is our system, and the question is,
+can we send our integrated language model?
+
+0:13:42.782 --> 0:13:49.890
+And somehow it makes sense to take out a neural
+language model because we are anyway in the
+
+0:13:49.890 --> 0:13:50.971
+neural space.
+
+0:13:50.971 --> 0:13:58.465
+It's not surprising that it contrasts to statistical
+work used and grants it might make sense to
+
+0:13:58.465 --> 0:14:01.478
+take a bit of a normal language model.
+
+0:14:01.621 --> 0:14:06.437
+And there would be something like on Tubbles
+Air, a neural language model, and our man based
+
+0:14:06.437 --> 0:14:11.149
+is you have a target word, you put it in, you
+get a new benchmark, and then you always put
+
+0:14:11.149 --> 0:14:15.757
+in the words and get new hidden states, and
+you can do some predictions at the output to
+
+0:14:15.757 --> 0:14:16.948
+predict the next word.
+
+0:14:17.597 --> 0:14:26.977
+So if we're having this type of in language
+model, there's like two main questions we have
+
+0:14:26.977 --> 0:14:34.769
+to answer: So how do we combine now on the
+one hand our system and on the other hand our
+
+0:14:34.769 --> 0:14:35.358
+model?
+
+0:14:35.358 --> 0:14:42.004
+You see that was mentioned before when we
+started talking about ENCODA models.
+
+0:14:42.004 --> 0:14:45.369
+They can be viewed as a language model.
+
+0:14:45.805 --> 0:14:47.710
+The wine is lengthened, unconditioned.
+
+0:14:47.710 --> 0:14:49.518
+It's just modeling the target sides.
+
+0:14:49.970 --> 0:14:56.963
+And the other one is a conditional language
+one, which is a language one conditioned on
+
+0:14:56.963 --> 0:14:57.837
+the Sewer.
+
+0:14:58.238 --> 0:15:03.694
+So how can you combine to language models?
+
+0:15:03.694 --> 0:15:14.860
+Of course, it's like the translation model
+will be more important because it has access
+
+0:15:14.860 --> 0:15:16.763
+to the source.
+
+0:15:18.778 --> 0:15:22.571
+If we have that, the other question is okay.
+
+0:15:22.571 --> 0:15:24.257
+Now we have models.
+
+0:15:24.257 --> 0:15:25.689
+How do we train?
+
+0:15:26.026 --> 0:15:30.005
+Pickers integrated them.
+
+0:15:30.005 --> 0:15:34.781
+We have now two sets of data.
+
+0:15:34.781 --> 0:15:42.741
+We have parallel data where you can do the
+lower.
+
+0:15:44.644 --> 0:15:53.293
+So the first idea is we can do something more
+like a parallel combination.
+
+0:15:53.293 --> 0:15:55.831
+We just keep running.
+
+0:15:56.036 --> 0:15:59.864
+So here you see your system that is running.
+
+0:16:00.200 --> 0:16:09.649
+It's normally completely independent of your
+language model, which is up there, so down
+
+0:16:09.649 --> 0:16:13.300
+here we have just our NMT system.
+
+0:16:13.313 --> 0:16:26.470
+The only thing which is used is we have the
+words, and of course they are put into both
+
+0:16:26.470 --> 0:16:30.059
+systems, and out there.
+
+0:16:30.050 --> 0:16:42.221
+So we use them somehow for both, and then
+we are doing our decision just by merging these
+
+0:16:42.221 --> 0:16:42.897
+two.
+
+0:16:43.343 --> 0:16:53.956
+So there can be, for example, we are doing
+a probability distribution here, and then we
+
+0:16:53.956 --> 0:17:03.363
+are taking the average of post-perability distribution
+to do our predictions.
+
+0:17:11.871 --> 0:17:18.923
+You could also take the output with Steve's
+to be more in chore about the mixture.
+
+0:17:20.000 --> 0:17:32.896
+Yes, you could also do that, so it's more
+like engaging mechanisms that you're not doing.
+
+0:17:32.993 --> 0:17:41.110
+Another one would be cochtrinate the hidden
+states, and then you would have another layer
+
+0:17:41.110 --> 0:17:41.831
+on top.
+
+0:17:43.303 --> 0:17:56.889
+You think about if you do the conqueredination
+instead of taking the instead and then merging
+
+0:17:56.889 --> 0:18:01.225
+the probability distribution.
+
+0:18:03.143 --> 0:18:16.610
+Introduce many new parameters, and these parameters
+have somehow something special compared to
+
+0:18:16.610 --> 0:18:17.318
+the.
+
+0:18:23.603 --> 0:18:37.651
+So before all the error other parameters can
+be trained independent, the language model
+
+0:18:37.651 --> 0:18:42.121
+can be trained independent.
+
+0:18:43.043 --> 0:18:51.749
+If you have a joint layer, of course you need
+to train them because you have now inputs.
+
+0:18:54.794 --> 0:19:02.594
+Not surprisingly, if you have a parallel combination
+of whether you could, the other way is to do
+
+0:19:02.594 --> 0:19:04.664
+more serial combinations.
+
+0:19:04.924 --> 0:19:10.101
+How can you do a similar combination?
+
+0:19:10.101 --> 0:19:18.274
+Your final decision makes sense to do a face
+on the system.
+
+0:19:18.438 --> 0:19:20.996
+So you have on top of your normal and system.
+
+0:19:21.121 --> 0:19:30.678
+The only thing is now you're inputting into
+your system.
+
+0:19:30.678 --> 0:19:38.726
+You're no longer inputting the word embeddings.
+
+0:19:38.918 --> 0:19:45.588
+So you're training your mainly what you have
+your lower layers here which are trained more
+
+0:19:45.588 --> 0:19:52.183
+on the purely language model style and then
+on top your putting into the NMT system where
+
+0:19:52.183 --> 0:19:55.408
+it now has already here the language model.
+
+0:19:55.815 --> 0:19:58.482
+So here you can also view it.
+
+0:19:58.482 --> 0:20:06.481
+Here you have more contextual embeddings which
+no longer depend only on the word but they
+
+0:20:06.481 --> 0:20:10.659
+also depend on the context of the target site.
+
+0:20:11.051 --> 0:20:19.941
+But you have more understanding of the source
+word, so you have a language in the current
+
+0:20:19.941 --> 0:20:21.620
+target sentence.
+
+0:20:21.881 --> 0:20:27.657
+So if it's like the word can, for example,
+will be put in here always the same independent
+
+0:20:27.657 --> 0:20:31.147
+of its user can of beans, or if it's like I
+can do it.
+
+0:20:31.147 --> 0:20:37.049
+However, because you are having your language
+model style, you have maybe disintegrated this
+
+0:20:37.049 --> 0:20:40.984
+already a bit, and you give this information
+directly to the.
+
+0:20:41.701 --> 0:20:43.095
+An empty cyst.
+
+0:20:44.364 --> 0:20:49.850
+You, if you're remembering more the transformer
+based approach, you have some layers.
+
+0:20:49.850 --> 0:20:55.783
+The lower layers are purely languaged while
+the other ones are with attention to the source.
+
+0:20:55.783 --> 0:21:01.525
+So you can view it also that you just have
+lower layers which don't attend to the source.
+
+0:21:02.202 --> 0:21:07.227
+This is purely a language model, and then
+at some point you're starting to attend to
+
+0:21:07.227 --> 0:21:08.587
+the source and use it.
+
+0:21:13.493 --> 0:21:20.781
+Yes, so this is how you combine them in peril
+or first do the language model and then do.
+
+0:21:23.623 --> 0:21:26.147
+Questions for the integration.
+
+0:21:31.831 --> 0:21:35.034
+Not really sure about the input of the.
+
+0:21:35.475 --> 0:21:38.102
+Model, and in this case in the sequence.
+
+0:21:38.278 --> 0:21:54.854
+Case so the actual word that we transferred
+into a numerical lecture, and this is an input.
+
+0:21:56.176 --> 0:22:03.568
+That depends on if you view the word embedding
+as part of the language model.
+
+0:22:03.568 --> 0:22:10.865
+So if you first put the word target word then
+you do the one hot end coding.
+
+0:22:11.691 --> 0:22:13.805
+And then the word embedding there is the r&amp;
+
+0:22:13.805 --> 0:22:13.937
+n.
+
+0:22:14.314 --> 0:22:21.035
+So you can use this together as your language
+model when you first do the word embedding.
+
+0:22:21.401 --> 0:22:24.346
+All you can say is like before.
+
+0:22:24.346 --> 0:22:28.212
+It's more a definition, but you're right.
+
+0:22:28.212 --> 0:22:30.513
+So what's the steps out?
+
+0:22:30.513 --> 0:22:36.128
+You take the word, the one hut encoding, the
+word embedding.
+
+0:22:36.516 --> 0:22:46.214
+What one of these parrots, you know, called
+a language model is definition wise and not
+
+0:22:46.214 --> 0:22:47.978
+that important.
+
+0:22:53.933 --> 0:23:02.264
+So the question is how can you then train
+them and make this this one work?
+
+0:23:02.264 --> 0:23:02.812
+The.
+
+0:23:03.363 --> 0:23:15.201
+So in the case where you combine the language
+one of the abilities you can train them independently
+
+0:23:15.201 --> 0:23:18.516
+and just put them together.
+
+0:23:18.918 --> 0:23:27.368
+Might not be the best because we have no longer
+the stability that we had before that optimally
+
+0:23:27.368 --> 0:23:29.128
+performed together.
+
+0:23:29.128 --> 0:23:33.881
+It's not clear if they really work the best
+together.
+
+0:23:34.514 --> 0:23:41.585
+At least you need to somehow find how much
+do you trust the one model and how much.
+
+0:23:43.323 --> 0:23:45.058
+Still in some cases useful.
+
+0:23:45.058 --> 0:23:48.530
+It might be helpful if you have only data
+and software.
+
+0:23:48.928 --> 0:23:59.064
+However, in MT we have one specific situation
+that at least for the MT part parallel is also
+
+0:23:59.064 --> 0:24:07.456
+always monolingual data, so what we definitely
+can do is train the language.
+
+0:24:08.588 --> 0:24:18.886
+So what we also can do is more like the pre-training
+approach.
+
+0:24:18.886 --> 0:24:24.607
+We first train the language model.
+
+0:24:24.704 --> 0:24:27.334
+The pre-training approach.
+
+0:24:27.334 --> 0:24:33.470
+You first train on the monolingual data and
+then you join the.
+
+0:24:33.933 --> 0:24:41.143
+Of course, the model size is this way, but
+the data size is too bigly the other way around.
+
+0:24:41.143 --> 0:24:47.883
+You often have a lot more monolingual data
+than you have here parallel data, in which
+
+0:24:47.883 --> 0:24:52.350
+scenario can you imagine where this type of
+pretraining?
+
+0:24:56.536 --> 0:24:57.901
+Any Ideas.
+
+0:25:04.064 --> 0:25:12.772
+One example where this might also be helpful
+if you want to adapt to domains.
+
+0:25:12.772 --> 0:25:22.373
+So let's say you do medical sentences and
+if you want to translate medical sentences.
+
+0:25:23.083 --> 0:25:26.706
+In this case it could be or its most probable
+happen.
+
+0:25:26.706 --> 0:25:32.679
+You're learning here up there what medical
+means, but in your fine tuning step the model
+
+0:25:32.679 --> 0:25:38.785
+is forgotten everything about Medicare, so
+you may be losing all the information you gain.
+
+0:25:39.099 --> 0:25:42.366
+So this type of priest training step is good.
+
+0:25:42.366 --> 0:25:47.978
+If your pretraining data is more general,
+very large and then you're adapting.
+
+0:25:48.428 --> 0:25:56.012
+But in the task with moral lingual data, which
+should be used to adapt the system to some
+
+0:25:56.012 --> 0:25:57.781
+general topic style.
+
+0:25:57.817 --> 0:26:06.795
+Then, of course, this is not a good strategy
+because you might forgot about everything up
+
+0:26:06.795 --> 0:26:09.389
+there and you don't have.
+
+0:26:09.649 --> 0:26:14.678
+So then you have to check what you can do
+for them.
+
+0:26:14.678 --> 0:26:23.284
+You can freeze this part and change it any
+more so you don't lose the ability or you can
+
+0:26:23.284 --> 0:26:25.702
+do a direct combination.
+
+0:26:25.945 --> 0:26:31.028
+Where you jointly train both of them, so you
+train the NMT system on the, and then you train
+
+0:26:31.028 --> 0:26:34.909
+the language model always in parallels so that
+you don't forget about.
+
+0:26:35.395 --> 0:26:37.684
+And what you learn of the length.
+
+0:26:37.937 --> 0:26:46.711
+Depends on what you want to combine because
+it's large data and you have a good general
+
+0:26:46.711 --> 0:26:48.107
+knowledge in.
+
+0:26:48.548 --> 0:26:55.733
+Then you normally don't really forget it because
+it's also in the or you use it to adapt to
+
+0:26:55.733 --> 0:26:57.295
+something specific.
+
+0:26:57.295 --> 0:26:58.075
+Then you.
+
+0:27:01.001 --> 0:27:06.676
+Then this is a way of how we can make use
+of monolingual data.
+
+0:27:07.968 --> 0:27:12.116
+It seems to be the easiest one somehow.
+
+0:27:12.116 --> 0:27:20.103
+It's more similar to what we are doing with
+statistical machine translation.
+
+0:27:21.181 --> 0:27:31.158
+Normally always beats this type of model,
+which in some view can be like from the conceptual
+
+0:27:31.158 --> 0:27:31.909
+thing.
+
+0:27:31.909 --> 0:27:36.844
+It's even easier from the computational side.
+
+0:27:40.560 --> 0:27:42.078
+And the idea is OK.
+
+0:27:42.078 --> 0:27:49.136
+We have monolingual data that we just translate
+and then generate some type of parallel data
+
+0:27:49.136 --> 0:27:50.806
+and use that then to.
+
+0:27:51.111 --> 0:28:00.017
+So if you want to build a German-to-English
+system first, take the large amount of data
+
+0:28:00.017 --> 0:28:02.143
+you have translated.
+
+0:28:02.402 --> 0:28:10.446
+Then you have more peril data and the interesting
+thing is if you then train on the joint thing
+
+0:28:10.446 --> 0:28:18.742
+or on the original peril data and on what is
+artificial where you have generated the translations.
+
+0:28:18.918 --> 0:28:26.487
+So you can because you are not doing the same
+era all the times and you have some knowledge.
+
+0:28:28.028 --> 0:28:43.199
+With this first approach, however, there is
+one issue why it might not work the best.
+
+0:28:49.409 --> 0:28:51.177
+Very a bit shown in the image to you.
+
+0:28:53.113 --> 0:28:58.153
+You trade on that quality data.
+
+0:28:58.153 --> 0:29:02.563
+Here is a bit of a problem.
+
+0:29:02.563 --> 0:29:08.706
+Your English style is not really good.
+
+0:29:08.828 --> 0:29:12.213
+And as you're saying, the system always mistranslates.
+
+0:29:13.493 --> 0:29:19.798
+Something then you will learn that this is
+correct because now it's a training game and
+
+0:29:19.798 --> 0:29:23.022
+you will encourage it to make it more often.
+
+0:29:23.022 --> 0:29:29.614
+So the problem with training on your own areas
+yeah you might prevent some areas you rarely
+
+0:29:29.614 --> 0:29:29.901
+do.
+
+0:29:30.150 --> 0:29:31.749
+But errors use systematically.
+
+0:29:31.749 --> 0:29:34.225
+Do you even enforce more and will even do
+more?
+
+0:29:34.654 --> 0:29:40.145
+So that might not be the best solution to
+have any idea how you could do it better.
+
+0:29:44.404 --> 0:29:57.754
+Is one way there is even a bit of more simple
+idea.
+
+0:30:04.624 --> 0:30:10.975
+The problem is yeah, the translations are
+not perfect, so the output and you're learning
+
+0:30:10.975 --> 0:30:12.188
+something wrong.
+
+0:30:12.188 --> 0:30:17.969
+Normally it's less bad if your inputs are
+not bad, but your outputs are perfect.
+
+0:30:18.538 --> 0:30:24.284
+So if your inputs are wrong you may learn
+that if you're doing this wrong input you're
+
+0:30:24.284 --> 0:30:30.162
+generating something correct, but you're not
+learning to generate something which is not
+
+0:30:30.162 --> 0:30:30.756
+correct.
+
+0:30:31.511 --> 0:30:47.124
+So often the case it is that it is more important
+than your target is correct.
+
+0:30:47.347 --> 0:30:52.182
+But you can assume in your application scenario
+you hope that you may only get correct inputs.
+
+0:30:52.572 --> 0:31:02.535
+So that is not harming you, and in machine
+translation we have one very nice advantage:
+
+0:31:02.762 --> 0:31:04.648
+And also the other way around.
+
+0:31:04.648 --> 0:31:10.062
+It's a very similar task, so there's a task
+to translate from German to English, but the
+
+0:31:10.062 --> 0:31:13.894
+task to translate from English to German is
+very similar, and.
+
+0:31:14.094 --> 0:31:19.309
+So what we can do is we can just switch it
+initially and generate the data the other way
+
+0:31:19.309 --> 0:31:19.778
+around.
+
+0:31:20.120 --> 0:31:25.959
+So what we are doing here is we are starting
+with an English to German system.
+
+0:31:25.959 --> 0:31:32.906
+Then we are translating the English data into
+German where the German is maybe not very nice.
+
+0:31:33.293 --> 0:31:51.785
+And then we are training on our original data
+and on the back translated data.
+
+0:31:52.632 --> 0:32:02.332
+So here we have the advantage that our target
+side is human quality and only the input.
+
+0:32:03.583 --> 0:32:08.113
+Then this helps us to get really good.
+
+0:32:08.113 --> 0:32:15.431
+There is one difference if you think about
+the data resources.
+
+0:32:21.341 --> 0:32:27.336
+Too obvious here we need a target site monolingual
+layer.
+
+0:32:27.336 --> 0:32:31.574
+In the first example we had source site.
+
+0:32:31.931 --> 0:32:45.111
+So back translation is normally working if
+you have target size peril later and not search
+
+0:32:45.111 --> 0:32:48.152
+side modeling later.
+
+0:32:48.448 --> 0:32:56.125
+Might be also, like if you think about it,
+understand a little better to understand the
+
+0:32:56.125 --> 0:32:56.823
+target.
+
+0:32:57.117 --> 0:33:01.469
+On the source side you have to understand
+the content.
+
+0:33:01.469 --> 0:33:08.749
+On the target side you have to generate really
+sentences and somehow it's more difficult to
+
+0:33:08.749 --> 0:33:12.231
+generate something than to only understand.
+
+0:33:17.617 --> 0:33:30.734
+This works well if you have to select how
+many back translated data do you use.
+
+0:33:31.051 --> 0:33:32.983
+Because only there's like a lot more.
+
+0:33:33.253 --> 0:33:42.136
+Question: Should take all of my data there
+is two problems with it?
+
+0:33:42.136 --> 0:33:51.281
+Of course it's expensive because you have
+to translate all this data.
+
+0:33:51.651 --> 0:34:00.946
+So if you don't know the normal good starting
+point is to take equal amount of data as many
+
+0:34:00.946 --> 0:34:02.663
+back translated.
+
+0:34:02.963 --> 0:34:04.673
+It depends on the used case.
+
+0:34:04.673 --> 0:34:08.507
+If we have very few data here, it makes more
+sense to have more.
+
+0:34:08.688 --> 0:34:15.224
+Depends on how good your quality is here,
+so the better the more data you might use because
+
+0:34:15.224 --> 0:34:16.574
+quality is better.
+
+0:34:16.574 --> 0:34:22.755
+So it depends on a lot of things, but your
+rule of sum is like which general way often
+
+0:34:22.755 --> 0:34:24.815
+is to have equal amounts of.
+
+0:34:26.646 --> 0:34:29.854
+And you can, of course, do that now.
+
+0:34:29.854 --> 0:34:34.449
+I said already that it's better to have the
+quality.
+
+0:34:34.449 --> 0:34:38.523
+At the end, of course, depends on this system.
+
+0:34:38.523 --> 0:34:46.152
+Also, because the better this system is, the
+better your synthetic data is, the better.
+
+0:34:47.207 --> 0:34:50.949
+That leads to what is referred to as iterated
+back translation.
+
+0:34:51.291 --> 0:34:56.917
+So you play them on English to German, and
+you translate the data on.
+
+0:34:56.957 --> 0:35:03.198
+Then you train a model on German to English
+with the additional data.
+
+0:35:03.198 --> 0:35:09.796
+Then you translate German data and then you
+train to gain your first one.
+
+0:35:09.796 --> 0:35:14.343
+So in the second iteration this quality is
+better.
+
+0:35:14.334 --> 0:35:19.900
+System is better because it's not only trained
+on the small data but additionally on back
+
+0:35:19.900 --> 0:35:22.003
+translated data with this system.
+
+0:35:22.442 --> 0:35:24.458
+And so you can get better.
+
+0:35:24.764 --> 0:35:28.053
+However, typically you can stop quite early.
+
+0:35:28.053 --> 0:35:35.068
+Maybe one iteration is good, but then you
+have diminishing gains after two or three iterations.
+
+0:35:35.935 --> 0:35:46.140
+There is very slight difference because you
+need a quite big difference in the quality
+
+0:35:46.140 --> 0:35:46.843
+here.
+
+0:35:47.207 --> 0:36:02.262
+Language is also good because it means you
+can already train it with relatively bad profiles.
+
+0:36:03.723 --> 0:36:10.339
+It's a design decision would advise so guess
+because it's easy to get it.
+
+0:36:10.550 --> 0:36:20.802
+Replace that because you have a higher quality
+real data, but then I think normally it's okay
+
+0:36:20.802 --> 0:36:22.438
+to replace it.
+
+0:36:22.438 --> 0:36:28.437
+I would assume it's not too much of a difference,
+but.
+
+0:36:34.414 --> 0:36:42.014
+That's about like using monolingual data before
+we go into the pre-train models to have any
+
+0:36:42.014 --> 0:36:43.005
+more crash.
+
+0:36:49.029 --> 0:36:55.740
+Yes, so the other thing which we can do and
+which is recently more and more successful
+
+0:36:55.740 --> 0:37:02.451
+and even more successful since we have this
+really large language models where you can
+
+0:37:02.451 --> 0:37:08.545
+even do the translation task with this is the
+way of using pre-trained models.
+
+0:37:08.688 --> 0:37:16.135
+So you learn a representation of one task,
+and then you use this representation from another.
+
+0:37:16.576 --> 0:37:26.862
+It was made maybe like one of the first words
+where it really used largely is doing something
+
+0:37:26.862 --> 0:37:35.945
+like a bird which you pre trained on purely
+text era and you take it in fine tune.
+
+0:37:36.496 --> 0:37:42.953
+And one big advantage, of course, is that
+people can only share data but also pre-trained.
+
+0:37:43.423 --> 0:37:59.743
+The recent models and the large language ones
+which are available.
+
+0:37:59.919 --> 0:38:09.145
+Where I think it costs several millions to
+train them all, just if you would buy the GPUs
+
+0:38:09.145 --> 0:38:15.397
+from some cloud company and train that the
+cost of training.
+
+0:38:15.475 --> 0:38:21.735
+And guess as a student project you won't have
+the budget to like build these models.
+
+0:38:21.801 --> 0:38:24.598
+So another idea is what you can do is okay.
+
+0:38:24.598 --> 0:38:27.330
+Maybe if these months are once available,.
+
+0:38:27.467 --> 0:38:36.598
+Can take them and use them as an also resource
+similar to pure text, and you can now build
+
+0:38:36.598 --> 0:38:44.524
+models which somehow learn not only from from
+data but also from other models.
+
+0:38:44.844 --> 0:38:49.127
+So it's a quite new way of thinking of how
+to train.
+
+0:38:49.127 --> 0:38:53.894
+We are not only learning from examples, but
+we might also.
+
+0:38:54.534 --> 0:39:05.397
+The nice thing is that this type of training
+where we are not learning directly from data
+
+0:39:05.397 --> 0:39:07.087
+but learning.
+
+0:39:07.427 --> 0:39:17.647
+So the main idea this go is you have a person
+initial task.
+
+0:39:17.817 --> 0:39:26.369
+And if you're working with anLP, that means
+you're training pure taxator because that's
+
+0:39:26.369 --> 0:39:30.547
+where you have the largest amount of data.
+
+0:39:30.951 --> 0:39:35.854
+And then you're defining some type of task
+in order to you do your creek training.
+
+0:39:36.176 --> 0:39:43.092
+And: The typical task you can train on on
+that is like the language waddling task.
+
+0:39:43.092 --> 0:39:50.049
+So to predict the next word or we have a related
+task to predict something in between, we'll
+
+0:39:50.049 --> 0:39:52.667
+see depending on the architecture.
+
+0:39:52.932 --> 0:39:58.278
+But somehow to predict something which you
+have not in the input is a task which is easy
+
+0:39:58.278 --> 0:40:00.740
+to generate, so you just need your data.
+
+0:40:00.740 --> 0:40:06.086
+That's why it's called self supervised, so
+you're creating your supervised pending data.
+
+0:40:06.366 --> 0:40:07.646
+By yourself.
+
+0:40:07.646 --> 0:40:15.133
+On the other hand, you need a lot of knowledge
+and that is the other thing.
+
+0:40:15.735 --> 0:40:24.703
+Because there is this idea that the meaning
+of a word heavily depends on the context that.
+
+0:40:25.145 --> 0:40:36.846
+So can give you a sentence with some giverish
+word and there's some name and although you've
+
+0:40:36.846 --> 0:40:41.627
+never heard the name you will assume.
+
+0:40:42.062 --> 0:40:44.149
+And exactly the same thing.
+
+0:40:44.149 --> 0:40:49.143
+The models can also learn something about
+the world by just using.
+
+0:40:49.649 --> 0:40:53.651
+So that is typically the mule.
+
+0:40:53.651 --> 0:40:59.848
+Then we can use this model to train the system.
+
+0:41:00.800 --> 0:41:03.368
+Course we might need to adapt the system.
+
+0:41:03.368 --> 0:41:07.648
+To do that we have to change the architecture
+we might use only some.
+
+0:41:07.627 --> 0:41:09.443
+Part of the pre-trained model.
+
+0:41:09.443 --> 0:41:14.773
+In there we have seen that a bit already in
+the R&amp;N case you can also see that we have
+
+0:41:14.773 --> 0:41:17.175
+also mentioned the pre-training already.
+
+0:41:17.437 --> 0:41:22.783
+So you can use the R&amp;N as one of these
+approaches.
+
+0:41:22.783 --> 0:41:28.712
+You train the R&amp;M language more on large
+pre-train data.
+
+0:41:28.712 --> 0:41:32.309
+Then you put it somewhere into your.
+
+0:41:33.653 --> 0:41:37.415
+So this gives you the ability to really do
+these types of tests.
+
+0:41:37.877 --> 0:41:53.924
+So you can build a system which is knowledge,
+which is just trained on large amounts of data.
+
+0:41:56.376 --> 0:42:01.564
+So the question is maybe what type of information
+so what type of models can you?
+
+0:42:01.821 --> 0:42:05.277
+And we want today to look at briefly at swings.
+
+0:42:05.725 --> 0:42:08.850
+That was what was initially done.
+
+0:42:08.850 --> 0:42:17.213
+It wasn't as famous as in machine translation
+as in other things, but it's also used there
+
+0:42:17.213 --> 0:42:21.072
+and that is to use static word embedding.
+
+0:42:21.221 --> 0:42:28.981
+So we have this mapping from the one hot to
+a small continuous word representation.
+
+0:42:29.229 --> 0:42:38.276
+Using this one in your NG system, so you can,
+for example, replace the embedding layer by
+
+0:42:38.276 --> 0:42:38.779
+the.
+
+0:42:39.139 --> 0:42:41.832
+That is helpful to be a really small amount
+of data.
+
+0:42:42.922 --> 0:42:48.517
+And we're always in this pre-training phase
+and have the thing the advantage is.
+
+0:42:48.468 --> 0:42:52.411
+More data than the trade off, so you can get
+better.
+
+0:42:52.411 --> 0:42:59.107
+The disadvantage is, does anybody have an
+idea of what might be the disadvantage of using
+
+0:42:59.107 --> 0:43:00.074
+things like.
+
+0:43:04.624 --> 0:43:12.175
+What was one mentioned today giving like big
+advantage of the system compared to previous.
+
+0:43:20.660 --> 0:43:25.134
+Where one advantage was the enter end training,
+so you have the enter end training so that
+
+0:43:25.134 --> 0:43:27.937
+all parameters and all components play optimal
+together.
+
+0:43:28.208 --> 0:43:33.076
+If you know pre-train something on one fast,
+it may be no longer optimal fitting to everything
+
+0:43:33.076 --> 0:43:33.384
+else.
+
+0:43:33.893 --> 0:43:37.862
+So what do pretending or not?
+
+0:43:37.862 --> 0:43:48.180
+It depends on how important everything is
+optimal together and how important.
+
+0:43:48.388 --> 0:43:51.874
+Is a iquality of large amount.
+
+0:43:51.874 --> 0:44:00.532
+The pre-change one is so much better that
+it's helpful and the advantage of.
+
+0:44:00.600 --> 0:44:11.211
+Getting everything optimal together, yes,
+we would use random instructions for raising.
+
+0:44:11.691 --> 0:44:26.437
+The problem is you might be already in some
+area where it's not easy to get.
+
+0:44:26.766 --> 0:44:35.329
+But often in some way right, so often it's
+not about your really worse pre trained monolepsy.
+
+0:44:35.329 --> 0:44:43.254
+If you're going already in some direction,
+and if this is not really optimal for you,.
+
+0:44:43.603 --> 0:44:52.450
+But if you're not really getting better because
+you have a decent amount of data, it's so different
+
+0:44:52.450 --> 0:44:52.981
+that.
+
+0:44:53.153 --> 0:44:59.505
+Initially it wasn't a machine translation
+done so much because there are more data in
+
+0:44:59.505 --> 0:45:06.153
+MPs than in other tasks, but now with really
+large amounts of monolingual data we do some
+
+0:45:06.153 --> 0:45:09.403
+type of pretraining in currently all state.
+
+0:45:12.632 --> 0:45:14.302
+The other one is okay now.
+
+0:45:14.302 --> 0:45:18.260
+It's always like how much of the model do
+you plea track a bit?
+
+0:45:18.658 --> 0:45:22.386
+To the other one you can do contextural word
+embedded.
+
+0:45:22.386 --> 0:45:28.351
+That is something like bird or Roberta where
+you train already a sequence model and the
+
+0:45:28.351 --> 0:45:34.654
+embeddings you're using are no longer specific
+for word but they are also taking the context
+
+0:45:34.654 --> 0:45:35.603
+into account.
+
+0:45:35.875 --> 0:45:50.088
+The embedding you're using is no longer depending
+on the word itself but on the whole sentence,
+
+0:45:50.088 --> 0:45:54.382
+so you can use this context.
+
+0:45:55.415 --> 0:46:02.691
+You can use similar things also in the decoder
+just by having layers which don't have access
+
+0:46:02.691 --> 0:46:12.430
+to the source, but there it still might have
+and these are typically models like: And finally
+
+0:46:12.430 --> 0:46:14.634
+they will look at the end.
+
+0:46:14.634 --> 0:46:19.040
+You can also have models which are already
+sequenced.
+
+0:46:19.419 --> 0:46:28.561
+So you may be training a sequence to sequence
+models.
+
+0:46:28.561 --> 0:46:35.164
+You have to make it a bit challenging.
+
+0:46:36.156 --> 0:46:43.445
+But the idea is really you're pre-training
+your whole model and then you'll find tuning.
+
+0:46:47.227 --> 0:46:59.614
+But let's first do a bit of step back and
+look into what are the different things.
+
+0:46:59.614 --> 0:47:02.151
+The first thing.
+
+0:47:02.382 --> 0:47:11.063
+The wooden bettings are just this first layer
+and you can train them with feedback annual
+
+0:47:11.063 --> 0:47:12.028
+networks.
+
+0:47:12.212 --> 0:47:22.761
+But you can also train them with an N language
+model, and by now you hopefully have also seen
+
+0:47:22.761 --> 0:47:27.699
+that you cannot transform a language model.
+
+0:47:30.130 --> 0:47:37.875
+So this is how you can train them and you're
+training them.
+
+0:47:37.875 --> 0:47:45.234
+For example, to speak the next word that is
+the easiest.
+
+0:47:45.525 --> 0:47:55.234
+And that is what is now referred to as South
+Supervised Learning and, for example, all the
+
+0:47:55.234 --> 0:48:00.675
+big large language models like Chad GPT and
+so on.
+
+0:48:00.675 --> 0:48:03.129
+They are trained with.
+
+0:48:03.823 --> 0:48:15.812
+So that is where you can hopefully learn how
+a word is used because you always try to previct
+
+0:48:15.812 --> 0:48:17.725
+the next word.
+
+0:48:19.619 --> 0:48:27.281
+Word embedding: Why do you keep the first
+look at the word embeddings and the use of
+
+0:48:27.281 --> 0:48:29.985
+word embeddings for our task?
+
+0:48:29.985 --> 0:48:38.007
+The main advantage was it might be only the
+first layer where you typically have most of
+
+0:48:38.007 --> 0:48:39.449
+the parameters.
+
+0:48:39.879 --> 0:48:57.017
+Most of your parameters already on the large
+data, then on your target data you have to
+
+0:48:57.017 --> 0:48:59.353
+train less.
+
+0:48:59.259 --> 0:49:06.527
+Big difference that your input size is so
+much bigger than the size of the novel in size.
+
+0:49:06.626 --> 0:49:17.709
+So it's a normally sign, maybe like, but your
+input and banning size is something like.
+
+0:49:17.709 --> 0:49:20.606
+Then here you have to.
+
+0:49:23.123 --> 0:49:30.160
+While here you see it's only like zero point
+five times as much in the layer.
+
+0:49:30.750 --> 0:49:40.367
+So here is where most of your parameters are,
+which means if you already replace the word
+
+0:49:40.367 --> 0:49:48.915
+embeddings they might look a bit small in your
+overall and in key architecture.
+
+0:49:57.637 --> 0:50:01.249
+The thing is we have seen these were the bettings.
+
+0:50:01.249 --> 0:50:04.295
+They can be very good use for other types.
+
+0:50:04.784 --> 0:50:08.994
+You learn some general relations between words.
+
+0:50:08.994 --> 0:50:17.454
+If you're doing this type of language modeling
+cast, you predict: The one thing is you have
+
+0:50:17.454 --> 0:50:24.084
+a lot of data, so the one question is we want
+to have data to trade a model.
+
+0:50:24.084 --> 0:50:28.734
+The other thing, the tasks need to be somehow
+useful.
+
+0:50:29.169 --> 0:50:43.547
+If you would predict the first letter of the
+word, then you wouldn't learn anything about
+
+0:50:43.547 --> 0:50:45.144
+the word.
+
+0:50:45.545 --> 0:50:53.683
+And the interesting thing is people have looked
+at these wood embeddings.
+
+0:50:53.954 --> 0:50:58.550
+And looking at the word embeddings.
+
+0:50:58.550 --> 0:51:09.276
+You can ask yourself how they look and visualize
+them by doing dimension reduction.
+
+0:51:09.489 --> 0:51:13.236
+Don't know if you and you are listening to
+artificial intelligence.
+
+0:51:13.236 --> 0:51:15.110
+Advanced artificial intelligence.
+
+0:51:15.515 --> 0:51:23.217
+We had on yesterday there how to do this type
+of representation, but you can do this time
+
+0:51:23.217 --> 0:51:29.635
+of representation, and now you're seeing interesting
+things that normally.
+
+0:51:30.810 --> 0:51:41.027
+Now you can represent a here in a three dimensional
+space with some dimension reduction.
+
+0:51:41.027 --> 0:51:46.881
+For example, the relation between male and
+female.
+
+0:51:47.447 --> 0:51:56.625
+So this vector between the male and female
+version of something is always not the same,
+
+0:51:56.625 --> 0:51:58.502
+but it's related.
+
+0:51:58.718 --> 0:52:14.522
+So you can do a bit of maths, so you do take
+king, you subtract this vector, add this vector.
+
+0:52:14.894 --> 0:52:17.591
+So that means okay, there is really something
+stored.
+
+0:52:17.591 --> 0:52:19.689
+Some information are stored in that book.
+
+0:52:20.040 --> 0:52:22.492
+Similar, you can do it with bug answers.
+
+0:52:22.492 --> 0:52:25.004
+You see here swimming slang walking walk.
+
+0:52:25.265 --> 0:52:34.620
+So again these vectors are not the same, but
+they are related.
+
+0:52:34.620 --> 0:52:42.490
+So you learn something from going from here
+to here.
+
+0:52:43.623 --> 0:52:49.761
+Or semantically, the relations between city
+and capital have exactly the same sense.
+
+0:52:51.191 --> 0:52:56.854
+And people had even done that question answering
+about that if they showed the diembeddings
+
+0:52:56.854 --> 0:52:57.839
+and the end of.
+
+0:52:58.218 --> 0:53:06.711
+All you can also do is don't trust the dimensions
+of the reaction because maybe there is something.
+
+0:53:06.967 --> 0:53:16.863
+You can also look into what happens really
+in the individual space.
+
+0:53:16.863 --> 0:53:22.247
+What is the nearest neighbor of the.
+
+0:53:22.482 --> 0:53:29.608
+So you can take the relationship between France
+and Paris and add it to Italy and you'll.
+
+0:53:30.010 --> 0:53:33.078
+You can do big and bigger and you have small
+and smaller and stuff.
+
+0:53:33.593 --> 0:53:49.417
+Because it doesn't work everywhere, there
+is also some typical dish here in German.
+
+0:53:51.491 --> 0:54:01.677
+You can do what the person is doing for famous
+ones, of course only like Einstein scientists
+
+0:54:01.677 --> 0:54:06.716
+that find midfielders not completely correct.
+
+0:54:06.846 --> 0:54:10.134
+You see the examples are a bit old.
+
+0:54:10.134 --> 0:54:15.066
+The politicians are no longer they am, but
+of course.
+
+0:54:16.957 --> 0:54:26.759
+What people have done there, especially at
+the beginning training our end language model,
+
+0:54:26.759 --> 0:54:28.937
+was very expensive.
+
+0:54:29.309 --> 0:54:38.031
+So one famous model was, but we are not really
+interested in the language model performance.
+
+0:54:38.338 --> 0:54:40.581
+Think something good to keep in mind.
+
+0:54:40.581 --> 0:54:42.587
+What are we really interested in?
+
+0:54:42.587 --> 0:54:45.007
+Do we really want to have an R&amp;N no?
+
+0:54:45.007 --> 0:54:48.607
+In this case we are only interested in this
+type of mapping.
+
+0:54:49.169 --> 0:54:55.500
+And so successful and very successful was
+this word to vet.
+
+0:54:55.535 --> 0:54:56.865
+The idea is okay.
+
+0:54:56.865 --> 0:55:03.592
+We are not training real language one, making
+it even simpler and doing this, for example,
+
+0:55:03.592 --> 0:55:05.513
+continuous peck of words.
+
+0:55:05.513 --> 0:55:12.313
+We're just having four input tokens and we're
+predicting what is the word in the middle and
+
+0:55:12.313 --> 0:55:15.048
+this is just like two linear layers.
+
+0:55:15.615 --> 0:55:21.627
+So it's even simplifying things and making
+the calculation faster because that is what
+
+0:55:21.627 --> 0:55:22.871
+we're interested.
+
+0:55:23.263 --> 0:55:32.897
+All this continuous skip ground models with
+these other models which refer to as where
+
+0:55:32.897 --> 0:55:34.004
+to where.
+
+0:55:34.234 --> 0:55:42.394
+Where you have one equal word and the other
+way around, you're predicting the four words
+
+0:55:42.394 --> 0:55:43.585
+around them.
+
+0:55:43.585 --> 0:55:45.327
+It's very similar.
+
+0:55:45.327 --> 0:55:48.720
+The task is in the end very similar.
+
+0:55:51.131 --> 0:56:01.407
+Before we are going to the next point, anything
+about normal weight vectors or weight embedding.
+
+0:56:04.564 --> 0:56:07.794
+The next thing is contexture.
+
+0:56:07.794 --> 0:56:12.208
+Word embeddings and the idea is helpful.
+
+0:56:12.208 --> 0:56:19.206
+However, we might even be able to get more
+from one lingo layer.
+
+0:56:19.419 --> 0:56:31.732
+And now in the word that is overlap of these
+two meanings, so it represents both the meaning
+
+0:56:31.732 --> 0:56:33.585
+of can do it.
+
+0:56:34.834 --> 0:56:40.410
+But we might be able to in the pre-trained
+model already disambiguate this because they
+
+0:56:40.410 --> 0:56:41.044
+are used.
+
+0:56:41.701 --> 0:56:53.331
+So if we can have a model which can not only
+represent a word but can also represent the
+
+0:56:53.331 --> 0:56:58.689
+meaning of the word within the context,.
+
+0:56:59.139 --> 0:57:03.769
+So then we are going to context your word
+embeddings.
+
+0:57:03.769 --> 0:57:07.713
+We are really having a representation in the.
+
+0:57:07.787 --> 0:57:11.519
+And we have a very good architecture for that
+already.
+
+0:57:11.691 --> 0:57:23.791
+The hidden state represents what is currently
+said, but it's focusing on what is the last
+
+0:57:23.791 --> 0:57:29.303
+one, so it's some of the representation.
+
+0:57:29.509 --> 0:57:43.758
+The first one doing that is something like
+the Elmo paper where they instead of this is
+
+0:57:43.758 --> 0:57:48.129
+the normal language model.
+
+0:57:48.008 --> 0:57:50.714
+Within the third, predicting the fourth, and
+so on.
+
+0:57:50.714 --> 0:57:53.004
+So you are always predicting the next work.
+
+0:57:53.193 --> 0:57:57.335
+The architecture is the heaven words embedding
+layer and then layers.
+
+0:57:57.335 --> 0:58:03.901
+See you, for example: And now instead of using
+this one in the end, you're using here this
+
+0:58:03.901 --> 0:58:04.254
+one.
+
+0:58:04.364 --> 0:58:11.245
+This represents the meaning of this word mainly
+in the context of what we have seen before.
+
+0:58:11.871 --> 0:58:18.610
+We can train it in a language model style
+always predicting the next word, but we have
+
+0:58:18.610 --> 0:58:21.088
+more information trained there.
+
+0:58:21.088 --> 0:58:26.123
+Therefore, in the system it has to learn less
+additional things.
+
+0:58:27.167 --> 0:58:31.261
+And there is one Edendang which is done currently
+in GPS.
+
+0:58:31.261 --> 0:58:38.319
+The only difference is that we have more layers,
+bigger size, and we're using transformer neurocell
+
+0:58:38.319 --> 0:58:40.437
+potential instead of the RNA.
+
+0:58:40.437 --> 0:58:45.095
+But that is how you train like some large
+language models at the.
+
+0:58:46.746 --> 0:58:55.044
+However, if you look at this contextual representation,
+they might not be perfect.
+
+0:58:55.044 --> 0:59:02.942
+So if you think of this one as a contextual
+representation of the third word,.
+
+0:59:07.587 --> 0:59:16.686
+Is representing a three in the context of
+a sentence, however only in the context of
+
+0:59:16.686 --> 0:59:18.185
+the previous.
+
+0:59:18.558 --> 0:59:27.413
+However, we have an architecture which can
+also take both sides and we have used that
+
+0:59:27.413 --> 0:59:30.193
+already in the ink holder.
+
+0:59:30.630 --> 0:59:34.264
+So we could do the iron easily on your, also
+in the backward direction.
+
+0:59:34.874 --> 0:59:42.826
+By just having the states the other way around
+and then we couldn't combine the forward and
+
+0:59:42.826 --> 0:59:49.135
+the forward into a joint one where we are doing
+this type of prediction.
+
+0:59:49.329 --> 0:59:50.858
+So you have the word embedding.
+
+0:59:51.011 --> 1:00:02.095
+Then you have two in the states, one on the
+forward arm and one on the backward arm, and
+
+1:00:02.095 --> 1:00:10.314
+then you can, for example, take the cocagenation
+of both of them.
+
+1:00:10.490 --> 1:00:23.257
+Now this same here represents mainly this
+word because this is what both puts in it last
+
+1:00:23.257 --> 1:00:30.573
+and we know is focusing on what is happening
+last.
+
+1:00:31.731 --> 1:00:40.469
+However, there is a bit of difference when
+training that as a language model you already
+
+1:00:40.469 --> 1:00:41.059
+have.
+
+1:00:43.203 --> 1:00:44.956
+Maybe There's Again This Masking.
+
+1:00:46.546 --> 1:00:47.748
+That is one solution.
+
+1:00:47.748 --> 1:00:52.995
+First of all, why we can't do it is the information
+you leak it, so you cannot just predict the
+
+1:00:52.995 --> 1:00:53.596
+next word.
+
+1:00:53.596 --> 1:00:58.132
+If we just predict the next word in this type
+of model, that's a very simple task.
+
+1:00:58.738 --> 1:01:09.581
+You know the next word because it's influencing
+this hidden state predicting something is not
+
+1:01:09.581 --> 1:01:11.081
+a good task.
+
+1:01:11.081 --> 1:01:18.455
+You have to define: Because in this case what
+will end with the system will just ignore these
+
+1:01:18.455 --> 1:01:22.966
+estates and what will learn is copy this information
+directly in here.
+
+1:01:23.343 --> 1:01:31.218
+So it would be representing this word and
+you would have nearly a perfect model because
+
+1:01:31.218 --> 1:01:38.287
+you only need to find encoding where you can
+encode all words somehow in this.
+
+1:01:38.458 --> 1:01:44.050
+The only thing can learn is that turn and
+encode all my words in this upper hidden.
+
+1:01:44.985 --> 1:01:53.779
+Therefore, it's not really useful, so we need
+to find a bit of different ways out.
+
+1:01:55.295 --> 1:01:57.090
+There is a masking one.
+
+1:01:57.090 --> 1:02:03.747
+I'll come to that shortly just a bit that
+other things also have been done, so the other
+
+1:02:03.747 --> 1:02:06.664
+thing is not to directly combine them.
+
+1:02:06.664 --> 1:02:13.546
+That was in the animal paper, so you have
+them forward R&amp;M and you keep them completely
+
+1:02:13.546 --> 1:02:14.369
+separated.
+
+1:02:14.594 --> 1:02:20.458
+So you never merged to state.
+
+1:02:20.458 --> 1:02:33.749
+At the end, the representation of the word
+is now from the forward.
+
+1:02:33.873 --> 1:02:35.953
+So it's always the hidden state before the
+good thing.
+
+1:02:36.696 --> 1:02:41.286
+These two you join now to your to the representation.
+
+1:02:42.022 --> 1:02:48.685
+And then you have now a representation also
+about like the whole sentence for the word,
+
+1:02:48.685 --> 1:02:51.486
+but there is no information leakage.
+
+1:02:51.486 --> 1:02:58.149
+One way of doing this is instead of doing
+a bidirection along you do a forward pass and
+
+1:02:58.149 --> 1:02:59.815
+then join the hidden.
+
+1:03:00.380 --> 1:03:05.960
+So you can do that in all layers.
+
+1:03:05.960 --> 1:03:16.300
+In the end you do the forwarded layers and
+you get the hidden.
+
+1:03:16.596 --> 1:03:19.845
+However, it's a bit of a complicated.
+
+1:03:19.845 --> 1:03:25.230
+You have to keep both separate and merge things
+so can you do.
+
+1:03:27.968 --> 1:03:33.030
+And that is the moment where like the big.
+
+1:03:34.894 --> 1:03:39.970
+The big success of the burnt model was used
+where it okay.
+
+1:03:39.970 --> 1:03:47.281
+Maybe in bite and rich case it's not good
+to do the next word prediction, but we can
+
+1:03:47.281 --> 1:03:48.314
+do masking.
+
+1:03:48.308 --> 1:03:56.019
+Masking mainly means we do a prediction of
+something in the middle or some words.
+
+1:03:56.019 --> 1:04:04.388
+So the idea is if we have the input, we are
+putting noise into the input, removing them,
+
+1:04:04.388 --> 1:04:07.961
+and then the model we are interested.
+
+1:04:08.048 --> 1:04:15.327
+Now there can be no information leakage because
+this wasn't predicting that one is a big challenge.
+
+1:04:16.776 --> 1:04:19.957
+Do any assumption about our model?
+
+1:04:19.957 --> 1:04:26.410
+It doesn't need to be a forward model or a
+backward model or anything.
+
+1:04:26.410 --> 1:04:29.500
+You can always predict the three.
+
+1:04:30.530 --> 1:04:34.844
+There's maybe one bit of a disadvantage.
+
+1:04:34.844 --> 1:04:40.105
+Do you see what could be a bit of a problem
+this?
+
+1:05:00.000 --> 1:05:06.429
+Yes, so yeah, you can of course mask more,
+but to see it more globally, just first assume
+
+1:05:06.429 --> 1:05:08.143
+you're only masked one.
+
+1:05:08.143 --> 1:05:13.930
+For the whole sentence, we get one feedback
+signal, like what is the word three.
+
+1:05:13.930 --> 1:05:22.882
+So we have one training example: If you do
+the language modeling taste, we predicted here,
+
+1:05:22.882 --> 1:05:24.679
+we predicted here.
+
+1:05:25.005 --> 1:05:26.735
+So we have number of tokens.
+
+1:05:26.735 --> 1:05:30.970
+For each token we have a feet pad and say
+what is the best correction.
+
+1:05:31.211 --> 1:05:43.300
+So in this case this is less efficient because
+we are getting less feedback signals on what
+
+1:05:43.300 --> 1:05:45.797
+we should predict.
+
+1:05:48.348 --> 1:05:56.373
+So and bird, the main ideas are that you're
+doing this bidirectional model with masking.
+
+1:05:56.373 --> 1:05:59.709
+It's using transformer architecture.
+
+1:06:00.320 --> 1:06:06.326
+There are two more minor changes.
+
+1:06:06.326 --> 1:06:16.573
+We'll see that this next word prediction is
+another task.
+
+1:06:16.957 --> 1:06:30.394
+You want to learn more about what language
+is to really understand following a story or
+
+1:06:30.394 --> 1:06:35.127
+their independent tokens into.
+
+1:06:38.158 --> 1:06:42.723
+The input is using word units as we use it.
+
+1:06:42.723 --> 1:06:50.193
+It has some special token that is framing
+for the next word prediction.
+
+1:06:50.470 --> 1:07:04.075
+It's more for classification task because
+you may be learning a general representation
+
+1:07:04.075 --> 1:07:07.203
+as a full sentence.
+
+1:07:07.607 --> 1:07:19.290
+You're doing segment embedding, so you have
+an embedding for it.
+
+1:07:19.290 --> 1:07:24.323
+This is the first sentence.
+
+1:07:24.684 --> 1:07:29.099
+Now what is more challenging is this masking.
+
+1:07:29.099 --> 1:07:30.827
+What do you mask?
+
+1:07:30.827 --> 1:07:35.050
+We already have the crush enough or should.
+
+1:07:35.275 --> 1:07:42.836
+So there has been afterwards eating some work
+like, for example, a bearer.
+
+1:07:42.836 --> 1:07:52.313
+It's not super sensitive, but if you do it
+completely wrong then you're not letting anything.
+
+1:07:52.572 --> 1:07:54.590
+That's Then Another Question There.
+
+1:07:56.756 --> 1:08:04.594
+Should I mask all types of should I always
+mask the footwork or if I have a subword to
+
+1:08:04.594 --> 1:08:10.630
+mask only like a subword and predict them based
+on the other ones?
+
+1:08:10.630 --> 1:08:14.504
+Of course, it's a bit of a different task.
+
+1:08:14.894 --> 1:08:21.210
+If you know three parts of the words, it might
+be easier to guess the last because they here
+
+1:08:21.210 --> 1:08:27.594
+took the easiest selection, so not considering
+words anymore at all because you're doing that
+
+1:08:27.594 --> 1:08:32.280
+in the preprocessing and just taking always
+words and like subwords.
+
+1:08:32.672 --> 1:08:36.089
+Think in group there is done differently.
+
+1:08:36.089 --> 1:08:40.401
+They mark always the full words, but guess
+it's not.
+
+1:08:41.001 --> 1:08:46.044
+And then what to do with the mask word in
+eighty percent of the cases.
+
+1:08:46.044 --> 1:08:50.803
+If the word is masked, they replace it with
+a special token thing.
+
+1:08:50.803 --> 1:08:57.197
+This is a mask token in ten percent they put
+in some random other token in there, and ten
+
+1:08:57.197 --> 1:08:59.470
+percent they keep it on change.
+
+1:09:02.202 --> 1:09:10.846
+And then what you can do is also this next
+word prediction.
+
+1:09:10.846 --> 1:09:14.880
+The man went to Mass Store.
+
+1:09:14.880 --> 1:09:17.761
+He bought a gallon.
+
+1:09:18.418 --> 1:09:24.088
+So may you see you're joining them, you're
+doing both masks and prediction that you're.
+
+1:09:24.564 --> 1:09:29.449
+Is a penguin mask or flyless birds.
+
+1:09:29.449 --> 1:09:41.390
+These two sentences have nothing to do with
+each other, so you can do also this type of
+
+1:09:41.390 --> 1:09:43.018
+prediction.
+
+1:09:47.127 --> 1:09:56.572
+And then the whole bird model, so here you
+have the in-foot to transform the layers, and
+
+1:09:56.572 --> 1:09:58.164
+you can train.
+
+1:09:58.598 --> 1:10:17.731
+And this model was quite successful in general
+applications.
+
+1:10:17.937 --> 1:10:27.644
+However, there is like a huge thing of different
+types of models coming from them.
+
+1:10:27.827 --> 1:10:38.709
+So based on others these supervised molds
+like a whole setup came out of there and now
+
+1:10:38.709 --> 1:10:42.086
+this is getting even more.
+
+1:10:42.082 --> 1:10:46.640
+With availability of a large language model
+than the success.
+
+1:10:47.007 --> 1:10:48.436
+We have now even larger ones.
+
+1:10:48.828 --> 1:10:50.961
+Interestingly, it goes a bit.
+
+1:10:50.910 --> 1:10:57.847
+Change the bit again from like more the spider
+action model to uni directional models.
+
+1:10:57.847 --> 1:11:02.710
+Are at the moment maybe a bit more we're coming
+to them now?
+
+1:11:02.710 --> 1:11:09.168
+Do you see one advantage while what is another
+event and we have the efficiency?
+
+1:11:09.509 --> 1:11:15.901
+Is one other reason why you are sometimes
+more interested in uni-direction models than
+
+1:11:15.901 --> 1:11:17.150
+in bi-direction.
+
+1:11:22.882 --> 1:11:30.220
+It depends on the pass, but for example for
+a language generation pass, the eccard is not
+
+1:11:30.220 --> 1:11:30.872
+really.
+
+1:11:32.192 --> 1:11:40.924
+It doesn't work so if you want to do a generation
+like the decoder you don't know the future
+
+1:11:40.924 --> 1:11:42.896
+so you cannot apply.
+
+1:11:43.223 --> 1:11:53.870
+So this time of model can be used for the
+encoder in an encoder model, but it cannot
+
+1:11:53.870 --> 1:11:57.002
+be used for the decoder.
+
+1:12:00.000 --> 1:12:05.012
+That's a good view to the next overall cast
+of models.
+
+1:12:05.012 --> 1:12:08.839
+Perhaps if you view it from the sequence.
+
+1:12:09.009 --> 1:12:12.761
+We have the encoder base model.
+
+1:12:12.761 --> 1:12:16.161
+That's what we just look at.
+
+1:12:16.161 --> 1:12:20.617
+They are bidirectional and typically.
+
+1:12:20.981 --> 1:12:22.347
+That Is the One We Looked At.
+
+1:12:22.742 --> 1:12:34.634
+At the beginning is the decoder based model,
+so see out in regressive models which are unidirective
+
+1:12:34.634 --> 1:12:42.601
+like an based model, and there we can do the
+next word prediction.
+
+1:12:43.403 --> 1:12:52.439
+And what you can also do first, and there
+you can also have a special things called prefix
+
+1:12:52.439 --> 1:12:53.432
+language.
+
+1:12:54.354 --> 1:13:05.039
+Because we are saying it might be helpful
+that some of your input can also use bi-direction.
+
+1:13:05.285 --> 1:13:12.240
+And that is somehow doing what it is called
+prefix length.
+
+1:13:12.240 --> 1:13:19.076
+On the first tokens you directly give your
+bidirectional.
+
+1:13:19.219 --> 1:13:28.774
+So you somehow merge that and that mainly
+works only in transformer based models because.
+
+1:13:29.629 --> 1:13:33.039
+There is no different number of parameters
+in our end.
+
+1:13:33.039 --> 1:13:34.836
+We need a back foot our end.
+
+1:13:34.975 --> 1:13:38.533
+Transformer: The only difference is how you
+mask your attention.
+
+1:13:38.878 --> 1:13:44.918
+We have seen that in the anchoder and decoder
+the number of parameters is different because
+
+1:13:44.918 --> 1:13:50.235
+you do cross attention, but if you do forward
+and backward or union directions,.
+
+1:13:50.650 --> 1:13:58.419
+It's only like that you mask your attention
+to only look at the bad past or to look into
+
+1:13:58.419 --> 1:13:59.466
+the future.
+
+1:14:00.680 --> 1:14:03.326
+And now you can of course also do mixing.
+
+1:14:03.563 --> 1:14:08.306
+So this is a bi-directional attention matrix
+where you can attend to everything.
+
+1:14:08.588 --> 1:14:23.516
+There is a uni-direction or causal where you
+can look at the past and you can do the first
+
+1:14:23.516 --> 1:14:25.649
+three words.
+
+1:14:29.149 --> 1:14:42.831
+That somehow clear based on that, then of
+course you cannot do the other things.
+
+1:14:43.163 --> 1:14:50.623
+So the idea is we have our anchor to decoder
+architecture.
+
+1:14:50.623 --> 1:14:57.704
+Can we also train them completely in a side
+supervisor?
+
+1:14:58.238 --> 1:15:09.980
+And in this case we have the same input to
+both, so in this case we need to do some type
+
+1:15:09.980 --> 1:15:12.224
+of masking here.
+
+1:15:12.912 --> 1:15:17.591
+Here we don't need to do the masking, but
+here we need to the masking that doesn't know
+
+1:15:17.591 --> 1:15:17.910
+ever.
+
+1:15:20.440 --> 1:15:30.269
+And this type of model got quite successful
+also, especially for pre-training machine translation.
+
+1:15:30.330 --> 1:15:39.059
+The first model doing that is a Bart model,
+which exactly does that, and yes, it's one
+
+1:15:39.059 --> 1:15:42.872
+successful way to pre train your one.
+
+1:15:42.872 --> 1:15:47.087
+It's pretraining your full encoder model.
+
+1:15:47.427 --> 1:15:54.365
+Where you put in contrast to machine translation,
+where you put in source sentence, we can't
+
+1:15:54.365 --> 1:15:55.409
+do that here.
+
+1:15:55.715 --> 1:16:01.382
+But we can just put the second twice in there,
+and then it's not a trivial task.
+
+1:16:01.382 --> 1:16:02.432
+We can change.
+
+1:16:03.003 --> 1:16:12.777
+And there is like they do different corruption
+techniques so you can also do.
+
+1:16:13.233 --> 1:16:19.692
+That you couldn't do in an agricultural system
+because then it wouldn't be there and you cannot
+
+1:16:19.692 --> 1:16:20.970
+predict somewhere.
+
+1:16:20.970 --> 1:16:26.353
+So the anchor, the number of input and output
+tokens always has to be the same.
+
+1:16:26.906 --> 1:16:29.818
+You cannot do a prediction for something which
+isn't in it.
+
+1:16:30.110 --> 1:16:38.268
+Here in the decoder side it's unidirection
+so we can also delete the top and then try
+
+1:16:38.268 --> 1:16:40.355
+to generate the full.
+
+1:16:41.061 --> 1:16:45.250
+We can do sentence permutation.
+
+1:16:45.250 --> 1:16:54.285
+We can document rotation and text infilling
+so there is quite a bit.
+
+1:16:55.615 --> 1:17:06.568
+So you see there's quite a lot of types of
+models that you can use in order to pre-train.
+
+1:17:07.507 --> 1:17:14.985
+Then, of course, there is again for the language
+one.
+
+1:17:14.985 --> 1:17:21.079
+The other question is how do you integrate?
+
+1:17:21.761 --> 1:17:26.636
+And there's also, like yeah, quite some different
+ways of techniques.
+
+1:17:27.007 --> 1:17:28.684
+It's a Bit Similar to Before.
+
+1:17:28.928 --> 1:17:39.068
+So the easiest thing is you take your word
+embeddings or your free trained model.
+
+1:17:39.068 --> 1:17:47.971
+You freeze them and stack your decoder layers
+and keep these ones free.
+
+1:17:48.748 --> 1:17:54.495
+Can also be done if you have this type of
+bark model.
+
+1:17:54.495 --> 1:18:03.329
+What you can do is you freeze your word embeddings,
+for example some products and.
+
+1:18:05.865 --> 1:18:17.296
+The other thing is you initialize them so
+you initialize your models but you train everything
+
+1:18:17.296 --> 1:18:19.120
+so you're not.
+
+1:18:22.562 --> 1:18:29.986
+Then one thing, if you think about Bart, you
+want to have the Chinese language, the Italian
+
+1:18:29.986 --> 1:18:32.165
+language, and the deconer.
+
+1:18:32.165 --> 1:18:35.716
+However, in Bart we have the same language.
+
+1:18:36.516 --> 1:18:46.010
+The one you get is from English, so what you
+can do there is so you cannot try to do some.
+
+1:18:46.366 --> 1:18:52.562
+Below the barge, in order to learn some language
+specific stuff, or there's a masculine barge,
+
+1:18:52.562 --> 1:18:58.823
+which is trained on many languages, but it's
+trained only on like the Old Coast Modern Language
+
+1:18:58.823 --> 1:19:03.388
+House, which may be trained in German and English,
+but not on German.
+
+1:19:03.923 --> 1:19:08.779
+So then you would still need to find June
+and the model needs to learn how to better
+
+1:19:08.779 --> 1:19:10.721
+do the attention cross lingually.
+
+1:19:10.721 --> 1:19:15.748
+It's only on the same language but it mainly
+only has to learn this mapping and not all
+
+1:19:15.748 --> 1:19:18.775
+the rest and that's why it's still quite successful.
+
+1:19:21.982 --> 1:19:27.492
+Now certain thing which is very commonly used
+is what is required to it as adapters.
+
+1:19:27.607 --> 1:19:29.754
+So for example you take and buy.
+
+1:19:29.709 --> 1:19:35.218
+And you put some adapters on the inside of
+the networks so that it's small new layers
+
+1:19:35.218 --> 1:19:40.790
+which are in between put in there and then
+you only train these adapters or also train
+
+1:19:40.790 --> 1:19:41.815
+these adapters.
+
+1:19:41.815 --> 1:19:47.900
+For example, an embryo you could see that
+this learns to map the Sears language representation
+
+1:19:47.900 --> 1:19:50.334
+to the Tiger language representation.
+
+1:19:50.470 --> 1:19:52.395
+And then you don't have to change that luck.
+
+1:19:52.792 --> 1:19:59.793
+You give it extra ability to really perform
+well on that.
+
+1:19:59.793 --> 1:20:05.225
+These are quite small and so very efficient.
+
+1:20:05.905 --> 1:20:12.632
+That is also very commonly used, for example
+in modular systems where you have some adaptors
+
+1:20:12.632 --> 1:20:16.248
+in between here which might be language specific.
+
+1:20:16.916 --> 1:20:22.247
+So they are trained only for one language.
+
+1:20:22.247 --> 1:20:33.777
+The model has some or both and once has the
+ability to do multilingually to share knowledge.
+
+1:20:34.914 --> 1:20:39.058
+But there's one chance in general in the multilingual
+systems.
+
+1:20:39.058 --> 1:20:40.439
+It works quite well.
+
+1:20:40.439 --> 1:20:46.161
+There's one case or one specific use case
+for multilingual where this normally doesn't
+
+1:20:46.161 --> 1:20:47.344
+really work well.
+
+1:20:47.344 --> 1:20:49.975
+Do you have an idea what that could be?
+
+1:20:55.996 --> 1:20:57.536
+It's for Zero Shot Cases.
+
+1:20:57.998 --> 1:21:03.660
+Because having here some situation with this
+might be very language specific and zero shot,
+
+1:21:03.660 --> 1:21:09.015
+the idea is always to learn representations
+view which are more language dependent and
+
+1:21:09.015 --> 1:21:10.184
+with the adaptors.
+
+1:21:10.184 --> 1:21:15.601
+Of course you get in representations again
+which are more language specific and then it
+
+1:21:15.601 --> 1:21:17.078
+doesn't work that well.
+
+1:21:20.260 --> 1:21:37.730
+And there is also the idea of doing more knowledge
+pistolation.
+
+1:21:39.179 --> 1:21:42.923
+And now the idea is okay.
+
+1:21:42.923 --> 1:21:54.157
+We are training it the same, but what we want
+to achieve is that the encoder.
+
+1:21:54.414 --> 1:22:03.095
+So you should learn faster by trying to make
+these states as similar as possible.
+
+1:22:03.095 --> 1:22:11.777
+So you compare the first-hit state of the
+pre-trained model and try to make them.
+
+1:22:12.192 --> 1:22:18.144
+For example, by using the out two norms, so
+by just making these two representations the
+
+1:22:18.144 --> 1:22:26.373
+same: The same vocabulary: Why does it need
+the same vocabulary with any idea?
+
+1:22:34.754 --> 1:22:46.137
+If you have different vocabulary, it's typical
+you also have different sequenced lengths here.
+
+1:22:46.137 --> 1:22:50.690
+The number of sequences is different.
+
+1:22:51.231 --> 1:22:58.888
+If you now have pipe stains and four states
+here, it's no longer straightforward which
+
+1:22:58.888 --> 1:23:01.089
+states compare to which.
+
+1:23:02.322 --> 1:23:05.246
+And that's just easier if you have like the
+same number.
+
+1:23:05.246 --> 1:23:08.940
+You can always compare the first to the first
+and second to the second.
+
+1:23:09.709 --> 1:23:16.836
+So therefore at least the very easy way of
+knowledge destination only works if you have.
+
+1:23:17.177 --> 1:23:30.030
+Course: You could do things like yeah, the
+average should be the same, but of course there's
+
+1:23:30.030 --> 1:23:33.071
+a less strong signal.
+
+1:23:34.314 --> 1:23:42.979
+But the advantage here is that you have a
+diameter training signal here on the handquarter
+
+1:23:42.979 --> 1:23:51.455
+so you can directly make some of the encoder
+already giving a good signal while normally
+
+1:23:51.455 --> 1:23:52.407
+an empty.
+
+1:23:56.936 --> 1:24:13.197
+Yes, think this is most things for today,
+so what you should keep in mind is remind me.
+
+1:24:13.393 --> 1:24:18.400
+The one is a back translation idea.
+
+1:24:18.400 --> 1:24:29.561
+If you have monolingual and use that, the
+other one is to: And mentally it is often helpful
+
+1:24:29.561 --> 1:24:33.614
+to combine them so you can even use both of
+that.
+
+1:24:33.853 --> 1:24:38.908
+So you can use pre-trained walls, but then
+you can even still do back translation where
+
+1:24:38.908 --> 1:24:40.057
+it's still helpful.
+
+1:24:40.160 --> 1:24:45.502
+We have the advantage we are training like
+everything working together on the task so
+
+1:24:45.502 --> 1:24:51.093
+it might be helpful even to backtranslate some
+data and then use it in a real translation
+
+1:24:51.093 --> 1:24:56.683
+setup because in pretraining of course the
+beach challenge is always that you're training
+
+1:24:56.683 --> 1:24:57.739
+it on different.
+
+1:24:58.058 --> 1:25:03.327
+Different ways of how you integrate this knowledge.
+
+1:25:03.327 --> 1:25:08.089
+Even if you just use a full model, so in this.
+
+1:25:08.748 --> 1:25:11.128
+This is the most similar you can get.
+
+1:25:11.128 --> 1:25:13.945
+You're doing no changes to the architecture.
+
+1:25:13.945 --> 1:25:19.643
+You're really taking the model and just fine
+tuning them on the new task, but it still has
+
+1:25:19.643 --> 1:25:24.026
+to completely newly learn how to do the attention
+and how to do that.
+
+1:25:24.464 --> 1:25:29.971
+And that might be, for example, helpful to
+have more back-translated data to learn them.
+
+1:25:32.192 --> 1:25:34.251
+That's for today.
+
+1:25:34.251 --> 1:25:44.661
+There's one important thing that next Tuesday
+there is a conference or a workshop or so in
+
+1:25:44.661 --> 1:25:45.920
+this room.
+
+1:25:47.127 --> 1:25:56.769
+You should get an e-mail if you're in Elias
+that there's a room change for Tuesdays and
+
+1:25:56.769 --> 1:25:57.426
+it's.
+
+1:25:57.637 --> 1:26:03.890
+There are more questions, yeah, have a more
+general position, especially: In computer vision
+
+1:26:03.890 --> 1:26:07.347
+you can enlarge your data center data orientation.
+
+1:26:07.347 --> 1:26:08.295
+Is there any?
+
+1:26:08.388 --> 1:26:15.301
+It's similar to a large speech for text for
+the data of an edge.
+
+1:26:15.755 --> 1:26:29.176
+And you can use this back translation and
+also masking, but back translation is some
+
+1:26:29.176 --> 1:26:31.228
+way of data.
+
+1:26:31.371 --> 1:26:35.629
+So it has also been, for example, even its
+used not only for monolingual data.
+
+1:26:36.216 --> 1:26:54.060
+If you have good MP system, it can also be
+used for parallel data.
+
+1:26:54.834 --> 1:26:59.139
+So would say this is the most similar one.
+
+1:26:59.139 --> 1:27:03.143
+There's ways you can do power phrasing.
+
+1:27:05.025 --> 1:27:12.057
+But for example there is very hard to do this
+by rules like which words to replace because
+
+1:27:12.057 --> 1:27:18.936
+there is not a coup like you cannot always
+say this word can always be replaced by that.
+
+1:27:19.139 --> 1:27:27.225
+Mean, although they are many perfect synonyms,
+normally they are good in some cases, but not
+
+1:27:27.225 --> 1:27:29.399
+in all cases, and so on.
+
+1:27:29.399 --> 1:27:36.963
+And if you don't do a rule based, you have
+to train your model and then the freshness.
+
+1:27:38.058 --> 1:27:57.236
+The same architecture as the pre-trained mount.
+
+1:27:57.457 --> 1:27:59.810
+Should be of the same dimension, so it's easiest
+to have the same dimension.
+
+1:28:00.000 --> 1:28:01.590
+Architecture.
+
+1:28:01.590 --> 1:28:05.452
+We later will learn inefficiency.
+
+1:28:05.452 --> 1:28:12.948
+You can also do knowledge cessulation with,
+for example, smaller.
+
+1:28:12.948 --> 1:28:16.469
+You can learn the same within.
+
+1:28:17.477 --> 1:28:22.949
+Eight layers for it so that is possible, but
+yeah agree it should be of the same.
+
+1:28:23.623 --> 1:28:32.486
+Yeah yeah you need the question then of course
+you can do it like it's an initialization or
+
+1:28:32.486 --> 1:28:41.157
+you can do it doing training but normally it
+most makes sense during the normal training.
+
+1:28:45.865 --> 1:28:53.963
+Do it, then thanks a lot, and then we'll see
+each other again on Tuesday.
+
+0:00:00.981 --> 0:00:20.036
+Today about is how to use some type of additional
+resources to improve the translation.
+
+0:00:20.300 --> 0:00:28.188
+We have in the first part of the semester
+two thirds of the semester how to build some
+
+0:00:28.188 --> 0:00:31.361
+of your basic machine translation.
+
+0:00:31.571 --> 0:00:42.317
+Now the basic components are both for statistical
+and for neural, with the encoded decoding.
+
+0:00:43.123 --> 0:00:46.000
+Now, of course, that's not where it stops.
+
+0:00:46.000 --> 0:00:51.286
+It's still what nearly every machine translation
+system is currently in there.
+
+0:00:51.286 --> 0:00:57.308
+However, there's a lot of challenges which
+you need to address in addition and which need
+
+0:00:57.308 --> 0:00:58.245
+to be solved.
+
+0:00:58.918 --> 0:01:09.858
+And there we want to start to tell you what
+else can you do around this, and partly.
+
+0:01:10.030 --> 0:01:14.396
+And one important question there is on what
+do you train your models?
+
+0:01:14.394 --> 0:01:32.003
+Because like this type of parallel data, it's
+easier in machine translation than in other
+
+0:01:32.003 --> 0:01:33.569
+trusts.
+
+0:01:33.853 --> 0:01:41.178
+And therefore an important question is, can
+we also learn from like other sources and through?
+
+0:01:41.701 --> 0:01:47.830
+Because if you remember strongly right at
+the beginning of the election,.
+
+0:01:51.171 --> 0:01:53.801
+This Is How We Train All Our.
+
+0:01:54.194 --> 0:01:59.887
+Machine learning models from statistical to
+neural.
+
+0:01:59.887 --> 0:02:09.412
+This doesn't have changed so we need this
+type of parallel data where we have a source
+
+0:02:09.412 --> 0:02:13.462
+sentence aligned with a target data.
+
+0:02:13.493 --> 0:02:19.135
+We have now a strong model here, a very good
+model to do that.
+
+0:02:19.135 --> 0:02:22.091
+However, we always rely on this.
+
+0:02:22.522 --> 0:02:28.395
+For languages, high risk language pairs say
+from German to English or other European languages,
+
+0:02:28.395 --> 0:02:31.332
+there is decent amount, at least for similarly.
+
+0:02:31.471 --> 0:02:37.630
+But even there if we are going to very specific
+domains it might get difficult and then your
+
+0:02:37.630 --> 0:02:43.525
+system performance might drop because if you
+want to translate now some medical text for
+
+0:02:43.525 --> 0:02:50.015
+example of course you need to also have peril
+data in the medical domain to know how to translate
+
+0:02:50.015 --> 0:02:50.876
+these types.
+
+0:02:51.231 --> 0:02:55.264
+Phrases how to use the vocabulary and so on
+in the style.
+
+0:02:55.915 --> 0:03:04.887
+And if you are going to other languages, there
+is a lot bigger challenge and the question
+
+0:03:04.887 --> 0:03:05.585
+there.
+
+0:03:05.825 --> 0:03:09.649
+So is really this the only resource we can
+use.
+
+0:03:09.889 --> 0:03:19.462
+Can be adapted or training phase in order
+to also make use of other types of models that
+
+0:03:19.462 --> 0:03:27.314
+might enable us to build strong systems with
+other types of information.
+
+0:03:27.707 --> 0:03:35.276
+And that we will look into now in the next
+starting from from just saying the next election.
+
+0:03:35.515 --> 0:03:40.697
+So this idea we already have covered on Tuesday.
+
+0:03:40.697 --> 0:03:45.350
+One very successful idea for this is to do.
+
+0:03:45.645 --> 0:03:51.990
+So that we're no longer doing translation
+between languages, but we can do translation
+
+0:03:51.990 --> 0:03:55.928
+between languages and share common knowledge
+between.
+
+0:03:56.296 --> 0:04:03.888
+You also learned about things like zero shots
+machine translation so you can translate between
+
+0:04:03.888 --> 0:04:06.446
+languages where you don't have.
+
+0:04:06.786 --> 0:04:09.790
+Which is the case for many, many language
+pairs.
+
+0:04:10.030 --> 0:04:19.209
+Like even with German, you have not translation
+parallel data to all languages around the world,
+
+0:04:19.209 --> 0:04:26.400
+or most of them you have it to the Europeans
+once, maybe even for Japanese.
+
+0:04:26.746 --> 0:04:35.332
+There is quite a lot of data, for example
+English to Japanese, but German to Japanese
+
+0:04:35.332 --> 0:04:37.827
+or German to Vietnamese.
+
+0:04:37.827 --> 0:04:41.621
+There is some data from Multilingual.
+
+0:04:42.042 --> 0:04:54.584
+So there is a very promising direction if
+you want to build translation systems between
+
+0:04:54.584 --> 0:05:00.142
+language peers, typically not English.
+
+0:05:01.221 --> 0:05:05.887
+And the other ideas, of course, we don't have
+to either just search for it.
+
+0:05:06.206 --> 0:05:12.505
+Some work on a data crawling so if I don't
+have a corpus directly or I don't have an high
+
+0:05:12.505 --> 0:05:19.014
+quality corpus like from the European Parliament
+for a TED corpus so maybe it makes sense to
+
+0:05:19.014 --> 0:05:23.913
+crawl more data and get additional sources
+so you can build stronger.
+
+0:05:24.344 --> 0:05:35.485
+There has been quite a big effort in Europe
+to collect really large data sets for parallel
+
+0:05:35.485 --> 0:05:36.220
+data.
+
+0:05:36.220 --> 0:05:40.382
+How can we do this data crawling?
+
+0:05:40.600 --> 0:05:46.103
+There the interesting thing from the machine
+translation point is not just general data
+
+0:05:46.103 --> 0:05:46.729
+crawling.
+
+0:05:47.067 --> 0:05:50.037
+But how can we explicitly crawl data?
+
+0:05:50.037 --> 0:05:52.070
+Which is some of a peril?
+
+0:05:52.132 --> 0:05:58.461
+So there is in the Internet quite a lot of
+data which has been company websites which
+
+0:05:58.461 --> 0:06:01.626
+have been translated and things like that.
+
+0:06:01.626 --> 0:06:05.158
+So how can you extract them parallel fragments?
+
+0:06:06.566 --> 0:06:13.404
+That is typically more noisy than where you
+do more at hands where mean if you have Parliament.
+
+0:06:13.693 --> 0:06:17.680
+You can do some rules how to extract parallel
+things.
+
+0:06:17.680 --> 0:06:24.176
+Here there is more to it, so the quality is
+later maybe not as good, but normally scale
+
+0:06:24.176 --> 0:06:26.908
+is then a possibility to address it.
+
+0:06:26.908 --> 0:06:30.304
+So you just have so much more data that even.
+
+0:06:33.313 --> 0:06:40.295
+The other thing can be used monolingual data
+and monolingual data has a big advantage that
+
+0:06:40.295 --> 0:06:46.664
+we can have a huge amount of that so that you
+can be autocrawed from the Internet.
+
+0:06:46.664 --> 0:06:51.728
+The nice thing is you can also get it typically
+for many domains.
+
+0:06:52.352 --> 0:06:59.558
+There is just so much more magnitude of monolingual
+data so that it might be very helpful.
+
+0:06:59.559 --> 0:07:03.054
+We can do that in statistical machine translation.
+
+0:07:03.054 --> 0:07:06.755
+It was quite easy to integrate using language
+models.
+
+0:07:08.508 --> 0:07:16.912
+In neural machine translation we have the
+advantage that we have this overall architecture
+
+0:07:16.912 --> 0:07:22.915
+that does everything together, but it has also
+the disadvantage.
+
+0:07:23.283 --> 0:07:25.675
+We'll look today at two things.
+
+0:07:25.675 --> 0:07:32.925
+On the one end you can still try to do a bit
+of language modeling in there and add an additional
+
+0:07:32.925 --> 0:07:35.168
+language model into in there.
+
+0:07:35.168 --> 0:07:38.232
+There is some work, one very successful.
+
+0:07:38.178 --> 0:07:43.764
+A way in which I think is used in most systems
+at the moment is to do some scientific data.
+
+0:07:43.763 --> 0:07:53.087
+Is a very easy thing, but you can just translate
+there and use it as training gator, and normally.
+
+0:07:53.213 --> 0:07:59.185
+And thereby you are able to use like some
+type of monolingual a day.
+
+0:08:00.380 --> 0:08:05.271
+Another way to do it is unsupervised and the
+extreme case.
+
+0:08:05.271 --> 0:08:11.158
+If you have a scenario then you only have
+data, only monolingual data.
+
+0:08:11.158 --> 0:08:13.976
+Can you still build translations?
+
+0:08:14.754 --> 0:08:27.675
+If you have large amounts of data and languages
+are not too dissimilar, you can build translation
+
+0:08:27.675 --> 0:08:31.102
+systems without parallel.
+
+0:08:32.512 --> 0:08:36.267
+That we will see you then next Thursday.
+
+0:08:37.857 --> 0:08:50.512
+And then there is now a third type of pre-trained
+model that recently became very successful
+
+0:08:50.512 --> 0:08:55.411
+and now with large language models.
+
+0:08:55.715 --> 0:09:03.525
+So the idea is we are no longer sharing the
+real data, but it can also help to train a
+
+0:09:03.525 --> 0:09:04.153
+model.
+
+0:09:04.364 --> 0:09:11.594
+And that is now a big advantage of deep learning
+based approaches.
+
+0:09:11.594 --> 0:09:22.169
+There you have this ability that you can train
+a model in some task and then apply it to another.
+
+0:09:22.722 --> 0:09:33.405
+And then, of course, the question is, can
+I have an initial task where there's huge amounts
+
+0:09:33.405 --> 0:09:34.450
+of data?
+
+0:09:34.714 --> 0:09:40.251
+And the test that typically you pre train
+on is more like similar to a language moral
+
+0:09:40.251 --> 0:09:45.852
+task either direct to a language moral task
+or like a masking task which is related so
+
+0:09:45.852 --> 0:09:51.582
+the idea is oh I can train on this data and
+the knowledge about words how they relate to
+
+0:09:51.582 --> 0:09:53.577
+each other I can use in there.
+
+0:09:53.753 --> 0:10:00.276
+So it's a different way of using language
+models.
+
+0:10:00.276 --> 0:10:06.276
+There's more transfer learning at the end
+of.
+
+0:10:09.029 --> 0:10:17.496
+So first we will start with how can we use
+monolingual data to do a Yeah to do a machine
+
+0:10:17.496 --> 0:10:18.733
+translation?
+
+0:10:20.040 --> 0:10:27.499
+That: Big difference is you should remember
+from what I mentioned before is.
+
+0:10:27.499 --> 0:10:32.783
+In statistical machine translation we directly
+have the opportunity.
+
+0:10:32.783 --> 0:10:39.676
+There's peril data for the translation model
+and monolingual data for the language model.
+
+0:10:39.679 --> 0:10:45.343
+And you combine your translation model and
+language model, and then you can make use of
+
+0:10:45.343 --> 0:10:45.730
+both.
+
+0:10:46.726 --> 0:10:53.183
+That you can make use of these large large
+amounts of monolingual data, but of course
+
+0:10:53.183 --> 0:10:55.510
+it has also some disadvantage.
+
+0:10:55.495 --> 0:11:01.156
+Because we say the problem is we are optimizing
+both parts a bit independently to each other
+
+0:11:01.156 --> 0:11:06.757
+and we say oh yeah the big disadvantage of
+newer machine translations now we are optimizing
+
+0:11:06.757 --> 0:11:10.531
+the overall architecture everything together
+to perform best.
+
+0:11:10.890 --> 0:11:16.994
+And then, of course, we can't do there, so
+Leo we can can only do a mural like use power
+
+0:11:16.994 --> 0:11:17.405
+data.
+
+0:11:17.897 --> 0:11:28.714
+So the question is, but this advantage is
+not so important that we can train everything,
+
+0:11:28.714 --> 0:11:35.276
+but we have a moral legal data or even small
+amounts.
+
+0:11:35.675 --> 0:11:43.102
+So in data we know it's not only important
+the amount of data we have but also like how
+
+0:11:43.102 --> 0:11:50.529
+similar it is to your test data so it can be
+that this modeling data is quite small but
+
+0:11:50.529 --> 0:11:55.339
+it's very well fitting and then it's still
+very helpful.
+
+0:11:55.675 --> 0:12:02.691
+At the first year of surprisingness, if we
+are here successful with integrating a language
+
+0:12:02.691 --> 0:12:09.631
+model into a translation system, maybe we can
+also integrate some type of language models
+
+0:12:09.631 --> 0:12:14.411
+into our empty system in order to make it better
+and perform.
+
+0:12:16.536 --> 0:12:23.298
+The first thing we can do is we know there
+is language models, so let's try to integrate.
+
+0:12:23.623 --> 0:12:31.096
+There was our language model because these
+works were mainly done before transformer-based
+
+0:12:31.096 --> 0:12:31.753
+models.
+
+0:12:32.152 --> 0:12:38.764
+In general, of course, you can do the same
+thing with transformer baseball.
+
+0:12:38.764 --> 0:12:50.929
+There is nothing about whether: It's just
+that it has mainly been done before people
+
+0:12:50.929 --> 0:13:01.875
+started using R&amp;S and they tried to do
+this more in cases.
+
+0:13:07.087 --> 0:13:22.938
+So what we're happening here is in some of
+this type of idea, and in key system you remember
+
+0:13:22.938 --> 0:13:25.495
+the attention.
+
+0:13:25.605 --> 0:13:29.465
+Gets it was your last in this day that you
+calculate easy attention.
+
+0:13:29.729 --> 0:13:36.610
+We get the context back, then combine both
+and then base the next in state and then predict.
+
+0:13:37.057 --> 0:13:42.424
+So this is our system, and the question is,
+can we send our integrated language model?
+
+0:13:42.782 --> 0:13:49.890
+And somehow it makes sense to take out a neural
+language model because we are anyway in the
+
+0:13:49.890 --> 0:13:50.971
+neural space.
+
+0:13:50.971 --> 0:13:58.465
+It's not surprising that it contrasts to statistical
+work used and grants it might make sense to
+
+0:13:58.465 --> 0:14:01.478
+take a bit of a normal language model.
+
+0:14:01.621 --> 0:14:06.437
+And there would be something like on Tubbles
+Air, a neural language model, and our man based
+
+0:14:06.437 --> 0:14:11.149
+is you have a target word, you put it in, you
+get a new benchmark, and then you always put
+
+0:14:11.149 --> 0:14:15.757
+in the words and get new hidden states, and
+you can do some predictions at the output to
+
+0:14:15.757 --> 0:14:16.948
+predict the next word.
+
+0:14:17.597 --> 0:14:26.977
+So if we're having this type of in language
+model, there's like two main questions we have
+
+0:14:26.977 --> 0:14:34.769
+to answer: So how do we combine now on the
+one hand our system and on the other hand our
+
+0:14:34.769 --> 0:14:35.358
+model?
+
+0:14:35.358 --> 0:14:42.004
+You see that was mentioned before when we
+started talking about ENCODA models.
+
+0:14:42.004 --> 0:14:45.369
+They can be viewed as a language model.
+
+0:14:45.805 --> 0:14:47.710
+The wine is lengthened, unconditioned.
+
+0:14:47.710 --> 0:14:49.518
+It's just modeling the target sides.
+
+0:14:49.970 --> 0:14:56.963
+And the other one is a conditional language
+one, which is a language one conditioned on
+
+0:14:56.963 --> 0:14:57.837
+the Sewer.
+
+0:14:58.238 --> 0:15:03.694
+So how can you combine to language models?
+
+0:15:03.694 --> 0:15:14.860
+Of course, it's like the translation model
+will be more important because it has access
+
+0:15:14.860 --> 0:15:16.763
+to the source.
+
+0:15:18.778 --> 0:15:22.571
+If we have that, the other question is okay.
+
+0:15:22.571 --> 0:15:24.257
+Now we have models.
+
+0:15:24.257 --> 0:15:25.689
+How do we train?
+
+0:15:26.026 --> 0:15:30.005
+Pickers integrated them.
+
+0:15:30.005 --> 0:15:34.781
+We have now two sets of data.
+
+0:15:34.781 --> 0:15:42.741
+We have parallel data where you can do the
+lower.
+
+0:15:44.644 --> 0:15:53.293
+So the first idea is we can do something more
+like a parallel combination.
+
+0:15:53.293 --> 0:15:55.831
+We just keep running.
+
+0:15:56.036 --> 0:15:59.864
+So here you see your system that is running.
+
+0:16:00.200 --> 0:16:09.649
+It's normally completely independent of your
+language model, which is up there, so down
+
+0:16:09.649 --> 0:16:13.300
+here we have just our NMT system.
+
+0:16:13.313 --> 0:16:26.470
+The only thing which is used is we have the
+words, and of course they are put into both
+
+0:16:26.470 --> 0:16:30.059
+systems, and out there.
+
+0:16:30.050 --> 0:16:42.221
+So we use them somehow for both, and then
+we are doing our decision just by merging these
+
+0:16:42.221 --> 0:16:42.897
+two.
+
+0:16:43.343 --> 0:16:53.956
+So there can be, for example, we are doing
+a probability distribution here, and then we
+
+0:16:53.956 --> 0:17:03.363
+are taking the average of post-perability distribution
+to do our predictions.
+
+0:17:11.871 --> 0:17:18.923
+You could also take the output with Steve's
+to be more in chore about the mixture.
+
+0:17:20.000 --> 0:17:32.896
+Yes, you could also do that, so it's more
+like engaging mechanisms that you're not doing.
+
+0:17:32.993 --> 0:17:41.110
+Another one would be cochtrinate the hidden
+states, and then you would have another layer
+
+0:17:41.110 --> 0:17:41.831
+on top.
+
+0:17:43.303 --> 0:17:56.889
+You think about if you do the conqueredination
+instead of taking the instead and then merging
+
+0:17:56.889 --> 0:18:01.225
+the probability distribution.
+
+0:18:03.143 --> 0:18:16.610
+Introduce many new parameters, and these parameters
+have somehow something special compared to
+
+0:18:16.610 --> 0:18:17.318
+the.
+
+0:18:23.603 --> 0:18:37.651
+So before all the error other parameters can
+be trained independent, the language model
+
+0:18:37.651 --> 0:18:42.121
+can be trained independent.
+
+0:18:43.043 --> 0:18:51.749
+If you have a joint layer, of course you need
+to train them because you have now inputs.
+
+0:18:54.794 --> 0:19:02.594
+Not surprisingly, if you have a parallel combination
+of whether you could, the other way is to do
+
+0:19:02.594 --> 0:19:04.664
+more serial combinations.
+
+0:19:04.924 --> 0:19:10.101
+How can you do a similar combination?
+
+0:19:10.101 --> 0:19:18.274
+Your final decision makes sense to do a face
+on the system.
+
+0:19:18.438 --> 0:19:20.996
+So you have on top of your normal and system.
+
+0:19:21.121 --> 0:19:30.678
+The only thing is now you're inputting into
+your system.
+
+0:19:30.678 --> 0:19:38.726
+You're no longer inputting the word embeddings.
+
+0:19:38.918 --> 0:19:45.588
+So you're training your mainly what you have
+your lower layers here which are trained more
+
+0:19:45.588 --> 0:19:52.183
+on the purely language model style and then
+on top your putting into the NMT system where
+
+0:19:52.183 --> 0:19:55.408
+it now has already here the language model.
+
+0:19:55.815 --> 0:19:58.482
+So here you can also view it.
+
+0:19:58.482 --> 0:20:06.481
+Here you have more contextual embeddings which
+no longer depend only on the word but they
+
+0:20:06.481 --> 0:20:10.659
+also depend on the context of the target site.
+
+0:20:11.051 --> 0:20:19.941
+But you have more understanding of the source
+word, so you have a language in the current
+
+0:20:19.941 --> 0:20:21.620
+target sentence.
+
+0:20:21.881 --> 0:20:27.657
+So if it's like the word can, for example,
+will be put in here always the same independent
+
+0:20:27.657 --> 0:20:31.147
+of its user can of beans, or if it's like I
+can do it.
+
+0:20:31.147 --> 0:20:37.049
+However, because you are having your language
+model style, you have maybe disintegrated this
+
+0:20:37.049 --> 0:20:40.984
+already a bit, and you give this information
+directly to the.
+
+0:20:41.701 --> 0:20:43.095
+An empty cyst.
+
+0:20:44.364 --> 0:20:49.850
+You, if you're remembering more the transformer
+based approach, you have some layers.
+
+0:20:49.850 --> 0:20:55.783
+The lower layers are purely languaged while
+the other ones are with attention to the source.
+
+0:20:55.783 --> 0:21:01.525
+So you can view it also that you just have
+lower layers which don't attend to the source.
+
+0:21:02.202 --> 0:21:07.227
+This is purely a language model, and then
+at some point you're starting to attend to
+
+0:21:07.227 --> 0:21:08.587
+the source and use it.
+
+0:21:13.493 --> 0:21:20.781
+Yes, so this is how you combine them in peril
+or first do the language model and then do.
+
+0:21:23.623 --> 0:21:26.147
+Questions for the integration.
+
+0:21:31.831 --> 0:21:35.034
+Not really sure about the input of the.
+
+0:21:35.475 --> 0:21:38.102
+Model, and in this case in the sequence.
+
+0:21:38.278 --> 0:21:53.199
+Case so the actual word that we transferred
+into a numerical lecture, and this is an input
+
+0:21:53.199 --> 0:21:54.838
+into the.
+
+0:21:56.176 --> 0:22:03.568
+That depends on if you view the word embedding
+as part of the language model.
+
+0:22:03.568 --> 0:22:10.865
+So if you first put the word target word then
+you do the one hot end coding.
+
+0:22:11.691 --> 0:22:13.805
+And then the word embedding there is the r&amp;
+
+0:22:13.805 --> 0:22:13.937
+n.
+
+0:22:14.314 --> 0:22:21.035
+So you can use this together as your language
+model when you first do the word embedding.
+
+0:22:21.401 --> 0:22:24.346
+All you can say is like before.
+
+0:22:24.346 --> 0:22:28.212
+It's more a definition, but you're right.
+
+0:22:28.212 --> 0:22:30.513
+So what's the steps out?
+
+0:22:30.513 --> 0:22:36.128
+You take the word, the one hut encoding, the
+word embedding.
+
+0:22:36.516 --> 0:22:46.214
+What one of these parrots, you know, called
+a language model is definition wise and not
+
+0:22:46.214 --> 0:22:47.978
+that important.
+
+0:22:53.933 --> 0:23:02.264
+So the question is how can you then train
+them and make this this one work?
+
+0:23:02.264 --> 0:23:02.812
+The.
+
+0:23:03.363 --> 0:23:15.201
+So in the case where you combine the language
+one of the abilities you can train them independently
+
+0:23:15.201 --> 0:23:18.516
+and just put them together.
+
+0:23:18.918 --> 0:23:27.368
+Might not be the best because we have no longer
+the stability that we had before that optimally
+
+0:23:27.368 --> 0:23:29.128
+performed together.
+
+0:23:29.128 --> 0:23:33.881
+It's not clear if they really work the best
+together.
+
+0:23:34.514 --> 0:23:41.585
+At least you need to somehow find how much
+do you trust the one model and how much.
+
+0:23:43.323 --> 0:23:45.058
+Still in some cases useful.
+
+0:23:45.058 --> 0:23:48.530
+It might be helpful if you have only data
+and software.
+
+0:23:48.928 --> 0:23:59.064
+However, in MT we have one specific situation
+that at least for the MT part parallel is also
+
+0:23:59.064 --> 0:24:07.456
+always monolingual data, so what we definitely
+can do is train the language.
+
+0:24:08.588 --> 0:24:18.886
+So what we also can do is more like the pre-training
+approach.
+
+0:24:18.886 --> 0:24:24.607
+We first train the language model.
+
+0:24:24.704 --> 0:24:27.334
+The pre-training approach.
+
+0:24:27.334 --> 0:24:33.470
+You first train on the monolingual data and
+then you join the.
+
+0:24:33.933 --> 0:24:41.143
+Of course, the model size is this way, but
+the data size is too bigly the other way around.
+
+0:24:41.143 --> 0:24:47.883
+You often have a lot more monolingual data
+than you have here parallel data, in which
+
+0:24:47.883 --> 0:24:52.350
+scenario can you imagine where this type of
+pretraining?
+
+0:24:56.536 --> 0:24:57.901
+Any Ideas.
+
+0:25:04.064 --> 0:25:12.772
+One example where this might also be helpful
+if you want to adapt to domains.
+
+0:25:12.772 --> 0:25:22.373
+So let's say you do medical sentences and
+if you want to translate medical sentences.
+
+0:25:23.083 --> 0:25:26.706
+In this case it could be or its most probable
+happen.
+
+0:25:26.706 --> 0:25:32.679
+You're learning here up there what medical
+means, but in your fine tuning step the model
+
+0:25:32.679 --> 0:25:38.785
+is forgotten everything about Medicare, so
+you may be losing all the information you gain.
+
+0:25:39.099 --> 0:25:42.366
+So this type of priest training step is good.
+
+0:25:42.366 --> 0:25:47.978
+If your pretraining data is more general,
+very large and then you're adapting.
+
+0:25:48.428 --> 0:25:56.012
+But in the task with moral lingual data, which
+should be used to adapt the system to some
+
+0:25:56.012 --> 0:25:57.781
+general topic style.
+
+0:25:57.817 --> 0:26:06.795
+Then, of course, this is not a good strategy
+because you might forgot about everything up
+
+0:26:06.795 --> 0:26:09.389
+there and you don't have.
+
+0:26:09.649 --> 0:26:14.678
+So then you have to check what you can do
+for them.
+
+0:26:14.678 --> 0:26:23.284
+You can freeze this part and change it any
+more so you don't lose the ability or you can
+
+0:26:23.284 --> 0:26:25.702
+do a direct combination.
+
+0:26:25.945 --> 0:26:31.028
+Where you jointly train both of them, so you
+train the NMT system on the, and then you train
+
+0:26:31.028 --> 0:26:34.909
+the language model always in parallels so that
+you don't forget about.
+
+0:26:35.395 --> 0:26:37.684
+And what you learn of the length.
+
+0:26:37.937 --> 0:26:46.711
+Depends on what you want to combine because
+it's large data and you have a good general
+
+0:26:46.711 --> 0:26:48.107
+knowledge in.
+
+0:26:48.548 --> 0:26:55.733
+Then you normally don't really forget it because
+it's also in the or you use it to adapt to
+
+0:26:55.733 --> 0:26:57.295
+something specific.
+
+0:26:57.295 --> 0:26:58.075
+Then you.
+
+0:27:01.001 --> 0:27:06.676
+Then this is a way of how we can make use
+of monolingual data.
+
+0:27:07.968 --> 0:27:12.116
+It seems to be the easiest one somehow.
+
+0:27:12.116 --> 0:27:20.103
+It's more similar to what we are doing with
+statistical machine translation.
+
+0:27:21.181 --> 0:27:31.158
+Normally always beats this type of model,
+which in some view can be like from the conceptual
+
+0:27:31.158 --> 0:27:31.909
+thing.
+
+0:27:31.909 --> 0:27:36.844
+It's even easier from the computational side.
+
+0:27:40.560 --> 0:27:42.078
+And the idea is OK.
+
+0:27:42.078 --> 0:27:49.136
+We have monolingual data that we just translate
+and then generate some type of parallel data
+
+0:27:49.136 --> 0:27:50.806
+and use that then to.
+
+0:27:51.111 --> 0:28:00.017
+So if you want to build a German-to-English
+system first, take the large amount of data
+
+0:28:00.017 --> 0:28:02.143
+you have translated.
+
+0:28:02.402 --> 0:28:10.446
+Then you have more peril data and the interesting
+thing is if you then train on the joint thing
+
+0:28:10.446 --> 0:28:18.742
+or on the original peril data and on what is
+artificial where you have generated the translations.
+
+0:28:18.918 --> 0:28:26.487
+So you can because you are not doing the same
+era all the times and you have some knowledge.
+
+0:28:28.028 --> 0:28:43.199
+With this first approach, however, there is
+one issue why it might not work the best.
+
+0:28:49.409 --> 0:28:51.177
+Very a bit shown in the image to you.
+
+0:28:53.113 --> 0:28:58.153
+You trade on that quality data.
+
+0:28:58.153 --> 0:29:02.563
+Here is a bit of a problem.
+
+0:29:02.563 --> 0:29:08.706
+Your English style is not really good.
+
+0:29:08.828 --> 0:29:12.213
+And as you're saying, the system always mistranslates.
+
+0:29:13.493 --> 0:29:19.798
+Something then you will learn that this is
+correct because now it's a training game and
+
+0:29:19.798 --> 0:29:23.022
+you will encourage it to make it more often.
+
+0:29:23.022 --> 0:29:29.614
+So the problem with training on your own areas
+yeah you might prevent some areas you rarely
+
+0:29:29.614 --> 0:29:29.901
+do.
+
+0:29:30.150 --> 0:29:31.749
+But errors use systematically.
+
+0:29:31.749 --> 0:29:34.225
+Do you even enforce more and will even do
+more?
+
+0:29:34.654 --> 0:29:40.145
+So that might not be the best solution to
+have any idea how you could do it better.
+
+0:29:44.404 --> 0:29:57.754
+Is one way there is even a bit of more simple
+idea.
+
+0:30:04.624 --> 0:30:10.975
+The problem is yeah, the translations are
+not perfect, so the output and you're learning
+
+0:30:10.975 --> 0:30:12.188
+something wrong.
+
+0:30:12.188 --> 0:30:17.969
+Normally it's less bad if your inputs are
+not bad, but your outputs are perfect.
+
+0:30:18.538 --> 0:30:24.284
+So if your inputs are wrong you may learn
+that if you're doing this wrong input you're
+
+0:30:24.284 --> 0:30:30.162
+generating something correct, but you're not
+learning to generate something which is not
+
+0:30:30.162 --> 0:30:30.756
+correct.
+
+0:30:31.511 --> 0:30:47.124
+So often the case it is that it is more important
+than your target is correct.
+
+0:30:47.347 --> 0:30:52.182
+But you can assume in your application scenario
+you hope that you may only get correct inputs.
+
+0:30:52.572 --> 0:31:02.535
+So that is not harming you, and in machine
+translation we have one very nice advantage:
+
+0:31:02.762 --> 0:31:04.648
+And also the other way around.
+
+0:31:04.648 --> 0:31:10.062
+It's a very similar task, so there's a task
+to translate from German to English, but the
+
+0:31:10.062 --> 0:31:13.894
+task to translate from English to German is
+very similar, and.
+
+0:31:14.094 --> 0:31:19.309
+So what we can do is we can just switch it
+initially and generate the data the other way
+
+0:31:19.309 --> 0:31:19.778
+around.
+
+0:31:20.120 --> 0:31:25.959
+So what we are doing here is we are starting
+with an English to German system.
+
+0:31:25.959 --> 0:31:32.906
+Then we are translating the English data into
+German where the German is maybe not very nice.
+
+0:31:33.293 --> 0:31:51.785
+And then we are training on our original data
+and on the back translated data.
+
+0:31:52.632 --> 0:32:02.332
+So here we have the advantage that our target
+side is human quality and only the input.
+
+0:32:03.583 --> 0:32:08.113
+Then this helps us to get really good.
+
+0:32:08.113 --> 0:32:15.431
+There is one difference if you think about
+the data resources.
+
+0:32:21.341 --> 0:32:27.336
+Too obvious here we need a target site monolingual
+layer.
+
+0:32:27.336 --> 0:32:31.574
+In the first example we had source site.
+
+0:32:31.931 --> 0:32:45.111
+So back translation is normally working if
+you have target size peril later and not search
+
+0:32:45.111 --> 0:32:48.152
+side modeling later.
+
+0:32:48.448 --> 0:32:56.125
+Might be also, like if you think about it,
+understand a little better to understand the
+
+0:32:56.125 --> 0:32:56.823
+target.
+
+0:32:57.117 --> 0:33:01.469
+On the source side you have to understand
+the content.
+
+0:33:01.469 --> 0:33:08.749
+On the target side you have to generate really
+sentences and somehow it's more difficult to
+
+0:33:08.749 --> 0:33:12.231
+generate something than to only understand.
+
+0:33:17.617 --> 0:33:30.734
+This works well if you have to select how
+many back translated data do you use.
+
+0:33:31.051 --> 0:33:32.983
+Because only there's like a lot more.
+
+0:33:33.253 --> 0:33:42.136
+Question: Should take all of my data there
+is two problems with it?
+
+0:33:42.136 --> 0:33:51.281
+Of course it's expensive because you have
+to translate all this data.
+
+0:33:51.651 --> 0:34:00.946
+So if you don't know the normal good starting
+point is to take equal amount of data as many
+
+0:34:00.946 --> 0:34:02.663
+back translated.
+
+0:34:02.963 --> 0:34:04.673
+It depends on the used case.
+
+0:34:04.673 --> 0:34:08.507
+If we have very few data here, it makes more
+sense to have more.
+
+0:34:08.688 --> 0:34:15.224
+Depends on how good your quality is here,
+so the better the more data you might use because
+
+0:34:15.224 --> 0:34:16.574
+quality is better.
+
+0:34:16.574 --> 0:34:22.755
+So it depends on a lot of things, but your
+rule of sum is like which general way often
+
+0:34:22.755 --> 0:34:24.815
+is to have equal amounts of.
+
+0:34:26.646 --> 0:34:29.854
+And you can, of course, do that now.
+
+0:34:29.854 --> 0:34:34.449
+I said already that it's better to have the
+quality.
+
+0:34:34.449 --> 0:34:38.523
+At the end, of course, depends on this system.
+
+0:34:38.523 --> 0:34:46.152
+Also, because the better this system is, the
+better your synthetic data is, the better.
+
+0:34:47.207 --> 0:34:50.949
+That leads to what is referred to as iterated
+back translation.
+
+0:34:51.291 --> 0:34:56.917
+So you play them on English to German, and
+you translate the data on.
+
+0:34:56.957 --> 0:35:03.198
+Then you train a model on German to English
+with the additional data.
+
+0:35:03.198 --> 0:35:09.796
+Then you translate German data and then you
+train to gain your first one.
+
+0:35:09.796 --> 0:35:14.343
+So in the second iteration this quality is
+better.
+
+0:35:14.334 --> 0:35:19.900
+System is better because it's not only trained
+on the small data but additionally on back
+
+0:35:19.900 --> 0:35:22.003
+translated data with this system.
+
+0:35:22.442 --> 0:35:24.458
+And so you can get better.
+
+0:35:24.764 --> 0:35:28.053
+However, typically you can stop quite early.
+
+0:35:28.053 --> 0:35:35.068
+Maybe one iteration is good, but then you
+have diminishing gains after two or three iterations.
+
+0:35:35.935 --> 0:35:46.140
+There is very slight difference because you
+need a quite big difference in the quality
+
+0:35:46.140 --> 0:35:46.843
+here.
+
+0:35:47.207 --> 0:36:02.262
+Language is also good because it means you
+can already train it with relatively bad profiles.
+
+0:36:03.723 --> 0:36:10.339
+It's a design decision would advise so guess
+because it's easy to get it.
+
+0:36:10.550 --> 0:36:20.802
+Replace that because you have a higher quality
+real data, but then I think normally it's okay
+
+0:36:20.802 --> 0:36:22.438
+to replace it.
+
+0:36:22.438 --> 0:36:28.437
+I would assume it's not too much of a difference,
+but.
+
+0:36:34.414 --> 0:36:42.014
+That's about like using monolingual data before
+we go into the pre-train models to have any
+
+0:36:42.014 --> 0:36:43.005
+more crash.
+
+0:36:49.029 --> 0:36:55.740
+Yes, so the other thing which we can do and
+which is recently more and more successful
+
+0:36:55.740 --> 0:37:02.451
+and even more successful since we have this
+really large language models where you can
+
+0:37:02.451 --> 0:37:08.545
+even do the translation task with this is the
+way of using pre-trained models.
+
+0:37:08.688 --> 0:37:16.135
+So you learn a representation of one task,
+and then you use this representation from another.
+
+0:37:16.576 --> 0:37:26.862
+It was made maybe like one of the first words
+where it really used largely is doing something
+
+0:37:26.862 --> 0:37:35.945
+like a bird which you pre trained on purely
+text era and you take it in fine tune.
+
+0:37:36.496 --> 0:37:42.953
+And one big advantage, of course, is that
+people can only share data but also pre-trained.
+
+0:37:43.423 --> 0:37:59.743
+The recent models and the large language ones
+which are available.
+
+0:37:59.919 --> 0:38:09.145
+Where I think it costs several millions to
+train them all, just if you would buy the GPUs
+
+0:38:09.145 --> 0:38:15.397
+from some cloud company and train that the
+cost of training.
+
+0:38:15.475 --> 0:38:21.735
+And guess as a student project you won't have
+the budget to like build these models.
+
+0:38:21.801 --> 0:38:24.598
+So another idea is what you can do is okay.
+
+0:38:24.598 --> 0:38:27.330
+Maybe if these months are once available,.
+
+0:38:27.467 --> 0:38:36.598
+Can take them and use them as an also resource
+similar to pure text, and you can now build
+
+0:38:36.598 --> 0:38:44.524
+models which somehow learn not only from from
+data but also from other models.
+
+0:38:44.844 --> 0:38:49.127
+So it's a quite new way of thinking of how
+to train.
+
+0:38:49.127 --> 0:38:53.894
+We are not only learning from examples, but
+we might also.
+
+0:38:54.534 --> 0:39:05.397
+The nice thing is that this type of training
+where we are not learning directly from data
+
+0:39:05.397 --> 0:39:07.087
+but learning.
+
+0:39:07.427 --> 0:39:17.647
+So the main idea this go is you have a person
+initial task.
+
+0:39:17.817 --> 0:39:26.369
+And if you're working with anLP, that means
+you're training pure taxator because that's
+
+0:39:26.369 --> 0:39:30.547
+where you have the largest amount of data.
+
+0:39:30.951 --> 0:39:35.857
+And then you're defining some type of task
+in order to do your creek training.
+
+0:39:36.176 --> 0:39:43.092
+And: The typical task you can train on on
+that is like the language waddling task.
+
+0:39:43.092 --> 0:39:50.049
+So to predict the next word or we have a related
+task to predict something in between, we'll
+
+0:39:50.049 --> 0:39:52.667
+see depending on the architecture.
+
+0:39:52.932 --> 0:39:58.278
+But somehow to predict something which you
+have not in the input is a task which is easy
+
+0:39:58.278 --> 0:40:00.740
+to generate, so you just need your data.
+
+0:40:00.740 --> 0:40:06.086
+That's why it's called self supervised, so
+you're creating your supervised pending data.
+
+0:40:06.366 --> 0:40:07.646
+By yourself.
+
+0:40:07.646 --> 0:40:15.133
+On the other hand, you need a lot of knowledge
+and that is the other thing.
+
+0:40:15.735 --> 0:40:24.703
+Because there is this idea that the meaning
+of a word heavily depends on the context that.
+
+0:40:25.145 --> 0:40:36.846
+So can give you a sentence with some giverish
+word and there's some name and although you've
+
+0:40:36.846 --> 0:40:41.627
+never heard the name you will assume.
+
+0:40:42.062 --> 0:40:44.149
+And exactly the same thing.
+
+0:40:44.149 --> 0:40:49.143
+The models can also learn something about
+the world by just using.
+
+0:40:49.649 --> 0:40:53.651
+So that is typically the mule.
+
+0:40:53.651 --> 0:40:59.848
+Then we can use this model to train the system.
+
+0:41:00.800 --> 0:41:03.368
+Course we might need to adapt the system.
+
+0:41:03.368 --> 0:41:07.648
+To do that we have to change the architecture
+we might use only some.
+
+0:41:07.627 --> 0:41:09.443
+Part of the pre-trained model.
+
+0:41:09.443 --> 0:41:14.773
+In there we have seen that a bit already in
+the R&amp;N case you can also see that we have
+
+0:41:14.773 --> 0:41:17.175
+also mentioned the pre-training already.
+
+0:41:17.437 --> 0:41:22.783
+So you can use the R&amp;N as one of these
+approaches.
+
+0:41:22.783 --> 0:41:28.712
+You train the R&amp;M language more on large
+pre-train data.
+
+0:41:28.712 --> 0:41:32.309
+Then you put it somewhere into your.
+
+0:41:33.653 --> 0:41:37.415
+So this gives you the ability to really do
+these types of tests.
+
+0:41:37.877 --> 0:41:53.924
+So you can build a system which is knowledge,
+which is just trained on large amounts of data.
+
+0:41:56.376 --> 0:42:01.564
+So the question is maybe what type of information
+so what type of models can you?
+
+0:42:01.821 --> 0:42:05.277
+And we want today to look at briefly at swings.
+
+0:42:05.725 --> 0:42:08.704
+First, that was what was initially done.
+
+0:42:08.704 --> 0:42:15.314
+It wasn't as famous as in machine translation
+as in other things, but it's also used there
+
+0:42:15.314 --> 0:42:21.053
+and that is to use static word embedding, so
+just the first step we know here.
+
+0:42:21.221 --> 0:42:28.981
+So we have this mapping from the one hot to
+a small continuous word representation.
+
+0:42:29.229 --> 0:42:38.276
+Using this one in your NG system, so you can,
+for example, replace the embedding layer by
+
+0:42:38.276 --> 0:42:38.779
+the.
+
+0:42:39.139 --> 0:42:41.832
+That is helpful to be a really small amount
+of data.
+
+0:42:42.922 --> 0:42:48.517
+And we're always in this pre-training phase
+and have the thing the advantage is.
+
+0:42:48.468 --> 0:42:52.411
+More data than the trade off, so you can get
+better.
+
+0:42:52.411 --> 0:42:59.107
+The disadvantage is, does anybody have an
+idea of what might be the disadvantage of using
+
+0:42:59.107 --> 0:43:00.074
+things like.
+
+0:43:04.624 --> 0:43:12.175
+What was one mentioned today giving like big
+advantage of the system compared to previous.
+
+0:43:20.660 --> 0:43:25.134
+Where one advantage was the enter end training,
+so you have the enter end training so that
+
+0:43:25.134 --> 0:43:27.937
+all parameters and all components play optimal
+together.
+
+0:43:28.208 --> 0:43:33.076
+If you know pre-train something on one fast,
+it may be no longer optimal fitting to everything
+
+0:43:33.076 --> 0:43:33.384
+else.
+
+0:43:33.893 --> 0:43:37.862
+So what do pretending or not?
+
+0:43:37.862 --> 0:43:48.180
+It depends on how important everything is
+optimal together and how important.
+
+0:43:48.388 --> 0:43:50.454
+Of large amount.
+
+0:43:50.454 --> 0:44:00.541
+The pre-change one is so much better that
+it's helpful, and the advantage of that.
+
+0:44:00.600 --> 0:44:11.211
+Getting everything optimal together, yes,
+we would use random instructions for raising.
+
+0:44:11.691 --> 0:44:26.437
+The problem is you might be already in some
+area where it's not easy to get.
+
+0:44:26.766 --> 0:44:35.329
+But often in some way right, so often it's
+not about your really worse pre trained monolepsy.
+
+0:44:35.329 --> 0:44:43.254
+If you're going already in some direction,
+and if this is not really optimal for you,.
+
+0:44:43.603 --> 0:44:52.450
+But if you're not really getting better because
+you have a decent amount of data, it's so different
+
+0:44:52.450 --> 0:44:52.981
+that.
+
+0:44:53.153 --> 0:44:59.505
+Initially it wasn't a machine translation
+done so much because there are more data in
+
+0:44:59.505 --> 0:45:06.153
+MPs than in other tasks, but now with really
+large amounts of monolingual data we do some
+
+0:45:06.153 --> 0:45:09.403
+type of pretraining in currently all state.
+
+0:45:12.632 --> 0:45:14.302
+The other one is okay now.
+
+0:45:14.302 --> 0:45:18.260
+It's always like how much of the model do
+you plea track a bit?
+
+0:45:18.658 --> 0:45:22.386
+To the other one you can do contextural word
+embedded.
+
+0:45:22.386 --> 0:45:28.351
+That is something like bird or Roberta where
+you train already a sequence model and the
+
+0:45:28.351 --> 0:45:34.654
+embeddings you're using are no longer specific
+for word but they are also taking the context
+
+0:45:34.654 --> 0:45:35.603
+into account.
+
+0:45:35.875 --> 0:45:50.088
+The embedding you're using is no longer depending
+on the word itself but on the whole sentence,
+
+0:45:50.088 --> 0:45:54.382
+so you can use this context.
+
+0:45:55.415 --> 0:46:02.691
+You can use similar things also in the decoder
+just by having layers which don't have access
+
+0:46:02.691 --> 0:46:12.430
+to the source, but there it still might have
+and these are typically models like: And finally
+
+0:46:12.430 --> 0:46:14.634
+they will look at the end.
+
+0:46:14.634 --> 0:46:19.040
+You can also have models which are already
+sequenced.
+
+0:46:19.419 --> 0:46:28.561
+So you may be training a sequence to sequence
+models.
+
+0:46:28.561 --> 0:46:35.164
+You have to make it a bit challenging.
+
+0:46:36.156 --> 0:46:43.445
+But the idea is really you're pre-training
+your whole model and then you'll find tuning.
+
+0:46:47.227 --> 0:46:59.614
+But let's first do a bit of step back and
+look into what are the different things.
+
+0:46:59.614 --> 0:47:02.151
+The first thing.
+
+0:47:02.382 --> 0:47:11.063
+The wooden bettings are just this first layer
+and you can train them with feedback annual
+
+0:47:11.063 --> 0:47:12.028
+networks.
+
+0:47:12.212 --> 0:47:22.761
+But you can also train them with an N language
+model, and by now you hopefully have also seen
+
+0:47:22.761 --> 0:47:27.699
+that you cannot transform a language model.
+
+0:47:30.130 --> 0:47:37.875
+So this is how you can train them and you're
+training them.
+
+0:47:37.875 --> 0:47:45.234
+For example, to speak the next word that is
+the easiest.
+
+0:47:45.525 --> 0:47:55.234
+And that is what is now referred to as South
+Supervised Learning and, for example, all the
+
+0:47:55.234 --> 0:48:00.675
+big large language models like Chad GPT and
+so on.
+
+0:48:00.675 --> 0:48:03.129
+They are trained with.
+
+0:48:03.823 --> 0:48:15.812
+So that is where you can hopefully learn how
+a word is used because you always try to previct
+
+0:48:15.812 --> 0:48:17.725
+the next word.
+
+0:48:19.619 --> 0:48:27.281
+Word embedding: Why do you keep the first
+look at the word embeddings and the use of
+
+0:48:27.281 --> 0:48:29.985
+word embeddings for our task?
+
+0:48:29.985 --> 0:48:38.007
+The main advantage was it might be only the
+first layer where you typically have most of
+
+0:48:38.007 --> 0:48:39.449
+the parameters.
+
+0:48:39.879 --> 0:48:57.017
+Most of your parameters already on the large
+data, then on your target data you have to
+
+0:48:57.017 --> 0:48:59.353
+train less.
+
+0:48:59.259 --> 0:49:06.527
+Big difference that your input size is so
+much bigger than the size of the novel in size.
+
+0:49:06.626 --> 0:49:17.709
+So it's a normally sign, maybe like, but your
+input and banning size is something like.
+
+0:49:17.709 --> 0:49:20.606
+Then here you have to.
+
+0:49:23.123 --> 0:49:30.160
+While here you see it's only like zero point
+five times as much in the layer.
+
+0:49:30.750 --> 0:49:36.534
+So here is where most of your parameters are,
+which means if you already replace the word
+
+0:49:36.534 --> 0:49:41.739
+embeddings, they might look a bit small in
+your overall and in key architecture.
+
+0:49:41.739 --> 0:49:47.395
+It's where most of the things are, and if
+you're doing that you already have really big
+
+0:49:47.395 --> 0:49:48.873
+games and can do that.
+
+0:49:57.637 --> 0:50:01.249
+The thing is we have seen these were the bettings.
+
+0:50:01.249 --> 0:50:04.295
+They can be very good use for other types.
+
+0:50:04.784 --> 0:50:08.994
+You learn some general relations between words.
+
+0:50:08.994 --> 0:50:17.454
+If you're doing this type of language modeling
+cast, you predict: The one thing is you have
+
+0:50:17.454 --> 0:50:24.084
+a lot of data, so the one question is we want
+to have data to trade a model.
+
+0:50:24.084 --> 0:50:28.734
+The other thing, the tasks need to be somehow
+useful.
+
+0:50:29.169 --> 0:50:43.547
+If you would predict the first letter of the
+word, then you wouldn't learn anything about
+
+0:50:43.547 --> 0:50:45.144
+the word.
+
+0:50:45.545 --> 0:50:53.683
+And the interesting thing is people have looked
+at these wood embeddings.
+
+0:50:53.954 --> 0:50:58.550
+And looking at the word embeddings.
+
+0:50:58.550 --> 0:51:09.276
+You can ask yourself how they look and visualize
+them by doing dimension reduction.
+
+0:51:09.489 --> 0:51:13.236
+Don't know if you and you are listening to
+artificial intelligence.
+
+0:51:13.236 --> 0:51:15.110
+Advanced artificial intelligence.
+
+0:51:15.515 --> 0:51:23.217
+We had on yesterday there how to do this type
+of representation, but you can do this time
+
+0:51:23.217 --> 0:51:29.635
+of representation, and now you're seeing interesting
+things that normally.
+
+0:51:30.810 --> 0:51:41.027
+Now you can represent a here in a three dimensional
+space with some dimension reduction.
+
+0:51:41.027 --> 0:51:46.881
+For example, the relation between male and
+female.
+
+0:51:47.447 --> 0:51:56.625
+So this vector between the male and female
+version of something is always not the same,
+
+0:51:56.625 --> 0:51:58.502
+but it's related.
+
+0:51:58.718 --> 0:52:14.522
+So you can do a bit of maths, so you do take
+king, you subtract this vector, add this vector.
+
+0:52:14.894 --> 0:52:17.591
+So that means okay, there is really something
+stored.
+
+0:52:17.591 --> 0:52:19.689
+Some information are stored in that book.
+
+0:52:20.040 --> 0:52:22.621
+Similar, you can do it with Bob Hansen.
+
+0:52:22.621 --> 0:52:25.009
+See here swimming slam walking walk.
+
+0:52:25.265 --> 0:52:34.620
+So again these vectors are not the same, but
+they are related.
+
+0:52:34.620 --> 0:52:42.490
+So you learn something from going from here
+to here.
+
+0:52:43.623 --> 0:52:49.761
+Or semantically, the relations between city
+and capital have exactly the same sense.
+
+0:52:51.191 --> 0:52:56.854
+And people had even done that question answering
+about that if they showed the diembeddings
+
+0:52:56.854 --> 0:52:57.839
+and the end of.
+
+0:52:58.218 --> 0:53:06.711
+All you can also do is don't trust the dimensions
+of the reaction because maybe there is something.
+
+0:53:06.967 --> 0:53:16.863
+You can also look into what happens really
+in the individual space.
+
+0:53:16.863 --> 0:53:22.247
+What is the nearest neighbor of the.
+
+0:53:22.482 --> 0:53:29.608
+So you can take the relationship between France
+and Paris and add it to Italy and you'll.
+
+0:53:30.010 --> 0:53:33.078
+You can do big and bigger and you have small
+and smaller and stuff.
+
+0:53:33.593 --> 0:53:49.417
+Because it doesn't work everywhere, there
+is also some typical dish here in German.
+
+0:53:51.491 --> 0:54:01.677
+You can do what the person is doing for famous
+ones, of course only like Einstein scientists
+
+0:54:01.677 --> 0:54:06.716
+that find midfielders not completely correct.
+
+0:54:06.846 --> 0:54:10.134
+You see the examples are a bit old.
+
+0:54:10.134 --> 0:54:15.066
+The politicians are no longer they am, but
+of course.
+
+0:54:16.957 --> 0:54:26.759
+What people have done there, especially at
+the beginning training our end language model,
+
+0:54:26.759 --> 0:54:28.937
+was very expensive.
+
+0:54:29.309 --> 0:54:38.031
+So one famous model was, but we are not really
+interested in the language model performance.
+
+0:54:38.338 --> 0:54:40.581
+Think something good to keep in mind.
+
+0:54:40.581 --> 0:54:42.587
+What are we really interested in?
+
+0:54:42.587 --> 0:54:45.007
+Do we really want to have an R&amp;N no?
+
+0:54:45.007 --> 0:54:48.607
+In this case we are only interested in this
+type of mapping.
+
+0:54:49.169 --> 0:54:55.500
+And so successful and very successful was
+this word to vet.
+
+0:54:55.535 --> 0:54:56.865
+The idea is okay.
+
+0:54:56.865 --> 0:55:03.592
+We are not training real language one, making
+it even simpler and doing this, for example,
+
+0:55:03.592 --> 0:55:05.513
+continuous peck of words.
+
+0:55:05.513 --> 0:55:12.313
+We're just having four input tokens and we're
+predicting what is the word in the middle and
+
+0:55:12.313 --> 0:55:15.048
+this is just like two linear layers.
+
+0:55:15.615 --> 0:55:21.627
+So it's even simplifying things and making
+the calculation faster because that is what
+
+0:55:21.627 --> 0:55:22.871
+we're interested.
+
+0:55:23.263 --> 0:55:32.897
+All this continuous skip ground models with
+these other models which refer to as where
+
+0:55:32.897 --> 0:55:34.004
+to where.
+
+0:55:34.234 --> 0:55:42.394
+Where you have one equal word and the other
+way around, you're predicting the four words
+
+0:55:42.394 --> 0:55:43.585
+around them.
+
+0:55:43.585 --> 0:55:45.327
+It's very similar.
+
+0:55:45.327 --> 0:55:48.720
+The task is in the end very similar.
+
+0:55:51.131 --> 0:56:01.407
+Before we are going to the next point, anything
+about normal weight vectors or weight embedding.
+
+0:56:04.564 --> 0:56:07.794
+The next thing is contexture.
+
+0:56:07.794 --> 0:56:12.208
+Word embeddings and the idea is helpful.
+
+0:56:12.208 --> 0:56:19.206
+However, we might even be able to get more
+from one lingo layer.
+
+0:56:19.419 --> 0:56:31.732
+And now in the word that is overlap of these
+two meanings, so it represents both the meaning
+
+0:56:31.732 --> 0:56:33.585
+of can do it.
+
+0:56:34.834 --> 0:56:40.410
+But we might be able to in the pre-trained
+model already disambiguate this because they
+
+0:56:40.410 --> 0:56:41.044
+are used.
+
+0:56:41.701 --> 0:56:53.331
+So if we can have a model which can not only
+represent a word but can also represent the
+
+0:56:53.331 --> 0:56:58.689
+meaning of the word within the context,.
+
+0:56:59.139 --> 0:57:03.769
+So then we are going to context your word
+embeddings.
+
+0:57:03.769 --> 0:57:07.713
+We are really having a representation in the.
+
+0:57:07.787 --> 0:57:11.519
+And we have a very good architecture for that
+already.
+
+0:57:11.691 --> 0:57:23.791
+The hidden state represents what is currently
+said, but it's focusing on what is the last
+
+0:57:23.791 --> 0:57:29.303
+one, so it's some of the representation.
+
+0:57:29.509 --> 0:57:43.758
+The first one doing that is something like
+the Elmo paper where they instead of this is
+
+0:57:43.758 --> 0:57:48.129
+the normal language model.
+
+0:57:48.008 --> 0:57:50.714
+Within the third, predicting the fourth, and
+so on.
+
+0:57:50.714 --> 0:57:53.004
+So you are always predicting the next work.
+
+0:57:53.193 --> 0:57:57.335
+The architecture is the heaven words embedding
+layer and then layers.
+
+0:57:57.335 --> 0:58:03.901
+See you, for example: And now instead of using
+this one in the end, you're using here this
+
+0:58:03.901 --> 0:58:04.254
+one.
+
+0:58:04.364 --> 0:58:11.245
+This represents the meaning of this word mainly
+in the context of what we have seen before.
+
+0:58:11.871 --> 0:58:18.610
+We can train it in a language model style
+always predicting the next word, but we have
+
+0:58:18.610 --> 0:58:21.088
+more information trained there.
+
+0:58:21.088 --> 0:58:26.123
+Therefore, in the system it has to learn less
+additional things.
+
+0:58:27.167 --> 0:58:31.261
+And there is one Edendang which is done currently
+in GPS.
+
+0:58:31.261 --> 0:58:38.319
+The only difference is that we have more layers,
+bigger size, and we're using transformer neurocell
+
+0:58:38.319 --> 0:58:40.437
+potential instead of the RNA.
+
+0:58:40.437 --> 0:58:45.095
+But that is how you train like some large
+language models at the.
+
+0:58:46.746 --> 0:58:55.044
+However, if you look at this contextual representation,
+they might not be perfect.
+
+0:58:55.044 --> 0:59:02.942
+So if you think of this one as a contextual
+representation of the third word,.
+
+0:59:07.587 --> 0:59:16.686
+Is representing a three in the context of
+a sentence, however only in the context of
+
+0:59:16.686 --> 0:59:18.185
+the previous.
+
+0:59:18.558 --> 0:59:27.413
+However, we have an architecture which can
+also take both sides and we have used that
+
+0:59:27.413 --> 0:59:30.193
+already in the ink holder.
+
+0:59:30.630 --> 0:59:34.264
+So we could do the iron easily on your, also
+in the backward direction.
+
+0:59:34.874 --> 0:59:42.826
+By just having the states the other way around
+and then we couldn't combine the forward and
+
+0:59:42.826 --> 0:59:49.135
+the forward into a joint one where we are doing
+this type of prediction.
+
+0:59:49.329 --> 0:59:50.858
+So you have the word embedding.
+
+0:59:51.011 --> 1:00:02.095
+Then you have two in the states, one on the
+forward arm and one on the backward arm, and
+
+1:00:02.095 --> 1:00:10.314
+then you can, for example, take the cocagenation
+of both of them.
+
+1:00:10.490 --> 1:00:23.257
+Now this same here represents mainly this
+word because this is what both puts in it last
+
+1:00:23.257 --> 1:00:30.573
+and we know is focusing on what is happening
+last.
+
+1:00:31.731 --> 1:00:40.469
+However, there is a bit of difference when
+training that as a language model you already
+
+1:00:40.469 --> 1:00:41.059
+have.
+
+1:00:43.203 --> 1:00:44.956
+Maybe There's Again This Masking.
+
+1:00:46.546 --> 1:00:47.748
+That is one solution.
+
+1:00:47.748 --> 1:00:52.995
+First of all, why we can't do it is the information
+you leak it, so you cannot just predict the
+
+1:00:52.995 --> 1:00:53.596
+next word.
+
+1:00:53.596 --> 1:00:58.132
+If we just predict the next word in this type
+of model, that's a very simple task.
+
+1:00:58.738 --> 1:01:09.581
+You know the next word because it's influencing
+this hidden state predicting something is not
+
+1:01:09.581 --> 1:01:11.081
+a good task.
+
+1:01:11.081 --> 1:01:18.455
+You have to define: Because in this case what
+will end with the system will just ignore these
+
+1:01:18.455 --> 1:01:22.966
+estates and what will learn is copy this information
+directly in here.
+
+1:01:23.343 --> 1:01:31.218
+So it would be representing this word and
+you would have nearly a perfect model because
+
+1:01:31.218 --> 1:01:38.287
+you only need to find encoding where you can
+encode all words somehow in this.
+
+1:01:38.458 --> 1:01:44.050
+The only thing can learn is that turn and
+encode all my words in this upper hidden.
+
+1:01:44.985 --> 1:01:53.779
+Therefore, it's not really useful, so we need
+to find a bit of different ways out.
+
+1:01:55.295 --> 1:01:57.090
+There is a masking one.
+
+1:01:57.090 --> 1:02:03.747
+I'll come to that shortly just a bit that
+other things also have been done, so the other
+
+1:02:03.747 --> 1:02:06.664
+thing is not to directly combine them.
+
+1:02:06.664 --> 1:02:13.546
+That was in the animal paper, so you have
+them forward R&amp;M and you keep them completely
+
+1:02:13.546 --> 1:02:14.369
+separated.
+
+1:02:14.594 --> 1:02:20.458
+So you never merged to state.
+
+1:02:20.458 --> 1:02:33.749
+At the end, the representation of the word
+is now from the forward.
+
+1:02:33.873 --> 1:02:35.953
+So it's always the hidden state before the
+good thing.
+
+1:02:36.696 --> 1:02:41.286
+These two you join now to your to the representation.
+
+1:02:42.022 --> 1:02:48.685
+And then you have now a representation also
+about like the whole sentence for the word,
+
+1:02:48.685 --> 1:02:51.486
+but there is no information leakage.
+
+1:02:51.486 --> 1:02:58.149
+One way of doing this is instead of doing
+a bidirection along you do a forward pass and
+
+1:02:58.149 --> 1:02:59.815
+then join the hidden.
+
+1:03:00.380 --> 1:03:05.960
+So you can do that in all layers.
+
+1:03:05.960 --> 1:03:16.300
+In the end you do the forwarded layers and
+you get the hidden.
+
+1:03:16.596 --> 1:03:19.845
+However, it's a bit of a complicated.
+
+1:03:19.845 --> 1:03:25.230
+You have to keep both separate and merge things
+so can you do.
+
+1:03:27.968 --> 1:03:33.030
+And that is the moment where like the big.
+
+1:03:34.894 --> 1:03:39.970
+The big success of the burnt model was used
+where it okay.
+
+1:03:39.970 --> 1:03:47.281
+Maybe in bite and rich case it's not good
+to do the next word prediction, but we can
+
+1:03:47.281 --> 1:03:48.314
+do masking.
+
+1:03:48.308 --> 1:03:56.019
+Masking mainly means we do a prediction of
+something in the middle or some words.
+
+1:03:56.019 --> 1:04:04.388
+So the idea is if we have the input, we are
+putting noise into the input, removing them,
+
+1:04:04.388 --> 1:04:07.961
+and then the model we are interested.
+
+1:04:08.048 --> 1:04:15.327
+Now there can be no information leakage because
+this wasn't predicting that one is a big challenge.
+
+1:04:16.776 --> 1:04:19.957
+Do any assumption about our model?
+
+1:04:19.957 --> 1:04:26.410
+It doesn't need to be a forward model or a
+backward model or anything.
+
+1:04:26.410 --> 1:04:29.500
+You can always predict the three.
+
+1:04:30.530 --> 1:04:34.844
+There's maybe one bit of a disadvantage.
+
+1:04:34.844 --> 1:04:40.105
+Do you see what could be a bit of a problem
+this?
+
+1:05:00.000 --> 1:05:06.429
+Yes, so yeah, you can of course mask more,
+but to see it more globally, just first assume
+
+1:05:06.429 --> 1:05:08.143
+you're only masked one.
+
+1:05:08.143 --> 1:05:13.930
+For the whole sentence, we get one feedback
+signal, like what is the word three.
+
+1:05:13.930 --> 1:05:22.882
+So we have one training example: If you do
+the language modeling taste, we predicted here,
+
+1:05:22.882 --> 1:05:24.679
+we predicted here.
+
+1:05:25.005 --> 1:05:26.735
+So we have number of tokens.
+
+1:05:26.735 --> 1:05:30.970
+For each token we have a feet pad and say
+what is the best correction.
+
+1:05:31.211 --> 1:05:43.300
+So in this case this is less efficient because
+we are getting less feedback signals on what
+
+1:05:43.300 --> 1:05:45.797
+we should predict.
+
+1:05:48.348 --> 1:05:56.373
+So and bird, the main ideas are that you're
+doing this bidirectional model with masking.
+
+1:05:56.373 --> 1:05:59.709
+It's using transformer architecture.
+
+1:06:00.320 --> 1:06:06.326
+There are two more minor changes.
+
+1:06:06.326 --> 1:06:16.573
+We'll see that this next word prediction is
+another task.
+
+1:06:16.957 --> 1:06:30.394
+You want to learn more about what language
+is to really understand following a story or
+
+1:06:30.394 --> 1:06:35.127
+their independent tokens into.
+
+1:06:38.158 --> 1:06:42.723
+The input is using word units as we use it.
+
+1:06:42.723 --> 1:06:50.193
+It has some special token that is framing
+for the next word prediction.
+
+1:06:50.470 --> 1:07:04.075
+It's more for classification task because
+you may be learning a general representation
+
+1:07:04.075 --> 1:07:07.203
+as a full sentence.
+
+1:07:07.607 --> 1:07:19.290
+You're doing segment embedding, so you have
+an embedding for it.
+
+1:07:19.290 --> 1:07:24.323
+This is the first sentence.
+
+1:07:24.684 --> 1:07:29.099
+Now what is more challenging is this masking.
+
+1:07:29.099 --> 1:07:30.827
+What do you mask?
+
+1:07:30.827 --> 1:07:35.050
+We already have the crush enough or should.
+
+1:07:35.275 --> 1:07:42.836
+So there has been afterwards eating some work
+like, for example, a bearer.
+
+1:07:42.836 --> 1:07:52.313
+It's not super sensitive, but if you do it
+completely wrong then you're not letting anything.
+
+1:07:52.572 --> 1:07:54.590
+That's Then Another Question There.
+
+1:07:56.756 --> 1:08:04.594
+Should I mask all types of should I always
+mask the footwork or if I have a subword to
+
+1:08:04.594 --> 1:08:10.630
+mask only like a subword and predict them based
+on the other ones?
+
+1:08:10.630 --> 1:08:14.504
+Of course, it's a bit of a different task.
+
+1:08:14.894 --> 1:08:21.210
+If you know three parts of the words, it might
+be easier to guess the last because they here
+
+1:08:21.210 --> 1:08:27.594
+took the easiest selection, so not considering
+words anymore at all because you're doing that
+
+1:08:27.594 --> 1:08:32.280
+in the preprocessing and just taking always
+words and like subwords.
+
+1:08:32.672 --> 1:08:36.089
+Think in group there is done differently.
+
+1:08:36.089 --> 1:08:40.401
+They mark always the full words, but guess
+it's not.
+
+1:08:41.001 --> 1:08:46.044
+And then what to do with the mask word in
+eighty percent of the cases.
+
+1:08:46.044 --> 1:08:50.803
+If the word is masked, they replace it with
+a special token thing.
+
+1:08:50.803 --> 1:08:57.197
+This is a mask token in ten percent they put
+in some random other token in there, and ten
+
+1:08:57.197 --> 1:08:59.470
+percent they keep it on change.
+
+1:09:02.202 --> 1:09:10.846
+And then what you can do is also this next
+word prediction.
+
+1:09:10.846 --> 1:09:14.880
+The man went to Mass Store.
+
+1:09:14.880 --> 1:09:17.761
+He bought a gallon.
+
+1:09:18.418 --> 1:09:24.088
+So may you see you're joining them, you're
+doing both masks and prediction that you're.
+
+1:09:24.564 --> 1:09:29.449
+Is a penguin mask or flyless birds.
+
+1:09:29.449 --> 1:09:41.390
+These two sentences have nothing to do with
+each other, so you can do also this type of
+
+1:09:41.390 --> 1:09:43.018
+prediction.
+
+1:09:47.127 --> 1:09:57.043
+And then the whole bird model, so here you
+have the input here to transform the layers,
+
+1:09:57.043 --> 1:09:58.170
+and then.
+
+1:09:58.598 --> 1:10:17.731
+And this model was quite successful in general
+applications.
+
+1:10:17.937 --> 1:10:27.644
+However, there is like a huge thing of different
+types of models coming from them.
+
+1:10:27.827 --> 1:10:38.709
+So based on others these supervised molds
+like a whole setup came out of there and now
+
+1:10:38.709 --> 1:10:42.086
+this is getting even more.
+
+1:10:42.082 --> 1:10:46.640
+With availability of a large language model
+than the success.
+
+1:10:47.007 --> 1:10:48.436
+We have now even larger ones.
+
+1:10:48.828 --> 1:10:50.961
+Interestingly, it goes a bit.
+
+1:10:50.910 --> 1:10:57.847
+Change the bit again from like more the spider
+action model to uni directional models.
+
+1:10:57.847 --> 1:11:02.710
+Are at the moment maybe a bit more we're coming
+to them now?
+
+1:11:02.710 --> 1:11:09.168
+Do you see one advantage while what is another
+event and we have the efficiency?
+
+1:11:09.509 --> 1:11:15.901
+Is one other reason why you are sometimes
+more interested in uni-direction models than
+
+1:11:15.901 --> 1:11:17.150
+in bi-direction.
+
+1:11:22.882 --> 1:11:30.220
+It depends on the pass, but for example for
+a language generation pass, the eccard is not
+
+1:11:30.220 --> 1:11:30.872
+really.
+
+1:11:32.192 --> 1:11:40.924
+It doesn't work so if you want to do a generation
+like the decoder you don't know the future
+
+1:11:40.924 --> 1:11:42.896
+so you cannot apply.
+
+1:11:43.223 --> 1:11:53.870
+So this time of model can be used for the
+encoder in an encoder model, but it cannot
+
+1:11:53.870 --> 1:11:57.002
+be used for the decoder.
+
+1:12:00.000 --> 1:12:05.012
+That's a good view to the next overall cast
+of models.
+
+1:12:05.012 --> 1:12:08.839
+Perhaps if you view it from the sequence.
+
+1:12:09.009 --> 1:12:12.761
+We have the encoder base model.
+
+1:12:12.761 --> 1:12:16.161
+That's what we just look at.
+
+1:12:16.161 --> 1:12:20.617
+They are bidirectional and typically.
+
+1:12:20.981 --> 1:12:22.347
+That Is the One We Looked At.
+
+1:12:22.742 --> 1:12:34.634
+At the beginning is the decoder based model,
+so see out in regressive models which are unidirective
+
+1:12:34.634 --> 1:12:42.601
+like an based model, and there we can do the
+next word prediction.
+
+1:12:43.403 --> 1:12:52.439
+And what you can also do first, and there
+you can also have a special things called prefix
+
+1:12:52.439 --> 1:12:53.432
+language.
+
+1:12:54.354 --> 1:13:05.039
+Because we are saying it might be helpful
+that some of your input can also use bi-direction.
+
+1:13:05.285 --> 1:13:12.240
+And that is somehow doing what it is called
+prefix length.
+
+1:13:12.240 --> 1:13:19.076
+On the first tokens you directly give your
+bidirectional.
+
+1:13:19.219 --> 1:13:28.774
+So you somehow merge that and that mainly
+works only in transformer based models because.
+
+1:13:29.629 --> 1:13:33.039
+There is no different number of parameters
+in our end.
+
+1:13:33.039 --> 1:13:34.836
+We need a back foot our end.
+
+1:13:34.975 --> 1:13:38.533
+Transformer: The only difference is how you
+mask your attention.
+
+1:13:38.878 --> 1:13:44.918
+We have seen that in the anchoder and decoder
+the number of parameters is different because
+
+1:13:44.918 --> 1:13:50.235
+you do cross attention, but if you do forward
+and backward or union directions,.
+
+1:13:50.650 --> 1:13:58.736
+It's only like you mask your attention to
+only look at the bad past or to look into the
+
+1:13:58.736 --> 1:13:59.471
+future.
+
+1:14:00.680 --> 1:14:03.326
+And now you can of course also do mixing.
+
+1:14:03.563 --> 1:14:08.306
+So this is a bi-directional attention matrix
+where you can attend to everything.
+
+1:14:08.588 --> 1:14:23.516
+There is a uni-direction or causal where you
+can look at the past and you can do the first
+
+1:14:23.516 --> 1:14:25.649
+three words.
+
+1:14:29.149 --> 1:14:42.831
+That somehow clear based on that, then of
+course you cannot do the other things.
+
+1:14:43.163 --> 1:14:50.623
+So the idea is we have our anchor to decoder
+architecture.
+
+1:14:50.623 --> 1:14:57.704
+Can we also train them completely in a side
+supervisor?
+
+1:14:58.238 --> 1:15:09.980
+And in this case we have the same input to
+both, so in this case we need to do some type
+
+1:15:09.980 --> 1:15:12.224
+of masking here.
+
+1:15:12.912 --> 1:15:17.696
+Here we don't need to do the masking, but
+here we need to masking that doesn't know ever
+
+1:15:17.696 --> 1:15:17.911
+so.
+
+1:15:20.440 --> 1:15:30.269
+And this type of model got quite successful
+also, especially for pre-training machine translation.
+
+1:15:30.330 --> 1:15:39.059
+The first model doing that is a Bart model,
+which exactly does that, and yes, it's one
+
+1:15:39.059 --> 1:15:42.872
+successful way to pre train your one.
+
+1:15:42.872 --> 1:15:47.087
+It's pretraining your full encoder model.
+
+1:15:47.427 --> 1:15:54.365
+Where you put in contrast to machine translation,
+where you put in source sentence, we can't
+
+1:15:54.365 --> 1:15:55.409
+do that here.
+
+1:15:55.715 --> 1:16:01.382
+But we can just put the second twice in there,
+and then it's not a trivial task.
+
+1:16:01.382 --> 1:16:02.432
+We can change.
+
+1:16:03.003 --> 1:16:12.777
+And there is like they do different corruption
+techniques so you can also do.
+
+1:16:13.233 --> 1:16:19.692
+That you couldn't do in an agricultural system
+because then it wouldn't be there and you cannot
+
+1:16:19.692 --> 1:16:20.970
+predict somewhere.
+
+1:16:20.970 --> 1:16:26.353
+So the anchor, the number of input and output
+tokens always has to be the same.
+
+1:16:26.906 --> 1:16:29.818
+You cannot do a prediction for something which
+isn't in it.
+
+1:16:30.110 --> 1:16:38.268
+Here in the decoder side it's unidirection
+so we can also delete the top and then try
+
+1:16:38.268 --> 1:16:40.355
+to generate the full.
+
+1:16:41.061 --> 1:16:45.250
+We can do sentence permutation.
+
+1:16:45.250 --> 1:16:54.285
+We can document rotation and text infilling
+so there is quite a bit.
+
+1:16:55.615 --> 1:17:06.568
+So you see there's quite a lot of types of
+models that you can use in order to pre-train.
+
+1:17:07.507 --> 1:17:14.985
+Then, of course, there is again for the language
+one.
+
+1:17:14.985 --> 1:17:21.079
+The other question is how do you integrate?
+
+1:17:21.761 --> 1:17:26.636
+And there's also, like yeah, quite some different
+ways of techniques.
+
+1:17:27.007 --> 1:17:28.684
+It's a Bit Similar to Before.
+
+1:17:28.928 --> 1:17:39.068
+So the easiest thing is you take your word
+embeddings or your free trained model.
+
+1:17:39.068 --> 1:17:47.971
+You freeze them and stack your decoder layers
+and keep these ones free.
+
+1:17:48.748 --> 1:17:54.495
+Can also be done if you have this type of
+bark model.
+
+1:17:54.495 --> 1:18:03.329
+What you can do is you freeze your word embeddings,
+for example some products and.
+
+1:18:05.865 --> 1:18:17.296
+The other thing is you initialize them so
+you initialize your models but you train everything
+
+1:18:17.296 --> 1:18:19.120
+so you're not.
+
+1:18:22.562 --> 1:18:29.986
+Then one thing, if you think about Bart, you
+want to have the Chinese language, the Italian
+
+1:18:29.986 --> 1:18:32.165
+language, and the deconer.
+
+1:18:32.165 --> 1:18:35.716
+However, in Bart we have the same language.
+
+1:18:36.516 --> 1:18:46.010
+The one you get is from English, so what you
+can do there is so you cannot try to do some.
+
+1:18:46.366 --> 1:18:52.562
+Below the barge, in order to learn some language
+specific stuff, or there's a masculine barge,
+
+1:18:52.562 --> 1:18:58.823
+which is trained on many languages, but it's
+trained only on like the Old Coast Modern Language
+
+1:18:58.823 --> 1:19:03.388
+House, which may be trained in German and English,
+but not on German.
+
+1:19:03.923 --> 1:19:08.779
+So then you would still need to find June
+and the model needs to learn how to better
+
+1:19:08.779 --> 1:19:10.721
+do the attention cross lingually.
+
+1:19:10.721 --> 1:19:15.748
+It's only on the same language but it mainly
+only has to learn this mapping and not all
+
+1:19:15.748 --> 1:19:18.775
+the rest and that's why it's still quite successful.
+
+1:19:21.982 --> 1:19:27.492
+Now certain thing which is very commonly used
+is what is required to it as adapters.
+
+1:19:27.607 --> 1:19:29.754
+So for example you take and buy.
+
+1:19:29.709 --> 1:19:35.218
+And you put some adapters on the inside of
+the networks so that it's small new layers
+
+1:19:35.218 --> 1:19:40.790
+which are in between put in there and then
+you only train these adapters or also train
+
+1:19:40.790 --> 1:19:41.815
+these adapters.
+
+1:19:41.815 --> 1:19:47.900
+For example, an embryo you could see that
+this learns to map the Sears language representation
+
+1:19:47.900 --> 1:19:50.334
+to the Tiger language representation.
+
+1:19:50.470 --> 1:19:52.395
+And then you don't have to change that luck.
+
+1:19:52.792 --> 1:19:59.793
+You give it extra ability to really perform
+well on that.
+
+1:19:59.793 --> 1:20:05.225
+These are quite small and so very efficient.
+
+1:20:05.905 --> 1:20:12.632
+That is also very commonly used, for example
+in modular systems where you have some adaptors
+
+1:20:12.632 --> 1:20:16.248
+in between here which might be language specific.
+
+1:20:16.916 --> 1:20:22.247
+So they are trained only for one language.
+
+1:20:22.247 --> 1:20:33.777
+The model has some or both and once has the
+ability to do multilingually to share knowledge.
+
+1:20:34.914 --> 1:20:39.058
+But there's one chance in general in the multilingual
+systems.
+
+1:20:39.058 --> 1:20:40.439
+It works quite well.
+
+1:20:40.439 --> 1:20:46.161
+There's one case or one specific use case
+for multilingual where this normally doesn't
+
+1:20:46.161 --> 1:20:47.344
+really work well.
+
+1:20:47.344 --> 1:20:49.975
+Do you have an idea what that could be?
+
+1:20:55.996 --> 1:20:57.536
+It's for Zero Shot Cases.
+
+1:20:57.998 --> 1:21:03.660
+Because having here some situation with this
+might be very language specific and zero shot,
+
+1:21:03.660 --> 1:21:09.015
+the idea is always to learn representations
+view which are more language dependent and
+
+1:21:09.015 --> 1:21:10.184
+with the adaptors.
+
+1:21:10.184 --> 1:21:15.601
+Of course you get in representations again
+which are more language specific and then it
+
+1:21:15.601 --> 1:21:17.078
+doesn't work that well.
+
+1:21:20.260 --> 1:21:37.730
+And there is also the idea of doing more knowledge
+pistolation.
+
+1:21:39.179 --> 1:21:42.923
+And now the idea is okay.
+
+1:21:42.923 --> 1:21:54.157
+We are training it the same, but what we want
+to achieve is that the encoder.
+
+1:21:54.414 --> 1:22:03.095
+So you should learn faster by trying to make
+these states as similar as possible.
+
+1:22:03.095 --> 1:22:11.777
+So you compare the first-hit state of the
+pre-trained model and try to make them.
+
+1:22:12.192 --> 1:22:18.144
+For example, by using the out two norms, so
+by just making these two representations the
+
+1:22:18.144 --> 1:22:26.373
+same: The same vocabulary: Why does it need
+the same vocabulary with any idea?
+
+1:22:34.754 --> 1:22:46.137
+If you have different vocabulary, it's typical
+you also have different sequenced lengths here.
+
+1:22:46.137 --> 1:22:50.690
+The number of sequences is different.
+
+1:22:51.231 --> 1:22:58.888
+If you now have pipe stains and four states
+here, it's no longer straightforward which
+
+1:22:58.888 --> 1:23:01.089
+states compare to which.
+
+1:23:02.322 --> 1:23:05.246
+And that's just easier if you have like the
+same number.
+
+1:23:05.246 --> 1:23:08.940
+You can always compare the first to the first
+and second to the second.
+
+1:23:09.709 --> 1:23:16.836
+So therefore at least the very easy way of
+knowledge destination only works if you have.
+
+1:23:17.177 --> 1:23:30.030
+Course: You could do things like yeah, the
+average should be the same, but of course there's
+
+1:23:30.030 --> 1:23:33.071
+a less strong signal.
+
+1:23:34.314 --> 1:23:42.979
+But the advantage here is that you have a
+diameter training signal here on the handquarter
+
+1:23:42.979 --> 1:23:51.455
+so you can directly make some of the encoder
+already giving a good signal while normally
+
+1:23:51.455 --> 1:23:52.407
+an empty.
+
+1:23:56.936 --> 1:24:13.197
+Yes, think this is most things for today,
+so what you should keep in mind is remind me.
+
+1:24:13.393 --> 1:24:18.400
+The one is a back translation idea.
+
+1:24:18.400 --> 1:24:29.561
+If you have monolingual and use that, the
+other one is to: And mentally it is often helpful
+
+1:24:29.561 --> 1:24:33.614
+to combine them so you can even use both of
+that.
+
+1:24:33.853 --> 1:24:38.908
+So you can use pre-trained walls, but then
+you can even still do back translation where
+
+1:24:38.908 --> 1:24:40.057
+it's still helpful.
+
+1:24:40.160 --> 1:24:45.502
+We have the advantage we are training like
+everything working together on the task so
+
+1:24:45.502 --> 1:24:51.093
+it might be helpful even to backtranslate some
+data and then use it in a real translation
+
+1:24:51.093 --> 1:24:56.683
+setup because in pretraining of course the
+beach challenge is always that you're training
+
+1:24:56.683 --> 1:24:57.739
+it on different.
+
+1:24:58.058 --> 1:25:03.327
+Different ways of how you integrate this knowledge.
+
+1:25:03.327 --> 1:25:08.089
+Even if you just use a full model, so in this.
+
+1:25:08.748 --> 1:25:11.128
+This is the most similar you can get.
+
+1:25:11.128 --> 1:25:13.945
+You're doing no changes to the architecture.
+
+1:25:13.945 --> 1:25:19.643
+You're really taking the model and just fine
+tuning them on the new task, but it still has
+
+1:25:19.643 --> 1:25:24.026
+to completely newly learn how to do the attention
+and how to do that.
+
+1:25:24.464 --> 1:25:29.971
+And that might be, for example, helpful to
+have more back-translated data to learn them.
+
+1:25:32.192 --> 1:25:34.251
+That's for today.
+
+1:25:34.251 --> 1:25:44.661
+There's one important thing that next Tuesday
+there is a conference or a workshop or so in
+
+1:25:44.661 --> 1:25:45.920
+this room.
+
+1:25:47.127 --> 1:25:56.769
+You should get an e-mail if you're in Elias
+that there's a room change for Tuesdays and
+
+1:25:56.769 --> 1:25:57.426
+it's.
+
+1:25:57.637 --> 1:26:03.890
+There are more questions, yeah, have a more
+general position, especially: In computer vision
+
+1:26:03.890 --> 1:26:07.347
+you can enlarge your data center data orientation.
+
+1:26:07.347 --> 1:26:08.295
+Is there any?
+
+1:26:08.388 --> 1:26:15.301
+It's similar to a large speech for text for
+the data of an edge.
+
+1:26:15.755 --> 1:26:29.176
+And you can use this back translation and
+also masking, but back translation is some
+
+1:26:29.176 --> 1:26:31.228
+way of data.
+
+1:26:31.371 --> 1:26:35.629
+So it has also been, for example, even its
+used not only for monolingual data.
+
+1:26:36.216 --> 1:26:54.060
+If you have good MP system, it can also be
+used for parallel data.
+
+1:26:54.834 --> 1:26:59.139
+So would say this is the most similar one.
+
+1:26:59.139 --> 1:27:03.143
+There's ways you can do power phrasing.
+
+1:27:05.025 --> 1:27:12.057
+But for example there is very hard to do this
+by rules like which words to replace because
+
+1:27:12.057 --> 1:27:18.936
+there is not a coup like you cannot always
+say this word can always be replaced by that.
+
+1:27:19.139 --> 1:27:27.225
+Mean, although they are many perfect synonyms,
+normally they are good in some cases, but not
+
+1:27:27.225 --> 1:27:29.399
+in all cases, and so on.
+
+1:27:29.399 --> 1:27:36.963
+And if you don't do a rule based, you have
+to train your model and then the freshness.
+
+1:27:38.058 --> 1:27:57.236
+The same architecture as the pre-trained mount.
+
+1:27:57.457 --> 1:27:59.810
+Should be of the same dimension, so it's easiest
+to have the same dimension.
+
+1:28:00.000 --> 1:28:01.590
+Architecture.
+
+1:28:01.590 --> 1:28:05.452
+We later will learn inefficiency.
+
+1:28:05.452 --> 1:28:12.948
+You can also do knowledge cessulation with,
+for example, smaller.
+
+1:28:12.948 --> 1:28:16.469
+You can learn the same within.
+
+1:28:17.477 --> 1:28:22.949
+Eight layers for it so that is possible, but
+yeah agree it should be of the same.
+
+1:28:23.623 --> 1:28:32.486
+Yeah yeah you need the question then of course
+you can do it like it's an initialization or
+
+1:28:32.486 --> 1:28:41.157
+you can do it doing training but normally it
+most makes sense during the normal training.
+
+1:28:45.865 --> 1:28:53.963
+Do it, then thanks a lot, and then we'll see
+each other again on Tuesday.
+
+0:00:00.981 --> 0:00:17.559
+What we want today about is how to use some
+type of additional resources to improve the
+
+0:00:17.559 --> 0:00:20.008
+translation.
+
+0:00:20.300 --> 0:00:31.387
+We have in the first part of the semester
+how to build some of your basic machine translation.
+
+0:00:31.571 --> 0:00:40.743
+You know now the basic components both for
+statistical and for neural, with the encoder
+
+0:00:40.743 --> 0:00:42.306
+decoder model.
+
+0:00:43.123 --> 0:00:45.950
+Now, of course, that's not where it stops.
+
+0:00:45.950 --> 0:00:51.340
+It's still what in nearly every machine translation
+system is currently in India.
+
+0:00:51.340 --> 0:00:57.323
+However, there is a lot of challenges which
+you need to address in addition and which need
+
+0:00:57.323 --> 0:00:58.243
+to be solved.
+
+0:00:58.918 --> 0:01:03.031
+We want to start with these parts.
+
+0:01:03.031 --> 0:01:07.614
+What else can you do around this part?
+
+0:01:07.614 --> 0:01:09.847
+You can be honest.
+
+0:01:10.030 --> 0:01:14.396
+And one important question there is on what
+do you train your models?
+
+0:01:14.394 --> 0:01:27.237
+Because this type of parallel data is easier
+in machine translation than many other tasks
+
+0:01:27.237 --> 0:01:33.516
+where you have a decent amount of training.
+
+0:01:33.853 --> 0:01:40.789
+And therefore an important question is: Can
+we also learn from other sources and improve
+
+0:01:40.789 --> 0:01:41.178
+our.
+
+0:01:41.701 --> 0:01:47.840
+Because if you remember from quite the beginning
+of the lecture,.
+
+0:01:51.171 --> 0:01:53.801
+This is how we train all our.
+
+0:01:54.194 --> 0:02:01.318
+Machine learning models, all the corpus bases
+from statistical to neural.
+
+0:02:01.318 --> 0:02:09.694
+This doesn't have change, so we need this
+type of parallel data where we have a source
+
+0:02:09.694 --> 0:02:13.449
+sentence aligned with the target data.
+
+0:02:13.493 --> 0:02:19.654
+We have now a strong model here, a very good
+model to do that.
+
+0:02:19.654 --> 0:02:22.099
+However, we always rely.
+
+0:02:22.522 --> 0:02:27.376
+More languages, higher resource languages,
+prayers that say from German to English or
+
+0:02:27.376 --> 0:02:31.327
+other European languages, there is a decent
+amount at least for some.
+
+0:02:31.471 --> 0:02:46.131
+But even there, if we're going to very specific
+domains, it might get difficult and then your
+
+0:02:46.131 --> 0:02:50.966
+system performance might drop.
+
+0:02:51.231 --> 0:02:55.261
+Phrases how to use the vocabulary, and so
+on, and the style.
+
+0:02:55.915 --> 0:03:04.104
+And if you're going to other languages, there
+is of course a lot bigger challenge.
+
+0:03:04.104 --> 0:03:05.584
+Why can't you?
+
+0:03:05.825 --> 0:03:09.647
+So is really this the only resource you can
+use.
+
+0:03:09.889 --> 0:03:20.667
+Or can we adapt our models in order to also
+make use of other types of models that might
+
+0:03:20.667 --> 0:03:27.328
+enable us to build strong systems with other
+types of.
+
+0:03:27.707 --> 0:03:35.283
+And that's what we will look into now in the
+next, starting from Tuesday in the next.
+
+0:03:35.515 --> 0:03:43.437
+So this idea we already have covered on Tuesday,
+so one very successful idea for this is to
+
+0:03:43.437 --> 0:03:45.331
+do more multilingual.
+
+0:03:45.645 --> 0:03:52.010
+So that we're no longer only doing translation
+between two languages, but we can do translation
+
+0:03:52.010 --> 0:03:55.922
+between many languages and share common knowledge
+between.
+
+0:03:56.296 --> 0:04:06.477
+And you also learned about that you can even
+do things like zero shot machine translations.
+
+0:04:06.786 --> 0:04:09.792
+Which is the case for many many language pairs.
+
+0:04:10.030 --> 0:04:17.406
+Even with German, you have not translation
+parallel data to all languages around the world,
+
+0:04:17.406 --> 0:04:22.698
+or most of them you have it to the Europeans,
+maybe for Japanese.
+
+0:04:22.698 --> 0:04:26.386
+But even for Japanese, it will get difficult.
+
+0:04:26.746 --> 0:04:32.862
+There is quite a lot of data, for example
+English to Japanese, but German to Vietnamese.
+
+0:04:32.862 --> 0:04:39.253
+There is some data from Multilingual Corpora
+where you can extract the name, but your amount
+
+0:04:39.253 --> 0:04:41.590
+really is dropping significantly.
+
+0:04:42.042 --> 0:04:54.907
+So that is a very promising direction if you
+want to build translation systems between language
+
+0:04:54.907 --> 0:05:00.134
+pairs, typically not English, because.
+
+0:05:01.221 --> 0:05:05.888
+And the other ideas, of course, we don't have
+data, just search for data.
+
+0:05:06.206 --> 0:05:15.755
+There is some work on data crawling so if
+don't have a corpus directly or don't have
+
+0:05:15.755 --> 0:05:23.956
+a high quality corpus from the European Parliament
+for TED corpus maybe.
+
+0:05:24.344 --> 0:05:35.528
+There has been a big effort in Europe to collect
+data sets for parallel data.
+
+0:05:35.528 --> 0:05:40.403
+How can we do this data crawling?
+
+0:05:40.600 --> 0:05:46.103
+There the interesting thing from the machine
+translation point is not just general data
+
+0:05:46.103 --> 0:05:46.729
+crawling.
+
+0:05:47.067 --> 0:05:52.067
+But how can we explicitly crawl data, which
+is somewhat parallel?
+
+0:05:52.132 --> 0:05:58.538
+So there is in the Internet quite a lot of
+data which has been like company websites which
+
+0:05:58.538 --> 0:06:01.565
+have been translated and things like that.
+
+0:06:01.565 --> 0:06:05.155
+So how can you extract them and then extract
+them?
+
+0:06:06.566 --> 0:06:13.406
+There is typically more noisy than where you
+do more, hence mean if you have your Parliament.
+
+0:06:13.693 --> 0:06:21.305
+You can do some rules how to extract the parallel
+things.
+
+0:06:21.305 --> 0:06:30.361
+Here there is more to it, so the quality is
+later maybe not as good.
+
+0:06:33.313 --> 0:06:39.927
+The other thing is can we use monolingual
+data and monolingual data has a big advantage
+
+0:06:39.927 --> 0:06:46.766
+that we can have a huge amount of that so that
+you can be able to crawl from the internet.
+
+0:06:46.766 --> 0:06:51.726
+The nice thing is you can also get it typically
+for many domains.
+
+0:06:52.352 --> 0:06:58.879
+There is just so much more magnitude more
+of monolingual data so that it might be very
+
+0:06:58.879 --> 0:06:59.554
+helpful.
+
+0:06:59.559 --> 0:07:06.187
+We can do that in statistical machine translation
+was quite easy to integrate using language
+
+0:07:06.187 --> 0:07:06.757
+models.
+
+0:07:08.508 --> 0:07:14.499
+In neural machine translation we have the
+advantage that we have this overall and architecture
+
+0:07:14.499 --> 0:07:18.850
+that does everything together, but it has also
+the disadvantage now.
+
+0:07:18.850 --> 0:07:22.885
+It's more difficult to put in this type of
+information or make.
+
+0:07:23.283 --> 0:07:26.427
+We'll look to two things.
+
+0:07:26.427 --> 0:07:37.432
+You can still try to do a bit of language
+modeling in there and add an additional language
+
+0:07:37.432 --> 0:07:38.279
+model.
+
+0:07:38.178 --> 0:07:43.771
+A way which I think is used in most systems
+at the moment is to do synthetic data.
+
+0:07:43.763 --> 0:07:53.095
+It's a very easy thing, but you can just translate
+there and then use it as training data.
+
+0:07:53.213 --> 0:07:59.192
+And thereby you are able to use like some
+type of moonlighting.
+
+0:08:00.380 --> 0:08:09.521
+Another way to do it is to ensure that some
+are in the extreme case.
+
+0:08:09.521 --> 0:08:14.026
+If you have a scenario that only.
+
+0:08:14.754 --> 0:08:24.081
+The impressive thing is if you have large
+amounts of data and the languages are not too
+
+0:08:24.081 --> 0:08:31.076
+dissimilar, you can even in this case build
+a translation system.
+
+0:08:32.512 --> 0:08:36.277
+That we will see then next Thursday.
+
+0:08:37.857 --> 0:08:55.462
+And then there is now a fourth type of restorer
+that recently became very successful and now.
+
+0:08:55.715 --> 0:09:02.409
+So the idea is we are no longer sharing the
+real data such as text data, but it can also
+
+0:09:02.409 --> 0:09:04.139
+help to train a model.
+
+0:09:04.364 --> 0:09:08.599
+And that is now a big advantage of deep learning
+based approaches.
+
+0:09:08.599 --> 0:09:14.414
+There you have this ability that you can train
+a model on some task and then you can modify
+
+0:09:14.414 --> 0:09:19.913
+it maybe and then apply it to another task
+and you can somewhat transfer the knowledge
+
+0:09:19.913 --> 0:09:22.125
+from the first task to the second.
+
+0:09:22.722 --> 0:09:31.906
+And then, of course, the question is, can
+it have an initial task where it's very easy
+
+0:09:31.906 --> 0:09:34.439
+to train on the second?
+
+0:09:34.714 --> 0:09:53.821
+The task that you pre-train on is more similar
+to a language.
+
+0:09:53.753 --> 0:10:06.293
+A bit of a different way of using language
+malls in this more transfer learning set.
+
+0:10:09.029 --> 0:10:18.747
+So first we will start with how can we use
+monolingual data to do a machine translation?
+
+0:10:20.040 --> 0:10:22.542
+The.
+
+0:10:22.062 --> 0:10:28.924
+This big difference is you should remember
+from what I mentioned before is in statistical
+
+0:10:28.924 --> 0:10:30.525
+machine translation.
+
+0:10:30.525 --> 0:10:33.118
+We directly have the opportunity.
+
+0:10:33.118 --> 0:10:39.675
+There's peril data for a translation model
+and monolingual data for a language model.
+
+0:10:39.679 --> 0:10:45.735
+And you combine your translation model and
+your language model, and then you can make.
+
+0:10:46.726 --> 0:10:54.263
+That has big advantages that you can make
+use of these large amounts of monolingual data,
+
+0:10:54.263 --> 0:10:55.519
+but of course.
+
+0:10:55.495 --> 0:11:02.198
+Because we said the problem is, we are optimizing
+both parts independently to each other, and
+
+0:11:02.198 --> 0:11:09.329
+we say the big advantage of newer machine translation
+is we are optimizing the overall architecture
+
+0:11:09.329 --> 0:11:10.541
+to perform best.
+
+0:11:10.890 --> 0:11:17.423
+And then, of course, we can't do that, so
+here we can only use power there.
+
+0:11:17.897 --> 0:11:25.567
+So the question is, but if this advantage
+is not so important, we can train everything,
+
+0:11:25.567 --> 0:11:33.499
+but we have large amounts of monolingual data
+or small amounts, but they fit perfectly, so
+
+0:11:33.499 --> 0:11:35.242
+they are very good.
+
+0:11:35.675 --> 0:11:41.438
+So in data we know it's not only important
+the amount of data we have but also like how
+
+0:11:41.438 --> 0:11:43.599
+similar it is to your test data.
+
+0:11:43.599 --> 0:11:49.230
+So it can be that this volume is even only
+quite small but it's very well fitting and
+
+0:11:49.230 --> 0:11:51.195
+then it's still very helpful.
+
+0:11:51.195 --> 0:11:55.320
+So the question is if this is the case how
+can we make use of?
+
+0:11:55.675 --> 0:12:03.171
+And the first year of surprisingness, if we
+are here successful with integrating a language
+
+0:12:03.171 --> 0:12:10.586
+model into a translation system, maybe we can
+also integrate some types of language models
+
+0:12:10.586 --> 0:12:14.415
+into our MT system in order to make it better.
+
+0:12:16.536 --> 0:12:19.000
+The first thing we can do is okay.
+
+0:12:19.000 --> 0:12:23.293
+We know there is language models, so let's
+try to integrate.
+
+0:12:23.623 --> 0:12:30.693
+There was mainly used language models because
+these works were mainly done before transformer
+
+0:12:30.693 --> 0:12:31.746
+based models.
+
+0:12:32.152 --> 0:12:41.567
+And generally, of course, you can do the same
+thing with all the Transformers baseballs.
+
+0:12:41.721 --> 0:12:58.900
+It has mainly been done before people started
+using R&amp;S, and they tried to do this more
+
+0:12:58.900 --> 0:13:01.888
+in cases where.
+
+0:13:07.087 --> 0:13:17.508
+So what we're having here is some of this
+type of idea.
+
+0:13:17.508 --> 0:13:25.511
+This is a key system here as you remember.
+
+0:13:25.605 --> 0:13:29.470
+Gets in with your last instinct and calculates
+your attention.
+
+0:13:29.729 --> 0:13:36.614
+We get the context and combine both and then
+based on that and then predict the target.
+
+0:13:37.057 --> 0:13:42.423
+So this is our anti-system, and the question
+is, can we somehow integrate the language?
+
+0:13:42.782 --> 0:13:55.788
+And of course, if someone makes sense to take
+out a neural language model because we're anyway
+
+0:13:55.788 --> 0:14:01.538
+in the neural space, it's not surprising.
+
+0:14:01.621 --> 0:14:15.522
+And there would be something like on top of
+there and you're a language model and you have
+
+0:14:15.522 --> 0:14:17.049
+a target.
+
+0:14:17.597 --> 0:14:27.007
+So if we're having this type of language model,
+there's two main questions we have to answer.
+
+0:14:27.007 --> 0:14:28.108
+How do we?
+
+0:14:28.208 --> 0:14:37.935
+So how do we combine now on the one hand our
+NMT system and on the other hand our RNA you
+
+0:14:37.935 --> 0:14:45.393
+see that was mentioned before when we started
+talking about encoder.
+
+0:14:45.805 --> 0:14:49.523
+The wild is like unconditioned, it's just
+modeling the targets side.
+
+0:14:49.970 --> 0:14:57.183
+And the other one is a conditional language,
+which is a language condition on the sewer
+
+0:14:57.183 --> 0:14:57.839
+center.
+
+0:14:58.238 --> 0:15:03.144
+So the question is how can you not combine
+two language models?
+
+0:15:03.144 --> 0:15:09.813
+Of course, it's like the translation model
+will some will be more important because it
+
+0:15:09.813 --> 0:15:11.806
+has access to the source.
+
+0:15:11.806 --> 0:15:16.713
+We want to generate something which corresponds
+to your source.
+
+0:15:18.778 --> 0:15:20.918
+If we had that, the other question is OK.
+
+0:15:20.918 --> 0:15:22.141
+Now we have two models.
+
+0:15:22.141 --> 0:15:25.656
+If we even have integrated them, the answer
+is how do we train them?
+
+0:15:26.026 --> 0:15:39.212
+Because we have integrated them, we have no
+two sets of data with parallel data where you
+
+0:15:39.212 --> 0:15:42.729
+can do the lower thing.
+
+0:15:44.644 --> 0:15:47.575
+So the first idea is okay.
+
+0:15:47.575 --> 0:15:53.436
+We can do something more like a parallel combination.
+
+0:15:53.436 --> 0:15:55.824
+We just keep running.
+
+0:15:56.036 --> 0:15:59.854
+So a year you see your NMT system that is
+running.
+
+0:16:00.200 --> 0:16:08.182
+First of all, it's normally completely independent
+of your language model, which is up there.
+
+0:16:08.182 --> 0:16:13.278
+So down here we have just our NMT system,
+which is running.
+
+0:16:13.313 --> 0:16:26.439
+The only thing which is used is we have the
+words inputted, and of course they are put
+
+0:16:26.439 --> 0:16:28.099
+into both.
+
+0:16:28.099 --> 0:16:41.334
+We also put: So we use them in parallel, and
+then we are doing our decision just by merging
+
+0:16:41.334 --> 0:16:42.905
+these two.
+
+0:16:43.343 --> 0:16:52.288
+So there can be, for example, we are doing
+a probability distribution here, we are doing
+
+0:16:52.288 --> 0:17:01.032
+a purability distribution here, and then we
+are taking the average of both per ability
+
+0:17:01.032 --> 0:17:03.343
+to do our predictions.
+
+0:17:11.871 --> 0:17:18.929
+You could also take the output which seems
+to be more short about the answer.
+
+0:17:20.000 --> 0:17:23.272
+Yes, you could also do that.
+
+0:17:23.272 --> 0:17:27.222
+It's more like a gating mechanism.
+
+0:17:27.222 --> 0:17:32.865
+You're not doing everything, but you're focusing.
+
+0:17:32.993 --> 0:17:38.927
+Another one would be you could also just concatenate
+the hidden states and then you have another
+
+0:17:38.927 --> 0:17:41.802
+layer on top which based on the concatenation.
+
+0:17:43.303 --> 0:17:58.634
+If you think about it, you do the coordination
+instead of taking the instead and then merging
+
+0:17:58.634 --> 0:18:01.244
+the perability.
+
+0:18:03.143 --> 0:18:15.027
+Yes, in the end you introduce many new parameters
+and these parameters have somehow something
+
+0:18:15.027 --> 0:18:17.303
+special compared.
+
+0:18:23.603 --> 0:18:33.657
+So before all the other parameters can be
+trained independently of each other, the language
+
+0:18:33.657 --> 0:18:42.071
+one can be trained independent and an antisystem
+can be trained independent.
+
+0:18:43.043 --> 0:18:51.198
+If you have a joint layer of course you need
+to train them because you have inputs so you
+
+0:18:51.198 --> 0:19:01.560
+need: Not surprisingly, if you have a parallel
+combination or whether you could, the other
+
+0:19:01.560 --> 0:19:04.664
+way is to do more serial combinations.
+
+0:19:04.924 --> 0:19:10.382
+How can you do a similar combination?
+
+0:19:10.382 --> 0:19:18.281
+Your final decision makes sense to do it based
+on the.
+
+0:19:18.438 --> 0:19:20.997
+So you have on top of your normal an system.
+
+0:19:21.121 --> 0:19:30.826
+The only thing is now your inputting into
+your NIT system.
+
+0:19:30.826 --> 0:19:38.723
+You're no longer inputting the word embeddings.
+
+0:19:38.918 --> 0:19:47.819
+You're training the lower layers here which
+are trained more on the purely language model
+
+0:19:47.819 --> 0:19:55.434
+and on top you're putting into the NMT system
+where it now has the language.
+
+0:19:55.815 --> 0:19:59.003
+So here you can also view it here.
+
+0:19:59.003 --> 0:20:06.836
+You have more contextual embeddings which
+no longer depend on the word, but they also
+
+0:20:06.836 --> 0:20:10.661
+depend on the context of the target site.
+
+0:20:11.051 --> 0:20:21.797
+More understanding of the source word.
+
+0:20:21.881 --> 0:20:34.761
+So if it's like the word can, for example,
+will be put in here always the same, independent
+
+0:20:34.761 --> 0:20:41.060
+of its use of can of beans, or if can do it.
+
+0:20:41.701 --> 0:20:43.165
+Empties.
+
+0:20:44.364 --> 0:20:54.959
+So another view, if you're remembering more
+the transformer based approach, is you have
+
+0:20:54.959 --> 0:21:01.581
+some layers, and the lower layers are purely
+language.
+
+0:21:02.202 --> 0:21:08.052
+This is purely language model and then at
+some point you're starting to attend to the
+
+0:21:08.052 --> 0:21:08.596
+source.
+
+0:21:13.493 --> 0:21:20.774
+Yes, so these are two ways of how you combine
+it, so run them in peril, or first do the language.
+
+0:21:23.623 --> 0:21:26.147
+Questions for the integration.
+
+0:21:31.831 --> 0:21:35.034
+Not really sure about the input of the.
+
+0:21:35.475 --> 0:21:38.123
+And this case with a sequence.
+
+0:21:38.278 --> 0:21:50.721
+Is the input and bedding, the target word
+embedding, or the actual word, and then we
+
+0:21:50.721 --> 0:21:54.821
+transfer it to a numerical.
+
+0:21:56.176 --> 0:22:08.824
+That depends on if you view the word embedding
+as part of the language model, so of course
+
+0:22:08.824 --> 0:22:10.909
+you first put.
+
+0:22:11.691 --> 0:22:13.938
+And then the word embedding there is the r&amp;n.
+
+0:22:14.314 --> 0:22:20.296
+So of course you can view this together as
+your language model when you first do the word
+
+0:22:20.296 --> 0:22:21.027
+embedding.
+
+0:22:21.401 --> 0:22:28.098
+All you can say are the RNAs and this is like
+before.
+
+0:22:28.098 --> 0:22:36.160
+It's more a definition, but you're right,
+so what are the steps?
+
+0:22:36.516 --> 0:22:46.655
+One of these parts, you know, called a language
+model is definitionally not that important,
+
+0:22:46.655 --> 0:22:47.978
+but that's.
+
+0:22:53.933 --> 0:23:02.812
+So the question is how can you then train
+them and make make this this one work?
+
+0:23:03.363 --> 0:23:15.492
+So in the case where you combine the language
+of our abilities you can train them independently
+
+0:23:15.492 --> 0:23:18.524
+and then just put them.
+
+0:23:18.918 --> 0:23:29.623
+It might not be the best because we have no
+longer this ability before that.
+
+0:23:29.623 --> 0:23:33.932
+They optimal perform together.
+
+0:23:34.514 --> 0:23:41.050
+At least you need to summarize how much do
+you trust the one model and how much do you
+
+0:23:41.050 --> 0:23:41.576
+trust.
+
+0:23:43.323 --> 0:23:48.529
+But still in some cases usually it might be
+helpful if you have only data and so on.
+
+0:23:48.928 --> 0:24:06.397
+However, we have one specific situation that
+leads to the pearl leader is always mono legal
+
+0:24:06.397 --> 0:24:07.537
+data.
+
+0:24:08.588 --> 0:24:17.693
+So what we can also do is more the pre-training
+approach.
+
+0:24:17.693 --> 0:24:24.601
+We first train the language model and then.
+
+0:24:24.704 --> 0:24:33.468
+So the pre-training approach you first train
+on the monolingual data and then you join the.
+
+0:24:33.933 --> 0:24:45.077
+Of course, the model size is this way, but
+the data size is of course too big.
+
+0:24:45.077 --> 0:24:52.413
+You often have more monolingual data than
+parallel.
+
+0:24:56.536 --> 0:24:57.901
+Any ideas.
+
+0:25:04.064 --> 0:25:10.108
+Had one example where this might also be helpful
+if you want to adapt to a domain so let's say
+
+0:25:10.108 --> 0:25:16.281
+you do medical sentences and if you want to
+translate medical sentences and you have monolingual
+
+0:25:16.281 --> 0:25:22.007
+data on the target side for medical sentences
+but you only have parallel data for general
+
+0:25:22.007 --> 0:25:22.325
+use.
+
+0:25:23.083 --> 0:25:30.601
+In this case it could be, or it's the most
+probable happen if you're learning out there
+
+0:25:30.601 --> 0:25:38.804
+what medical means, but then in your fine tuning
+step the model is forgetting everything about.
+
+0:25:39.099 --> 0:25:42.340
+So this type of priest training step is good.
+
+0:25:42.340 --> 0:25:47.978
+If your pretraining data is more general,
+very large, and then you're adapting.
+
+0:25:48.428 --> 0:25:55.545
+But in the task we have monolingual data,
+which should be used to adapt the system to
+
+0:25:55.545 --> 0:25:57.780
+some genre of topic style.
+
+0:25:57.817 --> 0:26:08.572
+Then, of course, this is not a good strategy
+because you might forget about everything up
+
+0:26:08.572 --> 0:26:09.408
+there.
+
+0:26:09.649 --> 0:26:17.494
+So then you have to check what you can do
+for them to see.
+
+0:26:17.494 --> 0:26:25.738
+You can freeze this part and you can do a
+direct combination.
+
+0:26:25.945 --> 0:26:33.796
+Where you train both of them, and then you
+train the language more and parallel on their
+
+0:26:33.796 --> 0:26:34.942
+one so that.
+
+0:26:35.395 --> 0:26:37.687
+Eh What You Learn in the Length.
+
+0:26:37.937 --> 0:26:48.116
+So the bit depends on what you want to combine
+is that you use a language model because it's.
+
+0:26:48.548 --> 0:26:56.380
+Then you normally don't really forget it because
+it's also in the or you use it to adapt to
+
+0:26:56.380 --> 0:26:58.083
+something specific.
+
+0:27:01.001 --> 0:27:06.662
+Then there is so this is a way of how we can
+make use of monolingual data.
+
+0:27:07.968 --> 0:27:11.787
+It seems to be the easiest one somehow.
+
+0:27:11.787 --> 0:27:19.140
+It's more similar to what we are doing with
+statistical machine translation.
+
+0:27:19.140 --> 0:27:20.095
+However,.
+
+0:27:21.181 --> 0:27:27.211
+Normally always beats this type of model,
+which in some view can be from the conceptual
+
+0:27:27.211 --> 0:27:27.691
+thing.
+
+0:27:27.691 --> 0:27:31.460
+At least it's even easier from the computational
+side.
+
+0:27:31.460 --> 0:27:36.805
+Sometimes it has a disadvantage that it's
+more problematic or more difficult.
+
+0:27:40.560 --> 0:27:42.576
+And the idea is okay.
+
+0:27:42.576 --> 0:27:45.141
+We have a monolingual data.
+
+0:27:45.141 --> 0:27:50.822
+We just translate it and then generate some
+type of parallel.
+
+0:27:51.111 --> 0:28:00.465
+So if you want to build a German to English
+system, your first trained German to English
+
+0:28:00.465 --> 0:28:02.147
+system on your.
+
+0:28:02.402 --> 0:28:05.217
+Then you have more pearl data.
+
+0:28:05.217 --> 0:28:13.482
+The interesting thing is if you then train
+on the joint thing, on the original pearl data,
+
+0:28:13.482 --> 0:28:18.749
+and on that one is artificial, it even normally
+improves.
+
+0:28:18.918 --> 0:28:26.490
+You can because you're not doing the same
+error all the time and you have some knowledge.
+
+0:28:28.028 --> 0:28:40.080
+With this first approach, however, there's
+one issue: why it might not work the best,
+
+0:28:40.080 --> 0:28:43.163
+so could you imagine?
+
+0:28:49.409 --> 0:28:51.186
+Ready a bit shown in image two.
+
+0:28:53.113 --> 0:29:00.637
+Have a few trains on bad quality data.
+
+0:29:00.637 --> 0:29:08.741
+The system will learn also in the states.
+
+0:29:08.828 --> 0:29:12.210
+And as you're saying, it's a system always
+mistranslates.
+
+0:29:13.493 --> 0:29:14.497
+Something.
+
+0:29:14.497 --> 0:29:23.623
+Then you will learn that this is correct because
+now it's training data and you will even encourage
+
+0:29:23.623 --> 0:29:25.996
+it to make it more often.
+
+0:29:25.996 --> 0:29:29.921
+So the problem on training on your own is.
+
+0:29:30.150 --> 0:29:34.222
+But however, as you systematically do, you
+even enforce more and will even do more.
+
+0:29:34.654 --> 0:29:37.401
+So that might not be the best solution.
+
+0:29:37.401 --> 0:29:40.148
+Do any idea how you could do it better?
+
+0:29:44.404 --> 0:29:57.653
+If you had something else to prevent some
+systematic problems, yes, that is one way.
+
+0:30:04.624 --> 0:30:10.809
+The problem is yeah, the translations are
+not perfect, so the output and you're learning
+
+0:30:10.809 --> 0:30:11.990
+something wrong.
+
+0:30:11.990 --> 0:30:17.967
+Normally it's less bad if your inputs are
+somewhat bad, but your outputs are perfect.
+
+0:30:18.538 --> 0:30:26.670
+So if your inputs are wrong you maybe learn
+that if you're doing this wrong input you're
+
+0:30:26.670 --> 0:30:30.782
+generating something correct but you're not.
+
+0:30:31.511 --> 0:30:40.911
+So often the case is that it's more important
+that your target is correct.
+
+0:30:40.911 --> 0:30:47.052
+If on the source there is something crazy,
+then.
+
+0:30:47.347 --> 0:30:52.184
+But you can assume in your application scenario
+you hope that you mainly get correct input.
+
+0:30:52.572 --> 0:31:02.126
+So that is not harming you as much, and in
+machine translation we have some of these symmetries,
+
+0:31:02.126 --> 0:31:02.520
+so.
+
+0:31:02.762 --> 0:31:04.578
+And also the other way around.
+
+0:31:04.578 --> 0:31:09.792
+It's a very similar task, so there's a task
+to translate from German to English, but the
+
+0:31:09.792 --> 0:31:13.892
+task to translate from English to German is
+very similar and helpful.
+
+0:31:14.094 --> 0:31:19.313
+So what we can do is, we can just switch it
+initially and generate the data the other way
+
+0:31:19.313 --> 0:31:19.777
+around.
+
+0:31:20.120 --> 0:31:25.699
+So what we are doing here is we are starting
+with an English to German system.
+
+0:31:25.699 --> 0:31:32.126
+Then we are translating the English data into
+German, where the German is maybe not really
+
+0:31:32.126 --> 0:31:32.903
+very nice.
+
+0:31:33.293 --> 0:31:46.045
+And then we're training on our original data
+and on the back translated data where only
+
+0:31:46.045 --> 0:31:51.696
+the input is good and it's like human.
+
+0:31:52.632 --> 0:32:01.622
+So here we have now the advantage that always
+our target site is of human quality and the
+
+0:32:01.622 --> 0:32:02.322
+input.
+
+0:32:03.583 --> 0:32:08.998
+And then this helps us to get really good
+form.
+
+0:32:08.998 --> 0:32:15.428
+There's one important difference if you think
+about the.
+
+0:32:21.341 --> 0:32:31.604
+It's too obvious here we need a target side
+monolingual layer and the first.
+
+0:32:31.931 --> 0:32:47.143
+So back translation is normally working if
+you have target size parallel and not search
+
+0:32:47.143 --> 0:32:48.180
+side.
+
+0:32:48.448 --> 0:32:55.493
+Might be also a bit if you think about it
+understandable that it's more important to
+
+0:32:55.493 --> 0:32:56.819
+be like better.
+
+0:32:57.117 --> 0:33:04.472
+On the suicide you have to understand the
+content, on the target side you have to generate
+
+0:33:04.472 --> 0:33:12.232
+really sentences and somehow it's more difficult
+to generate something than to only understand.
+
+0:33:17.617 --> 0:33:29.916
+One other thing, so typically it's shown here
+differently, but typically it's like this works
+
+0:33:29.916 --> 0:33:30.701
+well.
+
+0:33:31.051 --> 0:33:32.978
+Because normally there's like a lot more.
+
+0:33:33.253 --> 0:33:36.683
+So the question is, should really take all
+of my data?
+
+0:33:36.683 --> 0:33:38.554
+There's two problems with it.
+
+0:33:38.554 --> 0:33:42.981
+Of course, it's expensive because you have
+to translate all this data.
+
+0:33:42.981 --> 0:33:48.407
+And secondly, if you had, although now your
+packet site is wrong, it might be that you
+
+0:33:48.407 --> 0:33:51.213
+still have your wrong correlations in there.
+
+0:33:51.651 --> 0:34:01.061
+So if you don't know the normally good starting
+point is to take equal amount of data as many
+
+0:34:01.061 --> 0:34:02.662
+backtranslated.
+
+0:34:02.963 --> 0:34:05.366
+Of course, it depends on the use case.
+
+0:34:05.366 --> 0:34:07.215
+There are very few data here.
+
+0:34:07.215 --> 0:34:08.510
+It makes more sense.
+
+0:34:08.688 --> 0:34:14.273
+It depends on how good your quality is here,
+so the better the model is observable, the
+
+0:34:14.273 --> 0:34:17.510
+more data you might use because quality is
+better.
+
+0:34:17.510 --> 0:34:23.158
+So it depends on a lot of things, but yeah,
+a rule of sample like good general way often
+
+0:34:23.158 --> 0:34:24.808
+is to have equal amounts.
+
+0:34:26.646 --> 0:34:31.233
+And you can of course do that now iteratively.
+
+0:34:31.233 --> 0:34:39.039
+It said already that the quality at the end,
+of course, depends on this system.
+
+0:34:39.039 --> 0:34:46.163
+Also, because the better this system is, the
+better your synthetic data.
+
+0:34:47.207 --> 0:34:50.949
+That leads to what is referred to as iterated
+back translation.
+
+0:34:51.291 --> 0:34:56.911
+So you're playing a model on English to German
+and you translate the data.
+
+0:34:56.957 --> 0:35:03.397
+Then you train a model on German to English
+with the additional data.
+
+0:35:03.397 --> 0:35:11.954
+Then you translate German when you translate
+German data and then you train again your first
+
+0:35:11.954 --> 0:35:12.414
+one.
+
+0:35:12.414 --> 0:35:14.346
+So you iterate that.
+
+0:35:14.334 --> 0:35:19.653
+Because now your system is better because
+it's not only trained on the small data but
+
+0:35:19.653 --> 0:35:22.003
+additionally on back translated data.
+
+0:35:22.442 --> 0:35:24.458
+And so you can get better.
+
+0:35:24.764 --> 0:35:31.739
+However, typically you can stop quite early,
+so maybe one iteration is good, but then you
+
+0:35:31.739 --> 0:35:35.072
+have diminishing gains after two or three.
+
+0:35:35.935 --> 0:35:44.094
+There's very slight difference and then yeah
+because you need of course quite big difference
+
+0:35:44.094 --> 0:35:45.937
+in the quality here.
+
+0:35:45.937 --> 0:35:46.814
+In order.
+
+0:35:47.207 --> 0:35:59.810
+Which is not too good because it means you
+can already have to train it with relatively
+
+0:35:59.810 --> 0:36:02.245
+bad performance.
+
+0:36:03.723 --> 0:36:10.323
+And they don't yeah, a design decision would
+advise so guess because it's easy to get it.
+
+0:36:10.550 --> 0:36:16.617
+Better to replace that because you have a
+higher quality, but you of course keep your
+
+0:36:16.617 --> 0:36:18.310
+high quality real data.
+
+0:36:18.310 --> 0:36:21.626
+Then I think normally it's okay to replace
+it.
+
+0:36:21.626 --> 0:36:24.518
+Of course you can also try to append it.
+
+0:36:24.518 --> 0:36:28.398
+I would assume it's not too much of a difference,
+but.
+
+0:36:34.414 --> 0:36:40.567
+That's about like using monolingual data before
+we go into the pre-train models.
+
+0:36:40.567 --> 0:36:42.998
+Do you have any more questions?
+
+0:36:49.029 --> 0:36:57.521
+Yes, so the other thing we can do and which
+is recently more and more successful and even
+
+0:36:57.521 --> 0:37:05.731
+more successful since we have these really
+large language models where you can even do
+
+0:37:05.731 --> 0:37:08.562
+a translation task with this.
+
+0:37:08.688 --> 0:37:16.132
+So here the idea is you learn a representation
+of one task and then you use this representation.
+
+0:37:16.576 --> 0:37:27.276
+It was made maybe like one of the first where
+it's really used largely is doing something
+
+0:37:27.276 --> 0:37:35.954
+like a bird which you pre-train on purely text
+editor and then you take.
+
+0:37:36.496 --> 0:37:42.952
+And the one big advantage, of course, is that
+people can only share data but also pre-train.
+
+0:37:43.423 --> 0:37:53.247
+So if you think of the recent models and the
+large language models which are available,
+
+0:37:53.247 --> 0:37:59.611
+it is not possible for universities often to
+train them.
+
+0:37:59.919 --> 0:38:09.413
+Think it costs several millions to train the
+model just if you rent the GPS from some cloud
+
+0:38:09.413 --> 0:38:15.398
+company and train that the cost of training
+these models.
+
+0:38:15.475 --> 0:38:21.735
+And guess as a student project you won't have
+the budget to like build these models.
+
+0:38:21.801 --> 0:38:24.630
+So another idea is what you can do is okay.
+
+0:38:24.630 --> 0:38:27.331
+Maybe if these months are once available.
+
+0:38:27.467 --> 0:38:34.723
+You can take them and use them as a resource
+similar to pure text, and you can now build
+
+0:38:34.723 --> 0:38:41.734
+models which some will learn not only from
+from data but also from other models which
+
+0:38:41.734 --> 0:38:44.506
+are maybe trained on other tasks.
+
+0:38:44.844 --> 0:38:48.647
+So it's a quite new way of thinking of how
+to train.
+
+0:38:48.647 --> 0:38:53.885
+So we are not only learning from examples,
+but we might also learn from.
+
+0:38:54.534 --> 0:39:03.937
+The nice thing is that this type of training
+where we are not learning directly from data
+
+0:39:03.937 --> 0:39:07.071
+by learning from other tasks.
+
+0:39:07.427 --> 0:39:15.581
+So the main idea to start with is to have
+a personal initial task, and typically this
+
+0:39:15.581 --> 0:39:24.425
+initial task is for: And if you're working
+with, that means you're training pure taxator
+
+0:39:24.425 --> 0:39:30.547
+because you have the largest amount of data
+from the Internet.
+
+0:39:30.951 --> 0:39:35.857
+And then you're defining some type of task
+in order to do your quick training.
+
+0:39:36.176 --> 0:39:42.056
+And: There's a typical task you can train
+on.
+
+0:39:42.056 --> 0:39:52.709
+That is like the language modeling text, so
+to predict the next word, all we have related.
+
+0:39:52.932 --> 0:40:04.654
+But to predict something which you have not
+in the input is a task which is easy to generate.
+
+0:40:04.654 --> 0:40:06.150
+That's why.
+
+0:40:06.366 --> 0:40:14.005
+By yourself, on the other hand, you need a
+lot of knowledge, and that is the other thing
+
+0:40:14.005 --> 0:40:15.120
+you need to.
+
+0:40:15.735 --> 0:40:23.690
+Because there is this idea that the meaning
+of the word heavily depends on the context
+
+0:40:23.690 --> 0:40:24.695
+it's used.
+
+0:40:25.145 --> 0:40:36.087
+So can give you a sentence with some gibberish
+word and there's some name, and although you've
+
+0:40:36.087 --> 0:40:41.616
+never read the name, you will just assume that.
+
+0:40:42.062 --> 0:40:48.290
+Exactly the same thing, the models can also
+learn something about the words in there by
+
+0:40:48.290 --> 0:40:49.139
+just using.
+
+0:40:49.649 --> 0:40:53.246
+So that is typically the new.
+
+0:40:53.246 --> 0:40:59.839
+Then we can use this model, use our data to
+train the.
+
+0:41:00.800 --> 0:41:04.703
+Of course, it might need to adapt the system.
+
+0:41:04.703 --> 0:41:07.672
+To do that we might use only some.
+
+0:41:07.627 --> 0:41:16.326
+Part of the pre-train model in there is that
+we have seen that a bit already in the RNA
+
+0:41:16.326 --> 0:41:17.215
+case is.
+
+0:41:17.437 --> 0:41:22.670
+So you can view the RN as one of these approaches.
+
+0:41:22.670 --> 0:41:28.518
+You train the RN language while on large pre-train
+data.
+
+0:41:28.518 --> 0:41:32.314
+Then you put it somewhere into your.
+
+0:41:33.653 --> 0:41:37.415
+So this gives you the ability to really do
+these types of tests.
+
+0:41:37.877 --> 0:41:49.027
+So that you can build a system which uses
+knowledge, which is just trained on large amounts
+
+0:41:49.027 --> 0:41:52.299
+of data and extracting it.
+
+0:41:52.299 --> 0:41:53.874
+So it knows.
+
+0:41:56.376 --> 0:42:01.561
+So the question is that yeah, what type of
+information so what type of models can you?
+
+0:42:01.821 --> 0:42:05.278
+And we want to today look at briefly at three.
+
+0:42:05.725 --> 0:42:08.474
+Was initially done.
+
+0:42:08.474 --> 0:42:21.118
+It wasn't as famous as in machine translation
+as in other things, but it's also used there.
+
+0:42:21.221 --> 0:42:28.974
+So where you have this mapping from the one
+hot to a small continuous word representation?
+
+0:42:29.229 --> 0:42:37.891
+Using this one in your anthrax you can, for
+example, replace the embedding layer by the
+
+0:42:37.891 --> 0:42:38.776
+trained.
+
+0:42:39.139 --> 0:42:41.832
+That is helpful to be a really small amount
+of data.
+
+0:42:42.922 --> 0:42:48.520
+You're always in this pre training phase and
+have the thing the advantage is.
+
+0:42:48.468 --> 0:42:55.515
+More data, that's the trade off so you can
+get better.
+
+0:42:55.515 --> 0:43:00.128
+Disadvantage is, does anybody have?
+
+0:43:04.624 --> 0:43:12.173
+Was one of the mentioned today, even like
+big advantages of the system compared to previous.
+
+0:43:20.660 --> 0:43:26.781
+Where one advantage was the end to end training
+so that all parameters and all components are
+
+0:43:26.781 --> 0:43:27.952
+optimal together.
+
+0:43:28.208 --> 0:43:33.386
+If you know pre-train something on one pass,
+it's maybe no longer optimal fitting to everything.
+
+0:43:33.893 --> 0:43:40.338
+So that is similar to what should do pretaining
+or not.
+
+0:43:40.338 --> 0:43:48.163
+It depends on how important everything is
+optimal together and how.
+
+0:43:48.388 --> 0:44:00.552
+If the state is a high quality of large amount,
+the pre trained one is just so much better.
+
+0:44:00.600 --> 0:44:11.215
+Standing everything optimal together, we would
+use random actions for amazing vices.
+
+0:44:11.691 --> 0:44:18.791
+Mean, we assume some structures that are trained
+basically.
+
+0:44:18.791 --> 0:44:26.364
+Yes, if you're fine tuning everything, it
+might be the problem.
+
+0:44:26.766 --> 0:44:31.139
+But often yeah, in some way right, so often
+it's not about.
+
+0:44:31.139 --> 0:44:37.624
+You're really worse with some pre-trained
+molecules because you're going already in some
+
+0:44:37.624 --> 0:44:43.236
+direction, and if this is not really optimal
+for you, it might be difficult.
+
+0:44:43.603 --> 0:44:51.774
+But the bigger is, if you're not getting better
+because you have a decent amount of data, it's
+
+0:44:51.774 --> 0:44:52.978
+so different.
+
+0:44:53.153 --> 0:45:04.884
+But mean initially it wasn't a machine translation
+done so much because there was more data in
+
+0:45:04.884 --> 0:45:09.452
+the task, but now it's really large.
+
+0:45:12.632 --> 0:45:14.188
+The other one is then OK.
+
+0:45:14.188 --> 0:45:18.258
+Now it's always like how much of the model
+do your pre-track a bit?
+
+0:45:18.658 --> 0:45:25.057
+The other one you can do is tack contextual
+words and then something like bird or a robota
+
+0:45:25.057 --> 0:45:31.667
+where you train more already as sequence models
+and the embeddings you're using are no longer
+
+0:45:31.667 --> 0:45:35.605
+specific for words but they're also taking
+the context.
+
+0:45:35.875 --> 0:45:54.425
+Embedding you're using is no longer only depending
+on the word itself but on the whole sentence.
+
+0:45:55.415 --> 0:46:03.714
+And of course you can use similar things also
+in the decoder just by having layers which
+
+0:46:03.714 --> 0:46:09.122
+don't have access to the source but there it's
+still not.
+
+0:46:11.451 --> 0:46:19.044
+And finally, and then we'll look at the end,
+you can also have models which are already.
+
+0:46:19.419 --> 0:46:28.605
+So you may be training a sequence model, but
+not a monolingual data.
+
+0:46:28.605 --> 0:46:35.128
+Of course you have to make it a bit challenging.
+
+0:46:36.156 --> 0:46:43.445
+But the idea is really you're pre-training
+your whole model and then you're fine tuning.
+
+0:46:47.227 --> 0:46:59.487
+But let's first do a bit of step back and
+look into what are the differences.
+
+0:46:59.487 --> 0:47:02.159
+The first thing.
+
+0:47:02.382 --> 0:47:06.870
+The word embeddings are just this first layer.
+
+0:47:06.870 --> 0:47:12.027
+You can train them with feed-forward neural
+networks.
+
+0:47:12.212 --> 0:47:25.683
+But you can also train them in language model,
+and by now you hopefully have also seen that
+
+0:47:25.683 --> 0:47:27.733
+you can also.
+
+0:47:30.130 --> 0:47:41.558
+So this is how you can train them, and you
+are training them to predict the next word,
+
+0:47:41.558 --> 0:47:45.236
+the typical language model.
+
+0:47:45.525 --> 0:47:52.494
+And that is what is now referred to as a South
+Supervised Learning, and for example all the
+
+0:47:52.494 --> 0:47:56.357
+big large language models like Chat, gp and
+so on.
+
+0:47:56.357 --> 0:48:03.098
+They are trained at an end or feet, but exactly
+with this objective to predict the next.
+
+0:48:03.823 --> 0:48:12.847
+So that is where you can hopefully learn what
+a word is used because you always try to predict
+
+0:48:12.847 --> 0:48:17.692
+the next word and then you have a ready intuition.
+
+0:48:19.619 --> 0:48:25.374
+In the word embedding, why do people first
+look at the word embeddings and the use of
+
+0:48:25.374 --> 0:48:27.582
+word embeddings for other tasks?
+
+0:48:27.582 --> 0:48:32.600
+The main advantage is it might be only the
+first layer you would think of.
+
+0:48:32.600 --> 0:48:34.474
+What does it really matter?
+
+0:48:34.474 --> 0:48:39.426
+However, it is the layer where you typically
+have most of the parameters.
+
+0:48:39.879 --> 0:48:52.201
+Of course, if you have trained on most of
+your parameters already on the large data,
+
+0:48:52.201 --> 0:48:59.304
+then on your target data you have to train
+less.
+
+0:48:59.259 --> 0:49:05.841
+This big difference that your input size is
+so much bigger than the size of the normal
+
+0:49:05.841 --> 0:49:06.522
+in size.
+
+0:49:06.626 --> 0:49:16.551
+So it's a normal size, maybe two hundred and
+fifty, but your input embedding besides vocabulary
+
+0:49:16.551 --> 0:49:20.583
+size is something like fifty thousand.
+
+0:49:23.123 --> 0:49:30.163
+And bending while here you see, it's only
+like times as much in the layer.
+
+0:49:30.750 --> 0:49:36.747
+So here's where most of your parameters are,
+which means if you already replace the word
+
+0:49:36.747 --> 0:49:41.329
+embeddings, it might look a bit small in your
+overall architecture.
+
+0:49:41.329 --> 0:49:47.056
+It's where most of the things are, and if
+you're doing that, you already have really
+
+0:49:47.056 --> 0:49:48.876
+big games and can do that.
+
+0:49:57.637 --> 0:50:04.301
+The thing is we have seen these wooden beddings
+can be very good used for other taps.
+
+0:50:04.784 --> 0:50:08.921
+Now you learn some relation between words.
+
+0:50:08.921 --> 0:50:14.790
+If you're doing this type of language modeling,
+you predict.
+
+0:50:15.215 --> 0:50:21.532
+The one thing is, of course, you have a lot
+of data, so the one question is we want to
+
+0:50:21.532 --> 0:50:25.961
+have a lot of data to good training models,
+the other thing.
+
+0:50:25.961 --> 0:50:28.721
+The tasks need to be somewhat useful.
+
+0:50:29.169 --> 0:50:41.905
+If you would predict the first letter of the
+word, it has to be a task where you need some
+
+0:50:41.905 --> 0:50:45.124
+syntactic information.
+
+0:50:45.545 --> 0:50:53.066
+The interesting thing is people have looked
+at these world embeddings here in a language
+
+0:50:53.066 --> 0:50:53.658
+model.
+
+0:50:53.954 --> 0:51:04.224
+And you're looking at the word embeddings,
+which are these vectors here.
+
+0:51:04.224 --> 0:51:09.289
+You can ask yourself, do they look?
+
+0:51:09.489 --> 0:51:15.122
+Don't know if your view is listening to artificial
+advance artificial intelligence.
+
+0:51:15.515 --> 0:51:23.994
+We had on yesterday how to do this type of
+representation, but you can do this kind of
+
+0:51:23.994 --> 0:51:29.646
+representation, and now you're seeing interesting
+things.
+
+0:51:30.810 --> 0:51:41.248
+Now you can represent it here in a three dimensional
+space with a dimension reduction.
+
+0:51:41.248 --> 0:51:46.886
+Then you can look into it and the interesting.
+
+0:51:47.447 --> 0:51:57.539
+So this vector between the male and the female
+version of something is not the same, but it's
+
+0:51:57.539 --> 0:51:58.505
+related.
+
+0:51:58.718 --> 0:52:11.256
+So you can do a bit of nuts, you subtract
+this vector, add this vector, and then you
+
+0:52:11.256 --> 0:52:14.501
+look around this one.
+
+0:52:14.894 --> 0:52:19.691
+So that means okay, there is really something
+stored, some information stored in that book.
+
+0:52:20.040 --> 0:52:25.003
+Similar you can do it with Buck and since
+you see here swimming slam walk and walk.
+
+0:52:25.265 --> 0:52:42.534
+So again these vectors are not the same, but
+they're related for going from here to here.
+
+0:52:43.623 --> 0:52:47.508
+Are semantically the relations between city
+and capital?
+
+0:52:47.508 --> 0:52:49.757
+You have exactly the same thing.
+
+0:52:51.191 --> 0:52:57.857
+People having done question answering about
+that if they show these embeddings and.
+
+0:52:58.218 --> 0:53:05.198
+Or you can also, if you don't trust the the
+dimensional reduction because you say maybe
+
+0:53:05.198 --> 0:53:06.705
+there's something.
+
+0:53:06.967 --> 0:53:16.473
+Done you can also look into what happens really
+in the indimensional space.
+
+0:53:16.473 --> 0:53:22.227
+You can look at what is the nearest neighbor.
+
+0:53:22.482 --> 0:53:29.605
+So you can take the relationship between France
+and Paris and add it to Italy and nicely see.
+
+0:53:30.010 --> 0:53:33.082
+You can do big and bigger and you have small
+and small lines.
+
+0:53:33.593 --> 0:53:38.202
+It doesn't work everywhere.
+
+0:53:38.202 --> 0:53:49.393
+There are also some which sometimes work,
+so if you have a typical.
+
+0:53:51.491 --> 0:53:56.832
+You can do what the person is doing for famous
+ones.
+
+0:53:56.832 --> 0:54:05.800
+Of course, only like Einstein, scientist,
+that Messier finds Midfield are not completely
+
+0:54:05.800 --> 0:54:06.707
+correct.
+
+0:54:06.846 --> 0:54:09.781
+You'll see the examples are a bit old.
+
+0:54:09.781 --> 0:54:15.050
+The politicians are no longer there, but the
+first one doesn't learn.
+
+0:54:16.957 --> 0:54:29.003
+What people have done there of courses, especially
+at the beginning.
+
+0:54:29.309 --> 0:54:36.272
+So one famous model was, but we're not really
+interested in the language model performance.
+
+0:54:36.272 --> 0:54:38.013
+We're only interested.
+
+0:54:38.338 --> 0:54:40.634
+Think something good to keep in mind.
+
+0:54:40.634 --> 0:54:42.688
+What are we really interested in?
+
+0:54:42.688 --> 0:54:44.681
+Do we really want to have an RN?
+
+0:54:44.681 --> 0:54:44.923
+No.
+
+0:54:44.923 --> 0:54:48.608
+In this case we are only interested in this
+type of mapping.
+
+0:54:49.169 --> 0:54:55.536
+And so very successful was this word to beg.
+
+0:54:55.535 --> 0:55:02.597
+We are not training real language when making
+it even simpler and doing this for example
+
+0:55:02.597 --> 0:55:04.660
+continuous back of words.
+
+0:55:04.660 --> 0:55:11.801
+We are just having four input tokens and we
+are predicting what is the word in the middle
+
+0:55:11.801 --> 0:55:15.054
+and this is just like two linear layers.
+
+0:55:15.615 --> 0:55:22.019
+It's even simplifying things and making the
+calculation faster because that is what we're
+
+0:55:22.019 --> 0:55:22.873
+interested.
+
+0:55:23.263 --> 0:55:34.059
+All this continues skip ground models of these
+other two models.
+
+0:55:34.234 --> 0:55:38.273
+You have one equal word and it's the other
+way around.
+
+0:55:38.273 --> 0:55:41.651
+You're predicting the four words around them.
+
+0:55:41.651 --> 0:55:43.047
+It's very similar.
+
+0:55:43.047 --> 0:55:48.702
+The task is in the end very similar, but in
+all of them it's about learning.
+
+0:55:51.131 --> 0:56:01.416
+Before we go into the next part, let's talk
+about the normal white vector or white line.
+
+0:56:04.564 --> 0:56:07.562
+The next thing is contextual word embeddings.
+
+0:56:07.562 --> 0:56:08.670
+The idea is yes.
+
+0:56:08.670 --> 0:56:09.778
+This is helpful.
+
+0:56:09.778 --> 0:56:14.080
+However, we might be able to get more from
+just only lingo later.
+
+0:56:14.080 --> 0:56:19.164
+For example, if you think about the word can,
+it can have different meanings.
+
+0:56:19.419 --> 0:56:32.619
+And now in the word embeddings how you have
+an overlap of these two meanings, so it represents
+
+0:56:32.619 --> 0:56:33.592
+those.
+
+0:56:34.834 --> 0:56:40.318
+But we might be able to in the pre-train model
+already disambiguate these because they use
+
+0:56:40.318 --> 0:56:41.041
+completely.
+
+0:56:41.701 --> 0:56:50.998
+So if we can have a model which can not only
+represent the word, but it can also represent
+
+0:56:50.998 --> 0:56:58.660
+the meaning of the word within the context,
+it might be even more helpful.
+
+0:56:59.139 --> 0:57:03.342
+So then we're going to contextual word embeddings.
+
+0:57:03.342 --> 0:57:07.709
+We're really having a representation of the
+context.
+
+0:57:07.787 --> 0:57:11.519
+And we have a very good architecture for that
+already.
+
+0:57:11.691 --> 0:57:20.551
+It's like our base language model where you
+have to do the hidden state.
+
+0:57:20.551 --> 0:57:29.290
+The hidden state represents what is apparently
+said, but it's focusing.
+
+0:57:29.509 --> 0:57:43.814
+The first one doing that is in something like
+the Elmo paper where they instead of like this
+
+0:57:43.814 --> 0:57:48.121
+is a normal language model.
+
+0:57:48.008 --> 0:57:52.735
+Put in the third predicting the fourth and
+so on, so you're always predicting the next
+
+0:57:52.735 --> 0:57:53.007
+one.
+
+0:57:53.193 --> 0:57:57.919
+The architecture of the heaven works embedding
+layer, and then two are an layer here.
+
+0:57:57.919 --> 0:58:04.255
+For example: And now instead of using this
+one in the end you're using here this one.
+
+0:58:04.364 --> 0:58:11.245
+This represents the meaning of this word mainly
+in the context of what we have seen before.
+
+0:58:11.871 --> 0:58:22.909
+We can train it in a language model or predicting
+the next word, but we have more information,
+
+0:58:22.909 --> 0:58:26.162
+train there, and therefore.
+
+0:58:27.167 --> 0:58:31.168
+And there is one even done currently in.
+
+0:58:31.168 --> 0:58:40.536
+The only difference is that we have more layers,
+bigger size, and we're using transform on here
+
+0:58:40.536 --> 0:58:44.634
+or self-attention instead of the R&amp;F.
+
+0:58:44.634 --> 0:58:45.122
+But.
+
+0:58:46.746 --> 0:58:52.737
+However, if you look at this contextual representation,
+they might not be perfect.
+
+0:58:52.737 --> 0:58:58.584
+So what do you think of this one as contextual
+representation of the third word?
+
+0:58:58.584 --> 0:59:02.914
+Do you see anything which is not really considered
+in this?
+
+0:59:07.587 --> 0:59:11.492
+Only one way yes, so that is not a big issue
+here.
+
+0:59:11.492 --> 0:59:18.154
+It's representing a string in the context
+of a sentence, however, only in the context.
+
+0:59:18.558 --> 0:59:28.394
+However, we have an architecture which can
+also take both sides and we have used it in
+
+0:59:28.394 --> 0:59:30.203
+the ink holder.
+
+0:59:30.630 --> 0:59:34.269
+So we could do the and easily only us in the
+backboard direction.
+
+0:59:34.874 --> 0:59:46.889
+By just having the other way around, and then
+we couldn't combine the forward and into a
+
+0:59:46.889 --> 0:59:49.184
+joint one where.
+
+0:59:49.329 --> 0:59:50.861
+So You Have a Word embedding.
+
+0:59:51.011 --> 1:00:03.910
+Then you have two states, one with a forward,
+and then one with a backward.
+
+1:00:03.910 --> 1:00:10.359
+For example, take the representation.
+
+1:00:10.490 --> 1:00:21.903
+Now this same here represents mainly this
+word because this is where what both focuses
+
+1:00:21.903 --> 1:00:30.561
+on is what is happening last but is also looking
+at the previous.
+
+1:00:31.731 --> 1:00:41.063
+However, there is a bit different when training
+that as a language model you already have.
+
+1:00:43.203 --> 1:00:44.956
+Maybe there's again this masking.
+
+1:00:46.546 --> 1:00:47.814
+That is one solution.
+
+1:00:47.814 --> 1:00:53.407
+First of all, why we can't do it is the information
+you leave it, so you cannot just predict the
+
+1:00:53.407 --> 1:00:54.041
+next word.
+
+1:00:54.041 --> 1:00:58.135
+If we just predict the next word in this type
+of model, that's a very.
+
+1:00:58.738 --> 1:01:04.590
+You know the next word because it's influencing
+this hidden stage and then it's very easy so
+
+1:01:04.590 --> 1:01:07.736
+predicting something you know is not a good
+task.
+
+1:01:07.736 --> 1:01:09.812
+This is what I mentioned before.
+
+1:01:09.812 --> 1:01:13.336
+You have to define somehow a task which is
+challenging.
+
+1:01:13.753 --> 1:01:19.007
+Because in this case one would, I mean, the
+system would just ignore the states and what
+
+1:01:19.007 --> 1:01:22.961
+it would learn is that you copy this information
+directly in here.
+
+1:01:23.343 --> 1:01:31.462
+So it would mainly be representing this word
+and you would have a perfect model because
+
+1:01:31.462 --> 1:01:38.290
+you only need to find an encoding where you
+can encode all words somehow.
+
+1:01:38.458 --> 1:01:44.046
+The only thing that will learn is that tenor
+and coat all my words in this upper hidden.
+
+1:01:44.985 --> 1:01:49.584
+And then, of course, it's not really useful.
+
+1:01:49.584 --> 1:01:53.775
+We need to find a bit of different ways.
+
+1:01:55.295 --> 1:01:59.440
+There is a masking one.
+
+1:01:59.440 --> 1:02:06.003
+I'll come to that shortly just a bit.
+
+1:02:06.003 --> 1:02:14.466
+The other thing is not to directly combine
+them.
+
+1:02:14.594 --> 1:02:22.276
+So you never merge the states only at the
+end.
+
+1:02:22.276 --> 1:02:33.717
+The representation of the words is now from
+the forward and the next.
+
+1:02:33.873 --> 1:02:35.964
+So it's always a hidden state before that.
+
+1:02:36.696 --> 1:02:41.273
+And these two you're joined now to your to
+the representation.
+
+1:02:42.022 --> 1:02:50.730
+And then you have now a representation also
+about the whole sentence for the word, but
+
+1:02:50.730 --> 1:02:53.933
+there's no information leakage.
+
+1:02:53.933 --> 1:02:59.839
+One way of doing this is instead of doing
+a bidirectional.
+
+1:03:00.380 --> 1:03:08.079
+You can do that, of course, in all layers.
+
+1:03:08.079 --> 1:03:16.315
+In the end you have different bedding states.
+
+1:03:16.596 --> 1:03:20.246
+However, it's a bit of a complicated.
+
+1:03:20.246 --> 1:03:25.241
+You have to keep up separate and then merge
+things.
+
+1:03:27.968 --> 1:03:33.007
+And that is is the moment where, like the,
+the peak.
+
+1:03:34.894 --> 1:03:42.018
+Idea of the big success of the bird model
+was used, maybe in bidirector case.
+
+1:03:42.018 --> 1:03:48.319
+It's not good to do the next word prediction,
+but we can do masking.
+
+1:03:48.308 --> 1:03:59.618
+And masking maybe means we do a prediction
+of something in the middle or some words.
+
+1:03:59.618 --> 1:04:08.000
+If we have the input, we're just putting noise
+into the input.
+
+1:04:08.048 --> 1:04:14.040
+Now there can be no information leakage because
+this wasn't in the input.
+
+1:04:14.040 --> 1:04:15.336
+Now predicting.
+
+1:04:16.776 --> 1:04:20.524
+So thereby we don't do any assumption again
+about our models.
+
+1:04:20.524 --> 1:04:24.815
+It doesn't need to be a forward model or a
+backward model or anything.
+
+1:04:24.815 --> 1:04:29.469
+You can have any type of architecture and
+you can always predict the street.
+
+1:04:30.530 --> 1:04:39.112
+There is maybe one disadvantage: do you see
+what could be a bit of a problem this type
+
+1:04:39.112 --> 1:04:40.098
+compared?
+
+1:05:00.000 --> 1:05:05.920
+Yes, so yeah mean you cannot cross mass more,
+but to see it more globally just twist assume
+
+1:05:05.920 --> 1:05:07.142
+you only mask one.
+
+1:05:07.142 --> 1:05:12.676
+For the whole sentence we get one feedback
+signal like what is the word street, so we
+
+1:05:12.676 --> 1:05:16.280
+have one training sample, a model for the whole
+center.
+
+1:05:17.397 --> 1:05:19.461
+The language modeling paste.
+
+1:05:19.461 --> 1:05:21.240
+We predicted here three.
+
+1:05:21.240 --> 1:05:22.947
+We predicted here four.
+
+1:05:22.947 --> 1:05:24.655
+We predicted here five.
+
+1:05:25.005 --> 1:05:26.973
+So we have a number of tokens.
+
+1:05:26.973 --> 1:05:30.974
+For each token we have a feet bed and saying
+what is the best.
+
+1:05:31.211 --> 1:05:39.369
+So in this case of course this is a lot less
+efficient because we are getting less feedback
+
+1:05:39.369 --> 1:05:45.754
+signals on what we should predict compared
+to models where we're doing.
+
+1:05:48.348 --> 1:05:54.847
+So in birth the main idea this bidirectional
+model was masking.
+
+1:05:54.847 --> 1:05:59.721
+It was the first large model using transformer.
+
+1:06:00.320 --> 1:06:06.326
+There are two more minor changes.
+
+1:06:06.326 --> 1:06:16.573
+We'll see that this next word prediction is
+another task.
+
+1:06:16.957 --> 1:06:25.395
+Again you want to learn more about what language
+is to really understand.
+
+1:06:25.395 --> 1:06:35.089
+Are these two sentences like following a story
+or they're independent of each other?
+
+1:06:38.158 --> 1:06:43.026
+The input is using subword units as we're
+using it and we're using it.
+
+1:06:43.026 --> 1:06:48.992
+It has some special token, the beginning,
+the CLS token that is straining for the next
+
+1:06:48.992 --> 1:06:50.158
+word prediction.
+
+1:06:50.470 --> 1:06:57.296
+It's more for machine translation.
+
+1:06:57.296 --> 1:07:07.242
+It's more for classification tasks because
+you're.
+
+1:07:07.607 --> 1:07:24.323
+You have two sentences, and then you have
+a position of encoding as we know them in general.
+
+1:07:24.684 --> 1:07:28.812
+Now what is more challenging is masking.
+
+1:07:28.812 --> 1:07:30.927
+So what do you mask?
+
+1:07:30.927 --> 1:07:35.055
+We already have to question like should.
+
+1:07:35.275 --> 1:07:44.453
+So there has been afterwards eating some work
+like, for example, Urbana, which tries to improve.
+
+1:07:44.453 --> 1:07:52.306
+It's not super sensitive, but of course if
+you do it completely wrong then you're.
+
+1:07:52.572 --> 1:07:54.590
+That's then another question there.
+
+1:07:56.756 --> 1:08:03.285
+All types should always mask the poor word.
+
+1:08:03.285 --> 1:08:14.562
+If have a subword, it's good to mask only
+like a subword and predict based.
+
+1:08:14.894 --> 1:08:20.755
+You know, like three parts of the words, it
+might be easier to get the last because they
+
+1:08:20.755 --> 1:08:27.142
+here took the easiest selections, not considering
+words anymore at all because you're doing that
+
+1:08:27.142 --> 1:08:32.278
+in the pre-processing and just taking always
+words like subwords and masking.
+
+1:08:32.672 --> 1:08:36.286
+Their thinking will bear them differently.
+
+1:08:36.286 --> 1:08:40.404
+They mark always the full words, but guess
+it's.
+
+1:08:41.001 --> 1:08:46.969
+And then what to do with the mask work in
+eighty percent of the cases is the word is
+
+1:08:46.969 --> 1:08:47.391
+mask.
+
+1:08:47.391 --> 1:08:50.481
+They replace it with a special token thing.
+
+1:08:50.481 --> 1:08:52.166
+This is the mask token.
+
+1:08:52.166 --> 1:08:58.486
+In ten percent they put in some random other
+token in there, and in ten percent they keep
+
+1:08:58.486 --> 1:08:59.469
+it unchanged.
+
+1:09:02.202 --> 1:09:11.519
+And then what you can do is also this next
+prediction.
+
+1:09:11.519 --> 1:09:17.786
+So if you have the man went to mass.
+
+1:09:18.418 --> 1:09:24.090
+So may you see you're joining that you're
+doing both masks and next prediction that.
+
+1:09:24.564 --> 1:09:34.402
+And if the sentence is pinguine masks are
+flyless birds, then these two sentences have
+
+1:09:34.402 --> 1:09:42.995
+nothing to do with each other, and so in this
+case it's not the next token.
+
+1:09:47.127 --> 1:09:56.184
+And that is the whole bird model, so here
+is the input, here the transformable layers,
+
+1:09:56.184 --> 1:09:58.162
+and you can train.
+
+1:09:58.598 --> 1:10:08.580
+And this model was quite successful in general
+applications.
+
+1:10:08.580 --> 1:10:17.581
+It was not as successful as people are nowadays
+using.
+
+1:10:17.937 --> 1:10:27.644
+However, there is like a huge thing of different
+types of models coming from that.
+
+1:10:27.827 --> 1:10:39.109
+So based on bird and other semi-supervised
+models like a whole setup came out of there
+
+1:10:39.109 --> 1:10:42.091
+and there's different.
+
+1:10:42.082 --> 1:10:46.637
+With the availability of large languages more
+than the success.
+
+1:10:47.007 --> 1:10:48.436
+We have now even larger ones.
+
+1:10:48.828 --> 1:10:50.961
+Interestingly, it goes a bit.
+
+1:10:50.910 --> 1:10:59.321
+Change the bit again from like more this spider
+action model to unidirectional models, or at
+
+1:10:59.321 --> 1:11:03.843
+the moment maybe a bit more we're coming to
+them.
+
+1:11:03.843 --> 1:11:09.179
+Now do you see one advantage,, and we have
+the efficiency.
+
+1:11:09.509 --> 1:11:16.670
+There's one other reason why you sometimes
+are more interested in unidirectional models
+
+1:11:16.670 --> 1:11:17.158
+than.
+
+1:11:22.882 --> 1:11:30.882
+Mean it depends on the task, but for example
+for a language generation task, the task.
+
+1:11:32.192 --> 1:11:34.574
+It's not only interesting, it doesn't work.
+
+1:11:34.574 --> 1:11:39.283
+So if you want to do a generation like the
+decoder so you want to generate a sentence,
+
+1:11:39.283 --> 1:11:42.856
+you don't know the future so you cannot apply
+this type of model.
+
+1:11:43.223 --> 1:11:49.498
+This time off model can be used for the encoder
+in an encoder model but cannot be used for
+
+1:11:49.498 --> 1:11:55.497
+the decoder because it is trained that only
+works and it has information on both sides
+
+1:11:55.497 --> 1:11:56.945
+and if you're doing.
+
+1:12:00.000 --> 1:12:05.559
+Yeah, that's a good view to the next overall
+task of models.
+
+1:12:05.559 --> 1:12:08.839
+We have so if you view it from the.
+
+1:12:09.009 --> 1:12:13.137
+Of you we have the encoder baseball.
+
+1:12:13.137 --> 1:12:16.372
+That's what we just look at.
+
+1:12:16.372 --> 1:12:20.612
+They are bidirectional and typically.
+
+1:12:20.981 --> 1:12:22.347
+That is the one we looked at.
+
+1:12:22.742 --> 1:12:35.217
+At the beginning is the decoder-based model,
+so the outer-regressive mounts which are unit
+
+1:12:35.217 --> 1:12:42.619
+based model, and there we can do the next prediction.
+
+1:12:43.403 --> 1:12:52.421
+And what you can also do first, and there
+you can also have special things called prefix
+
+1:12:52.421 --> 1:12:53.434
+language.
+
+1:12:54.354 --> 1:13:04.079
+Because we are saying it might be helpful
+that some of your inputs you can use by direction
+
+1:13:04.079 --> 1:13:17.334
+because: That is what is called a prefix where
+you say on the first tokens you have bidirectional
+
+1:13:17.334 --> 1:13:19.094
+connections.
+
+1:13:19.219 --> 1:13:28.768
+You somehow merge that mainly works only in
+transformer based models because the uni direction.
+
+1:13:29.629 --> 1:13:34.894
+There is no different number of parameters.
+
+1:13:34.975 --> 1:13:38.533
+Transformer: The only difference is how you
+mask your attention.
+
+1:13:38.878 --> 1:13:47.691
+We have seen that in the encoder, in the decoder,
+the number of parameters is different because
+
+1:13:47.691 --> 1:13:50.261
+you do the cross-attention.
+
+1:13:50.650 --> 1:13:58.389
+It's only like you mask your attention to
+only look at the bad past or also look into
+
+1:13:58.389 --> 1:13:59.469
+the future.
+
+1:14:00.680 --> 1:14:03.323
+And now you can, of course, also do mixing.
+
+1:14:03.563 --> 1:14:08.307
+So this is a bidirectional attention metric
+where you can attend to everything.
+
+1:14:08.588 --> 1:14:23.477
+That is a unidirection or causal where you
+can only look at the past and you can do this
+
+1:14:23.477 --> 1:14:25.652
+with prefix.
+
+1:14:29.149 --> 1:14:42.829
+Some are all clear based on that, then of
+course you can also do the other thing.
+
+1:14:43.163 --> 1:14:54.497
+So the idea is we have our encoder, decoder
+architecture, can we also train them completely
+
+1:14:54.497 --> 1:14:57.700
+in a side supervised way?
+
+1:14:58.238 --> 1:15:06.206
+In this case we have the same input to both,
+so in this case we would have the sentence
+
+1:15:06.206 --> 1:15:08.470
+as input in the decoder.
+
+1:15:08.470 --> 1:15:12.182
+Then we need to do some type of masking.
+
+1:15:12.912 --> 1:15:16.245
+Here we don't need to do the masking, but
+here we need to do.
+
+1:15:16.245 --> 1:15:17.911
+The masking doesn't know ever.
+
+1:15:20.440 --> 1:15:30.269
+And this type of model got quite successful
+also, especially for pre-training machine translation.
+
+1:15:30.330 --> 1:15:45.934
+This is the first model of the BART model,
+which is one successful way to pre-train your
+
+1:15:45.934 --> 1:15:47.162
+model.
+
+1:15:47.427 --> 1:15:52.858
+Where you put in source sentence, we can't
+do that here.
+
+1:15:52.858 --> 1:15:55.430
+We only have one language.
+
+1:15:55.715 --> 1:16:00.932
+But we can just put this twice in there, and
+that is not a trivial task.
+
+1:16:00.932 --> 1:16:08.517
+We can change it in: They do quite a bit of
+different corruption techniques.
+
+1:16:08.517 --> 1:16:12.751
+You can do token masking and you can also.
+
+1:16:13.233 --> 1:16:20.785
+That you couldn't do and go the only system
+because then it wouldn't be there if you cannot
+
+1:16:20.785 --> 1:16:22.345
+predict somewhere.
+
+1:16:22.345 --> 1:16:26.368
+So the number of input and output tokens always.
+
+1:16:26.906 --> 1:16:29.820
+You cannot do a prediction for something which
+isn't it?
+
+1:16:30.110 --> 1:16:39.714
+Here in the decoder side it's uni-direction
+so we can also delete and then generate the
+
+1:16:39.714 --> 1:16:40.369
+full.
+
+1:16:41.061 --> 1:16:48.628
+We can do sentence per rotation where you
+change the sentence.
+
+1:16:48.628 --> 1:16:54.274
+We can document rotation and text and filling.
+
+1:16:55.615 --> 1:17:05.870
+So you see there's quite a lot of types of
+models that you can use in order to pre-train
+
+1:17:05.870 --> 1:17:06.561
+your.
+
+1:17:07.507 --> 1:17:12.512
+And these are the models you can use.
+
+1:17:12.512 --> 1:17:21.072
+Of course, the other question is how do you
+integrate them into?
+
+1:17:21.761 --> 1:17:26.638
+And there's also like yeah quite some different
+ways of techniques.
+
+1:17:27.007 --> 1:17:28.684
+It's a Bit Similar to Before.
+
+1:17:28.928 --> 1:17:39.307
+So the easiest thing is you take your word
+embeddings or your pre-train model.
+
+1:17:39.307 --> 1:17:47.979
+If you're contextual embedding several layers
+you freeze them in.
+
+1:17:48.748 --> 1:17:53.978
+Can also be done if you have a bark model.
+
+1:17:53.978 --> 1:18:03.344
+You freeze your wooden beddings, for example,
+and only train the top layers.
+
+1:18:05.865 --> 1:18:14.965
+The other thing is you initialize them so
+you initialize your models but then you train
+
+1:18:14.965 --> 1:18:19.102
+everything so you're not only training.
+
+1:18:22.562 --> 1:18:32.600
+When you have then one thing, if you think
+about Bart, there's one thing, so you want
+
+1:18:32.600 --> 1:18:35.752
+to have the same language.
+
+1:18:36.516 --> 1:18:46.013
+Typically mean the one you get is from English,
+so you can not try to do some language.
+
+1:18:46.366 --> 1:18:55.165
+Below the barge, in order to learn some language
+specific stuff or there's a multilingual barge
+
+1:18:55.165 --> 1:19:03.415
+which is trained on many languages, it's trained
+only on like it's more or less language.
+
+1:19:03.923 --> 1:19:09.745
+So then you would still need to find June
+and the model needs to learn how to better
+
+1:19:09.745 --> 1:19:12.074
+do the attention cross lingually.
+
+1:19:12.074 --> 1:19:18.102
+It's only on the same language but it mainly
+only has to learn this mapping and not all
+
+1:19:18.102 --> 1:19:18.787
+the rest.
+
+1:19:21.982 --> 1:19:27.492
+A third thing which is is very commonly used
+is what is frequent to it as adapters.
+
+1:19:27.607 --> 1:19:29.749
+So, for example, you take and bark.
+
+1:19:29.709 --> 1:19:35.502
+And you put some adapters on the inside of
+the network so that it's small new layers which
+
+1:19:35.502 --> 1:19:41.676
+are in between put in there and then you only
+train these adapters or also train these adapters.
+
+1:19:41.676 --> 1:19:47.724
+So for example in Embry you could see that
+this learns to map the Seus language representation
+
+1:19:47.724 --> 1:19:50.333
+to the targeted language representation.
+
+1:19:50.470 --> 1:19:52.395
+And then you don't have to change that luck.
+
+1:19:52.792 --> 1:20:04.197
+Ideas that you give it some extra ability
+to really perform well on that, and then it's
+
+1:20:04.197 --> 1:20:05.234
+easier.
+
+1:20:05.905 --> 1:20:15.117
+Is also very commonly used, for example, in
+multilingual systems where the idea is you
+
+1:20:15.117 --> 1:20:16.282
+have some.
+
+1:20:16.916 --> 1:20:23.505
+So they are trained only for one language
+pair, so the model has some of those it once
+
+1:20:23.505 --> 1:20:27.973
+has the abilities to do multilingually to share
+knowledge.
+
+1:20:27.973 --> 1:20:33.729
+But then there is some knowledge which is
+very language specific, and then.
+
+1:20:34.914 --> 1:20:39.291
+But there's one chance in general, the multilingual
+systems.
+
+1:20:39.291 --> 1:20:40.798
+It works quite well.
+
+1:20:40.798 --> 1:20:47.542
+There's one specific use case for multilingual,
+where this normally doesn't really work well.
+
+1:20:47.542 --> 1:20:49.981
+Do you have an idea of what that?
+
+1:20:55.996 --> 1:20:57.534
+It's for Zero Short Cases.
+
+1:20:57.998 --> 1:21:06.051
+Because then you're having to hear some situation
+which might be very language specific again
+
+1:21:06.051 --> 1:21:15.046
+in zero shot, the idea is always to learn representations
+via which are more language dependent and with
+
+1:21:15.046 --> 1:21:17.102
+the adaptors of course.
+
+1:21:20.260 --> 1:21:37.655
+And there's also the idea of doing more like
+a knowledge ventilation setup, so in this.
+
+1:21:39.179 --> 1:21:41.177
+And now the idea is okay.
+
+1:21:41.177 --> 1:21:48.095
+We are training it the same, but what we want
+to achieve is that the hidden stages of the
+
+1:21:48.095 --> 1:21:54.090
+encoder are as similar to the one as the pre-train
+model, just as additional.
+
+1:21:54.414 --> 1:22:07.569
+So you should learn faster by telling the
+model to make these states as similar as possible.
+
+1:22:07.569 --> 1:22:11.813
+You compare the first hidden.
+
+1:22:12.192 --> 1:22:18.549
+For example, by using the L2 norm, so by just
+making these two representations the same.
+
+1:22:20.020 --> 1:22:22.880
+Now here it requires the same vocabulary.
+
+1:22:22.880 --> 1:22:25.468
+Why does it need the same vocabulary?
+
+1:22:25.468 --> 1:22:26.354
+Give me the.
+
+1:22:34.754 --> 1:22:39.132
+You have different vocabulary.
+
+1:22:39.132 --> 1:22:50.711
+You also have different like sequence lengths
+because if you use different these.
+
+1:22:51.231 --> 1:22:55.680
+Then what happens is now we have states here.
+
+1:22:55.680 --> 1:23:01.097
+It's no longer straightforward which states
+to compare.
+
+1:23:02.322 --> 1:23:05.892
+And then it's just easier to have like the
+same number.
+
+1:23:05.892 --> 1:23:08.952
+You can always compare the first to the second.
+
+1:23:09.709 --> 1:23:16.836
+So therefore at least the very easy way of
+knowledge destination only works if you have.
+
+1:23:17.177 --> 1:23:30.871
+Of course you could do things like the average
+should be the same, but of course that's less
+
+1:23:30.871 --> 1:23:33.080
+strong signal.
+
+1:23:34.314 --> 1:23:47.087
+But the advantage here is that you have a
+direct training signal here on the ink corner
+
+1:23:47.087 --> 1:23:52.457
+so you can directly make the signal.
+
+1:23:56.936 --> 1:24:11.208
+Yes, think this is most things for today,
+so what you should keep in mind today is two
+
+1:24:11.208 --> 1:24:18.147
+techniques: The one is a back translation idea.
+
+1:24:18.147 --> 1:24:26.598
+If you have monolingual letters, you back
+translate it and use.
+
+1:24:26.886 --> 1:24:33.608
+And yeah, it is even often helpful to even
+combine them so you can even use both of them.
+
+1:24:33.853 --> 1:24:39.669
+You can do use pre-trained walls, but then
+you can even still do back translation where
+
+1:24:39.669 --> 1:24:40.066
+it's.
+
+1:24:40.160 --> 1:24:47.058
+We have the advantage that we are training
+like everything working together on the tasks
+
+1:24:47.058 --> 1:24:54.422
+so it might be helpful even to backtranslate
+some data and then use it in the real translation
+
+1:24:54.422 --> 1:24:57.755
+because in pre-training the big challenge.
+
+1:24:58.058 --> 1:25:07.392
+You can see there is different ways of integrating
+this knowledge, but even if you use a full
+
+1:25:07.392 --> 1:25:08.087
+model.
+
+1:25:08.748 --> 1:25:11.713
+This is the most similar you can get.
+
+1:25:11.713 --> 1:25:15.224
+You're doing no changes to the architecture.
+
+1:25:15.224 --> 1:25:20.608
+You're really taking the model and just fine
+tuning on the new task.
+
+1:25:20.608 --> 1:25:24.041
+But it still has to completely newly learn.
+
+1:25:24.464 --> 1:25:29.978
+Might be, for example, helpful to have more
+back translated data to learn them.
+
+1:25:32.192 --> 1:25:45.096
+Good, that's important thing that next Tuesday
+there is a conference or a workshop in this
+
+1:25:45.096 --> 1:25:45.947
+room.
+
+1:25:47.127 --> 1:25:54.405
+You should get an email if you're an alias
+that there is a room change for Tuesdays, only
+
+1:25:54.405 --> 1:25:57.398
+for Tuesdays, and it's again normal.
+
+1:25:57.637 --> 1:26:03.714
+Some more questions again have a more general
+perspective, especially: Computer vision.
+
+1:26:03.714 --> 1:26:07.246
+You can enlarge your data set with data augmentation.
+
+1:26:07.246 --> 1:26:08.293
+It's there and.
+
+1:26:08.388 --> 1:26:15.306
+Similarly to a large speech or text, so the
+data orientation.
+
+1:26:15.755 --> 1:26:27.013
+You can use this back translation and also
+the masking, but a bit like that would say
+
+1:26:27.013 --> 1:26:31.201
+that is the most similar thing.
+
+1:26:31.371 --> 1:26:35.632
+So it has also been, for example, it's used
+not only for monolingual data.
+
+1:26:36.216 --> 1:26:40.958
+If you have good MP system, it can also be
+used for parallel data by having like augmenting
+
+1:26:40.958 --> 1:26:46.061
+your data with more data because then you have
+the human translation and the automatic translation
+
+1:26:46.061 --> 1:26:46.783
+is both good.
+
+1:26:46.783 --> 1:26:51.680
+You're just having more data and better feedback
+signal and different ways because there's not
+
+1:26:51.680 --> 1:26:53.845
+only one correct translation but several.
+
+1:26:54.834 --> 1:26:58.327
+Would say this is the most similar one.
+
+1:26:58.327 --> 1:27:00.947
+Just rotate things and so on.
+
+1:27:00.947 --> 1:27:03.130
+There's ways you can do.
+
+1:27:05.025 --> 1:27:07.646
+But for example there's rarely use.
+
+1:27:07.646 --> 1:27:13.907
+It's very hard to do this by by rules like
+which words to replace because there's not
+
+1:27:13.907 --> 1:27:14.490
+a cool.
+
+1:27:14.490 --> 1:27:18.931
+You cannot like always say this word can always
+be replaced.
+
+1:27:19.139 --> 1:27:28.824
+Mean, although they are my perfect synonyms,
+they are good in some cases, but not in all
+
+1:27:28.824 --> 1:27:29.585
+cases.
+
+1:27:29.585 --> 1:27:36.985
+And if you don't do a rule base, you have
+to train the model again.
+
+1:27:38.058 --> 1:27:57.050
+Here we can compare the hidden stages to the
+same architecture as the free train normal.
+
+1:27:57.457 --> 1:27:59.817
+Should be of the same dimension, so it's easiest
+to have the.
+
+1:28:00.000 --> 1:28:03.780
+Architecture: We later will learn in efficiency.
+
+1:28:03.780 --> 1:28:08.949
+You can also do knowledge destillation with,
+for example, smaller.
+
+1:28:08.949 --> 1:28:15.816
+So you can have twelve layers, only five,
+and then you try to learn the same within five
+
+1:28:15.816 --> 1:28:16.433
+layers.
+
+1:28:17.477 --> 1:28:22.945
+Eight layers, so that is possible, but yeah
+agree it should be of the same hidden size.
+
+1:28:23.623 --> 1:28:35.963
+The question then, of course, is you can do
+it as an initialization or you can do it during
+
+1:28:35.963 --> 1:28:37.305
+training?
+
+1:28:37.305 --> 1:28:41.195
+You have some main training.
+
+1:28:45.865 --> 1:28:53.964
+Good, then thanks a lot, and then we'll see
+each other again on Tuesday.
+