WEBVTT

0:00:00.981 --> 0:00:20.036
Today about is how to use some type of additional
resources to improve the translation.

0:00:20.300 --> 0:00:28.188
We have in the first part of the semester
two thirds of the semester how to build some

0:00:28.188 --> 0:00:31.361
of your basic machine translation.

0:00:31.571 --> 0:00:42.317
Now the basic components are both for statistical
and for neural, with the encoded decoding.

0:00:43.123 --> 0:00:46.000
Now, of course, that's not where it stops.

0:00:46.000 --> 0:00:51.286
It's still what nearly every machine translation
system is currently in there.

0:00:51.286 --> 0:00:57.308
However, there's a lot of challenges which
you need to address in addition and which need

0:00:57.308 --> 0:00:58.245
to be solved.

0:00:58.918 --> 0:01:09.858
And there we want to start to tell you what
else can you do around this, and partly.

0:01:10.030 --> 0:01:14.396
And one important question there is on what
do you train your models?

0:01:14.394 --> 0:01:32.003
Because like this type of parallel data, it's
easier in machine translation than in other

0:01:32.003 --> 0:01:33.569
trusts.

0:01:33.853 --> 0:01:41.178
And therefore an important question is, can
we also learn from like other sources and through?

0:01:41.701 --> 0:01:47.830
Because if you remember strongly right at
the beginning of the election,.

0:01:51.171 --> 0:01:53.801
This Is How We Train All Our.

0:01:54.194 --> 0:01:59.887
Machine learning models from statistical to
neural.

0:01:59.887 --> 0:02:09.412
This doesn't have changed so we need this
type of parallel data where we have a source

0:02:09.412 --> 0:02:13.462
sentence aligned with a target data.

0:02:13.493 --> 0:02:19.135
We have now a strong model here, a very good
model to do that.

0:02:19.135 --> 0:02:22.091
However, we always rely on this.

0:02:22.522 --> 0:02:28.395
For languages, high risk language pairs say
from German to English or other European languages,

0:02:28.395 --> 0:02:31.332
there is decent amount, at least for similarly.

0:02:31.471 --> 0:02:37.630
But even there if we are going to very specific
domains it might get difficult and then your

0:02:37.630 --> 0:02:43.525
system performance might drop because if you
want to translate now some medical text for

0:02:43.525 --> 0:02:50.015
example of course you need to also have peril
data in the medical domain to know how to translate

0:02:50.015 --> 0:02:50.876
these types.

0:02:51.231 --> 0:02:55.264
Phrases how to use the vocabulary and so on
in the style.

0:02:55.915 --> 0:03:04.887
And if you are going to other languages, there
is a lot bigger challenge and the question

0:03:04.887 --> 0:03:05.585
there.

0:03:05.825 --> 0:03:09.649
So is really this the only resource we can
use.

0:03:09.889 --> 0:03:19.462
Can be adapted or training phase in order
to also make use of other types of models that

0:03:19.462 --> 0:03:27.314
might enable us to build strong systems with
other types of information.

0:03:27.707 --> 0:03:35.276
And that we will look into now in the next
starting from from just saying the next election.

0:03:35.515 --> 0:03:40.697
So this idea we already have covered on Tuesday.

0:03:40.697 --> 0:03:45.350
One very successful idea for this is to do.

0:03:45.645 --> 0:03:51.990
So that we're no longer doing translation
between languages, but we can do translation

0:03:51.990 --> 0:03:55.928
between languages and share common knowledge
between.

0:03:56.296 --> 0:04:04.703
And you also learned about things like zero
shots machine translation so you can translate

0:04:04.703 --> 0:04:06.458
between languages.

0:04:06.786 --> 0:04:09.790
Which is the case for many, many language
pairs.

0:04:10.030 --> 0:04:19.209
Like even with German, you have not translation
parallel data to all languages around the world,

0:04:19.209 --> 0:04:26.400
or most of them you have it to the Europeans
once, maybe even for Japanese.

0:04:26.746 --> 0:04:35.332
There is quite a lot of data, for example
English to Japanese, but German to Japanese

0:04:35.332 --> 0:04:37.827
or German to Vietnamese.

0:04:37.827 --> 0:04:41.621
There is some data from Multilingual.

0:04:42.042 --> 0:04:54.584
So there is a very promising direction if
you want to build translation systems between

0:04:54.584 --> 0:05:00.142
language peers, typically not English.

0:05:01.221 --> 0:05:05.887
And the other ideas, of course, we don't have
to either just search for it.

0:05:06.206 --> 0:05:12.505
Some work on a data crawling so if I don't
have a corpus directly or I don't have an high

0:05:12.505 --> 0:05:19.014
quality corpus like from the European Parliament
for a TED corpus so maybe it makes sense to

0:05:19.014 --> 0:05:23.913
crawl more data and get additional sources
so you can build stronger.

0:05:24.344 --> 0:05:35.485
There has been quite a big effort in Europe
to collect really large data sets for parallel

0:05:35.485 --> 0:05:36.220
data.

0:05:36.220 --> 0:05:40.382
How can we do this data crawling?

0:05:40.600 --> 0:05:46.103
There the interesting thing from the machine
translation point is not just general data

0:05:46.103 --> 0:05:46.729
crawling.

0:05:47.067 --> 0:05:50.037
But how can we explicitly crawl data?

0:05:50.037 --> 0:05:52.070
Which is some of a peril?

0:05:52.132 --> 0:05:58.461
So there is in the Internet quite a lot of
data which has been company websites which

0:05:58.461 --> 0:06:01.626
have been translated and things like that.

0:06:01.626 --> 0:06:05.158
So how can you extract them parallel fragments?

0:06:06.566 --> 0:06:13.404
That is typically more noisy than where you
do more at hands where mean if you have Parliament.

0:06:13.693 --> 0:06:17.680
You can do some rules how to extract parallel
things.

0:06:17.680 --> 0:06:24.176
Here there is more to it, so the quality is
later maybe not as good, but normally scale

0:06:24.176 --> 0:06:26.908
is then a possibility to address it.

0:06:26.908 --> 0:06:30.304
So you just have so much more data that even.

0:06:33.313 --> 0:06:40.295
The other thing can be used monolingual data
and monolingual data has a big advantage that

0:06:40.295 --> 0:06:46.664
we can have a huge amount of that so that you
can be autocrawed from the Internet.

0:06:46.664 --> 0:06:51.728
The nice thing is you can also get it typically
for many domains.

0:06:52.352 --> 0:06:59.558
There is just so much more magnitude of monolingual
data so that it might be very helpful.

0:06:59.559 --> 0:07:03.054
We can do that in statistical machine translation.

0:07:03.054 --> 0:07:06.755
It was quite easy to integrate using language
models.

0:07:08.508 --> 0:07:16.912
In neural machine translation we have the
advantage that we have this overall architecture

0:07:16.912 --> 0:07:22.915
that does everything together, but it has also
the disadvantage.

0:07:23.283 --> 0:07:25.675
We'll look today at two things.

0:07:25.675 --> 0:07:32.925
On the one end you can still try to do a bit
of language modeling in there and add an additional

0:07:32.925 --> 0:07:35.168
language model into in there.

0:07:35.168 --> 0:07:38.232
There is some work, one very successful.

0:07:38.178 --> 0:07:43.764
A way in which I think is used in most systems
at the moment is to do some scientific data.

0:07:43.763 --> 0:07:53.087
Is a very easy thing, but you can just translate
there and use it as training gator, and normally.

0:07:53.213 --> 0:07:59.185
And thereby you are able to use like some
type of monolingual a day.

0:08:00.380 --> 0:08:05.271
Another way to do it is unsupervised and the
extreme case.

0:08:05.271 --> 0:08:11.158
If you have a scenario then you only have
data, only monolingual data.

0:08:11.158 --> 0:08:13.976
Can you still build translations?

0:08:14.754 --> 0:08:27.675
If you have large amounts of data and languages
are not too dissimilar, you can build translation

0:08:27.675 --> 0:08:31.102
systems without parallel.

0:08:32.512 --> 0:08:36.267
That we will see you then next Thursday.

0:08:37.857 --> 0:08:50.512
And then there is now a third type of pre-trained
model that recently became very successful

0:08:50.512 --> 0:08:55.411
and now with large language models.

0:08:55.715 --> 0:09:03.525
So the idea is we are no longer sharing the
real data, but it can also help to train a

0:09:03.525 --> 0:09:04.153
model.

0:09:04.364 --> 0:09:11.594
And that is now a big advantage of deep learning
based approaches.

0:09:11.594 --> 0:09:22.169
There you have this ability that you can train
a model in some task and then apply it to another.

0:09:22.722 --> 0:09:33.405
And then, of course, the question is, can
I have an initial task where there's huge amounts

0:09:33.405 --> 0:09:34.450
of data?

0:09:34.714 --> 0:09:40.251
And the test that typically you pre train
on is more like similar to a language moral

0:09:40.251 --> 0:09:45.852
task either direct to a language moral task
or like a masking task which is related so

0:09:45.852 --> 0:09:51.582
the idea is oh I can train on this data and
the knowledge about words how they relate to

0:09:51.582 --> 0:09:53.577
each other I can use in there.

0:09:53.753 --> 0:10:00.276
So it's a different way of using language
models.

0:10:00.276 --> 0:10:06.276
There's more transfer learning at the end
of.

0:10:09.029 --> 0:10:17.496
So first we will start with how can we use
monolingual data to do a Yeah to do a machine

0:10:17.496 --> 0:10:18.733
translation?

0:10:20.040 --> 0:10:27.499
That: Big difference is you should remember
from what I mentioned before is.

0:10:27.499 --> 0:10:32.783
In statistical machine translation we directly
have the opportunity.

0:10:32.783 --> 0:10:39.676
There's peril data for the translation model
and monolingual data for the language model.

0:10:39.679 --> 0:10:45.343
And you combine your translation model and
language model, and then you can make use of

0:10:45.343 --> 0:10:45.730
both.

0:10:46.726 --> 0:10:53.183
That you can make use of these large large
amounts of monolingual data, but of course

0:10:53.183 --> 0:10:55.510
it has also some disadvantage.

0:10:55.495 --> 0:11:01.156
Because we say the problem is we are optimizing
both parts a bit independently to each other

0:11:01.156 --> 0:11:06.757
and we say oh yeah the big disadvantage of
newer machine translations now we are optimizing

0:11:06.757 --> 0:11:10.531
the overall architecture everything together
to perform best.

0:11:10.890 --> 0:11:16.994
And then, of course, we can't do there, so
Leo we can can only do a mural like use power

0:11:16.994 --> 0:11:17.405
data.

0:11:17.897 --> 0:11:28.714
So the question is, but this advantage is
not so important that we can train everything,

0:11:28.714 --> 0:11:35.276
but we have a moral legal data or even small
amounts.

0:11:35.675 --> 0:11:43.102
So in data we know it's not only important
the amount of data we have but also like how

0:11:43.102 --> 0:11:50.529
similar it is to your test data so it can be
that this modeling data is quite small but

0:11:50.529 --> 0:11:55.339
it's very well fitting and then it's still
very helpful.

0:11:55.675 --> 0:12:02.691
At the first year of surprisingness, if we
are here successful with integrating a language

0:12:02.691 --> 0:12:09.631
model into a translation system, maybe we can
also integrate some type of language models

0:12:09.631 --> 0:12:14.411
into our empty system in order to make it better
and perform.

0:12:16.536 --> 0:12:23.298
The first thing we can do is we know there
is language models, so let's try to integrate.

0:12:23.623 --> 0:12:31.096
There was our language model because these
works were mainly done before transformer-based

0:12:31.096 --> 0:12:31.753
models.

0:12:32.152 --> 0:12:38.764
In general, of course, you can do the same
thing with transformer baseball.

0:12:38.764 --> 0:12:50.929
There is nothing about whether: It's just
that it has mainly been done before people

0:12:50.929 --> 0:13:01.875
started using R&amp;S and they tried to do
this more in cases.

0:13:07.087 --> 0:13:22.938
So what we're happening here is in some of
this type of idea, and in key system you remember

0:13:22.938 --> 0:13:25.495
the attention.

0:13:25.605 --> 0:13:29.465
Gets it was your last in this day that you
calculate easy attention.

0:13:29.729 --> 0:13:36.610
We get the context back, then combine both
and then base the next in state and then predict.

0:13:37.057 --> 0:13:42.424
So this is our system, and the question is,
can we send our integrated language model?

0:13:42.782 --> 0:13:49.890
And somehow it makes sense to take out a neural
language model because we are anyway in the

0:13:49.890 --> 0:13:50.971
neural space.

0:13:50.971 --> 0:13:58.465
It's not surprising that it contrasts to statistical
work used and grants it might make sense to

0:13:58.465 --> 0:14:01.478
take a bit of a normal language model.

0:14:01.621 --> 0:14:06.437
And there would be something like on Tubbles
Air, a neural language model, and our man based

0:14:06.437 --> 0:14:11.149
is you have a target word, you put it in, you
get a new benchmark, and then you always put

0:14:11.149 --> 0:14:15.757
in the words and get new hidden states, and
you can do some predictions at the output to

0:14:15.757 --> 0:14:16.948
predict the next word.

0:14:17.597 --> 0:14:26.977
So if we're having this type of in language
model, there's like two main questions we have

0:14:26.977 --> 0:14:34.769
to answer: So how do we combine now on the
one hand our system and on the other hand our

0:14:34.769 --> 0:14:35.358
model?

0:14:35.358 --> 0:14:42.004
You see that was mentioned before when we
started talking about ENCODA models.

0:14:42.004 --> 0:14:45.369
They can be viewed as a language model.

0:14:45.805 --> 0:14:47.710
The wine is lengthened, unconditioned.

0:14:47.710 --> 0:14:49.518
It's just modeling the target sides.

0:14:49.970 --> 0:14:56.963
And the other one is a conditional language
one, which is a language one conditioned on

0:14:56.963 --> 0:14:57.837
the Sewer.

0:14:58.238 --> 0:15:03.694
So how can you combine to language models?

0:15:03.694 --> 0:15:14.860
Of course, it's like the translation model
will be more important because it has access

0:15:14.860 --> 0:15:16.763
to the source.

0:15:18.778 --> 0:15:22.571
If we have that, the other question is okay.

0:15:22.571 --> 0:15:24.257
Now we have models.

0:15:24.257 --> 0:15:25.689
How do we train?

0:15:26.026 --> 0:15:30.005
Pickers integrated them.

0:15:30.005 --> 0:15:34.781
We have now two sets of data.

0:15:34.781 --> 0:15:42.741
We have parallel data where you can do the
lower.

0:15:44.644 --> 0:15:53.293
So the first idea is we can do something more
like a parallel combination.

0:15:53.293 --> 0:15:55.831
We just keep running.

0:15:56.036 --> 0:15:59.864
So here you see your system that is running.

0:16:00.200 --> 0:16:09.649
It's normally completely independent of your
language model, which is up there, so down

0:16:09.649 --> 0:16:13.300
here we have just our NMT system.

0:16:13.313 --> 0:16:26.470
The only thing which is used is we have the
words, and of course they are put into both

0:16:26.470 --> 0:16:30.059
systems, and out there.

0:16:30.050 --> 0:16:42.221
So we use them somehow for both, and then
we are doing our decision just by merging these

0:16:42.221 --> 0:16:42.897
two.

0:16:43.343 --> 0:16:53.956
So there can be, for example, we are doing
a probability distribution here, and then we

0:16:53.956 --> 0:17:03.363
are taking the average of post-perability distribution
to do our predictions.

0:17:11.871 --> 0:17:18.923
You could also take the output with Steve's
to be more in chore about the mixture.

0:17:20.000 --> 0:17:32.896
Yes, you could also do that, so it's more
like engaging mechanisms that you're not doing.

0:17:32.993 --> 0:17:41.110
Another one would be cochtrinate the hidden
states, and then you would have another layer

0:17:41.110 --> 0:17:41.831
on top.

0:17:43.303 --> 0:17:56.889
You think about if you do the conqueredination
instead of taking the instead and then merging

0:17:56.889 --> 0:18:01.225
the probability distribution.

0:18:03.143 --> 0:18:16.610
Introduce many new parameters, and these parameters
have somehow something special compared to

0:18:16.610 --> 0:18:17.318
the.

0:18:23.603 --> 0:18:37.651
So before all the error other parameters can
be trained independent, the language model

0:18:37.651 --> 0:18:42.121
can be trained independent.

0:18:43.043 --> 0:18:51.749
If you have a joint layer, of course you need
to train them because you have now inputs.

0:18:54.794 --> 0:19:02.594
Not surprisingly, if you have a parallel combination
of whether you could, the other way is to do

0:19:02.594 --> 0:19:04.664
more serial combinations.

0:19:04.924 --> 0:19:10.101
How can you do a similar combination?

0:19:10.101 --> 0:19:18.274
Your final decision makes sense to do a face
on the system.

0:19:18.438 --> 0:19:20.996
So you have on top of your normal and system.

0:19:21.121 --> 0:19:30.678
The only thing is now you're inputting into
your system.

0:19:30.678 --> 0:19:38.726
You're no longer inputting the word embeddings.

0:19:38.918 --> 0:19:45.588
So you're training your mainly what you have
your lower layers here which are trained more

0:19:45.588 --> 0:19:52.183
on the purely language model style and then
on top your putting into the NMT system where

0:19:52.183 --> 0:19:55.408
it now has already here the language model.

0:19:55.815 --> 0:19:58.482
So here you can also view it.

0:19:58.482 --> 0:20:06.481
Here you have more contextual embeddings which
no longer depend only on the word but they

0:20:06.481 --> 0:20:10.659
also depend on the context of the target site.

0:20:11.051 --> 0:20:19.941
But you have more understanding of the source
word, so you have a language in the current

0:20:19.941 --> 0:20:21.620
target sentence.

0:20:21.881 --> 0:20:27.657
So if it's like the word can, for example,
will be put in here always the same independent

0:20:27.657 --> 0:20:31.147
of its user can of beans, or if it's like I
can do it.

0:20:31.147 --> 0:20:37.049
However, because you are having your language
model style, you have maybe disintegrated this

0:20:37.049 --> 0:20:40.984
already a bit, and you give this information
directly to the.

0:20:41.701 --> 0:20:43.095
An empty cyst.

0:20:44.364 --> 0:20:49.850
You, if you're remembering more the transformer
based approach, you have some layers.

0:20:49.850 --> 0:20:55.783
The lower layers are purely languaged while
the other ones are with attention to the source.

0:20:55.783 --> 0:21:01.525
So you can view it also that you just have
lower layers which don't attend to the source.

0:21:02.202 --> 0:21:07.227
This is purely a language model, and then
at some point you're starting to attend to

0:21:07.227 --> 0:21:08.587
the source and use it.

0:21:13.493 --> 0:21:20.781
Yes, so this is how you combine them in peril
or first do the language model and then do.

0:21:23.623 --> 0:21:26.147
Questions for the integration.

0:21:31.831 --> 0:21:35.034
Not really sure about the input of the.

0:21:35.475 --> 0:21:38.102
Model, and in this case in the sequence.

0:21:38.278 --> 0:21:53.199
Case so the actual word that we transferred
into a numerical lecture, and this is an input

0:21:53.199 --> 0:21:54.838
into the.

0:21:56.176 --> 0:22:03.568
That depends on if you view the word embedding
as part of the language model.

0:22:03.568 --> 0:22:10.865
So if you first put the word target word then
you do the one hot end coding.

0:22:11.691 --> 0:22:13.805
And then the word embedding there is the r&amp;

0:22:13.805 --> 0:22:13.937
n.

0:22:14.314 --> 0:22:21.035
So you can use this together as your language
model when you first do the word embedding.

0:22:21.401 --> 0:22:24.346
All you can say is like before.

0:22:24.346 --> 0:22:28.212
It's more a definition, but you're right.

0:22:28.212 --> 0:22:30.513
So what's the steps out?

0:22:30.513 --> 0:22:36.128
You take the word, the one hut encoding, the
word embedding.

0:22:36.516 --> 0:22:46.214
What one of these parrots, you know, called
a language model is definition wise and not

0:22:46.214 --> 0:22:47.978
that important.

0:22:53.933 --> 0:23:02.264
So the question is how can you then train
them and make this this one work?

0:23:02.264 --> 0:23:02.812
The.

0:23:03.363 --> 0:23:15.201
So in the case where you combine the language
one of the abilities you can train them independently

0:23:15.201 --> 0:23:18.516
and just put them together.

0:23:18.918 --> 0:23:27.368
Might not be the best because we have no longer
the stability that we had before that optimally

0:23:27.368 --> 0:23:29.128
performed together.

0:23:29.128 --> 0:23:33.881
It's not clear if they really work the best
together.

0:23:34.514 --> 0:23:41.585
At least you need to somehow find how much
do you trust the one model and how much.

0:23:43.323 --> 0:23:45.058
Still in some cases useful.

0:23:45.058 --> 0:23:48.530
It might be helpful if you have only data
and software.

0:23:48.928 --> 0:23:59.064
However, in MT we have one specific situation
that at least for the MT part parallel is also

0:23:59.064 --> 0:24:07.456
always monolingual data, so what we definitely
can do is train the language.

0:24:08.588 --> 0:24:18.886
So what we also can do is more like the pre-training
approach.

0:24:18.886 --> 0:24:24.607
We first train the language model.

0:24:24.704 --> 0:24:27.334
The pre-training approach.

0:24:27.334 --> 0:24:33.470
You first train on the monolingual data and
then you join the.

0:24:33.933 --> 0:24:41.143
Of course, the model size is this way, but
the data size is too bigly the other way around.

0:24:41.143 --> 0:24:47.883
You often have a lot more monolingual data
than you have here parallel data, in which

0:24:47.883 --> 0:24:52.350
scenario can you imagine where this type of
pretraining?

0:24:56.536 --> 0:24:57.901
Any Ideas.

0:25:04.064 --> 0:25:12.772
One example where this might also be helpful
if you want to adapt to domains.

0:25:12.772 --> 0:25:22.373
So let's say you do medical sentences and
if you want to translate medical sentences.

0:25:23.083 --> 0:25:26.706
In this case it could be or its most probable
happen.

0:25:26.706 --> 0:25:32.679
You're learning here up there what medical
means, but in your fine tuning step the model

0:25:32.679 --> 0:25:38.785
is forgotten everything about Medicare, so
you may be losing all the information you gain.

0:25:39.099 --> 0:25:42.366
So this type of priest training step is good.

0:25:42.366 --> 0:25:47.978
If your pretraining data is more general,
very large and then you're adapting.

0:25:48.428 --> 0:25:56.012
But in the task with moral lingual data, which
should be used to adapt the system to some

0:25:56.012 --> 0:25:57.781
general topic style.

0:25:57.817 --> 0:26:06.795
Then, of course, this is not a good strategy
because you might forgot about everything up

0:26:06.795 --> 0:26:09.389
there and you don't have.

0:26:09.649 --> 0:26:14.678
So then you have to check what you can do
for them.

0:26:14.678 --> 0:26:23.284
You can freeze this part and change it any
more so you don't lose the ability or you can

0:26:23.284 --> 0:26:25.702
do a direct combination.

0:26:25.945 --> 0:26:31.028
Where you jointly train both of them, so you
train the NMT system on the, and then you train

0:26:31.028 --> 0:26:34.909
the language model always in parallels so that
you don't forget about.

0:26:35.395 --> 0:26:37.684
And what you learn of the length.

0:26:37.937 --> 0:26:46.711
Depends on what you want to combine because
it's large data and you have a good general

0:26:46.711 --> 0:26:48.107
knowledge in.

0:26:48.548 --> 0:26:55.733
Then you normally don't really forget it because
it's also in the or you use it to adapt to

0:26:55.733 --> 0:26:57.295
something specific.

0:26:57.295 --> 0:26:58.075
Then you.

0:27:01.001 --> 0:27:06.676
Then this is a way of how we can make use
of monolingual data.

0:27:07.968 --> 0:27:12.116
It seems to be the easiest one somehow.

0:27:12.116 --> 0:27:20.103
It's more similar to what we are doing with
statistical machine translation.

0:27:21.181 --> 0:27:31.158
Normally always beats this type of model,
which in some view can be like from the conceptual

0:27:31.158 --> 0:27:31.909
thing.

0:27:31.909 --> 0:27:36.844
It's even easier from the computational side.

0:27:40.560 --> 0:27:42.078
And the idea is OK.

0:27:42.078 --> 0:27:49.136
We have monolingual data that we just translate
and then generate some type of parallel data

0:27:49.136 --> 0:27:50.806
and use that then to.

0:27:51.111 --> 0:28:00.017
So if you want to build a German-to-English
system first, take the large amount of data

0:28:00.017 --> 0:28:02.143
you have translated.

0:28:02.402 --> 0:28:10.446
Then you have more peril data and the interesting
thing is if you then train on the joint thing

0:28:10.446 --> 0:28:18.742
or on the original peril data and on what is
artificial where you have generated the translations.

0:28:18.918 --> 0:28:26.487
So you can because you are not doing the same
era all the times and you have some knowledge.

0:28:28.028 --> 0:28:43.199
With this first approach, however, there is
one issue why it might not work the best.

0:28:49.409 --> 0:28:51.177
Very a bit shown in the image to you.

0:28:53.113 --> 0:28:58.153
You trade on that quality data.

0:28:58.153 --> 0:29:02.563
Here is a bit of a problem.

0:29:02.563 --> 0:29:08.706
Your English style is not really good.

0:29:08.828 --> 0:29:12.213
And as you're saying, the system always mistranslates.

0:29:13.493 --> 0:29:19.798
Something then you will learn that this is
correct because now it's a training game and

0:29:19.798 --> 0:29:23.022
you will encourage it to make it more often.

0:29:23.022 --> 0:29:29.614
So the problem with training on your own areas
yeah you might prevent some areas you rarely

0:29:29.614 --> 0:29:29.901
do.

0:29:30.150 --> 0:29:31.749
But errors use systematically.

0:29:31.749 --> 0:29:34.225
Do you even enforce more and will even do
more?

0:29:34.654 --> 0:29:40.145
So that might not be the best solution to
have any idea how you could do it better.

0:29:44.404 --> 0:29:57.754
Is one way there is even a bit of more simple
idea.

0:30:04.624 --> 0:30:10.975
The problem is yeah, the translations are
not perfect, so the output and you're learning

0:30:10.975 --> 0:30:12.188
something wrong.

0:30:12.188 --> 0:30:17.969
Normally it's less bad if your inputs are
not bad, but your outputs are perfect.

0:30:18.538 --> 0:30:24.284
So if your inputs are wrong you may learn
that if you're doing this wrong input you're

0:30:24.284 --> 0:30:30.162
generating something correct, but you're not
learning to generate something which is not

0:30:30.162 --> 0:30:30.756
correct.

0:30:31.511 --> 0:30:47.124
So often the case it is that it is more important
than your target is correct.

0:30:47.347 --> 0:30:52.182
But you can assume in your application scenario
you hope that you may only get correct inputs.

0:30:52.572 --> 0:31:02.535
So that is not harming you, and in machine
translation we have one very nice advantage:

0:31:02.762 --> 0:31:04.648
And also the other way around.

0:31:04.648 --> 0:31:10.062
It's a very similar task, so there's a task
to translate from German to English, but the

0:31:10.062 --> 0:31:13.894
task to translate from English to German is
very similar, and.

0:31:14.094 --> 0:31:19.309
So what we can do is we can just switch it
initially and generate the data the other way

0:31:19.309 --> 0:31:19.778
around.

0:31:20.120 --> 0:31:25.959
So what we are doing here is we are starting
with an English to German system.

0:31:25.959 --> 0:31:32.906
Then we are translating the English data into
German where the German is maybe not very nice.

0:31:33.293 --> 0:31:51.785
And then we are training on our original data
and on the back translated data.

0:31:52.632 --> 0:32:02.332
So here we have the advantage that our target
side is human quality and only the input.

0:32:03.583 --> 0:32:08.113
Then this helps us to get really good.

0:32:08.113 --> 0:32:15.431
There is one difference if you think about
the data resources.

0:32:21.341 --> 0:32:27.336
Too obvious here we need a target site monolingual
layer.

0:32:27.336 --> 0:32:31.574
In the first example we had source site.

0:32:31.931 --> 0:32:45.111
So back translation is normally working if
you have target size peril later and not search

0:32:45.111 --> 0:32:48.152
side modeling later.

0:32:48.448 --> 0:32:56.125
Might be also, like if you think about it,
understand a little better to understand the

0:32:56.125 --> 0:32:56.823
target.

0:32:57.117 --> 0:33:01.469
On the source side you have to understand
the content.

0:33:01.469 --> 0:33:08.749
On the target side you have to generate really
sentences and somehow it's more difficult to

0:33:08.749 --> 0:33:12.231
generate something than to only understand.

0:33:17.617 --> 0:33:30.734
This works well if you have to select how
many back translated data do you use.

0:33:31.051 --> 0:33:32.983
Because only there's like a lot more.

0:33:33.253 --> 0:33:42.136
Question: Should take all of my data there
is two problems with it?

0:33:42.136 --> 0:33:51.281
Of course it's expensive because you have
to translate all this data.

0:33:51.651 --> 0:34:00.946
So if you don't know the normal good starting
point is to take equal amount of data as many

0:34:00.946 --> 0:34:02.663
back translated.

0:34:02.963 --> 0:34:04.673
It depends on the used case.

0:34:04.673 --> 0:34:08.507
If we have very few data here, it makes more
sense to have more.

0:34:08.688 --> 0:34:15.224
Depends on how good your quality is here,
so the better the more data you might use because

0:34:15.224 --> 0:34:16.574
quality is better.

0:34:16.574 --> 0:34:22.755
So it depends on a lot of things, but your
rule of sum is like which general way often

0:34:22.755 --> 0:34:24.815
is to have equal amounts of.

0:34:26.646 --> 0:34:29.854
And you can, of course, do that now.

0:34:29.854 --> 0:34:34.449
I said already that it's better to have the
quality.

0:34:34.449 --> 0:34:38.523
At the end, of course, depends on this system.

0:34:38.523 --> 0:34:46.152
Also, because the better this system is, the
better your synthetic data is, the better.

0:34:47.207 --> 0:34:50.949
That leads to what is referred to as iterated
back translation.

0:34:51.291 --> 0:34:56.917
So you play them on English to German, and
you translate the data on.

0:34:56.957 --> 0:35:03.198
Then you train a model on German to English
with the additional data.

0:35:03.198 --> 0:35:09.796
Then you translate German data and then you
train to gain your first one.

0:35:09.796 --> 0:35:14.343
So in the second iteration this quality is
better.

0:35:14.334 --> 0:35:19.900
System is better because it's not only trained
on the small data but additionally on back

0:35:19.900 --> 0:35:22.003
translated data with this system.

0:35:22.442 --> 0:35:24.458
And so you can get better.

0:35:24.764 --> 0:35:28.053
However, typically you can stop quite early.

0:35:28.053 --> 0:35:35.068
Maybe one iteration is good, but then you
have diminishing gains after two or three iterations.

0:35:35.935 --> 0:35:46.140
There is very slight difference because you
need a quite big difference in the quality

0:35:46.140 --> 0:35:46.843
here.

0:35:47.207 --> 0:36:02.262
Language is also good because it means you
can already train it with relatively bad profiles.

0:36:03.723 --> 0:36:10.339
It's a design decision would advise so guess
because it's easy to get it.

0:36:10.550 --> 0:36:20.802
Replace that because you have a higher quality
real data, but then I think normally it's okay

0:36:20.802 --> 0:36:22.438
to replace it.

0:36:22.438 --> 0:36:28.437
I would assume it's not too much of a difference,
but.

0:36:34.414 --> 0:36:42.014
That's about like using monolingual data before
we go into the pre-train models to have any

0:36:42.014 --> 0:36:43.005
more crash.

0:36:49.029 --> 0:36:55.740
Yes, so the other thing which we can do and
which is recently more and more successful

0:36:55.740 --> 0:37:02.451
and even more successful since we have this
really large language models where you can

0:37:02.451 --> 0:37:08.545
even do the translation task with this is the
way of using pre-trained models.

0:37:08.688 --> 0:37:16.135
So you learn a representation of one task,
and then you use this representation from another.

0:37:16.576 --> 0:37:26.862
It was made maybe like one of the first words
where it really used largely is doing something

0:37:26.862 --> 0:37:35.945
like a bird which you pre trained on purely
text era and you take it in fine tune.

0:37:36.496 --> 0:37:42.953
And one big advantage, of course, is that
people can only share data but also pre-trained.

0:37:43.423 --> 0:37:59.743
The recent models and the large language ones
which are available.

0:37:59.919 --> 0:38:09.145
Where I think it costs several millions to
train them all, just if you would buy the GPUs

0:38:09.145 --> 0:38:15.397
from some cloud company and train that the
cost of training.

0:38:15.475 --> 0:38:21.735
And guess as a student project you won't have
the budget to like build these models.

0:38:21.801 --> 0:38:24.598
So another idea is what you can do is okay.

0:38:24.598 --> 0:38:27.330
Maybe if these months are once available,.

0:38:27.467 --> 0:38:36.598
Can take them and use them as an also resource
similar to pure text, and you can now build

0:38:36.598 --> 0:38:44.524
models which somehow learn not only from from
data but also from other models.

0:38:44.844 --> 0:38:49.127
So it's a quite new way of thinking of how
to train.

0:38:49.127 --> 0:38:53.894
We are not only learning from examples, but
we might also.

0:38:54.534 --> 0:39:05.397
The nice thing is that this type of training
where we are not learning directly from data

0:39:05.397 --> 0:39:07.087
but learning.

0:39:07.427 --> 0:39:17.647
So the main idea this go is you have a person
initial task.

0:39:17.817 --> 0:39:26.369
And if you're working with anLP, that means
you're training pure taxator because that's

0:39:26.369 --> 0:39:30.547
where you have the largest amount of data.

0:39:30.951 --> 0:39:35.857
And then you're defining some type of task
in order to do your creek training.

0:39:36.176 --> 0:39:43.092
And: The typical task you can train on on
that is like the language waddling task.

0:39:43.092 --> 0:39:50.049
So to predict the next word or we have a related
task to predict something in between, we'll

0:39:50.049 --> 0:39:52.667
see depending on the architecture.

0:39:52.932 --> 0:39:58.278
But somehow to predict something which you
have not in the input is a task which is easy

0:39:58.278 --> 0:40:00.740
to generate, so you just need your data.

0:40:00.740 --> 0:40:06.086
That's why it's called self supervised, so
you're creating your supervised pending data.

0:40:06.366 --> 0:40:07.646
By yourself.

0:40:07.646 --> 0:40:15.133
On the other hand, you need a lot of knowledge
and that is the other thing.

0:40:15.735 --> 0:40:24.703
Because there is this idea that the meaning
of a word heavily depends on the context that.

0:40:25.145 --> 0:40:36.846
So can give you a sentence with some giverish
word and there's some name and although you've

0:40:36.846 --> 0:40:41.627
never heard the name you will assume.

0:40:42.062 --> 0:40:44.149
And exactly the same thing.

0:40:44.149 --> 0:40:49.143
The models can also learn something about
the world by just using.

0:40:49.649 --> 0:40:53.651
So that is typically the mule.

0:40:53.651 --> 0:40:59.848
Then we can use this model to train the system.

0:41:00.800 --> 0:41:03.368
Course we might need to adapt the system.

0:41:03.368 --> 0:41:07.648
To do that we have to change the architecture
we might use only some.

0:41:07.627 --> 0:41:09.443
Part of the pre-trained model.

0:41:09.443 --> 0:41:14.773
In there we have seen that a bit already in
the R&amp;N case you can also see that we have

0:41:14.773 --> 0:41:17.175
also mentioned the pre-training already.

0:41:17.437 --> 0:41:22.783
So you can use the R&amp;N as one of these
approaches.

0:41:22.783 --> 0:41:28.712
You train the R&amp;M language more on large
pre-train data.

0:41:28.712 --> 0:41:32.309
Then you put it somewhere into your.

0:41:33.653 --> 0:41:37.415
So this gives you the ability to really do
these types of tests.

0:41:37.877 --> 0:41:53.924
So you can build a system which is knowledge,
which is just trained on large amounts of data.

0:41:56.376 --> 0:42:01.564
So the question is maybe what type of information
so what type of models can you?

0:42:01.821 --> 0:42:05.277
And we want today to look at briefly at swings.

0:42:05.725 --> 0:42:08.704
First, that was what was initially done.

0:42:08.704 --> 0:42:15.314
It wasn't as famous as in machine translation
as in other things, but it's also used there

0:42:15.314 --> 0:42:21.053
and that is to use static word embedding, so
just the first step we know here.

0:42:21.221 --> 0:42:28.981
So we have this mapping from the one hot to
a small continuous word representation.

0:42:29.229 --> 0:42:38.276
Using this one in your NG system, so you can,
for example, replace the embedding layer by

0:42:38.276 --> 0:42:38.779
the.

0:42:39.139 --> 0:42:41.832
That is helpful to be a really small amount
of data.

0:42:42.922 --> 0:42:48.517
And we're always in this pre-training phase
and have the thing the advantage is.

0:42:48.468 --> 0:42:52.411
More data than the trade off, so you can get
better.

0:42:52.411 --> 0:42:59.107
The disadvantage is, does anybody have an
idea of what might be the disadvantage of using

0:42:59.107 --> 0:43:00.074
things like.

0:43:04.624 --> 0:43:12.175
What was one mentioned today giving like big
advantage of the system compared to previous.

0:43:20.660 --> 0:43:25.134
Where one advantage was the enter end training,
so you have the enter end training so that

0:43:25.134 --> 0:43:27.937
all parameters and all components play optimal
together.

0:43:28.208 --> 0:43:33.076
If you know pre-train something on one fast,
it may be no longer optimal fitting to everything

0:43:33.076 --> 0:43:33.384
else.

0:43:33.893 --> 0:43:37.862
So what do pretending or not?

0:43:37.862 --> 0:43:48.180
It depends on how important everything is
optimal together and how important.

0:43:48.388 --> 0:43:50.454
Of large amount.

0:43:50.454 --> 0:44:00.541
The pre-change one is so much better that
it's helpful, and the advantage of that.

0:44:00.600 --> 0:44:11.211
Getting everything optimal together, yes,
we would use random instructions for raising.

0:44:11.691 --> 0:44:26.437
The problem is you might be already in some
area where it's not easy to get.

0:44:26.766 --> 0:44:35.329
But often in some way right, so often it's
not about your really worse pre trained monolepsy.

0:44:35.329 --> 0:44:43.254
If you're going already in some direction,
and if this is not really optimal for you,.

0:44:43.603 --> 0:44:52.450
But if you're not really getting better because
you have a decent amount of data, it's so different

0:44:52.450 --> 0:44:52.981
that.

0:44:53.153 --> 0:44:59.505
Initially it wasn't a machine translation
done so much because there are more data in

0:44:59.505 --> 0:45:06.153
MPs than in other tasks, but now with really
large amounts of monolingual data we do some

0:45:06.153 --> 0:45:09.403
type of pretraining in currently all state.

0:45:12.632 --> 0:45:14.302
The other one is okay now.

0:45:14.302 --> 0:45:18.260
It's always like how much of the model do
you plea track a bit?

0:45:18.658 --> 0:45:22.386
To the other one you can do contextural word
embedded.

0:45:22.386 --> 0:45:28.351
That is something like bird or Roberta where
you train already a sequence model and the

0:45:28.351 --> 0:45:34.654
embeddings you're using are no longer specific
for word but they are also taking the context

0:45:34.654 --> 0:45:35.603
into account.

0:45:35.875 --> 0:45:50.088
The embedding you're using is no longer depending
on the word itself but on the whole sentence,

0:45:50.088 --> 0:45:54.382
so you can use this context.

0:45:55.415 --> 0:46:02.691
You can use similar things also in the decoder
just by having layers which don't have access

0:46:02.691 --> 0:46:12.430
to the source, but there it still might have
and these are typically models like: And finally

0:46:12.430 --> 0:46:14.634
they will look at the end.

0:46:14.634 --> 0:46:19.040
You can also have models which are already
sequenced.

0:46:19.419 --> 0:46:28.561
So you may be training a sequence to sequence
models.

0:46:28.561 --> 0:46:35.164
You have to make it a bit challenging.

0:46:36.156 --> 0:46:43.445
But the idea is really you're pre-training
your whole model and then you'll find tuning.

0:46:47.227 --> 0:46:59.614
But let's first do a bit of step back and
look into what are the different things.

0:46:59.614 --> 0:47:02.151
The first thing.

0:47:02.382 --> 0:47:11.063
The wooden bettings are just this first layer
and you can train them with feedback annual

0:47:11.063 --> 0:47:12.028
networks.

0:47:12.212 --> 0:47:22.761
But you can also train them with an N language
model, and by now you hopefully have also seen

0:47:22.761 --> 0:47:27.699
that you cannot transform a language model.

0:47:30.130 --> 0:47:37.875
So this is how you can train them and you're
training them.

0:47:37.875 --> 0:47:45.234
For example, to speak the next word that is
the easiest.

0:47:45.525 --> 0:47:55.234
And that is what is now referred to as South
Supervised Learning and, for example, all the

0:47:55.234 --> 0:48:00.675
big large language models like Chad GPT and
so on.

0:48:00.675 --> 0:48:03.129
They are trained with.

0:48:03.823 --> 0:48:15.812
So that is where you can hopefully learn how
a word is used because you always try to previct

0:48:15.812 --> 0:48:17.725
the next word.

0:48:19.619 --> 0:48:27.281
Word embedding: Why do you keep the first
look at the word embeddings and the use of

0:48:27.281 --> 0:48:29.985
word embeddings for our task?

0:48:29.985 --> 0:48:38.007
The main advantage was it might be only the
first layer where you typically have most of

0:48:38.007 --> 0:48:39.449
the parameters.

0:48:39.879 --> 0:48:57.017
Most of your parameters already on the large
data, then on your target data you have to

0:48:57.017 --> 0:48:59.353
train less.

0:48:59.259 --> 0:49:06.527
Big difference that your input size is so
much bigger than the size of the novel in size.

0:49:06.626 --> 0:49:17.709
So it's a normally sign, maybe like, but your
input and banning size is something like.

0:49:17.709 --> 0:49:20.606
Then here you have to.

0:49:23.123 --> 0:49:30.160
While here you see it's only like zero point
five times as much in the layer.

0:49:30.750 --> 0:49:36.534
So here is where most of your parameters are,
which means if you already replace the word

0:49:36.534 --> 0:49:41.739
embeddings, they might look a bit small in
your overall and in key architecture.

0:49:41.739 --> 0:49:47.395
It's where most of the things are, and if
you're doing that you already have really big

0:49:47.395 --> 0:49:48.873
games and can do that.

0:49:57.637 --> 0:50:01.249
The thing is we have seen these were the bettings.

0:50:01.249 --> 0:50:04.295
They can be very good use for other types.

0:50:04.784 --> 0:50:08.994
You learn some general relations between words.

0:50:08.994 --> 0:50:17.454
If you're doing this type of language modeling
cast, you predict: The one thing is you have

0:50:17.454 --> 0:50:24.084
a lot of data, so the one question is we want
to have data to trade a model.

0:50:24.084 --> 0:50:28.734
The other thing, the tasks need to be somehow
useful.

0:50:29.169 --> 0:50:43.547
If you would predict the first letter of the
word, then you wouldn't learn anything about

0:50:43.547 --> 0:50:45.144
the word.

0:50:45.545 --> 0:50:53.683
And the interesting thing is people have looked
at these wood embeddings.

0:50:53.954 --> 0:50:58.550
And looking at the word embeddings.

0:50:58.550 --> 0:51:09.276
You can ask yourself how they look and visualize
them by doing dimension reduction.

0:51:09.489 --> 0:51:13.236
Don't know if you and you are listening to
artificial intelligence.

0:51:13.236 --> 0:51:15.110
Advanced artificial intelligence.

0:51:15.515 --> 0:51:23.217
We had on yesterday there how to do this type
of representation, but you can do this time

0:51:23.217 --> 0:51:29.635
of representation, and now you're seeing interesting
things that normally.

0:51:30.810 --> 0:51:41.027
Now you can represent a here in a three dimensional
space with some dimension reduction.

0:51:41.027 --> 0:51:46.881
For example, the relation between male and
female.

0:51:47.447 --> 0:51:56.625
So this vector between the male and female
version of something is always not the same,

0:51:56.625 --> 0:51:58.502
but it's related.

0:51:58.718 --> 0:52:14.522
So you can do a bit of maths, so you do take
king, you subtract this vector, add this vector.

0:52:14.894 --> 0:52:17.591
So that means okay, there is really something
stored.

0:52:17.591 --> 0:52:19.689
Some information are stored in that book.

0:52:20.040 --> 0:52:22.621
Similar, you can do it with Bob Hansen.

0:52:22.621 --> 0:52:25.009
See here swimming slam walking walk.

0:52:25.265 --> 0:52:34.620
So again these vectors are not the same, but
they are related.

0:52:34.620 --> 0:52:42.490
So you learn something from going from here
to here.

0:52:43.623 --> 0:52:49.761
Or semantically, the relations between city
and capital have exactly the same sense.

0:52:51.191 --> 0:52:56.854
And people had even done that question answering
about that if they showed the diembeddings

0:52:56.854 --> 0:52:57.839
and the end of.

0:52:58.218 --> 0:53:06.711
All you can also do is don't trust the dimensions
of the reaction because maybe there is something.

0:53:06.967 --> 0:53:16.863
You can also look into what happens really
in the individual space.

0:53:16.863 --> 0:53:22.247
What is the nearest neighbor of the.

0:53:22.482 --> 0:53:29.608
So you can take the relationship between France
and Paris and add it to Italy and you'll.

0:53:30.010 --> 0:53:33.078
You can do big and bigger and you have small
and smaller and stuff.

0:53:33.593 --> 0:53:49.417
Because it doesn't work everywhere, there
is also some typical dish here in German.

0:53:51.491 --> 0:54:01.677
You can do what the person is doing for famous
ones, of course only like Einstein scientists

0:54:01.677 --> 0:54:06.716
that find midfielders not completely correct.

0:54:06.846 --> 0:54:10.134
You see the examples are a bit old.

0:54:10.134 --> 0:54:15.066
The politicians are no longer they am, but
of course.

0:54:16.957 --> 0:54:26.759
What people have done there, especially at
the beginning training our end language model,

0:54:26.759 --> 0:54:28.937
was very expensive.

0:54:29.309 --> 0:54:38.031
So one famous model was, but we are not really
interested in the language model performance.

0:54:38.338 --> 0:54:40.581
Think something good to keep in mind.

0:54:40.581 --> 0:54:42.587
What are we really interested in?

0:54:42.587 --> 0:54:45.007
Do we really want to have an R&amp;N no?

0:54:45.007 --> 0:54:48.607
In this case we are only interested in this
type of mapping.

0:54:49.169 --> 0:54:55.500
And so successful and very successful was
this word to vet.

0:54:55.535 --> 0:54:56.865
The idea is okay.

0:54:56.865 --> 0:55:03.592
We are not training real language one, making
it even simpler and doing this, for example,

0:55:03.592 --> 0:55:05.513
continuous peck of words.

0:55:05.513 --> 0:55:12.313
We're just having four input tokens and we're
predicting what is the word in the middle and

0:55:12.313 --> 0:55:15.048
this is just like two linear layers.

0:55:15.615 --> 0:55:21.627
So it's even simplifying things and making
the calculation faster because that is what

0:55:21.627 --> 0:55:22.871
we're interested.

0:55:23.263 --> 0:55:32.897
All this continuous skip ground models with
these other models which refer to as where

0:55:32.897 --> 0:55:34.004
to where.

0:55:34.234 --> 0:55:42.394
Where you have one equal word and the other
way around, you're predicting the four words

0:55:42.394 --> 0:55:43.585
around them.

0:55:43.585 --> 0:55:45.327
It's very similar.

0:55:45.327 --> 0:55:48.720
The task is in the end very similar.

0:55:51.131 --> 0:56:01.407
Before we are going to the next point, anything
about normal weight vectors or weight embedding.

0:56:04.564 --> 0:56:07.794
The next thing is contexture.

0:56:07.794 --> 0:56:12.208
Word embeddings and the idea is helpful.

0:56:12.208 --> 0:56:19.206
However, we might even be able to get more
from one lingo layer.

0:56:19.419 --> 0:56:31.732
And now in the word that is overlap of these
two meanings, so it represents both the meaning

0:56:31.732 --> 0:56:33.585
of can do it.

0:56:34.834 --> 0:56:40.410
But we might be able to in the pre-trained
model already disambiguate this because they

0:56:40.410 --> 0:56:41.044
are used.

0:56:41.701 --> 0:56:53.331
So if we can have a model which can not only
represent a word but can also represent the

0:56:53.331 --> 0:56:58.689
meaning of the word within the context,.

0:56:59.139 --> 0:57:03.769
So then we are going to context your word
embeddings.

0:57:03.769 --> 0:57:07.713
We are really having a representation in the.

0:57:07.787 --> 0:57:11.519
And we have a very good architecture for that
already.

0:57:11.691 --> 0:57:23.791
The hidden state represents what is currently
said, but it's focusing on what is the last

0:57:23.791 --> 0:57:29.303
one, so it's some of the representation.

0:57:29.509 --> 0:57:43.758
The first one doing that is something like
the Elmo paper where they instead of this is

0:57:43.758 --> 0:57:48.129
the normal language model.

0:57:48.008 --> 0:57:50.714
Within the third, predicting the fourth, and
so on.

0:57:50.714 --> 0:57:53.004
So you are always predicting the next work.

0:57:53.193 --> 0:57:57.335
The architecture is the heaven words embedding
layer and then layers.

0:57:57.335 --> 0:58:03.901
See you, for example: And now instead of using
this one in the end, you're using here this

0:58:03.901 --> 0:58:04.254
one.

0:58:04.364 --> 0:58:11.245
This represents the meaning of this word mainly
in the context of what we have seen before.

0:58:11.871 --> 0:58:18.610
We can train it in a language model style
always predicting the next word, but we have

0:58:18.610 --> 0:58:21.088
more information trained there.

0:58:21.088 --> 0:58:26.123
Therefore, in the system it has to learn less
additional things.

0:58:27.167 --> 0:58:31.261
And there is one Edendang which is done currently
in GPS.

0:58:31.261 --> 0:58:38.319
The only difference is that we have more layers,
bigger size, and we're using transformer neurocell

0:58:38.319 --> 0:58:40.437
potential instead of the RNA.

0:58:40.437 --> 0:58:45.095
But that is how you train like some large
language models at the.

0:58:46.746 --> 0:58:55.044
However, if you look at this contextual representation,
they might not be perfect.

0:58:55.044 --> 0:59:02.942
So if you think of this one as a contextual
representation of the third word,.

0:59:07.587 --> 0:59:16.686
Is representing a three in the context of
a sentence, however only in the context of

0:59:16.686 --> 0:59:18.185
the previous.

0:59:18.558 --> 0:59:27.413
However, we have an architecture which can
also take both sides and we have used that

0:59:27.413 --> 0:59:30.193
already in the ink holder.

0:59:30.630 --> 0:59:34.264
So we could do the iron easily on your, also
in the backward direction.

0:59:34.874 --> 0:59:42.826
By just having the states the other way around
and then we couldn't combine the forward and

0:59:42.826 --> 0:59:49.135
the forward into a joint one where we are doing
this type of prediction.

0:59:49.329 --> 0:59:50.858
So you have the word embedding.

0:59:51.011 --> 1:00:02.095
Then you have two in the states, one on the
forward arm and one on the backward arm, and

1:00:02.095 --> 1:00:10.314
then you can, for example, take the cocagenation
of both of them.

1:00:10.490 --> 1:00:23.257
Now this same here represents mainly this
word because this is what both puts in it last

1:00:23.257 --> 1:00:30.573
and we know is focusing on what is happening
last.

1:00:31.731 --> 1:00:40.469
However, there is a bit of difference when
training that as a language model you already

1:00:40.469 --> 1:00:41.059
have.

1:00:43.203 --> 1:00:44.956
Maybe There's Again This Masking.

1:00:46.546 --> 1:00:47.748
That is one solution.

1:00:47.748 --> 1:00:52.995
First of all, why we can't do it is the information
you leak it, so you cannot just predict the

1:00:52.995 --> 1:00:53.596
next word.

1:00:53.596 --> 1:00:58.132
If we just predict the next word in this type
of model, that's a very simple task.

1:00:58.738 --> 1:01:09.581
You know the next word because it's influencing
this hidden state predicting something is not

1:01:09.581 --> 1:01:11.081
a good task.

1:01:11.081 --> 1:01:18.455
You have to define: Because in this case what
will end with the system will just ignore these

1:01:18.455 --> 1:01:22.966
estates and what will learn is copy this information
directly in here.

1:01:23.343 --> 1:01:31.218
So it would be representing this word and
you would have nearly a perfect model because

1:01:31.218 --> 1:01:38.287
you only need to find encoding where you can
encode all words somehow in this.

1:01:38.458 --> 1:01:44.050
The only thing can learn is that turn and
encode all my words in this upper hidden.

1:01:44.985 --> 1:01:53.779
Therefore, it's not really useful, so we need
to find a bit of different ways out.

1:01:55.295 --> 1:01:57.090
There is a masking one.

1:01:57.090 --> 1:02:03.747
I'll come to that shortly just a bit that
other things also have been done, so the other

1:02:03.747 --> 1:02:06.664
thing is not to directly combine them.

1:02:06.664 --> 1:02:13.546
That was in the animal paper, so you have
them forward R&amp;M and you keep them completely

1:02:13.546 --> 1:02:14.369
separated.

1:02:14.594 --> 1:02:20.458
So you never merged to state.

1:02:20.458 --> 1:02:33.749
At the end, the representation of the word
is now from the forward.

1:02:33.873 --> 1:02:35.953
So it's always the hidden state before the
good thing.

1:02:36.696 --> 1:02:41.286
These two you join now to your to the representation.

1:02:42.022 --> 1:02:48.685
And then you have now a representation also
about like the whole sentence for the word,

1:02:48.685 --> 1:02:51.486
but there is no information leakage.

1:02:51.486 --> 1:02:58.149
One way of doing this is instead of doing
a bidirection along you do a forward pass and

1:02:58.149 --> 1:02:59.815
then join the hidden.

1:03:00.380 --> 1:03:05.960
So you can do that in all layers.

1:03:05.960 --> 1:03:16.300
In the end you do the forwarded layers and
you get the hidden.

1:03:16.596 --> 1:03:19.845
However, it's a bit of a complicated.

1:03:19.845 --> 1:03:25.230
You have to keep both separate and merge things
so can you do.

1:03:27.968 --> 1:03:33.030
And that is the moment where like the big.

1:03:34.894 --> 1:03:39.970
The big success of the burnt model was used
where it okay.

1:03:39.970 --> 1:03:47.281
Maybe in bite and rich case it's not good
to do the next word prediction, but we can

1:03:47.281 --> 1:03:48.314
do masking.

1:03:48.308 --> 1:03:56.019
Masking mainly means we do a prediction of
something in the middle or some words.

1:03:56.019 --> 1:04:04.388
So the idea is if we have the input, we are
putting noise into the input, removing them,

1:04:04.388 --> 1:04:07.961
and then the model we are interested.

1:04:08.048 --> 1:04:15.327
Now there can be no information leakage because
this wasn't predicting that one is a big challenge.

1:04:16.776 --> 1:04:19.957
Do any assumption about our model?

1:04:19.957 --> 1:04:26.410
It doesn't need to be a forward model or a
backward model or anything.

1:04:26.410 --> 1:04:29.500
You can always predict the three.

1:04:30.530 --> 1:04:34.844
There's maybe one bit of a disadvantage.

1:04:34.844 --> 1:04:40.105
Do you see what could be a bit of a problem
this?

1:05:00.000 --> 1:05:06.429
Yes, so yeah, you can of course mask more,
but to see it more globally, just first assume

1:05:06.429 --> 1:05:08.143
you're only masked one.

1:05:08.143 --> 1:05:13.930
For the whole sentence, we get one feedback
signal, like what is the word three.

1:05:13.930 --> 1:05:22.882
So we have one training example: If you do
the language modeling taste, we predicted here,

1:05:22.882 --> 1:05:24.679
we predicted here.

1:05:25.005 --> 1:05:26.735
So we have number of tokens.

1:05:26.735 --> 1:05:30.970
For each token we have a feet pad and say
what is the best correction.

1:05:31.211 --> 1:05:43.300
So in this case this is less efficient because
we are getting less feedback signals on what

1:05:43.300 --> 1:05:45.797
we should predict.

1:05:48.348 --> 1:05:56.373
So and bird, the main ideas are that you're
doing this bidirectional model with masking.

1:05:56.373 --> 1:05:59.709
It's using transformer architecture.

1:06:00.320 --> 1:06:06.326
There are two more minor changes.

1:06:06.326 --> 1:06:16.573
We'll see that this next word prediction is
another task.

1:06:16.957 --> 1:06:30.394
You want to learn more about what language
is to really understand following a story or

1:06:30.394 --> 1:06:35.127
their independent tokens into.

1:06:38.158 --> 1:06:42.723
The input is using word units as we use it.

1:06:42.723 --> 1:06:50.193
It has some special token that is framing
for the next word prediction.

1:06:50.470 --> 1:07:04.075
It's more for classification task because
you may be learning a general representation

1:07:04.075 --> 1:07:07.203
as a full sentence.

1:07:07.607 --> 1:07:19.290
You're doing segment embedding, so you have
an embedding for it.

1:07:19.290 --> 1:07:24.323
This is the first sentence.

1:07:24.684 --> 1:07:29.099
Now what is more challenging is this masking.

1:07:29.099 --> 1:07:30.827
What do you mask?

1:07:30.827 --> 1:07:35.050
We already have the crush enough or should.

1:07:35.275 --> 1:07:42.836
So there has been afterwards eating some work
like, for example, a bearer.

1:07:42.836 --> 1:07:52.313
It's not super sensitive, but if you do it
completely wrong then you're not letting anything.

1:07:52.572 --> 1:07:54.590
That's Then Another Question There.

1:07:56.756 --> 1:08:04.594
Should I mask all types of should I always
mask the footwork or if I have a subword to

1:08:04.594 --> 1:08:10.630
mask only like a subword and predict them based
on the other ones?

1:08:10.630 --> 1:08:14.504
Of course, it's a bit of a different task.

1:08:14.894 --> 1:08:21.210
If you know three parts of the words, it might
be easier to guess the last because they here

1:08:21.210 --> 1:08:27.594
took the easiest selection, so not considering
words anymore at all because you're doing that

1:08:27.594 --> 1:08:32.280
in the preprocessing and just taking always
words and like subwords.

1:08:32.672 --> 1:08:36.089
Think in group there is done differently.

1:08:36.089 --> 1:08:40.401
They mark always the full words, but guess
it's not.

1:08:41.001 --> 1:08:46.044
And then what to do with the mask word in
eighty percent of the cases.

1:08:46.044 --> 1:08:50.803
If the word is masked, they replace it with
a special token thing.

1:08:50.803 --> 1:08:57.197
This is a mask token in ten percent they put
in some random other token in there, and ten

1:08:57.197 --> 1:08:59.470
percent they keep it on change.

1:09:02.202 --> 1:09:10.846
And then what you can do is also this next
word prediction.

1:09:10.846 --> 1:09:14.880
The man went to Mass Store.

1:09:14.880 --> 1:09:17.761
He bought a gallon.

1:09:18.418 --> 1:09:24.088
So may you see you're joining them, you're
doing both masks and prediction that you're.

1:09:24.564 --> 1:09:29.449
Is a penguin mask or flyless birds.

1:09:29.449 --> 1:09:41.390
These two sentences have nothing to do with
each other, so you can do also this type of

1:09:41.390 --> 1:09:43.018
prediction.

1:09:47.127 --> 1:09:57.043
And then the whole bird model, so here you
have the input here to transform the layers,

1:09:57.043 --> 1:09:58.170
and then.

1:09:58.598 --> 1:10:17.731
And this model was quite successful in general
applications.

1:10:17.937 --> 1:10:27.644
However, there is like a huge thing of different
types of models coming from them.

1:10:27.827 --> 1:10:38.709
So based on others these supervised molds
like a whole setup came out of there and now

1:10:38.709 --> 1:10:42.086
this is getting even more.

1:10:42.082 --> 1:10:46.640
With availability of a large language model
than the success.

1:10:47.007 --> 1:10:48.436
We have now even larger ones.

1:10:48.828 --> 1:10:50.961
Interestingly, it goes a bit.

1:10:50.910 --> 1:10:57.847
Change the bit again from like more the spider
action model to uni directional models.

1:10:57.847 --> 1:11:02.710
Are at the moment maybe a bit more we're coming
to them now?

1:11:02.710 --> 1:11:09.168
Do you see one advantage while what is another
event and we have the efficiency?

1:11:09.509 --> 1:11:15.901
Is one other reason why you are sometimes
more interested in uni-direction models than

1:11:15.901 --> 1:11:17.150
in bi-direction.

1:11:22.882 --> 1:11:30.220
It depends on the pass, but for example for
a language generation pass, the eccard is not

1:11:30.220 --> 1:11:30.872
really.

1:11:32.192 --> 1:11:40.924
It doesn't work so if you want to do a generation
like the decoder you don't know the future

1:11:40.924 --> 1:11:42.896
so you cannot apply.

1:11:43.223 --> 1:11:53.870
So this time of model can be used for the
encoder in an encoder model, but it cannot

1:11:53.870 --> 1:11:57.002
be used for the decoder.

1:12:00.000 --> 1:12:05.012
That's a good view to the next overall cast
of models.

1:12:05.012 --> 1:12:08.839
Perhaps if you view it from the sequence.

1:12:09.009 --> 1:12:12.761
We have the encoder base model.

1:12:12.761 --> 1:12:16.161
That's what we just look at.

1:12:16.161 --> 1:12:20.617
They are bidirectional and typically.

1:12:20.981 --> 1:12:22.347
That Is the One We Looked At.

1:12:22.742 --> 1:12:34.634
At the beginning is the decoder based model,
so see out in regressive models which are unidirective

1:12:34.634 --> 1:12:42.601
like an based model, and there we can do the
next word prediction.

1:12:43.403 --> 1:12:52.439
And what you can also do first, and there
you can also have a special things called prefix

1:12:52.439 --> 1:12:53.432
language.

1:12:54.354 --> 1:13:05.039
Because we are saying it might be helpful
that some of your input can also use bi-direction.

1:13:05.285 --> 1:13:12.240
And that is somehow doing what it is called
prefix length.

1:13:12.240 --> 1:13:19.076
On the first tokens you directly give your
bidirectional.

1:13:19.219 --> 1:13:28.774
So you somehow merge that and that mainly
works only in transformer based models because.

1:13:29.629 --> 1:13:33.039
There is no different number of parameters
in our end.

1:13:33.039 --> 1:13:34.836
We need a back foot our end.

1:13:34.975 --> 1:13:38.533
Transformer: The only difference is how you
mask your attention.

1:13:38.878 --> 1:13:44.918
We have seen that in the anchoder and decoder
the number of parameters is different because

1:13:44.918 --> 1:13:50.235
you do cross attention, but if you do forward
and backward or union directions,.

1:13:50.650 --> 1:13:58.736
It's only like you mask your attention to
only look at the bad past or to look into the

1:13:58.736 --> 1:13:59.471
future.

1:14:00.680 --> 1:14:03.326
And now you can of course also do mixing.

1:14:03.563 --> 1:14:08.306
So this is a bi-directional attention matrix
where you can attend to everything.

1:14:08.588 --> 1:14:23.516
There is a uni-direction or causal where you
can look at the past and you can do the first

1:14:23.516 --> 1:14:25.649
three words.

1:14:29.149 --> 1:14:42.831
That somehow clear based on that, then of
course you cannot do the other things.

1:14:43.163 --> 1:14:50.623
So the idea is we have our anchor to decoder
architecture.

1:14:50.623 --> 1:14:57.704
Can we also train them completely in a side
supervisor?

1:14:58.238 --> 1:15:09.980
And in this case we have the same input to
both, so in this case we need to do some type

1:15:09.980 --> 1:15:12.224
of masking here.

1:15:12.912 --> 1:15:17.696
Here we don't need to do the masking, but
here we need to masking that doesn't know ever

1:15:17.696 --> 1:15:17.911
so.

1:15:20.440 --> 1:15:30.269
And this type of model got quite successful
also, especially for pre-training machine translation.

1:15:30.330 --> 1:15:39.059
The first model doing that is a Bart model,
which exactly does that, and yes, it's one

1:15:39.059 --> 1:15:42.872
successful way to pre train your one.

1:15:42.872 --> 1:15:47.087
It's pretraining your full encoder model.

1:15:47.427 --> 1:15:54.365
Where you put in contrast to machine translation,
where you put in source sentence, we can't

1:15:54.365 --> 1:15:55.409
do that here.

1:15:55.715 --> 1:16:01.382
But we can just put the second twice in there,
and then it's not a trivial task.

1:16:01.382 --> 1:16:02.432
We can change.

1:16:03.003 --> 1:16:12.777
And there is like they do different corruption
techniques so you can also do.

1:16:13.233 --> 1:16:19.692
That you couldn't do in an agricultural system
because then it wouldn't be there and you cannot

1:16:19.692 --> 1:16:20.970
predict somewhere.

1:16:20.970 --> 1:16:26.353
So the anchor, the number of input and output
tokens always has to be the same.

1:16:26.906 --> 1:16:29.818
You cannot do a prediction for something which
isn't in it.

1:16:30.110 --> 1:16:38.268
Here in the decoder side it's unidirection
so we can also delete the top and then try

1:16:38.268 --> 1:16:40.355
to generate the full.

1:16:41.061 --> 1:16:45.250
We can do sentence permutation.

1:16:45.250 --> 1:16:54.285
We can document rotation and text infilling
so there is quite a bit.

1:16:55.615 --> 1:17:06.568
So you see there's quite a lot of types of
models that you can use in order to pre-train.

1:17:07.507 --> 1:17:14.985
Then, of course, there is again for the language
one.

1:17:14.985 --> 1:17:21.079
The other question is how do you integrate?

1:17:21.761 --> 1:17:26.636
And there's also, like yeah, quite some different
ways of techniques.

1:17:27.007 --> 1:17:28.684
It's a Bit Similar to Before.

1:17:28.928 --> 1:17:39.068
So the easiest thing is you take your word
embeddings or your free trained model.

1:17:39.068 --> 1:17:47.971
You freeze them and stack your decoder layers
and keep these ones free.

1:17:48.748 --> 1:17:54.495
Can also be done if you have this type of
bark model.

1:17:54.495 --> 1:18:03.329
What you can do is you freeze your word embeddings,
for example some products and.

1:18:05.865 --> 1:18:17.296
The other thing is you initialize them so
you initialize your models but you train everything

1:18:17.296 --> 1:18:19.120
so you're not.

1:18:22.562 --> 1:18:29.986
Then one thing, if you think about Bart, you
want to have the Chinese language, the Italian

1:18:29.986 --> 1:18:32.165
language, and the deconer.

1:18:32.165 --> 1:18:35.716
However, in Bart we have the same language.

1:18:36.516 --> 1:18:46.010
The one you get is from English, so what you
can do there is so you cannot try to do some.

1:18:46.366 --> 1:18:52.562
Below the barge, in order to learn some language
specific stuff, or there's a masculine barge,

1:18:52.562 --> 1:18:58.823
which is trained on many languages, but it's
trained only on like the Old Coast Modern Language

1:18:58.823 --> 1:19:03.388
House, which may be trained in German and English,
but not on German.

1:19:03.923 --> 1:19:08.779
So then you would still need to find June
and the model needs to learn how to better

1:19:08.779 --> 1:19:10.721
do the attention cross lingually.

1:19:10.721 --> 1:19:15.748
It's only on the same language but it mainly
only has to learn this mapping and not all

1:19:15.748 --> 1:19:18.775
the rest and that's why it's still quite successful.

1:19:21.982 --> 1:19:27.492
Now certain thing which is very commonly used
is what is required to it as adapters.

1:19:27.607 --> 1:19:29.754
So for example you take and buy.

1:19:29.709 --> 1:19:35.218
And you put some adapters on the inside of
the networks so that it's small new layers

1:19:35.218 --> 1:19:40.790
which are in between put in there and then
you only train these adapters or also train

1:19:40.790 --> 1:19:41.815
these adapters.

1:19:41.815 --> 1:19:47.900
For example, an embryo you could see that
this learns to map the Sears language representation

1:19:47.900 --> 1:19:50.334
to the Tiger language representation.

1:19:50.470 --> 1:19:52.395
And then you don't have to change that luck.

1:19:52.792 --> 1:19:59.793
You give it extra ability to really perform
well on that.

1:19:59.793 --> 1:20:05.225
These are quite small and so very efficient.

1:20:05.905 --> 1:20:12.632
That is also very commonly used, for example
in modular systems where you have some adaptors

1:20:12.632 --> 1:20:16.248
in between here which might be language specific.

1:20:16.916 --> 1:20:22.247
So they are trained only for one language.

1:20:22.247 --> 1:20:33.777
The model has some or both and once has the
ability to do multilingually to share knowledge.

1:20:34.914 --> 1:20:39.058
But there's one chance in general in the multilingual
systems.

1:20:39.058 --> 1:20:40.439
It works quite well.

1:20:40.439 --> 1:20:46.161
There's one case or one specific use case
for multilingual where this normally doesn't

1:20:46.161 --> 1:20:47.344
really work well.

1:20:47.344 --> 1:20:49.975
Do you have an idea what that could be?

1:20:55.996 --> 1:20:57.536
It's for Zero Shot Cases.

1:20:57.998 --> 1:21:03.660
Because having here some situation with this
might be very language specific and zero shot,

1:21:03.660 --> 1:21:09.015
the idea is always to learn representations
view which are more language dependent and

1:21:09.015 --> 1:21:10.184
with the adaptors.

1:21:10.184 --> 1:21:15.601
Of course you get in representations again
which are more language specific and then it

1:21:15.601 --> 1:21:17.078
doesn't work that well.

1:21:20.260 --> 1:21:37.730
And there is also the idea of doing more knowledge
pistolation.

1:21:39.179 --> 1:21:42.923
And now the idea is okay.

1:21:42.923 --> 1:21:54.157
We are training it the same, but what we want
to achieve is that the encoder.

1:21:54.414 --> 1:22:03.095
So you should learn faster by trying to make
these states as similar as possible.

1:22:03.095 --> 1:22:11.777
So you compare the first-hit state of the
pre-trained model and try to make them.

1:22:12.192 --> 1:22:18.144
For example, by using the out two norms, so
by just making these two representations the

1:22:18.144 --> 1:22:26.373
same: The same vocabulary: Why does it need
the same vocabulary with any idea?

1:22:34.754 --> 1:22:46.137
If you have different vocabulary, it's typical
you also have different sequenced lengths here.

1:22:46.137 --> 1:22:50.690
The number of sequences is different.

1:22:51.231 --> 1:22:58.888
If you now have pipe stains and four states
here, it's no longer straightforward which

1:22:58.888 --> 1:23:01.089
states compare to which.

1:23:02.322 --> 1:23:05.246
And that's just easier if you have like the
same number.

1:23:05.246 --> 1:23:08.940
You can always compare the first to the first
and second to the second.

1:23:09.709 --> 1:23:16.836
So therefore at least the very easy way of
knowledge destination only works if you have.

1:23:17.177 --> 1:23:30.030
Course: You could do things like yeah, the
average should be the same, but of course there's

1:23:30.030 --> 1:23:33.071
a less strong signal.

1:23:34.314 --> 1:23:42.979
But the advantage here is that you have a
diameter training signal here on the handquarter

1:23:42.979 --> 1:23:51.455
so you can directly make some of the encoder
already giving a good signal while normally

1:23:51.455 --> 1:23:52.407
an empty.

1:23:56.936 --> 1:24:13.197
Yes, think this is most things for today,
so what you should keep in mind is remind me.

1:24:13.393 --> 1:24:18.400
The one is a back translation idea.

1:24:18.400 --> 1:24:29.561
If you have monolingual and use that, the
other one is to: And mentally it is often helpful

1:24:29.561 --> 1:24:33.614
to combine them so you can even use both of
that.

1:24:33.853 --> 1:24:38.908
So you can use pre-trained walls, but then
you can even still do back translation where

1:24:38.908 --> 1:24:40.057
it's still helpful.

1:24:40.160 --> 1:24:45.502
We have the advantage we are training like
everything working together on the task so

1:24:45.502 --> 1:24:51.093
it might be helpful even to backtranslate some
data and then use it in a real translation

1:24:51.093 --> 1:24:56.683
setup because in pretraining of course the
beach challenge is always that you're training

1:24:56.683 --> 1:24:57.739
it on different.

1:24:58.058 --> 1:25:03.327
Different ways of how you integrate this knowledge.

1:25:03.327 --> 1:25:08.089
Even if you just use a full model, so in this.

1:25:08.748 --> 1:25:11.128
This is the most similar you can get.

1:25:11.128 --> 1:25:13.945
You're doing no changes to the architecture.

1:25:13.945 --> 1:25:19.643
You're really taking the model and just fine
tuning them on the new task, but it still has

1:25:19.643 --> 1:25:24.026
to completely newly learn how to do the attention
and how to do that.

1:25:24.464 --> 1:25:29.971
And that might be, for example, helpful to
have more back-translated data to learn them.

1:25:32.192 --> 1:25:34.251
That's for today.

1:25:34.251 --> 1:25:44.661
There's one important thing that next Tuesday
there is a conference or a workshop or so in

1:25:44.661 --> 1:25:45.920
this room.

1:25:47.127 --> 1:25:56.769
You should get an e-mail if you're in Elias
that there's a room change for Tuesdays and

1:25:56.769 --> 1:25:57.426
it's.

1:25:57.637 --> 1:26:03.890
There are more questions, yeah, have a more
general position, especially: In computer vision

1:26:03.890 --> 1:26:07.347
you can enlarge your data center data orientation.

1:26:07.347 --> 1:26:08.295
Is there any?

1:26:08.388 --> 1:26:15.301
It's similar to a large speech for text for
the data of an edge.

1:26:15.755 --> 1:26:29.176
And you can use this back translation and
also masking, but back translation is some

1:26:29.176 --> 1:26:31.228
way of data.

1:26:31.371 --> 1:26:35.629
So it has also been, for example, even its
used not only for monolingual data.

1:26:36.216 --> 1:26:54.060
If you have good MP system, it can also be
used for parallel data.

1:26:54.834 --> 1:26:59.139
So would say this is the most similar one.

1:26:59.139 --> 1:27:03.143
There's ways you can do power phrasing.

1:27:05.025 --> 1:27:12.057
But for example there is very hard to do this
by rules like which words to replace because

1:27:12.057 --> 1:27:18.936
there is not a coup like you cannot always
say this word can always be replaced by that.

1:27:19.139 --> 1:27:27.225
Mean, although they are many perfect synonyms,
normally they are good in some cases, but not

1:27:27.225 --> 1:27:29.399
in all cases, and so on.

1:27:29.399 --> 1:27:36.963
And if you don't do a rule based, you have
to train your model and then the freshness.

1:27:38.058 --> 1:27:57.236
The same architecture as the pre-trained mount.

1:27:57.457 --> 1:27:59.810
Should be of the same dimension, so it's easiest
to have the same dimension.

1:28:00.000 --> 1:28:01.590
Architecture.

1:28:01.590 --> 1:28:05.452
We later will learn inefficiency.

1:28:05.452 --> 1:28:12.948
You can also do knowledge cessulation with,
for example, smaller.

1:28:12.948 --> 1:28:16.469
You can learn the same within.

1:28:17.477 --> 1:28:22.949
Eight layers for it so that is possible, but
yeah agree it should be of the same.

1:28:23.623 --> 1:28:32.486
Yeah yeah you need the question then of course
you can do it like it's an initialization or

1:28:32.486 --> 1:28:41.157
you can do it doing training but normally it
most makes sense during the normal training.

1:28:45.865 --> 1:28:53.963
Do it, then thanks a lot, and then we'll see
each other again on Tuesday.

0:00:00.981 --> 0:00:20.036
Today about is how to use some type of additional
resources to improve the translation.

0:00:20.300 --> 0:00:28.188
We have in the first part of the semester
two thirds of the semester how to build some

0:00:28.188 --> 0:00:31.361
of your basic machine translation.

0:00:31.571 --> 0:00:42.317
Now the basic components are both for statistical
and for neural, with the encoded decoding.

0:00:43.123 --> 0:00:46.000
Now, of course, that's not where it stops.

0:00:46.000 --> 0:00:51.286
It's still what nearly every machine translation
system is currently in there.

0:00:51.286 --> 0:00:57.308
However, there's a lot of challenges which
you need to address in addition and which need

0:00:57.308 --> 0:00:58.245
to be solved.

0:00:58.918 --> 0:01:09.858
And there we want to start to tell you what
else can you do around this, and partly.

0:01:10.030 --> 0:01:14.396
And one important question there is on what
do you train your models?

0:01:14.394 --> 0:01:32.003
Because like this type of parallel data, it's
easier in machine translation than in other

0:01:32.003 --> 0:01:33.569
trusts.

0:01:33.853 --> 0:01:41.178
And therefore an important question is, can
we also learn from like other sources and through?

0:01:41.701 --> 0:01:47.830
Because if you remember strongly right at
the beginning of the election,.

0:01:51.171 --> 0:01:53.801
This Is How We Train All Our.

0:01:54.194 --> 0:01:59.887
Machine learning models from statistical to
neural.

0:01:59.887 --> 0:02:09.412
This doesn't have changed so we need this
type of parallel data where we have a source

0:02:09.412 --> 0:02:13.462
sentence aligned with a target data.

0:02:13.493 --> 0:02:19.135
We have now a strong model here, a very good
model to do that.

0:02:19.135 --> 0:02:22.091
However, we always rely on this.

0:02:22.522 --> 0:02:28.437
For languages, high risk language pairs say
from German to English or other European languages,

0:02:28.437 --> 0:02:31.332
there is decent amount at least for similarly.

0:02:31.471 --> 0:02:37.630
But even there if we are going to very specific
domains it might get difficult and then your

0:02:37.630 --> 0:02:43.525
system performance might drop because if you
want to translate now some medical text for

0:02:43.525 --> 0:02:50.015
example of course you need to also have peril
data in the medical domain to know how to translate

0:02:50.015 --> 0:02:50.876
these types.

0:02:51.231 --> 0:02:55.264
Phrases how to use the vocabulary and so on
in the style.

0:02:55.915 --> 0:03:04.887
And if you are going to other languages, there
is a lot bigger challenge and the question

0:03:04.887 --> 0:03:05.585
there.

0:03:05.825 --> 0:03:09.649
So is really this the only resource we can
use.

0:03:09.889 --> 0:03:19.462
Can be adapted or training phase in order
to also make use of other types of models that

0:03:19.462 --> 0:03:27.314
might enable us to build strong systems with
other types of information.

0:03:27.707 --> 0:03:35.276
And that we will look into now in the next
starting from from just saying the next election.

0:03:35.515 --> 0:03:40.697
So this idea we already have covered on Tuesday.

0:03:40.697 --> 0:03:45.350
One very successful idea for this is to do.

0:03:45.645 --> 0:03:51.990
So that we're no longer doing translation
between languages, but we can do translation

0:03:51.990 --> 0:03:55.928
between languages and share common knowledge
between.

0:03:56.296 --> 0:04:03.888
You also learned about things like zero shots
machine translation so you can translate between

0:04:03.888 --> 0:04:06.446
languages where you don't have.

0:04:06.786 --> 0:04:09.790
Which is the case for many, many language
pairs.

0:04:10.030 --> 0:04:16.954
Like even with German, you have not translation
parallel data to all languages around the world,

0:04:16.954 --> 0:04:23.450
or most of them you have it to the Europeans
once, maybe even for Japanese, so it will get

0:04:23.450 --> 0:04:26.377
difficult to get a really decent amount.

0:04:26.746 --> 0:04:35.332
There is quite a lot of data, for example
English to Japanese, but German to Japanese

0:04:35.332 --> 0:04:37.827
or German to Vietnamese.

0:04:37.827 --> 0:04:41.621
There is some data from Multilingual.

0:04:42.042 --> 0:04:54.584
So there is a very promising direction if
you want to build translation systems between

0:04:54.584 --> 0:05:00.142
language peers, typically not English.

0:05:01.221 --> 0:05:05.887
And the other ideas, of course, we don't have
to either just search for it.

0:05:06.206 --> 0:05:12.505
Some work on a data crawling so if I don't
have a corpus directly or I don't have an high

0:05:12.505 --> 0:05:19.014
quality corpus like from the European Parliament
for a TED corpus so maybe it makes sense to

0:05:19.014 --> 0:05:23.913
crawl more data and get additional sources
so you can build stronger.

0:05:24.344 --> 0:05:35.485
There has been quite a big effort in Europe
to collect really large data sets for parallel

0:05:35.485 --> 0:05:36.220
data.

0:05:36.220 --> 0:05:40.382
How can we do this data crawling?

0:05:40.600 --> 0:05:46.103
There the interesting thing from the machine
translation point is not just general data

0:05:46.103 --> 0:05:46.729
crawling.

0:05:47.067 --> 0:05:50.037
But how can we explicitly crawl data?

0:05:50.037 --> 0:05:52.070
Which is some of a peril?

0:05:52.132 --> 0:05:58.461
So there is in the Internet quite a lot of
data which has been company websites which

0:05:58.461 --> 0:06:01.626
have been translated and things like that.

0:06:01.626 --> 0:06:05.158
So how can you extract them parallel fragments?

0:06:06.566 --> 0:06:13.404
That is typically more noisy than where you
do more at hands where mean if you have Parliament.

0:06:13.693 --> 0:06:17.680
You can do some rules how to extract parallel
things.

0:06:17.680 --> 0:06:24.176
Here there is more to it, so the quality is
later maybe not as good, but normally scale

0:06:24.176 --> 0:06:26.908
is then a possibility to address it.

0:06:26.908 --> 0:06:30.304
So you just have so much more data that even.

0:06:33.313 --> 0:06:40.295
The other thing can be used monolingual data
and monolingual data has a big advantage that

0:06:40.295 --> 0:06:46.664
we can have a huge amount of that so that you
can be autocrawed from the Internet.

0:06:46.664 --> 0:06:51.728
The nice thing is you can also get it typically
for many domains.

0:06:52.352 --> 0:06:59.558
There is just so much more magnitude of monolingual
data so that it might be very helpful.

0:06:59.559 --> 0:07:03.054
We can do that in statistical machine translation.

0:07:03.054 --> 0:07:06.755
It was quite easy to integrate using language
models.

0:07:08.508 --> 0:07:16.912
In neural machine translation we have the
advantage that we have this overall architecture

0:07:16.912 --> 0:07:22.915
that does everything together, but it has also
the disadvantage.

0:07:23.283 --> 0:07:25.675
We'll look today at two things.

0:07:25.675 --> 0:07:32.925
On the one end you can still try to do a bit
of language modeling in there and add an additional

0:07:32.925 --> 0:07:35.168
language model into in there.

0:07:35.168 --> 0:07:38.232
There is some work, one very successful.

0:07:38.178 --> 0:07:43.764
A way in which I think is used in most systems
at the moment is to do some scientific data.

0:07:43.763 --> 0:07:53.087
Is a very easy thing, but you can just translate
there and use it as training gator, and normally.

0:07:53.213 --> 0:07:59.185
And thereby you are able to use like some
type of monolingual a day.

0:08:00.380 --> 0:08:05.271
Another way to do it is unsupervised and the
extreme case.

0:08:05.271 --> 0:08:11.158
If you have a scenario then you only have
data, only monolingual data.

0:08:11.158 --> 0:08:13.976
Can you still build translations?

0:08:14.754 --> 0:08:27.675
If you have large amounts of data and languages
are not too dissimilar, you can build translation

0:08:27.675 --> 0:08:31.102
systems without parallel.

0:08:32.512 --> 0:08:36.267
That we will see you then next Thursday.

0:08:37.857 --> 0:08:50.512
And then there is now a third type of pre-trained
model that recently became very successful

0:08:50.512 --> 0:08:55.411
and now with large language models.

0:08:55.715 --> 0:09:03.525
So the idea is we are no longer sharing the
real data, but it can also help to train a

0:09:03.525 --> 0:09:04.153
model.

0:09:04.364 --> 0:09:11.594
And that is now a big advantage of deep learning
based approaches.

0:09:11.594 --> 0:09:22.169
There you have this ability that you can train
a model in some task and then apply it to another.

0:09:22.722 --> 0:09:33.405
And then, of course, the question is, can
I have an initial task where there's huge amounts

0:09:33.405 --> 0:09:34.450
of data?

0:09:34.714 --> 0:09:40.251
And the test that typically you pre train
on is more like similar to a language moral

0:09:40.251 --> 0:09:45.852
task either direct to a language moral task
or like a masking task which is related so

0:09:45.852 --> 0:09:51.582
the idea is oh I can train on this data and
the knowledge about words how they relate to

0:09:51.582 --> 0:09:53.577
each other I can use in there.

0:09:53.753 --> 0:10:00.276
So it's a different way of using language
models.

0:10:00.276 --> 0:10:06.276
There's more transfer learning at the end
of.

0:10:09.029 --> 0:10:17.496
So first we will start with how can we use
monolingual data to do a Yeah to do a machine

0:10:17.496 --> 0:10:18.733
translation?

0:10:20.040 --> 0:10:27.499
That: Big difference is you should remember
from what I mentioned before is.

0:10:27.499 --> 0:10:32.783
In statistical machine translation we directly
have the opportunity.

0:10:32.783 --> 0:10:39.676
There's peril data for the translation model
and monolingual data for the language model.

0:10:39.679 --> 0:10:45.343
And you combine your translation model and
language model, and then you can make use of

0:10:45.343 --> 0:10:45.730
both.

0:10:46.726 --> 0:10:53.183
That you can make use of these large large
amounts of monolingual data, but of course

0:10:53.183 --> 0:10:55.510
it has also some disadvantage.

0:10:55.495 --> 0:11:01.156
Because we say the problem is we are optimizing
both parts a bit independently to each other

0:11:01.156 --> 0:11:06.757
and we say oh yeah the big disadvantage of
newer machine translations now we are optimizing

0:11:06.757 --> 0:11:10.531
the overall architecture everything together
to perform best.

0:11:10.890 --> 0:11:16.994
And then, of course, we can't do there, so
Leo we can can only do a mural like use power

0:11:16.994 --> 0:11:17.405
data.

0:11:17.897 --> 0:11:28.714
So the question is, but this advantage is
not so important that we can train everything,

0:11:28.714 --> 0:11:35.276
but we have a moral legal data or even small
amounts.

0:11:35.675 --> 0:11:43.102
So in data we know it's not only important
the amount of data we have but also like how

0:11:43.102 --> 0:11:50.529
similar it is to your test data so it can be
that this modeling data is quite small but

0:11:50.529 --> 0:11:55.339
it's very well fitting and then it's still
very helpful.

0:11:55.675 --> 0:12:02.691
At the first year of surprisingness, if we
are here successful with integrating a language

0:12:02.691 --> 0:12:09.631
model into a translation system, maybe we can
also integrate some type of language models

0:12:09.631 --> 0:12:14.411
into our empty system in order to make it better
and perform.

0:12:16.536 --> 0:12:23.298
The first thing we can do is we know there
is language models, so let's try to integrate.

0:12:23.623 --> 0:12:31.096
There was our language model because these
works were mainly done before transformer-based

0:12:31.096 --> 0:12:31.753
models.

0:12:32.152 --> 0:12:38.764
In general, of course, you can do the same
thing with transformer baseball.

0:12:38.764 --> 0:12:50.929
There is nothing about whether: It's just
that it has mainly been done before people

0:12:50.929 --> 0:13:01.875
started using R&amp;S and they tried to do
this more in cases.

0:13:07.087 --> 0:13:22.938
So what we're happening here is in some of
this type of idea, and in key system you remember

0:13:22.938 --> 0:13:25.495
the attention.

0:13:25.605 --> 0:13:29.465
Gets it was your last in this day that you
calculate easy attention.

0:13:29.729 --> 0:13:36.610
We get the context back, then combine both
and then base the next in state and then predict.

0:13:37.057 --> 0:13:42.424
So this is our system, and the question is,
can we send our integrated language model?

0:13:42.782 --> 0:13:49.890
And somehow it makes sense to take out a neural
language model because we are anyway in the

0:13:49.890 --> 0:13:50.971
neural space.

0:13:50.971 --> 0:13:58.465
It's not surprising that it contrasts to statistical
work used and grants it might make sense to

0:13:58.465 --> 0:14:01.478
take a bit of a normal language model.

0:14:01.621 --> 0:14:06.437
And there would be something like on Tubbles
Air, a neural language model, and our man based

0:14:06.437 --> 0:14:11.149
is you have a target word, you put it in, you
get a new benchmark, and then you always put

0:14:11.149 --> 0:14:15.757
in the words and get new hidden states, and
you can do some predictions at the output to

0:14:15.757 --> 0:14:16.948
predict the next word.

0:14:17.597 --> 0:14:26.977
So if we're having this type of in language
model, there's like two main questions we have

0:14:26.977 --> 0:14:34.769
to answer: So how do we combine now on the
one hand our system and on the other hand our

0:14:34.769 --> 0:14:35.358
model?

0:14:35.358 --> 0:14:42.004
You see that was mentioned before when we
started talking about ENCODA models.

0:14:42.004 --> 0:14:45.369
They can be viewed as a language model.

0:14:45.805 --> 0:14:47.710
The wine is lengthened, unconditioned.

0:14:47.710 --> 0:14:49.518
It's just modeling the target sides.

0:14:49.970 --> 0:14:56.963
And the other one is a conditional language
one, which is a language one conditioned on

0:14:56.963 --> 0:14:57.837
the Sewer.

0:14:58.238 --> 0:15:03.694
So how can you combine to language models?

0:15:03.694 --> 0:15:14.860
Of course, it's like the translation model
will be more important because it has access

0:15:14.860 --> 0:15:16.763
to the source.

0:15:18.778 --> 0:15:22.571
If we have that, the other question is okay.

0:15:22.571 --> 0:15:24.257
Now we have models.

0:15:24.257 --> 0:15:25.689
How do we train?

0:15:26.026 --> 0:15:30.005
Pickers integrated them.

0:15:30.005 --> 0:15:34.781
We have now two sets of data.

0:15:34.781 --> 0:15:42.741
We have parallel data where you can do the
lower.

0:15:44.644 --> 0:15:53.293
So the first idea is we can do something more
like a parallel combination.

0:15:53.293 --> 0:15:55.831
We just keep running.

0:15:56.036 --> 0:15:59.864
So here you see your system that is running.

0:16:00.200 --> 0:16:09.649
It's normally completely independent of your
language model, which is up there, so down

0:16:09.649 --> 0:16:13.300
here we have just our NMT system.

0:16:13.313 --> 0:16:26.470
The only thing which is used is we have the
words, and of course they are put into both

0:16:26.470 --> 0:16:30.059
systems, and out there.

0:16:30.050 --> 0:16:42.221
So we use them somehow for both, and then
we are doing our decision just by merging these

0:16:42.221 --> 0:16:42.897
two.

0:16:43.343 --> 0:16:53.956
So there can be, for example, we are doing
a probability distribution here, and then we

0:16:53.956 --> 0:17:03.363
are taking the average of post-perability distribution
to do our predictions.

0:17:11.871 --> 0:17:18.923
You could also take the output with Steve's
to be more in chore about the mixture.

0:17:20.000 --> 0:17:32.896
Yes, you could also do that, so it's more
like engaging mechanisms that you're not doing.

0:17:32.993 --> 0:17:41.110
Another one would be cochtrinate the hidden
states, and then you would have another layer

0:17:41.110 --> 0:17:41.831
on top.

0:17:43.303 --> 0:17:56.889
You think about if you do the conqueredination
instead of taking the instead and then merging

0:17:56.889 --> 0:18:01.225
the probability distribution.

0:18:03.143 --> 0:18:16.610
Introduce many new parameters, and these parameters
have somehow something special compared to

0:18:16.610 --> 0:18:17.318
the.

0:18:23.603 --> 0:18:37.651
So before all the error other parameters can
be trained independent, the language model

0:18:37.651 --> 0:18:42.121
can be trained independent.

0:18:43.043 --> 0:18:51.749
If you have a joint layer, of course you need
to train them because you have now inputs.

0:18:54.794 --> 0:19:02.594
Not surprisingly, if you have a parallel combination
of whether you could, the other way is to do

0:19:02.594 --> 0:19:04.664
more serial combinations.

0:19:04.924 --> 0:19:10.101
How can you do a similar combination?

0:19:10.101 --> 0:19:18.274
Your final decision makes sense to do a face
on the system.

0:19:18.438 --> 0:19:20.996
So you have on top of your normal and system.

0:19:21.121 --> 0:19:30.678
The only thing is now you're inputting into
your system.

0:19:30.678 --> 0:19:38.726
You're no longer inputting the word embeddings.

0:19:38.918 --> 0:19:45.588
So you're training your mainly what you have
your lower layers here which are trained more

0:19:45.588 --> 0:19:52.183
on the purely language model style and then
on top your putting into the NMT system where

0:19:52.183 --> 0:19:55.408
it now has already here the language model.

0:19:55.815 --> 0:19:58.482
So here you can also view it.

0:19:58.482 --> 0:20:06.481
Here you have more contextual embeddings which
no longer depend only on the word but they

0:20:06.481 --> 0:20:10.659
also depend on the context of the target site.

0:20:11.051 --> 0:20:19.941
But you have more understanding of the source
word, so you have a language in the current

0:20:19.941 --> 0:20:21.620
target sentence.

0:20:21.881 --> 0:20:27.657
So if it's like the word can, for example,
will be put in here always the same independent

0:20:27.657 --> 0:20:31.147
of its user can of beans, or if it's like I
can do it.

0:20:31.147 --> 0:20:37.049
However, because you are having your language
model style, you have maybe disintegrated this

0:20:37.049 --> 0:20:40.984
already a bit, and you give this information
directly to the.

0:20:41.701 --> 0:20:43.095
An empty cyst.

0:20:44.364 --> 0:20:49.850
You, if you're remembering more the transformer
based approach, you have some layers.

0:20:49.850 --> 0:20:55.783
The lower layers are purely languaged while
the other ones are with attention to the source.

0:20:55.783 --> 0:21:01.525
So you can view it also that you just have
lower layers which don't attend to the source.

0:21:02.202 --> 0:21:07.227
This is purely a language model, and then
at some point you're starting to attend to

0:21:07.227 --> 0:21:08.587
the source and use it.

0:21:13.493 --> 0:21:20.781
Yes, so this is how you combine them in peril
or first do the language model and then do.

0:21:23.623 --> 0:21:26.147
Questions for the integration.

0:21:31.831 --> 0:21:35.034
Not really sure about the input of the.

0:21:35.475 --> 0:21:38.102
Model, and in this case in the sequence.

0:21:38.278 --> 0:21:54.854
Case so the actual word that we transferred
into a numerical lecture, and this is an input.

0:21:56.176 --> 0:22:03.568
That depends on if you view the word embedding
as part of the language model.

0:22:03.568 --> 0:22:10.865
So if you first put the word target word then
you do the one hot end coding.

0:22:11.691 --> 0:22:13.805
And then the word embedding there is the r&amp;

0:22:13.805 --> 0:22:13.937
n.

0:22:14.314 --> 0:22:21.035
So you can use this together as your language
model when you first do the word embedding.

0:22:21.401 --> 0:22:24.346
All you can say is like before.

0:22:24.346 --> 0:22:28.212
It's more a definition, but you're right.

0:22:28.212 --> 0:22:30.513
So what's the steps out?

0:22:30.513 --> 0:22:36.128
You take the word, the one hut encoding, the
word embedding.

0:22:36.516 --> 0:22:46.214
What one of these parrots, you know, called
a language model is definition wise and not

0:22:46.214 --> 0:22:47.978
that important.

0:22:53.933 --> 0:23:02.264
So the question is how can you then train
them and make this this one work?

0:23:02.264 --> 0:23:02.812
The.

0:23:03.363 --> 0:23:15.201
So in the case where you combine the language
one of the abilities you can train them independently

0:23:15.201 --> 0:23:18.516
and just put them together.

0:23:18.918 --> 0:23:27.368
Might not be the best because we have no longer
the stability that we had before that optimally

0:23:27.368 --> 0:23:29.128
performed together.

0:23:29.128 --> 0:23:33.881
It's not clear if they really work the best
together.

0:23:34.514 --> 0:23:41.585
At least you need to somehow find how much
do you trust the one model and how much.

0:23:43.323 --> 0:23:45.058
Still in some cases useful.

0:23:45.058 --> 0:23:48.530
It might be helpful if you have only data
and software.

0:23:48.928 --> 0:23:59.064
However, in MT we have one specific situation
that at least for the MT part parallel is also

0:23:59.064 --> 0:24:07.456
always monolingual data, so what we definitely
can do is train the language.

0:24:08.588 --> 0:24:18.886
So what we also can do is more like the pre-training
approach.

0:24:18.886 --> 0:24:24.607
We first train the language model.

0:24:24.704 --> 0:24:27.334
The pre-training approach.

0:24:27.334 --> 0:24:33.470
You first train on the monolingual data and
then you join the.

0:24:33.933 --> 0:24:41.143
Of course, the model size is this way, but
the data size is too bigly the other way around.

0:24:41.143 --> 0:24:47.883
You often have a lot more monolingual data
than you have here parallel data, in which

0:24:47.883 --> 0:24:52.350
scenario can you imagine where this type of
pretraining?

0:24:56.536 --> 0:24:57.901
Any Ideas.

0:25:04.064 --> 0:25:12.772
One example where this might also be helpful
if you want to adapt to domains.

0:25:12.772 --> 0:25:22.373
So let's say you do medical sentences and
if you want to translate medical sentences.

0:25:23.083 --> 0:25:26.706
In this case it could be or its most probable
happen.

0:25:26.706 --> 0:25:32.679
You're learning here up there what medical
means, but in your fine tuning step the model

0:25:32.679 --> 0:25:38.785
is forgotten everything about Medicare, so
you may be losing all the information you gain.

0:25:39.099 --> 0:25:42.366
So this type of priest training step is good.

0:25:42.366 --> 0:25:47.978
If your pretraining data is more general,
very large and then you're adapting.

0:25:48.428 --> 0:25:56.012
But in the task with moral lingual data, which
should be used to adapt the system to some

0:25:56.012 --> 0:25:57.781
general topic style.

0:25:57.817 --> 0:26:06.795
Then, of course, this is not a good strategy
because you might forgot about everything up

0:26:06.795 --> 0:26:09.389
there and you don't have.

0:26:09.649 --> 0:26:14.678
So then you have to check what you can do
for them.

0:26:14.678 --> 0:26:23.284
You can freeze this part and change it any
more so you don't lose the ability or you can

0:26:23.284 --> 0:26:25.702
do a direct combination.

0:26:25.945 --> 0:26:31.028
Where you jointly train both of them, so you
train the NMT system on the, and then you train

0:26:31.028 --> 0:26:34.909
the language model always in parallels so that
you don't forget about.

0:26:35.395 --> 0:26:37.684
And what you learn of the length.

0:26:37.937 --> 0:26:46.711
Depends on what you want to combine because
it's large data and you have a good general

0:26:46.711 --> 0:26:48.107
knowledge in.

0:26:48.548 --> 0:26:55.733
Then you normally don't really forget it because
it's also in the or you use it to adapt to

0:26:55.733 --> 0:26:57.295
something specific.

0:26:57.295 --> 0:26:58.075
Then you.

0:27:01.001 --> 0:27:06.676
Then this is a way of how we can make use
of monolingual data.

0:27:07.968 --> 0:27:12.116
It seems to be the easiest one somehow.

0:27:12.116 --> 0:27:20.103
It's more similar to what we are doing with
statistical machine translation.

0:27:21.181 --> 0:27:31.158
Normally always beats this type of model,
which in some view can be like from the conceptual

0:27:31.158 --> 0:27:31.909
thing.

0:27:31.909 --> 0:27:36.844
It's even easier from the computational side.

0:27:40.560 --> 0:27:42.078
And the idea is OK.

0:27:42.078 --> 0:27:49.136
We have monolingual data that we just translate
and then generate some type of parallel data

0:27:49.136 --> 0:27:50.806
and use that then to.

0:27:51.111 --> 0:28:00.017
So if you want to build a German-to-English
system first, take the large amount of data

0:28:00.017 --> 0:28:02.143
you have translated.

0:28:02.402 --> 0:28:10.446
Then you have more peril data and the interesting
thing is if you then train on the joint thing

0:28:10.446 --> 0:28:18.742
or on the original peril data and on what is
artificial where you have generated the translations.

0:28:18.918 --> 0:28:26.487
So you can because you are not doing the same
era all the times and you have some knowledge.

0:28:28.028 --> 0:28:43.199
With this first approach, however, there is
one issue why it might not work the best.

0:28:49.409 --> 0:28:51.177
Very a bit shown in the image to you.

0:28:53.113 --> 0:28:58.153
You trade on that quality data.

0:28:58.153 --> 0:29:02.563
Here is a bit of a problem.

0:29:02.563 --> 0:29:08.706
Your English style is not really good.

0:29:08.828 --> 0:29:12.213
And as you're saying, the system always mistranslates.

0:29:13.493 --> 0:29:19.798
Something then you will learn that this is
correct because now it's a training game and

0:29:19.798 --> 0:29:23.022
you will encourage it to make it more often.

0:29:23.022 --> 0:29:29.614
So the problem with training on your own areas
yeah you might prevent some areas you rarely

0:29:29.614 --> 0:29:29.901
do.

0:29:30.150 --> 0:29:31.749
But errors use systematically.

0:29:31.749 --> 0:29:34.225
Do you even enforce more and will even do
more?

0:29:34.654 --> 0:29:40.145
So that might not be the best solution to
have any idea how you could do it better.

0:29:44.404 --> 0:29:57.754
Is one way there is even a bit of more simple
idea.

0:30:04.624 --> 0:30:10.975
The problem is yeah, the translations are
not perfect, so the output and you're learning

0:30:10.975 --> 0:30:12.188
something wrong.

0:30:12.188 --> 0:30:17.969
Normally it's less bad if your inputs are
not bad, but your outputs are perfect.

0:30:18.538 --> 0:30:24.284
So if your inputs are wrong you may learn
that if you're doing this wrong input you're

0:30:24.284 --> 0:30:30.162
generating something correct, but you're not
learning to generate something which is not

0:30:30.162 --> 0:30:30.756
correct.

0:30:31.511 --> 0:30:47.124
So often the case it is that it is more important
than your target is correct.

0:30:47.347 --> 0:30:52.182
But you can assume in your application scenario
you hope that you may only get correct inputs.

0:30:52.572 --> 0:31:02.535
So that is not harming you, and in machine
translation we have one very nice advantage:

0:31:02.762 --> 0:31:04.648
And also the other way around.

0:31:04.648 --> 0:31:10.062
It's a very similar task, so there's a task
to translate from German to English, but the

0:31:10.062 --> 0:31:13.894
task to translate from English to German is
very similar, and.

0:31:14.094 --> 0:31:19.309
So what we can do is we can just switch it
initially and generate the data the other way

0:31:19.309 --> 0:31:19.778
around.

0:31:20.120 --> 0:31:25.959
So what we are doing here is we are starting
with an English to German system.

0:31:25.959 --> 0:31:32.906
Then we are translating the English data into
German where the German is maybe not very nice.

0:31:33.293 --> 0:31:51.785
And then we are training on our original data
and on the back translated data.

0:31:52.632 --> 0:32:02.332
So here we have the advantage that our target
side is human quality and only the input.

0:32:03.583 --> 0:32:08.113
Then this helps us to get really good.

0:32:08.113 --> 0:32:15.431
There is one difference if you think about
the data resources.

0:32:21.341 --> 0:32:27.336
Too obvious here we need a target site monolingual
layer.

0:32:27.336 --> 0:32:31.574
In the first example we had source site.

0:32:31.931 --> 0:32:45.111
So back translation is normally working if
you have target size peril later and not search

0:32:45.111 --> 0:32:48.152
side modeling later.

0:32:48.448 --> 0:32:56.125
Might be also, like if you think about it,
understand a little better to understand the

0:32:56.125 --> 0:32:56.823
target.

0:32:57.117 --> 0:33:01.469
On the source side you have to understand
the content.

0:33:01.469 --> 0:33:08.749
On the target side you have to generate really
sentences and somehow it's more difficult to

0:33:08.749 --> 0:33:12.231
generate something than to only understand.

0:33:17.617 --> 0:33:30.734
This works well if you have to select how
many back translated data do you use.

0:33:31.051 --> 0:33:32.983
Because only there's like a lot more.

0:33:33.253 --> 0:33:42.136
Question: Should take all of my data there
is two problems with it?

0:33:42.136 --> 0:33:51.281
Of course it's expensive because you have
to translate all this data.

0:33:51.651 --> 0:34:00.946
So if you don't know the normal good starting
point is to take equal amount of data as many

0:34:00.946 --> 0:34:02.663
back translated.

0:34:02.963 --> 0:34:04.673
It depends on the used case.

0:34:04.673 --> 0:34:08.507
If we have very few data here, it makes more
sense to have more.

0:34:08.688 --> 0:34:15.224
Depends on how good your quality is here,
so the better the more data you might use because

0:34:15.224 --> 0:34:16.574
quality is better.

0:34:16.574 --> 0:34:22.755
So it depends on a lot of things, but your
rule of sum is like which general way often

0:34:22.755 --> 0:34:24.815
is to have equal amounts of.

0:34:26.646 --> 0:34:29.854
And you can, of course, do that now.

0:34:29.854 --> 0:34:34.449
I said already that it's better to have the
quality.

0:34:34.449 --> 0:34:38.523
At the end, of course, depends on this system.

0:34:38.523 --> 0:34:46.152
Also, because the better this system is, the
better your synthetic data is, the better.

0:34:47.207 --> 0:34:50.949
That leads to what is referred to as iterated
back translation.

0:34:51.291 --> 0:34:56.917
So you play them on English to German, and
you translate the data on.

0:34:56.957 --> 0:35:03.198
Then you train a model on German to English
with the additional data.

0:35:03.198 --> 0:35:09.796
Then you translate German data and then you
train to gain your first one.

0:35:09.796 --> 0:35:14.343
So in the second iteration this quality is
better.

0:35:14.334 --> 0:35:19.900
System is better because it's not only trained
on the small data but additionally on back

0:35:19.900 --> 0:35:22.003
translated data with this system.

0:35:22.442 --> 0:35:24.458
And so you can get better.

0:35:24.764 --> 0:35:28.053
However, typically you can stop quite early.

0:35:28.053 --> 0:35:35.068
Maybe one iteration is good, but then you
have diminishing gains after two or three iterations.

0:35:35.935 --> 0:35:46.140
There is very slight difference because you
need a quite big difference in the quality

0:35:46.140 --> 0:35:46.843
here.

0:35:47.207 --> 0:36:02.262
Language is also good because it means you
can already train it with relatively bad profiles.

0:36:03.723 --> 0:36:10.339
It's a design decision would advise so guess
because it's easy to get it.

0:36:10.550 --> 0:36:20.802
Replace that because you have a higher quality
real data, but then I think normally it's okay

0:36:20.802 --> 0:36:22.438
to replace it.

0:36:22.438 --> 0:36:28.437
I would assume it's not too much of a difference,
but.

0:36:34.414 --> 0:36:42.014
That's about like using monolingual data before
we go into the pre-train models to have any

0:36:42.014 --> 0:36:43.005
more crash.

0:36:49.029 --> 0:36:55.740
Yes, so the other thing which we can do and
which is recently more and more successful

0:36:55.740 --> 0:37:02.451
and even more successful since we have this
really large language models where you can

0:37:02.451 --> 0:37:08.545
even do the translation task with this is the
way of using pre-trained models.

0:37:08.688 --> 0:37:16.135
So you learn a representation of one task,
and then you use this representation from another.

0:37:16.576 --> 0:37:26.862
It was made maybe like one of the first words
where it really used largely is doing something

0:37:26.862 --> 0:37:35.945
like a bird which you pre trained on purely
text era and you take it in fine tune.

0:37:36.496 --> 0:37:42.953
And one big advantage, of course, is that
people can only share data but also pre-trained.

0:37:43.423 --> 0:37:59.743
The recent models and the large language ones
which are available.

0:37:59.919 --> 0:38:09.145
Where I think it costs several millions to
train them all, just if you would buy the GPUs

0:38:09.145 --> 0:38:15.397
from some cloud company and train that the
cost of training.

0:38:15.475 --> 0:38:21.735
And guess as a student project you won't have
the budget to like build these models.

0:38:21.801 --> 0:38:24.598
So another idea is what you can do is okay.

0:38:24.598 --> 0:38:27.330
Maybe if these months are once available,.

0:38:27.467 --> 0:38:36.598
Can take them and use them as an also resource
similar to pure text, and you can now build

0:38:36.598 --> 0:38:44.524
models which somehow learn not only from from
data but also from other models.

0:38:44.844 --> 0:38:49.127
So it's a quite new way of thinking of how
to train.

0:38:49.127 --> 0:38:53.894
We are not only learning from examples, but
we might also.

0:38:54.534 --> 0:39:05.397
The nice thing is that this type of training
where we are not learning directly from data

0:39:05.397 --> 0:39:07.087
but learning.

0:39:07.427 --> 0:39:17.647
So the main idea this go is you have a person
initial task.

0:39:17.817 --> 0:39:26.369
And if you're working with anLP, that means
you're training pure taxator because that's

0:39:26.369 --> 0:39:30.547
where you have the largest amount of data.

0:39:30.951 --> 0:39:35.854
And then you're defining some type of task
in order to you do your creek training.

0:39:36.176 --> 0:39:43.092
And: The typical task you can train on on
that is like the language waddling task.

0:39:43.092 --> 0:39:50.049
So to predict the next word or we have a related
task to predict something in between, we'll

0:39:50.049 --> 0:39:52.667
see depending on the architecture.

0:39:52.932 --> 0:39:58.278
But somehow to predict something which you
have not in the input is a task which is easy

0:39:58.278 --> 0:40:00.740
to generate, so you just need your data.

0:40:00.740 --> 0:40:06.086
That's why it's called self supervised, so
you're creating your supervised pending data.

0:40:06.366 --> 0:40:07.646
By yourself.

0:40:07.646 --> 0:40:15.133
On the other hand, you need a lot of knowledge
and that is the other thing.

0:40:15.735 --> 0:40:24.703
Because there is this idea that the meaning
of a word heavily depends on the context that.

0:40:25.145 --> 0:40:36.846
So can give you a sentence with some giverish
word and there's some name and although you've

0:40:36.846 --> 0:40:41.627
never heard the name you will assume.

0:40:42.062 --> 0:40:44.149
And exactly the same thing.

0:40:44.149 --> 0:40:49.143
The models can also learn something about
the world by just using.

0:40:49.649 --> 0:40:53.651
So that is typically the mule.

0:40:53.651 --> 0:40:59.848
Then we can use this model to train the system.

0:41:00.800 --> 0:41:03.368
Course we might need to adapt the system.

0:41:03.368 --> 0:41:07.648
To do that we have to change the architecture
we might use only some.

0:41:07.627 --> 0:41:09.443
Part of the pre-trained model.

0:41:09.443 --> 0:41:14.773
In there we have seen that a bit already in
the R&amp;N case you can also see that we have

0:41:14.773 --> 0:41:17.175
also mentioned the pre-training already.

0:41:17.437 --> 0:41:22.783
So you can use the R&amp;N as one of these
approaches.

0:41:22.783 --> 0:41:28.712
You train the R&amp;M language more on large
pre-train data.

0:41:28.712 --> 0:41:32.309
Then you put it somewhere into your.

0:41:33.653 --> 0:41:37.415
So this gives you the ability to really do
these types of tests.

0:41:37.877 --> 0:41:53.924
So you can build a system which is knowledge,
which is just trained on large amounts of data.

0:41:56.376 --> 0:42:01.564
So the question is maybe what type of information
so what type of models can you?

0:42:01.821 --> 0:42:05.277
And we want today to look at briefly at swings.

0:42:05.725 --> 0:42:08.850
That was what was initially done.

0:42:08.850 --> 0:42:17.213
It wasn't as famous as in machine translation
as in other things, but it's also used there

0:42:17.213 --> 0:42:21.072
and that is to use static word embedding.

0:42:21.221 --> 0:42:28.981
So we have this mapping from the one hot to
a small continuous word representation.

0:42:29.229 --> 0:42:38.276
Using this one in your NG system, so you can,
for example, replace the embedding layer by

0:42:38.276 --> 0:42:38.779
the.

0:42:39.139 --> 0:42:41.832
That is helpful to be a really small amount
of data.

0:42:42.922 --> 0:42:48.517
And we're always in this pre-training phase
and have the thing the advantage is.

0:42:48.468 --> 0:42:52.411
More data than the trade off, so you can get
better.

0:42:52.411 --> 0:42:59.107
The disadvantage is, does anybody have an
idea of what might be the disadvantage of using

0:42:59.107 --> 0:43:00.074
things like.

0:43:04.624 --> 0:43:12.175
What was one mentioned today giving like big
advantage of the system compared to previous.

0:43:20.660 --> 0:43:25.134
Where one advantage was the enter end training,
so you have the enter end training so that

0:43:25.134 --> 0:43:27.937
all parameters and all components play optimal
together.

0:43:28.208 --> 0:43:33.076
If you know pre-train something on one fast,
it may be no longer optimal fitting to everything

0:43:33.076 --> 0:43:33.384
else.

0:43:33.893 --> 0:43:37.862
So what do pretending or not?

0:43:37.862 --> 0:43:48.180
It depends on how important everything is
optimal together and how important.

0:43:48.388 --> 0:43:51.874
Is a iquality of large amount.

0:43:51.874 --> 0:44:00.532
The pre-change one is so much better that
it's helpful and the advantage of.

0:44:00.600 --> 0:44:11.211
Getting everything optimal together, yes,
we would use random instructions for raising.

0:44:11.691 --> 0:44:26.437
The problem is you might be already in some
area where it's not easy to get.

0:44:26.766 --> 0:44:35.329
But often in some way right, so often it's
not about your really worse pre trained monolepsy.

0:44:35.329 --> 0:44:43.254
If you're going already in some direction,
and if this is not really optimal for you,.

0:44:43.603 --> 0:44:52.450
But if you're not really getting better because
you have a decent amount of data, it's so different

0:44:52.450 --> 0:44:52.981
that.

0:44:53.153 --> 0:44:59.505
Initially it wasn't a machine translation
done so much because there are more data in

0:44:59.505 --> 0:45:06.153
MPs than in other tasks, but now with really
large amounts of monolingual data we do some

0:45:06.153 --> 0:45:09.403
type of pretraining in currently all state.

0:45:12.632 --> 0:45:14.302
The other one is okay now.

0:45:14.302 --> 0:45:18.260
It's always like how much of the model do
you plea track a bit?

0:45:18.658 --> 0:45:22.386
To the other one you can do contextural word
embedded.

0:45:22.386 --> 0:45:28.351
That is something like bird or Roberta where
you train already a sequence model and the

0:45:28.351 --> 0:45:34.654
embeddings you're using are no longer specific
for word but they are also taking the context

0:45:34.654 --> 0:45:35.603
into account.

0:45:35.875 --> 0:45:50.088
The embedding you're using is no longer depending
on the word itself but on the whole sentence,

0:45:50.088 --> 0:45:54.382
so you can use this context.

0:45:55.415 --> 0:46:02.691
You can use similar things also in the decoder
just by having layers which don't have access

0:46:02.691 --> 0:46:12.430
to the source, but there it still might have
and these are typically models like: And finally

0:46:12.430 --> 0:46:14.634
they will look at the end.

0:46:14.634 --> 0:46:19.040
You can also have models which are already
sequenced.

0:46:19.419 --> 0:46:28.561
So you may be training a sequence to sequence
models.

0:46:28.561 --> 0:46:35.164
You have to make it a bit challenging.

0:46:36.156 --> 0:46:43.445
But the idea is really you're pre-training
your whole model and then you'll find tuning.

0:46:47.227 --> 0:46:59.614
But let's first do a bit of step back and
look into what are the different things.

0:46:59.614 --> 0:47:02.151
The first thing.

0:47:02.382 --> 0:47:11.063
The wooden bettings are just this first layer
and you can train them with feedback annual

0:47:11.063 --> 0:47:12.028
networks.

0:47:12.212 --> 0:47:22.761
But you can also train them with an N language
model, and by now you hopefully have also seen

0:47:22.761 --> 0:47:27.699
that you cannot transform a language model.

0:47:30.130 --> 0:47:37.875
So this is how you can train them and you're
training them.

0:47:37.875 --> 0:47:45.234
For example, to speak the next word that is
the easiest.

0:47:45.525 --> 0:47:55.234
And that is what is now referred to as South
Supervised Learning and, for example, all the

0:47:55.234 --> 0:48:00.675
big large language models like Chad GPT and
so on.

0:48:00.675 --> 0:48:03.129
They are trained with.

0:48:03.823 --> 0:48:15.812
So that is where you can hopefully learn how
a word is used because you always try to previct

0:48:15.812 --> 0:48:17.725
the next word.

0:48:19.619 --> 0:48:27.281
Word embedding: Why do you keep the first
look at the word embeddings and the use of

0:48:27.281 --> 0:48:29.985
word embeddings for our task?

0:48:29.985 --> 0:48:38.007
The main advantage was it might be only the
first layer where you typically have most of

0:48:38.007 --> 0:48:39.449
the parameters.

0:48:39.879 --> 0:48:57.017
Most of your parameters already on the large
data, then on your target data you have to

0:48:57.017 --> 0:48:59.353
train less.

0:48:59.259 --> 0:49:06.527
Big difference that your input size is so
much bigger than the size of the novel in size.

0:49:06.626 --> 0:49:17.709
So it's a normally sign, maybe like, but your
input and banning size is something like.

0:49:17.709 --> 0:49:20.606
Then here you have to.

0:49:23.123 --> 0:49:30.160
While here you see it's only like zero point
five times as much in the layer.

0:49:30.750 --> 0:49:40.367
So here is where most of your parameters are,
which means if you already replace the word

0:49:40.367 --> 0:49:48.915
embeddings they might look a bit small in your
overall and in key architecture.

0:49:57.637 --> 0:50:01.249
The thing is we have seen these were the bettings.

0:50:01.249 --> 0:50:04.295
They can be very good use for other types.

0:50:04.784 --> 0:50:08.994
You learn some general relations between words.

0:50:08.994 --> 0:50:17.454
If you're doing this type of language modeling
cast, you predict: The one thing is you have

0:50:17.454 --> 0:50:24.084
a lot of data, so the one question is we want
to have data to trade a model.

0:50:24.084 --> 0:50:28.734
The other thing, the tasks need to be somehow
useful.

0:50:29.169 --> 0:50:43.547
If you would predict the first letter of the
word, then you wouldn't learn anything about

0:50:43.547 --> 0:50:45.144
the word.

0:50:45.545 --> 0:50:53.683
And the interesting thing is people have looked
at these wood embeddings.

0:50:53.954 --> 0:50:58.550
And looking at the word embeddings.

0:50:58.550 --> 0:51:09.276
You can ask yourself how they look and visualize
them by doing dimension reduction.

0:51:09.489 --> 0:51:13.236
Don't know if you and you are listening to
artificial intelligence.

0:51:13.236 --> 0:51:15.110
Advanced artificial intelligence.

0:51:15.515 --> 0:51:23.217
We had on yesterday there how to do this type
of representation, but you can do this time

0:51:23.217 --> 0:51:29.635
of representation, and now you're seeing interesting
things that normally.

0:51:30.810 --> 0:51:41.027
Now you can represent a here in a three dimensional
space with some dimension reduction.

0:51:41.027 --> 0:51:46.881
For example, the relation between male and
female.

0:51:47.447 --> 0:51:56.625
So this vector between the male and female
version of something is always not the same,

0:51:56.625 --> 0:51:58.502
but it's related.

0:51:58.718 --> 0:52:14.522
So you can do a bit of maths, so you do take
king, you subtract this vector, add this vector.

0:52:14.894 --> 0:52:17.591
So that means okay, there is really something
stored.

0:52:17.591 --> 0:52:19.689
Some information are stored in that book.

0:52:20.040 --> 0:52:22.492
Similar, you can do it with bug answers.

0:52:22.492 --> 0:52:25.004
You see here swimming slang walking walk.

0:52:25.265 --> 0:52:34.620
So again these vectors are not the same, but
they are related.

0:52:34.620 --> 0:52:42.490
So you learn something from going from here
to here.

0:52:43.623 --> 0:52:49.761
Or semantically, the relations between city
and capital have exactly the same sense.

0:52:51.191 --> 0:52:56.854
And people had even done that question answering
about that if they showed the diembeddings

0:52:56.854 --> 0:52:57.839
and the end of.

0:52:58.218 --> 0:53:06.711
All you can also do is don't trust the dimensions
of the reaction because maybe there is something.

0:53:06.967 --> 0:53:16.863
You can also look into what happens really
in the individual space.

0:53:16.863 --> 0:53:22.247
What is the nearest neighbor of the.

0:53:22.482 --> 0:53:29.608
So you can take the relationship between France
and Paris and add it to Italy and you'll.

0:53:30.010 --> 0:53:33.078
You can do big and bigger and you have small
and smaller and stuff.

0:53:33.593 --> 0:53:49.417
Because it doesn't work everywhere, there
is also some typical dish here in German.

0:53:51.491 --> 0:54:01.677
You can do what the person is doing for famous
ones, of course only like Einstein scientists

0:54:01.677 --> 0:54:06.716
that find midfielders not completely correct.

0:54:06.846 --> 0:54:10.134
You see the examples are a bit old.

0:54:10.134 --> 0:54:15.066
The politicians are no longer they am, but
of course.

0:54:16.957 --> 0:54:26.759
What people have done there, especially at
the beginning training our end language model,

0:54:26.759 --> 0:54:28.937
was very expensive.

0:54:29.309 --> 0:54:38.031
So one famous model was, but we are not really
interested in the language model performance.

0:54:38.338 --> 0:54:40.581
Think something good to keep in mind.

0:54:40.581 --> 0:54:42.587
What are we really interested in?

0:54:42.587 --> 0:54:45.007
Do we really want to have an R&amp;N no?

0:54:45.007 --> 0:54:48.607
In this case we are only interested in this
type of mapping.

0:54:49.169 --> 0:54:55.500
And so successful and very successful was
this word to vet.

0:54:55.535 --> 0:54:56.865
The idea is okay.

0:54:56.865 --> 0:55:03.592
We are not training real language one, making
it even simpler and doing this, for example,

0:55:03.592 --> 0:55:05.513
continuous peck of words.

0:55:05.513 --> 0:55:12.313
We're just having four input tokens and we're
predicting what is the word in the middle and

0:55:12.313 --> 0:55:15.048
this is just like two linear layers.

0:55:15.615 --> 0:55:21.627
So it's even simplifying things and making
the calculation faster because that is what

0:55:21.627 --> 0:55:22.871
we're interested.

0:55:23.263 --> 0:55:32.897
All this continuous skip ground models with
these other models which refer to as where

0:55:32.897 --> 0:55:34.004
to where.

0:55:34.234 --> 0:55:42.394
Where you have one equal word and the other
way around, you're predicting the four words

0:55:42.394 --> 0:55:43.585
around them.

0:55:43.585 --> 0:55:45.327
It's very similar.

0:55:45.327 --> 0:55:48.720
The task is in the end very similar.

0:55:51.131 --> 0:56:01.407
Before we are going to the next point, anything
about normal weight vectors or weight embedding.

0:56:04.564 --> 0:56:07.794
The next thing is contexture.

0:56:07.794 --> 0:56:12.208
Word embeddings and the idea is helpful.

0:56:12.208 --> 0:56:19.206
However, we might even be able to get more
from one lingo layer.

0:56:19.419 --> 0:56:31.732
And now in the word that is overlap of these
two meanings, so it represents both the meaning

0:56:31.732 --> 0:56:33.585
of can do it.

0:56:34.834 --> 0:56:40.410
But we might be able to in the pre-trained
model already disambiguate this because they

0:56:40.410 --> 0:56:41.044
are used.

0:56:41.701 --> 0:56:53.331
So if we can have a model which can not only
represent a word but can also represent the

0:56:53.331 --> 0:56:58.689
meaning of the word within the context,.

0:56:59.139 --> 0:57:03.769
So then we are going to context your word
embeddings.

0:57:03.769 --> 0:57:07.713
We are really having a representation in the.

0:57:07.787 --> 0:57:11.519
And we have a very good architecture for that
already.

0:57:11.691 --> 0:57:23.791
The hidden state represents what is currently
said, but it's focusing on what is the last

0:57:23.791 --> 0:57:29.303
one, so it's some of the representation.

0:57:29.509 --> 0:57:43.758
The first one doing that is something like
the Elmo paper where they instead of this is

0:57:43.758 --> 0:57:48.129
the normal language model.

0:57:48.008 --> 0:57:50.714
Within the third, predicting the fourth, and
so on.

0:57:50.714 --> 0:57:53.004
So you are always predicting the next work.

0:57:53.193 --> 0:57:57.335
The architecture is the heaven words embedding
layer and then layers.

0:57:57.335 --> 0:58:03.901
See you, for example: And now instead of using
this one in the end, you're using here this

0:58:03.901 --> 0:58:04.254
one.

0:58:04.364 --> 0:58:11.245
This represents the meaning of this word mainly
in the context of what we have seen before.

0:58:11.871 --> 0:58:18.610
We can train it in a language model style
always predicting the next word, but we have

0:58:18.610 --> 0:58:21.088
more information trained there.

0:58:21.088 --> 0:58:26.123
Therefore, in the system it has to learn less
additional things.

0:58:27.167 --> 0:58:31.261
And there is one Edendang which is done currently
in GPS.

0:58:31.261 --> 0:58:38.319
The only difference is that we have more layers,
bigger size, and we're using transformer neurocell

0:58:38.319 --> 0:58:40.437
potential instead of the RNA.

0:58:40.437 --> 0:58:45.095
But that is how you train like some large
language models at the.

0:58:46.746 --> 0:58:55.044
However, if you look at this contextual representation,
they might not be perfect.

0:58:55.044 --> 0:59:02.942
So if you think of this one as a contextual
representation of the third word,.

0:59:07.587 --> 0:59:16.686
Is representing a three in the context of
a sentence, however only in the context of

0:59:16.686 --> 0:59:18.185
the previous.

0:59:18.558 --> 0:59:27.413
However, we have an architecture which can
also take both sides and we have used that

0:59:27.413 --> 0:59:30.193
already in the ink holder.

0:59:30.630 --> 0:59:34.264
So we could do the iron easily on your, also
in the backward direction.

0:59:34.874 --> 0:59:42.826
By just having the states the other way around
and then we couldn't combine the forward and

0:59:42.826 --> 0:59:49.135
the forward into a joint one where we are doing
this type of prediction.

0:59:49.329 --> 0:59:50.858
So you have the word embedding.

0:59:51.011 --> 1:00:02.095
Then you have two in the states, one on the
forward arm and one on the backward arm, and

1:00:02.095 --> 1:00:10.314
then you can, for example, take the cocagenation
of both of them.

1:00:10.490 --> 1:00:23.257
Now this same here represents mainly this
word because this is what both puts in it last

1:00:23.257 --> 1:00:30.573
and we know is focusing on what is happening
last.

1:00:31.731 --> 1:00:40.469
However, there is a bit of difference when
training that as a language model you already

1:00:40.469 --> 1:00:41.059
have.

1:00:43.203 --> 1:00:44.956
Maybe There's Again This Masking.

1:00:46.546 --> 1:00:47.748
That is one solution.

1:00:47.748 --> 1:00:52.995
First of all, why we can't do it is the information
you leak it, so you cannot just predict the

1:00:52.995 --> 1:00:53.596
next word.

1:00:53.596 --> 1:00:58.132
If we just predict the next word in this type
of model, that's a very simple task.

1:00:58.738 --> 1:01:09.581
You know the next word because it's influencing
this hidden state predicting something is not

1:01:09.581 --> 1:01:11.081
a good task.

1:01:11.081 --> 1:01:18.455
You have to define: Because in this case what
will end with the system will just ignore these

1:01:18.455 --> 1:01:22.966
estates and what will learn is copy this information
directly in here.

1:01:23.343 --> 1:01:31.218
So it would be representing this word and
you would have nearly a perfect model because

1:01:31.218 --> 1:01:38.287
you only need to find encoding where you can
encode all words somehow in this.

1:01:38.458 --> 1:01:44.050
The only thing can learn is that turn and
encode all my words in this upper hidden.

1:01:44.985 --> 1:01:53.779
Therefore, it's not really useful, so we need
to find a bit of different ways out.

1:01:55.295 --> 1:01:57.090
There is a masking one.

1:01:57.090 --> 1:02:03.747
I'll come to that shortly just a bit that
other things also have been done, so the other

1:02:03.747 --> 1:02:06.664
thing is not to directly combine them.

1:02:06.664 --> 1:02:13.546
That was in the animal paper, so you have
them forward R&amp;M and you keep them completely

1:02:13.546 --> 1:02:14.369
separated.

1:02:14.594 --> 1:02:20.458
So you never merged to state.

1:02:20.458 --> 1:02:33.749
At the end, the representation of the word
is now from the forward.

1:02:33.873 --> 1:02:35.953
So it's always the hidden state before the
good thing.

1:02:36.696 --> 1:02:41.286
These two you join now to your to the representation.

1:02:42.022 --> 1:02:48.685
And then you have now a representation also
about like the whole sentence for the word,

1:02:48.685 --> 1:02:51.486
but there is no information leakage.

1:02:51.486 --> 1:02:58.149
One way of doing this is instead of doing
a bidirection along you do a forward pass and

1:02:58.149 --> 1:02:59.815
then join the hidden.

1:03:00.380 --> 1:03:05.960
So you can do that in all layers.

1:03:05.960 --> 1:03:16.300
In the end you do the forwarded layers and
you get the hidden.

1:03:16.596 --> 1:03:19.845
However, it's a bit of a complicated.

1:03:19.845 --> 1:03:25.230
You have to keep both separate and merge things
so can you do.

1:03:27.968 --> 1:03:33.030
And that is the moment where like the big.

1:03:34.894 --> 1:03:39.970
The big success of the burnt model was used
where it okay.

1:03:39.970 --> 1:03:47.281
Maybe in bite and rich case it's not good
to do the next word prediction, but we can

1:03:47.281 --> 1:03:48.314
do masking.

1:03:48.308 --> 1:03:56.019
Masking mainly means we do a prediction of
something in the middle or some words.

1:03:56.019 --> 1:04:04.388
So the idea is if we have the input, we are
putting noise into the input, removing them,

1:04:04.388 --> 1:04:07.961
and then the model we are interested.

1:04:08.048 --> 1:04:15.327
Now there can be no information leakage because
this wasn't predicting that one is a big challenge.

1:04:16.776 --> 1:04:19.957
Do any assumption about our model?

1:04:19.957 --> 1:04:26.410
It doesn't need to be a forward model or a
backward model or anything.

1:04:26.410 --> 1:04:29.500
You can always predict the three.

1:04:30.530 --> 1:04:34.844
There's maybe one bit of a disadvantage.

1:04:34.844 --> 1:04:40.105
Do you see what could be a bit of a problem
this?

1:05:00.000 --> 1:05:06.429
Yes, so yeah, you can of course mask more,
but to see it more globally, just first assume

1:05:06.429 --> 1:05:08.143
you're only masked one.

1:05:08.143 --> 1:05:13.930
For the whole sentence, we get one feedback
signal, like what is the word three.

1:05:13.930 --> 1:05:22.882
So we have one training example: If you do
the language modeling taste, we predicted here,

1:05:22.882 --> 1:05:24.679
we predicted here.

1:05:25.005 --> 1:05:26.735
So we have number of tokens.

1:05:26.735 --> 1:05:30.970
For each token we have a feet pad and say
what is the best correction.

1:05:31.211 --> 1:05:43.300
So in this case this is less efficient because
we are getting less feedback signals on what

1:05:43.300 --> 1:05:45.797
we should predict.

1:05:48.348 --> 1:05:56.373
So and bird, the main ideas are that you're
doing this bidirectional model with masking.

1:05:56.373 --> 1:05:59.709
It's using transformer architecture.

1:06:00.320 --> 1:06:06.326
There are two more minor changes.

1:06:06.326 --> 1:06:16.573
We'll see that this next word prediction is
another task.

1:06:16.957 --> 1:06:30.394
You want to learn more about what language
is to really understand following a story or

1:06:30.394 --> 1:06:35.127
their independent tokens into.

1:06:38.158 --> 1:06:42.723
The input is using word units as we use it.

1:06:42.723 --> 1:06:50.193
It has some special token that is framing
for the next word prediction.

1:06:50.470 --> 1:07:04.075
It's more for classification task because
you may be learning a general representation

1:07:04.075 --> 1:07:07.203
as a full sentence.

1:07:07.607 --> 1:07:19.290
You're doing segment embedding, so you have
an embedding for it.

1:07:19.290 --> 1:07:24.323
This is the first sentence.

1:07:24.684 --> 1:07:29.099
Now what is more challenging is this masking.

1:07:29.099 --> 1:07:30.827
What do you mask?

1:07:30.827 --> 1:07:35.050
We already have the crush enough or should.

1:07:35.275 --> 1:07:42.836
So there has been afterwards eating some work
like, for example, a bearer.

1:07:42.836 --> 1:07:52.313
It's not super sensitive, but if you do it
completely wrong then you're not letting anything.

1:07:52.572 --> 1:07:54.590
That's Then Another Question There.

1:07:56.756 --> 1:08:04.594
Should I mask all types of should I always
mask the footwork or if I have a subword to

1:08:04.594 --> 1:08:10.630
mask only like a subword and predict them based
on the other ones?

1:08:10.630 --> 1:08:14.504
Of course, it's a bit of a different task.

1:08:14.894 --> 1:08:21.210
If you know three parts of the words, it might
be easier to guess the last because they here

1:08:21.210 --> 1:08:27.594
took the easiest selection, so not considering
words anymore at all because you're doing that

1:08:27.594 --> 1:08:32.280
in the preprocessing and just taking always
words and like subwords.

1:08:32.672 --> 1:08:36.089
Think in group there is done differently.

1:08:36.089 --> 1:08:40.401
They mark always the full words, but guess
it's not.

1:08:41.001 --> 1:08:46.044
And then what to do with the mask word in
eighty percent of the cases.

1:08:46.044 --> 1:08:50.803
If the word is masked, they replace it with
a special token thing.

1:08:50.803 --> 1:08:57.197
This is a mask token in ten percent they put
in some random other token in there, and ten

1:08:57.197 --> 1:08:59.470
percent they keep it on change.

1:09:02.202 --> 1:09:10.846
And then what you can do is also this next
word prediction.

1:09:10.846 --> 1:09:14.880
The man went to Mass Store.

1:09:14.880 --> 1:09:17.761
He bought a gallon.

1:09:18.418 --> 1:09:24.088
So may you see you're joining them, you're
doing both masks and prediction that you're.

1:09:24.564 --> 1:09:29.449
Is a penguin mask or flyless birds.

1:09:29.449 --> 1:09:41.390
These two sentences have nothing to do with
each other, so you can do also this type of

1:09:41.390 --> 1:09:43.018
prediction.

1:09:47.127 --> 1:09:56.572
And then the whole bird model, so here you
have the in-foot to transform the layers, and

1:09:56.572 --> 1:09:58.164
you can train.

1:09:58.598 --> 1:10:17.731
And this model was quite successful in general
applications.

1:10:17.937 --> 1:10:27.644
However, there is like a huge thing of different
types of models coming from them.

1:10:27.827 --> 1:10:38.709
So based on others these supervised molds
like a whole setup came out of there and now

1:10:38.709 --> 1:10:42.086
this is getting even more.

1:10:42.082 --> 1:10:46.640
With availability of a large language model
than the success.

1:10:47.007 --> 1:10:48.436
We have now even larger ones.

1:10:48.828 --> 1:10:50.961
Interestingly, it goes a bit.

1:10:50.910 --> 1:10:57.847
Change the bit again from like more the spider
action model to uni directional models.

1:10:57.847 --> 1:11:02.710
Are at the moment maybe a bit more we're coming
to them now?

1:11:02.710 --> 1:11:09.168
Do you see one advantage while what is another
event and we have the efficiency?

1:11:09.509 --> 1:11:15.901
Is one other reason why you are sometimes
more interested in uni-direction models than

1:11:15.901 --> 1:11:17.150
in bi-direction.

1:11:22.882 --> 1:11:30.220
It depends on the pass, but for example for
a language generation pass, the eccard is not

1:11:30.220 --> 1:11:30.872
really.

1:11:32.192 --> 1:11:40.924
It doesn't work so if you want to do a generation
like the decoder you don't know the future

1:11:40.924 --> 1:11:42.896
so you cannot apply.

1:11:43.223 --> 1:11:53.870
So this time of model can be used for the
encoder in an encoder model, but it cannot

1:11:53.870 --> 1:11:57.002
be used for the decoder.

1:12:00.000 --> 1:12:05.012
That's a good view to the next overall cast
of models.

1:12:05.012 --> 1:12:08.839
Perhaps if you view it from the sequence.

1:12:09.009 --> 1:12:12.761
We have the encoder base model.

1:12:12.761 --> 1:12:16.161
That's what we just look at.

1:12:16.161 --> 1:12:20.617
They are bidirectional and typically.

1:12:20.981 --> 1:12:22.347
That Is the One We Looked At.

1:12:22.742 --> 1:12:34.634
At the beginning is the decoder based model,
so see out in regressive models which are unidirective

1:12:34.634 --> 1:12:42.601
like an based model, and there we can do the
next word prediction.

1:12:43.403 --> 1:12:52.439
And what you can also do first, and there
you can also have a special things called prefix

1:12:52.439 --> 1:12:53.432
language.

1:12:54.354 --> 1:13:05.039
Because we are saying it might be helpful
that some of your input can also use bi-direction.

1:13:05.285 --> 1:13:12.240
And that is somehow doing what it is called
prefix length.

1:13:12.240 --> 1:13:19.076
On the first tokens you directly give your
bidirectional.

1:13:19.219 --> 1:13:28.774
So you somehow merge that and that mainly
works only in transformer based models because.

1:13:29.629 --> 1:13:33.039
There is no different number of parameters
in our end.

1:13:33.039 --> 1:13:34.836
We need a back foot our end.

1:13:34.975 --> 1:13:38.533
Transformer: The only difference is how you
mask your attention.

1:13:38.878 --> 1:13:44.918
We have seen that in the anchoder and decoder
the number of parameters is different because

1:13:44.918 --> 1:13:50.235
you do cross attention, but if you do forward
and backward or union directions,.

1:13:50.650 --> 1:13:58.419
It's only like that you mask your attention
to only look at the bad past or to look into

1:13:58.419 --> 1:13:59.466
the future.

1:14:00.680 --> 1:14:03.326
And now you can of course also do mixing.

1:14:03.563 --> 1:14:08.306
So this is a bi-directional attention matrix
where you can attend to everything.

1:14:08.588 --> 1:14:23.516
There is a uni-direction or causal where you
can look at the past and you can do the first

1:14:23.516 --> 1:14:25.649
three words.

1:14:29.149 --> 1:14:42.831
That somehow clear based on that, then of
course you cannot do the other things.

1:14:43.163 --> 1:14:50.623
So the idea is we have our anchor to decoder
architecture.

1:14:50.623 --> 1:14:57.704
Can we also train them completely in a side
supervisor?

1:14:58.238 --> 1:15:09.980
And in this case we have the same input to
both, so in this case we need to do some type

1:15:09.980 --> 1:15:12.224
of masking here.

1:15:12.912 --> 1:15:17.591
Here we don't need to do the masking, but
here we need to the masking that doesn't know

1:15:17.591 --> 1:15:17.910
ever.

1:15:20.440 --> 1:15:30.269
And this type of model got quite successful
also, especially for pre-training machine translation.

1:15:30.330 --> 1:15:39.059
The first model doing that is a Bart model,
which exactly does that, and yes, it's one

1:15:39.059 --> 1:15:42.872
successful way to pre train your one.

1:15:42.872 --> 1:15:47.087
It's pretraining your full encoder model.

1:15:47.427 --> 1:15:54.365
Where you put in contrast to machine translation,
where you put in source sentence, we can't

1:15:54.365 --> 1:15:55.409
do that here.

1:15:55.715 --> 1:16:01.382
But we can just put the second twice in there,
and then it's not a trivial task.

1:16:01.382 --> 1:16:02.432
We can change.

1:16:03.003 --> 1:16:12.777
And there is like they do different corruption
techniques so you can also do.

1:16:13.233 --> 1:16:19.692
That you couldn't do in an agricultural system
because then it wouldn't be there and you cannot

1:16:19.692 --> 1:16:20.970
predict somewhere.

1:16:20.970 --> 1:16:26.353
So the anchor, the number of input and output
tokens always has to be the same.

1:16:26.906 --> 1:16:29.818
You cannot do a prediction for something which
isn't in it.

1:16:30.110 --> 1:16:38.268
Here in the decoder side it's unidirection
so we can also delete the top and then try

1:16:38.268 --> 1:16:40.355
to generate the full.

1:16:41.061 --> 1:16:45.250
We can do sentence permutation.

1:16:45.250 --> 1:16:54.285
We can document rotation and text infilling
so there is quite a bit.

1:16:55.615 --> 1:17:06.568
So you see there's quite a lot of types of
models that you can use in order to pre-train.

1:17:07.507 --> 1:17:14.985
Then, of course, there is again for the language
one.

1:17:14.985 --> 1:17:21.079
The other question is how do you integrate?

1:17:21.761 --> 1:17:26.636
And there's also, like yeah, quite some different
ways of techniques.

1:17:27.007 --> 1:17:28.684
It's a Bit Similar to Before.

1:17:28.928 --> 1:17:39.068
So the easiest thing is you take your word
embeddings or your free trained model.

1:17:39.068 --> 1:17:47.971
You freeze them and stack your decoder layers
and keep these ones free.

1:17:48.748 --> 1:17:54.495
Can also be done if you have this type of
bark model.

1:17:54.495 --> 1:18:03.329
What you can do is you freeze your word embeddings,
for example some products and.

1:18:05.865 --> 1:18:17.296
The other thing is you initialize them so
you initialize your models but you train everything

1:18:17.296 --> 1:18:19.120
so you're not.

1:18:22.562 --> 1:18:29.986
Then one thing, if you think about Bart, you
want to have the Chinese language, the Italian

1:18:29.986 --> 1:18:32.165
language, and the deconer.

1:18:32.165 --> 1:18:35.716
However, in Bart we have the same language.

1:18:36.516 --> 1:18:46.010
The one you get is from English, so what you
can do there is so you cannot try to do some.

1:18:46.366 --> 1:18:52.562
Below the barge, in order to learn some language
specific stuff, or there's a masculine barge,

1:18:52.562 --> 1:18:58.823
which is trained on many languages, but it's
trained only on like the Old Coast Modern Language

1:18:58.823 --> 1:19:03.388
House, which may be trained in German and English,
but not on German.

1:19:03.923 --> 1:19:08.779
So then you would still need to find June
and the model needs to learn how to better

1:19:08.779 --> 1:19:10.721
do the attention cross lingually.

1:19:10.721 --> 1:19:15.748
It's only on the same language but it mainly
only has to learn this mapping and not all

1:19:15.748 --> 1:19:18.775
the rest and that's why it's still quite successful.

1:19:21.982 --> 1:19:27.492
Now certain thing which is very commonly used
is what is required to it as adapters.

1:19:27.607 --> 1:19:29.754
So for example you take and buy.

1:19:29.709 --> 1:19:35.218
And you put some adapters on the inside of
the networks so that it's small new layers

1:19:35.218 --> 1:19:40.790
which are in between put in there and then
you only train these adapters or also train

1:19:40.790 --> 1:19:41.815
these adapters.

1:19:41.815 --> 1:19:47.900
For example, an embryo you could see that
this learns to map the Sears language representation

1:19:47.900 --> 1:19:50.334
to the Tiger language representation.

1:19:50.470 --> 1:19:52.395
And then you don't have to change that luck.

1:19:52.792 --> 1:19:59.793
You give it extra ability to really perform
well on that.

1:19:59.793 --> 1:20:05.225
These are quite small and so very efficient.

1:20:05.905 --> 1:20:12.632
That is also very commonly used, for example
in modular systems where you have some adaptors

1:20:12.632 --> 1:20:16.248
in between here which might be language specific.

1:20:16.916 --> 1:20:22.247
So they are trained only for one language.

1:20:22.247 --> 1:20:33.777
The model has some or both and once has the
ability to do multilingually to share knowledge.

1:20:34.914 --> 1:20:39.058
But there's one chance in general in the multilingual
systems.

1:20:39.058 --> 1:20:40.439
It works quite well.

1:20:40.439 --> 1:20:46.161
There's one case or one specific use case
for multilingual where this normally doesn't

1:20:46.161 --> 1:20:47.344
really work well.

1:20:47.344 --> 1:20:49.975
Do you have an idea what that could be?

1:20:55.996 --> 1:20:57.536
It's for Zero Shot Cases.

1:20:57.998 --> 1:21:03.660
Because having here some situation with this
might be very language specific and zero shot,

1:21:03.660 --> 1:21:09.015
the idea is always to learn representations
view which are more language dependent and

1:21:09.015 --> 1:21:10.184
with the adaptors.

1:21:10.184 --> 1:21:15.601
Of course you get in representations again
which are more language specific and then it

1:21:15.601 --> 1:21:17.078
doesn't work that well.

1:21:20.260 --> 1:21:37.730
And there is also the idea of doing more knowledge
pistolation.

1:21:39.179 --> 1:21:42.923
And now the idea is okay.

1:21:42.923 --> 1:21:54.157
We are training it the same, but what we want
to achieve is that the encoder.

1:21:54.414 --> 1:22:03.095
So you should learn faster by trying to make
these states as similar as possible.

1:22:03.095 --> 1:22:11.777
So you compare the first-hit state of the
pre-trained model and try to make them.

1:22:12.192 --> 1:22:18.144
For example, by using the out two norms, so
by just making these two representations the

1:22:18.144 --> 1:22:26.373
same: The same vocabulary: Why does it need
the same vocabulary with any idea?

1:22:34.754 --> 1:22:46.137
If you have different vocabulary, it's typical
you also have different sequenced lengths here.

1:22:46.137 --> 1:22:50.690
The number of sequences is different.

1:22:51.231 --> 1:22:58.888
If you now have pipe stains and four states
here, it's no longer straightforward which

1:22:58.888 --> 1:23:01.089
states compare to which.

1:23:02.322 --> 1:23:05.246
And that's just easier if you have like the
same number.

1:23:05.246 --> 1:23:08.940
You can always compare the first to the first
and second to the second.

1:23:09.709 --> 1:23:16.836
So therefore at least the very easy way of
knowledge destination only works if you have.

1:23:17.177 --> 1:23:30.030
Course: You could do things like yeah, the
average should be the same, but of course there's

1:23:30.030 --> 1:23:33.071
a less strong signal.

1:23:34.314 --> 1:23:42.979
But the advantage here is that you have a
diameter training signal here on the handquarter

1:23:42.979 --> 1:23:51.455
so you can directly make some of the encoder
already giving a good signal while normally

1:23:51.455 --> 1:23:52.407
an empty.

1:23:56.936 --> 1:24:13.197
Yes, think this is most things for today,
so what you should keep in mind is remind me.

1:24:13.393 --> 1:24:18.400
The one is a back translation idea.

1:24:18.400 --> 1:24:29.561
If you have monolingual and use that, the
other one is to: And mentally it is often helpful

1:24:29.561 --> 1:24:33.614
to combine them so you can even use both of
that.

1:24:33.853 --> 1:24:38.908
So you can use pre-trained walls, but then
you can even still do back translation where

1:24:38.908 --> 1:24:40.057
it's still helpful.

1:24:40.160 --> 1:24:45.502
We have the advantage we are training like
everything working together on the task so

1:24:45.502 --> 1:24:51.093
it might be helpful even to backtranslate some
data and then use it in a real translation

1:24:51.093 --> 1:24:56.683
setup because in pretraining of course the
beach challenge is always that you're training

1:24:56.683 --> 1:24:57.739
it on different.

1:24:58.058 --> 1:25:03.327
Different ways of how you integrate this knowledge.

1:25:03.327 --> 1:25:08.089
Even if you just use a full model, so in this.

1:25:08.748 --> 1:25:11.128
This is the most similar you can get.

1:25:11.128 --> 1:25:13.945
You're doing no changes to the architecture.

1:25:13.945 --> 1:25:19.643
You're really taking the model and just fine
tuning them on the new task, but it still has

1:25:19.643 --> 1:25:24.026
to completely newly learn how to do the attention
and how to do that.

1:25:24.464 --> 1:25:29.971
And that might be, for example, helpful to
have more back-translated data to learn them.

1:25:32.192 --> 1:25:34.251
That's for today.

1:25:34.251 --> 1:25:44.661
There's one important thing that next Tuesday
there is a conference or a workshop or so in

1:25:44.661 --> 1:25:45.920
this room.

1:25:47.127 --> 1:25:56.769
You should get an e-mail if you're in Elias
that there's a room change for Tuesdays and

1:25:56.769 --> 1:25:57.426
it's.

1:25:57.637 --> 1:26:03.890
There are more questions, yeah, have a more
general position, especially: In computer vision

1:26:03.890 --> 1:26:07.347
you can enlarge your data center data orientation.

1:26:07.347 --> 1:26:08.295
Is there any?

1:26:08.388 --> 1:26:15.301
It's similar to a large speech for text for
the data of an edge.

1:26:15.755 --> 1:26:29.176
And you can use this back translation and
also masking, but back translation is some

1:26:29.176 --> 1:26:31.228
way of data.

1:26:31.371 --> 1:26:35.629
So it has also been, for example, even its
used not only for monolingual data.

1:26:36.216 --> 1:26:54.060
If you have good MP system, it can also be
used for parallel data.

1:26:54.834 --> 1:26:59.139
So would say this is the most similar one.

1:26:59.139 --> 1:27:03.143
There's ways you can do power phrasing.

1:27:05.025 --> 1:27:12.057
But for example there is very hard to do this
by rules like which words to replace because

1:27:12.057 --> 1:27:18.936
there is not a coup like you cannot always
say this word can always be replaced by that.

1:27:19.139 --> 1:27:27.225
Mean, although they are many perfect synonyms,
normally they are good in some cases, but not

1:27:27.225 --> 1:27:29.399
in all cases, and so on.

1:27:29.399 --> 1:27:36.963
And if you don't do a rule based, you have
to train your model and then the freshness.

1:27:38.058 --> 1:27:57.236
The same architecture as the pre-trained mount.

1:27:57.457 --> 1:27:59.810
Should be of the same dimension, so it's easiest
to have the same dimension.

1:28:00.000 --> 1:28:01.590
Architecture.

1:28:01.590 --> 1:28:05.452
We later will learn inefficiency.

1:28:05.452 --> 1:28:12.948
You can also do knowledge cessulation with,
for example, smaller.

1:28:12.948 --> 1:28:16.469
You can learn the same within.

1:28:17.477 --> 1:28:22.949
Eight layers for it so that is possible, but
yeah agree it should be of the same.

1:28:23.623 --> 1:28:32.486
Yeah yeah you need the question then of course
you can do it like it's an initialization or

1:28:32.486 --> 1:28:41.157
you can do it doing training but normally it
most makes sense during the normal training.

1:28:45.865 --> 1:28:53.963
Do it, then thanks a lot, and then we'll see
each other again on Tuesday.

0:00:00.981 --> 0:00:20.036
Today about is how to use some type of additional
resources to improve the translation.

0:00:20.300 --> 0:00:28.188
We have in the first part of the semester
two thirds of the semester how to build some

0:00:28.188 --> 0:00:31.361
of your basic machine translation.

0:00:31.571 --> 0:00:42.317
Now the basic components are both for statistical
and for neural, with the encoded decoding.

0:00:43.123 --> 0:00:46.000
Now, of course, that's not where it stops.

0:00:46.000 --> 0:00:51.286
It's still what nearly every machine translation
system is currently in there.

0:00:51.286 --> 0:00:57.308
However, there's a lot of challenges which
you need to address in addition and which need

0:00:57.308 --> 0:00:58.245
to be solved.

0:00:58.918 --> 0:01:09.858
And there we want to start to tell you what
else can you do around this, and partly.

0:01:10.030 --> 0:01:14.396
And one important question there is on what
do you train your models?

0:01:14.394 --> 0:01:32.003
Because like this type of parallel data, it's
easier in machine translation than in other

0:01:32.003 --> 0:01:33.569
trusts.

0:01:33.853 --> 0:01:41.178
And therefore an important question is, can
we also learn from like other sources and through?

0:01:41.701 --> 0:01:47.830
Because if you remember strongly right at
the beginning of the election,.

0:01:51.171 --> 0:01:53.801
This Is How We Train All Our.

0:01:54.194 --> 0:01:59.887
Machine learning models from statistical to
neural.

0:01:59.887 --> 0:02:09.412
This doesn't have changed so we need this
type of parallel data where we have a source

0:02:09.412 --> 0:02:13.462
sentence aligned with a target data.

0:02:13.493 --> 0:02:19.135
We have now a strong model here, a very good
model to do that.

0:02:19.135 --> 0:02:22.091
However, we always rely on this.

0:02:22.522 --> 0:02:28.395
For languages, high risk language pairs say
from German to English or other European languages,

0:02:28.395 --> 0:02:31.332
there is decent amount, at least for similarly.

0:02:31.471 --> 0:02:37.630
But even there if we are going to very specific
domains it might get difficult and then your

0:02:37.630 --> 0:02:43.525
system performance might drop because if you
want to translate now some medical text for

0:02:43.525 --> 0:02:50.015
example of course you need to also have peril
data in the medical domain to know how to translate

0:02:50.015 --> 0:02:50.876
these types.

0:02:51.231 --> 0:02:55.264
Phrases how to use the vocabulary and so on
in the style.

0:02:55.915 --> 0:03:04.887
And if you are going to other languages, there
is a lot bigger challenge and the question

0:03:04.887 --> 0:03:05.585
there.

0:03:05.825 --> 0:03:09.649
So is really this the only resource we can
use.

0:03:09.889 --> 0:03:19.462
Can be adapted or training phase in order
to also make use of other types of models that

0:03:19.462 --> 0:03:27.314
might enable us to build strong systems with
other types of information.

0:03:27.707 --> 0:03:35.276
And that we will look into now in the next
starting from from just saying the next election.

0:03:35.515 --> 0:03:40.697
So this idea we already have covered on Tuesday.

0:03:40.697 --> 0:03:45.350
One very successful idea for this is to do.

0:03:45.645 --> 0:03:51.990
So that we're no longer doing translation
between languages, but we can do translation

0:03:51.990 --> 0:03:55.928
between languages and share common knowledge
between.

0:03:56.296 --> 0:04:03.888
You also learned about things like zero shots
machine translation so you can translate between

0:04:03.888 --> 0:04:06.446
languages where you don't have.

0:04:06.786 --> 0:04:09.790
Which is the case for many, many language
pairs.

0:04:10.030 --> 0:04:19.209
Like even with German, you have not translation
parallel data to all languages around the world,

0:04:19.209 --> 0:04:26.400
or most of them you have it to the Europeans
once, maybe even for Japanese.

0:04:26.746 --> 0:04:35.332
There is quite a lot of data, for example
English to Japanese, but German to Japanese

0:04:35.332 --> 0:04:37.827
or German to Vietnamese.

0:04:37.827 --> 0:04:41.621
There is some data from Multilingual.

0:04:42.042 --> 0:04:54.584
So there is a very promising direction if
you want to build translation systems between

0:04:54.584 --> 0:05:00.142
language peers, typically not English.

0:05:01.221 --> 0:05:05.887
And the other ideas, of course, we don't have
to either just search for it.

0:05:06.206 --> 0:05:12.505
Some work on a data crawling so if I don't
have a corpus directly or I don't have an high

0:05:12.505 --> 0:05:19.014
quality corpus like from the European Parliament
for a TED corpus so maybe it makes sense to

0:05:19.014 --> 0:05:23.913
crawl more data and get additional sources
so you can build stronger.

0:05:24.344 --> 0:05:35.485
There has been quite a big effort in Europe
to collect really large data sets for parallel

0:05:35.485 --> 0:05:36.220
data.

0:05:36.220 --> 0:05:40.382
How can we do this data crawling?

0:05:40.600 --> 0:05:46.103
There the interesting thing from the machine
translation point is not just general data

0:05:46.103 --> 0:05:46.729
crawling.

0:05:47.067 --> 0:05:50.037
But how can we explicitly crawl data?

0:05:50.037 --> 0:05:52.070
Which is some of a peril?

0:05:52.132 --> 0:05:58.461
So there is in the Internet quite a lot of
data which has been company websites which

0:05:58.461 --> 0:06:01.626
have been translated and things like that.

0:06:01.626 --> 0:06:05.158
So how can you extract them parallel fragments?

0:06:06.566 --> 0:06:13.404
That is typically more noisy than where you
do more at hands where mean if you have Parliament.

0:06:13.693 --> 0:06:17.680
You can do some rules how to extract parallel
things.

0:06:17.680 --> 0:06:24.176
Here there is more to it, so the quality is
later maybe not as good, but normally scale

0:06:24.176 --> 0:06:26.908
is then a possibility to address it.

0:06:26.908 --> 0:06:30.304
So you just have so much more data that even.

0:06:33.313 --> 0:06:40.295
The other thing can be used monolingual data
and monolingual data has a big advantage that

0:06:40.295 --> 0:06:46.664
we can have a huge amount of that so that you
can be autocrawed from the Internet.

0:06:46.664 --> 0:06:51.728
The nice thing is you can also get it typically
for many domains.

0:06:52.352 --> 0:06:59.558
There is just so much more magnitude of monolingual
data so that it might be very helpful.

0:06:59.559 --> 0:07:03.054
We can do that in statistical machine translation.

0:07:03.054 --> 0:07:06.755
It was quite easy to integrate using language
models.

0:07:08.508 --> 0:07:16.912
In neural machine translation we have the
advantage that we have this overall architecture

0:07:16.912 --> 0:07:22.915
that does everything together, but it has also
the disadvantage.

0:07:23.283 --> 0:07:25.675
We'll look today at two things.

0:07:25.675 --> 0:07:32.925
On the one end you can still try to do a bit
of language modeling in there and add an additional

0:07:32.925 --> 0:07:35.168
language model into in there.

0:07:35.168 --> 0:07:38.232
There is some work, one very successful.

0:07:38.178 --> 0:07:43.764
A way in which I think is used in most systems
at the moment is to do some scientific data.

0:07:43.763 --> 0:07:53.087
Is a very easy thing, but you can just translate
there and use it as training gator, and normally.

0:07:53.213 --> 0:07:59.185
And thereby you are able to use like some
type of monolingual a day.

0:08:00.380 --> 0:08:05.271
Another way to do it is unsupervised and the
extreme case.

0:08:05.271 --> 0:08:11.158
If you have a scenario then you only have
data, only monolingual data.

0:08:11.158 --> 0:08:13.976
Can you still build translations?

0:08:14.754 --> 0:08:27.675
If you have large amounts of data and languages
are not too dissimilar, you can build translation

0:08:27.675 --> 0:08:31.102
systems without parallel.

0:08:32.512 --> 0:08:36.267
That we will see you then next Thursday.

0:08:37.857 --> 0:08:50.512
And then there is now a third type of pre-trained
model that recently became very successful

0:08:50.512 --> 0:08:55.411
and now with large language models.

0:08:55.715 --> 0:09:03.525
So the idea is we are no longer sharing the
real data, but it can also help to train a

0:09:03.525 --> 0:09:04.153
model.

0:09:04.364 --> 0:09:11.594
And that is now a big advantage of deep learning
based approaches.

0:09:11.594 --> 0:09:22.169
There you have this ability that you can train
a model in some task and then apply it to another.

0:09:22.722 --> 0:09:33.405
And then, of course, the question is, can
I have an initial task where there's huge amounts

0:09:33.405 --> 0:09:34.450
of data?

0:09:34.714 --> 0:09:40.251
And the test that typically you pre train
on is more like similar to a language moral

0:09:40.251 --> 0:09:45.852
task either direct to a language moral task
or like a masking task which is related so

0:09:45.852 --> 0:09:51.582
the idea is oh I can train on this data and
the knowledge about words how they relate to

0:09:51.582 --> 0:09:53.577
each other I can use in there.

0:09:53.753 --> 0:10:00.276
So it's a different way of using language
models.

0:10:00.276 --> 0:10:06.276
There's more transfer learning at the end
of.

0:10:09.029 --> 0:10:17.496
So first we will start with how can we use
monolingual data to do a Yeah to do a machine

0:10:17.496 --> 0:10:18.733
translation?

0:10:20.040 --> 0:10:27.499
That: Big difference is you should remember
from what I mentioned before is.

0:10:27.499 --> 0:10:32.783
In statistical machine translation we directly
have the opportunity.

0:10:32.783 --> 0:10:39.676
There's peril data for the translation model
and monolingual data for the language model.

0:10:39.679 --> 0:10:45.343
And you combine your translation model and
language model, and then you can make use of

0:10:45.343 --> 0:10:45.730
both.

0:10:46.726 --> 0:10:53.183
That you can make use of these large large
amounts of monolingual data, but of course

0:10:53.183 --> 0:10:55.510
it has also some disadvantage.

0:10:55.495 --> 0:11:01.156
Because we say the problem is we are optimizing
both parts a bit independently to each other

0:11:01.156 --> 0:11:06.757
and we say oh yeah the big disadvantage of
newer machine translations now we are optimizing

0:11:06.757 --> 0:11:10.531
the overall architecture everything together
to perform best.

0:11:10.890 --> 0:11:16.994
And then, of course, we can't do there, so
Leo we can can only do a mural like use power

0:11:16.994 --> 0:11:17.405
data.

0:11:17.897 --> 0:11:28.714
So the question is, but this advantage is
not so important that we can train everything,

0:11:28.714 --> 0:11:35.276
but we have a moral legal data or even small
amounts.

0:11:35.675 --> 0:11:43.102
So in data we know it's not only important
the amount of data we have but also like how

0:11:43.102 --> 0:11:50.529
similar it is to your test data so it can be
that this modeling data is quite small but

0:11:50.529 --> 0:11:55.339
it's very well fitting and then it's still
very helpful.

0:11:55.675 --> 0:12:02.691
At the first year of surprisingness, if we
are here successful with integrating a language

0:12:02.691 --> 0:12:09.631
model into a translation system, maybe we can
also integrate some type of language models

0:12:09.631 --> 0:12:14.411
into our empty system in order to make it better
and perform.

0:12:16.536 --> 0:12:23.298
The first thing we can do is we know there
is language models, so let's try to integrate.

0:12:23.623 --> 0:12:31.096
There was our language model because these
works were mainly done before transformer-based

0:12:31.096 --> 0:12:31.753
models.

0:12:32.152 --> 0:12:38.764
In general, of course, you can do the same
thing with transformer baseball.

0:12:38.764 --> 0:12:50.929
There is nothing about whether: It's just
that it has mainly been done before people

0:12:50.929 --> 0:13:01.875
started using R&amp;S and they tried to do
this more in cases.

0:13:07.087 --> 0:13:22.938
So what we're happening here is in some of
this type of idea, and in key system you remember

0:13:22.938 --> 0:13:25.495
the attention.

0:13:25.605 --> 0:13:29.465
Gets it was your last in this day that you
calculate easy attention.

0:13:29.729 --> 0:13:36.610
We get the context back, then combine both
and then base the next in state and then predict.

0:13:37.057 --> 0:13:42.424
So this is our system, and the question is,
can we send our integrated language model?

0:13:42.782 --> 0:13:49.890
And somehow it makes sense to take out a neural
language model because we are anyway in the

0:13:49.890 --> 0:13:50.971
neural space.

0:13:50.971 --> 0:13:58.465
It's not surprising that it contrasts to statistical
work used and grants it might make sense to

0:13:58.465 --> 0:14:01.478
take a bit of a normal language model.

0:14:01.621 --> 0:14:06.437
And there would be something like on Tubbles
Air, a neural language model, and our man based

0:14:06.437 --> 0:14:11.149
is you have a target word, you put it in, you
get a new benchmark, and then you always put

0:14:11.149 --> 0:14:15.757
in the words and get new hidden states, and
you can do some predictions at the output to

0:14:15.757 --> 0:14:16.948
predict the next word.

0:14:17.597 --> 0:14:26.977
So if we're having this type of in language
model, there's like two main questions we have

0:14:26.977 --> 0:14:34.769
to answer: So how do we combine now on the
one hand our system and on the other hand our

0:14:34.769 --> 0:14:35.358
model?

0:14:35.358 --> 0:14:42.004
You see that was mentioned before when we
started talking about ENCODA models.

0:14:42.004 --> 0:14:45.369
They can be viewed as a language model.

0:14:45.805 --> 0:14:47.710
The wine is lengthened, unconditioned.

0:14:47.710 --> 0:14:49.518
It's just modeling the target sides.

0:14:49.970 --> 0:14:56.963
And the other one is a conditional language
one, which is a language one conditioned on

0:14:56.963 --> 0:14:57.837
the Sewer.

0:14:58.238 --> 0:15:03.694
So how can you combine to language models?

0:15:03.694 --> 0:15:14.860
Of course, it's like the translation model
will be more important because it has access

0:15:14.860 --> 0:15:16.763
to the source.

0:15:18.778 --> 0:15:22.571
If we have that, the other question is okay.

0:15:22.571 --> 0:15:24.257
Now we have models.

0:15:24.257 --> 0:15:25.689
How do we train?

0:15:26.026 --> 0:15:30.005
Pickers integrated them.

0:15:30.005 --> 0:15:34.781
We have now two sets of data.

0:15:34.781 --> 0:15:42.741
We have parallel data where you can do the
lower.

0:15:44.644 --> 0:15:53.293
So the first idea is we can do something more
like a parallel combination.

0:15:53.293 --> 0:15:55.831
We just keep running.

0:15:56.036 --> 0:15:59.864
So here you see your system that is running.

0:16:00.200 --> 0:16:09.649
It's normally completely independent of your
language model, which is up there, so down

0:16:09.649 --> 0:16:13.300
here we have just our NMT system.

0:16:13.313 --> 0:16:26.470
The only thing which is used is we have the
words, and of course they are put into both

0:16:26.470 --> 0:16:30.059
systems, and out there.

0:16:30.050 --> 0:16:42.221
So we use them somehow for both, and then
we are doing our decision just by merging these

0:16:42.221 --> 0:16:42.897
two.

0:16:43.343 --> 0:16:53.956
So there can be, for example, we are doing
a probability distribution here, and then we

0:16:53.956 --> 0:17:03.363
are taking the average of post-perability distribution
to do our predictions.

0:17:11.871 --> 0:17:18.923
You could also take the output with Steve's
to be more in chore about the mixture.

0:17:20.000 --> 0:17:32.896
Yes, you could also do that, so it's more
like engaging mechanisms that you're not doing.

0:17:32.993 --> 0:17:41.110
Another one would be cochtrinate the hidden
states, and then you would have another layer

0:17:41.110 --> 0:17:41.831
on top.

0:17:43.303 --> 0:17:56.889
You think about if you do the conqueredination
instead of taking the instead and then merging

0:17:56.889 --> 0:18:01.225
the probability distribution.

0:18:03.143 --> 0:18:16.610
Introduce many new parameters, and these parameters
have somehow something special compared to

0:18:16.610 --> 0:18:17.318
the.

0:18:23.603 --> 0:18:37.651
So before all the error other parameters can
be trained independent, the language model

0:18:37.651 --> 0:18:42.121
can be trained independent.

0:18:43.043 --> 0:18:51.749
If you have a joint layer, of course you need
to train them because you have now inputs.

0:18:54.794 --> 0:19:02.594
Not surprisingly, if you have a parallel combination
of whether you could, the other way is to do

0:19:02.594 --> 0:19:04.664
more serial combinations.

0:19:04.924 --> 0:19:10.101
How can you do a similar combination?

0:19:10.101 --> 0:19:18.274
Your final decision makes sense to do a face
on the system.

0:19:18.438 --> 0:19:20.996
So you have on top of your normal and system.

0:19:21.121 --> 0:19:30.678
The only thing is now you're inputting into
your system.

0:19:30.678 --> 0:19:38.726
You're no longer inputting the word embeddings.

0:19:38.918 --> 0:19:45.588
So you're training your mainly what you have
your lower layers here which are trained more

0:19:45.588 --> 0:19:52.183
on the purely language model style and then
on top your putting into the NMT system where

0:19:52.183 --> 0:19:55.408
it now has already here the language model.

0:19:55.815 --> 0:19:58.482
So here you can also view it.

0:19:58.482 --> 0:20:06.481
Here you have more contextual embeddings which
no longer depend only on the word but they

0:20:06.481 --> 0:20:10.659
also depend on the context of the target site.

0:20:11.051 --> 0:20:19.941
But you have more understanding of the source
word, so you have a language in the current

0:20:19.941 --> 0:20:21.620
target sentence.

0:20:21.881 --> 0:20:27.657
So if it's like the word can, for example,
will be put in here always the same independent

0:20:27.657 --> 0:20:31.147
of its user can of beans, or if it's like I
can do it.

0:20:31.147 --> 0:20:37.049
However, because you are having your language
model style, you have maybe disintegrated this

0:20:37.049 --> 0:20:40.984
already a bit, and you give this information
directly to the.

0:20:41.701 --> 0:20:43.095
An empty cyst.

0:20:44.364 --> 0:20:49.850
You, if you're remembering more the transformer
based approach, you have some layers.

0:20:49.850 --> 0:20:55.783
The lower layers are purely languaged while
the other ones are with attention to the source.

0:20:55.783 --> 0:21:01.525
So you can view it also that you just have
lower layers which don't attend to the source.

0:21:02.202 --> 0:21:07.227
This is purely a language model, and then
at some point you're starting to attend to

0:21:07.227 --> 0:21:08.587
the source and use it.

0:21:13.493 --> 0:21:20.781
Yes, so this is how you combine them in peril
or first do the language model and then do.

0:21:23.623 --> 0:21:26.147
Questions for the integration.

0:21:31.831 --> 0:21:35.034
Not really sure about the input of the.

0:21:35.475 --> 0:21:38.102
Model, and in this case in the sequence.

0:21:38.278 --> 0:21:53.199
Case so the actual word that we transferred
into a numerical lecture, and this is an input

0:21:53.199 --> 0:21:54.838
into the.

0:21:56.176 --> 0:22:03.568
That depends on if you view the word embedding
as part of the language model.

0:22:03.568 --> 0:22:10.865
So if you first put the word target word then
you do the one hot end coding.

0:22:11.691 --> 0:22:13.805
And then the word embedding there is the r&amp;

0:22:13.805 --> 0:22:13.937
n.

0:22:14.314 --> 0:22:21.035
So you can use this together as your language
model when you first do the word embedding.

0:22:21.401 --> 0:22:24.346
All you can say is like before.

0:22:24.346 --> 0:22:28.212
It's more a definition, but you're right.

0:22:28.212 --> 0:22:30.513
So what's the steps out?

0:22:30.513 --> 0:22:36.128
You take the word, the one hut encoding, the
word embedding.

0:22:36.516 --> 0:22:46.214
What one of these parrots, you know, called
a language model is definition wise and not

0:22:46.214 --> 0:22:47.978
that important.

0:22:53.933 --> 0:23:02.264
So the question is how can you then train
them and make this this one work?

0:23:02.264 --> 0:23:02.812
The.

0:23:03.363 --> 0:23:15.201
So in the case where you combine the language
one of the abilities you can train them independently

0:23:15.201 --> 0:23:18.516
and just put them together.

0:23:18.918 --> 0:23:27.368
Might not be the best because we have no longer
the stability that we had before that optimally

0:23:27.368 --> 0:23:29.128
performed together.

0:23:29.128 --> 0:23:33.881
It's not clear if they really work the best
together.

0:23:34.514 --> 0:23:41.585
At least you need to somehow find how much
do you trust the one model and how much.

0:23:43.323 --> 0:23:45.058
Still in some cases useful.

0:23:45.058 --> 0:23:48.530
It might be helpful if you have only data
and software.

0:23:48.928 --> 0:23:59.064
However, in MT we have one specific situation
that at least for the MT part parallel is also

0:23:59.064 --> 0:24:07.456
always monolingual data, so what we definitely
can do is train the language.

0:24:08.588 --> 0:24:18.886
So what we also can do is more like the pre-training
approach.

0:24:18.886 --> 0:24:24.607
We first train the language model.

0:24:24.704 --> 0:24:27.334
The pre-training approach.

0:24:27.334 --> 0:24:33.470
You first train on the monolingual data and
then you join the.

0:24:33.933 --> 0:24:41.143
Of course, the model size is this way, but
the data size is too bigly the other way around.

0:24:41.143 --> 0:24:47.883
You often have a lot more monolingual data
than you have here parallel data, in which

0:24:47.883 --> 0:24:52.350
scenario can you imagine where this type of
pretraining?

0:24:56.536 --> 0:24:57.901
Any Ideas.

0:25:04.064 --> 0:25:12.772
One example where this might also be helpful
if you want to adapt to domains.

0:25:12.772 --> 0:25:22.373
So let's say you do medical sentences and
if you want to translate medical sentences.

0:25:23.083 --> 0:25:26.706
In this case it could be or its most probable
happen.

0:25:26.706 --> 0:25:32.679
You're learning here up there what medical
means, but in your fine tuning step the model

0:25:32.679 --> 0:25:38.785
is forgotten everything about Medicare, so
you may be losing all the information you gain.

0:25:39.099 --> 0:25:42.366
So this type of priest training step is good.

0:25:42.366 --> 0:25:47.978
If your pretraining data is more general,
very large and then you're adapting.

0:25:48.428 --> 0:25:56.012
But in the task with moral lingual data, which
should be used to adapt the system to some

0:25:56.012 --> 0:25:57.781
general topic style.

0:25:57.817 --> 0:26:06.795
Then, of course, this is not a good strategy
because you might forgot about everything up

0:26:06.795 --> 0:26:09.389
there and you don't have.

0:26:09.649 --> 0:26:14.678
So then you have to check what you can do
for them.

0:26:14.678 --> 0:26:23.284
You can freeze this part and change it any
more so you don't lose the ability or you can

0:26:23.284 --> 0:26:25.702
do a direct combination.

0:26:25.945 --> 0:26:31.028
Where you jointly train both of them, so you
train the NMT system on the, and then you train

0:26:31.028 --> 0:26:34.909
the language model always in parallels so that
you don't forget about.

0:26:35.395 --> 0:26:37.684
And what you learn of the length.

0:26:37.937 --> 0:26:46.711
Depends on what you want to combine because
it's large data and you have a good general

0:26:46.711 --> 0:26:48.107
knowledge in.

0:26:48.548 --> 0:26:55.733
Then you normally don't really forget it because
it's also in the or you use it to adapt to

0:26:55.733 --> 0:26:57.295
something specific.

0:26:57.295 --> 0:26:58.075
Then you.

0:27:01.001 --> 0:27:06.676
Then this is a way of how we can make use
of monolingual data.

0:27:07.968 --> 0:27:12.116
It seems to be the easiest one somehow.

0:27:12.116 --> 0:27:20.103
It's more similar to what we are doing with
statistical machine translation.

0:27:21.181 --> 0:27:31.158
Normally always beats this type of model,
which in some view can be like from the conceptual

0:27:31.158 --> 0:27:31.909
thing.

0:27:31.909 --> 0:27:36.844
It's even easier from the computational side.

0:27:40.560 --> 0:27:42.078
And the idea is OK.

0:27:42.078 --> 0:27:49.136
We have monolingual data that we just translate
and then generate some type of parallel data

0:27:49.136 --> 0:27:50.806
and use that then to.

0:27:51.111 --> 0:28:00.017
So if you want to build a German-to-English
system first, take the large amount of data

0:28:00.017 --> 0:28:02.143
you have translated.

0:28:02.402 --> 0:28:10.446
Then you have more peril data and the interesting
thing is if you then train on the joint thing

0:28:10.446 --> 0:28:18.742
or on the original peril data and on what is
artificial where you have generated the translations.

0:28:18.918 --> 0:28:26.487
So you can because you are not doing the same
era all the times and you have some knowledge.

0:28:28.028 --> 0:28:43.199
With this first approach, however, there is
one issue why it might not work the best.

0:28:49.409 --> 0:28:51.177
Very a bit shown in the image to you.

0:28:53.113 --> 0:28:58.153
You trade on that quality data.

0:28:58.153 --> 0:29:02.563
Here is a bit of a problem.

0:29:02.563 --> 0:29:08.706
Your English style is not really good.

0:29:08.828 --> 0:29:12.213
And as you're saying, the system always mistranslates.

0:29:13.493 --> 0:29:19.798
Something then you will learn that this is
correct because now it's a training game and

0:29:19.798 --> 0:29:23.022
you will encourage it to make it more often.

0:29:23.022 --> 0:29:29.614
So the problem with training on your own areas
yeah you might prevent some areas you rarely

0:29:29.614 --> 0:29:29.901
do.

0:29:30.150 --> 0:29:31.749
But errors use systematically.

0:29:31.749 --> 0:29:34.225
Do you even enforce more and will even do
more?

0:29:34.654 --> 0:29:40.145
So that might not be the best solution to
have any idea how you could do it better.

0:29:44.404 --> 0:29:57.754
Is one way there is even a bit of more simple
idea.

0:30:04.624 --> 0:30:10.975
The problem is yeah, the translations are
not perfect, so the output and you're learning

0:30:10.975 --> 0:30:12.188
something wrong.

0:30:12.188 --> 0:30:17.969
Normally it's less bad if your inputs are
not bad, but your outputs are perfect.

0:30:18.538 --> 0:30:24.284
So if your inputs are wrong you may learn
that if you're doing this wrong input you're

0:30:24.284 --> 0:30:30.162
generating something correct, but you're not
learning to generate something which is not

0:30:30.162 --> 0:30:30.756
correct.

0:30:31.511 --> 0:30:47.124
So often the case it is that it is more important
than your target is correct.

0:30:47.347 --> 0:30:52.182
But you can assume in your application scenario
you hope that you may only get correct inputs.

0:30:52.572 --> 0:31:02.535
So that is not harming you, and in machine
translation we have one very nice advantage:

0:31:02.762 --> 0:31:04.648
And also the other way around.

0:31:04.648 --> 0:31:10.062
It's a very similar task, so there's a task
to translate from German to English, but the

0:31:10.062 --> 0:31:13.894
task to translate from English to German is
very similar, and.

0:31:14.094 --> 0:31:19.309
So what we can do is we can just switch it
initially and generate the data the other way

0:31:19.309 --> 0:31:19.778
around.

0:31:20.120 --> 0:31:25.959
So what we are doing here is we are starting
with an English to German system.

0:31:25.959 --> 0:31:32.906
Then we are translating the English data into
German where the German is maybe not very nice.

0:31:33.293 --> 0:31:51.785
And then we are training on our original data
and on the back translated data.

0:31:52.632 --> 0:32:02.332
So here we have the advantage that our target
side is human quality and only the input.

0:32:03.583 --> 0:32:08.113
Then this helps us to get really good.

0:32:08.113 --> 0:32:15.431
There is one difference if you think about
the data resources.

0:32:21.341 --> 0:32:27.336
Too obvious here we need a target site monolingual
layer.

0:32:27.336 --> 0:32:31.574
In the first example we had source site.

0:32:31.931 --> 0:32:45.111
So back translation is normally working if
you have target size peril later and not search

0:32:45.111 --> 0:32:48.152
side modeling later.

0:32:48.448 --> 0:32:56.125
Might be also, like if you think about it,
understand a little better to understand the

0:32:56.125 --> 0:32:56.823
target.

0:32:57.117 --> 0:33:01.469
On the source side you have to understand
the content.

0:33:01.469 --> 0:33:08.749
On the target side you have to generate really
sentences and somehow it's more difficult to

0:33:08.749 --> 0:33:12.231
generate something than to only understand.

0:33:17.617 --> 0:33:30.734
This works well if you have to select how
many back translated data do you use.

0:33:31.051 --> 0:33:32.983
Because only there's like a lot more.

0:33:33.253 --> 0:33:42.136
Question: Should take all of my data there
is two problems with it?

0:33:42.136 --> 0:33:51.281
Of course it's expensive because you have
to translate all this data.

0:33:51.651 --> 0:34:00.946
So if you don't know the normal good starting
point is to take equal amount of data as many

0:34:00.946 --> 0:34:02.663
back translated.

0:34:02.963 --> 0:34:04.673
It depends on the used case.

0:34:04.673 --> 0:34:08.507
If we have very few data here, it makes more
sense to have more.

0:34:08.688 --> 0:34:15.224
Depends on how good your quality is here,
so the better the more data you might use because

0:34:15.224 --> 0:34:16.574
quality is better.

0:34:16.574 --> 0:34:22.755
So it depends on a lot of things, but your
rule of sum is like which general way often

0:34:22.755 --> 0:34:24.815
is to have equal amounts of.

0:34:26.646 --> 0:34:29.854
And you can, of course, do that now.

0:34:29.854 --> 0:34:34.449
I said already that it's better to have the
quality.

0:34:34.449 --> 0:34:38.523
At the end, of course, depends on this system.

0:34:38.523 --> 0:34:46.152
Also, because the better this system is, the
better your synthetic data is, the better.

0:34:47.207 --> 0:34:50.949
That leads to what is referred to as iterated
back translation.

0:34:51.291 --> 0:34:56.917
So you play them on English to German, and
you translate the data on.

0:34:56.957 --> 0:35:03.198
Then you train a model on German to English
with the additional data.

0:35:03.198 --> 0:35:09.796
Then you translate German data and then you
train to gain your first one.

0:35:09.796 --> 0:35:14.343
So in the second iteration this quality is
better.

0:35:14.334 --> 0:35:19.900
System is better because it's not only trained
on the small data but additionally on back

0:35:19.900 --> 0:35:22.003
translated data with this system.

0:35:22.442 --> 0:35:24.458
And so you can get better.

0:35:24.764 --> 0:35:28.053
However, typically you can stop quite early.

0:35:28.053 --> 0:35:35.068
Maybe one iteration is good, but then you
have diminishing gains after two or three iterations.

0:35:35.935 --> 0:35:46.140
There is very slight difference because you
need a quite big difference in the quality

0:35:46.140 --> 0:35:46.843
here.

0:35:47.207 --> 0:36:02.262
Language is also good because it means you
can already train it with relatively bad profiles.

0:36:03.723 --> 0:36:10.339
It's a design decision would advise so guess
because it's easy to get it.

0:36:10.550 --> 0:36:20.802
Replace that because you have a higher quality
real data, but then I think normally it's okay

0:36:20.802 --> 0:36:22.438
to replace it.

0:36:22.438 --> 0:36:28.437
I would assume it's not too much of a difference,
but.

0:36:34.414 --> 0:36:42.014
That's about like using monolingual data before
we go into the pre-train models to have any

0:36:42.014 --> 0:36:43.005
more crash.

0:36:49.029 --> 0:36:55.740
Yes, so the other thing which we can do and
which is recently more and more successful

0:36:55.740 --> 0:37:02.451
and even more successful since we have this
really large language models where you can

0:37:02.451 --> 0:37:08.545
even do the translation task with this is the
way of using pre-trained models.

0:37:08.688 --> 0:37:16.135
So you learn a representation of one task,
and then you use this representation from another.

0:37:16.576 --> 0:37:26.862
It was made maybe like one of the first words
where it really used largely is doing something

0:37:26.862 --> 0:37:35.945
like a bird which you pre trained on purely
text era and you take it in fine tune.

0:37:36.496 --> 0:37:42.953
And one big advantage, of course, is that
people can only share data but also pre-trained.

0:37:43.423 --> 0:37:59.743
The recent models and the large language ones
which are available.

0:37:59.919 --> 0:38:09.145
Where I think it costs several millions to
train them all, just if you would buy the GPUs

0:38:09.145 --> 0:38:15.397
from some cloud company and train that the
cost of training.

0:38:15.475 --> 0:38:21.735
And guess as a student project you won't have
the budget to like build these models.

0:38:21.801 --> 0:38:24.598
So another idea is what you can do is okay.

0:38:24.598 --> 0:38:27.330
Maybe if these months are once available,.

0:38:27.467 --> 0:38:36.598
Can take them and use them as an also resource
similar to pure text, and you can now build

0:38:36.598 --> 0:38:44.524
models which somehow learn not only from from
data but also from other models.

0:38:44.844 --> 0:38:49.127
So it's a quite new way of thinking of how
to train.

0:38:49.127 --> 0:38:53.894
We are not only learning from examples, but
we might also.

0:38:54.534 --> 0:39:05.397
The nice thing is that this type of training
where we are not learning directly from data

0:39:05.397 --> 0:39:07.087
but learning.

0:39:07.427 --> 0:39:17.647
So the main idea this go is you have a person
initial task.

0:39:17.817 --> 0:39:26.369
And if you're working with anLP, that means
you're training pure taxator because that's

0:39:26.369 --> 0:39:30.547
where you have the largest amount of data.

0:39:30.951 --> 0:39:35.857
And then you're defining some type of task
in order to do your creek training.

0:39:36.176 --> 0:39:43.092
And: The typical task you can train on on
that is like the language waddling task.

0:39:43.092 --> 0:39:50.049
So to predict the next word or we have a related
task to predict something in between, we'll

0:39:50.049 --> 0:39:52.667
see depending on the architecture.

0:39:52.932 --> 0:39:58.278
But somehow to predict something which you
have not in the input is a task which is easy

0:39:58.278 --> 0:40:00.740
to generate, so you just need your data.

0:40:00.740 --> 0:40:06.086
That's why it's called self supervised, so
you're creating your supervised pending data.

0:40:06.366 --> 0:40:07.646
By yourself.

0:40:07.646 --> 0:40:15.133
On the other hand, you need a lot of knowledge
and that is the other thing.

0:40:15.735 --> 0:40:24.703
Because there is this idea that the meaning
of a word heavily depends on the context that.

0:40:25.145 --> 0:40:36.846
So can give you a sentence with some giverish
word and there's some name and although you've

0:40:36.846 --> 0:40:41.627
never heard the name you will assume.

0:40:42.062 --> 0:40:44.149
And exactly the same thing.

0:40:44.149 --> 0:40:49.143
The models can also learn something about
the world by just using.

0:40:49.649 --> 0:40:53.651
So that is typically the mule.

0:40:53.651 --> 0:40:59.848
Then we can use this model to train the system.

0:41:00.800 --> 0:41:03.368
Course we might need to adapt the system.

0:41:03.368 --> 0:41:07.648
To do that we have to change the architecture
we might use only some.

0:41:07.627 --> 0:41:09.443
Part of the pre-trained model.

0:41:09.443 --> 0:41:14.773
In there we have seen that a bit already in
the R&amp;N case you can also see that we have

0:41:14.773 --> 0:41:17.175
also mentioned the pre-training already.

0:41:17.437 --> 0:41:22.783
So you can use the R&amp;N as one of these
approaches.

0:41:22.783 --> 0:41:28.712
You train the R&amp;M language more on large
pre-train data.

0:41:28.712 --> 0:41:32.309
Then you put it somewhere into your.

0:41:33.653 --> 0:41:37.415
So this gives you the ability to really do
these types of tests.

0:41:37.877 --> 0:41:53.924
So you can build a system which is knowledge,
which is just trained on large amounts of data.

0:41:56.376 --> 0:42:01.564
So the question is maybe what type of information
so what type of models can you?

0:42:01.821 --> 0:42:05.277
And we want today to look at briefly at swings.

0:42:05.725 --> 0:42:08.704
First, that was what was initially done.

0:42:08.704 --> 0:42:15.314
It wasn't as famous as in machine translation
as in other things, but it's also used there

0:42:15.314 --> 0:42:21.053
and that is to use static word embedding, so
just the first step we know here.

0:42:21.221 --> 0:42:28.981
So we have this mapping from the one hot to
a small continuous word representation.

0:42:29.229 --> 0:42:38.276
Using this one in your NG system, so you can,
for example, replace the embedding layer by

0:42:38.276 --> 0:42:38.779
the.

0:42:39.139 --> 0:42:41.832
That is helpful to be a really small amount
of data.

0:42:42.922 --> 0:42:48.517
And we're always in this pre-training phase
and have the thing the advantage is.

0:42:48.468 --> 0:42:52.411
More data than the trade off, so you can get
better.

0:42:52.411 --> 0:42:59.107
The disadvantage is, does anybody have an
idea of what might be the disadvantage of using

0:42:59.107 --> 0:43:00.074
things like.

0:43:04.624 --> 0:43:12.175
What was one mentioned today giving like big
advantage of the system compared to previous.

0:43:20.660 --> 0:43:25.134
Where one advantage was the enter end training,
so you have the enter end training so that

0:43:25.134 --> 0:43:27.937
all parameters and all components play optimal
together.

0:43:28.208 --> 0:43:33.076
If you know pre-train something on one fast,
it may be no longer optimal fitting to everything

0:43:33.076 --> 0:43:33.384
else.

0:43:33.893 --> 0:43:37.862
So what do pretending or not?

0:43:37.862 --> 0:43:48.180
It depends on how important everything is
optimal together and how important.

0:43:48.388 --> 0:43:50.454
Of large amount.

0:43:50.454 --> 0:44:00.541
The pre-change one is so much better that
it's helpful, and the advantage of that.

0:44:00.600 --> 0:44:11.211
Getting everything optimal together, yes,
we would use random instructions for raising.

0:44:11.691 --> 0:44:26.437
The problem is you might be already in some
area where it's not easy to get.

0:44:26.766 --> 0:44:35.329
But often in some way right, so often it's
not about your really worse pre trained monolepsy.

0:44:35.329 --> 0:44:43.254
If you're going already in some direction,
and if this is not really optimal for you,.

0:44:43.603 --> 0:44:52.450
But if you're not really getting better because
you have a decent amount of data, it's so different

0:44:52.450 --> 0:44:52.981
that.

0:44:53.153 --> 0:44:59.505
Initially it wasn't a machine translation
done so much because there are more data in

0:44:59.505 --> 0:45:06.153
MPs than in other tasks, but now with really
large amounts of monolingual data we do some

0:45:06.153 --> 0:45:09.403
type of pretraining in currently all state.

0:45:12.632 --> 0:45:14.302
The other one is okay now.

0:45:14.302 --> 0:45:18.260
It's always like how much of the model do
you plea track a bit?

0:45:18.658 --> 0:45:22.386
To the other one you can do contextural word
embedded.

0:45:22.386 --> 0:45:28.351
That is something like bird or Roberta where
you train already a sequence model and the

0:45:28.351 --> 0:45:34.654
embeddings you're using are no longer specific
for word but they are also taking the context

0:45:34.654 --> 0:45:35.603
into account.

0:45:35.875 --> 0:45:50.088
The embedding you're using is no longer depending
on the word itself but on the whole sentence,

0:45:50.088 --> 0:45:54.382
so you can use this context.

0:45:55.415 --> 0:46:02.691
You can use similar things also in the decoder
just by having layers which don't have access

0:46:02.691 --> 0:46:12.430
to the source, but there it still might have
and these are typically models like: And finally

0:46:12.430 --> 0:46:14.634
they will look at the end.

0:46:14.634 --> 0:46:19.040
You can also have models which are already
sequenced.

0:46:19.419 --> 0:46:28.561
So you may be training a sequence to sequence
models.

0:46:28.561 --> 0:46:35.164
You have to make it a bit challenging.

0:46:36.156 --> 0:46:43.445
But the idea is really you're pre-training
your whole model and then you'll find tuning.

0:46:47.227 --> 0:46:59.614
But let's first do a bit of step back and
look into what are the different things.

0:46:59.614 --> 0:47:02.151
The first thing.

0:47:02.382 --> 0:47:11.063
The wooden bettings are just this first layer
and you can train them with feedback annual

0:47:11.063 --> 0:47:12.028
networks.

0:47:12.212 --> 0:47:22.761
But you can also train them with an N language
model, and by now you hopefully have also seen

0:47:22.761 --> 0:47:27.699
that you cannot transform a language model.

0:47:30.130 --> 0:47:37.875
So this is how you can train them and you're
training them.

0:47:37.875 --> 0:47:45.234
For example, to speak the next word that is
the easiest.

0:47:45.525 --> 0:47:55.234
And that is what is now referred to as South
Supervised Learning and, for example, all the

0:47:55.234 --> 0:48:00.675
big large language models like Chad GPT and
so on.

0:48:00.675 --> 0:48:03.129
They are trained with.

0:48:03.823 --> 0:48:15.812
So that is where you can hopefully learn how
a word is used because you always try to previct

0:48:15.812 --> 0:48:17.725
the next word.

0:48:19.619 --> 0:48:27.281
Word embedding: Why do you keep the first
look at the word embeddings and the use of

0:48:27.281 --> 0:48:29.985
word embeddings for our task?

0:48:29.985 --> 0:48:38.007
The main advantage was it might be only the
first layer where you typically have most of

0:48:38.007 --> 0:48:39.449
the parameters.

0:48:39.879 --> 0:48:57.017
Most of your parameters already on the large
data, then on your target data you have to

0:48:57.017 --> 0:48:59.353
train less.

0:48:59.259 --> 0:49:06.527
Big difference that your input size is so
much bigger than the size of the novel in size.

0:49:06.626 --> 0:49:17.709
So it's a normally sign, maybe like, but your
input and banning size is something like.

0:49:17.709 --> 0:49:20.606
Then here you have to.

0:49:23.123 --> 0:49:30.160
While here you see it's only like zero point
five times as much in the layer.

0:49:30.750 --> 0:49:36.534
So here is where most of your parameters are,
which means if you already replace the word

0:49:36.534 --> 0:49:41.739
embeddings, they might look a bit small in
your overall and in key architecture.

0:49:41.739 --> 0:49:47.395
It's where most of the things are, and if
you're doing that you already have really big

0:49:47.395 --> 0:49:48.873
games and can do that.

0:49:57.637 --> 0:50:01.249
The thing is we have seen these were the bettings.

0:50:01.249 --> 0:50:04.295
They can be very good use for other types.

0:50:04.784 --> 0:50:08.994
You learn some general relations between words.

0:50:08.994 --> 0:50:17.454
If you're doing this type of language modeling
cast, you predict: The one thing is you have

0:50:17.454 --> 0:50:24.084
a lot of data, so the one question is we want
to have data to trade a model.

0:50:24.084 --> 0:50:28.734
The other thing, the tasks need to be somehow
useful.

0:50:29.169 --> 0:50:43.547
If you would predict the first letter of the
word, then you wouldn't learn anything about

0:50:43.547 --> 0:50:45.144
the word.

0:50:45.545 --> 0:50:53.683
And the interesting thing is people have looked
at these wood embeddings.

0:50:53.954 --> 0:50:58.550
And looking at the word embeddings.

0:50:58.550 --> 0:51:09.276
You can ask yourself how they look and visualize
them by doing dimension reduction.

0:51:09.489 --> 0:51:13.236
Don't know if you and you are listening to
artificial intelligence.

0:51:13.236 --> 0:51:15.110
Advanced artificial intelligence.

0:51:15.515 --> 0:51:23.217
We had on yesterday there how to do this type
of representation, but you can do this time

0:51:23.217 --> 0:51:29.635
of representation, and now you're seeing interesting
things that normally.

0:51:30.810 --> 0:51:41.027
Now you can represent a here in a three dimensional
space with some dimension reduction.

0:51:41.027 --> 0:51:46.881
For example, the relation between male and
female.

0:51:47.447 --> 0:51:56.625
So this vector between the male and female
version of something is always not the same,

0:51:56.625 --> 0:51:58.502
but it's related.

0:51:58.718 --> 0:52:14.522
So you can do a bit of maths, so you do take
king, you subtract this vector, add this vector.

0:52:14.894 --> 0:52:17.591
So that means okay, there is really something
stored.

0:52:17.591 --> 0:52:19.689
Some information are stored in that book.

0:52:20.040 --> 0:52:22.621
Similar, you can do it with Bob Hansen.

0:52:22.621 --> 0:52:25.009
See here swimming slam walking walk.

0:52:25.265 --> 0:52:34.620
So again these vectors are not the same, but
they are related.

0:52:34.620 --> 0:52:42.490
So you learn something from going from here
to here.

0:52:43.623 --> 0:52:49.761
Or semantically, the relations between city
and capital have exactly the same sense.

0:52:51.191 --> 0:52:56.854
And people had even done that question answering
about that if they showed the diembeddings

0:52:56.854 --> 0:52:57.839
and the end of.

0:52:58.218 --> 0:53:06.711
All you can also do is don't trust the dimensions
of the reaction because maybe there is something.

0:53:06.967 --> 0:53:16.863
You can also look into what happens really
in the individual space.

0:53:16.863 --> 0:53:22.247
What is the nearest neighbor of the.

0:53:22.482 --> 0:53:29.608
So you can take the relationship between France
and Paris and add it to Italy and you'll.

0:53:30.010 --> 0:53:33.078
You can do big and bigger and you have small
and smaller and stuff.

0:53:33.593 --> 0:53:49.417
Because it doesn't work everywhere, there
is also some typical dish here in German.

0:53:51.491 --> 0:54:01.677
You can do what the person is doing for famous
ones, of course only like Einstein scientists

0:54:01.677 --> 0:54:06.716
that find midfielders not completely correct.

0:54:06.846 --> 0:54:10.134
You see the examples are a bit old.

0:54:10.134 --> 0:54:15.066
The politicians are no longer they am, but
of course.

0:54:16.957 --> 0:54:26.759
What people have done there, especially at
the beginning training our end language model,

0:54:26.759 --> 0:54:28.937
was very expensive.

0:54:29.309 --> 0:54:38.031
So one famous model was, but we are not really
interested in the language model performance.

0:54:38.338 --> 0:54:40.581
Think something good to keep in mind.

0:54:40.581 --> 0:54:42.587
What are we really interested in?

0:54:42.587 --> 0:54:45.007
Do we really want to have an R&amp;N no?

0:54:45.007 --> 0:54:48.607
In this case we are only interested in this
type of mapping.

0:54:49.169 --> 0:54:55.500
And so successful and very successful was
this word to vet.

0:54:55.535 --> 0:54:56.865
The idea is okay.

0:54:56.865 --> 0:55:03.592
We are not training real language one, making
it even simpler and doing this, for example,

0:55:03.592 --> 0:55:05.513
continuous peck of words.

0:55:05.513 --> 0:55:12.313
We're just having four input tokens and we're
predicting what is the word in the middle and

0:55:12.313 --> 0:55:15.048
this is just like two linear layers.

0:55:15.615 --> 0:55:21.627
So it's even simplifying things and making
the calculation faster because that is what

0:55:21.627 --> 0:55:22.871
we're interested.

0:55:23.263 --> 0:55:32.897
All this continuous skip ground models with
these other models which refer to as where

0:55:32.897 --> 0:55:34.004
to where.

0:55:34.234 --> 0:55:42.394
Where you have one equal word and the other
way around, you're predicting the four words

0:55:42.394 --> 0:55:43.585
around them.

0:55:43.585 --> 0:55:45.327
It's very similar.

0:55:45.327 --> 0:55:48.720
The task is in the end very similar.

0:55:51.131 --> 0:56:01.407
Before we are going to the next point, anything
about normal weight vectors or weight embedding.

0:56:04.564 --> 0:56:07.794
The next thing is contexture.

0:56:07.794 --> 0:56:12.208
Word embeddings and the idea is helpful.

0:56:12.208 --> 0:56:19.206
However, we might even be able to get more
from one lingo layer.

0:56:19.419 --> 0:56:31.732
And now in the word that is overlap of these
two meanings, so it represents both the meaning

0:56:31.732 --> 0:56:33.585
of can do it.

0:56:34.834 --> 0:56:40.410
But we might be able to in the pre-trained
model already disambiguate this because they

0:56:40.410 --> 0:56:41.044
are used.

0:56:41.701 --> 0:56:53.331
So if we can have a model which can not only
represent a word but can also represent the

0:56:53.331 --> 0:56:58.689
meaning of the word within the context,.

0:56:59.139 --> 0:57:03.769
So then we are going to context your word
embeddings.

0:57:03.769 --> 0:57:07.713
We are really having a representation in the.

0:57:07.787 --> 0:57:11.519
And we have a very good architecture for that
already.

0:57:11.691 --> 0:57:23.791
The hidden state represents what is currently
said, but it's focusing on what is the last

0:57:23.791 --> 0:57:29.303
one, so it's some of the representation.

0:57:29.509 --> 0:57:43.758
The first one doing that is something like
the Elmo paper where they instead of this is

0:57:43.758 --> 0:57:48.129
the normal language model.

0:57:48.008 --> 0:57:50.714
Within the third, predicting the fourth, and
so on.

0:57:50.714 --> 0:57:53.004
So you are always predicting the next work.

0:57:53.193 --> 0:57:57.335
The architecture is the heaven words embedding
layer and then layers.

0:57:57.335 --> 0:58:03.901
See you, for example: And now instead of using
this one in the end, you're using here this

0:58:03.901 --> 0:58:04.254
one.

0:58:04.364 --> 0:58:11.245
This represents the meaning of this word mainly
in the context of what we have seen before.

0:58:11.871 --> 0:58:18.610
We can train it in a language model style
always predicting the next word, but we have

0:58:18.610 --> 0:58:21.088
more information trained there.

0:58:21.088 --> 0:58:26.123
Therefore, in the system it has to learn less
additional things.

0:58:27.167 --> 0:58:31.261
And there is one Edendang which is done currently
in GPS.

0:58:31.261 --> 0:58:38.319
The only difference is that we have more layers,
bigger size, and we're using transformer neurocell

0:58:38.319 --> 0:58:40.437
potential instead of the RNA.

0:58:40.437 --> 0:58:45.095
But that is how you train like some large
language models at the.

0:58:46.746 --> 0:58:55.044
However, if you look at this contextual representation,
they might not be perfect.

0:58:55.044 --> 0:59:02.942
So if you think of this one as a contextual
representation of the third word,.

0:59:07.587 --> 0:59:16.686
Is representing a three in the context of
a sentence, however only in the context of

0:59:16.686 --> 0:59:18.185
the previous.

0:59:18.558 --> 0:59:27.413
However, we have an architecture which can
also take both sides and we have used that

0:59:27.413 --> 0:59:30.193
already in the ink holder.

0:59:30.630 --> 0:59:34.264
So we could do the iron easily on your, also
in the backward direction.

0:59:34.874 --> 0:59:42.826
By just having the states the other way around
and then we couldn't combine the forward and

0:59:42.826 --> 0:59:49.135
the forward into a joint one where we are doing
this type of prediction.

0:59:49.329 --> 0:59:50.858
So you have the word embedding.

0:59:51.011 --> 1:00:02.095
Then you have two in the states, one on the
forward arm and one on the backward arm, and

1:00:02.095 --> 1:00:10.314
then you can, for example, take the cocagenation
of both of them.

1:00:10.490 --> 1:00:23.257
Now this same here represents mainly this
word because this is what both puts in it last

1:00:23.257 --> 1:00:30.573
and we know is focusing on what is happening
last.

1:00:31.731 --> 1:00:40.469
However, there is a bit of difference when
training that as a language model you already

1:00:40.469 --> 1:00:41.059
have.

1:00:43.203 --> 1:00:44.956
Maybe There's Again This Masking.

1:00:46.546 --> 1:00:47.748
That is one solution.

1:00:47.748 --> 1:00:52.995
First of all, why we can't do it is the information
you leak it, so you cannot just predict the

1:00:52.995 --> 1:00:53.596
next word.

1:00:53.596 --> 1:00:58.132
If we just predict the next word in this type
of model, that's a very simple task.

1:00:58.738 --> 1:01:09.581
You know the next word because it's influencing
this hidden state predicting something is not

1:01:09.581 --> 1:01:11.081
a good task.

1:01:11.081 --> 1:01:18.455
You have to define: Because in this case what
will end with the system will just ignore these

1:01:18.455 --> 1:01:22.966
estates and what will learn is copy this information
directly in here.

1:01:23.343 --> 1:01:31.218
So it would be representing this word and
you would have nearly a perfect model because

1:01:31.218 --> 1:01:38.287
you only need to find encoding where you can
encode all words somehow in this.

1:01:38.458 --> 1:01:44.050
The only thing can learn is that turn and
encode all my words in this upper hidden.

1:01:44.985 --> 1:01:53.779
Therefore, it's not really useful, so we need
to find a bit of different ways out.

1:01:55.295 --> 1:01:57.090
There is a masking one.

1:01:57.090 --> 1:02:03.747
I'll come to that shortly just a bit that
other things also have been done, so the other

1:02:03.747 --> 1:02:06.664
thing is not to directly combine them.

1:02:06.664 --> 1:02:13.546
That was in the animal paper, so you have
them forward R&amp;M and you keep them completely

1:02:13.546 --> 1:02:14.369
separated.

1:02:14.594 --> 1:02:20.458
So you never merged to state.

1:02:20.458 --> 1:02:33.749
At the end, the representation of the word
is now from the forward.

1:02:33.873 --> 1:02:35.953
So it's always the hidden state before the
good thing.

1:02:36.696 --> 1:02:41.286
These two you join now to your to the representation.

1:02:42.022 --> 1:02:48.685
And then you have now a representation also
about like the whole sentence for the word,

1:02:48.685 --> 1:02:51.486
but there is no information leakage.

1:02:51.486 --> 1:02:58.149
One way of doing this is instead of doing
a bidirection along you do a forward pass and

1:02:58.149 --> 1:02:59.815
then join the hidden.

1:03:00.380 --> 1:03:05.960
So you can do that in all layers.

1:03:05.960 --> 1:03:16.300
In the end you do the forwarded layers and
you get the hidden.

1:03:16.596 --> 1:03:19.845
However, it's a bit of a complicated.

1:03:19.845 --> 1:03:25.230
You have to keep both separate and merge things
so can you do.

1:03:27.968 --> 1:03:33.030
And that is the moment where like the big.

1:03:34.894 --> 1:03:39.970
The big success of the burnt model was used
where it okay.

1:03:39.970 --> 1:03:47.281
Maybe in bite and rich case it's not good
to do the next word prediction, but we can

1:03:47.281 --> 1:03:48.314
do masking.

1:03:48.308 --> 1:03:56.019
Masking mainly means we do a prediction of
something in the middle or some words.

1:03:56.019 --> 1:04:04.388
So the idea is if we have the input, we are
putting noise into the input, removing them,

1:04:04.388 --> 1:04:07.961
and then the model we are interested.

1:04:08.048 --> 1:04:15.327
Now there can be no information leakage because
this wasn't predicting that one is a big challenge.

1:04:16.776 --> 1:04:19.957
Do any assumption about our model?

1:04:19.957 --> 1:04:26.410
It doesn't need to be a forward model or a
backward model or anything.

1:04:26.410 --> 1:04:29.500
You can always predict the three.

1:04:30.530 --> 1:04:34.844
There's maybe one bit of a disadvantage.

1:04:34.844 --> 1:04:40.105
Do you see what could be a bit of a problem
this?

1:05:00.000 --> 1:05:06.429
Yes, so yeah, you can of course mask more,
but to see it more globally, just first assume

1:05:06.429 --> 1:05:08.143
you're only masked one.

1:05:08.143 --> 1:05:13.930
For the whole sentence, we get one feedback
signal, like what is the word three.

1:05:13.930 --> 1:05:22.882
So we have one training example: If you do
the language modeling taste, we predicted here,

1:05:22.882 --> 1:05:24.679
we predicted here.

1:05:25.005 --> 1:05:26.735
So we have number of tokens.

1:05:26.735 --> 1:05:30.970
For each token we have a feet pad and say
what is the best correction.

1:05:31.211 --> 1:05:43.300
So in this case this is less efficient because
we are getting less feedback signals on what

1:05:43.300 --> 1:05:45.797
we should predict.

1:05:48.348 --> 1:05:56.373
So and bird, the main ideas are that you're
doing this bidirectional model with masking.

1:05:56.373 --> 1:05:59.709
It's using transformer architecture.

1:06:00.320 --> 1:06:06.326
There are two more minor changes.

1:06:06.326 --> 1:06:16.573
We'll see that this next word prediction is
another task.

1:06:16.957 --> 1:06:30.394
You want to learn more about what language
is to really understand following a story or

1:06:30.394 --> 1:06:35.127
their independent tokens into.

1:06:38.158 --> 1:06:42.723
The input is using word units as we use it.

1:06:42.723 --> 1:06:50.193
It has some special token that is framing
for the next word prediction.

1:06:50.470 --> 1:07:04.075
It's more for classification task because
you may be learning a general representation

1:07:04.075 --> 1:07:07.203
as a full sentence.

1:07:07.607 --> 1:07:19.290
You're doing segment embedding, so you have
an embedding for it.

1:07:19.290 --> 1:07:24.323
This is the first sentence.

1:07:24.684 --> 1:07:29.099
Now what is more challenging is this masking.

1:07:29.099 --> 1:07:30.827
What do you mask?

1:07:30.827 --> 1:07:35.050
We already have the crush enough or should.

1:07:35.275 --> 1:07:42.836
So there has been afterwards eating some work
like, for example, a bearer.

1:07:42.836 --> 1:07:52.313
It's not super sensitive, but if you do it
completely wrong then you're not letting anything.

1:07:52.572 --> 1:07:54.590
That's Then Another Question There.

1:07:56.756 --> 1:08:04.594
Should I mask all types of should I always
mask the footwork or if I have a subword to

1:08:04.594 --> 1:08:10.630
mask only like a subword and predict them based
on the other ones?

1:08:10.630 --> 1:08:14.504
Of course, it's a bit of a different task.

1:08:14.894 --> 1:08:21.210
If you know three parts of the words, it might
be easier to guess the last because they here

1:08:21.210 --> 1:08:27.594
took the easiest selection, so not considering
words anymore at all because you're doing that

1:08:27.594 --> 1:08:32.280
in the preprocessing and just taking always
words and like subwords.

1:08:32.672 --> 1:08:36.089
Think in group there is done differently.

1:08:36.089 --> 1:08:40.401
They mark always the full words, but guess
it's not.

1:08:41.001 --> 1:08:46.044
And then what to do with the mask word in
eighty percent of the cases.

1:08:46.044 --> 1:08:50.803
If the word is masked, they replace it with
a special token thing.

1:08:50.803 --> 1:08:57.197
This is a mask token in ten percent they put
in some random other token in there, and ten

1:08:57.197 --> 1:08:59.470
percent they keep it on change.

1:09:02.202 --> 1:09:10.846
And then what you can do is also this next
word prediction.

1:09:10.846 --> 1:09:14.880
The man went to Mass Store.

1:09:14.880 --> 1:09:17.761
He bought a gallon.

1:09:18.418 --> 1:09:24.088
So may you see you're joining them, you're
doing both masks and prediction that you're.

1:09:24.564 --> 1:09:29.449
Is a penguin mask or flyless birds.

1:09:29.449 --> 1:09:41.390
These two sentences have nothing to do with
each other, so you can do also this type of

1:09:41.390 --> 1:09:43.018
prediction.

1:09:47.127 --> 1:09:57.043
And then the whole bird model, so here you
have the input here to transform the layers,

1:09:57.043 --> 1:09:58.170
and then.

1:09:58.598 --> 1:10:17.731
And this model was quite successful in general
applications.

1:10:17.937 --> 1:10:27.644
However, there is like a huge thing of different
types of models coming from them.

1:10:27.827 --> 1:10:38.709
So based on others these supervised molds
like a whole setup came out of there and now

1:10:38.709 --> 1:10:42.086
this is getting even more.

1:10:42.082 --> 1:10:46.640
With availability of a large language model
than the success.

1:10:47.007 --> 1:10:48.436
We have now even larger ones.

1:10:48.828 --> 1:10:50.961
Interestingly, it goes a bit.

1:10:50.910 --> 1:10:57.847
Change the bit again from like more the spider
action model to uni directional models.

1:10:57.847 --> 1:11:02.710
Are at the moment maybe a bit more we're coming
to them now?

1:11:02.710 --> 1:11:09.168
Do you see one advantage while what is another
event and we have the efficiency?

1:11:09.509 --> 1:11:15.901
Is one other reason why you are sometimes
more interested in uni-direction models than

1:11:15.901 --> 1:11:17.150
in bi-direction.

1:11:22.882 --> 1:11:30.220
It depends on the pass, but for example for
a language generation pass, the eccard is not

1:11:30.220 --> 1:11:30.872
really.

1:11:32.192 --> 1:11:40.924
It doesn't work so if you want to do a generation
like the decoder you don't know the future

1:11:40.924 --> 1:11:42.896
so you cannot apply.

1:11:43.223 --> 1:11:53.870
So this time of model can be used for the
encoder in an encoder model, but it cannot

1:11:53.870 --> 1:11:57.002
be used for the decoder.

1:12:00.000 --> 1:12:05.012
That's a good view to the next overall cast
of models.

1:12:05.012 --> 1:12:08.839
Perhaps if you view it from the sequence.

1:12:09.009 --> 1:12:12.761
We have the encoder base model.

1:12:12.761 --> 1:12:16.161
That's what we just look at.

1:12:16.161 --> 1:12:20.617
They are bidirectional and typically.

1:12:20.981 --> 1:12:22.347
That Is the One We Looked At.

1:12:22.742 --> 1:12:34.634
At the beginning is the decoder based model,
so see out in regressive models which are unidirective

1:12:34.634 --> 1:12:42.601
like an based model, and there we can do the
next word prediction.

1:12:43.403 --> 1:12:52.439
And what you can also do first, and there
you can also have a special things called prefix

1:12:52.439 --> 1:12:53.432
language.

1:12:54.354 --> 1:13:05.039
Because we are saying it might be helpful
that some of your input can also use bi-direction.

1:13:05.285 --> 1:13:12.240
And that is somehow doing what it is called
prefix length.

1:13:12.240 --> 1:13:19.076
On the first tokens you directly give your
bidirectional.

1:13:19.219 --> 1:13:28.774
So you somehow merge that and that mainly
works only in transformer based models because.

1:13:29.629 --> 1:13:33.039
There is no different number of parameters
in our end.

1:13:33.039 --> 1:13:34.836
We need a back foot our end.

1:13:34.975 --> 1:13:38.533
Transformer: The only difference is how you
mask your attention.

1:13:38.878 --> 1:13:44.918
We have seen that in the anchoder and decoder
the number of parameters is different because

1:13:44.918 --> 1:13:50.235
you do cross attention, but if you do forward
and backward or union directions,.

1:13:50.650 --> 1:13:58.736
It's only like you mask your attention to
only look at the bad past or to look into the

1:13:58.736 --> 1:13:59.471
future.

1:14:00.680 --> 1:14:03.326
And now you can of course also do mixing.

1:14:03.563 --> 1:14:08.306
So this is a bi-directional attention matrix
where you can attend to everything.

1:14:08.588 --> 1:14:23.516
There is a uni-direction or causal where you
can look at the past and you can do the first

1:14:23.516 --> 1:14:25.649
three words.

1:14:29.149 --> 1:14:42.831
That somehow clear based on that, then of
course you cannot do the other things.

1:14:43.163 --> 1:14:50.623
So the idea is we have our anchor to decoder
architecture.

1:14:50.623 --> 1:14:57.704
Can we also train them completely in a side
supervisor?

1:14:58.238 --> 1:15:09.980
And in this case we have the same input to
both, so in this case we need to do some type

1:15:09.980 --> 1:15:12.224
of masking here.

1:15:12.912 --> 1:15:17.696
Here we don't need to do the masking, but
here we need to masking that doesn't know ever

1:15:17.696 --> 1:15:17.911
so.

1:15:20.440 --> 1:15:30.269
And this type of model got quite successful
also, especially for pre-training machine translation.

1:15:30.330 --> 1:15:39.059
The first model doing that is a Bart model,
which exactly does that, and yes, it's one

1:15:39.059 --> 1:15:42.872
successful way to pre train your one.

1:15:42.872 --> 1:15:47.087
It's pretraining your full encoder model.

1:15:47.427 --> 1:15:54.365
Where you put in contrast to machine translation,
where you put in source sentence, we can't

1:15:54.365 --> 1:15:55.409
do that here.

1:15:55.715 --> 1:16:01.382
But we can just put the second twice in there,
and then it's not a trivial task.

1:16:01.382 --> 1:16:02.432
We can change.

1:16:03.003 --> 1:16:12.777
And there is like they do different corruption
techniques so you can also do.

1:16:13.233 --> 1:16:19.692
That you couldn't do in an agricultural system
because then it wouldn't be there and you cannot

1:16:19.692 --> 1:16:20.970
predict somewhere.

1:16:20.970 --> 1:16:26.353
So the anchor, the number of input and output
tokens always has to be the same.

1:16:26.906 --> 1:16:29.818
You cannot do a prediction for something which
isn't in it.

1:16:30.110 --> 1:16:38.268
Here in the decoder side it's unidirection
so we can also delete the top and then try

1:16:38.268 --> 1:16:40.355
to generate the full.

1:16:41.061 --> 1:16:45.250
We can do sentence permutation.

1:16:45.250 --> 1:16:54.285
We can document rotation and text infilling
so there is quite a bit.

1:16:55.615 --> 1:17:06.568
So you see there's quite a lot of types of
models that you can use in order to pre-train.

1:17:07.507 --> 1:17:14.985
Then, of course, there is again for the language
one.

1:17:14.985 --> 1:17:21.079
The other question is how do you integrate?

1:17:21.761 --> 1:17:26.636
And there's also, like yeah, quite some different
ways of techniques.

1:17:27.007 --> 1:17:28.684
It's a Bit Similar to Before.

1:17:28.928 --> 1:17:39.068
So the easiest thing is you take your word
embeddings or your free trained model.

1:17:39.068 --> 1:17:47.971
You freeze them and stack your decoder layers
and keep these ones free.

1:17:48.748 --> 1:17:54.495
Can also be done if you have this type of
bark model.

1:17:54.495 --> 1:18:03.329
What you can do is you freeze your word embeddings,
for example some products and.

1:18:05.865 --> 1:18:17.296
The other thing is you initialize them so
you initialize your models but you train everything

1:18:17.296 --> 1:18:19.120
so you're not.

1:18:22.562 --> 1:18:29.986
Then one thing, if you think about Bart, you
want to have the Chinese language, the Italian

1:18:29.986 --> 1:18:32.165
language, and the deconer.

1:18:32.165 --> 1:18:35.716
However, in Bart we have the same language.

1:18:36.516 --> 1:18:46.010
The one you get is from English, so what you
can do there is so you cannot try to do some.

1:18:46.366 --> 1:18:52.562
Below the barge, in order to learn some language
specific stuff, or there's a masculine barge,

1:18:52.562 --> 1:18:58.823
which is trained on many languages, but it's
trained only on like the Old Coast Modern Language

1:18:58.823 --> 1:19:03.388
House, which may be trained in German and English,
but not on German.

1:19:03.923 --> 1:19:08.779
So then you would still need to find June
and the model needs to learn how to better

1:19:08.779 --> 1:19:10.721
do the attention cross lingually.

1:19:10.721 --> 1:19:15.748
It's only on the same language but it mainly
only has to learn this mapping and not all

1:19:15.748 --> 1:19:18.775
the rest and that's why it's still quite successful.

1:19:21.982 --> 1:19:27.492
Now certain thing which is very commonly used
is what is required to it as adapters.

1:19:27.607 --> 1:19:29.754
So for example you take and buy.

1:19:29.709 --> 1:19:35.218
And you put some adapters on the inside of
the networks so that it's small new layers

1:19:35.218 --> 1:19:40.790
which are in between put in there and then
you only train these adapters or also train

1:19:40.790 --> 1:19:41.815
these adapters.

1:19:41.815 --> 1:19:47.900
For example, an embryo you could see that
this learns to map the Sears language representation

1:19:47.900 --> 1:19:50.334
to the Tiger language representation.

1:19:50.470 --> 1:19:52.395
And then you don't have to change that luck.

1:19:52.792 --> 1:19:59.793
You give it extra ability to really perform
well on that.

1:19:59.793 --> 1:20:05.225
These are quite small and so very efficient.

1:20:05.905 --> 1:20:12.632
That is also very commonly used, for example
in modular systems where you have some adaptors

1:20:12.632 --> 1:20:16.248
in between here which might be language specific.

1:20:16.916 --> 1:20:22.247
So they are trained only for one language.

1:20:22.247 --> 1:20:33.777
The model has some or both and once has the
ability to do multilingually to share knowledge.

1:20:34.914 --> 1:20:39.058
But there's one chance in general in the multilingual
systems.

1:20:39.058 --> 1:20:40.439
It works quite well.

1:20:40.439 --> 1:20:46.161
There's one case or one specific use case
for multilingual where this normally doesn't

1:20:46.161 --> 1:20:47.344
really work well.

1:20:47.344 --> 1:20:49.975
Do you have an idea what that could be?

1:20:55.996 --> 1:20:57.536
It's for Zero Shot Cases.

1:20:57.998 --> 1:21:03.660
Because having here some situation with this
might be very language specific and zero shot,

1:21:03.660 --> 1:21:09.015
the idea is always to learn representations
view which are more language dependent and

1:21:09.015 --> 1:21:10.184
with the adaptors.

1:21:10.184 --> 1:21:15.601
Of course you get in representations again
which are more language specific and then it

1:21:15.601 --> 1:21:17.078
doesn't work that well.

1:21:20.260 --> 1:21:37.730
And there is also the idea of doing more knowledge
pistolation.

1:21:39.179 --> 1:21:42.923
And now the idea is okay.

1:21:42.923 --> 1:21:54.157
We are training it the same, but what we want
to achieve is that the encoder.

1:21:54.414 --> 1:22:03.095
So you should learn faster by trying to make
these states as similar as possible.

1:22:03.095 --> 1:22:11.777
So you compare the first-hit state of the
pre-trained model and try to make them.

1:22:12.192 --> 1:22:18.144
For example, by using the out two norms, so
by just making these two representations the

1:22:18.144 --> 1:22:26.373
same: The same vocabulary: Why does it need
the same vocabulary with any idea?

1:22:34.754 --> 1:22:46.137
If you have different vocabulary, it's typical
you also have different sequenced lengths here.

1:22:46.137 --> 1:22:50.690
The number of sequences is different.

1:22:51.231 --> 1:22:58.888
If you now have pipe stains and four states
here, it's no longer straightforward which

1:22:58.888 --> 1:23:01.089
states compare to which.

1:23:02.322 --> 1:23:05.246
And that's just easier if you have like the
same number.

1:23:05.246 --> 1:23:08.940
You can always compare the first to the first
and second to the second.

1:23:09.709 --> 1:23:16.836
So therefore at least the very easy way of
knowledge destination only works if you have.

1:23:17.177 --> 1:23:30.030
Course: You could do things like yeah, the
average should be the same, but of course there's

1:23:30.030 --> 1:23:33.071
a less strong signal.

1:23:34.314 --> 1:23:42.979
But the advantage here is that you have a
diameter training signal here on the handquarter

1:23:42.979 --> 1:23:51.455
so you can directly make some of the encoder
already giving a good signal while normally

1:23:51.455 --> 1:23:52.407
an empty.

1:23:56.936 --> 1:24:13.197
Yes, think this is most things for today,
so what you should keep in mind is remind me.

1:24:13.393 --> 1:24:18.400
The one is a back translation idea.

1:24:18.400 --> 1:24:29.561
If you have monolingual and use that, the
other one is to: And mentally it is often helpful

1:24:29.561 --> 1:24:33.614
to combine them so you can even use both of
that.

1:24:33.853 --> 1:24:38.908
So you can use pre-trained walls, but then
you can even still do back translation where

1:24:38.908 --> 1:24:40.057
it's still helpful.

1:24:40.160 --> 1:24:45.502
We have the advantage we are training like
everything working together on the task so

1:24:45.502 --> 1:24:51.093
it might be helpful even to backtranslate some
data and then use it in a real translation

1:24:51.093 --> 1:24:56.683
setup because in pretraining of course the
beach challenge is always that you're training

1:24:56.683 --> 1:24:57.739
it on different.

1:24:58.058 --> 1:25:03.327
Different ways of how you integrate this knowledge.

1:25:03.327 --> 1:25:08.089
Even if you just use a full model, so in this.

1:25:08.748 --> 1:25:11.128
This is the most similar you can get.

1:25:11.128 --> 1:25:13.945
You're doing no changes to the architecture.

1:25:13.945 --> 1:25:19.643
You're really taking the model and just fine
tuning them on the new task, but it still has

1:25:19.643 --> 1:25:24.026
to completely newly learn how to do the attention
and how to do that.

1:25:24.464 --> 1:25:29.971
And that might be, for example, helpful to
have more back-translated data to learn them.

1:25:32.192 --> 1:25:34.251
That's for today.

1:25:34.251 --> 1:25:44.661
There's one important thing that next Tuesday
there is a conference or a workshop or so in

1:25:44.661 --> 1:25:45.920
this room.

1:25:47.127 --> 1:25:56.769
You should get an e-mail if you're in Elias
that there's a room change for Tuesdays and

1:25:56.769 --> 1:25:57.426
it's.

1:25:57.637 --> 1:26:03.890
There are more questions, yeah, have a more
general position, especially: In computer vision

1:26:03.890 --> 1:26:07.347
you can enlarge your data center data orientation.

1:26:07.347 --> 1:26:08.295
Is there any?

1:26:08.388 --> 1:26:15.301
It's similar to a large speech for text for
the data of an edge.

1:26:15.755 --> 1:26:29.176
And you can use this back translation and
also masking, but back translation is some

1:26:29.176 --> 1:26:31.228
way of data.

1:26:31.371 --> 1:26:35.629
So it has also been, for example, even its
used not only for monolingual data.

1:26:36.216 --> 1:26:54.060
If you have good MP system, it can also be
used for parallel data.

1:26:54.834 --> 1:26:59.139
So would say this is the most similar one.

1:26:59.139 --> 1:27:03.143
There's ways you can do power phrasing.

1:27:05.025 --> 1:27:12.057
But for example there is very hard to do this
by rules like which words to replace because

1:27:12.057 --> 1:27:18.936
there is not a coup like you cannot always
say this word can always be replaced by that.

1:27:19.139 --> 1:27:27.225
Mean, although they are many perfect synonyms,
normally they are good in some cases, but not

1:27:27.225 --> 1:27:29.399
in all cases, and so on.

1:27:29.399 --> 1:27:36.963
And if you don't do a rule based, you have
to train your model and then the freshness.

1:27:38.058 --> 1:27:57.236
The same architecture as the pre-trained mount.

1:27:57.457 --> 1:27:59.810
Should be of the same dimension, so it's easiest
to have the same dimension.

1:28:00.000 --> 1:28:01.590
Architecture.

1:28:01.590 --> 1:28:05.452
We later will learn inefficiency.

1:28:05.452 --> 1:28:12.948
You can also do knowledge cessulation with,
for example, smaller.

1:28:12.948 --> 1:28:16.469
You can learn the same within.

1:28:17.477 --> 1:28:22.949
Eight layers for it so that is possible, but
yeah agree it should be of the same.

1:28:23.623 --> 1:28:32.486
Yeah yeah you need the question then of course
you can do it like it's an initialization or

1:28:32.486 --> 1:28:41.157
you can do it doing training but normally it
most makes sense during the normal training.

1:28:45.865 --> 1:28:53.963
Do it, then thanks a lot, and then we'll see
each other again on Tuesday.

0:00:00.981 --> 0:00:17.559
What we want today about is how to use some
type of additional resources to improve the

0:00:17.559 --> 0:00:20.008
translation.

0:00:20.300 --> 0:00:31.387
We have in the first part of the semester
how to build some of your basic machine translation.

0:00:31.571 --> 0:00:40.743
You know now the basic components both for
statistical and for neural, with the encoder

0:00:40.743 --> 0:00:42.306
decoder model.

0:00:43.123 --> 0:00:45.950
Now, of course, that's not where it stops.

0:00:45.950 --> 0:00:51.340
It's still what in nearly every machine translation
system is currently in India.

0:00:51.340 --> 0:00:57.323
However, there is a lot of challenges which
you need to address in addition and which need

0:00:57.323 --> 0:00:58.243
to be solved.

0:00:58.918 --> 0:01:03.031
We want to start with these parts.

0:01:03.031 --> 0:01:07.614
What else can you do around this part?

0:01:07.614 --> 0:01:09.847
You can be honest.

0:01:10.030 --> 0:01:14.396
And one important question there is on what
do you train your models?

0:01:14.394 --> 0:01:27.237
Because this type of parallel data is easier
in machine translation than many other tasks

0:01:27.237 --> 0:01:33.516
where you have a decent amount of training.

0:01:33.853 --> 0:01:40.789
And therefore an important question is: Can
we also learn from other sources and improve

0:01:40.789 --> 0:01:41.178
our.

0:01:41.701 --> 0:01:47.840
Because if you remember from quite the beginning
of the lecture,.

0:01:51.171 --> 0:01:53.801
This is how we train all our.

0:01:54.194 --> 0:02:01.318
Machine learning models, all the corpus bases
from statistical to neural.

0:02:01.318 --> 0:02:09.694
This doesn't have change, so we need this
type of parallel data where we have a source

0:02:09.694 --> 0:02:13.449
sentence aligned with the target data.

0:02:13.493 --> 0:02:19.654
We have now a strong model here, a very good
model to do that.

0:02:19.654 --> 0:02:22.099
However, we always rely.

0:02:22.522 --> 0:02:27.376
More languages, higher resource languages,
prayers that say from German to English or

0:02:27.376 --> 0:02:31.327
other European languages, there is a decent
amount at least for some.

0:02:31.471 --> 0:02:46.131
But even there, if we're going to very specific
domains, it might get difficult and then your

0:02:46.131 --> 0:02:50.966
system performance might drop.

0:02:51.231 --> 0:02:55.261
Phrases how to use the vocabulary, and so
on, and the style.

0:02:55.915 --> 0:03:04.104
And if you're going to other languages, there
is of course a lot bigger challenge.

0:03:04.104 --> 0:03:05.584
Why can't you?

0:03:05.825 --> 0:03:09.647
So is really this the only resource you can
use.

0:03:09.889 --> 0:03:20.667
Or can we adapt our models in order to also
make use of other types of models that might

0:03:20.667 --> 0:03:27.328
enable us to build strong systems with other
types of.

0:03:27.707 --> 0:03:35.283
And that's what we will look into now in the
next, starting from Tuesday in the next.

0:03:35.515 --> 0:03:43.437
So this idea we already have covered on Tuesday,
so one very successful idea for this is to

0:03:43.437 --> 0:03:45.331
do more multilingual.

0:03:45.645 --> 0:03:52.010
So that we're no longer only doing translation
between two languages, but we can do translation

0:03:52.010 --> 0:03:55.922
between many languages and share common knowledge
between.

0:03:56.296 --> 0:04:06.477
And you also learned about that you can even
do things like zero shot machine translations.

0:04:06.786 --> 0:04:09.792
Which is the case for many many language pairs.

0:04:10.030 --> 0:04:17.406
Even with German, you have not translation
parallel data to all languages around the world,

0:04:17.406 --> 0:04:22.698
or most of them you have it to the Europeans,
maybe for Japanese.

0:04:22.698 --> 0:04:26.386
But even for Japanese, it will get difficult.

0:04:26.746 --> 0:04:32.862
There is quite a lot of data, for example
English to Japanese, but German to Vietnamese.

0:04:32.862 --> 0:04:39.253
There is some data from Multilingual Corpora
where you can extract the name, but your amount

0:04:39.253 --> 0:04:41.590
really is dropping significantly.

0:04:42.042 --> 0:04:54.907
So that is a very promising direction if you
want to build translation systems between language

0:04:54.907 --> 0:05:00.134
pairs, typically not English, because.

0:05:01.221 --> 0:05:05.888
And the other ideas, of course, we don't have
data, just search for data.

0:05:06.206 --> 0:05:15.755
There is some work on data crawling so if
don't have a corpus directly or don't have

0:05:15.755 --> 0:05:23.956
a high quality corpus from the European Parliament
for TED corpus maybe.

0:05:24.344 --> 0:05:35.528
There has been a big effort in Europe to collect
data sets for parallel data.

0:05:35.528 --> 0:05:40.403
How can we do this data crawling?

0:05:40.600 --> 0:05:46.103
There the interesting thing from the machine
translation point is not just general data

0:05:46.103 --> 0:05:46.729
crawling.

0:05:47.067 --> 0:05:52.067
But how can we explicitly crawl data, which
is somewhat parallel?

0:05:52.132 --> 0:05:58.538
So there is in the Internet quite a lot of
data which has been like company websites which

0:05:58.538 --> 0:06:01.565
have been translated and things like that.

0:06:01.565 --> 0:06:05.155
So how can you extract them and then extract
them?

0:06:06.566 --> 0:06:13.406
There is typically more noisy than where you
do more, hence mean if you have your Parliament.

0:06:13.693 --> 0:06:21.305
You can do some rules how to extract the parallel
things.

0:06:21.305 --> 0:06:30.361
Here there is more to it, so the quality is
later maybe not as good.

0:06:33.313 --> 0:06:39.927
The other thing is can we use monolingual
data and monolingual data has a big advantage

0:06:39.927 --> 0:06:46.766
that we can have a huge amount of that so that
you can be able to crawl from the internet.

0:06:46.766 --> 0:06:51.726
The nice thing is you can also get it typically
for many domains.

0:06:52.352 --> 0:06:58.879
There is just so much more magnitude more
of monolingual data so that it might be very

0:06:58.879 --> 0:06:59.554
helpful.

0:06:59.559 --> 0:07:06.187
We can do that in statistical machine translation
was quite easy to integrate using language

0:07:06.187 --> 0:07:06.757
models.

0:07:08.508 --> 0:07:14.499
In neural machine translation we have the
advantage that we have this overall and architecture

0:07:14.499 --> 0:07:18.850
that does everything together, but it has also
the disadvantage now.

0:07:18.850 --> 0:07:22.885
It's more difficult to put in this type of
information or make.

0:07:23.283 --> 0:07:26.427
We'll look to two things.

0:07:26.427 --> 0:07:37.432
You can still try to do a bit of language
modeling in there and add an additional language

0:07:37.432 --> 0:07:38.279
model.

0:07:38.178 --> 0:07:43.771
A way which I think is used in most systems
at the moment is to do synthetic data.

0:07:43.763 --> 0:07:53.095
It's a very easy thing, but you can just translate
there and then use it as training data.

0:07:53.213 --> 0:07:59.192
And thereby you are able to use like some
type of moonlighting.

0:08:00.380 --> 0:08:09.521
Another way to do it is to ensure that some
are in the extreme case.

0:08:09.521 --> 0:08:14.026
If you have a scenario that only.

0:08:14.754 --> 0:08:24.081
The impressive thing is if you have large
amounts of data and the languages are not too

0:08:24.081 --> 0:08:31.076
dissimilar, you can even in this case build
a translation system.

0:08:32.512 --> 0:08:36.277
That we will see then next Thursday.

0:08:37.857 --> 0:08:55.462
And then there is now a fourth type of restorer
that recently became very successful and now.

0:08:55.715 --> 0:09:02.409
So the idea is we are no longer sharing the
real data such as text data, but it can also

0:09:02.409 --> 0:09:04.139
help to train a model.

0:09:04.364 --> 0:09:08.599
And that is now a big advantage of deep learning
based approaches.

0:09:08.599 --> 0:09:14.414
There you have this ability that you can train
a model on some task and then you can modify

0:09:14.414 --> 0:09:19.913
it maybe and then apply it to another task
and you can somewhat transfer the knowledge

0:09:19.913 --> 0:09:22.125
from the first task to the second.

0:09:22.722 --> 0:09:31.906
And then, of course, the question is, can
it have an initial task where it's very easy

0:09:31.906 --> 0:09:34.439
to train on the second?

0:09:34.714 --> 0:09:53.821
The task that you pre-train on is more similar
to a language.

0:09:53.753 --> 0:10:06.293
A bit of a different way of using language
malls in this more transfer learning set.

0:10:09.029 --> 0:10:18.747
So first we will start with how can we use
monolingual data to do a machine translation?

0:10:20.040 --> 0:10:22.542
The.

0:10:22.062 --> 0:10:28.924
This big difference is you should remember
from what I mentioned before is in statistical

0:10:28.924 --> 0:10:30.525
machine translation.

0:10:30.525 --> 0:10:33.118
We directly have the opportunity.

0:10:33.118 --> 0:10:39.675
There's peril data for a translation model
and monolingual data for a language model.

0:10:39.679 --> 0:10:45.735
And you combine your translation model and
your language model, and then you can make.

0:10:46.726 --> 0:10:54.263
That has big advantages that you can make
use of these large amounts of monolingual data,

0:10:54.263 --> 0:10:55.519
but of course.

0:10:55.495 --> 0:11:02.198
Because we said the problem is, we are optimizing
both parts independently to each other, and

0:11:02.198 --> 0:11:09.329
we say the big advantage of newer machine translation
is we are optimizing the overall architecture

0:11:09.329 --> 0:11:10.541
to perform best.

0:11:10.890 --> 0:11:17.423
And then, of course, we can't do that, so
here we can only use power there.

0:11:17.897 --> 0:11:25.567
So the question is, but if this advantage
is not so important, we can train everything,

0:11:25.567 --> 0:11:33.499
but we have large amounts of monolingual data
or small amounts, but they fit perfectly, so

0:11:33.499 --> 0:11:35.242
they are very good.

0:11:35.675 --> 0:11:41.438
So in data we know it's not only important
the amount of data we have but also like how

0:11:41.438 --> 0:11:43.599
similar it is to your test data.

0:11:43.599 --> 0:11:49.230
So it can be that this volume is even only
quite small but it's very well fitting and

0:11:49.230 --> 0:11:51.195
then it's still very helpful.

0:11:51.195 --> 0:11:55.320
So the question is if this is the case how
can we make use of?

0:11:55.675 --> 0:12:03.171
And the first year of surprisingness, if we
are here successful with integrating a language

0:12:03.171 --> 0:12:10.586
model into a translation system, maybe we can
also integrate some types of language models

0:12:10.586 --> 0:12:14.415
into our MT system in order to make it better.

0:12:16.536 --> 0:12:19.000
The first thing we can do is okay.

0:12:19.000 --> 0:12:23.293
We know there is language models, so let's
try to integrate.

0:12:23.623 --> 0:12:30.693
There was mainly used language models because
these works were mainly done before transformer

0:12:30.693 --> 0:12:31.746
based models.

0:12:32.152 --> 0:12:41.567
And generally, of course, you can do the same
thing with all the Transformers baseballs.

0:12:41.721 --> 0:12:58.900
It has mainly been done before people started
using R&amp;S, and they tried to do this more

0:12:58.900 --> 0:13:01.888
in cases where.

0:13:07.087 --> 0:13:17.508
So what we're having here is some of this
type of idea.

0:13:17.508 --> 0:13:25.511
This is a key system here as you remember.

0:13:25.605 --> 0:13:29.470
Gets in with your last instinct and calculates
your attention.

0:13:29.729 --> 0:13:36.614
We get the context and combine both and then
based on that and then predict the target.

0:13:37.057 --> 0:13:42.423
So this is our anti-system, and the question
is, can we somehow integrate the language?

0:13:42.782 --> 0:13:55.788
And of course, if someone makes sense to take
out a neural language model because we're anyway

0:13:55.788 --> 0:14:01.538
in the neural space, it's not surprising.

0:14:01.621 --> 0:14:15.522
And there would be something like on top of
there and you're a language model and you have

0:14:15.522 --> 0:14:17.049
a target.

0:14:17.597 --> 0:14:27.007
So if we're having this type of language model,
there's two main questions we have to answer.

0:14:27.007 --> 0:14:28.108
How do we?

0:14:28.208 --> 0:14:37.935
So how do we combine now on the one hand our
NMT system and on the other hand our RNA you

0:14:37.935 --> 0:14:45.393
see that was mentioned before when we started
talking about encoder.

0:14:45.805 --> 0:14:49.523
The wild is like unconditioned, it's just
modeling the targets side.

0:14:49.970 --> 0:14:57.183
And the other one is a conditional language,
which is a language condition on the sewer

0:14:57.183 --> 0:14:57.839
center.

0:14:58.238 --> 0:15:03.144
So the question is how can you not combine
two language models?

0:15:03.144 --> 0:15:09.813
Of course, it's like the translation model
will some will be more important because it

0:15:09.813 --> 0:15:11.806
has access to the source.

0:15:11.806 --> 0:15:16.713
We want to generate something which corresponds
to your source.

0:15:18.778 --> 0:15:20.918
If we had that, the other question is OK.

0:15:20.918 --> 0:15:22.141
Now we have two models.

0:15:22.141 --> 0:15:25.656
If we even have integrated them, the answer
is how do we train them?

0:15:26.026 --> 0:15:39.212
Because we have integrated them, we have no
two sets of data with parallel data where you

0:15:39.212 --> 0:15:42.729
can do the lower thing.

0:15:44.644 --> 0:15:47.575
So the first idea is okay.

0:15:47.575 --> 0:15:53.436
We can do something more like a parallel combination.

0:15:53.436 --> 0:15:55.824
We just keep running.

0:15:56.036 --> 0:15:59.854
So a year you see your NMT system that is
running.

0:16:00.200 --> 0:16:08.182
First of all, it's normally completely independent
of your language model, which is up there.

0:16:08.182 --> 0:16:13.278
So down here we have just our NMT system,
which is running.

0:16:13.313 --> 0:16:26.439
The only thing which is used is we have the
words inputted, and of course they are put

0:16:26.439 --> 0:16:28.099
into both.

0:16:28.099 --> 0:16:41.334
We also put: So we use them in parallel, and
then we are doing our decision just by merging

0:16:41.334 --> 0:16:42.905
these two.

0:16:43.343 --> 0:16:52.288
So there can be, for example, we are doing
a probability distribution here, we are doing

0:16:52.288 --> 0:17:01.032
a purability distribution here, and then we
are taking the average of both per ability

0:17:01.032 --> 0:17:03.343
to do our predictions.

0:17:11.871 --> 0:17:18.929
You could also take the output which seems
to be more short about the answer.

0:17:20.000 --> 0:17:23.272
Yes, you could also do that.

0:17:23.272 --> 0:17:27.222
It's more like a gating mechanism.

0:17:27.222 --> 0:17:32.865
You're not doing everything, but you're focusing.

0:17:32.993 --> 0:17:38.927
Another one would be you could also just concatenate
the hidden states and then you have another

0:17:38.927 --> 0:17:41.802
layer on top which based on the concatenation.

0:17:43.303 --> 0:17:58.634
If you think about it, you do the coordination
instead of taking the instead and then merging

0:17:58.634 --> 0:18:01.244
the perability.

0:18:03.143 --> 0:18:15.027
Yes, in the end you introduce many new parameters
and these parameters have somehow something

0:18:15.027 --> 0:18:17.303
special compared.

0:18:23.603 --> 0:18:33.657
So before all the other parameters can be
trained independently of each other, the language

0:18:33.657 --> 0:18:42.071
one can be trained independent and an antisystem
can be trained independent.

0:18:43.043 --> 0:18:51.198
If you have a joint layer of course you need
to train them because you have inputs so you

0:18:51.198 --> 0:19:01.560
need: Not surprisingly, if you have a parallel
combination or whether you could, the other

0:19:01.560 --> 0:19:04.664
way is to do more serial combinations.

0:19:04.924 --> 0:19:10.382
How can you do a similar combination?

0:19:10.382 --> 0:19:18.281
Your final decision makes sense to do it based
on the.

0:19:18.438 --> 0:19:20.997
So you have on top of your normal an system.

0:19:21.121 --> 0:19:30.826
The only thing is now your inputting into
your NIT system.

0:19:30.826 --> 0:19:38.723
You're no longer inputting the word embeddings.

0:19:38.918 --> 0:19:47.819
You're training the lower layers here which
are trained more on the purely language model

0:19:47.819 --> 0:19:55.434
and on top you're putting into the NMT system
where it now has the language.

0:19:55.815 --> 0:19:59.003
So here you can also view it here.

0:19:59.003 --> 0:20:06.836
You have more contextual embeddings which
no longer depend on the word, but they also

0:20:06.836 --> 0:20:10.661
depend on the context of the target site.

0:20:11.051 --> 0:20:21.797
More understanding of the source word.

0:20:21.881 --> 0:20:34.761
So if it's like the word can, for example,
will be put in here always the same, independent

0:20:34.761 --> 0:20:41.060
of its use of can of beans, or if can do it.

0:20:41.701 --> 0:20:43.165
Empties.

0:20:44.364 --> 0:20:54.959
So another view, if you're remembering more
the transformer based approach, is you have

0:20:54.959 --> 0:21:01.581
some layers, and the lower layers are purely
language.

0:21:02.202 --> 0:21:08.052
This is purely language model and then at
some point you're starting to attend to the

0:21:08.052 --> 0:21:08.596
source.

0:21:13.493 --> 0:21:20.774
Yes, so these are two ways of how you combine
it, so run them in peril, or first do the language.

0:21:23.623 --> 0:21:26.147
Questions for the integration.

0:21:31.831 --> 0:21:35.034
Not really sure about the input of the.

0:21:35.475 --> 0:21:38.123
And this case with a sequence.

0:21:38.278 --> 0:21:50.721
Is the input and bedding, the target word
embedding, or the actual word, and then we

0:21:50.721 --> 0:21:54.821
transfer it to a numerical.

0:21:56.176 --> 0:22:08.824
That depends on if you view the word embedding
as part of the language model, so of course

0:22:08.824 --> 0:22:10.909
you first put.

0:22:11.691 --> 0:22:13.938
And then the word embedding there is the r&amp;n.

0:22:14.314 --> 0:22:20.296
So of course you can view this together as
your language model when you first do the word

0:22:20.296 --> 0:22:21.027
embedding.

0:22:21.401 --> 0:22:28.098
All you can say are the RNAs and this is like
before.

0:22:28.098 --> 0:22:36.160
It's more a definition, but you're right,
so what are the steps?

0:22:36.516 --> 0:22:46.655
One of these parts, you know, called a language
model is definitionally not that important,

0:22:46.655 --> 0:22:47.978
but that's.

0:22:53.933 --> 0:23:02.812
So the question is how can you then train
them and make make this this one work?

0:23:03.363 --> 0:23:15.492
So in the case where you combine the language
of our abilities you can train them independently

0:23:15.492 --> 0:23:18.524
and then just put them.

0:23:18.918 --> 0:23:29.623
It might not be the best because we have no
longer this ability before that.

0:23:29.623 --> 0:23:33.932
They optimal perform together.

0:23:34.514 --> 0:23:41.050
At least you need to summarize how much do
you trust the one model and how much do you

0:23:41.050 --> 0:23:41.576
trust.

0:23:43.323 --> 0:23:48.529
But still in some cases usually it might be
helpful if you have only data and so on.

0:23:48.928 --> 0:24:06.397
However, we have one specific situation that
leads to the pearl leader is always mono legal

0:24:06.397 --> 0:24:07.537
data.

0:24:08.588 --> 0:24:17.693
So what we can also do is more the pre-training
approach.

0:24:17.693 --> 0:24:24.601
We first train the language model and then.

0:24:24.704 --> 0:24:33.468
So the pre-training approach you first train
on the monolingual data and then you join the.

0:24:33.933 --> 0:24:45.077
Of course, the model size is this way, but
the data size is of course too big.

0:24:45.077 --> 0:24:52.413
You often have more monolingual data than
parallel.

0:24:56.536 --> 0:24:57.901
Any ideas.

0:25:04.064 --> 0:25:10.108
Had one example where this might also be helpful
if you want to adapt to a domain so let's say

0:25:10.108 --> 0:25:16.281
you do medical sentences and if you want to
translate medical sentences and you have monolingual

0:25:16.281 --> 0:25:22.007
data on the target side for medical sentences
but you only have parallel data for general

0:25:22.007 --> 0:25:22.325
use.

0:25:23.083 --> 0:25:30.601
In this case it could be, or it's the most
probable happen if you're learning out there

0:25:30.601 --> 0:25:38.804
what medical means, but then in your fine tuning
step the model is forgetting everything about.

0:25:39.099 --> 0:25:42.340
So this type of priest training step is good.

0:25:42.340 --> 0:25:47.978
If your pretraining data is more general,
very large, and then you're adapting.

0:25:48.428 --> 0:25:55.545
But in the task we have monolingual data,
which should be used to adapt the system to

0:25:55.545 --> 0:25:57.780
some genre of topic style.

0:25:57.817 --> 0:26:08.572
Then, of course, this is not a good strategy
because you might forget about everything up

0:26:08.572 --> 0:26:09.408
there.

0:26:09.649 --> 0:26:17.494
So then you have to check what you can do
for them to see.

0:26:17.494 --> 0:26:25.738
You can freeze this part and you can do a
direct combination.

0:26:25.945 --> 0:26:33.796
Where you train both of them, and then you
train the language more and parallel on their

0:26:33.796 --> 0:26:34.942
one so that.

0:26:35.395 --> 0:26:37.687
Eh What You Learn in the Length.

0:26:37.937 --> 0:26:48.116
So the bit depends on what you want to combine
is that you use a language model because it's.

0:26:48.548 --> 0:26:56.380
Then you normally don't really forget it because
it's also in the or you use it to adapt to

0:26:56.380 --> 0:26:58.083
something specific.

0:27:01.001 --> 0:27:06.662
Then there is so this is a way of how we can
make use of monolingual data.

0:27:07.968 --> 0:27:11.787
It seems to be the easiest one somehow.

0:27:11.787 --> 0:27:19.140
It's more similar to what we are doing with
statistical machine translation.

0:27:19.140 --> 0:27:20.095
However,.

0:27:21.181 --> 0:27:27.211
Normally always beats this type of model,
which in some view can be from the conceptual

0:27:27.211 --> 0:27:27.691
thing.

0:27:27.691 --> 0:27:31.460
At least it's even easier from the computational
side.

0:27:31.460 --> 0:27:36.805
Sometimes it has a disadvantage that it's
more problematic or more difficult.

0:27:40.560 --> 0:27:42.576
And the idea is okay.

0:27:42.576 --> 0:27:45.141
We have a monolingual data.

0:27:45.141 --> 0:27:50.822
We just translate it and then generate some
type of parallel.

0:27:51.111 --> 0:28:00.465
So if you want to build a German to English
system, your first trained German to English

0:28:00.465 --> 0:28:02.147
system on your.

0:28:02.402 --> 0:28:05.217
Then you have more pearl data.

0:28:05.217 --> 0:28:13.482
The interesting thing is if you then train
on the joint thing, on the original pearl data,

0:28:13.482 --> 0:28:18.749
and on that one is artificial, it even normally
improves.

0:28:18.918 --> 0:28:26.490
You can because you're not doing the same
error all the time and you have some knowledge.

0:28:28.028 --> 0:28:40.080
With this first approach, however, there's
one issue: why it might not work the best,

0:28:40.080 --> 0:28:43.163
so could you imagine?

0:28:49.409 --> 0:28:51.186
Ready a bit shown in image two.

0:28:53.113 --> 0:29:00.637
Have a few trains on bad quality data.

0:29:00.637 --> 0:29:08.741
The system will learn also in the states.

0:29:08.828 --> 0:29:12.210
And as you're saying, it's a system always
mistranslates.

0:29:13.493 --> 0:29:14.497
Something.

0:29:14.497 --> 0:29:23.623
Then you will learn that this is correct because
now it's training data and you will even encourage

0:29:23.623 --> 0:29:25.996
it to make it more often.

0:29:25.996 --> 0:29:29.921
So the problem on training on your own is.

0:29:30.150 --> 0:29:34.222
But however, as you systematically do, you
even enforce more and will even do more.

0:29:34.654 --> 0:29:37.401
So that might not be the best solution.

0:29:37.401 --> 0:29:40.148
Do any idea how you could do it better?

0:29:44.404 --> 0:29:57.653
If you had something else to prevent some
systematic problems, yes, that is one way.

0:30:04.624 --> 0:30:10.809
The problem is yeah, the translations are
not perfect, so the output and you're learning

0:30:10.809 --> 0:30:11.990
something wrong.

0:30:11.990 --> 0:30:17.967
Normally it's less bad if your inputs are
somewhat bad, but your outputs are perfect.

0:30:18.538 --> 0:30:26.670
So if your inputs are wrong you maybe learn
that if you're doing this wrong input you're

0:30:26.670 --> 0:30:30.782
generating something correct but you're not.

0:30:31.511 --> 0:30:40.911
So often the case is that it's more important
that your target is correct.

0:30:40.911 --> 0:30:47.052
If on the source there is something crazy,
then.

0:30:47.347 --> 0:30:52.184
But you can assume in your application scenario
you hope that you mainly get correct input.

0:30:52.572 --> 0:31:02.126
So that is not harming you as much, and in
machine translation we have some of these symmetries,

0:31:02.126 --> 0:31:02.520
so.

0:31:02.762 --> 0:31:04.578
And also the other way around.

0:31:04.578 --> 0:31:09.792
It's a very similar task, so there's a task
to translate from German to English, but the

0:31:09.792 --> 0:31:13.892
task to translate from English to German is
very similar and helpful.

0:31:14.094 --> 0:31:19.313
So what we can do is, we can just switch it
initially and generate the data the other way

0:31:19.313 --> 0:31:19.777
around.

0:31:20.120 --> 0:31:25.699
So what we are doing here is we are starting
with an English to German system.

0:31:25.699 --> 0:31:32.126
Then we are translating the English data into
German, where the German is maybe not really

0:31:32.126 --> 0:31:32.903
very nice.

0:31:33.293 --> 0:31:46.045
And then we're training on our original data
and on the back translated data where only

0:31:46.045 --> 0:31:51.696
the input is good and it's like human.

0:31:52.632 --> 0:32:01.622
So here we have now the advantage that always
our target site is of human quality and the

0:32:01.622 --> 0:32:02.322
input.

0:32:03.583 --> 0:32:08.998
And then this helps us to get really good
form.

0:32:08.998 --> 0:32:15.428
There's one important difference if you think
about the.

0:32:21.341 --> 0:32:31.604
It's too obvious here we need a target side
monolingual layer and the first.

0:32:31.931 --> 0:32:47.143
So back translation is normally working if
you have target size parallel and not search

0:32:47.143 --> 0:32:48.180
side.

0:32:48.448 --> 0:32:55.493
Might be also a bit if you think about it
understandable that it's more important to

0:32:55.493 --> 0:32:56.819
be like better.

0:32:57.117 --> 0:33:04.472
On the suicide you have to understand the
content, on the target side you have to generate

0:33:04.472 --> 0:33:12.232
really sentences and somehow it's more difficult
to generate something than to only understand.

0:33:17.617 --> 0:33:29.916
One other thing, so typically it's shown here
differently, but typically it's like this works

0:33:29.916 --> 0:33:30.701
well.

0:33:31.051 --> 0:33:32.978
Because normally there's like a lot more.

0:33:33.253 --> 0:33:36.683
So the question is, should really take all
of my data?

0:33:36.683 --> 0:33:38.554
There's two problems with it.

0:33:38.554 --> 0:33:42.981
Of course, it's expensive because you have
to translate all this data.

0:33:42.981 --> 0:33:48.407
And secondly, if you had, although now your
packet site is wrong, it might be that you

0:33:48.407 --> 0:33:51.213
still have your wrong correlations in there.

0:33:51.651 --> 0:34:01.061
So if you don't know the normally good starting
point is to take equal amount of data as many

0:34:01.061 --> 0:34:02.662
backtranslated.

0:34:02.963 --> 0:34:05.366
Of course, it depends on the use case.

0:34:05.366 --> 0:34:07.215
There are very few data here.

0:34:07.215 --> 0:34:08.510
It makes more sense.

0:34:08.688 --> 0:34:14.273
It depends on how good your quality is here,
so the better the model is observable, the

0:34:14.273 --> 0:34:17.510
more data you might use because quality is
better.

0:34:17.510 --> 0:34:23.158
So it depends on a lot of things, but yeah,
a rule of sample like good general way often

0:34:23.158 --> 0:34:24.808
is to have equal amounts.

0:34:26.646 --> 0:34:31.233
And you can of course do that now iteratively.

0:34:31.233 --> 0:34:39.039
It said already that the quality at the end,
of course, depends on this system.

0:34:39.039 --> 0:34:46.163
Also, because the better this system is, the
better your synthetic data.

0:34:47.207 --> 0:34:50.949
That leads to what is referred to as iterated
back translation.

0:34:51.291 --> 0:34:56.911
So you're playing a model on English to German
and you translate the data.

0:34:56.957 --> 0:35:03.397
Then you train a model on German to English
with the additional data.

0:35:03.397 --> 0:35:11.954
Then you translate German when you translate
German data and then you train again your first

0:35:11.954 --> 0:35:12.414
one.

0:35:12.414 --> 0:35:14.346
So you iterate that.

0:35:14.334 --> 0:35:19.653
Because now your system is better because
it's not only trained on the small data but

0:35:19.653 --> 0:35:22.003
additionally on back translated data.

0:35:22.442 --> 0:35:24.458
And so you can get better.

0:35:24.764 --> 0:35:31.739
However, typically you can stop quite early,
so maybe one iteration is good, but then you

0:35:31.739 --> 0:35:35.072
have diminishing gains after two or three.

0:35:35.935 --> 0:35:44.094
There's very slight difference and then yeah
because you need of course quite big difference

0:35:44.094 --> 0:35:45.937
in the quality here.

0:35:45.937 --> 0:35:46.814
In order.

0:35:47.207 --> 0:35:59.810
Which is not too good because it means you
can already have to train it with relatively

0:35:59.810 --> 0:36:02.245
bad performance.

0:36:03.723 --> 0:36:10.323
And they don't yeah, a design decision would
advise so guess because it's easy to get it.

0:36:10.550 --> 0:36:16.617
Better to replace that because you have a
higher quality, but you of course keep your

0:36:16.617 --> 0:36:18.310
high quality real data.

0:36:18.310 --> 0:36:21.626
Then I think normally it's okay to replace
it.

0:36:21.626 --> 0:36:24.518
Of course you can also try to append it.

0:36:24.518 --> 0:36:28.398
I would assume it's not too much of a difference,
but.

0:36:34.414 --> 0:36:40.567
That's about like using monolingual data before
we go into the pre-train models.

0:36:40.567 --> 0:36:42.998
Do you have any more questions?

0:36:49.029 --> 0:36:57.521
Yes, so the other thing we can do and which
is recently more and more successful and even

0:36:57.521 --> 0:37:05.731
more successful since we have these really
large language models where you can even do

0:37:05.731 --> 0:37:08.562
a translation task with this.

0:37:08.688 --> 0:37:16.132
So here the idea is you learn a representation
of one task and then you use this representation.

0:37:16.576 --> 0:37:27.276
It was made maybe like one of the first where
it's really used largely is doing something

0:37:27.276 --> 0:37:35.954
like a bird which you pre-train on purely text
editor and then you take.

0:37:36.496 --> 0:37:42.952
And the one big advantage, of course, is that
people can only share data but also pre-train.

0:37:43.423 --> 0:37:53.247
So if you think of the recent models and the
large language models which are available,

0:37:53.247 --> 0:37:59.611
it is not possible for universities often to
train them.

0:37:59.919 --> 0:38:09.413
Think it costs several millions to train the
model just if you rent the GPS from some cloud

0:38:09.413 --> 0:38:15.398
company and train that the cost of training
these models.

0:38:15.475 --> 0:38:21.735
And guess as a student project you won't have
the budget to like build these models.

0:38:21.801 --> 0:38:24.630
So another idea is what you can do is okay.

0:38:24.630 --> 0:38:27.331
Maybe if these months are once available.

0:38:27.467 --> 0:38:34.723
You can take them and use them as a resource
similar to pure text, and you can now build

0:38:34.723 --> 0:38:41.734
models which some will learn not only from
from data but also from other models which

0:38:41.734 --> 0:38:44.506
are maybe trained on other tasks.

0:38:44.844 --> 0:38:48.647
So it's a quite new way of thinking of how
to train.

0:38:48.647 --> 0:38:53.885
So we are not only learning from examples,
but we might also learn from.

0:38:54.534 --> 0:39:03.937
The nice thing is that this type of training
where we are not learning directly from data

0:39:03.937 --> 0:39:07.071
by learning from other tasks.

0:39:07.427 --> 0:39:15.581
So the main idea to start with is to have
a personal initial task, and typically this

0:39:15.581 --> 0:39:24.425
initial task is for: And if you're working
with, that means you're training pure taxator

0:39:24.425 --> 0:39:30.547
because you have the largest amount of data
from the Internet.

0:39:30.951 --> 0:39:35.857
And then you're defining some type of task
in order to do your quick training.

0:39:36.176 --> 0:39:42.056
And: There's a typical task you can train
on.

0:39:42.056 --> 0:39:52.709
That is like the language modeling text, so
to predict the next word, all we have related.

0:39:52.932 --> 0:40:04.654
But to predict something which you have not
in the input is a task which is easy to generate.

0:40:04.654 --> 0:40:06.150
That's why.

0:40:06.366 --> 0:40:14.005
By yourself, on the other hand, you need a
lot of knowledge, and that is the other thing

0:40:14.005 --> 0:40:15.120
you need to.

0:40:15.735 --> 0:40:23.690
Because there is this idea that the meaning
of the word heavily depends on the context

0:40:23.690 --> 0:40:24.695
it's used.

0:40:25.145 --> 0:40:36.087
So can give you a sentence with some gibberish
word and there's some name, and although you've

0:40:36.087 --> 0:40:41.616
never read the name, you will just assume that.

0:40:42.062 --> 0:40:48.290
Exactly the same thing, the models can also
learn something about the words in there by

0:40:48.290 --> 0:40:49.139
just using.

0:40:49.649 --> 0:40:53.246
So that is typically the new.

0:40:53.246 --> 0:40:59.839
Then we can use this model, use our data to
train the.

0:41:00.800 --> 0:41:04.703
Of course, it might need to adapt the system.

0:41:04.703 --> 0:41:07.672
To do that we might use only some.

0:41:07.627 --> 0:41:16.326
Part of the pre-train model in there is that
we have seen that a bit already in the RNA

0:41:16.326 --> 0:41:17.215
case is.

0:41:17.437 --> 0:41:22.670
So you can view the RN as one of these approaches.

0:41:22.670 --> 0:41:28.518
You train the RN language while on large pre-train
data.

0:41:28.518 --> 0:41:32.314
Then you put it somewhere into your.

0:41:33.653 --> 0:41:37.415
So this gives you the ability to really do
these types of tests.

0:41:37.877 --> 0:41:49.027
So that you can build a system which uses
knowledge, which is just trained on large amounts

0:41:49.027 --> 0:41:52.299
of data and extracting it.

0:41:52.299 --> 0:41:53.874
So it knows.

0:41:56.376 --> 0:42:01.561
So the question is that yeah, what type of
information so what type of models can you?

0:42:01.821 --> 0:42:05.278
And we want to today look at briefly at three.

0:42:05.725 --> 0:42:08.474
Was initially done.

0:42:08.474 --> 0:42:21.118
It wasn't as famous as in machine translation
as in other things, but it's also used there.

0:42:21.221 --> 0:42:28.974
So where you have this mapping from the one
hot to a small continuous word representation?

0:42:29.229 --> 0:42:37.891
Using this one in your anthrax you can, for
example, replace the embedding layer by the

0:42:37.891 --> 0:42:38.776
trained.

0:42:39.139 --> 0:42:41.832
That is helpful to be a really small amount
of data.

0:42:42.922 --> 0:42:48.520
You're always in this pre training phase and
have the thing the advantage is.

0:42:48.468 --> 0:42:55.515
More data, that's the trade off so you can
get better.

0:42:55.515 --> 0:43:00.128
Disadvantage is, does anybody have?

0:43:04.624 --> 0:43:12.173
Was one of the mentioned today, even like
big advantages of the system compared to previous.

0:43:20.660 --> 0:43:26.781
Where one advantage was the end to end training
so that all parameters and all components are

0:43:26.781 --> 0:43:27.952
optimal together.

0:43:28.208 --> 0:43:33.386
If you know pre-train something on one pass,
it's maybe no longer optimal fitting to everything.

0:43:33.893 --> 0:43:40.338
So that is similar to what should do pretaining
or not.

0:43:40.338 --> 0:43:48.163
It depends on how important everything is
optimal together and how.

0:43:48.388 --> 0:44:00.552
If the state is a high quality of large amount,
the pre trained one is just so much better.

0:44:00.600 --> 0:44:11.215
Standing everything optimal together, we would
use random actions for amazing vices.

0:44:11.691 --> 0:44:18.791
Mean, we assume some structures that are trained
basically.

0:44:18.791 --> 0:44:26.364
Yes, if you're fine tuning everything, it
might be the problem.

0:44:26.766 --> 0:44:31.139
But often yeah, in some way right, so often
it's not about.

0:44:31.139 --> 0:44:37.624
You're really worse with some pre-trained
molecules because you're going already in some

0:44:37.624 --> 0:44:43.236
direction, and if this is not really optimal
for you, it might be difficult.

0:44:43.603 --> 0:44:51.774
But the bigger is, if you're not getting better
because you have a decent amount of data, it's

0:44:51.774 --> 0:44:52.978
so different.

0:44:53.153 --> 0:45:04.884
But mean initially it wasn't a machine translation
done so much because there was more data in

0:45:04.884 --> 0:45:09.452
the task, but now it's really large.

0:45:12.632 --> 0:45:14.188
The other one is then OK.

0:45:14.188 --> 0:45:18.258
Now it's always like how much of the model
do your pre-track a bit?

0:45:18.658 --> 0:45:25.057
The other one you can do is tack contextual
words and then something like bird or a robota

0:45:25.057 --> 0:45:31.667
where you train more already as sequence models
and the embeddings you're using are no longer

0:45:31.667 --> 0:45:35.605
specific for words but they're also taking
the context.

0:45:35.875 --> 0:45:54.425
Embedding you're using is no longer only depending
on the word itself but on the whole sentence.

0:45:55.415 --> 0:46:03.714
And of course you can use similar things also
in the decoder just by having layers which

0:46:03.714 --> 0:46:09.122
don't have access to the source but there it's
still not.

0:46:11.451 --> 0:46:19.044
And finally, and then we'll look at the end,
you can also have models which are already.

0:46:19.419 --> 0:46:28.605
So you may be training a sequence model, but
not a monolingual data.

0:46:28.605 --> 0:46:35.128
Of course you have to make it a bit challenging.

0:46:36.156 --> 0:46:43.445
But the idea is really you're pre-training
your whole model and then you're fine tuning.

0:46:47.227 --> 0:46:59.487
But let's first do a bit of step back and
look into what are the differences.

0:46:59.487 --> 0:47:02.159
The first thing.

0:47:02.382 --> 0:47:06.870
The word embeddings are just this first layer.

0:47:06.870 --> 0:47:12.027
You can train them with feed-forward neural
networks.

0:47:12.212 --> 0:47:25.683
But you can also train them in language model,
and by now you hopefully have also seen that

0:47:25.683 --> 0:47:27.733
you can also.

0:47:30.130 --> 0:47:41.558
So this is how you can train them, and you
are training them to predict the next word,

0:47:41.558 --> 0:47:45.236
the typical language model.

0:47:45.525 --> 0:47:52.494
And that is what is now referred to as a South
Supervised Learning, and for example all the

0:47:52.494 --> 0:47:56.357
big large language models like Chat, gp and
so on.

0:47:56.357 --> 0:48:03.098
They are trained at an end or feet, but exactly
with this objective to predict the next.

0:48:03.823 --> 0:48:12.847
So that is where you can hopefully learn what
a word is used because you always try to predict

0:48:12.847 --> 0:48:17.692
the next word and then you have a ready intuition.

0:48:19.619 --> 0:48:25.374
In the word embedding, why do people first
look at the word embeddings and the use of

0:48:25.374 --> 0:48:27.582
word embeddings for other tasks?

0:48:27.582 --> 0:48:32.600
The main advantage is it might be only the
first layer you would think of.

0:48:32.600 --> 0:48:34.474
What does it really matter?

0:48:34.474 --> 0:48:39.426
However, it is the layer where you typically
have most of the parameters.

0:48:39.879 --> 0:48:52.201
Of course, if you have trained on most of
your parameters already on the large data,

0:48:52.201 --> 0:48:59.304
then on your target data you have to train
less.

0:48:59.259 --> 0:49:05.841
This big difference that your input size is
so much bigger than the size of the normal

0:49:05.841 --> 0:49:06.522
in size.

0:49:06.626 --> 0:49:16.551
So it's a normal size, maybe two hundred and
fifty, but your input embedding besides vocabulary

0:49:16.551 --> 0:49:20.583
size is something like fifty thousand.

0:49:23.123 --> 0:49:30.163
And bending while here you see, it's only
like times as much in the layer.

0:49:30.750 --> 0:49:36.747
So here's where most of your parameters are,
which means if you already replace the word

0:49:36.747 --> 0:49:41.329
embeddings, it might look a bit small in your
overall architecture.

0:49:41.329 --> 0:49:47.056
It's where most of the things are, and if
you're doing that, you already have really

0:49:47.056 --> 0:49:48.876
big games and can do that.

0:49:57.637 --> 0:50:04.301
The thing is we have seen these wooden beddings
can be very good used for other taps.

0:50:04.784 --> 0:50:08.921
Now you learn some relation between words.

0:50:08.921 --> 0:50:14.790
If you're doing this type of language modeling,
you predict.

0:50:15.215 --> 0:50:21.532
The one thing is, of course, you have a lot
of data, so the one question is we want to

0:50:21.532 --> 0:50:25.961
have a lot of data to good training models,
the other thing.

0:50:25.961 --> 0:50:28.721
The tasks need to be somewhat useful.

0:50:29.169 --> 0:50:41.905
If you would predict the first letter of the
word, it has to be a task where you need some

0:50:41.905 --> 0:50:45.124
syntactic information.

0:50:45.545 --> 0:50:53.066
The interesting thing is people have looked
at these world embeddings here in a language

0:50:53.066 --> 0:50:53.658
model.

0:50:53.954 --> 0:51:04.224
And you're looking at the word embeddings,
which are these vectors here.

0:51:04.224 --> 0:51:09.289
You can ask yourself, do they look?

0:51:09.489 --> 0:51:15.122
Don't know if your view is listening to artificial
advance artificial intelligence.

0:51:15.515 --> 0:51:23.994
We had on yesterday how to do this type of
representation, but you can do this kind of

0:51:23.994 --> 0:51:29.646
representation, and now you're seeing interesting
things.

0:51:30.810 --> 0:51:41.248
Now you can represent it here in a three dimensional
space with a dimension reduction.

0:51:41.248 --> 0:51:46.886
Then you can look into it and the interesting.

0:51:47.447 --> 0:51:57.539
So this vector between the male and the female
version of something is not the same, but it's

0:51:57.539 --> 0:51:58.505
related.

0:51:58.718 --> 0:52:11.256
So you can do a bit of nuts, you subtract
this vector, add this vector, and then you

0:52:11.256 --> 0:52:14.501
look around this one.

0:52:14.894 --> 0:52:19.691
So that means okay, there is really something
stored, some information stored in that book.

0:52:20.040 --> 0:52:25.003
Similar you can do it with Buck and since
you see here swimming slam walk and walk.

0:52:25.265 --> 0:52:42.534
So again these vectors are not the same, but
they're related for going from here to here.

0:52:43.623 --> 0:52:47.508
Are semantically the relations between city
and capital?

0:52:47.508 --> 0:52:49.757
You have exactly the same thing.

0:52:51.191 --> 0:52:57.857
People having done question answering about
that if they show these embeddings and.

0:52:58.218 --> 0:53:05.198
Or you can also, if you don't trust the the
dimensional reduction because you say maybe

0:53:05.198 --> 0:53:06.705
there's something.

0:53:06.967 --> 0:53:16.473
Done you can also look into what happens really
in the indimensional space.

0:53:16.473 --> 0:53:22.227
You can look at what is the nearest neighbor.

0:53:22.482 --> 0:53:29.605
So you can take the relationship between France
and Paris and add it to Italy and nicely see.

0:53:30.010 --> 0:53:33.082
You can do big and bigger and you have small
and small lines.

0:53:33.593 --> 0:53:38.202
It doesn't work everywhere.

0:53:38.202 --> 0:53:49.393
There are also some which sometimes work,
so if you have a typical.

0:53:51.491 --> 0:53:56.832
You can do what the person is doing for famous
ones.

0:53:56.832 --> 0:54:05.800
Of course, only like Einstein, scientist,
that Messier finds Midfield are not completely

0:54:05.800 --> 0:54:06.707
correct.

0:54:06.846 --> 0:54:09.781
You'll see the examples are a bit old.

0:54:09.781 --> 0:54:15.050
The politicians are no longer there, but the
first one doesn't learn.

0:54:16.957 --> 0:54:29.003
What people have done there of courses, especially
at the beginning.

0:54:29.309 --> 0:54:36.272
So one famous model was, but we're not really
interested in the language model performance.

0:54:36.272 --> 0:54:38.013
We're only interested.

0:54:38.338 --> 0:54:40.634
Think something good to keep in mind.

0:54:40.634 --> 0:54:42.688
What are we really interested in?

0:54:42.688 --> 0:54:44.681
Do we really want to have an RN?

0:54:44.681 --> 0:54:44.923
No.

0:54:44.923 --> 0:54:48.608
In this case we are only interested in this
type of mapping.

0:54:49.169 --> 0:54:55.536
And so very successful was this word to beg.

0:54:55.535 --> 0:55:02.597
We are not training real language when making
it even simpler and doing this for example

0:55:02.597 --> 0:55:04.660
continuous back of words.

0:55:04.660 --> 0:55:11.801
We are just having four input tokens and we
are predicting what is the word in the middle

0:55:11.801 --> 0:55:15.054
and this is just like two linear layers.

0:55:15.615 --> 0:55:22.019
It's even simplifying things and making the
calculation faster because that is what we're

0:55:22.019 --> 0:55:22.873
interested.

0:55:23.263 --> 0:55:34.059
All this continues skip ground models of these
other two models.

0:55:34.234 --> 0:55:38.273
You have one equal word and it's the other
way around.

0:55:38.273 --> 0:55:41.651
You're predicting the four words around them.

0:55:41.651 --> 0:55:43.047
It's very similar.

0:55:43.047 --> 0:55:48.702
The task is in the end very similar, but in
all of them it's about learning.

0:55:51.131 --> 0:56:01.416
Before we go into the next part, let's talk
about the normal white vector or white line.

0:56:04.564 --> 0:56:07.562
The next thing is contextual word embeddings.

0:56:07.562 --> 0:56:08.670
The idea is yes.

0:56:08.670 --> 0:56:09.778
This is helpful.

0:56:09.778 --> 0:56:14.080
However, we might be able to get more from
just only lingo later.

0:56:14.080 --> 0:56:19.164
For example, if you think about the word can,
it can have different meanings.

0:56:19.419 --> 0:56:32.619
And now in the word embeddings how you have
an overlap of these two meanings, so it represents

0:56:32.619 --> 0:56:33.592
those.

0:56:34.834 --> 0:56:40.318
But we might be able to in the pre-train model
already disambiguate these because they use

0:56:40.318 --> 0:56:41.041
completely.

0:56:41.701 --> 0:56:50.998
So if we can have a model which can not only
represent the word, but it can also represent

0:56:50.998 --> 0:56:58.660
the meaning of the word within the context,
it might be even more helpful.

0:56:59.139 --> 0:57:03.342
So then we're going to contextual word embeddings.

0:57:03.342 --> 0:57:07.709
We're really having a representation of the
context.

0:57:07.787 --> 0:57:11.519
And we have a very good architecture for that
already.

0:57:11.691 --> 0:57:20.551
It's like our base language model where you
have to do the hidden state.

0:57:20.551 --> 0:57:29.290
The hidden state represents what is apparently
said, but it's focusing.

0:57:29.509 --> 0:57:43.814
The first one doing that is in something like
the Elmo paper where they instead of like this

0:57:43.814 --> 0:57:48.121
is a normal language model.

0:57:48.008 --> 0:57:52.735
Put in the third predicting the fourth and
so on, so you're always predicting the next

0:57:52.735 --> 0:57:53.007
one.

0:57:53.193 --> 0:57:57.919
The architecture of the heaven works embedding
layer, and then two are an layer here.

0:57:57.919 --> 0:58:04.255
For example: And now instead of using this
one in the end you're using here this one.

0:58:04.364 --> 0:58:11.245
This represents the meaning of this word mainly
in the context of what we have seen before.

0:58:11.871 --> 0:58:22.909
We can train it in a language model or predicting
the next word, but we have more information,

0:58:22.909 --> 0:58:26.162
train there, and therefore.

0:58:27.167 --> 0:58:31.168
And there is one even done currently in.

0:58:31.168 --> 0:58:40.536
The only difference is that we have more layers,
bigger size, and we're using transform on here

0:58:40.536 --> 0:58:44.634
or self-attention instead of the R&amp;F.

0:58:44.634 --> 0:58:45.122
But.

0:58:46.746 --> 0:58:52.737
However, if you look at this contextual representation,
they might not be perfect.

0:58:52.737 --> 0:58:58.584
So what do you think of this one as contextual
representation of the third word?

0:58:58.584 --> 0:59:02.914
Do you see anything which is not really considered
in this?

0:59:07.587 --> 0:59:11.492
Only one way yes, so that is not a big issue
here.

0:59:11.492 --> 0:59:18.154
It's representing a string in the context
of a sentence, however, only in the context.

0:59:18.558 --> 0:59:28.394
However, we have an architecture which can
also take both sides and we have used it in

0:59:28.394 --> 0:59:30.203
the ink holder.

0:59:30.630 --> 0:59:34.269
So we could do the and easily only us in the
backboard direction.

0:59:34.874 --> 0:59:46.889
By just having the other way around, and then
we couldn't combine the forward and into a

0:59:46.889 --> 0:59:49.184
joint one where.

0:59:49.329 --> 0:59:50.861
So You Have a Word embedding.

0:59:51.011 --> 1:00:03.910
Then you have two states, one with a forward,
and then one with a backward.

1:00:03.910 --> 1:00:10.359
For example, take the representation.

1:00:10.490 --> 1:00:21.903
Now this same here represents mainly this
word because this is where what both focuses

1:00:21.903 --> 1:00:30.561
on is what is happening last but is also looking
at the previous.

1:00:31.731 --> 1:00:41.063
However, there is a bit different when training
that as a language model you already have.

1:00:43.203 --> 1:00:44.956
Maybe there's again this masking.

1:00:46.546 --> 1:00:47.814
That is one solution.

1:00:47.814 --> 1:00:53.407
First of all, why we can't do it is the information
you leave it, so you cannot just predict the

1:00:53.407 --> 1:00:54.041
next word.

1:00:54.041 --> 1:00:58.135
If we just predict the next word in this type
of model, that's a very.

1:00:58.738 --> 1:01:04.590
You know the next word because it's influencing
this hidden stage and then it's very easy so

1:01:04.590 --> 1:01:07.736
predicting something you know is not a good
task.

1:01:07.736 --> 1:01:09.812
This is what I mentioned before.

1:01:09.812 --> 1:01:13.336
You have to define somehow a task which is
challenging.

1:01:13.753 --> 1:01:19.007
Because in this case one would, I mean, the
system would just ignore the states and what

1:01:19.007 --> 1:01:22.961
it would learn is that you copy this information
directly in here.

1:01:23.343 --> 1:01:31.462
So it would mainly be representing this word
and you would have a perfect model because

1:01:31.462 --> 1:01:38.290
you only need to find an encoding where you
can encode all words somehow.

1:01:38.458 --> 1:01:44.046
The only thing that will learn is that tenor
and coat all my words in this upper hidden.

1:01:44.985 --> 1:01:49.584
And then, of course, it's not really useful.

1:01:49.584 --> 1:01:53.775
We need to find a bit of different ways.

1:01:55.295 --> 1:01:59.440
There is a masking one.

1:01:59.440 --> 1:02:06.003
I'll come to that shortly just a bit.

1:02:06.003 --> 1:02:14.466
The other thing is not to directly combine
them.

1:02:14.594 --> 1:02:22.276
So you never merge the states only at the
end.

1:02:22.276 --> 1:02:33.717
The representation of the words is now from
the forward and the next.

1:02:33.873 --> 1:02:35.964
So it's always a hidden state before that.

1:02:36.696 --> 1:02:41.273
And these two you're joined now to your to
the representation.

1:02:42.022 --> 1:02:50.730
And then you have now a representation also
about the whole sentence for the word, but

1:02:50.730 --> 1:02:53.933
there's no information leakage.

1:02:53.933 --> 1:02:59.839
One way of doing this is instead of doing
a bidirectional.

1:03:00.380 --> 1:03:08.079
You can do that, of course, in all layers.

1:03:08.079 --> 1:03:16.315
In the end you have different bedding states.

1:03:16.596 --> 1:03:20.246
However, it's a bit of a complicated.

1:03:20.246 --> 1:03:25.241
You have to keep up separate and then merge
things.

1:03:27.968 --> 1:03:33.007
And that is is the moment where, like the,
the peak.

1:03:34.894 --> 1:03:42.018
Idea of the big success of the bird model
was used, maybe in bidirector case.

1:03:42.018 --> 1:03:48.319
It's not good to do the next word prediction,
but we can do masking.

1:03:48.308 --> 1:03:59.618
And masking maybe means we do a prediction
of something in the middle or some words.

1:03:59.618 --> 1:04:08.000
If we have the input, we're just putting noise
into the input.

1:04:08.048 --> 1:04:14.040
Now there can be no information leakage because
this wasn't in the input.

1:04:14.040 --> 1:04:15.336
Now predicting.

1:04:16.776 --> 1:04:20.524
So thereby we don't do any assumption again
about our models.

1:04:20.524 --> 1:04:24.815
It doesn't need to be a forward model or a
backward model or anything.

1:04:24.815 --> 1:04:29.469
You can have any type of architecture and
you can always predict the street.

1:04:30.530 --> 1:04:39.112
There is maybe one disadvantage: do you see
what could be a bit of a problem this type

1:04:39.112 --> 1:04:40.098
compared?

1:05:00.000 --> 1:05:05.920
Yes, so yeah mean you cannot cross mass more,
but to see it more globally just twist assume

1:05:05.920 --> 1:05:07.142
you only mask one.

1:05:07.142 --> 1:05:12.676
For the whole sentence we get one feedback
signal like what is the word street, so we

1:05:12.676 --> 1:05:16.280
have one training sample, a model for the whole
center.

1:05:17.397 --> 1:05:19.461
The language modeling paste.

1:05:19.461 --> 1:05:21.240
We predicted here three.

1:05:21.240 --> 1:05:22.947
We predicted here four.

1:05:22.947 --> 1:05:24.655
We predicted here five.

1:05:25.005 --> 1:05:26.973
So we have a number of tokens.

1:05:26.973 --> 1:05:30.974
For each token we have a feet bed and saying
what is the best.

1:05:31.211 --> 1:05:39.369
So in this case of course this is a lot less
efficient because we are getting less feedback

1:05:39.369 --> 1:05:45.754
signals on what we should predict compared
to models where we're doing.

1:05:48.348 --> 1:05:54.847
So in birth the main idea this bidirectional
model was masking.

1:05:54.847 --> 1:05:59.721
It was the first large model using transformer.

1:06:00.320 --> 1:06:06.326
There are two more minor changes.

1:06:06.326 --> 1:06:16.573
We'll see that this next word prediction is
another task.

1:06:16.957 --> 1:06:25.395
Again you want to learn more about what language
is to really understand.

1:06:25.395 --> 1:06:35.089
Are these two sentences like following a story
or they're independent of each other?

1:06:38.158 --> 1:06:43.026
The input is using subword units as we're
using it and we're using it.

1:06:43.026 --> 1:06:48.992
It has some special token, the beginning,
the CLS token that is straining for the next

1:06:48.992 --> 1:06:50.158
word prediction.

1:06:50.470 --> 1:06:57.296
It's more for machine translation.

1:06:57.296 --> 1:07:07.242
It's more for classification tasks because
you're.

1:07:07.607 --> 1:07:24.323
You have two sentences, and then you have
a position of encoding as we know them in general.

1:07:24.684 --> 1:07:28.812
Now what is more challenging is masking.

1:07:28.812 --> 1:07:30.927
So what do you mask?

1:07:30.927 --> 1:07:35.055
We already have to question like should.

1:07:35.275 --> 1:07:44.453
So there has been afterwards eating some work
like, for example, Urbana, which tries to improve.

1:07:44.453 --> 1:07:52.306
It's not super sensitive, but of course if
you do it completely wrong then you're.

1:07:52.572 --> 1:07:54.590
That's then another question there.

1:07:56.756 --> 1:08:03.285
All types should always mask the poor word.

1:08:03.285 --> 1:08:14.562
If have a subword, it's good to mask only
like a subword and predict based.

1:08:14.894 --> 1:08:20.755
You know, like three parts of the words, it
might be easier to get the last because they

1:08:20.755 --> 1:08:27.142
here took the easiest selections, not considering
words anymore at all because you're doing that

1:08:27.142 --> 1:08:32.278
in the pre-processing and just taking always
words like subwords and masking.

1:08:32.672 --> 1:08:36.286
Their thinking will bear them differently.

1:08:36.286 --> 1:08:40.404
They mark always the full words, but guess
it's.

1:08:41.001 --> 1:08:46.969
And then what to do with the mask work in
eighty percent of the cases is the word is

1:08:46.969 --> 1:08:47.391
mask.

1:08:47.391 --> 1:08:50.481
They replace it with a special token thing.

1:08:50.481 --> 1:08:52.166
This is the mask token.

1:08:52.166 --> 1:08:58.486
In ten percent they put in some random other
token in there, and in ten percent they keep

1:08:58.486 --> 1:08:59.469
it unchanged.

1:09:02.202 --> 1:09:11.519
And then what you can do is also this next
prediction.

1:09:11.519 --> 1:09:17.786
So if you have the man went to mass.

1:09:18.418 --> 1:09:24.090
So may you see you're joining that you're
doing both masks and next prediction that.

1:09:24.564 --> 1:09:34.402
And if the sentence is pinguine masks are
flyless birds, then these two sentences have

1:09:34.402 --> 1:09:42.995
nothing to do with each other, and so in this
case it's not the next token.

1:09:47.127 --> 1:09:56.184
And that is the whole bird model, so here
is the input, here the transformable layers,

1:09:56.184 --> 1:09:58.162
and you can train.

1:09:58.598 --> 1:10:08.580
And this model was quite successful in general
applications.

1:10:08.580 --> 1:10:17.581
It was not as successful as people are nowadays
using.

1:10:17.937 --> 1:10:27.644
However, there is like a huge thing of different
types of models coming from that.

1:10:27.827 --> 1:10:39.109
So based on bird and other semi-supervised
models like a whole setup came out of there

1:10:39.109 --> 1:10:42.091
and there's different.

1:10:42.082 --> 1:10:46.637
With the availability of large languages more
than the success.

1:10:47.007 --> 1:10:48.436
We have now even larger ones.

1:10:48.828 --> 1:10:50.961
Interestingly, it goes a bit.

1:10:50.910 --> 1:10:59.321
Change the bit again from like more this spider
action model to unidirectional models, or at

1:10:59.321 --> 1:11:03.843
the moment maybe a bit more we're coming to
them.

1:11:03.843 --> 1:11:09.179
Now do you see one advantage,, and we have
the efficiency.

1:11:09.509 --> 1:11:16.670
There's one other reason why you sometimes
are more interested in unidirectional models

1:11:16.670 --> 1:11:17.158
than.

1:11:22.882 --> 1:11:30.882
Mean it depends on the task, but for example
for a language generation task, the task.

1:11:32.192 --> 1:11:34.574
It's not only interesting, it doesn't work.

1:11:34.574 --> 1:11:39.283
So if you want to do a generation like the
decoder so you want to generate a sentence,

1:11:39.283 --> 1:11:42.856
you don't know the future so you cannot apply
this type of model.

1:11:43.223 --> 1:11:49.498
This time off model can be used for the encoder
in an encoder model but cannot be used for

1:11:49.498 --> 1:11:55.497
the decoder because it is trained that only
works and it has information on both sides

1:11:55.497 --> 1:11:56.945
and if you're doing.

1:12:00.000 --> 1:12:05.559
Yeah, that's a good view to the next overall
task of models.

1:12:05.559 --> 1:12:08.839
We have so if you view it from the.

1:12:09.009 --> 1:12:13.137
Of you we have the encoder baseball.

1:12:13.137 --> 1:12:16.372
That's what we just look at.

1:12:16.372 --> 1:12:20.612
They are bidirectional and typically.

1:12:20.981 --> 1:12:22.347
That is the one we looked at.

1:12:22.742 --> 1:12:35.217
At the beginning is the decoder-based model,
so the outer-regressive mounts which are unit

1:12:35.217 --> 1:12:42.619
based model, and there we can do the next prediction.

1:12:43.403 --> 1:12:52.421
And what you can also do first, and there
you can also have special things called prefix

1:12:52.421 --> 1:12:53.434
language.

1:12:54.354 --> 1:13:04.079
Because we are saying it might be helpful
that some of your inputs you can use by direction

1:13:04.079 --> 1:13:17.334
because: That is what is called a prefix where
you say on the first tokens you have bidirectional

1:13:17.334 --> 1:13:19.094
connections.

1:13:19.219 --> 1:13:28.768
You somehow merge that mainly works only in
transformer based models because the uni direction.

1:13:29.629 --> 1:13:34.894
There is no different number of parameters.

1:13:34.975 --> 1:13:38.533
Transformer: The only difference is how you
mask your attention.

1:13:38.878 --> 1:13:47.691
We have seen that in the encoder, in the decoder,
the number of parameters is different because

1:13:47.691 --> 1:13:50.261
you do the cross-attention.

1:13:50.650 --> 1:13:58.389
It's only like you mask your attention to
only look at the bad past or also look into

1:13:58.389 --> 1:13:59.469
the future.

1:14:00.680 --> 1:14:03.323
And now you can, of course, also do mixing.

1:14:03.563 --> 1:14:08.307
So this is a bidirectional attention metric
where you can attend to everything.

1:14:08.588 --> 1:14:23.477
That is a unidirection or causal where you
can only look at the past and you can do this

1:14:23.477 --> 1:14:25.652
with prefix.

1:14:29.149 --> 1:14:42.829
Some are all clear based on that, then of
course you can also do the other thing.

1:14:43.163 --> 1:14:54.497
So the idea is we have our encoder, decoder
architecture, can we also train them completely

1:14:54.497 --> 1:14:57.700
in a side supervised way?

1:14:58.238 --> 1:15:06.206
In this case we have the same input to both,
so in this case we would have the sentence

1:15:06.206 --> 1:15:08.470
as input in the decoder.

1:15:08.470 --> 1:15:12.182
Then we need to do some type of masking.

1:15:12.912 --> 1:15:16.245
Here we don't need to do the masking, but
here we need to do.

1:15:16.245 --> 1:15:17.911
The masking doesn't know ever.

1:15:20.440 --> 1:15:30.269
And this type of model got quite successful
also, especially for pre-training machine translation.

1:15:30.330 --> 1:15:45.934
This is the first model of the BART model,
which is one successful way to pre-train your

1:15:45.934 --> 1:15:47.162
model.

1:15:47.427 --> 1:15:52.858
Where you put in source sentence, we can't
do that here.

1:15:52.858 --> 1:15:55.430
We only have one language.

1:15:55.715 --> 1:16:00.932
But we can just put this twice in there, and
that is not a trivial task.

1:16:00.932 --> 1:16:08.517
We can change it in: They do quite a bit of
different corruption techniques.

1:16:08.517 --> 1:16:12.751
You can do token masking and you can also.

1:16:13.233 --> 1:16:20.785
That you couldn't do and go the only system
because then it wouldn't be there if you cannot

1:16:20.785 --> 1:16:22.345
predict somewhere.

1:16:22.345 --> 1:16:26.368
So the number of input and output tokens always.

1:16:26.906 --> 1:16:29.820
You cannot do a prediction for something which
isn't it?

1:16:30.110 --> 1:16:39.714
Here in the decoder side it's uni-direction
so we can also delete and then generate the

1:16:39.714 --> 1:16:40.369
full.

1:16:41.061 --> 1:16:48.628
We can do sentence per rotation where you
change the sentence.

1:16:48.628 --> 1:16:54.274
We can document rotation and text and filling.

1:16:55.615 --> 1:17:05.870
So you see there's quite a lot of types of
models that you can use in order to pre-train

1:17:05.870 --> 1:17:06.561
your.

1:17:07.507 --> 1:17:12.512
And these are the models you can use.

1:17:12.512 --> 1:17:21.072
Of course, the other question is how do you
integrate them into?

1:17:21.761 --> 1:17:26.638
And there's also like yeah quite some different
ways of techniques.

1:17:27.007 --> 1:17:28.684
It's a Bit Similar to Before.

1:17:28.928 --> 1:17:39.307
So the easiest thing is you take your word
embeddings or your pre-train model.

1:17:39.307 --> 1:17:47.979
If you're contextual embedding several layers
you freeze them in.

1:17:48.748 --> 1:17:53.978
Can also be done if you have a bark model.

1:17:53.978 --> 1:18:03.344
You freeze your wooden beddings, for example,
and only train the top layers.

1:18:05.865 --> 1:18:14.965
The other thing is you initialize them so
you initialize your models but then you train

1:18:14.965 --> 1:18:19.102
everything so you're not only training.

1:18:22.562 --> 1:18:32.600
When you have then one thing, if you think
about Bart, there's one thing, so you want

1:18:32.600 --> 1:18:35.752
to have the same language.

1:18:36.516 --> 1:18:46.013
Typically mean the one you get is from English,
so you can not try to do some language.

1:18:46.366 --> 1:18:55.165
Below the barge, in order to learn some language
specific stuff or there's a multilingual barge

1:18:55.165 --> 1:19:03.415
which is trained on many languages, it's trained
only on like it's more or less language.

1:19:03.923 --> 1:19:09.745
So then you would still need to find June
and the model needs to learn how to better

1:19:09.745 --> 1:19:12.074
do the attention cross lingually.

1:19:12.074 --> 1:19:18.102
It's only on the same language but it mainly
only has to learn this mapping and not all

1:19:18.102 --> 1:19:18.787
the rest.

1:19:21.982 --> 1:19:27.492
A third thing which is is very commonly used
is what is frequent to it as adapters.

1:19:27.607 --> 1:19:29.749
So, for example, you take and bark.

1:19:29.709 --> 1:19:35.502
And you put some adapters on the inside of
the network so that it's small new layers which

1:19:35.502 --> 1:19:41.676
are in between put in there and then you only
train these adapters or also train these adapters.

1:19:41.676 --> 1:19:47.724
So for example in Embry you could see that
this learns to map the Seus language representation

1:19:47.724 --> 1:19:50.333
to the targeted language representation.

1:19:50.470 --> 1:19:52.395
And then you don't have to change that luck.

1:19:52.792 --> 1:20:04.197
Ideas that you give it some extra ability
to really perform well on that, and then it's

1:20:04.197 --> 1:20:05.234
easier.

1:20:05.905 --> 1:20:15.117
Is also very commonly used, for example, in
multilingual systems where the idea is you

1:20:15.117 --> 1:20:16.282
have some.

1:20:16.916 --> 1:20:23.505
So they are trained only for one language
pair, so the model has some of those it once

1:20:23.505 --> 1:20:27.973
has the abilities to do multilingually to share
knowledge.

1:20:27.973 --> 1:20:33.729
But then there is some knowledge which is
very language specific, and then.

1:20:34.914 --> 1:20:39.291
But there's one chance in general, the multilingual
systems.

1:20:39.291 --> 1:20:40.798
It works quite well.

1:20:40.798 --> 1:20:47.542
There's one specific use case for multilingual,
where this normally doesn't really work well.

1:20:47.542 --> 1:20:49.981
Do you have an idea of what that?

1:20:55.996 --> 1:20:57.534
It's for Zero Short Cases.

1:20:57.998 --> 1:21:06.051
Because then you're having to hear some situation
which might be very language specific again

1:21:06.051 --> 1:21:15.046
in zero shot, the idea is always to learn representations
via which are more language dependent and with

1:21:15.046 --> 1:21:17.102
the adaptors of course.

1:21:20.260 --> 1:21:37.655
And there's also the idea of doing more like
a knowledge ventilation setup, so in this.

1:21:39.179 --> 1:21:41.177
And now the idea is okay.

1:21:41.177 --> 1:21:48.095
We are training it the same, but what we want
to achieve is that the hidden stages of the

1:21:48.095 --> 1:21:54.090
encoder are as similar to the one as the pre-train
model, just as additional.

1:21:54.414 --> 1:22:07.569
So you should learn faster by telling the
model to make these states as similar as possible.

1:22:07.569 --> 1:22:11.813
You compare the first hidden.

1:22:12.192 --> 1:22:18.549
For example, by using the L2 norm, so by just
making these two representations the same.

1:22:20.020 --> 1:22:22.880
Now here it requires the same vocabulary.

1:22:22.880 --> 1:22:25.468
Why does it need the same vocabulary?

1:22:25.468 --> 1:22:26.354
Give me the.

1:22:34.754 --> 1:22:39.132
You have different vocabulary.

1:22:39.132 --> 1:22:50.711
You also have different like sequence lengths
because if you use different these.

1:22:51.231 --> 1:22:55.680
Then what happens is now we have states here.

1:22:55.680 --> 1:23:01.097
It's no longer straightforward which states
to compare.

1:23:02.322 --> 1:23:05.892
And then it's just easier to have like the
same number.

1:23:05.892 --> 1:23:08.952
You can always compare the first to the second.

1:23:09.709 --> 1:23:16.836
So therefore at least the very easy way of
knowledge destination only works if you have.

1:23:17.177 --> 1:23:30.871
Of course you could do things like the average
should be the same, but of course that's less

1:23:30.871 --> 1:23:33.080
strong signal.

1:23:34.314 --> 1:23:47.087
But the advantage here is that you have a
direct training signal here on the ink corner

1:23:47.087 --> 1:23:52.457
so you can directly make the signal.

1:23:56.936 --> 1:24:11.208
Yes, think this is most things for today,
so what you should keep in mind today is two

1:24:11.208 --> 1:24:18.147
techniques: The one is a back translation idea.

1:24:18.147 --> 1:24:26.598
If you have monolingual letters, you back
translate it and use.

1:24:26.886 --> 1:24:33.608
And yeah, it is even often helpful to even
combine them so you can even use both of them.

1:24:33.853 --> 1:24:39.669
You can do use pre-trained walls, but then
you can even still do back translation where

1:24:39.669 --> 1:24:40.066
it's.

1:24:40.160 --> 1:24:47.058
We have the advantage that we are training
like everything working together on the tasks

1:24:47.058 --> 1:24:54.422
so it might be helpful even to backtranslate
some data and then use it in the real translation

1:24:54.422 --> 1:24:57.755
because in pre-training the big challenge.

1:24:58.058 --> 1:25:07.392
You can see there is different ways of integrating
this knowledge, but even if you use a full

1:25:07.392 --> 1:25:08.087
model.

1:25:08.748 --> 1:25:11.713
This is the most similar you can get.

1:25:11.713 --> 1:25:15.224
You're doing no changes to the architecture.

1:25:15.224 --> 1:25:20.608
You're really taking the model and just fine
tuning on the new task.

1:25:20.608 --> 1:25:24.041
But it still has to completely newly learn.

1:25:24.464 --> 1:25:29.978
Might be, for example, helpful to have more
back translated data to learn them.

1:25:32.192 --> 1:25:45.096
Good, that's important thing that next Tuesday
there is a conference or a workshop in this

1:25:45.096 --> 1:25:45.947
room.

1:25:47.127 --> 1:25:54.405
You should get an email if you're an alias
that there is a room change for Tuesdays, only

1:25:54.405 --> 1:25:57.398
for Tuesdays, and it's again normal.

1:25:57.637 --> 1:26:03.714
Some more questions again have a more general
perspective, especially: Computer vision.

1:26:03.714 --> 1:26:07.246
You can enlarge your data set with data augmentation.

1:26:07.246 --> 1:26:08.293
It's there and.

1:26:08.388 --> 1:26:15.306
Similarly to a large speech or text, so the
data orientation.

1:26:15.755 --> 1:26:27.013
You can use this back translation and also
the masking, but a bit like that would say

1:26:27.013 --> 1:26:31.201
that is the most similar thing.

1:26:31.371 --> 1:26:35.632
So it has also been, for example, it's used
not only for monolingual data.

1:26:36.216 --> 1:26:40.958
If you have good MP system, it can also be
used for parallel data by having like augmenting

1:26:40.958 --> 1:26:46.061
your data with more data because then you have
the human translation and the automatic translation

1:26:46.061 --> 1:26:46.783
is both good.

1:26:46.783 --> 1:26:51.680
You're just having more data and better feedback
signal and different ways because there's not

1:26:51.680 --> 1:26:53.845
only one correct translation but several.

1:26:54.834 --> 1:26:58.327
Would say this is the most similar one.

1:26:58.327 --> 1:27:00.947
Just rotate things and so on.

1:27:00.947 --> 1:27:03.130
There's ways you can do.

1:27:05.025 --> 1:27:07.646
But for example there's rarely use.

1:27:07.646 --> 1:27:13.907
It's very hard to do this by by rules like
which words to replace because there's not

1:27:13.907 --> 1:27:14.490
a cool.

1:27:14.490 --> 1:27:18.931
You cannot like always say this word can always
be replaced.

1:27:19.139 --> 1:27:28.824
Mean, although they are my perfect synonyms,
they are good in some cases, but not in all

1:27:28.824 --> 1:27:29.585
cases.

1:27:29.585 --> 1:27:36.985
And if you don't do a rule base, you have
to train the model again.

1:27:38.058 --> 1:27:57.050
Here we can compare the hidden stages to the
same architecture as the free train normal.

1:27:57.457 --> 1:27:59.817
Should be of the same dimension, so it's easiest
to have the.

1:28:00.000 --> 1:28:03.780
Architecture: We later will learn in efficiency.

1:28:03.780 --> 1:28:08.949
You can also do knowledge destillation with,
for example, smaller.

1:28:08.949 --> 1:28:15.816
So you can have twelve layers, only five,
and then you try to learn the same within five

1:28:15.816 --> 1:28:16.433
layers.

1:28:17.477 --> 1:28:22.945
Eight layers, so that is possible, but yeah
agree it should be of the same hidden size.

1:28:23.623 --> 1:28:35.963
The question then, of course, is you can do
it as an initialization or you can do it during

1:28:35.963 --> 1:28:37.305
training?

1:28:37.305 --> 1:28:41.195
You have some main training.

1:28:45.865 --> 1:28:53.964
Good, then thanks a lot, and then we'll see
each other again on Tuesday.