Spaces:

retkowski
/

ytseg_demo

Running

File size: 57,385 Bytes

cb71ef5

WEBVTT

0:00:00.060 --> 0:00:07.762
OK good so today's lecture is on on supervised
machines and stations so what you have seen

0:00:07.762 --> 0:00:13.518
so far is different techniques are on supervised
and MP so you are.

0:00:13.593 --> 0:00:18.552
Data right so let's say in English coppers
you are one file and then in German you have

0:00:18.552 --> 0:00:23.454
another file which is sentence to sentence
la and then you try to build systems around

0:00:23.454 --> 0:00:23.679
it.

0:00:24.324 --> 0:00:30.130
But what's different about this lecture is
that you assume that you have no final data

0:00:30.130 --> 0:00:30.663
at all.

0:00:30.663 --> 0:00:37.137
You only have monolingual data and the question
is how can we build systems to translate between

0:00:37.137 --> 0:00:39.405
these two languages right and so.

0:00:39.359 --> 0:00:44.658
This is a bit more realistic scenario because
you have so many languages in the world.

0:00:44.658 --> 0:00:50.323
You cannot expect to have parallel data between
all the two languages and so, but in typical

0:00:50.323 --> 0:00:55.623
cases you have newspapers and so on, which
is like monolingual files, and the question

0:00:55.623 --> 0:00:57.998
is can we build something around them?

0:00:59.980 --> 0:01:01.651
They like said for today.

0:01:01.651 --> 0:01:05.893
First we'll start up with the interactions,
so why do we need it?

0:01:05.893 --> 0:01:11.614
and also some infusion on how these models
work before going into the technical details.

0:01:11.614 --> 0:01:17.335
I want to also go through an example,, which
kind of gives you more understanding on how

0:01:17.335 --> 0:01:19.263
people came into more elders.

0:01:20.820 --> 0:01:23.905
Then the rest of the lecture is going to be
two parts.

0:01:23.905 --> 0:01:26.092
One is we're going to translate words.

0:01:26.092 --> 0:01:30.018
We're not going to care about how can we translate
the full sentence.

0:01:30.018 --> 0:01:35.177
But given to monolingual files, how can we
get a dictionary basically, which is much easier

0:01:35.177 --> 0:01:37.813
than generating something in a sentence level?

0:01:38.698 --> 0:01:43.533
Then we're going to go into the Edwards case,
which is the unsupervised sentence type solution.

0:01:44.204 --> 0:01:50.201
And here what you'll see is what are the training
objectives which are quite different than the

0:01:50.201 --> 0:01:55.699
word translation and also where it doesn't
but because this is also quite important and

0:01:55.699 --> 0:02:01.384
it's one of the reasons why unsupervised does
not use anymore because the limitations kind

0:02:01.384 --> 0:02:03.946
of go away from the realistic use cases.

0:02:04.504 --> 0:02:06.922
And then that leads to the marketing world
model.

0:02:06.922 --> 0:02:07.115
So.

0:02:07.807 --> 0:02:12.915
People are trying to do to build systems for
languages that will not have any parallel data.

0:02:12.915 --> 0:02:17.693
Is use multilingual models and combine with
these training objectives to get better at

0:02:17.693 --> 0:02:17.913
it.

0:02:17.913 --> 0:02:18.132
So.

0:02:18.658 --> 0:02:24.396
People are not trying to build bilingual systems
currently for unsupervised arm translation,

0:02:24.396 --> 0:02:30.011
but I think it's good to know how they came
to hear this point and what they're doing now.

0:02:30.090 --> 0:02:34.687
You also see some patterns overlapping which
people are using.

0:02:36.916 --> 0:02:41.642
So as you said before, and you probably hear
it multiple times now is that we have seven

0:02:41.642 --> 0:02:43.076
thousand languages around.

0:02:43.903 --> 0:02:49.460
Can be different dialects in someone, so it's
quite hard to distinguish what's the language,

0:02:49.460 --> 0:02:54.957
but you can typically approximate that seven
thousand and that leads to twenty five million

0:02:54.957 --> 0:02:59.318
pairs, which is the obvious reason why we do
not have any parallel data.

0:03:00.560 --> 0:03:06.386
So you want to build an empty system for all
possible language pests and the question is

0:03:06.386 --> 0:03:07.172
how can we?

0:03:08.648 --> 0:03:13.325
The typical use case, but there are actually
quite few interesting use cases than what you

0:03:13.325 --> 0:03:14.045
would expect.

0:03:14.614 --> 0:03:20.508
One is the animal languages, which is the
real thing that's happening right now with.

0:03:20.780 --> 0:03:26.250
The dog but with dolphins and so on, but I
couldn't find a picture that could show this,

0:03:26.250 --> 0:03:31.659
but if you are interested in stuff like this
you can check out the website where people

0:03:31.659 --> 0:03:34.916
are actually trying to understand how animals
speak.

0:03:35.135 --> 0:03:37.356
It's Also a Bit More About.

0:03:37.297 --> 0:03:44.124
Knowing what the animals want to say but may
not die dead but still people are trying to

0:03:44.124 --> 0:03:44.661
do it.

0:03:45.825 --> 0:03:50.689
More realistic thing that's happening is the
translation of programming languages.

0:03:51.371 --> 0:03:56.963
And so this is quite a quite good scenario
for entrepreneurs and empty is that you have

0:03:56.963 --> 0:04:02.556
a lot of code available online right in C +
+ and in Python and the question is how can

0:04:02.556 --> 0:04:08.402
we translate by just looking at the code alone
and no parallel functions and so on and this

0:04:08.402 --> 0:04:10.754
is actually quite good right now so.

0:04:12.032 --> 0:04:16.111
See how these techniques were applied to do
the programming translation.

0:04:18.258 --> 0:04:23.882
And then you can also think of language as
something that is quite common so you can take

0:04:23.882 --> 0:04:24.194
off.

0:04:24.194 --> 0:04:29.631
Think of formal sentences in English as one
language and informal sentences in English

0:04:29.631 --> 0:04:35.442
as another language and then learn the kind
to stay between them and then it kind of becomes

0:04:35.442 --> 0:04:37.379
a style plan for a problem so.

0:04:38.358 --> 0:04:43.042
Although it's translation, you can consider
different characteristics of a language and

0:04:43.042 --> 0:04:46.875
then separate them as two different languages
and then try to map them.

0:04:46.875 --> 0:04:52.038
So it's not only about languages, but you
can also do quite cool things by using unsophisticated

0:04:52.038 --> 0:04:54.327
techniques, which are quite possible also.

0:04:56.256 --> 0:04:56.990
I am so.

0:04:56.990 --> 0:05:04.335
This is kind of TV modeling for many of the
use cases that we have for ours, ours and MD.

0:05:04.335 --> 0:05:11.842
But before we go into the modeling of these
systems, what I want you to do is look at these

0:05:11.842 --> 0:05:12.413
dummy.

0:05:13.813 --> 0:05:19.720
We have text and language one, text and language
two right, and nobody knows what these languages

0:05:19.720 --> 0:05:20.082
mean.

0:05:20.082 --> 0:05:23.758
They completely are made up right, and the
question is also.

0:05:23.758 --> 0:05:29.364
They're not parallel lines, so the first line
here and the first line is not a line, they're

0:05:29.364 --> 0:05:30.810
just monolingual files.

0:05:32.052 --> 0:05:38.281
And now think about how can you translate
the word M1 from language one to language two,

0:05:38.281 --> 0:05:41.851
and this kind of you see how we try to model
this.

0:05:42.983 --> 0:05:47.966
Would take your time and then think of how
can you translate more into language two?

0:06:41.321 --> 0:06:45.589
About the model, if you ask somebody who doesn't
know anything about machine translation right,

0:06:45.589 --> 0:06:47.411
and then you ask them to translate more.

0:07:01.201 --> 0:07:10.027
But it's also not quite easy if you think
of the way that I made this example is relatively

0:07:10.027 --> 0:07:10.986
easy, so.

0:07:11.431 --> 0:07:17.963
Basically, the first two sentences are these
two: A, B, C is E, and G cured up the U, V

0:07:17.963 --> 0:07:21.841
is L, A, A, C, S, and S, on and this is used
towards the German.

0:07:22.662 --> 0:07:25.241
And then when you join these two words, it's.

0:07:25.205 --> 0:07:32.445
English German the third line and the last
line, and then the fourth line is the first

0:07:32.445 --> 0:07:38.521
line, so German language, English, and then
speak English, speak German.

0:07:38.578 --> 0:07:44.393
So this is how I made made up the example
and what the intuition here is that you assume

0:07:44.393 --> 0:07:50.535
that the languages have a fundamental structure
right and it's the same across all languages.

0:07:51.211 --> 0:07:57.727
Doesn't matter what language you are thinking
of words kind of you have in the same way join

0:07:57.727 --> 0:07:59.829
together is the same way and.

0:07:59.779 --> 0:08:06.065
And plasma sign thinks the same way but this
is not a realistic assumption for sure but

0:08:06.065 --> 0:08:12.636
it's actually a decent one to make and if you
can think of this like if you can assume this

0:08:12.636 --> 0:08:16.207
then we can model systems in an unsupervised
way.

0:08:16.396 --> 0:08:22.743
So this is the intuition that I want to give,
and you can see that whenever assumptions fail,

0:08:22.743 --> 0:08:23.958
the systems fail.

0:08:23.958 --> 0:08:29.832
So in practice whenever we go far away from
these assumptions, the systems try to more

0:08:29.832 --> 0:08:30.778
time to fail.

0:08:33.753 --> 0:08:39.711
So the example that I gave was actually perfect
mapping right, so it never really sticks bad.

0:08:39.711 --> 0:08:45.353
They have the same number of words, same sentence
structure, perfect mapping, and so on.

0:08:45.353 --> 0:08:50.994
This doesn't happen, but let's assume that
this happens and try to see how we can moral.

0:08:53.493 --> 0:09:01.061
Okay, now let's go a bit more formal, so what
you want to do is unsupervise word translation.

0:09:01.901 --> 0:09:08.773
Here the task is that we have input data as
monolingual data, so a bunch of sentences in

0:09:08.773 --> 0:09:15.876
one file and a bunch of sentences another file
in two different languages, and the question

0:09:15.876 --> 0:09:18.655
is how can we get a bilingual word?

0:09:19.559 --> 0:09:25.134
So if you look at the picture you see that
it's just kind of projected down into two dimension

0:09:25.134 --> 0:09:30.358
planes, but it's basically when you map them
into a plot you see that the words that are

0:09:30.358 --> 0:09:35.874
parallel are closer together, and the question
is how can we do it just looking at two files?

0:09:36.816 --> 0:09:42.502
And you can say that what we want to basically
do is create a dictionary in the end given

0:09:42.502 --> 0:09:43.260
two fights.

0:09:43.260 --> 0:09:45.408
So this is the task that we want.

0:09:46.606 --> 0:09:52.262
And the first step on how we do this is to
learn word vectors, and this chicken is whatever

0:09:52.262 --> 0:09:56.257
techniques that you have seen before, but to
work glow or so on.

0:09:56.856 --> 0:10:00.699
So you take a monolingual data and try to
learn word embeddings.

0:10:02.002 --> 0:10:07.675
Then you plot them into a graph, and then
typically what you would see is that they're

0:10:07.675 --> 0:10:08.979
not aligned at all.

0:10:08.979 --> 0:10:14.717
One word space is somewhere, and one word
space is somewhere else, and this is what you

0:10:14.717 --> 0:10:18.043
would typically expect to see in the in the
image.

0:10:19.659 --> 0:10:23.525
Now our assumption was that both lines we
just have the same.

0:10:23.563 --> 0:10:28.520
Culture and so that we can use this information
to learn the mapping between these two spaces.

0:10:30.130 --> 0:10:37.085
So before how we do it, I think this is quite
famous already, and everybody knows it a bit

0:10:37.085 --> 0:10:41.824
more is that we're emitting capture semantic
relations right.

0:10:41.824 --> 0:10:48.244
So the distance between man and woman is approximately
the same as king and prince.

0:10:48.888 --> 0:10:54.620
It's also for world dances, country capital
and so on, so there are some relationships

0:10:54.620 --> 0:11:00.286
happening in the word emmering space, which
is quite clear for at least one language.

0:11:03.143 --> 0:11:08.082
Now if you think of this, let's say of the
English word embryng.

0:11:08.082 --> 0:11:14.769
Let's say of German word embryng and the way
the King Keene Man woman organized is same

0:11:14.769 --> 0:11:17.733
as the German translation of his word.

0:11:17.998 --> 0:11:23.336
This is the main idea is that although they
are somewhere else, the relationship is the

0:11:23.336 --> 0:11:28.008
same between the both languages and we can
use this to to learn the mapping.

0:11:31.811 --> 0:11:35.716
'S not only for these poor words where it
happens for all the words in the language,

0:11:35.716 --> 0:11:37.783
and so we can use this to to learn the math.

0:11:39.179 --> 0:11:43.828
This is the main idea is that both emittings
have a similar shape.

0:11:43.828 --> 0:11:48.477
It's only that they're just not aligned and
so you go to the here.

0:11:48.477 --> 0:11:50.906
They kind of have a similar shape.

0:11:50.906 --> 0:11:57.221
They're just in some different spaces and
what you need to do is to map them into a common

0:11:57.221 --> 0:11:57.707
space.

0:12:06.086 --> 0:12:12.393
The w, such that if it multiplied w with x,
they both become.

0:12:35.335 --> 0:12:41.097
That's true, but there are also many works
that have the relationship right, and we hope

0:12:41.097 --> 0:12:43.817
that this is enough to learn the mapping.

0:12:43.817 --> 0:12:49.838
So there's always going to be a bit of noise,
as in how when we align them they're not going

0:12:49.838 --> 0:12:51.716
to be exactly the same, but.

0:12:51.671 --> 0:12:57.293
What you can expect is that there are these
main works that allow us to learn the mapping,

0:12:57.293 --> 0:13:02.791
so it's not going to be perfect, but it's an
approximation that we make to to see how it

0:13:02.791 --> 0:13:04.521
works and then practice it.

0:13:04.521 --> 0:13:10.081
Also, it's not that the fact that women do
not have any relationship does not affect that

0:13:10.081 --> 0:13:10.452
much.

0:13:10.550 --> 0:13:15.429
A lot of words usually have, so it kind of
works out in practice.

0:13:22.242 --> 0:13:34.248
I have not heard about it, but if you want
to say something about it, I would be interested,

0:13:34.248 --> 0:13:37.346
but we can do it later.

0:13:41.281 --> 0:13:44.133
Usual case: This is supervised.

0:13:45.205 --> 0:13:49.484
First way to do a supervised work translation
where we have a dictionary right and that we

0:13:49.484 --> 0:13:53.764
can use that to learn the mapping, but in our
case we assume that we have nothing right so

0:13:53.764 --> 0:13:55.222
we only have monolingual data.

0:13:56.136 --> 0:14:03.126
Then we need unsupervised planning to figure
out W, and we're going to use guns to to find

0:14:03.126 --> 0:14:06.122
W, and it's quite a nice way to do it.

0:14:08.248 --> 0:14:15.393
So just before I go on how we use it to use
case, I'm going to go briefly on gas right,

0:14:15.393 --> 0:14:19.940
so we have two components: generator and discriminator.

0:14:21.441 --> 0:14:27.052
Gen data tries to generate something obviously,
and the discriminator tries to see if it's

0:14:27.052 --> 0:14:30.752
real data or something that is generated by
the generation.

0:14:31.371 --> 0:14:37.038
And there's like this two player game where
the winner decides to fool and the winner decides

0:14:37.038 --> 0:14:41.862
to market food and they try to build these
two components and try to learn WWE.

0:14:43.483 --> 0:14:53.163
Okay, so let's say we have two languages,
X and Y right, so the X language has N words

0:14:53.163 --> 0:14:56.167
with numbering dimensions.

0:14:56.496 --> 0:14:59.498
So what I'm reading is matrix is peak or something.

0:14:59.498 --> 0:15:02.211
Then we have target language why with m words.

0:15:02.211 --> 0:15:06.944
I'm also the same amount of things I mentioned
and then we have a matrix peak or.

0:15:07.927 --> 0:15:13.784
Basically what you're going to do is use word
to work and learn our word embedded.

0:15:14.995 --> 0:15:23.134
Now we have these X Mrings, Y Mrings, and
what you want to know is W, such that W X and

0:15:23.134 --> 0:15:24.336
Y are align.

0:15:29.209 --> 0:15:35.489
With guns you have two steps, one is a discriminative
step and one is the the mapping step and the

0:15:35.489 --> 0:15:41.135
discriminative step is to see if the embeddings
are from the source or mapped embedding.

0:15:41.135 --> 0:15:44.688
So it's going to be much scary when I go to
the figure.

0:15:46.306 --> 0:15:50.041
So we have a monolingual documents with two
different languages.

0:15:50.041 --> 0:15:54.522
From here we get our source language ambients
target language ambients right.

0:15:54.522 --> 0:15:57.855
Then we randomly initialize the transformation
metrics W.

0:16:00.040 --> 0:16:06.377
Then we have the discriminator which tries
to see if it's WX or Y, so it needs to know

0:16:06.377 --> 0:16:13.735
that this is a mapped one and this is the original
language, and so if you look at the lost function

0:16:13.735 --> 0:16:20.072
here, it's basically that source is one given
WX, so this is from the source language.

0:16:23.543 --> 0:16:27.339
Which means it's the target language em yeah.

0:16:27.339 --> 0:16:34.436
It's just like my figure is not that great,
but you can assume that they are totally.

0:16:40.260 --> 0:16:43.027
So this is the kind of the lost function.

0:16:43.027 --> 0:16:46.386
We have N source words, M target words, and
so on.

0:16:46.386 --> 0:16:52.381
So that's why you have one by M, one by M,
and the discriminator is to just see if they're

0:16:52.381 --> 0:16:55.741
mapped or they're from the original target
number.

0:16:57.317 --> 0:17:04.024
And then we have the mapping step where we
train W to fool the the discriminators.

0:17:04.564 --> 0:17:10.243
So here it's the same way, but what you're
going to just do is inverse the loss function.

0:17:10.243 --> 0:17:15.859
So now we freeze the discriminators, so it's
important to note that in the previous sect

0:17:15.859 --> 0:17:20.843
we freezed the transformation matrix, and here
we freezed your discriminators.

0:17:22.482 --> 0:17:28.912
And now it's to fool the discriminated rights,
so it should predict that the source is zero

0:17:28.912 --> 0:17:35.271
given the map numbering, and the source is
one given the target numbering, which is wrong,

0:17:35.271 --> 0:17:37.787
which is why we're attaining the W.

0:17:39.439 --> 0:17:46.261
Any questions on this okay so then how do
we know when to stop?

0:17:46.261 --> 0:17:55.854
We just train until we reach convergence right
and then we have our W hopefully train and

0:17:55.854 --> 0:17:59.265
map them into an airline space.

0:18:02.222 --> 0:18:07.097
The question is how can we evaluate this mapping?

0:18:07.097 --> 0:18:13.923
Does anybody know what we can use to mapping
or evaluate the mapping?

0:18:13.923 --> 0:18:15.873
How good is a word?

0:18:28.969 --> 0:18:33.538
We use as I said we use a dictionary, at least
in the end.

0:18:33.538 --> 0:18:40.199
We need a dictionary to evaluate, so this
is our only final, so we aren't using it at

0:18:40.199 --> 0:18:42.600
all in attaining data and the.

0:18:43.223 --> 0:18:49.681
Is one is to check what's the position for
our dictionary, just that.

0:18:50.650 --> 0:18:52.813
The first nearest neighbor and see if it's
there on.

0:18:53.573 --> 0:18:56.855
But this is quite strict because there's a
lot of noise in the emitting space right.

0:18:57.657 --> 0:19:03.114
Not always your first neighbor is going to
be the translation, so what people also report

0:19:03.114 --> 0:19:05.055
is precision at file and so on.

0:19:05.055 --> 0:19:10.209
So you take the finerest neighbors and see
if the translation is in there and so on.

0:19:10.209 --> 0:19:15.545
So the more you increase, the more likely
that there is a translation because where I'm

0:19:15.545 --> 0:19:16.697
being quite noisy.

0:19:19.239 --> 0:19:25.924
What's interesting is that people have used
dictionary to to learn word translation, but

0:19:25.924 --> 0:19:32.985
the way of doing this is much better than using
a dictionary, so somehow our assumption helps

0:19:32.985 --> 0:19:36.591
us to to build better than a supervised system.

0:19:39.099 --> 0:19:42.985
So as you see on the top you have a question
at one five ten.

0:19:42.985 --> 0:19:47.309
These are the typical numbers that you report
for world translation.

0:19:48.868 --> 0:19:55.996
But guns are usually quite tricky to to train,
and it does not converge on on language based,

0:19:55.996 --> 0:20:02.820
and this kind of goes back to a assumption
that they kind of behave in the same structure

0:20:02.820 --> 0:20:03.351
right.

0:20:03.351 --> 0:20:07.142
But if you take a language like English and
some.

0:20:07.087 --> 0:20:12.203
Other languages are almost very lotus, so
it's quite different from English and so on.

0:20:12.203 --> 0:20:13.673
Then I've one language,.

0:20:13.673 --> 0:20:18.789
So whenever whenever our assumption fails,
these unsupervised techniques always do not

0:20:18.789 --> 0:20:21.199
converge or just give really bad scores.

0:20:22.162 --> 0:20:27.083
And so the fact is that the monolingual embryons
for distant languages are too far.

0:20:27.083 --> 0:20:30.949
They do not share the same structure, and
so they do not convert.

0:20:32.452 --> 0:20:39.380
And so I just want to mention that there is
a better retrieval technique than the nearest

0:20:39.380 --> 0:20:41.458
neighbor, which is called.

0:20:42.882 --> 0:20:46.975
But it's more advanced than mathematical,
so I didn't want to go in it now.

0:20:46.975 --> 0:20:51.822
But if your interest is in some quite good
retrieval segments, you can just look at these

0:20:51.822 --> 0:20:53.006
if you're interested.

0:20:55.615 --> 0:20:59.241
Okay, so this is about the the word translation.

0:20:59.241 --> 0:21:02.276
Does anybody have any questions of cure?

0:21:06.246 --> 0:21:07.501
Was the worst answer?

0:21:07.501 --> 0:21:12.580
It was a bit easier than a sentence right,
so you just assume that there's a mapping and

0:21:12.580 --> 0:21:14.577
then you try to learn the mapping.

0:21:14.577 --> 0:21:19.656
But now it's a bit more difficult because
you need to jump at stuff also, which is quite

0:21:19.656 --> 0:21:20.797
much more trickier.

0:21:22.622 --> 0:21:28.512
Task here is that we have our input as manually
well data for both languages as before, but

0:21:28.512 --> 0:21:34.017
now what we want to do is instead of translating
word by word we want to do sentence.

0:21:37.377 --> 0:21:44.002
We have word of work now and so on to learn
word amber inks, but sentence amber inks are

0:21:44.002 --> 0:21:50.627
actually not the site powered often, at least
when people try to work on Answer Voice M,

0:21:50.627 --> 0:21:51.445
E, before.

0:21:52.632 --> 0:21:54.008
Now they're a bit okay.

0:21:54.008 --> 0:21:59.054
I mean, as you've seen in the practice on
where we used places, they were quite decent.

0:21:59.054 --> 0:22:03.011
But then it's also the case on which data
it's trained on and so on.

0:22:03.011 --> 0:22:03.240
So.

0:22:04.164 --> 0:22:09.666
Sentence embedings are definitely much more
harder to get than were embedings, so this

0:22:09.666 --> 0:22:13.776
is a bit more complicated than the task that
you've seen before.

0:22:16.476 --> 0:22:18.701
Before we go into how U.

0:22:18.701 --> 0:22:18.968
N.

0:22:18.968 --> 0:22:19.235
M.

0:22:19.235 --> 0:22:19.502
T.

0:22:19.502 --> 0:22:24.485
Works, so this is your typical supervised
system right.

0:22:24.485 --> 0:22:29.558
So we have parallel data source sentence target
centers.

0:22:29.558 --> 0:22:31.160
We have a source.

0:22:31.471 --> 0:22:36.709
We have a target decoder and then we try to
minimize the cross center pillar on this viral

0:22:36.709 --> 0:22:37.054
data.

0:22:37.157 --> 0:22:39.818
And this is how we train our typical system.

0:22:43.583 --> 0:22:49.506
But now we do not have any parallel data,
and so the intuition here is that if we can

0:22:49.506 --> 0:22:55.429
learn language independent representations
at the end quota outputs, then we can pass

0:22:55.429 --> 0:22:58.046
it along to the decoder that we want.

0:22:58.718 --> 0:23:03.809
It's going to get more clear in the future,
but I'm trying to give a bit more intuition

0:23:03.809 --> 0:23:07.164
before I'm going to show you all the planning
objectives.

0:23:08.688 --> 0:23:15.252
So I assume that we have these different encoders
right, so it's not only two, you have a bunch

0:23:15.252 --> 0:23:21.405
of different source language encoders, a bunch
of different target language decoders, and

0:23:21.405 --> 0:23:26.054
also I assume that the encoder is in the same
representation space.

0:23:26.706 --> 0:23:31.932
If you give a sentence in English and the
same sentence in German, the embeddings are

0:23:31.932 --> 0:23:38.313
quite the same, so like the muddling when embeddings
die right, and so then what we can do is, depending

0:23:38.313 --> 0:23:42.202
on the language we want, pass it to the the
appropriate decode.

0:23:42.682 --> 0:23:50.141
And so the kind of goal here is to find out
a way to create language independent representations

0:23:50.141 --> 0:23:52.909
and then pass it to the decodement.

0:23:54.975 --> 0:23:59.714
Just keep in mind that you're trying to do
language independent for some reason, but it's

0:23:59.714 --> 0:24:02.294
going to be more clear once we see how it works.

0:24:05.585 --> 0:24:12.845
So in total we have three objectives that
we're going to try to train in our systems,

0:24:12.845 --> 0:24:16.981
so this is and all of them use monolingual
data.

0:24:17.697 --> 0:24:19.559
So there's no pilot data at all.

0:24:19.559 --> 0:24:24.469
The first one is denoising water encoding,
so it's more like you add noise to noise to

0:24:24.469 --> 0:24:27.403
the sentence, and then they construct the original.

0:24:28.388 --> 0:24:34.276
Then we have the on the flyby translation,
so this is where you take a sentence, generate

0:24:34.276 --> 0:24:39.902
a translation, and then learn the the word
smarting, which I'm going to show pictures

0:24:39.902 --> 0:24:45.725
stated, and then we have an adverse serial
planning to do learn the language independent

0:24:45.725 --> 0:24:46.772
representation.

0:24:47.427 --> 0:24:52.148
So somehow we'll fill in these three tasks
or retain on these three tasks.

0:24:52.148 --> 0:24:54.728
We somehow get an answer to President M.

0:24:54.728 --> 0:24:54.917
T.

0:24:56.856 --> 0:25:02.964
OK, so first we're going to do is denoising
what I'm cutting right, so as I said we add

0:25:02.964 --> 0:25:06.295
noise to the sentence, so we take our sentence.

0:25:06.826 --> 0:25:09.709
And then there are different ways to add noise.

0:25:09.709 --> 0:25:11.511
You can shuffle words around.

0:25:11.511 --> 0:25:12.712
You can drop words.

0:25:12.712 --> 0:25:18.298
Do whatever you want to do as long as there's
enough information to reconstruct the original

0:25:18.298 --> 0:25:18.898
sentence.

0:25:19.719 --> 0:25:25.051
And then we assume that the nicest one and
the original one are parallel data and train

0:25:25.051 --> 0:25:26.687
similar to the supervised.

0:25:28.168 --> 0:25:30.354
So we have a source sentence.

0:25:30.354 --> 0:25:32.540
We have a noisy source right.

0:25:32.540 --> 0:25:37.130
So here what basically happened is that the
word got shuffled.

0:25:37.130 --> 0:25:39.097
One word is dropped right.

0:25:39.097 --> 0:25:41.356
So this was a noise of source.

0:25:41.356 --> 0:25:47.039
And then we treat the noise of source and
source as a sentence bed basically.

0:25:49.009 --> 0:25:53.874
Way retainers optimizing the cross entropy
loss similar to.

0:25:57.978 --> 0:26:03.211
Basically a picture to show what's happening
and we have the nice resources.

0:26:03.163 --> 0:26:09.210
Now is the target and then we have the reconstructed
original source and original tag and since

0:26:09.210 --> 0:26:14.817
the languages are different we have our source
hand coded target and coded source coded.

0:26:17.317 --> 0:26:20.202
And for this task we only need monolingual
data.

0:26:20.202 --> 0:26:25.267
We don't need any pedal data because it's
just taking a sentence and shuffling it and

0:26:25.267 --> 0:26:27.446
reconstructing the the original one.

0:26:28.848 --> 0:26:31.058
And we are four different blocks.

0:26:31.058 --> 0:26:36.841
This is kind of very important to keep in
mind on how we change these connections later.

0:26:41.121 --> 0:26:49.093
Then this is more like the mathematical formulation
where you predict source given the noisy.

0:26:52.492 --> 0:26:55.090
So that was the nursing water encoding.

0:26:55.090 --> 0:26:58.403
The second step is on the flight back translation.

0:26:59.479 --> 0:27:06.386
So what we do is, we put our model inference
mode right, we take a source of sentences,

0:27:06.386 --> 0:27:09.447
and we generate a translation pattern.

0:27:09.829 --> 0:27:18.534
It might be completely wrong or maybe partially
correct or so on, but we assume that the moral

0:27:18.534 --> 0:27:20.091
knows of it and.

0:27:20.680 --> 0:27:25.779
Tend rate: T head right and then what we do
is assume that T head or not assume but T head

0:27:25.779 --> 0:27:27.572
and S are sentence space right.

0:27:27.572 --> 0:27:29.925
That's how we can handle the translation.

0:27:30.530 --> 0:27:38.824
So we train a supervised system on this sentence
bed, so we do inference and then build a reverse

0:27:38.824 --> 0:27:39.924
translation.

0:27:42.442 --> 0:27:49.495
Are both more concrete, so we have a false
sentence right, then we chamber the translation,

0:27:49.495 --> 0:27:55.091
then we give the general translation as an
input and try to predict the.

0:27:58.378 --> 0:28:03.500
This is how we would do in practice right,
so not before the source encoder was connected

0:28:03.500 --> 0:28:08.907
to the source decoder, but now we interchanged
connections, so the source encoder is connected

0:28:08.907 --> 0:28:10.216
to the target decoder.

0:28:10.216 --> 0:28:13.290
The target encoder is turned into the source
decoder.

0:28:13.974 --> 0:28:20.747
And given s we get t-hat and given t we get
s-hat, so this is the first time.

0:28:21.661 --> 0:28:24.022
On the second time step, what you're going
to do is reverse.

0:28:24.664 --> 0:28:32.625
So as that is here, t hat is here, and given
s hat we are trying to predict t, and given

0:28:32.625 --> 0:28:34.503
t hat we are trying.

0:28:36.636 --> 0:28:39.386
Is this clear you have any questions on?

0:28:45.405 --> 0:28:50.823
Bit more mathematically, we try to play the
class, give and take and so it's always the

0:28:50.823 --> 0:28:53.963
supervised NMP technique that we are trying
to do.

0:28:53.963 --> 0:28:59.689
But you're trying to create this synthetic
pass that kind of helpers to build an unsurprised

0:28:59.689 --> 0:29:00.181
system.

0:29:02.362 --> 0:29:08.611
Now also with maybe you can see here is that
if the source encoded and targeted encoded

0:29:08.611 --> 0:29:14.718
the language independent, we can always shuffle
the connections and the translations.

0:29:14.718 --> 0:29:21.252
That's why it was important to find a way
to generate language independent representations.

0:29:21.441 --> 0:29:26.476
And the way we try to force this language
independence is the gan step.

0:29:27.627 --> 0:29:34.851
So the third step kind of combines all of
them is where we try to use gun to make the

0:29:34.851 --> 0:29:37.959
encoded output language independent.

0:29:37.959 --> 0:29:42.831
So here it's the same picture but from a different
paper.

0:29:42.831 --> 0:29:43.167
So.

0:29:43.343 --> 0:29:48.888
We have X-rays, X-ray objects which is monolingual
in data.

0:29:48.888 --> 0:29:50.182
We add noise.

0:29:50.690 --> 0:29:54.736
Then we encode it using the source and the
target encoders right.

0:29:54.736 --> 0:29:58.292
Then we get the latent space Z source and
Z target right.

0:29:58.292 --> 0:30:03.503
Then we decode and try to reconstruct the
original one and this is the auto encoding

0:30:03.503 --> 0:30:08.469
loss which takes the X source which is the
original one and then the translated.

0:30:08.468 --> 0:30:09.834
Predicted output.

0:30:09.834 --> 0:30:16.740
So hello, it always is the auto encoding step
where the gun concern is in the between gang

0:30:16.740 --> 0:30:24.102
cord outputs, and here we have an discriminator
which tries to predict which language the latent

0:30:24.102 --> 0:30:25.241
space is from.

0:30:26.466 --> 0:30:33.782
So given Z source it has to predict that the
representation is from a language source and

0:30:33.782 --> 0:30:39.961
given Z target it has to predict the representation
from a language target.

0:30:40.520 --> 0:30:45.135
And our headquarters are kind of teaching
data right now, and then we have a separate

0:30:45.135 --> 0:30:49.803
network discriminator which tries to predict
which language the Latin spaces are from.

0:30:53.393 --> 0:30:57.611
And then this one is when we combined guns
with the other ongoing step.

0:30:57.611 --> 0:31:02.767
Then we had an on the fly back translation
step right, and so here what we're trying to

0:31:02.767 --> 0:31:03.001
do.

0:31:03.863 --> 0:31:07.260
Is the same, basically just exactly the same.

0:31:07.260 --> 0:31:12.946
But when we are doing the training, we are
at the adversarial laws here, so.

0:31:13.893 --> 0:31:20.762
We take our X source, gender and intermediate
translation, so why target and why source right?

0:31:20.762 --> 0:31:27.342
This is the previous time step, and then we
have to encode the new sentences and basically

0:31:27.342 --> 0:31:32.764
make them language independent or train to
make them language independent.

0:31:33.974 --> 0:31:43.502
And then the hope is that now if we do this
using monolingual data alone we can just switch

0:31:43.502 --> 0:31:47.852
connections and then get our translation.

0:31:47.852 --> 0:31:49.613
So the scale of.

0:31:54.574 --> 0:32:03.749
And so as I said before, guns are quite good
for vision right, so this is kind of like the

0:32:03.749 --> 0:32:11.312
cycle gun approach that you might have seen
in any computer vision course.

0:32:11.911 --> 0:32:19.055
Somehow protect that place at least not as
promising as for merchants, and so people.

0:32:19.055 --> 0:32:23.706
What they did is to enforce this language
independence.

0:32:25.045 --> 0:32:31.226
They try to use a shared encoder instead of
having these different encoders right, and

0:32:31.226 --> 0:32:37.835
so this is basically the same painting objectives
as before, but what you're going to do now

0:32:37.835 --> 0:32:43.874
is learn cross language language and then use
the single encoder for both languages.

0:32:44.104 --> 0:32:49.795
And this kind also forces them to be in the
same space, and then you can choose whichever

0:32:49.795 --> 0:32:50.934
decoder you want.

0:32:52.552 --> 0:32:58.047
You can use guns or you can just use a shared
encoder and type to build your unsupervised

0:32:58.047 --> 0:32:58.779
MTT system.

0:33:08.488 --> 0:33:09.808
These are now the.

0:33:09.808 --> 0:33:15.991
The enhancements that you can do on top of
your unsavoizant system is one you can create

0:33:15.991 --> 0:33:16.686
a shared.

0:33:18.098 --> 0:33:22.358
On top of the shared encoder you can ask are
your guns lost or whatever so there's a lot

0:33:22.358 --> 0:33:22.550
of.

0:33:24.164 --> 0:33:29.726
The other thing that is more relevant right
now is that you can create parallel data by

0:33:29.726 --> 0:33:35.478
word to word translation right because you
know how to do all supervised word translation.

0:33:36.376 --> 0:33:40.548
First step is to create parallel data, assuming
that word translations are quite good.

0:33:41.361 --> 0:33:47.162
And then you claim a supervised and empty
model on these more likely wrong model data,

0:33:47.162 --> 0:33:50.163
but somehow gives you a good starting point.

0:33:50.163 --> 0:33:56.098
So you build your supervised and empty system
on the word translation data, and then you

0:33:56.098 --> 0:33:59.966
initialize it before you're doing unsupervised
and empty.

0:34:00.260 --> 0:34:05.810
And the hope is that when you're doing the
back pain installation, it's a good starting

0:34:05.810 --> 0:34:11.234
point, but it's one technique that you can
do to to improve your anthropoids and the.

0:34:17.097 --> 0:34:25.879
In the previous case we had: The way we know
when to stop was to see comedians on the gun

0:34:25.879 --> 0:34:26.485
training.

0:34:26.485 --> 0:34:28.849
Actually, all we want to do is when W.

0:34:28.849 --> 0:34:32.062
Comedians, which is quite easy to know when
to stop.

0:34:32.062 --> 0:34:37.517
But in a realistic case, we don't have any
parallel data right, so there's no validation.

0:34:37.517 --> 0:34:42.002
Or I mean, we might have test data in the
end, but there's no validation.

0:34:43.703 --> 0:34:48.826
How will we tune our hyper parameters in this
case because it's not really there's nothing

0:34:48.826 --> 0:34:49.445
for us to?

0:34:50.130 --> 0:34:53.326
Or the gold data in a sense like so.

0:34:53.326 --> 0:35:01.187
How do you think we can evaluate such systems
or how can we tune hyper parameters in this?

0:35:11.711 --> 0:35:17.089
So what you're going to do is use the back
translation technique.

0:35:17.089 --> 0:35:24.340
It's like a common technique where you have
nothing okay that is to use back translation

0:35:24.340 --> 0:35:26.947
somehow and what you can do is.

0:35:26.947 --> 0:35:31.673
The main idea is validate on how good the
reconstruction.

0:35:32.152 --> 0:35:37.534
So the idea is that if you have a good system
then the intermediate translation is quite

0:35:37.534 --> 0:35:39.287
good and going back is easy.

0:35:39.287 --> 0:35:44.669
But if it's just noise that you generate in
the forward step then it's really hard to go

0:35:44.669 --> 0:35:46.967
back, which is kind of the main idea.

0:35:48.148 --> 0:35:53.706
So the way it works is that we take a source
sentence, we generate a translation in target

0:35:53.706 --> 0:35:59.082
language right, and then again can state the
generated sentence and compare it with the

0:35:59.082 --> 0:36:01.342
original one, and if they're closer.

0:36:01.841 --> 0:36:09.745
It means that we have a good system, and if
they are far this is kind of like an unsupervised

0:36:09.745 --> 0:36:10.334
grade.

0:36:17.397 --> 0:36:21.863
As far as the amount of data that you need.

0:36:23.083 --> 0:36:27.995
This was like the first initial resistance
on on these systems is that you had.

0:36:27.995 --> 0:36:32.108
They wanted to do English and French and they
had fifteen million.

0:36:32.108 --> 0:36:38.003
There was fifteen million more linguist sentences
so it's quite a lot and they were able to get

0:36:38.003 --> 0:36:40.581
thirty two blue on these kinds of setups.

0:36:41.721 --> 0:36:47.580
But unsurprisingly if you have zero point
one million pilot sentences you get the same

0:36:47.580 --> 0:36:48.455
performance.

0:36:48.748 --> 0:36:50.357
So it's a lot of training.

0:36:50.357 --> 0:36:55.960
It's a lot of monolingual data, but monolingual
data is relatively easy to obtain is the fact

0:36:55.960 --> 0:37:01.264
that the training is also quite longer than
the supervised system, but it's unsupervised

0:37:01.264 --> 0:37:04.303
so it's kind of the trade off that you are
making.

0:37:07.367 --> 0:37:13.101
The other thing to note is that it's English
and French, which is very close to our exemptions.

0:37:13.101 --> 0:37:18.237
Also, the monolingual data that they took
are kind of from similar domains and so on.

0:37:18.638 --> 0:37:27.564
So that's why they're able to build such a
good system, but you'll see later that it fails.

0:37:36.256 --> 0:37:46.888
Voice, and so mean what people usually do
is first build a system right using whatever

0:37:46.888 --> 0:37:48.110
parallel.

0:37:48.608 --> 0:37:55.864
Then they use monolingual data and do back
translation, so this is always being the standard

0:37:55.864 --> 0:38:04.478
way to to improve, and what people have seen
is that: You don't even need zero point one

0:38:04.478 --> 0:38:05.360
million right.

0:38:05.360 --> 0:38:10.706
You just need like ten thousand or so on and
then you do the monolingual back time station

0:38:10.706 --> 0:38:12.175
and you're still better.

0:38:12.175 --> 0:38:13.291
The answer is why.

0:38:13.833 --> 0:38:19.534
The question is it's really worth trying to
to do this or maybe it's always better to find

0:38:19.534 --> 0:38:20.787
some parallel data.

0:38:20.787 --> 0:38:26.113
I'll expand a bit of money on getting few
parallel data and then use it to start and

0:38:26.113 --> 0:38:27.804
find to build your system.

0:38:27.804 --> 0:38:33.756
So it was kind of the understanding that billing
wool and spoiled systems are not that really.

0:38:50.710 --> 0:38:54.347
The thing is that with unlabeled data.

0:38:57.297 --> 0:39:05.488
Not in an obtaining signal, so when we are
starting basically what we want to do is first

0:39:05.488 --> 0:39:13.224
get a good translation system and then use
an unlabeled monolingual data to improve.

0:39:13.613 --> 0:39:15.015
But if you start from U.

0:39:15.015 --> 0:39:15.183
N.

0:39:15.183 --> 0:39:20.396
Empty our model might be really bad like it
would be somewhere translating completely wrong.

0:39:20.760 --> 0:39:26.721
And then when you find your unlabeled data,
it basically might be harming, or maybe the

0:39:26.721 --> 0:39:28.685
same as supervised applause.

0:39:28.685 --> 0:39:35.322
So the thing is, I hope, by fine tuning on
labeled data as first is to get a good initialization.

0:39:35.835 --> 0:39:38.404
And then use the unsupervised techniques to
get better.

0:39:38.818 --> 0:39:42.385
But if your starting point is really bad then
it's not.

0:39:45.185 --> 0:39:47.324
Year so as we said before.

0:39:47.324 --> 0:39:52.475
This is kind of like the self supervised training
usually works.

0:39:52.475 --> 0:39:54.773
First we have parallel data.

0:39:56.456 --> 0:39:58.062
Source language is X.

0:39:58.062 --> 0:39:59.668
Target language is Y.

0:39:59.668 --> 0:40:06.018
In the end we want a system that does X to
Y, not Y to X, but first we want to train a

0:40:06.018 --> 0:40:10.543
backward model as it is Y to X, so target language
to source.

0:40:11.691 --> 0:40:17.353
Then we take our moonlighting will target
sentences, use our backward model to generate

0:40:17.353 --> 0:40:21.471
synthetic source, and then we join them with
our original data.

0:40:21.471 --> 0:40:27.583
So now we have this noisy input, but always
the gold output, which is kind of really important

0:40:27.583 --> 0:40:29.513
when you're doing backpaints.

0:40:30.410 --> 0:40:36.992
And then you can coordinate these big data
and then you can train your X to Y cholesterol

0:40:36.992 --> 0:40:44.159
system and then you can always do this in multiple
steps and usually three, four steps which kind

0:40:44.159 --> 0:40:48.401
of improves always and then finally get your
best system.

0:40:49.029 --> 0:40:54.844
The point that I'm trying to make is that
although answers and MPs the scores that I've

0:40:54.844 --> 0:41:00.659
shown before were quite good, you probably
can get the same performance with with fifty

0:41:00.659 --> 0:41:06.474
thousand sentences, and also the languages
that they've shown are quite similar and the

0:41:06.474 --> 0:41:08.654
texts were from the same domain.

0:41:14.354 --> 0:41:21.494
So any questions on u n m t ok yeah.

0:41:22.322 --> 0:41:28.982
So after this fact that temperature was already
better than than empty, what people have tried

0:41:28.982 --> 0:41:34.660
is to use this idea of multilinguality as you
have seen in the previous lecture.

0:41:34.660 --> 0:41:41.040
The question is how can we do this knowledge
transfer from high resource language to lower

0:41:41.040 --> 0:41:42.232
source language?

0:41:44.484 --> 0:41:51.074
One way to promote this language independent
representations is to share the encoder and

0:41:51.074 --> 0:41:57.960
decoder for all languages, all their available
languages, and that kind of hopefully enables

0:41:57.960 --> 0:42:00.034
the the knowledge transfer.

0:42:03.323 --> 0:42:08.605
When we're doing multilinguality, the two
questions we need to to think of is how does

0:42:08.605 --> 0:42:09.698
the encoder know?

0:42:09.698 --> 0:42:14.495
How does the encoder encoder know which language
that we're dealing with that?

0:42:15.635 --> 0:42:20.715
You already might have known the answer also,
and the second question is how can we promote

0:42:20.715 --> 0:42:24.139
the encoder to generate language independent
representations?

0:42:25.045 --> 0:42:32.580
By solving these two problems we can take
help of high resource languages to do unsupervised

0:42:32.580 --> 0:42:33.714
translations.

0:42:34.134 --> 0:42:40.997
Typical example would be you want to do unsurpressed
between English and Dutch right, but you are

0:42:40.997 --> 0:42:47.369
parallel data between English and German, so
the question is can we use this parallel data

0:42:47.369 --> 0:42:51.501
to help building an unsurpressed betweenEnglish
and Dutch?

0:42:56.296 --> 0:43:01.240
For the first one we try to take help of language
embeddings for tokens, and this kind of is

0:43:01.240 --> 0:43:05.758
a straightforward way to know to tell them
well which language they're dealing with.

0:43:06.466 --> 0:43:11.993
And for the second one we're going to look
at some pre training objectives which are also

0:43:11.993 --> 0:43:17.703
kind of unsupervised so we need monolingual
data mostly and this kind of helps us to promote

0:43:17.703 --> 0:43:20.221
the language independent representation.

0:43:23.463 --> 0:43:29.954
So the first three things more that we'll
look at is excel, which is quite famous if

0:43:29.954 --> 0:43:32.168
you haven't heard of it yet.

0:43:32.552 --> 0:43:40.577
And: The way it works is that it's basically
a transformer encoder right, so it's like the

0:43:40.577 --> 0:43:42.391
just the encoder module.

0:43:42.391 --> 0:43:44.496
No, there's no decoder here.

0:43:44.884 --> 0:43:51.481
And what we're trying to do is mask two tokens
in a sequence and try to predict these mask

0:43:51.481 --> 0:43:52.061
tokens.

0:43:52.061 --> 0:43:55.467
So I quickly called us mask language modeling.

0:43:55.996 --> 0:44:05.419
Typical language modeling that you see is
the Danish language modeling where you predict

0:44:05.419 --> 0:44:08.278
the next token in English.

0:44:08.278 --> 0:44:11.136
Then we have the position.

0:44:11.871 --> 0:44:18.774
Then we have the token embellings, and then
here we have the mass token, and then we have

0:44:18.774 --> 0:44:22.378
the transformer encoder blocks to predict the.

0:44:24.344 --> 0:44:30.552
To do this for all languages using the same
tang somewhere encoded and this kind of helps

0:44:30.552 --> 0:44:36.760
us to push the the sentence and bearings or
the output of the encoded into a common space

0:44:36.760 --> 0:44:37.726
per multiple.

0:44:42.782 --> 0:44:49.294
So first we train an MLM on both source, both
source and target language sites, and then

0:44:49.294 --> 0:44:54.928
we use it as a starting point for the encoded
and decoded for a UNMP system.

0:44:55.475 --> 0:45:03.175
So we take a monolingual data, build a mass
language model on both source and target languages,

0:45:03.175 --> 0:45:07.346
and then read it to be or initialize that in
the U.

0:45:07.346 --> 0:45:07.586
N.

0:45:07.586 --> 0:45:07.827
P.

0:45:07.827 --> 0:45:08.068
C.

0:45:09.009 --> 0:45:14.629
Here we look at two languages, but you can
also do it with one hundred languages once.

0:45:14.629 --> 0:45:20.185
So they're retain checkpoints that you can
use, which are quite which have seen quite

0:45:20.185 --> 0:45:21.671
a lot of data and use.

0:45:21.671 --> 0:45:24.449
It always has a starting point for your U.

0:45:24.449 --> 0:45:24.643
N.

0:45:24.643 --> 0:45:27.291
MP system, which in practice works well.

0:45:31.491 --> 0:45:36.759
This detail is that since this is an encoder
block only, and your U.

0:45:36.759 --> 0:45:36.988
N.

0:45:36.988 --> 0:45:37.217
M.

0:45:37.217 --> 0:45:37.446
T.

0:45:37.446 --> 0:45:40.347
System is encodered, decodered right.

0:45:40.347 --> 0:45:47.524
So there's this cross attention that's missing,
but you can always branch like that randomly.

0:45:47.524 --> 0:45:48.364
It's fine.

0:45:48.508 --> 0:45:53.077
Not everything is initialized, but it's still
decent.

0:45:56.056 --> 0:46:02.141
Then we have the other one is M by plane,
and here you see that this kind of builds on

0:46:02.141 --> 0:46:07.597
the the unsupervised training objector, which
is the realizing auto encoding.

0:46:08.128 --> 0:46:14.337
So what they do is they say that we don't
even need to do the gun outback translation,

0:46:14.337 --> 0:46:17.406
but you can do it later, but pre training.

0:46:17.406 --> 0:46:24.258
We just do do doing doing doing water inputting
on all different languages, and that also gives

0:46:24.258 --> 0:46:32.660
you: Out of the box good performance, so what
we basically have here is the transformer encoded.

0:46:34.334 --> 0:46:37.726
You are trying to generate a reconstructed
sequence.

0:46:37.726 --> 0:46:38.942
You need a tickle.

0:46:39.899 --> 0:46:42.022
So we gave an input sentence.

0:46:42.022 --> 0:46:48.180
We tried to predict the masked tokens from
the or we tried to reconstruct the original

0:46:48.180 --> 0:46:52.496
sentence from the input segments, which was
corrupted right.

0:46:52.496 --> 0:46:57.167
So this is the same denoting objective that
you have seen before.

0:46:58.418 --> 0:46:59.737
This is for English.

0:46:59.737 --> 0:47:04.195
I think this is for Japanese and then once
we do it for all languages.

0:47:04.195 --> 0:47:09.596
I mean they have this difference on twenty
five, fifty or so on and then you can find

0:47:09.596 --> 0:47:11.794
you on your sentence and document.

0:47:13.073 --> 0:47:20.454
And so what they is this for the supervised
techniques, but you can also use this as initializations

0:47:20.454 --> 0:47:25.058
for unsupervised buildup on that which also
in practice works.

0:47:30.790 --> 0:47:36.136
Then we have these, so still now we kind of
didn't see the the states benefit from the

0:47:36.136 --> 0:47:38.840
high resource language right, so as I said.

0:47:38.878 --> 0:47:44.994
Why you can use English as something for English
to Dutch, and if you want a new Catalan, you

0:47:44.994 --> 0:47:46.751
can use English to French.

0:47:48.408 --> 0:47:55.866
One typical way to do this is to use favorite
translation lights or you take the.

0:47:55.795 --> 0:48:01.114
So here it's finished two weeks so you take
your time say from finish to English English

0:48:01.114 --> 0:48:03.743
two weeks and then you get the translation.

0:48:04.344 --> 0:48:10.094
What's important is that you have these different
techniques and you can always think of which

0:48:10.094 --> 0:48:12.333
one to use given the data situation.

0:48:12.333 --> 0:48:18.023
So if it was like finish to Greek maybe it's
pivotal better because you might get good finish

0:48:18.023 --> 0:48:20.020
to English and English to Greek.

0:48:20.860 --> 0:48:23.255
Sometimes it also depends on the language
pair.

0:48:23.255 --> 0:48:27.595
There might be some information loss and so
on, so there are quite a few variables you

0:48:27.595 --> 0:48:30.039
need to think of and decide which system to
use.

0:48:32.752 --> 0:48:39.654
Then there's a zero shot, which probably also
I've seen in the multilingual course, and how

0:48:39.654 --> 0:48:45.505
if you can improve the language independence
then your zero shot gets better.

0:48:45.505 --> 0:48:52.107
So maybe if you use the multilingual models
and do zero shot directly, it's quite good.

0:48:53.093 --> 0:48:58.524
Thought we have zero shots per word, and then
we have the answer to voice translation where

0:48:58.524 --> 0:49:00.059
we can calculate between.

0:49:00.600 --> 0:49:02.762
Just when there is no battle today.

0:49:06.686 --> 0:49:07.565
Is to solve.

0:49:07.565 --> 0:49:11.959
So sometimes what we have seen so far is that
we basically have.

0:49:15.255 --> 0:49:16.754
To do from looking at it.

0:49:16.836 --> 0:49:19.307
These two files alone you can create a dictionary.

0:49:19.699 --> 0:49:26.773
Can build an unsupervised entry system, not
always, but if the domains are similar in the

0:49:26.773 --> 0:49:28.895
languages, that's similar.

0:49:28.895 --> 0:49:36.283
But if there are distant languages, then the
unsupervised texting doesn't usually work really

0:49:36.283 --> 0:49:36.755
well.

0:49:37.617 --> 0:49:40.297
What um.

0:49:40.720 --> 0:49:46.338
Would be is that if you can get some paddle
data from somewhere or do bitex mining that

0:49:46.338 --> 0:49:51.892
we have seen in the in the laser practicum
then you can use that as to initialize your

0:49:51.892 --> 0:49:57.829
system and then try and accept a semi supervised
energy system and that would be better than

0:49:57.829 --> 0:50:00.063
just building an unsupervised and.

0:50:00.820 --> 0:50:06.546
With that as the end.

0:50:07.207 --> 0:50:08.797
Quickly could be.

0:50:16.236 --> 0:50:25.070
In common, they can catch the worst because
the thing about finding a language is: And

0:50:25.070 --> 0:50:34.874
there's another joy in playing these games,
almost in the middle of a game, and she's a

0:50:34.874 --> 0:50:40.111
characteristic too, and she is a global waver.

0:50:56.916 --> 0:51:03.798
Next talk inside and this somehow gives them
many abilities, not only translation but other

0:51:03.798 --> 0:51:08.062
than that there are quite a few things that
they can do.

0:51:10.590 --> 0:51:17.706
But the translation in itself usually doesn't
really work really well if you build a system

0:51:17.706 --> 0:51:20.878
from your specific system for your case.

0:51:22.162 --> 0:51:27.924
I would guess that it's usually better than
the LLM, but you can always adapt the LLM to

0:51:27.924 --> 0:51:31.355
the task that you want, and then it could be
better.

0:51:32.152 --> 0:51:37.849
A little amount of the box might not be the
best choice for your task force.

0:51:37.849 --> 0:51:44.138
For me, I'm working on new air translation,
so it's more about translating software.

0:51:45.065 --> 0:51:50.451
And it's quite often each domain as well,
and if use the LLM out of the box, they're

0:51:50.451 --> 0:51:53.937
actually quite bad compared to the systems
that built.

0:51:54.414 --> 0:51:56.736
But you can do these different techniques
like prompting.

0:51:57.437 --> 0:52:03.442
This is what people usually do is heart prompting
where they give similar translation pairs in

0:52:03.442 --> 0:52:08.941
the prompt and then ask it to translate and
then that kind of improves the performance

0:52:08.941 --> 0:52:09.383
a lot.

0:52:09.383 --> 0:52:15.135
So there are different techniques that you
can do to adapt your eye lens and then it might

0:52:15.135 --> 0:52:16.399
be better than the.

0:52:16.376 --> 0:52:17.742
Task a fixed system.

0:52:18.418 --> 0:52:22.857
But if you're looking for niche things, I
don't think error limbs are that good.

0:52:22.857 --> 0:52:26.309
But if you want to do to do, let's say, unplugged
translation.

0:52:26.309 --> 0:52:30.036
In this case you can never be sure that they
haven't seen the data.

0:52:30.036 --> 0:52:35.077
First of all is that if you see the data in
that language or not, and if they're panthetic,

0:52:35.077 --> 0:52:36.831
they probably did see the data.

0:52:40.360 --> 0:53:00.276
I feel like they have pretty good understanding
of each million people.

0:53:04.784 --> 0:53:09.059
Depends on the language, but I'm pretty surprised
that it works on a lotus language.

0:53:09.059 --> 0:53:11.121
I would expect it to work on German and.

0:53:11.972 --> 0:53:13.633
But if you take a lot of first language,.

0:53:14.474 --> 0:53:20.973
Don't think it works, and also there are quite
a few papers where they've already showed that

0:53:20.973 --> 0:53:27.610
if you build a system yourself or build a typical
way to build a system, it's quite better than

0:53:27.610 --> 0:53:29.338
the bit better than the.

0:53:29.549 --> 0:53:34.883
But you can always do things with limbs to
get better, but then I'm probably.

0:53:37.557 --> 0:53:39.539
Anymore.

0:53:41.421 --> 0:53:47.461
So if not then we're going to end the lecture
here and then on Thursday we're going to have

0:53:47.461 --> 0:53:51.597
documented empty which is also run by me so
thanks for coming.