retkowski's picture
Add demo
cb71ef5
WEBVTT
0:00:00.060 --> 0:00:07.762
OK good so today's lecture is on on supervised
machines and stations so what you have seen
0:00:07.762 --> 0:00:13.518
so far is different techniques are on supervised
and MP so you are.
0:00:13.593 --> 0:00:18.552
Data right so let's say in English coppers
you are one file and then in German you have
0:00:18.552 --> 0:00:23.454
another file which is sentence to sentence
la and then you try to build systems around
0:00:23.454 --> 0:00:23.679
it.
0:00:24.324 --> 0:00:30.130
But what's different about this lecture is
that you assume that you have no final data
0:00:30.130 --> 0:00:30.663
at all.
0:00:30.663 --> 0:00:37.137
You only have monolingual data and the question
is how can we build systems to translate between
0:00:37.137 --> 0:00:39.405
these two languages right and so.
0:00:39.359 --> 0:00:44.658
This is a bit more realistic scenario because
you have so many languages in the world.
0:00:44.658 --> 0:00:50.323
You cannot expect to have parallel data between
all the two languages and so, but in typical
0:00:50.323 --> 0:00:55.623
cases you have newspapers and so on, which
is like monolingual files, and the question
0:00:55.623 --> 0:00:57.998
is can we build something around them?
0:00:59.980 --> 0:01:01.651
They like said for today.
0:01:01.651 --> 0:01:05.893
First we'll start up with the interactions,
so why do we need it?
0:01:05.893 --> 0:01:11.614
and also some infusion on how these models
work before going into the technical details.
0:01:11.614 --> 0:01:17.335
I want to also go through an example,, which
kind of gives you more understanding on how
0:01:17.335 --> 0:01:19.263
people came into more elders.
0:01:20.820 --> 0:01:23.905
Then the rest of the lecture is going to be
two parts.
0:01:23.905 --> 0:01:26.092
One is we're going to translate words.
0:01:26.092 --> 0:01:30.018
We're not going to care about how can we translate
the full sentence.
0:01:30.018 --> 0:01:35.177
But given to monolingual files, how can we
get a dictionary basically, which is much easier
0:01:35.177 --> 0:01:37.813
than generating something in a sentence level?
0:01:38.698 --> 0:01:43.533
Then we're going to go into the Edwards case,
which is the unsupervised sentence type solution.
0:01:44.204 --> 0:01:50.201
And here what you'll see is what are the training
objectives which are quite different than the
0:01:50.201 --> 0:01:55.699
word translation and also where it doesn't
but because this is also quite important and
0:01:55.699 --> 0:02:01.384
it's one of the reasons why unsupervised does
not use anymore because the limitations kind
0:02:01.384 --> 0:02:03.946
of go away from the realistic use cases.
0:02:04.504 --> 0:02:06.922
And then that leads to the marketing world
model.
0:02:06.922 --> 0:02:07.115
So.
0:02:07.807 --> 0:02:12.915
People are trying to do to build systems for
languages that will not have any parallel data.
0:02:12.915 --> 0:02:17.693
Is use multilingual models and combine with
these training objectives to get better at
0:02:17.693 --> 0:02:17.913
it.
0:02:17.913 --> 0:02:18.132
So.
0:02:18.658 --> 0:02:24.396
People are not trying to build bilingual systems
currently for unsupervised arm translation,
0:02:24.396 --> 0:02:30.011
but I think it's good to know how they came
to hear this point and what they're doing now.
0:02:30.090 --> 0:02:34.687
You also see some patterns overlapping which
people are using.
0:02:36.916 --> 0:02:41.642
So as you said before, and you probably hear
it multiple times now is that we have seven
0:02:41.642 --> 0:02:43.076
thousand languages around.
0:02:43.903 --> 0:02:49.460
Can be different dialects in someone, so it's
quite hard to distinguish what's the language,
0:02:49.460 --> 0:02:54.957
but you can typically approximate that seven
thousand and that leads to twenty five million
0:02:54.957 --> 0:02:59.318
pairs, which is the obvious reason why we do
not have any parallel data.
0:03:00.560 --> 0:03:06.386
So you want to build an empty system for all
possible language pests and the question is
0:03:06.386 --> 0:03:07.172
how can we?
0:03:08.648 --> 0:03:13.325
The typical use case, but there are actually
quite few interesting use cases than what you
0:03:13.325 --> 0:03:14.045
would expect.
0:03:14.614 --> 0:03:20.508
One is the animal languages, which is the
real thing that's happening right now with.
0:03:20.780 --> 0:03:26.250
The dog but with dolphins and so on, but I
couldn't find a picture that could show this,
0:03:26.250 --> 0:03:31.659
but if you are interested in stuff like this
you can check out the website where people
0:03:31.659 --> 0:03:34.916
are actually trying to understand how animals
speak.
0:03:35.135 --> 0:03:37.356
It's Also a Bit More About.
0:03:37.297 --> 0:03:44.124
Knowing what the animals want to say but may
not die dead but still people are trying to
0:03:44.124 --> 0:03:44.661
do it.
0:03:45.825 --> 0:03:50.689
More realistic thing that's happening is the
translation of programming languages.
0:03:51.371 --> 0:03:56.963
And so this is quite a quite good scenario
for entrepreneurs and empty is that you have
0:03:56.963 --> 0:04:02.556
a lot of code available online right in C +
+ and in Python and the question is how can
0:04:02.556 --> 0:04:08.402
we translate by just looking at the code alone
and no parallel functions and so on and this
0:04:08.402 --> 0:04:10.754
is actually quite good right now so.
0:04:12.032 --> 0:04:16.111
See how these techniques were applied to do
the programming translation.
0:04:18.258 --> 0:04:23.882
And then you can also think of language as
something that is quite common so you can take
0:04:23.882 --> 0:04:24.194
off.
0:04:24.194 --> 0:04:29.631
Think of formal sentences in English as one
language and informal sentences in English
0:04:29.631 --> 0:04:35.442
as another language and then learn the kind
to stay between them and then it kind of becomes
0:04:35.442 --> 0:04:37.379
a style plan for a problem so.
0:04:38.358 --> 0:04:43.042
Although it's translation, you can consider
different characteristics of a language and
0:04:43.042 --> 0:04:46.875
then separate them as two different languages
and then try to map them.
0:04:46.875 --> 0:04:52.038
So it's not only about languages, but you
can also do quite cool things by using unsophisticated
0:04:52.038 --> 0:04:54.327
techniques, which are quite possible also.
0:04:56.256 --> 0:04:56.990
I am so.
0:04:56.990 --> 0:05:04.335
This is kind of TV modeling for many of the
use cases that we have for ours, ours and MD.
0:05:04.335 --> 0:05:11.842
But before we go into the modeling of these
systems, what I want you to do is look at these
0:05:11.842 --> 0:05:12.413
dummy.
0:05:13.813 --> 0:05:19.720
We have text and language one, text and language
two right, and nobody knows what these languages
0:05:19.720 --> 0:05:20.082
mean.
0:05:20.082 --> 0:05:23.758
They completely are made up right, and the
question is also.
0:05:23.758 --> 0:05:29.364
They're not parallel lines, so the first line
here and the first line is not a line, they're
0:05:29.364 --> 0:05:30.810
just monolingual files.
0:05:32.052 --> 0:05:38.281
And now think about how can you translate
the word M1 from language one to language two,
0:05:38.281 --> 0:05:41.851
and this kind of you see how we try to model
this.
0:05:42.983 --> 0:05:47.966
Would take your time and then think of how
can you translate more into language two?
0:06:41.321 --> 0:06:45.589
About the model, if you ask somebody who doesn't
know anything about machine translation right,
0:06:45.589 --> 0:06:47.411
and then you ask them to translate more.
0:07:01.201 --> 0:07:10.027
But it's also not quite easy if you think
of the way that I made this example is relatively
0:07:10.027 --> 0:07:10.986
easy, so.
0:07:11.431 --> 0:07:17.963
Basically, the first two sentences are these
two: A, B, C is E, and G cured up the U, V
0:07:17.963 --> 0:07:21.841
is L, A, A, C, S, and S, on and this is used
towards the German.
0:07:22.662 --> 0:07:25.241
And then when you join these two words, it's.
0:07:25.205 --> 0:07:32.445
English German the third line and the last
line, and then the fourth line is the first
0:07:32.445 --> 0:07:38.521
line, so German language, English, and then
speak English, speak German.
0:07:38.578 --> 0:07:44.393
So this is how I made made up the example
and what the intuition here is that you assume
0:07:44.393 --> 0:07:50.535
that the languages have a fundamental structure
right and it's the same across all languages.
0:07:51.211 --> 0:07:57.727
Doesn't matter what language you are thinking
of words kind of you have in the same way join
0:07:57.727 --> 0:07:59.829
together is the same way and.
0:07:59.779 --> 0:08:06.065
And plasma sign thinks the same way but this
is not a realistic assumption for sure but
0:08:06.065 --> 0:08:12.636
it's actually a decent one to make and if you
can think of this like if you can assume this
0:08:12.636 --> 0:08:16.207
then we can model systems in an unsupervised
way.
0:08:16.396 --> 0:08:22.743
So this is the intuition that I want to give,
and you can see that whenever assumptions fail,
0:08:22.743 --> 0:08:23.958
the systems fail.
0:08:23.958 --> 0:08:29.832
So in practice whenever we go far away from
these assumptions, the systems try to more
0:08:29.832 --> 0:08:30.778
time to fail.
0:08:33.753 --> 0:08:39.711
So the example that I gave was actually perfect
mapping right, so it never really sticks bad.
0:08:39.711 --> 0:08:45.353
They have the same number of words, same sentence
structure, perfect mapping, and so on.
0:08:45.353 --> 0:08:50.994
This doesn't happen, but let's assume that
this happens and try to see how we can moral.
0:08:53.493 --> 0:09:01.061
Okay, now let's go a bit more formal, so what
you want to do is unsupervise word translation.
0:09:01.901 --> 0:09:08.773
Here the task is that we have input data as
monolingual data, so a bunch of sentences in
0:09:08.773 --> 0:09:15.876
one file and a bunch of sentences another file
in two different languages, and the question
0:09:15.876 --> 0:09:18.655
is how can we get a bilingual word?
0:09:19.559 --> 0:09:25.134
So if you look at the picture you see that
it's just kind of projected down into two dimension
0:09:25.134 --> 0:09:30.358
planes, but it's basically when you map them
into a plot you see that the words that are
0:09:30.358 --> 0:09:35.874
parallel are closer together, and the question
is how can we do it just looking at two files?
0:09:36.816 --> 0:09:42.502
And you can say that what we want to basically
do is create a dictionary in the end given
0:09:42.502 --> 0:09:43.260
two fights.
0:09:43.260 --> 0:09:45.408
So this is the task that we want.
0:09:46.606 --> 0:09:52.262
And the first step on how we do this is to
learn word vectors, and this chicken is whatever
0:09:52.262 --> 0:09:56.257
techniques that you have seen before, but to
work glow or so on.
0:09:56.856 --> 0:10:00.699
So you take a monolingual data and try to
learn word embeddings.
0:10:02.002 --> 0:10:07.675
Then you plot them into a graph, and then
typically what you would see is that they're
0:10:07.675 --> 0:10:08.979
not aligned at all.
0:10:08.979 --> 0:10:14.717
One word space is somewhere, and one word
space is somewhere else, and this is what you
0:10:14.717 --> 0:10:18.043
would typically expect to see in the in the
image.
0:10:19.659 --> 0:10:23.525
Now our assumption was that both lines we
just have the same.
0:10:23.563 --> 0:10:28.520
Culture and so that we can use this information
to learn the mapping between these two spaces.
0:10:30.130 --> 0:10:37.085
So before how we do it, I think this is quite
famous already, and everybody knows it a bit
0:10:37.085 --> 0:10:41.824
more is that we're emitting capture semantic
relations right.
0:10:41.824 --> 0:10:48.244
So the distance between man and woman is approximately
the same as king and prince.
0:10:48.888 --> 0:10:54.620
It's also for world dances, country capital
and so on, so there are some relationships
0:10:54.620 --> 0:11:00.286
happening in the word emmering space, which
is quite clear for at least one language.
0:11:03.143 --> 0:11:08.082
Now if you think of this, let's say of the
English word embryng.
0:11:08.082 --> 0:11:14.769
Let's say of German word embryng and the way
the King Keene Man woman organized is same
0:11:14.769 --> 0:11:17.733
as the German translation of his word.
0:11:17.998 --> 0:11:23.336
This is the main idea is that although they
are somewhere else, the relationship is the
0:11:23.336 --> 0:11:28.008
same between the both languages and we can
use this to to learn the mapping.
0:11:31.811 --> 0:11:35.716
'S not only for these poor words where it
happens for all the words in the language,
0:11:35.716 --> 0:11:37.783
and so we can use this to to learn the math.
0:11:39.179 --> 0:11:43.828
This is the main idea is that both emittings
have a similar shape.
0:11:43.828 --> 0:11:48.477
It's only that they're just not aligned and
so you go to the here.
0:11:48.477 --> 0:11:50.906
They kind of have a similar shape.
0:11:50.906 --> 0:11:57.221
They're just in some different spaces and
what you need to do is to map them into a common
0:11:57.221 --> 0:11:57.707
space.
0:12:06.086 --> 0:12:12.393
The w, such that if it multiplied w with x,
they both become.
0:12:35.335 --> 0:12:41.097
That's true, but there are also many works
that have the relationship right, and we hope
0:12:41.097 --> 0:12:43.817
that this is enough to learn the mapping.
0:12:43.817 --> 0:12:49.838
So there's always going to be a bit of noise,
as in how when we align them they're not going
0:12:49.838 --> 0:12:51.716
to be exactly the same, but.
0:12:51.671 --> 0:12:57.293
What you can expect is that there are these
main works that allow us to learn the mapping,
0:12:57.293 --> 0:13:02.791
so it's not going to be perfect, but it's an
approximation that we make to to see how it
0:13:02.791 --> 0:13:04.521
works and then practice it.
0:13:04.521 --> 0:13:10.081
Also, it's not that the fact that women do
not have any relationship does not affect that
0:13:10.081 --> 0:13:10.452
much.
0:13:10.550 --> 0:13:15.429
A lot of words usually have, so it kind of
works out in practice.
0:13:22.242 --> 0:13:34.248
I have not heard about it, but if you want
to say something about it, I would be interested,
0:13:34.248 --> 0:13:37.346
but we can do it later.
0:13:41.281 --> 0:13:44.133
Usual case: This is supervised.
0:13:45.205 --> 0:13:49.484
First way to do a supervised work translation
where we have a dictionary right and that we
0:13:49.484 --> 0:13:53.764
can use that to learn the mapping, but in our
case we assume that we have nothing right so
0:13:53.764 --> 0:13:55.222
we only have monolingual data.
0:13:56.136 --> 0:14:03.126
Then we need unsupervised planning to figure
out W, and we're going to use guns to to find
0:14:03.126 --> 0:14:06.122
W, and it's quite a nice way to do it.
0:14:08.248 --> 0:14:15.393
So just before I go on how we use it to use
case, I'm going to go briefly on gas right,
0:14:15.393 --> 0:14:19.940
so we have two components: generator and discriminator.
0:14:21.441 --> 0:14:27.052
Gen data tries to generate something obviously,
and the discriminator tries to see if it's
0:14:27.052 --> 0:14:30.752
real data or something that is generated by
the generation.
0:14:31.371 --> 0:14:37.038
And there's like this two player game where
the winner decides to fool and the winner decides
0:14:37.038 --> 0:14:41.862
to market food and they try to build these
two components and try to learn WWE.
0:14:43.483 --> 0:14:53.163
Okay, so let's say we have two languages,
X and Y right, so the X language has N words
0:14:53.163 --> 0:14:56.167
with numbering dimensions.
0:14:56.496 --> 0:14:59.498
So what I'm reading is matrix is peak or something.
0:14:59.498 --> 0:15:02.211
Then we have target language why with m words.
0:15:02.211 --> 0:15:06.944
I'm also the same amount of things I mentioned
and then we have a matrix peak or.
0:15:07.927 --> 0:15:13.784
Basically what you're going to do is use word
to work and learn our word embedded.
0:15:14.995 --> 0:15:23.134
Now we have these X Mrings, Y Mrings, and
what you want to know is W, such that W X and
0:15:23.134 --> 0:15:24.336
Y are align.
0:15:29.209 --> 0:15:35.489
With guns you have two steps, one is a discriminative
step and one is the the mapping step and the
0:15:35.489 --> 0:15:41.135
discriminative step is to see if the embeddings
are from the source or mapped embedding.
0:15:41.135 --> 0:15:44.688
So it's going to be much scary when I go to
the figure.
0:15:46.306 --> 0:15:50.041
So we have a monolingual documents with two
different languages.
0:15:50.041 --> 0:15:54.522
From here we get our source language ambients
target language ambients right.
0:15:54.522 --> 0:15:57.855
Then we randomly initialize the transformation
metrics W.
0:16:00.040 --> 0:16:06.377
Then we have the discriminator which tries
to see if it's WX or Y, so it needs to know
0:16:06.377 --> 0:16:13.735
that this is a mapped one and this is the original
language, and so if you look at the lost function
0:16:13.735 --> 0:16:20.072
here, it's basically that source is one given
WX, so this is from the source language.
0:16:23.543 --> 0:16:27.339
Which means it's the target language em yeah.
0:16:27.339 --> 0:16:34.436
It's just like my figure is not that great,
but you can assume that they are totally.
0:16:40.260 --> 0:16:43.027
So this is the kind of the lost function.
0:16:43.027 --> 0:16:46.386
We have N source words, M target words, and
so on.
0:16:46.386 --> 0:16:52.381
So that's why you have one by M, one by M,
and the discriminator is to just see if they're
0:16:52.381 --> 0:16:55.741
mapped or they're from the original target
number.
0:16:57.317 --> 0:17:04.024
And then we have the mapping step where we
train W to fool the the discriminators.
0:17:04.564 --> 0:17:10.243
So here it's the same way, but what you're
going to just do is inverse the loss function.
0:17:10.243 --> 0:17:15.859
So now we freeze the discriminators, so it's
important to note that in the previous sect
0:17:15.859 --> 0:17:20.843
we freezed the transformation matrix, and here
we freezed your discriminators.
0:17:22.482 --> 0:17:28.912
And now it's to fool the discriminated rights,
so it should predict that the source is zero
0:17:28.912 --> 0:17:35.271
given the map numbering, and the source is
one given the target numbering, which is wrong,
0:17:35.271 --> 0:17:37.787
which is why we're attaining the W.
0:17:39.439 --> 0:17:46.261
Any questions on this okay so then how do
we know when to stop?
0:17:46.261 --> 0:17:55.854
We just train until we reach convergence right
and then we have our W hopefully train and
0:17:55.854 --> 0:17:59.265
map them into an airline space.
0:18:02.222 --> 0:18:07.097
The question is how can we evaluate this mapping?
0:18:07.097 --> 0:18:13.923
Does anybody know what we can use to mapping
or evaluate the mapping?
0:18:13.923 --> 0:18:15.873
How good is a word?
0:18:28.969 --> 0:18:33.538
We use as I said we use a dictionary, at least
in the end.
0:18:33.538 --> 0:18:40.199
We need a dictionary to evaluate, so this
is our only final, so we aren't using it at
0:18:40.199 --> 0:18:42.600
all in attaining data and the.
0:18:43.223 --> 0:18:49.681
Is one is to check what's the position for
our dictionary, just that.
0:18:50.650 --> 0:18:52.813
The first nearest neighbor and see if it's
there on.
0:18:53.573 --> 0:18:56.855
But this is quite strict because there's a
lot of noise in the emitting space right.
0:18:57.657 --> 0:19:03.114
Not always your first neighbor is going to
be the translation, so what people also report
0:19:03.114 --> 0:19:05.055
is precision at file and so on.
0:19:05.055 --> 0:19:10.209
So you take the finerest neighbors and see
if the translation is in there and so on.
0:19:10.209 --> 0:19:15.545
So the more you increase, the more likely
that there is a translation because where I'm
0:19:15.545 --> 0:19:16.697
being quite noisy.
0:19:19.239 --> 0:19:25.924
What's interesting is that people have used
dictionary to to learn word translation, but
0:19:25.924 --> 0:19:32.985
the way of doing this is much better than using
a dictionary, so somehow our assumption helps
0:19:32.985 --> 0:19:36.591
us to to build better than a supervised system.
0:19:39.099 --> 0:19:42.985
So as you see on the top you have a question
at one five ten.
0:19:42.985 --> 0:19:47.309
These are the typical numbers that you report
for world translation.
0:19:48.868 --> 0:19:55.996
But guns are usually quite tricky to to train,
and it does not converge on on language based,
0:19:55.996 --> 0:20:02.820
and this kind of goes back to a assumption
that they kind of behave in the same structure
0:20:02.820 --> 0:20:03.351
right.
0:20:03.351 --> 0:20:07.142
But if you take a language like English and
some.
0:20:07.087 --> 0:20:12.203
Other languages are almost very lotus, so
it's quite different from English and so on.
0:20:12.203 --> 0:20:13.673
Then I've one language,.
0:20:13.673 --> 0:20:18.789
So whenever whenever our assumption fails,
these unsupervised techniques always do not
0:20:18.789 --> 0:20:21.199
converge or just give really bad scores.
0:20:22.162 --> 0:20:27.083
And so the fact is that the monolingual embryons
for distant languages are too far.
0:20:27.083 --> 0:20:30.949
They do not share the same structure, and
so they do not convert.
0:20:32.452 --> 0:20:39.380
And so I just want to mention that there is
a better retrieval technique than the nearest
0:20:39.380 --> 0:20:41.458
neighbor, which is called.
0:20:42.882 --> 0:20:46.975
But it's more advanced than mathematical,
so I didn't want to go in it now.
0:20:46.975 --> 0:20:51.822
But if your interest is in some quite good
retrieval segments, you can just look at these
0:20:51.822 --> 0:20:53.006
if you're interested.
0:20:55.615 --> 0:20:59.241
Okay, so this is about the the word translation.
0:20:59.241 --> 0:21:02.276
Does anybody have any questions of cure?
0:21:06.246 --> 0:21:07.501
Was the worst answer?
0:21:07.501 --> 0:21:12.580
It was a bit easier than a sentence right,
so you just assume that there's a mapping and
0:21:12.580 --> 0:21:14.577
then you try to learn the mapping.
0:21:14.577 --> 0:21:19.656
But now it's a bit more difficult because
you need to jump at stuff also, which is quite
0:21:19.656 --> 0:21:20.797
much more trickier.
0:21:22.622 --> 0:21:28.512
Task here is that we have our input as manually
well data for both languages as before, but
0:21:28.512 --> 0:21:34.017
now what we want to do is instead of translating
word by word we want to do sentence.
0:21:37.377 --> 0:21:44.002
We have word of work now and so on to learn
word amber inks, but sentence amber inks are
0:21:44.002 --> 0:21:50.627
actually not the site powered often, at least
when people try to work on Answer Voice M,
0:21:50.627 --> 0:21:51.445
E, before.
0:21:52.632 --> 0:21:54.008
Now they're a bit okay.
0:21:54.008 --> 0:21:59.054
I mean, as you've seen in the practice on
where we used places, they were quite decent.
0:21:59.054 --> 0:22:03.011
But then it's also the case on which data
it's trained on and so on.
0:22:03.011 --> 0:22:03.240
So.
0:22:04.164 --> 0:22:09.666
Sentence embedings are definitely much more
harder to get than were embedings, so this
0:22:09.666 --> 0:22:13.776
is a bit more complicated than the task that
you've seen before.
0:22:16.476 --> 0:22:18.701
Before we go into how U.
0:22:18.701 --> 0:22:18.968
N.
0:22:18.968 --> 0:22:19.235
M.
0:22:19.235 --> 0:22:19.502
T.
0:22:19.502 --> 0:22:24.485
Works, so this is your typical supervised
system right.
0:22:24.485 --> 0:22:29.558
So we have parallel data source sentence target
centers.
0:22:29.558 --> 0:22:31.160
We have a source.
0:22:31.471 --> 0:22:36.709
We have a target decoder and then we try to
minimize the cross center pillar on this viral
0:22:36.709 --> 0:22:37.054
data.
0:22:37.157 --> 0:22:39.818
And this is how we train our typical system.
0:22:43.583 --> 0:22:49.506
But now we do not have any parallel data,
and so the intuition here is that if we can
0:22:49.506 --> 0:22:55.429
learn language independent representations
at the end quota outputs, then we can pass
0:22:55.429 --> 0:22:58.046
it along to the decoder that we want.
0:22:58.718 --> 0:23:03.809
It's going to get more clear in the future,
but I'm trying to give a bit more intuition
0:23:03.809 --> 0:23:07.164
before I'm going to show you all the planning
objectives.
0:23:08.688 --> 0:23:15.252
So I assume that we have these different encoders
right, so it's not only two, you have a bunch
0:23:15.252 --> 0:23:21.405
of different source language encoders, a bunch
of different target language decoders, and
0:23:21.405 --> 0:23:26.054
also I assume that the encoder is in the same
representation space.
0:23:26.706 --> 0:23:31.932
If you give a sentence in English and the
same sentence in German, the embeddings are
0:23:31.932 --> 0:23:38.313
quite the same, so like the muddling when embeddings
die right, and so then what we can do is, depending
0:23:38.313 --> 0:23:42.202
on the language we want, pass it to the the
appropriate decode.
0:23:42.682 --> 0:23:50.141
And so the kind of goal here is to find out
a way to create language independent representations
0:23:50.141 --> 0:23:52.909
and then pass it to the decodement.
0:23:54.975 --> 0:23:59.714
Just keep in mind that you're trying to do
language independent for some reason, but it's
0:23:59.714 --> 0:24:02.294
going to be more clear once we see how it works.
0:24:05.585 --> 0:24:12.845
So in total we have three objectives that
we're going to try to train in our systems,
0:24:12.845 --> 0:24:16.981
so this is and all of them use monolingual
data.
0:24:17.697 --> 0:24:19.559
So there's no pilot data at all.
0:24:19.559 --> 0:24:24.469
The first one is denoising water encoding,
so it's more like you add noise to noise to
0:24:24.469 --> 0:24:27.403
the sentence, and then they construct the original.
0:24:28.388 --> 0:24:34.276
Then we have the on the flyby translation,
so this is where you take a sentence, generate
0:24:34.276 --> 0:24:39.902
a translation, and then learn the the word
smarting, which I'm going to show pictures
0:24:39.902 --> 0:24:45.725
stated, and then we have an adverse serial
planning to do learn the language independent
0:24:45.725 --> 0:24:46.772
representation.
0:24:47.427 --> 0:24:52.148
So somehow we'll fill in these three tasks
or retain on these three tasks.
0:24:52.148 --> 0:24:54.728
We somehow get an answer to President M.
0:24:54.728 --> 0:24:54.917
T.
0:24:56.856 --> 0:25:02.964
OK, so first we're going to do is denoising
what I'm cutting right, so as I said we add
0:25:02.964 --> 0:25:06.295
noise to the sentence, so we take our sentence.
0:25:06.826 --> 0:25:09.709
And then there are different ways to add noise.
0:25:09.709 --> 0:25:11.511
You can shuffle words around.
0:25:11.511 --> 0:25:12.712
You can drop words.
0:25:12.712 --> 0:25:18.298
Do whatever you want to do as long as there's
enough information to reconstruct the original
0:25:18.298 --> 0:25:18.898
sentence.
0:25:19.719 --> 0:25:25.051
And then we assume that the nicest one and
the original one are parallel data and train
0:25:25.051 --> 0:25:26.687
similar to the supervised.
0:25:28.168 --> 0:25:30.354
So we have a source sentence.
0:25:30.354 --> 0:25:32.540
We have a noisy source right.
0:25:32.540 --> 0:25:37.130
So here what basically happened is that the
word got shuffled.
0:25:37.130 --> 0:25:39.097
One word is dropped right.
0:25:39.097 --> 0:25:41.356
So this was a noise of source.
0:25:41.356 --> 0:25:47.039
And then we treat the noise of source and
source as a sentence bed basically.
0:25:49.009 --> 0:25:53.874
Way retainers optimizing the cross entropy
loss similar to.
0:25:57.978 --> 0:26:03.211
Basically a picture to show what's happening
and we have the nice resources.
0:26:03.163 --> 0:26:09.210
Now is the target and then we have the reconstructed
original source and original tag and since
0:26:09.210 --> 0:26:14.817
the languages are different we have our source
hand coded target and coded source coded.
0:26:17.317 --> 0:26:20.202
And for this task we only need monolingual
data.
0:26:20.202 --> 0:26:25.267
We don't need any pedal data because it's
just taking a sentence and shuffling it and
0:26:25.267 --> 0:26:27.446
reconstructing the the original one.
0:26:28.848 --> 0:26:31.058
And we are four different blocks.
0:26:31.058 --> 0:26:36.841
This is kind of very important to keep in
mind on how we change these connections later.
0:26:41.121 --> 0:26:49.093
Then this is more like the mathematical formulation
where you predict source given the noisy.
0:26:52.492 --> 0:26:55.090
So that was the nursing water encoding.
0:26:55.090 --> 0:26:58.403
The second step is on the flight back translation.
0:26:59.479 --> 0:27:06.386
So what we do is, we put our model inference
mode right, we take a source of sentences,
0:27:06.386 --> 0:27:09.447
and we generate a translation pattern.
0:27:09.829 --> 0:27:18.534
It might be completely wrong or maybe partially
correct or so on, but we assume that the moral
0:27:18.534 --> 0:27:20.091
knows of it and.
0:27:20.680 --> 0:27:25.779
Tend rate: T head right and then what we do
is assume that T head or not assume but T head
0:27:25.779 --> 0:27:27.572
and S are sentence space right.
0:27:27.572 --> 0:27:29.925
That's how we can handle the translation.
0:27:30.530 --> 0:27:38.824
So we train a supervised system on this sentence
bed, so we do inference and then build a reverse
0:27:38.824 --> 0:27:39.924
translation.
0:27:42.442 --> 0:27:49.495
Are both more concrete, so we have a false
sentence right, then we chamber the translation,
0:27:49.495 --> 0:27:55.091
then we give the general translation as an
input and try to predict the.
0:27:58.378 --> 0:28:03.500
This is how we would do in practice right,
so not before the source encoder was connected
0:28:03.500 --> 0:28:08.907
to the source decoder, but now we interchanged
connections, so the source encoder is connected
0:28:08.907 --> 0:28:10.216
to the target decoder.
0:28:10.216 --> 0:28:13.290
The target encoder is turned into the source
decoder.
0:28:13.974 --> 0:28:20.747
And given s we get t-hat and given t we get
s-hat, so this is the first time.
0:28:21.661 --> 0:28:24.022
On the second time step, what you're going
to do is reverse.
0:28:24.664 --> 0:28:32.625
So as that is here, t hat is here, and given
s hat we are trying to predict t, and given
0:28:32.625 --> 0:28:34.503
t hat we are trying.
0:28:36.636 --> 0:28:39.386
Is this clear you have any questions on?
0:28:45.405 --> 0:28:50.823
Bit more mathematically, we try to play the
class, give and take and so it's always the
0:28:50.823 --> 0:28:53.963
supervised NMP technique that we are trying
to do.
0:28:53.963 --> 0:28:59.689
But you're trying to create this synthetic
pass that kind of helpers to build an unsurprised
0:28:59.689 --> 0:29:00.181
system.
0:29:02.362 --> 0:29:08.611
Now also with maybe you can see here is that
if the source encoded and targeted encoded
0:29:08.611 --> 0:29:14.718
the language independent, we can always shuffle
the connections and the translations.
0:29:14.718 --> 0:29:21.252
That's why it was important to find a way
to generate language independent representations.
0:29:21.441 --> 0:29:26.476
And the way we try to force this language
independence is the gan step.
0:29:27.627 --> 0:29:34.851
So the third step kind of combines all of
them is where we try to use gun to make the
0:29:34.851 --> 0:29:37.959
encoded output language independent.
0:29:37.959 --> 0:29:42.831
So here it's the same picture but from a different
paper.
0:29:42.831 --> 0:29:43.167
So.
0:29:43.343 --> 0:29:48.888
We have X-rays, X-ray objects which is monolingual
in data.
0:29:48.888 --> 0:29:50.182
We add noise.
0:29:50.690 --> 0:29:54.736
Then we encode it using the source and the
target encoders right.
0:29:54.736 --> 0:29:58.292
Then we get the latent space Z source and
Z target right.
0:29:58.292 --> 0:30:03.503
Then we decode and try to reconstruct the
original one and this is the auto encoding
0:30:03.503 --> 0:30:08.469
loss which takes the X source which is the
original one and then the translated.
0:30:08.468 --> 0:30:09.834
Predicted output.
0:30:09.834 --> 0:30:16.740
So hello, it always is the auto encoding step
where the gun concern is in the between gang
0:30:16.740 --> 0:30:24.102
cord outputs, and here we have an discriminator
which tries to predict which language the latent
0:30:24.102 --> 0:30:25.241
space is from.
0:30:26.466 --> 0:30:33.782
So given Z source it has to predict that the
representation is from a language source and
0:30:33.782 --> 0:30:39.961
given Z target it has to predict the representation
from a language target.
0:30:40.520 --> 0:30:45.135
And our headquarters are kind of teaching
data right now, and then we have a separate
0:30:45.135 --> 0:30:49.803
network discriminator which tries to predict
which language the Latin spaces are from.
0:30:53.393 --> 0:30:57.611
And then this one is when we combined guns
with the other ongoing step.
0:30:57.611 --> 0:31:02.767
Then we had an on the fly back translation
step right, and so here what we're trying to
0:31:02.767 --> 0:31:03.001
do.
0:31:03.863 --> 0:31:07.260
Is the same, basically just exactly the same.
0:31:07.260 --> 0:31:12.946
But when we are doing the training, we are
at the adversarial laws here, so.
0:31:13.893 --> 0:31:20.762
We take our X source, gender and intermediate
translation, so why target and why source right?
0:31:20.762 --> 0:31:27.342
This is the previous time step, and then we
have to encode the new sentences and basically
0:31:27.342 --> 0:31:32.764
make them language independent or train to
make them language independent.
0:31:33.974 --> 0:31:43.502
And then the hope is that now if we do this
using monolingual data alone we can just switch
0:31:43.502 --> 0:31:47.852
connections and then get our translation.
0:31:47.852 --> 0:31:49.613
So the scale of.
0:31:54.574 --> 0:32:03.749
And so as I said before, guns are quite good
for vision right, so this is kind of like the
0:32:03.749 --> 0:32:11.312
cycle gun approach that you might have seen
in any computer vision course.
0:32:11.911 --> 0:32:19.055
Somehow protect that place at least not as
promising as for merchants, and so people.
0:32:19.055 --> 0:32:23.706
What they did is to enforce this language
independence.
0:32:25.045 --> 0:32:31.226
They try to use a shared encoder instead of
having these different encoders right, and
0:32:31.226 --> 0:32:37.835
so this is basically the same painting objectives
as before, but what you're going to do now
0:32:37.835 --> 0:32:43.874
is learn cross language language and then use
the single encoder for both languages.
0:32:44.104 --> 0:32:49.795
And this kind also forces them to be in the
same space, and then you can choose whichever
0:32:49.795 --> 0:32:50.934
decoder you want.
0:32:52.552 --> 0:32:58.047
You can use guns or you can just use a shared
encoder and type to build your unsupervised
0:32:58.047 --> 0:32:58.779
MTT system.
0:33:08.488 --> 0:33:09.808
These are now the.
0:33:09.808 --> 0:33:15.991
The enhancements that you can do on top of
your unsavoizant system is one you can create
0:33:15.991 --> 0:33:16.686
a shared.
0:33:18.098 --> 0:33:22.358
On top of the shared encoder you can ask are
your guns lost or whatever so there's a lot
0:33:22.358 --> 0:33:22.550
of.
0:33:24.164 --> 0:33:29.726
The other thing that is more relevant right
now is that you can create parallel data by
0:33:29.726 --> 0:33:35.478
word to word translation right because you
know how to do all supervised word translation.
0:33:36.376 --> 0:33:40.548
First step is to create parallel data, assuming
that word translations are quite good.
0:33:41.361 --> 0:33:47.162
And then you claim a supervised and empty
model on these more likely wrong model data,
0:33:47.162 --> 0:33:50.163
but somehow gives you a good starting point.
0:33:50.163 --> 0:33:56.098
So you build your supervised and empty system
on the word translation data, and then you
0:33:56.098 --> 0:33:59.966
initialize it before you're doing unsupervised
and empty.
0:34:00.260 --> 0:34:05.810
And the hope is that when you're doing the
back pain installation, it's a good starting
0:34:05.810 --> 0:34:11.234
point, but it's one technique that you can
do to to improve your anthropoids and the.
0:34:17.097 --> 0:34:25.879
In the previous case we had: The way we know
when to stop was to see comedians on the gun
0:34:25.879 --> 0:34:26.485
training.
0:34:26.485 --> 0:34:28.849
Actually, all we want to do is when W.
0:34:28.849 --> 0:34:32.062
Comedians, which is quite easy to know when
to stop.
0:34:32.062 --> 0:34:37.517
But in a realistic case, we don't have any
parallel data right, so there's no validation.
0:34:37.517 --> 0:34:42.002
Or I mean, we might have test data in the
end, but there's no validation.
0:34:43.703 --> 0:34:48.826
How will we tune our hyper parameters in this
case because it's not really there's nothing
0:34:48.826 --> 0:34:49.445
for us to?
0:34:50.130 --> 0:34:53.326
Or the gold data in a sense like so.
0:34:53.326 --> 0:35:01.187
How do you think we can evaluate such systems
or how can we tune hyper parameters in this?
0:35:11.711 --> 0:35:17.089
So what you're going to do is use the back
translation technique.
0:35:17.089 --> 0:35:24.340
It's like a common technique where you have
nothing okay that is to use back translation
0:35:24.340 --> 0:35:26.947
somehow and what you can do is.
0:35:26.947 --> 0:35:31.673
The main idea is validate on how good the
reconstruction.
0:35:32.152 --> 0:35:37.534
So the idea is that if you have a good system
then the intermediate translation is quite
0:35:37.534 --> 0:35:39.287
good and going back is easy.
0:35:39.287 --> 0:35:44.669
But if it's just noise that you generate in
the forward step then it's really hard to go
0:35:44.669 --> 0:35:46.967
back, which is kind of the main idea.
0:35:48.148 --> 0:35:53.706
So the way it works is that we take a source
sentence, we generate a translation in target
0:35:53.706 --> 0:35:59.082
language right, and then again can state the
generated sentence and compare it with the
0:35:59.082 --> 0:36:01.342
original one, and if they're closer.
0:36:01.841 --> 0:36:09.745
It means that we have a good system, and if
they are far this is kind of like an unsupervised
0:36:09.745 --> 0:36:10.334
grade.
0:36:17.397 --> 0:36:21.863
As far as the amount of data that you need.
0:36:23.083 --> 0:36:27.995
This was like the first initial resistance
on on these systems is that you had.
0:36:27.995 --> 0:36:32.108
They wanted to do English and French and they
had fifteen million.
0:36:32.108 --> 0:36:38.003
There was fifteen million more linguist sentences
so it's quite a lot and they were able to get
0:36:38.003 --> 0:36:40.581
thirty two blue on these kinds of setups.
0:36:41.721 --> 0:36:47.580
But unsurprisingly if you have zero point
one million pilot sentences you get the same
0:36:47.580 --> 0:36:48.455
performance.
0:36:48.748 --> 0:36:50.357
So it's a lot of training.
0:36:50.357 --> 0:36:55.960
It's a lot of monolingual data, but monolingual
data is relatively easy to obtain is the fact
0:36:55.960 --> 0:37:01.264
that the training is also quite longer than
the supervised system, but it's unsupervised
0:37:01.264 --> 0:37:04.303
so it's kind of the trade off that you are
making.
0:37:07.367 --> 0:37:13.101
The other thing to note is that it's English
and French, which is very close to our exemptions.
0:37:13.101 --> 0:37:18.237
Also, the monolingual data that they took
are kind of from similar domains and so on.
0:37:18.638 --> 0:37:27.564
So that's why they're able to build such a
good system, but you'll see later that it fails.
0:37:36.256 --> 0:37:46.888
Voice, and so mean what people usually do
is first build a system right using whatever
0:37:46.888 --> 0:37:48.110
parallel.
0:37:48.608 --> 0:37:55.864
Then they use monolingual data and do back
translation, so this is always being the standard
0:37:55.864 --> 0:38:04.478
way to to improve, and what people have seen
is that: You don't even need zero point one
0:38:04.478 --> 0:38:05.360
million right.
0:38:05.360 --> 0:38:10.706
You just need like ten thousand or so on and
then you do the monolingual back time station
0:38:10.706 --> 0:38:12.175
and you're still better.
0:38:12.175 --> 0:38:13.291
The answer is why.
0:38:13.833 --> 0:38:19.534
The question is it's really worth trying to
to do this or maybe it's always better to find
0:38:19.534 --> 0:38:20.787
some parallel data.
0:38:20.787 --> 0:38:26.113
I'll expand a bit of money on getting few
parallel data and then use it to start and
0:38:26.113 --> 0:38:27.804
find to build your system.
0:38:27.804 --> 0:38:33.756
So it was kind of the understanding that billing
wool and spoiled systems are not that really.
0:38:50.710 --> 0:38:54.347
The thing is that with unlabeled data.
0:38:57.297 --> 0:39:05.488
Not in an obtaining signal, so when we are
starting basically what we want to do is first
0:39:05.488 --> 0:39:13.224
get a good translation system and then use
an unlabeled monolingual data to improve.
0:39:13.613 --> 0:39:15.015
But if you start from U.
0:39:15.015 --> 0:39:15.183
N.
0:39:15.183 --> 0:39:20.396
Empty our model might be really bad like it
would be somewhere translating completely wrong.
0:39:20.760 --> 0:39:26.721
And then when you find your unlabeled data,
it basically might be harming, or maybe the
0:39:26.721 --> 0:39:28.685
same as supervised applause.
0:39:28.685 --> 0:39:35.322
So the thing is, I hope, by fine tuning on
labeled data as first is to get a good initialization.
0:39:35.835 --> 0:39:38.404
And then use the unsupervised techniques to
get better.
0:39:38.818 --> 0:39:42.385
But if your starting point is really bad then
it's not.
0:39:45.185 --> 0:39:47.324
Year so as we said before.
0:39:47.324 --> 0:39:52.475
This is kind of like the self supervised training
usually works.
0:39:52.475 --> 0:39:54.773
First we have parallel data.
0:39:56.456 --> 0:39:58.062
Source language is X.
0:39:58.062 --> 0:39:59.668
Target language is Y.
0:39:59.668 --> 0:40:06.018
In the end we want a system that does X to
Y, not Y to X, but first we want to train a
0:40:06.018 --> 0:40:10.543
backward model as it is Y to X, so target language
to source.
0:40:11.691 --> 0:40:17.353
Then we take our moonlighting will target
sentences, use our backward model to generate
0:40:17.353 --> 0:40:21.471
synthetic source, and then we join them with
our original data.
0:40:21.471 --> 0:40:27.583
So now we have this noisy input, but always
the gold output, which is kind of really important
0:40:27.583 --> 0:40:29.513
when you're doing backpaints.
0:40:30.410 --> 0:40:36.992
And then you can coordinate these big data
and then you can train your X to Y cholesterol
0:40:36.992 --> 0:40:44.159
system and then you can always do this in multiple
steps and usually three, four steps which kind
0:40:44.159 --> 0:40:48.401
of improves always and then finally get your
best system.
0:40:49.029 --> 0:40:54.844
The point that I'm trying to make is that
although answers and MPs the scores that I've
0:40:54.844 --> 0:41:00.659
shown before were quite good, you probably
can get the same performance with with fifty
0:41:00.659 --> 0:41:06.474
thousand sentences, and also the languages
that they've shown are quite similar and the
0:41:06.474 --> 0:41:08.654
texts were from the same domain.
0:41:14.354 --> 0:41:21.494
So any questions on u n m t ok yeah.
0:41:22.322 --> 0:41:28.982
So after this fact that temperature was already
better than than empty, what people have tried
0:41:28.982 --> 0:41:34.660
is to use this idea of multilinguality as you
have seen in the previous lecture.
0:41:34.660 --> 0:41:41.040
The question is how can we do this knowledge
transfer from high resource language to lower
0:41:41.040 --> 0:41:42.232
source language?
0:41:44.484 --> 0:41:51.074
One way to promote this language independent
representations is to share the encoder and
0:41:51.074 --> 0:41:57.960
decoder for all languages, all their available
languages, and that kind of hopefully enables
0:41:57.960 --> 0:42:00.034
the the knowledge transfer.
0:42:03.323 --> 0:42:08.605
When we're doing multilinguality, the two
questions we need to to think of is how does
0:42:08.605 --> 0:42:09.698
the encoder know?
0:42:09.698 --> 0:42:14.495
How does the encoder encoder know which language
that we're dealing with that?
0:42:15.635 --> 0:42:20.715
You already might have known the answer also,
and the second question is how can we promote
0:42:20.715 --> 0:42:24.139
the encoder to generate language independent
representations?
0:42:25.045 --> 0:42:32.580
By solving these two problems we can take
help of high resource languages to do unsupervised
0:42:32.580 --> 0:42:33.714
translations.
0:42:34.134 --> 0:42:40.997
Typical example would be you want to do unsurpressed
between English and Dutch right, but you are
0:42:40.997 --> 0:42:47.369
parallel data between English and German, so
the question is can we use this parallel data
0:42:47.369 --> 0:42:51.501
to help building an unsurpressed betweenEnglish
and Dutch?
0:42:56.296 --> 0:43:01.240
For the first one we try to take help of language
embeddings for tokens, and this kind of is
0:43:01.240 --> 0:43:05.758
a straightforward way to know to tell them
well which language they're dealing with.
0:43:06.466 --> 0:43:11.993
And for the second one we're going to look
at some pre training objectives which are also
0:43:11.993 --> 0:43:17.703
kind of unsupervised so we need monolingual
data mostly and this kind of helps us to promote
0:43:17.703 --> 0:43:20.221
the language independent representation.
0:43:23.463 --> 0:43:29.954
So the first three things more that we'll
look at is excel, which is quite famous if
0:43:29.954 --> 0:43:32.168
you haven't heard of it yet.
0:43:32.552 --> 0:43:40.577
And: The way it works is that it's basically
a transformer encoder right, so it's like the
0:43:40.577 --> 0:43:42.391
just the encoder module.
0:43:42.391 --> 0:43:44.496
No, there's no decoder here.
0:43:44.884 --> 0:43:51.481
And what we're trying to do is mask two tokens
in a sequence and try to predict these mask
0:43:51.481 --> 0:43:52.061
tokens.
0:43:52.061 --> 0:43:55.467
So I quickly called us mask language modeling.
0:43:55.996 --> 0:44:05.419
Typical language modeling that you see is
the Danish language modeling where you predict
0:44:05.419 --> 0:44:08.278
the next token in English.
0:44:08.278 --> 0:44:11.136
Then we have the position.
0:44:11.871 --> 0:44:18.774
Then we have the token embellings, and then
here we have the mass token, and then we have
0:44:18.774 --> 0:44:22.378
the transformer encoder blocks to predict the.
0:44:24.344 --> 0:44:30.552
To do this for all languages using the same
tang somewhere encoded and this kind of helps
0:44:30.552 --> 0:44:36.760
us to push the the sentence and bearings or
the output of the encoded into a common space
0:44:36.760 --> 0:44:37.726
per multiple.
0:44:42.782 --> 0:44:49.294
So first we train an MLM on both source, both
source and target language sites, and then
0:44:49.294 --> 0:44:54.928
we use it as a starting point for the encoded
and decoded for a UNMP system.
0:44:55.475 --> 0:45:03.175
So we take a monolingual data, build a mass
language model on both source and target languages,
0:45:03.175 --> 0:45:07.346
and then read it to be or initialize that in
the U.
0:45:07.346 --> 0:45:07.586
N.
0:45:07.586 --> 0:45:07.827
P.
0:45:07.827 --> 0:45:08.068
C.
0:45:09.009 --> 0:45:14.629
Here we look at two languages, but you can
also do it with one hundred languages once.
0:45:14.629 --> 0:45:20.185
So they're retain checkpoints that you can
use, which are quite which have seen quite
0:45:20.185 --> 0:45:21.671
a lot of data and use.
0:45:21.671 --> 0:45:24.449
It always has a starting point for your U.
0:45:24.449 --> 0:45:24.643
N.
0:45:24.643 --> 0:45:27.291
MP system, which in practice works well.
0:45:31.491 --> 0:45:36.759
This detail is that since this is an encoder
block only, and your U.
0:45:36.759 --> 0:45:36.988
N.
0:45:36.988 --> 0:45:37.217
M.
0:45:37.217 --> 0:45:37.446
T.
0:45:37.446 --> 0:45:40.347
System is encodered, decodered right.
0:45:40.347 --> 0:45:47.524
So there's this cross attention that's missing,
but you can always branch like that randomly.
0:45:47.524 --> 0:45:48.364
It's fine.
0:45:48.508 --> 0:45:53.077
Not everything is initialized, but it's still
decent.
0:45:56.056 --> 0:46:02.141
Then we have the other one is M by plane,
and here you see that this kind of builds on
0:46:02.141 --> 0:46:07.597
the the unsupervised training objector, which
is the realizing auto encoding.
0:46:08.128 --> 0:46:14.337
So what they do is they say that we don't
even need to do the gun outback translation,
0:46:14.337 --> 0:46:17.406
but you can do it later, but pre training.
0:46:17.406 --> 0:46:24.258
We just do do doing doing doing water inputting
on all different languages, and that also gives
0:46:24.258 --> 0:46:32.660
you: Out of the box good performance, so what
we basically have here is the transformer encoded.
0:46:34.334 --> 0:46:37.726
You are trying to generate a reconstructed
sequence.
0:46:37.726 --> 0:46:38.942
You need a tickle.
0:46:39.899 --> 0:46:42.022
So we gave an input sentence.
0:46:42.022 --> 0:46:48.180
We tried to predict the masked tokens from
the or we tried to reconstruct the original
0:46:48.180 --> 0:46:52.496
sentence from the input segments, which was
corrupted right.
0:46:52.496 --> 0:46:57.167
So this is the same denoting objective that
you have seen before.
0:46:58.418 --> 0:46:59.737
This is for English.
0:46:59.737 --> 0:47:04.195
I think this is for Japanese and then once
we do it for all languages.
0:47:04.195 --> 0:47:09.596
I mean they have this difference on twenty
five, fifty or so on and then you can find
0:47:09.596 --> 0:47:11.794
you on your sentence and document.
0:47:13.073 --> 0:47:20.454
And so what they is this for the supervised
techniques, but you can also use this as initializations
0:47:20.454 --> 0:47:25.058
for unsupervised buildup on that which also
in practice works.
0:47:30.790 --> 0:47:36.136
Then we have these, so still now we kind of
didn't see the the states benefit from the
0:47:36.136 --> 0:47:38.840
high resource language right, so as I said.
0:47:38.878 --> 0:47:44.994
Why you can use English as something for English
to Dutch, and if you want a new Catalan, you
0:47:44.994 --> 0:47:46.751
can use English to French.
0:47:48.408 --> 0:47:55.866
One typical way to do this is to use favorite
translation lights or you take the.
0:47:55.795 --> 0:48:01.114
So here it's finished two weeks so you take
your time say from finish to English English
0:48:01.114 --> 0:48:03.743
two weeks and then you get the translation.
0:48:04.344 --> 0:48:10.094
What's important is that you have these different
techniques and you can always think of which
0:48:10.094 --> 0:48:12.333
one to use given the data situation.
0:48:12.333 --> 0:48:18.023
So if it was like finish to Greek maybe it's
pivotal better because you might get good finish
0:48:18.023 --> 0:48:20.020
to English and English to Greek.
0:48:20.860 --> 0:48:23.255
Sometimes it also depends on the language
pair.
0:48:23.255 --> 0:48:27.595
There might be some information loss and so
on, so there are quite a few variables you
0:48:27.595 --> 0:48:30.039
need to think of and decide which system to
use.
0:48:32.752 --> 0:48:39.654
Then there's a zero shot, which probably also
I've seen in the multilingual course, and how
0:48:39.654 --> 0:48:45.505
if you can improve the language independence
then your zero shot gets better.
0:48:45.505 --> 0:48:52.107
So maybe if you use the multilingual models
and do zero shot directly, it's quite good.
0:48:53.093 --> 0:48:58.524
Thought we have zero shots per word, and then
we have the answer to voice translation where
0:48:58.524 --> 0:49:00.059
we can calculate between.
0:49:00.600 --> 0:49:02.762
Just when there is no battle today.
0:49:06.686 --> 0:49:07.565
Is to solve.
0:49:07.565 --> 0:49:11.959
So sometimes what we have seen so far is that
we basically have.
0:49:15.255 --> 0:49:16.754
To do from looking at it.
0:49:16.836 --> 0:49:19.307
These two files alone you can create a dictionary.
0:49:19.699 --> 0:49:26.773
Can build an unsupervised entry system, not
always, but if the domains are similar in the
0:49:26.773 --> 0:49:28.895
languages, that's similar.
0:49:28.895 --> 0:49:36.283
But if there are distant languages, then the
unsupervised texting doesn't usually work really
0:49:36.283 --> 0:49:36.755
well.
0:49:37.617 --> 0:49:40.297
What um.
0:49:40.720 --> 0:49:46.338
Would be is that if you can get some paddle
data from somewhere or do bitex mining that
0:49:46.338 --> 0:49:51.892
we have seen in the in the laser practicum
then you can use that as to initialize your
0:49:51.892 --> 0:49:57.829
system and then try and accept a semi supervised
energy system and that would be better than
0:49:57.829 --> 0:50:00.063
just building an unsupervised and.
0:50:00.820 --> 0:50:06.546
With that as the end.
0:50:07.207 --> 0:50:08.797
Quickly could be.
0:50:16.236 --> 0:50:25.070
In common, they can catch the worst because
the thing about finding a language is: And
0:50:25.070 --> 0:50:34.874
there's another joy in playing these games,
almost in the middle of a game, and she's a
0:50:34.874 --> 0:50:40.111
characteristic too, and she is a global waver.
0:50:56.916 --> 0:51:03.798
Next talk inside and this somehow gives them
many abilities, not only translation but other
0:51:03.798 --> 0:51:08.062
than that there are quite a few things that
they can do.
0:51:10.590 --> 0:51:17.706
But the translation in itself usually doesn't
really work really well if you build a system
0:51:17.706 --> 0:51:20.878
from your specific system for your case.
0:51:22.162 --> 0:51:27.924
I would guess that it's usually better than
the LLM, but you can always adapt the LLM to
0:51:27.924 --> 0:51:31.355
the task that you want, and then it could be
better.
0:51:32.152 --> 0:51:37.849
A little amount of the box might not be the
best choice for your task force.
0:51:37.849 --> 0:51:44.138
For me, I'm working on new air translation,
so it's more about translating software.
0:51:45.065 --> 0:51:50.451
And it's quite often each domain as well,
and if use the LLM out of the box, they're
0:51:50.451 --> 0:51:53.937
actually quite bad compared to the systems
that built.
0:51:54.414 --> 0:51:56.736
But you can do these different techniques
like prompting.
0:51:57.437 --> 0:52:03.442
This is what people usually do is heart prompting
where they give similar translation pairs in
0:52:03.442 --> 0:52:08.941
the prompt and then ask it to translate and
then that kind of improves the performance
0:52:08.941 --> 0:52:09.383
a lot.
0:52:09.383 --> 0:52:15.135
So there are different techniques that you
can do to adapt your eye lens and then it might
0:52:15.135 --> 0:52:16.399
be better than the.
0:52:16.376 --> 0:52:17.742
Task a fixed system.
0:52:18.418 --> 0:52:22.857
But if you're looking for niche things, I
don't think error limbs are that good.
0:52:22.857 --> 0:52:26.309
But if you want to do to do, let's say, unplugged
translation.
0:52:26.309 --> 0:52:30.036
In this case you can never be sure that they
haven't seen the data.
0:52:30.036 --> 0:52:35.077
First of all is that if you see the data in
that language or not, and if they're panthetic,
0:52:35.077 --> 0:52:36.831
they probably did see the data.
0:52:40.360 --> 0:53:00.276
I feel like they have pretty good understanding
of each million people.
0:53:04.784 --> 0:53:09.059
Depends on the language, but I'm pretty surprised
that it works on a lotus language.
0:53:09.059 --> 0:53:11.121
I would expect it to work on German and.
0:53:11.972 --> 0:53:13.633
But if you take a lot of first language,.
0:53:14.474 --> 0:53:20.973
Don't think it works, and also there are quite
a few papers where they've already showed that
0:53:20.973 --> 0:53:27.610
if you build a system yourself or build a typical
way to build a system, it's quite better than
0:53:27.610 --> 0:53:29.338
the bit better than the.
0:53:29.549 --> 0:53:34.883
But you can always do things with limbs to
get better, but then I'm probably.
0:53:37.557 --> 0:53:39.539
Anymore.
0:53:41.421 --> 0:53:47.461
So if not then we're going to end the lecture
here and then on Thursday we're going to have
0:53:47.461 --> 0:53:51.597
documented empty which is also run by me so
thanks for coming.