Spaces:
Running
Running
WEBVTT | |
0:00:00.060 --> 0:00:07.762 | |
OK good so today's lecture is on on supervised | |
machines and stations so what you have seen | |
0:00:07.762 --> 0:00:13.518 | |
so far is different techniques are on supervised | |
and MP so you are. | |
0:00:13.593 --> 0:00:18.552 | |
Data right so let's say in English coppers | |
you are one file and then in German you have | |
0:00:18.552 --> 0:00:23.454 | |
another file which is sentence to sentence | |
la and then you try to build systems around | |
0:00:23.454 --> 0:00:23.679 | |
it. | |
0:00:24.324 --> 0:00:30.130 | |
But what's different about this lecture is | |
that you assume that you have no final data | |
0:00:30.130 --> 0:00:30.663 | |
at all. | |
0:00:30.663 --> 0:00:37.137 | |
You only have monolingual data and the question | |
is how can we build systems to translate between | |
0:00:37.137 --> 0:00:39.405 | |
these two languages right and so. | |
0:00:39.359 --> 0:00:44.658 | |
This is a bit more realistic scenario because | |
you have so many languages in the world. | |
0:00:44.658 --> 0:00:50.323 | |
You cannot expect to have parallel data between | |
all the two languages and so, but in typical | |
0:00:50.323 --> 0:00:55.623 | |
cases you have newspapers and so on, which | |
is like monolingual files, and the question | |
0:00:55.623 --> 0:00:57.998 | |
is can we build something around them? | |
0:00:59.980 --> 0:01:01.651 | |
They like said for today. | |
0:01:01.651 --> 0:01:05.893 | |
First we'll start up with the interactions, | |
so why do we need it? | |
0:01:05.893 --> 0:01:11.614 | |
and also some infusion on how these models | |
work before going into the technical details. | |
0:01:11.614 --> 0:01:17.335 | |
I want to also go through an example,, which | |
kind of gives you more understanding on how | |
0:01:17.335 --> 0:01:19.263 | |
people came into more elders. | |
0:01:20.820 --> 0:01:23.905 | |
Then the rest of the lecture is going to be | |
two parts. | |
0:01:23.905 --> 0:01:26.092 | |
One is we're going to translate words. | |
0:01:26.092 --> 0:01:30.018 | |
We're not going to care about how can we translate | |
the full sentence. | |
0:01:30.018 --> 0:01:35.177 | |
But given to monolingual files, how can we | |
get a dictionary basically, which is much easier | |
0:01:35.177 --> 0:01:37.813 | |
than generating something in a sentence level? | |
0:01:38.698 --> 0:01:43.533 | |
Then we're going to go into the Edwards case, | |
which is the unsupervised sentence type solution. | |
0:01:44.204 --> 0:01:50.201 | |
And here what you'll see is what are the training | |
objectives which are quite different than the | |
0:01:50.201 --> 0:01:55.699 | |
word translation and also where it doesn't | |
but because this is also quite important and | |
0:01:55.699 --> 0:02:01.384 | |
it's one of the reasons why unsupervised does | |
not use anymore because the limitations kind | |
0:02:01.384 --> 0:02:03.946 | |
of go away from the realistic use cases. | |
0:02:04.504 --> 0:02:06.922 | |
And then that leads to the marketing world | |
model. | |
0:02:06.922 --> 0:02:07.115 | |
So. | |
0:02:07.807 --> 0:02:12.915 | |
People are trying to do to build systems for | |
languages that will not have any parallel data. | |
0:02:12.915 --> 0:02:17.693 | |
Is use multilingual models and combine with | |
these training objectives to get better at | |
0:02:17.693 --> 0:02:17.913 | |
it. | |
0:02:17.913 --> 0:02:18.132 | |
So. | |
0:02:18.658 --> 0:02:24.396 | |
People are not trying to build bilingual systems | |
currently for unsupervised arm translation, | |
0:02:24.396 --> 0:02:30.011 | |
but I think it's good to know how they came | |
to hear this point and what they're doing now. | |
0:02:30.090 --> 0:02:34.687 | |
You also see some patterns overlapping which | |
people are using. | |
0:02:36.916 --> 0:02:41.642 | |
So as you said before, and you probably hear | |
it multiple times now is that we have seven | |
0:02:41.642 --> 0:02:43.076 | |
thousand languages around. | |
0:02:43.903 --> 0:02:49.460 | |
Can be different dialects in someone, so it's | |
quite hard to distinguish what's the language, | |
0:02:49.460 --> 0:02:54.957 | |
but you can typically approximate that seven | |
thousand and that leads to twenty five million | |
0:02:54.957 --> 0:02:59.318 | |
pairs, which is the obvious reason why we do | |
not have any parallel data. | |
0:03:00.560 --> 0:03:06.386 | |
So you want to build an empty system for all | |
possible language pests and the question is | |
0:03:06.386 --> 0:03:07.172 | |
how can we? | |
0:03:08.648 --> 0:03:13.325 | |
The typical use case, but there are actually | |
quite few interesting use cases than what you | |
0:03:13.325 --> 0:03:14.045 | |
would expect. | |
0:03:14.614 --> 0:03:20.508 | |
One is the animal languages, which is the | |
real thing that's happening right now with. | |
0:03:20.780 --> 0:03:26.250 | |
The dog but with dolphins and so on, but I | |
couldn't find a picture that could show this, | |
0:03:26.250 --> 0:03:31.659 | |
but if you are interested in stuff like this | |
you can check out the website where people | |
0:03:31.659 --> 0:03:34.916 | |
are actually trying to understand how animals | |
speak. | |
0:03:35.135 --> 0:03:37.356 | |
It's Also a Bit More About. | |
0:03:37.297 --> 0:03:44.124 | |
Knowing what the animals want to say but may | |
not die dead but still people are trying to | |
0:03:44.124 --> 0:03:44.661 | |
do it. | |
0:03:45.825 --> 0:03:50.689 | |
More realistic thing that's happening is the | |
translation of programming languages. | |
0:03:51.371 --> 0:03:56.963 | |
And so this is quite a quite good scenario | |
for entrepreneurs and empty is that you have | |
0:03:56.963 --> 0:04:02.556 | |
a lot of code available online right in C + | |
+ and in Python and the question is how can | |
0:04:02.556 --> 0:04:08.402 | |
we translate by just looking at the code alone | |
and no parallel functions and so on and this | |
0:04:08.402 --> 0:04:10.754 | |
is actually quite good right now so. | |
0:04:12.032 --> 0:04:16.111 | |
See how these techniques were applied to do | |
the programming translation. | |
0:04:18.258 --> 0:04:23.882 | |
And then you can also think of language as | |
something that is quite common so you can take | |
0:04:23.882 --> 0:04:24.194 | |
off. | |
0:04:24.194 --> 0:04:29.631 | |
Think of formal sentences in English as one | |
language and informal sentences in English | |
0:04:29.631 --> 0:04:35.442 | |
as another language and then learn the kind | |
to stay between them and then it kind of becomes | |
0:04:35.442 --> 0:04:37.379 | |
a style plan for a problem so. | |
0:04:38.358 --> 0:04:43.042 | |
Although it's translation, you can consider | |
different characteristics of a language and | |
0:04:43.042 --> 0:04:46.875 | |
then separate them as two different languages | |
and then try to map them. | |
0:04:46.875 --> 0:04:52.038 | |
So it's not only about languages, but you | |
can also do quite cool things by using unsophisticated | |
0:04:52.038 --> 0:04:54.327 | |
techniques, which are quite possible also. | |
0:04:56.256 --> 0:04:56.990 | |
I am so. | |
0:04:56.990 --> 0:05:04.335 | |
This is kind of TV modeling for many of the | |
use cases that we have for ours, ours and MD. | |
0:05:04.335 --> 0:05:11.842 | |
But before we go into the modeling of these | |
systems, what I want you to do is look at these | |
0:05:11.842 --> 0:05:12.413 | |
dummy. | |
0:05:13.813 --> 0:05:19.720 | |
We have text and language one, text and language | |
two right, and nobody knows what these languages | |
0:05:19.720 --> 0:05:20.082 | |
mean. | |
0:05:20.082 --> 0:05:23.758 | |
They completely are made up right, and the | |
question is also. | |
0:05:23.758 --> 0:05:29.364 | |
They're not parallel lines, so the first line | |
here and the first line is not a line, they're | |
0:05:29.364 --> 0:05:30.810 | |
just monolingual files. | |
0:05:32.052 --> 0:05:38.281 | |
And now think about how can you translate | |
the word M1 from language one to language two, | |
0:05:38.281 --> 0:05:41.851 | |
and this kind of you see how we try to model | |
this. | |
0:05:42.983 --> 0:05:47.966 | |
Would take your time and then think of how | |
can you translate more into language two? | |
0:06:41.321 --> 0:06:45.589 | |
About the model, if you ask somebody who doesn't | |
know anything about machine translation right, | |
0:06:45.589 --> 0:06:47.411 | |
and then you ask them to translate more. | |
0:07:01.201 --> 0:07:10.027 | |
But it's also not quite easy if you think | |
of the way that I made this example is relatively | |
0:07:10.027 --> 0:07:10.986 | |
easy, so. | |
0:07:11.431 --> 0:07:17.963 | |
Basically, the first two sentences are these | |
two: A, B, C is E, and G cured up the U, V | |
0:07:17.963 --> 0:07:21.841 | |
is L, A, A, C, S, and S, on and this is used | |
towards the German. | |
0:07:22.662 --> 0:07:25.241 | |
And then when you join these two words, it's. | |
0:07:25.205 --> 0:07:32.445 | |
English German the third line and the last | |
line, and then the fourth line is the first | |
0:07:32.445 --> 0:07:38.521 | |
line, so German language, English, and then | |
speak English, speak German. | |
0:07:38.578 --> 0:07:44.393 | |
So this is how I made made up the example | |
and what the intuition here is that you assume | |
0:07:44.393 --> 0:07:50.535 | |
that the languages have a fundamental structure | |
right and it's the same across all languages. | |
0:07:51.211 --> 0:07:57.727 | |
Doesn't matter what language you are thinking | |
of words kind of you have in the same way join | |
0:07:57.727 --> 0:07:59.829 | |
together is the same way and. | |
0:07:59.779 --> 0:08:06.065 | |
And plasma sign thinks the same way but this | |
is not a realistic assumption for sure but | |
0:08:06.065 --> 0:08:12.636 | |
it's actually a decent one to make and if you | |
can think of this like if you can assume this | |
0:08:12.636 --> 0:08:16.207 | |
then we can model systems in an unsupervised | |
way. | |
0:08:16.396 --> 0:08:22.743 | |
So this is the intuition that I want to give, | |
and you can see that whenever assumptions fail, | |
0:08:22.743 --> 0:08:23.958 | |
the systems fail. | |
0:08:23.958 --> 0:08:29.832 | |
So in practice whenever we go far away from | |
these assumptions, the systems try to more | |
0:08:29.832 --> 0:08:30.778 | |
time to fail. | |
0:08:33.753 --> 0:08:39.711 | |
So the example that I gave was actually perfect | |
mapping right, so it never really sticks bad. | |
0:08:39.711 --> 0:08:45.353 | |
They have the same number of words, same sentence | |
structure, perfect mapping, and so on. | |
0:08:45.353 --> 0:08:50.994 | |
This doesn't happen, but let's assume that | |
this happens and try to see how we can moral. | |
0:08:53.493 --> 0:09:01.061 | |
Okay, now let's go a bit more formal, so what | |
you want to do is unsupervise word translation. | |
0:09:01.901 --> 0:09:08.773 | |
Here the task is that we have input data as | |
monolingual data, so a bunch of sentences in | |
0:09:08.773 --> 0:09:15.876 | |
one file and a bunch of sentences another file | |
in two different languages, and the question | |
0:09:15.876 --> 0:09:18.655 | |
is how can we get a bilingual word? | |
0:09:19.559 --> 0:09:25.134 | |
So if you look at the picture you see that | |
it's just kind of projected down into two dimension | |
0:09:25.134 --> 0:09:30.358 | |
planes, but it's basically when you map them | |
into a plot you see that the words that are | |
0:09:30.358 --> 0:09:35.874 | |
parallel are closer together, and the question | |
is how can we do it just looking at two files? | |
0:09:36.816 --> 0:09:42.502 | |
And you can say that what we want to basically | |
do is create a dictionary in the end given | |
0:09:42.502 --> 0:09:43.260 | |
two fights. | |
0:09:43.260 --> 0:09:45.408 | |
So this is the task that we want. | |
0:09:46.606 --> 0:09:52.262 | |
And the first step on how we do this is to | |
learn word vectors, and this chicken is whatever | |
0:09:52.262 --> 0:09:56.257 | |
techniques that you have seen before, but to | |
work glow or so on. | |
0:09:56.856 --> 0:10:00.699 | |
So you take a monolingual data and try to | |
learn word embeddings. | |
0:10:02.002 --> 0:10:07.675 | |
Then you plot them into a graph, and then | |
typically what you would see is that they're | |
0:10:07.675 --> 0:10:08.979 | |
not aligned at all. | |
0:10:08.979 --> 0:10:14.717 | |
One word space is somewhere, and one word | |
space is somewhere else, and this is what you | |
0:10:14.717 --> 0:10:18.043 | |
would typically expect to see in the in the | |
image. | |
0:10:19.659 --> 0:10:23.525 | |
Now our assumption was that both lines we | |
just have the same. | |
0:10:23.563 --> 0:10:28.520 | |
Culture and so that we can use this information | |
to learn the mapping between these two spaces. | |
0:10:30.130 --> 0:10:37.085 | |
So before how we do it, I think this is quite | |
famous already, and everybody knows it a bit | |
0:10:37.085 --> 0:10:41.824 | |
more is that we're emitting capture semantic | |
relations right. | |
0:10:41.824 --> 0:10:48.244 | |
So the distance between man and woman is approximately | |
the same as king and prince. | |
0:10:48.888 --> 0:10:54.620 | |
It's also for world dances, country capital | |
and so on, so there are some relationships | |
0:10:54.620 --> 0:11:00.286 | |
happening in the word emmering space, which | |
is quite clear for at least one language. | |
0:11:03.143 --> 0:11:08.082 | |
Now if you think of this, let's say of the | |
English word embryng. | |
0:11:08.082 --> 0:11:14.769 | |
Let's say of German word embryng and the way | |
the King Keene Man woman organized is same | |
0:11:14.769 --> 0:11:17.733 | |
as the German translation of his word. | |
0:11:17.998 --> 0:11:23.336 | |
This is the main idea is that although they | |
are somewhere else, the relationship is the | |
0:11:23.336 --> 0:11:28.008 | |
same between the both languages and we can | |
use this to to learn the mapping. | |
0:11:31.811 --> 0:11:35.716 | |
'S not only for these poor words where it | |
happens for all the words in the language, | |
0:11:35.716 --> 0:11:37.783 | |
and so we can use this to to learn the math. | |
0:11:39.179 --> 0:11:43.828 | |
This is the main idea is that both emittings | |
have a similar shape. | |
0:11:43.828 --> 0:11:48.477 | |
It's only that they're just not aligned and | |
so you go to the here. | |
0:11:48.477 --> 0:11:50.906 | |
They kind of have a similar shape. | |
0:11:50.906 --> 0:11:57.221 | |
They're just in some different spaces and | |
what you need to do is to map them into a common | |
0:11:57.221 --> 0:11:57.707 | |
space. | |
0:12:06.086 --> 0:12:12.393 | |
The w, such that if it multiplied w with x, | |
they both become. | |
0:12:35.335 --> 0:12:41.097 | |
That's true, but there are also many works | |
that have the relationship right, and we hope | |
0:12:41.097 --> 0:12:43.817 | |
that this is enough to learn the mapping. | |
0:12:43.817 --> 0:12:49.838 | |
So there's always going to be a bit of noise, | |
as in how when we align them they're not going | |
0:12:49.838 --> 0:12:51.716 | |
to be exactly the same, but. | |
0:12:51.671 --> 0:12:57.293 | |
What you can expect is that there are these | |
main works that allow us to learn the mapping, | |
0:12:57.293 --> 0:13:02.791 | |
so it's not going to be perfect, but it's an | |
approximation that we make to to see how it | |
0:13:02.791 --> 0:13:04.521 | |
works and then practice it. | |
0:13:04.521 --> 0:13:10.081 | |
Also, it's not that the fact that women do | |
not have any relationship does not affect that | |
0:13:10.081 --> 0:13:10.452 | |
much. | |
0:13:10.550 --> 0:13:15.429 | |
A lot of words usually have, so it kind of | |
works out in practice. | |
0:13:22.242 --> 0:13:34.248 | |
I have not heard about it, but if you want | |
to say something about it, I would be interested, | |
0:13:34.248 --> 0:13:37.346 | |
but we can do it later. | |
0:13:41.281 --> 0:13:44.133 | |
Usual case: This is supervised. | |
0:13:45.205 --> 0:13:49.484 | |
First way to do a supervised work translation | |
where we have a dictionary right and that we | |
0:13:49.484 --> 0:13:53.764 | |
can use that to learn the mapping, but in our | |
case we assume that we have nothing right so | |
0:13:53.764 --> 0:13:55.222 | |
we only have monolingual data. | |
0:13:56.136 --> 0:14:03.126 | |
Then we need unsupervised planning to figure | |
out W, and we're going to use guns to to find | |
0:14:03.126 --> 0:14:06.122 | |
W, and it's quite a nice way to do it. | |
0:14:08.248 --> 0:14:15.393 | |
So just before I go on how we use it to use | |
case, I'm going to go briefly on gas right, | |
0:14:15.393 --> 0:14:19.940 | |
so we have two components: generator and discriminator. | |
0:14:21.441 --> 0:14:27.052 | |
Gen data tries to generate something obviously, | |
and the discriminator tries to see if it's | |
0:14:27.052 --> 0:14:30.752 | |
real data or something that is generated by | |
the generation. | |
0:14:31.371 --> 0:14:37.038 | |
And there's like this two player game where | |
the winner decides to fool and the winner decides | |
0:14:37.038 --> 0:14:41.862 | |
to market food and they try to build these | |
two components and try to learn WWE. | |
0:14:43.483 --> 0:14:53.163 | |
Okay, so let's say we have two languages, | |
X and Y right, so the X language has N words | |
0:14:53.163 --> 0:14:56.167 | |
with numbering dimensions. | |
0:14:56.496 --> 0:14:59.498 | |
So what I'm reading is matrix is peak or something. | |
0:14:59.498 --> 0:15:02.211 | |
Then we have target language why with m words. | |
0:15:02.211 --> 0:15:06.944 | |
I'm also the same amount of things I mentioned | |
and then we have a matrix peak or. | |
0:15:07.927 --> 0:15:13.784 | |
Basically what you're going to do is use word | |
to work and learn our word embedded. | |
0:15:14.995 --> 0:15:23.134 | |
Now we have these X Mrings, Y Mrings, and | |
what you want to know is W, such that W X and | |
0:15:23.134 --> 0:15:24.336 | |
Y are align. | |
0:15:29.209 --> 0:15:35.489 | |
With guns you have two steps, one is a discriminative | |
step and one is the the mapping step and the | |
0:15:35.489 --> 0:15:41.135 | |
discriminative step is to see if the embeddings | |
are from the source or mapped embedding. | |
0:15:41.135 --> 0:15:44.688 | |
So it's going to be much scary when I go to | |
the figure. | |
0:15:46.306 --> 0:15:50.041 | |
So we have a monolingual documents with two | |
different languages. | |
0:15:50.041 --> 0:15:54.522 | |
From here we get our source language ambients | |
target language ambients right. | |
0:15:54.522 --> 0:15:57.855 | |
Then we randomly initialize the transformation | |
metrics W. | |
0:16:00.040 --> 0:16:06.377 | |
Then we have the discriminator which tries | |
to see if it's WX or Y, so it needs to know | |
0:16:06.377 --> 0:16:13.735 | |
that this is a mapped one and this is the original | |
language, and so if you look at the lost function | |
0:16:13.735 --> 0:16:20.072 | |
here, it's basically that source is one given | |
WX, so this is from the source language. | |
0:16:23.543 --> 0:16:27.339 | |
Which means it's the target language em yeah. | |
0:16:27.339 --> 0:16:34.436 | |
It's just like my figure is not that great, | |
but you can assume that they are totally. | |
0:16:40.260 --> 0:16:43.027 | |
So this is the kind of the lost function. | |
0:16:43.027 --> 0:16:46.386 | |
We have N source words, M target words, and | |
so on. | |
0:16:46.386 --> 0:16:52.381 | |
So that's why you have one by M, one by M, | |
and the discriminator is to just see if they're | |
0:16:52.381 --> 0:16:55.741 | |
mapped or they're from the original target | |
number. | |
0:16:57.317 --> 0:17:04.024 | |
And then we have the mapping step where we | |
train W to fool the the discriminators. | |
0:17:04.564 --> 0:17:10.243 | |
So here it's the same way, but what you're | |
going to just do is inverse the loss function. | |
0:17:10.243 --> 0:17:15.859 | |
So now we freeze the discriminators, so it's | |
important to note that in the previous sect | |
0:17:15.859 --> 0:17:20.843 | |
we freezed the transformation matrix, and here | |
we freezed your discriminators. | |
0:17:22.482 --> 0:17:28.912 | |
And now it's to fool the discriminated rights, | |
so it should predict that the source is zero | |
0:17:28.912 --> 0:17:35.271 | |
given the map numbering, and the source is | |
one given the target numbering, which is wrong, | |
0:17:35.271 --> 0:17:37.787 | |
which is why we're attaining the W. | |
0:17:39.439 --> 0:17:46.261 | |
Any questions on this okay so then how do | |
we know when to stop? | |
0:17:46.261 --> 0:17:55.854 | |
We just train until we reach convergence right | |
and then we have our W hopefully train and | |
0:17:55.854 --> 0:17:59.265 | |
map them into an airline space. | |
0:18:02.222 --> 0:18:07.097 | |
The question is how can we evaluate this mapping? | |
0:18:07.097 --> 0:18:13.923 | |
Does anybody know what we can use to mapping | |
or evaluate the mapping? | |
0:18:13.923 --> 0:18:15.873 | |
How good is a word? | |
0:18:28.969 --> 0:18:33.538 | |
We use as I said we use a dictionary, at least | |
in the end. | |
0:18:33.538 --> 0:18:40.199 | |
We need a dictionary to evaluate, so this | |
is our only final, so we aren't using it at | |
0:18:40.199 --> 0:18:42.600 | |
all in attaining data and the. | |
0:18:43.223 --> 0:18:49.681 | |
Is one is to check what's the position for | |
our dictionary, just that. | |
0:18:50.650 --> 0:18:52.813 | |
The first nearest neighbor and see if it's | |
there on. | |
0:18:53.573 --> 0:18:56.855 | |
But this is quite strict because there's a | |
lot of noise in the emitting space right. | |
0:18:57.657 --> 0:19:03.114 | |
Not always your first neighbor is going to | |
be the translation, so what people also report | |
0:19:03.114 --> 0:19:05.055 | |
is precision at file and so on. | |
0:19:05.055 --> 0:19:10.209 | |
So you take the finerest neighbors and see | |
if the translation is in there and so on. | |
0:19:10.209 --> 0:19:15.545 | |
So the more you increase, the more likely | |
that there is a translation because where I'm | |
0:19:15.545 --> 0:19:16.697 | |
being quite noisy. | |
0:19:19.239 --> 0:19:25.924 | |
What's interesting is that people have used | |
dictionary to to learn word translation, but | |
0:19:25.924 --> 0:19:32.985 | |
the way of doing this is much better than using | |
a dictionary, so somehow our assumption helps | |
0:19:32.985 --> 0:19:36.591 | |
us to to build better than a supervised system. | |
0:19:39.099 --> 0:19:42.985 | |
So as you see on the top you have a question | |
at one five ten. | |
0:19:42.985 --> 0:19:47.309 | |
These are the typical numbers that you report | |
for world translation. | |
0:19:48.868 --> 0:19:55.996 | |
But guns are usually quite tricky to to train, | |
and it does not converge on on language based, | |
0:19:55.996 --> 0:20:02.820 | |
and this kind of goes back to a assumption | |
that they kind of behave in the same structure | |
0:20:02.820 --> 0:20:03.351 | |
right. | |
0:20:03.351 --> 0:20:07.142 | |
But if you take a language like English and | |
some. | |
0:20:07.087 --> 0:20:12.203 | |
Other languages are almost very lotus, so | |
it's quite different from English and so on. | |
0:20:12.203 --> 0:20:13.673 | |
Then I've one language,. | |
0:20:13.673 --> 0:20:18.789 | |
So whenever whenever our assumption fails, | |
these unsupervised techniques always do not | |
0:20:18.789 --> 0:20:21.199 | |
converge or just give really bad scores. | |
0:20:22.162 --> 0:20:27.083 | |
And so the fact is that the monolingual embryons | |
for distant languages are too far. | |
0:20:27.083 --> 0:20:30.949 | |
They do not share the same structure, and | |
so they do not convert. | |
0:20:32.452 --> 0:20:39.380 | |
And so I just want to mention that there is | |
a better retrieval technique than the nearest | |
0:20:39.380 --> 0:20:41.458 | |
neighbor, which is called. | |
0:20:42.882 --> 0:20:46.975 | |
But it's more advanced than mathematical, | |
so I didn't want to go in it now. | |
0:20:46.975 --> 0:20:51.822 | |
But if your interest is in some quite good | |
retrieval segments, you can just look at these | |
0:20:51.822 --> 0:20:53.006 | |
if you're interested. | |
0:20:55.615 --> 0:20:59.241 | |
Okay, so this is about the the word translation. | |
0:20:59.241 --> 0:21:02.276 | |
Does anybody have any questions of cure? | |
0:21:06.246 --> 0:21:07.501 | |
Was the worst answer? | |
0:21:07.501 --> 0:21:12.580 | |
It was a bit easier than a sentence right, | |
so you just assume that there's a mapping and | |
0:21:12.580 --> 0:21:14.577 | |
then you try to learn the mapping. | |
0:21:14.577 --> 0:21:19.656 | |
But now it's a bit more difficult because | |
you need to jump at stuff also, which is quite | |
0:21:19.656 --> 0:21:20.797 | |
much more trickier. | |
0:21:22.622 --> 0:21:28.512 | |
Task here is that we have our input as manually | |
well data for both languages as before, but | |
0:21:28.512 --> 0:21:34.017 | |
now what we want to do is instead of translating | |
word by word we want to do sentence. | |
0:21:37.377 --> 0:21:44.002 | |
We have word of work now and so on to learn | |
word amber inks, but sentence amber inks are | |
0:21:44.002 --> 0:21:50.627 | |
actually not the site powered often, at least | |
when people try to work on Answer Voice M, | |
0:21:50.627 --> 0:21:51.445 | |
E, before. | |
0:21:52.632 --> 0:21:54.008 | |
Now they're a bit okay. | |
0:21:54.008 --> 0:21:59.054 | |
I mean, as you've seen in the practice on | |
where we used places, they were quite decent. | |
0:21:59.054 --> 0:22:03.011 | |
But then it's also the case on which data | |
it's trained on and so on. | |
0:22:03.011 --> 0:22:03.240 | |
So. | |
0:22:04.164 --> 0:22:09.666 | |
Sentence embedings are definitely much more | |
harder to get than were embedings, so this | |
0:22:09.666 --> 0:22:13.776 | |
is a bit more complicated than the task that | |
you've seen before. | |
0:22:16.476 --> 0:22:18.701 | |
Before we go into how U. | |
0:22:18.701 --> 0:22:18.968 | |
N. | |
0:22:18.968 --> 0:22:19.235 | |
M. | |
0:22:19.235 --> 0:22:19.502 | |
T. | |
0:22:19.502 --> 0:22:24.485 | |
Works, so this is your typical supervised | |
system right. | |
0:22:24.485 --> 0:22:29.558 | |
So we have parallel data source sentence target | |
centers. | |
0:22:29.558 --> 0:22:31.160 | |
We have a source. | |
0:22:31.471 --> 0:22:36.709 | |
We have a target decoder and then we try to | |
minimize the cross center pillar on this viral | |
0:22:36.709 --> 0:22:37.054 | |
data. | |
0:22:37.157 --> 0:22:39.818 | |
And this is how we train our typical system. | |
0:22:43.583 --> 0:22:49.506 | |
But now we do not have any parallel data, | |
and so the intuition here is that if we can | |
0:22:49.506 --> 0:22:55.429 | |
learn language independent representations | |
at the end quota outputs, then we can pass | |
0:22:55.429 --> 0:22:58.046 | |
it along to the decoder that we want. | |
0:22:58.718 --> 0:23:03.809 | |
It's going to get more clear in the future, | |
but I'm trying to give a bit more intuition | |
0:23:03.809 --> 0:23:07.164 | |
before I'm going to show you all the planning | |
objectives. | |
0:23:08.688 --> 0:23:15.252 | |
So I assume that we have these different encoders | |
right, so it's not only two, you have a bunch | |
0:23:15.252 --> 0:23:21.405 | |
of different source language encoders, a bunch | |
of different target language decoders, and | |
0:23:21.405 --> 0:23:26.054 | |
also I assume that the encoder is in the same | |
representation space. | |
0:23:26.706 --> 0:23:31.932 | |
If you give a sentence in English and the | |
same sentence in German, the embeddings are | |
0:23:31.932 --> 0:23:38.313 | |
quite the same, so like the muddling when embeddings | |
die right, and so then what we can do is, depending | |
0:23:38.313 --> 0:23:42.202 | |
on the language we want, pass it to the the | |
appropriate decode. | |
0:23:42.682 --> 0:23:50.141 | |
And so the kind of goal here is to find out | |
a way to create language independent representations | |
0:23:50.141 --> 0:23:52.909 | |
and then pass it to the decodement. | |
0:23:54.975 --> 0:23:59.714 | |
Just keep in mind that you're trying to do | |
language independent for some reason, but it's | |
0:23:59.714 --> 0:24:02.294 | |
going to be more clear once we see how it works. | |
0:24:05.585 --> 0:24:12.845 | |
So in total we have three objectives that | |
we're going to try to train in our systems, | |
0:24:12.845 --> 0:24:16.981 | |
so this is and all of them use monolingual | |
data. | |
0:24:17.697 --> 0:24:19.559 | |
So there's no pilot data at all. | |
0:24:19.559 --> 0:24:24.469 | |
The first one is denoising water encoding, | |
so it's more like you add noise to noise to | |
0:24:24.469 --> 0:24:27.403 | |
the sentence, and then they construct the original. | |
0:24:28.388 --> 0:24:34.276 | |
Then we have the on the flyby translation, | |
so this is where you take a sentence, generate | |
0:24:34.276 --> 0:24:39.902 | |
a translation, and then learn the the word | |
smarting, which I'm going to show pictures | |
0:24:39.902 --> 0:24:45.725 | |
stated, and then we have an adverse serial | |
planning to do learn the language independent | |
0:24:45.725 --> 0:24:46.772 | |
representation. | |
0:24:47.427 --> 0:24:52.148 | |
So somehow we'll fill in these three tasks | |
or retain on these three tasks. | |
0:24:52.148 --> 0:24:54.728 | |
We somehow get an answer to President M. | |
0:24:54.728 --> 0:24:54.917 | |
T. | |
0:24:56.856 --> 0:25:02.964 | |
OK, so first we're going to do is denoising | |
what I'm cutting right, so as I said we add | |
0:25:02.964 --> 0:25:06.295 | |
noise to the sentence, so we take our sentence. | |
0:25:06.826 --> 0:25:09.709 | |
And then there are different ways to add noise. | |
0:25:09.709 --> 0:25:11.511 | |
You can shuffle words around. | |
0:25:11.511 --> 0:25:12.712 | |
You can drop words. | |
0:25:12.712 --> 0:25:18.298 | |
Do whatever you want to do as long as there's | |
enough information to reconstruct the original | |
0:25:18.298 --> 0:25:18.898 | |
sentence. | |
0:25:19.719 --> 0:25:25.051 | |
And then we assume that the nicest one and | |
the original one are parallel data and train | |
0:25:25.051 --> 0:25:26.687 | |
similar to the supervised. | |
0:25:28.168 --> 0:25:30.354 | |
So we have a source sentence. | |
0:25:30.354 --> 0:25:32.540 | |
We have a noisy source right. | |
0:25:32.540 --> 0:25:37.130 | |
So here what basically happened is that the | |
word got shuffled. | |
0:25:37.130 --> 0:25:39.097 | |
One word is dropped right. | |
0:25:39.097 --> 0:25:41.356 | |
So this was a noise of source. | |
0:25:41.356 --> 0:25:47.039 | |
And then we treat the noise of source and | |
source as a sentence bed basically. | |
0:25:49.009 --> 0:25:53.874 | |
Way retainers optimizing the cross entropy | |
loss similar to. | |
0:25:57.978 --> 0:26:03.211 | |
Basically a picture to show what's happening | |
and we have the nice resources. | |
0:26:03.163 --> 0:26:09.210 | |
Now is the target and then we have the reconstructed | |
original source and original tag and since | |
0:26:09.210 --> 0:26:14.817 | |
the languages are different we have our source | |
hand coded target and coded source coded. | |
0:26:17.317 --> 0:26:20.202 | |
And for this task we only need monolingual | |
data. | |
0:26:20.202 --> 0:26:25.267 | |
We don't need any pedal data because it's | |
just taking a sentence and shuffling it and | |
0:26:25.267 --> 0:26:27.446 | |
reconstructing the the original one. | |
0:26:28.848 --> 0:26:31.058 | |
And we are four different blocks. | |
0:26:31.058 --> 0:26:36.841 | |
This is kind of very important to keep in | |
mind on how we change these connections later. | |
0:26:41.121 --> 0:26:49.093 | |
Then this is more like the mathematical formulation | |
where you predict source given the noisy. | |
0:26:52.492 --> 0:26:55.090 | |
So that was the nursing water encoding. | |
0:26:55.090 --> 0:26:58.403 | |
The second step is on the flight back translation. | |
0:26:59.479 --> 0:27:06.386 | |
So what we do is, we put our model inference | |
mode right, we take a source of sentences, | |
0:27:06.386 --> 0:27:09.447 | |
and we generate a translation pattern. | |
0:27:09.829 --> 0:27:18.534 | |
It might be completely wrong or maybe partially | |
correct or so on, but we assume that the moral | |
0:27:18.534 --> 0:27:20.091 | |
knows of it and. | |
0:27:20.680 --> 0:27:25.779 | |
Tend rate: T head right and then what we do | |
is assume that T head or not assume but T head | |
0:27:25.779 --> 0:27:27.572 | |
and S are sentence space right. | |
0:27:27.572 --> 0:27:29.925 | |
That's how we can handle the translation. | |
0:27:30.530 --> 0:27:38.824 | |
So we train a supervised system on this sentence | |
bed, so we do inference and then build a reverse | |
0:27:38.824 --> 0:27:39.924 | |
translation. | |
0:27:42.442 --> 0:27:49.495 | |
Are both more concrete, so we have a false | |
sentence right, then we chamber the translation, | |
0:27:49.495 --> 0:27:55.091 | |
then we give the general translation as an | |
input and try to predict the. | |
0:27:58.378 --> 0:28:03.500 | |
This is how we would do in practice right, | |
so not before the source encoder was connected | |
0:28:03.500 --> 0:28:08.907 | |
to the source decoder, but now we interchanged | |
connections, so the source encoder is connected | |
0:28:08.907 --> 0:28:10.216 | |
to the target decoder. | |
0:28:10.216 --> 0:28:13.290 | |
The target encoder is turned into the source | |
decoder. | |
0:28:13.974 --> 0:28:20.747 | |
And given s we get t-hat and given t we get | |
s-hat, so this is the first time. | |
0:28:21.661 --> 0:28:24.022 | |
On the second time step, what you're going | |
to do is reverse. | |
0:28:24.664 --> 0:28:32.625 | |
So as that is here, t hat is here, and given | |
s hat we are trying to predict t, and given | |
0:28:32.625 --> 0:28:34.503 | |
t hat we are trying. | |
0:28:36.636 --> 0:28:39.386 | |
Is this clear you have any questions on? | |
0:28:45.405 --> 0:28:50.823 | |
Bit more mathematically, we try to play the | |
class, give and take and so it's always the | |
0:28:50.823 --> 0:28:53.963 | |
supervised NMP technique that we are trying | |
to do. | |
0:28:53.963 --> 0:28:59.689 | |
But you're trying to create this synthetic | |
pass that kind of helpers to build an unsurprised | |
0:28:59.689 --> 0:29:00.181 | |
system. | |
0:29:02.362 --> 0:29:08.611 | |
Now also with maybe you can see here is that | |
if the source encoded and targeted encoded | |
0:29:08.611 --> 0:29:14.718 | |
the language independent, we can always shuffle | |
the connections and the translations. | |
0:29:14.718 --> 0:29:21.252 | |
That's why it was important to find a way | |
to generate language independent representations. | |
0:29:21.441 --> 0:29:26.476 | |
And the way we try to force this language | |
independence is the gan step. | |
0:29:27.627 --> 0:29:34.851 | |
So the third step kind of combines all of | |
them is where we try to use gun to make the | |
0:29:34.851 --> 0:29:37.959 | |
encoded output language independent. | |
0:29:37.959 --> 0:29:42.831 | |
So here it's the same picture but from a different | |
paper. | |
0:29:42.831 --> 0:29:43.167 | |
So. | |
0:29:43.343 --> 0:29:48.888 | |
We have X-rays, X-ray objects which is monolingual | |
in data. | |
0:29:48.888 --> 0:29:50.182 | |
We add noise. | |
0:29:50.690 --> 0:29:54.736 | |
Then we encode it using the source and the | |
target encoders right. | |
0:29:54.736 --> 0:29:58.292 | |
Then we get the latent space Z source and | |
Z target right. | |
0:29:58.292 --> 0:30:03.503 | |
Then we decode and try to reconstruct the | |
original one and this is the auto encoding | |
0:30:03.503 --> 0:30:08.469 | |
loss which takes the X source which is the | |
original one and then the translated. | |
0:30:08.468 --> 0:30:09.834 | |
Predicted output. | |
0:30:09.834 --> 0:30:16.740 | |
So hello, it always is the auto encoding step | |
where the gun concern is in the between gang | |
0:30:16.740 --> 0:30:24.102 | |
cord outputs, and here we have an discriminator | |
which tries to predict which language the latent | |
0:30:24.102 --> 0:30:25.241 | |
space is from. | |
0:30:26.466 --> 0:30:33.782 | |
So given Z source it has to predict that the | |
representation is from a language source and | |
0:30:33.782 --> 0:30:39.961 | |
given Z target it has to predict the representation | |
from a language target. | |
0:30:40.520 --> 0:30:45.135 | |
And our headquarters are kind of teaching | |
data right now, and then we have a separate | |
0:30:45.135 --> 0:30:49.803 | |
network discriminator which tries to predict | |
which language the Latin spaces are from. | |
0:30:53.393 --> 0:30:57.611 | |
And then this one is when we combined guns | |
with the other ongoing step. | |
0:30:57.611 --> 0:31:02.767 | |
Then we had an on the fly back translation | |
step right, and so here what we're trying to | |
0:31:02.767 --> 0:31:03.001 | |
do. | |
0:31:03.863 --> 0:31:07.260 | |
Is the same, basically just exactly the same. | |
0:31:07.260 --> 0:31:12.946 | |
But when we are doing the training, we are | |
at the adversarial laws here, so. | |
0:31:13.893 --> 0:31:20.762 | |
We take our X source, gender and intermediate | |
translation, so why target and why source right? | |
0:31:20.762 --> 0:31:27.342 | |
This is the previous time step, and then we | |
have to encode the new sentences and basically | |
0:31:27.342 --> 0:31:32.764 | |
make them language independent or train to | |
make them language independent. | |
0:31:33.974 --> 0:31:43.502 | |
And then the hope is that now if we do this | |
using monolingual data alone we can just switch | |
0:31:43.502 --> 0:31:47.852 | |
connections and then get our translation. | |
0:31:47.852 --> 0:31:49.613 | |
So the scale of. | |
0:31:54.574 --> 0:32:03.749 | |
And so as I said before, guns are quite good | |
for vision right, so this is kind of like the | |
0:32:03.749 --> 0:32:11.312 | |
cycle gun approach that you might have seen | |
in any computer vision course. | |
0:32:11.911 --> 0:32:19.055 | |
Somehow protect that place at least not as | |
promising as for merchants, and so people. | |
0:32:19.055 --> 0:32:23.706 | |
What they did is to enforce this language | |
independence. | |
0:32:25.045 --> 0:32:31.226 | |
They try to use a shared encoder instead of | |
having these different encoders right, and | |
0:32:31.226 --> 0:32:37.835 | |
so this is basically the same painting objectives | |
as before, but what you're going to do now | |
0:32:37.835 --> 0:32:43.874 | |
is learn cross language language and then use | |
the single encoder for both languages. | |
0:32:44.104 --> 0:32:49.795 | |
And this kind also forces them to be in the | |
same space, and then you can choose whichever | |
0:32:49.795 --> 0:32:50.934 | |
decoder you want. | |
0:32:52.552 --> 0:32:58.047 | |
You can use guns or you can just use a shared | |
encoder and type to build your unsupervised | |
0:32:58.047 --> 0:32:58.779 | |
MTT system. | |
0:33:08.488 --> 0:33:09.808 | |
These are now the. | |
0:33:09.808 --> 0:33:15.991 | |
The enhancements that you can do on top of | |
your unsavoizant system is one you can create | |
0:33:15.991 --> 0:33:16.686 | |
a shared. | |
0:33:18.098 --> 0:33:22.358 | |
On top of the shared encoder you can ask are | |
your guns lost or whatever so there's a lot | |
0:33:22.358 --> 0:33:22.550 | |
of. | |
0:33:24.164 --> 0:33:29.726 | |
The other thing that is more relevant right | |
now is that you can create parallel data by | |
0:33:29.726 --> 0:33:35.478 | |
word to word translation right because you | |
know how to do all supervised word translation. | |
0:33:36.376 --> 0:33:40.548 | |
First step is to create parallel data, assuming | |
that word translations are quite good. | |
0:33:41.361 --> 0:33:47.162 | |
And then you claim a supervised and empty | |
model on these more likely wrong model data, | |
0:33:47.162 --> 0:33:50.163 | |
but somehow gives you a good starting point. | |
0:33:50.163 --> 0:33:56.098 | |
So you build your supervised and empty system | |
on the word translation data, and then you | |
0:33:56.098 --> 0:33:59.966 | |
initialize it before you're doing unsupervised | |
and empty. | |
0:34:00.260 --> 0:34:05.810 | |
And the hope is that when you're doing the | |
back pain installation, it's a good starting | |
0:34:05.810 --> 0:34:11.234 | |
point, but it's one technique that you can | |
do to to improve your anthropoids and the. | |
0:34:17.097 --> 0:34:25.879 | |
In the previous case we had: The way we know | |
when to stop was to see comedians on the gun | |
0:34:25.879 --> 0:34:26.485 | |
training. | |
0:34:26.485 --> 0:34:28.849 | |
Actually, all we want to do is when W. | |
0:34:28.849 --> 0:34:32.062 | |
Comedians, which is quite easy to know when | |
to stop. | |
0:34:32.062 --> 0:34:37.517 | |
But in a realistic case, we don't have any | |
parallel data right, so there's no validation. | |
0:34:37.517 --> 0:34:42.002 | |
Or I mean, we might have test data in the | |
end, but there's no validation. | |
0:34:43.703 --> 0:34:48.826 | |
How will we tune our hyper parameters in this | |
case because it's not really there's nothing | |
0:34:48.826 --> 0:34:49.445 | |
for us to? | |
0:34:50.130 --> 0:34:53.326 | |
Or the gold data in a sense like so. | |
0:34:53.326 --> 0:35:01.187 | |
How do you think we can evaluate such systems | |
or how can we tune hyper parameters in this? | |
0:35:11.711 --> 0:35:17.089 | |
So what you're going to do is use the back | |
translation technique. | |
0:35:17.089 --> 0:35:24.340 | |
It's like a common technique where you have | |
nothing okay that is to use back translation | |
0:35:24.340 --> 0:35:26.947 | |
somehow and what you can do is. | |
0:35:26.947 --> 0:35:31.673 | |
The main idea is validate on how good the | |
reconstruction. | |
0:35:32.152 --> 0:35:37.534 | |
So the idea is that if you have a good system | |
then the intermediate translation is quite | |
0:35:37.534 --> 0:35:39.287 | |
good and going back is easy. | |
0:35:39.287 --> 0:35:44.669 | |
But if it's just noise that you generate in | |
the forward step then it's really hard to go | |
0:35:44.669 --> 0:35:46.967 | |
back, which is kind of the main idea. | |
0:35:48.148 --> 0:35:53.706 | |
So the way it works is that we take a source | |
sentence, we generate a translation in target | |
0:35:53.706 --> 0:35:59.082 | |
language right, and then again can state the | |
generated sentence and compare it with the | |
0:35:59.082 --> 0:36:01.342 | |
original one, and if they're closer. | |
0:36:01.841 --> 0:36:09.745 | |
It means that we have a good system, and if | |
they are far this is kind of like an unsupervised | |
0:36:09.745 --> 0:36:10.334 | |
grade. | |
0:36:17.397 --> 0:36:21.863 | |
As far as the amount of data that you need. | |
0:36:23.083 --> 0:36:27.995 | |
This was like the first initial resistance | |
on on these systems is that you had. | |
0:36:27.995 --> 0:36:32.108 | |
They wanted to do English and French and they | |
had fifteen million. | |
0:36:32.108 --> 0:36:38.003 | |
There was fifteen million more linguist sentences | |
so it's quite a lot and they were able to get | |
0:36:38.003 --> 0:36:40.581 | |
thirty two blue on these kinds of setups. | |
0:36:41.721 --> 0:36:47.580 | |
But unsurprisingly if you have zero point | |
one million pilot sentences you get the same | |
0:36:47.580 --> 0:36:48.455 | |
performance. | |
0:36:48.748 --> 0:36:50.357 | |
So it's a lot of training. | |
0:36:50.357 --> 0:36:55.960 | |
It's a lot of monolingual data, but monolingual | |
data is relatively easy to obtain is the fact | |
0:36:55.960 --> 0:37:01.264 | |
that the training is also quite longer than | |
the supervised system, but it's unsupervised | |
0:37:01.264 --> 0:37:04.303 | |
so it's kind of the trade off that you are | |
making. | |
0:37:07.367 --> 0:37:13.101 | |
The other thing to note is that it's English | |
and French, which is very close to our exemptions. | |
0:37:13.101 --> 0:37:18.237 | |
Also, the monolingual data that they took | |
are kind of from similar domains and so on. | |
0:37:18.638 --> 0:37:27.564 | |
So that's why they're able to build such a | |
good system, but you'll see later that it fails. | |
0:37:36.256 --> 0:37:46.888 | |
Voice, and so mean what people usually do | |
is first build a system right using whatever | |
0:37:46.888 --> 0:37:48.110 | |
parallel. | |
0:37:48.608 --> 0:37:55.864 | |
Then they use monolingual data and do back | |
translation, so this is always being the standard | |
0:37:55.864 --> 0:38:04.478 | |
way to to improve, and what people have seen | |
is that: You don't even need zero point one | |
0:38:04.478 --> 0:38:05.360 | |
million right. | |
0:38:05.360 --> 0:38:10.706 | |
You just need like ten thousand or so on and | |
then you do the monolingual back time station | |
0:38:10.706 --> 0:38:12.175 | |
and you're still better. | |
0:38:12.175 --> 0:38:13.291 | |
The answer is why. | |
0:38:13.833 --> 0:38:19.534 | |
The question is it's really worth trying to | |
to do this or maybe it's always better to find | |
0:38:19.534 --> 0:38:20.787 | |
some parallel data. | |
0:38:20.787 --> 0:38:26.113 | |
I'll expand a bit of money on getting few | |
parallel data and then use it to start and | |
0:38:26.113 --> 0:38:27.804 | |
find to build your system. | |
0:38:27.804 --> 0:38:33.756 | |
So it was kind of the understanding that billing | |
wool and spoiled systems are not that really. | |
0:38:50.710 --> 0:38:54.347 | |
The thing is that with unlabeled data. | |
0:38:57.297 --> 0:39:05.488 | |
Not in an obtaining signal, so when we are | |
starting basically what we want to do is first | |
0:39:05.488 --> 0:39:13.224 | |
get a good translation system and then use | |
an unlabeled monolingual data to improve. | |
0:39:13.613 --> 0:39:15.015 | |
But if you start from U. | |
0:39:15.015 --> 0:39:15.183 | |
N. | |
0:39:15.183 --> 0:39:20.396 | |
Empty our model might be really bad like it | |
would be somewhere translating completely wrong. | |
0:39:20.760 --> 0:39:26.721 | |
And then when you find your unlabeled data, | |
it basically might be harming, or maybe the | |
0:39:26.721 --> 0:39:28.685 | |
same as supervised applause. | |
0:39:28.685 --> 0:39:35.322 | |
So the thing is, I hope, by fine tuning on | |
labeled data as first is to get a good initialization. | |
0:39:35.835 --> 0:39:38.404 | |
And then use the unsupervised techniques to | |
get better. | |
0:39:38.818 --> 0:39:42.385 | |
But if your starting point is really bad then | |
it's not. | |
0:39:45.185 --> 0:39:47.324 | |
Year so as we said before. | |
0:39:47.324 --> 0:39:52.475 | |
This is kind of like the self supervised training | |
usually works. | |
0:39:52.475 --> 0:39:54.773 | |
First we have parallel data. | |
0:39:56.456 --> 0:39:58.062 | |
Source language is X. | |
0:39:58.062 --> 0:39:59.668 | |
Target language is Y. | |
0:39:59.668 --> 0:40:06.018 | |
In the end we want a system that does X to | |
Y, not Y to X, but first we want to train a | |
0:40:06.018 --> 0:40:10.543 | |
backward model as it is Y to X, so target language | |
to source. | |
0:40:11.691 --> 0:40:17.353 | |
Then we take our moonlighting will target | |
sentences, use our backward model to generate | |
0:40:17.353 --> 0:40:21.471 | |
synthetic source, and then we join them with | |
our original data. | |
0:40:21.471 --> 0:40:27.583 | |
So now we have this noisy input, but always | |
the gold output, which is kind of really important | |
0:40:27.583 --> 0:40:29.513 | |
when you're doing backpaints. | |
0:40:30.410 --> 0:40:36.992 | |
And then you can coordinate these big data | |
and then you can train your X to Y cholesterol | |
0:40:36.992 --> 0:40:44.159 | |
system and then you can always do this in multiple | |
steps and usually three, four steps which kind | |
0:40:44.159 --> 0:40:48.401 | |
of improves always and then finally get your | |
best system. | |
0:40:49.029 --> 0:40:54.844 | |
The point that I'm trying to make is that | |
although answers and MPs the scores that I've | |
0:40:54.844 --> 0:41:00.659 | |
shown before were quite good, you probably | |
can get the same performance with with fifty | |
0:41:00.659 --> 0:41:06.474 | |
thousand sentences, and also the languages | |
that they've shown are quite similar and the | |
0:41:06.474 --> 0:41:08.654 | |
texts were from the same domain. | |
0:41:14.354 --> 0:41:21.494 | |
So any questions on u n m t ok yeah. | |
0:41:22.322 --> 0:41:28.982 | |
So after this fact that temperature was already | |
better than than empty, what people have tried | |
0:41:28.982 --> 0:41:34.660 | |
is to use this idea of multilinguality as you | |
have seen in the previous lecture. | |
0:41:34.660 --> 0:41:41.040 | |
The question is how can we do this knowledge | |
transfer from high resource language to lower | |
0:41:41.040 --> 0:41:42.232 | |
source language? | |
0:41:44.484 --> 0:41:51.074 | |
One way to promote this language independent | |
representations is to share the encoder and | |
0:41:51.074 --> 0:41:57.960 | |
decoder for all languages, all their available | |
languages, and that kind of hopefully enables | |
0:41:57.960 --> 0:42:00.034 | |
the the knowledge transfer. | |
0:42:03.323 --> 0:42:08.605 | |
When we're doing multilinguality, the two | |
questions we need to to think of is how does | |
0:42:08.605 --> 0:42:09.698 | |
the encoder know? | |
0:42:09.698 --> 0:42:14.495 | |
How does the encoder encoder know which language | |
that we're dealing with that? | |
0:42:15.635 --> 0:42:20.715 | |
You already might have known the answer also, | |
and the second question is how can we promote | |
0:42:20.715 --> 0:42:24.139 | |
the encoder to generate language independent | |
representations? | |
0:42:25.045 --> 0:42:32.580 | |
By solving these two problems we can take | |
help of high resource languages to do unsupervised | |
0:42:32.580 --> 0:42:33.714 | |
translations. | |
0:42:34.134 --> 0:42:40.997 | |
Typical example would be you want to do unsurpressed | |
between English and Dutch right, but you are | |
0:42:40.997 --> 0:42:47.369 | |
parallel data between English and German, so | |
the question is can we use this parallel data | |
0:42:47.369 --> 0:42:51.501 | |
to help building an unsurpressed betweenEnglish | |
and Dutch? | |
0:42:56.296 --> 0:43:01.240 | |
For the first one we try to take help of language | |
embeddings for tokens, and this kind of is | |
0:43:01.240 --> 0:43:05.758 | |
a straightforward way to know to tell them | |
well which language they're dealing with. | |
0:43:06.466 --> 0:43:11.993 | |
And for the second one we're going to look | |
at some pre training objectives which are also | |
0:43:11.993 --> 0:43:17.703 | |
kind of unsupervised so we need monolingual | |
data mostly and this kind of helps us to promote | |
0:43:17.703 --> 0:43:20.221 | |
the language independent representation. | |
0:43:23.463 --> 0:43:29.954 | |
So the first three things more that we'll | |
look at is excel, which is quite famous if | |
0:43:29.954 --> 0:43:32.168 | |
you haven't heard of it yet. | |
0:43:32.552 --> 0:43:40.577 | |
And: The way it works is that it's basically | |
a transformer encoder right, so it's like the | |
0:43:40.577 --> 0:43:42.391 | |
just the encoder module. | |
0:43:42.391 --> 0:43:44.496 | |
No, there's no decoder here. | |
0:43:44.884 --> 0:43:51.481 | |
And what we're trying to do is mask two tokens | |
in a sequence and try to predict these mask | |
0:43:51.481 --> 0:43:52.061 | |
tokens. | |
0:43:52.061 --> 0:43:55.467 | |
So I quickly called us mask language modeling. | |
0:43:55.996 --> 0:44:05.419 | |
Typical language modeling that you see is | |
the Danish language modeling where you predict | |
0:44:05.419 --> 0:44:08.278 | |
the next token in English. | |
0:44:08.278 --> 0:44:11.136 | |
Then we have the position. | |
0:44:11.871 --> 0:44:18.774 | |
Then we have the token embellings, and then | |
here we have the mass token, and then we have | |
0:44:18.774 --> 0:44:22.378 | |
the transformer encoder blocks to predict the. | |
0:44:24.344 --> 0:44:30.552 | |
To do this for all languages using the same | |
tang somewhere encoded and this kind of helps | |
0:44:30.552 --> 0:44:36.760 | |
us to push the the sentence and bearings or | |
the output of the encoded into a common space | |
0:44:36.760 --> 0:44:37.726 | |
per multiple. | |
0:44:42.782 --> 0:44:49.294 | |
So first we train an MLM on both source, both | |
source and target language sites, and then | |
0:44:49.294 --> 0:44:54.928 | |
we use it as a starting point for the encoded | |
and decoded for a UNMP system. | |
0:44:55.475 --> 0:45:03.175 | |
So we take a monolingual data, build a mass | |
language model on both source and target languages, | |
0:45:03.175 --> 0:45:07.346 | |
and then read it to be or initialize that in | |
the U. | |
0:45:07.346 --> 0:45:07.586 | |
N. | |
0:45:07.586 --> 0:45:07.827 | |
P. | |
0:45:07.827 --> 0:45:08.068 | |
C. | |
0:45:09.009 --> 0:45:14.629 | |
Here we look at two languages, but you can | |
also do it with one hundred languages once. | |
0:45:14.629 --> 0:45:20.185 | |
So they're retain checkpoints that you can | |
use, which are quite which have seen quite | |
0:45:20.185 --> 0:45:21.671 | |
a lot of data and use. | |
0:45:21.671 --> 0:45:24.449 | |
It always has a starting point for your U. | |
0:45:24.449 --> 0:45:24.643 | |
N. | |
0:45:24.643 --> 0:45:27.291 | |
MP system, which in practice works well. | |
0:45:31.491 --> 0:45:36.759 | |
This detail is that since this is an encoder | |
block only, and your U. | |
0:45:36.759 --> 0:45:36.988 | |
N. | |
0:45:36.988 --> 0:45:37.217 | |
M. | |
0:45:37.217 --> 0:45:37.446 | |
T. | |
0:45:37.446 --> 0:45:40.347 | |
System is encodered, decodered right. | |
0:45:40.347 --> 0:45:47.524 | |
So there's this cross attention that's missing, | |
but you can always branch like that randomly. | |
0:45:47.524 --> 0:45:48.364 | |
It's fine. | |
0:45:48.508 --> 0:45:53.077 | |
Not everything is initialized, but it's still | |
decent. | |
0:45:56.056 --> 0:46:02.141 | |
Then we have the other one is M by plane, | |
and here you see that this kind of builds on | |
0:46:02.141 --> 0:46:07.597 | |
the the unsupervised training objector, which | |
is the realizing auto encoding. | |
0:46:08.128 --> 0:46:14.337 | |
So what they do is they say that we don't | |
even need to do the gun outback translation, | |
0:46:14.337 --> 0:46:17.406 | |
but you can do it later, but pre training. | |
0:46:17.406 --> 0:46:24.258 | |
We just do do doing doing doing water inputting | |
on all different languages, and that also gives | |
0:46:24.258 --> 0:46:32.660 | |
you: Out of the box good performance, so what | |
we basically have here is the transformer encoded. | |
0:46:34.334 --> 0:46:37.726 | |
You are trying to generate a reconstructed | |
sequence. | |
0:46:37.726 --> 0:46:38.942 | |
You need a tickle. | |
0:46:39.899 --> 0:46:42.022 | |
So we gave an input sentence. | |
0:46:42.022 --> 0:46:48.180 | |
We tried to predict the masked tokens from | |
the or we tried to reconstruct the original | |
0:46:48.180 --> 0:46:52.496 | |
sentence from the input segments, which was | |
corrupted right. | |
0:46:52.496 --> 0:46:57.167 | |
So this is the same denoting objective that | |
you have seen before. | |
0:46:58.418 --> 0:46:59.737 | |
This is for English. | |
0:46:59.737 --> 0:47:04.195 | |
I think this is for Japanese and then once | |
we do it for all languages. | |
0:47:04.195 --> 0:47:09.596 | |
I mean they have this difference on twenty | |
five, fifty or so on and then you can find | |
0:47:09.596 --> 0:47:11.794 | |
you on your sentence and document. | |
0:47:13.073 --> 0:47:20.454 | |
And so what they is this for the supervised | |
techniques, but you can also use this as initializations | |
0:47:20.454 --> 0:47:25.058 | |
for unsupervised buildup on that which also | |
in practice works. | |
0:47:30.790 --> 0:47:36.136 | |
Then we have these, so still now we kind of | |
didn't see the the states benefit from the | |
0:47:36.136 --> 0:47:38.840 | |
high resource language right, so as I said. | |
0:47:38.878 --> 0:47:44.994 | |
Why you can use English as something for English | |
to Dutch, and if you want a new Catalan, you | |
0:47:44.994 --> 0:47:46.751 | |
can use English to French. | |
0:47:48.408 --> 0:47:55.866 | |
One typical way to do this is to use favorite | |
translation lights or you take the. | |
0:47:55.795 --> 0:48:01.114 | |
So here it's finished two weeks so you take | |
your time say from finish to English English | |
0:48:01.114 --> 0:48:03.743 | |
two weeks and then you get the translation. | |
0:48:04.344 --> 0:48:10.094 | |
What's important is that you have these different | |
techniques and you can always think of which | |
0:48:10.094 --> 0:48:12.333 | |
one to use given the data situation. | |
0:48:12.333 --> 0:48:18.023 | |
So if it was like finish to Greek maybe it's | |
pivotal better because you might get good finish | |
0:48:18.023 --> 0:48:20.020 | |
to English and English to Greek. | |
0:48:20.860 --> 0:48:23.255 | |
Sometimes it also depends on the language | |
pair. | |
0:48:23.255 --> 0:48:27.595 | |
There might be some information loss and so | |
on, so there are quite a few variables you | |
0:48:27.595 --> 0:48:30.039 | |
need to think of and decide which system to | |
use. | |
0:48:32.752 --> 0:48:39.654 | |
Then there's a zero shot, which probably also | |
I've seen in the multilingual course, and how | |
0:48:39.654 --> 0:48:45.505 | |
if you can improve the language independence | |
then your zero shot gets better. | |
0:48:45.505 --> 0:48:52.107 | |
So maybe if you use the multilingual models | |
and do zero shot directly, it's quite good. | |
0:48:53.093 --> 0:48:58.524 | |
Thought we have zero shots per word, and then | |
we have the answer to voice translation where | |
0:48:58.524 --> 0:49:00.059 | |
we can calculate between. | |
0:49:00.600 --> 0:49:02.762 | |
Just when there is no battle today. | |
0:49:06.686 --> 0:49:07.565 | |
Is to solve. | |
0:49:07.565 --> 0:49:11.959 | |
So sometimes what we have seen so far is that | |
we basically have. | |
0:49:15.255 --> 0:49:16.754 | |
To do from looking at it. | |
0:49:16.836 --> 0:49:19.307 | |
These two files alone you can create a dictionary. | |
0:49:19.699 --> 0:49:26.773 | |
Can build an unsupervised entry system, not | |
always, but if the domains are similar in the | |
0:49:26.773 --> 0:49:28.895 | |
languages, that's similar. | |
0:49:28.895 --> 0:49:36.283 | |
But if there are distant languages, then the | |
unsupervised texting doesn't usually work really | |
0:49:36.283 --> 0:49:36.755 | |
well. | |
0:49:37.617 --> 0:49:40.297 | |
What um. | |
0:49:40.720 --> 0:49:46.338 | |
Would be is that if you can get some paddle | |
data from somewhere or do bitex mining that | |
0:49:46.338 --> 0:49:51.892 | |
we have seen in the in the laser practicum | |
then you can use that as to initialize your | |
0:49:51.892 --> 0:49:57.829 | |
system and then try and accept a semi supervised | |
energy system and that would be better than | |
0:49:57.829 --> 0:50:00.063 | |
just building an unsupervised and. | |
0:50:00.820 --> 0:50:06.546 | |
With that as the end. | |
0:50:07.207 --> 0:50:08.797 | |
Quickly could be. | |
0:50:16.236 --> 0:50:25.070 | |
In common, they can catch the worst because | |
the thing about finding a language is: And | |
0:50:25.070 --> 0:50:34.874 | |
there's another joy in playing these games, | |
almost in the middle of a game, and she's a | |
0:50:34.874 --> 0:50:40.111 | |
characteristic too, and she is a global waver. | |
0:50:56.916 --> 0:51:03.798 | |
Next talk inside and this somehow gives them | |
many abilities, not only translation but other | |
0:51:03.798 --> 0:51:08.062 | |
than that there are quite a few things that | |
they can do. | |
0:51:10.590 --> 0:51:17.706 | |
But the translation in itself usually doesn't | |
really work really well if you build a system | |
0:51:17.706 --> 0:51:20.878 | |
from your specific system for your case. | |
0:51:22.162 --> 0:51:27.924 | |
I would guess that it's usually better than | |
the LLM, but you can always adapt the LLM to | |
0:51:27.924 --> 0:51:31.355 | |
the task that you want, and then it could be | |
better. | |
0:51:32.152 --> 0:51:37.849 | |
A little amount of the box might not be the | |
best choice for your task force. | |
0:51:37.849 --> 0:51:44.138 | |
For me, I'm working on new air translation, | |
so it's more about translating software. | |
0:51:45.065 --> 0:51:50.451 | |
And it's quite often each domain as well, | |
and if use the LLM out of the box, they're | |
0:51:50.451 --> 0:51:53.937 | |
actually quite bad compared to the systems | |
that built. | |
0:51:54.414 --> 0:51:56.736 | |
But you can do these different techniques | |
like prompting. | |
0:51:57.437 --> 0:52:03.442 | |
This is what people usually do is heart prompting | |
where they give similar translation pairs in | |
0:52:03.442 --> 0:52:08.941 | |
the prompt and then ask it to translate and | |
then that kind of improves the performance | |
0:52:08.941 --> 0:52:09.383 | |
a lot. | |
0:52:09.383 --> 0:52:15.135 | |
So there are different techniques that you | |
can do to adapt your eye lens and then it might | |
0:52:15.135 --> 0:52:16.399 | |
be better than the. | |
0:52:16.376 --> 0:52:17.742 | |
Task a fixed system. | |
0:52:18.418 --> 0:52:22.857 | |
But if you're looking for niche things, I | |
don't think error limbs are that good. | |
0:52:22.857 --> 0:52:26.309 | |
But if you want to do to do, let's say, unplugged | |
translation. | |
0:52:26.309 --> 0:52:30.036 | |
In this case you can never be sure that they | |
haven't seen the data. | |
0:52:30.036 --> 0:52:35.077 | |
First of all is that if you see the data in | |
that language or not, and if they're panthetic, | |
0:52:35.077 --> 0:52:36.831 | |
they probably did see the data. | |
0:52:40.360 --> 0:53:00.276 | |
I feel like they have pretty good understanding | |
of each million people. | |
0:53:04.784 --> 0:53:09.059 | |
Depends on the language, but I'm pretty surprised | |
that it works on a lotus language. | |
0:53:09.059 --> 0:53:11.121 | |
I would expect it to work on German and. | |
0:53:11.972 --> 0:53:13.633 | |
But if you take a lot of first language,. | |
0:53:14.474 --> 0:53:20.973 | |
Don't think it works, and also there are quite | |
a few papers where they've already showed that | |
0:53:20.973 --> 0:53:27.610 | |
if you build a system yourself or build a typical | |
way to build a system, it's quite better than | |
0:53:27.610 --> 0:53:29.338 | |
the bit better than the. | |
0:53:29.549 --> 0:53:34.883 | |
But you can always do things with limbs to | |
get better, but then I'm probably. | |
0:53:37.557 --> 0:53:39.539 | |
Anymore. | |
0:53:41.421 --> 0:53:47.461 | |
So if not then we're going to end the lecture | |
here and then on Thursday we're going to have | |
0:53:47.461 --> 0:53:51.597 | |
documented empty which is also run by me so | |
thanks for coming. | |