Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-15-11.07.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

57.4 kB

	WEBVTT

	0:00:00.060 --> 0:00:07.762
	OK good so today's lecture is on on supervised
	machines and stations so what you have seen

	0:00:07.762 --> 0:00:13.518
	so far is different techniques are on supervised
	and MP so you are.

	0:00:13.593 --> 0:00:18.552
	Data right so let's say in English coppers
	you are one file and then in German you have

	0:00:18.552 --> 0:00:23.454
	another file which is sentence to sentence
	la and then you try to build systems around

	0:00:23.454 --> 0:00:23.679
	it.

	0:00:24.324 --> 0:00:30.130
	But what's different about this lecture is
	that you assume that you have no final data

	0:00:30.130 --> 0:00:30.663
	at all.

	0:00:30.663 --> 0:00:37.137
	You only have monolingual data and the question
	is how can we build systems to translate between

	0:00:37.137 --> 0:00:39.405
	these two languages right and so.

	0:00:39.359 --> 0:00:44.658
	This is a bit more realistic scenario because
	you have so many languages in the world.

	0:00:44.658 --> 0:00:50.323
	You cannot expect to have parallel data between
	all the two languages and so, but in typical

	0:00:50.323 --> 0:00:55.623
	cases you have newspapers and so on, which
	is like monolingual files, and the question

	0:00:55.623 --> 0:00:57.998
	is can we build something around them?

	0:00:59.980 --> 0:01:01.651
	They like said for today.

	0:01:01.651 --> 0:01:05.893
	First we'll start up with the interactions,
	so why do we need it?

	0:01:05.893 --> 0:01:11.614
	and also some infusion on how these models
	work before going into the technical details.

	0:01:11.614 --> 0:01:17.335
	I want to also go through an example,, which
	kind of gives you more understanding on how

	0:01:17.335 --> 0:01:19.263
	people came into more elders.

	0:01:20.820 --> 0:01:23.905
	Then the rest of the lecture is going to be
	two parts.

	0:01:23.905 --> 0:01:26.092
	One is we're going to translate words.

	0:01:26.092 --> 0:01:30.018
	We're not going to care about how can we translate
	the full sentence.

	0:01:30.018 --> 0:01:35.177
	But given to monolingual files, how can we
	get a dictionary basically, which is much easier

	0:01:35.177 --> 0:01:37.813
	than generating something in a sentence level?

	0:01:38.698 --> 0:01:43.533
	Then we're going to go into the Edwards case,
	which is the unsupervised sentence type solution.

	0:01:44.204 --> 0:01:50.201
	And here what you'll see is what are the training
	objectives which are quite different than the

	0:01:50.201 --> 0:01:55.699
	word translation and also where it doesn't
	but because this is also quite important and

	0:01:55.699 --> 0:02:01.384
	it's one of the reasons why unsupervised does
	not use anymore because the limitations kind

	0:02:01.384 --> 0:02:03.946
	of go away from the realistic use cases.

	0:02:04.504 --> 0:02:06.922
	And then that leads to the marketing world
	model.

	0:02:06.922 --> 0:02:07.115
	So.

	0:02:07.807 --> 0:02:12.915
	People are trying to do to build systems for
	languages that will not have any parallel data.

	0:02:12.915 --> 0:02:17.693
	Is use multilingual models and combine with
	these training objectives to get better at

	0:02:17.693 --> 0:02:17.913
	it.

	0:02:17.913 --> 0:02:18.132
	So.

	0:02:18.658 --> 0:02:24.396
	People are not trying to build bilingual systems
	currently for unsupervised arm translation,

	0:02:24.396 --> 0:02:30.011
	but I think it's good to know how they came
	to hear this point and what they're doing now.

	0:02:30.090 --> 0:02:34.687
	You also see some patterns overlapping which
	people are using.

	0:02:36.916 --> 0:02:41.642
	So as you said before, and you probably hear
	it multiple times now is that we have seven

	0:02:41.642 --> 0:02:43.076
	thousand languages around.

	0:02:43.903 --> 0:02:49.460
	Can be different dialects in someone, so it's
	quite hard to distinguish what's the language,

	0:02:49.460 --> 0:02:54.957
	but you can typically approximate that seven
	thousand and that leads to twenty five million

	0:02:54.957 --> 0:02:59.318
	pairs, which is the obvious reason why we do
	not have any parallel data.

	0:03:00.560 --> 0:03:06.386
	So you want to build an empty system for all
	possible language pests and the question is

	0:03:06.386 --> 0:03:07.172
	how can we?

	0:03:08.648 --> 0:03:13.325
	The typical use case, but there are actually
	quite few interesting use cases than what you

	0:03:13.325 --> 0:03:14.045
	would expect.

	0:03:14.614 --> 0:03:20.508
	One is the animal languages, which is the
	real thing that's happening right now with.

	0:03:20.780 --> 0:03:26.250
	The dog but with dolphins and so on, but I
	couldn't find a picture that could show this,

	0:03:26.250 --> 0:03:31.659
	but if you are interested in stuff like this
	you can check out the website where people

	0:03:31.659 --> 0:03:34.916
	are actually trying to understand how animals
	speak.

	0:03:35.135 --> 0:03:37.356
	It's Also a Bit More About.

	0:03:37.297 --> 0:03:44.124
	Knowing what the animals want to say but may
	not die dead but still people are trying to

	0:03:44.124 --> 0:03:44.661
	do it.

	0:03:45.825 --> 0:03:50.689
	More realistic thing that's happening is the
	translation of programming languages.

	0:03:51.371 --> 0:03:56.963
	And so this is quite a quite good scenario
	for entrepreneurs and empty is that you have

	0:03:56.963 --> 0:04:02.556
	a lot of code available online right in C +
	+ and in Python and the question is how can

	0:04:02.556 --> 0:04:08.402
	we translate by just looking at the code alone
	and no parallel functions and so on and this

	0:04:08.402 --> 0:04:10.754
	is actually quite good right now so.

	0:04:12.032 --> 0:04:16.111
	See how these techniques were applied to do
	the programming translation.

	0:04:18.258 --> 0:04:23.882
	And then you can also think of language as
	something that is quite common so you can take

	0:04:23.882 --> 0:04:24.194
	off.

	0:04:24.194 --> 0:04:29.631
	Think of formal sentences in English as one
	language and informal sentences in English

	0:04:29.631 --> 0:04:35.442
	as another language and then learn the kind
	to stay between them and then it kind of becomes

	0:04:35.442 --> 0:04:37.379
	a style plan for a problem so.

	0:04:38.358 --> 0:04:43.042
	Although it's translation, you can consider
	different characteristics of a language and

	0:04:43.042 --> 0:04:46.875
	then separate them as two different languages
	and then try to map them.

	0:04:46.875 --> 0:04:52.038
	So it's not only about languages, but you
	can also do quite cool things by using unsophisticated

	0:04:52.038 --> 0:04:54.327
	techniques, which are quite possible also.

	0:04:56.256 --> 0:04:56.990
	I am so.

	0:04:56.990 --> 0:05:04.335
	This is kind of TV modeling for many of the
	use cases that we have for ours, ours and MD.

	0:05:04.335 --> 0:05:11.842
	But before we go into the modeling of these
	systems, what I want you to do is look at these

	0:05:11.842 --> 0:05:12.413
	dummy.

	0:05:13.813 --> 0:05:19.720
	We have text and language one, text and language
	two right, and nobody knows what these languages

	0:05:19.720 --> 0:05:20.082
	mean.

	0:05:20.082 --> 0:05:23.758
	They completely are made up right, and the
	question is also.

	0:05:23.758 --> 0:05:29.364
	They're not parallel lines, so the first line
	here and the first line is not a line, they're

	0:05:29.364 --> 0:05:30.810
	just monolingual files.

	0:05:32.052 --> 0:05:38.281
	And now think about how can you translate
	the word M1 from language one to language two,

	0:05:38.281 --> 0:05:41.851
	and this kind of you see how we try to model
	this.

	0:05:42.983 --> 0:05:47.966
	Would take your time and then think of how
	can you translate more into language two?

	0:06:41.321 --> 0:06:45.589
	About the model, if you ask somebody who doesn't
	know anything about machine translation right,

	0:06:45.589 --> 0:06:47.411
	and then you ask them to translate more.

	0:07:01.201 --> 0:07:10.027
	But it's also not quite easy if you think
	of the way that I made this example is relatively

	0:07:10.027 --> 0:07:10.986
	easy, so.

	0:07:11.431 --> 0:07:17.963
	Basically, the first two sentences are these
	two: A, B, C is E, and G cured up the U, V

	0:07:17.963 --> 0:07:21.841
	is L, A, A, C, S, and S, on and this is used
	towards the German.

	0:07:22.662 --> 0:07:25.241
	And then when you join these two words, it's.

	0:07:25.205 --> 0:07:32.445
	English German the third line and the last
	line, and then the fourth line is the first

	0:07:32.445 --> 0:07:38.521
	line, so German language, English, and then
	speak English, speak German.

	0:07:38.578 --> 0:07:44.393
	So this is how I made made up the example
	and what the intuition here is that you assume

	0:07:44.393 --> 0:07:50.535
	that the languages have a fundamental structure
	right and it's the same across all languages.

	0:07:51.211 --> 0:07:57.727
	Doesn't matter what language you are thinking
	of words kind of you have in the same way join

	0:07:57.727 --> 0:07:59.829
	together is the same way and.

	0:07:59.779 --> 0:08:06.065
	And plasma sign thinks the same way but this
	is not a realistic assumption for sure but

	0:08:06.065 --> 0:08:12.636
	it's actually a decent one to make and if you
	can think of this like if you can assume this

	0:08:12.636 --> 0:08:16.207
	then we can model systems in an unsupervised
	way.

	0:08:16.396 --> 0:08:22.743
	So this is the intuition that I want to give,
	and you can see that whenever assumptions fail,

	0:08:22.743 --> 0:08:23.958
	the systems fail.

	0:08:23.958 --> 0:08:29.832
	So in practice whenever we go far away from
	these assumptions, the systems try to more

	0:08:29.832 --> 0:08:30.778
	time to fail.

	0:08:33.753 --> 0:08:39.711
	So the example that I gave was actually perfect
	mapping right, so it never really sticks bad.

	0:08:39.711 --> 0:08:45.353
	They have the same number of words, same sentence
	structure, perfect mapping, and so on.

	0:08:45.353 --> 0:08:50.994
	This doesn't happen, but let's assume that
	this happens and try to see how we can moral.

	0:08:53.493 --> 0:09:01.061
	Okay, now let's go a bit more formal, so what
	you want to do is unsupervise word translation.

	0:09:01.901 --> 0:09:08.773
	Here the task is that we have input data as
	monolingual data, so a bunch of sentences in

	0:09:08.773 --> 0:09:15.876
	one file and a bunch of sentences another file
	in two different languages, and the question

	0:09:15.876 --> 0:09:18.655
	is how can we get a bilingual word?

	0:09:19.559 --> 0:09:25.134
	So if you look at the picture you see that
	it's just kind of projected down into two dimension

	0:09:25.134 --> 0:09:30.358
	planes, but it's basically when you map them
	into a plot you see that the words that are

	0:09:30.358 --> 0:09:35.874
	parallel are closer together, and the question
	is how can we do it just looking at two files?

	0:09:36.816 --> 0:09:42.502
	And you can say that what we want to basically
	do is create a dictionary in the end given

	0:09:42.502 --> 0:09:43.260
	two fights.

	0:09:43.260 --> 0:09:45.408
	So this is the task that we want.

	0:09:46.606 --> 0:09:52.262
	And the first step on how we do this is to
	learn word vectors, and this chicken is whatever

	0:09:52.262 --> 0:09:56.257
	techniques that you have seen before, but to
	work glow or so on.

	0:09:56.856 --> 0:10:00.699
	So you take a monolingual data and try to
	learn word embeddings.

	0:10:02.002 --> 0:10:07.675
	Then you plot them into a graph, and then
	typically what you would see is that they're

	0:10:07.675 --> 0:10:08.979
	not aligned at all.

	0:10:08.979 --> 0:10:14.717
	One word space is somewhere, and one word
	space is somewhere else, and this is what you

	0:10:14.717 --> 0:10:18.043
	would typically expect to see in the in the
	image.

	0:10:19.659 --> 0:10:23.525
	Now our assumption was that both lines we
	just have the same.

	0:10:23.563 --> 0:10:28.520
	Culture and so that we can use this information
	to learn the mapping between these two spaces.

	0:10:30.130 --> 0:10:37.085
	So before how we do it, I think this is quite
	famous already, and everybody knows it a bit

	0:10:37.085 --> 0:10:41.824
	more is that we're emitting capture semantic
	relations right.

	0:10:41.824 --> 0:10:48.244
	So the distance between man and woman is approximately
	the same as king and prince.

	0:10:48.888 --> 0:10:54.620
	It's also for world dances, country capital
	and so on, so there are some relationships

	0:10:54.620 --> 0:11:00.286
	happening in the word emmering space, which
	is quite clear for at least one language.

	0:11:03.143 --> 0:11:08.082
	Now if you think of this, let's say of the
	English word embryng.

	0:11:08.082 --> 0:11:14.769
	Let's say of German word embryng and the way
	the King Keene Man woman organized is same

	0:11:14.769 --> 0:11:17.733
	as the German translation of his word.

	0:11:17.998 --> 0:11:23.336
	This is the main idea is that although they
	are somewhere else, the relationship is the

	0:11:23.336 --> 0:11:28.008
	same between the both languages and we can
	use this to to learn the mapping.

	0:11:31.811 --> 0:11:35.716
	'S not only for these poor words where it
	happens for all the words in the language,

	0:11:35.716 --> 0:11:37.783
	and so we can use this to to learn the math.

	0:11:39.179 --> 0:11:43.828
	This is the main idea is that both emittings
	have a similar shape.

	0:11:43.828 --> 0:11:48.477
	It's only that they're just not aligned and
	so you go to the here.

	0:11:48.477 --> 0:11:50.906
	They kind of have a similar shape.

	0:11:50.906 --> 0:11:57.221
	They're just in some different spaces and
	what you need to do is to map them into a common

	0:11:57.221 --> 0:11:57.707
	space.

	0:12:06.086 --> 0:12:12.393
	The w, such that if it multiplied w with x,
	they both become.

	0:12:35.335 --> 0:12:41.097
	That's true, but there are also many works
	that have the relationship right, and we hope

	0:12:41.097 --> 0:12:43.817
	that this is enough to learn the mapping.

	0:12:43.817 --> 0:12:49.838
	So there's always going to be a bit of noise,
	as in how when we align them they're not going

	0:12:49.838 --> 0:12:51.716
	to be exactly the same, but.

	0:12:51.671 --> 0:12:57.293
	What you can expect is that there are these
	main works that allow us to learn the mapping,

	0:12:57.293 --> 0:13:02.791
	so it's not going to be perfect, but it's an
	approximation that we make to to see how it

	0:13:02.791 --> 0:13:04.521
	works and then practice it.

	0:13:04.521 --> 0:13:10.081
	Also, it's not that the fact that women do
	not have any relationship does not affect that

	0:13:10.081 --> 0:13:10.452
	much.

	0:13:10.550 --> 0:13:15.429
	A lot of words usually have, so it kind of
	works out in practice.

	0:13:22.242 --> 0:13:34.248
	I have not heard about it, but if you want
	to say something about it, I would be interested,

	0:13:34.248 --> 0:13:37.346
	but we can do it later.

	0:13:41.281 --> 0:13:44.133
	Usual case: This is supervised.

	0:13:45.205 --> 0:13:49.484
	First way to do a supervised work translation
	where we have a dictionary right and that we

	0:13:49.484 --> 0:13:53.764
	can use that to learn the mapping, but in our
	case we assume that we have nothing right so

	0:13:53.764 --> 0:13:55.222
	we only have monolingual data.

	0:13:56.136 --> 0:14:03.126
	Then we need unsupervised planning to figure
	out W, and we're going to use guns to to find

	0:14:03.126 --> 0:14:06.122
	W, and it's quite a nice way to do it.

	0:14:08.248 --> 0:14:15.393
	So just before I go on how we use it to use
	case, I'm going to go briefly on gas right,

	0:14:15.393 --> 0:14:19.940
	so we have two components: generator and discriminator.

	0:14:21.441 --> 0:14:27.052
	Gen data tries to generate something obviously,
	and the discriminator tries to see if it's

	0:14:27.052 --> 0:14:30.752
	real data or something that is generated by
	the generation.

	0:14:31.371 --> 0:14:37.038
	And there's like this two player game where
	the winner decides to fool and the winner decides

	0:14:37.038 --> 0:14:41.862
	to market food and they try to build these
	two components and try to learn WWE.

	0:14:43.483 --> 0:14:53.163
	Okay, so let's say we have two languages,
	X and Y right, so the X language has N words

	0:14:53.163 --> 0:14:56.167
	with numbering dimensions.

	0:14:56.496 --> 0:14:59.498
	So what I'm reading is matrix is peak or something.

	0:14:59.498 --> 0:15:02.211
	Then we have target language why with m words.

	0:15:02.211 --> 0:15:06.944
	I'm also the same amount of things I mentioned
	and then we have a matrix peak or.

	0:15:07.927 --> 0:15:13.784
	Basically what you're going to do is use word
	to work and learn our word embedded.

	0:15:14.995 --> 0:15:23.134
	Now we have these X Mrings, Y Mrings, and
	what you want to know is W, such that W X and

	0:15:23.134 --> 0:15:24.336
	Y are align.

	0:15:29.209 --> 0:15:35.489
	With guns you have two steps, one is a discriminative
	step and one is the the mapping step and the

	0:15:35.489 --> 0:15:41.135
	discriminative step is to see if the embeddings
	are from the source or mapped embedding.

	0:15:41.135 --> 0:15:44.688
	So it's going to be much scary when I go to
	the figure.

	0:15:46.306 --> 0:15:50.041
	So we have a monolingual documents with two
	different languages.

	0:15:50.041 --> 0:15:54.522
	From here we get our source language ambients
	target language ambients right.

	0:15:54.522 --> 0:15:57.855
	Then we randomly initialize the transformation
	metrics W.

	0:16:00.040 --> 0:16:06.377
	Then we have the discriminator which tries
	to see if it's WX or Y, so it needs to know

	0:16:06.377 --> 0:16:13.735
	that this is a mapped one and this is the original
	language, and so if you look at the lost function

	0:16:13.735 --> 0:16:20.072
	here, it's basically that source is one given
	WX, so this is from the source language.

	0:16:23.543 --> 0:16:27.339
	Which means it's the target language em yeah.

	0:16:27.339 --> 0:16:34.436
	It's just like my figure is not that great,
	but you can assume that they are totally.

	0:16:40.260 --> 0:16:43.027
	So this is the kind of the lost function.

	0:16:43.027 --> 0:16:46.386
	We have N source words, M target words, and
	so on.

	0:16:46.386 --> 0:16:52.381
	So that's why you have one by M, one by M,
	and the discriminator is to just see if they're

	0:16:52.381 --> 0:16:55.741
	mapped or they're from the original target
	number.

	0:16:57.317 --> 0:17:04.024
	And then we have the mapping step where we
	train W to fool the the discriminators.

	0:17:04.564 --> 0:17:10.243
	So here it's the same way, but what you're
	going to just do is inverse the loss function.

	0:17:10.243 --> 0:17:15.859
	So now we freeze the discriminators, so it's
	important to note that in the previous sect

	0:17:15.859 --> 0:17:20.843
	we freezed the transformation matrix, and here
	we freezed your discriminators.

	0:17:22.482 --> 0:17:28.912
	And now it's to fool the discriminated rights,
	so it should predict that the source is zero

	0:17:28.912 --> 0:17:35.271
	given the map numbering, and the source is
	one given the target numbering, which is wrong,

	0:17:35.271 --> 0:17:37.787
	which is why we're attaining the W.

	0:17:39.439 --> 0:17:46.261
	Any questions on this okay so then how do
	we know when to stop?

	0:17:46.261 --> 0:17:55.854
	We just train until we reach convergence right
	and then we have our W hopefully train and

	0:17:55.854 --> 0:17:59.265
	map them into an airline space.

	0:18:02.222 --> 0:18:07.097
	The question is how can we evaluate this mapping?

	0:18:07.097 --> 0:18:13.923
	Does anybody know what we can use to mapping
	or evaluate the mapping?

	0:18:13.923 --> 0:18:15.873
	How good is a word?

	0:18:28.969 --> 0:18:33.538
	We use as I said we use a dictionary, at least
	in the end.

	0:18:33.538 --> 0:18:40.199
	We need a dictionary to evaluate, so this
	is our only final, so we aren't using it at

	0:18:40.199 --> 0:18:42.600
	all in attaining data and the.

	0:18:43.223 --> 0:18:49.681
	Is one is to check what's the position for
	our dictionary, just that.

	0:18:50.650 --> 0:18:52.813
	The first nearest neighbor and see if it's
	there on.

	0:18:53.573 --> 0:18:56.855
	But this is quite strict because there's a
	lot of noise in the emitting space right.

	0:18:57.657 --> 0:19:03.114
	Not always your first neighbor is going to
	be the translation, so what people also report

	0:19:03.114 --> 0:19:05.055
	is precision at file and so on.

	0:19:05.055 --> 0:19:10.209
	So you take the finerest neighbors and see
	if the translation is in there and so on.

	0:19:10.209 --> 0:19:15.545
	So the more you increase, the more likely
	that there is a translation because where I'm

	0:19:15.545 --> 0:19:16.697
	being quite noisy.

	0:19:19.239 --> 0:19:25.924
	What's interesting is that people have used
	dictionary to to learn word translation, but

	0:19:25.924 --> 0:19:32.985
	the way of doing this is much better than using
	a dictionary, so somehow our assumption helps

	0:19:32.985 --> 0:19:36.591
	us to to build better than a supervised system.

	0:19:39.099 --> 0:19:42.985
	So as you see on the top you have a question
	at one five ten.

	0:19:42.985 --> 0:19:47.309
	These are the typical numbers that you report
	for world translation.

	0:19:48.868 --> 0:19:55.996
	But guns are usually quite tricky to to train,
	and it does not converge on on language based,

	0:19:55.996 --> 0:20:02.820
	and this kind of goes back to a assumption
	that they kind of behave in the same structure

	0:20:02.820 --> 0:20:03.351
	right.

	0:20:03.351 --> 0:20:07.142
	But if you take a language like English and
	some.

	0:20:07.087 --> 0:20:12.203
	Other languages are almost very lotus, so
	it's quite different from English and so on.

	0:20:12.203 --> 0:20:13.673
	Then I've one language,.

	0:20:13.673 --> 0:20:18.789
	So whenever whenever our assumption fails,
	these unsupervised techniques always do not

	0:20:18.789 --> 0:20:21.199
	converge or just give really bad scores.

	0:20:22.162 --> 0:20:27.083
	And so the fact is that the monolingual embryons
	for distant languages are too far.

	0:20:27.083 --> 0:20:30.949
	They do not share the same structure, and
	so they do not convert.

	0:20:32.452 --> 0:20:39.380
	And so I just want to mention that there is
	a better retrieval technique than the nearest

	0:20:39.380 --> 0:20:41.458
	neighbor, which is called.

	0:20:42.882 --> 0:20:46.975
	But it's more advanced than mathematical,
	so I didn't want to go in it now.

	0:20:46.975 --> 0:20:51.822
	But if your interest is in some quite good
	retrieval segments, you can just look at these

	0:20:51.822 --> 0:20:53.006
	if you're interested.

	0:20:55.615 --> 0:20:59.241
	Okay, so this is about the the word translation.

	0:20:59.241 --> 0:21:02.276
	Does anybody have any questions of cure?

	0:21:06.246 --> 0:21:07.501
	Was the worst answer?

	0:21:07.501 --> 0:21:12.580
	It was a bit easier than a sentence right,
	so you just assume that there's a mapping and

	0:21:12.580 --> 0:21:14.577
	then you try to learn the mapping.

	0:21:14.577 --> 0:21:19.656
	But now it's a bit more difficult because
	you need to jump at stuff also, which is quite

	0:21:19.656 --> 0:21:20.797
	much more trickier.

	0:21:22.622 --> 0:21:28.512
	Task here is that we have our input as manually
	well data for both languages as before, but

	0:21:28.512 --> 0:21:34.017
	now what we want to do is instead of translating
	word by word we want to do sentence.

	0:21:37.377 --> 0:21:44.002
	We have word of work now and so on to learn
	word amber inks, but sentence amber inks are

	0:21:44.002 --> 0:21:50.627
	actually not the site powered often, at least
	when people try to work on Answer Voice M,

	0:21:50.627 --> 0:21:51.445
	E, before.

	0:21:52.632 --> 0:21:54.008
	Now they're a bit okay.

	0:21:54.008 --> 0:21:59.054
	I mean, as you've seen in the practice on
	where we used places, they were quite decent.

	0:21:59.054 --> 0:22:03.011
	But then it's also the case on which data
	it's trained on and so on.

	0:22:03.011 --> 0:22:03.240
	So.

	0:22:04.164 --> 0:22:09.666
	Sentence embedings are definitely much more
	harder to get than were embedings, so this

	0:22:09.666 --> 0:22:13.776
	is a bit more complicated than the task that
	you've seen before.

	0:22:16.476 --> 0:22:18.701
	Before we go into how U.

	0:22:18.701 --> 0:22:18.968
	N.

	0:22:18.968 --> 0:22:19.235
	M.

	0:22:19.235 --> 0:22:19.502
	T.

	0:22:19.502 --> 0:22:24.485
	Works, so this is your typical supervised
	system right.

	0:22:24.485 --> 0:22:29.558
	So we have parallel data source sentence target
	centers.

	0:22:29.558 --> 0:22:31.160
	We have a source.

	0:22:31.471 --> 0:22:36.709
	We have a target decoder and then we try to
	minimize the cross center pillar on this viral

	0:22:36.709 --> 0:22:37.054
	data.

	0:22:37.157 --> 0:22:39.818
	And this is how we train our typical system.

	0:22:43.583 --> 0:22:49.506
	But now we do not have any parallel data,
	and so the intuition here is that if we can

	0:22:49.506 --> 0:22:55.429
	learn language independent representations
	at the end quota outputs, then we can pass

	0:22:55.429 --> 0:22:58.046
	it along to the decoder that we want.

	0:22:58.718 --> 0:23:03.809
	It's going to get more clear in the future,
	but I'm trying to give a bit more intuition

	0:23:03.809 --> 0:23:07.164
	before I'm going to show you all the planning
	objectives.

	0:23:08.688 --> 0:23:15.252
	So I assume that we have these different encoders
	right, so it's not only two, you have a bunch

	0:23:15.252 --> 0:23:21.405
	of different source language encoders, a bunch
	of different target language decoders, and

	0:23:21.405 --> 0:23:26.054
	also I assume that the encoder is in the same
	representation space.

	0:23:26.706 --> 0:23:31.932
	If you give a sentence in English and the
	same sentence in German, the embeddings are

	0:23:31.932 --> 0:23:38.313
	quite the same, so like the muddling when embeddings
	die right, and so then what we can do is, depending

	0:23:38.313 --> 0:23:42.202
	on the language we want, pass it to the the
	appropriate decode.

	0:23:42.682 --> 0:23:50.141
	And so the kind of goal here is to find out
	a way to create language independent representations

	0:23:50.141 --> 0:23:52.909
	and then pass it to the decodement.

	0:23:54.975 --> 0:23:59.714
	Just keep in mind that you're trying to do
	language independent for some reason, but it's

	0:23:59.714 --> 0:24:02.294
	going to be more clear once we see how it works.

	0:24:05.585 --> 0:24:12.845
	So in total we have three objectives that
	we're going to try to train in our systems,

	0:24:12.845 --> 0:24:16.981
	so this is and all of them use monolingual
	data.

	0:24:17.697 --> 0:24:19.559
	So there's no pilot data at all.

	0:24:19.559 --> 0:24:24.469
	The first one is denoising water encoding,
	so it's more like you add noise to noise to

	0:24:24.469 --> 0:24:27.403
	the sentence, and then they construct the original.

	0:24:28.388 --> 0:24:34.276
	Then we have the on the flyby translation,
	so this is where you take a sentence, generate

	0:24:34.276 --> 0:24:39.902
	a translation, and then learn the the word
	smarting, which I'm going to show pictures

	0:24:39.902 --> 0:24:45.725
	stated, and then we have an adverse serial
	planning to do learn the language independent

	0:24:45.725 --> 0:24:46.772
	representation.

	0:24:47.427 --> 0:24:52.148
	So somehow we'll fill in these three tasks
	or retain on these three tasks.

	0:24:52.148 --> 0:24:54.728
	We somehow get an answer to President M.

	0:24:54.728 --> 0:24:54.917
	T.

	0:24:56.856 --> 0:25:02.964
	OK, so first we're going to do is denoising
	what I'm cutting right, so as I said we add

	0:25:02.964 --> 0:25:06.295
	noise to the sentence, so we take our sentence.

	0:25:06.826 --> 0:25:09.709
	And then there are different ways to add noise.

	0:25:09.709 --> 0:25:11.511
	You can shuffle words around.

	0:25:11.511 --> 0:25:12.712
	You can drop words.

	0:25:12.712 --> 0:25:18.298
	Do whatever you want to do as long as there's
	enough information to reconstruct the original

	0:25:18.298 --> 0:25:18.898
	sentence.

	0:25:19.719 --> 0:25:25.051
	And then we assume that the nicest one and
	the original one are parallel data and train

	0:25:25.051 --> 0:25:26.687
	similar to the supervised.

	0:25:28.168 --> 0:25:30.354
	So we have a source sentence.

	0:25:30.354 --> 0:25:32.540
	We have a noisy source right.

	0:25:32.540 --> 0:25:37.130
	So here what basically happened is that the
	word got shuffled.

	0:25:37.130 --> 0:25:39.097
	One word is dropped right.

	0:25:39.097 --> 0:25:41.356
	So this was a noise of source.

	0:25:41.356 --> 0:25:47.039
	And then we treat the noise of source and
	source as a sentence bed basically.

	0:25:49.009 --> 0:25:53.874
	Way retainers optimizing the cross entropy
	loss similar to.

	0:25:57.978 --> 0:26:03.211
	Basically a picture to show what's happening
	and we have the nice resources.

	0:26:03.163 --> 0:26:09.210
	Now is the target and then we have the reconstructed
	original source and original tag and since

	0:26:09.210 --> 0:26:14.817
	the languages are different we have our source
	hand coded target and coded source coded.

	0:26:17.317 --> 0:26:20.202
	And for this task we only need monolingual
	data.

	0:26:20.202 --> 0:26:25.267
	We don't need any pedal data because it's
	just taking a sentence and shuffling it and

	0:26:25.267 --> 0:26:27.446
	reconstructing the the original one.

	0:26:28.848 --> 0:26:31.058
	And we are four different blocks.

	0:26:31.058 --> 0:26:36.841
	This is kind of very important to keep in
	mind on how we change these connections later.

	0:26:41.121 --> 0:26:49.093
	Then this is more like the mathematical formulation
	where you predict source given the noisy.

	0:26:52.492 --> 0:26:55.090
	So that was the nursing water encoding.

	0:26:55.090 --> 0:26:58.403
	The second step is on the flight back translation.

	0:26:59.479 --> 0:27:06.386
	So what we do is, we put our model inference
	mode right, we take a source of sentences,

	0:27:06.386 --> 0:27:09.447
	and we generate a translation pattern.

	0:27:09.829 --> 0:27:18.534
	It might be completely wrong or maybe partially
	correct or so on, but we assume that the moral

	0:27:18.534 --> 0:27:20.091
	knows of it and.

	0:27:20.680 --> 0:27:25.779
	Tend rate: T head right and then what we do
	is assume that T head or not assume but T head

	0:27:25.779 --> 0:27:27.572
	and S are sentence space right.

	0:27:27.572 --> 0:27:29.925
	That's how we can handle the translation.

	0:27:30.530 --> 0:27:38.824
	So we train a supervised system on this sentence
	bed, so we do inference and then build a reverse

	0:27:38.824 --> 0:27:39.924
	translation.

	0:27:42.442 --> 0:27:49.495
	Are both more concrete, so we have a false
	sentence right, then we chamber the translation,

	0:27:49.495 --> 0:27:55.091
	then we give the general translation as an
	input and try to predict the.

	0:27:58.378 --> 0:28:03.500
	This is how we would do in practice right,
	so not before the source encoder was connected

	0:28:03.500 --> 0:28:08.907
	to the source decoder, but now we interchanged
	connections, so the source encoder is connected

	0:28:08.907 --> 0:28:10.216
	to the target decoder.

	0:28:10.216 --> 0:28:13.290
	The target encoder is turned into the source
	decoder.

	0:28:13.974 --> 0:28:20.747
	And given s we get t-hat and given t we get
	s-hat, so this is the first time.

	0:28:21.661 --> 0:28:24.022
	On the second time step, what you're going
	to do is reverse.

	0:28:24.664 --> 0:28:32.625
	So as that is here, t hat is here, and given
	s hat we are trying to predict t, and given

	0:28:32.625 --> 0:28:34.503
	t hat we are trying.

	0:28:36.636 --> 0:28:39.386
	Is this clear you have any questions on?

	0:28:45.405 --> 0:28:50.823
	Bit more mathematically, we try to play the
	class, give and take and so it's always the

	0:28:50.823 --> 0:28:53.963
	supervised NMP technique that we are trying
	to do.

	0:28:53.963 --> 0:28:59.689
	But you're trying to create this synthetic
	pass that kind of helpers to build an unsurprised

	0:28:59.689 --> 0:29:00.181
	system.

	0:29:02.362 --> 0:29:08.611
	Now also with maybe you can see here is that
	if the source encoded and targeted encoded

	0:29:08.611 --> 0:29:14.718
	the language independent, we can always shuffle
	the connections and the translations.

	0:29:14.718 --> 0:29:21.252
	That's why it was important to find a way
	to generate language independent representations.

	0:29:21.441 --> 0:29:26.476
	And the way we try to force this language
	independence is the gan step.

	0:29:27.627 --> 0:29:34.851
	So the third step kind of combines all of
	them is where we try to use gun to make the

	0:29:34.851 --> 0:29:37.959
	encoded output language independent.

	0:29:37.959 --> 0:29:42.831
	So here it's the same picture but from a different
	paper.

	0:29:42.831 --> 0:29:43.167
	So.

	0:29:43.343 --> 0:29:48.888
	We have X-rays, X-ray objects which is monolingual
	in data.

	0:29:48.888 --> 0:29:50.182
	We add noise.

	0:29:50.690 --> 0:29:54.736
	Then we encode it using the source and the
	target encoders right.

	0:29:54.736 --> 0:29:58.292
	Then we get the latent space Z source and
	Z target right.

	0:29:58.292 --> 0:30:03.503
	Then we decode and try to reconstruct the
	original one and this is the auto encoding

	0:30:03.503 --> 0:30:08.469
	loss which takes the X source which is the
	original one and then the translated.

	0:30:08.468 --> 0:30:09.834
	Predicted output.

	0:30:09.834 --> 0:30:16.740
	So hello, it always is the auto encoding step
	where the gun concern is in the between gang

	0:30:16.740 --> 0:30:24.102
	cord outputs, and here we have an discriminator
	which tries to predict which language the latent

	0:30:24.102 --> 0:30:25.241
	space is from.

	0:30:26.466 --> 0:30:33.782
	So given Z source it has to predict that the
	representation is from a language source and

	0:30:33.782 --> 0:30:39.961
	given Z target it has to predict the representation
	from a language target.

	0:30:40.520 --> 0:30:45.135
	And our headquarters are kind of teaching
	data right now, and then we have a separate

	0:30:45.135 --> 0:30:49.803
	network discriminator which tries to predict
	which language the Latin spaces are from.

	0:30:53.393 --> 0:30:57.611
	And then this one is when we combined guns
	with the other ongoing step.

	0:30:57.611 --> 0:31:02.767
	Then we had an on the fly back translation
	step right, and so here what we're trying to

	0:31:02.767 --> 0:31:03.001
	do.

	0:31:03.863 --> 0:31:07.260
	Is the same, basically just exactly the same.

	0:31:07.260 --> 0:31:12.946
	But when we are doing the training, we are
	at the adversarial laws here, so.

	0:31:13.893 --> 0:31:20.762
	We take our X source, gender and intermediate
	translation, so why target and why source right?

	0:31:20.762 --> 0:31:27.342
	This is the previous time step, and then we
	have to encode the new sentences and basically

	0:31:27.342 --> 0:31:32.764
	make them language independent or train to
	make them language independent.

	0:31:33.974 --> 0:31:43.502
	And then the hope is that now if we do this
	using monolingual data alone we can just switch

	0:31:43.502 --> 0:31:47.852
	connections and then get our translation.

	0:31:47.852 --> 0:31:49.613
	So the scale of.

	0:31:54.574 --> 0:32:03.749
	And so as I said before, guns are quite good
	for vision right, so this is kind of like the

	0:32:03.749 --> 0:32:11.312
	cycle gun approach that you might have seen
	in any computer vision course.

	0:32:11.911 --> 0:32:19.055
	Somehow protect that place at least not as
	promising as for merchants, and so people.

	0:32:19.055 --> 0:32:23.706
	What they did is to enforce this language
	independence.

	0:32:25.045 --> 0:32:31.226
	They try to use a shared encoder instead of
	having these different encoders right, and

	0:32:31.226 --> 0:32:37.835
	so this is basically the same painting objectives
	as before, but what you're going to do now

	0:32:37.835 --> 0:32:43.874
	is learn cross language language and then use
	the single encoder for both languages.

	0:32:44.104 --> 0:32:49.795
	And this kind also forces them to be in the
	same space, and then you can choose whichever

	0:32:49.795 --> 0:32:50.934
	decoder you want.

	0:32:52.552 --> 0:32:58.047
	You can use guns or you can just use a shared
	encoder and type to build your unsupervised

	0:32:58.047 --> 0:32:58.779
	MTT system.

	0:33:08.488 --> 0:33:09.808
	These are now the.

	0:33:09.808 --> 0:33:15.991
	The enhancements that you can do on top of
	your unsavoizant system is one you can create

	0:33:15.991 --> 0:33:16.686
	a shared.

	0:33:18.098 --> 0:33:22.358
	On top of the shared encoder you can ask are
	your guns lost or whatever so there's a lot

	0:33:22.358 --> 0:33:22.550
	of.

	0:33:24.164 --> 0:33:29.726
	The other thing that is more relevant right
	now is that you can create parallel data by

	0:33:29.726 --> 0:33:35.478
	word to word translation right because you
	know how to do all supervised word translation.

	0:33:36.376 --> 0:33:40.548
	First step is to create parallel data, assuming
	that word translations are quite good.

	0:33:41.361 --> 0:33:47.162
	And then you claim a supervised and empty
	model on these more likely wrong model data,

	0:33:47.162 --> 0:33:50.163
	but somehow gives you a good starting point.

	0:33:50.163 --> 0:33:56.098
	So you build your supervised and empty system
	on the word translation data, and then you

	0:33:56.098 --> 0:33:59.966
	initialize it before you're doing unsupervised
	and empty.

	0:34:00.260 --> 0:34:05.810
	And the hope is that when you're doing the
	back pain installation, it's a good starting

	0:34:05.810 --> 0:34:11.234
	point, but it's one technique that you can
	do to to improve your anthropoids and the.

	0:34:17.097 --> 0:34:25.879
	In the previous case we had: The way we know
	when to stop was to see comedians on the gun

	0:34:25.879 --> 0:34:26.485
	training.

	0:34:26.485 --> 0:34:28.849
	Actually, all we want to do is when W.

	0:34:28.849 --> 0:34:32.062
	Comedians, which is quite easy to know when
	to stop.

	0:34:32.062 --> 0:34:37.517
	But in a realistic case, we don't have any
	parallel data right, so there's no validation.

	0:34:37.517 --> 0:34:42.002
	Or I mean, we might have test data in the
	end, but there's no validation.

	0:34:43.703 --> 0:34:48.826
	How will we tune our hyper parameters in this
	case because it's not really there's nothing

	0:34:48.826 --> 0:34:49.445
	for us to?

	0:34:50.130 --> 0:34:53.326
	Or the gold data in a sense like so.

	0:34:53.326 --> 0:35:01.187
	How do you think we can evaluate such systems
	or how can we tune hyper parameters in this?

	0:35:11.711 --> 0:35:17.089
	So what you're going to do is use the back
	translation technique.

	0:35:17.089 --> 0:35:24.340
	It's like a common technique where you have
	nothing okay that is to use back translation

	0:35:24.340 --> 0:35:26.947
	somehow and what you can do is.

	0:35:26.947 --> 0:35:31.673
	The main idea is validate on how good the
	reconstruction.

	0:35:32.152 --> 0:35:37.534
	So the idea is that if you have a good system
	then the intermediate translation is quite

	0:35:37.534 --> 0:35:39.287
	good and going back is easy.

	0:35:39.287 --> 0:35:44.669
	But if it's just noise that you generate in
	the forward step then it's really hard to go

	0:35:44.669 --> 0:35:46.967
	back, which is kind of the main idea.

	0:35:48.148 --> 0:35:53.706
	So the way it works is that we take a source
	sentence, we generate a translation in target

	0:35:53.706 --> 0:35:59.082
	language right, and then again can state the
	generated sentence and compare it with the

	0:35:59.082 --> 0:36:01.342
	original one, and if they're closer.

	0:36:01.841 --> 0:36:09.745
	It means that we have a good system, and if
	they are far this is kind of like an unsupervised

	0:36:09.745 --> 0:36:10.334
	grade.

	0:36:17.397 --> 0:36:21.863
	As far as the amount of data that you need.

	0:36:23.083 --> 0:36:27.995
	This was like the first initial resistance
	on on these systems is that you had.

	0:36:27.995 --> 0:36:32.108
	They wanted to do English and French and they
	had fifteen million.

	0:36:32.108 --> 0:36:38.003
	There was fifteen million more linguist sentences
	so it's quite a lot and they were able to get

	0:36:38.003 --> 0:36:40.581
	thirty two blue on these kinds of setups.

	0:36:41.721 --> 0:36:47.580
	But unsurprisingly if you have zero point
	one million pilot sentences you get the same

	0:36:47.580 --> 0:36:48.455
	performance.

	0:36:48.748 --> 0:36:50.357
	So it's a lot of training.

	0:36:50.357 --> 0:36:55.960
	It's a lot of monolingual data, but monolingual
	data is relatively easy to obtain is the fact

	0:36:55.960 --> 0:37:01.264
	that the training is also quite longer than
	the supervised system, but it's unsupervised

	0:37:01.264 --> 0:37:04.303
	so it's kind of the trade off that you are
	making.

	0:37:07.367 --> 0:37:13.101
	The other thing to note is that it's English
	and French, which is very close to our exemptions.

	0:37:13.101 --> 0:37:18.237
	Also, the monolingual data that they took
	are kind of from similar domains and so on.

	0:37:18.638 --> 0:37:27.564
	So that's why they're able to build such a
	good system, but you'll see later that it fails.

	0:37:36.256 --> 0:37:46.888
	Voice, and so mean what people usually do
	is first build a system right using whatever

	0:37:46.888 --> 0:37:48.110
	parallel.

	0:37:48.608 --> 0:37:55.864
	Then they use monolingual data and do back
	translation, so this is always being the standard

	0:37:55.864 --> 0:38:04.478
	way to to improve, and what people have seen
	is that: You don't even need zero point one

	0:38:04.478 --> 0:38:05.360
	million right.

	0:38:05.360 --> 0:38:10.706
	You just need like ten thousand or so on and
	then you do the monolingual back time station

	0:38:10.706 --> 0:38:12.175
	and you're still better.

	0:38:12.175 --> 0:38:13.291
	The answer is why.

	0:38:13.833 --> 0:38:19.534
	The question is it's really worth trying to
	to do this or maybe it's always better to find

	0:38:19.534 --> 0:38:20.787
	some parallel data.

	0:38:20.787 --> 0:38:26.113
	I'll expand a bit of money on getting few
	parallel data and then use it to start and

	0:38:26.113 --> 0:38:27.804
	find to build your system.

	0:38:27.804 --> 0:38:33.756
	So it was kind of the understanding that billing
	wool and spoiled systems are not that really.

	0:38:50.710 --> 0:38:54.347
	The thing is that with unlabeled data.

	0:38:57.297 --> 0:39:05.488
	Not in an obtaining signal, so when we are
	starting basically what we want to do is first

	0:39:05.488 --> 0:39:13.224
	get a good translation system and then use
	an unlabeled monolingual data to improve.

	0:39:13.613 --> 0:39:15.015
	But if you start from U.

	0:39:15.015 --> 0:39:15.183
	N.

	0:39:15.183 --> 0:39:20.396
	Empty our model might be really bad like it
	would be somewhere translating completely wrong.

	0:39:20.760 --> 0:39:26.721
	And then when you find your unlabeled data,
	it basically might be harming, or maybe the

	0:39:26.721 --> 0:39:28.685
	same as supervised applause.

	0:39:28.685 --> 0:39:35.322
	So the thing is, I hope, by fine tuning on
	labeled data as first is to get a good initialization.

	0:39:35.835 --> 0:39:38.404
	And then use the unsupervised techniques to
	get better.

	0:39:38.818 --> 0:39:42.385
	But if your starting point is really bad then
	it's not.

	0:39:45.185 --> 0:39:47.324
	Year so as we said before.

	0:39:47.324 --> 0:39:52.475
	This is kind of like the self supervised training
	usually works.

	0:39:52.475 --> 0:39:54.773
	First we have parallel data.

	0:39:56.456 --> 0:39:58.062
	Source language is X.

	0:39:58.062 --> 0:39:59.668
	Target language is Y.

	0:39:59.668 --> 0:40:06.018
	In the end we want a system that does X to
	Y, not Y to X, but first we want to train a

	0:40:06.018 --> 0:40:10.543
	backward model as it is Y to X, so target language
	to source.

	0:40:11.691 --> 0:40:17.353
	Then we take our moonlighting will target
	sentences, use our backward model to generate

	0:40:17.353 --> 0:40:21.471
	synthetic source, and then we join them with
	our original data.

	0:40:21.471 --> 0:40:27.583
	So now we have this noisy input, but always
	the gold output, which is kind of really important

	0:40:27.583 --> 0:40:29.513
	when you're doing backpaints.

	0:40:30.410 --> 0:40:36.992
	And then you can coordinate these big data
	and then you can train your X to Y cholesterol

	0:40:36.992 --> 0:40:44.159
	system and then you can always do this in multiple
	steps and usually three, four steps which kind

	0:40:44.159 --> 0:40:48.401
	of improves always and then finally get your
	best system.

	0:40:49.029 --> 0:40:54.844
	The point that I'm trying to make is that
	although answers and MPs the scores that I've

	0:40:54.844 --> 0:41:00.659
	shown before were quite good, you probably
	can get the same performance with with fifty

	0:41:00.659 --> 0:41:06.474
	thousand sentences, and also the languages
	that they've shown are quite similar and the

	0:41:06.474 --> 0:41:08.654
	texts were from the same domain.

	0:41:14.354 --> 0:41:21.494
	So any questions on u n m t ok yeah.

	0:41:22.322 --> 0:41:28.982
	So after this fact that temperature was already
	better than than empty, what people have tried

	0:41:28.982 --> 0:41:34.660
	is to use this idea of multilinguality as you
	have seen in the previous lecture.

	0:41:34.660 --> 0:41:41.040
	The question is how can we do this knowledge
	transfer from high resource language to lower

	0:41:41.040 --> 0:41:42.232
	source language?

	0:41:44.484 --> 0:41:51.074
	One way to promote this language independent
	representations is to share the encoder and

	0:41:51.074 --> 0:41:57.960
	decoder for all languages, all their available
	languages, and that kind of hopefully enables

	0:41:57.960 --> 0:42:00.034
	the the knowledge transfer.

	0:42:03.323 --> 0:42:08.605
	When we're doing multilinguality, the two
	questions we need to to think of is how does

	0:42:08.605 --> 0:42:09.698
	the encoder know?

	0:42:09.698 --> 0:42:14.495
	How does the encoder encoder know which language
	that we're dealing with that?

	0:42:15.635 --> 0:42:20.715
	You already might have known the answer also,
	and the second question is how can we promote

	0:42:20.715 --> 0:42:24.139
	the encoder to generate language independent
	representations?

	0:42:25.045 --> 0:42:32.580
	By solving these two problems we can take
	help of high resource languages to do unsupervised

	0:42:32.580 --> 0:42:33.714
	translations.

	0:42:34.134 --> 0:42:40.997
	Typical example would be you want to do unsurpressed
	between English and Dutch right, but you are

	0:42:40.997 --> 0:42:47.369
	parallel data between English and German, so
	the question is can we use this parallel data

	0:42:47.369 --> 0:42:51.501
	to help building an unsurpressed betweenEnglish
	and Dutch?

	0:42:56.296 --> 0:43:01.240
	For the first one we try to take help of language
	embeddings for tokens, and this kind of is

	0:43:01.240 --> 0:43:05.758
	a straightforward way to know to tell them
	well which language they're dealing with.

	0:43:06.466 --> 0:43:11.993
	And for the second one we're going to look
	at some pre training objectives which are also

	0:43:11.993 --> 0:43:17.703
	kind of unsupervised so we need monolingual
	data mostly and this kind of helps us to promote

	0:43:17.703 --> 0:43:20.221
	the language independent representation.

	0:43:23.463 --> 0:43:29.954
	So the first three things more that we'll
	look at is excel, which is quite famous if

	0:43:29.954 --> 0:43:32.168
	you haven't heard of it yet.

	0:43:32.552 --> 0:43:40.577
	And: The way it works is that it's basically
	a transformer encoder right, so it's like the

	0:43:40.577 --> 0:43:42.391
	just the encoder module.

	0:43:42.391 --> 0:43:44.496
	No, there's no decoder here.

	0:43:44.884 --> 0:43:51.481
	And what we're trying to do is mask two tokens
	in a sequence and try to predict these mask

	0:43:51.481 --> 0:43:52.061
	tokens.

	0:43:52.061 --> 0:43:55.467
	So I quickly called us mask language modeling.

	0:43:55.996 --> 0:44:05.419
	Typical language modeling that you see is
	the Danish language modeling where you predict

	0:44:05.419 --> 0:44:08.278
	the next token in English.

	0:44:08.278 --> 0:44:11.136
	Then we have the position.

	0:44:11.871 --> 0:44:18.774
	Then we have the token embellings, and then
	here we have the mass token, and then we have

	0:44:18.774 --> 0:44:22.378
	the transformer encoder blocks to predict the.

	0:44:24.344 --> 0:44:30.552
	To do this for all languages using the same
	tang somewhere encoded and this kind of helps

	0:44:30.552 --> 0:44:36.760
	us to push the the sentence and bearings or
	the output of the encoded into a common space

	0:44:36.760 --> 0:44:37.726
	per multiple.

	0:44:42.782 --> 0:44:49.294
	So first we train an MLM on both source, both
	source and target language sites, and then

	0:44:49.294 --> 0:44:54.928
	we use it as a starting point for the encoded
	and decoded for a UNMP system.

	0:44:55.475 --> 0:45:03.175
	So we take a monolingual data, build a mass
	language model on both source and target languages,

	0:45:03.175 --> 0:45:07.346
	and then read it to be or initialize that in
	the U.

	0:45:07.346 --> 0:45:07.586
	N.

	0:45:07.586 --> 0:45:07.827
	P.

	0:45:07.827 --> 0:45:08.068
	C.

	0:45:09.009 --> 0:45:14.629
	Here we look at two languages, but you can
	also do it with one hundred languages once.

	0:45:14.629 --> 0:45:20.185
	So they're retain checkpoints that you can
	use, which are quite which have seen quite

	0:45:20.185 --> 0:45:21.671
	a lot of data and use.

	0:45:21.671 --> 0:45:24.449
	It always has a starting point for your U.

	0:45:24.449 --> 0:45:24.643
	N.

	0:45:24.643 --> 0:45:27.291
	MP system, which in practice works well.

	0:45:31.491 --> 0:45:36.759
	This detail is that since this is an encoder
	block only, and your U.

	0:45:36.759 --> 0:45:36.988
	N.

	0:45:36.988 --> 0:45:37.217
	M.

	0:45:37.217 --> 0:45:37.446
	T.

	0:45:37.446 --> 0:45:40.347
	System is encodered, decodered right.

	0:45:40.347 --> 0:45:47.524
	So there's this cross attention that's missing,
	but you can always branch like that randomly.

	0:45:47.524 --> 0:45:48.364
	It's fine.

	0:45:48.508 --> 0:45:53.077
	Not everything is initialized, but it's still
	decent.

	0:45:56.056 --> 0:46:02.141
	Then we have the other one is M by plane,
	and here you see that this kind of builds on

	0:46:02.141 --> 0:46:07.597
	the the unsupervised training objector, which
	is the realizing auto encoding.

	0:46:08.128 --> 0:46:14.337
	So what they do is they say that we don't
	even need to do the gun outback translation,

	0:46:14.337 --> 0:46:17.406
	but you can do it later, but pre training.

	0:46:17.406 --> 0:46:24.258
	We just do do doing doing doing water inputting
	on all different languages, and that also gives

	0:46:24.258 --> 0:46:32.660
	you: Out of the box good performance, so what
	we basically have here is the transformer encoded.

	0:46:34.334 --> 0:46:37.726
	You are trying to generate a reconstructed
	sequence.

	0:46:37.726 --> 0:46:38.942
	You need a tickle.

	0:46:39.899 --> 0:46:42.022
	So we gave an input sentence.

	0:46:42.022 --> 0:46:48.180
	We tried to predict the masked tokens from
	the or we tried to reconstruct the original

	0:46:48.180 --> 0:46:52.496
	sentence from the input segments, which was
	corrupted right.

	0:46:52.496 --> 0:46:57.167
	So this is the same denoting objective that
	you have seen before.

	0:46:58.418 --> 0:46:59.737
	This is for English.

	0:46:59.737 --> 0:47:04.195
	I think this is for Japanese and then once
	we do it for all languages.

	0:47:04.195 --> 0:47:09.596
	I mean they have this difference on twenty
	five, fifty or so on and then you can find

	0:47:09.596 --> 0:47:11.794
	you on your sentence and document.

	0:47:13.073 --> 0:47:20.454
	And so what they is this for the supervised
	techniques, but you can also use this as initializations

	0:47:20.454 --> 0:47:25.058
	for unsupervised buildup on that which also
	in practice works.

	0:47:30.790 --> 0:47:36.136
	Then we have these, so still now we kind of
	didn't see the the states benefit from the

	0:47:36.136 --> 0:47:38.840
	high resource language right, so as I said.

	0:47:38.878 --> 0:47:44.994
	Why you can use English as something for English
	to Dutch, and if you want a new Catalan, you

	0:47:44.994 --> 0:47:46.751
	can use English to French.

	0:47:48.408 --> 0:47:55.866
	One typical way to do this is to use favorite
	translation lights or you take the.

	0:47:55.795 --> 0:48:01.114
	So here it's finished two weeks so you take
	your time say from finish to English English

	0:48:01.114 --> 0:48:03.743
	two weeks and then you get the translation.

	0:48:04.344 --> 0:48:10.094
	What's important is that you have these different
	techniques and you can always think of which

	0:48:10.094 --> 0:48:12.333
	one to use given the data situation.

	0:48:12.333 --> 0:48:18.023
	So if it was like finish to Greek maybe it's
	pivotal better because you might get good finish

	0:48:18.023 --> 0:48:20.020
	to English and English to Greek.

	0:48:20.860 --> 0:48:23.255
	Sometimes it also depends on the language
	pair.

	0:48:23.255 --> 0:48:27.595
	There might be some information loss and so
	on, so there are quite a few variables you

	0:48:27.595 --> 0:48:30.039
	need to think of and decide which system to
	use.

	0:48:32.752 --> 0:48:39.654
	Then there's a zero shot, which probably also
	I've seen in the multilingual course, and how

	0:48:39.654 --> 0:48:45.505
	if you can improve the language independence
	then your zero shot gets better.

	0:48:45.505 --> 0:48:52.107
	So maybe if you use the multilingual models
	and do zero shot directly, it's quite good.

	0:48:53.093 --> 0:48:58.524
	Thought we have zero shots per word, and then
	we have the answer to voice translation where

	0:48:58.524 --> 0:49:00.059
	we can calculate between.

	0:49:00.600 --> 0:49:02.762
	Just when there is no battle today.

	0:49:06.686 --> 0:49:07.565
	Is to solve.

	0:49:07.565 --> 0:49:11.959
	So sometimes what we have seen so far is that
	we basically have.

	0:49:15.255 --> 0:49:16.754
	To do from looking at it.

	0:49:16.836 --> 0:49:19.307
	These two files alone you can create a dictionary.

	0:49:19.699 --> 0:49:26.773
	Can build an unsupervised entry system, not
	always, but if the domains are similar in the

	0:49:26.773 --> 0:49:28.895
	languages, that's similar.

	0:49:28.895 --> 0:49:36.283
	But if there are distant languages, then the
	unsupervised texting doesn't usually work really

	0:49:36.283 --> 0:49:36.755
	well.

	0:49:37.617 --> 0:49:40.297
	What um.

	0:49:40.720 --> 0:49:46.338
	Would be is that if you can get some paddle
	data from somewhere or do bitex mining that

	0:49:46.338 --> 0:49:51.892
	we have seen in the in the laser practicum
	then you can use that as to initialize your

	0:49:51.892 --> 0:49:57.829
	system and then try and accept a semi supervised
	energy system and that would be better than

	0:49:57.829 --> 0:50:00.063
	just building an unsupervised and.

	0:50:00.820 --> 0:50:06.546
	With that as the end.

	0:50:07.207 --> 0:50:08.797
	Quickly could be.

	0:50:16.236 --> 0:50:25.070
	In common, they can catch the worst because
	the thing about finding a language is: And

	0:50:25.070 --> 0:50:34.874
	there's another joy in playing these games,
	almost in the middle of a game, and she's a

	0:50:34.874 --> 0:50:40.111
	characteristic too, and she is a global waver.

	0:50:56.916 --> 0:51:03.798
	Next talk inside and this somehow gives them
	many abilities, not only translation but other

	0:51:03.798 --> 0:51:08.062
	than that there are quite a few things that
	they can do.

	0:51:10.590 --> 0:51:17.706
	But the translation in itself usually doesn't
	really work really well if you build a system

	0:51:17.706 --> 0:51:20.878
	from your specific system for your case.

	0:51:22.162 --> 0:51:27.924
	I would guess that it's usually better than
	the LLM, but you can always adapt the LLM to

	0:51:27.924 --> 0:51:31.355
	the task that you want, and then it could be
	better.

	0:51:32.152 --> 0:51:37.849
	A little amount of the box might not be the
	best choice for your task force.

	0:51:37.849 --> 0:51:44.138
	For me, I'm working on new air translation,
	so it's more about translating software.

	0:51:45.065 --> 0:51:50.451
	And it's quite often each domain as well,
	and if use the LLM out of the box, they're

	0:51:50.451 --> 0:51:53.937
	actually quite bad compared to the systems
	that built.

	0:51:54.414 --> 0:51:56.736
	But you can do these different techniques
	like prompting.

	0:51:57.437 --> 0:52:03.442
	This is what people usually do is heart prompting
	where they give similar translation pairs in

	0:52:03.442 --> 0:52:08.941
	the prompt and then ask it to translate and
	then that kind of improves the performance

	0:52:08.941 --> 0:52:09.383
	a lot.

	0:52:09.383 --> 0:52:15.135
	So there are different techniques that you
	can do to adapt your eye lens and then it might

	0:52:15.135 --> 0:52:16.399
	be better than the.

	0:52:16.376 --> 0:52:17.742
	Task a fixed system.

	0:52:18.418 --> 0:52:22.857
	But if you're looking for niche things, I
	don't think error limbs are that good.

	0:52:22.857 --> 0:52:26.309
	But if you want to do to do, let's say, unplugged
	translation.

	0:52:26.309 --> 0:52:30.036
	In this case you can never be sure that they
	haven't seen the data.

	0:52:30.036 --> 0:52:35.077
	First of all is that if you see the data in
	that language or not, and if they're panthetic,

	0:52:35.077 --> 0:52:36.831
	they probably did see the data.

	0:52:40.360 --> 0:53:00.276
	I feel like they have pretty good understanding
	of each million people.

	0:53:04.784 --> 0:53:09.059
	Depends on the language, but I'm pretty surprised
	that it works on a lotus language.

	0:53:09.059 --> 0:53:11.121
	I would expect it to work on German and.

	0:53:11.972 --> 0:53:13.633
	But if you take a lot of first language,.

	0:53:14.474 --> 0:53:20.973
	Don't think it works, and also there are quite
	a few papers where they've already showed that

	0:53:20.973 --> 0:53:27.610
	if you build a system yourself or build a typical
	way to build a system, it's quite better than

	0:53:27.610 --> 0:53:29.338
	the bit better than the.

	0:53:29.549 --> 0:53:34.883
	But you can always do things with limbs to
	get better, but then I'm probably.

	0:53:37.557 --> 0:53:39.539
	Anymore.

	0:53:41.421 --> 0:53:47.461
	So if not then we're going to end the lecture
	here and then on Thursday we're going to have

	0:53:47.461 --> 0:53:51.597
	documented empty which is also run by me so
	thanks for coming.