WEBVTT

0:00:03.663 --> 0:00:07.970
Okay, then I should switch back to English,
sorry,.

0:00:08.528 --> 0:00:18.970
So welcome to today's lecture in the cross
machine translation and today we're planning

0:00:18.970 --> 0:00:20.038
to talk.

0:00:20.880 --> 0:00:31.845
Which will be without our summary of power
translation was done from around till.

0:00:32.872 --> 0:00:38.471
Fourteen, so this was an approach which was
quite long.

0:00:38.471 --> 0:00:47.070
It was the first approach where at the end
the quality was really so good that it was

0:00:47.070 --> 0:00:49.969
used as a commercial system.

0:00:49.990 --> 0:00:56.482
Or something like that, so the first systems
there was using the statistical machine translation.

0:00:57.937 --> 0:01:02.706
So when I came into the field this was the
main part of the lecture, so there would be

0:01:02.706 --> 0:01:07.912
not be one lecture, but in more detail than
half of the full course would be about statistical

0:01:07.912 --> 0:01:09.063
machine translation.

0:01:09.369 --> 0:01:23.381
So what we try to do today is like get the
most important things, which think our part

0:01:23.381 --> 0:01:27.408
is still very important.

0:01:27.267 --> 0:01:31.196
Four State of the Art Box.

0:01:31.952 --> 0:01:45.240
Then we'll have the presentation about how
to evaluate the other part of the machine translation.

0:01:45.505 --> 0:01:58.396
The other important thing is the language
modeling part will explain later how they combine.

0:01:59.539 --> 0:02:04.563
Shortly mentioned this one already.

0:02:04.824 --> 0:02:06.025
On Tuesday.

0:02:06.246 --> 0:02:21.849
So in a lot of these explanations, how we
model translation process, it might be surprising:

0:02:22.082 --> 0:02:27.905
Later some people say it's for four eight words
traditionally came because the first models

0:02:27.905 --> 0:02:32.715
which you'll discuss here also when they are
referred to as the IVM models.

0:02:32.832 --> 0:02:40.043
They were trained on French to English translation
directions and that's why they started using

0:02:40.043 --> 0:02:44.399
F and E and then this was done for the next
twenty years.

0:02:44.664 --> 0:02:52.316
So while we are trying to wait, the source
words is: We have a big eye, typically the

0:02:52.316 --> 0:03:02.701
lengths of the sewer sentence in small eye,
the position, and similarly in the target and

0:03:02.701 --> 0:03:05.240
the lengths of small.

0:03:05.485 --> 0:03:13.248
Things will get a bit complicated in this
way because it is not always clear what is

0:03:13.248 --> 0:03:13.704
the.

0:03:14.014 --> 0:03:21.962
See that there is this noisy channel model
which switches the direction in your model,

0:03:21.962 --> 0:03:25.616
but in the application it's the target.

0:03:26.006 --> 0:03:37.077
So that is why if you especially read these
papers, it might sometimes be a bit disturbing.

0:03:37.437 --> 0:03:40.209
Try to keep it here always.

0:03:40.209 --> 0:03:48.427
The source is, and even if we use a model
where it's inverse, we'll keep this way.

0:03:48.468 --> 0:03:55.138
Don't get disturbed by that, and I think it's
possible to understand all that without this

0:03:55.138 --> 0:03:55.944
confusion.

0:03:55.944 --> 0:04:01.734
But in some of the papers you might get confused
because they switched to the.

0:04:04.944 --> 0:04:17.138
In general, in statistics and machine translation,
the goal is how we do translation.

0:04:17.377 --> 0:04:25.562
But first we are seeing all our possible target
sentences as possible translations.

0:04:26.726 --> 0:04:37.495
And we are assigning some probability to the
combination, so we are modeling.

0:04:39.359 --> 0:04:49.746
And then we are doing a search over all possible
things or at least theoretically, and we are

0:04:49.746 --> 0:04:56.486
trying to find the translation with the highest
probability.

0:04:56.936 --> 0:05:05.116
And this general idea is also true for neuromachine
translation.

0:05:05.116 --> 0:05:07.633
They differ in how.

0:05:08.088 --> 0:05:10.801
So these were then of course the two big challenges.

0:05:11.171 --> 0:05:17.414
On the one hand, how can we estimate this
probability?

0:05:17.414 --> 0:05:21.615
How is the translation of the other?

0:05:22.262 --> 0:05:32.412
The other challenge is the search, so we cannot,
of course, say we want to find the most probable

0:05:32.412 --> 0:05:33.759
translation.

0:05:33.759 --> 0:05:42.045
We cannot go over all possible English sentences
and calculate the probability.

0:05:43.103 --> 0:05:45.004
So,.

0:05:45.165 --> 0:05:53.423
What we have to do there is some are doing
intelligent search and look for the ones and

0:05:53.423 --> 0:05:54.268
compare.

0:05:54.734 --> 0:05:57.384
That will be done.

0:05:57.384 --> 0:06:07.006
This process of finding them is called the
decoding process because.

0:06:07.247 --> 0:06:09.015
They will be covered well later.

0:06:09.015 --> 0:06:11.104
Today we will concentrate on the mile.

0:06:11.451 --> 0:06:23.566
The model is trained using data, so in the
first step we're having data, we're somehow

0:06:23.566 --> 0:06:30.529
having a definition of what the model looks
like.

0:06:34.034 --> 0:06:42.913
And in statistical machine translation the
common model is behind.

0:06:42.913 --> 0:06:46.358
That is what is referred.

0:06:46.786 --> 0:06:55.475
And this is motivated by the initial idea
from Shannon.

0:06:55.475 --> 0:07:02.457
We have this that you can think of decoding.

0:07:02.722 --> 0:07:10.472
So think of it as we have this text in maybe
German.

0:07:10.472 --> 0:07:21.147
Originally it was an English text, but somebody
used some nice decoding.

0:07:21.021 --> 0:07:28.579
Task is to decipher it again, this crazy cyborg
expressing things in German, and to decipher

0:07:28.579 --> 0:07:31.993
the meaning again and doing that between.

0:07:32.452 --> 0:07:35.735
And that is the idea about this noisy channel
when it.

0:07:36.236 --> 0:07:47.209
It goes through some type of channel which
adds noise to the source and then you receive

0:07:47.209 --> 0:07:48.811
the message.

0:07:49.429 --> 0:08:00.190
And then the idea is, can we now construct
the original message out of these messages

0:08:00.190 --> 0:08:05.070
by modeling some of the channels here?

0:08:06.726 --> 0:08:15.797
There you know to see a bit the surface of
the source message with English.

0:08:15.797 --> 0:08:22.361
It went through some channel and received
the message.

0:08:22.682 --> 0:08:31.381
If you're not looking at machine translation,
your source language is English.

0:08:31.671 --> 0:08:44.388
Here you see now a bit of this where the confusion
starts while English as a target language is

0:08:44.388 --> 0:08:47.700
also the source message.

0:08:47.927 --> 0:08:48.674
You can see.

0:08:48.674 --> 0:08:51.488
There is also a mathematics of how we model
the.

0:08:52.592 --> 0:08:56.888
It's a noisy channel model from a mathematic
point of view.

0:08:56.997 --> 0:09:00.245
So this is again our general formula.

0:09:00.245 --> 0:09:08.623
We are looking for the most probable translation
and that is the translation that has the highest

0:09:08.623 --> 0:09:09.735
probability.

0:09:09.809 --> 0:09:19.467
We are not interested in the probability itself,
but we are interesting in this target sentence

0:09:19.467 --> 0:09:22.082
E where this probability.

0:09:23.483 --> 0:09:33.479
And: Therefore, we can use them twice definition
of conditional probability and using the base

0:09:33.479 --> 0:09:42.712
rules, so this probability equals the probability
of f giving any kind of probability of e divided

0:09:42.712 --> 0:09:44.858
by the probability of.

0:09:45.525 --> 0:09:48.218
Now see mathematically this confusion.

0:09:48.218 --> 0:09:54.983
Originally we are interested in the probability
of the target sentence given the search sentence.

0:09:55.295 --> 0:10:00.742
And if we are modeling things now, we are
looking here at the inverse direction, so the

0:10:00.742 --> 0:10:06.499
probability of F given E to the probability
of the source sentence given the target sentence

0:10:06.499 --> 0:10:10.832
is the probability of the target sentence divided
by the probability.

0:10:13.033 --> 0:10:15.353
Why are we doing this?

0:10:15.353 --> 0:10:24.333
Maybe I mean, of course, once it's motivated
by our model, that we were saying this type

0:10:24.333 --> 0:10:27.058
of how we are modeling it.

0:10:27.058 --> 0:10:30.791
The other interesting thing is that.

0:10:31.231 --> 0:10:40.019
So we are looking at this probability up there,
which we had before we formulate that we can

0:10:40.019 --> 0:10:40.775
remove.

0:10:41.181 --> 0:10:46.164
If we are searching for the highest translation,
this is fixed.

0:10:46.164 --> 0:10:47.800
This doesn't change.

0:10:47.800 --> 0:10:52.550
We have an input, the source sentence, and
we cannot change.

0:10:52.812 --> 0:11:02.780
Is always the same, so we can ignore it in
the ACMAX because the lower one is exactly

0:11:02.780 --> 0:11:03.939
the same.

0:11:04.344 --> 0:11:06.683
And then we have p o f.

0:11:06.606 --> 0:11:13.177
E times P of E and that is so we are modeling
the translation process on the one hand with

0:11:13.177 --> 0:11:19.748
the translation model which models how probable
is the sentence F given E and on the other

0:11:19.748 --> 0:11:25.958
hand with the language model which models only
how probable is this English sentence.

0:11:26.586 --> 0:11:39.366
That somebody wrote this language or translation
point of view, this is about fluency.

0:11:40.200 --> 0:11:44.416
You should have in German, for example, agreement.

0:11:44.416 --> 0:11:50.863
If the agreement is not right, that's properly
not said by anybody in German.

0:11:50.863 --> 0:11:58.220
Nobody would say that's Schönest's house because
it's not according to the German rules.

0:11:58.598 --> 0:12:02.302
So this can be modeled by the language model.

0:12:02.542 --> 0:12:09.855
And you have the translation model which models
housings get translated between the.

0:12:10.910 --> 0:12:18.775
And here you see again our confusion again,
and now here put the translation model: Wage

0:12:18.775 --> 0:12:24.360
is a big income counterintuitive because the
probability of a sewer sentence giving the

0:12:24.360 --> 0:12:24.868
target.

0:12:26.306 --> 0:12:35.094
Have to do that for the bass farmer, but in
the following slides I'll talk again about.

0:12:35.535 --> 0:12:45.414
Because yeah, that's more intuitive that you
model the translation of the target sentence

0:12:45.414 --> 0:12:48.377
given the source sentence.

0:12:50.930 --> 0:12:55.668
And this is what we want to talk about today.

0:12:55.668 --> 0:13:01.023
We later talk about language models how to
do that.

0:13:00.940 --> 0:13:04.493
And maybe also how to combine them.

0:13:04.493 --> 0:13:13.080
But the focus on today would be how can we
model this probability to how to generate a

0:13:13.080 --> 0:13:16.535
translation from source to target?

0:13:19.960 --> 0:13:24.263
How can we do that and the easiest thing?

0:13:24.263 --> 0:13:33.588
Maybe if you think about statistics, you count
how many examples you have, how many target

0:13:33.588 --> 0:13:39.121
sentences go occur, and that gives you an estimation.

0:13:40.160 --> 0:13:51.632
However, like in another model that is not
possible because most sentences you will never

0:13:51.632 --> 0:13:52.780
see, so.

0:13:53.333 --> 0:14:06.924
So what we have to do is break up the translation
process into smaller models and model each

0:14:06.924 --> 0:14:09.555
of the decisions.

0:14:09.970 --> 0:14:26.300
So this simple solution with how you throw
a dice is like you have a and that gives you

0:14:26.300 --> 0:14:29.454
the probability.

0:14:29.449 --> 0:14:40.439
But here's the principle because each event
is so rare that most of them never have helped.

0:14:43.063 --> 0:14:48.164
Although it might be that in all your training
data you have never seen this title of set.

0:14:49.589 --> 0:14:52.388
How can we do that?

0:14:52.388 --> 0:15:04.845
We look in statistical machine translation
into two different models, a generative model

0:15:04.845 --> 0:15:05.825
where.

0:15:06.166 --> 0:15:11.736
So the idea was to really model model like
each individual translation between words.

0:15:12.052 --> 0:15:22.598
So you break down the translation of a full
sentence into the translation of each individual's

0:15:22.598 --> 0:15:23.264
word.

0:15:23.264 --> 0:15:31.922
So you say if you have the black cat, if you
translate it, the full sentence.

0:15:32.932 --> 0:15:38.797
Of course, this has some challenges, any ideas
where this type of model could be very challenging.

0:15:40.240 --> 0:15:47.396
Vocabularies and videos: Yes, we're going
to be able to play in the very color.

0:15:47.867 --> 0:15:51.592
Yes, but you could at least use a bit of the
context around it.

0:15:51.592 --> 0:15:55.491
It will not only depend on the word, but it's
already challenging.

0:15:55.491 --> 0:15:59.157
You make things very hard, so that's definitely
one challenge.

0:16:00.500 --> 0:16:07.085
One other, what did you talk about that we
just don't want to say?

0:16:08.348 --> 0:16:11.483
Yes, they are challenging.

0:16:11.483 --> 0:16:21.817
You have to do something like words, but the
problem is that you might introduce errors.

0:16:21.841 --> 0:16:23.298
Later and makes things very comfortable.

0:16:25.265 --> 0:16:28.153
Wrong splitting is the worst things that are
very complicated.

0:16:32.032 --> 0:16:35.580
Saints, for example, and also maybe Japanese
medicine.

0:16:35.735 --> 0:16:41.203
In German, yes, especially like these are
all right.

0:16:41.203 --> 0:16:46.981
The first thing is maybe the one which is
most obvious.

0:16:46.981 --> 0:16:49.972
It is raining cats and dogs.

0:16:51.631 --> 0:17:01.837
To German, the cat doesn't translate this
whole chunk into something because there is

0:17:01.837 --> 0:17:03.261
not really.

0:17:03.403 --> 0:17:08.610
Mean, of course, in generally there is this
type of alignment, so there is a correspondence

0:17:08.610 --> 0:17:11.439
between words in English and the words in German.

0:17:11.439 --> 0:17:16.363
However, that's not true for all sentences,
so in some sentences you cannot really say

0:17:16.363 --> 0:17:18.174
this word translates into that.

0:17:18.498 --> 0:17:21.583
But you can only let more locate this whole
phrase.

0:17:21.583 --> 0:17:23.482
This model into something else.

0:17:23.563 --> 0:17:30.970
If you think about the don't in English, the
do is not really clearly where should that

0:17:30.970 --> 0:17:31.895
be allied.

0:17:32.712 --> 0:17:39.079
Then for a long time the most successful approach
was this phrase based translation model where

0:17:39.079 --> 0:17:45.511
the idea is your block is not a single word
but a longer phrase if you try to build translations

0:17:45.511 --> 0:17:46.572
based on these.

0:17:48.768 --> 0:17:54.105
But let's start with a word based and what
you need.

0:17:54.105 --> 0:18:03.470
There is two main knowledge sources, so on
the one hand we have a lexicon where we translate

0:18:03.470 --> 0:18:05.786
possible translations.

0:18:06.166 --> 0:18:16.084
The main difference between the lexicon and
statistical machine translation and lexicon

0:18:16.084 --> 0:18:17.550
as you know.

0:18:17.837 --> 0:18:23.590
Traditional lexicon: You know how word is
translated and mainly it's giving you two or

0:18:23.590 --> 0:18:26.367
three examples with any example sentence.

0:18:26.367 --> 0:18:30.136
So in this context it gets translated like
that henceon.

0:18:30.570 --> 0:18:38.822
In order to model that and work with probabilities
what we need in a machine translation is these:

0:18:39.099 --> 0:18:47.962
So if we have the German word bargain, it sends
me out with a probability of zero point five.

0:18:47.962 --> 0:18:51.545
Maybe it's translated into a vehicle.

0:18:52.792 --> 0:18:58.876
And of course this is not easy to be created
by a shoveman.

0:18:58.876 --> 0:19:07.960
If ask you and give probabilities for how
probable this vehicle is, there might: So how

0:19:07.960 --> 0:19:12.848
we are doing is again that the lexicon is automatically
will be created from a corpus.

0:19:13.333 --> 0:19:18.754
And we're just counting here, so we count
how often does it work, how often does it co

0:19:18.754 --> 0:19:24.425
occur with vehicle, and then we're taking the
ratio and saying in the house of time on the

0:19:24.425 --> 0:19:26.481
English side there was vehicles.

0:19:26.481 --> 0:19:31.840
There was a probability of vehicles given
back, and there's something like zero point

0:19:31.840 --> 0:19:32.214
five.

0:19:33.793 --> 0:19:46.669
That we need another concept, and that is
this concept of alignment, and now you can

0:19:46.669 --> 0:19:47.578
have.

0:19:47.667 --> 0:19:53.113
Since this is quite complicated, the alignment
in general can be complex.

0:19:53.113 --> 0:19:55.689
It can be that it's not only like.

0:19:55.895 --> 0:20:04.283
It can be that two words of a surrender target
sign and it's also imbiguous.

0:20:04.283 --> 0:20:13.761
It can be that you say all these two words
only are aligned together and our words are

0:20:13.761 --> 0:20:15.504
aligned or not.

0:20:15.875 --> 0:20:21.581
Is should the do be aligned to the knot in
German?

0:20:21.581 --> 0:20:29.301
It's only there because in German it's not,
so it should be aligned.

0:20:30.510 --> 0:20:39.736
However, typically it's formalized and it's
formalized by a function from the target language.

0:20:40.180 --> 0:20:44.051
And that is to make these models get easier
and clearer.

0:20:44.304 --> 0:20:49.860
That means what means does it mean that you
have a fence that means that each.

0:20:49.809 --> 0:20:58.700
A sewer's word gives target word and the alliance
to only one source word because the function

0:20:58.700 --> 0:21:00.384
is also directly.

0:21:00.384 --> 0:21:05.999
However, a source word can be hit or like
by signal target.

0:21:06.286 --> 0:21:11.332
So you are allowing for one to many alignments,
but not for many to one alignment.

0:21:11.831 --> 0:21:17.848
That is a bit of a challenge because you assume
a lightning should be symmetrical.

0:21:17.848 --> 0:21:24.372
So if you look at a parallel sentence, it
should not matter if you look at it from German

0:21:24.372 --> 0:21:26.764
to English or English to German.

0:21:26.764 --> 0:21:34.352
So however, it makes these models: Yea possible
and we'll like to see yea for the phrase bass

0:21:34.352 --> 0:21:36.545
until we need these alignments.

0:21:36.836 --> 0:21:41.423
So this alignment was the most important of
the world based models.

0:21:41.423 --> 0:21:47.763
For the next twenty years you need the world
based models to generate this type of alignment,

0:21:47.763 --> 0:21:50.798
which is then the first step for the phrase.

0:21:51.931 --> 0:21:59.642
Approach, and there you can then combine them
again like both directions into one we'll see.

0:22:00.280 --> 0:22:06.850
This alignment is very important and allows
us to do this type of separation.

0:22:08.308 --> 0:22:15.786
And yet the most commonly used word based
models are these models referred to as IBM

0:22:15.786 --> 0:22:25.422
models, and there is a sequence of them with
great names: And they were like yeah very commonly

0:22:25.422 --> 0:22:26.050
used.

0:22:26.246 --> 0:22:31.719
We'll mainly focus on the simple one here
and look how this works and then not do all

0:22:31.719 --> 0:22:34.138
the details about the further models.

0:22:34.138 --> 0:22:38.084
The interesting thing is also that all of
them are important.

0:22:38.084 --> 0:22:43.366
So if you want to train this alignment what
you normally do is train an IVM model.

0:22:43.743 --> 0:22:50.940
Then you take that as your initialization
to then train the IBM model too and so on.

0:22:50.940 --> 0:22:53.734
The motivation for that is yeah.

0:22:53.734 --> 0:23:00.462
The first model gives you: Is so simple that
you can even find a global optimum, so it gives

0:23:00.462 --> 0:23:06.403
you a good starting point for the next one
where the optimization in finding the right

0:23:06.403 --> 0:23:12.344
model is more difficult and therefore like
the defore technique was to make your model

0:23:12.344 --> 0:23:13.641
step by step more.

0:23:15.195 --> 0:23:27.333
In these models we are breaking down the probability
into smaller steps and then we can define:

0:23:27.367 --> 0:23:38.981
You see it's not a bit different, so it's not
the curability and one specific alignment given.

0:23:39.299 --> 0:23:42.729
We'll let us learn how we can then go from
one alignment to the full set.

0:23:43.203 --> 0:23:52.889
The probability of target sentences and one
alignment between the source and target sentences

0:23:52.889 --> 0:23:56.599
alignment is this type of function.

0:23:57.057 --> 0:24:14.347
That every word is aligned in order to ensure
that every word is aligned.

0:24:15.835 --> 0:24:28.148
So first of all you do some epsilon, the epsilon
is just a normalization factor that everything

0:24:28.148 --> 0:24:31.739
is somehow to inferability.

0:24:31.631 --> 0:24:37.539
Of source sentences plus one to the power
of the length of the targets.

0:24:37.937 --> 0:24:50.987
And this is somehow the probability of this
alignment.

0:24:51.131 --> 0:24:53.224
So is this alignment probable or not?

0:24:53.224 --> 0:24:55.373
Of course you can have some intuition.

0:24:55.373 --> 0:24:58.403
So if there's a lot of crossing, it may be
not a good.

0:24:58.403 --> 0:25:03.196
If all of the words align to the same one
might be not a good alignment, but generally

0:25:03.196 --> 0:25:06.501
it's difficult to really describe what is a
good alignment.

0:25:07.067 --> 0:25:11.482
Say for the first model that's the most simple
thing.

0:25:11.482 --> 0:25:18.760
What can be the most simple thing if you think
about giving a probability to some event?

0:25:21.401 --> 0:25:25.973
Yes exactly, so just take the uniform distribution.

0:25:25.973 --> 0:25:33.534
If we don't really know the best thing of
modeling is all equally probable, of course

0:25:33.534 --> 0:25:38.105
that is not true, but it's giving you a good
study.

0:25:38.618 --> 0:25:44.519
And so this one is just a number of all possible
alignments for this sentence.

0:25:44.644 --> 0:25:53.096
So how many alignments are possible, so the
first target word can be allied to all sources

0:25:53.096 --> 0:25:53.746
worth.

0:25:54.234 --> 0:26:09.743
The second one can also be aligned to all
source work, and the third one also to source.

0:26:10.850 --> 0:26:13.678
This is the number of alignments.

0:26:13.678 --> 0:26:19.002
The second part is to model the probability
of the translation.

0:26:19.439 --> 0:26:31.596
And there it's not nice to have this function,
so now we are making the product over all target.

0:26:31.911 --> 0:26:40.068
And we are making a very strong independent
assumption because in these models we normally

0:26:40.068 --> 0:26:45.715
assume the translation probability of one word
is independent.

0:26:46.126 --> 0:26:49.800
So how you translate and visit it is independent
of all the other parts.

0:26:50.290 --> 0:26:52.907
That is very strong and very bad.

0:26:52.907 --> 0:26:55.294
Yeah, you should do it better.

0:26:55.294 --> 0:27:00.452
We know that it's wrong because how you translate
this depends on.

0:27:00.452 --> 0:27:05.302
However, it's a first easy solution and again
a good starting.

0:27:05.966 --> 0:27:14.237
So what you do is that you take a product
of all words and take a translation probability

0:27:14.237 --> 0:27:15.707
on this target.

0:27:16.076 --> 0:27:23.901
And because we know that there is always one
source word allied to that, so it.

0:27:24.344 --> 0:27:37.409
If the probability of visits in the zoo doesn't
really work, the good here I'm again.

0:27:38.098 --> 0:27:51.943
So most only we have it here, so the probability
is an absolute divided pipe to the power.

0:27:53.913 --> 0:27:58.401
And then there is somewhere in the last one.

0:27:58.401 --> 0:28:04.484
There is an arrow and switch, so it is the
other way around.

0:28:04.985 --> 0:28:07.511
Then you have your translation model.

0:28:07.511 --> 0:28:12.498
Hopefully let's assume you have your water
train so that's only a signing.

0:28:12.953 --> 0:28:25.466
And then this sentence has the probability
of generating I visit a friend given that you

0:28:25.466 --> 0:28:31.371
have the source sentence if Bezukhov I'm.

0:28:32.012 --> 0:28:34.498
Time stand to the power of minus five.

0:28:35.155 --> 0:28:36.098
So this is your model.

0:28:36.098 --> 0:28:37.738
This is how you're applying your model.

0:28:39.479 --> 0:28:44.220
As you said, it's the most simple bottle you
assume that all word translations are.

0:28:44.204 --> 0:28:46.540
Independent of each other.

0:28:46.540 --> 0:28:54.069
You assume that all alignments are equally
important, and then the only thing you need

0:28:54.069 --> 0:29:00.126
for this type of model is to have this lexicon
in order to calculate.

0:29:00.940 --> 0:29:04.560
And that is, of course, now the training process.

0:29:04.560 --> 0:29:08.180
The question is how do we get this type of
lexic?

0:29:09.609 --> 0:29:15.461
But before we look into the training, do you
have any questions about the model itself?

0:29:21.101 --> 0:29:26.816
The problem in training is that we have incomplete
data.

0:29:26.816 --> 0:29:32.432
So if you want to count, I mean said you want
to count.

0:29:33.073 --> 0:29:39.348
However, if you don't have the alignment,
on the other hand, if you would have a lexicon

0:29:39.348 --> 0:29:44.495
you could maybe generate the alignment, which
is the most probable word.

0:29:45.225 --> 0:29:55.667
And this is the very common problem that you
have this type of incomplete data where you

0:29:55.667 --> 0:29:59.656
have not one type of information.

0:30:00.120 --> 0:30:08.767
And you can model this by considering the
alignment as your hidden variable and then

0:30:08.767 --> 0:30:17.619
you can use the expectation maximization algorithm
in order to generate the alignment.

0:30:17.577 --> 0:30:26.801
So the nice thing is that you only need your
parallel data, which is aligned on sentence

0:30:26.801 --> 0:30:29.392
level, but you normally.

0:30:29.389 --> 0:30:33.720
Is just a lot of work we saw last time.

0:30:33.720 --> 0:30:39.567
Typically what you have is this type of corpus
where.

0:30:41.561 --> 0:30:50.364
And yeah, the ERM algorithm sounds very fancy.

0:30:50.364 --> 0:30:58.605
However, again look at a little high level.

0:30:58.838 --> 0:31:05.841
So you're initializing a model by uniform
distribution.

0:31:05.841 --> 0:31:14.719
You're just saying if have lexicon, if all
words are equally possible.

0:31:15.215 --> 0:31:23.872
And then you apply your model to the data,
and that is your expectation step.

0:31:23.872 --> 0:31:30.421
So given this initial lexicon, we are now
calculating the.

0:31:30.951 --> 0:31:36.043
So we can now take all our parallel sentences,
and of course ought to check what is the most

0:31:36.043 --> 0:31:36.591
probable.

0:31:38.338 --> 0:31:49.851
And then, of course, at the beginning maybe
houses most often in line.

0:31:50.350 --> 0:31:58.105
Once we have done this expectation step, we
can next do the maximization step and based

0:31:58.105 --> 0:32:06.036
on this guest alignment, which we have, we
can now learn better translation probabilities

0:32:06.036 --> 0:32:09.297
by just counting how often do words.

0:32:09.829 --> 0:32:22.289
And then it's rated these steps: We can make
this whole process even more stable, only taking

0:32:22.289 --> 0:32:26.366
the most probable alignment.

0:32:26.346 --> 0:32:36.839
Second step, but in contrast we calculate
for all possible alignments the alignment probability

0:32:36.839 --> 0:32:40.009
and weigh the correcurrence.

0:32:40.000 --> 0:32:41.593
Then Things Are Most.

0:32:42.942 --> 0:32:49.249
Why could that be very challenging if we do
it in general and really calculate all probabilities

0:32:49.249 --> 0:32:49.834
for all?

0:32:53.673 --> 0:32:55.905
How many alignments are there for a Simpson?

0:32:58.498 --> 0:33:03.344
Yes there, we just saw that in the formula
if you remember.

0:33:03.984 --> 0:33:12.336
This was the formula so it's exponential in
the lengths of the target sentence.

0:33:12.336 --> 0:33:15.259
It would calculate all the.

0:33:15.415 --> 0:33:18.500
Be very inefficient and really possible.

0:33:18.500 --> 0:33:25.424
The nice thing is we can again use some type
of dynamic programming, so then we can do this

0:33:25.424 --> 0:33:27.983
without really calculating audit.

0:33:28.948 --> 0:33:40.791
We have the next pipe slides or so with the
most equations in the whole lecture, so don't

0:33:40.791 --> 0:33:41.713
worry.

0:33:42.902 --> 0:34:01.427
So we said we have first explanation where
it is about calculating the alignment.

0:34:02.022 --> 0:34:20.253
And we can do this with our initial definition
of because this formula.

0:34:20.160 --> 0:34:25.392
So we can define this as and and divided by
and.

0:34:25.905 --> 0:34:30.562
This is just the normal definition of a conditional
probability.

0:34:31.231 --> 0:34:37.937
And what we then need to assume a meter calculate
is P of E given.

0:34:37.937 --> 0:34:41.441
P of E given is still again quiet.

0:34:41.982 --> 0:34:56.554
Simple: The probability of the sewer sentence
given the target sentence is quite intuitive.

0:34:57.637 --> 0:35:15.047
So let's just calculate how to calculate the
probability of a event.

0:35:15.215 --> 0:35:21.258
So in here we can then put in our original
form in our soils.

0:35:21.201 --> 0:35:28.023
There are some of the possible alignments
of the first word, and so until the sum of

0:35:28.023 --> 0:35:30.030
all possible alignments.

0:35:29.990 --> 0:35:41.590
And then we have the probability here of the
alignment type, this product of translation.

0:35:42.562 --> 0:35:58.857
Now this one is independent of the alignment,
so we can put it to the front here.

0:35:58.959 --> 0:36:03.537
And now this is where dynamic programming
works in.

0:36:03.537 --> 0:36:08.556
We can change that and make thereby things
a lot easier.

0:36:08.668 --> 0:36:21.783
Can reform it like this just as a product
over all target positions, and then it's the

0:36:21.783 --> 0:36:26.456
sum over all source positions.

0:36:27.127 --> 0:36:36.454
Maybe at least the intuition why this is equal
is a lot easier if you look into it as graphic.

0:36:36.816 --> 0:36:39.041
So what we have here is the table.

0:36:39.041 --> 0:36:42.345
We have the target position and the Swiss
position.

0:36:42.862 --> 0:37:03.643
And we have to sum up all possible passes
through that: The nice thing is that each of

0:37:03.643 --> 0:37:07.127
these passes these probabilities are independent
of each.

0:37:07.607 --> 0:37:19.678
In order to get the sum of all passes through
this table you can use dynamic programming

0:37:19.678 --> 0:37:27.002
and then say oh this probability is exactly
the same.

0:37:26.886 --> 0:37:34.618
Times the sun of this column finds the sum
of this column, and times the sun of this colun.

0:37:35.255 --> 0:37:41.823
That is the same as if you go through all
possible passes here and multiply always the

0:37:41.823 --> 0:37:42.577
elements.

0:37:43.923 --> 0:37:54.227
And that is a simplification because now we
only have quadratic numbers and we don't have

0:37:54.227 --> 0:37:55.029
to go.

0:37:55.355 --> 0:38:12.315
Similar to guess you may be seen the same
type of algorithm for what is it?

0:38:14.314 --> 0:38:19.926
Yeah, well yeah, so that is the saying.

0:38:19.926 --> 0:38:31.431
But yeah, I think graphically this is seeable
if you don't know exactly the mass.

0:38:32.472 --> 0:38:49.786
Now put these both together, so if you really
want to take a piece of and put these two formulas

0:38:49.786 --> 0:38:51.750
together,.

0:38:51.611 --> 0:38:56.661
Eliminated and Then You Get Your Final Formula.

0:38:56.716 --> 0:39:01.148
And that somehow really makes now really intuitively
again sense.

0:39:01.401 --> 0:39:08.301
So the probability of an alignment is the
product of all target sentences, and then it's

0:39:08.301 --> 0:39:15.124
the probability of to translate a word into
the word that is aligned to divided by some

0:39:15.124 --> 0:39:17.915
of the other words in the sentence.

0:39:18.678 --> 0:39:31.773
If you look at this again, it makes real descent.

0:39:31.891 --> 0:39:43.872
So you're looking at how probable it is to
translate compared to all the other words.

0:39:43.872 --> 0:39:45.404
So you're.

0:39:45.865 --> 0:39:48.543
So and that gives you the alignment probability.

0:39:48.768 --> 0:39:54.949
Somehow it's not only that it's mathematically
correct if you look at it this way, it's somehow

0:39:54.949 --> 0:39:55.785
intuitively.

0:39:55.785 --> 0:39:58.682
So if you would say how good is it to align?

0:39:58.638 --> 0:40:04.562
We had to zoo him to visit, or yet it should
depend on how good this is the translation

0:40:04.562 --> 0:40:10.620
probability compared to how good are the other
words in the sentence, and how probable is

0:40:10.620 --> 0:40:12.639
it that I align them to them.

0:40:15.655 --> 0:40:26.131
Then you have the expectations that the next
thing is now the maximization step, so we have

0:40:26.131 --> 0:40:30.344
now the probability of an alignment.

0:40:31.451 --> 0:40:37.099
Intuitively, that means how often are words
aligned to each other giving this alignment

0:40:37.099 --> 0:40:39.281
or more in a perverse definition?

0:40:39.281 --> 0:40:43.581
What is the expectation value that they are
aligned to each other?

0:40:43.581 --> 0:40:49.613
So if there's a lot of alignments with hyperability
that they're aligned to each other, then.

0:40:50.050 --> 0:41:07.501
So the count of E and given F given our caravan
data is a sum of all possible alignments.

0:41:07.968 --> 0:41:14.262
That is, this count, and you don't do just
count with absolute numbers, but you count

0:41:14.262 --> 0:41:14.847
always.

0:41:15.815 --> 0:41:26.519
And to make that translation probability is
that you have to normalize it, of course, through:

0:41:27.487 --> 0:41:30.584
And that's then the whole model.

0:41:31.111 --> 0:41:39.512
It looks now maybe a bit mathematically complex.

0:41:39.512 --> 0:41:47.398
The whole training process is described here.

0:41:47.627 --> 0:41:53.809
So you really, really just have to collect
these counts and later normalize that.

0:41:54.134 --> 0:42:03.812
So repeating that until convergence we have
said the ear migration is always done again.

0:42:04.204 --> 0:42:15.152
Equally, then you go over all sentence pairs
and all of words and calculate the translation.

0:42:15.355 --> 0:42:17.983
And then you go once again over.

0:42:17.983 --> 0:42:22.522
It counted this count, count given, and totally
e-given.

0:42:22.702 --> 0:42:35.316
Initially how probable is the E translated
to something else, and you normalize your translation

0:42:35.316 --> 0:42:37.267
probabilities.

0:42:38.538 --> 0:42:45.761
So this is an old training process for this
type of.

0:42:46.166 --> 0:43:00.575
How that then works is shown here a bit, so
we have a very simple corpus.

0:43:01.221 --> 0:43:12.522
And as we said, you initialize your translation
with yes or possible translations, so dusk

0:43:12.522 --> 0:43:16.620
can be aligned to the bookhouse.

0:43:16.997 --> 0:43:25.867
And the other ones are missing because only
a curse with and book, and then the others

0:43:25.867 --> 0:43:26.988
will soon.

0:43:27.127 --> 0:43:34.316
In the initial way your vocabulary is for
works, so the initial probabilities are all:

0:43:34.794 --> 0:43:50.947
And then if you iterate you see that the things
which occur often and then get alignments get

0:43:50.947 --> 0:43:53.525
more and more.

0:43:55.615 --> 0:44:01.506
In reality, of course, you won't get like
zero alignments, but you would normally get

0:44:01.506 --> 0:44:02.671
there sometimes.

0:44:03.203 --> 0:44:05.534
But as the probability increases.

0:44:05.785 --> 0:44:17.181
The training process is also guaranteed that
the probability of your training data is always

0:44:17.181 --> 0:44:20.122
increased in iteration.

0:44:21.421 --> 0:44:27.958
You see that the model tries to model your
training data and give you at least good models.

0:44:30.130 --> 0:44:37.765
Okay, are there any more questions to the
training of these type of word-based models?

0:44:38.838 --> 0:44:54.790
Initially there is like forwards in the source
site, so it's just one force to do equal distribution.

0:44:55.215 --> 0:45:01.888
So each target word, the probability of the
target word, is at four target words, so the

0:45:01.888 --> 0:45:03.538
uniform distribution.

0:45:07.807 --> 0:45:14.430
However, there is problems with this initial
order and we have this already mentioned at

0:45:14.430 --> 0:45:15.547
the beginning.

0:45:15.547 --> 0:45:21.872
There is for example things that yeah you
want to allow for reordering but there are

0:45:21.872 --> 0:45:27.081
definitely some alignments which should be
more probable than others.

0:45:27.347 --> 0:45:42.333
So a friend visit should have a lower probability
than visit a friend.

0:45:42.302 --> 0:45:50.233
It's not always monitoring, there is some
reordering happening, but if you just mix it

0:45:50.233 --> 0:45:51.782
crazy, it's not.

0:45:52.252 --> 0:46:11.014
You have slings like one too many alignments
and they are not really models.

0:46:11.491 --> 0:46:17.066
But it shouldn't be that you align one word
to all the others, and that is, you don't want

0:46:17.066 --> 0:46:18.659
this type of probability.

0:46:19.199 --> 0:46:27.879
You don't want to align to null, so there's
nothing about that and how to deal with other

0:46:27.879 --> 0:46:30.386
words on the source side.

0:46:32.272 --> 0:46:45.074
And therefore this was only like the initial
model in there.

0:46:45.325 --> 0:46:47.639
Models, which we saw.

0:46:47.639 --> 0:46:57.001
They only model the translation probability,
so how probable is it to translate one word

0:46:57.001 --> 0:46:58.263
to another?

0:46:58.678 --> 0:47:05.915
What you could then add is the absolute position.

0:47:05.915 --> 0:47:16.481
Yeah, the second word should more probable
align to the second position.

0:47:17.557 --> 0:47:22.767
We add a fertility model that means one word
is mostly translated into one word.

0:47:23.523 --> 0:47:29.257
For example, we saw it there that should be
translated into two words, but most words should

0:47:29.257 --> 0:47:32.463
be one to one, and it's even modeled for each
word.

0:47:32.463 --> 0:47:37.889
So for each source word, how probable is it
that it is translated to one, two, three or

0:47:37.889 --> 0:47:38.259
more?

0:47:40.620 --> 0:47:50.291
Then either one of four acts relative positions,
so it's asks: Maybe instead of modeling, how

0:47:50.291 --> 0:47:55.433
probable is it that you translate from position
five to position twenty five?

0:47:55.433 --> 0:48:01.367
It's not a very good way, but in a relative
position instead of what you try to model it.

0:48:01.321 --> 0:48:06.472
How probable is that you are jumping Swiss
steps forward or Swiss steps back?

0:48:07.287 --> 0:48:15.285
However, this makes sense more complex because
what is a jump forward and a jump backward

0:48:15.285 --> 0:48:16.885
is not that easy.

0:48:18.318 --> 0:48:30.423
You want to have a model that describes reality,
so every sentence that is not possible should

0:48:30.423 --> 0:48:37.304
have the probability zero because that cannot
happen.

0:48:37.837 --> 0:48:48.037
However, with this type of IBM model four
this has a positive probability, so it makes

0:48:48.037 --> 0:48:54.251
a sentence more complex and you can easily
check it.

0:48:57.457 --> 0:49:09.547
So these models were the first models which
tried to directly model and where they are

0:49:09.547 --> 0:49:14.132
the first to do the translation.

0:49:14.414 --> 0:49:19.605
So in all of these models, the probability
of a word translating into another word is

0:49:19.605 --> 0:49:25.339
always independent of all the other translations,
and that is a challenge because we know that

0:49:25.339 --> 0:49:26.486
this is not right.

0:49:26.967 --> 0:49:32.342
And therefore we will come now to then the
phrase-based translation models.

0:49:35.215 --> 0:49:42.057
However, this word alignment is the very important
concept which was used in phrase based.

0:49:42.162 --> 0:49:50.559
Even when people use phrase based, they first
would always train a word based model not to

0:49:50.559 --> 0:49:56.188
get the really model but only to get this type
of alignment.

0:49:57.497 --> 0:50:01.343
What was the main idea of a phrase based machine
translation?

0:50:03.223 --> 0:50:08.898
It's not only that things got mathematically
a lot more simple here because you don't try

0:50:08.898 --> 0:50:13.628
to express the whole translation process, but
it's a discriminative model.

0:50:13.628 --> 0:50:19.871
So what you only try to model is this translation
probability or is this translation more probable

0:50:19.871 --> 0:50:20.943
than some other.

0:50:24.664 --> 0:50:28.542
The main idea is that the basic units are
are the phrases.

0:50:28.542 --> 0:50:31.500
That's why it's called phrase phrase phrase.

0:50:31.500 --> 0:50:35.444
You have to be aware that these are not linguistic
phrases.

0:50:35.444 --> 0:50:39.124
I guess you have some intuition about what
is a phrase.

0:50:39.399 --> 0:50:45.547
You would express as a phrase.

0:50:45.547 --> 0:50:58.836
However, you wouldn't say that is a very good
phrase because it's.

0:50:59.339 --> 0:51:06.529
However, in this machine learning-based motivated
thing, phrases are just indicative.

0:51:07.127 --> 0:51:08.832
So it can be any split.

0:51:08.832 --> 0:51:12.455
We don't consider linguistically motivated
or not.

0:51:12.455 --> 0:51:15.226
It can be any sequence of consecutive.

0:51:15.335 --> 0:51:16.842
That's the Only Important Thing.

0:51:16.977 --> 0:51:25.955
The phrase is always a thing of consecutive
words, and the motivation behind that is getting

0:51:25.955 --> 0:51:27.403
computational.

0:51:27.387 --> 0:51:35.912
People have looked into how you can also discontinuous
phrases, which might be very helpful if you

0:51:35.912 --> 0:51:38.237
think about German harbor.

0:51:38.237 --> 0:51:40.046
Has this one phrase?

0:51:40.000 --> 0:51:47.068
There's two phrases, although there's many
things in between, but in order to make things

0:51:47.068 --> 0:51:52.330
still possible and runner will, it's always
like consecutive work.

0:51:53.313 --> 0:52:05.450
The nice thing is that on the one hand you
don't need this word to word correspondence

0:52:05.450 --> 0:52:06.706
anymore.

0:52:06.906 --> 0:52:17.088
You now need to invent some type of alignment
that in this case doesn't really make sense.

0:52:17.417 --> 0:52:21.710
So you can just learn okay, you have this
phrase and this phrase and their translation.

0:52:22.862 --> 0:52:25.989
Secondly, we can add a bit of context into
that.

0:52:26.946 --> 0:52:43.782
You're saying, for example, of Ultimate Customs
and of My Shift.

0:52:44.404 --> 0:52:51.443
And this was difficult to model and work based
models because they always model the translation.

0:52:52.232 --> 0:52:57.877
Here you can have phrases where you have more
context and just jointly translate the phrases,

0:52:57.877 --> 0:53:03.703
and if you then have seen all by the question
as a phrase you can directly use that to generate.

0:53:08.468 --> 0:53:19.781
Okay, before we go into how to do that, then
we start, so the start is when we start with

0:53:19.781 --> 0:53:21.667
the alignment.

0:53:22.022 --> 0:53:35.846
So that is what we get from the work based
model and we are assuming to get the.

0:53:36.356 --> 0:53:40.786
So that is your starting point.

0:53:40.786 --> 0:53:47.846
You have a certain sentence and one most probable.

0:53:48.989 --> 0:54:11.419
The challenge you now have is that these alignments
are: On the one hand, a source word like hit

0:54:11.419 --> 0:54:19.977
several times with one source word can be aligned
to several: So in this case you see that for

0:54:19.977 --> 0:54:29.594
example Bisher is aligned to three words, so
this can be the alignment from English to German,

0:54:29.594 --> 0:54:32.833
but it cannot be the alignment.

0:54:33.273 --> 0:54:41.024
In order to address for this inconsistency
and being able to do that, what you typically

0:54:41.024 --> 0:54:49.221
then do is: If you have this inconsistency
and you get different things in both directions,.

0:54:54.774 --> 0:55:01.418
In machine translation to do that you just
do it in both directions and somehow combine

0:55:01.418 --> 0:55:08.363
them because both will do arrows and the hope
is yeah if you know both things you minimize.

0:55:08.648 --> 0:55:20.060
So you would also do it in the other direction
and get a different type of lineup, for example

0:55:20.060 --> 0:55:22.822
that you now have saw.

0:55:23.323 --> 0:55:37.135
So in this way you are having two alignments
and the question is now how do get one alignment

0:55:37.135 --> 0:55:38.605
and what?

0:55:38.638 --> 0:55:45.828
There were a lot of different types of heuristics.

0:55:45.828 --> 0:55:55.556
They normally start with intersection because
you should trust them.

0:55:55.996 --> 0:55:59.661
And your maximum will could take this, the
union thought,.

0:55:59.980 --> 0:56:04.679
If one of the systems says they are not aligned
then maybe you should not align them.

0:56:05.986 --> 0:56:12.240
The only question they are different is what
should I do about things where they don't agree?

0:56:12.240 --> 0:56:18.096
So where only one of them enlines and then
you have heuristics depending on other words

0:56:18.096 --> 0:56:22.288
around it, you can decide should I align them
or should I not.

0:56:24.804 --> 0:56:34.728
So that is your first step and then the second
step in your model.

0:56:34.728 --> 0:56:41.689
So now you have one alignment for the process.

0:56:42.042 --> 0:56:47.918
And the idea is that we will now extract all
phrase pairs to combinations of source and

0:56:47.918 --> 0:56:51.858
target phrases where they are consistent within
alignment.

0:56:52.152 --> 0:56:57.980
The idea is a consistence with an alignment
that should be a good example and that we can

0:56:57.980 --> 0:56:58.563
extract.

0:56:59.459 --> 0:57:14.533
And there are three conditions where we say
an alignment has to be consistent.

0:57:14.533 --> 0:57:17.968
The first one is.

0:57:18.318 --> 0:57:24.774
So if you add bisher, then it's in your phrase.

0:57:24.774 --> 0:57:32.306
All the three words up till and now should
be in there.

0:57:32.492 --> 0:57:42.328
So Bisheret Till would not be a valid phrase
pair in this case, but for example Bisheret

0:57:42.328 --> 0:57:43.433
Till now.

0:57:45.525 --> 0:58:04.090
Does anybody now have already an idea about
the second rule that should be there?

0:58:05.325 --> 0:58:10.529
Yes, that is exactly the other thing.

0:58:10.529 --> 0:58:22.642
If a target verse is in the phrase pair, there
are also: Then there is one very obvious one.

0:58:22.642 --> 0:58:28.401
If you strike a phrase pair, at least one
word in the phrase.

0:58:29.069 --> 0:58:32.686
And this is a knife with working.

0:58:32.686 --> 0:58:40.026
However, in reality a captain will select
some part of the sentence.

0:58:40.380 --> 0:58:47.416
You can take any possible combination of sewers
and target words for this part, and that of

0:58:47.416 --> 0:58:54.222
course is not very helpful because you just
have no idea, and therefore it says at least

0:58:54.222 --> 0:58:58.735
one sewer should be aligned to one target word
to prevent.

0:58:59.399 --> 0:59:09.615
But still, it means that if you have normally
analyzed words, the more analyzed words you

0:59:09.615 --> 0:59:10.183
can.

0:59:10.630 --> 0:59:13.088
That's not true for the very extreme case.

0:59:13.088 --> 0:59:17.603
If no word is a line you can extract nothing
because you can never fulfill it.

0:59:17.603 --> 0:59:23.376
However, if only for example one word is aligned
then you can align a lot of different possibilities

0:59:23.376 --> 0:59:28.977
because you can start with this word and then
add source words or target words or any combination

0:59:28.977 --> 0:59:29.606
of source.

0:59:30.410 --> 0:59:37.585
So there was typically a problem that if you
have too few works in light you can really

0:59:37.585 --> 0:59:38.319
extract.

0:59:38.558 --> 0:59:45.787
If you think about this already here you can
extract very, very many phrase pairs from:

0:59:45.845 --> 0:59:55.476
So what you can extract is, for example, what
we saw up and so on.

0:59:55.476 --> 1:00:00.363
So all of them will be extracted.

1:00:00.400 --> 1:00:08.379
In order to limit this you typically have
a length limit so you can only extract phrases

1:00:08.379 --> 1:00:08.738
up.

1:00:09.049 --> 1:00:18.328
But still there these phrases where you have
all these phrases extracted.

1:00:18.328 --> 1:00:22.968
You have to think about how to deal.

1:00:26.366 --> 1:00:34.966
Now we have the phrases, so the other question
is what is a good phrase pair and not so good.

1:00:35.255 --> 1:00:39.933
You might be that you sometimes extract one
which is explaining this sentence but is not

1:00:39.933 --> 1:00:44.769
really a good one because there is something
ever in there or something special so it might

1:00:44.769 --> 1:00:47.239
not be a good phase pair in another situation.

1:00:49.629 --> 1:00:59.752
And therefore the easiest thing is again just
count, and if a phrase pair occurs very often

1:00:59.752 --> 1:01:03.273
seems to be a good phrase pair.

1:01:03.743 --> 1:01:05.185
So if we have this one.

1:01:05.665 --> 1:01:09.179
And if you have the exam up till now,.

1:01:09.469 --> 1:01:20.759
Then you look how often does up till now to
this hair occur?

1:01:20.759 --> 1:01:28.533
How often does up until now to this hair?

1:01:30.090 --> 1:01:36.426
So this is one way of yeah describing the
quality of the phrase book.

1:01:37.257 --> 1:01:47.456
So one difference is now, and that is the
advantage of these primitive models.

1:01:47.867 --> 1:01:55.442
But instead we are trying to have a lot of
features describing how good a phrase parent

1:01:55.442 --> 1:01:55.786
is.

1:01:55.786 --> 1:02:04.211
One of these features is this one describing:
But in this model we'll later see how to combine

1:02:04.211 --> 1:02:04.515
it.

1:02:04.515 --> 1:02:10.987
The nice thing is we can invent any other
type of features and add that and normally

1:02:10.987 --> 1:02:14.870
if you have two or three metrics to describe
then.

1:02:15.435 --> 1:02:18.393
And therefore the spray spray sprays.

1:02:18.393 --> 1:02:23.220
They were not only like evaluated by one type
but by several.

1:02:23.763 --> 1:02:36.580
So this could, for example, have a problem
because your target phrase here occurs only

1:02:36.580 --> 1:02:37.464
once.

1:02:38.398 --> 1:02:46.026
It will of course only occur with one other
source trait, and that probability will be

1:02:46.026 --> 1:02:53.040
one which might not be a very good estimation
because you've only seen it once.

1:02:53.533 --> 1:02:58.856
Therefore, we use additional ones to better
deal with that, and the first thing is we're

1:02:58.856 --> 1:02:59.634
doing again.

1:02:59.634 --> 1:03:01.129
Yeah, we know it by now.

1:03:01.129 --> 1:03:06.692
If you look at it in the one direction, it's
helpful to us to look into the other direction.

1:03:06.692 --> 1:03:11.297
So you take also the inverse probability,
so you not only take in peer of E.

1:03:11.297 --> 1:03:11.477
G.

1:03:11.477 --> 1:03:11.656
M.

1:03:11.656 --> 1:03:12.972
F., but also peer of.

1:03:13.693 --> 1:03:19.933
And then in addition you say maybe for the
especially prolonged phrases they occur rarely,

1:03:19.933 --> 1:03:25.898
and then you have very high probabilities,
and that might not be always the right one.

1:03:25.898 --> 1:03:32.138
So maybe it's good to also look at the word
based probabilities to represent how good they

1:03:32.138 --> 1:03:32.480
are.

1:03:32.692 --> 1:03:44.202
So in addition you take the work based probabilities
of this phrase pair as an additional model.

1:03:44.704 --> 1:03:52.828
So then you would have in total four different
values describing how good the phrase is.

1:03:52.828 --> 1:04:00.952
It would be the relatively frequencies in
both directions and the lexical probabilities.

1:04:01.361 --> 1:04:08.515
So four values in describing how probable
a phrase translation is.

1:04:11.871 --> 1:04:20.419
Then the next challenge is how can we combine
these different types of probabilities into

1:04:20.419 --> 1:04:23.458
a global score saying how good?

1:04:24.424 --> 1:04:36.259
Model, but before we are doing that give any
questions to this phrase extraction and phrase

1:04:36.259 --> 1:04:37.546
creation.

1:04:40.260 --> 1:04:44.961
And the motivation for that this was our initial
moral.

1:04:44.961 --> 1:04:52.937
If you remember from the beginning of a lecture
we had the probability of like PFO three times

1:04:52.937 --> 1:04:53.357
PFO.

1:04:55.155 --> 1:04:57.051
Now the problem is here.

1:04:57.051 --> 1:04:59.100
That is, of course, right.

1:04:59.100 --> 1:05:06.231
However, we have done a lot of simplification
that the translation probability is independent

1:05:06.231 --> 1:05:08.204
of the other translation.

1:05:08.628 --> 1:05:14.609
So therefore our estimations of pH give me
and pH might not be right, and therefore the

1:05:14.609 --> 1:05:16.784
combination might not be right.

1:05:17.317 --> 1:05:22.499
So it can be that, for example, at the edge
you have a fluid but not accurate translation.

1:05:22.782 --> 1:05:25.909
And Then There's Could Be an Easy Way Around
It.

1:05:26.126 --> 1:05:32.019
If our effluent but not accurate, it might
be that we put too much effort on the language

1:05:32.019 --> 1:05:36.341
model and we are putting too few effort on
the translation model.

1:05:36.936 --> 1:05:43.016
There we can wait a minute so we can do this
a bit stronger.

1:05:43.016 --> 1:05:46.305
This one is more important than.

1:05:48.528 --> 1:05:53.511
And based on that we can extend this idea
to the lacteria mole.

1:05:53.893 --> 1:06:02.164
The log linear model now says all the translation
probabilities is just we have.

1:06:02.082 --> 1:06:09.230
Describing how good this translation process
is, these are the speeches H which depend on

1:06:09.230 --> 1:06:09.468
E.

1:06:09.468 --> 1:06:09.706
F.

1:06:09.706 --> 1:06:13.280
Only one of them, but generally depend on
E.

1:06:13.280 --> 1:06:13.518
E.

1:06:13.518 --> 1:06:13.757
E.

1:06:13.757 --> 1:06:13.995
N.

1:06:13.995 --> 1:06:14.233
F.

1:06:14.474 --> 1:06:22.393
Each of these pictures has a weight saying
yeah how good does it model it so that if you're

1:06:22.393 --> 1:06:29.968
asking a lot of people about some opinion it
might also be waiting some opinion more so

1:06:29.968 --> 1:06:34.100
I put more effort on that and he may not be
so.

1:06:34.314 --> 1:06:39.239
If you're saying that it's maybe a good indication,
yeah, would trust that much.

1:06:39.559 --> 1:06:41.380
And exactly you can do that for you too.

1:06:41.380 --> 1:06:42.446
You can't add no below.

1:06:43.423 --> 1:07:01.965
It's like depending on how many you want to
have and each of the features gives you value.

1:07:02.102 --> 1:07:12.655
The nice thing is that we can normally ignore
because we are not interested in the probability

1:07:12.655 --> 1:07:13.544
itself.

1:07:13.733 --> 1:07:18.640
And again, if that's not normalized, that's
fine.

1:07:18.640 --> 1:07:23.841
So if this value is the highest, that's the
highest.

1:07:26.987 --> 1:07:29.302
Can we do that?

1:07:29.302 --> 1:07:34.510
Let's start with two simple things.

1:07:34.510 --> 1:07:39.864
Then you have one translation model.

1:07:40.000 --> 1:07:43.102
Which gives you the peer of eagerness.

1:07:43.383 --> 1:07:49.203
It can be typically as a feature it would
take the liberalism of this ability, so mine

1:07:49.203 --> 1:07:51.478
is nine hundred and fourty seven.

1:07:51.451 --> 1:07:57.846
And the language model which says you how
clue in the English side is how you can calculate

1:07:57.846 --> 1:07:59.028
the probability.

1:07:58.979 --> 1:08:03.129
In some future lectures we'll give you all
superbology.

1:08:03.129 --> 1:08:10.465
You can feature again the luck of the purbology,
then you have minus seven and then give different

1:08:10.465 --> 1:08:11.725
weights to them.

1:08:12.292 --> 1:08:19.243
And that means that your probability is one
divided by said to the power of this.

1:08:20.840 --> 1:08:38.853
You're not really interested in the probability,
so you just calculate on the score to the exponendum.

1:08:40.000 --> 1:08:41.668
Maximal Maximal I Think.

1:08:42.122 --> 1:08:57.445
You can, for example, try different translations,
calculate all their scores and take in the

1:08:57.445 --> 1:09:00.905
end the translation.

1:09:03.423 --> 1:09:04.661
Why to do that.

1:09:05.986 --> 1:09:10.698
We've done that now for two, but of course
you cannot only do it with two.

1:09:10.698 --> 1:09:16.352
You can do it now with any fixed number, so
of course you have to decide in the beginning

1:09:16.352 --> 1:09:21.944
I want to have ten features or something like
that, but you can take all these features.

1:09:22.002 --> 1:09:29.378
And yeah, based on them, they calculate your
model probability or the model score.

1:09:31.031 --> 1:09:40.849
A big advantage over the initial.

1:09:40.580 --> 1:09:45.506
A model because now we can add a lot of features
and there was diamond machine translation,

1:09:45.506 --> 1:09:47.380
a statistical machine translation.

1:09:47.647 --> 1:09:57.063
So how can develop new features, new ways
of evaluating them so that can hopefully better

1:09:57.063 --> 1:10:00.725
describe what is good translation?

1:10:01.001 --> 1:10:16.916
If you have a new great feature you can calculate
these features and then how much better do

1:10:16.916 --> 1:10:18.969
they model?

1:10:21.741 --> 1:10:27.903
There is one challenge which haven't touched
upon yet.

1:10:27.903 --> 1:10:33.505
So could you easily build your model if you
have.

1:10:38.999 --> 1:10:43.016
Assumed here something which just gazed, but
which might not be that easy.

1:10:49.990 --> 1:10:56.333
The weight for the translation model is and
the weight for the language model is.

1:10:56.716 --> 1:11:08.030
That's a bit arbitrary, so why should you
use this one and guess normally you won't be

1:11:08.030 --> 1:11:11.801
able to select that by hand?

1:11:11.992 --> 1:11:19.123
So typically we didn't have like or features
in there, but features is very common.

1:11:19.779 --> 1:11:21.711
So how do you select them?

1:11:21.711 --> 1:11:24.645
There was a second part of the training.

1:11:24.645 --> 1:11:27.507
These models were trained in two steps.

1:11:27.507 --> 1:11:32.302
On the one hand, we had the training of the
individual components.

1:11:32.302 --> 1:11:38.169
We saw that now how to build the phrase based
system, how to extract the phrases.

1:11:38.738 --> 1:11:46.223
But then if you have these different components
you need a second training to learn the optimal.

1:11:46.926 --> 1:11:51.158
And typically this is referred to as the tuning
of the system.

1:11:51.431 --> 1:12:07.030
So now if you have different types of models
describing what a good translation is you need

1:12:07.030 --> 1:12:10.760
to find good weights.

1:12:12.312 --> 1:12:14.315
So how can you do it?

1:12:14.315 --> 1:12:20.871
The easiest thing is, of course, you can just
try different things out.

1:12:21.121 --> 1:12:27.496
You can then always select the best hyper
scissors.

1:12:27.496 --> 1:12:38.089
You can evaluate it with some metrics saying:
You can score all your outputs, always select

1:12:38.089 --> 1:12:42.543
the best one and then get this translation.

1:12:42.983 --> 1:12:45.930
And you can do that for a lot of different
possible combinations.

1:12:47.067 --> 1:12:59.179
However, the challenge is the complexity,
so if you have only parameters and each of

1:12:59.179 --> 1:13:04.166
them has values you try for, then.

1:13:04.804 --> 1:13:16.895
We won't be able to try all of these possible
combinations, so what we have to do is some

1:13:16.895 --> 1:13:19.313
more intelligent.

1:13:20.540 --> 1:13:34.027
And what has been done there in machine translation
is referred to as a minimum error rate training.

1:13:34.534 --> 1:13:41.743
Whole surge is a very intuitive one, so have
all these different parameters, so how do.

1:13:42.522 --> 1:13:44.358
And the idea is okay.

1:13:44.358 --> 1:13:52.121
I start with an initial guess and then I optimize
one single parameter that's always easier.

1:13:52.121 --> 1:13:54.041
That's some or linear.

1:13:54.041 --> 1:13:58.882
So you're searching the best value for the
one parameter.

1:13:59.759 --> 1:14:04.130
Often visualized with a San Francisco map.

1:14:04.130 --> 1:14:13.786
Just imagine if you want to go to the highest
spot in San Francisco, you're standing somewhere

1:14:13.786 --> 1:14:14.395
here.

1:14:14.574 --> 1:14:21.220
You are switching your dimensions so you are
going in this direction again finding.

1:14:21.661 --> 1:14:33.804
Now you're on a different street and this
one is not a different one so you go in here

1:14:33.804 --> 1:14:36.736
so you can interact.

1:14:36.977 --> 1:14:56.368
The one thing of course is find a local optimum,
especially if you start in two different positions.

1:14:56.536 --> 1:15:10.030
So yeah, there is a heuristic in there, so
typically it's done again if you land in different

1:15:10.030 --> 1:15:16.059
positions with different starting points.

1:15:16.516 --> 1:15:29.585
What is different or what is like the addition
of arrow rate training compared to the standard?

1:15:29.729 --> 1:15:37.806
So the question is, like we said, you can
now evaluate different values for one parameter.

1:15:38.918 --> 1:15:42.857
And the question is: Which values should you
try out for one parameters?

1:15:42.857 --> 1:15:47.281
Should you just do zero point one, zero point
two, zero point three, or anything?

1:15:49.029 --> 1:16:03.880
If you change only one parameter then you
can define the score of translation as a linear

1:16:03.880 --> 1:16:05.530
function.

1:16:05.945 --> 1:16:17.258
That this is the one that possesses, and yet
if you change the parameter, the score of this.

1:16:17.397 --> 1:16:26.506
It may depend so your score is there because
the rest you don't change your feature value.

1:16:26.826 --> 1:16:30.100
And the feature value is there for the steepness
of their purse.

1:16:30.750 --> 1:16:38.887
And now look at different possible translations.

1:16:38.887 --> 1:16:46.692
Therefore, how they go up here is differently.

1:16:47.247 --> 1:16:59.289
So in this case if you look at the minimum
score so there should be as minimum.

1:17:00.300 --> 1:17:10.642
So it's enough to check once a year and check
once here because if you check here and here.

1:17:11.111 --> 1:17:24.941
And that is the idea in minimum air rate training
when you select different hypotheses.

1:17:29.309 --> 1:17:34.378
So in yeah, the minimum air raid training
is a power search.

1:17:34.378 --> 1:17:37.453
Then we do an intelligent step size.

1:17:37.453 --> 1:17:39.364
We do random restarts.

1:17:39.364 --> 1:17:46.428
Then things are still too slow because it
might say we would have to decode a lot of

1:17:46.428 --> 1:17:47.009
times.

1:17:46.987 --> 1:17:54.460
So what we can do to make things even faster
is we are decoding once with the current parameters,

1:17:54.460 --> 1:18:01.248
but then we are not generating only the most
probable translation, but we are generating

1:18:01.248 --> 1:18:05.061
the most probable ten hundred translations
or so.

1:18:06.006 --> 1:18:18.338
And then we are optimizing our weights by
only looking at this one hundred translation

1:18:18.338 --> 1:18:23.725
and finding the optimal values there.

1:18:24.564 --> 1:18:39.284
Of course, it might be a problem that at some
point you have now good ways to find good translations

1:18:39.284 --> 1:18:42.928
inside your ambest list.

1:18:43.143 --> 1:18:52.357
You have to iterate that sometime, but the
important thing is you don't have to decode

1:18:52.357 --> 1:18:56.382
every time you need weights, but you.

1:18:57.397 --> 1:19:11.325
There is mainly a speed up process in order
to make things more, make things even faster.

1:19:15.515 --> 1:19:20.160
Good Then We'll Finish With.

1:19:20.440 --> 1:19:25.289
Looking at how do you really calculate the
scores and everything?

1:19:25.289 --> 1:19:32.121
Because what we did look into was a translation
of a full sentence doesn't really consist of

1:19:32.121 --> 1:19:37.190
only one single phrase, but of course you have
to combine different.

1:19:37.637 --> 1:19:40.855
So how does that now really look and how do
we have to do?

1:19:41.361 --> 1:19:48.252
Just think again of the translation we have
done before.

1:19:48.252 --> 1:19:59.708
The sentence must be: What is the probability
of translating this one into what we saw after

1:19:59.708 --> 1:20:00.301
now?

1:20:00.301 --> 1:20:03.501
We're doing this by using.

1:20:03.883 --> 1:20:07.157
So we're having the phrase pair.

1:20:07.157 --> 1:20:12.911
Vasvia is the phrase pair up to now and gazine
harm into.

1:20:13.233 --> 1:20:18.970
In addition, that is important because translation
is not monotone.

1:20:18.970 --> 1:20:26.311
We are not putting phrase pairs in the same
order as we are doing it on the source and

1:20:26.311 --> 1:20:31.796
on the target, but in order to generate the
correct translation.

1:20:31.771 --> 1:20:34.030
So we have to shuffle the phrase pears.

1:20:34.294 --> 1:20:39.747
And the blue wand is in front on the search
side but not on the back of the tag.

1:20:40.200 --> 1:20:49.709
This reordering makes a statistic of the machine
translation really complicated because if you

1:20:49.709 --> 1:20:53.313
would just monotonely do this then.

1:20:53.593 --> 1:21:05.288
The problem is if you would analyze all possible
combinations of reshuffling them, then again.

1:21:05.565 --> 1:21:11.508
So you again have to use some type of heuristics
which shuffle you allow and which you don't

1:21:11.508 --> 1:21:11.955
allow.

1:21:12.472 --> 1:21:27.889
That was relatively challenging since, for
example, if you think of Germany you would

1:21:27.889 --> 1:21:32.371
have to allow very long.

1:21:33.033 --> 1:21:52.218
But if we have now this, how do we calculate
the translation score so the translation score?

1:21:52.432 --> 1:21:55.792
That's why we sum up the scores at the end.

1:21:56.036 --> 1:22:08.524
So you said our first feature is the probability
of the full sentence.

1:22:08.588 --> 1:22:13.932
So we say, the translation of each phrase
pair is independent of each other, and then

1:22:13.932 --> 1:22:19.959
we can hear the probability of the full sentences,
fear of what we give, but fear of times, fear

1:22:19.959 --> 1:22:24.246
of sobbing because they have time to feel up
till now is impossible.

1:22:24.664 --> 1:22:29.379
Now we can use the loss of logarithmal calculation.

1:22:29.609 --> 1:22:36.563
That's logarithm of the first perability.

1:22:36.563 --> 1:22:48.153
We'll get our first score, which says the
translation model is minus.

1:22:49.970 --> 1:22:56.586
And that we're not doing only once, but we're
exactly doing it with all our translation model.

1:22:56.957 --> 1:23:03.705
So we said we also have the relative frequency
and the inverse directions of the.

1:23:03.843 --> 1:23:06.226
So in the end you'll have four scores.

1:23:06.226 --> 1:23:09.097
Here how you combine them is exactly the same.

1:23:09.097 --> 1:23:12.824
The only thing is how you look them up for
each phrase pair.

1:23:12.824 --> 1:23:18.139
We have said in the beginning we are storing
four scores describing how good they are.

1:23:19.119 --> 1:23:25.415
And these are then of force points describing
how probable the sense.

1:23:27.427 --> 1:23:31.579
Then we can have more sports.

1:23:31.579 --> 1:23:37.806
For example, we can have a distortion model.

1:23:37.806 --> 1:23:41.820
How much reordering is done?

1:23:41.841 --> 1:23:47.322
There were different types of ones who won't
go into detail, but just imagine you have no

1:23:47.322 --> 1:23:47.748
score.

1:23:48.548 --> 1:23:56.651
Then you have a language model which is the
sequence of what we saw until now.

1:23:56.651 --> 1:24:06.580
How we generate this language model for ability
will cover: And there weren't even more probabilities.

1:24:06.580 --> 1:24:11.841
So one, for example, was a phrase count scarf,
which just counts how many.

1:24:12.072 --> 1:24:19.555
In order to learn is it better to have more
short phrases or should bias on having fewer

1:24:19.555 --> 1:24:20.564
and longer.

1:24:20.940 --> 1:24:28.885
Easily add this but just counting so the value
will be here and like putting in a count like

1:24:28.885 --> 1:24:32.217
typically how good is it to translate.

1:24:32.932 --> 1:24:44.887
For language model, the probability normally
gets shorter the longer the sequences in order

1:24:44.887 --> 1:24:46.836
to counteract.

1:24:47.827 --> 1:24:59.717
And then you get your final score by multi-climbing
each of the scores we had before.

1:24:59.619 --> 1:25:07.339
Optimization and that gives you a final score
maybe of twenty three point seven eight five

1:25:07.339 --> 1:25:13.278
and then you can do that with several possible
translation tests and.

1:25:14.114 --> 1:25:23.949
One may be important point here is so the
score not only depends on the target side but

1:25:23.949 --> 1:25:32.444
it also depends on which phrases you have used
so you could have generated.

1:25:32.772 --> 1:25:38.076
So you would have the same translation, but
you would have a different split into phrase.

1:25:38.979 --> 1:25:45.636
And this was normally ignored so you would
just look at all of them and then select the

1:25:45.636 --> 1:25:52.672
one which has the highest probability and ignore
that this translation could be generated by

1:25:52.672 --> 1:25:54.790
several splits into phrase.

1:25:57.497 --> 1:26:06.097
So to summarize what we look into today and
what you should hopefully remember is: Statistical

1:26:06.097 --> 1:26:11.440
models in how to generate machine translation
output that were the word based statistical

1:26:11.440 --> 1:26:11.915
models.

1:26:11.915 --> 1:26:16.962
There was IBM models at the beginning and
then we have the phrase based entity where

1:26:16.962 --> 1:26:22.601
it's about building the translation by putting
together these blocks of phrases and combining.

1:26:23.283 --> 1:26:34.771
If you have a water which has several features
you can't do that with millions but with features.

1:26:34.834 --> 1:26:42.007
Then you can combine them with your local
model, which allows you to have your variable

1:26:42.007 --> 1:26:45.186
number of features and easily combine.

1:26:45.365 --> 1:26:47.920
Yeah, how much can you trust each of these
more?

1:26:51.091 --> 1:26:54.584
Do you have any further questions for this
topic?

1:26:58.378 --> 1:27:08.715
And there will be on Tuesday a lecture by
Tuan about evaluation, and then next Thursday

1:27:08.715 --> 1:27:12.710
there will be the practical part.

1:27:12.993 --> 1:27:21.461
So please bring the practical pot here, but
you can do something yourself if you are not

1:27:21.461 --> 1:27:22.317
able to.

1:27:23.503 --> 1:27:26.848
So then please tell us and we'll have to see
how we find the difference in this.