WEBVTT

0:00:03.243 --> 0:00:18.400
Hey welcome to our video, small room today
and to the lecture machine translation.

0:00:19.579 --> 0:00:32.295
So the idea is we have like last time we started
addressing problems and building machine translation.

0:00:32.772 --> 0:00:39.140
And we looked into different ways of how we
can use other types of resources.

0:00:39.379 --> 0:00:54.656
Last time we looked into language models and
especially pre-trained models which are different

0:00:54.656 --> 0:00:59.319
paradigms and learning data.

0:01:00.480 --> 0:01:07.606
However, there is one other way of getting
data and that is just searching for more data.

0:01:07.968 --> 0:01:14.637
And the nice thing is it was a worldwide web.

0:01:14.637 --> 0:01:27.832
We have a very big data resource where there's
various types of data which we can all use.

0:01:28.128 --> 0:01:38.902
If you want to build a machine translation
for a specific language or specific to Maine,

0:01:38.902 --> 0:01:41.202
it might be worse.

0:01:46.586 --> 0:01:55.399
In general, the other year we had different
types of additional resources we can have.

0:01:55.399 --> 0:01:59.654
Today we look into the state of crawling.

0:01:59.654 --> 0:02:05.226
It always depends a bit on what type of task
you have.

0:02:05.525 --> 0:02:08.571
We're crawling, you point off no possibilities.

0:02:08.828 --> 0:02:14.384
We have seen some weeks ago that Maje Lingo
models another thing where you can try to share

0:02:14.384 --> 0:02:16.136
knowledge between languages.

0:02:16.896 --> 0:02:26.774
Last we looked into monolingual data and next
we also unsupervised them too which is purely

0:02:26.774 --> 0:02:29.136
based on monolingual.

0:02:29.689 --> 0:02:35.918
What we today will focus on is really web
crawling of parallel data.

0:02:35.918 --> 0:02:40.070
We will focus not on the crawling pad itself.

0:02:41.541 --> 0:02:49.132
Networking lecture is something about one
of the best techniques to do web trolleying

0:02:49.132 --> 0:02:53.016
and then we'll just rely on existing tools.

0:02:53.016 --> 0:02:59.107
But the challenge is normally if you have
web data that's pure text.

0:03:00.920 --> 0:03:08.030
And these are all different ways of how we
can do that, and today is focused on that.

0:03:08.508 --> 0:03:21.333
So why would we be interested in that there
is quite different ways of collecting data?

0:03:21.333 --> 0:03:28.473
If you're currently when we talk about parallel.

0:03:28.548 --> 0:03:36.780
The big difference is that you focus on one
specific website so you can manually check

0:03:36.780 --> 0:03:37.632
how you.

0:03:38.278 --> 0:03:49.480
This you can do for dedicated resources where
you have high quality data.

0:03:50.510 --> 0:03:56.493
Another thing which has been developed or
has been done for several tasks is also is

0:03:56.493 --> 0:03:59.732
like you can do something like crowdsourcing.

0:03:59.732 --> 0:04:05.856
I don't know if you know about sites like
Amazon Mechanical Turing or things like that

0:04:05.856 --> 0:04:08.038
so you can there get a lot of.

0:04:07.988 --> 0:04:11.544
Writing between cheap labors would like easy
translations for you.

0:04:12.532 --> 0:04:22.829
Of course you can't collect millions of sentences,
but if it's like thousands of sentences that's

0:04:22.829 --> 0:04:29.134
also sourced, it's often interesting when you
have somehow.

0:04:29.509 --> 0:04:36.446
However, this is a field of itself, so crowdsourcing
is not that easy.

0:04:36.446 --> 0:04:38.596
It's not like upload.

0:04:38.738 --> 0:04:50.806
If you're doing that you will have very poor
quality, for example in the field of machine

0:04:50.806 --> 0:04:52.549
translation.

0:04:52.549 --> 0:04:57.511
Crowdsourcing is very commonly used.

0:04:57.397 --> 0:05:00.123
The problem there is.

0:05:00.480 --> 0:05:08.181
Since they are paid quite bad, of course,
a lot of people also try to make it put into

0:05:08.181 --> 0:05:09.598
it as possible.

0:05:09.869 --> 0:05:21.076
So if you're just using it without any control
mechanisms, the quality will be bad.

0:05:21.076 --> 0:05:27.881
What you can do is like doing additional checking.

0:05:28.188 --> 0:05:39.084
And think recently read a paper that now these
things can be worse because people don't do

0:05:39.084 --> 0:05:40.880
it themselves.

0:05:41.281 --> 0:05:46.896
So it's a very interesting topic.

0:05:46.896 --> 0:05:55.320
There has been a lot of resources created
by this.

0:05:57.657 --> 0:06:09.796
It's really about large scale data, then of
course doing some type of web crawling is the

0:06:09.796 --> 0:06:10.605
best.

0:06:10.930 --> 0:06:17.296
However, the biggest issue in this case is
in the quality.

0:06:17.296 --> 0:06:22.690
So how can we ensure that somehow the quality
of.

0:06:23.003 --> 0:06:28.656
Because if you just, we all know that in the
Internet there's also a lot of tools.

0:06:29.149 --> 0:06:37.952
Low quality staff, and especially now the
bigger question is how can we ensure that translations

0:06:37.952 --> 0:06:41.492
are really translations of each other?

0:06:45.065 --> 0:06:58.673
Why is this interesting so we had this number
before so there is some estimates that roughly

0:06:58.673 --> 0:07:05.111
a human reads around three hundred million.

0:07:05.525 --> 0:07:16.006
If you look into the web you will have millions
of words there so you can really get a large

0:07:16.006 --> 0:07:21.754
amount of data and if you think about monolingual.

0:07:22.042 --> 0:07:32.702
So at least for some language pairs there
is a large amount of data you can have.

0:07:32.852 --> 0:07:37.783
Languages are official languages in one country.

0:07:37.783 --> 0:07:46.537
There's always a very great success because
a lot of websites from the government need

0:07:46.537 --> 0:07:48.348
to be translated.

0:07:48.568 --> 0:07:58.777
For example, a large purpose like in India,
which we have worked with in India, so you

0:07:58.777 --> 0:08:00.537
have parallel.

0:08:01.201 --> 0:08:02.161
Two questions.

0:08:02.161 --> 0:08:08.438
First of all, if jet GPS and machine translation
tools are more becoming ubiquitous and everybody

0:08:08.438 --> 0:08:14.138
uses them, don't we get a problem because we
want to crawl the web and use the data and.

0:08:15.155 --> 0:08:18.553
Yes, that is a severe problem.

0:08:18.553 --> 0:08:26.556
Of course, are we only training on training
data which is automatically?

0:08:26.766 --> 0:08:41.182
And if we are doing that, of course, we talked
about the synthetic data where we do back translation.

0:08:41.341 --> 0:08:46.446
But of course it gives you some aren't up
about norm, you cannot be much better than

0:08:46.446 --> 0:08:46.806
this.

0:08:48.308 --> 0:08:57.194
That is, we'll get more and more on issues,
so maybe at some point we won't look at the

0:08:57.194 --> 0:09:06.687
current Internet, but focus on oats like image
of the Internet, which are created by Archive.

0:09:07.527 --> 0:09:18.611
There's lots of classification algorithms
on how to classify automatic data they had

0:09:18.611 --> 0:09:26.957
a very interesting paper on how to watermark
their translation.

0:09:27.107 --> 0:09:32.915
So there's like two scenarios of course in
this program: The one thing you might want

0:09:32.915 --> 0:09:42.244
to find your own translation if you're a big
company and say do an antisystem that may be

0:09:42.244 --> 0:09:42.866
used.

0:09:43.083 --> 0:09:49.832
This problem might be that most of the translation
out there is created by you.

0:09:49.832 --> 0:10:01.770
You might be able: And there is a relatively
easy way of doing that so that there are other

0:10:01.770 --> 0:10:09.948
peoples' mainly that can do it like the search
or teacher.

0:10:09.929 --> 0:10:12.878
They are different, but there is not the one
correction station.

0:10:13.153 --> 0:10:23.763
So what you then can't do is you can't output
the best one to the user, but the highest value.

0:10:23.763 --> 0:10:30.241
For example, it's easy, but you can take the
translation.

0:10:30.870 --> 0:10:40.713
And if you always give the translation of
your investments, which are all good with the

0:10:40.713 --> 0:10:42.614
most ease, then.

0:10:42.942 --> 0:10:55.503
But of course this you can only do with most
of the data generated by your model.

0:10:55.503 --> 0:11:02.855
What we are now seeing is not only checks,
but.

0:11:03.163 --> 0:11:13.295
But it's definitely an additional research
question that might get more and more importance,

0:11:13.295 --> 0:11:18.307
and it might be an additional filtering step.

0:11:18.838 --> 0:11:29.396
There are other issues in data quality, so
in which direction wasn't translated, so that

0:11:29.396 --> 0:11:31.650
is not interested.

0:11:31.891 --> 0:11:35.672
But if you're now reaching better and better
quality, it makes a difference.

0:11:35.672 --> 0:11:39.208
The original data was from German to English
or from English to German.

0:11:39.499 --> 0:11:44.797
Because translation, they call it translate
Chinese.

0:11:44.797 --> 0:11:53.595
So if you generate German from English, it
has a more similar structure as if you would

0:11:53.595 --> 0:11:55.195
directly speak.

0:11:55.575 --> 0:11:57.187
So um.

0:11:57.457 --> 0:12:03.014
These are all issues which you then might
do like do additional training to remove them

0:12:03.014 --> 0:12:07.182
or you first train on them and later train
on other quality data.

0:12:07.182 --> 0:12:11.034
But yet that's a general view on so it's an
important issue.

0:12:11.034 --> 0:12:17.160
But until now I think it hasn't been addressed
that much maybe because the quality was decently.

0:12:18.858 --> 0:12:23.691
Actually, I think we're sure if we have the
time we use the Internet.

0:12:23.691 --> 0:12:29.075
The problem is, it's a lot of English speaking
text, but most used languages.

0:12:29.075 --> 0:12:34.460
I don't know some language in Africa that's
spoken, but we do about that one.

0:12:34.460 --> 0:12:37.566
I mean, that's why most data is English too.

0:12:38.418 --> 0:12:42.259
Other languages, and then you get the best.

0:12:42.259 --> 0:12:46.013
If there is no data on the Internet, then.

0:12:46.226 --> 0:12:48.255
So there is still a lot of data collection.

0:12:48.255 --> 0:12:50.976
Also in the wild way you try to improve there
and collect.

0:12:51.431 --> 0:12:57.406
But English is the most in the world, but
you find surprisingly much data also for other

0:12:57.406 --> 0:12:58.145
languages.

0:12:58.678 --> 0:13:04.227
Of course, only if they're written remember.

0:13:04.227 --> 0:13:15.077
Most languages are not written at all, but
for them you might find some video, but it's

0:13:15.077 --> 0:13:17.420
difficult to find.

0:13:17.697 --> 0:13:22.661
So this is mainly done for the web trawling.

0:13:22.661 --> 0:13:29.059
It's mainly done for languages which are commonly
spoken.

0:13:30.050 --> 0:13:38.773
Is exactly the next point, so this is that
much data is only true for English and some

0:13:38.773 --> 0:13:41.982
other languages, but of course.

0:13:41.982 --> 0:13:50.285
And therefore a lot of research on how to
make things efficient and efficient and learn

0:13:50.285 --> 0:13:54.248
faster from pure data is still essential.

0:13:59.939 --> 0:14:06.326
So what we are interested in now on data is
parallel data.

0:14:06.326 --> 0:14:10.656
We assume always we have parallel data.

0:14:10.656 --> 0:14:12.820
That means we have.

0:14:13.253 --> 0:14:20.988
To be careful when you start crawling from
the web, we might get only related types of.

0:14:21.421 --> 0:14:30.457
So one comedy thing is what people refer as
noisy parallel data where there is documents

0:14:30.457 --> 0:14:34.315
which are translations of each other.

0:14:34.434 --> 0:14:44.300
So you have senses where there is no translation
on the other side because you have.

0:14:44.484 --> 0:14:50.445
So if you have these types of documents your
algorithm to extract parallel data might be

0:14:50.445 --> 0:14:51.918
a bit more difficult.

0:14:52.352 --> 0:15:04.351
Know if you can still remember in the beginning
of the lecture when we talked about different

0:15:04.351 --> 0:15:06.393
data resources.

0:15:06.286 --> 0:15:11.637
But the first step is then approached to a
light source and target sentences, and it was

0:15:11.637 --> 0:15:16.869
about like a steep vocabulary, and then you
have some probabilities for one to one and

0:15:16.869 --> 0:15:17.590
one to one.

0:15:17.590 --> 0:15:23.002
It's very like simple algorithm, but yet it
works fine for really a high quality parallel

0:15:23.002 --> 0:15:23.363
data.

0:15:23.623 --> 0:15:30.590
But when we're talking about noisy data, we
might have to do additional steps and use more

0:15:30.590 --> 0:15:35.872
advanced models to extract what is parallel
and to get high quality.

0:15:36.136 --> 0:15:44.682
So if we just had no easy parallel data, the
document might not be as easy to extract.

0:15:49.249 --> 0:15:54.877
And then there is even the more extreme pains,
which has also been used to be honest.

0:15:54.877 --> 0:15:58.214
The use of this data is reasoning not that
common.

0:15:58.214 --> 0:16:04.300
It was more interested maybe like ten or fifteen
years ago, and that is what people referred

0:16:04.300 --> 0:16:05.871
to as comparative data.

0:16:06.266 --> 0:16:17.167
And then the idea is you even don't have translations
like sentences which are translations of each

0:16:17.167 --> 0:16:25.234
other, but you have more news documents or
articles about the same topic.

0:16:25.205 --> 0:16:32.410
But it's more that you find phrases which
are too big in the user, so even black fragments.

0:16:32.852 --> 0:16:44.975
So if you think about the pedia, for example,
these articles have to be written in like the

0:16:44.975 --> 0:16:51.563
Wikipedia general idea independent of each
other.

0:16:51.791 --> 0:17:01.701
They have different information in there,
and I mean, the German movie gets more detail

0:17:01.701 --> 0:17:04.179
than the English one.

0:17:04.179 --> 0:17:07.219
However, it might be that.

0:17:07.807 --> 0:17:20.904
And the same thing is that you think about
newspaper articles if they're at the same time.

0:17:21.141 --> 0:17:24.740
And so this is an ability to learn.

0:17:24.740 --> 0:17:29.738
For example, new phrases, vocabulary and stature.

0:17:29.738 --> 0:17:36.736
If you don't have parallel data, but you could
monitor all time long.

0:17:37.717 --> 0:17:49.020
And then not everything will be the same,
but there might be an overlap about events.

0:17:54.174 --> 0:18:00.348
So if we're talking about web trolling said
in the beginning it was really about specific.

0:18:00.660 --> 0:18:18.878
They do very good things by hand and really
focus on them and do a very specific way of

0:18:18.878 --> 0:18:20.327
doing.

0:18:20.540 --> 0:18:23.464
The European Parliament was very focused in
Ted.

0:18:23.464 --> 0:18:26.686
Maybe you even have looked in the particular
session.

0:18:27.427 --> 0:18:40.076
And these are still important, but they are
of course very specific in covering different

0:18:40.076 --> 0:18:41.341
pockets.

0:18:42.002 --> 0:18:55.921
Then there was a focus on language centering,
so there was a big drawer, for example, that

0:18:55.921 --> 0:18:59.592
you can check websites.

0:19:00.320 --> 0:19:06.849
Apparently what really people like is a more
general approach where you just have to specify.

0:19:06.849 --> 0:19:13.239
I'm interested in data from German to Lithuanian
and then you can as automatic as possible.

0:19:13.239 --> 0:19:15.392
We see what's normally needed.

0:19:15.392 --> 0:19:19.628
You can collect as much data and extract codelaia
from this.

0:19:21.661 --> 0:19:25.633
So is this our interest?

0:19:25.633 --> 0:19:36.435
Of course, the question is how can we build
these types of systems?

0:19:36.616 --> 0:19:52.913
The first are more general web crawling base
systems, so there is nothing about.

0:19:53.173 --> 0:19:57.337
Based on the websites you have, you have to
do like text extraction.

0:19:57.597 --> 0:20:06.503
We are typically not that much interested
in text and images in there, so we try to extract

0:20:06.503 --> 0:20:07.083
text.

0:20:07.227 --> 0:20:16.919
This is also not specific to machine translation,
but it's a more traditional way of doing web

0:20:16.919 --> 0:20:17.939
trolling.

0:20:18.478 --> 0:20:22.252
And at the end you have mirror like some other
set of document collectors.

0:20:22.842 --> 0:20:37.025
Is the idea, so you have the text, and often
this is a document, and so in the end.

0:20:37.077 --> 0:20:51.523
And that is some of your starting point now
for doing the more machine translation.

0:20:52.672 --> 0:21:05.929
One way of doing that now is very similar
to what you might have think about the traditional

0:21:05.929 --> 0:21:06.641
one.

0:21:06.641 --> 0:21:10.633
The first thing is to do a.

0:21:11.071 --> 0:21:22.579
So you have this based on the initial fact
that you know this is a German website in the

0:21:22.579 --> 0:21:25.294
English translation.

0:21:25.745 --> 0:21:31.037
And based on this document alignment, then
you can do your sentence alignment.

0:21:31.291 --> 0:21:39.072
And this is similar to what we had before
with the church accordion.

0:21:39.072 --> 0:21:43.696
This is typically more noisy peril data.

0:21:43.623 --> 0:21:52.662
So that you are not assuming that everything
is on both sides, that the order is the same,

0:21:52.662 --> 0:21:56.635
so you should do more flexible systems.

0:21:58.678 --> 0:22:14.894
Then it depends if the documents you were
drawing were really some type of parallel data.

0:22:15.115 --> 0:22:35.023
Say then you should do what is referred to
as fragmented extraction.

0:22:36.136 --> 0:22:47.972
One problem with these types of models is
if you are doing errors in your document alignment,.

0:22:48.128 --> 0:22:55.860
It means that if you are saying these two
documents are align then you can only find

0:22:55.860 --> 0:22:58.589
sense and if you are missing.

0:22:59.259 --> 0:23:15.284
Is very different, only small parts of the
document are parallel, and most parts are independent

0:23:15.284 --> 0:23:17.762
of each other.

0:23:19.459 --> 0:23:31.318
Therefore, more recently, there is also the
idea of directly doing sentence aligned so

0:23:31.318 --> 0:23:35.271
that you're directly taking.

0:23:36.036 --> 0:23:41.003
Was already one challenge of this one, the
second approach.

0:23:42.922 --> 0:23:50.300
Yes, so one big challenge on here, beef, then
you have to do a lot of comparison.

0:23:50.470 --> 0:23:59.270
You have to cook out every source, every target
set and square.

0:23:59.270 --> 0:24:06.283
If you think of a million or trillion pairs,
then.

0:24:07.947 --> 0:24:12.176
And this also gives you a reason for a last
step in both cases.

0:24:12.176 --> 0:24:18.320
So in both of them you have to remember you're
typically eating here in this very large data

0:24:18.320 --> 0:24:18.650
set.

0:24:18.650 --> 0:24:24.530
So all of these and also the document alignment
here they should be done very efficient.

0:24:24.965 --> 0:24:42.090
And if you want to do it very efficiently,
that means your quality will go lower.

0:24:41.982 --> 0:24:47.348
Because you just have to ever see it fast,
and then yeah you can put less computation

0:24:47.348 --> 0:24:47.910
on each.

0:24:48.688 --> 0:25:06.255
Therefore, in a lot of scenarios it makes
sense to make an additional filtering step

0:25:06.255 --> 0:25:08.735
at the end.

0:25:08.828 --> 0:25:13.370
And then we do a second filtering step where
we now can put a lot more effort.

0:25:13.433 --> 0:25:20.972
Because now we don't have like any square
possible combinations anymore, we have already

0:25:20.972 --> 0:25:26.054
selected and maybe in dimension of maybe like
two or three.

0:25:26.054 --> 0:25:29.273
For each sentence we even don't have.

0:25:29.429 --> 0:25:39.234
And then we can put a lot more effort in each
individual example and build a high quality

0:25:39.234 --> 0:25:42.611
classic fire to really select.

0:25:45.125 --> 0:26:00.506
Two or one example for that, so one of the
biggest projects doing this is the so-called

0:26:00.506 --> 0:26:03.478
Paratrol Corpus.

0:26:03.343 --> 0:26:11.846
Typically it's like before the picturing so
there are a lot of challenges on how you can.

0:26:12.272 --> 0:26:25.808
And the steps they start to be with the seatbelt,
so what you should give at the beginning is:

0:26:26.146 --> 0:26:36.908
Then they do the problem, the text extraction,
the document alignment, the sentence alignment,

0:26:36.908 --> 0:26:45.518
and the sentence filter, and it swings down
to implementing the text store.

0:26:46.366 --> 0:26:51.936
We'll see later for a lot of language pairs
exist so it's easier to download them and then

0:26:51.936 --> 0:26:52.793
like improve.

0:26:53.073 --> 0:27:08.270
For example, the crawling one thing they often
do is even not throw the direct website because

0:27:08.270 --> 0:27:10.510
there's also.

0:27:10.770 --> 0:27:14.540
Black parts of the Internet that they can
work on today.

0:27:14.854 --> 0:27:22.238
In more detail, this is a bit shown here.

0:27:22.238 --> 0:27:31.907
All the steps you can see are different possibilities.

0:27:32.072 --> 0:27:39.018
You need a bit of knowledge to do that, or
you can build a machine translation system.

0:27:39.239 --> 0:27:47.810
There are two different ways of deduction
and alignment.

0:27:47.810 --> 0:27:52.622
You can use sentence alignment.

0:27:53.333 --> 0:28:02.102
And how you can do the flexigrade exam, for
example, the lexic graph, or you can chin.

0:28:02.422 --> 0:28:05.826
To the next step in a bit more detail.

0:28:05.826 --> 0:28:13.680
But before we're doing it, I need more questions
about the general overview of how these.

0:28:22.042 --> 0:28:37.058
Yeah, so two or three things to web-drawing,
so you normally start with the URLs.

0:28:37.058 --> 0:28:40.903
It's most promising.

0:28:41.021 --> 0:28:46.674
Found that if you're interested in German
to English, you would maybe move some data

0:28:46.674 --> 0:28:47.073
from.

0:28:47.407 --> 0:28:58.739
Companies where you know they have a German
and an English website are from agencies which

0:28:58.739 --> 0:29:08.359
might be: And then we can use one of these
tools to start from there using standard web

0:29:08.359 --> 0:29:10.328
calling techniques.

0:29:11.071 --> 0:29:23.942
There are several challenges when doing that,
so if you request a website too often you can:

0:29:25.305 --> 0:29:37.819
You have to keep in history of the sites and
you click on all the links and then click on

0:29:37.819 --> 0:29:40.739
all the links again.

0:29:41.721 --> 0:29:49.432
To be very careful about legal issues starting
from this robotics day so get allowed to use.

0:29:49.549 --> 0:29:58.941
Mean, that's the one major thing about what
trolley general is.

0:29:58.941 --> 0:30:05.251
The problem is how you deal with property.

0:30:05.685 --> 0:30:13.114
That is why it is easier sometimes to start
with some quick fold data that you don't have.

0:30:13.893 --> 0:30:22.526
Of course, the network issues you retry, so
there's more technical things, but there's

0:30:22.526 --> 0:30:23.122
good.

0:30:24.724 --> 0:30:35.806
Another thing which is very helpful and is
often done is instead of doing the web trolling

0:30:35.806 --> 0:30:38.119
yourself, relying.

0:30:38.258 --> 0:30:44.125
And one thing is it's common crawl from the
web.

0:30:44.125 --> 0:30:51.190
Think on this common crawl a lot of these
language models.

0:30:51.351 --> 0:30:59.763
So think in American Company or organization
which really works on like writing.

0:31:00.000 --> 0:31:01.111
Possible.

0:31:01.111 --> 0:31:10.341
So the nice thing is if you start with this
you don't have to worry about network.

0:31:10.250 --> 0:31:16.086
I don't think you can do that because it's
too big, but you can do a pipeline on how to

0:31:16.086 --> 0:31:16.683
process.

0:31:17.537 --> 0:31:28.874
That is, of course, a general challenge in
all this web crawling and parallel web mining.

0:31:28.989 --> 0:31:38.266
That means you cannot just don't know the
data and study the processes.

0:31:39.639 --> 0:31:45.593
Here it might make sense to directly fields
of both domains that in some way bark just

0:31:45.593 --> 0:31:46.414
marginally.

0:31:49.549 --> 0:31:59.381
Then you can do the text extraction, which
means like converging two HTML and then splitting

0:31:59.381 --> 0:32:01.707
things from the HTML.

0:32:01.841 --> 0:32:04.802
Often very important is to do the language
I need.

0:32:05.045 --> 0:32:16.728
It's not that clear even if it's links which
language it is, but they are quite good tools

0:32:16.728 --> 0:32:22.891
like that can't identify from relatively short.

0:32:23.623 --> 0:32:36.678
And then you are now in the situation that
you have all your danger and that you can start.

0:32:37.157 --> 0:32:43.651
After the text extraction you have now a collection
or a large collection of of data where it's

0:32:43.651 --> 0:32:49.469
like text and maybe the document at use of
some meta information and now the question

0:32:49.469 --> 0:32:55.963
is based on this monolingual text or multilingual
text so text in many languages but not align.

0:32:56.036 --> 0:32:59.863
How can you now do a generate power?

0:33:01.461 --> 0:33:06.289
And UM.

0:33:05.705 --> 0:33:13.322
So if we're not seeing it as a task or if
we want to do it in a machine learning way,

0:33:13.322 --> 0:33:20.940
what we have is we have a set of sentences
and a suits language, and we have a set Of

0:33:20.940 --> 0:33:23.331
sentences from the target.

0:33:23.823 --> 0:33:27.814
This is the target language.

0:33:27.814 --> 0:33:31.392
This is the data we have.

0:33:31.392 --> 0:33:37.034
We kind of directly assume any ordering.

0:33:38.018 --> 0:33:44.502
More documents there are not really in line
or there is maybe a graph and what we are interested

0:33:44.502 --> 0:33:50.518
in is finding these alignments so which senses
are aligned to each other and which senses

0:33:50.518 --> 0:33:53.860
we can remove but we don't have translations
for.

0:33:53.974 --> 0:34:00.339
But exactly this mapping is what we are interested
in and what we need to find.

0:34:01.901 --> 0:34:17.910
And if we are modeling it more from the machine
translation point of view, what can model that

0:34:17.910 --> 0:34:21.449
as a classification?

0:34:21.681 --> 0:34:36.655
And so the main challenge of this is to build
this type of classifier and you want to decide.

0:34:42.402 --> 0:34:50.912
However, the biggest challenge has already
pointed out in the beginning is the sites if

0:34:50.912 --> 0:34:53.329
we have millions target.

0:34:53.713 --> 0:35:05.194
The number of comparison is n square, so this
very path is very inefficient, and we need

0:35:05.194 --> 0:35:06.355
to find.

0:35:07.087 --> 0:35:16.914
And traditionally there is the first one mentioned
before the local or the hierarchical meaning

0:35:16.914 --> 0:35:20.292
mining and there the idea is OK.

0:35:20.292 --> 0:35:23.465
First we are lining documents.

0:35:23.964 --> 0:35:32.887
Move back the things and align them, and once
you have the alignment you only need to remind.

0:35:33.273 --> 0:35:51.709
That of course makes anything more efficient
because we don't have to do all the comparison.

0:35:53.253 --> 0:35:56.411
Then it's, for example, in the before mentioned
apparel.

0:35:57.217 --> 0:36:11.221
But it has the issue that if this document
is bad you have error propagation and you can

0:36:11.221 --> 0:36:14.211
recover from that.

0:36:14.494 --> 0:36:20.715
Because then document that cannot say ever,
there are some sentences which are: Therefore,

0:36:20.715 --> 0:36:24.973
more recently there is also was referred to
as global mining.

0:36:26.366 --> 0:36:31.693
And there we really do this.

0:36:31.693 --> 0:36:43.266
Although it's in the square, we are doing
all the comparisons.

0:36:43.523 --> 0:36:52.588
So the idea is that you can do represent all
the sentences in a vector space.

0:36:52.892 --> 0:37:06.654
And then it's about nearest neighbor search
and there is a lot of very efficient algorithms.

0:37:07.067 --> 0:37:20.591
Then if you only compare them to your nearest
neighbors you don't have to do like a comparison

0:37:20.591 --> 0:37:22.584
but you have.

0:37:26.186 --> 0:37:40.662
So in the first step what we want to look
at is this: This document classification refers

0:37:40.662 --> 0:37:49.584
to the document alignment, and then we do the
sentence alignment.

0:37:51.111 --> 0:37:58.518
And if we're talking about document alignment,
there's like typically two steps in that: We

0:37:58.518 --> 0:38:01.935
first do a candidate selection.

0:38:01.935 --> 0:38:10.904
Often we have several steps and that is again
to make more things more efficiently.

0:38:10.904 --> 0:38:13.360
We have the candidate.

0:38:13.893 --> 0:38:18.402
The candidate select means OK, which documents
do we want to compare?

0:38:19.579 --> 0:38:35.364
Then if we have initial candidates which might
be parallel, we can do a classification test.

0:38:35.575 --> 0:38:37.240
And there is different ways.

0:38:37.240 --> 0:38:40.397
We can use lexical similarity or we can use
ten basic.

0:38:41.321 --> 0:38:48.272
The first and easiest thing is to take off
possible candidates.

0:38:48.272 --> 0:38:55.223
There's one possibility, the other one, is
based on structural.

0:38:55.235 --> 0:39:05.398
So based on how your website looks like, you
might find that there are only translations.

0:39:05.825 --> 0:39:14.789
This is typically the only case where we try
to do some kind of major information, which

0:39:14.789 --> 0:39:22.342
can be very useful because we know that websites,
for example, are linked.

0:39:22.722 --> 0:39:35.586
We can try to use some URL patterns, so if
we have some website which ends with the.

0:39:35.755 --> 0:39:43.932
So that can be easily used in order to find
candidates.

0:39:43.932 --> 0:39:49.335
Then we only compare websites where.

0:39:49.669 --> 0:40:05.633
The language and the translation of each other,
but typically you hear several heuristics to

0:40:05.633 --> 0:40:07.178
do that.

0:40:07.267 --> 0:40:16.606
Then you don't have to compare all websites,
but you only have to compare web sites.

0:40:17.277 --> 0:40:27.607
Cruiser problems especially with an hour day's
content management system.

0:40:27.607 --> 0:40:32.912
Sometimes it's nice and easy to read.

0:40:33.193 --> 0:40:44.452
So on the one hand there typically leads from
the parent's side to different languages.

0:40:44.764 --> 0:40:46.632
Now I can look at the kit websites.

0:40:46.632 --> 0:40:49.381
It's the same thing you can check on the difference.

0:40:49.609 --> 0:41:06.835
Languages: You can either do that from the
parent website or you can click on the English.

0:41:06.926 --> 0:41:10.674
You can therefore either like prepare to all
the websites.

0:41:10.971 --> 0:41:18.205
Can be even more focused and checked if the
link is somehow either flexible or the language

0:41:18.205 --> 0:41:18.677
name.

0:41:19.019 --> 0:41:24.413
So there really depends on how much you want
to filter out.

0:41:24.413 --> 0:41:29.178
There is always a trade-off between being
efficient.

0:41:33.913 --> 0:41:49.963
Based on that we then have our candidate list,
so we now have two independent sets of German

0:41:49.963 --> 0:41:52.725
documents, but.

0:41:53.233 --> 0:42:03.515
And now the task is, we want to extract these,
which are really translations of each other.

0:42:03.823 --> 0:42:10.201
So the question of how can we measure the
document similarity?

0:42:10.201 --> 0:42:14.655
Because what we then do is, we measure the.

0:42:14.955 --> 0:42:27.096
And here you already see why this is also
that problematic from where it's partial or

0:42:27.096 --> 0:42:28.649
similarly.

0:42:30.330 --> 0:42:37.594
All you can do that is again two folds.

0:42:37.594 --> 0:42:48.309
You can do it more content based or more structural
based.

0:42:48.188 --> 0:42:53.740
Calculating a lot of features and then maybe
training a classic pyramid small set which

0:42:53.740 --> 0:42:57.084
stands like based on the spesse feature is
the data.

0:42:57.084 --> 0:42:58.661
It is a corpus parallel.

0:43:00.000 --> 0:43:10.955
One way of doing that is to have traction
features, so the idea is the text length, so

0:43:10.955 --> 0:43:12.718
the document.

0:43:13.213 --> 0:43:20.511
Of course, text links will not be the same,
but if the one document has fifty words and

0:43:20.511 --> 0:43:24.907
the other five thousand words, it's quite realistic.

0:43:25.305 --> 0:43:29.274
So you can use the text length as one proxy
of.

0:43:29.274 --> 0:43:32.334
Is this might be a good translation?

0:43:32.712 --> 0:43:41.316
Now the thing is the alignment between the
structure.

0:43:41.316 --> 0:43:52.151
If you have here the website you can create
some type of structure.

0:43:52.332 --> 0:44:04.958
You can compare that to the French version
and then calculate some similarities because

0:44:04.958 --> 0:44:07.971
you see translation.

0:44:08.969 --> 0:44:12.172
Of course, it's getting more and more problematic.

0:44:12.172 --> 0:44:16.318
It does be a different structure than these
features are helpful.

0:44:16.318 --> 0:44:22.097
However, if you are doing it more in a trained
way, you can automatically learn how helpful

0:44:22.097 --> 0:44:22.725
they are.

0:44:24.704 --> 0:44:37.516
Then there are different ways of yeah: Content
based things: One easy thing, especially if

0:44:37.516 --> 0:44:48.882
you have systems that are using the same script
that you are looking for.

0:44:48.888 --> 0:44:49.611
The legs.

0:44:49.611 --> 0:44:53.149
We call them a beggar words and we'll look
into.

0:44:53.149 --> 0:44:55.027
You can use some type of.

0:44:55.635 --> 0:44:58.418
And neural embedding is also to abate him
at.

0:45:02.742 --> 0:45:06.547
And as then mean we have machine translation,.

0:45:06.906 --> 0:45:14.640
And one idea that you can also do is really
use the machine translation.

0:45:14.874 --> 0:45:22.986
Because this one is one which takes more effort,
so what you then have to do is put more effort.

0:45:23.203 --> 0:45:37.526
You wouldn't do this type of machine translation
based approach for a system which has product.

0:45:38.018 --> 0:45:53.712
But maybe your first of thinking why can't
do that because I'm collecting data to build

0:45:53.712 --> 0:45:55.673
an system.

0:45:55.875 --> 0:46:01.628
So you can use an initial system to translate
it, and then you can collect more data.

0:46:01.901 --> 0:46:06.879
And one way of doing that is, you're translating,
for example, all documents even to English.

0:46:07.187 --> 0:46:25.789
Then you only need two English data and you
do it in the example with three grams.

0:46:25.825 --> 0:46:33.253
For example, the current induction in 1 in
the Spanish, which is German induction in 1,

0:46:33.253 --> 0:46:37.641
which was Spanish induction in 2, which was
French.

0:46:37.637 --> 0:46:52.225
You're creating this index and then based
on that you can calculate how similar the documents.

0:46:52.092 --> 0:46:58.190
And then you can use the Cossack similarity
to really calculate which of the most similar

0:46:58.190 --> 0:47:00.968
document or how similar is the document.

0:47:00.920 --> 0:47:04.615
And then measure if this is a possible translation.

0:47:05.285 --> 0:47:14.921
Mean, of course, the document will not be
exactly the same, and even if you have a parallel

0:47:14.921 --> 0:47:18.483
document, French and German, and.

0:47:18.898 --> 0:47:29.086
You'll have not a perfect translation, therefore
it's looking into five front overlap since

0:47:29.086 --> 0:47:31.522
there should be last.

0:47:34.074 --> 0:47:42.666
Okay, before we take the next step and go
into the sentence alignment, there are more

0:47:42.666 --> 0:47:44.764
questions about the.

0:47:51.131 --> 0:47:55.924
Too Hot and.

0:47:56.997 --> 0:47:59.384
Well um.

0:48:00.200 --> 0:48:05.751
There is different ways of doing sentence
alignment.

0:48:05.751 --> 0:48:12.036
Here's one way to describe is to call the
other line again.

0:48:12.172 --> 0:48:17.590
Of course, we have the advantage that we have
only documents, so we might have like hundred

0:48:17.590 --> 0:48:20.299
sentences and hundred sentences in the tower.

0:48:20.740 --> 0:48:31.909
Although it still might be difficult to compare
all the things in parallel, and.

0:48:31.791 --> 0:48:37.541
And therefore typically these even assume
that we are only interested in a line character

0:48:37.541 --> 0:48:40.800
that can be identified on the sum of the diagonal.

0:48:40.800 --> 0:48:46.422
Of course, not exactly the diagonal will sum
some parts around it, but in order to make

0:48:46.422 --> 0:48:47.891
things more efficient.

0:48:48.108 --> 0:48:55.713
You can still do it around the diagonal because
if you say this is a parallel document, we

0:48:55.713 --> 0:48:56.800
assume that.

0:48:56.836 --> 0:49:05.002
We wouldn't have passed the document alignment,
therefore we wouldn't have seen it.

0:49:05.505 --> 0:49:06.774
In the underline.

0:49:06.774 --> 0:49:10.300
Then we are calculating the similarity for
these.

0:49:10.270 --> 0:49:17.428
Set this here based on the bilingual dictionary,
so it may be based on how much overlap you

0:49:17.428 --> 0:49:17.895
have.

0:49:18.178 --> 0:49:24.148
And then we are finding a path through it.

0:49:24.148 --> 0:49:31.089
You are finding a path which the lights ever
see.

0:49:31.271 --> 0:49:41.255
But you're trying to find a pass through your
document so that you get these parallel.

0:49:41.201 --> 0:49:49.418
And then the perfect ones here would be your
pass, where you just take this other parallel.

0:49:51.011 --> 0:50:05.206
The advantage is that on the one end limits
your search space, then centers alignment,

0:50:05.206 --> 0:50:07.490
and secondly.

0:50:07.787 --> 0:50:10.013
So what does it mean?

0:50:10.013 --> 0:50:19.120
So even if you have a very high probable pair,
you're not taking them on because overall.

0:50:19.399 --> 0:50:27.063
So sometimes it makes sense to also use this
global information and not only compare on

0:50:27.063 --> 0:50:34.815
individual sentences because what you're with
your parents is that sometimes it's only a

0:50:34.815 --> 0:50:36.383
good translation.

0:50:38.118 --> 0:50:51.602
So by this minion paste you're preventing
the system to do it at the border where there's

0:50:51.602 --> 0:50:52.201
no.

0:50:53.093 --> 0:50:55.689
So that might achieve you a bit better quality.

0:50:56.636 --> 0:51:12.044
The pack always ends if we write the button
for everybody, but it also means you couldn't

0:51:12.044 --> 0:51:15.126
necessarily have.

0:51:15.375 --> 0:51:24.958
Have some restrictions that is right, so first
of all they can't be translated out.

0:51:25.285 --> 0:51:32.572
So the handle line typically only really works
well if you have a relatively high quality.

0:51:32.752 --> 0:51:39.038
So if you have this more general data where
there's like some parts are translated and

0:51:39.038 --> 0:51:39.471
some.

0:51:39.719 --> 0:51:43.604
It doesn't really work, so it might.

0:51:43.604 --> 0:51:53.157
It's okay with having maybe at the end some
sentences which are missing, but in generally.

0:51:53.453 --> 0:51:59.942
So it's not robust against significant noise
on the.

0:52:05.765 --> 0:52:12.584
The second thing is is to what is referred
to as blue alibi.

0:52:13.233 --> 0:52:16.982
And this doesn't does, does not do us much.

0:52:16.977 --> 0:52:30.220
A global information you can translate each
sentence to English, and then you calculate

0:52:30.220 --> 0:52:34.885
the voice for the translation.

0:52:35.095 --> 0:52:41.888
And that you would get six answer points,
which are the ones in a purple ear.

0:52:42.062 --> 0:52:56.459
And then you have the ability to add some
points around it, which might be a bit lower.

0:52:56.756 --> 0:53:06.962
But here in this case you are able to deal
with reorderings, angles to deal with parts.

0:53:07.247 --> 0:53:16.925
Therefore, in this case we need a full scale
and key system to do this calculation while

0:53:16.925 --> 0:53:17.686
we're.

0:53:18.318 --> 0:53:26.637
Then, of course, the better your similarity
metric is, so the better you are able to do

0:53:26.637 --> 0:53:35.429
this comparison, the less you have to rely
on structural information that, in one sentence,.

0:53:39.319 --> 0:53:53.411
Anymore questions, and then there are things
like back in line which try to do the same.

0:53:53.793 --> 0:53:59.913
That means the idea is that you expect each
sentence.

0:53:59.819 --> 0:54:02.246
In a crossing will vector space.

0:54:02.246 --> 0:54:08.128
Crossing will vector space always means that
you have a vector or knight means.

0:54:08.128 --> 0:54:14.598
In this case you have a vector space where
sentences in different languages are near to

0:54:14.598 --> 0:54:16.069
each other if they.

0:54:16.316 --> 0:54:23.750
So you can have it again and so on, but just
next to each other and want to call you.

0:54:24.104 --> 0:54:32.009
And then you can of course measure now the
similarity by some distance matrix in this

0:54:32.009 --> 0:54:32.744
vector.

0:54:33.033 --> 0:54:36.290
And you're saying towards two senses are lying.

0:54:36.290 --> 0:54:39.547
If the distance in the vector space is somehow.

0:54:40.240 --> 0:54:50.702
We'll discuss that in a bit more heat soon
because these vector spades and bathings are

0:54:50.702 --> 0:54:52.010
even then.

0:54:52.392 --> 0:54:55.861
So the nice thing is with this.

0:54:55.861 --> 0:55:05.508
It's really good and good to get quite good
quality and can decide whether two sentences

0:55:05.508 --> 0:55:08.977
are translations of each other.

0:55:08.888 --> 0:55:14.023
In the fact-lined approach, but often they
even work on a global search way to really

0:55:14.023 --> 0:55:15.575
compare on everything to.

0:55:16.236 --> 0:55:29.415
What weak alignment also does is trying to
do to make this more efficient in finding the.

0:55:29.309 --> 0:55:40.563
If you don't want to compare everything to
everything, you first need sentence blocks,

0:55:40.563 --> 0:55:41.210
and.

0:55:41.141 --> 0:55:42.363
Then find him fast.

0:55:42.562 --> 0:55:55.053
You always have full sentence resolution,
but then you always compare on the area around.

0:55:55.475 --> 0:56:11.501
So if you do compare blocks on the source
of the target, then you have of your possibilities.

0:56:11.611 --> 0:56:17.262
So here the end times and comparison is a
lot less than the comparison you have here.

0:56:17.777 --> 0:56:23.750
And with neural embeddings you can also embed
not only single sentences and whole blocks.

0:56:24.224 --> 0:56:28.073
So how you make this in fast?

0:56:28.073 --> 0:56:35.643
You're starting from a coarse grain resolution
here where.

0:56:36.176 --> 0:56:47.922
Then you're getting a double pass where they
could be good and near this pass you're doing

0:56:47.922 --> 0:56:49.858
more and more.

0:56:52.993 --> 0:56:54.601
And yeah, what's the?

0:56:54.601 --> 0:56:56.647
This is the white egg lift.

0:56:56.647 --> 0:56:59.352
These are the sewers and the target.

0:57:00.100 --> 0:57:16.163
While it was sleeping in the forests and things,
I thought it was very strange to see this man.

0:57:16.536 --> 0:57:25.197
So you have the sentences, but if you do blocks
you have blocks that are in.

0:57:30.810 --> 0:57:38.514
This is the thing about the pipeline approach.

0:57:38.514 --> 0:57:46.710
We want to look at the global mining, but
before.

0:57:53.633 --> 0:58:07.389
In the global mining thing we have to also
do some filtering and so typically in the things

0:58:07.389 --> 0:58:10.379
they do they start.

0:58:10.290 --> 0:58:14.256
And then they are doing some pretty processing.

0:58:14.254 --> 0:58:17.706
So you try to at first to de-defecate paragraphs.

0:58:17.797 --> 0:58:30.622
So, of course, if you compare everything with
everything in two times the same input example,

0:58:30.622 --> 0:58:35.748
you will also: The hard thing is that you first
keep duplicating.

0:58:35.748 --> 0:58:37.385
You have each paragraph only one.

0:58:37.958 --> 0:58:42.079
There's a lot of text which occurs a lot of
times.

0:58:42.079 --> 0:58:44.585
They will happen all the time.

0:58:44.884 --> 0:58:57.830
There are pages about the cookie thing you
see and about accepting things.

0:58:58.038 --> 0:59:04.963
So you can already be duplicated here, or
your problem has crossed the website twice,

0:59:04.963 --> 0:59:05.365
and.

0:59:06.066 --> 0:59:11.291
Then you can remove low quality data like
cooking warnings that have biolabites start.

0:59:12.012 --> 0:59:13.388
Hey!

0:59:13.173 --> 0:59:19.830
So let you have maybe some other sentence,
and then you're doing a language idea.

0:59:19.830 --> 0:59:29.936
That means you want to have a text, which
is: You want to know for each sentence a paragraph

0:59:29.936 --> 0:59:38.695
which language it has so that you then, of
course, if you want.

0:59:39.259 --> 0:59:44.987
Finally, there is some complexity based film
screenings to believe, for example, for very

0:59:44.987 --> 0:59:46.069
high complexity.

0:59:46.326 --> 0:59:59.718
That means, for example, data where there's
a lot of crazy names which are growing.

1:00:00.520 --> 1:00:09.164
Sometimes it also improves very high perplexity
data because that is then unmanned generated

1:00:09.164 --> 1:00:09.722
data.

1:00:11.511 --> 1:00:17.632
And then the model which is mostly used for
that is what is called a laser model.

1:00:18.178 --> 1:00:21.920
It's based on machine translation.

1:00:21.920 --> 1:00:28.442
Hope it all recognizes the machine translation
architecture.

1:00:28.442 --> 1:00:37.103
However, there is a difference between a general
machine translation system and.

1:01:00.000 --> 1:01:13.322
Machine translation system, so it's messy.

1:01:14.314 --> 1:01:24.767
See one bigger difference, which is great
if I'm excluding that object or the other.

1:01:25.405 --> 1:01:39.768
There is one difference to the other, one
with attention, so we are having.

1:01:40.160 --> 1:01:43.642
And then we are using that here in there each
time set up.

1:01:44.004 --> 1:01:54.295
Mean, therefore, it's maybe a bit similar
to original anti-system without attention.

1:01:54.295 --> 1:01:56.717
It's quite similar.

1:01:57.597 --> 1:02:10.011
However, it has this disadvantage saying that
we have to put everything in one sentence and

1:02:10.011 --> 1:02:14.329
that maybe not all information.

1:02:15.055 --> 1:02:25.567
However, now in this type of framework we
are not really interested in machine translation,

1:02:25.567 --> 1:02:27.281
so this model.

1:02:27.527 --> 1:02:34.264
So we are training it to do machine translation.

1:02:34.264 --> 1:02:42.239
What that means in the end should be as much
information.

1:02:43.883 --> 1:03:01.977
Only all the information in here is able to
really well do the machine translation.

1:03:02.642 --> 1:03:07.801
So that is the first step, so we are doing
here.

1:03:07.801 --> 1:03:17.067
We are building the MT system, not with the
goal of making the best MT system, but with

1:03:17.067 --> 1:03:22.647
learning and sentences, and hopefully all important.

1:03:22.882 --> 1:03:26.116
Because otherwise we won't be able to generate
the translation.

1:03:26.906 --> 1:03:31.287
So it's a bit more on the bottom neck like
to try to put as much information.

1:03:32.012 --> 1:03:36.426
And if you think if you want to do later finding
the bear's neighbor or something like.

1:03:37.257 --> 1:03:48.680
So finding similarities is typically possible
with fixed dimensional things, so we can do

1:03:48.680 --> 1:03:56.803
that in an end dimensional space and find the
nearest neighbor.

1:03:57.857 --> 1:03:59.837
Yeah, it would be very difficult.

1:04:00.300 --> 1:04:03.865
There's one thing that we also do.

1:04:03.865 --> 1:04:09.671
We don't want to find the nearest neighbor
in the other.

1:04:10.570 --> 1:04:13.424
Do you have an idea how we can train them?

1:04:13.424 --> 1:04:16.542
This is a set that embeddings can be compared.

1:04:23.984 --> 1:04:36.829
Any idea do you think about two lectures,
a three lecture stack, one that did gave.

1:04:41.301 --> 1:04:50.562
We can train them on a multilingual setting
and that's how it's done in lasers so we're

1:04:50.562 --> 1:04:56.982
not doing it only from German to English but
we're training.

1:04:57.017 --> 1:05:04.898
Mean, if the English one has to be useful
for German, French and so on, and for German

1:05:04.898 --> 1:05:13.233
also, the German and the English and so have
to be useful, then somehow we'll automatically

1:05:13.233 --> 1:05:16.947
learn that these embattes are popularly.

1:05:17.437 --> 1:05:28.562
And then we can use an exact as we will plan
to have a similar sentence embedding.

1:05:28.908 --> 1:05:39.734
If you put in here a German and a French one
and always generate as they both have the same

1:05:39.734 --> 1:05:48.826
translations, you give these sentences: And
you should do exactly the same thing, so that's

1:05:48.826 --> 1:05:50.649
of course the easiest.

1:05:51.151 --> 1:05:59.817
If the sentence is very different then most
people will also hear the English decoder and

1:05:59.817 --> 1:06:00.877
therefore.

1:06:02.422 --> 1:06:04.784
So that is the first thing.

1:06:04.784 --> 1:06:06.640
Now we have this one.

1:06:06.640 --> 1:06:10.014
We have to be trained on parallel data.

1:06:10.390 --> 1:06:22.705
Then we can use these embeddings on our new
data and try to use them to make efficient

1:06:22.705 --> 1:06:24.545
comparisons.

1:06:26.286 --> 1:06:30.669
So how can you do comparison?

1:06:30.669 --> 1:06:37.243
Maybe the first thing you think of is to do.

1:06:37.277 --> 1:06:44.365
So you take all the German sentences, all
the French sentences.

1:06:44.365 --> 1:06:49.460
We compute the Cousin's simple limit between.

1:06:49.469 --> 1:06:58.989
And then you take all pairs where the similarity
is very high.

1:07:00.180 --> 1:07:17.242
So you have your French list, you have them,
and then you just take all sentences.

1:07:19.839 --> 1:07:29.800
It's an additional power method that we have,
but we have a lot of data who will find a point.

1:07:29.800 --> 1:07:32.317
It's a good point, but.

1:07:35.595 --> 1:07:45.738
It's also not that easy, so one problem is
that typically there are some sentences where.

1:07:46.066 --> 1:07:48.991
And other points where there is very few points
in the neighborhood.

1:07:49.629 --> 1:08:06.241
And then for things where a lot of things
are enabled you might extract not for one percent

1:08:06.241 --> 1:08:08.408
to do that.

1:08:08.868 --> 1:08:18.341
So what typically is happening is you do the
max merchant?

1:08:18.341 --> 1:08:25.085
How good is a pair compared to the other?

1:08:25.305 --> 1:08:33.859
So you take the similarity between X and Y,
and then you look at one of the eight nearest

1:08:33.859 --> 1:08:35.190
neighbors of.

1:08:35.115 --> 1:08:48.461
Of x and what are the eight nearest neighbors
of y, and the dividing of the similarity through

1:08:48.461 --> 1:08:51.411
the eight neighbors.

1:08:51.671 --> 1:09:00.333
So what you may be looking at are these two
sentences a lot more similar than all the other.

1:09:00.840 --> 1:09:13.455
And if these are exceptional and similar compared
to other sentences then they should be translations.

1:09:16.536 --> 1:09:19.158
Of course, that has also some.

1:09:19.158 --> 1:09:24.148
Then the good thing is there's a lot of similar
sentences.

1:09:24.584 --> 1:09:30.641
If there is a lot of similar sensations in
white then these are also very similar and

1:09:30.641 --> 1:09:32.824
you are doing more comparison.

1:09:32.824 --> 1:09:36.626
If all the arrows are far away then the translations.

1:09:37.057 --> 1:09:40.895
So think about this like short sentences.

1:09:40.895 --> 1:09:47.658
They might be that most things are similar,
but they are just in general.

1:09:49.129 --> 1:09:59.220
There are some problems that now we assume
there is only one pair of translations.

1:09:59.759 --> 1:10:09.844
So it has some problems in their two or three
ballad translations of that.

1:10:09.844 --> 1:10:18.853
Then, of course, this pair might not find
it, but in general this.

1:10:19.139 --> 1:10:27.397
For example, they have like all of these common
trawl.

1:10:27.397 --> 1:10:32.802
They have large parallel data sets.

1:10:36.376 --> 1:10:38.557
One point maybe also year.

1:10:38.557 --> 1:10:45.586
Of course, now it's important that we have
done the deduplication before because if we

1:10:45.586 --> 1:10:52.453
wouldn't have the deduplication, we would have
points which are the same coordinate.

1:10:57.677 --> 1:11:03.109
Maybe only one small things to that mean.

1:11:03.109 --> 1:11:09.058
A major issue in this case is still making
a.

1:11:09.409 --> 1:11:18.056
So you have to still do all of this comparison,
and that cannot be done just by simple.

1:11:19.199 --> 1:11:27.322
So what is done typically express the word,
you know things can be done in parallel.

1:11:28.368 --> 1:11:36.024
So calculating the embeddings and all that
stuff doesn't need to be sequential, but it's

1:11:36.024 --> 1:11:37.143
independent.

1:11:37.357 --> 1:11:48.680
What you typically do is create an event and
then you do some kind of projectization.

1:11:48.708 --> 1:11:57.047
So there is this space library which does
key nearest neighbor search very efficient

1:11:57.047 --> 1:11:59.597
in very high-dimensional.

1:12:00.080 --> 1:12:03.410
And then based on that you can now do comparison.

1:12:03.410 --> 1:12:06.873
You can even do the comparison in parallel
because.

1:12:06.906 --> 1:12:13.973
Can look at different areas of your space
and then compare the different pieces to find

1:12:13.973 --> 1:12:14.374
the.

1:12:15.875 --> 1:12:30.790
With this you are then able to do very fast
calculations on this type of sentence.

1:12:31.451 --> 1:12:34.761
So yeah this is currently one.

1:12:35.155 --> 1:12:48.781
Mean, those of them are covered with this,
so there's a parade.

1:12:48.668 --> 1:12:55.543
We are collected by that and most of them
are in a very big corporate for languages which

1:12:55.543 --> 1:12:57.453
you can hardly stand on.

1:12:58.778 --> 1:13:01.016
Do you have any more questions on this?

1:13:05.625 --> 1:13:17.306
And then some more words to this last set
here: So we have now done our pearl marker

1:13:17.306 --> 1:13:25.165
and we could assume that everything is fine
now.

1:13:25.465 --> 1:13:35.238
However, the problem with this noisy data
is that typically this is quite noisy still,

1:13:35.238 --> 1:13:35.687
so.

1:13:36.176 --> 1:13:44.533
In order to make things efficient to have
a high recall, the final data is often not

1:13:44.533 --> 1:13:49.547
of the best quality, not the same type of quality.

1:13:49.789 --> 1:13:58.870
So it is essential to do another figuring
step and to remove senses which might seem

1:13:58.870 --> 1:14:01.007
to be translations.

1:14:01.341 --> 1:14:08.873
And here, of course, the final evaluation
matrix would be how much do my system improve?

1:14:09.089 --> 1:14:23.476
And there are even challenges on doing that
so: people getting this noisy data like symmetrics

1:14:23.476 --> 1:14:25.596
or something.

1:14:27.707 --> 1:14:34.247
However, all these steps is of course very
time consuming, so you might not always want

1:14:34.247 --> 1:14:37.071
to do the full pipeline and training.

1:14:37.757 --> 1:14:51.614
So how can you model that we want to get this
best and normally what we always want?

1:14:51.871 --> 1:15:02.781
You also want to have the best over translation
quality, but this is also normally not achieved

1:15:02.781 --> 1:15:03.917
with all.

1:15:04.444 --> 1:15:12.389
And that's why you're doing this two-step
approach first of the second alignment.

1:15:12.612 --> 1:15:27.171
And after once you do the sentence filtering,
we can put a lot more alphabet in all the comparisons.

1:15:27.627 --> 1:15:37.472
For example, you can just translate the source
and compare that translation with the original

1:15:37.472 --> 1:15:40.404
one and calculate how good.

1:15:40.860 --> 1:15:49.467
And this, of course, you can do with the filing
set, but you can't do with your initial set

1:15:49.467 --> 1:15:50.684
of millions.

1:15:54.114 --> 1:16:01.700
So what it is again is the ancient test where
you input as a sentence pair as here, and then

1:16:01.700 --> 1:16:09.532
once you have a biometria, these are sentence
pairs with a high quality, and these are sentence

1:16:09.532 --> 1:16:11.653
pairs avec a low quality.

1:16:12.692 --> 1:16:17.552
Does anybody see what might be a challenge
if you want to train this type of classifier?

1:16:22.822 --> 1:16:24.264
How do you measure exactly?

1:16:24.264 --> 1:16:26.477
The quality is probably about the problem.

1:16:27.887 --> 1:16:39.195
Yes, that is one, that is true, there is even
more, more simple one, and high quality data

1:16:39.195 --> 1:16:42.426
here is not so difficult.

1:16:43.303 --> 1:16:46.844
Globally, yeah, probably we have a class in
balance.

1:16:46.844 --> 1:16:49.785
We don't see many bad quality combinations.

1:16:49.785 --> 1:16:54.395
It's hard to get there at the beginning, so
maybe how can you argue?

1:16:54.395 --> 1:16:58.405
Where do you find bad quality and what type
of bad quality?

1:16:58.798 --> 1:17:05.122
Because if it's too easy, you just take a
random germ and the random innocence that is

1:17:05.122 --> 1:17:05.558
very.

1:17:05.765 --> 1:17:15.747
But what you're interested is like bad quality
data, which still passes your first initial

1:17:15.747 --> 1:17:16.405
step.

1:17:17.257 --> 1:17:28.824
What you can use for that is you can use any
type of network or model that in the beginning,

1:17:28.824 --> 1:17:33.177
like in random forests, would see.

1:17:33.613 --> 1:17:38.912
So the positive examples are quite easy to
get.

1:17:38.912 --> 1:17:44.543
You just take parallel data and high quality
data.

1:17:44.543 --> 1:17:45.095
You.

1:17:45.425 --> 1:17:47.565
That is quite easy.

1:17:47.565 --> 1:17:55.482
You normally don't need a lot of data, then
to train in a few validation.

1:17:57.397 --> 1:18:12.799
The challenge is like the negative samples
because how would you generate negative samples?

1:18:13.133 --> 1:18:17.909
Because the negative examples are the ones
which ask the first step but don't ask the

1:18:17.909 --> 1:18:18.353
second.

1:18:18.838 --> 1:18:23.682
So how do you typically do it?

1:18:23.682 --> 1:18:28.994
You try to do synthetic examples.

1:18:28.994 --> 1:18:33.369
You can do random examples.

1:18:33.493 --> 1:18:45.228
But this is the typical error that you want
to detect when you do frequency based replacements.

1:18:45.228 --> 1:18:52.074
But this is one major issue when you generate
the data.

1:18:52.132 --> 1:19:02.145
That doesn't match well with what are the
real arrows that you're interested in.

1:19:02.702 --> 1:19:13.177
Is some of the most challenging here to find
the negative samples, which are hard enough

1:19:13.177 --> 1:19:14.472
to detect.

1:19:17.537 --> 1:19:21.863
And the other thing, which is difficult, is
of course the data ratio.

1:19:22.262 --> 1:19:24.212
Why is it important any?

1:19:24.212 --> 1:19:29.827
Why is the ratio between positive and negative
examples here important?

1:19:30.510 --> 1:19:40.007
Because in a case of plus imbalance we effectively
could learn to just that it's positive and

1:19:40.007 --> 1:19:43.644
high quality and we would be right.

1:19:44.844 --> 1:19:46.654
Yes, so I'm training.

1:19:46.654 --> 1:19:51.180
This is important, but otherwise it might
be too easy.

1:19:51.180 --> 1:19:52.414
You always do.

1:19:52.732 --> 1:19:58.043
And on the other head, of course, navy and
deputy, it's also important because if we have

1:19:58.043 --> 1:20:03.176
equal things, we're also assuming that this
might be the other one, and if the quality

1:20:03.176 --> 1:20:06.245
is worse or higher, we might also accept too
fewer.

1:20:06.626 --> 1:20:10.486
So this ratio is not easy to determine.

1:20:13.133 --> 1:20:16.969
What type of features can we use?

1:20:16.969 --> 1:20:23.175
Traditionally, we're also looking at word
translation.

1:20:23.723 --> 1:20:37.592
And nowadays, of course, we can model this
also with something like similar, so this is

1:20:37.592 --> 1:20:38.696
again.

1:20:40.200 --> 1:20:42.306
Language follow.

1:20:42.462 --> 1:20:49.763
So we can, for example, put the sentence in
there for the source and the target, and then

1:20:49.763 --> 1:20:56.497
based on this classification label we can classify
as this a parallel sentence or.

1:20:56.476 --> 1:21:00.054
So it's more like a normal classification
task.

1:21:00.160 --> 1:21:09.233
And by having a system which can have much
enable input, we can just put in two R.

1:21:09.233 --> 1:21:16.886
We can also put in two independent of each
other based on the hidden.

1:21:17.657 --> 1:21:35.440
You can, as you do any other type of classifier,
you can train them on top of.

1:21:35.895 --> 1:21:42.801
This so it tries to represent the full sentence
and that's what you also want to do on.

1:21:43.103 --> 1:21:45.043
The Other Thing What They Can't Do Is, of
Course.

1:21:45.265 --> 1:21:46.881
You can make here.

1:21:46.881 --> 1:21:52.837
You can do your summation of all the hidden
statements that you said.

1:21:58.698 --> 1:22:10.618
Okay, and then one thing which we skipped
until now, and that is only briefly this fragment.

1:22:10.630 --> 1:22:19.517
So if we have sentences which are not really
parallel, can we also extract information from

1:22:19.517 --> 1:22:20.096
them?

1:22:22.002 --> 1:22:25.627
And so what here the test is?

1:22:25.627 --> 1:22:33.603
We have a sentence and we want to find within
or a sentence pair.

1:22:33.603 --> 1:22:38.679
We want to find within the sentence pair.

1:22:39.799 --> 1:22:46.577
And how that, for example, has been done is
using a lexical positive and negative association.

1:22:47.187 --> 1:22:57.182
And then you can transform your target sentence
into a signal and find a thing where you have.

1:22:57.757 --> 1:23:00.317
So I'm Going to Get a Clear Eye.

1:23:00.480 --> 1:23:15.788
So you hear the English sentence, the other
language, and you have an alignment between

1:23:15.788 --> 1:23:18.572
them, and then.

1:23:18.818 --> 1:23:21.925
This is not a light cell from a negative signal.

1:23:22.322 --> 1:23:40.023
And then you drink some sauce on there because
you want to have an area where there's.

1:23:40.100 --> 1:23:51.728
It doesn't matter if you have simple arrows
here by smooth saying you can't extract.

1:23:51.972 --> 1:23:58.813
So you try to find long segments here where
at least most of the words are somehow aligned.

1:24:00.040 --> 1:24:10.069
And then you take this one in the side and
extract that one as your parallel fragment,

1:24:10.069 --> 1:24:10.645
and.

1:24:10.630 --> 1:24:21.276
So in the end you not only have full sentences
but you also have partial sentences which might

1:24:21.276 --> 1:24:27.439
be helpful for especially if you have quite
low upset.

1:24:32.332 --> 1:24:36.388
That's everything work for today.

1:24:36.388 --> 1:24:44.023
What you hopefully remember is the thing about
how the general.

1:24:44.184 --> 1:24:54.506
We talked about how we can do the document
alignment and then we can do the sentence alignment,

1:24:54.506 --> 1:24:57.625
which can be done after the.

1:24:59.339 --> 1:25:12.611
Any more questions think on Thursday we had
to do a switch, so on Thursday there will be

1:25:12.611 --> 1:25:15.444
a practical thing.

0:00:01.921 --> 0:00:16.424
Hey welcome to today's lecture, what we today
want to look at is how we can make new.

0:00:16.796 --> 0:00:26.458
So until now we have this global system, the
encoder and the decoder mostly, and we haven't

0:00:26.458 --> 0:00:29.714
really thought about how long.

0:00:30.170 --> 0:00:42.684
And what we, for example, know is yeah, you
can make the systems bigger in different ways.

0:00:42.684 --> 0:00:47.084
We can make them deeper so the.

0:00:47.407 --> 0:00:56.331
And if we have at least enough data that typically
helps you make things performance better,.

0:00:56.576 --> 0:01:00.620
But of course leads to problems that we need
more resources.

0:01:00.620 --> 0:01:06.587
That is a problem at universities where we
have typically limited computation capacities.

0:01:06.587 --> 0:01:11.757
So at some point you have such big models
that you cannot train them anymore.

0:01:13.033 --> 0:01:23.792
And also for companies is of course important
if it costs you like to generate translation

0:01:23.792 --> 0:01:26.984
just by power consumption.

0:01:27.667 --> 0:01:35.386
So yeah, there's different reasons why you
want to do efficient machine translation.

0:01:36.436 --> 0:01:48.338
One reason is there are different ways of
how you can improve your machine translation

0:01:48.338 --> 0:01:50.527
system once we.

0:01:50.670 --> 0:01:55.694
There can be different types of data we looked
into data crawling, monolingual data.

0:01:55.875 --> 0:01:59.024
All this data and the aim is always.

0:01:59.099 --> 0:02:06.067
Of course, we are not just purely interested
in having more data, but the idea why we want

0:02:06.067 --> 0:02:12.959
to have more data is that more data also means
that we have better quality because mostly

0:02:12.959 --> 0:02:17.554
we are interested in increasing the quality
of the machine.

0:02:18.838 --> 0:02:24.892
But there's also other ways of how you can
improve the quality of a machine translation.

0:02:25.325 --> 0:02:36.450
And what is, of course, that is where most
research is focusing on.

0:02:36.450 --> 0:02:44.467
It means all we want to build better algorithms.

0:02:44.684 --> 0:02:48.199
Course: The other things are normally as good.

0:02:48.199 --> 0:02:54.631
Sometimes it's easier to improve, so often
it's easier to just collect more data than

0:02:54.631 --> 0:02:57.473
to invent some great view algorithms.

0:02:57.473 --> 0:03:00.315
But yeah, both of them are important.

0:03:00.920 --> 0:03:09.812
But there is this third thing, especially
with neural machine translation, and that means

0:03:09.812 --> 0:03:11.590
we make a bigger.

0:03:11.751 --> 0:03:16.510
Can be, as said, that we have more layers,
that we have wider layers.

0:03:16.510 --> 0:03:19.977
The other thing we talked a bit about is ensemble.

0:03:19.977 --> 0:03:24.532
That means we are not building one new machine
translation system.

0:03:24.965 --> 0:03:27.505
And we can easily build four.

0:03:27.505 --> 0:03:32.331
What is the typical strategy to build different
systems?

0:03:32.331 --> 0:03:33.177
Remember.

0:03:35.795 --> 0:03:40.119
It should be of course a bit different if
you have the same.

0:03:40.119 --> 0:03:44.585
If they all predict the same then combining
them doesn't help.

0:03:44.585 --> 0:03:48.979
So what is the easiest way if you have to
build four systems?

0:03:51.711 --> 0:04:01.747
And the Charleston's will take, but this is
the best output of a single system.

0:04:02.362 --> 0:04:10.165
Mean now, it's really three different systems
so that you later can combine them and maybe

0:04:10.165 --> 0:04:11.280
the average.

0:04:11.280 --> 0:04:16.682
Ensembles are typically that the average is
all probabilities.

0:04:19.439 --> 0:04:24.227
The idea is to think about neural networks.

0:04:24.227 --> 0:04:29.342
There's one parameter which can easily adjust.

0:04:29.342 --> 0:04:36.525
That's exactly the easiest way to randomize
with three different.

0:04:37.017 --> 0:04:43.119
They have the same architecture, so all the
hydroparameters are the same, but they are

0:04:43.119 --> 0:04:43.891
different.

0:04:43.891 --> 0:04:46.556
They will have different predictions.

0:04:48.228 --> 0:04:52.572
So, of course, bigger amounts.

0:04:52.572 --> 0:05:05.325
Some of these are a bit the easiest way of
improving your quality because you don't really

0:05:05.325 --> 0:05:08.268
have to do anything.

0:05:08.588 --> 0:05:12.588
There is limits on that bigger models only
get better.

0:05:12.588 --> 0:05:19.132
If you have enough training data you can't
do like a handheld layer and you will not work

0:05:19.132 --> 0:05:24.877
on very small data but with a recent amount
of data that is the easiest thing.

0:05:25.305 --> 0:05:33.726
However, they are challenging with making
better models, bigger motors, and that is the

0:05:33.726 --> 0:05:34.970
computation.

0:05:35.175 --> 0:05:44.482
So, of course, if you have a bigger model
that can mean that you have longer running

0:05:44.482 --> 0:05:49.518
times, if you have models, you have to times.

0:05:51.171 --> 0:05:56.685
Normally you cannot paralyze the different
layers because the input to one layer is always

0:05:56.685 --> 0:06:02.442
the output of the previous layer, so you propagate
that so it will also increase your runtime.

0:06:02.822 --> 0:06:10.720
Then you have to store all your models in
memory.

0:06:10.720 --> 0:06:20.927
If you have double weights you will have:
Is more difficult to then do back propagation.

0:06:20.927 --> 0:06:27.680
You have to store in between the activations,
so there's not only do you increase the model

0:06:27.680 --> 0:06:31.865
in your memory, but also all these other variables
that.

0:06:34.414 --> 0:06:36.734
And so in general it is more expensive.

0:06:37.137 --> 0:06:54.208
And therefore there's good reasons in looking
into can we make these models sound more efficient.

0:06:54.134 --> 0:07:00.982
So it's been through the viewer, you can have
it okay, have one and one day of training time,

0:07:00.982 --> 0:07:01.274
or.

0:07:01.221 --> 0:07:07.535
Forty thousand euros and then what is the
best machine translation system I can get within

0:07:07.535 --> 0:07:08.437
this budget.

0:07:08.969 --> 0:07:19.085
And then, of course, you can make the models
bigger, but then you have to train them shorter,

0:07:19.085 --> 0:07:24.251
and then we can make more efficient algorithms.

0:07:25.925 --> 0:07:31.699
If you think about efficiency, there's a bit
different scenarios.

0:07:32.312 --> 0:07:43.635
So if you're more of coming from the research
community, what you'll be doing is building

0:07:43.635 --> 0:07:47.913
a lot of models in your research.

0:07:48.088 --> 0:07:58.645
So you're having your test set of maybe sentences,
calculating the blue score, then another model.

0:07:58.818 --> 0:08:08.911
So what that means is typically you're training
on millions of cents, so your training time

0:08:08.911 --> 0:08:14.944
is long, maybe a day, but maybe in other cases
a week.

0:08:15.135 --> 0:08:22.860
The testing is not really the cost efficient,
but the training is very costly.

0:08:23.443 --> 0:08:37.830
If you are more thinking of building models
for application, the scenario is quite different.

0:08:38.038 --> 0:08:46.603
And then you keep it running, and maybe thousands
of customers are using it in translating.

0:08:46.603 --> 0:08:47.720
So in that.

0:08:48.168 --> 0:08:59.577
And we will see that it is not always the
same type of challenges you can paralyze some

0:08:59.577 --> 0:09:07.096
things in training, which you cannot paralyze
in testing.

0:09:07.347 --> 0:09:14.124
For example, in training you have to do back
propagation, so you have to store the activations.

0:09:14.394 --> 0:09:23.901
Therefore, in testing we briefly discussed
that we would do it in more detail today in

0:09:23.901 --> 0:09:24.994
training.

0:09:25.265 --> 0:09:36.100
You know they're a target and you can process
everything in parallel while in testing.

0:09:36.356 --> 0:09:46.741
So you can only do one word at a time, and
so you can less paralyze this.

0:09:46.741 --> 0:09:50.530
Therefore, it's important.

0:09:52.712 --> 0:09:55.347
Is a specific task on this.

0:09:55.347 --> 0:10:03.157
For example, it's the efficiency task where
it's about making things as efficient.

0:10:03.123 --> 0:10:09.230
Is possible and they can look at different
resources.

0:10:09.230 --> 0:10:14.207
So how much deep fuel run time do you need?

0:10:14.454 --> 0:10:19.366
See how much memory you need or you can have
a fixed memory budget and then have to build

0:10:19.366 --> 0:10:20.294
the best system.

0:10:20.500 --> 0:10:29.010
And here is a bit like an example of that,
so there's three teams from Edinburgh from

0:10:29.010 --> 0:10:30.989
and they submitted.

0:10:31.131 --> 0:10:36.278
So then, of course, if you want to know the
most efficient system you have to do a bit

0:10:36.278 --> 0:10:36.515
of.

0:10:36.776 --> 0:10:44.656
You want to have a better quality or more
runtime and there's not the one solution.

0:10:44.656 --> 0:10:46.720
You can improve your.

0:10:46.946 --> 0:10:49.662
And that you see that there are different
systems.

0:10:49.909 --> 0:11:06.051
Here is how many words you can do for a second
on the clock, and you want to be as talk as

0:11:06.051 --> 0:11:07.824
possible.

0:11:08.068 --> 0:11:08.889
And you see here a bit.

0:11:08.889 --> 0:11:09.984
This is a little bit different.

0:11:11.051 --> 0:11:27.717
You want to be there on the top right corner
and you can get a score of something between

0:11:27.717 --> 0:11:29.014
words.

0:11:30.250 --> 0:11:34.161
Two hundred and fifty thousand, then you'll
ever come and score zero point three.

0:11:34.834 --> 0:11:41.243
There is, of course, any bit of a decision,
but the question is, like how far can you again?

0:11:41.243 --> 0:11:47.789
Some of all these points on this line would
be winners because they are somehow most efficient

0:11:47.789 --> 0:11:53.922
in a way that there's no system which achieves
the same quality with less computational.

0:11:57.657 --> 0:12:04.131
So there's the one question of which resources
are you interested.

0:12:04.131 --> 0:12:07.416
Are you running it on CPU or GPU?

0:12:07.416 --> 0:12:11.668
There's different ways of paralyzing stuff.

0:12:14.654 --> 0:12:20.777
Another dimension is how you process your
data.

0:12:20.777 --> 0:12:27.154
There's really the best processing and streaming.

0:12:27.647 --> 0:12:34.672
So in batch processing you have the whole
document available so you can translate all

0:12:34.672 --> 0:12:39.981
sentences in perimeter and then you're interested
in throughput.

0:12:40.000 --> 0:12:43.844
But you can then process, for example, especially
in GPS.

0:12:43.844 --> 0:12:49.810
That's interesting, you're not translating
one sentence at a time, but you're translating

0:12:49.810 --> 0:12:56.108
one hundred sentences or so in parallel, so
you have one more dimension where you can paralyze

0:12:56.108 --> 0:12:57.964
and then be more efficient.

0:12:58.558 --> 0:13:14.863
On the other hand, for example sorts of documents,
so we learned that if you do badge processing

0:13:14.863 --> 0:13:16.544
you have.

0:13:16.636 --> 0:13:24.636
Then, of course, it makes sense to sort the
sentences in order to have the minimum thing

0:13:24.636 --> 0:13:25.535
attached.

0:13:27.427 --> 0:13:32.150
The other scenario is more the streaming scenario
where you do life translation.

0:13:32.512 --> 0:13:40.212
So in that case you can't wait for the whole
document to pass, but you have to do.

0:13:40.520 --> 0:13:49.529
And then, for example, that's especially in
situations like speech translation, and then

0:13:49.529 --> 0:13:53.781
you're interested in things like latency.

0:13:53.781 --> 0:14:00.361
So how much do you have to wait to get the
output of a sentence?

0:14:06.566 --> 0:14:16.956
Finally, there is the thing about the implementation:
Today we're mainly looking at different algorithms,

0:14:16.956 --> 0:14:23.678
different models of how you can model them
in your machine translation system, but of

0:14:23.678 --> 0:14:29.227
course for the same algorithms there's also
different implementations.

0:14:29.489 --> 0:14:38.643
So, for example, for a machine translation
this tool could be very fast.

0:14:38.638 --> 0:14:46.615
So they have like coded a lot of the operations
very low resource, not low resource, low level

0:14:46.615 --> 0:14:49.973
on the directly on the QDAC kernels in.

0:14:50.110 --> 0:15:00.948
So the same attention network is typically
more efficient in that type of algorithm.

0:15:00.880 --> 0:15:02.474
Than in in any other.

0:15:03.323 --> 0:15:13.105
Of course, it might be other disadvantages,
so if you're a little worker or have worked

0:15:13.105 --> 0:15:15.106
in the practical.

0:15:15.255 --> 0:15:22.604
Because it's normally easier to understand,
easier to change, and so on, but there is again

0:15:22.604 --> 0:15:23.323
a train.

0:15:23.483 --> 0:15:29.440
You have to think about, do you want to include
this into my study or comparison or not?

0:15:29.440 --> 0:15:36.468
Should it be like I compare different implementations
and I also find the most efficient implementation?

0:15:36.468 --> 0:15:39.145
Or is it only about the pure algorithm?

0:15:42.742 --> 0:15:50.355
Yeah, when building these systems there is
a different trade-off to do.

0:15:50.850 --> 0:15:56.555
So there's one of the traders between memory
and throughput, so how many words can generate

0:15:56.555 --> 0:15:57.299
per second.

0:15:57.557 --> 0:16:03.351
So typically you can easily like increase
your scruple by increasing the batch size.

0:16:03.643 --> 0:16:06.899
So that means you are translating more sentences
in parallel.

0:16:07.107 --> 0:16:09.241
And gypsies are very good at that stuff.

0:16:09.349 --> 0:16:15.161
It should translate one sentence or one hundred
sentences, not the same time, but its.

0:16:15.115 --> 0:16:20.784
Rough are very similar because they are at
this efficient metrics multiplication so that

0:16:20.784 --> 0:16:24.415
you can do the same operation on all sentences
parallel.

0:16:24.415 --> 0:16:30.148
So typically that means if you increase your
benchmark you can do more things in parallel

0:16:30.148 --> 0:16:31.995
and you will translate more.

0:16:31.952 --> 0:16:33.370
Second.

0:16:33.653 --> 0:16:43.312
On the other hand, with this advantage, of
course you will need higher badge sizes and

0:16:43.312 --> 0:16:44.755
more memory.

0:16:44.965 --> 0:16:56.452
To begin with, the other problem is that you
have such big models that you can only translate

0:16:56.452 --> 0:16:59.141
with lower bed sizes.

0:16:59.119 --> 0:17:08.466
If you are running out of memory with translating,
one idea to go on that is to decrease your.

0:17:13.453 --> 0:17:24.456
Then there is the thing about quality in Screwport,
of course, and before it's like larger models,

0:17:24.456 --> 0:17:28.124
but in generally higher quality.

0:17:28.124 --> 0:17:31.902
The first one is always this way.

0:17:32.092 --> 0:17:38.709
Course: Not always larger model helps you
have over fitting at some point, but in generally.

0:17:43.883 --> 0:17:52.901
And with this a bit on this training and testing
thing we had before.

0:17:53.113 --> 0:17:58.455
So it wears all the difference between training
and testing, and for the encoder and decoder.

0:17:58.798 --> 0:18:06.992
So if we are looking at what mentioned before
at training time, we have a source sentence

0:18:06.992 --> 0:18:17.183
here: And how this is processed on a is not
the attention here.

0:18:17.183 --> 0:18:21.836
That's a tubical transformer.

0:18:22.162 --> 0:18:31.626
And how we can do that on a is that we can
paralyze the ear ever since.

0:18:31.626 --> 0:18:40.422
The first thing to know is: So that is, of
course, not in all cases.

0:18:40.422 --> 0:18:49.184
We'll later talk about speech translation
where we might want to translate.

0:18:49.389 --> 0:18:56.172
Without the general case in, it's like you
have the full sentence you want to translate.

0:18:56.416 --> 0:19:02.053
So the important thing is we are here everything
available on the source side.

0:19:03.323 --> 0:19:13.524
And then this was one of the big advantages
that you can remember back of transformer.

0:19:13.524 --> 0:19:15.752
There are several.

0:19:16.156 --> 0:19:25.229
But the other one is now that we can calculate
the full layer.

0:19:25.645 --> 0:19:29.318
There is no dependency between this and this
state or this and this state.

0:19:29.749 --> 0:19:36.662
So we always did like here to calculate the
key value and query, and based on that you

0:19:36.662 --> 0:19:37.536
calculate.

0:19:37.937 --> 0:19:46.616
Which means we can do all these calculations
here in parallel and in parallel.

0:19:48.028 --> 0:19:55.967
And there, of course, is this very efficiency
because again for GPS it's too bigly possible

0:19:55.967 --> 0:20:00.887
to do these things in parallel and one after
each other.

0:20:01.421 --> 0:20:10.311
And then we can also for each layer one by
one, and then we calculate here the encoder.

0:20:10.790 --> 0:20:21.921
In training now an important thing is that
for the decoder we have the full sentence available

0:20:21.921 --> 0:20:28.365
because we know this is the target we should
generate.

0:20:29.649 --> 0:20:33.526
We have models now in a different way.

0:20:33.526 --> 0:20:38.297
This hidden state is only on the previous
ones.

0:20:38.598 --> 0:20:51.887
And the first thing here depends only on this
information, so you see if you remember we

0:20:51.887 --> 0:20:56.665
had this masked self-attention.

0:20:56.896 --> 0:21:04.117
So that means, of course, we can only calculate
the decoder once the encoder is done, but that's.

0:21:04.444 --> 0:21:06.656
Percent can calculate the end quarter.

0:21:06.656 --> 0:21:08.925
Then we can calculate here the decoder.

0:21:09.569 --> 0:21:25.566
But again in training we have x, y and that
is available so we can calculate everything

0:21:25.566 --> 0:21:27.929
in parallel.

0:21:28.368 --> 0:21:40.941
So the interesting thing or advantage of transformer
is in training.

0:21:40.941 --> 0:21:46.408
We can do it for the decoder.

0:21:46.866 --> 0:21:54.457
That means you will have more calculations
because you can only calculate one layer at

0:21:54.457 --> 0:22:02.310
a time, but for example the length which is
too bigly quite long or doesn't really matter

0:22:02.310 --> 0:22:03.270
that much.

0:22:05.665 --> 0:22:10.704
However, in testing this situation is different.

0:22:10.704 --> 0:22:13.276
In testing we only have.

0:22:13.713 --> 0:22:20.622
So this means we start with a sense: We don't
know the full sentence yet because we ought

0:22:20.622 --> 0:22:29.063
to regularly generate that so for the encoder
we have the same here but for the decoder.

0:22:29.409 --> 0:22:39.598
In this case we only have the first and the
second instinct, but only for all states in

0:22:39.598 --> 0:22:40.756
parallel.

0:22:41.101 --> 0:22:51.752
And then we can do the next step for y because
we are putting our most probable one.

0:22:51.752 --> 0:22:58.643
We do greedy search or beam search, but you
cannot do.

0:23:03.663 --> 0:23:16.838
Yes, so if we are interesting in making things
more efficient for testing, which we see, for

0:23:16.838 --> 0:23:22.363
example in the scenario of really our.

0:23:22.642 --> 0:23:34.286
It makes sense that we think about our architecture
and that we are currently working on attention

0:23:34.286 --> 0:23:35.933
based models.

0:23:36.096 --> 0:23:44.150
The decoder there is some of the most time
spent testing and testing.

0:23:44.150 --> 0:23:47.142
It's similar, but during.

0:23:47.167 --> 0:23:50.248
Nothing about beam search.

0:23:50.248 --> 0:23:59.833
It might be even more complicated because
in beam search you have to try different.

0:24:02.762 --> 0:24:15.140
So the question is what can you now do in
order to make your model more efficient and

0:24:15.140 --> 0:24:21.905
better in translation in these types of cases?

0:24:24.604 --> 0:24:30.178
And the one thing is to look into the encoded
decoder trailer.

0:24:30.690 --> 0:24:43.898
And then until now we typically assume that
the depth of the encoder and the depth of the

0:24:43.898 --> 0:24:48.154
decoder is roughly the same.

0:24:48.268 --> 0:24:55.553
So if you haven't thought about it, you just
take what is running well.

0:24:55.553 --> 0:24:57.678
You would try to do.

0:24:58.018 --> 0:25:04.148
However, we saw now that there is a quite
big challenge and the runtime is a lot longer

0:25:04.148 --> 0:25:04.914
than here.

0:25:05.425 --> 0:25:14.018
The question is also the case for the calculations,
or do we have there the same issue that we

0:25:14.018 --> 0:25:21.887
only get the good quality if we are having
high and high, so we know that making these

0:25:21.887 --> 0:25:25.415
more depths is increasing our quality.

0:25:25.425 --> 0:25:31.920
But what we haven't talked about is really
important that we increase the depth the same

0:25:31.920 --> 0:25:32.285
way.

0:25:32.552 --> 0:25:41.815
So what we can put instead also do is something
like this where you have a deep encoder and

0:25:41.815 --> 0:25:42.923
a shallow.

0:25:43.163 --> 0:25:57.386
So that would be that you, for example, have
instead of having layers on the encoder, and

0:25:57.386 --> 0:25:59.757
layers on the.

0:26:00.080 --> 0:26:10.469
So in this case the overall depth from start
to end would be similar and so hopefully.

0:26:11.471 --> 0:26:21.662
But we could a lot more things hear parallelized,
and hear what is costly at the end during decoding

0:26:21.662 --> 0:26:22.973
the decoder.

0:26:22.973 --> 0:26:29.330
Because that does change in an outer regressive
way, there we.

0:26:31.411 --> 0:26:33.727
And that that can be analyzed.

0:26:33.727 --> 0:26:38.734
So here is some examples: Where people have
done all this.

0:26:39.019 --> 0:26:55.710
So here it's mainly interested on the orange
things, which is auto-regressive about the

0:26:55.710 --> 0:26:57.607
speed up.

0:26:57.717 --> 0:27:15.031
You have the system, so agree is not exactly
the same, but it's similar.

0:27:15.055 --> 0:27:23.004
It's always the case if you look at speed
up.

0:27:23.004 --> 0:27:31.644
Think they put a speed of so that's the baseline.

0:27:31.771 --> 0:27:35.348
So between and times as fast.

0:27:35.348 --> 0:27:42.621
If you switch from a system to where you have
layers in the.

0:27:42.782 --> 0:27:52.309
You see that although you have slightly more
parameters, more calculations are also roughly

0:27:52.309 --> 0:28:00.283
the same, but you can speed out because now
during testing you can paralyze.

0:28:02.182 --> 0:28:09.754
The other thing is that you're speeding up,
but if you look at the performance it's similar,

0:28:09.754 --> 0:28:13.500
so sometimes you improve, sometimes you lose.

0:28:13.500 --> 0:28:20.421
There's a bit of losing English to Romania,
but in general the quality is very slow.

0:28:20.680 --> 0:28:30.343
So you see that you can keep a similar performance
while improving your speed by just having different.

0:28:30.470 --> 0:28:34.903
And you also see the encoder layers from speed.

0:28:34.903 --> 0:28:38.136
They don't really metal that much.

0:28:38.136 --> 0:28:38.690
Most.

0:28:38.979 --> 0:28:50.319
Because if you compare the 12th system to
the 6th system you have a lower performance

0:28:50.319 --> 0:28:57.309
with 6th and colder layers but the speed is
similar.

0:28:57.897 --> 0:29:02.233
And see the huge decrease is it maybe due
to a lack of data.

0:29:03.743 --> 0:29:11.899
Good idea would say it's not the case.

0:29:11.899 --> 0:29:23.191
Romanian English should have the same number
of data.

0:29:24.224 --> 0:29:31.184
Maybe it's just that something in that language.

0:29:31.184 --> 0:29:40.702
If you generate Romanian maybe they need more
target dependencies.

0:29:42.882 --> 0:29:46.263
The Wine's the Eye Also Don't Know Any Sex
People Want To.

0:29:47.887 --> 0:29:49.034
There could be yeah the.

0:29:49.889 --> 0:29:58.962
As the maybe if you go from like a movie sphere
to a hybrid sphere, you can: It's very much

0:29:58.962 --> 0:30:12.492
easier to expand the vocabulary to English,
but it must be the vocabulary.

0:30:13.333 --> 0:30:21.147
Have to check, but would assume that in this
case the system is not retrained, but it's

0:30:21.147 --> 0:30:22.391
trained with.

0:30:22.902 --> 0:30:30.213
And that's why I was assuming that they have
the same, but maybe you'll write that in this

0:30:30.213 --> 0:30:35.595
piece, for example, if they were pre-trained,
the decoder English.

0:30:36.096 --> 0:30:43.733
But don't remember exactly if they do something
like that, but that could be a good.

0:30:45.325 --> 0:30:52.457
So this is some of the most easy way to speed
up.

0:30:52.457 --> 0:31:01.443
You just switch to hyperparameters, not to
implement anything.

0:31:02.722 --> 0:31:08.367
Of course, there's other ways of doing that.

0:31:08.367 --> 0:31:11.880
We'll look into two things.

0:31:11.880 --> 0:31:16.521
The other thing is the architecture.

0:31:16.796 --> 0:31:28.154
We are now at some of the baselines that we
are doing.

0:31:28.488 --> 0:31:39.978
However, in translation in the decoder side,
it might not be the best solution.

0:31:39.978 --> 0:31:41.845
There is no.

0:31:42.222 --> 0:31:47.130
So we can use different types of architectures,
also in the encoder and the.

0:31:47.747 --> 0:31:52.475
And there's two ways of what you could do
different, or there's more ways.

0:31:52.912 --> 0:31:54.825
We will look into two todays.

0:31:54.825 --> 0:31:58.842
The one is average attention, which is a very
simple solution.

0:31:59.419 --> 0:32:01.464
You can do as it says.

0:32:01.464 --> 0:32:04.577
It's not really attending anymore.

0:32:04.577 --> 0:32:08.757
It's just like equal attendance to everything.

0:32:09.249 --> 0:32:23.422
And the other idea, which is currently done
in most systems which are optimized to efficiency,

0:32:23.422 --> 0:32:24.913
is we're.

0:32:25.065 --> 0:32:32.623
But on the decoder side we are then not using
transformer or self attention, but we are using

0:32:32.623 --> 0:32:39.700
recurrent neural network because they are the
disadvantage of recurrent neural network.

0:32:39.799 --> 0:32:48.353
And then the recurrent is normally easier
to calculate because it only depends on inputs,

0:32:48.353 --> 0:32:49.684
the input on.

0:32:51.931 --> 0:33:02.190
So what is the difference between decoding
and why is the tension maybe not sufficient

0:33:02.190 --> 0:33:03.841
for decoding?

0:33:04.204 --> 0:33:14.390
If we want to populate the new state, we only
have to look at the input and the previous

0:33:14.390 --> 0:33:15.649
state, so.

0:33:16.136 --> 0:33:19.029
We are more conditional here networks.

0:33:19.029 --> 0:33:19.994
We have the.

0:33:19.980 --> 0:33:31.291
Dependency to a fixed number of previous ones,
but that's rarely used for decoding.

0:33:31.291 --> 0:33:39.774
In contrast, in transformer we have this large
dependency, so.

0:33:40.000 --> 0:33:52.760
So from t minus one to y t so that is somehow
and mainly not very efficient in this way mean

0:33:52.760 --> 0:33:56.053
it's very good because.

0:33:56.276 --> 0:34:03.543
However, the disadvantage is that we also
have to do all these calculations, so if we

0:34:03.543 --> 0:34:10.895
more view from the point of view of efficient
calculation, this might not be the best.

0:34:11.471 --> 0:34:20.517
So the question is, can we change our architecture
to keep some of the advantages but make things

0:34:20.517 --> 0:34:21.994
more efficient?

0:34:24.284 --> 0:34:31.131
The one idea is what is called the average
attention, and the interesting thing is this

0:34:31.131 --> 0:34:32.610
work surprisingly.

0:34:33.013 --> 0:34:38.917
So the only idea what you're doing is doing
the decoder.

0:34:38.917 --> 0:34:42.646
You're not doing attention anymore.

0:34:42.646 --> 0:34:46.790
The attention weights are all the same.

0:34:47.027 --> 0:35:00.723
So you don't calculate with query and key
the different weights, and then you just take

0:35:00.723 --> 0:35:03.058
equal weights.

0:35:03.283 --> 0:35:07.585
So here would be one third from this, one
third from this, and one third.

0:35:09.009 --> 0:35:14.719
And while it is sufficient you can now do
precalculation and things get more efficient.

0:35:15.195 --> 0:35:18.803
So first go the formula that's maybe not directed
here.

0:35:18.979 --> 0:35:38.712
So the difference here is that your new hint
stage is the sum of all the hint states, then.

0:35:38.678 --> 0:35:40.844
So here would be with this.

0:35:40.844 --> 0:35:45.022
It would be one third of this plus one third
of this.

0:35:46.566 --> 0:35:57.162
But if you calculate it this way, it's not
yet being more efficient because you still

0:35:57.162 --> 0:36:01.844
have to sum over here all the hidden.

0:36:04.524 --> 0:36:22.932
But you can not easily speed up these things
by having an in between value, which is just

0:36:22.932 --> 0:36:24.568
always.

0:36:25.585 --> 0:36:30.057
If you take this as ten to one, you take this
one class this one.

0:36:30.350 --> 0:36:36.739
Because this one then was before this, and
this one was this, so in the end.

0:36:37.377 --> 0:36:49.545
So now this one is not the final one in order
to get the final one to do the average.

0:36:49.545 --> 0:36:50.111
So.

0:36:50.430 --> 0:37:00.264
But then if you do this calculation with speed
up you can do it with a fixed number of steps.

0:37:00.180 --> 0:37:11.300
Instead of the sun which depends on age, so
you only have to do calculations to calculate

0:37:11.300 --> 0:37:12.535
this one.

0:37:12.732 --> 0:37:21.253
Can you do a lakes on a wet spoon?

0:37:21.253 --> 0:37:32.695
For example, a light spoon here now takes
and.

0:37:32.993 --> 0:37:38.762
That's a very good point and that's why this
is now in the image.

0:37:38.762 --> 0:37:44.531
It's not very good so this is the one with
tilder and the tilder.

0:37:44.884 --> 0:37:57.895
So this one is just the sum of these two,
because this is just this one.

0:37:58.238 --> 0:38:08.956
So the sum of this is exactly as the sum of
these, and the sum of these is the sum of here.

0:38:08.956 --> 0:38:15.131
So you only do the sum in here, and the multiplying.

0:38:15.255 --> 0:38:22.145
So what you can mainly do here is you can
do it more mathematically.

0:38:22.145 --> 0:38:31.531
You can know this by tea taking out of the
sum, and then you can calculate the sum different.

0:38:36.256 --> 0:38:42.443
That maybe looks a bit weird and simple, so
we were all talking about this great attention

0:38:42.443 --> 0:38:47.882
that we can focus on different parts, and a
bit surprising on this work is now.

0:38:47.882 --> 0:38:53.321
In the end it might also work well without
really putting and just doing equal.

0:38:53.954 --> 0:38:56.164
Mean it's not that easy.

0:38:56.376 --> 0:38:58.261
It's like sometimes this is working.

0:38:58.261 --> 0:39:00.451
There's also report weight work that well.

0:39:01.481 --> 0:39:05.848
But I think it's an interesting way and it
maybe shows that a lot of.

0:39:05.805 --> 0:39:10.669
Things in the self or in the transformer paper
which are more put as like yet.

0:39:10.669 --> 0:39:14.301
These are some hyperparameters that are rounded
like that.

0:39:14.301 --> 0:39:19.657
You do the lay on all in between and that
you do a feat forward before and things like

0:39:19.657 --> 0:39:20.026
that.

0:39:20.026 --> 0:39:25.567
But these are also all important and the right
set up around that is also very important.

0:39:28.969 --> 0:39:38.598
The other thing you can do in the end is not
completely different from this one.

0:39:38.598 --> 0:39:42.521
It's just like a very different.

0:39:42.942 --> 0:39:54.338
And that is a recurrent network which also
has this type of highway connection that can

0:39:54.338 --> 0:40:01.330
ignore the recurrent unit and directly put
the input.

0:40:01.561 --> 0:40:10.770
It's not really adding out, but if you see
the hitting step is your input, but what you

0:40:10.770 --> 0:40:15.480
can do is somehow directly go to the output.

0:40:17.077 --> 0:40:28.390
These are the four components of the simple
return unit, and the unit is motivated by GIS

0:40:28.390 --> 0:40:33.418
and by LCMs, which we have seen before.

0:40:33.513 --> 0:40:43.633
And that has proven to be very good for iron
ends, which allows you to have a gate on your.

0:40:44.164 --> 0:40:48.186
In this thing we have two gates, the reset
gate and the forget gate.

0:40:48.768 --> 0:40:57.334
So first we have the general structure which
has a cell state.

0:40:57.334 --> 0:41:01.277
Here we have the cell state.

0:41:01.361 --> 0:41:09.661
And then this goes next, and we always get
the different cell states over the times that.

0:41:10.030 --> 0:41:11.448
This Is the South Stand.

0:41:11.771 --> 0:41:16.518
How do we now calculate that just assume we
have an initial cell safe here?

0:41:17.017 --> 0:41:19.670
But the first thing is we're doing the forget
game.

0:41:20.060 --> 0:41:34.774
The forgetting models should the new cell
state mainly depend on the previous cell state

0:41:34.774 --> 0:41:40.065
or should it depend on our age.

0:41:40.000 --> 0:41:41.356
Like Add to Them.

0:41:41.621 --> 0:41:42.877
How can we model that?

0:41:44.024 --> 0:41:45.599
First we were at a cocktail.

0:41:45.945 --> 0:41:52.151
The forget gait is depending on minus one.

0:41:52.151 --> 0:41:56.480
You also see here the former.

0:41:57.057 --> 0:42:01.963
So we are multiplying both the cell state
and our input.

0:42:01.963 --> 0:42:04.890
With some weights we are getting.

0:42:05.105 --> 0:42:08.472
We are putting some Bay Inspector and then
we are doing Sigma Weed on that.

0:42:08.868 --> 0:42:13.452
So in the end we have numbers between zero
and one saying for each dimension.

0:42:13.853 --> 0:42:22.041
Like how much if it's near to zero we will
mainly use the new input.

0:42:22.041 --> 0:42:31.890
If it's near to one we will keep the input
and ignore the input at this dimension.

0:42:33.313 --> 0:42:40.173
And by this motivation we can then create
here the new sound state, and here you see

0:42:40.173 --> 0:42:41.141
the formal.

0:42:41.601 --> 0:42:55.048
So you take your foot back gate and multiply
it with your class.

0:42:55.048 --> 0:43:00.427
So if my was around then.

0:43:00.800 --> 0:43:07.405
In the other case, when the value was others,
that's what you added.

0:43:07.405 --> 0:43:10.946
Then you're adding a transformation.

0:43:11.351 --> 0:43:24.284
So if this value was maybe zero then you're
putting most of the information from inputting.

0:43:25.065 --> 0:43:26.947
Is already your element?

0:43:26.947 --> 0:43:30.561
The only question is now based on your element.

0:43:30.561 --> 0:43:32.067
What is the output?

0:43:33.253 --> 0:43:47.951
And there you have another opportunity so
you can either take the output or instead you

0:43:47.951 --> 0:43:50.957
prefer the input.

0:43:52.612 --> 0:43:58.166
So is the value also the same for the recept
game and the forget game.

0:43:58.166 --> 0:43:59.417
Yes, the movie.

0:44:00.900 --> 0:44:10.004
Yes exactly so the matrices are different
and therefore it can be and that should be

0:44:10.004 --> 0:44:16.323
and maybe there is sometimes you want to have
information.

0:44:16.636 --> 0:44:23.843
So here again we have this vector with values
between zero and which says controlling how

0:44:23.843 --> 0:44:25.205
the information.

0:44:25.505 --> 0:44:36.459
And then the output is calculated here similar
to a cell stage, but again input is from.

0:44:36.536 --> 0:44:45.714
So either the reset gate decides should give
what is currently stored in there, or.

0:44:46.346 --> 0:44:58.647
So it's not exactly as the thing we had before,
with the residual connections where we added

0:44:58.647 --> 0:45:01.293
up, but here we do.

0:45:04.224 --> 0:45:08.472
This is the general idea of a simple recurrent
neural network.

0:45:08.472 --> 0:45:13.125
Then we will now look at how we can make things
even more efficient.

0:45:13.125 --> 0:45:17.104
But first do you have more questions on how
it is working?

0:45:23.063 --> 0:45:38.799
Now these calculations are a bit where things
get more efficient because this somehow.

0:45:38.718 --> 0:45:43.177
It depends on all the other damage for the
second one also.

0:45:43.423 --> 0:45:48.904
Because if you do a matrix multiplication
with a vector like for the output vector, each

0:45:48.904 --> 0:45:52.353
diameter of the output vector depends on all
the other.

0:45:52.973 --> 0:46:06.561
The cell state here depends because this one
is used here, and somehow the first dimension

0:46:06.561 --> 0:46:11.340
of the cell state only depends.

0:46:11.931 --> 0:46:17.973
In order to make that, of course, is sometimes
again making things less paralyzeable if things

0:46:17.973 --> 0:46:18.481
depend.

0:46:19.359 --> 0:46:35.122
Can easily make that different by changing
from the metric product to not a vector.

0:46:35.295 --> 0:46:51.459
So you do first, just like inside here, you
take like the first dimension, my second dimension.

0:46:52.032 --> 0:46:53.772
Is, of course, narrow.

0:46:53.772 --> 0:46:59.294
This should be reset or this should be because
it should be a different.

0:46:59.899 --> 0:47:12.053
Now the first dimension only depends on the
first dimension, so you don't have dependencies

0:47:12.053 --> 0:47:16.148
any longer between dimensions.

0:47:18.078 --> 0:47:25.692
Maybe it gets a bit clearer if you see about
it in this way, so what we have to do now.

0:47:25.966 --> 0:47:31.911
First, we have to do a metrics multiplication
on to gather and to get the.

0:47:32.292 --> 0:47:38.041
And then we only have the element wise operations
where we take this output.

0:47:38.041 --> 0:47:38.713
We take.

0:47:39.179 --> 0:47:42.978
Minus one and our original.

0:47:42.978 --> 0:47:52.748
Here we only have elemental abrasions which
can be optimally paralyzed.

0:47:53.273 --> 0:48:07.603
So here we have additional paralyzed things
across the dimension and don't have to do that.

0:48:09.929 --> 0:48:24.255
Yeah, but this you can do like in parallel
again for all xts.

0:48:24.544 --> 0:48:33.014
Here you can't do it in parallel, but you
only have to do it on each seat, and then you

0:48:33.014 --> 0:48:34.650
can parallelize.

0:48:35.495 --> 0:48:39.190
But this maybe for the dimension.

0:48:39.190 --> 0:48:42.124
Maybe it's also important.

0:48:42.124 --> 0:48:46.037
I don't know if they have tried it.

0:48:46.037 --> 0:48:55.383
I assume it's not only for dimension reduction,
but it's hard because you can easily.

0:49:01.001 --> 0:49:08.164
People have even like made the second thing
even more easy.

0:49:08.164 --> 0:49:10.313
So there is this.

0:49:10.313 --> 0:49:17.954
This is how we have the highway connections
in the transformer.

0:49:17.954 --> 0:49:20.699
Then it's like you do.

0:49:20.780 --> 0:49:24.789
So that is like how things are put together
as a transformer.

0:49:25.125 --> 0:49:39.960
And that is a similar and simple recurring
neural network where you do exactly the same

0:49:39.960 --> 0:49:44.512
for the so you don't have.

0:49:46.326 --> 0:49:47.503
This type of things.

0:49:49.149 --> 0:50:01.196
And with this we are at the end of how to
make efficient architectures before we go to

0:50:01.196 --> 0:50:02.580
the next.

0:50:13.013 --> 0:50:24.424
Between the ink or the trader and the architectures
there is a next technique which is used in

0:50:24.424 --> 0:50:28.988
nearly all deburning very successful.

0:50:29.449 --> 0:50:43.463
So the idea is can we extract the knowledge
from a large network into a smaller one, but

0:50:43.463 --> 0:50:45.983
it's similarly.

0:50:47.907 --> 0:50:53.217
And the nice thing is that this really works,
and it may be very, very surprising.

0:50:53.673 --> 0:51:03.035
So the idea is that we have a large strong
model which we train for long, and the question

0:51:03.035 --> 0:51:07.870
is: Can that help us to train a smaller model?

0:51:08.148 --> 0:51:16.296
So can what we refer to as teacher model tell
us better to build a small student model than

0:51:16.296 --> 0:51:17.005
before.

0:51:17.257 --> 0:51:27.371
So what we're before in it as a student model,
we learn from the data and that is how we train

0:51:27.371 --> 0:51:28.755
our systems.

0:51:29.249 --> 0:51:37.949
The question is: Can we train this small model
better if we are not only learning from the

0:51:37.949 --> 0:51:46.649
data, but we are also learning from a large
model which has been trained maybe in the same

0:51:46.649 --> 0:51:47.222
data?

0:51:47.667 --> 0:51:55.564
So that you have then in the end a smaller
model that is somehow better performing than.

0:51:55.895 --> 0:51:59.828
And maybe that's on the first view.

0:51:59.739 --> 0:52:05.396
Very very surprising because it has seen the
same data so it should have learned the same

0:52:05.396 --> 0:52:11.053
so the baseline model trained only on the data
and the student teacher knowledge to still

0:52:11.053 --> 0:52:11.682
model it.

0:52:11.682 --> 0:52:17.401
They all have seen only this data because
your teacher modeling was also trained typically

0:52:17.401 --> 0:52:19.161
only on this model however.

0:52:20.580 --> 0:52:30.071
It has by now shown that by many ways the
model trained in the teacher and analysis framework

0:52:30.071 --> 0:52:32.293
is performing better.

0:52:33.473 --> 0:52:40.971
A bit of an explanation when we see how that
works.

0:52:40.971 --> 0:52:46.161
There's different ways of doing it.

0:52:46.161 --> 0:52:47.171
Maybe.

0:52:47.567 --> 0:52:51.501
So how does it work?

0:52:51.501 --> 0:53:04.802
This is our student network, the normal one,
some type of new network.

0:53:04.802 --> 0:53:06.113
We're.

0:53:06.586 --> 0:53:17.050
So we are training the model to predict the
same thing as we are doing that by calculating.

0:53:17.437 --> 0:53:23.173
The cross angry loss was defined in a way
where saying all the probabilities for the

0:53:23.173 --> 0:53:25.332
correct word should be as high.

0:53:25.745 --> 0:53:32.207
So you are calculating your alphabet probabilities
always, and each time step you have an alphabet

0:53:32.207 --> 0:53:33.055
probability.

0:53:33.055 --> 0:53:38.669
What is the most probable in the next word
and your training signal is put as much of

0:53:38.669 --> 0:53:43.368
your probability mass to the correct word to
the word that is there in.

0:53:43.903 --> 0:53:51.367
And this is the chief by this cross entry
loss, which says with some of the all training

0:53:51.367 --> 0:53:58.664
examples of all positions, with some of the
full vocabulary, and then this one is this

0:53:58.664 --> 0:54:03.947
one that this current word is the case word
in the vocabulary.

0:54:04.204 --> 0:54:11.339
And then we take here the lock for the ability
of that, so what we made me do is: We have

0:54:11.339 --> 0:54:27.313
this metric here, so each position of your
vocabulary size.

0:54:27.507 --> 0:54:38.656
In the end what you just do is some of these
three lock probabilities, and then you want

0:54:38.656 --> 0:54:40.785
to have as much.

0:54:41.041 --> 0:54:54.614
So although this is a thumb over this metric
here, in the end of each dimension you.

0:54:54.794 --> 0:55:06.366
So that is a normal cross end to be lost that
we have discussed at the very beginning of

0:55:06.366 --> 0:55:07.016
how.

0:55:08.068 --> 0:55:15.132
So what can we do differently in the teacher
network?

0:55:15.132 --> 0:55:23.374
We also have a teacher network which is trained
on large data.

0:55:24.224 --> 0:55:35.957
And of course this distribution might be better
than the one from the small model because it's.

0:55:36.456 --> 0:55:40.941
So in this case we have now the training signal
from the teacher network.

0:55:41.441 --> 0:55:46.262
And it's the same way as we had before.

0:55:46.262 --> 0:55:56.507
The only difference is we're training not
the ground truths per ability distribution

0:55:56.507 --> 0:55:59.159
year, which is sharp.

0:55:59.299 --> 0:56:11.303
That's also a probability, so this word has
a high probability, but have some probability.

0:56:12.612 --> 0:56:19.577
And that is the main difference.

0:56:19.577 --> 0:56:30.341
Typically you do like the interpretation of
these.

0:56:33.213 --> 0:56:38.669
Because there's more information contained
in the distribution than in the front booth,

0:56:38.669 --> 0:56:44.187
because it encodes more information about the
language, because language always has more

0:56:44.187 --> 0:56:47.907
options to put alone, that's the same sentence
yes exactly.

0:56:47.907 --> 0:56:53.114
So there's ambiguity in there that is encoded
hopefully very well in the complaint.

0:56:53.513 --> 0:56:57.257
Trade you two networks so better than a student
network you have in there from your learner.

0:56:57.537 --> 0:57:05.961
So maybe often there's only one correct word,
but it might be two or three, and then all

0:57:05.961 --> 0:57:10.505
of these three have a probability distribution.

0:57:10.590 --> 0:57:21.242
And then is the main advantage or one explanation
of why it's better to train from the.

0:57:21.361 --> 0:57:32.652
Of course, it's good to also keep the signal
in there because then you can prevent it because

0:57:32.652 --> 0:57:33.493
crazy.

0:57:37.017 --> 0:57:49.466
Any more questions on the first type of knowledge
distillation, also distribution changes.

0:57:50.550 --> 0:58:02.202
Coming around again, this would put it a bit
different, so this is not a solution to maintenance

0:58:02.202 --> 0:58:04.244
or distribution.

0:58:04.744 --> 0:58:12.680
But don't think it's performing worse than
only doing the ground tours because they also.

0:58:13.113 --> 0:58:21.254
So it's more like it's not improving you would
assume it's similarly helping you, but.

0:58:21.481 --> 0:58:28.145
Of course, if you now have a teacher, maybe
you have no danger on your target to Maine,

0:58:28.145 --> 0:58:28.524
but.

0:58:28.888 --> 0:58:39.895
Then you can use this one which is not the
ground truth but helpful to learn better for

0:58:39.895 --> 0:58:42.147
the distribution.

0:58:46.326 --> 0:58:57.012
The second idea is to do sequence level knowledge
distillation, so what we have in this case

0:58:57.012 --> 0:59:02.757
is we have looked at each position independently.

0:59:03.423 --> 0:59:05.436
Mean, we do that often.

0:59:05.436 --> 0:59:10.972
We are not generating a lot of sequences,
but that has a problem.

0:59:10.972 --> 0:59:13.992
We have this propagation of errors.

0:59:13.992 --> 0:59:16.760
We start with one area and then.

0:59:17.237 --> 0:59:27.419
So if we are doing word-level knowledge dissolution,
we are treating each word in the sentence independently.

0:59:28.008 --> 0:59:32.091
So we are not trying to like somewhat model
the dependency between.

0:59:32.932 --> 0:59:47.480
We can try to do that by sequence level knowledge
dissolution, but the problem is, of course,.

0:59:47.847 --> 0:59:53.478
So we can that for each position we can get
a distribution over all the words at this.

0:59:53.793 --> 1:00:05.305
But if we want to have a distribution of all
possible target sentences, that's not possible

1:00:05.305 --> 1:00:06.431
because.

1:00:08.508 --> 1:00:15.940
Area, so we can then again do a bit of a heck
on that.

1:00:15.940 --> 1:00:23.238
If we can't have a distribution of all sentences,
it.

1:00:23.843 --> 1:00:30.764
So what we can't do is you can not use the
teacher network and sample different translations.

1:00:31.931 --> 1:00:39.327
And now we can do different ways to train
them.

1:00:39.327 --> 1:00:49.343
We can use them as their probability, the
easiest one to assume.

1:00:50.050 --> 1:00:56.373
So what that ends to is that we're taking
our teacher network, we're generating some

1:00:56.373 --> 1:01:01.135
translations, and these ones we're using as
additional trading.

1:01:01.781 --> 1:01:11.382
Then we have mainly done this sequence level
because the teacher network takes us.

1:01:11.382 --> 1:01:17.513
These are all probable translations of the
sentence.

1:01:26.286 --> 1:01:34.673
And then you can do a bit of a yeah, and you
can try to better make a bit of an interpolated

1:01:34.673 --> 1:01:36.206
version of that.

1:01:36.716 --> 1:01:42.802
So what people have also done is like subsequent
level interpolations.

1:01:42.802 --> 1:01:52.819
You generate here several translations: But
then you don't use all of them.

1:01:52.819 --> 1:02:00.658
You do some metrics on which of these ones.

1:02:01.021 --> 1:02:12.056
So it's a bit more training on this brown
chose which might be improbable or unreachable

1:02:12.056 --> 1:02:16.520
because we can generate everything.

1:02:16.676 --> 1:02:23.378
And we are giving it an easier solution which
is also good quality and training of that.

1:02:23.703 --> 1:02:32.602
So you're not training it on a very difficult
solution, but you're training it on an easier

1:02:32.602 --> 1:02:33.570
solution.

1:02:36.356 --> 1:02:38.494
Any More Questions to This.

1:02:40.260 --> 1:02:41.557
Yeah.

1:02:41.461 --> 1:02:44.296
Good.

1:02:43.843 --> 1:03:01.642
Is to look at the vocabulary, so the problem
is we have seen that vocabulary calculations

1:03:01.642 --> 1:03:06.784
are often very presuming.

1:03:09.789 --> 1:03:19.805
The thing is that most of the vocabulary is
not needed for each sentence, so in each sentence.

1:03:20.280 --> 1:03:28.219
The question is: Can we somehow easily precalculate,
which words are probable to occur in the sentence,

1:03:28.219 --> 1:03:30.967
and then only calculate these ones?

1:03:31.691 --> 1:03:34.912
And this can be done so.

1:03:34.912 --> 1:03:43.932
For example, if you have sentenced card, it's
probably not happening.

1:03:44.164 --> 1:03:48.701
So what you can try to do is to limit your
vocabulary.

1:03:48.701 --> 1:03:51.093
You're considering for each.

1:03:51.151 --> 1:04:04.693
So you're no longer taking the full vocabulary
as possible output, but you're restricting.

1:04:06.426 --> 1:04:18.275
That typically works is that we limit it by
the most frequent words we always take because

1:04:18.275 --> 1:04:23.613
these are not so easy to align to words.

1:04:23.964 --> 1:04:32.241
To take the most treatment taggin' words and
then work that often aligns with one of the

1:04:32.241 --> 1:04:32.985
source.

1:04:33.473 --> 1:04:46.770
So for each source word you calculate the
word alignment on your training data, and then

1:04:46.770 --> 1:04:51.700
you calculate which words occur.

1:04:52.352 --> 1:04:57.680
And then for decoding you build this union
of maybe the source word list that other.

1:04:59.960 --> 1:05:02.145
Are like for each source work.

1:05:02.145 --> 1:05:08.773
One of the most frequent translations of these
source words, for example for each source work

1:05:08.773 --> 1:05:13.003
like in the most frequent ones, and then the
most frequent.

1:05:13.193 --> 1:05:24.333
In total, if you have short sentences, you
have a lot less words, so in most cases it's

1:05:24.333 --> 1:05:26.232
not more than.

1:05:26.546 --> 1:05:33.957
And so you have dramatically reduced your
vocabulary, and thereby can also fax a depot.

1:05:35.495 --> 1:05:43.757
That easy does anybody see what is challenging
here and why that might not always need.

1:05:47.687 --> 1:05:54.448
The performance is not why this might not.

1:05:54.448 --> 1:06:01.838
If you implement it, it might not be a strong.

1:06:01.941 --> 1:06:06.053
You have to store this list.

1:06:06.053 --> 1:06:14.135
You have to burn the union and of course your
safe time.

1:06:14.554 --> 1:06:21.920
The second thing the vocabulary is used in
our last step, so we have the hidden state,

1:06:21.920 --> 1:06:23.868
and then we calculate.

1:06:24.284 --> 1:06:29.610
Now we are not longer calculating them for
all output words, but for a subset of them.

1:06:30.430 --> 1:06:35.613
However, this metric multiplication is typically
parallelized with the perfect but good.

1:06:35.956 --> 1:06:46.937
But if you not only calculate some of them,
if you're not modeling it right, it will take

1:06:46.937 --> 1:06:52.794
as long as before because of the nature of
the.

1:06:56.776 --> 1:07:07.997
Here for beam search there's some ideas of
course you can go back to greedy search because

1:07:07.997 --> 1:07:10.833
that's more efficient.

1:07:11.651 --> 1:07:18.347
And better quality, and you can buffer some
states in between, so how much buffering it's

1:07:18.347 --> 1:07:22.216
again this tradeoff between calculation and
memory.

1:07:25.125 --> 1:07:41.236
Then at the end of today what we want to look
into is one last type of new machine translation

1:07:41.236 --> 1:07:42.932
approach.

1:07:43.403 --> 1:07:53.621
And the idea is what we've already seen in
our first two steps is that this ultra aggressive

1:07:53.621 --> 1:07:57.246
park is taking community coding.

1:07:57.557 --> 1:08:04.461
Can process everything in parallel, but we
are always taking the most probable and then.

1:08:05.905 --> 1:08:10.476
The question is: Do we really need to do that?

1:08:10.476 --> 1:08:14.074
Therefore, there is a bunch of work.

1:08:14.074 --> 1:08:16.602
Can we do it differently?

1:08:16.602 --> 1:08:19.616
Can we generate a full target?

1:08:20.160 --> 1:08:29.417
We'll see it's not that easy and there's still
an open debate whether this is really faster

1:08:29.417 --> 1:08:31.832
and quality, but think.

1:08:32.712 --> 1:08:45.594
So, as said, what we have done is our encoder
decoder where we can process our encoder color,

1:08:45.594 --> 1:08:50.527
and then the output always depends.

1:08:50.410 --> 1:08:54.709
We generate the output and then we have to
put it here the wide because then everything

1:08:54.709 --> 1:08:56.565
depends on the purpose of the output.

1:08:56.916 --> 1:09:10.464
This is what is referred to as an outer-regressive
model and nearly outs speech generation and

1:09:10.464 --> 1:09:16.739
language generation or works in this outer.

1:09:18.318 --> 1:09:21.132
So the motivation is, can we do that more
efficiently?

1:09:21.361 --> 1:09:31.694
And can we somehow process all target words
in parallel?

1:09:31.694 --> 1:09:41.302
So instead of doing it one by one, we are
inputting.

1:09:45.105 --> 1:09:46.726
So how does it work?

1:09:46.726 --> 1:09:50.587
So let's first have a basic auto regressive
mode.

1:09:50.810 --> 1:09:53.551
So the encoder looks as it is before.

1:09:53.551 --> 1:09:58.310
That's maybe not surprising because here we
know we can paralyze.

1:09:58.618 --> 1:10:04.592
So we have put in here our ink holder and
generated the ink stash, so that's exactly

1:10:04.592 --> 1:10:05.295
the same.

1:10:05.845 --> 1:10:16.229
However, now we need to do one more thing:
One challenge is what we had before and that's

1:10:16.229 --> 1:10:26.799
a challenge of natural language generation
like machine translation.

1:10:32.672 --> 1:10:38.447
We generate until we generate this out of
end of center stock, but if we now generate

1:10:38.447 --> 1:10:44.625
everything at once that's no longer possible,
so we cannot generate as long because we only

1:10:44.625 --> 1:10:45.632
generated one.

1:10:46.206 --> 1:10:58.321
So the question is how can we now determine
how long the sequence is, and we can also accelerate.

1:11:00.000 --> 1:11:06.384
Yes, but there would be one idea, and there
is other work which tries to do that.

1:11:06.806 --> 1:11:15.702
However, in here there's some work already
done before and maybe you remember we had the

1:11:15.702 --> 1:11:20.900
IBM models and there was this concept of fertility.

1:11:21.241 --> 1:11:26.299
The concept of fertility is means like for
one saucepan, and how many target pores does

1:11:26.299 --> 1:11:27.104
it translate?

1:11:27.847 --> 1:11:34.805
And exactly that we try to do here, and that
means we are calculating like at the top we

1:11:34.805 --> 1:11:36.134
are calculating.

1:11:36.396 --> 1:11:42.045
So it says word is translated into word.

1:11:42.045 --> 1:11:54.171
Word might be translated into words into,
so we're trying to predict in how many words.

1:11:55.935 --> 1:12:10.314
And then the end of the anchor, so this is
like a length estimation.

1:12:10.314 --> 1:12:15.523
You can do it otherwise.

1:12:16.236 --> 1:12:24.526
You initialize your decoder input and we know
it's good with word embeddings so we're trying

1:12:24.526 --> 1:12:28.627
to do the same thing and what people then do.

1:12:28.627 --> 1:12:35.224
They initialize it again with word embedding
but in the frequency of the.

1:12:35.315 --> 1:12:36.460
So we have the cartilage.

1:12:36.896 --> 1:12:47.816
So one has two, so twice the is and then one
is, so that is then our initialization.

1:12:48.208 --> 1:12:57.151
In other words, if you don't predict fertilities
but predict lengths, you can just initialize

1:12:57.151 --> 1:12:57.912
second.

1:12:58.438 --> 1:13:07.788
This often works a bit better, but that's
the other.

1:13:07.788 --> 1:13:16.432
Now you have everything in training and testing.

1:13:16.656 --> 1:13:18.621
This is all available at once.

1:13:20.280 --> 1:13:31.752
Then we can generate everything in parallel,
so we have the decoder stack, and that is now

1:13:31.752 --> 1:13:33.139
as before.

1:13:35.395 --> 1:13:41.555
And then we're doing the translation predictions
here on top of it in order to do.

1:13:43.083 --> 1:13:59.821
And then we are predicting here the target
words and once predicted, and that is the basic

1:13:59.821 --> 1:14:00.924
idea.

1:14:01.241 --> 1:14:08.171
Machine translation: Where the idea is, we
don't have to do one by one what we're.

1:14:10.210 --> 1:14:13.900
So this looks really, really, really great.

1:14:13.900 --> 1:14:20.358
On the first view there's one challenge with
this, and this is the baseline.

1:14:20.358 --> 1:14:27.571
Of course there's some improvements, but in
general the quality is often significant.

1:14:28.068 --> 1:14:32.075
So here you see the baseline models.

1:14:32.075 --> 1:14:38.466
You have a loss of ten blue points or something
like that.

1:14:38.878 --> 1:14:40.230
So why does it change?

1:14:40.230 --> 1:14:41.640
So why is it happening?

1:14:43.903 --> 1:14:56.250
If you look at the errors there is repetitive
tokens, so you have like or things like that.

1:14:56.536 --> 1:15:01.995
Broken senses or influent senses, so that
exactly where algebra aggressive models are

1:15:01.995 --> 1:15:04.851
very good, we say that's a bit of a problem.

1:15:04.851 --> 1:15:07.390
They generate very fluid transcription.

1:15:07.387 --> 1:15:10.898
Translation: Sometimes there doesn't have
to do anything with the input.

1:15:11.411 --> 1:15:14.047
But generally it really looks always very
fluid.

1:15:14.995 --> 1:15:20.865
Here exactly the opposite, so the problem
is that we don't have really fluid translation.

1:15:21.421 --> 1:15:26.123
And that is mainly due to the challenge that
we have this independent assumption.

1:15:26.646 --> 1:15:35.873
So in this case, the probability of Y of the
second position is independent of the probability

1:15:35.873 --> 1:15:40.632
of X, so we don't know what was there generated.

1:15:40.632 --> 1:15:43.740
We're just generating it there.

1:15:43.964 --> 1:15:55.439
You can see it also in a bit of examples.

1:15:55.439 --> 1:16:03.636
You can over-panelize shifts.

1:16:04.024 --> 1:16:10.566
And the problem is this is already an improvement
again, but this is also similar to.

1:16:11.071 --> 1:16:19.900
So you can, for example, translate heeded
back, or maybe you could also translate it

1:16:19.900 --> 1:16:31.105
with: But on their feeling down in feeling
down, if the first position thinks of their

1:16:31.105 --> 1:16:34.594
feeling done and the second.

1:16:35.075 --> 1:16:42.908
So each position here and that is one of the
main issues here doesn't know what the other.

1:16:43.243 --> 1:16:53.846
And for example, if you are translating something
with, you can often translate things in two

1:16:53.846 --> 1:16:58.471
ways: German with a different agreement.

1:16:58.999 --> 1:17:02.047
And then here where you have to decide do
you have to use jewelry.

1:17:02.162 --> 1:17:05.460
Interpretator: It doesn't know which word
it has to select.

1:17:06.086 --> 1:17:14.789
Mean, of course, it knows a hidden state,
but in the end you have a liability distribution.

1:17:16.256 --> 1:17:20.026
And that is the important thing in the outer
regressive month.

1:17:20.026 --> 1:17:24.335
You know that because you have put it in you
here, you don't know that.

1:17:24.335 --> 1:17:29.660
If it's equal probable here to two, you don't
Know Which Is Selected, and of course that

1:17:29.660 --> 1:17:32.832
depends on what should be the latest traction
under.

1:17:33.333 --> 1:17:39.554
Yep, that's the undershift, and we're going
to last last the next time.

1:17:39.554 --> 1:17:39.986
Yes.

1:17:40.840 --> 1:17:44.935
Doesn't this also appear in and like now we're
talking about physical training?

1:17:46.586 --> 1:17:48.412
The thing is in the auto regress.

1:17:48.412 --> 1:17:50.183
If you give it the correct one,.

1:17:50.450 --> 1:17:55.827
So if you predict here comma what the reference
is feeling then you tell the model here.

1:17:55.827 --> 1:17:59.573
The last one was feeling and then it knows
it has to be done.

1:17:59.573 --> 1:18:04.044
But here it doesn't know that because it doesn't
get as input as a right.

1:18:04.204 --> 1:18:24.286
Yes, that's a bit depending on what.

1:18:24.204 --> 1:18:27.973
But in training, of course, you just try to
make the highest one the current one.

1:18:31.751 --> 1:18:38.181
So what you can do is things like CDC loss
which can adjust for this.

1:18:38.181 --> 1:18:42.866
So then you can also have this shifted correction.

1:18:42.866 --> 1:18:50.582
If you're doing this type of correction in
the CDC loss you don't get full penalty.

1:18:50.930 --> 1:18:58.486
Just shifted by one, so it's a bit of a different
loss, which is mainly used in, but.

1:19:00.040 --> 1:19:03.412
It can be used in order to address this problem.

1:19:04.504 --> 1:19:13.844
The other problem is that outer regressively
we have the label buyers that tries to disimmigrate.

1:19:13.844 --> 1:19:20.515
That's the example did before was if you translate
thank you to Dung.

1:19:20.460 --> 1:19:31.925
And then it might end up because it learns
in the first position and the second also.

1:19:32.492 --> 1:19:43.201
In order to prevent that, it would be helpful
for one output, only one output, so that makes

1:19:43.201 --> 1:19:47.002
the system already better learn.

1:19:47.227 --> 1:19:53.867
Might be that for slightly different inputs
you have different outputs, but for the same.

1:19:54.714 --> 1:19:57.467
That we can luckily very easily solve.

1:19:59.119 --> 1:19:59.908
And it's done.

1:19:59.908 --> 1:20:04.116
We just learned the technique about it, which
is called knowledge distillation.

1:20:04.985 --> 1:20:13.398
So what we can do and the easiest solution
to prove your non-autoregressive model is to

1:20:13.398 --> 1:20:16.457
train an auto regressive model.

1:20:16.457 --> 1:20:22.958
Then you decode your whole training gamer
with this model and then.

1:20:23.603 --> 1:20:27.078
While the main advantage of that is that this
is more consistent,.

1:20:27.407 --> 1:20:33.995
So for the same input you always have the
same output.

1:20:33.995 --> 1:20:41.901
So you have to make your training data more
consistent and learn.

1:20:42.482 --> 1:20:54.471
So there is another advantage of knowledge
distillation and that advantage is you have

1:20:54.471 --> 1:20:59.156
more consistent training signals.

1:21:04.884 --> 1:21:10.287
There's another to make the things more easy
at the beginning.

1:21:10.287 --> 1:21:16.462
There's this plants model, black model where
you put in parts of input.

1:21:16.756 --> 1:21:26.080
So during training, especially at the beginning,
you give some correct solutions at the beginning.

1:21:28.468 --> 1:21:38.407
And there is this tokens at a time, so the
idea is to establish other regressive training.

1:21:40.000 --> 1:21:50.049
And some targets are open, so you always predict
only like first auto regression is K.

1:21:50.049 --> 1:21:59.174
It puts one, so you always have one input
and one output, then you do partial.

1:21:59.699 --> 1:22:05.825
So in that way you can slowly learn what is
a good and what is a bad answer.

1:22:08.528 --> 1:22:10.862
It doesn't sound very impressive.

1:22:10.862 --> 1:22:12.578
Don't contact me anyway.

1:22:12.578 --> 1:22:15.323
Go all over your training data several.

1:22:15.875 --> 1:22:20.655
You can even switch in between.

1:22:20.655 --> 1:22:29.318
There is a homework on this thing where you
try to start.

1:22:31.271 --> 1:22:41.563
You have to learn so there's a whole work
on that so this is often happening and it doesn't

1:22:41.563 --> 1:22:46.598
mean it's less efficient but still it helps.

1:22:49.389 --> 1:22:57.979
For later maybe here are some examples of
how much things help.

1:22:57.979 --> 1:23:04.958
Maybe one point here is that it's really important.

1:23:05.365 --> 1:23:13.787
Here's the translation performance and speed.

1:23:13.787 --> 1:23:24.407
One point which is a point is if you compare
researchers.

1:23:24.784 --> 1:23:33.880
So yeah, if you're compared to one very weak
baseline transformer even with beam search,

1:23:33.880 --> 1:23:40.522
then you're ten times slower than a very strong
auto regressive.

1:23:40.961 --> 1:23:48.620
If you make a strong baseline then it's going
down to depending on times and here like: You

1:23:48.620 --> 1:23:53.454
have a lot of different speed ups.

1:23:53.454 --> 1:24:03.261
Generally, it makes a strong baseline and
not very simple transformer.

1:24:07.407 --> 1:24:20.010
Yeah, with this one last thing that you can
do to speed up things and also reduce your

1:24:20.010 --> 1:24:25.950
memory is what is called half precision.

1:24:26.326 --> 1:24:29.139
And especially for decoding issues for training.

1:24:29.139 --> 1:24:31.148
Sometimes it also gets less stale.

1:24:32.592 --> 1:24:45.184
With this we close nearly wait a bit, so what
you should remember is that efficient machine

1:24:45.184 --> 1:24:46.963
translation.

1:24:47.007 --> 1:24:51.939
We have, for example, looked at knowledge
distillation.

1:24:51.939 --> 1:24:55.991
We have looked at non auto regressive models.

1:24:55.991 --> 1:24:57.665
We have different.

1:24:58.898 --> 1:25:02.383
For today and then only requests.

1:25:02.383 --> 1:25:08.430
So if you haven't done so, please fill out
the evaluation.

1:25:08.388 --> 1:25:20.127
So now if you have done so think then you
should have and with the online people hopefully.

1:25:20.320 --> 1:25:29.758
Only possibility to tell us what things are
good and what not the only one but the most

1:25:29.758 --> 1:25:30.937
efficient.

1:25:31.851 --> 1:25:35.871
So think of all the students doing it in this
case okay and then thank.

0:00:01.921 --> 0:00:16.424
Hey welcome to today's lecture, what we today
want to look at is how we can make new.

0:00:16.796 --> 0:00:26.458
So until now we have this global system, the
encoder and the decoder mostly, and we haven't

0:00:26.458 --> 0:00:29.714
really thought about how long.

0:00:30.170 --> 0:00:42.684
And what we, for example, know is yeah, you
can make the systems bigger in different ways.

0:00:42.684 --> 0:00:47.084
We can make them deeper so the.

0:00:47.407 --> 0:00:56.331
And if we have at least enough data that typically
helps you make things performance better,.

0:00:56.576 --> 0:01:00.620
But of course leads to problems that we need
more resources.

0:01:00.620 --> 0:01:06.587
That is a problem at universities where we
have typically limited computation capacities.

0:01:06.587 --> 0:01:11.757
So at some point you have such big models
that you cannot train them anymore.

0:01:13.033 --> 0:01:23.792
And also for companies is of course important
if it costs you like to generate translation

0:01:23.792 --> 0:01:26.984
just by power consumption.

0:01:27.667 --> 0:01:35.386
So yeah, there's different reasons why you
want to do efficient machine translation.

0:01:36.436 --> 0:01:48.338
One reason is there are different ways of
how you can improve your machine translation

0:01:48.338 --> 0:01:50.527
system once we.

0:01:50.670 --> 0:01:55.694
There can be different types of data we looked
into data crawling, monolingual data.

0:01:55.875 --> 0:01:59.024
All this data and the aim is always.

0:01:59.099 --> 0:02:05.735
Of course, we are not just purely interested
in having more data, but the idea why we want

0:02:05.735 --> 0:02:12.299
to have more data is that more data also means
that we have better quality because mostly

0:02:12.299 --> 0:02:17.550
we are interested in increasing the quality
of the machine translation.

0:02:18.838 --> 0:02:24.892
But there's also other ways of how you can
improve the quality of a machine translation.

0:02:25.325 --> 0:02:36.450
And what is, of course, that is where most
research is focusing on.

0:02:36.450 --> 0:02:44.467
It means all we want to build better algorithms.

0:02:44.684 --> 0:02:48.199
Course: The other things are normally as good.

0:02:48.199 --> 0:02:54.631
Sometimes it's easier to improve, so often
it's easier to just collect more data than

0:02:54.631 --> 0:02:57.473
to invent some great view algorithms.

0:02:57.473 --> 0:03:00.315
But yeah, both of them are important.

0:03:00.920 --> 0:03:09.812
But there is this third thing, especially
with neural machine translation, and that means

0:03:09.812 --> 0:03:11.590
we make a bigger.

0:03:11.751 --> 0:03:16.510
Can be, as said, that we have more layers,
that we have wider layers.

0:03:16.510 --> 0:03:19.977
The other thing we talked a bit about is ensemble.

0:03:19.977 --> 0:03:24.532
That means we are not building one new machine
translation system.

0:03:24.965 --> 0:03:27.505
And we can easily build four.

0:03:27.505 --> 0:03:32.331
What is the typical strategy to build different
systems?

0:03:32.331 --> 0:03:33.177
Remember.

0:03:35.795 --> 0:03:40.119
It should be of course a bit different if
you have the same.

0:03:40.119 --> 0:03:44.585
If they all predict the same then combining
them doesn't help.

0:03:44.585 --> 0:03:48.979
So what is the easiest way if you have to
build four systems?

0:03:51.711 --> 0:04:01.747
And the Charleston's will take, but this is
the best output of a single system.

0:04:02.362 --> 0:04:10.165
Mean now, it's really three different systems
so that you later can combine them and maybe

0:04:10.165 --> 0:04:11.280
the average.

0:04:11.280 --> 0:04:16.682
Ensembles are typically that the average is
all probabilities.

0:04:19.439 --> 0:04:24.227
The idea is to think about neural networks.

0:04:24.227 --> 0:04:29.342
There's one parameter which can easily adjust.

0:04:29.342 --> 0:04:36.525
That's exactly the easiest way to randomize
with three different.

0:04:37.017 --> 0:04:43.119
They have the same architecture, so all the
hydroparameters are the same, but they are

0:04:43.119 --> 0:04:43.891
different.

0:04:43.891 --> 0:04:46.556
They will have different predictions.

0:04:48.228 --> 0:04:52.572
So, of course, bigger amounts.

0:04:52.572 --> 0:05:05.325
Some of these are a bit the easiest way of
improving your quality because you don't really

0:05:05.325 --> 0:05:08.268
have to do anything.

0:05:08.588 --> 0:05:12.588
There is limits on that bigger models only
get better.

0:05:12.588 --> 0:05:19.132
If you have enough training data you can't
do like a handheld layer and you will not work

0:05:19.132 --> 0:05:24.877
on very small data but with a recent amount
of data that is the easiest thing.

0:05:25.305 --> 0:05:33.726
However, they are challenging with making
better models, bigger motors, and that is the

0:05:33.726 --> 0:05:34.970
computation.

0:05:35.175 --> 0:05:44.482
So, of course, if you have a bigger model
that can mean that you have longer running

0:05:44.482 --> 0:05:49.518
times, if you have models, you have to times.

0:05:51.171 --> 0:05:56.685
Normally you cannot paralyze the different
layers because the input to one layer is always

0:05:56.685 --> 0:06:02.442
the output of the previous layer, so you propagate
that so it will also increase your runtime.

0:06:02.822 --> 0:06:10.720
Then you have to store all your models in
memory.

0:06:10.720 --> 0:06:20.927
If you have double weights you will have:
Is more difficult to then do back propagation.

0:06:20.927 --> 0:06:27.680
You have to store in between the activations,
so there's not only do you increase the model

0:06:27.680 --> 0:06:31.865
in your memory, but also all these other variables
that.

0:06:34.414 --> 0:06:36.734
And so in general it is more expensive.

0:06:37.137 --> 0:06:54.208
And therefore there's good reasons in looking
into can we make these models sound more efficient.

0:06:54.134 --> 0:07:00.982
So it's been through the viewer, you can have
it okay, have one and one day of training time,

0:07:00.982 --> 0:07:01.274
or.

0:07:01.221 --> 0:07:07.535
Forty thousand euros and then what is the
best machine translation system I can get within

0:07:07.535 --> 0:07:08.437
this budget.

0:07:08.969 --> 0:07:19.085
And then, of course, you can make the models
bigger, but then you have to train them shorter,

0:07:19.085 --> 0:07:24.251
and then we can make more efficient algorithms.

0:07:25.925 --> 0:07:31.699
If you think about efficiency, there's a bit
different scenarios.

0:07:32.312 --> 0:07:43.635
So if you're more of coming from the research
community, what you'll be doing is building

0:07:43.635 --> 0:07:47.913
a lot of models in your research.

0:07:48.088 --> 0:07:58.645
So you're having your test set of maybe sentences,
calculating the blue score, then another model.

0:07:58.818 --> 0:08:08.911
So what that means is typically you're training
on millions of cents, so your training time

0:08:08.911 --> 0:08:14.944
is long, maybe a day, but maybe in other cases
a week.

0:08:15.135 --> 0:08:22.860
The testing is not really the cost efficient,
but the training is very costly.

0:08:23.443 --> 0:08:37.830
If you are more thinking of building models
for application, the scenario is quite different.

0:08:38.038 --> 0:08:46.603
And then you keep it running, and maybe thousands
of customers are using it in translating.

0:08:46.603 --> 0:08:47.720
So in that.

0:08:48.168 --> 0:08:59.577
And we will see that it is not always the
same type of challenges you can paralyze some

0:08:59.577 --> 0:09:07.096
things in training, which you cannot paralyze
in testing.

0:09:07.347 --> 0:09:14.124
For example, in training you have to do back
propagation, so you have to store the activations.

0:09:14.394 --> 0:09:23.901
Therefore, in testing we briefly discussed
that we would do it in more detail today in

0:09:23.901 --> 0:09:24.994
training.

0:09:25.265 --> 0:09:36.100
You know they're a target and you can process
everything in parallel while in testing.

0:09:36.356 --> 0:09:46.741
So you can only do one word at a time, and
so you can less paralyze this.

0:09:46.741 --> 0:09:50.530
Therefore, it's important.

0:09:52.712 --> 0:09:55.347
Is a specific task on this.

0:09:55.347 --> 0:10:03.157
For example, it's the efficiency task where
it's about making things as efficient.

0:10:03.123 --> 0:10:09.230
Is possible and they can look at different
resources.

0:10:09.230 --> 0:10:14.207
So how much deep fuel run time do you need?

0:10:14.454 --> 0:10:19.366
See how much memory you need or you can have
a fixed memory budget and then have to build

0:10:19.366 --> 0:10:20.294
the best system.

0:10:20.500 --> 0:10:29.010
And here is a bit like an example of that,
so there's three teams from Edinburgh from

0:10:29.010 --> 0:10:30.989
and they submitted.

0:10:31.131 --> 0:10:36.278
So then, of course, if you want to know the
most efficient system you have to do a bit

0:10:36.278 --> 0:10:36.515
of.

0:10:36.776 --> 0:10:44.656
You want to have a better quality or more
runtime and there's not the one solution.

0:10:44.656 --> 0:10:46.720
You can improve your.

0:10:46.946 --> 0:10:49.662
And that you see that there are different
systems.

0:10:49.909 --> 0:11:06.051
Here is how many words you can do for a second
on the clock, and you want to be as talk as

0:11:06.051 --> 0:11:07.824
possible.

0:11:08.068 --> 0:11:08.889
And you see here a bit.

0:11:08.889 --> 0:11:09.984
This is a little bit different.

0:11:11.051 --> 0:11:27.717
You want to be there on the top right corner
and you can get a score of something between

0:11:27.717 --> 0:11:29.014
words.

0:11:30.250 --> 0:11:34.161
Two hundred and fifty thousand, then you'll
ever come and score zero point three.

0:11:34.834 --> 0:11:41.243
There is, of course, any bit of a decision,
but the question is, like how far can you again?

0:11:41.243 --> 0:11:47.789
Some of all these points on this line would
be winners because they are somehow most efficient

0:11:47.789 --> 0:11:53.922
in a way that there's no system which achieves
the same quality with less computational.

0:11:57.657 --> 0:12:04.131
So there's the one question of which resources
are you interested.

0:12:04.131 --> 0:12:07.416
Are you running it on CPU or GPU?

0:12:07.416 --> 0:12:11.668
There's different ways of paralyzing stuff.

0:12:14.654 --> 0:12:20.777
Another dimension is how you process your
data.

0:12:20.777 --> 0:12:27.154
There's really the best processing and streaming.

0:12:27.647 --> 0:12:34.672
So in batch processing you have the whole
document available so you can translate all

0:12:34.672 --> 0:12:39.981
sentences in perimeter and then you're interested
in throughput.

0:12:40.000 --> 0:12:43.844
But you can then process, for example, especially
in GPS.

0:12:43.844 --> 0:12:49.810
That's interesting, you're not translating
one sentence at a time, but you're translating

0:12:49.810 --> 0:12:56.108
one hundred sentences or so in parallel, so
you have one more dimension where you can paralyze

0:12:56.108 --> 0:12:57.964
and then be more efficient.

0:12:58.558 --> 0:13:14.863
On the other hand, for example sorts of documents,
so we learned that if you do badge processing

0:13:14.863 --> 0:13:16.544
you have.

0:13:16.636 --> 0:13:24.636
Then, of course, it makes sense to sort the
sentences in order to have the minimum thing

0:13:24.636 --> 0:13:25.535
attached.

0:13:27.427 --> 0:13:32.150
The other scenario is more the streaming scenario
where you do life translation.

0:13:32.512 --> 0:13:40.212
So in that case you can't wait for the whole
document to pass, but you have to do.

0:13:40.520 --> 0:13:49.529
And then, for example, that's especially in
situations like speech translation, and then

0:13:49.529 --> 0:13:53.781
you're interested in things like latency.

0:13:53.781 --> 0:14:00.361
So how much do you have to wait to get the
output of a sentence?

0:14:06.566 --> 0:14:16.956
Finally, there is the thing about the implementation:
Today we're mainly looking at different algorithms,

0:14:16.956 --> 0:14:23.678
different models of how you can model them
in your machine translation system, but of

0:14:23.678 --> 0:14:29.227
course for the same algorithms there's also
different implementations.

0:14:29.489 --> 0:14:38.643
So, for example, for a machine translation
this tool could be very fast.

0:14:38.638 --> 0:14:46.615
So they have like coded a lot of the operations
very low resource, not low resource, low level

0:14:46.615 --> 0:14:49.973
on the directly on the QDAC kernels in.

0:14:50.110 --> 0:15:00.948
So the same attention network is typically
more efficient in that type of algorithm.

0:15:00.880 --> 0:15:02.474
Than in in any other.

0:15:03.323 --> 0:15:13.105
Of course, it might be other disadvantages,
so if you're a little worker or have worked

0:15:13.105 --> 0:15:15.106
in the practical.

0:15:15.255 --> 0:15:22.604
Because it's normally easier to understand,
easier to change, and so on, but there is again

0:15:22.604 --> 0:15:23.323
a train.

0:15:23.483 --> 0:15:29.440
You have to think about, do you want to include
this into my study or comparison or not?

0:15:29.440 --> 0:15:36.468
Should it be like I compare different implementations
and I also find the most efficient implementation?

0:15:36.468 --> 0:15:39.145
Or is it only about the pure algorithm?

0:15:42.742 --> 0:15:50.355
Yeah, when building these systems there is
a different trade-off to do.

0:15:50.850 --> 0:15:56.555
So there's one of the traders between memory
and throughput, so how many words can generate

0:15:56.555 --> 0:15:57.299
per second.

0:15:57.557 --> 0:16:03.351
So typically you can easily like increase
your scruple by increasing the batch size.

0:16:03.643 --> 0:16:06.899
So that means you are translating more sentences
in parallel.

0:16:07.107 --> 0:16:09.241
And gypsies are very good at that stuff.

0:16:09.349 --> 0:16:15.161
It should translate one sentence or one hundred
sentences, not the same time, but its.

0:16:15.115 --> 0:16:20.997
Rough are very similar because they have these
efficient metrics multiplication so that you

0:16:20.997 --> 0:16:24.386
can do the same operation on all sentences
parallel.

0:16:24.386 --> 0:16:30.141
So typically that means if you increase your
benchmark you can do more things in parallel

0:16:30.141 --> 0:16:31.995
and you will translate more.

0:16:31.952 --> 0:16:33.370
Second.

0:16:33.653 --> 0:16:43.312
On the other hand, with this advantage, of
course you will need higher badge sizes and

0:16:43.312 --> 0:16:44.755
more memory.

0:16:44.965 --> 0:16:56.452
To begin with, the other problem is that you
have such big models that you can only translate

0:16:56.452 --> 0:16:59.141
with lower bed sizes.

0:16:59.119 --> 0:17:08.466
If you are running out of memory with translating,
one idea to go on that is to decrease your.

0:17:13.453 --> 0:17:24.456
Then there is the thing about quality in Screwport,
of course, and before it's like larger models,

0:17:24.456 --> 0:17:28.124
but in generally higher quality.

0:17:28.124 --> 0:17:31.902
The first one is always this way.

0:17:32.092 --> 0:17:38.709
Course: Not always larger model helps you
have over fitting at some point, but in generally.

0:17:43.883 --> 0:17:52.901
And with this a bit on this training and testing
thing we had before.

0:17:53.113 --> 0:17:58.455
So it wears all the difference between training
and testing, and for the encoder and decoder.

0:17:58.798 --> 0:18:06.992
So if we are looking at what mentioned before
at training time, we have a source sentence

0:18:06.992 --> 0:18:17.183
here: And how this is processed on a is not
the attention here.

0:18:17.183 --> 0:18:21.836
That's a tubical transformer.

0:18:22.162 --> 0:18:31.626
And how we can do that on a is that we can
paralyze the ear ever since.

0:18:31.626 --> 0:18:40.422
The first thing to know is: So that is, of
course, not in all cases.

0:18:40.422 --> 0:18:49.184
We'll later talk about speech translation
where we might want to translate.

0:18:49.389 --> 0:18:56.172
Without the general case in, it's like you
have the full sentence you want to translate.

0:18:56.416 --> 0:19:02.053
So the important thing is we are here everything
available on the source side.

0:19:03.323 --> 0:19:13.524
And then this was one of the big advantages
that you can remember back of transformer.

0:19:13.524 --> 0:19:15.752
There are several.

0:19:16.156 --> 0:19:25.229
But the other one is now that we can calculate
the full layer.

0:19:25.645 --> 0:19:29.318
There is no dependency between this and this
state or this and this state.

0:19:29.749 --> 0:19:36.662
So we always did like here to calculate the
key value and query, and based on that you

0:19:36.662 --> 0:19:37.536
calculate.

0:19:37.937 --> 0:19:46.616
Which means we can do all these calculations
here in parallel and in parallel.

0:19:48.028 --> 0:19:55.967
And there, of course, is this very efficiency
because again for GPS it's too bigly possible

0:19:55.967 --> 0:20:00.887
to do these things in parallel and one after
each other.

0:20:01.421 --> 0:20:10.311
And then we can also for each layer one by
one, and then we calculate here the encoder.

0:20:10.790 --> 0:20:21.921
In training now an important thing is that
for the decoder we have the full sentence available

0:20:21.921 --> 0:20:28.365
because we know this is the target we should
generate.

0:20:29.649 --> 0:20:33.526
We have models now in a different way.

0:20:33.526 --> 0:20:38.297
This hidden state is only on the previous
ones.

0:20:38.598 --> 0:20:51.887
And the first thing here depends only on this
information, so you see if you remember we

0:20:51.887 --> 0:20:56.665
had this masked self-attention.

0:20:56.896 --> 0:21:04.117
So that means, of course, we can only calculate
the decoder once the encoder is done, but that's.

0:21:04.444 --> 0:21:06.656
Percent can calculate the end quarter.

0:21:06.656 --> 0:21:08.925
Then we can calculate here the decoder.

0:21:09.569 --> 0:21:25.566
But again in training we have x, y and that
is available so we can calculate everything

0:21:25.566 --> 0:21:27.929
in parallel.

0:21:28.368 --> 0:21:40.941
So the interesting thing or advantage of transformer
is in training.

0:21:40.941 --> 0:21:46.408
We can do it for the decoder.

0:21:46.866 --> 0:21:54.457
That means you will have more calculations
because you can only calculate one layer at

0:21:54.457 --> 0:22:02.310
a time, but for example the length which is
too bigly quite long or doesn't really matter

0:22:02.310 --> 0:22:03.270
that much.

0:22:05.665 --> 0:22:10.704
However, in testing this situation is different.

0:22:10.704 --> 0:22:13.276
In testing we only have.

0:22:13.713 --> 0:22:20.622
So this means we start with a sense: We don't
know the full sentence yet because we ought

0:22:20.622 --> 0:22:29.063
to regularly generate that so for the encoder
we have the same here but for the decoder.

0:22:29.409 --> 0:22:39.598
In this case we only have the first and the
second instinct, but only for all states in

0:22:39.598 --> 0:22:40.756
parallel.

0:22:41.101 --> 0:22:51.752
And then we can do the next step for y because
we are putting our most probable one.

0:22:51.752 --> 0:22:58.643
We do greedy search or beam search, but you
cannot do.

0:23:03.663 --> 0:23:16.838
Yes, so if we are interesting in making things
more efficient for testing, which we see, for

0:23:16.838 --> 0:23:22.363
example in the scenario of really our.

0:23:22.642 --> 0:23:34.286
It makes sense that we think about our architecture
and that we are currently working on attention

0:23:34.286 --> 0:23:35.933
based models.

0:23:36.096 --> 0:23:44.150
The decoder there is some of the most time
spent testing and testing.

0:23:44.150 --> 0:23:47.142
It's similar, but during.

0:23:47.167 --> 0:23:50.248
Nothing about beam search.

0:23:50.248 --> 0:23:59.833
It might be even more complicated because
in beam search you have to try different.

0:24:02.762 --> 0:24:15.140
So the question is what can you now do in
order to make your model more efficient and

0:24:15.140 --> 0:24:21.905
better in translation in these types of cases?

0:24:24.604 --> 0:24:30.178
And the one thing is to look into the encoded
decoder trailer.

0:24:30.690 --> 0:24:43.898
And then until now we typically assume that
the depth of the encoder and the depth of the

0:24:43.898 --> 0:24:48.154
decoder is roughly the same.

0:24:48.268 --> 0:24:55.553
So if you haven't thought about it, you just
take what is running well.

0:24:55.553 --> 0:24:57.678
You would try to do.

0:24:58.018 --> 0:25:04.148
However, we saw now that there is a quite
big challenge and the runtime is a lot longer

0:25:04.148 --> 0:25:04.914
than here.

0:25:05.425 --> 0:25:14.018
The question is also the case for the calculations,
or do we have there the same issue that we

0:25:14.018 --> 0:25:21.887
only get the good quality if we are having
high and high, so we know that making these

0:25:21.887 --> 0:25:25.415
more depths is increasing our quality.

0:25:25.425 --> 0:25:31.920
But what we haven't talked about is really
important that we increase the depth the same

0:25:31.920 --> 0:25:32.285
way.

0:25:32.552 --> 0:25:41.815
So what we can put instead also do is something
like this where you have a deep encoder and

0:25:41.815 --> 0:25:42.923
a shallow.

0:25:43.163 --> 0:25:57.386
So that would be that you, for example, have
instead of having layers on the encoder, and

0:25:57.386 --> 0:25:59.757
layers on the.

0:26:00.080 --> 0:26:10.469
So in this case the overall depth from start
to end would be similar and so hopefully.

0:26:11.471 --> 0:26:21.662
But we could a lot more things hear parallelized,
and hear what is costly at the end during decoding

0:26:21.662 --> 0:26:22.973
the decoder.

0:26:22.973 --> 0:26:29.330
Because that does change in an outer regressive
way, there we.

0:26:31.411 --> 0:26:33.727
And that that can be analyzed.

0:26:33.727 --> 0:26:38.734
So here is some examples: Where people have
done all this.

0:26:39.019 --> 0:26:55.710
So here it's mainly interested on the orange
things, which is auto-regressive about the

0:26:55.710 --> 0:26:57.607
speed up.

0:26:57.717 --> 0:27:15.031
You have the system, so agree is not exactly
the same, but it's similar.

0:27:15.055 --> 0:27:23.004
It's always the case if you look at speed
up.

0:27:23.004 --> 0:27:31.644
Think they put a speed of so that's the baseline.

0:27:31.771 --> 0:27:35.348
So between and times as fast.

0:27:35.348 --> 0:27:42.621
If you switch from a system to where you have
layers in the.

0:27:42.782 --> 0:27:52.309
You see that although you have slightly more
parameters, more calculations are also roughly

0:27:52.309 --> 0:28:00.283
the same, but you can speed out because now
during testing you can paralyze.

0:28:02.182 --> 0:28:09.754
The other thing is that you're speeding up,
but if you look at the performance it's similar,

0:28:09.754 --> 0:28:13.500
so sometimes you improve, sometimes you lose.

0:28:13.500 --> 0:28:20.421
There's a bit of losing English to Romania,
but in general the quality is very slow.

0:28:20.680 --> 0:28:30.343
So you see that you can keep a similar performance
while improving your speed by just having different.

0:28:30.470 --> 0:28:34.903
And you also see the encoder layers from speed.

0:28:34.903 --> 0:28:38.136
They don't really metal that much.

0:28:38.136 --> 0:28:38.690
Most.

0:28:38.979 --> 0:28:50.319
Because if you compare the 12th system to
the 6th system you have a lower performance

0:28:50.319 --> 0:28:57.309
with 6th and colder layers but the speed is
similar.

0:28:57.897 --> 0:29:02.233
And see the huge decrease is it maybe due
to a lack of data.

0:29:03.743 --> 0:29:11.899
Good idea would say it's not the case.

0:29:11.899 --> 0:29:23.191
Romanian English should have the same number
of data.

0:29:24.224 --> 0:29:31.184
Maybe it's just that something in that language.

0:29:31.184 --> 0:29:40.702
If you generate Romanian maybe they need more
target dependencies.

0:29:42.882 --> 0:29:46.263
The Wine's the Eye Also Don't Know Any Sex
People Want To.

0:29:47.887 --> 0:29:49.034
There could be yeah the.

0:29:49.889 --> 0:29:58.962
As the maybe if you go from like a movie sphere
to a hybrid sphere, you can: It's very much

0:29:58.962 --> 0:30:12.492
easier to expand the vocabulary to English,
but it must be the vocabulary.

0:30:13.333 --> 0:30:21.147
Have to check, but would assume that in this
case the system is not retrained, but it's

0:30:21.147 --> 0:30:22.391
trained with.

0:30:22.902 --> 0:30:30.213
And that's why I was assuming that they have
the same, but maybe you'll write that in this

0:30:30.213 --> 0:30:35.595
piece, for example, if they were pre-trained,
the decoder English.

0:30:36.096 --> 0:30:43.733
But don't remember exactly if they do something
like that, but that could be a good.

0:30:45.325 --> 0:30:52.457
So this is some of the most easy way to speed
up.

0:30:52.457 --> 0:31:01.443
You just switch to hyperparameters, not to
implement anything.

0:31:02.722 --> 0:31:08.367
Of course, there's other ways of doing that.

0:31:08.367 --> 0:31:11.880
We'll look into two things.

0:31:11.880 --> 0:31:16.521
The other thing is the architecture.

0:31:16.796 --> 0:31:28.154
We are now at some of the baselines that we
are doing.

0:31:28.488 --> 0:31:39.978
However, in translation in the decoder side,
it might not be the best solution.

0:31:39.978 --> 0:31:41.845
There is no.

0:31:42.222 --> 0:31:47.130
So we can use different types of architectures,
also in the encoder and the.

0:31:47.747 --> 0:31:52.475
And there's two ways of what you could do
different, or there's more ways.

0:31:52.912 --> 0:31:54.825
We will look into two todays.

0:31:54.825 --> 0:31:58.842
The one is average attention, which is a very
simple solution.

0:31:59.419 --> 0:32:01.464
You can do as it says.

0:32:01.464 --> 0:32:04.577
It's not really attending anymore.

0:32:04.577 --> 0:32:08.757
It's just like equal attendance to everything.

0:32:09.249 --> 0:32:23.422
And the other idea, which is currently done
in most systems which are optimized to efficiency,

0:32:23.422 --> 0:32:24.913
is we're.

0:32:25.065 --> 0:32:32.623
But on the decoder side we are then not using
transformer or self attention, but we are using

0:32:32.623 --> 0:32:39.700
recurrent neural network because they are the
disadvantage of recurrent neural network.

0:32:39.799 --> 0:32:48.353
And then the recurrent is normally easier
to calculate because it only depends on inputs,

0:32:48.353 --> 0:32:49.684
the input on.

0:32:51.931 --> 0:33:02.190
So what is the difference between decoding
and why is the tension maybe not sufficient

0:33:02.190 --> 0:33:03.841
for decoding?

0:33:04.204 --> 0:33:14.390
If we want to populate the new state, we only
have to look at the input and the previous

0:33:14.390 --> 0:33:15.649
state, so.

0:33:16.136 --> 0:33:19.029
We are more conditional here networks.

0:33:19.029 --> 0:33:19.994
We have the.

0:33:19.980 --> 0:33:31.291
Dependency to a fixed number of previous ones,
but that's rarely used for decoding.

0:33:31.291 --> 0:33:39.774
In contrast, in transformer we have this large
dependency, so.

0:33:40.000 --> 0:33:52.760
So from t minus one to y t so that is somehow
and mainly not very efficient in this way mean

0:33:52.760 --> 0:33:56.053
it's very good because.

0:33:56.276 --> 0:34:03.543
However, the disadvantage is that we also
have to do all these calculations, so if we

0:34:03.543 --> 0:34:10.895
more view from the point of view of efficient
calculation, this might not be the best.

0:34:11.471 --> 0:34:20.517
So the question is, can we change our architecture
to keep some of the advantages but make things

0:34:20.517 --> 0:34:21.994
more efficient?

0:34:24.284 --> 0:34:31.131
The one idea is what is called the average
attention, and the interesting thing is this

0:34:31.131 --> 0:34:32.610
work surprisingly.

0:34:33.013 --> 0:34:38.917
So the only idea what you're doing is doing
the decoder.

0:34:38.917 --> 0:34:42.646
You're not doing attention anymore.

0:34:42.646 --> 0:34:46.790
The attention weights are all the same.

0:34:47.027 --> 0:35:00.723
So you don't calculate with query and key
the different weights, and then you just take

0:35:00.723 --> 0:35:03.058
equal weights.

0:35:03.283 --> 0:35:07.585
So here would be one third from this, one
third from this, and one third.

0:35:09.009 --> 0:35:14.719
And while it is sufficient you can now do
precalculation and things get more efficient.

0:35:15.195 --> 0:35:18.803
So first go the formula that's maybe not directed
here.

0:35:18.979 --> 0:35:38.712
So the difference here is that your new hint
stage is the sum of all the hint states, then.

0:35:38.678 --> 0:35:40.844
So here would be with this.

0:35:40.844 --> 0:35:45.022
It would be one third of this plus one third
of this.

0:35:46.566 --> 0:35:57.162
But if you calculate it this way, it's not
yet being more efficient because you still

0:35:57.162 --> 0:36:01.844
have to sum over here all the hidden.

0:36:04.524 --> 0:36:22.932
But you can not easily speed up these things
by having an in between value, which is just

0:36:22.932 --> 0:36:24.568
always.

0:36:25.585 --> 0:36:30.057
If you take this as ten to one, you take this
one class this one.

0:36:30.350 --> 0:36:36.739
Because this one then was before this, and
this one was this, so in the end.

0:36:37.377 --> 0:36:49.545
So now this one is not the final one in order
to get the final one to do the average.

0:36:49.545 --> 0:36:50.111
So.

0:36:50.430 --> 0:37:00.264
But then if you do this calculation with speed
up you can do it with a fixed number of steps.

0:37:00.180 --> 0:37:11.300
Instead of the sun which depends on age, so
you only have to do calculations to calculate

0:37:11.300 --> 0:37:12.535
this one.

0:37:12.732 --> 0:37:21.183
Can you do the lakes and the lakes?

0:37:21.183 --> 0:37:32.687
For example, light bulb here now takes and
then.

0:37:32.993 --> 0:37:38.762
That's a very good point and that's why this
is now in the image.

0:37:38.762 --> 0:37:44.531
It's not very good so this is the one with
tilder and the tilder.

0:37:44.884 --> 0:37:57.895
So this one is just the sum of these two,
because this is just this one.

0:37:58.238 --> 0:38:08.956
So the sum of this is exactly as the sum of
these, and the sum of these is the sum of here.

0:38:08.956 --> 0:38:15.131
So you only do the sum in here, and the multiplying.

0:38:15.255 --> 0:38:22.145
So what you can mainly do here is you can
do it more mathematically.

0:38:22.145 --> 0:38:31.531
You can know this by tea taking out of the
sum, and then you can calculate the sum different.

0:38:36.256 --> 0:38:42.443
That maybe looks a bit weird and simple, so
we were all talking about this great attention

0:38:42.443 --> 0:38:47.882
that we can focus on different parts, and a
bit surprising on this work is now.

0:38:47.882 --> 0:38:53.321
In the end it might also work well without
really putting and just doing equal.

0:38:53.954 --> 0:38:56.164
Mean it's not that easy.

0:38:56.376 --> 0:38:58.261
It's like sometimes this is working.

0:38:58.261 --> 0:39:00.451
There's also report weight work that well.

0:39:01.481 --> 0:39:05.848
But I think it's an interesting way and it
maybe shows that a lot of.

0:39:05.805 --> 0:39:10.669
Things in the self or in the transformer paper
which are more put as like yet.

0:39:10.669 --> 0:39:14.301
These are some hyperparameters that are rounded
like that.

0:39:14.301 --> 0:39:19.657
You do the lay on all in between and that
you do a feat forward before and things like

0:39:19.657 --> 0:39:20.026
that.

0:39:20.026 --> 0:39:25.567
But these are also all important and the right
set up around that is also very important.

0:39:28.969 --> 0:39:38.598
The other thing you can do in the end is not
completely different from this one.

0:39:38.598 --> 0:39:42.521
It's just like a very different.

0:39:42.942 --> 0:39:54.338
And that is a recurrent network which also
has this type of highway connection that can

0:39:54.338 --> 0:40:01.330
ignore the recurrent unit and directly put
the input.

0:40:01.561 --> 0:40:10.770
It's not really adding out, but if you see
the hitting step is your input, but what you

0:40:10.770 --> 0:40:15.480
can do is somehow directly go to the output.

0:40:17.077 --> 0:40:28.390
These are the four components of the simple
return unit, and the unit is motivated by GIS

0:40:28.390 --> 0:40:33.418
and by LCMs, which we have seen before.

0:40:33.513 --> 0:40:43.633
And that has proven to be very good for iron
ends, which allows you to have a gate on your.

0:40:44.164 --> 0:40:48.186
In this thing we have two gates, the reset
gate and the forget gate.

0:40:48.768 --> 0:40:57.334
So first we have the general structure which
has a cell state.

0:40:57.334 --> 0:41:01.277
Here we have the cell state.

0:41:01.361 --> 0:41:09.661
And then this goes next, and we always get
the different cell states over the times that.

0:41:10.030 --> 0:41:11.448
This Is the South Stand.

0:41:11.771 --> 0:41:16.518
How do we now calculate that just assume we
have an initial cell safe here?

0:41:17.017 --> 0:41:19.670
But the first thing is we're doing the forget
game.

0:41:20.060 --> 0:41:34.774
The forgetting models should the new cell
state mainly depend on the previous cell state

0:41:34.774 --> 0:41:40.065
or should it depend on our age.

0:41:40.000 --> 0:41:41.356
Like Add to Them.

0:41:41.621 --> 0:41:42.877
How can we model that?

0:41:44.024 --> 0:41:45.599
First we were at a cocktail.

0:41:45.945 --> 0:41:52.151
The forget gait is depending on minus one.

0:41:52.151 --> 0:41:56.480
You also see here the former.

0:41:57.057 --> 0:42:01.963
So we are multiplying both the cell state
and our input.

0:42:01.963 --> 0:42:04.890
With some weights we are getting.

0:42:05.105 --> 0:42:08.472
We are putting some Bay Inspector and then
we are doing Sigma Weed on that.

0:42:08.868 --> 0:42:13.452
So in the end we have numbers between zero
and one saying for each dimension.

0:42:13.853 --> 0:42:22.041
Like how much if it's near to zero we will
mainly use the new input.

0:42:22.041 --> 0:42:31.890
If it's near to one we will keep the input
and ignore the input at this dimension.

0:42:33.313 --> 0:42:40.173
And by this motivation we can then create
here the new sound state, and here you see

0:42:40.173 --> 0:42:41.141
the formal.

0:42:41.601 --> 0:42:55.048
So you take your foot back gate and multiply
it with your class.

0:42:55.048 --> 0:43:00.427
So if my was around then.

0:43:00.800 --> 0:43:07.405
In the other case, when the value was others,
that's what you added.

0:43:07.405 --> 0:43:10.946
Then you're adding a transformation.

0:43:11.351 --> 0:43:24.284
So if this value was maybe zero then you're
putting most of the information from inputting.

0:43:25.065 --> 0:43:26.947
Is already your element?

0:43:26.947 --> 0:43:30.561
The only question is now based on your element.

0:43:30.561 --> 0:43:32.067
What is the output?

0:43:33.253 --> 0:43:47.951
And there you have another opportunity so
you can either take the output or instead you

0:43:47.951 --> 0:43:50.957
prefer the input.

0:43:52.612 --> 0:43:58.166
So is the value also the same for the recept
game and the forget game.

0:43:58.166 --> 0:43:59.417
Yes, the movie.

0:44:00.900 --> 0:44:10.004
Yes exactly so the matrices are different
and therefore it can be and that should be

0:44:10.004 --> 0:44:16.323
and maybe there is sometimes you want to have
information.

0:44:16.636 --> 0:44:23.843
So here again we have this vector with values
between zero and which says controlling how

0:44:23.843 --> 0:44:25.205
the information.

0:44:25.505 --> 0:44:36.459
And then the output is calculated here similar
to a cell stage, but again input is from.

0:44:36.536 --> 0:44:45.714
So either the reset gate decides should give
what is currently stored in there, or.

0:44:46.346 --> 0:44:58.647
So it's not exactly as the thing we had before,
with the residual connections where we added

0:44:58.647 --> 0:45:01.293
up, but here we do.

0:45:04.224 --> 0:45:08.472
This is the general idea of a simple recurrent
neural network.

0:45:08.472 --> 0:45:13.125
Then we will now look at how we can make things
even more efficient.

0:45:13.125 --> 0:45:17.104
But first do you have more questions on how
it is working?

0:45:23.063 --> 0:45:38.799
Now these calculations are a bit where things
get more efficient because this somehow.

0:45:38.718 --> 0:45:43.177
It depends on all the other damage for the
second one also.

0:45:43.423 --> 0:45:48.904
Because if you do a matrix multiplication
with a vector like for the output vector, each

0:45:48.904 --> 0:45:52.353
diameter of the output vector depends on all
the other.

0:45:52.973 --> 0:46:06.561
The cell state here depends because this one
is used here, and somehow the first dimension

0:46:06.561 --> 0:46:11.340
of the cell state only depends.

0:46:11.931 --> 0:46:17.973
In order to make that, of course, is sometimes
again making things less paralyzeable if things

0:46:17.973 --> 0:46:18.481
depend.

0:46:19.359 --> 0:46:35.122
Can easily make that different by changing
from the metric product to not a vector.

0:46:35.295 --> 0:46:51.459
So you do first, just like inside here, you
take like the first dimension, my second dimension.

0:46:52.032 --> 0:46:53.772
Is, of course, narrow.

0:46:53.772 --> 0:46:59.294
This should be reset or this should be because
it should be a different.

0:46:59.899 --> 0:47:12.053
Now the first dimension only depends on the
first dimension, so you don't have dependencies

0:47:12.053 --> 0:47:16.148
any longer between dimensions.

0:47:18.078 --> 0:47:25.692
Maybe it gets a bit clearer if you see about
it in this way, so what we have to do now.

0:47:25.966 --> 0:47:31.911
First, we have to do a metrics multiplication
on to gather and to get the.

0:47:32.292 --> 0:47:38.041
And then we only have the element wise operations
where we take this output.

0:47:38.041 --> 0:47:38.713
We take.

0:47:39.179 --> 0:47:42.978
Minus one and our original.

0:47:42.978 --> 0:47:52.748
Here we only have elemental abrasions which
can be optimally paralyzed.

0:47:53.273 --> 0:48:07.603
So here we have additional paralyzed things
across the dimension and don't have to do that.

0:48:09.929 --> 0:48:24.255
Yeah, but this you can do like in parallel
again for all xts.

0:48:24.544 --> 0:48:33.014
Here you can't do it in parallel, but you
only have to do it on each seat, and then you

0:48:33.014 --> 0:48:34.650
can parallelize.

0:48:35.495 --> 0:48:39.190
But this maybe for the dimension.

0:48:39.190 --> 0:48:42.124
Maybe it's also important.

0:48:42.124 --> 0:48:46.037
I don't know if they have tried it.

0:48:46.037 --> 0:48:55.383
I assume it's not only for dimension reduction,
but it's hard because you can easily.

0:49:01.001 --> 0:49:08.164
People have even like made the second thing
even more easy.

0:49:08.164 --> 0:49:10.313
So there is this.

0:49:10.313 --> 0:49:17.954
This is how we have the highway connections
in the transformer.

0:49:17.954 --> 0:49:20.699
Then it's like you do.

0:49:20.780 --> 0:49:24.789
So that is like how things are put together
as a transformer.

0:49:25.125 --> 0:49:39.960
And that is a similar and simple recurring
neural network where you do exactly the same

0:49:39.960 --> 0:49:44.512
for the so you don't have.

0:49:46.326 --> 0:49:47.503
This type of things.

0:49:49.149 --> 0:50:01.196
And with this we are at the end of how to
make efficient architectures before we go to

0:50:01.196 --> 0:50:02.580
the next.

0:50:13.013 --> 0:50:24.424
Between the ink or the trader and the architectures
there is a next technique which is used in

0:50:24.424 --> 0:50:28.988
nearly all deburning very successful.

0:50:29.449 --> 0:50:43.463
So the idea is can we extract the knowledge
from a large network into a smaller one, but

0:50:43.463 --> 0:50:45.983
it's similarly.

0:50:47.907 --> 0:50:53.217
And the nice thing is that this really works,
and it may be very, very surprising.

0:50:53.673 --> 0:51:03.000
So the idea is that we have a large straw
model which we train for long, and the question

0:51:03.000 --> 0:51:07.871
is: Can that help us to train a smaller model?

0:51:08.148 --> 0:51:16.296
So can what we refer to as teacher model tell
us better to build a small student model than

0:51:16.296 --> 0:51:17.005
before.

0:51:17.257 --> 0:51:27.371
So what we're before in it as a student model,
we learn from the data and that is how we train

0:51:27.371 --> 0:51:28.755
our systems.

0:51:29.249 --> 0:51:37.949
The question is: Can we train this small model
better if we are not only learning from the

0:51:37.949 --> 0:51:46.649
data, but we are also learning from a large
model which has been trained maybe in the same

0:51:46.649 --> 0:51:47.222
data?

0:51:47.667 --> 0:51:55.564
So that you have then in the end a smaller
model that is somehow better performing than.

0:51:55.895 --> 0:51:59.828
And maybe that's on the first view.

0:51:59.739 --> 0:52:05.396
Very very surprising because it has seen the
same data so it should have learned the same

0:52:05.396 --> 0:52:11.053
so the baseline model trained only on the data
and the student teacher knowledge to still

0:52:11.053 --> 0:52:11.682
model it.

0:52:11.682 --> 0:52:17.401
They all have seen only this data because
your teacher modeling was also trained typically

0:52:17.401 --> 0:52:19.161
only on this model however.

0:52:20.580 --> 0:52:30.071
It has by now shown that by many ways the
model trained in the teacher and analysis framework

0:52:30.071 --> 0:52:32.293
is performing better.

0:52:33.473 --> 0:52:40.971
A bit of an explanation when we see how that
works.

0:52:40.971 --> 0:52:46.161
There's different ways of doing it.

0:52:46.161 --> 0:52:47.171
Maybe.

0:52:47.567 --> 0:52:51.501
So how does it work?

0:52:51.501 --> 0:53:04.802
This is our student network, the normal one,
some type of new network.

0:53:04.802 --> 0:53:06.113
We're.

0:53:06.586 --> 0:53:17.050
So we are training the model to predict the
same thing as we are doing that by calculating.

0:53:17.437 --> 0:53:23.173
The cross angry loss was defined in a way
where saying all the probabilities for the

0:53:23.173 --> 0:53:25.332
correct word should be as high.

0:53:25.745 --> 0:53:31.576
So your calculating gear out of probability
is always and each time step you have an out

0:53:31.576 --> 0:53:32.624
of probability.

0:53:32.624 --> 0:53:38.258
What is the most probable in the next word
and your training signal is put as much of

0:53:38.258 --> 0:53:43.368
your probability mass to the correct word to
the word that is there in train.

0:53:43.903 --> 0:53:51.367
And this is the chief by this cross entry
loss, which says with some of the all training

0:53:51.367 --> 0:53:58.664
examples of all positions, with some of the
full vocabulary, and then this one is this

0:53:58.664 --> 0:54:03.947
one that this current word is the case word
in the vocabulary.

0:54:04.204 --> 0:54:11.339
And then we take here the lock for the ability
of that, so what we made me do is: We have

0:54:11.339 --> 0:54:27.313
this metric here, so each position of your
vocabulary size.

0:54:27.507 --> 0:54:38.656
In the end what you just do is some of these
three lock probabilities, and then you want

0:54:38.656 --> 0:54:40.785
to have as much.

0:54:41.041 --> 0:54:54.614
So although this is a thumb over this metric
here, in the end of each dimension you.

0:54:54.794 --> 0:55:06.366
So that is a normal cross end to be lost that
we have discussed at the very beginning of

0:55:06.366 --> 0:55:07.016
how.

0:55:08.068 --> 0:55:15.132
So what can we do differently in the teacher
network?

0:55:15.132 --> 0:55:23.374
We also have a teacher network which is trained
on large data.

0:55:24.224 --> 0:55:35.957
And of course this distribution might be better
than the one from the small model because it's.

0:55:36.456 --> 0:55:40.941
So in this case we have now the training signal
from the teacher network.

0:55:41.441 --> 0:55:46.262
And it's the same way as we had before.

0:55:46.262 --> 0:55:56.507
The only difference is we're training not
the ground truths per ability distribution

0:55:56.507 --> 0:55:59.159
year, which is sharp.

0:55:59.299 --> 0:56:11.303
That's also a probability, so this word has
a high probability, but have some probability.

0:56:12.612 --> 0:56:19.577
And that is the main difference.

0:56:19.577 --> 0:56:30.341
Typically you do like the interpretation of
these.

0:56:33.213 --> 0:56:38.669
Because there's more information contained
in the distribution than in the front booth,

0:56:38.669 --> 0:56:44.187
because it encodes more information about the
language, because language always has more

0:56:44.187 --> 0:56:47.907
options to put alone, that's the same sentence
yes exactly.

0:56:47.907 --> 0:56:53.114
So there's ambiguity in there that is encoded
hopefully very well in the complaint.

0:56:53.513 --> 0:56:57.257
Trade you two networks so better than a student
network you have in there from your learner.

0:56:57.537 --> 0:57:05.961
So maybe often there's only one correct word,
but it might be two or three, and then all

0:57:05.961 --> 0:57:10.505
of these three have a probability distribution.

0:57:10.590 --> 0:57:21.242
And then is the main advantage or one explanation
of why it's better to train from the.

0:57:21.361 --> 0:57:32.652
Of course, it's good to also keep the signal
in there because then you can prevent it because

0:57:32.652 --> 0:57:33.493
crazy.

0:57:37.017 --> 0:57:49.466
Any more questions on the first type of knowledge
distillation, also distribution changes.

0:57:50.550 --> 0:58:02.202
Coming around again, this would put it a bit
different, so this is not a solution to maintenance

0:58:02.202 --> 0:58:04.244
or distribution.

0:58:04.744 --> 0:58:12.680
But don't think it's performing worse than
only doing the ground tours because they also.

0:58:13.113 --> 0:58:21.254
So it's more like it's not improving you would
assume it's similarly helping you, but.

0:58:21.481 --> 0:58:28.145
Of course, if you now have a teacher, maybe
you have no danger on your target to Maine,

0:58:28.145 --> 0:58:28.524
but.

0:58:28.888 --> 0:58:39.895
Then you can use this one which is not the
ground truth but helpful to learn better for

0:58:39.895 --> 0:58:42.147
the distribution.

0:58:46.326 --> 0:58:57.012
The second idea is to do sequence level knowledge
distillation, so what we have in this case

0:58:57.012 --> 0:59:02.757
is we have looked at each position independently.

0:59:03.423 --> 0:59:05.436
Mean, we do that often.

0:59:05.436 --> 0:59:10.972
We are not generating a lot of sequences,
but that has a problem.

0:59:10.972 --> 0:59:13.992
We have this propagation of errors.

0:59:13.992 --> 0:59:16.760
We start with one area and then.

0:59:17.237 --> 0:59:27.419
So if we are doing word-level knowledge dissolution,
we are treating each word in the sentence independently.

0:59:28.008 --> 0:59:32.091
So we are not trying to like somewhat model
the dependency between.

0:59:32.932 --> 0:59:47.480
We can try to do that by sequence level knowledge
dissolution, but the problem is, of course,.

0:59:47.847 --> 0:59:53.478
So we can that for each position we can get
a distribution over all the words at this.

0:59:53.793 --> 1:00:05.305
But if we want to have a distribution of all
possible target sentences, that's not possible

1:00:05.305 --> 1:00:06.431
because.

1:00:08.508 --> 1:00:15.940
Area, so we can then again do a bit of a heck
on that.

1:00:15.940 --> 1:00:23.238
If we can't have a distribution of all sentences,
it.

1:00:23.843 --> 1:00:30.764
So what we can't do is you can not use the
teacher network and sample different translations.

1:00:31.931 --> 1:00:39.327
And now we can do different ways to train
them.

1:00:39.327 --> 1:00:49.343
We can use them as their probability, the
easiest one to assume.

1:00:50.050 --> 1:00:56.373
So what that ends to is that we're taking
our teacher network, we're generating some

1:00:56.373 --> 1:01:01.135
translations, and these ones we're using as
additional trading.

1:01:01.781 --> 1:01:11.382
Then we have mainly done this sequence level
because the teacher network takes us.

1:01:11.382 --> 1:01:17.513
These are all probable translations of the
sentence.

1:01:26.286 --> 1:01:34.673
And then you can do a bit of a yeah, and you
can try to better make a bit of an interpolated

1:01:34.673 --> 1:01:36.206
version of that.

1:01:36.716 --> 1:01:42.802
So what people have also done is like subsequent
level interpolations.

1:01:42.802 --> 1:01:52.819
You generate here several translations: But
then you don't use all of them.

1:01:52.819 --> 1:02:00.658
You do some metrics on which of these ones.

1:02:01.021 --> 1:02:12.056
So it's a bit more training on this brown
chose which might be improbable or unreachable

1:02:12.056 --> 1:02:16.520
because we can generate everything.

1:02:16.676 --> 1:02:23.378
And we are giving it an easier solution which
is also good quality and training of that.

1:02:23.703 --> 1:02:32.602
So you're not training it on a very difficult
solution, but you're training it on an easier

1:02:32.602 --> 1:02:33.570
solution.

1:02:36.356 --> 1:02:38.494
Any More Questions to This.

1:02:40.260 --> 1:02:41.557
Yeah.

1:02:41.461 --> 1:02:44.296
Good.

1:02:43.843 --> 1:03:01.642
Is to look at the vocabulary, so the problem
is we have seen that vocabulary calculations

1:03:01.642 --> 1:03:06.784
are often very presuming.

1:03:09.789 --> 1:03:19.805
The thing is that most of the vocabulary is
not needed for each sentence, so in each sentence.

1:03:20.280 --> 1:03:28.219
The question is: Can we somehow easily precalculate,
which words are probable to occur in the sentence,

1:03:28.219 --> 1:03:30.967
and then only calculate these ones?

1:03:31.691 --> 1:03:34.912
And this can be done so.

1:03:34.912 --> 1:03:43.932
For example, if you have sentenced card, it's
probably not happening.

1:03:44.164 --> 1:03:48.701
So what you can try to do is to limit your
vocabulary.

1:03:48.701 --> 1:03:51.093
You're considering for each.

1:03:51.151 --> 1:04:04.693
So you're no longer taking the full vocabulary
as possible output, but you're restricting.

1:04:06.426 --> 1:04:18.275
That typically works is that we limit it by
the most frequent words we always take because

1:04:18.275 --> 1:04:23.613
these are not so easy to align to words.

1:04:23.964 --> 1:04:32.241
To take the most treatment taggin' words and
then work that often aligns with one of the

1:04:32.241 --> 1:04:32.985
source.

1:04:33.473 --> 1:04:46.770
So for each source word you calculate the
word alignment on your training data, and then

1:04:46.770 --> 1:04:51.700
you calculate which words occur.

1:04:52.352 --> 1:04:57.680
And then for decoding you build this union
of maybe the source word list that other.

1:04:59.960 --> 1:05:02.145
Are like for each source work.

1:05:02.145 --> 1:05:08.773
One of the most frequent translations of these
source words, for example for each source work

1:05:08.773 --> 1:05:13.003
like in the most frequent ones, and then the
most frequent.

1:05:13.193 --> 1:05:24.333
In total, if you have short sentences, you
have a lot less words, so in most cases it's

1:05:24.333 --> 1:05:26.232
not more than.

1:05:26.546 --> 1:05:33.957
And so you have dramatically reduced your
vocabulary, and thereby can also fax a depot.

1:05:35.495 --> 1:05:43.757
That easy does anybody see what is challenging
here and why that might not always need.

1:05:47.687 --> 1:05:54.448
The performance is not why this might not.

1:05:54.448 --> 1:06:01.838
If you implement it, it might not be a strong.

1:06:01.941 --> 1:06:06.053
You have to store this list.

1:06:06.053 --> 1:06:14.135
You have to burn the union and of course your
safe time.

1:06:14.554 --> 1:06:21.920
The second thing the vocabulary is used in
our last step, so we have the hidden state,

1:06:21.920 --> 1:06:23.868
and then we calculate.

1:06:24.284 --> 1:06:29.610
Now we are not longer calculating them for
all output words, but for a subset of them.

1:06:30.430 --> 1:06:35.613
However, this metric multiplication is typically
parallelized with the perfect but good.

1:06:35.956 --> 1:06:46.937
But if you not only calculate some of them,
if you're not modeling it right, it will take

1:06:46.937 --> 1:06:52.794
as long as before because of the nature of
the.

1:06:56.776 --> 1:07:07.997
Here for beam search there's some ideas of
course you can go back to greedy search because

1:07:07.997 --> 1:07:10.833
that's more efficient.

1:07:11.651 --> 1:07:18.347
And better quality, and you can buffer some
states in between, so how much buffering it's

1:07:18.347 --> 1:07:22.216
again this tradeoff between calculation and
memory.

1:07:25.125 --> 1:07:41.236
Then at the end of today what we want to look
into is one last type of new machine translation

1:07:41.236 --> 1:07:42.932
approach.

1:07:43.403 --> 1:07:53.621
And the idea is what we've already seen in
our first two steps is that this ultra aggressive

1:07:53.621 --> 1:07:57.246
park is taking community coding.

1:07:57.557 --> 1:08:04.461
Can process everything in parallel, but we
are always taking the most probable and then.

1:08:05.905 --> 1:08:10.476
The question is: Do we really need to do that?

1:08:10.476 --> 1:08:14.074
Therefore, there is a bunch of work.

1:08:14.074 --> 1:08:16.602
Can we do it differently?

1:08:16.602 --> 1:08:19.616
Can we generate a full target?

1:08:20.160 --> 1:08:29.417
We'll see it's not that easy and there's still
an open debate whether this is really faster

1:08:29.417 --> 1:08:31.832
and quality, but think.

1:08:32.712 --> 1:08:45.594
So, as said, what we have done is our encoder
decoder where we can process our encoder color,

1:08:45.594 --> 1:08:50.527
and then the output always depends.

1:08:50.410 --> 1:08:54.709
We generate the output and then we have to
put it here the wide because then everything

1:08:54.709 --> 1:08:56.565
depends on the purpose of the output.

1:08:56.916 --> 1:09:10.464
This is what is referred to as an outer-regressive
model and nearly outs speech generation and

1:09:10.464 --> 1:09:16.739
language generation or works in this outer.

1:09:18.318 --> 1:09:21.132
So the motivation is, can we do that more
efficiently?

1:09:21.361 --> 1:09:31.694
And can we somehow process all target words
in parallel?

1:09:31.694 --> 1:09:41.302
So instead of doing it one by one, we are
inputting.

1:09:45.105 --> 1:09:46.726
So how does it work?

1:09:46.726 --> 1:09:50.587
So let's first have a basic auto regressive
mode.

1:09:50.810 --> 1:09:53.551
So the encoder looks as it is before.

1:09:53.551 --> 1:09:58.310
That's maybe not surprising because here we
know we can paralyze.

1:09:58.618 --> 1:10:04.592
So we have put in here our ink holder and
generated the ink stash, so that's exactly

1:10:04.592 --> 1:10:05.295
the same.

1:10:05.845 --> 1:10:16.229
However, now we need to do one more thing:
One challenge is what we had before and that's

1:10:16.229 --> 1:10:26.799
a challenge of natural language generation
like machine translation.

1:10:32.672 --> 1:10:38.447
We generate until we generate this out of
end of center stock, but if we now generate

1:10:38.447 --> 1:10:44.625
everything at once that's no longer possible,
so we cannot generate as long because we only

1:10:44.625 --> 1:10:45.632
generated one.

1:10:46.206 --> 1:10:58.321
So the question is how can we now determine
how long the sequence is, and we can also accelerate.

1:11:00.000 --> 1:11:06.384
Yes, but there would be one idea, and there
is other work which tries to do that.

1:11:06.806 --> 1:11:15.702
However, in here there's some work already
done before and maybe you remember we had the

1:11:15.702 --> 1:11:20.900
IBM models and there was this concept of fertility.

1:11:21.241 --> 1:11:26.299
The concept of fertility is means like for
one saucepan, and how many target pores does

1:11:26.299 --> 1:11:27.104
it translate?

1:11:27.847 --> 1:11:34.805
And exactly that we try to do here, and that
means we are calculating like at the top we

1:11:34.805 --> 1:11:36.134
are calculating.

1:11:36.396 --> 1:11:42.045
So it says word is translated into word.

1:11:42.045 --> 1:11:54.171
Word might be translated into words into,
so we're trying to predict in how many words.

1:11:55.935 --> 1:12:10.314
And then the end of the anchor, so this is
like a length estimation.

1:12:10.314 --> 1:12:15.523
You can do it otherwise.

1:12:16.236 --> 1:12:24.526
You initialize your decoder input and we know
it's good with word embeddings so we're trying

1:12:24.526 --> 1:12:28.627
to do the same thing and what people then do.

1:12:28.627 --> 1:12:35.224
They initialize it again with word embedding
but in the frequency of the.

1:12:35.315 --> 1:12:36.460
So we have the cartilage.

1:12:36.896 --> 1:12:47.816
So one has two, so twice the is and then one
is, so that is then our initialization.

1:12:48.208 --> 1:12:57.151
In other words, if you don't predict fertilities
but predict lengths, you can just initialize

1:12:57.151 --> 1:12:57.912
second.

1:12:58.438 --> 1:13:07.788
This often works a bit better, but that's
the other.

1:13:07.788 --> 1:13:16.432
Now you have everything in training and testing.

1:13:16.656 --> 1:13:18.621
This is all available at once.

1:13:20.280 --> 1:13:31.752
Then we can generate everything in parallel,
so we have the decoder stack, and that is now

1:13:31.752 --> 1:13:33.139
as before.

1:13:35.395 --> 1:13:41.555
And then we're doing the translation predictions
here on top of it in order to do.

1:13:43.083 --> 1:13:59.821
And then we are predicting here the target
words and once predicted, and that is the basic

1:13:59.821 --> 1:14:00.924
idea.

1:14:01.241 --> 1:14:08.171
Machine translation: Where the idea is, we
don't have to do one by one what we're.

1:14:10.210 --> 1:14:13.900
So this looks really, really, really great.

1:14:13.900 --> 1:14:20.358
On the first view there's one challenge with
this, and this is the baseline.

1:14:20.358 --> 1:14:27.571
Of course there's some improvements, but in
general the quality is often significant.

1:14:28.068 --> 1:14:32.075
So here you see the baseline models.

1:14:32.075 --> 1:14:38.466
You have a loss of ten blue points or something
like that.

1:14:38.878 --> 1:14:40.230
So why does it change?

1:14:40.230 --> 1:14:41.640
So why is it happening?

1:14:43.903 --> 1:14:56.250
If you look at the errors there is repetitive
tokens, so you have like or things like that.

1:14:56.536 --> 1:15:01.995
Broken senses or influent senses, so that
exactly where algebra aggressive models are

1:15:01.995 --> 1:15:04.851
very good, we say that's a bit of a problem.

1:15:04.851 --> 1:15:07.390
They generate very fluid transcription.

1:15:07.387 --> 1:15:10.898
Translation: Sometimes there doesn't have
to do anything with the input.

1:15:11.411 --> 1:15:14.047
But generally it really looks always very
fluid.

1:15:14.995 --> 1:15:20.865
Here exactly the opposite, so the problem
is that we don't have really fluid translation.

1:15:21.421 --> 1:15:26.123
And that is mainly due to the challenge that
we have this independent assumption.

1:15:26.646 --> 1:15:35.873
So in this case, the probability of Y of the
second position is independent of the probability

1:15:35.873 --> 1:15:40.632
of X, so we don't know what was there generated.

1:15:40.632 --> 1:15:43.740
We're just generating it there.

1:15:43.964 --> 1:15:55.439
You can see it also in a bit of examples.

1:15:55.439 --> 1:16:03.636
You can over-panelize shifts.

1:16:04.024 --> 1:16:10.566
And the problem is this is already an improvement
again, but this is also similar to.

1:16:11.071 --> 1:16:19.900
So you can, for example, translate heeded
back, or maybe you could also translate it

1:16:19.900 --> 1:16:31.105
with: But on their feeling down in feeling
down, if the first position thinks of their

1:16:31.105 --> 1:16:34.594
feeling done and the second.

1:16:35.075 --> 1:16:42.908
So each position here and that is one of the
main issues here doesn't know what the other.

1:16:43.243 --> 1:16:53.846
And for example, if you are translating something
with, you can often translate things in two

1:16:53.846 --> 1:16:58.471
ways: German with a different agreement.

1:16:58.999 --> 1:17:02.058
And then here where you have to decide do
a used jet.

1:17:02.162 --> 1:17:05.460
Interpretator: It doesn't know which word
it has to select.

1:17:06.086 --> 1:17:14.789
Mean, of course, it knows a hidden state,
but in the end you have a liability distribution.

1:17:16.256 --> 1:17:20.026
And that is the important thing in the outer
regressive month.

1:17:20.026 --> 1:17:24.335
You know that because you have put it in you
here, you don't know that.

1:17:24.335 --> 1:17:29.660
If it's equal probable here to two, you don't
Know Which Is Selected, and of course that

1:17:29.660 --> 1:17:32.832
depends on what should be the latest traction
under.

1:17:33.333 --> 1:17:39.554
Yep, that's the undershift, and we're going
to last last the next time.

1:17:39.554 --> 1:17:39.986
Yes.

1:17:40.840 --> 1:17:44.934
Doesn't this also appear in and like now we're
talking about physical training or.

1:17:46.586 --> 1:17:48.412
The thing is in the auto regress.

1:17:48.412 --> 1:17:50.183
If you give it the correct one,.

1:17:50.450 --> 1:17:55.827
So if you predict here comma what the reference
is feeling then you tell the model here.

1:17:55.827 --> 1:17:59.573
The last one was feeling and then it knows
it has to be done.

1:17:59.573 --> 1:18:04.044
But here it doesn't know that because it doesn't
get as input as a right.

1:18:04.204 --> 1:18:24.286
Yes, that's a bit depending on what.

1:18:24.204 --> 1:18:27.973
But in training, of course, you just try to
make the highest one the current one.

1:18:31.751 --> 1:18:38.181
So what you can do is things like CDC loss
which can adjust for this.

1:18:38.181 --> 1:18:42.866
So then you can also have this shifted correction.

1:18:42.866 --> 1:18:50.582
If you're doing this type of correction in
the CDC loss you don't get full penalty.

1:18:50.930 --> 1:18:58.486
Just shifted by one, so it's a bit of a different
loss, which is mainly used in, but.

1:19:00.040 --> 1:19:03.412
It can be used in order to address this problem.

1:19:04.504 --> 1:19:13.844
The other problem is that outer regressively
we have the label buyers that tries to disimmigrate.

1:19:13.844 --> 1:19:20.515
That's the example did before was if you translate
thank you to Dung.

1:19:20.460 --> 1:19:31.925
And then it might end up because it learns
in the first position and the second also.

1:19:32.492 --> 1:19:43.201
In order to prevent that, it would be helpful
for one output, only one output, so that makes

1:19:43.201 --> 1:19:47.002
the system already better learn.

1:19:47.227 --> 1:19:53.867
Might be that for slightly different inputs
you have different outputs, but for the same.

1:19:54.714 --> 1:19:57.467
That we can luckily very easily solve.

1:19:59.119 --> 1:19:59.908
And it's done.

1:19:59.908 --> 1:20:04.116
We just learned the technique about it, which
is called knowledge distillation.

1:20:04.985 --> 1:20:13.398
So what we can do and the easiest solution
to prove your non-autoregressive model is to

1:20:13.398 --> 1:20:16.457
train an auto regressive model.

1:20:16.457 --> 1:20:22.958
Then you decode your whole training gamer
with this model and then.

1:20:23.603 --> 1:20:27.078
While the main advantage of that is that this
is more consistent,.

1:20:27.407 --> 1:20:33.995
So for the same input you always have the
same output.

1:20:33.995 --> 1:20:41.901
So you have to make your training data more
consistent and learn.

1:20:42.482 --> 1:20:54.471
So there is another advantage of knowledge
distillation and that advantage is you have

1:20:54.471 --> 1:20:59.156
more consistent training signals.

1:21:04.884 --> 1:21:10.630
There's another to make the things more easy
at the beginning.

1:21:10.630 --> 1:21:16.467
There's this plants model, black model where
you do more masks.

1:21:16.756 --> 1:21:26.080
So during training, especially at the beginning,
you give some correct solutions at the beginning.

1:21:28.468 --> 1:21:38.407
And there is this tokens at a time, so the
idea is to establish other regressive training.

1:21:40.000 --> 1:21:50.049
And some targets are open, so you always predict
only like first auto regression is K.

1:21:50.049 --> 1:21:59.174
It puts one, so you always have one input
and one output, then you do partial.

1:21:59.699 --> 1:22:05.825
So in that way you can slowly learn what is
a good and what is a bad answer.

1:22:08.528 --> 1:22:10.862
It doesn't sound very impressive.

1:22:10.862 --> 1:22:12.578
Don't contact me anyway.

1:22:12.578 --> 1:22:15.323
Go all over your training data several.

1:22:15.875 --> 1:22:20.655
You can even switch in between.

1:22:20.655 --> 1:22:29.318
There is a homework on this thing where you
try to start.

1:22:31.271 --> 1:22:41.563
You have to learn so there's a whole work
on that so this is often happening and it doesn't

1:22:41.563 --> 1:22:46.598
mean it's less efficient but still it helps.

1:22:49.389 --> 1:22:57.979
For later maybe here are some examples of
how much things help.

1:22:57.979 --> 1:23:04.958
Maybe one point here is that it's really important.

1:23:05.365 --> 1:23:13.787
Here's the translation performance and speed.

1:23:13.787 --> 1:23:24.407
One point which is a point is if you compare
researchers.

1:23:24.784 --> 1:23:33.880
So yeah, if you're compared to one very weak
baseline transformer even with beam search,

1:23:33.880 --> 1:23:40.522
then you're ten times slower than a very strong
auto regressive.

1:23:40.961 --> 1:23:48.620
If you make a strong baseline then it's going
down to depending on times and here like: You

1:23:48.620 --> 1:23:53.454
have a lot of different speed ups.

1:23:53.454 --> 1:24:03.261
Generally, it makes a strong baseline and
not very simple transformer.

1:24:07.407 --> 1:24:20.010
Yeah, with this one last thing that you can
do to speed up things and also reduce your

1:24:20.010 --> 1:24:25.950
memory is what is called half precision.

1:24:26.326 --> 1:24:29.139
And especially for decoding issues for training.

1:24:29.139 --> 1:24:31.148
Sometimes it also gets less stale.

1:24:32.592 --> 1:24:45.184
With this we close nearly wait a bit, so what
you should remember is that efficient machine

1:24:45.184 --> 1:24:46.963
translation.

1:24:47.007 --> 1:24:51.939
We have, for example, looked at knowledge
distillation.

1:24:51.939 --> 1:24:55.991
We have looked at non auto regressive models.

1:24:55.991 --> 1:24:57.665
We have different.

1:24:58.898 --> 1:25:02.383
For today and then only requests.

1:25:02.383 --> 1:25:08.430
So if you haven't done so, please fill out
the evaluation.

1:25:08.388 --> 1:25:20.127
So now if you have done so think then you
should have and with the online people hopefully.

1:25:20.320 --> 1:25:29.758
Only possibility to tell us what things are
good and what not the only one but the most

1:25:29.758 --> 1:25:30.937
efficient.

1:25:31.851 --> 1:25:35.875
So think of all the students doing it in the
next okay, then thank.

0:00:03.243 --> 0:00:18.400
Hey welcome to our video, small room today
and to the lecture machine translation.

0:00:19.579 --> 0:00:32.295
So the idea is we have like last time we started
addressing problems and building machine translation.

0:00:32.772 --> 0:00:39.140
And we looked into different ways of how we
can use other types of resources.

0:00:39.379 --> 0:00:54.656
Last time we looked into language models and
especially pre-trained models which are different

0:00:54.656 --> 0:00:59.319
paradigms and learning data.

0:01:00.480 --> 0:01:07.606
However, there is one other way of getting
data and that is just searching for more data.

0:01:07.968 --> 0:01:14.637
And the nice thing is it was a worldwide web.

0:01:14.637 --> 0:01:27.832
We have a very big data resource where there's
various types of data which we can all use.

0:01:28.128 --> 0:01:38.902
If you want to build a machine translation
for a specific language or specific to Maine,

0:01:38.902 --> 0:01:41.202
it might be worse.

0:01:46.586 --> 0:01:55.399
In general, the other year we had different
types of additional resources we can have.

0:01:55.399 --> 0:01:59.654
Today we look into the state of crawling.

0:01:59.654 --> 0:02:05.226
It always depends a bit on what type of task
you have.

0:02:05.525 --> 0:02:08.571
We're crawling, you point off no possibilities.

0:02:08.828 --> 0:02:14.384
We have seen some weeks ago that Maje Lingo
models another thing where you can try to share

0:02:14.384 --> 0:02:16.136
knowledge between languages.

0:02:16.896 --> 0:02:26.774
Last we looked into monolingual data and next
we also unsupervised them too which is purely

0:02:26.774 --> 0:02:29.136
based on monolingual.

0:02:29.689 --> 0:02:35.918
What we today will focus on is really web
crawling of parallel data.

0:02:35.918 --> 0:02:40.070
We will focus not on the crawling pad itself.

0:02:41.541 --> 0:02:49.132
Networking lecture is something about one
of the best techniques to do web trolleying

0:02:49.132 --> 0:02:53.016
and then we'll just rely on existing tools.

0:02:53.016 --> 0:02:59.107
But the challenge is normally if you have
web data that's pure text.

0:03:00.920 --> 0:03:08.030
And these are all different ways of how we
can do that, and today is focused on that.

0:03:08.508 --> 0:03:21.333
So why would we be interested in that there
is quite different ways of collecting data?

0:03:21.333 --> 0:03:28.473
If you're currently when we talk about parallel.

0:03:28.548 --> 0:03:36.780
The big difference is that you focus on one
specific website so you can manually check

0:03:36.780 --> 0:03:37.632
how you.

0:03:38.278 --> 0:03:49.480
This you can do for dedicated resources where
you have high quality data.

0:03:50.510 --> 0:03:56.493
Another thing which has been developed or
has been done for several tasks is also is

0:03:56.493 --> 0:03:59.732
like you can do something like crowdsourcing.

0:03:59.732 --> 0:04:05.856
I don't know if you know about sites like
Amazon Mechanical Turing or things like that

0:04:05.856 --> 0:04:08.038
so you can there get a lot of.

0:04:07.988 --> 0:04:11.544
Writing between cheap labors would like easy
translations for you.

0:04:12.532 --> 0:04:22.829
Of course you can't collect millions of sentences,
but if it's like thousands of sentences that's

0:04:22.829 --> 0:04:29.134
also sourced, it's often interesting when you
have somehow.

0:04:29.509 --> 0:04:36.446
However, this is a field of itself, so crowdsourcing
is not that easy.

0:04:36.446 --> 0:04:38.596
It's not like upload.

0:04:38.738 --> 0:04:50.806
If you're doing that you will have very poor
quality, for example in the field of machine

0:04:50.806 --> 0:04:52.549
translation.

0:04:52.549 --> 0:04:57.511
Crowdsourcing is very commonly used.

0:04:57.397 --> 0:05:00.123
The problem there is.

0:05:00.480 --> 0:05:08.181
Since they are paid quite bad, of course,
a lot of people also try to make it put into

0:05:08.181 --> 0:05:09.598
it as possible.

0:05:09.869 --> 0:05:21.076
So if you're just using it without any control
mechanisms, the quality will be bad.

0:05:21.076 --> 0:05:27.881
What you can do is like doing additional checking.

0:05:28.188 --> 0:05:39.084
And think recently read a paper that now these
things can be worse because people don't do

0:05:39.084 --> 0:05:40.880
it themselves.

0:05:41.281 --> 0:05:46.896
So it's a very interesting topic.

0:05:46.896 --> 0:05:55.320
There has been a lot of resources created
by this.

0:05:57.657 --> 0:06:09.796
It's really about large scale data, then of
course doing some type of web crawling is the

0:06:09.796 --> 0:06:10.605
best.

0:06:10.930 --> 0:06:17.296
However, the biggest issue in this case is
in the quality.

0:06:17.296 --> 0:06:22.690
So how can we ensure that somehow the quality
of.

0:06:23.003 --> 0:06:28.656
Because if you just, we all know that in the
Internet there's also a lot of tools.

0:06:29.149 --> 0:06:37.952
Low quality staff, and especially now the
bigger question is how can we ensure that translations

0:06:37.952 --> 0:06:41.492
are really translations of each other?

0:06:45.065 --> 0:06:58.673
Why is this interesting so we had this number
before so there is some estimates that roughly

0:06:58.673 --> 0:07:05.111
a human reads around three hundred million.

0:07:05.525 --> 0:07:16.006
If you look into the web you will have millions
of words there so you can really get a large

0:07:16.006 --> 0:07:21.754
amount of data and if you think about monolingual.

0:07:22.042 --> 0:07:32.702
So at least for some language pairs there
is a large amount of data you can have.

0:07:32.852 --> 0:07:37.783
Languages are official languages in one country.

0:07:37.783 --> 0:07:46.537
There's always a very great success because
a lot of websites from the government need

0:07:46.537 --> 0:07:48.348
to be translated.

0:07:48.568 --> 0:07:58.777
For example, a large purpose like in India,
which we have worked with in India, so you

0:07:58.777 --> 0:08:00.537
have parallel.

0:08:01.201 --> 0:08:02.161
Two questions.

0:08:02.161 --> 0:08:08.438
First of all, if jet GPS and machine translation
tools are more becoming ubiquitous and everybody

0:08:08.438 --> 0:08:14.138
uses them, don't we get a problem because we
want to crawl the web and use the data and.

0:08:15.155 --> 0:08:18.553
Yes, that is a severe problem.

0:08:18.553 --> 0:08:26.556
Of course, are we only training on training
data which is automatically?

0:08:26.766 --> 0:08:41.182
And if we are doing that, of course, we talked
about the synthetic data where we do back translation.

0:08:41.341 --> 0:08:46.446
But of course it gives you some aren't up
about norm, you cannot be much better than

0:08:46.446 --> 0:08:46.806
this.

0:08:48.308 --> 0:08:57.194
That is, we'll get more and more on issues,
so maybe at some point we won't look at the

0:08:57.194 --> 0:09:06.687
current Internet, but focus on oats like image
of the Internet, which are created by Archive.

0:09:07.527 --> 0:09:18.611
There's lots of classification algorithms
on how to classify automatic data they had

0:09:18.611 --> 0:09:26.957
a very interesting paper on how to watermark
their translation.

0:09:27.107 --> 0:09:32.915
So there's like two scenarios of course in
this program: The one thing you might want

0:09:32.915 --> 0:09:42.244
to find your own translation if you're a big
company and say do an antisystem that may be

0:09:42.244 --> 0:09:42.866
used.

0:09:43.083 --> 0:09:49.832
This problem might be that most of the translation
out there is created by you.

0:09:49.832 --> 0:10:02.007
You might be able: And there is a relatively
easy way of doing that so that there are other

0:10:02.007 --> 0:10:09.951
peoples' mainly that can do like the search
or teacher.

0:10:09.929 --> 0:10:12.878
They are different, but there is not the one
correction station.

0:10:13.153 --> 0:10:23.763
So what you then can't do is you can't output
the best one to the user, but the highest value.

0:10:23.763 --> 0:10:30.241
For example, it's easy, but you can take the
translation.

0:10:30.870 --> 0:10:40.713
And if you always give the translation of
your investments, which are all good with the

0:10:40.713 --> 0:10:42.614
most ease, then.

0:10:42.942 --> 0:10:55.503
But of course this you can only do with most
of the data generated by your model.

0:10:55.503 --> 0:11:02.855
What we are now seeing is not only checks,
but.

0:11:03.163 --> 0:11:13.295
But it's definitely an additional research
question that might get more and more importance,

0:11:13.295 --> 0:11:18.307
and it might be an additional filtering step.

0:11:18.838 --> 0:11:29.396
There are other issues in data quality, so
in which direction wasn't translated, so that

0:11:29.396 --> 0:11:31.650
is not interested.

0:11:31.891 --> 0:11:35.672
But if you're now reaching better and better
quality, it makes a difference.

0:11:35.672 --> 0:11:39.208
The original data was from German to English
or from English to German.

0:11:39.499 --> 0:11:44.797
Because translation, they call it translate
Chinese.

0:11:44.797 --> 0:11:53.595
So if you generate German from English, it
has a more similar structure as if you would

0:11:53.595 --> 0:11:55.195
directly speak.

0:11:55.575 --> 0:11:57.187
So um.

0:11:57.457 --> 0:12:03.014
These are all issues which you then might
do like do additional training to remove them

0:12:03.014 --> 0:12:07.182
or you first train on them and later train
on other quality data.

0:12:07.182 --> 0:12:11.034
But yet that's a general view on so it's an
important issue.

0:12:11.034 --> 0:12:17.160
But until now I think it hasn't been addressed
that much maybe because the quality was decently.

0:12:18.858 --> 0:12:23.691
Actually, I think we're sure if we have the
time we use the Internet.

0:12:23.691 --> 0:12:29.075
The problem is, it's a lot of English speaking
text, but most used languages.

0:12:29.075 --> 0:12:34.460
I don't know some language in Africa that's
spoken, but we do about that one.

0:12:34.460 --> 0:12:37.566
I mean, that's why most data is English too.

0:12:38.418 --> 0:12:42.259
Other languages, and then you get the best.

0:12:42.259 --> 0:12:46.013
If there is no data on the Internet, then.

0:12:46.226 --> 0:12:48.255
So there is still a lot of data collection.

0:12:48.255 --> 0:12:50.976
Also in the wild way you try to improve there
and collect.

0:12:51.431 --> 0:12:57.406
But English is the most in the world, but
you find surprisingly much data also for other

0:12:57.406 --> 0:12:58.145
languages.

0:12:58.678 --> 0:13:04.227
Of course, only if they're written remember.

0:13:04.227 --> 0:13:15.077
Most languages are not written at all, but
for them you might find some video, but it's

0:13:15.077 --> 0:13:17.420
difficult to find.

0:13:17.697 --> 0:13:22.661
So this is mainly done for the web trawling.

0:13:22.661 --> 0:13:29.059
It's mainly done for languages which are commonly
spoken.

0:13:30.050 --> 0:13:37.907
Is exactly the next point, so this is that
much data is only true for English and some

0:13:37.907 --> 0:13:41.972
other languages, but of course there's many.

0:13:41.982 --> 0:13:50.285
And therefore a lot of research on how to
make things efficient and efficient and learn

0:13:50.285 --> 0:13:54.248
faster from pure data is still essential.

0:13:59.939 --> 0:14:06.326
So what we are interested in now on data is
parallel data.

0:14:06.326 --> 0:14:10.656
We assume always we have parallel data.

0:14:10.656 --> 0:14:12.820
That means we have.

0:14:13.253 --> 0:14:20.988
To be careful when you start crawling from
the web, we might get only related types of.

0:14:21.421 --> 0:14:30.457
So one comedy thing is what people refer as
noisy parallel data where there is documents

0:14:30.457 --> 0:14:34.315
which are translations of each other.

0:14:34.434 --> 0:14:44.300
So you have senses where there is no translation
on the other side because you have.

0:14:44.484 --> 0:14:50.445
So if you have these types of documents your
algorithm to extract parallel data might be

0:14:50.445 --> 0:14:51.918
a bit more difficult.

0:14:52.352 --> 0:15:04.351
Know if you can still remember in the beginning
of the lecture when we talked about different

0:15:04.351 --> 0:15:06.393
data resources.

0:15:06.286 --> 0:15:11.637
But the first step is then approached to a
light source and target sentences, and it was

0:15:11.637 --> 0:15:16.869
about like a steep vocabulary, and then you
have some probabilities for one to one and

0:15:16.869 --> 0:15:17.590
one to one.

0:15:17.590 --> 0:15:23.002
It's very like simple algorithm, but yet it
works fine for really a high quality parallel

0:15:23.002 --> 0:15:23.363
data.

0:15:23.623 --> 0:15:30.590
But when we're talking about noisy data, we
might have to do additional steps and use more

0:15:30.590 --> 0:15:35.872
advanced models to extract what is parallel
and to get high quality.

0:15:36.136 --> 0:15:44.682
So if we just had no easy parallel data, the
document might not be as easy to extract.

0:15:49.249 --> 0:15:54.877
And then there is even the more extreme pains,
which has also been used to be honest.

0:15:54.877 --> 0:15:58.214
The use of this data is reasoning not that
common.

0:15:58.214 --> 0:16:04.300
It was more interested maybe like ten or fifteen
years ago, and that is what people referred

0:16:04.300 --> 0:16:05.871
to as comparative data.

0:16:06.266 --> 0:16:17.167
And then the idea is you even don't have translations
like sentences which are translations of each

0:16:17.167 --> 0:16:25.234
other, but you have more news documents or
articles about the same topic.

0:16:25.205 --> 0:16:32.410
But it's more that you find phrases which
are too big in the user, so even black fragments.

0:16:32.852 --> 0:16:44.975
So if you think about the pedia, for example,
these articles have to be written in like the

0:16:44.975 --> 0:16:51.563
Wikipedia general idea independent of each
other.

0:16:51.791 --> 0:17:01.701
They have different information in there,
and I mean, the German movie gets more detail

0:17:01.701 --> 0:17:04.179
than the English one.

0:17:04.179 --> 0:17:07.219
However, it might be that.

0:17:07.807 --> 0:17:20.904
And the same thing is that you think about
newspaper articles if they're at the same time.

0:17:21.141 --> 0:17:25.603
And so this is an ability to learn.

0:17:25.603 --> 0:17:36.760
For example, new phrases, vocabulary and stature
if you don't have monitor all time long.

0:17:37.717 --> 0:17:49.020
And then not everything will be the same,
but there might be an overlap about events.

0:17:54.174 --> 0:18:00.348
So if we're talking about web trolling said
in the beginning it was really about specific.

0:18:00.660 --> 0:18:18.878
They do very good things by hand and really
focus on them and do a very specific way of

0:18:18.878 --> 0:18:20.327
doing.

0:18:20.540 --> 0:18:23.464
The European Parliament was very focused in
Ted.

0:18:23.464 --> 0:18:26.686
Maybe you even have looked in the particular
session.

0:18:27.427 --> 0:18:40.076
And these are still important, but they are
of course very specific in covering different

0:18:40.076 --> 0:18:41.341
pockets.

0:18:42.002 --> 0:18:55.921
Then there was a focus on language centering,
so there was a big drawer, for example, that

0:18:55.921 --> 0:18:59.592
you can check websites.

0:19:00.320 --> 0:19:07.918
Apparently what really people like is a more
general approach where you just have to specify.

0:19:07.918 --> 0:19:15.355
I'm interested in data from German to Lithuanian
and then you can as automatic as possible.

0:19:15.355 --> 0:19:19.640
You can collect data and extract codelator
for this.

0:19:21.661 --> 0:19:25.633
So is this our interest?

0:19:25.633 --> 0:19:36.435
Of course, the question is how can we build
these types of systems?

0:19:36.616 --> 0:19:52.913
The first are more general web crawling base
systems, so there is nothing about.

0:19:53.173 --> 0:19:57.337
Based on the websites you have, you have to
do like text extraction.

0:19:57.597 --> 0:20:06.503
We are typically not that much interested
in text and images in there, so we try to extract

0:20:06.503 --> 0:20:07.083
text.

0:20:07.227 --> 0:20:16.919
This is also not specific to machine translation,
but it's a more traditional way of doing web

0:20:16.919 --> 0:20:17.939
trolling.

0:20:18.478 --> 0:20:22.252
And at the end you have mirror like some other
set of document collectors.

0:20:22.842 --> 0:20:37.025
Is the idea, so you have the text, and often
this is a document, and so in the end.

0:20:37.077 --> 0:20:51.523
And that is some of your starting point now
for doing the more machine translation.

0:20:52.672 --> 0:21:05.929
One way of doing that now is very similar
to what you might have think about the traditional

0:21:05.929 --> 0:21:06.641
one.

0:21:06.641 --> 0:21:10.633
The first thing is to do a.

0:21:11.071 --> 0:21:22.579
So you have this based on the initial fact
that you know this is a German website in the

0:21:22.579 --> 0:21:25.294
English translation.

0:21:25.745 --> 0:21:31.037
And based on this document alignment, then
you can do your sentence alignment.

0:21:31.291 --> 0:21:39.072
And this is similar to what we had before
with the church accordion.

0:21:39.072 --> 0:21:43.696
This is typically more noisy peril data.

0:21:43.623 --> 0:21:52.662
So that you are not assuming that everything
is on both sides, that the order is the same,

0:21:52.662 --> 0:21:56.635
so you should do more flexible systems.

0:21:58.678 --> 0:22:14.894
Then it depends if the documents you were
drawing were really some type of parallel data.

0:22:15.115 --> 0:22:35.023
Say then you should do what is referred to
as fragmented extraction.

0:22:36.136 --> 0:22:47.972
One problem with these types of models is
if you are doing errors in your document alignment,.

0:22:48.128 --> 0:22:55.860
It means that if you are saying these two
documents are align then you can only find

0:22:55.860 --> 0:22:58.589
sense and if you are missing.

0:22:59.259 --> 0:23:15.284
Is very different, only small parts of the
document are parallel, and most parts are independent

0:23:15.284 --> 0:23:17.762
of each other.

0:23:19.459 --> 0:23:31.318
Therefore, more recently, there is also the
idea of directly doing sentence aligned so

0:23:31.318 --> 0:23:35.271
that you're directly taking.

0:23:36.036 --> 0:23:41.003
Was already one challenge of this one, the
second approach.

0:23:42.922 --> 0:23:50.300
Yes, so one big challenge on here, beef, then
you have to do a lot of comparison.

0:23:50.470 --> 0:23:59.270
You have to cook out every source, every target
set and square.

0:23:59.270 --> 0:24:06.283
If you think of a million or trillion pairs,
then.

0:24:07.947 --> 0:24:12.176
And this also gives you a reason for a last
step in both cases.

0:24:12.176 --> 0:24:18.320
So in both of them you have to remember you're
typically eating here in this very large data

0:24:18.320 --> 0:24:18.650
set.

0:24:18.650 --> 0:24:24.530
So all of these and also the document alignment
here they should be done very efficient.

0:24:24.965 --> 0:24:42.090
And if you want to do it very efficiently,
that means your quality will go lower.

0:24:41.982 --> 0:24:47.348
Because you just have to ever see it fast,
and then yeah you can put less computation

0:24:47.348 --> 0:24:47.910
on each.

0:24:48.688 --> 0:25:06.255
Therefore, in a lot of scenarios it makes
sense to make an additional filtering step

0:25:06.255 --> 0:25:08.735
at the end.

0:25:08.828 --> 0:25:13.370
And then we do a second filtering step where
we now can put a lot more effort.

0:25:13.433 --> 0:25:20.972
Because now we don't have like any square
possible combinations anymore, we have already

0:25:20.972 --> 0:25:26.054
selected and maybe in dimension of maybe like
two or three.

0:25:26.054 --> 0:25:29.273
For each sentence we even don't have.

0:25:29.429 --> 0:25:39.234
And then we can put a lot more effort in each
individual example and build a high quality

0:25:39.234 --> 0:25:42.611
classic fire to really select.

0:25:45.125 --> 0:26:00.506
Two or one example for that, so one of the
biggest projects doing this is the so-called

0:26:00.506 --> 0:26:03.478
Paratrol Corpus.

0:26:03.343 --> 0:26:11.846
Typically it's like before the picturing so
there are a lot of challenges on how you can.

0:26:12.272 --> 0:26:25.808
And the steps they start to be with the seatbelt,
so what you should give at the beginning is:

0:26:26.146 --> 0:26:36.908
Then they do the problem, the text extraction,
the document alignment, the sentence alignment,

0:26:36.908 --> 0:26:45.518
and the sentence filter, and it swings down
to implementing the text store.

0:26:46.366 --> 0:26:51.936
We'll see later for a lot of language pairs
exist so it's easier to download them and then

0:26:51.936 --> 0:26:52.793
like improve.

0:26:53.073 --> 0:27:08.270
For example, the crawling one thing they often
do is even not throw the direct website because

0:27:08.270 --> 0:27:10.510
there's also.

0:27:10.770 --> 0:27:14.540
Black parts of the Internet that they can
work on today.

0:27:14.854 --> 0:27:22.238
In more detail, this is a bit shown here.

0:27:22.238 --> 0:27:31.907
All the steps you can see are different possibilities.

0:27:32.072 --> 0:27:39.018
You need a bit of knowledge to do that, or
you can build a machine translation system.

0:27:39.239 --> 0:27:47.810
There are two different ways of deduction
and alignment.

0:27:47.810 --> 0:27:52.622
You can use sentence alignment.

0:27:53.333 --> 0:28:02.102
And how you can do the flexigrade exam, for
example, the lexic graph, or you can chin.

0:28:02.422 --> 0:28:05.826
To the next step in a bit more detail.

0:28:05.826 --> 0:28:13.680
But before we're doing it, I need more questions
about the general overview of how these.

0:28:22.042 --> 0:28:37.058
Yeah, so two or three things to web-drawing,
so you normally start with the URLs.

0:28:37.058 --> 0:28:40.903
It's most promising.

0:28:41.021 --> 0:28:48.652
What you found is that if you're interested
in German to English, you would: Companies

0:28:48.652 --> 0:29:01.074
where you know they have a German and an English
website are from agencies which might be: And

0:29:01.074 --> 0:29:10.328
then we can use one of these tools to start
from there using standard web calling techniques.

0:29:11.071 --> 0:29:23.942
There are several challenges when doing that,
so if you request a website too often you can:

0:29:25.305 --> 0:29:37.819
You have to keep in history of the sites and
you click on all the links and then click on

0:29:37.819 --> 0:29:40.739
all the links again.

0:29:41.721 --> 0:29:49.432
To be very careful about legal issues starting
from this robotics day so get allowed to use.

0:29:49.549 --> 0:29:58.941
Mean, that's the one major thing about what
trolley general is.

0:29:58.941 --> 0:30:05.251
The problem is how you deal with property.

0:30:05.685 --> 0:30:13.114
That is why it is easier sometimes to start
with some quick fold data that you don't have.

0:30:13.893 --> 0:30:22.526
Of course, the network issues you retry, so
there's more technical things, but there's

0:30:22.526 --> 0:30:23.122
good.

0:30:24.724 --> 0:30:35.806
Another thing which is very helpful and is
often done is instead of doing the web trolling

0:30:35.806 --> 0:30:38.119
yourself, relying.

0:30:38.258 --> 0:30:44.125
And one thing is it's common crawl from the
web.

0:30:44.125 --> 0:30:51.190
Think on this common crawl a lot of these
language models.

0:30:51.351 --> 0:30:59.763
So think in American Company or organization
which really works on like writing.

0:31:00.000 --> 0:31:01.111
Possible.

0:31:01.111 --> 0:31:10.341
So the nice thing is if you start with this
you don't have to worry about network.

0:31:10.250 --> 0:31:16.086
I don't think you can do that because it's
too big, but you can do a pipeline on how to

0:31:16.086 --> 0:31:16.683
process.

0:31:17.537 --> 0:31:28.874
That is, of course, a general challenge in
all this web crawling and parallel web mining.

0:31:28.989 --> 0:31:38.266
That means you cannot just don't know the
data and study the processes.

0:31:39.639 --> 0:31:45.593
Here it might make sense to directly fields
of both domains that in some way bark just

0:31:45.593 --> 0:31:46.414
marginally.

0:31:49.549 --> 0:31:59.381
Then you can do the text extraction, which
means like converging two HTML and then splitting

0:31:59.381 --> 0:32:01.707
things from the HTML.

0:32:01.841 --> 0:32:04.802
Often very important is to do the language
I need.

0:32:05.045 --> 0:32:16.728
It's not that clear even if it's links which
language it is, but they are quite good tools

0:32:16.728 --> 0:32:22.891
like that can't identify from relatively short.

0:32:23.623 --> 0:32:36.678
And then you are now in the situation that
you have all your danger and that you can start.

0:32:37.157 --> 0:32:43.651
After the text extraction you have now a collection
or a large collection of of data where it's

0:32:43.651 --> 0:32:49.469
like text and maybe the document at use of
some meta information and now the question

0:32:49.469 --> 0:32:55.963
is based on this monolingual text or multilingual
text so text in many languages but not align.

0:32:56.036 --> 0:32:59.863
How can you now do a generate power?

0:33:01.461 --> 0:33:06.289
And UM.

0:33:05.705 --> 0:33:12.965
So the main thing, if we're not seeing it
as a task, or if we want to do it in a machine

0:33:12.965 --> 0:33:20.388
learning way, what we have is we have a set
of sentences and a suits language, and we have

0:33:20.388 --> 0:33:23.324
a set Of sentences from the target.

0:33:23.823 --> 0:33:27.814
This is the target language.

0:33:27.814 --> 0:33:31.392
This is the data we have.

0:33:31.392 --> 0:33:37.034
We kind of directly assume any ordering.

0:33:38.018 --> 0:33:44.502
More documents there are not really in line
or there is maybe a graph and what we are interested

0:33:44.502 --> 0:33:50.518
in is finding these alignments so which senses
are aligned to each other and which senses

0:33:50.518 --> 0:33:53.860
we can remove but we don't have translations
for.

0:33:53.974 --> 0:34:00.339
But exactly this mapping is what we are interested
in and what we need to find.

0:34:01.901 --> 0:34:17.910
And if we are modeling it more from the machine
translation point of view, what can model that

0:34:17.910 --> 0:34:21.449
as a classification?

0:34:21.681 --> 0:34:34.850
And so the main challenge is to build this
type of classifier and you want to decide is

0:34:34.850 --> 0:34:36.646
a parallel.

0:34:42.402 --> 0:34:50.912
However, the biggest challenge has already
pointed out in the beginning is the sites if

0:34:50.912 --> 0:34:53.329
we have millions target.

0:34:53.713 --> 0:35:05.194
The number of comparison is n square, so this
very path is very inefficient, and we need

0:35:05.194 --> 0:35:06.355
to find.

0:35:07.087 --> 0:35:16.914
And traditionally there is the first one mentioned
before the local or the hierarchical meaning

0:35:16.914 --> 0:35:20.292
mining and there the idea is OK.

0:35:20.292 --> 0:35:23.465
First we are lining documents.

0:35:23.964 --> 0:35:32.887
Move back the things and align them, and once
you have the alignment you only need to remind.

0:35:33.273 --> 0:35:51.709
That of course makes anything more efficient
because we don't have to do all the comparison.

0:35:53.253 --> 0:35:56.411
Then it's, for example, in the before mentioned
apparel.

0:35:57.217 --> 0:36:11.221
But it has the issue that if this document
is bad you have error propagation and you can

0:36:11.221 --> 0:36:14.211
recover from that.

0:36:14.494 --> 0:36:20.715
Because then document that cannot say ever,
there are some sentences which are: Therefore,

0:36:20.715 --> 0:36:24.973
more recently there is also was referred to
as global mining.

0:36:26.366 --> 0:36:31.693
And there we really do this.

0:36:31.693 --> 0:36:43.266
Although it's in the square, we are doing
all the comparisons.

0:36:43.523 --> 0:36:52.588
So the idea is that you can do represent all
the sentences in a vector space.

0:36:52.892 --> 0:37:06.654
And then it's about nearest neighbor search
and there is a lot of very efficient algorithms.

0:37:07.067 --> 0:37:20.591
Then if you only compare them to your nearest
neighbors you don't have to do like a comparison

0:37:20.591 --> 0:37:22.584
but you have.

0:37:26.186 --> 0:37:40.662
So in the first step what we want to look
at is this: This document classification refers

0:37:40.662 --> 0:37:49.584
to the document alignment, and then we do the
sentence alignment.

0:37:51.111 --> 0:37:58.518
And if we're talking about document alignment,
there's like typically two steps in that: We

0:37:58.518 --> 0:38:01.935
first do a candidate selection.

0:38:01.935 --> 0:38:10.904
Often we have several steps and that is again
to make more things more efficiently.

0:38:10.904 --> 0:38:13.360
We have the candidate.

0:38:13.893 --> 0:38:18.402
The candidate select means OK, which documents
do we want to compare?

0:38:19.579 --> 0:38:35.364
Then if we have initial candidates which might
be parallel, we can do a classification test.

0:38:35.575 --> 0:38:37.240
And there is different ways.

0:38:37.240 --> 0:38:40.397
We can use lexical similarity or we can use
ten basic.

0:38:41.321 --> 0:38:48.272
The first and easiest thing is to take off
possible candidates.

0:38:48.272 --> 0:38:55.223
There's one possibility, the other one, is
based on structural.

0:38:55.235 --> 0:39:05.398
So based on how your website looks like, you
might find that there are only translations.

0:39:05.825 --> 0:39:14.789
This is typically the only case where we try
to do some kind of major information, which

0:39:14.789 --> 0:39:22.342
can be very useful because we know that websites,
for example, are linked.

0:39:22.722 --> 0:39:35.586
We can try to use some URL patterns, so if
we have some website which ends with the.

0:39:35.755 --> 0:39:43.932
So that can be easily used in order to find
candidates.

0:39:43.932 --> 0:39:49.335
Then we only compare websites where.

0:39:49.669 --> 0:40:05.633
The language and the translation of each other,
but typically you hear several heuristics to

0:40:05.633 --> 0:40:07.178
do that.

0:40:07.267 --> 0:40:16.606
Then you don't have to compare all websites,
but you only have to compare web sites.

0:40:17.277 --> 0:40:27.607
Cruiser problems especially with an hour day's
content management system.

0:40:27.607 --> 0:40:32.912
Sometimes it's nice and easy to read.

0:40:33.193 --> 0:40:44.452
So on the one hand there typically leads from
the parent's side to different languages.

0:40:44.764 --> 0:40:46.632
Now I can look at the kit websites.

0:40:46.632 --> 0:40:49.381
It's the same thing you can check on the difference.

0:40:49.609 --> 0:41:06.833
Languages: You can either do that from the
parent website or you can also click on English.

0:41:06.926 --> 0:41:10.674
You can therefore either like prepare to all
the websites.

0:41:10.971 --> 0:41:18.205
Can be even more focused and checked if the
link is somehow either flexible or the language

0:41:18.205 --> 0:41:18.677
name.

0:41:19.019 --> 0:41:24.413
So there really depends on how much you want
to filter out.

0:41:24.413 --> 0:41:29.178
There is always a trade-off between being
efficient.

0:41:33.913 --> 0:41:49.963
Based on that we then have our candidate list,
so we now have two independent sets of German

0:41:49.963 --> 0:41:52.725
documents, but.

0:41:53.233 --> 0:42:03.515
And now the task is, we want to extract these,
which are really translations of each other.

0:42:03.823 --> 0:42:10.201
So the question of how can we measure the
document similarity?

0:42:10.201 --> 0:42:14.655
Because what we then do is, we measure the.

0:42:14.955 --> 0:42:27.096
And here you already see why this is also
that problematic from where it's partial or

0:42:27.096 --> 0:42:28.649
similarly.

0:42:30.330 --> 0:42:37.594
All you can do that is again two folds.

0:42:37.594 --> 0:42:48.309
You can do it more content based or more structural
based.

0:42:48.188 --> 0:42:53.740
Calculating a lot of features and then maybe
training a classic pyramid small set which

0:42:53.740 --> 0:42:57.084
stands like based on the spesse feature is
the data.

0:42:57.084 --> 0:42:58.661
It is a corpus parallel.

0:43:00.000 --> 0:43:10.955
One way of doing that is to have traction
features, so the idea is the text length, so

0:43:10.955 --> 0:43:12.718
the document.

0:43:13.213 --> 0:43:20.511
Of course, text links will not be the same,
but if the one document has fifty words and

0:43:20.511 --> 0:43:24.907
the other five thousand words, it's quite realistic.

0:43:25.305 --> 0:43:29.274
So you can use the text length as one proxy
of.

0:43:29.274 --> 0:43:32.334
Is this might be a good translation?

0:43:32.712 --> 0:43:41.316
Now the thing is the alignment between the
structure.

0:43:41.316 --> 0:43:52.151
If you have here the website you can create
some type of structure.

0:43:52.332 --> 0:44:04.958
You can compare that to the French version
and then calculate some similarities because

0:44:04.958 --> 0:44:07.971
you see translation.

0:44:08.969 --> 0:44:12.172
Of course, it's getting more and more problematic.

0:44:12.172 --> 0:44:16.318
It does be a different structure than these
features are helpful.

0:44:16.318 --> 0:44:22.097
However, if you are doing it more in a trained
way, you can automatically learn how helpful

0:44:22.097 --> 0:44:22.725
they are.

0:44:24.704 --> 0:44:37.516
Then there are different ways of yeah: Content
based things: One easy thing, especially if

0:44:37.516 --> 0:44:48.882
you have systems that are using the same script
that you are looking for.

0:44:48.888 --> 0:44:49.611
The legs.

0:44:49.611 --> 0:44:53.149
We call them a beggar words and we'll look
into.

0:44:53.149 --> 0:44:55.027
You can use some type of.

0:44:55.635 --> 0:44:58.418
And neural embedding is also to abate him
at.

0:45:02.742 --> 0:45:06.547
And as then mean we have machine translation,.

0:45:06.906 --> 0:45:14.640
And one idea that you can also do is really
use the machine translation.

0:45:14.874 --> 0:45:22.986
Because this one is one which takes more effort,
so what you then have to do is put more effort.

0:45:23.203 --> 0:45:37.526
You wouldn't do this type of machine translation
based approach for a system which has product.

0:45:38.018 --> 0:45:53.712
But maybe your first of thinking why can't
do that because I'm collecting data to build

0:45:53.712 --> 0:45:55.673
an system.

0:45:55.875 --> 0:46:01.628
So you can use an initial system to translate
it, and then you can collect more data.

0:46:01.901 --> 0:46:06.879
And one way of doing that is, you're translating,
for example, all documents even to English.

0:46:07.187 --> 0:46:25.789
Then you only need two English data and you
do it in the example with three grams.

0:46:25.825 --> 0:46:33.253
For example, the current induction in 1 in
the Spanish, which is German induction in 1,

0:46:33.253 --> 0:46:37.641
which was Spanish induction in 2, which was
French.

0:46:37.637 --> 0:46:52.225
You're creating this index and then based
on that you can calculate how similar the documents.

0:46:52.092 --> 0:46:58.190
And then you can use the Cossack similarity
to really calculate which of the most similar

0:46:58.190 --> 0:47:00.968
document or how similar is the document.

0:47:00.920 --> 0:47:04.615
And then measure if this is a possible translation.

0:47:05.285 --> 0:47:14.921
Mean, of course, the document will not be
exactly the same, and even if you have a parallel

0:47:14.921 --> 0:47:18.483
document, French and German, and.

0:47:18.898 --> 0:47:29.086
You'll have not a perfect translation, therefore
it's looking into five front overlap since

0:47:29.086 --> 0:47:31.522
there should be last.

0:47:34.074 --> 0:47:42.666
Okay, before we take the next step and go
into the sentence alignment, there are more

0:47:42.666 --> 0:47:44.764
questions about the.

0:47:51.131 --> 0:47:55.924
Too Hot and.

0:47:56.997 --> 0:47:59.384
Well um.

0:48:00.200 --> 0:48:05.751
There is different ways of doing sentence
alignment.

0:48:05.751 --> 0:48:12.036
Here's one way to describe is to call the
other line again.

0:48:12.172 --> 0:48:17.590
Of course, we have the advantage that we have
only documents, so we might have like hundred

0:48:17.590 --> 0:48:20.299
sentences and hundred sentences in the tower.

0:48:20.740 --> 0:48:31.909
Although it still might be difficult to compare
all the things in parallel, and.

0:48:31.791 --> 0:48:37.541
And therefore typically these even assume
that we are only interested in a line character

0:48:37.541 --> 0:48:40.800
that can be identified on the sum of the diagonal.

0:48:40.800 --> 0:48:46.422
Of course, not exactly the diagonal will sum
some parts around it, but in order to make

0:48:46.422 --> 0:48:47.891
things more efficient.

0:48:48.108 --> 0:48:55.713
You can still do it around the diagonal because
if you say this is a parallel document, we

0:48:55.713 --> 0:48:56.800
assume that.

0:48:56.836 --> 0:49:05.002
We wouldn't have passed the document alignment,
therefore we wouldn't have seen it.

0:49:05.505 --> 0:49:06.774
In the underline.

0:49:06.774 --> 0:49:10.300
Then we are calculating the similarity for
these.

0:49:10.270 --> 0:49:17.428
Set this here based on the bilingual dictionary,
so it may be based on how much overlap you

0:49:17.428 --> 0:49:17.895
have.

0:49:18.178 --> 0:49:24.148
And then we are finding a path through it.

0:49:24.148 --> 0:49:31.089
You are finding a path which the lights ever
see.

0:49:31.271 --> 0:49:41.255
But you're trying to find a pass through your
document so that you get these parallel.

0:49:41.201 --> 0:49:49.418
And then the perfect ones here would be your
pass, where you just take this other parallel.

0:49:51.011 --> 0:50:05.579
The advantage is that, of course, on the one
end limits your search space.

0:50:05.579 --> 0:50:07.521
That is,.

0:50:07.787 --> 0:50:10.013
So what does it mean?

0:50:10.013 --> 0:50:19.120
So even if you have a very high probable pair,
you're not taking them on because overall.

0:50:19.399 --> 0:50:27.063
So sometimes it makes sense to also use this
global information and not only compare on

0:50:27.063 --> 0:50:34.815
individual sentences because what you're with
your parents is that sometimes it's only a

0:50:34.815 --> 0:50:36.383
good translation.

0:50:38.118 --> 0:50:51.602
So by this minion paste you're preventing
the system to do it at the border where there's

0:50:51.602 --> 0:50:52.201
no.

0:50:53.093 --> 0:50:55.689
So that might achieve you a bit better quality.

0:50:56.636 --> 0:51:12.044
The pack always ends if we write the button
for everybody, but it also means you couldn't

0:51:12.044 --> 0:51:15.126
necessarily have.

0:51:15.375 --> 0:51:24.958
Have some restrictions that is right, so first
of all they can't be translated out.

0:51:25.285 --> 0:51:32.572
So the handle line typically only really works
well if you have a relatively high quality.

0:51:32.752 --> 0:51:39.038
So if you have this more general data where
there's like some parts are translated and

0:51:39.038 --> 0:51:39.471
some.

0:51:39.719 --> 0:51:43.604
It doesn't really work, so it might.

0:51:43.604 --> 0:51:53.157
It's okay with having maybe at the end some
sentences which are missing, but in generally.

0:51:53.453 --> 0:51:59.942
So it's not robust against significant noise
on the.

0:52:05.765 --> 0:52:12.584
The second thing is is to what is referred
to as blue alibi.

0:52:13.233 --> 0:52:16.982
And this doesn't does, does not do us much.

0:52:16.977 --> 0:52:30.220
A global information you can translate each
sentence to English, and then you calculate

0:52:30.220 --> 0:52:34.885
the voice for the translation.

0:52:35.095 --> 0:52:41.888
And that you would get six answer points,
which are the ones in a purple ear.

0:52:42.062 --> 0:52:56.459
And then you have the ability to add some
points around it, which might be a bit lower.

0:52:56.756 --> 0:53:06.962
But here in this case you are able to deal
with reorderings, angles to deal with parts.

0:53:07.247 --> 0:53:16.925
Therefore, in this case we need a full scale
and key system to do this calculation while

0:53:16.925 --> 0:53:17.686
we're.

0:53:18.318 --> 0:53:26.637
Then, of course, the better your similarity
metric is, so the better you are able to do

0:53:26.637 --> 0:53:35.429
this comparison, the less you have to rely
on structural information that, in one sentence,.

0:53:39.319 --> 0:53:53.411
Anymore questions, and then there are things
like back in line which try to do the same.

0:53:53.793 --> 0:53:59.913
That means the idea is that you expect each
sentence.

0:53:59.819 --> 0:54:02.246
In a crossing will vector space.

0:54:02.246 --> 0:54:08.128
Crossing will vector space always means that
you have a vector or knight means.

0:54:08.128 --> 0:54:14.598
In this case you have a vector space where
sentences in different languages are near to

0:54:14.598 --> 0:54:16.069
each other if they.

0:54:16.316 --> 0:54:23.750
So you can have it again and so on, but just
next to each other and want to call you.

0:54:24.104 --> 0:54:32.009
And then you can of course measure now the
similarity by some distance matrix in this

0:54:32.009 --> 0:54:32.744
vector.

0:54:33.033 --> 0:54:36.290
And you're saying towards two senses are lying.

0:54:36.290 --> 0:54:39.547
If the distance in the vector space is somehow.

0:54:40.240 --> 0:54:50.702
We'll discuss that in a bit more heat soon
because these vector spades and bathings are

0:54:50.702 --> 0:54:52.010
even then.

0:54:52.392 --> 0:54:55.861
So the nice thing is with this.

0:54:55.861 --> 0:55:05.508
It's really good and good to get quite good
quality and can decide whether two sentences

0:55:05.508 --> 0:55:08.977
are translations of each other.

0:55:08.888 --> 0:55:14.023
In the fact-lined approach, but often they
even work on a global search way to really

0:55:14.023 --> 0:55:15.575
compare on everything to.

0:55:16.236 --> 0:55:29.415
What weak alignment also does is trying to
do to make this more efficient in finding the.

0:55:29.309 --> 0:55:40.563
If you don't want to compare everything to
everything, you first need sentence blocks,

0:55:40.563 --> 0:55:41.210
and.

0:55:41.141 --> 0:55:42.363
Then find him fast.

0:55:42.562 --> 0:55:55.053
You always have full sentence resolution,
but then you always compare on the area around.

0:55:55.475 --> 0:56:11.501
So if you do compare blocks on the source
of the target, then you have of your possibilities.

0:56:11.611 --> 0:56:17.262
So here the end times and comparison is a
lot less than the comparison you have here.

0:56:17.777 --> 0:56:23.750
And with neural embeddings you can also embed
not only single sentences and whole blocks.

0:56:24.224 --> 0:56:28.073
So how you make this in fast?

0:56:28.073 --> 0:56:35.643
You're starting from a coarse grain resolution
here where.

0:56:36.176 --> 0:56:47.922
Then you're getting a double pass where they
could be good and near this pass you're doing

0:56:47.922 --> 0:56:49.858
more and more.

0:56:52.993 --> 0:56:54.601
And yeah, what's the?

0:56:54.601 --> 0:56:56.647
This is the white egg lift.

0:56:56.647 --> 0:56:59.352
These are the sewers and the target.

0:57:00.100 --> 0:57:16.163
While it was sleeping in the forests and things,
I thought it was very strange to see this man.

0:57:16.536 --> 0:57:25.197
So you have the sentences, but if you do blocks
you have blocks that are in.

0:57:30.810 --> 0:57:38.514
This is the thing about the pipeline approach.

0:57:38.514 --> 0:57:46.710
We want to look at the global mining, but
before.

0:57:53.633 --> 0:58:07.389
In the global mining thing we have to also
do some filtering and so typically in the things

0:58:07.389 --> 0:58:10.379
they do they start.

0:58:10.290 --> 0:58:14.256
And then they are doing some pretty processing.

0:58:14.254 --> 0:58:17.706
So you try to at first to de-defecate paragraphs.

0:58:17.797 --> 0:58:30.622
So, of course, if you compare everything with
everything in two times the same input example,

0:58:30.622 --> 0:58:35.748
you will also: The hard thing is that you first
keep duplicating.

0:58:35.748 --> 0:58:37.385
You have each paragraph only one.

0:58:37.958 --> 0:58:42.079
There's a lot of text which occurs a lot of
times.

0:58:42.079 --> 0:58:44.585
They will happen all the time.

0:58:44.884 --> 0:58:57.830
There are pages about the cookie thing you
see and about accepting things.

0:58:58.038 --> 0:59:04.963
So you can already be duplicated here, or
your problem has crossed the website twice,

0:59:04.963 --> 0:59:05.365
and.

0:59:06.066 --> 0:59:11.291
Then you can remove low quality data like
cooking warnings that have biolabites start.

0:59:12.012 --> 0:59:13.388
Hey!

0:59:13.173 --> 0:59:19.830
So let you have maybe some other sentence,
and then you're doing a language idea.

0:59:19.830 --> 0:59:29.936
That means you want to have a text, which
is: You want to know for each sentence a paragraph

0:59:29.936 --> 0:59:38.695
which language it has so that you then, of
course, if you want.

0:59:39.259 --> 0:59:44.987
Finally, there is some complexity based film
screenings to believe, for example, for very

0:59:44.987 --> 0:59:46.069
high complexity.

0:59:46.326 --> 0:59:59.718
That means, for example, data where there's
a lot of crazy names which are growing.

1:00:00.520 --> 1:00:09.164
Sometimes it also improves very high perplexity
data because that is then unmanned generated

1:00:09.164 --> 1:00:09.722
data.

1:00:11.511 --> 1:00:17.632
And then the model which is mostly used for
that is what is called a laser model.

1:00:18.178 --> 1:00:21.920
It's based on machine translation.

1:00:21.920 --> 1:00:28.442
Hope it all recognizes the machine translation
architecture.

1:00:28.442 --> 1:00:37.103
However, there is a difference between a general
machine translation system and.

1:01:00.000 --> 1:01:13.322
Machine translation system, so it's messy.

1:01:14.314 --> 1:01:24.767
See one bigger difference, which is great
if I'm excluding that object or the other.

1:01:25.405 --> 1:01:39.768
There is one difference to the other, one
with attention, so we are having.

1:01:40.160 --> 1:01:43.642
And then we are using that here in there each
time set up.

1:01:44.004 --> 1:01:54.295
Mean, therefore, it's maybe a bit similar
to original anti-system without attention.

1:01:54.295 --> 1:01:56.717
It's quite similar.

1:01:57.597 --> 1:02:10.011
However, it has this disadvantage saying that
we have to put everything in one sentence and

1:02:10.011 --> 1:02:14.329
that maybe not all information.

1:02:15.055 --> 1:02:25.567
However, now in this type of framework we
are not really interested in machine translation,

1:02:25.567 --> 1:02:27.281
so this model.

1:02:27.527 --> 1:02:34.264
So we are training it to do machine translation.

1:02:34.264 --> 1:02:42.239
What that means in the end should be as much
information.

1:02:43.883 --> 1:03:01.977
Only all the information in here is able to
really well do the machine translation.

1:03:02.642 --> 1:03:07.801
So that is the first step, so we are doing
here.

1:03:07.801 --> 1:03:17.067
We are building the MT system, not with the
goal of making the best MT system, but with

1:03:17.067 --> 1:03:22.647
learning and sentences, and hopefully all important.

1:03:22.882 --> 1:03:26.116
Because otherwise we won't be able to generate
the translation.

1:03:26.906 --> 1:03:31.287
So it's a bit more on the bottom neck like
to try to put as much information.

1:03:32.012 --> 1:03:36.426
And if you think if you want to do later finding
the bear's neighbor or something like.

1:03:37.257 --> 1:03:48.680
So finding similarities is typically possible
with fixed dimensional things, so we can do

1:03:48.680 --> 1:03:56.803
that in an end dimensional space and find the
nearest neighbor.

1:03:57.857 --> 1:03:59.837
Yeah, it would be very difficult.

1:04:00.300 --> 1:04:03.865
There's one thing that we also do.

1:04:03.865 --> 1:04:09.671
We don't want to find the nearest neighbor
in the other.

1:04:10.570 --> 1:04:13.424
Do you have an idea how we can train them?

1:04:13.424 --> 1:04:16.542
This is a set that embeddings can be compared.

1:04:23.984 --> 1:04:36.829
Any idea do you think about two lectures,
a three lecture stack, one that did gave.

1:04:41.301 --> 1:04:50.562
We can train them on a multilingual setting
and that's how it's done in lasers so we're

1:04:50.562 --> 1:04:56.982
not doing it only from German to English but
we're training.

1:04:57.017 --> 1:05:04.898
Mean, if the English one has to be useful
for German, French and so on, and for German

1:05:04.898 --> 1:05:13.233
also, the German and the English and so have
to be useful, then somehow we'll automatically

1:05:13.233 --> 1:05:16.947
learn that these embattes are popularly.

1:05:17.437 --> 1:05:28.562
And then we can use an exact as we will plan
to have a similar sentence embedding.

1:05:28.908 --> 1:05:39.734
If you put in here a German and a French one
and always generate as they both have the same

1:05:39.734 --> 1:05:48.826
translations, you give these sentences: And
you should do exactly the same thing, so that's

1:05:48.826 --> 1:05:50.649
of course the easiest.

1:05:51.151 --> 1:05:59.817
If the sentence is very different then most
people will also hear the English decoder and

1:05:59.817 --> 1:06:00.877
therefore.

1:06:02.422 --> 1:06:04.784
So that is the first thing.

1:06:04.784 --> 1:06:06.640
Now we have this one.

1:06:06.640 --> 1:06:10.014
We have to be trained on parallel data.

1:06:10.390 --> 1:06:22.705
Then we can use these embeddings on our new
data and try to use them to make efficient

1:06:22.705 --> 1:06:24.545
comparisons.

1:06:26.286 --> 1:06:30.669
So how can you do comparison?

1:06:30.669 --> 1:06:37.243
Maybe the first thing you think of is to do.

1:06:37.277 --> 1:06:44.365
So you take all the German sentences, all
the French sentences.

1:06:44.365 --> 1:06:49.460
We compute the Cousin's simple limit between.

1:06:49.469 --> 1:06:58.989
And then you take all pairs where the similarity
is very high.

1:07:00.180 --> 1:07:17.242
So you have your French list, you have them,
and then you just take all sentences.

1:07:19.839 --> 1:07:29.800
It's an additional power method that we have,
but we have a lot of data who will find a point.

1:07:29.800 --> 1:07:32.317
It's a good point, but.

1:07:35.595 --> 1:07:45.738
It's also not that easy, so one problem is
that typically there are some sentences where.

1:07:46.066 --> 1:07:48.991
And other points where there is very few points
in the neighborhood.

1:07:49.629 --> 1:08:06.241
And then for things where a lot of things
are enabled you might extract not for one percent

1:08:06.241 --> 1:08:08.408
to do that.

1:08:08.868 --> 1:08:18.341
So what typically is happening is you do the
max merchant?

1:08:18.341 --> 1:08:25.085
How good is a pair compared to the other?

1:08:25.305 --> 1:08:33.859
So you take the similarity between X and Y,
and then you look at one of the eight nearest

1:08:33.859 --> 1:08:35.190
neighbors of.

1:08:35.115 --> 1:08:48.461
Of x and what are the eight nearest neighbors
of y, and the dividing of the similarity through

1:08:48.461 --> 1:08:51.411
the eight neighbors.

1:08:51.671 --> 1:09:00.333
So what you may be looking at are these two
sentences a lot more similar than all the other.

1:09:00.840 --> 1:09:13.455
And if these are exceptional and similar compared
to other sentences then they should be translations.

1:09:16.536 --> 1:09:19.158
Of course, that has also some.

1:09:19.158 --> 1:09:24.148
Then the good thing is there's a lot of similar
sentences.

1:09:24.584 --> 1:09:30.641
If there is a lot of similar sensations in
white then these are also very similar and

1:09:30.641 --> 1:09:32.824
you are doing more comparison.

1:09:32.824 --> 1:09:36.626
If all the arrows are far away then the translations.

1:09:37.057 --> 1:09:40.895
So think about this like short sentences.

1:09:40.895 --> 1:09:47.658
They might be that most things are similar,
but they are just in general.

1:09:49.129 --> 1:09:59.220
There are some problems that now we assume
there is only one pair of translations.

1:09:59.759 --> 1:10:09.844
So it has some problems in their two or three
ballad translations of that.

1:10:09.844 --> 1:10:18.853
Then, of course, this pair might not find
it, but in general this.

1:10:19.139 --> 1:10:27.397
For example, they have like all of these common
trawl.

1:10:27.397 --> 1:10:32.802
They have large parallel data sets.

1:10:36.376 --> 1:10:38.557
One point maybe also year.

1:10:38.557 --> 1:10:45.586
Of course, now it's important that we have
done the deduplication before because if we

1:10:45.586 --> 1:10:52.453
wouldn't have the deduplication, we would have
points which are the same coordinate.

1:10:57.677 --> 1:11:03.109
Maybe only one small things to that mean.

1:11:03.109 --> 1:11:09.058
A major issue in this case is still making
a.

1:11:09.409 --> 1:11:18.056
So you have to still do all of this comparison,
and that cannot be done just by simple.

1:11:19.199 --> 1:11:27.322
So what is done typically express the word,
you know things can be done in parallel.

1:11:28.368 --> 1:11:36.024
So calculating the embeddings and all that
stuff doesn't need to be sequential, but it's

1:11:36.024 --> 1:11:37.143
independent.

1:11:37.357 --> 1:11:48.680
What you typically do is create an event and
then you do some kind of projectization.

1:11:48.708 --> 1:11:57.047
So there is this space library which does
key nearest neighbor search very efficient

1:11:57.047 --> 1:11:59.597
in very high-dimensional.

1:12:00.080 --> 1:12:03.410
And then based on that you can now do comparison.

1:12:03.410 --> 1:12:06.873
You can even do the comparison in parallel
because.

1:12:06.906 --> 1:12:13.973
Can look at different areas of your space
and then compare the different pieces to find

1:12:13.973 --> 1:12:14.374
the.

1:12:15.875 --> 1:12:30.790
With this you are then able to do very fast
calculations on this type of sentence.

1:12:31.451 --> 1:12:34.761
So yeah this is currently one.

1:12:35.155 --> 1:12:48.781
Mean, those of them are covered with this,
so there's a parade.

1:12:48.668 --> 1:12:55.543
We are collected by that and most of them
are in a very big corporate for languages which

1:12:55.543 --> 1:12:57.453
you can hardly stand on.

1:12:58.778 --> 1:13:01.016
Do you have any more questions on this?

1:13:05.625 --> 1:13:17.306
And then some more words to this last set
here: So we have now done our pearl marker

1:13:17.306 --> 1:13:25.165
and we could assume that everything is fine
now.

1:13:25.465 --> 1:13:35.238
However, the problem with this noisy data
is that typically this is quite noisy still,

1:13:35.238 --> 1:13:35.687
so.

1:13:36.176 --> 1:13:44.533
In order to make things efficient to have
a high recall, the final data is often not

1:13:44.533 --> 1:13:49.547
of the best quality, not the same type of quality.

1:13:49.789 --> 1:13:58.870
So it is essential to do another figuring
step and to remove senses which might seem

1:13:58.870 --> 1:14:01.007
to be translations.

1:14:01.341 --> 1:14:08.873
And here, of course, the final evaluation
matrix would be how much do my system improve?

1:14:09.089 --> 1:14:23.476
And there are even challenges on doing that
so: people getting this noisy data like symmetrics

1:14:23.476 --> 1:14:25.596
or something.

1:14:27.707 --> 1:14:34.247
However, all these steps is of course very
time consuming, so you might not always want

1:14:34.247 --> 1:14:37.071
to do the full pipeline and training.

1:14:37.757 --> 1:14:51.614
So how can you model that we want to get this
best and normally what we always want?

1:14:51.871 --> 1:15:02.781
You also want to have the best over translation
quality, but this is also normally not achieved

1:15:02.781 --> 1:15:03.917
with all.

1:15:04.444 --> 1:15:12.389
And that's why you're doing this two-step
approach first of the second alignment.

1:15:12.612 --> 1:15:27.171
And after once you do the sentence filtering,
we can put a lot more alphabet in all the comparisons.

1:15:27.627 --> 1:15:37.472
For example, you can just translate the source
and compare that translation with the original

1:15:37.472 --> 1:15:40.404
one and calculate how good.

1:15:40.860 --> 1:15:49.467
And this, of course, you can do with the filing
set, but you can't do with your initial set

1:15:49.467 --> 1:15:50.684
of millions.

1:15:54.114 --> 1:16:01.700
So what it is again is the ancient test where
you input as a sentence pair as here, and then

1:16:01.700 --> 1:16:09.532
once you have a biometria, these are sentence
pairs with a high quality, and these are sentence

1:16:09.532 --> 1:16:11.653
pairs avec a low quality.

1:16:12.692 --> 1:16:17.552
Does anybody see what might be a challenge
if you want to train this type of classifier?

1:16:22.822 --> 1:16:24.264
How do you measure exactly?

1:16:24.264 --> 1:16:26.477
The quality is probably about the problem.

1:16:27.887 --> 1:16:39.195
Yes, that is one, that is true, there is even
more, more simple one, and high quality data

1:16:39.195 --> 1:16:42.426
here is not so difficult.

1:16:43.303 --> 1:16:46.844
Globally, yeah, probably we have a class in
balance.

1:16:46.844 --> 1:16:49.785
We don't see many bad quality combinations.

1:16:49.785 --> 1:16:54.395
It's hard to get there at the beginning, so
maybe how can you argue?

1:16:54.395 --> 1:16:58.405
Where do you find bad quality and what type
of bad quality?

1:16:58.798 --> 1:17:05.122
Because if it's too easy, you just take a
random germ and the random innocence that is

1:17:05.122 --> 1:17:05.558
very.

1:17:05.765 --> 1:17:15.747
But what you're interested is like bad quality
data, which still passes your first initial

1:17:15.747 --> 1:17:16.405
step.

1:17:17.257 --> 1:17:28.824
What you can use for that is you can use any
type of network or model that in the beginning,

1:17:28.824 --> 1:17:33.177
like in random forests, would see.

1:17:33.613 --> 1:17:38.912
So the positive examples are quite easy to
get.

1:17:38.912 --> 1:17:44.543
You just take parallel data and high quality
data.

1:17:44.543 --> 1:17:45.095
You.

1:17:45.425 --> 1:17:47.565
That is quite easy.

1:17:47.565 --> 1:17:55.482
You normally don't need a lot of data, then
to train in a few validation.

1:17:57.397 --> 1:18:12.799
The challenge is like the negative samples
because how would you generate negative samples?

1:18:13.133 --> 1:18:17.909
Because the negative examples are the ones
which ask the first step but don't ask the

1:18:17.909 --> 1:18:18.353
second.

1:18:18.838 --> 1:18:23.682
So how do you typically do it?

1:18:23.682 --> 1:18:28.994
You try to do synthetic examples.

1:18:28.994 --> 1:18:33.369
You can do random examples.

1:18:33.493 --> 1:18:45.228
But this is the typical error that you want
to detect when you do frequency based replacements.

1:18:45.228 --> 1:18:52.074
But this is one major issue when you generate
the data.

1:18:52.132 --> 1:19:02.145
That doesn't match well with what are the
real arrows that you're interested in.

1:19:02.702 --> 1:19:13.177
Is some of the most challenging here to find
the negative samples, which are hard enough

1:19:13.177 --> 1:19:14.472
to detect.

1:19:17.537 --> 1:19:21.863
And the other thing, which is difficult, is
of course the data ratio.

1:19:22.262 --> 1:19:24.212
Why is it important any?

1:19:24.212 --> 1:19:29.827
Why is the ratio between positive and negative
examples here important?

1:19:30.510 --> 1:19:40.007
Because in a case of plus imbalance we effectively
could learn to just that it's positive and

1:19:40.007 --> 1:19:43.644
high quality and we would be right.

1:19:44.844 --> 1:19:46.654
Yes, so I'm training.

1:19:46.654 --> 1:19:51.180
This is important, but otherwise it might
be too easy.

1:19:51.180 --> 1:19:52.414
You always do.

1:19:52.732 --> 1:19:58.043
And on the other head, of course, navy and
deputy, it's also important because if we have

1:19:58.043 --> 1:20:03.176
equal things, we're also assuming that this
might be the other one, and if the quality

1:20:03.176 --> 1:20:06.245
is worse or higher, we might also accept too
fewer.

1:20:06.626 --> 1:20:10.486
So this ratio is not easy to determine.

1:20:13.133 --> 1:20:16.969
What type of features can we use?

1:20:16.969 --> 1:20:23.175
Traditionally, we're also looking at word
translation.

1:20:23.723 --> 1:20:37.592
And nowadays, of course, we can model this
also with something like similar, so this is

1:20:37.592 --> 1:20:38.696
again.

1:20:40.200 --> 1:20:42.306
Language follow.

1:20:42.462 --> 1:20:49.763
So we can, for example, put the sentence in
there for the source and the target, and then

1:20:49.763 --> 1:20:56.497
based on this classification label we can classify
as this a parallel sentence or.

1:20:56.476 --> 1:21:00.054
So it's more like a normal classification
task.

1:21:00.160 --> 1:21:09.233
And by having a system which can have much
enable input, we can just put in two R.

1:21:09.233 --> 1:21:16.886
We can also put in two independent of each
other based on the hidden.

1:21:17.657 --> 1:21:35.440
You can, as you do any other type of classifier,
you can train them on top of.

1:21:35.895 --> 1:21:42.801
This so it tries to represent the full sentence
and that's what you also want to do on.

1:21:43.103 --> 1:21:45.043
The Other Thing What They Can't Do Is, of
Course.

1:21:45.265 --> 1:21:46.881
You can make here.

1:21:46.881 --> 1:21:52.837
You can do your summation of all the hidden
statements that you said.

1:21:58.698 --> 1:22:10.618
Okay, and then one thing which we skipped
until now, and that is only briefly this fragment.

1:22:10.630 --> 1:22:19.517
So if we have sentences which are not really
parallel, can we also extract information from

1:22:19.517 --> 1:22:20.096
them?

1:22:22.002 --> 1:22:25.627
And so what here the test is?

1:22:25.627 --> 1:22:33.603
We have a sentence and we want to find within
or a sentence pair.

1:22:33.603 --> 1:22:38.679
We want to find within the sentence pair.

1:22:39.799 --> 1:22:46.577
And how that, for example, has been done is
using a lexical positive and negative association.

1:22:47.187 --> 1:22:57.182
And then you can transform your target sentence
into a signal and find a thing where you have.

1:22:57.757 --> 1:23:00.317
So I'm Going to Get a Clear Eye.

1:23:00.480 --> 1:23:15.788
So you hear the English sentence, the other
language, and you have an alignment between

1:23:15.788 --> 1:23:18.572
them, and then.

1:23:18.818 --> 1:23:21.925
This is not a light cell from a negative signal.

1:23:22.322 --> 1:23:40.023
And then you drink some sauce on there because
you want to have an area where there's.

1:23:40.100 --> 1:23:51.742
It doesn't matter if you have simple arrows
here by smooth saying you can't.

1:23:51.972 --> 1:23:58.813
So you try to find long segments here where
at least most of the words are somehow aligned.

1:24:00.040 --> 1:24:10.069
And then you take this one in the side and
extract that one as your parallel fragment,

1:24:10.069 --> 1:24:10.645
and.

1:24:10.630 --> 1:24:21.276
So in the end you not only have full sentences
but you also have partial sentences which might

1:24:21.276 --> 1:24:27.439
be helpful for especially if you have quite
low upset.

1:24:32.332 --> 1:24:36.388
That's everything work for today.

1:24:36.388 --> 1:24:44.023
What you hopefully remember is the thing about
how the general.

1:24:44.184 --> 1:24:54.506
We talked about how we can do the document
alignment and then we can do the sentence alignment,

1:24:54.506 --> 1:24:57.625
which can be done after the.

1:24:59.339 --> 1:25:12.611
Any more questions think on Thursday we had
to do a switch, so on Thursday there will be

1:25:12.611 --> 1:25:15.444
a practical thing.