diff --git "a/demo_data/lectures/Lecture-12-20.06.2023/English.vtt" "b/demo_data/lectures/Lecture-12-20.06.2023/English.vtt"
new file mode 100644--- /dev/null
+++ "b/demo_data/lectures/Lecture-12-20.06.2023/English.vtt"
@@ -0,0 +1,10713 @@
+WEBVTT
+
+0:00:03.243 --> 0:00:18.400
+Hey welcome to our video, small room today
+and to the lecture machine translation.
+
+0:00:19.579 --> 0:00:32.295
+So the idea is we have like last time we started
+addressing problems and building machine translation.
+
+0:00:32.772 --> 0:00:39.140
+And we looked into different ways of how we
+can use other types of resources.
+
+0:00:39.379 --> 0:00:54.656
+Last time we looked into language models and
+especially pre-trained models which are different
+
+0:00:54.656 --> 0:00:59.319
+paradigms and learning data.
+
+0:01:00.480 --> 0:01:07.606
+However, there is one other way of getting
+data and that is just searching for more data.
+
+0:01:07.968 --> 0:01:14.637
+And the nice thing is it was a worldwide web.
+
+0:01:14.637 --> 0:01:27.832
+We have a very big data resource where there's
+various types of data which we can all use.
+
+0:01:28.128 --> 0:01:38.902
+If you want to build a machine translation
+for a specific language or specific to Maine,
+
+0:01:38.902 --> 0:01:41.202
+it might be worse.
+
+0:01:46.586 --> 0:01:55.399
+In general, the other year we had different
+types of additional resources we can have.
+
+0:01:55.399 --> 0:01:59.654
+Today we look into the state of crawling.
+
+0:01:59.654 --> 0:02:05.226
+It always depends a bit on what type of task
+you have.
+
+0:02:05.525 --> 0:02:08.571
+We're crawling, you point off no possibilities.
+
+0:02:08.828 --> 0:02:14.384
+We have seen some weeks ago that Maje Lingo
+models another thing where you can try to share
+
+0:02:14.384 --> 0:02:16.136
+knowledge between languages.
+
+0:02:16.896 --> 0:02:26.774
+Last we looked into monolingual data and next
+we also unsupervised them too which is purely
+
+0:02:26.774 --> 0:02:29.136
+based on monolingual.
+
+0:02:29.689 --> 0:02:35.918
+What we today will focus on is really web
+crawling of parallel data.
+
+0:02:35.918 --> 0:02:40.070
+We will focus not on the crawling pad itself.
+
+0:02:41.541 --> 0:02:49.132
+Networking lecture is something about one
+of the best techniques to do web trolleying
+
+0:02:49.132 --> 0:02:53.016
+and then we'll just rely on existing tools.
+
+0:02:53.016 --> 0:02:59.107
+But the challenge is normally if you have
+web data that's pure text.
+
+0:03:00.920 --> 0:03:08.030
+And these are all different ways of how we
+can do that, and today is focused on that.
+
+0:03:08.508 --> 0:03:21.333
+So why would we be interested in that there
+is quite different ways of collecting data?
+
+0:03:21.333 --> 0:03:28.473
+If you're currently when we talk about parallel.
+
+0:03:28.548 --> 0:03:36.780
+The big difference is that you focus on one
+specific website so you can manually check
+
+0:03:36.780 --> 0:03:37.632
+how you.
+
+0:03:38.278 --> 0:03:49.480
+This you can do for dedicated resources where
+you have high quality data.
+
+0:03:50.510 --> 0:03:56.493
+Another thing which has been developed or
+has been done for several tasks is also is
+
+0:03:56.493 --> 0:03:59.732
+like you can do something like crowdsourcing.
+
+0:03:59.732 --> 0:04:05.856
+I don't know if you know about sites like
+Amazon Mechanical Turing or things like that
+
+0:04:05.856 --> 0:04:08.038
+so you can there get a lot of.
+
+0:04:07.988 --> 0:04:11.544
+Writing between cheap labors would like easy
+translations for you.
+
+0:04:12.532 --> 0:04:22.829
+Of course you can't collect millions of sentences,
+but if it's like thousands of sentences that's
+
+0:04:22.829 --> 0:04:29.134
+also sourced, it's often interesting when you
+have somehow.
+
+0:04:29.509 --> 0:04:36.446
+However, this is a field of itself, so crowdsourcing
+is not that easy.
+
+0:04:36.446 --> 0:04:38.596
+It's not like upload.
+
+0:04:38.738 --> 0:04:50.806
+If you're doing that you will have very poor
+quality, for example in the field of machine
+
+0:04:50.806 --> 0:04:52.549
+translation.
+
+0:04:52.549 --> 0:04:57.511
+Crowdsourcing is very commonly used.
+
+0:04:57.397 --> 0:05:00.123
+The problem there is.
+
+0:05:00.480 --> 0:05:08.181
+Since they are paid quite bad, of course,
+a lot of people also try to make it put into
+
+0:05:08.181 --> 0:05:09.598
+it as possible.
+
+0:05:09.869 --> 0:05:21.076
+So if you're just using it without any control
+mechanisms, the quality will be bad.
+
+0:05:21.076 --> 0:05:27.881
+What you can do is like doing additional checking.
+
+0:05:28.188 --> 0:05:39.084
+And think recently read a paper that now these
+things can be worse because people don't do
+
+0:05:39.084 --> 0:05:40.880
+it themselves.
+
+0:05:41.281 --> 0:05:46.896
+So it's a very interesting topic.
+
+0:05:46.896 --> 0:05:55.320
+There has been a lot of resources created
+by this.
+
+0:05:57.657 --> 0:06:09.796
+It's really about large scale data, then of
+course doing some type of web crawling is the
+
+0:06:09.796 --> 0:06:10.605
+best.
+
+0:06:10.930 --> 0:06:17.296
+However, the biggest issue in this case is
+in the quality.
+
+0:06:17.296 --> 0:06:22.690
+So how can we ensure that somehow the quality
+of.
+
+0:06:23.003 --> 0:06:28.656
+Because if you just, we all know that in the
+Internet there's also a lot of tools.
+
+0:06:29.149 --> 0:06:37.952
+Low quality staff, and especially now the
+bigger question is how can we ensure that translations
+
+0:06:37.952 --> 0:06:41.492
+are really translations of each other?
+
+0:06:45.065 --> 0:06:58.673
+Why is this interesting so we had this number
+before so there is some estimates that roughly
+
+0:06:58.673 --> 0:07:05.111
+a human reads around three hundred million.
+
+0:07:05.525 --> 0:07:16.006
+If you look into the web you will have millions
+of words there so you can really get a large
+
+0:07:16.006 --> 0:07:21.754
+amount of data and if you think about monolingual.
+
+0:07:22.042 --> 0:07:32.702
+So at least for some language pairs there
+is a large amount of data you can have.
+
+0:07:32.852 --> 0:07:37.783
+Languages are official languages in one country.
+
+0:07:37.783 --> 0:07:46.537
+There's always a very great success because
+a lot of websites from the government need
+
+0:07:46.537 --> 0:07:48.348
+to be translated.
+
+0:07:48.568 --> 0:07:58.777
+For example, a large purpose like in India,
+which we have worked with in India, so you
+
+0:07:58.777 --> 0:08:00.537
+have parallel.
+
+0:08:01.201 --> 0:08:02.161
+Two questions.
+
+0:08:02.161 --> 0:08:08.438
+First of all, if jet GPS and machine translation
+tools are more becoming ubiquitous and everybody
+
+0:08:08.438 --> 0:08:14.138
+uses them, don't we get a problem because we
+want to crawl the web and use the data and.
+
+0:08:15.155 --> 0:08:18.553
+Yes, that is a severe problem.
+
+0:08:18.553 --> 0:08:26.556
+Of course, are we only training on training
+data which is automatically?
+
+0:08:26.766 --> 0:08:41.182
+And if we are doing that, of course, we talked
+about the synthetic data where we do back translation.
+
+0:08:41.341 --> 0:08:46.446
+But of course it gives you some aren't up
+about norm, you cannot be much better than
+
+0:08:46.446 --> 0:08:46.806
+this.
+
+0:08:48.308 --> 0:08:57.194
+That is, we'll get more and more on issues,
+so maybe at some point we won't look at the
+
+0:08:57.194 --> 0:09:06.687
+current Internet, but focus on oats like image
+of the Internet, which are created by Archive.
+
+0:09:07.527 --> 0:09:18.611
+There's lots of classification algorithms
+on how to classify automatic data they had
+
+0:09:18.611 --> 0:09:26.957
+a very interesting paper on how to watermark
+their translation.
+
+0:09:27.107 --> 0:09:32.915
+So there's like two scenarios of course in
+this program: The one thing you might want
+
+0:09:32.915 --> 0:09:42.244
+to find your own translation if you're a big
+company and say do an antisystem that may be
+
+0:09:42.244 --> 0:09:42.866
+used.
+
+0:09:43.083 --> 0:09:49.832
+This problem might be that most of the translation
+out there is created by you.
+
+0:09:49.832 --> 0:10:01.770
+You might be able: And there is a relatively
+easy way of doing that so that there are other
+
+0:10:01.770 --> 0:10:09.948
+peoples' mainly that can do it like the search
+or teacher.
+
+0:10:09.929 --> 0:10:12.878
+They are different, but there is not the one
+correction station.
+
+0:10:13.153 --> 0:10:23.763
+So what you then can't do is you can't output
+the best one to the user, but the highest value.
+
+0:10:23.763 --> 0:10:30.241
+For example, it's easy, but you can take the
+translation.
+
+0:10:30.870 --> 0:10:40.713
+And if you always give the translation of
+your investments, which are all good with the
+
+0:10:40.713 --> 0:10:42.614
+most ease, then.
+
+0:10:42.942 --> 0:10:55.503
+But of course this you can only do with most
+of the data generated by your model.
+
+0:10:55.503 --> 0:11:02.855
+What we are now seeing is not only checks,
+but.
+
+0:11:03.163 --> 0:11:13.295
+But it's definitely an additional research
+question that might get more and more importance,
+
+0:11:13.295 --> 0:11:18.307
+and it might be an additional filtering step.
+
+0:11:18.838 --> 0:11:29.396
+There are other issues in data quality, so
+in which direction wasn't translated, so that
+
+0:11:29.396 --> 0:11:31.650
+is not interested.
+
+0:11:31.891 --> 0:11:35.672
+But if you're now reaching better and better
+quality, it makes a difference.
+
+0:11:35.672 --> 0:11:39.208
+The original data was from German to English
+or from English to German.
+
+0:11:39.499 --> 0:11:44.797
+Because translation, they call it translate
+Chinese.
+
+0:11:44.797 --> 0:11:53.595
+So if you generate German from English, it
+has a more similar structure as if you would
+
+0:11:53.595 --> 0:11:55.195
+directly speak.
+
+0:11:55.575 --> 0:11:57.187
+So um.
+
+0:11:57.457 --> 0:12:03.014
+These are all issues which you then might
+do like do additional training to remove them
+
+0:12:03.014 --> 0:12:07.182
+or you first train on them and later train
+on other quality data.
+
+0:12:07.182 --> 0:12:11.034
+But yet that's a general view on so it's an
+important issue.
+
+0:12:11.034 --> 0:12:17.160
+But until now I think it hasn't been addressed
+that much maybe because the quality was decently.
+
+0:12:18.858 --> 0:12:23.691
+Actually, I think we're sure if we have the
+time we use the Internet.
+
+0:12:23.691 --> 0:12:29.075
+The problem is, it's a lot of English speaking
+text, but most used languages.
+
+0:12:29.075 --> 0:12:34.460
+I don't know some language in Africa that's
+spoken, but we do about that one.
+
+0:12:34.460 --> 0:12:37.566
+I mean, that's why most data is English too.
+
+0:12:38.418 --> 0:12:42.259
+Other languages, and then you get the best.
+
+0:12:42.259 --> 0:12:46.013
+If there is no data on the Internet, then.
+
+0:12:46.226 --> 0:12:48.255
+So there is still a lot of data collection.
+
+0:12:48.255 --> 0:12:50.976
+Also in the wild way you try to improve there
+and collect.
+
+0:12:51.431 --> 0:12:57.406
+But English is the most in the world, but
+you find surprisingly much data also for other
+
+0:12:57.406 --> 0:12:58.145
+languages.
+
+0:12:58.678 --> 0:13:04.227
+Of course, only if they're written remember.
+
+0:13:04.227 --> 0:13:15.077
+Most languages are not written at all, but
+for them you might find some video, but it's
+
+0:13:15.077 --> 0:13:17.420
+difficult to find.
+
+0:13:17.697 --> 0:13:22.661
+So this is mainly done for the web trawling.
+
+0:13:22.661 --> 0:13:29.059
+It's mainly done for languages which are commonly
+spoken.
+
+0:13:30.050 --> 0:13:38.773
+Is exactly the next point, so this is that
+much data is only true for English and some
+
+0:13:38.773 --> 0:13:41.982
+other languages, but of course.
+
+0:13:41.982 --> 0:13:50.285
+And therefore a lot of research on how to
+make things efficient and efficient and learn
+
+0:13:50.285 --> 0:13:54.248
+faster from pure data is still essential.
+
+0:13:59.939 --> 0:14:06.326
+So what we are interested in now on data is
+parallel data.
+
+0:14:06.326 --> 0:14:10.656
+We assume always we have parallel data.
+
+0:14:10.656 --> 0:14:12.820
+That means we have.
+
+0:14:13.253 --> 0:14:20.988
+To be careful when you start crawling from
+the web, we might get only related types of.
+
+0:14:21.421 --> 0:14:30.457
+So one comedy thing is what people refer as
+noisy parallel data where there is documents
+
+0:14:30.457 --> 0:14:34.315
+which are translations of each other.
+
+0:14:34.434 --> 0:14:44.300
+So you have senses where there is no translation
+on the other side because you have.
+
+0:14:44.484 --> 0:14:50.445
+So if you have these types of documents your
+algorithm to extract parallel data might be
+
+0:14:50.445 --> 0:14:51.918
+a bit more difficult.
+
+0:14:52.352 --> 0:15:04.351
+Know if you can still remember in the beginning
+of the lecture when we talked about different
+
+0:15:04.351 --> 0:15:06.393
+data resources.
+
+0:15:06.286 --> 0:15:11.637
+But the first step is then approached to a
+light source and target sentences, and it was
+
+0:15:11.637 --> 0:15:16.869
+about like a steep vocabulary, and then you
+have some probabilities for one to one and
+
+0:15:16.869 --> 0:15:17.590
+one to one.
+
+0:15:17.590 --> 0:15:23.002
+It's very like simple algorithm, but yet it
+works fine for really a high quality parallel
+
+0:15:23.002 --> 0:15:23.363
+data.
+
+0:15:23.623 --> 0:15:30.590
+But when we're talking about noisy data, we
+might have to do additional steps and use more
+
+0:15:30.590 --> 0:15:35.872
+advanced models to extract what is parallel
+and to get high quality.
+
+0:15:36.136 --> 0:15:44.682
+So if we just had no easy parallel data, the
+document might not be as easy to extract.
+
+0:15:49.249 --> 0:15:54.877
+And then there is even the more extreme pains,
+which has also been used to be honest.
+
+0:15:54.877 --> 0:15:58.214
+The use of this data is reasoning not that
+common.
+
+0:15:58.214 --> 0:16:04.300
+It was more interested maybe like ten or fifteen
+years ago, and that is what people referred
+
+0:16:04.300 --> 0:16:05.871
+to as comparative data.
+
+0:16:06.266 --> 0:16:17.167
+And then the idea is you even don't have translations
+like sentences which are translations of each
+
+0:16:17.167 --> 0:16:25.234
+other, but you have more news documents or
+articles about the same topic.
+
+0:16:25.205 --> 0:16:32.410
+But it's more that you find phrases which
+are too big in the user, so even black fragments.
+
+0:16:32.852 --> 0:16:44.975
+So if you think about the pedia, for example,
+these articles have to be written in like the
+
+0:16:44.975 --> 0:16:51.563
+Wikipedia general idea independent of each
+other.
+
+0:16:51.791 --> 0:17:01.701
+They have different information in there,
+and I mean, the German movie gets more detail
+
+0:17:01.701 --> 0:17:04.179
+than the English one.
+
+0:17:04.179 --> 0:17:07.219
+However, it might be that.
+
+0:17:07.807 --> 0:17:20.904
+And the same thing is that you think about
+newspaper articles if they're at the same time.
+
+0:17:21.141 --> 0:17:24.740
+And so this is an ability to learn.
+
+0:17:24.740 --> 0:17:29.738
+For example, new phrases, vocabulary and stature.
+
+0:17:29.738 --> 0:17:36.736
+If you don't have parallel data, but you could
+monitor all time long.
+
+0:17:37.717 --> 0:17:49.020
+And then not everything will be the same,
+but there might be an overlap about events.
+
+0:17:54.174 --> 0:18:00.348
+So if we're talking about web trolling said
+in the beginning it was really about specific.
+
+0:18:00.660 --> 0:18:18.878
+They do very good things by hand and really
+focus on them and do a very specific way of
+
+0:18:18.878 --> 0:18:20.327
+doing.
+
+0:18:20.540 --> 0:18:23.464
+The European Parliament was very focused in
+Ted.
+
+0:18:23.464 --> 0:18:26.686
+Maybe you even have looked in the particular
+session.
+
+0:18:27.427 --> 0:18:40.076
+And these are still important, but they are
+of course very specific in covering different
+
+0:18:40.076 --> 0:18:41.341
+pockets.
+
+0:18:42.002 --> 0:18:55.921
+Then there was a focus on language centering,
+so there was a big drawer, for example, that
+
+0:18:55.921 --> 0:18:59.592
+you can check websites.
+
+0:19:00.320 --> 0:19:06.849
+Apparently what really people like is a more
+general approach where you just have to specify.
+
+0:19:06.849 --> 0:19:13.239
+I'm interested in data from German to Lithuanian
+and then you can as automatic as possible.
+
+0:19:13.239 --> 0:19:15.392
+We see what's normally needed.
+
+0:19:15.392 --> 0:19:19.628
+You can collect as much data and extract codelaia
+from this.
+
+0:19:21.661 --> 0:19:25.633
+So is this our interest?
+
+0:19:25.633 --> 0:19:36.435
+Of course, the question is how can we build
+these types of systems?
+
+0:19:36.616 --> 0:19:52.913
+The first are more general web crawling base
+systems, so there is nothing about.
+
+0:19:53.173 --> 0:19:57.337
+Based on the websites you have, you have to
+do like text extraction.
+
+0:19:57.597 --> 0:20:06.503
+We are typically not that much interested
+in text and images in there, so we try to extract
+
+0:20:06.503 --> 0:20:07.083
+text.
+
+0:20:07.227 --> 0:20:16.919
+This is also not specific to machine translation,
+but it's a more traditional way of doing web
+
+0:20:16.919 --> 0:20:17.939
+trolling.
+
+0:20:18.478 --> 0:20:22.252
+And at the end you have mirror like some other
+set of document collectors.
+
+0:20:22.842 --> 0:20:37.025
+Is the idea, so you have the text, and often
+this is a document, and so in the end.
+
+0:20:37.077 --> 0:20:51.523
+And that is some of your starting point now
+for doing the more machine translation.
+
+0:20:52.672 --> 0:21:05.929
+One way of doing that now is very similar
+to what you might have think about the traditional
+
+0:21:05.929 --> 0:21:06.641
+one.
+
+0:21:06.641 --> 0:21:10.633
+The first thing is to do a.
+
+0:21:11.071 --> 0:21:22.579
+So you have this based on the initial fact
+that you know this is a German website in the
+
+0:21:22.579 --> 0:21:25.294
+English translation.
+
+0:21:25.745 --> 0:21:31.037
+And based on this document alignment, then
+you can do your sentence alignment.
+
+0:21:31.291 --> 0:21:39.072
+And this is similar to what we had before
+with the church accordion.
+
+0:21:39.072 --> 0:21:43.696
+This is typically more noisy peril data.
+
+0:21:43.623 --> 0:21:52.662
+So that you are not assuming that everything
+is on both sides, that the order is the same,
+
+0:21:52.662 --> 0:21:56.635
+so you should do more flexible systems.
+
+0:21:58.678 --> 0:22:14.894
+Then it depends if the documents you were
+drawing were really some type of parallel data.
+
+0:22:15.115 --> 0:22:35.023
+Say then you should do what is referred to
+as fragmented extraction.
+
+0:22:36.136 --> 0:22:47.972
+One problem with these types of models is
+if you are doing errors in your document alignment,.
+
+0:22:48.128 --> 0:22:55.860
+It means that if you are saying these two
+documents are align then you can only find
+
+0:22:55.860 --> 0:22:58.589
+sense and if you are missing.
+
+0:22:59.259 --> 0:23:15.284
+Is very different, only small parts of the
+document are parallel, and most parts are independent
+
+0:23:15.284 --> 0:23:17.762
+of each other.
+
+0:23:19.459 --> 0:23:31.318
+Therefore, more recently, there is also the
+idea of directly doing sentence aligned so
+
+0:23:31.318 --> 0:23:35.271
+that you're directly taking.
+
+0:23:36.036 --> 0:23:41.003
+Was already one challenge of this one, the
+second approach.
+
+0:23:42.922 --> 0:23:50.300
+Yes, so one big challenge on here, beef, then
+you have to do a lot of comparison.
+
+0:23:50.470 --> 0:23:59.270
+You have to cook out every source, every target
+set and square.
+
+0:23:59.270 --> 0:24:06.283
+If you think of a million or trillion pairs,
+then.
+
+0:24:07.947 --> 0:24:12.176
+And this also gives you a reason for a last
+step in both cases.
+
+0:24:12.176 --> 0:24:18.320
+So in both of them you have to remember you're
+typically eating here in this very large data
+
+0:24:18.320 --> 0:24:18.650
+set.
+
+0:24:18.650 --> 0:24:24.530
+So all of these and also the document alignment
+here they should be done very efficient.
+
+0:24:24.965 --> 0:24:42.090
+And if you want to do it very efficiently,
+that means your quality will go lower.
+
+0:24:41.982 --> 0:24:47.348
+Because you just have to ever see it fast,
+and then yeah you can put less computation
+
+0:24:47.348 --> 0:24:47.910
+on each.
+
+0:24:48.688 --> 0:25:06.255
+Therefore, in a lot of scenarios it makes
+sense to make an additional filtering step
+
+0:25:06.255 --> 0:25:08.735
+at the end.
+
+0:25:08.828 --> 0:25:13.370
+And then we do a second filtering step where
+we now can put a lot more effort.
+
+0:25:13.433 --> 0:25:20.972
+Because now we don't have like any square
+possible combinations anymore, we have already
+
+0:25:20.972 --> 0:25:26.054
+selected and maybe in dimension of maybe like
+two or three.
+
+0:25:26.054 --> 0:25:29.273
+For each sentence we even don't have.
+
+0:25:29.429 --> 0:25:39.234
+And then we can put a lot more effort in each
+individual example and build a high quality
+
+0:25:39.234 --> 0:25:42.611
+classic fire to really select.
+
+0:25:45.125 --> 0:26:00.506
+Two or one example for that, so one of the
+biggest projects doing this is the so-called
+
+0:26:00.506 --> 0:26:03.478
+Paratrol Corpus.
+
+0:26:03.343 --> 0:26:11.846
+Typically it's like before the picturing so
+there are a lot of challenges on how you can.
+
+0:26:12.272 --> 0:26:25.808
+And the steps they start to be with the seatbelt,
+so what you should give at the beginning is:
+
+0:26:26.146 --> 0:26:36.908
+Then they do the problem, the text extraction,
+the document alignment, the sentence alignment,
+
+0:26:36.908 --> 0:26:45.518
+and the sentence filter, and it swings down
+to implementing the text store.
+
+0:26:46.366 --> 0:26:51.936
+We'll see later for a lot of language pairs
+exist so it's easier to download them and then
+
+0:26:51.936 --> 0:26:52.793
+like improve.
+
+0:26:53.073 --> 0:27:08.270
+For example, the crawling one thing they often
+do is even not throw the direct website because
+
+0:27:08.270 --> 0:27:10.510
+there's also.
+
+0:27:10.770 --> 0:27:14.540
+Black parts of the Internet that they can
+work on today.
+
+0:27:14.854 --> 0:27:22.238
+In more detail, this is a bit shown here.
+
+0:27:22.238 --> 0:27:31.907
+All the steps you can see are different possibilities.
+
+0:27:32.072 --> 0:27:39.018
+You need a bit of knowledge to do that, or
+you can build a machine translation system.
+
+0:27:39.239 --> 0:27:47.810
+There are two different ways of deduction
+and alignment.
+
+0:27:47.810 --> 0:27:52.622
+You can use sentence alignment.
+
+0:27:53.333 --> 0:28:02.102
+And how you can do the flexigrade exam, for
+example, the lexic graph, or you can chin.
+
+0:28:02.422 --> 0:28:05.826
+To the next step in a bit more detail.
+
+0:28:05.826 --> 0:28:13.680
+But before we're doing it, I need more questions
+about the general overview of how these.
+
+0:28:22.042 --> 0:28:37.058
+Yeah, so two or three things to web-drawing,
+so you normally start with the URLs.
+
+0:28:37.058 --> 0:28:40.903
+It's most promising.
+
+0:28:41.021 --> 0:28:46.674
+Found that if you're interested in German
+to English, you would maybe move some data
+
+0:28:46.674 --> 0:28:47.073
+from.
+
+0:28:47.407 --> 0:28:58.739
+Companies where you know they have a German
+and an English website are from agencies which
+
+0:28:58.739 --> 0:29:08.359
+might be: And then we can use one of these
+tools to start from there using standard web
+
+0:29:08.359 --> 0:29:10.328
+calling techniques.
+
+0:29:11.071 --> 0:29:23.942
+There are several challenges when doing that,
+so if you request a website too often you can:
+
+0:29:25.305 --> 0:29:37.819
+You have to keep in history of the sites and
+you click on all the links and then click on
+
+0:29:37.819 --> 0:29:40.739
+all the links again.
+
+0:29:41.721 --> 0:29:49.432
+To be very careful about legal issues starting
+from this robotics day so get allowed to use.
+
+0:29:49.549 --> 0:29:58.941
+Mean, that's the one major thing about what
+trolley general is.
+
+0:29:58.941 --> 0:30:05.251
+The problem is how you deal with property.
+
+0:30:05.685 --> 0:30:13.114
+That is why it is easier sometimes to start
+with some quick fold data that you don't have.
+
+0:30:13.893 --> 0:30:22.526
+Of course, the network issues you retry, so
+there's more technical things, but there's
+
+0:30:22.526 --> 0:30:23.122
+good.
+
+0:30:24.724 --> 0:30:35.806
+Another thing which is very helpful and is
+often done is instead of doing the web trolling
+
+0:30:35.806 --> 0:30:38.119
+yourself, relying.
+
+0:30:38.258 --> 0:30:44.125
+And one thing is it's common crawl from the
+web.
+
+0:30:44.125 --> 0:30:51.190
+Think on this common crawl a lot of these
+language models.
+
+0:30:51.351 --> 0:30:59.763
+So think in American Company or organization
+which really works on like writing.
+
+0:31:00.000 --> 0:31:01.111
+Possible.
+
+0:31:01.111 --> 0:31:10.341
+So the nice thing is if you start with this
+you don't have to worry about network.
+
+0:31:10.250 --> 0:31:16.086
+I don't think you can do that because it's
+too big, but you can do a pipeline on how to
+
+0:31:16.086 --> 0:31:16.683
+process.
+
+0:31:17.537 --> 0:31:28.874
+That is, of course, a general challenge in
+all this web crawling and parallel web mining.
+
+0:31:28.989 --> 0:31:38.266
+That means you cannot just don't know the
+data and study the processes.
+
+0:31:39.639 --> 0:31:45.593
+Here it might make sense to directly fields
+of both domains that in some way bark just
+
+0:31:45.593 --> 0:31:46.414
+marginally.
+
+0:31:49.549 --> 0:31:59.381
+Then you can do the text extraction, which
+means like converging two HTML and then splitting
+
+0:31:59.381 --> 0:32:01.707
+things from the HTML.
+
+0:32:01.841 --> 0:32:04.802
+Often very important is to do the language
+I need.
+
+0:32:05.045 --> 0:32:16.728
+It's not that clear even if it's links which
+language it is, but they are quite good tools
+
+0:32:16.728 --> 0:32:22.891
+like that can't identify from relatively short.
+
+0:32:23.623 --> 0:32:36.678
+And then you are now in the situation that
+you have all your danger and that you can start.
+
+0:32:37.157 --> 0:32:43.651
+After the text extraction you have now a collection
+or a large collection of of data where it's
+
+0:32:43.651 --> 0:32:49.469
+like text and maybe the document at use of
+some meta information and now the question
+
+0:32:49.469 --> 0:32:55.963
+is based on this monolingual text or multilingual
+text so text in many languages but not align.
+
+0:32:56.036 --> 0:32:59.863
+How can you now do a generate power?
+
+0:33:01.461 --> 0:33:06.289
+And UM.
+
+0:33:05.705 --> 0:33:13.322
+So if we're not seeing it as a task or if
+we want to do it in a machine learning way,
+
+0:33:13.322 --> 0:33:20.940
+what we have is we have a set of sentences
+and a suits language, and we have a set Of
+
+0:33:20.940 --> 0:33:23.331
+sentences from the target.
+
+0:33:23.823 --> 0:33:27.814
+This is the target language.
+
+0:33:27.814 --> 0:33:31.392
+This is the data we have.
+
+0:33:31.392 --> 0:33:37.034
+We kind of directly assume any ordering.
+
+0:33:38.018 --> 0:33:44.502
+More documents there are not really in line
+or there is maybe a graph and what we are interested
+
+0:33:44.502 --> 0:33:50.518
+in is finding these alignments so which senses
+are aligned to each other and which senses
+
+0:33:50.518 --> 0:33:53.860
+we can remove but we don't have translations
+for.
+
+0:33:53.974 --> 0:34:00.339
+But exactly this mapping is what we are interested
+in and what we need to find.
+
+0:34:01.901 --> 0:34:17.910
+And if we are modeling it more from the machine
+translation point of view, what can model that
+
+0:34:17.910 --> 0:34:21.449
+as a classification?
+
+0:34:21.681 --> 0:34:36.655
+And so the main challenge of this is to build
+this type of classifier and you want to decide.
+
+0:34:42.402 --> 0:34:50.912
+However, the biggest challenge has already
+pointed out in the beginning is the sites if
+
+0:34:50.912 --> 0:34:53.329
+we have millions target.
+
+0:34:53.713 --> 0:35:05.194
+The number of comparison is n square, so this
+very path is very inefficient, and we need
+
+0:35:05.194 --> 0:35:06.355
+to find.
+
+0:35:07.087 --> 0:35:16.914
+And traditionally there is the first one mentioned
+before the local or the hierarchical meaning
+
+0:35:16.914 --> 0:35:20.292
+mining and there the idea is OK.
+
+0:35:20.292 --> 0:35:23.465
+First we are lining documents.
+
+0:35:23.964 --> 0:35:32.887
+Move back the things and align them, and once
+you have the alignment you only need to remind.
+
+0:35:33.273 --> 0:35:51.709
+That of course makes anything more efficient
+because we don't have to do all the comparison.
+
+0:35:53.253 --> 0:35:56.411
+Then it's, for example, in the before mentioned
+apparel.
+
+0:35:57.217 --> 0:36:11.221
+But it has the issue that if this document
+is bad you have error propagation and you can
+
+0:36:11.221 --> 0:36:14.211
+recover from that.
+
+0:36:14.494 --> 0:36:20.715
+Because then document that cannot say ever,
+there are some sentences which are: Therefore,
+
+0:36:20.715 --> 0:36:24.973
+more recently there is also was referred to
+as global mining.
+
+0:36:26.366 --> 0:36:31.693
+And there we really do this.
+
+0:36:31.693 --> 0:36:43.266
+Although it's in the square, we are doing
+all the comparisons.
+
+0:36:43.523 --> 0:36:52.588
+So the idea is that you can do represent all
+the sentences in a vector space.
+
+0:36:52.892 --> 0:37:06.654
+And then it's about nearest neighbor search
+and there is a lot of very efficient algorithms.
+
+0:37:07.067 --> 0:37:20.591
+Then if you only compare them to your nearest
+neighbors you don't have to do like a comparison
+
+0:37:20.591 --> 0:37:22.584
+but you have.
+
+0:37:26.186 --> 0:37:40.662
+So in the first step what we want to look
+at is this: This document classification refers
+
+0:37:40.662 --> 0:37:49.584
+to the document alignment, and then we do the
+sentence alignment.
+
+0:37:51.111 --> 0:37:58.518
+And if we're talking about document alignment,
+there's like typically two steps in that: We
+
+0:37:58.518 --> 0:38:01.935
+first do a candidate selection.
+
+0:38:01.935 --> 0:38:10.904
+Often we have several steps and that is again
+to make more things more efficiently.
+
+0:38:10.904 --> 0:38:13.360
+We have the candidate.
+
+0:38:13.893 --> 0:38:18.402
+The candidate select means OK, which documents
+do we want to compare?
+
+0:38:19.579 --> 0:38:35.364
+Then if we have initial candidates which might
+be parallel, we can do a classification test.
+
+0:38:35.575 --> 0:38:37.240
+And there is different ways.
+
+0:38:37.240 --> 0:38:40.397
+We can use lexical similarity or we can use
+ten basic.
+
+0:38:41.321 --> 0:38:48.272
+The first and easiest thing is to take off
+possible candidates.
+
+0:38:48.272 --> 0:38:55.223
+There's one possibility, the other one, is
+based on structural.
+
+0:38:55.235 --> 0:39:05.398
+So based on how your website looks like, you
+might find that there are only translations.
+
+0:39:05.825 --> 0:39:14.789
+This is typically the only case where we try
+to do some kind of major information, which
+
+0:39:14.789 --> 0:39:22.342
+can be very useful because we know that websites,
+for example, are linked.
+
+0:39:22.722 --> 0:39:35.586
+We can try to use some URL patterns, so if
+we have some website which ends with the.
+
+0:39:35.755 --> 0:39:43.932
+So that can be easily used in order to find
+candidates.
+
+0:39:43.932 --> 0:39:49.335
+Then we only compare websites where.
+
+0:39:49.669 --> 0:40:05.633
+The language and the translation of each other,
+but typically you hear several heuristics to
+
+0:40:05.633 --> 0:40:07.178
+do that.
+
+0:40:07.267 --> 0:40:16.606
+Then you don't have to compare all websites,
+but you only have to compare web sites.
+
+0:40:17.277 --> 0:40:27.607
+Cruiser problems especially with an hour day's
+content management system.
+
+0:40:27.607 --> 0:40:32.912
+Sometimes it's nice and easy to read.
+
+0:40:33.193 --> 0:40:44.452
+So on the one hand there typically leads from
+the parent's side to different languages.
+
+0:40:44.764 --> 0:40:46.632
+Now I can look at the kit websites.
+
+0:40:46.632 --> 0:40:49.381
+It's the same thing you can check on the difference.
+
+0:40:49.609 --> 0:41:06.835
+Languages: You can either do that from the
+parent website or you can click on the English.
+
+0:41:06.926 --> 0:41:10.674
+You can therefore either like prepare to all
+the websites.
+
+0:41:10.971 --> 0:41:18.205
+Can be even more focused and checked if the
+link is somehow either flexible or the language
+
+0:41:18.205 --> 0:41:18.677
+name.
+
+0:41:19.019 --> 0:41:24.413
+So there really depends on how much you want
+to filter out.
+
+0:41:24.413 --> 0:41:29.178
+There is always a trade-off between being
+efficient.
+
+0:41:33.913 --> 0:41:49.963
+Based on that we then have our candidate list,
+so we now have two independent sets of German
+
+0:41:49.963 --> 0:41:52.725
+documents, but.
+
+0:41:53.233 --> 0:42:03.515
+And now the task is, we want to extract these,
+which are really translations of each other.
+
+0:42:03.823 --> 0:42:10.201
+So the question of how can we measure the
+document similarity?
+
+0:42:10.201 --> 0:42:14.655
+Because what we then do is, we measure the.
+
+0:42:14.955 --> 0:42:27.096
+And here you already see why this is also
+that problematic from where it's partial or
+
+0:42:27.096 --> 0:42:28.649
+similarly.
+
+0:42:30.330 --> 0:42:37.594
+All you can do that is again two folds.
+
+0:42:37.594 --> 0:42:48.309
+You can do it more content based or more structural
+based.
+
+0:42:48.188 --> 0:42:53.740
+Calculating a lot of features and then maybe
+training a classic pyramid small set which
+
+0:42:53.740 --> 0:42:57.084
+stands like based on the spesse feature is
+the data.
+
+0:42:57.084 --> 0:42:58.661
+It is a corpus parallel.
+
+0:43:00.000 --> 0:43:10.955
+One way of doing that is to have traction
+features, so the idea is the text length, so
+
+0:43:10.955 --> 0:43:12.718
+the document.
+
+0:43:13.213 --> 0:43:20.511
+Of course, text links will not be the same,
+but if the one document has fifty words and
+
+0:43:20.511 --> 0:43:24.907
+the other five thousand words, it's quite realistic.
+
+0:43:25.305 --> 0:43:29.274
+So you can use the text length as one proxy
+of.
+
+0:43:29.274 --> 0:43:32.334
+Is this might be a good translation?
+
+0:43:32.712 --> 0:43:41.316
+Now the thing is the alignment between the
+structure.
+
+0:43:41.316 --> 0:43:52.151
+If you have here the website you can create
+some type of structure.
+
+0:43:52.332 --> 0:44:04.958
+You can compare that to the French version
+and then calculate some similarities because
+
+0:44:04.958 --> 0:44:07.971
+you see translation.
+
+0:44:08.969 --> 0:44:12.172
+Of course, it's getting more and more problematic.
+
+0:44:12.172 --> 0:44:16.318
+It does be a different structure than these
+features are helpful.
+
+0:44:16.318 --> 0:44:22.097
+However, if you are doing it more in a trained
+way, you can automatically learn how helpful
+
+0:44:22.097 --> 0:44:22.725
+they are.
+
+0:44:24.704 --> 0:44:37.516
+Then there are different ways of yeah: Content
+based things: One easy thing, especially if
+
+0:44:37.516 --> 0:44:48.882
+you have systems that are using the same script
+that you are looking for.
+
+0:44:48.888 --> 0:44:49.611
+The legs.
+
+0:44:49.611 --> 0:44:53.149
+We call them a beggar words and we'll look
+into.
+
+0:44:53.149 --> 0:44:55.027
+You can use some type of.
+
+0:44:55.635 --> 0:44:58.418
+And neural embedding is also to abate him
+at.
+
+0:45:02.742 --> 0:45:06.547
+And as then mean we have machine translation,.
+
+0:45:06.906 --> 0:45:14.640
+And one idea that you can also do is really
+use the machine translation.
+
+0:45:14.874 --> 0:45:22.986
+Because this one is one which takes more effort,
+so what you then have to do is put more effort.
+
+0:45:23.203 --> 0:45:37.526
+You wouldn't do this type of machine translation
+based approach for a system which has product.
+
+0:45:38.018 --> 0:45:53.712
+But maybe your first of thinking why can't
+do that because I'm collecting data to build
+
+0:45:53.712 --> 0:45:55.673
+an system.
+
+0:45:55.875 --> 0:46:01.628
+So you can use an initial system to translate
+it, and then you can collect more data.
+
+0:46:01.901 --> 0:46:06.879
+And one way of doing that is, you're translating,
+for example, all documents even to English.
+
+0:46:07.187 --> 0:46:25.789
+Then you only need two English data and you
+do it in the example with three grams.
+
+0:46:25.825 --> 0:46:33.253
+For example, the current induction in 1 in
+the Spanish, which is German induction in 1,
+
+0:46:33.253 --> 0:46:37.641
+which was Spanish induction in 2, which was
+French.
+
+0:46:37.637 --> 0:46:52.225
+You're creating this index and then based
+on that you can calculate how similar the documents.
+
+0:46:52.092 --> 0:46:58.190
+And then you can use the Cossack similarity
+to really calculate which of the most similar
+
+0:46:58.190 --> 0:47:00.968
+document or how similar is the document.
+
+0:47:00.920 --> 0:47:04.615
+And then measure if this is a possible translation.
+
+0:47:05.285 --> 0:47:14.921
+Mean, of course, the document will not be
+exactly the same, and even if you have a parallel
+
+0:47:14.921 --> 0:47:18.483
+document, French and German, and.
+
+0:47:18.898 --> 0:47:29.086
+You'll have not a perfect translation, therefore
+it's looking into five front overlap since
+
+0:47:29.086 --> 0:47:31.522
+there should be last.
+
+0:47:34.074 --> 0:47:42.666
+Okay, before we take the next step and go
+into the sentence alignment, there are more
+
+0:47:42.666 --> 0:47:44.764
+questions about the.
+
+0:47:51.131 --> 0:47:55.924
+Too Hot and.
+
+0:47:56.997 --> 0:47:59.384
+Well um.
+
+0:48:00.200 --> 0:48:05.751
+There is different ways of doing sentence
+alignment.
+
+0:48:05.751 --> 0:48:12.036
+Here's one way to describe is to call the
+other line again.
+
+0:48:12.172 --> 0:48:17.590
+Of course, we have the advantage that we have
+only documents, so we might have like hundred
+
+0:48:17.590 --> 0:48:20.299
+sentences and hundred sentences in the tower.
+
+0:48:20.740 --> 0:48:31.909
+Although it still might be difficult to compare
+all the things in parallel, and.
+
+0:48:31.791 --> 0:48:37.541
+And therefore typically these even assume
+that we are only interested in a line character
+
+0:48:37.541 --> 0:48:40.800
+that can be identified on the sum of the diagonal.
+
+0:48:40.800 --> 0:48:46.422
+Of course, not exactly the diagonal will sum
+some parts around it, but in order to make
+
+0:48:46.422 --> 0:48:47.891
+things more efficient.
+
+0:48:48.108 --> 0:48:55.713
+You can still do it around the diagonal because
+if you say this is a parallel document, we
+
+0:48:55.713 --> 0:48:56.800
+assume that.
+
+0:48:56.836 --> 0:49:05.002
+We wouldn't have passed the document alignment,
+therefore we wouldn't have seen it.
+
+0:49:05.505 --> 0:49:06.774
+In the underline.
+
+0:49:06.774 --> 0:49:10.300
+Then we are calculating the similarity for
+these.
+
+0:49:10.270 --> 0:49:17.428
+Set this here based on the bilingual dictionary,
+so it may be based on how much overlap you
+
+0:49:17.428 --> 0:49:17.895
+have.
+
+0:49:18.178 --> 0:49:24.148
+And then we are finding a path through it.
+
+0:49:24.148 --> 0:49:31.089
+You are finding a path which the lights ever
+see.
+
+0:49:31.271 --> 0:49:41.255
+But you're trying to find a pass through your
+document so that you get these parallel.
+
+0:49:41.201 --> 0:49:49.418
+And then the perfect ones here would be your
+pass, where you just take this other parallel.
+
+0:49:51.011 --> 0:50:05.206
+The advantage is that on the one end limits
+your search space, then centers alignment,
+
+0:50:05.206 --> 0:50:07.490
+and secondly.
+
+0:50:07.787 --> 0:50:10.013
+So what does it mean?
+
+0:50:10.013 --> 0:50:19.120
+So even if you have a very high probable pair,
+you're not taking them on because overall.
+
+0:50:19.399 --> 0:50:27.063
+So sometimes it makes sense to also use this
+global information and not only compare on
+
+0:50:27.063 --> 0:50:34.815
+individual sentences because what you're with
+your parents is that sometimes it's only a
+
+0:50:34.815 --> 0:50:36.383
+good translation.
+
+0:50:38.118 --> 0:50:51.602
+So by this minion paste you're preventing
+the system to do it at the border where there's
+
+0:50:51.602 --> 0:50:52.201
+no.
+
+0:50:53.093 --> 0:50:55.689
+So that might achieve you a bit better quality.
+
+0:50:56.636 --> 0:51:12.044
+The pack always ends if we write the button
+for everybody, but it also means you couldn't
+
+0:51:12.044 --> 0:51:15.126
+necessarily have.
+
+0:51:15.375 --> 0:51:24.958
+Have some restrictions that is right, so first
+of all they can't be translated out.
+
+0:51:25.285 --> 0:51:32.572
+So the handle line typically only really works
+well if you have a relatively high quality.
+
+0:51:32.752 --> 0:51:39.038
+So if you have this more general data where
+there's like some parts are translated and
+
+0:51:39.038 --> 0:51:39.471
+some.
+
+0:51:39.719 --> 0:51:43.604
+It doesn't really work, so it might.
+
+0:51:43.604 --> 0:51:53.157
+It's okay with having maybe at the end some
+sentences which are missing, but in generally.
+
+0:51:53.453 --> 0:51:59.942
+So it's not robust against significant noise
+on the.
+
+0:52:05.765 --> 0:52:12.584
+The second thing is is to what is referred
+to as blue alibi.
+
+0:52:13.233 --> 0:52:16.982
+And this doesn't does, does not do us much.
+
+0:52:16.977 --> 0:52:30.220
+A global information you can translate each
+sentence to English, and then you calculate
+
+0:52:30.220 --> 0:52:34.885
+the voice for the translation.
+
+0:52:35.095 --> 0:52:41.888
+And that you would get six answer points,
+which are the ones in a purple ear.
+
+0:52:42.062 --> 0:52:56.459
+And then you have the ability to add some
+points around it, which might be a bit lower.
+
+0:52:56.756 --> 0:53:06.962
+But here in this case you are able to deal
+with reorderings, angles to deal with parts.
+
+0:53:07.247 --> 0:53:16.925
+Therefore, in this case we need a full scale
+and key system to do this calculation while
+
+0:53:16.925 --> 0:53:17.686
+we're.
+
+0:53:18.318 --> 0:53:26.637
+Then, of course, the better your similarity
+metric is, so the better you are able to do
+
+0:53:26.637 --> 0:53:35.429
+this comparison, the less you have to rely
+on structural information that, in one sentence,.
+
+0:53:39.319 --> 0:53:53.411
+Anymore questions, and then there are things
+like back in line which try to do the same.
+
+0:53:53.793 --> 0:53:59.913
+That means the idea is that you expect each
+sentence.
+
+0:53:59.819 --> 0:54:02.246
+In a crossing will vector space.
+
+0:54:02.246 --> 0:54:08.128
+Crossing will vector space always means that
+you have a vector or knight means.
+
+0:54:08.128 --> 0:54:14.598
+In this case you have a vector space where
+sentences in different languages are near to
+
+0:54:14.598 --> 0:54:16.069
+each other if they.
+
+0:54:16.316 --> 0:54:23.750
+So you can have it again and so on, but just
+next to each other and want to call you.
+
+0:54:24.104 --> 0:54:32.009
+And then you can of course measure now the
+similarity by some distance matrix in this
+
+0:54:32.009 --> 0:54:32.744
+vector.
+
+0:54:33.033 --> 0:54:36.290
+And you're saying towards two senses are lying.
+
+0:54:36.290 --> 0:54:39.547
+If the distance in the vector space is somehow.
+
+0:54:40.240 --> 0:54:50.702
+We'll discuss that in a bit more heat soon
+because these vector spades and bathings are
+
+0:54:50.702 --> 0:54:52.010
+even then.
+
+0:54:52.392 --> 0:54:55.861
+So the nice thing is with this.
+
+0:54:55.861 --> 0:55:05.508
+It's really good and good to get quite good
+quality and can decide whether two sentences
+
+0:55:05.508 --> 0:55:08.977
+are translations of each other.
+
+0:55:08.888 --> 0:55:14.023
+In the fact-lined approach, but often they
+even work on a global search way to really
+
+0:55:14.023 --> 0:55:15.575
+compare on everything to.
+
+0:55:16.236 --> 0:55:29.415
+What weak alignment also does is trying to
+do to make this more efficient in finding the.
+
+0:55:29.309 --> 0:55:40.563
+If you don't want to compare everything to
+everything, you first need sentence blocks,
+
+0:55:40.563 --> 0:55:41.210
+and.
+
+0:55:41.141 --> 0:55:42.363
+Then find him fast.
+
+0:55:42.562 --> 0:55:55.053
+You always have full sentence resolution,
+but then you always compare on the area around.
+
+0:55:55.475 --> 0:56:11.501
+So if you do compare blocks on the source
+of the target, then you have of your possibilities.
+
+0:56:11.611 --> 0:56:17.262
+So here the end times and comparison is a
+lot less than the comparison you have here.
+
+0:56:17.777 --> 0:56:23.750
+And with neural embeddings you can also embed
+not only single sentences and whole blocks.
+
+0:56:24.224 --> 0:56:28.073
+So how you make this in fast?
+
+0:56:28.073 --> 0:56:35.643
+You're starting from a coarse grain resolution
+here where.
+
+0:56:36.176 --> 0:56:47.922
+Then you're getting a double pass where they
+could be good and near this pass you're doing
+
+0:56:47.922 --> 0:56:49.858
+more and more.
+
+0:56:52.993 --> 0:56:54.601
+And yeah, what's the?
+
+0:56:54.601 --> 0:56:56.647
+This is the white egg lift.
+
+0:56:56.647 --> 0:56:59.352
+These are the sewers and the target.
+
+0:57:00.100 --> 0:57:16.163
+While it was sleeping in the forests and things,
+I thought it was very strange to see this man.
+
+0:57:16.536 --> 0:57:25.197
+So you have the sentences, but if you do blocks
+you have blocks that are in.
+
+0:57:30.810 --> 0:57:38.514
+This is the thing about the pipeline approach.
+
+0:57:38.514 --> 0:57:46.710
+We want to look at the global mining, but
+before.
+
+0:57:53.633 --> 0:58:07.389
+In the global mining thing we have to also
+do some filtering and so typically in the things
+
+0:58:07.389 --> 0:58:10.379
+they do they start.
+
+0:58:10.290 --> 0:58:14.256
+And then they are doing some pretty processing.
+
+0:58:14.254 --> 0:58:17.706
+So you try to at first to de-defecate paragraphs.
+
+0:58:17.797 --> 0:58:30.622
+So, of course, if you compare everything with
+everything in two times the same input example,
+
+0:58:30.622 --> 0:58:35.748
+you will also: The hard thing is that you first
+keep duplicating.
+
+0:58:35.748 --> 0:58:37.385
+You have each paragraph only one.
+
+0:58:37.958 --> 0:58:42.079
+There's a lot of text which occurs a lot of
+times.
+
+0:58:42.079 --> 0:58:44.585
+They will happen all the time.
+
+0:58:44.884 --> 0:58:57.830
+There are pages about the cookie thing you
+see and about accepting things.
+
+0:58:58.038 --> 0:59:04.963
+So you can already be duplicated here, or
+your problem has crossed the website twice,
+
+0:59:04.963 --> 0:59:05.365
+and.
+
+0:59:06.066 --> 0:59:11.291
+Then you can remove low quality data like
+cooking warnings that have biolabites start.
+
+0:59:12.012 --> 0:59:13.388
+Hey!
+
+0:59:13.173 --> 0:59:19.830
+So let you have maybe some other sentence,
+and then you're doing a language idea.
+
+0:59:19.830 --> 0:59:29.936
+That means you want to have a text, which
+is: You want to know for each sentence a paragraph
+
+0:59:29.936 --> 0:59:38.695
+which language it has so that you then, of
+course, if you want.
+
+0:59:39.259 --> 0:59:44.987
+Finally, there is some complexity based film
+screenings to believe, for example, for very
+
+0:59:44.987 --> 0:59:46.069
+high complexity.
+
+0:59:46.326 --> 0:59:59.718
+That means, for example, data where there's
+a lot of crazy names which are growing.
+
+1:00:00.520 --> 1:00:09.164
+Sometimes it also improves very high perplexity
+data because that is then unmanned generated
+
+1:00:09.164 --> 1:00:09.722
+data.
+
+1:00:11.511 --> 1:00:17.632
+And then the model which is mostly used for
+that is what is called a laser model.
+
+1:00:18.178 --> 1:00:21.920
+It's based on machine translation.
+
+1:00:21.920 --> 1:00:28.442
+Hope it all recognizes the machine translation
+architecture.
+
+1:00:28.442 --> 1:00:37.103
+However, there is a difference between a general
+machine translation system and.
+
+1:01:00.000 --> 1:01:13.322
+Machine translation system, so it's messy.
+
+1:01:14.314 --> 1:01:24.767
+See one bigger difference, which is great
+if I'm excluding that object or the other.
+
+1:01:25.405 --> 1:01:39.768
+There is one difference to the other, one
+with attention, so we are having.
+
+1:01:40.160 --> 1:01:43.642
+And then we are using that here in there each
+time set up.
+
+1:01:44.004 --> 1:01:54.295
+Mean, therefore, it's maybe a bit similar
+to original anti-system without attention.
+
+1:01:54.295 --> 1:01:56.717
+It's quite similar.
+
+1:01:57.597 --> 1:02:10.011
+However, it has this disadvantage saying that
+we have to put everything in one sentence and
+
+1:02:10.011 --> 1:02:14.329
+that maybe not all information.
+
+1:02:15.055 --> 1:02:25.567
+However, now in this type of framework we
+are not really interested in machine translation,
+
+1:02:25.567 --> 1:02:27.281
+so this model.
+
+1:02:27.527 --> 1:02:34.264
+So we are training it to do machine translation.
+
+1:02:34.264 --> 1:02:42.239
+What that means in the end should be as much
+information.
+
+1:02:43.883 --> 1:03:01.977
+Only all the information in here is able to
+really well do the machine translation.
+
+1:03:02.642 --> 1:03:07.801
+So that is the first step, so we are doing
+here.
+
+1:03:07.801 --> 1:03:17.067
+We are building the MT system, not with the
+goal of making the best MT system, but with
+
+1:03:17.067 --> 1:03:22.647
+learning and sentences, and hopefully all important.
+
+1:03:22.882 --> 1:03:26.116
+Because otherwise we won't be able to generate
+the translation.
+
+1:03:26.906 --> 1:03:31.287
+So it's a bit more on the bottom neck like
+to try to put as much information.
+
+1:03:32.012 --> 1:03:36.426
+And if you think if you want to do later finding
+the bear's neighbor or something like.
+
+1:03:37.257 --> 1:03:48.680
+So finding similarities is typically possible
+with fixed dimensional things, so we can do
+
+1:03:48.680 --> 1:03:56.803
+that in an end dimensional space and find the
+nearest neighbor.
+
+1:03:57.857 --> 1:03:59.837
+Yeah, it would be very difficult.
+
+1:04:00.300 --> 1:04:03.865
+There's one thing that we also do.
+
+1:04:03.865 --> 1:04:09.671
+We don't want to find the nearest neighbor
+in the other.
+
+1:04:10.570 --> 1:04:13.424
+Do you have an idea how we can train them?
+
+1:04:13.424 --> 1:04:16.542
+This is a set that embeddings can be compared.
+
+1:04:23.984 --> 1:04:36.829
+Any idea do you think about two lectures,
+a three lecture stack, one that did gave.
+
+1:04:41.301 --> 1:04:50.562
+We can train them on a multilingual setting
+and that's how it's done in lasers so we're
+
+1:04:50.562 --> 1:04:56.982
+not doing it only from German to English but
+we're training.
+
+1:04:57.017 --> 1:05:04.898
+Mean, if the English one has to be useful
+for German, French and so on, and for German
+
+1:05:04.898 --> 1:05:13.233
+also, the German and the English and so have
+to be useful, then somehow we'll automatically
+
+1:05:13.233 --> 1:05:16.947
+learn that these embattes are popularly.
+
+1:05:17.437 --> 1:05:28.562
+And then we can use an exact as we will plan
+to have a similar sentence embedding.
+
+1:05:28.908 --> 1:05:39.734
+If you put in here a German and a French one
+and always generate as they both have the same
+
+1:05:39.734 --> 1:05:48.826
+translations, you give these sentences: And
+you should do exactly the same thing, so that's
+
+1:05:48.826 --> 1:05:50.649
+of course the easiest.
+
+1:05:51.151 --> 1:05:59.817
+If the sentence is very different then most
+people will also hear the English decoder and
+
+1:05:59.817 --> 1:06:00.877
+therefore.
+
+1:06:02.422 --> 1:06:04.784
+So that is the first thing.
+
+1:06:04.784 --> 1:06:06.640
+Now we have this one.
+
+1:06:06.640 --> 1:06:10.014
+We have to be trained on parallel data.
+
+1:06:10.390 --> 1:06:22.705
+Then we can use these embeddings on our new
+data and try to use them to make efficient
+
+1:06:22.705 --> 1:06:24.545
+comparisons.
+
+1:06:26.286 --> 1:06:30.669
+So how can you do comparison?
+
+1:06:30.669 --> 1:06:37.243
+Maybe the first thing you think of is to do.
+
+1:06:37.277 --> 1:06:44.365
+So you take all the German sentences, all
+the French sentences.
+
+1:06:44.365 --> 1:06:49.460
+We compute the Cousin's simple limit between.
+
+1:06:49.469 --> 1:06:58.989
+And then you take all pairs where the similarity
+is very high.
+
+1:07:00.180 --> 1:07:17.242
+So you have your French list, you have them,
+and then you just take all sentences.
+
+1:07:19.839 --> 1:07:29.800
+It's an additional power method that we have,
+but we have a lot of data who will find a point.
+
+1:07:29.800 --> 1:07:32.317
+It's a good point, but.
+
+1:07:35.595 --> 1:07:45.738
+It's also not that easy, so one problem is
+that typically there are some sentences where.
+
+1:07:46.066 --> 1:07:48.991
+And other points where there is very few points
+in the neighborhood.
+
+1:07:49.629 --> 1:08:06.241
+And then for things where a lot of things
+are enabled you might extract not for one percent
+
+1:08:06.241 --> 1:08:08.408
+to do that.
+
+1:08:08.868 --> 1:08:18.341
+So what typically is happening is you do the
+max merchant?
+
+1:08:18.341 --> 1:08:25.085
+How good is a pair compared to the other?
+
+1:08:25.305 --> 1:08:33.859
+So you take the similarity between X and Y,
+and then you look at one of the eight nearest
+
+1:08:33.859 --> 1:08:35.190
+neighbors of.
+
+1:08:35.115 --> 1:08:48.461
+Of x and what are the eight nearest neighbors
+of y, and the dividing of the similarity through
+
+1:08:48.461 --> 1:08:51.411
+the eight neighbors.
+
+1:08:51.671 --> 1:09:00.333
+So what you may be looking at are these two
+sentences a lot more similar than all the other.
+
+1:09:00.840 --> 1:09:13.455
+And if these are exceptional and similar compared
+to other sentences then they should be translations.
+
+1:09:16.536 --> 1:09:19.158
+Of course, that has also some.
+
+1:09:19.158 --> 1:09:24.148
+Then the good thing is there's a lot of similar
+sentences.
+
+1:09:24.584 --> 1:09:30.641
+If there is a lot of similar sensations in
+white then these are also very similar and
+
+1:09:30.641 --> 1:09:32.824
+you are doing more comparison.
+
+1:09:32.824 --> 1:09:36.626
+If all the arrows are far away then the translations.
+
+1:09:37.057 --> 1:09:40.895
+So think about this like short sentences.
+
+1:09:40.895 --> 1:09:47.658
+They might be that most things are similar,
+but they are just in general.
+
+1:09:49.129 --> 1:09:59.220
+There are some problems that now we assume
+there is only one pair of translations.
+
+1:09:59.759 --> 1:10:09.844
+So it has some problems in their two or three
+ballad translations of that.
+
+1:10:09.844 --> 1:10:18.853
+Then, of course, this pair might not find
+it, but in general this.
+
+1:10:19.139 --> 1:10:27.397
+For example, they have like all of these common
+trawl.
+
+1:10:27.397 --> 1:10:32.802
+They have large parallel data sets.
+
+1:10:36.376 --> 1:10:38.557
+One point maybe also year.
+
+1:10:38.557 --> 1:10:45.586
+Of course, now it's important that we have
+done the deduplication before because if we
+
+1:10:45.586 --> 1:10:52.453
+wouldn't have the deduplication, we would have
+points which are the same coordinate.
+
+1:10:57.677 --> 1:11:03.109
+Maybe only one small things to that mean.
+
+1:11:03.109 --> 1:11:09.058
+A major issue in this case is still making
+a.
+
+1:11:09.409 --> 1:11:18.056
+So you have to still do all of this comparison,
+and that cannot be done just by simple.
+
+1:11:19.199 --> 1:11:27.322
+So what is done typically express the word,
+you know things can be done in parallel.
+
+1:11:28.368 --> 1:11:36.024
+So calculating the embeddings and all that
+stuff doesn't need to be sequential, but it's
+
+1:11:36.024 --> 1:11:37.143
+independent.
+
+1:11:37.357 --> 1:11:48.680
+What you typically do is create an event and
+then you do some kind of projectization.
+
+1:11:48.708 --> 1:11:57.047
+So there is this space library which does
+key nearest neighbor search very efficient
+
+1:11:57.047 --> 1:11:59.597
+in very high-dimensional.
+
+1:12:00.080 --> 1:12:03.410
+And then based on that you can now do comparison.
+
+1:12:03.410 --> 1:12:06.873
+You can even do the comparison in parallel
+because.
+
+1:12:06.906 --> 1:12:13.973
+Can look at different areas of your space
+and then compare the different pieces to find
+
+1:12:13.973 --> 1:12:14.374
+the.
+
+1:12:15.875 --> 1:12:30.790
+With this you are then able to do very fast
+calculations on this type of sentence.
+
+1:12:31.451 --> 1:12:34.761
+So yeah this is currently one.
+
+1:12:35.155 --> 1:12:48.781
+Mean, those of them are covered with this,
+so there's a parade.
+
+1:12:48.668 --> 1:12:55.543
+We are collected by that and most of them
+are in a very big corporate for languages which
+
+1:12:55.543 --> 1:12:57.453
+you can hardly stand on.
+
+1:12:58.778 --> 1:13:01.016
+Do you have any more questions on this?
+
+1:13:05.625 --> 1:13:17.306
+And then some more words to this last set
+here: So we have now done our pearl marker
+
+1:13:17.306 --> 1:13:25.165
+and we could assume that everything is fine
+now.
+
+1:13:25.465 --> 1:13:35.238
+However, the problem with this noisy data
+is that typically this is quite noisy still,
+
+1:13:35.238 --> 1:13:35.687
+so.
+
+1:13:36.176 --> 1:13:44.533
+In order to make things efficient to have
+a high recall, the final data is often not
+
+1:13:44.533 --> 1:13:49.547
+of the best quality, not the same type of quality.
+
+1:13:49.789 --> 1:13:58.870
+So it is essential to do another figuring
+step and to remove senses which might seem
+
+1:13:58.870 --> 1:14:01.007
+to be translations.
+
+1:14:01.341 --> 1:14:08.873
+And here, of course, the final evaluation
+matrix would be how much do my system improve?
+
+1:14:09.089 --> 1:14:23.476
+And there are even challenges on doing that
+so: people getting this noisy data like symmetrics
+
+1:14:23.476 --> 1:14:25.596
+or something.
+
+1:14:27.707 --> 1:14:34.247
+However, all these steps is of course very
+time consuming, so you might not always want
+
+1:14:34.247 --> 1:14:37.071
+to do the full pipeline and training.
+
+1:14:37.757 --> 1:14:51.614
+So how can you model that we want to get this
+best and normally what we always want?
+
+1:14:51.871 --> 1:15:02.781
+You also want to have the best over translation
+quality, but this is also normally not achieved
+
+1:15:02.781 --> 1:15:03.917
+with all.
+
+1:15:04.444 --> 1:15:12.389
+And that's why you're doing this two-step
+approach first of the second alignment.
+
+1:15:12.612 --> 1:15:27.171
+And after once you do the sentence filtering,
+we can put a lot more alphabet in all the comparisons.
+
+1:15:27.627 --> 1:15:37.472
+For example, you can just translate the source
+and compare that translation with the original
+
+1:15:37.472 --> 1:15:40.404
+one and calculate how good.
+
+1:15:40.860 --> 1:15:49.467
+And this, of course, you can do with the filing
+set, but you can't do with your initial set
+
+1:15:49.467 --> 1:15:50.684
+of millions.
+
+1:15:54.114 --> 1:16:01.700
+So what it is again is the ancient test where
+you input as a sentence pair as here, and then
+
+1:16:01.700 --> 1:16:09.532
+once you have a biometria, these are sentence
+pairs with a high quality, and these are sentence
+
+1:16:09.532 --> 1:16:11.653
+pairs avec a low quality.
+
+1:16:12.692 --> 1:16:17.552
+Does anybody see what might be a challenge
+if you want to train this type of classifier?
+
+1:16:22.822 --> 1:16:24.264
+How do you measure exactly?
+
+1:16:24.264 --> 1:16:26.477
+The quality is probably about the problem.
+
+1:16:27.887 --> 1:16:39.195
+Yes, that is one, that is true, there is even
+more, more simple one, and high quality data
+
+1:16:39.195 --> 1:16:42.426
+here is not so difficult.
+
+1:16:43.303 --> 1:16:46.844
+Globally, yeah, probably we have a class in
+balance.
+
+1:16:46.844 --> 1:16:49.785
+We don't see many bad quality combinations.
+
+1:16:49.785 --> 1:16:54.395
+It's hard to get there at the beginning, so
+maybe how can you argue?
+
+1:16:54.395 --> 1:16:58.405
+Where do you find bad quality and what type
+of bad quality?
+
+1:16:58.798 --> 1:17:05.122
+Because if it's too easy, you just take a
+random germ and the random innocence that is
+
+1:17:05.122 --> 1:17:05.558
+very.
+
+1:17:05.765 --> 1:17:15.747
+But what you're interested is like bad quality
+data, which still passes your first initial
+
+1:17:15.747 --> 1:17:16.405
+step.
+
+1:17:17.257 --> 1:17:28.824
+What you can use for that is you can use any
+type of network or model that in the beginning,
+
+1:17:28.824 --> 1:17:33.177
+like in random forests, would see.
+
+1:17:33.613 --> 1:17:38.912
+So the positive examples are quite easy to
+get.
+
+1:17:38.912 --> 1:17:44.543
+You just take parallel data and high quality
+data.
+
+1:17:44.543 --> 1:17:45.095
+You.
+
+1:17:45.425 --> 1:17:47.565
+That is quite easy.
+
+1:17:47.565 --> 1:17:55.482
+You normally don't need a lot of data, then
+to train in a few validation.
+
+1:17:57.397 --> 1:18:12.799
+The challenge is like the negative samples
+because how would you generate negative samples?
+
+1:18:13.133 --> 1:18:17.909
+Because the negative examples are the ones
+which ask the first step but don't ask the
+
+1:18:17.909 --> 1:18:18.353
+second.
+
+1:18:18.838 --> 1:18:23.682
+So how do you typically do it?
+
+1:18:23.682 --> 1:18:28.994
+You try to do synthetic examples.
+
+1:18:28.994 --> 1:18:33.369
+You can do random examples.
+
+1:18:33.493 --> 1:18:45.228
+But this is the typical error that you want
+to detect when you do frequency based replacements.
+
+1:18:45.228 --> 1:18:52.074
+But this is one major issue when you generate
+the data.
+
+1:18:52.132 --> 1:19:02.145
+That doesn't match well with what are the
+real arrows that you're interested in.
+
+1:19:02.702 --> 1:19:13.177
+Is some of the most challenging here to find
+the negative samples, which are hard enough
+
+1:19:13.177 --> 1:19:14.472
+to detect.
+
+1:19:17.537 --> 1:19:21.863
+And the other thing, which is difficult, is
+of course the data ratio.
+
+1:19:22.262 --> 1:19:24.212
+Why is it important any?
+
+1:19:24.212 --> 1:19:29.827
+Why is the ratio between positive and negative
+examples here important?
+
+1:19:30.510 --> 1:19:40.007
+Because in a case of plus imbalance we effectively
+could learn to just that it's positive and
+
+1:19:40.007 --> 1:19:43.644
+high quality and we would be right.
+
+1:19:44.844 --> 1:19:46.654
+Yes, so I'm training.
+
+1:19:46.654 --> 1:19:51.180
+This is important, but otherwise it might
+be too easy.
+
+1:19:51.180 --> 1:19:52.414
+You always do.
+
+1:19:52.732 --> 1:19:58.043
+And on the other head, of course, navy and
+deputy, it's also important because if we have
+
+1:19:58.043 --> 1:20:03.176
+equal things, we're also assuming that this
+might be the other one, and if the quality
+
+1:20:03.176 --> 1:20:06.245
+is worse or higher, we might also accept too
+fewer.
+
+1:20:06.626 --> 1:20:10.486
+So this ratio is not easy to determine.
+
+1:20:13.133 --> 1:20:16.969
+What type of features can we use?
+
+1:20:16.969 --> 1:20:23.175
+Traditionally, we're also looking at word
+translation.
+
+1:20:23.723 --> 1:20:37.592
+And nowadays, of course, we can model this
+also with something like similar, so this is
+
+1:20:37.592 --> 1:20:38.696
+again.
+
+1:20:40.200 --> 1:20:42.306
+Language follow.
+
+1:20:42.462 --> 1:20:49.763
+So we can, for example, put the sentence in
+there for the source and the target, and then
+
+1:20:49.763 --> 1:20:56.497
+based on this classification label we can classify
+as this a parallel sentence or.
+
+1:20:56.476 --> 1:21:00.054
+So it's more like a normal classification
+task.
+
+1:21:00.160 --> 1:21:09.233
+And by having a system which can have much
+enable input, we can just put in two R.
+
+1:21:09.233 --> 1:21:16.886
+We can also put in two independent of each
+other based on the hidden.
+
+1:21:17.657 --> 1:21:35.440
+You can, as you do any other type of classifier,
+you can train them on top of.
+
+1:21:35.895 --> 1:21:42.801
+This so it tries to represent the full sentence
+and that's what you also want to do on.
+
+1:21:43.103 --> 1:21:45.043
+The Other Thing What They Can't Do Is, of
+Course.
+
+1:21:45.265 --> 1:21:46.881
+You can make here.
+
+1:21:46.881 --> 1:21:52.837
+You can do your summation of all the hidden
+statements that you said.
+
+1:21:58.698 --> 1:22:10.618
+Okay, and then one thing which we skipped
+until now, and that is only briefly this fragment.
+
+1:22:10.630 --> 1:22:19.517
+So if we have sentences which are not really
+parallel, can we also extract information from
+
+1:22:19.517 --> 1:22:20.096
+them?
+
+1:22:22.002 --> 1:22:25.627
+And so what here the test is?
+
+1:22:25.627 --> 1:22:33.603
+We have a sentence and we want to find within
+or a sentence pair.
+
+1:22:33.603 --> 1:22:38.679
+We want to find within the sentence pair.
+
+1:22:39.799 --> 1:22:46.577
+And how that, for example, has been done is
+using a lexical positive and negative association.
+
+1:22:47.187 --> 1:22:57.182
+And then you can transform your target sentence
+into a signal and find a thing where you have.
+
+1:22:57.757 --> 1:23:00.317
+So I'm Going to Get a Clear Eye.
+
+1:23:00.480 --> 1:23:15.788
+So you hear the English sentence, the other
+language, and you have an alignment between
+
+1:23:15.788 --> 1:23:18.572
+them, and then.
+
+1:23:18.818 --> 1:23:21.925
+This is not a light cell from a negative signal.
+
+1:23:22.322 --> 1:23:40.023
+And then you drink some sauce on there because
+you want to have an area where there's.
+
+1:23:40.100 --> 1:23:51.728
+It doesn't matter if you have simple arrows
+here by smooth saying you can't extract.
+
+1:23:51.972 --> 1:23:58.813
+So you try to find long segments here where
+at least most of the words are somehow aligned.
+
+1:24:00.040 --> 1:24:10.069
+And then you take this one in the side and
+extract that one as your parallel fragment,
+
+1:24:10.069 --> 1:24:10.645
+and.
+
+1:24:10.630 --> 1:24:21.276
+So in the end you not only have full sentences
+but you also have partial sentences which might
+
+1:24:21.276 --> 1:24:27.439
+be helpful for especially if you have quite
+low upset.
+
+1:24:32.332 --> 1:24:36.388
+That's everything work for today.
+
+1:24:36.388 --> 1:24:44.023
+What you hopefully remember is the thing about
+how the general.
+
+1:24:44.184 --> 1:24:54.506
+We talked about how we can do the document
+alignment and then we can do the sentence alignment,
+
+1:24:54.506 --> 1:24:57.625
+which can be done after the.
+
+1:24:59.339 --> 1:25:12.611
+Any more questions think on Thursday we had
+to do a switch, so on Thursday there will be
+
+1:25:12.611 --> 1:25:15.444
+a practical thing.
+
+0:00:01.921 --> 0:00:16.424
+Hey welcome to today's lecture, what we today
+want to look at is how we can make new.
+
+0:00:16.796 --> 0:00:26.458
+So until now we have this global system, the
+encoder and the decoder mostly, and we haven't
+
+0:00:26.458 --> 0:00:29.714
+really thought about how long.
+
+0:00:30.170 --> 0:00:42.684
+And what we, for example, know is yeah, you
+can make the systems bigger in different ways.
+
+0:00:42.684 --> 0:00:47.084
+We can make them deeper so the.
+
+0:00:47.407 --> 0:00:56.331
+And if we have at least enough data that typically
+helps you make things performance better,.
+
+0:00:56.576 --> 0:01:00.620
+But of course leads to problems that we need
+more resources.
+
+0:01:00.620 --> 0:01:06.587
+That is a problem at universities where we
+have typically limited computation capacities.
+
+0:01:06.587 --> 0:01:11.757
+So at some point you have such big models
+that you cannot train them anymore.
+
+0:01:13.033 --> 0:01:23.792
+And also for companies is of course important
+if it costs you like to generate translation
+
+0:01:23.792 --> 0:01:26.984
+just by power consumption.
+
+0:01:27.667 --> 0:01:35.386
+So yeah, there's different reasons why you
+want to do efficient machine translation.
+
+0:01:36.436 --> 0:01:48.338
+One reason is there are different ways of
+how you can improve your machine translation
+
+0:01:48.338 --> 0:01:50.527
+system once we.
+
+0:01:50.670 --> 0:01:55.694
+There can be different types of data we looked
+into data crawling, monolingual data.
+
+0:01:55.875 --> 0:01:59.024
+All this data and the aim is always.
+
+0:01:59.099 --> 0:02:06.067
+Of course, we are not just purely interested
+in having more data, but the idea why we want
+
+0:02:06.067 --> 0:02:12.959
+to have more data is that more data also means
+that we have better quality because mostly
+
+0:02:12.959 --> 0:02:17.554
+we are interested in increasing the quality
+of the machine.
+
+0:02:18.838 --> 0:02:24.892
+But there's also other ways of how you can
+improve the quality of a machine translation.
+
+0:02:25.325 --> 0:02:36.450
+And what is, of course, that is where most
+research is focusing on.
+
+0:02:36.450 --> 0:02:44.467
+It means all we want to build better algorithms.
+
+0:02:44.684 --> 0:02:48.199
+Course: The other things are normally as good.
+
+0:02:48.199 --> 0:02:54.631
+Sometimes it's easier to improve, so often
+it's easier to just collect more data than
+
+0:02:54.631 --> 0:02:57.473
+to invent some great view algorithms.
+
+0:02:57.473 --> 0:03:00.315
+But yeah, both of them are important.
+
+0:03:00.920 --> 0:03:09.812
+But there is this third thing, especially
+with neural machine translation, and that means
+
+0:03:09.812 --> 0:03:11.590
+we make a bigger.
+
+0:03:11.751 --> 0:03:16.510
+Can be, as said, that we have more layers,
+that we have wider layers.
+
+0:03:16.510 --> 0:03:19.977
+The other thing we talked a bit about is ensemble.
+
+0:03:19.977 --> 0:03:24.532
+That means we are not building one new machine
+translation system.
+
+0:03:24.965 --> 0:03:27.505
+And we can easily build four.
+
+0:03:27.505 --> 0:03:32.331
+What is the typical strategy to build different
+systems?
+
+0:03:32.331 --> 0:03:33.177
+Remember.
+
+0:03:35.795 --> 0:03:40.119
+It should be of course a bit different if
+you have the same.
+
+0:03:40.119 --> 0:03:44.585
+If they all predict the same then combining
+them doesn't help.
+
+0:03:44.585 --> 0:03:48.979
+So what is the easiest way if you have to
+build four systems?
+
+0:03:51.711 --> 0:04:01.747
+And the Charleston's will take, but this is
+the best output of a single system.
+
+0:04:02.362 --> 0:04:10.165
+Mean now, it's really three different systems
+so that you later can combine them and maybe
+
+0:04:10.165 --> 0:04:11.280
+the average.
+
+0:04:11.280 --> 0:04:16.682
+Ensembles are typically that the average is
+all probabilities.
+
+0:04:19.439 --> 0:04:24.227
+The idea is to think about neural networks.
+
+0:04:24.227 --> 0:04:29.342
+There's one parameter which can easily adjust.
+
+0:04:29.342 --> 0:04:36.525
+That's exactly the easiest way to randomize
+with three different.
+
+0:04:37.017 --> 0:04:43.119
+They have the same architecture, so all the
+hydroparameters are the same, but they are
+
+0:04:43.119 --> 0:04:43.891
+different.
+
+0:04:43.891 --> 0:04:46.556
+They will have different predictions.
+
+0:04:48.228 --> 0:04:52.572
+So, of course, bigger amounts.
+
+0:04:52.572 --> 0:05:05.325
+Some of these are a bit the easiest way of
+improving your quality because you don't really
+
+0:05:05.325 --> 0:05:08.268
+have to do anything.
+
+0:05:08.588 --> 0:05:12.588
+There is limits on that bigger models only
+get better.
+
+0:05:12.588 --> 0:05:19.132
+If you have enough training data you can't
+do like a handheld layer and you will not work
+
+0:05:19.132 --> 0:05:24.877
+on very small data but with a recent amount
+of data that is the easiest thing.
+
+0:05:25.305 --> 0:05:33.726
+However, they are challenging with making
+better models, bigger motors, and that is the
+
+0:05:33.726 --> 0:05:34.970
+computation.
+
+0:05:35.175 --> 0:05:44.482
+So, of course, if you have a bigger model
+that can mean that you have longer running
+
+0:05:44.482 --> 0:05:49.518
+times, if you have models, you have to times.
+
+0:05:51.171 --> 0:05:56.685
+Normally you cannot paralyze the different
+layers because the input to one layer is always
+
+0:05:56.685 --> 0:06:02.442
+the output of the previous layer, so you propagate
+that so it will also increase your runtime.
+
+0:06:02.822 --> 0:06:10.720
+Then you have to store all your models in
+memory.
+
+0:06:10.720 --> 0:06:20.927
+If you have double weights you will have:
+Is more difficult to then do back propagation.
+
+0:06:20.927 --> 0:06:27.680
+You have to store in between the activations,
+so there's not only do you increase the model
+
+0:06:27.680 --> 0:06:31.865
+in your memory, but also all these other variables
+that.
+
+0:06:34.414 --> 0:06:36.734
+And so in general it is more expensive.
+
+0:06:37.137 --> 0:06:54.208
+And therefore there's good reasons in looking
+into can we make these models sound more efficient.
+
+0:06:54.134 --> 0:07:00.982
+So it's been through the viewer, you can have
+it okay, have one and one day of training time,
+
+0:07:00.982 --> 0:07:01.274
+or.
+
+0:07:01.221 --> 0:07:07.535
+Forty thousand euros and then what is the
+best machine translation system I can get within
+
+0:07:07.535 --> 0:07:08.437
+this budget.
+
+0:07:08.969 --> 0:07:19.085
+And then, of course, you can make the models
+bigger, but then you have to train them shorter,
+
+0:07:19.085 --> 0:07:24.251
+and then we can make more efficient algorithms.
+
+0:07:25.925 --> 0:07:31.699
+If you think about efficiency, there's a bit
+different scenarios.
+
+0:07:32.312 --> 0:07:43.635
+So if you're more of coming from the research
+community, what you'll be doing is building
+
+0:07:43.635 --> 0:07:47.913
+a lot of models in your research.
+
+0:07:48.088 --> 0:07:58.645
+So you're having your test set of maybe sentences,
+calculating the blue score, then another model.
+
+0:07:58.818 --> 0:08:08.911
+So what that means is typically you're training
+on millions of cents, so your training time
+
+0:08:08.911 --> 0:08:14.944
+is long, maybe a day, but maybe in other cases
+a week.
+
+0:08:15.135 --> 0:08:22.860
+The testing is not really the cost efficient,
+but the training is very costly.
+
+0:08:23.443 --> 0:08:37.830
+If you are more thinking of building models
+for application, the scenario is quite different.
+
+0:08:38.038 --> 0:08:46.603
+And then you keep it running, and maybe thousands
+of customers are using it in translating.
+
+0:08:46.603 --> 0:08:47.720
+So in that.
+
+0:08:48.168 --> 0:08:59.577
+And we will see that it is not always the
+same type of challenges you can paralyze some
+
+0:08:59.577 --> 0:09:07.096
+things in training, which you cannot paralyze
+in testing.
+
+0:09:07.347 --> 0:09:14.124
+For example, in training you have to do back
+propagation, so you have to store the activations.
+
+0:09:14.394 --> 0:09:23.901
+Therefore, in testing we briefly discussed
+that we would do it in more detail today in
+
+0:09:23.901 --> 0:09:24.994
+training.
+
+0:09:25.265 --> 0:09:36.100
+You know they're a target and you can process
+everything in parallel while in testing.
+
+0:09:36.356 --> 0:09:46.741
+So you can only do one word at a time, and
+so you can less paralyze this.
+
+0:09:46.741 --> 0:09:50.530
+Therefore, it's important.
+
+0:09:52.712 --> 0:09:55.347
+Is a specific task on this.
+
+0:09:55.347 --> 0:10:03.157
+For example, it's the efficiency task where
+it's about making things as efficient.
+
+0:10:03.123 --> 0:10:09.230
+Is possible and they can look at different
+resources.
+
+0:10:09.230 --> 0:10:14.207
+So how much deep fuel run time do you need?
+
+0:10:14.454 --> 0:10:19.366
+See how much memory you need or you can have
+a fixed memory budget and then have to build
+
+0:10:19.366 --> 0:10:20.294
+the best system.
+
+0:10:20.500 --> 0:10:29.010
+And here is a bit like an example of that,
+so there's three teams from Edinburgh from
+
+0:10:29.010 --> 0:10:30.989
+and they submitted.
+
+0:10:31.131 --> 0:10:36.278
+So then, of course, if you want to know the
+most efficient system you have to do a bit
+
+0:10:36.278 --> 0:10:36.515
+of.
+
+0:10:36.776 --> 0:10:44.656
+You want to have a better quality or more
+runtime and there's not the one solution.
+
+0:10:44.656 --> 0:10:46.720
+You can improve your.
+
+0:10:46.946 --> 0:10:49.662
+And that you see that there are different
+systems.
+
+0:10:49.909 --> 0:11:06.051
+Here is how many words you can do for a second
+on the clock, and you want to be as talk as
+
+0:11:06.051 --> 0:11:07.824
+possible.
+
+0:11:08.068 --> 0:11:08.889
+And you see here a bit.
+
+0:11:08.889 --> 0:11:09.984
+This is a little bit different.
+
+0:11:11.051 --> 0:11:27.717
+You want to be there on the top right corner
+and you can get a score of something between
+
+0:11:27.717 --> 0:11:29.014
+words.
+
+0:11:30.250 --> 0:11:34.161
+Two hundred and fifty thousand, then you'll
+ever come and score zero point three.
+
+0:11:34.834 --> 0:11:41.243
+There is, of course, any bit of a decision,
+but the question is, like how far can you again?
+
+0:11:41.243 --> 0:11:47.789
+Some of all these points on this line would
+be winners because they are somehow most efficient
+
+0:11:47.789 --> 0:11:53.922
+in a way that there's no system which achieves
+the same quality with less computational.
+
+0:11:57.657 --> 0:12:04.131
+So there's the one question of which resources
+are you interested.
+
+0:12:04.131 --> 0:12:07.416
+Are you running it on CPU or GPU?
+
+0:12:07.416 --> 0:12:11.668
+There's different ways of paralyzing stuff.
+
+0:12:14.654 --> 0:12:20.777
+Another dimension is how you process your
+data.
+
+0:12:20.777 --> 0:12:27.154
+There's really the best processing and streaming.
+
+0:12:27.647 --> 0:12:34.672
+So in batch processing you have the whole
+document available so you can translate all
+
+0:12:34.672 --> 0:12:39.981
+sentences in perimeter and then you're interested
+in throughput.
+
+0:12:40.000 --> 0:12:43.844
+But you can then process, for example, especially
+in GPS.
+
+0:12:43.844 --> 0:12:49.810
+That's interesting, you're not translating
+one sentence at a time, but you're translating
+
+0:12:49.810 --> 0:12:56.108
+one hundred sentences or so in parallel, so
+you have one more dimension where you can paralyze
+
+0:12:56.108 --> 0:12:57.964
+and then be more efficient.
+
+0:12:58.558 --> 0:13:14.863
+On the other hand, for example sorts of documents,
+so we learned that if you do badge processing
+
+0:13:14.863 --> 0:13:16.544
+you have.
+
+0:13:16.636 --> 0:13:24.636
+Then, of course, it makes sense to sort the
+sentences in order to have the minimum thing
+
+0:13:24.636 --> 0:13:25.535
+attached.
+
+0:13:27.427 --> 0:13:32.150
+The other scenario is more the streaming scenario
+where you do life translation.
+
+0:13:32.512 --> 0:13:40.212
+So in that case you can't wait for the whole
+document to pass, but you have to do.
+
+0:13:40.520 --> 0:13:49.529
+And then, for example, that's especially in
+situations like speech translation, and then
+
+0:13:49.529 --> 0:13:53.781
+you're interested in things like latency.
+
+0:13:53.781 --> 0:14:00.361
+So how much do you have to wait to get the
+output of a sentence?
+
+0:14:06.566 --> 0:14:16.956
+Finally, there is the thing about the implementation:
+Today we're mainly looking at different algorithms,
+
+0:14:16.956 --> 0:14:23.678
+different models of how you can model them
+in your machine translation system, but of
+
+0:14:23.678 --> 0:14:29.227
+course for the same algorithms there's also
+different implementations.
+
+0:14:29.489 --> 0:14:38.643
+So, for example, for a machine translation
+this tool could be very fast.
+
+0:14:38.638 --> 0:14:46.615
+So they have like coded a lot of the operations
+very low resource, not low resource, low level
+
+0:14:46.615 --> 0:14:49.973
+on the directly on the QDAC kernels in.
+
+0:14:50.110 --> 0:15:00.948
+So the same attention network is typically
+more efficient in that type of algorithm.
+
+0:15:00.880 --> 0:15:02.474
+Than in in any other.
+
+0:15:03.323 --> 0:15:13.105
+Of course, it might be other disadvantages,
+so if you're a little worker or have worked
+
+0:15:13.105 --> 0:15:15.106
+in the practical.
+
+0:15:15.255 --> 0:15:22.604
+Because it's normally easier to understand,
+easier to change, and so on, but there is again
+
+0:15:22.604 --> 0:15:23.323
+a train.
+
+0:15:23.483 --> 0:15:29.440
+You have to think about, do you want to include
+this into my study or comparison or not?
+
+0:15:29.440 --> 0:15:36.468
+Should it be like I compare different implementations
+and I also find the most efficient implementation?
+
+0:15:36.468 --> 0:15:39.145
+Or is it only about the pure algorithm?
+
+0:15:42.742 --> 0:15:50.355
+Yeah, when building these systems there is
+a different trade-off to do.
+
+0:15:50.850 --> 0:15:56.555
+So there's one of the traders between memory
+and throughput, so how many words can generate
+
+0:15:56.555 --> 0:15:57.299
+per second.
+
+0:15:57.557 --> 0:16:03.351
+So typically you can easily like increase
+your scruple by increasing the batch size.
+
+0:16:03.643 --> 0:16:06.899
+So that means you are translating more sentences
+in parallel.
+
+0:16:07.107 --> 0:16:09.241
+And gypsies are very good at that stuff.
+
+0:16:09.349 --> 0:16:15.161
+It should translate one sentence or one hundred
+sentences, not the same time, but its.
+
+0:16:15.115 --> 0:16:20.784
+Rough are very similar because they are at
+this efficient metrics multiplication so that
+
+0:16:20.784 --> 0:16:24.415
+you can do the same operation on all sentences
+parallel.
+
+0:16:24.415 --> 0:16:30.148
+So typically that means if you increase your
+benchmark you can do more things in parallel
+
+0:16:30.148 --> 0:16:31.995
+and you will translate more.
+
+0:16:31.952 --> 0:16:33.370
+Second.
+
+0:16:33.653 --> 0:16:43.312
+On the other hand, with this advantage, of
+course you will need higher badge sizes and
+
+0:16:43.312 --> 0:16:44.755
+more memory.
+
+0:16:44.965 --> 0:16:56.452
+To begin with, the other problem is that you
+have such big models that you can only translate
+
+0:16:56.452 --> 0:16:59.141
+with lower bed sizes.
+
+0:16:59.119 --> 0:17:08.466
+If you are running out of memory with translating,
+one idea to go on that is to decrease your.
+
+0:17:13.453 --> 0:17:24.456
+Then there is the thing about quality in Screwport,
+of course, and before it's like larger models,
+
+0:17:24.456 --> 0:17:28.124
+but in generally higher quality.
+
+0:17:28.124 --> 0:17:31.902
+The first one is always this way.
+
+0:17:32.092 --> 0:17:38.709
+Course: Not always larger model helps you
+have over fitting at some point, but in generally.
+
+0:17:43.883 --> 0:17:52.901
+And with this a bit on this training and testing
+thing we had before.
+
+0:17:53.113 --> 0:17:58.455
+So it wears all the difference between training
+and testing, and for the encoder and decoder.
+
+0:17:58.798 --> 0:18:06.992
+So if we are looking at what mentioned before
+at training time, we have a source sentence
+
+0:18:06.992 --> 0:18:17.183
+here: And how this is processed on a is not
+the attention here.
+
+0:18:17.183 --> 0:18:21.836
+That's a tubical transformer.
+
+0:18:22.162 --> 0:18:31.626
+And how we can do that on a is that we can
+paralyze the ear ever since.
+
+0:18:31.626 --> 0:18:40.422
+The first thing to know is: So that is, of
+course, not in all cases.
+
+0:18:40.422 --> 0:18:49.184
+We'll later talk about speech translation
+where we might want to translate.
+
+0:18:49.389 --> 0:18:56.172
+Without the general case in, it's like you
+have the full sentence you want to translate.
+
+0:18:56.416 --> 0:19:02.053
+So the important thing is we are here everything
+available on the source side.
+
+0:19:03.323 --> 0:19:13.524
+And then this was one of the big advantages
+that you can remember back of transformer.
+
+0:19:13.524 --> 0:19:15.752
+There are several.
+
+0:19:16.156 --> 0:19:25.229
+But the other one is now that we can calculate
+the full layer.
+
+0:19:25.645 --> 0:19:29.318
+There is no dependency between this and this
+state or this and this state.
+
+0:19:29.749 --> 0:19:36.662
+So we always did like here to calculate the
+key value and query, and based on that you
+
+0:19:36.662 --> 0:19:37.536
+calculate.
+
+0:19:37.937 --> 0:19:46.616
+Which means we can do all these calculations
+here in parallel and in parallel.
+
+0:19:48.028 --> 0:19:55.967
+And there, of course, is this very efficiency
+because again for GPS it's too bigly possible
+
+0:19:55.967 --> 0:20:00.887
+to do these things in parallel and one after
+each other.
+
+0:20:01.421 --> 0:20:10.311
+And then we can also for each layer one by
+one, and then we calculate here the encoder.
+
+0:20:10.790 --> 0:20:21.921
+In training now an important thing is that
+for the decoder we have the full sentence available
+
+0:20:21.921 --> 0:20:28.365
+because we know this is the target we should
+generate.
+
+0:20:29.649 --> 0:20:33.526
+We have models now in a different way.
+
+0:20:33.526 --> 0:20:38.297
+This hidden state is only on the previous
+ones.
+
+0:20:38.598 --> 0:20:51.887
+And the first thing here depends only on this
+information, so you see if you remember we
+
+0:20:51.887 --> 0:20:56.665
+had this masked self-attention.
+
+0:20:56.896 --> 0:21:04.117
+So that means, of course, we can only calculate
+the decoder once the encoder is done, but that's.
+
+0:21:04.444 --> 0:21:06.656
+Percent can calculate the end quarter.
+
+0:21:06.656 --> 0:21:08.925
+Then we can calculate here the decoder.
+
+0:21:09.569 --> 0:21:25.566
+But again in training we have x, y and that
+is available so we can calculate everything
+
+0:21:25.566 --> 0:21:27.929
+in parallel.
+
+0:21:28.368 --> 0:21:40.941
+So the interesting thing or advantage of transformer
+is in training.
+
+0:21:40.941 --> 0:21:46.408
+We can do it for the decoder.
+
+0:21:46.866 --> 0:21:54.457
+That means you will have more calculations
+because you can only calculate one layer at
+
+0:21:54.457 --> 0:22:02.310
+a time, but for example the length which is
+too bigly quite long or doesn't really matter
+
+0:22:02.310 --> 0:22:03.270
+that much.
+
+0:22:05.665 --> 0:22:10.704
+However, in testing this situation is different.
+
+0:22:10.704 --> 0:22:13.276
+In testing we only have.
+
+0:22:13.713 --> 0:22:20.622
+So this means we start with a sense: We don't
+know the full sentence yet because we ought
+
+0:22:20.622 --> 0:22:29.063
+to regularly generate that so for the encoder
+we have the same here but for the decoder.
+
+0:22:29.409 --> 0:22:39.598
+In this case we only have the first and the
+second instinct, but only for all states in
+
+0:22:39.598 --> 0:22:40.756
+parallel.
+
+0:22:41.101 --> 0:22:51.752
+And then we can do the next step for y because
+we are putting our most probable one.
+
+0:22:51.752 --> 0:22:58.643
+We do greedy search or beam search, but you
+cannot do.
+
+0:23:03.663 --> 0:23:16.838
+Yes, so if we are interesting in making things
+more efficient for testing, which we see, for
+
+0:23:16.838 --> 0:23:22.363
+example in the scenario of really our.
+
+0:23:22.642 --> 0:23:34.286
+It makes sense that we think about our architecture
+and that we are currently working on attention
+
+0:23:34.286 --> 0:23:35.933
+based models.
+
+0:23:36.096 --> 0:23:44.150
+The decoder there is some of the most time
+spent testing and testing.
+
+0:23:44.150 --> 0:23:47.142
+It's similar, but during.
+
+0:23:47.167 --> 0:23:50.248
+Nothing about beam search.
+
+0:23:50.248 --> 0:23:59.833
+It might be even more complicated because
+in beam search you have to try different.
+
+0:24:02.762 --> 0:24:15.140
+So the question is what can you now do in
+order to make your model more efficient and
+
+0:24:15.140 --> 0:24:21.905
+better in translation in these types of cases?
+
+0:24:24.604 --> 0:24:30.178
+And the one thing is to look into the encoded
+decoder trailer.
+
+0:24:30.690 --> 0:24:43.898
+And then until now we typically assume that
+the depth of the encoder and the depth of the
+
+0:24:43.898 --> 0:24:48.154
+decoder is roughly the same.
+
+0:24:48.268 --> 0:24:55.553
+So if you haven't thought about it, you just
+take what is running well.
+
+0:24:55.553 --> 0:24:57.678
+You would try to do.
+
+0:24:58.018 --> 0:25:04.148
+However, we saw now that there is a quite
+big challenge and the runtime is a lot longer
+
+0:25:04.148 --> 0:25:04.914
+than here.
+
+0:25:05.425 --> 0:25:14.018
+The question is also the case for the calculations,
+or do we have there the same issue that we
+
+0:25:14.018 --> 0:25:21.887
+only get the good quality if we are having
+high and high, so we know that making these
+
+0:25:21.887 --> 0:25:25.415
+more depths is increasing our quality.
+
+0:25:25.425 --> 0:25:31.920
+But what we haven't talked about is really
+important that we increase the depth the same
+
+0:25:31.920 --> 0:25:32.285
+way.
+
+0:25:32.552 --> 0:25:41.815
+So what we can put instead also do is something
+like this where you have a deep encoder and
+
+0:25:41.815 --> 0:25:42.923
+a shallow.
+
+0:25:43.163 --> 0:25:57.386
+So that would be that you, for example, have
+instead of having layers on the encoder, and
+
+0:25:57.386 --> 0:25:59.757
+layers on the.
+
+0:26:00.080 --> 0:26:10.469
+So in this case the overall depth from start
+to end would be similar and so hopefully.
+
+0:26:11.471 --> 0:26:21.662
+But we could a lot more things hear parallelized,
+and hear what is costly at the end during decoding
+
+0:26:21.662 --> 0:26:22.973
+the decoder.
+
+0:26:22.973 --> 0:26:29.330
+Because that does change in an outer regressive
+way, there we.
+
+0:26:31.411 --> 0:26:33.727
+And that that can be analyzed.
+
+0:26:33.727 --> 0:26:38.734
+So here is some examples: Where people have
+done all this.
+
+0:26:39.019 --> 0:26:55.710
+So here it's mainly interested on the orange
+things, which is auto-regressive about the
+
+0:26:55.710 --> 0:26:57.607
+speed up.
+
+0:26:57.717 --> 0:27:15.031
+You have the system, so agree is not exactly
+the same, but it's similar.
+
+0:27:15.055 --> 0:27:23.004
+It's always the case if you look at speed
+up.
+
+0:27:23.004 --> 0:27:31.644
+Think they put a speed of so that's the baseline.
+
+0:27:31.771 --> 0:27:35.348
+So between and times as fast.
+
+0:27:35.348 --> 0:27:42.621
+If you switch from a system to where you have
+layers in the.
+
+0:27:42.782 --> 0:27:52.309
+You see that although you have slightly more
+parameters, more calculations are also roughly
+
+0:27:52.309 --> 0:28:00.283
+the same, but you can speed out because now
+during testing you can paralyze.
+
+0:28:02.182 --> 0:28:09.754
+The other thing is that you're speeding up,
+but if you look at the performance it's similar,
+
+0:28:09.754 --> 0:28:13.500
+so sometimes you improve, sometimes you lose.
+
+0:28:13.500 --> 0:28:20.421
+There's a bit of losing English to Romania,
+but in general the quality is very slow.
+
+0:28:20.680 --> 0:28:30.343
+So you see that you can keep a similar performance
+while improving your speed by just having different.
+
+0:28:30.470 --> 0:28:34.903
+And you also see the encoder layers from speed.
+
+0:28:34.903 --> 0:28:38.136
+They don't really metal that much.
+
+0:28:38.136 --> 0:28:38.690
+Most.
+
+0:28:38.979 --> 0:28:50.319
+Because if you compare the 12th system to
+the 6th system you have a lower performance
+
+0:28:50.319 --> 0:28:57.309
+with 6th and colder layers but the speed is
+similar.
+
+0:28:57.897 --> 0:29:02.233
+And see the huge decrease is it maybe due
+to a lack of data.
+
+0:29:03.743 --> 0:29:11.899
+Good idea would say it's not the case.
+
+0:29:11.899 --> 0:29:23.191
+Romanian English should have the same number
+of data.
+
+0:29:24.224 --> 0:29:31.184
+Maybe it's just that something in that language.
+
+0:29:31.184 --> 0:29:40.702
+If you generate Romanian maybe they need more
+target dependencies.
+
+0:29:42.882 --> 0:29:46.263
+The Wine's the Eye Also Don't Know Any Sex
+People Want To.
+
+0:29:47.887 --> 0:29:49.034
+There could be yeah the.
+
+0:29:49.889 --> 0:29:58.962
+As the maybe if you go from like a movie sphere
+to a hybrid sphere, you can: It's very much
+
+0:29:58.962 --> 0:30:12.492
+easier to expand the vocabulary to English,
+but it must be the vocabulary.
+
+0:30:13.333 --> 0:30:21.147
+Have to check, but would assume that in this
+case the system is not retrained, but it's
+
+0:30:21.147 --> 0:30:22.391
+trained with.
+
+0:30:22.902 --> 0:30:30.213
+And that's why I was assuming that they have
+the same, but maybe you'll write that in this
+
+0:30:30.213 --> 0:30:35.595
+piece, for example, if they were pre-trained,
+the decoder English.
+
+0:30:36.096 --> 0:30:43.733
+But don't remember exactly if they do something
+like that, but that could be a good.
+
+0:30:45.325 --> 0:30:52.457
+So this is some of the most easy way to speed
+up.
+
+0:30:52.457 --> 0:31:01.443
+You just switch to hyperparameters, not to
+implement anything.
+
+0:31:02.722 --> 0:31:08.367
+Of course, there's other ways of doing that.
+
+0:31:08.367 --> 0:31:11.880
+We'll look into two things.
+
+0:31:11.880 --> 0:31:16.521
+The other thing is the architecture.
+
+0:31:16.796 --> 0:31:28.154
+We are now at some of the baselines that we
+are doing.
+
+0:31:28.488 --> 0:31:39.978
+However, in translation in the decoder side,
+it might not be the best solution.
+
+0:31:39.978 --> 0:31:41.845
+There is no.
+
+0:31:42.222 --> 0:31:47.130
+So we can use different types of architectures,
+also in the encoder and the.
+
+0:31:47.747 --> 0:31:52.475
+And there's two ways of what you could do
+different, or there's more ways.
+
+0:31:52.912 --> 0:31:54.825
+We will look into two todays.
+
+0:31:54.825 --> 0:31:58.842
+The one is average attention, which is a very
+simple solution.
+
+0:31:59.419 --> 0:32:01.464
+You can do as it says.
+
+0:32:01.464 --> 0:32:04.577
+It's not really attending anymore.
+
+0:32:04.577 --> 0:32:08.757
+It's just like equal attendance to everything.
+
+0:32:09.249 --> 0:32:23.422
+And the other idea, which is currently done
+in most systems which are optimized to efficiency,
+
+0:32:23.422 --> 0:32:24.913
+is we're.
+
+0:32:25.065 --> 0:32:32.623
+But on the decoder side we are then not using
+transformer or self attention, but we are using
+
+0:32:32.623 --> 0:32:39.700
+recurrent neural network because they are the
+disadvantage of recurrent neural network.
+
+0:32:39.799 --> 0:32:48.353
+And then the recurrent is normally easier
+to calculate because it only depends on inputs,
+
+0:32:48.353 --> 0:32:49.684
+the input on.
+
+0:32:51.931 --> 0:33:02.190
+So what is the difference between decoding
+and why is the tension maybe not sufficient
+
+0:33:02.190 --> 0:33:03.841
+for decoding?
+
+0:33:04.204 --> 0:33:14.390
+If we want to populate the new state, we only
+have to look at the input and the previous
+
+0:33:14.390 --> 0:33:15.649
+state, so.
+
+0:33:16.136 --> 0:33:19.029
+We are more conditional here networks.
+
+0:33:19.029 --> 0:33:19.994
+We have the.
+
+0:33:19.980 --> 0:33:31.291
+Dependency to a fixed number of previous ones,
+but that's rarely used for decoding.
+
+0:33:31.291 --> 0:33:39.774
+In contrast, in transformer we have this large
+dependency, so.
+
+0:33:40.000 --> 0:33:52.760
+So from t minus one to y t so that is somehow
+and mainly not very efficient in this way mean
+
+0:33:52.760 --> 0:33:56.053
+it's very good because.
+
+0:33:56.276 --> 0:34:03.543
+However, the disadvantage is that we also
+have to do all these calculations, so if we
+
+0:34:03.543 --> 0:34:10.895
+more view from the point of view of efficient
+calculation, this might not be the best.
+
+0:34:11.471 --> 0:34:20.517
+So the question is, can we change our architecture
+to keep some of the advantages but make things
+
+0:34:20.517 --> 0:34:21.994
+more efficient?
+
+0:34:24.284 --> 0:34:31.131
+The one idea is what is called the average
+attention, and the interesting thing is this
+
+0:34:31.131 --> 0:34:32.610
+work surprisingly.
+
+0:34:33.013 --> 0:34:38.917
+So the only idea what you're doing is doing
+the decoder.
+
+0:34:38.917 --> 0:34:42.646
+You're not doing attention anymore.
+
+0:34:42.646 --> 0:34:46.790
+The attention weights are all the same.
+
+0:34:47.027 --> 0:35:00.723
+So you don't calculate with query and key
+the different weights, and then you just take
+
+0:35:00.723 --> 0:35:03.058
+equal weights.
+
+0:35:03.283 --> 0:35:07.585
+So here would be one third from this, one
+third from this, and one third.
+
+0:35:09.009 --> 0:35:14.719
+And while it is sufficient you can now do
+precalculation and things get more efficient.
+
+0:35:15.195 --> 0:35:18.803
+So first go the formula that's maybe not directed
+here.
+
+0:35:18.979 --> 0:35:38.712
+So the difference here is that your new hint
+stage is the sum of all the hint states, then.
+
+0:35:38.678 --> 0:35:40.844
+So here would be with this.
+
+0:35:40.844 --> 0:35:45.022
+It would be one third of this plus one third
+of this.
+
+0:35:46.566 --> 0:35:57.162
+But if you calculate it this way, it's not
+yet being more efficient because you still
+
+0:35:57.162 --> 0:36:01.844
+have to sum over here all the hidden.
+
+0:36:04.524 --> 0:36:22.932
+But you can not easily speed up these things
+by having an in between value, which is just
+
+0:36:22.932 --> 0:36:24.568
+always.
+
+0:36:25.585 --> 0:36:30.057
+If you take this as ten to one, you take this
+one class this one.
+
+0:36:30.350 --> 0:36:36.739
+Because this one then was before this, and
+this one was this, so in the end.
+
+0:36:37.377 --> 0:36:49.545
+So now this one is not the final one in order
+to get the final one to do the average.
+
+0:36:49.545 --> 0:36:50.111
+So.
+
+0:36:50.430 --> 0:37:00.264
+But then if you do this calculation with speed
+up you can do it with a fixed number of steps.
+
+0:37:00.180 --> 0:37:11.300
+Instead of the sun which depends on age, so
+you only have to do calculations to calculate
+
+0:37:11.300 --> 0:37:12.535
+this one.
+
+0:37:12.732 --> 0:37:21.253
+Can you do a lakes on a wet spoon?
+
+0:37:21.253 --> 0:37:32.695
+For example, a light spoon here now takes
+and.
+
+0:37:32.993 --> 0:37:38.762
+That's a very good point and that's why this
+is now in the image.
+
+0:37:38.762 --> 0:37:44.531
+It's not very good so this is the one with
+tilder and the tilder.
+
+0:37:44.884 --> 0:37:57.895
+So this one is just the sum of these two,
+because this is just this one.
+
+0:37:58.238 --> 0:38:08.956
+So the sum of this is exactly as the sum of
+these, and the sum of these is the sum of here.
+
+0:38:08.956 --> 0:38:15.131
+So you only do the sum in here, and the multiplying.
+
+0:38:15.255 --> 0:38:22.145
+So what you can mainly do here is you can
+do it more mathematically.
+
+0:38:22.145 --> 0:38:31.531
+You can know this by tea taking out of the
+sum, and then you can calculate the sum different.
+
+0:38:36.256 --> 0:38:42.443
+That maybe looks a bit weird and simple, so
+we were all talking about this great attention
+
+0:38:42.443 --> 0:38:47.882
+that we can focus on different parts, and a
+bit surprising on this work is now.
+
+0:38:47.882 --> 0:38:53.321
+In the end it might also work well without
+really putting and just doing equal.
+
+0:38:53.954 --> 0:38:56.164
+Mean it's not that easy.
+
+0:38:56.376 --> 0:38:58.261
+It's like sometimes this is working.
+
+0:38:58.261 --> 0:39:00.451
+There's also report weight work that well.
+
+0:39:01.481 --> 0:39:05.848
+But I think it's an interesting way and it
+maybe shows that a lot of.
+
+0:39:05.805 --> 0:39:10.669
+Things in the self or in the transformer paper
+which are more put as like yet.
+
+0:39:10.669 --> 0:39:14.301
+These are some hyperparameters that are rounded
+like that.
+
+0:39:14.301 --> 0:39:19.657
+You do the lay on all in between and that
+you do a feat forward before and things like
+
+0:39:19.657 --> 0:39:20.026
+that.
+
+0:39:20.026 --> 0:39:25.567
+But these are also all important and the right
+set up around that is also very important.
+
+0:39:28.969 --> 0:39:38.598
+The other thing you can do in the end is not
+completely different from this one.
+
+0:39:38.598 --> 0:39:42.521
+It's just like a very different.
+
+0:39:42.942 --> 0:39:54.338
+And that is a recurrent network which also
+has this type of highway connection that can
+
+0:39:54.338 --> 0:40:01.330
+ignore the recurrent unit and directly put
+the input.
+
+0:40:01.561 --> 0:40:10.770
+It's not really adding out, but if you see
+the hitting step is your input, but what you
+
+0:40:10.770 --> 0:40:15.480
+can do is somehow directly go to the output.
+
+0:40:17.077 --> 0:40:28.390
+These are the four components of the simple
+return unit, and the unit is motivated by GIS
+
+0:40:28.390 --> 0:40:33.418
+and by LCMs, which we have seen before.
+
+0:40:33.513 --> 0:40:43.633
+And that has proven to be very good for iron
+ends, which allows you to have a gate on your.
+
+0:40:44.164 --> 0:40:48.186
+In this thing we have two gates, the reset
+gate and the forget gate.
+
+0:40:48.768 --> 0:40:57.334
+So first we have the general structure which
+has a cell state.
+
+0:40:57.334 --> 0:41:01.277
+Here we have the cell state.
+
+0:41:01.361 --> 0:41:09.661
+And then this goes next, and we always get
+the different cell states over the times that.
+
+0:41:10.030 --> 0:41:11.448
+This Is the South Stand.
+
+0:41:11.771 --> 0:41:16.518
+How do we now calculate that just assume we
+have an initial cell safe here?
+
+0:41:17.017 --> 0:41:19.670
+But the first thing is we're doing the forget
+game.
+
+0:41:20.060 --> 0:41:34.774
+The forgetting models should the new cell
+state mainly depend on the previous cell state
+
+0:41:34.774 --> 0:41:40.065
+or should it depend on our age.
+
+0:41:40.000 --> 0:41:41.356
+Like Add to Them.
+
+0:41:41.621 --> 0:41:42.877
+How can we model that?
+
+0:41:44.024 --> 0:41:45.599
+First we were at a cocktail.
+
+0:41:45.945 --> 0:41:52.151
+The forget gait is depending on minus one.
+
+0:41:52.151 --> 0:41:56.480
+You also see here the former.
+
+0:41:57.057 --> 0:42:01.963
+So we are multiplying both the cell state
+and our input.
+
+0:42:01.963 --> 0:42:04.890
+With some weights we are getting.
+
+0:42:05.105 --> 0:42:08.472
+We are putting some Bay Inspector and then
+we are doing Sigma Weed on that.
+
+0:42:08.868 --> 0:42:13.452
+So in the end we have numbers between zero
+and one saying for each dimension.
+
+0:42:13.853 --> 0:42:22.041
+Like how much if it's near to zero we will
+mainly use the new input.
+
+0:42:22.041 --> 0:42:31.890
+If it's near to one we will keep the input
+and ignore the input at this dimension.
+
+0:42:33.313 --> 0:42:40.173
+And by this motivation we can then create
+here the new sound state, and here you see
+
+0:42:40.173 --> 0:42:41.141
+the formal.
+
+0:42:41.601 --> 0:42:55.048
+So you take your foot back gate and multiply
+it with your class.
+
+0:42:55.048 --> 0:43:00.427
+So if my was around then.
+
+0:43:00.800 --> 0:43:07.405
+In the other case, when the value was others,
+that's what you added.
+
+0:43:07.405 --> 0:43:10.946
+Then you're adding a transformation.
+
+0:43:11.351 --> 0:43:24.284
+So if this value was maybe zero then you're
+putting most of the information from inputting.
+
+0:43:25.065 --> 0:43:26.947
+Is already your element?
+
+0:43:26.947 --> 0:43:30.561
+The only question is now based on your element.
+
+0:43:30.561 --> 0:43:32.067
+What is the output?
+
+0:43:33.253 --> 0:43:47.951
+And there you have another opportunity so
+you can either take the output or instead you
+
+0:43:47.951 --> 0:43:50.957
+prefer the input.
+
+0:43:52.612 --> 0:43:58.166
+So is the value also the same for the recept
+game and the forget game.
+
+0:43:58.166 --> 0:43:59.417
+Yes, the movie.
+
+0:44:00.900 --> 0:44:10.004
+Yes exactly so the matrices are different
+and therefore it can be and that should be
+
+0:44:10.004 --> 0:44:16.323
+and maybe there is sometimes you want to have
+information.
+
+0:44:16.636 --> 0:44:23.843
+So here again we have this vector with values
+between zero and which says controlling how
+
+0:44:23.843 --> 0:44:25.205
+the information.
+
+0:44:25.505 --> 0:44:36.459
+And then the output is calculated here similar
+to a cell stage, but again input is from.
+
+0:44:36.536 --> 0:44:45.714
+So either the reset gate decides should give
+what is currently stored in there, or.
+
+0:44:46.346 --> 0:44:58.647
+So it's not exactly as the thing we had before,
+with the residual connections where we added
+
+0:44:58.647 --> 0:45:01.293
+up, but here we do.
+
+0:45:04.224 --> 0:45:08.472
+This is the general idea of a simple recurrent
+neural network.
+
+0:45:08.472 --> 0:45:13.125
+Then we will now look at how we can make things
+even more efficient.
+
+0:45:13.125 --> 0:45:17.104
+But first do you have more questions on how
+it is working?
+
+0:45:23.063 --> 0:45:38.799
+Now these calculations are a bit where things
+get more efficient because this somehow.
+
+0:45:38.718 --> 0:45:43.177
+It depends on all the other damage for the
+second one also.
+
+0:45:43.423 --> 0:45:48.904
+Because if you do a matrix multiplication
+with a vector like for the output vector, each
+
+0:45:48.904 --> 0:45:52.353
+diameter of the output vector depends on all
+the other.
+
+0:45:52.973 --> 0:46:06.561
+The cell state here depends because this one
+is used here, and somehow the first dimension
+
+0:46:06.561 --> 0:46:11.340
+of the cell state only depends.
+
+0:46:11.931 --> 0:46:17.973
+In order to make that, of course, is sometimes
+again making things less paralyzeable if things
+
+0:46:17.973 --> 0:46:18.481
+depend.
+
+0:46:19.359 --> 0:46:35.122
+Can easily make that different by changing
+from the metric product to not a vector.
+
+0:46:35.295 --> 0:46:51.459
+So you do first, just like inside here, you
+take like the first dimension, my second dimension.
+
+0:46:52.032 --> 0:46:53.772
+Is, of course, narrow.
+
+0:46:53.772 --> 0:46:59.294
+This should be reset or this should be because
+it should be a different.
+
+0:46:59.899 --> 0:47:12.053
+Now the first dimension only depends on the
+first dimension, so you don't have dependencies
+
+0:47:12.053 --> 0:47:16.148
+any longer between dimensions.
+
+0:47:18.078 --> 0:47:25.692
+Maybe it gets a bit clearer if you see about
+it in this way, so what we have to do now.
+
+0:47:25.966 --> 0:47:31.911
+First, we have to do a metrics multiplication
+on to gather and to get the.
+
+0:47:32.292 --> 0:47:38.041
+And then we only have the element wise operations
+where we take this output.
+
+0:47:38.041 --> 0:47:38.713
+We take.
+
+0:47:39.179 --> 0:47:42.978
+Minus one and our original.
+
+0:47:42.978 --> 0:47:52.748
+Here we only have elemental abrasions which
+can be optimally paralyzed.
+
+0:47:53.273 --> 0:48:07.603
+So here we have additional paralyzed things
+across the dimension and don't have to do that.
+
+0:48:09.929 --> 0:48:24.255
+Yeah, but this you can do like in parallel
+again for all xts.
+
+0:48:24.544 --> 0:48:33.014
+Here you can't do it in parallel, but you
+only have to do it on each seat, and then you
+
+0:48:33.014 --> 0:48:34.650
+can parallelize.
+
+0:48:35.495 --> 0:48:39.190
+But this maybe for the dimension.
+
+0:48:39.190 --> 0:48:42.124
+Maybe it's also important.
+
+0:48:42.124 --> 0:48:46.037
+I don't know if they have tried it.
+
+0:48:46.037 --> 0:48:55.383
+I assume it's not only for dimension reduction,
+but it's hard because you can easily.
+
+0:49:01.001 --> 0:49:08.164
+People have even like made the second thing
+even more easy.
+
+0:49:08.164 --> 0:49:10.313
+So there is this.
+
+0:49:10.313 --> 0:49:17.954
+This is how we have the highway connections
+in the transformer.
+
+0:49:17.954 --> 0:49:20.699
+Then it's like you do.
+
+0:49:20.780 --> 0:49:24.789
+So that is like how things are put together
+as a transformer.
+
+0:49:25.125 --> 0:49:39.960
+And that is a similar and simple recurring
+neural network where you do exactly the same
+
+0:49:39.960 --> 0:49:44.512
+for the so you don't have.
+
+0:49:46.326 --> 0:49:47.503
+This type of things.
+
+0:49:49.149 --> 0:50:01.196
+And with this we are at the end of how to
+make efficient architectures before we go to
+
+0:50:01.196 --> 0:50:02.580
+the next.
+
+0:50:13.013 --> 0:50:24.424
+Between the ink or the trader and the architectures
+there is a next technique which is used in
+
+0:50:24.424 --> 0:50:28.988
+nearly all deburning very successful.
+
+0:50:29.449 --> 0:50:43.463
+So the idea is can we extract the knowledge
+from a large network into a smaller one, but
+
+0:50:43.463 --> 0:50:45.983
+it's similarly.
+
+0:50:47.907 --> 0:50:53.217
+And the nice thing is that this really works,
+and it may be very, very surprising.
+
+0:50:53.673 --> 0:51:03.035
+So the idea is that we have a large strong
+model which we train for long, and the question
+
+0:51:03.035 --> 0:51:07.870
+is: Can that help us to train a smaller model?
+
+0:51:08.148 --> 0:51:16.296
+So can what we refer to as teacher model tell
+us better to build a small student model than
+
+0:51:16.296 --> 0:51:17.005
+before.
+
+0:51:17.257 --> 0:51:27.371
+So what we're before in it as a student model,
+we learn from the data and that is how we train
+
+0:51:27.371 --> 0:51:28.755
+our systems.
+
+0:51:29.249 --> 0:51:37.949
+The question is: Can we train this small model
+better if we are not only learning from the
+
+0:51:37.949 --> 0:51:46.649
+data, but we are also learning from a large
+model which has been trained maybe in the same
+
+0:51:46.649 --> 0:51:47.222
+data?
+
+0:51:47.667 --> 0:51:55.564
+So that you have then in the end a smaller
+model that is somehow better performing than.
+
+0:51:55.895 --> 0:51:59.828
+And maybe that's on the first view.
+
+0:51:59.739 --> 0:52:05.396
+Very very surprising because it has seen the
+same data so it should have learned the same
+
+0:52:05.396 --> 0:52:11.053
+so the baseline model trained only on the data
+and the student teacher knowledge to still
+
+0:52:11.053 --> 0:52:11.682
+model it.
+
+0:52:11.682 --> 0:52:17.401
+They all have seen only this data because
+your teacher modeling was also trained typically
+
+0:52:17.401 --> 0:52:19.161
+only on this model however.
+
+0:52:20.580 --> 0:52:30.071
+It has by now shown that by many ways the
+model trained in the teacher and analysis framework
+
+0:52:30.071 --> 0:52:32.293
+is performing better.
+
+0:52:33.473 --> 0:52:40.971
+A bit of an explanation when we see how that
+works.
+
+0:52:40.971 --> 0:52:46.161
+There's different ways of doing it.
+
+0:52:46.161 --> 0:52:47.171
+Maybe.
+
+0:52:47.567 --> 0:52:51.501
+So how does it work?
+
+0:52:51.501 --> 0:53:04.802
+This is our student network, the normal one,
+some type of new network.
+
+0:53:04.802 --> 0:53:06.113
+We're.
+
+0:53:06.586 --> 0:53:17.050
+So we are training the model to predict the
+same thing as we are doing that by calculating.
+
+0:53:17.437 --> 0:53:23.173
+The cross angry loss was defined in a way
+where saying all the probabilities for the
+
+0:53:23.173 --> 0:53:25.332
+correct word should be as high.
+
+0:53:25.745 --> 0:53:32.207
+So you are calculating your alphabet probabilities
+always, and each time step you have an alphabet
+
+0:53:32.207 --> 0:53:33.055
+probability.
+
+0:53:33.055 --> 0:53:38.669
+What is the most probable in the next word
+and your training signal is put as much of
+
+0:53:38.669 --> 0:53:43.368
+your probability mass to the correct word to
+the word that is there in.
+
+0:53:43.903 --> 0:53:51.367
+And this is the chief by this cross entry
+loss, which says with some of the all training
+
+0:53:51.367 --> 0:53:58.664
+examples of all positions, with some of the
+full vocabulary, and then this one is this
+
+0:53:58.664 --> 0:54:03.947
+one that this current word is the case word
+in the vocabulary.
+
+0:54:04.204 --> 0:54:11.339
+And then we take here the lock for the ability
+of that, so what we made me do is: We have
+
+0:54:11.339 --> 0:54:27.313
+this metric here, so each position of your
+vocabulary size.
+
+0:54:27.507 --> 0:54:38.656
+In the end what you just do is some of these
+three lock probabilities, and then you want
+
+0:54:38.656 --> 0:54:40.785
+to have as much.
+
+0:54:41.041 --> 0:54:54.614
+So although this is a thumb over this metric
+here, in the end of each dimension you.
+
+0:54:54.794 --> 0:55:06.366
+So that is a normal cross end to be lost that
+we have discussed at the very beginning of
+
+0:55:06.366 --> 0:55:07.016
+how.
+
+0:55:08.068 --> 0:55:15.132
+So what can we do differently in the teacher
+network?
+
+0:55:15.132 --> 0:55:23.374
+We also have a teacher network which is trained
+on large data.
+
+0:55:24.224 --> 0:55:35.957
+And of course this distribution might be better
+than the one from the small model because it's.
+
+0:55:36.456 --> 0:55:40.941
+So in this case we have now the training signal
+from the teacher network.
+
+0:55:41.441 --> 0:55:46.262
+And it's the same way as we had before.
+
+0:55:46.262 --> 0:55:56.507
+The only difference is we're training not
+the ground truths per ability distribution
+
+0:55:56.507 --> 0:55:59.159
+year, which is sharp.
+
+0:55:59.299 --> 0:56:11.303
+That's also a probability, so this word has
+a high probability, but have some probability.
+
+0:56:12.612 --> 0:56:19.577
+And that is the main difference.
+
+0:56:19.577 --> 0:56:30.341
+Typically you do like the interpretation of
+these.
+
+0:56:33.213 --> 0:56:38.669
+Because there's more information contained
+in the distribution than in the front booth,
+
+0:56:38.669 --> 0:56:44.187
+because it encodes more information about the
+language, because language always has more
+
+0:56:44.187 --> 0:56:47.907
+options to put alone, that's the same sentence
+yes exactly.
+
+0:56:47.907 --> 0:56:53.114
+So there's ambiguity in there that is encoded
+hopefully very well in the complaint.
+
+0:56:53.513 --> 0:56:57.257
+Trade you two networks so better than a student
+network you have in there from your learner.
+
+0:56:57.537 --> 0:57:05.961
+So maybe often there's only one correct word,
+but it might be two or three, and then all
+
+0:57:05.961 --> 0:57:10.505
+of these three have a probability distribution.
+
+0:57:10.590 --> 0:57:21.242
+And then is the main advantage or one explanation
+of why it's better to train from the.
+
+0:57:21.361 --> 0:57:32.652
+Of course, it's good to also keep the signal
+in there because then you can prevent it because
+
+0:57:32.652 --> 0:57:33.493
+crazy.
+
+0:57:37.017 --> 0:57:49.466
+Any more questions on the first type of knowledge
+distillation, also distribution changes.
+
+0:57:50.550 --> 0:58:02.202
+Coming around again, this would put it a bit
+different, so this is not a solution to maintenance
+
+0:58:02.202 --> 0:58:04.244
+or distribution.
+
+0:58:04.744 --> 0:58:12.680
+But don't think it's performing worse than
+only doing the ground tours because they also.
+
+0:58:13.113 --> 0:58:21.254
+So it's more like it's not improving you would
+assume it's similarly helping you, but.
+
+0:58:21.481 --> 0:58:28.145
+Of course, if you now have a teacher, maybe
+you have no danger on your target to Maine,
+
+0:58:28.145 --> 0:58:28.524
+but.
+
+0:58:28.888 --> 0:58:39.895
+Then you can use this one which is not the
+ground truth but helpful to learn better for
+
+0:58:39.895 --> 0:58:42.147
+the distribution.
+
+0:58:46.326 --> 0:58:57.012
+The second idea is to do sequence level knowledge
+distillation, so what we have in this case
+
+0:58:57.012 --> 0:59:02.757
+is we have looked at each position independently.
+
+0:59:03.423 --> 0:59:05.436
+Mean, we do that often.
+
+0:59:05.436 --> 0:59:10.972
+We are not generating a lot of sequences,
+but that has a problem.
+
+0:59:10.972 --> 0:59:13.992
+We have this propagation of errors.
+
+0:59:13.992 --> 0:59:16.760
+We start with one area and then.
+
+0:59:17.237 --> 0:59:27.419
+So if we are doing word-level knowledge dissolution,
+we are treating each word in the sentence independently.
+
+0:59:28.008 --> 0:59:32.091
+So we are not trying to like somewhat model
+the dependency between.
+
+0:59:32.932 --> 0:59:47.480
+We can try to do that by sequence level knowledge
+dissolution, but the problem is, of course,.
+
+0:59:47.847 --> 0:59:53.478
+So we can that for each position we can get
+a distribution over all the words at this.
+
+0:59:53.793 --> 1:00:05.305
+But if we want to have a distribution of all
+possible target sentences, that's not possible
+
+1:00:05.305 --> 1:00:06.431
+because.
+
+1:00:08.508 --> 1:00:15.940
+Area, so we can then again do a bit of a heck
+on that.
+
+1:00:15.940 --> 1:00:23.238
+If we can't have a distribution of all sentences,
+it.
+
+1:00:23.843 --> 1:00:30.764
+So what we can't do is you can not use the
+teacher network and sample different translations.
+
+1:00:31.931 --> 1:00:39.327
+And now we can do different ways to train
+them.
+
+1:00:39.327 --> 1:00:49.343
+We can use them as their probability, the
+easiest one to assume.
+
+1:00:50.050 --> 1:00:56.373
+So what that ends to is that we're taking
+our teacher network, we're generating some
+
+1:00:56.373 --> 1:01:01.135
+translations, and these ones we're using as
+additional trading.
+
+1:01:01.781 --> 1:01:11.382
+Then we have mainly done this sequence level
+because the teacher network takes us.
+
+1:01:11.382 --> 1:01:17.513
+These are all probable translations of the
+sentence.
+
+1:01:26.286 --> 1:01:34.673
+And then you can do a bit of a yeah, and you
+can try to better make a bit of an interpolated
+
+1:01:34.673 --> 1:01:36.206
+version of that.
+
+1:01:36.716 --> 1:01:42.802
+So what people have also done is like subsequent
+level interpolations.
+
+1:01:42.802 --> 1:01:52.819
+You generate here several translations: But
+then you don't use all of them.
+
+1:01:52.819 --> 1:02:00.658
+You do some metrics on which of these ones.
+
+1:02:01.021 --> 1:02:12.056
+So it's a bit more training on this brown
+chose which might be improbable or unreachable
+
+1:02:12.056 --> 1:02:16.520
+because we can generate everything.
+
+1:02:16.676 --> 1:02:23.378
+And we are giving it an easier solution which
+is also good quality and training of that.
+
+1:02:23.703 --> 1:02:32.602
+So you're not training it on a very difficult
+solution, but you're training it on an easier
+
+1:02:32.602 --> 1:02:33.570
+solution.
+
+1:02:36.356 --> 1:02:38.494
+Any More Questions to This.
+
+1:02:40.260 --> 1:02:41.557
+Yeah.
+
+1:02:41.461 --> 1:02:44.296
+Good.
+
+1:02:43.843 --> 1:03:01.642
+Is to look at the vocabulary, so the problem
+is we have seen that vocabulary calculations
+
+1:03:01.642 --> 1:03:06.784
+are often very presuming.
+
+1:03:09.789 --> 1:03:19.805
+The thing is that most of the vocabulary is
+not needed for each sentence, so in each sentence.
+
+1:03:20.280 --> 1:03:28.219
+The question is: Can we somehow easily precalculate,
+which words are probable to occur in the sentence,
+
+1:03:28.219 --> 1:03:30.967
+and then only calculate these ones?
+
+1:03:31.691 --> 1:03:34.912
+And this can be done so.
+
+1:03:34.912 --> 1:03:43.932
+For example, if you have sentenced card, it's
+probably not happening.
+
+1:03:44.164 --> 1:03:48.701
+So what you can try to do is to limit your
+vocabulary.
+
+1:03:48.701 --> 1:03:51.093
+You're considering for each.
+
+1:03:51.151 --> 1:04:04.693
+So you're no longer taking the full vocabulary
+as possible output, but you're restricting.
+
+1:04:06.426 --> 1:04:18.275
+That typically works is that we limit it by
+the most frequent words we always take because
+
+1:04:18.275 --> 1:04:23.613
+these are not so easy to align to words.
+
+1:04:23.964 --> 1:04:32.241
+To take the most treatment taggin' words and
+then work that often aligns with one of the
+
+1:04:32.241 --> 1:04:32.985
+source.
+
+1:04:33.473 --> 1:04:46.770
+So for each source word you calculate the
+word alignment on your training data, and then
+
+1:04:46.770 --> 1:04:51.700
+you calculate which words occur.
+
+1:04:52.352 --> 1:04:57.680
+And then for decoding you build this union
+of maybe the source word list that other.
+
+1:04:59.960 --> 1:05:02.145
+Are like for each source work.
+
+1:05:02.145 --> 1:05:08.773
+One of the most frequent translations of these
+source words, for example for each source work
+
+1:05:08.773 --> 1:05:13.003
+like in the most frequent ones, and then the
+most frequent.
+
+1:05:13.193 --> 1:05:24.333
+In total, if you have short sentences, you
+have a lot less words, so in most cases it's
+
+1:05:24.333 --> 1:05:26.232
+not more than.
+
+1:05:26.546 --> 1:05:33.957
+And so you have dramatically reduced your
+vocabulary, and thereby can also fax a depot.
+
+1:05:35.495 --> 1:05:43.757
+That easy does anybody see what is challenging
+here and why that might not always need.
+
+1:05:47.687 --> 1:05:54.448
+The performance is not why this might not.
+
+1:05:54.448 --> 1:06:01.838
+If you implement it, it might not be a strong.
+
+1:06:01.941 --> 1:06:06.053
+You have to store this list.
+
+1:06:06.053 --> 1:06:14.135
+You have to burn the union and of course your
+safe time.
+
+1:06:14.554 --> 1:06:21.920
+The second thing the vocabulary is used in
+our last step, so we have the hidden state,
+
+1:06:21.920 --> 1:06:23.868
+and then we calculate.
+
+1:06:24.284 --> 1:06:29.610
+Now we are not longer calculating them for
+all output words, but for a subset of them.
+
+1:06:30.430 --> 1:06:35.613
+However, this metric multiplication is typically
+parallelized with the perfect but good.
+
+1:06:35.956 --> 1:06:46.937
+But if you not only calculate some of them,
+if you're not modeling it right, it will take
+
+1:06:46.937 --> 1:06:52.794
+as long as before because of the nature of
+the.
+
+1:06:56.776 --> 1:07:07.997
+Here for beam search there's some ideas of
+course you can go back to greedy search because
+
+1:07:07.997 --> 1:07:10.833
+that's more efficient.
+
+1:07:11.651 --> 1:07:18.347
+And better quality, and you can buffer some
+states in between, so how much buffering it's
+
+1:07:18.347 --> 1:07:22.216
+again this tradeoff between calculation and
+memory.
+
+1:07:25.125 --> 1:07:41.236
+Then at the end of today what we want to look
+into is one last type of new machine translation
+
+1:07:41.236 --> 1:07:42.932
+approach.
+
+1:07:43.403 --> 1:07:53.621
+And the idea is what we've already seen in
+our first two steps is that this ultra aggressive
+
+1:07:53.621 --> 1:07:57.246
+park is taking community coding.
+
+1:07:57.557 --> 1:08:04.461
+Can process everything in parallel, but we
+are always taking the most probable and then.
+
+1:08:05.905 --> 1:08:10.476
+The question is: Do we really need to do that?
+
+1:08:10.476 --> 1:08:14.074
+Therefore, there is a bunch of work.
+
+1:08:14.074 --> 1:08:16.602
+Can we do it differently?
+
+1:08:16.602 --> 1:08:19.616
+Can we generate a full target?
+
+1:08:20.160 --> 1:08:29.417
+We'll see it's not that easy and there's still
+an open debate whether this is really faster
+
+1:08:29.417 --> 1:08:31.832
+and quality, but think.
+
+1:08:32.712 --> 1:08:45.594
+So, as said, what we have done is our encoder
+decoder where we can process our encoder color,
+
+1:08:45.594 --> 1:08:50.527
+and then the output always depends.
+
+1:08:50.410 --> 1:08:54.709
+We generate the output and then we have to
+put it here the wide because then everything
+
+1:08:54.709 --> 1:08:56.565
+depends on the purpose of the output.
+
+1:08:56.916 --> 1:09:10.464
+This is what is referred to as an outer-regressive
+model and nearly outs speech generation and
+
+1:09:10.464 --> 1:09:16.739
+language generation or works in this outer.
+
+1:09:18.318 --> 1:09:21.132
+So the motivation is, can we do that more
+efficiently?
+
+1:09:21.361 --> 1:09:31.694
+And can we somehow process all target words
+in parallel?
+
+1:09:31.694 --> 1:09:41.302
+So instead of doing it one by one, we are
+inputting.
+
+1:09:45.105 --> 1:09:46.726
+So how does it work?
+
+1:09:46.726 --> 1:09:50.587
+So let's first have a basic auto regressive
+mode.
+
+1:09:50.810 --> 1:09:53.551
+So the encoder looks as it is before.
+
+1:09:53.551 --> 1:09:58.310
+That's maybe not surprising because here we
+know we can paralyze.
+
+1:09:58.618 --> 1:10:04.592
+So we have put in here our ink holder and
+generated the ink stash, so that's exactly
+
+1:10:04.592 --> 1:10:05.295
+the same.
+
+1:10:05.845 --> 1:10:16.229
+However, now we need to do one more thing:
+One challenge is what we had before and that's
+
+1:10:16.229 --> 1:10:26.799
+a challenge of natural language generation
+like machine translation.
+
+1:10:32.672 --> 1:10:38.447
+We generate until we generate this out of
+end of center stock, but if we now generate
+
+1:10:38.447 --> 1:10:44.625
+everything at once that's no longer possible,
+so we cannot generate as long because we only
+
+1:10:44.625 --> 1:10:45.632
+generated one.
+
+1:10:46.206 --> 1:10:58.321
+So the question is how can we now determine
+how long the sequence is, and we can also accelerate.
+
+1:11:00.000 --> 1:11:06.384
+Yes, but there would be one idea, and there
+is other work which tries to do that.
+
+1:11:06.806 --> 1:11:15.702
+However, in here there's some work already
+done before and maybe you remember we had the
+
+1:11:15.702 --> 1:11:20.900
+IBM models and there was this concept of fertility.
+
+1:11:21.241 --> 1:11:26.299
+The concept of fertility is means like for
+one saucepan, and how many target pores does
+
+1:11:26.299 --> 1:11:27.104
+it translate?
+
+1:11:27.847 --> 1:11:34.805
+And exactly that we try to do here, and that
+means we are calculating like at the top we
+
+1:11:34.805 --> 1:11:36.134
+are calculating.
+
+1:11:36.396 --> 1:11:42.045
+So it says word is translated into word.
+
+1:11:42.045 --> 1:11:54.171
+Word might be translated into words into,
+so we're trying to predict in how many words.
+
+1:11:55.935 --> 1:12:10.314
+And then the end of the anchor, so this is
+like a length estimation.
+
+1:12:10.314 --> 1:12:15.523
+You can do it otherwise.
+
+1:12:16.236 --> 1:12:24.526
+You initialize your decoder input and we know
+it's good with word embeddings so we're trying
+
+1:12:24.526 --> 1:12:28.627
+to do the same thing and what people then do.
+
+1:12:28.627 --> 1:12:35.224
+They initialize it again with word embedding
+but in the frequency of the.
+
+1:12:35.315 --> 1:12:36.460
+So we have the cartilage.
+
+1:12:36.896 --> 1:12:47.816
+So one has two, so twice the is and then one
+is, so that is then our initialization.
+
+1:12:48.208 --> 1:12:57.151
+In other words, if you don't predict fertilities
+but predict lengths, you can just initialize
+
+1:12:57.151 --> 1:12:57.912
+second.
+
+1:12:58.438 --> 1:13:07.788
+This often works a bit better, but that's
+the other.
+
+1:13:07.788 --> 1:13:16.432
+Now you have everything in training and testing.
+
+1:13:16.656 --> 1:13:18.621
+This is all available at once.
+
+1:13:20.280 --> 1:13:31.752
+Then we can generate everything in parallel,
+so we have the decoder stack, and that is now
+
+1:13:31.752 --> 1:13:33.139
+as before.
+
+1:13:35.395 --> 1:13:41.555
+And then we're doing the translation predictions
+here on top of it in order to do.
+
+1:13:43.083 --> 1:13:59.821
+And then we are predicting here the target
+words and once predicted, and that is the basic
+
+1:13:59.821 --> 1:14:00.924
+idea.
+
+1:14:01.241 --> 1:14:08.171
+Machine translation: Where the idea is, we
+don't have to do one by one what we're.
+
+1:14:10.210 --> 1:14:13.900
+So this looks really, really, really great.
+
+1:14:13.900 --> 1:14:20.358
+On the first view there's one challenge with
+this, and this is the baseline.
+
+1:14:20.358 --> 1:14:27.571
+Of course there's some improvements, but in
+general the quality is often significant.
+
+1:14:28.068 --> 1:14:32.075
+So here you see the baseline models.
+
+1:14:32.075 --> 1:14:38.466
+You have a loss of ten blue points or something
+like that.
+
+1:14:38.878 --> 1:14:40.230
+So why does it change?
+
+1:14:40.230 --> 1:14:41.640
+So why is it happening?
+
+1:14:43.903 --> 1:14:56.250
+If you look at the errors there is repetitive
+tokens, so you have like or things like that.
+
+1:14:56.536 --> 1:15:01.995
+Broken senses or influent senses, so that
+exactly where algebra aggressive models are
+
+1:15:01.995 --> 1:15:04.851
+very good, we say that's a bit of a problem.
+
+1:15:04.851 --> 1:15:07.390
+They generate very fluid transcription.
+
+1:15:07.387 --> 1:15:10.898
+Translation: Sometimes there doesn't have
+to do anything with the input.
+
+1:15:11.411 --> 1:15:14.047
+But generally it really looks always very
+fluid.
+
+1:15:14.995 --> 1:15:20.865
+Here exactly the opposite, so the problem
+is that we don't have really fluid translation.
+
+1:15:21.421 --> 1:15:26.123
+And that is mainly due to the challenge that
+we have this independent assumption.
+
+1:15:26.646 --> 1:15:35.873
+So in this case, the probability of Y of the
+second position is independent of the probability
+
+1:15:35.873 --> 1:15:40.632
+of X, so we don't know what was there generated.
+
+1:15:40.632 --> 1:15:43.740
+We're just generating it there.
+
+1:15:43.964 --> 1:15:55.439
+You can see it also in a bit of examples.
+
+1:15:55.439 --> 1:16:03.636
+You can over-panelize shifts.
+
+1:16:04.024 --> 1:16:10.566
+And the problem is this is already an improvement
+again, but this is also similar to.
+
+1:16:11.071 --> 1:16:19.900
+So you can, for example, translate heeded
+back, or maybe you could also translate it
+
+1:16:19.900 --> 1:16:31.105
+with: But on their feeling down in feeling
+down, if the first position thinks of their
+
+1:16:31.105 --> 1:16:34.594
+feeling done and the second.
+
+1:16:35.075 --> 1:16:42.908
+So each position here and that is one of the
+main issues here doesn't know what the other.
+
+1:16:43.243 --> 1:16:53.846
+And for example, if you are translating something
+with, you can often translate things in two
+
+1:16:53.846 --> 1:16:58.471
+ways: German with a different agreement.
+
+1:16:58.999 --> 1:17:02.047
+And then here where you have to decide do
+you have to use jewelry.
+
+1:17:02.162 --> 1:17:05.460
+Interpretator: It doesn't know which word
+it has to select.
+
+1:17:06.086 --> 1:17:14.789
+Mean, of course, it knows a hidden state,
+but in the end you have a liability distribution.
+
+1:17:16.256 --> 1:17:20.026
+And that is the important thing in the outer
+regressive month.
+
+1:17:20.026 --> 1:17:24.335
+You know that because you have put it in you
+here, you don't know that.
+
+1:17:24.335 --> 1:17:29.660
+If it's equal probable here to two, you don't
+Know Which Is Selected, and of course that
+
+1:17:29.660 --> 1:17:32.832
+depends on what should be the latest traction
+under.
+
+1:17:33.333 --> 1:17:39.554
+Yep, that's the undershift, and we're going
+to last last the next time.
+
+1:17:39.554 --> 1:17:39.986
+Yes.
+
+1:17:40.840 --> 1:17:44.935
+Doesn't this also appear in and like now we're
+talking about physical training?
+
+1:17:46.586 --> 1:17:48.412
+The thing is in the auto regress.
+
+1:17:48.412 --> 1:17:50.183
+If you give it the correct one,.
+
+1:17:50.450 --> 1:17:55.827
+So if you predict here comma what the reference
+is feeling then you tell the model here.
+
+1:17:55.827 --> 1:17:59.573
+The last one was feeling and then it knows
+it has to be done.
+
+1:17:59.573 --> 1:18:04.044
+But here it doesn't know that because it doesn't
+get as input as a right.
+
+1:18:04.204 --> 1:18:24.286
+Yes, that's a bit depending on what.
+
+1:18:24.204 --> 1:18:27.973
+But in training, of course, you just try to
+make the highest one the current one.
+
+1:18:31.751 --> 1:18:38.181
+So what you can do is things like CDC loss
+which can adjust for this.
+
+1:18:38.181 --> 1:18:42.866
+So then you can also have this shifted correction.
+
+1:18:42.866 --> 1:18:50.582
+If you're doing this type of correction in
+the CDC loss you don't get full penalty.
+
+1:18:50.930 --> 1:18:58.486
+Just shifted by one, so it's a bit of a different
+loss, which is mainly used in, but.
+
+1:19:00.040 --> 1:19:03.412
+It can be used in order to address this problem.
+
+1:19:04.504 --> 1:19:13.844
+The other problem is that outer regressively
+we have the label buyers that tries to disimmigrate.
+
+1:19:13.844 --> 1:19:20.515
+That's the example did before was if you translate
+thank you to Dung.
+
+1:19:20.460 --> 1:19:31.925
+And then it might end up because it learns
+in the first position and the second also.
+
+1:19:32.492 --> 1:19:43.201
+In order to prevent that, it would be helpful
+for one output, only one output, so that makes
+
+1:19:43.201 --> 1:19:47.002
+the system already better learn.
+
+1:19:47.227 --> 1:19:53.867
+Might be that for slightly different inputs
+you have different outputs, but for the same.
+
+1:19:54.714 --> 1:19:57.467
+That we can luckily very easily solve.
+
+1:19:59.119 --> 1:19:59.908
+And it's done.
+
+1:19:59.908 --> 1:20:04.116
+We just learned the technique about it, which
+is called knowledge distillation.
+
+1:20:04.985 --> 1:20:13.398
+So what we can do and the easiest solution
+to prove your non-autoregressive model is to
+
+1:20:13.398 --> 1:20:16.457
+train an auto regressive model.
+
+1:20:16.457 --> 1:20:22.958
+Then you decode your whole training gamer
+with this model and then.
+
+1:20:23.603 --> 1:20:27.078
+While the main advantage of that is that this
+is more consistent,.
+
+1:20:27.407 --> 1:20:33.995
+So for the same input you always have the
+same output.
+
+1:20:33.995 --> 1:20:41.901
+So you have to make your training data more
+consistent and learn.
+
+1:20:42.482 --> 1:20:54.471
+So there is another advantage of knowledge
+distillation and that advantage is you have
+
+1:20:54.471 --> 1:20:59.156
+more consistent training signals.
+
+1:21:04.884 --> 1:21:10.287
+There's another to make the things more easy
+at the beginning.
+
+1:21:10.287 --> 1:21:16.462
+There's this plants model, black model where
+you put in parts of input.
+
+1:21:16.756 --> 1:21:26.080
+So during training, especially at the beginning,
+you give some correct solutions at the beginning.
+
+1:21:28.468 --> 1:21:38.407
+And there is this tokens at a time, so the
+idea is to establish other regressive training.
+
+1:21:40.000 --> 1:21:50.049
+And some targets are open, so you always predict
+only like first auto regression is K.
+
+1:21:50.049 --> 1:21:59.174
+It puts one, so you always have one input
+and one output, then you do partial.
+
+1:21:59.699 --> 1:22:05.825
+So in that way you can slowly learn what is
+a good and what is a bad answer.
+
+1:22:08.528 --> 1:22:10.862
+It doesn't sound very impressive.
+
+1:22:10.862 --> 1:22:12.578
+Don't contact me anyway.
+
+1:22:12.578 --> 1:22:15.323
+Go all over your training data several.
+
+1:22:15.875 --> 1:22:20.655
+You can even switch in between.
+
+1:22:20.655 --> 1:22:29.318
+There is a homework on this thing where you
+try to start.
+
+1:22:31.271 --> 1:22:41.563
+You have to learn so there's a whole work
+on that so this is often happening and it doesn't
+
+1:22:41.563 --> 1:22:46.598
+mean it's less efficient but still it helps.
+
+1:22:49.389 --> 1:22:57.979
+For later maybe here are some examples of
+how much things help.
+
+1:22:57.979 --> 1:23:04.958
+Maybe one point here is that it's really important.
+
+1:23:05.365 --> 1:23:13.787
+Here's the translation performance and speed.
+
+1:23:13.787 --> 1:23:24.407
+One point which is a point is if you compare
+researchers.
+
+1:23:24.784 --> 1:23:33.880
+So yeah, if you're compared to one very weak
+baseline transformer even with beam search,
+
+1:23:33.880 --> 1:23:40.522
+then you're ten times slower than a very strong
+auto regressive.
+
+1:23:40.961 --> 1:23:48.620
+If you make a strong baseline then it's going
+down to depending on times and here like: You
+
+1:23:48.620 --> 1:23:53.454
+have a lot of different speed ups.
+
+1:23:53.454 --> 1:24:03.261
+Generally, it makes a strong baseline and
+not very simple transformer.
+
+1:24:07.407 --> 1:24:20.010
+Yeah, with this one last thing that you can
+do to speed up things and also reduce your
+
+1:24:20.010 --> 1:24:25.950
+memory is what is called half precision.
+
+1:24:26.326 --> 1:24:29.139
+And especially for decoding issues for training.
+
+1:24:29.139 --> 1:24:31.148
+Sometimes it also gets less stale.
+
+1:24:32.592 --> 1:24:45.184
+With this we close nearly wait a bit, so what
+you should remember is that efficient machine
+
+1:24:45.184 --> 1:24:46.963
+translation.
+
+1:24:47.007 --> 1:24:51.939
+We have, for example, looked at knowledge
+distillation.
+
+1:24:51.939 --> 1:24:55.991
+We have looked at non auto regressive models.
+
+1:24:55.991 --> 1:24:57.665
+We have different.
+
+1:24:58.898 --> 1:25:02.383
+For today and then only requests.
+
+1:25:02.383 --> 1:25:08.430
+So if you haven't done so, please fill out
+the evaluation.
+
+1:25:08.388 --> 1:25:20.127
+So now if you have done so think then you
+should have and with the online people hopefully.
+
+1:25:20.320 --> 1:25:29.758
+Only possibility to tell us what things are
+good and what not the only one but the most
+
+1:25:29.758 --> 1:25:30.937
+efficient.
+
+1:25:31.851 --> 1:25:35.871
+So think of all the students doing it in this
+case okay and then thank.
+
+0:00:01.921 --> 0:00:16.424
+Hey welcome to today's lecture, what we today
+want to look at is how we can make new.
+
+0:00:16.796 --> 0:00:26.458
+So until now we have this global system, the
+encoder and the decoder mostly, and we haven't
+
+0:00:26.458 --> 0:00:29.714
+really thought about how long.
+
+0:00:30.170 --> 0:00:42.684
+And what we, for example, know is yeah, you
+can make the systems bigger in different ways.
+
+0:00:42.684 --> 0:00:47.084
+We can make them deeper so the.
+
+0:00:47.407 --> 0:00:56.331
+And if we have at least enough data that typically
+helps you make things performance better,.
+
+0:00:56.576 --> 0:01:00.620
+But of course leads to problems that we need
+more resources.
+
+0:01:00.620 --> 0:01:06.587
+That is a problem at universities where we
+have typically limited computation capacities.
+
+0:01:06.587 --> 0:01:11.757
+So at some point you have such big models
+that you cannot train them anymore.
+
+0:01:13.033 --> 0:01:23.792
+And also for companies is of course important
+if it costs you like to generate translation
+
+0:01:23.792 --> 0:01:26.984
+just by power consumption.
+
+0:01:27.667 --> 0:01:35.386
+So yeah, there's different reasons why you
+want to do efficient machine translation.
+
+0:01:36.436 --> 0:01:48.338
+One reason is there are different ways of
+how you can improve your machine translation
+
+0:01:48.338 --> 0:01:50.527
+system once we.
+
+0:01:50.670 --> 0:01:55.694
+There can be different types of data we looked
+into data crawling, monolingual data.
+
+0:01:55.875 --> 0:01:59.024
+All this data and the aim is always.
+
+0:01:59.099 --> 0:02:05.735
+Of course, we are not just purely interested
+in having more data, but the idea why we want
+
+0:02:05.735 --> 0:02:12.299
+to have more data is that more data also means
+that we have better quality because mostly
+
+0:02:12.299 --> 0:02:17.550
+we are interested in increasing the quality
+of the machine translation.
+
+0:02:18.838 --> 0:02:24.892
+But there's also other ways of how you can
+improve the quality of a machine translation.
+
+0:02:25.325 --> 0:02:36.450
+And what is, of course, that is where most
+research is focusing on.
+
+0:02:36.450 --> 0:02:44.467
+It means all we want to build better algorithms.
+
+0:02:44.684 --> 0:02:48.199
+Course: The other things are normally as good.
+
+0:02:48.199 --> 0:02:54.631
+Sometimes it's easier to improve, so often
+it's easier to just collect more data than
+
+0:02:54.631 --> 0:02:57.473
+to invent some great view algorithms.
+
+0:02:57.473 --> 0:03:00.315
+But yeah, both of them are important.
+
+0:03:00.920 --> 0:03:09.812
+But there is this third thing, especially
+with neural machine translation, and that means
+
+0:03:09.812 --> 0:03:11.590
+we make a bigger.
+
+0:03:11.751 --> 0:03:16.510
+Can be, as said, that we have more layers,
+that we have wider layers.
+
+0:03:16.510 --> 0:03:19.977
+The other thing we talked a bit about is ensemble.
+
+0:03:19.977 --> 0:03:24.532
+That means we are not building one new machine
+translation system.
+
+0:03:24.965 --> 0:03:27.505
+And we can easily build four.
+
+0:03:27.505 --> 0:03:32.331
+What is the typical strategy to build different
+systems?
+
+0:03:32.331 --> 0:03:33.177
+Remember.
+
+0:03:35.795 --> 0:03:40.119
+It should be of course a bit different if
+you have the same.
+
+0:03:40.119 --> 0:03:44.585
+If they all predict the same then combining
+them doesn't help.
+
+0:03:44.585 --> 0:03:48.979
+So what is the easiest way if you have to
+build four systems?
+
+0:03:51.711 --> 0:04:01.747
+And the Charleston's will take, but this is
+the best output of a single system.
+
+0:04:02.362 --> 0:04:10.165
+Mean now, it's really three different systems
+so that you later can combine them and maybe
+
+0:04:10.165 --> 0:04:11.280
+the average.
+
+0:04:11.280 --> 0:04:16.682
+Ensembles are typically that the average is
+all probabilities.
+
+0:04:19.439 --> 0:04:24.227
+The idea is to think about neural networks.
+
+0:04:24.227 --> 0:04:29.342
+There's one parameter which can easily adjust.
+
+0:04:29.342 --> 0:04:36.525
+That's exactly the easiest way to randomize
+with three different.
+
+0:04:37.017 --> 0:04:43.119
+They have the same architecture, so all the
+hydroparameters are the same, but they are
+
+0:04:43.119 --> 0:04:43.891
+different.
+
+0:04:43.891 --> 0:04:46.556
+They will have different predictions.
+
+0:04:48.228 --> 0:04:52.572
+So, of course, bigger amounts.
+
+0:04:52.572 --> 0:05:05.325
+Some of these are a bit the easiest way of
+improving your quality because you don't really
+
+0:05:05.325 --> 0:05:08.268
+have to do anything.
+
+0:05:08.588 --> 0:05:12.588
+There is limits on that bigger models only
+get better.
+
+0:05:12.588 --> 0:05:19.132
+If you have enough training data you can't
+do like a handheld layer and you will not work
+
+0:05:19.132 --> 0:05:24.877
+on very small data but with a recent amount
+of data that is the easiest thing.
+
+0:05:25.305 --> 0:05:33.726
+However, they are challenging with making
+better models, bigger motors, and that is the
+
+0:05:33.726 --> 0:05:34.970
+computation.
+
+0:05:35.175 --> 0:05:44.482
+So, of course, if you have a bigger model
+that can mean that you have longer running
+
+0:05:44.482 --> 0:05:49.518
+times, if you have models, you have to times.
+
+0:05:51.171 --> 0:05:56.685
+Normally you cannot paralyze the different
+layers because the input to one layer is always
+
+0:05:56.685 --> 0:06:02.442
+the output of the previous layer, so you propagate
+that so it will also increase your runtime.
+
+0:06:02.822 --> 0:06:10.720
+Then you have to store all your models in
+memory.
+
+0:06:10.720 --> 0:06:20.927
+If you have double weights you will have:
+Is more difficult to then do back propagation.
+
+0:06:20.927 --> 0:06:27.680
+You have to store in between the activations,
+so there's not only do you increase the model
+
+0:06:27.680 --> 0:06:31.865
+in your memory, but also all these other variables
+that.
+
+0:06:34.414 --> 0:06:36.734
+And so in general it is more expensive.
+
+0:06:37.137 --> 0:06:54.208
+And therefore there's good reasons in looking
+into can we make these models sound more efficient.
+
+0:06:54.134 --> 0:07:00.982
+So it's been through the viewer, you can have
+it okay, have one and one day of training time,
+
+0:07:00.982 --> 0:07:01.274
+or.
+
+0:07:01.221 --> 0:07:07.535
+Forty thousand euros and then what is the
+best machine translation system I can get within
+
+0:07:07.535 --> 0:07:08.437
+this budget.
+
+0:07:08.969 --> 0:07:19.085
+And then, of course, you can make the models
+bigger, but then you have to train them shorter,
+
+0:07:19.085 --> 0:07:24.251
+and then we can make more efficient algorithms.
+
+0:07:25.925 --> 0:07:31.699
+If you think about efficiency, there's a bit
+different scenarios.
+
+0:07:32.312 --> 0:07:43.635
+So if you're more of coming from the research
+community, what you'll be doing is building
+
+0:07:43.635 --> 0:07:47.913
+a lot of models in your research.
+
+0:07:48.088 --> 0:07:58.645
+So you're having your test set of maybe sentences,
+calculating the blue score, then another model.
+
+0:07:58.818 --> 0:08:08.911
+So what that means is typically you're training
+on millions of cents, so your training time
+
+0:08:08.911 --> 0:08:14.944
+is long, maybe a day, but maybe in other cases
+a week.
+
+0:08:15.135 --> 0:08:22.860
+The testing is not really the cost efficient,
+but the training is very costly.
+
+0:08:23.443 --> 0:08:37.830
+If you are more thinking of building models
+for application, the scenario is quite different.
+
+0:08:38.038 --> 0:08:46.603
+And then you keep it running, and maybe thousands
+of customers are using it in translating.
+
+0:08:46.603 --> 0:08:47.720
+So in that.
+
+0:08:48.168 --> 0:08:59.577
+And we will see that it is not always the
+same type of challenges you can paralyze some
+
+0:08:59.577 --> 0:09:07.096
+things in training, which you cannot paralyze
+in testing.
+
+0:09:07.347 --> 0:09:14.124
+For example, in training you have to do back
+propagation, so you have to store the activations.
+
+0:09:14.394 --> 0:09:23.901
+Therefore, in testing we briefly discussed
+that we would do it in more detail today in
+
+0:09:23.901 --> 0:09:24.994
+training.
+
+0:09:25.265 --> 0:09:36.100
+You know they're a target and you can process
+everything in parallel while in testing.
+
+0:09:36.356 --> 0:09:46.741
+So you can only do one word at a time, and
+so you can less paralyze this.
+
+0:09:46.741 --> 0:09:50.530
+Therefore, it's important.
+
+0:09:52.712 --> 0:09:55.347
+Is a specific task on this.
+
+0:09:55.347 --> 0:10:03.157
+For example, it's the efficiency task where
+it's about making things as efficient.
+
+0:10:03.123 --> 0:10:09.230
+Is possible and they can look at different
+resources.
+
+0:10:09.230 --> 0:10:14.207
+So how much deep fuel run time do you need?
+
+0:10:14.454 --> 0:10:19.366
+See how much memory you need or you can have
+a fixed memory budget and then have to build
+
+0:10:19.366 --> 0:10:20.294
+the best system.
+
+0:10:20.500 --> 0:10:29.010
+And here is a bit like an example of that,
+so there's three teams from Edinburgh from
+
+0:10:29.010 --> 0:10:30.989
+and they submitted.
+
+0:10:31.131 --> 0:10:36.278
+So then, of course, if you want to know the
+most efficient system you have to do a bit
+
+0:10:36.278 --> 0:10:36.515
+of.
+
+0:10:36.776 --> 0:10:44.656
+You want to have a better quality or more
+runtime and there's not the one solution.
+
+0:10:44.656 --> 0:10:46.720
+You can improve your.
+
+0:10:46.946 --> 0:10:49.662
+And that you see that there are different
+systems.
+
+0:10:49.909 --> 0:11:06.051
+Here is how many words you can do for a second
+on the clock, and you want to be as talk as
+
+0:11:06.051 --> 0:11:07.824
+possible.
+
+0:11:08.068 --> 0:11:08.889
+And you see here a bit.
+
+0:11:08.889 --> 0:11:09.984
+This is a little bit different.
+
+0:11:11.051 --> 0:11:27.717
+You want to be there on the top right corner
+and you can get a score of something between
+
+0:11:27.717 --> 0:11:29.014
+words.
+
+0:11:30.250 --> 0:11:34.161
+Two hundred and fifty thousand, then you'll
+ever come and score zero point three.
+
+0:11:34.834 --> 0:11:41.243
+There is, of course, any bit of a decision,
+but the question is, like how far can you again?
+
+0:11:41.243 --> 0:11:47.789
+Some of all these points on this line would
+be winners because they are somehow most efficient
+
+0:11:47.789 --> 0:11:53.922
+in a way that there's no system which achieves
+the same quality with less computational.
+
+0:11:57.657 --> 0:12:04.131
+So there's the one question of which resources
+are you interested.
+
+0:12:04.131 --> 0:12:07.416
+Are you running it on CPU or GPU?
+
+0:12:07.416 --> 0:12:11.668
+There's different ways of paralyzing stuff.
+
+0:12:14.654 --> 0:12:20.777
+Another dimension is how you process your
+data.
+
+0:12:20.777 --> 0:12:27.154
+There's really the best processing and streaming.
+
+0:12:27.647 --> 0:12:34.672
+So in batch processing you have the whole
+document available so you can translate all
+
+0:12:34.672 --> 0:12:39.981
+sentences in perimeter and then you're interested
+in throughput.
+
+0:12:40.000 --> 0:12:43.844
+But you can then process, for example, especially
+in GPS.
+
+0:12:43.844 --> 0:12:49.810
+That's interesting, you're not translating
+one sentence at a time, but you're translating
+
+0:12:49.810 --> 0:12:56.108
+one hundred sentences or so in parallel, so
+you have one more dimension where you can paralyze
+
+0:12:56.108 --> 0:12:57.964
+and then be more efficient.
+
+0:12:58.558 --> 0:13:14.863
+On the other hand, for example sorts of documents,
+so we learned that if you do badge processing
+
+0:13:14.863 --> 0:13:16.544
+you have.
+
+0:13:16.636 --> 0:13:24.636
+Then, of course, it makes sense to sort the
+sentences in order to have the minimum thing
+
+0:13:24.636 --> 0:13:25.535
+attached.
+
+0:13:27.427 --> 0:13:32.150
+The other scenario is more the streaming scenario
+where you do life translation.
+
+0:13:32.512 --> 0:13:40.212
+So in that case you can't wait for the whole
+document to pass, but you have to do.
+
+0:13:40.520 --> 0:13:49.529
+And then, for example, that's especially in
+situations like speech translation, and then
+
+0:13:49.529 --> 0:13:53.781
+you're interested in things like latency.
+
+0:13:53.781 --> 0:14:00.361
+So how much do you have to wait to get the
+output of a sentence?
+
+0:14:06.566 --> 0:14:16.956
+Finally, there is the thing about the implementation:
+Today we're mainly looking at different algorithms,
+
+0:14:16.956 --> 0:14:23.678
+different models of how you can model them
+in your machine translation system, but of
+
+0:14:23.678 --> 0:14:29.227
+course for the same algorithms there's also
+different implementations.
+
+0:14:29.489 --> 0:14:38.643
+So, for example, for a machine translation
+this tool could be very fast.
+
+0:14:38.638 --> 0:14:46.615
+So they have like coded a lot of the operations
+very low resource, not low resource, low level
+
+0:14:46.615 --> 0:14:49.973
+on the directly on the QDAC kernels in.
+
+0:14:50.110 --> 0:15:00.948
+So the same attention network is typically
+more efficient in that type of algorithm.
+
+0:15:00.880 --> 0:15:02.474
+Than in in any other.
+
+0:15:03.323 --> 0:15:13.105
+Of course, it might be other disadvantages,
+so if you're a little worker or have worked
+
+0:15:13.105 --> 0:15:15.106
+in the practical.
+
+0:15:15.255 --> 0:15:22.604
+Because it's normally easier to understand,
+easier to change, and so on, but there is again
+
+0:15:22.604 --> 0:15:23.323
+a train.
+
+0:15:23.483 --> 0:15:29.440
+You have to think about, do you want to include
+this into my study or comparison or not?
+
+0:15:29.440 --> 0:15:36.468
+Should it be like I compare different implementations
+and I also find the most efficient implementation?
+
+0:15:36.468 --> 0:15:39.145
+Or is it only about the pure algorithm?
+
+0:15:42.742 --> 0:15:50.355
+Yeah, when building these systems there is
+a different trade-off to do.
+
+0:15:50.850 --> 0:15:56.555
+So there's one of the traders between memory
+and throughput, so how many words can generate
+
+0:15:56.555 --> 0:15:57.299
+per second.
+
+0:15:57.557 --> 0:16:03.351
+So typically you can easily like increase
+your scruple by increasing the batch size.
+
+0:16:03.643 --> 0:16:06.899
+So that means you are translating more sentences
+in parallel.
+
+0:16:07.107 --> 0:16:09.241
+And gypsies are very good at that stuff.
+
+0:16:09.349 --> 0:16:15.161
+It should translate one sentence or one hundred
+sentences, not the same time, but its.
+
+0:16:15.115 --> 0:16:20.997
+Rough are very similar because they have these
+efficient metrics multiplication so that you
+
+0:16:20.997 --> 0:16:24.386
+can do the same operation on all sentences
+parallel.
+
+0:16:24.386 --> 0:16:30.141
+So typically that means if you increase your
+benchmark you can do more things in parallel
+
+0:16:30.141 --> 0:16:31.995
+and you will translate more.
+
+0:16:31.952 --> 0:16:33.370
+Second.
+
+0:16:33.653 --> 0:16:43.312
+On the other hand, with this advantage, of
+course you will need higher badge sizes and
+
+0:16:43.312 --> 0:16:44.755
+more memory.
+
+0:16:44.965 --> 0:16:56.452
+To begin with, the other problem is that you
+have such big models that you can only translate
+
+0:16:56.452 --> 0:16:59.141
+with lower bed sizes.
+
+0:16:59.119 --> 0:17:08.466
+If you are running out of memory with translating,
+one idea to go on that is to decrease your.
+
+0:17:13.453 --> 0:17:24.456
+Then there is the thing about quality in Screwport,
+of course, and before it's like larger models,
+
+0:17:24.456 --> 0:17:28.124
+but in generally higher quality.
+
+0:17:28.124 --> 0:17:31.902
+The first one is always this way.
+
+0:17:32.092 --> 0:17:38.709
+Course: Not always larger model helps you
+have over fitting at some point, but in generally.
+
+0:17:43.883 --> 0:17:52.901
+And with this a bit on this training and testing
+thing we had before.
+
+0:17:53.113 --> 0:17:58.455
+So it wears all the difference between training
+and testing, and for the encoder and decoder.
+
+0:17:58.798 --> 0:18:06.992
+So if we are looking at what mentioned before
+at training time, we have a source sentence
+
+0:18:06.992 --> 0:18:17.183
+here: And how this is processed on a is not
+the attention here.
+
+0:18:17.183 --> 0:18:21.836
+That's a tubical transformer.
+
+0:18:22.162 --> 0:18:31.626
+And how we can do that on a is that we can
+paralyze the ear ever since.
+
+0:18:31.626 --> 0:18:40.422
+The first thing to know is: So that is, of
+course, not in all cases.
+
+0:18:40.422 --> 0:18:49.184
+We'll later talk about speech translation
+where we might want to translate.
+
+0:18:49.389 --> 0:18:56.172
+Without the general case in, it's like you
+have the full sentence you want to translate.
+
+0:18:56.416 --> 0:19:02.053
+So the important thing is we are here everything
+available on the source side.
+
+0:19:03.323 --> 0:19:13.524
+And then this was one of the big advantages
+that you can remember back of transformer.
+
+0:19:13.524 --> 0:19:15.752
+There are several.
+
+0:19:16.156 --> 0:19:25.229
+But the other one is now that we can calculate
+the full layer.
+
+0:19:25.645 --> 0:19:29.318
+There is no dependency between this and this
+state or this and this state.
+
+0:19:29.749 --> 0:19:36.662
+So we always did like here to calculate the
+key value and query, and based on that you
+
+0:19:36.662 --> 0:19:37.536
+calculate.
+
+0:19:37.937 --> 0:19:46.616
+Which means we can do all these calculations
+here in parallel and in parallel.
+
+0:19:48.028 --> 0:19:55.967
+And there, of course, is this very efficiency
+because again for GPS it's too bigly possible
+
+0:19:55.967 --> 0:20:00.887
+to do these things in parallel and one after
+each other.
+
+0:20:01.421 --> 0:20:10.311
+And then we can also for each layer one by
+one, and then we calculate here the encoder.
+
+0:20:10.790 --> 0:20:21.921
+In training now an important thing is that
+for the decoder we have the full sentence available
+
+0:20:21.921 --> 0:20:28.365
+because we know this is the target we should
+generate.
+
+0:20:29.649 --> 0:20:33.526
+We have models now in a different way.
+
+0:20:33.526 --> 0:20:38.297
+This hidden state is only on the previous
+ones.
+
+0:20:38.598 --> 0:20:51.887
+And the first thing here depends only on this
+information, so you see if you remember we
+
+0:20:51.887 --> 0:20:56.665
+had this masked self-attention.
+
+0:20:56.896 --> 0:21:04.117
+So that means, of course, we can only calculate
+the decoder once the encoder is done, but that's.
+
+0:21:04.444 --> 0:21:06.656
+Percent can calculate the end quarter.
+
+0:21:06.656 --> 0:21:08.925
+Then we can calculate here the decoder.
+
+0:21:09.569 --> 0:21:25.566
+But again in training we have x, y and that
+is available so we can calculate everything
+
+0:21:25.566 --> 0:21:27.929
+in parallel.
+
+0:21:28.368 --> 0:21:40.941
+So the interesting thing or advantage of transformer
+is in training.
+
+0:21:40.941 --> 0:21:46.408
+We can do it for the decoder.
+
+0:21:46.866 --> 0:21:54.457
+That means you will have more calculations
+because you can only calculate one layer at
+
+0:21:54.457 --> 0:22:02.310
+a time, but for example the length which is
+too bigly quite long or doesn't really matter
+
+0:22:02.310 --> 0:22:03.270
+that much.
+
+0:22:05.665 --> 0:22:10.704
+However, in testing this situation is different.
+
+0:22:10.704 --> 0:22:13.276
+In testing we only have.
+
+0:22:13.713 --> 0:22:20.622
+So this means we start with a sense: We don't
+know the full sentence yet because we ought
+
+0:22:20.622 --> 0:22:29.063
+to regularly generate that so for the encoder
+we have the same here but for the decoder.
+
+0:22:29.409 --> 0:22:39.598
+In this case we only have the first and the
+second instinct, but only for all states in
+
+0:22:39.598 --> 0:22:40.756
+parallel.
+
+0:22:41.101 --> 0:22:51.752
+And then we can do the next step for y because
+we are putting our most probable one.
+
+0:22:51.752 --> 0:22:58.643
+We do greedy search or beam search, but you
+cannot do.
+
+0:23:03.663 --> 0:23:16.838
+Yes, so if we are interesting in making things
+more efficient for testing, which we see, for
+
+0:23:16.838 --> 0:23:22.363
+example in the scenario of really our.
+
+0:23:22.642 --> 0:23:34.286
+It makes sense that we think about our architecture
+and that we are currently working on attention
+
+0:23:34.286 --> 0:23:35.933
+based models.
+
+0:23:36.096 --> 0:23:44.150
+The decoder there is some of the most time
+spent testing and testing.
+
+0:23:44.150 --> 0:23:47.142
+It's similar, but during.
+
+0:23:47.167 --> 0:23:50.248
+Nothing about beam search.
+
+0:23:50.248 --> 0:23:59.833
+It might be even more complicated because
+in beam search you have to try different.
+
+0:24:02.762 --> 0:24:15.140
+So the question is what can you now do in
+order to make your model more efficient and
+
+0:24:15.140 --> 0:24:21.905
+better in translation in these types of cases?
+
+0:24:24.604 --> 0:24:30.178
+And the one thing is to look into the encoded
+decoder trailer.
+
+0:24:30.690 --> 0:24:43.898
+And then until now we typically assume that
+the depth of the encoder and the depth of the
+
+0:24:43.898 --> 0:24:48.154
+decoder is roughly the same.
+
+0:24:48.268 --> 0:24:55.553
+So if you haven't thought about it, you just
+take what is running well.
+
+0:24:55.553 --> 0:24:57.678
+You would try to do.
+
+0:24:58.018 --> 0:25:04.148
+However, we saw now that there is a quite
+big challenge and the runtime is a lot longer
+
+0:25:04.148 --> 0:25:04.914
+than here.
+
+0:25:05.425 --> 0:25:14.018
+The question is also the case for the calculations,
+or do we have there the same issue that we
+
+0:25:14.018 --> 0:25:21.887
+only get the good quality if we are having
+high and high, so we know that making these
+
+0:25:21.887 --> 0:25:25.415
+more depths is increasing our quality.
+
+0:25:25.425 --> 0:25:31.920
+But what we haven't talked about is really
+important that we increase the depth the same
+
+0:25:31.920 --> 0:25:32.285
+way.
+
+0:25:32.552 --> 0:25:41.815
+So what we can put instead also do is something
+like this where you have a deep encoder and
+
+0:25:41.815 --> 0:25:42.923
+a shallow.
+
+0:25:43.163 --> 0:25:57.386
+So that would be that you, for example, have
+instead of having layers on the encoder, and
+
+0:25:57.386 --> 0:25:59.757
+layers on the.
+
+0:26:00.080 --> 0:26:10.469
+So in this case the overall depth from start
+to end would be similar and so hopefully.
+
+0:26:11.471 --> 0:26:21.662
+But we could a lot more things hear parallelized,
+and hear what is costly at the end during decoding
+
+0:26:21.662 --> 0:26:22.973
+the decoder.
+
+0:26:22.973 --> 0:26:29.330
+Because that does change in an outer regressive
+way, there we.
+
+0:26:31.411 --> 0:26:33.727
+And that that can be analyzed.
+
+0:26:33.727 --> 0:26:38.734
+So here is some examples: Where people have
+done all this.
+
+0:26:39.019 --> 0:26:55.710
+So here it's mainly interested on the orange
+things, which is auto-regressive about the
+
+0:26:55.710 --> 0:26:57.607
+speed up.
+
+0:26:57.717 --> 0:27:15.031
+You have the system, so agree is not exactly
+the same, but it's similar.
+
+0:27:15.055 --> 0:27:23.004
+It's always the case if you look at speed
+up.
+
+0:27:23.004 --> 0:27:31.644
+Think they put a speed of so that's the baseline.
+
+0:27:31.771 --> 0:27:35.348
+So between and times as fast.
+
+0:27:35.348 --> 0:27:42.621
+If you switch from a system to where you have
+layers in the.
+
+0:27:42.782 --> 0:27:52.309
+You see that although you have slightly more
+parameters, more calculations are also roughly
+
+0:27:52.309 --> 0:28:00.283
+the same, but you can speed out because now
+during testing you can paralyze.
+
+0:28:02.182 --> 0:28:09.754
+The other thing is that you're speeding up,
+but if you look at the performance it's similar,
+
+0:28:09.754 --> 0:28:13.500
+so sometimes you improve, sometimes you lose.
+
+0:28:13.500 --> 0:28:20.421
+There's a bit of losing English to Romania,
+but in general the quality is very slow.
+
+0:28:20.680 --> 0:28:30.343
+So you see that you can keep a similar performance
+while improving your speed by just having different.
+
+0:28:30.470 --> 0:28:34.903
+And you also see the encoder layers from speed.
+
+0:28:34.903 --> 0:28:38.136
+They don't really metal that much.
+
+0:28:38.136 --> 0:28:38.690
+Most.
+
+0:28:38.979 --> 0:28:50.319
+Because if you compare the 12th system to
+the 6th system you have a lower performance
+
+0:28:50.319 --> 0:28:57.309
+with 6th and colder layers but the speed is
+similar.
+
+0:28:57.897 --> 0:29:02.233
+And see the huge decrease is it maybe due
+to a lack of data.
+
+0:29:03.743 --> 0:29:11.899
+Good idea would say it's not the case.
+
+0:29:11.899 --> 0:29:23.191
+Romanian English should have the same number
+of data.
+
+0:29:24.224 --> 0:29:31.184
+Maybe it's just that something in that language.
+
+0:29:31.184 --> 0:29:40.702
+If you generate Romanian maybe they need more
+target dependencies.
+
+0:29:42.882 --> 0:29:46.263
+The Wine's the Eye Also Don't Know Any Sex
+People Want To.
+
+0:29:47.887 --> 0:29:49.034
+There could be yeah the.
+
+0:29:49.889 --> 0:29:58.962
+As the maybe if you go from like a movie sphere
+to a hybrid sphere, you can: It's very much
+
+0:29:58.962 --> 0:30:12.492
+easier to expand the vocabulary to English,
+but it must be the vocabulary.
+
+0:30:13.333 --> 0:30:21.147
+Have to check, but would assume that in this
+case the system is not retrained, but it's
+
+0:30:21.147 --> 0:30:22.391
+trained with.
+
+0:30:22.902 --> 0:30:30.213
+And that's why I was assuming that they have
+the same, but maybe you'll write that in this
+
+0:30:30.213 --> 0:30:35.595
+piece, for example, if they were pre-trained,
+the decoder English.
+
+0:30:36.096 --> 0:30:43.733
+But don't remember exactly if they do something
+like that, but that could be a good.
+
+0:30:45.325 --> 0:30:52.457
+So this is some of the most easy way to speed
+up.
+
+0:30:52.457 --> 0:31:01.443
+You just switch to hyperparameters, not to
+implement anything.
+
+0:31:02.722 --> 0:31:08.367
+Of course, there's other ways of doing that.
+
+0:31:08.367 --> 0:31:11.880
+We'll look into two things.
+
+0:31:11.880 --> 0:31:16.521
+The other thing is the architecture.
+
+0:31:16.796 --> 0:31:28.154
+We are now at some of the baselines that we
+are doing.
+
+0:31:28.488 --> 0:31:39.978
+However, in translation in the decoder side,
+it might not be the best solution.
+
+0:31:39.978 --> 0:31:41.845
+There is no.
+
+0:31:42.222 --> 0:31:47.130
+So we can use different types of architectures,
+also in the encoder and the.
+
+0:31:47.747 --> 0:31:52.475
+And there's two ways of what you could do
+different, or there's more ways.
+
+0:31:52.912 --> 0:31:54.825
+We will look into two todays.
+
+0:31:54.825 --> 0:31:58.842
+The one is average attention, which is a very
+simple solution.
+
+0:31:59.419 --> 0:32:01.464
+You can do as it says.
+
+0:32:01.464 --> 0:32:04.577
+It's not really attending anymore.
+
+0:32:04.577 --> 0:32:08.757
+It's just like equal attendance to everything.
+
+0:32:09.249 --> 0:32:23.422
+And the other idea, which is currently done
+in most systems which are optimized to efficiency,
+
+0:32:23.422 --> 0:32:24.913
+is we're.
+
+0:32:25.065 --> 0:32:32.623
+But on the decoder side we are then not using
+transformer or self attention, but we are using
+
+0:32:32.623 --> 0:32:39.700
+recurrent neural network because they are the
+disadvantage of recurrent neural network.
+
+0:32:39.799 --> 0:32:48.353
+And then the recurrent is normally easier
+to calculate because it only depends on inputs,
+
+0:32:48.353 --> 0:32:49.684
+the input on.
+
+0:32:51.931 --> 0:33:02.190
+So what is the difference between decoding
+and why is the tension maybe not sufficient
+
+0:33:02.190 --> 0:33:03.841
+for decoding?
+
+0:33:04.204 --> 0:33:14.390
+If we want to populate the new state, we only
+have to look at the input and the previous
+
+0:33:14.390 --> 0:33:15.649
+state, so.
+
+0:33:16.136 --> 0:33:19.029
+We are more conditional here networks.
+
+0:33:19.029 --> 0:33:19.994
+We have the.
+
+0:33:19.980 --> 0:33:31.291
+Dependency to a fixed number of previous ones,
+but that's rarely used for decoding.
+
+0:33:31.291 --> 0:33:39.774
+In contrast, in transformer we have this large
+dependency, so.
+
+0:33:40.000 --> 0:33:52.760
+So from t minus one to y t so that is somehow
+and mainly not very efficient in this way mean
+
+0:33:52.760 --> 0:33:56.053
+it's very good because.
+
+0:33:56.276 --> 0:34:03.543
+However, the disadvantage is that we also
+have to do all these calculations, so if we
+
+0:34:03.543 --> 0:34:10.895
+more view from the point of view of efficient
+calculation, this might not be the best.
+
+0:34:11.471 --> 0:34:20.517
+So the question is, can we change our architecture
+to keep some of the advantages but make things
+
+0:34:20.517 --> 0:34:21.994
+more efficient?
+
+0:34:24.284 --> 0:34:31.131
+The one idea is what is called the average
+attention, and the interesting thing is this
+
+0:34:31.131 --> 0:34:32.610
+work surprisingly.
+
+0:34:33.013 --> 0:34:38.917
+So the only idea what you're doing is doing
+the decoder.
+
+0:34:38.917 --> 0:34:42.646
+You're not doing attention anymore.
+
+0:34:42.646 --> 0:34:46.790
+The attention weights are all the same.
+
+0:34:47.027 --> 0:35:00.723
+So you don't calculate with query and key
+the different weights, and then you just take
+
+0:35:00.723 --> 0:35:03.058
+equal weights.
+
+0:35:03.283 --> 0:35:07.585
+So here would be one third from this, one
+third from this, and one third.
+
+0:35:09.009 --> 0:35:14.719
+And while it is sufficient you can now do
+precalculation and things get more efficient.
+
+0:35:15.195 --> 0:35:18.803
+So first go the formula that's maybe not directed
+here.
+
+0:35:18.979 --> 0:35:38.712
+So the difference here is that your new hint
+stage is the sum of all the hint states, then.
+
+0:35:38.678 --> 0:35:40.844
+So here would be with this.
+
+0:35:40.844 --> 0:35:45.022
+It would be one third of this plus one third
+of this.
+
+0:35:46.566 --> 0:35:57.162
+But if you calculate it this way, it's not
+yet being more efficient because you still
+
+0:35:57.162 --> 0:36:01.844
+have to sum over here all the hidden.
+
+0:36:04.524 --> 0:36:22.932
+But you can not easily speed up these things
+by having an in between value, which is just
+
+0:36:22.932 --> 0:36:24.568
+always.
+
+0:36:25.585 --> 0:36:30.057
+If you take this as ten to one, you take this
+one class this one.
+
+0:36:30.350 --> 0:36:36.739
+Because this one then was before this, and
+this one was this, so in the end.
+
+0:36:37.377 --> 0:36:49.545
+So now this one is not the final one in order
+to get the final one to do the average.
+
+0:36:49.545 --> 0:36:50.111
+So.
+
+0:36:50.430 --> 0:37:00.264
+But then if you do this calculation with speed
+up you can do it with a fixed number of steps.
+
+0:37:00.180 --> 0:37:11.300
+Instead of the sun which depends on age, so
+you only have to do calculations to calculate
+
+0:37:11.300 --> 0:37:12.535
+this one.
+
+0:37:12.732 --> 0:37:21.183
+Can you do the lakes and the lakes?
+
+0:37:21.183 --> 0:37:32.687
+For example, light bulb here now takes and
+then.
+
+0:37:32.993 --> 0:37:38.762
+That's a very good point and that's why this
+is now in the image.
+
+0:37:38.762 --> 0:37:44.531
+It's not very good so this is the one with
+tilder and the tilder.
+
+0:37:44.884 --> 0:37:57.895
+So this one is just the sum of these two,
+because this is just this one.
+
+0:37:58.238 --> 0:38:08.956
+So the sum of this is exactly as the sum of
+these, and the sum of these is the sum of here.
+
+0:38:08.956 --> 0:38:15.131
+So you only do the sum in here, and the multiplying.
+
+0:38:15.255 --> 0:38:22.145
+So what you can mainly do here is you can
+do it more mathematically.
+
+0:38:22.145 --> 0:38:31.531
+You can know this by tea taking out of the
+sum, and then you can calculate the sum different.
+
+0:38:36.256 --> 0:38:42.443
+That maybe looks a bit weird and simple, so
+we were all talking about this great attention
+
+0:38:42.443 --> 0:38:47.882
+that we can focus on different parts, and a
+bit surprising on this work is now.
+
+0:38:47.882 --> 0:38:53.321
+In the end it might also work well without
+really putting and just doing equal.
+
+0:38:53.954 --> 0:38:56.164
+Mean it's not that easy.
+
+0:38:56.376 --> 0:38:58.261
+It's like sometimes this is working.
+
+0:38:58.261 --> 0:39:00.451
+There's also report weight work that well.
+
+0:39:01.481 --> 0:39:05.848
+But I think it's an interesting way and it
+maybe shows that a lot of.
+
+0:39:05.805 --> 0:39:10.669
+Things in the self or in the transformer paper
+which are more put as like yet.
+
+0:39:10.669 --> 0:39:14.301
+These are some hyperparameters that are rounded
+like that.
+
+0:39:14.301 --> 0:39:19.657
+You do the lay on all in between and that
+you do a feat forward before and things like
+
+0:39:19.657 --> 0:39:20.026
+that.
+
+0:39:20.026 --> 0:39:25.567
+But these are also all important and the right
+set up around that is also very important.
+
+0:39:28.969 --> 0:39:38.598
+The other thing you can do in the end is not
+completely different from this one.
+
+0:39:38.598 --> 0:39:42.521
+It's just like a very different.
+
+0:39:42.942 --> 0:39:54.338
+And that is a recurrent network which also
+has this type of highway connection that can
+
+0:39:54.338 --> 0:40:01.330
+ignore the recurrent unit and directly put
+the input.
+
+0:40:01.561 --> 0:40:10.770
+It's not really adding out, but if you see
+the hitting step is your input, but what you
+
+0:40:10.770 --> 0:40:15.480
+can do is somehow directly go to the output.
+
+0:40:17.077 --> 0:40:28.390
+These are the four components of the simple
+return unit, and the unit is motivated by GIS
+
+0:40:28.390 --> 0:40:33.418
+and by LCMs, which we have seen before.
+
+0:40:33.513 --> 0:40:43.633
+And that has proven to be very good for iron
+ends, which allows you to have a gate on your.
+
+0:40:44.164 --> 0:40:48.186
+In this thing we have two gates, the reset
+gate and the forget gate.
+
+0:40:48.768 --> 0:40:57.334
+So first we have the general structure which
+has a cell state.
+
+0:40:57.334 --> 0:41:01.277
+Here we have the cell state.
+
+0:41:01.361 --> 0:41:09.661
+And then this goes next, and we always get
+the different cell states over the times that.
+
+0:41:10.030 --> 0:41:11.448
+This Is the South Stand.
+
+0:41:11.771 --> 0:41:16.518
+How do we now calculate that just assume we
+have an initial cell safe here?
+
+0:41:17.017 --> 0:41:19.670
+But the first thing is we're doing the forget
+game.
+
+0:41:20.060 --> 0:41:34.774
+The forgetting models should the new cell
+state mainly depend on the previous cell state
+
+0:41:34.774 --> 0:41:40.065
+or should it depend on our age.
+
+0:41:40.000 --> 0:41:41.356
+Like Add to Them.
+
+0:41:41.621 --> 0:41:42.877
+How can we model that?
+
+0:41:44.024 --> 0:41:45.599
+First we were at a cocktail.
+
+0:41:45.945 --> 0:41:52.151
+The forget gait is depending on minus one.
+
+0:41:52.151 --> 0:41:56.480
+You also see here the former.
+
+0:41:57.057 --> 0:42:01.963
+So we are multiplying both the cell state
+and our input.
+
+0:42:01.963 --> 0:42:04.890
+With some weights we are getting.
+
+0:42:05.105 --> 0:42:08.472
+We are putting some Bay Inspector and then
+we are doing Sigma Weed on that.
+
+0:42:08.868 --> 0:42:13.452
+So in the end we have numbers between zero
+and one saying for each dimension.
+
+0:42:13.853 --> 0:42:22.041
+Like how much if it's near to zero we will
+mainly use the new input.
+
+0:42:22.041 --> 0:42:31.890
+If it's near to one we will keep the input
+and ignore the input at this dimension.
+
+0:42:33.313 --> 0:42:40.173
+And by this motivation we can then create
+here the new sound state, and here you see
+
+0:42:40.173 --> 0:42:41.141
+the formal.
+
+0:42:41.601 --> 0:42:55.048
+So you take your foot back gate and multiply
+it with your class.
+
+0:42:55.048 --> 0:43:00.427
+So if my was around then.
+
+0:43:00.800 --> 0:43:07.405
+In the other case, when the value was others,
+that's what you added.
+
+0:43:07.405 --> 0:43:10.946
+Then you're adding a transformation.
+
+0:43:11.351 --> 0:43:24.284
+So if this value was maybe zero then you're
+putting most of the information from inputting.
+
+0:43:25.065 --> 0:43:26.947
+Is already your element?
+
+0:43:26.947 --> 0:43:30.561
+The only question is now based on your element.
+
+0:43:30.561 --> 0:43:32.067
+What is the output?
+
+0:43:33.253 --> 0:43:47.951
+And there you have another opportunity so
+you can either take the output or instead you
+
+0:43:47.951 --> 0:43:50.957
+prefer the input.
+
+0:43:52.612 --> 0:43:58.166
+So is the value also the same for the recept
+game and the forget game.
+
+0:43:58.166 --> 0:43:59.417
+Yes, the movie.
+
+0:44:00.900 --> 0:44:10.004
+Yes exactly so the matrices are different
+and therefore it can be and that should be
+
+0:44:10.004 --> 0:44:16.323
+and maybe there is sometimes you want to have
+information.
+
+0:44:16.636 --> 0:44:23.843
+So here again we have this vector with values
+between zero and which says controlling how
+
+0:44:23.843 --> 0:44:25.205
+the information.
+
+0:44:25.505 --> 0:44:36.459
+And then the output is calculated here similar
+to a cell stage, but again input is from.
+
+0:44:36.536 --> 0:44:45.714
+So either the reset gate decides should give
+what is currently stored in there, or.
+
+0:44:46.346 --> 0:44:58.647
+So it's not exactly as the thing we had before,
+with the residual connections where we added
+
+0:44:58.647 --> 0:45:01.293
+up, but here we do.
+
+0:45:04.224 --> 0:45:08.472
+This is the general idea of a simple recurrent
+neural network.
+
+0:45:08.472 --> 0:45:13.125
+Then we will now look at how we can make things
+even more efficient.
+
+0:45:13.125 --> 0:45:17.104
+But first do you have more questions on how
+it is working?
+
+0:45:23.063 --> 0:45:38.799
+Now these calculations are a bit where things
+get more efficient because this somehow.
+
+0:45:38.718 --> 0:45:43.177
+It depends on all the other damage for the
+second one also.
+
+0:45:43.423 --> 0:45:48.904
+Because if you do a matrix multiplication
+with a vector like for the output vector, each
+
+0:45:48.904 --> 0:45:52.353
+diameter of the output vector depends on all
+the other.
+
+0:45:52.973 --> 0:46:06.561
+The cell state here depends because this one
+is used here, and somehow the first dimension
+
+0:46:06.561 --> 0:46:11.340
+of the cell state only depends.
+
+0:46:11.931 --> 0:46:17.973
+In order to make that, of course, is sometimes
+again making things less paralyzeable if things
+
+0:46:17.973 --> 0:46:18.481
+depend.
+
+0:46:19.359 --> 0:46:35.122
+Can easily make that different by changing
+from the metric product to not a vector.
+
+0:46:35.295 --> 0:46:51.459
+So you do first, just like inside here, you
+take like the first dimension, my second dimension.
+
+0:46:52.032 --> 0:46:53.772
+Is, of course, narrow.
+
+0:46:53.772 --> 0:46:59.294
+This should be reset or this should be because
+it should be a different.
+
+0:46:59.899 --> 0:47:12.053
+Now the first dimension only depends on the
+first dimension, so you don't have dependencies
+
+0:47:12.053 --> 0:47:16.148
+any longer between dimensions.
+
+0:47:18.078 --> 0:47:25.692
+Maybe it gets a bit clearer if you see about
+it in this way, so what we have to do now.
+
+0:47:25.966 --> 0:47:31.911
+First, we have to do a metrics multiplication
+on to gather and to get the.
+
+0:47:32.292 --> 0:47:38.041
+And then we only have the element wise operations
+where we take this output.
+
+0:47:38.041 --> 0:47:38.713
+We take.
+
+0:47:39.179 --> 0:47:42.978
+Minus one and our original.
+
+0:47:42.978 --> 0:47:52.748
+Here we only have elemental abrasions which
+can be optimally paralyzed.
+
+0:47:53.273 --> 0:48:07.603
+So here we have additional paralyzed things
+across the dimension and don't have to do that.
+
+0:48:09.929 --> 0:48:24.255
+Yeah, but this you can do like in parallel
+again for all xts.
+
+0:48:24.544 --> 0:48:33.014
+Here you can't do it in parallel, but you
+only have to do it on each seat, and then you
+
+0:48:33.014 --> 0:48:34.650
+can parallelize.
+
+0:48:35.495 --> 0:48:39.190
+But this maybe for the dimension.
+
+0:48:39.190 --> 0:48:42.124
+Maybe it's also important.
+
+0:48:42.124 --> 0:48:46.037
+I don't know if they have tried it.
+
+0:48:46.037 --> 0:48:55.383
+I assume it's not only for dimension reduction,
+but it's hard because you can easily.
+
+0:49:01.001 --> 0:49:08.164
+People have even like made the second thing
+even more easy.
+
+0:49:08.164 --> 0:49:10.313
+So there is this.
+
+0:49:10.313 --> 0:49:17.954
+This is how we have the highway connections
+in the transformer.
+
+0:49:17.954 --> 0:49:20.699
+Then it's like you do.
+
+0:49:20.780 --> 0:49:24.789
+So that is like how things are put together
+as a transformer.
+
+0:49:25.125 --> 0:49:39.960
+And that is a similar and simple recurring
+neural network where you do exactly the same
+
+0:49:39.960 --> 0:49:44.512
+for the so you don't have.
+
+0:49:46.326 --> 0:49:47.503
+This type of things.
+
+0:49:49.149 --> 0:50:01.196
+And with this we are at the end of how to
+make efficient architectures before we go to
+
+0:50:01.196 --> 0:50:02.580
+the next.
+
+0:50:13.013 --> 0:50:24.424
+Between the ink or the trader and the architectures
+there is a next technique which is used in
+
+0:50:24.424 --> 0:50:28.988
+nearly all deburning very successful.
+
+0:50:29.449 --> 0:50:43.463
+So the idea is can we extract the knowledge
+from a large network into a smaller one, but
+
+0:50:43.463 --> 0:50:45.983
+it's similarly.
+
+0:50:47.907 --> 0:50:53.217
+And the nice thing is that this really works,
+and it may be very, very surprising.
+
+0:50:53.673 --> 0:51:03.000
+So the idea is that we have a large straw
+model which we train for long, and the question
+
+0:51:03.000 --> 0:51:07.871
+is: Can that help us to train a smaller model?
+
+0:51:08.148 --> 0:51:16.296
+So can what we refer to as teacher model tell
+us better to build a small student model than
+
+0:51:16.296 --> 0:51:17.005
+before.
+
+0:51:17.257 --> 0:51:27.371
+So what we're before in it as a student model,
+we learn from the data and that is how we train
+
+0:51:27.371 --> 0:51:28.755
+our systems.
+
+0:51:29.249 --> 0:51:37.949
+The question is: Can we train this small model
+better if we are not only learning from the
+
+0:51:37.949 --> 0:51:46.649
+data, but we are also learning from a large
+model which has been trained maybe in the same
+
+0:51:46.649 --> 0:51:47.222
+data?
+
+0:51:47.667 --> 0:51:55.564
+So that you have then in the end a smaller
+model that is somehow better performing than.
+
+0:51:55.895 --> 0:51:59.828
+And maybe that's on the first view.
+
+0:51:59.739 --> 0:52:05.396
+Very very surprising because it has seen the
+same data so it should have learned the same
+
+0:52:05.396 --> 0:52:11.053
+so the baseline model trained only on the data
+and the student teacher knowledge to still
+
+0:52:11.053 --> 0:52:11.682
+model it.
+
+0:52:11.682 --> 0:52:17.401
+They all have seen only this data because
+your teacher modeling was also trained typically
+
+0:52:17.401 --> 0:52:19.161
+only on this model however.
+
+0:52:20.580 --> 0:52:30.071
+It has by now shown that by many ways the
+model trained in the teacher and analysis framework
+
+0:52:30.071 --> 0:52:32.293
+is performing better.
+
+0:52:33.473 --> 0:52:40.971
+A bit of an explanation when we see how that
+works.
+
+0:52:40.971 --> 0:52:46.161
+There's different ways of doing it.
+
+0:52:46.161 --> 0:52:47.171
+Maybe.
+
+0:52:47.567 --> 0:52:51.501
+So how does it work?
+
+0:52:51.501 --> 0:53:04.802
+This is our student network, the normal one,
+some type of new network.
+
+0:53:04.802 --> 0:53:06.113
+We're.
+
+0:53:06.586 --> 0:53:17.050
+So we are training the model to predict the
+same thing as we are doing that by calculating.
+
+0:53:17.437 --> 0:53:23.173
+The cross angry loss was defined in a way
+where saying all the probabilities for the
+
+0:53:23.173 --> 0:53:25.332
+correct word should be as high.
+
+0:53:25.745 --> 0:53:31.576
+So your calculating gear out of probability
+is always and each time step you have an out
+
+0:53:31.576 --> 0:53:32.624
+of probability.
+
+0:53:32.624 --> 0:53:38.258
+What is the most probable in the next word
+and your training signal is put as much of
+
+0:53:38.258 --> 0:53:43.368
+your probability mass to the correct word to
+the word that is there in train.
+
+0:53:43.903 --> 0:53:51.367
+And this is the chief by this cross entry
+loss, which says with some of the all training
+
+0:53:51.367 --> 0:53:58.664
+examples of all positions, with some of the
+full vocabulary, and then this one is this
+
+0:53:58.664 --> 0:54:03.947
+one that this current word is the case word
+in the vocabulary.
+
+0:54:04.204 --> 0:54:11.339
+And then we take here the lock for the ability
+of that, so what we made me do is: We have
+
+0:54:11.339 --> 0:54:27.313
+this metric here, so each position of your
+vocabulary size.
+
+0:54:27.507 --> 0:54:38.656
+In the end what you just do is some of these
+three lock probabilities, and then you want
+
+0:54:38.656 --> 0:54:40.785
+to have as much.
+
+0:54:41.041 --> 0:54:54.614
+So although this is a thumb over this metric
+here, in the end of each dimension you.
+
+0:54:54.794 --> 0:55:06.366
+So that is a normal cross end to be lost that
+we have discussed at the very beginning of
+
+0:55:06.366 --> 0:55:07.016
+how.
+
+0:55:08.068 --> 0:55:15.132
+So what can we do differently in the teacher
+network?
+
+0:55:15.132 --> 0:55:23.374
+We also have a teacher network which is trained
+on large data.
+
+0:55:24.224 --> 0:55:35.957
+And of course this distribution might be better
+than the one from the small model because it's.
+
+0:55:36.456 --> 0:55:40.941
+So in this case we have now the training signal
+from the teacher network.
+
+0:55:41.441 --> 0:55:46.262
+And it's the same way as we had before.
+
+0:55:46.262 --> 0:55:56.507
+The only difference is we're training not
+the ground truths per ability distribution
+
+0:55:56.507 --> 0:55:59.159
+year, which is sharp.
+
+0:55:59.299 --> 0:56:11.303
+That's also a probability, so this word has
+a high probability, but have some probability.
+
+0:56:12.612 --> 0:56:19.577
+And that is the main difference.
+
+0:56:19.577 --> 0:56:30.341
+Typically you do like the interpretation of
+these.
+
+0:56:33.213 --> 0:56:38.669
+Because there's more information contained
+in the distribution than in the front booth,
+
+0:56:38.669 --> 0:56:44.187
+because it encodes more information about the
+language, because language always has more
+
+0:56:44.187 --> 0:56:47.907
+options to put alone, that's the same sentence
+yes exactly.
+
+0:56:47.907 --> 0:56:53.114
+So there's ambiguity in there that is encoded
+hopefully very well in the complaint.
+
+0:56:53.513 --> 0:56:57.257
+Trade you two networks so better than a student
+network you have in there from your learner.
+
+0:56:57.537 --> 0:57:05.961
+So maybe often there's only one correct word,
+but it might be two or three, and then all
+
+0:57:05.961 --> 0:57:10.505
+of these three have a probability distribution.
+
+0:57:10.590 --> 0:57:21.242
+And then is the main advantage or one explanation
+of why it's better to train from the.
+
+0:57:21.361 --> 0:57:32.652
+Of course, it's good to also keep the signal
+in there because then you can prevent it because
+
+0:57:32.652 --> 0:57:33.493
+crazy.
+
+0:57:37.017 --> 0:57:49.466
+Any more questions on the first type of knowledge
+distillation, also distribution changes.
+
+0:57:50.550 --> 0:58:02.202
+Coming around again, this would put it a bit
+different, so this is not a solution to maintenance
+
+0:58:02.202 --> 0:58:04.244
+or distribution.
+
+0:58:04.744 --> 0:58:12.680
+But don't think it's performing worse than
+only doing the ground tours because they also.
+
+0:58:13.113 --> 0:58:21.254
+So it's more like it's not improving you would
+assume it's similarly helping you, but.
+
+0:58:21.481 --> 0:58:28.145
+Of course, if you now have a teacher, maybe
+you have no danger on your target to Maine,
+
+0:58:28.145 --> 0:58:28.524
+but.
+
+0:58:28.888 --> 0:58:39.895
+Then you can use this one which is not the
+ground truth but helpful to learn better for
+
+0:58:39.895 --> 0:58:42.147
+the distribution.
+
+0:58:46.326 --> 0:58:57.012
+The second idea is to do sequence level knowledge
+distillation, so what we have in this case
+
+0:58:57.012 --> 0:59:02.757
+is we have looked at each position independently.
+
+0:59:03.423 --> 0:59:05.436
+Mean, we do that often.
+
+0:59:05.436 --> 0:59:10.972
+We are not generating a lot of sequences,
+but that has a problem.
+
+0:59:10.972 --> 0:59:13.992
+We have this propagation of errors.
+
+0:59:13.992 --> 0:59:16.760
+We start with one area and then.
+
+0:59:17.237 --> 0:59:27.419
+So if we are doing word-level knowledge dissolution,
+we are treating each word in the sentence independently.
+
+0:59:28.008 --> 0:59:32.091
+So we are not trying to like somewhat model
+the dependency between.
+
+0:59:32.932 --> 0:59:47.480
+We can try to do that by sequence level knowledge
+dissolution, but the problem is, of course,.
+
+0:59:47.847 --> 0:59:53.478
+So we can that for each position we can get
+a distribution over all the words at this.
+
+0:59:53.793 --> 1:00:05.305
+But if we want to have a distribution of all
+possible target sentences, that's not possible
+
+1:00:05.305 --> 1:00:06.431
+because.
+
+1:00:08.508 --> 1:00:15.940
+Area, so we can then again do a bit of a heck
+on that.
+
+1:00:15.940 --> 1:00:23.238
+If we can't have a distribution of all sentences,
+it.
+
+1:00:23.843 --> 1:00:30.764
+So what we can't do is you can not use the
+teacher network and sample different translations.
+
+1:00:31.931 --> 1:00:39.327
+And now we can do different ways to train
+them.
+
+1:00:39.327 --> 1:00:49.343
+We can use them as their probability, the
+easiest one to assume.
+
+1:00:50.050 --> 1:00:56.373
+So what that ends to is that we're taking
+our teacher network, we're generating some
+
+1:00:56.373 --> 1:01:01.135
+translations, and these ones we're using as
+additional trading.
+
+1:01:01.781 --> 1:01:11.382
+Then we have mainly done this sequence level
+because the teacher network takes us.
+
+1:01:11.382 --> 1:01:17.513
+These are all probable translations of the
+sentence.
+
+1:01:26.286 --> 1:01:34.673
+And then you can do a bit of a yeah, and you
+can try to better make a bit of an interpolated
+
+1:01:34.673 --> 1:01:36.206
+version of that.
+
+1:01:36.716 --> 1:01:42.802
+So what people have also done is like subsequent
+level interpolations.
+
+1:01:42.802 --> 1:01:52.819
+You generate here several translations: But
+then you don't use all of them.
+
+1:01:52.819 --> 1:02:00.658
+You do some metrics on which of these ones.
+
+1:02:01.021 --> 1:02:12.056
+So it's a bit more training on this brown
+chose which might be improbable or unreachable
+
+1:02:12.056 --> 1:02:16.520
+because we can generate everything.
+
+1:02:16.676 --> 1:02:23.378
+And we are giving it an easier solution which
+is also good quality and training of that.
+
+1:02:23.703 --> 1:02:32.602
+So you're not training it on a very difficult
+solution, but you're training it on an easier
+
+1:02:32.602 --> 1:02:33.570
+solution.
+
+1:02:36.356 --> 1:02:38.494
+Any More Questions to This.
+
+1:02:40.260 --> 1:02:41.557
+Yeah.
+
+1:02:41.461 --> 1:02:44.296
+Good.
+
+1:02:43.843 --> 1:03:01.642
+Is to look at the vocabulary, so the problem
+is we have seen that vocabulary calculations
+
+1:03:01.642 --> 1:03:06.784
+are often very presuming.
+
+1:03:09.789 --> 1:03:19.805
+The thing is that most of the vocabulary is
+not needed for each sentence, so in each sentence.
+
+1:03:20.280 --> 1:03:28.219
+The question is: Can we somehow easily precalculate,
+which words are probable to occur in the sentence,
+
+1:03:28.219 --> 1:03:30.967
+and then only calculate these ones?
+
+1:03:31.691 --> 1:03:34.912
+And this can be done so.
+
+1:03:34.912 --> 1:03:43.932
+For example, if you have sentenced card, it's
+probably not happening.
+
+1:03:44.164 --> 1:03:48.701
+So what you can try to do is to limit your
+vocabulary.
+
+1:03:48.701 --> 1:03:51.093
+You're considering for each.
+
+1:03:51.151 --> 1:04:04.693
+So you're no longer taking the full vocabulary
+as possible output, but you're restricting.
+
+1:04:06.426 --> 1:04:18.275
+That typically works is that we limit it by
+the most frequent words we always take because
+
+1:04:18.275 --> 1:04:23.613
+these are not so easy to align to words.
+
+1:04:23.964 --> 1:04:32.241
+To take the most treatment taggin' words and
+then work that often aligns with one of the
+
+1:04:32.241 --> 1:04:32.985
+source.
+
+1:04:33.473 --> 1:04:46.770
+So for each source word you calculate the
+word alignment on your training data, and then
+
+1:04:46.770 --> 1:04:51.700
+you calculate which words occur.
+
+1:04:52.352 --> 1:04:57.680
+And then for decoding you build this union
+of maybe the source word list that other.
+
+1:04:59.960 --> 1:05:02.145
+Are like for each source work.
+
+1:05:02.145 --> 1:05:08.773
+One of the most frequent translations of these
+source words, for example for each source work
+
+1:05:08.773 --> 1:05:13.003
+like in the most frequent ones, and then the
+most frequent.
+
+1:05:13.193 --> 1:05:24.333
+In total, if you have short sentences, you
+have a lot less words, so in most cases it's
+
+1:05:24.333 --> 1:05:26.232
+not more than.
+
+1:05:26.546 --> 1:05:33.957
+And so you have dramatically reduced your
+vocabulary, and thereby can also fax a depot.
+
+1:05:35.495 --> 1:05:43.757
+That easy does anybody see what is challenging
+here and why that might not always need.
+
+1:05:47.687 --> 1:05:54.448
+The performance is not why this might not.
+
+1:05:54.448 --> 1:06:01.838
+If you implement it, it might not be a strong.
+
+1:06:01.941 --> 1:06:06.053
+You have to store this list.
+
+1:06:06.053 --> 1:06:14.135
+You have to burn the union and of course your
+safe time.
+
+1:06:14.554 --> 1:06:21.920
+The second thing the vocabulary is used in
+our last step, so we have the hidden state,
+
+1:06:21.920 --> 1:06:23.868
+and then we calculate.
+
+1:06:24.284 --> 1:06:29.610
+Now we are not longer calculating them for
+all output words, but for a subset of them.
+
+1:06:30.430 --> 1:06:35.613
+However, this metric multiplication is typically
+parallelized with the perfect but good.
+
+1:06:35.956 --> 1:06:46.937
+But if you not only calculate some of them,
+if you're not modeling it right, it will take
+
+1:06:46.937 --> 1:06:52.794
+as long as before because of the nature of
+the.
+
+1:06:56.776 --> 1:07:07.997
+Here for beam search there's some ideas of
+course you can go back to greedy search because
+
+1:07:07.997 --> 1:07:10.833
+that's more efficient.
+
+1:07:11.651 --> 1:07:18.347
+And better quality, and you can buffer some
+states in between, so how much buffering it's
+
+1:07:18.347 --> 1:07:22.216
+again this tradeoff between calculation and
+memory.
+
+1:07:25.125 --> 1:07:41.236
+Then at the end of today what we want to look
+into is one last type of new machine translation
+
+1:07:41.236 --> 1:07:42.932
+approach.
+
+1:07:43.403 --> 1:07:53.621
+And the idea is what we've already seen in
+our first two steps is that this ultra aggressive
+
+1:07:53.621 --> 1:07:57.246
+park is taking community coding.
+
+1:07:57.557 --> 1:08:04.461
+Can process everything in parallel, but we
+are always taking the most probable and then.
+
+1:08:05.905 --> 1:08:10.476
+The question is: Do we really need to do that?
+
+1:08:10.476 --> 1:08:14.074
+Therefore, there is a bunch of work.
+
+1:08:14.074 --> 1:08:16.602
+Can we do it differently?
+
+1:08:16.602 --> 1:08:19.616
+Can we generate a full target?
+
+1:08:20.160 --> 1:08:29.417
+We'll see it's not that easy and there's still
+an open debate whether this is really faster
+
+1:08:29.417 --> 1:08:31.832
+and quality, but think.
+
+1:08:32.712 --> 1:08:45.594
+So, as said, what we have done is our encoder
+decoder where we can process our encoder color,
+
+1:08:45.594 --> 1:08:50.527
+and then the output always depends.
+
+1:08:50.410 --> 1:08:54.709
+We generate the output and then we have to
+put it here the wide because then everything
+
+1:08:54.709 --> 1:08:56.565
+depends on the purpose of the output.
+
+1:08:56.916 --> 1:09:10.464
+This is what is referred to as an outer-regressive
+model and nearly outs speech generation and
+
+1:09:10.464 --> 1:09:16.739
+language generation or works in this outer.
+
+1:09:18.318 --> 1:09:21.132
+So the motivation is, can we do that more
+efficiently?
+
+1:09:21.361 --> 1:09:31.694
+And can we somehow process all target words
+in parallel?
+
+1:09:31.694 --> 1:09:41.302
+So instead of doing it one by one, we are
+inputting.
+
+1:09:45.105 --> 1:09:46.726
+So how does it work?
+
+1:09:46.726 --> 1:09:50.587
+So let's first have a basic auto regressive
+mode.
+
+1:09:50.810 --> 1:09:53.551
+So the encoder looks as it is before.
+
+1:09:53.551 --> 1:09:58.310
+That's maybe not surprising because here we
+know we can paralyze.
+
+1:09:58.618 --> 1:10:04.592
+So we have put in here our ink holder and
+generated the ink stash, so that's exactly
+
+1:10:04.592 --> 1:10:05.295
+the same.
+
+1:10:05.845 --> 1:10:16.229
+However, now we need to do one more thing:
+One challenge is what we had before and that's
+
+1:10:16.229 --> 1:10:26.799
+a challenge of natural language generation
+like machine translation.
+
+1:10:32.672 --> 1:10:38.447
+We generate until we generate this out of
+end of center stock, but if we now generate
+
+1:10:38.447 --> 1:10:44.625
+everything at once that's no longer possible,
+so we cannot generate as long because we only
+
+1:10:44.625 --> 1:10:45.632
+generated one.
+
+1:10:46.206 --> 1:10:58.321
+So the question is how can we now determine
+how long the sequence is, and we can also accelerate.
+
+1:11:00.000 --> 1:11:06.384
+Yes, but there would be one idea, and there
+is other work which tries to do that.
+
+1:11:06.806 --> 1:11:15.702
+However, in here there's some work already
+done before and maybe you remember we had the
+
+1:11:15.702 --> 1:11:20.900
+IBM models and there was this concept of fertility.
+
+1:11:21.241 --> 1:11:26.299
+The concept of fertility is means like for
+one saucepan, and how many target pores does
+
+1:11:26.299 --> 1:11:27.104
+it translate?
+
+1:11:27.847 --> 1:11:34.805
+And exactly that we try to do here, and that
+means we are calculating like at the top we
+
+1:11:34.805 --> 1:11:36.134
+are calculating.
+
+1:11:36.396 --> 1:11:42.045
+So it says word is translated into word.
+
+1:11:42.045 --> 1:11:54.171
+Word might be translated into words into,
+so we're trying to predict in how many words.
+
+1:11:55.935 --> 1:12:10.314
+And then the end of the anchor, so this is
+like a length estimation.
+
+1:12:10.314 --> 1:12:15.523
+You can do it otherwise.
+
+1:12:16.236 --> 1:12:24.526
+You initialize your decoder input and we know
+it's good with word embeddings so we're trying
+
+1:12:24.526 --> 1:12:28.627
+to do the same thing and what people then do.
+
+1:12:28.627 --> 1:12:35.224
+They initialize it again with word embedding
+but in the frequency of the.
+
+1:12:35.315 --> 1:12:36.460
+So we have the cartilage.
+
+1:12:36.896 --> 1:12:47.816
+So one has two, so twice the is and then one
+is, so that is then our initialization.
+
+1:12:48.208 --> 1:12:57.151
+In other words, if you don't predict fertilities
+but predict lengths, you can just initialize
+
+1:12:57.151 --> 1:12:57.912
+second.
+
+1:12:58.438 --> 1:13:07.788
+This often works a bit better, but that's
+the other.
+
+1:13:07.788 --> 1:13:16.432
+Now you have everything in training and testing.
+
+1:13:16.656 --> 1:13:18.621
+This is all available at once.
+
+1:13:20.280 --> 1:13:31.752
+Then we can generate everything in parallel,
+so we have the decoder stack, and that is now
+
+1:13:31.752 --> 1:13:33.139
+as before.
+
+1:13:35.395 --> 1:13:41.555
+And then we're doing the translation predictions
+here on top of it in order to do.
+
+1:13:43.083 --> 1:13:59.821
+And then we are predicting here the target
+words and once predicted, and that is the basic
+
+1:13:59.821 --> 1:14:00.924
+idea.
+
+1:14:01.241 --> 1:14:08.171
+Machine translation: Where the idea is, we
+don't have to do one by one what we're.
+
+1:14:10.210 --> 1:14:13.900
+So this looks really, really, really great.
+
+1:14:13.900 --> 1:14:20.358
+On the first view there's one challenge with
+this, and this is the baseline.
+
+1:14:20.358 --> 1:14:27.571
+Of course there's some improvements, but in
+general the quality is often significant.
+
+1:14:28.068 --> 1:14:32.075
+So here you see the baseline models.
+
+1:14:32.075 --> 1:14:38.466
+You have a loss of ten blue points or something
+like that.
+
+1:14:38.878 --> 1:14:40.230
+So why does it change?
+
+1:14:40.230 --> 1:14:41.640
+So why is it happening?
+
+1:14:43.903 --> 1:14:56.250
+If you look at the errors there is repetitive
+tokens, so you have like or things like that.
+
+1:14:56.536 --> 1:15:01.995
+Broken senses or influent senses, so that
+exactly where algebra aggressive models are
+
+1:15:01.995 --> 1:15:04.851
+very good, we say that's a bit of a problem.
+
+1:15:04.851 --> 1:15:07.390
+They generate very fluid transcription.
+
+1:15:07.387 --> 1:15:10.898
+Translation: Sometimes there doesn't have
+to do anything with the input.
+
+1:15:11.411 --> 1:15:14.047
+But generally it really looks always very
+fluid.
+
+1:15:14.995 --> 1:15:20.865
+Here exactly the opposite, so the problem
+is that we don't have really fluid translation.
+
+1:15:21.421 --> 1:15:26.123
+And that is mainly due to the challenge that
+we have this independent assumption.
+
+1:15:26.646 --> 1:15:35.873
+So in this case, the probability of Y of the
+second position is independent of the probability
+
+1:15:35.873 --> 1:15:40.632
+of X, so we don't know what was there generated.
+
+1:15:40.632 --> 1:15:43.740
+We're just generating it there.
+
+1:15:43.964 --> 1:15:55.439
+You can see it also in a bit of examples.
+
+1:15:55.439 --> 1:16:03.636
+You can over-panelize shifts.
+
+1:16:04.024 --> 1:16:10.566
+And the problem is this is already an improvement
+again, but this is also similar to.
+
+1:16:11.071 --> 1:16:19.900
+So you can, for example, translate heeded
+back, or maybe you could also translate it
+
+1:16:19.900 --> 1:16:31.105
+with: But on their feeling down in feeling
+down, if the first position thinks of their
+
+1:16:31.105 --> 1:16:34.594
+feeling done and the second.
+
+1:16:35.075 --> 1:16:42.908
+So each position here and that is one of the
+main issues here doesn't know what the other.
+
+1:16:43.243 --> 1:16:53.846
+And for example, if you are translating something
+with, you can often translate things in two
+
+1:16:53.846 --> 1:16:58.471
+ways: German with a different agreement.
+
+1:16:58.999 --> 1:17:02.058
+And then here where you have to decide do
+a used jet.
+
+1:17:02.162 --> 1:17:05.460
+Interpretator: It doesn't know which word
+it has to select.
+
+1:17:06.086 --> 1:17:14.789
+Mean, of course, it knows a hidden state,
+but in the end you have a liability distribution.
+
+1:17:16.256 --> 1:17:20.026
+And that is the important thing in the outer
+regressive month.
+
+1:17:20.026 --> 1:17:24.335
+You know that because you have put it in you
+here, you don't know that.
+
+1:17:24.335 --> 1:17:29.660
+If it's equal probable here to two, you don't
+Know Which Is Selected, and of course that
+
+1:17:29.660 --> 1:17:32.832
+depends on what should be the latest traction
+under.
+
+1:17:33.333 --> 1:17:39.554
+Yep, that's the undershift, and we're going
+to last last the next time.
+
+1:17:39.554 --> 1:17:39.986
+Yes.
+
+1:17:40.840 --> 1:17:44.934
+Doesn't this also appear in and like now we're
+talking about physical training or.
+
+1:17:46.586 --> 1:17:48.412
+The thing is in the auto regress.
+
+1:17:48.412 --> 1:17:50.183
+If you give it the correct one,.
+
+1:17:50.450 --> 1:17:55.827
+So if you predict here comma what the reference
+is feeling then you tell the model here.
+
+1:17:55.827 --> 1:17:59.573
+The last one was feeling and then it knows
+it has to be done.
+
+1:17:59.573 --> 1:18:04.044
+But here it doesn't know that because it doesn't
+get as input as a right.
+
+1:18:04.204 --> 1:18:24.286
+Yes, that's a bit depending on what.
+
+1:18:24.204 --> 1:18:27.973
+But in training, of course, you just try to
+make the highest one the current one.
+
+1:18:31.751 --> 1:18:38.181
+So what you can do is things like CDC loss
+which can adjust for this.
+
+1:18:38.181 --> 1:18:42.866
+So then you can also have this shifted correction.
+
+1:18:42.866 --> 1:18:50.582
+If you're doing this type of correction in
+the CDC loss you don't get full penalty.
+
+1:18:50.930 --> 1:18:58.486
+Just shifted by one, so it's a bit of a different
+loss, which is mainly used in, but.
+
+1:19:00.040 --> 1:19:03.412
+It can be used in order to address this problem.
+
+1:19:04.504 --> 1:19:13.844
+The other problem is that outer regressively
+we have the label buyers that tries to disimmigrate.
+
+1:19:13.844 --> 1:19:20.515
+That's the example did before was if you translate
+thank you to Dung.
+
+1:19:20.460 --> 1:19:31.925
+And then it might end up because it learns
+in the first position and the second also.
+
+1:19:32.492 --> 1:19:43.201
+In order to prevent that, it would be helpful
+for one output, only one output, so that makes
+
+1:19:43.201 --> 1:19:47.002
+the system already better learn.
+
+1:19:47.227 --> 1:19:53.867
+Might be that for slightly different inputs
+you have different outputs, but for the same.
+
+1:19:54.714 --> 1:19:57.467
+That we can luckily very easily solve.
+
+1:19:59.119 --> 1:19:59.908
+And it's done.
+
+1:19:59.908 --> 1:20:04.116
+We just learned the technique about it, which
+is called knowledge distillation.
+
+1:20:04.985 --> 1:20:13.398
+So what we can do and the easiest solution
+to prove your non-autoregressive model is to
+
+1:20:13.398 --> 1:20:16.457
+train an auto regressive model.
+
+1:20:16.457 --> 1:20:22.958
+Then you decode your whole training gamer
+with this model and then.
+
+1:20:23.603 --> 1:20:27.078
+While the main advantage of that is that this
+is more consistent,.
+
+1:20:27.407 --> 1:20:33.995
+So for the same input you always have the
+same output.
+
+1:20:33.995 --> 1:20:41.901
+So you have to make your training data more
+consistent and learn.
+
+1:20:42.482 --> 1:20:54.471
+So there is another advantage of knowledge
+distillation and that advantage is you have
+
+1:20:54.471 --> 1:20:59.156
+more consistent training signals.
+
+1:21:04.884 --> 1:21:10.630
+There's another to make the things more easy
+at the beginning.
+
+1:21:10.630 --> 1:21:16.467
+There's this plants model, black model where
+you do more masks.
+
+1:21:16.756 --> 1:21:26.080
+So during training, especially at the beginning,
+you give some correct solutions at the beginning.
+
+1:21:28.468 --> 1:21:38.407
+And there is this tokens at a time, so the
+idea is to establish other regressive training.
+
+1:21:40.000 --> 1:21:50.049
+And some targets are open, so you always predict
+only like first auto regression is K.
+
+1:21:50.049 --> 1:21:59.174
+It puts one, so you always have one input
+and one output, then you do partial.
+
+1:21:59.699 --> 1:22:05.825
+So in that way you can slowly learn what is
+a good and what is a bad answer.
+
+1:22:08.528 --> 1:22:10.862
+It doesn't sound very impressive.
+
+1:22:10.862 --> 1:22:12.578
+Don't contact me anyway.
+
+1:22:12.578 --> 1:22:15.323
+Go all over your training data several.
+
+1:22:15.875 --> 1:22:20.655
+You can even switch in between.
+
+1:22:20.655 --> 1:22:29.318
+There is a homework on this thing where you
+try to start.
+
+1:22:31.271 --> 1:22:41.563
+You have to learn so there's a whole work
+on that so this is often happening and it doesn't
+
+1:22:41.563 --> 1:22:46.598
+mean it's less efficient but still it helps.
+
+1:22:49.389 --> 1:22:57.979
+For later maybe here are some examples of
+how much things help.
+
+1:22:57.979 --> 1:23:04.958
+Maybe one point here is that it's really important.
+
+1:23:05.365 --> 1:23:13.787
+Here's the translation performance and speed.
+
+1:23:13.787 --> 1:23:24.407
+One point which is a point is if you compare
+researchers.
+
+1:23:24.784 --> 1:23:33.880
+So yeah, if you're compared to one very weak
+baseline transformer even with beam search,
+
+1:23:33.880 --> 1:23:40.522
+then you're ten times slower than a very strong
+auto regressive.
+
+1:23:40.961 --> 1:23:48.620
+If you make a strong baseline then it's going
+down to depending on times and here like: You
+
+1:23:48.620 --> 1:23:53.454
+have a lot of different speed ups.
+
+1:23:53.454 --> 1:24:03.261
+Generally, it makes a strong baseline and
+not very simple transformer.
+
+1:24:07.407 --> 1:24:20.010
+Yeah, with this one last thing that you can
+do to speed up things and also reduce your
+
+1:24:20.010 --> 1:24:25.950
+memory is what is called half precision.
+
+1:24:26.326 --> 1:24:29.139
+And especially for decoding issues for training.
+
+1:24:29.139 --> 1:24:31.148
+Sometimes it also gets less stale.
+
+1:24:32.592 --> 1:24:45.184
+With this we close nearly wait a bit, so what
+you should remember is that efficient machine
+
+1:24:45.184 --> 1:24:46.963
+translation.
+
+1:24:47.007 --> 1:24:51.939
+We have, for example, looked at knowledge
+distillation.
+
+1:24:51.939 --> 1:24:55.991
+We have looked at non auto regressive models.
+
+1:24:55.991 --> 1:24:57.665
+We have different.
+
+1:24:58.898 --> 1:25:02.383
+For today and then only requests.
+
+1:25:02.383 --> 1:25:08.430
+So if you haven't done so, please fill out
+the evaluation.
+
+1:25:08.388 --> 1:25:20.127
+So now if you have done so think then you
+should have and with the online people hopefully.
+
+1:25:20.320 --> 1:25:29.758
+Only possibility to tell us what things are
+good and what not the only one but the most
+
+1:25:29.758 --> 1:25:30.937
+efficient.
+
+1:25:31.851 --> 1:25:35.875
+So think of all the students doing it in the
+next okay, then thank.
+
+0:00:03.243 --> 0:00:18.400
+Hey welcome to our video, small room today
+and to the lecture machine translation.
+
+0:00:19.579 --> 0:00:32.295
+So the idea is we have like last time we started
+addressing problems and building machine translation.
+
+0:00:32.772 --> 0:00:39.140
+And we looked into different ways of how we
+can use other types of resources.
+
+0:00:39.379 --> 0:00:54.656
+Last time we looked into language models and
+especially pre-trained models which are different
+
+0:00:54.656 --> 0:00:59.319
+paradigms and learning data.
+
+0:01:00.480 --> 0:01:07.606
+However, there is one other way of getting
+data and that is just searching for more data.
+
+0:01:07.968 --> 0:01:14.637
+And the nice thing is it was a worldwide web.
+
+0:01:14.637 --> 0:01:27.832
+We have a very big data resource where there's
+various types of data which we can all use.
+
+0:01:28.128 --> 0:01:38.902
+If you want to build a machine translation
+for a specific language or specific to Maine,
+
+0:01:38.902 --> 0:01:41.202
+it might be worse.
+
+0:01:46.586 --> 0:01:55.399
+In general, the other year we had different
+types of additional resources we can have.
+
+0:01:55.399 --> 0:01:59.654
+Today we look into the state of crawling.
+
+0:01:59.654 --> 0:02:05.226
+It always depends a bit on what type of task
+you have.
+
+0:02:05.525 --> 0:02:08.571
+We're crawling, you point off no possibilities.
+
+0:02:08.828 --> 0:02:14.384
+We have seen some weeks ago that Maje Lingo
+models another thing where you can try to share
+
+0:02:14.384 --> 0:02:16.136
+knowledge between languages.
+
+0:02:16.896 --> 0:02:26.774
+Last we looked into monolingual data and next
+we also unsupervised them too which is purely
+
+0:02:26.774 --> 0:02:29.136
+based on monolingual.
+
+0:02:29.689 --> 0:02:35.918
+What we today will focus on is really web
+crawling of parallel data.
+
+0:02:35.918 --> 0:02:40.070
+We will focus not on the crawling pad itself.
+
+0:02:41.541 --> 0:02:49.132
+Networking lecture is something about one
+of the best techniques to do web trolleying
+
+0:02:49.132 --> 0:02:53.016
+and then we'll just rely on existing tools.
+
+0:02:53.016 --> 0:02:59.107
+But the challenge is normally if you have
+web data that's pure text.
+
+0:03:00.920 --> 0:03:08.030
+And these are all different ways of how we
+can do that, and today is focused on that.
+
+0:03:08.508 --> 0:03:21.333
+So why would we be interested in that there
+is quite different ways of collecting data?
+
+0:03:21.333 --> 0:03:28.473
+If you're currently when we talk about parallel.
+
+0:03:28.548 --> 0:03:36.780
+The big difference is that you focus on one
+specific website so you can manually check
+
+0:03:36.780 --> 0:03:37.632
+how you.
+
+0:03:38.278 --> 0:03:49.480
+This you can do for dedicated resources where
+you have high quality data.
+
+0:03:50.510 --> 0:03:56.493
+Another thing which has been developed or
+has been done for several tasks is also is
+
+0:03:56.493 --> 0:03:59.732
+like you can do something like crowdsourcing.
+
+0:03:59.732 --> 0:04:05.856
+I don't know if you know about sites like
+Amazon Mechanical Turing or things like that
+
+0:04:05.856 --> 0:04:08.038
+so you can there get a lot of.
+
+0:04:07.988 --> 0:04:11.544
+Writing between cheap labors would like easy
+translations for you.
+
+0:04:12.532 --> 0:04:22.829
+Of course you can't collect millions of sentences,
+but if it's like thousands of sentences that's
+
+0:04:22.829 --> 0:04:29.134
+also sourced, it's often interesting when you
+have somehow.
+
+0:04:29.509 --> 0:04:36.446
+However, this is a field of itself, so crowdsourcing
+is not that easy.
+
+0:04:36.446 --> 0:04:38.596
+It's not like upload.
+
+0:04:38.738 --> 0:04:50.806
+If you're doing that you will have very poor
+quality, for example in the field of machine
+
+0:04:50.806 --> 0:04:52.549
+translation.
+
+0:04:52.549 --> 0:04:57.511
+Crowdsourcing is very commonly used.
+
+0:04:57.397 --> 0:05:00.123
+The problem there is.
+
+0:05:00.480 --> 0:05:08.181
+Since they are paid quite bad, of course,
+a lot of people also try to make it put into
+
+0:05:08.181 --> 0:05:09.598
+it as possible.
+
+0:05:09.869 --> 0:05:21.076
+So if you're just using it without any control
+mechanisms, the quality will be bad.
+
+0:05:21.076 --> 0:05:27.881
+What you can do is like doing additional checking.
+
+0:05:28.188 --> 0:05:39.084
+And think recently read a paper that now these
+things can be worse because people don't do
+
+0:05:39.084 --> 0:05:40.880
+it themselves.
+
+0:05:41.281 --> 0:05:46.896
+So it's a very interesting topic.
+
+0:05:46.896 --> 0:05:55.320
+There has been a lot of resources created
+by this.
+
+0:05:57.657 --> 0:06:09.796
+It's really about large scale data, then of
+course doing some type of web crawling is the
+
+0:06:09.796 --> 0:06:10.605
+best.
+
+0:06:10.930 --> 0:06:17.296
+However, the biggest issue in this case is
+in the quality.
+
+0:06:17.296 --> 0:06:22.690
+So how can we ensure that somehow the quality
+of.
+
+0:06:23.003 --> 0:06:28.656
+Because if you just, we all know that in the
+Internet there's also a lot of tools.
+
+0:06:29.149 --> 0:06:37.952
+Low quality staff, and especially now the
+bigger question is how can we ensure that translations
+
+0:06:37.952 --> 0:06:41.492
+are really translations of each other?
+
+0:06:45.065 --> 0:06:58.673
+Why is this interesting so we had this number
+before so there is some estimates that roughly
+
+0:06:58.673 --> 0:07:05.111
+a human reads around three hundred million.
+
+0:07:05.525 --> 0:07:16.006
+If you look into the web you will have millions
+of words there so you can really get a large
+
+0:07:16.006 --> 0:07:21.754
+amount of data and if you think about monolingual.
+
+0:07:22.042 --> 0:07:32.702
+So at least for some language pairs there
+is a large amount of data you can have.
+
+0:07:32.852 --> 0:07:37.783
+Languages are official languages in one country.
+
+0:07:37.783 --> 0:07:46.537
+There's always a very great success because
+a lot of websites from the government need
+
+0:07:46.537 --> 0:07:48.348
+to be translated.
+
+0:07:48.568 --> 0:07:58.777
+For example, a large purpose like in India,
+which we have worked with in India, so you
+
+0:07:58.777 --> 0:08:00.537
+have parallel.
+
+0:08:01.201 --> 0:08:02.161
+Two questions.
+
+0:08:02.161 --> 0:08:08.438
+First of all, if jet GPS and machine translation
+tools are more becoming ubiquitous and everybody
+
+0:08:08.438 --> 0:08:14.138
+uses them, don't we get a problem because we
+want to crawl the web and use the data and.
+
+0:08:15.155 --> 0:08:18.553
+Yes, that is a severe problem.
+
+0:08:18.553 --> 0:08:26.556
+Of course, are we only training on training
+data which is automatically?
+
+0:08:26.766 --> 0:08:41.182
+And if we are doing that, of course, we talked
+about the synthetic data where we do back translation.
+
+0:08:41.341 --> 0:08:46.446
+But of course it gives you some aren't up
+about norm, you cannot be much better than
+
+0:08:46.446 --> 0:08:46.806
+this.
+
+0:08:48.308 --> 0:08:57.194
+That is, we'll get more and more on issues,
+so maybe at some point we won't look at the
+
+0:08:57.194 --> 0:09:06.687
+current Internet, but focus on oats like image
+of the Internet, which are created by Archive.
+
+0:09:07.527 --> 0:09:18.611
+There's lots of classification algorithms
+on how to classify automatic data they had
+
+0:09:18.611 --> 0:09:26.957
+a very interesting paper on how to watermark
+their translation.
+
+0:09:27.107 --> 0:09:32.915
+So there's like two scenarios of course in
+this program: The one thing you might want
+
+0:09:32.915 --> 0:09:42.244
+to find your own translation if you're a big
+company and say do an antisystem that may be
+
+0:09:42.244 --> 0:09:42.866
+used.
+
+0:09:43.083 --> 0:09:49.832
+This problem might be that most of the translation
+out there is created by you.
+
+0:09:49.832 --> 0:10:02.007
+You might be able: And there is a relatively
+easy way of doing that so that there are other
+
+0:10:02.007 --> 0:10:09.951
+peoples' mainly that can do like the search
+or teacher.
+
+0:10:09.929 --> 0:10:12.878
+They are different, but there is not the one
+correction station.
+
+0:10:13.153 --> 0:10:23.763
+So what you then can't do is you can't output
+the best one to the user, but the highest value.
+
+0:10:23.763 --> 0:10:30.241
+For example, it's easy, but you can take the
+translation.
+
+0:10:30.870 --> 0:10:40.713
+And if you always give the translation of
+your investments, which are all good with the
+
+0:10:40.713 --> 0:10:42.614
+most ease, then.
+
+0:10:42.942 --> 0:10:55.503
+But of course this you can only do with most
+of the data generated by your model.
+
+0:10:55.503 --> 0:11:02.855
+What we are now seeing is not only checks,
+but.
+
+0:11:03.163 --> 0:11:13.295
+But it's definitely an additional research
+question that might get more and more importance,
+
+0:11:13.295 --> 0:11:18.307
+and it might be an additional filtering step.
+
+0:11:18.838 --> 0:11:29.396
+There are other issues in data quality, so
+in which direction wasn't translated, so that
+
+0:11:29.396 --> 0:11:31.650
+is not interested.
+
+0:11:31.891 --> 0:11:35.672
+But if you're now reaching better and better
+quality, it makes a difference.
+
+0:11:35.672 --> 0:11:39.208
+The original data was from German to English
+or from English to German.
+
+0:11:39.499 --> 0:11:44.797
+Because translation, they call it translate
+Chinese.
+
+0:11:44.797 --> 0:11:53.595
+So if you generate German from English, it
+has a more similar structure as if you would
+
+0:11:53.595 --> 0:11:55.195
+directly speak.
+
+0:11:55.575 --> 0:11:57.187
+So um.
+
+0:11:57.457 --> 0:12:03.014
+These are all issues which you then might
+do like do additional training to remove them
+
+0:12:03.014 --> 0:12:07.182
+or you first train on them and later train
+on other quality data.
+
+0:12:07.182 --> 0:12:11.034
+But yet that's a general view on so it's an
+important issue.
+
+0:12:11.034 --> 0:12:17.160
+But until now I think it hasn't been addressed
+that much maybe because the quality was decently.
+
+0:12:18.858 --> 0:12:23.691
+Actually, I think we're sure if we have the
+time we use the Internet.
+
+0:12:23.691 --> 0:12:29.075
+The problem is, it's a lot of English speaking
+text, but most used languages.
+
+0:12:29.075 --> 0:12:34.460
+I don't know some language in Africa that's
+spoken, but we do about that one.
+
+0:12:34.460 --> 0:12:37.566
+I mean, that's why most data is English too.
+
+0:12:38.418 --> 0:12:42.259
+Other languages, and then you get the best.
+
+0:12:42.259 --> 0:12:46.013
+If there is no data on the Internet, then.
+
+0:12:46.226 --> 0:12:48.255
+So there is still a lot of data collection.
+
+0:12:48.255 --> 0:12:50.976
+Also in the wild way you try to improve there
+and collect.
+
+0:12:51.431 --> 0:12:57.406
+But English is the most in the world, but
+you find surprisingly much data also for other
+
+0:12:57.406 --> 0:12:58.145
+languages.
+
+0:12:58.678 --> 0:13:04.227
+Of course, only if they're written remember.
+
+0:13:04.227 --> 0:13:15.077
+Most languages are not written at all, but
+for them you might find some video, but it's
+
+0:13:15.077 --> 0:13:17.420
+difficult to find.
+
+0:13:17.697 --> 0:13:22.661
+So this is mainly done for the web trawling.
+
+0:13:22.661 --> 0:13:29.059
+It's mainly done for languages which are commonly
+spoken.
+
+0:13:30.050 --> 0:13:37.907
+Is exactly the next point, so this is that
+much data is only true for English and some
+
+0:13:37.907 --> 0:13:41.972
+other languages, but of course there's many.
+
+0:13:41.982 --> 0:13:50.285
+And therefore a lot of research on how to
+make things efficient and efficient and learn
+
+0:13:50.285 --> 0:13:54.248
+faster from pure data is still essential.
+
+0:13:59.939 --> 0:14:06.326
+So what we are interested in now on data is
+parallel data.
+
+0:14:06.326 --> 0:14:10.656
+We assume always we have parallel data.
+
+0:14:10.656 --> 0:14:12.820
+That means we have.
+
+0:14:13.253 --> 0:14:20.988
+To be careful when you start crawling from
+the web, we might get only related types of.
+
+0:14:21.421 --> 0:14:30.457
+So one comedy thing is what people refer as
+noisy parallel data where there is documents
+
+0:14:30.457 --> 0:14:34.315
+which are translations of each other.
+
+0:14:34.434 --> 0:14:44.300
+So you have senses where there is no translation
+on the other side because you have.
+
+0:14:44.484 --> 0:14:50.445
+So if you have these types of documents your
+algorithm to extract parallel data might be
+
+0:14:50.445 --> 0:14:51.918
+a bit more difficult.
+
+0:14:52.352 --> 0:15:04.351
+Know if you can still remember in the beginning
+of the lecture when we talked about different
+
+0:15:04.351 --> 0:15:06.393
+data resources.
+
+0:15:06.286 --> 0:15:11.637
+But the first step is then approached to a
+light source and target sentences, and it was
+
+0:15:11.637 --> 0:15:16.869
+about like a steep vocabulary, and then you
+have some probabilities for one to one and
+
+0:15:16.869 --> 0:15:17.590
+one to one.
+
+0:15:17.590 --> 0:15:23.002
+It's very like simple algorithm, but yet it
+works fine for really a high quality parallel
+
+0:15:23.002 --> 0:15:23.363
+data.
+
+0:15:23.623 --> 0:15:30.590
+But when we're talking about noisy data, we
+might have to do additional steps and use more
+
+0:15:30.590 --> 0:15:35.872
+advanced models to extract what is parallel
+and to get high quality.
+
+0:15:36.136 --> 0:15:44.682
+So if we just had no easy parallel data, the
+document might not be as easy to extract.
+
+0:15:49.249 --> 0:15:54.877
+And then there is even the more extreme pains,
+which has also been used to be honest.
+
+0:15:54.877 --> 0:15:58.214
+The use of this data is reasoning not that
+common.
+
+0:15:58.214 --> 0:16:04.300
+It was more interested maybe like ten or fifteen
+years ago, and that is what people referred
+
+0:16:04.300 --> 0:16:05.871
+to as comparative data.
+
+0:16:06.266 --> 0:16:17.167
+And then the idea is you even don't have translations
+like sentences which are translations of each
+
+0:16:17.167 --> 0:16:25.234
+other, but you have more news documents or
+articles about the same topic.
+
+0:16:25.205 --> 0:16:32.410
+But it's more that you find phrases which
+are too big in the user, so even black fragments.
+
+0:16:32.852 --> 0:16:44.975
+So if you think about the pedia, for example,
+these articles have to be written in like the
+
+0:16:44.975 --> 0:16:51.563
+Wikipedia general idea independent of each
+other.
+
+0:16:51.791 --> 0:17:01.701
+They have different information in there,
+and I mean, the German movie gets more detail
+
+0:17:01.701 --> 0:17:04.179
+than the English one.
+
+0:17:04.179 --> 0:17:07.219
+However, it might be that.
+
+0:17:07.807 --> 0:17:20.904
+And the same thing is that you think about
+newspaper articles if they're at the same time.
+
+0:17:21.141 --> 0:17:25.603
+And so this is an ability to learn.
+
+0:17:25.603 --> 0:17:36.760
+For example, new phrases, vocabulary and stature
+if you don't have monitor all time long.
+
+0:17:37.717 --> 0:17:49.020
+And then not everything will be the same,
+but there might be an overlap about events.
+
+0:17:54.174 --> 0:18:00.348
+So if we're talking about web trolling said
+in the beginning it was really about specific.
+
+0:18:00.660 --> 0:18:18.878
+They do very good things by hand and really
+focus on them and do a very specific way of
+
+0:18:18.878 --> 0:18:20.327
+doing.
+
+0:18:20.540 --> 0:18:23.464
+The European Parliament was very focused in
+Ted.
+
+0:18:23.464 --> 0:18:26.686
+Maybe you even have looked in the particular
+session.
+
+0:18:27.427 --> 0:18:40.076
+And these are still important, but they are
+of course very specific in covering different
+
+0:18:40.076 --> 0:18:41.341
+pockets.
+
+0:18:42.002 --> 0:18:55.921
+Then there was a focus on language centering,
+so there was a big drawer, for example, that
+
+0:18:55.921 --> 0:18:59.592
+you can check websites.
+
+0:19:00.320 --> 0:19:07.918
+Apparently what really people like is a more
+general approach where you just have to specify.
+
+0:19:07.918 --> 0:19:15.355
+I'm interested in data from German to Lithuanian
+and then you can as automatic as possible.
+
+0:19:15.355 --> 0:19:19.640
+You can collect data and extract codelator
+for this.
+
+0:19:21.661 --> 0:19:25.633
+So is this our interest?
+
+0:19:25.633 --> 0:19:36.435
+Of course, the question is how can we build
+these types of systems?
+
+0:19:36.616 --> 0:19:52.913
+The first are more general web crawling base
+systems, so there is nothing about.
+
+0:19:53.173 --> 0:19:57.337
+Based on the websites you have, you have to
+do like text extraction.
+
+0:19:57.597 --> 0:20:06.503
+We are typically not that much interested
+in text and images in there, so we try to extract
+
+0:20:06.503 --> 0:20:07.083
+text.
+
+0:20:07.227 --> 0:20:16.919
+This is also not specific to machine translation,
+but it's a more traditional way of doing web
+
+0:20:16.919 --> 0:20:17.939
+trolling.
+
+0:20:18.478 --> 0:20:22.252
+And at the end you have mirror like some other
+set of document collectors.
+
+0:20:22.842 --> 0:20:37.025
+Is the idea, so you have the text, and often
+this is a document, and so in the end.
+
+0:20:37.077 --> 0:20:51.523
+And that is some of your starting point now
+for doing the more machine translation.
+
+0:20:52.672 --> 0:21:05.929
+One way of doing that now is very similar
+to what you might have think about the traditional
+
+0:21:05.929 --> 0:21:06.641
+one.
+
+0:21:06.641 --> 0:21:10.633
+The first thing is to do a.
+
+0:21:11.071 --> 0:21:22.579
+So you have this based on the initial fact
+that you know this is a German website in the
+
+0:21:22.579 --> 0:21:25.294
+English translation.
+
+0:21:25.745 --> 0:21:31.037
+And based on this document alignment, then
+you can do your sentence alignment.
+
+0:21:31.291 --> 0:21:39.072
+And this is similar to what we had before
+with the church accordion.
+
+0:21:39.072 --> 0:21:43.696
+This is typically more noisy peril data.
+
+0:21:43.623 --> 0:21:52.662
+So that you are not assuming that everything
+is on both sides, that the order is the same,
+
+0:21:52.662 --> 0:21:56.635
+so you should do more flexible systems.
+
+0:21:58.678 --> 0:22:14.894
+Then it depends if the documents you were
+drawing were really some type of parallel data.
+
+0:22:15.115 --> 0:22:35.023
+Say then you should do what is referred to
+as fragmented extraction.
+
+0:22:36.136 --> 0:22:47.972
+One problem with these types of models is
+if you are doing errors in your document alignment,.
+
+0:22:48.128 --> 0:22:55.860
+It means that if you are saying these two
+documents are align then you can only find
+
+0:22:55.860 --> 0:22:58.589
+sense and if you are missing.
+
+0:22:59.259 --> 0:23:15.284
+Is very different, only small parts of the
+document are parallel, and most parts are independent
+
+0:23:15.284 --> 0:23:17.762
+of each other.
+
+0:23:19.459 --> 0:23:31.318
+Therefore, more recently, there is also the
+idea of directly doing sentence aligned so
+
+0:23:31.318 --> 0:23:35.271
+that you're directly taking.
+
+0:23:36.036 --> 0:23:41.003
+Was already one challenge of this one, the
+second approach.
+
+0:23:42.922 --> 0:23:50.300
+Yes, so one big challenge on here, beef, then
+you have to do a lot of comparison.
+
+0:23:50.470 --> 0:23:59.270
+You have to cook out every source, every target
+set and square.
+
+0:23:59.270 --> 0:24:06.283
+If you think of a million or trillion pairs,
+then.
+
+0:24:07.947 --> 0:24:12.176
+And this also gives you a reason for a last
+step in both cases.
+
+0:24:12.176 --> 0:24:18.320
+So in both of them you have to remember you're
+typically eating here in this very large data
+
+0:24:18.320 --> 0:24:18.650
+set.
+
+0:24:18.650 --> 0:24:24.530
+So all of these and also the document alignment
+here they should be done very efficient.
+
+0:24:24.965 --> 0:24:42.090
+And if you want to do it very efficiently,
+that means your quality will go lower.
+
+0:24:41.982 --> 0:24:47.348
+Because you just have to ever see it fast,
+and then yeah you can put less computation
+
+0:24:47.348 --> 0:24:47.910
+on each.
+
+0:24:48.688 --> 0:25:06.255
+Therefore, in a lot of scenarios it makes
+sense to make an additional filtering step
+
+0:25:06.255 --> 0:25:08.735
+at the end.
+
+0:25:08.828 --> 0:25:13.370
+And then we do a second filtering step where
+we now can put a lot more effort.
+
+0:25:13.433 --> 0:25:20.972
+Because now we don't have like any square
+possible combinations anymore, we have already
+
+0:25:20.972 --> 0:25:26.054
+selected and maybe in dimension of maybe like
+two or three.
+
+0:25:26.054 --> 0:25:29.273
+For each sentence we even don't have.
+
+0:25:29.429 --> 0:25:39.234
+And then we can put a lot more effort in each
+individual example and build a high quality
+
+0:25:39.234 --> 0:25:42.611
+classic fire to really select.
+
+0:25:45.125 --> 0:26:00.506
+Two or one example for that, so one of the
+biggest projects doing this is the so-called
+
+0:26:00.506 --> 0:26:03.478
+Paratrol Corpus.
+
+0:26:03.343 --> 0:26:11.846
+Typically it's like before the picturing so
+there are a lot of challenges on how you can.
+
+0:26:12.272 --> 0:26:25.808
+And the steps they start to be with the seatbelt,
+so what you should give at the beginning is:
+
+0:26:26.146 --> 0:26:36.908
+Then they do the problem, the text extraction,
+the document alignment, the sentence alignment,
+
+0:26:36.908 --> 0:26:45.518
+and the sentence filter, and it swings down
+to implementing the text store.
+
+0:26:46.366 --> 0:26:51.936
+We'll see later for a lot of language pairs
+exist so it's easier to download them and then
+
+0:26:51.936 --> 0:26:52.793
+like improve.
+
+0:26:53.073 --> 0:27:08.270
+For example, the crawling one thing they often
+do is even not throw the direct website because
+
+0:27:08.270 --> 0:27:10.510
+there's also.
+
+0:27:10.770 --> 0:27:14.540
+Black parts of the Internet that they can
+work on today.
+
+0:27:14.854 --> 0:27:22.238
+In more detail, this is a bit shown here.
+
+0:27:22.238 --> 0:27:31.907
+All the steps you can see are different possibilities.
+
+0:27:32.072 --> 0:27:39.018
+You need a bit of knowledge to do that, or
+you can build a machine translation system.
+
+0:27:39.239 --> 0:27:47.810
+There are two different ways of deduction
+and alignment.
+
+0:27:47.810 --> 0:27:52.622
+You can use sentence alignment.
+
+0:27:53.333 --> 0:28:02.102
+And how you can do the flexigrade exam, for
+example, the lexic graph, or you can chin.
+
+0:28:02.422 --> 0:28:05.826
+To the next step in a bit more detail.
+
+0:28:05.826 --> 0:28:13.680
+But before we're doing it, I need more questions
+about the general overview of how these.
+
+0:28:22.042 --> 0:28:37.058
+Yeah, so two or three things to web-drawing,
+so you normally start with the URLs.
+
+0:28:37.058 --> 0:28:40.903
+It's most promising.
+
+0:28:41.021 --> 0:28:48.652
+What you found is that if you're interested
+in German to English, you would: Companies
+
+0:28:48.652 --> 0:29:01.074
+where you know they have a German and an English
+website are from agencies which might be: And
+
+0:29:01.074 --> 0:29:10.328
+then we can use one of these tools to start
+from there using standard web calling techniques.
+
+0:29:11.071 --> 0:29:23.942
+There are several challenges when doing that,
+so if you request a website too often you can:
+
+0:29:25.305 --> 0:29:37.819
+You have to keep in history of the sites and
+you click on all the links and then click on
+
+0:29:37.819 --> 0:29:40.739
+all the links again.
+
+0:29:41.721 --> 0:29:49.432
+To be very careful about legal issues starting
+from this robotics day so get allowed to use.
+
+0:29:49.549 --> 0:29:58.941
+Mean, that's the one major thing about what
+trolley general is.
+
+0:29:58.941 --> 0:30:05.251
+The problem is how you deal with property.
+
+0:30:05.685 --> 0:30:13.114
+That is why it is easier sometimes to start
+with some quick fold data that you don't have.
+
+0:30:13.893 --> 0:30:22.526
+Of course, the network issues you retry, so
+there's more technical things, but there's
+
+0:30:22.526 --> 0:30:23.122
+good.
+
+0:30:24.724 --> 0:30:35.806
+Another thing which is very helpful and is
+often done is instead of doing the web trolling
+
+0:30:35.806 --> 0:30:38.119
+yourself, relying.
+
+0:30:38.258 --> 0:30:44.125
+And one thing is it's common crawl from the
+web.
+
+0:30:44.125 --> 0:30:51.190
+Think on this common crawl a lot of these
+language models.
+
+0:30:51.351 --> 0:30:59.763
+So think in American Company or organization
+which really works on like writing.
+
+0:31:00.000 --> 0:31:01.111
+Possible.
+
+0:31:01.111 --> 0:31:10.341
+So the nice thing is if you start with this
+you don't have to worry about network.
+
+0:31:10.250 --> 0:31:16.086
+I don't think you can do that because it's
+too big, but you can do a pipeline on how to
+
+0:31:16.086 --> 0:31:16.683
+process.
+
+0:31:17.537 --> 0:31:28.874
+That is, of course, a general challenge in
+all this web crawling and parallel web mining.
+
+0:31:28.989 --> 0:31:38.266
+That means you cannot just don't know the
+data and study the processes.
+
+0:31:39.639 --> 0:31:45.593
+Here it might make sense to directly fields
+of both domains that in some way bark just
+
+0:31:45.593 --> 0:31:46.414
+marginally.
+
+0:31:49.549 --> 0:31:59.381
+Then you can do the text extraction, which
+means like converging two HTML and then splitting
+
+0:31:59.381 --> 0:32:01.707
+things from the HTML.
+
+0:32:01.841 --> 0:32:04.802
+Often very important is to do the language
+I need.
+
+0:32:05.045 --> 0:32:16.728
+It's not that clear even if it's links which
+language it is, but they are quite good tools
+
+0:32:16.728 --> 0:32:22.891
+like that can't identify from relatively short.
+
+0:32:23.623 --> 0:32:36.678
+And then you are now in the situation that
+you have all your danger and that you can start.
+
+0:32:37.157 --> 0:32:43.651
+After the text extraction you have now a collection
+or a large collection of of data where it's
+
+0:32:43.651 --> 0:32:49.469
+like text and maybe the document at use of
+some meta information and now the question
+
+0:32:49.469 --> 0:32:55.963
+is based on this monolingual text or multilingual
+text so text in many languages but not align.
+
+0:32:56.036 --> 0:32:59.863
+How can you now do a generate power?
+
+0:33:01.461 --> 0:33:06.289
+And UM.
+
+0:33:05.705 --> 0:33:12.965
+So the main thing, if we're not seeing it
+as a task, or if we want to do it in a machine
+
+0:33:12.965 --> 0:33:20.388
+learning way, what we have is we have a set
+of sentences and a suits language, and we have
+
+0:33:20.388 --> 0:33:23.324
+a set Of sentences from the target.
+
+0:33:23.823 --> 0:33:27.814
+This is the target language.
+
+0:33:27.814 --> 0:33:31.392
+This is the data we have.
+
+0:33:31.392 --> 0:33:37.034
+We kind of directly assume any ordering.
+
+0:33:38.018 --> 0:33:44.502
+More documents there are not really in line
+or there is maybe a graph and what we are interested
+
+0:33:44.502 --> 0:33:50.518
+in is finding these alignments so which senses
+are aligned to each other and which senses
+
+0:33:50.518 --> 0:33:53.860
+we can remove but we don't have translations
+for.
+
+0:33:53.974 --> 0:34:00.339
+But exactly this mapping is what we are interested
+in and what we need to find.
+
+0:34:01.901 --> 0:34:17.910
+And if we are modeling it more from the machine
+translation point of view, what can model that
+
+0:34:17.910 --> 0:34:21.449
+as a classification?
+
+0:34:21.681 --> 0:34:34.850
+And so the main challenge is to build this
+type of classifier and you want to decide is
+
+0:34:34.850 --> 0:34:36.646
+a parallel.
+
+0:34:42.402 --> 0:34:50.912
+However, the biggest challenge has already
+pointed out in the beginning is the sites if
+
+0:34:50.912 --> 0:34:53.329
+we have millions target.
+
+0:34:53.713 --> 0:35:05.194
+The number of comparison is n square, so this
+very path is very inefficient, and we need
+
+0:35:05.194 --> 0:35:06.355
+to find.
+
+0:35:07.087 --> 0:35:16.914
+And traditionally there is the first one mentioned
+before the local or the hierarchical meaning
+
+0:35:16.914 --> 0:35:20.292
+mining and there the idea is OK.
+
+0:35:20.292 --> 0:35:23.465
+First we are lining documents.
+
+0:35:23.964 --> 0:35:32.887
+Move back the things and align them, and once
+you have the alignment you only need to remind.
+
+0:35:33.273 --> 0:35:51.709
+That of course makes anything more efficient
+because we don't have to do all the comparison.
+
+0:35:53.253 --> 0:35:56.411
+Then it's, for example, in the before mentioned
+apparel.
+
+0:35:57.217 --> 0:36:11.221
+But it has the issue that if this document
+is bad you have error propagation and you can
+
+0:36:11.221 --> 0:36:14.211
+recover from that.
+
+0:36:14.494 --> 0:36:20.715
+Because then document that cannot say ever,
+there are some sentences which are: Therefore,
+
+0:36:20.715 --> 0:36:24.973
+more recently there is also was referred to
+as global mining.
+
+0:36:26.366 --> 0:36:31.693
+And there we really do this.
+
+0:36:31.693 --> 0:36:43.266
+Although it's in the square, we are doing
+all the comparisons.
+
+0:36:43.523 --> 0:36:52.588
+So the idea is that you can do represent all
+the sentences in a vector space.
+
+0:36:52.892 --> 0:37:06.654
+And then it's about nearest neighbor search
+and there is a lot of very efficient algorithms.
+
+0:37:07.067 --> 0:37:20.591
+Then if you only compare them to your nearest
+neighbors you don't have to do like a comparison
+
+0:37:20.591 --> 0:37:22.584
+but you have.
+
+0:37:26.186 --> 0:37:40.662
+So in the first step what we want to look
+at is this: This document classification refers
+
+0:37:40.662 --> 0:37:49.584
+to the document alignment, and then we do the
+sentence alignment.
+
+0:37:51.111 --> 0:37:58.518
+And if we're talking about document alignment,
+there's like typically two steps in that: We
+
+0:37:58.518 --> 0:38:01.935
+first do a candidate selection.
+
+0:38:01.935 --> 0:38:10.904
+Often we have several steps and that is again
+to make more things more efficiently.
+
+0:38:10.904 --> 0:38:13.360
+We have the candidate.
+
+0:38:13.893 --> 0:38:18.402
+The candidate select means OK, which documents
+do we want to compare?
+
+0:38:19.579 --> 0:38:35.364
+Then if we have initial candidates which might
+be parallel, we can do a classification test.
+
+0:38:35.575 --> 0:38:37.240
+And there is different ways.
+
+0:38:37.240 --> 0:38:40.397
+We can use lexical similarity or we can use
+ten basic.
+
+0:38:41.321 --> 0:38:48.272
+The first and easiest thing is to take off
+possible candidates.
+
+0:38:48.272 --> 0:38:55.223
+There's one possibility, the other one, is
+based on structural.
+
+0:38:55.235 --> 0:39:05.398
+So based on how your website looks like, you
+might find that there are only translations.
+
+0:39:05.825 --> 0:39:14.789
+This is typically the only case where we try
+to do some kind of major information, which
+
+0:39:14.789 --> 0:39:22.342
+can be very useful because we know that websites,
+for example, are linked.
+
+0:39:22.722 --> 0:39:35.586
+We can try to use some URL patterns, so if
+we have some website which ends with the.
+
+0:39:35.755 --> 0:39:43.932
+So that can be easily used in order to find
+candidates.
+
+0:39:43.932 --> 0:39:49.335
+Then we only compare websites where.
+
+0:39:49.669 --> 0:40:05.633
+The language and the translation of each other,
+but typically you hear several heuristics to
+
+0:40:05.633 --> 0:40:07.178
+do that.
+
+0:40:07.267 --> 0:40:16.606
+Then you don't have to compare all websites,
+but you only have to compare web sites.
+
+0:40:17.277 --> 0:40:27.607
+Cruiser problems especially with an hour day's
+content management system.
+
+0:40:27.607 --> 0:40:32.912
+Sometimes it's nice and easy to read.
+
+0:40:33.193 --> 0:40:44.452
+So on the one hand there typically leads from
+the parent's side to different languages.
+
+0:40:44.764 --> 0:40:46.632
+Now I can look at the kit websites.
+
+0:40:46.632 --> 0:40:49.381
+It's the same thing you can check on the difference.
+
+0:40:49.609 --> 0:41:06.833
+Languages: You can either do that from the
+parent website or you can also click on English.
+
+0:41:06.926 --> 0:41:10.674
+You can therefore either like prepare to all
+the websites.
+
+0:41:10.971 --> 0:41:18.205
+Can be even more focused and checked if the
+link is somehow either flexible or the language
+
+0:41:18.205 --> 0:41:18.677
+name.
+
+0:41:19.019 --> 0:41:24.413
+So there really depends on how much you want
+to filter out.
+
+0:41:24.413 --> 0:41:29.178
+There is always a trade-off between being
+efficient.
+
+0:41:33.913 --> 0:41:49.963
+Based on that we then have our candidate list,
+so we now have two independent sets of German
+
+0:41:49.963 --> 0:41:52.725
+documents, but.
+
+0:41:53.233 --> 0:42:03.515
+And now the task is, we want to extract these,
+which are really translations of each other.
+
+0:42:03.823 --> 0:42:10.201
+So the question of how can we measure the
+document similarity?
+
+0:42:10.201 --> 0:42:14.655
+Because what we then do is, we measure the.
+
+0:42:14.955 --> 0:42:27.096
+And here you already see why this is also
+that problematic from where it's partial or
+
+0:42:27.096 --> 0:42:28.649
+similarly.
+
+0:42:30.330 --> 0:42:37.594
+All you can do that is again two folds.
+
+0:42:37.594 --> 0:42:48.309
+You can do it more content based or more structural
+based.
+
+0:42:48.188 --> 0:42:53.740
+Calculating a lot of features and then maybe
+training a classic pyramid small set which
+
+0:42:53.740 --> 0:42:57.084
+stands like based on the spesse feature is
+the data.
+
+0:42:57.084 --> 0:42:58.661
+It is a corpus parallel.
+
+0:43:00.000 --> 0:43:10.955
+One way of doing that is to have traction
+features, so the idea is the text length, so
+
+0:43:10.955 --> 0:43:12.718
+the document.
+
+0:43:13.213 --> 0:43:20.511
+Of course, text links will not be the same,
+but if the one document has fifty words and
+
+0:43:20.511 --> 0:43:24.907
+the other five thousand words, it's quite realistic.
+
+0:43:25.305 --> 0:43:29.274
+So you can use the text length as one proxy
+of.
+
+0:43:29.274 --> 0:43:32.334
+Is this might be a good translation?
+
+0:43:32.712 --> 0:43:41.316
+Now the thing is the alignment between the
+structure.
+
+0:43:41.316 --> 0:43:52.151
+If you have here the website you can create
+some type of structure.
+
+0:43:52.332 --> 0:44:04.958
+You can compare that to the French version
+and then calculate some similarities because
+
+0:44:04.958 --> 0:44:07.971
+you see translation.
+
+0:44:08.969 --> 0:44:12.172
+Of course, it's getting more and more problematic.
+
+0:44:12.172 --> 0:44:16.318
+It does be a different structure than these
+features are helpful.
+
+0:44:16.318 --> 0:44:22.097
+However, if you are doing it more in a trained
+way, you can automatically learn how helpful
+
+0:44:22.097 --> 0:44:22.725
+they are.
+
+0:44:24.704 --> 0:44:37.516
+Then there are different ways of yeah: Content
+based things: One easy thing, especially if
+
+0:44:37.516 --> 0:44:48.882
+you have systems that are using the same script
+that you are looking for.
+
+0:44:48.888 --> 0:44:49.611
+The legs.
+
+0:44:49.611 --> 0:44:53.149
+We call them a beggar words and we'll look
+into.
+
+0:44:53.149 --> 0:44:55.027
+You can use some type of.
+
+0:44:55.635 --> 0:44:58.418
+And neural embedding is also to abate him
+at.
+
+0:45:02.742 --> 0:45:06.547
+And as then mean we have machine translation,.
+
+0:45:06.906 --> 0:45:14.640
+And one idea that you can also do is really
+use the machine translation.
+
+0:45:14.874 --> 0:45:22.986
+Because this one is one which takes more effort,
+so what you then have to do is put more effort.
+
+0:45:23.203 --> 0:45:37.526
+You wouldn't do this type of machine translation
+based approach for a system which has product.
+
+0:45:38.018 --> 0:45:53.712
+But maybe your first of thinking why can't
+do that because I'm collecting data to build
+
+0:45:53.712 --> 0:45:55.673
+an system.
+
+0:45:55.875 --> 0:46:01.628
+So you can use an initial system to translate
+it, and then you can collect more data.
+
+0:46:01.901 --> 0:46:06.879
+And one way of doing that is, you're translating,
+for example, all documents even to English.
+
+0:46:07.187 --> 0:46:25.789
+Then you only need two English data and you
+do it in the example with three grams.
+
+0:46:25.825 --> 0:46:33.253
+For example, the current induction in 1 in
+the Spanish, which is German induction in 1,
+
+0:46:33.253 --> 0:46:37.641
+which was Spanish induction in 2, which was
+French.
+
+0:46:37.637 --> 0:46:52.225
+You're creating this index and then based
+on that you can calculate how similar the documents.
+
+0:46:52.092 --> 0:46:58.190
+And then you can use the Cossack similarity
+to really calculate which of the most similar
+
+0:46:58.190 --> 0:47:00.968
+document or how similar is the document.
+
+0:47:00.920 --> 0:47:04.615
+And then measure if this is a possible translation.
+
+0:47:05.285 --> 0:47:14.921
+Mean, of course, the document will not be
+exactly the same, and even if you have a parallel
+
+0:47:14.921 --> 0:47:18.483
+document, French and German, and.
+
+0:47:18.898 --> 0:47:29.086
+You'll have not a perfect translation, therefore
+it's looking into five front overlap since
+
+0:47:29.086 --> 0:47:31.522
+there should be last.
+
+0:47:34.074 --> 0:47:42.666
+Okay, before we take the next step and go
+into the sentence alignment, there are more
+
+0:47:42.666 --> 0:47:44.764
+questions about the.
+
+0:47:51.131 --> 0:47:55.924
+Too Hot and.
+
+0:47:56.997 --> 0:47:59.384
+Well um.
+
+0:48:00.200 --> 0:48:05.751
+There is different ways of doing sentence
+alignment.
+
+0:48:05.751 --> 0:48:12.036
+Here's one way to describe is to call the
+other line again.
+
+0:48:12.172 --> 0:48:17.590
+Of course, we have the advantage that we have
+only documents, so we might have like hundred
+
+0:48:17.590 --> 0:48:20.299
+sentences and hundred sentences in the tower.
+
+0:48:20.740 --> 0:48:31.909
+Although it still might be difficult to compare
+all the things in parallel, and.
+
+0:48:31.791 --> 0:48:37.541
+And therefore typically these even assume
+that we are only interested in a line character
+
+0:48:37.541 --> 0:48:40.800
+that can be identified on the sum of the diagonal.
+
+0:48:40.800 --> 0:48:46.422
+Of course, not exactly the diagonal will sum
+some parts around it, but in order to make
+
+0:48:46.422 --> 0:48:47.891
+things more efficient.
+
+0:48:48.108 --> 0:48:55.713
+You can still do it around the diagonal because
+if you say this is a parallel document, we
+
+0:48:55.713 --> 0:48:56.800
+assume that.
+
+0:48:56.836 --> 0:49:05.002
+We wouldn't have passed the document alignment,
+therefore we wouldn't have seen it.
+
+0:49:05.505 --> 0:49:06.774
+In the underline.
+
+0:49:06.774 --> 0:49:10.300
+Then we are calculating the similarity for
+these.
+
+0:49:10.270 --> 0:49:17.428
+Set this here based on the bilingual dictionary,
+so it may be based on how much overlap you
+
+0:49:17.428 --> 0:49:17.895
+have.
+
+0:49:18.178 --> 0:49:24.148
+And then we are finding a path through it.
+
+0:49:24.148 --> 0:49:31.089
+You are finding a path which the lights ever
+see.
+
+0:49:31.271 --> 0:49:41.255
+But you're trying to find a pass through your
+document so that you get these parallel.
+
+0:49:41.201 --> 0:49:49.418
+And then the perfect ones here would be your
+pass, where you just take this other parallel.
+
+0:49:51.011 --> 0:50:05.579
+The advantage is that, of course, on the one
+end limits your search space.
+
+0:50:05.579 --> 0:50:07.521
+That is,.
+
+0:50:07.787 --> 0:50:10.013
+So what does it mean?
+
+0:50:10.013 --> 0:50:19.120
+So even if you have a very high probable pair,
+you're not taking them on because overall.
+
+0:50:19.399 --> 0:50:27.063
+So sometimes it makes sense to also use this
+global information and not only compare on
+
+0:50:27.063 --> 0:50:34.815
+individual sentences because what you're with
+your parents is that sometimes it's only a
+
+0:50:34.815 --> 0:50:36.383
+good translation.
+
+0:50:38.118 --> 0:50:51.602
+So by this minion paste you're preventing
+the system to do it at the border where there's
+
+0:50:51.602 --> 0:50:52.201
+no.
+
+0:50:53.093 --> 0:50:55.689
+So that might achieve you a bit better quality.
+
+0:50:56.636 --> 0:51:12.044
+The pack always ends if we write the button
+for everybody, but it also means you couldn't
+
+0:51:12.044 --> 0:51:15.126
+necessarily have.
+
+0:51:15.375 --> 0:51:24.958
+Have some restrictions that is right, so first
+of all they can't be translated out.
+
+0:51:25.285 --> 0:51:32.572
+So the handle line typically only really works
+well if you have a relatively high quality.
+
+0:51:32.752 --> 0:51:39.038
+So if you have this more general data where
+there's like some parts are translated and
+
+0:51:39.038 --> 0:51:39.471
+some.
+
+0:51:39.719 --> 0:51:43.604
+It doesn't really work, so it might.
+
+0:51:43.604 --> 0:51:53.157
+It's okay with having maybe at the end some
+sentences which are missing, but in generally.
+
+0:51:53.453 --> 0:51:59.942
+So it's not robust against significant noise
+on the.
+
+0:52:05.765 --> 0:52:12.584
+The second thing is is to what is referred
+to as blue alibi.
+
+0:52:13.233 --> 0:52:16.982
+And this doesn't does, does not do us much.
+
+0:52:16.977 --> 0:52:30.220
+A global information you can translate each
+sentence to English, and then you calculate
+
+0:52:30.220 --> 0:52:34.885
+the voice for the translation.
+
+0:52:35.095 --> 0:52:41.888
+And that you would get six answer points,
+which are the ones in a purple ear.
+
+0:52:42.062 --> 0:52:56.459
+And then you have the ability to add some
+points around it, which might be a bit lower.
+
+0:52:56.756 --> 0:53:06.962
+But here in this case you are able to deal
+with reorderings, angles to deal with parts.
+
+0:53:07.247 --> 0:53:16.925
+Therefore, in this case we need a full scale
+and key system to do this calculation while
+
+0:53:16.925 --> 0:53:17.686
+we're.
+
+0:53:18.318 --> 0:53:26.637
+Then, of course, the better your similarity
+metric is, so the better you are able to do
+
+0:53:26.637 --> 0:53:35.429
+this comparison, the less you have to rely
+on structural information that, in one sentence,.
+
+0:53:39.319 --> 0:53:53.411
+Anymore questions, and then there are things
+like back in line which try to do the same.
+
+0:53:53.793 --> 0:53:59.913
+That means the idea is that you expect each
+sentence.
+
+0:53:59.819 --> 0:54:02.246
+In a crossing will vector space.
+
+0:54:02.246 --> 0:54:08.128
+Crossing will vector space always means that
+you have a vector or knight means.
+
+0:54:08.128 --> 0:54:14.598
+In this case you have a vector space where
+sentences in different languages are near to
+
+0:54:14.598 --> 0:54:16.069
+each other if they.
+
+0:54:16.316 --> 0:54:23.750
+So you can have it again and so on, but just
+next to each other and want to call you.
+
+0:54:24.104 --> 0:54:32.009
+And then you can of course measure now the
+similarity by some distance matrix in this
+
+0:54:32.009 --> 0:54:32.744
+vector.
+
+0:54:33.033 --> 0:54:36.290
+And you're saying towards two senses are lying.
+
+0:54:36.290 --> 0:54:39.547
+If the distance in the vector space is somehow.
+
+0:54:40.240 --> 0:54:50.702
+We'll discuss that in a bit more heat soon
+because these vector spades and bathings are
+
+0:54:50.702 --> 0:54:52.010
+even then.
+
+0:54:52.392 --> 0:54:55.861
+So the nice thing is with this.
+
+0:54:55.861 --> 0:55:05.508
+It's really good and good to get quite good
+quality and can decide whether two sentences
+
+0:55:05.508 --> 0:55:08.977
+are translations of each other.
+
+0:55:08.888 --> 0:55:14.023
+In the fact-lined approach, but often they
+even work on a global search way to really
+
+0:55:14.023 --> 0:55:15.575
+compare on everything to.
+
+0:55:16.236 --> 0:55:29.415
+What weak alignment also does is trying to
+do to make this more efficient in finding the.
+
+0:55:29.309 --> 0:55:40.563
+If you don't want to compare everything to
+everything, you first need sentence blocks,
+
+0:55:40.563 --> 0:55:41.210
+and.
+
+0:55:41.141 --> 0:55:42.363
+Then find him fast.
+
+0:55:42.562 --> 0:55:55.053
+You always have full sentence resolution,
+but then you always compare on the area around.
+
+0:55:55.475 --> 0:56:11.501
+So if you do compare blocks on the source
+of the target, then you have of your possibilities.
+
+0:56:11.611 --> 0:56:17.262
+So here the end times and comparison is a
+lot less than the comparison you have here.
+
+0:56:17.777 --> 0:56:23.750
+And with neural embeddings you can also embed
+not only single sentences and whole blocks.
+
+0:56:24.224 --> 0:56:28.073
+So how you make this in fast?
+
+0:56:28.073 --> 0:56:35.643
+You're starting from a coarse grain resolution
+here where.
+
+0:56:36.176 --> 0:56:47.922
+Then you're getting a double pass where they
+could be good and near this pass you're doing
+
+0:56:47.922 --> 0:56:49.858
+more and more.
+
+0:56:52.993 --> 0:56:54.601
+And yeah, what's the?
+
+0:56:54.601 --> 0:56:56.647
+This is the white egg lift.
+
+0:56:56.647 --> 0:56:59.352
+These are the sewers and the target.
+
+0:57:00.100 --> 0:57:16.163
+While it was sleeping in the forests and things,
+I thought it was very strange to see this man.
+
+0:57:16.536 --> 0:57:25.197
+So you have the sentences, but if you do blocks
+you have blocks that are in.
+
+0:57:30.810 --> 0:57:38.514
+This is the thing about the pipeline approach.
+
+0:57:38.514 --> 0:57:46.710
+We want to look at the global mining, but
+before.
+
+0:57:53.633 --> 0:58:07.389
+In the global mining thing we have to also
+do some filtering and so typically in the things
+
+0:58:07.389 --> 0:58:10.379
+they do they start.
+
+0:58:10.290 --> 0:58:14.256
+And then they are doing some pretty processing.
+
+0:58:14.254 --> 0:58:17.706
+So you try to at first to de-defecate paragraphs.
+
+0:58:17.797 --> 0:58:30.622
+So, of course, if you compare everything with
+everything in two times the same input example,
+
+0:58:30.622 --> 0:58:35.748
+you will also: The hard thing is that you first
+keep duplicating.
+
+0:58:35.748 --> 0:58:37.385
+You have each paragraph only one.
+
+0:58:37.958 --> 0:58:42.079
+There's a lot of text which occurs a lot of
+times.
+
+0:58:42.079 --> 0:58:44.585
+They will happen all the time.
+
+0:58:44.884 --> 0:58:57.830
+There are pages about the cookie thing you
+see and about accepting things.
+
+0:58:58.038 --> 0:59:04.963
+So you can already be duplicated here, or
+your problem has crossed the website twice,
+
+0:59:04.963 --> 0:59:05.365
+and.
+
+0:59:06.066 --> 0:59:11.291
+Then you can remove low quality data like
+cooking warnings that have biolabites start.
+
+0:59:12.012 --> 0:59:13.388
+Hey!
+
+0:59:13.173 --> 0:59:19.830
+So let you have maybe some other sentence,
+and then you're doing a language idea.
+
+0:59:19.830 --> 0:59:29.936
+That means you want to have a text, which
+is: You want to know for each sentence a paragraph
+
+0:59:29.936 --> 0:59:38.695
+which language it has so that you then, of
+course, if you want.
+
+0:59:39.259 --> 0:59:44.987
+Finally, there is some complexity based film
+screenings to believe, for example, for very
+
+0:59:44.987 --> 0:59:46.069
+high complexity.
+
+0:59:46.326 --> 0:59:59.718
+That means, for example, data where there's
+a lot of crazy names which are growing.
+
+1:00:00.520 --> 1:00:09.164
+Sometimes it also improves very high perplexity
+data because that is then unmanned generated
+
+1:00:09.164 --> 1:00:09.722
+data.
+
+1:00:11.511 --> 1:00:17.632
+And then the model which is mostly used for
+that is what is called a laser model.
+
+1:00:18.178 --> 1:00:21.920
+It's based on machine translation.
+
+1:00:21.920 --> 1:00:28.442
+Hope it all recognizes the machine translation
+architecture.
+
+1:00:28.442 --> 1:00:37.103
+However, there is a difference between a general
+machine translation system and.
+
+1:01:00.000 --> 1:01:13.322
+Machine translation system, so it's messy.
+
+1:01:14.314 --> 1:01:24.767
+See one bigger difference, which is great
+if I'm excluding that object or the other.
+
+1:01:25.405 --> 1:01:39.768
+There is one difference to the other, one
+with attention, so we are having.
+
+1:01:40.160 --> 1:01:43.642
+And then we are using that here in there each
+time set up.
+
+1:01:44.004 --> 1:01:54.295
+Mean, therefore, it's maybe a bit similar
+to original anti-system without attention.
+
+1:01:54.295 --> 1:01:56.717
+It's quite similar.
+
+1:01:57.597 --> 1:02:10.011
+However, it has this disadvantage saying that
+we have to put everything in one sentence and
+
+1:02:10.011 --> 1:02:14.329
+that maybe not all information.
+
+1:02:15.055 --> 1:02:25.567
+However, now in this type of framework we
+are not really interested in machine translation,
+
+1:02:25.567 --> 1:02:27.281
+so this model.
+
+1:02:27.527 --> 1:02:34.264
+So we are training it to do machine translation.
+
+1:02:34.264 --> 1:02:42.239
+What that means in the end should be as much
+information.
+
+1:02:43.883 --> 1:03:01.977
+Only all the information in here is able to
+really well do the machine translation.
+
+1:03:02.642 --> 1:03:07.801
+So that is the first step, so we are doing
+here.
+
+1:03:07.801 --> 1:03:17.067
+We are building the MT system, not with the
+goal of making the best MT system, but with
+
+1:03:17.067 --> 1:03:22.647
+learning and sentences, and hopefully all important.
+
+1:03:22.882 --> 1:03:26.116
+Because otherwise we won't be able to generate
+the translation.
+
+1:03:26.906 --> 1:03:31.287
+So it's a bit more on the bottom neck like
+to try to put as much information.
+
+1:03:32.012 --> 1:03:36.426
+And if you think if you want to do later finding
+the bear's neighbor or something like.
+
+1:03:37.257 --> 1:03:48.680
+So finding similarities is typically possible
+with fixed dimensional things, so we can do
+
+1:03:48.680 --> 1:03:56.803
+that in an end dimensional space and find the
+nearest neighbor.
+
+1:03:57.857 --> 1:03:59.837
+Yeah, it would be very difficult.
+
+1:04:00.300 --> 1:04:03.865
+There's one thing that we also do.
+
+1:04:03.865 --> 1:04:09.671
+We don't want to find the nearest neighbor
+in the other.
+
+1:04:10.570 --> 1:04:13.424
+Do you have an idea how we can train them?
+
+1:04:13.424 --> 1:04:16.542
+This is a set that embeddings can be compared.
+
+1:04:23.984 --> 1:04:36.829
+Any idea do you think about two lectures,
+a three lecture stack, one that did gave.
+
+1:04:41.301 --> 1:04:50.562
+We can train them on a multilingual setting
+and that's how it's done in lasers so we're
+
+1:04:50.562 --> 1:04:56.982
+not doing it only from German to English but
+we're training.
+
+1:04:57.017 --> 1:05:04.898
+Mean, if the English one has to be useful
+for German, French and so on, and for German
+
+1:05:04.898 --> 1:05:13.233
+also, the German and the English and so have
+to be useful, then somehow we'll automatically
+
+1:05:13.233 --> 1:05:16.947
+learn that these embattes are popularly.
+
+1:05:17.437 --> 1:05:28.562
+And then we can use an exact as we will plan
+to have a similar sentence embedding.
+
+1:05:28.908 --> 1:05:39.734
+If you put in here a German and a French one
+and always generate as they both have the same
+
+1:05:39.734 --> 1:05:48.826
+translations, you give these sentences: And
+you should do exactly the same thing, so that's
+
+1:05:48.826 --> 1:05:50.649
+of course the easiest.
+
+1:05:51.151 --> 1:05:59.817
+If the sentence is very different then most
+people will also hear the English decoder and
+
+1:05:59.817 --> 1:06:00.877
+therefore.
+
+1:06:02.422 --> 1:06:04.784
+So that is the first thing.
+
+1:06:04.784 --> 1:06:06.640
+Now we have this one.
+
+1:06:06.640 --> 1:06:10.014
+We have to be trained on parallel data.
+
+1:06:10.390 --> 1:06:22.705
+Then we can use these embeddings on our new
+data and try to use them to make efficient
+
+1:06:22.705 --> 1:06:24.545
+comparisons.
+
+1:06:26.286 --> 1:06:30.669
+So how can you do comparison?
+
+1:06:30.669 --> 1:06:37.243
+Maybe the first thing you think of is to do.
+
+1:06:37.277 --> 1:06:44.365
+So you take all the German sentences, all
+the French sentences.
+
+1:06:44.365 --> 1:06:49.460
+We compute the Cousin's simple limit between.
+
+1:06:49.469 --> 1:06:58.989
+And then you take all pairs where the similarity
+is very high.
+
+1:07:00.180 --> 1:07:17.242
+So you have your French list, you have them,
+and then you just take all sentences.
+
+1:07:19.839 --> 1:07:29.800
+It's an additional power method that we have,
+but we have a lot of data who will find a point.
+
+1:07:29.800 --> 1:07:32.317
+It's a good point, but.
+
+1:07:35.595 --> 1:07:45.738
+It's also not that easy, so one problem is
+that typically there are some sentences where.
+
+1:07:46.066 --> 1:07:48.991
+And other points where there is very few points
+in the neighborhood.
+
+1:07:49.629 --> 1:08:06.241
+And then for things where a lot of things
+are enabled you might extract not for one percent
+
+1:08:06.241 --> 1:08:08.408
+to do that.
+
+1:08:08.868 --> 1:08:18.341
+So what typically is happening is you do the
+max merchant?
+
+1:08:18.341 --> 1:08:25.085
+How good is a pair compared to the other?
+
+1:08:25.305 --> 1:08:33.859
+So you take the similarity between X and Y,
+and then you look at one of the eight nearest
+
+1:08:33.859 --> 1:08:35.190
+neighbors of.
+
+1:08:35.115 --> 1:08:48.461
+Of x and what are the eight nearest neighbors
+of y, and the dividing of the similarity through
+
+1:08:48.461 --> 1:08:51.411
+the eight neighbors.
+
+1:08:51.671 --> 1:09:00.333
+So what you may be looking at are these two
+sentences a lot more similar than all the other.
+
+1:09:00.840 --> 1:09:13.455
+And if these are exceptional and similar compared
+to other sentences then they should be translations.
+
+1:09:16.536 --> 1:09:19.158
+Of course, that has also some.
+
+1:09:19.158 --> 1:09:24.148
+Then the good thing is there's a lot of similar
+sentences.
+
+1:09:24.584 --> 1:09:30.641
+If there is a lot of similar sensations in
+white then these are also very similar and
+
+1:09:30.641 --> 1:09:32.824
+you are doing more comparison.
+
+1:09:32.824 --> 1:09:36.626
+If all the arrows are far away then the translations.
+
+1:09:37.057 --> 1:09:40.895
+So think about this like short sentences.
+
+1:09:40.895 --> 1:09:47.658
+They might be that most things are similar,
+but they are just in general.
+
+1:09:49.129 --> 1:09:59.220
+There are some problems that now we assume
+there is only one pair of translations.
+
+1:09:59.759 --> 1:10:09.844
+So it has some problems in their two or three
+ballad translations of that.
+
+1:10:09.844 --> 1:10:18.853
+Then, of course, this pair might not find
+it, but in general this.
+
+1:10:19.139 --> 1:10:27.397
+For example, they have like all of these common
+trawl.
+
+1:10:27.397 --> 1:10:32.802
+They have large parallel data sets.
+
+1:10:36.376 --> 1:10:38.557
+One point maybe also year.
+
+1:10:38.557 --> 1:10:45.586
+Of course, now it's important that we have
+done the deduplication before because if we
+
+1:10:45.586 --> 1:10:52.453
+wouldn't have the deduplication, we would have
+points which are the same coordinate.
+
+1:10:57.677 --> 1:11:03.109
+Maybe only one small things to that mean.
+
+1:11:03.109 --> 1:11:09.058
+A major issue in this case is still making
+a.
+
+1:11:09.409 --> 1:11:18.056
+So you have to still do all of this comparison,
+and that cannot be done just by simple.
+
+1:11:19.199 --> 1:11:27.322
+So what is done typically express the word,
+you know things can be done in parallel.
+
+1:11:28.368 --> 1:11:36.024
+So calculating the embeddings and all that
+stuff doesn't need to be sequential, but it's
+
+1:11:36.024 --> 1:11:37.143
+independent.
+
+1:11:37.357 --> 1:11:48.680
+What you typically do is create an event and
+then you do some kind of projectization.
+
+1:11:48.708 --> 1:11:57.047
+So there is this space library which does
+key nearest neighbor search very efficient
+
+1:11:57.047 --> 1:11:59.597
+in very high-dimensional.
+
+1:12:00.080 --> 1:12:03.410
+And then based on that you can now do comparison.
+
+1:12:03.410 --> 1:12:06.873
+You can even do the comparison in parallel
+because.
+
+1:12:06.906 --> 1:12:13.973
+Can look at different areas of your space
+and then compare the different pieces to find
+
+1:12:13.973 --> 1:12:14.374
+the.
+
+1:12:15.875 --> 1:12:30.790
+With this you are then able to do very fast
+calculations on this type of sentence.
+
+1:12:31.451 --> 1:12:34.761
+So yeah this is currently one.
+
+1:12:35.155 --> 1:12:48.781
+Mean, those of them are covered with this,
+so there's a parade.
+
+1:12:48.668 --> 1:12:55.543
+We are collected by that and most of them
+are in a very big corporate for languages which
+
+1:12:55.543 --> 1:12:57.453
+you can hardly stand on.
+
+1:12:58.778 --> 1:13:01.016
+Do you have any more questions on this?
+
+1:13:05.625 --> 1:13:17.306
+And then some more words to this last set
+here: So we have now done our pearl marker
+
+1:13:17.306 --> 1:13:25.165
+and we could assume that everything is fine
+now.
+
+1:13:25.465 --> 1:13:35.238
+However, the problem with this noisy data
+is that typically this is quite noisy still,
+
+1:13:35.238 --> 1:13:35.687
+so.
+
+1:13:36.176 --> 1:13:44.533
+In order to make things efficient to have
+a high recall, the final data is often not
+
+1:13:44.533 --> 1:13:49.547
+of the best quality, not the same type of quality.
+
+1:13:49.789 --> 1:13:58.870
+So it is essential to do another figuring
+step and to remove senses which might seem
+
+1:13:58.870 --> 1:14:01.007
+to be translations.
+
+1:14:01.341 --> 1:14:08.873
+And here, of course, the final evaluation
+matrix would be how much do my system improve?
+
+1:14:09.089 --> 1:14:23.476
+And there are even challenges on doing that
+so: people getting this noisy data like symmetrics
+
+1:14:23.476 --> 1:14:25.596
+or something.
+
+1:14:27.707 --> 1:14:34.247
+However, all these steps is of course very
+time consuming, so you might not always want
+
+1:14:34.247 --> 1:14:37.071
+to do the full pipeline and training.
+
+1:14:37.757 --> 1:14:51.614
+So how can you model that we want to get this
+best and normally what we always want?
+
+1:14:51.871 --> 1:15:02.781
+You also want to have the best over translation
+quality, but this is also normally not achieved
+
+1:15:02.781 --> 1:15:03.917
+with all.
+
+1:15:04.444 --> 1:15:12.389
+And that's why you're doing this two-step
+approach first of the second alignment.
+
+1:15:12.612 --> 1:15:27.171
+And after once you do the sentence filtering,
+we can put a lot more alphabet in all the comparisons.
+
+1:15:27.627 --> 1:15:37.472
+For example, you can just translate the source
+and compare that translation with the original
+
+1:15:37.472 --> 1:15:40.404
+one and calculate how good.
+
+1:15:40.860 --> 1:15:49.467
+And this, of course, you can do with the filing
+set, but you can't do with your initial set
+
+1:15:49.467 --> 1:15:50.684
+of millions.
+
+1:15:54.114 --> 1:16:01.700
+So what it is again is the ancient test where
+you input as a sentence pair as here, and then
+
+1:16:01.700 --> 1:16:09.532
+once you have a biometria, these are sentence
+pairs with a high quality, and these are sentence
+
+1:16:09.532 --> 1:16:11.653
+pairs avec a low quality.
+
+1:16:12.692 --> 1:16:17.552
+Does anybody see what might be a challenge
+if you want to train this type of classifier?
+
+1:16:22.822 --> 1:16:24.264
+How do you measure exactly?
+
+1:16:24.264 --> 1:16:26.477
+The quality is probably about the problem.
+
+1:16:27.887 --> 1:16:39.195
+Yes, that is one, that is true, there is even
+more, more simple one, and high quality data
+
+1:16:39.195 --> 1:16:42.426
+here is not so difficult.
+
+1:16:43.303 --> 1:16:46.844
+Globally, yeah, probably we have a class in
+balance.
+
+1:16:46.844 --> 1:16:49.785
+We don't see many bad quality combinations.
+
+1:16:49.785 --> 1:16:54.395
+It's hard to get there at the beginning, so
+maybe how can you argue?
+
+1:16:54.395 --> 1:16:58.405
+Where do you find bad quality and what type
+of bad quality?
+
+1:16:58.798 --> 1:17:05.122
+Because if it's too easy, you just take a
+random germ and the random innocence that is
+
+1:17:05.122 --> 1:17:05.558
+very.
+
+1:17:05.765 --> 1:17:15.747
+But what you're interested is like bad quality
+data, which still passes your first initial
+
+1:17:15.747 --> 1:17:16.405
+step.
+
+1:17:17.257 --> 1:17:28.824
+What you can use for that is you can use any
+type of network or model that in the beginning,
+
+1:17:28.824 --> 1:17:33.177
+like in random forests, would see.
+
+1:17:33.613 --> 1:17:38.912
+So the positive examples are quite easy to
+get.
+
+1:17:38.912 --> 1:17:44.543
+You just take parallel data and high quality
+data.
+
+1:17:44.543 --> 1:17:45.095
+You.
+
+1:17:45.425 --> 1:17:47.565
+That is quite easy.
+
+1:17:47.565 --> 1:17:55.482
+You normally don't need a lot of data, then
+to train in a few validation.
+
+1:17:57.397 --> 1:18:12.799
+The challenge is like the negative samples
+because how would you generate negative samples?
+
+1:18:13.133 --> 1:18:17.909
+Because the negative examples are the ones
+which ask the first step but don't ask the
+
+1:18:17.909 --> 1:18:18.353
+second.
+
+1:18:18.838 --> 1:18:23.682
+So how do you typically do it?
+
+1:18:23.682 --> 1:18:28.994
+You try to do synthetic examples.
+
+1:18:28.994 --> 1:18:33.369
+You can do random examples.
+
+1:18:33.493 --> 1:18:45.228
+But this is the typical error that you want
+to detect when you do frequency based replacements.
+
+1:18:45.228 --> 1:18:52.074
+But this is one major issue when you generate
+the data.
+
+1:18:52.132 --> 1:19:02.145
+That doesn't match well with what are the
+real arrows that you're interested in.
+
+1:19:02.702 --> 1:19:13.177
+Is some of the most challenging here to find
+the negative samples, which are hard enough
+
+1:19:13.177 --> 1:19:14.472
+to detect.
+
+1:19:17.537 --> 1:19:21.863
+And the other thing, which is difficult, is
+of course the data ratio.
+
+1:19:22.262 --> 1:19:24.212
+Why is it important any?
+
+1:19:24.212 --> 1:19:29.827
+Why is the ratio between positive and negative
+examples here important?
+
+1:19:30.510 --> 1:19:40.007
+Because in a case of plus imbalance we effectively
+could learn to just that it's positive and
+
+1:19:40.007 --> 1:19:43.644
+high quality and we would be right.
+
+1:19:44.844 --> 1:19:46.654
+Yes, so I'm training.
+
+1:19:46.654 --> 1:19:51.180
+This is important, but otherwise it might
+be too easy.
+
+1:19:51.180 --> 1:19:52.414
+You always do.
+
+1:19:52.732 --> 1:19:58.043
+And on the other head, of course, navy and
+deputy, it's also important because if we have
+
+1:19:58.043 --> 1:20:03.176
+equal things, we're also assuming that this
+might be the other one, and if the quality
+
+1:20:03.176 --> 1:20:06.245
+is worse or higher, we might also accept too
+fewer.
+
+1:20:06.626 --> 1:20:10.486
+So this ratio is not easy to determine.
+
+1:20:13.133 --> 1:20:16.969
+What type of features can we use?
+
+1:20:16.969 --> 1:20:23.175
+Traditionally, we're also looking at word
+translation.
+
+1:20:23.723 --> 1:20:37.592
+And nowadays, of course, we can model this
+also with something like similar, so this is
+
+1:20:37.592 --> 1:20:38.696
+again.
+
+1:20:40.200 --> 1:20:42.306
+Language follow.
+
+1:20:42.462 --> 1:20:49.763
+So we can, for example, put the sentence in
+there for the source and the target, and then
+
+1:20:49.763 --> 1:20:56.497
+based on this classification label we can classify
+as this a parallel sentence or.
+
+1:20:56.476 --> 1:21:00.054
+So it's more like a normal classification
+task.
+
+1:21:00.160 --> 1:21:09.233
+And by having a system which can have much
+enable input, we can just put in two R.
+
+1:21:09.233 --> 1:21:16.886
+We can also put in two independent of each
+other based on the hidden.
+
+1:21:17.657 --> 1:21:35.440
+You can, as you do any other type of classifier,
+you can train them on top of.
+
+1:21:35.895 --> 1:21:42.801
+This so it tries to represent the full sentence
+and that's what you also want to do on.
+
+1:21:43.103 --> 1:21:45.043
+The Other Thing What They Can't Do Is, of
+Course.
+
+1:21:45.265 --> 1:21:46.881
+You can make here.
+
+1:21:46.881 --> 1:21:52.837
+You can do your summation of all the hidden
+statements that you said.
+
+1:21:58.698 --> 1:22:10.618
+Okay, and then one thing which we skipped
+until now, and that is only briefly this fragment.
+
+1:22:10.630 --> 1:22:19.517
+So if we have sentences which are not really
+parallel, can we also extract information from
+
+1:22:19.517 --> 1:22:20.096
+them?
+
+1:22:22.002 --> 1:22:25.627
+And so what here the test is?
+
+1:22:25.627 --> 1:22:33.603
+We have a sentence and we want to find within
+or a sentence pair.
+
+1:22:33.603 --> 1:22:38.679
+We want to find within the sentence pair.
+
+1:22:39.799 --> 1:22:46.577
+And how that, for example, has been done is
+using a lexical positive and negative association.
+
+1:22:47.187 --> 1:22:57.182
+And then you can transform your target sentence
+into a signal and find a thing where you have.
+
+1:22:57.757 --> 1:23:00.317
+So I'm Going to Get a Clear Eye.
+
+1:23:00.480 --> 1:23:15.788
+So you hear the English sentence, the other
+language, and you have an alignment between
+
+1:23:15.788 --> 1:23:18.572
+them, and then.
+
+1:23:18.818 --> 1:23:21.925
+This is not a light cell from a negative signal.
+
+1:23:22.322 --> 1:23:40.023
+And then you drink some sauce on there because
+you want to have an area where there's.
+
+1:23:40.100 --> 1:23:51.742
+It doesn't matter if you have simple arrows
+here by smooth saying you can't.
+
+1:23:51.972 --> 1:23:58.813
+So you try to find long segments here where
+at least most of the words are somehow aligned.
+
+1:24:00.040 --> 1:24:10.069
+And then you take this one in the side and
+extract that one as your parallel fragment,
+
+1:24:10.069 --> 1:24:10.645
+and.
+
+1:24:10.630 --> 1:24:21.276
+So in the end you not only have full sentences
+but you also have partial sentences which might
+
+1:24:21.276 --> 1:24:27.439
+be helpful for especially if you have quite
+low upset.
+
+1:24:32.332 --> 1:24:36.388
+That's everything work for today.
+
+1:24:36.388 --> 1:24:44.023
+What you hopefully remember is the thing about
+how the general.
+
+1:24:44.184 --> 1:24:54.506
+We talked about how we can do the document
+alignment and then we can do the sentence alignment,
+
+1:24:54.506 --> 1:24:57.625
+which can be done after the.
+
+1:24:59.339 --> 1:25:12.611
+Any more questions think on Thursday we had
+to do a switch, so on Thursday there will be
+
+1:25:12.611 --> 1:25:15.444
+a practical thing.
+