WEBVTT 0:00:03.243 --> 0:00:18.400 Hey welcome to our video, small room today and to the lecture machine translation. 0:00:19.579 --> 0:00:32.295 So the idea is we have like last time we started addressing problems and building machine translation. 0:00:32.772 --> 0:00:39.140 And we looked into different ways of how we can use other types of resources. 0:00:39.379 --> 0:00:54.656 Last time we looked into language models and especially pre-trained models which are different 0:00:54.656 --> 0:00:59.319 paradigms and learning data. 0:01:00.480 --> 0:01:07.606 However, there is one other way of getting data and that is just searching for more data. 0:01:07.968 --> 0:01:14.637 And the nice thing is it was a worldwide web. 0:01:14.637 --> 0:01:27.832 We have a very big data resource where there's various types of data which we can all use. 0:01:28.128 --> 0:01:38.902 If you want to build a machine translation for a specific language or specific to Maine, 0:01:38.902 --> 0:01:41.202 it might be worse. 0:01:46.586 --> 0:01:55.399 In general, the other year we had different types of additional resources we can have. 0:01:55.399 --> 0:01:59.654 Today we look into the state of crawling. 0:01:59.654 --> 0:02:05.226 It always depends a bit on what type of task you have. 0:02:05.525 --> 0:02:08.571 We're crawling, you point off no possibilities. 0:02:08.828 --> 0:02:14.384 We have seen some weeks ago that Maje Lingo models another thing where you can try to share 0:02:14.384 --> 0:02:16.136 knowledge between languages. 0:02:16.896 --> 0:02:26.774 Last we looked into monolingual data and next we also unsupervised them too which is purely 0:02:26.774 --> 0:02:29.136 based on monolingual. 0:02:29.689 --> 0:02:35.918 What we today will focus on is really web crawling of parallel data. 0:02:35.918 --> 0:02:40.070 We will focus not on the crawling pad itself. 0:02:41.541 --> 0:02:49.132 Networking lecture is something about one of the best techniques to do web trolleying 0:02:49.132 --> 0:02:53.016 and then we'll just rely on existing tools. 0:02:53.016 --> 0:02:59.107 But the challenge is normally if you have web data that's pure text. 0:03:00.920 --> 0:03:08.030 And these are all different ways of how we can do that, and today is focused on that. 0:03:08.508 --> 0:03:21.333 So why would we be interested in that there is quite different ways of collecting data? 0:03:21.333 --> 0:03:28.473 If you're currently when we talk about parallel. 0:03:28.548 --> 0:03:36.780 The big difference is that you focus on one specific website so you can manually check 0:03:36.780 --> 0:03:37.632 how you. 0:03:38.278 --> 0:03:49.480 This you can do for dedicated resources where you have high quality data. 0:03:50.510 --> 0:03:56.493 Another thing which has been developed or has been done for several tasks is also is 0:03:56.493 --> 0:03:59.732 like you can do something like crowdsourcing. 0:03:59.732 --> 0:04:05.856 I don't know if you know about sites like Amazon Mechanical Turing or things like that 0:04:05.856 --> 0:04:08.038 so you can there get a lot of. 0:04:07.988 --> 0:04:11.544 Writing between cheap labors would like easy translations for you. 0:04:12.532 --> 0:04:22.829 Of course you can't collect millions of sentences, but if it's like thousands of sentences that's 0:04:22.829 --> 0:04:29.134 also sourced, it's often interesting when you have somehow. 0:04:29.509 --> 0:04:36.446 However, this is a field of itself, so crowdsourcing is not that easy. 0:04:36.446 --> 0:04:38.596 It's not like upload. 0:04:38.738 --> 0:04:50.806 If you're doing that you will have very poor quality, for example in the field of machine 0:04:50.806 --> 0:04:52.549 translation. 0:04:52.549 --> 0:04:57.511 Crowdsourcing is very commonly used. 0:04:57.397 --> 0:05:00.123 The problem there is. 0:05:00.480 --> 0:05:08.181 Since they are paid quite bad, of course, a lot of people also try to make it put into 0:05:08.181 --> 0:05:09.598 it as possible. 0:05:09.869 --> 0:05:21.076 So if you're just using it without any control mechanisms, the quality will be bad. 0:05:21.076 --> 0:05:27.881 What you can do is like doing additional checking. 0:05:28.188 --> 0:05:39.084 And think recently read a paper that now these things can be worse because people don't do 0:05:39.084 --> 0:05:40.880 it themselves. 0:05:41.281 --> 0:05:46.896 So it's a very interesting topic. 0:05:46.896 --> 0:05:55.320 There has been a lot of resources created by this. 0:05:57.657 --> 0:06:09.796 It's really about large scale data, then of course doing some type of web crawling is the 0:06:09.796 --> 0:06:10.605 best. 0:06:10.930 --> 0:06:17.296 However, the biggest issue in this case is in the quality. 0:06:17.296 --> 0:06:22.690 So how can we ensure that somehow the quality of. 0:06:23.003 --> 0:06:28.656 Because if you just, we all know that in the Internet there's also a lot of tools. 0:06:29.149 --> 0:06:37.952 Low quality staff, and especially now the bigger question is how can we ensure that translations 0:06:37.952 --> 0:06:41.492 are really translations of each other? 0:06:45.065 --> 0:06:58.673 Why is this interesting so we had this number before so there is some estimates that roughly 0:06:58.673 --> 0:07:05.111 a human reads around three hundred million. 0:07:05.525 --> 0:07:16.006 If you look into the web you will have millions of words there so you can really get a large 0:07:16.006 --> 0:07:21.754 amount of data and if you think about monolingual. 0:07:22.042 --> 0:07:32.702 So at least for some language pairs there is a large amount of data you can have. 0:07:32.852 --> 0:07:37.783 Languages are official languages in one country. 0:07:37.783 --> 0:07:46.537 There's always a very great success because a lot of websites from the government need 0:07:46.537 --> 0:07:48.348 to be translated. 0:07:48.568 --> 0:07:58.777 For example, a large purpose like in India, which we have worked with in India, so you 0:07:58.777 --> 0:08:00.537 have parallel. 0:08:01.201 --> 0:08:02.161 Two questions. 0:08:02.161 --> 0:08:08.438 First of all, if jet GPS and machine translation tools are more becoming ubiquitous and everybody 0:08:08.438 --> 0:08:14.138 uses them, don't we get a problem because we want to crawl the web and use the data and. 0:08:15.155 --> 0:08:18.553 Yes, that is a severe problem. 0:08:18.553 --> 0:08:26.556 Of course, are we only training on training data which is automatically? 0:08:26.766 --> 0:08:41.182 And if we are doing that, of course, we talked about the synthetic data where we do back translation. 0:08:41.341 --> 0:08:46.446 But of course it gives you some aren't up about norm, you cannot be much better than 0:08:46.446 --> 0:08:46.806 this. 0:08:48.308 --> 0:08:57.194 That is, we'll get more and more on issues, so maybe at some point we won't look at the 0:08:57.194 --> 0:09:06.687 current Internet, but focus on oats like image of the Internet, which are created by Archive. 0:09:07.527 --> 0:09:18.611 There's lots of classification algorithms on how to classify automatic data they had 0:09:18.611 --> 0:09:26.957 a very interesting paper on how to watermark their translation. 0:09:27.107 --> 0:09:32.915 So there's like two scenarios of course in this program: The one thing you might want 0:09:32.915 --> 0:09:42.244 to find your own translation if you're a big company and say do an antisystem that may be 0:09:42.244 --> 0:09:42.866 used. 0:09:43.083 --> 0:09:49.832 This problem might be that most of the translation out there is created by you. 0:09:49.832 --> 0:10:01.770 You might be able: And there is a relatively easy way of doing that so that there are other 0:10:01.770 --> 0:10:09.948 peoples' mainly that can do it like the search or teacher. 0:10:09.929 --> 0:10:12.878 They are different, but there is not the one correction station. 0:10:13.153 --> 0:10:23.763 So what you then can't do is you can't output the best one to the user, but the highest value. 0:10:23.763 --> 0:10:30.241 For example, it's easy, but you can take the translation. 0:10:30.870 --> 0:10:40.713 And if you always give the translation of your investments, which are all good with the 0:10:40.713 --> 0:10:42.614 most ease, then. 0:10:42.942 --> 0:10:55.503 But of course this you can only do with most of the data generated by your model. 0:10:55.503 --> 0:11:02.855 What we are now seeing is not only checks, but. 0:11:03.163 --> 0:11:13.295 But it's definitely an additional research question that might get more and more importance, 0:11:13.295 --> 0:11:18.307 and it might be an additional filtering step. 0:11:18.838 --> 0:11:29.396 There are other issues in data quality, so in which direction wasn't translated, so that 0:11:29.396 --> 0:11:31.650 is not interested. 0:11:31.891 --> 0:11:35.672 But if you're now reaching better and better quality, it makes a difference. 0:11:35.672 --> 0:11:39.208 The original data was from German to English or from English to German. 0:11:39.499 --> 0:11:44.797 Because translation, they call it translate Chinese. 0:11:44.797 --> 0:11:53.595 So if you generate German from English, it has a more similar structure as if you would 0:11:53.595 --> 0:11:55.195 directly speak. 0:11:55.575 --> 0:11:57.187 So um. 0:11:57.457 --> 0:12:03.014 These are all issues which you then might do like do additional training to remove them 0:12:03.014 --> 0:12:07.182 or you first train on them and later train on other quality data. 0:12:07.182 --> 0:12:11.034 But yet that's a general view on so it's an important issue. 0:12:11.034 --> 0:12:17.160 But until now I think it hasn't been addressed that much maybe because the quality was decently. 0:12:18.858 --> 0:12:23.691 Actually, I think we're sure if we have the time we use the Internet. 0:12:23.691 --> 0:12:29.075 The problem is, it's a lot of English speaking text, but most used languages. 0:12:29.075 --> 0:12:34.460 I don't know some language in Africa that's spoken, but we do about that one. 0:12:34.460 --> 0:12:37.566 I mean, that's why most data is English too. 0:12:38.418 --> 0:12:42.259 Other languages, and then you get the best. 0:12:42.259 --> 0:12:46.013 If there is no data on the Internet, then. 0:12:46.226 --> 0:12:48.255 So there is still a lot of data collection. 0:12:48.255 --> 0:12:50.976 Also in the wild way you try to improve there and collect. 0:12:51.431 --> 0:12:57.406 But English is the most in the world, but you find surprisingly much data also for other 0:12:57.406 --> 0:12:58.145 languages. 0:12:58.678 --> 0:13:04.227 Of course, only if they're written remember. 0:13:04.227 --> 0:13:15.077 Most languages are not written at all, but for them you might find some video, but it's 0:13:15.077 --> 0:13:17.420 difficult to find. 0:13:17.697 --> 0:13:22.661 So this is mainly done for the web trawling. 0:13:22.661 --> 0:13:29.059 It's mainly done for languages which are commonly spoken. 0:13:30.050 --> 0:13:38.773 Is exactly the next point, so this is that much data is only true for English and some 0:13:38.773 --> 0:13:41.982 other languages, but of course. 0:13:41.982 --> 0:13:50.285 And therefore a lot of research on how to make things efficient and efficient and learn 0:13:50.285 --> 0:13:54.248 faster from pure data is still essential. 0:13:59.939 --> 0:14:06.326 So what we are interested in now on data is parallel data. 0:14:06.326 --> 0:14:10.656 We assume always we have parallel data. 0:14:10.656 --> 0:14:12.820 That means we have. 0:14:13.253 --> 0:14:20.988 To be careful when you start crawling from the web, we might get only related types of. 0:14:21.421 --> 0:14:30.457 So one comedy thing is what people refer as noisy parallel data where there is documents 0:14:30.457 --> 0:14:34.315 which are translations of each other. 0:14:34.434 --> 0:14:44.300 So you have senses where there is no translation on the other side because you have. 0:14:44.484 --> 0:14:50.445 So if you have these types of documents your algorithm to extract parallel data might be 0:14:50.445 --> 0:14:51.918 a bit more difficult. 0:14:52.352 --> 0:15:04.351 Know if you can still remember in the beginning of the lecture when we talked about different 0:15:04.351 --> 0:15:06.393 data resources. 0:15:06.286 --> 0:15:11.637 But the first step is then approached to a light source and target sentences, and it was 0:15:11.637 --> 0:15:16.869 about like a steep vocabulary, and then you have some probabilities for one to one and 0:15:16.869 --> 0:15:17.590 one to one. 0:15:17.590 --> 0:15:23.002 It's very like simple algorithm, but yet it works fine for really a high quality parallel 0:15:23.002 --> 0:15:23.363 data. 0:15:23.623 --> 0:15:30.590 But when we're talking about noisy data, we might have to do additional steps and use more 0:15:30.590 --> 0:15:35.872 advanced models to extract what is parallel and to get high quality. 0:15:36.136 --> 0:15:44.682 So if we just had no easy parallel data, the document might not be as easy to extract. 0:15:49.249 --> 0:15:54.877 And then there is even the more extreme pains, which has also been used to be honest. 0:15:54.877 --> 0:15:58.214 The use of this data is reasoning not that common. 0:15:58.214 --> 0:16:04.300 It was more interested maybe like ten or fifteen years ago, and that is what people referred 0:16:04.300 --> 0:16:05.871 to as comparative data. 0:16:06.266 --> 0:16:17.167 And then the idea is you even don't have translations like sentences which are translations of each 0:16:17.167 --> 0:16:25.234 other, but you have more news documents or articles about the same topic. 0:16:25.205 --> 0:16:32.410 But it's more that you find phrases which are too big in the user, so even black fragments. 0:16:32.852 --> 0:16:44.975 So if you think about the pedia, for example, these articles have to be written in like the 0:16:44.975 --> 0:16:51.563 Wikipedia general idea independent of each other. 0:16:51.791 --> 0:17:01.701 They have different information in there, and I mean, the German movie gets more detail 0:17:01.701 --> 0:17:04.179 than the English one. 0:17:04.179 --> 0:17:07.219 However, it might be that. 0:17:07.807 --> 0:17:20.904 And the same thing is that you think about newspaper articles if they're at the same time. 0:17:21.141 --> 0:17:24.740 And so this is an ability to learn. 0:17:24.740 --> 0:17:29.738 For example, new phrases, vocabulary and stature. 0:17:29.738 --> 0:17:36.736 If you don't have parallel data, but you could monitor all time long. 0:17:37.717 --> 0:17:49.020 And then not everything will be the same, but there might be an overlap about events. 0:17:54.174 --> 0:18:00.348 So if we're talking about web trolling said in the beginning it was really about specific. 0:18:00.660 --> 0:18:18.878 They do very good things by hand and really focus on them and do a very specific way of 0:18:18.878 --> 0:18:20.327 doing. 0:18:20.540 --> 0:18:23.464 The European Parliament was very focused in Ted. 0:18:23.464 --> 0:18:26.686 Maybe you even have looked in the particular session. 0:18:27.427 --> 0:18:40.076 And these are still important, but they are of course very specific in covering different 0:18:40.076 --> 0:18:41.341 pockets. 0:18:42.002 --> 0:18:55.921 Then there was a focus on language centering, so there was a big drawer, for example, that 0:18:55.921 --> 0:18:59.592 you can check websites. 0:19:00.320 --> 0:19:06.849 Apparently what really people like is a more general approach where you just have to specify. 0:19:06.849 --> 0:19:13.239 I'm interested in data from German to Lithuanian and then you can as automatic as possible. 0:19:13.239 --> 0:19:15.392 We see what's normally needed. 0:19:15.392 --> 0:19:19.628 You can collect as much data and extract codelaia from this. 0:19:21.661 --> 0:19:25.633 So is this our interest? 0:19:25.633 --> 0:19:36.435 Of course, the question is how can we build these types of systems? 0:19:36.616 --> 0:19:52.913 The first are more general web crawling base systems, so there is nothing about. 0:19:53.173 --> 0:19:57.337 Based on the websites you have, you have to do like text extraction. 0:19:57.597 --> 0:20:06.503 We are typically not that much interested in text and images in there, so we try to extract 0:20:06.503 --> 0:20:07.083 text. 0:20:07.227 --> 0:20:16.919 This is also not specific to machine translation, but it's a more traditional way of doing web 0:20:16.919 --> 0:20:17.939 trolling. 0:20:18.478 --> 0:20:22.252 And at the end you have mirror like some other set of document collectors. 0:20:22.842 --> 0:20:37.025 Is the idea, so you have the text, and often this is a document, and so in the end. 0:20:37.077 --> 0:20:51.523 And that is some of your starting point now for doing the more machine translation. 0:20:52.672 --> 0:21:05.929 One way of doing that now is very similar to what you might have think about the traditional 0:21:05.929 --> 0:21:06.641 one. 0:21:06.641 --> 0:21:10.633 The first thing is to do a. 0:21:11.071 --> 0:21:22.579 So you have this based on the initial fact that you know this is a German website in the 0:21:22.579 --> 0:21:25.294 English translation. 0:21:25.745 --> 0:21:31.037 And based on this document alignment, then you can do your sentence alignment. 0:21:31.291 --> 0:21:39.072 And this is similar to what we had before with the church accordion. 0:21:39.072 --> 0:21:43.696 This is typically more noisy peril data. 0:21:43.623 --> 0:21:52.662 So that you are not assuming that everything is on both sides, that the order is the same, 0:21:52.662 --> 0:21:56.635 so you should do more flexible systems. 0:21:58.678 --> 0:22:14.894 Then it depends if the documents you were drawing were really some type of parallel data. 0:22:15.115 --> 0:22:35.023 Say then you should do what is referred to as fragmented extraction. 0:22:36.136 --> 0:22:47.972 One problem with these types of models is if you are doing errors in your document alignment,. 0:22:48.128 --> 0:22:55.860 It means that if you are saying these two documents are align then you can only find 0:22:55.860 --> 0:22:58.589 sense and if you are missing. 0:22:59.259 --> 0:23:15.284 Is very different, only small parts of the document are parallel, and most parts are independent 0:23:15.284 --> 0:23:17.762 of each other. 0:23:19.459 --> 0:23:31.318 Therefore, more recently, there is also the idea of directly doing sentence aligned so 0:23:31.318 --> 0:23:35.271 that you're directly taking. 0:23:36.036 --> 0:23:41.003 Was already one challenge of this one, the second approach. 0:23:42.922 --> 0:23:50.300 Yes, so one big challenge on here, beef, then you have to do a lot of comparison. 0:23:50.470 --> 0:23:59.270 You have to cook out every source, every target set and square. 0:23:59.270 --> 0:24:06.283 If you think of a million or trillion pairs, then. 0:24:07.947 --> 0:24:12.176 And this also gives you a reason for a last step in both cases. 0:24:12.176 --> 0:24:18.320 So in both of them you have to remember you're typically eating here in this very large data 0:24:18.320 --> 0:24:18.650 set. 0:24:18.650 --> 0:24:24.530 So all of these and also the document alignment here they should be done very efficient. 0:24:24.965 --> 0:24:42.090 And if you want to do it very efficiently, that means your quality will go lower. 0:24:41.982 --> 0:24:47.348 Because you just have to ever see it fast, and then yeah you can put less computation 0:24:47.348 --> 0:24:47.910 on each. 0:24:48.688 --> 0:25:06.255 Therefore, in a lot of scenarios it makes sense to make an additional filtering step 0:25:06.255 --> 0:25:08.735 at the end. 0:25:08.828 --> 0:25:13.370 And then we do a second filtering step where we now can put a lot more effort. 0:25:13.433 --> 0:25:20.972 Because now we don't have like any square possible combinations anymore, we have already 0:25:20.972 --> 0:25:26.054 selected and maybe in dimension of maybe like two or three. 0:25:26.054 --> 0:25:29.273 For each sentence we even don't have. 0:25:29.429 --> 0:25:39.234 And then we can put a lot more effort in each individual example and build a high quality 0:25:39.234 --> 0:25:42.611 classic fire to really select. 0:25:45.125 --> 0:26:00.506 Two or one example for that, so one of the biggest projects doing this is the so-called 0:26:00.506 --> 0:26:03.478 Paratrol Corpus. 0:26:03.343 --> 0:26:11.846 Typically it's like before the picturing so there are a lot of challenges on how you can. 0:26:12.272 --> 0:26:25.808 And the steps they start to be with the seatbelt, so what you should give at the beginning is: 0:26:26.146 --> 0:26:36.908 Then they do the problem, the text extraction, the document alignment, the sentence alignment, 0:26:36.908 --> 0:26:45.518 and the sentence filter, and it swings down to implementing the text store. 0:26:46.366 --> 0:26:51.936 We'll see later for a lot of language pairs exist so it's easier to download them and then 0:26:51.936 --> 0:26:52.793 like improve. 0:26:53.073 --> 0:27:08.270 For example, the crawling one thing they often do is even not throw the direct website because 0:27:08.270 --> 0:27:10.510 there's also. 0:27:10.770 --> 0:27:14.540 Black parts of the Internet that they can work on today. 0:27:14.854 --> 0:27:22.238 In more detail, this is a bit shown here. 0:27:22.238 --> 0:27:31.907 All the steps you can see are different possibilities. 0:27:32.072 --> 0:27:39.018 You need a bit of knowledge to do that, or you can build a machine translation system. 0:27:39.239 --> 0:27:47.810 There are two different ways of deduction and alignment. 0:27:47.810 --> 0:27:52.622 You can use sentence alignment. 0:27:53.333 --> 0:28:02.102 And how you can do the flexigrade exam, for example, the lexic graph, or you can chin. 0:28:02.422 --> 0:28:05.826 To the next step in a bit more detail. 0:28:05.826 --> 0:28:13.680 But before we're doing it, I need more questions about the general overview of how these. 0:28:22.042 --> 0:28:37.058 Yeah, so two or three things to web-drawing, so you normally start with the URLs. 0:28:37.058 --> 0:28:40.903 It's most promising. 0:28:41.021 --> 0:28:46.674 Found that if you're interested in German to English, you would maybe move some data 0:28:46.674 --> 0:28:47.073 from. 0:28:47.407 --> 0:28:58.739 Companies where you know they have a German and an English website are from agencies which 0:28:58.739 --> 0:29:08.359 might be: And then we can use one of these tools to start from there using standard web 0:29:08.359 --> 0:29:10.328 calling techniques. 0:29:11.071 --> 0:29:23.942 There are several challenges when doing that, so if you request a website too often you can: 0:29:25.305 --> 0:29:37.819 You have to keep in history of the sites and you click on all the links and then click on 0:29:37.819 --> 0:29:40.739 all the links again. 0:29:41.721 --> 0:29:49.432 To be very careful about legal issues starting from this robotics day so get allowed to use. 0:29:49.549 --> 0:29:58.941 Mean, that's the one major thing about what trolley general is. 0:29:58.941 --> 0:30:05.251 The problem is how you deal with property. 0:30:05.685 --> 0:30:13.114 That is why it is easier sometimes to start with some quick fold data that you don't have. 0:30:13.893 --> 0:30:22.526 Of course, the network issues you retry, so there's more technical things, but there's 0:30:22.526 --> 0:30:23.122 good. 0:30:24.724 --> 0:30:35.806 Another thing which is very helpful and is often done is instead of doing the web trolling 0:30:35.806 --> 0:30:38.119 yourself, relying. 0:30:38.258 --> 0:30:44.125 And one thing is it's common crawl from the web. 0:30:44.125 --> 0:30:51.190 Think on this common crawl a lot of these language models. 0:30:51.351 --> 0:30:59.763 So think in American Company or organization which really works on like writing. 0:31:00.000 --> 0:31:01.111 Possible. 0:31:01.111 --> 0:31:10.341 So the nice thing is if you start with this you don't have to worry about network. 0:31:10.250 --> 0:31:16.086 I don't think you can do that because it's too big, but you can do a pipeline on how to 0:31:16.086 --> 0:31:16.683 process. 0:31:17.537 --> 0:31:28.874 That is, of course, a general challenge in all this web crawling and parallel web mining. 0:31:28.989 --> 0:31:38.266 That means you cannot just don't know the data and study the processes. 0:31:39.639 --> 0:31:45.593 Here it might make sense to directly fields of both domains that in some way bark just 0:31:45.593 --> 0:31:46.414 marginally. 0:31:49.549 --> 0:31:59.381 Then you can do the text extraction, which means like converging two HTML and then splitting 0:31:59.381 --> 0:32:01.707 things from the HTML. 0:32:01.841 --> 0:32:04.802 Often very important is to do the language I need. 0:32:05.045 --> 0:32:16.728 It's not that clear even if it's links which language it is, but they are quite good tools 0:32:16.728 --> 0:32:22.891 like that can't identify from relatively short. 0:32:23.623 --> 0:32:36.678 And then you are now in the situation that you have all your danger and that you can start. 0:32:37.157 --> 0:32:43.651 After the text extraction you have now a collection or a large collection of of data where it's 0:32:43.651 --> 0:32:49.469 like text and maybe the document at use of some meta information and now the question 0:32:49.469 --> 0:32:55.963 is based on this monolingual text or multilingual text so text in many languages but not align. 0:32:56.036 --> 0:32:59.863 How can you now do a generate power? 0:33:01.461 --> 0:33:06.289 And UM. 0:33:05.705 --> 0:33:13.322 So if we're not seeing it as a task or if we want to do it in a machine learning way, 0:33:13.322 --> 0:33:20.940 what we have is we have a set of sentences and a suits language, and we have a set Of 0:33:20.940 --> 0:33:23.331 sentences from the target. 0:33:23.823 --> 0:33:27.814 This is the target language. 0:33:27.814 --> 0:33:31.392 This is the data we have. 0:33:31.392 --> 0:33:37.034 We kind of directly assume any ordering. 0:33:38.018 --> 0:33:44.502 More documents there are not really in line or there is maybe a graph and what we are interested 0:33:44.502 --> 0:33:50.518 in is finding these alignments so which senses are aligned to each other and which senses 0:33:50.518 --> 0:33:53.860 we can remove but we don't have translations for. 0:33:53.974 --> 0:34:00.339 But exactly this mapping is what we are interested in and what we need to find. 0:34:01.901 --> 0:34:17.910 And if we are modeling it more from the machine translation point of view, what can model that 0:34:17.910 --> 0:34:21.449 as a classification? 0:34:21.681 --> 0:34:36.655 And so the main challenge of this is to build this type of classifier and you want to decide. 0:34:42.402 --> 0:34:50.912 However, the biggest challenge has already pointed out in the beginning is the sites if 0:34:50.912 --> 0:34:53.329 we have millions target. 0:34:53.713 --> 0:35:05.194 The number of comparison is n square, so this very path is very inefficient, and we need 0:35:05.194 --> 0:35:06.355 to find. 0:35:07.087 --> 0:35:16.914 And traditionally there is the first one mentioned before the local or the hierarchical meaning 0:35:16.914 --> 0:35:20.292 mining and there the idea is OK. 0:35:20.292 --> 0:35:23.465 First we are lining documents. 0:35:23.964 --> 0:35:32.887 Move back the things and align them, and once you have the alignment you only need to remind. 0:35:33.273 --> 0:35:51.709 That of course makes anything more efficient because we don't have to do all the comparison. 0:35:53.253 --> 0:35:56.411 Then it's, for example, in the before mentioned apparel. 0:35:57.217 --> 0:36:11.221 But it has the issue that if this document is bad you have error propagation and you can 0:36:11.221 --> 0:36:14.211 recover from that. 0:36:14.494 --> 0:36:20.715 Because then document that cannot say ever, there are some sentences which are: Therefore, 0:36:20.715 --> 0:36:24.973 more recently there is also was referred to as global mining. 0:36:26.366 --> 0:36:31.693 And there we really do this. 0:36:31.693 --> 0:36:43.266 Although it's in the square, we are doing all the comparisons. 0:36:43.523 --> 0:36:52.588 So the idea is that you can do represent all the sentences in a vector space. 0:36:52.892 --> 0:37:06.654 And then it's about nearest neighbor search and there is a lot of very efficient algorithms. 0:37:07.067 --> 0:37:20.591 Then if you only compare them to your nearest neighbors you don't have to do like a comparison 0:37:20.591 --> 0:37:22.584 but you have. 0:37:26.186 --> 0:37:40.662 So in the first step what we want to look at is this: This document classification refers 0:37:40.662 --> 0:37:49.584 to the document alignment, and then we do the sentence alignment. 0:37:51.111 --> 0:37:58.518 And if we're talking about document alignment, there's like typically two steps in that: We 0:37:58.518 --> 0:38:01.935 first do a candidate selection. 0:38:01.935 --> 0:38:10.904 Often we have several steps and that is again to make more things more efficiently. 0:38:10.904 --> 0:38:13.360 We have the candidate. 0:38:13.893 --> 0:38:18.402 The candidate select means OK, which documents do we want to compare? 0:38:19.579 --> 0:38:35.364 Then if we have initial candidates which might be parallel, we can do a classification test. 0:38:35.575 --> 0:38:37.240 And there is different ways. 0:38:37.240 --> 0:38:40.397 We can use lexical similarity or we can use ten basic. 0:38:41.321 --> 0:38:48.272 The first and easiest thing is to take off possible candidates. 0:38:48.272 --> 0:38:55.223 There's one possibility, the other one, is based on structural. 0:38:55.235 --> 0:39:05.398 So based on how your website looks like, you might find that there are only translations. 0:39:05.825 --> 0:39:14.789 This is typically the only case where we try to do some kind of major information, which 0:39:14.789 --> 0:39:22.342 can be very useful because we know that websites, for example, are linked. 0:39:22.722 --> 0:39:35.586 We can try to use some URL patterns, so if we have some website which ends with the. 0:39:35.755 --> 0:39:43.932 So that can be easily used in order to find candidates. 0:39:43.932 --> 0:39:49.335 Then we only compare websites where. 0:39:49.669 --> 0:40:05.633 The language and the translation of each other, but typically you hear several heuristics to 0:40:05.633 --> 0:40:07.178 do that. 0:40:07.267 --> 0:40:16.606 Then you don't have to compare all websites, but you only have to compare web sites. 0:40:17.277 --> 0:40:27.607 Cruiser problems especially with an hour day's content management system. 0:40:27.607 --> 0:40:32.912 Sometimes it's nice and easy to read. 0:40:33.193 --> 0:40:44.452 So on the one hand there typically leads from the parent's side to different languages. 0:40:44.764 --> 0:40:46.632 Now I can look at the kit websites. 0:40:46.632 --> 0:40:49.381 It's the same thing you can check on the difference. 0:40:49.609 --> 0:41:06.835 Languages: You can either do that from the parent website or you can click on the English. 0:41:06.926 --> 0:41:10.674 You can therefore either like prepare to all the websites. 0:41:10.971 --> 0:41:18.205 Can be even more focused and checked if the link is somehow either flexible or the language 0:41:18.205 --> 0:41:18.677 name. 0:41:19.019 --> 0:41:24.413 So there really depends on how much you want to filter out. 0:41:24.413 --> 0:41:29.178 There is always a trade-off between being efficient. 0:41:33.913 --> 0:41:49.963 Based on that we then have our candidate list, so we now have two independent sets of German 0:41:49.963 --> 0:41:52.725 documents, but. 0:41:53.233 --> 0:42:03.515 And now the task is, we want to extract these, which are really translations of each other. 0:42:03.823 --> 0:42:10.201 So the question of how can we measure the document similarity? 0:42:10.201 --> 0:42:14.655 Because what we then do is, we measure the. 0:42:14.955 --> 0:42:27.096 And here you already see why this is also that problematic from where it's partial or 0:42:27.096 --> 0:42:28.649 similarly. 0:42:30.330 --> 0:42:37.594 All you can do that is again two folds. 0:42:37.594 --> 0:42:48.309 You can do it more content based or more structural based. 0:42:48.188 --> 0:42:53.740 Calculating a lot of features and then maybe training a classic pyramid small set which 0:42:53.740 --> 0:42:57.084 stands like based on the spesse feature is the data. 0:42:57.084 --> 0:42:58.661 It is a corpus parallel. 0:43:00.000 --> 0:43:10.955 One way of doing that is to have traction features, so the idea is the text length, so 0:43:10.955 --> 0:43:12.718 the document. 0:43:13.213 --> 0:43:20.511 Of course, text links will not be the same, but if the one document has fifty words and 0:43:20.511 --> 0:43:24.907 the other five thousand words, it's quite realistic. 0:43:25.305 --> 0:43:29.274 So you can use the text length as one proxy of. 0:43:29.274 --> 0:43:32.334 Is this might be a good translation? 0:43:32.712 --> 0:43:41.316 Now the thing is the alignment between the structure. 0:43:41.316 --> 0:43:52.151 If you have here the website you can create some type of structure. 0:43:52.332 --> 0:44:04.958 You can compare that to the French version and then calculate some similarities because 0:44:04.958 --> 0:44:07.971 you see translation. 0:44:08.969 --> 0:44:12.172 Of course, it's getting more and more problematic. 0:44:12.172 --> 0:44:16.318 It does be a different structure than these features are helpful. 0:44:16.318 --> 0:44:22.097 However, if you are doing it more in a trained way, you can automatically learn how helpful 0:44:22.097 --> 0:44:22.725 they are. 0:44:24.704 --> 0:44:37.516 Then there are different ways of yeah: Content based things: One easy thing, especially if 0:44:37.516 --> 0:44:48.882 you have systems that are using the same script that you are looking for. 0:44:48.888 --> 0:44:49.611 The legs. 0:44:49.611 --> 0:44:53.149 We call them a beggar words and we'll look into. 0:44:53.149 --> 0:44:55.027 You can use some type of. 0:44:55.635 --> 0:44:58.418 And neural embedding is also to abate him at. 0:45:02.742 --> 0:45:06.547 And as then mean we have machine translation,. 0:45:06.906 --> 0:45:14.640 And one idea that you can also do is really use the machine translation. 0:45:14.874 --> 0:45:22.986 Because this one is one which takes more effort, so what you then have to do is put more effort. 0:45:23.203 --> 0:45:37.526 You wouldn't do this type of machine translation based approach for a system which has product. 0:45:38.018 --> 0:45:53.712 But maybe your first of thinking why can't do that because I'm collecting data to build 0:45:53.712 --> 0:45:55.673 an system. 0:45:55.875 --> 0:46:01.628 So you can use an initial system to translate it, and then you can collect more data. 0:46:01.901 --> 0:46:06.879 And one way of doing that is, you're translating, for example, all documents even to English. 0:46:07.187 --> 0:46:25.789 Then you only need two English data and you do it in the example with three grams. 0:46:25.825 --> 0:46:33.253 For example, the current induction in 1 in the Spanish, which is German induction in 1, 0:46:33.253 --> 0:46:37.641 which was Spanish induction in 2, which was French. 0:46:37.637 --> 0:46:52.225 You're creating this index and then based on that you can calculate how similar the documents. 0:46:52.092 --> 0:46:58.190 And then you can use the Cossack similarity to really calculate which of the most similar 0:46:58.190 --> 0:47:00.968 document or how similar is the document. 0:47:00.920 --> 0:47:04.615 And then measure if this is a possible translation. 0:47:05.285 --> 0:47:14.921 Mean, of course, the document will not be exactly the same, and even if you have a parallel 0:47:14.921 --> 0:47:18.483 document, French and German, and. 0:47:18.898 --> 0:47:29.086 You'll have not a perfect translation, therefore it's looking into five front overlap since 0:47:29.086 --> 0:47:31.522 there should be last. 0:47:34.074 --> 0:47:42.666 Okay, before we take the next step and go into the sentence alignment, there are more 0:47:42.666 --> 0:47:44.764 questions about the. 0:47:51.131 --> 0:47:55.924 Too Hot and. 0:47:56.997 --> 0:47:59.384 Well um. 0:48:00.200 --> 0:48:05.751 There is different ways of doing sentence alignment. 0:48:05.751 --> 0:48:12.036 Here's one way to describe is to call the other line again. 0:48:12.172 --> 0:48:17.590 Of course, we have the advantage that we have only documents, so we might have like hundred 0:48:17.590 --> 0:48:20.299 sentences and hundred sentences in the tower. 0:48:20.740 --> 0:48:31.909 Although it still might be difficult to compare all the things in parallel, and. 0:48:31.791 --> 0:48:37.541 And therefore typically these even assume that we are only interested in a line character 0:48:37.541 --> 0:48:40.800 that can be identified on the sum of the diagonal. 0:48:40.800 --> 0:48:46.422 Of course, not exactly the diagonal will sum some parts around it, but in order to make 0:48:46.422 --> 0:48:47.891 things more efficient. 0:48:48.108 --> 0:48:55.713 You can still do it around the diagonal because if you say this is a parallel document, we 0:48:55.713 --> 0:48:56.800 assume that. 0:48:56.836 --> 0:49:05.002 We wouldn't have passed the document alignment, therefore we wouldn't have seen it. 0:49:05.505 --> 0:49:06.774 In the underline. 0:49:06.774 --> 0:49:10.300 Then we are calculating the similarity for these. 0:49:10.270 --> 0:49:17.428 Set this here based on the bilingual dictionary, so it may be based on how much overlap you 0:49:17.428 --> 0:49:17.895 have. 0:49:18.178 --> 0:49:24.148 And then we are finding a path through it. 0:49:24.148 --> 0:49:31.089 You are finding a path which the lights ever see. 0:49:31.271 --> 0:49:41.255 But you're trying to find a pass through your document so that you get these parallel. 0:49:41.201 --> 0:49:49.418 And then the perfect ones here would be your pass, where you just take this other parallel. 0:49:51.011 --> 0:50:05.206 The advantage is that on the one end limits your search space, then centers alignment, 0:50:05.206 --> 0:50:07.490 and secondly. 0:50:07.787 --> 0:50:10.013 So what does it mean? 0:50:10.013 --> 0:50:19.120 So even if you have a very high probable pair, you're not taking them on because overall. 0:50:19.399 --> 0:50:27.063 So sometimes it makes sense to also use this global information and not only compare on 0:50:27.063 --> 0:50:34.815 individual sentences because what you're with your parents is that sometimes it's only a 0:50:34.815 --> 0:50:36.383 good translation. 0:50:38.118 --> 0:50:51.602 So by this minion paste you're preventing the system to do it at the border where there's 0:50:51.602 --> 0:50:52.201 no. 0:50:53.093 --> 0:50:55.689 So that might achieve you a bit better quality. 0:50:56.636 --> 0:51:12.044 The pack always ends if we write the button for everybody, but it also means you couldn't 0:51:12.044 --> 0:51:15.126 necessarily have. 0:51:15.375 --> 0:51:24.958 Have some restrictions that is right, so first of all they can't be translated out. 0:51:25.285 --> 0:51:32.572 So the handle line typically only really works well if you have a relatively high quality. 0:51:32.752 --> 0:51:39.038 So if you have this more general data where there's like some parts are translated and 0:51:39.038 --> 0:51:39.471 some. 0:51:39.719 --> 0:51:43.604 It doesn't really work, so it might. 0:51:43.604 --> 0:51:53.157 It's okay with having maybe at the end some sentences which are missing, but in generally. 0:51:53.453 --> 0:51:59.942 So it's not robust against significant noise on the. 0:52:05.765 --> 0:52:12.584 The second thing is is to what is referred to as blue alibi. 0:52:13.233 --> 0:52:16.982 And this doesn't does, does not do us much. 0:52:16.977 --> 0:52:30.220 A global information you can translate each sentence to English, and then you calculate 0:52:30.220 --> 0:52:34.885 the voice for the translation. 0:52:35.095 --> 0:52:41.888 And that you would get six answer points, which are the ones in a purple ear. 0:52:42.062 --> 0:52:56.459 And then you have the ability to add some points around it, which might be a bit lower. 0:52:56.756 --> 0:53:06.962 But here in this case you are able to deal with reorderings, angles to deal with parts. 0:53:07.247 --> 0:53:16.925 Therefore, in this case we need a full scale and key system to do this calculation while 0:53:16.925 --> 0:53:17.686 we're. 0:53:18.318 --> 0:53:26.637 Then, of course, the better your similarity metric is, so the better you are able to do 0:53:26.637 --> 0:53:35.429 this comparison, the less you have to rely on structural information that, in one sentence,. 0:53:39.319 --> 0:53:53.411 Anymore questions, and then there are things like back in line which try to do the same. 0:53:53.793 --> 0:53:59.913 That means the idea is that you expect each sentence. 0:53:59.819 --> 0:54:02.246 In a crossing will vector space. 0:54:02.246 --> 0:54:08.128 Crossing will vector space always means that you have a vector or knight means. 0:54:08.128 --> 0:54:14.598 In this case you have a vector space where sentences in different languages are near to 0:54:14.598 --> 0:54:16.069 each other if they. 0:54:16.316 --> 0:54:23.750 So you can have it again and so on, but just next to each other and want to call you. 0:54:24.104 --> 0:54:32.009 And then you can of course measure now the similarity by some distance matrix in this 0:54:32.009 --> 0:54:32.744 vector. 0:54:33.033 --> 0:54:36.290 And you're saying towards two senses are lying. 0:54:36.290 --> 0:54:39.547 If the distance in the vector space is somehow. 0:54:40.240 --> 0:54:50.702 We'll discuss that in a bit more heat soon because these vector spades and bathings are 0:54:50.702 --> 0:54:52.010 even then. 0:54:52.392 --> 0:54:55.861 So the nice thing is with this. 0:54:55.861 --> 0:55:05.508 It's really good and good to get quite good quality and can decide whether two sentences 0:55:05.508 --> 0:55:08.977 are translations of each other. 0:55:08.888 --> 0:55:14.023 In the fact-lined approach, but often they even work on a global search way to really 0:55:14.023 --> 0:55:15.575 compare on everything to. 0:55:16.236 --> 0:55:29.415 What weak alignment also does is trying to do to make this more efficient in finding the. 0:55:29.309 --> 0:55:40.563 If you don't want to compare everything to everything, you first need sentence blocks, 0:55:40.563 --> 0:55:41.210 and. 0:55:41.141 --> 0:55:42.363 Then find him fast. 0:55:42.562 --> 0:55:55.053 You always have full sentence resolution, but then you always compare on the area around. 0:55:55.475 --> 0:56:11.501 So if you do compare blocks on the source of the target, then you have of your possibilities. 0:56:11.611 --> 0:56:17.262 So here the end times and comparison is a lot less than the comparison you have here. 0:56:17.777 --> 0:56:23.750 And with neural embeddings you can also embed not only single sentences and whole blocks. 0:56:24.224 --> 0:56:28.073 So how you make this in fast? 0:56:28.073 --> 0:56:35.643 You're starting from a coarse grain resolution here where. 0:56:36.176 --> 0:56:47.922 Then you're getting a double pass where they could be good and near this pass you're doing 0:56:47.922 --> 0:56:49.858 more and more. 0:56:52.993 --> 0:56:54.601 And yeah, what's the? 0:56:54.601 --> 0:56:56.647 This is the white egg lift. 0:56:56.647 --> 0:56:59.352 These are the sewers and the target. 0:57:00.100 --> 0:57:16.163 While it was sleeping in the forests and things, I thought it was very strange to see this man. 0:57:16.536 --> 0:57:25.197 So you have the sentences, but if you do blocks you have blocks that are in. 0:57:30.810 --> 0:57:38.514 This is the thing about the pipeline approach. 0:57:38.514 --> 0:57:46.710 We want to look at the global mining, but before. 0:57:53.633 --> 0:58:07.389 In the global mining thing we have to also do some filtering and so typically in the things 0:58:07.389 --> 0:58:10.379 they do they start. 0:58:10.290 --> 0:58:14.256 And then they are doing some pretty processing. 0:58:14.254 --> 0:58:17.706 So you try to at first to de-defecate paragraphs. 0:58:17.797 --> 0:58:30.622 So, of course, if you compare everything with everything in two times the same input example, 0:58:30.622 --> 0:58:35.748 you will also: The hard thing is that you first keep duplicating. 0:58:35.748 --> 0:58:37.385 You have each paragraph only one. 0:58:37.958 --> 0:58:42.079 There's a lot of text which occurs a lot of times. 0:58:42.079 --> 0:58:44.585 They will happen all the time. 0:58:44.884 --> 0:58:57.830 There are pages about the cookie thing you see and about accepting things. 0:58:58.038 --> 0:59:04.963 So you can already be duplicated here, or your problem has crossed the website twice, 0:59:04.963 --> 0:59:05.365 and. 0:59:06.066 --> 0:59:11.291 Then you can remove low quality data like cooking warnings that have biolabites start. 0:59:12.012 --> 0:59:13.388 Hey! 0:59:13.173 --> 0:59:19.830 So let you have maybe some other sentence, and then you're doing a language idea. 0:59:19.830 --> 0:59:29.936 That means you want to have a text, which is: You want to know for each sentence a paragraph 0:59:29.936 --> 0:59:38.695 which language it has so that you then, of course, if you want. 0:59:39.259 --> 0:59:44.987 Finally, there is some complexity based film screenings to believe, for example, for very 0:59:44.987 --> 0:59:46.069 high complexity. 0:59:46.326 --> 0:59:59.718 That means, for example, data where there's a lot of crazy names which are growing. 1:00:00.520 --> 1:00:09.164 Sometimes it also improves very high perplexity data because that is then unmanned generated 1:00:09.164 --> 1:00:09.722 data. 1:00:11.511 --> 1:00:17.632 And then the model which is mostly used for that is what is called a laser model. 1:00:18.178 --> 1:00:21.920 It's based on machine translation. 1:00:21.920 --> 1:00:28.442 Hope it all recognizes the machine translation architecture. 1:00:28.442 --> 1:00:37.103 However, there is a difference between a general machine translation system and. 1:01:00.000 --> 1:01:13.322 Machine translation system, so it's messy. 1:01:14.314 --> 1:01:24.767 See one bigger difference, which is great if I'm excluding that object or the other. 1:01:25.405 --> 1:01:39.768 There is one difference to the other, one with attention, so we are having. 1:01:40.160 --> 1:01:43.642 And then we are using that here in there each time set up. 1:01:44.004 --> 1:01:54.295 Mean, therefore, it's maybe a bit similar to original anti-system without attention. 1:01:54.295 --> 1:01:56.717 It's quite similar. 1:01:57.597 --> 1:02:10.011 However, it has this disadvantage saying that we have to put everything in one sentence and 1:02:10.011 --> 1:02:14.329 that maybe not all information. 1:02:15.055 --> 1:02:25.567 However, now in this type of framework we are not really interested in machine translation, 1:02:25.567 --> 1:02:27.281 so this model. 1:02:27.527 --> 1:02:34.264 So we are training it to do machine translation. 1:02:34.264 --> 1:02:42.239 What that means in the end should be as much information. 1:02:43.883 --> 1:03:01.977 Only all the information in here is able to really well do the machine translation. 1:03:02.642 --> 1:03:07.801 So that is the first step, so we are doing here. 1:03:07.801 --> 1:03:17.067 We are building the MT system, not with the goal of making the best MT system, but with 1:03:17.067 --> 1:03:22.647 learning and sentences, and hopefully all important. 1:03:22.882 --> 1:03:26.116 Because otherwise we won't be able to generate the translation. 1:03:26.906 --> 1:03:31.287 So it's a bit more on the bottom neck like to try to put as much information. 1:03:32.012 --> 1:03:36.426 And if you think if you want to do later finding the bear's neighbor or something like. 1:03:37.257 --> 1:03:48.680 So finding similarities is typically possible with fixed dimensional things, so we can do 1:03:48.680 --> 1:03:56.803 that in an end dimensional space and find the nearest neighbor. 1:03:57.857 --> 1:03:59.837 Yeah, it would be very difficult. 1:04:00.300 --> 1:04:03.865 There's one thing that we also do. 1:04:03.865 --> 1:04:09.671 We don't want to find the nearest neighbor in the other. 1:04:10.570 --> 1:04:13.424 Do you have an idea how we can train them? 1:04:13.424 --> 1:04:16.542 This is a set that embeddings can be compared. 1:04:23.984 --> 1:04:36.829 Any idea do you think about two lectures, a three lecture stack, one that did gave. 1:04:41.301 --> 1:04:50.562 We can train them on a multilingual setting and that's how it's done in lasers so we're 1:04:50.562 --> 1:04:56.982 not doing it only from German to English but we're training. 1:04:57.017 --> 1:05:04.898 Mean, if the English one has to be useful for German, French and so on, and for German 1:05:04.898 --> 1:05:13.233 also, the German and the English and so have to be useful, then somehow we'll automatically 1:05:13.233 --> 1:05:16.947 learn that these embattes are popularly. 1:05:17.437 --> 1:05:28.562 And then we can use an exact as we will plan to have a similar sentence embedding. 1:05:28.908 --> 1:05:39.734 If you put in here a German and a French one and always generate as they both have the same 1:05:39.734 --> 1:05:48.826 translations, you give these sentences: And you should do exactly the same thing, so that's 1:05:48.826 --> 1:05:50.649 of course the easiest. 1:05:51.151 --> 1:05:59.817 If the sentence is very different then most people will also hear the English decoder and 1:05:59.817 --> 1:06:00.877 therefore. 1:06:02.422 --> 1:06:04.784 So that is the first thing. 1:06:04.784 --> 1:06:06.640 Now we have this one. 1:06:06.640 --> 1:06:10.014 We have to be trained on parallel data. 1:06:10.390 --> 1:06:22.705 Then we can use these embeddings on our new data and try to use them to make efficient 1:06:22.705 --> 1:06:24.545 comparisons. 1:06:26.286 --> 1:06:30.669 So how can you do comparison? 1:06:30.669 --> 1:06:37.243 Maybe the first thing you think of is to do. 1:06:37.277 --> 1:06:44.365 So you take all the German sentences, all the French sentences. 1:06:44.365 --> 1:06:49.460 We compute the Cousin's simple limit between. 1:06:49.469 --> 1:06:58.989 And then you take all pairs where the similarity is very high. 1:07:00.180 --> 1:07:17.242 So you have your French list, you have them, and then you just take all sentences. 1:07:19.839 --> 1:07:29.800 It's an additional power method that we have, but we have a lot of data who will find a point. 1:07:29.800 --> 1:07:32.317 It's a good point, but. 1:07:35.595 --> 1:07:45.738 It's also not that easy, so one problem is that typically there are some sentences where. 1:07:46.066 --> 1:07:48.991 And other points where there is very few points in the neighborhood. 1:07:49.629 --> 1:08:06.241 And then for things where a lot of things are enabled you might extract not for one percent 1:08:06.241 --> 1:08:08.408 to do that. 1:08:08.868 --> 1:08:18.341 So what typically is happening is you do the max merchant? 1:08:18.341 --> 1:08:25.085 How good is a pair compared to the other? 1:08:25.305 --> 1:08:33.859 So you take the similarity between X and Y, and then you look at one of the eight nearest 1:08:33.859 --> 1:08:35.190 neighbors of. 1:08:35.115 --> 1:08:48.461 Of x and what are the eight nearest neighbors of y, and the dividing of the similarity through 1:08:48.461 --> 1:08:51.411 the eight neighbors. 1:08:51.671 --> 1:09:00.333 So what you may be looking at are these two sentences a lot more similar than all the other. 1:09:00.840 --> 1:09:13.455 And if these are exceptional and similar compared to other sentences then they should be translations. 1:09:16.536 --> 1:09:19.158 Of course, that has also some. 1:09:19.158 --> 1:09:24.148 Then the good thing is there's a lot of similar sentences. 1:09:24.584 --> 1:09:30.641 If there is a lot of similar sensations in white then these are also very similar and 1:09:30.641 --> 1:09:32.824 you are doing more comparison. 1:09:32.824 --> 1:09:36.626 If all the arrows are far away then the translations. 1:09:37.057 --> 1:09:40.895 So think about this like short sentences. 1:09:40.895 --> 1:09:47.658 They might be that most things are similar, but they are just in general. 1:09:49.129 --> 1:09:59.220 There are some problems that now we assume there is only one pair of translations. 1:09:59.759 --> 1:10:09.844 So it has some problems in their two or three ballad translations of that. 1:10:09.844 --> 1:10:18.853 Then, of course, this pair might not find it, but in general this. 1:10:19.139 --> 1:10:27.397 For example, they have like all of these common trawl. 1:10:27.397 --> 1:10:32.802 They have large parallel data sets. 1:10:36.376 --> 1:10:38.557 One point maybe also year. 1:10:38.557 --> 1:10:45.586 Of course, now it's important that we have done the deduplication before because if we 1:10:45.586 --> 1:10:52.453 wouldn't have the deduplication, we would have points which are the same coordinate. 1:10:57.677 --> 1:11:03.109 Maybe only one small things to that mean. 1:11:03.109 --> 1:11:09.058 A major issue in this case is still making a. 1:11:09.409 --> 1:11:18.056 So you have to still do all of this comparison, and that cannot be done just by simple. 1:11:19.199 --> 1:11:27.322 So what is done typically express the word, you know things can be done in parallel. 1:11:28.368 --> 1:11:36.024 So calculating the embeddings and all that stuff doesn't need to be sequential, but it's 1:11:36.024 --> 1:11:37.143 independent. 1:11:37.357 --> 1:11:48.680 What you typically do is create an event and then you do some kind of projectization. 1:11:48.708 --> 1:11:57.047 So there is this space library which does key nearest neighbor search very efficient 1:11:57.047 --> 1:11:59.597 in very high-dimensional. 1:12:00.080 --> 1:12:03.410 And then based on that you can now do comparison. 1:12:03.410 --> 1:12:06.873 You can even do the comparison in parallel because. 1:12:06.906 --> 1:12:13.973 Can look at different areas of your space and then compare the different pieces to find 1:12:13.973 --> 1:12:14.374 the. 1:12:15.875 --> 1:12:30.790 With this you are then able to do very fast calculations on this type of sentence. 1:12:31.451 --> 1:12:34.761 So yeah this is currently one. 1:12:35.155 --> 1:12:48.781 Mean, those of them are covered with this, so there's a parade. 1:12:48.668 --> 1:12:55.543 We are collected by that and most of them are in a very big corporate for languages which 1:12:55.543 --> 1:12:57.453 you can hardly stand on. 1:12:58.778 --> 1:13:01.016 Do you have any more questions on this? 1:13:05.625 --> 1:13:17.306 And then some more words to this last set here: So we have now done our pearl marker 1:13:17.306 --> 1:13:25.165 and we could assume that everything is fine now. 1:13:25.465 --> 1:13:35.238 However, the problem with this noisy data is that typically this is quite noisy still, 1:13:35.238 --> 1:13:35.687 so. 1:13:36.176 --> 1:13:44.533 In order to make things efficient to have a high recall, the final data is often not 1:13:44.533 --> 1:13:49.547 of the best quality, not the same type of quality. 1:13:49.789 --> 1:13:58.870 So it is essential to do another figuring step and to remove senses which might seem 1:13:58.870 --> 1:14:01.007 to be translations. 1:14:01.341 --> 1:14:08.873 And here, of course, the final evaluation matrix would be how much do my system improve? 1:14:09.089 --> 1:14:23.476 And there are even challenges on doing that so: people getting this noisy data like symmetrics 1:14:23.476 --> 1:14:25.596 or something. 1:14:27.707 --> 1:14:34.247 However, all these steps is of course very time consuming, so you might not always want 1:14:34.247 --> 1:14:37.071 to do the full pipeline and training. 1:14:37.757 --> 1:14:51.614 So how can you model that we want to get this best and normally what we always want? 1:14:51.871 --> 1:15:02.781 You also want to have the best over translation quality, but this is also normally not achieved 1:15:02.781 --> 1:15:03.917 with all. 1:15:04.444 --> 1:15:12.389 And that's why you're doing this two-step approach first of the second alignment. 1:15:12.612 --> 1:15:27.171 And after once you do the sentence filtering, we can put a lot more alphabet in all the comparisons. 1:15:27.627 --> 1:15:37.472 For example, you can just translate the source and compare that translation with the original 1:15:37.472 --> 1:15:40.404 one and calculate how good. 1:15:40.860 --> 1:15:49.467 And this, of course, you can do with the filing set, but you can't do with your initial set 1:15:49.467 --> 1:15:50.684 of millions. 1:15:54.114 --> 1:16:01.700 So what it is again is the ancient test where you input as a sentence pair as here, and then 1:16:01.700 --> 1:16:09.532 once you have a biometria, these are sentence pairs with a high quality, and these are sentence 1:16:09.532 --> 1:16:11.653 pairs avec a low quality. 1:16:12.692 --> 1:16:17.552 Does anybody see what might be a challenge if you want to train this type of classifier? 1:16:22.822 --> 1:16:24.264 How do you measure exactly? 1:16:24.264 --> 1:16:26.477 The quality is probably about the problem. 1:16:27.887 --> 1:16:39.195 Yes, that is one, that is true, there is even more, more simple one, and high quality data 1:16:39.195 --> 1:16:42.426 here is not so difficult. 1:16:43.303 --> 1:16:46.844 Globally, yeah, probably we have a class in balance. 1:16:46.844 --> 1:16:49.785 We don't see many bad quality combinations. 1:16:49.785 --> 1:16:54.395 It's hard to get there at the beginning, so maybe how can you argue? 1:16:54.395 --> 1:16:58.405 Where do you find bad quality and what type of bad quality? 1:16:58.798 --> 1:17:05.122 Because if it's too easy, you just take a random germ and the random innocence that is 1:17:05.122 --> 1:17:05.558 very. 1:17:05.765 --> 1:17:15.747 But what you're interested is like bad quality data, which still passes your first initial 1:17:15.747 --> 1:17:16.405 step. 1:17:17.257 --> 1:17:28.824 What you can use for that is you can use any type of network or model that in the beginning, 1:17:28.824 --> 1:17:33.177 like in random forests, would see. 1:17:33.613 --> 1:17:38.912 So the positive examples are quite easy to get. 1:17:38.912 --> 1:17:44.543 You just take parallel data and high quality data. 1:17:44.543 --> 1:17:45.095 You. 1:17:45.425 --> 1:17:47.565 That is quite easy. 1:17:47.565 --> 1:17:55.482 You normally don't need a lot of data, then to train in a few validation. 1:17:57.397 --> 1:18:12.799 The challenge is like the negative samples because how would you generate negative samples? 1:18:13.133 --> 1:18:17.909 Because the negative examples are the ones which ask the first step but don't ask the 1:18:17.909 --> 1:18:18.353 second. 1:18:18.838 --> 1:18:23.682 So how do you typically do it? 1:18:23.682 --> 1:18:28.994 You try to do synthetic examples. 1:18:28.994 --> 1:18:33.369 You can do random examples. 1:18:33.493 --> 1:18:45.228 But this is the typical error that you want to detect when you do frequency based replacements. 1:18:45.228 --> 1:18:52.074 But this is one major issue when you generate the data. 1:18:52.132 --> 1:19:02.145 That doesn't match well with what are the real arrows that you're interested in. 1:19:02.702 --> 1:19:13.177 Is some of the most challenging here to find the negative samples, which are hard enough 1:19:13.177 --> 1:19:14.472 to detect. 1:19:17.537 --> 1:19:21.863 And the other thing, which is difficult, is of course the data ratio. 1:19:22.262 --> 1:19:24.212 Why is it important any? 1:19:24.212 --> 1:19:29.827 Why is the ratio between positive and negative examples here important? 1:19:30.510 --> 1:19:40.007 Because in a case of plus imbalance we effectively could learn to just that it's positive and 1:19:40.007 --> 1:19:43.644 high quality and we would be right. 1:19:44.844 --> 1:19:46.654 Yes, so I'm training. 1:19:46.654 --> 1:19:51.180 This is important, but otherwise it might be too easy. 1:19:51.180 --> 1:19:52.414 You always do. 1:19:52.732 --> 1:19:58.043 And on the other head, of course, navy and deputy, it's also important because if we have 1:19:58.043 --> 1:20:03.176 equal things, we're also assuming that this might be the other one, and if the quality 1:20:03.176 --> 1:20:06.245 is worse or higher, we might also accept too fewer. 1:20:06.626 --> 1:20:10.486 So this ratio is not easy to determine. 1:20:13.133 --> 1:20:16.969 What type of features can we use? 1:20:16.969 --> 1:20:23.175 Traditionally, we're also looking at word translation. 1:20:23.723 --> 1:20:37.592 And nowadays, of course, we can model this also with something like similar, so this is 1:20:37.592 --> 1:20:38.696 again. 1:20:40.200 --> 1:20:42.306 Language follow. 1:20:42.462 --> 1:20:49.763 So we can, for example, put the sentence in there for the source and the target, and then 1:20:49.763 --> 1:20:56.497 based on this classification label we can classify as this a parallel sentence or. 1:20:56.476 --> 1:21:00.054 So it's more like a normal classification task. 1:21:00.160 --> 1:21:09.233 And by having a system which can have much enable input, we can just put in two R. 1:21:09.233 --> 1:21:16.886 We can also put in two independent of each other based on the hidden. 1:21:17.657 --> 1:21:35.440 You can, as you do any other type of classifier, you can train them on top of. 1:21:35.895 --> 1:21:42.801 This so it tries to represent the full sentence and that's what you also want to do on. 1:21:43.103 --> 1:21:45.043 The Other Thing What They Can't Do Is, of Course. 1:21:45.265 --> 1:21:46.881 You can make here. 1:21:46.881 --> 1:21:52.837 You can do your summation of all the hidden statements that you said. 1:21:58.698 --> 1:22:10.618 Okay, and then one thing which we skipped until now, and that is only briefly this fragment. 1:22:10.630 --> 1:22:19.517 So if we have sentences which are not really parallel, can we also extract information from 1:22:19.517 --> 1:22:20.096 them? 1:22:22.002 --> 1:22:25.627 And so what here the test is? 1:22:25.627 --> 1:22:33.603 We have a sentence and we want to find within or a sentence pair. 1:22:33.603 --> 1:22:38.679 We want to find within the sentence pair. 1:22:39.799 --> 1:22:46.577 And how that, for example, has been done is using a lexical positive and negative association. 1:22:47.187 --> 1:22:57.182 And then you can transform your target sentence into a signal and find a thing where you have. 1:22:57.757 --> 1:23:00.317 So I'm Going to Get a Clear Eye. 1:23:00.480 --> 1:23:15.788 So you hear the English sentence, the other language, and you have an alignment between 1:23:15.788 --> 1:23:18.572 them, and then. 1:23:18.818 --> 1:23:21.925 This is not a light cell from a negative signal. 1:23:22.322 --> 1:23:40.023 And then you drink some sauce on there because you want to have an area where there's. 1:23:40.100 --> 1:23:51.728 It doesn't matter if you have simple arrows here by smooth saying you can't extract. 1:23:51.972 --> 1:23:58.813 So you try to find long segments here where at least most of the words are somehow aligned. 1:24:00.040 --> 1:24:10.069 And then you take this one in the side and extract that one as your parallel fragment, 1:24:10.069 --> 1:24:10.645 and. 1:24:10.630 --> 1:24:21.276 So in the end you not only have full sentences but you also have partial sentences which might 1:24:21.276 --> 1:24:27.439 be helpful for especially if you have quite low upset. 1:24:32.332 --> 1:24:36.388 That's everything work for today. 1:24:36.388 --> 1:24:44.023 What you hopefully remember is the thing about how the general. 1:24:44.184 --> 1:24:54.506 We talked about how we can do the document alignment and then we can do the sentence alignment, 1:24:54.506 --> 1:24:57.625 which can be done after the. 1:24:59.339 --> 1:25:12.611 Any more questions think on Thursday we had to do a switch, so on Thursday there will be 1:25:12.611 --> 1:25:15.444 a practical thing. 0:00:01.921 --> 0:00:16.424 Hey welcome to today's lecture, what we today want to look at is how we can make new. 0:00:16.796 --> 0:00:26.458 So until now we have this global system, the encoder and the decoder mostly, and we haven't 0:00:26.458 --> 0:00:29.714 really thought about how long. 0:00:30.170 --> 0:00:42.684 And what we, for example, know is yeah, you can make the systems bigger in different ways. 0:00:42.684 --> 0:00:47.084 We can make them deeper so the. 0:00:47.407 --> 0:00:56.331 And if we have at least enough data that typically helps you make things performance better,. 0:00:56.576 --> 0:01:00.620 But of course leads to problems that we need more resources. 0:01:00.620 --> 0:01:06.587 That is a problem at universities where we have typically limited computation capacities. 0:01:06.587 --> 0:01:11.757 So at some point you have such big models that you cannot train them anymore. 0:01:13.033 --> 0:01:23.792 And also for companies is of course important if it costs you like to generate translation 0:01:23.792 --> 0:01:26.984 just by power consumption. 0:01:27.667 --> 0:01:35.386 So yeah, there's different reasons why you want to do efficient machine translation. 0:01:36.436 --> 0:01:48.338 One reason is there are different ways of how you can improve your machine translation 0:01:48.338 --> 0:01:50.527 system once we. 0:01:50.670 --> 0:01:55.694 There can be different types of data we looked into data crawling, monolingual data. 0:01:55.875 --> 0:01:59.024 All this data and the aim is always. 0:01:59.099 --> 0:02:06.067 Of course, we are not just purely interested in having more data, but the idea why we want 0:02:06.067 --> 0:02:12.959 to have more data is that more data also means that we have better quality because mostly 0:02:12.959 --> 0:02:17.554 we are interested in increasing the quality of the machine. 0:02:18.838 --> 0:02:24.892 But there's also other ways of how you can improve the quality of a machine translation. 0:02:25.325 --> 0:02:36.450 And what is, of course, that is where most research is focusing on. 0:02:36.450 --> 0:02:44.467 It means all we want to build better algorithms. 0:02:44.684 --> 0:02:48.199 Course: The other things are normally as good. 0:02:48.199 --> 0:02:54.631 Sometimes it's easier to improve, so often it's easier to just collect more data than 0:02:54.631 --> 0:02:57.473 to invent some great view algorithms. 0:02:57.473 --> 0:03:00.315 But yeah, both of them are important. 0:03:00.920 --> 0:03:09.812 But there is this third thing, especially with neural machine translation, and that means 0:03:09.812 --> 0:03:11.590 we make a bigger. 0:03:11.751 --> 0:03:16.510 Can be, as said, that we have more layers, that we have wider layers. 0:03:16.510 --> 0:03:19.977 The other thing we talked a bit about is ensemble. 0:03:19.977 --> 0:03:24.532 That means we are not building one new machine translation system. 0:03:24.965 --> 0:03:27.505 And we can easily build four. 0:03:27.505 --> 0:03:32.331 What is the typical strategy to build different systems? 0:03:32.331 --> 0:03:33.177 Remember. 0:03:35.795 --> 0:03:40.119 It should be of course a bit different if you have the same. 0:03:40.119 --> 0:03:44.585 If they all predict the same then combining them doesn't help. 0:03:44.585 --> 0:03:48.979 So what is the easiest way if you have to build four systems? 0:03:51.711 --> 0:04:01.747 And the Charleston's will take, but this is the best output of a single system. 0:04:02.362 --> 0:04:10.165 Mean now, it's really three different systems so that you later can combine them and maybe 0:04:10.165 --> 0:04:11.280 the average. 0:04:11.280 --> 0:04:16.682 Ensembles are typically that the average is all probabilities. 0:04:19.439 --> 0:04:24.227 The idea is to think about neural networks. 0:04:24.227 --> 0:04:29.342 There's one parameter which can easily adjust. 0:04:29.342 --> 0:04:36.525 That's exactly the easiest way to randomize with three different. 0:04:37.017 --> 0:04:43.119 They have the same architecture, so all the hydroparameters are the same, but they are 0:04:43.119 --> 0:04:43.891 different. 0:04:43.891 --> 0:04:46.556 They will have different predictions. 0:04:48.228 --> 0:04:52.572 So, of course, bigger amounts. 0:04:52.572 --> 0:05:05.325 Some of these are a bit the easiest way of improving your quality because you don't really 0:05:05.325 --> 0:05:08.268 have to do anything. 0:05:08.588 --> 0:05:12.588 There is limits on that bigger models only get better. 0:05:12.588 --> 0:05:19.132 If you have enough training data you can't do like a handheld layer and you will not work 0:05:19.132 --> 0:05:24.877 on very small data but with a recent amount of data that is the easiest thing. 0:05:25.305 --> 0:05:33.726 However, they are challenging with making better models, bigger motors, and that is the 0:05:33.726 --> 0:05:34.970 computation. 0:05:35.175 --> 0:05:44.482 So, of course, if you have a bigger model that can mean that you have longer running 0:05:44.482 --> 0:05:49.518 times, if you have models, you have to times. 0:05:51.171 --> 0:05:56.685 Normally you cannot paralyze the different layers because the input to one layer is always 0:05:56.685 --> 0:06:02.442 the output of the previous layer, so you propagate that so it will also increase your runtime. 0:06:02.822 --> 0:06:10.720 Then you have to store all your models in memory. 0:06:10.720 --> 0:06:20.927 If you have double weights you will have: Is more difficult to then do back propagation. 0:06:20.927 --> 0:06:27.680 You have to store in between the activations, so there's not only do you increase the model 0:06:27.680 --> 0:06:31.865 in your memory, but also all these other variables that. 0:06:34.414 --> 0:06:36.734 And so in general it is more expensive. 0:06:37.137 --> 0:06:54.208 And therefore there's good reasons in looking into can we make these models sound more efficient. 0:06:54.134 --> 0:07:00.982 So it's been through the viewer, you can have it okay, have one and one day of training time, 0:07:00.982 --> 0:07:01.274 or. 0:07:01.221 --> 0:07:07.535 Forty thousand euros and then what is the best machine translation system I can get within 0:07:07.535 --> 0:07:08.437 this budget. 0:07:08.969 --> 0:07:19.085 And then, of course, you can make the models bigger, but then you have to train them shorter, 0:07:19.085 --> 0:07:24.251 and then we can make more efficient algorithms. 0:07:25.925 --> 0:07:31.699 If you think about efficiency, there's a bit different scenarios. 0:07:32.312 --> 0:07:43.635 So if you're more of coming from the research community, what you'll be doing is building 0:07:43.635 --> 0:07:47.913 a lot of models in your research. 0:07:48.088 --> 0:07:58.645 So you're having your test set of maybe sentences, calculating the blue score, then another model. 0:07:58.818 --> 0:08:08.911 So what that means is typically you're training on millions of cents, so your training time 0:08:08.911 --> 0:08:14.944 is long, maybe a day, but maybe in other cases a week. 0:08:15.135 --> 0:08:22.860 The testing is not really the cost efficient, but the training is very costly. 0:08:23.443 --> 0:08:37.830 If you are more thinking of building models for application, the scenario is quite different. 0:08:38.038 --> 0:08:46.603 And then you keep it running, and maybe thousands of customers are using it in translating. 0:08:46.603 --> 0:08:47.720 So in that. 0:08:48.168 --> 0:08:59.577 And we will see that it is not always the same type of challenges you can paralyze some 0:08:59.577 --> 0:09:07.096 things in training, which you cannot paralyze in testing. 0:09:07.347 --> 0:09:14.124 For example, in training you have to do back propagation, so you have to store the activations. 0:09:14.394 --> 0:09:23.901 Therefore, in testing we briefly discussed that we would do it in more detail today in 0:09:23.901 --> 0:09:24.994 training. 0:09:25.265 --> 0:09:36.100 You know they're a target and you can process everything in parallel while in testing. 0:09:36.356 --> 0:09:46.741 So you can only do one word at a time, and so you can less paralyze this. 0:09:46.741 --> 0:09:50.530 Therefore, it's important. 0:09:52.712 --> 0:09:55.347 Is a specific task on this. 0:09:55.347 --> 0:10:03.157 For example, it's the efficiency task where it's about making things as efficient. 0:10:03.123 --> 0:10:09.230 Is possible and they can look at different resources. 0:10:09.230 --> 0:10:14.207 So how much deep fuel run time do you need? 0:10:14.454 --> 0:10:19.366 See how much memory you need or you can have a fixed memory budget and then have to build 0:10:19.366 --> 0:10:20.294 the best system. 0:10:20.500 --> 0:10:29.010 And here is a bit like an example of that, so there's three teams from Edinburgh from 0:10:29.010 --> 0:10:30.989 and they submitted. 0:10:31.131 --> 0:10:36.278 So then, of course, if you want to know the most efficient system you have to do a bit 0:10:36.278 --> 0:10:36.515 of. 0:10:36.776 --> 0:10:44.656 You want to have a better quality or more runtime and there's not the one solution. 0:10:44.656 --> 0:10:46.720 You can improve your. 0:10:46.946 --> 0:10:49.662 And that you see that there are different systems. 0:10:49.909 --> 0:11:06.051 Here is how many words you can do for a second on the clock, and you want to be as talk as 0:11:06.051 --> 0:11:07.824 possible. 0:11:08.068 --> 0:11:08.889 And you see here a bit. 0:11:08.889 --> 0:11:09.984 This is a little bit different. 0:11:11.051 --> 0:11:27.717 You want to be there on the top right corner and you can get a score of something between 0:11:27.717 --> 0:11:29.014 words. 0:11:30.250 --> 0:11:34.161 Two hundred and fifty thousand, then you'll ever come and score zero point three. 0:11:34.834 --> 0:11:41.243 There is, of course, any bit of a decision, but the question is, like how far can you again? 0:11:41.243 --> 0:11:47.789 Some of all these points on this line would be winners because they are somehow most efficient 0:11:47.789 --> 0:11:53.922 in a way that there's no system which achieves the same quality with less computational. 0:11:57.657 --> 0:12:04.131 So there's the one question of which resources are you interested. 0:12:04.131 --> 0:12:07.416 Are you running it on CPU or GPU? 0:12:07.416 --> 0:12:11.668 There's different ways of paralyzing stuff. 0:12:14.654 --> 0:12:20.777 Another dimension is how you process your data. 0:12:20.777 --> 0:12:27.154 There's really the best processing and streaming. 0:12:27.647 --> 0:12:34.672 So in batch processing you have the whole document available so you can translate all 0:12:34.672 --> 0:12:39.981 sentences in perimeter and then you're interested in throughput. 0:12:40.000 --> 0:12:43.844 But you can then process, for example, especially in GPS. 0:12:43.844 --> 0:12:49.810 That's interesting, you're not translating one sentence at a time, but you're translating 0:12:49.810 --> 0:12:56.108 one hundred sentences or so in parallel, so you have one more dimension where you can paralyze 0:12:56.108 --> 0:12:57.964 and then be more efficient. 0:12:58.558 --> 0:13:14.863 On the other hand, for example sorts of documents, so we learned that if you do badge processing 0:13:14.863 --> 0:13:16.544 you have. 0:13:16.636 --> 0:13:24.636 Then, of course, it makes sense to sort the sentences in order to have the minimum thing 0:13:24.636 --> 0:13:25.535 attached. 0:13:27.427 --> 0:13:32.150 The other scenario is more the streaming scenario where you do life translation. 0:13:32.512 --> 0:13:40.212 So in that case you can't wait for the whole document to pass, but you have to do. 0:13:40.520 --> 0:13:49.529 And then, for example, that's especially in situations like speech translation, and then 0:13:49.529 --> 0:13:53.781 you're interested in things like latency. 0:13:53.781 --> 0:14:00.361 So how much do you have to wait to get the output of a sentence? 0:14:06.566 --> 0:14:16.956 Finally, there is the thing about the implementation: Today we're mainly looking at different algorithms, 0:14:16.956 --> 0:14:23.678 different models of how you can model them in your machine translation system, but of 0:14:23.678 --> 0:14:29.227 course for the same algorithms there's also different implementations. 0:14:29.489 --> 0:14:38.643 So, for example, for a machine translation this tool could be very fast. 0:14:38.638 --> 0:14:46.615 So they have like coded a lot of the operations very low resource, not low resource, low level 0:14:46.615 --> 0:14:49.973 on the directly on the QDAC kernels in. 0:14:50.110 --> 0:15:00.948 So the same attention network is typically more efficient in that type of algorithm. 0:15:00.880 --> 0:15:02.474 Than in in any other. 0:15:03.323 --> 0:15:13.105 Of course, it might be other disadvantages, so if you're a little worker or have worked 0:15:13.105 --> 0:15:15.106 in the practical. 0:15:15.255 --> 0:15:22.604 Because it's normally easier to understand, easier to change, and so on, but there is again 0:15:22.604 --> 0:15:23.323 a train. 0:15:23.483 --> 0:15:29.440 You have to think about, do you want to include this into my study or comparison or not? 0:15:29.440 --> 0:15:36.468 Should it be like I compare different implementations and I also find the most efficient implementation? 0:15:36.468 --> 0:15:39.145 Or is it only about the pure algorithm? 0:15:42.742 --> 0:15:50.355 Yeah, when building these systems there is a different trade-off to do. 0:15:50.850 --> 0:15:56.555 So there's one of the traders between memory and throughput, so how many words can generate 0:15:56.555 --> 0:15:57.299 per second. 0:15:57.557 --> 0:16:03.351 So typically you can easily like increase your scruple by increasing the batch size. 0:16:03.643 --> 0:16:06.899 So that means you are translating more sentences in parallel. 0:16:07.107 --> 0:16:09.241 And gypsies are very good at that stuff. 0:16:09.349 --> 0:16:15.161 It should translate one sentence or one hundred sentences, not the same time, but its. 0:16:15.115 --> 0:16:20.784 Rough are very similar because they are at this efficient metrics multiplication so that 0:16:20.784 --> 0:16:24.415 you can do the same operation on all sentences parallel. 0:16:24.415 --> 0:16:30.148 So typically that means if you increase your benchmark you can do more things in parallel 0:16:30.148 --> 0:16:31.995 and you will translate more. 0:16:31.952 --> 0:16:33.370 Second. 0:16:33.653 --> 0:16:43.312 On the other hand, with this advantage, of course you will need higher badge sizes and 0:16:43.312 --> 0:16:44.755 more memory. 0:16:44.965 --> 0:16:56.452 To begin with, the other problem is that you have such big models that you can only translate 0:16:56.452 --> 0:16:59.141 with lower bed sizes. 0:16:59.119 --> 0:17:08.466 If you are running out of memory with translating, one idea to go on that is to decrease your. 0:17:13.453 --> 0:17:24.456 Then there is the thing about quality in Screwport, of course, and before it's like larger models, 0:17:24.456 --> 0:17:28.124 but in generally higher quality. 0:17:28.124 --> 0:17:31.902 The first one is always this way. 0:17:32.092 --> 0:17:38.709 Course: Not always larger model helps you have over fitting at some point, but in generally. 0:17:43.883 --> 0:17:52.901 And with this a bit on this training and testing thing we had before. 0:17:53.113 --> 0:17:58.455 So it wears all the difference between training and testing, and for the encoder and decoder. 0:17:58.798 --> 0:18:06.992 So if we are looking at what mentioned before at training time, we have a source sentence 0:18:06.992 --> 0:18:17.183 here: And how this is processed on a is not the attention here. 0:18:17.183 --> 0:18:21.836 That's a tubical transformer. 0:18:22.162 --> 0:18:31.626 And how we can do that on a is that we can paralyze the ear ever since. 0:18:31.626 --> 0:18:40.422 The first thing to know is: So that is, of course, not in all cases. 0:18:40.422 --> 0:18:49.184 We'll later talk about speech translation where we might want to translate. 0:18:49.389 --> 0:18:56.172 Without the general case in, it's like you have the full sentence you want to translate. 0:18:56.416 --> 0:19:02.053 So the important thing is we are here everything available on the source side. 0:19:03.323 --> 0:19:13.524 And then this was one of the big advantages that you can remember back of transformer. 0:19:13.524 --> 0:19:15.752 There are several. 0:19:16.156 --> 0:19:25.229 But the other one is now that we can calculate the full layer. 0:19:25.645 --> 0:19:29.318 There is no dependency between this and this state or this and this state. 0:19:29.749 --> 0:19:36.662 So we always did like here to calculate the key value and query, and based on that you 0:19:36.662 --> 0:19:37.536 calculate. 0:19:37.937 --> 0:19:46.616 Which means we can do all these calculations here in parallel and in parallel. 0:19:48.028 --> 0:19:55.967 And there, of course, is this very efficiency because again for GPS it's too bigly possible 0:19:55.967 --> 0:20:00.887 to do these things in parallel and one after each other. 0:20:01.421 --> 0:20:10.311 And then we can also for each layer one by one, and then we calculate here the encoder. 0:20:10.790 --> 0:20:21.921 In training now an important thing is that for the decoder we have the full sentence available 0:20:21.921 --> 0:20:28.365 because we know this is the target we should generate. 0:20:29.649 --> 0:20:33.526 We have models now in a different way. 0:20:33.526 --> 0:20:38.297 This hidden state is only on the previous ones. 0:20:38.598 --> 0:20:51.887 And the first thing here depends only on this information, so you see if you remember we 0:20:51.887 --> 0:20:56.665 had this masked self-attention. 0:20:56.896 --> 0:21:04.117 So that means, of course, we can only calculate the decoder once the encoder is done, but that's. 0:21:04.444 --> 0:21:06.656 Percent can calculate the end quarter. 0:21:06.656 --> 0:21:08.925 Then we can calculate here the decoder. 0:21:09.569 --> 0:21:25.566 But again in training we have x, y and that is available so we can calculate everything 0:21:25.566 --> 0:21:27.929 in parallel. 0:21:28.368 --> 0:21:40.941 So the interesting thing or advantage of transformer is in training. 0:21:40.941 --> 0:21:46.408 We can do it for the decoder. 0:21:46.866 --> 0:21:54.457 That means you will have more calculations because you can only calculate one layer at 0:21:54.457 --> 0:22:02.310 a time, but for example the length which is too bigly quite long or doesn't really matter 0:22:02.310 --> 0:22:03.270 that much. 0:22:05.665 --> 0:22:10.704 However, in testing this situation is different. 0:22:10.704 --> 0:22:13.276 In testing we only have. 0:22:13.713 --> 0:22:20.622 So this means we start with a sense: We don't know the full sentence yet because we ought 0:22:20.622 --> 0:22:29.063 to regularly generate that so for the encoder we have the same here but for the decoder. 0:22:29.409 --> 0:22:39.598 In this case we only have the first and the second instinct, but only for all states in 0:22:39.598 --> 0:22:40.756 parallel. 0:22:41.101 --> 0:22:51.752 And then we can do the next step for y because we are putting our most probable one. 0:22:51.752 --> 0:22:58.643 We do greedy search or beam search, but you cannot do. 0:23:03.663 --> 0:23:16.838 Yes, so if we are interesting in making things more efficient for testing, which we see, for 0:23:16.838 --> 0:23:22.363 example in the scenario of really our. 0:23:22.642 --> 0:23:34.286 It makes sense that we think about our architecture and that we are currently working on attention 0:23:34.286 --> 0:23:35.933 based models. 0:23:36.096 --> 0:23:44.150 The decoder there is some of the most time spent testing and testing. 0:23:44.150 --> 0:23:47.142 It's similar, but during. 0:23:47.167 --> 0:23:50.248 Nothing about beam search. 0:23:50.248 --> 0:23:59.833 It might be even more complicated because in beam search you have to try different. 0:24:02.762 --> 0:24:15.140 So the question is what can you now do in order to make your model more efficient and 0:24:15.140 --> 0:24:21.905 better in translation in these types of cases? 0:24:24.604 --> 0:24:30.178 And the one thing is to look into the encoded decoder trailer. 0:24:30.690 --> 0:24:43.898 And then until now we typically assume that the depth of the encoder and the depth of the 0:24:43.898 --> 0:24:48.154 decoder is roughly the same. 0:24:48.268 --> 0:24:55.553 So if you haven't thought about it, you just take what is running well. 0:24:55.553 --> 0:24:57.678 You would try to do. 0:24:58.018 --> 0:25:04.148 However, we saw now that there is a quite big challenge and the runtime is a lot longer 0:25:04.148 --> 0:25:04.914 than here. 0:25:05.425 --> 0:25:14.018 The question is also the case for the calculations, or do we have there the same issue that we 0:25:14.018 --> 0:25:21.887 only get the good quality if we are having high and high, so we know that making these 0:25:21.887 --> 0:25:25.415 more depths is increasing our quality. 0:25:25.425 --> 0:25:31.920 But what we haven't talked about is really important that we increase the depth the same 0:25:31.920 --> 0:25:32.285 way. 0:25:32.552 --> 0:25:41.815 So what we can put instead also do is something like this where you have a deep encoder and 0:25:41.815 --> 0:25:42.923 a shallow. 0:25:43.163 --> 0:25:57.386 So that would be that you, for example, have instead of having layers on the encoder, and 0:25:57.386 --> 0:25:59.757 layers on the. 0:26:00.080 --> 0:26:10.469 So in this case the overall depth from start to end would be similar and so hopefully. 0:26:11.471 --> 0:26:21.662 But we could a lot more things hear parallelized, and hear what is costly at the end during decoding 0:26:21.662 --> 0:26:22.973 the decoder. 0:26:22.973 --> 0:26:29.330 Because that does change in an outer regressive way, there we. 0:26:31.411 --> 0:26:33.727 And that that can be analyzed. 0:26:33.727 --> 0:26:38.734 So here is some examples: Where people have done all this. 0:26:39.019 --> 0:26:55.710 So here it's mainly interested on the orange things, which is auto-regressive about the 0:26:55.710 --> 0:26:57.607 speed up. 0:26:57.717 --> 0:27:15.031 You have the system, so agree is not exactly the same, but it's similar. 0:27:15.055 --> 0:27:23.004 It's always the case if you look at speed up. 0:27:23.004 --> 0:27:31.644 Think they put a speed of so that's the baseline. 0:27:31.771 --> 0:27:35.348 So between and times as fast. 0:27:35.348 --> 0:27:42.621 If you switch from a system to where you have layers in the. 0:27:42.782 --> 0:27:52.309 You see that although you have slightly more parameters, more calculations are also roughly 0:27:52.309 --> 0:28:00.283 the same, but you can speed out because now during testing you can paralyze. 0:28:02.182 --> 0:28:09.754 The other thing is that you're speeding up, but if you look at the performance it's similar, 0:28:09.754 --> 0:28:13.500 so sometimes you improve, sometimes you lose. 0:28:13.500 --> 0:28:20.421 There's a bit of losing English to Romania, but in general the quality is very slow. 0:28:20.680 --> 0:28:30.343 So you see that you can keep a similar performance while improving your speed by just having different. 0:28:30.470 --> 0:28:34.903 And you also see the encoder layers from speed. 0:28:34.903 --> 0:28:38.136 They don't really metal that much. 0:28:38.136 --> 0:28:38.690 Most. 0:28:38.979 --> 0:28:50.319 Because if you compare the 12th system to the 6th system you have a lower performance 0:28:50.319 --> 0:28:57.309 with 6th and colder layers but the speed is similar. 0:28:57.897 --> 0:29:02.233 And see the huge decrease is it maybe due to a lack of data. 0:29:03.743 --> 0:29:11.899 Good idea would say it's not the case. 0:29:11.899 --> 0:29:23.191 Romanian English should have the same number of data. 0:29:24.224 --> 0:29:31.184 Maybe it's just that something in that language. 0:29:31.184 --> 0:29:40.702 If you generate Romanian maybe they need more target dependencies. 0:29:42.882 --> 0:29:46.263 The Wine's the Eye Also Don't Know Any Sex People Want To. 0:29:47.887 --> 0:29:49.034 There could be yeah the. 0:29:49.889 --> 0:29:58.962 As the maybe if you go from like a movie sphere to a hybrid sphere, you can: It's very much 0:29:58.962 --> 0:30:12.492 easier to expand the vocabulary to English, but it must be the vocabulary. 0:30:13.333 --> 0:30:21.147 Have to check, but would assume that in this case the system is not retrained, but it's 0:30:21.147 --> 0:30:22.391 trained with. 0:30:22.902 --> 0:30:30.213 And that's why I was assuming that they have the same, but maybe you'll write that in this 0:30:30.213 --> 0:30:35.595 piece, for example, if they were pre-trained, the decoder English. 0:30:36.096 --> 0:30:43.733 But don't remember exactly if they do something like that, but that could be a good. 0:30:45.325 --> 0:30:52.457 So this is some of the most easy way to speed up. 0:30:52.457 --> 0:31:01.443 You just switch to hyperparameters, not to implement anything. 0:31:02.722 --> 0:31:08.367 Of course, there's other ways of doing that. 0:31:08.367 --> 0:31:11.880 We'll look into two things. 0:31:11.880 --> 0:31:16.521 The other thing is the architecture. 0:31:16.796 --> 0:31:28.154 We are now at some of the baselines that we are doing. 0:31:28.488 --> 0:31:39.978 However, in translation in the decoder side, it might not be the best solution. 0:31:39.978 --> 0:31:41.845 There is no. 0:31:42.222 --> 0:31:47.130 So we can use different types of architectures, also in the encoder and the. 0:31:47.747 --> 0:31:52.475 And there's two ways of what you could do different, or there's more ways. 0:31:52.912 --> 0:31:54.825 We will look into two todays. 0:31:54.825 --> 0:31:58.842 The one is average attention, which is a very simple solution. 0:31:59.419 --> 0:32:01.464 You can do as it says. 0:32:01.464 --> 0:32:04.577 It's not really attending anymore. 0:32:04.577 --> 0:32:08.757 It's just like equal attendance to everything. 0:32:09.249 --> 0:32:23.422 And the other idea, which is currently done in most systems which are optimized to efficiency, 0:32:23.422 --> 0:32:24.913 is we're. 0:32:25.065 --> 0:32:32.623 But on the decoder side we are then not using transformer or self attention, but we are using 0:32:32.623 --> 0:32:39.700 recurrent neural network because they are the disadvantage of recurrent neural network. 0:32:39.799 --> 0:32:48.353 And then the recurrent is normally easier to calculate because it only depends on inputs, 0:32:48.353 --> 0:32:49.684 the input on. 0:32:51.931 --> 0:33:02.190 So what is the difference between decoding and why is the tension maybe not sufficient 0:33:02.190 --> 0:33:03.841 for decoding? 0:33:04.204 --> 0:33:14.390 If we want to populate the new state, we only have to look at the input and the previous 0:33:14.390 --> 0:33:15.649 state, so. 0:33:16.136 --> 0:33:19.029 We are more conditional here networks. 0:33:19.029 --> 0:33:19.994 We have the. 0:33:19.980 --> 0:33:31.291 Dependency to a fixed number of previous ones, but that's rarely used for decoding. 0:33:31.291 --> 0:33:39.774 In contrast, in transformer we have this large dependency, so. 0:33:40.000 --> 0:33:52.760 So from t minus one to y t so that is somehow and mainly not very efficient in this way mean 0:33:52.760 --> 0:33:56.053 it's very good because. 0:33:56.276 --> 0:34:03.543 However, the disadvantage is that we also have to do all these calculations, so if we 0:34:03.543 --> 0:34:10.895 more view from the point of view of efficient calculation, this might not be the best. 0:34:11.471 --> 0:34:20.517 So the question is, can we change our architecture to keep some of the advantages but make things 0:34:20.517 --> 0:34:21.994 more efficient? 0:34:24.284 --> 0:34:31.131 The one idea is what is called the average attention, and the interesting thing is this 0:34:31.131 --> 0:34:32.610 work surprisingly. 0:34:33.013 --> 0:34:38.917 So the only idea what you're doing is doing the decoder. 0:34:38.917 --> 0:34:42.646 You're not doing attention anymore. 0:34:42.646 --> 0:34:46.790 The attention weights are all the same. 0:34:47.027 --> 0:35:00.723 So you don't calculate with query and key the different weights, and then you just take 0:35:00.723 --> 0:35:03.058 equal weights. 0:35:03.283 --> 0:35:07.585 So here would be one third from this, one third from this, and one third. 0:35:09.009 --> 0:35:14.719 And while it is sufficient you can now do precalculation and things get more efficient. 0:35:15.195 --> 0:35:18.803 So first go the formula that's maybe not directed here. 0:35:18.979 --> 0:35:38.712 So the difference here is that your new hint stage is the sum of all the hint states, then. 0:35:38.678 --> 0:35:40.844 So here would be with this. 0:35:40.844 --> 0:35:45.022 It would be one third of this plus one third of this. 0:35:46.566 --> 0:35:57.162 But if you calculate it this way, it's not yet being more efficient because you still 0:35:57.162 --> 0:36:01.844 have to sum over here all the hidden. 0:36:04.524 --> 0:36:22.932 But you can not easily speed up these things by having an in between value, which is just 0:36:22.932 --> 0:36:24.568 always. 0:36:25.585 --> 0:36:30.057 If you take this as ten to one, you take this one class this one. 0:36:30.350 --> 0:36:36.739 Because this one then was before this, and this one was this, so in the end. 0:36:37.377 --> 0:36:49.545 So now this one is not the final one in order to get the final one to do the average. 0:36:49.545 --> 0:36:50.111 So. 0:36:50.430 --> 0:37:00.264 But then if you do this calculation with speed up you can do it with a fixed number of steps. 0:37:00.180 --> 0:37:11.300 Instead of the sun which depends on age, so you only have to do calculations to calculate 0:37:11.300 --> 0:37:12.535 this one. 0:37:12.732 --> 0:37:21.253 Can you do a lakes on a wet spoon? 0:37:21.253 --> 0:37:32.695 For example, a light spoon here now takes and. 0:37:32.993 --> 0:37:38.762 That's a very good point and that's why this is now in the image. 0:37:38.762 --> 0:37:44.531 It's not very good so this is the one with tilder and the tilder. 0:37:44.884 --> 0:37:57.895 So this one is just the sum of these two, because this is just this one. 0:37:58.238 --> 0:38:08.956 So the sum of this is exactly as the sum of these, and the sum of these is the sum of here. 0:38:08.956 --> 0:38:15.131 So you only do the sum in here, and the multiplying. 0:38:15.255 --> 0:38:22.145 So what you can mainly do here is you can do it more mathematically. 0:38:22.145 --> 0:38:31.531 You can know this by tea taking out of the sum, and then you can calculate the sum different. 0:38:36.256 --> 0:38:42.443 That maybe looks a bit weird and simple, so we were all talking about this great attention 0:38:42.443 --> 0:38:47.882 that we can focus on different parts, and a bit surprising on this work is now. 0:38:47.882 --> 0:38:53.321 In the end it might also work well without really putting and just doing equal. 0:38:53.954 --> 0:38:56.164 Mean it's not that easy. 0:38:56.376 --> 0:38:58.261 It's like sometimes this is working. 0:38:58.261 --> 0:39:00.451 There's also report weight work that well. 0:39:01.481 --> 0:39:05.848 But I think it's an interesting way and it maybe shows that a lot of. 0:39:05.805 --> 0:39:10.669 Things in the self or in the transformer paper which are more put as like yet. 0:39:10.669 --> 0:39:14.301 These are some hyperparameters that are rounded like that. 0:39:14.301 --> 0:39:19.657 You do the lay on all in between and that you do a feat forward before and things like 0:39:19.657 --> 0:39:20.026 that. 0:39:20.026 --> 0:39:25.567 But these are also all important and the right set up around that is also very important. 0:39:28.969 --> 0:39:38.598 The other thing you can do in the end is not completely different from this one. 0:39:38.598 --> 0:39:42.521 It's just like a very different. 0:39:42.942 --> 0:39:54.338 And that is a recurrent network which also has this type of highway connection that can 0:39:54.338 --> 0:40:01.330 ignore the recurrent unit and directly put the input. 0:40:01.561 --> 0:40:10.770 It's not really adding out, but if you see the hitting step is your input, but what you 0:40:10.770 --> 0:40:15.480 can do is somehow directly go to the output. 0:40:17.077 --> 0:40:28.390 These are the four components of the simple return unit, and the unit is motivated by GIS 0:40:28.390 --> 0:40:33.418 and by LCMs, which we have seen before. 0:40:33.513 --> 0:40:43.633 And that has proven to be very good for iron ends, which allows you to have a gate on your. 0:40:44.164 --> 0:40:48.186 In this thing we have two gates, the reset gate and the forget gate. 0:40:48.768 --> 0:40:57.334 So first we have the general structure which has a cell state. 0:40:57.334 --> 0:41:01.277 Here we have the cell state. 0:41:01.361 --> 0:41:09.661 And then this goes next, and we always get the different cell states over the times that. 0:41:10.030 --> 0:41:11.448 This Is the South Stand. 0:41:11.771 --> 0:41:16.518 How do we now calculate that just assume we have an initial cell safe here? 0:41:17.017 --> 0:41:19.670 But the first thing is we're doing the forget game. 0:41:20.060 --> 0:41:34.774 The forgetting models should the new cell state mainly depend on the previous cell state 0:41:34.774 --> 0:41:40.065 or should it depend on our age. 0:41:40.000 --> 0:41:41.356 Like Add to Them. 0:41:41.621 --> 0:41:42.877 How can we model that? 0:41:44.024 --> 0:41:45.599 First we were at a cocktail. 0:41:45.945 --> 0:41:52.151 The forget gait is depending on minus one. 0:41:52.151 --> 0:41:56.480 You also see here the former. 0:41:57.057 --> 0:42:01.963 So we are multiplying both the cell state and our input. 0:42:01.963 --> 0:42:04.890 With some weights we are getting. 0:42:05.105 --> 0:42:08.472 We are putting some Bay Inspector and then we are doing Sigma Weed on that. 0:42:08.868 --> 0:42:13.452 So in the end we have numbers between zero and one saying for each dimension. 0:42:13.853 --> 0:42:22.041 Like how much if it's near to zero we will mainly use the new input. 0:42:22.041 --> 0:42:31.890 If it's near to one we will keep the input and ignore the input at this dimension. 0:42:33.313 --> 0:42:40.173 And by this motivation we can then create here the new sound state, and here you see 0:42:40.173 --> 0:42:41.141 the formal. 0:42:41.601 --> 0:42:55.048 So you take your foot back gate and multiply it with your class. 0:42:55.048 --> 0:43:00.427 So if my was around then. 0:43:00.800 --> 0:43:07.405 In the other case, when the value was others, that's what you added. 0:43:07.405 --> 0:43:10.946 Then you're adding a transformation. 0:43:11.351 --> 0:43:24.284 So if this value was maybe zero then you're putting most of the information from inputting. 0:43:25.065 --> 0:43:26.947 Is already your element? 0:43:26.947 --> 0:43:30.561 The only question is now based on your element. 0:43:30.561 --> 0:43:32.067 What is the output? 0:43:33.253 --> 0:43:47.951 And there you have another opportunity so you can either take the output or instead you 0:43:47.951 --> 0:43:50.957 prefer the input. 0:43:52.612 --> 0:43:58.166 So is the value also the same for the recept game and the forget game. 0:43:58.166 --> 0:43:59.417 Yes, the movie. 0:44:00.900 --> 0:44:10.004 Yes exactly so the matrices are different and therefore it can be and that should be 0:44:10.004 --> 0:44:16.323 and maybe there is sometimes you want to have information. 0:44:16.636 --> 0:44:23.843 So here again we have this vector with values between zero and which says controlling how 0:44:23.843 --> 0:44:25.205 the information. 0:44:25.505 --> 0:44:36.459 And then the output is calculated here similar to a cell stage, but again input is from. 0:44:36.536 --> 0:44:45.714 So either the reset gate decides should give what is currently stored in there, or. 0:44:46.346 --> 0:44:58.647 So it's not exactly as the thing we had before, with the residual connections where we added 0:44:58.647 --> 0:45:01.293 up, but here we do. 0:45:04.224 --> 0:45:08.472 This is the general idea of a simple recurrent neural network. 0:45:08.472 --> 0:45:13.125 Then we will now look at how we can make things even more efficient. 0:45:13.125 --> 0:45:17.104 But first do you have more questions on how it is working? 0:45:23.063 --> 0:45:38.799 Now these calculations are a bit where things get more efficient because this somehow. 0:45:38.718 --> 0:45:43.177 It depends on all the other damage for the second one also. 0:45:43.423 --> 0:45:48.904 Because if you do a matrix multiplication with a vector like for the output vector, each 0:45:48.904 --> 0:45:52.353 diameter of the output vector depends on all the other. 0:45:52.973 --> 0:46:06.561 The cell state here depends because this one is used here, and somehow the first dimension 0:46:06.561 --> 0:46:11.340 of the cell state only depends. 0:46:11.931 --> 0:46:17.973 In order to make that, of course, is sometimes again making things less paralyzeable if things 0:46:17.973 --> 0:46:18.481 depend. 0:46:19.359 --> 0:46:35.122 Can easily make that different by changing from the metric product to not a vector. 0:46:35.295 --> 0:46:51.459 So you do first, just like inside here, you take like the first dimension, my second dimension. 0:46:52.032 --> 0:46:53.772 Is, of course, narrow. 0:46:53.772 --> 0:46:59.294 This should be reset or this should be because it should be a different. 0:46:59.899 --> 0:47:12.053 Now the first dimension only depends on the first dimension, so you don't have dependencies 0:47:12.053 --> 0:47:16.148 any longer between dimensions. 0:47:18.078 --> 0:47:25.692 Maybe it gets a bit clearer if you see about it in this way, so what we have to do now. 0:47:25.966 --> 0:47:31.911 First, we have to do a metrics multiplication on to gather and to get the. 0:47:32.292 --> 0:47:38.041 And then we only have the element wise operations where we take this output. 0:47:38.041 --> 0:47:38.713 We take. 0:47:39.179 --> 0:47:42.978 Minus one and our original. 0:47:42.978 --> 0:47:52.748 Here we only have elemental abrasions which can be optimally paralyzed. 0:47:53.273 --> 0:48:07.603 So here we have additional paralyzed things across the dimension and don't have to do that. 0:48:09.929 --> 0:48:24.255 Yeah, but this you can do like in parallel again for all xts. 0:48:24.544 --> 0:48:33.014 Here you can't do it in parallel, but you only have to do it on each seat, and then you 0:48:33.014 --> 0:48:34.650 can parallelize. 0:48:35.495 --> 0:48:39.190 But this maybe for the dimension. 0:48:39.190 --> 0:48:42.124 Maybe it's also important. 0:48:42.124 --> 0:48:46.037 I don't know if they have tried it. 0:48:46.037 --> 0:48:55.383 I assume it's not only for dimension reduction, but it's hard because you can easily. 0:49:01.001 --> 0:49:08.164 People have even like made the second thing even more easy. 0:49:08.164 --> 0:49:10.313 So there is this. 0:49:10.313 --> 0:49:17.954 This is how we have the highway connections in the transformer. 0:49:17.954 --> 0:49:20.699 Then it's like you do. 0:49:20.780 --> 0:49:24.789 So that is like how things are put together as a transformer. 0:49:25.125 --> 0:49:39.960 And that is a similar and simple recurring neural network where you do exactly the same 0:49:39.960 --> 0:49:44.512 for the so you don't have. 0:49:46.326 --> 0:49:47.503 This type of things. 0:49:49.149 --> 0:50:01.196 And with this we are at the end of how to make efficient architectures before we go to 0:50:01.196 --> 0:50:02.580 the next. 0:50:13.013 --> 0:50:24.424 Between the ink or the trader and the architectures there is a next technique which is used in 0:50:24.424 --> 0:50:28.988 nearly all deburning very successful. 0:50:29.449 --> 0:50:43.463 So the idea is can we extract the knowledge from a large network into a smaller one, but 0:50:43.463 --> 0:50:45.983 it's similarly. 0:50:47.907 --> 0:50:53.217 And the nice thing is that this really works, and it may be very, very surprising. 0:50:53.673 --> 0:51:03.035 So the idea is that we have a large strong model which we train for long, and the question 0:51:03.035 --> 0:51:07.870 is: Can that help us to train a smaller model? 0:51:08.148 --> 0:51:16.296 So can what we refer to as teacher model tell us better to build a small student model than 0:51:16.296 --> 0:51:17.005 before. 0:51:17.257 --> 0:51:27.371 So what we're before in it as a student model, we learn from the data and that is how we train 0:51:27.371 --> 0:51:28.755 our systems. 0:51:29.249 --> 0:51:37.949 The question is: Can we train this small model better if we are not only learning from the 0:51:37.949 --> 0:51:46.649 data, but we are also learning from a large model which has been trained maybe in the same 0:51:46.649 --> 0:51:47.222 data? 0:51:47.667 --> 0:51:55.564 So that you have then in the end a smaller model that is somehow better performing than. 0:51:55.895 --> 0:51:59.828 And maybe that's on the first view. 0:51:59.739 --> 0:52:05.396 Very very surprising because it has seen the same data so it should have learned the same 0:52:05.396 --> 0:52:11.053 so the baseline model trained only on the data and the student teacher knowledge to still 0:52:11.053 --> 0:52:11.682 model it. 0:52:11.682 --> 0:52:17.401 They all have seen only this data because your teacher modeling was also trained typically 0:52:17.401 --> 0:52:19.161 only on this model however. 0:52:20.580 --> 0:52:30.071 It has by now shown that by many ways the model trained in the teacher and analysis framework 0:52:30.071 --> 0:52:32.293 is performing better. 0:52:33.473 --> 0:52:40.971 A bit of an explanation when we see how that works. 0:52:40.971 --> 0:52:46.161 There's different ways of doing it. 0:52:46.161 --> 0:52:47.171 Maybe. 0:52:47.567 --> 0:52:51.501 So how does it work? 0:52:51.501 --> 0:53:04.802 This is our student network, the normal one, some type of new network. 0:53:04.802 --> 0:53:06.113 We're. 0:53:06.586 --> 0:53:17.050 So we are training the model to predict the same thing as we are doing that by calculating. 0:53:17.437 --> 0:53:23.173 The cross angry loss was defined in a way where saying all the probabilities for the 0:53:23.173 --> 0:53:25.332 correct word should be as high. 0:53:25.745 --> 0:53:32.207 So you are calculating your alphabet probabilities always, and each time step you have an alphabet 0:53:32.207 --> 0:53:33.055 probability. 0:53:33.055 --> 0:53:38.669 What is the most probable in the next word and your training signal is put as much of 0:53:38.669 --> 0:53:43.368 your probability mass to the correct word to the word that is there in. 0:53:43.903 --> 0:53:51.367 And this is the chief by this cross entry loss, which says with some of the all training 0:53:51.367 --> 0:53:58.664 examples of all positions, with some of the full vocabulary, and then this one is this 0:53:58.664 --> 0:54:03.947 one that this current word is the case word in the vocabulary. 0:54:04.204 --> 0:54:11.339 And then we take here the lock for the ability of that, so what we made me do is: We have 0:54:11.339 --> 0:54:27.313 this metric here, so each position of your vocabulary size. 0:54:27.507 --> 0:54:38.656 In the end what you just do is some of these three lock probabilities, and then you want 0:54:38.656 --> 0:54:40.785 to have as much. 0:54:41.041 --> 0:54:54.614 So although this is a thumb over this metric here, in the end of each dimension you. 0:54:54.794 --> 0:55:06.366 So that is a normal cross end to be lost that we have discussed at the very beginning of 0:55:06.366 --> 0:55:07.016 how. 0:55:08.068 --> 0:55:15.132 So what can we do differently in the teacher network? 0:55:15.132 --> 0:55:23.374 We also have a teacher network which is trained on large data. 0:55:24.224 --> 0:55:35.957 And of course this distribution might be better than the one from the small model because it's. 0:55:36.456 --> 0:55:40.941 So in this case we have now the training signal from the teacher network. 0:55:41.441 --> 0:55:46.262 And it's the same way as we had before. 0:55:46.262 --> 0:55:56.507 The only difference is we're training not the ground truths per ability distribution 0:55:56.507 --> 0:55:59.159 year, which is sharp. 0:55:59.299 --> 0:56:11.303 That's also a probability, so this word has a high probability, but have some probability. 0:56:12.612 --> 0:56:19.577 And that is the main difference. 0:56:19.577 --> 0:56:30.341 Typically you do like the interpretation of these. 0:56:33.213 --> 0:56:38.669 Because there's more information contained in the distribution than in the front booth, 0:56:38.669 --> 0:56:44.187 because it encodes more information about the language, because language always has more 0:56:44.187 --> 0:56:47.907 options to put alone, that's the same sentence yes exactly. 0:56:47.907 --> 0:56:53.114 So there's ambiguity in there that is encoded hopefully very well in the complaint. 0:56:53.513 --> 0:56:57.257 Trade you two networks so better than a student network you have in there from your learner. 0:56:57.537 --> 0:57:05.961 So maybe often there's only one correct word, but it might be two or three, and then all 0:57:05.961 --> 0:57:10.505 of these three have a probability distribution. 0:57:10.590 --> 0:57:21.242 And then is the main advantage or one explanation of why it's better to train from the. 0:57:21.361 --> 0:57:32.652 Of course, it's good to also keep the signal in there because then you can prevent it because 0:57:32.652 --> 0:57:33.493 crazy. 0:57:37.017 --> 0:57:49.466 Any more questions on the first type of knowledge distillation, also distribution changes. 0:57:50.550 --> 0:58:02.202 Coming around again, this would put it a bit different, so this is not a solution to maintenance 0:58:02.202 --> 0:58:04.244 or distribution. 0:58:04.744 --> 0:58:12.680 But don't think it's performing worse than only doing the ground tours because they also. 0:58:13.113 --> 0:58:21.254 So it's more like it's not improving you would assume it's similarly helping you, but. 0:58:21.481 --> 0:58:28.145 Of course, if you now have a teacher, maybe you have no danger on your target to Maine, 0:58:28.145 --> 0:58:28.524 but. 0:58:28.888 --> 0:58:39.895 Then you can use this one which is not the ground truth but helpful to learn better for 0:58:39.895 --> 0:58:42.147 the distribution. 0:58:46.326 --> 0:58:57.012 The second idea is to do sequence level knowledge distillation, so what we have in this case 0:58:57.012 --> 0:59:02.757 is we have looked at each position independently. 0:59:03.423 --> 0:59:05.436 Mean, we do that often. 0:59:05.436 --> 0:59:10.972 We are not generating a lot of sequences, but that has a problem. 0:59:10.972 --> 0:59:13.992 We have this propagation of errors. 0:59:13.992 --> 0:59:16.760 We start with one area and then. 0:59:17.237 --> 0:59:27.419 So if we are doing word-level knowledge dissolution, we are treating each word in the sentence independently. 0:59:28.008 --> 0:59:32.091 So we are not trying to like somewhat model the dependency between. 0:59:32.932 --> 0:59:47.480 We can try to do that by sequence level knowledge dissolution, but the problem is, of course,. 0:59:47.847 --> 0:59:53.478 So we can that for each position we can get a distribution over all the words at this. 0:59:53.793 --> 1:00:05.305 But if we want to have a distribution of all possible target sentences, that's not possible 1:00:05.305 --> 1:00:06.431 because. 1:00:08.508 --> 1:00:15.940 Area, so we can then again do a bit of a heck on that. 1:00:15.940 --> 1:00:23.238 If we can't have a distribution of all sentences, it. 1:00:23.843 --> 1:00:30.764 So what we can't do is you can not use the teacher network and sample different translations. 1:00:31.931 --> 1:00:39.327 And now we can do different ways to train them. 1:00:39.327 --> 1:00:49.343 We can use them as their probability, the easiest one to assume. 1:00:50.050 --> 1:00:56.373 So what that ends to is that we're taking our teacher network, we're generating some 1:00:56.373 --> 1:01:01.135 translations, and these ones we're using as additional trading. 1:01:01.781 --> 1:01:11.382 Then we have mainly done this sequence level because the teacher network takes us. 1:01:11.382 --> 1:01:17.513 These are all probable translations of the sentence. 1:01:26.286 --> 1:01:34.673 And then you can do a bit of a yeah, and you can try to better make a bit of an interpolated 1:01:34.673 --> 1:01:36.206 version of that. 1:01:36.716 --> 1:01:42.802 So what people have also done is like subsequent level interpolations. 1:01:42.802 --> 1:01:52.819 You generate here several translations: But then you don't use all of them. 1:01:52.819 --> 1:02:00.658 You do some metrics on which of these ones. 1:02:01.021 --> 1:02:12.056 So it's a bit more training on this brown chose which might be improbable or unreachable 1:02:12.056 --> 1:02:16.520 because we can generate everything. 1:02:16.676 --> 1:02:23.378 And we are giving it an easier solution which is also good quality and training of that. 1:02:23.703 --> 1:02:32.602 So you're not training it on a very difficult solution, but you're training it on an easier 1:02:32.602 --> 1:02:33.570 solution. 1:02:36.356 --> 1:02:38.494 Any More Questions to This. 1:02:40.260 --> 1:02:41.557 Yeah. 1:02:41.461 --> 1:02:44.296 Good. 1:02:43.843 --> 1:03:01.642 Is to look at the vocabulary, so the problem is we have seen that vocabulary calculations 1:03:01.642 --> 1:03:06.784 are often very presuming. 1:03:09.789 --> 1:03:19.805 The thing is that most of the vocabulary is not needed for each sentence, so in each sentence. 1:03:20.280 --> 1:03:28.219 The question is: Can we somehow easily precalculate, which words are probable to occur in the sentence, 1:03:28.219 --> 1:03:30.967 and then only calculate these ones? 1:03:31.691 --> 1:03:34.912 And this can be done so. 1:03:34.912 --> 1:03:43.932 For example, if you have sentenced card, it's probably not happening. 1:03:44.164 --> 1:03:48.701 So what you can try to do is to limit your vocabulary. 1:03:48.701 --> 1:03:51.093 You're considering for each. 1:03:51.151 --> 1:04:04.693 So you're no longer taking the full vocabulary as possible output, but you're restricting. 1:04:06.426 --> 1:04:18.275 That typically works is that we limit it by the most frequent words we always take because 1:04:18.275 --> 1:04:23.613 these are not so easy to align to words. 1:04:23.964 --> 1:04:32.241 To take the most treatment taggin' words and then work that often aligns with one of the 1:04:32.241 --> 1:04:32.985 source. 1:04:33.473 --> 1:04:46.770 So for each source word you calculate the word alignment on your training data, and then 1:04:46.770 --> 1:04:51.700 you calculate which words occur. 1:04:52.352 --> 1:04:57.680 And then for decoding you build this union of maybe the source word list that other. 1:04:59.960 --> 1:05:02.145 Are like for each source work. 1:05:02.145 --> 1:05:08.773 One of the most frequent translations of these source words, for example for each source work 1:05:08.773 --> 1:05:13.003 like in the most frequent ones, and then the most frequent. 1:05:13.193 --> 1:05:24.333 In total, if you have short sentences, you have a lot less words, so in most cases it's 1:05:24.333 --> 1:05:26.232 not more than. 1:05:26.546 --> 1:05:33.957 And so you have dramatically reduced your vocabulary, and thereby can also fax a depot. 1:05:35.495 --> 1:05:43.757 That easy does anybody see what is challenging here and why that might not always need. 1:05:47.687 --> 1:05:54.448 The performance is not why this might not. 1:05:54.448 --> 1:06:01.838 If you implement it, it might not be a strong. 1:06:01.941 --> 1:06:06.053 You have to store this list. 1:06:06.053 --> 1:06:14.135 You have to burn the union and of course your safe time. 1:06:14.554 --> 1:06:21.920 The second thing the vocabulary is used in our last step, so we have the hidden state, 1:06:21.920 --> 1:06:23.868 and then we calculate. 1:06:24.284 --> 1:06:29.610 Now we are not longer calculating them for all output words, but for a subset of them. 1:06:30.430 --> 1:06:35.613 However, this metric multiplication is typically parallelized with the perfect but good. 1:06:35.956 --> 1:06:46.937 But if you not only calculate some of them, if you're not modeling it right, it will take 1:06:46.937 --> 1:06:52.794 as long as before because of the nature of the. 1:06:56.776 --> 1:07:07.997 Here for beam search there's some ideas of course you can go back to greedy search because 1:07:07.997 --> 1:07:10.833 that's more efficient. 1:07:11.651 --> 1:07:18.347 And better quality, and you can buffer some states in between, so how much buffering it's 1:07:18.347 --> 1:07:22.216 again this tradeoff between calculation and memory. 1:07:25.125 --> 1:07:41.236 Then at the end of today what we want to look into is one last type of new machine translation 1:07:41.236 --> 1:07:42.932 approach. 1:07:43.403 --> 1:07:53.621 And the idea is what we've already seen in our first two steps is that this ultra aggressive 1:07:53.621 --> 1:07:57.246 park is taking community coding. 1:07:57.557 --> 1:08:04.461 Can process everything in parallel, but we are always taking the most probable and then. 1:08:05.905 --> 1:08:10.476 The question is: Do we really need to do that? 1:08:10.476 --> 1:08:14.074 Therefore, there is a bunch of work. 1:08:14.074 --> 1:08:16.602 Can we do it differently? 1:08:16.602 --> 1:08:19.616 Can we generate a full target? 1:08:20.160 --> 1:08:29.417 We'll see it's not that easy and there's still an open debate whether this is really faster 1:08:29.417 --> 1:08:31.832 and quality, but think. 1:08:32.712 --> 1:08:45.594 So, as said, what we have done is our encoder decoder where we can process our encoder color, 1:08:45.594 --> 1:08:50.527 and then the output always depends. 1:08:50.410 --> 1:08:54.709 We generate the output and then we have to put it here the wide because then everything 1:08:54.709 --> 1:08:56.565 depends on the purpose of the output. 1:08:56.916 --> 1:09:10.464 This is what is referred to as an outer-regressive model and nearly outs speech generation and 1:09:10.464 --> 1:09:16.739 language generation or works in this outer. 1:09:18.318 --> 1:09:21.132 So the motivation is, can we do that more efficiently? 1:09:21.361 --> 1:09:31.694 And can we somehow process all target words in parallel? 1:09:31.694 --> 1:09:41.302 So instead of doing it one by one, we are inputting. 1:09:45.105 --> 1:09:46.726 So how does it work? 1:09:46.726 --> 1:09:50.587 So let's first have a basic auto regressive mode. 1:09:50.810 --> 1:09:53.551 So the encoder looks as it is before. 1:09:53.551 --> 1:09:58.310 That's maybe not surprising because here we know we can paralyze. 1:09:58.618 --> 1:10:04.592 So we have put in here our ink holder and generated the ink stash, so that's exactly 1:10:04.592 --> 1:10:05.295 the same. 1:10:05.845 --> 1:10:16.229 However, now we need to do one more thing: One challenge is what we had before and that's 1:10:16.229 --> 1:10:26.799 a challenge of natural language generation like machine translation. 1:10:32.672 --> 1:10:38.447 We generate until we generate this out of end of center stock, but if we now generate 1:10:38.447 --> 1:10:44.625 everything at once that's no longer possible, so we cannot generate as long because we only 1:10:44.625 --> 1:10:45.632 generated one. 1:10:46.206 --> 1:10:58.321 So the question is how can we now determine how long the sequence is, and we can also accelerate. 1:11:00.000 --> 1:11:06.384 Yes, but there would be one idea, and there is other work which tries to do that. 1:11:06.806 --> 1:11:15.702 However, in here there's some work already done before and maybe you remember we had the 1:11:15.702 --> 1:11:20.900 IBM models and there was this concept of fertility. 1:11:21.241 --> 1:11:26.299 The concept of fertility is means like for one saucepan, and how many target pores does 1:11:26.299 --> 1:11:27.104 it translate? 1:11:27.847 --> 1:11:34.805 And exactly that we try to do here, and that means we are calculating like at the top we 1:11:34.805 --> 1:11:36.134 are calculating. 1:11:36.396 --> 1:11:42.045 So it says word is translated into word. 1:11:42.045 --> 1:11:54.171 Word might be translated into words into, so we're trying to predict in how many words. 1:11:55.935 --> 1:12:10.314 And then the end of the anchor, so this is like a length estimation. 1:12:10.314 --> 1:12:15.523 You can do it otherwise. 1:12:16.236 --> 1:12:24.526 You initialize your decoder input and we know it's good with word embeddings so we're trying 1:12:24.526 --> 1:12:28.627 to do the same thing and what people then do. 1:12:28.627 --> 1:12:35.224 They initialize it again with word embedding but in the frequency of the. 1:12:35.315 --> 1:12:36.460 So we have the cartilage. 1:12:36.896 --> 1:12:47.816 So one has two, so twice the is and then one is, so that is then our initialization. 1:12:48.208 --> 1:12:57.151 In other words, if you don't predict fertilities but predict lengths, you can just initialize 1:12:57.151 --> 1:12:57.912 second. 1:12:58.438 --> 1:13:07.788 This often works a bit better, but that's the other. 1:13:07.788 --> 1:13:16.432 Now you have everything in training and testing. 1:13:16.656 --> 1:13:18.621 This is all available at once. 1:13:20.280 --> 1:13:31.752 Then we can generate everything in parallel, so we have the decoder stack, and that is now 1:13:31.752 --> 1:13:33.139 as before. 1:13:35.395 --> 1:13:41.555 And then we're doing the translation predictions here on top of it in order to do. 1:13:43.083 --> 1:13:59.821 And then we are predicting here the target words and once predicted, and that is the basic 1:13:59.821 --> 1:14:00.924 idea. 1:14:01.241 --> 1:14:08.171 Machine translation: Where the idea is, we don't have to do one by one what we're. 1:14:10.210 --> 1:14:13.900 So this looks really, really, really great. 1:14:13.900 --> 1:14:20.358 On the first view there's one challenge with this, and this is the baseline. 1:14:20.358 --> 1:14:27.571 Of course there's some improvements, but in general the quality is often significant. 1:14:28.068 --> 1:14:32.075 So here you see the baseline models. 1:14:32.075 --> 1:14:38.466 You have a loss of ten blue points or something like that. 1:14:38.878 --> 1:14:40.230 So why does it change? 1:14:40.230 --> 1:14:41.640 So why is it happening? 1:14:43.903 --> 1:14:56.250 If you look at the errors there is repetitive tokens, so you have like or things like that. 1:14:56.536 --> 1:15:01.995 Broken senses or influent senses, so that exactly where algebra aggressive models are 1:15:01.995 --> 1:15:04.851 very good, we say that's a bit of a problem. 1:15:04.851 --> 1:15:07.390 They generate very fluid transcription. 1:15:07.387 --> 1:15:10.898 Translation: Sometimes there doesn't have to do anything with the input. 1:15:11.411 --> 1:15:14.047 But generally it really looks always very fluid. 1:15:14.995 --> 1:15:20.865 Here exactly the opposite, so the problem is that we don't have really fluid translation. 1:15:21.421 --> 1:15:26.123 And that is mainly due to the challenge that we have this independent assumption. 1:15:26.646 --> 1:15:35.873 So in this case, the probability of Y of the second position is independent of the probability 1:15:35.873 --> 1:15:40.632 of X, so we don't know what was there generated. 1:15:40.632 --> 1:15:43.740 We're just generating it there. 1:15:43.964 --> 1:15:55.439 You can see it also in a bit of examples. 1:15:55.439 --> 1:16:03.636 You can over-panelize shifts. 1:16:04.024 --> 1:16:10.566 And the problem is this is already an improvement again, but this is also similar to. 1:16:11.071 --> 1:16:19.900 So you can, for example, translate heeded back, or maybe you could also translate it 1:16:19.900 --> 1:16:31.105 with: But on their feeling down in feeling down, if the first position thinks of their 1:16:31.105 --> 1:16:34.594 feeling done and the second. 1:16:35.075 --> 1:16:42.908 So each position here and that is one of the main issues here doesn't know what the other. 1:16:43.243 --> 1:16:53.846 And for example, if you are translating something with, you can often translate things in two 1:16:53.846 --> 1:16:58.471 ways: German with a different agreement. 1:16:58.999 --> 1:17:02.047 And then here where you have to decide do you have to use jewelry. 1:17:02.162 --> 1:17:05.460 Interpretator: It doesn't know which word it has to select. 1:17:06.086 --> 1:17:14.789 Mean, of course, it knows a hidden state, but in the end you have a liability distribution. 1:17:16.256 --> 1:17:20.026 And that is the important thing in the outer regressive month. 1:17:20.026 --> 1:17:24.335 You know that because you have put it in you here, you don't know that. 1:17:24.335 --> 1:17:29.660 If it's equal probable here to two, you don't Know Which Is Selected, and of course that 1:17:29.660 --> 1:17:32.832 depends on what should be the latest traction under. 1:17:33.333 --> 1:17:39.554 Yep, that's the undershift, and we're going to last last the next time. 1:17:39.554 --> 1:17:39.986 Yes. 1:17:40.840 --> 1:17:44.935 Doesn't this also appear in and like now we're talking about physical training? 1:17:46.586 --> 1:17:48.412 The thing is in the auto regress. 1:17:48.412 --> 1:17:50.183 If you give it the correct one,. 1:17:50.450 --> 1:17:55.827 So if you predict here comma what the reference is feeling then you tell the model here. 1:17:55.827 --> 1:17:59.573 The last one was feeling and then it knows it has to be done. 1:17:59.573 --> 1:18:04.044 But here it doesn't know that because it doesn't get as input as a right. 1:18:04.204 --> 1:18:24.286 Yes, that's a bit depending on what. 1:18:24.204 --> 1:18:27.973 But in training, of course, you just try to make the highest one the current one. 1:18:31.751 --> 1:18:38.181 So what you can do is things like CDC loss which can adjust for this. 1:18:38.181 --> 1:18:42.866 So then you can also have this shifted correction. 1:18:42.866 --> 1:18:50.582 If you're doing this type of correction in the CDC loss you don't get full penalty. 1:18:50.930 --> 1:18:58.486 Just shifted by one, so it's a bit of a different loss, which is mainly used in, but. 1:19:00.040 --> 1:19:03.412 It can be used in order to address this problem. 1:19:04.504 --> 1:19:13.844 The other problem is that outer regressively we have the label buyers that tries to disimmigrate. 1:19:13.844 --> 1:19:20.515 That's the example did before was if you translate thank you to Dung. 1:19:20.460 --> 1:19:31.925 And then it might end up because it learns in the first position and the second also. 1:19:32.492 --> 1:19:43.201 In order to prevent that, it would be helpful for one output, only one output, so that makes 1:19:43.201 --> 1:19:47.002 the system already better learn. 1:19:47.227 --> 1:19:53.867 Might be that for slightly different inputs you have different outputs, but for the same. 1:19:54.714 --> 1:19:57.467 That we can luckily very easily solve. 1:19:59.119 --> 1:19:59.908 And it's done. 1:19:59.908 --> 1:20:04.116 We just learned the technique about it, which is called knowledge distillation. 1:20:04.985 --> 1:20:13.398 So what we can do and the easiest solution to prove your non-autoregressive model is to 1:20:13.398 --> 1:20:16.457 train an auto regressive model. 1:20:16.457 --> 1:20:22.958 Then you decode your whole training gamer with this model and then. 1:20:23.603 --> 1:20:27.078 While the main advantage of that is that this is more consistent,. 1:20:27.407 --> 1:20:33.995 So for the same input you always have the same output. 1:20:33.995 --> 1:20:41.901 So you have to make your training data more consistent and learn. 1:20:42.482 --> 1:20:54.471 So there is another advantage of knowledge distillation and that advantage is you have 1:20:54.471 --> 1:20:59.156 more consistent training signals. 1:21:04.884 --> 1:21:10.287 There's another to make the things more easy at the beginning. 1:21:10.287 --> 1:21:16.462 There's this plants model, black model where you put in parts of input. 1:21:16.756 --> 1:21:26.080 So during training, especially at the beginning, you give some correct solutions at the beginning. 1:21:28.468 --> 1:21:38.407 And there is this tokens at a time, so the idea is to establish other regressive training. 1:21:40.000 --> 1:21:50.049 And some targets are open, so you always predict only like first auto regression is K. 1:21:50.049 --> 1:21:59.174 It puts one, so you always have one input and one output, then you do partial. 1:21:59.699 --> 1:22:05.825 So in that way you can slowly learn what is a good and what is a bad answer. 1:22:08.528 --> 1:22:10.862 It doesn't sound very impressive. 1:22:10.862 --> 1:22:12.578 Don't contact me anyway. 1:22:12.578 --> 1:22:15.323 Go all over your training data several. 1:22:15.875 --> 1:22:20.655 You can even switch in between. 1:22:20.655 --> 1:22:29.318 There is a homework on this thing where you try to start. 1:22:31.271 --> 1:22:41.563 You have to learn so there's a whole work on that so this is often happening and it doesn't 1:22:41.563 --> 1:22:46.598 mean it's less efficient but still it helps. 1:22:49.389 --> 1:22:57.979 For later maybe here are some examples of how much things help. 1:22:57.979 --> 1:23:04.958 Maybe one point here is that it's really important. 1:23:05.365 --> 1:23:13.787 Here's the translation performance and speed. 1:23:13.787 --> 1:23:24.407 One point which is a point is if you compare researchers. 1:23:24.784 --> 1:23:33.880 So yeah, if you're compared to one very weak baseline transformer even with beam search, 1:23:33.880 --> 1:23:40.522 then you're ten times slower than a very strong auto regressive. 1:23:40.961 --> 1:23:48.620 If you make a strong baseline then it's going down to depending on times and here like: You 1:23:48.620 --> 1:23:53.454 have a lot of different speed ups. 1:23:53.454 --> 1:24:03.261 Generally, it makes a strong baseline and not very simple transformer. 1:24:07.407 --> 1:24:20.010 Yeah, with this one last thing that you can do to speed up things and also reduce your 1:24:20.010 --> 1:24:25.950 memory is what is called half precision. 1:24:26.326 --> 1:24:29.139 And especially for decoding issues for training. 1:24:29.139 --> 1:24:31.148 Sometimes it also gets less stale. 1:24:32.592 --> 1:24:45.184 With this we close nearly wait a bit, so what you should remember is that efficient machine 1:24:45.184 --> 1:24:46.963 translation. 1:24:47.007 --> 1:24:51.939 We have, for example, looked at knowledge distillation. 1:24:51.939 --> 1:24:55.991 We have looked at non auto regressive models. 1:24:55.991 --> 1:24:57.665 We have different. 1:24:58.898 --> 1:25:02.383 For today and then only requests. 1:25:02.383 --> 1:25:08.430 So if you haven't done so, please fill out the evaluation. 1:25:08.388 --> 1:25:20.127 So now if you have done so think then you should have and with the online people hopefully. 1:25:20.320 --> 1:25:29.758 Only possibility to tell us what things are good and what not the only one but the most 1:25:29.758 --> 1:25:30.937 efficient. 1:25:31.851 --> 1:25:35.871 So think of all the students doing it in this case okay and then thank. 0:00:01.921 --> 0:00:16.424 Hey welcome to today's lecture, what we today want to look at is how we can make new. 0:00:16.796 --> 0:00:26.458 So until now we have this global system, the encoder and the decoder mostly, and we haven't 0:00:26.458 --> 0:00:29.714 really thought about how long. 0:00:30.170 --> 0:00:42.684 And what we, for example, know is yeah, you can make the systems bigger in different ways. 0:00:42.684 --> 0:00:47.084 We can make them deeper so the. 0:00:47.407 --> 0:00:56.331 And if we have at least enough data that typically helps you make things performance better,. 0:00:56.576 --> 0:01:00.620 But of course leads to problems that we need more resources. 0:01:00.620 --> 0:01:06.587 That is a problem at universities where we have typically limited computation capacities. 0:01:06.587 --> 0:01:11.757 So at some point you have such big models that you cannot train them anymore. 0:01:13.033 --> 0:01:23.792 And also for companies is of course important if it costs you like to generate translation 0:01:23.792 --> 0:01:26.984 just by power consumption. 0:01:27.667 --> 0:01:35.386 So yeah, there's different reasons why you want to do efficient machine translation. 0:01:36.436 --> 0:01:48.338 One reason is there are different ways of how you can improve your machine translation 0:01:48.338 --> 0:01:50.527 system once we. 0:01:50.670 --> 0:01:55.694 There can be different types of data we looked into data crawling, monolingual data. 0:01:55.875 --> 0:01:59.024 All this data and the aim is always. 0:01:59.099 --> 0:02:05.735 Of course, we are not just purely interested in having more data, but the idea why we want 0:02:05.735 --> 0:02:12.299 to have more data is that more data also means that we have better quality because mostly 0:02:12.299 --> 0:02:17.550 we are interested in increasing the quality of the machine translation. 0:02:18.838 --> 0:02:24.892 But there's also other ways of how you can improve the quality of a machine translation. 0:02:25.325 --> 0:02:36.450 And what is, of course, that is where most research is focusing on. 0:02:36.450 --> 0:02:44.467 It means all we want to build better algorithms. 0:02:44.684 --> 0:02:48.199 Course: The other things are normally as good. 0:02:48.199 --> 0:02:54.631 Sometimes it's easier to improve, so often it's easier to just collect more data than 0:02:54.631 --> 0:02:57.473 to invent some great view algorithms. 0:02:57.473 --> 0:03:00.315 But yeah, both of them are important. 0:03:00.920 --> 0:03:09.812 But there is this third thing, especially with neural machine translation, and that means 0:03:09.812 --> 0:03:11.590 we make a bigger. 0:03:11.751 --> 0:03:16.510 Can be, as said, that we have more layers, that we have wider layers. 0:03:16.510 --> 0:03:19.977 The other thing we talked a bit about is ensemble. 0:03:19.977 --> 0:03:24.532 That means we are not building one new machine translation system. 0:03:24.965 --> 0:03:27.505 And we can easily build four. 0:03:27.505 --> 0:03:32.331 What is the typical strategy to build different systems? 0:03:32.331 --> 0:03:33.177 Remember. 0:03:35.795 --> 0:03:40.119 It should be of course a bit different if you have the same. 0:03:40.119 --> 0:03:44.585 If they all predict the same then combining them doesn't help. 0:03:44.585 --> 0:03:48.979 So what is the easiest way if you have to build four systems? 0:03:51.711 --> 0:04:01.747 And the Charleston's will take, but this is the best output of a single system. 0:04:02.362 --> 0:04:10.165 Mean now, it's really three different systems so that you later can combine them and maybe 0:04:10.165 --> 0:04:11.280 the average. 0:04:11.280 --> 0:04:16.682 Ensembles are typically that the average is all probabilities. 0:04:19.439 --> 0:04:24.227 The idea is to think about neural networks. 0:04:24.227 --> 0:04:29.342 There's one parameter which can easily adjust. 0:04:29.342 --> 0:04:36.525 That's exactly the easiest way to randomize with three different. 0:04:37.017 --> 0:04:43.119 They have the same architecture, so all the hydroparameters are the same, but they are 0:04:43.119 --> 0:04:43.891 different. 0:04:43.891 --> 0:04:46.556 They will have different predictions. 0:04:48.228 --> 0:04:52.572 So, of course, bigger amounts. 0:04:52.572 --> 0:05:05.325 Some of these are a bit the easiest way of improving your quality because you don't really 0:05:05.325 --> 0:05:08.268 have to do anything. 0:05:08.588 --> 0:05:12.588 There is limits on that bigger models only get better. 0:05:12.588 --> 0:05:19.132 If you have enough training data you can't do like a handheld layer and you will not work 0:05:19.132 --> 0:05:24.877 on very small data but with a recent amount of data that is the easiest thing. 0:05:25.305 --> 0:05:33.726 However, they are challenging with making better models, bigger motors, and that is the 0:05:33.726 --> 0:05:34.970 computation. 0:05:35.175 --> 0:05:44.482 So, of course, if you have a bigger model that can mean that you have longer running 0:05:44.482 --> 0:05:49.518 times, if you have models, you have to times. 0:05:51.171 --> 0:05:56.685 Normally you cannot paralyze the different layers because the input to one layer is always 0:05:56.685 --> 0:06:02.442 the output of the previous layer, so you propagate that so it will also increase your runtime. 0:06:02.822 --> 0:06:10.720 Then you have to store all your models in memory. 0:06:10.720 --> 0:06:20.927 If you have double weights you will have: Is more difficult to then do back propagation. 0:06:20.927 --> 0:06:27.680 You have to store in between the activations, so there's not only do you increase the model 0:06:27.680 --> 0:06:31.865 in your memory, but also all these other variables that. 0:06:34.414 --> 0:06:36.734 And so in general it is more expensive. 0:06:37.137 --> 0:06:54.208 And therefore there's good reasons in looking into can we make these models sound more efficient. 0:06:54.134 --> 0:07:00.982 So it's been through the viewer, you can have it okay, have one and one day of training time, 0:07:00.982 --> 0:07:01.274 or. 0:07:01.221 --> 0:07:07.535 Forty thousand euros and then what is the best machine translation system I can get within 0:07:07.535 --> 0:07:08.437 this budget. 0:07:08.969 --> 0:07:19.085 And then, of course, you can make the models bigger, but then you have to train them shorter, 0:07:19.085 --> 0:07:24.251 and then we can make more efficient algorithms. 0:07:25.925 --> 0:07:31.699 If you think about efficiency, there's a bit different scenarios. 0:07:32.312 --> 0:07:43.635 So if you're more of coming from the research community, what you'll be doing is building 0:07:43.635 --> 0:07:47.913 a lot of models in your research. 0:07:48.088 --> 0:07:58.645 So you're having your test set of maybe sentences, calculating the blue score, then another model. 0:07:58.818 --> 0:08:08.911 So what that means is typically you're training on millions of cents, so your training time 0:08:08.911 --> 0:08:14.944 is long, maybe a day, but maybe in other cases a week. 0:08:15.135 --> 0:08:22.860 The testing is not really the cost efficient, but the training is very costly. 0:08:23.443 --> 0:08:37.830 If you are more thinking of building models for application, the scenario is quite different. 0:08:38.038 --> 0:08:46.603 And then you keep it running, and maybe thousands of customers are using it in translating. 0:08:46.603 --> 0:08:47.720 So in that. 0:08:48.168 --> 0:08:59.577 And we will see that it is not always the same type of challenges you can paralyze some 0:08:59.577 --> 0:09:07.096 things in training, which you cannot paralyze in testing. 0:09:07.347 --> 0:09:14.124 For example, in training you have to do back propagation, so you have to store the activations. 0:09:14.394 --> 0:09:23.901 Therefore, in testing we briefly discussed that we would do it in more detail today in 0:09:23.901 --> 0:09:24.994 training. 0:09:25.265 --> 0:09:36.100 You know they're a target and you can process everything in parallel while in testing. 0:09:36.356 --> 0:09:46.741 So you can only do one word at a time, and so you can less paralyze this. 0:09:46.741 --> 0:09:50.530 Therefore, it's important. 0:09:52.712 --> 0:09:55.347 Is a specific task on this. 0:09:55.347 --> 0:10:03.157 For example, it's the efficiency task where it's about making things as efficient. 0:10:03.123 --> 0:10:09.230 Is possible and they can look at different resources. 0:10:09.230 --> 0:10:14.207 So how much deep fuel run time do you need? 0:10:14.454 --> 0:10:19.366 See how much memory you need or you can have a fixed memory budget and then have to build 0:10:19.366 --> 0:10:20.294 the best system. 0:10:20.500 --> 0:10:29.010 And here is a bit like an example of that, so there's three teams from Edinburgh from 0:10:29.010 --> 0:10:30.989 and they submitted. 0:10:31.131 --> 0:10:36.278 So then, of course, if you want to know the most efficient system you have to do a bit 0:10:36.278 --> 0:10:36.515 of. 0:10:36.776 --> 0:10:44.656 You want to have a better quality or more runtime and there's not the one solution. 0:10:44.656 --> 0:10:46.720 You can improve your. 0:10:46.946 --> 0:10:49.662 And that you see that there are different systems. 0:10:49.909 --> 0:11:06.051 Here is how many words you can do for a second on the clock, and you want to be as talk as 0:11:06.051 --> 0:11:07.824 possible. 0:11:08.068 --> 0:11:08.889 And you see here a bit. 0:11:08.889 --> 0:11:09.984 This is a little bit different. 0:11:11.051 --> 0:11:27.717 You want to be there on the top right corner and you can get a score of something between 0:11:27.717 --> 0:11:29.014 words. 0:11:30.250 --> 0:11:34.161 Two hundred and fifty thousand, then you'll ever come and score zero point three. 0:11:34.834 --> 0:11:41.243 There is, of course, any bit of a decision, but the question is, like how far can you again? 0:11:41.243 --> 0:11:47.789 Some of all these points on this line would be winners because they are somehow most efficient 0:11:47.789 --> 0:11:53.922 in a way that there's no system which achieves the same quality with less computational. 0:11:57.657 --> 0:12:04.131 So there's the one question of which resources are you interested. 0:12:04.131 --> 0:12:07.416 Are you running it on CPU or GPU? 0:12:07.416 --> 0:12:11.668 There's different ways of paralyzing stuff. 0:12:14.654 --> 0:12:20.777 Another dimension is how you process your data. 0:12:20.777 --> 0:12:27.154 There's really the best processing and streaming. 0:12:27.647 --> 0:12:34.672 So in batch processing you have the whole document available so you can translate all 0:12:34.672 --> 0:12:39.981 sentences in perimeter and then you're interested in throughput. 0:12:40.000 --> 0:12:43.844 But you can then process, for example, especially in GPS. 0:12:43.844 --> 0:12:49.810 That's interesting, you're not translating one sentence at a time, but you're translating 0:12:49.810 --> 0:12:56.108 one hundred sentences or so in parallel, so you have one more dimension where you can paralyze 0:12:56.108 --> 0:12:57.964 and then be more efficient. 0:12:58.558 --> 0:13:14.863 On the other hand, for example sorts of documents, so we learned that if you do badge processing 0:13:14.863 --> 0:13:16.544 you have. 0:13:16.636 --> 0:13:24.636 Then, of course, it makes sense to sort the sentences in order to have the minimum thing 0:13:24.636 --> 0:13:25.535 attached. 0:13:27.427 --> 0:13:32.150 The other scenario is more the streaming scenario where you do life translation. 0:13:32.512 --> 0:13:40.212 So in that case you can't wait for the whole document to pass, but you have to do. 0:13:40.520 --> 0:13:49.529 And then, for example, that's especially in situations like speech translation, and then 0:13:49.529 --> 0:13:53.781 you're interested in things like latency. 0:13:53.781 --> 0:14:00.361 So how much do you have to wait to get the output of a sentence? 0:14:06.566 --> 0:14:16.956 Finally, there is the thing about the implementation: Today we're mainly looking at different algorithms, 0:14:16.956 --> 0:14:23.678 different models of how you can model them in your machine translation system, but of 0:14:23.678 --> 0:14:29.227 course for the same algorithms there's also different implementations. 0:14:29.489 --> 0:14:38.643 So, for example, for a machine translation this tool could be very fast. 0:14:38.638 --> 0:14:46.615 So they have like coded a lot of the operations very low resource, not low resource, low level 0:14:46.615 --> 0:14:49.973 on the directly on the QDAC kernels in. 0:14:50.110 --> 0:15:00.948 So the same attention network is typically more efficient in that type of algorithm. 0:15:00.880 --> 0:15:02.474 Than in in any other. 0:15:03.323 --> 0:15:13.105 Of course, it might be other disadvantages, so if you're a little worker or have worked 0:15:13.105 --> 0:15:15.106 in the practical. 0:15:15.255 --> 0:15:22.604 Because it's normally easier to understand, easier to change, and so on, but there is again 0:15:22.604 --> 0:15:23.323 a train. 0:15:23.483 --> 0:15:29.440 You have to think about, do you want to include this into my study or comparison or not? 0:15:29.440 --> 0:15:36.468 Should it be like I compare different implementations and I also find the most efficient implementation? 0:15:36.468 --> 0:15:39.145 Or is it only about the pure algorithm? 0:15:42.742 --> 0:15:50.355 Yeah, when building these systems there is a different trade-off to do. 0:15:50.850 --> 0:15:56.555 So there's one of the traders between memory and throughput, so how many words can generate 0:15:56.555 --> 0:15:57.299 per second. 0:15:57.557 --> 0:16:03.351 So typically you can easily like increase your scruple by increasing the batch size. 0:16:03.643 --> 0:16:06.899 So that means you are translating more sentences in parallel. 0:16:07.107 --> 0:16:09.241 And gypsies are very good at that stuff. 0:16:09.349 --> 0:16:15.161 It should translate one sentence or one hundred sentences, not the same time, but its. 0:16:15.115 --> 0:16:20.997 Rough are very similar because they have these efficient metrics multiplication so that you 0:16:20.997 --> 0:16:24.386 can do the same operation on all sentences parallel. 0:16:24.386 --> 0:16:30.141 So typically that means if you increase your benchmark you can do more things in parallel 0:16:30.141 --> 0:16:31.995 and you will translate more. 0:16:31.952 --> 0:16:33.370 Second. 0:16:33.653 --> 0:16:43.312 On the other hand, with this advantage, of course you will need higher badge sizes and 0:16:43.312 --> 0:16:44.755 more memory. 0:16:44.965 --> 0:16:56.452 To begin with, the other problem is that you have such big models that you can only translate 0:16:56.452 --> 0:16:59.141 with lower bed sizes. 0:16:59.119 --> 0:17:08.466 If you are running out of memory with translating, one idea to go on that is to decrease your. 0:17:13.453 --> 0:17:24.456 Then there is the thing about quality in Screwport, of course, and before it's like larger models, 0:17:24.456 --> 0:17:28.124 but in generally higher quality. 0:17:28.124 --> 0:17:31.902 The first one is always this way. 0:17:32.092 --> 0:17:38.709 Course: Not always larger model helps you have over fitting at some point, but in generally. 0:17:43.883 --> 0:17:52.901 And with this a bit on this training and testing thing we had before. 0:17:53.113 --> 0:17:58.455 So it wears all the difference between training and testing, and for the encoder and decoder. 0:17:58.798 --> 0:18:06.992 So if we are looking at what mentioned before at training time, we have a source sentence 0:18:06.992 --> 0:18:17.183 here: And how this is processed on a is not the attention here. 0:18:17.183 --> 0:18:21.836 That's a tubical transformer. 0:18:22.162 --> 0:18:31.626 And how we can do that on a is that we can paralyze the ear ever since. 0:18:31.626 --> 0:18:40.422 The first thing to know is: So that is, of course, not in all cases. 0:18:40.422 --> 0:18:49.184 We'll later talk about speech translation where we might want to translate. 0:18:49.389 --> 0:18:56.172 Without the general case in, it's like you have the full sentence you want to translate. 0:18:56.416 --> 0:19:02.053 So the important thing is we are here everything available on the source side. 0:19:03.323 --> 0:19:13.524 And then this was one of the big advantages that you can remember back of transformer. 0:19:13.524 --> 0:19:15.752 There are several. 0:19:16.156 --> 0:19:25.229 But the other one is now that we can calculate the full layer. 0:19:25.645 --> 0:19:29.318 There is no dependency between this and this state or this and this state. 0:19:29.749 --> 0:19:36.662 So we always did like here to calculate the key value and query, and based on that you 0:19:36.662 --> 0:19:37.536 calculate. 0:19:37.937 --> 0:19:46.616 Which means we can do all these calculations here in parallel and in parallel. 0:19:48.028 --> 0:19:55.967 And there, of course, is this very efficiency because again for GPS it's too bigly possible 0:19:55.967 --> 0:20:00.887 to do these things in parallel and one after each other. 0:20:01.421 --> 0:20:10.311 And then we can also for each layer one by one, and then we calculate here the encoder. 0:20:10.790 --> 0:20:21.921 In training now an important thing is that for the decoder we have the full sentence available 0:20:21.921 --> 0:20:28.365 because we know this is the target we should generate. 0:20:29.649 --> 0:20:33.526 We have models now in a different way. 0:20:33.526 --> 0:20:38.297 This hidden state is only on the previous ones. 0:20:38.598 --> 0:20:51.887 And the first thing here depends only on this information, so you see if you remember we 0:20:51.887 --> 0:20:56.665 had this masked self-attention. 0:20:56.896 --> 0:21:04.117 So that means, of course, we can only calculate the decoder once the encoder is done, but that's. 0:21:04.444 --> 0:21:06.656 Percent can calculate the end quarter. 0:21:06.656 --> 0:21:08.925 Then we can calculate here the decoder. 0:21:09.569 --> 0:21:25.566 But again in training we have x, y and that is available so we can calculate everything 0:21:25.566 --> 0:21:27.929 in parallel. 0:21:28.368 --> 0:21:40.941 So the interesting thing or advantage of transformer is in training. 0:21:40.941 --> 0:21:46.408 We can do it for the decoder. 0:21:46.866 --> 0:21:54.457 That means you will have more calculations because you can only calculate one layer at 0:21:54.457 --> 0:22:02.310 a time, but for example the length which is too bigly quite long or doesn't really matter 0:22:02.310 --> 0:22:03.270 that much. 0:22:05.665 --> 0:22:10.704 However, in testing this situation is different. 0:22:10.704 --> 0:22:13.276 In testing we only have. 0:22:13.713 --> 0:22:20.622 So this means we start with a sense: We don't know the full sentence yet because we ought 0:22:20.622 --> 0:22:29.063 to regularly generate that so for the encoder we have the same here but for the decoder. 0:22:29.409 --> 0:22:39.598 In this case we only have the first and the second instinct, but only for all states in 0:22:39.598 --> 0:22:40.756 parallel. 0:22:41.101 --> 0:22:51.752 And then we can do the next step for y because we are putting our most probable one. 0:22:51.752 --> 0:22:58.643 We do greedy search or beam search, but you cannot do. 0:23:03.663 --> 0:23:16.838 Yes, so if we are interesting in making things more efficient for testing, which we see, for 0:23:16.838 --> 0:23:22.363 example in the scenario of really our. 0:23:22.642 --> 0:23:34.286 It makes sense that we think about our architecture and that we are currently working on attention 0:23:34.286 --> 0:23:35.933 based models. 0:23:36.096 --> 0:23:44.150 The decoder there is some of the most time spent testing and testing. 0:23:44.150 --> 0:23:47.142 It's similar, but during. 0:23:47.167 --> 0:23:50.248 Nothing about beam search. 0:23:50.248 --> 0:23:59.833 It might be even more complicated because in beam search you have to try different. 0:24:02.762 --> 0:24:15.140 So the question is what can you now do in order to make your model more efficient and 0:24:15.140 --> 0:24:21.905 better in translation in these types of cases? 0:24:24.604 --> 0:24:30.178 And the one thing is to look into the encoded decoder trailer. 0:24:30.690 --> 0:24:43.898 And then until now we typically assume that the depth of the encoder and the depth of the 0:24:43.898 --> 0:24:48.154 decoder is roughly the same. 0:24:48.268 --> 0:24:55.553 So if you haven't thought about it, you just take what is running well. 0:24:55.553 --> 0:24:57.678 You would try to do. 0:24:58.018 --> 0:25:04.148 However, we saw now that there is a quite big challenge and the runtime is a lot longer 0:25:04.148 --> 0:25:04.914 than here. 0:25:05.425 --> 0:25:14.018 The question is also the case for the calculations, or do we have there the same issue that we 0:25:14.018 --> 0:25:21.887 only get the good quality if we are having high and high, so we know that making these 0:25:21.887 --> 0:25:25.415 more depths is increasing our quality. 0:25:25.425 --> 0:25:31.920 But what we haven't talked about is really important that we increase the depth the same 0:25:31.920 --> 0:25:32.285 way. 0:25:32.552 --> 0:25:41.815 So what we can put instead also do is something like this where you have a deep encoder and 0:25:41.815 --> 0:25:42.923 a shallow. 0:25:43.163 --> 0:25:57.386 So that would be that you, for example, have instead of having layers on the encoder, and 0:25:57.386 --> 0:25:59.757 layers on the. 0:26:00.080 --> 0:26:10.469 So in this case the overall depth from start to end would be similar and so hopefully. 0:26:11.471 --> 0:26:21.662 But we could a lot more things hear parallelized, and hear what is costly at the end during decoding 0:26:21.662 --> 0:26:22.973 the decoder. 0:26:22.973 --> 0:26:29.330 Because that does change in an outer regressive way, there we. 0:26:31.411 --> 0:26:33.727 And that that can be analyzed. 0:26:33.727 --> 0:26:38.734 So here is some examples: Where people have done all this. 0:26:39.019 --> 0:26:55.710 So here it's mainly interested on the orange things, which is auto-regressive about the 0:26:55.710 --> 0:26:57.607 speed up. 0:26:57.717 --> 0:27:15.031 You have the system, so agree is not exactly the same, but it's similar. 0:27:15.055 --> 0:27:23.004 It's always the case if you look at speed up. 0:27:23.004 --> 0:27:31.644 Think they put a speed of so that's the baseline. 0:27:31.771 --> 0:27:35.348 So between and times as fast. 0:27:35.348 --> 0:27:42.621 If you switch from a system to where you have layers in the. 0:27:42.782 --> 0:27:52.309 You see that although you have slightly more parameters, more calculations are also roughly 0:27:52.309 --> 0:28:00.283 the same, but you can speed out because now during testing you can paralyze. 0:28:02.182 --> 0:28:09.754 The other thing is that you're speeding up, but if you look at the performance it's similar, 0:28:09.754 --> 0:28:13.500 so sometimes you improve, sometimes you lose. 0:28:13.500 --> 0:28:20.421 There's a bit of losing English to Romania, but in general the quality is very slow. 0:28:20.680 --> 0:28:30.343 So you see that you can keep a similar performance while improving your speed by just having different. 0:28:30.470 --> 0:28:34.903 And you also see the encoder layers from speed. 0:28:34.903 --> 0:28:38.136 They don't really metal that much. 0:28:38.136 --> 0:28:38.690 Most. 0:28:38.979 --> 0:28:50.319 Because if you compare the 12th system to the 6th system you have a lower performance 0:28:50.319 --> 0:28:57.309 with 6th and colder layers but the speed is similar. 0:28:57.897 --> 0:29:02.233 And see the huge decrease is it maybe due to a lack of data. 0:29:03.743 --> 0:29:11.899 Good idea would say it's not the case. 0:29:11.899 --> 0:29:23.191 Romanian English should have the same number of data. 0:29:24.224 --> 0:29:31.184 Maybe it's just that something in that language. 0:29:31.184 --> 0:29:40.702 If you generate Romanian maybe they need more target dependencies. 0:29:42.882 --> 0:29:46.263 The Wine's the Eye Also Don't Know Any Sex People Want To. 0:29:47.887 --> 0:29:49.034 There could be yeah the. 0:29:49.889 --> 0:29:58.962 As the maybe if you go from like a movie sphere to a hybrid sphere, you can: It's very much 0:29:58.962 --> 0:30:12.492 easier to expand the vocabulary to English, but it must be the vocabulary. 0:30:13.333 --> 0:30:21.147 Have to check, but would assume that in this case the system is not retrained, but it's 0:30:21.147 --> 0:30:22.391 trained with. 0:30:22.902 --> 0:30:30.213 And that's why I was assuming that they have the same, but maybe you'll write that in this 0:30:30.213 --> 0:30:35.595 piece, for example, if they were pre-trained, the decoder English. 0:30:36.096 --> 0:30:43.733 But don't remember exactly if they do something like that, but that could be a good. 0:30:45.325 --> 0:30:52.457 So this is some of the most easy way to speed up. 0:30:52.457 --> 0:31:01.443 You just switch to hyperparameters, not to implement anything. 0:31:02.722 --> 0:31:08.367 Of course, there's other ways of doing that. 0:31:08.367 --> 0:31:11.880 We'll look into two things. 0:31:11.880 --> 0:31:16.521 The other thing is the architecture. 0:31:16.796 --> 0:31:28.154 We are now at some of the baselines that we are doing. 0:31:28.488 --> 0:31:39.978 However, in translation in the decoder side, it might not be the best solution. 0:31:39.978 --> 0:31:41.845 There is no. 0:31:42.222 --> 0:31:47.130 So we can use different types of architectures, also in the encoder and the. 0:31:47.747 --> 0:31:52.475 And there's two ways of what you could do different, or there's more ways. 0:31:52.912 --> 0:31:54.825 We will look into two todays. 0:31:54.825 --> 0:31:58.842 The one is average attention, which is a very simple solution. 0:31:59.419 --> 0:32:01.464 You can do as it says. 0:32:01.464 --> 0:32:04.577 It's not really attending anymore. 0:32:04.577 --> 0:32:08.757 It's just like equal attendance to everything. 0:32:09.249 --> 0:32:23.422 And the other idea, which is currently done in most systems which are optimized to efficiency, 0:32:23.422 --> 0:32:24.913 is we're. 0:32:25.065 --> 0:32:32.623 But on the decoder side we are then not using transformer or self attention, but we are using 0:32:32.623 --> 0:32:39.700 recurrent neural network because they are the disadvantage of recurrent neural network. 0:32:39.799 --> 0:32:48.353 And then the recurrent is normally easier to calculate because it only depends on inputs, 0:32:48.353 --> 0:32:49.684 the input on. 0:32:51.931 --> 0:33:02.190 So what is the difference between decoding and why is the tension maybe not sufficient 0:33:02.190 --> 0:33:03.841 for decoding? 0:33:04.204 --> 0:33:14.390 If we want to populate the new state, we only have to look at the input and the previous 0:33:14.390 --> 0:33:15.649 state, so. 0:33:16.136 --> 0:33:19.029 We are more conditional here networks. 0:33:19.029 --> 0:33:19.994 We have the. 0:33:19.980 --> 0:33:31.291 Dependency to a fixed number of previous ones, but that's rarely used for decoding. 0:33:31.291 --> 0:33:39.774 In contrast, in transformer we have this large dependency, so. 0:33:40.000 --> 0:33:52.760 So from t minus one to y t so that is somehow and mainly not very efficient in this way mean 0:33:52.760 --> 0:33:56.053 it's very good because. 0:33:56.276 --> 0:34:03.543 However, the disadvantage is that we also have to do all these calculations, so if we 0:34:03.543 --> 0:34:10.895 more view from the point of view of efficient calculation, this might not be the best. 0:34:11.471 --> 0:34:20.517 So the question is, can we change our architecture to keep some of the advantages but make things 0:34:20.517 --> 0:34:21.994 more efficient? 0:34:24.284 --> 0:34:31.131 The one idea is what is called the average attention, and the interesting thing is this 0:34:31.131 --> 0:34:32.610 work surprisingly. 0:34:33.013 --> 0:34:38.917 So the only idea what you're doing is doing the decoder. 0:34:38.917 --> 0:34:42.646 You're not doing attention anymore. 0:34:42.646 --> 0:34:46.790 The attention weights are all the same. 0:34:47.027 --> 0:35:00.723 So you don't calculate with query and key the different weights, and then you just take 0:35:00.723 --> 0:35:03.058 equal weights. 0:35:03.283 --> 0:35:07.585 So here would be one third from this, one third from this, and one third. 0:35:09.009 --> 0:35:14.719 And while it is sufficient you can now do precalculation and things get more efficient. 0:35:15.195 --> 0:35:18.803 So first go the formula that's maybe not directed here. 0:35:18.979 --> 0:35:38.712 So the difference here is that your new hint stage is the sum of all the hint states, then. 0:35:38.678 --> 0:35:40.844 So here would be with this. 0:35:40.844 --> 0:35:45.022 It would be one third of this plus one third of this. 0:35:46.566 --> 0:35:57.162 But if you calculate it this way, it's not yet being more efficient because you still 0:35:57.162 --> 0:36:01.844 have to sum over here all the hidden. 0:36:04.524 --> 0:36:22.932 But you can not easily speed up these things by having an in between value, which is just 0:36:22.932 --> 0:36:24.568 always. 0:36:25.585 --> 0:36:30.057 If you take this as ten to one, you take this one class this one. 0:36:30.350 --> 0:36:36.739 Because this one then was before this, and this one was this, so in the end. 0:36:37.377 --> 0:36:49.545 So now this one is not the final one in order to get the final one to do the average. 0:36:49.545 --> 0:36:50.111 So. 0:36:50.430 --> 0:37:00.264 But then if you do this calculation with speed up you can do it with a fixed number of steps. 0:37:00.180 --> 0:37:11.300 Instead of the sun which depends on age, so you only have to do calculations to calculate 0:37:11.300 --> 0:37:12.535 this one. 0:37:12.732 --> 0:37:21.183 Can you do the lakes and the lakes? 0:37:21.183 --> 0:37:32.687 For example, light bulb here now takes and then. 0:37:32.993 --> 0:37:38.762 That's a very good point and that's why this is now in the image. 0:37:38.762 --> 0:37:44.531 It's not very good so this is the one with tilder and the tilder. 0:37:44.884 --> 0:37:57.895 So this one is just the sum of these two, because this is just this one. 0:37:58.238 --> 0:38:08.956 So the sum of this is exactly as the sum of these, and the sum of these is the sum of here. 0:38:08.956 --> 0:38:15.131 So you only do the sum in here, and the multiplying. 0:38:15.255 --> 0:38:22.145 So what you can mainly do here is you can do it more mathematically. 0:38:22.145 --> 0:38:31.531 You can know this by tea taking out of the sum, and then you can calculate the sum different. 0:38:36.256 --> 0:38:42.443 That maybe looks a bit weird and simple, so we were all talking about this great attention 0:38:42.443 --> 0:38:47.882 that we can focus on different parts, and a bit surprising on this work is now. 0:38:47.882 --> 0:38:53.321 In the end it might also work well without really putting and just doing equal. 0:38:53.954 --> 0:38:56.164 Mean it's not that easy. 0:38:56.376 --> 0:38:58.261 It's like sometimes this is working. 0:38:58.261 --> 0:39:00.451 There's also report weight work that well. 0:39:01.481 --> 0:39:05.848 But I think it's an interesting way and it maybe shows that a lot of. 0:39:05.805 --> 0:39:10.669 Things in the self or in the transformer paper which are more put as like yet. 0:39:10.669 --> 0:39:14.301 These are some hyperparameters that are rounded like that. 0:39:14.301 --> 0:39:19.657 You do the lay on all in between and that you do a feat forward before and things like 0:39:19.657 --> 0:39:20.026 that. 0:39:20.026 --> 0:39:25.567 But these are also all important and the right set up around that is also very important. 0:39:28.969 --> 0:39:38.598 The other thing you can do in the end is not completely different from this one. 0:39:38.598 --> 0:39:42.521 It's just like a very different. 0:39:42.942 --> 0:39:54.338 And that is a recurrent network which also has this type of highway connection that can 0:39:54.338 --> 0:40:01.330 ignore the recurrent unit and directly put the input. 0:40:01.561 --> 0:40:10.770 It's not really adding out, but if you see the hitting step is your input, but what you 0:40:10.770 --> 0:40:15.480 can do is somehow directly go to the output. 0:40:17.077 --> 0:40:28.390 These are the four components of the simple return unit, and the unit is motivated by GIS 0:40:28.390 --> 0:40:33.418 and by LCMs, which we have seen before. 0:40:33.513 --> 0:40:43.633 And that has proven to be very good for iron ends, which allows you to have a gate on your. 0:40:44.164 --> 0:40:48.186 In this thing we have two gates, the reset gate and the forget gate. 0:40:48.768 --> 0:40:57.334 So first we have the general structure which has a cell state. 0:40:57.334 --> 0:41:01.277 Here we have the cell state. 0:41:01.361 --> 0:41:09.661 And then this goes next, and we always get the different cell states over the times that. 0:41:10.030 --> 0:41:11.448 This Is the South Stand. 0:41:11.771 --> 0:41:16.518 How do we now calculate that just assume we have an initial cell safe here? 0:41:17.017 --> 0:41:19.670 But the first thing is we're doing the forget game. 0:41:20.060 --> 0:41:34.774 The forgetting models should the new cell state mainly depend on the previous cell state 0:41:34.774 --> 0:41:40.065 or should it depend on our age. 0:41:40.000 --> 0:41:41.356 Like Add to Them. 0:41:41.621 --> 0:41:42.877 How can we model that? 0:41:44.024 --> 0:41:45.599 First we were at a cocktail. 0:41:45.945 --> 0:41:52.151 The forget gait is depending on minus one. 0:41:52.151 --> 0:41:56.480 You also see here the former. 0:41:57.057 --> 0:42:01.963 So we are multiplying both the cell state and our input. 0:42:01.963 --> 0:42:04.890 With some weights we are getting. 0:42:05.105 --> 0:42:08.472 We are putting some Bay Inspector and then we are doing Sigma Weed on that. 0:42:08.868 --> 0:42:13.452 So in the end we have numbers between zero and one saying for each dimension. 0:42:13.853 --> 0:42:22.041 Like how much if it's near to zero we will mainly use the new input. 0:42:22.041 --> 0:42:31.890 If it's near to one we will keep the input and ignore the input at this dimension. 0:42:33.313 --> 0:42:40.173 And by this motivation we can then create here the new sound state, and here you see 0:42:40.173 --> 0:42:41.141 the formal. 0:42:41.601 --> 0:42:55.048 So you take your foot back gate and multiply it with your class. 0:42:55.048 --> 0:43:00.427 So if my was around then. 0:43:00.800 --> 0:43:07.405 In the other case, when the value was others, that's what you added. 0:43:07.405 --> 0:43:10.946 Then you're adding a transformation. 0:43:11.351 --> 0:43:24.284 So if this value was maybe zero then you're putting most of the information from inputting. 0:43:25.065 --> 0:43:26.947 Is already your element? 0:43:26.947 --> 0:43:30.561 The only question is now based on your element. 0:43:30.561 --> 0:43:32.067 What is the output? 0:43:33.253 --> 0:43:47.951 And there you have another opportunity so you can either take the output or instead you 0:43:47.951 --> 0:43:50.957 prefer the input. 0:43:52.612 --> 0:43:58.166 So is the value also the same for the recept game and the forget game. 0:43:58.166 --> 0:43:59.417 Yes, the movie. 0:44:00.900 --> 0:44:10.004 Yes exactly so the matrices are different and therefore it can be and that should be 0:44:10.004 --> 0:44:16.323 and maybe there is sometimes you want to have information. 0:44:16.636 --> 0:44:23.843 So here again we have this vector with values between zero and which says controlling how 0:44:23.843 --> 0:44:25.205 the information. 0:44:25.505 --> 0:44:36.459 And then the output is calculated here similar to a cell stage, but again input is from. 0:44:36.536 --> 0:44:45.714 So either the reset gate decides should give what is currently stored in there, or. 0:44:46.346 --> 0:44:58.647 So it's not exactly as the thing we had before, with the residual connections where we added 0:44:58.647 --> 0:45:01.293 up, but here we do. 0:45:04.224 --> 0:45:08.472 This is the general idea of a simple recurrent neural network. 0:45:08.472 --> 0:45:13.125 Then we will now look at how we can make things even more efficient. 0:45:13.125 --> 0:45:17.104 But first do you have more questions on how it is working? 0:45:23.063 --> 0:45:38.799 Now these calculations are a bit where things get more efficient because this somehow. 0:45:38.718 --> 0:45:43.177 It depends on all the other damage for the second one also. 0:45:43.423 --> 0:45:48.904 Because if you do a matrix multiplication with a vector like for the output vector, each 0:45:48.904 --> 0:45:52.353 diameter of the output vector depends on all the other. 0:45:52.973 --> 0:46:06.561 The cell state here depends because this one is used here, and somehow the first dimension 0:46:06.561 --> 0:46:11.340 of the cell state only depends. 0:46:11.931 --> 0:46:17.973 In order to make that, of course, is sometimes again making things less paralyzeable if things 0:46:17.973 --> 0:46:18.481 depend. 0:46:19.359 --> 0:46:35.122 Can easily make that different by changing from the metric product to not a vector. 0:46:35.295 --> 0:46:51.459 So you do first, just like inside here, you take like the first dimension, my second dimension. 0:46:52.032 --> 0:46:53.772 Is, of course, narrow. 0:46:53.772 --> 0:46:59.294 This should be reset or this should be because it should be a different. 0:46:59.899 --> 0:47:12.053 Now the first dimension only depends on the first dimension, so you don't have dependencies 0:47:12.053 --> 0:47:16.148 any longer between dimensions. 0:47:18.078 --> 0:47:25.692 Maybe it gets a bit clearer if you see about it in this way, so what we have to do now. 0:47:25.966 --> 0:47:31.911 First, we have to do a metrics multiplication on to gather and to get the. 0:47:32.292 --> 0:47:38.041 And then we only have the element wise operations where we take this output. 0:47:38.041 --> 0:47:38.713 We take. 0:47:39.179 --> 0:47:42.978 Minus one and our original. 0:47:42.978 --> 0:47:52.748 Here we only have elemental abrasions which can be optimally paralyzed. 0:47:53.273 --> 0:48:07.603 So here we have additional paralyzed things across the dimension and don't have to do that. 0:48:09.929 --> 0:48:24.255 Yeah, but this you can do like in parallel again for all xts. 0:48:24.544 --> 0:48:33.014 Here you can't do it in parallel, but you only have to do it on each seat, and then you 0:48:33.014 --> 0:48:34.650 can parallelize. 0:48:35.495 --> 0:48:39.190 But this maybe for the dimension. 0:48:39.190 --> 0:48:42.124 Maybe it's also important. 0:48:42.124 --> 0:48:46.037 I don't know if they have tried it. 0:48:46.037 --> 0:48:55.383 I assume it's not only for dimension reduction, but it's hard because you can easily. 0:49:01.001 --> 0:49:08.164 People have even like made the second thing even more easy. 0:49:08.164 --> 0:49:10.313 So there is this. 0:49:10.313 --> 0:49:17.954 This is how we have the highway connections in the transformer. 0:49:17.954 --> 0:49:20.699 Then it's like you do. 0:49:20.780 --> 0:49:24.789 So that is like how things are put together as a transformer. 0:49:25.125 --> 0:49:39.960 And that is a similar and simple recurring neural network where you do exactly the same 0:49:39.960 --> 0:49:44.512 for the so you don't have. 0:49:46.326 --> 0:49:47.503 This type of things. 0:49:49.149 --> 0:50:01.196 And with this we are at the end of how to make efficient architectures before we go to 0:50:01.196 --> 0:50:02.580 the next. 0:50:13.013 --> 0:50:24.424 Between the ink or the trader and the architectures there is a next technique which is used in 0:50:24.424 --> 0:50:28.988 nearly all deburning very successful. 0:50:29.449 --> 0:50:43.463 So the idea is can we extract the knowledge from a large network into a smaller one, but 0:50:43.463 --> 0:50:45.983 it's similarly. 0:50:47.907 --> 0:50:53.217 And the nice thing is that this really works, and it may be very, very surprising. 0:50:53.673 --> 0:51:03.000 So the idea is that we have a large straw model which we train for long, and the question 0:51:03.000 --> 0:51:07.871 is: Can that help us to train a smaller model? 0:51:08.148 --> 0:51:16.296 So can what we refer to as teacher model tell us better to build a small student model than 0:51:16.296 --> 0:51:17.005 before. 0:51:17.257 --> 0:51:27.371 So what we're before in it as a student model, we learn from the data and that is how we train 0:51:27.371 --> 0:51:28.755 our systems. 0:51:29.249 --> 0:51:37.949 The question is: Can we train this small model better if we are not only learning from the 0:51:37.949 --> 0:51:46.649 data, but we are also learning from a large model which has been trained maybe in the same 0:51:46.649 --> 0:51:47.222 data? 0:51:47.667 --> 0:51:55.564 So that you have then in the end a smaller model that is somehow better performing than. 0:51:55.895 --> 0:51:59.828 And maybe that's on the first view. 0:51:59.739 --> 0:52:05.396 Very very surprising because it has seen the same data so it should have learned the same 0:52:05.396 --> 0:52:11.053 so the baseline model trained only on the data and the student teacher knowledge to still 0:52:11.053 --> 0:52:11.682 model it. 0:52:11.682 --> 0:52:17.401 They all have seen only this data because your teacher modeling was also trained typically 0:52:17.401 --> 0:52:19.161 only on this model however. 0:52:20.580 --> 0:52:30.071 It has by now shown that by many ways the model trained in the teacher and analysis framework 0:52:30.071 --> 0:52:32.293 is performing better. 0:52:33.473 --> 0:52:40.971 A bit of an explanation when we see how that works. 0:52:40.971 --> 0:52:46.161 There's different ways of doing it. 0:52:46.161 --> 0:52:47.171 Maybe. 0:52:47.567 --> 0:52:51.501 So how does it work? 0:52:51.501 --> 0:53:04.802 This is our student network, the normal one, some type of new network. 0:53:04.802 --> 0:53:06.113 We're. 0:53:06.586 --> 0:53:17.050 So we are training the model to predict the same thing as we are doing that by calculating. 0:53:17.437 --> 0:53:23.173 The cross angry loss was defined in a way where saying all the probabilities for the 0:53:23.173 --> 0:53:25.332 correct word should be as high. 0:53:25.745 --> 0:53:31.576 So your calculating gear out of probability is always and each time step you have an out 0:53:31.576 --> 0:53:32.624 of probability. 0:53:32.624 --> 0:53:38.258 What is the most probable in the next word and your training signal is put as much of 0:53:38.258 --> 0:53:43.368 your probability mass to the correct word to the word that is there in train. 0:53:43.903 --> 0:53:51.367 And this is the chief by this cross entry loss, which says with some of the all training 0:53:51.367 --> 0:53:58.664 examples of all positions, with some of the full vocabulary, and then this one is this 0:53:58.664 --> 0:54:03.947 one that this current word is the case word in the vocabulary. 0:54:04.204 --> 0:54:11.339 And then we take here the lock for the ability of that, so what we made me do is: We have 0:54:11.339 --> 0:54:27.313 this metric here, so each position of your vocabulary size. 0:54:27.507 --> 0:54:38.656 In the end what you just do is some of these three lock probabilities, and then you want 0:54:38.656 --> 0:54:40.785 to have as much. 0:54:41.041 --> 0:54:54.614 So although this is a thumb over this metric here, in the end of each dimension you. 0:54:54.794 --> 0:55:06.366 So that is a normal cross end to be lost that we have discussed at the very beginning of 0:55:06.366 --> 0:55:07.016 how. 0:55:08.068 --> 0:55:15.132 So what can we do differently in the teacher network? 0:55:15.132 --> 0:55:23.374 We also have a teacher network which is trained on large data. 0:55:24.224 --> 0:55:35.957 And of course this distribution might be better than the one from the small model because it's. 0:55:36.456 --> 0:55:40.941 So in this case we have now the training signal from the teacher network. 0:55:41.441 --> 0:55:46.262 And it's the same way as we had before. 0:55:46.262 --> 0:55:56.507 The only difference is we're training not the ground truths per ability distribution 0:55:56.507 --> 0:55:59.159 year, which is sharp. 0:55:59.299 --> 0:56:11.303 That's also a probability, so this word has a high probability, but have some probability. 0:56:12.612 --> 0:56:19.577 And that is the main difference. 0:56:19.577 --> 0:56:30.341 Typically you do like the interpretation of these. 0:56:33.213 --> 0:56:38.669 Because there's more information contained in the distribution than in the front booth, 0:56:38.669 --> 0:56:44.187 because it encodes more information about the language, because language always has more 0:56:44.187 --> 0:56:47.907 options to put alone, that's the same sentence yes exactly. 0:56:47.907 --> 0:56:53.114 So there's ambiguity in there that is encoded hopefully very well in the complaint. 0:56:53.513 --> 0:56:57.257 Trade you two networks so better than a student network you have in there from your learner. 0:56:57.537 --> 0:57:05.961 So maybe often there's only one correct word, but it might be two or three, and then all 0:57:05.961 --> 0:57:10.505 of these three have a probability distribution. 0:57:10.590 --> 0:57:21.242 And then is the main advantage or one explanation of why it's better to train from the. 0:57:21.361 --> 0:57:32.652 Of course, it's good to also keep the signal in there because then you can prevent it because 0:57:32.652 --> 0:57:33.493 crazy. 0:57:37.017 --> 0:57:49.466 Any more questions on the first type of knowledge distillation, also distribution changes. 0:57:50.550 --> 0:58:02.202 Coming around again, this would put it a bit different, so this is not a solution to maintenance 0:58:02.202 --> 0:58:04.244 or distribution. 0:58:04.744 --> 0:58:12.680 But don't think it's performing worse than only doing the ground tours because they also. 0:58:13.113 --> 0:58:21.254 So it's more like it's not improving you would assume it's similarly helping you, but. 0:58:21.481 --> 0:58:28.145 Of course, if you now have a teacher, maybe you have no danger on your target to Maine, 0:58:28.145 --> 0:58:28.524 but. 0:58:28.888 --> 0:58:39.895 Then you can use this one which is not the ground truth but helpful to learn better for 0:58:39.895 --> 0:58:42.147 the distribution. 0:58:46.326 --> 0:58:57.012 The second idea is to do sequence level knowledge distillation, so what we have in this case 0:58:57.012 --> 0:59:02.757 is we have looked at each position independently. 0:59:03.423 --> 0:59:05.436 Mean, we do that often. 0:59:05.436 --> 0:59:10.972 We are not generating a lot of sequences, but that has a problem. 0:59:10.972 --> 0:59:13.992 We have this propagation of errors. 0:59:13.992 --> 0:59:16.760 We start with one area and then. 0:59:17.237 --> 0:59:27.419 So if we are doing word-level knowledge dissolution, we are treating each word in the sentence independently. 0:59:28.008 --> 0:59:32.091 So we are not trying to like somewhat model the dependency between. 0:59:32.932 --> 0:59:47.480 We can try to do that by sequence level knowledge dissolution, but the problem is, of course,. 0:59:47.847 --> 0:59:53.478 So we can that for each position we can get a distribution over all the words at this. 0:59:53.793 --> 1:00:05.305 But if we want to have a distribution of all possible target sentences, that's not possible 1:00:05.305 --> 1:00:06.431 because. 1:00:08.508 --> 1:00:15.940 Area, so we can then again do a bit of a heck on that. 1:00:15.940 --> 1:00:23.238 If we can't have a distribution of all sentences, it. 1:00:23.843 --> 1:00:30.764 So what we can't do is you can not use the teacher network and sample different translations. 1:00:31.931 --> 1:00:39.327 And now we can do different ways to train them. 1:00:39.327 --> 1:00:49.343 We can use them as their probability, the easiest one to assume. 1:00:50.050 --> 1:00:56.373 So what that ends to is that we're taking our teacher network, we're generating some 1:00:56.373 --> 1:01:01.135 translations, and these ones we're using as additional trading. 1:01:01.781 --> 1:01:11.382 Then we have mainly done this sequence level because the teacher network takes us. 1:01:11.382 --> 1:01:17.513 These are all probable translations of the sentence. 1:01:26.286 --> 1:01:34.673 And then you can do a bit of a yeah, and you can try to better make a bit of an interpolated 1:01:34.673 --> 1:01:36.206 version of that. 1:01:36.716 --> 1:01:42.802 So what people have also done is like subsequent level interpolations. 1:01:42.802 --> 1:01:52.819 You generate here several translations: But then you don't use all of them. 1:01:52.819 --> 1:02:00.658 You do some metrics on which of these ones. 1:02:01.021 --> 1:02:12.056 So it's a bit more training on this brown chose which might be improbable or unreachable 1:02:12.056 --> 1:02:16.520 because we can generate everything. 1:02:16.676 --> 1:02:23.378 And we are giving it an easier solution which is also good quality and training of that. 1:02:23.703 --> 1:02:32.602 So you're not training it on a very difficult solution, but you're training it on an easier 1:02:32.602 --> 1:02:33.570 solution. 1:02:36.356 --> 1:02:38.494 Any More Questions to This. 1:02:40.260 --> 1:02:41.557 Yeah. 1:02:41.461 --> 1:02:44.296 Good. 1:02:43.843 --> 1:03:01.642 Is to look at the vocabulary, so the problem is we have seen that vocabulary calculations 1:03:01.642 --> 1:03:06.784 are often very presuming. 1:03:09.789 --> 1:03:19.805 The thing is that most of the vocabulary is not needed for each sentence, so in each sentence. 1:03:20.280 --> 1:03:28.219 The question is: Can we somehow easily precalculate, which words are probable to occur in the sentence, 1:03:28.219 --> 1:03:30.967 and then only calculate these ones? 1:03:31.691 --> 1:03:34.912 And this can be done so. 1:03:34.912 --> 1:03:43.932 For example, if you have sentenced card, it's probably not happening. 1:03:44.164 --> 1:03:48.701 So what you can try to do is to limit your vocabulary. 1:03:48.701 --> 1:03:51.093 You're considering for each. 1:03:51.151 --> 1:04:04.693 So you're no longer taking the full vocabulary as possible output, but you're restricting. 1:04:06.426 --> 1:04:18.275 That typically works is that we limit it by the most frequent words we always take because 1:04:18.275 --> 1:04:23.613 these are not so easy to align to words. 1:04:23.964 --> 1:04:32.241 To take the most treatment taggin' words and then work that often aligns with one of the 1:04:32.241 --> 1:04:32.985 source. 1:04:33.473 --> 1:04:46.770 So for each source word you calculate the word alignment on your training data, and then 1:04:46.770 --> 1:04:51.700 you calculate which words occur. 1:04:52.352 --> 1:04:57.680 And then for decoding you build this union of maybe the source word list that other. 1:04:59.960 --> 1:05:02.145 Are like for each source work. 1:05:02.145 --> 1:05:08.773 One of the most frequent translations of these source words, for example for each source work 1:05:08.773 --> 1:05:13.003 like in the most frequent ones, and then the most frequent. 1:05:13.193 --> 1:05:24.333 In total, if you have short sentences, you have a lot less words, so in most cases it's 1:05:24.333 --> 1:05:26.232 not more than. 1:05:26.546 --> 1:05:33.957 And so you have dramatically reduced your vocabulary, and thereby can also fax a depot. 1:05:35.495 --> 1:05:43.757 That easy does anybody see what is challenging here and why that might not always need. 1:05:47.687 --> 1:05:54.448 The performance is not why this might not. 1:05:54.448 --> 1:06:01.838 If you implement it, it might not be a strong. 1:06:01.941 --> 1:06:06.053 You have to store this list. 1:06:06.053 --> 1:06:14.135 You have to burn the union and of course your safe time. 1:06:14.554 --> 1:06:21.920 The second thing the vocabulary is used in our last step, so we have the hidden state, 1:06:21.920 --> 1:06:23.868 and then we calculate. 1:06:24.284 --> 1:06:29.610 Now we are not longer calculating them for all output words, but for a subset of them. 1:06:30.430 --> 1:06:35.613 However, this metric multiplication is typically parallelized with the perfect but good. 1:06:35.956 --> 1:06:46.937 But if you not only calculate some of them, if you're not modeling it right, it will take 1:06:46.937 --> 1:06:52.794 as long as before because of the nature of the. 1:06:56.776 --> 1:07:07.997 Here for beam search there's some ideas of course you can go back to greedy search because 1:07:07.997 --> 1:07:10.833 that's more efficient. 1:07:11.651 --> 1:07:18.347 And better quality, and you can buffer some states in between, so how much buffering it's 1:07:18.347 --> 1:07:22.216 again this tradeoff between calculation and memory. 1:07:25.125 --> 1:07:41.236 Then at the end of today what we want to look into is one last type of new machine translation 1:07:41.236 --> 1:07:42.932 approach. 1:07:43.403 --> 1:07:53.621 And the idea is what we've already seen in our first two steps is that this ultra aggressive 1:07:53.621 --> 1:07:57.246 park is taking community coding. 1:07:57.557 --> 1:08:04.461 Can process everything in parallel, but we are always taking the most probable and then. 1:08:05.905 --> 1:08:10.476 The question is: Do we really need to do that? 1:08:10.476 --> 1:08:14.074 Therefore, there is a bunch of work. 1:08:14.074 --> 1:08:16.602 Can we do it differently? 1:08:16.602 --> 1:08:19.616 Can we generate a full target? 1:08:20.160 --> 1:08:29.417 We'll see it's not that easy and there's still an open debate whether this is really faster 1:08:29.417 --> 1:08:31.832 and quality, but think. 1:08:32.712 --> 1:08:45.594 So, as said, what we have done is our encoder decoder where we can process our encoder color, 1:08:45.594 --> 1:08:50.527 and then the output always depends. 1:08:50.410 --> 1:08:54.709 We generate the output and then we have to put it here the wide because then everything 1:08:54.709 --> 1:08:56.565 depends on the purpose of the output. 1:08:56.916 --> 1:09:10.464 This is what is referred to as an outer-regressive model and nearly outs speech generation and 1:09:10.464 --> 1:09:16.739 language generation or works in this outer. 1:09:18.318 --> 1:09:21.132 So the motivation is, can we do that more efficiently? 1:09:21.361 --> 1:09:31.694 And can we somehow process all target words in parallel? 1:09:31.694 --> 1:09:41.302 So instead of doing it one by one, we are inputting. 1:09:45.105 --> 1:09:46.726 So how does it work? 1:09:46.726 --> 1:09:50.587 So let's first have a basic auto regressive mode. 1:09:50.810 --> 1:09:53.551 So the encoder looks as it is before. 1:09:53.551 --> 1:09:58.310 That's maybe not surprising because here we know we can paralyze. 1:09:58.618 --> 1:10:04.592 So we have put in here our ink holder and generated the ink stash, so that's exactly 1:10:04.592 --> 1:10:05.295 the same. 1:10:05.845 --> 1:10:16.229 However, now we need to do one more thing: One challenge is what we had before and that's 1:10:16.229 --> 1:10:26.799 a challenge of natural language generation like machine translation. 1:10:32.672 --> 1:10:38.447 We generate until we generate this out of end of center stock, but if we now generate 1:10:38.447 --> 1:10:44.625 everything at once that's no longer possible, so we cannot generate as long because we only 1:10:44.625 --> 1:10:45.632 generated one. 1:10:46.206 --> 1:10:58.321 So the question is how can we now determine how long the sequence is, and we can also accelerate. 1:11:00.000 --> 1:11:06.384 Yes, but there would be one idea, and there is other work which tries to do that. 1:11:06.806 --> 1:11:15.702 However, in here there's some work already done before and maybe you remember we had the 1:11:15.702 --> 1:11:20.900 IBM models and there was this concept of fertility. 1:11:21.241 --> 1:11:26.299 The concept of fertility is means like for one saucepan, and how many target pores does 1:11:26.299 --> 1:11:27.104 it translate? 1:11:27.847 --> 1:11:34.805 And exactly that we try to do here, and that means we are calculating like at the top we 1:11:34.805 --> 1:11:36.134 are calculating. 1:11:36.396 --> 1:11:42.045 So it says word is translated into word. 1:11:42.045 --> 1:11:54.171 Word might be translated into words into, so we're trying to predict in how many words. 1:11:55.935 --> 1:12:10.314 And then the end of the anchor, so this is like a length estimation. 1:12:10.314 --> 1:12:15.523 You can do it otherwise. 1:12:16.236 --> 1:12:24.526 You initialize your decoder input and we know it's good with word embeddings so we're trying 1:12:24.526 --> 1:12:28.627 to do the same thing and what people then do. 1:12:28.627 --> 1:12:35.224 They initialize it again with word embedding but in the frequency of the. 1:12:35.315 --> 1:12:36.460 So we have the cartilage. 1:12:36.896 --> 1:12:47.816 So one has two, so twice the is and then one is, so that is then our initialization. 1:12:48.208 --> 1:12:57.151 In other words, if you don't predict fertilities but predict lengths, you can just initialize 1:12:57.151 --> 1:12:57.912 second. 1:12:58.438 --> 1:13:07.788 This often works a bit better, but that's the other. 1:13:07.788 --> 1:13:16.432 Now you have everything in training and testing. 1:13:16.656 --> 1:13:18.621 This is all available at once. 1:13:20.280 --> 1:13:31.752 Then we can generate everything in parallel, so we have the decoder stack, and that is now 1:13:31.752 --> 1:13:33.139 as before. 1:13:35.395 --> 1:13:41.555 And then we're doing the translation predictions here on top of it in order to do. 1:13:43.083 --> 1:13:59.821 And then we are predicting here the target words and once predicted, and that is the basic 1:13:59.821 --> 1:14:00.924 idea. 1:14:01.241 --> 1:14:08.171 Machine translation: Where the idea is, we don't have to do one by one what we're. 1:14:10.210 --> 1:14:13.900 So this looks really, really, really great. 1:14:13.900 --> 1:14:20.358 On the first view there's one challenge with this, and this is the baseline. 1:14:20.358 --> 1:14:27.571 Of course there's some improvements, but in general the quality is often significant. 1:14:28.068 --> 1:14:32.075 So here you see the baseline models. 1:14:32.075 --> 1:14:38.466 You have a loss of ten blue points or something like that. 1:14:38.878 --> 1:14:40.230 So why does it change? 1:14:40.230 --> 1:14:41.640 So why is it happening? 1:14:43.903 --> 1:14:56.250 If you look at the errors there is repetitive tokens, so you have like or things like that. 1:14:56.536 --> 1:15:01.995 Broken senses or influent senses, so that exactly where algebra aggressive models are 1:15:01.995 --> 1:15:04.851 very good, we say that's a bit of a problem. 1:15:04.851 --> 1:15:07.390 They generate very fluid transcription. 1:15:07.387 --> 1:15:10.898 Translation: Sometimes there doesn't have to do anything with the input. 1:15:11.411 --> 1:15:14.047 But generally it really looks always very fluid. 1:15:14.995 --> 1:15:20.865 Here exactly the opposite, so the problem is that we don't have really fluid translation. 1:15:21.421 --> 1:15:26.123 And that is mainly due to the challenge that we have this independent assumption. 1:15:26.646 --> 1:15:35.873 So in this case, the probability of Y of the second position is independent of the probability 1:15:35.873 --> 1:15:40.632 of X, so we don't know what was there generated. 1:15:40.632 --> 1:15:43.740 We're just generating it there. 1:15:43.964 --> 1:15:55.439 You can see it also in a bit of examples. 1:15:55.439 --> 1:16:03.636 You can over-panelize shifts. 1:16:04.024 --> 1:16:10.566 And the problem is this is already an improvement again, but this is also similar to. 1:16:11.071 --> 1:16:19.900 So you can, for example, translate heeded back, or maybe you could also translate it 1:16:19.900 --> 1:16:31.105 with: But on their feeling down in feeling down, if the first position thinks of their 1:16:31.105 --> 1:16:34.594 feeling done and the second. 1:16:35.075 --> 1:16:42.908 So each position here and that is one of the main issues here doesn't know what the other. 1:16:43.243 --> 1:16:53.846 And for example, if you are translating something with, you can often translate things in two 1:16:53.846 --> 1:16:58.471 ways: German with a different agreement. 1:16:58.999 --> 1:17:02.058 And then here where you have to decide do a used jet. 1:17:02.162 --> 1:17:05.460 Interpretator: It doesn't know which word it has to select. 1:17:06.086 --> 1:17:14.789 Mean, of course, it knows a hidden state, but in the end you have a liability distribution. 1:17:16.256 --> 1:17:20.026 And that is the important thing in the outer regressive month. 1:17:20.026 --> 1:17:24.335 You know that because you have put it in you here, you don't know that. 1:17:24.335 --> 1:17:29.660 If it's equal probable here to two, you don't Know Which Is Selected, and of course that 1:17:29.660 --> 1:17:32.832 depends on what should be the latest traction under. 1:17:33.333 --> 1:17:39.554 Yep, that's the undershift, and we're going to last last the next time. 1:17:39.554 --> 1:17:39.986 Yes. 1:17:40.840 --> 1:17:44.934 Doesn't this also appear in and like now we're talking about physical training or. 1:17:46.586 --> 1:17:48.412 The thing is in the auto regress. 1:17:48.412 --> 1:17:50.183 If you give it the correct one,. 1:17:50.450 --> 1:17:55.827 So if you predict here comma what the reference is feeling then you tell the model here. 1:17:55.827 --> 1:17:59.573 The last one was feeling and then it knows it has to be done. 1:17:59.573 --> 1:18:04.044 But here it doesn't know that because it doesn't get as input as a right. 1:18:04.204 --> 1:18:24.286 Yes, that's a bit depending on what. 1:18:24.204 --> 1:18:27.973 But in training, of course, you just try to make the highest one the current one. 1:18:31.751 --> 1:18:38.181 So what you can do is things like CDC loss which can adjust for this. 1:18:38.181 --> 1:18:42.866 So then you can also have this shifted correction. 1:18:42.866 --> 1:18:50.582 If you're doing this type of correction in the CDC loss you don't get full penalty. 1:18:50.930 --> 1:18:58.486 Just shifted by one, so it's a bit of a different loss, which is mainly used in, but. 1:19:00.040 --> 1:19:03.412 It can be used in order to address this problem. 1:19:04.504 --> 1:19:13.844 The other problem is that outer regressively we have the label buyers that tries to disimmigrate. 1:19:13.844 --> 1:19:20.515 That's the example did before was if you translate thank you to Dung. 1:19:20.460 --> 1:19:31.925 And then it might end up because it learns in the first position and the second also. 1:19:32.492 --> 1:19:43.201 In order to prevent that, it would be helpful for one output, only one output, so that makes 1:19:43.201 --> 1:19:47.002 the system already better learn. 1:19:47.227 --> 1:19:53.867 Might be that for slightly different inputs you have different outputs, but for the same. 1:19:54.714 --> 1:19:57.467 That we can luckily very easily solve. 1:19:59.119 --> 1:19:59.908 And it's done. 1:19:59.908 --> 1:20:04.116 We just learned the technique about it, which is called knowledge distillation. 1:20:04.985 --> 1:20:13.398 So what we can do and the easiest solution to prove your non-autoregressive model is to 1:20:13.398 --> 1:20:16.457 train an auto regressive model. 1:20:16.457 --> 1:20:22.958 Then you decode your whole training gamer with this model and then. 1:20:23.603 --> 1:20:27.078 While the main advantage of that is that this is more consistent,. 1:20:27.407 --> 1:20:33.995 So for the same input you always have the same output. 1:20:33.995 --> 1:20:41.901 So you have to make your training data more consistent and learn. 1:20:42.482 --> 1:20:54.471 So there is another advantage of knowledge distillation and that advantage is you have 1:20:54.471 --> 1:20:59.156 more consistent training signals. 1:21:04.884 --> 1:21:10.630 There's another to make the things more easy at the beginning. 1:21:10.630 --> 1:21:16.467 There's this plants model, black model where you do more masks. 1:21:16.756 --> 1:21:26.080 So during training, especially at the beginning, you give some correct solutions at the beginning. 1:21:28.468 --> 1:21:38.407 And there is this tokens at a time, so the idea is to establish other regressive training. 1:21:40.000 --> 1:21:50.049 And some targets are open, so you always predict only like first auto regression is K. 1:21:50.049 --> 1:21:59.174 It puts one, so you always have one input and one output, then you do partial. 1:21:59.699 --> 1:22:05.825 So in that way you can slowly learn what is a good and what is a bad answer. 1:22:08.528 --> 1:22:10.862 It doesn't sound very impressive. 1:22:10.862 --> 1:22:12.578 Don't contact me anyway. 1:22:12.578 --> 1:22:15.323 Go all over your training data several. 1:22:15.875 --> 1:22:20.655 You can even switch in between. 1:22:20.655 --> 1:22:29.318 There is a homework on this thing where you try to start. 1:22:31.271 --> 1:22:41.563 You have to learn so there's a whole work on that so this is often happening and it doesn't 1:22:41.563 --> 1:22:46.598 mean it's less efficient but still it helps. 1:22:49.389 --> 1:22:57.979 For later maybe here are some examples of how much things help. 1:22:57.979 --> 1:23:04.958 Maybe one point here is that it's really important. 1:23:05.365 --> 1:23:13.787 Here's the translation performance and speed. 1:23:13.787 --> 1:23:24.407 One point which is a point is if you compare researchers. 1:23:24.784 --> 1:23:33.880 So yeah, if you're compared to one very weak baseline transformer even with beam search, 1:23:33.880 --> 1:23:40.522 then you're ten times slower than a very strong auto regressive. 1:23:40.961 --> 1:23:48.620 If you make a strong baseline then it's going down to depending on times and here like: You 1:23:48.620 --> 1:23:53.454 have a lot of different speed ups. 1:23:53.454 --> 1:24:03.261 Generally, it makes a strong baseline and not very simple transformer. 1:24:07.407 --> 1:24:20.010 Yeah, with this one last thing that you can do to speed up things and also reduce your 1:24:20.010 --> 1:24:25.950 memory is what is called half precision. 1:24:26.326 --> 1:24:29.139 And especially for decoding issues for training. 1:24:29.139 --> 1:24:31.148 Sometimes it also gets less stale. 1:24:32.592 --> 1:24:45.184 With this we close nearly wait a bit, so what you should remember is that efficient machine 1:24:45.184 --> 1:24:46.963 translation. 1:24:47.007 --> 1:24:51.939 We have, for example, looked at knowledge distillation. 1:24:51.939 --> 1:24:55.991 We have looked at non auto regressive models. 1:24:55.991 --> 1:24:57.665 We have different. 1:24:58.898 --> 1:25:02.383 For today and then only requests. 1:25:02.383 --> 1:25:08.430 So if you haven't done so, please fill out the evaluation. 1:25:08.388 --> 1:25:20.127 So now if you have done so think then you should have and with the online people hopefully. 1:25:20.320 --> 1:25:29.758 Only possibility to tell us what things are good and what not the only one but the most 1:25:29.758 --> 1:25:30.937 efficient. 1:25:31.851 --> 1:25:35.875 So think of all the students doing it in the next okay, then thank. 0:00:03.243 --> 0:00:18.400 Hey welcome to our video, small room today and to the lecture machine translation. 0:00:19.579 --> 0:00:32.295 So the idea is we have like last time we started addressing problems and building machine translation. 0:00:32.772 --> 0:00:39.140 And we looked into different ways of how we can use other types of resources. 0:00:39.379 --> 0:00:54.656 Last time we looked into language models and especially pre-trained models which are different 0:00:54.656 --> 0:00:59.319 paradigms and learning data. 0:01:00.480 --> 0:01:07.606 However, there is one other way of getting data and that is just searching for more data. 0:01:07.968 --> 0:01:14.637 And the nice thing is it was a worldwide web. 0:01:14.637 --> 0:01:27.832 We have a very big data resource where there's various types of data which we can all use. 0:01:28.128 --> 0:01:38.902 If you want to build a machine translation for a specific language or specific to Maine, 0:01:38.902 --> 0:01:41.202 it might be worse. 0:01:46.586 --> 0:01:55.399 In general, the other year we had different types of additional resources we can have. 0:01:55.399 --> 0:01:59.654 Today we look into the state of crawling. 0:01:59.654 --> 0:02:05.226 It always depends a bit on what type of task you have. 0:02:05.525 --> 0:02:08.571 We're crawling, you point off no possibilities. 0:02:08.828 --> 0:02:14.384 We have seen some weeks ago that Maje Lingo models another thing where you can try to share 0:02:14.384 --> 0:02:16.136 knowledge between languages. 0:02:16.896 --> 0:02:26.774 Last we looked into monolingual data and next we also unsupervised them too which is purely 0:02:26.774 --> 0:02:29.136 based on monolingual. 0:02:29.689 --> 0:02:35.918 What we today will focus on is really web crawling of parallel data. 0:02:35.918 --> 0:02:40.070 We will focus not on the crawling pad itself. 0:02:41.541 --> 0:02:49.132 Networking lecture is something about one of the best techniques to do web trolleying 0:02:49.132 --> 0:02:53.016 and then we'll just rely on existing tools. 0:02:53.016 --> 0:02:59.107 But the challenge is normally if you have web data that's pure text. 0:03:00.920 --> 0:03:08.030 And these are all different ways of how we can do that, and today is focused on that. 0:03:08.508 --> 0:03:21.333 So why would we be interested in that there is quite different ways of collecting data? 0:03:21.333 --> 0:03:28.473 If you're currently when we talk about parallel. 0:03:28.548 --> 0:03:36.780 The big difference is that you focus on one specific website so you can manually check 0:03:36.780 --> 0:03:37.632 how you. 0:03:38.278 --> 0:03:49.480 This you can do for dedicated resources where you have high quality data. 0:03:50.510 --> 0:03:56.493 Another thing which has been developed or has been done for several tasks is also is 0:03:56.493 --> 0:03:59.732 like you can do something like crowdsourcing. 0:03:59.732 --> 0:04:05.856 I don't know if you know about sites like Amazon Mechanical Turing or things like that 0:04:05.856 --> 0:04:08.038 so you can there get a lot of. 0:04:07.988 --> 0:04:11.544 Writing between cheap labors would like easy translations for you. 0:04:12.532 --> 0:04:22.829 Of course you can't collect millions of sentences, but if it's like thousands of sentences that's 0:04:22.829 --> 0:04:29.134 also sourced, it's often interesting when you have somehow. 0:04:29.509 --> 0:04:36.446 However, this is a field of itself, so crowdsourcing is not that easy. 0:04:36.446 --> 0:04:38.596 It's not like upload. 0:04:38.738 --> 0:04:50.806 If you're doing that you will have very poor quality, for example in the field of machine 0:04:50.806 --> 0:04:52.549 translation. 0:04:52.549 --> 0:04:57.511 Crowdsourcing is very commonly used. 0:04:57.397 --> 0:05:00.123 The problem there is. 0:05:00.480 --> 0:05:08.181 Since they are paid quite bad, of course, a lot of people also try to make it put into 0:05:08.181 --> 0:05:09.598 it as possible. 0:05:09.869 --> 0:05:21.076 So if you're just using it without any control mechanisms, the quality will be bad. 0:05:21.076 --> 0:05:27.881 What you can do is like doing additional checking. 0:05:28.188 --> 0:05:39.084 And think recently read a paper that now these things can be worse because people don't do 0:05:39.084 --> 0:05:40.880 it themselves. 0:05:41.281 --> 0:05:46.896 So it's a very interesting topic. 0:05:46.896 --> 0:05:55.320 There has been a lot of resources created by this. 0:05:57.657 --> 0:06:09.796 It's really about large scale data, then of course doing some type of web crawling is the 0:06:09.796 --> 0:06:10.605 best. 0:06:10.930 --> 0:06:17.296 However, the biggest issue in this case is in the quality. 0:06:17.296 --> 0:06:22.690 So how can we ensure that somehow the quality of. 0:06:23.003 --> 0:06:28.656 Because if you just, we all know that in the Internet there's also a lot of tools. 0:06:29.149 --> 0:06:37.952 Low quality staff, and especially now the bigger question is how can we ensure that translations 0:06:37.952 --> 0:06:41.492 are really translations of each other? 0:06:45.065 --> 0:06:58.673 Why is this interesting so we had this number before so there is some estimates that roughly 0:06:58.673 --> 0:07:05.111 a human reads around three hundred million. 0:07:05.525 --> 0:07:16.006 If you look into the web you will have millions of words there so you can really get a large 0:07:16.006 --> 0:07:21.754 amount of data and if you think about monolingual. 0:07:22.042 --> 0:07:32.702 So at least for some language pairs there is a large amount of data you can have. 0:07:32.852 --> 0:07:37.783 Languages are official languages in one country. 0:07:37.783 --> 0:07:46.537 There's always a very great success because a lot of websites from the government need 0:07:46.537 --> 0:07:48.348 to be translated. 0:07:48.568 --> 0:07:58.777 For example, a large purpose like in India, which we have worked with in India, so you 0:07:58.777 --> 0:08:00.537 have parallel. 0:08:01.201 --> 0:08:02.161 Two questions. 0:08:02.161 --> 0:08:08.438 First of all, if jet GPS and machine translation tools are more becoming ubiquitous and everybody 0:08:08.438 --> 0:08:14.138 uses them, don't we get a problem because we want to crawl the web and use the data and. 0:08:15.155 --> 0:08:18.553 Yes, that is a severe problem. 0:08:18.553 --> 0:08:26.556 Of course, are we only training on training data which is automatically? 0:08:26.766 --> 0:08:41.182 And if we are doing that, of course, we talked about the synthetic data where we do back translation. 0:08:41.341 --> 0:08:46.446 But of course it gives you some aren't up about norm, you cannot be much better than 0:08:46.446 --> 0:08:46.806 this. 0:08:48.308 --> 0:08:57.194 That is, we'll get more and more on issues, so maybe at some point we won't look at the 0:08:57.194 --> 0:09:06.687 current Internet, but focus on oats like image of the Internet, which are created by Archive. 0:09:07.527 --> 0:09:18.611 There's lots of classification algorithms on how to classify automatic data they had 0:09:18.611 --> 0:09:26.957 a very interesting paper on how to watermark their translation. 0:09:27.107 --> 0:09:32.915 So there's like two scenarios of course in this program: The one thing you might want 0:09:32.915 --> 0:09:42.244 to find your own translation if you're a big company and say do an antisystem that may be 0:09:42.244 --> 0:09:42.866 used. 0:09:43.083 --> 0:09:49.832 This problem might be that most of the translation out there is created by you. 0:09:49.832 --> 0:10:02.007 You might be able: And there is a relatively easy way of doing that so that there are other 0:10:02.007 --> 0:10:09.951 peoples' mainly that can do like the search or teacher. 0:10:09.929 --> 0:10:12.878 They are different, but there is not the one correction station. 0:10:13.153 --> 0:10:23.763 So what you then can't do is you can't output the best one to the user, but the highest value. 0:10:23.763 --> 0:10:30.241 For example, it's easy, but you can take the translation. 0:10:30.870 --> 0:10:40.713 And if you always give the translation of your investments, which are all good with the 0:10:40.713 --> 0:10:42.614 most ease, then. 0:10:42.942 --> 0:10:55.503 But of course this you can only do with most of the data generated by your model. 0:10:55.503 --> 0:11:02.855 What we are now seeing is not only checks, but. 0:11:03.163 --> 0:11:13.295 But it's definitely an additional research question that might get more and more importance, 0:11:13.295 --> 0:11:18.307 and it might be an additional filtering step. 0:11:18.838 --> 0:11:29.396 There are other issues in data quality, so in which direction wasn't translated, so that 0:11:29.396 --> 0:11:31.650 is not interested. 0:11:31.891 --> 0:11:35.672 But if you're now reaching better and better quality, it makes a difference. 0:11:35.672 --> 0:11:39.208 The original data was from German to English or from English to German. 0:11:39.499 --> 0:11:44.797 Because translation, they call it translate Chinese. 0:11:44.797 --> 0:11:53.595 So if you generate German from English, it has a more similar structure as if you would 0:11:53.595 --> 0:11:55.195 directly speak. 0:11:55.575 --> 0:11:57.187 So um. 0:11:57.457 --> 0:12:03.014 These are all issues which you then might do like do additional training to remove them 0:12:03.014 --> 0:12:07.182 or you first train on them and later train on other quality data. 0:12:07.182 --> 0:12:11.034 But yet that's a general view on so it's an important issue. 0:12:11.034 --> 0:12:17.160 But until now I think it hasn't been addressed that much maybe because the quality was decently. 0:12:18.858 --> 0:12:23.691 Actually, I think we're sure if we have the time we use the Internet. 0:12:23.691 --> 0:12:29.075 The problem is, it's a lot of English speaking text, but most used languages. 0:12:29.075 --> 0:12:34.460 I don't know some language in Africa that's spoken, but we do about that one. 0:12:34.460 --> 0:12:37.566 I mean, that's why most data is English too. 0:12:38.418 --> 0:12:42.259 Other languages, and then you get the best. 0:12:42.259 --> 0:12:46.013 If there is no data on the Internet, then. 0:12:46.226 --> 0:12:48.255 So there is still a lot of data collection. 0:12:48.255 --> 0:12:50.976 Also in the wild way you try to improve there and collect. 0:12:51.431 --> 0:12:57.406 But English is the most in the world, but you find surprisingly much data also for other 0:12:57.406 --> 0:12:58.145 languages. 0:12:58.678 --> 0:13:04.227 Of course, only if they're written remember. 0:13:04.227 --> 0:13:15.077 Most languages are not written at all, but for them you might find some video, but it's 0:13:15.077 --> 0:13:17.420 difficult to find. 0:13:17.697 --> 0:13:22.661 So this is mainly done for the web trawling. 0:13:22.661 --> 0:13:29.059 It's mainly done for languages which are commonly spoken. 0:13:30.050 --> 0:13:37.907 Is exactly the next point, so this is that much data is only true for English and some 0:13:37.907 --> 0:13:41.972 other languages, but of course there's many. 0:13:41.982 --> 0:13:50.285 And therefore a lot of research on how to make things efficient and efficient and learn 0:13:50.285 --> 0:13:54.248 faster from pure data is still essential. 0:13:59.939 --> 0:14:06.326 So what we are interested in now on data is parallel data. 0:14:06.326 --> 0:14:10.656 We assume always we have parallel data. 0:14:10.656 --> 0:14:12.820 That means we have. 0:14:13.253 --> 0:14:20.988 To be careful when you start crawling from the web, we might get only related types of. 0:14:21.421 --> 0:14:30.457 So one comedy thing is what people refer as noisy parallel data where there is documents 0:14:30.457 --> 0:14:34.315 which are translations of each other. 0:14:34.434 --> 0:14:44.300 So you have senses where there is no translation on the other side because you have. 0:14:44.484 --> 0:14:50.445 So if you have these types of documents your algorithm to extract parallel data might be 0:14:50.445 --> 0:14:51.918 a bit more difficult. 0:14:52.352 --> 0:15:04.351 Know if you can still remember in the beginning of the lecture when we talked about different 0:15:04.351 --> 0:15:06.393 data resources. 0:15:06.286 --> 0:15:11.637 But the first step is then approached to a light source and target sentences, and it was 0:15:11.637 --> 0:15:16.869 about like a steep vocabulary, and then you have some probabilities for one to one and 0:15:16.869 --> 0:15:17.590 one to one. 0:15:17.590 --> 0:15:23.002 It's very like simple algorithm, but yet it works fine for really a high quality parallel 0:15:23.002 --> 0:15:23.363 data. 0:15:23.623 --> 0:15:30.590 But when we're talking about noisy data, we might have to do additional steps and use more 0:15:30.590 --> 0:15:35.872 advanced models to extract what is parallel and to get high quality. 0:15:36.136 --> 0:15:44.682 So if we just had no easy parallel data, the document might not be as easy to extract. 0:15:49.249 --> 0:15:54.877 And then there is even the more extreme pains, which has also been used to be honest. 0:15:54.877 --> 0:15:58.214 The use of this data is reasoning not that common. 0:15:58.214 --> 0:16:04.300 It was more interested maybe like ten or fifteen years ago, and that is what people referred 0:16:04.300 --> 0:16:05.871 to as comparative data. 0:16:06.266 --> 0:16:17.167 And then the idea is you even don't have translations like sentences which are translations of each 0:16:17.167 --> 0:16:25.234 other, but you have more news documents or articles about the same topic. 0:16:25.205 --> 0:16:32.410 But it's more that you find phrases which are too big in the user, so even black fragments. 0:16:32.852 --> 0:16:44.975 So if you think about the pedia, for example, these articles have to be written in like the 0:16:44.975 --> 0:16:51.563 Wikipedia general idea independent of each other. 0:16:51.791 --> 0:17:01.701 They have different information in there, and I mean, the German movie gets more detail 0:17:01.701 --> 0:17:04.179 than the English one. 0:17:04.179 --> 0:17:07.219 However, it might be that. 0:17:07.807 --> 0:17:20.904 And the same thing is that you think about newspaper articles if they're at the same time. 0:17:21.141 --> 0:17:25.603 And so this is an ability to learn. 0:17:25.603 --> 0:17:36.760 For example, new phrases, vocabulary and stature if you don't have monitor all time long. 0:17:37.717 --> 0:17:49.020 And then not everything will be the same, but there might be an overlap about events. 0:17:54.174 --> 0:18:00.348 So if we're talking about web trolling said in the beginning it was really about specific. 0:18:00.660 --> 0:18:18.878 They do very good things by hand and really focus on them and do a very specific way of 0:18:18.878 --> 0:18:20.327 doing. 0:18:20.540 --> 0:18:23.464 The European Parliament was very focused in Ted. 0:18:23.464 --> 0:18:26.686 Maybe you even have looked in the particular session. 0:18:27.427 --> 0:18:40.076 And these are still important, but they are of course very specific in covering different 0:18:40.076 --> 0:18:41.341 pockets. 0:18:42.002 --> 0:18:55.921 Then there was a focus on language centering, so there was a big drawer, for example, that 0:18:55.921 --> 0:18:59.592 you can check websites. 0:19:00.320 --> 0:19:07.918 Apparently what really people like is a more general approach where you just have to specify. 0:19:07.918 --> 0:19:15.355 I'm interested in data from German to Lithuanian and then you can as automatic as possible. 0:19:15.355 --> 0:19:19.640 You can collect data and extract codelator for this. 0:19:21.661 --> 0:19:25.633 So is this our interest? 0:19:25.633 --> 0:19:36.435 Of course, the question is how can we build these types of systems? 0:19:36.616 --> 0:19:52.913 The first are more general web crawling base systems, so there is nothing about. 0:19:53.173 --> 0:19:57.337 Based on the websites you have, you have to do like text extraction. 0:19:57.597 --> 0:20:06.503 We are typically not that much interested in text and images in there, so we try to extract 0:20:06.503 --> 0:20:07.083 text. 0:20:07.227 --> 0:20:16.919 This is also not specific to machine translation, but it's a more traditional way of doing web 0:20:16.919 --> 0:20:17.939 trolling. 0:20:18.478 --> 0:20:22.252 And at the end you have mirror like some other set of document collectors. 0:20:22.842 --> 0:20:37.025 Is the idea, so you have the text, and often this is a document, and so in the end. 0:20:37.077 --> 0:20:51.523 And that is some of your starting point now for doing the more machine translation. 0:20:52.672 --> 0:21:05.929 One way of doing that now is very similar to what you might have think about the traditional 0:21:05.929 --> 0:21:06.641 one. 0:21:06.641 --> 0:21:10.633 The first thing is to do a. 0:21:11.071 --> 0:21:22.579 So you have this based on the initial fact that you know this is a German website in the 0:21:22.579 --> 0:21:25.294 English translation. 0:21:25.745 --> 0:21:31.037 And based on this document alignment, then you can do your sentence alignment. 0:21:31.291 --> 0:21:39.072 And this is similar to what we had before with the church accordion. 0:21:39.072 --> 0:21:43.696 This is typically more noisy peril data. 0:21:43.623 --> 0:21:52.662 So that you are not assuming that everything is on both sides, that the order is the same, 0:21:52.662 --> 0:21:56.635 so you should do more flexible systems. 0:21:58.678 --> 0:22:14.894 Then it depends if the documents you were drawing were really some type of parallel data. 0:22:15.115 --> 0:22:35.023 Say then you should do what is referred to as fragmented extraction. 0:22:36.136 --> 0:22:47.972 One problem with these types of models is if you are doing errors in your document alignment,. 0:22:48.128 --> 0:22:55.860 It means that if you are saying these two documents are align then you can only find 0:22:55.860 --> 0:22:58.589 sense and if you are missing. 0:22:59.259 --> 0:23:15.284 Is very different, only small parts of the document are parallel, and most parts are independent 0:23:15.284 --> 0:23:17.762 of each other. 0:23:19.459 --> 0:23:31.318 Therefore, more recently, there is also the idea of directly doing sentence aligned so 0:23:31.318 --> 0:23:35.271 that you're directly taking. 0:23:36.036 --> 0:23:41.003 Was already one challenge of this one, the second approach. 0:23:42.922 --> 0:23:50.300 Yes, so one big challenge on here, beef, then you have to do a lot of comparison. 0:23:50.470 --> 0:23:59.270 You have to cook out every source, every target set and square. 0:23:59.270 --> 0:24:06.283 If you think of a million or trillion pairs, then. 0:24:07.947 --> 0:24:12.176 And this also gives you a reason for a last step in both cases. 0:24:12.176 --> 0:24:18.320 So in both of them you have to remember you're typically eating here in this very large data 0:24:18.320 --> 0:24:18.650 set. 0:24:18.650 --> 0:24:24.530 So all of these and also the document alignment here they should be done very efficient. 0:24:24.965 --> 0:24:42.090 And if you want to do it very efficiently, that means your quality will go lower. 0:24:41.982 --> 0:24:47.348 Because you just have to ever see it fast, and then yeah you can put less computation 0:24:47.348 --> 0:24:47.910 on each. 0:24:48.688 --> 0:25:06.255 Therefore, in a lot of scenarios it makes sense to make an additional filtering step 0:25:06.255 --> 0:25:08.735 at the end. 0:25:08.828 --> 0:25:13.370 And then we do a second filtering step where we now can put a lot more effort. 0:25:13.433 --> 0:25:20.972 Because now we don't have like any square possible combinations anymore, we have already 0:25:20.972 --> 0:25:26.054 selected and maybe in dimension of maybe like two or three. 0:25:26.054 --> 0:25:29.273 For each sentence we even don't have. 0:25:29.429 --> 0:25:39.234 And then we can put a lot more effort in each individual example and build a high quality 0:25:39.234 --> 0:25:42.611 classic fire to really select. 0:25:45.125 --> 0:26:00.506 Two or one example for that, so one of the biggest projects doing this is the so-called 0:26:00.506 --> 0:26:03.478 Paratrol Corpus. 0:26:03.343 --> 0:26:11.846 Typically it's like before the picturing so there are a lot of challenges on how you can. 0:26:12.272 --> 0:26:25.808 And the steps they start to be with the seatbelt, so what you should give at the beginning is: 0:26:26.146 --> 0:26:36.908 Then they do the problem, the text extraction, the document alignment, the sentence alignment, 0:26:36.908 --> 0:26:45.518 and the sentence filter, and it swings down to implementing the text store. 0:26:46.366 --> 0:26:51.936 We'll see later for a lot of language pairs exist so it's easier to download them and then 0:26:51.936 --> 0:26:52.793 like improve. 0:26:53.073 --> 0:27:08.270 For example, the crawling one thing they often do is even not throw the direct website because 0:27:08.270 --> 0:27:10.510 there's also. 0:27:10.770 --> 0:27:14.540 Black parts of the Internet that they can work on today. 0:27:14.854 --> 0:27:22.238 In more detail, this is a bit shown here. 0:27:22.238 --> 0:27:31.907 All the steps you can see are different possibilities. 0:27:32.072 --> 0:27:39.018 You need a bit of knowledge to do that, or you can build a machine translation system. 0:27:39.239 --> 0:27:47.810 There are two different ways of deduction and alignment. 0:27:47.810 --> 0:27:52.622 You can use sentence alignment. 0:27:53.333 --> 0:28:02.102 And how you can do the flexigrade exam, for example, the lexic graph, or you can chin. 0:28:02.422 --> 0:28:05.826 To the next step in a bit more detail. 0:28:05.826 --> 0:28:13.680 But before we're doing it, I need more questions about the general overview of how these. 0:28:22.042 --> 0:28:37.058 Yeah, so two or three things to web-drawing, so you normally start with the URLs. 0:28:37.058 --> 0:28:40.903 It's most promising. 0:28:41.021 --> 0:28:48.652 What you found is that if you're interested in German to English, you would: Companies 0:28:48.652 --> 0:29:01.074 where you know they have a German and an English website are from agencies which might be: And 0:29:01.074 --> 0:29:10.328 then we can use one of these tools to start from there using standard web calling techniques. 0:29:11.071 --> 0:29:23.942 There are several challenges when doing that, so if you request a website too often you can: 0:29:25.305 --> 0:29:37.819 You have to keep in history of the sites and you click on all the links and then click on 0:29:37.819 --> 0:29:40.739 all the links again. 0:29:41.721 --> 0:29:49.432 To be very careful about legal issues starting from this robotics day so get allowed to use. 0:29:49.549 --> 0:29:58.941 Mean, that's the one major thing about what trolley general is. 0:29:58.941 --> 0:30:05.251 The problem is how you deal with property. 0:30:05.685 --> 0:30:13.114 That is why it is easier sometimes to start with some quick fold data that you don't have. 0:30:13.893 --> 0:30:22.526 Of course, the network issues you retry, so there's more technical things, but there's 0:30:22.526 --> 0:30:23.122 good. 0:30:24.724 --> 0:30:35.806 Another thing which is very helpful and is often done is instead of doing the web trolling 0:30:35.806 --> 0:30:38.119 yourself, relying. 0:30:38.258 --> 0:30:44.125 And one thing is it's common crawl from the web. 0:30:44.125 --> 0:30:51.190 Think on this common crawl a lot of these language models. 0:30:51.351 --> 0:30:59.763 So think in American Company or organization which really works on like writing. 0:31:00.000 --> 0:31:01.111 Possible. 0:31:01.111 --> 0:31:10.341 So the nice thing is if you start with this you don't have to worry about network. 0:31:10.250 --> 0:31:16.086 I don't think you can do that because it's too big, but you can do a pipeline on how to 0:31:16.086 --> 0:31:16.683 process. 0:31:17.537 --> 0:31:28.874 That is, of course, a general challenge in all this web crawling and parallel web mining. 0:31:28.989 --> 0:31:38.266 That means you cannot just don't know the data and study the processes. 0:31:39.639 --> 0:31:45.593 Here it might make sense to directly fields of both domains that in some way bark just 0:31:45.593 --> 0:31:46.414 marginally. 0:31:49.549 --> 0:31:59.381 Then you can do the text extraction, which means like converging two HTML and then splitting 0:31:59.381 --> 0:32:01.707 things from the HTML. 0:32:01.841 --> 0:32:04.802 Often very important is to do the language I need. 0:32:05.045 --> 0:32:16.728 It's not that clear even if it's links which language it is, but they are quite good tools 0:32:16.728 --> 0:32:22.891 like that can't identify from relatively short. 0:32:23.623 --> 0:32:36.678 And then you are now in the situation that you have all your danger and that you can start. 0:32:37.157 --> 0:32:43.651 After the text extraction you have now a collection or a large collection of of data where it's 0:32:43.651 --> 0:32:49.469 like text and maybe the document at use of some meta information and now the question 0:32:49.469 --> 0:32:55.963 is based on this monolingual text or multilingual text so text in many languages but not align. 0:32:56.036 --> 0:32:59.863 How can you now do a generate power? 0:33:01.461 --> 0:33:06.289 And UM. 0:33:05.705 --> 0:33:12.965 So the main thing, if we're not seeing it as a task, or if we want to do it in a machine 0:33:12.965 --> 0:33:20.388 learning way, what we have is we have a set of sentences and a suits language, and we have 0:33:20.388 --> 0:33:23.324 a set Of sentences from the target. 0:33:23.823 --> 0:33:27.814 This is the target language. 0:33:27.814 --> 0:33:31.392 This is the data we have. 0:33:31.392 --> 0:33:37.034 We kind of directly assume any ordering. 0:33:38.018 --> 0:33:44.502 More documents there are not really in line or there is maybe a graph and what we are interested 0:33:44.502 --> 0:33:50.518 in is finding these alignments so which senses are aligned to each other and which senses 0:33:50.518 --> 0:33:53.860 we can remove but we don't have translations for. 0:33:53.974 --> 0:34:00.339 But exactly this mapping is what we are interested in and what we need to find. 0:34:01.901 --> 0:34:17.910 And if we are modeling it more from the machine translation point of view, what can model that 0:34:17.910 --> 0:34:21.449 as a classification? 0:34:21.681 --> 0:34:34.850 And so the main challenge is to build this type of classifier and you want to decide is 0:34:34.850 --> 0:34:36.646 a parallel. 0:34:42.402 --> 0:34:50.912 However, the biggest challenge has already pointed out in the beginning is the sites if 0:34:50.912 --> 0:34:53.329 we have millions target. 0:34:53.713 --> 0:35:05.194 The number of comparison is n square, so this very path is very inefficient, and we need 0:35:05.194 --> 0:35:06.355 to find. 0:35:07.087 --> 0:35:16.914 And traditionally there is the first one mentioned before the local or the hierarchical meaning 0:35:16.914 --> 0:35:20.292 mining and there the idea is OK. 0:35:20.292 --> 0:35:23.465 First we are lining documents. 0:35:23.964 --> 0:35:32.887 Move back the things and align them, and once you have the alignment you only need to remind. 0:35:33.273 --> 0:35:51.709 That of course makes anything more efficient because we don't have to do all the comparison. 0:35:53.253 --> 0:35:56.411 Then it's, for example, in the before mentioned apparel. 0:35:57.217 --> 0:36:11.221 But it has the issue that if this document is bad you have error propagation and you can 0:36:11.221 --> 0:36:14.211 recover from that. 0:36:14.494 --> 0:36:20.715 Because then document that cannot say ever, there are some sentences which are: Therefore, 0:36:20.715 --> 0:36:24.973 more recently there is also was referred to as global mining. 0:36:26.366 --> 0:36:31.693 And there we really do this. 0:36:31.693 --> 0:36:43.266 Although it's in the square, we are doing all the comparisons. 0:36:43.523 --> 0:36:52.588 So the idea is that you can do represent all the sentences in a vector space. 0:36:52.892 --> 0:37:06.654 And then it's about nearest neighbor search and there is a lot of very efficient algorithms. 0:37:07.067 --> 0:37:20.591 Then if you only compare them to your nearest neighbors you don't have to do like a comparison 0:37:20.591 --> 0:37:22.584 but you have. 0:37:26.186 --> 0:37:40.662 So in the first step what we want to look at is this: This document classification refers 0:37:40.662 --> 0:37:49.584 to the document alignment, and then we do the sentence alignment. 0:37:51.111 --> 0:37:58.518 And if we're talking about document alignment, there's like typically two steps in that: We 0:37:58.518 --> 0:38:01.935 first do a candidate selection. 0:38:01.935 --> 0:38:10.904 Often we have several steps and that is again to make more things more efficiently. 0:38:10.904 --> 0:38:13.360 We have the candidate. 0:38:13.893 --> 0:38:18.402 The candidate select means OK, which documents do we want to compare? 0:38:19.579 --> 0:38:35.364 Then if we have initial candidates which might be parallel, we can do a classification test. 0:38:35.575 --> 0:38:37.240 And there is different ways. 0:38:37.240 --> 0:38:40.397 We can use lexical similarity or we can use ten basic. 0:38:41.321 --> 0:38:48.272 The first and easiest thing is to take off possible candidates. 0:38:48.272 --> 0:38:55.223 There's one possibility, the other one, is based on structural. 0:38:55.235 --> 0:39:05.398 So based on how your website looks like, you might find that there are only translations. 0:39:05.825 --> 0:39:14.789 This is typically the only case where we try to do some kind of major information, which 0:39:14.789 --> 0:39:22.342 can be very useful because we know that websites, for example, are linked. 0:39:22.722 --> 0:39:35.586 We can try to use some URL patterns, so if we have some website which ends with the. 0:39:35.755 --> 0:39:43.932 So that can be easily used in order to find candidates. 0:39:43.932 --> 0:39:49.335 Then we only compare websites where. 0:39:49.669 --> 0:40:05.633 The language and the translation of each other, but typically you hear several heuristics to 0:40:05.633 --> 0:40:07.178 do that. 0:40:07.267 --> 0:40:16.606 Then you don't have to compare all websites, but you only have to compare web sites. 0:40:17.277 --> 0:40:27.607 Cruiser problems especially with an hour day's content management system. 0:40:27.607 --> 0:40:32.912 Sometimes it's nice and easy to read. 0:40:33.193 --> 0:40:44.452 So on the one hand there typically leads from the parent's side to different languages. 0:40:44.764 --> 0:40:46.632 Now I can look at the kit websites. 0:40:46.632 --> 0:40:49.381 It's the same thing you can check on the difference. 0:40:49.609 --> 0:41:06.833 Languages: You can either do that from the parent website or you can also click on English. 0:41:06.926 --> 0:41:10.674 You can therefore either like prepare to all the websites. 0:41:10.971 --> 0:41:18.205 Can be even more focused and checked if the link is somehow either flexible or the language 0:41:18.205 --> 0:41:18.677 name. 0:41:19.019 --> 0:41:24.413 So there really depends on how much you want to filter out. 0:41:24.413 --> 0:41:29.178 There is always a trade-off between being efficient. 0:41:33.913 --> 0:41:49.963 Based on that we then have our candidate list, so we now have two independent sets of German 0:41:49.963 --> 0:41:52.725 documents, but. 0:41:53.233 --> 0:42:03.515 And now the task is, we want to extract these, which are really translations of each other. 0:42:03.823 --> 0:42:10.201 So the question of how can we measure the document similarity? 0:42:10.201 --> 0:42:14.655 Because what we then do is, we measure the. 0:42:14.955 --> 0:42:27.096 And here you already see why this is also that problematic from where it's partial or 0:42:27.096 --> 0:42:28.649 similarly. 0:42:30.330 --> 0:42:37.594 All you can do that is again two folds. 0:42:37.594 --> 0:42:48.309 You can do it more content based or more structural based. 0:42:48.188 --> 0:42:53.740 Calculating a lot of features and then maybe training a classic pyramid small set which 0:42:53.740 --> 0:42:57.084 stands like based on the spesse feature is the data. 0:42:57.084 --> 0:42:58.661 It is a corpus parallel. 0:43:00.000 --> 0:43:10.955 One way of doing that is to have traction features, so the idea is the text length, so 0:43:10.955 --> 0:43:12.718 the document. 0:43:13.213 --> 0:43:20.511 Of course, text links will not be the same, but if the one document has fifty words and 0:43:20.511 --> 0:43:24.907 the other five thousand words, it's quite realistic. 0:43:25.305 --> 0:43:29.274 So you can use the text length as one proxy of. 0:43:29.274 --> 0:43:32.334 Is this might be a good translation? 0:43:32.712 --> 0:43:41.316 Now the thing is the alignment between the structure. 0:43:41.316 --> 0:43:52.151 If you have here the website you can create some type of structure. 0:43:52.332 --> 0:44:04.958 You can compare that to the French version and then calculate some similarities because 0:44:04.958 --> 0:44:07.971 you see translation. 0:44:08.969 --> 0:44:12.172 Of course, it's getting more and more problematic. 0:44:12.172 --> 0:44:16.318 It does be a different structure than these features are helpful. 0:44:16.318 --> 0:44:22.097 However, if you are doing it more in a trained way, you can automatically learn how helpful 0:44:22.097 --> 0:44:22.725 they are. 0:44:24.704 --> 0:44:37.516 Then there are different ways of yeah: Content based things: One easy thing, especially if 0:44:37.516 --> 0:44:48.882 you have systems that are using the same script that you are looking for. 0:44:48.888 --> 0:44:49.611 The legs. 0:44:49.611 --> 0:44:53.149 We call them a beggar words and we'll look into. 0:44:53.149 --> 0:44:55.027 You can use some type of. 0:44:55.635 --> 0:44:58.418 And neural embedding is also to abate him at. 0:45:02.742 --> 0:45:06.547 And as then mean we have machine translation,. 0:45:06.906 --> 0:45:14.640 And one idea that you can also do is really use the machine translation. 0:45:14.874 --> 0:45:22.986 Because this one is one which takes more effort, so what you then have to do is put more effort. 0:45:23.203 --> 0:45:37.526 You wouldn't do this type of machine translation based approach for a system which has product. 0:45:38.018 --> 0:45:53.712 But maybe your first of thinking why can't do that because I'm collecting data to build 0:45:53.712 --> 0:45:55.673 an system. 0:45:55.875 --> 0:46:01.628 So you can use an initial system to translate it, and then you can collect more data. 0:46:01.901 --> 0:46:06.879 And one way of doing that is, you're translating, for example, all documents even to English. 0:46:07.187 --> 0:46:25.789 Then you only need two English data and you do it in the example with three grams. 0:46:25.825 --> 0:46:33.253 For example, the current induction in 1 in the Spanish, which is German induction in 1, 0:46:33.253 --> 0:46:37.641 which was Spanish induction in 2, which was French. 0:46:37.637 --> 0:46:52.225 You're creating this index and then based on that you can calculate how similar the documents. 0:46:52.092 --> 0:46:58.190 And then you can use the Cossack similarity to really calculate which of the most similar 0:46:58.190 --> 0:47:00.968 document or how similar is the document. 0:47:00.920 --> 0:47:04.615 And then measure if this is a possible translation. 0:47:05.285 --> 0:47:14.921 Mean, of course, the document will not be exactly the same, and even if you have a parallel 0:47:14.921 --> 0:47:18.483 document, French and German, and. 0:47:18.898 --> 0:47:29.086 You'll have not a perfect translation, therefore it's looking into five front overlap since 0:47:29.086 --> 0:47:31.522 there should be last. 0:47:34.074 --> 0:47:42.666 Okay, before we take the next step and go into the sentence alignment, there are more 0:47:42.666 --> 0:47:44.764 questions about the. 0:47:51.131 --> 0:47:55.924 Too Hot and. 0:47:56.997 --> 0:47:59.384 Well um. 0:48:00.200 --> 0:48:05.751 There is different ways of doing sentence alignment. 0:48:05.751 --> 0:48:12.036 Here's one way to describe is to call the other line again. 0:48:12.172 --> 0:48:17.590 Of course, we have the advantage that we have only documents, so we might have like hundred 0:48:17.590 --> 0:48:20.299 sentences and hundred sentences in the tower. 0:48:20.740 --> 0:48:31.909 Although it still might be difficult to compare all the things in parallel, and. 0:48:31.791 --> 0:48:37.541 And therefore typically these even assume that we are only interested in a line character 0:48:37.541 --> 0:48:40.800 that can be identified on the sum of the diagonal. 0:48:40.800 --> 0:48:46.422 Of course, not exactly the diagonal will sum some parts around it, but in order to make 0:48:46.422 --> 0:48:47.891 things more efficient. 0:48:48.108 --> 0:48:55.713 You can still do it around the diagonal because if you say this is a parallel document, we 0:48:55.713 --> 0:48:56.800 assume that. 0:48:56.836 --> 0:49:05.002 We wouldn't have passed the document alignment, therefore we wouldn't have seen it. 0:49:05.505 --> 0:49:06.774 In the underline. 0:49:06.774 --> 0:49:10.300 Then we are calculating the similarity for these. 0:49:10.270 --> 0:49:17.428 Set this here based on the bilingual dictionary, so it may be based on how much overlap you 0:49:17.428 --> 0:49:17.895 have. 0:49:18.178 --> 0:49:24.148 And then we are finding a path through it. 0:49:24.148 --> 0:49:31.089 You are finding a path which the lights ever see. 0:49:31.271 --> 0:49:41.255 But you're trying to find a pass through your document so that you get these parallel. 0:49:41.201 --> 0:49:49.418 And then the perfect ones here would be your pass, where you just take this other parallel. 0:49:51.011 --> 0:50:05.579 The advantage is that, of course, on the one end limits your search space. 0:50:05.579 --> 0:50:07.521 That is,. 0:50:07.787 --> 0:50:10.013 So what does it mean? 0:50:10.013 --> 0:50:19.120 So even if you have a very high probable pair, you're not taking them on because overall. 0:50:19.399 --> 0:50:27.063 So sometimes it makes sense to also use this global information and not only compare on 0:50:27.063 --> 0:50:34.815 individual sentences because what you're with your parents is that sometimes it's only a 0:50:34.815 --> 0:50:36.383 good translation. 0:50:38.118 --> 0:50:51.602 So by this minion paste you're preventing the system to do it at the border where there's 0:50:51.602 --> 0:50:52.201 no. 0:50:53.093 --> 0:50:55.689 So that might achieve you a bit better quality. 0:50:56.636 --> 0:51:12.044 The pack always ends if we write the button for everybody, but it also means you couldn't 0:51:12.044 --> 0:51:15.126 necessarily have. 0:51:15.375 --> 0:51:24.958 Have some restrictions that is right, so first of all they can't be translated out. 0:51:25.285 --> 0:51:32.572 So the handle line typically only really works well if you have a relatively high quality. 0:51:32.752 --> 0:51:39.038 So if you have this more general data where there's like some parts are translated and 0:51:39.038 --> 0:51:39.471 some. 0:51:39.719 --> 0:51:43.604 It doesn't really work, so it might. 0:51:43.604 --> 0:51:53.157 It's okay with having maybe at the end some sentences which are missing, but in generally. 0:51:53.453 --> 0:51:59.942 So it's not robust against significant noise on the. 0:52:05.765 --> 0:52:12.584 The second thing is is to what is referred to as blue alibi. 0:52:13.233 --> 0:52:16.982 And this doesn't does, does not do us much. 0:52:16.977 --> 0:52:30.220 A global information you can translate each sentence to English, and then you calculate 0:52:30.220 --> 0:52:34.885 the voice for the translation. 0:52:35.095 --> 0:52:41.888 And that you would get six answer points, which are the ones in a purple ear. 0:52:42.062 --> 0:52:56.459 And then you have the ability to add some points around it, which might be a bit lower. 0:52:56.756 --> 0:53:06.962 But here in this case you are able to deal with reorderings, angles to deal with parts. 0:53:07.247 --> 0:53:16.925 Therefore, in this case we need a full scale and key system to do this calculation while 0:53:16.925 --> 0:53:17.686 we're. 0:53:18.318 --> 0:53:26.637 Then, of course, the better your similarity metric is, so the better you are able to do 0:53:26.637 --> 0:53:35.429 this comparison, the less you have to rely on structural information that, in one sentence,. 0:53:39.319 --> 0:53:53.411 Anymore questions, and then there are things like back in line which try to do the same. 0:53:53.793 --> 0:53:59.913 That means the idea is that you expect each sentence. 0:53:59.819 --> 0:54:02.246 In a crossing will vector space. 0:54:02.246 --> 0:54:08.128 Crossing will vector space always means that you have a vector or knight means. 0:54:08.128 --> 0:54:14.598 In this case you have a vector space where sentences in different languages are near to 0:54:14.598 --> 0:54:16.069 each other if they. 0:54:16.316 --> 0:54:23.750 So you can have it again and so on, but just next to each other and want to call you. 0:54:24.104 --> 0:54:32.009 And then you can of course measure now the similarity by some distance matrix in this 0:54:32.009 --> 0:54:32.744 vector. 0:54:33.033 --> 0:54:36.290 And you're saying towards two senses are lying. 0:54:36.290 --> 0:54:39.547 If the distance in the vector space is somehow. 0:54:40.240 --> 0:54:50.702 We'll discuss that in a bit more heat soon because these vector spades and bathings are 0:54:50.702 --> 0:54:52.010 even then. 0:54:52.392 --> 0:54:55.861 So the nice thing is with this. 0:54:55.861 --> 0:55:05.508 It's really good and good to get quite good quality and can decide whether two sentences 0:55:05.508 --> 0:55:08.977 are translations of each other. 0:55:08.888 --> 0:55:14.023 In the fact-lined approach, but often they even work on a global search way to really 0:55:14.023 --> 0:55:15.575 compare on everything to. 0:55:16.236 --> 0:55:29.415 What weak alignment also does is trying to do to make this more efficient in finding the. 0:55:29.309 --> 0:55:40.563 If you don't want to compare everything to everything, you first need sentence blocks, 0:55:40.563 --> 0:55:41.210 and. 0:55:41.141 --> 0:55:42.363 Then find him fast. 0:55:42.562 --> 0:55:55.053 You always have full sentence resolution, but then you always compare on the area around. 0:55:55.475 --> 0:56:11.501 So if you do compare blocks on the source of the target, then you have of your possibilities. 0:56:11.611 --> 0:56:17.262 So here the end times and comparison is a lot less than the comparison you have here. 0:56:17.777 --> 0:56:23.750 And with neural embeddings you can also embed not only single sentences and whole blocks. 0:56:24.224 --> 0:56:28.073 So how you make this in fast? 0:56:28.073 --> 0:56:35.643 You're starting from a coarse grain resolution here where. 0:56:36.176 --> 0:56:47.922 Then you're getting a double pass where they could be good and near this pass you're doing 0:56:47.922 --> 0:56:49.858 more and more. 0:56:52.993 --> 0:56:54.601 And yeah, what's the? 0:56:54.601 --> 0:56:56.647 This is the white egg lift. 0:56:56.647 --> 0:56:59.352 These are the sewers and the target. 0:57:00.100 --> 0:57:16.163 While it was sleeping in the forests and things, I thought it was very strange to see this man. 0:57:16.536 --> 0:57:25.197 So you have the sentences, but if you do blocks you have blocks that are in. 0:57:30.810 --> 0:57:38.514 This is the thing about the pipeline approach. 0:57:38.514 --> 0:57:46.710 We want to look at the global mining, but before. 0:57:53.633 --> 0:58:07.389 In the global mining thing we have to also do some filtering and so typically in the things 0:58:07.389 --> 0:58:10.379 they do they start. 0:58:10.290 --> 0:58:14.256 And then they are doing some pretty processing. 0:58:14.254 --> 0:58:17.706 So you try to at first to de-defecate paragraphs. 0:58:17.797 --> 0:58:30.622 So, of course, if you compare everything with everything in two times the same input example, 0:58:30.622 --> 0:58:35.748 you will also: The hard thing is that you first keep duplicating. 0:58:35.748 --> 0:58:37.385 You have each paragraph only one. 0:58:37.958 --> 0:58:42.079 There's a lot of text which occurs a lot of times. 0:58:42.079 --> 0:58:44.585 They will happen all the time. 0:58:44.884 --> 0:58:57.830 There are pages about the cookie thing you see and about accepting things. 0:58:58.038 --> 0:59:04.963 So you can already be duplicated here, or your problem has crossed the website twice, 0:59:04.963 --> 0:59:05.365 and. 0:59:06.066 --> 0:59:11.291 Then you can remove low quality data like cooking warnings that have biolabites start. 0:59:12.012 --> 0:59:13.388 Hey! 0:59:13.173 --> 0:59:19.830 So let you have maybe some other sentence, and then you're doing a language idea. 0:59:19.830 --> 0:59:29.936 That means you want to have a text, which is: You want to know for each sentence a paragraph 0:59:29.936 --> 0:59:38.695 which language it has so that you then, of course, if you want. 0:59:39.259 --> 0:59:44.987 Finally, there is some complexity based film screenings to believe, for example, for very 0:59:44.987 --> 0:59:46.069 high complexity. 0:59:46.326 --> 0:59:59.718 That means, for example, data where there's a lot of crazy names which are growing. 1:00:00.520 --> 1:00:09.164 Sometimes it also improves very high perplexity data because that is then unmanned generated 1:00:09.164 --> 1:00:09.722 data. 1:00:11.511 --> 1:00:17.632 And then the model which is mostly used for that is what is called a laser model. 1:00:18.178 --> 1:00:21.920 It's based on machine translation. 1:00:21.920 --> 1:00:28.442 Hope it all recognizes the machine translation architecture. 1:00:28.442 --> 1:00:37.103 However, there is a difference between a general machine translation system and. 1:01:00.000 --> 1:01:13.322 Machine translation system, so it's messy. 1:01:14.314 --> 1:01:24.767 See one bigger difference, which is great if I'm excluding that object or the other. 1:01:25.405 --> 1:01:39.768 There is one difference to the other, one with attention, so we are having. 1:01:40.160 --> 1:01:43.642 And then we are using that here in there each time set up. 1:01:44.004 --> 1:01:54.295 Mean, therefore, it's maybe a bit similar to original anti-system without attention. 1:01:54.295 --> 1:01:56.717 It's quite similar. 1:01:57.597 --> 1:02:10.011 However, it has this disadvantage saying that we have to put everything in one sentence and 1:02:10.011 --> 1:02:14.329 that maybe not all information. 1:02:15.055 --> 1:02:25.567 However, now in this type of framework we are not really interested in machine translation, 1:02:25.567 --> 1:02:27.281 so this model. 1:02:27.527 --> 1:02:34.264 So we are training it to do machine translation. 1:02:34.264 --> 1:02:42.239 What that means in the end should be as much information. 1:02:43.883 --> 1:03:01.977 Only all the information in here is able to really well do the machine translation. 1:03:02.642 --> 1:03:07.801 So that is the first step, so we are doing here. 1:03:07.801 --> 1:03:17.067 We are building the MT system, not with the goal of making the best MT system, but with 1:03:17.067 --> 1:03:22.647 learning and sentences, and hopefully all important. 1:03:22.882 --> 1:03:26.116 Because otherwise we won't be able to generate the translation. 1:03:26.906 --> 1:03:31.287 So it's a bit more on the bottom neck like to try to put as much information. 1:03:32.012 --> 1:03:36.426 And if you think if you want to do later finding the bear's neighbor or something like. 1:03:37.257 --> 1:03:48.680 So finding similarities is typically possible with fixed dimensional things, so we can do 1:03:48.680 --> 1:03:56.803 that in an end dimensional space and find the nearest neighbor. 1:03:57.857 --> 1:03:59.837 Yeah, it would be very difficult. 1:04:00.300 --> 1:04:03.865 There's one thing that we also do. 1:04:03.865 --> 1:04:09.671 We don't want to find the nearest neighbor in the other. 1:04:10.570 --> 1:04:13.424 Do you have an idea how we can train them? 1:04:13.424 --> 1:04:16.542 This is a set that embeddings can be compared. 1:04:23.984 --> 1:04:36.829 Any idea do you think about two lectures, a three lecture stack, one that did gave. 1:04:41.301 --> 1:04:50.562 We can train them on a multilingual setting and that's how it's done in lasers so we're 1:04:50.562 --> 1:04:56.982 not doing it only from German to English but we're training. 1:04:57.017 --> 1:05:04.898 Mean, if the English one has to be useful for German, French and so on, and for German 1:05:04.898 --> 1:05:13.233 also, the German and the English and so have to be useful, then somehow we'll automatically 1:05:13.233 --> 1:05:16.947 learn that these embattes are popularly. 1:05:17.437 --> 1:05:28.562 And then we can use an exact as we will plan to have a similar sentence embedding. 1:05:28.908 --> 1:05:39.734 If you put in here a German and a French one and always generate as they both have the same 1:05:39.734 --> 1:05:48.826 translations, you give these sentences: And you should do exactly the same thing, so that's 1:05:48.826 --> 1:05:50.649 of course the easiest. 1:05:51.151 --> 1:05:59.817 If the sentence is very different then most people will also hear the English decoder and 1:05:59.817 --> 1:06:00.877 therefore. 1:06:02.422 --> 1:06:04.784 So that is the first thing. 1:06:04.784 --> 1:06:06.640 Now we have this one. 1:06:06.640 --> 1:06:10.014 We have to be trained on parallel data. 1:06:10.390 --> 1:06:22.705 Then we can use these embeddings on our new data and try to use them to make efficient 1:06:22.705 --> 1:06:24.545 comparisons. 1:06:26.286 --> 1:06:30.669 So how can you do comparison? 1:06:30.669 --> 1:06:37.243 Maybe the first thing you think of is to do. 1:06:37.277 --> 1:06:44.365 So you take all the German sentences, all the French sentences. 1:06:44.365 --> 1:06:49.460 We compute the Cousin's simple limit between. 1:06:49.469 --> 1:06:58.989 And then you take all pairs where the similarity is very high. 1:07:00.180 --> 1:07:17.242 So you have your French list, you have them, and then you just take all sentences. 1:07:19.839 --> 1:07:29.800 It's an additional power method that we have, but we have a lot of data who will find a point. 1:07:29.800 --> 1:07:32.317 It's a good point, but. 1:07:35.595 --> 1:07:45.738 It's also not that easy, so one problem is that typically there are some sentences where. 1:07:46.066 --> 1:07:48.991 And other points where there is very few points in the neighborhood. 1:07:49.629 --> 1:08:06.241 And then for things where a lot of things are enabled you might extract not for one percent 1:08:06.241 --> 1:08:08.408 to do that. 1:08:08.868 --> 1:08:18.341 So what typically is happening is you do the max merchant? 1:08:18.341 --> 1:08:25.085 How good is a pair compared to the other? 1:08:25.305 --> 1:08:33.859 So you take the similarity between X and Y, and then you look at one of the eight nearest 1:08:33.859 --> 1:08:35.190 neighbors of. 1:08:35.115 --> 1:08:48.461 Of x and what are the eight nearest neighbors of y, and the dividing of the similarity through 1:08:48.461 --> 1:08:51.411 the eight neighbors. 1:08:51.671 --> 1:09:00.333 So what you may be looking at are these two sentences a lot more similar than all the other. 1:09:00.840 --> 1:09:13.455 And if these are exceptional and similar compared to other sentences then they should be translations. 1:09:16.536 --> 1:09:19.158 Of course, that has also some. 1:09:19.158 --> 1:09:24.148 Then the good thing is there's a lot of similar sentences. 1:09:24.584 --> 1:09:30.641 If there is a lot of similar sensations in white then these are also very similar and 1:09:30.641 --> 1:09:32.824 you are doing more comparison. 1:09:32.824 --> 1:09:36.626 If all the arrows are far away then the translations. 1:09:37.057 --> 1:09:40.895 So think about this like short sentences. 1:09:40.895 --> 1:09:47.658 They might be that most things are similar, but they are just in general. 1:09:49.129 --> 1:09:59.220 There are some problems that now we assume there is only one pair of translations. 1:09:59.759 --> 1:10:09.844 So it has some problems in their two or three ballad translations of that. 1:10:09.844 --> 1:10:18.853 Then, of course, this pair might not find it, but in general this. 1:10:19.139 --> 1:10:27.397 For example, they have like all of these common trawl. 1:10:27.397 --> 1:10:32.802 They have large parallel data sets. 1:10:36.376 --> 1:10:38.557 One point maybe also year. 1:10:38.557 --> 1:10:45.586 Of course, now it's important that we have done the deduplication before because if we 1:10:45.586 --> 1:10:52.453 wouldn't have the deduplication, we would have points which are the same coordinate. 1:10:57.677 --> 1:11:03.109 Maybe only one small things to that mean. 1:11:03.109 --> 1:11:09.058 A major issue in this case is still making a. 1:11:09.409 --> 1:11:18.056 So you have to still do all of this comparison, and that cannot be done just by simple. 1:11:19.199 --> 1:11:27.322 So what is done typically express the word, you know things can be done in parallel. 1:11:28.368 --> 1:11:36.024 So calculating the embeddings and all that stuff doesn't need to be sequential, but it's 1:11:36.024 --> 1:11:37.143 independent. 1:11:37.357 --> 1:11:48.680 What you typically do is create an event and then you do some kind of projectization. 1:11:48.708 --> 1:11:57.047 So there is this space library which does key nearest neighbor search very efficient 1:11:57.047 --> 1:11:59.597 in very high-dimensional. 1:12:00.080 --> 1:12:03.410 And then based on that you can now do comparison. 1:12:03.410 --> 1:12:06.873 You can even do the comparison in parallel because. 1:12:06.906 --> 1:12:13.973 Can look at different areas of your space and then compare the different pieces to find 1:12:13.973 --> 1:12:14.374 the. 1:12:15.875 --> 1:12:30.790 With this you are then able to do very fast calculations on this type of sentence. 1:12:31.451 --> 1:12:34.761 So yeah this is currently one. 1:12:35.155 --> 1:12:48.781 Mean, those of them are covered with this, so there's a parade. 1:12:48.668 --> 1:12:55.543 We are collected by that and most of them are in a very big corporate for languages which 1:12:55.543 --> 1:12:57.453 you can hardly stand on. 1:12:58.778 --> 1:13:01.016 Do you have any more questions on this? 1:13:05.625 --> 1:13:17.306 And then some more words to this last set here: So we have now done our pearl marker 1:13:17.306 --> 1:13:25.165 and we could assume that everything is fine now. 1:13:25.465 --> 1:13:35.238 However, the problem with this noisy data is that typically this is quite noisy still, 1:13:35.238 --> 1:13:35.687 so. 1:13:36.176 --> 1:13:44.533 In order to make things efficient to have a high recall, the final data is often not 1:13:44.533 --> 1:13:49.547 of the best quality, not the same type of quality. 1:13:49.789 --> 1:13:58.870 So it is essential to do another figuring step and to remove senses which might seem 1:13:58.870 --> 1:14:01.007 to be translations. 1:14:01.341 --> 1:14:08.873 And here, of course, the final evaluation matrix would be how much do my system improve? 1:14:09.089 --> 1:14:23.476 And there are even challenges on doing that so: people getting this noisy data like symmetrics 1:14:23.476 --> 1:14:25.596 or something. 1:14:27.707 --> 1:14:34.247 However, all these steps is of course very time consuming, so you might not always want 1:14:34.247 --> 1:14:37.071 to do the full pipeline and training. 1:14:37.757 --> 1:14:51.614 So how can you model that we want to get this best and normally what we always want? 1:14:51.871 --> 1:15:02.781 You also want to have the best over translation quality, but this is also normally not achieved 1:15:02.781 --> 1:15:03.917 with all. 1:15:04.444 --> 1:15:12.389 And that's why you're doing this two-step approach first of the second alignment. 1:15:12.612 --> 1:15:27.171 And after once you do the sentence filtering, we can put a lot more alphabet in all the comparisons. 1:15:27.627 --> 1:15:37.472 For example, you can just translate the source and compare that translation with the original 1:15:37.472 --> 1:15:40.404 one and calculate how good. 1:15:40.860 --> 1:15:49.467 And this, of course, you can do with the filing set, but you can't do with your initial set 1:15:49.467 --> 1:15:50.684 of millions. 1:15:54.114 --> 1:16:01.700 So what it is again is the ancient test where you input as a sentence pair as here, and then 1:16:01.700 --> 1:16:09.532 once you have a biometria, these are sentence pairs with a high quality, and these are sentence 1:16:09.532 --> 1:16:11.653 pairs avec a low quality. 1:16:12.692 --> 1:16:17.552 Does anybody see what might be a challenge if you want to train this type of classifier? 1:16:22.822 --> 1:16:24.264 How do you measure exactly? 1:16:24.264 --> 1:16:26.477 The quality is probably about the problem. 1:16:27.887 --> 1:16:39.195 Yes, that is one, that is true, there is even more, more simple one, and high quality data 1:16:39.195 --> 1:16:42.426 here is not so difficult. 1:16:43.303 --> 1:16:46.844 Globally, yeah, probably we have a class in balance. 1:16:46.844 --> 1:16:49.785 We don't see many bad quality combinations. 1:16:49.785 --> 1:16:54.395 It's hard to get there at the beginning, so maybe how can you argue? 1:16:54.395 --> 1:16:58.405 Where do you find bad quality and what type of bad quality? 1:16:58.798 --> 1:17:05.122 Because if it's too easy, you just take a random germ and the random innocence that is 1:17:05.122 --> 1:17:05.558 very. 1:17:05.765 --> 1:17:15.747 But what you're interested is like bad quality data, which still passes your first initial 1:17:15.747 --> 1:17:16.405 step. 1:17:17.257 --> 1:17:28.824 What you can use for that is you can use any type of network or model that in the beginning, 1:17:28.824 --> 1:17:33.177 like in random forests, would see. 1:17:33.613 --> 1:17:38.912 So the positive examples are quite easy to get. 1:17:38.912 --> 1:17:44.543 You just take parallel data and high quality data. 1:17:44.543 --> 1:17:45.095 You. 1:17:45.425 --> 1:17:47.565 That is quite easy. 1:17:47.565 --> 1:17:55.482 You normally don't need a lot of data, then to train in a few validation. 1:17:57.397 --> 1:18:12.799 The challenge is like the negative samples because how would you generate negative samples? 1:18:13.133 --> 1:18:17.909 Because the negative examples are the ones which ask the first step but don't ask the 1:18:17.909 --> 1:18:18.353 second. 1:18:18.838 --> 1:18:23.682 So how do you typically do it? 1:18:23.682 --> 1:18:28.994 You try to do synthetic examples. 1:18:28.994 --> 1:18:33.369 You can do random examples. 1:18:33.493 --> 1:18:45.228 But this is the typical error that you want to detect when you do frequency based replacements. 1:18:45.228 --> 1:18:52.074 But this is one major issue when you generate the data. 1:18:52.132 --> 1:19:02.145 That doesn't match well with what are the real arrows that you're interested in. 1:19:02.702 --> 1:19:13.177 Is some of the most challenging here to find the negative samples, which are hard enough 1:19:13.177 --> 1:19:14.472 to detect. 1:19:17.537 --> 1:19:21.863 And the other thing, which is difficult, is of course the data ratio. 1:19:22.262 --> 1:19:24.212 Why is it important any? 1:19:24.212 --> 1:19:29.827 Why is the ratio between positive and negative examples here important? 1:19:30.510 --> 1:19:40.007 Because in a case of plus imbalance we effectively could learn to just that it's positive and 1:19:40.007 --> 1:19:43.644 high quality and we would be right. 1:19:44.844 --> 1:19:46.654 Yes, so I'm training. 1:19:46.654 --> 1:19:51.180 This is important, but otherwise it might be too easy. 1:19:51.180 --> 1:19:52.414 You always do. 1:19:52.732 --> 1:19:58.043 And on the other head, of course, navy and deputy, it's also important because if we have 1:19:58.043 --> 1:20:03.176 equal things, we're also assuming that this might be the other one, and if the quality 1:20:03.176 --> 1:20:06.245 is worse or higher, we might also accept too fewer. 1:20:06.626 --> 1:20:10.486 So this ratio is not easy to determine. 1:20:13.133 --> 1:20:16.969 What type of features can we use? 1:20:16.969 --> 1:20:23.175 Traditionally, we're also looking at word translation. 1:20:23.723 --> 1:20:37.592 And nowadays, of course, we can model this also with something like similar, so this is 1:20:37.592 --> 1:20:38.696 again. 1:20:40.200 --> 1:20:42.306 Language follow. 1:20:42.462 --> 1:20:49.763 So we can, for example, put the sentence in there for the source and the target, and then 1:20:49.763 --> 1:20:56.497 based on this classification label we can classify as this a parallel sentence or. 1:20:56.476 --> 1:21:00.054 So it's more like a normal classification task. 1:21:00.160 --> 1:21:09.233 And by having a system which can have much enable input, we can just put in two R. 1:21:09.233 --> 1:21:16.886 We can also put in two independent of each other based on the hidden. 1:21:17.657 --> 1:21:35.440 You can, as you do any other type of classifier, you can train them on top of. 1:21:35.895 --> 1:21:42.801 This so it tries to represent the full sentence and that's what you also want to do on. 1:21:43.103 --> 1:21:45.043 The Other Thing What They Can't Do Is, of Course. 1:21:45.265 --> 1:21:46.881 You can make here. 1:21:46.881 --> 1:21:52.837 You can do your summation of all the hidden statements that you said. 1:21:58.698 --> 1:22:10.618 Okay, and then one thing which we skipped until now, and that is only briefly this fragment. 1:22:10.630 --> 1:22:19.517 So if we have sentences which are not really parallel, can we also extract information from 1:22:19.517 --> 1:22:20.096 them? 1:22:22.002 --> 1:22:25.627 And so what here the test is? 1:22:25.627 --> 1:22:33.603 We have a sentence and we want to find within or a sentence pair. 1:22:33.603 --> 1:22:38.679 We want to find within the sentence pair. 1:22:39.799 --> 1:22:46.577 And how that, for example, has been done is using a lexical positive and negative association. 1:22:47.187 --> 1:22:57.182 And then you can transform your target sentence into a signal and find a thing where you have. 1:22:57.757 --> 1:23:00.317 So I'm Going to Get a Clear Eye. 1:23:00.480 --> 1:23:15.788 So you hear the English sentence, the other language, and you have an alignment between 1:23:15.788 --> 1:23:18.572 them, and then. 1:23:18.818 --> 1:23:21.925 This is not a light cell from a negative signal. 1:23:22.322 --> 1:23:40.023 And then you drink some sauce on there because you want to have an area where there's. 1:23:40.100 --> 1:23:51.742 It doesn't matter if you have simple arrows here by smooth saying you can't. 1:23:51.972 --> 1:23:58.813 So you try to find long segments here where at least most of the words are somehow aligned. 1:24:00.040 --> 1:24:10.069 And then you take this one in the side and extract that one as your parallel fragment, 1:24:10.069 --> 1:24:10.645 and. 1:24:10.630 --> 1:24:21.276 So in the end you not only have full sentences but you also have partial sentences which might 1:24:21.276 --> 1:24:27.439 be helpful for especially if you have quite low upset. 1:24:32.332 --> 1:24:36.388 That's everything work for today. 1:24:36.388 --> 1:24:44.023 What you hopefully remember is the thing about how the general. 1:24:44.184 --> 1:24:54.506 We talked about how we can do the document alignment and then we can do the sentence alignment, 1:24:54.506 --> 1:24:57.625 which can be done after the. 1:24:59.339 --> 1:25:12.611 Any more questions think on Thursday we had to do a switch, so on Thursday there will be 1:25:12.611 --> 1:25:15.444 a practical thing.