diff --git "a/demo_data/lectures/Lecture-12-20.06.2023/English.vtt" "b/demo_data/lectures/Lecture-12-20.06.2023/English.vtt" new file mode 100644--- /dev/null +++ "b/demo_data/lectures/Lecture-12-20.06.2023/English.vtt" @@ -0,0 +1,10713 @@ +WEBVTT + +0:00:03.243 --> 0:00:18.400 +Hey welcome to our video, small room today +and to the lecture machine translation. + +0:00:19.579 --> 0:00:32.295 +So the idea is we have like last time we started +addressing problems and building machine translation. + +0:00:32.772 --> 0:00:39.140 +And we looked into different ways of how we +can use other types of resources. + +0:00:39.379 --> 0:00:54.656 +Last time we looked into language models and +especially pre-trained models which are different + +0:00:54.656 --> 0:00:59.319 +paradigms and learning data. + +0:01:00.480 --> 0:01:07.606 +However, there is one other way of getting +data and that is just searching for more data. + +0:01:07.968 --> 0:01:14.637 +And the nice thing is it was a worldwide web. + +0:01:14.637 --> 0:01:27.832 +We have a very big data resource where there's +various types of data which we can all use. + +0:01:28.128 --> 0:01:38.902 +If you want to build a machine translation +for a specific language or specific to Maine, + +0:01:38.902 --> 0:01:41.202 +it might be worse. + +0:01:46.586 --> 0:01:55.399 +In general, the other year we had different +types of additional resources we can have. + +0:01:55.399 --> 0:01:59.654 +Today we look into the state of crawling. + +0:01:59.654 --> 0:02:05.226 +It always depends a bit on what type of task +you have. + +0:02:05.525 --> 0:02:08.571 +We're crawling, you point off no possibilities. + +0:02:08.828 --> 0:02:14.384 +We have seen some weeks ago that Maje Lingo +models another thing where you can try to share + +0:02:14.384 --> 0:02:16.136 +knowledge between languages. + +0:02:16.896 --> 0:02:26.774 +Last we looked into monolingual data and next +we also unsupervised them too which is purely + +0:02:26.774 --> 0:02:29.136 +based on monolingual. + +0:02:29.689 --> 0:02:35.918 +What we today will focus on is really web +crawling of parallel data. + +0:02:35.918 --> 0:02:40.070 +We will focus not on the crawling pad itself. + +0:02:41.541 --> 0:02:49.132 +Networking lecture is something about one +of the best techniques to do web trolleying + +0:02:49.132 --> 0:02:53.016 +and then we'll just rely on existing tools. + +0:02:53.016 --> 0:02:59.107 +But the challenge is normally if you have +web data that's pure text. + +0:03:00.920 --> 0:03:08.030 +And these are all different ways of how we +can do that, and today is focused on that. + +0:03:08.508 --> 0:03:21.333 +So why would we be interested in that there +is quite different ways of collecting data? + +0:03:21.333 --> 0:03:28.473 +If you're currently when we talk about parallel. + +0:03:28.548 --> 0:03:36.780 +The big difference is that you focus on one +specific website so you can manually check + +0:03:36.780 --> 0:03:37.632 +how you. + +0:03:38.278 --> 0:03:49.480 +This you can do for dedicated resources where +you have high quality data. + +0:03:50.510 --> 0:03:56.493 +Another thing which has been developed or +has been done for several tasks is also is + +0:03:56.493 --> 0:03:59.732 +like you can do something like crowdsourcing. + +0:03:59.732 --> 0:04:05.856 +I don't know if you know about sites like +Amazon Mechanical Turing or things like that + +0:04:05.856 --> 0:04:08.038 +so you can there get a lot of. + +0:04:07.988 --> 0:04:11.544 +Writing between cheap labors would like easy +translations for you. + +0:04:12.532 --> 0:04:22.829 +Of course you can't collect millions of sentences, +but if it's like thousands of sentences that's + +0:04:22.829 --> 0:04:29.134 +also sourced, it's often interesting when you +have somehow. + +0:04:29.509 --> 0:04:36.446 +However, this is a field of itself, so crowdsourcing +is not that easy. + +0:04:36.446 --> 0:04:38.596 +It's not like upload. + +0:04:38.738 --> 0:04:50.806 +If you're doing that you will have very poor +quality, for example in the field of machine + +0:04:50.806 --> 0:04:52.549 +translation. + +0:04:52.549 --> 0:04:57.511 +Crowdsourcing is very commonly used. + +0:04:57.397 --> 0:05:00.123 +The problem there is. + +0:05:00.480 --> 0:05:08.181 +Since they are paid quite bad, of course, +a lot of people also try to make it put into + +0:05:08.181 --> 0:05:09.598 +it as possible. + +0:05:09.869 --> 0:05:21.076 +So if you're just using it without any control +mechanisms, the quality will be bad. + +0:05:21.076 --> 0:05:27.881 +What you can do is like doing additional checking. + +0:05:28.188 --> 0:05:39.084 +And think recently read a paper that now these +things can be worse because people don't do + +0:05:39.084 --> 0:05:40.880 +it themselves. + +0:05:41.281 --> 0:05:46.896 +So it's a very interesting topic. + +0:05:46.896 --> 0:05:55.320 +There has been a lot of resources created +by this. + +0:05:57.657 --> 0:06:09.796 +It's really about large scale data, then of +course doing some type of web crawling is the + +0:06:09.796 --> 0:06:10.605 +best. + +0:06:10.930 --> 0:06:17.296 +However, the biggest issue in this case is +in the quality. + +0:06:17.296 --> 0:06:22.690 +So how can we ensure that somehow the quality +of. + +0:06:23.003 --> 0:06:28.656 +Because if you just, we all know that in the +Internet there's also a lot of tools. + +0:06:29.149 --> 0:06:37.952 +Low quality staff, and especially now the +bigger question is how can we ensure that translations + +0:06:37.952 --> 0:06:41.492 +are really translations of each other? + +0:06:45.065 --> 0:06:58.673 +Why is this interesting so we had this number +before so there is some estimates that roughly + +0:06:58.673 --> 0:07:05.111 +a human reads around three hundred million. + +0:07:05.525 --> 0:07:16.006 +If you look into the web you will have millions +of words there so you can really get a large + +0:07:16.006 --> 0:07:21.754 +amount of data and if you think about monolingual. + +0:07:22.042 --> 0:07:32.702 +So at least for some language pairs there +is a large amount of data you can have. + +0:07:32.852 --> 0:07:37.783 +Languages are official languages in one country. + +0:07:37.783 --> 0:07:46.537 +There's always a very great success because +a lot of websites from the government need + +0:07:46.537 --> 0:07:48.348 +to be translated. + +0:07:48.568 --> 0:07:58.777 +For example, a large purpose like in India, +which we have worked with in India, so you + +0:07:58.777 --> 0:08:00.537 +have parallel. + +0:08:01.201 --> 0:08:02.161 +Two questions. + +0:08:02.161 --> 0:08:08.438 +First of all, if jet GPS and machine translation +tools are more becoming ubiquitous and everybody + +0:08:08.438 --> 0:08:14.138 +uses them, don't we get a problem because we +want to crawl the web and use the data and. + +0:08:15.155 --> 0:08:18.553 +Yes, that is a severe problem. + +0:08:18.553 --> 0:08:26.556 +Of course, are we only training on training +data which is automatically? + +0:08:26.766 --> 0:08:41.182 +And if we are doing that, of course, we talked +about the synthetic data where we do back translation. + +0:08:41.341 --> 0:08:46.446 +But of course it gives you some aren't up +about norm, you cannot be much better than + +0:08:46.446 --> 0:08:46.806 +this. + +0:08:48.308 --> 0:08:57.194 +That is, we'll get more and more on issues, +so maybe at some point we won't look at the + +0:08:57.194 --> 0:09:06.687 +current Internet, but focus on oats like image +of the Internet, which are created by Archive. + +0:09:07.527 --> 0:09:18.611 +There's lots of classification algorithms +on how to classify automatic data they had + +0:09:18.611 --> 0:09:26.957 +a very interesting paper on how to watermark +their translation. + +0:09:27.107 --> 0:09:32.915 +So there's like two scenarios of course in +this program: The one thing you might want + +0:09:32.915 --> 0:09:42.244 +to find your own translation if you're a big +company and say do an antisystem that may be + +0:09:42.244 --> 0:09:42.866 +used. + +0:09:43.083 --> 0:09:49.832 +This problem might be that most of the translation +out there is created by you. + +0:09:49.832 --> 0:10:01.770 +You might be able: And there is a relatively +easy way of doing that so that there are other + +0:10:01.770 --> 0:10:09.948 +peoples' mainly that can do it like the search +or teacher. + +0:10:09.929 --> 0:10:12.878 +They are different, but there is not the one +correction station. + +0:10:13.153 --> 0:10:23.763 +So what you then can't do is you can't output +the best one to the user, but the highest value. + +0:10:23.763 --> 0:10:30.241 +For example, it's easy, but you can take the +translation. + +0:10:30.870 --> 0:10:40.713 +And if you always give the translation of +your investments, which are all good with the + +0:10:40.713 --> 0:10:42.614 +most ease, then. + +0:10:42.942 --> 0:10:55.503 +But of course this you can only do with most +of the data generated by your model. + +0:10:55.503 --> 0:11:02.855 +What we are now seeing is not only checks, +but. + +0:11:03.163 --> 0:11:13.295 +But it's definitely an additional research +question that might get more and more importance, + +0:11:13.295 --> 0:11:18.307 +and it might be an additional filtering step. + +0:11:18.838 --> 0:11:29.396 +There are other issues in data quality, so +in which direction wasn't translated, so that + +0:11:29.396 --> 0:11:31.650 +is not interested. + +0:11:31.891 --> 0:11:35.672 +But if you're now reaching better and better +quality, it makes a difference. + +0:11:35.672 --> 0:11:39.208 +The original data was from German to English +or from English to German. + +0:11:39.499 --> 0:11:44.797 +Because translation, they call it translate +Chinese. + +0:11:44.797 --> 0:11:53.595 +So if you generate German from English, it +has a more similar structure as if you would + +0:11:53.595 --> 0:11:55.195 +directly speak. + +0:11:55.575 --> 0:11:57.187 +So um. + +0:11:57.457 --> 0:12:03.014 +These are all issues which you then might +do like do additional training to remove them + +0:12:03.014 --> 0:12:07.182 +or you first train on them and later train +on other quality data. + +0:12:07.182 --> 0:12:11.034 +But yet that's a general view on so it's an +important issue. + +0:12:11.034 --> 0:12:17.160 +But until now I think it hasn't been addressed +that much maybe because the quality was decently. + +0:12:18.858 --> 0:12:23.691 +Actually, I think we're sure if we have the +time we use the Internet. + +0:12:23.691 --> 0:12:29.075 +The problem is, it's a lot of English speaking +text, but most used languages. + +0:12:29.075 --> 0:12:34.460 +I don't know some language in Africa that's +spoken, but we do about that one. + +0:12:34.460 --> 0:12:37.566 +I mean, that's why most data is English too. + +0:12:38.418 --> 0:12:42.259 +Other languages, and then you get the best. + +0:12:42.259 --> 0:12:46.013 +If there is no data on the Internet, then. + +0:12:46.226 --> 0:12:48.255 +So there is still a lot of data collection. + +0:12:48.255 --> 0:12:50.976 +Also in the wild way you try to improve there +and collect. + +0:12:51.431 --> 0:12:57.406 +But English is the most in the world, but +you find surprisingly much data also for other + +0:12:57.406 --> 0:12:58.145 +languages. + +0:12:58.678 --> 0:13:04.227 +Of course, only if they're written remember. + +0:13:04.227 --> 0:13:15.077 +Most languages are not written at all, but +for them you might find some video, but it's + +0:13:15.077 --> 0:13:17.420 +difficult to find. + +0:13:17.697 --> 0:13:22.661 +So this is mainly done for the web trawling. + +0:13:22.661 --> 0:13:29.059 +It's mainly done for languages which are commonly +spoken. + +0:13:30.050 --> 0:13:38.773 +Is exactly the next point, so this is that +much data is only true for English and some + +0:13:38.773 --> 0:13:41.982 +other languages, but of course. + +0:13:41.982 --> 0:13:50.285 +And therefore a lot of research on how to +make things efficient and efficient and learn + +0:13:50.285 --> 0:13:54.248 +faster from pure data is still essential. + +0:13:59.939 --> 0:14:06.326 +So what we are interested in now on data is +parallel data. + +0:14:06.326 --> 0:14:10.656 +We assume always we have parallel data. + +0:14:10.656 --> 0:14:12.820 +That means we have. + +0:14:13.253 --> 0:14:20.988 +To be careful when you start crawling from +the web, we might get only related types of. + +0:14:21.421 --> 0:14:30.457 +So one comedy thing is what people refer as +noisy parallel data where there is documents + +0:14:30.457 --> 0:14:34.315 +which are translations of each other. + +0:14:34.434 --> 0:14:44.300 +So you have senses where there is no translation +on the other side because you have. + +0:14:44.484 --> 0:14:50.445 +So if you have these types of documents your +algorithm to extract parallel data might be + +0:14:50.445 --> 0:14:51.918 +a bit more difficult. + +0:14:52.352 --> 0:15:04.351 +Know if you can still remember in the beginning +of the lecture when we talked about different + +0:15:04.351 --> 0:15:06.393 +data resources. + +0:15:06.286 --> 0:15:11.637 +But the first step is then approached to a +light source and target sentences, and it was + +0:15:11.637 --> 0:15:16.869 +about like a steep vocabulary, and then you +have some probabilities for one to one and + +0:15:16.869 --> 0:15:17.590 +one to one. + +0:15:17.590 --> 0:15:23.002 +It's very like simple algorithm, but yet it +works fine for really a high quality parallel + +0:15:23.002 --> 0:15:23.363 +data. + +0:15:23.623 --> 0:15:30.590 +But when we're talking about noisy data, we +might have to do additional steps and use more + +0:15:30.590 --> 0:15:35.872 +advanced models to extract what is parallel +and to get high quality. + +0:15:36.136 --> 0:15:44.682 +So if we just had no easy parallel data, the +document might not be as easy to extract. + +0:15:49.249 --> 0:15:54.877 +And then there is even the more extreme pains, +which has also been used to be honest. + +0:15:54.877 --> 0:15:58.214 +The use of this data is reasoning not that +common. + +0:15:58.214 --> 0:16:04.300 +It was more interested maybe like ten or fifteen +years ago, and that is what people referred + +0:16:04.300 --> 0:16:05.871 +to as comparative data. + +0:16:06.266 --> 0:16:17.167 +And then the idea is you even don't have translations +like sentences which are translations of each + +0:16:17.167 --> 0:16:25.234 +other, but you have more news documents or +articles about the same topic. + +0:16:25.205 --> 0:16:32.410 +But it's more that you find phrases which +are too big in the user, so even black fragments. + +0:16:32.852 --> 0:16:44.975 +So if you think about the pedia, for example, +these articles have to be written in like the + +0:16:44.975 --> 0:16:51.563 +Wikipedia general idea independent of each +other. + +0:16:51.791 --> 0:17:01.701 +They have different information in there, +and I mean, the German movie gets more detail + +0:17:01.701 --> 0:17:04.179 +than the English one. + +0:17:04.179 --> 0:17:07.219 +However, it might be that. + +0:17:07.807 --> 0:17:20.904 +And the same thing is that you think about +newspaper articles if they're at the same time. + +0:17:21.141 --> 0:17:24.740 +And so this is an ability to learn. + +0:17:24.740 --> 0:17:29.738 +For example, new phrases, vocabulary and stature. + +0:17:29.738 --> 0:17:36.736 +If you don't have parallel data, but you could +monitor all time long. + +0:17:37.717 --> 0:17:49.020 +And then not everything will be the same, +but there might be an overlap about events. + +0:17:54.174 --> 0:18:00.348 +So if we're talking about web trolling said +in the beginning it was really about specific. + +0:18:00.660 --> 0:18:18.878 +They do very good things by hand and really +focus on them and do a very specific way of + +0:18:18.878 --> 0:18:20.327 +doing. + +0:18:20.540 --> 0:18:23.464 +The European Parliament was very focused in +Ted. + +0:18:23.464 --> 0:18:26.686 +Maybe you even have looked in the particular +session. + +0:18:27.427 --> 0:18:40.076 +And these are still important, but they are +of course very specific in covering different + +0:18:40.076 --> 0:18:41.341 +pockets. + +0:18:42.002 --> 0:18:55.921 +Then there was a focus on language centering, +so there was a big drawer, for example, that + +0:18:55.921 --> 0:18:59.592 +you can check websites. + +0:19:00.320 --> 0:19:06.849 +Apparently what really people like is a more +general approach where you just have to specify. + +0:19:06.849 --> 0:19:13.239 +I'm interested in data from German to Lithuanian +and then you can as automatic as possible. + +0:19:13.239 --> 0:19:15.392 +We see what's normally needed. + +0:19:15.392 --> 0:19:19.628 +You can collect as much data and extract codelaia +from this. + +0:19:21.661 --> 0:19:25.633 +So is this our interest? + +0:19:25.633 --> 0:19:36.435 +Of course, the question is how can we build +these types of systems? + +0:19:36.616 --> 0:19:52.913 +The first are more general web crawling base +systems, so there is nothing about. + +0:19:53.173 --> 0:19:57.337 +Based on the websites you have, you have to +do like text extraction. + +0:19:57.597 --> 0:20:06.503 +We are typically not that much interested +in text and images in there, so we try to extract + +0:20:06.503 --> 0:20:07.083 +text. + +0:20:07.227 --> 0:20:16.919 +This is also not specific to machine translation, +but it's a more traditional way of doing web + +0:20:16.919 --> 0:20:17.939 +trolling. + +0:20:18.478 --> 0:20:22.252 +And at the end you have mirror like some other +set of document collectors. + +0:20:22.842 --> 0:20:37.025 +Is the idea, so you have the text, and often +this is a document, and so in the end. + +0:20:37.077 --> 0:20:51.523 +And that is some of your starting point now +for doing the more machine translation. + +0:20:52.672 --> 0:21:05.929 +One way of doing that now is very similar +to what you might have think about the traditional + +0:21:05.929 --> 0:21:06.641 +one. + +0:21:06.641 --> 0:21:10.633 +The first thing is to do a. + +0:21:11.071 --> 0:21:22.579 +So you have this based on the initial fact +that you know this is a German website in the + +0:21:22.579 --> 0:21:25.294 +English translation. + +0:21:25.745 --> 0:21:31.037 +And based on this document alignment, then +you can do your sentence alignment. + +0:21:31.291 --> 0:21:39.072 +And this is similar to what we had before +with the church accordion. + +0:21:39.072 --> 0:21:43.696 +This is typically more noisy peril data. + +0:21:43.623 --> 0:21:52.662 +So that you are not assuming that everything +is on both sides, that the order is the same, + +0:21:52.662 --> 0:21:56.635 +so you should do more flexible systems. + +0:21:58.678 --> 0:22:14.894 +Then it depends if the documents you were +drawing were really some type of parallel data. + +0:22:15.115 --> 0:22:35.023 +Say then you should do what is referred to +as fragmented extraction. + +0:22:36.136 --> 0:22:47.972 +One problem with these types of models is +if you are doing errors in your document alignment,. + +0:22:48.128 --> 0:22:55.860 +It means that if you are saying these two +documents are align then you can only find + +0:22:55.860 --> 0:22:58.589 +sense and if you are missing. + +0:22:59.259 --> 0:23:15.284 +Is very different, only small parts of the +document are parallel, and most parts are independent + +0:23:15.284 --> 0:23:17.762 +of each other. + +0:23:19.459 --> 0:23:31.318 +Therefore, more recently, there is also the +idea of directly doing sentence aligned so + +0:23:31.318 --> 0:23:35.271 +that you're directly taking. + +0:23:36.036 --> 0:23:41.003 +Was already one challenge of this one, the +second approach. + +0:23:42.922 --> 0:23:50.300 +Yes, so one big challenge on here, beef, then +you have to do a lot of comparison. + +0:23:50.470 --> 0:23:59.270 +You have to cook out every source, every target +set and square. + +0:23:59.270 --> 0:24:06.283 +If you think of a million or trillion pairs, +then. + +0:24:07.947 --> 0:24:12.176 +And this also gives you a reason for a last +step in both cases. + +0:24:12.176 --> 0:24:18.320 +So in both of them you have to remember you're +typically eating here in this very large data + +0:24:18.320 --> 0:24:18.650 +set. + +0:24:18.650 --> 0:24:24.530 +So all of these and also the document alignment +here they should be done very efficient. + +0:24:24.965 --> 0:24:42.090 +And if you want to do it very efficiently, +that means your quality will go lower. + +0:24:41.982 --> 0:24:47.348 +Because you just have to ever see it fast, +and then yeah you can put less computation + +0:24:47.348 --> 0:24:47.910 +on each. + +0:24:48.688 --> 0:25:06.255 +Therefore, in a lot of scenarios it makes +sense to make an additional filtering step + +0:25:06.255 --> 0:25:08.735 +at the end. + +0:25:08.828 --> 0:25:13.370 +And then we do a second filtering step where +we now can put a lot more effort. + +0:25:13.433 --> 0:25:20.972 +Because now we don't have like any square +possible combinations anymore, we have already + +0:25:20.972 --> 0:25:26.054 +selected and maybe in dimension of maybe like +two or three. + +0:25:26.054 --> 0:25:29.273 +For each sentence we even don't have. + +0:25:29.429 --> 0:25:39.234 +And then we can put a lot more effort in each +individual example and build a high quality + +0:25:39.234 --> 0:25:42.611 +classic fire to really select. + +0:25:45.125 --> 0:26:00.506 +Two or one example for that, so one of the +biggest projects doing this is the so-called + +0:26:00.506 --> 0:26:03.478 +Paratrol Corpus. + +0:26:03.343 --> 0:26:11.846 +Typically it's like before the picturing so +there are a lot of challenges on how you can. + +0:26:12.272 --> 0:26:25.808 +And the steps they start to be with the seatbelt, +so what you should give at the beginning is: + +0:26:26.146 --> 0:26:36.908 +Then they do the problem, the text extraction, +the document alignment, the sentence alignment, + +0:26:36.908 --> 0:26:45.518 +and the sentence filter, and it swings down +to implementing the text store. + +0:26:46.366 --> 0:26:51.936 +We'll see later for a lot of language pairs +exist so it's easier to download them and then + +0:26:51.936 --> 0:26:52.793 +like improve. + +0:26:53.073 --> 0:27:08.270 +For example, the crawling one thing they often +do is even not throw the direct website because + +0:27:08.270 --> 0:27:10.510 +there's also. + +0:27:10.770 --> 0:27:14.540 +Black parts of the Internet that they can +work on today. + +0:27:14.854 --> 0:27:22.238 +In more detail, this is a bit shown here. + +0:27:22.238 --> 0:27:31.907 +All the steps you can see are different possibilities. + +0:27:32.072 --> 0:27:39.018 +You need a bit of knowledge to do that, or +you can build a machine translation system. + +0:27:39.239 --> 0:27:47.810 +There are two different ways of deduction +and alignment. + +0:27:47.810 --> 0:27:52.622 +You can use sentence alignment. + +0:27:53.333 --> 0:28:02.102 +And how you can do the flexigrade exam, for +example, the lexic graph, or you can chin. + +0:28:02.422 --> 0:28:05.826 +To the next step in a bit more detail. + +0:28:05.826 --> 0:28:13.680 +But before we're doing it, I need more questions +about the general overview of how these. + +0:28:22.042 --> 0:28:37.058 +Yeah, so two or three things to web-drawing, +so you normally start with the URLs. + +0:28:37.058 --> 0:28:40.903 +It's most promising. + +0:28:41.021 --> 0:28:46.674 +Found that if you're interested in German +to English, you would maybe move some data + +0:28:46.674 --> 0:28:47.073 +from. + +0:28:47.407 --> 0:28:58.739 +Companies where you know they have a German +and an English website are from agencies which + +0:28:58.739 --> 0:29:08.359 +might be: And then we can use one of these +tools to start from there using standard web + +0:29:08.359 --> 0:29:10.328 +calling techniques. + +0:29:11.071 --> 0:29:23.942 +There are several challenges when doing that, +so if you request a website too often you can: + +0:29:25.305 --> 0:29:37.819 +You have to keep in history of the sites and +you click on all the links and then click on + +0:29:37.819 --> 0:29:40.739 +all the links again. + +0:29:41.721 --> 0:29:49.432 +To be very careful about legal issues starting +from this robotics day so get allowed to use. + +0:29:49.549 --> 0:29:58.941 +Mean, that's the one major thing about what +trolley general is. + +0:29:58.941 --> 0:30:05.251 +The problem is how you deal with property. + +0:30:05.685 --> 0:30:13.114 +That is why it is easier sometimes to start +with some quick fold data that you don't have. + +0:30:13.893 --> 0:30:22.526 +Of course, the network issues you retry, so +there's more technical things, but there's + +0:30:22.526 --> 0:30:23.122 +good. + +0:30:24.724 --> 0:30:35.806 +Another thing which is very helpful and is +often done is instead of doing the web trolling + +0:30:35.806 --> 0:30:38.119 +yourself, relying. + +0:30:38.258 --> 0:30:44.125 +And one thing is it's common crawl from the +web. + +0:30:44.125 --> 0:30:51.190 +Think on this common crawl a lot of these +language models. + +0:30:51.351 --> 0:30:59.763 +So think in American Company or organization +which really works on like writing. + +0:31:00.000 --> 0:31:01.111 +Possible. + +0:31:01.111 --> 0:31:10.341 +So the nice thing is if you start with this +you don't have to worry about network. + +0:31:10.250 --> 0:31:16.086 +I don't think you can do that because it's +too big, but you can do a pipeline on how to + +0:31:16.086 --> 0:31:16.683 +process. + +0:31:17.537 --> 0:31:28.874 +That is, of course, a general challenge in +all this web crawling and parallel web mining. + +0:31:28.989 --> 0:31:38.266 +That means you cannot just don't know the +data and study the processes. + +0:31:39.639 --> 0:31:45.593 +Here it might make sense to directly fields +of both domains that in some way bark just + +0:31:45.593 --> 0:31:46.414 +marginally. + +0:31:49.549 --> 0:31:59.381 +Then you can do the text extraction, which +means like converging two HTML and then splitting + +0:31:59.381 --> 0:32:01.707 +things from the HTML. + +0:32:01.841 --> 0:32:04.802 +Often very important is to do the language +I need. + +0:32:05.045 --> 0:32:16.728 +It's not that clear even if it's links which +language it is, but they are quite good tools + +0:32:16.728 --> 0:32:22.891 +like that can't identify from relatively short. + +0:32:23.623 --> 0:32:36.678 +And then you are now in the situation that +you have all your danger and that you can start. + +0:32:37.157 --> 0:32:43.651 +After the text extraction you have now a collection +or a large collection of of data where it's + +0:32:43.651 --> 0:32:49.469 +like text and maybe the document at use of +some meta information and now the question + +0:32:49.469 --> 0:32:55.963 +is based on this monolingual text or multilingual +text so text in many languages but not align. + +0:32:56.036 --> 0:32:59.863 +How can you now do a generate power? + +0:33:01.461 --> 0:33:06.289 +And UM. + +0:33:05.705 --> 0:33:13.322 +So if we're not seeing it as a task or if +we want to do it in a machine learning way, + +0:33:13.322 --> 0:33:20.940 +what we have is we have a set of sentences +and a suits language, and we have a set Of + +0:33:20.940 --> 0:33:23.331 +sentences from the target. + +0:33:23.823 --> 0:33:27.814 +This is the target language. + +0:33:27.814 --> 0:33:31.392 +This is the data we have. + +0:33:31.392 --> 0:33:37.034 +We kind of directly assume any ordering. + +0:33:38.018 --> 0:33:44.502 +More documents there are not really in line +or there is maybe a graph and what we are interested + +0:33:44.502 --> 0:33:50.518 +in is finding these alignments so which senses +are aligned to each other and which senses + +0:33:50.518 --> 0:33:53.860 +we can remove but we don't have translations +for. + +0:33:53.974 --> 0:34:00.339 +But exactly this mapping is what we are interested +in and what we need to find. + +0:34:01.901 --> 0:34:17.910 +And if we are modeling it more from the machine +translation point of view, what can model that + +0:34:17.910 --> 0:34:21.449 +as a classification? + +0:34:21.681 --> 0:34:36.655 +And so the main challenge of this is to build +this type of classifier and you want to decide. + +0:34:42.402 --> 0:34:50.912 +However, the biggest challenge has already +pointed out in the beginning is the sites if + +0:34:50.912 --> 0:34:53.329 +we have millions target. + +0:34:53.713 --> 0:35:05.194 +The number of comparison is n square, so this +very path is very inefficient, and we need + +0:35:05.194 --> 0:35:06.355 +to find. + +0:35:07.087 --> 0:35:16.914 +And traditionally there is the first one mentioned +before the local or the hierarchical meaning + +0:35:16.914 --> 0:35:20.292 +mining and there the idea is OK. + +0:35:20.292 --> 0:35:23.465 +First we are lining documents. + +0:35:23.964 --> 0:35:32.887 +Move back the things and align them, and once +you have the alignment you only need to remind. + +0:35:33.273 --> 0:35:51.709 +That of course makes anything more efficient +because we don't have to do all the comparison. + +0:35:53.253 --> 0:35:56.411 +Then it's, for example, in the before mentioned +apparel. + +0:35:57.217 --> 0:36:11.221 +But it has the issue that if this document +is bad you have error propagation and you can + +0:36:11.221 --> 0:36:14.211 +recover from that. + +0:36:14.494 --> 0:36:20.715 +Because then document that cannot say ever, +there are some sentences which are: Therefore, + +0:36:20.715 --> 0:36:24.973 +more recently there is also was referred to +as global mining. + +0:36:26.366 --> 0:36:31.693 +And there we really do this. + +0:36:31.693 --> 0:36:43.266 +Although it's in the square, we are doing +all the comparisons. + +0:36:43.523 --> 0:36:52.588 +So the idea is that you can do represent all +the sentences in a vector space. + +0:36:52.892 --> 0:37:06.654 +And then it's about nearest neighbor search +and there is a lot of very efficient algorithms. + +0:37:07.067 --> 0:37:20.591 +Then if you only compare them to your nearest +neighbors you don't have to do like a comparison + +0:37:20.591 --> 0:37:22.584 +but you have. + +0:37:26.186 --> 0:37:40.662 +So in the first step what we want to look +at is this: This document classification refers + +0:37:40.662 --> 0:37:49.584 +to the document alignment, and then we do the +sentence alignment. + +0:37:51.111 --> 0:37:58.518 +And if we're talking about document alignment, +there's like typically two steps in that: We + +0:37:58.518 --> 0:38:01.935 +first do a candidate selection. + +0:38:01.935 --> 0:38:10.904 +Often we have several steps and that is again +to make more things more efficiently. + +0:38:10.904 --> 0:38:13.360 +We have the candidate. + +0:38:13.893 --> 0:38:18.402 +The candidate select means OK, which documents +do we want to compare? + +0:38:19.579 --> 0:38:35.364 +Then if we have initial candidates which might +be parallel, we can do a classification test. + +0:38:35.575 --> 0:38:37.240 +And there is different ways. + +0:38:37.240 --> 0:38:40.397 +We can use lexical similarity or we can use +ten basic. + +0:38:41.321 --> 0:38:48.272 +The first and easiest thing is to take off +possible candidates. + +0:38:48.272 --> 0:38:55.223 +There's one possibility, the other one, is +based on structural. + +0:38:55.235 --> 0:39:05.398 +So based on how your website looks like, you +might find that there are only translations. + +0:39:05.825 --> 0:39:14.789 +This is typically the only case where we try +to do some kind of major information, which + +0:39:14.789 --> 0:39:22.342 +can be very useful because we know that websites, +for example, are linked. + +0:39:22.722 --> 0:39:35.586 +We can try to use some URL patterns, so if +we have some website which ends with the. + +0:39:35.755 --> 0:39:43.932 +So that can be easily used in order to find +candidates. + +0:39:43.932 --> 0:39:49.335 +Then we only compare websites where. + +0:39:49.669 --> 0:40:05.633 +The language and the translation of each other, +but typically you hear several heuristics to + +0:40:05.633 --> 0:40:07.178 +do that. + +0:40:07.267 --> 0:40:16.606 +Then you don't have to compare all websites, +but you only have to compare web sites. + +0:40:17.277 --> 0:40:27.607 +Cruiser problems especially with an hour day's +content management system. + +0:40:27.607 --> 0:40:32.912 +Sometimes it's nice and easy to read. + +0:40:33.193 --> 0:40:44.452 +So on the one hand there typically leads from +the parent's side to different languages. + +0:40:44.764 --> 0:40:46.632 +Now I can look at the kit websites. + +0:40:46.632 --> 0:40:49.381 +It's the same thing you can check on the difference. + +0:40:49.609 --> 0:41:06.835 +Languages: You can either do that from the +parent website or you can click on the English. + +0:41:06.926 --> 0:41:10.674 +You can therefore either like prepare to all +the websites. + +0:41:10.971 --> 0:41:18.205 +Can be even more focused and checked if the +link is somehow either flexible or the language + +0:41:18.205 --> 0:41:18.677 +name. + +0:41:19.019 --> 0:41:24.413 +So there really depends on how much you want +to filter out. + +0:41:24.413 --> 0:41:29.178 +There is always a trade-off between being +efficient. + +0:41:33.913 --> 0:41:49.963 +Based on that we then have our candidate list, +so we now have two independent sets of German + +0:41:49.963 --> 0:41:52.725 +documents, but. + +0:41:53.233 --> 0:42:03.515 +And now the task is, we want to extract these, +which are really translations of each other. + +0:42:03.823 --> 0:42:10.201 +So the question of how can we measure the +document similarity? + +0:42:10.201 --> 0:42:14.655 +Because what we then do is, we measure the. + +0:42:14.955 --> 0:42:27.096 +And here you already see why this is also +that problematic from where it's partial or + +0:42:27.096 --> 0:42:28.649 +similarly. + +0:42:30.330 --> 0:42:37.594 +All you can do that is again two folds. + +0:42:37.594 --> 0:42:48.309 +You can do it more content based or more structural +based. + +0:42:48.188 --> 0:42:53.740 +Calculating a lot of features and then maybe +training a classic pyramid small set which + +0:42:53.740 --> 0:42:57.084 +stands like based on the spesse feature is +the data. + +0:42:57.084 --> 0:42:58.661 +It is a corpus parallel. + +0:43:00.000 --> 0:43:10.955 +One way of doing that is to have traction +features, so the idea is the text length, so + +0:43:10.955 --> 0:43:12.718 +the document. + +0:43:13.213 --> 0:43:20.511 +Of course, text links will not be the same, +but if the one document has fifty words and + +0:43:20.511 --> 0:43:24.907 +the other five thousand words, it's quite realistic. + +0:43:25.305 --> 0:43:29.274 +So you can use the text length as one proxy +of. + +0:43:29.274 --> 0:43:32.334 +Is this might be a good translation? + +0:43:32.712 --> 0:43:41.316 +Now the thing is the alignment between the +structure. + +0:43:41.316 --> 0:43:52.151 +If you have here the website you can create +some type of structure. + +0:43:52.332 --> 0:44:04.958 +You can compare that to the French version +and then calculate some similarities because + +0:44:04.958 --> 0:44:07.971 +you see translation. + +0:44:08.969 --> 0:44:12.172 +Of course, it's getting more and more problematic. + +0:44:12.172 --> 0:44:16.318 +It does be a different structure than these +features are helpful. + +0:44:16.318 --> 0:44:22.097 +However, if you are doing it more in a trained +way, you can automatically learn how helpful + +0:44:22.097 --> 0:44:22.725 +they are. + +0:44:24.704 --> 0:44:37.516 +Then there are different ways of yeah: Content +based things: One easy thing, especially if + +0:44:37.516 --> 0:44:48.882 +you have systems that are using the same script +that you are looking for. + +0:44:48.888 --> 0:44:49.611 +The legs. + +0:44:49.611 --> 0:44:53.149 +We call them a beggar words and we'll look +into. + +0:44:53.149 --> 0:44:55.027 +You can use some type of. + +0:44:55.635 --> 0:44:58.418 +And neural embedding is also to abate him +at. + +0:45:02.742 --> 0:45:06.547 +And as then mean we have machine translation,. + +0:45:06.906 --> 0:45:14.640 +And one idea that you can also do is really +use the machine translation. + +0:45:14.874 --> 0:45:22.986 +Because this one is one which takes more effort, +so what you then have to do is put more effort. + +0:45:23.203 --> 0:45:37.526 +You wouldn't do this type of machine translation +based approach for a system which has product. + +0:45:38.018 --> 0:45:53.712 +But maybe your first of thinking why can't +do that because I'm collecting data to build + +0:45:53.712 --> 0:45:55.673 +an system. + +0:45:55.875 --> 0:46:01.628 +So you can use an initial system to translate +it, and then you can collect more data. + +0:46:01.901 --> 0:46:06.879 +And one way of doing that is, you're translating, +for example, all documents even to English. + +0:46:07.187 --> 0:46:25.789 +Then you only need two English data and you +do it in the example with three grams. + +0:46:25.825 --> 0:46:33.253 +For example, the current induction in 1 in +the Spanish, which is German induction in 1, + +0:46:33.253 --> 0:46:37.641 +which was Spanish induction in 2, which was +French. + +0:46:37.637 --> 0:46:52.225 +You're creating this index and then based +on that you can calculate how similar the documents. + +0:46:52.092 --> 0:46:58.190 +And then you can use the Cossack similarity +to really calculate which of the most similar + +0:46:58.190 --> 0:47:00.968 +document or how similar is the document. + +0:47:00.920 --> 0:47:04.615 +And then measure if this is a possible translation. + +0:47:05.285 --> 0:47:14.921 +Mean, of course, the document will not be +exactly the same, and even if you have a parallel + +0:47:14.921 --> 0:47:18.483 +document, French and German, and. + +0:47:18.898 --> 0:47:29.086 +You'll have not a perfect translation, therefore +it's looking into five front overlap since + +0:47:29.086 --> 0:47:31.522 +there should be last. + +0:47:34.074 --> 0:47:42.666 +Okay, before we take the next step and go +into the sentence alignment, there are more + +0:47:42.666 --> 0:47:44.764 +questions about the. + +0:47:51.131 --> 0:47:55.924 +Too Hot and. + +0:47:56.997 --> 0:47:59.384 +Well um. + +0:48:00.200 --> 0:48:05.751 +There is different ways of doing sentence +alignment. + +0:48:05.751 --> 0:48:12.036 +Here's one way to describe is to call the +other line again. + +0:48:12.172 --> 0:48:17.590 +Of course, we have the advantage that we have +only documents, so we might have like hundred + +0:48:17.590 --> 0:48:20.299 +sentences and hundred sentences in the tower. + +0:48:20.740 --> 0:48:31.909 +Although it still might be difficult to compare +all the things in parallel, and. + +0:48:31.791 --> 0:48:37.541 +And therefore typically these even assume +that we are only interested in a line character + +0:48:37.541 --> 0:48:40.800 +that can be identified on the sum of the diagonal. + +0:48:40.800 --> 0:48:46.422 +Of course, not exactly the diagonal will sum +some parts around it, but in order to make + +0:48:46.422 --> 0:48:47.891 +things more efficient. + +0:48:48.108 --> 0:48:55.713 +You can still do it around the diagonal because +if you say this is a parallel document, we + +0:48:55.713 --> 0:48:56.800 +assume that. + +0:48:56.836 --> 0:49:05.002 +We wouldn't have passed the document alignment, +therefore we wouldn't have seen it. + +0:49:05.505 --> 0:49:06.774 +In the underline. + +0:49:06.774 --> 0:49:10.300 +Then we are calculating the similarity for +these. + +0:49:10.270 --> 0:49:17.428 +Set this here based on the bilingual dictionary, +so it may be based on how much overlap you + +0:49:17.428 --> 0:49:17.895 +have. + +0:49:18.178 --> 0:49:24.148 +And then we are finding a path through it. + +0:49:24.148 --> 0:49:31.089 +You are finding a path which the lights ever +see. + +0:49:31.271 --> 0:49:41.255 +But you're trying to find a pass through your +document so that you get these parallel. + +0:49:41.201 --> 0:49:49.418 +And then the perfect ones here would be your +pass, where you just take this other parallel. + +0:49:51.011 --> 0:50:05.206 +The advantage is that on the one end limits +your search space, then centers alignment, + +0:50:05.206 --> 0:50:07.490 +and secondly. + +0:50:07.787 --> 0:50:10.013 +So what does it mean? + +0:50:10.013 --> 0:50:19.120 +So even if you have a very high probable pair, +you're not taking them on because overall. + +0:50:19.399 --> 0:50:27.063 +So sometimes it makes sense to also use this +global information and not only compare on + +0:50:27.063 --> 0:50:34.815 +individual sentences because what you're with +your parents is that sometimes it's only a + +0:50:34.815 --> 0:50:36.383 +good translation. + +0:50:38.118 --> 0:50:51.602 +So by this minion paste you're preventing +the system to do it at the border where there's + +0:50:51.602 --> 0:50:52.201 +no. + +0:50:53.093 --> 0:50:55.689 +So that might achieve you a bit better quality. + +0:50:56.636 --> 0:51:12.044 +The pack always ends if we write the button +for everybody, but it also means you couldn't + +0:51:12.044 --> 0:51:15.126 +necessarily have. + +0:51:15.375 --> 0:51:24.958 +Have some restrictions that is right, so first +of all they can't be translated out. + +0:51:25.285 --> 0:51:32.572 +So the handle line typically only really works +well if you have a relatively high quality. + +0:51:32.752 --> 0:51:39.038 +So if you have this more general data where +there's like some parts are translated and + +0:51:39.038 --> 0:51:39.471 +some. + +0:51:39.719 --> 0:51:43.604 +It doesn't really work, so it might. + +0:51:43.604 --> 0:51:53.157 +It's okay with having maybe at the end some +sentences which are missing, but in generally. + +0:51:53.453 --> 0:51:59.942 +So it's not robust against significant noise +on the. + +0:52:05.765 --> 0:52:12.584 +The second thing is is to what is referred +to as blue alibi. + +0:52:13.233 --> 0:52:16.982 +And this doesn't does, does not do us much. + +0:52:16.977 --> 0:52:30.220 +A global information you can translate each +sentence to English, and then you calculate + +0:52:30.220 --> 0:52:34.885 +the voice for the translation. + +0:52:35.095 --> 0:52:41.888 +And that you would get six answer points, +which are the ones in a purple ear. + +0:52:42.062 --> 0:52:56.459 +And then you have the ability to add some +points around it, which might be a bit lower. + +0:52:56.756 --> 0:53:06.962 +But here in this case you are able to deal +with reorderings, angles to deal with parts. + +0:53:07.247 --> 0:53:16.925 +Therefore, in this case we need a full scale +and key system to do this calculation while + +0:53:16.925 --> 0:53:17.686 +we're. + +0:53:18.318 --> 0:53:26.637 +Then, of course, the better your similarity +metric is, so the better you are able to do + +0:53:26.637 --> 0:53:35.429 +this comparison, the less you have to rely +on structural information that, in one sentence,. + +0:53:39.319 --> 0:53:53.411 +Anymore questions, and then there are things +like back in line which try to do the same. + +0:53:53.793 --> 0:53:59.913 +That means the idea is that you expect each +sentence. + +0:53:59.819 --> 0:54:02.246 +In a crossing will vector space. + +0:54:02.246 --> 0:54:08.128 +Crossing will vector space always means that +you have a vector or knight means. + +0:54:08.128 --> 0:54:14.598 +In this case you have a vector space where +sentences in different languages are near to + +0:54:14.598 --> 0:54:16.069 +each other if they. + +0:54:16.316 --> 0:54:23.750 +So you can have it again and so on, but just +next to each other and want to call you. + +0:54:24.104 --> 0:54:32.009 +And then you can of course measure now the +similarity by some distance matrix in this + +0:54:32.009 --> 0:54:32.744 +vector. + +0:54:33.033 --> 0:54:36.290 +And you're saying towards two senses are lying. + +0:54:36.290 --> 0:54:39.547 +If the distance in the vector space is somehow. + +0:54:40.240 --> 0:54:50.702 +We'll discuss that in a bit more heat soon +because these vector spades and bathings are + +0:54:50.702 --> 0:54:52.010 +even then. + +0:54:52.392 --> 0:54:55.861 +So the nice thing is with this. + +0:54:55.861 --> 0:55:05.508 +It's really good and good to get quite good +quality and can decide whether two sentences + +0:55:05.508 --> 0:55:08.977 +are translations of each other. + +0:55:08.888 --> 0:55:14.023 +In the fact-lined approach, but often they +even work on a global search way to really + +0:55:14.023 --> 0:55:15.575 +compare on everything to. + +0:55:16.236 --> 0:55:29.415 +What weak alignment also does is trying to +do to make this more efficient in finding the. + +0:55:29.309 --> 0:55:40.563 +If you don't want to compare everything to +everything, you first need sentence blocks, + +0:55:40.563 --> 0:55:41.210 +and. + +0:55:41.141 --> 0:55:42.363 +Then find him fast. + +0:55:42.562 --> 0:55:55.053 +You always have full sentence resolution, +but then you always compare on the area around. + +0:55:55.475 --> 0:56:11.501 +So if you do compare blocks on the source +of the target, then you have of your possibilities. + +0:56:11.611 --> 0:56:17.262 +So here the end times and comparison is a +lot less than the comparison you have here. + +0:56:17.777 --> 0:56:23.750 +And with neural embeddings you can also embed +not only single sentences and whole blocks. + +0:56:24.224 --> 0:56:28.073 +So how you make this in fast? + +0:56:28.073 --> 0:56:35.643 +You're starting from a coarse grain resolution +here where. + +0:56:36.176 --> 0:56:47.922 +Then you're getting a double pass where they +could be good and near this pass you're doing + +0:56:47.922 --> 0:56:49.858 +more and more. + +0:56:52.993 --> 0:56:54.601 +And yeah, what's the? + +0:56:54.601 --> 0:56:56.647 +This is the white egg lift. + +0:56:56.647 --> 0:56:59.352 +These are the sewers and the target. + +0:57:00.100 --> 0:57:16.163 +While it was sleeping in the forests and things, +I thought it was very strange to see this man. + +0:57:16.536 --> 0:57:25.197 +So you have the sentences, but if you do blocks +you have blocks that are in. + +0:57:30.810 --> 0:57:38.514 +This is the thing about the pipeline approach. + +0:57:38.514 --> 0:57:46.710 +We want to look at the global mining, but +before. + +0:57:53.633 --> 0:58:07.389 +In the global mining thing we have to also +do some filtering and so typically in the things + +0:58:07.389 --> 0:58:10.379 +they do they start. + +0:58:10.290 --> 0:58:14.256 +And then they are doing some pretty processing. + +0:58:14.254 --> 0:58:17.706 +So you try to at first to de-defecate paragraphs. + +0:58:17.797 --> 0:58:30.622 +So, of course, if you compare everything with +everything in two times the same input example, + +0:58:30.622 --> 0:58:35.748 +you will also: The hard thing is that you first +keep duplicating. + +0:58:35.748 --> 0:58:37.385 +You have each paragraph only one. + +0:58:37.958 --> 0:58:42.079 +There's a lot of text which occurs a lot of +times. + +0:58:42.079 --> 0:58:44.585 +They will happen all the time. + +0:58:44.884 --> 0:58:57.830 +There are pages about the cookie thing you +see and about accepting things. + +0:58:58.038 --> 0:59:04.963 +So you can already be duplicated here, or +your problem has crossed the website twice, + +0:59:04.963 --> 0:59:05.365 +and. + +0:59:06.066 --> 0:59:11.291 +Then you can remove low quality data like +cooking warnings that have biolabites start. + +0:59:12.012 --> 0:59:13.388 +Hey! + +0:59:13.173 --> 0:59:19.830 +So let you have maybe some other sentence, +and then you're doing a language idea. + +0:59:19.830 --> 0:59:29.936 +That means you want to have a text, which +is: You want to know for each sentence a paragraph + +0:59:29.936 --> 0:59:38.695 +which language it has so that you then, of +course, if you want. + +0:59:39.259 --> 0:59:44.987 +Finally, there is some complexity based film +screenings to believe, for example, for very + +0:59:44.987 --> 0:59:46.069 +high complexity. + +0:59:46.326 --> 0:59:59.718 +That means, for example, data where there's +a lot of crazy names which are growing. + +1:00:00.520 --> 1:00:09.164 +Sometimes it also improves very high perplexity +data because that is then unmanned generated + +1:00:09.164 --> 1:00:09.722 +data. + +1:00:11.511 --> 1:00:17.632 +And then the model which is mostly used for +that is what is called a laser model. + +1:00:18.178 --> 1:00:21.920 +It's based on machine translation. + +1:00:21.920 --> 1:00:28.442 +Hope it all recognizes the machine translation +architecture. + +1:00:28.442 --> 1:00:37.103 +However, there is a difference between a general +machine translation system and. + +1:01:00.000 --> 1:01:13.322 +Machine translation system, so it's messy. + +1:01:14.314 --> 1:01:24.767 +See one bigger difference, which is great +if I'm excluding that object or the other. + +1:01:25.405 --> 1:01:39.768 +There is one difference to the other, one +with attention, so we are having. + +1:01:40.160 --> 1:01:43.642 +And then we are using that here in there each +time set up. + +1:01:44.004 --> 1:01:54.295 +Mean, therefore, it's maybe a bit similar +to original anti-system without attention. + +1:01:54.295 --> 1:01:56.717 +It's quite similar. + +1:01:57.597 --> 1:02:10.011 +However, it has this disadvantage saying that +we have to put everything in one sentence and + +1:02:10.011 --> 1:02:14.329 +that maybe not all information. + +1:02:15.055 --> 1:02:25.567 +However, now in this type of framework we +are not really interested in machine translation, + +1:02:25.567 --> 1:02:27.281 +so this model. + +1:02:27.527 --> 1:02:34.264 +So we are training it to do machine translation. + +1:02:34.264 --> 1:02:42.239 +What that means in the end should be as much +information. + +1:02:43.883 --> 1:03:01.977 +Only all the information in here is able to +really well do the machine translation. + +1:03:02.642 --> 1:03:07.801 +So that is the first step, so we are doing +here. + +1:03:07.801 --> 1:03:17.067 +We are building the MT system, not with the +goal of making the best MT system, but with + +1:03:17.067 --> 1:03:22.647 +learning and sentences, and hopefully all important. + +1:03:22.882 --> 1:03:26.116 +Because otherwise we won't be able to generate +the translation. + +1:03:26.906 --> 1:03:31.287 +So it's a bit more on the bottom neck like +to try to put as much information. + +1:03:32.012 --> 1:03:36.426 +And if you think if you want to do later finding +the bear's neighbor or something like. + +1:03:37.257 --> 1:03:48.680 +So finding similarities is typically possible +with fixed dimensional things, so we can do + +1:03:48.680 --> 1:03:56.803 +that in an end dimensional space and find the +nearest neighbor. + +1:03:57.857 --> 1:03:59.837 +Yeah, it would be very difficult. + +1:04:00.300 --> 1:04:03.865 +There's one thing that we also do. + +1:04:03.865 --> 1:04:09.671 +We don't want to find the nearest neighbor +in the other. + +1:04:10.570 --> 1:04:13.424 +Do you have an idea how we can train them? + +1:04:13.424 --> 1:04:16.542 +This is a set that embeddings can be compared. + +1:04:23.984 --> 1:04:36.829 +Any idea do you think about two lectures, +a three lecture stack, one that did gave. + +1:04:41.301 --> 1:04:50.562 +We can train them on a multilingual setting +and that's how it's done in lasers so we're + +1:04:50.562 --> 1:04:56.982 +not doing it only from German to English but +we're training. + +1:04:57.017 --> 1:05:04.898 +Mean, if the English one has to be useful +for German, French and so on, and for German + +1:05:04.898 --> 1:05:13.233 +also, the German and the English and so have +to be useful, then somehow we'll automatically + +1:05:13.233 --> 1:05:16.947 +learn that these embattes are popularly. + +1:05:17.437 --> 1:05:28.562 +And then we can use an exact as we will plan +to have a similar sentence embedding. + +1:05:28.908 --> 1:05:39.734 +If you put in here a German and a French one +and always generate as they both have the same + +1:05:39.734 --> 1:05:48.826 +translations, you give these sentences: And +you should do exactly the same thing, so that's + +1:05:48.826 --> 1:05:50.649 +of course the easiest. + +1:05:51.151 --> 1:05:59.817 +If the sentence is very different then most +people will also hear the English decoder and + +1:05:59.817 --> 1:06:00.877 +therefore. + +1:06:02.422 --> 1:06:04.784 +So that is the first thing. + +1:06:04.784 --> 1:06:06.640 +Now we have this one. + +1:06:06.640 --> 1:06:10.014 +We have to be trained on parallel data. + +1:06:10.390 --> 1:06:22.705 +Then we can use these embeddings on our new +data and try to use them to make efficient + +1:06:22.705 --> 1:06:24.545 +comparisons. + +1:06:26.286 --> 1:06:30.669 +So how can you do comparison? + +1:06:30.669 --> 1:06:37.243 +Maybe the first thing you think of is to do. + +1:06:37.277 --> 1:06:44.365 +So you take all the German sentences, all +the French sentences. + +1:06:44.365 --> 1:06:49.460 +We compute the Cousin's simple limit between. + +1:06:49.469 --> 1:06:58.989 +And then you take all pairs where the similarity +is very high. + +1:07:00.180 --> 1:07:17.242 +So you have your French list, you have them, +and then you just take all sentences. + +1:07:19.839 --> 1:07:29.800 +It's an additional power method that we have, +but we have a lot of data who will find a point. + +1:07:29.800 --> 1:07:32.317 +It's a good point, but. + +1:07:35.595 --> 1:07:45.738 +It's also not that easy, so one problem is +that typically there are some sentences where. + +1:07:46.066 --> 1:07:48.991 +And other points where there is very few points +in the neighborhood. + +1:07:49.629 --> 1:08:06.241 +And then for things where a lot of things +are enabled you might extract not for one percent + +1:08:06.241 --> 1:08:08.408 +to do that. + +1:08:08.868 --> 1:08:18.341 +So what typically is happening is you do the +max merchant? + +1:08:18.341 --> 1:08:25.085 +How good is a pair compared to the other? + +1:08:25.305 --> 1:08:33.859 +So you take the similarity between X and Y, +and then you look at one of the eight nearest + +1:08:33.859 --> 1:08:35.190 +neighbors of. + +1:08:35.115 --> 1:08:48.461 +Of x and what are the eight nearest neighbors +of y, and the dividing of the similarity through + +1:08:48.461 --> 1:08:51.411 +the eight neighbors. + +1:08:51.671 --> 1:09:00.333 +So what you may be looking at are these two +sentences a lot more similar than all the other. + +1:09:00.840 --> 1:09:13.455 +And if these are exceptional and similar compared +to other sentences then they should be translations. + +1:09:16.536 --> 1:09:19.158 +Of course, that has also some. + +1:09:19.158 --> 1:09:24.148 +Then the good thing is there's a lot of similar +sentences. + +1:09:24.584 --> 1:09:30.641 +If there is a lot of similar sensations in +white then these are also very similar and + +1:09:30.641 --> 1:09:32.824 +you are doing more comparison. + +1:09:32.824 --> 1:09:36.626 +If all the arrows are far away then the translations. + +1:09:37.057 --> 1:09:40.895 +So think about this like short sentences. + +1:09:40.895 --> 1:09:47.658 +They might be that most things are similar, +but they are just in general. + +1:09:49.129 --> 1:09:59.220 +There are some problems that now we assume +there is only one pair of translations. + +1:09:59.759 --> 1:10:09.844 +So it has some problems in their two or three +ballad translations of that. + +1:10:09.844 --> 1:10:18.853 +Then, of course, this pair might not find +it, but in general this. + +1:10:19.139 --> 1:10:27.397 +For example, they have like all of these common +trawl. + +1:10:27.397 --> 1:10:32.802 +They have large parallel data sets. + +1:10:36.376 --> 1:10:38.557 +One point maybe also year. + +1:10:38.557 --> 1:10:45.586 +Of course, now it's important that we have +done the deduplication before because if we + +1:10:45.586 --> 1:10:52.453 +wouldn't have the deduplication, we would have +points which are the same coordinate. + +1:10:57.677 --> 1:11:03.109 +Maybe only one small things to that mean. + +1:11:03.109 --> 1:11:09.058 +A major issue in this case is still making +a. + +1:11:09.409 --> 1:11:18.056 +So you have to still do all of this comparison, +and that cannot be done just by simple. + +1:11:19.199 --> 1:11:27.322 +So what is done typically express the word, +you know things can be done in parallel. + +1:11:28.368 --> 1:11:36.024 +So calculating the embeddings and all that +stuff doesn't need to be sequential, but it's + +1:11:36.024 --> 1:11:37.143 +independent. + +1:11:37.357 --> 1:11:48.680 +What you typically do is create an event and +then you do some kind of projectization. + +1:11:48.708 --> 1:11:57.047 +So there is this space library which does +key nearest neighbor search very efficient + +1:11:57.047 --> 1:11:59.597 +in very high-dimensional. + +1:12:00.080 --> 1:12:03.410 +And then based on that you can now do comparison. + +1:12:03.410 --> 1:12:06.873 +You can even do the comparison in parallel +because. + +1:12:06.906 --> 1:12:13.973 +Can look at different areas of your space +and then compare the different pieces to find + +1:12:13.973 --> 1:12:14.374 +the. + +1:12:15.875 --> 1:12:30.790 +With this you are then able to do very fast +calculations on this type of sentence. + +1:12:31.451 --> 1:12:34.761 +So yeah this is currently one. + +1:12:35.155 --> 1:12:48.781 +Mean, those of them are covered with this, +so there's a parade. + +1:12:48.668 --> 1:12:55.543 +We are collected by that and most of them +are in a very big corporate for languages which + +1:12:55.543 --> 1:12:57.453 +you can hardly stand on. + +1:12:58.778 --> 1:13:01.016 +Do you have any more questions on this? + +1:13:05.625 --> 1:13:17.306 +And then some more words to this last set +here: So we have now done our pearl marker + +1:13:17.306 --> 1:13:25.165 +and we could assume that everything is fine +now. + +1:13:25.465 --> 1:13:35.238 +However, the problem with this noisy data +is that typically this is quite noisy still, + +1:13:35.238 --> 1:13:35.687 +so. + +1:13:36.176 --> 1:13:44.533 +In order to make things efficient to have +a high recall, the final data is often not + +1:13:44.533 --> 1:13:49.547 +of the best quality, not the same type of quality. + +1:13:49.789 --> 1:13:58.870 +So it is essential to do another figuring +step and to remove senses which might seem + +1:13:58.870 --> 1:14:01.007 +to be translations. + +1:14:01.341 --> 1:14:08.873 +And here, of course, the final evaluation +matrix would be how much do my system improve? + +1:14:09.089 --> 1:14:23.476 +And there are even challenges on doing that +so: people getting this noisy data like symmetrics + +1:14:23.476 --> 1:14:25.596 +or something. + +1:14:27.707 --> 1:14:34.247 +However, all these steps is of course very +time consuming, so you might not always want + +1:14:34.247 --> 1:14:37.071 +to do the full pipeline and training. + +1:14:37.757 --> 1:14:51.614 +So how can you model that we want to get this +best and normally what we always want? + +1:14:51.871 --> 1:15:02.781 +You also want to have the best over translation +quality, but this is also normally not achieved + +1:15:02.781 --> 1:15:03.917 +with all. + +1:15:04.444 --> 1:15:12.389 +And that's why you're doing this two-step +approach first of the second alignment. + +1:15:12.612 --> 1:15:27.171 +And after once you do the sentence filtering, +we can put a lot more alphabet in all the comparisons. + +1:15:27.627 --> 1:15:37.472 +For example, you can just translate the source +and compare that translation with the original + +1:15:37.472 --> 1:15:40.404 +one and calculate how good. + +1:15:40.860 --> 1:15:49.467 +And this, of course, you can do with the filing +set, but you can't do with your initial set + +1:15:49.467 --> 1:15:50.684 +of millions. + +1:15:54.114 --> 1:16:01.700 +So what it is again is the ancient test where +you input as a sentence pair as here, and then + +1:16:01.700 --> 1:16:09.532 +once you have a biometria, these are sentence +pairs with a high quality, and these are sentence + +1:16:09.532 --> 1:16:11.653 +pairs avec a low quality. + +1:16:12.692 --> 1:16:17.552 +Does anybody see what might be a challenge +if you want to train this type of classifier? + +1:16:22.822 --> 1:16:24.264 +How do you measure exactly? + +1:16:24.264 --> 1:16:26.477 +The quality is probably about the problem. + +1:16:27.887 --> 1:16:39.195 +Yes, that is one, that is true, there is even +more, more simple one, and high quality data + +1:16:39.195 --> 1:16:42.426 +here is not so difficult. + +1:16:43.303 --> 1:16:46.844 +Globally, yeah, probably we have a class in +balance. + +1:16:46.844 --> 1:16:49.785 +We don't see many bad quality combinations. + +1:16:49.785 --> 1:16:54.395 +It's hard to get there at the beginning, so +maybe how can you argue? + +1:16:54.395 --> 1:16:58.405 +Where do you find bad quality and what type +of bad quality? + +1:16:58.798 --> 1:17:05.122 +Because if it's too easy, you just take a +random germ and the random innocence that is + +1:17:05.122 --> 1:17:05.558 +very. + +1:17:05.765 --> 1:17:15.747 +But what you're interested is like bad quality +data, which still passes your first initial + +1:17:15.747 --> 1:17:16.405 +step. + +1:17:17.257 --> 1:17:28.824 +What you can use for that is you can use any +type of network or model that in the beginning, + +1:17:28.824 --> 1:17:33.177 +like in random forests, would see. + +1:17:33.613 --> 1:17:38.912 +So the positive examples are quite easy to +get. + +1:17:38.912 --> 1:17:44.543 +You just take parallel data and high quality +data. + +1:17:44.543 --> 1:17:45.095 +You. + +1:17:45.425 --> 1:17:47.565 +That is quite easy. + +1:17:47.565 --> 1:17:55.482 +You normally don't need a lot of data, then +to train in a few validation. + +1:17:57.397 --> 1:18:12.799 +The challenge is like the negative samples +because how would you generate negative samples? + +1:18:13.133 --> 1:18:17.909 +Because the negative examples are the ones +which ask the first step but don't ask the + +1:18:17.909 --> 1:18:18.353 +second. + +1:18:18.838 --> 1:18:23.682 +So how do you typically do it? + +1:18:23.682 --> 1:18:28.994 +You try to do synthetic examples. + +1:18:28.994 --> 1:18:33.369 +You can do random examples. + +1:18:33.493 --> 1:18:45.228 +But this is the typical error that you want +to detect when you do frequency based replacements. + +1:18:45.228 --> 1:18:52.074 +But this is one major issue when you generate +the data. + +1:18:52.132 --> 1:19:02.145 +That doesn't match well with what are the +real arrows that you're interested in. + +1:19:02.702 --> 1:19:13.177 +Is some of the most challenging here to find +the negative samples, which are hard enough + +1:19:13.177 --> 1:19:14.472 +to detect. + +1:19:17.537 --> 1:19:21.863 +And the other thing, which is difficult, is +of course the data ratio. + +1:19:22.262 --> 1:19:24.212 +Why is it important any? + +1:19:24.212 --> 1:19:29.827 +Why is the ratio between positive and negative +examples here important? + +1:19:30.510 --> 1:19:40.007 +Because in a case of plus imbalance we effectively +could learn to just that it's positive and + +1:19:40.007 --> 1:19:43.644 +high quality and we would be right. + +1:19:44.844 --> 1:19:46.654 +Yes, so I'm training. + +1:19:46.654 --> 1:19:51.180 +This is important, but otherwise it might +be too easy. + +1:19:51.180 --> 1:19:52.414 +You always do. + +1:19:52.732 --> 1:19:58.043 +And on the other head, of course, navy and +deputy, it's also important because if we have + +1:19:58.043 --> 1:20:03.176 +equal things, we're also assuming that this +might be the other one, and if the quality + +1:20:03.176 --> 1:20:06.245 +is worse or higher, we might also accept too +fewer. + +1:20:06.626 --> 1:20:10.486 +So this ratio is not easy to determine. + +1:20:13.133 --> 1:20:16.969 +What type of features can we use? + +1:20:16.969 --> 1:20:23.175 +Traditionally, we're also looking at word +translation. + +1:20:23.723 --> 1:20:37.592 +And nowadays, of course, we can model this +also with something like similar, so this is + +1:20:37.592 --> 1:20:38.696 +again. + +1:20:40.200 --> 1:20:42.306 +Language follow. + +1:20:42.462 --> 1:20:49.763 +So we can, for example, put the sentence in +there for the source and the target, and then + +1:20:49.763 --> 1:20:56.497 +based on this classification label we can classify +as this a parallel sentence or. + +1:20:56.476 --> 1:21:00.054 +So it's more like a normal classification +task. + +1:21:00.160 --> 1:21:09.233 +And by having a system which can have much +enable input, we can just put in two R. + +1:21:09.233 --> 1:21:16.886 +We can also put in two independent of each +other based on the hidden. + +1:21:17.657 --> 1:21:35.440 +You can, as you do any other type of classifier, +you can train them on top of. + +1:21:35.895 --> 1:21:42.801 +This so it tries to represent the full sentence +and that's what you also want to do on. + +1:21:43.103 --> 1:21:45.043 +The Other Thing What They Can't Do Is, of +Course. + +1:21:45.265 --> 1:21:46.881 +You can make here. + +1:21:46.881 --> 1:21:52.837 +You can do your summation of all the hidden +statements that you said. + +1:21:58.698 --> 1:22:10.618 +Okay, and then one thing which we skipped +until now, and that is only briefly this fragment. + +1:22:10.630 --> 1:22:19.517 +So if we have sentences which are not really +parallel, can we also extract information from + +1:22:19.517 --> 1:22:20.096 +them? + +1:22:22.002 --> 1:22:25.627 +And so what here the test is? + +1:22:25.627 --> 1:22:33.603 +We have a sentence and we want to find within +or a sentence pair. + +1:22:33.603 --> 1:22:38.679 +We want to find within the sentence pair. + +1:22:39.799 --> 1:22:46.577 +And how that, for example, has been done is +using a lexical positive and negative association. + +1:22:47.187 --> 1:22:57.182 +And then you can transform your target sentence +into a signal and find a thing where you have. + +1:22:57.757 --> 1:23:00.317 +So I'm Going to Get a Clear Eye. + +1:23:00.480 --> 1:23:15.788 +So you hear the English sentence, the other +language, and you have an alignment between + +1:23:15.788 --> 1:23:18.572 +them, and then. + +1:23:18.818 --> 1:23:21.925 +This is not a light cell from a negative signal. + +1:23:22.322 --> 1:23:40.023 +And then you drink some sauce on there because +you want to have an area where there's. + +1:23:40.100 --> 1:23:51.728 +It doesn't matter if you have simple arrows +here by smooth saying you can't extract. + +1:23:51.972 --> 1:23:58.813 +So you try to find long segments here where +at least most of the words are somehow aligned. + +1:24:00.040 --> 1:24:10.069 +And then you take this one in the side and +extract that one as your parallel fragment, + +1:24:10.069 --> 1:24:10.645 +and. + +1:24:10.630 --> 1:24:21.276 +So in the end you not only have full sentences +but you also have partial sentences which might + +1:24:21.276 --> 1:24:27.439 +be helpful for especially if you have quite +low upset. + +1:24:32.332 --> 1:24:36.388 +That's everything work for today. + +1:24:36.388 --> 1:24:44.023 +What you hopefully remember is the thing about +how the general. + +1:24:44.184 --> 1:24:54.506 +We talked about how we can do the document +alignment and then we can do the sentence alignment, + +1:24:54.506 --> 1:24:57.625 +which can be done after the. + +1:24:59.339 --> 1:25:12.611 +Any more questions think on Thursday we had +to do a switch, so on Thursday there will be + +1:25:12.611 --> 1:25:15.444 +a practical thing. + +0:00:01.921 --> 0:00:16.424 +Hey welcome to today's lecture, what we today +want to look at is how we can make new. + +0:00:16.796 --> 0:00:26.458 +So until now we have this global system, the +encoder and the decoder mostly, and we haven't + +0:00:26.458 --> 0:00:29.714 +really thought about how long. + +0:00:30.170 --> 0:00:42.684 +And what we, for example, know is yeah, you +can make the systems bigger in different ways. + +0:00:42.684 --> 0:00:47.084 +We can make them deeper so the. + +0:00:47.407 --> 0:00:56.331 +And if we have at least enough data that typically +helps you make things performance better,. + +0:00:56.576 --> 0:01:00.620 +But of course leads to problems that we need +more resources. + +0:01:00.620 --> 0:01:06.587 +That is a problem at universities where we +have typically limited computation capacities. + +0:01:06.587 --> 0:01:11.757 +So at some point you have such big models +that you cannot train them anymore. + +0:01:13.033 --> 0:01:23.792 +And also for companies is of course important +if it costs you like to generate translation + +0:01:23.792 --> 0:01:26.984 +just by power consumption. + +0:01:27.667 --> 0:01:35.386 +So yeah, there's different reasons why you +want to do efficient machine translation. + +0:01:36.436 --> 0:01:48.338 +One reason is there are different ways of +how you can improve your machine translation + +0:01:48.338 --> 0:01:50.527 +system once we. + +0:01:50.670 --> 0:01:55.694 +There can be different types of data we looked +into data crawling, monolingual data. + +0:01:55.875 --> 0:01:59.024 +All this data and the aim is always. + +0:01:59.099 --> 0:02:06.067 +Of course, we are not just purely interested +in having more data, but the idea why we want + +0:02:06.067 --> 0:02:12.959 +to have more data is that more data also means +that we have better quality because mostly + +0:02:12.959 --> 0:02:17.554 +we are interested in increasing the quality +of the machine. + +0:02:18.838 --> 0:02:24.892 +But there's also other ways of how you can +improve the quality of a machine translation. + +0:02:25.325 --> 0:02:36.450 +And what is, of course, that is where most +research is focusing on. + +0:02:36.450 --> 0:02:44.467 +It means all we want to build better algorithms. + +0:02:44.684 --> 0:02:48.199 +Course: The other things are normally as good. + +0:02:48.199 --> 0:02:54.631 +Sometimes it's easier to improve, so often +it's easier to just collect more data than + +0:02:54.631 --> 0:02:57.473 +to invent some great view algorithms. + +0:02:57.473 --> 0:03:00.315 +But yeah, both of them are important. + +0:03:00.920 --> 0:03:09.812 +But there is this third thing, especially +with neural machine translation, and that means + +0:03:09.812 --> 0:03:11.590 +we make a bigger. + +0:03:11.751 --> 0:03:16.510 +Can be, as said, that we have more layers, +that we have wider layers. + +0:03:16.510 --> 0:03:19.977 +The other thing we talked a bit about is ensemble. + +0:03:19.977 --> 0:03:24.532 +That means we are not building one new machine +translation system. + +0:03:24.965 --> 0:03:27.505 +And we can easily build four. + +0:03:27.505 --> 0:03:32.331 +What is the typical strategy to build different +systems? + +0:03:32.331 --> 0:03:33.177 +Remember. + +0:03:35.795 --> 0:03:40.119 +It should be of course a bit different if +you have the same. + +0:03:40.119 --> 0:03:44.585 +If they all predict the same then combining +them doesn't help. + +0:03:44.585 --> 0:03:48.979 +So what is the easiest way if you have to +build four systems? + +0:03:51.711 --> 0:04:01.747 +And the Charleston's will take, but this is +the best output of a single system. + +0:04:02.362 --> 0:04:10.165 +Mean now, it's really three different systems +so that you later can combine them and maybe + +0:04:10.165 --> 0:04:11.280 +the average. + +0:04:11.280 --> 0:04:16.682 +Ensembles are typically that the average is +all probabilities. + +0:04:19.439 --> 0:04:24.227 +The idea is to think about neural networks. + +0:04:24.227 --> 0:04:29.342 +There's one parameter which can easily adjust. + +0:04:29.342 --> 0:04:36.525 +That's exactly the easiest way to randomize +with three different. + +0:04:37.017 --> 0:04:43.119 +They have the same architecture, so all the +hydroparameters are the same, but they are + +0:04:43.119 --> 0:04:43.891 +different. + +0:04:43.891 --> 0:04:46.556 +They will have different predictions. + +0:04:48.228 --> 0:04:52.572 +So, of course, bigger amounts. + +0:04:52.572 --> 0:05:05.325 +Some of these are a bit the easiest way of +improving your quality because you don't really + +0:05:05.325 --> 0:05:08.268 +have to do anything. + +0:05:08.588 --> 0:05:12.588 +There is limits on that bigger models only +get better. + +0:05:12.588 --> 0:05:19.132 +If you have enough training data you can't +do like a handheld layer and you will not work + +0:05:19.132 --> 0:05:24.877 +on very small data but with a recent amount +of data that is the easiest thing. + +0:05:25.305 --> 0:05:33.726 +However, they are challenging with making +better models, bigger motors, and that is the + +0:05:33.726 --> 0:05:34.970 +computation. + +0:05:35.175 --> 0:05:44.482 +So, of course, if you have a bigger model +that can mean that you have longer running + +0:05:44.482 --> 0:05:49.518 +times, if you have models, you have to times. + +0:05:51.171 --> 0:05:56.685 +Normally you cannot paralyze the different +layers because the input to one layer is always + +0:05:56.685 --> 0:06:02.442 +the output of the previous layer, so you propagate +that so it will also increase your runtime. + +0:06:02.822 --> 0:06:10.720 +Then you have to store all your models in +memory. + +0:06:10.720 --> 0:06:20.927 +If you have double weights you will have: +Is more difficult to then do back propagation. + +0:06:20.927 --> 0:06:27.680 +You have to store in between the activations, +so there's not only do you increase the model + +0:06:27.680 --> 0:06:31.865 +in your memory, but also all these other variables +that. + +0:06:34.414 --> 0:06:36.734 +And so in general it is more expensive. + +0:06:37.137 --> 0:06:54.208 +And therefore there's good reasons in looking +into can we make these models sound more efficient. + +0:06:54.134 --> 0:07:00.982 +So it's been through the viewer, you can have +it okay, have one and one day of training time, + +0:07:00.982 --> 0:07:01.274 +or. + +0:07:01.221 --> 0:07:07.535 +Forty thousand euros and then what is the +best machine translation system I can get within + +0:07:07.535 --> 0:07:08.437 +this budget. + +0:07:08.969 --> 0:07:19.085 +And then, of course, you can make the models +bigger, but then you have to train them shorter, + +0:07:19.085 --> 0:07:24.251 +and then we can make more efficient algorithms. + +0:07:25.925 --> 0:07:31.699 +If you think about efficiency, there's a bit +different scenarios. + +0:07:32.312 --> 0:07:43.635 +So if you're more of coming from the research +community, what you'll be doing is building + +0:07:43.635 --> 0:07:47.913 +a lot of models in your research. + +0:07:48.088 --> 0:07:58.645 +So you're having your test set of maybe sentences, +calculating the blue score, then another model. + +0:07:58.818 --> 0:08:08.911 +So what that means is typically you're training +on millions of cents, so your training time + +0:08:08.911 --> 0:08:14.944 +is long, maybe a day, but maybe in other cases +a week. + +0:08:15.135 --> 0:08:22.860 +The testing is not really the cost efficient, +but the training is very costly. + +0:08:23.443 --> 0:08:37.830 +If you are more thinking of building models +for application, the scenario is quite different. + +0:08:38.038 --> 0:08:46.603 +And then you keep it running, and maybe thousands +of customers are using it in translating. + +0:08:46.603 --> 0:08:47.720 +So in that. + +0:08:48.168 --> 0:08:59.577 +And we will see that it is not always the +same type of challenges you can paralyze some + +0:08:59.577 --> 0:09:07.096 +things in training, which you cannot paralyze +in testing. + +0:09:07.347 --> 0:09:14.124 +For example, in training you have to do back +propagation, so you have to store the activations. + +0:09:14.394 --> 0:09:23.901 +Therefore, in testing we briefly discussed +that we would do it in more detail today in + +0:09:23.901 --> 0:09:24.994 +training. + +0:09:25.265 --> 0:09:36.100 +You know they're a target and you can process +everything in parallel while in testing. + +0:09:36.356 --> 0:09:46.741 +So you can only do one word at a time, and +so you can less paralyze this. + +0:09:46.741 --> 0:09:50.530 +Therefore, it's important. + +0:09:52.712 --> 0:09:55.347 +Is a specific task on this. + +0:09:55.347 --> 0:10:03.157 +For example, it's the efficiency task where +it's about making things as efficient. + +0:10:03.123 --> 0:10:09.230 +Is possible and they can look at different +resources. + +0:10:09.230 --> 0:10:14.207 +So how much deep fuel run time do you need? + +0:10:14.454 --> 0:10:19.366 +See how much memory you need or you can have +a fixed memory budget and then have to build + +0:10:19.366 --> 0:10:20.294 +the best system. + +0:10:20.500 --> 0:10:29.010 +And here is a bit like an example of that, +so there's three teams from Edinburgh from + +0:10:29.010 --> 0:10:30.989 +and they submitted. + +0:10:31.131 --> 0:10:36.278 +So then, of course, if you want to know the +most efficient system you have to do a bit + +0:10:36.278 --> 0:10:36.515 +of. + +0:10:36.776 --> 0:10:44.656 +You want to have a better quality or more +runtime and there's not the one solution. + +0:10:44.656 --> 0:10:46.720 +You can improve your. + +0:10:46.946 --> 0:10:49.662 +And that you see that there are different +systems. + +0:10:49.909 --> 0:11:06.051 +Here is how many words you can do for a second +on the clock, and you want to be as talk as + +0:11:06.051 --> 0:11:07.824 +possible. + +0:11:08.068 --> 0:11:08.889 +And you see here a bit. + +0:11:08.889 --> 0:11:09.984 +This is a little bit different. + +0:11:11.051 --> 0:11:27.717 +You want to be there on the top right corner +and you can get a score of something between + +0:11:27.717 --> 0:11:29.014 +words. + +0:11:30.250 --> 0:11:34.161 +Two hundred and fifty thousand, then you'll +ever come and score zero point three. + +0:11:34.834 --> 0:11:41.243 +There is, of course, any bit of a decision, +but the question is, like how far can you again? + +0:11:41.243 --> 0:11:47.789 +Some of all these points on this line would +be winners because they are somehow most efficient + +0:11:47.789 --> 0:11:53.922 +in a way that there's no system which achieves +the same quality with less computational. + +0:11:57.657 --> 0:12:04.131 +So there's the one question of which resources +are you interested. + +0:12:04.131 --> 0:12:07.416 +Are you running it on CPU or GPU? + +0:12:07.416 --> 0:12:11.668 +There's different ways of paralyzing stuff. + +0:12:14.654 --> 0:12:20.777 +Another dimension is how you process your +data. + +0:12:20.777 --> 0:12:27.154 +There's really the best processing and streaming. + +0:12:27.647 --> 0:12:34.672 +So in batch processing you have the whole +document available so you can translate all + +0:12:34.672 --> 0:12:39.981 +sentences in perimeter and then you're interested +in throughput. + +0:12:40.000 --> 0:12:43.844 +But you can then process, for example, especially +in GPS. + +0:12:43.844 --> 0:12:49.810 +That's interesting, you're not translating +one sentence at a time, but you're translating + +0:12:49.810 --> 0:12:56.108 +one hundred sentences or so in parallel, so +you have one more dimension where you can paralyze + +0:12:56.108 --> 0:12:57.964 +and then be more efficient. + +0:12:58.558 --> 0:13:14.863 +On the other hand, for example sorts of documents, +so we learned that if you do badge processing + +0:13:14.863 --> 0:13:16.544 +you have. + +0:13:16.636 --> 0:13:24.636 +Then, of course, it makes sense to sort the +sentences in order to have the minimum thing + +0:13:24.636 --> 0:13:25.535 +attached. + +0:13:27.427 --> 0:13:32.150 +The other scenario is more the streaming scenario +where you do life translation. + +0:13:32.512 --> 0:13:40.212 +So in that case you can't wait for the whole +document to pass, but you have to do. + +0:13:40.520 --> 0:13:49.529 +And then, for example, that's especially in +situations like speech translation, and then + +0:13:49.529 --> 0:13:53.781 +you're interested in things like latency. + +0:13:53.781 --> 0:14:00.361 +So how much do you have to wait to get the +output of a sentence? + +0:14:06.566 --> 0:14:16.956 +Finally, there is the thing about the implementation: +Today we're mainly looking at different algorithms, + +0:14:16.956 --> 0:14:23.678 +different models of how you can model them +in your machine translation system, but of + +0:14:23.678 --> 0:14:29.227 +course for the same algorithms there's also +different implementations. + +0:14:29.489 --> 0:14:38.643 +So, for example, for a machine translation +this tool could be very fast. + +0:14:38.638 --> 0:14:46.615 +So they have like coded a lot of the operations +very low resource, not low resource, low level + +0:14:46.615 --> 0:14:49.973 +on the directly on the QDAC kernels in. + +0:14:50.110 --> 0:15:00.948 +So the same attention network is typically +more efficient in that type of algorithm. + +0:15:00.880 --> 0:15:02.474 +Than in in any other. + +0:15:03.323 --> 0:15:13.105 +Of course, it might be other disadvantages, +so if you're a little worker or have worked + +0:15:13.105 --> 0:15:15.106 +in the practical. + +0:15:15.255 --> 0:15:22.604 +Because it's normally easier to understand, +easier to change, and so on, but there is again + +0:15:22.604 --> 0:15:23.323 +a train. + +0:15:23.483 --> 0:15:29.440 +You have to think about, do you want to include +this into my study or comparison or not? + +0:15:29.440 --> 0:15:36.468 +Should it be like I compare different implementations +and I also find the most efficient implementation? + +0:15:36.468 --> 0:15:39.145 +Or is it only about the pure algorithm? + +0:15:42.742 --> 0:15:50.355 +Yeah, when building these systems there is +a different trade-off to do. + +0:15:50.850 --> 0:15:56.555 +So there's one of the traders between memory +and throughput, so how many words can generate + +0:15:56.555 --> 0:15:57.299 +per second. + +0:15:57.557 --> 0:16:03.351 +So typically you can easily like increase +your scruple by increasing the batch size. + +0:16:03.643 --> 0:16:06.899 +So that means you are translating more sentences +in parallel. + +0:16:07.107 --> 0:16:09.241 +And gypsies are very good at that stuff. + +0:16:09.349 --> 0:16:15.161 +It should translate one sentence or one hundred +sentences, not the same time, but its. + +0:16:15.115 --> 0:16:20.784 +Rough are very similar because they are at +this efficient metrics multiplication so that + +0:16:20.784 --> 0:16:24.415 +you can do the same operation on all sentences +parallel. + +0:16:24.415 --> 0:16:30.148 +So typically that means if you increase your +benchmark you can do more things in parallel + +0:16:30.148 --> 0:16:31.995 +and you will translate more. + +0:16:31.952 --> 0:16:33.370 +Second. + +0:16:33.653 --> 0:16:43.312 +On the other hand, with this advantage, of +course you will need higher badge sizes and + +0:16:43.312 --> 0:16:44.755 +more memory. + +0:16:44.965 --> 0:16:56.452 +To begin with, the other problem is that you +have such big models that you can only translate + +0:16:56.452 --> 0:16:59.141 +with lower bed sizes. + +0:16:59.119 --> 0:17:08.466 +If you are running out of memory with translating, +one idea to go on that is to decrease your. + +0:17:13.453 --> 0:17:24.456 +Then there is the thing about quality in Screwport, +of course, and before it's like larger models, + +0:17:24.456 --> 0:17:28.124 +but in generally higher quality. + +0:17:28.124 --> 0:17:31.902 +The first one is always this way. + +0:17:32.092 --> 0:17:38.709 +Course: Not always larger model helps you +have over fitting at some point, but in generally. + +0:17:43.883 --> 0:17:52.901 +And with this a bit on this training and testing +thing we had before. + +0:17:53.113 --> 0:17:58.455 +So it wears all the difference between training +and testing, and for the encoder and decoder. + +0:17:58.798 --> 0:18:06.992 +So if we are looking at what mentioned before +at training time, we have a source sentence + +0:18:06.992 --> 0:18:17.183 +here: And how this is processed on a is not +the attention here. + +0:18:17.183 --> 0:18:21.836 +That's a tubical transformer. + +0:18:22.162 --> 0:18:31.626 +And how we can do that on a is that we can +paralyze the ear ever since. + +0:18:31.626 --> 0:18:40.422 +The first thing to know is: So that is, of +course, not in all cases. + +0:18:40.422 --> 0:18:49.184 +We'll later talk about speech translation +where we might want to translate. + +0:18:49.389 --> 0:18:56.172 +Without the general case in, it's like you +have the full sentence you want to translate. + +0:18:56.416 --> 0:19:02.053 +So the important thing is we are here everything +available on the source side. + +0:19:03.323 --> 0:19:13.524 +And then this was one of the big advantages +that you can remember back of transformer. + +0:19:13.524 --> 0:19:15.752 +There are several. + +0:19:16.156 --> 0:19:25.229 +But the other one is now that we can calculate +the full layer. + +0:19:25.645 --> 0:19:29.318 +There is no dependency between this and this +state or this and this state. + +0:19:29.749 --> 0:19:36.662 +So we always did like here to calculate the +key value and query, and based on that you + +0:19:36.662 --> 0:19:37.536 +calculate. + +0:19:37.937 --> 0:19:46.616 +Which means we can do all these calculations +here in parallel and in parallel. + +0:19:48.028 --> 0:19:55.967 +And there, of course, is this very efficiency +because again for GPS it's too bigly possible + +0:19:55.967 --> 0:20:00.887 +to do these things in parallel and one after +each other. + +0:20:01.421 --> 0:20:10.311 +And then we can also for each layer one by +one, and then we calculate here the encoder. + +0:20:10.790 --> 0:20:21.921 +In training now an important thing is that +for the decoder we have the full sentence available + +0:20:21.921 --> 0:20:28.365 +because we know this is the target we should +generate. + +0:20:29.649 --> 0:20:33.526 +We have models now in a different way. + +0:20:33.526 --> 0:20:38.297 +This hidden state is only on the previous +ones. + +0:20:38.598 --> 0:20:51.887 +And the first thing here depends only on this +information, so you see if you remember we + +0:20:51.887 --> 0:20:56.665 +had this masked self-attention. + +0:20:56.896 --> 0:21:04.117 +So that means, of course, we can only calculate +the decoder once the encoder is done, but that's. + +0:21:04.444 --> 0:21:06.656 +Percent can calculate the end quarter. + +0:21:06.656 --> 0:21:08.925 +Then we can calculate here the decoder. + +0:21:09.569 --> 0:21:25.566 +But again in training we have x, y and that +is available so we can calculate everything + +0:21:25.566 --> 0:21:27.929 +in parallel. + +0:21:28.368 --> 0:21:40.941 +So the interesting thing or advantage of transformer +is in training. + +0:21:40.941 --> 0:21:46.408 +We can do it for the decoder. + +0:21:46.866 --> 0:21:54.457 +That means you will have more calculations +because you can only calculate one layer at + +0:21:54.457 --> 0:22:02.310 +a time, but for example the length which is +too bigly quite long or doesn't really matter + +0:22:02.310 --> 0:22:03.270 +that much. + +0:22:05.665 --> 0:22:10.704 +However, in testing this situation is different. + +0:22:10.704 --> 0:22:13.276 +In testing we only have. + +0:22:13.713 --> 0:22:20.622 +So this means we start with a sense: We don't +know the full sentence yet because we ought + +0:22:20.622 --> 0:22:29.063 +to regularly generate that so for the encoder +we have the same here but for the decoder. + +0:22:29.409 --> 0:22:39.598 +In this case we only have the first and the +second instinct, but only for all states in + +0:22:39.598 --> 0:22:40.756 +parallel. + +0:22:41.101 --> 0:22:51.752 +And then we can do the next step for y because +we are putting our most probable one. + +0:22:51.752 --> 0:22:58.643 +We do greedy search or beam search, but you +cannot do. + +0:23:03.663 --> 0:23:16.838 +Yes, so if we are interesting in making things +more efficient for testing, which we see, for + +0:23:16.838 --> 0:23:22.363 +example in the scenario of really our. + +0:23:22.642 --> 0:23:34.286 +It makes sense that we think about our architecture +and that we are currently working on attention + +0:23:34.286 --> 0:23:35.933 +based models. + +0:23:36.096 --> 0:23:44.150 +The decoder there is some of the most time +spent testing and testing. + +0:23:44.150 --> 0:23:47.142 +It's similar, but during. + +0:23:47.167 --> 0:23:50.248 +Nothing about beam search. + +0:23:50.248 --> 0:23:59.833 +It might be even more complicated because +in beam search you have to try different. + +0:24:02.762 --> 0:24:15.140 +So the question is what can you now do in +order to make your model more efficient and + +0:24:15.140 --> 0:24:21.905 +better in translation in these types of cases? + +0:24:24.604 --> 0:24:30.178 +And the one thing is to look into the encoded +decoder trailer. + +0:24:30.690 --> 0:24:43.898 +And then until now we typically assume that +the depth of the encoder and the depth of the + +0:24:43.898 --> 0:24:48.154 +decoder is roughly the same. + +0:24:48.268 --> 0:24:55.553 +So if you haven't thought about it, you just +take what is running well. + +0:24:55.553 --> 0:24:57.678 +You would try to do. + +0:24:58.018 --> 0:25:04.148 +However, we saw now that there is a quite +big challenge and the runtime is a lot longer + +0:25:04.148 --> 0:25:04.914 +than here. + +0:25:05.425 --> 0:25:14.018 +The question is also the case for the calculations, +or do we have there the same issue that we + +0:25:14.018 --> 0:25:21.887 +only get the good quality if we are having +high and high, so we know that making these + +0:25:21.887 --> 0:25:25.415 +more depths is increasing our quality. + +0:25:25.425 --> 0:25:31.920 +But what we haven't talked about is really +important that we increase the depth the same + +0:25:31.920 --> 0:25:32.285 +way. + +0:25:32.552 --> 0:25:41.815 +So what we can put instead also do is something +like this where you have a deep encoder and + +0:25:41.815 --> 0:25:42.923 +a shallow. + +0:25:43.163 --> 0:25:57.386 +So that would be that you, for example, have +instead of having layers on the encoder, and + +0:25:57.386 --> 0:25:59.757 +layers on the. + +0:26:00.080 --> 0:26:10.469 +So in this case the overall depth from start +to end would be similar and so hopefully. + +0:26:11.471 --> 0:26:21.662 +But we could a lot more things hear parallelized, +and hear what is costly at the end during decoding + +0:26:21.662 --> 0:26:22.973 +the decoder. + +0:26:22.973 --> 0:26:29.330 +Because that does change in an outer regressive +way, there we. + +0:26:31.411 --> 0:26:33.727 +And that that can be analyzed. + +0:26:33.727 --> 0:26:38.734 +So here is some examples: Where people have +done all this. + +0:26:39.019 --> 0:26:55.710 +So here it's mainly interested on the orange +things, which is auto-regressive about the + +0:26:55.710 --> 0:26:57.607 +speed up. + +0:26:57.717 --> 0:27:15.031 +You have the system, so agree is not exactly +the same, but it's similar. + +0:27:15.055 --> 0:27:23.004 +It's always the case if you look at speed +up. + +0:27:23.004 --> 0:27:31.644 +Think they put a speed of so that's the baseline. + +0:27:31.771 --> 0:27:35.348 +So between and times as fast. + +0:27:35.348 --> 0:27:42.621 +If you switch from a system to where you have +layers in the. + +0:27:42.782 --> 0:27:52.309 +You see that although you have slightly more +parameters, more calculations are also roughly + +0:27:52.309 --> 0:28:00.283 +the same, but you can speed out because now +during testing you can paralyze. + +0:28:02.182 --> 0:28:09.754 +The other thing is that you're speeding up, +but if you look at the performance it's similar, + +0:28:09.754 --> 0:28:13.500 +so sometimes you improve, sometimes you lose. + +0:28:13.500 --> 0:28:20.421 +There's a bit of losing English to Romania, +but in general the quality is very slow. + +0:28:20.680 --> 0:28:30.343 +So you see that you can keep a similar performance +while improving your speed by just having different. + +0:28:30.470 --> 0:28:34.903 +And you also see the encoder layers from speed. + +0:28:34.903 --> 0:28:38.136 +They don't really metal that much. + +0:28:38.136 --> 0:28:38.690 +Most. + +0:28:38.979 --> 0:28:50.319 +Because if you compare the 12th system to +the 6th system you have a lower performance + +0:28:50.319 --> 0:28:57.309 +with 6th and colder layers but the speed is +similar. + +0:28:57.897 --> 0:29:02.233 +And see the huge decrease is it maybe due +to a lack of data. + +0:29:03.743 --> 0:29:11.899 +Good idea would say it's not the case. + +0:29:11.899 --> 0:29:23.191 +Romanian English should have the same number +of data. + +0:29:24.224 --> 0:29:31.184 +Maybe it's just that something in that language. + +0:29:31.184 --> 0:29:40.702 +If you generate Romanian maybe they need more +target dependencies. + +0:29:42.882 --> 0:29:46.263 +The Wine's the Eye Also Don't Know Any Sex +People Want To. + +0:29:47.887 --> 0:29:49.034 +There could be yeah the. + +0:29:49.889 --> 0:29:58.962 +As the maybe if you go from like a movie sphere +to a hybrid sphere, you can: It's very much + +0:29:58.962 --> 0:30:12.492 +easier to expand the vocabulary to English, +but it must be the vocabulary. + +0:30:13.333 --> 0:30:21.147 +Have to check, but would assume that in this +case the system is not retrained, but it's + +0:30:21.147 --> 0:30:22.391 +trained with. + +0:30:22.902 --> 0:30:30.213 +And that's why I was assuming that they have +the same, but maybe you'll write that in this + +0:30:30.213 --> 0:30:35.595 +piece, for example, if they were pre-trained, +the decoder English. + +0:30:36.096 --> 0:30:43.733 +But don't remember exactly if they do something +like that, but that could be a good. + +0:30:45.325 --> 0:30:52.457 +So this is some of the most easy way to speed +up. + +0:30:52.457 --> 0:31:01.443 +You just switch to hyperparameters, not to +implement anything. + +0:31:02.722 --> 0:31:08.367 +Of course, there's other ways of doing that. + +0:31:08.367 --> 0:31:11.880 +We'll look into two things. + +0:31:11.880 --> 0:31:16.521 +The other thing is the architecture. + +0:31:16.796 --> 0:31:28.154 +We are now at some of the baselines that we +are doing. + +0:31:28.488 --> 0:31:39.978 +However, in translation in the decoder side, +it might not be the best solution. + +0:31:39.978 --> 0:31:41.845 +There is no. + +0:31:42.222 --> 0:31:47.130 +So we can use different types of architectures, +also in the encoder and the. + +0:31:47.747 --> 0:31:52.475 +And there's two ways of what you could do +different, or there's more ways. + +0:31:52.912 --> 0:31:54.825 +We will look into two todays. + +0:31:54.825 --> 0:31:58.842 +The one is average attention, which is a very +simple solution. + +0:31:59.419 --> 0:32:01.464 +You can do as it says. + +0:32:01.464 --> 0:32:04.577 +It's not really attending anymore. + +0:32:04.577 --> 0:32:08.757 +It's just like equal attendance to everything. + +0:32:09.249 --> 0:32:23.422 +And the other idea, which is currently done +in most systems which are optimized to efficiency, + +0:32:23.422 --> 0:32:24.913 +is we're. + +0:32:25.065 --> 0:32:32.623 +But on the decoder side we are then not using +transformer or self attention, but we are using + +0:32:32.623 --> 0:32:39.700 +recurrent neural network because they are the +disadvantage of recurrent neural network. + +0:32:39.799 --> 0:32:48.353 +And then the recurrent is normally easier +to calculate because it only depends on inputs, + +0:32:48.353 --> 0:32:49.684 +the input on. + +0:32:51.931 --> 0:33:02.190 +So what is the difference between decoding +and why is the tension maybe not sufficient + +0:33:02.190 --> 0:33:03.841 +for decoding? + +0:33:04.204 --> 0:33:14.390 +If we want to populate the new state, we only +have to look at the input and the previous + +0:33:14.390 --> 0:33:15.649 +state, so. + +0:33:16.136 --> 0:33:19.029 +We are more conditional here networks. + +0:33:19.029 --> 0:33:19.994 +We have the. + +0:33:19.980 --> 0:33:31.291 +Dependency to a fixed number of previous ones, +but that's rarely used for decoding. + +0:33:31.291 --> 0:33:39.774 +In contrast, in transformer we have this large +dependency, so. + +0:33:40.000 --> 0:33:52.760 +So from t minus one to y t so that is somehow +and mainly not very efficient in this way mean + +0:33:52.760 --> 0:33:56.053 +it's very good because. + +0:33:56.276 --> 0:34:03.543 +However, the disadvantage is that we also +have to do all these calculations, so if we + +0:34:03.543 --> 0:34:10.895 +more view from the point of view of efficient +calculation, this might not be the best. + +0:34:11.471 --> 0:34:20.517 +So the question is, can we change our architecture +to keep some of the advantages but make things + +0:34:20.517 --> 0:34:21.994 +more efficient? + +0:34:24.284 --> 0:34:31.131 +The one idea is what is called the average +attention, and the interesting thing is this + +0:34:31.131 --> 0:34:32.610 +work surprisingly. + +0:34:33.013 --> 0:34:38.917 +So the only idea what you're doing is doing +the decoder. + +0:34:38.917 --> 0:34:42.646 +You're not doing attention anymore. + +0:34:42.646 --> 0:34:46.790 +The attention weights are all the same. + +0:34:47.027 --> 0:35:00.723 +So you don't calculate with query and key +the different weights, and then you just take + +0:35:00.723 --> 0:35:03.058 +equal weights. + +0:35:03.283 --> 0:35:07.585 +So here would be one third from this, one +third from this, and one third. + +0:35:09.009 --> 0:35:14.719 +And while it is sufficient you can now do +precalculation and things get more efficient. + +0:35:15.195 --> 0:35:18.803 +So first go the formula that's maybe not directed +here. + +0:35:18.979 --> 0:35:38.712 +So the difference here is that your new hint +stage is the sum of all the hint states, then. + +0:35:38.678 --> 0:35:40.844 +So here would be with this. + +0:35:40.844 --> 0:35:45.022 +It would be one third of this plus one third +of this. + +0:35:46.566 --> 0:35:57.162 +But if you calculate it this way, it's not +yet being more efficient because you still + +0:35:57.162 --> 0:36:01.844 +have to sum over here all the hidden. + +0:36:04.524 --> 0:36:22.932 +But you can not easily speed up these things +by having an in between value, which is just + +0:36:22.932 --> 0:36:24.568 +always. + +0:36:25.585 --> 0:36:30.057 +If you take this as ten to one, you take this +one class this one. + +0:36:30.350 --> 0:36:36.739 +Because this one then was before this, and +this one was this, so in the end. + +0:36:37.377 --> 0:36:49.545 +So now this one is not the final one in order +to get the final one to do the average. + +0:36:49.545 --> 0:36:50.111 +So. + +0:36:50.430 --> 0:37:00.264 +But then if you do this calculation with speed +up you can do it with a fixed number of steps. + +0:37:00.180 --> 0:37:11.300 +Instead of the sun which depends on age, so +you only have to do calculations to calculate + +0:37:11.300 --> 0:37:12.535 +this one. + +0:37:12.732 --> 0:37:21.253 +Can you do a lakes on a wet spoon? + +0:37:21.253 --> 0:37:32.695 +For example, a light spoon here now takes +and. + +0:37:32.993 --> 0:37:38.762 +That's a very good point and that's why this +is now in the image. + +0:37:38.762 --> 0:37:44.531 +It's not very good so this is the one with +tilder and the tilder. + +0:37:44.884 --> 0:37:57.895 +So this one is just the sum of these two, +because this is just this one. + +0:37:58.238 --> 0:38:08.956 +So the sum of this is exactly as the sum of +these, and the sum of these is the sum of here. + +0:38:08.956 --> 0:38:15.131 +So you only do the sum in here, and the multiplying. + +0:38:15.255 --> 0:38:22.145 +So what you can mainly do here is you can +do it more mathematically. + +0:38:22.145 --> 0:38:31.531 +You can know this by tea taking out of the +sum, and then you can calculate the sum different. + +0:38:36.256 --> 0:38:42.443 +That maybe looks a bit weird and simple, so +we were all talking about this great attention + +0:38:42.443 --> 0:38:47.882 +that we can focus on different parts, and a +bit surprising on this work is now. + +0:38:47.882 --> 0:38:53.321 +In the end it might also work well without +really putting and just doing equal. + +0:38:53.954 --> 0:38:56.164 +Mean it's not that easy. + +0:38:56.376 --> 0:38:58.261 +It's like sometimes this is working. + +0:38:58.261 --> 0:39:00.451 +There's also report weight work that well. + +0:39:01.481 --> 0:39:05.848 +But I think it's an interesting way and it +maybe shows that a lot of. + +0:39:05.805 --> 0:39:10.669 +Things in the self or in the transformer paper +which are more put as like yet. + +0:39:10.669 --> 0:39:14.301 +These are some hyperparameters that are rounded +like that. + +0:39:14.301 --> 0:39:19.657 +You do the lay on all in between and that +you do a feat forward before and things like + +0:39:19.657 --> 0:39:20.026 +that. + +0:39:20.026 --> 0:39:25.567 +But these are also all important and the right +set up around that is also very important. + +0:39:28.969 --> 0:39:38.598 +The other thing you can do in the end is not +completely different from this one. + +0:39:38.598 --> 0:39:42.521 +It's just like a very different. + +0:39:42.942 --> 0:39:54.338 +And that is a recurrent network which also +has this type of highway connection that can + +0:39:54.338 --> 0:40:01.330 +ignore the recurrent unit and directly put +the input. + +0:40:01.561 --> 0:40:10.770 +It's not really adding out, but if you see +the hitting step is your input, but what you + +0:40:10.770 --> 0:40:15.480 +can do is somehow directly go to the output. + +0:40:17.077 --> 0:40:28.390 +These are the four components of the simple +return unit, and the unit is motivated by GIS + +0:40:28.390 --> 0:40:33.418 +and by LCMs, which we have seen before. + +0:40:33.513 --> 0:40:43.633 +And that has proven to be very good for iron +ends, which allows you to have a gate on your. + +0:40:44.164 --> 0:40:48.186 +In this thing we have two gates, the reset +gate and the forget gate. + +0:40:48.768 --> 0:40:57.334 +So first we have the general structure which +has a cell state. + +0:40:57.334 --> 0:41:01.277 +Here we have the cell state. + +0:41:01.361 --> 0:41:09.661 +And then this goes next, and we always get +the different cell states over the times that. + +0:41:10.030 --> 0:41:11.448 +This Is the South Stand. + +0:41:11.771 --> 0:41:16.518 +How do we now calculate that just assume we +have an initial cell safe here? + +0:41:17.017 --> 0:41:19.670 +But the first thing is we're doing the forget +game. + +0:41:20.060 --> 0:41:34.774 +The forgetting models should the new cell +state mainly depend on the previous cell state + +0:41:34.774 --> 0:41:40.065 +or should it depend on our age. + +0:41:40.000 --> 0:41:41.356 +Like Add to Them. + +0:41:41.621 --> 0:41:42.877 +How can we model that? + +0:41:44.024 --> 0:41:45.599 +First we were at a cocktail. + +0:41:45.945 --> 0:41:52.151 +The forget gait is depending on minus one. + +0:41:52.151 --> 0:41:56.480 +You also see here the former. + +0:41:57.057 --> 0:42:01.963 +So we are multiplying both the cell state +and our input. + +0:42:01.963 --> 0:42:04.890 +With some weights we are getting. + +0:42:05.105 --> 0:42:08.472 +We are putting some Bay Inspector and then +we are doing Sigma Weed on that. + +0:42:08.868 --> 0:42:13.452 +So in the end we have numbers between zero +and one saying for each dimension. + +0:42:13.853 --> 0:42:22.041 +Like how much if it's near to zero we will +mainly use the new input. + +0:42:22.041 --> 0:42:31.890 +If it's near to one we will keep the input +and ignore the input at this dimension. + +0:42:33.313 --> 0:42:40.173 +And by this motivation we can then create +here the new sound state, and here you see + +0:42:40.173 --> 0:42:41.141 +the formal. + +0:42:41.601 --> 0:42:55.048 +So you take your foot back gate and multiply +it with your class. + +0:42:55.048 --> 0:43:00.427 +So if my was around then. + +0:43:00.800 --> 0:43:07.405 +In the other case, when the value was others, +that's what you added. + +0:43:07.405 --> 0:43:10.946 +Then you're adding a transformation. + +0:43:11.351 --> 0:43:24.284 +So if this value was maybe zero then you're +putting most of the information from inputting. + +0:43:25.065 --> 0:43:26.947 +Is already your element? + +0:43:26.947 --> 0:43:30.561 +The only question is now based on your element. + +0:43:30.561 --> 0:43:32.067 +What is the output? + +0:43:33.253 --> 0:43:47.951 +And there you have another opportunity so +you can either take the output or instead you + +0:43:47.951 --> 0:43:50.957 +prefer the input. + +0:43:52.612 --> 0:43:58.166 +So is the value also the same for the recept +game and the forget game. + +0:43:58.166 --> 0:43:59.417 +Yes, the movie. + +0:44:00.900 --> 0:44:10.004 +Yes exactly so the matrices are different +and therefore it can be and that should be + +0:44:10.004 --> 0:44:16.323 +and maybe there is sometimes you want to have +information. + +0:44:16.636 --> 0:44:23.843 +So here again we have this vector with values +between zero and which says controlling how + +0:44:23.843 --> 0:44:25.205 +the information. + +0:44:25.505 --> 0:44:36.459 +And then the output is calculated here similar +to a cell stage, but again input is from. + +0:44:36.536 --> 0:44:45.714 +So either the reset gate decides should give +what is currently stored in there, or. + +0:44:46.346 --> 0:44:58.647 +So it's not exactly as the thing we had before, +with the residual connections where we added + +0:44:58.647 --> 0:45:01.293 +up, but here we do. + +0:45:04.224 --> 0:45:08.472 +This is the general idea of a simple recurrent +neural network. + +0:45:08.472 --> 0:45:13.125 +Then we will now look at how we can make things +even more efficient. + +0:45:13.125 --> 0:45:17.104 +But first do you have more questions on how +it is working? + +0:45:23.063 --> 0:45:38.799 +Now these calculations are a bit where things +get more efficient because this somehow. + +0:45:38.718 --> 0:45:43.177 +It depends on all the other damage for the +second one also. + +0:45:43.423 --> 0:45:48.904 +Because if you do a matrix multiplication +with a vector like for the output vector, each + +0:45:48.904 --> 0:45:52.353 +diameter of the output vector depends on all +the other. + +0:45:52.973 --> 0:46:06.561 +The cell state here depends because this one +is used here, and somehow the first dimension + +0:46:06.561 --> 0:46:11.340 +of the cell state only depends. + +0:46:11.931 --> 0:46:17.973 +In order to make that, of course, is sometimes +again making things less paralyzeable if things + +0:46:17.973 --> 0:46:18.481 +depend. + +0:46:19.359 --> 0:46:35.122 +Can easily make that different by changing +from the metric product to not a vector. + +0:46:35.295 --> 0:46:51.459 +So you do first, just like inside here, you +take like the first dimension, my second dimension. + +0:46:52.032 --> 0:46:53.772 +Is, of course, narrow. + +0:46:53.772 --> 0:46:59.294 +This should be reset or this should be because +it should be a different. + +0:46:59.899 --> 0:47:12.053 +Now the first dimension only depends on the +first dimension, so you don't have dependencies + +0:47:12.053 --> 0:47:16.148 +any longer between dimensions. + +0:47:18.078 --> 0:47:25.692 +Maybe it gets a bit clearer if you see about +it in this way, so what we have to do now. + +0:47:25.966 --> 0:47:31.911 +First, we have to do a metrics multiplication +on to gather and to get the. + +0:47:32.292 --> 0:47:38.041 +And then we only have the element wise operations +where we take this output. + +0:47:38.041 --> 0:47:38.713 +We take. + +0:47:39.179 --> 0:47:42.978 +Minus one and our original. + +0:47:42.978 --> 0:47:52.748 +Here we only have elemental abrasions which +can be optimally paralyzed. + +0:47:53.273 --> 0:48:07.603 +So here we have additional paralyzed things +across the dimension and don't have to do that. + +0:48:09.929 --> 0:48:24.255 +Yeah, but this you can do like in parallel +again for all xts. + +0:48:24.544 --> 0:48:33.014 +Here you can't do it in parallel, but you +only have to do it on each seat, and then you + +0:48:33.014 --> 0:48:34.650 +can parallelize. + +0:48:35.495 --> 0:48:39.190 +But this maybe for the dimension. + +0:48:39.190 --> 0:48:42.124 +Maybe it's also important. + +0:48:42.124 --> 0:48:46.037 +I don't know if they have tried it. + +0:48:46.037 --> 0:48:55.383 +I assume it's not only for dimension reduction, +but it's hard because you can easily. + +0:49:01.001 --> 0:49:08.164 +People have even like made the second thing +even more easy. + +0:49:08.164 --> 0:49:10.313 +So there is this. + +0:49:10.313 --> 0:49:17.954 +This is how we have the highway connections +in the transformer. + +0:49:17.954 --> 0:49:20.699 +Then it's like you do. + +0:49:20.780 --> 0:49:24.789 +So that is like how things are put together +as a transformer. + +0:49:25.125 --> 0:49:39.960 +And that is a similar and simple recurring +neural network where you do exactly the same + +0:49:39.960 --> 0:49:44.512 +for the so you don't have. + +0:49:46.326 --> 0:49:47.503 +This type of things. + +0:49:49.149 --> 0:50:01.196 +And with this we are at the end of how to +make efficient architectures before we go to + +0:50:01.196 --> 0:50:02.580 +the next. + +0:50:13.013 --> 0:50:24.424 +Between the ink or the trader and the architectures +there is a next technique which is used in + +0:50:24.424 --> 0:50:28.988 +nearly all deburning very successful. + +0:50:29.449 --> 0:50:43.463 +So the idea is can we extract the knowledge +from a large network into a smaller one, but + +0:50:43.463 --> 0:50:45.983 +it's similarly. + +0:50:47.907 --> 0:50:53.217 +And the nice thing is that this really works, +and it may be very, very surprising. + +0:50:53.673 --> 0:51:03.035 +So the idea is that we have a large strong +model which we train for long, and the question + +0:51:03.035 --> 0:51:07.870 +is: Can that help us to train a smaller model? + +0:51:08.148 --> 0:51:16.296 +So can what we refer to as teacher model tell +us better to build a small student model than + +0:51:16.296 --> 0:51:17.005 +before. + +0:51:17.257 --> 0:51:27.371 +So what we're before in it as a student model, +we learn from the data and that is how we train + +0:51:27.371 --> 0:51:28.755 +our systems. + +0:51:29.249 --> 0:51:37.949 +The question is: Can we train this small model +better if we are not only learning from the + +0:51:37.949 --> 0:51:46.649 +data, but we are also learning from a large +model which has been trained maybe in the same + +0:51:46.649 --> 0:51:47.222 +data? + +0:51:47.667 --> 0:51:55.564 +So that you have then in the end a smaller +model that is somehow better performing than. + +0:51:55.895 --> 0:51:59.828 +And maybe that's on the first view. + +0:51:59.739 --> 0:52:05.396 +Very very surprising because it has seen the +same data so it should have learned the same + +0:52:05.396 --> 0:52:11.053 +so the baseline model trained only on the data +and the student teacher knowledge to still + +0:52:11.053 --> 0:52:11.682 +model it. + +0:52:11.682 --> 0:52:17.401 +They all have seen only this data because +your teacher modeling was also trained typically + +0:52:17.401 --> 0:52:19.161 +only on this model however. + +0:52:20.580 --> 0:52:30.071 +It has by now shown that by many ways the +model trained in the teacher and analysis framework + +0:52:30.071 --> 0:52:32.293 +is performing better. + +0:52:33.473 --> 0:52:40.971 +A bit of an explanation when we see how that +works. + +0:52:40.971 --> 0:52:46.161 +There's different ways of doing it. + +0:52:46.161 --> 0:52:47.171 +Maybe. + +0:52:47.567 --> 0:52:51.501 +So how does it work? + +0:52:51.501 --> 0:53:04.802 +This is our student network, the normal one, +some type of new network. + +0:53:04.802 --> 0:53:06.113 +We're. + +0:53:06.586 --> 0:53:17.050 +So we are training the model to predict the +same thing as we are doing that by calculating. + +0:53:17.437 --> 0:53:23.173 +The cross angry loss was defined in a way +where saying all the probabilities for the + +0:53:23.173 --> 0:53:25.332 +correct word should be as high. + +0:53:25.745 --> 0:53:32.207 +So you are calculating your alphabet probabilities +always, and each time step you have an alphabet + +0:53:32.207 --> 0:53:33.055 +probability. + +0:53:33.055 --> 0:53:38.669 +What is the most probable in the next word +and your training signal is put as much of + +0:53:38.669 --> 0:53:43.368 +your probability mass to the correct word to +the word that is there in. + +0:53:43.903 --> 0:53:51.367 +And this is the chief by this cross entry +loss, which says with some of the all training + +0:53:51.367 --> 0:53:58.664 +examples of all positions, with some of the +full vocabulary, and then this one is this + +0:53:58.664 --> 0:54:03.947 +one that this current word is the case word +in the vocabulary. + +0:54:04.204 --> 0:54:11.339 +And then we take here the lock for the ability +of that, so what we made me do is: We have + +0:54:11.339 --> 0:54:27.313 +this metric here, so each position of your +vocabulary size. + +0:54:27.507 --> 0:54:38.656 +In the end what you just do is some of these +three lock probabilities, and then you want + +0:54:38.656 --> 0:54:40.785 +to have as much. + +0:54:41.041 --> 0:54:54.614 +So although this is a thumb over this metric +here, in the end of each dimension you. + +0:54:54.794 --> 0:55:06.366 +So that is a normal cross end to be lost that +we have discussed at the very beginning of + +0:55:06.366 --> 0:55:07.016 +how. + +0:55:08.068 --> 0:55:15.132 +So what can we do differently in the teacher +network? + +0:55:15.132 --> 0:55:23.374 +We also have a teacher network which is trained +on large data. + +0:55:24.224 --> 0:55:35.957 +And of course this distribution might be better +than the one from the small model because it's. + +0:55:36.456 --> 0:55:40.941 +So in this case we have now the training signal +from the teacher network. + +0:55:41.441 --> 0:55:46.262 +And it's the same way as we had before. + +0:55:46.262 --> 0:55:56.507 +The only difference is we're training not +the ground truths per ability distribution + +0:55:56.507 --> 0:55:59.159 +year, which is sharp. + +0:55:59.299 --> 0:56:11.303 +That's also a probability, so this word has +a high probability, but have some probability. + +0:56:12.612 --> 0:56:19.577 +And that is the main difference. + +0:56:19.577 --> 0:56:30.341 +Typically you do like the interpretation of +these. + +0:56:33.213 --> 0:56:38.669 +Because there's more information contained +in the distribution than in the front booth, + +0:56:38.669 --> 0:56:44.187 +because it encodes more information about the +language, because language always has more + +0:56:44.187 --> 0:56:47.907 +options to put alone, that's the same sentence +yes exactly. + +0:56:47.907 --> 0:56:53.114 +So there's ambiguity in there that is encoded +hopefully very well in the complaint. + +0:56:53.513 --> 0:56:57.257 +Trade you two networks so better than a student +network you have in there from your learner. + +0:56:57.537 --> 0:57:05.961 +So maybe often there's only one correct word, +but it might be two or three, and then all + +0:57:05.961 --> 0:57:10.505 +of these three have a probability distribution. + +0:57:10.590 --> 0:57:21.242 +And then is the main advantage or one explanation +of why it's better to train from the. + +0:57:21.361 --> 0:57:32.652 +Of course, it's good to also keep the signal +in there because then you can prevent it because + +0:57:32.652 --> 0:57:33.493 +crazy. + +0:57:37.017 --> 0:57:49.466 +Any more questions on the first type of knowledge +distillation, also distribution changes. + +0:57:50.550 --> 0:58:02.202 +Coming around again, this would put it a bit +different, so this is not a solution to maintenance + +0:58:02.202 --> 0:58:04.244 +or distribution. + +0:58:04.744 --> 0:58:12.680 +But don't think it's performing worse than +only doing the ground tours because they also. + +0:58:13.113 --> 0:58:21.254 +So it's more like it's not improving you would +assume it's similarly helping you, but. + +0:58:21.481 --> 0:58:28.145 +Of course, if you now have a teacher, maybe +you have no danger on your target to Maine, + +0:58:28.145 --> 0:58:28.524 +but. + +0:58:28.888 --> 0:58:39.895 +Then you can use this one which is not the +ground truth but helpful to learn better for + +0:58:39.895 --> 0:58:42.147 +the distribution. + +0:58:46.326 --> 0:58:57.012 +The second idea is to do sequence level knowledge +distillation, so what we have in this case + +0:58:57.012 --> 0:59:02.757 +is we have looked at each position independently. + +0:59:03.423 --> 0:59:05.436 +Mean, we do that often. + +0:59:05.436 --> 0:59:10.972 +We are not generating a lot of sequences, +but that has a problem. + +0:59:10.972 --> 0:59:13.992 +We have this propagation of errors. + +0:59:13.992 --> 0:59:16.760 +We start with one area and then. + +0:59:17.237 --> 0:59:27.419 +So if we are doing word-level knowledge dissolution, +we are treating each word in the sentence independently. + +0:59:28.008 --> 0:59:32.091 +So we are not trying to like somewhat model +the dependency between. + +0:59:32.932 --> 0:59:47.480 +We can try to do that by sequence level knowledge +dissolution, but the problem is, of course,. + +0:59:47.847 --> 0:59:53.478 +So we can that for each position we can get +a distribution over all the words at this. + +0:59:53.793 --> 1:00:05.305 +But if we want to have a distribution of all +possible target sentences, that's not possible + +1:00:05.305 --> 1:00:06.431 +because. + +1:00:08.508 --> 1:00:15.940 +Area, so we can then again do a bit of a heck +on that. + +1:00:15.940 --> 1:00:23.238 +If we can't have a distribution of all sentences, +it. + +1:00:23.843 --> 1:00:30.764 +So what we can't do is you can not use the +teacher network and sample different translations. + +1:00:31.931 --> 1:00:39.327 +And now we can do different ways to train +them. + +1:00:39.327 --> 1:00:49.343 +We can use them as their probability, the +easiest one to assume. + +1:00:50.050 --> 1:00:56.373 +So what that ends to is that we're taking +our teacher network, we're generating some + +1:00:56.373 --> 1:01:01.135 +translations, and these ones we're using as +additional trading. + +1:01:01.781 --> 1:01:11.382 +Then we have mainly done this sequence level +because the teacher network takes us. + +1:01:11.382 --> 1:01:17.513 +These are all probable translations of the +sentence. + +1:01:26.286 --> 1:01:34.673 +And then you can do a bit of a yeah, and you +can try to better make a bit of an interpolated + +1:01:34.673 --> 1:01:36.206 +version of that. + +1:01:36.716 --> 1:01:42.802 +So what people have also done is like subsequent +level interpolations. + +1:01:42.802 --> 1:01:52.819 +You generate here several translations: But +then you don't use all of them. + +1:01:52.819 --> 1:02:00.658 +You do some metrics on which of these ones. + +1:02:01.021 --> 1:02:12.056 +So it's a bit more training on this brown +chose which might be improbable or unreachable + +1:02:12.056 --> 1:02:16.520 +because we can generate everything. + +1:02:16.676 --> 1:02:23.378 +And we are giving it an easier solution which +is also good quality and training of that. + +1:02:23.703 --> 1:02:32.602 +So you're not training it on a very difficult +solution, but you're training it on an easier + +1:02:32.602 --> 1:02:33.570 +solution. + +1:02:36.356 --> 1:02:38.494 +Any More Questions to This. + +1:02:40.260 --> 1:02:41.557 +Yeah. + +1:02:41.461 --> 1:02:44.296 +Good. + +1:02:43.843 --> 1:03:01.642 +Is to look at the vocabulary, so the problem +is we have seen that vocabulary calculations + +1:03:01.642 --> 1:03:06.784 +are often very presuming. + +1:03:09.789 --> 1:03:19.805 +The thing is that most of the vocabulary is +not needed for each sentence, so in each sentence. + +1:03:20.280 --> 1:03:28.219 +The question is: Can we somehow easily precalculate, +which words are probable to occur in the sentence, + +1:03:28.219 --> 1:03:30.967 +and then only calculate these ones? + +1:03:31.691 --> 1:03:34.912 +And this can be done so. + +1:03:34.912 --> 1:03:43.932 +For example, if you have sentenced card, it's +probably not happening. + +1:03:44.164 --> 1:03:48.701 +So what you can try to do is to limit your +vocabulary. + +1:03:48.701 --> 1:03:51.093 +You're considering for each. + +1:03:51.151 --> 1:04:04.693 +So you're no longer taking the full vocabulary +as possible output, but you're restricting. + +1:04:06.426 --> 1:04:18.275 +That typically works is that we limit it by +the most frequent words we always take because + +1:04:18.275 --> 1:04:23.613 +these are not so easy to align to words. + +1:04:23.964 --> 1:04:32.241 +To take the most treatment taggin' words and +then work that often aligns with one of the + +1:04:32.241 --> 1:04:32.985 +source. + +1:04:33.473 --> 1:04:46.770 +So for each source word you calculate the +word alignment on your training data, and then + +1:04:46.770 --> 1:04:51.700 +you calculate which words occur. + +1:04:52.352 --> 1:04:57.680 +And then for decoding you build this union +of maybe the source word list that other. + +1:04:59.960 --> 1:05:02.145 +Are like for each source work. + +1:05:02.145 --> 1:05:08.773 +One of the most frequent translations of these +source words, for example for each source work + +1:05:08.773 --> 1:05:13.003 +like in the most frequent ones, and then the +most frequent. + +1:05:13.193 --> 1:05:24.333 +In total, if you have short sentences, you +have a lot less words, so in most cases it's + +1:05:24.333 --> 1:05:26.232 +not more than. + +1:05:26.546 --> 1:05:33.957 +And so you have dramatically reduced your +vocabulary, and thereby can also fax a depot. + +1:05:35.495 --> 1:05:43.757 +That easy does anybody see what is challenging +here and why that might not always need. + +1:05:47.687 --> 1:05:54.448 +The performance is not why this might not. + +1:05:54.448 --> 1:06:01.838 +If you implement it, it might not be a strong. + +1:06:01.941 --> 1:06:06.053 +You have to store this list. + +1:06:06.053 --> 1:06:14.135 +You have to burn the union and of course your +safe time. + +1:06:14.554 --> 1:06:21.920 +The second thing the vocabulary is used in +our last step, so we have the hidden state, + +1:06:21.920 --> 1:06:23.868 +and then we calculate. + +1:06:24.284 --> 1:06:29.610 +Now we are not longer calculating them for +all output words, but for a subset of them. + +1:06:30.430 --> 1:06:35.613 +However, this metric multiplication is typically +parallelized with the perfect but good. + +1:06:35.956 --> 1:06:46.937 +But if you not only calculate some of them, +if you're not modeling it right, it will take + +1:06:46.937 --> 1:06:52.794 +as long as before because of the nature of +the. + +1:06:56.776 --> 1:07:07.997 +Here for beam search there's some ideas of +course you can go back to greedy search because + +1:07:07.997 --> 1:07:10.833 +that's more efficient. + +1:07:11.651 --> 1:07:18.347 +And better quality, and you can buffer some +states in between, so how much buffering it's + +1:07:18.347 --> 1:07:22.216 +again this tradeoff between calculation and +memory. + +1:07:25.125 --> 1:07:41.236 +Then at the end of today what we want to look +into is one last type of new machine translation + +1:07:41.236 --> 1:07:42.932 +approach. + +1:07:43.403 --> 1:07:53.621 +And the idea is what we've already seen in +our first two steps is that this ultra aggressive + +1:07:53.621 --> 1:07:57.246 +park is taking community coding. + +1:07:57.557 --> 1:08:04.461 +Can process everything in parallel, but we +are always taking the most probable and then. + +1:08:05.905 --> 1:08:10.476 +The question is: Do we really need to do that? + +1:08:10.476 --> 1:08:14.074 +Therefore, there is a bunch of work. + +1:08:14.074 --> 1:08:16.602 +Can we do it differently? + +1:08:16.602 --> 1:08:19.616 +Can we generate a full target? + +1:08:20.160 --> 1:08:29.417 +We'll see it's not that easy and there's still +an open debate whether this is really faster + +1:08:29.417 --> 1:08:31.832 +and quality, but think. + +1:08:32.712 --> 1:08:45.594 +So, as said, what we have done is our encoder +decoder where we can process our encoder color, + +1:08:45.594 --> 1:08:50.527 +and then the output always depends. + +1:08:50.410 --> 1:08:54.709 +We generate the output and then we have to +put it here the wide because then everything + +1:08:54.709 --> 1:08:56.565 +depends on the purpose of the output. + +1:08:56.916 --> 1:09:10.464 +This is what is referred to as an outer-regressive +model and nearly outs speech generation and + +1:09:10.464 --> 1:09:16.739 +language generation or works in this outer. + +1:09:18.318 --> 1:09:21.132 +So the motivation is, can we do that more +efficiently? + +1:09:21.361 --> 1:09:31.694 +And can we somehow process all target words +in parallel? + +1:09:31.694 --> 1:09:41.302 +So instead of doing it one by one, we are +inputting. + +1:09:45.105 --> 1:09:46.726 +So how does it work? + +1:09:46.726 --> 1:09:50.587 +So let's first have a basic auto regressive +mode. + +1:09:50.810 --> 1:09:53.551 +So the encoder looks as it is before. + +1:09:53.551 --> 1:09:58.310 +That's maybe not surprising because here we +know we can paralyze. + +1:09:58.618 --> 1:10:04.592 +So we have put in here our ink holder and +generated the ink stash, so that's exactly + +1:10:04.592 --> 1:10:05.295 +the same. + +1:10:05.845 --> 1:10:16.229 +However, now we need to do one more thing: +One challenge is what we had before and that's + +1:10:16.229 --> 1:10:26.799 +a challenge of natural language generation +like machine translation. + +1:10:32.672 --> 1:10:38.447 +We generate until we generate this out of +end of center stock, but if we now generate + +1:10:38.447 --> 1:10:44.625 +everything at once that's no longer possible, +so we cannot generate as long because we only + +1:10:44.625 --> 1:10:45.632 +generated one. + +1:10:46.206 --> 1:10:58.321 +So the question is how can we now determine +how long the sequence is, and we can also accelerate. + +1:11:00.000 --> 1:11:06.384 +Yes, but there would be one idea, and there +is other work which tries to do that. + +1:11:06.806 --> 1:11:15.702 +However, in here there's some work already +done before and maybe you remember we had the + +1:11:15.702 --> 1:11:20.900 +IBM models and there was this concept of fertility. + +1:11:21.241 --> 1:11:26.299 +The concept of fertility is means like for +one saucepan, and how many target pores does + +1:11:26.299 --> 1:11:27.104 +it translate? + +1:11:27.847 --> 1:11:34.805 +And exactly that we try to do here, and that +means we are calculating like at the top we + +1:11:34.805 --> 1:11:36.134 +are calculating. + +1:11:36.396 --> 1:11:42.045 +So it says word is translated into word. + +1:11:42.045 --> 1:11:54.171 +Word might be translated into words into, +so we're trying to predict in how many words. + +1:11:55.935 --> 1:12:10.314 +And then the end of the anchor, so this is +like a length estimation. + +1:12:10.314 --> 1:12:15.523 +You can do it otherwise. + +1:12:16.236 --> 1:12:24.526 +You initialize your decoder input and we know +it's good with word embeddings so we're trying + +1:12:24.526 --> 1:12:28.627 +to do the same thing and what people then do. + +1:12:28.627 --> 1:12:35.224 +They initialize it again with word embedding +but in the frequency of the. + +1:12:35.315 --> 1:12:36.460 +So we have the cartilage. + +1:12:36.896 --> 1:12:47.816 +So one has two, so twice the is and then one +is, so that is then our initialization. + +1:12:48.208 --> 1:12:57.151 +In other words, if you don't predict fertilities +but predict lengths, you can just initialize + +1:12:57.151 --> 1:12:57.912 +second. + +1:12:58.438 --> 1:13:07.788 +This often works a bit better, but that's +the other. + +1:13:07.788 --> 1:13:16.432 +Now you have everything in training and testing. + +1:13:16.656 --> 1:13:18.621 +This is all available at once. + +1:13:20.280 --> 1:13:31.752 +Then we can generate everything in parallel, +so we have the decoder stack, and that is now + +1:13:31.752 --> 1:13:33.139 +as before. + +1:13:35.395 --> 1:13:41.555 +And then we're doing the translation predictions +here on top of it in order to do. + +1:13:43.083 --> 1:13:59.821 +And then we are predicting here the target +words and once predicted, and that is the basic + +1:13:59.821 --> 1:14:00.924 +idea. + +1:14:01.241 --> 1:14:08.171 +Machine translation: Where the idea is, we +don't have to do one by one what we're. + +1:14:10.210 --> 1:14:13.900 +So this looks really, really, really great. + +1:14:13.900 --> 1:14:20.358 +On the first view there's one challenge with +this, and this is the baseline. + +1:14:20.358 --> 1:14:27.571 +Of course there's some improvements, but in +general the quality is often significant. + +1:14:28.068 --> 1:14:32.075 +So here you see the baseline models. + +1:14:32.075 --> 1:14:38.466 +You have a loss of ten blue points or something +like that. + +1:14:38.878 --> 1:14:40.230 +So why does it change? + +1:14:40.230 --> 1:14:41.640 +So why is it happening? + +1:14:43.903 --> 1:14:56.250 +If you look at the errors there is repetitive +tokens, so you have like or things like that. + +1:14:56.536 --> 1:15:01.995 +Broken senses or influent senses, so that +exactly where algebra aggressive models are + +1:15:01.995 --> 1:15:04.851 +very good, we say that's a bit of a problem. + +1:15:04.851 --> 1:15:07.390 +They generate very fluid transcription. + +1:15:07.387 --> 1:15:10.898 +Translation: Sometimes there doesn't have +to do anything with the input. + +1:15:11.411 --> 1:15:14.047 +But generally it really looks always very +fluid. + +1:15:14.995 --> 1:15:20.865 +Here exactly the opposite, so the problem +is that we don't have really fluid translation. + +1:15:21.421 --> 1:15:26.123 +And that is mainly due to the challenge that +we have this independent assumption. + +1:15:26.646 --> 1:15:35.873 +So in this case, the probability of Y of the +second position is independent of the probability + +1:15:35.873 --> 1:15:40.632 +of X, so we don't know what was there generated. + +1:15:40.632 --> 1:15:43.740 +We're just generating it there. + +1:15:43.964 --> 1:15:55.439 +You can see it also in a bit of examples. + +1:15:55.439 --> 1:16:03.636 +You can over-panelize shifts. + +1:16:04.024 --> 1:16:10.566 +And the problem is this is already an improvement +again, but this is also similar to. + +1:16:11.071 --> 1:16:19.900 +So you can, for example, translate heeded +back, or maybe you could also translate it + +1:16:19.900 --> 1:16:31.105 +with: But on their feeling down in feeling +down, if the first position thinks of their + +1:16:31.105 --> 1:16:34.594 +feeling done and the second. + +1:16:35.075 --> 1:16:42.908 +So each position here and that is one of the +main issues here doesn't know what the other. + +1:16:43.243 --> 1:16:53.846 +And for example, if you are translating something +with, you can often translate things in two + +1:16:53.846 --> 1:16:58.471 +ways: German with a different agreement. + +1:16:58.999 --> 1:17:02.047 +And then here where you have to decide do +you have to use jewelry. + +1:17:02.162 --> 1:17:05.460 +Interpretator: It doesn't know which word +it has to select. + +1:17:06.086 --> 1:17:14.789 +Mean, of course, it knows a hidden state, +but in the end you have a liability distribution. + +1:17:16.256 --> 1:17:20.026 +And that is the important thing in the outer +regressive month. + +1:17:20.026 --> 1:17:24.335 +You know that because you have put it in you +here, you don't know that. + +1:17:24.335 --> 1:17:29.660 +If it's equal probable here to two, you don't +Know Which Is Selected, and of course that + +1:17:29.660 --> 1:17:32.832 +depends on what should be the latest traction +under. + +1:17:33.333 --> 1:17:39.554 +Yep, that's the undershift, and we're going +to last last the next time. + +1:17:39.554 --> 1:17:39.986 +Yes. + +1:17:40.840 --> 1:17:44.935 +Doesn't this also appear in and like now we're +talking about physical training? + +1:17:46.586 --> 1:17:48.412 +The thing is in the auto regress. + +1:17:48.412 --> 1:17:50.183 +If you give it the correct one,. + +1:17:50.450 --> 1:17:55.827 +So if you predict here comma what the reference +is feeling then you tell the model here. + +1:17:55.827 --> 1:17:59.573 +The last one was feeling and then it knows +it has to be done. + +1:17:59.573 --> 1:18:04.044 +But here it doesn't know that because it doesn't +get as input as a right. + +1:18:04.204 --> 1:18:24.286 +Yes, that's a bit depending on what. + +1:18:24.204 --> 1:18:27.973 +But in training, of course, you just try to +make the highest one the current one. + +1:18:31.751 --> 1:18:38.181 +So what you can do is things like CDC loss +which can adjust for this. + +1:18:38.181 --> 1:18:42.866 +So then you can also have this shifted correction. + +1:18:42.866 --> 1:18:50.582 +If you're doing this type of correction in +the CDC loss you don't get full penalty. + +1:18:50.930 --> 1:18:58.486 +Just shifted by one, so it's a bit of a different +loss, which is mainly used in, but. + +1:19:00.040 --> 1:19:03.412 +It can be used in order to address this problem. + +1:19:04.504 --> 1:19:13.844 +The other problem is that outer regressively +we have the label buyers that tries to disimmigrate. + +1:19:13.844 --> 1:19:20.515 +That's the example did before was if you translate +thank you to Dung. + +1:19:20.460 --> 1:19:31.925 +And then it might end up because it learns +in the first position and the second also. + +1:19:32.492 --> 1:19:43.201 +In order to prevent that, it would be helpful +for one output, only one output, so that makes + +1:19:43.201 --> 1:19:47.002 +the system already better learn. + +1:19:47.227 --> 1:19:53.867 +Might be that for slightly different inputs +you have different outputs, but for the same. + +1:19:54.714 --> 1:19:57.467 +That we can luckily very easily solve. + +1:19:59.119 --> 1:19:59.908 +And it's done. + +1:19:59.908 --> 1:20:04.116 +We just learned the technique about it, which +is called knowledge distillation. + +1:20:04.985 --> 1:20:13.398 +So what we can do and the easiest solution +to prove your non-autoregressive model is to + +1:20:13.398 --> 1:20:16.457 +train an auto regressive model. + +1:20:16.457 --> 1:20:22.958 +Then you decode your whole training gamer +with this model and then. + +1:20:23.603 --> 1:20:27.078 +While the main advantage of that is that this +is more consistent,. + +1:20:27.407 --> 1:20:33.995 +So for the same input you always have the +same output. + +1:20:33.995 --> 1:20:41.901 +So you have to make your training data more +consistent and learn. + +1:20:42.482 --> 1:20:54.471 +So there is another advantage of knowledge +distillation and that advantage is you have + +1:20:54.471 --> 1:20:59.156 +more consistent training signals. + +1:21:04.884 --> 1:21:10.287 +There's another to make the things more easy +at the beginning. + +1:21:10.287 --> 1:21:16.462 +There's this plants model, black model where +you put in parts of input. + +1:21:16.756 --> 1:21:26.080 +So during training, especially at the beginning, +you give some correct solutions at the beginning. + +1:21:28.468 --> 1:21:38.407 +And there is this tokens at a time, so the +idea is to establish other regressive training. + +1:21:40.000 --> 1:21:50.049 +And some targets are open, so you always predict +only like first auto regression is K. + +1:21:50.049 --> 1:21:59.174 +It puts one, so you always have one input +and one output, then you do partial. + +1:21:59.699 --> 1:22:05.825 +So in that way you can slowly learn what is +a good and what is a bad answer. + +1:22:08.528 --> 1:22:10.862 +It doesn't sound very impressive. + +1:22:10.862 --> 1:22:12.578 +Don't contact me anyway. + +1:22:12.578 --> 1:22:15.323 +Go all over your training data several. + +1:22:15.875 --> 1:22:20.655 +You can even switch in between. + +1:22:20.655 --> 1:22:29.318 +There is a homework on this thing where you +try to start. + +1:22:31.271 --> 1:22:41.563 +You have to learn so there's a whole work +on that so this is often happening and it doesn't + +1:22:41.563 --> 1:22:46.598 +mean it's less efficient but still it helps. + +1:22:49.389 --> 1:22:57.979 +For later maybe here are some examples of +how much things help. + +1:22:57.979 --> 1:23:04.958 +Maybe one point here is that it's really important. + +1:23:05.365 --> 1:23:13.787 +Here's the translation performance and speed. + +1:23:13.787 --> 1:23:24.407 +One point which is a point is if you compare +researchers. + +1:23:24.784 --> 1:23:33.880 +So yeah, if you're compared to one very weak +baseline transformer even with beam search, + +1:23:33.880 --> 1:23:40.522 +then you're ten times slower than a very strong +auto regressive. + +1:23:40.961 --> 1:23:48.620 +If you make a strong baseline then it's going +down to depending on times and here like: You + +1:23:48.620 --> 1:23:53.454 +have a lot of different speed ups. + +1:23:53.454 --> 1:24:03.261 +Generally, it makes a strong baseline and +not very simple transformer. + +1:24:07.407 --> 1:24:20.010 +Yeah, with this one last thing that you can +do to speed up things and also reduce your + +1:24:20.010 --> 1:24:25.950 +memory is what is called half precision. + +1:24:26.326 --> 1:24:29.139 +And especially for decoding issues for training. + +1:24:29.139 --> 1:24:31.148 +Sometimes it also gets less stale. + +1:24:32.592 --> 1:24:45.184 +With this we close nearly wait a bit, so what +you should remember is that efficient machine + +1:24:45.184 --> 1:24:46.963 +translation. + +1:24:47.007 --> 1:24:51.939 +We have, for example, looked at knowledge +distillation. + +1:24:51.939 --> 1:24:55.991 +We have looked at non auto regressive models. + +1:24:55.991 --> 1:24:57.665 +We have different. + +1:24:58.898 --> 1:25:02.383 +For today and then only requests. + +1:25:02.383 --> 1:25:08.430 +So if you haven't done so, please fill out +the evaluation. + +1:25:08.388 --> 1:25:20.127 +So now if you have done so think then you +should have and with the online people hopefully. + +1:25:20.320 --> 1:25:29.758 +Only possibility to tell us what things are +good and what not the only one but the most + +1:25:29.758 --> 1:25:30.937 +efficient. + +1:25:31.851 --> 1:25:35.871 +So think of all the students doing it in this +case okay and then thank. + +0:00:01.921 --> 0:00:16.424 +Hey welcome to today's lecture, what we today +want to look at is how we can make new. + +0:00:16.796 --> 0:00:26.458 +So until now we have this global system, the +encoder and the decoder mostly, and we haven't + +0:00:26.458 --> 0:00:29.714 +really thought about how long. + +0:00:30.170 --> 0:00:42.684 +And what we, for example, know is yeah, you +can make the systems bigger in different ways. + +0:00:42.684 --> 0:00:47.084 +We can make them deeper so the. + +0:00:47.407 --> 0:00:56.331 +And if we have at least enough data that typically +helps you make things performance better,. + +0:00:56.576 --> 0:01:00.620 +But of course leads to problems that we need +more resources. + +0:01:00.620 --> 0:01:06.587 +That is a problem at universities where we +have typically limited computation capacities. + +0:01:06.587 --> 0:01:11.757 +So at some point you have such big models +that you cannot train them anymore. + +0:01:13.033 --> 0:01:23.792 +And also for companies is of course important +if it costs you like to generate translation + +0:01:23.792 --> 0:01:26.984 +just by power consumption. + +0:01:27.667 --> 0:01:35.386 +So yeah, there's different reasons why you +want to do efficient machine translation. + +0:01:36.436 --> 0:01:48.338 +One reason is there are different ways of +how you can improve your machine translation + +0:01:48.338 --> 0:01:50.527 +system once we. + +0:01:50.670 --> 0:01:55.694 +There can be different types of data we looked +into data crawling, monolingual data. + +0:01:55.875 --> 0:01:59.024 +All this data and the aim is always. + +0:01:59.099 --> 0:02:05.735 +Of course, we are not just purely interested +in having more data, but the idea why we want + +0:02:05.735 --> 0:02:12.299 +to have more data is that more data also means +that we have better quality because mostly + +0:02:12.299 --> 0:02:17.550 +we are interested in increasing the quality +of the machine translation. + +0:02:18.838 --> 0:02:24.892 +But there's also other ways of how you can +improve the quality of a machine translation. + +0:02:25.325 --> 0:02:36.450 +And what is, of course, that is where most +research is focusing on. + +0:02:36.450 --> 0:02:44.467 +It means all we want to build better algorithms. + +0:02:44.684 --> 0:02:48.199 +Course: The other things are normally as good. + +0:02:48.199 --> 0:02:54.631 +Sometimes it's easier to improve, so often +it's easier to just collect more data than + +0:02:54.631 --> 0:02:57.473 +to invent some great view algorithms. + +0:02:57.473 --> 0:03:00.315 +But yeah, both of them are important. + +0:03:00.920 --> 0:03:09.812 +But there is this third thing, especially +with neural machine translation, and that means + +0:03:09.812 --> 0:03:11.590 +we make a bigger. + +0:03:11.751 --> 0:03:16.510 +Can be, as said, that we have more layers, +that we have wider layers. + +0:03:16.510 --> 0:03:19.977 +The other thing we talked a bit about is ensemble. + +0:03:19.977 --> 0:03:24.532 +That means we are not building one new machine +translation system. + +0:03:24.965 --> 0:03:27.505 +And we can easily build four. + +0:03:27.505 --> 0:03:32.331 +What is the typical strategy to build different +systems? + +0:03:32.331 --> 0:03:33.177 +Remember. + +0:03:35.795 --> 0:03:40.119 +It should be of course a bit different if +you have the same. + +0:03:40.119 --> 0:03:44.585 +If they all predict the same then combining +them doesn't help. + +0:03:44.585 --> 0:03:48.979 +So what is the easiest way if you have to +build four systems? + +0:03:51.711 --> 0:04:01.747 +And the Charleston's will take, but this is +the best output of a single system. + +0:04:02.362 --> 0:04:10.165 +Mean now, it's really three different systems +so that you later can combine them and maybe + +0:04:10.165 --> 0:04:11.280 +the average. + +0:04:11.280 --> 0:04:16.682 +Ensembles are typically that the average is +all probabilities. + +0:04:19.439 --> 0:04:24.227 +The idea is to think about neural networks. + +0:04:24.227 --> 0:04:29.342 +There's one parameter which can easily adjust. + +0:04:29.342 --> 0:04:36.525 +That's exactly the easiest way to randomize +with three different. + +0:04:37.017 --> 0:04:43.119 +They have the same architecture, so all the +hydroparameters are the same, but they are + +0:04:43.119 --> 0:04:43.891 +different. + +0:04:43.891 --> 0:04:46.556 +They will have different predictions. + +0:04:48.228 --> 0:04:52.572 +So, of course, bigger amounts. + +0:04:52.572 --> 0:05:05.325 +Some of these are a bit the easiest way of +improving your quality because you don't really + +0:05:05.325 --> 0:05:08.268 +have to do anything. + +0:05:08.588 --> 0:05:12.588 +There is limits on that bigger models only +get better. + +0:05:12.588 --> 0:05:19.132 +If you have enough training data you can't +do like a handheld layer and you will not work + +0:05:19.132 --> 0:05:24.877 +on very small data but with a recent amount +of data that is the easiest thing. + +0:05:25.305 --> 0:05:33.726 +However, they are challenging with making +better models, bigger motors, and that is the + +0:05:33.726 --> 0:05:34.970 +computation. + +0:05:35.175 --> 0:05:44.482 +So, of course, if you have a bigger model +that can mean that you have longer running + +0:05:44.482 --> 0:05:49.518 +times, if you have models, you have to times. + +0:05:51.171 --> 0:05:56.685 +Normally you cannot paralyze the different +layers because the input to one layer is always + +0:05:56.685 --> 0:06:02.442 +the output of the previous layer, so you propagate +that so it will also increase your runtime. + +0:06:02.822 --> 0:06:10.720 +Then you have to store all your models in +memory. + +0:06:10.720 --> 0:06:20.927 +If you have double weights you will have: +Is more difficult to then do back propagation. + +0:06:20.927 --> 0:06:27.680 +You have to store in between the activations, +so there's not only do you increase the model + +0:06:27.680 --> 0:06:31.865 +in your memory, but also all these other variables +that. + +0:06:34.414 --> 0:06:36.734 +And so in general it is more expensive. + +0:06:37.137 --> 0:06:54.208 +And therefore there's good reasons in looking +into can we make these models sound more efficient. + +0:06:54.134 --> 0:07:00.982 +So it's been through the viewer, you can have +it okay, have one and one day of training time, + +0:07:00.982 --> 0:07:01.274 +or. + +0:07:01.221 --> 0:07:07.535 +Forty thousand euros and then what is the +best machine translation system I can get within + +0:07:07.535 --> 0:07:08.437 +this budget. + +0:07:08.969 --> 0:07:19.085 +And then, of course, you can make the models +bigger, but then you have to train them shorter, + +0:07:19.085 --> 0:07:24.251 +and then we can make more efficient algorithms. + +0:07:25.925 --> 0:07:31.699 +If you think about efficiency, there's a bit +different scenarios. + +0:07:32.312 --> 0:07:43.635 +So if you're more of coming from the research +community, what you'll be doing is building + +0:07:43.635 --> 0:07:47.913 +a lot of models in your research. + +0:07:48.088 --> 0:07:58.645 +So you're having your test set of maybe sentences, +calculating the blue score, then another model. + +0:07:58.818 --> 0:08:08.911 +So what that means is typically you're training +on millions of cents, so your training time + +0:08:08.911 --> 0:08:14.944 +is long, maybe a day, but maybe in other cases +a week. + +0:08:15.135 --> 0:08:22.860 +The testing is not really the cost efficient, +but the training is very costly. + +0:08:23.443 --> 0:08:37.830 +If you are more thinking of building models +for application, the scenario is quite different. + +0:08:38.038 --> 0:08:46.603 +And then you keep it running, and maybe thousands +of customers are using it in translating. + +0:08:46.603 --> 0:08:47.720 +So in that. + +0:08:48.168 --> 0:08:59.577 +And we will see that it is not always the +same type of challenges you can paralyze some + +0:08:59.577 --> 0:09:07.096 +things in training, which you cannot paralyze +in testing. + +0:09:07.347 --> 0:09:14.124 +For example, in training you have to do back +propagation, so you have to store the activations. + +0:09:14.394 --> 0:09:23.901 +Therefore, in testing we briefly discussed +that we would do it in more detail today in + +0:09:23.901 --> 0:09:24.994 +training. + +0:09:25.265 --> 0:09:36.100 +You know they're a target and you can process +everything in parallel while in testing. + +0:09:36.356 --> 0:09:46.741 +So you can only do one word at a time, and +so you can less paralyze this. + +0:09:46.741 --> 0:09:50.530 +Therefore, it's important. + +0:09:52.712 --> 0:09:55.347 +Is a specific task on this. + +0:09:55.347 --> 0:10:03.157 +For example, it's the efficiency task where +it's about making things as efficient. + +0:10:03.123 --> 0:10:09.230 +Is possible and they can look at different +resources. + +0:10:09.230 --> 0:10:14.207 +So how much deep fuel run time do you need? + +0:10:14.454 --> 0:10:19.366 +See how much memory you need or you can have +a fixed memory budget and then have to build + +0:10:19.366 --> 0:10:20.294 +the best system. + +0:10:20.500 --> 0:10:29.010 +And here is a bit like an example of that, +so there's three teams from Edinburgh from + +0:10:29.010 --> 0:10:30.989 +and they submitted. + +0:10:31.131 --> 0:10:36.278 +So then, of course, if you want to know the +most efficient system you have to do a bit + +0:10:36.278 --> 0:10:36.515 +of. + +0:10:36.776 --> 0:10:44.656 +You want to have a better quality or more +runtime and there's not the one solution. + +0:10:44.656 --> 0:10:46.720 +You can improve your. + +0:10:46.946 --> 0:10:49.662 +And that you see that there are different +systems. + +0:10:49.909 --> 0:11:06.051 +Here is how many words you can do for a second +on the clock, and you want to be as talk as + +0:11:06.051 --> 0:11:07.824 +possible. + +0:11:08.068 --> 0:11:08.889 +And you see here a bit. + +0:11:08.889 --> 0:11:09.984 +This is a little bit different. + +0:11:11.051 --> 0:11:27.717 +You want to be there on the top right corner +and you can get a score of something between + +0:11:27.717 --> 0:11:29.014 +words. + +0:11:30.250 --> 0:11:34.161 +Two hundred and fifty thousand, then you'll +ever come and score zero point three. + +0:11:34.834 --> 0:11:41.243 +There is, of course, any bit of a decision, +but the question is, like how far can you again? + +0:11:41.243 --> 0:11:47.789 +Some of all these points on this line would +be winners because they are somehow most efficient + +0:11:47.789 --> 0:11:53.922 +in a way that there's no system which achieves +the same quality with less computational. + +0:11:57.657 --> 0:12:04.131 +So there's the one question of which resources +are you interested. + +0:12:04.131 --> 0:12:07.416 +Are you running it on CPU or GPU? + +0:12:07.416 --> 0:12:11.668 +There's different ways of paralyzing stuff. + +0:12:14.654 --> 0:12:20.777 +Another dimension is how you process your +data. + +0:12:20.777 --> 0:12:27.154 +There's really the best processing and streaming. + +0:12:27.647 --> 0:12:34.672 +So in batch processing you have the whole +document available so you can translate all + +0:12:34.672 --> 0:12:39.981 +sentences in perimeter and then you're interested +in throughput. + +0:12:40.000 --> 0:12:43.844 +But you can then process, for example, especially +in GPS. + +0:12:43.844 --> 0:12:49.810 +That's interesting, you're not translating +one sentence at a time, but you're translating + +0:12:49.810 --> 0:12:56.108 +one hundred sentences or so in parallel, so +you have one more dimension where you can paralyze + +0:12:56.108 --> 0:12:57.964 +and then be more efficient. + +0:12:58.558 --> 0:13:14.863 +On the other hand, for example sorts of documents, +so we learned that if you do badge processing + +0:13:14.863 --> 0:13:16.544 +you have. + +0:13:16.636 --> 0:13:24.636 +Then, of course, it makes sense to sort the +sentences in order to have the minimum thing + +0:13:24.636 --> 0:13:25.535 +attached. + +0:13:27.427 --> 0:13:32.150 +The other scenario is more the streaming scenario +where you do life translation. + +0:13:32.512 --> 0:13:40.212 +So in that case you can't wait for the whole +document to pass, but you have to do. + +0:13:40.520 --> 0:13:49.529 +And then, for example, that's especially in +situations like speech translation, and then + +0:13:49.529 --> 0:13:53.781 +you're interested in things like latency. + +0:13:53.781 --> 0:14:00.361 +So how much do you have to wait to get the +output of a sentence? + +0:14:06.566 --> 0:14:16.956 +Finally, there is the thing about the implementation: +Today we're mainly looking at different algorithms, + +0:14:16.956 --> 0:14:23.678 +different models of how you can model them +in your machine translation system, but of + +0:14:23.678 --> 0:14:29.227 +course for the same algorithms there's also +different implementations. + +0:14:29.489 --> 0:14:38.643 +So, for example, for a machine translation +this tool could be very fast. + +0:14:38.638 --> 0:14:46.615 +So they have like coded a lot of the operations +very low resource, not low resource, low level + +0:14:46.615 --> 0:14:49.973 +on the directly on the QDAC kernels in. + +0:14:50.110 --> 0:15:00.948 +So the same attention network is typically +more efficient in that type of algorithm. + +0:15:00.880 --> 0:15:02.474 +Than in in any other. + +0:15:03.323 --> 0:15:13.105 +Of course, it might be other disadvantages, +so if you're a little worker or have worked + +0:15:13.105 --> 0:15:15.106 +in the practical. + +0:15:15.255 --> 0:15:22.604 +Because it's normally easier to understand, +easier to change, and so on, but there is again + +0:15:22.604 --> 0:15:23.323 +a train. + +0:15:23.483 --> 0:15:29.440 +You have to think about, do you want to include +this into my study or comparison or not? + +0:15:29.440 --> 0:15:36.468 +Should it be like I compare different implementations +and I also find the most efficient implementation? + +0:15:36.468 --> 0:15:39.145 +Or is it only about the pure algorithm? + +0:15:42.742 --> 0:15:50.355 +Yeah, when building these systems there is +a different trade-off to do. + +0:15:50.850 --> 0:15:56.555 +So there's one of the traders between memory +and throughput, so how many words can generate + +0:15:56.555 --> 0:15:57.299 +per second. + +0:15:57.557 --> 0:16:03.351 +So typically you can easily like increase +your scruple by increasing the batch size. + +0:16:03.643 --> 0:16:06.899 +So that means you are translating more sentences +in parallel. + +0:16:07.107 --> 0:16:09.241 +And gypsies are very good at that stuff. + +0:16:09.349 --> 0:16:15.161 +It should translate one sentence or one hundred +sentences, not the same time, but its. + +0:16:15.115 --> 0:16:20.997 +Rough are very similar because they have these +efficient metrics multiplication so that you + +0:16:20.997 --> 0:16:24.386 +can do the same operation on all sentences +parallel. + +0:16:24.386 --> 0:16:30.141 +So typically that means if you increase your +benchmark you can do more things in parallel + +0:16:30.141 --> 0:16:31.995 +and you will translate more. + +0:16:31.952 --> 0:16:33.370 +Second. + +0:16:33.653 --> 0:16:43.312 +On the other hand, with this advantage, of +course you will need higher badge sizes and + +0:16:43.312 --> 0:16:44.755 +more memory. + +0:16:44.965 --> 0:16:56.452 +To begin with, the other problem is that you +have such big models that you can only translate + +0:16:56.452 --> 0:16:59.141 +with lower bed sizes. + +0:16:59.119 --> 0:17:08.466 +If you are running out of memory with translating, +one idea to go on that is to decrease your. + +0:17:13.453 --> 0:17:24.456 +Then there is the thing about quality in Screwport, +of course, and before it's like larger models, + +0:17:24.456 --> 0:17:28.124 +but in generally higher quality. + +0:17:28.124 --> 0:17:31.902 +The first one is always this way. + +0:17:32.092 --> 0:17:38.709 +Course: Not always larger model helps you +have over fitting at some point, but in generally. + +0:17:43.883 --> 0:17:52.901 +And with this a bit on this training and testing +thing we had before. + +0:17:53.113 --> 0:17:58.455 +So it wears all the difference between training +and testing, and for the encoder and decoder. + +0:17:58.798 --> 0:18:06.992 +So if we are looking at what mentioned before +at training time, we have a source sentence + +0:18:06.992 --> 0:18:17.183 +here: And how this is processed on a is not +the attention here. + +0:18:17.183 --> 0:18:21.836 +That's a tubical transformer. + +0:18:22.162 --> 0:18:31.626 +And how we can do that on a is that we can +paralyze the ear ever since. + +0:18:31.626 --> 0:18:40.422 +The first thing to know is: So that is, of +course, not in all cases. + +0:18:40.422 --> 0:18:49.184 +We'll later talk about speech translation +where we might want to translate. + +0:18:49.389 --> 0:18:56.172 +Without the general case in, it's like you +have the full sentence you want to translate. + +0:18:56.416 --> 0:19:02.053 +So the important thing is we are here everything +available on the source side. + +0:19:03.323 --> 0:19:13.524 +And then this was one of the big advantages +that you can remember back of transformer. + +0:19:13.524 --> 0:19:15.752 +There are several. + +0:19:16.156 --> 0:19:25.229 +But the other one is now that we can calculate +the full layer. + +0:19:25.645 --> 0:19:29.318 +There is no dependency between this and this +state or this and this state. + +0:19:29.749 --> 0:19:36.662 +So we always did like here to calculate the +key value and query, and based on that you + +0:19:36.662 --> 0:19:37.536 +calculate. + +0:19:37.937 --> 0:19:46.616 +Which means we can do all these calculations +here in parallel and in parallel. + +0:19:48.028 --> 0:19:55.967 +And there, of course, is this very efficiency +because again for GPS it's too bigly possible + +0:19:55.967 --> 0:20:00.887 +to do these things in parallel and one after +each other. + +0:20:01.421 --> 0:20:10.311 +And then we can also for each layer one by +one, and then we calculate here the encoder. + +0:20:10.790 --> 0:20:21.921 +In training now an important thing is that +for the decoder we have the full sentence available + +0:20:21.921 --> 0:20:28.365 +because we know this is the target we should +generate. + +0:20:29.649 --> 0:20:33.526 +We have models now in a different way. + +0:20:33.526 --> 0:20:38.297 +This hidden state is only on the previous +ones. + +0:20:38.598 --> 0:20:51.887 +And the first thing here depends only on this +information, so you see if you remember we + +0:20:51.887 --> 0:20:56.665 +had this masked self-attention. + +0:20:56.896 --> 0:21:04.117 +So that means, of course, we can only calculate +the decoder once the encoder is done, but that's. + +0:21:04.444 --> 0:21:06.656 +Percent can calculate the end quarter. + +0:21:06.656 --> 0:21:08.925 +Then we can calculate here the decoder. + +0:21:09.569 --> 0:21:25.566 +But again in training we have x, y and that +is available so we can calculate everything + +0:21:25.566 --> 0:21:27.929 +in parallel. + +0:21:28.368 --> 0:21:40.941 +So the interesting thing or advantage of transformer +is in training. + +0:21:40.941 --> 0:21:46.408 +We can do it for the decoder. + +0:21:46.866 --> 0:21:54.457 +That means you will have more calculations +because you can only calculate one layer at + +0:21:54.457 --> 0:22:02.310 +a time, but for example the length which is +too bigly quite long or doesn't really matter + +0:22:02.310 --> 0:22:03.270 +that much. + +0:22:05.665 --> 0:22:10.704 +However, in testing this situation is different. + +0:22:10.704 --> 0:22:13.276 +In testing we only have. + +0:22:13.713 --> 0:22:20.622 +So this means we start with a sense: We don't +know the full sentence yet because we ought + +0:22:20.622 --> 0:22:29.063 +to regularly generate that so for the encoder +we have the same here but for the decoder. + +0:22:29.409 --> 0:22:39.598 +In this case we only have the first and the +second instinct, but only for all states in + +0:22:39.598 --> 0:22:40.756 +parallel. + +0:22:41.101 --> 0:22:51.752 +And then we can do the next step for y because +we are putting our most probable one. + +0:22:51.752 --> 0:22:58.643 +We do greedy search or beam search, but you +cannot do. + +0:23:03.663 --> 0:23:16.838 +Yes, so if we are interesting in making things +more efficient for testing, which we see, for + +0:23:16.838 --> 0:23:22.363 +example in the scenario of really our. + +0:23:22.642 --> 0:23:34.286 +It makes sense that we think about our architecture +and that we are currently working on attention + +0:23:34.286 --> 0:23:35.933 +based models. + +0:23:36.096 --> 0:23:44.150 +The decoder there is some of the most time +spent testing and testing. + +0:23:44.150 --> 0:23:47.142 +It's similar, but during. + +0:23:47.167 --> 0:23:50.248 +Nothing about beam search. + +0:23:50.248 --> 0:23:59.833 +It might be even more complicated because +in beam search you have to try different. + +0:24:02.762 --> 0:24:15.140 +So the question is what can you now do in +order to make your model more efficient and + +0:24:15.140 --> 0:24:21.905 +better in translation in these types of cases? + +0:24:24.604 --> 0:24:30.178 +And the one thing is to look into the encoded +decoder trailer. + +0:24:30.690 --> 0:24:43.898 +And then until now we typically assume that +the depth of the encoder and the depth of the + +0:24:43.898 --> 0:24:48.154 +decoder is roughly the same. + +0:24:48.268 --> 0:24:55.553 +So if you haven't thought about it, you just +take what is running well. + +0:24:55.553 --> 0:24:57.678 +You would try to do. + +0:24:58.018 --> 0:25:04.148 +However, we saw now that there is a quite +big challenge and the runtime is a lot longer + +0:25:04.148 --> 0:25:04.914 +than here. + +0:25:05.425 --> 0:25:14.018 +The question is also the case for the calculations, +or do we have there the same issue that we + +0:25:14.018 --> 0:25:21.887 +only get the good quality if we are having +high and high, so we know that making these + +0:25:21.887 --> 0:25:25.415 +more depths is increasing our quality. + +0:25:25.425 --> 0:25:31.920 +But what we haven't talked about is really +important that we increase the depth the same + +0:25:31.920 --> 0:25:32.285 +way. + +0:25:32.552 --> 0:25:41.815 +So what we can put instead also do is something +like this where you have a deep encoder and + +0:25:41.815 --> 0:25:42.923 +a shallow. + +0:25:43.163 --> 0:25:57.386 +So that would be that you, for example, have +instead of having layers on the encoder, and + +0:25:57.386 --> 0:25:59.757 +layers on the. + +0:26:00.080 --> 0:26:10.469 +So in this case the overall depth from start +to end would be similar and so hopefully. + +0:26:11.471 --> 0:26:21.662 +But we could a lot more things hear parallelized, +and hear what is costly at the end during decoding + +0:26:21.662 --> 0:26:22.973 +the decoder. + +0:26:22.973 --> 0:26:29.330 +Because that does change in an outer regressive +way, there we. + +0:26:31.411 --> 0:26:33.727 +And that that can be analyzed. + +0:26:33.727 --> 0:26:38.734 +So here is some examples: Where people have +done all this. + +0:26:39.019 --> 0:26:55.710 +So here it's mainly interested on the orange +things, which is auto-regressive about the + +0:26:55.710 --> 0:26:57.607 +speed up. + +0:26:57.717 --> 0:27:15.031 +You have the system, so agree is not exactly +the same, but it's similar. + +0:27:15.055 --> 0:27:23.004 +It's always the case if you look at speed +up. + +0:27:23.004 --> 0:27:31.644 +Think they put a speed of so that's the baseline. + +0:27:31.771 --> 0:27:35.348 +So between and times as fast. + +0:27:35.348 --> 0:27:42.621 +If you switch from a system to where you have +layers in the. + +0:27:42.782 --> 0:27:52.309 +You see that although you have slightly more +parameters, more calculations are also roughly + +0:27:52.309 --> 0:28:00.283 +the same, but you can speed out because now +during testing you can paralyze. + +0:28:02.182 --> 0:28:09.754 +The other thing is that you're speeding up, +but if you look at the performance it's similar, + +0:28:09.754 --> 0:28:13.500 +so sometimes you improve, sometimes you lose. + +0:28:13.500 --> 0:28:20.421 +There's a bit of losing English to Romania, +but in general the quality is very slow. + +0:28:20.680 --> 0:28:30.343 +So you see that you can keep a similar performance +while improving your speed by just having different. + +0:28:30.470 --> 0:28:34.903 +And you also see the encoder layers from speed. + +0:28:34.903 --> 0:28:38.136 +They don't really metal that much. + +0:28:38.136 --> 0:28:38.690 +Most. + +0:28:38.979 --> 0:28:50.319 +Because if you compare the 12th system to +the 6th system you have a lower performance + +0:28:50.319 --> 0:28:57.309 +with 6th and colder layers but the speed is +similar. + +0:28:57.897 --> 0:29:02.233 +And see the huge decrease is it maybe due +to a lack of data. + +0:29:03.743 --> 0:29:11.899 +Good idea would say it's not the case. + +0:29:11.899 --> 0:29:23.191 +Romanian English should have the same number +of data. + +0:29:24.224 --> 0:29:31.184 +Maybe it's just that something in that language. + +0:29:31.184 --> 0:29:40.702 +If you generate Romanian maybe they need more +target dependencies. + +0:29:42.882 --> 0:29:46.263 +The Wine's the Eye Also Don't Know Any Sex +People Want To. + +0:29:47.887 --> 0:29:49.034 +There could be yeah the. + +0:29:49.889 --> 0:29:58.962 +As the maybe if you go from like a movie sphere +to a hybrid sphere, you can: It's very much + +0:29:58.962 --> 0:30:12.492 +easier to expand the vocabulary to English, +but it must be the vocabulary. + +0:30:13.333 --> 0:30:21.147 +Have to check, but would assume that in this +case the system is not retrained, but it's + +0:30:21.147 --> 0:30:22.391 +trained with. + +0:30:22.902 --> 0:30:30.213 +And that's why I was assuming that they have +the same, but maybe you'll write that in this + +0:30:30.213 --> 0:30:35.595 +piece, for example, if they were pre-trained, +the decoder English. + +0:30:36.096 --> 0:30:43.733 +But don't remember exactly if they do something +like that, but that could be a good. + +0:30:45.325 --> 0:30:52.457 +So this is some of the most easy way to speed +up. + +0:30:52.457 --> 0:31:01.443 +You just switch to hyperparameters, not to +implement anything. + +0:31:02.722 --> 0:31:08.367 +Of course, there's other ways of doing that. + +0:31:08.367 --> 0:31:11.880 +We'll look into two things. + +0:31:11.880 --> 0:31:16.521 +The other thing is the architecture. + +0:31:16.796 --> 0:31:28.154 +We are now at some of the baselines that we +are doing. + +0:31:28.488 --> 0:31:39.978 +However, in translation in the decoder side, +it might not be the best solution. + +0:31:39.978 --> 0:31:41.845 +There is no. + +0:31:42.222 --> 0:31:47.130 +So we can use different types of architectures, +also in the encoder and the. + +0:31:47.747 --> 0:31:52.475 +And there's two ways of what you could do +different, or there's more ways. + +0:31:52.912 --> 0:31:54.825 +We will look into two todays. + +0:31:54.825 --> 0:31:58.842 +The one is average attention, which is a very +simple solution. + +0:31:59.419 --> 0:32:01.464 +You can do as it says. + +0:32:01.464 --> 0:32:04.577 +It's not really attending anymore. + +0:32:04.577 --> 0:32:08.757 +It's just like equal attendance to everything. + +0:32:09.249 --> 0:32:23.422 +And the other idea, which is currently done +in most systems which are optimized to efficiency, + +0:32:23.422 --> 0:32:24.913 +is we're. + +0:32:25.065 --> 0:32:32.623 +But on the decoder side we are then not using +transformer or self attention, but we are using + +0:32:32.623 --> 0:32:39.700 +recurrent neural network because they are the +disadvantage of recurrent neural network. + +0:32:39.799 --> 0:32:48.353 +And then the recurrent is normally easier +to calculate because it only depends on inputs, + +0:32:48.353 --> 0:32:49.684 +the input on. + +0:32:51.931 --> 0:33:02.190 +So what is the difference between decoding +and why is the tension maybe not sufficient + +0:33:02.190 --> 0:33:03.841 +for decoding? + +0:33:04.204 --> 0:33:14.390 +If we want to populate the new state, we only +have to look at the input and the previous + +0:33:14.390 --> 0:33:15.649 +state, so. + +0:33:16.136 --> 0:33:19.029 +We are more conditional here networks. + +0:33:19.029 --> 0:33:19.994 +We have the. + +0:33:19.980 --> 0:33:31.291 +Dependency to a fixed number of previous ones, +but that's rarely used for decoding. + +0:33:31.291 --> 0:33:39.774 +In contrast, in transformer we have this large +dependency, so. + +0:33:40.000 --> 0:33:52.760 +So from t minus one to y t so that is somehow +and mainly not very efficient in this way mean + +0:33:52.760 --> 0:33:56.053 +it's very good because. + +0:33:56.276 --> 0:34:03.543 +However, the disadvantage is that we also +have to do all these calculations, so if we + +0:34:03.543 --> 0:34:10.895 +more view from the point of view of efficient +calculation, this might not be the best. + +0:34:11.471 --> 0:34:20.517 +So the question is, can we change our architecture +to keep some of the advantages but make things + +0:34:20.517 --> 0:34:21.994 +more efficient? + +0:34:24.284 --> 0:34:31.131 +The one idea is what is called the average +attention, and the interesting thing is this + +0:34:31.131 --> 0:34:32.610 +work surprisingly. + +0:34:33.013 --> 0:34:38.917 +So the only idea what you're doing is doing +the decoder. + +0:34:38.917 --> 0:34:42.646 +You're not doing attention anymore. + +0:34:42.646 --> 0:34:46.790 +The attention weights are all the same. + +0:34:47.027 --> 0:35:00.723 +So you don't calculate with query and key +the different weights, and then you just take + +0:35:00.723 --> 0:35:03.058 +equal weights. + +0:35:03.283 --> 0:35:07.585 +So here would be one third from this, one +third from this, and one third. + +0:35:09.009 --> 0:35:14.719 +And while it is sufficient you can now do +precalculation and things get more efficient. + +0:35:15.195 --> 0:35:18.803 +So first go the formula that's maybe not directed +here. + +0:35:18.979 --> 0:35:38.712 +So the difference here is that your new hint +stage is the sum of all the hint states, then. + +0:35:38.678 --> 0:35:40.844 +So here would be with this. + +0:35:40.844 --> 0:35:45.022 +It would be one third of this plus one third +of this. + +0:35:46.566 --> 0:35:57.162 +But if you calculate it this way, it's not +yet being more efficient because you still + +0:35:57.162 --> 0:36:01.844 +have to sum over here all the hidden. + +0:36:04.524 --> 0:36:22.932 +But you can not easily speed up these things +by having an in between value, which is just + +0:36:22.932 --> 0:36:24.568 +always. + +0:36:25.585 --> 0:36:30.057 +If you take this as ten to one, you take this +one class this one. + +0:36:30.350 --> 0:36:36.739 +Because this one then was before this, and +this one was this, so in the end. + +0:36:37.377 --> 0:36:49.545 +So now this one is not the final one in order +to get the final one to do the average. + +0:36:49.545 --> 0:36:50.111 +So. + +0:36:50.430 --> 0:37:00.264 +But then if you do this calculation with speed +up you can do it with a fixed number of steps. + +0:37:00.180 --> 0:37:11.300 +Instead of the sun which depends on age, so +you only have to do calculations to calculate + +0:37:11.300 --> 0:37:12.535 +this one. + +0:37:12.732 --> 0:37:21.183 +Can you do the lakes and the lakes? + +0:37:21.183 --> 0:37:32.687 +For example, light bulb here now takes and +then. + +0:37:32.993 --> 0:37:38.762 +That's a very good point and that's why this +is now in the image. + +0:37:38.762 --> 0:37:44.531 +It's not very good so this is the one with +tilder and the tilder. + +0:37:44.884 --> 0:37:57.895 +So this one is just the sum of these two, +because this is just this one. + +0:37:58.238 --> 0:38:08.956 +So the sum of this is exactly as the sum of +these, and the sum of these is the sum of here. + +0:38:08.956 --> 0:38:15.131 +So you only do the sum in here, and the multiplying. + +0:38:15.255 --> 0:38:22.145 +So what you can mainly do here is you can +do it more mathematically. + +0:38:22.145 --> 0:38:31.531 +You can know this by tea taking out of the +sum, and then you can calculate the sum different. + +0:38:36.256 --> 0:38:42.443 +That maybe looks a bit weird and simple, so +we were all talking about this great attention + +0:38:42.443 --> 0:38:47.882 +that we can focus on different parts, and a +bit surprising on this work is now. + +0:38:47.882 --> 0:38:53.321 +In the end it might also work well without +really putting and just doing equal. + +0:38:53.954 --> 0:38:56.164 +Mean it's not that easy. + +0:38:56.376 --> 0:38:58.261 +It's like sometimes this is working. + +0:38:58.261 --> 0:39:00.451 +There's also report weight work that well. + +0:39:01.481 --> 0:39:05.848 +But I think it's an interesting way and it +maybe shows that a lot of. + +0:39:05.805 --> 0:39:10.669 +Things in the self or in the transformer paper +which are more put as like yet. + +0:39:10.669 --> 0:39:14.301 +These are some hyperparameters that are rounded +like that. + +0:39:14.301 --> 0:39:19.657 +You do the lay on all in between and that +you do a feat forward before and things like + +0:39:19.657 --> 0:39:20.026 +that. + +0:39:20.026 --> 0:39:25.567 +But these are also all important and the right +set up around that is also very important. + +0:39:28.969 --> 0:39:38.598 +The other thing you can do in the end is not +completely different from this one. + +0:39:38.598 --> 0:39:42.521 +It's just like a very different. + +0:39:42.942 --> 0:39:54.338 +And that is a recurrent network which also +has this type of highway connection that can + +0:39:54.338 --> 0:40:01.330 +ignore the recurrent unit and directly put +the input. + +0:40:01.561 --> 0:40:10.770 +It's not really adding out, but if you see +the hitting step is your input, but what you + +0:40:10.770 --> 0:40:15.480 +can do is somehow directly go to the output. + +0:40:17.077 --> 0:40:28.390 +These are the four components of the simple +return unit, and the unit is motivated by GIS + +0:40:28.390 --> 0:40:33.418 +and by LCMs, which we have seen before. + +0:40:33.513 --> 0:40:43.633 +And that has proven to be very good for iron +ends, which allows you to have a gate on your. + +0:40:44.164 --> 0:40:48.186 +In this thing we have two gates, the reset +gate and the forget gate. + +0:40:48.768 --> 0:40:57.334 +So first we have the general structure which +has a cell state. + +0:40:57.334 --> 0:41:01.277 +Here we have the cell state. + +0:41:01.361 --> 0:41:09.661 +And then this goes next, and we always get +the different cell states over the times that. + +0:41:10.030 --> 0:41:11.448 +This Is the South Stand. + +0:41:11.771 --> 0:41:16.518 +How do we now calculate that just assume we +have an initial cell safe here? + +0:41:17.017 --> 0:41:19.670 +But the first thing is we're doing the forget +game. + +0:41:20.060 --> 0:41:34.774 +The forgetting models should the new cell +state mainly depend on the previous cell state + +0:41:34.774 --> 0:41:40.065 +or should it depend on our age. + +0:41:40.000 --> 0:41:41.356 +Like Add to Them. + +0:41:41.621 --> 0:41:42.877 +How can we model that? + +0:41:44.024 --> 0:41:45.599 +First we were at a cocktail. + +0:41:45.945 --> 0:41:52.151 +The forget gait is depending on minus one. + +0:41:52.151 --> 0:41:56.480 +You also see here the former. + +0:41:57.057 --> 0:42:01.963 +So we are multiplying both the cell state +and our input. + +0:42:01.963 --> 0:42:04.890 +With some weights we are getting. + +0:42:05.105 --> 0:42:08.472 +We are putting some Bay Inspector and then +we are doing Sigma Weed on that. + +0:42:08.868 --> 0:42:13.452 +So in the end we have numbers between zero +and one saying for each dimension. + +0:42:13.853 --> 0:42:22.041 +Like how much if it's near to zero we will +mainly use the new input. + +0:42:22.041 --> 0:42:31.890 +If it's near to one we will keep the input +and ignore the input at this dimension. + +0:42:33.313 --> 0:42:40.173 +And by this motivation we can then create +here the new sound state, and here you see + +0:42:40.173 --> 0:42:41.141 +the formal. + +0:42:41.601 --> 0:42:55.048 +So you take your foot back gate and multiply +it with your class. + +0:42:55.048 --> 0:43:00.427 +So if my was around then. + +0:43:00.800 --> 0:43:07.405 +In the other case, when the value was others, +that's what you added. + +0:43:07.405 --> 0:43:10.946 +Then you're adding a transformation. + +0:43:11.351 --> 0:43:24.284 +So if this value was maybe zero then you're +putting most of the information from inputting. + +0:43:25.065 --> 0:43:26.947 +Is already your element? + +0:43:26.947 --> 0:43:30.561 +The only question is now based on your element. + +0:43:30.561 --> 0:43:32.067 +What is the output? + +0:43:33.253 --> 0:43:47.951 +And there you have another opportunity so +you can either take the output or instead you + +0:43:47.951 --> 0:43:50.957 +prefer the input. + +0:43:52.612 --> 0:43:58.166 +So is the value also the same for the recept +game and the forget game. + +0:43:58.166 --> 0:43:59.417 +Yes, the movie. + +0:44:00.900 --> 0:44:10.004 +Yes exactly so the matrices are different +and therefore it can be and that should be + +0:44:10.004 --> 0:44:16.323 +and maybe there is sometimes you want to have +information. + +0:44:16.636 --> 0:44:23.843 +So here again we have this vector with values +between zero and which says controlling how + +0:44:23.843 --> 0:44:25.205 +the information. + +0:44:25.505 --> 0:44:36.459 +And then the output is calculated here similar +to a cell stage, but again input is from. + +0:44:36.536 --> 0:44:45.714 +So either the reset gate decides should give +what is currently stored in there, or. + +0:44:46.346 --> 0:44:58.647 +So it's not exactly as the thing we had before, +with the residual connections where we added + +0:44:58.647 --> 0:45:01.293 +up, but here we do. + +0:45:04.224 --> 0:45:08.472 +This is the general idea of a simple recurrent +neural network. + +0:45:08.472 --> 0:45:13.125 +Then we will now look at how we can make things +even more efficient. + +0:45:13.125 --> 0:45:17.104 +But first do you have more questions on how +it is working? + +0:45:23.063 --> 0:45:38.799 +Now these calculations are a bit where things +get more efficient because this somehow. + +0:45:38.718 --> 0:45:43.177 +It depends on all the other damage for the +second one also. + +0:45:43.423 --> 0:45:48.904 +Because if you do a matrix multiplication +with a vector like for the output vector, each + +0:45:48.904 --> 0:45:52.353 +diameter of the output vector depends on all +the other. + +0:45:52.973 --> 0:46:06.561 +The cell state here depends because this one +is used here, and somehow the first dimension + +0:46:06.561 --> 0:46:11.340 +of the cell state only depends. + +0:46:11.931 --> 0:46:17.973 +In order to make that, of course, is sometimes +again making things less paralyzeable if things + +0:46:17.973 --> 0:46:18.481 +depend. + +0:46:19.359 --> 0:46:35.122 +Can easily make that different by changing +from the metric product to not a vector. + +0:46:35.295 --> 0:46:51.459 +So you do first, just like inside here, you +take like the first dimension, my second dimension. + +0:46:52.032 --> 0:46:53.772 +Is, of course, narrow. + +0:46:53.772 --> 0:46:59.294 +This should be reset or this should be because +it should be a different. + +0:46:59.899 --> 0:47:12.053 +Now the first dimension only depends on the +first dimension, so you don't have dependencies + +0:47:12.053 --> 0:47:16.148 +any longer between dimensions. + +0:47:18.078 --> 0:47:25.692 +Maybe it gets a bit clearer if you see about +it in this way, so what we have to do now. + +0:47:25.966 --> 0:47:31.911 +First, we have to do a metrics multiplication +on to gather and to get the. + +0:47:32.292 --> 0:47:38.041 +And then we only have the element wise operations +where we take this output. + +0:47:38.041 --> 0:47:38.713 +We take. + +0:47:39.179 --> 0:47:42.978 +Minus one and our original. + +0:47:42.978 --> 0:47:52.748 +Here we only have elemental abrasions which +can be optimally paralyzed. + +0:47:53.273 --> 0:48:07.603 +So here we have additional paralyzed things +across the dimension and don't have to do that. + +0:48:09.929 --> 0:48:24.255 +Yeah, but this you can do like in parallel +again for all xts. + +0:48:24.544 --> 0:48:33.014 +Here you can't do it in parallel, but you +only have to do it on each seat, and then you + +0:48:33.014 --> 0:48:34.650 +can parallelize. + +0:48:35.495 --> 0:48:39.190 +But this maybe for the dimension. + +0:48:39.190 --> 0:48:42.124 +Maybe it's also important. + +0:48:42.124 --> 0:48:46.037 +I don't know if they have tried it. + +0:48:46.037 --> 0:48:55.383 +I assume it's not only for dimension reduction, +but it's hard because you can easily. + +0:49:01.001 --> 0:49:08.164 +People have even like made the second thing +even more easy. + +0:49:08.164 --> 0:49:10.313 +So there is this. + +0:49:10.313 --> 0:49:17.954 +This is how we have the highway connections +in the transformer. + +0:49:17.954 --> 0:49:20.699 +Then it's like you do. + +0:49:20.780 --> 0:49:24.789 +So that is like how things are put together +as a transformer. + +0:49:25.125 --> 0:49:39.960 +And that is a similar and simple recurring +neural network where you do exactly the same + +0:49:39.960 --> 0:49:44.512 +for the so you don't have. + +0:49:46.326 --> 0:49:47.503 +This type of things. + +0:49:49.149 --> 0:50:01.196 +And with this we are at the end of how to +make efficient architectures before we go to + +0:50:01.196 --> 0:50:02.580 +the next. + +0:50:13.013 --> 0:50:24.424 +Between the ink or the trader and the architectures +there is a next technique which is used in + +0:50:24.424 --> 0:50:28.988 +nearly all deburning very successful. + +0:50:29.449 --> 0:50:43.463 +So the idea is can we extract the knowledge +from a large network into a smaller one, but + +0:50:43.463 --> 0:50:45.983 +it's similarly. + +0:50:47.907 --> 0:50:53.217 +And the nice thing is that this really works, +and it may be very, very surprising. + +0:50:53.673 --> 0:51:03.000 +So the idea is that we have a large straw +model which we train for long, and the question + +0:51:03.000 --> 0:51:07.871 +is: Can that help us to train a smaller model? + +0:51:08.148 --> 0:51:16.296 +So can what we refer to as teacher model tell +us better to build a small student model than + +0:51:16.296 --> 0:51:17.005 +before. + +0:51:17.257 --> 0:51:27.371 +So what we're before in it as a student model, +we learn from the data and that is how we train + +0:51:27.371 --> 0:51:28.755 +our systems. + +0:51:29.249 --> 0:51:37.949 +The question is: Can we train this small model +better if we are not only learning from the + +0:51:37.949 --> 0:51:46.649 +data, but we are also learning from a large +model which has been trained maybe in the same + +0:51:46.649 --> 0:51:47.222 +data? + +0:51:47.667 --> 0:51:55.564 +So that you have then in the end a smaller +model that is somehow better performing than. + +0:51:55.895 --> 0:51:59.828 +And maybe that's on the first view. + +0:51:59.739 --> 0:52:05.396 +Very very surprising because it has seen the +same data so it should have learned the same + +0:52:05.396 --> 0:52:11.053 +so the baseline model trained only on the data +and the student teacher knowledge to still + +0:52:11.053 --> 0:52:11.682 +model it. + +0:52:11.682 --> 0:52:17.401 +They all have seen only this data because +your teacher modeling was also trained typically + +0:52:17.401 --> 0:52:19.161 +only on this model however. + +0:52:20.580 --> 0:52:30.071 +It has by now shown that by many ways the +model trained in the teacher and analysis framework + +0:52:30.071 --> 0:52:32.293 +is performing better. + +0:52:33.473 --> 0:52:40.971 +A bit of an explanation when we see how that +works. + +0:52:40.971 --> 0:52:46.161 +There's different ways of doing it. + +0:52:46.161 --> 0:52:47.171 +Maybe. + +0:52:47.567 --> 0:52:51.501 +So how does it work? + +0:52:51.501 --> 0:53:04.802 +This is our student network, the normal one, +some type of new network. + +0:53:04.802 --> 0:53:06.113 +We're. + +0:53:06.586 --> 0:53:17.050 +So we are training the model to predict the +same thing as we are doing that by calculating. + +0:53:17.437 --> 0:53:23.173 +The cross angry loss was defined in a way +where saying all the probabilities for the + +0:53:23.173 --> 0:53:25.332 +correct word should be as high. + +0:53:25.745 --> 0:53:31.576 +So your calculating gear out of probability +is always and each time step you have an out + +0:53:31.576 --> 0:53:32.624 +of probability. + +0:53:32.624 --> 0:53:38.258 +What is the most probable in the next word +and your training signal is put as much of + +0:53:38.258 --> 0:53:43.368 +your probability mass to the correct word to +the word that is there in train. + +0:53:43.903 --> 0:53:51.367 +And this is the chief by this cross entry +loss, which says with some of the all training + +0:53:51.367 --> 0:53:58.664 +examples of all positions, with some of the +full vocabulary, and then this one is this + +0:53:58.664 --> 0:54:03.947 +one that this current word is the case word +in the vocabulary. + +0:54:04.204 --> 0:54:11.339 +And then we take here the lock for the ability +of that, so what we made me do is: We have + +0:54:11.339 --> 0:54:27.313 +this metric here, so each position of your +vocabulary size. + +0:54:27.507 --> 0:54:38.656 +In the end what you just do is some of these +three lock probabilities, and then you want + +0:54:38.656 --> 0:54:40.785 +to have as much. + +0:54:41.041 --> 0:54:54.614 +So although this is a thumb over this metric +here, in the end of each dimension you. + +0:54:54.794 --> 0:55:06.366 +So that is a normal cross end to be lost that +we have discussed at the very beginning of + +0:55:06.366 --> 0:55:07.016 +how. + +0:55:08.068 --> 0:55:15.132 +So what can we do differently in the teacher +network? + +0:55:15.132 --> 0:55:23.374 +We also have a teacher network which is trained +on large data. + +0:55:24.224 --> 0:55:35.957 +And of course this distribution might be better +than the one from the small model because it's. + +0:55:36.456 --> 0:55:40.941 +So in this case we have now the training signal +from the teacher network. + +0:55:41.441 --> 0:55:46.262 +And it's the same way as we had before. + +0:55:46.262 --> 0:55:56.507 +The only difference is we're training not +the ground truths per ability distribution + +0:55:56.507 --> 0:55:59.159 +year, which is sharp. + +0:55:59.299 --> 0:56:11.303 +That's also a probability, so this word has +a high probability, but have some probability. + +0:56:12.612 --> 0:56:19.577 +And that is the main difference. + +0:56:19.577 --> 0:56:30.341 +Typically you do like the interpretation of +these. + +0:56:33.213 --> 0:56:38.669 +Because there's more information contained +in the distribution than in the front booth, + +0:56:38.669 --> 0:56:44.187 +because it encodes more information about the +language, because language always has more + +0:56:44.187 --> 0:56:47.907 +options to put alone, that's the same sentence +yes exactly. + +0:56:47.907 --> 0:56:53.114 +So there's ambiguity in there that is encoded +hopefully very well in the complaint. + +0:56:53.513 --> 0:56:57.257 +Trade you two networks so better than a student +network you have in there from your learner. + +0:56:57.537 --> 0:57:05.961 +So maybe often there's only one correct word, +but it might be two or three, and then all + +0:57:05.961 --> 0:57:10.505 +of these three have a probability distribution. + +0:57:10.590 --> 0:57:21.242 +And then is the main advantage or one explanation +of why it's better to train from the. + +0:57:21.361 --> 0:57:32.652 +Of course, it's good to also keep the signal +in there because then you can prevent it because + +0:57:32.652 --> 0:57:33.493 +crazy. + +0:57:37.017 --> 0:57:49.466 +Any more questions on the first type of knowledge +distillation, also distribution changes. + +0:57:50.550 --> 0:58:02.202 +Coming around again, this would put it a bit +different, so this is not a solution to maintenance + +0:58:02.202 --> 0:58:04.244 +or distribution. + +0:58:04.744 --> 0:58:12.680 +But don't think it's performing worse than +only doing the ground tours because they also. + +0:58:13.113 --> 0:58:21.254 +So it's more like it's not improving you would +assume it's similarly helping you, but. + +0:58:21.481 --> 0:58:28.145 +Of course, if you now have a teacher, maybe +you have no danger on your target to Maine, + +0:58:28.145 --> 0:58:28.524 +but. + +0:58:28.888 --> 0:58:39.895 +Then you can use this one which is not the +ground truth but helpful to learn better for + +0:58:39.895 --> 0:58:42.147 +the distribution. + +0:58:46.326 --> 0:58:57.012 +The second idea is to do sequence level knowledge +distillation, so what we have in this case + +0:58:57.012 --> 0:59:02.757 +is we have looked at each position independently. + +0:59:03.423 --> 0:59:05.436 +Mean, we do that often. + +0:59:05.436 --> 0:59:10.972 +We are not generating a lot of sequences, +but that has a problem. + +0:59:10.972 --> 0:59:13.992 +We have this propagation of errors. + +0:59:13.992 --> 0:59:16.760 +We start with one area and then. + +0:59:17.237 --> 0:59:27.419 +So if we are doing word-level knowledge dissolution, +we are treating each word in the sentence independently. + +0:59:28.008 --> 0:59:32.091 +So we are not trying to like somewhat model +the dependency between. + +0:59:32.932 --> 0:59:47.480 +We can try to do that by sequence level knowledge +dissolution, but the problem is, of course,. + +0:59:47.847 --> 0:59:53.478 +So we can that for each position we can get +a distribution over all the words at this. + +0:59:53.793 --> 1:00:05.305 +But if we want to have a distribution of all +possible target sentences, that's not possible + +1:00:05.305 --> 1:00:06.431 +because. + +1:00:08.508 --> 1:00:15.940 +Area, so we can then again do a bit of a heck +on that. + +1:00:15.940 --> 1:00:23.238 +If we can't have a distribution of all sentences, +it. + +1:00:23.843 --> 1:00:30.764 +So what we can't do is you can not use the +teacher network and sample different translations. + +1:00:31.931 --> 1:00:39.327 +And now we can do different ways to train +them. + +1:00:39.327 --> 1:00:49.343 +We can use them as their probability, the +easiest one to assume. + +1:00:50.050 --> 1:00:56.373 +So what that ends to is that we're taking +our teacher network, we're generating some + +1:00:56.373 --> 1:01:01.135 +translations, and these ones we're using as +additional trading. + +1:01:01.781 --> 1:01:11.382 +Then we have mainly done this sequence level +because the teacher network takes us. + +1:01:11.382 --> 1:01:17.513 +These are all probable translations of the +sentence. + +1:01:26.286 --> 1:01:34.673 +And then you can do a bit of a yeah, and you +can try to better make a bit of an interpolated + +1:01:34.673 --> 1:01:36.206 +version of that. + +1:01:36.716 --> 1:01:42.802 +So what people have also done is like subsequent +level interpolations. + +1:01:42.802 --> 1:01:52.819 +You generate here several translations: But +then you don't use all of them. + +1:01:52.819 --> 1:02:00.658 +You do some metrics on which of these ones. + +1:02:01.021 --> 1:02:12.056 +So it's a bit more training on this brown +chose which might be improbable or unreachable + +1:02:12.056 --> 1:02:16.520 +because we can generate everything. + +1:02:16.676 --> 1:02:23.378 +And we are giving it an easier solution which +is also good quality and training of that. + +1:02:23.703 --> 1:02:32.602 +So you're not training it on a very difficult +solution, but you're training it on an easier + +1:02:32.602 --> 1:02:33.570 +solution. + +1:02:36.356 --> 1:02:38.494 +Any More Questions to This. + +1:02:40.260 --> 1:02:41.557 +Yeah. + +1:02:41.461 --> 1:02:44.296 +Good. + +1:02:43.843 --> 1:03:01.642 +Is to look at the vocabulary, so the problem +is we have seen that vocabulary calculations + +1:03:01.642 --> 1:03:06.784 +are often very presuming. + +1:03:09.789 --> 1:03:19.805 +The thing is that most of the vocabulary is +not needed for each sentence, so in each sentence. + +1:03:20.280 --> 1:03:28.219 +The question is: Can we somehow easily precalculate, +which words are probable to occur in the sentence, + +1:03:28.219 --> 1:03:30.967 +and then only calculate these ones? + +1:03:31.691 --> 1:03:34.912 +And this can be done so. + +1:03:34.912 --> 1:03:43.932 +For example, if you have sentenced card, it's +probably not happening. + +1:03:44.164 --> 1:03:48.701 +So what you can try to do is to limit your +vocabulary. + +1:03:48.701 --> 1:03:51.093 +You're considering for each. + +1:03:51.151 --> 1:04:04.693 +So you're no longer taking the full vocabulary +as possible output, but you're restricting. + +1:04:06.426 --> 1:04:18.275 +That typically works is that we limit it by +the most frequent words we always take because + +1:04:18.275 --> 1:04:23.613 +these are not so easy to align to words. + +1:04:23.964 --> 1:04:32.241 +To take the most treatment taggin' words and +then work that often aligns with one of the + +1:04:32.241 --> 1:04:32.985 +source. + +1:04:33.473 --> 1:04:46.770 +So for each source word you calculate the +word alignment on your training data, and then + +1:04:46.770 --> 1:04:51.700 +you calculate which words occur. + +1:04:52.352 --> 1:04:57.680 +And then for decoding you build this union +of maybe the source word list that other. + +1:04:59.960 --> 1:05:02.145 +Are like for each source work. + +1:05:02.145 --> 1:05:08.773 +One of the most frequent translations of these +source words, for example for each source work + +1:05:08.773 --> 1:05:13.003 +like in the most frequent ones, and then the +most frequent. + +1:05:13.193 --> 1:05:24.333 +In total, if you have short sentences, you +have a lot less words, so in most cases it's + +1:05:24.333 --> 1:05:26.232 +not more than. + +1:05:26.546 --> 1:05:33.957 +And so you have dramatically reduced your +vocabulary, and thereby can also fax a depot. + +1:05:35.495 --> 1:05:43.757 +That easy does anybody see what is challenging +here and why that might not always need. + +1:05:47.687 --> 1:05:54.448 +The performance is not why this might not. + +1:05:54.448 --> 1:06:01.838 +If you implement it, it might not be a strong. + +1:06:01.941 --> 1:06:06.053 +You have to store this list. + +1:06:06.053 --> 1:06:14.135 +You have to burn the union and of course your +safe time. + +1:06:14.554 --> 1:06:21.920 +The second thing the vocabulary is used in +our last step, so we have the hidden state, + +1:06:21.920 --> 1:06:23.868 +and then we calculate. + +1:06:24.284 --> 1:06:29.610 +Now we are not longer calculating them for +all output words, but for a subset of them. + +1:06:30.430 --> 1:06:35.613 +However, this metric multiplication is typically +parallelized with the perfect but good. + +1:06:35.956 --> 1:06:46.937 +But if you not only calculate some of them, +if you're not modeling it right, it will take + +1:06:46.937 --> 1:06:52.794 +as long as before because of the nature of +the. + +1:06:56.776 --> 1:07:07.997 +Here for beam search there's some ideas of +course you can go back to greedy search because + +1:07:07.997 --> 1:07:10.833 +that's more efficient. + +1:07:11.651 --> 1:07:18.347 +And better quality, and you can buffer some +states in between, so how much buffering it's + +1:07:18.347 --> 1:07:22.216 +again this tradeoff between calculation and +memory. + +1:07:25.125 --> 1:07:41.236 +Then at the end of today what we want to look +into is one last type of new machine translation + +1:07:41.236 --> 1:07:42.932 +approach. + +1:07:43.403 --> 1:07:53.621 +And the idea is what we've already seen in +our first two steps is that this ultra aggressive + +1:07:53.621 --> 1:07:57.246 +park is taking community coding. + +1:07:57.557 --> 1:08:04.461 +Can process everything in parallel, but we +are always taking the most probable and then. + +1:08:05.905 --> 1:08:10.476 +The question is: Do we really need to do that? + +1:08:10.476 --> 1:08:14.074 +Therefore, there is a bunch of work. + +1:08:14.074 --> 1:08:16.602 +Can we do it differently? + +1:08:16.602 --> 1:08:19.616 +Can we generate a full target? + +1:08:20.160 --> 1:08:29.417 +We'll see it's not that easy and there's still +an open debate whether this is really faster + +1:08:29.417 --> 1:08:31.832 +and quality, but think. + +1:08:32.712 --> 1:08:45.594 +So, as said, what we have done is our encoder +decoder where we can process our encoder color, + +1:08:45.594 --> 1:08:50.527 +and then the output always depends. + +1:08:50.410 --> 1:08:54.709 +We generate the output and then we have to +put it here the wide because then everything + +1:08:54.709 --> 1:08:56.565 +depends on the purpose of the output. + +1:08:56.916 --> 1:09:10.464 +This is what is referred to as an outer-regressive +model and nearly outs speech generation and + +1:09:10.464 --> 1:09:16.739 +language generation or works in this outer. + +1:09:18.318 --> 1:09:21.132 +So the motivation is, can we do that more +efficiently? + +1:09:21.361 --> 1:09:31.694 +And can we somehow process all target words +in parallel? + +1:09:31.694 --> 1:09:41.302 +So instead of doing it one by one, we are +inputting. + +1:09:45.105 --> 1:09:46.726 +So how does it work? + +1:09:46.726 --> 1:09:50.587 +So let's first have a basic auto regressive +mode. + +1:09:50.810 --> 1:09:53.551 +So the encoder looks as it is before. + +1:09:53.551 --> 1:09:58.310 +That's maybe not surprising because here we +know we can paralyze. + +1:09:58.618 --> 1:10:04.592 +So we have put in here our ink holder and +generated the ink stash, so that's exactly + +1:10:04.592 --> 1:10:05.295 +the same. + +1:10:05.845 --> 1:10:16.229 +However, now we need to do one more thing: +One challenge is what we had before and that's + +1:10:16.229 --> 1:10:26.799 +a challenge of natural language generation +like machine translation. + +1:10:32.672 --> 1:10:38.447 +We generate until we generate this out of +end of center stock, but if we now generate + +1:10:38.447 --> 1:10:44.625 +everything at once that's no longer possible, +so we cannot generate as long because we only + +1:10:44.625 --> 1:10:45.632 +generated one. + +1:10:46.206 --> 1:10:58.321 +So the question is how can we now determine +how long the sequence is, and we can also accelerate. + +1:11:00.000 --> 1:11:06.384 +Yes, but there would be one idea, and there +is other work which tries to do that. + +1:11:06.806 --> 1:11:15.702 +However, in here there's some work already +done before and maybe you remember we had the + +1:11:15.702 --> 1:11:20.900 +IBM models and there was this concept of fertility. + +1:11:21.241 --> 1:11:26.299 +The concept of fertility is means like for +one saucepan, and how many target pores does + +1:11:26.299 --> 1:11:27.104 +it translate? + +1:11:27.847 --> 1:11:34.805 +And exactly that we try to do here, and that +means we are calculating like at the top we + +1:11:34.805 --> 1:11:36.134 +are calculating. + +1:11:36.396 --> 1:11:42.045 +So it says word is translated into word. + +1:11:42.045 --> 1:11:54.171 +Word might be translated into words into, +so we're trying to predict in how many words. + +1:11:55.935 --> 1:12:10.314 +And then the end of the anchor, so this is +like a length estimation. + +1:12:10.314 --> 1:12:15.523 +You can do it otherwise. + +1:12:16.236 --> 1:12:24.526 +You initialize your decoder input and we know +it's good with word embeddings so we're trying + +1:12:24.526 --> 1:12:28.627 +to do the same thing and what people then do. + +1:12:28.627 --> 1:12:35.224 +They initialize it again with word embedding +but in the frequency of the. + +1:12:35.315 --> 1:12:36.460 +So we have the cartilage. + +1:12:36.896 --> 1:12:47.816 +So one has two, so twice the is and then one +is, so that is then our initialization. + +1:12:48.208 --> 1:12:57.151 +In other words, if you don't predict fertilities +but predict lengths, you can just initialize + +1:12:57.151 --> 1:12:57.912 +second. + +1:12:58.438 --> 1:13:07.788 +This often works a bit better, but that's +the other. + +1:13:07.788 --> 1:13:16.432 +Now you have everything in training and testing. + +1:13:16.656 --> 1:13:18.621 +This is all available at once. + +1:13:20.280 --> 1:13:31.752 +Then we can generate everything in parallel, +so we have the decoder stack, and that is now + +1:13:31.752 --> 1:13:33.139 +as before. + +1:13:35.395 --> 1:13:41.555 +And then we're doing the translation predictions +here on top of it in order to do. + +1:13:43.083 --> 1:13:59.821 +And then we are predicting here the target +words and once predicted, and that is the basic + +1:13:59.821 --> 1:14:00.924 +idea. + +1:14:01.241 --> 1:14:08.171 +Machine translation: Where the idea is, we +don't have to do one by one what we're. + +1:14:10.210 --> 1:14:13.900 +So this looks really, really, really great. + +1:14:13.900 --> 1:14:20.358 +On the first view there's one challenge with +this, and this is the baseline. + +1:14:20.358 --> 1:14:27.571 +Of course there's some improvements, but in +general the quality is often significant. + +1:14:28.068 --> 1:14:32.075 +So here you see the baseline models. + +1:14:32.075 --> 1:14:38.466 +You have a loss of ten blue points or something +like that. + +1:14:38.878 --> 1:14:40.230 +So why does it change? + +1:14:40.230 --> 1:14:41.640 +So why is it happening? + +1:14:43.903 --> 1:14:56.250 +If you look at the errors there is repetitive +tokens, so you have like or things like that. + +1:14:56.536 --> 1:15:01.995 +Broken senses or influent senses, so that +exactly where algebra aggressive models are + +1:15:01.995 --> 1:15:04.851 +very good, we say that's a bit of a problem. + +1:15:04.851 --> 1:15:07.390 +They generate very fluid transcription. + +1:15:07.387 --> 1:15:10.898 +Translation: Sometimes there doesn't have +to do anything with the input. + +1:15:11.411 --> 1:15:14.047 +But generally it really looks always very +fluid. + +1:15:14.995 --> 1:15:20.865 +Here exactly the opposite, so the problem +is that we don't have really fluid translation. + +1:15:21.421 --> 1:15:26.123 +And that is mainly due to the challenge that +we have this independent assumption. + +1:15:26.646 --> 1:15:35.873 +So in this case, the probability of Y of the +second position is independent of the probability + +1:15:35.873 --> 1:15:40.632 +of X, so we don't know what was there generated. + +1:15:40.632 --> 1:15:43.740 +We're just generating it there. + +1:15:43.964 --> 1:15:55.439 +You can see it also in a bit of examples. + +1:15:55.439 --> 1:16:03.636 +You can over-panelize shifts. + +1:16:04.024 --> 1:16:10.566 +And the problem is this is already an improvement +again, but this is also similar to. + +1:16:11.071 --> 1:16:19.900 +So you can, for example, translate heeded +back, or maybe you could also translate it + +1:16:19.900 --> 1:16:31.105 +with: But on their feeling down in feeling +down, if the first position thinks of their + +1:16:31.105 --> 1:16:34.594 +feeling done and the second. + +1:16:35.075 --> 1:16:42.908 +So each position here and that is one of the +main issues here doesn't know what the other. + +1:16:43.243 --> 1:16:53.846 +And for example, if you are translating something +with, you can often translate things in two + +1:16:53.846 --> 1:16:58.471 +ways: German with a different agreement. + +1:16:58.999 --> 1:17:02.058 +And then here where you have to decide do +a used jet. + +1:17:02.162 --> 1:17:05.460 +Interpretator: It doesn't know which word +it has to select. + +1:17:06.086 --> 1:17:14.789 +Mean, of course, it knows a hidden state, +but in the end you have a liability distribution. + +1:17:16.256 --> 1:17:20.026 +And that is the important thing in the outer +regressive month. + +1:17:20.026 --> 1:17:24.335 +You know that because you have put it in you +here, you don't know that. + +1:17:24.335 --> 1:17:29.660 +If it's equal probable here to two, you don't +Know Which Is Selected, and of course that + +1:17:29.660 --> 1:17:32.832 +depends on what should be the latest traction +under. + +1:17:33.333 --> 1:17:39.554 +Yep, that's the undershift, and we're going +to last last the next time. + +1:17:39.554 --> 1:17:39.986 +Yes. + +1:17:40.840 --> 1:17:44.934 +Doesn't this also appear in and like now we're +talking about physical training or. + +1:17:46.586 --> 1:17:48.412 +The thing is in the auto regress. + +1:17:48.412 --> 1:17:50.183 +If you give it the correct one,. + +1:17:50.450 --> 1:17:55.827 +So if you predict here comma what the reference +is feeling then you tell the model here. + +1:17:55.827 --> 1:17:59.573 +The last one was feeling and then it knows +it has to be done. + +1:17:59.573 --> 1:18:04.044 +But here it doesn't know that because it doesn't +get as input as a right. + +1:18:04.204 --> 1:18:24.286 +Yes, that's a bit depending on what. + +1:18:24.204 --> 1:18:27.973 +But in training, of course, you just try to +make the highest one the current one. + +1:18:31.751 --> 1:18:38.181 +So what you can do is things like CDC loss +which can adjust for this. + +1:18:38.181 --> 1:18:42.866 +So then you can also have this shifted correction. + +1:18:42.866 --> 1:18:50.582 +If you're doing this type of correction in +the CDC loss you don't get full penalty. + +1:18:50.930 --> 1:18:58.486 +Just shifted by one, so it's a bit of a different +loss, which is mainly used in, but. + +1:19:00.040 --> 1:19:03.412 +It can be used in order to address this problem. + +1:19:04.504 --> 1:19:13.844 +The other problem is that outer regressively +we have the label buyers that tries to disimmigrate. + +1:19:13.844 --> 1:19:20.515 +That's the example did before was if you translate +thank you to Dung. + +1:19:20.460 --> 1:19:31.925 +And then it might end up because it learns +in the first position and the second also. + +1:19:32.492 --> 1:19:43.201 +In order to prevent that, it would be helpful +for one output, only one output, so that makes + +1:19:43.201 --> 1:19:47.002 +the system already better learn. + +1:19:47.227 --> 1:19:53.867 +Might be that for slightly different inputs +you have different outputs, but for the same. + +1:19:54.714 --> 1:19:57.467 +That we can luckily very easily solve. + +1:19:59.119 --> 1:19:59.908 +And it's done. + +1:19:59.908 --> 1:20:04.116 +We just learned the technique about it, which +is called knowledge distillation. + +1:20:04.985 --> 1:20:13.398 +So what we can do and the easiest solution +to prove your non-autoregressive model is to + +1:20:13.398 --> 1:20:16.457 +train an auto regressive model. + +1:20:16.457 --> 1:20:22.958 +Then you decode your whole training gamer +with this model and then. + +1:20:23.603 --> 1:20:27.078 +While the main advantage of that is that this +is more consistent,. + +1:20:27.407 --> 1:20:33.995 +So for the same input you always have the +same output. + +1:20:33.995 --> 1:20:41.901 +So you have to make your training data more +consistent and learn. + +1:20:42.482 --> 1:20:54.471 +So there is another advantage of knowledge +distillation and that advantage is you have + +1:20:54.471 --> 1:20:59.156 +more consistent training signals. + +1:21:04.884 --> 1:21:10.630 +There's another to make the things more easy +at the beginning. + +1:21:10.630 --> 1:21:16.467 +There's this plants model, black model where +you do more masks. + +1:21:16.756 --> 1:21:26.080 +So during training, especially at the beginning, +you give some correct solutions at the beginning. + +1:21:28.468 --> 1:21:38.407 +And there is this tokens at a time, so the +idea is to establish other regressive training. + +1:21:40.000 --> 1:21:50.049 +And some targets are open, so you always predict +only like first auto regression is K. + +1:21:50.049 --> 1:21:59.174 +It puts one, so you always have one input +and one output, then you do partial. + +1:21:59.699 --> 1:22:05.825 +So in that way you can slowly learn what is +a good and what is a bad answer. + +1:22:08.528 --> 1:22:10.862 +It doesn't sound very impressive. + +1:22:10.862 --> 1:22:12.578 +Don't contact me anyway. + +1:22:12.578 --> 1:22:15.323 +Go all over your training data several. + +1:22:15.875 --> 1:22:20.655 +You can even switch in between. + +1:22:20.655 --> 1:22:29.318 +There is a homework on this thing where you +try to start. + +1:22:31.271 --> 1:22:41.563 +You have to learn so there's a whole work +on that so this is often happening and it doesn't + +1:22:41.563 --> 1:22:46.598 +mean it's less efficient but still it helps. + +1:22:49.389 --> 1:22:57.979 +For later maybe here are some examples of +how much things help. + +1:22:57.979 --> 1:23:04.958 +Maybe one point here is that it's really important. + +1:23:05.365 --> 1:23:13.787 +Here's the translation performance and speed. + +1:23:13.787 --> 1:23:24.407 +One point which is a point is if you compare +researchers. + +1:23:24.784 --> 1:23:33.880 +So yeah, if you're compared to one very weak +baseline transformer even with beam search, + +1:23:33.880 --> 1:23:40.522 +then you're ten times slower than a very strong +auto regressive. + +1:23:40.961 --> 1:23:48.620 +If you make a strong baseline then it's going +down to depending on times and here like: You + +1:23:48.620 --> 1:23:53.454 +have a lot of different speed ups. + +1:23:53.454 --> 1:24:03.261 +Generally, it makes a strong baseline and +not very simple transformer. + +1:24:07.407 --> 1:24:20.010 +Yeah, with this one last thing that you can +do to speed up things and also reduce your + +1:24:20.010 --> 1:24:25.950 +memory is what is called half precision. + +1:24:26.326 --> 1:24:29.139 +And especially for decoding issues for training. + +1:24:29.139 --> 1:24:31.148 +Sometimes it also gets less stale. + +1:24:32.592 --> 1:24:45.184 +With this we close nearly wait a bit, so what +you should remember is that efficient machine + +1:24:45.184 --> 1:24:46.963 +translation. + +1:24:47.007 --> 1:24:51.939 +We have, for example, looked at knowledge +distillation. + +1:24:51.939 --> 1:24:55.991 +We have looked at non auto regressive models. + +1:24:55.991 --> 1:24:57.665 +We have different. + +1:24:58.898 --> 1:25:02.383 +For today and then only requests. + +1:25:02.383 --> 1:25:08.430 +So if you haven't done so, please fill out +the evaluation. + +1:25:08.388 --> 1:25:20.127 +So now if you have done so think then you +should have and with the online people hopefully. + +1:25:20.320 --> 1:25:29.758 +Only possibility to tell us what things are +good and what not the only one but the most + +1:25:29.758 --> 1:25:30.937 +efficient. + +1:25:31.851 --> 1:25:35.875 +So think of all the students doing it in the +next okay, then thank. + +0:00:03.243 --> 0:00:18.400 +Hey welcome to our video, small room today +and to the lecture machine translation. + +0:00:19.579 --> 0:00:32.295 +So the idea is we have like last time we started +addressing problems and building machine translation. + +0:00:32.772 --> 0:00:39.140 +And we looked into different ways of how we +can use other types of resources. + +0:00:39.379 --> 0:00:54.656 +Last time we looked into language models and +especially pre-trained models which are different + +0:00:54.656 --> 0:00:59.319 +paradigms and learning data. + +0:01:00.480 --> 0:01:07.606 +However, there is one other way of getting +data and that is just searching for more data. + +0:01:07.968 --> 0:01:14.637 +And the nice thing is it was a worldwide web. + +0:01:14.637 --> 0:01:27.832 +We have a very big data resource where there's +various types of data which we can all use. + +0:01:28.128 --> 0:01:38.902 +If you want to build a machine translation +for a specific language or specific to Maine, + +0:01:38.902 --> 0:01:41.202 +it might be worse. + +0:01:46.586 --> 0:01:55.399 +In general, the other year we had different +types of additional resources we can have. + +0:01:55.399 --> 0:01:59.654 +Today we look into the state of crawling. + +0:01:59.654 --> 0:02:05.226 +It always depends a bit on what type of task +you have. + +0:02:05.525 --> 0:02:08.571 +We're crawling, you point off no possibilities. + +0:02:08.828 --> 0:02:14.384 +We have seen some weeks ago that Maje Lingo +models another thing where you can try to share + +0:02:14.384 --> 0:02:16.136 +knowledge between languages. + +0:02:16.896 --> 0:02:26.774 +Last we looked into monolingual data and next +we also unsupervised them too which is purely + +0:02:26.774 --> 0:02:29.136 +based on monolingual. + +0:02:29.689 --> 0:02:35.918 +What we today will focus on is really web +crawling of parallel data. + +0:02:35.918 --> 0:02:40.070 +We will focus not on the crawling pad itself. + +0:02:41.541 --> 0:02:49.132 +Networking lecture is something about one +of the best techniques to do web trolleying + +0:02:49.132 --> 0:02:53.016 +and then we'll just rely on existing tools. + +0:02:53.016 --> 0:02:59.107 +But the challenge is normally if you have +web data that's pure text. + +0:03:00.920 --> 0:03:08.030 +And these are all different ways of how we +can do that, and today is focused on that. + +0:03:08.508 --> 0:03:21.333 +So why would we be interested in that there +is quite different ways of collecting data? + +0:03:21.333 --> 0:03:28.473 +If you're currently when we talk about parallel. + +0:03:28.548 --> 0:03:36.780 +The big difference is that you focus on one +specific website so you can manually check + +0:03:36.780 --> 0:03:37.632 +how you. + +0:03:38.278 --> 0:03:49.480 +This you can do for dedicated resources where +you have high quality data. + +0:03:50.510 --> 0:03:56.493 +Another thing which has been developed or +has been done for several tasks is also is + +0:03:56.493 --> 0:03:59.732 +like you can do something like crowdsourcing. + +0:03:59.732 --> 0:04:05.856 +I don't know if you know about sites like +Amazon Mechanical Turing or things like that + +0:04:05.856 --> 0:04:08.038 +so you can there get a lot of. + +0:04:07.988 --> 0:04:11.544 +Writing between cheap labors would like easy +translations for you. + +0:04:12.532 --> 0:04:22.829 +Of course you can't collect millions of sentences, +but if it's like thousands of sentences that's + +0:04:22.829 --> 0:04:29.134 +also sourced, it's often interesting when you +have somehow. + +0:04:29.509 --> 0:04:36.446 +However, this is a field of itself, so crowdsourcing +is not that easy. + +0:04:36.446 --> 0:04:38.596 +It's not like upload. + +0:04:38.738 --> 0:04:50.806 +If you're doing that you will have very poor +quality, for example in the field of machine + +0:04:50.806 --> 0:04:52.549 +translation. + +0:04:52.549 --> 0:04:57.511 +Crowdsourcing is very commonly used. + +0:04:57.397 --> 0:05:00.123 +The problem there is. + +0:05:00.480 --> 0:05:08.181 +Since they are paid quite bad, of course, +a lot of people also try to make it put into + +0:05:08.181 --> 0:05:09.598 +it as possible. + +0:05:09.869 --> 0:05:21.076 +So if you're just using it without any control +mechanisms, the quality will be bad. + +0:05:21.076 --> 0:05:27.881 +What you can do is like doing additional checking. + +0:05:28.188 --> 0:05:39.084 +And think recently read a paper that now these +things can be worse because people don't do + +0:05:39.084 --> 0:05:40.880 +it themselves. + +0:05:41.281 --> 0:05:46.896 +So it's a very interesting topic. + +0:05:46.896 --> 0:05:55.320 +There has been a lot of resources created +by this. + +0:05:57.657 --> 0:06:09.796 +It's really about large scale data, then of +course doing some type of web crawling is the + +0:06:09.796 --> 0:06:10.605 +best. + +0:06:10.930 --> 0:06:17.296 +However, the biggest issue in this case is +in the quality. + +0:06:17.296 --> 0:06:22.690 +So how can we ensure that somehow the quality +of. + +0:06:23.003 --> 0:06:28.656 +Because if you just, we all know that in the +Internet there's also a lot of tools. + +0:06:29.149 --> 0:06:37.952 +Low quality staff, and especially now the +bigger question is how can we ensure that translations + +0:06:37.952 --> 0:06:41.492 +are really translations of each other? + +0:06:45.065 --> 0:06:58.673 +Why is this interesting so we had this number +before so there is some estimates that roughly + +0:06:58.673 --> 0:07:05.111 +a human reads around three hundred million. + +0:07:05.525 --> 0:07:16.006 +If you look into the web you will have millions +of words there so you can really get a large + +0:07:16.006 --> 0:07:21.754 +amount of data and if you think about monolingual. + +0:07:22.042 --> 0:07:32.702 +So at least for some language pairs there +is a large amount of data you can have. + +0:07:32.852 --> 0:07:37.783 +Languages are official languages in one country. + +0:07:37.783 --> 0:07:46.537 +There's always a very great success because +a lot of websites from the government need + +0:07:46.537 --> 0:07:48.348 +to be translated. + +0:07:48.568 --> 0:07:58.777 +For example, a large purpose like in India, +which we have worked with in India, so you + +0:07:58.777 --> 0:08:00.537 +have parallel. + +0:08:01.201 --> 0:08:02.161 +Two questions. + +0:08:02.161 --> 0:08:08.438 +First of all, if jet GPS and machine translation +tools are more becoming ubiquitous and everybody + +0:08:08.438 --> 0:08:14.138 +uses them, don't we get a problem because we +want to crawl the web and use the data and. + +0:08:15.155 --> 0:08:18.553 +Yes, that is a severe problem. + +0:08:18.553 --> 0:08:26.556 +Of course, are we only training on training +data which is automatically? + +0:08:26.766 --> 0:08:41.182 +And if we are doing that, of course, we talked +about the synthetic data where we do back translation. + +0:08:41.341 --> 0:08:46.446 +But of course it gives you some aren't up +about norm, you cannot be much better than + +0:08:46.446 --> 0:08:46.806 +this. + +0:08:48.308 --> 0:08:57.194 +That is, we'll get more and more on issues, +so maybe at some point we won't look at the + +0:08:57.194 --> 0:09:06.687 +current Internet, but focus on oats like image +of the Internet, which are created by Archive. + +0:09:07.527 --> 0:09:18.611 +There's lots of classification algorithms +on how to classify automatic data they had + +0:09:18.611 --> 0:09:26.957 +a very interesting paper on how to watermark +their translation. + +0:09:27.107 --> 0:09:32.915 +So there's like two scenarios of course in +this program: The one thing you might want + +0:09:32.915 --> 0:09:42.244 +to find your own translation if you're a big +company and say do an antisystem that may be + +0:09:42.244 --> 0:09:42.866 +used. + +0:09:43.083 --> 0:09:49.832 +This problem might be that most of the translation +out there is created by you. + +0:09:49.832 --> 0:10:02.007 +You might be able: And there is a relatively +easy way of doing that so that there are other + +0:10:02.007 --> 0:10:09.951 +peoples' mainly that can do like the search +or teacher. + +0:10:09.929 --> 0:10:12.878 +They are different, but there is not the one +correction station. + +0:10:13.153 --> 0:10:23.763 +So what you then can't do is you can't output +the best one to the user, but the highest value. + +0:10:23.763 --> 0:10:30.241 +For example, it's easy, but you can take the +translation. + +0:10:30.870 --> 0:10:40.713 +And if you always give the translation of +your investments, which are all good with the + +0:10:40.713 --> 0:10:42.614 +most ease, then. + +0:10:42.942 --> 0:10:55.503 +But of course this you can only do with most +of the data generated by your model. + +0:10:55.503 --> 0:11:02.855 +What we are now seeing is not only checks, +but. + +0:11:03.163 --> 0:11:13.295 +But it's definitely an additional research +question that might get more and more importance, + +0:11:13.295 --> 0:11:18.307 +and it might be an additional filtering step. + +0:11:18.838 --> 0:11:29.396 +There are other issues in data quality, so +in which direction wasn't translated, so that + +0:11:29.396 --> 0:11:31.650 +is not interested. + +0:11:31.891 --> 0:11:35.672 +But if you're now reaching better and better +quality, it makes a difference. + +0:11:35.672 --> 0:11:39.208 +The original data was from German to English +or from English to German. + +0:11:39.499 --> 0:11:44.797 +Because translation, they call it translate +Chinese. + +0:11:44.797 --> 0:11:53.595 +So if you generate German from English, it +has a more similar structure as if you would + +0:11:53.595 --> 0:11:55.195 +directly speak. + +0:11:55.575 --> 0:11:57.187 +So um. + +0:11:57.457 --> 0:12:03.014 +These are all issues which you then might +do like do additional training to remove them + +0:12:03.014 --> 0:12:07.182 +or you first train on them and later train +on other quality data. + +0:12:07.182 --> 0:12:11.034 +But yet that's a general view on so it's an +important issue. + +0:12:11.034 --> 0:12:17.160 +But until now I think it hasn't been addressed +that much maybe because the quality was decently. + +0:12:18.858 --> 0:12:23.691 +Actually, I think we're sure if we have the +time we use the Internet. + +0:12:23.691 --> 0:12:29.075 +The problem is, it's a lot of English speaking +text, but most used languages. + +0:12:29.075 --> 0:12:34.460 +I don't know some language in Africa that's +spoken, but we do about that one. + +0:12:34.460 --> 0:12:37.566 +I mean, that's why most data is English too. + +0:12:38.418 --> 0:12:42.259 +Other languages, and then you get the best. + +0:12:42.259 --> 0:12:46.013 +If there is no data on the Internet, then. + +0:12:46.226 --> 0:12:48.255 +So there is still a lot of data collection. + +0:12:48.255 --> 0:12:50.976 +Also in the wild way you try to improve there +and collect. + +0:12:51.431 --> 0:12:57.406 +But English is the most in the world, but +you find surprisingly much data also for other + +0:12:57.406 --> 0:12:58.145 +languages. + +0:12:58.678 --> 0:13:04.227 +Of course, only if they're written remember. + +0:13:04.227 --> 0:13:15.077 +Most languages are not written at all, but +for them you might find some video, but it's + +0:13:15.077 --> 0:13:17.420 +difficult to find. + +0:13:17.697 --> 0:13:22.661 +So this is mainly done for the web trawling. + +0:13:22.661 --> 0:13:29.059 +It's mainly done for languages which are commonly +spoken. + +0:13:30.050 --> 0:13:37.907 +Is exactly the next point, so this is that +much data is only true for English and some + +0:13:37.907 --> 0:13:41.972 +other languages, but of course there's many. + +0:13:41.982 --> 0:13:50.285 +And therefore a lot of research on how to +make things efficient and efficient and learn + +0:13:50.285 --> 0:13:54.248 +faster from pure data is still essential. + +0:13:59.939 --> 0:14:06.326 +So what we are interested in now on data is +parallel data. + +0:14:06.326 --> 0:14:10.656 +We assume always we have parallel data. + +0:14:10.656 --> 0:14:12.820 +That means we have. + +0:14:13.253 --> 0:14:20.988 +To be careful when you start crawling from +the web, we might get only related types of. + +0:14:21.421 --> 0:14:30.457 +So one comedy thing is what people refer as +noisy parallel data where there is documents + +0:14:30.457 --> 0:14:34.315 +which are translations of each other. + +0:14:34.434 --> 0:14:44.300 +So you have senses where there is no translation +on the other side because you have. + +0:14:44.484 --> 0:14:50.445 +So if you have these types of documents your +algorithm to extract parallel data might be + +0:14:50.445 --> 0:14:51.918 +a bit more difficult. + +0:14:52.352 --> 0:15:04.351 +Know if you can still remember in the beginning +of the lecture when we talked about different + +0:15:04.351 --> 0:15:06.393 +data resources. + +0:15:06.286 --> 0:15:11.637 +But the first step is then approached to a +light source and target sentences, and it was + +0:15:11.637 --> 0:15:16.869 +about like a steep vocabulary, and then you +have some probabilities for one to one and + +0:15:16.869 --> 0:15:17.590 +one to one. + +0:15:17.590 --> 0:15:23.002 +It's very like simple algorithm, but yet it +works fine for really a high quality parallel + +0:15:23.002 --> 0:15:23.363 +data. + +0:15:23.623 --> 0:15:30.590 +But when we're talking about noisy data, we +might have to do additional steps and use more + +0:15:30.590 --> 0:15:35.872 +advanced models to extract what is parallel +and to get high quality. + +0:15:36.136 --> 0:15:44.682 +So if we just had no easy parallel data, the +document might not be as easy to extract. + +0:15:49.249 --> 0:15:54.877 +And then there is even the more extreme pains, +which has also been used to be honest. + +0:15:54.877 --> 0:15:58.214 +The use of this data is reasoning not that +common. + +0:15:58.214 --> 0:16:04.300 +It was more interested maybe like ten or fifteen +years ago, and that is what people referred + +0:16:04.300 --> 0:16:05.871 +to as comparative data. + +0:16:06.266 --> 0:16:17.167 +And then the idea is you even don't have translations +like sentences which are translations of each + +0:16:17.167 --> 0:16:25.234 +other, but you have more news documents or +articles about the same topic. + +0:16:25.205 --> 0:16:32.410 +But it's more that you find phrases which +are too big in the user, so even black fragments. + +0:16:32.852 --> 0:16:44.975 +So if you think about the pedia, for example, +these articles have to be written in like the + +0:16:44.975 --> 0:16:51.563 +Wikipedia general idea independent of each +other. + +0:16:51.791 --> 0:17:01.701 +They have different information in there, +and I mean, the German movie gets more detail + +0:17:01.701 --> 0:17:04.179 +than the English one. + +0:17:04.179 --> 0:17:07.219 +However, it might be that. + +0:17:07.807 --> 0:17:20.904 +And the same thing is that you think about +newspaper articles if they're at the same time. + +0:17:21.141 --> 0:17:25.603 +And so this is an ability to learn. + +0:17:25.603 --> 0:17:36.760 +For example, new phrases, vocabulary and stature +if you don't have monitor all time long. + +0:17:37.717 --> 0:17:49.020 +And then not everything will be the same, +but there might be an overlap about events. + +0:17:54.174 --> 0:18:00.348 +So if we're talking about web trolling said +in the beginning it was really about specific. + +0:18:00.660 --> 0:18:18.878 +They do very good things by hand and really +focus on them and do a very specific way of + +0:18:18.878 --> 0:18:20.327 +doing. + +0:18:20.540 --> 0:18:23.464 +The European Parliament was very focused in +Ted. + +0:18:23.464 --> 0:18:26.686 +Maybe you even have looked in the particular +session. + +0:18:27.427 --> 0:18:40.076 +And these are still important, but they are +of course very specific in covering different + +0:18:40.076 --> 0:18:41.341 +pockets. + +0:18:42.002 --> 0:18:55.921 +Then there was a focus on language centering, +so there was a big drawer, for example, that + +0:18:55.921 --> 0:18:59.592 +you can check websites. + +0:19:00.320 --> 0:19:07.918 +Apparently what really people like is a more +general approach where you just have to specify. + +0:19:07.918 --> 0:19:15.355 +I'm interested in data from German to Lithuanian +and then you can as automatic as possible. + +0:19:15.355 --> 0:19:19.640 +You can collect data and extract codelator +for this. + +0:19:21.661 --> 0:19:25.633 +So is this our interest? + +0:19:25.633 --> 0:19:36.435 +Of course, the question is how can we build +these types of systems? + +0:19:36.616 --> 0:19:52.913 +The first are more general web crawling base +systems, so there is nothing about. + +0:19:53.173 --> 0:19:57.337 +Based on the websites you have, you have to +do like text extraction. + +0:19:57.597 --> 0:20:06.503 +We are typically not that much interested +in text and images in there, so we try to extract + +0:20:06.503 --> 0:20:07.083 +text. + +0:20:07.227 --> 0:20:16.919 +This is also not specific to machine translation, +but it's a more traditional way of doing web + +0:20:16.919 --> 0:20:17.939 +trolling. + +0:20:18.478 --> 0:20:22.252 +And at the end you have mirror like some other +set of document collectors. + +0:20:22.842 --> 0:20:37.025 +Is the idea, so you have the text, and often +this is a document, and so in the end. + +0:20:37.077 --> 0:20:51.523 +And that is some of your starting point now +for doing the more machine translation. + +0:20:52.672 --> 0:21:05.929 +One way of doing that now is very similar +to what you might have think about the traditional + +0:21:05.929 --> 0:21:06.641 +one. + +0:21:06.641 --> 0:21:10.633 +The first thing is to do a. + +0:21:11.071 --> 0:21:22.579 +So you have this based on the initial fact +that you know this is a German website in the + +0:21:22.579 --> 0:21:25.294 +English translation. + +0:21:25.745 --> 0:21:31.037 +And based on this document alignment, then +you can do your sentence alignment. + +0:21:31.291 --> 0:21:39.072 +And this is similar to what we had before +with the church accordion. + +0:21:39.072 --> 0:21:43.696 +This is typically more noisy peril data. + +0:21:43.623 --> 0:21:52.662 +So that you are not assuming that everything +is on both sides, that the order is the same, + +0:21:52.662 --> 0:21:56.635 +so you should do more flexible systems. + +0:21:58.678 --> 0:22:14.894 +Then it depends if the documents you were +drawing were really some type of parallel data. + +0:22:15.115 --> 0:22:35.023 +Say then you should do what is referred to +as fragmented extraction. + +0:22:36.136 --> 0:22:47.972 +One problem with these types of models is +if you are doing errors in your document alignment,. + +0:22:48.128 --> 0:22:55.860 +It means that if you are saying these two +documents are align then you can only find + +0:22:55.860 --> 0:22:58.589 +sense and if you are missing. + +0:22:59.259 --> 0:23:15.284 +Is very different, only small parts of the +document are parallel, and most parts are independent + +0:23:15.284 --> 0:23:17.762 +of each other. + +0:23:19.459 --> 0:23:31.318 +Therefore, more recently, there is also the +idea of directly doing sentence aligned so + +0:23:31.318 --> 0:23:35.271 +that you're directly taking. + +0:23:36.036 --> 0:23:41.003 +Was already one challenge of this one, the +second approach. + +0:23:42.922 --> 0:23:50.300 +Yes, so one big challenge on here, beef, then +you have to do a lot of comparison. + +0:23:50.470 --> 0:23:59.270 +You have to cook out every source, every target +set and square. + +0:23:59.270 --> 0:24:06.283 +If you think of a million or trillion pairs, +then. + +0:24:07.947 --> 0:24:12.176 +And this also gives you a reason for a last +step in both cases. + +0:24:12.176 --> 0:24:18.320 +So in both of them you have to remember you're +typically eating here in this very large data + +0:24:18.320 --> 0:24:18.650 +set. + +0:24:18.650 --> 0:24:24.530 +So all of these and also the document alignment +here they should be done very efficient. + +0:24:24.965 --> 0:24:42.090 +And if you want to do it very efficiently, +that means your quality will go lower. + +0:24:41.982 --> 0:24:47.348 +Because you just have to ever see it fast, +and then yeah you can put less computation + +0:24:47.348 --> 0:24:47.910 +on each. + +0:24:48.688 --> 0:25:06.255 +Therefore, in a lot of scenarios it makes +sense to make an additional filtering step + +0:25:06.255 --> 0:25:08.735 +at the end. + +0:25:08.828 --> 0:25:13.370 +And then we do a second filtering step where +we now can put a lot more effort. + +0:25:13.433 --> 0:25:20.972 +Because now we don't have like any square +possible combinations anymore, we have already + +0:25:20.972 --> 0:25:26.054 +selected and maybe in dimension of maybe like +two or three. + +0:25:26.054 --> 0:25:29.273 +For each sentence we even don't have. + +0:25:29.429 --> 0:25:39.234 +And then we can put a lot more effort in each +individual example and build a high quality + +0:25:39.234 --> 0:25:42.611 +classic fire to really select. + +0:25:45.125 --> 0:26:00.506 +Two or one example for that, so one of the +biggest projects doing this is the so-called + +0:26:00.506 --> 0:26:03.478 +Paratrol Corpus. + +0:26:03.343 --> 0:26:11.846 +Typically it's like before the picturing so +there are a lot of challenges on how you can. + +0:26:12.272 --> 0:26:25.808 +And the steps they start to be with the seatbelt, +so what you should give at the beginning is: + +0:26:26.146 --> 0:26:36.908 +Then they do the problem, the text extraction, +the document alignment, the sentence alignment, + +0:26:36.908 --> 0:26:45.518 +and the sentence filter, and it swings down +to implementing the text store. + +0:26:46.366 --> 0:26:51.936 +We'll see later for a lot of language pairs +exist so it's easier to download them and then + +0:26:51.936 --> 0:26:52.793 +like improve. + +0:26:53.073 --> 0:27:08.270 +For example, the crawling one thing they often +do is even not throw the direct website because + +0:27:08.270 --> 0:27:10.510 +there's also. + +0:27:10.770 --> 0:27:14.540 +Black parts of the Internet that they can +work on today. + +0:27:14.854 --> 0:27:22.238 +In more detail, this is a bit shown here. + +0:27:22.238 --> 0:27:31.907 +All the steps you can see are different possibilities. + +0:27:32.072 --> 0:27:39.018 +You need a bit of knowledge to do that, or +you can build a machine translation system. + +0:27:39.239 --> 0:27:47.810 +There are two different ways of deduction +and alignment. + +0:27:47.810 --> 0:27:52.622 +You can use sentence alignment. + +0:27:53.333 --> 0:28:02.102 +And how you can do the flexigrade exam, for +example, the lexic graph, or you can chin. + +0:28:02.422 --> 0:28:05.826 +To the next step in a bit more detail. + +0:28:05.826 --> 0:28:13.680 +But before we're doing it, I need more questions +about the general overview of how these. + +0:28:22.042 --> 0:28:37.058 +Yeah, so two or three things to web-drawing, +so you normally start with the URLs. + +0:28:37.058 --> 0:28:40.903 +It's most promising. + +0:28:41.021 --> 0:28:48.652 +What you found is that if you're interested +in German to English, you would: Companies + +0:28:48.652 --> 0:29:01.074 +where you know they have a German and an English +website are from agencies which might be: And + +0:29:01.074 --> 0:29:10.328 +then we can use one of these tools to start +from there using standard web calling techniques. + +0:29:11.071 --> 0:29:23.942 +There are several challenges when doing that, +so if you request a website too often you can: + +0:29:25.305 --> 0:29:37.819 +You have to keep in history of the sites and +you click on all the links and then click on + +0:29:37.819 --> 0:29:40.739 +all the links again. + +0:29:41.721 --> 0:29:49.432 +To be very careful about legal issues starting +from this robotics day so get allowed to use. + +0:29:49.549 --> 0:29:58.941 +Mean, that's the one major thing about what +trolley general is. + +0:29:58.941 --> 0:30:05.251 +The problem is how you deal with property. + +0:30:05.685 --> 0:30:13.114 +That is why it is easier sometimes to start +with some quick fold data that you don't have. + +0:30:13.893 --> 0:30:22.526 +Of course, the network issues you retry, so +there's more technical things, but there's + +0:30:22.526 --> 0:30:23.122 +good. + +0:30:24.724 --> 0:30:35.806 +Another thing which is very helpful and is +often done is instead of doing the web trolling + +0:30:35.806 --> 0:30:38.119 +yourself, relying. + +0:30:38.258 --> 0:30:44.125 +And one thing is it's common crawl from the +web. + +0:30:44.125 --> 0:30:51.190 +Think on this common crawl a lot of these +language models. + +0:30:51.351 --> 0:30:59.763 +So think in American Company or organization +which really works on like writing. + +0:31:00.000 --> 0:31:01.111 +Possible. + +0:31:01.111 --> 0:31:10.341 +So the nice thing is if you start with this +you don't have to worry about network. + +0:31:10.250 --> 0:31:16.086 +I don't think you can do that because it's +too big, but you can do a pipeline on how to + +0:31:16.086 --> 0:31:16.683 +process. + +0:31:17.537 --> 0:31:28.874 +That is, of course, a general challenge in +all this web crawling and parallel web mining. + +0:31:28.989 --> 0:31:38.266 +That means you cannot just don't know the +data and study the processes. + +0:31:39.639 --> 0:31:45.593 +Here it might make sense to directly fields +of both domains that in some way bark just + +0:31:45.593 --> 0:31:46.414 +marginally. + +0:31:49.549 --> 0:31:59.381 +Then you can do the text extraction, which +means like converging two HTML and then splitting + +0:31:59.381 --> 0:32:01.707 +things from the HTML. + +0:32:01.841 --> 0:32:04.802 +Often very important is to do the language +I need. + +0:32:05.045 --> 0:32:16.728 +It's not that clear even if it's links which +language it is, but they are quite good tools + +0:32:16.728 --> 0:32:22.891 +like that can't identify from relatively short. + +0:32:23.623 --> 0:32:36.678 +And then you are now in the situation that +you have all your danger and that you can start. + +0:32:37.157 --> 0:32:43.651 +After the text extraction you have now a collection +or a large collection of of data where it's + +0:32:43.651 --> 0:32:49.469 +like text and maybe the document at use of +some meta information and now the question + +0:32:49.469 --> 0:32:55.963 +is based on this monolingual text or multilingual +text so text in many languages but not align. + +0:32:56.036 --> 0:32:59.863 +How can you now do a generate power? + +0:33:01.461 --> 0:33:06.289 +And UM. + +0:33:05.705 --> 0:33:12.965 +So the main thing, if we're not seeing it +as a task, or if we want to do it in a machine + +0:33:12.965 --> 0:33:20.388 +learning way, what we have is we have a set +of sentences and a suits language, and we have + +0:33:20.388 --> 0:33:23.324 +a set Of sentences from the target. + +0:33:23.823 --> 0:33:27.814 +This is the target language. + +0:33:27.814 --> 0:33:31.392 +This is the data we have. + +0:33:31.392 --> 0:33:37.034 +We kind of directly assume any ordering. + +0:33:38.018 --> 0:33:44.502 +More documents there are not really in line +or there is maybe a graph and what we are interested + +0:33:44.502 --> 0:33:50.518 +in is finding these alignments so which senses +are aligned to each other and which senses + +0:33:50.518 --> 0:33:53.860 +we can remove but we don't have translations +for. + +0:33:53.974 --> 0:34:00.339 +But exactly this mapping is what we are interested +in and what we need to find. + +0:34:01.901 --> 0:34:17.910 +And if we are modeling it more from the machine +translation point of view, what can model that + +0:34:17.910 --> 0:34:21.449 +as a classification? + +0:34:21.681 --> 0:34:34.850 +And so the main challenge is to build this +type of classifier and you want to decide is + +0:34:34.850 --> 0:34:36.646 +a parallel. + +0:34:42.402 --> 0:34:50.912 +However, the biggest challenge has already +pointed out in the beginning is the sites if + +0:34:50.912 --> 0:34:53.329 +we have millions target. + +0:34:53.713 --> 0:35:05.194 +The number of comparison is n square, so this +very path is very inefficient, and we need + +0:35:05.194 --> 0:35:06.355 +to find. + +0:35:07.087 --> 0:35:16.914 +And traditionally there is the first one mentioned +before the local or the hierarchical meaning + +0:35:16.914 --> 0:35:20.292 +mining and there the idea is OK. + +0:35:20.292 --> 0:35:23.465 +First we are lining documents. + +0:35:23.964 --> 0:35:32.887 +Move back the things and align them, and once +you have the alignment you only need to remind. + +0:35:33.273 --> 0:35:51.709 +That of course makes anything more efficient +because we don't have to do all the comparison. + +0:35:53.253 --> 0:35:56.411 +Then it's, for example, in the before mentioned +apparel. + +0:35:57.217 --> 0:36:11.221 +But it has the issue that if this document +is bad you have error propagation and you can + +0:36:11.221 --> 0:36:14.211 +recover from that. + +0:36:14.494 --> 0:36:20.715 +Because then document that cannot say ever, +there are some sentences which are: Therefore, + +0:36:20.715 --> 0:36:24.973 +more recently there is also was referred to +as global mining. + +0:36:26.366 --> 0:36:31.693 +And there we really do this. + +0:36:31.693 --> 0:36:43.266 +Although it's in the square, we are doing +all the comparisons. + +0:36:43.523 --> 0:36:52.588 +So the idea is that you can do represent all +the sentences in a vector space. + +0:36:52.892 --> 0:37:06.654 +And then it's about nearest neighbor search +and there is a lot of very efficient algorithms. + +0:37:07.067 --> 0:37:20.591 +Then if you only compare them to your nearest +neighbors you don't have to do like a comparison + +0:37:20.591 --> 0:37:22.584 +but you have. + +0:37:26.186 --> 0:37:40.662 +So in the first step what we want to look +at is this: This document classification refers + +0:37:40.662 --> 0:37:49.584 +to the document alignment, and then we do the +sentence alignment. + +0:37:51.111 --> 0:37:58.518 +And if we're talking about document alignment, +there's like typically two steps in that: We + +0:37:58.518 --> 0:38:01.935 +first do a candidate selection. + +0:38:01.935 --> 0:38:10.904 +Often we have several steps and that is again +to make more things more efficiently. + +0:38:10.904 --> 0:38:13.360 +We have the candidate. + +0:38:13.893 --> 0:38:18.402 +The candidate select means OK, which documents +do we want to compare? + +0:38:19.579 --> 0:38:35.364 +Then if we have initial candidates which might +be parallel, we can do a classification test. + +0:38:35.575 --> 0:38:37.240 +And there is different ways. + +0:38:37.240 --> 0:38:40.397 +We can use lexical similarity or we can use +ten basic. + +0:38:41.321 --> 0:38:48.272 +The first and easiest thing is to take off +possible candidates. + +0:38:48.272 --> 0:38:55.223 +There's one possibility, the other one, is +based on structural. + +0:38:55.235 --> 0:39:05.398 +So based on how your website looks like, you +might find that there are only translations. + +0:39:05.825 --> 0:39:14.789 +This is typically the only case where we try +to do some kind of major information, which + +0:39:14.789 --> 0:39:22.342 +can be very useful because we know that websites, +for example, are linked. + +0:39:22.722 --> 0:39:35.586 +We can try to use some URL patterns, so if +we have some website which ends with the. + +0:39:35.755 --> 0:39:43.932 +So that can be easily used in order to find +candidates. + +0:39:43.932 --> 0:39:49.335 +Then we only compare websites where. + +0:39:49.669 --> 0:40:05.633 +The language and the translation of each other, +but typically you hear several heuristics to + +0:40:05.633 --> 0:40:07.178 +do that. + +0:40:07.267 --> 0:40:16.606 +Then you don't have to compare all websites, +but you only have to compare web sites. + +0:40:17.277 --> 0:40:27.607 +Cruiser problems especially with an hour day's +content management system. + +0:40:27.607 --> 0:40:32.912 +Sometimes it's nice and easy to read. + +0:40:33.193 --> 0:40:44.452 +So on the one hand there typically leads from +the parent's side to different languages. + +0:40:44.764 --> 0:40:46.632 +Now I can look at the kit websites. + +0:40:46.632 --> 0:40:49.381 +It's the same thing you can check on the difference. + +0:40:49.609 --> 0:41:06.833 +Languages: You can either do that from the +parent website or you can also click on English. + +0:41:06.926 --> 0:41:10.674 +You can therefore either like prepare to all +the websites. + +0:41:10.971 --> 0:41:18.205 +Can be even more focused and checked if the +link is somehow either flexible or the language + +0:41:18.205 --> 0:41:18.677 +name. + +0:41:19.019 --> 0:41:24.413 +So there really depends on how much you want +to filter out. + +0:41:24.413 --> 0:41:29.178 +There is always a trade-off between being +efficient. + +0:41:33.913 --> 0:41:49.963 +Based on that we then have our candidate list, +so we now have two independent sets of German + +0:41:49.963 --> 0:41:52.725 +documents, but. + +0:41:53.233 --> 0:42:03.515 +And now the task is, we want to extract these, +which are really translations of each other. + +0:42:03.823 --> 0:42:10.201 +So the question of how can we measure the +document similarity? + +0:42:10.201 --> 0:42:14.655 +Because what we then do is, we measure the. + +0:42:14.955 --> 0:42:27.096 +And here you already see why this is also +that problematic from where it's partial or + +0:42:27.096 --> 0:42:28.649 +similarly. + +0:42:30.330 --> 0:42:37.594 +All you can do that is again two folds. + +0:42:37.594 --> 0:42:48.309 +You can do it more content based or more structural +based. + +0:42:48.188 --> 0:42:53.740 +Calculating a lot of features and then maybe +training a classic pyramid small set which + +0:42:53.740 --> 0:42:57.084 +stands like based on the spesse feature is +the data. + +0:42:57.084 --> 0:42:58.661 +It is a corpus parallel. + +0:43:00.000 --> 0:43:10.955 +One way of doing that is to have traction +features, so the idea is the text length, so + +0:43:10.955 --> 0:43:12.718 +the document. + +0:43:13.213 --> 0:43:20.511 +Of course, text links will not be the same, +but if the one document has fifty words and + +0:43:20.511 --> 0:43:24.907 +the other five thousand words, it's quite realistic. + +0:43:25.305 --> 0:43:29.274 +So you can use the text length as one proxy +of. + +0:43:29.274 --> 0:43:32.334 +Is this might be a good translation? + +0:43:32.712 --> 0:43:41.316 +Now the thing is the alignment between the +structure. + +0:43:41.316 --> 0:43:52.151 +If you have here the website you can create +some type of structure. + +0:43:52.332 --> 0:44:04.958 +You can compare that to the French version +and then calculate some similarities because + +0:44:04.958 --> 0:44:07.971 +you see translation. + +0:44:08.969 --> 0:44:12.172 +Of course, it's getting more and more problematic. + +0:44:12.172 --> 0:44:16.318 +It does be a different structure than these +features are helpful. + +0:44:16.318 --> 0:44:22.097 +However, if you are doing it more in a trained +way, you can automatically learn how helpful + +0:44:22.097 --> 0:44:22.725 +they are. + +0:44:24.704 --> 0:44:37.516 +Then there are different ways of yeah: Content +based things: One easy thing, especially if + +0:44:37.516 --> 0:44:48.882 +you have systems that are using the same script +that you are looking for. + +0:44:48.888 --> 0:44:49.611 +The legs. + +0:44:49.611 --> 0:44:53.149 +We call them a beggar words and we'll look +into. + +0:44:53.149 --> 0:44:55.027 +You can use some type of. + +0:44:55.635 --> 0:44:58.418 +And neural embedding is also to abate him +at. + +0:45:02.742 --> 0:45:06.547 +And as then mean we have machine translation,. + +0:45:06.906 --> 0:45:14.640 +And one idea that you can also do is really +use the machine translation. + +0:45:14.874 --> 0:45:22.986 +Because this one is one which takes more effort, +so what you then have to do is put more effort. + +0:45:23.203 --> 0:45:37.526 +You wouldn't do this type of machine translation +based approach for a system which has product. + +0:45:38.018 --> 0:45:53.712 +But maybe your first of thinking why can't +do that because I'm collecting data to build + +0:45:53.712 --> 0:45:55.673 +an system. + +0:45:55.875 --> 0:46:01.628 +So you can use an initial system to translate +it, and then you can collect more data. + +0:46:01.901 --> 0:46:06.879 +And one way of doing that is, you're translating, +for example, all documents even to English. + +0:46:07.187 --> 0:46:25.789 +Then you only need two English data and you +do it in the example with three grams. + +0:46:25.825 --> 0:46:33.253 +For example, the current induction in 1 in +the Spanish, which is German induction in 1, + +0:46:33.253 --> 0:46:37.641 +which was Spanish induction in 2, which was +French. + +0:46:37.637 --> 0:46:52.225 +You're creating this index and then based +on that you can calculate how similar the documents. + +0:46:52.092 --> 0:46:58.190 +And then you can use the Cossack similarity +to really calculate which of the most similar + +0:46:58.190 --> 0:47:00.968 +document or how similar is the document. + +0:47:00.920 --> 0:47:04.615 +And then measure if this is a possible translation. + +0:47:05.285 --> 0:47:14.921 +Mean, of course, the document will not be +exactly the same, and even if you have a parallel + +0:47:14.921 --> 0:47:18.483 +document, French and German, and. + +0:47:18.898 --> 0:47:29.086 +You'll have not a perfect translation, therefore +it's looking into five front overlap since + +0:47:29.086 --> 0:47:31.522 +there should be last. + +0:47:34.074 --> 0:47:42.666 +Okay, before we take the next step and go +into the sentence alignment, there are more + +0:47:42.666 --> 0:47:44.764 +questions about the. + +0:47:51.131 --> 0:47:55.924 +Too Hot and. + +0:47:56.997 --> 0:47:59.384 +Well um. + +0:48:00.200 --> 0:48:05.751 +There is different ways of doing sentence +alignment. + +0:48:05.751 --> 0:48:12.036 +Here's one way to describe is to call the +other line again. + +0:48:12.172 --> 0:48:17.590 +Of course, we have the advantage that we have +only documents, so we might have like hundred + +0:48:17.590 --> 0:48:20.299 +sentences and hundred sentences in the tower. + +0:48:20.740 --> 0:48:31.909 +Although it still might be difficult to compare +all the things in parallel, and. + +0:48:31.791 --> 0:48:37.541 +And therefore typically these even assume +that we are only interested in a line character + +0:48:37.541 --> 0:48:40.800 +that can be identified on the sum of the diagonal. + +0:48:40.800 --> 0:48:46.422 +Of course, not exactly the diagonal will sum +some parts around it, but in order to make + +0:48:46.422 --> 0:48:47.891 +things more efficient. + +0:48:48.108 --> 0:48:55.713 +You can still do it around the diagonal because +if you say this is a parallel document, we + +0:48:55.713 --> 0:48:56.800 +assume that. + +0:48:56.836 --> 0:49:05.002 +We wouldn't have passed the document alignment, +therefore we wouldn't have seen it. + +0:49:05.505 --> 0:49:06.774 +In the underline. + +0:49:06.774 --> 0:49:10.300 +Then we are calculating the similarity for +these. + +0:49:10.270 --> 0:49:17.428 +Set this here based on the bilingual dictionary, +so it may be based on how much overlap you + +0:49:17.428 --> 0:49:17.895 +have. + +0:49:18.178 --> 0:49:24.148 +And then we are finding a path through it. + +0:49:24.148 --> 0:49:31.089 +You are finding a path which the lights ever +see. + +0:49:31.271 --> 0:49:41.255 +But you're trying to find a pass through your +document so that you get these parallel. + +0:49:41.201 --> 0:49:49.418 +And then the perfect ones here would be your +pass, where you just take this other parallel. + +0:49:51.011 --> 0:50:05.579 +The advantage is that, of course, on the one +end limits your search space. + +0:50:05.579 --> 0:50:07.521 +That is,. + +0:50:07.787 --> 0:50:10.013 +So what does it mean? + +0:50:10.013 --> 0:50:19.120 +So even if you have a very high probable pair, +you're not taking them on because overall. + +0:50:19.399 --> 0:50:27.063 +So sometimes it makes sense to also use this +global information and not only compare on + +0:50:27.063 --> 0:50:34.815 +individual sentences because what you're with +your parents is that sometimes it's only a + +0:50:34.815 --> 0:50:36.383 +good translation. + +0:50:38.118 --> 0:50:51.602 +So by this minion paste you're preventing +the system to do it at the border where there's + +0:50:51.602 --> 0:50:52.201 +no. + +0:50:53.093 --> 0:50:55.689 +So that might achieve you a bit better quality. + +0:50:56.636 --> 0:51:12.044 +The pack always ends if we write the button +for everybody, but it also means you couldn't + +0:51:12.044 --> 0:51:15.126 +necessarily have. + +0:51:15.375 --> 0:51:24.958 +Have some restrictions that is right, so first +of all they can't be translated out. + +0:51:25.285 --> 0:51:32.572 +So the handle line typically only really works +well if you have a relatively high quality. + +0:51:32.752 --> 0:51:39.038 +So if you have this more general data where +there's like some parts are translated and + +0:51:39.038 --> 0:51:39.471 +some. + +0:51:39.719 --> 0:51:43.604 +It doesn't really work, so it might. + +0:51:43.604 --> 0:51:53.157 +It's okay with having maybe at the end some +sentences which are missing, but in generally. + +0:51:53.453 --> 0:51:59.942 +So it's not robust against significant noise +on the. + +0:52:05.765 --> 0:52:12.584 +The second thing is is to what is referred +to as blue alibi. + +0:52:13.233 --> 0:52:16.982 +And this doesn't does, does not do us much. + +0:52:16.977 --> 0:52:30.220 +A global information you can translate each +sentence to English, and then you calculate + +0:52:30.220 --> 0:52:34.885 +the voice for the translation. + +0:52:35.095 --> 0:52:41.888 +And that you would get six answer points, +which are the ones in a purple ear. + +0:52:42.062 --> 0:52:56.459 +And then you have the ability to add some +points around it, which might be a bit lower. + +0:52:56.756 --> 0:53:06.962 +But here in this case you are able to deal +with reorderings, angles to deal with parts. + +0:53:07.247 --> 0:53:16.925 +Therefore, in this case we need a full scale +and key system to do this calculation while + +0:53:16.925 --> 0:53:17.686 +we're. + +0:53:18.318 --> 0:53:26.637 +Then, of course, the better your similarity +metric is, so the better you are able to do + +0:53:26.637 --> 0:53:35.429 +this comparison, the less you have to rely +on structural information that, in one sentence,. + +0:53:39.319 --> 0:53:53.411 +Anymore questions, and then there are things +like back in line which try to do the same. + +0:53:53.793 --> 0:53:59.913 +That means the idea is that you expect each +sentence. + +0:53:59.819 --> 0:54:02.246 +In a crossing will vector space. + +0:54:02.246 --> 0:54:08.128 +Crossing will vector space always means that +you have a vector or knight means. + +0:54:08.128 --> 0:54:14.598 +In this case you have a vector space where +sentences in different languages are near to + +0:54:14.598 --> 0:54:16.069 +each other if they. + +0:54:16.316 --> 0:54:23.750 +So you can have it again and so on, but just +next to each other and want to call you. + +0:54:24.104 --> 0:54:32.009 +And then you can of course measure now the +similarity by some distance matrix in this + +0:54:32.009 --> 0:54:32.744 +vector. + +0:54:33.033 --> 0:54:36.290 +And you're saying towards two senses are lying. + +0:54:36.290 --> 0:54:39.547 +If the distance in the vector space is somehow. + +0:54:40.240 --> 0:54:50.702 +We'll discuss that in a bit more heat soon +because these vector spades and bathings are + +0:54:50.702 --> 0:54:52.010 +even then. + +0:54:52.392 --> 0:54:55.861 +So the nice thing is with this. + +0:54:55.861 --> 0:55:05.508 +It's really good and good to get quite good +quality and can decide whether two sentences + +0:55:05.508 --> 0:55:08.977 +are translations of each other. + +0:55:08.888 --> 0:55:14.023 +In the fact-lined approach, but often they +even work on a global search way to really + +0:55:14.023 --> 0:55:15.575 +compare on everything to. + +0:55:16.236 --> 0:55:29.415 +What weak alignment also does is trying to +do to make this more efficient in finding the. + +0:55:29.309 --> 0:55:40.563 +If you don't want to compare everything to +everything, you first need sentence blocks, + +0:55:40.563 --> 0:55:41.210 +and. + +0:55:41.141 --> 0:55:42.363 +Then find him fast. + +0:55:42.562 --> 0:55:55.053 +You always have full sentence resolution, +but then you always compare on the area around. + +0:55:55.475 --> 0:56:11.501 +So if you do compare blocks on the source +of the target, then you have of your possibilities. + +0:56:11.611 --> 0:56:17.262 +So here the end times and comparison is a +lot less than the comparison you have here. + +0:56:17.777 --> 0:56:23.750 +And with neural embeddings you can also embed +not only single sentences and whole blocks. + +0:56:24.224 --> 0:56:28.073 +So how you make this in fast? + +0:56:28.073 --> 0:56:35.643 +You're starting from a coarse grain resolution +here where. + +0:56:36.176 --> 0:56:47.922 +Then you're getting a double pass where they +could be good and near this pass you're doing + +0:56:47.922 --> 0:56:49.858 +more and more. + +0:56:52.993 --> 0:56:54.601 +And yeah, what's the? + +0:56:54.601 --> 0:56:56.647 +This is the white egg lift. + +0:56:56.647 --> 0:56:59.352 +These are the sewers and the target. + +0:57:00.100 --> 0:57:16.163 +While it was sleeping in the forests and things, +I thought it was very strange to see this man. + +0:57:16.536 --> 0:57:25.197 +So you have the sentences, but if you do blocks +you have blocks that are in. + +0:57:30.810 --> 0:57:38.514 +This is the thing about the pipeline approach. + +0:57:38.514 --> 0:57:46.710 +We want to look at the global mining, but +before. + +0:57:53.633 --> 0:58:07.389 +In the global mining thing we have to also +do some filtering and so typically in the things + +0:58:07.389 --> 0:58:10.379 +they do they start. + +0:58:10.290 --> 0:58:14.256 +And then they are doing some pretty processing. + +0:58:14.254 --> 0:58:17.706 +So you try to at first to de-defecate paragraphs. + +0:58:17.797 --> 0:58:30.622 +So, of course, if you compare everything with +everything in two times the same input example, + +0:58:30.622 --> 0:58:35.748 +you will also: The hard thing is that you first +keep duplicating. + +0:58:35.748 --> 0:58:37.385 +You have each paragraph only one. + +0:58:37.958 --> 0:58:42.079 +There's a lot of text which occurs a lot of +times. + +0:58:42.079 --> 0:58:44.585 +They will happen all the time. + +0:58:44.884 --> 0:58:57.830 +There are pages about the cookie thing you +see and about accepting things. + +0:58:58.038 --> 0:59:04.963 +So you can already be duplicated here, or +your problem has crossed the website twice, + +0:59:04.963 --> 0:59:05.365 +and. + +0:59:06.066 --> 0:59:11.291 +Then you can remove low quality data like +cooking warnings that have biolabites start. + +0:59:12.012 --> 0:59:13.388 +Hey! + +0:59:13.173 --> 0:59:19.830 +So let you have maybe some other sentence, +and then you're doing a language idea. + +0:59:19.830 --> 0:59:29.936 +That means you want to have a text, which +is: You want to know for each sentence a paragraph + +0:59:29.936 --> 0:59:38.695 +which language it has so that you then, of +course, if you want. + +0:59:39.259 --> 0:59:44.987 +Finally, there is some complexity based film +screenings to believe, for example, for very + +0:59:44.987 --> 0:59:46.069 +high complexity. + +0:59:46.326 --> 0:59:59.718 +That means, for example, data where there's +a lot of crazy names which are growing. + +1:00:00.520 --> 1:00:09.164 +Sometimes it also improves very high perplexity +data because that is then unmanned generated + +1:00:09.164 --> 1:00:09.722 +data. + +1:00:11.511 --> 1:00:17.632 +And then the model which is mostly used for +that is what is called a laser model. + +1:00:18.178 --> 1:00:21.920 +It's based on machine translation. + +1:00:21.920 --> 1:00:28.442 +Hope it all recognizes the machine translation +architecture. + +1:00:28.442 --> 1:00:37.103 +However, there is a difference between a general +machine translation system and. + +1:01:00.000 --> 1:01:13.322 +Machine translation system, so it's messy. + +1:01:14.314 --> 1:01:24.767 +See one bigger difference, which is great +if I'm excluding that object or the other. + +1:01:25.405 --> 1:01:39.768 +There is one difference to the other, one +with attention, so we are having. + +1:01:40.160 --> 1:01:43.642 +And then we are using that here in there each +time set up. + +1:01:44.004 --> 1:01:54.295 +Mean, therefore, it's maybe a bit similar +to original anti-system without attention. + +1:01:54.295 --> 1:01:56.717 +It's quite similar. + +1:01:57.597 --> 1:02:10.011 +However, it has this disadvantage saying that +we have to put everything in one sentence and + +1:02:10.011 --> 1:02:14.329 +that maybe not all information. + +1:02:15.055 --> 1:02:25.567 +However, now in this type of framework we +are not really interested in machine translation, + +1:02:25.567 --> 1:02:27.281 +so this model. + +1:02:27.527 --> 1:02:34.264 +So we are training it to do machine translation. + +1:02:34.264 --> 1:02:42.239 +What that means in the end should be as much +information. + +1:02:43.883 --> 1:03:01.977 +Only all the information in here is able to +really well do the machine translation. + +1:03:02.642 --> 1:03:07.801 +So that is the first step, so we are doing +here. + +1:03:07.801 --> 1:03:17.067 +We are building the MT system, not with the +goal of making the best MT system, but with + +1:03:17.067 --> 1:03:22.647 +learning and sentences, and hopefully all important. + +1:03:22.882 --> 1:03:26.116 +Because otherwise we won't be able to generate +the translation. + +1:03:26.906 --> 1:03:31.287 +So it's a bit more on the bottom neck like +to try to put as much information. + +1:03:32.012 --> 1:03:36.426 +And if you think if you want to do later finding +the bear's neighbor or something like. + +1:03:37.257 --> 1:03:48.680 +So finding similarities is typically possible +with fixed dimensional things, so we can do + +1:03:48.680 --> 1:03:56.803 +that in an end dimensional space and find the +nearest neighbor. + +1:03:57.857 --> 1:03:59.837 +Yeah, it would be very difficult. + +1:04:00.300 --> 1:04:03.865 +There's one thing that we also do. + +1:04:03.865 --> 1:04:09.671 +We don't want to find the nearest neighbor +in the other. + +1:04:10.570 --> 1:04:13.424 +Do you have an idea how we can train them? + +1:04:13.424 --> 1:04:16.542 +This is a set that embeddings can be compared. + +1:04:23.984 --> 1:04:36.829 +Any idea do you think about two lectures, +a three lecture stack, one that did gave. + +1:04:41.301 --> 1:04:50.562 +We can train them on a multilingual setting +and that's how it's done in lasers so we're + +1:04:50.562 --> 1:04:56.982 +not doing it only from German to English but +we're training. + +1:04:57.017 --> 1:05:04.898 +Mean, if the English one has to be useful +for German, French and so on, and for German + +1:05:04.898 --> 1:05:13.233 +also, the German and the English and so have +to be useful, then somehow we'll automatically + +1:05:13.233 --> 1:05:16.947 +learn that these embattes are popularly. + +1:05:17.437 --> 1:05:28.562 +And then we can use an exact as we will plan +to have a similar sentence embedding. + +1:05:28.908 --> 1:05:39.734 +If you put in here a German and a French one +and always generate as they both have the same + +1:05:39.734 --> 1:05:48.826 +translations, you give these sentences: And +you should do exactly the same thing, so that's + +1:05:48.826 --> 1:05:50.649 +of course the easiest. + +1:05:51.151 --> 1:05:59.817 +If the sentence is very different then most +people will also hear the English decoder and + +1:05:59.817 --> 1:06:00.877 +therefore. + +1:06:02.422 --> 1:06:04.784 +So that is the first thing. + +1:06:04.784 --> 1:06:06.640 +Now we have this one. + +1:06:06.640 --> 1:06:10.014 +We have to be trained on parallel data. + +1:06:10.390 --> 1:06:22.705 +Then we can use these embeddings on our new +data and try to use them to make efficient + +1:06:22.705 --> 1:06:24.545 +comparisons. + +1:06:26.286 --> 1:06:30.669 +So how can you do comparison? + +1:06:30.669 --> 1:06:37.243 +Maybe the first thing you think of is to do. + +1:06:37.277 --> 1:06:44.365 +So you take all the German sentences, all +the French sentences. + +1:06:44.365 --> 1:06:49.460 +We compute the Cousin's simple limit between. + +1:06:49.469 --> 1:06:58.989 +And then you take all pairs where the similarity +is very high. + +1:07:00.180 --> 1:07:17.242 +So you have your French list, you have them, +and then you just take all sentences. + +1:07:19.839 --> 1:07:29.800 +It's an additional power method that we have, +but we have a lot of data who will find a point. + +1:07:29.800 --> 1:07:32.317 +It's a good point, but. + +1:07:35.595 --> 1:07:45.738 +It's also not that easy, so one problem is +that typically there are some sentences where. + +1:07:46.066 --> 1:07:48.991 +And other points where there is very few points +in the neighborhood. + +1:07:49.629 --> 1:08:06.241 +And then for things where a lot of things +are enabled you might extract not for one percent + +1:08:06.241 --> 1:08:08.408 +to do that. + +1:08:08.868 --> 1:08:18.341 +So what typically is happening is you do the +max merchant? + +1:08:18.341 --> 1:08:25.085 +How good is a pair compared to the other? + +1:08:25.305 --> 1:08:33.859 +So you take the similarity between X and Y, +and then you look at one of the eight nearest + +1:08:33.859 --> 1:08:35.190 +neighbors of. + +1:08:35.115 --> 1:08:48.461 +Of x and what are the eight nearest neighbors +of y, and the dividing of the similarity through + +1:08:48.461 --> 1:08:51.411 +the eight neighbors. + +1:08:51.671 --> 1:09:00.333 +So what you may be looking at are these two +sentences a lot more similar than all the other. + +1:09:00.840 --> 1:09:13.455 +And if these are exceptional and similar compared +to other sentences then they should be translations. + +1:09:16.536 --> 1:09:19.158 +Of course, that has also some. + +1:09:19.158 --> 1:09:24.148 +Then the good thing is there's a lot of similar +sentences. + +1:09:24.584 --> 1:09:30.641 +If there is a lot of similar sensations in +white then these are also very similar and + +1:09:30.641 --> 1:09:32.824 +you are doing more comparison. + +1:09:32.824 --> 1:09:36.626 +If all the arrows are far away then the translations. + +1:09:37.057 --> 1:09:40.895 +So think about this like short sentences. + +1:09:40.895 --> 1:09:47.658 +They might be that most things are similar, +but they are just in general. + +1:09:49.129 --> 1:09:59.220 +There are some problems that now we assume +there is only one pair of translations. + +1:09:59.759 --> 1:10:09.844 +So it has some problems in their two or three +ballad translations of that. + +1:10:09.844 --> 1:10:18.853 +Then, of course, this pair might not find +it, but in general this. + +1:10:19.139 --> 1:10:27.397 +For example, they have like all of these common +trawl. + +1:10:27.397 --> 1:10:32.802 +They have large parallel data sets. + +1:10:36.376 --> 1:10:38.557 +One point maybe also year. + +1:10:38.557 --> 1:10:45.586 +Of course, now it's important that we have +done the deduplication before because if we + +1:10:45.586 --> 1:10:52.453 +wouldn't have the deduplication, we would have +points which are the same coordinate. + +1:10:57.677 --> 1:11:03.109 +Maybe only one small things to that mean. + +1:11:03.109 --> 1:11:09.058 +A major issue in this case is still making +a. + +1:11:09.409 --> 1:11:18.056 +So you have to still do all of this comparison, +and that cannot be done just by simple. + +1:11:19.199 --> 1:11:27.322 +So what is done typically express the word, +you know things can be done in parallel. + +1:11:28.368 --> 1:11:36.024 +So calculating the embeddings and all that +stuff doesn't need to be sequential, but it's + +1:11:36.024 --> 1:11:37.143 +independent. + +1:11:37.357 --> 1:11:48.680 +What you typically do is create an event and +then you do some kind of projectization. + +1:11:48.708 --> 1:11:57.047 +So there is this space library which does +key nearest neighbor search very efficient + +1:11:57.047 --> 1:11:59.597 +in very high-dimensional. + +1:12:00.080 --> 1:12:03.410 +And then based on that you can now do comparison. + +1:12:03.410 --> 1:12:06.873 +You can even do the comparison in parallel +because. + +1:12:06.906 --> 1:12:13.973 +Can look at different areas of your space +and then compare the different pieces to find + +1:12:13.973 --> 1:12:14.374 +the. + +1:12:15.875 --> 1:12:30.790 +With this you are then able to do very fast +calculations on this type of sentence. + +1:12:31.451 --> 1:12:34.761 +So yeah this is currently one. + +1:12:35.155 --> 1:12:48.781 +Mean, those of them are covered with this, +so there's a parade. + +1:12:48.668 --> 1:12:55.543 +We are collected by that and most of them +are in a very big corporate for languages which + +1:12:55.543 --> 1:12:57.453 +you can hardly stand on. + +1:12:58.778 --> 1:13:01.016 +Do you have any more questions on this? + +1:13:05.625 --> 1:13:17.306 +And then some more words to this last set +here: So we have now done our pearl marker + +1:13:17.306 --> 1:13:25.165 +and we could assume that everything is fine +now. + +1:13:25.465 --> 1:13:35.238 +However, the problem with this noisy data +is that typically this is quite noisy still, + +1:13:35.238 --> 1:13:35.687 +so. + +1:13:36.176 --> 1:13:44.533 +In order to make things efficient to have +a high recall, the final data is often not + +1:13:44.533 --> 1:13:49.547 +of the best quality, not the same type of quality. + +1:13:49.789 --> 1:13:58.870 +So it is essential to do another figuring +step and to remove senses which might seem + +1:13:58.870 --> 1:14:01.007 +to be translations. + +1:14:01.341 --> 1:14:08.873 +And here, of course, the final evaluation +matrix would be how much do my system improve? + +1:14:09.089 --> 1:14:23.476 +And there are even challenges on doing that +so: people getting this noisy data like symmetrics + +1:14:23.476 --> 1:14:25.596 +or something. + +1:14:27.707 --> 1:14:34.247 +However, all these steps is of course very +time consuming, so you might not always want + +1:14:34.247 --> 1:14:37.071 +to do the full pipeline and training. + +1:14:37.757 --> 1:14:51.614 +So how can you model that we want to get this +best and normally what we always want? + +1:14:51.871 --> 1:15:02.781 +You also want to have the best over translation +quality, but this is also normally not achieved + +1:15:02.781 --> 1:15:03.917 +with all. + +1:15:04.444 --> 1:15:12.389 +And that's why you're doing this two-step +approach first of the second alignment. + +1:15:12.612 --> 1:15:27.171 +And after once you do the sentence filtering, +we can put a lot more alphabet in all the comparisons. + +1:15:27.627 --> 1:15:37.472 +For example, you can just translate the source +and compare that translation with the original + +1:15:37.472 --> 1:15:40.404 +one and calculate how good. + +1:15:40.860 --> 1:15:49.467 +And this, of course, you can do with the filing +set, but you can't do with your initial set + +1:15:49.467 --> 1:15:50.684 +of millions. + +1:15:54.114 --> 1:16:01.700 +So what it is again is the ancient test where +you input as a sentence pair as here, and then + +1:16:01.700 --> 1:16:09.532 +once you have a biometria, these are sentence +pairs with a high quality, and these are sentence + +1:16:09.532 --> 1:16:11.653 +pairs avec a low quality. + +1:16:12.692 --> 1:16:17.552 +Does anybody see what might be a challenge +if you want to train this type of classifier? + +1:16:22.822 --> 1:16:24.264 +How do you measure exactly? + +1:16:24.264 --> 1:16:26.477 +The quality is probably about the problem. + +1:16:27.887 --> 1:16:39.195 +Yes, that is one, that is true, there is even +more, more simple one, and high quality data + +1:16:39.195 --> 1:16:42.426 +here is not so difficult. + +1:16:43.303 --> 1:16:46.844 +Globally, yeah, probably we have a class in +balance. + +1:16:46.844 --> 1:16:49.785 +We don't see many bad quality combinations. + +1:16:49.785 --> 1:16:54.395 +It's hard to get there at the beginning, so +maybe how can you argue? + +1:16:54.395 --> 1:16:58.405 +Where do you find bad quality and what type +of bad quality? + +1:16:58.798 --> 1:17:05.122 +Because if it's too easy, you just take a +random germ and the random innocence that is + +1:17:05.122 --> 1:17:05.558 +very. + +1:17:05.765 --> 1:17:15.747 +But what you're interested is like bad quality +data, which still passes your first initial + +1:17:15.747 --> 1:17:16.405 +step. + +1:17:17.257 --> 1:17:28.824 +What you can use for that is you can use any +type of network or model that in the beginning, + +1:17:28.824 --> 1:17:33.177 +like in random forests, would see. + +1:17:33.613 --> 1:17:38.912 +So the positive examples are quite easy to +get. + +1:17:38.912 --> 1:17:44.543 +You just take parallel data and high quality +data. + +1:17:44.543 --> 1:17:45.095 +You. + +1:17:45.425 --> 1:17:47.565 +That is quite easy. + +1:17:47.565 --> 1:17:55.482 +You normally don't need a lot of data, then +to train in a few validation. + +1:17:57.397 --> 1:18:12.799 +The challenge is like the negative samples +because how would you generate negative samples? + +1:18:13.133 --> 1:18:17.909 +Because the negative examples are the ones +which ask the first step but don't ask the + +1:18:17.909 --> 1:18:18.353 +second. + +1:18:18.838 --> 1:18:23.682 +So how do you typically do it? + +1:18:23.682 --> 1:18:28.994 +You try to do synthetic examples. + +1:18:28.994 --> 1:18:33.369 +You can do random examples. + +1:18:33.493 --> 1:18:45.228 +But this is the typical error that you want +to detect when you do frequency based replacements. + +1:18:45.228 --> 1:18:52.074 +But this is one major issue when you generate +the data. + +1:18:52.132 --> 1:19:02.145 +That doesn't match well with what are the +real arrows that you're interested in. + +1:19:02.702 --> 1:19:13.177 +Is some of the most challenging here to find +the negative samples, which are hard enough + +1:19:13.177 --> 1:19:14.472 +to detect. + +1:19:17.537 --> 1:19:21.863 +And the other thing, which is difficult, is +of course the data ratio. + +1:19:22.262 --> 1:19:24.212 +Why is it important any? + +1:19:24.212 --> 1:19:29.827 +Why is the ratio between positive and negative +examples here important? + +1:19:30.510 --> 1:19:40.007 +Because in a case of plus imbalance we effectively +could learn to just that it's positive and + +1:19:40.007 --> 1:19:43.644 +high quality and we would be right. + +1:19:44.844 --> 1:19:46.654 +Yes, so I'm training. + +1:19:46.654 --> 1:19:51.180 +This is important, but otherwise it might +be too easy. + +1:19:51.180 --> 1:19:52.414 +You always do. + +1:19:52.732 --> 1:19:58.043 +And on the other head, of course, navy and +deputy, it's also important because if we have + +1:19:58.043 --> 1:20:03.176 +equal things, we're also assuming that this +might be the other one, and if the quality + +1:20:03.176 --> 1:20:06.245 +is worse or higher, we might also accept too +fewer. + +1:20:06.626 --> 1:20:10.486 +So this ratio is not easy to determine. + +1:20:13.133 --> 1:20:16.969 +What type of features can we use? + +1:20:16.969 --> 1:20:23.175 +Traditionally, we're also looking at word +translation. + +1:20:23.723 --> 1:20:37.592 +And nowadays, of course, we can model this +also with something like similar, so this is + +1:20:37.592 --> 1:20:38.696 +again. + +1:20:40.200 --> 1:20:42.306 +Language follow. + +1:20:42.462 --> 1:20:49.763 +So we can, for example, put the sentence in +there for the source and the target, and then + +1:20:49.763 --> 1:20:56.497 +based on this classification label we can classify +as this a parallel sentence or. + +1:20:56.476 --> 1:21:00.054 +So it's more like a normal classification +task. + +1:21:00.160 --> 1:21:09.233 +And by having a system which can have much +enable input, we can just put in two R. + +1:21:09.233 --> 1:21:16.886 +We can also put in two independent of each +other based on the hidden. + +1:21:17.657 --> 1:21:35.440 +You can, as you do any other type of classifier, +you can train them on top of. + +1:21:35.895 --> 1:21:42.801 +This so it tries to represent the full sentence +and that's what you also want to do on. + +1:21:43.103 --> 1:21:45.043 +The Other Thing What They Can't Do Is, of +Course. + +1:21:45.265 --> 1:21:46.881 +You can make here. + +1:21:46.881 --> 1:21:52.837 +You can do your summation of all the hidden +statements that you said. + +1:21:58.698 --> 1:22:10.618 +Okay, and then one thing which we skipped +until now, and that is only briefly this fragment. + +1:22:10.630 --> 1:22:19.517 +So if we have sentences which are not really +parallel, can we also extract information from + +1:22:19.517 --> 1:22:20.096 +them? + +1:22:22.002 --> 1:22:25.627 +And so what here the test is? + +1:22:25.627 --> 1:22:33.603 +We have a sentence and we want to find within +or a sentence pair. + +1:22:33.603 --> 1:22:38.679 +We want to find within the sentence pair. + +1:22:39.799 --> 1:22:46.577 +And how that, for example, has been done is +using a lexical positive and negative association. + +1:22:47.187 --> 1:22:57.182 +And then you can transform your target sentence +into a signal and find a thing where you have. + +1:22:57.757 --> 1:23:00.317 +So I'm Going to Get a Clear Eye. + +1:23:00.480 --> 1:23:15.788 +So you hear the English sentence, the other +language, and you have an alignment between + +1:23:15.788 --> 1:23:18.572 +them, and then. + +1:23:18.818 --> 1:23:21.925 +This is not a light cell from a negative signal. + +1:23:22.322 --> 1:23:40.023 +And then you drink some sauce on there because +you want to have an area where there's. + +1:23:40.100 --> 1:23:51.742 +It doesn't matter if you have simple arrows +here by smooth saying you can't. + +1:23:51.972 --> 1:23:58.813 +So you try to find long segments here where +at least most of the words are somehow aligned. + +1:24:00.040 --> 1:24:10.069 +And then you take this one in the side and +extract that one as your parallel fragment, + +1:24:10.069 --> 1:24:10.645 +and. + +1:24:10.630 --> 1:24:21.276 +So in the end you not only have full sentences +but you also have partial sentences which might + +1:24:21.276 --> 1:24:27.439 +be helpful for especially if you have quite +low upset. + +1:24:32.332 --> 1:24:36.388 +That's everything work for today. + +1:24:36.388 --> 1:24:44.023 +What you hopefully remember is the thing about +how the general. + +1:24:44.184 --> 1:24:54.506 +We talked about how we can do the document +alignment and then we can do the sentence alignment, + +1:24:54.506 --> 1:24:57.625 +which can be done after the. + +1:24:59.339 --> 1:25:12.611 +Any more questions think on Thursday we had +to do a switch, so on Thursday there will be + +1:25:12.611 --> 1:25:15.444 +a practical thing. +