WEBVTT 0:00:01.641 --> 0:00:06.302 Hey so what again to today's lecture on machine translation. 0:00:07.968 --> 0:00:15.152 This week we'll have a bit of different focus, so last two weeks or so we have looking into. 0:00:15.655 --> 0:00:28.073 How we can improve our system by having more data, other data sources, or using them to 0:00:28.073 --> 0:00:30.331 more efficient. 0:00:30.590 --> 0:00:38.046 And we'll have a bit more of that next week with the anti-travised and the context. 0:00:38.338 --> 0:00:47.415 So that we are shifting from this idea of we treat each sentence independently, but treat 0:00:47.415 --> 0:00:49.129 the translation. 0:00:49.129 --> 0:00:58.788 Because maybe you can remember from the beginning, there are phenomenon in machine translation 0:00:58.788 --> 0:01:02.143 that you cannot correctly check. 0:01:03.443 --> 0:01:14.616 However, today we want to more look into what challenges arise, specifically when we're practically 0:01:14.616 --> 0:01:16.628 applying machine. 0:01:17.017 --> 0:01:23.674 And this block will be a total of four different lectures. 0:01:23.674 --> 0:01:29.542 What type of biases are in machine translation can. 0:01:29.729 --> 0:01:37.646 Just then can we try to improve this, but of course the first focus can be at least the. 0:01:37.717 --> 0:01:41.375 And this, of course, gets more and more important. 0:01:41.375 --> 0:01:48.333 The more often you apply this type of technology, when it was mainly a basic research tool which 0:01:48.333 --> 0:01:53.785 you were using in a research environment, it's not directly that important. 0:01:54.054 --> 0:02:00.370 But once you apply it to the question, is it performed the same for everybody or is it 0:02:00.370 --> 0:02:04.436 performance of some people less good than other people? 0:02:04.436 --> 0:02:10.462 Does it have specific challenges and we are seeing that especially in translation? 0:02:10.710 --> 0:02:13.420 We have the major challenge. 0:02:13.420 --> 0:02:20.333 We have the grammatical gender and this is not the same in all languages. 0:02:20.520 --> 0:02:35.431 In English, it's not clear if you talk about some person, if it's male or female, and so 0:02:35.431 --> 0:02:39.787 hopefully you've learned. 0:02:41.301 --> 0:02:50.034 Just as a brief view, so based on this one aspect of application will then have two other 0:02:50.034 --> 0:02:57.796 aspects: On Thursday we'll look into adaptation, so how can we adapt to specific situations? 0:02:58.718 --> 0:03:09.127 Because we have seen that your systems perform well when the test case is similar to the training 0:03:09.127 --> 0:03:15.181 case, it's always the case you should get training data. 0:03:16.036 --> 0:03:27.577 However, in practical applications, it's not always possible to collect really the best 0:03:27.577 --> 0:03:31.642 fitting data, so in that case. 0:03:32.092 --> 0:03:39.269 And then the third larger group of applications will then be speech translation. 0:03:39.269 --> 0:03:42.991 What do we have to change in our machine? 0:03:43.323 --> 0:03:53.569 If we are now not translating text, but if we want to translate speech, that will be more 0:03:53.569 --> 0:03:54.708 lectures. 0:04:00.180 --> 0:04:12.173 So what are we talking about when we are talking about bias from a definition point? 0:04:12.092 --> 0:04:21.799 Means we are introducing systematic errors when testing, and then we encourage the selection 0:04:21.799 --> 0:04:24.408 of the specific answers. 0:04:24.804 --> 0:04:36.862 The most prominent case, which is analyzed most in the research community, is a bias based 0:04:36.862 --> 0:04:38.320 on gender. 0:04:38.320 --> 0:04:43.355 One example: she works in a hospital. 0:04:43.523 --> 0:04:50.787 It is not directly able to assess whether this is now a point or a friend. 0:04:51.251 --> 0:05:07.095 And although in this one even there is, it's possible to ambiguate this based on the context. 0:05:07.127 --> 0:05:14.391 However, there is yeah, this relation to learn is of course not that easy. 0:05:14.614 --> 0:05:27.249 So the system might also learn more like shortcut connections, which might be that in your training 0:05:27.249 --> 0:05:31.798 data most of the doctors are males. 0:05:32.232 --> 0:05:41.725 That is like that was too bigly analyzed and biased, and we'll focus on that also in this. 0:05:41.641 --> 0:05:47.664 In this lecture, however, of course, the system might be a lot of other biases too, which have 0:05:47.664 --> 0:05:50.326 been partly investigated in other fields. 0:05:50.326 --> 0:05:53.496 But I think machine translation is not that much. 0:05:53.813 --> 0:05:57.637 For example, it can be based on your originals. 0:05:57.737 --> 0:06:09.405 So there is an example for a sentiment analysis that's a bit prominent. 0:06:09.405 --> 0:06:15.076 A sentiment analysis means you're. 0:06:15.035 --> 0:06:16.788 Like you're seeing it in reviews. 0:06:17.077 --> 0:06:24.045 And then you can show that with baseline models, if the name is Mohammed then the sentiment 0:06:24.045 --> 0:06:30.786 in a lot of systems will be more negative than if it's like a traditional European name. 0:06:31.271 --> 0:06:33.924 Are with foods that is simple. 0:06:33.924 --> 0:06:36.493 It's this type of restaurant. 0:06:36.493 --> 0:06:38.804 It's positive and another. 0:06:39.319 --> 0:06:49.510 You have other aspects, so we have seen this. 0:06:49.510 --> 0:06:59.480 We have done some experiments in Vietnamese. 0:06:59.559 --> 0:07:11.040 And then, for example, you can analyze that if it's like he's Germany will address it more 0:07:11.040 --> 0:07:18.484 formal, while if he is North Korean he'll use an informal. 0:07:18.838 --> 0:07:24.923 So these are also possible types of gender. 0:07:24.923 --> 0:07:31.009 However, this is difficult types of biases. 0:07:31.251 --> 0:07:38.903 However, especially in translation, the bias for gender is the most challenging because 0:07:38.903 --> 0:07:42.989 we are treating gender in different languages. 0:07:45.405 --> 0:07:46.930 Hi this is challenging. 0:07:48.148 --> 0:07:54.616 The reason for that is that there is a translation mismatch and we have, I mean, one reason for 0:07:54.616 --> 0:08:00.140 that is there's a translation mismatch and that's the most challenging situation. 0:08:00.140 --> 0:08:05.732 So there is there is different information in the Sears language or in the target. 0:08:06.046 --> 0:08:08.832 So if we have the English word dot player,. 0:08:09.029 --> 0:08:12.911 It's there is no information about the gender in there. 0:08:12.911 --> 0:08:19.082 However, if you want to translate in German, you cannot easily generate a word without a 0:08:19.082 --> 0:08:20.469 gender information. 0:08:20.469 --> 0:08:27.056 Or man, you can't do something like Shubila in, but that sounds a bit weird if you're talking. 0:08:27.027 --> 0:08:29.006 About a specific person. 0:08:29.006 --> 0:08:32.331 Then you should use the appropriate font. 0:08:32.692 --> 0:08:44.128 And so it's most challenging translation as always in this situation where you have less 0:08:44.128 --> 0:08:50.939 information on the source side but more information. 0:08:51.911 --> 0:08:57.103 Similar things like if you think about Japanese, for example where there's different formality 0:08:57.103 --> 0:08:57.540 levels. 0:08:57.540 --> 0:09:02.294 If in German there is no formality or like two only or in English there's no formality 0:09:02.294 --> 0:09:02.677 level. 0:09:02.862 --> 0:09:08.139 And now you have to estimate the formality level. 0:09:08.139 --> 0:09:10.884 Of course, it takes some. 0:09:10.884 --> 0:09:13.839 It's not directly possible. 0:09:14.094 --> 0:09:20.475 What nowadays systems are doing is at least assess. 0:09:20.475 --> 0:09:27.470 This is a situation where don't have enough information. 0:09:27.567 --> 0:09:28.656 Translation. 0:09:28.656 --> 0:09:34.938 So here you have that suggesting it can be doctor or doctorate in Spanish. 0:09:35.115 --> 0:09:37.051 So that is a possibility. 0:09:37.051 --> 0:09:41.595 However, it is of course very, very challenging to find out. 0:09:42.062 --> 0:09:46.130 Is there two really different meanings, or is it not the case? 0:09:46.326 --> 0:09:47.933 You can do the big rule base here. 0:09:47.933 --> 0:09:49.495 Maybe don't know how they did it. 0:09:49.990 --> 0:09:57.469 You can, of course, if you are focusing on gender, the source and the target is different, 0:09:57.469 --> 0:09:57.879 and. 0:09:58.118 --> 0:10:05.799 But if you want to do it more general, it's not that easy because there's always. 0:10:06.166 --> 0:10:18.255 But it's not clear if these are really different or if there's only slight differences. 0:10:22.142 --> 0:10:36.451 Between that another reason why there is a bias in there is typically the system tries 0:10:36.451 --> 0:10:41.385 to always do the most simple. 0:10:42.262 --> 0:10:54.483 And also in your training data there are unintended shortcuts or clues only in the training data 0:10:54.483 --> 0:10:59.145 because you sample them in some way. 0:10:59.379 --> 0:11:06.257 This example, if she works in a hospital and my friend is a nurse, then it might be that 0:11:06.257 --> 0:11:07.184 one friend. 0:11:08.168 --> 0:11:18.979 Male and female because it has learned that in your trained doctor is a male and a nurse 0:11:18.979 --> 0:11:20.802 is doing this. 0:11:20.880 --> 0:11:29.587 And of course, if we are doing maximum likelihood approximation as we are doing it in general, 0:11:29.587 --> 0:11:30.962 we are always. 0:11:30.951 --> 0:11:43.562 So that means if in your training data this correlation is maybe in the case then your 0:11:43.562 --> 0:11:48.345 predictions are always the same. 0:11:48.345 --> 0:11:50.375 It typically. 0:11:55.035 --> 0:12:06.007 What does it mean, of course, if we are having this type of fires and if we are applying? 0:12:05.925 --> 0:12:14.821 It might be that the benefit of machine translation rice so more and more people can benefit from 0:12:14.821 --> 0:12:20.631 the ability to talk to people in different languages and so on. 0:12:20.780 --> 0:12:27.261 But if you more often use it, problems of the system also get more and more important. 0:12:27.727 --> 0:12:36.984 And so if we are seeing that these problems and people nowadays only start to analyze these 0:12:36.984 --> 0:12:46.341 problems partly, also because if it hasn't been used, it's not that important if the quality 0:12:46.341 --> 0:12:47.447 is so bad. 0:12:47.627 --> 0:12:51.907 Version or is mixing it all the time like we have seen in old systems. 0:12:51.907 --> 0:12:52.993 Then, of course,. 0:12:53.053 --> 0:12:57.303 The issue is not that you have biased issues that you at first need to create a right view. 0:12:57.637 --> 0:13:10.604 So only with the wide application of the good quality this becomes important, and then of 0:13:10.604 --> 0:13:15.359 course you should look into how. 0:13:15.355 --> 0:13:23.100 In order to first get aware of what are the challenges, and that is a general idea not 0:13:23.100 --> 0:13:24.613 only about bias. 0:13:24.764 --> 0:13:31.868 Of course, we have learned about blue scores, so how can you evaluate the over quality and 0:13:31.868 --> 0:13:36.006 they are very important, either blue or any of that. 0:13:36.006 --> 0:13:40.378 However, they are somehow giving us a general overview. 0:13:40.560 --> 0:13:58.410 And if we want to improve our systems, of course it's important that we also do more 0:13:58.410 --> 0:14:00.510 detailed. 0:14:00.340 --> 0:14:05.828 Test sets which are very challenging in order to attend to see how good these systems. 0:14:06.446 --> 0:14:18.674 Of course, one last reminder to that if you do a challenge that says it's typically good 0:14:18.674 --> 0:14:24.581 to keep track of your general performance. 0:14:24.784 --> 0:14:28.648 You don't want to improve normally then on the general quality. 0:14:28.688 --> 0:14:41.555 So if you build a system which will mitigate some biases then the aim is that if you evaluate 0:14:41.555 --> 0:14:45.662 it on the challenging biases. 0:14:45.745 --> 0:14:53.646 You don't need to get better because the aggregated versions don't really measure that aspect well, 0:14:53.646 --> 0:14:57.676 but if you significantly drop in performance then. 0:15:00.000 --> 0:15:19.164 What are, in generally calms, people report about that or why should you care about? 0:15:19.259 --> 0:15:23.598 And you're even then amplifying this type of stereotypes. 0:15:23.883 --> 0:15:33.879 And that is not what you want to achieve with using this technology. 0:15:33.879 --> 0:15:39.384 It's not working through some groups. 0:15:39.819 --> 0:15:47.991 And secondly what is referred to as allocational parts. 0:15:47.991 --> 0:15:54.119 The system might not perform as well for. 0:15:54.314 --> 0:16:00.193 So another example of which we would like to see is that sometimes the translation depends 0:16:00.193 --> 0:16:01.485 on who is speaking. 0:16:01.601 --> 0:16:03.463 So Here You Have It in French. 0:16:03.723 --> 0:16:16.359 Not say it, but the word happy or French has to be expressed differently, whether it's a 0:16:16.359 --> 0:16:20.902 male person or a female person. 0:16:21.121 --> 0:16:28.917 It's nearly impossible to guess that or it's impossible, so then you always select one. 0:16:29.189 --> 0:16:37.109 And of course, since we do greedy search, it will always generate the same, so you will 0:16:37.109 --> 0:16:39.449 have a worse performance. 0:16:39.779 --> 0:16:46.826 And of course not what we want to achieve in average. 0:16:46.826 --> 0:16:54.004 You might be then good, but you also have the ability. 0:16:54.234 --> 0:17:08.749 This is a biased problem or an interface problem because mean you can say well. 0:17:09.069 --> 0:17:17.358 And if you do it, we still have a system that generates unusable output. 0:17:17.358 --> 0:17:24.057 If you don't tell it what you want to do, so in this case. 0:17:24.244 --> 0:17:27.173 So in this case it's like if we don't have enough information. 0:17:27.467 --> 0:17:34.629 So you have to adapt your system in some way that can either access the information or output. 0:17:34.894 --> 0:17:46.144 But yeah, how you mean there's different ways of how to improve over that first thing is 0:17:46.144 --> 0:17:47.914 you find out. 0:17:48.688 --> 0:17:53.826 Then there is different ways of addressing them, and they of course differ. 0:17:53.826 --> 0:17:57.545 Isn't the situation where the information's available? 0:17:58.038 --> 0:18:12.057 That's the first case we have, or is it a situation where we don't have the information 0:18:12.057 --> 0:18:13.332 either? 0:18:14.154 --> 0:18:28.787 Or should give the system maybe the opportunity to output those or say don't know this is still 0:18:28.787 --> 0:18:29.701 open. 0:18:29.769 --> 0:18:35.470 And even if they have enough information, need this additional information, but they 0:18:35.470 --> 0:18:36.543 are just doing. 0:18:36.776 --> 0:18:51.132 Which is a bit based on how we find that there is research on that, but it's not that easy 0:18:51.132 --> 0:18:52.710 to solve. 0:18:52.993 --> 0:19:05.291 But in general, detecting do have enough information to do a good translation or are information 0:19:05.291 --> 0:19:06.433 missing? 0:19:09.669 --> 0:19:18.951 But before we come on how we will address it or try to change it, and before we look 0:19:18.951 --> 0:19:22.992 at how we can assess it, of course,. 0:19:23.683 --> 0:19:42.820 And therefore wanted to do a bit of a review on how gender is represented in languages. 0:19:43.743 --> 0:19:48.920 Course: You can have more fine grained. 0:19:48.920 --> 0:20:00.569 It's not that everything in the group is the same, but in general you have a large group. 0:20:01.381 --> 0:20:08.347 For example, you even don't say ishi or but it's just one word for it written. 0:20:08.347 --> 0:20:16.107 Oh, don't know how it's pronounced, so you cannot say from a sentence whether it's ishi 0:20:16.107 --> 0:20:16.724 or it. 0:20:17.937 --> 0:20:29.615 Of course, there are some exceptions for whether it's a difference between male and female. 0:20:29.615 --> 0:20:35.962 They have different names for brother and sister. 0:20:36.036 --> 0:20:41.772 So normally you cannot infer whether this is a male speaker or speaking about a male 0:20:41.772 --> 0:20:42.649 or a female. 0:20:44.304 --> 0:20:50.153 Examples for these languages are, for example, Finnish and Turkish. 0:20:50.153 --> 0:21:00.370 There are more languages, but these are: Then we have no nutritional gender languages where 0:21:00.370 --> 0:21:05.932 there's some gender information in there, but it's. 0:21:05.905 --> 0:21:08.169 And this is an example. 0:21:08.169 --> 0:21:15.149 This is English, which is in that way a nice example because most people. 0:21:15.415 --> 0:21:20.164 So you have there some lexicogender and phenomenal gender. 0:21:20.164 --> 0:21:23.303 I mean mamadeta there she-hee and him. 0:21:23.643 --> 0:21:31.171 And very few words are marked like actor and actress, but in general most words are not 0:21:31.171 --> 0:21:39.468 marked, so it's teacher and lecturer and friend, so in all these words the gender is not marked, 0:21:39.468 --> 0:21:41.607 and so you cannot infer. 0:21:42.622 --> 0:21:48.216 So the initial Turkish sentence here would be translated to either he is a good friend 0:21:48.216 --> 0:21:49.373 or she is a good. 0:21:51.571 --> 0:22:05.222 In this case you would have them gender information in there, but of course there's a good friend. 0:22:07.667 --> 0:22:21.077 And then finally there is the grammatical German languages where each noun has a gender. 0:22:21.077 --> 0:22:25.295 That's the case in Spanish. 0:22:26.186 --> 0:22:34.025 This is mostly formal, but at least if you're talking about a human that also agrees. 0:22:34.214 --> 0:22:38.209 Of course, it's like the sun. 0:22:38.209 --> 0:22:50.463 There is no clear thing why the sun should be female, and in other language it's different. 0:22:50.390 --> 0:22:56.100 The matching, and then you also have more agreements with this that makes things more 0:22:56.100 --> 0:22:56.963 complicated. 0:22:57.958 --> 0:23:08.571 Here he is a good friend and the good is also depending whether it's male or went up so it's 0:23:08.571 --> 0:23:17.131 changing also based on the gender so you have a lot of gender information. 0:23:17.777 --> 0:23:21.364 Get them, but do you always get them correctly? 0:23:21.364 --> 0:23:25.099 It might be that they're in English, for example. 0:23:28.748 --> 0:23:36.154 And since this is the case, and you need to like often express the gender even though you 0:23:36.154 --> 0:23:37.059 might not. 0:23:37.377 --> 0:23:53.030 Aware of it or it's not possible, there's some ways in German how to mark mutual forms. 0:23:54.194 --> 0:24:03.025 But then it's again from the machine learning side of view, of course quite challenging because 0:24:03.025 --> 0:24:05.417 you only want to use the. 0:24:05.625 --> 0:24:11.108 If it's known to the reader you want to use the correct, the not mutual form but either 0:24:11.108 --> 0:24:12.354 the male or female. 0:24:13.013 --> 0:24:21.771 So they are assessing what is known to the reader as a challenge which needs to in some 0:24:21.771 --> 0:24:23.562 way be addressed. 0:24:26.506 --> 0:24:30.887 Here why does that happen? 0:24:30.887 --> 0:24:42.084 Three reasons we have that in a bit so one is, of course, that your. 0:24:42.162 --> 0:24:49.003 Example: If you look at the Europe High Corpus, which is an important resource for doing machine 0:24:49.003 --> 0:24:49.920 translation. 0:24:50.010 --> 0:24:59.208 Then there's only thirty percent of the speakers are female, and so if you train a model on 0:24:59.208 --> 0:25:06.606 that data, if you're translating to French, there will be a male version. 0:25:06.746 --> 0:25:10.762 And so you'll just have a lot more like seventy percent of your mail for it. 0:25:10.971 --> 0:25:18.748 And that will be Yep will make the model therefore from this data sub. 0:25:18.898 --> 0:25:25.882 And of course this will be in the data for a very long time. 0:25:25.882 --> 0:25:33.668 So if there's more female speakers in the European Parliament, but. 0:25:33.933 --> 0:25:42.338 But we are training on historical data, so even if there is for a long time, it will not 0:25:42.338 --> 0:25:43.377 be in the. 0:25:46.346 --> 0:25:57.457 Then besides these preexisting data there is of course technical biases which will amplify 0:25:57.457 --> 0:25:58.800 this type. 0:25:59.039 --> 0:26:04.027 So one we already address, that's for example sampling or beam search. 0:26:04.027 --> 0:26:06.416 You get the most probable output. 0:26:06.646 --> 0:26:16.306 So if there's a bias in your model, it will amplify that not only in the case we had before, 0:26:16.306 --> 0:26:19.423 and produce the male version. 0:26:20.040 --> 0:26:32.873 So if you have the same source sentence like am happy and in your training data it will 0:26:32.873 --> 0:26:38.123 be male and female if you're doing. 0:26:38.418 --> 0:26:44.510 So in that way by doing this type of algorithmic design you will have. 0:26:44.604 --> 0:26:59.970 Another use case is if you think about a multilingual machine translation, for example if you are 0:26:59.970 --> 0:27:04.360 now doing a pivot language. 0:27:04.524 --> 0:27:13.654 But if you're first trying to English this information might get lost and then you translate 0:27:13.654 --> 0:27:14.832 to Spanish. 0:27:15.075 --> 0:27:21.509 So while in general in this class there is not this type of bias there,. 0:27:22.922 --> 0:27:28.996 You might introduce it because you might have good reasons for doing a modular system because 0:27:28.996 --> 0:27:31.968 you don't have enough training data or so on. 0:27:31.968 --> 0:27:37.589 It's performing better in average, but of course by doing this choice you'll introduce 0:27:37.589 --> 0:27:40.044 an additional type of bias into your. 0:27:45.805 --> 0:27:52.212 And then there is what people refer to as emergent bias, and that is, if you use a system 0:27:52.212 --> 0:27:58.903 for a different use case as we see in, generally it is the case that is performing worse, but 0:27:58.903 --> 0:28:02.533 then of course you can have even more challenging. 0:28:02.942 --> 0:28:16.196 So the extreme case would be if you train a system only on male speakers, then of course 0:28:16.196 --> 0:28:22.451 it will perform worse on female speakers. 0:28:22.902 --> 0:28:36.287 So, of course, if you're doing this type of problem, if you use a system for a different 0:28:36.287 --> 0:28:42.152 situation where it was original, then. 0:28:44.004 --> 0:28:54.337 And with this we would then go for type of evaluation, but before we are looking at how 0:28:54.337 --> 0:28:56.333 we can evaluate. 0:29:00.740 --> 0:29:12.176 Before we want to look into how we can improve the system, think yeah, maybe at the moment 0:29:12.176 --> 0:29:13.559 most work. 0:29:13.954 --> 0:29:21.659 And the one thing is the system trying to look into stereotypes. 0:29:21.659 --> 0:29:26.164 So how does a system use stereotypes? 0:29:26.466 --> 0:29:29.443 So if you have the Hungarian sentence,. 0:29:29.729 --> 0:29:33.805 Which should be he is an engineer or she is an engineer. 0:29:35.375 --> 0:29:43.173 And you cannot guess that because we saw that he and she is not different in Hungary. 0:29:43.423 --> 0:29:57.085 Then you can have a test set where you have these type of ailanomal occupations. 0:29:56.977 --> 0:30:03.862 You have statistics from how is the distribution by gender so you can automatically generate 0:30:03.862 --> 0:30:04.898 the sentence. 0:30:04.985 --> 0:30:21.333 Then you could put in jobs which are mostly done by a man and then you can check how is 0:30:21.333 --> 0:30:22.448 your. 0:30:22.542 --> 0:30:31.315 That is one type of evaluating stereotypes that one of the most famous benchmarks called 0:30:31.315 --> 0:30:42.306 vino is exactly: The second type of evaluation is about gender preserving. 0:30:42.342 --> 0:30:51.201 So that is exactly what we have seen beforehand. 0:30:51.201 --> 0:31:00.240 If these information are not in the text itself,. 0:31:00.320 --> 0:31:01.875 Gender as a speaker. 0:31:02.062 --> 0:31:04.450 And how good does a system do that? 0:31:04.784 --> 0:31:09.675 And we'll see there's, for example, one benchmark on this. 0:31:09.675 --> 0:31:16.062 For example: For Arabic there is one benchmark on this foot: Audio because if you're now think 0:31:16.062 --> 0:31:16.781 already of the. 0:31:17.157 --> 0:31:25.257 From when we're talking about speech translation, it might be interesting because in the speech 0:31:25.257 --> 0:31:32.176 signal you should have a better guess on whether it's a male or a female speaker. 0:31:32.432 --> 0:31:38.928 So but mean current systems, mostly you can always add, and they will just first transcribe. 0:31:42.562 --> 0:31:45.370 Yes, so how do these benchmarks? 0:31:45.305 --> 0:31:51.356 Look like that, the first one is here. 0:31:51.356 --> 0:32:02.837 There's an occupation test where it looks like a simple test set because. 0:32:03.023 --> 0:32:10.111 So I've known either hurry him or pronounce the name for a long time. 0:32:10.111 --> 0:32:13.554 My friend works as an occupation. 0:32:13.833 --> 0:32:16.771 So that is like all sentences in that look like that. 0:32:17.257 --> 0:32:28.576 So in this case you haven't had the biggest work in here, which is friends. 0:32:28.576 --> 0:32:33.342 So your only checking later is. 0:32:34.934 --> 0:32:46.981 This can be inferred from whether it's her or her or her, or if it's a proper name, so 0:32:46.981 --> 0:32:55.013 can you infer it from the name, and then you can compare. 0:32:55.115 --> 0:33:01.744 So is this because the job description is nearer to friend. 0:33:01.744 --> 0:33:06.937 Does the system get disturbed by this type of. 0:33:08.828 --> 0:33:14.753 And there you can then automatically assess yeah this type. 0:33:14.774 --> 0:33:18.242 Of course, that's what said at the beginning. 0:33:18.242 --> 0:33:24.876 You shouldn't only rely on that because if you only rely on it you can easily trick the 0:33:24.876 --> 0:33:25.479 system. 0:33:25.479 --> 0:33:31.887 So one type of sentence is translated, but of course it can give you very important. 0:33:33.813 --> 0:33:35.309 Any questions yeah. 0:33:36.736 --> 0:33:44.553 Much like the evaluation of stereotype, we want the system to agree with stereotypes because 0:33:44.553 --> 0:33:46.570 it increases precision. 0:33:46.786 --> 0:33:47.979 No, no, no. 0:33:47.979 --> 0:33:53.149 In this case, if we say oh yeah, he is an engineer. 0:33:53.149 --> 0:34:01.600 From the example, it's probably the most likely translation, probably in more cases. 0:34:02.702 --> 0:34:08.611 Now there is two things, so yeah yeah, so there is two ways of evaluating. 0:34:08.611 --> 0:34:15.623 The one thing is in this case he's using that he's an engineer, but there is conflicting 0:34:15.623 --> 0:34:19.878 information that in this case the engineer is female. 0:34:20.380 --> 0:34:21.890 So anything was. 0:34:22.342 --> 0:34:29.281 Information yes, so that is the one in the other case. 0:34:29.281 --> 0:34:38.744 Typically it's not evaluated in that, but in that time you really want it. 0:34:38.898 --> 0:34:52.732 That's why most of those cases you have evaluated in scenarios where you have context information. 0:34:53.453 --> 0:34:58.878 How to deal with the other thing is even more challenging to one case where it is the case 0:34:58.878 --> 0:35:04.243 is what I said before is when it's about the speaker so that the speech translation test. 0:35:04.584 --> 0:35:17.305 And there they try to look in a way that can you use, so use the audio also as input. 0:35:18.678 --> 0:35:20.432 Yeah. 0:35:20.640 --> 0:35:30.660 So if we have a reference where she is an engineer okay, are there efforts to adjust 0:35:30.660 --> 0:35:37.497 the metric so that our transmissions go into the correct? 0:35:37.497 --> 0:35:38.676 We don't. 0:35:38.618 --> 0:35:40.389 Only done for mean this is evaluation. 0:35:40.389 --> 0:35:42.387 You are not pushing the model for anything. 0:35:43.023 --> 0:35:53.458 But if you want to do it in training, that you're not doing it this way. 0:35:53.458 --> 0:35:58.461 I'm not aware of any direct model. 0:35:58.638 --> 0:36:04.146 Because you have to find out, is it known in this scenario or not? 0:36:05.725 --> 0:36:12.622 So at least I'm not aware of there's like the directive doing training try to assess 0:36:12.622 --> 0:36:13.514 more than. 0:36:13.813 --> 0:36:18.518 Mean there is data augmentation in the way that is done. 0:36:18.518 --> 0:36:23.966 Think we'll have that later, so what you can do is generate more. 0:36:24.144 --> 0:36:35.355 You can do that automatically or there's ways of biasing so that you can try to make your 0:36:35.355 --> 0:36:36.600 training. 0:36:36.957 --> 0:36:46.228 That's typically not done with focusing on scenarios where you check before or do have 0:36:46.228 --> 0:36:47.614 information. 0:36:49.990 --> 0:36:58.692 Mean, but for everyone it's not clear and agree with you in this scenario, the normal 0:36:58.692 --> 0:37:01.222 evaluation system where. 0:37:01.341 --> 0:37:07.006 Maybe you could say it shouldn't do always the same but have a distribution like a training 0:37:07.006 --> 0:37:12.733 data or something like that because otherwise we're amplifying but that current system can't 0:37:12.733 --> 0:37:15.135 do current systems can't predict both. 0:37:15.135 --> 0:37:17.413 That's why we see all the beginning. 0:37:17.413 --> 0:37:20.862 They have this extra interface where they then propose. 0:37:24.784 --> 0:37:33.896 Another thing is the vino empty system and it started from a challenge set for co-reference 0:37:33.896 --> 0:37:35.084 resolution. 0:37:35.084 --> 0:37:43.502 Co-reference resolution means we have pear on him and we need to find out what it's. 0:37:43.823 --> 0:37:53.620 So you have the doctor off the nurse to help her in the procedure, and now her does not 0:37:53.620 --> 0:37:55.847 refer to the nurse. 0:37:56.556 --> 0:38:10.689 And there you of course have the same type of stewardesses and the same type of buyers 0:38:10.689 --> 0:38:15.237 as the machine translation. 0:38:16.316 --> 0:38:25.165 And no think that normally yeah mean maybe that's also biased. 0:38:27.687 --> 0:38:37.514 No, but if you ask somebody, I guess if you ask somebody, then I mean syntectically it's 0:38:37.514 --> 0:38:38.728 ambiguous. 0:38:38.918 --> 0:38:50.248 If you ask somebody to help, then the horror has to refer to that. 0:38:50.248 --> 0:38:54.983 So it should also help the. 0:38:56.396 --> 0:38:57.469 Of the time. 0:38:57.469 --> 0:39:03.906 The doctor is female and says please have me in the procedure, but the other. 0:39:04.904 --> 0:39:09.789 Oh, you mean that it's helping the third person. 0:39:12.192 --> 0:39:16.140 Yeah, agree that it could also be yes. 0:39:16.140 --> 0:39:19.077 Don't know how easy that is. 0:39:19.077 --> 0:39:21.102 Only know the test. 0:39:21.321 --> 0:39:31.820 Then guess yeah, then you need a situation context where you know the situation, the other 0:39:31.820 --> 0:39:34.589 person having problems. 0:39:36.936 --> 0:39:42.251 Yeah no yeah that is like here when there is additional ambiguity in there. 0:39:45.465 --> 0:39:48.395 See that pure text models is not always okay. 0:39:48.395 --> 0:39:51.134 How full mean there is a lot of work also. 0:39:52.472 --> 0:40:00.119 Will not cover that in the lecture, but there are things like multimodal machine translation 0:40:00.119 --> 0:40:07.109 where you try to add pictures or something like that to have more context, and then. 0:40:10.370 --> 0:40:23.498 Yeah, it starts with this, so in order to evaluate that what it does is that you translate 0:40:23.498 --> 0:40:25.229 the system. 0:40:25.305 --> 0:40:32.310 It's doing stereotyping so the doctor is male and the nurse is female. 0:40:32.492 --> 0:40:42.362 And then you're using word alignment, and then you check whether this gender maps with 0:40:42.362 --> 0:40:52.345 the annotated gender of there, and that is how you evaluate in this type of vino empty. 0:40:52.832 --> 0:40:59.475 Mean, as you see, you're only focusing on the situation where you can or where the gender 0:40:59.475 --> 0:41:00.214 is known. 0:41:00.214 --> 0:41:06.930 Why for this one you don't do any evaluation, but because nurses can in that case be those 0:41:06.930 --> 0:41:08.702 and you cannot, as has. 0:41:08.728 --> 0:41:19.112 The benchmarks are at the moment designed in a way that you only evaluate things that 0:41:19.112 --> 0:41:20.440 are known. 0:41:23.243 --> 0:41:25.081 Then yeah, you can have a look. 0:41:25.081 --> 0:41:28.931 For example, here what people are looking is you can do the first. 0:41:28.931 --> 0:41:32.149 Oh well, the currency, how often does it do it correct? 0:41:32.552 --> 0:41:41.551 And there you see these numbers are a bit older. 0:41:41.551 --> 0:41:51.835 There's more work on that, but this is the first color. 0:41:51.731 --> 0:42:01.311 Because they do it like in this test, they do it twice, one with him and one with her. 0:42:01.311 --> 0:42:04.834 So the chance is fifty percent. 0:42:05.065 --> 0:42:12.097 Except somehow here, the one system seems to be quite good there that everything. 0:42:13.433 --> 0:42:30.863 What you can also do is look at the difference, where you need to predict female and the difference. 0:42:30.850 --> 0:42:40.338 It's more often correct on the male forms than on the female forms, and you see that 0:42:40.338 --> 0:42:43.575 it's except for this system. 0:42:43.603 --> 0:42:53.507 So would assume that they maybe in this one language did some type of method in there. 0:42:55.515 --> 0:42:57.586 If you are more often mean there is like. 0:42:58.178 --> 0:43:01.764 It's not a lot lower, there's one. 0:43:01.764 --> 0:43:08.938 I don't know why, but if you're always to the same then it should be. 0:43:08.938 --> 0:43:14.677 You seem to be counter intuitive, so maybe it's better. 0:43:15.175 --> 0:43:18.629 Don't know exactly how yes, but it's, it's true. 0:43:19.019 --> 0:43:20.849 Mean, there's very few cases. 0:43:20.849 --> 0:43:22.740 I also don't know for Russian. 0:43:22.740 --> 0:43:27.559 I mean, there is, I think, mainly for Russian where you have very low numbers. 0:43:27.559 --> 0:43:30.183 I mean, I would say like forty five or so. 0:43:30.183 --> 0:43:32.989 There can be more about renting and sampling. 0:43:32.989 --> 0:43:37.321 I don't know if they have even more gender or if they have a new tool. 0:43:37.321 --> 0:43:38.419 I don't think so. 0:43:40.040 --> 0:43:46.901 Then you have typically even a stronger bias here where you not do the differentiation between 0:43:46.901 --> 0:43:53.185 how often is it correct for me and the female, but you are distinguishing between the. 0:43:53.553 --> 0:44:00.503 So you're here, for you can check for each occupation, which is the most important. 0:44:00.440 --> 0:44:06.182 A comment one based on statistics, and then you take that on the one side and the anti 0:44:06.182 --> 0:44:12.188 stereotypically on the other side, and you see that not in all cases but in a lot of cases 0:44:12.188 --> 0:44:16.081 that null probabilities are even higher than on the other. 0:44:21.061 --> 0:44:24.595 Ah, I'm telling you there's something. 0:44:28.668 --> 0:44:32.850 But it has to be for a doctor. 0:44:32.850 --> 0:44:39.594 For example, for a doctor there three don't know. 0:44:40.780 --> 0:44:44.275 Yeah, but guess here it's mainly imminent job description. 0:44:44.275 --> 0:44:45.104 So yeah, but. 0:44:50.050 --> 0:45:01.145 And then there is the Arabic capital gender corpus where it is about more assessing how 0:45:01.145 --> 0:45:03.289 strong a singer. 0:45:03.483 --> 0:45:09.445 How that is done is the open subtitles. 0:45:09.445 --> 0:45:18.687 Corpus is like a corpus of subtitles generated by volunteers. 0:45:18.558 --> 0:45:23.426 For the Words Like I Mean Myself. 0:45:23.303 --> 0:45:30.670 And mine, and then they annotated the Arabic sentences, whether here I refer to as a female 0:45:30.670 --> 0:45:38.198 and masculine, or whether it's ambiguous, and then from the male and female one they generate 0:45:38.198 --> 0:45:40.040 types of translations. 0:45:43.703 --> 0:45:51.921 And then a bit more different test sets as the last one that is referred to as the machine. 0:45:52.172 --> 0:45:57.926 Corpus, which is based on these lectures. 0:45:57.926 --> 0:46:05.462 In general, this lecture is very important because it. 0:46:05.765 --> 0:46:22.293 And here is also interesting because you also have the obvious signal and it's done in the 0:46:22.293 --> 0:46:23.564 worst. 0:46:23.763 --> 0:46:27.740 In the first case is where it can only be determined based on the speaker. 0:46:27.968 --> 0:46:30.293 So something like am a good speaker. 0:46:30.430 --> 0:46:32.377 You cannot do that correctly. 0:46:32.652 --> 0:46:36.970 However, if you would have the audio signal you should have a lot better guests. 0:46:37.257 --> 0:46:47.812 So it wasn't evaluated, especially machine translation and speech translation system, 0:46:47.812 --> 0:46:53.335 which take this into account or, of course,. 0:46:57.697 --> 0:47:04.265 The second thing is where you can do it based on the context. 0:47:04.265 --> 0:47:08.714 In this case we are not using artificial. 0:47:11.011 --> 0:47:15.550 Cope from the from the real data, so it's not like artificial creative data, but. 0:47:15.815 --> 0:47:20.939 Of course, in a lot more work you have to somehow find these in the corpus and use them 0:47:20.939 --> 0:47:21.579 as a test. 0:47:21.601 --> 0:47:27.594 Is something she got together with two of her dearest friends, this older woman, and 0:47:27.594 --> 0:47:34.152 then, of course, here friends can we get from the context, but it might be that some systems 0:47:34.152 --> 0:47:36.126 ignore that that should be. 0:47:36.256 --> 0:47:43.434 So you have two test sets in there, two types of benchmarks, and you want to determine which 0:47:43.434 --> 0:47:43.820 one. 0:47:47.787 --> 0:47:55.801 Yes, this is how we can evaluate it, so the next question is how can we improve our systems 0:47:55.801 --> 0:48:03.728 because that's normally how we do evaluation and why we do evaluation so before we go into 0:48:03.728 --> 0:48:04.251 that? 0:48:08.508 --> 0:48:22.685 One idea is to do what is referred to as modeling, so the idea is somehow change the model in 0:48:22.685 --> 0:48:24.495 a way that. 0:48:24.965 --> 0:48:38.271 And yes, one idea is, of course, if we are giving him more information, the system doesn't 0:48:38.271 --> 0:48:44.850 need to do a guess without this information. 0:48:44.724 --> 0:48:47.253 In order to just ambiguate the bias,. 0:48:47.707 --> 0:48:59.746 The first thing is you can do that on the sentence level, for example, especially if 0:48:59.746 --> 0:49:03.004 you have the speakers. 0:49:03.063 --> 0:49:12.518 You can annotate the sentence with whether a speaker is made or a female, and then you 0:49:12.518 --> 0:49:25.998 can: Here we're seeing one thing which is very successful in neuromachine translation and 0:49:25.998 --> 0:49:30.759 other kinds of neural networks. 0:49:31.711 --> 0:49:39.546 However, in neuromachine translation, since we have no longer the strong correlation between 0:49:39.546 --> 0:49:47.043 input and output, the nice thing is you can normally put everything into your input, and 0:49:47.043 --> 0:49:50.834 if you have enough data, it's well balanced. 0:49:51.151 --> 0:50:00.608 So how you can do it here is you can add the token here saying female or male if the speaker 0:50:00.608 --> 0:50:01.523 is male. 0:50:01.881 --> 0:50:07.195 So, of course, this is no longer for human correct translation. 0:50:07.195 --> 0:50:09.852 It's like female Madam because. 0:50:10.090 --> 0:50:22.951 If you are doing the same thing then the translation would not be to translate female but can use 0:50:22.951 --> 0:50:25.576 it to disintegrate. 0:50:25.865 --> 0:50:43.573 And so this type of tagging is a very commonly used method in order to add more information. 0:50:47.107 --> 0:50:54.047 So this is first of all a very good thing, a very easy one. 0:50:54.047 --> 0:50:57.633 You don't have to change your. 0:50:58.018 --> 0:51:04.581 For example, has also been done if you think about formality in German. 0:51:04.581 --> 0:51:11.393 Whether you have to produce or, you can: We'll see it on Thursday. 0:51:11.393 --> 0:51:19.628 It's a very common approach for domains, so you put in the domain beforehand. 0:51:19.628 --> 0:51:24.589 This is from a Twitter or something like that. 0:51:24.904 --> 0:51:36.239 Of course, it only learns it if it has seen it and it dees them out, but in this case you 0:51:36.239 --> 0:51:38.884 don't need an equal. 0:51:39.159 --> 0:51:42.593 But however, it's still like challenging to get this availability. 0:51:42.983 --> 0:51:55.300 If you would do that on the first of all, of course, it only works if you really have 0:51:55.300 --> 0:52:02.605 data from speaking because otherwise it's unclear. 0:52:02.642 --> 0:52:09.816 You would only have the text and you would not easily see whether it is the mayor or the 0:52:09.816 --> 0:52:14.895 female speaker because this information has been removed from. 0:52:16.456 --> 0:52:18.745 Does anybody of you have an idea of how it fits? 0:52:20.000 --> 0:52:25.480 Manage that and still get the data of whether it's made or not speaking. 0:52:32.152 --> 0:52:34.270 Can do a small trick. 0:52:34.270 --> 0:52:37.834 We can just look on the target side. 0:52:37.937 --> 0:52:43.573 Mean this is, of course, only important if in the target side this is the case. 0:52:44.004 --> 0:52:50.882 So for your training data you can irritate it based on your target site in German you 0:52:50.882 --> 0:52:51.362 know. 0:52:51.362 --> 0:52:58.400 In German you don't know but in Spanish for example you know because different and then 0:52:58.400 --> 0:53:00.400 you can use grammatical. 0:53:00.700 --> 0:53:10.964 Of course, the test day would still need to do that more interface decision. 0:53:13.954 --> 0:53:18.829 And: You can, of course, do it even more advanced. 0:53:18.898 --> 0:53:30.659 You can even try to add these information to each word, so you're not doing it for the 0:53:30.659 --> 0:53:32.687 full sentence. 0:53:32.572 --> 0:53:42.129 If it's unknown, if it's female or if it's male, you know word alignment so you can't 0:53:42.129 --> 0:53:42.573 do. 0:53:42.502 --> 0:53:55.919 Here then you can do a word alignment, which is of course not always perfect, but roughly 0:53:55.919 --> 0:53:59.348 then you can annotate. 0:54:01.401 --> 0:54:14.165 Now you have these type of inputs where you have one information per word, but on the one 0:54:14.165 --> 0:54:16.718 end you have the. 0:54:17.517 --> 0:54:26.019 This has been used before in other scenarios, so you might not put in the gender, but in 0:54:26.019 --> 0:54:29.745 general this can be other information. 0:54:30.090 --> 0:54:39.981 And people refer to that or have used that as a factored translation model, so what you 0:54:39.981 --> 0:54:42.454 may do is you factor. 0:54:42.742 --> 0:54:45.612 You have the word itself. 0:54:45.612 --> 0:54:48.591 You might have the gender. 0:54:48.591 --> 0:54:55.986 You could have more information like don't know the paddle speech. 0:54:56.316 --> 0:54:58.564 And then you have an embedding for each of them. 0:54:59.199 --> 0:55:03.599 And you congratulate them, and then you have years of congratulated a bedding. 0:55:03.563 --> 0:55:09.947 Which says okay, this is a female plumber or a male plumber or so on. 0:55:09.947 --> 0:55:18.064 This has additional information and then you can train this factory model where you have 0:55:18.064 --> 0:55:22.533 the ability to give the model extra information. 0:55:23.263 --> 0:55:35.702 And of course now if you are training this way directly you always need to have this information. 0:55:36.576 --> 0:55:45.396 So that might not be the best way if you want to use a translation system and sometimes don't 0:55:45.396 --> 0:55:45.959 have. 0:55:46.866 --> 0:55:57.987 So any idea of how you can train it or what machine learning technique you can use to deal 0:55:57.987 --> 0:55:58.720 with. 0:56:03.263 --> 0:56:07.475 Mainly despite it already, many of your things. 0:56:14.154 --> 0:56:21.521 Drop out so you sometimes put information in there and then you can use dropouts to inputs. 0:56:21.861 --> 0:56:27.599 Is sometimes put in this information in there, sometimes not, and the system is then able 0:56:27.599 --> 0:56:28.874 to deal with those. 0:56:28.874 --> 0:56:34.803 If it doesn't have the information, it's doing some of the best it can do, but if it has the 0:56:34.803 --> 0:56:39.202 information, it can use the information and maybe do a more rounded. 0:56:46.766 --> 0:56:52.831 So then there is, of course, more ways to try to do a moderately biased one. 0:56:52.993 --> 0:57:01.690 We will only want to mention here because you'll have a full lecture on that next week 0:57:01.690 --> 0:57:08.188 and that is referred to where context based machine translation. 0:57:08.728 --> 0:57:10.397 Good, and in this other ones, but. 0:57:10.750 --> 0:57:16.830 If you translate several sentences well, of course, there are more situations where you 0:57:16.830 --> 0:57:17.866 can dissemble. 0:57:18.118 --> 0:57:23.996 Because it might be that the information is not in the current sentence, but it's in the 0:57:23.996 --> 0:57:25.911 previous sentence or before. 0:57:26.967 --> 0:57:33.124 If you have the mean with the speaker maybe not, but if it's referring to, you can core 0:57:33.124 --> 0:57:33.963 references. 0:57:34.394 --> 0:57:40.185 They are often referring to things in the previous sentence so you can use them in order 0:57:40.185 --> 0:57:44.068 to: And that can be done basically and very easy. 0:57:44.068 --> 0:57:47.437 You'll see more advanced options, but the main. 0:57:48.108 --> 0:57:58.516 Mean, no machine translation is a sequence to sequence model, which can use any input 0:57:58.516 --> 0:58:02.993 sequence to output sequence mapping. 0:58:02.993 --> 0:58:04.325 So now at. 0:58:04.484 --> 0:58:11.281 So then you can do, for example, five to five translations, or also five to one, or so there's. 0:58:11.811 --> 0:58:19.211 This is not a method like only dedicated to buying, of course, but the hope is. 0:58:19.139 --> 0:58:25.534 If you're using this because I mean bias often, we have seen that it rises in situations where 0:58:25.534 --> 0:58:27.756 we're not having enough context. 0:58:27.756 --> 0:58:32.940 So the idea is if we generally increase our context, it will also help this. 0:58:32.932 --> 0:58:42.378 Of course, it will help other situations where you need context to disintegrate. 0:58:43.603 --> 0:58:45.768 Get There If You're Saying I'm Going to the Bank. 0:58:46.286 --> 0:58:54.761 It's not directly from this sentence clear whether it's the finance institute or the bank 0:58:54.761 --> 0:58:59.093 for sitting, but maybe if you say afterward,. 0:59:02.322 --> 0:59:11.258 And then there is in generally a very large amount of work on debiasing the word embelling. 0:59:11.258 --> 0:59:20.097 So the one I hear like, I mean, I think that partly comes from the fact that like a first. 0:59:21.041 --> 0:59:26.925 Or that first research was done often on inspecting the word embeddings and seeing whether they 0:59:26.925 --> 0:59:32.503 are biased or not, and people found out how there is some bias in there, and then the idea 0:59:32.503 --> 0:59:38.326 is oh, if you remove them from the word embedded in already, then maybe your system later will 0:59:38.326 --> 0:59:39.981 not have that strong of a. 0:59:40.520 --> 0:59:44.825 So how can that work? 0:59:44.825 --> 0:59:56.369 Or like maybe first, how do words encounter bias in there? 0:59:56.369 --> 0:59:57.152 So. 0:59:57.137 --> 1:00:05.555 So you can look at the word embedding, and then you can compare the distance of the word 1:00:05.555 --> 1:00:11.053 compared: And there's like interesting findings. 1:00:11.053 --> 1:00:18.284 For example, you have the difference in occupation and how similar. 1:00:18.678 --> 1:00:33.068 And of course it's not a perfect correlation, but you see some type of correlation: jobs 1:00:33.068 --> 1:00:37.919 which have a high occupation. 1:00:37.797 --> 1:00:41.387 They also are more similar to the word what we're going to be talking about. 1:00:43.023 --> 1:00:50.682 Maybe a secretary is also a bit difficult, but because yeah maybe it's more often. 1:00:50.610 --> 1:00:52.438 Done in general by by women. 1:00:52.438 --> 1:00:58.237 However, there is a secretary like the Secretary of State or so, the German minister, which 1:00:58.237 --> 1:01:03.406 I of course know that many so in the statistics they are not counting that often. 1:01:03.543 --> 1:01:11.576 But in data they of course cook quite often, so there's different ways of different meanings. 1:01:14.154 --> 1:01:23.307 So how can you not try to remove this type of bias? 1:01:23.307 --> 1:01:32.988 One way is the idea of hearts, devices and embeddings. 1:01:33.113 --> 1:01:39.354 So if you remember on word embeddings think we have this image that you can do the difference 1:01:39.354 --> 1:01:44.931 between man and woman and add this difference to king and then look at your screen. 1:01:45.865 --> 1:01:57.886 So here's the idea we want to remove this gender information from some things which should 1:01:57.886 --> 1:02:00.132 not have gender. 1:02:00.120 --> 1:02:01.386 The word engineer. 1:02:01.386 --> 1:02:06.853 There is no information about the gender in that, so you should remove this type. 1:02:07.347 --> 1:02:16.772 Of course, you first need to find out where these inflammations are and you can. 1:02:17.037 --> 1:02:23.603 However, normally if you do the difference like the subspace by only one example, it's 1:02:23.603 --> 1:02:24.659 not the best. 1:02:24.924 --> 1:02:31.446 So you can do the same thing for things like brother and sister, man and dad, and then you 1:02:31.446 --> 1:02:38.398 can somehow take the average of these differences saying this is a vector which maps a male from 1:02:38.398 --> 1:02:39.831 to the female form. 1:02:40.660 --> 1:02:50.455 And then you can try to neutralize this gender information on this dimension. 1:02:50.490 --> 1:02:57.951 You can find it's subspace or dimensional. 1:02:57.951 --> 1:03:08.882 It would be a line, but now this is dimensional, and then you. 1:03:08.728 --> 1:03:13.104 Representation: Where you remove this type of embellishment. 1:03:15.595 --> 1:03:18.178 This is, of course, quite strong of the questions. 1:03:18.178 --> 1:03:19.090 How good does it? 1:03:19.090 --> 1:03:20.711 Thanks tell them for one other. 1:03:20.880 --> 1:03:28.256 But it's an idea we are trying to after learning before we are using the Word and Banks for 1:03:28.256 --> 1:03:29.940 machine translation. 1:03:29.940 --> 1:03:37.315 We are trying to remove the gender information from the jobs and then have a representation 1:03:37.315 --> 1:03:38.678 which hopefully. 1:03:40.240 --> 1:03:45.047 Similar idea is the one of agenda neutral glove. 1:03:45.047 --> 1:03:50.248 Glove is another technique to learn word embeddings. 1:03:50.750 --> 1:03:52.870 Think we discussed one shortly. 1:03:52.870 --> 1:03:56.182 It was too back, which was some of the first one. 1:03:56.456 --> 1:04:04.383 But there are other of course methods how you can train word embeddings and glove as 1:04:04.383 --> 1:04:04.849 one. 1:04:04.849 --> 1:04:07.460 The idea is we're training. 1:04:07.747 --> 1:04:19.007 At least this is somehow a bit separated, so where you have part of the vector is gender 1:04:19.007 --> 1:04:20.146 neutral. 1:04:20.300 --> 1:04:29.247 What you need therefore is three sets of words, so you have male words and you have words. 1:04:29.769 --> 1:04:39.071 And then you're trying to learn some type of vector where some dimensions are not. 1:04:39.179 --> 1:04:51.997 So the idea is can learn a representation where at least know that this part is gender 1:04:51.997 --> 1:04:56.123 neutral and the other part. 1:05:00.760 --> 1:05:03.793 How can we do that? 1:05:03.793 --> 1:05:12.435 How can we change the system to learn anything specific? 1:05:12.435 --> 1:05:20.472 Nearly in all cases this works by the loss function. 1:05:20.520 --> 1:05:26.206 And that is more a general approach in machine translation. 1:05:26.206 --> 1:05:30.565 The general loss function is we are learning. 1:05:31.111 --> 1:05:33.842 Here is the same idea. 1:05:33.842 --> 1:05:44.412 You have the general loss function in order to learn good embeddings and then you try to 1:05:44.412 --> 1:05:48.687 introduce additional loss function. 1:05:48.969 --> 1:05:58.213 Yes, I think yes, yes, that's the solution, and how you make sure that if I have training 1:05:58.213 --> 1:06:07.149 for all nurses of email, how do you make sure that the algorithm puts it into neutral? 1:06:07.747 --> 1:06:12.448 And you need, so this is like for only the first learning of word embeddings. 1:06:12.448 --> 1:06:18.053 Then the idea is if you have word embeddings where the gender is separate and then you train 1:06:18.053 --> 1:06:23.718 on top of that machine translation where you don't change the embeddings, it should hopefully 1:06:23.718 --> 1:06:25.225 be less and less biased. 1:06:25.865 --> 1:06:33.465 And in order to train that yes you need additional information so these information need to be 1:06:33.465 --> 1:06:40.904 hence defined and they can't be general so you need to have a list of these are male persons 1:06:40.904 --> 1:06:44.744 or males these are nouns for females and these. 1:06:49.429 --> 1:06:52.575 So the first step, of course, we still want to have good word inventings. 1:06:54.314 --> 1:07:04.100 So you have the normal objective function of the word embedding. 1:07:04.100 --> 1:07:09.519 It's something like the similarity. 1:07:09.849 --> 1:07:19.751 How it's exactly derived is not that important because we're not interested in love itself, 1:07:19.751 --> 1:07:23.195 but you have any loss function. 1:07:23.195 --> 1:07:26.854 Of course, you have to keep that. 1:07:27.167 --> 1:07:37.481 And then there's three more lost functions that you can add: So the one is you take the 1:07:37.481 --> 1:07:51.341 average value of all the male words and the average word embedding of all the female words. 1:07:51.731 --> 1:08:00.066 So the good thing about this is we don't always need to have for one word the male and the 1:08:00.066 --> 1:08:05.837 female worship, so it's only like we have a set of male words. 1:08:06.946 --> 1:08:21.719 So this is just saying yeah, we want these two should be somehow similar to each other. 1:08:21.719 --> 1:08:25.413 It shouldn't be that. 1:08:30.330 --> 1:08:40.081 Should be the other one, or think this should be it. 1:08:40.081 --> 1:08:45.969 This is agenda, the average of. 1:08:45.945 --> 1:09:01.206 The average should be the same, but if you're looking at the female should be at the other. 1:09:01.681 --> 1:09:06.959 This is like on these dimensions, the male should be on the one and the female on the 1:09:06.959 --> 1:09:07.388 other. 1:09:07.627 --> 1:09:16.123 The same yeah, this gender information should be there, so you're pushing all the males to 1:09:16.123 --> 1:09:17.150 the other. 1:09:21.541 --> 1:09:23.680 Then their words should be. 1:09:23.680 --> 1:09:30.403 If you have that you see the neutral words, they should be in the middle of between the 1:09:30.403 --> 1:09:32.008 male and the female. 1:09:32.012 --> 1:09:48.261 So you say is the middle point between all male and female words and just somehow putting 1:09:48.261 --> 1:09:51.691 the neutral words. 1:09:52.912 --> 1:09:56.563 And then you're learning them, and then you can apply them in different ways. 1:09:57.057 --> 1:10:03.458 So you have this a bit in the pre-training thing. 1:10:03.458 --> 1:10:10.372 You can use the pre-trained inbeddings on the output. 1:10:10.372 --> 1:10:23.117 All you can use are: And then you can analyze what happens instead of training them directly. 1:10:23.117 --> 1:10:30.504 If have this additional loss, which tries to optimize. 1:10:32.432 --> 1:10:42.453 And then it was evaluated exactly on the sentences we had at the beginning where it is about know 1:10:42.453 --> 1:10:44.600 her for a long time. 1:10:44.600 --> 1:10:48.690 My friend works as an accounting cling. 1:10:48.788 --> 1:10:58.049 So all these examples are not very difficult to translation, but the question is how often 1:10:58.049 --> 1:10:58.660 does? 1:11:01.621 --> 1:11:06.028 That it's not that complicated as you see here, so even the baseline. 1:11:06.366 --> 1:11:10.772 If you're doing nothing is working quite well, it's most challenging. 1:11:10.772 --> 1:11:16.436 It seems overall in the situation where it's a name, so for he and him he has learned the 1:11:16.436 --> 1:11:22.290 correlation because that's maybe not surprisingly because this correlation occurs more often 1:11:22.290 --> 1:11:23.926 than with any name there. 1:11:24.044 --> 1:11:31.749 If you have a name that you can extract, that is talking about Mary, that's female is a lot 1:11:31.749 --> 1:11:34.177 harder to extract than this. 1:11:34.594 --> 1:11:40.495 So you'll see already in the bass line this is yeah, not working, not working. 1:11:43.403 --> 1:11:47.159 And for all the other cases it's working very well. 1:11:47.787 --> 1:11:53.921 Where all the best one is achieved here with an arc debiasing both on the encoder, on the. 1:11:57.077 --> 1:12:09.044 It makes sense that a hard debasing on the decoder doesn't really work because there you 1:12:09.044 --> 1:12:12.406 have gender information. 1:12:14.034 --> 1:12:17.406 For glove it seems to already work here. 1:12:17.406 --> 1:12:20.202 That's maybe surprising and yeah. 1:12:20.260 --> 1:12:28.263 So there is no clear else we don't have numbers for that doesn't really work well on the other. 1:12:28.263 --> 1:12:30.513 So how much do I use then? 1:12:33.693 --> 1:12:44.720 Then as a last way of improving that is a bit what we had mentioned before. 1:12:44.720 --> 1:12:48.493 That is what is referred. 1:12:48.488 --> 1:12:59.133 One problem is the bias in the data so you can adapt your data so you can just try to 1:12:59.133 --> 1:13:01.485 find equal amount. 1:13:01.561 --> 1:13:11.368 In your data like you adapt your data and then you find your data on the smaller but 1:13:11.368 --> 1:13:12.868 you can try. 1:13:18.298 --> 1:13:19.345 This is line okay. 1:13:19.345 --> 1:13:21.605 We have access to the data to the model. 1:13:21.605 --> 1:13:23.038 We can improve our model. 1:13:24.564 --> 1:13:31.328 One situation we haven't talked a lot about but another situation might also be and that's 1:13:31.328 --> 1:13:37.942 even getting more important is oh you want to work with a model which you don't have but 1:13:37.942 --> 1:13:42.476 you want to improve the model without having access so when. 1:13:42.862 --> 1:13:49.232 Nowadays there are a lot of companies who are not developing their own system but they're 1:13:49.232 --> 1:13:52.983 using or something like that or machine translation. 1:13:53.313 --> 1:13:59.853 So there is interest that you might not be able to find children with models completely. 1:14:00.080 --> 1:14:09.049 So the question is, can you do some type of black box adaptation of a system that takes 1:14:09.049 --> 1:14:19.920 the black box system but tries to improve it in some ways through: There's some ways of 1:14:19.920 --> 1:14:21.340 doing that. 1:14:21.340 --> 1:14:30.328 One is called black box injection and that's what is referred to as prompt. 1:14:30.730 --> 1:14:39.793 So the problem is if you have sentences you don't have information about the speakers. 1:14:39.793 --> 1:14:43.127 So how can you put information? 1:14:43.984 --> 1:14:53.299 And what we know from a large language model, we just prompt them, and you can do that. 1:14:53.233 --> 1:14:59.545 Translating directly, I love you, you said she said to him, I love you, and then of course 1:14:59.545 --> 1:15:01.210 you have to strip away. 1:15:01.181 --> 1:15:06.629 I mean, you cannot prevent the model from translating that, but you should be able to 1:15:06.629 --> 1:15:08.974 see what is the translation of this. 1:15:08.974 --> 1:15:14.866 One can strip that away, and now the system had hopefully the information that it's somebody 1:15:14.866 --> 1:15:15.563 like that. 1:15:15.563 --> 1:15:17.020 The speaker is female. 1:15:18.198 --> 1:15:23.222 Because you're no longer translating love you, but you're translating the sentence she 1:15:23.222 --> 1:15:24.261 said to him love. 1:15:24.744 --> 1:15:37.146 And so you insert this information as contextual information around it and don't have to change 1:15:37.146 --> 1:15:38.567 the model. 1:15:41.861 --> 1:15:56.946 Last idea is to do what is referred to as letters rescoring, so the idea there is you 1:15:56.946 --> 1:16:01.156 generate a translation. 1:16:01.481 --> 1:16:18.547 And now you have an additional component which tries to add possibilities where gender information 1:16:18.547 --> 1:16:21.133 might be lost. 1:16:21.261 --> 1:16:29.687 It's just a graph in this way, a simplified graph where there's always one word between 1:16:29.687 --> 1:16:31.507 two notes and you. 1:16:31.851 --> 1:16:35.212 So you have something like Zi is an ads or a Zi is an ads. 1:16:35.535 --> 1:16:41.847 And then you can generate all possible variants. 1:16:41.847 --> 1:16:49.317 Then, of course, we're not done because the final output. 1:16:50.530 --> 1:16:56.999 Then you can re-score the system by a gender de-biased model. 1:16:56.999 --> 1:17:03.468 So the nice thing is why why don't we directly use our model? 1:17:03.468 --> 1:17:10.354 The idea is our model, which is only focusing on gender devising. 1:17:10.530 --> 1:17:16.470 It can be, for example, if it's just trained on some synthetical data, it will not be that 1:17:16.470 --> 1:17:16.862 well. 1:17:16.957 --> 1:17:21.456 But what we can do then is now you can rescore the possible translations in here. 1:17:21.721 --> 1:17:31.090 And here the cases of course in general structure is already done how to translate the words. 1:17:31.051 --> 1:17:42.226 Then you're only using the second component in order to react for some variants and then 1:17:42.226 --> 1:17:45.490 get the best translation. 1:17:45.925 --> 1:17:58.479 And: As the last one there is the post processing so you can't have it. 1:17:58.538 --> 1:18:02.830 Mean this was one way of post-processing was to generate the lattice and retranslate it. 1:18:03.123 --> 1:18:08.407 But you can also have a processing, for example only on the target side where you have additional 1:18:08.407 --> 1:18:12.236 components with checks about the gender which maybe only knows gender. 1:18:12.236 --> 1:18:17.089 So it's not a machine translation component but more like a grammatical checker which can 1:18:17.089 --> 1:18:19.192 be used as most processing to do that. 1:18:19.579 --> 1:18:22.926 Think about it a bit like when you use PPT. 1:18:22.926 --> 1:18:25.892 There's also a lot of post processing. 1:18:25.892 --> 1:18:32.661 If you use a directive, it would tell you how to build a bond, but they have some checks 1:18:32.661 --> 1:18:35.931 either before and after to prevent things. 1:18:36.356 --> 1:18:40.580 So often there might be an application system. 1:18:40.580 --> 1:18:44.714 There might be extra pre and post processing. 1:18:48.608 --> 1:18:52.589 And yeah, with this we're at the end of. 1:18:52.512 --> 1:19:09.359 To this lecture where we focused on the bias, but think a lot of these techniques we have 1:19:09.359 --> 1:19:11.418 seen here. 1:19:11.331 --> 1:19:17.664 So we saw, on the one hand, we saw that evaluating just pure blues first might not always be. 1:19:17.677 --> 1:19:18.947 Mean it's very important. 1:19:20.000 --> 1:19:30.866 Always do that, but if you want to check and some specific things are important, then you 1:19:30.866 --> 1:19:35.696 might have to do dedicated evaluations. 1:19:36.036 --> 1:19:44.296 It is now translating for the President and it is like in German that guess it is not very 1:19:44.296 --> 1:19:45.476 appropriate. 1:19:45.785 --> 1:19:53.591 So it might be important if characteristics of your system are essential to have dedicated 1:19:53.591 --> 1:19:54.620 evaluation. 1:19:55.135 --> 1:20:02.478 And then if you have that, of course, it might be also important to develop delicate techniques. 1:20:02.862 --> 1:20:10.988 We have seen today some how to mitigate biases, but I hope you see that a lot of these techniques 1:20:10.988 --> 1:20:13.476 you can also use to mitigate. 1:20:13.573 --> 1:20:31.702 At least related things you can adjust the training data you can do for other things. 1:20:33.253 --> 1:20:36.022 Before we have been finishing, we have any more questions. 1:20:41.761 --> 1:20:47.218 Then thanks a lot, and then we will see each other again on the first step.