Spaces:
Running
Running
File size: 57,385 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 |
WEBVTT
0:00:00.060 --> 0:00:07.762
OK good so today's lecture is on on supervised
machines and stations so what you have seen
0:00:07.762 --> 0:00:13.518
so far is different techniques are on supervised
and MP so you are.
0:00:13.593 --> 0:00:18.552
Data right so let's say in English coppers
you are one file and then in German you have
0:00:18.552 --> 0:00:23.454
another file which is sentence to sentence
la and then you try to build systems around
0:00:23.454 --> 0:00:23.679
it.
0:00:24.324 --> 0:00:30.130
But what's different about this lecture is
that you assume that you have no final data
0:00:30.130 --> 0:00:30.663
at all.
0:00:30.663 --> 0:00:37.137
You only have monolingual data and the question
is how can we build systems to translate between
0:00:37.137 --> 0:00:39.405
these two languages right and so.
0:00:39.359 --> 0:00:44.658
This is a bit more realistic scenario because
you have so many languages in the world.
0:00:44.658 --> 0:00:50.323
You cannot expect to have parallel data between
all the two languages and so, but in typical
0:00:50.323 --> 0:00:55.623
cases you have newspapers and so on, which
is like monolingual files, and the question
0:00:55.623 --> 0:00:57.998
is can we build something around them?
0:00:59.980 --> 0:01:01.651
They like said for today.
0:01:01.651 --> 0:01:05.893
First we'll start up with the interactions,
so why do we need it?
0:01:05.893 --> 0:01:11.614
and also some infusion on how these models
work before going into the technical details.
0:01:11.614 --> 0:01:17.335
I want to also go through an example,, which
kind of gives you more understanding on how
0:01:17.335 --> 0:01:19.263
people came into more elders.
0:01:20.820 --> 0:01:23.905
Then the rest of the lecture is going to be
two parts.
0:01:23.905 --> 0:01:26.092
One is we're going to translate words.
0:01:26.092 --> 0:01:30.018
We're not going to care about how can we translate
the full sentence.
0:01:30.018 --> 0:01:35.177
But given to monolingual files, how can we
get a dictionary basically, which is much easier
0:01:35.177 --> 0:01:37.813
than generating something in a sentence level?
0:01:38.698 --> 0:01:43.533
Then we're going to go into the Edwards case,
which is the unsupervised sentence type solution.
0:01:44.204 --> 0:01:50.201
And here what you'll see is what are the training
objectives which are quite different than the
0:01:50.201 --> 0:01:55.699
word translation and also where it doesn't
but because this is also quite important and
0:01:55.699 --> 0:02:01.384
it's one of the reasons why unsupervised does
not use anymore because the limitations kind
0:02:01.384 --> 0:02:03.946
of go away from the realistic use cases.
0:02:04.504 --> 0:02:06.922
And then that leads to the marketing world
model.
0:02:06.922 --> 0:02:07.115
So.
0:02:07.807 --> 0:02:12.915
People are trying to do to build systems for
languages that will not have any parallel data.
0:02:12.915 --> 0:02:17.693
Is use multilingual models and combine with
these training objectives to get better at
0:02:17.693 --> 0:02:17.913
it.
0:02:17.913 --> 0:02:18.132
So.
0:02:18.658 --> 0:02:24.396
People are not trying to build bilingual systems
currently for unsupervised arm translation,
0:02:24.396 --> 0:02:30.011
but I think it's good to know how they came
to hear this point and what they're doing now.
0:02:30.090 --> 0:02:34.687
You also see some patterns overlapping which
people are using.
0:02:36.916 --> 0:02:41.642
So as you said before, and you probably hear
it multiple times now is that we have seven
0:02:41.642 --> 0:02:43.076
thousand languages around.
0:02:43.903 --> 0:02:49.460
Can be different dialects in someone, so it's
quite hard to distinguish what's the language,
0:02:49.460 --> 0:02:54.957
but you can typically approximate that seven
thousand and that leads to twenty five million
0:02:54.957 --> 0:02:59.318
pairs, which is the obvious reason why we do
not have any parallel data.
0:03:00.560 --> 0:03:06.386
So you want to build an empty system for all
possible language pests and the question is
0:03:06.386 --> 0:03:07.172
how can we?
0:03:08.648 --> 0:03:13.325
The typical use case, but there are actually
quite few interesting use cases than what you
0:03:13.325 --> 0:03:14.045
would expect.
0:03:14.614 --> 0:03:20.508
One is the animal languages, which is the
real thing that's happening right now with.
0:03:20.780 --> 0:03:26.250
The dog but with dolphins and so on, but I
couldn't find a picture that could show this,
0:03:26.250 --> 0:03:31.659
but if you are interested in stuff like this
you can check out the website where people
0:03:31.659 --> 0:03:34.916
are actually trying to understand how animals
speak.
0:03:35.135 --> 0:03:37.356
It's Also a Bit More About.
0:03:37.297 --> 0:03:44.124
Knowing what the animals want to say but may
not die dead but still people are trying to
0:03:44.124 --> 0:03:44.661
do it.
0:03:45.825 --> 0:03:50.689
More realistic thing that's happening is the
translation of programming languages.
0:03:51.371 --> 0:03:56.963
And so this is quite a quite good scenario
for entrepreneurs and empty is that you have
0:03:56.963 --> 0:04:02.556
a lot of code available online right in C +
+ and in Python and the question is how can
0:04:02.556 --> 0:04:08.402
we translate by just looking at the code alone
and no parallel functions and so on and this
0:04:08.402 --> 0:04:10.754
is actually quite good right now so.
0:04:12.032 --> 0:04:16.111
See how these techniques were applied to do
the programming translation.
0:04:18.258 --> 0:04:23.882
And then you can also think of language as
something that is quite common so you can take
0:04:23.882 --> 0:04:24.194
off.
0:04:24.194 --> 0:04:29.631
Think of formal sentences in English as one
language and informal sentences in English
0:04:29.631 --> 0:04:35.442
as another language and then learn the kind
to stay between them and then it kind of becomes
0:04:35.442 --> 0:04:37.379
a style plan for a problem so.
0:04:38.358 --> 0:04:43.042
Although it's translation, you can consider
different characteristics of a language and
0:04:43.042 --> 0:04:46.875
then separate them as two different languages
and then try to map them.
0:04:46.875 --> 0:04:52.038
So it's not only about languages, but you
can also do quite cool things by using unsophisticated
0:04:52.038 --> 0:04:54.327
techniques, which are quite possible also.
0:04:56.256 --> 0:04:56.990
I am so.
0:04:56.990 --> 0:05:04.335
This is kind of TV modeling for many of the
use cases that we have for ours, ours and MD.
0:05:04.335 --> 0:05:11.842
But before we go into the modeling of these
systems, what I want you to do is look at these
0:05:11.842 --> 0:05:12.413
dummy.
0:05:13.813 --> 0:05:19.720
We have text and language one, text and language
two right, and nobody knows what these languages
0:05:19.720 --> 0:05:20.082
mean.
0:05:20.082 --> 0:05:23.758
They completely are made up right, and the
question is also.
0:05:23.758 --> 0:05:29.364
They're not parallel lines, so the first line
here and the first line is not a line, they're
0:05:29.364 --> 0:05:30.810
just monolingual files.
0:05:32.052 --> 0:05:38.281
And now think about how can you translate
the word M1 from language one to language two,
0:05:38.281 --> 0:05:41.851
and this kind of you see how we try to model
this.
0:05:42.983 --> 0:05:47.966
Would take your time and then think of how
can you translate more into language two?
0:06:41.321 --> 0:06:45.589
About the model, if you ask somebody who doesn't
know anything about machine translation right,
0:06:45.589 --> 0:06:47.411
and then you ask them to translate more.
0:07:01.201 --> 0:07:10.027
But it's also not quite easy if you think
of the way that I made this example is relatively
0:07:10.027 --> 0:07:10.986
easy, so.
0:07:11.431 --> 0:07:17.963
Basically, the first two sentences are these
two: A, B, C is E, and G cured up the U, V
0:07:17.963 --> 0:07:21.841
is L, A, A, C, S, and S, on and this is used
towards the German.
0:07:22.662 --> 0:07:25.241
And then when you join these two words, it's.
0:07:25.205 --> 0:07:32.445
English German the third line and the last
line, and then the fourth line is the first
0:07:32.445 --> 0:07:38.521
line, so German language, English, and then
speak English, speak German.
0:07:38.578 --> 0:07:44.393
So this is how I made made up the example
and what the intuition here is that you assume
0:07:44.393 --> 0:07:50.535
that the languages have a fundamental structure
right and it's the same across all languages.
0:07:51.211 --> 0:07:57.727
Doesn't matter what language you are thinking
of words kind of you have in the same way join
0:07:57.727 --> 0:07:59.829
together is the same way and.
0:07:59.779 --> 0:08:06.065
And plasma sign thinks the same way but this
is not a realistic assumption for sure but
0:08:06.065 --> 0:08:12.636
it's actually a decent one to make and if you
can think of this like if you can assume this
0:08:12.636 --> 0:08:16.207
then we can model systems in an unsupervised
way.
0:08:16.396 --> 0:08:22.743
So this is the intuition that I want to give,
and you can see that whenever assumptions fail,
0:08:22.743 --> 0:08:23.958
the systems fail.
0:08:23.958 --> 0:08:29.832
So in practice whenever we go far away from
these assumptions, the systems try to more
0:08:29.832 --> 0:08:30.778
time to fail.
0:08:33.753 --> 0:08:39.711
So the example that I gave was actually perfect
mapping right, so it never really sticks bad.
0:08:39.711 --> 0:08:45.353
They have the same number of words, same sentence
structure, perfect mapping, and so on.
0:08:45.353 --> 0:08:50.994
This doesn't happen, but let's assume that
this happens and try to see how we can moral.
0:08:53.493 --> 0:09:01.061
Okay, now let's go a bit more formal, so what
you want to do is unsupervise word translation.
0:09:01.901 --> 0:09:08.773
Here the task is that we have input data as
monolingual data, so a bunch of sentences in
0:09:08.773 --> 0:09:15.876
one file and a bunch of sentences another file
in two different languages, and the question
0:09:15.876 --> 0:09:18.655
is how can we get a bilingual word?
0:09:19.559 --> 0:09:25.134
So if you look at the picture you see that
it's just kind of projected down into two dimension
0:09:25.134 --> 0:09:30.358
planes, but it's basically when you map them
into a plot you see that the words that are
0:09:30.358 --> 0:09:35.874
parallel are closer together, and the question
is how can we do it just looking at two files?
0:09:36.816 --> 0:09:42.502
And you can say that what we want to basically
do is create a dictionary in the end given
0:09:42.502 --> 0:09:43.260
two fights.
0:09:43.260 --> 0:09:45.408
So this is the task that we want.
0:09:46.606 --> 0:09:52.262
And the first step on how we do this is to
learn word vectors, and this chicken is whatever
0:09:52.262 --> 0:09:56.257
techniques that you have seen before, but to
work glow or so on.
0:09:56.856 --> 0:10:00.699
So you take a monolingual data and try to
learn word embeddings.
0:10:02.002 --> 0:10:07.675
Then you plot them into a graph, and then
typically what you would see is that they're
0:10:07.675 --> 0:10:08.979
not aligned at all.
0:10:08.979 --> 0:10:14.717
One word space is somewhere, and one word
space is somewhere else, and this is what you
0:10:14.717 --> 0:10:18.043
would typically expect to see in the in the
image.
0:10:19.659 --> 0:10:23.525
Now our assumption was that both lines we
just have the same.
0:10:23.563 --> 0:10:28.520
Culture and so that we can use this information
to learn the mapping between these two spaces.
0:10:30.130 --> 0:10:37.085
So before how we do it, I think this is quite
famous already, and everybody knows it a bit
0:10:37.085 --> 0:10:41.824
more is that we're emitting capture semantic
relations right.
0:10:41.824 --> 0:10:48.244
So the distance between man and woman is approximately
the same as king and prince.
0:10:48.888 --> 0:10:54.620
It's also for world dances, country capital
and so on, so there are some relationships
0:10:54.620 --> 0:11:00.286
happening in the word emmering space, which
is quite clear for at least one language.
0:11:03.143 --> 0:11:08.082
Now if you think of this, let's say of the
English word embryng.
0:11:08.082 --> 0:11:14.769
Let's say of German word embryng and the way
the King Keene Man woman organized is same
0:11:14.769 --> 0:11:17.733
as the German translation of his word.
0:11:17.998 --> 0:11:23.336
This is the main idea is that although they
are somewhere else, the relationship is the
0:11:23.336 --> 0:11:28.008
same between the both languages and we can
use this to to learn the mapping.
0:11:31.811 --> 0:11:35.716
'S not only for these poor words where it
happens for all the words in the language,
0:11:35.716 --> 0:11:37.783
and so we can use this to to learn the math.
0:11:39.179 --> 0:11:43.828
This is the main idea is that both emittings
have a similar shape.
0:11:43.828 --> 0:11:48.477
It's only that they're just not aligned and
so you go to the here.
0:11:48.477 --> 0:11:50.906
They kind of have a similar shape.
0:11:50.906 --> 0:11:57.221
They're just in some different spaces and
what you need to do is to map them into a common
0:11:57.221 --> 0:11:57.707
space.
0:12:06.086 --> 0:12:12.393
The w, such that if it multiplied w with x,
they both become.
0:12:35.335 --> 0:12:41.097
That's true, but there are also many works
that have the relationship right, and we hope
0:12:41.097 --> 0:12:43.817
that this is enough to learn the mapping.
0:12:43.817 --> 0:12:49.838
So there's always going to be a bit of noise,
as in how when we align them they're not going
0:12:49.838 --> 0:12:51.716
to be exactly the same, but.
0:12:51.671 --> 0:12:57.293
What you can expect is that there are these
main works that allow us to learn the mapping,
0:12:57.293 --> 0:13:02.791
so it's not going to be perfect, but it's an
approximation that we make to to see how it
0:13:02.791 --> 0:13:04.521
works and then practice it.
0:13:04.521 --> 0:13:10.081
Also, it's not that the fact that women do
not have any relationship does not affect that
0:13:10.081 --> 0:13:10.452
much.
0:13:10.550 --> 0:13:15.429
A lot of words usually have, so it kind of
works out in practice.
0:13:22.242 --> 0:13:34.248
I have not heard about it, but if you want
to say something about it, I would be interested,
0:13:34.248 --> 0:13:37.346
but we can do it later.
0:13:41.281 --> 0:13:44.133
Usual case: This is supervised.
0:13:45.205 --> 0:13:49.484
First way to do a supervised work translation
where we have a dictionary right and that we
0:13:49.484 --> 0:13:53.764
can use that to learn the mapping, but in our
case we assume that we have nothing right so
0:13:53.764 --> 0:13:55.222
we only have monolingual data.
0:13:56.136 --> 0:14:03.126
Then we need unsupervised planning to figure
out W, and we're going to use guns to to find
0:14:03.126 --> 0:14:06.122
W, and it's quite a nice way to do it.
0:14:08.248 --> 0:14:15.393
So just before I go on how we use it to use
case, I'm going to go briefly on gas right,
0:14:15.393 --> 0:14:19.940
so we have two components: generator and discriminator.
0:14:21.441 --> 0:14:27.052
Gen data tries to generate something obviously,
and the discriminator tries to see if it's
0:14:27.052 --> 0:14:30.752
real data or something that is generated by
the generation.
0:14:31.371 --> 0:14:37.038
And there's like this two player game where
the winner decides to fool and the winner decides
0:14:37.038 --> 0:14:41.862
to market food and they try to build these
two components and try to learn WWE.
0:14:43.483 --> 0:14:53.163
Okay, so let's say we have two languages,
X and Y right, so the X language has N words
0:14:53.163 --> 0:14:56.167
with numbering dimensions.
0:14:56.496 --> 0:14:59.498
So what I'm reading is matrix is peak or something.
0:14:59.498 --> 0:15:02.211
Then we have target language why with m words.
0:15:02.211 --> 0:15:06.944
I'm also the same amount of things I mentioned
and then we have a matrix peak or.
0:15:07.927 --> 0:15:13.784
Basically what you're going to do is use word
to work and learn our word embedded.
0:15:14.995 --> 0:15:23.134
Now we have these X Mrings, Y Mrings, and
what you want to know is W, such that W X and
0:15:23.134 --> 0:15:24.336
Y are align.
0:15:29.209 --> 0:15:35.489
With guns you have two steps, one is a discriminative
step and one is the the mapping step and the
0:15:35.489 --> 0:15:41.135
discriminative step is to see if the embeddings
are from the source or mapped embedding.
0:15:41.135 --> 0:15:44.688
So it's going to be much scary when I go to
the figure.
0:15:46.306 --> 0:15:50.041
So we have a monolingual documents with two
different languages.
0:15:50.041 --> 0:15:54.522
From here we get our source language ambients
target language ambients right.
0:15:54.522 --> 0:15:57.855
Then we randomly initialize the transformation
metrics W.
0:16:00.040 --> 0:16:06.377
Then we have the discriminator which tries
to see if it's WX or Y, so it needs to know
0:16:06.377 --> 0:16:13.735
that this is a mapped one and this is the original
language, and so if you look at the lost function
0:16:13.735 --> 0:16:20.072
here, it's basically that source is one given
WX, so this is from the source language.
0:16:23.543 --> 0:16:27.339
Which means it's the target language em yeah.
0:16:27.339 --> 0:16:34.436
It's just like my figure is not that great,
but you can assume that they are totally.
0:16:40.260 --> 0:16:43.027
So this is the kind of the lost function.
0:16:43.027 --> 0:16:46.386
We have N source words, M target words, and
so on.
0:16:46.386 --> 0:16:52.381
So that's why you have one by M, one by M,
and the discriminator is to just see if they're
0:16:52.381 --> 0:16:55.741
mapped or they're from the original target
number.
0:16:57.317 --> 0:17:04.024
And then we have the mapping step where we
train W to fool the the discriminators.
0:17:04.564 --> 0:17:10.243
So here it's the same way, but what you're
going to just do is inverse the loss function.
0:17:10.243 --> 0:17:15.859
So now we freeze the discriminators, so it's
important to note that in the previous sect
0:17:15.859 --> 0:17:20.843
we freezed the transformation matrix, and here
we freezed your discriminators.
0:17:22.482 --> 0:17:28.912
And now it's to fool the discriminated rights,
so it should predict that the source is zero
0:17:28.912 --> 0:17:35.271
given the map numbering, and the source is
one given the target numbering, which is wrong,
0:17:35.271 --> 0:17:37.787
which is why we're attaining the W.
0:17:39.439 --> 0:17:46.261
Any questions on this okay so then how do
we know when to stop?
0:17:46.261 --> 0:17:55.854
We just train until we reach convergence right
and then we have our W hopefully train and
0:17:55.854 --> 0:17:59.265
map them into an airline space.
0:18:02.222 --> 0:18:07.097
The question is how can we evaluate this mapping?
0:18:07.097 --> 0:18:13.923
Does anybody know what we can use to mapping
or evaluate the mapping?
0:18:13.923 --> 0:18:15.873
How good is a word?
0:18:28.969 --> 0:18:33.538
We use as I said we use a dictionary, at least
in the end.
0:18:33.538 --> 0:18:40.199
We need a dictionary to evaluate, so this
is our only final, so we aren't using it at
0:18:40.199 --> 0:18:42.600
all in attaining data and the.
0:18:43.223 --> 0:18:49.681
Is one is to check what's the position for
our dictionary, just that.
0:18:50.650 --> 0:18:52.813
The first nearest neighbor and see if it's
there on.
0:18:53.573 --> 0:18:56.855
But this is quite strict because there's a
lot of noise in the emitting space right.
0:18:57.657 --> 0:19:03.114
Not always your first neighbor is going to
be the translation, so what people also report
0:19:03.114 --> 0:19:05.055
is precision at file and so on.
0:19:05.055 --> 0:19:10.209
So you take the finerest neighbors and see
if the translation is in there and so on.
0:19:10.209 --> 0:19:15.545
So the more you increase, the more likely
that there is a translation because where I'm
0:19:15.545 --> 0:19:16.697
being quite noisy.
0:19:19.239 --> 0:19:25.924
What's interesting is that people have used
dictionary to to learn word translation, but
0:19:25.924 --> 0:19:32.985
the way of doing this is much better than using
a dictionary, so somehow our assumption helps
0:19:32.985 --> 0:19:36.591
us to to build better than a supervised system.
0:19:39.099 --> 0:19:42.985
So as you see on the top you have a question
at one five ten.
0:19:42.985 --> 0:19:47.309
These are the typical numbers that you report
for world translation.
0:19:48.868 --> 0:19:55.996
But guns are usually quite tricky to to train,
and it does not converge on on language based,
0:19:55.996 --> 0:20:02.820
and this kind of goes back to a assumption
that they kind of behave in the same structure
0:20:02.820 --> 0:20:03.351
right.
0:20:03.351 --> 0:20:07.142
But if you take a language like English and
some.
0:20:07.087 --> 0:20:12.203
Other languages are almost very lotus, so
it's quite different from English and so on.
0:20:12.203 --> 0:20:13.673
Then I've one language,.
0:20:13.673 --> 0:20:18.789
So whenever whenever our assumption fails,
these unsupervised techniques always do not
0:20:18.789 --> 0:20:21.199
converge or just give really bad scores.
0:20:22.162 --> 0:20:27.083
And so the fact is that the monolingual embryons
for distant languages are too far.
0:20:27.083 --> 0:20:30.949
They do not share the same structure, and
so they do not convert.
0:20:32.452 --> 0:20:39.380
And so I just want to mention that there is
a better retrieval technique than the nearest
0:20:39.380 --> 0:20:41.458
neighbor, which is called.
0:20:42.882 --> 0:20:46.975
But it's more advanced than mathematical,
so I didn't want to go in it now.
0:20:46.975 --> 0:20:51.822
But if your interest is in some quite good
retrieval segments, you can just look at these
0:20:51.822 --> 0:20:53.006
if you're interested.
0:20:55.615 --> 0:20:59.241
Okay, so this is about the the word translation.
0:20:59.241 --> 0:21:02.276
Does anybody have any questions of cure?
0:21:06.246 --> 0:21:07.501
Was the worst answer?
0:21:07.501 --> 0:21:12.580
It was a bit easier than a sentence right,
so you just assume that there's a mapping and
0:21:12.580 --> 0:21:14.577
then you try to learn the mapping.
0:21:14.577 --> 0:21:19.656
But now it's a bit more difficult because
you need to jump at stuff also, which is quite
0:21:19.656 --> 0:21:20.797
much more trickier.
0:21:22.622 --> 0:21:28.512
Task here is that we have our input as manually
well data for both languages as before, but
0:21:28.512 --> 0:21:34.017
now what we want to do is instead of translating
word by word we want to do sentence.
0:21:37.377 --> 0:21:44.002
We have word of work now and so on to learn
word amber inks, but sentence amber inks are
0:21:44.002 --> 0:21:50.627
actually not the site powered often, at least
when people try to work on Answer Voice M,
0:21:50.627 --> 0:21:51.445
E, before.
0:21:52.632 --> 0:21:54.008
Now they're a bit okay.
0:21:54.008 --> 0:21:59.054
I mean, as you've seen in the practice on
where we used places, they were quite decent.
0:21:59.054 --> 0:22:03.011
But then it's also the case on which data
it's trained on and so on.
0:22:03.011 --> 0:22:03.240
So.
0:22:04.164 --> 0:22:09.666
Sentence embedings are definitely much more
harder to get than were embedings, so this
0:22:09.666 --> 0:22:13.776
is a bit more complicated than the task that
you've seen before.
0:22:16.476 --> 0:22:18.701
Before we go into how U.
0:22:18.701 --> 0:22:18.968
N.
0:22:18.968 --> 0:22:19.235
M.
0:22:19.235 --> 0:22:19.502
T.
0:22:19.502 --> 0:22:24.485
Works, so this is your typical supervised
system right.
0:22:24.485 --> 0:22:29.558
So we have parallel data source sentence target
centers.
0:22:29.558 --> 0:22:31.160
We have a source.
0:22:31.471 --> 0:22:36.709
We have a target decoder and then we try to
minimize the cross center pillar on this viral
0:22:36.709 --> 0:22:37.054
data.
0:22:37.157 --> 0:22:39.818
And this is how we train our typical system.
0:22:43.583 --> 0:22:49.506
But now we do not have any parallel data,
and so the intuition here is that if we can
0:22:49.506 --> 0:22:55.429
learn language independent representations
at the end quota outputs, then we can pass
0:22:55.429 --> 0:22:58.046
it along to the decoder that we want.
0:22:58.718 --> 0:23:03.809
It's going to get more clear in the future,
but I'm trying to give a bit more intuition
0:23:03.809 --> 0:23:07.164
before I'm going to show you all the planning
objectives.
0:23:08.688 --> 0:23:15.252
So I assume that we have these different encoders
right, so it's not only two, you have a bunch
0:23:15.252 --> 0:23:21.405
of different source language encoders, a bunch
of different target language decoders, and
0:23:21.405 --> 0:23:26.054
also I assume that the encoder is in the same
representation space.
0:23:26.706 --> 0:23:31.932
If you give a sentence in English and the
same sentence in German, the embeddings are
0:23:31.932 --> 0:23:38.313
quite the same, so like the muddling when embeddings
die right, and so then what we can do is, depending
0:23:38.313 --> 0:23:42.202
on the language we want, pass it to the the
appropriate decode.
0:23:42.682 --> 0:23:50.141
And so the kind of goal here is to find out
a way to create language independent representations
0:23:50.141 --> 0:23:52.909
and then pass it to the decodement.
0:23:54.975 --> 0:23:59.714
Just keep in mind that you're trying to do
language independent for some reason, but it's
0:23:59.714 --> 0:24:02.294
going to be more clear once we see how it works.
0:24:05.585 --> 0:24:12.845
So in total we have three objectives that
we're going to try to train in our systems,
0:24:12.845 --> 0:24:16.981
so this is and all of them use monolingual
data.
0:24:17.697 --> 0:24:19.559
So there's no pilot data at all.
0:24:19.559 --> 0:24:24.469
The first one is denoising water encoding,
so it's more like you add noise to noise to
0:24:24.469 --> 0:24:27.403
the sentence, and then they construct the original.
0:24:28.388 --> 0:24:34.276
Then we have the on the flyby translation,
so this is where you take a sentence, generate
0:24:34.276 --> 0:24:39.902
a translation, and then learn the the word
smarting, which I'm going to show pictures
0:24:39.902 --> 0:24:45.725
stated, and then we have an adverse serial
planning to do learn the language independent
0:24:45.725 --> 0:24:46.772
representation.
0:24:47.427 --> 0:24:52.148
So somehow we'll fill in these three tasks
or retain on these three tasks.
0:24:52.148 --> 0:24:54.728
We somehow get an answer to President M.
0:24:54.728 --> 0:24:54.917
T.
0:24:56.856 --> 0:25:02.964
OK, so first we're going to do is denoising
what I'm cutting right, so as I said we add
0:25:02.964 --> 0:25:06.295
noise to the sentence, so we take our sentence.
0:25:06.826 --> 0:25:09.709
And then there are different ways to add noise.
0:25:09.709 --> 0:25:11.511
You can shuffle words around.
0:25:11.511 --> 0:25:12.712
You can drop words.
0:25:12.712 --> 0:25:18.298
Do whatever you want to do as long as there's
enough information to reconstruct the original
0:25:18.298 --> 0:25:18.898
sentence.
0:25:19.719 --> 0:25:25.051
And then we assume that the nicest one and
the original one are parallel data and train
0:25:25.051 --> 0:25:26.687
similar to the supervised.
0:25:28.168 --> 0:25:30.354
So we have a source sentence.
0:25:30.354 --> 0:25:32.540
We have a noisy source right.
0:25:32.540 --> 0:25:37.130
So here what basically happened is that the
word got shuffled.
0:25:37.130 --> 0:25:39.097
One word is dropped right.
0:25:39.097 --> 0:25:41.356
So this was a noise of source.
0:25:41.356 --> 0:25:47.039
And then we treat the noise of source and
source as a sentence bed basically.
0:25:49.009 --> 0:25:53.874
Way retainers optimizing the cross entropy
loss similar to.
0:25:57.978 --> 0:26:03.211
Basically a picture to show what's happening
and we have the nice resources.
0:26:03.163 --> 0:26:09.210
Now is the target and then we have the reconstructed
original source and original tag and since
0:26:09.210 --> 0:26:14.817
the languages are different we have our source
hand coded target and coded source coded.
0:26:17.317 --> 0:26:20.202
And for this task we only need monolingual
data.
0:26:20.202 --> 0:26:25.267
We don't need any pedal data because it's
just taking a sentence and shuffling it and
0:26:25.267 --> 0:26:27.446
reconstructing the the original one.
0:26:28.848 --> 0:26:31.058
And we are four different blocks.
0:26:31.058 --> 0:26:36.841
This is kind of very important to keep in
mind on how we change these connections later.
0:26:41.121 --> 0:26:49.093
Then this is more like the mathematical formulation
where you predict source given the noisy.
0:26:52.492 --> 0:26:55.090
So that was the nursing water encoding.
0:26:55.090 --> 0:26:58.403
The second step is on the flight back translation.
0:26:59.479 --> 0:27:06.386
So what we do is, we put our model inference
mode right, we take a source of sentences,
0:27:06.386 --> 0:27:09.447
and we generate a translation pattern.
0:27:09.829 --> 0:27:18.534
It might be completely wrong or maybe partially
correct or so on, but we assume that the moral
0:27:18.534 --> 0:27:20.091
knows of it and.
0:27:20.680 --> 0:27:25.779
Tend rate: T head right and then what we do
is assume that T head or not assume but T head
0:27:25.779 --> 0:27:27.572
and S are sentence space right.
0:27:27.572 --> 0:27:29.925
That's how we can handle the translation.
0:27:30.530 --> 0:27:38.824
So we train a supervised system on this sentence
bed, so we do inference and then build a reverse
0:27:38.824 --> 0:27:39.924
translation.
0:27:42.442 --> 0:27:49.495
Are both more concrete, so we have a false
sentence right, then we chamber the translation,
0:27:49.495 --> 0:27:55.091
then we give the general translation as an
input and try to predict the.
0:27:58.378 --> 0:28:03.500
This is how we would do in practice right,
so not before the source encoder was connected
0:28:03.500 --> 0:28:08.907
to the source decoder, but now we interchanged
connections, so the source encoder is connected
0:28:08.907 --> 0:28:10.216
to the target decoder.
0:28:10.216 --> 0:28:13.290
The target encoder is turned into the source
decoder.
0:28:13.974 --> 0:28:20.747
And given s we get t-hat and given t we get
s-hat, so this is the first time.
0:28:21.661 --> 0:28:24.022
On the second time step, what you're going
to do is reverse.
0:28:24.664 --> 0:28:32.625
So as that is here, t hat is here, and given
s hat we are trying to predict t, and given
0:28:32.625 --> 0:28:34.503
t hat we are trying.
0:28:36.636 --> 0:28:39.386
Is this clear you have any questions on?
0:28:45.405 --> 0:28:50.823
Bit more mathematically, we try to play the
class, give and take and so it's always the
0:28:50.823 --> 0:28:53.963
supervised NMP technique that we are trying
to do.
0:28:53.963 --> 0:28:59.689
But you're trying to create this synthetic
pass that kind of helpers to build an unsurprised
0:28:59.689 --> 0:29:00.181
system.
0:29:02.362 --> 0:29:08.611
Now also with maybe you can see here is that
if the source encoded and targeted encoded
0:29:08.611 --> 0:29:14.718
the language independent, we can always shuffle
the connections and the translations.
0:29:14.718 --> 0:29:21.252
That's why it was important to find a way
to generate language independent representations.
0:29:21.441 --> 0:29:26.476
And the way we try to force this language
independence is the gan step.
0:29:27.627 --> 0:29:34.851
So the third step kind of combines all of
them is where we try to use gun to make the
0:29:34.851 --> 0:29:37.959
encoded output language independent.
0:29:37.959 --> 0:29:42.831
So here it's the same picture but from a different
paper.
0:29:42.831 --> 0:29:43.167
So.
0:29:43.343 --> 0:29:48.888
We have X-rays, X-ray objects which is monolingual
in data.
0:29:48.888 --> 0:29:50.182
We add noise.
0:29:50.690 --> 0:29:54.736
Then we encode it using the source and the
target encoders right.
0:29:54.736 --> 0:29:58.292
Then we get the latent space Z source and
Z target right.
0:29:58.292 --> 0:30:03.503
Then we decode and try to reconstruct the
original one and this is the auto encoding
0:30:03.503 --> 0:30:08.469
loss which takes the X source which is the
original one and then the translated.
0:30:08.468 --> 0:30:09.834
Predicted output.
0:30:09.834 --> 0:30:16.740
So hello, it always is the auto encoding step
where the gun concern is in the between gang
0:30:16.740 --> 0:30:24.102
cord outputs, and here we have an discriminator
which tries to predict which language the latent
0:30:24.102 --> 0:30:25.241
space is from.
0:30:26.466 --> 0:30:33.782
So given Z source it has to predict that the
representation is from a language source and
0:30:33.782 --> 0:30:39.961
given Z target it has to predict the representation
from a language target.
0:30:40.520 --> 0:30:45.135
And our headquarters are kind of teaching
data right now, and then we have a separate
0:30:45.135 --> 0:30:49.803
network discriminator which tries to predict
which language the Latin spaces are from.
0:30:53.393 --> 0:30:57.611
And then this one is when we combined guns
with the other ongoing step.
0:30:57.611 --> 0:31:02.767
Then we had an on the fly back translation
step right, and so here what we're trying to
0:31:02.767 --> 0:31:03.001
do.
0:31:03.863 --> 0:31:07.260
Is the same, basically just exactly the same.
0:31:07.260 --> 0:31:12.946
But when we are doing the training, we are
at the adversarial laws here, so.
0:31:13.893 --> 0:31:20.762
We take our X source, gender and intermediate
translation, so why target and why source right?
0:31:20.762 --> 0:31:27.342
This is the previous time step, and then we
have to encode the new sentences and basically
0:31:27.342 --> 0:31:32.764
make them language independent or train to
make them language independent.
0:31:33.974 --> 0:31:43.502
And then the hope is that now if we do this
using monolingual data alone we can just switch
0:31:43.502 --> 0:31:47.852
connections and then get our translation.
0:31:47.852 --> 0:31:49.613
So the scale of.
0:31:54.574 --> 0:32:03.749
And so as I said before, guns are quite good
for vision right, so this is kind of like the
0:32:03.749 --> 0:32:11.312
cycle gun approach that you might have seen
in any computer vision course.
0:32:11.911 --> 0:32:19.055
Somehow protect that place at least not as
promising as for merchants, and so people.
0:32:19.055 --> 0:32:23.706
What they did is to enforce this language
independence.
0:32:25.045 --> 0:32:31.226
They try to use a shared encoder instead of
having these different encoders right, and
0:32:31.226 --> 0:32:37.835
so this is basically the same painting objectives
as before, but what you're going to do now
0:32:37.835 --> 0:32:43.874
is learn cross language language and then use
the single encoder for both languages.
0:32:44.104 --> 0:32:49.795
And this kind also forces them to be in the
same space, and then you can choose whichever
0:32:49.795 --> 0:32:50.934
decoder you want.
0:32:52.552 --> 0:32:58.047
You can use guns or you can just use a shared
encoder and type to build your unsupervised
0:32:58.047 --> 0:32:58.779
MTT system.
0:33:08.488 --> 0:33:09.808
These are now the.
0:33:09.808 --> 0:33:15.991
The enhancements that you can do on top of
your unsavoizant system is one you can create
0:33:15.991 --> 0:33:16.686
a shared.
0:33:18.098 --> 0:33:22.358
On top of the shared encoder you can ask are
your guns lost or whatever so there's a lot
0:33:22.358 --> 0:33:22.550
of.
0:33:24.164 --> 0:33:29.726
The other thing that is more relevant right
now is that you can create parallel data by
0:33:29.726 --> 0:33:35.478
word to word translation right because you
know how to do all supervised word translation.
0:33:36.376 --> 0:33:40.548
First step is to create parallel data, assuming
that word translations are quite good.
0:33:41.361 --> 0:33:47.162
And then you claim a supervised and empty
model on these more likely wrong model data,
0:33:47.162 --> 0:33:50.163
but somehow gives you a good starting point.
0:33:50.163 --> 0:33:56.098
So you build your supervised and empty system
on the word translation data, and then you
0:33:56.098 --> 0:33:59.966
initialize it before you're doing unsupervised
and empty.
0:34:00.260 --> 0:34:05.810
And the hope is that when you're doing the
back pain installation, it's a good starting
0:34:05.810 --> 0:34:11.234
point, but it's one technique that you can
do to to improve your anthropoids and the.
0:34:17.097 --> 0:34:25.879
In the previous case we had: The way we know
when to stop was to see comedians on the gun
0:34:25.879 --> 0:34:26.485
training.
0:34:26.485 --> 0:34:28.849
Actually, all we want to do is when W.
0:34:28.849 --> 0:34:32.062
Comedians, which is quite easy to know when
to stop.
0:34:32.062 --> 0:34:37.517
But in a realistic case, we don't have any
parallel data right, so there's no validation.
0:34:37.517 --> 0:34:42.002
Or I mean, we might have test data in the
end, but there's no validation.
0:34:43.703 --> 0:34:48.826
How will we tune our hyper parameters in this
case because it's not really there's nothing
0:34:48.826 --> 0:34:49.445
for us to?
0:34:50.130 --> 0:34:53.326
Or the gold data in a sense like so.
0:34:53.326 --> 0:35:01.187
How do you think we can evaluate such systems
or how can we tune hyper parameters in this?
0:35:11.711 --> 0:35:17.089
So what you're going to do is use the back
translation technique.
0:35:17.089 --> 0:35:24.340
It's like a common technique where you have
nothing okay that is to use back translation
0:35:24.340 --> 0:35:26.947
somehow and what you can do is.
0:35:26.947 --> 0:35:31.673
The main idea is validate on how good the
reconstruction.
0:35:32.152 --> 0:35:37.534
So the idea is that if you have a good system
then the intermediate translation is quite
0:35:37.534 --> 0:35:39.287
good and going back is easy.
0:35:39.287 --> 0:35:44.669
But if it's just noise that you generate in
the forward step then it's really hard to go
0:35:44.669 --> 0:35:46.967
back, which is kind of the main idea.
0:35:48.148 --> 0:35:53.706
So the way it works is that we take a source
sentence, we generate a translation in target
0:35:53.706 --> 0:35:59.082
language right, and then again can state the
generated sentence and compare it with the
0:35:59.082 --> 0:36:01.342
original one, and if they're closer.
0:36:01.841 --> 0:36:09.745
It means that we have a good system, and if
they are far this is kind of like an unsupervised
0:36:09.745 --> 0:36:10.334
grade.
0:36:17.397 --> 0:36:21.863
As far as the amount of data that you need.
0:36:23.083 --> 0:36:27.995
This was like the first initial resistance
on on these systems is that you had.
0:36:27.995 --> 0:36:32.108
They wanted to do English and French and they
had fifteen million.
0:36:32.108 --> 0:36:38.003
There was fifteen million more linguist sentences
so it's quite a lot and they were able to get
0:36:38.003 --> 0:36:40.581
thirty two blue on these kinds of setups.
0:36:41.721 --> 0:36:47.580
But unsurprisingly if you have zero point
one million pilot sentences you get the same
0:36:47.580 --> 0:36:48.455
performance.
0:36:48.748 --> 0:36:50.357
So it's a lot of training.
0:36:50.357 --> 0:36:55.960
It's a lot of monolingual data, but monolingual
data is relatively easy to obtain is the fact
0:36:55.960 --> 0:37:01.264
that the training is also quite longer than
the supervised system, but it's unsupervised
0:37:01.264 --> 0:37:04.303
so it's kind of the trade off that you are
making.
0:37:07.367 --> 0:37:13.101
The other thing to note is that it's English
and French, which is very close to our exemptions.
0:37:13.101 --> 0:37:18.237
Also, the monolingual data that they took
are kind of from similar domains and so on.
0:37:18.638 --> 0:37:27.564
So that's why they're able to build such a
good system, but you'll see later that it fails.
0:37:36.256 --> 0:37:46.888
Voice, and so mean what people usually do
is first build a system right using whatever
0:37:46.888 --> 0:37:48.110
parallel.
0:37:48.608 --> 0:37:55.864
Then they use monolingual data and do back
translation, so this is always being the standard
0:37:55.864 --> 0:38:04.478
way to to improve, and what people have seen
is that: You don't even need zero point one
0:38:04.478 --> 0:38:05.360
million right.
0:38:05.360 --> 0:38:10.706
You just need like ten thousand or so on and
then you do the monolingual back time station
0:38:10.706 --> 0:38:12.175
and you're still better.
0:38:12.175 --> 0:38:13.291
The answer is why.
0:38:13.833 --> 0:38:19.534
The question is it's really worth trying to
to do this or maybe it's always better to find
0:38:19.534 --> 0:38:20.787
some parallel data.
0:38:20.787 --> 0:38:26.113
I'll expand a bit of money on getting few
parallel data and then use it to start and
0:38:26.113 --> 0:38:27.804
find to build your system.
0:38:27.804 --> 0:38:33.756
So it was kind of the understanding that billing
wool and spoiled systems are not that really.
0:38:50.710 --> 0:38:54.347
The thing is that with unlabeled data.
0:38:57.297 --> 0:39:05.488
Not in an obtaining signal, so when we are
starting basically what we want to do is first
0:39:05.488 --> 0:39:13.224
get a good translation system and then use
an unlabeled monolingual data to improve.
0:39:13.613 --> 0:39:15.015
But if you start from U.
0:39:15.015 --> 0:39:15.183
N.
0:39:15.183 --> 0:39:20.396
Empty our model might be really bad like it
would be somewhere translating completely wrong.
0:39:20.760 --> 0:39:26.721
And then when you find your unlabeled data,
it basically might be harming, or maybe the
0:39:26.721 --> 0:39:28.685
same as supervised applause.
0:39:28.685 --> 0:39:35.322
So the thing is, I hope, by fine tuning on
labeled data as first is to get a good initialization.
0:39:35.835 --> 0:39:38.404
And then use the unsupervised techniques to
get better.
0:39:38.818 --> 0:39:42.385
But if your starting point is really bad then
it's not.
0:39:45.185 --> 0:39:47.324
Year so as we said before.
0:39:47.324 --> 0:39:52.475
This is kind of like the self supervised training
usually works.
0:39:52.475 --> 0:39:54.773
First we have parallel data.
0:39:56.456 --> 0:39:58.062
Source language is X.
0:39:58.062 --> 0:39:59.668
Target language is Y.
0:39:59.668 --> 0:40:06.018
In the end we want a system that does X to
Y, not Y to X, but first we want to train a
0:40:06.018 --> 0:40:10.543
backward model as it is Y to X, so target language
to source.
0:40:11.691 --> 0:40:17.353
Then we take our moonlighting will target
sentences, use our backward model to generate
0:40:17.353 --> 0:40:21.471
synthetic source, and then we join them with
our original data.
0:40:21.471 --> 0:40:27.583
So now we have this noisy input, but always
the gold output, which is kind of really important
0:40:27.583 --> 0:40:29.513
when you're doing backpaints.
0:40:30.410 --> 0:40:36.992
And then you can coordinate these big data
and then you can train your X to Y cholesterol
0:40:36.992 --> 0:40:44.159
system and then you can always do this in multiple
steps and usually three, four steps which kind
0:40:44.159 --> 0:40:48.401
of improves always and then finally get your
best system.
0:40:49.029 --> 0:40:54.844
The point that I'm trying to make is that
although answers and MPs the scores that I've
0:40:54.844 --> 0:41:00.659
shown before were quite good, you probably
can get the same performance with with fifty
0:41:00.659 --> 0:41:06.474
thousand sentences, and also the languages
that they've shown are quite similar and the
0:41:06.474 --> 0:41:08.654
texts were from the same domain.
0:41:14.354 --> 0:41:21.494
So any questions on u n m t ok yeah.
0:41:22.322 --> 0:41:28.982
So after this fact that temperature was already
better than than empty, what people have tried
0:41:28.982 --> 0:41:34.660
is to use this idea of multilinguality as you
have seen in the previous lecture.
0:41:34.660 --> 0:41:41.040
The question is how can we do this knowledge
transfer from high resource language to lower
0:41:41.040 --> 0:41:42.232
source language?
0:41:44.484 --> 0:41:51.074
One way to promote this language independent
representations is to share the encoder and
0:41:51.074 --> 0:41:57.960
decoder for all languages, all their available
languages, and that kind of hopefully enables
0:41:57.960 --> 0:42:00.034
the the knowledge transfer.
0:42:03.323 --> 0:42:08.605
When we're doing multilinguality, the two
questions we need to to think of is how does
0:42:08.605 --> 0:42:09.698
the encoder know?
0:42:09.698 --> 0:42:14.495
How does the encoder encoder know which language
that we're dealing with that?
0:42:15.635 --> 0:42:20.715
You already might have known the answer also,
and the second question is how can we promote
0:42:20.715 --> 0:42:24.139
the encoder to generate language independent
representations?
0:42:25.045 --> 0:42:32.580
By solving these two problems we can take
help of high resource languages to do unsupervised
0:42:32.580 --> 0:42:33.714
translations.
0:42:34.134 --> 0:42:40.997
Typical example would be you want to do unsurpressed
between English and Dutch right, but you are
0:42:40.997 --> 0:42:47.369
parallel data between English and German, so
the question is can we use this parallel data
0:42:47.369 --> 0:42:51.501
to help building an unsurpressed betweenEnglish
and Dutch?
0:42:56.296 --> 0:43:01.240
For the first one we try to take help of language
embeddings for tokens, and this kind of is
0:43:01.240 --> 0:43:05.758
a straightforward way to know to tell them
well which language they're dealing with.
0:43:06.466 --> 0:43:11.993
And for the second one we're going to look
at some pre training objectives which are also
0:43:11.993 --> 0:43:17.703
kind of unsupervised so we need monolingual
data mostly and this kind of helps us to promote
0:43:17.703 --> 0:43:20.221
the language independent representation.
0:43:23.463 --> 0:43:29.954
So the first three things more that we'll
look at is excel, which is quite famous if
0:43:29.954 --> 0:43:32.168
you haven't heard of it yet.
0:43:32.552 --> 0:43:40.577
And: The way it works is that it's basically
a transformer encoder right, so it's like the
0:43:40.577 --> 0:43:42.391
just the encoder module.
0:43:42.391 --> 0:43:44.496
No, there's no decoder here.
0:43:44.884 --> 0:43:51.481
And what we're trying to do is mask two tokens
in a sequence and try to predict these mask
0:43:51.481 --> 0:43:52.061
tokens.
0:43:52.061 --> 0:43:55.467
So I quickly called us mask language modeling.
0:43:55.996 --> 0:44:05.419
Typical language modeling that you see is
the Danish language modeling where you predict
0:44:05.419 --> 0:44:08.278
the next token in English.
0:44:08.278 --> 0:44:11.136
Then we have the position.
0:44:11.871 --> 0:44:18.774
Then we have the token embellings, and then
here we have the mass token, and then we have
0:44:18.774 --> 0:44:22.378
the transformer encoder blocks to predict the.
0:44:24.344 --> 0:44:30.552
To do this for all languages using the same
tang somewhere encoded and this kind of helps
0:44:30.552 --> 0:44:36.760
us to push the the sentence and bearings or
the output of the encoded into a common space
0:44:36.760 --> 0:44:37.726
per multiple.
0:44:42.782 --> 0:44:49.294
So first we train an MLM on both source, both
source and target language sites, and then
0:44:49.294 --> 0:44:54.928
we use it as a starting point for the encoded
and decoded for a UNMP system.
0:44:55.475 --> 0:45:03.175
So we take a monolingual data, build a mass
language model on both source and target languages,
0:45:03.175 --> 0:45:07.346
and then read it to be or initialize that in
the U.
0:45:07.346 --> 0:45:07.586
N.
0:45:07.586 --> 0:45:07.827
P.
0:45:07.827 --> 0:45:08.068
C.
0:45:09.009 --> 0:45:14.629
Here we look at two languages, but you can
also do it with one hundred languages once.
0:45:14.629 --> 0:45:20.185
So they're retain checkpoints that you can
use, which are quite which have seen quite
0:45:20.185 --> 0:45:21.671
a lot of data and use.
0:45:21.671 --> 0:45:24.449
It always has a starting point for your U.
0:45:24.449 --> 0:45:24.643
N.
0:45:24.643 --> 0:45:27.291
MP system, which in practice works well.
0:45:31.491 --> 0:45:36.759
This detail is that since this is an encoder
block only, and your U.
0:45:36.759 --> 0:45:36.988
N.
0:45:36.988 --> 0:45:37.217
M.
0:45:37.217 --> 0:45:37.446
T.
0:45:37.446 --> 0:45:40.347
System is encodered, decodered right.
0:45:40.347 --> 0:45:47.524
So there's this cross attention that's missing,
but you can always branch like that randomly.
0:45:47.524 --> 0:45:48.364
It's fine.
0:45:48.508 --> 0:45:53.077
Not everything is initialized, but it's still
decent.
0:45:56.056 --> 0:46:02.141
Then we have the other one is M by plane,
and here you see that this kind of builds on
0:46:02.141 --> 0:46:07.597
the the unsupervised training objector, which
is the realizing auto encoding.
0:46:08.128 --> 0:46:14.337
So what they do is they say that we don't
even need to do the gun outback translation,
0:46:14.337 --> 0:46:17.406
but you can do it later, but pre training.
0:46:17.406 --> 0:46:24.258
We just do do doing doing doing water inputting
on all different languages, and that also gives
0:46:24.258 --> 0:46:32.660
you: Out of the box good performance, so what
we basically have here is the transformer encoded.
0:46:34.334 --> 0:46:37.726
You are trying to generate a reconstructed
sequence.
0:46:37.726 --> 0:46:38.942
You need a tickle.
0:46:39.899 --> 0:46:42.022
So we gave an input sentence.
0:46:42.022 --> 0:46:48.180
We tried to predict the masked tokens from
the or we tried to reconstruct the original
0:46:48.180 --> 0:46:52.496
sentence from the input segments, which was
corrupted right.
0:46:52.496 --> 0:46:57.167
So this is the same denoting objective that
you have seen before.
0:46:58.418 --> 0:46:59.737
This is for English.
0:46:59.737 --> 0:47:04.195
I think this is for Japanese and then once
we do it for all languages.
0:47:04.195 --> 0:47:09.596
I mean they have this difference on twenty
five, fifty or so on and then you can find
0:47:09.596 --> 0:47:11.794
you on your sentence and document.
0:47:13.073 --> 0:47:20.454
And so what they is this for the supervised
techniques, but you can also use this as initializations
0:47:20.454 --> 0:47:25.058
for unsupervised buildup on that which also
in practice works.
0:47:30.790 --> 0:47:36.136
Then we have these, so still now we kind of
didn't see the the states benefit from the
0:47:36.136 --> 0:47:38.840
high resource language right, so as I said.
0:47:38.878 --> 0:47:44.994
Why you can use English as something for English
to Dutch, and if you want a new Catalan, you
0:47:44.994 --> 0:47:46.751
can use English to French.
0:47:48.408 --> 0:47:55.866
One typical way to do this is to use favorite
translation lights or you take the.
0:47:55.795 --> 0:48:01.114
So here it's finished two weeks so you take
your time say from finish to English English
0:48:01.114 --> 0:48:03.743
two weeks and then you get the translation.
0:48:04.344 --> 0:48:10.094
What's important is that you have these different
techniques and you can always think of which
0:48:10.094 --> 0:48:12.333
one to use given the data situation.
0:48:12.333 --> 0:48:18.023
So if it was like finish to Greek maybe it's
pivotal better because you might get good finish
0:48:18.023 --> 0:48:20.020
to English and English to Greek.
0:48:20.860 --> 0:48:23.255
Sometimes it also depends on the language
pair.
0:48:23.255 --> 0:48:27.595
There might be some information loss and so
on, so there are quite a few variables you
0:48:27.595 --> 0:48:30.039
need to think of and decide which system to
use.
0:48:32.752 --> 0:48:39.654
Then there's a zero shot, which probably also
I've seen in the multilingual course, and how
0:48:39.654 --> 0:48:45.505
if you can improve the language independence
then your zero shot gets better.
0:48:45.505 --> 0:48:52.107
So maybe if you use the multilingual models
and do zero shot directly, it's quite good.
0:48:53.093 --> 0:48:58.524
Thought we have zero shots per word, and then
we have the answer to voice translation where
0:48:58.524 --> 0:49:00.059
we can calculate between.
0:49:00.600 --> 0:49:02.762
Just when there is no battle today.
0:49:06.686 --> 0:49:07.565
Is to solve.
0:49:07.565 --> 0:49:11.959
So sometimes what we have seen so far is that
we basically have.
0:49:15.255 --> 0:49:16.754
To do from looking at it.
0:49:16.836 --> 0:49:19.307
These two files alone you can create a dictionary.
0:49:19.699 --> 0:49:26.773
Can build an unsupervised entry system, not
always, but if the domains are similar in the
0:49:26.773 --> 0:49:28.895
languages, that's similar.
0:49:28.895 --> 0:49:36.283
But if there are distant languages, then the
unsupervised texting doesn't usually work really
0:49:36.283 --> 0:49:36.755
well.
0:49:37.617 --> 0:49:40.297
What um.
0:49:40.720 --> 0:49:46.338
Would be is that if you can get some paddle
data from somewhere or do bitex mining that
0:49:46.338 --> 0:49:51.892
we have seen in the in the laser practicum
then you can use that as to initialize your
0:49:51.892 --> 0:49:57.829
system and then try and accept a semi supervised
energy system and that would be better than
0:49:57.829 --> 0:50:00.063
just building an unsupervised and.
0:50:00.820 --> 0:50:06.546
With that as the end.
0:50:07.207 --> 0:50:08.797
Quickly could be.
0:50:16.236 --> 0:50:25.070
In common, they can catch the worst because
the thing about finding a language is: And
0:50:25.070 --> 0:50:34.874
there's another joy in playing these games,
almost in the middle of a game, and she's a
0:50:34.874 --> 0:50:40.111
characteristic too, and she is a global waver.
0:50:56.916 --> 0:51:03.798
Next talk inside and this somehow gives them
many abilities, not only translation but other
0:51:03.798 --> 0:51:08.062
than that there are quite a few things that
they can do.
0:51:10.590 --> 0:51:17.706
But the translation in itself usually doesn't
really work really well if you build a system
0:51:17.706 --> 0:51:20.878
from your specific system for your case.
0:51:22.162 --> 0:51:27.924
I would guess that it's usually better than
the LLM, but you can always adapt the LLM to
0:51:27.924 --> 0:51:31.355
the task that you want, and then it could be
better.
0:51:32.152 --> 0:51:37.849
A little amount of the box might not be the
best choice for your task force.
0:51:37.849 --> 0:51:44.138
For me, I'm working on new air translation,
so it's more about translating software.
0:51:45.065 --> 0:51:50.451
And it's quite often each domain as well,
and if use the LLM out of the box, they're
0:51:50.451 --> 0:51:53.937
actually quite bad compared to the systems
that built.
0:51:54.414 --> 0:51:56.736
But you can do these different techniques
like prompting.
0:51:57.437 --> 0:52:03.442
This is what people usually do is heart prompting
where they give similar translation pairs in
0:52:03.442 --> 0:52:08.941
the prompt and then ask it to translate and
then that kind of improves the performance
0:52:08.941 --> 0:52:09.383
a lot.
0:52:09.383 --> 0:52:15.135
So there are different techniques that you
can do to adapt your eye lens and then it might
0:52:15.135 --> 0:52:16.399
be better than the.
0:52:16.376 --> 0:52:17.742
Task a fixed system.
0:52:18.418 --> 0:52:22.857
But if you're looking for niche things, I
don't think error limbs are that good.
0:52:22.857 --> 0:52:26.309
But if you want to do to do, let's say, unplugged
translation.
0:52:26.309 --> 0:52:30.036
In this case you can never be sure that they
haven't seen the data.
0:52:30.036 --> 0:52:35.077
First of all is that if you see the data in
that language or not, and if they're panthetic,
0:52:35.077 --> 0:52:36.831
they probably did see the data.
0:52:40.360 --> 0:53:00.276
I feel like they have pretty good understanding
of each million people.
0:53:04.784 --> 0:53:09.059
Depends on the language, but I'm pretty surprised
that it works on a lotus language.
0:53:09.059 --> 0:53:11.121
I would expect it to work on German and.
0:53:11.972 --> 0:53:13.633
But if you take a lot of first language,.
0:53:14.474 --> 0:53:20.973
Don't think it works, and also there are quite
a few papers where they've already showed that
0:53:20.973 --> 0:53:27.610
if you build a system yourself or build a typical
way to build a system, it's quite better than
0:53:27.610 --> 0:53:29.338
the bit better than the.
0:53:29.549 --> 0:53:34.883
But you can always do things with limbs to
get better, but then I'm probably.
0:53:37.557 --> 0:53:39.539
Anymore.
0:53:41.421 --> 0:53:47.461
So if not then we're going to end the lecture
here and then on Thursday we're going to have
0:53:47.461 --> 0:53:51.597
documented empty which is also run by me so
thanks for coming.
|