Spaces:
Running
Running
File size: 69,444 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 |
WEBVTT
0:00:01.121 --> 0:00:14.214
Okay, so welcome to today's lecture, on Tuesday
we started to talk about speech translation.
0:00:14.634 --> 0:00:27.037
And the idea is hopefully an idea of the basic
ideas we have in speech translation, the two
0:00:27.037 --> 0:00:29.464
major approaches.
0:00:29.829 --> 0:00:41.459
And the other one is the end system where
we have one large system which is everything
0:00:41.459 --> 0:00:42.796
together.
0:00:43.643 --> 0:00:58.459
Until now we mainly focus on text output that
we'll see today, but you can extend these ideas
0:00:58.459 --> 0:01:01.138
to other speech.
0:01:01.441 --> 0:01:08.592
But since it's also like a machine translation
lecture, you of course mainly focus a bit on
0:01:08.592 --> 0:01:10.768
the translation challenges.
0:01:12.172 --> 0:01:25.045
And what is the main focus of today's lecture
is to look into why that is challenging speech
0:01:25.045 --> 0:01:26.845
translation.
0:01:27.627 --> 0:01:33.901
So a bit more focus on what is now really
the difference to all you and how we can address.
0:01:34.254 --> 0:01:39.703
We'll start there by with the segmentation
problem.
0:01:39.703 --> 0:01:45.990
We had that already of bits, but especially
for end-to-end.
0:01:46.386 --> 0:01:57.253
So the problem is that until now it was easy
to segment the input into sentences and then
0:01:57.253 --> 0:02:01.842
translate each sentence individually.
0:02:02.442 --> 0:02:17.561
When you're now translating audio, the challenge
is that you have just a sequence of audio input
0:02:17.561 --> 0:02:20.055
and there's no.
0:02:21.401 --> 0:02:27.834
So you have this difference that your audio
is a continuous stream, but the text is typically
0:02:27.834 --> 0:02:28.930
sentence based.
0:02:28.930 --> 0:02:31.667
So how can you match this gap in there?
0:02:31.667 --> 0:02:37.690
We'll see that is really essential, and if
you're not using a decent good system there,
0:02:37.690 --> 0:02:41.249
then you can lose a lot of quality and performance.
0:02:41.641 --> 0:02:44.267
That is what also meant before.
0:02:44.267 --> 0:02:51.734
So if you have a more complex system out of
several units, it's really essential that they
0:02:51.734 --> 0:02:56.658
all work together and it's very easy to lose
significantly.
0:02:57.497 --> 0:03:13.029
The second challenge we'll talk about is disfluencies,
so the style of speaking is very different
0:03:13.029 --> 0:03:14.773
from text.
0:03:15.135 --> 0:03:24.727
So if you translate or TedTalks, that's normally
very good speakers.
0:03:24.727 --> 0:03:30.149
They will give you a very fluent text.
0:03:30.670 --> 0:03:36.692
When you want to translate a lecture, it might
be more difficult or rednested.
0:03:37.097 --> 0:03:39.242
Mean people are not well that well.
0:03:39.242 --> 0:03:42.281
They should be prepared in giving the lecture
and.
0:03:42.362 --> 0:03:48.241
But it's not that I mean, typically a lecture
will have like rehearsal like five times before
0:03:48.241 --> 0:03:52.682
he is giving this lecture, and then like will
it completely be fluent?
0:03:52.682 --> 0:03:56.122
He might at some point notice all this is
not perfect.
0:03:56.122 --> 0:04:00.062
I want to rephrase, and he'll have to sing
during the lecture.
0:04:00.300 --> 0:04:04.049
Might be also good that he's thinking, so
he's not going too fast and things like.
0:04:05.305 --> 0:04:07.933
If you then go to the other extreme, it's
more meetings.
0:04:08.208 --> 0:04:15.430
If you have a lively discussion, of course,
people will interrupt, they will restart, they
0:04:15.430 --> 0:04:22.971
will think while they speak, and you know that
sometimes you tell people first think and speak
0:04:22.971 --> 0:04:26.225
because they are changing their opinion.
0:04:26.606 --> 0:04:31.346
So the question of how can you deal with this?
0:04:31.346 --> 0:04:37.498
And there again it might be solutions for
that, or at least.
0:04:39.759 --> 0:04:46.557
Then for the output we will look into simultaneous
translation that is at least not very important
0:04:46.557 --> 0:04:47.175
in text.
0:04:47.175 --> 0:04:53.699
There might be some cases but normally you
have all text available and then you're translating
0:04:53.699 --> 0:04:54.042
and.
0:04:54.394 --> 0:05:09.220
While for speech translation, since it's often
a life interaction, then of course it's important.
0:05:09.149 --> 0:05:12.378
Otherwise it's hard to follow.
0:05:12.378 --> 0:05:19.463
You see what said five minutes ago and the
slide is not as helpful.
0:05:19.739 --> 0:05:35.627
You have to wait very long before you can
answer because you have to first wait for what
0:05:35.627 --> 0:05:39.197
is happening there.
0:05:40.660 --> 0:05:46.177
And finally, we can talk a bit about presentation.
0:05:46.177 --> 0:05:54.722
For example, mentioned that if you're generating
subtitles, it's not possible.
0:05:54.854 --> 0:06:01.110
So in professional subtitles there are clear
rules.
0:06:01.110 --> 0:06:05.681
Subtitle has to be shown for seconds.
0:06:05.681 --> 0:06:08.929
It's maximum of two lines.
0:06:09.549 --> 0:06:13.156
Because otherwise it's getting too long, it's
not able to read it anymore, and so.
0:06:13.613 --> 0:06:19.826
So if you want to achieve that, of course,
you might have to adjust and select what you
0:06:19.826 --> 0:06:20.390
really.
0:06:23.203 --> 0:06:28.393
The first date starts with the segmentation.
0:06:28.393 --> 0:06:36.351
On the one end it's an issue while training,
on the other hand it's.
0:06:38.678 --> 0:06:47.781
What is the problem so when we train it's
relatively easy to separate our data into sentence
0:06:47.781 --> 0:06:48.466
level.
0:06:48.808 --> 0:07:02.241
So if you have your example, you have the
audio and the text, then you typically know
0:07:02.241 --> 0:07:07.083
that this sentence is aligned.
0:07:07.627 --> 0:07:16.702
You can use these time information to cut
your audio and then you can train and then.
0:07:18.018 --> 0:07:31.775
Because what we need for an enchilada model
is to be an output chart, in this case an audio
0:07:31.775 --> 0:07:32.822
chart.
0:07:33.133 --> 0:07:38.551
And even if this is a long speech, it's easy
then since we have this time information to
0:07:38.551 --> 0:07:39.159
separate.
0:07:39.579 --> 0:07:43.866
But we are using therefore, of course, the
target side information.
0:07:45.865 --> 0:07:47.949
The problem is now in runtime.
0:07:47.949 --> 0:07:49.427
This is not possible.
0:07:49.427 --> 0:07:55.341
Here we can do that based on the calculation
marks and the sentence segmentation on the
0:07:55.341 --> 0:07:57.962
target side because that is splitting.
0:07:57.962 --> 0:08:02.129
But during transcript, during translation
it is not possible.
0:08:02.442 --> 0:08:10.288
Because there is just a long audio signal,
and of course if you have your test data to
0:08:10.288 --> 0:08:15.193
split it into: That has been done for some
experience.
0:08:15.193 --> 0:08:22.840
It's fine, but it's not a realistic scenario
because if you really apply it in real world,
0:08:22.840 --> 0:08:25.949
we won't have a manual segmentation.
0:08:26.266 --> 0:08:31.838
If a human has to do that then he can do the
translation so you want to have a full automatic
0:08:31.838 --> 0:08:32.431
pipeline.
0:08:32.993 --> 0:08:38.343
So the question is how can we deal with this
type of you know?
0:09:09.309 --> 0:09:20.232
So the question is how can we deal with this
time of situation and how can we segment the
0:09:20.232 --> 0:09:23.024
audio into some units?
0:09:23.863 --> 0:09:32.495
And here is one further really big advantage
of a cascaded sauce: Because how is this done
0:09:32.495 --> 0:09:34.259
in a cascade of systems?
0:09:34.259 --> 0:09:38.494
We are splitting the audio with some features
we are doing.
0:09:38.494 --> 0:09:42.094
We can use similar ones which we'll discuss
later.
0:09:42.094 --> 0:09:43.929
Then we run against chin.
0:09:43.929 --> 0:09:48.799
We have the transcript and then we can do
what we talked last about.
0:09:49.069 --> 0:10:02.260
So if you have this is an audio signal and
the training data it was good.
0:10:02.822 --> 0:10:07.951
So here we have a big advantage.
0:10:07.951 --> 0:10:16.809
We can use a different segmentation for the
and for the.
0:10:16.809 --> 0:10:21.316
Why is that a big advantage?
0:10:23.303 --> 0:10:34.067
Will say for a team task is more important
because we can then do the sentence transformation.
0:10:34.955 --> 0:10:37.603
See and Yeah, We Can Do the Same Thing.
0:10:37.717 --> 0:10:40.226
To save us, why is it not as important for
us?
0:10:40.226 --> 0:10:40.814
Are maybe.
0:10:43.363 --> 0:10:48.589
We don't need that much context.
0:10:48.589 --> 0:11:01.099
We only try to restrict the word, but the
context to consider is mainly small.
0:11:03.283 --> 0:11:11.419
Would agree with it in more context, but there
is one more important: its.
0:11:11.651 --> 0:11:16.764
The is monotone, so there's no reordering.
0:11:16.764 --> 0:11:22.472
The second part of the signal is no reordering.
0:11:22.472 --> 0:11:23.542
We have.
0:11:23.683 --> 0:11:29.147
And of course if we are doing that we cannot
really order across boundaries between segments.
0:11:29.549 --> 0:11:37.491
It might be challenging if we split the words
so that it's not perfect for so that.
0:11:37.637 --> 0:11:40.846
But we need to do quite long range reordering.
0:11:40.846 --> 0:11:47.058
If you think about the German where the work
has moved, and now the English work is in one
0:11:47.058 --> 0:11:50.198
part, but the end of the sentence is another.
0:11:50.670 --> 0:11:59.427
And of course this advantage we have now here
that if we have a segment we have.
0:12:01.441 --> 0:12:08.817
And that this segmentation is important.
0:12:08.817 --> 0:12:15.294
Here are some motivations for that.
0:12:15.675 --> 0:12:25.325
What you are doing is you are taking the reference
text and you are segmenting.
0:12:26.326 --> 0:12:30.991
And then, of course, your segments are exactly
yeah cute.
0:12:31.471 --> 0:12:42.980
If you're now using different segmentation
strategies, you're using significantly in blue
0:12:42.980 --> 0:12:44.004
points.
0:12:44.004 --> 0:12:50.398
If the segmentation is bad, you have a lot
worse.
0:12:52.312 --> 0:13:10.323
And interesting, here you ought to see how
it was a human, but people have in a competition.
0:13:10.450 --> 0:13:22.996
You can see that by working on the segmentation
and using better segmentation you can improve
0:13:22.996 --> 0:13:25.398
your performance.
0:13:26.006 --> 0:13:29.932
So it's really essential.
0:13:29.932 --> 0:13:41.712
One other interesting thing is if you're looking
into the difference between.
0:13:42.082 --> 0:13:49.145
So it really seems to be more important to
have a good segmentation for our cascaded system.
0:13:49.109 --> 0:13:56.248
For an intra-end system because there you
can't re-segment while it is less important
0:13:56.248 --> 0:13:58.157
for a cascaded system.
0:13:58.157 --> 0:14:05.048
Of course, it's still important, but the difference
between the two segmentations.
0:14:06.466 --> 0:14:18.391
It was a shared task some years ago like it's
just one system from different.
0:14:22.122 --> 0:14:31.934
So the question is how can we deal with this
in speech translation and what people look
0:14:31.934 --> 0:14:32.604
into?
0:14:32.752 --> 0:14:48.360
Now we want to use different techniques to
split the audio signal into segments.
0:14:48.848 --> 0:14:54.413
You have the disadvantage that you can't change
it.
0:14:54.413 --> 0:15:00.407
Therefore, some of the quality might be more
important.
0:15:00.660 --> 0:15:15.678
But in both cases, of course, the A's are
better if you have a good segmentation.
0:15:17.197 --> 0:15:23.149
So any idea, how would you have this task
now split this audio?
0:15:23.149 --> 0:15:26.219
What type of tool would you use?
0:15:28.648 --> 0:15:41.513
The fuse was a new network to segment half
for instance supervise.
0:15:41.962 --> 0:15:44.693
Yes, that's exactly already the better system.
0:15:44.693 --> 0:15:50.390
So for long time people have done more simple
things because we'll come to that a bit challenging
0:15:50.390 --> 0:15:52.250
as creating or having the data.
0:15:53.193 --> 0:16:00.438
The first thing is you use some tool out of
the box like voice activity detection which
0:16:00.438 --> 0:16:07.189
has been there as a whole research field so
people find when somebody's speaking.
0:16:07.647 --> 0:16:14.952
And then you use that in this different threshold
you always have the ability that somebody's
0:16:14.952 --> 0:16:16.273
speaking or not.
0:16:17.217 --> 0:16:19.889
Then you split your signal.
0:16:19.889 --> 0:16:26.762
It will not be perfect, but you transcribe
or translate each component.
0:16:28.508 --> 0:16:39.337
But as you see, a supervised classification
task is even better, and that is now the most
0:16:39.337 --> 0:16:40.781
common use.
0:16:41.441 --> 0:16:49.909
The supervisor is doing that as a supervisor
classification and then you'll try to use this
0:16:49.909 --> 0:16:50.462
type.
0:16:50.810 --> 0:16:53.217
We're going into a bit more detail on how
to do that.
0:16:53.633 --> 0:17:01.354
So what you need to do first is, of course,
you have to have some labels whether this is
0:17:01.354 --> 0:17:03.089
an end of sentence.
0:17:03.363 --> 0:17:10.588
You do that by using the alignment between
the segments and the audio.
0:17:10.588 --> 0:17:12.013
You have the.
0:17:12.212 --> 0:17:15.365
The two people have not for each word, so
these tank steps.
0:17:15.365 --> 0:17:16.889
This word is said this time.
0:17:17.157 --> 0:17:27.935
This word is said by what you typically have
from this time to time to time.
0:17:27.935 --> 0:17:34.654
We have the second segment, the second segment.
0:17:35.195 --> 0:17:39.051
Which also used to trade for example your
advanced system and everything.
0:17:41.661 --> 0:17:53.715
Based on that you can label each frame in
there so if you have a green or blue that is
0:17:53.715 --> 0:17:57.455
our speech segment so you.
0:17:58.618 --> 0:18:05.690
And these labels will then later help you,
but you extract exactly these types of.
0:18:07.067 --> 0:18:08.917
There's one big challenge.
0:18:08.917 --> 0:18:15.152
If you have two sentences which are directly
connected to each other, then if you're doing
0:18:15.152 --> 0:18:18.715
this labeling, you would not have a break in
later.
0:18:18.715 --> 0:18:23.512
If you tried to extract that, there should
be something great or not.
0:18:23.943 --> 0:18:31.955
So what you typically do is in the last frame.
0:18:31.955 --> 0:18:41.331
You mark as outside, although it's not really
outside.
0:18:43.463 --> 0:18:46.882
Yes, I guess you could also do that in more
of a below check.
0:18:46.882 --> 0:18:48.702
I mean, this is the most simple.
0:18:48.702 --> 0:18:51.514
It's like inside outside, so it's related
to that.
0:18:51.514 --> 0:18:54.988
Of course, you could have an extra startup
segment, and so on.
0:18:54.988 --> 0:18:57.469
I guess this is just to make it more simple.
0:18:57.469 --> 0:19:00.226
You only have two labels, not a street classroom.
0:19:00.226 --> 0:19:02.377
But yeah, you could do similar things.
0:19:12.432 --> 0:19:20.460
Has caused down the roads to problems because
it could be an important part of a segment
0:19:20.460 --> 0:19:24.429
which has some meaning and we do something.
0:19:24.429 --> 0:19:28.398
The good thing is frames are normally very.
0:19:28.688 --> 0:19:37.586
Like some milliseconds, so normally if you
remove some milliseconds you can still understand
0:19:37.586 --> 0:19:38.734
everything.
0:19:38.918 --> 0:19:46.999
Mean the speech signal is very repetitive,
and so you have information a lot of times.
0:19:47.387 --> 0:19:50.730
That's why we talked along there last time
they could try to shrink the steak and.
0:19:51.031 --> 0:20:00.995
If you now have a short sequence where there
is like which would be removed and that's not
0:20:00.995 --> 0:20:01.871
really.
0:20:02.162 --> 0:20:06.585
Yeah, but it's not a full letter is missing.
0:20:06.585 --> 0:20:11.009
It's like only the last ending of the vocal.
0:20:11.751 --> 0:20:15.369
Think it doesn't really happen.
0:20:15.369 --> 0:20:23.056
We have our audio signal and we have these
gags that are not above.
0:20:23.883 --> 0:20:29.288
With this blue rectangulars the inside speech
segment and with the guess it's all set yes.
0:20:29.669 --> 0:20:35.736
So then you have the full signal and you're
meaning now labeling your task as a blue or
0:20:35.736 --> 0:20:36.977
white prediction.
0:20:36.977 --> 0:20:39.252
So that is your prediction task.
0:20:39.252 --> 0:20:44.973
You have the audio signal only and your prediction
task is like label one or zero.
0:20:45.305 --> 0:20:55.585
Once you do that then based on this labeling
you can extract each segment again like each
0:20:55.585 --> 0:20:58.212
consecutive blue area.
0:20:58.798 --> 0:21:05.198
See then removed maybe the non-speaking part
already and duo speech translation only on
0:21:05.198 --> 0:21:05.998
the parts.
0:21:06.786 --> 0:21:19.768
Which is good because the training would have
done similarly.
0:21:20.120 --> 0:21:26.842
So on the noise in between you never saw in
the training, so it's good to throw it away.
0:21:29.649 --> 0:21:34.930
One challenge, of course, is now if you're
doing that, what is your input?
0:21:34.930 --> 0:21:40.704
You cannot do the sequence labeling normally
on the whole talk, so it's too long.
0:21:40.704 --> 0:21:46.759
So if you're doing this prediction of the
label, you also have a window for which you
0:21:46.759 --> 0:21:48.238
do the segmentation.
0:21:48.788 --> 0:21:54.515
And that's the bedline we have in the punctuation
prediction.
0:21:54.515 --> 0:22:00.426
If we don't have good borders, random splits
are normally good.
0:22:00.426 --> 0:22:03.936
So what we do now is split the audio.
0:22:04.344 --> 0:22:09.134
So that would be our input, and then the part
three would be our labels.
0:22:09.269 --> 0:22:15.606
This green would be the input and here we
want, for example, blue labels and then white.
0:22:16.036 --> 0:22:20.360
Here only do labors and here at the beginning
why maybe at the end why.
0:22:21.401 --> 0:22:28.924
So thereby you have now a fixed window always
for which you're doing than this task of predicting.
0:22:33.954 --> 0:22:43.914
How you build your classifier that is based
again.
0:22:43.914 --> 0:22:52.507
We had this wave to be mentioned last week.
0:22:52.752 --> 0:23:00.599
So in training you use labels to say whether
it's in speech or outside speech.
0:23:01.681 --> 0:23:17.740
Inference: You give them always the chance
and then predict whether this part like each
0:23:17.740 --> 0:23:20.843
label is afraid.
0:23:23.143 --> 0:23:29.511
Bit more complicated, so one challenge is
if you randomly split off cognition, losing
0:23:29.511 --> 0:23:32.028
your context for the first brain.
0:23:32.028 --> 0:23:38.692
It might be very hard to predict whether this
is now in or out of, and also for the last.
0:23:39.980 --> 0:23:48.449
You often need a bit of context whether this
is audio or not, and at the beginning.
0:23:49.249 --> 0:23:59.563
So what you do is you put the audio in twice.
0:23:59.563 --> 0:24:08.532
You want to do it with splits and then.
0:24:08.788 --> 0:24:15.996
It is shown you have shifted the two offsets,
so one is predicted with the other offset.
0:24:16.416 --> 0:24:23.647
And then averaging the probabilities so that
at each time you have, at least for one of
0:24:23.647 --> 0:24:25.127
the predictions,.
0:24:25.265 --> 0:24:36.326
Because at the end of the second it might
be very hard to predict whether this is now
0:24:36.326 --> 0:24:39.027
speech or nonspeech.
0:24:39.939 --> 0:24:47.956
Think it is a high parameter, but you are
not optimizing it, so you just take two shifts.
0:24:48.328 --> 0:24:54.636
Of course try a lot of different shifts and
so on.
0:24:54.636 --> 0:24:59.707
The thing is it's mainly a problem here.
0:24:59.707 --> 0:25:04.407
If you don't do two outsets you have.
0:25:05.105 --> 0:25:14.761
You could get better by doing that, but would
be skeptical if it really matters, and also
0:25:14.761 --> 0:25:18.946
have not seen any experience in doing.
0:25:19.159 --> 0:25:27.629
Guess you're already good, you have maybe
some arrows in there and you're getting.
0:25:31.191 --> 0:25:37.824
So with this you have your segmentation.
0:25:37.824 --> 0:25:44.296
However, there is a problem in between.
0:25:44.296 --> 0:25:49.150
Once the model is wrong then.
0:25:49.789 --> 0:26:01.755
The normal thing would be the first thing
that you take some threshold and that you always
0:26:01.755 --> 0:26:05.436
label everything in speech.
0:26:06.006 --> 0:26:19.368
The problem is when you are just doing this
one threshold that you might have.
0:26:19.339 --> 0:26:23.954
Those are the challenges.
0:26:23.954 --> 0:26:31.232
Short segments mean you have no context.
0:26:31.232 --> 0:26:35.492
The policy will be bad.
0:26:37.077 --> 0:26:48.954
Therefore, people use this probabilistic divided
cocker algorithm, so the main idea is start
0:26:48.954 --> 0:26:56.744
with the whole segment, and now you split the
whole segment.
0:26:57.397 --> 0:27:09.842
Then you split there and then you continue
until each segment is smaller than the maximum
0:27:09.842 --> 0:27:10.949
length.
0:27:11.431 --> 0:27:23.161
But you can ignore some splits, and if you
split one segment into two parts you first
0:27:23.161 --> 0:27:23.980
trim.
0:27:24.064 --> 0:27:40.197
So normally it's not only one signal position,
it's a longer area of non-voice, so you try
0:27:40.197 --> 0:27:43.921
to find this longer.
0:27:43.943 --> 0:27:51.403
Now your large segment is split into two smaller
segments.
0:27:51.403 --> 0:27:56.082
Now you are checking these segments.
0:27:56.296 --> 0:28:04.683
So if they are very, very short, it might
be good not to spin at this point because you're
0:28:04.683 --> 0:28:05.697
ending up.
0:28:06.006 --> 0:28:09.631
And this way you continue all the time, and
then hopefully you'll have a good stretch.
0:28:10.090 --> 0:28:19.225
So, of course, there's one challenge with
this approach: if you think about it later,
0:28:19.225 --> 0:28:20.606
low latency.
0:28:25.405 --> 0:28:31.555
So in this case you have to have the full
audio available.
0:28:32.132 --> 0:28:38.112
So you cannot continuously do that mean if
you would do it just always.
0:28:38.112 --> 0:28:45.588
If the probability is higher you split but
in this case you try to find a global optimal.
0:28:46.706 --> 0:28:49.134
A heuristic body.
0:28:49.134 --> 0:28:58.170
You find a global solution for your whole
tar and not a local one.
0:28:58.170 --> 0:29:02.216
Where's the system most sure?
0:29:02.802 --> 0:29:12.467
So that's a bit of a challenge here, but the
advantage of course is that in the end you
0:29:12.467 --> 0:29:14.444
have no segments.
0:29:17.817 --> 0:29:23.716
Any more questions like this.
0:29:23.716 --> 0:29:36.693
Then the next thing is we also need to evaluate
in this scenario.
0:29:37.097 --> 0:29:44.349
So know machine translation is quite a long
way.
0:29:44.349 --> 0:29:55.303
History now was the beginning of the semester,
but hope you can remember.
0:29:55.675 --> 0:30:09.214
Might be with blue score, might be with comment
or similar, but you need to have.
0:30:10.310 --> 0:30:22.335
But this assumes that you have this one-to-one
match, so you always have an output and machine
0:30:22.335 --> 0:30:26.132
translation, which is nicely.
0:30:26.506 --> 0:30:34.845
So then it might be that our output has four
segments, while our reference output has only
0:30:34.845 --> 0:30:35.487
three.
0:30:36.756 --> 0:30:40.649
And now is, of course, questionable like what
should we compare in our metric.
0:30:44.704 --> 0:30:53.087
So it's no longer directly possible to directly
do that because what should you compare?
0:30:53.413 --> 0:31:00.214
Just have four segments there and three segments
there, and of course it seems to be that.
0:31:00.920 --> 0:31:06.373
The first one it likes to the first one when
you see I can't speak Spanish, but you're an
0:31:06.373 --> 0:31:09.099
audience of the guests who is already there.
0:31:09.099 --> 0:31:14.491
So even like just a woman, the blue comparing
wouldn't work, so you need to do something
0:31:14.491 --> 0:31:17.157
about that to take this type of evaluation.
0:31:19.019 --> 0:31:21.727
Still any suggestions what you could do.
0:31:25.925 --> 0:31:44.702
How can you calculate a blue score because
you don't have one you want to see?
0:31:45.925 --> 0:31:49.365
Here you put another layer which spies to
add in the second.
0:31:51.491 --> 0:31:56.979
It's even not aligning only, but that's one
solution, so you need to align and resign.
0:31:57.177 --> 0:32:06.886
Because even if you have no alignment so this
to this and this to that you see that it's
0:32:06.886 --> 0:32:12.341
not good because the audio would compare to
that.
0:32:13.453 --> 0:32:16.967
That we'll discuss is even one simpler solution.
0:32:16.967 --> 0:32:19.119
Yes, it's a simpler solution.
0:32:19.119 --> 0:32:23.135
It's called document based blue or something
like that.
0:32:23.135 --> 0:32:25.717
So you just take the full document.
0:32:26.566 --> 0:32:32.630
For some matrix it's good and it's not clear
how good it is to the other, but there might
0:32:32.630 --> 0:32:32.900
be.
0:32:33.393 --> 0:32:36.454
Think of more simple metrics like blue.
0:32:36.454 --> 0:32:40.356
Do you have any idea what could be a disadvantage?
0:32:49.249 --> 0:32:56.616
Blue is matching ingrams so you start with
the original.
0:32:56.616 --> 0:33:01.270
You check how many ingrams in here.
0:33:01.901 --> 0:33:11.233
If you're not doing that on the full document,
you can also match grams from year to year.
0:33:11.751 --> 0:33:15.680
So you can match things very far away.
0:33:15.680 --> 0:33:21.321
Start doing translation and you just randomly
randomly.
0:33:22.142 --> 0:33:27.938
And that, of course, could be a bit of a disadvantage
or like is a problem, and therefore people
0:33:27.938 --> 0:33:29.910
also look into the segmentation.
0:33:29.910 --> 0:33:34.690
But I've recently seen some things, so document
levels tours are also normally.
0:33:34.690 --> 0:33:39.949
If you have a relatively high quality system
or state of the art, then they also have a
0:33:39.949 --> 0:33:41.801
good correlation of the human.
0:33:46.546 --> 0:33:59.241
So how are we doing that so we are putting
end of sentence boundaries in there and then.
0:33:59.179 --> 0:34:07.486
Alignment based on a similar Livingston distance,
so at a distance between our output and the
0:34:07.486 --> 0:34:09.077
reference output.
0:34:09.449 --> 0:34:13.061
And here is our boundary.
0:34:13.061 --> 0:34:23.482
We map the boundary based on the alignment,
so in Lithuania you only have.
0:34:23.803 --> 0:34:36.036
And then, like all the words that are before,
it might be since there is not a random.
0:34:36.336 --> 0:34:44.890
Mean it should be, but it can happen things
like that, and it's not clear where.
0:34:44.965 --> 0:34:49.727
At the break, however, they are typically
not that bad because they are words which are
0:34:49.727 --> 0:34:52.270
not matching between reference and hypothesis.
0:34:52.270 --> 0:34:56.870
So normally it doesn't really matter that
much because they are anyway not matching.
0:34:57.657 --> 0:35:05.888
And then you take the mule as a T output and
use that to calculate your metric.
0:35:05.888 --> 0:35:12.575
Then it's again a perfect alignment for which
you can calculate.
0:35:14.714 --> 0:35:19.229
Any idea you could do it the other way around.
0:35:19.229 --> 0:35:23.359
You could resigment your reference to the.
0:35:29.309 --> 0:35:30.368
Which one would you select?
0:35:34.214 --> 0:35:43.979
I think segmenting the assertive also is much
more natural because the reference sentence
0:35:43.979 --> 0:35:46.474
is the fixed solution.
0:35:47.007 --> 0:35:52.947
Yes, that's the right motivation if you do
think about blue or so.
0:35:52.947 --> 0:35:57.646
Additionally important if you change your
reference.
0:35:57.857 --> 0:36:07.175
You might have a different number of diagrams
or diagrams because the sentences are different
0:36:07.175 --> 0:36:08.067
lengths.
0:36:08.068 --> 0:36:15.347
Here your five system, you're always comparing
it to the same system, and you don't compare
0:36:15.347 --> 0:36:16.455
to different.
0:36:16.736 --> 0:36:22.317
The only different base of segmentation, but
still it could make some do.
0:36:25.645 --> 0:36:38.974
Good, that's all about sentence segmentation,
then a bit about disfluencies and what there
0:36:38.974 --> 0:36:40.146
really.
0:36:42.182 --> 0:36:51.138
So as said in daily life, you're not speaking
like very nice full sentences every.
0:36:51.471 --> 0:36:53.420
He was speaking powerful sentences.
0:36:53.420 --> 0:36:54.448
We do repetitions.
0:36:54.834 --> 0:37:00.915
It's especially if it's more interactive,
so in meetings, phone calls and so on.
0:37:00.915 --> 0:37:04.519
If you have multiple speakers, they also break.
0:37:04.724 --> 0:37:16.651
Each other, and then if you keep them, they
are harder to translate because most of your
0:37:16.651 --> 0:37:17.991
training.
0:37:18.278 --> 0:37:30.449
It's also very difficult to read, so we'll
have some examples there to transcribe everything
0:37:30.449 --> 0:37:32.543
as it was said.
0:37:33.473 --> 0:37:36.555
What type of things are there?
0:37:37.717 --> 0:37:42.942
So you have all these pillow works.
0:37:42.942 --> 0:37:47.442
These are very easy to remove.
0:37:47.442 --> 0:37:52.957
You can just use regular expressions.
0:37:53.433 --> 0:38:00.139
Is getting more difficult with some other
type of filler works.
0:38:00.139 --> 0:38:03.387
In German you have this or in.
0:38:04.024 --> 0:38:08.473
And these ones you cannot just remove by regular
expression.
0:38:08.473 --> 0:38:15.039
You shouldn't remove all yacht from a text
because it might be very important information
0:38:15.039 --> 0:38:15.768
for well.
0:38:15.715 --> 0:38:19.995
It may be not as important as you are, but
still it might be very important.
0:38:20.300 --> 0:38:24.215
So just removing them is there already more
difficult.
0:38:26.586 --> 0:38:29.162
Then you have these repetitions.
0:38:29.162 --> 0:38:32.596
You have something like mean saw him there.
0:38:32.596 --> 0:38:33.611
There was a.
0:38:34.334 --> 0:38:41.001
And while for the first one that might be
very easy to remove because you just look for
0:38:41.001 --> 0:38:47.821
double, the thing is that the repetition might
not be exactly the same, so there is there
0:38:47.821 --> 0:38:48.199
was.
0:38:48.199 --> 0:38:54.109
So there is already getting a bit more complicated,
of course still possible.
0:38:54.614 --> 0:39:01.929
You can remove Denver so the real sense would
be like to have a ticket to Houston.
0:39:02.882 --> 0:39:13.327
But there the detection, of course, is getting
more challenging as you want to get rid of.
0:39:13.893 --> 0:39:21.699
You don't have the data, of course, which
makes all the tasks harder, but you probably
0:39:21.699 --> 0:39:22.507
want to.
0:39:22.507 --> 0:39:24.840
That's really meaningful.
0:39:24.840 --> 0:39:26.185
Current isn't.
0:39:26.185 --> 0:39:31.120
That is now a really good point and it's really
there.
0:39:31.051 --> 0:39:34.785
The thing about what is your final task?
0:39:35.155 --> 0:39:45.526
If you want to have a transcript reading it,
I'm not sure if we have another example.
0:39:45.845 --> 0:39:54.171
So there it's nicer if you have a clean transfer
and if you see subtitles in, they're also not
0:39:54.171 --> 0:39:56.625
having all the repetitions.
0:39:56.625 --> 0:40:03.811
It's the nice way to shorten but also getting
the structure you cannot even make.
0:40:04.064 --> 0:40:11.407
So in this situation, of course, they might
give you information.
0:40:11.407 --> 0:40:14.745
There is a lot of stuttering.
0:40:15.015 --> 0:40:22.835
So in this case agree it might be helpful
in some way, but meaning reading all the disfluencies
0:40:22.835 --> 0:40:25.198
is getting really difficult.
0:40:25.198 --> 0:40:28.049
If you have the next one, we have.
0:40:28.308 --> 0:40:31.630
That's a very long text.
0:40:31.630 --> 0:40:35.883
You need a bit of time to pass.
0:40:35.883 --> 0:40:39.472
This one is not important.
0:40:40.480 --> 0:40:48.461
It might be nice if you can start reading
from here.
0:40:48.461 --> 0:40:52.074
Let's have a look here.
0:40:52.074 --> 0:40:54.785
Try to read this.
0:40:57.297 --> 0:41:02.725
You can understand it, but think you need
a bit of time to really understand what was.
0:41:11.711 --> 0:41:21.480
And now we have the same text, but you have
highlighted in bold, and not only read the
0:41:21.480 --> 0:41:22.154
bold.
0:41:23.984 --> 0:41:25.995
And ignore everything which is not bold.
0:41:30.250 --> 0:41:49.121
Would assume it's easier to read just the
book part more faster and more faster.
0:41:50.750 --> 0:41:57.626
Yeah, it might be, but I'm not sure we have
a master thesis of that.
0:41:57.626 --> 0:41:59.619
If seen my videos,.
0:42:00.000 --> 0:42:09.875
Of the recordings, I also have it more likely
that it's like a fluent speak and I'm not like
0:42:09.875 --> 0:42:12.318
doing the hesitations.
0:42:12.652 --> 0:42:23.764
Don't know if somebody else has looked into
the Cusera video, but notice that.
0:42:25.005 --> 0:42:31.879
For these videos spoke every minute, three
times or something, and then people were there
0:42:31.879 --> 0:42:35.011
and cutting things and making hopefully.
0:42:35.635 --> 0:42:42.445
And therefore if you want to more achieve
that, of course, no longer exactly what was
0:42:42.445 --> 0:42:50.206
happening, but if it more looks like a professional
video, then you would have to do that and cut
0:42:50.206 --> 0:42:50.998
that out.
0:42:50.998 --> 0:42:53.532
But yeah, there are definitely.
0:42:55.996 --> 0:42:59.008
We're also going to do this thing again.
0:42:59.008 --> 0:43:02.315
First turn is like I'm going to have a very.
0:43:02.422 --> 0:43:07.449
Which in the end they start to slow down just
without feeling as though they're.
0:43:07.407 --> 0:43:10.212
It's a good point for the next.
0:43:10.212 --> 0:43:13.631
There is not the one perfect solution.
0:43:13.631 --> 0:43:20.732
There's some work on destruction removal,
but of course there's also disability.
0:43:20.732 --> 0:43:27.394
Removal is not that easy, so do you just remove
that's in order everywhere.
0:43:27.607 --> 0:43:29.708
But how much like cleaning do you do?
0:43:29.708 --> 0:43:31.366
It's more a continuous thing.
0:43:31.811 --> 0:43:38.211
Is it more really you only remove stuff or
are you also into rephrasing and here is only
0:43:38.211 --> 0:43:38.930
removing?
0:43:39.279 --> 0:43:41.664
But maybe you want to rephrase it.
0:43:41.664 --> 0:43:43.231
That's hearing better.
0:43:43.503 --> 0:43:49.185
So then it's going into what people are doing
in style transfer.
0:43:49.185 --> 0:43:52.419
We are going from a speech style to.
0:43:52.872 --> 0:44:07.632
So there is more continuum, and of course
Airconditioner is not the perfect solution,
0:44:07.632 --> 0:44:10.722
but exactly what.
0:44:15.615 --> 0:44:19.005
Yeah, we're challenging.
0:44:19.005 --> 0:44:30.258
You have examples where the direct copy is
not as hard or is not exactly the same.
0:44:30.258 --> 0:44:35.410
That is, of course, more challenging.
0:44:41.861 --> 0:44:49.889
If it's getting really mean why it's so challenging,
if it's really spontaneous even for the speaker,
0:44:49.889 --> 0:44:55.634
you need maybe even the video to really get
that and at least the audio.
0:45:01.841 --> 0:45:06.025
Yeah what it also depends on.
0:45:06.626 --> 0:45:15.253
The purpose, of course, and very important
thing is the easiest tasks just to removing.
0:45:15.675 --> 0:45:25.841
Of course you have to be very careful because
if you remove some of the not, it's normally
0:45:25.841 --> 0:45:26.958
not much.
0:45:27.227 --> 0:45:33.176
But if you remove too much, of course, that's
very, very bad because you're losing important.
0:45:33.653 --> 0:45:46.176
And this might be even more challenging if
you think about rarer and unseen works.
0:45:46.226 --> 0:45:56.532
So when doing this removal, it's important
to be careful and normally more conservative.
0:46:03.083 --> 0:46:15.096
Of course, also you have to again see if you're
doing that now in a two step approach, not
0:46:15.096 --> 0:46:17.076
an end to end.
0:46:17.076 --> 0:46:20.772
So first you need a remote.
0:46:21.501 --> 0:46:30.230
But you have to somehow sing it in the whole
type line.
0:46:30.230 --> 0:46:36.932
If you learn text or remove disfluencies,.
0:46:36.796 --> 0:46:44.070
But it might be that the ASR system is outputing
something else or that it's more of an ASR
0:46:44.070 --> 0:46:44.623
error.
0:46:44.864 --> 0:46:46.756
So um.
0:46:46.506 --> 0:46:52.248
Just for example, if you do it based on language
modeling scores, it might be that you're just
0:46:52.248 --> 0:46:57.568
the language modeling score because the has
done some errors, so you really have to see
0:46:57.568 --> 0:46:59.079
the combination of that.
0:46:59.419 --> 0:47:04.285
And for example, we had like partial words.
0:47:04.285 --> 0:47:06.496
They are like some.
0:47:06.496 --> 0:47:08.819
We didn't have that.
0:47:08.908 --> 0:47:18.248
So these feelings cannot be that you start
in the middle of the world and then you switch
0:47:18.248 --> 0:47:19.182
because.
0:47:19.499 --> 0:47:23.214
And of course, in text in perfect transcript,
that's very easy to recognize.
0:47:23.214 --> 0:47:24.372
That's not a real word.
0:47:24.904 --> 0:47:37.198
However, when you really do it into an system,
he will normally detect some type of word because
0:47:37.198 --> 0:47:40.747
he only can help the words.
0:47:50.050 --> 0:48:03.450
Example: We should think so if you have this
in the transcript it's easy to detect as a
0:48:03.450 --> 0:48:05.277
disgusting.
0:48:05.986 --> 0:48:11.619
And then, of course, it's more challenging
in a real world example where you have.
0:48:12.492 --> 0:48:29.840
Now to the approaches one thing is to really
put it in between so you put your A's system.
0:48:31.391 --> 0:48:45.139
So what your task is like, so you have this
text and the outputs in this text.
0:48:45.565 --> 0:48:49.605
There is different formulations of that.
0:48:49.605 --> 0:48:54.533
You might not be able to do everything like
that.
0:48:55.195 --> 0:49:10.852
Or do you also allow, for example, rephrasing
for reordering so in text you might have the
0:49:10.852 --> 0:49:13.605
word correctly.
0:49:13.513 --> 0:49:24.201
But the easiest thing is you only do it more
like removing, so some things can be removed.
0:49:29.049 --> 0:49:34.508
Any ideas how to do that this is output.
0:49:34.508 --> 0:49:41.034
You have training data so we have training
data.
0:49:47.507 --> 0:49:55.869
To put in with the spoon you can eat it even
after it is out, but after the machine has.
0:50:00.000 --> 0:50:05.511
Was wearing rocks, so you have not just the
shoes you remove but wearing them as input,
0:50:05.511 --> 0:50:07.578
as disfluent text and as output.
0:50:07.578 --> 0:50:09.207
It should be fueled text.
0:50:09.207 --> 0:50:15.219
It can be before or after recycling as you
said, but you have this type of task, so technically
0:50:15.219 --> 0:50:20.042
how would you address this type of task when
you have to solve this type of.
0:50:24.364 --> 0:50:26.181
That's exactly so.
0:50:26.181 --> 0:50:28.859
That's one way of doing it.
0:50:28.859 --> 0:50:33.068
It's a translation task and you train your.
0:50:33.913 --> 0:50:34.683
Can do.
0:50:34.683 --> 0:50:42.865
Then, of course, the bit of the challenge
is that you automatically allow rephrasing
0:50:42.865 --> 0:50:43.539
stuff.
0:50:43.943 --> 0:50:52.240
Which of the one end is good so you have more
opportunities but it might be also a bad thing
0:50:52.240 --> 0:50:58.307
because if you have more opportunities you
have more opportunities.
0:51:01.041 --> 0:51:08.300
If you want to prevent that, it can also do
more simple labeling, so for each word your
0:51:08.300 --> 0:51:10.693
label should not be removed.
0:51:12.132 --> 0:51:17.658
People have also been looked into parsley.
0:51:17.658 --> 0:51:29.097
You remember maybe the past trees at the beginning
like the structure because the ideas.
0:51:29.649 --> 0:51:45.779
There's also more unsupervised approaches
where you then phrase it as a style transfer
0:51:45.779 --> 0:51:46.892
task.
0:51:50.310 --> 0:51:58.601
At the last point since we have that yes,
it has also been done in an end-to-end fashion
0:51:58.601 --> 0:52:06.519
so that it's really you have as input the audio
signal and output you have than the.
0:52:06.446 --> 0:52:10.750
The text, without influence, is a clearly
clear text.
0:52:11.131 --> 0:52:19.069
You model every single total, which of course
has a big advantage.
0:52:19.069 --> 0:52:25.704
You can use these paralinguistic features,
pauses, and.
0:52:25.705 --> 0:52:34.091
If you switch so you start something then
oh it doesn't work continue differently so.
0:52:34.374 --> 0:52:42.689
So you can easily use in a fashion while in
a cascade approach.
0:52:42.689 --> 0:52:47.497
As we saw there you have text input.
0:52:49.990 --> 0:53:02.389
But on the one end we have again, and in the
more extreme case the problem before was endless.
0:53:02.389 --> 0:53:06.957
Of course there is even less data.
0:53:11.611 --> 0:53:12.837
Good.
0:53:12.837 --> 0:53:30.814
This is all about the input to a very more
person, or maybe if you think about YouTube.
0:53:32.752 --> 0:53:34.989
Talk so this could use be very exciting.
0:53:36.296 --> 0:53:42.016
Is more viewed as style transferred.
0:53:42.016 --> 0:53:53.147
You can use ideas from machine translation
where you have one language.
0:53:53.713 --> 0:53:57.193
So there is ways of trying to do this type
of style transfer.
0:53:57.637 --> 0:54:02.478
Think is definitely also very promising to
make it more and more fluent in a business.
0:54:03.223 --> 0:54:17.974
Because one major issue about all the previous
ones is that you need training data and then
0:54:17.974 --> 0:54:21.021
you need training.
0:54:21.381 --> 0:54:32.966
So I mean, think that we are only really of
data that we have for English.
0:54:32.966 --> 0:54:39.453
Maybe there is a very few data in German.
0:54:42.382 --> 0:54:49.722
Okay, then let's talk about low latency speech.
0:54:50.270 --> 0:55:05.158
So the idea is if we are doing life translation
of a talker, so we want to start out.
0:55:05.325 --> 0:55:23.010
This is possible because there is typically
some kind of monotony in many languages.
0:55:24.504 --> 0:55:29.765
And this is also what, for example, human
interpreters are doing to have a really low
0:55:29.765 --> 0:55:30.071
leg.
0:55:30.750 --> 0:55:34.393
They are even going further.
0:55:34.393 --> 0:55:40.926
They guess what will be the ending of the
sentence.
0:55:41.421 --> 0:55:51.120
Then they can already continue, although it's
not sad it might be needed, but that is even
0:55:51.120 --> 0:55:53.039
more challenging.
0:55:54.714 --> 0:55:58.014
Why is it so difficult?
0:55:58.014 --> 0:56:09.837
There is this train of on the one end for
a and you want to have more context because
0:56:09.837 --> 0:56:14.511
we learn if we have more context.
0:56:15.015 --> 0:56:24.033
And therefore to have more contacts you have
to wait as long as possible.
0:56:24.033 --> 0:56:27.689
The best is to have the full.
0:56:28.168 --> 0:56:35.244
On the other hand, you want to have a low
latency for the user to wait to generate as
0:56:35.244 --> 0:56:35.737
soon.
0:56:36.356 --> 0:56:47.149
So if you're doing no situation you have to
find the best way to start in order to have
0:56:47.149 --> 0:56:48.130
a good.
0:56:48.728 --> 0:56:52.296
There's no longer the perfect solution.
0:56:52.296 --> 0:56:56.845
People will also evaluate what is the translation.
0:56:57.657 --> 0:57:09.942
While it's challenging in German to English,
German has this very nice thing where the prefix
0:57:09.942 --> 0:57:16.607
of the word can be put at the end of the sentence.
0:57:17.137 --> 0:57:24.201
And you only know if the person registers
or cancels his station at the end of the center.
0:57:24.985 --> 0:57:33.690
So if you want to start the translation in
English you need to know at this point is the.
0:57:35.275 --> 0:57:39.993
So you would have to wait until the end of
the year.
0:57:39.993 --> 0:57:42.931
That's not really what you want.
0:57:43.843 --> 0:57:45.795
What happened.
0:57:47.207 --> 0:58:12.550
Other solutions of doing that are: Have been
motivating like how we can do that subject
0:58:12.550 --> 0:58:15.957
object or subject work.
0:58:16.496 --> 0:58:24.582
In German it's not always subject, but there
are relative sentence where you have that,
0:58:24.582 --> 0:58:25.777
so it needs.
0:58:28.808 --> 0:58:41.858
How we can do that is, we'll look today into
three ways of doing that.
0:58:41.858 --> 0:58:46.269
The one is to mitigate.
0:58:46.766 --> 0:58:54.824
And then the IVAR idea is to do retranslating,
and there you can now use the text output.
0:58:54.934 --> 0:59:02.302
So the idea is you translate, and if you later
notice it was wrong then you can retranslate
0:59:02.302 --> 0:59:03.343
and correct.
0:59:03.803 --> 0:59:14.383
Or you can do what is called extremely coding,
so you can generically.
0:59:17.237 --> 0:59:30.382
Let's start with the optimization, so if you
have a sentence, it may reach a conference,
0:59:30.382 --> 0:59:33.040
and in this time.
0:59:32.993 --> 0:59:39.592
So you have a good translation quality while
still having low latency.
0:59:39.699 --> 0:59:50.513
You have an extra model which does your segmentation
before, but your aim is not to have a segmentation.
0:59:50.470 --> 0:59:53.624
But you can somehow measure in training data.
0:59:53.624 --> 0:59:59.863
If do these types of segment lengths, that's
my latency and that's my translation quality,
0:59:59.863 --> 1:00:02.811
and then you can try to search a good way.
1:00:03.443 --> 1:00:20.188
If you're doing that one, it's an extra component,
so you can use your system as it was.
1:00:22.002 --> 1:00:28.373
The other idea is to directly output the first
high processes always, so always when you have
1:00:28.373 --> 1:00:34.201
text or audio we translate, and if we then
have more context available we can update.
1:00:35.015 --> 1:00:50.195
So imagine before, if get an eye register
and there's a sentence continued, then.
1:00:50.670 --> 1:00:54.298
So you change the output.
1:00:54.298 --> 1:01:07.414
Of course, that might be also leading to bad
user experience if you always flicker and change
1:01:07.414 --> 1:01:09.228
your output.
1:01:09.669 --> 1:01:15.329
The bit like human interpreters also are able
to correct, so they're doing a more long text.
1:01:15.329 --> 1:01:20.867
If they are guessing how to continue to say
and then he's saying something different, they
1:01:20.867 --> 1:01:22.510
also have to correct them.
1:01:22.510 --> 1:01:26.831
So here, since it's not all you, we can even
change what we have said.
1:01:26.831 --> 1:01:29.630
Yes, that's exactly what we have implemented.
1:01:31.431 --> 1:01:49.217
So how that works is, we are aware, and then
we translate it, and if we get more input like
1:01:49.217 --> 1:01:51.344
you, then.
1:01:51.711 --> 1:02:00.223
And so we can always continue to do that and
improve the transcript that we have.
1:02:00.480 --> 1:02:07.729
So in the end we have the lowest possible
latency because we always output what is possible.
1:02:07.729 --> 1:02:14.784
On the other hand, introducing a bit of a
new problem is: There's another challenge when
1:02:14.784 --> 1:02:20.061
we first used that this one was first used
for old and that it worked fine.
1:02:20.061 --> 1:02:21.380
You switch to NMT.
1:02:21.380 --> 1:02:25.615
You saw one problem that is even generating
more flickering.
1:02:25.615 --> 1:02:28.878
The problem is the normal machine translation.
1:02:29.669 --> 1:02:35.414
So implicitly learn all the output that always
ends with a dot, and it's always a full sentence.
1:02:36.696 --> 1:02:42.466
And this was even more important somewhere
in the model than really what is in the input.
1:02:42.983 --> 1:02:55.910
So if you give him a partial sentence, it
will still generate a full sentence.
1:02:55.910 --> 1:02:58.201
So encourage.
1:02:58.298 --> 1:03:05.821
It's like trying to just continue it somehow
to a full sentence and if it's doing better
1:03:05.821 --> 1:03:10.555
guessing stuff then you have to even have more
changes.
1:03:10.890 --> 1:03:23.944
So here we have a trained mismatch and that's
maybe more a general important thing that the
1:03:23.944 --> 1:03:28.910
modem might learn a bit different.
1:03:29.289 --> 1:03:32.636
It's always ending with a dog, so you don't
just guess something in general.
1:03:33.053 --> 1:03:35.415
So we have your trained test mismatch.
1:03:38.918 --> 1:03:41.248
And we have a trained test message.
1:03:41.248 --> 1:03:43.708
What is the best way to address that?
1:03:46.526 --> 1:03:51.934
That's exactly the right, so we have to like
train also on that.
1:03:52.692 --> 1:03:55.503
The problem is for particle sentences.
1:03:55.503 --> 1:03:59.611
There's not training data, so it's hard to
find all our.
1:04:00.580 --> 1:04:06.531
Hi, I'm ransom quite easy to generate artificial
pottery scent or at least for the source.
1:04:06.926 --> 1:04:15.367
So you just take, you take all the prefixes
of the source data.
1:04:17.017 --> 1:04:22.794
On the problem of course, with a bit what
do you know lying?
1:04:22.794 --> 1:04:30.845
If you have a sentence, I encourage all of
what should be the right target for that.
1:04:31.491 --> 1:04:45.381
And the constraints on the one hand, it should
be as long as possible, so you always have
1:04:45.381 --> 1:04:47.541
a long delay.
1:04:47.687 --> 1:04:55.556
On the other hand, it should be also a suspect
of the previous ones, and it should be not
1:04:55.556 --> 1:04:57.304
too much inventing.
1:04:58.758 --> 1:05:02.170
A very easy solution works fine.
1:05:02.170 --> 1:05:05.478
You can just do a length space.
1:05:05.478 --> 1:05:09.612
You also take two thirds of the target.
1:05:10.070 --> 1:05:19.626
His learning then implicitly to guess a bit
if you think about the beginning of example.
1:05:20.000 --> 1:05:30.287
This one, if you do two sorts like half, in
this case the target would be eye register.
1:05:30.510 --> 1:05:39.289
So you're doing a bit of implicit guessing,
and if it's getting wrong you have rewriting,
1:05:39.289 --> 1:05:43.581
but you're doing a good amount of guessing.
1:05:49.849 --> 1:05:53.950
In addition, this would be like how it looks
like if it was like.
1:05:53.950 --> 1:05:58.300
If it wasn't a housing game, then the target
could be something like.
1:05:58.979 --> 1:06:02.513
One problem is that you just do that this
way.
1:06:02.513 --> 1:06:04.619
It's most of your training.
1:06:05.245 --> 1:06:11.983
And in the end you're interested in the overall
translation quality, so for full sentence.
1:06:11.983 --> 1:06:19.017
So if you train on that, it will mainly learn
how to translate prefixes because ninety percent
1:06:19.017 --> 1:06:21.535
or more of your data is prefixed.
1:06:22.202 --> 1:06:31.636
That's why we'll see that it's better to do
like a ratio.
1:06:31.636 --> 1:06:39.281
So half your training data are full sentences.
1:06:39.759 --> 1:06:47.693
Because if you're doing this well you see
that for every word prefix and only one sentence.
1:06:48.048 --> 1:06:52.252
You also see that nicely here here are both.
1:06:52.252 --> 1:06:56.549
This is the blue scores and you see the bass.
1:06:58.518 --> 1:06:59.618
Is this one?
1:06:59.618 --> 1:07:03.343
It has a good quality because it's trained.
1:07:03.343 --> 1:07:11.385
If you know, train with all the partial sentences
is more focusing on how to translate partial
1:07:11.385 --> 1:07:12.316
sentences.
1:07:12.752 --> 1:07:17.840
Because all the partial sentences will at
some point be removed, because at the end you
1:07:17.840 --> 1:07:18.996
translate the full.
1:07:20.520 --> 1:07:24.079
There's many tasks to read, but you have the
same performances.
1:07:24.504 --> 1:07:26.938
On the other hand, you see here the other
problem.
1:07:26.938 --> 1:07:28.656
This is how many words got updated.
1:07:29.009 --> 1:07:31.579
You want to have as few updates as possible.
1:07:31.579 --> 1:07:34.891
Updates need to remove things which are once
being shown.
1:07:35.255 --> 1:07:40.538
This is quite high for the baseline.
1:07:40.538 --> 1:07:50.533
If you know the partials that are going down,
they should be removed.
1:07:51.151 --> 1:07:58.648
And then for moody tasks you have a bit like
the best note of swim.
1:08:02.722 --> 1:08:05.296
Any more questions to this type of.
1:08:09.309 --> 1:08:20.760
The last thing is that you want to do an extremely.
1:08:21.541 --> 1:08:23.345
Again, it's a bit implication.
1:08:23.345 --> 1:08:25.323
Scenario is what you really want.
1:08:25.323 --> 1:08:30.211
As you said, we sometimes use this updating,
and for text output it'd be very nice.
1:08:30.211 --> 1:08:35.273
But imagine if you want to audio output, of
course you can't change it anymore because
1:08:35.273 --> 1:08:37.891
on one side you cannot change what was said.
1:08:37.891 --> 1:08:40.858
So in this time you more need like a fixed
output.
1:08:41.121 --> 1:08:47.440
And then the style of street decoding is interesting.
1:08:47.440 --> 1:08:55.631
Where you, for example, get sourced, the seagullins
are so stoked in.
1:08:55.631 --> 1:09:00.897
Then you decide oh, now it's better to wait.
1:09:01.041 --> 1:09:14.643
So you somehow need to have this type of additional
information.
1:09:15.295 --> 1:09:23.074
Here you have to decide should know I'll put
a token or should wait for my and feel.
1:09:26.546 --> 1:09:32.649
So you have to do this additional labels like
weight, weight, output, output, wage and so
1:09:32.649 --> 1:09:32.920
on.
1:09:33.453 --> 1:09:38.481
There are different ways of doing that.
1:09:38.481 --> 1:09:45.771
You can have an additional model that does
this decision.
1:09:46.166 --> 1:09:53.669
And then have a higher quality or better to
continue and then have a lower latency in this
1:09:53.669 --> 1:09:54.576
different.
1:09:55.215 --> 1:09:59.241
Surprisingly, a very easy task also works,
sometimes quite good.
1:10:03.043 --> 1:10:10.981
And that is the so called way care policy
and the idea is there at least for text to
1:10:10.981 --> 1:10:14.623
text translation that is working well.
1:10:14.623 --> 1:10:22.375
It's like you wait for words and then you
always output one and like one for each.
1:10:22.682 --> 1:10:28.908
So your weight slow works at the beginning
of the sentence, and every time a new board
1:10:28.908 --> 1:10:29.981
is coming you.
1:10:31.091 --> 1:10:39.459
So you have the same times to beat as input,
so you're not legging more or less, but to
1:10:39.459 --> 1:10:41.456
have enough context.
1:10:43.103 --> 1:10:49.283
Of course this for example for the unmarried
will not solve it perfectly but if you have
1:10:49.283 --> 1:10:55.395
a bit of local reordering inside your token
that you can manage very well and then it's
1:10:55.395 --> 1:10:57.687
a very simple solution but it's.
1:10:57.877 --> 1:11:00.481
The other one was dynamic.
1:11:00.481 --> 1:11:06.943
Depending on the context you can decide how
long you want to wait.
1:11:07.687 --> 1:11:21.506
It also only works if you have a similar amount
of tokens, so if your target is very short
1:11:21.506 --> 1:11:22.113
of.
1:11:22.722 --> 1:11:28.791
That's why it's also more challenging for
audio input because the speaking rate is changing
1:11:28.791 --> 1:11:29.517
and so on.
1:11:29.517 --> 1:11:35.586
You would have to do something like I'll output
a word for every second a year or something
1:11:35.586 --> 1:11:35.981
like.
1:11:36.636 --> 1:11:45.459
The problem is that the audio speaking speed
is not like fixed but quite very, and therefore.
1:11:50.170 --> 1:11:58.278
Therefore, what you can also do is you can
use a similar solution than we had before with
1:11:58.278 --> 1:11:59.809
the resetteling.
1:12:00.080 --> 1:12:02.904
You remember we were re-decoded all the time.
1:12:03.423 --> 1:12:12.253
And you can do something similar in this case
except that you add something in that you're
1:12:12.253 --> 1:12:16.813
saying, oh, if I read it cold, I'm not always.
1:12:16.736 --> 1:12:22.065
Can decode as I want, but you can do this
target prefix decoding, so what you say is
1:12:22.065 --> 1:12:23.883
in your achievement section.
1:12:23.883 --> 1:12:26.829
You can easily say generate a translation
bus.
1:12:27.007 --> 1:12:29.810
The translation has to start with the prefix.
1:12:31.251 --> 1:12:35.350
How can you do that?
1:12:39.839 --> 1:12:49.105
In the decoder exactly you start, so if you
do beam search you select always the most probable.
1:12:49.349 --> 1:12:57.867
And now you say oh, I'm not selecting the
most perfect, but this is the fourth, so in
1:12:57.867 --> 1:13:04.603
the first step have to take this one, in the
second start decoding.
1:13:04.884 --> 1:13:09.387
And then you're making sure that your second
always starts with this prefix.
1:13:10.350 --> 1:13:18.627
And then you can use your immediate retranslation,
but you're no longer changing the output.
1:13:19.099 --> 1:13:31.595
Out as it works, so it may get a speech signal
and input, and it is not outputing any.
1:13:32.212 --> 1:13:45.980
So then if you got you get a translation maybe
and then you decide yes output.
1:13:46.766 --> 1:13:54.250
And then you're translating as one as two
as sweet as four, but now you say generate
1:13:54.250 --> 1:13:55.483
only outputs.
1:13:55.935 --> 1:14:07.163
And then you're translating and maybe you're
deciding on and now a good translation.
1:14:07.163 --> 1:14:08.880
Then you're.
1:14:09.749 --> 1:14:29.984
Yes, but don't get to worry about what the
effect is.
1:14:30.050 --> 1:14:31.842
We're generating your target text.
1:14:32.892 --> 1:14:36.930
But we're not always outputing the full target
text now.
1:14:36.930 --> 1:14:43.729
What we are having is we have here some strategy
to decide: Oh, is a system already sure enough
1:14:43.729 --> 1:14:44.437
about it?
1:14:44.437 --> 1:14:49.395
If it's sure enough and it has all the information,
we can output it.
1:14:49.395 --> 1:14:50.741
And then the next.
1:14:51.291 --> 1:14:55.931
If we say here sometimes with better not to
get output we won't output it already.
1:14:57.777 --> 1:15:06.369
And thereby the hope is in the uphill model
should not yet outcut a register because it
1:15:06.369 --> 1:15:10.568
doesn't mean no yet if it's a case or not.
1:15:13.193 --> 1:15:18.056
So what we have to discuss is what is a good
output strategy.
1:15:18.658 --> 1:15:20.070
So you could do.
1:15:20.070 --> 1:15:23.806
The output strategy could be something like.
1:15:23.743 --> 1:15:39.871
If you think of weight cape, this is an output
strategy here that you always input.
1:15:40.220 --> 1:15:44.990
Good, and you can view your weight in a similar
way as.
1:15:45.265 --> 1:15:55.194
But now, of course, we can also look at other
output strategies where it's more generic and
1:15:55.194 --> 1:15:59.727
it's deciding whether in some situations.
1:16:01.121 --> 1:16:12.739
And one thing that works quite well is referred
to as local agreement, and that means you're
1:16:12.739 --> 1:16:13.738
always.
1:16:14.234 --> 1:16:26.978
Then you're looking what is the same thing
between my current translation and the one
1:16:26.978 --> 1:16:28.756
did before.
1:16:29.349 --> 1:16:31.201
So let's do that again in six hours.
1:16:31.891 --> 1:16:45.900
So your input is a first audio segment and
your title text is all model trains.
1:16:46.346 --> 1:16:53.231
Then you're getting six opposites, one and
two, and this time the output is all models.
1:16:54.694 --> 1:17:08.407
You see trains are different, but both of
them agree that it's all so in those cases.
1:17:09.209 --> 1:17:13.806
So we can be hopefully a big show that really
starts with all.
1:17:15.155 --> 1:17:22.604
So now we say we're output all, so at this
time instead we'll output all, although before.
1:17:23.543 --> 1:17:27.422
We are getting one, two, three as input.
1:17:27.422 --> 1:17:35.747
This time we have a prefix, so now we are
only allowing translations to start with all.
1:17:35.747 --> 1:17:42.937
We cannot change that anymore, so we now need
to generate some translation.
1:17:43.363 --> 1:17:46.323
And then it can be that its now all models
are run.
1:17:47.927 --> 1:18:01.908
Then we compare here and see this agrees on
all models so we can output all models.
1:18:02.882 --> 1:18:07.356
So this by we can dynamically decide is a
model is very anxious.
1:18:07.356 --> 1:18:10.178
We always talk with something different.
1:18:11.231 --> 1:18:24.872
Then it's, we'll wait longer, it's more for
the same thing, and hope we don't need to wait.
1:18:30.430 --> 1:18:40.238
Is it clear again that the signal wouldn't
be able to detect?
1:18:43.203 --> 1:18:50.553
The hope it is because if it's not sure of,
of course, it in this kind would have to switch
1:18:50.553 --> 1:18:51.671
all the time.
1:18:56.176 --> 1:19:01.375
So if it would be the first step to register
and the second time to cancel and they may
1:19:01.375 --> 1:19:03.561
register again, they wouldn't do it.
1:19:03.561 --> 1:19:08.347
Of course, it is very short because in register
a long time, then it can't deal.
1:19:08.568 --> 1:19:23.410
That's why there's two parameters that you
can use and which might be important, or how.
1:19:23.763 --> 1:19:27.920
So you do it like every one second, every
five seconds or something like that.
1:19:28.648 --> 1:19:37.695
Put it more often as your latency will be
because your weight is less long, but also
1:19:37.695 --> 1:19:39.185
you might do.
1:19:40.400 --> 1:19:50.004
So that is the one thing and the other thing
is for words you might do everywhere, but if
1:19:50.004 --> 1:19:52.779
you think about audio it.
1:19:53.493 --> 1:20:04.287
And the other question you can do like the
agreement, so the model is sure.
1:20:04.287 --> 1:20:10.252
If you say have to agree, then hopefully.
1:20:10.650 --> 1:20:21.369
What we saw is think there has been a really
normally good performance and otherwise your
1:20:21.369 --> 1:20:22.441
latency.
1:20:22.963 --> 1:20:42.085
Okay, we'll just make more tests and we'll
get the confidence.
1:20:44.884 --> 1:20:47.596
Have to completely agree with that.
1:20:47.596 --> 1:20:53.018
So when this was done, that was our first
idea of using the confidence.
1:20:53.018 --> 1:21:00.248
The problem is that currently that's my assumption
is that the modeling the model confidence is
1:21:00.248 --> 1:21:03.939
not that easy, and they are often overconfident.
1:21:04.324 --> 1:21:17.121
In the paper there is this type also where
you try to use the confidence in some way to
1:21:17.121 --> 1:21:20.465
decide the confidence.
1:21:21.701 --> 1:21:26.825
But that gave worse results, and that's why
we looked into that.
1:21:27.087 --> 1:21:38.067
So it's a very good idea think, but it seems
not to at least how it was implemented.
1:21:38.959 --> 1:21:55.670
There is one way that maybe goes in more direction,
which is very new.
1:21:55.455 --> 1:22:02.743
If this one, the last word is attending mainly
to the end of the audio.
1:22:02.942 --> 1:22:04.934
You might you should not output it yet.
1:22:05.485 --> 1:22:15.539
Because they might think there is something
more missing than you need to know, so they
1:22:15.539 --> 1:22:24.678
look at the attention and only output parts
which look to not the audio signal.
1:22:25.045 --> 1:22:40.175
So there is, of course, a lot of ways how
you can do it better or easier in some way.
1:22:41.901 --> 1:22:53.388
Instead tries to predict the next word with
a large language model, and then for text translation
1:22:53.388 --> 1:22:54.911
you predict.
1:22:55.215 --> 1:23:01.177
Then you translate all of them and decide
if there is a change so you can even earlier
1:23:01.177 --> 1:23:02.410
do your decision.
1:23:02.362 --> 1:23:08.714
The idea is that if we continue and then this
will be to a change in the translation, then
1:23:08.714 --> 1:23:10.320
we should have opened.
1:23:10.890 --> 1:23:18.302
So it's more doing your estimate about possible
continuations of the source instead of looking
1:23:18.302 --> 1:23:19.317
at previous.
1:23:23.783 --> 1:23:31.388
All that works is a bit here like one example.
1:23:31.388 --> 1:23:39.641
It has a legacy baselines and you are not
putting.
1:23:40.040 --> 1:23:47.041
And you see in this case you have worse blood
scores here.
1:23:47.041 --> 1:23:51.670
For equal one you have better latency.
1:23:52.032 --> 1:24:01.123
The how to and how does anybody have an idea
of what could be challenging there or when?
1:24:05.825 --> 1:24:20.132
One problem of these models are hallucinations,
and often very long has a negative impact on.
1:24:24.884 --> 1:24:30.869
If you don't remove the last four words but
your model now starts to hallucinate and invent
1:24:30.869 --> 1:24:37.438
just a lot of new stuff then yeah you're removing
the last four words of that but if it has invented
1:24:37.438 --> 1:24:41.406
ten words and you're still outputting six of
these invented.
1:24:41.982 --> 1:24:48.672
Typically once it starts hallucination generating
some output, it's quite long, so then it's
1:24:48.672 --> 1:24:50.902
no longer enough to just hold.
1:24:51.511 --> 1:24:57.695
And then, of course, a bit better if you compare
to the previous ones.
1:24:57.695 --> 1:25:01.528
Their destinations are typically different.
1:25:07.567 --> 1:25:25.939
Yes, so we don't talk about the details, but
for outputs, for presentations, there's different
1:25:25.939 --> 1:25:27.100
ways.
1:25:27.347 --> 1:25:36.047
So you want to have maximum two lines, maximum
forty-two characters per line, and the reading
1:25:36.047 --> 1:25:40.212
speed is a maximum of twenty-one characters.
1:25:40.981 --> 1:25:43.513
How to Do That We Can Skip.
1:25:43.463 --> 1:25:46.804
Then you can generate something like that.
1:25:46.886 --> 1:25:53.250
Another challenge is, of course, that you
not only need to generate the translation,
1:25:53.250 --> 1:25:59.614
but for subtlyning you also want to generate
when to put breaks and what to display.
1:25:59.619 --> 1:26:06.234
Because it cannot be full sentences, as said
here, if you have like maximum twenty four
1:26:06.234 --> 1:26:10.443
characters per line, that's not always a full
sentence.
1:26:10.443 --> 1:26:12.247
So how can you make it?
1:26:13.093 --> 1:26:16.253
And then for speech there's not even a hint
of wisdom.
1:26:18.398 --> 1:26:27.711
So what we have done today is yeah, we looked
into maybe three challenges: We have this segmentation,
1:26:27.711 --> 1:26:33.013
which is a challenge both in evaluation and
in the decoder.
1:26:33.013 --> 1:26:40.613
We talked about disfluencies and we talked
about simultaneous translations and how to
1:26:40.613 --> 1:26:42.911
address these challenges.
1:26:43.463 --> 1:26:45.507
Any more questions.
1:26:48.408 --> 1:26:52.578
Good then new content.
1:26:52.578 --> 1:26:58.198
We are done for this semester.
1:26:58.198 --> 1:27:04.905
You can keep your knowledge in that.
1:27:04.744 --> 1:27:09.405
Repetition where we can try to repeat a bit
what we've done all over the semester.
1:27:10.010 --> 1:27:13.776
Now prepare a bit of repetition to what think
is important.
1:27:14.634 --> 1:27:21.441
But of course is also the chance for you to
ask specific questions.
1:27:21.441 --> 1:27:25.445
It's not clear to me how things relate.
1:27:25.745 --> 1:27:34.906
So if you have any specific questions, please
come to me or send me an email or so, then
1:27:34.906 --> 1:27:36.038
I'm happy.
1:27:36.396 --> 1:27:46.665
If should focus on it really in depth, it
might be good not to come and send me an email
1:27:46.665 --> 1:27:49.204
on Wednesday evening.
|