|
1 |
|
00:00:00,000 --> 00:00:06,600 |
|
各位听众朋友大家好 |
|
|
|
2 |
|
00:00:06,600 --> 00:00:09,980 |
|
欢迎收听Hugging Face每日爱论文速递周末特辑 |
|
|
|
3 |
|
00:00:09,980 --> 00:00:14,280 |
|
每周日准时为您带来一周内Hugging Face向最受欢迎的论文汇总 |
|
|
|
4 |
|
00:00:14,280 --> 00:00:18,379 |
|
本期节目涵盖的时间段是2025年6月2日至6月8日 |
|
|
|
5 |
|
00:00:18,379 --> 00:00:25,199 |
|
在本期节目中我们将为您精选五篇备受关注的论文内容涵盖了通过强化学习RL |
|
|
|
6 |
|
00:00:25,199 --> 00:00:28,400 |
|
提升大型语言模型LLM的自我改进 |
|
|
|
7 |
|
00:00:28,399 --> 00:00:33,079 |
|
高商仇恳在推理中的应用延长的强化学习对LM推理的拓展 |
|
|
|
8 |
|
00:00:33,079 --> 00:00:37,859 |
|
测试时驱动的大模型快慢思考框架以及一种经济高效的视觉 |
|
|
|
9 |
|
00:00:37,859 --> 00:00:39,500 |
|
语言动作模型 |
|
|
|
10 |
|
00:00:39,500 --> 00:00:44,159 |
|
接下来让我们一起深入这些前沿研究探索AI技术的最新进展 |
|
|
|
11 |
|
00:00:44,159 --> 00:00:45,340 |
|
节目正式开始 |
|
|
|
12 |
|
00:00:45,340 --> 00:00:53,500 |
|
本期节目的第一篇论文是反思重视奖励通过强化学习实现LM的自我提升 |
|
|
|
13 |
|
00:00:53,500 --> 00:00:57,039 |
|
这篇论文在Hugging Face社区获得了169个点赞 |
|
|
|
14 |
|
00:00:57,039 --> 00:00:59,759 |
|
显示出其研究价值和社区的关注度 |
|
|
|
15 |
|
00:00:59,759 --> 00:01:04,879 |
|
这篇论文的核心目标是提升大型语言模型LMS的性能 |
|
|
|
16 |
|
00:01:04,879 --> 00:01:06,700 |
|
通过一种名为反思 |
|
|
|
17 |
|
00:01:06,700 --> 00:01:07,359 |
|
重视 |
|
|
|
18 |
|
00:01:07,359 --> 00:01:09,239 |
|
奖励的新框架来实现 |
|
|
|
19 |
|
00:01:09,239 --> 00:01:13,219 |
|
这个框架的关键在于让模型在任务失败后进行自我反思 |
|
|
|
20 |
|
00:01:13,219 --> 00:01:14,400 |
|
分析失败原因 |
|
|
|
21 |
|
00:01:14,400 --> 00:01:17,799 |
|
并在再次尝试时利用这些反思来改进表现 |
|
|
|
22 |
|
00:01:17,799 --> 00:01:18,759 |
|
具体来说 |
|
|
|
23 |
|
00:01:18,759 --> 00:01:22,099 |
|
模型在失败后会生成一段自我反思的评论 |
|
|
|
24 |
|
00:01:22,099 --> 00:01:23,579 |
|
解释哪里出了问题 |
|
|
|
25 |
|
00:01:23,579 --> 00:01:25,019 |
|
并提出改进建议 |
|
|
|
26 |
|
00:01:25,019 --> 00:01:28,179 |
|
然后模型会根据这些反思再次尝试任务 |
|
|
|
27 |
|
00:01:28,179 --> 00:01:29,879 |
|
如果第二次尝试成功 |
|
|
|
28 |
|
00:01:29,879 --> 00:01:32,140 |
|
模型在反思阶段生成的内容 |
|
|
|
29 |
|
00:01:32,140 --> 00:01:34,920 |
|
会通过一种名为Group Relative Policy Optimization |
|
|
|
30 |
|
00:01:34,920 --> 00:01:36,699 |
|
Gruple的算法获得奖励 |
|
|
|
31 |
|
00:01:36,699 --> 00:01:39,239 |
|
从而进一步优化其自我反思的能力 |
|
|
|
32 |
|
00:01:39,239 --> 00:01:42,319 |
|
论文中使用了多个模型进行实验 |
|
|
|
33 |
|
00:01:42,319 --> 00:01:43,379 |
|
包括Cornar |
|
|
|
34 |
|
00:01:43,379 --> 00:01:44,519 |
|
Lama 3.1 |
|
|
|
35 |
|
00:01:44,519 --> 00:01:45,599 |
|
Fi 3.5 |
|
|
|
36 |
|
00:01:45,599 --> 00:01:46,799 |
|
Mini Instruct等 |
|
|
|
37 |
|
00:01:46,799 --> 00:01:48,579 |
|
并基于两个主要数据集 |
|
|
|
38 |
|
00:01:48,579 --> 00:01:49,780 |
|
Epojin和Countdown |
|
|
|
39 |
|
00:01:49,780 --> 00:01:52,780 |
|
Epojin数据集包含6万个高质量的函数调用 |
|
|
|
40 |
|
00:01:52,780 --> 00:01:55,140 |
|
要求模型生成正确的工具调用 |
|
|
|
41 |
|
00:01:55,140 --> 00:01:56,299 |
|
Countdown数据集 |
|
|
|
42 |
|
00:01:56,299 --> 00:01:59,280 |
|
则包含45万个数字列表和目标数字 |
|
|
|
43 |
|
00:01:59,280 --> 00:02:03,000 |
|
要求模型通过这些数字生成正确的方程来达到目标 |
|
|
|
44 |
|
00:02:03,000 --> 00:02:04,299 |
|
研究结果显示 |
|
|
|
45 |
|
00:02:04,299 --> 00:02:05,200 |
|
这种反思 |
|
|
|
46 |
|
00:02:05,200 --> 00:02:05,820 |
|
重视 |
|
|
|
47 |
|
00:02:05,820 --> 00:02:09,219 |
|
奖励的方法在提升模型性能方面非常有效 |
|
|
|
48 |
|
00:02:09,219 --> 00:02:11,159 |
|
特别是在Epojin数据集上 |
|
|
|
49 |
|
00:02:11,159 --> 00:02:13,639 |
|
经过Gurple训练的Quin27B模型 |
|
|
|
50 |
|
00:02:13,639 --> 00:02:17,020 |
|
甚至超过了未经过训练的Quin272B模型 |
|
|
|
51 |
|
00:02:17,020 --> 00:02:17,639 |
|
此外 |
|
|
|
52 |
|
00:02:17,639 --> 00:02:21,620 |
|
自我反思显著提升了模型在Countdown数据集上的表现 |
|
|
|
53 |
|
00:02:21,620 --> 00:02:24,379 |
|
尤其是对于那些初始表现较差的模型 |
|
|
|
54 |
|
00:02:24,379 --> 00:02:26,000 |
|
论文还指出 |
|
|
|
55 |
|
00:02:26,000 --> 00:02:30,139 |
|
这种自我反思的方法不仅增强了模型解决复杂任务的能力 |
|
|
|
56 |
|
00:02:30,139 --> 00:02:33,599 |
|
还使得较小的模型能够超越较大的未训练模型 |
|
|
|
57 |
|
00:02:33,599 --> 00:02:36,359 |
|
显示出其在效率和通用性上的优势 |
|
|
|
58 |
|
00:02:36,359 --> 00:02:36,800 |
|
此外 |
|
|
|
59 |
|
00:02:36,800 --> 00:02:39,780 |
|
研究中几乎没有观察到灾难性遗忘的现象 |
|
|
|
60 |
|
00:02:39,780 --> 00:02:43,380 |
|
表明这种方法在模型乳棒性方面也有显著提升 |
|
|
|
61 |
|
00:02:43,380 --> 00:02:44,219 |
|
总的来说 |
|
|
|
62 |
|
00:02:44,219 --> 00:02:46,840 |
|
这篇论文提出了一种创新的方法 |
|
|
|
63 |
|
00:02:46,840 --> 00:02:48,660 |
|
通过强化学习的方式 |
|
|
|
64 |
|
00:02:48,660 --> 00:02:51,260 |
|
让LLMS进行自我反思和改进 |
|
|
|
65 |
|
00:02:51,260 --> 00:02:53,800 |
|
从而在复杂任务上取得更好的表现 |
|
|
|
66 |
|
00:02:54,500 --> 00:02:57,300 |
|
这是本期节目的第二篇论文 |
|
|
|
67 |
|
00:02:57,300 --> 00:02:59,300 |
|
题目是超越8020法则 |
|
|
|
68 |
|
00:02:59,300 --> 00:03:03,220 |
|
高商少数Token驱动LLM推理的有效强化学习 |
|
|
|
69 |
|
00:03:03,219 --> 00:03:07,319 |
|
这篇论文目前在Hugging Face社区获得了130个点赞 |
|
|
|
70 |
|
00:03:07,319 --> 00:03:10,120 |
|
显示出它在学术界引起了广泛关注 |
|
|
|
71 |
|
00:03:10,120 --> 00:03:12,300 |
|
这篇论文的核心研究问题是 |
|
|
|
72 |
|
00:03:12,300 --> 00:03:16,400 |
|
在大型语言模型LLMS的验证奖励强化学习 |
|
|
|
73 |
|
00:03:16,400 --> 00:03:17,379 |
|
RLVR中 |
|
|
|
74 |
|
00:03:17,379 --> 00:03:20,120 |
|
不同类型的Token如何影响推理性能 |
|
|
|
75 |
|
00:03:20,199 --> 00:03:24,680 |
|
以及是否可以通过专注于特定类型的Token来提升RLVR的效果 |
|
|
|
76 |
|
00:03:24,680 --> 00:03:26,719 |
|
研究团队提出了一个假设 |
|
|
|
77 |
|
00:03:26,719 --> 00:03:30,699 |
|
高商的少数Token作为推理路径中的关键分支点 |
|
|
|
78 |
|
00:03:30,699 --> 00:03:34,780 |
|
比低商的多数Token更能有效驱动RLVR他们进一步假设 |
|
|
|
79 |
|
00:03:34,780 --> 00:03:37,839 |
|
通过限制策略梯度更新到这些高商Token |
|
|
|
80 |
|
00:03:37,839 --> 00:03:41,699 |
|
可以在保持或提升性能的同时提供计算上的优势 |
|
|
|
81 |
|
00:03:41,699 --> 00:03:43,599 |
|
为了验证这一假设 |
|
|
|
82 |
|
00:03:43,599 --> 00:03:46,079 |
|
研究团队进行了详细的实验设计 |
|
|
|
83 |
|
00:03:46,199 --> 00:03:51,839 |
|
他们选择了捆3LLM家族的8B 14B和32B基础模型作为研究对象 |
|
|
|
84 |
|
00:03:51,839 --> 00:03:55,219 |
|
通过链式思维COT推理中的Token商模式分析 |
|
|
|
85 |
|
00:03:55,219 --> 00:03:57,459 |
|
结合控制实验来调节这根商 |
|
|
|
86 |
|
00:03:57,460 --> 00:04:00,620 |
|
并在RLVR训练中选择性的更新策略梯度 |
|
|
|
87 |
|
00:04:00,620 --> 00:04:01,860 |
|
数据收集方面 |
|
|
|
88 |
|
00:04:01,860 --> 00:04:04,939 |
|
他们使用了M24 M25等数据集 |
|
|
|
89 |
|
00:04:04,939 --> 00:04:07,580 |
|
并在多个评估数据集上进行了验证 |
|
|
|
90 |
|
00:04:07,580 --> 00:04:08,900 |
|
实验结果显示 |
|
|
|
91 |
|
00:04:08,900 --> 00:04:11,980 |
|
高商Token在推理过程中起到了关键作用 |
|
|
|
92 |
|
00:04:11,980 --> 00:04:14,760 |
|
他们不仅连接了逻辑推理的各个环节 |
|
|
|
93 |
|
00:04:14,760 --> 00:04:18,319 |
|
还能通过调节节码温度来显著影响模型的性能 |
|
|
|
94 |
|
00:04:18,319 --> 00:04:19,240 |
|
具体来说 |
|
|
|
95 |
|
00:04:19,240 --> 00:04:21,819 |
|
降低高商Token的温度会降低性能 |
|
|
|
96 |
|
00:04:21,819 --> 00:04:24,060 |
|
而增加其温度则能提升性能 |
|
|
|
97 |
|
00:04:24,060 --> 00:04:24,620 |
|
此外 |
|
|
|
98 |
|
00:04:24,620 --> 00:04:27,980 |
|
RLVR在训练过程中保留了基础模型的商模式 |
|
|
|
99 |
|
00:04:27,980 --> 00:04:30,420 |
|
并且主要改变了高商Token的商值 |
|
|
|
100 |
|
00:04:30,420 --> 00:04:32,259 |
|
最令人振奋的是 |
|
|
|
101 |
|
00:04:32,259 --> 00:04:33,620 |
|
研究团队发现 |
|
|
|
102 |
|
00:04:33,620 --> 00:04:36,000 |
|
仅关注高商Token的策略梯度更新 |
|
|
|
103 |
|
00:04:36,000 --> 00:04:37,459 |
|
不仅没有降低性能 |
|
|
|
104 |
|
00:04:37,459 --> 00:04:40,639 |
|
反而在Koen3模型上显著提升了推理效果 |
|
|
|
105 |
|
00:04:40,639 --> 00:04:44,120 |
|
这一发现对于优化LM的推理能力具有重要意义 |
|
|
|
106 |
|
00:04:44,120 --> 00:04:46,480 |
|
尤其是在处理复杂推理任务时 |
|
|
|
107 |
|
00:04:46,480 --> 00:04:50,399 |
|
高商Token的聚焦策略能够平衡探索与训练稳定性 |
|
|
|
108 |
|
00:04:50,399 --> 00:04:52,560 |
|
为模型带来更大的性能提升 |
|
|
|
109 |
|
00:04:52,560 --> 00:04:57,100 |
|
总的来说这篇论文通过深入分析Token商对推理性能的影响 |
|
|
|
110 |
|
00:04:57,100 --> 00:05:01,019 |
|
揭示了高商少数Token在驱动LM推理中的关键作用 |
|
|
|
111 |
|
00:05:01,019 --> 00:05:04,720 |
|
为未来的LMU化提供了新的思路和方法 |
|
|
|
112 |
|
00:05:04,720 --> 00:05:08,220 |
|
这是本期节目的第三篇论文 |
|
|
|
113 |
|
00:05:08,220 --> 00:05:09,180 |
|
题目是Po |
|
|
|
114 |
|
00:05:09,180 --> 00:05:12,760 |
|
延长的强化学习拓展大型语言模型的推理边界 |
|
|
|
115 |
|
00:05:12,760 --> 00:05:16,600 |
|
这篇论文目前在Hugging Face社区获得了115个点赞 |
|
|
|
116 |
|
00:05:16,600 --> 00:05:19,680 |
|
显示出它在研究社区中引起了广泛关注 |
|
|
|
117 |
|
00:05:19,680 --> 00:05:21,920 |
|
这篇论文的核心研究问题是 |
|
|
|
118 |
|
00:05:21,920 --> 00:05:26,820 |
|
延长的强化学习训练能否在大型语言模型中揭示出新的推理策略 |
|
|
|
119 |
|
00:05:26,819 --> 00:05:30,779 |
|
这些策略是基础模型在广泛采样下也无法获得的 |
|
|
|
120 |
|
00:05:30,779 --> 00:05:32,639 |
|
研究团队提出了一个假设 |
|
|
|
121 |
|
00:05:32,639 --> 00:05:34,779 |
|
通过延长的强化学习训练 |
|
|
|
122 |
|
00:05:34,779 --> 00:05:38,279 |
|
模型可以在其基础模型的基础上拓展推理能力 |
|
|
|
123 |
|
00:05:38,279 --> 00:05:40,079 |
|
发现新的解决方案路径 |
|
|
|
124 |
|
00:05:40,079 --> 00:05:42,079 |
|
并在各种任务中表现更好 |
|
|
|
125 |
|
00:05:42,079 --> 00:05:43,519 |
|
为了验证这一假设 |
|
|
|
126 |
|
00:05:43,519 --> 00:05:46,719 |
|
研究团队设计了一种名为Pro的新训练方法 |
|
|
|
127 |
|
00:05:46,719 --> 00:05:49,360 |
|
这种方法结合了KL散度控制 |
|
|
|
128 |
|
00:05:49,360 --> 00:05:52,259 |
|
参考策略重置以及一系列多样化的任务 |
|
|
|
129 |
|
00:05:52,259 --> 00:05:54,579 |
|
他们使用了三个模型进行实验 |
|
|
|
130 |
|
00:05:54,579 --> 00:05:55,939 |
|
DeepSea Car 1-1 |
|
|
|
131 |
|
00:05:55,939 --> 00:05:57,560 |
|
5B作为基础模型 |
|
|
|
132 |
|
00:05:57,560 --> 00:05:59,779 |
|
Demitra Research Reasoning宽1.5B |
|
|
|
133 |
|
00:05:59,779 --> 00:06:01,660 |
|
作为经过Pro训练的模型 |
|
|
|
134 |
|
00:06:01,660 --> 00:06:04,519 |
|
以及DeepSea Car 1-7B用于比较 |
|
|
|
135 |
|
00:06:04,519 --> 00:06:05,600 |
|
在实验过程中 |
|
|
|
136 |
|
00:06:05,600 --> 00:06:09,100 |
|
Pro训练包括了超过2000步的强化学习训练 |
|
|
|
137 |
|
00:06:09,100 --> 00:06:11,819 |
|
同时引入了KL散度惩罚来保持伤 |
|
|
|
138 |
|
00:06:11,819 --> 00:06:13,220 |
|
并防止策略漂移 |
|
|
|
139 |
|
00:06:13,220 --> 00:06:14,980 |
|
参考策略会定期重置 |
|
|
|
140 |
|
00:06:14,980 --> 00:06:16,279 |
|
以允许持续改进 |
|
|
|
141 |
|
00:06:16,279 --> 00:06:18,060 |
|
训练数据涵盖了数学 |
|
|
|
142 |
|
00:06:18,060 --> 00:06:18,759 |
|
代码 |
|
|
|
143 |
|
00:06:18,759 --> 00:06:19,120 |
|
STEM |
|
|
|
144 |
|
00:06:19,120 --> 00:06:21,560 |
|
逻辑谜题和指令跟随等多种任务 |
|
|
|
145 |
|
00:06:21,560 --> 00:06:24,480 |
|
共构建了一个包含136000个视力的 |
|
|
|
146 |
|
00:06:24,480 --> 00:06:25,800 |
|
多样化训练数据集 |
|
|
|
147 |
|
00:06:25,800 --> 00:06:27,160 |
|
研究结果显示 |
|
|
|
148 |
|
00:06:27,160 --> 00:06:29,259 |
|
经过强化学习训练的模型 |
|
|
|
149 |
|
00:06:29,259 --> 00:06:30,620 |
|
在各种任务中的表现 |
|
|
|
150 |
|
00:06:30,620 --> 00:06:32,100 |
|
显著优于基础模型 |
|
|
|
151 |
|
00:06:32,100 --> 00:06:32,700 |
|
例如 |
|
|
|
152 |
|
00:06:32,700 --> 00:06:33,900 |
|
在数学任务中 |
|
|
|
153 |
|
00:06:33,900 --> 00:06:36,900 |
|
PiSide1的提升达到了14.7% |
|
|
|
154 |
|
00:06:36,900 --> 00:06:39,700 |
|
在编码任务中提升了13.9% |
|
|
|
155 |
|
00:06:39,700 --> 00:06:42,640 |
|
在逻辑谜题中提升了54.8% |
|
|
|
156 |
|
00:06:42,640 --> 00:06:45,860 |
|
在STEM推理任务中提升了25.1% |
|
|
|
157 |
|
00:06:45,860 --> 00:06:49,080 |
|
在指令跟随任务中提升了18.1% |
|
|
|
158 |
|
00:06:49,080 --> 00:06:49,439 |
|
此外 |
|
|
|
159 |
|
00:06:49,439 --> 00:06:50,540 |
|
研究还发现 |
|
|
|
160 |
|
00:06:50,540 --> 00:06:52,540 |
|
Pro训练在超过2000步 |
|
|
|
161 |
|
00:06:52,540 --> 00:06:54,860 |
|
后仍能持续提升模型性能 |
|
|
|
162 |
|
00:06:54,860 --> 00:06:57,220 |
|
论文还引入了创造力指数 |
|
|
|
163 |
|
00:06:57,220 --> 00:06:59,160 |
|
来量化推理路径的吸引性 |
|
|
|
164 |
|
00:06:59,160 --> 00:07:00,180 |
|
结果表明 |
|
|
|
165 |
|
00:07:00,180 --> 00:07:01,879 |
|
延长的强化学习训练 |
|
|
|
166 |
|
00:07:01,879 --> 00:07:04,560 |
|
确实能够产生更具创新性的解决方案 |
|
|
|
167 |
|
00:07:04,560 --> 00:07:05,360 |
|
这一发现 |
|
|
|
168 |
|
00:07:05,360 --> 00:07:06,379 |
|
挑战了之前认为 |
|
|
|
169 |
|
00:07:06,379 --> 00:07:07,500 |
|
强化学习模型 |
|
|
|
170 |
|
00:07:07,500 --> 00:07:09,620 |
|
不会获得新推理能力的研究结论 |
|
|
|
171 |
|
00:07:09,620 --> 00:07:10,420 |
|
总的来说 |
|
|
|
172 |
|
00:07:10,420 --> 00:07:12,520 |
|
这篇论文提供了新的见解 |
|
|
|
173 |
|
00:07:12,520 --> 00:07:14,259 |
|
展示了在什么条件下 |
|
|
|
174 |
|
00:07:14,259 --> 00:07:17,560 |
|
强化学习能够有效拓展语言模型的推理边界 |
|
|
|
175 |
|
00:07:17,560 --> 00:07:18,920 |
|
研究结果表明 |
|
|
|
176 |
|
00:07:18,920 --> 00:07:21,500 |
|
通过稳定且延长的强化学习训练 |
|
|
|
177 |
|
00:07:22,540 --> 00:07:24,080 |
|
开发出超越基础模型 |
|
|
|
178 |
|
00:07:24,080 --> 00:07:25,800 |
|
初始能力的新的推理模式 |
|
|
|
179 |
|
00:07:25,800 --> 00:07:29,080 |
|
本期节目的第四篇论文 |
|
|
|
180 |
|
00:07:29,080 --> 00:07:30,220 |
|
我们来关注一篇 |
|
|
|
181 |
|
00:07:30,220 --> 00:07:31,480 |
|
名为Alpha 1 |
|
|
|
182 |
|
00:07:31,480 --> 00:07:33,120 |
|
测试时驱动大模型 |
|
|
|
183 |
|
00:07:33,120 --> 00:07:35,340 |
|
进行快慢思考的推理框架的研究 |
|
|
|
184 |
|
00:07:35,340 --> 00:07:37,740 |
|
这篇论文目前在Hugging Face社区 |
|
|
|
185 |
|
00:07:37,740 --> 00:07:39,180 |
|
获得了89个点赞 |
|
|
|
186 |
|
00:07:39,180 --> 00:07:42,660 |
|
显示出它在学术界和开发者社区中的广泛关注 |
|
|
|
187 |
|
00:07:42,660 --> 00:07:46,200 |
|
这篇论文的核心目标是解决大型推理模型 |
|
|
|
188 |
|
00:07:46,200 --> 00:07:47,860 |
|
LRMS在测试时 |
|
|
|
189 |
|
00:07:47,860 --> 00:07:50,140 |
|
如何动态调节推理过程的挑战 |
|
|
|
190 |
|
00:07:50,139 --> 00:07:52,539 |
|
研究人员提出了一个名为Alpha 1 |
|
|
|
191 |
|
00:07:52,539 --> 00:07:53,919 |
|
Alpha 1的框架 |
|
|
|
192 |
|
00:07:53,919 --> 00:07:56,879 |
|
旨在提升LRMS的推理能力和效率 |
|
|
|
193 |
|
00:07:56,879 --> 00:07:57,839 |
|
简单来说 |
|
|
|
194 |
|
00:07:57,839 --> 00:07:59,560 |
|
Alpha 1通过在测试时 |
|
|
|
195 |
|
00:07:59,560 --> 00:08:02,099 |
|
动态调度慢思考和快思考的转换 |
|
|
|
196 |
|
00:08:02,099 --> 00:08:06,680 |
|
帮助模型在深度分析和计算效率之间找到平衡 |
|
|
|
197 |
|
00:08:06,680 --> 00:08:07,379 |
|
具体来看 |
|
|
|
198 |
|
00:08:07,379 --> 00:08:11,180 |
|
研究团队使用了三个开源的LRMS作为基础模型 |
|
|
|
199 |
|
00:08:11,180 --> 00:08:12,719 |
|
分别是DeepSeq R1 |
|
|
|
200 |
|
00:08:12,719 --> 00:08:14,180 |
|
Distil QN1.5B |
|
|
|
201 |
|
00:08:14,180 --> 00:08:15,079 |
|
DeepSeq R1 |
|
|
|
202 |
|
00:08:15,079 --> 00:08:17,379 |
|
Distil QN7B和QNQXRB |
|
|
|
203 |
|
00:08:17,379 --> 00:08:18,899 |
|
他们在一系列涵盖数学 |
|
|
|
204 |
|
00:08:18,899 --> 00:08:22,279 |
|
编程和科学领域的六个基准测试上进行了实验 |
|
|
|
205 |
|
00:08:22,279 --> 00:08:23,699 |
|
包括M2024 |
|
|
|
206 |
|
00:08:23,699 --> 00:08:24,779 |
|
AMCR3 |
|
|
|
207 |
|
00:08:24,779 --> 00:08:25,759 |
|
Minerva Math等 |
|
|
|
208 |
|
00:08:25,759 --> 00:08:29,339 |
|
实验在NVIDIA L40S和A100GPU上进行 |
|
|
|
209 |
|
00:08:29,339 --> 00:08:32,480 |
|
确保了计算资源的充足和实验的可靠性 |
|
|
|
210 |
|
00:08:32,480 --> 00:08:37,120 |
|
论文的主要创新点在于引入了Alpha时刻AlphaMoment这一概念 |
|
|
|
211 |
|
00:08:37,120 --> 00:08:39,659 |
|
通过于Alpha和后Alpha时刻的调节 |
|
|
|
212 |
|
00:08:39,659 --> 00:08:43,340 |
|
Alpha1能够有效地在测试时对LRMS进行缩放 |
|
|
|
213 |
|
00:08:43,340 --> 00:08:45,320 |
|
研究人员还通过对比实验 |
|
|
|
214 |
|
00:08:45,320 --> 00:08:47,899 |
|
验证了Alpha1在问题解决准确性 |
|
|
|
215 |
|
00:08:47,899 --> 00:08:49,680 |
|
PiCity和推理效率 |
|
|
|
216 |
|
00:08:49,680 --> 00:08:51,700 |
|
FAP指标上的显著提升 |
|
|
|
217 |
|
00:08:51,700 --> 00:08:53,759 |
|
例如1.5B的模型 |
|
|
|
218 |
|
00:08:53,759 --> 00:08:54,920 |
|
在使用Alpha1后 |
|
|
|
219 |
|
00:08:54,920 --> 00:08:58,039 |
|
问题解决准确性提高了6.15% |
|
|
|
220 |
|
00:08:58,039 --> 00:09:00,480 |
|
同时令牌长度减少了14% |
|
|
|
221 |
|
00:09:00,480 --> 00:09:02,220 |
|
研究结果显示 |
|
|
|
222 |
|
00:09:02,220 --> 00:09:06,379 |
|
Alpha1不仅在准确性上超越了传统的测试时缩放方法 |
|
|
|
223 |
|
00:09:06,379 --> 00:09:07,899 |
|
如SE和Chain of Draft |
|
|
|
224 |
|
00:09:07,899 --> 00:09:10,220 |
|
而且在推理效率上也表现出色 |
|
|
|
225 |
|
00:09:10,220 --> 00:09:11,060 |
|
特别是 |
|
|
|
226 |
|
00:09:11,060 --> 00:09:14,300 |
|
论文发现慢思考到快思考的线性调度方式 |
|
|
|
227 |
|
00:09:14,300 --> 00:09:16,440 |
|
能够带来最高的推理准确性 |
|
|
|
228 |
|
00:09:16,440 --> 00:09:20,279 |
|
这表明慢思考在提升推理效率方面起到了关键作用 |
|
|
|
229 |
|
00:09:20,279 --> 00:09:21,180 |
|
总体而言 |
|
|
|
230 |
|
00:09:21,180 --> 00:09:25,860 |
|
Alpha1为大型推理模型提供了一个通用的推理过程调节框架 |
|
|
|
231 |
|
00:09:25,860 --> 00:09:28,620 |
|
展示了慢思考和快思考的动态转换 |
|
|
|
232 |
|
00:09:28,620 --> 00:09:30,800 |
|
如何有效提升模型的推理能力 |
|
|
|
233 |
|
00:09:30,799 --> 00:09:34,839 |
|
这一研究不仅为LRMS的实际应用提供了新的思路 |
|
|
|
234 |
|
00:09:34,839 --> 00:09:38,719 |
|
也为未来在测试时优化模型推理提供了宝贵的经验 |
|
|
|
235 |
|
00:09:38,719 --> 00:09:44,899 |
|
这就是本期节目关于Alpha1测试时驱动大模型进行快慢思考的推理框架的介绍 |
|
|
|
236 |
|
00:09:44,899 --> 00:09:48,439 |
|
这是本期节目的第五篇论文 |
|
|
|
237 |
|
00:09:48,439 --> 00:09:48,939 |
|
题目是Small Flux |
|
|
|
238 |
|
00:09:48,939 --> 00:09:52,439 |
|
一种用于经济高效型机器人的视觉 |
|
|
|
239 |
|
00:09:52,439 --> 00:09:53,079 |
|
语言 |
|
|
|
240 |
|
00:09:53,079 --> 00:09:54,059 |
|
动作模型 |
|
|
|
241 |
|
00:09:54,059 --> 00:09:58,000 |
|
这篇论文目前在Hugging Face社区获得了75个点赞 |
|
|
|
242 |
|
00:09:58,000 --> 00:10:00,980 |
|
论文的核心目标是解决现有大规模视觉 |
|
|
|
243 |
|
00:10:00,980 --> 00:10:01,600 |
|
语言 |
|
|
|
244 |
|
00:10:01,600 --> 00:10:02,299 |
|
动作 |
|
|
|
245 |
|
00:10:02,299 --> 00:10:02,779 |
|
Flux |
|
|
|
246 |
|
00:10:02,779 --> 00:10:07,379 |
|
模型在机器人领域中面临的高训练成本和实际部署困难的问题 |
|
|
|
247 |
|
00:10:07,379 --> 00:10:09,879 |
|
研究团队提出了一个关键问题 |
|
|
|
248 |
|
00:10:09,879 --> 00:10:11,679 |
|
是否可以开发一种小型 |
|
|
|
249 |
|
00:10:11,679 --> 00:10:13,980 |
|
高效且由社区驱动的伐模型 |
|
|
|
250 |
|
00:10:13,980 --> 00:10:16,360 |
|
既能大幅降低训练和推理成本 |
|
|
|
251 |
|
00:10:16,360 --> 00:10:19,319 |
|
同时还能在机器人任务中保持竞争力 |
|
|
|
252 |
|
00:10:19,319 --> 00:10:20,720 |
|
论文的答案是Small Flux |
|
|
|
253 |
|
00:10:20,720 --> 00:10:22,579 |
|
这是一种紧凑的伐模型 |
|
|
|
254 |
|
00:10:22,579 --> 00:10:26,179 |
|
专门设计用于单GPU训练和消费级设备的部署 |
|
|
|
255 |
|
00:10:26,179 --> 00:10:29,740 |
|
Small Flux通过利用社区收集的数据和一部推理技术 |
|
|
|
256 |
|
00:10:29,740 --> 00:10:33,539 |
|
实现了与更大规模模型相媲美的性能 |
|
|
|
257 |
|
00:10:33,539 --> 00:10:34,419 |
|
在方法论上 |
|
|
|
258 |
|
00:10:34,419 --> 00:10:37,019 |
|
Small Flux有一个紧凑的与训练视觉 |
|
|
|
259 |
|
00:10:37,019 --> 00:10:40,259 |
|
以N模型VLM和一个动作专家组成 |
|
|
|
260 |
|
00:10:40,259 --> 00:10:42,240 |
|
VLM负责处理语言指令 |
|
|
|
261 |
|
00:10:42,240 --> 00:10:44,620 |
|
RGB图像和机器人传感器状态 |
|
|
|
262 |
|
00:10:44,620 --> 00:10:48,919 |
|
而动作专家则通过交替的交叉注意力和自注意力快进行训练 |
|
|
|
263 |
|
00:10:48,919 --> 00:10:50,299 |
|
输出低级别动作 |
|
|
|
264 |
|
00:10:50,299 --> 00:10:51,259 |
|
数据集方面 |
|
|
|
265 |
|
00:10:51,259 --> 00:10:55,980 |
|
研究团队使用了来自Hugging Face的481个社区数据集的子集 |
|
|
|
266 |
|
00:10:55,980 --> 00:10:57,879 |
|
以及新的MetaWorld数据集 |
|
|
|
267 |
|
00:10:57,879 --> 00:11:00,679 |
|
和几个真实世界的机器人操作任务数据集 |
|
|
|
268 |
|
00:11:00,679 --> 00:11:01,820 |
|
训练过程中 |
|
|
|
269 |
|
00:11:01,820 --> 00:11:03,639 |
|
Small Flux通过模仿学习 |
|
|
|
270 |
|
00:11:03,639 --> 00:11:05,639 |
|
在社区数据集上进行运训练 |
|
|
|
271 |
|
00:11:05,639 --> 00:11:07,299 |
|
并使用现成的VLM |
|
|
|
272 |
|
00:11:07,299 --> 00:11:08,419 |
|
如Kun 2.5 |
|
|
|
273 |
|
00:11:08,419 --> 00:11:09,860 |
|
VL3B Instruct |
|
|
|
274 |
|
00:11:09,860 --> 00:11:11,220 |
|
自动生成任务描述 |
|
|
|
275 |
|
00:11:11,220 --> 00:11:12,639 |
|
以改进任务注视 |
|
|
|
276 |
|
00:11:12,639 --> 00:11:13,559 |
|
推理阶段 |
|
|
|
277 |
|
00:11:13,559 --> 00:11:14,700 |
|
一部推理技术 |
|
|
|
278 |
|
00:11:14,700 --> 00:11:17,340 |
|
将动作执行与观察处理和动作预测机 |
|
|
|
279 |
|
00:11:17,340 --> 00:11:19,320 |
|
从而提高了控制频率 |
|
|
|
280 |
|
00:11:19,320 --> 00:11:21,080 |
|
并减少了任务完成时间 |
|
|
|
281 |
|
00:11:21,080 --> 00:11:22,059 |
|
在评估中 |
|
|
|
282 |
|
00:11:22,059 --> 00:11:26,279 |
|
Small Flux在模拟和真实世界的机器人基准测试中表现出色 |
|
|
|
283 |
|
00:11:26,279 --> 00:11:29,740 |
|
特别是在识取、放置、堆叠和分类任务中 |
|
|
|
284 |
|
00:11:29,740 --> 00:11:31,299 |
|
优于其他Fla模型 |
|
|
|
285 |
|
00:11:31,299 --> 00:11:32,259 |
|
一部推理 |
|
|
|
286 |
|
00:11:32,259 --> 00:11:35,839 |
|
还使任务完成时间减少了约30% |
|
|
|
287 |
|
00:11:35,839 --> 00:11:36,959 |
|
论文的结论表明 |
|
|
|
288 |
|
00:11:36,959 --> 00:11:39,000 |
|
通过利用社区驱动数据集 |
|
|
|
289 |
|
00:11:39,000 --> 00:11:41,600 |
|
优化模型架构和一部推理技术 |
|
|
|
290 |
|
00:11:41,600 --> 00:11:43,240 |
|
紧凑高效的Fla模型 |
|
|
|
291 |
|
00:11:43,240 --> 00:11:45,720 |
|
可以在机器人任务中取得竞争性表现 |
|
|
|
292 |
|
00:11:45,720 --> 00:11:47,299 |
|
Small Flux成功展示了 |
|
|
|
293 |
|
00:11:47,299 --> 00:11:49,720 |
|
开发经济高效型Fla模型的可行性 |
|
|
|
294 |
|
00:11:49,720 --> 00:11:52,240 |
|
为机器人研究提供了新的可能性 |
|
|
|
295 |
|
00:11:52,240 --> 00:11:55,419 |
|
并使更多资源有限的实际应用成为可能 |
|
|
|
296 |
|
00:11:55,419 --> 00:11:59,139 |
|
以上就是本期节目的全部内容 |
|
|
|
297 |
|
00:11:59,139 --> 00:12:00,459 |
|
感谢大家的收听 |
|
|
|
298 |
|
00:12:00,459 --> 00:12:02,059 |
|
如果你喜欢本期内容 |
|
|
|
299 |
|
00:12:02,059 --> 00:12:03,539 |
|
欢迎在评论区留言 |
|
|
|
300 |
|
00:12:03,539 --> 00:12:04,159 |
|
点赞 |
|
|
|
301 |
|
00:12:04,159 --> 00:12:04,740 |
|
转发 |
|
|
|
302 |
|
00:12:04,740 --> 00:12:05,979 |
|
并订阅我们的节目 |
|
|
|
303 |
|
00:12:05,979 --> 00:12:06,559 |
|
同时 |
|
|
|
304 |
|
00:12:06,559 --> 00:12:08,659 |
|
别忘了关注我们在小红书的账号 |
|
|
|
305 |
|
00:12:08,659 --> 00:12:09,199 |
|
ISOD |
|
|
|
306 |
|
00:12:09,199 --> 00:12:10,539 |
|
我们下期节目再见 |
|
|
|
307 |
|
00:12:10,539 --> 00:12:12,179 |
|
Hayae |
|
|
|
308 |
|
00:12:12,179 --> 00:12:28,179 |
|
�� |
|
|