Update README.md
Browse files
README.md
CHANGED
@@ -5,10 +5,20 @@ datasets:
|
|
5 |
base_model:
|
6 |
- google/gemma-2-27b
|
7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
# Model Card for MA-RLHF
|
9 |
<a href="https://iclr.cc/Conferences/2024" target="_blank">
|
10 |
<img alt="ICLR 2025" src="https://img.shields.io/badge/Proceedings-ICLR2025-red" />
|
11 |
</a>
|
|
|
|
|
|
|
12 |
|
13 |
This repository contains the official checkpoint for [Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF)](https://arxiv.org/pdf/2410.02743).
|
14 |
|
@@ -16,6 +26,18 @@ This repository contains the official checkpoint for [Reinforcement Learning Fro
|
|
16 |
|
17 |
MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
## Model Usage
|
20 |
|
21 |
```python
|
|
|
5 |
base_model:
|
6 |
- google/gemma-2-27b
|
7 |
---
|
8 |
+
---
|
9 |
+
license: mit
|
10 |
+
datasets:
|
11 |
+
- openai/summarize_from_feedback
|
12 |
+
base_model:
|
13 |
+
- google/gemma-2-27b
|
14 |
+
---
|
15 |
# Model Card for MA-RLHF
|
16 |
<a href="https://iclr.cc/Conferences/2024" target="_blank">
|
17 |
<img alt="ICLR 2025" src="https://img.shields.io/badge/Proceedings-ICLR2025-red" />
|
18 |
</a>
|
19 |
+
<a href="https://github.com/ernie-research/MA-RLHF" target="_blank">
|
20 |
+
<img alt="Github" src="https://img.shields.io/badge/Github-MA_RLHF-green" />
|
21 |
+
</a>
|
22 |
|
23 |
This repository contains the official checkpoint for [Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF)](https://arxiv.org/pdf/2410.02743).
|
24 |
|
|
|
26 |
|
27 |
MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.
|
28 |
|
29 |
+
|
30 |
+
|Model|Checkpoint|Base Model|Dataset|
|
31 |
+
|-----|----------|-|-|
|
32 |
+
|TLDR-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2B-MA-PPO-Fixed5)|[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
|
33 |
+
|TLDR-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-7B-MA-PPO-Fixed5)|[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
|
34 |
+
|TLDR-Gemma-2-27B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2-27B-MA-PPO-Fixed5)|[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
|
35 |
+
|HH-RLHF-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-2B-MA-PPO-Fixed5) |[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
|
36 |
+
|HH-RLHF-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-7B-MA-PPO-Fixed5) |[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
|
37 |
+
|APPS-Gemma-2B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-2B-MA-PPO-Fixed10) |[google/codegemma-2b](https://huggingface.co/google/codegemma-2b)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
|
38 |
+
|APPS-Gemma-7B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-7B-MA-PPO-Fixed10) |[google/codegemma-7b-it](https://huggingface.co/google/codegemma-7b-it)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
|
39 |
+
|
40 |
+
|
41 |
## Model Usage
|
42 |
|
43 |
```python
|