Safetensors
gemma2
Moyu-hrsun commited on
Commit
08570ca
·
verified ·
1 Parent(s): c5da9c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -5,10 +5,20 @@ datasets:
5
  base_model:
6
  - google/gemma-2-27b
7
  ---
 
 
 
 
 
 
 
8
  # Model Card for MA-RLHF
9
  <a href="https://iclr.cc/Conferences/2024" target="_blank">
10
  <img alt="ICLR 2025" src="https://img.shields.io/badge/Proceedings-ICLR2025-red" />
11
  </a>
 
 
 
12
 
13
  This repository contains the official checkpoint for [Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF)](https://arxiv.org/pdf/2410.02743).
14
 
@@ -16,6 +26,18 @@ This repository contains the official checkpoint for [Reinforcement Learning Fro
16
 
17
  MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.
18
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Model Usage
20
 
21
  ```python
 
5
  base_model:
6
  - google/gemma-2-27b
7
  ---
8
+ ---
9
+ license: mit
10
+ datasets:
11
+ - openai/summarize_from_feedback
12
+ base_model:
13
+ - google/gemma-2-27b
14
+ ---
15
  # Model Card for MA-RLHF
16
  <a href="https://iclr.cc/Conferences/2024" target="_blank">
17
  <img alt="ICLR 2025" src="https://img.shields.io/badge/Proceedings-ICLR2025-red" />
18
  </a>
19
+ <a href="https://github.com/ernie-research/MA-RLHF" target="_blank">
20
+ <img alt="Github" src="https://img.shields.io/badge/Github-MA_RLHF-green" />
21
+ </a>
22
 
23
  This repository contains the official checkpoint for [Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF)](https://arxiv.org/pdf/2410.02743).
24
 
 
26
 
27
  MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.
28
 
29
+
30
+ |Model|Checkpoint|Base Model|Dataset|
31
+ |-----|----------|-|-|
32
+ |TLDR-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2B-MA-PPO-Fixed5)|[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
33
+ |TLDR-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-7B-MA-PPO-Fixed5)|[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
34
+ |TLDR-Gemma-2-27B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/TLDR-Gemma-2-27B-MA-PPO-Fixed5)|[google/gemma-2-27b](https://huggingface.co/google/gemma-2-27b)|[openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
35
+ |HH-RLHF-Gemma-2B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-2B-MA-PPO-Fixed5) |[google/gemma-2b](https://huggingface.co/google/gemma-2b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
36
+ |HH-RLHF-Gemma-7B-MA-PPO-Fixed5|🤗 [HF Link](https://huggingface.co/baidu/HH-RLHF-Gemma-7B-MA-PPO-Fixed5) |[google/gemma-7b](https://huggingface.co/google/gemma-7b)|[Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
37
+ |APPS-Gemma-2B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-2B-MA-PPO-Fixed10) |[google/codegemma-2b](https://huggingface.co/google/codegemma-2b)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
38
+ |APPS-Gemma-7B-MA-PPO-Fixed10|🤗 [HF Link](https://huggingface.co/baidu/APPS-Gemma-7B-MA-PPO-Fixed10) |[google/codegemma-7b-it](https://huggingface.co/google/codegemma-7b-it)|[codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
39
+
40
+
41
  ## Model Usage
42
 
43
  ```python