Neph0s commited on
Commit
4751237
·
verified ·
1 Parent(s): 6b33690

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -1
README.md CHANGED
@@ -1 +1,124 @@
1
- # CoSER-70B
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - role-playing
6
+ - character simulation
7
+ - llama
8
+ - llama-3.1
9
+ - persona
10
+ license: mit
11
+ datasets:
12
+ - Neph0s/CoSER
13
+ ---
14
+
15
+ # CoSER Models
16
+
17
+ CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.
18
+
19
+ CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.
20
+
21
+
22
+ ### Model Variants
23
+
24
+ - **CoSER-8B**: Fine-tuned from LLaMA-3.1-8B
25
+ - **CoSER-70B**: Fine-tuned from LLaMA-3.1-70B
26
+
27
+ ## Training Data
28
+
29
+ The models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which differs from existing RPLA datasets in two fundamental ways:
30
+
31
+ 1. It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
32
+
33
+ 2. It incorporates comprehensive types of data:
34
+ - Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
35
+ - Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech
36
+
37
+ ## Training Methodology
38
+
39
+ Our training approach is based on "given-circumstance acting" (GCA):
40
+
41
+ Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.
42
+
43
+ ## Performance and Evaluation
44
+
45
+ We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:
46
+
47
+ 1. We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
48
+
49
+ 2. We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.
50
+
51
+ ### Performance on Given-Circumstance Acting
52
+
53
+ CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.
54
+
55
+ | Model | Storyline Consistency | Anthropomorphism | Character Fidelity | Storyline Quality | Average Score | BLEU | ROUGE-L |
56
+ |-------|----------------------|------------------|-------------------|------------------|--------------|------|---------|
57
+ | **Close-source Models** | | | | | | | |
58
+ | Abab7-preview | 56.81 | 44.23 | 43.83 | 74.83 | 54.92 | 4.96 | 11.50 |
59
+ | Doubao-pro | 60.95 | 49.72 | 47.02 | 79.28 | 59.24 | 6.38 | 12.95 |
60
+ | Step-1-Flash | 57.75 | 48.12 | 44.48 | 75.93 | 56.57 | 5.95 | 12.71 |
61
+ | Step-2 | 61.43 | 49.06 | 47.33 | 77.96 | 58.94 | 5.75 | 12.50 |
62
+ | GPT-3.5 | 57.22 | 43.30 | 42.29 | 73.91 | 54.18 | 4.58 | 11.80 |
63
+ | GPT-4o | **61.59** | 48.93 | **48.95** | **80.33** | **59.95** | 5.90 | 12.11 |
64
+ | GPT-4o Mini | 60.09 | 48.21 | 44.88 | 78.55 | 57.93 | 3.90 | 10.81 |
65
+ | Gemini Pro | 59.11 | 52.41 | 47.83 | 77.59 | 59.24 | 5.39 | 11.65 |
66
+ | Claude-3-Haiku | 58.18 | 44.66 | 41.88 | 74.14 | 54.71 | 4.80 | 12.02 |
67
+ | Claude-3.5-Sonnet | 57.45 | 48.50 | 45.69 | 77.23 | 57.22 | 5.17 | 11.45 |
68
+ | **Open-source Models** | | | | | | | |
69
+ | Mistral-7B | 59.90 | 40.00 | 44.75 | 61.93 | 51.64 | 2.71 | 9.28 |
70
+ | Qwen-2-7B | 51.96 | 35.48 | 31.51 | 63.18 | 45.53 | 4.21 | 10.71 |
71
+ | LLaMA-3.1-8B | 54.10 | 45.36 | 40.22 | 72.29 | 52.99 | 4.59 | 10.18 |
72
+ | CoSER-8B | 58.61 | 47.23 | 46.90 | 73.04 | 56.45 | 9.40 | 14.21 |
73
+ | Vicuna-13B-1.5 | 52.75 | 39.12 | 38.04 | 60.43 | 47.58 | 1.67 | 5.59 |
74
+ | Mixtral-8x7B | 51.25 | 38.44 | 36.92 | 67.69 | 48.58 | 5.28 | 11.66 |
75
+ | Qwen-2-72B | 57.75 | 47.28 | 46.62 | 76.60 | 57.06 | 5.38 | 11.85 |
76
+ | LLaMA-3.1-70B | 57.46 | 45.95 | 43.72 | 74.84 | 55.49 | 4.82 | 10.98 |
77
+ | Higgs-Llama-3-70B | 57.10 | 43.82 | 42.41 | 75.62 | 54.74 | 3.99 | 10.92 |
78
+ | CoSER-70B | 58.66 | **53.33** | 48.75 | 75.49 | 59.06 | **10.10** | **14.78** |
79
+ | DeepSeek-V3 | 56.40 | 47.87 | 44.02 | 76.66 | 56.24 | 4.54 | 11.02 |
80
+
81
+ *Note: Bold values indicate best performance across all models.*
82
+
83
+ ### Performance on Existing RPLA Benchmarks
84
+
85
+ | Model | InCharacter Dim | InCharacter Full | Life Choice | CroSS MR |
86
+ |-------|----------------|------------------|-------------|----------|
87
+ | LLaMA-3.1-8B | 64.97 | 15.62 | 61.10 | 30.15 |
88
+ | CoSER-8B | 75.80 | 21.88 | 69.54 | 44.94 |
89
+ | *CoSER-8B trained w/o I.T.* | 70.70 | 15.62 | 59.92 | 43.14 |
90
+ | LLaMA-3.1-70B | 72.16 | 31.25 | 86.48 | 61.30 |
91
+ | Higgs-Llama-3-70B | 74.52 | 28.12 | 74.03 | 60.12 |
92
+ | CoSER-70B | 75.80 | **34.38** | **93.47** | **64.49** |
93
+ | *CoSER-70B trained w/o I.T.* | 73.12 | 32.14 | 93.18 | 63.14 |
94
+ | Qwen-2-72B | 74.52 | 31.25 | 81.14 | 62.57 |
95
+ | GPT-3.5 | 71.20 | 21.88 | 78.07 | 30.09 |
96
+ | GPT-4o | **76.54** | 32.62 | 75.96 | **64.49** |
97
+ | Claude-3.5-Sonnet | 72.61 | 21.88 | 86.07 | 30.59 |
98
+
99
+ *Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.*
100
+
101
+ ## Ethical Considerations
102
+
103
+ We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:
104
+
105
+ - The models may still generate content that reflects biases present in the literary works they were trained on.
106
+ - Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
107
+ - Users should implement appropriate safeguards when deploying these models in applications.
108
+
109
+ ## Citation
110
+
111
+ If you use CoSER models in your research, please cite our paper:
112
+
113
+ ```
114
+ @misc{wang2025cosercoordinatingllmbasedpersona,
115
+ title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles},
116
+ author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
117
+ year={2025},
118
+ eprint={2502.09082},
119
+ archivePrefix={arXiv},
120
+ primaryClass={cs.CL},
121
+ url={https://arxiv.org/abs/2502.09082},
122
+ }
123
+ ```
124
+