Neph0s
/

CoSER-Llama-3.1-70B

+---
+language:
+- en
+tags:
+- role-playing
+- character simulation
+- llama
+- llama-3.1
+- persona
+license: mit
+datasets:
+- Neph0s/CoSER
+---
+# CoSER Models
+CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.
+CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.
+### Model Variants
+- **CoSER-8B**: Fine-tuned from LLaMA-3.1-8B
+- **CoSER-70B**: Fine-tuned from LLaMA-3.1-70B
+## Training Data
+The models are trained on the [CoSER dataset](https://huggingface.co/datasets/Neph0s/CoSER), which differs from existing RPLA datasets in two fundamental ways:
+1. It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
+2. It incorporates comprehensive types of data:
+   - Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
+   - Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech
+## Training Methodology
+Our training approach is based on "given-circumstance acting" (GCA):
+Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.
+## Performance and Evaluation
+We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:
+1. We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
+2. We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.
+### Performance on Given-Circumstance Acting
+CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.
+| Model | Storyline Consistency | Anthropomorphism | Character Fidelity | Storyline Quality | Average Score | BLEU | ROUGE-L |
+|-------|----------------------|------------------|-------------------|------------------|--------------|------|---------|
+| **Close-source Models** |  |  |  |  |  |  |  |
+| Abab7-preview | 56.81 | 44.23 | 43.83 | 74.83 | 54.92 | 4.96 | 11.50 |
+| Doubao-pro | 60.95 | 49.72 | 47.02 | 79.28 | 59.24 | 6.38 | 12.95 |
+| Step-1-Flash | 57.75 | 48.12 | 44.48 | 75.93 | 56.57 | 5.95 | 12.71 |
+| Step-2 | 61.43 | 49.06 | 47.33 | 77.96 | 58.94 | 5.75 | 12.50 |
+| GPT-3.5 | 57.22 | 43.30 | 42.29 | 73.91 | 54.18 | 4.58 | 11.80 |
+| GPT-4o | **61.59** | 48.93 | **48.95** | **80.33** | **59.95** | 5.90 | 12.11 |
+| GPT-4o Mini | 60.09 | 48.21 | 44.88 | 78.55 | 57.93 | 3.90 | 10.81 |
+| Gemini Pro | 59.11 | 52.41 | 47.83 | 77.59 | 59.24 | 5.39 | 11.65 |
+| Claude-3-Haiku | 58.18 | 44.66 | 41.88 | 74.14 | 54.71 | 4.80 | 12.02 |
+| Claude-3.5-Sonnet | 57.45 | 48.50 | 45.69 | 77.23 | 57.22 | 5.17 | 11.45 |
+| **Open-source Models** |  |  |  |  |  |  |  |
+| Mistral-7B | 59.90 | 40.00 | 44.75 | 61.93 | 51.64 | 2.71 | 9.28 |
+| Qwen-2-7B | 51.96 | 35.48 | 31.51 | 63.18 | 45.53 | 4.21 | 10.71 |
+| LLaMA-3.1-8B | 54.10 | 45.36 | 40.22 | 72.29 | 52.99 | 4.59 | 10.18 |
+| CoSER-8B | 58.61 | 47.23 | 46.90 | 73.04 | 56.45 | 9.40 | 14.21 |
+| Vicuna-13B-1.5 | 52.75 | 39.12 | 38.04 | 60.43 | 47.58 | 1.67 | 5.59 |
+| Mixtral-8x7B | 51.25 | 38.44 | 36.92 | 67.69 | 48.58 | 5.28 | 11.66 |
+| Qwen-2-72B | 57.75 | 47.28 | 46.62 | 76.60 | 57.06 | 5.38 | 11.85 |
+| LLaMA-3.1-70B | 57.46 | 45.95 | 43.72 | 74.84 | 55.49 | 4.82 | 10.98 |
+| Higgs-Llama-3-70B | 57.10 | 43.82 | 42.41 | 75.62 | 54.74 | 3.99 | 10.92 |
+| CoSER-70B | 58.66 | **53.33** | 48.75 | 75.49 | 59.06 | **10.10** | **14.78** |
+| DeepSeek-V3 | 56.40 | 47.87 | 44.02 | 76.66 | 56.24 | 4.54 | 11.02 |
+*Note: Bold values indicate best performance across all models.*
+### Performance on Existing RPLA Benchmarks
+| Model | InCharacter Dim | InCharacter Full | Life Choice | CroSS MR |
+|-------|----------------|------------------|-------------|----------|
+| LLaMA-3.1-8B | 64.97 | 15.62 | 61.10 | 30.15 |
+| CoSER-8B | 75.80 | 21.88 | 69.54 | 44.94 |
+| *CoSER-8B trained w/o I.T.* | 70.70 | 15.62 | 59.92 | 43.14 |
+| LLaMA-3.1-70B | 72.16 | 31.25 | 86.48 | 61.30 |
+| Higgs-Llama-3-70B | 74.52 | 28.12 | 74.03 | 60.12 |
+| CoSER-70B | 75.80 | **34.38** | **93.47** | **64.49** |
+| *CoSER-70B trained w/o I.T.* | 73.12 | 32.14 | 93.18 | 63.14 |
+| Qwen-2-72B | 74.52 | 31.25 | 81.14 | 62.57 |
+| GPT-3.5 | 71.20 | 21.88 | 78.07 | 30.09 |
+| GPT-4o | **76.54** | 32.62 | 75.96 | **64.49** |
+| Claude-3.5-Sonnet | 72.61 | 21.88 | 86.07 | 30.59 |
+*Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.*
+## Ethical Considerations
+We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:
+- The models may still generate content that reflects biases present in the literary works they were trained on.
+- Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
+- Users should implement appropriate safeguards when deploying these models in applications.
+## Citation
+If you use CoSER models in your research, please cite our paper:
+```
+@misc{wang2025cosercoordinatingllmbasedpersona,
+      title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles},
+      author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
+      year={2025},
+      eprint={2502.09082},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.09082},
+}
+```