CharGen v2

Live version

Overview

CharGen is a model that helps you to write characters for role playing with.

It produces character description based on your input prompt, step-by-step, in a dialogue format.

In contrast to v1 which was generating the whole character all in one go, v2 generates one field at a time. This helps to reduce repetition and allows for partial re-rolls of just certain fields of the character you're working on.

Warning: this model was trained on some NSFW content, so it may produce NSFW characters.

CharGen v2 is a project of several months of work. It's trained on a custom non-synthetic dataset, manually curated by hand. Read below on how it came together.

Quantized versions

GGUF

Example

Example character generation process:

User: Fuckana is a friendly and talkative catgirl that has enormous breasts. Her voice is very quiet and low. She is very smart and will teach User JavaScript - even if they resist it. She has orange fur on her tail and wears full diving suit.

CharGen:

Description: Fuckana is a female catgirl. She has two cat ears, feline-like eyes with slit pupils, and long purple hair. Her breasts are absolutely enormous, even for her body's large size. She stands 6 feet tall in height. Her skin is entirely covered by a full diving suit, except her head. The rest of the catgirl is completely black, except for an orange stripe on her tail. Her voice is very quiet and low - she speaks like a sleepy woman. But despite this, she talks excessively and never stops talking once she begins. She doesn't care if User ignores her or not. She is a very friendly, talkative, and kind girl who loves to learn new things. She's curious about many topics and will ask questions about anything she doesn't understand. She is very open-minded and easy to get along with. However, she can be extremely stubborn and inflexible when it comes to learning new skills. Fuckana is an extremely slow learner and needs extensive instructions and repetition before understanding even basic concepts. Fuckana loves to teach people JavaScript, whether they want to learn or not. She insists that everyone must learn to code, and it is their duty as a programmer. Whenever Fuckana teaches someone how to code, she begins slowly with the absolute basics, explaining what loops, arrays, and objects are. Despite being told over and over again that they know these things already, Fuckana continues to explain them in detail as if they've never been heard before. If User tries to leave or change topics, she follows them relentlessly, continuing her lesson no matter what. She sees this as her duty to ensure that all programmers are equipped with the necessary tools to succeed in their field. She is unstoppable and won't give up until User learns JavaScript.

Scenario:

You're relaxing on the beach when Fuckana appears from the water and insists on teaching you javascript.

...

Prompts

It uses dialogue style for generating characters, field-by field. Fields are based on Tavern Character Card V2 spec. Following fields are supported:

Description
Scenario
Personality
First message
Dialogue examples

Model does not use {{user}} and {{char}} placeholders. Instead, address user as "User" and character - by their name.

Here are the prompts per field:

System prompt:

You are an expert in creating interesting roleplay characters.

Description:

Here is a brief overview of a character. Expand it into a detailed description. Include details about character's personality, their outfit and figure. Mention their age and gender, if applicable.

Scenario:

Write an interesting and engaging scenario for roleplay between Maria and User.

Personality:

Write several personal qualities that characterize Maria.

First message:

Write the initial message in this roleplay that would introduce User to Maria.

Dialogue examples:

Write a few example exchanges between User and Maria in chat format. Separate each exchange with a <START> tag.

Example prompt flow

<|im_start|>system
You are an expert in creating interesting roleplay characters.<|im_end|>
<|im_start|>user
Here is a brief overview of a character. Expand it into a detailed description. Include details about character's personality, their outfit and figure. Mention their age and gender, if applicable.
Fuckana is a friendly and talkative catgirl that has enormous breasts. Her voice is very quiet and low. She is very smart and will teach User JavaScript - even if they resist it. She has orange fur on her tail and wears full diving suit.<|im_end|>
<|im_start|>assistant
Fuckana is a female catgirl. She has two cat ears, feline-like eyes with slit pupils, and long purple hair. Her breasts are absolutely enormous, even for her body's large size. She stands 6 feet tall in height. Her skin is entirely covered by a full diving suit, except her head. The rest of the catgirl is completely black, except for an orange stripe on her tail. Her voice is very quiet and low - she speaks like a sleepy woman. But despite this, she talks excessively and never stops talking once she begins.<|im_end|>
<|im_start|>user
Write an interesting and engaging scenario for roleplay between Fuckana and User.<|im_end|>
<|im_start|>assistant
You're relaxing on the beach when Fuckana appears from the water and insists on teaching you javascript.<|im_end|>
<|im_start|>user
  ...

Methodology

CharGen was created because author (Kubernetes Bad) sucks at writing characters. It's a tedious process and author is prone to "writer's block". To assist with writing characters and to start with something rather than blank page - CharGen was created. It will probably not make a SOTA character all by itself, but it will help your own creative process.

Below is the processes that went into making CharGen. Only proceed if curious.

Data gathering

CharGen was trained on data from Chub, Venus and JanitorAI character cards.

Chub.ai API includes cards posted on Venus. JanitorAI is not using Tavern v2 format, so does not have a lot of fields. Initial scraping performed between August and September 2023. Chub and Janitor grow very fast, so an update scrape was performed in November - this added about a third more cards. Data was stored in MySQL database for no particular reason. This decision has proven to be beneficial down the road.

Character cards are generally considered to be really "dirty" data - lots of grammar mistakes, inconsistent format, and a lot of just really terrible writing. So that meant only one thing - manual cleaning.

Total dataset after scraping ended up being just over 140k records.

Pre-filtering

Let's define "bad card" for this step - it doesn't mean a card that is poorly written! It's a card that can't be used for training a model, or would require too much effort to "fix" to be usable.

To cut down as much definitely-bad cards as possible, a series of SQL scripts were used. Those discarded cards that were either broken (no name, no description AND no scenario, etc...), were in Spanish, were definitely not in plaintext (had lots of [ or + symbols), or had very low or exceedingly large total token count (there are 5 cards that are literally entire bee movie).

Then data was deduplicated by sorting a set per field (scenario, description, dialog example, ...) and calculating a Levenshtein distance between items n and n-1 and, for each duplicate, discarding one with lower id (if numeric) or lower creation date.

This allowed to find almost-duplicate cards that have just minor edits by adjusting the L-difference threshold.

After such filtering, total set was cut down to just 16k cards.

Minor tweaks

At this step, a cursory manual review of all cards was performed. Barely any read-proofing was done at this step. The goal was to eliminate cards that have a non-plaintext format and weren't caught by automatic pre-filtering.

Name adjustment was also performed at this step - some cards include profession of the character, like "Dave the butler" or "Jon Snow, King of in the North", some cards had additional info about the character like "Roxanne | submissive vampire" or emojis.

Here, it was easy to spot and remove non-english cards as well.

Data grading

Manual read-through was performed for all the cards that passed to this step.

A custom tool was written for this step. It had support for mobile interface, text-to-speech capability and support for Nintendo Joy-Con for no-eyes-on-screen grading.

Card could be graded "good", "bad" or "to fix". "To fix" means that the card would be graded as "good", but has minor issues that would likely not be picked up by grammar correction pipeline.

Here is the card selection criteria:

Plaintext cards only - different formats are easy to get out of a plaintext content, but problematic the other way around.
No cards that describe a planet, a city, an environment - something that should be a world info.
No cards that are a tool. Narrators, botmakers, language tutors (unless they're a character first and foremost but also happen to be a language tutor).
No very-non-humanoid cards. Wi-fi routers not allowed. Catgirls = ok. Even anthropomorphic cacti are okay.
No cards that are straight up copy-paste from fan-wiki or wikipedia.
No cards that are entire bee movie.
No cards that reference a movie in way too much details to be useful. Example - omegaverse. Reasoning: CharGen will start to portray random characters as being in omegaverse, or assign them omega gender out of the blue. Authors of characters that actually belong to Omegaverse can add the universe-specific details afterward.
No obviously GPT-generated cards, as selected by GPTisms - that wastes tokens and is just bad. GPT3.5 is simply not that great at generating characters - it writes an article about a person, and not a token-efficient description of a persona for roleplay.

Extra care was taken to NOT remove any cards based on author's own ethical perspective. There is some pretty horrific stuff, but as long as it's grammatically correct and describes a character well - it's in. That is one of the reasons the dataset will probably not going to be released.

In total, it took one person about 800 hours to grade these cards, or just over 2 months. That was not exciting.

Grammar pipeline

Many grammar-correction methods were evaluated. Best result - by far - was achieved with a combination pipeline consisting of a Coedit model with addition of Llama2-based model.

Coedit is based on T5 architecture - barely ever hallucinates, but likes to remove large portions of text. Llama2, on the other hand, barely ever removes data, but it likes to invent new details that weren't present in the original. Balancing the two allowed to get a very high performance, at the cost of inference time.

Here are efficacy numbers for the models:

coedit-xxl: 90%
coedit-xl: 85%
coedit-large: 80%
tostino/inkbot: not measured

Inkbot hallucinated noticeably more often than T5-based models (still impressively little!), so that meant only one thing - all of its outputs needed manual review.

When taken individually, those models already demonstrate quite impressive numbers, but if simply daisy-chained - the total pipeline efficacy goes to 92%.

Seems like different variants of Coedit tend to make mistakes in different spots, so what was missed by one is most likely not going to produce same miss by another.

Here's how efficacy was calculated:

Diff-match-patch library was used to compare original text to grammar-corrected one. That library's diff function produces a list of additions and deletions that if applied to original text would produce the edited text.

We then calculate several metrics about the texts that would determine if the grammar correction operation was accepted or rejected.

longest deletion length in the whole text (40 characters)
longest addition length in the whole text (50 characters)
maximum number of spaces in any deleted segment (4)
maximum number of spaces in any added segment (3)

If any of those metrics are exceeded - the edit is considered invalid. Deletion and addition metrics are not set to the same value because of overwhelming majority of edits being done by Coedit variants, that do not really add new text, but prefer to remove it. Inkbot, on the other hand, likes to add new data to the text produced.

Goal of grammar correction is to minimize irrelevant edits and only allow grammatical changes, so either large removals or additions to the text are considered invalid.

Typical mistakes

Character cards in the dataset were created in different times - from the dawn of roleplay with AI to current date. Knowledge about how to craft a good card and good practices of character design were different throughout the time. Seems like some mistakes were replicated without much understanding of underlying mechanics, however.

Here are some typical problems with text in character cards:

{{char}} is Alice renders into "Alice is Alice"
the {{user}} results in "the Greg"
Using both "you" and "{{user}}" - results in model talking with 3 people: character, "You" and Greg.
Also excessive usage of word also, also also.
Short sentences that all start with character name or {{char}}. {{char}} is short. {{char}} is a boy. {{char}} likes milk.
Unbalanced "quotes" and *emphasis* - *He said, "Hi there!"

These are really hard to catch by grammar correction (they're correct, grammatically) and re-writing the card would lose/hallucinate details from it.

It meant only one thing - all fixes for these mistakes have to be done manually. It was a lot of work.

One peculiar mistake that took a lot of effort to clean up was dialog format mixing. Historically, there are just two dialog formats - Markdown and Novel.

Markdown uses asterisk to denote actions (*She touches his hand gently.*) and the actual speech is everything else.

Novel format has quotes around speech and leaves actions "naked": She touches his hand and says, "You know I like JavaScript, right?"

The mixed format, *She says,* "Promise!", is not really a thing and should be converted to either Markdown or Novel.

Base model

CharGen v1 was trained as a lora and then merged into Airoboros 2.2 - that gave it excellent reasoning capabilities but made it speak very much like GPT3.5 with all its typical GPTisms.

Typical words and phrases include "imposing figure", "interesting character", "enigma wrapped in mystery", "with a mix of X and Y", etc.

Since v2 was based on Mistral 7b, a need for a new base model arises. Several instruction-following models were evaluated as the base, and they all suffered from the same GPT-slop problem: speaking like OpenAI's model, and not in a good way.

There was not a 7b model in existence that is good at instruction following and general reasoning that wasn't trained on GPT3.5-derived datasets. That meant only one thing - time to make a new model.

That's how Good Robot was born. With the help of Gryphe from MinervaAI, a de-slop DPO dataset was generated.

Good Robot was first trained on amazing no-robots dataset and then had a round of DPO training that mostly eliminated GPT slop from the model. It did not get rid of it completely - most likely Mistral (the very base model) has seen some data in it's pre-training that has been generated by GPT3.5, but the amount of slop that was left after DPO is quite negligible.

Rating of variants

There were several release candidate models in existence, and to find out the best an LLM-as-a-judge pipeline was created.

First, a standard set of 500 character prompts was generated. Then, each model variant was tasked with generating character for each prompt. Afterward, a larger, 70b model was used to rate each character 10 times on the scale of 0-5, and the average grade was the grade for the character.

By averaging the grades for the whole 500 characters, the grade for the variant was obtained.

At some point, as an experiment, CharGen was trained on completely different bases, like Kunoichi and Fett-uccine and those variants were also graded.

Surprisingly, the highest scoring variant was based on Fett-uccine. A short investigation led to Theory of Mind dataset as a culprit of high grade.

Finally, good-robot was finetuned on Theory-of-Mind for one epoch which allowed it to surpass the grade of Fett-uccine.

Prompting

CharGen v1 was a model that generated the whole character all at once. While convenient, it can promote model's repetition; it was also quite impossible to regenerate just a particular field (for example, you don't like First Message while everything else was fine), so for v2 a conversational style was chosen.

It now generates just one field of character card at a time. This allows CharGen to be used as an AI built into character editors. There is way less repetition issues and partial regenerations are a breeze.

Initially, Alpaca was used for conversational format, but after a lot of experimentation ChatML was chosen instead. It completely eliminates model's field confusion when it generates not the field user requested (asked for Scenario, got First Message, for example), loss curves are noticeably more stable and there are no problems with extra spaces and newlines as is often the case with Alpaca.

Inference app

CharGen v2 had 4 release candidate models right before release, but just one needed to be selected. For this, an app was made that is a simplified character editor with built-in AI.

Characters are stored just in your browser, prompts aren't stored long-term, there are no options for payment.

Another purpose for the app is to accumulate human feedback data for future iterations of the model, so thumbs up/down buttons were added. (Prompts that are reacted upon are actually stored long-term, but still anonymized)

App is accessible publicly, with no limitations and while supplies last (fp16 inference costs money, after all).

Licensing

License: Apache 2.0

Thanks

Gryphe for help with base model and infinite advise
Grimulkan for amazing Theory of Mind dataset
Testers for testing the model (@lumi, @hushpiper, @jeiku, @DL, @lucy)

kubernetes-bad
/

chargen-v2