Spaces:
Sleeping
Sleeping
HarleyCoops
commited on
Commit
·
986a16c
1
Parent(s):
dca96ea
added GPRO verifiers idea
Browse files
app.py
CHANGED
@@ -476,7 +476,333 @@ From a tiny dictionary to an AI that:
|
|
476 |
- **CAS**: Cultural Authenticity Score
|
477 |
- **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
|
478 |
- **LoRA**: Low-Rank Adaptation
|
479 |
-
- **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
480 |
"""
|
481 |
|
482 |
# Store conversation history
|
@@ -547,7 +873,7 @@ def create_interface():
|
|
547 |
gr.Markdown(
|
548 |
"# You are Asking Google Deep Mind about "
|
549 |
"\"From Whispers to Voices\", "
|
550 |
-
"it will need a few seconds to review the
|
551 |
)
|
552 |
|
553 |
chatbot = gr.Chatbot(show_label=False)
|
|
|
476 |
- **CAS**: Cultural Authenticity Score
|
477 |
- **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
|
478 |
- **LoRA**: Low-Rank Adaptation
|
479 |
+
- **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
|
480 |
+
|
481 |
+
## March 2025 RL Update
|
482 |
+
|
483 |
+
# Adapting Verifiers for Low-Resource Language Translation
|
484 |
+
|
485 |
+
This document details how to adapt the `verifiers` framework, originally designed for verifiable environments like math and coding, to the task of low-resource language translation. This approach focuses on providing nuanced, multi-faceted rewards, going beyond simple correct/incorrect evaluations.
|
486 |
+
|
487 |
+
## Overview
|
488 |
+
|
489 |
+
The core idea is to treat translation as a multi-step process (even if it's just a single-turn translation) where the model receives rewards for various aspects of translation quality. This allows for partial credit and provides more informative training signals, particularly beneficial in low-resource settings where data scarcity is a major challenge.
|
490 |
+
|
491 |
+
We will be customizing the following components of the `verifiers` library:
|
492 |
+
|
493 |
+
* **Environment:** A custom `TranslationEnv` to handle the interaction with the translation model (LLM).
|
494 |
+
* **Parser:** A simplified `TranslationParser` to extract the translated text from the LLM's output. We won't require strict XML formatting for this task.
|
495 |
+
* **Rubric:** A `TranslationRubric` containing several reward functions that evaluate different quality dimensions (letter accuracy, word accuracy, semantic similarity, and edit distance).
|
496 |
+
* **Training:** Using the `GRPOEnvTrainer` with our custom components and a small, low-resource translation dataset.
|
497 |
+
|
498 |
+
## Key Concepts
|
499 |
+
|
500 |
+
* **Ground Truth:** A parallel corpus of source and target language sentences. Essential for calculating rewards. In low-resource scenarios, this might be a small, curated dataset.
|
501 |
+
* **Multi-faceted Reward:** Instead of a single reward, we provide separate rewards for:
|
502 |
+
* **Letter Accuracy:** Proportion of correctly translated letters.
|
503 |
+
* **Word Accuracy:** Proportion of correctly translated words (space-separated).
|
504 |
+
* **Semantic Similarity:** Uses pre-trained sentence embeddings (Sentence-BERT) to measure how close the *meaning* of the translation is to the ground truth, even if the exact words differ.
|
505 |
+
* **Edit Distance Similarity.** Levenshtein distances.
|
506 |
+
* **Iterative Refinement (Optional):** The environment can be designed to support multiple turns, allowing the LLM to refine its translation based on feedback (hints). This example shows a rudimentary character by character suggestion technique, although a better version might provide hints more sparingly based on confidence scores.
|
507 |
+
* **Low-Resource Focus:** The techniques are tailored for scenarios with limited training data. This involves using smaller, specialized translation models (rather than massive general-purpose LLMs) and careful hyperparameter tuning (particularly `beta` in GRPO).
|
508 |
+
|
509 |
+
## Code Structure and Components
|
510 |
+
|
511 |
+
The code consists of the following main parts, each described in detail below:
|
512 |
+
|
513 |
+
1. **`TranslationParser`:** A class to extract the translation from the LLM's output string.
|
514 |
+
2. **`TranslationEnv`:** A class inheriting from `MultiStepEnv` (or a simplified version) that defines the interaction loop between the trainer and the LLM.
|
515 |
+
3. **`TranslationRubric`:** A class inheriting from `Rubric` that defines the reward functions.
|
516 |
+
4. **Dataset Creation (`create_dummy_dataset`):** A function to load or create your low-resource translation dataset. *You will replace this with your own dataset loading logic.*
|
517 |
+
5. **Model Loading (`get_model_and_tokenizer`):** Uses functions from `verifiers` to load a suitable pre-trained translation model.
|
518 |
+
6. **Training Setup (`GRPOEnvTrainer`):** Sets up and runs the training process.
|
519 |
+
|
520 |
+
### 1. `TranslationParser`
|
521 |
+
|
522 |
+
```python
|
523 |
+
from types import SimpleNamespace
|
524 |
+
|
525 |
+
class TranslationParser:
|
526 |
+
def parse(self, text: str, strip: bool = True) -> Any:
|
527 |
+
translation = text.strip()
|
528 |
+
return SimpleNamespace(translation=translation)
|
529 |
+
```
|
530 |
+
|
531 |
+
This simplified parser extracts the raw translated text from the LLM's output. We are not requiring or enforcing XML formatting, keeping the interaction straightforward.
|
532 |
+
|
533 |
+
### 2. TranslationEnv
|
534 |
+
|
535 |
+
```python
|
536 |
+
import verifiers as vf
|
537 |
+
from verifiers.envs import MultiStepEnv
|
538 |
+
from verifiers.rubrics import Rubric # Will be used later.
|
539 |
+
from datasets import Dataset
|
540 |
+
from typing import List, Dict, Any
|
541 |
+
|
542 |
+
def check_prefix(text: str, suggested: str):
|
543 |
+
if len(suggested) < 1:
|
544 |
+
return False
|
545 |
+
return text.startswith(suggested[:len(text)])
|
546 |
+
|
547 |
+
class TranslationEnv(MultiStepEnv):
|
548 |
+
def __init__(self, dataset, system_prompt, max_steps=3):
|
549 |
+
super().__init__(system_prompt=system_prompt, max_steps=max_steps, mask_env_response=False)
|
550 |
+
self.dataset = dataset
|
551 |
+
self.rubric = None # Set during get_rubric
|
552 |
+
|
553 |
+
def get_dataset(self, **kwargs):
|
554 |
+
return self.dataset
|
555 |
+
def get_eval_dataset(self, **kwargs: Any):
|
556 |
+
return self.dataset # You might want separate eval set.
|
557 |
+
|
558 |
+
def get_rubric(self):
|
559 |
+
if self.rubric is None:
|
560 |
+
self.rubric = TranslationRubric() # instantiate later.
|
561 |
+
return self.rubric
|
562 |
+
|
563 |
+
def is_completed(self, messages, **kwargs):
|
564 |
+
assistant_text = self.rubric.parser.parse(messages[-1]['content']).translation
|
565 |
+
user_query = self.get_last_user_prompt(messages)
|
566 |
+
ground_truth = self.dataset.filter(lambda x: x["prompt"][0]['content'] == user_query)
|
567 |
+
for element in ground_truth:
|
568 |
+
target = element['answer']
|
569 |
+
|
570 |
+
return check_prefix(target, assistant_text)
|
571 |
+
|
572 |
+
def get_last_user_prompt(self, messages):
|
573 |
+
i = len(messages) -1
|
574 |
+
while i > -1:
|
575 |
+
if messages[i]['role'] == 'user':
|
576 |
+
return messages[i]['content']
|
577 |
+
i-= 1
|
578 |
+
return None
|
579 |
+
# Suggest letters sequentially
|
580 |
+
def env_response(self, messages, **kwargs):
|
581 |
+
assistant_text = self.rubric.parser.parse(messages[-1]['content']).translation
|
582 |
+
user_query = self.get_last_user_prompt(messages)
|
583 |
+
ground_truth = self.dataset.filter(lambda x: x["prompt"][0]['content'] == user_query)
|
584 |
+
|
585 |
+
response = "Check your word beginnings:"
|
586 |
+
for element in ground_truth:
|
587 |
+
target = element['answer']
|
588 |
+
for i in range(0, min(len(target), len(assistant_text))):
|
589 |
+
if target[i] != assistant_text[i]:
|
590 |
+
response += f" Your next correct letter choice starts with {target[i]}"
|
591 |
+
return {"role": "user", "content": response}
|
592 |
+
```
|
593 |
+
|
594 |
+
Key Functions:
|
595 |
+
|
596 |
+
__init__: Initializes the environment with the dataset and system prompt. mask_env_response is set to False so suggestions/hints appear.
|
597 |
+
|
598 |
+
get_dataset: Returns the training dataset.
|
599 |
+
|
600 |
+
get_eval_dataset: Gets eval dataset
|
601 |
+
|
602 |
+
get_rubric: Returns an instance of the TranslationRubric.
|
603 |
+
|
604 |
+
is_completed: Checks if translation matches target, to terminate an interaction. We use custom checking logic by suggesting prefix matching, enabling hints, and then do similarity comparisons.
|
605 |
+
|
606 |
+
env_response Uses basic sequential suggestion algorithm. It will guide completion letter-by-letter if LLM fails.
|
607 |
+
|
608 |
+
### 3. TranslationRubric
|
609 |
+
|
610 |
+
```python
|
611 |
+
from verifiers.rubrics import Rubric
|
612 |
+
from sentence_transformers import SentenceTransformer
|
613 |
+
import numpy as np
|
614 |
+
from typing import List, Dict
|
615 |
+
|
616 |
+
class TranslationRubric(Rubric):
|
617 |
+
def __init__(self, embedding_model_name: str = 'all-MiniLM-L6-v2'):
|
618 |
+
super().__init__()
|
619 |
+
self.parser = TranslationParser()
|
620 |
+
self.embedding_model = SentenceTransformer(embedding_model_name)
|
621 |
+
self.reward_funcs = [
|
622 |
+
self.letter_accuracy_reward_func,
|
623 |
+
self.word_accuracy_reward_func,
|
624 |
+
self.semantic_similarity_reward_func,
|
625 |
+
self.levenshtein_distance_reward_func,
|
626 |
+
]
|
627 |
+
|
628 |
+
def letter_accuracy_reward_func(self, completions, answer, **kwargs) -> List[float]:
|
629 |
+
rewards = []
|
630 |
+
for completion, target in zip(completions, answer):
|
631 |
+
completion_text = self.parser.parse(completion[0]["content"]).translation
|
632 |
+
target_text = target.strip()
|
633 |
+
|
634 |
+
min_len = min(len(completion_text), len(target_text))
|
635 |
+
correct_letters = sum(1 for c1, c2 in zip(completion_text, target_text) if c1 == c2)
|
636 |
+
reward = correct_letters / max(len(target_text), 1) # Avoid division by zero
|
637 |
+
|
638 |
+
rewards.append(reward)
|
639 |
+
return rewards
|
640 |
+
|
641 |
+
def word_accuracy_reward_func(self, completions, answer, **kwargs) -> List[float]:
|
642 |
+
rewards = []
|
643 |
+
for completion, target in zip(completions, answer):
|
644 |
+
completion_text = self.parser.parse(completion[0]["content"]).translation
|
645 |
+
target_words = target.strip().split()
|
646 |
+
completion_words = completion_text.split()
|
647 |
+
|
648 |
+
correct_words = sum(1 for cw in completion_words if cw in target_words)
|
649 |
+
reward = correct_words / max(len(target_words), 1)
|
650 |
+
rewards.append(reward)
|
651 |
+
return rewards
|
652 |
+
|
653 |
+
def semantic_similarity_reward_func(self, completions, answer, **kwargs) -> List[float]:
|
654 |
+
rewards = []
|
655 |
+
for completion, target in zip(completions, answer):
|
656 |
+
completion_text = self.parser.parse(completion[0]["content"]).translation
|
657 |
+
target_text = target.strip()
|
658 |
+
|
659 |
+
try:
|
660 |
+
completion_embedding = self.embedding_model.encode(completion_text, convert_to_numpy=True)
|
661 |
+
target_embedding = self.embedding_model.encode(target_text, convert_to_numpy=True)
|
662 |
+
# Cosine similarity
|
663 |
+
similarity = np.dot(completion_embedding, target_embedding) / (np.linalg.norm(completion_embedding) * np.linalg.norm(target_embedding))
|
664 |
+
rewards.append(max(0, similarity)) # Clip to be >= 0
|
665 |
+
except Exception as e:
|
666 |
+
print("Error during semantic similarity", e)
|
667 |
+
rewards.append(0.0)
|
668 |
+
return rewards
|
669 |
+
|
670 |
+
def levenshtein_distance_reward_func(self, completions, answer, **kwargs) -> List[float]:
|
671 |
+
def levenshtein_distance(s1, s2):
|
672 |
+
if len(s1) > len(s2):
|
673 |
+
s1, s2 = s2, s1
|
674 |
+
distances = range(len(s1) + 1)
|
675 |
+
for i2, c2 in enumerate(s2):
|
676 |
+
distances_ = [i2+1]
|
677 |
+
for i1, c1 in enumerate(s1):
|
678 |
+
if c1 == c2:
|
679 |
+
distances_.append(distances[i1])
|
680 |
+
else:
|
681 |
+
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
|
682 |
+
distances = distances_
|
683 |
+
return distances[-1]
|
684 |
+
|
685 |
+
rewards = []
|
686 |
+
for completion, target in zip(completions, answer):
|
687 |
+
completion_text = self.parser.parse(completion[0]["content"]).translation
|
688 |
+
target_text = target.strip()
|
689 |
+
distance = levenshtein_distance(completion_text, target_text)
|
690 |
+
normalized_distance = distance / max(len(completion_text), len(target_text), 1) # Avoid division by zero
|
691 |
+
rewards.append(1.0 - normalized_distance)
|
692 |
+
return rewards
|
693 |
+
```
|
694 |
+
|
695 |
+
Key Components:
|
696 |
+
|
697 |
+
__init__: Initializes the rubric with a TranslationParser and a Sentence-BERT model for semantic similarity calculations. You can change the embedding_model_name to use different pre-trained embeddings.
|
698 |
+
|
699 |
+
letter_accuracy_reward_func: Calculates the proportion of correct letters.
|
700 |
+
|
701 |
+
word_accuracy_reward_func: Calculates the proportion of correct words.
|
702 |
+
|
703 |
+
semantic_similarity_reward_func: Calculates the cosine similarity between the sentence embeddings of the generated translation and the ground truth.
|
704 |
+
|
705 |
+
levenshtein_distance_reward_func: Provides similarity based on edit distances
|
706 |
+
|
707 |
+
### 4. Dataset Creation (create_dummy_dataset)
|
708 |
+
|
709 |
+
```python
|
710 |
+
from datasets import Dataset
|
711 |
+
import verifiers as vf
|
712 |
+
|
713 |
+
def create_dummy_dataset():
|
714 |
+
data = {
|
715 |
+
'prompt': [
|
716 |
+
vf.format_prompt("Translate to French: 'The cat is on the mat.'", "You are a translation expert."),
|
717 |
+
vf.format_prompt("Translate to French: good morning", "You are a translation expert.")
|
718 |
+
],
|
719 |
+
'answer': ["Le chat est sur le tapis.", "Bonjour"]
|
720 |
+
}
|
721 |
+
return Dataset.from_dict(data)
|
722 |
+
```
|
723 |
+
|
724 |
+
Important: This is a placeholder. You'll need to replace this with code that loads your low-resource parallel text dataset and creates a Hugging Face Dataset object with 'prompt' and 'answer' columns. The 'prompt' should contain the source sentence and any system prompt, and the 'answer' should contain the target translation.
|
725 |
+
|
726 |
+
### 5. Model Loading (get_model_and_tokenizer)
|
727 |
+
|
728 |
+
```python
|
729 |
+
import verifiers as vf
|
730 |
+
|
731 |
+
model_name = "Helsinki-NLP/opus-mt-en-fr" # Example: English to French
|
732 |
+
model, tokenizer = vf.get_model_and_tokenizer(model_name)
|
733 |
+
```
|
734 |
+
|
735 |
+
This uses the verifiers utility functions to load a pre-trained translation model and its corresponding tokenizer. Choose a model appropriate for your language pair. Start with smaller models for efficiency, especially in a low-resource setting.
|
736 |
+
|
737 |
+
### 6. Training Setup (GRPOEnvTrainer)
|
738 |
+
|
739 |
+
```python
|
740 |
+
from verifiers.trainers.grpo_env_trainer import GRPOEnvTrainer
|
741 |
+
|
742 |
+
# Create dataset instances. YOU WILL REPLACE create_dummy_dataset!
|
743 |
+
train_dataset = create_dummy_dataset()
|
744 |
+
eval_dataset = create_dummy_dataset()
|
745 |
+
|
746 |
+
# Set up environment and rubric.
|
747 |
+
vf_env = TranslationEnv(dataset=train_dataset, system_prompt="You are a translation expert.")
|
748 |
+
rubric = vf_env.get_rubric() # Get the rubric *from* the environment
|
749 |
+
|
750 |
+
run_name = "translation_example"
|
751 |
+
# set training to be short
|
752 |
+
training_args = vf.get_default_grpo_config(run_name=run_name, num_gpus=8)
|
753 |
+
training_args.num_generations = 1 # reduce data
|
754 |
+
training_args.max_steps = 3 # Short training for illustration
|
755 |
+
|
756 |
+
trainer = GRPOEnvTrainer(
|
757 |
+
model=model,
|
758 |
+
tokenizer=tokenizer,
|
759 |
+
env=vf_env,
|
760 |
+
reward_funcs=rubric.reward_funcs,
|
761 |
+
args=training_args,
|
762 |
+
train_dataset=train_dataset,
|
763 |
+
# eval_dataset=eval_dataset
|
764 |
+
)
|
765 |
+
|
766 |
+
trainer.train()
|
767 |
+
```
|
768 |
+
|
769 |
+
This part sets up the GRPOEnvTrainer with the custom environment, rubric, dataset, model, and tokenizer. Key parameters to consider tuning, especially in low-resource settings, are in training_args.
|
770 |
+
|
771 |
+
## Running the Example (Not Built Yet)
|
772 |
+
- The idea here is to get completely away from the OpenAI fine tuning I use now to any open source model. The idea I'm going to build here is to give any community the tool to input their language as they understand it, train that model on any opensource model, likey with LoRA, and achieve better and better output.
|
773 |
+
|
774 |
+
Install Dependencies: Make sure you have the required packages installed (see your original pyproject.toml). Notably: sentence-transformers torch transformers. Use uv or other packaging method.
|
775 |
+
|
776 |
+
Run the Code: Combine the code snippets above into a single Python file (e.g., translation_trainer.py). Execute the script:
|
777 |
+
|
778 |
+
```bash
|
779 |
+
python translation_trainer.py
|
780 |
+
```
|
781 |
+
|
782 |
+
This will run a very short training demonstration on the dummy dataset. You should see output from the trainer and (if you enable logging) see the prompts, completions, and the calculated rewards.
|
783 |
+
|
784 |
+
## Adapting to Your Specific Low-Resource Task
|
785 |
+
|
786 |
+
Dataset: Replace create_dummy_dataset() with your data loading.
|
787 |
+
|
788 |
+
Model: Choose a suitable pre-trained translation model for your languages.
|
789 |
+
|
790 |
+
is_completed and Hints. Change these parts to improve hints.
|
791 |
+
|
792 |
+
Hyperparameters: Experiment with the GRPOConfig parameters. Start with a low learning rate and consider increasing beta (the KL divergence penalty) to prevent overfitting on a small dataset. A larger beta keeps the model's weights closer to the pre-trained values.
|
793 |
+
|
794 |
+
## Generating Reward Functions from Language Patterns
|
795 |
+
|
796 |
+
One idea I have for generating a series of reward functions on a low resource language is to simply pass the JSON dictionary of the Stoney Nakoda, and ask for the rules or patterns the LLM notices.
|
797 |
+
|
798 |
+
It will give you a full set of rules that you can then use to define a very large number of very small reward functions that can be used to very precisely fine tune even low resource languages around contours.
|
799 |
+
|
800 |
+
Here is the actual LLM output using this simple idea: [RLHFrules.json](RLHFrules.json)
|
801 |
+
|
802 |
+
|
803 |
+
|
804 |
+
|
805 |
+
|
806 |
"""
|
807 |
|
808 |
# Store conversation history
|
|
|
873 |
gr.Markdown(
|
874 |
"# You are Asking Google Deep Mind about "
|
875 |
"\"From Whispers to Voices\", "
|
876 |
+
"it will need a few seconds to review the details"
|
877 |
)
|
878 |
|
879 |
chatbot = gr.Chatbot(show_label=False)
|