HarleyCoops commited on
Commit
986a16c
·
1 Parent(s): dca96ea

added GPRO verifiers idea

Browse files
Files changed (1) hide show
  1. app.py +328 -2
app.py CHANGED
@@ -476,7 +476,333 @@ From a tiny dictionary to an AI that:
476
  - **CAS**: Cultural Authenticity Score
477
  - **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
478
  - **LoRA**: Low-Rank Adaptation
479
- - **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
480
  """
481
 
482
  # Store conversation history
@@ -547,7 +873,7 @@ def create_interface():
547
  gr.Markdown(
548
  "# You are Asking Google Deep Mind about "
549
  "\"From Whispers to Voices\", "
550
- "it will need a few seconds to review the code"
551
  )
552
 
553
  chatbot = gr.Chatbot(show_label=False)
 
476
  - **CAS**: Cultural Authenticity Score
477
  - **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
478
  - **LoRA**: Low-Rank Adaptation
479
+ - **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
480
+
481
+ ## March 2025 RL Update
482
+
483
+ # Adapting Verifiers for Low-Resource Language Translation
484
+
485
+ This document details how to adapt the `verifiers` framework, originally designed for verifiable environments like math and coding, to the task of low-resource language translation. This approach focuses on providing nuanced, multi-faceted rewards, going beyond simple correct/incorrect evaluations.
486
+
487
+ ## Overview
488
+
489
+ The core idea is to treat translation as a multi-step process (even if it's just a single-turn translation) where the model receives rewards for various aspects of translation quality. This allows for partial credit and provides more informative training signals, particularly beneficial in low-resource settings where data scarcity is a major challenge.
490
+
491
+ We will be customizing the following components of the `verifiers` library:
492
+
493
+ * **Environment:** A custom `TranslationEnv` to handle the interaction with the translation model (LLM).
494
+ * **Parser:** A simplified `TranslationParser` to extract the translated text from the LLM's output. We won't require strict XML formatting for this task.
495
+ * **Rubric:** A `TranslationRubric` containing several reward functions that evaluate different quality dimensions (letter accuracy, word accuracy, semantic similarity, and edit distance).
496
+ * **Training:** Using the `GRPOEnvTrainer` with our custom components and a small, low-resource translation dataset.
497
+
498
+ ## Key Concepts
499
+
500
+ * **Ground Truth:** A parallel corpus of source and target language sentences. Essential for calculating rewards. In low-resource scenarios, this might be a small, curated dataset.
501
+ * **Multi-faceted Reward:** Instead of a single reward, we provide separate rewards for:
502
+ * **Letter Accuracy:** Proportion of correctly translated letters.
503
+ * **Word Accuracy:** Proportion of correctly translated words (space-separated).
504
+ * **Semantic Similarity:** Uses pre-trained sentence embeddings (Sentence-BERT) to measure how close the *meaning* of the translation is to the ground truth, even if the exact words differ.
505
+ * **Edit Distance Similarity.** Levenshtein distances.
506
+ * **Iterative Refinement (Optional):** The environment can be designed to support multiple turns, allowing the LLM to refine its translation based on feedback (hints). This example shows a rudimentary character by character suggestion technique, although a better version might provide hints more sparingly based on confidence scores.
507
+ * **Low-Resource Focus:** The techniques are tailored for scenarios with limited training data. This involves using smaller, specialized translation models (rather than massive general-purpose LLMs) and careful hyperparameter tuning (particularly `beta` in GRPO).
508
+
509
+ ## Code Structure and Components
510
+
511
+ The code consists of the following main parts, each described in detail below:
512
+
513
+ 1. **`TranslationParser`:** A class to extract the translation from the LLM's output string.
514
+ 2. **`TranslationEnv`:** A class inheriting from `MultiStepEnv` (or a simplified version) that defines the interaction loop between the trainer and the LLM.
515
+ 3. **`TranslationRubric`:** A class inheriting from `Rubric` that defines the reward functions.
516
+ 4. **Dataset Creation (`create_dummy_dataset`):** A function to load or create your low-resource translation dataset. *You will replace this with your own dataset loading logic.*
517
+ 5. **Model Loading (`get_model_and_tokenizer`):** Uses functions from `verifiers` to load a suitable pre-trained translation model.
518
+ 6. **Training Setup (`GRPOEnvTrainer`):** Sets up and runs the training process.
519
+
520
+ ### 1. `TranslationParser`
521
+
522
+ ```python
523
+ from types import SimpleNamespace
524
+
525
+ class TranslationParser:
526
+ def parse(self, text: str, strip: bool = True) -> Any:
527
+ translation = text.strip()
528
+ return SimpleNamespace(translation=translation)
529
+ ```
530
+
531
+ This simplified parser extracts the raw translated text from the LLM's output. We are not requiring or enforcing XML formatting, keeping the interaction straightforward.
532
+
533
+ ### 2. TranslationEnv
534
+
535
+ ```python
536
+ import verifiers as vf
537
+ from verifiers.envs import MultiStepEnv
538
+ from verifiers.rubrics import Rubric # Will be used later.
539
+ from datasets import Dataset
540
+ from typing import List, Dict, Any
541
+
542
+ def check_prefix(text: str, suggested: str):
543
+ if len(suggested) < 1:
544
+ return False
545
+ return text.startswith(suggested[:len(text)])
546
+
547
+ class TranslationEnv(MultiStepEnv):
548
+ def __init__(self, dataset, system_prompt, max_steps=3):
549
+ super().__init__(system_prompt=system_prompt, max_steps=max_steps, mask_env_response=False)
550
+ self.dataset = dataset
551
+ self.rubric = None # Set during get_rubric
552
+
553
+ def get_dataset(self, **kwargs):
554
+ return self.dataset
555
+ def get_eval_dataset(self, **kwargs: Any):
556
+ return self.dataset # You might want separate eval set.
557
+
558
+ def get_rubric(self):
559
+ if self.rubric is None:
560
+ self.rubric = TranslationRubric() # instantiate later.
561
+ return self.rubric
562
+
563
+ def is_completed(self, messages, **kwargs):
564
+ assistant_text = self.rubric.parser.parse(messages[-1]['content']).translation
565
+ user_query = self.get_last_user_prompt(messages)
566
+ ground_truth = self.dataset.filter(lambda x: x["prompt"][0]['content'] == user_query)
567
+ for element in ground_truth:
568
+ target = element['answer']
569
+
570
+ return check_prefix(target, assistant_text)
571
+
572
+ def get_last_user_prompt(self, messages):
573
+ i = len(messages) -1
574
+ while i > -1:
575
+ if messages[i]['role'] == 'user':
576
+ return messages[i]['content']
577
+ i-= 1
578
+ return None
579
+ # Suggest letters sequentially
580
+ def env_response(self, messages, **kwargs):
581
+ assistant_text = self.rubric.parser.parse(messages[-1]['content']).translation
582
+ user_query = self.get_last_user_prompt(messages)
583
+ ground_truth = self.dataset.filter(lambda x: x["prompt"][0]['content'] == user_query)
584
+
585
+ response = "Check your word beginnings:"
586
+ for element in ground_truth:
587
+ target = element['answer']
588
+ for i in range(0, min(len(target), len(assistant_text))):
589
+ if target[i] != assistant_text[i]:
590
+ response += f" Your next correct letter choice starts with {target[i]}"
591
+ return {"role": "user", "content": response}
592
+ ```
593
+
594
+ Key Functions:
595
+
596
+ __init__: Initializes the environment with the dataset and system prompt. mask_env_response is set to False so suggestions/hints appear.
597
+
598
+ get_dataset: Returns the training dataset.
599
+
600
+ get_eval_dataset: Gets eval dataset
601
+
602
+ get_rubric: Returns an instance of the TranslationRubric.
603
+
604
+ is_completed: Checks if translation matches target, to terminate an interaction. We use custom checking logic by suggesting prefix matching, enabling hints, and then do similarity comparisons.
605
+
606
+ env_response Uses basic sequential suggestion algorithm. It will guide completion letter-by-letter if LLM fails.
607
+
608
+ ### 3. TranslationRubric
609
+
610
+ ```python
611
+ from verifiers.rubrics import Rubric
612
+ from sentence_transformers import SentenceTransformer
613
+ import numpy as np
614
+ from typing import List, Dict
615
+
616
+ class TranslationRubric(Rubric):
617
+ def __init__(self, embedding_model_name: str = 'all-MiniLM-L6-v2'):
618
+ super().__init__()
619
+ self.parser = TranslationParser()
620
+ self.embedding_model = SentenceTransformer(embedding_model_name)
621
+ self.reward_funcs = [
622
+ self.letter_accuracy_reward_func,
623
+ self.word_accuracy_reward_func,
624
+ self.semantic_similarity_reward_func,
625
+ self.levenshtein_distance_reward_func,
626
+ ]
627
+
628
+ def letter_accuracy_reward_func(self, completions, answer, **kwargs) -> List[float]:
629
+ rewards = []
630
+ for completion, target in zip(completions, answer):
631
+ completion_text = self.parser.parse(completion[0]["content"]).translation
632
+ target_text = target.strip()
633
+
634
+ min_len = min(len(completion_text), len(target_text))
635
+ correct_letters = sum(1 for c1, c2 in zip(completion_text, target_text) if c1 == c2)
636
+ reward = correct_letters / max(len(target_text), 1) # Avoid division by zero
637
+
638
+ rewards.append(reward)
639
+ return rewards
640
+
641
+ def word_accuracy_reward_func(self, completions, answer, **kwargs) -> List[float]:
642
+ rewards = []
643
+ for completion, target in zip(completions, answer):
644
+ completion_text = self.parser.parse(completion[0]["content"]).translation
645
+ target_words = target.strip().split()
646
+ completion_words = completion_text.split()
647
+
648
+ correct_words = sum(1 for cw in completion_words if cw in target_words)
649
+ reward = correct_words / max(len(target_words), 1)
650
+ rewards.append(reward)
651
+ return rewards
652
+
653
+ def semantic_similarity_reward_func(self, completions, answer, **kwargs) -> List[float]:
654
+ rewards = []
655
+ for completion, target in zip(completions, answer):
656
+ completion_text = self.parser.parse(completion[0]["content"]).translation
657
+ target_text = target.strip()
658
+
659
+ try:
660
+ completion_embedding = self.embedding_model.encode(completion_text, convert_to_numpy=True)
661
+ target_embedding = self.embedding_model.encode(target_text, convert_to_numpy=True)
662
+ # Cosine similarity
663
+ similarity = np.dot(completion_embedding, target_embedding) / (np.linalg.norm(completion_embedding) * np.linalg.norm(target_embedding))
664
+ rewards.append(max(0, similarity)) # Clip to be >= 0
665
+ except Exception as e:
666
+ print("Error during semantic similarity", e)
667
+ rewards.append(0.0)
668
+ return rewards
669
+
670
+ def levenshtein_distance_reward_func(self, completions, answer, **kwargs) -> List[float]:
671
+ def levenshtein_distance(s1, s2):
672
+ if len(s1) > len(s2):
673
+ s1, s2 = s2, s1
674
+ distances = range(len(s1) + 1)
675
+ for i2, c2 in enumerate(s2):
676
+ distances_ = [i2+1]
677
+ for i1, c1 in enumerate(s1):
678
+ if c1 == c2:
679
+ distances_.append(distances[i1])
680
+ else:
681
+ distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
682
+ distances = distances_
683
+ return distances[-1]
684
+
685
+ rewards = []
686
+ for completion, target in zip(completions, answer):
687
+ completion_text = self.parser.parse(completion[0]["content"]).translation
688
+ target_text = target.strip()
689
+ distance = levenshtein_distance(completion_text, target_text)
690
+ normalized_distance = distance / max(len(completion_text), len(target_text), 1) # Avoid division by zero
691
+ rewards.append(1.0 - normalized_distance)
692
+ return rewards
693
+ ```
694
+
695
+ Key Components:
696
+
697
+ __init__: Initializes the rubric with a TranslationParser and a Sentence-BERT model for semantic similarity calculations. You can change the embedding_model_name to use different pre-trained embeddings.
698
+
699
+ letter_accuracy_reward_func: Calculates the proportion of correct letters.
700
+
701
+ word_accuracy_reward_func: Calculates the proportion of correct words.
702
+
703
+ semantic_similarity_reward_func: Calculates the cosine similarity between the sentence embeddings of the generated translation and the ground truth.
704
+
705
+ levenshtein_distance_reward_func: Provides similarity based on edit distances
706
+
707
+ ### 4. Dataset Creation (create_dummy_dataset)
708
+
709
+ ```python
710
+ from datasets import Dataset
711
+ import verifiers as vf
712
+
713
+ def create_dummy_dataset():
714
+ data = {
715
+ 'prompt': [
716
+ vf.format_prompt("Translate to French: 'The cat is on the mat.'", "You are a translation expert."),
717
+ vf.format_prompt("Translate to French: good morning", "You are a translation expert.")
718
+ ],
719
+ 'answer': ["Le chat est sur le tapis.", "Bonjour"]
720
+ }
721
+ return Dataset.from_dict(data)
722
+ ```
723
+
724
+ Important: This is a placeholder. You'll need to replace this with code that loads your low-resource parallel text dataset and creates a Hugging Face Dataset object with 'prompt' and 'answer' columns. The 'prompt' should contain the source sentence and any system prompt, and the 'answer' should contain the target translation.
725
+
726
+ ### 5. Model Loading (get_model_and_tokenizer)
727
+
728
+ ```python
729
+ import verifiers as vf
730
+
731
+ model_name = "Helsinki-NLP/opus-mt-en-fr" # Example: English to French
732
+ model, tokenizer = vf.get_model_and_tokenizer(model_name)
733
+ ```
734
+
735
+ This uses the verifiers utility functions to load a pre-trained translation model and its corresponding tokenizer. Choose a model appropriate for your language pair. Start with smaller models for efficiency, especially in a low-resource setting.
736
+
737
+ ### 6. Training Setup (GRPOEnvTrainer)
738
+
739
+ ```python
740
+ from verifiers.trainers.grpo_env_trainer import GRPOEnvTrainer
741
+
742
+ # Create dataset instances. YOU WILL REPLACE create_dummy_dataset!
743
+ train_dataset = create_dummy_dataset()
744
+ eval_dataset = create_dummy_dataset()
745
+
746
+ # Set up environment and rubric.
747
+ vf_env = TranslationEnv(dataset=train_dataset, system_prompt="You are a translation expert.")
748
+ rubric = vf_env.get_rubric() # Get the rubric *from* the environment
749
+
750
+ run_name = "translation_example"
751
+ # set training to be short
752
+ training_args = vf.get_default_grpo_config(run_name=run_name, num_gpus=8)
753
+ training_args.num_generations = 1 # reduce data
754
+ training_args.max_steps = 3 # Short training for illustration
755
+
756
+ trainer = GRPOEnvTrainer(
757
+ model=model,
758
+ tokenizer=tokenizer,
759
+ env=vf_env,
760
+ reward_funcs=rubric.reward_funcs,
761
+ args=training_args,
762
+ train_dataset=train_dataset,
763
+ # eval_dataset=eval_dataset
764
+ )
765
+
766
+ trainer.train()
767
+ ```
768
+
769
+ This part sets up the GRPOEnvTrainer with the custom environment, rubric, dataset, model, and tokenizer. Key parameters to consider tuning, especially in low-resource settings, are in training_args.
770
+
771
+ ## Running the Example (Not Built Yet)
772
+ - The idea here is to get completely away from the OpenAI fine tuning I use now to any open source model. The idea I'm going to build here is to give any community the tool to input their language as they understand it, train that model on any opensource model, likey with LoRA, and achieve better and better output.
773
+
774
+ Install Dependencies: Make sure you have the required packages installed (see your original pyproject.toml). Notably: sentence-transformers torch transformers. Use uv or other packaging method.
775
+
776
+ Run the Code: Combine the code snippets above into a single Python file (e.g., translation_trainer.py). Execute the script:
777
+
778
+ ```bash
779
+ python translation_trainer.py
780
+ ```
781
+
782
+ This will run a very short training demonstration on the dummy dataset. You should see output from the trainer and (if you enable logging) see the prompts, completions, and the calculated rewards.
783
+
784
+ ## Adapting to Your Specific Low-Resource Task
785
+
786
+ Dataset: Replace create_dummy_dataset() with your data loading.
787
+
788
+ Model: Choose a suitable pre-trained translation model for your languages.
789
+
790
+ is_completed and Hints. Change these parts to improve hints.
791
+
792
+ Hyperparameters: Experiment with the GRPOConfig parameters. Start with a low learning rate and consider increasing beta (the KL divergence penalty) to prevent overfitting on a small dataset. A larger beta keeps the model's weights closer to the pre-trained values.
793
+
794
+ ## Generating Reward Functions from Language Patterns
795
+
796
+ One idea I have for generating a series of reward functions on a low resource language is to simply pass the JSON dictionary of the Stoney Nakoda, and ask for the rules or patterns the LLM notices.
797
+
798
+ It will give you a full set of rules that you can then use to define a very large number of very small reward functions that can be used to very precisely fine tune even low resource languages around contours.
799
+
800
+ Here is the actual LLM output using this simple idea: [RLHFrules.json](RLHFrules.json)
801
+
802
+
803
+
804
+
805
+
806
  """
807
 
808
  # Store conversation history
 
873
  gr.Markdown(
874
  "# You are Asking Google Deep Mind about "
875
  "\"From Whispers to Voices\", "
876
+ "it will need a few seconds to review the details"
877
  )
878
 
879
  chatbot = gr.Chatbot(show_label=False)