Spaces:

MakiAi
/

Llama-finetune-sandbox

Sleeping

App Files Files Community

MakiAi commited on Nov 26, 2024

Commit

636f7db

2 Parent(s): ca5ae08 487159b

Merge feature/fast-inference-docs

Browse files

Files changed (4) hide show

README.md +11 -0
docs/README.en.md +23 -17
sandbox/Unsloth_inference_llama3-2.md +158 -0
sandbox/Unsloth_inference_llm_jp.md +120 -0

README.md CHANGED Viewed

@@ -81,6 +81,17 @@ license: mit
    - → [マークダウン形式からノートブック形式への変換はこちらを使用してください](https://huggingface.co/spaces/MakiAi/JupytextWebUI)
  - [📒ノートブックはこちら](https://colab.research.google.com/drive/1AjtWF2vOEwzIoCMmlQfSTYCVgy4Y78Wi?usp=sharing)
 ### OllamaとLiteLLMを使用した効率的なモデル運用
  - Google Colabでのセットアップと運用ガイド
  - → 詳細は [`efficient-ollama-colab-setup-with-litellm-guide.md`](sandbox/efficient-ollama-colab-setup-with-litellm-guide.md) をご参照ください。

    - → [マークダウン形式からノートブック形式への変換はこちらを使用してください](https://huggingface.co/spaces/MakiAi/JupytextWebUI)
  - [📒ノートブックはこちら](https://colab.research.google.com/drive/1AjtWF2vOEwzIoCMmlQfSTYCVgy4Y78Wi?usp=sharing)
+### Unslothを使用した高速推論
+ - Llama-3.2モデルの高速推論実装
+   - → 詳細は [`Unsloth_inference_llama3-2.md`](sandbox/Unsloth_inference_llama3-2.md) をご参照ください。
+   - → Unslothを使用したLlama-3.2モデルの効率的な推論処理の実装
+ - [📒ノートブックはこちら](https://colab.research.google.com/drive/1FkAYiX2fbGPTRUopYw39Qt5UE2tWJRpa?usp=sharing)
+ - LLM-JPモデルの高速推論実装
+   - → 詳細は [`Unsloth_inference_llm_jp.md`](sandbox/Unsloth_inference_llm_jp.md) をご参照ください。
+   - → 日本語LLMの高速推論処理の実装とパフォーマンス最適化
+ - [📒ノートブックはこちら](https://colab.research.google.com/drive/1lbMKv7NzXQ1ynCg7DGQ6PcCFPK-zlSEG?usp=sharing)
 ### OllamaとLiteLLMを使用した効率的なモデル運用
  - Google Colabでのセットアップと運用ガイド
  - → 詳細は [`efficient-ollama-colab-setup-with-litellm-guide.md`](sandbox/efficient-ollama-colab-setup-with-litellm-guide.md) をご参照ください。

docs/README.en.md CHANGED Viewed

@@ -31,7 +31,7 @@ license: mit
 </p>
 <h2 align="center">
-  Llama Model Fine-tuning Experimental Environment
 </h2>
 <p align="center">
@@ -44,7 +44,7 @@ license: mit
 ## 🚀 Project Overview
-**Llama-finetune-sandbox** provides an experimental environment for learning and verifying Llama model fine-tuning.  You can try various fine-tuning methods, customize models, and evaluate performance.  It caters to a wide range of users, from beginners to researchers. Version 0.5.0 includes updated documentation and the addition of a context-aware reflexive QA generation system. This system generates high-quality Q&A datasets from Wikipedia data, iteratively improving the quality of questions and answers using LLMs to create a more accurate dataset.
 ## ✨ Key Features
@@ -58,16 +58,16 @@ license: mit
    - Various quantization options
    - Multiple attention mechanisms
-3. **Experimental Environment Setup:**
-   - Memory usage optimization
    - Visualization of experimental results
 4. **Context-Aware Reflexive QA Generation System:**
     - Generates high-quality Q&A datasets from Wikipedia data.
     - Uses LLMs to automatically generate context-aware questions and answers, evaluate quality, and iteratively improve them.
-    - Employs a reflexive approach that quantifies factuality, question quality, and answer completeness to enable iterative improvement.
     - Provides comprehensive code and explanations covering environment setup, model selection, data preprocessing, Q&A pair generation, quality evaluation, and the improvement process.
-    - Uses libraries such as `litellm`, `wikipedia`, and `transformers`.
     - Generated Q&A pairs are saved in JSON format and can be easily uploaded to the Hugging Face Hub.
@@ -82,25 +82,33 @@ This repository includes the following examples:
  - [📒Notebook here](https://colab.research.google.com/drive/1AjtWF2vOEwzIoCMmlQfSTYCVgy4Y78Wi?usp=sharing)
 ### Efficient Model Deployment using Ollama and LiteLLM
- - Setup and usage guide on Google Colab.
  - → See [`efficient-ollama-colab-setup-with-litellm-guide.md`](sandbox/efficient-ollama-colab-setup-with-litellm-guide.md) for details.
  - [📒Notebook here](https://colab.research.google.com/drive/1buTPds1Go1NbZOLlpG94VG22GyK-F4GW?usp=sharing)
 ### Q&A Dataset Generation from Wikipedia Data (Sentence Pool QA Method)
 - High-quality Q&A dataset generation using the sentence pool QA method.
-  - → A new dataset creation method that generates Q&A pairs while preserving context by pooling sentences delimited by periods.
-  - → Chunk size is flexibly adjustable (default 200 characters), allowing generation of Q&A pairs with optimal context ranges for various applications.
   - → See [`wikipedia-qa-dataset-generator.md`](sandbox/wikipedia-qa-dataset-generator.md) for details.
 - [📒Notebook here](https://colab.research.google.com/drive/1mmK5vxUzjk3lI6OnEPrQqyjSzqsEoXpk?usp=sharing)
 ### Context-Aware Reflexive QA Generation System
 - Q&A dataset generation with reflexive quality improvement.
-  - → A new method that automatically evaluates the quality of generated Q&A pairs and iteratively improves them.
   - → Quantifies factuality, question quality, and answer completeness for evaluation.
-  - → Uses contextual information for high-precision question generation and answer consistency checks.
   - → See [`context_aware_Reflexive_qa_generator_V2.md`](sandbox/context_aware_Reflexive_qa_generator_V2.md) for details.
 - [📒Notebook here](https://colab.research.google.com/drive/1OYdgAuXHbl-0LUJgkLl_VqknaAEmAm0S?usp=sharing)
 ## 🛠️ Setup
@@ -113,11 +121,10 @@ cd Llama-finetune-sandbox
 ## 📝 Adding Examples
 1. Add new implementations to the `sandbox/` directory.
-2. Add necessary configurations and utilities to `utils/` (Removed as this directory didn't exist in the original).
-3. Update documentation and tests (Removed as this section didn't exist in the original).
 4. Create a pull request.
 ## 🤝 Contributions
 - Implementation of new fine-tuning methods
@@ -127,15 +134,14 @@ cd Llama-finetune-sandbox
 ## 📚 References
-- [HuggingFace PEFT Documentation](https://huggingface.co/docs/peft)
 - [About Llama models](https://github.com/facebookresearch/llama)
-- [Fine-tuning best practices](https://github.com/Sunwood-ai-labs/Llama-finetune-sandbox/wiki) (Removed as this wiki page didn't exist in the original).
 ## 📄 License
 This project is licensed under the MIT License.
 ## v0.5.0 Updates
 **🆕 What's New:**

 </p>
 <h2 align="center">
+  Llama Model Fine-tuning Experimentation Environment
 </h2>
 <p align="center">
 ## 🚀 Project Overview
+**Llama-finetune-sandbox** provides an experimental environment for learning and verifying the fine-tuning of Llama models.  You can try various fine-tuning methods, customize models, and evaluate performance.  It caters to a wide range of users, from beginners to researchers. Version 0.5.0 includes updated documentation and the addition of a context-aware reflexive QA generation system. This system generates high-quality Q&A datasets from Wikipedia data, leveraging LLMs to iteratively improve the quality of questions and answers, resulting in a more accurate dataset.
 ## ✨ Key Features
    - Various quantization options
    - Multiple attention mechanisms
+3. **Well-equipped Experimentation Environment:**
+   - Optimized memory usage
    - Visualization of experimental results
 4. **Context-Aware Reflexive QA Generation System:**
     - Generates high-quality Q&A datasets from Wikipedia data.
     - Uses LLMs to automatically generate context-aware questions and answers, evaluate quality, and iteratively improve them.
+    - Employs a reflexive approach, quantifying factuality, question quality, and answer completeness for iterative improvement.
     - Provides comprehensive code and explanations covering environment setup, model selection, data preprocessing, Q&A pair generation, quality evaluation, and the improvement process.
+    - Utilizes libraries such as `litellm`, `wikipedia`, and `transformers`.
     - Generated Q&A pairs are saved in JSON format and can be easily uploaded to the Hugging Face Hub.
  - [📒Notebook here](https://colab.research.google.com/drive/1AjtWF2vOEwzIoCMmlQfSTYCVgy4Y78Wi?usp=sharing)
 ### Efficient Model Deployment using Ollama and LiteLLM
+ - Setup and deployment guide on Google Colab.
  - → See [`efficient-ollama-colab-setup-with-litellm-guide.md`](sandbox/efficient-ollama-colab-setup-with-litellm-guide.md) for details.
  - [📒Notebook here](https://colab.research.google.com/drive/1buTPds1Go1NbZOLlpG94VG22GyK-F4GW?usp=sharing)
 ### Q&A Dataset Generation from Wikipedia Data (Sentence Pool QA Method)
 - High-quality Q&A dataset generation using the sentence pool QA method.
+  - → A new dataset creation method that generates Q&A pairs while preserving context by pooling sentence segments delimited by punctuation.
+  - → Chunk size is flexibly adjustable (default 200 characters) to generate Q&A pairs with an optimal context range depending on the application.
   - → See [`wikipedia-qa-dataset-generator.md`](sandbox/wikipedia-qa-dataset-generator.md) for details.
 - [📒Notebook here](https://colab.research.google.com/drive/1mmK5vxUzjk3lI6OnEPrQqyjSzqsEoXpk?usp=sharing)
 ### Context-Aware Reflexive QA Generation System
 - Q&A dataset generation with reflexive quality improvement.
+  - → Automatically evaluates the quality of generated Q&A pairs and iteratively improves them.
   - → Quantifies factuality, question quality, and answer completeness for evaluation.
+  - → Generates high-precision questions and performs consistency checks on answers using contextual information.
   - → See [`context_aware_Reflexive_qa_generator_V2.md`](sandbox/context_aware_Reflexive_qa_generator_V2.md) for details.
 - [📒Notebook here](https://colab.research.google.com/drive/1OYdgAuXHbl-0LUJgkLl_VqknaAEmAm0S?usp=sharing)
+### LLM Evaluation System (LLMs as a Judge)
+- Advanced quality evaluation system using LLMs as evaluators.
+  - → Automatically evaluates questions, model answers, and LLM responses on a four-level scale.
+  - → Robust design with error handling and retry functions.
+  - → Generates detailed evaluation reports in CSV and HTML formats.
+  - → See [`LLMs_as_a_Judge_TOHO_V2.md`](sandbox/LLMs_as_a_Judge_TOHO_V2.md) for details.
+- [📒Notebook here](https://colab.research.google.com/drive/1Zjw3sOMa2v5RFD8dFfxMZ4NDGFoQOL7s?usp=sharing)
 ## 🛠️ Setup
 ## 📝 Adding Examples
 1. Add new implementations to the `sandbox/` directory.
+2. Add necessary settings and utilities to `utils/` (This section was removed as `utils/` directory appears not to exist).
+3. Update documentation and tests (This section was removed as there's no mention of existing tests).
 4. Create a pull request.
 ## 🤝 Contributions
 - Implementation of new fine-tuning methods
 ## 📚 References
+- [HuggingFace PEFT documentation](https://huggingface.co/docs/peft)
 - [About Llama models](https://github.com/facebookresearch/llama)
+- [Fine-tuning best practices](https://github.com/Sunwood-ai-labs/Llama-finetune-sandbox/wiki) (This section was removed as the wiki page appears not to exist).
 ## 📄 License
 This project is licensed under the MIT License.
 ## v0.5.0 Updates
 **🆕 What's New:**

sandbox/Unsloth_inference_llama3-2.md ADDED Viewed

	@@ -0,0 +1,158 @@

+# 🦙 Unslothで作成したLLaMA 3.2ベースのファインチューニングモデルを使った高速推論ガイド
+## 📦 必要なライブラリのインストール
+```python
+%%capture
+!pip install unsloth
+# 最新のUnslothナイトリービルドを取得
+!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+```
+**解説**:
+Unslothライブラリをインストールします。このライブラリを使用することで、LLaMAモデルのファインチューニングと推論を大幅に高速化できます。ナイトリービルドを使用することで、最新の機能と改善が利用可能です。
+## 🔧 ライブラリのインポートと基本設定
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from unsloth import FastLanguageModel
+from unsloth.chat_templates import get_chat_template
+import torch
+# モデルの基本設定
+max_seq_length = 512
+dtype = None
+load_in_4bit = True
+model_id = "MakiAi/Llama-3-2-3B-Instruct-bnb-4bit-10epochs-adapter"  # ファインチューニング済みのモデルパス
+```
+**解説**:
+- 必要なライブラリをインポート
+- モデルは4ビット量子化を使用して、メモリ効率を改善
+- `model_id`には、Unslothでファインチューニングしたモデルのパスを指定
+## 🚀 モデルとトークナイザーの初期化
+```python
+# モデルとトークナイザーのロード
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=model_id,
+    dtype=dtype,
+    load_in_4bit=load_in_4bit,
+    trust_remote_code=True,
+)
+# LLaMA 3.1のチャットテンプレートを使用
+tokenizer = get_chat_template(
+    tokenizer,
+    chat_template="llama-3.1",  # LLaMA 3.1のテンプレートで問題なし
+)
+# 高速推論モードを有効化
+FastLanguageModel.for_inference(model)  # 通常の2倍の速度
+```
+**解説**:
+1. ファインチューニング済みのモデルをロード
+2. LLaMA 3.1のチャットテンプレートを適用（3.2でも互換性あり）
+3. Unslothの高速推論モードを有効化
+## 💬 データセットを使用した推論の実装
+```python
+def generate_response(dataset_entry):
+    """
+    データセットのエントリーに対して応答を生成する関数
+    """
+    # メッセージの作成
+    messages = [
+        {"role": "user", "content": dataset_entry["conversations"][0]['content']},
+    ]
+    # チャットテンプレートの適用
+    inputs = tokenizer.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,  # 生成プロンプトの追加
+        return_tensors="pt",
+    ).to(model.device)
+    # 応答の生成
+    outputs = model.generate(
+        input_ids=inputs,
+        max_new_tokens=64,  # 生成するトークン数
+        use_cache=True,     # キャッシュを使用して高速化
+        temperature=1.5,    # より創造的な応答を生成
+        min_p=0.1          # 出力の多様性を確保
+    )
+    return tokenizer.batch_decode(outputs)
+```
+**解説**:
+この関数は：
+1. データセットのエントリーからユーザーの入力を抽出
+2. LLaMA 3.1形式のチャットテンプレートを適用
+3. 以下のパラメータで応答を生成：
+   - `max_new_tokens`: 64（短めの応答を生成）
+   - `temperature`: 1.5（創造性を高める）
+   - `min_p`: 0.1（多様な応答を確保）
+## ✅ 実行例
+```python
+if __name__ == "__main__":
+    # テストデータセット
+    dataset = [
+        {"conversations": [{"content": "火焔猫燐について教えてください。"}]},
+        {"conversations": [{"content": "水橋パルスィの本質は何ですか？"}]},
+        {"conversations": [{"content": "プログラミング初心者へのアドバイスをお願いします。"}]}
+    ]
+    # 2番目のデータセットエントリーで試してみる
+    response = generate_response(dataset[0])
+    print("入力:", dataset[0]["conversations"][0]['content'])
+    print("応答:", response)
+```
+```python
+if __name__ == "__main__":
+    # テストデータセット
+    dataset = [
+        {"conversations": [{"content": "火焔猫燐について教えてください。"}]},
+        {"conversations": [{"content": "水橋パルスィの本質は何ですか？"}]},
+        {"conversations": [{"content": "プログラミング初心者へのアドバイスをお願いします。"}]}
+    ]
+    # 2番目のデータセットエントリーで試してみる
+    response = generate_response(dataset[1])
+    print("入力:", dataset[1]["conversations"][0]['content'])
+    print("応答:", response)
+```
+```python
+if __name__ == "__main__":
+    # テストデータセット
+    dataset = [
+        {"conversations": [{"content": "火焔猫燐について教えてください。"}]},
+        {"conversations": [{"content": "水橋パルスィの本質は何ですか？"}]},
+        {"conversations": [{"content": "プログラミング初心者へのアドバイスをお願いします。"}]}
+    ]
+    # 2番目のデータセットエントリーで試してみる
+    response = generate_response(dataset[2])
+    print("入力:", dataset[2]["conversations"][0]['content'])
+    print("応答:", response)
+```
+**解説**:
+サンプルの実行方法を示しています：
+- テスト用のデータセットを定義
+- 選択したエントリーで応答を生成
+- 入力と生成された応答を表示
+このコードを使用することで、UnslothでファインチューニングしたカスタムのデータセットでトレーニングしたLLaMA 3.2モデルを、高速に推論できます。LLaMA 3.1のトークナイザーを使用することで、新しいモデルでも安定した出力が得られます。必要に応じて生成パラメータを調整することで、モデルの応答特性をカスタマイズできます。

sandbox/Unsloth_inference_llm_jp.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# 🤖 UnslothによるLLM-JPモデルの高速推論実装ガイド(Google Colab📒ノートブック付)
+## 📦 必要なライブラリのインストール
+```python
+%%capture
+!pip install unsloth
+# 最新のUnslothナイトリービルドを取得
+!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
+```
+**解説**:
+このセルではUnslothライブラリをインストールしています。Unslothは大規模言語モデル（LLM）の推論を高速化するためのライブラリです。`%%capture`を使用することで、インストール時の出力を非表示にしています。
+## 🔧 必要なライブラリのインポートと基本設定
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from unsloth import FastLanguageModel
+import torch
+# モデルの基本設定
+max_seq_length = 512
+dtype = None
+load_in_4bit = True
+model_id = "llm-jp/llm-jp-3-13b"  # または自分でファインチューニングしたモデルID
+```
+**解説**:
+- `transformers`: Hugging Faceの変換器ライブラリ
+- `unsloth`: 高速化ライブラリ
+- `torch`: PyTorchフレームワーク
+- モデルの設定では：
+  - 最大シーケンス長: 512トークン
+  - 4ビット量子化を有効化
+  - LLM-JP 13Bモデルを使用
+## 🚀 モデルとトークナイザーの初期化
+```python
+# モデルとトークナイザーのロード
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=model_id,
+    dtype=dtype,
+    load_in_4bit=load_in_4bit,
+    trust_remote_code=True,
+)
+# 推論モードに設定
+FastLanguageModel.for_inference(model)
+```
+**解説**:
+このセルでは：
+1. モデルとトークナイザーを同時にロード
+2. 4ビット量子化を適用し、メモリ使用量を削減
+3. モデルを推論モードに設定して最適化
+## 💬 応答生成関数の実装
+```python
+def generate_response(input_text):
+    """
+    入力テキストに対して応答を生成する関数
+    """
+    # プロンプトの作成
+    prompt = f"""### 指示\n{input_text}\n### 回答\n"""
+    # 入力のトークナイズ
+    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
+    # 応答の生成
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=512,
+        use_cache=True,
+        do_sample=False,
+        repetition_penalty=1.2
+    )
+    # デコードして回答部分を抽出
+    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### 回答')[-1]
+    return prediction
+```
+**解説**:
+この関数は以下の処理を行います：
+1. 入力テキストを指示形式のプロンプトに変換
+2. トークナイズしてモデルに入力可能な形式に変換
+3. 以下のパラメータで応答を生成：
+   - `max_new_tokens`: 最大512トークンまで生成
+   - `use_cache`: キャッシュを使用して高速化
+   - `do_sample`: 決定的な出力を生成
+   - `repetition_penalty`: 繰り返しを抑制（1.2）
+4. 生成された出力から回答部分のみを抽出
+## ✅ 使用例
+```python
+if __name__ == "__main__":
+    # 入力例
+    sample_input = "今日の天気について教えてください。"
+    # 応答の生成
+    response = generate_response(sample_input)
+    print("入力:", sample_input)
+    print("応答:", response)
+```
+**解説**:
+このセルは実際の使用例を示しています：
+- サンプル入力を設定
+- `generate_response`関数を呼び出して応答を生成
+- 入力と応答を表示
+このコードを実行することで、LLM-JPモデルを使用して日本語の質問に対する応答を生成できます。Unslothによる最適化により、標準的な実装と比較して高速な推論が可能です。