haqishen
/

h2o-Llama-3-8B-Japanese-Instruct

@@ -1,20 +1,31 @@
 ---
 language:
 - en
 library_name: transformers
 tags:
 - gpt
 - llm
 - large language model
 - h2o-llmstudio
 inference: false
-thumbnail: https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
 ---
-# Model Card
-## Summary
-This model was trained using [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio).
-- Base model: [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
 ## Usage
@@ -57,9 +68,8 @@ generate_text = pipeline(
 # generate_text.model.generation_config.repetition_penalty = float(1.0)
 messages = [
-    {"role": "user", "content": "Hi, how are you?"},
-    {"role": "assistant", "content": "I'm doing great, how about you?"},
-    {"role": "user", "content": "Why is drinking water so healthy?"},
 ]
 res = generate_text(
@@ -88,9 +98,8 @@ model_name = "haqishen/h2o-Llama-3-8B-Japanese-Instruct"  # either local folder
 # Important: The prompt needs to be in the same format the model was trained with.
 # You can find an example prompt in the experiment logs.
 messages = [
-    {"role": "user", "content": "Hi, how are you?"},
-    {"role": "assistant", "content": "I'm doing great, how about you?"},
-    {"role": "user", "content": "Why is drinking water so healthy?"},
 ]
 tokenizer = AutoTokenizer.from_pretrained(
@@ -133,6 +142,41 @@ answer = tokenizer.decode(tokens, skip_special_tokens=True)
 print(answer)
 ```
 ## Quantization and sharding
 You can load the models using quantization by specifying ```load_in_8bit=True``` or ```load_in_4bit=True```. Also, sharding on multiple GPUs is possible by setting ```device_map=auto```.

 ---
 language:
 - en
+- ja
 library_name: transformers
+license: llama3
 tags:
 - gpt
 - llm
 - large language model
 - h2o-llmstudio
 inference: false
+thumbnail: >-
+  https://h2o.ai/etc.clientlibs/h2o/clientlibs/clientlib-site/resources/images/favicon.ico
+datasets:
+- fujiki/japanese_hh-rlhf-49k
+pipeline_tag: text-generation
 ---
+## Introduction
+This is a `meta-llama/Meta-Llama-3-8B-Instruct` model that finetuned on **Japanese** conversation dataset.
+Dataset: [japanese_hh-rlhf-49k](https://huggingface.co/datasets/fujiki/japanese_hh-rlhf-49k)
+Training framework: [h2o-llmstudio](https://github.com/h2oai/h2o-llmstudio)
+Training max context length: 8k
 ## Usage
 # generate_text.model.generation_config.repetition_penalty = float(1.0)
 messages = [
+    {"role": "system", "content": "あなたは、常に海賊の言葉で返事する海賊チャットボットです！"},
+    {"role": "user", "content": "自己紹介してください"},
 ]
 res = generate_text(
 # Important: The prompt needs to be in the same format the model was trained with.
 # You can find an example prompt in the experiment logs.
 messages = [
+    {"role": "system", "content": "あなたは、常に海賊の言葉で返事する海賊チャットボットです！"},
+    {"role": "user", "content": "自己紹介してください"},
 ]
 tokenizer = AutoTokenizer.from_pretrained(
 print(answer)
 ```
+### Use with vllm
+[vllm-project/vllm](https://github.com/vllm-project/vllm)
+```python
+from vllm import LLM, SamplingParams
+model_id = "haqishen/h2o-Llama-3-8B-Japanese-Instruct"
+llm = LLM(
+    model=model_id,
+    trust_remote_code=True,
+    tensor_parallel_size=2,
+)
+tokenizer = llm.get_tokenizer()
+messages = [
+    {"role": "system", "content": "あなたは、常に海賊の言葉で返事する海賊チャットボットです！"},
+    {"role": "user", "content": "自己紹介してください"},
+]
+conversations = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+outputs = llm.generate(
+    [conversations],
+    SamplingParams(
+        temperature=0.6,
+        top_p=0.9,
+        max_tokens=1024,
+        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],
+    )
+)
+print(outputs[0].outputs[0].text.strip())
+```
 ## Quantization and sharding
 You can load the models using quantization by specifying ```load_in_8bit=True``` or ```load_in_4bit=True```. Also, sharding on multiple GPUs is possible by setting ```device_map=auto```.