DeL-TaiseiOzaki
/

Tengentoppa-llm-jp-3-3.7B-base

Safetensors

Japanese

llama

Model card Files Files and versions Community

DeL-TaiseiOzaki commited on Dec 11, 2024

Commit

309fba1

verified ·

1 Parent(s): ede4328

Update README.md

Browse files

Files changed (1) hide show

README.md +99 -3

README.md CHANGED Viewed

@@ -1,3 +1,99 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- ja
+base_model:
+- llm-jp/llm-jp-3-3.7b
+---
+# Tengentoppa-llm-jp-base-3.7B
+This is a modified version of the [llm-jp-3-3.7b](https://huggingface.co/llm-jp/llm-jp-3-3.7b) model with additional special tokens for structured conversations. The base model was developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/).
+![image/jpg](tengentoppa2.jpg)
+## Model Details
+- **Base Model**: llm-jp-3-3.7b
+- **Model Type**: Transformer-based Language Model
+- **Parameters**: 3.7B
+- **Context Length**: 4096
+- **Language**: Japanese and English
+### Additional Special Tokens
+This model includes the following special tokens for structured conversations:
+```
+<|SYSTEM|>, </|SYSTEM|>   - System message delimiters
+<|USER|>, </|USER|>       - User input delimiters
+<|HINT|>, </|HINT|>       - Hint message delimiters
+<|REASONING|>, </|REASONING|> - Reasoning section delimiters
+<|ASSISTANT|>, </|ASSISTANT|> - Assistant response delimiters
+```
+## Required Libraries and Their Versions
+- torch>=2.3.0
+- transformers>=4.40.1
+- tokenizers>=0.19.1
+- accelerate>=0.29.3
+- flash-attn>=2.5.8
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load the model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B")
+model = AutoModelForCausalLM.from_pretrained(
+    "DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B",
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+# Example using special tokens
+text = "<|SYSTEM|>You are a helpful assistant.</|SYSTEM|>\n<|USER|>自然言語処理とは何か</|USER|>"
+tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output = model.generate(
+        tokenized_input,
+        max_new_tokens=100,
+        do_sample=True,
+        top_p=0.95,
+        temperature=0.7,
+        repetition_penalty=1.05,
+    )[0]
+print(tokenizer.decode(output))
+```
+## Base Model Information
+### Model Architecture
+|Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters|
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+|3.7b|28|3072|24|4096|611,844,096|3,171,068,928|
+### Tokenizer
+The tokenizer is based on the original llm-jp-3-3.7b tokenizer, which uses [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model. The vocabulary is based on [`llm-jp-tokenizer v3.0`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v3.0b2), with our additional special tokens added to the vocabulary.
+## License
+This model inherits the license from the base model:
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Attribution
+This model is based on llm-jp-3-3.7b. Please cite the original model and its creators when using this modified version.
+## Modifications
+The only modifications made to the original model are:
+1. Addition of special tokens for structured conversations
+2. Resizing of token embeddings to accommodate the new special tokens
+All other aspects of the model, including its training data, architecture, and capabilities, remain the same as the original llm-jp-3-3.7b model.