--- license: apache-2.0 language: - ja base_model: - llm-jp/llm-jp-3-3.7b --- # Tengentoppa-llm-jp-base-3.7B This is a modified version of the [llm-jp-3-3.7b](https://huggingface.co/llm-jp/llm-jp-3-3.7b) model with additional special tokens for structured conversations. The base model was developed by the [Research and Development Center for Large Language Models](https://llmc.nii.ac.jp/) at the [National Institute of Informatics](https://www.nii.ac.jp/en/). ![image/jpg](tengentoppa2.jpg) ## Model Details - **Base Model**: llm-jp-3-3.7b - **Model Type**: Transformer-based Language Model - **Parameters**: 3.7B - **Context Length**: 4096 - **Language**: Japanese and English ### Additional Special Tokens This model includes the following special tokens for structured conversations: ``` <|SYSTEM|>, - System message delimiters <|USER|>, - User input delimiters <|HINT|>, - Hint message delimiters <|REASONING|>, - Reasoning section delimiters <|ASSISTANT|>, - Assistant response delimiters ``` ## Required Libraries and Their Versions - torch>=2.3.0 - transformers>=4.40.1 - tokenizers>=0.19.1 - accelerate>=0.29.3 - flash-attn>=2.5.8 ## Usage ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B") model = AutoModelForCausalLM.from_pretrained( "DeL-TaiseiOzaki/Tengentoppa-llm-jp-base-3.7B", device_map="auto", torch_dtype=torch.bfloat16 ) # Example using special tokens text = "<|SYSTEM|>You are a helpful assistant.\n<|USER|>自然言語処理とは何か" tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( tokenized_input, max_new_tokens=100, do_sample=True, top_p=0.95, temperature=0.7, repetition_penalty=1.05, )[0] print(tokenizer.decode(output)) ``` ## Base Model Information ### Model Architecture |Params|Layers|Hidden size|Heads|Context length|Embedding parameters|Non-embedding parameters| |:---:|:---:|:---:|:---:|:---:|:---:|:---:| |3.7b|28|3072|24|4096|611,844,096|3,171,068,928| ### Tokenizer The tokenizer is based on the original llm-jp-3-3.7b tokenizer, which uses [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model. The vocabulary is based on [`llm-jp-tokenizer v3.0`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v3.0b2), with our additional special tokens added to the vocabulary. ## License This model inherits the license from the base model: [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Attribution This model is based on llm-jp-3-3.7b. Please cite the original model and its creators when using this modified version. ## Modifications The only modifications made to the original model are: 1. Addition of special tokens for structured conversations 2. Resizing of token embeddings to accommodate the new special tokens All other aspects of the model, including its training data, architecture, and capabilities, remain the same as the original llm-jp-3-3.7b model.