Stockmark-2-100B-Instruct-beta

Model description

Stockmark-2-100B-Instruct-beta is a 100-billion-parameter large language model built from scratch, with a particular focus on Japanese. It was pre-trained on approximately 1.5 trillion tokens of data, consisting of 60% English, 30% Japanese, and 10% code. Following pretraining, the model underwent post-training with synthetic data in Japanese to enhance its ability to follow instructions. This synthetic data was generated using Qwen2.5-32B-Instruct.

As a beta release, Stockmark-2-100b-Instruct-beta is still undergoing improvements and evaluations. Feedback and insights from users will help refine future versions.

See our blog for the detail.

This project is supported by GENIAC.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stockmark/Stockmark-2-100B-Instruct-beta")
model = AutoModelForCausalLM.from_pretrained(
    "stockmark/Stockmark-2-100B-Instruct-beta", device_map="auto", torch_dtype=torch.bfloat16
)

instruction = "自然言語処理とは？"
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": instruction}], add_generation_prompt=True, return_tensors="pt"
).to(model.device)

with torch.inference_mode():
    tokens = model.generate(
        input_ids,
        max_new_tokens = 512,
        do_sample = True,
        temperature = 0.7,
        top_p = 0.95,
        repetition_penalty = 1.05
    )
    
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(output)

License

MIT

Developed by

Stockmark Inc.

Author

Takahiro Omi