license: apache-2.0
language:
- en
pipeline_tag: text-generation
Introduction
SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.
Performance
Model | MMLU | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
---|---|---|---|---|---|---|---|
SmallThinker-21BA3B-Instruct | 84.43 | 55.05 | 82.4 | 85.77 | 60.3 | 89.63 | 76.26 |
Gemma3-12b-it | 78.52 | 34.85 | 82.4 | 74.68 | 44.5 | 82.93 | 66.31 |
Qwen3-14B | 84.82 | 50 | 84.6 | 85.21 | 59.5 | 88.41 | 75.42 |
Qwen3-30BA3B | 85.1 | 44.4 | 84.4 | 84.29 | 58.8 | 90.24 | 74.54 |
Qwen3-8B | 81.79 | 38.89 | 81.6 | 83.92 | 49.5 | 85.9 | 70.26 |
Phi-4-14B | 84.58 | 55.45 | 80.2 | 63.22 | 42.4 | 87.2 | 68.84 |
For the MMLU evaluation, we use a 0-shot CoT setting.
All models are evaluated in non-thinking mode.
Speed
Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
---|---|---|---|---|---|
SmallThinker 21B+sparse | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 |
SmallThinker 21B+sparse+limited memory | limit 8G | 20.30 | 15.50 | 8.56 | - |
Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - |
Qwen3 30B A3B+limited memory | limit 8G | 10.11 | 0.18 | 6.32 | - |
Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 |
Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 |
Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0. You can deploy SmallThinker with offloading support using PowerInfer
Model Card
Architecture | Mixture-of-Experts (MoE) |
---|---|
Total Parameters | 21B |
Activated Parameters | 3B |
Number of Layers | 52 |
Attention Hidden Dimension | 2560 |
MoE Hidden Dimension (per Expert) | 768 |
Number of Attention Heads | 28 |
Number of KV Heads | 4 |
Number of Experts | 64 |
Selected Experts per Token | 6 |
Vocabulary Size | 151,936 |
Context Length | 16K |
Attention Mechanism | GQA |
Activation Function | ReGLU |
How to Run
Transformers
The latest version of transformers
is recommended or transformers>=4.53.3
is required.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
path = "PowerInfer/SmallThinker-21BA3B-Instruct"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
messages = [
{"role": "user", "content": "Give me a short introduction to large language model."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
model_inputs,
do_sample=True,
max_new_tokens=1024
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
ModelScope
ModelScope
adopts Python API similar to (though not entirely identical to) Transformers
. For basic usage, simply modify the first line of the above code as follows:
from modelscope import AutoModelForCausalLM, AutoTokenizer
Statement
- Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information.
- Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content.
- SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.