File size: 5,176 Bytes

---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
---
## Introduction

SmallThinker is a family of **on-device native** Mixture-of-Experts (MoE) language models specially designed for local deployment,
co-developed by the **IPADS** and **School of AI at Shanghai Jiao Tong University** and **Zenergize AI**.
Designed from the ground up for resource-constrained environments,
SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
without relying on the cloud.

## Performance
| Model                        | MMLU  | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
|------------------------------|-------|--------------|----------|--------|-----------|-----------|---------|
| **SmallThinker-21BA3B-Instruct** | 84.43 | <u>55.05</u> | 82.4     | **85.77** | **60.3**      | <u>89.63</u>     | **76.26**   |
| Gemma3-12b-it                | 78.52 | 34.85        | 82.4     | 74.68  | 44.5      | 82.93     | 66.31   |
| Qwen3-14B                    | <u>84.82</u> | 50 | **84.6** | <u>85.21</u>| <u>59.5</u> | 88.41     | <u>75.42</u>   |
| Qwen3-30BA3B                 | **85.1**  | 44.4     | <u>84.4</u> | 84.29  | 58.8      | **90.24**     | 74.54   |
| Qwen3-8B                     | 81.79 | 38.89        | 81.6     | 83.92  | 49.5      | 85.9      | 70.26   |
| Phi-4-14B                    | 84.58 | **55.45**    | 80.2     | 63.22  | 42.4      | 87.2      | 68.84   |

For the MMLU evaluation, we use a 0-shot CoT setting.

All models are evaluated in non-thinking mode.

## Speed
| Model                               | Memory(GiB)         | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
|--------------------------------------|---------------------|----------|-----------|--------------|----------------|
| SmallThinker 21B+sparse              | 11.47               | 30.19    | 23.03     | 10.84        | 6.61           |
| SmallThinker 21B+sparse+limited memory | limit 8G         | 20.30    | 15.50     | 8.56         | -              |
| Qwen3 30B A3B                        | 16.20               | 33.52    | 20.18     | 9.07         | -              |
| Qwen3 30B A3B+limited memory          | limit 8G            | 10.11    | 0.18      | 6.32         | -              |
| Gemma 3n E2B                         | 1G, theoretically   | 36.88    | 27.06     | 12.50        | 6.66           |
| Gemma 3n E4B                         | 2G, theoretically   | 21.93    | 16.58     | 7.37         | 4.01           |

Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0.
You can deploy SmallThinker with offloading support using [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer/tree/main/smallthinker)

## Model Card

<div align="center">

| **Architecture** | Mixture-of-Experts (MoE) |
|:---:|:---:|
| **Total Parameters** | 21B |
| **Activated Parameters** | 3B |
| **Number of Layers** | 52 |
| **Attention Hidden Dimension** | 2560 |
| **MoE Hidden Dimension** (per Expert) | 768 |
| **Number of Attention Heads** | 28 |
| **Number of KV Heads** | 4 |
| **Number of Experts** | 64 |
| **Selected Experts per Token** | 6 |
| **Vocabulary Size** | 151,936 |
| **Context Length** | 16K |
| **Attention Mechanism** | GQA |
| **Activation Function** | ReGLU |
</div>

## How to Run

### Transformers

The latest version of `transformers` is recommended or `transformers>=4.53.3` is required.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

path = "PowerInfer/SmallThinker-21BA3B-Instruct"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language model."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

model_outputs = model.generate(
    model_inputs,
    do_sample=True,
    max_new_tokens=1024
)

output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

```

### ModelScope

`ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows:

```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
```

## Statement
- Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information.
- Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content.
- SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.