SmallThinker-21BA3B-Instruct / README.md

Update README.md

fe1da60 verified about 5 hours ago

5.18 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	---
	## Introduction

	SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment,
	co-developed by the IPADS and School of AI at Shanghai Jiao Tong University and Zenergize AI.
	Designed from the ground up for resource-constrained environments,
	SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
	without relying on the cloud.

	## Performance
	\| Model \| MMLU \| GPQA-diamond \| MATH-500 \| IFEVAL \| LIVEBENCH \| HUMANEVAL \| Average \|
	\|------------------------------\|-------\|--------------\|----------\|--------\|-----------\|-----------\|---------\|
	\| SmallThinker-21BA3B-Instruct \| 84.43 \| <u>55.05</u> \| 82.4 \| 85.77 \| 60.3 \| <u>89.63</u> \| 76.26 \|
	\| Gemma3-12b-it \| 78.52 \| 34.85 \| 82.4 \| 74.68 \| 44.5 \| 82.93 \| 66.31 \|
	\| Qwen3-14B \| <u>84.82</u> \| 50 \| 84.6 \| <u>85.21</u>\| <u>59.5</u> \| 88.41 \| <u>75.42</u> \|
	\| Qwen3-30BA3B \| 85.1 \| 44.4 \| <u>84.4</u> \| 84.29 \| 58.8 \| 90.24 \| 74.54 \|
	\| Qwen3-8B \| 81.79 \| 38.89 \| 81.6 \| 83.92 \| 49.5 \| 85.9 \| 70.26 \|
	\| Phi-4-14B \| 84.58 \| 55.45 \| 80.2 \| 63.22 \| 42.4 \| 87.2 \| 68.84 \|

	For the MMLU evaluation, we use a 0-shot CoT setting.

	All models are evaluated in non-thinking mode.

	## Speed
	\| Model \| Memory(GiB) \| i9 14900 \| 1+13 8ge4 \| rk3588 (16G) \| Raspberry PI 5 \|
	\|--------------------------------------\|---------------------\|----------\|-----------\|--------------\|----------------\|
	\| SmallThinker 21B+sparse \| 11.47 \| 30.19 \| 23.03 \| 10.84 \| 6.61 \|
	\| SmallThinker 21B+sparse+limited memory \| limit 8G \| 20.30 \| 15.50 \| 8.56 \| - \|
	\| Qwen3 30B A3B \| 16.20 \| 33.52 \| 20.18 \| 9.07 \| - \|
	\| Qwen3 30B A3B+limited memory \| limit 8G \| 10.11 \| 0.18 \| 6.32 \| - \|
	\| Gemma 3n E2B \| 1G, theoretically \| 36.88 \| 27.06 \| 12.50 \| 6.66 \|
	\| Gemma 3n E4B \| 2G, theoretically \| 21.93 \| 16.58 \| 7.37 \| 4.01 \|

	Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0.
	You can deploy SmallThinker with offloading support using [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer/tree/main/smallthinker)

	## Model Card

	<div align="center">

	\| Architecture \| Mixture-of-Experts (MoE) \|
	\|:---:\|:---:\|
	\| Total Parameters \| 21B \|
	\| Activated Parameters \| 3B \|
	\| Number of Layers \| 52 \|
	\| Attention Hidden Dimension \| 2560 \|
	\| MoE Hidden Dimension (per Expert) \| 768 \|
	\| Number of Attention Heads \| 28 \|
	\| Number of KV Heads \| 4 \|
	\| Number of Experts \| 64 \|
	\| Selected Experts per Token \| 6 \|
	\| Vocabulary Size \| 151,936 \|
	\| Context Length \| 16K \|
	\| Attention Mechanism \| GQA \|
	\| Activation Function \| ReGLU \|
	</div>

	## How to Run

	### Transformers

	The latest version of `transformers` is recommended or `transformers>=4.53.3` is required.
	The following contains a code snippet illustrating how to use the model generate content based on given inputs.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	path = "PowerInfer/SmallThinker-21BA3B-Instruct"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

	messages = [
	{"role": "user", "content": "Give me a short introduction to large language model."},
	]
	model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

	model_outputs = model.generate(
	model_inputs,
	do_sample=True,
	max_new_tokens=1024
	)

	output_token_ids = [
	model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
	]

	responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
	print(responses)

	```

	### ModelScope

	`ModelScope` adopts Python API similar to (though not entirely identical to) `Transformers`. For basic usage, simply modify the first line of the above code as follows:

	```python
	from modelscope import AutoModelForCausalLM, AutoTokenizer
	```

	## Statement
	- Due to the constraints of its model size and the limitations of its training data, its responses may contain factual inaccuracies, biases, or outdated information.
	- Users bear full responsibility for independently evaluating and verifying the accuracy and appropriateness of all generated content.
	- SmallThinker does not possess genuine comprehension or consciousness and cannot express personal opinions or value judgments.