wasm-32B-Instruct-V1
wasm-32B-Instruct-V1 is a state-of-the-art instruction-tuned large language model developed by wasmdashai. With 32 billion parameters, this model is designed to deliver high-quality performance across a wide range of natural language processing and code-related tasks,
π Introduction
wasm-32B-Instruct-V1
is built for instruction-following tasks and general-purpose reasoning. It leverages a powerful transformer architecture with optimized performance for large-scale generation tasks including:
- π§ Code generation and debugging
- π Long-context understanding
- π£οΈ Multi-turn dialogue and reasoning
- π Privacy-conscious edge deployments (e.g., via WebAssembly)
This model is fine-tuned on diverse instruction datasets and optimized for both human alignment and computational efficiency.
ποΈ Model Details
Type: Causal Language Model (Decoder-only)
Parameters: 32 Billion
Training: Pretraining + Instruction Fine-tuning
Architecture: Transformer with:
- Rotary Position Embeddings (RoPE)
- SwiGLU activation
- RMSNorm
- Attention with QKV bias
Context Length: Up to 32,768 tokens
Extended Context Option: Via
rope_scaling
(supports up to 128K with YaRN)Format: Hugging Face Transformers-compatible
βοΈ Requirements
To use this model, install the latest version of π€ transformers
(>= 4.37.0 recommended):
pip install --upgrade transformers
π§ͺ Quickstart
Here is a minimal example to load the model and generate a response:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "wasmdashai/wasm-32B-Instruct-V1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
prompt = "Explain the concept of recursion with Python code."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
π§© Processing Long Texts
This model supports context lengths up to 32,768 tokens. For even longer inputs, you can enable YaRN scaling by modifying the config.json
as follows:
{
"rope_scaling": {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
}
This is ideal for handling documents, logs, or multi-step reasoning tasks that exceed standard limits.
π¦ Deployment Notes
We recommend using vLLM
for efficient deployment, especially with large input lengths or real-time serving needs. Please note:
vLLM
currently supports static YaRN only.- Avoid applying rope scaling unless necessary for long-context tasks, as it may impact performance on short inputs.
π¬ Contact
For support, feedback, or collaboration inquiries, please contact:
π§ [email protected]
- Downloads last month
- 4