wasm-32B-Instruct-V1

wasm-32B-Instruct-V1 is a state-of-the-art instruction-tuned large language model developed by wasmdashai. With 32 billion parameters, this model is designed to deliver high-quality performance across a wide range of natural language processing and code-related tasks,

🚀 Introduction

wasm-32B-Instruct-V1 is built for instruction-following tasks and general-purpose reasoning. It leverages a powerful transformer architecture with optimized performance for large-scale generation tasks including:

🧠 Code generation and debugging
📚 Long-context understanding
🗣️ Multi-turn dialogue and reasoning
🔐 Privacy-conscious edge deployments (e.g., via WebAssembly)

This model is fine-tuned on diverse instruction datasets and optimized for both human alignment and computational efficiency.

🏗️ Model Details

Type: Causal Language Model (Decoder-only)
Parameters: 32 Billion
Training: Pretraining + Instruction Fine-tuning
Architecture: Transformer with:
- Rotary Position Embeddings (RoPE)
- SwiGLU activation
- RMSNorm
- Attention with QKV bias
Context Length: Up to 32,768 tokens
Extended Context Option: Via rope_scaling (supports up to 128K with YaRN)
Format: Hugging Face Transformers-compatible

⚙️ Requirements

To use this model, install the latest version of 🤗 transformers (>= 4.37.0 recommended):

pip install --upgrade transformers

🧪 Quickstart

Here is a minimal example to load the model and generate a response:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "wasmdashai/wasm-32B-Instruct-V1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Explain the concept of recursion with Python code."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)

🧩 Processing Long Texts

This model supports context lengths up to 32,768 tokens. For even longer inputs, you can enable YaRN scaling by modifying the config.json as follows:

{
  "rope_scaling": {
    "type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
  }
}

This is ideal for handling documents, logs, or multi-step reasoning tasks that exceed standard limits.

📦 Deployment Notes

We recommend using vLLM for efficient deployment, especially with large input lengths or real-time serving needs. Please note:

vLLM currently supports static YaRN only.
Avoid applying rope scaling unless necessary for long-context tasks, as it may impact performance on short inputs.

📬 Contact

For support, feedback, or collaboration inquiries, please contact:

📧 [email protected]

wasmdashai
/

wasm-32B-Instruct-V1