Model Card for SwarmFormer-Base

SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.

Model Details

Model Description

SwarmFormer-Base consists of:

  • Token embedding layer with heavy dropout (0.4)

  • Multiple SwarmFormer layers

  • Mean pooling layer

  • Final classification layer

  • Comprehensive dropout throughout (0.3-0.4)

  • Developed by: Jordan Legg, Mikus Sturmanis, Takara.ai

  • Funded by: Takara.ai

  • Shared by: Takara.ai

  • Model type: Hierarchical transformer

  • Language(s): English

  • License: Not specified

  • Finetuned from model: Trained from scratch

Model Sources

Uses

Direct Use

  • Text classification
  • Sentiment analysis
  • Document processing

Downstream Use

  • Feature extraction for NLP tasks
  • Transfer learning
  • Building block for larger systems

Out-of-Scope Use

  • Text generation
  • Machine translation
  • Tasks requiring >768 tokens
  • Real-time processing without adequate hardware

Bias, Risks, and Limitations

  • Fixed cluster size (4 tokens)
  • Maximum sequence length: 768 tokens
  • Potential information loss in clustering
  • Limited evaluation (English text classification only)

Training Details

Training Data

  • Dataset: IMDB Movie Review (50k samples)
  • Augmentation techniques:
    • Sentence-level shuffling
    • Controlled synonym replacement
    • Hierarchical sample creation

Training Procedure

Model Architecture Details

  1. Token Embedding Layer:

    - Embedding layer (vocab_size β†’ d_model)
    - Dropout rate: 0.4
    
  2. Local Swarm Aggregator:

    - Input processing dropout: 0.3
    - Local aggregation MLP:
      - Linear(d_model β†’ d_model)
      - GELU activation
      - Dropout(0.3)
      - Linear(d_model β†’ d_model)
    - Gate network:
      - Linear(2*d_model β†’ d_model)
      - GELU activation
      - Linear(d_model β†’ d_model)
      - Sigmoid activation
    - Output dropout: 0.3
    
  3. Clustering Mechanism:

    • Groups tokens into fixed-size clusters (size=4)
    • Computes mean representation per cluster
  4. Global Cluster Attention:

    - Query/Key/Value projections: Linear(d_model β†’ d_model)
    - Scaled dot-product attention
    - Attention dropout: 0.3
    - Output dropout: 0.3
    
  5. Broadcast Updater:

    - Linear projection: d_model β†’ d_model
    - Dropout: 0.1
    - Gate network:
      - Linear(2*d_model β†’ d_model)
      - GELU activation
      - Linear(d_model β†’ d_model)
      - Sigmoid activation
    

Training Hyperparameters

  • Embedding dimension: 192
  • Number of layers: 2
  • Local update steps (T_local): 3
  • Cluster size: 4
  • Batch size: 48
  • Learning rate: 4.74 Γ— 10⁻⁴
  • Weight decay: 0.0381
  • Dropout rates:
    • Embedding: 0.4
    • Local aggregation: 0.3
    • Attention: 0.3
    • Final: 0.4

Evaluation

Testing Data, Factors & Metrics

  • IMDB test split (25k samples)
  • Full FP32 inference
  • Batch size: 256

Results

  • Accuracy: 89.03%
  • Precision: 87.22%
  • Recall: 91.46%
  • F1: 89.29%
  • Mean batch latency: 4.83ms
  • Peak memory: 9.13GB

Technical Specifications

Model Architecture and Objective

Complete architecture flow:

  1. Input β†’ Token Embedding (with dropout)
  2. For each layer:
    • Multiple iterations of Local Swarm Updates
    • Cluster Formation
    • Global Attention between clusters
    • Broadcast updates back to tokens
  3. Mean pooling across sequence
  4. Final dropout and classification

Compute Infrastructure

  • GPU: NVIDIA RTX 2080 Ti or equivalent
  • VRAM: 10GB+ recommended
  • Framework: PyTorch

Software Requirements

import torch
import torch.nn as nn

Citation

@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}

Model Card Authors

Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

Model Card Contact

[email protected]

Downloads last month
63
Safetensors
Model size
6.75M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train takara-ai/SwarmFormer-Sentiment-Base

Collection including takara-ai/SwarmFormer-Sentiment-Base