tiiuae/Falcon3-Mamba-7B-Instruct · Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions

Subject: Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions

Hello Falcon3-Mamba team,

First, congratulations on this groundbreaking work with Mamba architecture - truly inspiring to see Transformer alternatives pushing LLM boundaries! 🚀

As a student exploring SSM-based models, I'd appreciate your insights for my fine-tuning project (budget ~$200). Could you help address these technical questions?

Core Technical Queries

Fine-Tuning Infrastructure
- What's the minimum VRAM required for full-parameter vs LoRA/QLoRA fine-tuning?
- Do you recommend gradient checkpointing or activation recomputation for Mamba backprop?
- Have you tested 8-bit/4-bit AdamW optimizers with this architecture?
Architecture-Specific Training
- How do optimal learning rates (LR) for Mamba compare to Transformer-based Falcon variants?
- What's your recommended LR scheduler (linear vs cosine) and warmup ratio?
- Any sequence length limitations during fine-tuning vs pretraining?
- Is there anything you learned while training this model
Knowledge Distillation Challenges
- How critical is architectural alignment when distilling Transformer-based Qwen2.5-72B into SSM?
- Would you recommend freezing specific layers (e.g., SSM blocks) during distillation?
- Any successful prior attempts at cross-architecture distillation?
DPO Implementation
- Have you tested synthetic preference data from judge models (e.g., DeepSeek-R1-70B)?
- Does Mamba's recurrent nature impact pairwise comparison during DPO?
- Preferred reward normalization techniques for SSM models?
Dataset Optimization
- Maximum recommended batch size for 24GB VRAM (Colab Pro) with 2k token sequences?
- Any tokenization mismatches observed with multilingual datasets like Aya?
- Experience with curriculum learning for conversational fine-tuning?

Project-Specific Questions

Approach A (Distillation + SFT + DPO):

Would layer-wise distillation (e.g., attention outputs → SSM states) be feasible?
How to handle dimensional mismatches in projection layers during transfer?

Approach B (Conversational Focus):

Optimal context window configuration for multi-turn dialogues?
Recommended techniques for maintaining Mamba's throughput advantage during long chats?

Resource Requests

Could you share a Colab-compatible fine-tuning template (PEFT/LoRA preferred)?
Any known issues with Hugging Face Trainer vs custom training loops?
Recommended monitoring tools for SSM-specific metrics (hidden states evolution, memory throughput)?

Architecture Curiosity

How does Mamba handle positional information compared to RoPE in Transformers?
Any plans to release ablation studies on SSM parameter initialization?
Maximum effective context length observed in practice for instruction tasks?

Thank you for advancing open-source LLM innovation - your expertise would be invaluable for exploring Mamba's full potential!

Best regards,