Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions
Subject: Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions
Hello Falcon3-Mamba team,
First, congratulations on this groundbreaking work with Mamba architecture - truly inspiring to see Transformer alternatives pushing LLM boundaries! π
As a student exploring SSM-based models, I'd appreciate your insights for my fine-tuning project (budget ~$200). Could you help address these technical questions?
Core Technical Queries
Fine-Tuning Infrastructure
- What's the minimum VRAM required for full-parameter vs LoRA/QLoRA fine-tuning?
- Do you recommend gradient checkpointing or activation recomputation for Mamba backprop?
- Have you tested 8-bit/4-bit AdamW optimizers with this architecture?
Architecture-Specific Training
- How do optimal learning rates (LR) for Mamba compare to Transformer-based Falcon variants?
- What's your recommended LR scheduler (linear vs cosine) and warmup ratio?
- Any sequence length limitations during fine-tuning vs pretraining?
- Is there anything you learned while training this model
Knowledge Distillation Challenges
- How critical is architectural alignment when distilling Transformer-based Qwen2.5-72B into SSM?
- Would you recommend freezing specific layers (e.g., SSM blocks) during distillation?
- Any successful prior attempts at cross-architecture distillation?
DPO Implementation
- Have you tested synthetic preference data from judge models (e.g., DeepSeek-R1-70B)?
- Does Mamba's recurrent nature impact pairwise comparison during DPO?
- Preferred reward normalization techniques for SSM models?
Dataset Optimization
- Maximum recommended batch size for 24GB VRAM (Colab Pro) with 2k token sequences?
- Any tokenization mismatches observed with multilingual datasets like Aya?
- Experience with curriculum learning for conversational fine-tuning?
Project-Specific Questions
Approach A (Distillation + SFT + DPO):
- Would layer-wise distillation (e.g., attention outputs β SSM states) be feasible?
- How to handle dimensional mismatches in projection layers during transfer?
Approach B (Conversational Focus):
- Optimal context window configuration for multi-turn dialogues?
- Recommended techniques for maintaining Mamba's throughput advantage during long chats?
Resource Requests
- Could you share a Colab-compatible fine-tuning template (PEFT/LoRA preferred)?
- Any known issues with Hugging Face Trainer vs custom training loops?
- Recommended monitoring tools for SSM-specific metrics (hidden states evolution, memory throughput)?
Architecture Curiosity
- How does Mamba handle positional information compared to RoPE in Transformers?
- Any plans to release ablation studies on SSM parameter initialization?
- Maximum effective context length observed in practice for instruction tasks?
Thank you for advancing open-source LLM innovation - your expertise would be invaluable for exploring Mamba's full potential!
Best regards,