Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions

#10
by DeltaWhiplash - opened

Subject: Request for Guidance on Fine-Tuning Falcon3-Mamba-7B-Instruct + Technical Questions

Hello Falcon3-Mamba team,

First, congratulations on this groundbreaking work with Mamba architecture - truly inspiring to see Transformer alternatives pushing LLM boundaries! πŸš€

As a student exploring SSM-based models, I'd appreciate your insights for my fine-tuning project (budget ~$200). Could you help address these technical questions?

Core Technical Queries

  1. Fine-Tuning Infrastructure

    • What's the minimum VRAM required for full-parameter vs LoRA/QLoRA fine-tuning?
    • Do you recommend gradient checkpointing or activation recomputation for Mamba backprop?
    • Have you tested 8-bit/4-bit AdamW optimizers with this architecture?
  2. Architecture-Specific Training

    • How do optimal learning rates (LR) for Mamba compare to Transformer-based Falcon variants?
    • What's your recommended LR scheduler (linear vs cosine) and warmup ratio?
    • Any sequence length limitations during fine-tuning vs pretraining?
    • Is there anything you learned while training this model
  3. Knowledge Distillation Challenges

    • How critical is architectural alignment when distilling Transformer-based Qwen2.5-72B into SSM?
    • Would you recommend freezing specific layers (e.g., SSM blocks) during distillation?
    • Any successful prior attempts at cross-architecture distillation?
  4. DPO Implementation

    • Have you tested synthetic preference data from judge models (e.g., DeepSeek-R1-70B)?
    • Does Mamba's recurrent nature impact pairwise comparison during DPO?
    • Preferred reward normalization techniques for SSM models?
  5. Dataset Optimization

    • Maximum recommended batch size for 24GB VRAM (Colab Pro) with 2k token sequences?
    • Any tokenization mismatches observed with multilingual datasets like Aya?
    • Experience with curriculum learning for conversational fine-tuning?

Project-Specific Questions

Approach A (Distillation + SFT + DPO):

  • Would layer-wise distillation (e.g., attention outputs β†’ SSM states) be feasible?
  • How to handle dimensional mismatches in projection layers during transfer?

Approach B (Conversational Focus):

  • Optimal context window configuration for multi-turn dialogues?
  • Recommended techniques for maintaining Mamba's throughput advantage during long chats?

Resource Requests

  • Could you share a Colab-compatible fine-tuning template (PEFT/LoRA preferred)?
  • Any known issues with Hugging Face Trainer vs custom training loops?
  • Recommended monitoring tools for SSM-specific metrics (hidden states evolution, memory throughput)?

Architecture Curiosity

  • How does Mamba handle positional information compared to RoPE in Transformers?
  • Any plans to release ablation studies on SSM parameter initialization?
  • Maximum effective context length observed in practice for instruction tasks?

Thank you for advancing open-source LLM innovation - your expertise would be invaluable for exploring Mamba's full potential!

Best regards,

Sign up or log in to comment