--- license: apache-2.0 --- # Model Card for Zamba Zamba-7B-v1-phase1 is a hybrid model between Mamba, a state-space model, and transformers. It uses a mamba backbone with a shared transformer layer every 6 blocks. Zamba was trained using next-token prediction. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1-phase-1 was pre-trained on 1T tokens of text and code data sourced from open web-datasets. Unlike Zamba-v1, this model represents the checkpoint after pure prertaining only on web-datasets. We envision its use primarily as a comparison tool to explore the effects of our annealing process. Note: the current Huggingface implementation of Zamba performs slower than our internal implementation. We are working to fix this with the Huggingface team. Our technical report describing the training of Zamba is available [here](https://arxiv.org/abs/2405.16712). ## Quick start ### Presequities To download Zamba, clone Zyphra's fork of transformers: 1. `git clone https://github.com/Zyphra/transformers_zamba` 2. `cd transformers_zamba` 3. Install the repository: `pip install -e .` In order to run optimized Mamba implementations on a CUDA device, you need to install `mamba-ssm` and `causal-conv1d`: ```bash pip install mamba-ssm causal-conv1d>=1.2.0 ``` You can run the model without using the optimized Mamba kernels, but it is **not** recommended as it will result in significantly higher latency. To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``. ### Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba-7B-v1-phase1") model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1-phase1", device_map="auto", torch_dtype=torch.bfloat16) input_text = "What factors contributed to the fall of the Roman Empire?" input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") outputs = model.generate(**input_ids, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ``` To load a different checkpoint use, e.g., for iteration 2500, ```python model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba-7B-v1-phase1", device_map="auto", torch_dtype=torch.bfloat16, revision="iter2500") ``` The default iteration is the fully trained phase 1 model, corresponding to iteration 462070. This is the number of iterations performed by training the model starting from random initialization. See [arXiv:2405.16712](https://arxiv.org/abs/2405.16712) for more details on training. ## Model Details Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth.