File size: 7,360 Bytes
652f522
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: cc-by-nc-4.0
---
# Model Details

MobileLLM is introduced: "[MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases](https://arxiv.org/abs/2402.14905)", published in ICML 2024.

**Model Developer**: Meta

**Model Architecture**: MobileLLM is an auto-regressive language model leveraging an optimized transformer architecture, specifically engineered for on-device applications with constrained resources.
MobileLLM integrated several key techniques including: (1) SwiGLU activation function, (2) deep and thin architectures, (3) embedding sharing, (4) grouped-query attention. MobileLLM-125M/350M attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M SoTA models on zero-shot commonsense reasoning tasks. In our updated version, we further demonstrate that our design philosophy scales effectively to larger models, with SoTA results for MobileLLM-600M/1B/1.5B.

![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/660f893bae89429c07a32cdb/ahtsJXC5vBVIdmsMQDNHv.jpeg)

| | # Layers | # Attnetion Heads | # KV Heads | Token Dimension | Params | 
| --- | --- | --- | --- | --- | --- | 
| MobileLLM-125M |  30 | 9  | 3 | 576  | 124.6M |
| MobileLLM-350M |  32 | 15 | 5 | 960  | 345.3M |
| MobileLLM-600M |  40 | 18 | 6 | 1152 | 603.1M |
| MobileLLM-1B   |  54 | 20 | 5 | 1280 | 1.01B  |
| MobileLLM-1.5B |  54 | 25 | 5 | 1600 | 1.51B  |

| | Training Data | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count |
| --- | --- | --- | --- | --- | --- | --- | --- |
| MobileLLM-125M | Publicly available online data. | Text | Text | 2k | Yes | Yes | 1T tokens |
| MobileLLM-350M | Publicly available online data. | Text | Text | 2k | Yes | Yes | 1T tokens |
| MobileLLM-600M | Publicly available online data. | Text | Text | 2k | Yes | Yes | 1T tokens |
| MobileLLM-1B   | Publicly available online data. | Text | Text | 2k | Yes | Yes | 1T tokens |
| MobileLLM-1.5B | Publicly available online data. | Text | Text | 2k | Yes | Yes | 1T tokens |


# How to use
We are providing 2 ways to run the model:

[HuggingFace](#huggingface)

[MobileLLM codebase](#mobilellm-codebase)

## HuggingFace
To load the pretrained model for further finetuning or evaluation:
```bash
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/MobileLLM-1B", use_fast_tokenizer=False)
model = AutoModelForCausalLM.from_pretrained("facebook/MobileLLM-1B", trust_remote_code=True)
```
Note that the default tokenizer does not contain special tokens. For example you can use:
```bash
tokenizer.add_special_tokens(
    {
        "eos_token": "</s>",
        "bos_token": "<s>",
        "unk_token": "<unk>",
    }
)
```
## MobileLLM codebase
We provide the pretraining code in https://github.com/facebookresearch/MobileLLM

```bash
> git clone https://github.com/facebookresearch/MobileLLM
> pip install -r requirement.txt

# data pre-process and specify the data path in pretrain.sh
# run pretraining
> bash pretrain.sh 
```
We also provide evaluation script for calculating ppl of wikitext-2 test split:
```bash
> bash eval.sh
```

You can find more details in the GitHub repo.

# Training cost 
It takes the following number of days to train MobileLLM on 1T tokens using 32 NVIDIA A100 80G GPUs.
| 125M | 350M | 600M | 1B | 1.5B | 
| --- | --- | --- | --- | --- |
| ~3 days| ~6 days| ~8 days | ~12 days | ~18 days |


# Evaluation
We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks

## MobileLLM-125M

| model | boolq | piqa | siqa | hellaswag | winogrande | arc_easy | arc_challenge | obqa | avg. |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| OPT-125M | 41.3 | 25.2 | 57.5 | 62.0 | 41.9 | 31.1 | 31.2 | 50.8 | 42.6 |
| GPT-neo-125M | 40.7 | 24.8 | 61.3 | 62.5 | 41.9 | 29.7 | 31.6 | 50.7 | 42.9 |
| Pythia-160M | 40.0 | 25.3 | 59.5 | 62.0 | 41.5 | 29.9 | 31.2 | 50.9 | 42.5 |
| **MobileLLM-125M** | 43.9 | 27.1 | 60.2 | 65.3 | 42.4 | 38.9 | 39.5 | 53.1 | **46.3** |
| **MobileLLM-LS-125M** | 45.8 | 28.7 | 60.4 | 65.7 | 42.9 | 39.5 | 41.1 | 52.1 | **47.0** |

## MobileLLM-350M

| model | boolq | piqa | siqa | hellaswag | winogrande | arc_easy | arc_challenge | obqa | avg. |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| OPT-350M | 41.9 | 25.7 | 54.0 | 64.8 | 42.6 | 36.2 | 33.3 | 52.4 | 43.9 |
| Pythia-410M | 47.1 | 30.3 | 55.3 | 67.2 | 43.1 | 40.1 | 36.2 | 53.4 | 46.6 |
| **MobileLLM-350M** | 53.8 | 33.5 | 62.4 | 68.6 | 44.7 | 49.6 | 40.0 | 57.6 | **51.3** |
| **MobileLLM-LS-350M** | 54.4 | 32.5 | 62.8 | 69.8 | 44.1 | 50.6 | 45.8 | 57.2 | **52.1** | 

## MobileLLM-600M

| model | boolq | piqa | siqa | hellaswag | winogrande | arc_easy | arc_challenge | obqa | avg. |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Qwen1.5-500M | 54.7 | 32.1 | 46.9 | 68.9 | 46.0 |  48.8 | 37.7 | 55.0 | 48.8 | 
| BLOOM-560M | 43.7 | 27.5 | 53.7 | 65.1 | 42.5 | 36.5 | 32.6 | 52.2 | 44.2 | 
| MobiLlama-800M | 52.0 | 31.7 | 54.6 | 73.0 |  43.3 | 52.3 | 42.5 | 56.3 | 50.7 | 
| **MobileLLM-600M** | 58.1 |  35.8 |  61.0 |  72.3 | 44.9 | 55.9 |  47.9 |  58.6 | **54.3** |  

## MobileLLM-1B

| model | boolq | piqa | siqa | hellaswag | winogrande | arc_easy | arc_challenge | obqa | avg. |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Pythia-1B | 49.9 | 30.4 | 58.7 | 69.2 | 43.3 | 47.4 | 38.6 | 52.2 | 48.7 | 
| MobiLlama-1B | 59.7 | 38.4 | 59.2 | 74.5 | 44.9 | 62.0 | 43.7 | 59.0 | 55.2 | 
| Falcon-1B | 59.5 | 38.4 | 63.9 | 74.6 |  44.6 | 62.9 |  45.6 | 60.9 | 56.3 | 
| BLOOM-1.1B | 47.6 | 27.3 | 58.6 | 67.0 | 42.4 | 42.2 | 36.6 | 53.8 | 46.9 | 
| TinyLlama-1.1B | 59.2 | 37.1 | 58.1 | 72.9 | 43.9 | 59.1 | 44.7 | 58.8 | 54.2 | 
| **MobileLLM-1B** | 63.0 |  39.0 |  66.7 |  74.4 | 45.0 |  61.4 | 46.8 | 62.3 | **57.3** |  

## MobileLLM-1.5B

| model | boolq | piqa | siqa | hellaswag | winogrande | arc_easy | arc_challenge | obqa | avg. |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT-neo-1.3B | 51.3 | 33.0 | 61.8 | 70.9 | 43.7 | 48.6 | 41.2 | 54.5 | 50.6 | 
| OPT-1.3B | 54.4 | 31.7 | 58.4 | 71.5 | 44.7 | 53.7 | 44.6 | 59.1 | 52.3 | 
| BLOOM-1.7B | 50.9 | 31.2 | 61.7 | 70.0 | 43.2 | 47.2 | 36.2 | 56.1 | 49.6 | 
| Qwen1.5-1.8B | 61.1 | 36.5 | 68.3 | 74.1 | 47.2 |  60.4 | 42.9 | 61.2 | 56.5 | 
| GPT-neo-2.7B | 55.8 | 34.3 | 62.4 | 72.9 | 43.6 | 55.6 | 40.0 | 57.9 | 52.8 | 
| OPT-2.7B | 56.6 | 34.6 | 61.8 | 74.5 | 45.6 | 60.2 | 48.2 | 59.6 | 55.1 | 
| Pythia-2.8B | 59.4 | 38.9 | 66.1 |  73.8 | 44.5 | 59.6 | 45.0 | 59.4 | 55.8 | 
| BLOOM-3B | 55.1 | 33.6 | 62.1 | 70.5 | 43.2 | 53.9 | 41.6 | 58.2 | 52.3 | 
| **MobileLLM-1.5B** | 67.5 |  40.9 |  65.7 | 74.8 |  46.4 | 64.5 | 50.5 | 64.7 | **59.4** | 

# Citation

If you find our code useful for your research, please consider citing:
    
    @article{liu2024mobilellm,
        title={MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases},
        author={Liu, Zechun and Zhao, Changsheng and Iandola, Forrest and Lai, Chen and Tian, Yuandong and Fedorov, Igor and Xiong, Yunyang and Chang, Ernie and Shi, Yangyang and Krishnamoorthi, Raghuraman and others},
        journal={arXiv preprint arXiv:2402.14905},
        year={2024}
    }
    
# License

MobileLLM is CC-BY-NC 4.0 licensed as of now.