File size: 3,223 Bytes
9a3c6c1
 
 
 
 
 
 
 
d8955a4
3867c63
 
9a3c6c1
 
 
 
e341838
 
 
9a3c6c1
 
e341838
88b950f
3867c63
88b950f
9a3c6c1
88b950f
9a3c6c1
88b950f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5878d9
 
 
 
 
 
ec04474
e5878d9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: mit
datasets:
- openslr/librispeech_asr
language:
- en
pipeline_tag: automatic-speech-recognition
---

# Splitformer

<div align="center" style="line-height: 1;">
  <a href="https://github.com/augustgw/early-exit-transformer" target="_blank" style="margin: 2px;">
    <img alt="GitHub" src="https://img.shields.io/badge/GitHub-Splitformer-181717?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://www.arxiv.org/abs/2506.18035" target="_blank" style="margin: 2px;">
    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2506.18035-B31B1B?logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>


## 1. Overview

**Splitformer** is a 36.7M parameters Conformer-based ASR model trained from scratch on 1000 hours of the **LibriSpeech dataset** with an **early‐exit objective**. 

This architecture introduces **parallel downsampling layers** before the first and last exits to improve performance with minimal extra overhead, while retaining inference speed.

Our code for training and inference is available on our [GitHub](https://github.com/augustgw/early-exit-transformer) repository.

### 2. Results on LibriSpeech

<table>
  <thead>
    <tr>
      <th rowspan="2">Layer</th>
      <th colspan="2">EE-baseline (31.5M)</th>
      <th colspan="2">Splitformer (36.7M)</th>
      <th colspan="2">Wav2Vec2 (94.0M)</th>
      <th colspan="2">WavLM (94.7M)</th>
    </tr>
    <tr>
      <th>test-clean</th>
      <th>test-other</th>
      <th>test-clean</th>
      <th>test-other</th>
      <th>test-clean</th>
      <th>test-other</th>
      <th>test-clean</th>
      <th>test-other</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2</td>
      <td>31.0</td>
      <td>51.0</td>
      <td>28.1</td>
      <td>48.3</td>
      <td>33.7</td>
      <td>56.0</td>
      <td>28.0</td>
      <td>48.5</td>
    </tr>
    <tr>
      <td>4</td>
      <td>11.7</td>
      <td>27.8</td>
      <td>10.8</td>
      <td>26.4</td>
      <td>17.4</td>
      <td>36.7</td>
      <td>13.9</td>
      <td>27.3</td>
    </tr>
    <tr>
      <td>6</td>
      <td>7.1</td>
      <td>19.8</td>
      <td>6.7</td>
      <td>19.2</td>
      <td>9.6</td>
      <td>23.7</td>
      <td>8.7</td>
      <td>18.4</td>
    </tr>
    <tr>
      <td>8</td>
      <td>5.8</td>
      <td>16.6</td>
      <td>5.5</td>
      <td>16.3</td>
      <td>5.8</td>
      <td>15.9</td>
      <td>4.8</td>
      <td>12.4</td>
    </tr>
    <tr>
      <td>10</td>
      <td>5.3</td>
      <td>15.3</td>
      <td>5.1</td>
      <td>15.1</td>
      <td>4.5</td>
      <td>12.6</td>
      <td>4.0</td>
      <td>9.5</td>
    </tr>
    <tr>
      <td>12</td>
      <td>5.1</td>
      <td>14.8</td>
      <td>4.8</td>
      <td>14.7</td>
      <td>4.3</td>
      <td>12.2</td>
      <td>3.6</td>
      <td>8.8</td>
    </tr>
  </tbody>
</table>

## 3. Citation

```bibtex
@misc{lasbordes2025splitformer,
      title={Splitformer: An improved early-exit architecture for automatic speech recognition on edge devices}, 
      author={Maxence Lasbordes, Daniele Falavigna and Alessio Brutti},
      year={2025},
      note={Proc. of EUSIPCO 2025},
}