File size: 4,935 Bytes
2f16664
01d323e
2f16664
01d323e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f16664
 
01d323e
 
 
 
 
 
 
 
 
849c530
 
 
 
 
 
01d323e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
849c530
 
 
 
 
 
 
 
 
01d323e
 
 
 
 
 
 
 
 
 
849c530
 
 
 
 
 
 
 
01d323e
 
 
 
 
 
 
 
 
 
849c530
01d323e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
849c530
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
pipeline_tag: voice-activity-detection
license: bsd-2-clause
tags:
  - speech-processing
  - semantic-vad
  - multilingual
datasets:
  - pipecat-ai/chirp3_1
  - pipecat-ai/orpheus_midfiller_1
  - pipecat-ai/orpheus_grammar_1
  - pipecat-ai/orpheus_endfiller_1
  - pipecat-ai/human_convcollector_1
  - pipecat-ai/rime_2
  - pipecat-ai/human_5_all
languages: 
  - en
  - fr
  - de
  - es
  - pt
  - zh
  - ja
  - hi
  - it
  - ko
  - nl
  - pl
  - ru
  - tr
---

# Smart Turn v2

**Smart Turn v2** is an open‑source semantic Voice Activity Detection (VAD) model that tells you **_whether a speaker has finished their turn_** by analysing the raw waveform, not the transcript.  
Compared with v1 it is:

* **Multilingual** – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
* **6 × smaller** – ≈ 360 MB vs. 2.3 GB.
* **3 × faster** – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.

## Links

* [Blog post: Smart Turn v2](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)
* [GitHub repo](https://github.com/pipecat-ai/smart-turn) with training and inference code


## Intended use & task

| Use‑case                                    | Why this model helps                                                    |
|---------------------------------------------|-------------------------------------------------------------------------|
| Voice agents / chatbots                     | Wait to reply until the user has **actually** finished speaking.        |
| Real‑time transcription + TTS               | Avoid “double‑talk” by triggering TTS only when the user turn ends.     |
| Call‑centre assist & analytics              | Accurate segmentation for diarisation and sentiment pipelines.          |
| Any project needing semantic VAD            | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |

The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.

## Model architecture

* Backbone : `wav2vec2` encoder  
* Head     : shallow linear classifier  
* Params   : 94.8 M (float32)  
* Checkpoint: 360 MB Safetensors (compressed)  
The `wav2vec2 + linear` configuration out‑performed LSTM and deeper transformer variants during ablation studies.  

## Training data

| Source                  | Type                          | Languages |
|-------------------------|-------------------------------|-----------|
| `human_5_all`           | Human‑recorded                | EN        |
| `human_convcollector_1` | Human‑recorded                | EN        |
| `rime_2`                | Synthetic (Rime)              | EN        |
| `orpheus_midfiller_1`   | Synthetic (Orpheus)           | EN        |
| `orpheus_grammar_1`     | Synthetic (Orpheus)           | EN        |
| `orpheus_endfiller_1`   | Synthetic (Orpheus)           | EN        |
| `chirp3_1`              | Synthetic (Google Chirp3 TTS) | 14 langs  |

* Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
* Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.

All audio/text pairs are released on the [pipecat‑ai/datasets](https://huggingface.co/pipecat-ai/datasets) hub.

## Evaluation & performance

### Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)  
| Lang | Acc % | Lang | Acc % |
|------|-------|------|-------|
| EN   | 94.3  | IT   | 94.4  |
| FR   | 95.5  | KO   | 95.5  |
| ES   | 92.1  | PT   | 95.5  |
| DE   | 95.8  | TR   | 96.8  |
| NL   | 96.7  | PL   | 94.6  |
| RU   | 93.0  | HI   | 91.2  |
| ZH   | 87.2  | –    |   –   |

*Human English benchmark (`human_5_all`) : **99 %** accuracy.*

### Inference latency for 8 s audio

| Device                        | Time |
|-------------------------------|------|
| NVIDIA L40S                   | 12 ms |
| NVIDIA A100                   | 19 ms |
| NVIDIA T4 (AWS g4dn.xlarge)   | 75 ms |
| 16‑core x86\_64 CPU (Modal)   | 410 ms |

 [oai_citation:7‡Daily](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)

## How to use – quick start

```python
from transformers import pipeline
import soundfile as sf

pipe = pipeline(
    "audio-classification",
    model="pipecat-ai/smart-turn-v2",
    feature_extractor="facebook/wav2vec2-base"
)

speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
    raise ValueError("Resample to 16 kHz")

result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking
```