File size: 4,226 Bytes
2f16664
e3ff8b2
2f16664
e3ff8b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f16664
 
e3ff8b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
pipeline_tag: voice-activity-detection
license: bsd-2-clause
tags:
  - speech-processing
  - semantic-vad
  - multilingual
datasets:
  - pipecat-ai/chirp3_1
  - pipecat-ai/orpheus_midfiller_1
  - pipecat-ai/orpheus_grammar_1
  - pipecat-ai/orpheus_endfiller_1
  - pipecat-ai/human_convcollector_1
  - pipecat-ai/rime_2
  - pipecat-ai/human_5_all
languages: 
  - en
  - fr
  - de
  - es
  - pt
  - zh
  - ja
  - hi
  - it
  - ko
  - nl
  - pl
  - ru
  - tr
---

# Smart Turn v2

**Smart Turn v2** is an open‑source semantic Voice Activity Detection (VAD) model that tells you **_whether a speaker has finished their turn_** by analysing the raw waveform, not the transcript.  
Compared with v1 it is:

* **Multilingual** – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
* **6 × smaller** – ≈ 360 MB vs. 2.3 GB.
* **3 × faster** – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.

## Intended use & task

| Use‑case                                    | Why this model helps                                                    |
|---------------------------------------------|-------------------------------------------------------------------------|
| Voice agents / chatbots                     | Wait to reply until the user has **actually** finished speaking.        |
| Real‑time transcription + TTS               | Avoid “double‑talk” by triggering TTS only when the user turn ends.     |
| Call‑centre assist & analytics              | Accurate segmentation for diarisation and sentiment pipelines.          |
| Any project needing semantic VAD            | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |

The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.

## Model architecture

* Backbone : `wav2vec2` encoder  
* Head     : shallow linear classifier  
* Params   : 94.8 M (float32)  
* Checkpoint: 360 MB Safetensors (compressed)  
The `wav2vec2 + linear` configuration out‑performed LSTM and deeper transformer variants during ablation studies.  

## Training data

| Source | Type | Split | Languages |
|--------|------|-------|-----------|
| `human_5_all` | Human‑recorded | Train / Dev / Test | EN |
| `chirp3_1`    | Synthetic (Google Chirp3 TTS) | Train / Dev / Test | 14 langs |

* Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
* Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.

All audio/text pairs are released on the [pipecat‑ai/datasets](https://huggingface.co/pipecat-ai/datasets) hub.

## Evaluation & performance

### Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)  
| Lang | Acc % | Lang | Acc % |
|------|------|------|------|
| EN | 94.3 | IT | 94.4 |
| FR | 95.5 | KO | 95.5 |
| ES | 92.1 | PT | 95.5 |
| DE | 95.8 | TR | 96.8 |
| NL | 96.7 | PL | 94.6 |
| RU | 93.0 | HI | 91.2 |
| ZH | 87.2 | – | – |

*Human English benchmark (`human_5_all`) : **99 %** accuracy.*

### Inference latency for 8 s audio

| Device                        | Time |
|-------------------------------|------|
| NVIDIA L40S                   | 12 ms |
| NVIDIA A100                   | 19 ms |
| NVIDIA T4 (AWS g4dn.xlarge)   | 75 ms |
| 16‑core x86 CPU (Modal)       | 410 ms |

 [oai_citation:7‡Daily](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)

## How to use – quick start

```python
from transformers import pipeline
import soundfile as sf

pipe = pipeline(
    "audio-classification",
    model="pipecat-ai/smart-turn-v2",
    feature_extractor="facebook/wav2vec2-base"
)

speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
    raise ValueError("Resample to 16 kHz")

result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking