marcus-daily victor HF Staff commited on
Commit
01d323e
·
verified ·
1 Parent(s): 2f16664

This deserves a model card (#1)

Browse files

- This deserve a model card (e3ff8b2ac4b39e656ab3fb2553f102794cf2f558)


Co-authored-by: Victor Mustar <[email protected]>

Files changed (1) hide show
  1. README.md +113 -1
README.md CHANGED
@@ -1,5 +1,117 @@
1
  ---
 
2
  license: bsd-2-clause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- See: https://github.com/pipecat-ai/smart-turn
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: voice-activity-detection
3
  license: bsd-2-clause
4
+ tags:
5
+ - speech-processing
6
+ - semantic-vad
7
+ - multilingual
8
+ datasets:
9
+ - pipecat-ai/chirp3_1
10
+ - pipecat-ai/orpheus_midfiller_1
11
+ - pipecat-ai/orpheus_grammar_1
12
+ - pipecat-ai/orpheus_endfiller_1
13
+ - pipecat-ai/human_convcollector_1
14
+ - pipecat-ai/rime_2
15
+ - pipecat-ai/human_5_all
16
+ languages:
17
+ - en
18
+ - fr
19
+ - de
20
+ - es
21
+ - pt
22
+ - zh
23
+ - ja
24
+ - hi
25
+ - it
26
+ - ko
27
+ - nl
28
+ - pl
29
+ - ru
30
+ - tr
31
  ---
32
 
33
+ # Smart Turn v2
34
+
35
+ **Smart Turn v2** is an open‑source semantic Voice Activity Detection (VAD) model that tells you **_whether a speaker has finished their turn_** by analysing the raw waveform, not the transcript.
36
+ Compared with v1 it is:
37
+
38
+ * **Multilingual** – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
39
+ * **6 × smaller** – ≈ 360 MB vs. 2.3 GB.
40
+ * **3 × faster** – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.
41
+
42
+ ## Intended use & task
43
+
44
+ | Use‑case | Why this model helps |
45
+ |---------------------------------------------|-------------------------------------------------------------------------|
46
+ | Voice agents / chatbots | Wait to reply until the user has **actually** finished speaking. |
47
+ | Real‑time transcription + TTS | Avoid “double‑talk” by triggering TTS only when the user turn ends. |
48
+ | Call‑centre assist & analytics | Accurate segmentation for diarisation and sentiment pipelines. |
49
+ | Any project needing semantic VAD | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |
50
+
51
+ The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.
52
+
53
+ ## Model architecture
54
+
55
+ * Backbone : `wav2vec2` encoder
56
+ * Head     : shallow linear classifier
57
+ * Params   : 94.8 M (float32)
58
+ * Checkpoint: 360 MB Safetensors (compressed)
59
+ The `wav2vec2 + linear` configuration out‑performed LSTM and deeper transformer variants during ablation studies.
60
+
61
+ ## Training data
62
+
63
+ | Source | Type | Split | Languages |
64
+ |--------|------|-------|-----------|
65
+ | `human_5_all` | Human‑recorded | Train / Dev / Test | EN |
66
+ | `chirp3_1` | Synthetic (Google Chirp3 TTS) | Train / Dev / Test | 14 langs |
67
+
68
+ * Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
69
+ * Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.
70
+
71
+ All audio/text pairs are released on the [pipecat‑ai/datasets](https://huggingface.co/pipecat-ai/datasets) hub.
72
+
73
+ ## Evaluation & performance
74
+
75
+ ### Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)
76
+ | Lang | Acc % | Lang | Acc % |
77
+ |------|------|------|------|
78
+ | EN | 94.3 | IT | 94.4 |
79
+ | FR | 95.5 | KO | 95.5 |
80
+ | ES | 92.1 | PT | 95.5 |
81
+ | DE | 95.8 | TR | 96.8 |
82
+ | NL | 96.7 | PL | 94.6 |
83
+ | RU | 93.0 | HI | 91.2 |
84
+ | ZH | 87.2 | – | – |
85
+
86
+ *Human English benchmark (`human_5_all`) : **99 %** accuracy.*
87
+
88
+ ### Inference latency for 8 s audio
89
+
90
+ | Device | Time |
91
+ |-------------------------------|------|
92
+ | NVIDIA L40S | 12 ms |
93
+ | NVIDIA A100 | 19 ms |
94
+ | NVIDIA T4 (AWS g4dn.xlarge) | 75 ms |
95
+ | 16‑core x86 CPU (Modal) | 410 ms |
96
+
97
+ [oai_citation:7‡Daily](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)
98
+
99
+ ## How to use – quick start
100
+
101
+ ```python
102
+ from transformers import pipeline
103
+ import soundfile as sf
104
+
105
+ pipe = pipeline(
106
+ "audio-classification",
107
+ model="pipecat-ai/smart-turn-v2",
108
+ feature_extractor="facebook/wav2vec2-base"
109
+ )
110
+
111
+ speech, sr = sf.read("user_utterance.wav")
112
+ if sr != 16_000:
113
+ raise ValueError("Resample to 16 kHz")
114
+
115
+ result = pipe(speech, top_k=None)[0]
116
+ print(f"Completed turn? {result['label']} Prob: {result['score']:.3f}")
117
+ # label == 'complete' → user has finished speaking