pipecat-ai
/

smart-turn-v2

Voice Activity Detection

Safetensors

wav2vec2

speech-processing

semantic-vad

multilingual

Model card Files Files and versions Community

marcus-daily commited on 2 days ago

Commit

849c530

1 Parent(s): 01d323e

Model card fixes

Browse files

Files changed (1) hide show

README.md +26 -14

README.md CHANGED Viewed

@@ -39,6 +39,12 @@ Compared with v1 it is:
 * **6 × smaller** – ≈ 360 MB vs. 2.3 GB.
 * **3 × faster** – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.
 ## Intended use & task
 | Use‑case                                    | Why this model helps                                                    |
@@ -60,10 +66,15 @@ The `wav2vec2 + linear` configuration out‑performed LSTM and deeper transfor
 ## Training data
-| Source | Type | Split | Languages |
-|--------|------|-------|-----------|
-| `human_5_all` | Human‑recorded | Train / Dev / Test | EN |
-| `chirp3_1`    | Synthetic (Google Chirp3 TTS) | Train / Dev / Test | 14 langs |
 * Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
 * Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.
@@ -74,14 +85,14 @@ All audio/text pairs are released on the [pipecat‑ai/datasets](https://hugging
 ### Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)
 | Lang | Acc % | Lang | Acc % |
-|------|------|------|------|
-| EN | 94.3 | IT | 94.4 |
-| FR | 95.5 | KO | 95.5 |
-| ES | 92.1 | PT | 95.5 |
-| DE | 95.8 | TR | 96.8 |
-| NL | 96.7 | PL | 94.6 |
-| RU | 93.0 | HI | 91.2 |
-| ZH | 87.2 | – | – |
 *Human English benchmark (`human_5_all`) : **99 %** accuracy.*
@@ -92,7 +103,7 @@ All audio/text pairs are released on the [pipecat‑ai/datasets](https://hugging
 | NVIDIA L40S                   | 12 ms |
 | NVIDIA A100                   | 19 ms |
 | NVIDIA T4 (AWS g4dn.xlarge)   | 75 ms |
-| 16‑core x86 CPU (Modal)       | 410 ms |
  [oai_citation:7‡Daily](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)
@@ -114,4 +125,5 @@ if sr != 16_000:
 result = pipe(speech, top_k=None)[0]
 print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
-# label == 'complete' → user has finished speaking

 * **6 × smaller** – ≈ 360 MB vs. 2.3 GB.
 * **3 × faster** – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.
+## Links
+* [Blog post: Smart Turn v2](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)
+* [GitHub repo](https://github.com/pipecat-ai/smart-turn) with training and inference code
 ## Intended use & task
 | Use‑case                                    | Why this model helps                                                    |
 ## Training data
+| Source                  | Type                          | Languages |
+|-------------------------|-------------------------------|-----------|
+| `human_5_all`           | Human‑recorded                | EN        |
+| `human_convcollector_1` | Human‑recorded                | EN        |
+| `rime_2`                | Synthetic (Rime)              | EN        |
+| `orpheus_midfiller_1`   | Synthetic (Orpheus)           | EN        |
+| `orpheus_grammar_1`     | Synthetic (Orpheus)           | EN        |
+| `orpheus_endfiller_1`   | Synthetic (Orpheus)           | EN        |
+| `chirp3_1`              | Synthetic (Google Chirp3 TTS) | 14 langs  |
 * Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
 * Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.
 ### Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)
 | Lang | Acc % | Lang | Acc % |
+|------|-------|------|-------|
+| EN   | 94.3  | IT   | 94.4  |
+| FR   | 95.5  | KO   | 95.5  |
+| ES   | 92.1  | PT   | 95.5  |
+| DE   | 95.8  | TR   | 96.8  |
+| NL   | 96.7  | PL   | 94.6  |
+| RU   | 93.0  | HI   | 91.2  |
+| ZH   | 87.2  | –    |   –   |
 *Human English benchmark (`human_5_all`) : **99 %** accuracy.*
 | NVIDIA L40S                   | 12 ms |
 | NVIDIA A100                   | 19 ms |
 | NVIDIA T4 (AWS g4dn.xlarge)   | 75 ms |
+| 16‑core x86\_64 CPU (Modal)   | 410 ms |
  [oai_citation:7‡Daily](https://www.daily.co/blog/smart-turn-v2-faster-inference-and-13-new-languages-for-voice-ai/)
 result = pipe(speech, top_k=None)[0]
 print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
+# label == 'complete' → user has finished speaking
+```