smart-turn-v2 / README.md
marcus-daily
Model card fixes
849c530
metadata
pipeline_tag: voice-activity-detection
license: bsd-2-clause
tags:
  - speech-processing
  - semantic-vad
  - multilingual
datasets:
  - pipecat-ai/chirp3_1
  - pipecat-ai/orpheus_midfiller_1
  - pipecat-ai/orpheus_grammar_1
  - pipecat-ai/orpheus_endfiller_1
  - pipecat-ai/human_convcollector_1
  - pipecat-ai/rime_2
  - pipecat-ai/human_5_all
languages:
  - en
  - fr
  - de
  - es
  - pt
  - zh
  - ja
  - hi
  - it
  - ko
  - nl
  - pl
  - ru
  - tr

Smart Turn v2

Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model that tells you whether a speaker has finished their turn by analysing the raw waveform, not the transcript.
Compared with v1 it is:

  • Multilingual – 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
  • 6 × smaller – ≈ 360 MB vs. 2.3 GB.
  • 3 × faster – ≈ 12 ms to analyse 8 s of audio on an NVIDIA L40S.

Links

Intended use & task

Use‑case Why this model helps
Voice agents / chatbots Wait to reply until the user has actually finished speaking.
Real‑time transcription + TTS Avoid “double‑talk” by triggering TTS only when the user turn ends.
Call‑centre assist & analytics Accurate segmentation for diarisation and sentiment pipelines.
Any project needing semantic VAD Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD.

The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.

Model architecture

  • Backbone : wav2vec2 encoder
  • Head     : shallow linear classifier
  • Params   : 94.8 M (float32)
  • Checkpoint: 360 MB Safetensors (compressed)
    The wav2vec2 + linear configuration out‑performed LSTM and deeper transformer variants during ablation studies.

Training data

Source Type Languages
human_5_all Human‑recorded EN
human_convcollector_1 Human‑recorded EN
rime_2 Synthetic (Rime) EN
orpheus_midfiller_1 Synthetic (Orpheus) EN
orpheus_grammar_1 Synthetic (Orpheus) EN
orpheus_endfiller_1 Synthetic (Orpheus) EN
chirp3_1 Synthetic (Google Chirp3 TTS) 14 langs
  • Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
  • Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.

All audio/text pairs are released on the pipecat‑ai/datasets hub.

Evaluation & performance

Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)

Lang Acc % Lang Acc %
EN 94.3 IT 94.4
FR 95.5 KO 95.5
ES 92.1 PT 95.5
DE 95.8 TR 96.8
NL 96.7 PL 94.6
RU 93.0 HI 91.2
ZH 87.2

Human English benchmark (human_5_all) : 99 % accuracy.

Inference latency for 8 s audio

Device Time
NVIDIA L40S 12 ms
NVIDIA A100 19 ms
NVIDIA T4 (AWS g4dn.xlarge) 75 ms
16‑core x86_64 CPU (Modal) 410 ms

oai_citation:7‡Daily

How to use – quick start

from transformers import pipeline
import soundfile as sf

pipe = pipeline(
    "audio-classification",
    model="pipecat-ai/smart-turn-v2",
    feature_extractor="facebook/wav2vec2-base"
)

speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
    raise ValueError("Resample to 16 kHz")

result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking