Spaces:

alec228
/

audio-sentiment

Runtime error

App Files Files Community

alec228 commited on Jul 14

Commit

c23173c

1 Parent(s): e4fccf0

Initial commit

Browse files

Files changed (15) hide show

README.md +126 -0
requirements.txt +9 -0
src/__init__.py +0 -0
src/__pycache__/__init__.cpython-313.pyc +0 -0
src/__pycache__/api.cpython-313.pyc +0 -0
src/__pycache__/app.cpython-313.pyc +0 -0
src/__pycache__/multimodal.cpython-313.pyc +0 -0
src/__pycache__/sentiment.cpython-313.pyc +0 -0
src/__pycache__/transcription.cpython-313.pyc +0 -0
src/api.py +85 -0
src/app.py +156 -0
src/inference.py +57 -0
src/multimodal.py +40 -0
src/sentiment.py +74 -0
src/transcription.py +55 -0

README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Sentiment Audio
+Ce projet propose un pipeline complet d’analyse de sentiment à partir de fichiers audio francophones, structuré en quatre composantes principales :
+1. **Transcription audio**
+   - Modèle Wav2Vec2 (`jonatasgrosman/wav2vec2-large-xlsr-53-french`)
+   - Extraction de vecteurs audio puis décodage CTC
+2. **Analyse de sentiment textuel**
+   - Modèle BERT multilingue (`nlptown/bert-base-multilingual-uncased-sentiment`)
+   - Fonction `analyze_sentiment(text)` retournant un label (`négatif`, `neutre`, `positif`) et sa confiance
+3. **Interface utilisateur Gradio**
+   - Modes d’entrée : **enregistrement microphone** et **téléversement de fichier**
+   - Affichage de la transcription et du score de sentiment en temps réel
+   ![Interface Gradio](home.png)
+4. **API REST FastAPI**
+   - Endpoint `/predict` pour soumettre un fichier audio
+   - Retour JSON `{ "transcription": ..., "sentiment": {label: confiance} }`
+   - Documentation interactive Swagger UI (`/docs`)
+   ![Documentation API](api.png)
+---
+## Structure du projet
+```
+sentiment_audio_tp/
+├── hf_model/              # exports de modèles sauvegardés via save_pretrained
+├── models/                # cache local HuggingFace (ignoré par Git)
+├── src/
+│   ├── __init__.py
+│   ├── transcription.py   # SpeechEncoder (Wav2Vec2Model)
+│   ├── sentiment.py       # TextEncoder + analyze_sentiment()
+│   ├── multimodal.py      # Classifieur multimodal (fusion embeddings)
+│   ├── inference.py       # CLI (audio → transcription + sentiment)
+│   ├── app.py             # Interface Gradio
+│   └── api.py             # Serveur FastAPI
+├── requirements.txt       # Dépendances du projet
+├── render.yaml            # Infra as code pour Render
+└── README.md              # Ce document
+```
+---
+## Installation
+1. **Cloner le dépôt**
+   ```bash
+   git clone <URL_DU_REPO>
+   cd sentiment_audio_tp
+   ```
+2. **Configurer l’environnement**
+   ```bash
+   python -m venv venv
+   source venv/bin/activate   # macOS/Linux
+   .\venv\Scripts\Activate.ps1  # Windows PowerShell
+   ```
+3. **Installer les dépendances**
+   ```bash
+   pip install --upgrade pip
+   pip install -r requirements.txt
+   ```
+---
+## Utilisation
+### CLI d’inférence
+```bash
+python src/inference.py chemin/vers/audio.wav
+# Affiche la transcription et le résultat de sentiment
+```
+### Interface Gradio
+```bash
+python -m src.app
+```
+- Rendez-vous sur `http://127.0.0.1:7861/`
+- Choisissez **Enregistrement** ou **Upload**
+- Obtenez la transcription et le sentiment en temps réel
+### API REST
+```bash
+uvicorn src.api:app --reload --host 0.0.0.0 --port 8000
+```
+- Swagger UI : `http://127.0.0.1:8000/docs`
+- Tester avec `curl` ou Postman :
+  ```bash
+  curl -X POST "http://127.0.0.1:8000/predict" \
+       -F "file=@/chemin/vers/audio.wav"
+  ```
+---
+## Cas d’usage
+- **Prototype rapide** d’analyse de sentiment sur des appels clients, podcasts, interviews
+- **Outil de validation** pour analyses qualitatives de contenu audio
+- **Proof of Concept** pour architectures multimodales
+- **Service back-end** dans un chatbot vocal ou plateforme d’assistance
+---
+## Extension
+- **Fine-tuning multimodal** : entraînement du classifieur fusion sur un dataset annoté
+- **Support de nouveaux formats** : MP3, FLAC…
+- **Tests et CI** : ajouter des tests `pytest` et pipelines CI/CD
+- **Déploiement** : Docker, Kubernetes, monitoring
+---
+## Licence
+Licence **MIT** — libre d’utilisation, modification et redistribution.
+```

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+transformers>=4.30.0
+torch>=2.0.0
+torchaudio>=2.0.0
+gradio>=3.30.0
+fastapi>=0.95.2
+uvicorn[standard]>=0.21.1
+soundfile>=0.12.1

src/__init__.py ADDED Viewed

File without changes

src/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (169 Bytes). View file

src/__pycache__/api.cpython-313.pyc ADDED Viewed

Binary file (4.5 kB). View file

src/__pycache__/app.cpython-313.pyc ADDED Viewed

Binary file (8.44 kB). View file

src/__pycache__/multimodal.cpython-313.pyc ADDED Viewed

Binary file (2.5 kB). View file

src/__pycache__/sentiment.cpython-313.pyc ADDED Viewed

Binary file (3.15 kB). View file

src/__pycache__/transcription.cpython-313.pyc ADDED Viewed

Binary file (2.34 kB). View file

src/api.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import tempfile
+import os
+from fastapi import FastAPI, File, UploadFile, HTTPException
+from fastapi.responses import JSONResponse
+import torch.nn.functional as F
+import torchaudio
+import torch
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from src.transcription import SpeechEncoder
+from src.sentiment     import TextEncoder
+from src.multimodal    import MultimodalSentimentClassifier
+app = FastAPI(
+    title="API Multimodale de Transcription & Sentiment",
+    version="1.0"
+)
+# Précharge des modèles
+processor_ctc = Wav2Vec2Processor.from_pretrained(
+    "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+    #"jonatasgrosman/wav2vec2-large-xlsr-53-french",
+    cache_dir="./models"
+)
+model_ctc = Wav2Vec2ForCTC.from_pretrained(
+    "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+    #"alec228/audio-sentiment/tree/main/wav2vec2",
+    cache_dir="./models"
+)
+speech_enc = SpeechEncoder()
+text_enc   = TextEncoder()
+model_mm   = MultimodalSentimentClassifier()
+def transcribe_ctc(wav_path: str) -> str:
+    waveform, sr = torchaudio.load(wav_path)
+    if sr != 16000:
+        waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
+    if waveform.size(0) > 1:
+        waveform = waveform.mean(dim=0, keepdim=True)
+    inputs = processor_ctc(
+        waveform.squeeze().numpy(),
+        sampling_rate=16000,
+        return_tensors="pt",
+        padding=True
+    )
+    with torch.no_grad():
+        logits = model_ctc(**inputs).logits
+    pred_ids = torch.argmax(logits, dim=-1)
+    return processor_ctc.batch_decode(pred_ids)[0].lower()
+@app.post("/predict")
+async def predict(file: UploadFile = File(...)):
+    # 1. Vérifier le type
+    if not file.filename.lower().endswith((".wav", ".flac", ".mp3")):
+        raise HTTPException(status_code=400,
+            detail="Seuls les fichiers audio WAV/FLAC/MP3 sont acceptés.")
+    # 2. Sauvegarder temporairement
+    suffix = os.path.splitext(file.filename)[1]
+    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
+        content = await file.read()
+        tmp.write(content)
+        tmp_path = tmp.name
+    try:
+        # 3. Transcription
+        transcription = transcribe_ctc(tmp_path)
+        # 4. Features multimodales
+        audio_feat = speech_enc.extract_features(tmp_path)
+        text_feat  = text_enc.extract_features([transcription])
+        # 5. Classification
+        logits = model_mm.classifier(torch.cat([audio_feat, text_feat], dim=1))
+        probs  = F.softmax(logits, dim=1).squeeze().tolist()
+        labels = ["négatif", "neutre", "positif"]
+        sentiment = { labels[i]: round(probs[i], 3) for i in range(len(labels)) }
+        return JSONResponse({
+            "transcription": transcription,
+            "sentiment": sentiment
+        })
+    finally:
+        # 6. Nettoyage
+        os.remove(tmp_path)

src/app.py ADDED Viewed

	@@ -0,0 +1,156 @@

+import os
+import re
+from datetime import datetime
+import gradio as gr
+import torch
+import pandas as pd
+import soundfile as sf
+import torchaudio
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from src.transcription import SpeechEncoder
+from src.sentiment import TextEncoder
+# Préchargement des modèles
+processor_ctc = Wav2Vec2Processor.from_pretrained(
+    "jonatasgrosman/wav2vec2-large-xlsr-53-french", cache_dir="./models"
+    #"alec228/audio-sentiment/tree/main/wav2vec2", cache_dir="./models"
+)
+model_ctc = Wav2Vec2ForCTC.from_pretrained(
+    "jonatasgrosman/wav2vec2-large-xlsr-53-french", cache_dir="./models"
+    #"alec228/audio-sentiment/tree/main/wav2vec2", cache_dir="./models"
+)
+speech_enc = SpeechEncoder()
+text_enc = TextEncoder()
+# Pipeline d’analyse
+def analyze_audio(audio_path):
+    # Lecture et prétraitement
+    data, sr = sf.read(audio_path)
+    arr = data.T if data.ndim > 1 else data
+    wav = torch.from_numpy(arr).unsqueeze(0).float()
+    if sr != 16000:
+        wav = torchaudio.transforms.Resample(sr, 16000)(wav)
+        sr = 16000
+    if wav.size(0) > 1:
+        wav = wav.mean(dim=0, keepdim=True)
+    # Transcription
+    inputs = processor_ctc(wav.squeeze().numpy(), sampling_rate=sr, return_tensors="pt")
+    with torch.no_grad():
+        logits = model_ctc(**inputs).logits
+    pred_ids = torch.argmax(logits, dim=-1)
+    transcription = processor_ctc.batch_decode(pred_ids)[0].lower()
+    # Sentiment principal
+    sent_dict = TextEncoder.analyze_sentiment(transcription)
+    label, conf = max(sent_dict.items(), key=lambda x: x[1])
+    emojis = {"positif": "😊", "neutre": "😐", "négatif": "☹️"}
+    emoji = emojis.get(label, "")
+    # Segmentation par phrase
+    segments = [s.strip() for s in re.split(r'[.?!]', transcription) if s.strip()]
+    seg_results = []
+    for seg in segments:
+        sd = TextEncoder.analyze_sentiment(seg)
+        l, c = max(sd.items(), key=lambda x: x[1])
+        seg_results.append({"Segment": seg, "Sentiment": l.capitalize(), "Confiance (%)": round(c*100,1)})
+    seg_df = pd.DataFrame(seg_results)
+    # Historique entry
+    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    history_entry = {
+        "Horodatage": timestamp,
+        "Transcription": transcription,
+        "Sentiment": label.capitalize(),
+        "Confiance (%)": round(conf*100,1)
+    }
+    # Rendu
+    summary_html = (
+        f"<div style='display:flex;align-items:center;'>"
+        f"<span style='font-size:3rem;margin-right:10px;'>{emoji}</span>"
+        f"<h2 style='color:#6a0dad;'>{label.upper()}</h2>"
+        f"</div>"
+        f"<p><strong>Confiance :</strong> {conf*100:.1f}%</p>"
+    )
+    return transcription, summary_html, seg_df, history_entry
+# Export CSV
+def export_history_csv(history):
+    df = pd.DataFrame(history)
+    path = "history.csv"
+    df.to_csv(path, index=False)
+    return path
+# Interface Chat + historique
+demo = gr.Blocks(theme=gr.themes.Monochrome(primary_hue="purple"))
+with demo:
+    gr.Markdown("# Chat & Analyse de Sentiment Audio")
+    gr.HTML("""
+    <div style="display: flex; flex-direction: column; gap: 10px; margin-bottom: 20px;">
+        <div style="background-color: #f3e8ff; padding: 12px 20px; border-radius: 12px; border-left: 5px solid #8e44ad;">
+            <strong>Étape 1 :</strong> Enregistrez votre voix ou téléversez un fichier audio (format WAV recommandé).
+        </div>
+        <div style="background-color: #e0f7fa; padding: 12px 20px; border-radius: 12px; border-left: 5px solid #0097a7;">
+            <strong>Étape 2 :</strong> Cliquez sur le bouton <em><b>Analyser</b></em> pour lancer la transcription et l’analyse.
+        </div>
+        <div style="background-color: #fff3e0; padding: 12px 20px; border-radius: 12px; border-left: 5px solid #fb8c00;">
+            <strong>Étape 3 :</strong> Visualisez les résultats : transcription, sentiment, et analyse détaillée.
+        </div>
+        <div style="background-color: #e8f5e9; padding: 12px 20px; border-radius: 12px; border-left: 5px solid #43a047;">
+            <strong>Étape 4 :</strong> Exportez l’historique des analyses au format CSV si besoin.
+        </div>
+    </div>
+    <script>
+        const origin = window.location.origin;
+        const swaggerUrl = origin + "/docs";
+        document.getElementById("swagger-link").innerHTML = `<a href="${swaggerUrl}" target="_blank">Voir la documentation de l’API (Swagger)</a>`;
+    </script>
+    """)
+    with gr.Row():
+        with gr.Column(scale=2):
+            audio_in = gr.Audio(sources=["microphone","upload"], type="filepath", label="Audio Input")
+            btn = gr.Button("Analyser")
+            export_btn = gr.Button("Exporter CSV")
+        with gr.Column(scale=3):
+            chat = gr.Chatbot(label="Historique des échanges")
+            transcription_out = gr.Textbox(label="Transcription", interactive=False)
+            summary_out = gr.HTML(label="Sentiment")
+            seg_out = gr.Dataframe(label="Détail par segment")
+            hist_out = gr.Dataframe(label="Historique")
+    state_chat = gr.State([])  # list of (user,bot)
+    state_hist = gr.State([])  # list of dict entries
+    def chat_callback(audio_path, chat_history, hist_state):
+        transcription, summary, seg_df, hist_entry = analyze_audio(audio_path)
+        user_msg = "[Audio reçu]"
+        bot_msg = f"**Transcription :** {transcription}\n**Sentiment :** {summary}"
+        chat_history = chat_history + [(user_msg, bot_msg)]
+        hist_state = hist_state + [hist_entry]
+        return chat_history, transcription, summary, seg_df, hist_state
+    btn.click(
+        fn=chat_callback,
+        inputs=[audio_in, state_chat, state_hist],
+        outputs=[chat, transcription_out, summary_out, seg_out, state_hist]
+    )
+    export_btn.click(
+        fn=export_history_csv,
+        inputs=[state_hist],
+        outputs=[gr.File(label="Télécharger CSV")]
+    )
+if __name__ == "__main__":
+    demo.launch()

src/inference.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import torch
+import torch.nn.functional as F
+import torchaudio
+from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
+from src.multimodal import MultimodalSentimentClassifier
+# 1. Transcription CTC
+def transcribe(audio_path: str) -> str:
+    processor = Wav2Vec2Processor.from_pretrained(
+        "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+        #cache_dir="./models"
+    )
+    model_ctc = Wav2Vec2ForCTC.from_pretrained(
+        "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+        #cache_dir="./models"
+    )
+    waveform, sr = torchaudio.load(audio_path)
+    if sr != 16000:
+        waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
+    if waveform.size(0) > 1:
+        waveform = waveform.mean(dim=0, keepdim=True)
+    inputs = processor(
+        waveform.squeeze().numpy(),
+        sampling_rate=16000,
+        return_tensors="pt",
+        padding=True
+    )
+    with torch.no_grad():
+        logits = model_ctc(**inputs).logits
+    predicted_ids = torch.argmax(logits, dim=-1)
+    transcription = processor.batch_decode(predicted_ids)[0]
+    return transcription.lower()
+# 2. Inférence multimodale
+def infer(audio_path: str) -> dict:
+    # a) transcrire l’audio
+    text = transcribe(audio_path)
+    # b) charger et exécuter le modèle multimodal
+    model = MultimodalSentimentClassifier()
+    logits = model(audio_path, text)      # [1, n_classes]
+    probs  = F.softmax(logits, dim=1).squeeze().tolist()
+    labels = ["négatif", "neutre", "positif"]
+    return { labels[i]: round(probs[i], 3) for i in range(len(labels)) }
+# Test rapide en ligne de commande
+if __name__ == "__main__":
+    import sys
+    if len(sys.argv) != 2:
+        print("Usage: python src/inference.py <chemin_vers_audio.wav>")
+        sys.exit(1)
+    res = infer(sys.argv[1])
+    print(f"Résultat multimodal : {res}")

src/multimodal.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from .transcription import SpeechEncoder
+from .sentiment     import TextEncoder
+import torch
+import torch.nn as nn
+class MultimodalSentimentClassifier(nn.Module):
+    def __init__(
+        self,
+        wav2vec_name: str = "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+        #wav2vec_name: str = "alec228/audio-sentiment/tree/main/wav2vec2",
+        bert_name:    str = "nlptown/bert-base-multilingual-uncased-sentiment",
+        #bert_name:    str = "alec228/audio-sentiment/tree/main/bert-sentiment",
+        #cache_dir:    str = "./models",
+        hidden_dim:   int = 256,
+        n_classes:    int = 3
+    ):
+        super().__init__()
+        self.speech_encoder = SpeechEncoder(
+            model_name=wav2vec_name,
+      #      cache_dir=cache_dir
+        )
+        self.text_encoder = TextEncoder(
+            model_name=bert_name,
+          #  cache_dir=cache_dir
+        )
+        dim_a = self.speech_encoder.model.config.hidden_size
+        dim_t = self.text_encoder.model.config.hidden_size
+        self.classifier = nn.Sequential(
+            nn.Linear(dim_a + dim_t, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(0.2),
+            nn.Linear(hidden_dim, n_classes)
+        )
+    def forward(self, audio_path: str, text: str) -> torch.Tensor:
+        a_feat = self.speech_encoder.extract_features(audio_path)
+        t_feat = self.text_encoder.extract_features([text])
+        fused  = torch.cat([a_feat, t_feat], dim=1)
+        return self.classifier(fused)

src/sentiment.py ADDED Viewed

	@@ -0,0 +1,74 @@

+from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
+import torch
+import torch.nn.functional as F
+class TextEncoder:
+    def __init__(
+        self,
+        model_name: str = "nlptown/bert-base-multilingual-uncased-sentiment",
+        #model_name: str = "alec228/audio-sentiment/tree/main/bert-sentiment",
+        #cache_dir: str = "./models"
+    ):
+        # Tokenizer pour prétraiter le texte
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            model_name,
+            #cache_dir=cache_dir
+        )
+        # Modèle BERT de base (sans tête de classification)
+        self.model = AutoModel.from_pretrained(
+            model_name,
+            #cache_dir=cache_dir
+        )
+    def extract_features(self, texts: list[str]) -> torch.Tensor:
+        """
+        Prend en entrée une liste de chaînes et renvoie
+        les embeddings du token [CLS] pour chaque texte.
+        """
+        # 1. Tokenisation
+        inputs = self.tokenizer(
+            texts,
+            return_tensors="pt",
+            truncation=True,
+            padding=True
+        )
+        # 2. Passage dans le modèle sans calcul de gradient
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+        # 3. Extraction de l'embedding du token [CLS]
+        return outputs.last_hidden_state[:, 0, :]  # [batch, hidden_size]
+    def analyze_sentiment(text: str) -> dict:
+        """
+        Analyse le sentiment d'un texte avec un modèle déjà fine-tuned
+        (nlptown/bert-base-multilingual-uncased-sentiment) et renvoie
+        un dict {label: confidence}.
+        """
+        # Chargement du tokenizer et du modèle de classification
+        tokenizer = AutoTokenizer.from_pretrained(
+            "nlptown/bert-base-multilingual-uncased-sentiment",
+            #cache_dir="./models"
+        )
+        model = AutoModelForSequenceClassification.from_pretrained(
+            "nlptown/bert-base-multilingual-uncased-sentiment",
+            #cache_dir="./models"
+        )
+        # Préparation
+        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+        with torch.no_grad():
+            outputs = model(**inputs)
+            logits = outputs.logits
+            probs = F.softmax(logits, dim=1).squeeze().tolist()
+        # Les classes vont de 1 à 5, on choisit la plus probable
+        label_idx = int(torch.argmax(torch.tensor(probs))) + 1
+        if label_idx <= 2:
+            label = "négatif"
+        elif label_idx == 3:
+            label = "neutre"
+        else:
+            label = "positif"
+        confidence = round(max(probs), 3)
+        return {label: confidence}

src/transcription.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# src/transcription.py
+from transformers import Wav2Vec2Processor, Wav2Vec2Model
+import torch
+import torchaudio
+class SpeechEncoder:
+    def __init__(
+        self,
+        model_name: str = "jonatasgrosman/wav2vec2-large-xlsr-53-french",
+        #model_name: str = "alec228/audio-sentiment/tree/main/wav2vec2",
+        cache_dir: str = "./models"
+    ):
+        # Processor pour prétraiter l'audio
+        self.processor = Wav2Vec2Processor.from_pretrained(
+            model_name, cache_dir=cache_dir
+        )
+        # Modèle de base (sans tête CTC)
+        self.model = Wav2Vec2Model.from_pretrained(
+            model_name, cache_dir=cache_dir
+        )
+    def extract_features(self, audio_path: str) -> torch.Tensor:
+        """
+        Charge un fichier audio, le resample à 16 kHz, convertit en mono,
+        et renvoie la représentation vectorielle moyenne sur la séquence.
+        """
+        # 1. Chargement
+        waveform, sample_rate = torchaudio.load(audio_path)
+        # 2. Resample si nécessaire
+        if sample_rate != 16000:
+            waveform = torchaudio.transforms.Resample(
+                orig_freq=sample_rate,
+                new_freq=16000
+            )(waveform)
+        # 3. Passage en mono
+        if waveform.size(0) > 1:
+            waveform = waveform.mean(dim=0, keepdim=True)
+        # 4. Prétraitement pour le modèle
+        inputs = self.processor(
+            waveform.squeeze().numpy(),
+            sampling_rate=16000,
+            return_tensors="pt",
+            padding=True
+        )
+        # 5. Extraction sans gradient
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+        # 6. Moyenne temporelle des embeddings
+        return outputs.last_hidden_state.mean(dim=1)  # shape: [batch, hidden_size]