johaness14
/

wav2vec2-jv-large-openslr

@@ -37,8 +37,6 @@ This model is intended for transcribing spoken Javanese language from audio reco
 The model use OpenSLR41 datasets, and split into 2 section (training and testing), then the model is trained using 1xA100 GPU with a training duration of 4-5 hours.
-## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -83,6 +81,90 @@ The following hyperparameters were used during training:
 | 0.0564        | 70.8215 | 50000 | 0.2711          | 0.1551 |
 | 0.0562        | 73.6544 | 52000 | 0.2727          | 0.1523 |
 ### Framework versions

 The model use OpenSLR41 datasets, and split into 2 section (training and testing), then the model is trained using 1xA100 GPU with a training duration of 4-5 hours.
 ### Training hyperparameters
 The following hyperparameters were used during training:
 | 0.0564        | 70.8215 | 50000 | 0.2711          | 0.1551 |
 | 0.0562        | 73.6544 | 52000 | 0.2727          | 0.1523 |
+### How to run (Gradio Web)
+```python
+import torch
+import torchaudio
+import gradio as gr
+import numpy as np
+from transformers import pipeline, AutoProcessor, AutoModelForCTC
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Load the model and processor
+MODEL_NAME = "<fill this to your model>"
+processor = AutoProcessor.from_pretrained(MODEL_NAME)
+model = AutoModelForCTC.from_pretrained(MODEL_NAME)
+# Move model to GPU
+model.to(device)
+# Create the pipeline with the model and processor
+transcriber = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, device=device)
+def transcribe(audio):
+    sr, y = audio
+    y = y.astype(np.float32)
+    y /= np.max(np.abs(y))
+    return transcriber({"sampling_rate": sr, "raw": y})["text"]
+demo = gr.Interface(
+    transcribe,
+    gr.Audio(sources=["upload"]),
+    "text",
+)
+demo.launch(share=True)
+```
+### How to run
+```python
+import torch
+import torchaudio
+import gradio as gr
+import numpy as np
+from transformers import pipeline, AutoProcessor, AutoModelForCTC
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# Load the model and processor
+MODEL_NAME = "<fill this to actual model>"
+processor = AutoProcessor.from_pretrained(MODEL_NAME)
+model = AutoModelForCTC.from_pretrained(MODEL_NAME)
+# Move model to GPU
+model.to(device)
+# Load audio file
+AUDIO_PATH = "<replace 'path_to_audio_file.wav' with the actual path to your audio file>"
+audio_input, sample_rate = torchaudio.load(AUDIO_PATH)
+# Ensure the audio is mono (1 channel)
+if audio_input.shape[0] > 1:
+    audio_input = torch.mean(audio_input, dim=0, keepdim=True)
+# Resample audio if necessary
+if sample_rate != 16000:
+    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
+    audio_input = resampler(audio_input)
+# Process the audio input
+input_values = processor(audio_input.squeeze(), sampling_rate=16000, return_tensors="pt").input_values
+# Move input values to GPU
+input_values = input_values.to(device)
+# Perform inference
+with torch.no_grad():
+    logits = model(input_values).logits
+# Decode the logits to text
+predicted_ids = torch.argmax(logits, dim=-1)
+transcription = processor.batch_decode(predicted_ids)[0]
+print("Transcription:", transcription)
+```
 ### Framework versions