File size: 2,897 Bytes
a50f503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
license: apache-2.0
language:
- ja
library_name: nemo
pipeline_tag: automatic-speech-recognition
---
# ASR Model Card: parakeet-ctc-1.1b-ja

## Model Details

- **Model Name**: parakeet-ctc-1.1b-ja
- **Type**: Automatic Speech Recognition (ASR)
- **Language**: Japanese
- **Framework**: NVIDIA NeMo

## Installation

To use this model, you need to install the NeMo toolkit:

```bash
pip install nemo-toolkit==2.0.0rc0 nemo-toolkit[asr]==2.0.0rc0
```

## Usage

Here's a basic example of how to use the model:

```python
import nemo.collections.asr as nemo_asr

# Load the model
nemo_model = nemo_asr.models.ASRModel.restore_from("/path/to/parakeet-ja.nemo")

# Transcribe audio files
audio_files = ["path/to/audio1.wav", "path/to/audio2.wav"]
transcriptions = nemo_model.transcribe(audio_files)

# Print transcriptions
for audio_file, transcription in zip(audio_files, transcriptions):
    print(f"Transcription for {audio_file}: {transcription}")
```

## Limitations

- This model is specifically trained for Japanese language and may not perform well on other languages.
- The accuracy of transcription may vary depending on the audio quality, background noise, and speaker accent.
- The model may struggle with specialized vocabulary or technical terms not encountered during training.

## Performance

The following table compares the performance of the NeMo model (Parakeet-JA) with Whisper v2 large and Whisper v3 large across different Japanese ASR datasets:

| Model          | Dataset                            | WER    | CER    |
|----------------|-----------------------------------|--------|--------|
| Whisper v2 large | japanese-asr/ja_asr.reazonspeech_test | 1.1378 | 0.3472 |
|                | japanese-asr/ja_asr.jsut_basic5000    | 0.8988 | 0.1063 |
|                | japanese-asr/ja_asr.common_voice_8_0  | 1.0314 | 0.1594 |
| Whisper v3 large | japanese-asr/ja_asr.reazonspeech_test | 0.9685 | 0.2107 |
|                | japanese-asr/ja_asr.jsut_basic5000    | 0.9936 | 0.1360 |
|                | japanese-asr/ja_asr.common_voice_8_0  | 1.0178 | 0.1548 |
| NeMo (parakeet-ctc-1.1b-ja) | japanese-asr/ja_asr.reazonspeech_test | 0.7785 | 0.1521 |
|                | japanese-asr/ja_asr.jsut_basic5000    | 0.9462 | 0.1291 |
|                | japanese-asr/ja_asr.common_voice_8_0  | 1.0002 | 0.1290 |

## Ethical Considerations

- Ensure that you have the necessary permissions and comply with local laws when recording and transcribing audio.
- Be aware of potential biases in the model, especially regarding different Japanese dialects or accents.
- Consider the privacy implications of transcribing personal or sensitive conversations.

## Additional Information

For more detailed information on using ASR models with the NeMo toolkit, please refer to the [NeMo ASR documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html).