File size: 4,214 Bytes
6ab2ab5
 
77a7228
 
 
 
67b03c6
d12caca
6ab2ab5
ff26e5c
aca0a0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfc8509
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aca0a0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05546cd
aca0a0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff26e5c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
library_name: keras-hub
license: mit
tags:
- speech-recognition
- keras
- automatic-speech-recognition
pipeline_tag: automatic-speech-recognition
---
### Model Overview
⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model.

A Whisper encoder-decoder network for speech.

This class implements a Transformer-based encoder-decoder model as
described in
["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356).
It includes the embedding lookups and transformer layers, but not the head
for predicting the next token.

The default constructor gives a fully customizable, randomly initialized Whisper
model with any number of layers, heads, and embedding dimensions. To load
preset architectures and weights, use the `from_preset()` constructor.

Disclaimer: Pre-trained models are provided on an "as is" basis, without
warranties or conditions of any kind. The underlying model is provided by a
third party and subject to a separate license, available
[here](https://github.com/openai/whisper).

## Links
* [Whisper Quickstart Notebook](coming soon)
* [Whisper API Documentation](https://keras.io/keras_hub/api/models/whisper/)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

__Arguments__


- __vocabulary_size__: int. The size of the token vocabulary.
- __num_layers__: int. The number of transformer encoder layers and
    transformer decoder layers.
- __num_heads__: int. The number of attention heads for each transformer.
    The hidden size must be divisible by the number of attention heads.
- __hidden_dim__: int. The size of the transformer encoding and pooler layers.
- __intermediate_dim__: int. The output dimension of the first Dense layer in
    a two-layer feedforward network for each transformer.
- __num_mels__: int. The number of mel-frequency filters. Defaults to `80`.
- __dropout__: float. Dropout probability for the Transformer encoder.
- __max_encoder_sequence_length__: int. The maximum sequence length that the
    audio encoder can consume. Since the second convolutional layer in
    the encoder reduces the sequence length by half (stride of 2), we
    use `max_encoder_sequence_length // 2` as the sequence length for the
    positional embedding layer.
- __max_decoder_sequence_length__: int. The maximum sequence length that the
    text decoder can consume.

## Example Usage
```python
import keras_hub
import keras_core as keras
import numpy as np
```



```python
input_data = {
    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
    "decoder_padding_mask": np.array(
        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
    ),
}

# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
    vocabulary_size=51864,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_encoder_sequence_length=128,
    max_decoder_sequence_length=128,
)
model(input_data)
```

## Example Usage with Hugging Face URI

```python
import keras_hub
import keras_core as keras
import numpy as np
```



```python
input_data = {
    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
    "decoder_padding_mask": np.array(
        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
    ),
}

# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
    vocabulary_size=51864,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_encoder_sequence_length=128,
    max_decoder_sequence_length=128,
)
model(input_data)
```