Update README.md
Browse files
README.md
CHANGED
@@ -19,42 +19,26 @@ tags:
|
|
19 |
- nlp
|
20 |
- new
|
21 |
---
|
22 |
-
---
|
23 |
-
license: apache-2.0
|
24 |
-
datasets:
|
25 |
-
- google/fleurs
|
26 |
-
metrics:
|
27 |
-
- wer
|
28 |
-
- accuracy
|
29 |
-
- cer
|
30 |
-
pipeline_tag: automatic-speech-recognition
|
31 |
-
tags:
|
32 |
-
- pitch
|
33 |
-
- f0
|
34 |
-
- echo
|
35 |
-
- whiper
|
36 |
-
- waveform
|
37 |
-
- spectrogram
|
38 |
-
- hilbert
|
39 |
-
- asr
|
40 |
-
- nlp
|
41 |
-
- new
|
42 |
-
---
|
43 |
|
44 |
|
45 |
-
NLP/ASR multimodal pitch aware model.
|
46 |
-
<img width="670" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
47 |
-
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech)
|
48 |
|
49 |
-
|
50 |
|
51 |
-
|
52 |
-
2. The second adds direct similarity bias to attention
|
53 |
-
3. Variable radii added in place of unit circle radius(1.0) of torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
|
54 |
|
55 |
-
|
56 |
|
57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
60 |
|
@@ -74,24 +58,22 @@ In each subplot:
|
|
74 |
|
75 |
4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
|
76 |
|
77 |
-
```python
|
78 |
-
|
79 |
-
freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
|
80 |
-
```
|
81 |
-
|
82 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
83 |
|
84 |
Bright diagonal line: Each position matches itself perfectly.
|
85 |
Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
|
86 |
Narrow bands: More focus on nearby positions (good for local patterns)
|
87 |
|
88 |
-
<img width="
|
89 |
-
<img width="
|
90 |
|
|
|
91 |
|
92 |
#### Diagnostic test run where 1 epoch = 1000 steps = 1000 samples:
|
93 |
|
94 |
<img width="680" alt="1555" src="https://github.com/user-attachments/assets/5bed0421-e32f-4234-ab55-51d64eb927ef" />
|
95 |
|
|
|
|
|
96 |
|
97 |
|
|
|
19 |
- nlp
|
20 |
- new
|
21 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
|
24 |
+
NLP/ASR multimodal pitch aware model. Research model.
|
|
|
|
|
25 |
|
26 |
+
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
27 |
|
28 |
+
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean). It also clearly identifies this as a heavily processed / "clean" dataset.
|
|
|
|
|
29 |
|
30 |
+
<img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />
|
31 |
|
32 |
+
**This dataset has gone through fewer processing / "cleaning" steps as can be seen with the spectrogram. The pitch isn't effected.
|
33 |
+
|
34 |
+
To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
|
35 |
+
|
36 |
+
1. **Pitch-modulated theta:** Pitch (f0) is used to modify the theta parameter, dynamically adjusting the rotary frequency.
|
37 |
+
2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
|
38 |
+
3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
|
39 |
+
|
40 |
+
|
41 |
+
<img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
|
42 |
|
43 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
44 |
|
|
|
58 |
|
59 |
4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
|
60 |
|
|
|
|
|
|
|
|
|
|
|
61 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
62 |
|
63 |
Bright diagonal line: Each position matches itself perfectly.
|
64 |
Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
|
65 |
Narrow bands: More focus on nearby positions (good for local patterns)
|
66 |
|
67 |
+
<img width="680" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8" />
|
68 |
+
<img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
69 |
|
70 |
+
----
|
71 |
|
72 |
#### Diagnostic test run where 1 epoch = 1000 steps = 1000 samples:
|
73 |
|
74 |
<img width="680" alt="1555" src="https://github.com/user-attachments/assets/5bed0421-e32f-4234-ab55-51d64eb927ef" />
|
75 |
|
76 |
+
<img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />
|
77 |
+
|
78 |
|
79 |
|