Sin2pi
/

asr-model

@@ -19,42 +19,26 @@ tags:
 - nlp
 - new
 ---
----
-license: apache-2.0
-datasets:
-- google/fleurs
-metrics:
-- wer
-- accuracy
-- cer
-pipeline_tag: automatic-speech-recognition
-tags:
-- pitch
-- f0
-- echo
-- whiper
-- waveform
-- spectrogram
-- hilbert
-- asr
-- nlp
-- new
----
-NLP/ASR multimodal pitch aware model.
-<img width="670" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb"  />
-**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech)
-To highlight the relationship between pitch and rotary embeddings the model implements three complementary pitch-based enhancements:
-1. The first uses pitch to modify theta (rotary frequency)*
-2. The second adds direct similarity bias to attention
-3. Variable radii added in place of unit circle radius(1.0) of torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
-By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics.  This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
-<img width="670" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae"  />
 Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
@@ -74,24 +58,22 @@ In each subplot:
 4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
-```python
- freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
-```
 The patterns below show how positions "see" each other in relation to theta and f0.
 Bright diagonal line: Each position matches itself perfectly.
 Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
 Narrow bands: More focus on nearby positions (good for local patterns)
-<img width="670" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8"  />
-<img width="670" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1"  />
 #### Diagnostic test run where 1 epoch = 1000 steps = 1000 samples:
 <img width="680" alt="1555" src="https://github.com/user-attachments/assets/5bed0421-e32f-4234-ab55-51d64eb927ef" />

 - nlp
 - new
 ---
+NLP/ASR multimodal pitch aware model. Research model.
+<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb"  />
+**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean). It also clearly identifies this as a heavily processed / "clean" dataset.
+<img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />
+**This dataset has gone through fewer processing / "cleaning" steps as can be seen with the spectrogram. The pitch isn't effected.
+To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
+1. **Pitch-modulated theta:** Pitch (f0) is used to modify the theta parameter, dynamically adjusting the rotary frequency.
+2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
+3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
+<img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae"  />
 Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
 4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
 The patterns below show how positions "see" each other in relation to theta and f0.
 Bright diagonal line: Each position matches itself perfectly.
 Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
 Narrow bands: More focus on nearby positions (good for local patterns)
+<img width="680" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8"  />
+<img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1"  />
+----
 #### Diagnostic test run where 1 epoch = 1000 steps = 1000 samples:
 <img width="680" alt="1555" src="https://github.com/user-attachments/assets/5bed0421-e32f-4234-ab55-51d64eb927ef" />
+<img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />