Sin2pi commited on
Commit
d5a705a
Β·
verified Β·
1 Parent(s): 7354230

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -1,3 +1,103 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - google/fleurs
5
+ metrics:
6
+ - wer
7
+ - accuracy
8
+ - cer
9
+ pipeline_tag: automatic-speech-recognition
10
+ tags:
11
+ - pitch
12
+ - f0
13
+ - echo
14
+ - whiper
15
+ - waveform
16
+ - spectrogram
17
+ - hilbert
18
+ - asr
19
+ - nlp
20
+ - new
21
+ ---
22
+
23
+ # NLP model with acoustic positional encoding.
24
+
25
+ ## Echo
26
+ ### Zero-Value Processing ASR model with Voice-modulated Rotary Position Encoding. (vRoPE)
27
+ ### Experimental - research model. Some of the modules and functions in the code are not part of the active model just yet.
28
+
29
+ Pitch-Aware Processing: Integrates F0/pitch information throughout the processing pipeline, making the model sensitive to prosodic features of speech.
30
+
31
+ To highlight the relationship between pitch and rotary embeddings echo implements two complementary pitch-based enhancements:
32
+
33
+ 1. The first uses pitch to modify theta (rotary frequency)
34
+ 2. The second adds direct similarity bias to attention
35
+
36
+ By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
37
+
38
+ The patterns below show how positions "see" each other in relation to theta and f0.
39
+
40
+ Bright diagonal line: Each position matches itself perfectly.
41
+ Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
42
+ Narrow bands: More focus on nearby positions (good for local patterns)
43
+
44
+ ![2](https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8)
45
+ ![1](https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1)
46
+
47
+ Static 10k theta is perfectly fine for a text model but probably not for a NLP ai.
48
+
49
+ Echos rotary implementation maps the perceptual properties of audio to the mathematical properties of the rotary embeddings, creating a more adaptive and context-aware representation system. Pitch is optionally extracted from audio in the data processing pipeline and can be used for an additional feature along with spectrograms and or used to inform the rotary and or pitch bias.
50
+
51
+ Pitch bias
52
+
53
+ The pitch bias implementation creates an attention bias matrix:
54
+ This makes tokens with similar pitch attend to each other more, which helps:
55
+
56
+ - Track speaker consistency
57
+ - Maintain coherent pitch patterns
58
+ - Group harmonically related segments
59
+
60
+ The theoretical foundation:
61
+ - Both position and pitch can be represented as frequencies
62
+ - Speech has inherent rhythmic and tonal patterns that correlate with semantic content
63
+ - Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
64
+
65
+ ---
66
+
67
+ ### Diagnostic test run with google/fleurs - Spectrogram + f0_rotary:
68
+
69
+ <img width="570" alt="score" src="https://github.com/user-attachments/assets/679d5032-6e84-4fe6-892c-6b01c6cb14ce" />
70
+
71
+ ```
72
+ πŸ“Š COMPONENT STATISTICS:
73
+ GATE: avg=0.638041, min=0.010094, max=2.071990, samples=135
74
+ MLP: avg=0.028625, min=0.003352, max=0.074448, samples=135
75
+ Q: avg=0.029973, min=0.001905, max=0.141696, samples=150
76
+ K: avg=0.030055, min=0.001910, max=0.144063, samples=150
77
+ V: avg=0.111713, min=0.050426, max=0.240650, samples=150
78
+ O: avg=0.108549, min=0.049052, max=0.244606, samples=150
79
+ LN: avg=0.092093, min=0.005017, max=0.349827, samples=285
80
+ ENCODER: avg=0.004097, min=0.001447, max=0.011093, samples=45
81
+
82
+ 🚨 GATE vs MLP ACTIVATION PATTERNS:
83
+ 🟒 encoder.blocks.spectrogram.1.: gate/mlp activation ratio=1.4918, sparsity difference=-0.0040
84
+ 🟑 encoder.blocks.spectrogram.2.: gate/mlp activation ratio=2.5671, sparsity difference=-0.0096
85
+ 🟒 encoder.blocks.spectrogram.3.: gate/mlp activation ratio=1.9277, sparsity difference=-0.0069
86
+ 🟑 encoder.blocks.spectrogram.4.: gate/mlp activation ratio=2.6485, sparsity difference=-0.0118
87
+ 🟑 decoder._blocks.0.: gate/mlp activation ratio=2.0988, sparsity difference=-0.0071
88
+ 🟑 decoder._blocks.1.: gate/mlp activation ratio=2.1584, sparsity difference=-0.0102
89
+ 🟑 decoder._blocks.2.: gate/mlp activation ratio=2.1087, sparsity difference=-0.0096
90
+ 🟑 decoder._blocks.3.: gate/mlp activation ratio=2.2582, sparsity difference=-0.0045
91
+ 🟑 decoder.blocks.spectrogram.0.: gate/mlp activation ratio=2.0964, sparsity difference=-0.0124
92
+ 🟒 decoder.blocks.spectrogram.1.: gate/mlp activation ratio=1.9247, sparsity difference=-0.0021
93
+ 🟒 decoder.blocks.spectrogram.2.: gate/mlp activation ratio=1.8573, sparsity difference=-0.0079
94
+ 🟒 decoder.blocks.spectrogram.3.: gate/mlp activation ratio=1.8911, sparsity difference=-0.0062
95
+ ```
96
+
97
+ ## The F0-Conditioned Rotation Mechanism
98
+
99
+ The high gate usage validates the fundamental frequency conditioning approach:
100
+
101
+ - Pitch-adaptive rotary embeddings are providing meaningful signal that the gates are actively utilizing
102
+ - The decoder is learning to selectively attend to pitch-relevant patterns
103
+ - The gates are functioning as a kind of "pitch-aware filter" that determines which information should flow through the network