Update README.md
Browse files
README.md
CHANGED
@@ -25,11 +25,7 @@ NLP/ASR multimodal pitch aware model. Research model.
|
|
25 |
|
26 |
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
27 |
|
28 |
-
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
|
29 |
-
|
30 |
-
<img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />
|
31 |
-
|
32 |
-
**This dataset has gone through fewer processing / "cleaning" steps as can be seen with the spectrogram. The pitch isn't effected.
|
33 |
|
34 |
To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
|
35 |
|
@@ -37,6 +33,39 @@ To highlight the relationship between pitch and rotary embeddings, the model imp
|
|
37 |
2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
|
38 |
3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
|
39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40 |
|
41 |
<img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
|
42 |
|
|
|
25 |
|
26 |
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
27 |
|
28 |
+
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
|
|
|
|
|
|
|
|
|
29 |
|
30 |
To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
|
31 |
|
|
|
33 |
2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
|
34 |
3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
|
35 |
|
36 |
+
The function `torch.polar` constructs a complex tensor from polar coordinates:
|
37 |
+
|
38 |
+
````python
|
39 |
+
# torch.polar(magnitude, angle) returns:
|
40 |
+
result = magnitude * (torch.cos(angle) + 1j * torch.sin(angle))
|
41 |
+
````
|
42 |
+
|
43 |
+
So, for each element:
|
44 |
+
- **magnitude** is the modulus (radius, r)
|
45 |
+
- **angle** is the phase (theta, in radians)
|
46 |
+
- The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
|
47 |
+
|
48 |
+
Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
|
49 |
+
|
50 |
+
Here are the abbreviated steps for replacing theta and radius in the rotary forward:
|
51 |
+
|
52 |
+
```python
|
53 |
+
f0 = f0.to(device, dtype) # feature extracted during processing
|
54 |
+
f0_mean = f0.mean() # mean only used as theta in freqs calculation
|
55 |
+
theta = f0_mean + self.theta
|
56 |
+
freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
|
57 |
+
freqs = t[:, None] * freqs[None, :]
|
58 |
+
|
59 |
+
radius = f0.to(device, dtype) # we want to avoid using the mean of f0 (or any stat or interpolation)
|
60 |
+
if radius.shape[0] != x.shape[0]: # encoder outputs will already be the correct length
|
61 |
+
F = radius.shape[0] / x.shape[0]
|
62 |
+
idx = torch.arange(x.shape[0], device=f0.device)
|
63 |
+
idx = (idx * F).long().clamp(0, radius.shape[0] - 1)
|
64 |
+
radius = radius[idx] # it's the best method i know of that retains f0 character
|
65 |
+
radius = radius.unsqueeze(-1).expand(-1, freqs.shape[-1])
|
66 |
+
radius = torch.sigmoid(radius)
|
67 |
+
freqs = torch.polar(radius, freqs)
|
68 |
+
```
|
69 |
|
70 |
<img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
|
71 |
|