Sin2pi commited on
Commit
018f69e
·
verified ·
1 Parent(s): 2bc5a91

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -5
README.md CHANGED
@@ -25,11 +25,7 @@ NLP/ASR multimodal pitch aware model. Research model.
25
 
26
  <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
27
 
28
- **This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean). It also clearly identifies this as a heavily processed / "clean" dataset.
29
-
30
- <img width="680" alt="1555" src="https://github.com/user-attachments/assets/14276b99-cf96-4022-9a16-4ac8ed1f6404" />
31
-
32
- **This dataset has gone through fewer processing / "cleaning" steps as can be seen with the spectrogram. The pitch isn't effected.
33
 
34
  To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
35
 
@@ -37,6 +33,39 @@ To highlight the relationship between pitch and rotary embeddings, the model imp
37
  2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
38
  3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  <img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
42
 
 
25
 
26
  <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
27
 
28
+ **This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
 
 
 
 
29
 
30
  To highlight the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
31
 
 
33
  2. **Direct similarity bias:** A pitch-based similarity bias is added directly to the attention mechanism.
34
  3. **Variable radii in torch.polar:** The unit circle radius (1.0) in the `torch.polar` calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase information without significant computational overhead.
35
 
36
+ The function `torch.polar` constructs a complex tensor from polar coordinates:
37
+
38
+ ````python
39
+ # torch.polar(magnitude, angle) returns:
40
+ result = magnitude * (torch.cos(angle) + 1j * torch.sin(angle))
41
+ ````
42
+
43
+ So, for each element:
44
+ - **magnitude** is the modulus (radius, r)
45
+ - **angle** is the phase (theta, in radians)
46
+ - The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
47
+
48
+ Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
49
+
50
+ Here are the abbreviated steps for replacing theta and radius in the rotary forward:
51
+
52
+ ```python
53
+ f0 = f0.to(device, dtype) # feature extracted during processing
54
+ f0_mean = f0.mean() # mean only used as theta in freqs calculation
55
+ theta = f0_mean + self.theta
56
+ freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
57
+ freqs = t[:, None] * freqs[None, :]
58
+
59
+ radius = f0.to(device, dtype) # we want to avoid using the mean of f0 (or any stat or interpolation)
60
+ if radius.shape[0] != x.shape[0]: # encoder outputs will already be the correct length
61
+ F = radius.shape[0] / x.shape[0]
62
+ idx = torch.arange(x.shape[0], device=f0.device)
63
+ idx = (idx * F).long().clamp(0, radius.shape[0] - 1)
64
+ radius = radius[idx] # it's the best method i know of that retains f0 character
65
+ radius = radius.unsqueeze(-1).expand(-1, freqs.shape[-1])
66
+ radius = torch.sigmoid(radius)
67
+ freqs = torch.polar(radius, freqs)
68
+ ```
69
 
70
  <img width="780" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
71