Blackroot
/

SimpleDiffusion-MultiHeadAttentionNope

Model card Files Files and versions Community

Blackroot commited on Jan 18

Commit

5c42973

·

verified ·

1 Parent(s): b3ebe96

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -5,11 +5,11 @@ license: mit
 A semi custom network trained from scratch for 799 epochs based on the follow paper [Simpler Diffusion (SiD2)](https://arxiv.org/abs/2410.19324v1)
-[Modeling](https://huggingface.co/Blackroot/SimpleDiffusion-TensorProductAttentionRope/blob/main/models/uvit.py) || [Training](https://huggingface.co/Blackroot/SimpleDiffusion-TensorProductAttentionRope/blob/main/train.py)
 This network uses the optimal transport flow matching objective outlined [Flow Matching for Generative Modeling](https://arxiv.org/abs/2210.02747)
-A modified tensor product attention with rope is used instead of regular MHA [Tensor Product Attention is All You Need](https://arxiv.org/abs/2501.06425)
 xATGLU Layers are used in some places [Expanded Gating Ranges Improve Activation Functions](https://arxiv.org/pdf/2405.20768)

 A semi custom network trained from scratch for 799 epochs based on the follow paper [Simpler Diffusion (SiD2)](https://arxiv.org/abs/2410.19324v1)
+[Modeling](./blob/main/models/uvit.py) || [Training](./blob/main/train.py)
 This network uses the optimal transport flow matching objective outlined [Flow Matching for Generative Modeling](https://arxiv.org/abs/2210.02747)
+This is using multi head attention with no positional embeddings. [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)
 xATGLU Layers are used in some places [Expanded Gating Ranges Improve Activation Functions](https://arxiv.org/pdf/2405.20768)