Spaces:

harmdevries
/

transformer_inference

Runtime error

harmdevries commited on Nov 3, 2022

Commit

52a11ab

1 Parent(s): 82370ff

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -134,7 +134,7 @@ c1.write("Multi-Head Attention:")
 c2.write(str(round(mha_total_time, 2)))
 c1.write("Multi-Query Attention:")
 c2.write(str(round(mqa_total_time, 2)))
-c1.write("Speed-up MQA over MHA: ")
 c2.write(str(round(mha_total_time/mqa_total_time,2)))
 st.subheader("Memory consumption")
@@ -161,7 +161,7 @@ acts = round(2*bs*l*(d/h)*2*n/1e9, 2)
 c2.write(str(acts))
 st.subheader("Approximations")
-st.markdown("[We use the [following crude approximation](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf) to estimate the execution time for each matrix multiplication.")
 st.latex("C = A \cdot B")
 st.latex("A \in \mathbb{R}^{MxK}, B \in R^{KxN}, C \in \mathbb{R}^{MxN}")
@@ -173,11 +173,6 @@ To execute this operation on the GPU, we need to
 3. Write C to memory
 ''')
-st.latex('''
-For float16 operations (2 bytes), we can estimate the memory access time of A as follows:
-T_mem(A) = 2*M*K / BW_mem
-where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)
-''')
 st.latex('''
 For float16 operations (2 bytes), we can estimate the memory access time of A as follows:
@@ -185,6 +180,9 @@ T_mem(A) = 2*M*K / BW_mem
 where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)
 ''')

 c2.write(str(round(mha_total_time, 2)))
 c1.write("Multi-Query Attention:")
 c2.write(str(round(mqa_total_time, 2)))
+c1.write("Speed-up MQA over MHA:")
 c2.write(str(round(mha_total_time/mqa_total_time,2)))
 st.subheader("Memory consumption")
 c2.write(str(acts))
 st.subheader("Approximations")
+st.markdown("We use the [following crude approximation](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf) to estimate the execution time for each matrix multiplication.")
 st.latex("C = A \cdot B")
 st.latex("A \in \mathbb{R}^{MxK}, B \in R^{KxN}, C \in \mathbb{R}^{MxN}")
 3. Write C to memory
 ''')
 st.latex('''
 For float16 operations (2 bytes), we can estimate the memory access time of A as follows:
 where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)
 ''')
+st.markdown("For float16 operations (2 bytes), we can estimate the memory access time of A as follows:")
+st.latex("T_{mem}(A) = 2*M*K / BW_{mem}")
+st.markdown("where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)")