Spaces:
Running
Running
more stuff (#59)
Browse files- assets/images/a1_kernels.png +3 -0
- assets/images/a1_ncu.png +3 -0
- assets/images/a1_profile_trace.png +3 -0
- dist/assets/images/a1_kernels.png +3 -0
- dist/assets/images/a1_ncu.png +3 -0
- dist/assets/images/a1_profile_trace.png +3 -0
- dist/index.html +165 -0
- src/index.html +165 -0
assets/images/a1_kernels.png
ADDED
![]() |
Git LFS Details
|
assets/images/a1_ncu.png
ADDED
![]() |
Git LFS Details
|
assets/images/a1_profile_trace.png
ADDED
![]() |
Git LFS Details
|
dist/assets/images/a1_kernels.png
ADDED
![]() |
Git LFS Details
|
dist/assets/images/a1_ncu.png
ADDED
![]() |
Git LFS Details
|
dist/assets/images/a1_profile_trace.png
ADDED
![]() |
Git LFS Details
|
dist/index.html
CHANGED
@@ -3230,6 +3230,171 @@
|
|
3230 |
|
3231 |
<h3>A1: Distributed Training Profiling</h3>
|
3232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3233 |
<h3>A2: Math for Compute/Comms Overlap</h3>
|
3234 |
|
3235 |
|
|
|
3230 |
|
3231 |
<h3>A1: Distributed Training Profiling</h3>
|
3232 |
|
3233 |
+
<h4>Kernels</h4>
|
3234 |
+
|
3235 |
+
<p>Let's begin by assuming for now that the kernels are already integrated into PyTorch. As a simple example, we can look at the Layer Normalization function implemented in PyTorch as <code>torch.nn.functional.layer_norm</code>. There are several methods to profile the kernel that underlies this function. The most straightforward approach might be to use the Python <code>time</code> module. However, since CUDA operations are asynchronous, measuring time with this method will only capture the overhead associated with launching the kernel in Python, rather than the actual execution time of the kernel itself.</p>
|
3236 |
+
|
3237 |
+
<p>To address this, we can utilize <code>torch.cuda.Event</code> for accurate timing and employ the <code>torch.cuda.synchronize()</code> directive to ensure we wait for the kernel execution to complete. This approach is demonstrated in the following snippet:</p>
|
3238 |
+
|
3239 |
+
<d-code block language="python">
|
3240 |
+
def profile_pytorch(func, input):
|
3241 |
+
# Create CUDA events to track time. CUDA operations are asynchronous,
|
3242 |
+
start = torch.cuda.Event(enable_timing=True) # Event to mark the start time
|
3243 |
+
end = torch.cuda.Event(enable_timing=True) # Event to mark the end time
|
3244 |
+
# Warmup to eliminate any overhead from the first run, which might not reflect
|
3245 |
+
# the actual performance.
|
3246 |
+
for _ in range(10):
|
3247 |
+
func(input)
|
3248 |
+
# Record the start time before executing the function
|
3249 |
+
start.record()
|
3250 |
+
func(input) # Call the function we want to profile
|
3251 |
+
# Record the end time after the function has completed
|
3252 |
+
end.record()
|
3253 |
+
# Synchronize the CUDA operations to ensure all operations are completed
|
3254 |
+
# before measuring the elapsed time.
|
3255 |
+
torch.cuda.synchronize()
|
3256 |
+
# Calculate and return the elapsed time in milliseconds.
|
3257 |
+
return start.elapsed_time(end)
|
3258 |
+
</d-code>
|
3259 |
+
|
3260 |
+
<p>A more effective approach to profiling is to utilize the PyTorch Profiler, as explained previously. For example, consider the following code:</p>
|
3261 |
+
|
3262 |
+
<d-code block language="python">
|
3263 |
+
import torch
|
3264 |
+
import torch.nn.functional as F
|
3265 |
+
|
3266 |
+
def pytorch_layer_norm(input):
|
3267 |
+
return F.layer_norm(input, input.size()[1:])
|
3268 |
+
|
3269 |
+
a = torch.randn(10000, 10000).cuda()
|
3270 |
+
|
3271 |
+
with torch.profiler.profile(
|
3272 |
+
activities=[
|
3273 |
+
torch.profiler.ProfilerActivity.CPU, # Profile CPU activities
|
3274 |
+
torch.profiler.ProfilerActivity.CUDA, # Profile CUDA activities
|
3275 |
+
],
|
3276 |
+
# Define a schedule for the profiler
|
3277 |
+
schedule=torch.profiler.schedule(
|
3278 |
+
wait=1, # Wait for 1 iteration before starting to profile
|
3279 |
+
warmup=3, # Warm up for 3 iterations to stabilize performance
|
3280 |
+
active=2, # Profile for 2 active iterations
|
3281 |
+
repeat=1, # Repeat the profiling schedule once
|
3282 |
+
),
|
3283 |
+
on_trace_ready=torch.profiler.tensorboard_trace_handler('.'),
|
3284 |
+
|
3285 |
+
) as p:
|
3286 |
+
for iter in range(10):
|
3287 |
+
pytorch_layer_norm(a)
|
3288 |
+
p.step()
|
3289 |
+
|
3290 |
+
# Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries
|
3291 |
+
print(p.key_averages().table(sort_by="cuda_time_total", row_limit=8))
|
3292 |
+
</d-code>
|
3293 |
+
|
3294 |
+
<p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
|
3295 |
+
|
3296 |
+
<div class="large-image-background">
|
3297 |
+
<img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
|
3298 |
+
</div>
|
3299 |
+
|
3300 |
+
<p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
|
3301 |
+
|
3302 |
+
<div class="note-box">
|
3303 |
+
<p class="note-box-title">💡 Tip</p>
|
3304 |
+
<div class="note-box-content">
|
3305 |
+
<p>If you're new to this tool, you can navigate the trace by using the right and left arrow keys. Additionally, you can zoom in and out by holding the <strong>Alt</strong> key while scrolling left or right with your mouse.</p>
|
3306 |
+
</div>
|
3307 |
+
</div>
|
3308 |
+
|
3309 |
+
<p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
|
3310 |
+
|
3311 |
+
<div class="large-image-background">
|
3312 |
+
<img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
|
3313 |
+
</div>
|
3314 |
+
|
3315 |
+
<p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
|
3316 |
+
|
3317 |
+
<div class="note-box">
|
3318 |
+
<p class="note-box-title">📝 Note</p>
|
3319 |
+
<div class="note-box-content">
|
3320 |
+
<p>You can enable memory profiling by setting <code>profile_memory</code> to <code>True</code> in the profiler. However, this can lead to more complex traces.</p>
|
3321 |
+
</div>
|
3322 |
+
</div>
|
3323 |
+
|
3324 |
+
<p>While the PyTorch Profiler offers a quick performance overview, <strong>NVIDIA Nsight Compute (ncu)</strong> provides deeper insights into GPU performance, including detailed execution times and memory usage for each kernel. To run the profiler it's very simple:</p>
|
3325 |
+
|
3326 |
+
<d-code block language="bash">
|
3327 |
+
ncu --set full python layer_norm.py
|
3328 |
+
</d-code>
|
3329 |
+
|
3330 |
+
<p>Where <code>layer_norm.py</code> is a straightforward file that executes the layer normalization function. This command will generate log outputs, but a more effective way to visualize the results is by setting the output flag:</p>
|
3331 |
+
|
3332 |
+
<d-code block language="bash">
|
3333 |
+
ncu --set full -o output python layer_norm.py
|
3334 |
+
</d-code>
|
3335 |
+
|
3336 |
+
<p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
|
3337 |
+
|
3338 |
+
<div class="large-image-background">
|
3339 |
+
<img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
|
3340 |
+
</div>
|
3341 |
+
|
3342 |
+
<p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
|
3343 |
+
|
3344 |
+
<h4>CPP extension</h4>
|
3345 |
+
|
3346 |
+
<p>If the kernel you want to profile isn't already integrated into PyTorch, you can use PyTorch's <code>cpp_extension</code> module to easily compile and run custom CUDA code. The process is straightforward—just create your CUDA kernel in a <code>.cu</code> file, and use the <code>load</code> function from the <code>cpp_extension</code> module to load it in Python.</p>
|
3347 |
+
|
3348 |
+
<p>The <code>.cu</code> file would like this for a simple <code>add</code> kernel:</p>
|
3349 |
+
|
3350 |
+
<d-code block language="clike">
|
3351 |
+
#include <torch/extension.h>
|
3352 |
+
#include <cuda.h>
|
3353 |
+
#include <cuda_runtime.h>
|
3354 |
+
|
3355 |
+
__global__ void add_kernel(float* x, float* y, float* output, int size) {
|
3356 |
+
int index = blockIdx.x * blockDim.x + threadIdx.x;
|
3357 |
+
if (index < size) {
|
3358 |
+
output[index] = x[index] + y[index];
|
3359 |
+
}
|
3360 |
+
}
|
3361 |
+
|
3362 |
+
void add_cuda(torch::Tensor x, torch::Tensor y, torch::Tensor output) {
|
3363 |
+
int threads = 1024;
|
3364 |
+
int blocks = (x.size(0) + threads - 1) / threads;
|
3365 |
+
|
3366 |
+
add_kernel<<<blocks, threads>>>(x.data_ptr<float>(), y.data_ptr<float>(), output.data_ptr<float>(), x.size(0));
|
3367 |
+
}
|
3368 |
+
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
3369 |
+
m.def("add_cuda", &add_cuda, "Vector addition (CUDA)");
|
3370 |
+
}
|
3371 |
+
</d-code>
|
3372 |
+
|
3373 |
+
<p>And the python file to load the kernel:</p>
|
3374 |
+
|
3375 |
+
<d-code block language="python">
|
3376 |
+
import torch
|
3377 |
+
from torch.utils.cpp_extension import load
|
3378 |
+
|
3379 |
+
# Load and compile the CUDA extension
|
3380 |
+
vector_add = load(
|
3381 |
+
name="vector_add",
|
3382 |
+
sources=["add_kernel.cu"],
|
3383 |
+
verbose=True
|
3384 |
+
)
|
3385 |
+
|
3386 |
+
# Define input tensors
|
3387 |
+
size = 10000
|
3388 |
+
x = torch.randn(size, device='cuda')
|
3389 |
+
y = torch.randn(size, device='cuda')
|
3390 |
+
output = torch.empty(size, device='cuda')
|
3391 |
+
|
3392 |
+
# Run the CUDA kernel
|
3393 |
+
vector_add.add_cuda(x, y, output)
|
3394 |
+
</d-code>
|
3395 |
+
|
3396 |
+
<p>Using this method, you can profile the custom CUDA kernel just as we demonstrated earlier with PyTorch's profiler or NVIDIA tools.</p>
|
3397 |
+
|
3398 |
<h3>A2: Math for Compute/Comms Overlap</h3>
|
3399 |
|
3400 |
|
src/index.html
CHANGED
@@ -3230,6 +3230,171 @@
|
|
3230 |
|
3231 |
<h3>A1: Distributed Training Profiling</h3>
|
3232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3233 |
<h3>A2: Math for Compute/Comms Overlap</h3>
|
3234 |
|
3235 |
|
|
|
3230 |
|
3231 |
<h3>A1: Distributed Training Profiling</h3>
|
3232 |
|
3233 |
+
<h4>Kernels</h4>
|
3234 |
+
|
3235 |
+
<p>Let's begin by assuming for now that the kernels are already integrated into PyTorch. As a simple example, we can look at the Layer Normalization function implemented in PyTorch as <code>torch.nn.functional.layer_norm</code>. There are several methods to profile the kernel that underlies this function. The most straightforward approach might be to use the Python <code>time</code> module. However, since CUDA operations are asynchronous, measuring time with this method will only capture the overhead associated with launching the kernel in Python, rather than the actual execution time of the kernel itself.</p>
|
3236 |
+
|
3237 |
+
<p>To address this, we can utilize <code>torch.cuda.Event</code> for accurate timing and employ the <code>torch.cuda.synchronize()</code> directive to ensure we wait for the kernel execution to complete. This approach is demonstrated in the following snippet:</p>
|
3238 |
+
|
3239 |
+
<d-code block language="python">
|
3240 |
+
def profile_pytorch(func, input):
|
3241 |
+
# Create CUDA events to track time. CUDA operations are asynchronous,
|
3242 |
+
start = torch.cuda.Event(enable_timing=True) # Event to mark the start time
|
3243 |
+
end = torch.cuda.Event(enable_timing=True) # Event to mark the end time
|
3244 |
+
# Warmup to eliminate any overhead from the first run, which might not reflect
|
3245 |
+
# the actual performance.
|
3246 |
+
for _ in range(10):
|
3247 |
+
func(input)
|
3248 |
+
# Record the start time before executing the function
|
3249 |
+
start.record()
|
3250 |
+
func(input) # Call the function we want to profile
|
3251 |
+
# Record the end time after the function has completed
|
3252 |
+
end.record()
|
3253 |
+
# Synchronize the CUDA operations to ensure all operations are completed
|
3254 |
+
# before measuring the elapsed time.
|
3255 |
+
torch.cuda.synchronize()
|
3256 |
+
# Calculate and return the elapsed time in milliseconds.
|
3257 |
+
return start.elapsed_time(end)
|
3258 |
+
</d-code>
|
3259 |
+
|
3260 |
+
<p>A more effective approach to profiling is to utilize the PyTorch Profiler, as explained previously. For example, consider the following code:</p>
|
3261 |
+
|
3262 |
+
<d-code block language="python">
|
3263 |
+
import torch
|
3264 |
+
import torch.nn.functional as F
|
3265 |
+
|
3266 |
+
def pytorch_layer_norm(input):
|
3267 |
+
return F.layer_norm(input, input.size()[1:])
|
3268 |
+
|
3269 |
+
a = torch.randn(10000, 10000).cuda()
|
3270 |
+
|
3271 |
+
with torch.profiler.profile(
|
3272 |
+
activities=[
|
3273 |
+
torch.profiler.ProfilerActivity.CPU, # Profile CPU activities
|
3274 |
+
torch.profiler.ProfilerActivity.CUDA, # Profile CUDA activities
|
3275 |
+
],
|
3276 |
+
# Define a schedule for the profiler
|
3277 |
+
schedule=torch.profiler.schedule(
|
3278 |
+
wait=1, # Wait for 1 iteration before starting to profile
|
3279 |
+
warmup=3, # Warm up for 3 iterations to stabilize performance
|
3280 |
+
active=2, # Profile for 2 active iterations
|
3281 |
+
repeat=1, # Repeat the profiling schedule once
|
3282 |
+
),
|
3283 |
+
on_trace_ready=torch.profiler.tensorboard_trace_handler('.'),
|
3284 |
+
|
3285 |
+
) as p:
|
3286 |
+
for iter in range(10):
|
3287 |
+
pytorch_layer_norm(a)
|
3288 |
+
p.step()
|
3289 |
+
|
3290 |
+
# Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries
|
3291 |
+
print(p.key_averages().table(sort_by="cuda_time_total", row_limit=8))
|
3292 |
+
</d-code>
|
3293 |
+
|
3294 |
+
<p>This would print aggregated profiling results sorted by the total CUDA time, and the output would be:</p>
|
3295 |
+
|
3296 |
+
<div class="large-image-background">
|
3297 |
+
<img alt="image.png" src="/assets/images/a1_kernels.png" style="width: 1200px; max-width: none;" />
|
3298 |
+
</div>
|
3299 |
+
|
3300 |
+
<p>You can also try to inspect the trace as we previously mentioned on <code>chrome://tracing/</code></p>
|
3301 |
+
|
3302 |
+
<div class="note-box">
|
3303 |
+
<p class="note-box-title">💡 Tip</p>
|
3304 |
+
<div class="note-box-content">
|
3305 |
+
<p>If you're new to this tool, you can navigate the trace by using the right and left arrow keys. Additionally, you can zoom in and out by holding the <strong>Alt</strong> key while scrolling left or right with your mouse.</p>
|
3306 |
+
</div>
|
3307 |
+
</div>
|
3308 |
+
|
3309 |
+
<p>After zooming in, you can observe the flow of operations when calling <code>layer_norm</code> in this trace:</p>
|
3310 |
+
|
3311 |
+
<div class="large-image-background">
|
3312 |
+
<img alt="image.png" src="/assets/images/a1_profile_trace.png" style="width: 1200px; max-width: none;" />
|
3313 |
+
</div>
|
3314 |
+
|
3315 |
+
<p>The sequence begins in the CPU (the upper section) with <code>aten::layer_norm</code>, progressing to <code>aten::native_layer_norm</code>, and then transitioning to <code>cudaLaunchKernel</code>. From there, we move on to the GPU, where the <code>vectorized_layer_norm_kernel</code> kernel is called.</p>
|
3316 |
+
|
3317 |
+
<div class="note-box">
|
3318 |
+
<p class="note-box-title">📝 Note</p>
|
3319 |
+
<div class="note-box-content">
|
3320 |
+
<p>You can enable memory profiling by setting <code>profile_memory</code> to <code>True</code> in the profiler. However, this can lead to more complex traces.</p>
|
3321 |
+
</div>
|
3322 |
+
</div>
|
3323 |
+
|
3324 |
+
<p>While the PyTorch Profiler offers a quick performance overview, <strong>NVIDIA Nsight Compute (ncu)</strong> provides deeper insights into GPU performance, including detailed execution times and memory usage for each kernel. To run the profiler it's very simple:</p>
|
3325 |
+
|
3326 |
+
<d-code block language="bash">
|
3327 |
+
ncu --set full python layer_norm.py
|
3328 |
+
</d-code>
|
3329 |
+
|
3330 |
+
<p>Where <code>layer_norm.py</code> is a straightforward file that executes the layer normalization function. This command will generate log outputs, but a more effective way to visualize the results is by setting the output flag:</p>
|
3331 |
+
|
3332 |
+
<d-code block language="bash">
|
3333 |
+
ncu --set full -o output python layer_norm.py
|
3334 |
+
</d-code>
|
3335 |
+
|
3336 |
+
<p>and open the file <code>output.ncu-rep</code> with Nsight Compute, you will have a view that looks like this:</p>
|
3337 |
+
|
3338 |
+
<div class="large-image-background">
|
3339 |
+
<img alt="image.png" src="/assets/images/a1_ncu.png" style="width: 1200px; max-width: none;" />
|
3340 |
+
</div>
|
3341 |
+
|
3342 |
+
<p>With clear warnings about compute and memory utilization, and how to make the kernel better in balancing compute and memory and achieve maximal occupancy.</p>
|
3343 |
+
|
3344 |
+
<h4>CPP extension</h4>
|
3345 |
+
|
3346 |
+
<p>If the kernel you want to profile isn't already integrated into PyTorch, you can use PyTorch's <code>cpp_extension</code> module to easily compile and run custom CUDA code. The process is straightforward—just create your CUDA kernel in a <code>.cu</code> file, and use the <code>load</code> function from the <code>cpp_extension</code> module to load it in Python.</p>
|
3347 |
+
|
3348 |
+
<p>The <code>.cu</code> file would like this for a simple <code>add</code> kernel:</p>
|
3349 |
+
|
3350 |
+
<d-code block language="clike">
|
3351 |
+
#include <torch/extension.h>
|
3352 |
+
#include <cuda.h>
|
3353 |
+
#include <cuda_runtime.h>
|
3354 |
+
|
3355 |
+
__global__ void add_kernel(float* x, float* y, float* output, int size) {
|
3356 |
+
int index = blockIdx.x * blockDim.x + threadIdx.x;
|
3357 |
+
if (index < size) {
|
3358 |
+
output[index] = x[index] + y[index];
|
3359 |
+
}
|
3360 |
+
}
|
3361 |
+
|
3362 |
+
void add_cuda(torch::Tensor x, torch::Tensor y, torch::Tensor output) {
|
3363 |
+
int threads = 1024;
|
3364 |
+
int blocks = (x.size(0) + threads - 1) / threads;
|
3365 |
+
|
3366 |
+
add_kernel<<<blocks, threads>>>(x.data_ptr<float>(), y.data_ptr<float>(), output.data_ptr<float>(), x.size(0));
|
3367 |
+
}
|
3368 |
+
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
3369 |
+
m.def("add_cuda", &add_cuda, "Vector addition (CUDA)");
|
3370 |
+
}
|
3371 |
+
</d-code>
|
3372 |
+
|
3373 |
+
<p>And the python file to load the kernel:</p>
|
3374 |
+
|
3375 |
+
<d-code block language="python">
|
3376 |
+
import torch
|
3377 |
+
from torch.utils.cpp_extension import load
|
3378 |
+
|
3379 |
+
# Load and compile the CUDA extension
|
3380 |
+
vector_add = load(
|
3381 |
+
name="vector_add",
|
3382 |
+
sources=["add_kernel.cu"],
|
3383 |
+
verbose=True
|
3384 |
+
)
|
3385 |
+
|
3386 |
+
# Define input tensors
|
3387 |
+
size = 10000
|
3388 |
+
x = torch.randn(size, device='cuda')
|
3389 |
+
y = torch.randn(size, device='cuda')
|
3390 |
+
output = torch.empty(size, device='cuda')
|
3391 |
+
|
3392 |
+
# Run the CUDA kernel
|
3393 |
+
vector_add.add_cuda(x, y, output)
|
3394 |
+
</d-code>
|
3395 |
+
|
3396 |
+
<p>Using this method, you can profile the custom CUDA kernel just as we demonstrated earlier with PyTorch's profiler or NVIDIA tools.</p>
|
3397 |
+
|
3398 |
<h3>A2: Math for Compute/Comms Overlap</h3>
|
3399 |
|
3400 |
|