facebook
/

KernelLLM

@@ -7,7 +7,7 @@ datasets:
 ---
 # KernelLLM
-![scatter performance comparison plot](llm_performance_comparison.png)
 Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
 ## Making Kernel Development more accessible with KernelLLM
@@ -17,9 +17,9 @@ KernelLLM's vision is to meet the growing demand for high-performance GPU kernel
 KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
-![alt text](triton-kernel-workflow.png)
-Caption: KernelLLM Workflow for Triton Kernel Generation Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output).
 The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations  and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
@@ -28,9 +28,8 @@ We finetuned Llama3.1-8B-Instruct on the created dataset using supervised instru
 ### Model Performance
-KernelLLM significantly outperforms larger general-purpose models on specialized kernel generation tasks, demonstrating the value of domain-specific fine-tuning.
-![alt text](vscode-local:/Users/zacharias/code/gtc_presentation/blog_post_model_performance_rev_4_RC.png)
 | Model | Parameters (B) | Score | Pass@k |
 |-------|---------------|-------|--------|
@@ -49,7 +48,7 @@ KernelLLM significantly outperforms larger general-purpose models on specialized
 Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
 The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.

 ---
 # KernelLLM
+![scatter performance comparison plot](media/llm_performance_comparison.png)
 Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
 ## Making Kernel Development more accessible with KernelLLM
 KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
+![alt text](media/triton-kernel-workflow.png)
+*KernelLLM Workflow for Triton Kernel Generation Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output).*
 The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations  and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
 ### Model Performance
+![alt text](media/blog_post_model_performance.png)
 | Model | Parameters (B) | Score | Pass@k |
 |-------|---------------|-------|--------|
 Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
 The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.
+![pass at k analysis plot](media/kernelllm_pass_at_k_scaling.png)