Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,7 @@ datasets:
|
|
7 |
---
|
8 |
|
9 |
# KernelLLM
|
10 |
-

|
11 |
Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
|
12 |
## Making Kernel Development more accessible with KernelLLM
|
13 |
|
@@ -17,9 +17,9 @@ KernelLLM's vision is to meet the growing demand for high-performance GPU kernel
|
|
17 |
KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
|
18 |
|
19 |
|
20 |
-

|
21 |
|
22 |
-
|
23 |
|
24 |
|
25 |
The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
|
@@ -28,9 +28,8 @@ We finetuned Llama3.1-8B-Instruct on the created dataset using supervised instru
|
|
28 |
|
29 |
### Model Performance
|
30 |
|
31 |
-
KernelLLM significantly outperforms larger general-purpose models on specialized kernel generation tasks, demonstrating the value of domain-specific fine-tuning.
|
32 |
|
33 |
-
 | Score | Pass@k |
|
36 |
|-------|---------------|-------|--------|
|
@@ -49,7 +48,7 @@ KernelLLM significantly outperforms larger general-purpose models on specialized
|
|
49 |
Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
|
50 |
|
51 |
The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.
|
52 |
-
|
53 |
|
54 |
|
55 |
|
|
|
7 |
---
|
8 |
|
9 |
# KernelLLM
|
10 |
+

|
11 |
Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
|
12 |
## Making Kernel Development more accessible with KernelLLM
|
13 |
|
|
|
17 |
KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
|
18 |
|
19 |
|
20 |
+

|
21 |
|
22 |
+
*KernelLLM Workflow for Triton Kernel Generation Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output).*
|
23 |
|
24 |
|
25 |
The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
|
|
|
28 |
|
29 |
### Model Performance
|
30 |
|
|
|
31 |
|
32 |
+

|
33 |
|
34 |
| Model | Parameters (B) | Score | Pass@k |
|
35 |
|-------|---------------|-------|--------|
|
|
|
48 |
Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
|
49 |
|
50 |
The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.
|
51 |
+

|
52 |
|
53 |
|
54 |
|