Spaces:
Runtime error
Runtime error
File size: 16,104 Bytes
4bdb245 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 |
.. role:: raw-html-m2r(raw)
:format: html
C++ Examples
=================
The power of Kompute comes in when the interface is used for complex computations. This section contains an outline of the advanced / end-to-end examples available.
Simple examples
^^^^^^^^^^^^^^^
* `Create your custom Kompute Operations <#your-custom-kompute-operation>`_
* `Run Asynchronous Operations <#asynchronous-operations>`_
* `Run Parallel Operations Across Multiple GPU Queues <#parallel-operations>`_
End-to-end examples
^^^^^^^^^^^^^^^^^^^
* `Machine Learning Logistic Regression Implementation <https://towardsdatascience.com/machine-learning-and-data-processing-in-the-gpu-with-vulkan-kompute-c9350e5e5d3a>`_
* `Parallelizing GPU-intensive Workloads via Multi-Queue Operations <https://towardsdatascience.com/parallelizing-heavy-gpu-workloads-via-multi-queue-operations-50a38b15a1dc>`_
* `Android NDK Mobile Kompute ML Application <https://towardsdatascience.com/gpu-accelerated-machine-learning-in-your-mobile-applications-using-the-android-ndk-vulkan-kompute-1e9da37b7617>`_
* `Game Development Kompute ML in Godot Engine <https://towardsdatascience.com/supercharging-game-development-with-gpu-accelerated-ml-using-vulkan-kompute-the-godot-game-engine-4e75a84ea9f0>`_
Add Extensions
^^^^^^^^^^^^^^^^^^^^
Kompute provides a simple way to add extensions through kp::Manager initialisation. When debug is enabled you will be able to see logs that show what are the desired extensions requested and the ones that are added based on the available extensions on the current driver.
The example below shows how you can enable the "VK_EXT_shader_atomic_float" extension so we can use the adomicAdd for floats in the shaders.
.. code-block:: cpp
:linenos:
int main() {
std::string shader(R"(
#version 450
#extension GL_EXT_shader_atomic_float: enable
layout(push_constant) uniform PushConstants {
float x;
float y;
float z;
} pcs;
layout (local_size_x = 1) in;
layout(set = 0, binding = 0) buffer a { float pa[]; };
void main() {
atomicAdd(pa[0], pcs.x);
atomicAdd(pa[1], pcs.y);
atomicAdd(pa[2], pcs.z);
})");
// See shader documentation section for compileSource
std::vector<uint32_t> spirv = compileSource(shader);
std::shared_ptr<kp::Sequence> sq = nullptr;
{
kp::Manager mgr(0, {}, { "VK_EXT_shader_atomic_float" });
std::shared_ptr<kp::Tensor> tensor = mgr.tensor({ 0, 0, 0 });
std::shared_ptr<kp::Algorithm> algo =
mgr.algorithm({ tensor }, spirv, kp::Workgroup({ 1 }), {}, { 0.0, 0.0, 0.0 });
sq = mgr.sequence()
->record<kp::OpTensorSyncDevice>({ tensor })
->record<kp::OpAlgoDispatch>(algo,
std::vector<float>{ 0.1, 0.2, 0.3 })
->record<kp::OpAlgoDispatch>(algo,
std::vector<float>{ 0.3, 0.2, 0.1 })
->record<kp::OpTensorSyncLocal>({ tensor })
->eval();
EXPECT_EQ(tensor->data(), std::vector<float>({ 0.4, 0.4, 0.4 }));
}
}
Your Custom Kompute Operation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Build your own pre-compiled operations for domain specific workflows. Back to `examples list <#simple-examples>`_
We also provide tools that allow you to `convert shaders into C++ headers <https://github.com/KomputeProject/kompute/blob/master/scripts/convert_shaders.py#L40>`_.
.. code-block:: cpp
:linenos:
class OpMyCustom : public OpAlgoDispatch
{
public:
OpMyCustom(std::vector<std::shared_ptr<Tensor>> tensors,
std::shared_ptr<kp::Algorithm> algorithm)
: OpAlgoBase(algorithm)
{
if (tensors.size() != 3) {
throw std::runtime_error("Kompute OpMult expected 3 tensors but got " + tensors.size());
}
// See shader documentation section for compileSource
std::vector<uint32_t> spirv = compileSource(R"(
#version 450
layout(set = 0, binding = 0) buffer tensorLhs {
float valuesLhs[ ];
};
layout(set = 0, binding = 1) buffer tensorRhs {
float valuesRhs[ ];
};
layout(set = 0, binding = 2) buffer tensorOutput {
float valuesOutput[ ];
};
layout (constant_id = 0) const uint LEN_LHS = 0;
layout (constant_id = 1) const uint LEN_RHS = 0;
layout (constant_id = 2) const uint LEN_OUT = 0;
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
void main()
{
uint index = gl_GlobalInvocationID.x;
valuesOutput[index] = valuesLhs[index] * valuesRhs[index];
}
)");
algorithm->rebuild(tensors, spirv);
}
}
int main() {
kp::Manager mgr; // Automatically selects Device 0
// Create 3 tensors of default type float
auto tensorLhs = mgr.tensor({ 0., 1., 2. });
auto tensorRhs = mgr.tensor({ 2., 4., 6. });
auto tensorOut = mgr.tensor({ 0., 0., 0. });
mgr.sequence()
->record<kp::OpTensorSyncDevice>({tensorLhs, tensorRhs, tensorOut})
->record<kp::OpMyCustom>({tensorLhs, tensorRhs, tensorOut}, mgr.algorithm())
->record<kp::OpTensorSyncLocal>({tensorLhs, tensorRhs, tensorOut})
->eval();
// Prints the output which is { 0, 4, 12 }
std::cout << fmt::format("Output: {}", tensorOutput.data()) << std::endl;
}
Async/Await Example
^^^^^^^^^^^^^^^^^^^^^
A simple example of asynchronous submission can be found below.
First we are able to create the manager as we normally would.
.. code-block:: cpp
:linenos:
// You can allow Kompute to create the GPU resources, or pass your existing ones
kp::Manager mgr; // Selects device 0 unless explicitly requested
// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensor = mgr.tensor(10, 0.0);
We can now run our first asynchronous command, which in this case we can use the default sequence.
Sequences can be executed in synchronously or asynchronously without having to change anything.
.. code-block:: cpp
:linenos:
// Create tensors data explicitly in GPU with an operation
mgr.sequence()->eval<kp::OpTensorSyncDevice>({tensor});
While this is running we can actually do other things like in this case create the shader we'll be using.
In this case we create a shader that should take a couple of milliseconds to run.
.. code-block:: cpp
:linenos:
// Define your shader as a string (using string literals for simplicity)
// (You can also pass the raw compiled bytes, or even path to file)
std::string shader(R"(
#version 450
layout (local_size_x = 1) in;
layout(set = 0, binding = 0) buffer b { float pb[]; };
shared uint sharedTotal[1];
void main() {
uint index = gl_GlobalInvocationID.x;
sharedTotal[0] = 0;
// Iterating to simulate longer process
for (int i = 0; i < 100000000; i++)
{
atomicAdd(sharedTotal[0], 1);
}
pb[index] = sharedTotal[0];
}
)");
// See shader documentation section for compileSource
auto algo = mgr.algorithm({tensor}, compileSource(shader));
Now we are able to run the await function on the default sequence.
If we are using the manager, we need to make sure that we are awaiting the same named sequence that was triggered asynchronously.
If the sequence is not running or has finished running, it would return immediately.
The parameter provided is the maximum amount of time to wait in nanoseconds. When the timeout expires, the sequence would return (with false value), but it does not stop the processing in the GPU - the processing would continue as normal.
.. code-block:: cpp
:linenos:
auto sq = mgr.sequence()
// Run Async Kompute operation on the parameters provided
sq->evalAsync<kp::OpAlgoDispatch>(algo);
// Here we can do other work
// When we're ready we can wait
// The default wait time is UINT64_MAX
sq.evalAwait()
Finally, below you can see that we can also run syncrhonous commands without having to change anything.
.. code-block:: cpp
:linenos:
// Sync the GPU memory back to the local tensor
// We can still run synchronous jobs in our created sequence
sq.eval<kp::OpTensorSyncLocal>({ tensor });
// Prints the output: B: { 100000000, ... }
std::cout << fmt::format("B: {}",
tensor.data()) << std::endl;
Parallel Operation Submission
-----------
In order to work with parallel execution of tasks, it is important that you understand some of the core GPU processing limitations, as these can be quite broad and hardware dependent, which means they will vary across NVIDIA / AMD / ETC video cards.
Conceptual Overview
^^^^^^^^^^^^^^^^^^^^^
If you are familiar with the Vulkan SDK, you will have experience that the first few things you do is fetching the physical Queues from the device. The queues themselves tend to have three main particular features - they can be GRAPHICS, TRANSFER and COMPUTE (between a few others we'll skip for simplicity).
Queues can have multiple properties - namely a queue can be of type GRAPHICS+TRANSFER+COMPUTE, etc. Now here comes the key point: the underlying hardware may (or may not) support parallelized processing at multiple levels.
Let's take a tangible example. The [NVIDIA 1650](http://vulkan.gpuinfo.org/displayreport.php?id=9700#queuefamilies) for example has 16 `GRAPHICS+TRANSFER+COMPUTE` queues on `familyIndex 0`, then 2 `TRANSFER` queues in `familyIndex 1` and finally 8 `COMPUTE+TRANSFER` queues in `familyIndex 2`.
With this in mind, the NVIDIA 1650 as of today does not support intra-family parallelization, which means that if you were to submit commands in multiple queues of the same family, these would still be exectured synchronously.
However the NVIDIA 1650 does support inter-family parallelization, which means that if we were to submit commands across multiple queues from different families, these would execute in parallel.
This means that we would be able to execute parallel workloads as long as we're running them across multiple queue families. This is one of the reasons why Kompute enables users to explicitly select the underlying queues and queue families to run particular workloads on.
It is important that you understand what are the capabilities and limitations of your hardware, as parallelization capabilities can vary, so you will want to make sure you account for potential discrepancies in processing structures, mainyl to avoid undesired/unexpected race conditions.
Parallel Execution Example
^^^^^^^^^^^^^^^^^^^^^
In this example we will demonstrate how you can set up parallel processing across two compute families to achieve 2x speedups when running processing workloads.
To start, you will see that we do have to create the manager with extra parameters. This includes the GPU device index we want to use, together with the array of the queues that we want to enable.
In this case we are using only two queues, which as per the section above, these would be familyIndex 0 which is of type `GRAPHICS+COMPUTE+TRANSFER` and familyIndex 2 which is of type `COMPUTE+TRANSFER`.
In this case based on the specifications of the NVIDIA 1650 we could define up to 16 graphics queues (familyIndex 0), 2 transfer queues (familyIndex 1), and 8 compute queues (familyIndex 2) in no particular order. This means that we could have something like `{ 0, 1, 1, 2, 2, 2, 0, ... }` as our initialization value.
You will want to keep track of the indices you initialize your manager, as you will be referring back to this ordering when creating sequences with particular queues.
.. code-block:: cpp
:linenos:
// In this case we select device 0, and for queues, one queue from familyIndex 0
// and one queue from familyIndex 2
uint32_t deviceIndex(0);
std::vector<uint32_t> familyIndices = {0, 2};
// We create a manager with device index, and queues by queue family index
kp::Manager mgr(deviceIndex, familyIndices);
We are now able to create sequences with a particular queue.
By default the Kompute Manager is created with device 0, and with a single queue of the first compatible familyIndex. Similarly, by default sequences are created with the first available queue.
In this case we are able to specify which queue we want to use. Below we initialize "queueOne" named sequence with the graphics family queue, and "queueTwo" with the compute family queue.
It's worth mentioning you can have multiple sequences referencing the same queue.
.. code-block:: cpp
:linenos:
// We need to create explicit sequences with their respective queues
// The second parameter is the index in the familyIndex array which is relative
// to the vector we created the manager with.
sqOne = mgr.sequence(0);
sqTwo = mgr.sequence(1);
We create the tensors without modifications.
.. code-block:: cpp
:linenos:
// Creates tensor an initializes GPU memory (below we show more granularity)
auto tensorA = mgr.tensor({ 10, 0.0 });
auto tensorB = mgr.tensor({ 10, 0.0 });
// Copies the data into GPU memory
mgr.sequence().eval<kp::OpTensorSyncDevice>({tensorA tensorB});
Similar to the asyncrhonous usecase above, we can still run synchronous commands without modifications.
.. code-block:: cpp
:linenos:
// Define your shader as a string (using string literals for simplicity)
// (You can also pass the raw compiled bytes, or even path to file)
std::string shader(R"(
#version 450
layout (local_size_x = 1) in;
layout(set = 0, binding = 0) buffer b { float pb[]; };
shared uint sharedTotal[1];
void main() {
uint index = gl_GlobalInvocationID.x;
sharedTotal[0] = 0;
// Iterating to simulate longer process
for (int i = 0; i < 100000000; i++)
{
atomicAdd(sharedTotal[0], 1);
}
pb[index] = sharedTotal[0];
}
)");
// See shader documentation section for compileSource
std::vector<uint32_t> spirv = compileSource(shader);
std::shared_ptr<kp::Algorithm> algo = mgr.algorithm({tensorA, tenssorB}, spirv);
Now we can actually trigger the parallel processing, running two OpAlgoBase Operations - each in a different sequence / queue.
.. code-block:: cpp
:linenos:
// Run the first parallel operation in the `queueOne` sequence
sqOne->evalAsync<kp::OpAlgoDispatch>(algo);
// Run the second parallel operation in the `queueTwo` sequence
sqTwo->evalAsync<kp::OpAlgoDispatch>(algo);
Similar to the asynchronous example above, we are able to do other work whilst the tasks are executing.
We are able to wait for the tasks to complete by triggering the `evalOpAwait` on the respective sequence.
.. code-block:: cpp
:linenos:
// Here we can do other work
// We can now wait for the two parallel tasks to finish
sqOne.evalOpAwait()
sqTwo.evalOpAwait()
// Sync the GPU memory back to the local tensor
mgr.sequence()->eval<kp::OpTensorSyncLocal>({ tensorA, tensorB });
// Prints the output: A: 100000000 B: 100000000
std::cout << fmt::format("A: {}, B: {}",
tensorA.data()[0], tensorB.data()[0]) << std::endl;
|