Spaces:
Sleeping
Sleeping
 | |
[README](../../README.md#documentation) > **CUTLASS Utilities** | |
Note: This document discusses utilities commonly used with code that targets CUTLASS 2.x. | |
Although CUTLASS 3.0's primary entry point APIs do not transact in these `cutlass::*` tensor types anymore, | |
users can still find them convenient for managing allocations with trivial affine layouts. | |
For more advanced host side tensor management, [`cute::Tensor`](/media/docs/cute/03_tensor.md)s | |
can be used on either host or device for any memory space and full expressive power of | |
[`cute::Layout`](/media/docs/cute/01_layout.md)s. | |
# CUTLASS Utilities | |
CUTLASS utilities are additional template classes that facilitate recurring tasks. These are | |
flexible implementations of needed functionality, but they are not expected to be efficient. | |
Applications should configure their builds to list `/tools/util/include` in their include | |
paths. | |
Source code is in [`/tools/util/include/cutlass/util/`](/tools/util/include/cutlass/util). | |
## Tensor Allocation and I/O | |
To allocate a tensor with storage in both host and device memory, use `HostTensor` in | |
[`cutlass/util/host_tensor.h`](/tools/util/include/cutlass/util/host_tensor.h) | |
```c++ | |
template <typename Element, typename Layout> | |
class HostTensor; | |
``` | |
This class is compatible with all CUTLASS numeric data types and layouts. | |
**Example:** column-major matrix storage of single-precision elements. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 32; | |
int columns = 16; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
return 0; | |
} | |
``` | |
Internal host-side storage may be accessed via the following methods. | |
```c++ | |
float *host_ptr = tensor.host_data(); | |
cutlass::TensorRef<float, cutlass::layout::ColumnMajor> host_ref = tensor.host_ref(); | |
cutlass::TensorView<float, cutlass::layout::ColumnMajor> host_view = tensor.host_view(); | |
``` | |
Device memory may be accessed similarly. | |
```c++ | |
float *device_ptr = tensor.device_data(); | |
cutlass::TensorRef<float, cutlass::layout::ColumnMajor> device_ref = tensor.device_ref(); | |
cutlass::TensorView<float, cutlass::layout::ColumnMajor> device_view = tensor.device_view(); | |
``` | |
Printing to human-readable CSV output is accoplished with `std::ostream::operator<<()` defined in | |
[`cutlass/util/tensor_view_io.h`](/tools/util/include/cutlass/util/tensor_view_io.h). | |
Note, this assumes all views refer to host memory. | |
```c++ | |
#include <cutlass/util/tensor_view_io.h> | |
int main() { | |
// Obtain a TensorView into host memory | |
cutlass::TensorView<float, cutlass::layout::ColumnMajor> view = tensor.host_view(); | |
// Print to std::cout | |
std::cout << view << std::endl; | |
return 0; | |
} | |
``` | |
Host and device memory must be explicitly synchronized by the application. | |
```c++ | |
float idx = 0; | |
for (int i = 0; i < rows; ++i) { | |
for (int j = 0; j < columns; ++j) { | |
// Write the element at location {i, j} in host memory | |
tensor.host_ref().at({i, j}) = idx; | |
idx += 0.5f; | |
} | |
} | |
// Copy host memory to device memory | |
tensor.sync_device(); | |
// Obtain a device pointer usable in CUDA kernels | |
float *device_ptr = tensor.device_data(); | |
``` | |
`HostTensor<>` is usable by all CUTLASS layouts including interleaved layouts. | |
```c++ | |
int rows = 4; | |
int columns = 3; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajorInterleaved<4>> tensor({rows, columns}); | |
for (int i = 0; i < rows; ++i) { | |
for (int j = 0; j < columns; ++j) { | |
// Write the element at location {i, j} in host memory | |
tensor.host_ref().at({i, j}) = float(i) * 1.5f - float(j) * 2.25f; | |
} | |
} | |
std::cout << tensor.host_view() << std::endl; | |
``` | |
## Device Allocations | |
To strictly allocate memory on the device using the smart pointer pattern to manage allocation and deallocation, | |
use `cutlass::DeviceAllocation<>`. | |
**Example:** allocating an array in device memory. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/layout/tensor_view.h> | |
#include <cutlass/util/device_memory.h> | |
__global__ void kernel(float *device_ptr) { | |
} | |
int main() { | |
size_t N = 1024; | |
cutlass::DeviceAllocation<float> device_alloc(N); | |
// Call a CUDA kernel passing device memory as a pointer argument | |
kernel<<< grid, block >>>(alloc.get()); | |
if (cudaGetLastError() != cudaSuccess) { | |
return -1; | |
} | |
// Device memory is automatically freed when device_alloc goes out of scope | |
return 0; | |
} | |
``` | |
## Tensor Initialization | |
CUTLASS defines several utility functions to initialize tensors to uniform, procedural, | |
or randomly generated elements. These have implementations using strictly host code and | |
implementations using strictly CUDA device code. | |
`TensorFill()` for uniform elements throughout a tensor. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/reference/host/tensor_fill.h> | |
#include <cutlass/util/reference/device/tensor_fill.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 128; | |
int columns = 64; | |
float x = 3.14159f; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
// Initialize in host memory | |
cutlass::reference::host::TensorFill(tensor.host_view(), x); | |
// Initialize in device memory | |
cutlass::reference::device::TensorFill(tensor.device_view(), x); | |
return 0; | |
} | |
``` | |
`TensorFillRandomUniform()` for initializing elements to a random uniform distribution. | |
The device-side implementation uses CURAND to generate random numbers. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/reference/host/tensor_fill.h> | |
#include <cutlass/util/reference/device/tensor_fill.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 128; | |
int columns = 64; | |
double maximum = 4; | |
double minimum = -4; | |
uint64_t seed = 0x2019; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
// Initialize in host memory | |
cutlass::reference::host::TensorFillRandomUniform( | |
tensor.host_view(), | |
seed, | |
maximum, | |
minimum); | |
// Initialize in device memory | |
cutlass::reference::device::TensorFillRandomUniform( | |
tensor.device_view(), | |
seed, | |
maximum, | |
minimum); | |
return 0; | |
} | |
``` | |
`TensorFillRandomGaussian()` for initializing elements to a random gaussian distribution. | |
The device-side implementation uses CURAND to generate random numbers. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/reference/host/tensor_fill.h> | |
#include <cutlass/util/reference/device/tensor_fill.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 128; | |
int columns = 64; | |
double mean = 0.5; | |
double stddev = 2.0; | |
uint64_t seed = 0x2019; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
// Initialize in host memory | |
cutlass::reference::host::TensorFillRandomGaussian( | |
tensor.host_view(), | |
seed, | |
mean, | |
stddev); | |
// Initialize in device memory | |
cutlass::reference::device::TensorFillRandomGaussian( | |
tensor.device_view(), | |
seed, | |
mean, | |
stddev); | |
return 0; | |
} | |
``` | |
Each of these functions accepts an additional argument to specify how many bits of | |
the mantissa less than 1 are non-zero. This simplifies functional comparisons when | |
exact random distributions are not necessary, since elements may be restricted to | |
integers or values with exact fixed-point representations. | |
```c++ | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/reference/host/tensor_fill.h> | |
#include <cutlass/util/reference/device/tensor_fill.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 128; | |
int columns = 64; | |
double mean = 0.5; | |
double stddev = 2.0; | |
uint64_t seed = 0x2019; | |
int bits_right_of_binary_decimal = 2; | |
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
// Initialize in host memory | |
cutlass::reference::host::TensorFillRandomGaussian( | |
tensor.host_view(), | |
seed, | |
mean, | |
stddev, | |
bits_right_of_binary_decimal); | |
// Initialize in device memory | |
cutlass::reference::device::TensorFillRandomGaussian( | |
tensor.device_view(), | |
seed, | |
mean, | |
stddev, | |
bits_right_of_binary_decimal); | |
return 0; | |
} | |
``` | |
These utilities may be used for all data types. | |
**Example:** random half-precision tensor with Gaussian distribution. | |
```c++ | |
#include <cutlass/numeric_types.h> | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/reference/host/tensor_fill.h> | |
#include <cutlass/util/reference/device/tensor_fill.h> | |
#include <cutlass/util/host_tensor.h> | |
int main() { | |
int rows = 128; | |
int columns = 64; | |
double mean = 0.5; | |
double stddev = 2.0; | |
uint64_t seed = 0x2019; | |
// Allocate a column-major tensor with half-precision elements | |
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> tensor({rows, columns}); | |
// Initialize in host memory | |
cutlass::reference::host::TensorFillRandomGaussian( | |
tensor.host_view(), | |
seed, | |
mean, | |
stddev); | |
// Initialize in device memory | |
cutlass::reference::device::TensorFillRandomGaussian( | |
tensor.device_view(), | |
seed, | |
mean, | |
stddev); | |
return 0; | |
} | |
``` | |
## Reference Implementations | |
CUTLASS defines reference implementations usable with all data types and layouts. These are | |
used throughout the unit tests. | |
**Example:** Reference GEMM implementation with mixed precision internal computation. | |
```c++ | |
#include <cutlass/numeric_types.h> | |
#include <cutlass/layout/matrix.h> | |
#include <cutlass/util/host_tensor.h> | |
#include <cutlass/util/reference/host/gemm.h> | |
int main() { | |
int M = 64; | |
int N = 32; | |
int K = 16; | |
float alpha = 1.5f; | |
float beta = -1.25f; | |
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> A({M, K}); | |
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> B({K, N}); | |
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> C({M, N}); | |
cutlass::reference::host::Gemm< | |
cutlass::half_t, cutlass::layout::ColumnMajor, // ElementA and LayoutA | |
cutlass::half_t, cutlass::layout::ColumnMajor, // ElementB and LayoutB | |
cutlass::half_t, cutlass::layout::ColumnMajor, // ElementC and LayoutC | |
float, // scalar type (alpha and beta) | |
float> gemm_op; // internal accumulation type | |
gemm_op( | |
{M, N, K}, // problem size | |
alpha, // alpha scalar | |
A.host_view(), // TensorView to host memory | |
B.host_view(), // TensorView to host memory | |
beta, // beta scalar | |
C.host_view(), // TensorView to host memory | |
D.host_view()); // TensorView to device memory | |
return 0; | |
} | |
``` | |
# Copyright | |
Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
SPDX-License-Identifier: BSD-3-Clause | |
``` | |
Redistribution and use in source and binary forms, with or without | |
modification, are permitted provided that the following conditions are met: | |
1. Redistributions of source code must retain the above copyright notice, this | |
list of conditions and the following disclaimer. | |
2. Redistributions in binary form must reproduce the above copyright notice, | |
this list of conditions and the following disclaimer in the documentation | |
and/or other materials provided with the distribution. | |
3. Neither the name of the copyright holder nor the names of its | |
contributors may be used to endorse or promote products derived from | |
this software without specific prior written permission. | |
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | |
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | |
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | |
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |
``` | |