File size: 29,509 Bytes
e45d058
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines")

[README](../../README.md#documentation) > **Programming Guidelines**

# Programming Guidelines

## Hierarchical Organization

The [CUTLASS 3.0 GEMM API](./gemm_api_3x.md) document
explains CUTLASS 3.0's hierarchical organization,
based conceptually on parallelization strategy.
This differs from CUTLASS 2.x's approach,
which more closely mirrors the GPU hardware hierarchy
of thread blocks, warps, and threads.

## Design Patterns

CUTLASS aims for the highest performance possible on NVIDIA GPUs.
It also offers flexible components that can be assembled and customized
to solve new problems related to deep learning and linear algebra.
Given a tradeoff between simplicity and performance,
CUTLASS chooses performance.
Consequently, several design patterns are necessary
to yield a composable structure
while also satisfying these performance objectives.

### Templates

CUDA C++ templates and modern generic programming techniques enable CUTLASS device code to span a large design space.

This design space includes:
* Mixed precision arithmetic and data storage
* Kernels specialized for layout and problem size
* Support for kernel fusion

Moreover, templates provided a structured approach to collecting compile-time constants such as tile dimensions. These
must be template arguments to target static array allocation and take advantage of loop unrolling, constant folding,
and function inlining.

### Constant Memory

Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel
launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several
offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements
of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists
only of the internal global memory pointer.

CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing
a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant
memory, and (2.) there is no overhead to compute the initial state by each thread.

The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class
which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params`
structure should also include a data member corresponding to each data member in the parent class, so these too can
be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as
its first argument.

### Composable Shared Memory

Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm
introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held
in shared memory. Any object requiring shared memory storage for itself or its data members should define a child
structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage`
objects for each data member.

To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements.
Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes
of child objects are known to be non-overlapping, `union`s may be used to alias multiple SharedStorage objects to the same
shared memory region and reduce overall shared memory capacity.  Developers should carefully note that C++ `union` rules
require that they only access the most recently written ("active") member of the `union`; this differs from C rules.

### Loop Unrolling

CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions
must be issued concurrently with memory instructions to hide latency with relatively few concurrent threads. These objectives are
achieved by unrolling loops whose iteration counts are known at compile time.

Consequently, most loops within the CUTLASS GEMM implementation are specified by constant values and template arguments. The CUDA compiler
is able to unroll the loop bodies, map array elements to registers, and construct an efficient instruction schedule.

All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
to unroll them.

```c++

int const kN = 8;

Array<float, kN> x;                       // Array we would like to store in registers



CUTLASS_PRAGMA_UNROLL                     // Directs the CUDA compiler to unroll this loop.

for (int idx = 0; idx < kN; ++idx) {      // Loop has constant number of iterations.



  x[i] = float(idx);                      // Indirect access by induction variable results in

                                          // direct register access.

}

```

## Style

### No automatic code formatting

Do not use any kind of automatic code formatting,
like `clang-format`, on CUTLASS code.

### C++ style

#### CUTLASS is a C++ project

CUTLASS is a C++ project.  CUDA C++ is a C++ dialect.
Therefore, we write using standard C++ idioms as much as possible.
We aim for portability to as many compilers as possible,
by writing host code in Standard C++
and device code in CUDA C++
that resembles Standard C++ as much as possible.
This improves usability
for the general community of C++ developers,
and makes it easier for new staff to join the project.

#### Follow Standard C++ idioms where possible

Regarding "standard C++ idioms,"
CUTLASS source code follows the following guidelines,
with deviations only because of compiler limitations
or where performance absolutely requires it.
"Performance requires it" implies measurement.
Deviations should be limited in scope
and we should always strive to eliminate them.

* [C++ Core Guidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md)

* [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html)

#### Spacing and line length

* Use spaces, not tabs.

* Use 2 spaces to indent.

* Max 100 characters per line.

Lines longer than 100 characters typically wrap unfavorably
when viewed in Github's pretty printer.

#### Function indentation

When calling a function or function object with a long name,
break the line right after the invoking open parenthesis.
Here is an example.

```c++

detail::very_long_function_object_name<TemplateArgument>{}(

  params.long_parameter_name, some_operator.another_long_function_name());

```

When declaring functions, indent function parameters like this.

```c++

void possibly_an_unusually_long_function_name(

    std::uint32_t foo

    std::uint32_t const* bar,

    TypeA a,

    TypeB b,

    TypeC c) {

  // ... the function's body ...

}

```

A newline should not be inserted between the parenthesis
that closes the function's parameters and the curly bracket
that opens the function's body. Note the double indent for function parameters.

#### If-else brackets and spacing

* Always use braces with conditionals such as `if`.

* Use a space after control flow keywords
  such as `if`, `for`, and `while`.

* Use a space after the parenthesis closing a conditional
  such as `if`, and the curly bracket opening a scope.

* Use a new line between the closing brace
  of an `if` branch, and the `else` keyword.

```c++

if (condition) {

  // ... code ...

}

else {

  // ... other code ...

}



for (int k = 0; k < num_iters; ++k) {

  // ... still more code ...

}

```

#### East const

CUTLASS uses the
["East const"](http://slashslash.info/2018/02/a-foolish-consistency/)
convention.
That is, the `const` or `constexpr` keyword
goes after the type, not before.
The general rule is that `const` or `constexpr`
modifies the type to the left of it.
Here are some examples.

```c++

float constexpr compile_time_constant = 42.3f;



float const const_float = /* whatever */;

float const& reference_to_const_float = const_float;

float const* pointer_to_const_float = &const_float;

float const* const const_pointer_to_const_float = &const_float;



float nonconst_float;

float& reference_to_nonconst_float = nonconst_float;

float* pointer_to_nonconst_float = &nonconst_float;

float* const pointer_to_nonconst_float = &nonconst_float;

```

Contrast this with "West const" style, e.g.,

```c++

const float const_float = /* whatever */;

const float* pointer_to_const_float = &const_float;

```

#### Alignment of reference and pointer types

For reference and pointer types,
align the `&` resp. `*` flush against the type
that it modifies.  This is called "left alignment."

For example, do this:

```c++

int const& var;

int const* var;

```

and not this.

```c++

int const &var;

int const *var;

```

#### Avoid calling functions "fast" or "optimized"

Putting words like "fast" or "optimized"
in the name of a function
assumes that the "fast" path is actually faster.
That might be true now, but later changes
(in the code, compilers, or GPU hardware)
might make it false.  In that case,
your name could be unintentionally misleading.
Consider instead a name that briefly describes
the algorithm or feature that is relevant for optimization.
For example, `compute_on_host` is more meaningful
than `compute_slowly`, and computing on host
might be faster in some cases
(e.g., if the data are already on host
and the algorithm is not GPU-friendly).

CUTLASS code has not always followed this rule in the past.
Some functions and classes might have words like "fast" in their name.
New code should follow this rule, however.

#### Avoid creating unconstrained templated functions with common names

See [C++ Core Guidelines T.47](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#t47-avoid-highly-visible-unconstrained-templates-with-common-names):
"Avoid highly visible unconstrained templates
with common names."
Argument-dependent lookup (ADL) means that
if users call a function name without specifying the namespace,
the compiler can find overloads
of that function in any namespace.
This can lead to ambiguous overloads in users' code,
just because they happened to include one of your header files
that exposes an unconstrained function template.
The following illustrates this
with an unconstrained swap overload in the `cutlass` namespace.

```c++

#include <cassert>

#include <memory>

#include <utility>



// Uncomment the line below to observe unwarranted build errors.

//#define BAD_CUTLASS_SWAP 1



namespace cutlass {

struct Bar {

  float f;

};

} // namespace cutlass



#ifdef BAD_CUTLASS_SWAP

namespace cutlass {



// don't do this

template<class T>

void swap(T& a, T& b) {

  T tmp = a;

  a = b;

  b = tmp;

}



} // namespace cutlass

#endif // BAD_CUTLASS_SWAP



namespace other {



#ifdef BAD_CUTLASS_SWAP

using cutlass::swap;

#endif // BAD_CUTLASS_SWAP



// Imagine for the sake of this example

// that "foo" is a less common name,

// and that T is constrained via

// std::enable_if or a requires clause.

template<class T>

void foo(T& a, T& b) {

  // The usual idiom for using std::swap is the "swap two-step":

  //

  // 1. import std::swap into the current scope, then

  // 2. call swap without namespace qualification.

  //

  // That won't build if we have another swap

  // overload available in the scope already.



  using std::swap;

  swap(a, b); // OBSERVE UNWARRANTED BUILD ERROR HERE

}



} // namespace other



int main() {

  int x = 42;

  int y = 43;

  other::foo(x, y);

  assert(x == 43);

  assert(y == 42);



  cutlass::Bar a{42.0};

  cutlass::Bar b{43.0};

  other::foo(a, b);

  assert(a.f == 43.0);

  assert(b.f == 42.0);



  // GCC 7.5 std::unique_ptr::reset calls swap,

  // leading to the same issue as above.

  // GCC 12.2's implementation of std::unique_ptr

  // does not have this issue.  Nevertheless,

  // breaking the swap two-step will break users' code,

  // just by them happening to include your headers.

  auto ptr = std::make_unique<cutlass::Bar>(cutlass::Bar{666.0f});

  ptr.reset(new cutlass::Bar{777.0f}); // OBSERVE UNWARRANTED BUILD ERROR HERE



  return 0;

}

```

#### Function return values and in-out parameters

##### Prefer return values to output parameters

In general, avoid in-out mutable references to return a value.
If you need to return multiple values,
you can return them by `struct` or `tuple`,
rather than by output references.
This includes the special case of error reporting
by returning either a value or an error code.
Please see the next section for details.

```c++

// Instead of passing in-out mutable references ...

void not_preferred(float& input_and_output); // not preferred



// keep functions pure and return value types instead

float preferred(float input); // preferred

```

##### Return multiple values by struct or tuple

Sometimes a function needs to return multiple values.  In that case, consider the following, in decreasing order of preference.

1. Return a `struct`.  This lets you name the fields
   (for more self-documenting code),
   yet still permits use of structured binding.

2. Return a `tuple`.  If you need a tuple type
   that works on device, use `cute::tuple`.
   (Please note that `cute::tuple` does not work
   for all the types that work in `std::tuple`.
   CuTe's documentation explains.)

Here is an example of the struct approach for named values.
For a comparable example in the C++ Standard,
please see [`std::allocate_at_least`](https://en.cppreference.com/w/cpp/memory/allocate_at_least),
which returns `std::allocation_result`.

```c++

struct my_computation_result {

  float value = 0.0f;

  float relative_error = 0.0f;

  bool success = false;

};



my_computation_result my_computation(float tolerance);



void foo(float tolerance) {

  // Approach 1: Use structured binding.  The names

  // you choose on the left-hand side have nothing

  // to do with the struct, so it's up to you

  // to get the order right.  On the other hand,

  // this code works whether my_computation returns

  // a struct or a tuple.

  auto [val, rel_err, ok] = my_computation(tolerance);



  // Approach 2: Keep the struct and use its named fields.

  // This approach prevents errors like mixing the order of return types.

  // However, it only works for structs, not for tuples.



  auto result = my_computation(tolerance);

  if (not result.success) {

    // computation did not succeed

  }

  else if (result.relative_error > tolerance) {

    // successful but relative error too large

  }

  else {

    // successful and relative error is in bounds

  }

}

```

##### Reporting errors from a function that returns one or more values

We may want to return one or more values
from a function that could fail
or otherwise report errors.
That is, the function either

* returns one or more valid values, or

* does not return any values and reports an error,

but NOT BOTH.  We contrast this with cases
when it's meaningful to report both a result
and whether the result is satisfactory.
For example, when solving
a system of nonlinear equations iteratively,
users may want the approximate computed solution,
even if the iteration did not succeed
by converging to the desired tolerance
in the desired number of steps.
(Users may want to invest more steps,
or use the current approximation
to jump-start a different algorithm.)

We're talking here about the "either valid value(s),
or error, but not both" case.
For this case, C++ offers a few options.

1. Return the value(s), or throw an exception on error

2. `std::expected` (requiring C++23) or something like it

3. `std::optional` (for a Boolean error state)
   or something like it

4. `std::variant` (a C++17 fall-back for `std::expected`)
   or something like it

5. C-style interface: return an error code,
   and "return" the values as output parameters

We usually cannot or do not want to
throw exceptions on device.
Some code projects forbid exceptions entirely
(on host or device)
and tell the compiler to disable them.
If we exclude a C-style interface (the last option)
as not idiomatic C++, then for host-only code,
`std::expected`, `std::optional`, and `std::variant`
all work.
For code that needs to build and run on device,
we can fall back to libcu++ equivalents
in the `cuda::std::` namespace, when they exist.
Otherwise, we must resort to returning a struct or tuple
with the value and the error information,
and ask users not to use the value on error.
This is acceptable if the value can be constructed
cheaply with a reasonable default.

##### Performance of different value-or-error reporting methods

[P1886R0](https://wg21.link/P1886R0)
(Ben Craig, "Error speed benchmarking")
surveys different ways in Standard C++
to report errors from a function
that returns one or more values,
and compares their (host-only) performance
with different compilers.

##### Use aggregate initialization when returning a struct or tuple

Use aggregate initialization when returning a struct or tuple.
This avoids duplication of the return type name.

```c++

struct foo_result {

  float value = 0.0f;

  float error = 0.0f;

  bool success = false;

};



foo_result foo(std::span<const float> input) {

  // ... code  ...



  // Prefer this.  We know what type the function returns.

  return {val, err, ok}; // prefer this



  // Naming foo_result again here is unnecessary.

  // return foo_result{val, err, ok};

}

```

However, note that this won't work if the function returns `auto`.
The general rule is to avoid code duplication.

```c++

auto foo(std::span<const float> input) {

  // ... code  ...



  if constexpr (some_condition) {

    return foo_result{val, err, ok};

  }

  else {

    return bar_result{val, err, ok};

  }

}

```

##### Prefer using the actual return type to auto, if you know the type

C++ lets you use `auto` to deduce the type returned from a function.

* If you know the actual type, prefer using the type instead of `auto`.

* Use [Constructor Type Argument Deduction](https://en.cppreference.com/w/cpp/language/class_template_argument_deduction)
  (CTAD) if you know that a function returns some type
  (e.g., `Tensor`), but don't know the type's template arguments.

* Use `auto` in structured bindings (where you have to use it anyway).  This also makes your code agnostic of whether the return type is a `struct`, `tuple`, `pair`, or other tuple-like type.

* Be careful using `auto` with types that provide expression templates.

Contrast this with "Almost Always Auto" (AAA) style.
We deliberately choose not to follow AAA style,
for the following reasons.

* Using the actual type when we know it can help prevent common loss-of-precision errors in mixed-precision computations, an important use case for CUTLASS.

* CTAD gives us much of the brevity of AAA, with more clarity.

* Using the actual type instead of `auto` can prevent common dangling errors with expression templates.

#### Classes and structs

Type names use `CamelCase`.
That is, words start with capital letters.
The remaining letters in the word are lower case,
and words are joined with no intervening underscores.
The only exception is when implementations are
a drop-in replacement for C++ Standard Library components.

Follow the
[C++ Core Guidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct)
to decide whether to use `class` or `struct`.

* Use `class` when the object must maintain an invariant.
  Data members related to the invariant should be `private`.

* Use `struct` when the class has no invariant to maintain,
  and data members may vary arbitrarily with respect to each other.

Prefer nonmember functions and statelessness where possible.
Member functions imply invariants.
More invariants make code maintenance and testing harder.

#### Class members

Methods and members are written using `snake_case`.

Private data and function members have suffix `_`.

#### Class Member Order

Members within classes and structures should be organized as follows:

1. Type and constant definitions

2. Data members

3. Constructors

4. Other methods

This convention follows the
[CUB library](https://nvlabs.github.io/cub/)
and is also described by
[Howard Hinnant](https://howardhinnant.github.io/classdecl.html).
It also approximates the usual ordering of chapters
in a typical Systems and Controls textbook.
That is, it

1. identifies relevant constants,

2. defines a state-space representation
   of the dynamical system under study
   (the class's data members), and then

3. devotes the remaining "chapters" to defining
   the system's dynamical behavior
   (the class's methods).

Here is an example class.

```c++

class A {

public:

  // type definitions

protected:

  // protected type definitions

private:

  // private type definitions



public:

  // data members

protected:

  // protected data members

  // STRONGLY TO BE AVOIDED;

  // please see C++ Core Guidelines

private:

  // private data members



public:

  // methods

protected:

  // protected methods

private:

  // private methods

};

```

#### Use scoped enums

Use scoped enums (a C++11 feature) for enumerated types.
Use capital letters for the enumerated type name
and prefix `k` for enumerators like other constants.

```c++

enum class MatrixOperation {

  kNone,

  kTranspose,

  kConjugate,

  kHermitian

};

```

#### Namespaces

Namespaces are all lower case.
The top-level namespace is `cutlass::`.
The second nested namespace refers to
the general category of operation
performed by its members: e.g., `gemm::`.
The third nested namespace refers to
the operations' position in the conceptual hierarchy:
e.g., `device::`, `kernel::`, or `collective::`.

The bodies of namespace definitions should not be indented.
Comments on the closing brace to indicate
the namespace being closed are welcome.

```c++

namespace cutlass {

namespace gemm {

namespace kernel {



struct AnotherGemmKernel {

  // ... contents ...

};



} // namespace kernel

} // namespace gemm

} // namespace cutlass

```

#### File Names

New files should be named using `snake_case`
with extension `.hpp` for header files,
`.cu` for CUDA sources,
and `.cpp` for C++ host-only source files.

Header files with extension `.h`
are CUTLASS 2.x legacy headers.

#### Macros

Only use macros when the preprocessor
is the only way to accomplish the task.
Do not use macros for literal constants.
Instead, if inside the body of a function,
use `constexpr` values,
and if at namespace scope, use
[`inline constexpr` variables](https://en.cppreference.com/w/cpp/language/inline)
(a C++17 feature).

"Namespace" macros by starting them with the module name, e.g., `CUTLASS_`.
Macros and ONLY MACROS use all capital letters with underscores between words.
For example:

```c++

#define CUTLASS_MACROS_USE_ALL_CAPS inline __host__ __device__

```

Header files such as
[cutlass/cutlass.h](../../include/cutlass/cutlass.h)
and
[cute/config.hpp](../../include/cutlass/cutlass.h)
offer macros for expressing compiler-dependent behavior.
These include

* replacements for `__device__` and/or `__host__`
  annotations:

  * `CUTLASS_HOST_DEVICE` or `CUTE_HOST_DEVICE`
    for functions that run on the host and the device,


  * `CUTLASS_DEVICE` or `CUTE_DEVICE`
    for functions that run on the device only,


  * `CUTE_HOST`
    for functions that run on the host only, and


  * `CUTE_HOST_RTC`
    for functions that run on the host only,

    but occur as unevaluated operands (of e.g., `decltype` or `sizeof`;

    see C++ Standard, `[expr.context]` 1) in device code; and


* annotations to loop unrolling:

  * `CUTLASS_PRAGMA_UNROLL` or `CUTE_UNROLL`
    for full unrolling of loops with constant trip counts, and


  * `CUTLASS_PRAGMA_NO_UNROLL` or `CUTE_NO_UNROLL` to prevent unrolling.

#### Guard all headers with `#pragma once`

Use `#pragma once` to guard all headers.

### CuTe Layout Comments

* Right align CuTe layout comments at column 120. 
* If layout comment is too long do your best to align it.
* If layout comment is too long and there are many related tensors that reader should read together, try to align the layout comments of related tensors.

```c++

    Tensor my_tensor = make_tensor<Type>(Layout<Shape<_2,_2>{}, Stride<_1,_2>>{});                       // (2,2):(1,2)

    

    // Related tensors

    Tensor my_tensor1 = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName);         // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)

    Tensor my_tensor2_related = make_tensor<Type>(ThisIsAVeryComplicatedLayoutWithAVeryLongName); // ((Mode0_0,Mode0_1,Mode0_2),Mode1,Mode2,Mode3)

```

### CUDA C++ style

#### CUDA Built-in Variables

Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within
CUTLASS components except in special circumstances.

Using built-in global variables directly within resuable components necessitates that all components
use them consistently which may not be possible if CUTLASS components are used in other contexts.

Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling
code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is
solving.

#### Use CUTLASS's and CuTe's fundamental types and operations

Use the
[fundamental types and operations](fundamental_types.md)
defined in CUTLASS consistently.
This contributes to a framework of interoperable, consistent components.
It reduces code duplication, which reduces build and test times.
It also saves developer effort.

CUTLASS's fundamental types and operations include

* [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code, and

* [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code.

CUTLASS 3.0 uses CuTe components to represent data layouts and multidimensional arrays.
Please refer to the [CuTe Tutorial](./cute/00_quickstart.md) for details.
CuTe has replaced CUTLASS 2.x components such as
[Containers](fundamental_types.md#containers),
[Layouts](layout.md), and
[`TensorRef` and `TensorView`](layout.md#tensorref).

## CUTLASS idioms

### Detecting major mode

Developers sometimes need to detect whether a tensor is MN-major or K-major.
(For definitions, see the [CuTe GEMM tutorial](./cute/0x_gemm_tutorial.md).)

* _Correct_: `cutlass::detail::is_major<0, Stride>()` or
`cutlass::detail::is_k_major()` from `include/cutlass/gemm/gemm.h`

* _Incorrect_: `get<0>(stride) == 1`

The second point is incorrect because it assumes that the mode
is a single integer, not a multimode.
This means that the code will fail to compile for tensor contractions.
For example, suppose that a tensor A
has shape `((X, Y), K)` and stride `((1, X), X*Y)`.
`get<0>(stride)` is the tuple `(1, X)`, not a single integer.
However, A is certainly M major if interpreted as a matrix.

# Copyright

Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

```

  Redistribution and use in source and binary forms, with or without

  modification, are permitted provided that the following conditions are met:



  1. Redistributions of source code must retain the above copyright notice, this

  list of conditions and the following disclaimer.



  2. Redistributions in binary form must reproduce the above copyright notice,

  this list of conditions and the following disclaimer in the documentation

  and/or other materials provided with the distribution.



  3. Neither the name of the copyright holder nor the names of its

  contributors may be used to endorse or promote products derived from

  this software without specific prior written permission.



  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE

  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,

  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

```