File size: 22,061 Bytes
e45d058
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
![ALT](../images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS GEMM API")

[README](../../README.md#documentation) > **CUTLASS GEMM API**

# CUTLASS GEMM API

CUTLASS presents a uniform programming model for matrix multiply-accumulate operations at each level of the hierarchy. This document
focuses on device-level, threadblock-level GEMMs, warp-level GEMMs, thread-level GEMMs, and instruction-level GEMMs.

# CUTLASS GEMM Model

CUTLASS implements the basic GEMM triple loop nest with a tiled structure mirroring the execution model hierarchy.

The following pseudocode describes the model for a GEMM kernel targeting a warp-synchronous matrix multiply instruction like
mma.sync. The entire operation is referred to as "Gemm," as it is assumed that an epilogue operation performs the general matrix
update similar to BLAS.

```c++

                                                                            // cutlass::gemm::device::Gemm

                                                                            //

for (int cta_n = 0; cta_n < GemmN; cta_n += CtaTileN) {                     // for each CTA       } CTA-level concurrency

  for (int cta_m = 0; cta_m < GemmM; cta_m += CtaTileM) {                   //    for each CTA    }

                                                                            //    

                                                                            // cutlass::gemm::threadblock::Mma

                                                                            //

    for (int cta_k = 0; cta_k < GemmK; cta_k += CtaTileK) {                 //       "GEMM mainloop" - no unrolling - one iteration of this loop is one "stage"

                                                                            //

      for (int warp_n = 0; warp_n < CtaTileN; warp_n += WarpTileN) {        // for each warp      } warp-level concurrency

        for (int warp_m = 0; warp_m < CtaTileM; warp_m += WarpTileM) {      //    for each warp   }

                                                                            //

          for (int warp_k = 0; warp_k < CtaTileK; warp_k += WarpTileK) {    //       fully unroll across CtaTileK - one iteration of this loop is one "k Group"

                                                                            //

            for (int mma_k = 0; mma_k < WarpTileK; mma_k += MmaK) {         // cutlass::gemm::warp::Mma

              for (int mma_n = 0; mma_n < WarpTileN; mma_n += MmaN) {       //

                for (int mma_m = 0; mma_m < WarpTileM; mma_m += MmaM) {     //

                                                                            //

                  mma_instruction(d, a, b, c);                              // cutlass::arch::mma - warp-wide matrix multiply instruction



                }   // for mma_m

              }   // for mma_n

            }   // for mma_k



          }   // for warp_k

        }   // for warp_m

      }   // for warp_n



    }   // for cta_k

  }   // for cta_m

}   // for cta_n



```

The outer-most loops correspond to CTA-level hardware concurrency and are not explicitly written as loops in the code. These
are implied by CUDA grid launch semantics.

The comment `cutlass::gemm::threadblock::Mma` refers to the threadblock-scoped matrix multiply-accumulate concept. This is
the computation performed by one threadblock to compute a matrix product in registers. The "GEMM main loop" is listed.

The comment `cutlass::gemm::warp::Mma` refers to the computation performed by each warp. This is a nested loop executing a
sequence of accumulated outer products. 

The inner-most operation corresponds directly to hardware support. In this example, the nested structure terminates with
warp-synchronous matrix multiply instructions targeting Tensor Cores. 
Alternatively, GEMMs targeting single-thread instructions may have an additional series of nested loops corresponding to 
thread-level concurrency.

# CUTLASS GEMM Components

This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
math instruction.

![ALT](/media/images/cutlass-gemm-components.png "CUTLASS GEMM Components")

These components are described in the following sections.

## Device-wide GEMM API

The device-level GEMM API is intended to streamline instantiation and execution of the standard
GEMM computation across the GPU. This operator is intended to be used in host-side .cu code and
has semantics similar to cuBLAS.

The device-wide GEMM API is embodied by the following operators:
- [cutlass::gemm::device::Gemm](/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
- [cutlass::gemm::device::GemmArray](/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
- [cutlass::gemm::device::GemmBatched](/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
- [cutlass::gemm::device::GemmSplitKParallel](/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel

**Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
```c++

  using Gemm = cutlass::gemm::device::Gemm<

    cutlass::half_t,                           // ElementA

    cutlass::layout::ColumnMajor,              // LayoutA

    cutlass::half_t,                           // ElementB

    cutlass::layout::ColumnMajor,              // LayoutB

    cutlass::half_t,                           // ElementOutput

    cutlass::layout::ColumnMajor,              // LayoutOutput

    float,                                     // ElementAccumulator

    cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores

    cutlass::arch::Sm70                        // tag indicating target GPU compute architecture

  >;



  Gemm gemm_op;

  cutlass::Status status;

 

  //

  // Launch GEMM on the device

  //

 

  status = gemm_op({

    {m, n, k},

    {ptrA, lda},

    {ptrB, ldb},

    {ptrC, ldc},

    {ptrD, ldd},

    {alpha, beta}

  });



  if (status != cutlass::Status::kSuccess) {

    return -1;

  }

```


## Threadblock-level GEMM API

GEMMs at this scope are expected to efficiently load tiles of data from global memory into internal storage and then compute matrix
products with warp-level GEMM operators.

The threadblock-scoped matrix multiply operation is embodied by 
[cutlass::gemm::threadblock::MmaPipelined](/include/cutlass/gemm/threadblock/mma_pipelined.h).
This is a class inspired by [std::transform_reduce()](https://en.cppreference.com/w/cpp/algorithm/transform_reduce) 
which computes the accumulated matrix product of a range of tiles defined by tile iterators.

![ALT](/media/images/cutlass-threadblock-mma-pipelined.png "cutlass::gemm::threadblock::MmaPipelined")

In the case of GEMM, the tile iterators are 
[cutlass::transform::threadblock::PredicatedTileIterator](/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
to traverse a sequence of tiles in global memory with appropriate predication to avoid out-of-bounds
memory accesses.

*Concept.* Threadblock-level matrix multiply accumulate operators are function objects satisfying the following concept.
```c++

struct Mma {

  /// Shape of warp-level matrix operation (concept: GemmShape)

  struct Shape;



  /// Data type of multiplicand A (concept: numeric type)

  struct ElementA;



  /// Layout of multiplicand A (concept: Layout)

  struct LayoutA;



  /// Data type of multiplicand B (concept: numeric type)

  struct ElementB;



  /// Layout of multiplicand B (concept: Layout)

  struct LayoutB;



  /// Data type of accumulator matrix C (concept: numeric type)

  struct ElementC;



  /// Layout of accumulator matrix C (concept: Layout)

  struct LayoutC;



  /// Iterator of A operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept

  struct IteratorA;



  /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)

  struct FragmentA;



  /// Iterator of B operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept

  struct IteratorB;



  /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)

  struct FragmentB;



  /// Iterator of C operand in shared memory - 

  ///    satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept

  struct IteratorC;



  /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)

  struct FragmentC;



  /// Warp-level matrix multiply operator (concept: satisfies gemm::warp::Mma)

  struct Operator;



  //

  // Method

  //



  /// Computes a matrix product accumulated in D

  CUTLASS_DEVICE

  void operator()(

    FragmentC &D, 

    IteratorA iter_A, 

    IteratorB iter_B, 

    FragmentC const &C);

};

```

## Warp-level Matrix Multiply API

Warp-level GEMM operators load tiles from shared memory into registers and then compute matrix multiplies using either 
Tensor Cores or CUDA Cores. The result is accumulated in a register tile. Iterators are defined for each
operand `A`, `B`, and `C`.

The warp-level GEMM API is a generalization of CUDA's WMMA API to achieve the following objectives:

- native matrix multiply sizes of Tensor Cores
- permuted shared memory layouts to ensure conflict-free accesses
- pointer initilization outside of the mainloop
- efficient traversal

Defining a warp-level matrix multiply in CUTLASS is similar to WMMA as shown below.

![ALT](/media/images/cutlass-warp-level-gemm-api-instantiation.png "CUTLASS vs WMMA API")

The usage model is also similar. The following example computes a warp-level GEMM operation,
accumulating a series of matrix products in a register-backed array. The input to a warp-level
GEMM operation in CUTLASS _must_ be data in shared memory loaded by iterators or on 
register-backed fragments.

![ALT](/media/images/cutlass-warp-level-gemm-operation.png "CUTLASS warp-level GEMM API")

```c++

#include "cutlass/gemm/warp/default_mma_tensor_op.h"



using LayoutA = cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous<

    cutlass::sizeof_bits<Element>::value, 64>;



using LayoutB = cutlass::layout::RowMajorTensorOpMultiplicandCongruous<

    cutlass::sizeof_bits<Element>::value, 64>;



using WarpMma = typename cutlass::gemm::warp::DefaultMmaTensorOp<

    cutlass::gemm::GemmShape<64, 64, 8>,                            // Overall warp-level GEMM operation

    cutlass::gemm::GemmShape<16, 8, 8>,                             // Target instruction

    cutlass::half_t, LayoutA,                                       // operand A type and layout

    cutlass::half_t, LayoutB,                                       // operand B type and layout

    float,                                                          // accumulator type

    cutlass::layout::RowMajor>::Type;                               // accumulator layout



//

// Define a GEMM operation loading data from shared memory

//

int const kGemmK = 32;



__shared__ ElementA smem_buffer_A[WarpMma::Shape::kM * kGemmK];

__shared__ ElementB smem_buffer_B[WarpMma::Shape::kN * kGemmK];



//

// Construct iterators into SMEM tiles

//



// leading dimensions inferred from matrix problem size

int lda = WarpMma::Shape::kM;

int ldb = WarpMma::Shape::kN;



// iterators into shared memory

WarpMma::IteratorA warp_iterator_A({smem_buffer_A, lda});

WarpMma::IteratorB warp_iterator_B({smem_buffer_B, ldb});



// Fragments in registers storing the operands

FragmentA frag_A;

FragmentB frag_B;

FragmentC accum;



WarpMma mma;



accum.clear();



//

// Accumulated outer product

//



#pragma unroll 1

for (int k = 0; k < kGemmK; k += WarpMma::Shape::kK) {



  

  iter_A.load(frag_A);  // Load fragments from A and B matrices

  iter_B.load(frag_B);



  ++iter_A; ++iter_B;   // Advance along GEMM K to next tile in A

                        //   and B matrices



                        // Compute matrix product

  mma(accum, frag_A, frag_B, accum);

}

```

*Concept.* Warp-level Mma operations are function objects satisfying the following concept.

```c++

struct Mma {

  /// Shape of warp-level matrix operation (concept: GemmShape)

  struct Shape;



  /// Data type of multiplicand A (concept: numeric type)

  struct ElementA;



  /// Layout of multiplicand A (concept: Layout)

  struct LayoutA;



  /// Data type of multiplicand B (concept: numeric type)

  struct ElementB;



  /// Layout of multiplicand B (concept: Layout)

  struct LayoutB;



  /// Data type of accumulator matrix C (concept: numeric type)

  struct ElementC;



  /// Layout of accumulator matrix C (concept: Layout)

  struct LayoutC;



  /// Iterator of A operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept

  struct IteratorA;



  /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)

  struct FragmentA;



  /// Iterator of B operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept

  struct IteratorB;



  /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)

  struct FragmentB;



  /// Iterator of C operand in shared memory - 

  ///     satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept

  struct IteratorC;



  /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)

  struct FragmentC;



  /// Indicates class of matrix operator (arch::OpClassSimt or arch::OpClassTensorOp)

  struct OperatorClass;



  //

  // Methods

  //



  /// Computes a matrix multiply-accumulate

  CUTLASS_DEVICE

  void operator()(

    FragmentC &D, 

    IteratorA A, 

    IteratorB B, 

    FragmentC const &C);

};

```



*Tensor Core Operators.* Warp-level matrix multiply operators targeting Tensor Cores
may be defined with the following template arguments. The `Policy` type specifies implementation-level details which may 
be used to affect performance or internal implementation of the warp-level operator.

```c++

namespace cutlass {

namespace gemm {

namespace warp {



/// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.

template <

  /// Size of the Gemm problem - concept: gemm::GemmShape<>

  typename Shape_,

  /// Data type of A elements

  typename ElementA_,

  /// Layout of A matrix (concept: MatrixLayout)

  typename LayoutA_,

  /// Data type of B elements

  typename ElementB_,

  /// Layout of B matrix (concept: MatrixLayout)

  typename LayoutB_,

  /// Element type of C matrix

  typename ElementC_,

  /// Layout of C matrix (concept: MatrixLayout)

  typename LayoutC_,

  /// Shape of the warp in units of thread (concept: MmaSimtPolicy)

  typename Policy_,

  /// Used for partial specialization

  typename Enable = bool

>

class MmaTensorOp {}



} // namespace warp

} // namespace gemm

} // namespace cutlass



```

*SIMT Math Instructions.*  Warp-level matrix multiply operators targeting CUDA Cores
may be defined with the following template arguments. The `Policy` type specifies implementation-level details which may 
be used to affect performance or internal implementation of the warp-level operator.

```c++

/// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.

template <

  /// Size of the Gemm problem - concept: gemm::GemmShape<>

  typename Shape_,

  /// Data type of A elements

  typename ElementA_,

  /// Layout of A matrix (concept: MatrixLayout)

  typename LayoutA_,

  /// Data type of B elements

  typename ElementB_,

  /// Layout of B matrix (concept: MatrixLayout)

  typename LayoutB_,

  /// Element type of C matrix

  typename ElementC_,

  /// Layout of C matrix (concept: MatrixLayout)

  typename LayoutC_,

  /// Shape of the warp in units of thread (concept: MmaSimtPolicy)

  typename Policy_,

  /// Used for partial specialization

  typename Enable = bool

>

class MmaSimt;

```


## Thread-level GEMM API

Thread-level GEMM operations perform matrix multiply-accumulate on data held in registers. These target CUDA Cores exclusively.

*Concept.* Thread-level matrix multiply operations are function objects satisfying the following concept.
```c++

struct Mma {



  /// Shape of warp-level matrix operation (concept: GemmShape)

  struct Shape;



  /// Data type of multiplicand A (concept: numeric type)

  struct ElementA;



  /// Layout of multiplicand A (concept: Layout)

  struct LayoutA;



  /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)

  struct FragmentA;



  /// Data type of multiplicand B (concept: numeric type)

  struct ElementB;



  /// Layout of multiplicand B (concept: Layout)

  struct LayoutB;



  /// Fragment object loaded from IteratorA (concept: Array<ElementB, ..>)

  struct FragmentB;



  /// Data type of accumulator matrix C (concept: numeric type)

  struct ElementC;



  /// Layout of accumulator matrix C (concept: Layout)

  struct LayoutC;



  /// Fragment object loaded from IteratorA (concept: Array<ElementC, ..>)

  struct FragmentC;



  //

  // Methods

  //



  /// Computes a matrix multiply-accumulate

  CUTLASS_DEVICE

  void operator()(

    FragmentC &D, 

    FragmentA const &A, 

    FragmentB const &B, 

    FragmentC const &C);

};

```

The CUTLASS thread-level GEMM template accepts the following template arguments.
```c++

namespace cutlass {

namespace gemm {

namespace thread {



/// Structure to compute the matrix product

template <

  /// Size of the Gemm problem - concept: gemm::GemmShape<>

  typename Shape,

  /// Data type of A elements

  typename ElementA,

  /// Layout of A matrix (concept: MatrixLayout)

  typename LayoutA,

  /// Data type of B elements

  typename ElementB,

  /// Layout of B matrix (concept: MatrixLayout)

  typename LayoutB,

  /// Element type of C matrix

  typename ElementC,

  /// Layout of C matrix (concept: MatrixLayout)

  typename LayoutC,

  /// Concept: arch::OpMultiplyAdd or arch::Mma<>

  typename Operator = arch::OpMultiplyAdd,

  /// Used for partial specialization

  typename Enable = bool

>

struct Mma;



} // namespace thread

} // namespace gemm

} // namespace cutlass

```

## Efficient Epilogue 

CUTLASS GEMM operators perform mma followed by epilogue operation similar 
to cuBLAS. CUTLASS implements an efficient row-major epilogue. Thus, to achieve 
column-major GEMM, operands A & B are transposed and swapped.

To enable efficient row-major epilogue for both row-major and column-major output layout, 
CUTLASS' device-level GEMM operators `cutlass::device::Gemm` and `cutlass::device::GemmUniversal` 
provide two template definitions:
- (a) [General definition](/include/cutlass/gemm/device/gemm.h#L217)
- (b) [Specialized definition for column-major source/output](/include/cutlass/gemm/device/gemm.h#L545)

Efficient row-major epilogue for:
- (i)  GEMM operator on row-major source/output uses template (a). It runs row-major GEMM and 
an efficient row-major epilogue.
- (ii)  GEMM operator on column-major source/output uses template (b). It transposes and swaps 
operands A and B to enable efficient epilogue. `A x B = C => Transpose(B) x Transpose(A) = Transpose(C)`.
For column-major source (C) matrix, Transpose(C) is row-major, and efficient epilogue works on 
row-major.

Note that cuBLAS typically expects a column-major source (C) and output matrix (D). Thus,
CUTLASS library only instantiates and generates GEMM operatos with column-major layout. However, 
CUTLASS by itself can run both row-major and column-major output layouts for all combinations 
of input layouts. Thus, CUTLASS supports the following layout combinations for input and output layouts: 

- `{N,T} x {N,T} => {N,T}` - NN, TN, TN, TT GEMM for both row-major and column-major output

## Instruction-level operations

CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
to inline PTX.

- [mma_sm70.h](/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
- [mma_sm75.h](/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations


# Copyright

Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: BSD-3-Clause

```

  Redistribution and use in source and binary forms, with or without

  modification, are permitted provided that the following conditions are met:



  1. Redistributions of source code must retain the above copyright notice, this

  list of conditions and the following disclaimer.



  2. Redistributions in binary form must reproduce the above copyright notice,

  this list of conditions and the following disclaimer in the documentation

  and/or other materials provided with the distribution.



  3. Neither the name of the copyright holder nor the names of its

  contributors may be used to endorse or promote products derived from

  this software without specific prior written permission.



  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"

  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE

  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL

  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR

  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER

  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,

  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

```