INTEL AMX

The best way for now to think of AMX is that it’s a matrix math overlay for the AVX-512 vector math units, as shown below. We can think of it like a “TensorCore” type unit for the CPU. The details about what this is were only a short snippet of the overall event, but it at least gives us an idea of how much space Intel is granting to training and inference specifically. AMX is expected to be seen initially with Sapphire Rapids. Intel has a new customer with the Argonne National Lab to build another supercomputer.

Intel designed AVX-512 to be incremental as the complexity of the microcode is comparatively complicated. AVX-512 stated with Skylake-X and slowly it has been extended. The goal is to allow the CPU to be able to do more high performance computing. AMD is expected to support AVX-512 with Zen 4 when it ships.

The Intel Advanced Matrix Extensions (Intel® AMX) is a new 64-bit programming paradigm consisting of two components:

  • A set of 2-dimensional registers (tiles) representing sub-arrays from a larger 2-dimensional memory image
  • An accelerator that is able to operate on tiles

The following sections show intrinsics that are available for Intel(R) Advanced Matrix Extension Instructions.

bfloat16;

The Advanced Matrix Extension (AMX) is an x86 extension that introduces a new programming framework for working with matrices (rank-2 tensors). The extensions introduce two new components: a 2-dimensional register file with registers called ’tiles’ and a set of accelerators that are able to operate on those tiles. The tiles represent a sub-array portion from a large 2-dimensional memory image. AMX instructions are synchronous in the instruction stream with memory load/store operations by tiles being coherent with the host’s memory accesses. AMX instructions may be freely interleaved with traditional x86 code and execute in parallel with other extensions (e.g., AVX512) with special tile loads and stores and accelerator commands being sent over to the accelerator for execution.

The extensive use of AVX-512 does warm up the processor so it usually downclocks to allow the process to continue. AVX-512 expands XMM and YMM to have 32 rows using the extended EVEX coding mode. Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

AMX is a major step up from traditional AVX based vector instructions. It supports BF16 and INT8 multiplication as well as FP32/INT32 FMA. AMX based BF16 is supposed to be 5x faster than Cooper Lake’s BF16 while AMX-INT8 is 8x faster than AVX-VNNI (512-bit). As such, it’s possible that Sapphire Lake which will be the first CPU to support these instructions will be the first with 100 TFLOP of theoretical FP throughput, and 300 TOPs of simultaneous INT8 compute performance.

More work on float16() is expected as this is the container used in GeForce RTX 2000 and above for tensor calculations. AMX is almost a replacement for the GPU logic.

%d bloggers like this: