Project: AMD AI-Engine (AIE) Accelerator
AIE is a new kind of accelerator architecture developed by AMD that is distinct from CPUs, GPUs and FPGAs. It is designed to be a highly efficient and flexible architecture for accelerating a wide range of workloads. The AIE architecture is based on a dataflow model of computation, where data flows through a network of processing elements (PEs) that are connected by a network-on-chip (NoC).
An AI Engine has 24 tiles arranged as 6 rows and 4 columns as shown. Each column has a shim tile, a memory tile and four compute tiles. Each of the 16 compute tiles has a local (L1) memory, a DMA and a VLIW processor optimized for large vector operations.
There are three main components to programming an AI Engine:
Host code: This runs on the host CPU outside the AI Engine and is responsible for setting up the AI Engine, creating example data, transferring data & instructions to and from the AI Engine, and verifying the results.
Graph code: This Python script describes the dataflow within the AI engine. This is acheived by configuring high level primitives such as object fifos, to move data between tiles of the AI Engine. This script also points to the kernel function that will be executed in the VLIW processors of each tile.
Kernel functions: These are functions that are compiled to run in the VLIW processors that make the compute tiles of the AI Engine.
Quick Reference to architecture manual, programming guide, API & intrinsic documentation.
There are multiple ways to specify the host, graph and kernel codes.
C++ for host, graph and kernel code - legacy flow found in some AIE documentation
C++ for host and kernel code, Python for graph code - older MLIR-AIE/IRON flow. You may see this in some of the examples in this repository.
Python for host and graph code, C++/Python for kernel code - new IRON flow.
The new IRON flow aims to increase developer productivity by providing high level abstractions in Python to control the data movement between tiles of the AI engines. A single-source IRON program has the following parts:
0) Setup
Make sure you are connected to the campus network in one of the following ways:
On-campus ethernet
On-campus WiFi: UCSD_PROTECTED (not UCSD_GUEST)
UCSD VPN using Cisco Secure Client [More Info]
Check the UCSD_2025_Tokens.xlsx spreadsheets shared as a Canvas Announcement and on Discord #announcement channel, and get the unique token & URL assigned to you.
Open a browser and visit your assigned URL https://aupcloud.io/aipc-## and sign in with the unique token ucsd-iron25-user-## given to you.
Enter the duration (in minutes) you want the instance to run for (e.g., 180) and launch instance
Copy the files from AIE Project Directory to notebooks/mlir-aie/myproject directory in your AIE instance.
1) Mini Tutorial
Complete the mini tutorial to get familiar with the IRON programming flow. The tutorial can be summarized as follows:
Routing data through shim tiles & mem tiles:
Worker -> DDR
DDR -> Worker -> DDR
DDR -> Mem tile -> DDR (of.forward())
DDR -> Mem tile -> Worker -> DDR
DDR -> Mem tile -> Worker -> Mem tile -> DDR
Splitting and joining object fifos.
DDR -> Worker -> DDR
DDR -> OF -> 3x(OFs -> Workers -> OFs) -> OF -> DDR
Processing multiple inputs and outputs
DDR (A) -> Worker (C=A) -> DDR (A)
DDR (A,B) -> Worker (C=A+B) -> DDR (C)
Runtime Parameters & synchronization
TensorAccessPatterns (TAPs)
DDR -> Worker -> DDR
Write a TensorAccessPattern (TAP) on
.fill()such that the output matches the refreplace tap with
simple_tiler()Observe a new pattern
2) Understanding Tiled Matrix Multiplication
The following python code (which you can run) shows how tiled matrix multiplication works. It first generates two random matrices A and B, and computes the expected output matrix C using numpy’s built-in matrix multiplication. It then performs tiled matrix multiplication by breaking down the input matrices into smaller tiles, multiplying them, and accumulating the results into the tiled output matrix. Finally, it un-tiles the output matrix to bring it to the original shape and compares the computed output with the expected output to verify correctness.
import numpy as np
m,k,n = 64, 32, 16 # Matrix dimensions
r,s,t = 8, 4, 2 # Tile dimensions
'''
Matrix Multiplication: A[m x k] @ B[k x n] = C[m x n]
'''
matA = np.random.randint(-10,10,(m,k), np.int32) # Matrix A of size (m x k)
matB = np.random.randint(-10,10,(k,n), np.int32) # Matrix B of size (k x n)
matC_exp = matA @ matB # Output matrix C of size (m x n)
'''
Matrix Multiplication tiled by r,s,t
matrix A: (m x k) tiled into (m//r, k//s) number of small matrices of size (r x s) each
matrix B: (k x n) tiled into (k//s, n//t) number of small matrices of size (s x t) each
matrix C: (m x n) tiled into (m//r, n//t) number of small matrices of size (r x t) each
'''
matA_tiled = matA.reshape(m//r,r,k//s,s).transpose(0,2,1,3) # shape (m//r, K//k, r, s)
matB_tiled = matB.reshape(k//s,s,n//t,t).transpose(0,2,1,3) # shape (k//s, n//t, s, t)
matC_tiled = np.zeros((m//r,n//t,r,t), dtype=np.int32) # shape (m//r, n//t, r, t)
for im_tiles in range(m//r):
for in_tiles in range(n//t):
# each tile
out_tile = np.zeros((r,t), dtype=np.int32)
for ik_tiles in range (k//s):
A_tile = matA_tiled[im_tiles, ik_tiles]
B_tile = matB_tiled[ik_tiles, in_tiles]
out_tile += A_tile @ B_tile
matC_tiled[im_tiles,in_tiles] = out_tile
'''
Comparing outputs
'''
matC_out = matC_tiled.transpose(0,2,1,3).reshape(m,n) # un-tile output matrix C to shape (m,n)
diff = matC_exp-matC_out
error = np.sum(np.abs(diff))
print(error)
3) Single Core Tiled Matrix Multiplication
- To run the basic matrix multiplication:
cd /notebooks/mlir-aie/myprojectpython basic_mm.py
The following image shows the dataflow for the single core matrix multiplication:
How it works:
r=2, s=8, t=8are the intrinsic dimensions for theMMULAPI used in the kernel (C++) code.m, k, nare the dimensions of the matrices we want to multiply. We mulitply matrixAof sizem x kwith matrixBof sizek x nto get output matrixCof sizem x n.main()function generates random input matrices, and a reference output matrix. It then calls thematrix_multiplication_single_corefunction. This compiles just-in-time (JIT) during the first call and executes.The
matrix_multiplication_single_corefunction sets up the dataflow through the AI engines viaObjectFIFOs, defines the kernel function and compute tile, then performs the runtime operations to transfer data to/from the AI engines and execute the kernel.
Vector intrinsic size: (r,s,t)
Each compute core of the AI Engine is a VLIW vector processor. That is, it can perform 512 int8, or 64 int16 multiply-accumulate operations in parallel, within one clock cycle. It also can perform two loads and one store of vectors in one clock cycle. Therefore, to maximize the performance, the kernel code uses aie::load_v() and aie::store_v() primitive functions to load entire vectors: a row from matrix A of size r and a column from matrix B of size s. We also use the MMUL::mac() primitive to perform the multiply-accumulate on a pair of vectors. We perform four such vector MACs at once to maximize performance. The APIs and primitives are listed here.
Tiling on the fly using ObjectFifos
The fifo_A and fifo_B receive sub-matrices of size m x k and k x n, respectively. The FIFOs translate those matrices from a row-major format into the r x s-sized and s x t-sized blocks required by the hardware’s vector instrinsics before sending them into the compute cores memory.
For matrix A (fifo_A), this transformation is expressed using the following wraps and strides as a list of tuples (wrap, stride), given as arguments to the object_fifo() operation:
(Note that // denotes integer floor-division in Python.)
(m // r, r * k), # Pair 1
(k // s, s), # Pair 2
(r, k), # Pair 3
(s, 1), # Pair 4
Let us break down each component of this pattern. We do so back-to-front for ease of understanding:
- Pair 4: (s, 1)
This dimension represents the transfer of a single row of a r x s-sized tile (our target tile size after the transformation).
Wrap: s is the length of a row of a r x s-sized block in units of 4 bytes (i32 elements).
Stride: A stride of 1 retrieves contiguous elements.
- Pair 3: (r, k)
Together with the previous dimension, this dimenison represents the transfer of a single r x s-sized tile.
Wrap: r is the number of rows of a r x s-sized tile.
Stride: k is the stride between first element of each consecutive row along the m dimension, i.e. adding this stride to a memory address points to the element in the matrix directly below the original address.
- Pair 2: (k // s, s)
Together with the previous dimensions, this dimension represents the transfer of one row of r x s-sized tiles, i.e. the first k x s elements of the input array.
Wrap: k // s is the number of r x s-sized tiles along the k (columns) dimension.
Stride: s is the stride between starting elements of consecutive blocks along the k dimension, i.e. adding this stridde to a memory address points to the same element in the r x s-sized block directly to the right of the block of the original address.
- Pair 1: (m // r, r * k)
Together with the previous dimensions, this dimension transfers the entire m x k-sized matrix as blocks of r x s-sized tiles.
Wrap: m // r is the number of r x s-sized blocks along the m (rows) dimension.
Stride: r * k is the stride between starting elements of consecutive blocks along the m dimension, i.e. adding this stride to a memory address points to the same element in the r x s-sized block directly below the block of the original address.
The following image describes the pattern of the object fifos for matrix A:
4) Whole Array Matrix Multiplication
In this example, the above single core matrix multiplication is extended to use all compute tiles of the 4x4 AI Engine array. To run the whole array matrix multiplication:
/notebooks/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array
make clean && make run use_iron=1
The above image describes the whole array matrix multiplication design for the Ryzen AI device. You can visualize the dataflow using this notebook
The dataflow in the Matrix Multiplication design is as follows. Two submatrices of size r x s from matrix A are broacasted across the each row of the AIE tiles. Similarly, two submatrices of size s x t from matrix B are broadcasted across each column of the AIE tiles. The compute tiles perform the vector multiply-accumulate operations on the submatrices and store the results in the output matrix C.
Matrix |
Size |
Submatrix Size (1.) |
Vector Intrinsic Size (2.) |
|---|---|---|---|
A (Input) |
M x K |
m x k |
r x s |
B (Input) |
K x N |
k x n |
s x t |
C (Output) |
M x N |
m x n |
r x t |
Multiple levels of tiling (M,K,N) -> (m,k,n) -> (r,s,t)
We first specify the dimensions M, K, N for the input matrices A (MxK), and B (KxN), and the output matrix C (MxN), as well as their data type. To enable efficient computation, our design will split large input matrices into smaller sub-matrix blocks on two levels; we thus also define the sizes of those sub-matrices.
At the first level, the constants m, k, and n define the size of the submatrices processed by each AIE core. This is done in the dataflow code.At the second level, we further subdivide using smaller sizes r, s and t – these are the sizes of required by the vector computation intrinsics of the AIEs. We leverage the multidimensional DMAs available in AIEs, through a higher level abstraction (object fifo), to automatically tile and load at this level.
The two levels of tiling of the output matrix C (MxN) is shown below:
5) Questions
Q1, Q2 are based on mini tutorials. Q3 is a generic question. Q4 is based on the whole array matrix multiplication to observe the full performance of AI Engine. Q5-Q7 are your project, based on the starter code.
In Exercise 5b of the mini tutorial, generate 3 different tensor access patterns (TAPs) for a 2D array. Write the equivalent nested loops (pseudocode or Python code) for data access in each of them. Include the code and TAP images in the report.
What role do ObjectFIFOs play in the design of the data movement within the AIE array? Describe how ObjectFIFOs facilitate synchronization between compute cores and memory tiles.
Discuss the purpose of “ping” and “pong” phases in data transfer. How does this design choice improve performance in handling large matrices?
Measure the performance (runtime, GOPS) and limitations of AI Engines using the Whole Array Matrix Multiplication example. You may edit the code to do so.
Keep the datatypes and (r,s,t) default. Change the parameters: (M, K, N) only, and plot a graph with workload size (M*K*N) on the x-axis. Draw multiple lines, varying the workload’s asymmetry for the same workload sizes. Eg. M doubled, K halved, N unchanged. M doubled, K unchanged, N halved.
Keep (r,s,t) fixed. For a set of linearly increasing workload size (M*K*N), change input and output datatypes (int8, int16, int32, and float) and plot another graph.
For the same set of increasing workload sizes, change (r,s,t)s and input/output datatypes, and plot another graph.
If anything did not work, refer to the documentation (not GitHub) to find out why, and explain, citing that reference.
Extend the basic passthrough example provided, such that the data passes through two compute tiles sequentially (one after another) instead of one. Measure the performance (runtime, GB/s) and compare it with the single compute tile design. To measure performance, study the code for the
whole array matrix multiplication, get the idea from there, and write equivalent Python code.Modify
basic_mm.pyandmatmul.ccto implement a simple dense layer that performsY = ReLU(X @ W). The python file should be nameddense.py, kernel function should be nameddense()and the kernel file should be nameddense.cc. Hint: Mathematically,ReLU(z) = max(z,0). In the kernel code, you can create a vector of zeros withauto zeros = aie::zeros<DTYPE, MMUL::size_C>();. You can perform a vectorized max withauto vec3 = aie::max(vec1, vec2).Combine the dense layer from (Q6) with two tile passthrough (Q5) to create a two layer neural network. Fill the blanks in
nn.pyand get it working. Measure the performance and compare it with the single tile matrix multiplication design. Change the intrinsic sizes from2,8,8to4,8,4and describe your observations. If it passes, good. If it fails, explain why.
6) Submission Procedure
You must submit
Report.pdf to gradescope
Code to a PRIVATE repo.
Keep your GitHub repo PRIVATE and invite me (abarajithan11). Include the link to the repo in the report.
You must submit your code (and only your code, not generated files). We must be able to use what is provided (.cpp, *.py files) and directly run your design in NPU Cloud. If you change test benches to answer questions, please submit them as well. You must follow the file structure below. We use automated scripts to pull your data, so **DOUBLE CHECK* your file/folder names to make sure it corresponds to the instructions. Your repo must contain a folder named “aie” at the top-level. This folder must be organized as follows:
passthrough_two_tiles.py
dense.cc
dense.py
nn.py
Optimizing Whole Array Matrix Multiplication for Small N
NOTE: This is not a part of Project 4. You can choose to do this part as Project 5, if you prefer.
The whole array design is efficient for matrices that are much bigger than the 4x4 AI Engine array. However, if the N dimension is small, it would be wasteful to pad the matrix with zeros. The following is a design that would be more efficient for small N dimensions:
Note that dimenison N of input matrix B and the output matrix C is smaller than the number of columns in the AI Engine array. Therefore, we can split the input matrix A into two matrices of size M/2 x K We can then broadcast the two submatrices of A across the rows and the two submatrices of B across the different groups of columns of the AI Engine array as shown.
Start with the whole array design as reference and implement this project. Present a comparison of performance metrics between this design and the whole array design with padded B matrix for different matrix sizes and datatypes. Analyze your observations.