Project: AMD AI-Engine (AIE) Accelerator

AIE is a new kind of accelerator architecture developed by AMD that is distinct from CPUs, GPUs and FPGAs. It is designed to be a highly efficient and flexible architecture for accelerating a wide range of workloads. The AIE architecture is based on a dataflow model of computation, where data flows through a network of processing elements (PEs) that are connected by a network-on-chip (NoC).

An AI Engine has 24 tiles arranged as 6 rows and 4 columns as shown. Each column has a shim tile, a memory tile and four compute tiles. Each of the 16 compute tiles has a local (L1) memory, a DMA and a VLIW processor optimized for large vector operations.

There are three main components to programming an AI Engine:

Host code: This runs on the host CPU outside the AI Engine and is responsible for setting up the AI Engine, creating example data, transferring data & instructions to and from the AI Engine, and verifying the results.
Graph code: This Python script describes the dataflow within the AI engine. This is acheived by configuring high level primitives such as object fifos, to move data between tiles of the AI Engine. This script also points to the kernel function that will be executed in the VLIW processors of each tile.
Kernel functions: These are functions that are compiled to run in the VLIW processors that make the compute tiles of the AI Engine.

Quick Reference to architecture manual, programming guide, API & intrinsic documentation.

There are multiple ways to specify the host, graph and kernel codes.

C++ for host, graph and kernel code - legacy flow found in some AIE documentation
C++ for host and kernel code, Python for graph code - older MLIR-AIE/IRON flow. You may see this in some of the examples in this repository.
Python for host and graph code, C++/Python for kernel code - new IRON flow.

The new IRON flow aims to increase developer productivity by providing high level abstractions in Python to control the data movement between tiles of the AI engines. A single-source IRON program has the following parts:

0) Setup

Make sure you are connected to the campus network in one of the following ways:

On-campus ethernet

On-campus WiFi: UCSD_PROTECTED (not UCSD_GUEST)

UCSD VPN using Cisco Secure Client [More Info]

Check the UCSD_2025_Tokens.xlsx spreadsheets shared as a Canvas Announcement and on Discord #announcement channel, and get the unique token & URL assigned to you.
Open a browser and visit your assigned URL https://aupcloud.io/aipc-## and sign in with the unique token ucsd-iron25-user-## given to you.
Enter the duration (in minutes) you want the instance to run for (e.g., 180) and launch instance
Copy the files from AIE Project Directory to notebooks/mlir-aie/myproject directory in your AIE instance.

1) Mini Tutorial

Complete the mini tutorial to get familiar with the IRON programming flow. The tutorial can be summarized as follows:

Routing data through shim tiles & mem tiles:

Worker -> DDR

DDR -> Worker -> DDR

DDR -> Mem tile -> DDR (of.forward())

DDR -> Mem tile -> Worker -> DDR

DDR -> Mem tile -> Worker -> Mem tile -> DDR

Splitting and joining object fifos.

DDR -> Worker -> DDR

DDR -> OF -> 3x(OFs -> Workers -> OFs) -> OF -> DDR

Processing multiple inputs and outputs

DDR (A) -> Worker (C=A) -> DDR (A)

DDR (A,B) -> Worker (C=A+B) -> DDR (C)

Runtime Parameters & synchronization
TensorAccessPatterns (TAPs)

DDR -> Worker -> DDR

Write a TensorAccessPattern (TAP) on .fill() such that the output matches the ref

replace tap with simple_tiler()

Observe a new pattern

2) Understanding Tiled Matrix Multiplication

The following python code (which you can run) shows how tiled matrix multiplication works. It first generates two random matrices A and B, and computes the expected output matrix C using numpy’s built-in matrix multiplication. It then performs tiled matrix multiplication by breaking down the input matrices into smaller tiles, multiplying them, and accumulating the results into the tiled output matrix. Finally, it un-tiles the output matrix to bring it to the original shape and compares the computed output with the expected output to verify correctness.

import numpy as np

m,k,n = 64, 32, 16  # Matrix dimensions
r,s,t = 8, 4, 2     # Tile dimensions

'''
Matrix Multiplication: A[m x k]  @  B[k x n]  =  C[m x n]
'''

matA = np.random.randint(-10,10,(m,k), np.int32) # Matrix A of size (m x k)
matB = np.random.randint(-10,10,(k,n), np.int32) # Matrix B of size (k x n)
matC_exp = matA @ matB                           # Output matrix C of size (m x n)

'''
Matrix Multiplication tiled by r,s,t

matrix A: (m x k) tiled into (m//r, k//s) number of small matrices of size (r x s) each
matrix B: (k x n) tiled into (k//s, n//t) number of small matrices of size (s x t) each
matrix C: (m x n) tiled into (m//r, n//t) number of small matrices of size (r x t) each
'''

matA_tiled = matA.reshape(m//r,r,k//s,s).transpose(0,2,1,3) # shape (m//r, K//k, r, s)
matB_tiled = matB.reshape(k//s,s,n//t,t).transpose(0,2,1,3) # shape (k//s, n//t, s, t)
matC_tiled = np.zeros((m//r,n//t,r,t), dtype=np.int32)      # shape (m//r, n//t, r, t)

for im_tiles in range(m//r):
    for in_tiles in range(n//t):
        # each tile
        out_tile = np.zeros((r,t), dtype=np.int32)
        for ik_tiles in range (k//s):
            A_tile = matA_tiled[im_tiles, ik_tiles]
            B_tile = matB_tiled[ik_tiles, in_tiles]
            out_tile += A_tile @ B_tile
        matC_tiled[im_tiles,in_tiles] = out_tile
'''
Comparing outputs
'''

matC_out = matC_tiled.transpose(0,2,1,3).reshape(m,n)    # un-tile output matrix C to shape (m,n)
diff = matC_exp-matC_out
error = np.sum(np.abs(diff))
print(error)

3) Single Core Tiled Matrix Multiplication

To run the basic matrix multiplication:

cd /notebooks/mlir-aie/myproject

python basic_mm.py

The following image shows the dataflow for the single core matrix multiplication:

How it works:

r=2, s=8, t=8 are the intrinsic dimensions for the MMUL API used in the kernel (C++) code.
m, k, n are the dimensions of the matrices we want to multiply. We mulitply matrix A of size m x k with matrix B of size k x n to get output matrix C of size m x n.
main() function generates random input matrices, and a reference output matrix. It then calls the matrix_multiplication_single_core function. This compiles just-in-time (JIT) during the first call and executes.
The matrix_multiplication_single_core function sets up the dataflow through the AI engines via ObjectFIFOs, defines the kernel function and compute tile, then performs the runtime operations to transfer data to/from the AI engines and execute the kernel.

Vector intrinsic size: (r,s,t)

Each compute core of the AI Engine is a VLIW vector processor. That is, it can perform 512 int8, or 64 int16 multiply-accumulate operations in parallel, within one clock cycle. It also can perform two loads and one store of vectors in one clock cycle. Therefore, to maximize the performance, the kernel code uses aie::load_v() and aie::store_v() primitive functions to load entire vectors: a row from matrix A of size r and a column from matrix B of size s. We also use the MMUL::mac() primitive to perform the multiply-accumulate on a pair of vectors. We perform four such vector MACs at once to maximize performance. The APIs and primitives are listed here.

Tiling on the fly using ObjectFifos

The fifo_A and fifo_B receive sub-matrices of size m x k and k x n, respectively. The FIFOs translate those matrices from a row-major format into the r x s-sized and s x t-sized blocks required by the hardware’s vector instrinsics before sending them into the compute cores memory.

For matrix A (fifo_A), this transformation is expressed using the following wraps and strides as a list of tuples (wrap, stride), given as arguments to the object_fifo() operation: (Note that // denotes integer floor-division in Python.)

(m // r, r * k), # Pair 1
(k // s, s), # Pair 2
(r, k), # Pair 3
(s, 1), # Pair 4

Let us break down each component of this pattern. We do so back-to-front for ease of understanding:

Pair 4: (s, 1)
- This dimension represents the transfer of a single row of a r x s-sized tile (our target tile size after the transformation).
- Wrap: s is the length of a row of a r x s-sized block in units of 4 bytes (i32 elements).
- Stride: A stride of 1 retrieves contiguous elements.
Pair 3: (r, k)
- Together with the previous dimension, this dimenison represents the transfer of a single r x s-sized tile.
- Wrap: r is the number of rows of a r x s-sized tile.
- Stride: k is the stride between first element of each consecutive row along the m dimension, i.e. adding this stride to a memory address points to the element in the matrix directly below the original address.
Pair 2: (k // s, s)
- Together with the previous dimensions, this dimension represents the transfer of one row of r x s-sized tiles, i.e. the first k x s elements of the input array.
- Wrap: k // s is the number of r x s-sized tiles along the k (columns) dimension.
- Stride: s is the stride between starting elements of consecutive blocks along the k dimension, i.e. adding this stridde to a memory address points to the same element in the r x s-sized block directly to the right of the block of the original address.
Pair 1: (m // r, r * k)
- Together with the previous dimensions, this dimension transfers the entire m x k-sized matrix as blocks of r x s-sized tiles.
- Wrap: m // r is the number of r x s-sized blocks along the m (rows) dimension.
- Stride: r * k is the stride between starting elements of consecutive blocks along the m dimension, i.e. adding this stride to a memory address points to the same element in the r x s-sized block directly below the block of the original address.

The following image describes the pattern of the object fifos for matrix A:

4) Whole Array Matrix Multiplication

In this example, the above single core matrix multiplication is extended to use all compute tiles of the 4x4 AI Engine array. To run the whole array matrix multiplication:

/notebooks/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array

make clean && make run use_iron=1

The above image describes the whole array matrix multiplication design for the Ryzen AI device. You can visualize the dataflow using this notebook

The dataflow in the Matrix Multiplication design is as follows. Two submatrices of size r x s from matrix A are broacasted across the each row of the AIE tiles. Similarly, two submatrices of size s x t from matrix B are broadcasted across each column of the AIE tiles. The compute tiles perform the vector multiply-accumulate operations on the submatrices and store the results in the output matrix C.

Matrix	Size	Submatrix Size (1.)	Vector Intrinsic Size (2.)
A (Input)	M x K	m x k	r x s
B (Input)	K x N	k x n	s x t
C (Output)	M x N	m x n	r x t

Multiple levels of tiling (M,K,N) -> (m,k,n) -> (r,s,t)

We first specify the dimensions M, K, N for the input matrices A (MxK), and B (KxN), and the output matrix C (MxN), as well as their data type. To enable efficient computation, our design will split large input matrices into smaller sub-matrix blocks on two levels; we thus also define the sizes of those sub-matrices.

At the first level, the constants m, k, and n define the size of the submatrices processed by each AIE core. This is done in the dataflow code.At the second level, we further subdivide using smaller sizes r, s and t – these are the sizes of required by the vector computation intrinsics of the AIEs. We leverage the multidimensional DMAs available in AIEs, through a higher level abstraction (object fifo), to automatically tile and load at this level.

The two levels of tiling of the output matrix C (MxN) is shown below:

5) Questions

Q1, Q2 are based on mini tutorials. Q3 is a generic question. Q4 is based on the whole array matrix multiplication to observe the full performance of AI Engine. Q5-Q7 are your project, based on the starter code.

In Exercise 5b of the mini tutorial, generate 3 different tensor access patterns (TAPs) for a 2D array. Write the equivalent nested loops (pseudocode or Python code) for data access in each of them. Include the code and TAP images in the report.
What role do ObjectFIFOs play in the design of the data movement within the AIE array? Describe how ObjectFIFOs facilitate synchronization between compute cores and memory tiles.
Discuss the purpose of “ping” and “pong” phases in data transfer. How does this design choice improve performance in handling large matrices?
Measure the performance (runtime, GOPS) and limitations of AI Engines using the Whole Array Matrix Multiplication example. You may edit the code to do so.

Keep the datatypes and (r,s,t) default. Change the parameters: (M, K, N) only, and plot a graph with workload size (M*K*N) on the x-axis. Draw multiple lines, varying the workload’s asymmetry for the same workload sizes. Eg. M doubled, K halved, N unchanged. M doubled, K unchanged, N halved.

Keep (r,s,t) fixed. For a set of linearly increasing workload size (M*K*N), change input and output datatypes (int8, int16, int32, and float) and plot another graph.

For the same set of increasing workload sizes, change (r,s,t)s and input/output datatypes, and plot another graph.

If anything did not work, refer to the documentation (not GitHub) to find out why, and explain, citing that reference.

Extend the basic passthrough example provided, such that the data passes through two compute tiles sequentially (one after another) instead of one. Measure the performance (runtime, GB/s) and compare it with the single compute tile design. To measure performance, study the code for the whole array matrix multiplication, get the idea from there, and write equivalent Python code.
Modify basic_mm.py and matmul.cc to implement a simple dense layer that performs Y = ReLU(X @ W). The python file should be named dense.py, kernel function should be named dense() and the kernel file should be named dense.cc. Hint: Mathematically, ReLU(z) = max(z,0). In the kernel code, you can create a vector of zeros with auto zeros = aie::zeros<DTYPE, MMUL::size_C>();. You can perform a vectorized max with auto vec3 = aie::max(vec1, vec2).
Combine the dense layer from (Q6) with two tile passthrough (Q5) to create a two layer neural network. Fill the blanks in nn.py and get it working. Measure the performance and compare it with the single tile matrix multiplication design. Change the intrinsic sizes from 2,8,8 to 4,8,4 and describe your observations. If it passes, good. If it fails, explain why.

6) Submission Procedure

You must submit

Report.pdf to gradescope
Code to a PRIVATE repo.

Keep your GitHub repo PRIVATE and invite me (abarajithan11). Include the link to the repo in the report.

You must submit your code (and only your code, not generated files). We must be able to use what is provided (.cpp, *.py files) and directly run your design in NPU Cloud. If you change test benches to answer questions, please submit them as well. You must follow the file structure below. We use automated scripts to pull your data, so **DOUBLE CHECK* your file/folder names to make sure it corresponds to the instructions. Your repo must contain a folder named “aie” at the top-level. This folder must be organized as follows:

passthrough_two_tiles.py
dense.cc
dense.py
nn.py

Optimizing Whole Array Matrix Multiplication for Small N

NOTE: This is not a part of Project 4. You can choose to do this part as Project 5, if you prefer.

The whole array design is efficient for matrices that are much bigger than the 4x4 AI Engine array. However, if the N dimension is small, it would be wasteful to pad the matrix with zeros. The following is a design that would be more efficient for small N dimensions:

Note that dimenison N of input matrix B and the output matrix C is smaller than the number of columns in the AI Engine array. Therefore, we can split the input matrix A into two matrices of size M/2 x K We can then broadcast the two submatrices of A across the rows and the two submatrices of B across the different groups of columns of the AI Engine array as shown.

Start with the whole array design as reference and implement this project. Present a comparison of performance metrics between this design and the whole array design with padded B matrix for different matrix sizes and datatypes. Analyze your observations.