<table>
<thead>
<tr>
<th><strong>Title</strong></th>
<th>Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Author(s)</strong></td>
<td>Rafique, Abid; Constantinides, George A.; Kapre, Nachiket</td>
</tr>
<tr>
<td><strong>Date</strong></td>
<td>2015-01</td>
</tr>
<tr>
<td><strong>URL</strong></td>
<td><a href="http://hdl.handle.net/10220/39128">http://hdl.handle.net/10220/39128</a></td>
</tr>
<tr>
<td><strong>Rights</strong></td>
<td>© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [<a href="http://dx.doi.org/10.1109/TPDS.2014.6">http://dx.doi.org/10.1109/TPDS.2014.6</a>].</td>
</tr>
</tbody>
</table>
Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Abid Rafique, George A. Constantinides, Senior Member, IEEE Nachiket Kapre, Member, IEEE

Abstract—Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPU in accelerating communication-bound sparse iterative solvers. While \( k \) iterations of the iterative solver can be unrolled to provide \( O(k) \) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter \( k \), which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of \( k \) can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a \( 1.9 \times -42.6 \times \) speedup over standard iterative solver for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large \( k \) and hence higher speedups. Our approach for FPGA provides a \( 0.3 \times -4.4 \times \) speedup over same generation GPU device where \( k \) is picked carefully for both architectures for a range of benchmarks.

Index Terms—Iterative Numerical Methods; Sparse Matrix–Vector Multiply; Matrix Powers Kernel; Field Programmable Gate Arrays (FPGAs); Graphics Processing Units (GPUs)

1 INTRODUCTION

The cost of a high performance scientific computation operating on large datasets consists of two factors (1) computation cost of performing floating-point operations (2) communication cost (both latency and bandwidth) of moving data within the memory hierarchy in sequential case or between processors in parallel case. One of the communication-intensive scientific computations is an iterative solver used for solving large-scale sparse linear system of equations \( (Ax = b) \) and eigenvalue problems \( (Ax = \lambda x) \) [1]. The solution of these problems is computed from a Krylov subspace \( \text{span}(x, Ax, A^2x, ..., A^r x) \) [1], where a new vector is generated in each iteration. Iterative solvers are challenging to accelerate as they spend most of the time in communication-bound computations, like sparse matrix-vector multiply (SpMV) and vector-vector operations (dot products and vector additions). Additionally, the data dependency between these operations hinder overlapping communication with computation. For large-scale problems where the matrix \( A \) does not fit on-chip, no matter how much parallelism can be exploited to accelerate SpMV, the performance of the iterative solver is bounded by the available off-chip memory bandwidth, e.g., with 2 flops per 4 bytes (single-precision) in SpMV, the maximum theoretical performance is 71 GFLOPs on an Nvidia C2050 GPU and 17 GFLOPs on a Virtex6 FPGA. This results in less than 7% and 4% efficiency of GPU and FPGA respectively (See Table 1 for peak single-precision GFLOPs and off-chip memory bandwidth).

The communication problem is connected to the memory wall problem [2]. Due to technology scaling, computation performance is increasing at a dramatic rate (flops/sec improves by 59% each year) whereas communication performance is improving but at a much lower rate (DRAM latency improves by 5.5% and bandwidth improves by 23% each year) [3]. It is a well-known idea to formulate algorithmic innovations that hide memory latency and optimize memory bandwidth [4] [5] [6]. For iterative solvers, Demmel et al. [7] trade communication with redundant computation by replacing \( k \) SpMVs with the matrix powers kernel. The key idea is to partition the matrix into blocks and performs \( k \) SpMVs on blocks without fetching the block again in the sequential case and performing redundant computation to avoid communication with other processors in the parallel case. In this way the communication cost is reduced by \( O(k) \) at the expense of increase in redundant computation. They show that such an approach can minimize latency in a grid [7] and both latency and bandwidth on a multi-core CPUs [8] to give up to \( 4 \times \) and \( 4.3 \times \) speedup respectively over \( k \) SpMVs for banded matrices. While
The communication-avoiding approach is promising, there are two main challenges associated with this communication-avoiding approach on parallel architectures (1) how to keep the redundant computation as low as possible to minimize the computation cost and (2) how to select the optimal value of the algorithmic parameter $k$, which minimizes overall runtime by providing a tradeoff between computation and communication cost.

In this paper, we show how we can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solver. As a motivation, we show a tradeoff between computation and communication cost for FPGA and GPU to minimize overall runtime as shown in Figure 1. We observe that in standard iterative solver ($k$ equal to 1), the communication cost is higher on FPGA as compared to GPU due to marked difference in off-chip memory bandwidth as shown in Table 1. However, we see a unique value of $k$, which trades communication with redundant computation to reduce overall cost and that this value needs to be selected carefully for each architecture. Additionally, we observe that unlike GPU, the computation cost does not grow in FPGAs allowing larger values of $k$, which leads to higher performance.

1) Communication optimization within the memory hierarchy of a single stream multiprocessor (SM) as well as between different SMs while mapping the matrix powers kernel to a GPU. As a result of these optimizations, we show $1.9 \times 42.6 \times$ speedup over $k$ SpMVs from CUSP library [15] for a range of randomly generated banded matrices.

2) An architecture-aware matrix powers kernel that matches the strength of the FPGAs to avoid redundant computation and a resource-constrained methodology to pick $k$ for a particular FPGA.

3) A unified predictive model of the matrix powers kernel for GPU and FPGA, which helps us understanding communication-computation tradeoffs in selecting the algorithmic parameter $k$. Using the steepest ascent approach, we also show which aspect of future GPU and FPGA architectures need to be improved to achieve higher performance.

4) For a range of problem sizes, a quantitative comparison of the matrix powers kernel on FPGA shows $0.3 \times 3.2 \times$ and $1.7 \times 4.4 \times$ speedup over GPU for largest and smallest band sizes respectively.

The paper is organized as follows. Section 2 provides the necessary background about the structurally sparse matrices used in this work. It also summarizes the matrix powers kernel based on [7] (See Section 1 of supplementary file for details). Section 3 shows how we map the parallel matrix powers kernel on GPUs along with discussion about the predictive model. Section 4 presents the proposed matrix powers kernel specifically targeting FPGAs. It also discusses custom architecture on FPGA and a resource-constrained methodology to select an optimal value of $k$. Section 5 introduces the evaluation methodology and performance of FPGA and GPU is compared in Section 6. Section 7 provides architectural insight and final conclusion is in Section 8.

## 2 Background

### 2.1 Matrix Structure

In this work, we target very large ($n \sim 10^6$) structured sparse banded matrices. The banded matrices are stored in band storage format where an $n \times n$ matrix of band size $b$ is stored as an $n \times b$ matrix [13]. We choose this band structure for two main reasons. First, computations on such matrices have been used as

<table>
<thead>
<tr>
<th>Device</th>
<th>Tech. (nm)</th>
<th>Peak GFLOPs (single-precision)</th>
<th>Memory (On-Chip)</th>
<th>Memory BW (Off-Chip)</th>
<th>Memory BW (Off-Chip)</th>
<th>Freq. MHz</th>
</tr>
</thead>
<tbody>
<tr>
<td>Virtex6-SX475T</td>
<td>40</td>
<td>450 [18]</td>
<td>4 MB</td>
<td>74 KB</td>
<td>5.4 TB/s</td>
<td>36 TB/s</td>
</tr>
<tr>
<td>Nvidia C2050 Fermi</td>
<td>40</td>
<td>1030</td>
<td>672 KB</td>
<td>1.7 MB</td>
<td>1.3 TB/s [19]</td>
<td>8 TB/s</td>
</tr>
</tbody>
</table>

Fig. 1. Computation-communication tradeoff for a banded matrix with band size 27 and $n = 1M$ on a Virtex6-SX475T FPGA and C2050 Fermi GPU.
Fig. 2. $k$ SpMVs vs. parallel matrix powers kernel for a matrix with size $n = 12$, band size $b = 3$, number of levels $k = 3$ and number of blocks $N_q = 3$.

an architectural evaluation benchmark due to high parallelism and low computational intensity, offering opportunities to exploit on-chip parallelism and challenges with associated memory systems [14]. Secondly, they naturally arise in numerous scientific computations like stencils in partial differential equation (PDE) solvers [10], semi-definite optimization problems [11] and model predictive control [12]. Although stencil computation in PDE solvers may not explicitly store data in band storage form but in the other two applications, the banded matrix $A$ is explicitly formed.

### 2.2 Matrix Powers Kernel

Given an $n \times n$ sparse matrix $A$, a dense vector $x^{(0)}$ of length $n$, the matrix powers kernel is computed as

$$
x^{(i)} = Ax^{(i-1)} \quad 1 \leq i \leq k
$$

(1)

The computation in Equation (1) can be unrolled for $k$ iterations as a graph as shown in Figure 2(b) for an example tri-diagonal matrix $A$ (See details of the graph notation in Section 1 of the supplementary file). Each vertex of the graph represents vector entry $x^{(i)}_j$ and there are $n$ vertices for $j = 0, 1, \ldots, n-1$ entries for each level $i = 0, 1, \ldots, k$. The vertices of new vector $x^{(i)} = Ax^{(i-1)}$ can be computed by multiplying the entries of the previous vector with the corresponding entries of the matrix as shown by the edges. To parallelize the matrix powers kernel, the graph can be partitioned into $N_q$ blocks where each block $q$ can be mapped on a single processor. In case, where we generate $k$ vectors using repeated SpMV, we need to synchronize after each step to get the corresponding entries of the previous vector that are computed by the neighbouring processors as shown by white vertices in Figure 2(b)(i). The synchronization in $k$ SpMVs is avoided by the parallel matrix powers kernel, which trades communication with redundant computation.

The key idea is to determine the dependency chain of each partition $q$ for computing the entries of the $k$th vector as shown by the bounded box in Figure 2(b)(ii). The vertices that are required from the neighbouring processors are fetched at once as shown by the white vertices in Figure 2(b)(ii). To avoid communication with neighbours, the redundant computation is then performed at each step (dotted white vertices). While this is well and good, two factors are crucial for optimal performance. First, it is desired to keep the surface to volume ratio ($\frac{\text{Redundant Flops}}{\text{Useful Flops}}$) as low as possible by efficient partitioning of the matrix. Secondly, the value of $k$ needs to be picked carefully as redundant flops grow as $O(k^2b^2)$ [7].

### 3 Matrix Powers Kernel on a GPU

While mapping the matrix powers kernel on a GPU, we want to answer two important questions (1) How to optimally utilize the current GPU memory subsystem and pick the algorithmic parameter $k$ to achieve the desired performance for a particular architecture? (2) How can current GPU architecture be changed to enhance the performance of iterative solvers? We first present the current GF100 GPU architecture and then discuss different optimization techniques that lead to high throughput. We then present an analytical model to predict and understand the performance of the matrix powers kernel on any GPU device (model parameters are obtained using micro-benchmarks). We use the same model to select $k$ for current GPU architectures (See Section 6.1) and also make architectural projections to obtain a desired performance with future devices (See Section 7).

#### 3.1 GPU Architecture

We select the Nvidia GF100 variant C2050 GPU, which is intended for high-performance numerical computing [14] [19]. A simplified architectural description of
the GPU is shown in Figure 3 highlighting memory hierarchy as well as capacity, bandwidth and latency of each memory.

Fig. 3. GPU Architecture (Nvidia C2050 Fermi).

The GPU comprises 14 streaming multiprocessors (SMs) each operating at 1.15 GHz. Each SM has 32 floating-point cores capable of performing 1 single-precision flop/cycle reaching a peak throughput of 1.03 TFLOPs for single-precision and 515 GFLOPs for double-precision. Tasks are scheduled on GPU as thread blocks. Each thread block can run independently on an SM without any communication with other SMs during a single parallel task, e.g. each SM can compute \( k \) SpMVs for a single block as shown in Figure 2(b)(ii) for \( q = 2 \).

3.2 Partitioning Strategy—One Partition Per Thread Block

Each of the \( N_q \) blocks within the matrix powers kernel is mapped to a thread block and all these thread blocks are computed independently in parallel and in any order. The size of each block is \( b_R \times b \) where \( b_R \) is the number of rows in each block. However, the actual size is \( (b_R+k(b-1)) \times b \) as we need some entries from neighbouring blocks for redundant computation that helps in avoiding communication. We assign each vertex to a single thread, which performs serial reduction to compute the dot product of the row with \( b \) components of vector \( x^{(i)} \). If \( N_T \) denotes the number of threads, we can represent partition size \( b_R \) and the total number of partitions as

\[
\begin{align*}
b_R &= N_T - k(b-1) \\
N_q &= \left\lceil \frac{n + b_R - 1}{b_R} \right\rceil
\end{align*}
\]

3.3 GPU Optimizations

To exploit memory hierarchy of the GPU for fast access of the matrix and vector partitions, we explore three possible mappings of the matrix powers kernel as shown in Table 2. We take a matrix of size \( n = 10^6 \) with band size \( b = 9 \), number of threads \( N_T = 512 \) and number of levels \( k = 8 \) and evaluate the performance of these mappings. We select the values of \( N_T \) and \( k \) to show performance scaling and later on show how these values impact performance and need to be picked carefully. We choose spmv_dia_kernel from CUSP library [15] as baseline to perform \( k \) SpMVs with a performance of 34.8 GFLOPs. We now briefly discuss the three possible GPU optimizations.

### TABLE 2

| Mapping on C2050 GPU \((n = 1M, b = 9, k = 8)\). | Efficiency = \( \frac{\text{Sustained GFLOPs}}{\text{Peak GFLOPs}} \) |
|---|---|---|
| Global Memory | Shared Memory | Reg. Memory |
| k SpMVs [15] | \( A, x^{(1)} \) | 34.8 | 3.3% |
| Matrix Powers (Thread Blocking) | \( A \) | \( x^{(1)} \) | 63 | 6.12% |
| Matrix Powers (Thread Blocking + Cache Blocking) | \( A, x^{(1)} \) | 97.6 | 9.4% |
| Matrix Powers (Thread Blocking + Reg. Blocking) | \( x^{(1)} \) | \( A \) | 123 | 11.9% |

3.3.1 Thread Blocking (63 GFLOPs)

Each thread within the thread block is responsible for computing a single entry of the vector \( x^{(i)} \) from the entries of matrix \( A \) and vector \( x^{(i-1)} \). We, therefore, only store \( N_T = b_R + k(b-1) \) components of the vector \( x^{(i-1)} \) within the shared memory of each SM. To avoid communication with other SMs, the entries from neighbouring blocks are pre-fetched. As the entries of the matrix \( A \) are not modified and do not require inter-SM synchronization at each level, we can access \( A \) from global memory (See Listing 1 of supplementary file).

3.3.2 Thread Blocking + Explicit Cache Blocking (97.6 GFLOPs)

In this case, in addition to thread blocking, we also use explicit cache blocking to store the partition of matrix \( A \) into on-chip shared memory of each SM (See Listing 2 of the supplementary file). Using this approach, we not only avoid communication between different SMs but also with the global memory as well. As a result, we see a significant performance improvement over \( k \) SpMVs.

3.3.3 Thread Blocking + Register Blocking (123 GFLOPs)

GPUs have an inverse memory hierarchy [19], i.e. registers have relatively large capacity as compared to L1 cache/shared memory and L2 cache. Also registers have low latency (~1 cycle) compared to shared memory (~27 cycles). To get high throughput, they have been recently used to block matrices arising in small linear algebra problems with high arithmetic intensities [16]. In the matrix powers kernel, each matrix partition can be blocked within the registers of the SM (See Listing 3 of the supplementary file). We store the \( N_T \times b \) partition matrix in a row cyclic
distributed fashion within these registers, i.e. each thread can store a single row of length \( b \). The vector partition is not register-blocked as its entries need to be shared among different threads and is, therefore, kept in the shared memory. We show the performance of these optimizations in Figure 4.

![Fig. 4. GPU Optimizations \((b = 9, N_T = 512, k = 8)\).](image)

The matrix powers kernel with thread blocking and register blocking gives higher throughput as we not only avoid communication within different SMs and within memory hierarchy of a single SM but also utilize low latency and high bandwidth memories of GPU. We see more than 3.5× speedup over \( k \) SpMVs for large matrices and this speedup is even more pronounced for small matrices with higher value of \( k \) (See Table 6).

### 3.4 Modelling Performance

Mapping the matrix powers kernel on GPU and selecting optimal \( k \) to trade communication with computation is not obvious due to the complex GPU architecture. To understand and predict the performance of the matrix powers kernel on the GPU, we characterize our discussion using bandwidth and latency of entire GF100 memory hierarchy. To that end, we assume data is either stored in global or shared memory and use two simple models to predict GPU performance. Our model is based on the LogP model used for distributed architectures [17]. The global and shared memory models are shown as:

\[
\begin{align*}
\tau_{glb} &= \#msg \times \alpha_{glb} + msize \times \beta_{glb} + \text{flops} \times \gamma \\
\tau_{sh} &= \#msg \times \alpha_{sh} + msize \times \beta_{sh} + \text{nsync} \times \alpha_{sync} + \text{flops} \times \gamma
\end{align*}
\]

Like LogP model, our models comprise three parameters \( \alpha, \beta \) and \( \gamma \). Overall runtime is the sum of three factors, memory latency (\( \alpha_{glb} \) or \( \alpha_{sh} \)), inverse memory bandwidth (\( \beta_{glb} \) or \( \beta_{sh} \)) and time per flop (\( \gamma \)). Additionally, the shared memory model also captures time required for thread synchronizations (\( \text{nsync} \times \alpha_{sync} \)). We estimate the model parameters for GF100 architecture using micro-benchmarks [16] and summarize them in Table 3.

**Table 3**

<table>
<thead>
<tr>
<th>Specifications</th>
<th>Nvidia C2050</th>
<th>Nvidia C2075</th>
<th>Nvidia K20</th>
</tr>
</thead>
<tbody>
<tr>
<td>Peak TFLOPs (single-precision)</td>
<td>1.03</td>
<td>1.03</td>
<td>3.95</td>
</tr>
<tr>
<td>Global memory clock (GHz)</td>
<td>1.13 GHz</td>
<td>1.15 GHz</td>
<td>0.725</td>
</tr>
<tr>
<td>Global memory bandwidth (GB/s)</td>
<td>144 GB/s</td>
<td>144 GB/s</td>
<td>280 GB/s</td>
</tr>
<tr>
<td>Core clock rate ((P_{reg})) (GHz)</td>
<td>1.13</td>
<td>1.15</td>
<td>0.725</td>
</tr>
<tr>
<td>Number of SMs ((P))</td>
<td>14</td>
<td>14</td>
<td>14</td>
</tr>
<tr>
<td>Number of cores per SM ((N_c))</td>
<td>32</td>
<td>32</td>
<td>192</td>
</tr>
</tbody>
</table>

We build separate models for global and shared memory accesses and then combine them to find out the latency \((L_q)\) of a single thread block (See Section 2.2 of the supplementary file for details about the model).

\[
L_q = l A_{glb2reg} + l x_{glb2sh} + k(l x_{sh2reg} + l_{compute}) + l_{condition} + l x_{reg2sh}
\]

Referring to Equation (4), \( l A_{glb2reg} \) is the number of cycles required in loading a block of the matrix \( A \) from the global memory to the register file and similarly \( l x_{glb2sh} \) represents number of cycles utilized in loading partition of the vector \( x \) from the global memory to the shared memory. As shown, these blocks are fetched only once and then they are used \( k \) times to calculate partitions for the \( k \) vectors which are stored in the shared memory (See other terms in Equation (4)). We calculate total cycles by plugging in the parameters from Table 3 and then find out overall runtime \( L \) by taking into account the total number of thread blocks \((N_T)\), number of SMs \((P)\) and the number of thread blocks concurrently running per SM (we obtain this information using CUDA Visual Profiler [20]). We run CUDA code shown in Listing 3 of the supplementary file on the GPUs to measure the actual results for validation of the performance model. We compare the predicted performance with the actual measured results for a range of GPU devices in Figure 5.

To indicate the accuracy of our model, we also show the mean \((\epsilon_{\mu})\) and standard deviation \((\epsilon_{\sigma})\) of error (absolute difference in measured and modelled performance) as a percentage of measured performance for each band size. Our model does not capture register spilling, i.e. when the data does not fit in GPU registers, the data is stored to local memory, which is a part of the global memory. Each thread can have a maximum of 64 registers, which are enough to store a single row of the matrix, the length of which is equal to the band size. The band sizes that arise in all practical applications can fit in these registers and therefore, there is no register spilling in our implementation.
3.5 Performance Optimization

Having an accurate model to predict performance on the GPU in terms of the problem parameters \((n, b)\), the architectural parameters \((\gamma, \alpha, \beta, \gamma_{sh}, \beta_{sh}, P)\) and the algorithmic parameters \((k, b_R)\), we can solve the following optimization problem to select the algorithmic parameters.

\[
\min_{k, b_R} \frac{L(n, b, \gamma, \alpha, \beta, \gamma_{sh}, \beta_{sh}, P, k, b_R)}{Freq}
\]

subject to

\[
k \leq \frac{2b_R}{b-1} \tag{5}
\]

Referring to (5), the constraint ensures only nearest neighbour communication in the matrix powers kernel as shown in Figure 2. We carry out sensitivity analysis of GPU performance with respect to the algorithmic parameters with constant architectural parameters in Section 6.1. We also highlight in Section 7 that by carefully picking the algorithmic parameters we can achieve higher performance over \(k\) SpMVs that otherwise requires significant architectural modifications in terms of global memory bandwidth and latency.

4 Matrix Powers Kernel on FPGA

The potential to use FPGAs in high-performance computing arises from the fact that computer architecture can be specialized to accelerate a particular task (See Section 3.1 of the supplementary file for details). Table 1 lists the important architectural features of FPGAs in terms of raw floating-point performance, on-chip memory capacity and on-chip as well as off-chip memory bandwidth. Referring to Table 1, although the off-chip memory bandwidth and peak floating-point performance is \(5 \times \) and \(2.3 \times\) lower than that of the same generation GPU device, it is the on-chip capacity and on-chip memory bandwidth coupled with rich communication-fabric, which make FPGAs suitable for accelerating iterative solvers. We now introduce the architecture-aware hybrid matrix powers kernel that exploits these architectural features to get high throughput. In this regard, we also present a resource-constrained methodology for selecting an optimal \(k\) for a target FPGA device.

4.1 Proposed Hybrid Matrix Powers Kernel

The proposed algorithm loads the blocks of the matrix from the slow memory into large on-chip memory using a sequential algorithm and then performs computations within the block in parallel without doing redundant computations. We show the proposed method in Algorithm 1.

\begin{algorithm}
\caption{Hybrid Matrix Powers Kernel}
\begin{algorithmic}
\For {\(q = 1\) to \(N_q\)}
\State load block \(q\) from slow memory into fast memory
\EndFor
\For {\(i = 1\) to \(k\)}
\State compute all locally computable \(x^{(i)}\)
\State wait for all the receives from neighbours to finish
\State compute the remaining entries of \(x^{(i)}\) with dependencies
\EndFor
\end{algorithmic}
\end{algorithm}

The outer loop is a sequential algorithm that loads the blocks of matrix \(A\) such that the block fits into the on-chip memory of the FPGA. The inner loop is a parallel algorithm, which is very similar to the one shown in Figure 2(b). The working of the algorithm is shown in Figure 6 with a toy example, where we have 2 outer blocks that are loaded sequentially from the slow memory into FPGA on-chip memory.

![Proposed hybrid matrix powers graph](image)

Fig. 6. Proposed hybrid matrix powers graph for \(n = 12, k = 3, b = 3\) and number of blocks \(N_q = 2\).

Each outer block is further partitioned into two sub-blocks that can be processed in parallel by an array of processing elements (PEs) working in a SIMD fashion as shown in Figure 7. All the vertices inside a sub-block are computed in a pipelined fashion using a reduction circuit within the PE (See Section 3.2 of the supplementary file for details). After each level in the graph, PEs need to communicate dependencies...
to their nearest neighbours. However, unlike GPU, where this communication is only possible using shared global memory and is therefore avoided using redundant computation, the PEs within the FPGA utilize low-latency FIFOs and hence avoid the redundant computation. This allows larger values of \( k \) and hence higher possible speedups. To provide motivation, we show the performance of this hybrid matrix powers kernel in Table 4 for the same matrix that we use for demonstrating GPU optimizations in Table 2. We observe that if \( A \) is blocked in on-chip memory of the FPGA along with the vector \( x^{(i)} \), we can get \( \sim 6 \times \) speedup over \( k \) SpMVs used in standard iterative solvers. This speedup factor is almost twice of what we achieved with parallel matrix powers kernel on GPU. Although the net performance of FPGA (85 GFLOPs) is less than that of GPU (123 GFLOPs) for this band size, however, we have better silicon efficiency with FPGA (18.8\%) compared to GPU (11.9\%). We show in Section 6 that for a range of matrix and band sizes, the FPGA even outperforms GPU because of the higher values of \( k \).

TABLE 4
Hybrid Matrix Powers Kernel Mapping on Virtex6-SX475T FPGA \((n = 1\mathrm{M}, b = 9, k = 10)\).

<table>
<thead>
<tr>
<th>Parameters</th>
<th>Off-Chip Memory</th>
<th>On-Chip Memory</th>
<th>GFLOPs</th>
<th>Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td>( k ) SpMVs</td>
<td>( A, x^{(i)} )</td>
<td>14.21</td>
<td>3.1%</td>
<td></td>
</tr>
<tr>
<td>( k ) SpMVs</td>
<td>( A )</td>
<td>15.68</td>
<td>3.4%</td>
<td></td>
</tr>
<tr>
<td>Hybrid Matrix Powers</td>
<td>( A, x^{(i)} )</td>
<td>85</td>
<td>18.8%</td>
<td></td>
</tr>
</tbody>
</table>

4.2 Modelling Performance

To understand the performance of matrix powers kernel on FPGA, we use the same LogP model which comprises both computation as well as communication cost as shown in Section 3.4 for the GPU. The model is exact due to the highly predictive nature of FPGAs as a computing platform. We show the parameters of the model in Table 5.

As FPGA has relatively larger on-chip memory compared to GPU, we intend to store the \( k \) vectors on-chip to be utilized by subsequent modules in communication-avoiding iterative solver. There are three stages in the matrix powers kernel on FPGA: loading the block, computing the sub-blocks in parallel and an optional stage for storing the \( k \) vectors back to the off-chip memory if they do not fit on-chip. The latency \( (L_q) \) of a single block is the summation of the latency of these three stages (See Section 3.3 of the supplementary file for details about the model).

\[
L_q = (lA_{glb2local} + lx_{glb2local}) + k \times l_{\text{compute}} + k \times l_{\text{local2glb}}
\]  

(6)

The overall latency \( L \) is then calculated by multiplying \( L_q \) with total number of blocks \( N_q \).

4.3 Resource-Constrained Methodology

Like GPU, the performance of the matrix powers kernel depends on the problem parameters \((n,b)\), the architectural parameters \((P,\gamma_A,\gamma_M,\alpha_{glb},\beta_{glb})\) and the algorithmic parameters \((k,b_R)\). We find the maximum number \( P \) of PEs that can be synthesized within the FPGA device for the given band size \( b \) (the number of floating-point units only depends on this parameter shown in Table 5). We calculate the memory bandwidth required for these \( P \) PEs (2\( b \) words per PE), partition the available on-chip memory in \( b_R \times b \) blocks and assign these blocks to \( P \) PEs. We solve the following constrained optimization problem to pick \( k \) on a particular FPGA for a given problem.

\[
\min_{k,b_R} \frac{L(n, b, \gamma_A, \gamma_M, \alpha_{glb}, \beta_{glb}, P, k, b_R)}{F_{\text{freq}}}
\]

subject to

\[
R(P) \leq FPGA_{\text{Logic}}
\]

\[
M(P, b_R, k) \leq FPGA_{\text{BRAMs}}
\]

\[
k \leq 2 \cdot \frac{b_R}{b - 1}
\]

(7)

Referring to Equation (7), our objective is to minimize the runtime based on constraints on FPGA resources. \( M(P, k, b_R) \) is the number of BRAMs (FPGA on-chip memories) required and \( R(P) \) is a vector containing the number of resources in terms of LUTs, FFs and DSP48Es that are used in floating-point adders and multipliers [22]. The last constraint ensures only nearest neighbour communication in the matrix powers kernel as shown in Figure 6.
5 Evaluation Methodology

We use the same generation (40nm) of GPU (NVidia C2050) and FPGA (Virtex6-SX475T) devices as mentioned in Table 1. We use CUDA 5.0 for compiling CUDA kernels and also use cusp-v0.3.1 [15], which is a sparse library optimized for GPUs. We prefer CUSP over vendor-specific cuSPARSE [21] since cuSPARSE lacks SpMV routine for banded matrices. We use Xilinx IP Core [22] for BRAMs, adders and multipliers. We synthesize as well as place and route the circuit using Xilinx ISE 13.4. To evaluate the performance of matrix powers kernel on GPU and FPGA, we use a range of randomly generated matrices with varying band sizes.

6 Results

As FPGAs and GPU are radically different computing platforms, we first analyze how the communication-avoiding approach of the matrix powers kernel can enhance their individual performance over k SpMVs in standard iterative solvers. We then compare the matrix powers kernel with optimal k for both GPU and FPGA and show which architecture is better in different problem and band sizes.

6.1 Sensitivity to Algorithmic Parameters

We use the formulations in (5) and (7) to select the algorithmic parameters for minimizing the runtime on GPU and FPGAs respectively. There are two algorithmic parameters, the partition size $b_R$ and the unroll factor k. While both parameters affect the surface to volume ratio but the impact of the partition size $b_R$ is marginal as compared to the unroll factor k (See Section 2.4 of the supplementary file). To show the performance variation with $k$, we take a problem size ($n = 1M$) and show both the communication and computation costs for band size equal to 9 in Figure 8 and for band size equal to 27 in Figure 1. We observe from Figure 8 and Figure 1 that in GPU, the optimal value of $k$ decreases as we increase the band size whereas in case of FPGA, it shows opposite trend. After a certain value of $k$, both computation and communication costs dominate on GPU. The computation cost increases due to $O(k^2b^2)$ growth in redundant operations and as a result the larger the band size, the smaller the value of $k$ whereas the communication cost increases with increasing value of $k$ due to shared memory accesses (See Figure 4 of the supplementary file).

In case of FPGA, the optimal value of $k$ increases with increasing band size. The communication cost decreases with increasing value of $k$ for all band sizes until it flattens as the vectors can no more be stored on-chip and they have to be stored back. For large band sizes, the computation to communication ratio is large and these vectors can be stored in an overlapped fashion. This allows large values of $k$ as shown in Figure 1 and 8. As a result of careful selection of the unroll factor $k$ on both GPU and FPGA, we see a significant increase in the silicon efficiency of these architectures as shown in Figure 9. We also observe that FPGAs have much better silicon efficiency as compared to GPU because of the relatively large values of $k$ as shown in Table 6.

6.2 Performance Comparison

Although FPGAs have better silicon efficiency than GPU as a result of careful selection of $k$, to compare these architectures in terms of raw performance for a range of problem and band sizes, we see three interesting scenarios in Figure 10.

6.2.1 Small Problem Sizes ($n \leq 20K$)

In this case, across all band sizes, due to small matrix size, $k$ vectors can also be stored in the on-chip memory of the FPGA with $k$ having large values. Since there is no off-chip communication involved except loading the matrix once, we see up to 4.4× speedup over GPU in this region due to the large on-chip capacity and zero redundant operations in FPGA.

6.2.2 Large Problem Sizes ($n \geq 20K$), Small Band Sizes ($b \leq 9$)

For large problem sizes and small band sizes, GPU performs slightly better than FPGA since the vectors...
spill into off-chip memory in case of FPGA and due to its relatively low off-chip memory bandwidth, the problem becomes communication-bound. On the other hand, GPU has higher off-chip bandwidth and as a result we see up to $\sim 3 \times$ speedup.

6.2.3 Large Problem Sizes ($n \geq 20K$), Large Band Sizes ($b > 9$)

In this region, as the band size increases, number of redundant operations grow rapidly which constrain GPU performance and as a result we see very small values of $k$. On the other hand, as the computation and communication (storing the vectors) ratio is high, the problem remains compute-bound as vectors can be stored in an overlapped fashion. As a result, FPGAs perform better and we get up to $1.7 \times$ speedup over GPU.

7 ARCHITECTURAL INSIGHT

Since we have an accurate predictive model for GPU and FPGA, we can answer questions

- if we do not change the algorithmic parameters, how we might have to change the architectural parameters?
- how to optimally change the architectural parameters to obtain a desired performance?

7.1 Sensitivity to GPU Architectural Parameters

We solve the optimization problem in (5) using a steepest ascent method. The steepest ascent curve shows different points of performance enhancement and the corresponding architectural parameters as shown in Figure 11. For example, to achieve a $3.5 \times$ speed up over a problem running on C2050 GPU, we show that our optimization vector $(\alpha_{glb}, \beta_{glb}, \alpha_{sh}, \beta_{sh})$ for a hypothetical GPU should be scaled as $(\sim \frac{1}{10}, \sim 10, \frac{1}{1.23})$. However, using our predictive model and measured results, we have already shown in Figure 5 that same performance can be obtained by careful selection of $k$ without changing the architecture.

7.2 Sensitivity to FPGA Architectural Parameters

We show the sensitivity of the FPGA performance contours in GFLOPs as a function of global memory bandwidth ($\beta_{glb}$) and shared memory bandwidth ($\beta_{sh}$) for band size $b = 9$ and $n = 1M$. Specific points (in red) on the steepest ascent curve (in black) are shown representing $(\alpha_{glb}, \beta_{glb}, \alpha_{sh}, \beta_{sh}, \frac{L_{base}}{\epsilon_{pred}})$ where $L_{base}$ is the performance obtained on C2050 GPU. $\alpha_{glb}$ and $\alpha_{sh}$ are in cycles whereas $\beta_{glb}$ and $\beta_{sh}$ are in GB/s.

Fig. 11. Architectural Sensitivity—GPU performance contours in GFLOPs as a function of global memory bandwidth ($\beta_{glb}$) and shared memory bandwidth ($\beta_{sh}$) for band size $b = 9$ and $n = 1M$. Specific points (in red) on the steepest ascent curve (in black) are shown representing $(\alpha_{glb}, \beta_{glb}, \alpha_{sh}, \beta_{sh}, \frac{L_{base}}{\epsilon_{pred}})$ where $L_{base}$ is the performance obtained on C2050 GPU. $\alpha_{glb}$ and $\alpha_{sh}$ are in cycles whereas $\beta_{glb}$ and $\beta_{sh}$ are in GB/s.

8 Conclusion

In this paper, we have presented an optimization framework for a kind of communication-bound problems.

Table 6

<table>
<thead>
<tr>
<th>Band Size</th>
<th>GPU</th>
<th>$k$ Range</th>
<th>Efficiency(%)</th>
<th>FPGA</th>
<th>$k$ Range</th>
<th>Efficiency(%)</th>
<th>FPGA vs. GPU SpeedUp</th>
</tr>
</thead>
<tbody>
<tr>
<td>$b=3$</td>
<td>160−32</td>
<td>0.04−2.6</td>
<td>2−11.6</td>
<td>354−16</td>
<td>2.8−3.3</td>
<td>15.1−8.8</td>
<td>3.2−0.3</td>
</tr>
<tr>
<td>$b=7$</td>
<td>58−16</td>
<td>0.1−3.2</td>
<td>2.3−10.5</td>
<td>436−28</td>
<td>3.0−3.3</td>
<td>19.8−16.1</td>
<td>3.7−0.7</td>
</tr>
<tr>
<td>$b=9$</td>
<td>43−8</td>
<td>0.1−3.5</td>
<td>2.4−11.9</td>
<td>458−12</td>
<td>3.0−3.3</td>
<td>19.6−18.6</td>
<td>3.5−0.7</td>
</tr>
<tr>
<td>$b=13$</td>
<td>30−16</td>
<td>0.4−3.7</td>
<td>2.7−9.1</td>
<td>466−21</td>
<td>3.0−3.3</td>
<td>23.1−22.7</td>
<td>3.3−1.4</td>
</tr>
<tr>
<td>$b=27$</td>
<td>10−3</td>
<td>0.3−3.7</td>
<td>2.6−7.4</td>
<td>481−28</td>
<td>3.1−3.3</td>
<td>27.3−29.9</td>
<td>4.4−1.7</td>
</tr>
</tbody>
</table>
Fig. 12. Architectural Sensitivity—FPGA performance in GFLOPs as a function of off-chip memory bandwidth ($\beta_{glb}$) for band size $b = 9$ and $n = 1M$. The starting point of the curves is a Virtex6-SX475T architecture with an off-chip bandwidth of 34 GB/s.

8 CONCLUSION

Trading communication with computation increases the silicon efficiency of hardware accelerators like FPGAs and GPU for accelerating computation-bound sparse iterative solver. Although unrolling $k$ iterations using the matrix powers kernel provides significant performance improvement compared to standard $k$ SpMVs on a GPU, the performance is constrained due to quadratic growth in redundant computations. Our proposed hybrid matrix powers kernel for FPGA exploits the architectural features of this radically different platform to minimize redundant computations. This allows us large value of $k$ and hence superior silicon efficiency compared to GPU. For a range of randomly generated banded matrices, we demonstrate $0.3\times-3.2\times$ and $1.7\times-4.4\times$ speedup over GPU for small and large band sizes respectively. Our architectural insight shows a tight algorithm-architecture interaction can provide similar performance, which otherwise requires significant enhancements in memory bandwidth.

9 FUTURE WORK

Besides Nvidia GPUs, we intend to validate our predictive model for the matrix powers kernel on other GPUs as well. We intend to extend the work to general sparse matrices with hyper-graph partitioning as a pre-processing step. We intend to use the matrix powers kernel instead of SpMV for applications [11] where we have to solve $Ax = b$ repeatedly in each iteration. As the sparsity pattern does not change over the iterations, we believe that the cost of this pre-processing step will be quite low as compared to the actual computation.

ACKNOWLEDGMENT

Dr. George A. Constantinides would like to acknowledge the support of EPSRC (EP/1020357/1 & EP/G031576/1). We would like to thank Michael Anderson, PAR Lab, University of California, Berkeley for providing us the GF101 micro-benchmarks. Also, we would like to thank Mark Hoemmen and Marghoob Mohiyuddin, University of California Berkeley for giving useful suggestions that help in improving the draft.

REFERENCES

Abid Rafique received the MSc. degree from Technical University Munich, Germany in 2010. Since 2010, he is a PhD student in the Circuits and Systems research group at Imperial College London. His research interests include high performance computing, parallel architectures and communication optimization in iterative numerical algorithms.

George A. Constantinides (S’96-M’01-SM’08) received the Ph.D. degree from Imperial College London in 2001. Since 2002, he has been with the faculty at Imperial College London, where he is currently Professor of Digital Computation and Head of the Circuits and Systems research group. He will be program (general) chair of the ACM International Symposium on Field-Programmable Gate Arrays in 2014 (2015). He serves on several programme committees and has published over 150 research papers in peer refereed journals and international conferences. Dr Constantinides is a Senior Member of the IEEE and a Fellow of the British Computer Society.

Nachiket Kapre is an Assistant Professor at Nanyang Technological University, Singapore since October 2012. He has received a PhD in Computer Science (2010) and a MS in Computer Science (2006) and an MS in Electrical Engineering (2005) all from California Institute of Technology, Pasadena, USA. He was awarded the prestigious Imperial College Junior Research Fellowship in 2010. He has won the best paper award at FPT 2011 and a HiPEAC paper award for his FCCM 2013 paper. His FCCM 2006 paper was featured in the FCCM20 list as one of the influential papers in past 20 years at FCCM conferences. He is broadly interested in exploiting the limits of modern VLSI architectures though reconfigurability, parallelism and domain-specific frameworks.