<table>
<thead>
<tr>
<th>Title</th>
<th>Off-chip decoupling capacitor allocation for chip package co-design</th>
</tr>
</thead>
<tbody>
<tr>
<td>Author(s)</td>
<td>Yu, Hao; Chu, Chunta Lei He</td>
</tr>
<tr>
<td>Date</td>
<td>2007</td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://hdl.handle.net/10220/7904">http://hdl.handle.net/10220/7904</a></td>
</tr>
</tbody>
</table>
Off-chip Decoupling Capacitor Allocation for Chip Package Co-Design

Hao Yu
Berkeley-DA Inc.
Santa Clara, CA 95054

Chunta Chu, Lei He
EE Department UCLA
Los Angeles, CA 90095

ABSTRACT
Off-chip decoupling capacitor (decap) allocation is a demanding task during package and chip co-design. Existing approaches can not handle large numbers of I/O counts and large numbers of legal decap positions. In this paper, we propose a fast decoupling capacitor allocation method. By applying a spectral clustering, a small amount of principal I/Os can be found. Accordingly, the large power supply network is partitioned into several blocks each with only one principal I/O. This enables a localized macromodeling for each block by a triangular-structured reduction. In addition, to systematically consider a large legal position map in a manageable fashion, the map of legal positions is decomposed into multiple rings, which are further parameterized in each block. The decaps are then allocated according to the sensitivity obtained from the parameterized macromodel for each block. Compared to the PRIMA-based macromodeling, experiments show that our method (TBS2) is 25X faster and has 3.04X smaller error. Moreover, our decap allocation reduces the optimization time by 97X, and reduces decap cost by up to 16% to meet the same power-integrity target.

Categories/Subject Descriptors: B.7.2[Hardware]: Integrated circuits – Design aids
General Terms: Algorithms, design
Keywords: Model Order Reduction, PG grid simulation

1. INTRODUCTION
High-performance system on chip (SoC) or system in package (SiP) integration leads to chip-package interface (I/O) operating in the Giga-bit range. Because the power supply planes in the package show strong electromagnetic resonance [1–3] under the injection of simultaneously switching I/O currents, they act as a significant source of noise in supply voltage and may create non-negligible jitters that limit the performance of I/Os. Therefore, it is necessary to obtain a clean power deliver system. Decoupling capacitors can be used to short power and ground planes at high frequencies to control power fluctuations. However, different from the on-chip decap, off-chip decaps are discrete passive components. Moreover, considering congestion from signal and power routing, off-chip decaps can be inserted only at selected slots, called legal positions in this paper, and legal positions are used to connect terminals of decaps inside or outside the package.

The following decap insertion algorithms have been developed recently. [1] calculates a multi-input multi-output (MIMO) impedance by model order reduction. It reduces the cost of decaps and resonance impedance in the frequency-domain. To explicitly consider the rising-time of the I/O current, [2] allocates decap to directly reduce the noise in the time-domain and avoids over-design compared to [1]. However, the simulated annealing (SA) based algorithms in [1,2] are capable of dealing with a pre-designed package with only a limited number of legal positions as the algorithm virtually tries on each legal position. Moreover, the models used in [1,2] need to be improved. The accuracy of PRIMA-based reduction in [1] decays when the I/O port number increases. In addition, the reduction in [1] ignores the structure information. The reduced model is dense and non-localized, and is inefficient to handle large-scale packages. On the other hand, [2] considers only the noise amplitude but not the noise pulse width. Because a very narrow noise with a large amplitude may not necessarily lead to noise violation, the noise model in [2] can lead to over-design.

Considering chip-package co-design, this paper formulates a decap allocation problem to minimize the decap cost subject to a dynamic noise violation constraint with consideration of noise pulse width. We develop a scalable algorithm using a localized and parameterized macromodel. The primary contributions of our paper are two-fold. First, to generate an effective macromodel considering large numbers of I/O ports, we propose a spectral clustering of I/Os based on correlation. The information of clustered I/Os is further used to partition the large RLC-network and leads to a structured macromodel with localized integrity analysis. Compared to the macromodel used in [1], our method is 3.04X more accurate and 25X more efficient. Secondly, given a large number of legal positions, we introduce a systematical decap allocation based on sensitivity of the transient response with respect to the decap location and type. Then, the decaps can be allocated according to sensitivity from a structured and parameterized macromodel that is only calculated once. Compared to the SA-based allocation in [1,2], experiments show that our allocation is 97X faster, and reduces the decap cost by up to 16%.

The rest of paper is organized as follows. In Section 2, we present the background and problem formulation. In Section 3, we introduce a parameterized description for P/G planes with allocated decaps. In Section 4, we propose a correlation based I/O clustering method. Using the I/O clustering information, in Section 5 we partition the parameterized RLC-plane into several blocks, and apply a triangular block-structured model reduction to locally generate the nominal response and sensitivity for each block. In Section 6, we introduce our decap allocation algorithm using the sensitivity, and present the experiment results. We conclude in Section 7.

2. PROBLEM FORMULATION
A complete RLC model is required for accurate representations of interactions among package layers, Cu bumps, vias, on-chip power grids and all other signal traces. The power/ground plane can be uniformly discretized into Nv tiles, and each tile is modeled by RLC element. We assume that the locations of chip I/O ports are known, and the allowed number of possible locations called legal positions for decaps are predefined for each region in a multi-layer package with consideration of congestion due to packaging

---

*This work was partially supported by NSF-CAREER 0401682 and UC-MICRO sponsored by Altera, Intel and Rio-DA. Address comments to hao@berkeley-da.com and lhe@ee.ucla.edu.
routing and ball assignment. The legal positions are slots to connect the terminals of decaps, but not necessarily where decaps are located. As shown in Fig. 1, the I/Os are located in the center of the package. With the consideration of reserved routing area, the legal positions to allocate decaps are surrounded the chip by rings with different distances. After one decap is assigned to one legal location the decap is then called legally placed.

In our decap allocation problem, the design freedoms are the legal locations and decap types. Brute-forcedly examining every possible combination is computationally expensive if not impossible. To allocate decaps in a manageable way, we propose a ring-based decomposition of all legal positions. This is based on the observation that the impact of decaps to I/O power-integrity can be distinguished by the distance of the center. As shown in Fig. 1 (a), the legal positions are decomposed into rings. Each ring is composed by a group of legal positions uniformly distributed on one edge. The illegal positions are not included in each ring.

Moreover, because of non-uniformly distributed I/Os in space, the orientation of legal positions can have different impact on I/O power-integrity as well. As a result, the decaps needs to be non-uniformly distributed on one ring. To consider this, all rings are hierarchically divided into M1 levelized positions, called templates in this paper. As shown by Fig. 1 (b), the level-0 template only allocates decaps on the edge centers, and the higher-level template allocates decaps more uniformly on the rings. To further consider M2 types of decaps, each levelized template is duplicated by M2 copies, each copy with a different decap type. Note that only one copy at one level is selected to allocate decaps. As a result, there are total M = M1 · M2 templates, and a vector of templates can be defined by T = [T1, T2, ..., TM], where T1 is one template with specified level and decap-type. Usually, there are less than 5 levels of decomposition and less than 10 types [1, 2] to choose.

In addition, to consider the dynamic noise taking in to account the noise pulse width, the noise integral instead of noise amplitude is used

\[ f_i = \int_{t_0}^{t_p} \max[y_i(T, t), V_{C1}] dt = \int_{t_0}^{t_e} [y_i(T, t) - V_{C1}] dt, \]

with a pulse-width \((t_s, t_e)\) in a sufficient long time-period \(t_p\). \(y_i(T, t)\) is transient noise waveform at \(i\)-th I/O. This applies to all p I/O cells, i.e., \(f_j \leq V_d \) \((j = 1, ..., p)\).

Recall that our design freedoms are two-fold: one is the level of ring, and another is the decap-type. Accordingly, our problem formulation becomes

**Formulation 1.** Given the allowed noise \((V_c)\), legal positions \((M1)\) and decap types \((M2)\), the decap allocation problem is to decide which decap to be placed at which legal position and minimize the total cost of decap under a given bound of decap number \((M)\), such that the voltage violations \(f\) at each I/Os are smaller than the allowed noise.

This problem can be mathematically represented by

\[ \min \sum_{i = 1}^{M} n_i T_i, \quad (i = 1, ..., M) \]

s.t. \(U_f \leq V_c \) and \(\sum_{j = 1}^{M} m_j \leq M\). \(\) \(\) \(\) \(\) \(\) \(\)

3. **PARAMETERIZED CIRCUIT EQUATION**

Because the partial inductance in PEEC introduces massive magnetic couplings, it would slow down the integrity analysis. As shown by [4], the inverse of \(L^{-1}\) described by VPEEC model can be stably sparsified, and stably and passively stamped in the circuit matrix by a vector-potential nodal analysis (VNA). In this paper, the nominal RLC-network for package planes is modeled by VPEEC model and is stamped by VNA in frequency \((s)\) domain:

\[ (G_0 + sC_0)y(s) = B(s), \quad y(s) = B^T x(s) \]

Note that \(G_0\) and \(C_0\) are nominal state matrices composed by conductance \(G\), capacitance \(C\), inverse-inductance \(L\) and incident-matrix \(B\). \(x\) is the state variable composed by the nodal voltage \(v_n\) and the branch vector-potential \(a_l\). In addition, \(B\) is the adjacent matrix to describe \(P\) identical input and output ports, where the inputs \(I = B(s)\) are I/O current sources, and outputs \(y\) are the voltage bounce at those I/Os. As discussed in Section 4, studying such a I/O map can guide the network partition.

To obtain the sensitivity, we need to first parameterize the system when we insert decaps with associated equivalent conductance \(g_i\), capacitance \(c_i\) and susceptance \(s_i\) (inverse of inductance). We describe each template \(T_i\) by a pair of topology matrices \(T^p_i\) and \(T^c_i\); where \(T^p_i\) describes how to connect the nodal equivalent conductance, and \(T^c_i\) defines how to connect the nodal capacitance and the branch susceptance. For an \(i\)-th template, adding decaps between tiles \(m\) and \(n\) results in:

\[ T^p_i(k, l) = T^p_i(l, k) = \begin{cases} -g_i & \text{if } k = m, n = l \text{ and } k \neq l \\ \sum_{l=1}^{N_i} |T^p_i(k, l)| & \text{if } k = l \\ 0 & \text{else} \end{cases} \]

where, \(k, l \in 1, 2, ..., N\). \(T^p_i(k, l)\) can be given similarly to add the equivalent capacitance and susceptance.

To separate the sensitivity from the nominal response, the state variable \(x(T, s)\) is first expanded into Taylor series with respect to \(T_i\) and a new state variable \(x_{ap}\) is reconstructed by the nominal values and the first-order sensitivities \(x_{ap} = [x_{0}^T; x_{1}^T; ..., x_{M}^T]^T\). Then, a dimension-Augmented system can be reorganized according to the expansion order

\[ (G_{ap} + sC_{ap})x_{ap} = B_{ap}I(s), \quad y_{ap} = B_{ap}^T x_{ap}, \]

Authorized licensed use limited to: Nanyang Technological University. Downloaded on May 20,2010 at 16:25:54 UTC from IEEE Xplore. Restrictions apply.
where

\[ \mathcal{G}_{ap} = \begin{bmatrix} G_0 & 0 & \ldots & 0 \\ T_0^T & \mathcal{C}_0 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ T_M^T & 0 & \ldots & \mathcal{C}_0 \end{bmatrix}, \quad \mathcal{C}_{ap} = \begin{bmatrix} C_0 & 0 & \ldots & 0 \\ T_0^T & C_0 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ T_M^T & 0 & \ldots & C_0 \end{bmatrix}, \]  

both have a lower triangular block structure. Due to the increased dimension, the augmented system needs to be reduced and represented by a macromodel.

4. SPECTRAL CLUSTERING OF I/O

Due to the large number of I/O ports, the macromodel by model reduction applied by [1] is still ineffective. Because the time/space-variant input I/O current are not independent, they can be represented by a function of a smaller number of independent variables based on the principal component analysis (PCA) using eigen-decomposition (ED). This becomes the motivation to apply the singular value decomposition (SVD) [5,6] based port reduction as SVD is equivalent to eigen-decomposition when the matrix to be decomposed is symmetric positive definite. These approaches [5,6] assume that the correlation or similarity of inputs can be inferred from a SVD analysis of the system transfer function. However, the correlation of I/O ports is actually dependent on the input signals. As a result, finding the `representative' ports or ignoring some `insignificant' ports based on the system similarity may lead to simulation errors, because there could be one significant output response caused by one significant signal that is applied at one port ignored from the system pole analysis.

**Algorithm 1 for Spectral Clustering of I/O Current Sources with PCA and K-means.**

In this paper, we propose to directly study the similarity or correlation of I/O current. The large number of I/Os are clustered into \( K \) groups, each with one principal I/O current as port. Given a typical set of \( P \) input vectors applied in a sufficient long period, the sampled transient-current \( I(t_k, n_j) \) for each I/O \( n_j \) can be described by a random process as follows:

\[ S_{n_k} = \{I(t_1, n_k), \ldots, I(T, n_k)\}, \quad S_{n_2} = \{I(t_1, n_2), \ldots, I(T, n_2)\} \]

A current spatial-correlation matrix is defined by

\[ C(i,j) = \frac{\text{cov}(i,j)}{\sigma_i \sigma_j}, \quad 1 \leq i, j \leq N \]  

where \( \text{cov}(i,j) \) is the covariance between nodes \( n_i \) and \( n_j \), and \( \sigma_i \), \( \sigma_j \) are the standard-variations of nodes \( n_i \) and \( n_j \). Those correlation coefficients \( C(i,j) \) can be precomputed and stored in a table.

After extracting the correlation for I/O currents, we can build a correlation graph by assigning the weight of edge between I/Os \( n_i \) and \( n_j \) by the correlation value \( C(i,j) \). A fast clustering based on spectral analysis [7] can be applied to efficiently handle a large-scale correlation graph to find \( K \) clusters \( A_1, \ldots, A_K \) using K-means method, where the I/Os in one cluster all show a similar current waveform. In addition, the number of I/O current sources can be reduced by PCA

\[ \mathcal{J}_F = \mathcal{V}_F = \mathcal{E} / \mathcal{I}(s) \in \mathbb{R}^{1 \times K}. \]

It is equivalent to reduce the port matrix

\[ B_2 = \mathcal{V} \mathcal{B} \in \mathbb{R}^{N \times K}. \]

As such, there is only one principal port selected to represent each cluster. The overall clustering is outlined in Algorithm 1. Usually, 1000 sources can be approximated by around 10 sources if I/Os are strongly correlated.

5. LOCALIZED INTEGRITY ANALYSIS

Because the I/O currents are distributed non-uniformly in space, they have different impact on voltage bounces at different locations. Therefore, it is possible that one level of timing can be non-uniformly allocated with different types of decaps. To this end, it better to decompose the I/O cells, the RLC-network for power supply, and the M templates into K blocks (See Fig. 1). A corresponding localized analysis can be then performed to decide how many decaps for one block of I/Os.

The decomposition needs to partition the network based on physical properties such as couplings and latency. The TBS method in [8] leverages the property of latency, which is more suitable for timing simulators. But for the verification of power integrity, it is more meaningful to study a partition based on I/O currents. Moreover, the partition in TBS [8] is to tear nodal voltage variables \( \phi_i \) for conductance and capacitance matrices, which is not suitable for inductance/susceptance partition because inductance/susceptance is described by the branch current/vector-potential. As shown in [4], this can be solved by Bordered-Block-Diagonal (BBD) decomposition of the VNA network \( \mathcal{G}_0, \mathcal{C}_0, \mathcal{B}_2 \). Note that each block has the specified ports \( A_i \) obtained from the spectral clustering.

As a result, the parameterized system can be described in a BBD matrix by

\[ \mathcal{G}_ap \rightarrow \mathcal{G}_{ap} = \begin{bmatrix} G_1 & 0 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & G_K \end{bmatrix} X_{k,0} \begin{bmatrix} X_{k,0}^T \\ \vdots \\ \vdots \\ X_{k,0}^T \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ \vdots \\ \vdots \end{bmatrix}, \]  

where \( \mathcal{G}_0 \) is partitioned into \( K \) blocks \( G_j \) \( j = 1, \ldots, K \), and \( \mathcal{C}_{ap} \) is built similarly from \( \mathcal{C}_{ap} \). Note that those parameterized templates are also partitioned into \( T_{ij} \) \( i = 1, \ldots, M \) \( j = 1, \ldots, K \).

Note that a block matrix structure is implemented for the BBD formulation. As such, the reduction can be performed locally [4]. However, the system poles are not determined only by those blocks in diagonal. To achieve a high-order accuracy but with only a low-order reduction, the TBS reduction in [8] is extended to consider the parameterized BBD state matrices with the inverse-inductance. Moreover, one important observation is that, since only one principal port at each block is selected, a SIMO-reduction can be easily applied to achieve q-th order moment matching for each block, and the reduced macromodel for each block can be repeatedly used for any input signals.

As a result, a localized integrity analysis can be efficiently performed for each block to generate both nominal responses and sensitivities in time-domain

\[ \hat{\mathcal{G}}_{ib} + \frac{1}{h} \hat{\mathcal{C}}_{ib} \hat{x}_{ib}(t) = \frac{1}{h} \hat{\mathcal{C}}_{ib} \hat{x}_{ib}(t-h) + \mathcal{B}_{ib} \hat{\mathcal{I}}(t) \]

The k-th block power integrity at one principle I/O perturbed by i-th template is \( \hat{y}_{ib}(t) = \hat{y}_{ib}^{(0)}(t) + \hat{y}_{ib}^{(1)}(t) \).

6. ALGORITHM AND RESULTS

With use of the structured and parameterized macromodel in Section 5, the problem formulation (2) in Section 2 can be efficiently solved by the sensitivity based optimization. The overall optimization is outlined below. The nominal value and sensitivities are computed one-time from the structured and parameterized macromodel from (11). Afterwards, the decap is added into each block independently. In k-th block, the template vector \( \mathbf{T} \) is ordered according to the magnitude of sensitivities: \( \{\hat{y}_{ib,k}, \ldots, \hat{y}_{ib,M,k}\} \).
Table 1: Results of decap allocations by SA and our MRA method. The cost of decaps is normalized.

<table>
<thead>
<tr>
<th>clock (node#-I/O)</th>
<th>#level</th>
<th>#legal-pos</th>
<th>#partition</th>
<th>SA-NA</th>
<th>MRA-NA</th>
<th>MRA-NI</th>
</tr>
</thead>
<tbody>
<tr>
<td>256+40</td>
<td>0.1</td>
<td>20</td>
<td>4</td>
<td>192.2</td>
<td>16</td>
<td>10</td>
</tr>
<tr>
<td>1160+160</td>
<td>0.1</td>
<td>80</td>
<td>4</td>
<td>2hrs</td>
<td>55</td>
<td>50</td>
</tr>
<tr>
<td>4720+640</td>
<td>0.1</td>
<td>320</td>
<td>4</td>
<td>7hrs</td>
<td>102</td>
<td>96</td>
</tr>
<tr>
<td>10660+1440</td>
<td>0.1,2</td>
<td>720</td>
<td>8</td>
<td>1day</td>
<td>233</td>
<td>216</td>
</tr>
<tr>
<td>1552+3645</td>
<td>0.1,2</td>
<td>1701</td>
<td>8</td>
<td>NA</td>
<td>932.4s</td>
<td>277</td>
</tr>
<tr>
<td>55216+10880</td>
<td>0.1,2,3</td>
<td>5440</td>
<td>16</td>
<td>NA</td>
<td>51mins</td>
<td>340</td>
</tr>
</tbody>
</table>

Figure 3: Waveform accuracy comparisons between the original method in [1] and TBS2 in (a) time-domain and (b) frequency-domain for 4th principal port. The original and TBS2 are visually identical.

and is added according to this order until the integrity constraint of k-th block is satisfied. The algorithm then iterates to the next block until all the power integrity of all blocks are satisfied.

The proposed macromodeling and allocation algorithm are implemented in C/Matlab. We call our macromodeling method as TBS2, and our optimization as multi-ring based allocation (MRA). Experiments are run on a Linux workstation with 2G RAM. A typical FPGA package model is assumed with the specific application inputs. Four packages P/G planes are assumed with the size of lcm x lcm. The Vdd is assumed to be 2.5V, and the targeted noise is 0.25V. The worst-case I/O currents are modeled as triangle-waveforms with rising time 0.2ns, width 1ns and period 150ns, which are randomly distributed in a square of 0.2cm x 0.2cm located in the center of package planes. The 30% of remaining area are reserved for legal positions. The 4 decap-types in [2] are used. The total number of decaps and rings are 80 and 5. Each ring decomposed into four levels (0-3). We increase the circuit complexity by increasing the number of discretized tiles, and need more levels for legal positions when the tile number becomes larger. We allocate decaps by MRA and SA methods to satisfy the power integrity at I/Os under constraints of the noise amplitude (NA) or the noise integral (NI).

6.1 Comparison of Macromodels

We first compare our method with macromodeling in [1] as follows. The packages planes are discretized into 4720 tiles, described by a RLC-mesh with 12,810 resistors, 11,800 capacitors and 64,000 susceptors. There are 420 I/O current sources as inputs. As discussed in Section 4, the sequences of I/O currents are generated by simulating the specified application of input vectors for millions of cycles. One spatial correlation matrix C is extracted from the sequences. Then, the spectral clustering finds 8 principal ports by PCA and clusters the ports into 8 groups. Accordingly, the network is partitioned into 8 blocks by hmetis. Fig. 3 compares the frequency and time domain responses at 4th principal port. Due to the I/O port reduction and a localized reduction and analysis, our method is 21X faster (765s vs. 35.2s) to build and 25X faster (51mins vs. 2mins) to simulate compared to [1]. Moreover, because the TBS reduction can achieve a higher accuracy with use of triangularization, the waveform by TBS2 is visually identical to the original. But the reduced waveform by [1] has about 3.04X larger waveform error in the time-domain.

6.2 Comparison of Allocations

We also compare the runtime and the cost of allocated decaps between SA and MRA, where both methods use the noise amplitude as the constraint. As shown in Table 3, due to the systematical allocation with use of sensitivity, MRA reduces the allocation time by 97X on average compared to SA. In addition, SA can only handle circuits up to ~ 10,000 nodes. To obtain a result in a reasonable time, SA usually can not find the optimal solution. For a circuit with 10,680 nodes, MRA finds a solution with dollar cost about 215 in 13mins, but SA finds a solution with dollar cost about 233 (+9%) in 1day.

We further compare the runtime and the cost of allocated decaps by noise amplitude (NA) and noise integral (NI), both using MRA for allocation. As shown in Table 3, compared to the optimization with NA, the optimization with NI reduces the cost of allocated decaps by up to 7% within a similar allocation time. This is because the constraint by the noise amplitude ignores the accumulated effect of the transient noise waveform. In contrast, the constraint by NI can consider the noise pulse width, and accurately predict the decap allocation. As a result, MRI reduces the cost by up to 16% compared to the SA using NA [2].

7. CONCLUSIONS

To efficiently and accurately allocate the decap, this paper has presented a fast off-chip decoupling capacitor allocation considering I/O clustering. We have presented a spectral analysis to cluster larger numbers of I/Os. This clustering enables an I/O-based network partition with a localized integrity analysis. In addition, to systemically allocate decaps, we have also proposed a ring-based decap allocation based on the sensitivity, which is generated from a structured and parameterized macromodel. Experiments show that compared to the existing methods, our method is up to 97X faster, and also reduces decap cost by up to 16% to meet the same noise bound in time-domain.

8. REFERENCES