<table>
<thead>
<tr>
<th>Title</th>
<th>Robust on-chip signaling by staggered and twisted bundle</th>
</tr>
</thead>
<tbody>
<tr>
<td>Author(s)</td>
<td>Yu, Hao; He, Lei; Chang, Frank Mau Chung</td>
</tr>
<tr>
<td>Date</td>
<td>2009</td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://hdl.handle.net/10220/6237">http://hdl.handle.net/10220/6237</a></td>
</tr>
<tr>
<td>Rights</td>
<td>© 2009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. <a href="http://www.ieee.org/portal/site">http://www.ieee.org/portal/site</a> This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.</td>
</tr>
</tbody>
</table>
Robust On-Chip Signaling by Staggered and Twisted Bundle

Hao Yu
Nanyang Technological University, Singapore

Lei He and Mau-Chung Frank Chang
University of California, Los Angeles

CHIP multiprocessors (CMPS) are becoming the semiconductor industry’s dominant hardware paradigm because they do not lead to unmanageable power dissipation, which often results from the technology scaling of uniprocessor designs. On-chip components can be easily shared in CMPS, improving the throughput rate. However, because of the additive effect of improved component utilization, the increased integration density and data sharing in a CMP exacerbate soft errors by the global interconnect when delivering signal or clock data. It is a well-known challenge for the gigascale integration of uniprocessor designs to consider delay variation and distribute a low-skew or low-jitter global signal or clock data in the presence of crosstalk.1-3 Because of the highly compact integration of the CMP interconnect, the adjacent coupling capacitance of the global interconnect increases. In addition, inductance also becomes significant when the signal’s slew rate is sharp, to switch from one state to another.4,5

Crosstalk, resulting from capacitive and inductive coupling, could severely affect the timing and signal integrity of clock or signal nets. Because each interconnect experiences a different capacitive-coupling length or a different inductive-coupling return path, the interconnect exhibits varying signal or clock delay under different switching patterns. Consequently, high-speed digital circuits in CMPS that use dynamic logic are more susceptible to crosstalk compared to circuits that use static logic.

In this article, we introduce a twisted and staggered shielding (TSS) structure for robust on-chip signaling that minimizes both inductive and capacitive coupling, by means of uniformly distributing twisted shields (see the “Related Work” sidebar). We describe a transmission line model we developed to explain why TSS can reduce both capacitive and inductive couplings. Furthermore, we present an automatic layout synthesis for the design of a TSS bundle, which we validated with a Spice simulation of the worst-case delay and the worst-case noise. What’s more, we fabricated a test chip to verify the delay and its variations of all these structures under different switching patterns. As our results will show, compared to the coplanar shielding (COS) structure, the TSS structure reduces delay by 25% and reduces delay variation by 25×.

Twisted signal and shield pairs

In explaining the design of our TSS structure, we begin by discussing key characteristics of a pair of twisted and normal wires. The use of twisted wire is well-known in wired transmissions such as in telephone lines and cables. This approach interleaves the polarity of magnetic flux and so cancels inductive coupling and can
Related Work

Researchers have proposed various techniques to improve the robustness of on-chip signaling. Compared to buffer insertion, which is one way to improve robustness, shields are an effective approach but with a smaller implementation cost.\(^1\)\(^,\)\(^2\) Because coplanar shielding (COS) reduces the capacitive-coupling length and provides a local return path for inductive coupling, designers commonly use COS to reduce crosstalk in layouts. To minimize self-inductance, for example, Massoud et al. introduced an interdigitized coplanar shielding technique.\(^1\) In other work, Kaul et al. described an active shielding method that applies complementary signals on shields,\(^2\) and Qi et al. have experimentally demonstrated coplanar shielding’s effectiveness.\(^3\) However, because chip multiprocessors have many global signal nets, the use of COS in a CMP design would excessively consume the signal net’s routing resource. As a result, many signal nets typically share only one shield, which can lead to delay variation.

In another approach to shield distribution, interconnects are twisted together with shields, using via arrays in copper interconnects with via small via resistive loss and improved reliability. Hidaka et al. were the first to introduce the twisted pair into differential signaling to reduce crosstalk in DRAM design.\(^4\) Recently, Mensink et al. conducted further studies of the optimal twist position for RC-dominant interconnects.\(^5\) Like COS shielding, in which each signal net is designed with its differential net, the use of differential pairs increases the design cost for multiple signal nets—or what in this article we call a bundle. To minimize that cost, Zhong et al. introduced a method to distribute shielding for multiple signal nets, that is, bundles.\(^6\) Using both twisted and normal interconnects with shields (TNS), they minimize inductive coupling by compensating for the polarity-interleaved magnetic fluxes between the twisted group and the normal group. Sato and Masuda further measured the delay variation of TNS by twisting only shields,\(^7\) a simplified design of work done by Zhong et al.\(^6\) However, with normal interconnects, shields are not uniformly distributed. Capacitive coupling is therefore significant not only between the twisted and normal interconnects but also among those interconnects inside the normal group. As a result, each bit in TNS could have a different capacitive-coupling length, which means delays can still be largely varying for signal nets.

References

noise $V_{\text{cap}}$. Next, we derive the induced crosstalk voltage in frequency domains.

We first determine the inductive-coupling introduced noise, $V_{\text{ind}}$. As Figures 2a and 2b show, we assume that the current at the $i$th stage of the aggressor is $I_{A_i}$, and the mutual inductance (unit length) is $M_0$ between two loops: one loop composed by the aggressor with its local ground, and the other composed by the victim with its local ground.

Then the superposed total $V_{\text{induced}}$ is

\[
V_{\text{induced}} = (sM_0l)(I_{A1} - I_{A2} + I_{A3} - \ldots) \tag{1}
\]

where the aggressor current $I_{A_i}$ is approximated by

\[
I_{A1} \approx I_{A2} \approx I_A = \frac{V_{\text{src}}}{(Z_{\text{src}} + Z_{\text{ld}})} \tag{2}
\]

Because the equivalent model works as a voltage divider, the inductive noise $V_{\text{ind}}$ observed at the receiver becomes

\[
V_{\text{ind}} = \frac{Z_{\text{ld}}}{(Z_{\text{ld}} + Z_{\text{src}})} \times V_{\text{induced}}
\]

Based on Equations 1 and 2, $V_{\text{ind}}$ becomes

\[
V_{\text{ind}} = \begin{cases} 
0 & \text{if } N \text{ is even} \\
\frac{Z_{\text{ld}}}{(Z_{\text{ld}} + Z_{\text{src}})} \times (sM_0l)I_A & \text{else}
\end{cases}
\tag{3}
\]

Therefore, when the pair is twisted into an even number of segments, the inductive crosstalk is 0. When the pair is twisted into an odd number of segments, the inductive crosstalk is as small as the coupling of one divided segment with length $l$.

We reached this conclusion on the basis of a low-frequency analysis in which the current at each twisted stage was approximately the same according to Equation 2. As shown by the experiment later in this section, this assumption still approximately holds in the high-frequency range when compared to the calculation by the exact 3D field solver.

We can further determine the capacitive-coupling-introduced noise, $V_{\text{cap}}$. As Figures 2c and 2d show, we assume that the coupling capacitance (unit length) is $C_0$ between the aggressor and victim when there is no shielding. The coupling is $\alpha C_0$ when shielding exists. The $\alpha$ factor reflects the effect of shielding between the aggressor and victim. Its value can be larger than 1.0.
and depends on both aggressor and victim switching patterns.

We can determine a superposed total, $I_{\text{induced}}$:

$$I_{\text{induced}} = (\alpha C_0 l)(V_{a1} + \alpha V_{a2} + V_{a3} + \cdots)$$

(4)

with the aggressor voltage $V_a$ calculated at each stage as

$$V_{a1} \approx V_{a2} \approx V_a = \frac{Z_{l}^{\text{desc}}}{Z_{l}^{\text{desc}} + Z_{l}^{\text{dil}}} V_{\text{src}}$$

(5)

The capacitive crosstalk $V_{\text{cap}}$ observed at the receiver is

$$V_{\text{cap}} = Z_{V_{\text{src}}} \times (0.5\alpha C_0 Nl)(1 + \alpha) V_a$$

(6)

where the interconnect length, $Nl$, is a constant. Clearly, for the twisted pair, the capacitive coupling contributes to the dominant crosstalk, and it differs from the inductive crosstalk by a factor $\alpha$

In the TNS design by Zhong et al., the two signals experienced capacitive coupling (with no shield inside) in a range only half that of the interconnect length. This situation worsens with more signal nets—that is, in a twisted bundle structure—that share one shield. Therefore, the TNS structure has limited usefulness if we don’t reduce capacitive coupling.

Pair of twisted and staggered wires

We have found that we can alleviate the capacitive-coupling situation by staggering twists as Figure 3b shows, in which shields are alternatively routed between signals.

To see how to reduce the crosstalk by the capacitive coupling, let the number of staggered twists be $N_{\text{stag}}$. We can easily verify that the capacitive-coupling-introduced noise is effectively reduced by a factor of $2N_{\text{stag}}$

$$V_{\text{cap}} = Z_{V_{\text{src}}} \times (0.5\alpha C_0 Nl)\left(1 + \frac{\alpha}{2N_{\text{stag}}}\right) V_a$$

(7)

Thus, by staggering the twisted interconnect in a chip design such as a CMP, we can reduce the capacitive-coupling length compared to the TNS structure.

Furthermore, by uniformly distributing staggered and twisted shields, we can compensate for flux. As a result, the signal-net flux also approaches 0. In the example shown in Figure 4b, with a 4,000-micron wire length, 1-micron width, and 2-micron spacing, we extracted the loop inductance matrix at 1 GHz using the FastHenry tool (http://www.rle.mit.edu/cpg/research_codes.htm). For a twisted pair with normal wires, the matrix is

$$L_{\text{TNS}} =$$

$$\begin{bmatrix}
3. 501 \times 10^{-9} & 4. 069 \times 10^{-14} & 5. 159 \times 10^{-11} & 3. 147 \times 10^{-15} \\
4. 069 \times 10^{-14} & 3. 504 \times 10^{-9} & 4. 069 \times 10^{-14} & 5. 160 \times 10^{-11} \\
5. 159 \times 10^{-11} & 4. 069 \times 10^{-14} & 3. 501 \times 10^{-9} & 4. 069 \times 10^{-14} \\
3. 147 \times 10^{-15} & 5. 160 \times 10^{-11} & 4. 069 \times 10^{-14} & 3. 504 \times 10^{-9}
\end{bmatrix}$$

For a staggered and twisted pair, the matrix is

$$L_{\text{TNS}} =$$

$$\begin{bmatrix}
3. 491 \times 10^{-9} & 5. 796 \times 10^{-14} & 5. 151 \times 10^{-11} & 4. 640 \times 10^{-15} \\
5. 796 \times 10^{-14} & 3. 491 \times 10^{-9} & 5. 796 \times 10^{-14} & 5. 151 \times 10^{-11} \\
5. 151 \times 10^{-11} & 5. 796 \times 10^{-14} & 3. 491 \times 10^{-9} & 5. 796 \times 10^{-14} \\
4. 640 \times 10^{-15} & 5. 151 \times 10^{-11} & 5. 796 \times 10^{-14} & 3. 491 \times 10^{-9}
\end{bmatrix}$$

We applied finer discretization (4 x 4) at each wire cross section to consider skin and proximity effects.

![Figure 3. Twisted and normal interconnect (a) and twisted and staggered interconnect (b).](image-url)
On-Chip Signaling in Chip Multiprocessor Interconnects

Synthesis for staggered and twisted bundle

Once we have our TSS structure, we need a systematic synthesis methodology to design a layout for a bundle of signal-nets using twisted and staggered shields. We can summarize the problem formulation as follows:

**TSS synthesis problem:** Given the number of signal nets \( N_{\text{sig}} \), the number of signal/shield ratios \( N_{\text{cell}} \), and the number of staggering stages \( N_{\text{stag}} \), the objective of synthesizing a staggered and twisted bundle is to find a routing topology with \( N_{\text{gp}} \) \( (N_{\text{gp}} = \frac{N_{\text{stag}}}{N_{\text{cell}}}) \) groups of wires \( (N_{\text{gp}} \) stands for ratio). Each group has \( N_{\text{stag}} \) stages formulated by connecting a unit's twisting cell \( (T) \), a unit's normal cell \( (N) \), and their complements \( (T_b, N_b) \) alternatively. Then, we generate adjacent groups of wires by cyclically shifting unit cells.

Figure 3b shows the wire diagram with unit cells for an example in which \( N_{\text{stag}} = 6, N_{\text{cell}} = 3 \), and \( N_{\text{stag}} = 1 \). We can define four kinds of unit cells \( (T, N, T_b, N_b) \) as we construct them.

First, we discuss how to synthesize a twisting cell. A twisting cell consists of \( N_{\text{cell}} \) signal nets with \( N_{\text{cell}} \) segments per net. If we assume that each bit of interconnect is equally divided into \( n(n = N_{\text{cell}} + 1) \) segments, we can describe the twisted pattern through a routing matrix \( T \):

\[
T = \begin{bmatrix}
t_{1,1} & t_{1,2} & \cdots & t_{1,n} \\
t_{2,1} & t_{2,2} & \cdots & t_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
t_{i,1} & t_{i,2} & \cdots & t_{i,n}
\end{bmatrix}
\]

where its \( i \)th row is

\[
T = \begin{bmatrix}
t_1 & t_2 & \cdots & t_n
\end{bmatrix}
\]

to represent wire segments in the \( i \)th bit. The changes between each neighboring column represent changes of routing connections. For example, a pair \( (t_{i,j}, t_{i,j}) \) means that the \( k \)th bit will change from the \( i \)th-bit track to the \( j \)th-bit track.

As Figure 3a shows, to minimize the inductive crosstalk, we must twist both the signal and shield segments such that the polarity of the current loop for each cell can change symmetrically. In the following, we discuss how to generate the staggered and twisted pattern for multiple signal nets with one shield.
When \( n \) is even, we can synthesize the routing matrix for one unit twisting cell as follows, similar to the approach of Zhong et al.\(^6\):

1. Begin with an initial row, \( T_0 = (n - 1 \ 1 \ 2 \ldots) \).
2. Cyclically shift \( T_0 \) up by one segment \((n - 1)\) times, obtain \((n - 1)\) number of permuted rows, and construct a cyclic permutation matrix \( C \).
3. Replace the diagonal element in the cyclic permutation matrix by 0 (representing a shield), attach the diagonal element to an additional column (row), and form an \( n \times n \) routing matrix \( T \).

On the other hand, when \( n \) is odd, we can apply the same procedure by adding one more dummy wire such that the total wires \((n + 1)\) in one unit’s twisting cell is still even. This avoids the additional design cost that Zhong et al. incurred to enforce permeability for an odd number of wires.\(^6\)

Let’s return for a moment to the example in Figure 3b and consider the leftmost cell in the top row. We take the following steps to construct one twisting cell. First, the initial row is

\[
T_0 = \begin{bmatrix} \ 3 & 2 & 1 \end{bmatrix}
\]

Second, the cyclic permutation matrix becomes

\[
C = \begin{bmatrix} \ 3 & 2 & 1 \\ 1 & 2 & 3 \\ 2 & 3 & 1 \end{bmatrix}
\]

Third, the complete routing matrix is obtained by

\[
T = \begin{bmatrix} \ 0 & 1 & 2 & 3 \\ 1 & 0 & 3 & 2 \\ 2 & 3 & 0 & 1 \\ 3 & 2 & 1 & 0 \end{bmatrix}
\]

We can’t synthesize the staggered and twisted bundle solely by using the above-constructed twisting cell; we also need to synthesize the other three unit cells. We obtain the complementary matrix \( T_b \) for the twisting cell \( T \) in Equation 7 by reversing the order of each row in Equation 8:

\[
T_b = \begin{bmatrix} \ t_{1,0} & t_{1,n-1} & \cdots & t_{1,1} \\ t_{2,n} & t_{2,1} & \cdots & t_{2,1} \\ \vdots & \vdots & \ddots & \vdots \\ t_{n,0} & t_{n,n-1} & \cdots & t_{n,1} \end{bmatrix}
\]

Furthermore, we can define a normal cell \((N)\) and its complementary \((N_b)\) by the following routing matrices:

\[
N = \begin{bmatrix} \ t_{1,1} & t_{1,1} & \cdots & t_{1,1} \\ t_{2,2} & t_{2,2} & \cdots & t_{2,2} \\ \vdots & \vdots & \ddots & \vdots \\ t_{n,n} & t_{n,n} & \cdots & t_{n,n} \end{bmatrix}
\]

\[
N_b = \begin{bmatrix} \ t_{n,n} & t_{n,n} & \cdots & t_{n,n} \\ t_{n-1,n-1} & t_{n-1,n-1} & \cdots & t_{n-1,n-1} \\ \vdots & \vdots & \ddots & \vdots \\ t_{1,1} & t_{1,1} & \cdots & t_{1,1} \end{bmatrix}
\]

With these four unit cells defined by the routing matrices in Equations 8, 9, and 10, we can construct the staggered and twisted pattern by interleaving the twisted cell \((T, T_b)\) and the normal cell \((N, N_b)\). The synthesis procedure is the same as the one we used to form one twisted cell.

First, we construct an initial staggered row:

\[
R^0 = [T//N_b//T_b//N_b,\ldots,T//N_b//T_b//N]
\]

where we repeat the pattern \(N_b\) times. We then cyclically permute \(R^0\) one unit cell at a time to obtain the routing matrix for each group as follows:

\[
R^1 = P^1 R^0 = [N//T//N_b//T_b,\ldots,N//T//N_b//T_b]
\]

\[
R^2 = P^2 R^0 = [T_b//N//T//N_b,\ldots,T_b//N//T//N_b]
\]

\[
R^3 = P^3 R^0 = [N_b//T_b//N//T,\ldots,N_b//T_b//N//T]
\]

Figure 4a shows the resulting general structure of the staggered and twisted pattern composed of those unit cells \((T, T_b, N, N_b)\). We further illustrate this procedure in Figure 4b, which is an example of 18 signal nets with 1-stage staggering. This means we have six groups to synthesize when the signal/shield ratio is 3:1. The routing matrix in Figure 4b is shown with dashed lines in different styles to indicate different unit cells. The initial staggering row is cyclically permuted six times, one cell at a time, and the resulting patterns form the overall routing matrix.

**Validation by circuit simulation**

With this synthesis, we next devised a circuit simulation in which we designed a more complicated layout for the staggered and twisted bundle, which enabled us to further study the TSS (interconnect) structure’s impact on worst-case delay (WCD) and worst-case noise (WCN).
We used 180-nm and 70-nm technologies (with a copper interconnect) in the Berkeley predictive model. We considered three interconnect structures for the simulation: the coplanar wire with shielding (COS), the twisted and normal wire with shielding (TNS), and the twisted and staggered wire with shielding (TSS). We assumed the use of M6 to lay out the signals and shields; a minimum wire width of 0.45 micron for 180 nm technology, and 0.2 micron for 70-nm technology; the spacing is 0.5 micron for 180 nm, and 0.2 micron for 70 nm. The via we chose was a 2 × 2 array of the minimum size (0.2 micron × 0.2 micron). The wire length was 4,000 microns, and the driver size was about 100 × the minimum inverter size. In the test structures we designed, the inductive effect is significant. Their driver strength and resistive loss are less than their characteristic impedance. Furthermore, we used an exponential voltage source for input signals, with a 50-ps rising time. We modeled the nonlinear driver with the Berkeley BSIM3 model within a modified Spice3 simulator (http://ngspice.sourceforge.net). We extracted the capacitance using the FastCap tool with wire discretized into 100 boxes along the length. We used the FastHenry tool to extract the resistance and inductance, with an additional 4 × 4 discretization at the cross section. We then used a distributed RLC circuit to model the signal wire.

Tu et al. showed that the WCN occurs when the victim line switches in the same direction as neighboring aggressor lines. Because it would be too expensive to explore every switching pattern during design, for our simulation we measured the WCN and WCD according to Chen and He's work, which involved keeping the victim line quiet. Moreover, we assumed the aggressor and victim had a switching window of 200 ps—that is, the earliest and latest arrival times differed by 200 ps. The computational time depended linearly on the number of signal nets and the Spice simulation time for each alignment. To reduce the simulation time for large circuits, we applied model order reduction to generate macro-models for interconnects.

In our validation efforts, we compared the WCD and WCN of COS, TNS, and TSS shielding using six signal wires with a 3:1 signal-to-shield ratio in 180-nm technology. We used three shields for COS and two shields for both TNS and TSS. Figure 5 compares the WCD and WCN when each wire acted as the victim for all three structures. As Figure 5a shows, the variation of WCD between signal nets was smaller in TSS than in COS and TNS. In terms of the average WCD among all six bits, the delay in TSS was 11 ps shorter than in COS (51 ps vs. 62 ps).

Figure 5b shows the WCN. The WCNs of TSS were uniform among six signal nets. Because of the large capacitive coupling between normal wires, the WCNs of normal nets—especially for the WCNs of TNS shielding—in the normal group (net 4, net 5, net 6) were much larger (they averaged a 15% delay variation) than those in the twisted group (net 1, net 2, net 3). As a result, the TSS structure in the simulation was optimal in terms of both delay and noise.

**Figure 5.** Worst-case delay (WCD) and worst-case noise (WCN) comparison for each signal net of COS, TNS, and TSS structures with a 3:1 signal-to-shield ratio: worst-case delay for each signal (a) and worst-case noise for each signal (b).

**Layout and delay measurement circuit**

Finally, using the simulation we just described, we designed and fabricated a test chip: a layout and measurement circuit with IBM 130-nm technology with a copper interconnect. Our objective was to learn whether and when inductance becomes
important to delay variation, and how the interconnect structure affects delay and delay variation.

As the die diagram in Figure 6 shows, our measurement circuit has four components. The first component is the device under test (DUT)—that is, four groups of interconnect structures. The second component is the testing generation module, including the programmable driver and delay element, and control logic to generate different logic patterns. The third component is a ring receiver. It forms an oscillator with the programmable driver by including DUTs in its signal path. The last component is the read-out circuitry.

We used a 16-bit synchronized counter to directly record the ring oscillator’s oscillation frequency. Finally, we applied a slow sampling clock to shift out the counted result. We could then infer the DUT’s delay information. Because the DUT is located in the signal paths, we achieved an overall operating frequency of approximately 1 GHz.

Layout of interconnect structure under test

Typically, in the IBM 130-nm technology, the top-layer metal, called MQ, is used for the normal signal net with \( L = 800 \) micron, \( W = 4 \) micron, and \( S = 6 \) micron. The third-bottom-layer metal, called M3, is used for the grounded interconnect (shield), and the second-top-layer, called MG, is used for twisting. To improve the reliability and reduce the resistance of vias, we used four 2 x 2 via arrays.

Figure 7 shows the four groups of interconnect structures we tested. Figure 7a shows 6-bit normal (NO) interconnects. In this case, the delay variation at each bit is determined by both the inductive-coupling return path and the capacitive-coupling length. Figure 7b shows 6 bits of coplanar (CO) interconnects with 3 bits of shielding. There are 6 bits of straight signal wires and 3 bits of shielding with the same wire length, width, and spacing. In this design, there are two groups each with three straight signal wires sharing two shields, and M3 for logic ground. Because multiple signals share only one shield, the results are a biased inductive return path, unequal capacitive coupling length, and, consequently, a non-uniform delay variation for the bit in the middle and at the boundary.

In addition, Figure 7c shows 6-bit twisted and normal (TN) interconnects with 2-bit shielding. There are 3-bit twisted signal wires, 3-bit straight signal wires, and 2-bit shielding. In this case, one normal group

![Figure 6. Die diagram of the layout and delay measurement circuit.](image)

![Figure 7. Four groups of interconnect structures under test: 6-bit normal interconnects (a), 6-bit coplanar interconnects with 2-bit shielding (b), 6-bit twisted and normal interconnects with 2-bit shielding (c), and 6-bit twisted and staggered interconnects with 2-bit shielding (d).](image)
shield distribution is still nonuniform in these two groups, so the delay variations are still significant.

Finally, Figure 7d shows 6-bit twisted and staggered (TS) interconnects with 2-bit shielding. There are two twisted groups, with each group having 3-bit twisted signal wires and 1-bit shield. In this design, the shielding is uniformly distributed among signal nets to minimize both capacitive and inductive coupling. In addition, Figure 8 further shows how to design a typical twist layout with use of two layers of metals and one via array.

Test-generating circuits

In the actual test-chip design, we used three components to generate test signals: the decoder to select the aggressor and the victim, the control circuit for the switching pattern, and the driver with programmable driving strength. As Figure 9 shows, we first used the 3:8 decoder to select one from five aggressors. Then, each aggressor was enabled to switch by a pattern of 0 to 1 (rising) or 1 to 0 (falling), controlled by a preset 1:2 multiplexer, and the victim was disabled to stay at 0 (quiet). In addition, we could change the driver strength from $\frac{1}{C_2}$ to about $\frac{8}{C_2}$ by selecting three drivers in parallel with exponentially increasing widths (NMOS $W = 2.56$ microns, $W = 5.12$ microns, and $W = 10.24$ microns), which drove the interconnect structure under test.

Sampling circuits

When measuring a device's inductance in the frequency domain, both the impedance mismatch and the parasitic from the probe can impair the measurement's accuracy. Consequently, in our work, we measured delay through an on-chip sampling system by including test structures in the signal path of a ring oscillator and using an on-chip counter to record the delays.

We also used three components to sample the delay, as Figure 10 shows. The first component was the ring-oscillator consisting of the programmable driver and the receiver, connected into an odd stage of inverter chain with the selected victim in the signal path to be measured. The second component was the 16-bit synchronous counter, which directly counted the output of the ring-oscillator when enabled by an external fast-clock signal. The last component was the shift register, for serially shifting out the counter outputs when enabled by another external slow-clock signal.

By setting a different input switching pattern for aggressors, we measured the delay at one selected victim bit in the signal path. The delay, reflecting the impact from crosstalk, can be calculated as follows:

$$T_{\text{delay}} = \frac{t_{\text{Enable}}}{N_{\text{counter}}}$$  \hspace{1cm} (11)
where \( t_{\text{enable}} \) is the enable time of the fast clock, and \( N_{\text{counter}} \) is the counter output shifted by the slow clock.

Our work focused chiefly on the design of a signal-net bundle, in which delay variation naturally differs from that found in a design for a pair of signals. The delay uncertainties we discuss, therefore, are defined as the mean and standard deviation of measured delays by iterating each bit as the victim from a group of 6-bit lines.

Measurement results

In our measurements, we used a Tektronix CSA 907A pattern generator to generate a 20-MHz sampling clock, an HP 8130A pulse generator to generate control for counter enabling and shifting, and an Agilent 8867 logic analyzer to measure counter output. The enabling time for the ring oscillator was 50.5 ms (with a 90-ms period), the reset time was 100 ns (with a 90-ms period), and the sampling clock was 8.15 MHz.

We investigated three different switching patterns, and selected the driver strength to be \( \frac{1}{C^2} \). For the first switching pattern, all 6 bits switched from 0 to 1. As a result, there was no dynamic capacitive coupling, and the impact of inductive coupling was amplified. For the second switching pattern, two adjacent bits switched in the opposite direction. In this case, the inductive current could be returned locally by its neighbor, which reduced inductive coupling.

Dynamic capacitive coupling, however, was magnified because of the Miller effect. Therefore, we used the second switching pattern to study the impact of capacitive coupling. As for the third switching pattern, only one bit in each group switched; all other bits were quiet. Accordingly, we were able to study a combined inductive and capacitive coupling.

Tables 1, 2, and 3 summarize the counter outputs and extracted delays for the three switching patterns. Table 4 lists calculations for the mean delay and its standard deviation of 6 bits for the three switching patterns.

The results we achieved with switching-pattern 1, shown in Tables 1 and 4, show that the NO interconnect has the largest delay variation (0.07 ns) among the four groups. Capacitive coupling was minimized because all bits switched in the same direction. Because there were no local shields serving as the return paths, each bit had a different return path and hence a different loop inductance. As a result, the delay of each bit in the NO interconnect had the largest deviation. In contrast, the CO, TN, and TS interconnects provided similar (although among different bits) return paths for the inductive coupling. Their delay variations, therefore, were all similar and smaller than in the NO interconnect. The TS interconnects had the smallest mean delay (0.9 ns) and delay variation (1 ps).
Tables 2 and 4 show the results for switching-pattern 2. The CO interconnect had the largest delay variation (0.06 ns) among the four groups. The capacitive coupling was maximized and each bit experienced a constant inductive coupling because two adjacent bits switched in the opposite direction. In addition, because the capacitive-coupling length differed significantly for the bit at the boundary and the bit in the middle, each bit’s delay in the CO interconnect had the largest deviation, larger than in the NO interconnect. In contrast, because the TS interconnect uniformly distributed the shields among 6 bits to minimize both the capacitive and inductive coupling, it had the smallest delay (0.9 ns) and delay variation (3 ps).

As for switching-pattern 3, as shown in Tables 3 and 4, the TN interconnect had a 0.06-ns delay variation.
and the NO interconnect had a 0.08-ns delay variation. Unlike the previous two switching patterns, in this case each bit’s delay variation was determined by both the inductive-coupling return path and the coupling length. Although the TN interconnect provided return paths to minimize the inductive coupling, the capacitive coupling between the twisted and normal groups and inside the normal group is still significant. Consequently, the delay of each bit in the TN interconnect had a large deviation (0.06 ns), similar to the NO interconnect (0.08 ns). In contrast, both the CO interconnect (0.02 ns) and the TS interconnect (3 ps) exhibited uniform delay variations. In addition, because the TS interconnect minimized the inductive coupling and also had a smaller capacitive-coupling length than the CO interconnect, it showed the smallest delay (0.9 ns). Table 5 also shows the delays and delay variations under switching-pattern 3 but with 8× driver strength. Because of the increased strength, 8× drivers have a higher tolerance to crosstalk than the 1× driver, so the delay and delay variations were of course smaller than those for 1× drivers.

By comparing the results of all three switching patterns, we see that the TS interconnect structure reduced delay by 25% and reduced delay variation by 25× compared to the CO interconnect structure. In addition, the TS interconnect structure reduced delay by 7.5% and reduced delay variation by 33× compared to the TN interconnect structure.

**Table 5. Measured counter outputs and delays of 6 bits by switching-pattern 3 with 8× driver strength.**

<table>
<thead>
<tr>
<th>Type of interconnect</th>
<th>Bit 0 (s)</th>
<th>Bit 1 (s)</th>
<th>Bit 2 (s)</th>
<th>Bit 3 (s)</th>
<th>Bit 4 (s)</th>
<th>Bit 5 (s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>NO</td>
<td>3.2600 × 10⁻⁹</td>
<td>3.1993 × 10⁻⁹</td>
<td>3.1471 × 10⁻⁹</td>
<td>3.1718 × 10⁻⁹</td>
<td>3.2916 × 10⁻⁹</td>
<td>3.2456 × 10⁻⁹</td>
</tr>
<tr>
<td>CO</td>
<td>3.3002 × 10⁻⁹</td>
<td>3.3081 × 10⁻⁹</td>
<td>3.2907 × 10⁻⁹</td>
<td>3.3081 × 10⁻⁹</td>
<td>3.3107 × 10⁻⁹</td>
<td>3.3230 × 10⁻⁹</td>
</tr>
<tr>
<td>TS</td>
<td>3.1566 × 10⁻⁹</td>
<td>3.1487 × 10⁻⁹</td>
<td>3.1519 × 10⁻⁹</td>
<td>3.1519 × 10⁻⁹</td>
<td>3.1535 × 10⁻⁹</td>
<td>3.1511 × 10⁻⁹</td>
</tr>
</tbody>
</table>

The staggered and twisted signaling presents a fabrication challenge for multilevel nonglobal interconnects because it requires an extra metal level to twist the interconnects together. However, because most global interconnects such as data buses and clocks are usually designed with top-level metals, our proposed interconnect structure still seems promising for intra- and interchip communication with many signal or clock nets. Such a staggered and twisted interconnect (TSS) structure is particularly suitable for data communication in the design of CMPs. To fully utilize the proposed TSS interconnect structure in high-performance designs, more detailed studies are needed. For example, how can we optimally insert buffers for this kind of interconnect? Research is also needed to design a test chip with different interconnect structures and to test the delay variation impacted by the fabrication process.

**Acknowledgments**

This work is sponsored by Semiconductor Research Corp. grant SRC-1100 and National Science Foundation grant NSF-0093273. We thank Yiyu Shi and Xinyi Zhang for the printed circuit board (PCB) design work, and Wei Yao for taking the die photo.

**References**

**Hao Yu** is an assistant professor in the School of Electrical and Electronic Engineering at Nanyang Technological University, Singapore. His research interests include the design and test of high-performance on-chip interconnects, temperature-aware 3D IC designs, and analog and RF circuit design verification platforms. He has a PhD in electrical engineering from the University of California, Los Angeles. He is a member of the IEEE.

**Lei He** is an associate professor in the Department of Electrical Engineering at the University of California, Los Angeles. His research interests include VLSI circuits and systems, and electronic design automation. He has a PhD in computer science from the University of California, Los Angeles. He is a senior member of the IEEE.

**Mau-Chung Frank Chang** is a professor in the Department of Electrical Engineering at the University of California, Los Angeles. His research interests focus mainly on the development of high-speed semiconductor devices and ICs for RF and mixed-signal communication and sensing systems. He is a Fellow of the IEEE and a member of the National Academy of Engineering.

Direct questions and comments about this article to Hao Yu, School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798; haoyu@ieee.org.

For further information on this or any other computing topic, please visit our Digital Library at http://www.computer.org/csdl.