<table>
<thead>
<tr>
<th>Title</th>
<th>Efficient VLSI implementation of $2^n$ scaling of signed integer in RNS {2^{n-1}, 2^n, 2^{n+1}}</th>
</tr>
</thead>
<tbody>
<tr>
<td>Author(s)</td>
<td>Tay, Thian Fatt; Chang, Chip-Hong; Low, Jeremy Yung Shern</td>
</tr>
<tr>
<td>Citation</td>
<td>Tay, T. F., Chang, C.-H., &amp; Low, J. Y. S. (2013). Efficient VLSI Implementation of $2^n$ scaling of signed integer in RNS {2^{n-1}, 2^n, 2^{n+1}}. IEEE transactions on very large scale integration (VLSI) systems, 21(10), 1936-1940.</td>
</tr>
<tr>
<td>Date</td>
<td>2013</td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://hdl.handle.net/10220/25187">http://hdl.handle.net/10220/25187</a></td>
</tr>
<tr>
<td>Rights</td>
<td>© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [<a href="http://dx.doi.org/10.1109/TVLSI.2012.2221752">http://dx.doi.org/10.1109/TVLSI.2012.2221752</a>].</td>
</tr>
</tbody>
</table>
Efficient VLSI Implementation of $2^n$ Scaling of Signed Integer in RNS \{2^n-1, 2^n, 2^n+1\}

Thian Fatt Tay, Chip-Hong Chang, and Jeremy Yung Shern Low

Abstract—Scaling is a problematic operation in Residue Number System (RNS) but a necessary evil in implementing many digital signal processing (DSP) algorithms for which RNS is particularly good. Existing signed integer RNS scalers entail a dedicated sign detection circuit, which is as complex as the magnitude scaling operation preceding it. In order to correct the incorrectly scaled negative integer in residue form, substantial hardware overheads have been incurred to detect the range of the residues upon magnitude scaling. In this paper, a fast and area efficient $2^n$ signed integer RNS scaler for the moduli set \{2^n-1, 2^n, 2^n+1\} is proposed. Complex sign detection circuit has been obviated and replaced by simple logic manipulation of some bit level information of intermediate magnitude scaling results. Comparing with the latest signed integer RNS scalers of comparable dynamic ranges, the proposed architecture achieves at least 21.6% of area saving, 28.5% of speedup and 32.5% of total power reduction for $n$ ranges from 5 to 8.

Index Terms—Chinese remainder theorem, residue number system, scaling, signed integer.

I. INTRODUCTION

Residue Number System (RNS), with its inherited modularity, parallelism and localized carry propagation arithmetic operations, has emerged as a promising substitute of the conventional two’s complement system for the data representation and computation of specific applications [1]-[4]. The moduli set selected for a RNS has an influence on its implementation efficiency. Of the known moduli sets, \{2^n-1, 2^n, 2^n+1\} has been most extensively studied. Due to its abundant number of well-developed residue arithmetic operations and reverse converter architectures, many applications have been built around this moduli set [1], [4], [5].

One problem inherited from the non-positional representation of RNS is the truncation of a number in residue domain, which makes overflow avoidance unwieldy. To overcome the magnitude scaling problem, an adaptive channel equalization filter with a large number of multiply-accumulations were recently implemented with an allied RNS-binary system to achieve significant power reduction over the two’s complement system without compromising throughput [2]. Scaling was performed in binary via residue-to-binary conversion to avoid the overflow of dynamic range. In the latest RNS implementation of a 32-bit low-pass finite impulse response (FIR) filter [3], signed number was represented in sign-magnitude form with its magnitude converted to RNS. This atypical RNS representation of signed integer suffers from the dual zero representations and the need to remap the magnitude for regular addition and multiplication, which undermines the advantages of RNS. RNS scaler is usually designed based on either the Chinese Remainder Theorem (CRT) or the Mixed Radix Conversion (MRC) [5]-[10]. Almost all of the existing RNS scalers focus only on magnitude scaling and have either downplayed or glossed over the problem of sign scaling.

Because sign detection itself is a difficult operation in RNS, the challenge of implementing a signed over an unsigned integer RNS scaler is the overheads required for sign detection in order to correctly map the scaled signed residues to the legitimate range. The state-of-the-art signed integer RNS scaler [9] comprises an unsigned integer RNS scaler and a correction circuit. The correction circuit involves a dedicated RNS sign detection circuit which is slow and consumes large logic area. In this paper, a unified architecture is proposed for the signed integer RNS scaling. The targeted moduli set and scaling factor are the popular \{2^n-1, 2^n, 2^n+1\} and \{2^n\}, respectively. Instead of using a dedicated RNS sign detection circuit, the residue representation of the signed integer scaling result is obtained by manipulating the intermediate computation results. Owing to its simplicity, the proposed architecture is faster, smaller and consumes lower power than the latest architectures for the same scaling factor and the same moduli set [9], and a different moduli set of comparable dynamic range [10].

II. UNSIGNED AND SIGNED INTEGERS IN RNS

A RNS is defined by a set of $N$ pairwise relative prime integers \{m_1, m_2, ..., m_N\}, where $m_i$ is called a modulus. An unsigned integer $X$ within the range of \{0, $M$-1\} can be uniquely represented by an $N$-tuple $(x_1, x_2, ..., x_N)$, where the dynamic range $M = \prod_{i=0}^{N-1} m_i$. The residue $x_i$ is the least positive remainder of the division of $X$ by $m_i$, and is usually represented as $X \mod m_i$ or $[X]_{m_i}$.

Let $\tilde{X}$ be a signed integer in the range \{[-M/2, M/2-1]\} if $M$ is even or \{[-(M-1)/2, (M-1)/2]\} if $M$ is odd [6]. $\tilde{X}$ can also be uniquely represented by an $N$-tuple $(\tilde{x}_1, \tilde{x}_2, ..., \tilde{x}_N)$ in signed RNS representation. The relationship between the residue representations of $X$ in unsigned RNS and $\tilde{X}$ in signed RNS under the same moduli set is

$$\tilde{X} = X \text{ if } \tilde{X} \geq 0 \text{ and } X - M \text{ if } \tilde{X} < 0$$

When $\tilde{X} \geq 0$, the residue representation of $X$ can be mapped to that of $\tilde{X}$ in the range of \{0, $M/2-1$\} if $M$ is even and \{0, $(M-1)/2$\} if $M$ is odd; When $\tilde{X} < 0$, the residue representation of $X$ can be mapped to that of $\tilde{X}$ in the range of \{M/2, $M-1$\} if $M$ is even and \{(M+1)/2, M-1\} if $M$ is odd.

III. PROPOSED SIGNED INTEGER RNS SCALER CIRCUIT

A. Magnitude Scaling Error in Signed RNS

Let $Y = (y_1, y_2, y_3)$ be the residues obtained by scaling $X = (x_1, x_2, x_3)$ by $k$ in the residue domain of RNS \{2^n-1, 2^n, 2^n+1\}. For $m_1 = 2^n-1$, $m_2 = 2^n$ and $m_3 = 2^n+1$, $[M_1^{-1}]_{m_1} = 2^{-n-1}$. $[M_2^{-1}]_{m_2} = 2^{n-1}$ and $[M_3^{-1}]_{m_3} = 2^{n+1}+1$ [11], where $M_i = M/m_i$.
and $|M^{-1}|_m$ is the multiplicative inverse of $|M|_m$.

According to Chinese Remainder Theorem (CRT) [1], [4], the integer $Y$ can be recovered by:

$$Y = \sum_{i=1}^{k} M_i \left[ |M^{-1}|_m y_i \right]_m$$

$$= 2^n \left( 2^n - 1 \right) y_1 + \left( 2^n - 1 \right) \left( 2^n - 1 \right) y_2 + \left( 2^n - 1 \right) \left( 2^n - 1 \right) y_3$$

(2)

However, if $X$ is treated as a signed integer $\tilde{X}$ and represented in signed RNS, (2) may not yield the correct result of scaling $X$ by $k$. This is because the scaling of magnitude in unsigned RNS creates an ambiguity in the range partitioning of signed RNS, causing the scaled residues to be incorrectly mapped for negative integers.

The range mismatch can be resolved upon detecting the sign of the input operand. Unfortunately, the sign of an integer is not distinguishable from its residue representation. Existing solutions to the RNS sign detection problem are mainly based on the CRT, MRC or core function [5], [12], [13]. ROM matrices are required to find the orthogonal projections in these approaches, making the overall circuit in conjunction with the magnitude scaling module to be more expensive than a residue-to-binary converter. To significantly lower the cost of signed integer scaling circuit, the sign detection and correction circuit must be kept simple and coherent with the magnitude scaling module.

### B. Correction Circuit for Signed RNS Scaling

From the definition of scaling in RNS, the following expressions of the scaled residues $(y_1, y_2, y_3)$ in terms of $(x_1, x_2, x_3)$ of an unsigned integer $X$ can be derived [4], [7]:

$$y_1 = \lfloor x_1 - x_2 \rfloor_{m_1}$$

(3)

$$y_2 = \left[ \lfloor 2^{2n-1} + 2^{n-1} \rfloor \right]_{m_2} x_1 - 2^n x_2 + \left( 2^{2n-1} + 2^{n-1} - 1 \right) x_3$$

(4)

$$y_3 = \lfloor x_2 + 2^n x_3 \rfloor_{m_3}$$

(5)

where $k = 2^n$ and $\lfloor \cdot \rfloor$ denotes the least integer function.

In order to obtain the correct residue representation of $\tilde{Y}$, $y_1, y_2$ and $y_3$ need to be corrected by detecting the sign of $\tilde{X}$ to map $X$ to $\tilde{X}$ based on their relationship expressed in (1).

No correction is required when $\tilde{X} \geq 0$ because $\tilde{X} = X$ according to (1). Hence, $\bar{y}_1 = y_1$, $\bar{y}_2 = y_2$ and $\bar{y}_3 = y_3$.

However, when $\tilde{X} < 0$, $\tilde{X} = X - M$. Since $k = m_2$

$$\tilde{Y} = \left\lfloor \frac{X}{k} \right\rfloor = \left\lfloor \frac{(X - M)}{k} \right\rfloor = \left\lfloor \frac{(X/k) - m_2 m_3} {m_1} \right\rfloor$$

(6)

Since $m_1 m_3$ is an integer, (6) can be rewritten as:

$$\tilde{Y} = \lfloor \frac{X}{k} \rfloor - m_1 m_3 = Y - m_1 m_3$$

(7)

The residue representation of $\tilde{Y}$ when $\tilde{X} < 0$ can be computed as follows [9]:

$$\bar{y}_1 = \lfloor Y - m_1 m_3 \rfloor_{m_1} = \lfloor Y \rfloor_{m_1} - m_1 m_3 \left[ \lfloor \frac{Y}{m_1} \rfloor_{m_1} \right]_{m_1} = \left\lfloor y_1 - 0 \right\rfloor_{m_1} = y_1$$

(8)

$$\bar{y}_2 = \lfloor Y - m_1 m_3 \rfloor_{m_2} = \lfloor Y \rfloor_{m_2} - m_1 m_3 \left[ \lfloor \frac{Y}{m_2} \rfloor_{m_2} \right]_{m_2} = \left\lfloor y_2 + 0 \right\rfloor_{m_2} = y_2$$

(9)

$$\bar{y}_3 = \lfloor Y - m_1 m_3 \rfloor_{m_3} = \lfloor Y \rfloor_{m_3} - m_1 m_3 \left[ \lfloor \frac{Y}{m_3} \rfloor_{m_3} \right]_{m_3} = \left\lfloor y_3 - 0 \right\rfloor_{m_3} = y_3$$

(10)

From (8) and (10), no correction is needed for $\bar{y}_1$ and $\bar{y}_3$ when $\tilde{X} < 0$, but $y_2$ needs to be incremented by one to obtain $\bar{y}_2$ as indicated in (9). To detect $\tilde{X} < 0$ from its residues, a full sign detection circuit is required. This is what we would like to avoid here.

Since the dynamic range $M = 2^{3n} - 2^n$ of the moduli set $\{2^{n-1}, 2^n, 2^{n+1}\}$ is always even, the residue representation of $X$ can be mapped to that of $\tilde{X}$ for $X$ in the range of $[0, M/2-1]$ if $\tilde{X} \geq 0$ and $[M/2, M-1]$ if $\tilde{X} < 0$. In order to correct $Y$ in unsigned RNS to $\tilde{Y}$ in signed RNS, it is necessary to know when $\tilde{Y}$ becomes negative. The ranges of the scaled unsigned integer $Y$ for positive and negative $\tilde{X}$ are determined as follows:

When $\tilde{X} \geq 0$, $0 \leq X \leq M/2 - 1$, $0 \leq Y \leq \lfloor (M-2)/2k \rfloor$ and

$$0 \leq Y \leq 2^{2n-1} - 1$$

(11)

When $\tilde{X} < 0$, $M/2 \leq X \leq M - 1$, $\lfloor (M/2k) \rfloor \leq Y \leq \lfloor (M-1)/k \rfloor$

and

$$2^{2n-1} - 1 \leq Y \leq 2^{2n-2} - 2$$

(12)

From (11) and (12), when $Y > 2^{2n-1} - 1$, $\tilde{Y}$ is negative and when $Y < 2^{2n-1} - 1$, $\tilde{Y}$ is positive. Thus, the condition when $\tilde{Y}$ is negative can be detected by the $(2n-1)$th bit of $Y$, i.e., $\bar{y}_2$. According to (9), this bit can be added at the least significant bit (LSB) of $y_2$ to correct the sign error of the magnitude-scaled residue, i.e.,

$$\bar{y}_2 = y_2 \left( Y_{2n-1} \right)$$

By adopting [7] for magnitude scaling, no additional circuit is needed to determine $(Y_{2n-1})$ because $Y$ is available in the result of (4) before taking the modulo of $m_2$.

However, when $Y = 2^{2n-1} - 1$, $\tilde{Y}$ can be either positive or negative according to (11) and (12). Based on the definition of magnitude scaling, $Y = \lfloor X/2^n \rfloor$, the range of $X$ for $Y = 2^{2n-1} - 1$ is given by:

$$2^n (2^{2n-1} - 1) \leq X \leq 2^n (2^{2n-1} - 1) + 2^n - 1$$

(13)

From (13), there are altogether $2^n$ integers in $[2^{3n-1} - 2^n, 2^{3n-1} - 1]$. From (1), when $X \geq M/2 = 2^{3n-1} - 2^n$, $\tilde{X} < 0$. Therefore, for $Y = 2^{2n-1} - 1$, when $\tilde{X} \geq 0$ and $\tilde{Y} \geq 0$, $2^{2n-1} - 2^n \leq X \leq 2^{2n-1} - 2^n - 1$

(14)

and when $\tilde{X} < 0$ and $\tilde{Y} < 0$, $2^{2n-1} - 2^n \leq X \leq 2^{2n-1} - 1$

(15)

Since the magnitude of $X$ is not directly available at the input, it can only be determined from the residue representation. By taking the modulo $2^n$ operation on (15), the range of the residue $x_3$ is found to be $2^{2n-3} \leq x_3 \leq 2^{2n-1}$. Since $\bar{x}_3 = x_3$, we have $2^{2n-3} \leq \bar{x}_3 \leq 2^{2n-1}$. Let $(xi)$ denote the $j$-th bit of $x_i$. It is observed that the bit $(\bar{x}_3)_{2n-1}$ is always ‘1’ for $\bar{x}_3$ in the range of (15) and ‘0’ otherwise. Hence, when $Y = 2^{2n-1} - 1$, $(\bar{x}_3)_{2n-1}$ alone is sufficient to determine the sign of $\tilde{Y}$, i.e., $(\bar{x}_3)_{2n-1}$ is ‘1’ when $\tilde{Y} < 0$ and ‘0’ when $\tilde{Y} \geq 0$.

In the above discussion, we have proven that the sign of $\tilde{Y}$ can be simply detected by using the bit $(Y)_{2n-1}$ except for the case when $Y = 2^{2n-1} - 1$ where we need an extra bit $(\bar{x}_3)_{2n-1}$ to determine the sign of $\tilde{Y}$. The condition of $Y = 2^{2n-1} - 1$ can be determined by checking if $2n-1$ LSBs of $Y$ are ‘1’s and the most significant bit (MSB) of $Y$ is ‘0’. However, this
implementation results in a very high fan-in multi-level logic structure, which is very slow and costly when \( n \) is large. We will show that the fan-in of the multi-level logic gate circuit for the detection of the condition of \( Y = 2^{2n-1} - 1 \) can be reduced from \( 2n \) bits to only \( n+2 \) bits, namely the \( n \)-bit residue \( y_1 \), and two other bits, \((Y_{2n-1}) \) and \((y_{2n}) \).

When \( Y = 2^{2n-1} - 1, \) \((Y_{2n-1}) = 0 \) and
\[
y_1 = [2^{2n-1}]_2 - 1 = \left(2^n\right)\left(2^n-1\right) - 1 = 2^n - 1 \quad (16)
\]

From (16), \( y_1 \) can be computed as \( 2^{2n-1} - 1 \) when \( Y = 2^{2n-1} - 1 \). However, \( y_1 \) alone is not sufficient to determine if \( Y = 2^{2n-1} - 1 \) because there are \( 2^n+1 \) different integers of \( Y \) that have \( y_1 = 2^{2n-1} - 1 \) in their residue representation. Among these integers, \( 2^n+1 \) of them have values less than or equal to \( 2^{2n-1} - 1 \). When \( y_1 = 2^{2n-1} - 1 \) and \( Y \leq 2^{2n-1} - 1, \) \( Y \) has values of \( 2^n+1, 2^n+2, 2^n+2, \ldots, 2^n+2^{2n-1}, 2^n+2^{2n-1} - 1 \). Thus, \( y_2 = 2^{2n-1} - 1, 2^{2n-1} - 2, \ldots, 0, 2^{2n-1} - 1 \). Among these \( y_2 \) values, only when \( Y = 2^{2n-1} - 1 \) can \( y_2 \) have values greater than \( 2^{2n-1} - 1 \) (i.e., \((y_2)_{2n-1} = 1\)). In other words, \((Y_{2n-1}, y_1 \) and \((y_{2n}) \) are sufficient to determine if \( Y = 2^{2n-1} - 1 \). Fig. 1 depicts the proposed architecture of the signed integer RNS scaler. The architecture of unsigned integer RNS scaler [7] is enclosed in the dashed-line box and the correction circuit is enclosed in the solid-line box. The role of the control signal generation under correction circuit is to detect the condition when \( Y = 2^{2n-1} - 1 \) and \( Y \) is in negative range. Under this condition, \( 1 \) will be selected as the LSB of the correction factor. Otherwise, \((Y_{2n-1}) \) will be selected. The control signal generation circuit can be built using \( n+2 \) two-input \( AND \) gates with \( y_1 \), \((\bar{x}_{2n-2}) \), \((Y_{2n-1}) \), and \((y_{2n}) \) as the inputs.

From (17), \([A + B]_{2^{2n-1}}\) can be implemented by a simple mod \( 2^n \) adder provided that when \( A + B \geq 2^n \), the sum is incremented by one. For simplicity, this simplified implementation of (17) is succinctly denoted by:
\[
[A + B]_{2^{2n-1}} = [A + B + c_m]_{2^n} \quad \text{if } A + B \geq 2^n \n\]
\[
[A + B]_{2^n} \quad \text{otherwise}
\]

\[\text{Fig. 2. Proposed simplified signed integer RNS scaler.}\]

\[\text{Fig. 3(a) depicts an example of the proposed mod } 2^n \text{ adder with } c_m \text{ for } n = 4. \text{ The computation of } y_3 \text{ in (4) can be performed by using the proposed modified mod } 2^n \text{ adder with } c_m. \text{ This adder adds two } n \text{-bit operands taken from the } n \text{ LSBs of the sum and carry vectors, } A \text{ and } B, \text{ produced by the } 2n \text{-bit CSA with } EAC \text{ block shown in Fig. 2. This adder has a similar structure as a standard mod } 2^n \text{ adder with an additional prefix level. This extra level of prefix operators is enclosed in the dashed-line box of Fig. 3(a). They are used to generate the carry signals due to } c_m \text{ [14]. } c_m \text{ in this case is equal to } c_{2n-1}, \text{ which is the carry output generated from the } A + B \text{ operation. By using this adder, we can avoid the use of a large and slow mod } 2^{2n-1} \text{ adder.} \]

Besides being one of the inputs to the modified control signal generation circuit, \((Y_{2n-1}) \) is also the LSB of the correction factor to map \( y_2 \) to \( y_2' \). The mapping is done by using a simplified mod \( 2^n \) adder which has a structure shown in Fig. 3(b), where \((Y_{2n-1}) \) is added to \( y_2 \) to obtain \( y_2' \), i.e., \( y_2' = y_2 + Y_{2n-1} \). Since all bits except the LSB of the correction factor are 0, the prefix operator, denoted by \( \cdot \), is simplified to \( \oplus \) to reduce the critical path delay.

Fig. 4 shows the circuit of the modified control signal generation block which is similar to that in Fig. 1 with the addition of a \((Y_{2n-1}) \) generation block. The role of the \((Y_{2n-1}) \) generation block is to compute \((Y_{2n-1}) \) and \( c_{2n-1} \) bits based on the \( n \) MSBs of \( A \) and \( B \), and \( G_{2n-1} \) and \( P_{2n-1} \) from the modified mod \( 2^n \) adder with \( c_m \). The \( c_{2n-1} \) bit is fed as the carry input, \( c_m \) to the modified mod \( 2^n \) adder with \( c_m \) shown in Fig. 3(a).

In Fig. 1, a multiplexer is needed to select the correction factor to be added to \( y_2 \). As \((y_{2n-1}) \) is an input to the control signal generation block, addition of the correction factor to \( y_2 \) will be delayed. In Fig. 2, \((Y_{2n-1}) \) is added to \( y_2 \) before the corrected output is selected by an \( AND \) gate array. This efficient implementation is derived as follows. When \( Y = 2^{2n-1} - 1 \) and \( Y \) is in negative range,
\[ \bar{y}_2 = |y_2 + 1|_2 = 2^n - 1 + 1 = 2^n = 0 \]  
(19)

For other conditions, we just need to add \((Y)_{2n-1}\) to \(y_2\).

\[ \bar{y}_2 = |y_2 + (Y)_{2n-1}|_2 \]  
(20)

From (19) and (20), the corrected output is either 0 or \(|y_2 + (Y)_{2n-1}|_2\). Instead of using \(n\)-way multiplexer, \(n\) two-input AND gates will suffice to obtain the correct output.

In this section, our design is compared against the latest signed integer RNS scaler architectures proposed in [9] and [10] for four different dynamic ranges of 15, 18, 21 and 24 bits. The three moduli set \([2^{n-1}, 2^n, 2^n+1]\) is used in our proposed design and [9]. For a fair comparison, the moduli sets \([23, 29, 31], [53, 59, 61], [109, 113, 127]\) and \([239, 241, 251]\) with comparable dynamic ranges (DRs) of 15, 18, 21 and 24 bits and the same scaling factor are used for [10] as all its moduli set are odd integers. These moduli sets are made up of adjacent prime numbers and are chosen in favor of [10] such that their exact dynamic ranges just fall below those of \([2^{n-1}, 2^n, 2^n+1]\) for \(n = 5, 6, 7, \text{and} 8\), respectively.

Unit-gate model [15] is adopted for the hardware area and delay estimation, where a two-input monotonic gate is assumed to have one unit of area and one unit of delay, an XOR gate has two units of area and two units of delay, and an inverter has zero unit of area and delay.

Channels \(m_1\) and \(m_2\) of our proposed design have similar architectures as those proposed in [7] and their unit-gate areas and delays are excerpted from Tables III and IV of [7], respectively. Channel \(m_2\) consists of bit rewiring, 2n-bit CSA with EAC and merged sign detection and correction blocks. The bit rewiring and 2n-bit CSA with EAC blocks are built up of \(n\) OR gates and \(2n\) full adders (FAs). The unit-gate areas for \(n = 5, 6, 7\) and 8 are listed in Table I, where \(A_{\text{MCA}}, A_{\text{MC}}, A_{\text{SMC}}, A_{\text{AND}}\) denote the unit-gate areas of the modified mod 2\(^n\) adder with \(c_{\text{in}}\), modified control signal generation block, simplified mod 2\(^n\) adder and AND gate array, respectively. The unit-gate areas of all other modules are added together under \(A\) in the ‘Others’ column. All component areas are summed to obtain the overall unit-gate area of the proposed design. Channel \(m_2\) has the longest path delay among the three parallel channels. The longest path of merged sign detection and correction module goes from the modified control signal generation block to the AND gate array through the simplified mod 2\(^n\) adder. Their unit-gate delays are also listed in Table I, where \(D_{\text{MCA}}, D_{\text{SMC}}, D_{\text{AND}}\) denote the unit-gate delays of the modified control signal generation block, simplified mod 2\(^n\) adder and AND gate array, respectively. The unit-gate delays of the bit-rewiring and 2n-bit CSA with EAC blocks are consolidated into \(D\) in the ‘Others’ column of Table I.

![Fig. 3. (a) Modified mod 2\(^n\) adder with \(c_{\text{in}}\) for \(n = 4\) (b) Simplified Mod 2\(^n\) Adder for \(n = 4\).](image1)

![Fig. 4. Modified control signal generation module.](image2)

### Table I: Unit-Gate Area and Delay of Proposed Design

<table>
<thead>
<tr>
<th>(n)</th>
<th>(A_{\text{MCA}})</th>
<th>(A_{\text{MC}})</th>
<th>(A_{\text{SMC}})</th>
<th>(A_{\text{AND}})</th>
<th>(D_{\text{MCA}})</th>
<th>(D_{\text{SMC}})</th>
<th>(D_{\text{AND}})</th>
<th>(A)</th>
<th>(D)</th>
<th>Overall</th>
</tr>
</thead>
<tbody>
<tr>
<td>5</td>
<td>57</td>
<td>45</td>
<td>21</td>
<td>5</td>
<td>13</td>
<td>6</td>
<td>1</td>
<td>282</td>
<td>5</td>
<td>410</td>
</tr>
<tr>
<td>6</td>
<td>72</td>
<td>51</td>
<td>25</td>
<td>6</td>
<td>13</td>
<td>7</td>
<td>1</td>
<td>336</td>
<td>5</td>
<td>400</td>
</tr>
<tr>
<td>7</td>
<td>87</td>
<td>57</td>
<td>32</td>
<td>7</td>
<td>13</td>
<td>8</td>
<td>1</td>
<td>390</td>
<td>5</td>
<td>573</td>
</tr>
<tr>
<td>8</td>
<td>105</td>
<td>66</td>
<td>39</td>
<td>8</td>
<td>13</td>
<td>8</td>
<td>1</td>
<td>444</td>
<td>5</td>
<td>662</td>
</tr>
</tbody>
</table>

The total unit-gate area and unit-gate delay of the architecture of [9] is estimated to be \(13.5n \log_2 n + 56.5n + 8 + 8 \log_2 n\) + 18 units, respectively based on the same unit-gate analysis as [15]. In [10], the arithmetic operations are implemented by look-up tables, which are usually implemented with read-only memory (ROM) modules. Based on the ROM model of [8], the number of transistors and unit-gate delay of the architecture of [10] are estimated and shown in Table II. The unit-gate areas of the proposed design and [9] are also converted to transistor counts and compared in Table II, where one unit-gate area is equivalent to four transistors. This conversion assumes that a FA is implemented with 28 transistors in static CMOS logic style. The results show that the proposed architecture consumes at least 15\% and 95\% lesser transistors than [9] and [10], respectively and is at least 35\% and 54\% faster than [9] and [10], respectively for \(DR = 15, 18, 21\) and 24 bits.

### Table II: Comparison of Number of Transistors and Unit-Gate Delay (Value in Parenthesis is the Percentage Reduction)

<table>
<thead>
<tr>
<th>(DR) (bit)</th>
<th>(\text{Number of Transistors})</th>
<th>(\text{Unit-Gate Delay})</th>
</tr>
</thead>
<tbody>
<tr>
<td>(n)</td>
<td>(\text{This [9]})</td>
<td>(\text{This [10]})</td>
</tr>
<tr>
<td>15</td>
<td>1640</td>
<td>1972 (16.84)</td>
</tr>
<tr>
<td>18</td>
<td>1960</td>
<td>2360 (16.95)</td>
</tr>
<tr>
<td>21</td>
<td>2292</td>
<td>2748 (16.59)</td>
</tr>
<tr>
<td>24</td>
<td>2648</td>
<td>3136 (15.56)</td>
</tr>
</tbody>
</table>

Each design is also specified at gate level using Verilog HDL, synthesized and technology mapped to TSMC 0.18 \(\mu\)m CMOS technology standard cell library using Synopsys Design Compiler. The designs are independently optimized for speed to obtain their minimum achievable delays. The areas and delays after logic synthesis and optimization are
shown in Table III. The results show that our proposed design is at least 21% and 89% smaller, and at least 28% and 63% faster than those of [9] and [10], respectively, for DR = 15, 18, 21 and 24 bits. The comparison with [9] suggests that more aggressive area savings have been obtained from the actual implementation of our design than the theoretical estimation based on transistor count. The delay improvement estimated by the unit-gate analysis is somewhat optimistic for the lower dynamic range, but is fairly accurate otherwise. Comparing with [10], the speed improvement obtained by the unit-gate delay model has been underestimated.

The power consumptions of all circuits are measured using Synopsys PrimeTime PX at the same clock rate and the same supply voltage of 1.62V. For each dynamic range, a common clock period is set based on the slowest design. Monte Carlo simulation method [16] is used with randomly generated inputs to obtain the average power dissipation with 99% confidence that the error is bounded below 2.5%. The total power dissipations including both dynamic and leakage powers are listed in Table III. Despite operating at a higher data rate, our design saves more than 32% and 89% of power over [9] and [10], respectively. Due to the replacement of mod $2^{n-1}$ adder with merged sign detection and correction, our augmented signed RNS scaler is even 22.7% smaller and consumes 23.1% less power on average than the simplest unsigned RNS scaler [7] synthesized under the same technology library. The ineluctable delay overhead due to the sign correction has been reduced to slightly below 30% on average.

| TABLE III. COMPARISON OF SYNTHESIZED AREAS, DELAYS AND TOTAL POWER (VALUE IN PARENTHESIS IS THE PERCENTAGE REDUCTION) |
|---|---|---|---|---|---|---|
| DR (bit) | Synthesized Area ($\mu m^2$) | Synthesized Delay (ns) | Total Power ($\mu W$) |
| | This [9] [10] | This [9] [10] | This [9] [10] |
| 15 | 9094 (21.59) | 11599 (51.59) | 87152 (89.57) | 2.00 | 2.81 (28.38) | 5.42 (63.10) | 1602 (99.56) | 2374 (99.56) | 15338 (99.56) |
| 18 | 9164 (48.06) | 17643 (96.70) | 278000 (71.49) | 1.94 | 3.02 (35.76) | 6.80 (71.49) | 1468 (99.56) | 3010 (99.56) | 24888 (99.56) |
| 21 | 10096 (51.25) | 10710 (51.25) | 649897 (98.44) | 2.09 | 3.26 (35.89) | 8.30 (74.82) | 1297 (99.56) | 3257 (99.56) | 33664 (99.56) |
| 24 | 13725 (36.61) | 21652 (99.36) | 1243865 (84.12) | 2.00 | 3.34 (40.12) | 9.10 (78.02) | 1458 (99.56) | 3173 (99.56) | 68682 (99.56) |

Based on the synthesis results, the most competitive contender [9] and our design for the moduli set [255, 256, 257] were physically placed and routed using Cadence SoC Encounter with four metal layers and the same initial core utilization ratios. The postlayout netlists were back-annotated for the same timing and power simulations at 1.62V and 3.34 ns clock period. Both designs were able to attain their respective minimum delay in Table III after the placement and routing. The physical synthesis results are summarized in Table IV. The final core utilization ratios of our design and [9] are 0.636 and 0.685, respectively. Our proposed design is 36.78% smaller and consumes 55.82% lower power, which are slightly better than the pre-layout results due to its lower interconnect complexity.

| TABLE IV. COMPARISON OF POST-LAYOUT SIMULATION RESULTS |
|---|---|---|
| Core area ($\mu m^2$) | 216050 | 33298 |
| Delay (ns) | 2.00 | 3.34 |
| Leakage power ($\mu W$) | 0.623 | 1.054 |
| Total power (mW) | 0.324 | 1.186 |
| Final core utilization ratio | 0.636 | 0.685 |

REFERENCES


In this paper, a low complexity, high-speed and low-power 2n signed integer RNS scaler for moduli set (2−1, 2−1, 2−1) has been proposed. By simplifying and merging the expensive sign detection and correction circuits, the complexity of implementing the signed integer RNS scaler has been reduced substantially. We benchmark our proposed design against two latest signed integer RNS scalers using the unit-gate analysis and logic synthesized results. The logic synthesized results show that our proposed design is on average 39.58% smaller, 35.15% faster and 48.60% more power-efficient than the most competitive contender over four different dynamic ranges. This superiority is further corroborated by the physical synthesis results based on the same TSMC 0.18μm standard cell implementation for n = 8, where our proposed design runs 40% faster, consumes 56% lower power and occupies 37% lesser silicon area.