<table>
<thead>
<tr>
<th><strong>Title</strong></th>
<th>Novel design algorithm for low complexity programmable FIR filters based on extended double base number system</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Author(s)</strong></td>
<td>Chen, Jiajia; Chang, Chip-Hong; Feng, Feng; Ding, Weiao; Ding, Jiatao</td>
</tr>
<tr>
<td><strong>Date</strong></td>
<td>2014</td>
</tr>
<tr>
<td><strong>URL</strong></td>
<td><a href="http://hdl.handle.net/10220/25027">http://hdl.handle.net/10220/25027</a></td>
</tr>
<tr>
<td><strong>Rights</strong></td>
<td>© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: [<a href="http://dx.doi.org/10.1109/TCSI.2014.2348072">http://dx.doi.org/10.1109/TCSI.2014.2348072</a>].</td>
</tr>
</tbody>
</table>
Novel Design Algorithm for Low Complexity Programmable FIR Filters Based on Extended Double Base Number System

Jiajia Chen, Member, IEEE, Chip-Hong Chang, Senior Member, IEEE, Feng Feng, Weiao Ding and Jiatao Ding

Abstract—Coefficient multipliers are the stumbling blocks in programmable finite impulse response (FIR) digital filters. As the filter coefficients change either dynamically or periodically, the search for common subexpressions for multiplierless implementation needs to be performed over the entire gamut of integers of the desired precision, and the amount of shifts associated with each identified common subexpression needs to be memorized. The complexity of a quality search is thus beyond the existing design algorithms based on conventional binary and signed digit representations. This paper presents a new design paradigm for the programmable FIR filters by exploiting the Extended Double Base Number Systems (EDBNS). Due to its sparsity and innate abstraction of the sum of binary shifted partial products, the sharing of adders in the time-multiplexed multiple constant multiplication block of the programmable FIR filters can be maximized by a direct mapping from the quasi-minimum EDBNS. The multiplexing cost can be further reduced by merging double base terms. Logic synthesis results on more than one hundred programmable filters with filter taps ranging from 10 to 100 and coefficient word lengths of 8, 12 and 16 bits show that the average logic complexity and critical path delay of the programmable FIR filters designed by our proposed algorithm have been reduced by up to 47.81% and 14.32%, respectively over the existing design methods.

Index Terms — FIR Filter, Double Based Number System, Programmable Filter, Digital Signal Processing.

I. INTRODUCTION

FINITE Impulse Response (FIR) filters offer many advantages, such as easily attainable linear phase response, computational efficiency in multi-rate applications and desirable numerical property for finite precision and fractional arithmetic [1]-[4]. Adaptive filters, in particular, are inevitable in many important applications in communications, image processing, computer vision, data acquisition and control [5]–[9]. With any adaptive filter, there is a requirement for a programmable filter, which is a primary reason behind the increasing dominance of digital instead of analog system implementations. In applications such as multi-rate decimation [6], discrete cosine transform [7], [8], channelization [9], [10], high efficiency video coding (HEVC) [11], wide bandwidth photonic filter [12] and high-rate communication [13], the filter coefficients need to be run-time reconfigurable by the error feedback signal or adaptable to varying filtering specifications in real time. To reduce the throughput rate from $N$ clock cycles to one clock cycle, the weighted sum of the past and present input samples with changing coefficient values in an $N$-tap adaptive filter is implemented on field programmable gate array (FPGA) with $N$ multiply-and-accumulate (MAC) units [14]-[17]. In [15], MAC units are used in the FIR filter architecture for the design of discrete wavelet transforms. In [16], a dot-product unit is designed using the multiplier core on FPGA and a mechanism to reallocate the partial products for better resource utilization is proposed. As multipliers are much slower and consume substantially more area than adders, programmable filter raises the cost and latency of the system, as exemplified by the dominating complexity of coefficient computation in channel equalization [10] and the HEVC decoding time contributed by the adaptive loop filter [11].

To avoid the time consuming and area intensive multipliers, they are replaced by simpler arithmetic operators in the form of shift-and-add network based on the transposed direct form implementation. The shifted partial products of the filter coefficients need to be recalculated before they are delayed and accumulated by the structural adders. Hence, the shifters are not fixed but vary with the changing partial products. The partial products also need to be either computed on demand or pre-computed and pre-stored for selection, which results in a Time Multiplexed Multiple Constant Multiplication (TM-MCM) block [8], [18]. Design heuristics [19]–[23] have been proposed to detect and eliminate the common subexpressions in the fixed filter coefficient set to reduce the implementation complexity of the multiple constant multiplication block. Unfortunately, the search for common subexpressions in a coefficient set is itself a complex process which cannot be easily implemented in hardware. This means that the shifted partial products cannot be dynamically modified in real time for a time varying coefficient set without some form of precomputation and storage. Design algorithms for Common Subexpression Elimination (CSE) [20] help, however, in
reducing the number of partial products to be pre-computed and hence the storage requirement and implementation complexity of the shifters and adders in the TM-MCM block of the programmable filter.

CSE algorithms [5], [9], [24]-[27] were developed to design programmable FIR filters with the coefficients represented in binary or Canonical Signed Digit (CSD) [28] representations. Algorithms [5] and [9] used the traditional pattern matching techniques historically adopted for high-level synthesis [29]. Due to the unspecified time-varying coefficients of programmable filters, the computational complexity is exceedingly high. The enormous search space renders these heuristics ineffective and inefficient. To overcome this problem, the computational sharing multiplier architecture [5] arbitrarily identifies all 4-bit binary expressions of odd integers from 0001 to 1111 as common subexpressions. Similarly, all 3-bit binary expressions from 001 to 111 are generated as common subexpressions in [9] and [24], and multiplexers are used to select the subexpressions to form the partial products to be summed into the updated coefficient values. Frequency response masking is used to develop the reconfigurable multi-mode filter bank in [9] whereas Constant Shifts Method and Programmable Shifts Method are proposed in [24]. The latter also provides the flexibility of changing the word length of the filter coefficients dynamically. These methods skip the common subexpression search, but the straightforward and simplistic approach of selecting common subexpressions has an inherently high opportunity cost – many reusable partial products were omitted.

Owing to the huge search space over the complete gamut of integer values of a given precision, the intricacy of common subexpression sharings within and across multiple sets of coefficients in a TM-MCM block is beyond the capability of existing CSE algorithms and fixed-coefficient filter design methodologies even for moderate coherent word length and filter taps. As the shifters are no longer fixed, more succinct number representation than CSD and binary are explored. DBNS and multidimensional logarithmic number system (MDLNS) are considered for the design of DSP operators such as constant multiplier [30]-[32]. In [30], logarithmic and double base number representations are used to synthesize inexact fractional filter coefficients for time-invariant filters. To minimize the additions of each coefficient multiplier, multiple-radix DBNS representation is used by limiting the maximum power of the second base number in [31]. In this paper, the double base number representation is uniquely exploited to maximize the subexpression sharings for all filter coefficients of a given word length in a more general time-varying filter implementation problem that has no leverage of the fixed quantized coefficients at design time. This work is an extension of our antecedent work [33], which is the first ever attempt to harness the sparseness of the canonical DBNS representation for the complexity reduction of TM-MCM block. In this paper, we have extended the Double Base Number System (DBNS) [34] to encapsulate the binary shifts in tandem with several most frequently encountered common subexpressions. A new formulation of the common subexpression search problem as a quasi-minimized extended DBNS (EDBNS) generation problem is proposed, which has led to considerable reduction in the number of distinct partial products for a given word length of programmable coefficients. With this number system, an efficient architecture for the implementation of TM-MCM is derived as shown in Fig. 1. It consists of a power-of-b generator (POBG), N blocks of power-of-b selector (POBS) and N blocks of double base coefficient generator (DBCG), where b is the second base number in EDBNS and N is the number of taps. In our design, those unique partial product terms can be generated incrementally with one adder each. The sizes of the multiplexers in the programmable units and the lookup tables for the selector logic are further minimized by exploiting the unique properties of EDBNS in the Exponential Diophantine Equation.

Fig. 1. Transposed form FIR filter with programmable coefficients.

II. EXTENDED DOUBLE BASE NUMBER SYSTEM

The output sequence of an N-tap finite impulse response (FIR) digital filter can be computed by the following discrete time convolution.

\[ y[n] = h[n]^* \cdot x[n] = \sum_{i=0}^{N-1} h[i] \cdot x[n-i] \]  (1)

where \( x[n] \) and \( y[n] \) are the n-th time domain input and output data samples. The samples \( h[i], i = 0, 1, \ldots, N-1 \), also known as the filter coefficients, are the impulse response of the filter transfer function \( H(z) \), i.e., \( H(z) = \sum_{i=0}^{N-1} h[i] z^{-i} \).

For fixed point implementation, each coefficient is quantized into a finite precision integer. The quantized coefficient, denoted by \( c \), can be written as a sum of Power-of-Two (POT) terms as follows:

\[ c = \sum_{i=0}^{w} b_i \cdot 2^i \]  (2)

where \( b_i \in \{0, 1\} \) for binary representation and \( b_i \in \{0, 1\} \) for signed digit representation. \( w \) is the word length required by the representation to express all the integer coefficients of the filter.

The discrete convolution in (1) can be implemented using a network of hardwired shifters and adders. The number of adders required is determined by the number of nonzero POT terms in the binary or signed digit representation of all coefficients of the filter. It can be greatly reduced by sharing common POT terms (also known as common subexpressions) [20]. The search space for the design is feasible if the filter coefficients are fixed. If the coefficient values are allowed to change, such as those in an adaptive filter, then all the \( 2^w \) \( w \)-bit integers need to be represented. The number of nonzero POT terms in (2) is
binomially distributed and the frequency of occurrences of \( r \) nonzero POT terms in all integers of \( w \)-bit binary representation is given by \( C^r_w \). On average, \( \frac{1}{2^w} \sum_{r=0}^{w} r \cdot C^r_w \) nonzero POT terms are required to represent a \( w \)-bit integer in binary. For \( w = 8 \), this number is 4. The number of nonzero POT terms can be saved by 33% on average if canonic signed digit (CSD) representation [28] is used, which means that the average number of signed POT terms for an 8-bit integer can be reduced to 2.67.

The sums of two adjacent nonzero POT terms in binary (i.e., bit pattern “11”) and CSD (i.e., bit patterns “101”), and their negation corresponding to the decimal numbers 3 and -3, appear to be the most frequently occurred common subexpressions in digital filter design. By recursively decomposing \( c \) in (2) into the sum of \( (2^{2m} + 2^n) = 3 \times 2^\alpha \) and a smaller integer of \( c - 3 \times 2^\alpha \), and factoring out 3 in each iteration, a sparse Double Base Number System (DBNS) can be obtained. This DBNS is defined in [35] for any integer \( c \) as:

\[
c = \sum_{\alpha, \beta} d_{\alpha, \beta} 2^\alpha 3^\beta
\]  

where \( d_{\alpha, \beta} \in \{0, 1\} \), and \( \alpha \) and \( \beta \) are the exponents of 2 and 3 of the double base (or 2-integer) term, respectively.

For any integer \( c \), the DBNS with the least number of nonzero double base terms is called the canonical DBNS [35]. It should be noted that unlike CSD, the integer representation in canonical DBNS is not unique. To avoid the ambiguity of the definition of canonicity, we will refer to canonic DBNS as the minimum DBNS. Due to the sparsity of the minimum DBNS, any integer within a given range can be generated with fewer additions of double base terms (i.e., \( 3^\beta \) shifted by \( \alpha \) for \( d_{\alpha, \beta} \neq 0 \)) and it will be shown in Section IV that each distinct double base term can be generated using only one adder.

The subexpression “101” of two adjacent nonzero POT terms in binary and CSD representations and its negation corresponding to the decimal integers 5 and -5 are also frequently encountered. To incorporate this additional common subexpression, we propose to extend the DBNS to include the nextprime 5 as a choice for the second base. To differentiate it from the accustomed form of DBNS defined in [34] with 2 and 3 as the only base numbers, we called this form of DBNS the Extended Double Based Number System (EDBNS). Its representation for any positive integer \( c \) is defined as:

\[
c = \sum_{i=1}^{T} 2^\alpha_i b^\beta_i
\]  

where \( b \in \{3, 5\} \), \( \alpha \) and \( \beta \) are respectively the non-negative exponents of 2 and \( b \), of the \( t \)-th nonzero double base term, and \( T \) is the total number of nonzero double base terms. Negative coefficient can be expressed in sign-magnitude form with its magnitude expressed in this EDBNS and its sign used to configure its structural adder as adder or subtractor to achieve the same effect as using signed EDBNS with configurable adder/subtractor.

It can be shown by exhaustive enumeration that every positive integer in the range \((0, 256)\) can be expressed in EDBNS with three or less double base terms. The numbers of occurrences of EDBNS representations with one, two and three double-base terms over the entire range are 26, 147 and 82, respectively. On average, only 2.21 double base terms are required to represent any integer in the range \([0, 255]\). This is 44.75% and 17.23% lower than the binary and CSD representations, respectively.

Similar to the definition of minimum DBNS, the minimum EDBNS representation of an integer \( c \) is an EDBNS representation of \( c \) with the minimum number of nonzero double base terms \( T_{\text{min}} \). The minimum EDBNS can be constructed as an abstraction of the maximum shareings of two adjacent POT terms in binary and CSD representations over the range of \( w \)-bit integers representable by the EDBNS.

### III. GENERATION OF QUASI-MINIMUM EDBNS

A negative coefficient can be replaced by a positive coefficient by subtracting the positive coefficient instead of adding the negative coefficient at the structural adder block. Let \( w \) be the word length of the unsigned binary representation of the largest coefficient magnitude of a programmable FIR filter. The CSE problem for the TM-MCM block can then be recast into the problem of searching for a minimum number of distinct power-of-\( b \) integers, i.e., \( b^\beta \), where \( b \in \{3, 5\} \) and \( \beta \) is a positive integer, to generate the minimum or near minimum EDBNS representations for all the integers in the range \((0, 2^w)\).

The highest number of double base terms \( T_{\text{max}} \) for a valid EDBNS representation of any \( w \)-bit integer \( c \) is \( T_{\text{max}} = c \), which is the case when the powers of both bases of all the terms are zero. If only the power of the second base \( b \) is always 0, then (4) reduces to (2), which gives a tighter upper bound of \( T_{\text{max}} \). If there is no restriction imposed in the number of terms \( T \), there are many possible ways to express a \( w \)-bit positive integer in DBNS [31] or EDBNS. For example, it is reported in [34] that the integer 127 has a total of 783 different DBNS representations. The minimum EDBNS representation of each integer in the range \((0, 2^w)\) has different number of terms \( T \). Let \( T_{\text{min}}(w) \) be the maximum number of double base terms among the minimum EDBNS representations of all the \( w \)-bit integers. To minimize the number of adders required to add up the double base terms, \( T = T_{\text{min}}(w) \).

Let \( F_{\text{min}}(T, w) \) be the smallest set of positive power-of-\( b \) (\( b \in \{3, 5\} \)) integers that appear in any terms of the EDBNS representations of \( w \)-bit integers with \( T \) or less double base terms. To obtain \( F_{\text{min}} \) all possible power-of-\( b \) sets that can be used to represent all the \( w \)-bit integers in EDBNS with \( T \) or less terms are sought. Among them, the set that has the minimum number of power-of-\( b \) integers is selected as \( F_{\text{min}} \). If there are two or more sets that have the same minimum number of power-of-\( b \) integers, the one with more power-of-3 integers is selected as \( F_{\text{min}} \) to avoid complicating the search and comparison. Table I shows \( T_{\text{min}}(w) \) and \( F_{\text{min}}(T_{\text{min}}(w), w) \) obtained by this method for \( w = 8, 12 \) and 16.

<table>
<thead>
<tr>
<th>( w )</th>
<th>8</th>
<th>12</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>( T_{\text{min}} )</td>
<td>3</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>( F_{\text{min}} ) ( b = 3 )</td>
<td>3, 9, 27</td>
<td>3, 9, 27, 81</td>
<td>3, 9, 27, 81, 1243, 729, 2187, 6561</td>
</tr>
<tr>
<td>( F_{\text{min}} ) ( b = 5 )</td>
<td>5</td>
<td>5</td>
<td>5, 25, 125</td>
</tr>
</tbody>
</table>
Obviously, with $T = T_{\text{min}}(w)$, the number of adders required to add up the double base terms of all $w$-bit coefficients can be minimized by the minimum EDBNS. However, this may also result in a larger set of distinct power-of-2 integers to guarantee that the minimum EDBNS exists for all $w$-bit integers. A good trade-off is made by the following proposition.

**Proposition 1:** The minimum EDBNS representations of all $w$-bit integers are first generated to obtain $F_{\text{min}}(T_{\text{min}}, w)$. Then, $T$ is incremented to $T_{\text{min}}(w) + 1$ and the quasi-minimum EDBNS representations of $T$ terms are generated. If $|F_{\text{min}}(T_{\text{min}}, w)| < |F_{\text{min}}(T, w)|$, where $|F|$ denotes the cardinality of $F$, then $T$ is further incremented and the process continues until $|F_{\text{min}}(T-1, w)| = |F_{\text{min}}(T, w)|$.

The cardinality of $F_{\text{min}}$ may decrease when $T$ increases above $T_{\text{min}}$. As the cardinality of $F_{\text{min}}$ stops to shrink with increasing $T$, the reduction in the quantity and bit width of adders required to realize the distinct power-of-2 terms in $F_{\text{min}}$ ceases. Further increment of $T$ will only increase the redundancies of one or more EDBNS representations for all $w$-bit integers. Hence, the search for EDBNS representations can stop as it will not lead to more efficient hardware implementation. The pseudo code for the EDBNS search algorithm is presented in Fig. 2.

**EDBNS**(w) {  
  $C$ = set of all $w$-bit coefficients;  
  $\exp_{3_{\text{max}}} = \lceil w/\log_3 3 \rceil$ // max exponent of power-of-3 integer $< 2^w$  
  $\exp_{5_{\text{max}}} = \lceil w/\log_5 5 \rceil$ // max exponent of power-of-5 integer $< 2^w$  
  $T = T_{\text{min}}(w)$; // min. no. double base terms for $w$-bit integers  
  $F_{\text{current}} = F_{\text{min}}(T, w)$; // current min. set of power-of-2 integers  
  $F_{\text{next}} = \varnothing$; // next min. set of power-of-2 integers  
  $S = \varnothing$; // initialize set of EDBNS representations found  
  $\text{EDBNS} \_\text{array} = \varnothing$; // Initialize storage for EDBNS representations of $C$  
  $P = \text{insert}(P, 0, 0, 1)$; // Initial entry for storage of double base terms  
  while ($F_{\text{next}} \neq F_{\text{current}}$) {  
    for ($i$ from 1 to $w-1$) {  
      for ($j$ from 1 to $\exp_{3_{\text{max}}}$) {  
        while ($2^i < 2^j$) $P = \text{insert}(P, i, j, 2^3 j)$;  
        while ($2^i < 2^j$) $P = \text{insert}(P, i, j, 2^5 j)$;  
      }  
      while (C is not empty) {  
        for ($i$ from 1 to $T$) {  
          $S = \text{Gen} \_\text{EDBNS}(P, i)$;  
          $I = \text{EDBNS} \_\text{Int}(S) \cap C$;  
          if ($I$ is not empty) {  
            $\text{EDBNS} \_\text{array} = \text{insert}(\text{EDBNS} \_\text{array}, I)$;  
            $C = C - I$;  
          }  
        }  
        $F_{\text{current}} = F_{\text{next}}$;  
        $T = T+1$;  
        $F_{\text{next}} = F_{\text{min}}(T, w)$;  
      }  
    }  
  }  
  return $\text{EDBNS} \_\text{array}$;  
}  

**Fig. 2.** Proposed search algorithm for quasi-minimum EDBNS.

The function $\text{EDBNS}(w)$ returns an array of EDBNS representations for all the $w$-bit integers in $C$. The function $\text{Tmin}(w)$ returns $T_{\text{min}}(w)$ for $w$-bit integers. The function $\text{Fmin}(T, w)$ returns the smallest set of power-of-2 integers that can be used to represent all $w$-bit integers with $T$ or less terms in EDBNS. The function $\text{insert}(P, i, j, t)$ appends the exponents, $i$ and $j$, of the two bases and the resultant double base term $t$ into the array $P$. The function $\text{Gen} \_\text{EDBNS}(P, i)$ generates EDBNS representations with $i$ unique double base terms from $P$. If each of the $i$ double base terms is obtained in any order, there will be $P! = i!$ permutations of $i$ terms from the same EDBNS representation of an integer. Replicated EDBNS representations due to the permutation of the double base terms are avoided by controlling the loop indices in $\text{Gen} \_\text{EDBNS}$. The function $\text{EDBNS} \_\text{Int}(S)$ converts the EDBNS representations in the array $S$ into a set of integers. The function $\text{insert}(\text{EDBNS} \_\text{array}, I)$ adds the subset $I$ of coefficients expressed in EDBNS into the array $\text{EDBNS} \_\text{array}$.

This search algorithm reduces the search complexity by seeking only the EDBNS representations in the reduced space from $T_{\text{min}}$ to a value of $T$ that satisfies the criterion of $|F_{\text{min}}(T-1, w)| = |F_{\text{min}}(T, w)|$. Efficient EDBNS representations with bigger $T$ but smaller $F_{\text{min}}$ will never be missed, which guarantees that the few most efficient EDBNS representations are generated out of many redundant ones without an exhaustive search for all possible combinations. The EDBNS representations sought by our algorithm represent a very small subset of all the EDBNS representations of $w$-bit integers. For example, for $w = 8$, only two sets of EDBNS representations with $T = 3$ and $4$ double base terms are sought and generated, from which the set with the more succinct representations is selected. Table II illustrates the frequencies of occurrences of power-of-2 integers in 8-bit, 12-bit and 16-bit integers respectively, when $T = T_{\text{min}}$. The percentage of occurrences of each power-of-2 integer $b^\beta$ over all power-of-2 integers of EDBNS is also listed.

<table>
<thead>
<tr>
<th>$w$</th>
<th>$b$</th>
<th>$\beta = 1$</th>
<th>$\beta = 2$</th>
<th>$\beta = 3$</th>
<th>$\beta = 4$</th>
<th>$\beta = 5$</th>
<th>$\beta = 6$</th>
<th>$\beta = 7$</th>
<th>$\beta = 8$</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>3</td>
<td>377</td>
<td>79</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>5</td>
<td>107</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>12</td>
<td>7338</td>
<td>1696</td>
<td>1893</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>16</td>
<td>104644</td>
<td>17606</td>
<td>12744</td>
<td>12746</td>
<td>12583</td>
<td>10768</td>
<td>10902</td>
<td>10790</td>
</tr>
<tr>
<td>12</td>
<td>3</td>
<td>5067</td>
<td>11711</td>
<td>13057</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td>5</td>
<td>1825</td>
<td>1729</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>16</td>
<td>3</td>
<td>42.83%</td>
<td>7.21%</td>
<td>7.06%</td>
<td>5.22%</td>
<td>5.15%</td>
<td>4.41%</td>
<td>4.46%</td>
<td>4.42%</td>
</tr>
<tr>
<td></td>
<td>5</td>
<td>6.22%</td>
<td>6.86%</td>
<td>6.17%</td>
<td>0%</td>
<td>0%</td>
<td>0%</td>
<td>0%</td>
<td>0%</td>
</tr>
</tbody>
</table>

IV. PROPOSED PROGRAMMABLE FIR FILTER DESIGN ALGORITHM BY EDBNS

The proposed programmable filter architecture is shown in Fig. 1. The structural adder block is fixed for a given number of taps $N$. The POBG block uses the input sample $x[n]$ to produce the set of unique power-of-2 integers obtained from the quasi-minimum EDBNS search algorithm presented in Section III for all $w$-bit coefficients. The set of coefficients $h[i], i = 0, 1, \ldots, N-1$ to be convoluted with the input sample $x[n]$ establishes the control logic of the $N$ POBS blocks to select the desired power-of-2 terms from the POBG block. Each POBS block consists of $T$ multiplexers and each multiplexer produces either a value of 0, 1 or a power-of-2 integer multiple of the input $x[n]$. In each successive DBCG block, the selected $b^\beta$ terms are shifted by $\alpha$ bit positions according to the
quasi-minimum EDBNS representation of \( h[i] \). These \( T \) or fewer \( 2^\alpha h_i^\beta \) terms are summed in parallel to produce a product of \( h[i] \) and \( x[n] \). The outputs of the \( N \) DBCG blocks are then delayed and accumulated in the final structural adder block to produce the transpose equivalent of (1), which is given by:

\[
y[n] = h[0]x[n] + z^{-1} \left( h[1]x[n] + z^{-1} (h[2]x[n] + \ldots) \right)
\]  

(5)

A. Optimization of POBG block

Only one POBG block is needed as the set \( F_{\text{min}}(w) \) of power-of-\( b \) integers, \( b^\alpha, b \in \{3, 5\} \), generated by the algorithm in Fig. 2 can be reused by all the \( N \) taps.

The power-of-\( b \) integers in \( F_{\text{min}} \) are generated in ascending order of their exponents \( \beta \) in the EDBNS search algorithm of Fig. 2. Since \( 3 = 2 + 1 \), the multiplication of an input variable \( x \) by 3 can be calculated by \( 3x = x \ll 1 + x \) using only one adder. Similarly, the product of \( x \) and 5 can be calculated by \( 5x = (2^2 + 1)x = x \ll 2 + x \). In general, the product of an input variable \( x \) and every element \( b^\beta \) of \( F_{\text{min}} \) can be generated by the following recursions for \( \beta > 1 \):

\[
3^\beta x = 3^{\beta-1} (2^1 + 1)x = 3^{\beta-1} x \ll 1 + 3^{\beta-1} x
\]  

(6)

\[
5^\beta x = 5^{\beta-1} (2^1 + 1)x = 5^{\beta-1} x \ll 2 + 5^{\beta-1} x
\]  

(7)

Using the above recursion, each successive power-of-\( b \) integer multiplier needs only one adder to implement. As the shift amount is fixed, it can be hardwired without incurring additional logic. Figs. 3(a) and (b) show the implementations of the POBG for \( F_{\text{min}} = \{3, 9, 27, 81\} \) and \( F_{\text{min}} = \{5, 25, 125, 625\} \), respectively. Since all the elements in \( F_{\text{min}} \) are odd integers, they are not divisible by any power-of-two integers and cannot be generated from any other elements directly without requiring at least one adder. Therefore, the adder cost for this method of implementation is the optimum.

\[
\begin{align*}
\text{(a)} & \quad x & \quad x \\
\text{(b)} & \quad x & \quad x \\
\text{(c)} & \quad x & \quad x
\end{align*}
\]

Fig. 3. POBG block implementation (a) \( F_{\text{min}} = \{3, 9, 27, 81\} \) without logic depth optimization (b) \( F_{\text{min}} = \{5, 25, 125, 625\} \) without logic depth optimization (c) \( F_{\text{min}} = \{3, 9, 27, 81\} \) with logic depth optimization

From Table II, it is evident that the number of elements for \( b = 3 \) outweighs that for \( b = 5 \) in \( F_{\text{min}} \). Thus the critical path of POBG block is dominated by the power-of-three integer multipliers. Without incurring additional adder and logic, the logic depth of the POBG can be further reduced for \( b = 3 \) by generating \( 3^k x \) by \( x \ll 3 + x \) in parallel with \( 3x \) in the first adder step. Subsequent products of \( x \) and the odd and even power-of-three multipliers \( 3^\beta \) for \( \beta > 2 \) can also be generated in pair concurrently from their respective preceding pair of odd and even power-of-three multipliers in each adder step by

\[
3^\beta x = 3^{\beta-1} x \ll 3 + 3^{\beta-1} x
\]  

(8)

Fig. 3 (c) shows the reduced logic depth minimum adder cost implementation for \( F_{\text{min}} = \{3, 9, 27, 81\} \). This method of implementation reduces the logic depth of the POBG for \( F_{\text{min}} \) of \( w = 16 \) in Table II by half from 8 to 4.

B. Minimization of POBS by EDBNS reduction properties and Exponential Diophantine Equation (EDE)

Each of the \( N \) POBS blocks consists of a bank of \( T \) multiplexers with inputs feeding from the POBG. The products of every power-of-\( b \) integers and the input signal \( x \) from the POBG are routed to the inputs of these multiplexers according to their frequencies of occurrences in the EDBNS representation of the coefficient. The same partial product may appear in the inputs of several multiplexers of the same POBS block if it occurs more than once in the EDBNS representation of the same coefficient. The implementation cost of a POBS block is determined by the number of multiplexers and the complexity of each multiplexer. The complexity of a multiplexer is approximately proportional to its number of inputs and the bit width of the inputs. Therefore, one way to reduce the hardware cost of a POBS block is to minimize the total number of input lines to the \( T \) multiplexers. Although \( T \) has been constrained as a whole in the generation of the EDBNS representations for all \( w \)-bit coefficients, the EDBNS representations of some coefficients may have fewer than \( T \) terms. Therefore, some elements in \( F_{\text{min}} \) do not have to be connected to the inputs of all \( T \) multiplexers. Further reduction in the total number of inputs to the multiplexers of a POBS block can be achieved with the help of the following two reduction rules [35] for DBNS. The second property is also valid for \( b = 5 \) of EDBNS.

Property 1: Sum of two consecutive double base terms with the same \( \alpha \).

\[
2^\alpha 3^\beta + 2^\alpha 3^{\beta+1} = 2^{\alpha+1} 3^\beta
\]

(9)

Property 2: Sum of two double base terms with the same \( \beta \).

\[
2^\alpha 3^\beta + 2^\alpha 3^{\beta+1} = 2^\alpha 3^{\beta+1}
\]

(10)

\[
2^\alpha 5^\beta + 2^\alpha 5^{\beta+1} = 2^\alpha 5^{\beta+1}
\]

(11)

Obviously, every application of Property 1 reduces the number of double base terms by one, and the same situation happens for Property 2 provided that the merged higher order term \( b^{\beta+1} \) exists in \( F_{\text{min}} \). These two properties can be applied recursively until the number of double base terms in an EDBNS cannot be reduced further. This process can be generalized and modeled as an Exponential Diophantine Equation (EDE) [36].

\[
2^{\alpha_1} 3^{\beta_1} + 2^{\alpha_2} 3^{\beta_2} + \ldots + 2^{\alpha_k} 3^{\beta_k} + 2^{\alpha_k} 5^{\beta_k} + \ldots + 2^{\alpha_k} 5^{\beta_k} = 2^{\alpha_1} 3^{\beta_1} + 2^{\alpha_2} 3^{\beta_2} + \ldots + 2^{\alpha_k} 5^{\beta_k} + \ldots + 2^{\alpha_k} 5^{\beta_k}
\]

(12)

where \( l + l' = k + k' \).

If (12) can be solved without introducing any new element into \( F_{\text{min}} \), the existing EDBNS for the coefficient, which is the left hand side (LHS) of (12) will be replaced by the right hand side (RHS) of (12) with less number of terms. With the aid of Properties 1 and 2 and the EDE solutions proved in Section 3 of [35], we can match the solutions of (12) against the EDBNS representations for all \( w \)-bit coefficients. Once a solution is found, the EDBNS will be replaced by the solution with a reduced number of terms. Eventually, some coefficients are
represented with a reduced number of terms even though $T$ for the overall $w$-bit coefficient sets remains unchanged. The unused input lines to the multiplexers of the corresponding POBS block can then be removed. The reduction process is summarized by the pseudo code in Fig. 4.

The shifters in DBCG block shift the $T$ outputs from POBS to generate the double base terms $2^t b^h$ for $t = 1, 2, \ldots, T$, and the subsequent adder tree sums up the $T$ terms to produce the coefficient multiplier output $h(x)[n]$ of each filter tap. Given $T$ and $F_{\min}$, the EDBNS representations of all $w$-bit coefficients that result in the minimum number of different $a_i$ for each of the $T$ POBS outputs are selected. Thus, each shifter needs only to realize $k$ different amounts of shift, where $k$ is determined by the minimum number of distinct exponents $a_i$ that can appear in the $t$-th double base term. Each POBS output is hardwire-shifted by $k$ different numbers of bits and fed into the $k$ data inputs of a multiplexer. One of these $k$ shifted versions of the POBS output will be selected from each multiplexer to compose the EDBNS representation of the programmed coefficient.

The control signals to the multiplexers and the programmable shifters are generated by an external look up table (LUT). Because the EDBNS representation of any even integers can be obtained by a double base scalar multiplication of a $2^a$ factor and an odd integer, i.e., $c = 2^a \times c'$ where $c$ is an even number and $c'$ is known as a fundamental [1], the control information of the even integers are stored together with their fundamentals in the LUT, plus a $2^a$ factor stored for each even number. This will reduce the LUT size by almost half. The complete design procedure is summarized as follows:

**Step 1:** Compute $T_{\min}$ and $F_{\min}$ for all $w$-bit coefficients. Generate the EDBNS array for all the $w$-bit coefficients using the search algorithm presented in Section III.

**Step 2:** Implement the POBG block by producing all power-of-$b$ integers in $F_{\min}$ using the method described in Section IV.A.

**Step 3:** Design the POBS block with $T$ multiplexers. Each power-of-$b$ integer from the POBG block is first connected to an input of $k$ different multiplexers, where $k$ is the maximum number of times that power-of-$b$ value can appear in the EDBNS representation of any coefficient. Then, minimize the number of input lines to the multiplexers of POBS block by the algorithm presented in Fig. 4.

**Step 4:** Design $T$ programmable shifters for the DBCG block. Extract the amount of shifts $a_i$ for each power-of-$b$ integer and store it in the LUT addressable by the fundamentals.

**Step 5:** Sum the $T$ double base terms in DBCG by a carry save adder (CSA) tree to reduce the delay.

**Design Example:** Consider the design of a programmable FIR filter with 8-bit coefficients. According to Table I, $T_{\min} = 3$. By fixing $T = T_{\min} = 3$, $F_{\min}$ is chosen to be $\{3, 9, 27, 5\}$. 136 out of the 255 integers can be implemented using two terms only. Another 26 integers can be implemented using only one term. Only 93 integers have to be expressed using three terms and cannot be further reduced. Thus, the input of each multiplexer in the POBS block is reduced from six to four compared to previous works [9] [24] where all factors from the precomputer block are connected to the subsequent multiplexers as inputs.
Using the EDBNS with $F_{min} = \{3, 9, 27, 5\}$ and $T = 3$ as an example, the POBG block is designed as shown in Fig. 5, where 3x, 5x and 9x are produced using only one adder each while 27x is generated from 3x by 3x $\ll 3 + 3x$ with one additional adder. The number of adders is 4 and the logic depth is only 2. Since $T = 3$, only three multiplexers are needed in each POBS block. To reduce the input lines to the multiplexers, the EDEs for the EDBNS of all 8-bit coefficients are solved using the algorithm presented in Fig. 4. The number of double base terms is reduced for the EDBNS representation of each coefficient that has an EDE solution. For example, 147 is initially expressed as $2^73^0 + 2^33^0 + 2^03^1$. The EDE $2^73^0 + 2^43^0 + 2^33^1 = 2^63^3 + 2^03^6$ has a solution of $a_x = 0, a_5 = 1$ and $b_x = 2$. Hence, the EDBNS representation for 147 is reduced to $2^63^3 + 2^03^6$ and one term has been saved. Upon minimizing the members of the EDBNS array, each multiplexer of the POBS block needs only four instead of six data inputs, including 0 (for 0 valued coefficient) and $x$ (for $b_xx$). Finally, each multiplexer output is connected to a programmable shifter in DBCG to generate a double base term, and the three double base terms are added up by an adder tree of two adders. Together with the tap delay registers and configurable structural adder/subtractors, the final design of this programmable filter is shown in Fig. 5.

![Fig. 5. 8-bit programmable FIR filter designed by proposed algorithm.](image)

V. SYNTHESIS RESULTS AND DISCUSSION

The proposed EDBNS generation as well as the POSG and POBS block optimization algorithms are implemented in Matlab. More than one hundred programmable filters with the number of filter taps ranging from 10 to 100 in step size of 10 and coefficient word lengths of 8, 12 and 16 bits are designed by our proposed method and two existing computation sharing programmable FIR filter design methods, [5] and [9]. The LUT and control logic circuit are independent of the filter taps and represent a small fraction of the total cost. They are not detailed in [5] and [9] and therefore cannot be included in the circuits for comparison. An input word length of 8 bits is assumed as this is a commonly used ADC resolution. All the programmable filter architectures are described in VHDL codes and synthesized by Mentor Graphics LeonardoSpectrum using 0.18$\mu$m standard cell library. As the number of structural adders and the registers in the tap-delay line are identical in all three design methods, only the synthesis results of the TM-MCM blocks of these programmable filters are compared.

Based on Proposition 1, more than one set of quasi-minimum EDBNS representations with different $T$ and $F_{min}$ are generated by our search algorithm for a given coefficient word length. Table III shows the synthesized areas in equivalent gate counts of the TM-MCM designs obtained by our proposed EDBNS based algorithm in Fig. 2 with different values of $T$ for 8-bit, 12-bit and 16-bit coefficients.

<table>
<thead>
<tr>
<th>$N$</th>
<th>$T = 3$</th>
<th>$T = 4$</th>
<th>$T = 5$</th>
<th>$T = 6$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$w = 8$</td>
<td>$w = 12$</td>
<td>$w = 16$</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>406</td>
<td>568</td>
<td>841</td>
<td>576</td>
</tr>
<tr>
<td>10</td>
<td>2526</td>
<td>3556</td>
<td>3785</td>
<td>5418</td>
</tr>
<tr>
<td>20</td>
<td>4861</td>
<td>6833</td>
<td>7423</td>
<td>10435</td>
</tr>
<tr>
<td>30</td>
<td>7176</td>
<td>10083</td>
<td>11056</td>
<td>23939</td>
</tr>
<tr>
<td>40</td>
<td>9501</td>
<td>13358</td>
<td>14708</td>
<td>31687</td>
</tr>
<tr>
<td>50</td>
<td>11850</td>
<td>16642</td>
<td>18348</td>
<td>39407</td>
</tr>
<tr>
<td>60</td>
<td>14157</td>
<td>19900</td>
<td>21973</td>
<td>47119</td>
</tr>
<tr>
<td>70</td>
<td>16482</td>
<td>23160</td>
<td>25610</td>
<td>54818</td>
</tr>
<tr>
<td>80</td>
<td>18831</td>
<td>26459</td>
<td>29253</td>
<td>62602</td>
</tr>
<tr>
<td>90</td>
<td>21139</td>
<td>29710</td>
<td>32876</td>
<td>70304</td>
</tr>
<tr>
<td>100</td>
<td>23483</td>
<td>32975</td>
<td>36527</td>
<td>78014</td>
</tr>
</tbody>
</table>

On average, the gate count for $T = 4$ and $F_{min} = \{3\}$ is about the same as that for $T = 3$ and $F_{min} = \{3, 9, 27, 5\}$ for programmable filters with 8-bit coefficients. For 12-bit coefficients, the designs implemented by EDBNS with $T = 5$ and $F_{min} = \{3, 9, 27\}$ has lower gate count when the number of filter taps is smaller than 10, whereas the designs implemented by the EDBNS with $T = 4$ and $F_{min} = \{3, 9, 27\}$ is more hardware efficient when the number of taps increased above 10. For 16-bit coefficients, the designs implemented by the EDBNS with $T = 6$ and $F_{min} = \{3, 9, 27, 81, 5\}$ is more hardware efficient when the number of taps increased above 10. The results indicate that the most efficient TM-MCM implementation is neither dictated by $T$ nor $F_{min}$ alone. The complexity of the POBG block is governed by the number of power-of-$b$ integers, which is the cardinality of $F_{min}$. On the other hand, the multiplexer cost of each POBS block is affected by the word length of the power-of-b integers and the number of double base terms $T$. Since only one POBG block is required irrespective of the number of filter taps $N$ as opposed to one POBS block for each tap, the total hardware cost will be dominated by the efficient implementation of POBS block more than the efficient implementation of POBG block as $N$ grows. This is observed in the case of 12-bit coefficients when $N > 10$. However, it should be highlighted that a larger $T$ does not necessarily incurs higher multiplexer cost in the POBS block. This is because the complexity of the POBS block is also dependent on the size and number of power-of-b integers. Those designs with larger $T$ but more efficient POBS blocks have consistently lower overall complexity regardless of $N$, as evinced by the 16-bit coefficient designs with $T = 6$.

Using the best design for each filter from Table III, we compare the gate counts of the programmable filters designed by our method against those designed by [5] and [9] in Tables
IV to VI for 8-bit, 12-bit and 16-bit coefficients, respectively. The percentage area reduction of our methods over [5] and [9] are calculated in the last two columns labeled ‘ΔA [5]’ and ‘ΔA [9]’, respectively. The equivalent gate counts against the number of filter taps \(N\) are also plotted in Figs. 6 to 8 to show the trend of improvements as \(N\) increases.

TABLE IV. Comparison of gate counts of 8-bit coefficient programmable filters of 10 to 100 taps designed by [5], [9] and the proposed method.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>2814</td>
<td>3799</td>
<td>2402</td>
<td>14.64%</td>
<td>35.24%</td>
</tr>
<tr>
<td>20</td>
<td>5238</td>
<td>7240</td>
<td>4774</td>
<td>8.86%</td>
<td>34.06%</td>
</tr>
<tr>
<td>30</td>
<td>7694</td>
<td>10798</td>
<td>7129</td>
<td>7.34%</td>
<td>33.98%</td>
</tr>
<tr>
<td>40</td>
<td>10178</td>
<td>14358</td>
<td>9490</td>
<td>6.76%</td>
<td>33.90%</td>
</tr>
<tr>
<td>50</td>
<td>12600</td>
<td>17889</td>
<td>11853</td>
<td>5.93%</td>
<td>33.74%</td>
</tr>
<tr>
<td>60</td>
<td>15062</td>
<td>21461</td>
<td>14208</td>
<td>5.67%</td>
<td>33.80%</td>
</tr>
<tr>
<td>70</td>
<td>17504</td>
<td>25006</td>
<td>16573</td>
<td>5.32%</td>
<td>33.72%</td>
</tr>
<tr>
<td>80</td>
<td>19967</td>
<td>28558</td>
<td>18932</td>
<td>5.18%</td>
<td>33.71%</td>
</tr>
<tr>
<td>90</td>
<td>22403</td>
<td>32103</td>
<td>21280</td>
<td>5.01%</td>
<td>33.71%</td>
</tr>
<tr>
<td>100</td>
<td>24869</td>
<td>35669</td>
<td>23653</td>
<td>4.89%</td>
<td>33.69%</td>
</tr>
</tbody>
</table>

TABLE V. Comparison of gate counts of 12-bit coefficient programmable filters of 10 to 100 taps designed by [5], [9] and the proposed method.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>4902</td>
<td>6003</td>
<td>3556</td>
<td>27.46%</td>
<td>40.76%</td>
</tr>
<tr>
<td>20</td>
<td>9473</td>
<td>11872</td>
<td>6833</td>
<td>27.87%</td>
<td>42.44%</td>
</tr>
<tr>
<td>30</td>
<td>14044</td>
<td>17793</td>
<td>10083</td>
<td>28.20%</td>
<td>43.09%</td>
</tr>
<tr>
<td>40</td>
<td>18615</td>
<td>23593</td>
<td>13358</td>
<td>28.24%</td>
<td>43.38%</td>
</tr>
<tr>
<td>50</td>
<td>23185</td>
<td>29447</td>
<td>16642</td>
<td>28.22%</td>
<td>43.48%</td>
</tr>
<tr>
<td>60</td>
<td>27724</td>
<td>35301</td>
<td>19900</td>
<td>28.22%</td>
<td>43.63%</td>
</tr>
<tr>
<td>70</td>
<td>32276</td>
<td>41183</td>
<td>23160</td>
<td>28.24%</td>
<td>43.76%</td>
</tr>
<tr>
<td>80</td>
<td>36850</td>
<td>47058</td>
<td>26459</td>
<td>28.20%</td>
<td>43.77%</td>
</tr>
<tr>
<td>90</td>
<td>41520</td>
<td>52912</td>
<td>29710</td>
<td>28.44%</td>
<td>43.85%</td>
</tr>
<tr>
<td>100</td>
<td>46091</td>
<td>58766</td>
<td>32975</td>
<td>28.46%</td>
<td>43.89%</td>
</tr>
</tbody>
</table>

TABLE VI. Comparison of gate counts of 16-bit coefficient programmable filters of 10 to 100 taps designed by [5], [9] and the proposed method.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>6714</td>
<td>8755</td>
<td>4640</td>
<td>30.89%</td>
<td>47.00%</td>
</tr>
<tr>
<td>20</td>
<td>13132</td>
<td>17345</td>
<td>9112</td>
<td>30.61%</td>
<td>47.47%</td>
</tr>
<tr>
<td>30</td>
<td>19487</td>
<td>25963</td>
<td>13596</td>
<td>30.23%</td>
<td>47.63%</td>
</tr>
<tr>
<td>40</td>
<td>25861</td>
<td>34561</td>
<td>18095</td>
<td>30.03%</td>
<td>47.64%</td>
</tr>
<tr>
<td>50</td>
<td>32238</td>
<td>43159</td>
<td>22569</td>
<td>29.99%</td>
<td>47.71%</td>
</tr>
<tr>
<td>60</td>
<td>38593</td>
<td>51785</td>
<td>27055</td>
<td>29.90%</td>
<td>47.76%</td>
</tr>
<tr>
<td>70</td>
<td>45068</td>
<td>60390</td>
<td>31533</td>
<td>30.03%</td>
<td>47.78%</td>
</tr>
<tr>
<td>80</td>
<td>51487</td>
<td>68994</td>
<td>36020</td>
<td>30.04%</td>
<td>47.79%</td>
</tr>
<tr>
<td>90</td>
<td>57847</td>
<td>77585</td>
<td>40502</td>
<td>29.98%</td>
<td>47.80%</td>
</tr>
<tr>
<td>100</td>
<td>64221</td>
<td>86218</td>
<td>45000</td>
<td>29.93%</td>
<td>47.81%</td>
</tr>
</tbody>
</table>

The results show that the TM-MCM blocks of the programmable FIR filters designed by our proposed EDBNS method have much lower logic complexity. Its average reductions in gate count over [5] are 6.96%, 28.16% and 30.16% for 8-bit, 12-bit and 16-bit coefficients, respectively. Its average gate count reductions over [9] are 33.96%, 43.21% and 47.64% for 8-bit, 12-bit and 16-bit coefficients, respectively. The reduction can be as high as 30.89% over [5] for \(N = 10\) and 47.81% over [9] for \(N = 100\) for the 16-bit coefficient filters. The savings are attributable to the sparsity of the quasi-minimum EDBNS, which effectively eliminates the high redundancies of common subexpressions encapsulated in its power-of-\(b\) terms. The adders and multiplexers of the POBG and POBS blocks have also been reduced significantly by our proposed techniques. As the number of filter taps grows, the net saving of the POBS block multiples, which leads to a more prominent overall area reduction.

The throughput rate is another important criterion especially for programmable filter as it determines the rate of adaptation and how fast the input signal can be sampled [37]. If the critical path delay is longer than the time interval between two samples, the input samples need to be buffered. Owing to the parallel architecture of the \(N\) identical POBS blocks in an \(N\)-tap filter, the delay is independent of \(N\) but dependent on the word length of the coefficients. The critical path delays of the TM-MCM blocks of 8-bit, 12-bit and 16-bit coefficient programmable filters designed by our method, [5] and [9] are compared in Table VII. The percentage reduction in delay over [5] and [9] are calculated in the last two rows labeled ‘ΔD [5]’ and ‘ΔD [9]’, respectively.

TABLE VII. Comparison of the critical path delays in ns of \(w\)-bit programmable filters designed by [5], [9] and our proposed EDBNS method.

<table>
<thead>
<tr>
<th>(w)</th>
<th>8</th>
<th>12</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>[9]</td>
<td>3.16</td>
<td>3.97</td>
<td>5.03</td>
</tr>
<tr>
<td>This</td>
<td>2.59</td>
<td>4.14</td>
<td>3.57</td>
</tr>
<tr>
<td>ΔD [5] (%)</td>
<td>23.15%</td>
<td>−8.95%</td>
<td>19.23%</td>
</tr>
<tr>
<td>ΔD [9] (%)</td>
<td>18.04%</td>
<td>−4.11%</td>
<td>29.03%</td>
</tr>
</tbody>
</table>
The POSG block lies in the critical path and its logic depth is a major contributor to the delay of the filter. Thanks to the reduced logic depth implementation method of the POSG block, although the hardware reduction by our method over [5] for 8-bit coefficient filters is not as significant, its critical path delay is much shorter. It should be noted that \( F_{\text{min}} = [3, 9, 27] \) is used for the 16-bit EDBNS while \( F_{\text{min}} = [3, 9, 27, 81, 5] \) is used for the 12-bit EDBNS. Due to the larger discrete adders used for the generation of \( 81 \times F_{\text{min}} \) in the POBG block, the delay of our 12-bit coefficient filter architecture is longer than those of [5] and [9] and even our 16-bit programmable filter (although the POBG blocks for the 12-bit and 16-bit coefficients have the sameadder depth). On average, the critical path delay of our TM-MCM block is reduced by 11.14% and 14.32% in comparison with those designed by [5] and [9], respectively.

Taken into consideration the importance of both logic complexity and logic depth, the area-time (AT) complexity, measured in terms of the product of equivalent gate count and complexity and logic depth, the area-time (AT) complexity, with coefficient word lengths of 8, 12 and 16 bits. It is evident that the AT complexity of the filters designed by our proposed EDBNS method is lower than those of [5] and [9] by significant margins of 31.0% and 49.9%, respectively.

Comparison of multiplierless programmable and parallel MAC FIR filters is not provided in existing publications with a tacit understanding that adder is cheaper, faster and more power efficient than multiplier. Nonetheless, an 8-bit 100-tap programmable FIR filter using the MAC approach is designed to compare with our method. Both designs are mapped to Xilinx Artix 7 FPGA using Xilinx ISE Design Suite 14.7. Excluding the programmable shifters, our method consumes 7210 logic slices and has a critical path delay of 6.37 ns, which is around 37.8% and 41% lower than the MAC method. With the inclusion of the programmable shifters, our design has a critical path delay of 7.45 ns, which is still considerably faster than the MAC method by 31% but uses about 19.1% more logic slices. The cost of the programmable shifters could be reduced by further optimization and better mapping. As it stands, the speed of programmable shift-add network makes it a prerogative choice for real-time adaptive system.

VI. CONCLUSION

Vector-scalar multiplications with programmable scalars are commonly found in application-specific digital circuits. Design methodologies aiming at minimizing their implementation cost have been intensively studied by many researchers. This paper presents a radically different approach to this problem for the minimization of the TM-MCM block of programmable FIR digital filters based on the extended form of DBNS. An algorithm for the generation of quasi-minimum EDBNS has been proposed. The obtained EDBNS can be directly mapped to an efficient TM-MCM architecture consisting of only adders, multiplexers, programmable shifters and a LUT. Synthesis results show that the proposed algorithm is capable of synthesizing programmable FIR filters with an average logic complexity reduction of up to 47.81% and critical path delay reduction of up to 14.32% over its contending designs.

ACKNOWLEDGEMENT

This research and the material reported in this document are supported by the SUTD-MIT International Design Centre (IDC) at Singapore University of Technology and Design.

REFERENCES


Jiajia Chen received his B. Eng. (Hons) and Ph.D. from Nanyang Technological University, Singapore, in 2004 and 2010, respectively. From August 2010 to March 2012, he was a lecturer in the School of Engineering at Ngee Ann Polytechnic, Singapore. Since April 2012, he has been a lecturer in Singapore University of Technology and Design. His research interest includes computational transformations of low-complexity digital filters, programmable filters, filter architectural optimization and EEG signal processing. Dr. Chen served as Web Chair of Asia-Pacific Computer Systems Architecture Conference 2005 and Technical Program Committee member of European Signal Processing Conference 2014.

Chip-Hong Chang (S’92–M’98–SM’03) received the B.Eng. (Hons.) degree from the National University of Singapore in 1989, and the M. Eng. and Ph.D. degrees from Nanyang Technological University (NTU) in 1993 and 1998, respectively. He served as a Technical Consultant in industry prior to joining the School of Electrical and Electronic Engineering (EEE) of NTU in 1999, where he is currently an Associate Professor. He holds joint appointments with the university as Assistant Chair of Alumni of the School of EEE from 2008 to 2014, Deputy Director of the Center for High Performance Embedded Systems from 2000 to 2011, and the Program Director of the Center for Integrated Circuits and Systems from 2003 to 2009. He has coedited one book, published four book chapters and more than 200 research papers in refereed international journals and conferences. His current research interests include hardware security and trust, residue number systems, low power arithmetic circuits and digital filter design.

Dr. Chang has served as Associate Editor of IEEE Access since 2013, IEEE Transactions on Circuits and Systems-I from 2010-2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems since 2011, Integration, the VLSI Journal since 2013 and Microelectronics Journal since 2014. He also guest edited several journal special issues and served in many international conference advisory and technical program committees. He is a Fellow of the IET.

Feng Feng has been studying for B.Eng. Degree in computer engineering in Singapore University of Technology and Design since 2013, and he is working as a research assistant at SUTD-MIT International Design Center at the same time. His current interest area is algorithm-based high level digital IC design.

Weiao Ding is currently pursuing B. Eng. degree in computer engineering at Singapore University of Technology and Design. He works as a research assistant at SUTD-MIT International Design Center. His research interests include digital signal processing, digital filter design and digital circuit design.

Jiatao Ding is pursuing his B.Eng. degree in computer engineering at Singapore University of Technology and Design. He is currently working as a part-time research assistant at SUTD-MIT International Design Center. His research interests include digital filter design and implementation, CAD algorithms for high-level synthesis and new logical operator design.