<table>
<thead>
<tr>
<th><strong>Title</strong></th>
<th>Peak power reduction and workload balancing by space-time multiplexing based demand-supply matching for 3D thousand-core microprocessor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Author(s)</strong></td>
<td>P. D., Sai Manoj; Kanwen Wang; Yu, Hao</td>
</tr>
<tr>
<td><strong>Citation</strong></td>
<td>Sai Manoj, P. D., Wang, K., &amp; Yu, H. (2013). Peak power reduction and workload balancing by space-time multiplexing based demand-supply matching for 3D thousand-core microprocessor. Design Automation Conference (DAC) 2013 50th ACM / EDAC / IEEE.</td>
</tr>
<tr>
<td><strong>Date</strong></td>
<td>2013</td>
</tr>
<tr>
<td><strong>URL</strong></td>
<td><a href="http://hdl.handle.net/10220/18220">http://hdl.handle.net/10220/18220</a></td>
</tr>
<tr>
<td><strong>Rights</strong></td>
<td>© 2013 IEEE. This is the author created version of a work that has been peer reviewed and accepted for publication by Design Automation Conference (DAC) 2013 50th ACM / EDAC / IEEE, IEEE. It incorporates referee’s comments but changes resulting from the publishing process, such as copyediting, structural formatting, may not be reflected in this document. The published version is available at: <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&amp;arnumber=6560768&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6560768">http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&amp;arnumber=6560768&amp;url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6560768</a>.</td>
</tr>
</tbody>
</table>
ABSTRACT

Space-time multiplexing is utilized for demand-supply matching between many-core microprocessors and power converters. Adaptive clustering is developed to classify cores by similar power level in space and similar power behavior in time. In each power management cycle, minimum number of power converters are allocated for space-time multiplexed matching, which is physically enabled by 3D through-silicon-vias. Moreover, demand-response based task adjustment is applied to reduce peak power and to balance workload. The proposed power management system is verified by system models with physical design parameters and benchedd power traces, which show 38.10\% peak power reduction and 2.60x balanced workload.

Categories and Subject Descriptors: B.7.2 [Design Aids]

Keywords: Demand-supply matching, Peak power reduction, Workload balancing, 3D thousand-core

1. INTRODUCTION

Exa-scale cloud computing for big-data applications requires integration of many-core microprocessors on a single chip [1, 2] at thousand-core scale. Though 3D integration is one promising solution [3] to increase integration density and communication bandwidth, the provision of many-core power supply voltages with maintenance of low power density has become an unresolved issue to address [4, 5, 6, 7, 8]. Supplying same voltage-level to all cores will result in high power density because the demand of each core can be different at different time instant. As such, a demand-supply matched dynamic voltage and frequency scaling (DVFS) scenario needs to be employed during power management for both peak power reduction and workload balancing.

From physical hardware perspective, an optimal demand-supply matching requires on-chip power converters [5, 6, 7, 8], which can provide prompt DVFS management with efficient power delivery. However, one power converter for one core has large area overhead in presence of non-scalable buck inductor. The design of single-inductor-multiple-output (SIMO) power converters [6] utilizes one common single buck inductor to provide different voltage-levels at different time slots in a space-time-multiplexed manner. The capability of SIMO is, however, still limited for many-core microprocessors at thousand-core scale. Moreover, considering hundreds of cores to be integrated on one chip, the remaining area is quite limited to consider on-chip power converter with buck inductor. The 3D integration introduces additional room for on-chip power converters. The recent work in [8] has demonstrated the possibility to design power converter on one die and 64-tile network-on-chip on the other die, which are integrated by through-silicon-via (TSV).

From cyber management perspective, the power management for many-core power-supply system will no longer be the same as the one for the traditional single-core. For big-data applications, there may exist various power patterns deployed on many-cores with multi-time-scale demands for power supply. Moreover, there are many microprocessor cores but limited power converters. A number of power management works for many-core microprocessor system have been explored before [5, 6, 7, 8] but with not fully resolved challenge that requires to not only match various demands from microprocessors with limited number of power converters, but also to reduce peak power and to balance workload on a power converter. As such, the smart power management of many-core microprocessor has similarity as smart-grid though at different time-scale with different workload behaviors. Thereby, the study of workload behavior with classification and also the demand-response can be leveraged from smart-grid management [9] to deal with the on-chip demand-supply matching problem.

In this paper, a space-time multiplexing (STM) based DVFS power management is utilized for demand-supply matching between many-core microprocessors and power converters. In each power management cycle an adaptive clustering is developed such that the minimum number of power converters are allocated for different groups classified by power-magnitudes, called space multiplexing. In one group, power converters are further reused in different time slots for different subgroups classified by power-phasors, called time multiplexing. Such a space-time multiplexed matching is physically enabled by designing a reconfigurable power switch network with the use of 3D through-silicon-vias (TSVs). Moreover, demand-response based task adjustment is applied to reduce peak power and to balance workload. The proposed power management system is verified by system models in SystemC-AMS. The physical design parameters are based on 130nm CMOS process with TSV models. Experiment results show that the proposed power management can achieve 38.10\% peak power reduction and 2.60x balanced workload.

The rest of this paper is organized as follows. In Section 2, we present the 3D many-core microprocessor system architecture with space-time multiplexing (STM) problem formulation towards demand-supply matching. In Section 3, we show the solution by STM-based resource allocation of power converters with use of adaptive clustering, which is based on singular-value-decomposition (SVD) analysis of workload correlation. We further show the demand-response based task scheduling to utilize demand slacks and to adjust tasks for both peak power reduction and workload balancing. The experimental results are included in Section 5 with conclusion in Section 6.
2. 3D SYSTEM ARCHITECTURE WITH SPACE-TIME MULTIPLEXING

In this section, 3D many-core microprocessor system architecture with a reconfigurable power switch network is reviewed with a space-time multiplexing (STM) problem formulated for power management. Table 1 summarizes necessary notations used in this paper.

### 2.1 3D System Architecture

As shown in Fig. 1, the 3D many-core microprocessor system architecture is basically composed of two tiers. The bottom tier is for power management, including arrays of power converters and power switches. Each power converter is SIMO type, capable of supplying multiple voltage-levels by one buck inductor. The top tier includes array of many-core capable of supplying multiple voltage-levels by one buck and power switches. Each power converter is SIMO type, is for power management, including arrays of power converters architecture is basically composed of two tiers. The bottom tier and power storage to supply voltage during the multiplexing when there are through-silicon-vias (TSVs), controlled by power inductor. The top tier includes array of many-core capable of supplying multiple voltage-levels by one buck and power switches. Each power converter is SIMO type, is for power management, including arrays of power converters architecture is basically composed of two tiers. The bottom tier

The proposed 3D system architecture can be described by a demand-supply system model composed of the following three components:

- **Power Demand**: a set of cores $C$ with demanded voltage-levels with set-size $N_c$. Each core $c_i$ has a demanded voltage-level $v_{d}(c_i)$ to meet the deadline of its running workload. In addition, $v_{a}(c_i)$ is the allocated voltage-level to $c_i$ after power management.

- **Power Supply**: a set of power converters $R$ with set-size $N_r$. Each power converter outputs the voltage-level $v(r_i) \in V$ to supply the cores, where $V$ is the set of available voltage-levels before power management.

- **Power Switch Network**: a set of reconfigurable switch-boxes $S$ with set-size $N_s$ to connect between $R$ and $C$ for demand-supply matching.

### 2.2 Space-Time Multiplexing Problem

As aforementioned in the introduction, the primary challenge in 3D thousand-core system to support exa-scale computing is to solve a large-scale demand-supply matching problem. Though there are various big-data applications with different power patterns, most of their power profiles can be still classified by magnitudes and phases. As such, if one can perform a detailed power profile characterization by clustering cores with similar power behaviors, the complexity in matching may be accordingly reduced. With the further consideration for implementation with the minimum cost of power converters, it is still feasible to formulate a resource (power converter) allocation problem with constraints of demand and supply matching. As such, one can formulate the first subproblem as follows.

**Subproblem 1: Resource Allocation Problem** is to decide the minimum number of power converters such that demand-supply matching can be satisfied.

What is more, due to spatial and temporal variation of power profiles, there may exist lots of power slacks to be utilized for a demand-response based workload scheduling. Without violating the workload execution priority or deadline, one can delay over-loaded workloads in one time-slot to the other time-slot with under-loaded workloads. As such, the peak power can be reduced as well as the workload can be balanced at power converters, which can be formulated as the second subproblem below.

**Subproblem 2: Workload Scheduling Problem** is to delay over-loaded workloads to under-loaded time-slots based on availability of slack and without violation of priority.

In this paper, we show that based on the aforementioned 3D system architecture, a space-time multiplexing (STM) based power management can be developed to solve the two subproblems in sections 3 and 4, respectively.

### 3. ADAPTIVE CLUSTERING BASED RESOURCE ALLOCATION

This section deals with resource allocation by adaptive clustering, resulting in the use of the minimum number of power converters for matched demand-supply. To deal with a large-scale demand-supply matching problem, we start with classification of cores into clusters by studying their power...
3.1 Grouping by Power Magnitude for Space Multiplexing

Grouping is the process of clustering different cores, which have similar power magnitudes and hence will demand the similar voltage-level. Note that $z$-th group $g_z$, $g_z \in G$, can be formed by the following criteria:

$$g_z = \{c_i; v_\sigma(c_i) = v_\sigma(c_j) = v_z, \forall i, j = 1, ..., N_c, z \leq N_v\}. \quad (1)$$

Here, $v_z$, $v_z \in V$ represents the $z$-th voltage-level and $c_i, c_i \in C$ and $v_\sigma(c_i) \in V$.

Based on the power magnitude levels, different groups are formed. Each group may contain different number of cores, which can have similar power magnitudes but maybe different power phases. The group formulation can change at different control-cycle. Based on the partitioned groups, power converters can be also partitioned in space to provide the specified voltage-levels for groups. This grouping process has less complexity because it involves just numerical comparisons.

3.2 Subgrouping by Power Phase for Time Multiplexing

Subgrouping is the process of clustering different cores, which have similar power phase (or pattern) and are within the same group $G_z$.

Subgroup $k_z$, $k_z \in K$, can be formed by the following criteria:

$$k_z = \{c_i; (v_\sigma(c_i) = v_\sigma(c_j) = v_z) \& (p_i \sim p_j), \forall i, j = 1, ..., N_c\}. \quad (2)$$

Here, $p_i, p_j \in P$, represents the phase or pattern of one power trace of the core $c_i, c_i \in C$. $v_\sigma(c_i)$ represents the demanded voltage-level of $c_i$ and $v_z$ represents the $z$-th voltage-level, $v_\sigma, v_\sigma(c_i) \in V$. However, the subgrouping by phase is more difficult than grouping by magnitude and may consume bit more time in clustering. In the next subsection, we show a solution by means of spectral clustering to perform subgrouping of power profiles, which can be easily deployed to make power management faster compared to the one without subgrouping. Moreover, all the computations can be pre-stored in a look-up-table for implementing a real-time control.

3.3 Spectral Clustering for Subgrouping

Spectral clustering algorithm is discussed below. To find similarity between two power profiles $p_i$ and $p_j$, $p_i, p_j \in P$, with $N$ samples in one control-cycle, correlation in term of covariance matrix can be evaluated by

$$X = \frac{1}{N} \sum_{i,j=1}^{N} (p_i - \overline{P})(p_j - \overline{P})^T \quad (3)$$

where $\overline{P}$ is the mean of all power profiles ($\frac{1}{N} \sum_{i=1}^{N}(p_i)$).

Based on the order of covariance matrix, number of clusters, $K$ can be analyzed by the singular-value-decomposition (SVD) of covariance matrix

$$X = U \times S \times V^{-1} \quad (4)$$

Matrices $U$ and $V$ are orthogonal matrices with $S$ as the diagonal matrix. Based on the rank analysis of $S$, the number of clusters $K$ can be decided. A new matrix can be formed with $K$ independent vectors, extracted from either of orthogonal matrices. Let the newly formed matrix be $V_K$, assuming it is extracted from $V$.

The product of $V_K$ with the covariance matrix $X$

$$X_K = X \times V_K \quad (5)$$

will result in a reduced matrix $X_K$, which becomes the basis of spectral clustering for subgrouping. For example, one will be allocated to $i$-th subgroup if the value of $X_K(i,j)$ is the maximum in $j$-th-row. The procedure for subgrouping is described in Algorithm 1.

Algorithm 1 Subgrouping by correlation extraction and spectral clustering

**INPUT:** Power trace matrix $P$ with $p_i$ power trace vectors after grouping

1. Compute covariance matrix $R \in R^{p \times p}$
2. Perform SVD: $R = U \times S \times V^{-1}$
3. Determine number of clusters: $K = rank(S)$
4. Compute the first $K$ singular-value vectors $v_1, ..., v_K$ of $V$
5. Let $V_K = [v_1, ..., v_K] \in R^{N \times K}$ and $R_K = R \times V_K$
6. Add $i$-th core to $j$-th cluster if $R_K(i,j)$ is maximum in the $ith$ row
7. Form $P_K$ matrices by finding corresponding indices in power trace matrix $P$

**OUTPUT:** New clustered subgroup matrices $P_K$, $(k = 1, ..., K)$

3.4 Solution to Subproblem 1

Once subgroups are formed, the maximum workloads of one subgroup can be determined. As such, the minimum number of power converters can be also determined to supply that subgroup. This results in one feasible solution to solve the Subproblem 1 in Section 2 as rephrased below.

$$\min \sum_{i=1}^{N_c} r_i$$

s.t.: (i) $v_\sigma(c_i) \geq v_\sigma(c_j), \forall c_j \in C$.

(ii) $d(r_i) \leq N_{max}, \forall r_i \in R$.

If one can determine the minimum number of power converters $r_i$ for each group, the total number of power converters can be correspondingly minimized. Note that constraint (i) guarantees that the supplied voltage-level $v_\sigma(c_j)$ from power converter will
satisfy the demanded voltage-level \(v_d(c_j)\) from core \(c_j\). Moreover, constraint (ii) imposes the driving ability \(d(r_i)\) of each power converter is \(N_{\text{max}}\), i.e., the maximum number of cores to drive. The driving ability can vary with the voltage-level: the higher the voltage-level is, the lower the number of cores that one power converter could drive.

Next, we show that the minimization of total number of power converters can be solved by grouping and subgrouping. By performing grouping, power converters are shared in space among \(N_c\) number of groups and subgrouping makes sharing of power converters inside a group in time. Based on the driving capability \(d^i_k\) of \(i\)-th power converter in group \(g_j\), \(g_j \in G\), having \(k\) subgroups, and the maximum number of cores among different subgroups, \(\max(c_i)\), \(c_i \in C\), the minimum number of power converters for group \(g_j\) can be determined as

\[
r_{g_j} = \max(c_i)/d^i_k.
\]

As such, for the whole system, the total number of power converters needed will be \(\sum(r_{g_j})\), which is the minimum number to satisfy the demand-supply matching.

4. DEMAND RESPONSE BASED WORKLOAD SCHEDULING

This section deals with peak reduction and load balancing after the minimum number of power converters are allocated. A demand-response based workload scheduling will be developed towards uniform distribution of workload with reduction in peaks at one power converter.

4.1 Peak Power Envelope Extraction

To deal with peak reduction and load balancing, we first discuss the extraction of peak power envelope in one control-cycle, because it is impractical to perform power management in continuous form. Based on the extracted peak power envelope, one can build workload behavior model for each subgroup to be used in scheduling.

Assume that in one control-cycle \(T^i\) for the \(i\)-th group, \(g_i\), \(g_i \in G\) with \(N_j\) number of subgroups, each core is assigned with one workload. One can have time slot \(T^j\), which is the amount of time to finish all workloads in a subgroup, \(k_j\), \(k_j \in K\). Relation between \(T^i\) and \(T^j\) is

\[
T^i = \sum_{j=1}^{N_j} T^j. \tag{7}
\]

As such, in one time-slot \(T^j\), peak power envelope \(P_e\) is extracted for workloads \(p(t)\) of one subgroup by

\[
P_e(T^j) = \max(p(t)). \tag{8}
\]

This is repeated for whole control cycle \(T^i\). Thus peaks are extracted and one envelope is formed. Peak extraction by forming one envelope is shown in Fig. 3. The control-cycle \(T^i\) is 400ms with time-slot \(T^j\) of 100ms. At each time slot, the power envelope is formed on the peak value.

4.2 Peak Reduction and Load Balancing

When the peak envelope of subgroup \(k\), \(k \in K\) is compared with one threshold power \(P_{th}^i\) of group \(g_i\), the slack can be calculated by

\[
a^i_j = P_e(T^j) - P_{th}^i. \tag{9}
\]

If the value of slack is positive, then the allocated power converter, \(r_j\), \(r_j \in R\), is overloaded and not capable of handling extra load at that time-slot; otherwise, the power converter \(r_j\) is underloaded and can be allocated with additional workloads. After calculating the amount of slack, the workload of the power converter \(r_j\) can be rescheduled such that priority is not violated.

We call such a scheduling as demand-response based workload scheduling. The procedure for scheduling is described in Algorithm 2. It is deployed after clustering to decide the time slot. The first step in scheduling is to calculate the threshold and slack. Line 2-4 of Algorithm 2 explains the scheduling of task from a power converter that is overloaded and reduction of corresponding load. Similarly Line 6-8 describes adding of workloads on an underloaded power converter. In short, it can be viewed as re-clustering or refinement. The overhead includes the time to perform the calculation and movement, which is negligible in the whole control cycle.

Algorithm 2 Demand-response based workload scheduling

1: INPUT: Initial set Workload \(L\), Slack \(A\)
2: if \(a^i_j > 0\) then
3: Decrease workload on \(r_j\)
4: \(l(r_j) = -;\)
5: else
6: while \(a^i_j < 0\) do
7: Increase workload on \(r_j\)
8: \(l(r_j) = +;\)
9: \(a^i_j = +;\)
10: end while
11: end if


![Figure 3: Peak power envelope extracted in each time-slot in one control cycle](image)

![Figure 4: (a) Load before demand response scheduling (b) Peak reduction by demand response scheduling](image)
Based on the value of slack for a subgroup \( k \) reduction has to be performed followed by load balancing.

Table 5: Comparison of peak power reduction and workload balancing by demand-response scheduling

<table>
<thead>
<tr>
<th>Group</th>
<th>Peak Reduction</th>
<th>Balance before</th>
<th>Balance after</th>
</tr>
</thead>
<tbody>
<tr>
<td>Group 1</td>
<td>34.9%</td>
<td>90.00</td>
<td>88.67</td>
</tr>
<tr>
<td>Group 2</td>
<td>43.6%</td>
<td>80.00</td>
<td>75.62</td>
</tr>
<tr>
<td>Group 3</td>
<td>30.6%</td>
<td>70.00</td>
<td>67.14</td>
</tr>
<tr>
<td>Group 4</td>
<td>38.1%</td>
<td>60.00</td>
<td>56.87</td>
</tr>
</tbody>
</table>

Table 4: Comparison of number of allocated power converters under different PM schemes

<table>
<thead>
<tr>
<th>PM Scheme</th>
<th>STM</th>
<th>TM</th>
<th>STM/TM</th>
<th>STM/SM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Group 1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>Group 2</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>Group 3</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>Group 4</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
</tbody>
</table>

Table 3: Clustering result for 64 cores

Solution to this problem is to minimize the overall sum of slacks. This can be achieved by rescheduling workloads that demand power more than the threshold. So, initially peak reduction has to be performed followed by load balancing. Based on the value of slack for a subgroup \( k \), if the slack is positive, then the workload on that subgroup needs to be delayed or advanced to other time-slot. As such, the workloads are allocated to subgroups with highly negative slack, and the differences in slack is reduced. As a result, peak reduction and load balancing can be achieved eventually.

5. SIMULATION RESULTS

5.1 System Modeling and Settings

The proposed system is validated by Matlab and system-level models built from SystemC-AMS. Table 2 summarizes the system design specifications. All units are scaled or modeled at the system-level and hence does not influence the load capacitance. What is more, for each TSV channel, one switch box is assigned with \( N_r \) power switches to support the core-converter connection. The switch box offers a compact reconfigurable unit driven by the controller. The power switch inside each switch box occupies \( 520\mu m^2 \) and is able to deliver the maximum core current with switching time of 300ns. As such, the TSV coupling is also quite small to consider under such a slow power switching.

Table 5: Comparison of peak power reduction and workload balancing by demand-response scheduling

5.2 Results and Comparisons

Initially, we take 32-core and 64-core microprocessors as two examples to show results under adaptive clustering. The input power traces are first grouped into 4 based on the power magnitudes, then in each group subgroups are formed based on their power phases.

Fig. 5 illustrates the adaptive clustering result of 32-core between two consecutive control cycles. Different filling-shapes represent different groups or voltage-levels. Different clustering numbers on the downright-corner of cores represent different subgroups. For example, in the first control cycle, the 30th core will be assigned to subgroup 4 with voltage-level 4 (group 4). And in the next control cycle, it will be assigned to subgroup 2 with voltage-level 1 (group 1). For 64-core case, Table. 3 summarized the clustering results with the value in the table to represent the core ID. One can also observe that the runtime of clustering is small at the scale of 200ms.

Next, we use the space-time multiplexing (STM) scheme to perform the demand-supply matching. The first step is for resource allocation and adaptive clustering is deployed. After clustering, we extract simplified workload models to represent the peak power in one control cycle, and also determine the minimum number of power converters for each group. When comparing to two schemes, namely space-multiplexing (SM) and time-multiplexing (TM) with the same driving ability and time slot, the STM-based approach takes the advantage of both space and time to minimize the number of power converters. Table. 4 shows the comparison for 32-core and 64-core cases with the three schemes. One can observe that 55.00% (SM) and 35.71% (TM) number of power converters can be reduced for the case of 32-core, while 41.67% (SM) and 36.36% (TM) number of power converters can be reduced for the case of 64-core. Therefore, STM based adaptive clustering can satisfy the demand-supply matching with the minimum number of power converters to reduce the area overhead and also on-chip implementation cost. Lastly, we perform demand-response based workload scheduling for time-multiplexing of power converters inside one
Table 2: System settings of 3D many-core microprocessors, on-chip power converters, TSVs and power switches

<table>
<thead>
<tr>
<th>Item</th>
<th>Description</th>
<th>Symbol</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Microprocessor</td>
<td>Performance</td>
<td>N.A.</td>
<td>310 DMIPS</td>
</tr>
<tr>
<td></td>
<td>Frequency</td>
<td>f</td>
<td>250MHz</td>
</tr>
<tr>
<td></td>
<td>Power Consumption</td>
<td>Pc</td>
<td>9.6W</td>
</tr>
<tr>
<td>Power Converter</td>
<td>Input Voltage</td>
<td>iV</td>
<td>4V</td>
</tr>
<tr>
<td></td>
<td>Output Voltage</td>
<td>oV</td>
<td>3.3V, 1.8V, 1.2V, 1.0V, 0.9V</td>
</tr>
<tr>
<td></td>
<td>Load Current</td>
<td>iL</td>
<td>1200mA, 2400mA, 4200mA, 5000mA</td>
</tr>
<tr>
<td></td>
<td>Number of Phases</td>
<td>Np</td>
<td>2</td>
</tr>
<tr>
<td></td>
<td>Inductor per Phase</td>
<td>Ip</td>
<td>1H</td>
</tr>
<tr>
<td></td>
<td>Switching Frequency</td>
<td>fs</td>
<td>25-200kHz</td>
</tr>
<tr>
<td></td>
<td>Peak Efficiency</td>
<td>Pef</td>
<td>76%</td>
</tr>
<tr>
<td></td>
<td>Length</td>
<td>L</td>
<td>2mm</td>
</tr>
<tr>
<td>TSV</td>
<td>Diameter</td>
<td>W</td>
<td>50μm</td>
</tr>
<tr>
<td></td>
<td>Induction Filament</td>
<td>r</td>
<td>1200mm</td>
</tr>
<tr>
<td></td>
<td>Resistance</td>
<td>Rts</td>
<td>20Ω</td>
</tr>
<tr>
<td></td>
<td>Capacitance</td>
<td>Cts</td>
<td>47F</td>
</tr>
<tr>
<td></td>
<td>Length</td>
<td>Lts</td>
<td>1.5mm</td>
</tr>
<tr>
<td>Power Switch</td>
<td>Width</td>
<td>Ws</td>
<td>4mm</td>
</tr>
<tr>
<td></td>
<td>Length</td>
<td>Ls</td>
<td>1.5mm</td>
</tr>
<tr>
<td></td>
<td>Switching Time</td>
<td>ts</td>
<td>300ns</td>
</tr>
</tbody>
</table>

Figure 6: Peak power reductions for 4 subgroups of 64-core case

The peak power reduction is defined as the difference of peak power value before and after the scheduling. The workload balancing is defined as the number of cores which one power converter drives over control cycles. We compare the peak power reduction by averaging the reduction in each group, and compare workload balancing by averaging the standard-deviation (SD) of workload on each power converter. For a 64-core microprocessor results shown in Fig. 6, in Group 3, the peak power value has been reduced from 9 to 6 with 33.33% peak power reduction. The average standard deviation of workload on each power converter before and after scheduling are 0.91 and 0.50 respectively, with a standard deviation improvement by 1.82x. Table 5 shows the summarized results for peak reduction and workload balancing by demand-response scheduling. One can observe an average of 38.10% peak power reduction and 2.60x workload balancing.

6. CONCLUSION

A space-time multiplexed power management is developed for large-scale demand-supply matching between on-chip power converters and many-core microprocessors. The power switch network is configured to perform space-time multiplexing between power converters and cores by vertical TSVs in 3D. Based on adaptive clustering of cores classified by both power magnitudes and power phases, the minimum number of power converters are allocated to supply the demanded voltage-levels from cores. What is more, demand-response based workload scheduling is deployed by utilizing the power slacks, such that peak power can be reduced as well as workload can be balanced. As verified by system-level behavior models implemented in SystemC and SystemAMS, and also physical-level models with design parameters, experiment results show that the space-time multiplexing can reduce peak power by 38.10% and improve load balancing by 2.60x improvement on average with the minimum number of allocated power converters.

7. ACKNOWLEDGMENTS

This work is sponsored by Singapore MOE TIER-2 fund MOE2010-T2-2-037 (ARC 5/11) and A*STAR SERC-PSF fund 11201202015. Please address comments to haoyu@ntu.edu.sg.

8. REFERENCES