Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/161048
Title: Efficient compute-intensive job allocation in data centers via deep reinforcement learning
Authors: Yi, Deliang
Zhou, Xin
Wen, Yonggang
Tan, Rui
Keywords: Engineering::Computer science and engineering
Issue Date: 2020
Source: Yi, D., Zhou, X., Wen, Y. & Tan, R. (2020). Efficient compute-intensive job allocation in data centers via deep reinforcement learning. IEEE Transactions On Parallel and Distributed Systems, 31(6), 1474-1485. https://dx.doi.org/10.1109/TPDS.2020.2968427
Project: NRF2015ENC-GBICRD001-012
NRF2015ENC-GDCR01001-003
Journal: IEEE Transactions on Parallel and Distributed Systems
Abstract: Reducing the energy consumption of the servers in a data center via proper job allocation is desirable. Existing advanced job allocation algorithms, based on constrained optimization formulations capturing servers' complex power consumption and thermal dynamics, often scale poorly with the data center size and optimization horizon. This article applies deep reinforcement learning to build an allocation algorithm for long-lasting and compute-intensive jobs that are increasingly seen among today's computation demands. Specifically, a deep Q-network is trained to allocate jobs, aiming to maximize a cumulative reward over long horizons. The training is performed offline using a computational model based on long short-term memory networks that capture the servers' power and thermal dynamics. This offline training approach avoids slow online convergence, low energy efficiency, and potential server overheating during the agent's extensive state-action space exploration if it directly interacts with the physical data center in the usually adopted online learning scheme. At run time, the trained Q-network is forward-propagated with little computation to allocate jobs. Evaluation based on eight months' physical state and job arrival records from a national supercomputing data center hosting 1,152 processors shows that our solution reduces computing power consumption by more than 10 percent and processor temperature by more than 4°C without sacrificing job processing throughput.
URI: https://hdl.handle.net/10356/161048
ISSN: 1045-9219
DOI: 10.1109/TPDS.2020.2968427
Schools: School of Computer Science and Engineering 
Rights: © 2020 IEEE. All rights reserved.
Fulltext Permission: none
Fulltext Availability: No Fulltext
Appears in Collections:SCSE Journal Articles

SCOPUSTM   
Citations 20

24
Updated on Sep 26, 2023

Web of ScienceTM
Citations 20

17
Updated on Sep 23, 2023

Page view(s)

37
Updated on Sep 30, 2023

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.