Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/179797
Title: Deep reinforcement learning solutions for multi-period inventory replenishment optimization
Authors: Shakya, Manoj
Keywords: Computer and Information Science
Issue Date: 2024
Publisher: Nanyang Technological University
Source: Shakya, M. (2024). Deep reinforcement learning solutions for multi-period inventory replenishment optimization. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/179797
Abstract: Deep Reinforcement Learning (DRL) represents a significant advancement in machine learning, particularly adept at addressing complex, high-dimensional decision-making problems. Its application in multi-period inventory replenishment problems marks a considerable shift from traditional inventory optimization techniques, such as integer programming, which often fall short in handling the complexities and uncertainties inherent in inventory management, especially when dealing with stochastic variables like demand or lead time. Traditional methods can be computationally intensive and may become impractical as the complexity of the problem increases. Conversely, DRL, with its model-free algorithms, offers a powerful alternative. By not requiring an explicit mathematical model and utilizing function approximation methods like deep neural networks, DRL can efficiently manage large state and action spaces, making it a suitable tool for tackling intricate inventory management challenges. Our first study delves into the comparative performance of Q-learning against mixed integer linear programming (MILP) in a scenario involving a single product and a two-echelon supply chain network. The results illustrated that Q-learning not only minimized the total cost of inventory management effectively but also maintained a high service level and fill rate, outpacing MILP significantly. Building on this, our second study explored the Deep Q-Network (DQN) approach, particularly focusing on inventory problems characterized by stochastic lead times. Stochastic lead times bring an additional layer of complexity, significantly expanding the state space and making traditional methods like Q-learning less effective. DQN’s ability to handle high-dimensional state spaces through deep neural networks made it a suitable candidate for this challenge. We modeled the problem using a Markov Decision Process (MDP) and employed DQN for its resolution, comparing its performance with the traditional (R_period, S_order) policy and Q-learning. The outcomes were compelling, with DQN outshining the others in minimizing inventory management costs. However, it wasn’t without its challenges – DQN struggled to balance cost with service levels. On top of that, the delayed reward was realized. This is a common issue in environments with stochastic lead times where the consequences of actions are not immediately apparent. To address the intricacies of delayed rewards, our third study proposed an innovative modification to the DQN architecture by incorporating a proxy experience replay buffer. This adjustment aimed to realign the rewards with the corresponding actions, a crucial step in scenarios involving deterministic and stochastic lead times. Our approach focused on assigning a certain weightage to the rewards, a strategy that proved effective in reducing inventory costs. Our fourth and final study concentrated on enhancing the training of DRL agents in multi-period inventory replenishment problems, specifically addressing the challenges posed by warehouse capacity constraints. We introduced an action masking technique, which strategically restricts the agent’s available actions, thereby ensuring adherence to capacity constraints. This technique involved set- ting the Q-values of invalid actions to negative infinity at the logits level of the neural network, ensuring the agent’s decisions remained within the feasible range. We compared this approach with other methods, such as environment masking and invalid action penalties, and found that action masking facilitated more efficient learning and quicker convergence to optimal solutions during the training phase. In summary, our series of studies collectively illuminate the dynamic potential of DRL in revolutionizing multi-period inventory replenishment strategies. From demonstrating the superiority of Q-learning over traditional methods to addressing complex challenges such as stochastic lead times and capacity constraints, our research highlights the adaptability, robustness, and efficiency of DRL in managing intricate, high-dimensional decision-making problems. Moreover, our innovative approaches in reward adjustment and action masking provide valuable insights and methodologies, paving the way for future research and practical implementations in inventory management and beyond.
URI: https://hdl.handle.net/10356/179797
DOI: 10.32657/10356/179797
Schools: College of Computing and Data Science 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:CCDS Theses

Files in This Item:
File Description SizeFormat 
Thesis_17Aug2024.pdfPhD Thesis by Shakya Manoj4.3 MBAdobe PDFView/Open

Page view(s)

94
Updated on Oct 9, 2024

Download(s) 50

87
Updated on Oct 9, 2024

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.