Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/152271
Full metadata record
DC FieldValueLanguage
dc.contributor.authorXu, Heen_US
dc.date.accessioned2021-07-28T06:07:44Z-
dc.date.available2021-07-28T06:07:44Z-
dc.date.issued2021-
dc.identifier.citationXu, H. (2021). Recommendation via reinforcement learning methods. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/152271en_US
dc.identifier.urihttps://hdl.handle.net/10356/152271-
dc.description.abstractRecommender system has been a persistent research goal for decades, which aims at recommending suitable items such as movies to users. Supervised learning methods are widely adopted by modeling recommendation problems as prediction tasks. However, with the rise of online e-commerce platforms, various scenarios appear, which allow users to make sequential decisions rather than one-time decisions. Therefore, reinforcement learning methods have attracted increasing attention in recent years to solve these problems. This doctoral thesis is devoted to investigating some recommendation settings that can be solved by reinforcement learning methods, including multi-arm bandit and multi-agent reinforcement learning. For the recommendation domain, most scenarios only involve a single agent that generates recommended items to users aiming at maximizing some metrics like click-through rate (CTR). Since candidate items change all the time in many online recommendation scenarios, one crucial issue is the trade-off between exploration and exploitation. Thus, we consider multi-arm bandit problems, a special topic in online learning and reinforcement learning to balance exploration and exploitation. We propose two methods to alleviate issues in recommendation problems. Firstly, we consider how users give feedback to items or actions chosen by an agent. Previous works rarely consider the uncertainty when humans provide feedback, especially in cases that the optimal actions are not obvious to the users. For example, when similar items are recommended to a user, the user is likely to provide positive feedback to suboptimal items, negative feedback to the optimal item and even do not provide feedback in some confusing situations. To involve uncertainties in the learning environment and human feedback, we introduce a feedback model. Moreover, a novel method is proposed to nd the optimal policy and proper feedback model simultaneously. Secondly, for the online recommendation in mobile devices, positions of items have a significant influence on clicks due to the limited screen size of mobile devices: 1) Higher positions lead to more clicks for one commodity. 2) The `pseudo-exposure' issue: Only a few recommended items are shown at first glance and users need to slide the screen to browse other items. Therefore, some recommended items ranked behind are not viewed by users and it is not proper to treat these items as negative samples. To address these two issues, we model the online recommendation as a contextual combinatorial bandit problem and define the reward of a recommended set. Then, we propose a novel contextual combinatorial bandit method and provide a formal regret analysis. An online experiment is implemented in Taobao, one of the most popular e-commerce platforms in the world. Results on two metrics show that our algorithm outperforms the other contextual bandit algorithms. For multi-agent reinforcement learning setting, we focus on a kind of recommendation scenario in online e-commerce platforms, which involves multiple modules to recommend items with different properties such as huge discounts. A web page often consists of some independent modules. The ranking policies of these modules are decided by different teams and optimized individually without cooperation, which would result in competition between modules. Thus, the global policy of the whole page could be sub-optimal. To address this issue, we propose a novel multi-agent cooperative reinforcement learning approach with the restriction that modules cannot communicate with others. Experiments based on real-world e-commerce data demonstrate that our algorithm obtains superior performance over baselines.en_US
dc.language.isoenen_US
dc.publisherNanyang Technological Universityen_US
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).en_US
dc.subjectEngineering::Computer science and engineering::Computing methodologies::Artificial intelligenceen_US
dc.titleRecommendation via reinforcement learning methodsen_US
dc.typeThesis-Doctor of Philosophyen_US
dc.contributor.supervisorBo Anen_US
dc.contributor.schoolSchool of Computer Science and Engineeringen_US
dc.description.degreeDoctor of Philosophyen_US
dc.identifier.doi10.32657/10356/152271-
dc.contributor.supervisoremailboan@ntu.edu.sgen_US
item.grantfulltextopen-
item.fulltextWith Fulltext-
Appears in Collections:SCSE Theses
Files in This Item:
File Description SizeFormat 
PhD_Thesis_xu_final.pdf2.77 MBAdobe PDFView/Open

Page view(s)

205
Updated on Jan 23, 2022

Download(s) 50

56
Updated on Jan 23, 2022

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.