TY - JOUR
T1 - A Replaceable Curiosity-Driven Candidate Agent Exploration Approach for Task-Oriented Dialog Policy Learning
AU - Niu, Xuecheng
AU - Ito, Akinori
AU - Nose, Takashi
N1 - Publisher Copyright:
© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
PY - 2024
Y1 - 2024
N2 - Task-oriented dialog policy learning is often formulated as a Reinforcement Learning problem whose rewards from the environment are extremely sparse, which means that the agent will often not find the reward by acting randomly. Thus, exploration techniques are of primary importance when solving RL problems, and more sophisticated exploration methods must be devised. In this study, we propose a replaceable curiosity-driven candidate agent exploration approach to encourage the agent to balance action sampling and explore new environments without overly violating dialog strategies. In this framework, we follow the employment of the curiosity model but design weight for the curiosity reward to balance exploration and exploitation. We designed a multi-candidate agent mechanism to filter an agent with relatively balanced action sampling for formal dialog training to motivate agents to escape pseudo-optimal actions in the early training stage. In addition, we propose a replacement mechanism for the first time to prevent the elected agents from performing poorly in the later stages of training and to fully utilize all the candidate agents. The experimental results show that the adjustable curiosity reward promotes dialog policy convergence. The agent replacement mechanism effectively blocks the training of poorly trained agents, significantly increasing the task's average success rate and reducing the number of dialog turns. In this research, an exploration approach for task-oriented dialog system is designed to encourage agents to explore environment through balanced action sampling, without significantly deviating from learned dialog strategies. Compared to baselines, the replaceable curiosity-driven candidate agent exploration approach yields a higher average success rate of 0.714 and a lower number of average turns of 20.6.
AB - Task-oriented dialog policy learning is often formulated as a Reinforcement Learning problem whose rewards from the environment are extremely sparse, which means that the agent will often not find the reward by acting randomly. Thus, exploration techniques are of primary importance when solving RL problems, and more sophisticated exploration methods must be devised. In this study, we propose a replaceable curiosity-driven candidate agent exploration approach to encourage the agent to balance action sampling and explore new environments without overly violating dialog strategies. In this framework, we follow the employment of the curiosity model but design weight for the curiosity reward to balance exploration and exploitation. We designed a multi-candidate agent mechanism to filter an agent with relatively balanced action sampling for formal dialog training to motivate agents to escape pseudo-optimal actions in the early training stage. In addition, we propose a replacement mechanism for the first time to prevent the elected agents from performing poorly in the later stages of training and to fully utilize all the candidate agents. The experimental results show that the adjustable curiosity reward promotes dialog policy convergence. The agent replacement mechanism effectively blocks the training of poorly trained agents, significantly increasing the task's average success rate and reducing the number of dialog turns. In this research, an exploration approach for task-oriented dialog system is designed to encourage agents to explore environment through balanced action sampling, without significantly deviating from learned dialog strategies. Compared to baselines, the replaceable curiosity-driven candidate agent exploration approach yields a higher average success rate of 0.714 and a lower number of average turns of 20.6.
KW - curiosity
KW - deep Dyna-Q
KW - Dialog management
KW - multi-agent optimization
KW - reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85204514932&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85204514932&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2024.3462719
DO - 10.1109/ACCESS.2024.3462719
M3 - Article
AN - SCOPUS:85204514932
SN - 2169-3536
VL - 12
SP - 142640
EP - 142650
JO - IEEE Access
JF - IEEE Access
ER -