TY - GEN
T1 - Look Wide and Interpret Twice
T2 - 30th International Joint Conference on Artificial Intelligence, IJCAI 2021
AU - Nguyen, Van Quang
AU - Suganuma, Masanori
AU - Okatani, Takayuki
N1 - Funding Information:
This work was partly supported by JSPS KAKENHI Grant Number 20H05952 and JP19H01110.
Publisher Copyright:
© 2021 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2021
Y1 - 2021
N2 - There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment's success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views.
AB - There is a growing interest in the community in making an embodied AI agent perform a complicated task while interacting with an environment following natural language directives. Recent studies have tackled the problem using ALFRED, a well-designed dataset for the task, but achieved only very low accuracy. This paper proposes a new method, which outperforms the previous methods by a large margin. It is based on a combination of several new ideas. One is a two-stage interpretation of the provided instructions. The method first selects and interprets an instruction without using visual information, yielding a tentative action sequence prediction. It then integrates the prediction with the visual information etc., yielding the final prediction of an action and an object. As the object's class to interact is identified in the first stage, it can accurately select the correct object from the input image. Moreover, our method considers multiple egocentric views of the environment and extracts essential information by applying hierarchical attention conditioned on the current instruction. This contributes to the accurate prediction of actions for navigation. A preliminary version of the method won the ALFRED Challenge 2020. The current version achieves the unseen environment's success rate of 4.45% with a single view, which is further improved to 8.37% with multiple views.
UR - http://www.scopus.com/inward/record.url?scp=85111241133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85111241133&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85111241133
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 923
EP - 930
BT - Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI 2021
A2 - Zhou, Zhi-Hua
PB - International Joint Conferences on Artificial Intelligence
Y2 - 19 August 2021 through 27 August 2021
ER -