File Download
Supplementary

postgraduate thesis: Towards intelligent vision-and-language navigation : from structured memorization to aerial exploration

TitleTowards intelligent vision-and-language navigation : from structured memorization to aerial exploration
Authors
Advisors
Advisor(s):Yu, Y
Issue Date2025
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Zhao, G. [趙贛龍]. (2025). Towards intelligent vision-and-language navigation : from structured memorization to aerial exploration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractThis doctoral dissertation investigates the problem of vision-and-language navigation~(VLN), which requires an agent to navigate an environment following natural language instructions. VLN represents a critical frontier in artificial intelligence, as it bridges the gap between linguistic understanding and embodied visual perception—key components for developing autonomous systems capable of seamless human-AI collaboration. It is a challenging problem that requires the agent to understand the natural language instructions, perceive the environment, and plan a sequence of actions to reach the target. This dissertation explores VLN from three distinct perspectives: iterative VLN, zero-shot VLN based on Large Language Models~(LLMs), and aerial VLN. Each perspective addresses a unique challenge in VLN and contributes to the development of more intelligent and versatile navigation agents. Iterative VLN introduces a practical paradigm by maintaining the agent's memory across scene tours. While long-term memory better aligns with VLN's persistent nature, it presents challenges in utilizing highly unstructured memory with sparse supervision. This thesis proposes OVER-NAV, advancing iterative VLN by integrating LLMs and open-vocabulary detectors to distill key information and enable cross-modal correspondence. This approach allows on-the-fly generalization to unseen scenes without re-training. OVER-NAV also introduces Omnigraph, a structured representation to integrate multi-modal information, along with a novel fusion mechanism to improve navigation accuracy. OVER-NAV supports both discrete and continuous environments, demonstrating superior performance in experiments. LLM-based agents are considered a promising solution for zero-shot VLN, due to LLMs' impressive reasoning and planning capabilities. However, previous approaches separate image-to-text translation and navigation, creating a gap between visual perception and action planning. Additionally, the growing token count from navigation history and visual inputs degrades performance and increases inference costs. This study presents NavGemini, a purely multi-modal LLM-based navigation agent. NavGemini explores the visual-spatial and multi-modal capabilities of LLMs for VLN tasks, addressing token limits and the shortcomings of existing multi-modal LLMs. Extensive experiments demonstrate the effectiveness of NavGemini. Aerial VLN aims to develop an unmanned aerial vehicle agent that can navigate 3D environments following human instructions by performing actions in both horizontal and vertical directions. This research proposes GVSMC, a grid-based view selection framework that discretizes the aerial scene and formulates action prediction as a grid-based view selection problem. GVSMC incorporates vertical action prediction, allowing the agent to adjust altitude and follow instructions more effectively. A bird's eye view map fuses navigation history and provides contextual information to reduce the impact of environmental obstacles. Experiments demonstrate the superiority of GVSMC. In summary, this dissertation contributes to the VLN field by proposing novel methods to address key challenges in VLN, thereby advancing the development of more intelligent and versatile navigation agents. Extensive experiments on various VLN benchmark tasks demonstrate the effectiveness of the proposed methods, highlighting their potential for real-world applications.
DegreeDoctor of Philosophy
SubjectComputer vision
Machine learning
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/360615

 

DC FieldValueLanguage
dc.contributor.advisorYu, Y-
dc.contributor.authorZhao, Ganlong-
dc.contributor.author趙贛龍-
dc.date.accessioned2025-09-12T02:02:07Z-
dc.date.available2025-09-12T02:02:07Z-
dc.date.issued2025-
dc.identifier.citationZhao, G. [趙贛龍]. (2025). Towards intelligent vision-and-language navigation : from structured memorization to aerial exploration. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/360615-
dc.description.abstractThis doctoral dissertation investigates the problem of vision-and-language navigation~(VLN), which requires an agent to navigate an environment following natural language instructions. VLN represents a critical frontier in artificial intelligence, as it bridges the gap between linguistic understanding and embodied visual perception—key components for developing autonomous systems capable of seamless human-AI collaboration. It is a challenging problem that requires the agent to understand the natural language instructions, perceive the environment, and plan a sequence of actions to reach the target. This dissertation explores VLN from three distinct perspectives: iterative VLN, zero-shot VLN based on Large Language Models~(LLMs), and aerial VLN. Each perspective addresses a unique challenge in VLN and contributes to the development of more intelligent and versatile navigation agents. Iterative VLN introduces a practical paradigm by maintaining the agent's memory across scene tours. While long-term memory better aligns with VLN's persistent nature, it presents challenges in utilizing highly unstructured memory with sparse supervision. This thesis proposes OVER-NAV, advancing iterative VLN by integrating LLMs and open-vocabulary detectors to distill key information and enable cross-modal correspondence. This approach allows on-the-fly generalization to unseen scenes without re-training. OVER-NAV also introduces Omnigraph, a structured representation to integrate multi-modal information, along with a novel fusion mechanism to improve navigation accuracy. OVER-NAV supports both discrete and continuous environments, demonstrating superior performance in experiments. LLM-based agents are considered a promising solution for zero-shot VLN, due to LLMs' impressive reasoning and planning capabilities. However, previous approaches separate image-to-text translation and navigation, creating a gap between visual perception and action planning. Additionally, the growing token count from navigation history and visual inputs degrades performance and increases inference costs. This study presents NavGemini, a purely multi-modal LLM-based navigation agent. NavGemini explores the visual-spatial and multi-modal capabilities of LLMs for VLN tasks, addressing token limits and the shortcomings of existing multi-modal LLMs. Extensive experiments demonstrate the effectiveness of NavGemini. Aerial VLN aims to develop an unmanned aerial vehicle agent that can navigate 3D environments following human instructions by performing actions in both horizontal and vertical directions. This research proposes GVSMC, a grid-based view selection framework that discretizes the aerial scene and formulates action prediction as a grid-based view selection problem. GVSMC incorporates vertical action prediction, allowing the agent to adjust altitude and follow instructions more effectively. A bird's eye view map fuses navigation history and provides contextual information to reduce the impact of environmental obstacles. Experiments demonstrate the superiority of GVSMC. In summary, this dissertation contributes to the VLN field by proposing novel methods to address key challenges in VLN, thereby advancing the development of more intelligent and versatile navigation agents. Extensive experiments on various VLN benchmark tasks demonstrate the effectiveness of the proposed methods, highlighting their potential for real-world applications.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshComputer vision-
dc.subject.lcshMachine learning-
dc.titleTowards intelligent vision-and-language navigation : from structured memorization to aerial exploration-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2025-
dc.identifier.mmsid991045060523503414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats