Memory-augmented Multimodal RAG for Long-horizon Agents

Motivation

Current agents often lose context when tasks unfold across many steps. This becomes harder when the interaction contains text, screenshots, documents, audio, or video.

Research Question

How can an agent remember useful information, retrieve relevant multimodal context, and complete long-horizon tasks without drifting?

Possible Method

A possible system may include:

Episodic memory for interaction history
Semantic memory for reusable knowledge
Multimodal retrieval over documents and media
A controller that decides what to remember and what to ignore
Evaluation over task trajectories instead of single-turn answers

Why It Matters

This direction connects industrial agent problems with research questions in memory, retrieval, multimodal learning, and reliable reasoning.