Exploration is the drum beat of any intelligent, self-improving system.
Key: Red: visited locations. Yellow: Last path taken. Cyan: Lowest-density frontier.
Most modern AI systems are built on machine learning, and machine learning achieves nothing without good, comprehensive data about its task. That’s why I’ve written a new paper, Meta-learning to Explore via Memory Density Feedback, demonstrating an improved approach to autonomous exploration. Watch above as the new RL-based agent, able to see only its own current coordinates and output movements, nonetheless learns to rapidly probe various maze problems, pushing its frontier and returning to lesser visited areas along the way.
The innovation behind this agent is its use of memory and feedback to make intelligent decisions even when it has ventured beyond its training distribution. It uses reinforcement learning with an internally generated reward: roughly the negative probability density of what it sees now, given the distribution of what it has seen before. In other words, it is rewarded by unfamiliar experiences. But that alone is standard practice for exploration algorithms. Where this one differs is that it evaluates the familiarity of what it sees in real time. That familiarity score and the associated actions it took feed back to it as input. When the agent processes a good running memory of its inputs, it can feel its way around the landscape of familiar experiences. On simple maze tasks, the solution it learns is a bit like gradient descent over its own memory distribution.
This system has several benefits over the preeminent Go-Explore type algorithm. On tasks with fixed, repeatable paths (the left-hand video), its progress accelerates as it learns more efficient strategies. It learns to maximize the amount of new progress per episode, often showing large leaps into new areas. But suppose the task is not repeatable because the agent never restarts to a familiar point (center video). In that case, the agent can use its training, memory, and feedback to backtrack and relocate unexplored paths. Suppose the task is not repeatable because the paths change unpredictably with every episode (right-hand video). The agent learns to rely less on the observations themselves, and more on the feedback it gets as it experiments with different actions. In that demo, you can see that as training goes on, its exploration progresses despite unpredictable re-arrangements of the walls.
There is another mechanism at play: offline reinforcement learning as planning. The training is just DQN, which involves replaying episodes and incrementing the value of each action that better maximized the temporally discounted* rewards. But when the initial hidden state of the replayed sequences is artificially set to be the explorer’s current state, this offline training doubles as a planning, synthesizing an action policy that will move the agent from its current location back to the frontier. By storing every episode that broke its own exploration record in a special, separate training database, it can re-train and plan on a stable recollection of the entire frontier. Lingering on a single goal direction too long does not cause it to forget previous, viable paths.
While the mazes above are crude, toy examples, they represent the general graph decision-making problem that RL agents are faced with. Take for instance, the same algorithm deployed to the RL testing game Crafter:
Of course, the goal of crafter is not to sprint around like a headless chicken, so the next steps are to decide on how to best balance Crafter’s reward system against the intrinsic exploration, such that the agent optimally alternates between them. Once integrated, the agent should train much faster, as its initial attainment of achievements and tools will no longer be totally left to chance.
*See Sutton and Barto for the complete RL basics.