Back to news

Multi-agent navigation: learning

Lug 23, 2025

This is the second of three parts that describe developments related to multi-agent navigation carried out by SUPSI in the context of REXASI-PRO.

Machine-learning (ML) has three primary applications in robotics navigation: perception (models that extract an environment description, like the position of obstacles, from raw sensor readings), policies (models that generate actions from observations), and assessment (models that extract a robust and/or interpretable assessment from a large array of simulations). Navigation policies are particularly noteworthy for REXASI-PRO, as they offer an alternative to model-based navigation algorithms that may increase performance at the cost of reducing transparency.

There are two methods for training navigation policies: Imitation Learning and Reinforcement Learning. To investigate both approaches in simulation, we developed and released Navground Learning, a Python extension to train and deploy navigation policies in Navground. It supports mixed groups where some agents learn an ML policy, while others employ model-based behaviors.

Imitation Learning

Train a model using supervised learning to replicate actions collected from an expert, which can be a human or an algorithm. In the context of REXASI-PRO, DFKI has been training machine learning (ML) models to imitate expert wheelchair drivers with data collected in the real-world. Human drivers are well-aware of their surroundings and of social norms, therefore replicating their actions should result in intelligent wheelchairs behaving appropriately in public spaces. SUPSI, on the other hand, has investigated imitating model-based navigation algorithms in simulation. This approach is useful when learning a policy that operates on partial information (e.g., with a reduced field of view or using low-level sensor readings) while the expert algorithm has access to more comprehensive information. For instance, we can train a policy that operates on laser scans to imitate a model-based behavior that operates on the precise position and shape of obstacles.

Reinforcement Learning

Train a policy without predefined action examples, optimizing a reward instead. Suitable actions are identified during the learning process from trials that yield a larger reward. For instance, an agent may learn to avoid obstacles by discovering that moving too close incurs a penalty. In REXASI-PRO, we investigated single-agent and multi-agent Reinforcement Learning for navigation, focusing on smart wheelchairs and other mobile robots moving among people and static obstacles.

Let us recall the scenario presented in the initial part of this series, wherein four smart wheelchairs use a model-based behavior to navigate an area crowded with people. By utilizing Navground Learning, we can replace the behavior with a policy to be learned by the smart wheelchairs. This requires just a few steps in a Python script:

loading the original scenario;
creating a training environment where smart wheelchairs use a sensor and a policy to output accelerations;
training a policy, in this instance using the Soft-Actor Critic Reinforcement Learning algorithm.

With this configuration, all wheelchairs share the same policy, which is trained in parallel, using the experience gathered by all wheelchairs. Once the policy is trained, we can select it in Navground as an additional navigation behavior to perform experiments or record a video, like the one below. The video captures the policy at the beginning (left), in the middle (center), and at the end (right) of the training process.

The documentation of Navground Learning includes several reproducible tutorials that explore diverse scenarios. Notably, one tutorial focuses on the impact of training a single agent or multiple agents simultaneously. We present a concise summary as a hint at the type of research questions that we investigated.

Consider a scenario involving two groups of agents, similar to the previous one. One group (smart wheelchairs, in this case) learns an ML policy, while the other group (humans) uses a model-based navigation behavior. Let us start by training a single wheelchair to move among numerous humans. Will it exhibit superior performance compared to the humans? How will the humans’ navigation be affected?

As it may be expected, it depends on the method used to train the policy. Policies trained using Reinforcement Learning lead to more efficient navigation for the wheelchair itself but are detrimental to the humans. This is because they learn to exploit the fact that the model-based navigation algorithms (used by simulated humans) are able to avoid a collision by themselves if needed. Conversely, policies trained with Imitation Learning to imitate the same modeled human behavior cause all the agents to behave in a similar manner. This distinction is visible in the following video showcasing green wheelchairs using Imitation Learning and blue wheelchairs using Reinforcement Learning (at the end of the training process). The left-most column with a single wheelchair corresponds to the training environment. In this scenario, the blue wheelchair navigates faster and straighter than the green one.

The other columns test the same policies in scenarios with a larger number of wheelchairs, which were not included in the training. We observe how this discrepancy leads to collisions for both policies. Furthermore, the performance of the policy trained with Reinforcement Learning deteriorates more when there are fewer agents (humans) willing to avoid collisions. A policy that is effective for a single agent is not guaranteed to be effective when utilized by multiple agents.

One strategy to address this challenge is to train multiple agents simultaneously, i.e. making them learn to navigate among peers. Will these policies perform better than the policies trained on a single wheelchair? The next video presents the same configuration as the previous one, but with a policy trained in a scenario where there are no humans but numerous wheelchairs learning collectively. Wheelchairs are again colored by the learning method: light green for Imitation Learning and light blue for Reinforcement Learning.

We observe a more consistent behavior across different ratios of humans and wheelchairs, in particular for the policy trained with Reinforcement Learning. Both policies are able to navigate effectively among humans, despite not having been exposed to them during training.