Written by: Lee Wee Sun, Oct 2019

Deep learning methods tend to use distributed representations where information is distributed throughout a network. In contrast, traditional engineering methods tend to decompose systems into modules, clearly separated by interfaces in a way where information can only pass from module to module through the interfaces. How do they compare? What is the role of modularity in deep learning?

We did a case study to help understand these issues in the setting of a robot system. For the system, imagine a mobile delivery robot arriving at building it has never been to before. It receives a map from the building system which it then uses to navigate to its goal in the building. The map provides only the positions of the walls and doors through which the robot can travel; the robot has to further localize itself on the map and avoid obstacles that are not on the map. And it has to do that using only its vision system; this is a visual navigation task. The system is trained on many simulated training environments and tested on new unseen simulated environments. In the experiments, we used 10,000 environments for training with 5 trajectories for each environment.

Modularity has great engineering advantages. By decomposing the module into components, we can manage the complexity of each component, allowing each component to be designed with reasonable effort. In the case of the robot system, we have a component that processes the visual input into whether there are obstacles in the area directly in front of the robot, a component that implements a Bayes filter to localize the robot position, a component that implements the value iteration algorithm for planning on the map (without obstacles), and a local controller to avoid collision with obstacles. For learning in a decomposed system, each component is learned separately and then put together for execution. The system is shown in the figure below.

The results? The system did quite poorly, with the robot successfully arriving at its destination only 76.6% of the time. What went wrong? Errors propagate. Each component was trained assuming that its inputs are correct and without considering what the other components would do with its outputs. For this particular system, the errors persist through time as well; the localization errors are likely fixed only when enough informative observations are obtained. All these likely contributed to poor performance.

A possible alternative method to design the system is to use a standard deep learning architecture and train it end-to-end. In this case, we use a LSTM for the controller with CNN for the vision system. The system is shown below.

This system did even worse, with the robot arriving at its destination only 38.4% of the time. What went wrong? It is possible that we did not use a large enough network to represent a good policy, although we did a reasonable hyper-parameter search over the network structure. It is also possible that the network is sufficiently expressive, but 10,000 environments with 5 trajectories from each environment do not provide enough training data. Or it may be the case that the learning suffered from being stuck in local minimums. There are no clear guidelines on how to design a good global policy structure; we are using the network as a powerful approximator can essentially control only the size and depth of different parts of the network.

Can we combine the two methods to gain the advantages of each? What are the advantages that we want? Retaining modular design can be advantageous. It is much easier to design a small component. Furthermore, by limiting the scope of the component, we may be able to use known models and algorithms for the component; in this case, we use the Bayes filter and the value iteration algorithm and their corresponding models. What about the advantages of the deep learning system? One advantage is that it is trained end-to-end, optimizing the desired final objective, instead of intermediate objectives of each component. This allows different parts of the network to compensate for each other for the purpose of optimizing the final objective. How is that useful in this system? It should help to reduce error propagation. This can be particularly important when the algorithms within the components are only approximate. For our problem, we have used a Markov decision process (MDP) algorithm to solve the planning problem when the actual system is a partially observable Markov decision process (POMDP). A MDP assumes that the state is observed; in this case, it means that the robot knows where it is on the map. In a POMDP, the state is not observed and has to be inferred from observations; in this case, the observations are the camera images, and the robot needs to infer its location from a sequence of images. Solving a POMDP much harder. In fact, it is PSPACE-hard and any practical algorithm would likely be approximate. We have used one of the simplest approximation of using a MDP to approximate the POMDP and training end-to-end can help to compensate for the use of the approximate algorithm. 

Can the composed system be trained end-to-end? Yes, if each component is differentiable, then a gradient-based method can be used. In fact, each component can be viewed as a computational graph and we compose the components by setting the outputs of some component to be the inputs of other components; the result is simply a larger computational graph. The entire graph can then be viewed as a neural network! We call such networks Differentiable Algorithms Networks (DAN). This is shown in the figure below where the result of the composition is treated as a single network and trained end-to-end.

Does it work? In this case, yes. The robot reached the goal in 99.8% of the test cases, compared to 76.6% for the individually trained modular system and 38.4% for the non-modular deep learning system.

Performance has improved, in this case, substantially. But have we lost modularity? The modules are still there, and information is still passed only through the interfaces. We have merely trained the system end-to-end. So the advantages of being simpler to design are still there. In particular, we are able to select architectures that are nearly correct for each module, instead of having to do a global search over generic architectures such as LSTM. However, we may have partially lost interpretability. We learned models for each module as part of end-to-end learning, but the models are trained to compensate for each other, hence the individual modules should no longer be regarded as correct and care should be taken if the interpretation is done on these components.

We did multiple experiments to understand what compensation strategies the networks may learn in such systems. We find that the system would learn to plan with macro-actions if the horizon of the planner is not long enough. It learned to compensate for incorrectly specified models: we used a spatially homogeneous transition model for the robot movement when the true transition is spatially varying, but the model learned to use the reward function to change the robot behaviour around different wall configurations. The system learned to take state uncertainty into account, allowing it to work well with a POMDP despite the use of an MDP planner. Finally, the system seamlessly combined the deep neural network for the vision and local controller components with the model-based filter and planner.

Check out the video of the system in action here.

The work has been published at the Robotics: Science and Systems 2019 conference, where it was a Best System Paper Award finalist and Best Student Paper Award finalist:

Differentiable Algorithm Networks for Composable Robot Learning, Peter Karkus, Xiao Ma, David Hsu, Leslie Pack Kaelbling, Wee Sun Lee, and Tomas Lozano-Perez.

You can find the paper at https://arxiv.org/abs/1905.11602.