A General Method for Improving the Performance of LSTM and GRU
Written by: Lee Wee Sun, Oct 2019
Gated recurrent neural networks such as LSTM and GRU are highly effective in practice. They use a vector of hidden variables as memory to capture information from the past for use in making current and future predictions. To improve performance, memory sometimes needs to be increased; this is done by increasing the number of hidden variables. For these architectures, the number of parameters usually increases at least quadratically with the number of hidden variables, and a larger number of parameters usually means that a larger amount of training data is required for learning.
A particle filter is an approximation to the Bayes filter that provides an improved approximation to the information from the past without increasing the number of parameters. It does this by approximating the posterior distribution of states given the observations by using a sampled set of states (multiple hypotheses, each particle represents a possible state of the system). The larger the number of particles used, the better the approximation, without changing the parameters representing the system.
We designed particle filter versions of LSTM and GRU where each particle is a vector of hidden variables of the recurrent neural network. By increasing the number of particles, these networks give an improved approximation of the information from previous time steps without increasing the number of network parameters. The PF-LSTM and PF-GRU architectures are shown below. It uses the reparameterization trick and soft-sampling to maintain differentiability so as to allow end-to-end training.
Does it work? In our experiments, it worked very well. We tried 10 datasets: 4 regression datasets on stock index prediction (NASDAQ), appliances energy prediction (AEP), and air quality prediction (AIR and PM) as well as 6 classification datasets on activity recognition (UID, LPA, REM and GAS) and text classification (R52 and MR). PF-LSTM worked as well as or better than LSTM on all datasets. PF-GRU worked as well as or better than GRU on 9 of the 10 datasets.
How does it do compared to other general machine learning methods for improving performance? We compared the performance of PF-LSTM and PF-GRU against bagging, a highly effective and commonly used ensemble method for improving predictor performance. As shown in the figures, PF-LSTM and PF-GRU outperform bagging most of the time (above the diagonal indicates better performance for classification while below the diagonal indicates better performance for regression).
Check out our preprint:
Particle Filter Recurrent Neural Networks, Xiao Ma, Peter Karkus, David Hsu, and Wee Sun Lee
at https://arxiv.org/abs/1905.12885.