Written by: Lee Wee Sun, Oct 2019

Finding the optimal solution for most interesting AI problems is computationally intractable. Herbert Simon coined the term satisficing for how decision makers make decisions when they are unable to find the optimal solution. So, what should AI researchers do?

One reason AI problems are often considered intractable is because we try to solve the worst case problems. Perhaps we should solve the typical cases and try to do well on average. Machine learning provides excellent tools for solving for the average case. Machine learning algorithms typically try to work well on a training set, and if the learned solution generalizes well, it would work well for the true distribution. 

There are various ways we can try to learn effective algorithms that run efficiently. We can restrict the class of models to those that can be solved efficiently, then learn the model parameters. We will use inference in probabilistic graphical models as an example in this discussion. For example, we may learn tree-structured models or sum-product networks; these models have efficient inference algorithms, hence we will be able to do inference on new instances efficiently after the model parameters have been learned. The drawback of this approach is that its effectiveness may be limited to problems that can be effectively approximated by models that have tractable inference algorithms. An alternative is to leave the model unrestricted but use approximate inference algorithms, for example, loopy belief propagation. We still learn the model parameters, and if the approximate inference algorithm is effective on the training set with the learned model parameters, we expect that it would work well on new instances. In this case, the drawback is that the approximate inference algorithm may only be effective across a restricted set of problems.

We would like to take another approach: try to learn a better approximate inference algorithm. How would we know that it is better? One way is to show that the algorithm that we are comparing to is a special case of the algorithm we are learning, i.e. we parameterize the algorithms, and for some set of parameters, we get the algorithm we are comparing with. So, if we learn the best algorithm within the parameterized class, we should get a better algorithm (or at least an algorithm that is equally good). Given the powerful approximation capabilities of neural networks, and its practical successes, we would like to parameterize the algorithms with neural networks. In our recent work, we took this approach with the Max Product Loopy Belief Propagation algorithm.

We will describe the graphical model inference problem and the Max Product algorithm first. We will learn approximate inference algorithms for Markov random fields (MRF); when the distribution is conditioned on the inputs, these models are called conditional random fields (CRF). A MRF is a probability distribution of the form

where \(x\) is a set of random variables and \(x_c\) is a subset of variables in \(x\). A MRF is conveniently specified using a factor graph, which consists of a bipartite graph G = (V, C, E) where V defines the set of variables \(x_i\), \(C\) defines the set of factors which are functions \(f_c\) and there is an edge between \(x_i\) and \(f_c\) if \(f_c\) depends on \(x_i\). In the example shown below \(f_1\) depends on \(x_1\), \(x_2\), and \(x_3\), while \(f_2\) depends on \(x_3\).

The aim of inference is to find x*

The Max Product algorithm is an iterative algorithm, where at each iteration, messages defined below are sent from every variable to each factor that depends on it, and from every factor to the variables, it depends on

We design the factor graph neural network (FGNN) to learn a message passing inference algorithm that contains the Max Product algorithm as a special case. The FGNN architecture is based on graph neural network, which has been quite successful in practice. In each FGNN layer, we pass messages from variables to factors and from factor to variables. By stacking multiple layers together, we form the execution of an algorithm, and we are able to show that we can represent the Max Product algorithm with specific parameter settings of the FGNN. Each FGNN layer has a set of variable nodes and a set of factor nodes. Each variable and factor node has its own feature vector associated with it. In addition, there is a set of edges, connected according to the factor graph, and each edge also has its own feature vector. Each factor receives messages from all the variables it is connected to as follows: the factor and variable features are used as input to a neural net that constructs a n by 1 vector as output. The edge features connecting the factor and variable are also used as input to a neural net that constructs a m by n matrix as output. This matrix is used to multiply the n by 1 vector in order to construct a m by 1 vector associated with the factor-variable pair. All the m by 1 vectors (messages) sent by the variables to the factors are then aggregated by the factor node using the max operation to form the new factor features. The process of sending messages from factors to variables is also the same. The pseudocode for a  FGNN layer is shown on the left while the architecture for sending a message to a factor (or a variable) is shown on the right in the figure below.

Does it work well? On a synthetic inference problem, the learned FGNN is able to outperform Max Product in all settings of the problem we tried. It is also able to outperform a strong approximation inference algorithm, the Alternating Direction Dual Decomposition, which is a linear programming relaxation, on two of the problem settings and has similar performance on a third setting as shown below.

We also obtained state-of-the-art results on a 3D point cloud semantic segmentation problem.

Figure from http://stanford.edu/~rqi/pointnet/

From the point of view of graphical model inference algorithms, FGNN can learn more powerful inference algorithms, specialized to the problem distribution. From the point of view of graph neural networks, FGNN allows the dependencies among variables, such as constraints among multiple variables to be naturally specified; hence we expect that it would extend the usefulness of graph neural network type algorithms. 

Check out our preprint:

Factor Graph Neural Network. Zhen Zhang, Fan Wu, Wee Sun Lee

at https://arxiv.org/abs/1906.00554.

Source code at: https://github.com/zzhang1987/Factor-Graph-Neural-Network