1 OPTIMIZATION WITH PARITY CONSTRAINTS: FROM BINARY CODES TO DISCRETE INTEGRATION Stefano Ermon*, Carla P. Gomes*, Ashish Sabharwal+, and Bart Selman* *Cornell University + IBM Watson Research Center UAI - 2013 2 High-dimensional integration High-dimensional integrals in statistics, ML, physics Expectations / model averaging Marginalization Partition function / rank models / parameter learning Curse of dimensionality:

n dimensional hypercube L L2 L3 L4 Quadrature involves weighted sum over exponential number of items (e.g., units of volume) Ln 3 Discrete Integration We are given A set of 2n items Non-negative weights w

Goal: compute total weight Size visually represents weight 2n Items 1 4 5 0 Compactly specified weight function: factored form (Bayes net, factor graph, CNF, ) Example 1: n=2 variables, sum over 4 items 5 0

2 1 factor Goal: compute 5 + 0 + 2 + 1 = 8 Example 2: n= 100 variables, sum over 2100 1030 items (intractable) 5 1 0 2 4

EXP PSPACE Hardness 0 0/1 weights case: Is there at least a 1? SAT 0 How many 1 ? #SATSAT NP-complete vs. #SATP-complete. Much harder 1 Hard P^#SATP PH NP 1

General weights: Find heaviest item (combinatorial optimization, MAP) Sum weights (discrete integration) P Easy 0 3 4 7 [ICML-13] WISH: Approximate Discrete Integration via Optimization. E.g., partition function via MAP inference MAP inference often fast in practice: Relaxations / bounds Pruning

5 WISH : Integration by Hashing and Optimization Outer loop over n variables MAP inference on model augmented with random parity constraints Repeat log(n) times Aggregate MAP inference solutions Original graphical model AUGMENTED MODEL {0,1}n n binary variables

Parity check nodes enforcing A = b (mod 2) The algorithm requires only O(n log n) MAP queries to approximate the partition function within a constant factor 6 Visual working of the algorithm How it works 1 random parity Function to be integrated 2 random parity constraints constraint . n times

3 random parity constraints . . . Log(n) times Mode M0 + median M1 1 + median M2 2

+ median M3 4 + 7 Accuracy Guarantees Theorem [ICML-13]: With probability at least 1- (e.g., 99.9%) WISH computes a 16-approximation of the ) WISH computes a 16-approximation of the partition function (discrete integral) by solving (n log n) MAP inference queries (optimization). Theorem [ICML-13]: Can improve the approximation factor to (1+) by adding extra variables and factors. Example: factor 2 approximation with 4n variables Remark: faster than enumeration only when combinatorial optimization is efficient 8

Summary of contributions Introduction and previous work: WISH: Approximate Discrete Integration via Optimization. Partition function / marginalization via MAP inference Accuracy guarantees MAP Inference subject to parity constraints: Tractable cases and approximations Integer Linear Programming formulation New family of polynomial time (probabilistic) upper and lower bounds on partition function that can be iteratively tightened (will reach within constant factor) Sparsity of the parity constraints: Techniques to improve solution time and bounds quality Experimental improvements over variational techniques 9 MAP INFERENCE WITH PARITY CONSTRAINTS Hardness, approximations, and bounds

10 Making WISH more scalable Would approximations to the optimization (MAP inference with parity constraints) be useful? YES Bounds on MAP (optimization) translate to bounds on the partition function Z (discrete integral) Lower bounds (local search) on MAP lower bounds on Z Upper bounds (LP,SDP relaxation) on MAP upper bounds on Z Constant-factor approximations on MAP constant factor on Z Question: Are there classes of problems where we can efficiently approximate the optimization (MAP inference) in the inner loop of WISH? 11 Error correcting codes Communication over a noisy channel Bob

Alice x 0100|1 Noisy channel Redundant parity check bit= 0 XOR 1 XOR 0 XOR 0 y 0110|1 Parity check bit = 1 0 XOR 1 XOR 1 XOR 0 = 0 Bob: There has been a transmission error! What was the message actually sent by Alice?

Must be a valid codeword As close as possible to received message y 12 Decoding a binary code Max-likelihood decoding x 0100|1 Noisy channel y 0110|1 ML-decoding graphical model Our more general case Noisy channel

model x Transmitted string must be a codeword More complex probabilistic model Parity check nodes MAP inference is NP hard to approximate within any constant factor [Stern, Arora,..] LDPC Routinely solved: 10GBase-T Ethernet, Wi-Fi 802.11n, digital TV,.. Parity check nodes

Max w(x) subject to A x = b (mod 2) Equivalent to MAP inference on augmented model 13 Decoding via Integer Programming MAP inference subject to parity constraints encoded as an Integer Linear Program (ILP): Standard MAP encoding Compact (polynomial) encoding by Yannakakis for parity constraints Parity polytope LP relaxation: relax integrality constraint Polynomial time upper bounds ILP solving strategy: cuts + branching + LP relaxations Solve a sequence of LP relaxations Upper and lower bounds that improve over time

14 Iterative bound tightening Polynomial time upper ad lower bounds on MAP that are iteratively tightened over time Recall: bounds on optimization (MAP) (probabilistic) bounds on the partition function Z. New family of bounds. WISH: When MAP is solved to optimality (LowerBound = UpperBound), guaranteed constant factor approximation on Z 15 SPARSITY OF THE PARITY CONSTRAINTS Improving solution time and bounds quality 16 Inducing sparsity

Observations: Problems with sparse A x = b (mod 2) are empirically easier to solve (similar to Low-Density Parity Check codes) Quality of LP relaxation depends on A and b , not just on the solution space. Elementary row operations (e.g., sum 2 equations) do not change solution space but affect the LP relaxation. 1)Reduce A x = b (mod 2) to row-echelon form with Gaussian elimination (linear equations over finite field) 2)Greedy application of elementary row operations Matrix A in row-echelon form Parity check nodes Equivalent but sparser Parity check nodes 17 Improvements from sparsity Quality of LP relaxations significantly improves

Finds integer solutions faster (better lower bounds) Without sparsification, fails at finding integer solutions (LB) Upper bound improvement Improvements from sparsification using IBM CPLEX ILP solver for a 10x10 Ising Grid 18 Generating sparse constraints WISH based on Universal Hashing: Randomly generate A in {0,1}in, b in {0,1}i Then A x + b (mod 2) is: n We optimize over solutions of A x = b

mod 2 (parity constraints) Uniform over {0,1}i Pairwise independent i A Given variable assignments x and y , the events A x = b (mod 2) and A y =b (mod 2) are independent. = b (mod 2)

x Suppose we generate a sparse matrix A At most k variables per parity constraint (up to k ones per row of A) A x+b (mod 2) is still uniform, not pairwise independent anymore E.g. for k=1, A x = b mod 2 is equivalent to fixing i variables. Lots of correlation. (Knowing A x = b tells me a lot about A y = b) 19 Using sparse parity constraints Theorem: With probability at least 1- (e.g., 99.9%) WISH computes a 16-approximation of the ) WISH with sparse parity constraints computes an approximate lower bound of the partition function. PRO: Easier MAP inference queries For example, random parity constraints of length 1 (= on a single variable). Equivalent to MAP with some variables fixed. CON: We lose the upper bound part. Output can

underestimate the partition function. CON: No constant factor approximation anymore 20 MAP with sparse parity constraints MAP inference with sparse constraints evaluation 10x10 attractive Ising Grid 10x10 mixed Ising Grid ILP and Branch&Bound outperform message-passing (BP, MP and MPLP) 21 Experimental results ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW)

22 Experimental results (2) ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW) 23 Conclusions [ICML-13] WISH: Discrete integration reduced to small number of optimization instances (MAP) Strong (probabilistic) accuracy guarantees MAP inference is still NP-hard Scalability: Approximations and Bounds Connection with max-likelihood decoding ILP formulation + sparsity (Gauss sparsification & uniform hashing) New family of probabilistic polynomial time computable upper

and lower bounds on partition function. Can be iteratively tightened (will reach within a constant factor) Future work: Extension to continuous integrals and variables Sampling from high-dimensional probability distributions 24 Extra slides