Primal-Dual Interior Point Methods

John Mitchell
November 2010

1 Introduction

Our standard linear programming problem is

              T
minx         c x
subject to   Ax   =   b      (P )
               x  ≥   0

Here, c and x are n vectors, b is an m vector, and A is an m×n matrix. The dual problem can be written

max         bT y
    y,s      T
subject to  A  y  +  s  =   c       (D )
                     s  ≥   0

where y is an M-vector and s is the n-vector of dual slacks.

The primal affine scaling algorithm sometimes makes very slow progress when it gets close to a non-optimal corner of the feasible region. One possible remedy is to try explicitly to center the iterates. We can do this by moving in the direction

 center     k
d     := X  PAk e,

where Xk is the diagonal matrix with diagonal entries equal to the components of xk and Ak = AXk. Notice that this direction is trying to increase all components simultaneously. Recall the example where we found the primal affine scaling direction,

min          - x1 -   x2  +   x3  +   x4
subject to    x           +   x           =  1                 (LP 1)
                1              3
                      x2          +   x4  =  2
                      xi  ≥    0       i  =  1,...,4.

At the point xk = (0.8, 0.1, 0.2, 1.9), the centering direction is

dcenter  =  XkPAk  e
           ⌊                    ⌋⌊                                    ⌋⌊    ⌋
           |  0.8   0    0    0  ||  0.059      0     - 0.235     0    ||  1 |
        =  ||  0   0.1   0    0  ||||    0      0.997      0     - 0.052  ||||  1 ||
           ⌈  0    0   0.2   0  ⌉⌈  - 0.235    0      0.941      0    ⌉⌈  1 ⌉
              0    0    0   1.9       0     - 0.052     0      0.003       1
           ⌊        ⌋
              - 0.14
           ||   0.09 ||
        =  |⌈   0.14 |⌉ .
              - 0.09
Notice that this has the effect of increasing x2 and decreasing x1, which does seem to move us towards the middle of the feasible region.

Current commercial implementations combine affine steps with centering and duality in predictor-corrector methods. In addition to excellent computational performance, variants of these methods converge in a number of iterations that is polynomial in the size of the problem.

Notation: We will let e = [1,, 1]T .

2 Log barrier function

In order to make the iterates more centered, we add a penalty term to the objective of (P), giving

                  n
         T       ∑
f(x) := c x  - μ    ln(xi)
                 i=1
(1)

where μ is a positive constant. Note that if xi gets close to zero then the function f(x) gets large. This gives a nonlinear program

              T      ∑n
minx         c x -  μ  i=1 ln(xi)
subject to   Ax   =  b                (P (μ))
              x  ≥   0

Given μ > 0, it can be shown that there is a unique optimal solution to the problem (P(μ)).

For problem (LP1), the barrier function is

                                ∑4
f (x ) = - x1 - x2 + x3 + x4 - μ    ln (xi).
                                i=1

With μ = 3, the value of the barrier function is 11.33 at the point xk = (0.8, 0.1, 0.2, 1.9) and it is 13.32 at the point ¯x = (0.95, 1.95, 0.05, 0.05), close to the optimal solution of (LP1). Conversely, when μ = 0.1, the barrier function takes the values 1.54 and -2.26 at the points xk = (0.8, 0.1, 0.2, 1.9) and ¯x = (0.95, 1.95, 0.05, 0.05), respectively. Thus, for larger choices of μ, the value f(xk) of the more central point is smaller than that of f(¯x), but for smaller choices of μ the results are reversed.

The motivation for this problem is that it should be easier to get an approximate optimal solution to (P(μ)) for a large value of μ, and then this solution can help in finding approximate optimal solutions for smaller values of μ. The limiting solution as μ 0 is an optimal solution to (P).

The gradient of f(x) is c-μX-1e. This eventually leads to a direction that combines the affine direction and a centering direction, as we will see later.

3 Primal-dual scaling and μ-complementary slackness

The conditions for a point ˆx to be optimal for (P(μ)) are closely related to the standard LP optimality conditions. In fact, we have the following theorem:

Theorem 1 Let μ > 0. A point x > 0 is optimal for (P(μ)) if and only if there exist vectors y and s satisfying

     Ax   =   b                         (primal  feasibility)
AT y + s  =   c                         (dual feasibility)
     x s  =   μ   for i = 1,...,n       (μ -  complementary  slackness)
      i i

This theorem makes it clear that there is no particular reason to favor (P) over (D). Typically, algorithms use iterates (xk,yk,sk) of both primal and dual variables. We will assume the iterates satisfy primal and dual feasibility, and work towards achieving μ-complementary slackness (and complementary slackness in the limit as μ 0). Further, we assume xk > 0 and sk > 0. Because both primal and dual iterates are available, a scaling matrix can be used that combines the primal variables x and the dual slacks s. This matrix is

       ⌊ ∘ xk-                 ⌋
       |   s1k1   0    ...   0   |
       ||       ∘ xk- .     .   ||
  k    ||  0      s2k2   ..   ..   ||
D   :=  ||   ..    ..   ..        ||.
       ||   .      .    .  ∘0-- ||
       ⌈                    xkn ⌉
          0     ...   0     skn
(2)

We get Ak = ADk and ck = Dkc. The directions used to update (x,y,s) are

             k       k        -1
Δx   :=   - D PADk D  (s - μX   e)
Δy   :=   (A (Dk )2AT )- 1A(Dk )2(s - μX -1e)

Δs   :=   - AT Δy
The direction Δx is a combination of an affine direction DkP ADkDks and a centering direction DkP ADkDkX-1e. The multiplication by DkP ADkDk serves to rescale the problem, project the direction in the rescaled problem, and scale back to the original space. The term s-μX-1e is the gradient of the log barrier objective function sT x - μ i=1n ln(x i).

The dual direction Δy can be calculated on the way to calculating Δx. It can be shown that Δx = -(Dk)2(s - μX-1e) - (Dk)2Δs.

Note that if x is primal feasible and (y,s) is dual feasible then

sTx =  (c - AT y)Tx =  cTx - bT y.

Therefore, for a given (y,s), it is equivalent to minimize cT x or sT x.

4 The algorithm

  1. Initialize with x0 > 0 feasible in (P) and (y0,s0) feasible in (D) with s0 > 0. Let k = 0.
  2. Let μk = ((xk)T sk)∕n2. (Other scalings are possible.)
  3. Calculate Dk andAk = ADk.
  4. Calculate Δy, Δs, Δx.
  5. Calculate primal and dual steplengths αP and αD, ensuring xk + α P Δx > 0 and sk + α DΔs > 0.
  6. Update xk+1 = xk + α P Δx, yk+1 = yk + α DΔy, sk+1 = sk + α DΔs.
  7. Let k k + 1. Calculate (xk)T sk. If small enough, STOP. Else return to Step 2.

5 An example

The problem

min          - x1  -  x2   +  x3  +   x4
subject to     x1          +  x3          =   1                (LP  1)
                      x2          +   x4  =   2
                       x   ≥   0       i  =   1,...,4
                        i

has dual

max         y1  +  2y2
subject to  y1          +   s1  =   - 1
                    y2  +   s2  =   - 1                        (LD1 )

            y1          +   s3  =    1
                    y2  +   s4  =    1
                            si  ≥    0  i = 1,...,4.

An initial feasible point for (LP1) is x0 = (0.6, 1.5, 0.4, 0.5) and an initial dual feasible solution is y0 = (-5,-2) giving s0 = (4, 1, 6, 3). Note that we have x i0s i0 = 2.4 for i = 1, 3 and x i0s i0 = 1.5 for i = 2, 4. Thus, this point is not centered. The current duality gap is 7.9, so we pick μ = 0.5 (approximately 7.9(42)). We get

       ⌊ √ ----                       ⌋                    ⌊       ⌋
       |   0.15  √ --0       0      0 |                       3.17
  0    ||      0    1.5  ∘ ---0-     0 ||             - 1    ||  0.67  ||
D   =  ||      0      0    1∕15      0 || ,    s - μX    e = |⌈  4.75  |⌉,
       ⌈                        ∘ ----⌉
              0      0       0    1∕6                         2.00

                      [              ]
             0 2 T      0.217    0
and     A (D  ) A   =     0    1.667   .

Further,

                        [       ]
     0 2        -1        0.792
A (D  ) (s - μX    e) =    1.333

so

                                            [       ]
            0 2  T -1    0 2        -1        3.654
Δy  =  (A(D  ) A  )  A (D  ) (s - μX   e) =    0.800

and

                  ⌊ - 3.654 ⌋
                  | - 0.800 |
Δs  =  - AT Δy =  ||         ||
                  ⌈ - 3.654 ⌉
                    - 0.800

Now,

                                            ⌊       ⌋    ⌊         ⌋    ⌊         ⌋
                                              0.475        - 0.548           0.073
            02        - 1       0 2         || 1.000 ||    || - 1.200  ||    ||   0.200 ||
Δx  =  - (D  )(s - μX    e) - (D ) Δs  = -  |⌈ 0.316 |⌉ -  |⌈ - 0.243  |⌉ =  |⌈ - 0.073 |⌉
                                              0.333        - 0.133         - 0.200

. Taking αP = 2 and αD = 1 gives

      ⌊       ⌋                              ⌊       ⌋
        0.746            [         ]           0.346
 1    || 1.900 ||     1      - 1.346      1    || 0.200 ||
x  =  |⌈ 0.254 |⌉ ,  y  =    - 1.200  ,  s  =  |⌈ 2.346 |⌉ .
        0.100                                  2.200

The complementarity values are now

  1 1           1 1            11            1 1
x 1s1 = 0.26,   x2s2 = 0.38,  x 3s3 = 0.60,   x4s4 = 0.22

and the new duality gap is 1.45. The individual complementarities are reasonably close to the target value of μ = 0.5, and the duality gap has been noticeably reduced.

6 Computational effort

Each iteration requires a lot of work. The major effort at each iteration is in calculating the projections. It is not necessary calculate (A(Dk)2AT )-1 explicitly; Gaussian elimination can be used instead.

It seems that an interior point method will solve almost any problem in at most 40 iterations. The number of iterations required grows very slowly as the size of the problem increases. By contrast, the simplex algorithm seems to need approximately 1.5m iterations to solve a problem with m constraints. The work required at each iteration of an interior point method is far larger than the amount of work needed to compute a simplex pivot, so there is a trade-off: an interior point algorithm needs far fewer iterations, but it takes considerably more time per iteration, when compared to the simplex algorithm.

Commercial implementations of interior point methods use algorithms that are similar to the one presented, but they have some more modifications to make them faster.

7 Further reading

Textbooks on interior point methods include [12345].

Current computational results indicate that interior point methods outperform the simplex algorithm on large problems.

References

[1]   S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004.

[2]   D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer, 3rd edition, 2008.

[3]   C. Roos, T. Terlaky, and J.-Ph. Vial. Interior Point Methods for Linear Optimization. Springer, 2nd edition, 2005.

[4]   R. J. Vanderbei. Linear Programming: Foundations and Extensions. Springer, 3rd edition, 2008.

[5]   S. J. Wright. Primal-dual interior point methods. SIAM, Philadelphia, 1996.