Mar 7, 2019 - the form eAb, which the Lanczos method approximates efficiently. ...... The variant of the Lanczos method used to ..... Cornelius Lanczo...

0 downloads 7 Views 860KB Size

No documents

arXiv:1903.02675v1 [cs.LG] 7 Mar 2019

Abstract We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys the same regret bounds as MMW, up to a small constant factor. Unlike MMW, where every step requires full matrix exponentiation, our steps require only a single product of the form eA b, which the Lanczos method approximates efficiently. Our key technique is to view the sketch as a randomized mirror projection, and perform mirror descent analysis on the expected projection. Our sketch solves the online eigenvector problem, improving the best known complexity bounds. We also apply this sketch to a simple no-regret scheme for semidefinite programming in saddle-point form, where it matches the best known guarantees.

1

Introduction

Consider the problem of online learning over the spectrahedron ∆n , the set of n × n symmetric positive semidefinite matrices with unit trace. At every time step t, a player chooses action Xt ∈ ∆n , an adversary supplies symmetric gain matrix Gt , and the player earns reward hGt , Xt i := tr(Gt Xt ). We seek to minimize the regret with respect to the best single action (in hindsight), ! T T T T X X X

X

sup Gt , X − Gt , Xt = λmax Gt − Gt , Xt . (1) X∈∆n t=1

t=1

t=1

t=1

Matrix multiplicative weights (MMW) (Arora et al., 2012), the matrix analog of the exponentiated gradient algorithm, is the best-known algorithm for this problem. It is given by ! t−1 X eY , (2) Xt = Pmw η Gi , where Pmw (Y ) := tr eY i=1

and η > 0 is a step size using the MMW pparameter. If the operator norm kGt k∞ ≤ 1 for every t, p strategy (2) with η = 2 log(n)/T guarantees that the regret (1) is bounded by 2 log(n)T ; this guarantee is minimax optimal up to a constant (Arora et al., 2012). Unlike standard (vector) multiplicative weights, MMW is computational expensive to implement in the high-dimensional setting n 1. This is due ot the high cost of computing matrix exponentials; currently they require an eigen-decomposition which costs Θ(n3 ) with practical general-purpose methods and Ω(nω ) in theory (Pan and Chen, 1999). This difficulty has led a number of researchers to consider a rank-k sketch of Pmw of the form PU (Y ) :=

eY /2 U U T eY /2 , where U ∈ Rn×k heY , U U T i

(3)

and the elements of U are i.i.d. standard Gaussian. For k n, PU is much cheaper than Pmw to compute, since its computation requires only k products of the form eA b which can be evaluated 1

efficiently via iterative methods (see Section 3). Since we play rank-deficient matrices, an adversary with knowledge of Xt may choose the gain Gt to be in its nullspace, incurring regret linear in T . To rule such an adversary out, we assume that Gt and Xt must be chosen simultaneously. We formalize this as Assumption A. Conditionally on X1 , G1 , . . . , Xt−1 , Gt−1 , the gain Gt is independent of Xt . This assumption is standard in the literature on adversarial bandit problems (Bubeck and CesaBianchi, 2012) where it is similarly unavoidable. While it comes at significant loss of generality, Assumption A holds in two important applications, as described below. The challenge of bias Assumption A allows us to write "* # * X X t−1 t−1 t . E Gt , PUt η Gi Gi {Gi }i=1 = Gt , EU PU η i=1

i=1

However, even though U satisfies EU U U T = I, we have EU PU (Y ) 6= Pmw (Y ) for general Y . Therefore, the guarantees of MMW do not immediately apply to actions chosen according to the sketch (3), even in expectation. A common solution in the literature (Arora and Kale, 2007; Peng et al., 2016; 2 ) such that, by the Johnson-Lindenstrauss lemma, e Allen-Zhu et al., 2016) is to pick k = O(1/ mw PU (Y ) approximates P (Y ) to within multiplicative error . This makes the MMW guarantees applicable again, but requires considerable computation per step, that will match the cost of full matrix exponentiation for sufficiently small . Garber et al. (2015) and Allen-Zhu and Li (2017) prove regret guarantees for sketches of fixed rank k ≤ 3 with forms different from (3); we discuss their approaches in detail in Section 1.1. Our approach In this work we use the sketch (3) with k = 1, playing the rank-1 matrix Xt = P Put (η t−1 G ) where Pu (Y ) = vv T /(v T v) for v = eY /2 u, and ut drawn uniformly from the unit i=1 i sphere. Instead of viewing Pu as a biased estimator of Pmw , we define the deterministic function ¯ ) := Eu Pu (Y ), P(Y ¯ Our primary contribution is in showing that and view Pu as an unbiased estimator for P. ¯ is nearly as good a mirror projection as Pmw . P ¯ leaves the regret bounds almost unchanged; if More precisely, we show that replacing Pmw with P P t−1 ¯ ¯ t = P(η kGt k∞ ≤ 1 p for every t, the actions X tuned η) regret i=1 Gi ) guarantee (with properly √ of at most 6 log(4n)T , worse than MMW by only a factor of roughly 3. To prove this, we ¯ possesses the geometric properties necessary for mirror descent analysis: it is establish that P Lipschitz continuous and its associated Bregman divergence is appropriately bounded. Since Pu ¯ is—by definition—an Pt−1 unbiased estimator of P, we immediately obtain (thanks to Assumption A) that Xt = Put (η i=1 Gi ) satisfies the same regret bound in expectation. High-probability bounds follow immediately via martingale concentration. Application to online eigenvectors sketched actions are of the form Xt = xt xTt , the PTAs our PT T regret they incur is λmax be viewed as t=1 Gt − t=1 xt Gt xt . Therefore, the vectors xt can Pt−1 streaming approximations of the principal component of the cumulative matrix i=1 Gi . This online counterpart of the classical principal component analysis problem is the topic of a number 2

of prior works (cf. Nie et al., 2013; Garber et al., 2015; Allen-Zhu and Li, 2017). Our sketch offers regret bounds that are optimal up to constants, with computational cost per step as low as any known alternative, and overall computational cost better than any in the literature by a factor of at least log5 n (see Section 1.1). As we explain in Section 3, it is also practical and simple to implement. Application to semidefinite programming (SDP) Any feasibility-form SDP is reducible to P m y A the matrix saddle-point game maxX∈∆n miny∈σm h m i=1 i i , Xi, where σm is the simplex in R and A1 , . . . , Am ∈ Rn×n are symmetric matrices. A simple procedure for approximating a saddle-point (Nash equilibrium) for this P game is to have each player perform online learning, where the maxplayer observes gains Gt = m i=1 [yt ]i Ai and the min-player observes costs [ct ]i = hAi , Xt i. Using standard/matrix multiplicative weights for the min/max players, respectively, we may produce approximate solutions with additive error in O(log(nm)/2 ) iterations, with each iteration costing O(n3 ) time, due to the MMW computation. In Section 4 we show that by replacing MMW with our sketch we guarantee error in a similar number of iterations, but with each iteration costing √ e O(nnz(A)/ ), where nnz(M ) is the number of nonzero elements in matrix M , and nnz(A) = P 2 i nnz(Ai ) is often significantly smaller than n . In Section 4.1, we compare the guarantees of this simple primal-dual scheme to others known in the literature, showing that it matches or improves on them in a number of scenarios. Paper outline After surveying related work in Section 1.1, we present our main contribution in Section 2: regret bounds for our rank-1 randomized projections Pu and their proof via the geometry ¯ In Section 3 we describe how to compute Xt in O( e √ηt) matrix-vector products using the of P. Lanczos method. In Section 4 we describe in detail the application of our sketching scheme to semidefinite programming, as described above. We conclude the paper in Section 5 by discussing a number of possible extensions of our results along with the challenges they present.

1.1

Related work

MMW appears in a large body of work spanning optimization, theoretical computer science, and machine learning (e.g. Nemirovski, 2004; Arora et al., 2012; Nie et al., 2013). Here, we focus on works that, like us, attempt to relieve the computational burden of computing the matrix exponential, while preserving the MMW regret guarantees. To our knowledge, the first proposal along these lines is due to Arora and Kale (2007), who apply MMW with a Johnson-Lindenstrauss sketch to semidefinite relaxations of combinatorial problems. Subsequent works on positive semidefinite programming adopted this technique (Peng et al., 2016; Allen-Zhu et al., 2016). To achieve accurate solutions, these works require roughly −2 matrix exponential vector products per mirror projection. Baes et al. (2013) apply the accelerated mirror-prox scheme of Nemirovski (2004) to matrix saddle-point problems and approximate Pmw using the rank-k sketch (3). Instead of appealing to the JL lemma, they absorb the bias and variance of this approximation directly into the algorithm’s error estimates. This enables a more parsimonious choice of k; to attain additive error , they e −1 ). See Section 4.1 for additional discussion of the performance of this method. require k = O( A different line of work, called Follow the Perturbed Leader (FTPL) (Garber et al., 2015), eschews matrix exponentiation, and instead produces rank-1 Xt = xt xTt , where xt is an Pactions t−1 approximate top eigenvector of a random perturbation of i=1 Gi . While a single eigenvector computation has roughly the same cost as a single matrix-exponential vector product, the regret of FTPL—and hence also the total work—scales polynomially in the problem dimension n; it is 3

√ √ e nT ) in general and improvable to O( e n1/2 T ) for rank-1 updates (Dwork et al., 2014). In O( contrast, the regret of MMW and its sketches depends on n only logarithmically. Allen-Zhu and Li (2017) give the first fixed-rank sketch with MMW-like regret, proposing a scheme called Follow the Compressed Leader (FTCL). Their approach is based on replacing the MMW mirror projection (2) with the projection corresponding to `1−1/q regularization, given by Pq-reg (Y ) := (c(Y )I −Y )−q where c(Y ) is the unique c ∈ R such that cI −Y 0 and tr[(cI −Y )−q ] = 1. They use a sketch of Pq-reg similar in spirit to (3) and prove that k = 3 suffices to obtain regret bounds within a polylogarithmic factor of MMW, with q chosen to be roughly log n. The basis of the FTCL proof strategy is a potential argument used to derive regret bounds for the exact Pq-reg . Their analysis consists of carefully tracing this argument, and accounting for the errors caused by sketching in each step of the way. In comparison, we believe our analysis is more transparent; rather than control multiple series expansion error terms, we establish three ¯ We also provide tighter bounds; to guarantee simple geometric properties of our projection P. average regret, FTCL requires a factor of Θ(log5 (n/)) more steps more than our method. The per-step computational cost of our method is similar to that of FTCL, with better polylogarithmic dependence on n. On a practical note, the computational scheme we describe in Section 3 is significantly simpler to implement than the one proposed for FTCL.

1.2

Notation

We use upper case letter for matrices and lower case letters for vectors and scalars. We let Sn denote the set of symmetric n × n matrices, and let ∆n := {X ∈ Sn | X 0, tr X = 1} denote the spectrahedron. We let hY, Xi = tr(Y T X) denote the Frobenius inner product between matrices. For X ∈ Sn , we let λmax (X) = λ1 (X) ≥ λ2 (X) ≥ . . . ≥ λn (X) = λmin (X) denote the eigenvalues of Pn p 1/p denote the `p X sorted in descending order. For x ∈ Rn and p ≥ 1 we let kxkp = i=1 |xi | norm., and for a X ∈ Sn , we let kXkp := kλ(X)kp be the standard Schatten p-norm. particular, PIn n kXk∞ = max{λmax (X), −λmin (X)} is the Euclidean operator norm and kXk1 = i=1 |λi (X)| is the nuclear norm. We write Uni(Sn−1 ) for the uniform distribution over the unit sphere in Rn .

2

A rank-1 sketch of matrix multiplicative weights

In this section, we state and prove our main result: regret bounds for a rank-1 sketch of the matrix multiplicative weights method. Let us recall our sketch. At time step t, having observed gain matrices G1 , . . . , Gt−1 ∈ Sn , we independently draw ut ∼ Uni(Sn−1 ) and play the rank-1 matrix ! t−1 X eY /2 uuT eY /2 vv T Xt := Put η Gi , where Pu (Y ) := = for v = eY /2 u. (4) uT eY u vT v i=1

We call Pu : Sn → ∆n the randomized mirror projection. The key computational consideration is that we can evaluate Pu (Y ) efficiently, while on the analytic side, we show that the update (4) defines on average an efficient mirror descent procedure. The regret bounds for Xt then follow.

2.1

Expected regret bounds

The focus of our analysis is the average mirror projection ¯ ) := Eu Pu (Y ) and action sequence X ¯ η ¯ t := P P(Y

t−1 X i=1

4

! Gi

,

(5)

¯ is the where Eu denotes expectation w.r.t. to u ∼ Uni(Sn−1 ). As we show in Section 2.3 to come, P gradient of the function

¯(Y ) := Eu log eY , uuT = Eu log uT eY u , p which we also show1 is a convex spectral function (Lewis, 1996). As a consequence, we can write ¯ t in the familiar dual averaging (Nesterov, 2009) or Follow the Regularized the average action X Leader (e.g. Hazan, 2016, Ch. 5) form ( t−1 ) X ¯ t = argmax η X hGi , Xi − ¯r(X) X∈∆n

i=1

¯(Y )} is the convex conjugate of p ¯. In this standard approach, the where ¯r(X) = supY ∈Sn {hY, Xi − p regularizer ¯r defines the scheme, and regret analysis proceeds by showing that ¯r is strongly convex ¯. and has bounded range. The former property is equivalent to the smoothness of p ¯ and we find it more In contrast, our starting point is the definition (5) of the projection P, ¯ and p ¯ directly. Toward that end, for any Y, Y 0 ∈ Sn we let convenient to argue about P

¯ ) ¯(Y 0 ) − p ¯(Y ) − Y 0 − Y, P(Y V¯Y (Y 0 ) := p (6) ¯. We show that V¯Y (·) has the properties—analogous denote the Bregman divergence induced by p to those arising from duality in analyses of dual averaging (Nesterov, 2009)—necessary to establish our regret bounds. ¯ and divergence V¯ satisfy Proposition 1. The projection P 1. Smoothness: for every Y, D ∈ Sn , V¯Y (Y + D) ≤

3 2

kDk2∞ .

10 . Refined smoothness for positive shifts: for every Y, D ∈ Sn such that D 0 and kDk∞ ≤ 16 , ¯ ) . V¯Y (Y + D) ≤ 3 kDk∞ D, P(Y 2. Diameter bound: for every Y, Y 0 ∈ Sn , V¯Y (0) − V¯Y (Y 0 ) ≤ log(4n). ¯ ) = X. 3. Surjectivity: for every X ∈ relint ∆n there exists Y ∈ Sn such that P(Y We return to Proposition 1 and prove it in Section 2.3. The proposition gives the following ¯t. regret bounds for the averaged actions X ¯ Pt−1 Gi ) as ¯ t = P(η Theorem 1. Let G1 , . . . , GT be any sequence of gain matrices in Sn and let X i=1 in Eq. (5). Then, for every T ∈ N, ! T T T X X

log(4n) 3η X ¯ λmax Gt − Gt , Xt ≤ + · kGt k2∞ . (7) η 2 t=1

t=1

t=1

If additionally 0 Gt I for every t and η ≤ 61 , λmax

T X t=1

! Gt

−

T X

t=1

¯t Gt , X

log (4n) ≤ + 3η · λmax η

1

T X

! Gt

.

(8)

t=1

For fixed u ∈ Rn , however, Pu = 6 ∇ log uT eY u and we do not know if it is the gradient of any other function. Moreover, Y 7→ log uT eY u is not convex.

5

We prove Theorem 1 in Appendix A. The proof is essentially the standard dual averaging telescoping argument (Nesterov, 2009), which we perform using only the properties in Proposition 1. Indeed, matrix multiplicative weights satisfies a version of Proposition 1 with slightly smaller constant factors, and its regret bounds follow similarly. ¯ is no easier to compute than the matrix multiplicative weights projection. The projection P ¯ Consequently—under Assumption A—the However, Pu is easily computed and is unbiased for P. sketch Pu inherits the regret guarantees in Theorem 1. To argue this formally, we define the σ-fields Ft := σ(G1 , X1 , . . . , Gt Xt , Gt+1 ), ¯ t because ¯ t ∈ Ft−1 , while, under Assumption A, E[Xt | Ft−1 ] = X so that Gt ∈ Ft−1 and X n−1 ut ∼ Uni(S ), independent of Ft−1 . Consequently, we have the following Corollary 1. Let G1 , . . . , GT be symmetric gain matrices satisfying Assumption A and let Xt be generated according to Eq. (4). Then X X T T T log(4n) 3η X E λmax Gt − hGt , Xt i ≤ + · E kGt k2∞ . η 2 t=1

t=1

t=1

If additionally 0 Gt I for every t and η ≤ 61 , X X X T T T log (4n) E λmax Gt − hGt , Xt i ≤ + 3η · E λmax Gt . η t=1

t=1

t=1

¯ t , we have E hGt , Xt i = E [E[hGt , Xt i | Ft−1 ]] = Proof. Using Gt ∈ Ft−1 and E[Xt | Ft−1 ] = X

¯ E Gt , Xt , and so the result is immediate from taking expectation in Theorem 1. It is instructive to compare these guarantees to those for the full (non-approximate) matrix multiplicative weights algorithm. Let X T T 1X 1 Gt − hGt , Xt i R[T ] := E λmax T T t=1

t=1

denote the expected average regret at time T . If kGt k∞ ≤ 1 for every t, the bound (7) along with Corollary 1 imply, for η = (2 log(4n)/(3T ))1/2 , r 6 log(4n) 6 log(4n) R[T ] ≤ , i.e. R[T ] ≤ for T ≥ . T 2 In contrast, the matrix multiplicative weights procedure (2) guarantees average regret below in 2 log(n)/2 steps, so our guarantee is worse by a factor of roughly 3. The bound (8) guarantees smaller average regret when P we additionally assume 0 Gt I for every t and an a-priori upper bound of the form λmax ( T1 Tt=1 Gt ) ≤ λ? ≤ 1. Here, a judicious choice of η guarantees R[T ] ≤ ≤ 1 for every T ≥ 12λ? log(4n)−2 . Again, this is slower than the corresponding guarantee for matrix multiplicative weights by a factor of roughly 3. Relative regret bounds of the form (8) are useful in several application of multiplicative weights and its matrix variant (Arora et al., 2012), e.g. width-independent solvers for linear and positive semidefinite programs (Peng et al., 2016).

6

2.2

High-probability regret bounds

Using standard martingale convergence arguments (cf. Cesa-Bianchi et al., 2004; Nemirovski et al., 2009), we can provide high-probability convergence guarantees for

our algorithm. Indeed, we have ¯ t and therefore Gt , Xt − X ¯ t is a already observed in Corollary 1 that E [hGt , Xt i | Ft−1 ] = Gt , X martingale difference to the filtration Ft . As |hGt , Xt i| ≤ kGt k∞ kXt k1 = kGt k∞ , P sequence adapted ¯ i has bounded differences whenever kGt k is bounded, so that the the martingale ti=1 Gi , Xi − X ∞ next theorem is an immediate consequence of the Azuma-Hoeffding inequality and its multiplicative variant (Allen-Zhu and Li, 2017, Lemma G.1) Corollary 2. Let G1 , . . . , GT be symmetric gain matrices satisfying Assumption A and let Xt be generated according to Eq. (4). If kGt k∞ ≤ 1 for every t, then for every T ∈ N and δ ∈ (0, 1), with probability at least 1 − δ, ! T T q X X log(4n) 3η hGt , Xt i ≤ λmax Gt − (9) + T + 2T log 1δ . η 2 t=1

i=1

If additionally 0 Gt I for every t and η ≤ 61 , then with probability at least 1 − δ, λmax

T X

! Gt

i=1

−

T X t=1

log (4n/δ) + 4η λmax hGt , Xt i ≤ η

T X

! Gt

.

(10)

i=1

We give the proof of Corollary 2 in Appendix B. ¯ t . Therefore, Our development uses Assumption A only through its consequence E[Xt | Ft−1 ] = X our results apply to any adversary that produces gains with such martingale structure, a weaker requirement than Assumption A.

2.3

Analyzing the average mirror projection

In this section we outline the proof of Proposition 1, which constitutes the core technical contribution of our paper. Our general strategy is to relate the average mirror projection to the multiplicative weights projection, which satisfies a version of Proposition 1. Our principal mathematical tool is the theory of convex, twice-differentiable spectral functions (Lewis, 1996; Lewis and Sendov, 2001). We begin with the vector log-sum-exp, or softmax, function lse(v) := log

X n

e

vj

and its gradient ∇lse(v) =

j=1

ev , 1T e v

where we write ev for exp(·) applied elementwise to v and 1 for the all-ones vector. Note that ∇lse : Rn → σn is the mirror projection associated with (vector) multiplicative weights. Let Y ∈ Sn have eigen-decomposition Y = Q diag(λ)QT . The matrix softmax function is pmw (Y ) := log tr eY = lse(λ) and Pmw (Y ) = ∇pmw (Y ) =

eY = Q diag(∇lse(λ))QT tr eY

is the matrix multiplicative weights mirror projection. ¯ ) = Eu eY /2TuuYT eY /2 ¯(Y ) = Eu [log tr(eY uuT )] and the projection P(Y We now connect the function p u e u to their counterparts pmw , Pmw and lse.

7

Lemma 1. Let Y ∈ Sn have eigen-decomposition Y = Q diag(λ)QT . Let w ∈ σn be drawn from a Dirichlet( 21 , . . . , 21 ) distribution. Then ¯(Y ) = Ew [lse(λ + log w)] = Ew pmw (Y + Q diag(log w)QT ) p

(11)

¯ is convex and its gradient is where log is applied elementwise. The function p ¯ ) = ∇¯ P(Y p(Y ) = Q diag (Ew [∇lse(λ + log w)]) QT = Ew Pmw (Y + Q diag(log w)QT ).

(12)

Proof. Let u be uniformly distributed over the unit sphere in Rn and note that u and QT u are identically distributed. Therefore, for Λ = diag(λ), ¯(Y ) = Eu log uT eY u = Eu log (QT u)T eΛ (QT u) = Eu log uT eΛ u = p ¯(Λ). p Further, a vector w with coordinates2 wi = u2i has a Dirichlet( 12 , . . . , 12 ) distribution. Hence, ! ! n n X X ¯(Λ) = Eu log p u2i eλi = Ew log eλi +log wi = Ew lse(λ + log w), i=1

i=1

establishing the identity (11). ¯(Y ) is a spectral function—a permutation-invariant function of the eigenvalues of Evidently, p Y . Moreover, since lse is convex, λ 7→ Ew lse(λ + log w) is also convex, and Lewis (1996, Corollary ¯ is convex. Moreover, Lewis (1996, Corollary 3.2) gives 2.4) shows that p ∇¯ p(Y ) = Q diag(∇Ew [lse(λ + log w)])QT = Ew Pmw (Y + Q log(w)QT ). ¯ ) = ∇¯ It remains to show that P(Y p(Y ). Here we again use the rotational symmetry of u to write ! ! Λ/2 (QT u)(QT u)T eΛ/2 Y /2 uuT eY /2 e e T ¯ ¯ ) = Eu log = Q Eu log QT = Q P(Λ)Q . P(Y uT eY u (QT u)T eΛ (QT u) Moreover, ui uj e ¯ P(Λ) ij = Eu P

(λi +λj )/2

2 λk k=1 uk e

u2i eλi I{i=j} P = Eu 2 λk = Ew ∇i lse(λ + log w)I{i=j} k=1 uk e

(?)

where the equality (?) above follows because ui has a symmetric distribution, even conditional on uj , j 6= i, so E ui uj | u21 , . . . , u2n , uj = 0 for i 6= j. Lemma 1 is all we need in order to prove parts 2 and 3 of Proposition 1. ¯, Proof. (Proposition 1, parts 2 and 3) We first observe the following simple lower bound on p immediate from identity (11) in Lemma 1, ¯(Y ) = Ew log p

n X

eλi (Y )+log wi ≥ λmax (Y ) + Ew1 log w1 ≥ λmax (Y ) − log(4n),

(13)

i=1

where Ew1 log w1 ≥ − log(4n) comes from noting that w1 ∼ Beta( 21 , n−1 2 ) (see Lemma 8 in Appendix C.3). For matrices Y ∈ Sn and X ∈ ∆n , hY, Xi = hY − λmin (Y )I, Xi + λmin (Y ) tr X ≤ kY − λmin (Y )Ik∞ kXk1 + λmin (Y ) tr X = λmax (Y ), 2

The letter w naturally denotes a vector of ‘weights’ in the simplex. Here, it is also double-u.

8

where the final equality is due to kY − λmin (Y )Ik∞ = λmax (Y ) − λmin (Y ) for every Y ∈ Sn and kXk1 = tr X = 1 for every X ∈ ∆n . Combining this bound with (13), we have that ¯(Y ) ≤ log(4n) hY, Xi − p

(14)

for every Y ∈ Sn and X ∈ ∆n . Part 2 follows since

¯ ) −p ¯(0) + Y 0 , P(Y ¯(Y 0 ) ≤ p ¯(0) + log(4n) = log(4n), V¯Y (0) − V¯Y (Y 0 ) = p ¯ ) and the fact that p ¯(0) = Ew log(1T w) = 0. where we used the bound (14) with X = P(Y ¯(Y )} be the convex conjugate of p ¯. Eq. (14) To show Part 3, let ¯r(X) := supY ∈Sn {hY, Xi − p implies that ¯r(X) < ∞ for all X ∈ ∆n , and therefore relint ∆n ⊆ relint dom¯r. Every convex function has nonempty subdifferential on the relative interior of its domain (Hiriart-Urruty and Lemaréchal, 1993, Theorem X.1.4.2), and thus for X ∈ relint ∆n there exists Y ∈ ∂¯r(X). By definition of ¯r, any ¯ ), as required. such Y satisfies X = ∇¯ p(Y ) = P(Y ¯. For twice differentiable function Proving parts 1 and 10 requires second order information on p ∂2 2 f , we denote ∇ f (A)[B, B] = ∂t2 f (A + tB)|t=0 . It is easy to verify that, for every λ, δ ∈ Rn , δ T ∇2 lse(λ)δ = ∇2 lse(λ)[δ, δ] ≤ (δ 2 )T ∇lse(λ), where [δ 2 ]i = δi2 ; this concisely captures the pertinent second order structure of the multiplicative weights mirror projection. Nesterov (2007) shows that this property extends to the matrix case.

Lemma 2. For any Y, D ∈ Sn , ∇2 pmw (Y )[D, D] ≤ D2 , Pmw (Y ) . In Appendix C.1 we explain how to find this result in Nesterov (2007), as it is not explicit there. In ¯ and ∇2 pmw are also related via simple expectation. view of Lemma 1, it is natural to hope that ∇2 p Unfortunately, this fails; we can, however, derive a bound. Lemma 3. For any Y, D ∈ Sn , orthogonal eigenbasis Q for Y , and w ∼ Dirichlet( 21 , . . . , 12 ), ¯(Y )[D, D] ≤ 3 · Ew ∇2 pmw (Y + Q diag(log w)QT )[D, D] ∇2 p

¯ ) . ≤ 3 D2 , P(Y

(15) (16)

Our proof of Lemma 3 is technical; we sketch it here briefly and give it in full Appendix C.2. The key ingredient in the proof is a formula for the Hessian of spectral functions (Lewis and Sendov, 2001). Using the spectral characterization (11), the formula gives that D E ˜ T Ew ∇2 lse(λ + log w) diag(D) ˜ + Ew Aw (λ), D ˜ ◦D ˜ . ¯(Y )[D, D] = diag(D) ∇2 p ˜ = QT DQ, diag(D) ˜ ∈ Rn is the vector containing the diagonal entries of D, ˜ A ◦ B denotes where D ∇i lse(λ+log(w))−∇j lse(λ+log(w)) w elementwise multiplication of A and B, and Aij (λ) := I{i6=j} . With the λi −λj T shorthand Y{w} := Y + Q diag(log w)Q , we use the formula of Lewis and Sendov (2001) again to express ∇2 pmw (Y{w} ) as D E ˜ T ∇2 lse(λ + log w) diag(D) ˜ + A1 (λ + log w), D ˜ ◦D ˜ , ∇2 pmw (Y{w} )[D, D] = diag(D) where A1 = Aw˜ evaluated at w ˜ = 1. The bulk of the proof is dedicated to establishing the entry-wise bounds λ −λ wi tanh i 2 j log w j 1 + A1ij (λ + log w) ≤ 3 · Ew A1ij (λ + log w). Ew Aw ij (λ) ≤ Ew λi − λj 9

The first inequality follows from pointwise analysis of a symmetrized version of Aw ij . The second wi w inequality follows from piecewise monotonicity of Aij as a function of log wj ∼ logit Beta( 21 , 12 ), combined with tight exponential tail bounds for the latter. Substituting the bound on Ew Aw ij (λ) 2 2 mw ¯ into the expression for ∇ p(Y ) and comparing with Ew ∇ p (Y{w} ) yields the desired result (15). Applying Lemma 2 and recalling the identity (12) yields

¯ ) , Ew ∇2 pmw (Y + Q diag(log w)QT )[D, D] ≤ D2 , Ew Pmw (Y + Q diag(log w)QT ) = D2 , P(Y establishing the final bound (16). The bound (16) gives the remaining parts of Proposition 1. ¯(Y + tD). The Bregman Proof. (Proposition 1, parts 1 and 10 ) Fix Y, D ∈ Sn and let p(t) := p divergence (6) admits the integral form Z 1 Z 1Z t V¯Y (Y + D) = p(1) − p(0) − p0 (0) = (p0 (t) − p0 (0))dt = p00 (τ )dτ dt 0 0 0 Z 1Z t ¯(Y + τ D)[D, D]dτ dt. = ∇2 p 0

(17)

0

¯ ) ∈ ∆n for every Y ∈ Sn , D2 , P(Y ¯ ) ≤ kD2 k∞ kP(Y ¯ )k1 = kDk2 . Therefore, Note that since P(Y ∞ the bound (16) gives ¯(Y + τ D)[D, D] ≤ 3 kDk2∞ . ∇2 p R1Rt Substituting back into (17) and using 0 0 dτ dt = 21 gives Proposition 1. 1. When D 0, we have E

2 D ¯ )D1/2 k1 = kDk∞ hD, ∇¯ ¯ )D1/2 ≤ kDk∞ kD1/2 P(Y ¯ ) = D, D1/2 P(Y p(Y )i . D , P(Y Plugging the bound above into the bound (16) and substituting back into (17) gives V¯Y (Y + D) ≤ 3 kDk∞ Moreover, Z

t

Z hD, ∇¯ p(Y + τ D)idτ =

0

t

Z

1Z t

hD, ∇¯ p(Y + τ D)idτ dt.

(18)

¯ ) , p0 (τ )dτ = p(t) − p(0) = V¯Y (Y + tD) + tD, P(Y

(19)

0

0

0

where the final equality uses the definition (6) of the Bregman divergence. Note also that v(t) := ¯; tv 0 (t) = htD, ∇¯ V¯Y (Y + tD) is increasing for t ≥ 0 due p(Y + tD) − ∇¯ p(Y )i ≥ 0. R t to convexity of p ¯ ) for every Therefore, the equality (19) implies 0 hD, ∇¯ p(Y + τ D)idτ ≤ V¯Y (Y + D) + t · D, P(Y 0 ≤ t ≤ 1. Substituting this back into (18) and rearranging gives

3 ¯ ) . 1 − 3 kDk∞ V¯Y (Y + D) ≤ kDk∞ D, P(Y 2

establishing part 10 of the proposition, as 1 − 3 kDk∞ ≥

10

1 2

by assumption.

3

Efficient computation of matrix exponential-vector products

The main burden in computing the randomized mirror projections (4) lies in computing eA b for A ∈ Sn and b ∈ Rn . Matrix exponential-vector products have widespread use in solutions of differential equations (cf. Saad, 1992; Hochbruck and Ostermann, 2010), and also appear as core components in a number of theoretical algorithms (Arora and Kale, 2007; Orecchia et al., 2012; Jambulapati et al., 2018). Following a large body of literature (cf. Moler and Loan, 2003), we approximate eA b via the classic Lanczos method (Lanczos, 1950), an iterative process for computing f (A)b for general real functions f applied to matrix A. The Lanczos approximation enjoys strong convergence guarantees upon which we base our analysis (Sachdeva and Vishnoi, 2014). It is also eminently practical: the only tunable parameter is the number of iterations, and each iteration accesses A via a single matrix-vector product. Let eg xpk (A, b) be the result of k iterations of the Lanczos method for approximating eA b. We provide a precise description of the method in Appendix D. Let ! t−1 T X e u ;k η e u;k (Y ) = vv for v = eg ˜ t;k = P X G , where P (20) xpk (Y /2, u) i t vT v i=1

denote the approximate randomized mirror projection. The variant of the Lanczos method used to compute full eigen-decompositions has well-documented numerical stability issues (Meurant, 2006). In contrast, the approximation (20) appears to be numerically stable. To provide a theoretical basis for this observation, we exhibit error bounds under finite floating point precision, leveraging the results of Musco et al. (2017), which in turn build off Druskin and Knizhnerman (1991, 1995). To account for computational cost, we let mv(Y ) denote the time needed to multiply the matrix Y by any vector. n Proposition 2. Let , δ ∈ (0, 1) and Y ∈ Sn , and set M := max{kAk∞ , log( δ ), 1}. Let u be n uniformly distributed on the unit q sphere in R and independent of Y . If the number of Lanczos iterations k satisfies k ≥ Θ(1) M log( nM δ ) then the approximation (20) satisfies

e u;k (Y )k1 ≤ with probability ≥ 1 − δ over u ∼ Uni(Sn−1 ) kPu (Y ) − P when implemented using floating point operations with B = Θ(1) log nM δ bits of precision. The time 2 e u;k (Y ) is O(mv(Y )k + k B). to compute P We prove Proposition 2 in Appendix D and describe here the main ingredients in the proof. First, we show by straightforward calculation that

√ eY /2 u − eg xpk (Y /2, u) 2 e u;k (Y )k1 ≤ 8

kPu (Y ) − P .

eY /2 u 2

Therefore, a multiplicative error guarantee for eg xpk would imply our result. Unfortunately, for such a guarantee to hold for all vectors u we must have k = Ω(kY k∞ ) (Orecchia et al., 2012, Section 3.3). We circumvent that by using the randomness of u to argue that w.h.p. keY /2 uk2 & √1 eλmax (Y /2) kuk . This allows us to use existing additive error guarantees for e g xpk to obtain our 2 n result. We connect the approximation to regret in the following corollary (see Appendix D.6)

11

Corollary 3. Let G1 , . . . , GT be symmetric gain matrices satisfying kGt k∞ ≤ 1 for every t. There ˜ exists a numerical k0 < √ constant nT ∞, such that for every T ∈ N and δ ∈ (0, 1), Xt;kt defined in (20) with kt = k0 ( 1 + ηt) log( δ ) , and Xt defined in (4) satisfy T D X

T E X ˜ t;k ≥ −1 + Gt , X hGt , Xt i w.p. ≥ 1 − δ/2. t

t=1

(21)

t=1

q 2 log(4en) . If Assumption A holds with respect to the actions Let ∈ (0, 1], T = 16 log(4en/δ) and η = 3T 2 E P P D T ˜ t;k ≤ . Computing ˜ t;k , then with probability at least 1 − δ, 1 λmax Gt − 1 T Gt , X X i=1

T

t

T

t=1

t

˜ 1;k , . . . , X ˜ T ;k requires O(−2.5 log2.5 ( n )) matrix-vector products. the actions X 1 T δ

4

Application to semidefinite programming

Here we describe how to use our rank-1 sketch to solve semidefinite programs (SDPs). The standard m ˜ ˜ ˜ A˜1 , . . . , A˜m SDP formulation is, given C, ˜ ∈ Sn ˜ and b ∈ R ,

˜ Z subject to A˜i , Z = ˜bi ∀i ∈ [m]. minimize C, ˜ Z0

A binary search over the optimum value reduces this problem to a sequence of feasibility problems. When the constraints imply tr Z ≤ r for some r < ∞, every intermediate feasibility problem is equivalent to deciding whether there exists X in the spectrahedron ∆n s.t. hAi , Xi ≤ 0 for all i ∈ [m], with n, m and Ai ∈ Sn constructed from n ˜ , m, ˜ A˜i , ˜b, C˜ and r. This decision problem is in turn equivalent (cf. Garber and Hazan, 2016) to determining the sign of X s = min max hA? y, Xi , where A? y := [y]i Ai . (22) y∈σm X∈∆n

i∈[m]

and σm is the simplex in Rm . For every, y ∈ σm and X ∈ ∆n ,

? 0

? 0 min A y , X ≤ s ≤ max A y, X . 0 0 y ∈σm

X ∈∆n

Therefore, to determine s to additive error , it suffices to find y, X with duality gap

?

? 0 Gap(X, y) := max A y, X 0 − min A y , X = λmax (A? y) − min hAi , Xi 0 0 X ∈∆n

y ∈σm

(23)

i∈[m]

below . A basic approach to solving convex-concave games such as (22) is to apply online learning for X and y simultaneously, where at each round the gains/costs to the max/min player are determined by the actions of the opposite player in the previous round. Importantly, such dynamics satisfy Assumption A, and we use our rank-1 sketch as the online learning strategy of the (matrix) max player, and standard multiplicative weights for the (vector) min player. Algorithm 1 describes the resulting scheme. The algorithm entertains a convergence guarantee that depends on the width parameter ω := max kAi k∞ i∈[m]

and has the following form.

12

Algorithm 1: Primal-dual SDP feasibility Let G0 := 0 and c0 := 0 for t = 1, . . . , T do Sample vector ut uniformly at random from the unit sphere Pt−1 Play matrix Xt := Put i=1 ηGi P yt−1 ◦e−ηct−1 Play vector yt := ∇lse − η t−1 i=1 ci = 1T (yt−1 ◦e−ηct−1 ) . P Form gain matrix Gt = A? yt = i∈[m] [yt ]i Ai Form cost vector [ct ]i := hXt , Ai i, i ∈ [m] end

Theorem 2. Let {Xt , yt }Tt=1 be the actions produced by Algorithm 1 and, define XTavg = P yTavg = T1 Tt=1 yt . Then log(4mn) + 2ηω 2 . E Gap XTavg , yTavg ≤ ηT

1 T

PT

t=1 Xt ,

Proof. Recalling the definition (23) of the duality gap, and that Gt = A? yt and [ct ]i = hAi , Xt i, we have X X T T 1 1 avg avg Gap(XT , yT ) = λmax Gt − min [ct ]i . T T i∈[m] t=1 t=1 Pt−1 ci ) is a function of X1 , . . . , Xt−1 . Therefore, Gt = A? yt satisfies Note that yt = ∇lse(−η i=1 Assumption A and we may use Corollary 1 to write X X T T T log(4n) 3ηω 2 T log(4n) 3η X E λmax Gt − hGt , Xt i ≤ + · E kGt k2∞ ≤ + , η 2 η 2 t=1

t=1

t=1

where in the second inequality we used ω = maxi∈[m] kAi k∞ and y ∈ σm to bound kGt k∞ = kA? yt k∞ ≤ ω · 1T yt = ω. Similarly, we use the standard multiplicative weights regret bound (cf. Shalev-Shwartz, 2012, Theorem 2.21) to write T X t=1

cTt yt

X T T log(m) η X log(m) ηω 2 T − min [ct ]i ≤ + · kct k2∞ ≤ + , η 2 η 2 i∈[m] t=1

t=1

where the second inequality again follows from |[ct ]i | = |hAi , Xt i| ≤ kAi k∞ ≤ ω since Xt ∈ ∆n . Finally, m X cTt yt = [yt ]i hAi , Xt i = hA? yt , Xt i = hGt , Xt i . i=1

Hence, summing the two regret bounds and dividing by T gives the result. √ 2ω 2 T and T = 8 log(4mn)ω 2 /2 , Theorem 2 guarantees For ηavg =avg log(4mn)/ E Gap(XT , yT ) ≤ . A high-probability version of this guarantee follows readily via Corollary 2. Let us now discuss the computational cost of Algorithm 1.PLet mv(M ) denote the time required to multiply the matrix M by any vector, and let mv(A) := i∈[m] mv(Ai ). Except for the computation of Xt , every step in the for loop in Algorithm 1 takes O(mv(A)) P work to ?execute Pt−1 (we may assume mv(A) ≥ max{n, m} without loss of generality). Let Yt = η t−1 G = A ( i=1 i i=1 ηyi ), 13

e and note that, with the values of η and T above, kYt k∞ ≤ ηT ω = O(ω/) for every t ≤ T . 0.5 0.5 e e Per Section 3, the computation of Xt costs O(kYt k∞ mv(Yt )) = O((ω/) mv(Yt )). Writing mv(A? ) := maxα∈Rm {mv(A? α)} ≤ min{mv(A), n2 }, the total computational cost of our algorithm is 2.5 e (ω/)0.5 mv(A? ) + mv(A) T ) = O((ω/) e O( mv(A? ) + (ω/)2 mv(A)). In many settings of interest—namely when the Ai s have mostly non-overlapping sparsity patterns and yet the Yt s are sparse—we have mv(A? ) ≈ mv(A), so that the computational cost is dominated by the first term.

4.1

Comparison with other algorithms

P Let nnz(M ) denote the number of nonzero entries of matrix M , and let nnz(A) := i∈[m] nnz(Ai ) ≥ mv(A). If in Algorithm 1 we replace the randomized projection Pu with the matrix multiplicative weights projection Pmw , the regret bound of Theorem 2 still holds, but the overall computational cost 2 (n3 + nnz(A))) due to full matrix exponentiation. Nemirovski (2004) accelerates e becomes O((ω/) e this scheme using extra-gradient steps, guaranteeing duality gap below in O(ω/) iterations, with each iteration involving two full matrix exponential computations. The overall computational cost e of such scheme is consequently O((ω/) (n3 + nnz(A))). Nesterov (2007) attains the same rate by using accelerated gradient descent on a smoothed version of the dual problem. Our scheme improves on this rate for sufficiently sparse problems, with n3 /nnz(A) (ω/)−1.5 . d’Aspremont (2011) applies a subgradient method to the dual problem, approximating the subgradients using the Lanczos method to compute a leading eigenvector of A? y. The method solves 2.5 mv(A? ) + (ω/)2 mv(A)), essentially the e the dual problem to accuracy with total work O((ω/) same as us. However, it is not clear how to efficiently recover a primal solution from this method. Moreover, the surrogate duality gap d’Aspremont (2011) proposes will not always be 0 at the global optimum, whereas with our approach the true duality gap is readily computable. Baes et al. (2013) replace the full matrix exponentiation in the accelerated scheme of Nemirovski e (2004) with a rank-k sketch of the form (3), where k = O(ω/). Similarly to Nemirovski (2004), e they require O(ω/) iterations to attain duality gap below . Baes et al. (2013) approximate e matrix exponential vector products by truncating a Taylor series, costing O(k(ω/) mv(A? )) = 2 mv(A? )) work per iteration. e O((ω/) With the Lanczos method, the cost improves to e O((ω/)1.5 mv(A? )) work per iteration. Every step of their method also computes hAi , Xi for all P i ∈ [m] and a rank-k matrix X = kj=1 vj vjT ; this costs either k ·mv(A) work (computing hAi , vj i for every i, j) or nnz(A) + n2 k (when forming X explicitly). The former option yields total complexity identical to our method. The latter option is preferable only when nnz(A) n2 ≥ mv(A? ), and can result in an improvement over the running time of our method if mv(A? ) nnz(A) (ω/)−1.5 + n2 (ω/)−0.5 . Baes et al. (2013) report that k = 1 often gave the best result in their experiment, which is not predicted by their theory. A hypothetical explanation for this finding is that, with k = 1, they are essentially running Algorithm 1. Finally, d’Aspremont (2011) and Garber and Hazan (2016) propose sub-sampling based algorithms for approximate SDP feasibility with runtimes potentially sublinear in mv(A? ). However, because of their significantly worse dependence on ω/, as well as dependence on Frobenius norms, we match or improve upon their runtime guarantees in a variety of settings; see (Garber and Hazan, 2016) for a detailed comparison.

14

5

Discussion

We conclude the paper with a discussion of a number of additional settings where our sketch—or some variation thereof—might be beneficial. In the first two settings we discuss, the naturally arising online learning problem involves adversaries that violate Assumption A, demonstrating a limitation of our analysis. Online convex optimization In the online convex optimization problem, at every time step t the adversary provides a convex P loss `t , the players pays a cost `t (Xt ) and wishes to minimize the P regret Tt=1 `t (Xt ) − minX Tt=1 `t (X). The standard reduction to the online learning problem is to construct an adversary with gains Gt = −∇`t (Xt ). However, even if the losses `t follow Assumption A, the constructed gains Gt clearly violate it. Therefore, extensions of our results to online convex optimization will require additional work and probably depend on finer problem structure. Positive semidefinite programming Peng et al. (2016) and Allen-Zhu et al. (2016) propose algorithms for solving positive (packing/covering) semidefinite programs with width independent running time, meaning that the computational cost of solving the problems to multiplicative error depends only logarithmically on the width parameter (denoted ω in section 4). Both algorithms e −2 ) sketch using the rely on matrix exponentiation, which they approximate with a rank k = O( Johnson-Lindenstrauss lemma. The algorithm of Peng et al. (2016) uses matrix multiplicative weights in essentially a black-box fashion, so one could hope to replace their high-rank sketch with our rank-1 technique. Unfortunately, the gain matrices that Peng et al. (2016) construct violate Assumption A and so our results do not immediately apply. Adapting our sketch to positive semidefinite programming remains an intriguing open problem. Improved computational efficiency against an oblivious adversary An oblivious adversary produces gain matrices G1 , . . . , GT independent of the actions X1 , . . . , XT ; this is a stronger version of Assumption A. For such an adversary, if we draw u ∼ Uni(Sn−1 ) and set u1 = u2 = · · · = uT = u, the average regret guarantee of Corollary 1 still applies, as Allen-Zhu and Li (2017) explain. In this setting, it may be possible to make the computation of Xt more efficient by reusing Xt−1 . Such savings exist in the stochastic setting (when Gt are i.i.d.) via Oja’s algorithm (Allen-Zhu and Li, 2017), and would be interesting to extend to the oblivious setting. Online k eigenvectors Nie et al. (2013) show that a variant of matrix multiplicative weights is also capable of learning online the top k-dimensional eigenspace, with similar regret guarantees. As our rank-1 sketch solves the k = 1 leading eigenvector problem, it is interesting to study whether a rank-k sketch solves the k leading eigenvectors problem.

15

References Zeyuan Allen-Zhu and Yuanzhi Li. Follow the compressed leader: Faster online learning of eigenvectors and faster mmwu. In Proceedings of the 34th International Conference on Machine Learning, 2017. Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtain a widthindependent, parallel, simpler, and faster positive sdp solver. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2016. Horst Alzer. On some inequalities for the gamma and psi functions. Mathematics of Computation of the American Mathematical Society, 66(217):373–389, 1997. S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm and applications. Theory of Computing, 8(1):121–164, 2012. Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. In Proceedings of the Thirty-Ninth Annual ACM Symposium on the Theory of Computing. ACM, 2007. K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68:357–367, 1967. Michel Baes, Michael Bürgisser, and Arkadi Nemirovski. A randomized mirror-prox method for solving structured large-scale matrix saddle-point problems. SIAM Journal on Optimization, 23 (2):934–962, 2013. Sébastien Bubeck and Nicoló Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. Michael B. Cohen, Yin Tat Lee, Gary L. Miller, Jakub W. Pachocki, and Aaron Sidford. Geometric median in nearly linear time. arXiv:1606.05225 [cs.DS], 2016. Alexandre d’Aspremont. Subsampling algorithms for semidefinite programming. Stochastic Systems, 1(2):209–436, 2011. Vladimir Druskin and Leonid Knizhnerman. Error bounds in the simple Lanczos procedure for computing functions of symmetric matrices and eigenvalues. U.S.S.R. Computational Mathematics and Mathematical Physics, 31(7):970–983, 1991. Vladimir Druskin and Leonid Knizhnerman. Krylov subspace approximation of eigenpairs and matrix functions in exact and computer arithmetic. Numerical Linear Algebra with Applications, 2(3):205–217, 1995. Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the Forty-Sixth Annual ACM Symposium on the Theory of Computing. ACM, 2014.

16

Dan Garber and Elad Hazan. Sublinear time algorithms for approximate semidefinite programming. Math. Program., 158(1-2):329–361, 2016. Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In Proceedings of the 32nd International Conference on Machine Learning, 2015. Ming Gu and Stanley C. Eisenstat. A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem. SIAM Journal on Matrix Analysis and Applications, 16(1):172–191, 1995. Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4):157–325, 2016. J. Hiriart-Urruty and C. Lemaréchal. Springer, New York, 1993.

Convex Analysis and Minimization Algorithms I & II.

Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963. Arun Jambulapati, Kirankumar Shiragur, and Aaron Sidford. Efficient structured matrix recovery and nearly-linear time algorithms for solving inverse symmetric M-matrices. arXiv:1812.06295 [cs.DS], 2018. Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45(4), 1950. Adrian Lewis. Convex analysis on the Hermitian matrices. SIAM Journal on Optimization, 6: 164–177, 1996. Adrian S. Lewis and Hristo S. Sendov. Twice differentiable spectral functions. SIAM Journal on Matrix Analysis and Applications, 23(2):368–386, 2001. Gérard Meurant. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations. Society for Industrial and Applied Mathematics, 2006. Cleve B. Moler and Charles Van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review, 45(1):3–49, 2003. Cameron Musco, Christopher Musco, and Aaron Sidford. Stability of the Lanczos method for matrix function approximation. arXiv:1708.07788 [cs.DS], 2017. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004. Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. 17

Yurii Nesterov. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, Series A, 110:245–259, 2007. Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regrets. In Proceedings of the Twenty Sixth Annual Conference on Computational Learning Theory, 2013. Lorenzo Orecchia, Sushant Sachdeva, and Nisheeth K. Vishnoi. Approximating the exponential, the lanczos method and an õ(m)-time spectral algorithm for balanced separator. In Proceedings of the Forty-Fourth Annual ACM Symposium on the Theory of Computing, 2012. Victor Y Pan and Zhao Q Chen. The complexity of the matrix eigenproblem. In Proceedings of the Thirty-First Annual ACM Symposium on the Theory of Computing. ACM, 1999. Richard Peng, Kanat Tangwongsan, and Peng Zhang. Faster and simpler width-independent parallel algorithms for positive semidefinite programming. arXiv:1201.5135v3 [cs.DS], 2016. Yousef Saad. Analysis of some Krylov subspace approximations to the matrix exponential operator. SIAM Journal on Numerical Analysis, 29(1):209–228, 1992. Sushant Sachdeva and Nisheeth K. Vishnoi. Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science, 9(2):125–210, 2014. Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012. Lieven Vandenberghe, Martin S Andersen, et al. Chordal graphs and semidefinite optimization. Foundations and Trends® in Optimization, 1(4):241–433, 2015.

A

Dual averaging regret bounds

¯ Pt−1 Gi ) as ¯ t = P(η Theorem 1. Let G1 , . . . , GT be any sequence of gain matrices in Sn and let X i=1 in Eq. (5). Then, for every T ∈ N, ! T T T X X

log(4n) 3η X ¯ λmax Gt − Gt , Xt ≤ + · kGt k2∞ . (7) η 2 t=1

t=1

t=1

If additionally 0 Gt I for every t and η ≤ 61 , λmax

T X t=1

! Gt

−

T X

t=1

¯t Gt , X

log (4n) ≤ + 3η · λmax η

T X

! Gt

.

(8)

t=1

Proof. We start with the well-known Bregman 3-point identity, valid for any Φ0 , Φ1 , Φ2 ∈ Sn ,

¯ 0 ) − P(Φ ¯ 1 ) = V¯Φ (Φ1 ) − V¯Φ (Φ2 ) + V¯Φ (Φ2 ); Φ2 − Φ1 , P(Φ (24) 0 0 1 the identity follows from the definition (6) of V¯ by direct substitution. Fix some P S ∈ relint ∆n ¯ and S ∈ Sn such that S = P(Ψ) (which exists by Proposition 1. 3). Let Yt = η t−1 i=1 Gi so that ¯ t ). For a given t, we use the 3-point identity with Φ0 = Ψ, Φ1 = Yt and Φ2 = Yt+1 , yielding ¯ t = P(Y X

¯ ¯ t ) = V¯Ψ (Yt ) − V¯Ψ (Yt+1 ) + V¯Y (Yt+1 ). ¯ t = Yt+1 − Yt , P(Ψ) η Gt , S − X − P(Y t 18

Summing these equalities over t = 1, . . . , T and dividing by η gives * T + T T X X X

V¯Ψ (Y1 ) − V¯Ψ (YT +1 ) ¯ +η Gt , S − Gt , Xt = V¯Yt (Yt+1 ) η t=1

(25)

t=1

t=1

≤

T log 4n 3η X + kGt k2∞ . η 2

(26)

t=1

2 3 2 2 η kGt k∞

Above, we used V¯Yt (Yt+1 ) = V¯Yt (Yt + ηGt ) ≤ (Proposition 1. 1) along with Y1 = 0 and ¯ ¯ VΨ (0) − VΨ (YT +1 ) ≤ log 4n (Proposition 1. 2). Since the bound (26) is valid for any S ∈ relint ∆n , we may it over S. The result (7)

P PTsupremize T follows from noting that supS∈relint ∆n i=1 Gt . t=1 Gt , S = λmax To see the second bound (8), we return to the identity (25) and note that the assumptions 0 Gt I and η ≤ 61 imply kηGt k∞ ≤ 61 . Therefore we may use Proposition 1. 10 to obtain

¯ t ) = 3η 2 Gt , X ¯t . V¯Yt (Yt+1 ) = V¯Yt (Yt + ηGt ) ≤ 3 kηGt k∞ ηGt , P(Y Substituting back into (25), rearranging and taking the supremum over S as before, we obtain ! T T X X log(4n) . (27) λmax Gt ≤ (1 + 3η) hGt , Xt i + η t=1

i=1

Dividing through by (1 + 3η) and noting that 1 − x ≤ result (8), concluding the proof.

B

1 1+x

≤ 1 for every x ≥ 0, we obtain the

High probability regret bounds

Corollary 2. Let G1 , . . . , GT be symmetric gain matrices satisfying Assumption A and let Xt be generated according to Eq. (4). If kGt k∞ ≤ 1 for every t, then for every T ∈ N and δ ∈ (0, 1), with probability at least 1 − δ, ! T T q X X log(4n) 3η λmax Gt − hGt , Xt i ≤ + T + 2T log 1δ . (9) η 2 t=1

i=1

If additionally 0 Gt I for every t and η ≤ 61 , then with probability at least 1 − δ, ! ! T T T X X X log (4n/δ) Gt − hGt , Xt i ≤ λmax + 4η λmax Gt . η i=1

t=1

(10)

i=1

Proof. We start with the first claim (9). Recall that a random process Dt adapted to a filtration Ft is σ 2 -sub-Gaussian if E[exp(λDt ) | Ft−1 ] ≤ exp(λ2 σ 2 /2) for all λ ∈ R. Then using the boundedness assumption that hGt , Xt i ≤ kGt k∞ ≤ 1, Hoeffding’s lemma on bounded random variables (Hoeffding, ¯ t i is 1-sub-Gaussian. Consequently, 1963) implies that the martingale difference sequence hGt , Xt − X the Azuma-Hoeffding inequality (Azuma, 1967) immediately implies that T X t=1

hGt , Xt i ≥

T X

¯ti − hGt , X

q 2T log 1δ w.p. ≥ 1 − δ.

t=1

The bound (7) in Theorem 1 thus gives the result (9). For the multiplicative bound (10), we require a slightly different relative martingale convergence guarantee. 19

Lemma 4 (Allen-Zhu and Li (2017), Lemma G.1). Let {Dt } be adapted to the filtration {Ft } and satisfy 0 ≤ Dt ≤ 1. Then, for any δ, µ ∈ (0, 1), and any T ∈ N, ! T T X X log 1δ ≥ 1 − δ. P Dt ≥ (1 − µ) E [Dt | Ft−1 ] − µ t=1

t=1

Similarly, the assumption 0 Gt I, along with Xt ∈ ∆n , imply 0 ≤ hGt , Xt i ≤ 1. Therefore, the conditions of Lemma 4 hold for Dt = hGt , Xt i, and we use it with µ = η ≤ 1, obtaining T X

hGt , Xt i ≥ (1 − η)

t=1

T X

t=1

1 ¯ t − log δ w.p. ≥ 1 − δ. Gt , X η

The bound (8) in Theorem 1 thus yields that with probability at least 1 − δ over the randomness in Xt and Gt , ! T T X X log(4n/δ) . hGt , Xt i ≥ (1 − η)(1 − 3η)λmax Gt − η t=1

t=1

Noting that (1 − η)(1 − 3η) ≥ 1 − 4η completes the proof.

C

Proofs from Section 2.3

C.1

Proof of Lemma 2

Lemma 2. For any Y, D ∈ Sn , ∇2 pmw (Y )[D, D] ≤ D2 , Pmw (Y ) . Proof. While the result is evident from the development in (Nesterov, 2007), it is not stated there formally. We therefore derive it here using our notation and one key lemma from (Nesterov, 2007). First, note that

Y D, e hD, ∇pmw (Y )i = hD, Pmw (Y )i = , tr eY where throughout ∇ denotes differentiation with respect to Y and D is viewed as fixed. Applying ∇ again gives, * 2 mw

∇ p

(Y )[D, D] =

D, ∇

D, eY tr eY

!+

D, ∇ D, eY = − tr eY

D, eY tr eY

!2

D, ∇ D, eY ≤ . tr eY

Note that ∇ D, eY 6= DeY when D and Y do not commute. However, using the Taylor series for

Pk−1 i the exponential and the formula ∇ D, Y k = i=0 Y DY k−1−i gives, ∞ ∞ X k−1 E X

X 1 D 1 i ∇ D, eY = ∇ D, Y k = Y DY k−1−i . k! k! k=1 i=0

k=0

Consequently, we may write ∞ X k−1 ∞ X k−1 E X E

X 1 D 1 D D, ∇ D, eY = D, Y i DY k−1−i = D, Y i DY k−1−i + Y k−1−i DY i . k! 2(k!) k=1 i=0

k=1 i=0

20

Lemma 1 in (Nesterov, 2007) shows that, when Y 0, D E D E D, Y i DY k−1−i + Y k−1−i DY i ≤ 2 D2 , Y k−1 . Substituting back, this gives ∞

X Y D, ∇ D, e ≤ k=1

D E

1 D2 , Y k−1 = D2 , eY , (k − 1)!

and consequently D2 , eY (Y )[D, D] ≤ tr eY

2 mw

∇ p

= D2 , Pmw (Y ) = D2 , ∇pmw (Y )

as required. Finally, note that the assumption Y 0 is without loss of generality, as Pmw (Y ) = Pmw (Y + cI) for every c ∈ R, and therefore ∇2 pmw is also invariant to scalar shifts.

C.2

Proof of Lemma 3

Lemma 3. For any Y, D ∈ Sn , orthogonal eigenbasis Q for Y , and w ∼ Dirichlet( 12 , . . . , 12 ), ¯(Y )[D, D] ≤ 3 · Ew ∇2 pmw (Y + Q diag(log w)QT )[D, D] ∇2 p

¯ ) . ≤ 3 D2 , P(Y

(15) (16)

˜ = QT DQ, where as before Y = QΛQT is an eigen-decomposition and Λ = diag(λ). Proof. Let D P Recall that lse : Rn → R denotes the vector softmax function, lse(y) := log( ni=1 eyi ) = pmw (diag y). ¯(Y ) = lse(λ) Similarly, define lse(y) := Ew lse(y + log w) for w ∼ Dirichlet( 12 , . . . , 12 ). By Lemma 1, p is a spectral function. Lewis and Sendov (2001, Theorem 3.3) prove that D E ˜ diag D] ˜ + A(λ), ¯ ˜ ◦D ˜ , ¯(Y )[D, D] = ∇2 lse(λ)[diag D, ∇2 p D (28) ˜ is a vector comprised of the diagonal of D, ˜ where ◦ denotes elementwise multiplication, diag(D) ¯ and the matrix A is given by ∇i lse(λ + log w) − ∇j lse(λ + log w) ∇i lse(λ) − ∇j lse(λ) = Ew A¯ij (λ) = λi − λj λi − λj | {z } :=Aw ij (λ)

for i 6= j and 0 otherwise, whenever λ has distinct elements. This distinctiveness assumption is ¯ is C 2 (Lewis and Sendov, 2001, Theorem 4.2) so we may otherwise without loss of generality, as p ¯. consider an arbitrarily small perturbation of λ and appeal to continuity of ∇2 p We now use the spectral function Hessian formula to write down ∇2 pmw (Y{w} )[D, D] where Y{w} := Y + Q diag(log w)QT (noting that Y and Y{w} have the same eigenvectors), D E ˜ diag D] ˜ + Amw (λ + log w), D ˜ ◦D ˜ , ∇2 pmw (Y{w} )[D, D] = ∇2 lse(λ + log w)[diag D, where Amw ij (λ) :=

∇i lse(λ) − ∇j lse(λ) = A1ij (λ) λi − λj 21

(29)

for i 6= j and 0 otherwise. Taking the expectation over w in (29) and recalling the definition lse(λ) = Ew lse(λ + log w) gives D E ˜ diag D] ˜ + Ew Amw (λ + log w), D ˜ ◦D ˜ . Ew ∇2 pmw (Y{w} )[D, D] = ∇2 lse(λ)[diag D, (30) Comparing Eq. (30) to (28) and the desired bound (15), we see that it remains to upper bound ¯ A(λ) = Ew Aw (λ) in terms of Ew Amw (λ + log w). Fix indices i, j ∈ [n] such that i 6= j, and let δ :=

λi − λj 1 wi . and ρ := log 2 2 wj

Since A¯ and Amw are both symmetric matrices, we may assume that λi > λj and so δ > 0 (recall we assumed λi 6= λj without loss of generality). Let wi↔j denote a vector identical to w except coordinates i and j are swapped. With this notation, Lemma 5, which we prove in Section C.2.1, yields the bound |ρ| tanh(δ) mw w wi↔j i↔j Aij (λ) + Aij (λ) ≤ 1 + Aij (λ + log w) + Amw ) . ij (λ + log w δ Taking the expectation over w and using the fact that Dirichlet( 21 , . . . , 21 ) is invariant to permutations, we have |ρ| tanh(δ) mw ¯ Aij (λ) ≤ Ew 1 + Aij (λ + log w) . (31) δ Amw We now focus on the term Ew |ρ| tanh(δ) ij (λ + log w). We have δ Ew

|ρ| tanh(δ) mw |ρ| tanh(δ) mw Aij (λ + log w) = Ew Aij (λ + log w) I{|ρ|≤δ} + I{|ρ|>δ} δ δ tanh δ Ew |ρ|Amw ≤ (tanh δ)Ew Amw ij (λ + log w)I{|ρ|>δ} , ij (λ + log w)I{|ρ|≤δ} + δ

(32)

n where the final transition uses |ρ|I{|ρ|≤δ} ≤ δI{|ρ|≤δ} and Amw ij (ζ) ≥ 0 for every ζ ∈ R . The latter is a consequence of the convexity of lse and is also evident from Eq. (37) in Section C.2.1. wi Since w ∼ Dirichlet( 21 , . . . , 12 ), ρ = 12 log w is independent of w\ij := {wk }k6=i,j . Moreover, j wi , wj are completely determined by ρ and w\ij (see explicit expression in Section C.2.2). Therefore, conditional on w\ij , Amw ij (λ + log w) is a function of ρ. In Lemma 6 we prove that for every λ and w\ij , this function is decreasing in ρ for ρ > δ. Hence, conditionally on w\ij and the event ρ > δ, the random variables |ρ| and Amw ij (λ + log w) are negatively correlated : the expectation of their product at most the product of their expectations. Let Eρ denote expectation conditional on w\ij . Lemma 7, with f (ρ) = |ρ|, g(ρ) = Amw ij (λ + log w), and S = {ρ | ρ > δ} gives that mw Eρ |ρ|Amw (33) ij (λ + log w)I{ρ>δ} ≤ (Eρ [ |ρ| | ρ > δ]) Eρ Aij (λ + log w)I{ρ>δ} .

Similarly, Lemma 6 also gives that (conditional on w\ij ) Amw ij (λ+log w) is increasing in ρ for ρ < −δ, and therefore, by Lemma 7, mw Eρ |ρ|Amw (34) ij (λ + log w)I{ρ<−δ} ≤ (Eρ [ |ρ| | ρ < −δ]) Eρ Aij (λ + log w)I{ρ<−δ} . Let z ∼ Beta( 12 , 12 ). 1 2

The random variable ρ =

1 2

wi log w is symmetric and distributed as j

log( 1−z z ). Therefore p (?) 1 1−z Eρ [ |ρ| | ρ < −δ] = Eρ [ |ρ| | ρ > δ] = E log 1−z | log > 2δ ≤ δ + 1 + e−2δ , z z 2 22

where we prove the inequality (?) in Lemma 10. Substituting this bound into inequalities (33) and (34) and summing them, we obtain p −2δ E Amw (λ + log w)I Eρ |ρ|Amw (λ + log w)I ≤ δ + 1 + e ρ ij {|ρ|>δ} {|ρ|>δ} . ij Taking expectation over w\ij and substituting back into (32) therefore gives, |ρ| tanh(δ) mw Ew Aij (λ + log w) ≤ δ

p tanh(δ) −2δ Ew Amw tanh(δ) + 1 + e · ij (λ + log w), δ

mw where we used again Amw ij (·) ≥ 0 in order to increase the multiplier of Ew Aij (λ + log w)I{|ρ|≤δ} . √ Computation shows that tanh(δ) + 1 + e−2δ · tanh(δ) ≤ 1.58 ≤ 2 for every δ ≥ 0. Therefore, by the δ bound (31) we have A¯ij (λ) ≤ 3 · Ew Aij (λ + log w) . (35)

Returning to (28), we write D E ˜ diag D] ˜ + 3 Ew Amw (λ + log w), D ˜ ◦D ˜ ¯(Y )[D, D] ≤ ∇2 lse(λ)[diag D, ∇2 p h D Ei ˜ diag D] ˜ + Ew Amw (λ + log w), D ˜ ◦D ˜ . ≤ 3 ∇2 lse(λ)[diag D, ˜ D ˜ In the first inequality above, we substituted the bound (35), using the fact that all the entries of D◦ 2 ˜ diag D] ˜ ≥ 0 since are nonnegative. In the second inequality, we used that fact that ∇ lse(λ)[diag D, lse is convex. Recalling the expression (30) gives (15). The final bound (16) follows from applying Lemma 2 to the right side of (15) and using the identity (12). C.2.1

A pointwise bound for Lemma 3

In this section we prove an elementary inequality that plays a central role in the proof of Lemma 3. Let i, j ∈ [n] be such that i 6= j. For λ ∈ Rn , we define λi −λj sinh λ λ i j 2 e −e Nij (λ) := ∇i lse(λ) − ∇j lse(λ) = Pn = (36) λ +λ λk P λ −λ e i j λk − i 2 j 1 k=1 cosh + 2 k6=i,j e 2 and Amw ij (λ) =

Nij (λ) Nij (λ + log w) and Aw . ij (λ) = λi − λj λi − λj

(37)

Additionally, for any vector w ∈ Rn , let wi↔j denote a vector identical to w except coordinates i and j are swapped. With this notation in hand, we state and prove our bound. Lemma 5. Let λ ∈ Rn , w ∈ Rn+ and i, j ∈ [n], i 6= j. Set δ = Aw ij (λ)

+

i↔j Aw (λ) ij

λi −λj 2

and ρ =

1 2

wi log w . Then, j

|ρ| tanh(δ) mw i↔j ≤ 1+ Aij (λ + log w) + Amw ) . ij (λ + log w δ

Proof. Define r=

1 X λk +log wk − λi +log wi +λj +log wj 2 e ≥ 0. 2 k∈{i,j} /

23

Observe that if we swap wi and wj , δ and r remain unchanged and the sign of ρ reverses. For x ∈ R, sinh(x) let f (x) := cosh(x)+r . Using (36), we may write i↔j

w q1 := 2Aw ij (λ) + 2Aij

(λ) =

f (δ + ρ) f (δ − ρ) + δ δ

and mw i↔j q2 := 2Amw )= ij (λ + log w) + 2Aij (λ + log w

f (δ + ρ) f (δ − ρ) + . δ+ρ δ−ρ

2 ≤ |ρ| tanh(δ) With these definitions, our goal is to prove that q1q−q . Since f (x) is an odd function of δ 2 x, the terms q1 and q2 are invariant to sign flips in either δ or ρ. Therefore, we may assume both

δ ≥ 0 and ρ ≥ 0 without loss of generality. Substituting back the expressions for q1 , q2 and using that |ρ| = ρ by assumption yields q1 − q2 ρ = · q2 δ

f (δ+ρ) δ+ρ f (δ+ρ) δ+ρ

where g(x) := Note that

tanh(x) x

− +

f (δ−ρ) δ−ρ f (δ−ρ) δ−ρ

=

ρ g(δ + ρ) − g(δ − ρ) · , δ g(δ + ρ) + g(δ − ρ)

(38)

f (x) tanh(x) cosh(x) = · . x x cosh(x) + r

is decreasing in |x|. Since |δ − ρ| ≤ |δ + ρ| by the assumption ρ, δ ≥ 0, we have g(δ − ρ) ≥

cosh(δ − ρ) tanh(δ + ρ) · . δ+ρ cosh(δ − ρ) + r

and therefore tanh(δ + ρ) g(δ + ρ) − g(δ − ρ) ≤ δ+ρ

cosh(δ + ρ) cosh(δ − ρ) − cosh(δ + ρ) + r cosh(δ − ρ) + r

and similarly, tanh(δ + ρ) g(δ + ρ) + g(δ − ρ) ≥ δ+ρ

cosh(δ + ρ) cosh(δ − ρ) + cosh(δ + ρ) + r cosh(δ − ρ) + r

.

As g(x) > 0 for every x, we may divide these bounds and obtain via elementary manipulation, g(δ + ρ) − g(δ − ρ) ≤ g(δ + ρ) + g(δ − ρ)

cosh(δ+ρ) cosh(δ+ρ)+r cosh(δ+ρ) cosh(δ+ρ)+r

− +

cosh(δ−ρ) cosh(δ−ρ)+r cosh(δ−ρ) cosh(δ−ρ)+r

r [cosh (δ + ρ) − cosh (δ − ρ)] 2 cosh (δ + ρ) cosh (δ − ρ) + r [cosh (δ + ρ) + cosh (δ − ρ)] cosh(δ + ρ) − cosh(δ − ρ) = tanh(ρ) tanh(δ) ≤ tanh(δ). ≤ cosh(δ + ρ) + cosh(δ − ρ) =

Substituting back into (38) establishes the desired bound. Examining the proof, we see that the bound is tight for large values of r and |ρ|. 24

C.2.2

Piecewise monotonicity of Amw

Lemma 6. Let λ ∈ Rn , w ∈ σn (the simplex in Rn ), and i, j ∈ [n] such that δ := 21 (λi − λj ) > 0, wi and set ρ := 12 log w . When λ and {wk }k6=i,j are held fixed, Amw ij (λ + log w) is increasing in ρ for j ρ < −δ, and decreasing in ρ for ρ > δ. Proof. First, we write Amw ij (λ + log w) explicitly as a function of ρ, with λ and {wk }k6=i,j as fixed parameters. By (37) we have Amw ij (λ + log w) =

Let m = wi + wj = 1 − m wj = 1+e 2ρ . Therefore,

P

k6=i,j

√

sinh(ρ + δ) P 2(ρ + δ) cosh(ρ + δ) + 12 k∈{i,j} / wk . Since

wi wj

√

wk λk − wi wj e

λi +λj 2

.

= e2ρ and w ∈ σn , we have that wi =

m 1+e−2ρ

and

1p 2 1 = (1 + e−2ρ )(1 + e2ρ ) = cosh(ρ). wi wj m m

Thus, Amw ij (λ + log w) =

sinh(ρ + δ) , 2(ρ + δ) [cosh(ρ + δ) + r0 cosh(ρ)]

λ +λ P wk λk − i 2 j is a function of only λ and {wk }k6=i,j , and therefore Amw where r0 = k∈{i,j} ij (λ+log w) / me can be viewed as a function of ρ as claimed. Writing x = ρ + δ, showing the desired monotonicity properties is equivalent to showing that

b(x) :=

sinh(x) x (cosh(x) + r0 cosh(x − δ))

is decreasing for x > 2δ and increasing for x < 0. The derivative of b(x) is b0 (x) =

cosh(x) − x1 sinh(x) sinh(x) [sinh(x) + r0 sinh(x − δ)] − , x (cosh(x) + r0 cosh(x − δ)) x [cosh(x) + r0 cosh(x − δ)]2

and has, for all x ∈ R, the same sign as s :=

1 sinh(x) + r0 sinh(x − δ) x [cosh(x) + r0 cosh(x − δ)] 0 b (x) = coth(x) − − . sinh(x) x cosh(x) + r0 cosh(x − δ)

(39)

a1 a2 2 For x > 2δ, we have by Dan’s favorite inequality ( ab11 +a +b2 ≥ min{ b1 , b2 } for all a1 , a2 , b1 , b2 ≥ 0),

sinh(x) + r0 sinh(x − δ) ≥ min {tanh(x), tanh(x − δ)} = tanh(x − δ) > tanh(x/2), cosh(x) + r0 cosh(x − δ) where in the last transition we used the fact that x > 2δ implies x − δ > x/2. Therefore, for x > 2δ we have the following bound for s, s ≤ coth(x) −

1 1 1 − tanh(x/2) = − < 0, x sinh(x) x

so we have that b(x) is decreasing for x > 2δ as required, since s has the same sign as b0 (x). 25

Similarly, for x < 0, we have by Dan’s favorite inequality, − sinh(x) − r0 sinh(x − δ) ≥ min {− tanh(x), − tanh(x − δ)} = − tanh(x). cosh(x) + r0 cosh(x − δ) Therefore, for x < 0 we have s ≥ coth(x) −

1 1 2 − tanh(x) = − > 0, x −x sinh(−2x)

which shows that b(x) is increasing for x < 0, concluding the proof. The following Lemma proves the intuitive fact that decreasing and increasing functions of the same random variable are negatively correlated. Lemma 7. Let ρ be a real-valued random variable, let f, g be functions from R to R and let S ⊂ R be an interval. If f (x) is non-decreasing in x for x ∈ S and g(x) is non-increasing in x for x ∈ S, then Ef (ρ)g(ρ)I{ρ∈S} ≤ (E [f (ρ) | ρ ∈ S]) · Eg(ρ)I{ρ∈S} . Proof. For every x, x0 ∈ S we have (f (x) − f (x0 )) · (g(x) − g(x0 )) ≤ 0. Hence, for every x, x0 ∈ R, the bound (f (x) − f (x0 )) · (g(x) − g(x0 )) · I{x∈S} I{x0 ∈S} ≤ 0 holds as well. Let ρ0 be an independent copy of ρ, then E (f (ρ) − f (ρ0 )) · (g(ρ) − g(ρ0 )) · I{ρ∈S} I{ρ0 ∈S} ≤ 0. Rearranging and using the fact that ρ, ρ0 are i.i.d., we have Ef (ρ)g(ρ)I{ρ∈S} · EI{ρ0 ∈S} ≤ E f (ρ)I{ρ∈S} · E g(ρ0 )I{ρ0 ∈S} . Dividing by EI{ρ0 ∈S} = P(ρ ∈ S) yields the desired bound.

C.3

Facts about the Beta distribution

Here we collect properties of Beta-distributed random variables, which we use in our development. Lemma 8. Let n ∈ N and let z ∼ Beta( 12 , n−1 2 ). Then E log z1 = ψ n2 − ψ 12 ≤ log(n) + log(2) + γ ≤ log(4n), where ψ(x) =

log Γ(x) is the digamma function, and γ is the Euler-Mascheroni constant. Proof. E log z1 = ψ n2 − ψ 12 by the well-known formula for expectation of the logarithm of a Beta random variable. We have ψ(x) ≤ log(x) (Alzer, 1997) and ψ( 21 ) = − log(4) − γ. Moreover, γ ≤ log 2, giving the final bound. Lemma 9. Let z ∼ Beta 12 , 12 and ` ≥ 0. Then d dx

2 e−`/2 1−z 2 √ ≤ P log ≥ ` ≤ e−`/2 . π 1 + e−` z π

26

Proof. The distribution Beta

1 1 2, 2

has density

1 −1/2 (1 πx

− x)−1/2 . Therefore

Z ` −1 1−z 1 1 (1+e ) x−1/2 (1 − x)−1/2 dx. P log ≥` =P z≤ = z π 0 1 + e`

To obtain a lower bound, we use (1 − x)−1/2 ≥ 1 for every x ∈ [0, 1], and therefore, Z ` −1 1−z 2 1 (1+e ) 2 e−`/2 P log x−1/2 dx = √ ≥` ≥ = √ . z π 0 π 1 + e−` π 1 + e` For the upper bound, we use (1 − x)−1/2 ≤ 1 −

1 1+e`

−1/2

for every 0 ≤ x ≤ (1 + e` )−1 , giving

r Z ` −1 1−z 1 1 + e` (1+e ) 2 P log ≥` ≤ x−1/2 dx = e−`/2 . z π π e` 0

Lemma 10. Let z ∼ Beta

and ` ≥ 0. Then p 1 − z 1−z E log log ≥ ` ≤ ` + 2 1 + e−` . z z 1 1 2, 2

1−z Proof. Conditional on log 1−z z ≥ `, log z is a nonnegative random variable, and we may therefore write Z ∞ 1 − z 1−z 1−z 1−z E log log ≥` = P log ≥ x log ≥ ` dx z z z z x=0 Z ∞ P log 1−z z ≥ x =`+ dx. 1−z x=` P log z ≥ `

By Lemma 9, p ≥ x P log 1−z z 1 + e−` · e−(x−`)/2 . ≤ P log 1−z ≥ ` z Integrating, we obtain the desired bound. Lemma 11. Let 3 ≤ n ∈ N and let z ∼ Beta( 21 , n−1 2 ). For every δ ∈ (0, 1), δ2 P z≥ > 1 − δ. n Proof. The random variable z has density Γ( n2 ) x−1/2 (1 − x)(n−3)/2 ≤ Γ( 12 )Γ( n−1 ) 2 where we used Γ( 21 ) =

√

r

n , 2πx

π and Gautschi’s inequality Γ(m+1)/Γ(m+s) ≤ (m+1)1−s with m = q and s = 12 . Integrating the upper bound on the density, we find P(z ≤ δ 2 /n) ≤ π2 δ < δ.

27

n 2 −1

D

Efficient computation of matrix exponential-vector products

In this section we give a more detailed discussion of matrix exponential-vector product approximation using the Lanczos method, and prove the results stated in Section 3. In Section D.1 we formally state the Lanczos method. In Section D.2 we survey known approximation guarantees and derive simple corollaries. In Section D.3 we show that we can apply the matrix exponential to a random vector with a multiplicative error guarantee, and in Section D.4 we prove it implies Proposition 2. In Section D.5 we discuss some possible improvement to our guarantees via modifications and alternatives to the Lanczos method. Finally, in Section D.6 we prove Corollary 3. Throughout this section we use mv(A) to denote the time required to multiply the matrix A with any vector.

D.1

Description of the Lanczos method

Algorithm 2: Lanczos method for computing matrix exponential vector product eg xpk (A, b) n input : A ∈ Sn , number of iterations k, vector b ∈ R q0 ← 0 ∈ Rn , q1 ← b/ kbk2 , β1 ← 1 for i = 1 to k do T q qi+1 ← Aqi − βqi−1 and αi ← qi+1 i qi+1 ← qi+1 − αqi and βi+1 = kqi+1 k2 if βi+1 = 0 then break else qi+1 ← qi+1 /βi+1 end Let

α1

β2

β2 Q = [q1 · · · qk ] and T = 0

α2 .. .

0 ..

.. . βk βk αk .

Compute tridiagonal eigen-decomposition T = V ΛV T return : eg xpk (A, b) = kbk2 · QV exp(Λ)V T e1 Ignoring numerical precision issues, each iteration in the for loop requires O(mv(A)) time, and that for a k-by-k tridiagonal matrix, eigen-decomposition requires O(k 2 ) time (Gu and Eisenstat, 1995), and so the total complexity is O(mv(A)k + k 2 ). In practical settings k n ≤ mv(A) and the cost of the eigen-decomposition is negligible. Nevertheless, there are ways to avoid performing it, which we discuss briefly in Section D.5.

D.2

Known approximation results, and some corollaries

We begin with a result on uniform polynomial approximation of the exponential due to Sachdeva and Vishnoi (2014). Theorem 3 (Sachdeva and Vishnoi (2014), Theorem p 4.1 Restated). For every b > 0 and every ∈ (0, 1] there exists polynomial p : R → R of degree O( max{b, log(1/)} log(1/)) such that sup | exp(−x) − p(x)| ≤ . x∈[0,b]

28

As an immediate corollary of this we obtain the following bounds for approximating exp(x) over arbitrary values Corollary 4. For every a < b ∈ R and every ∈ (0, 1] there exists polynomial p : R → R of degree p O( max{b − a, log(1/)} log(1/)) polynomial such that sup | exp(x) − p(x)| ≤ exp(b) . x∈[a,b]

Proof. p For all x ∈ [a, b] we have b − x ∈ [0, b − a] and therefore by Theorem 3 there is a degree O( max{b − a, log(1/)} log(1/)) polynomial q : R → R such that sup | exp(−(b − x)) − q(b − x)| ≤ . x∈[a,b]

Since exp(−(b − x)) = exp(−b) exp(x), the polynomial p(x) = exp(b)q(b − x) is as desired. The classical theory on the Lanczos method tells us that its error is bounded by twice that of any uniform polynomial approximation. However, this theory does not account for finite precision. A recent result (Musco et al., 2017) ties polynomial approximation to the error of the Lanczos method using finite bitwidth floating point operations. Theorem 4 (Musco et al. (2017), Theorem 1). Let A ∈ Sn , u ∈ Rn , and f : R → R. Suppose k ∈ N, η ∈ (0, kAk∞ ] and a polynomial p for degree < k satisfy, |f (x) − p(x)| ≤ k and

sup x∈[λmin (A)−η,λmax (A)+η]

sup

|f (x)| ≤ C.

x∈[λmin (A)−η,λmax (A)+η]

For any µ ∈ (0, 1), let yk,µ be the output of k iterations of the Lanczos method for approximating nkkAk f (A)v, using floating point operations with B ≥ c log( µη ∞ ) bits precision (for numerical constant c < ∞). Then yk,µ satisfies kf (A)u − yk,µ k2 ≤ (7k · k + µ · C) kuk2 . If arithmetic operations with B bits of precision can be performed in O(1) time then the method can be implemented in time O(mv(A)k + kB max{k, B}). Specializing to the matrix exponential and using the uniform approximation guarantee of Corollary 4, we immediately obtain the following. Corollary 5. Let A ∈ Sn , u ∈ Rn , and > 0, and p set M = max{kAk∞ , log(1/), 1}. There exists numerical constants c, c0 < ∞ such that, for k ≥ c M log(M/) and B ≥ c0 log( nM ), computing y = eg xpk (A, u) with B bits of floating point precision guarantees kexp(A)u − yk2 ≤ exp(λmax (A)) kuk2 . The computation takes time p O mv(A) M log(M/) + M log2 (nM/) provided Θ(log( nM )) bit arithmetic operations can be performed in time O(1).

29

Proof. Let η = 1. Using λmax (A) − λmin (A) ≤ 2 kAk,Corollary 4 yields that for all α ∈ (0, 1] there q exists a degree O

max{1 + kAk∞ , log( α1 )} log( α1 ) polynomial p : R → R such that sup

| exp(x) − p(x)| ≤ α exp(η) exp(λmax (A)) .

x∈[λmin (A)−η,λmax (A)+η]

Further, since | exp(x)| ≤ exp(η) exp(λmax (A)) for all x ∈ [λmin (A) − η, λmax (A) + η], Theox rem 4 with p f (x) = e and η = 1 implies that for all µ ∈ (0, 1), after applying Lanczos for k = O( max{kAk∞ , log(1/α)} log(1/α)) iterations on a floating point machine with Θ(B) bits of precision for B = log( nkkAk µ ) returns y with kf (A)u − yk2 ≤

q µ + α · O( max{kAk∞ , log(1/α)} log(1/α)) exp(η) exp(λmax (A)))

in time O((mv(A) + n)k + kB max{k, B}). Choosing, α = O(/(M log(M/))) and µ = O() yields the result.

D.3

Multiplicative approximation for random vectors

We now combine the known results cited in the previous section with the randomness of the vector fed to the matrix exponential, to obtain a multiplicative guarantee that holds with high-probability over the choice of u, but not for all u ∈ Sn−1 . Proposition 3. Let ∈ (0, 1),p δ ∈ (0, 1), and A ∈ Sn . If u is sampled uniformly at random from the unit sphere and for k = Ω( M log(nM/(δ)) ∈ N for M = max{kAk∞ , log(n/(δ)), 1} we let y = eg xpk (A, u) (See Algorithm 2) then kexp(A)u − yk2 ≤ kexp(A)uk2 with probability ≥ 1 − δ. p This can be implemented in time O mv(A) M log(nM/(δ) + M log2 (nM/(δ)) on a floating point machine with O(log(nM/(δ))) bits of precision where arithmetic operations take O(1) time. Proof. Consider an application of Corollary 5 to compute y such that kexp(A)u − yk2 ≤ 0 exp(λmax (A)) kuk2 . Now let v be a unit eigenvector of A with eigenvalue λmax (A). Since v is an eigenvector T or the PSD matrix exp(A) with eigenvalue exp(λmax (A)) we have that kexp(A)uk ≥ exp(λmax ) v u . However, since u is a random unit vector we have that |v T u|2 / kuk22 ∼ Beta( 21 , n−1 2 ). Lemma 11 therefore 2 δ2 T 2 gives that |v u| / kuk2 ≥ n with probability at least 1 − δ. Consequently, exp(λmax (A)) kuk2 ≤ √ √ n 0 δ kexp(A)uk2 with the same probability. Choosing = δ/ n and invoking Corollary 5 yields the result.

D.4

Proof of Proposition 2

The following lemma relates the multiplicative approximation error for matrix exponential vector products with the additive approximation error for Pu (Y ) under trace norm. Combining it with Proposition 3 immediately yields Proposition 2.

30

Lemma 12. Let Y ∈ Sn , u, y ∈ Rn and ∈ [0, 1). If y ∈ Rn satisfies kexp(Y /2)u − yk2 ≤ √ kexp(Y /2)uk2 8 then

T yy

Pu (Y ) −

≤.

kyk22 1

Proof. Let z := exp(Y /2)u so that by assumption kz − yk2 ≤ kzk2 . Further, let z¯ := z/ kzk2 and y¯ := y/ kyk2 . Direct calculation (see e.g. Lemma 27 of Cohen et al. (2016)) yields that the p z T y¯)2 = ± 12 k¯ eigenvalues of z¯z¯T − y¯y¯T are ± 1 − (¯ z + y¯k2 k¯ z − y¯k2 and therefore the definition of Pu (Y ) yields

√

yy T

z + y¯k2 · k¯ z − y¯k2 ≤ 2 k¯ z − y¯k2 , (40)

= z¯z¯T − y¯y¯T 1 = k¯

Pu (Y ) − 2

kyk2 1

where in the last inequality we used that z¯ and y¯ are unit vectors. Further, by the triangle inequality and the definitions of y¯ and z¯ we have

z

y y y

k¯ z − y¯k2 ≤ − + − kzk2 kzk2 2 kzk2 kyk2 2 kz − yk2 | kyk2 − kzk2 | kz − yk2 = + ≤2 (41) kzk2 kzk2 kzk2 √ Combining (40) and (41) with the fact that kz − yk2 ≤ (/ 8) kzk2 then yields

T √ √ yy

Pu (Y ) −

≤ 2 · 2 · (/ 8) = . 2

kyk2 1

Therefore, Proposition 2 follows immediately by invoking 3 with slightly smaller .

D.5

Improvements to the Lanczos method

In this paper we focused on the Lanczos method for approximating matrix exponential vector products because of its excellent practicality and clean analysis. However, there are several modifications to the method with appealing features, which we now describe briefly. A common theme among these modifications is the use of rational approximations to the exponential, which converge far faster than polynomial approximations (Orecchia et al., 2012; Sachdeva and Vishnoi, 2014). Cone sequently, it suffices to perform O(1) Lanczos iterations on a carefully shifted and inverted version of the matrix. Each of these iterations then involves solving a linear system, and the efficacy of the shift-invert scheme will depend on how quickly they are solved. One basic approach to solving these systems is via standard iterative methods, e.g. conjugate gradient. We expect such approach to offer little to no advantage over applying the Lanczos approximation directly, as both methods produce vectors in the same Krylov subspace. However, the approach renders the number of Lanczos iterations k logarithmic in kAk∞ , and therefore the cost k 2 will never dominate the cost of the matrix-vector products (Orecchia et al., 2012; Musco et al., 2017, Corollary 17). 31

There is, however, a simpler way of avoiding the eigen-decomposition—simply use the rational approximation on the tridiagonal matrix formed by running the ordinary Lanczos method, as Saad (1992) proposes. With an appropriate rational function, computing a highly accurate approximation e to exp(T )e1 requires O(1) tridiagonal system solves, each costing O(k) time. We leave the derivation of explicit error bounds for this technique (similar to Corollary 4) to future work. In practice, the cost O(k 2 ) of tridiagonal eigen-decomposition will often be very small compared to the cost O(mv(A)k) of the matrix-vector products. More significant improvements are possible if the linear system solving routine is able to exploit information beyond matrix vector products. For example, consider the case where the matrix to be exponentiated is a sum of very sparse matrices—this will happen for our sketch whenever the Gt matrices are much sparser than their cumulative sum. Then, it is possible to use stochastic variance reduced optimization methods to solve the linear system, as Allen-Zhu and Li (2017) describe. Another scenario of interest is when the input matrix has a Laplacian/SDD structure and in this case the performance of specialized linear system solvers implies approximation guarantees where the polynomial dependence on kAk∞ is removed altogether (Orecchia et al., 2012). A final useful structure is a chordal sparsity pattern (Vandenberghe et al., 2015), which enables efficient linear system solving through fast Cholesky decomposition.

D.6

Proof of Corollary 3

Corollary 3. Let G1 , . . . , GT be symmetric gain matrices satisfying kGt k∞ ≤ 1 for every t. There ˜ exists a numerical k0 < √ constant nT ∞, such that for every T ∈ N and δ ∈ (0, 1), Xt;kt defined in (20) with kt = k0 ( 1 + ηt) log( δ ) , and Xt defined in (4) satisfy T D X

T E X ˜ t;k ≥ −1 + hGt , Xt i w.p. ≥ 1 − δ/2. Gt , X t

(21)

t=1

t=1

q

16 log(4en/δ) 2

Let ∈ (0, 1], T = and η = 2 log(4en) . If Assumption A holds with respect to the actions 3T D E P T 1 PT ˜ t;k ≤ . Computing ˜ t;k , then with probability at least 1 − δ, 1 λmax G , X G − X t t t t t=1 i=1 T T ˜ 1;k , . . . , X ˜ T ;k requires O(−2.5 log2.5 ( n )) matrix-vector products. the actions X 1

δ

T

Proof. To obtain the bound (21) we use Proposition 2 with ← T1 and δ ← δ/(2T ) (since we will use a union bound). At iteration t, kGi k∞ ≤ 1 for all i < t, the quantity M appearing in Proposition 2 can be bounded as

!

t−1

η X

nT 2 nT

log M ≤ 1+ Gi ≤ O(1)(1 + ηt) log .

2

δ δ i=1

∞

Therefore, our choice of kt suffices to guarantee, for Yt = η

Pt−1 i=1

Gi ,

e u ;k (Yt )k1 ≤ 1 with probability ≥ 1 − δ , kPut (Yt ) − P t t T 2T and so by the union bound the inequality above holds for all t = 1, . . . , T with probability at least 1 − (δ/2). Note that when using Proposition 2 we use the fact that ut is independent of Yt . Thus, we have T D X t=1

T T E X X e u ;k (Yt )k1 = 1, ˜ t;k ≤ ˜ Gt , Xt − X kG k kX − X k ≤ kPut (Yt ) − P t ∞ t t;kt 1 t t t t=1

t=1

32

giving (21), where we have used kGt k∞ ≤ 1 for every t. ˜ Note that if Assumption A holds

with respect to the actions Xt;kt then we have Gt ⊥ ut | Ft−1 ¯ and therefore E[hGt , Xt i | Ft−1 ] = Gt , Xt so that Corollary 2 holds. Thus, to obtain the second part of the corollary, we use the bound (9) with δ ← δ/2 and η and T as specified; using a union bound again we have that (21) and (9) hold together with probability at least 1 − δ. Note that η ≤ ≤ 1 and therefore 1/T ≤ 1/(ηT ). This gives, s ! T T D E X X 2 log 2δ 1 1 1 3η log(4n) ˜ λmax Gt − Gt , Xt;kt ≤ + + + T T T 2 ηT T t=1 i=1 s s r 2 log 2δ 2 log 2δ 6 log(4en) 3η log(4en) ≤ + + = + ≤ , 2 ηT T T T as required. Finally note that 1 + ηT = O(−1 log( nδ )) and consequently −1/2 1.5 n ) = O log ( ) . kT = O −1/2 log1/2 ( nδ ) log1/2 ( nT δ δ Since k1 ≤ k2 ≤ · · · kT , the total number of matrix-vector products is bounded by T · kT = n O(−2.5 log2.5 ( δ )), which concludes the proof.

33