Mar 11, 2016 - Bold face type is used for random variables and random vectors. The conditional expectation given a random variable Z is denoted EZ...

0 downloads 10 Views 565KB Size

Dept., University of Michigan, {krmoon,greenewk,hero}@umich.edu † Xerox PARC, [email protected]

arXiv:1601.06884v2 [cs.IT] 11 Mar 2016

Abstract Distributional functionals are integral functionals of one or more probability distributions. Distributional functionals include information measures such as entropy, mutual information, and divergence. Recent work has focused on the problem of nonparametric estimation of entropy and information divergence functionals. Many existing approaches are restrictive in their assumptions on the density support set or require difficult calculations at the support boundary which must be known a priori. The MSE convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary is not required. The theory of optimally weighted ensemble estimation is generalized to derive two estimators that achieve the parametric rate when the densities are sufficiently smooth. The asymptotic distribution of these estimators and some guidelines for tuning parameter selection are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that outperforms the standard kernel density plug-in estimator, especially in high dimension. The estimators are shown to be robust to the choice of tuning parameters.

I. I NTRODUCTION Distributional functionals are integral functionals of one or more probability distributions. The most common distributional functionals in information theory are entropy, mutual information, and information divergence which have many applications in the fields of information theory, statistics, signal processing, and machine learning. Information divergence is the most general of these information measures and is a measure of the difference between probability distributions. Some applications involving divergences include estimating the decay rates of error probabilities [1], estimating bounds on the Bayes error for a classification problem [2]–[8], extending machine learning algorithms to distributional features [9]–[12], testing the hypothesis that two sets of samples come from the same probability distribution [13], clustering [14]–[16], feature selection and classification [17]– [19], blind source separation [20], [21], image segmentation [22]–[24], and steganography [25]. For many more applications of divergence measures, see [26]. Mutual information and entropy are both special cases of divergences and have been used in some of the above applications as well as others such as determining channel capacity [1], fMRI data processing [27], intrinsic dimension estimation [28], [29], and texture classification and image registration [30]. An important subset of information divergences is the family of f -divergences [31], [32]. This family includes the well-known Kullback-Leibler (KL) divergence [33], the Rényi-α divergence [34], the Hellinger-Bhattacharyya distance [35], [36], the Chernoff-α divergence [5], the total variation distance, and the Henze-Penrose divergence [6]. We consider the problem of estimating distributional functionals when only a finite population of independent and identically distributed (i.i.d.) samples is available from each d-dimensional distribution that is unknown, nonparametric, and smooth. For simplicity, we focus on functionals of two distributions, i.e. divergence functionals. However, our methods are general enough that they can be easily extended to estimate functionals of any finite number of distributions. While several estimators of divergence functionals have been previously defined, the convergence rates are known for only a few of them. Furthermore, the asymptotic distributions of these estimators is unknown for nearly all of them. Thus these estimators cannot be easily used to perform inference tasks on the divergence such as testing that two populations have identical distributions or constructing confidence intervals. In this paper, we derive mean squared error (MSE) convergence rates for kernel density plug-in divergence functional estimators. We then generalize the theory of optimally weighted ensemble entropy estimation developed in [37] to obtain two divergence functional estimators with a MSE convergence rate of O N1 , where N is the sample size, when the densities are sufficiently smooth. We obtain the asymptotic distribution of the weighted ensemble estimators which enables us to perform hypothesis testing. We then examine the problem of tuning parameter selection. Finally, we empirically validate the theory and establish the estimators’ robustness to the choice of tuning parameters. A. Related Work Several nonparametric estimators for some functionals of two distributions including f -divergences already exist. For example, Póczos and Schneider [9] established weak consistency of a bias-corrected k-nn estimator for Rényi-α and other divergences This work was partially supported by ARO MURI grant W911NF-15-1-0479, NSF grant CCF-1217880, and a NSF Graduate Research Fellowship to the first author under Grant No. F031543.

of a similar form where k is fixed. Wang et al [38] provided a k-nn based estimator for the KL divergence. Mutual information and divergence estimators based on plug-in histogram schemes have been proven to be consistent [39]–[42]. Hero et al [30] provided an estimator for Rényi-α divergence but assumed that one of the densities was known. However none of these works study the convergence rates nor the asymptotic distribution of their estimators. There has been recent interest in deriving convergence rates for divergence estimators [43]–[48]. The rates are typically derived in terms of a smoothness condition on the densities, such as the Hölder condition [49]: Pd Definition 1 (Hölder Class): Let X ⊂ Rd be a compact space. For r = (r1 , . . . , rd ), ri ∈ N, define |r| = i=1 ri and |r| Dr = ∂xr1∂...∂xrd . The Hölder class Σ(s, K) of functions on L2 (X ) consists of the functions f that satisfy 1

d

s−r

|Dr f (x) − Dr f (y)| ≤ K kx − yk

,

for all x, y ∈ X and for all r s.t. |r| ≤ ⌊s⌋. From Definition 1, it is clear that if a function f belongs to Σ(s, K), then f is continuously differentiable up to order ⌊s⌋. In this work, we propose estimators that achieve the parametric MSE convergence rate of O(1/T ) when s ≥ d and s ≥ d+1 2 , respectively. Nguyen et al [44] proposed a method for estimating f -divergences by estimating the likelihood ratio of the two densities by solving a convex optimization problem and then plugging it into the divergence formulas. For this method they prove that the minimax MSE convergence rate is parametric (O T1 ) when the likelihood ratio is in the bounded Hölder class Σ(s, K) with s ≥ d/2. However, this estimator is restricted to true f -divergences and may not apply to the broader class of divergence functionals (as an example, the L22 divergence is not an f -divergence). Additionally, solving the convex problem of [44] has similar complexity to that of training a support vector machine (SVM) (between O(T 2 ) and O(T 3 )) which can be demanding when T is very large. In contrast, our method of optimally weighted ensemble estimation depends only on simple density plug-in estimates and the solution of an offline convex optimization problem. Thus the most computationally demanding step in our approach is the calculation of the density estimates, which has complexity no greater than O(T 2 ). Singh and Póczos [46], [47] provided an estimator for Rényi-α divergences as well as general density functionals that uses a “mirror image” kernel density estimator. They prove a convergence rate of O T1 when s ≥ d for each of the densities. However this method requires several computations at each boundary of the support of the densities which becomes difficult to implement as d gets large. Also, this method requires knowledge of the support of the densities which may not be possible for some problems. In contrast, while our assumptions require the density supports to be bounded, knowledge of the support is not required for implementation. The “linear” and “quadratic” estimators presented by Krishnamurthy et al [45] estimate divergence functionals that include ´ the form f1α (x)f2β (x)dµ(x) for given α and β where f1 and f2 are probability densities. These estimators achieve the parametric rate when s ≥ d/2 and s ≥ d/4 for the linear and quadratic estimators, respectively. However, the latter estimator is computationally infeasible for most functionals and the former requires numerical integration for some divergence functionals, which can be computationally difficult. Additionally, while a suitable α-β indexed sequence of divergence functionals of this form can be made to converge to the KL divergence, this does not guarantee convergence of the corresponding sequence of divergence estimators in [45], whereas our estimator can be used to estimate the KL divergence. Other important f -divergence functionals are also excluded from this form including some that bound the Bayes error [2], [4], [6]. In contrast, our method applies to a large class of divergence functionals and avoids numerical integration. Finally, Kandasamy et al [48] propose influence function based estimators of distributional functionals that achieve the parametric rate when s ≥ d/2. While their method can be applied to general functionals, their estimator requires numerical integration for some functionals. Additionally, the estimators in both Kandasamy et al [48] and Krishnamurthy et al [45] require an optimal kernel density estimator. This is difficult to construct when the density support is bounded as it requires knowledge of the support boundary and difficult computations at the boundary, whereas our method does not require knowledge of the support boundary. Asymptotic normality has been established for certain appropriately normalized divergences between a specific density estimator and the true density [50]–[52]. This differs from our setting where we assume that both densities are unknown. The asymptotic distributions of the estimators in [44]–[47] are currently unknown. Kandasamy et al [48] prove a central limit theorem for their data-splitting estimator but do not prove similar results for their leave-one-out estimator. We establish a central limit theorem for our proposed leave-one-out divergence estimator. Divergence functional estimation is also related to the problem of entropy functional estimation which has received a lot of attention. Some examples include [53]–[55] which used specialized kernel density estimators to achieve the parametric convergence rate when the density has smoothness parameter s ≥ d/4. Sricharan et al [37] derived an entropy functional estimator that uses a weighted average of an ensemble of bias-corrected estimators. While the approach in [37] requires the density to have smoothness parameter s ≥ d in order to achieve the parametric rate, their approach is simpler to implement compared to the estimators in [53]–[55] as long as the density support set is known.

This paper extends the work in [37] to functionals of two distributions and improves upon the results reported in [7], [43] which explored k-nearest neighbor (k-nn) based estimators of f -divergences. We derive more general conditions on the required smoothness of the densities for the MSE convergence rates of plug-in kernel density estimators (KDE) of general divergence functionals. We then generalize the theory of optimally weighted ensemble entropy estimation developed in [37] to obtain two divergence functional estimators that achieve the parametric MSE convergence rate of O N1 when the densities are sufficiently smooth, where N is the sample size. One of these estimators achieves this rate when the densities have s ≥ (d + 1)/2, whereas [43] requires s ≥ d. These estimators apply to general divergence functionals and are simpler to implement than other estimators that also achieve the parametric rate. This work also extends these estimators to more general bounded density support sets in Rd , whereas the proofs in [43] restricted the estimator to compactly supported densities with no boundary conditions (e.g., a support set equal to the surface of a torus), which is unnecessarily restrictive. Finally, we use a leave-one-out approach that uses all of the data for both density estimation and integral approximation in contrast to [7], [37], [43], [56] and others which use a less efficient data-splitting approach. We then derive the asymptotic distribution of the weighted ensemble estimators which enables us to construct confidence intervals and perform hypothesis testing. B. Organization and Notation The paper is organized as follows. Section II presents the kernel density plug-in divergence functional estimator and its MSE convergence rate. Our generalized theory of optimally weighted ensemble estimation and the proposed ensemble estimators are given in Section III. A central limit theorem for the ensemble estimators is also presented in Section III. In Section IV, we provide guidelines for selecting the tuning parameters based on experiments and the theory derived in the previous sections. We then perform experiments in Section IV that validate the theory and establish the robustness of the proposed estimators to the tuning parameters. Bold face type is used for random variables and random vectors. The conditional expectation given a random variable Z is denoted EZ . The variance of a random variable is denoted V and the bias of an estimator is denoted B. II. T HE D IVERGENCE F UNCTIONAL W EAK E STIMATOR This paper focuses on estimating functionals of the form ˆ G (f1 , f2 ) = g (f1 (x), f2 (x)) f2 (x)dx,

(1)

where g(x, y) is a smooth functional, and f1 and f2 are smooth d-dimensional probability densities. If g (f1 (x), f2 (x)) = f1 (x) g f2 (x) , g is convex, and g(1) = 0, then G (f1 , f2 ) defines the family of f -divergences. Some common divergences that belong to this family include the KL divergence (g(t) = − ln t), the Rényi-α divergence (g(t) = tα ), and the total variation distance (g(t) = |t − 1|). In this work, we consider a broader class of functionals than the f -divergences. A. The Kernel Density Plug-in Estimator We use a kernel density plug-in estimator of the divergence functional in (1). Assume that N1 i.i.d. realizations {Y1 , . . . , YN1 } are available from f1 and N2 i.i.d. realizations {X1 , . . . , XN2 } are available from f2 . Let hi > 0 be the kernel bandwidth for the density estimator of fi . Let K(·) be a kernel function with ||K||∞ < ∞ where kKk∞ is the ℓ∞ norm of the kernel K. The KDEs are N1 1 X Xj − Yi ˜f1,h (Xj ) = K , 1 h1 N1 hd1 i=1 N2 Xj − Xi 1 X ˜f2,h (Xj ) = K , 2 h2 M2 hd2 i=1 i6=j

where M2 = N2 − 1. G (f1 , f2 ) is then approximated as N2 X ˜ h ,h = 1 g ˜f1,h1 (Xi ) , ˜f2,h2 (Xi ) . G 1 2 N2 i=1

(2)

B. Convergence Rates Similar to [7], [37], [43], the principal assumptions we make on the densities f1 and f2 and the functional g are that: 1) f1 , f2 , and g are smooth; 2) f1 and f2 have common bounded support sets S; 3) f1 and f2 are strictly lower bounded on S. We also assume 4) that the support is smooth with respect to the kernel K(u). Our full assumptions are: • (A.0): Assume that the kernel K is symmetric, is a product kernel, and has bounded support in each dimension. Also ´ assume that it has order ν which means that the jth moment of the kernel Ki defined as tj Ki (t)dt is zero for all j = 1, . . . , ν − 1 and i = 1, . . . , d where Ki is the kernel in the ith coordinate. • (A.1): Assume there exist constants ǫ0 , ǫ∞ such that 0 < ǫ0 ≤ fi (x) ≤ ǫ∞ < ∞, ∀x ∈ S. • (A.2): Assume that the densities fi ∈ Σ(s, K) in the interior of S with s ≥ 2. • (A.3): Assume that g has an infinite number of mixed derivatives. ∂ k+l g(x,y) • (A.4): Assume that ∂xk ∂y l , k, l = 0, 1, . . . are strictly upper bounded for ǫ0 ≤ x, y ≤ ǫ∞ . d • (A.5): Assume the following boundary smoothness condition: Let px (u) : R → R be a polynomial in u of order q ≤ r = ⌊s⌋ whose coefficients are a function of x and are r − q times differentiable. Then assume that !t ˆ ˆ K(u)px (u)du dx = vt (h), x∈S

u:K(u)>0, x+uh∈S /

where vt (h) admits the expansion vt (h) =

r−q X i=1

ei,q,t hi + o hr−q ,

for some constants ei,q,t . We focus on finite support kernels for simplicity in the proofs although it is likely that our results extend to some infinitely supported kernels as well. The smoothness assumptions on the densities are weaker as compared to [7], [37], [43]. However, we assume stronger conditions on the smoothness of g to enable us to achieve good convergence rates without knowledge of the boundary of the support set. Assumption A.5 requires the boundary of the density support set to be smooth wrt the kernel K(u) in the sense that the expectation of the area outside of S wrt any random variable u with smooth distribution is a smooth function of the bandwidth h. It is not necessary for the boundary of S to have smooth contours with no edges or corners as this assumption is satisfied by the following case: Theorem 1: Assumption A.5 is satisfied when S = [−1, 1]d and when K is the uniform rectangular kernel; that is K(x) = 1 for all x : ||x||1 ≤ 1/2. The proof is given in Appendix A. Given the simple nature of this density support set and kernel, it is likely that other kernels and supports will satisfy A.5 as well. This is left for future work. Densities for which assumptions A.1 − A.2 hold include the truncated Gaussian distribution and the Beta on the distribution α x x unit cube. Functions for which assumptions A.3 − A.4 hold include g(x, y) = − ln y and g(x, y) = y . The following theorem on the bias follows under assumptions A.0 − A.5: Theorem 2: ˜ h1 ,h2 is of the form For general g, the bias of the plug-in estimator G h i ˜ h1 ,h2 B G =

r X j=1

r X r X c5,i,j hj1 hi2 + O (hs1 + hs2 ) c4,1,j hj1 + c4,2,j hj2 + j=1 i=1

1 1 +c9,1 + c9,2 +o d N 1 h1 N2 hd2

Furthermore, if g(x, y) has k, l-th order mixed derivatives

∂ k+l g(x,y) ∂xk ∂y l

1 1 + d N 1 h1 N2 hd2

.

(3)

that depend on x, y only through xα y β for some α, β ∈ R,

then for any positive integer λ ≥ 2, the bias is of the form r X r r X h i X ˜ h1 ,h2 c5,i,j hj1 hi2 + O (hs1 + hs2 ) c4,1,j hj1 + c4,2,j hj2 + B G = j=1 i=1

j=1 λ/2

r XX

hm 1

c9,1,j,m

N1 hd1

j=1 m=0

+

λ/2 r λ/2 r X X XX

j + c9,2,j,m

c9,j,i,m,n

j=1 m=0 i=1 n=0

+O

1

λ + d 2

N 1 h1

1 N2 hd2

λ2

hm 2 N2 hd2

j

n hm 1 h2 i j N2 hd2 N1 hd1

.

!

(4)

Divergence functionals that satisfy the mixed derivatives condition required for (4) include the KL divergence and the Rényiα divergence. Obtaining similar terms for other divergence functionals requires us to separate the dependence on hi of the derivatives of g evaluated at EZ ˜fi,hi (Z). This is left for future work. See Appendix B for details. The following variance result requires much less strict assumptions: Theorem 3: Assume that the functional g in (1) is Lipschitz continuous in both of its arguments with Lipschitz constant Cg . ˜ h1 ,h2 is bounded by Then the variance of the plug-in estimator G h i N1 10 ˜ h1 ,h2 ≤ C 2 ||K||2 + . V G g ∞ N2 N22 ˜ h1 ,h2 to be unbiased while the variance of From Theorems 2 and 3, it is clear that we require hi → 0 and Ni hdi → ∞ for G the plug-in estimator depends primarily on the number of samples. Note that the constants in front of the terms that depend on hi and Ni may not be identical for different i, j, m, n in (3) and (4). However, these constants depend on the densities f1 and f2 and their derivatives which are often unknown. The rates given in Thm. 2 and 3 are similar to the rates derived for the entropy plug-in estimator in [37] if hdi = ki /Ni . The differences lie in the constants in front of the rates and the dependence on the number of samples from two distributions instead of one. Additionally, as compared to (3), in (4) there are many more terms. These terms enable us to achieve the parametric MSE convergence rate when s ≥ (d + 1)/2 for an appropriate choice of bandwidths whereas the terms in (3) require s ≥ d to achieve the same rate. C. Optimal MSE Rate From Theorem 2, the dominating terms in the bias are Θ (hi ) and Θ Ni1hd . If no attempt is made to correct the bias, the i optimal choice of hi in terms of minimizing the MSE is −1 ∗ hi = Θ Nid+1 . −1 This results in a dominant bias term of order Θ Nid+1 . Note that this differs from the standard result for the optimal KDE bandwidth for minimum MSE density estimation which is Θ N −1/(d+4) for a symmetric uniform kernel [57]. −1 Figure 1 gives a heatmap showing the leading term O (h) as a function of d and N when h = N d+1 . The heatmap indicates that the bias of the plug-in estimator in (2) is small only for relatively small values of d. D. Proof Sketches of Theorems 2 and 3 To prove the expressions for the bias, the bias is first decomposed into two parts by adding and subtracting g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) within the expectation creating a “bias” term and a “variance” term. hApplyingi a Taylor series expansion on the bias and ˜i,hi (Z) := variance terms results in expressions that depend on powers of BZ ˜fi,hi (Z) := EZ˜fi,hi (Z) − fi (Z) and e ˜fi,h (Z) − EZ˜fi,h (Z), respectively. Within the interior of the support, moment bounds can be derived from properties of i i the KDEs and a Taylor series expansion of the densities. Near the boundary of the support, the smoothness assumption on the boundary A.5 is also required. Note that this approach differs from that in [37] which corrected the KDEs near the boundary of the support set and also used concentration inequalities for the KDEs. The full proof of Thm. 2 is given in Appendix B. The proof of the variance result takes a different approach. It uses the Efron-Stein inequality which bounds the variance by analyzing the expected squared difference between the plug-in estimator when one sample is allowed to differ. The full proof of Thm. 3 is given in Appendix C.

30 0.7 25

Dimension d

0.6 20

0.5

15

0.4 0.3

10 0.2 5

0.1 5

10

15

20 25 30 35 Sample Size N × 103

40

45

50

Figure 1. Heat map of predicted bias of divergence funtional plug-in estimator based on Theorem 2 as a function of dimension and sample size when −1

h = N d+1 . Note the phase transition in the bias as dimension d increases for fixed sample size N : bias remains small only for relatively small values of d. The proposed weighted ensemble estimator removes this phase transition when the densities are sufficiently smooth.

III. W EIGHTED E NSEMBLE E STIMATION As pointed out in Sec. II-C, Thm. 2 shows that when the dimension of the data is not small, the bias of the MSE-optimal ˜ h1 ,h2 decreases very slowly as a function of sample size, resulting in large MSE. However, by applying plug-in estimator G the theory of optimally weighted ensemble estimation, originally developed in [37] for entropy estimation, we can modify the minimum MSE estimator by taking a weighted sum of an ensemble of estimators where the weights are chosen to significantly reduce the bias. A. The Weighted Ensemble Estimator The bias expression in Theorem 2 is quite complicated due to its dependence on the sample size of two different distributions. ˜ h := G ˜ h,h . We can simplify it significantly by assuming that N1 = N2 = N and h1 = h2 = h. Define G ˜ Corollary 1: For general g, the bias of the plug-in estimator Gh is given by ⌊s⌋ h i X 1 1 j s ˜ c10,j h + c11 B Gh = . +O h + N hd N hd j=1 If g(x, y) has k, l-th order mixed derivatives any positive integer λ ≥ 2, the bias is

∂ k+l g(x,y) ∂xk ∂y l

h i ˜h = B G

⌊s⌋ X

that depend on x, y only through xα y β for some α, β ∈ R, then for

c10,j hj +

λ/2 ⌊s⌋ X X

c11,q,j

q=1 j=0

j=1

+O hs +

1 λ

(N hd ) 2

!

hj q (N hd )

.

Note that the corollary still holds if N1 and N2 are linearly related, i.e., N = N1 = Θ(N2 ) and similarly if h1 and h2 are linearly related, i.e., h = h1 = Θ(h2 ). We form an ensemble of estimators by choosing different values of h. Choose L = {l1 , . . . , lL } to be real positive numbers that index h(li ). Thus the parameter l indexes over different neighborhood sizes ˜ ˜ w := P for the kernel density estimates. Define w := {w (l1 ) , . . . , w (lL )} and G l∈L w(l)Gh(l) . The key to reducing the MSE is to choose the weight vector w to reduce the lower order terms in the bias without substantially increasing the variance. B. Finding the Optimal Weight The theory of optimally weighted ensemble estimation is a general theory originally presented by Sricharan et al [37] that can be applied to many estimation problems as long as the bias and variance of the estimator can be expressed in a specific way. We generalize the conditions given in [37] that were required to apply the theory. Let L = 1 , . . . , lL } be a set of index n {lo ˆl values and let N be the number of samples available. For an indexed ensemble of estimators E of a parameter E, the l∈L P weighted ensemble estimator with weights w = {w (l1 ) , . . . , w (lL )} satisfying l∈L w(l) = 1 is defined as X ˆw = ˆ l. E w (l) E l∈L

ˆ w is asyptotically unbiased if the estimators E n o ˆl : E

n o ˆl E

l∈L

are asymptotically unbiased. Consider the following conditions on

l∈L

•

C.1 The bias is expressible as

h i X 1 ˆl = ci ψi (l)φi,d (N ) + O √ B E , N i∈J

where ci are constants depending on the underlying density, J = {i1 , . . . , iI } is a finite index set with I < L, and ψi (l) are basis functions depending only on the parameter l and not on the sample size. • C.2 The variance is expressible as h i ˆ l = cv 1 + o 1 . V E N N n o ˆl Theorem 4: Assume conditions C.1 and C.2 hold for an ensemble of estimators E . Then there exists a weight vector l∈L w0 such that the MSE of the weighted ensemble estimator attains the parametric rate of convergence: 2 1 ˆ =O E Ew0 − E . N

The weight vector w0 is the solution to the following convex optimization problem:

minw ||w|| P 2 subject to l∈L w(l) P = 1, γw (i) = l∈L w(l)ψi (l) = 0, i ∈ J.

(5)

A more restrictive version of Theorem 4 was originally presented in [37] with the stricter condition of φi,d (N ) = N −1/(2d) . The proof of our generalized version (Theorem 4) is sketched below. Proof: From condition C.1, the bias of the weighted estimator is ! √ h i X L||w|| 2 ˆw = √ . B E ci γw (i)φi,d (N ) + O N i∈J The variance of the weighted estimator is bounded as h i L||w||2 2 ˆw ≤ V E . (6) N The optimization problem in (5) zeroes out the lower-order bias terms and limits the ℓ2 norm of the weight vector w to limit the variance contribution. This results in an MSE rate of O(1/N ) when the dimension d is fixed and when L is fixed independently of the sample size N . Furthermore, a solution to (5) is guaranteed to exist as long as L > I and the vectors ai = [ψi (l1 ), . . . , ψi (lL )] are linearly independent. This completes our sketch of the proof of Thm. 4. C. Optimally Weighted Distributional Functional (ODin) Estimators To achieve the parametric rate O (1/N ) in MSE convergence it is not necessary that γw (i) = 0, i ∈ J. Solving the following convex optimization problem in place of the optimization problem in Theorem 4 retains the O(1/N ) rate: minw ǫP subject to l∈L w(l) = 1, 1 γw (i)N 2 φi,d (N ) ≤ ǫ, i ∈ J,

(7)

2

kwk2 ≤ η,

where the parameter η is chosen to achieve a trade-off between bias and variance. Insteadof forcing γw (i) = 0, the relaxed √ optimization problem uses the weights to decrease the bias terms at the rate of O 1/ N yielding an MSE of O(1/N ). We refer to the distributional functional estimators obtained using this theory as Optimally Weighted Distributional Functional (ODin) estimators. Sricharan et al [37] applied the stricter version of Theorem 4 to obtain an entropy estimator with convergence rate O(1/N ). We also apply the same theory to obtain a divergence functional estimator with the same asymptotic rate. Let 1 −1/(2d) i h(l) = lN . From Theorem 2, we get ψi (l) = l , i = 1, . . . , d. Note that if s ≥ d, then we are left with O ld √ N in addition to the terms in the sum. To obtain a uniform bound on the bias with respect to w and L, we also include the function ψd+1 (l) = l−d in the optimization problem. The bias of the resulting base estimator satisfies condition C.1 with φi,d (N ) = N −i/(2d) for i = 1, . . . , d and φd+1,d (N ) = N −1/2 . The variance also satisfies condition C.2. The optimal weight

Algorithm 1 Optimally weighted ensemble estimator of divergence functionals Input: η, L positive real numbers L, samples {Y1 , . . . , YN } from f1 , samples {X1 , . . . , XN } from f2 , dimension d, function g, kernel K ˜ w0 ,2 Output: The optimally weighted divergence estimator G j+q − d+1 1: Solve for w0 using (7) with φj,q,d (N ) = N and basis functions ψj,q (l) = lj−dq , l ∈ ¯l, and {i, j} ∈ J defined in (8) ¯ 2: for all l ∈ l do −1 3: h(l) ← lN d+1 4: for i = 1 to N do PN Xi −Xj 1 ˜f1,h(l) (Xi ) ← 1 d PN K Xi −Yj , ˜f2,h(l) (Xi ) ← 5: j=1 K j=1 h(l) h(l) N h(l) (N −1)h(l)d 6: 7: 8: 9:

end for ˜ h(l) ← 1 PN g ˜f1,h(l) (Xi ), ˜f2,h(l) (Xi ) G i=1 N end for P ˜ ˜ w0 ,2 ← G l∈L w0 (l)Gh(l)

j6=i

˜ w ,1 with an MSE convergence rate of O 1 w0 is found by using (7) to obtain a plug-in divergence functional estimator G 0 N as long as s ≥ d. Otherwise, if s < d we can only guarantee the MSE rate up to O N 1s/d . We refer to this estimator as the ODin1 estimator. Another weighted ensemble estimator can be defined that requires less strict assumptions on the smoothness of the densities. −1 This is accomplished by letting h(l) decrease at a faster rate. Let h(l) = lN d+1 . From Theorem 2, we have that if g(x, y) has j+q mixed derivatives of the form of xα y β , then the bias has terms proportional to lj−dq N − d+1 where j, q ≥ 0 and j + q > 0. Theorem 4 can be applied to the ensemble of estimators to derive an estimator that achieves the parametric convergence rate j+q under these conditions. Let φj,q,d (N ) = N − d+1 and ψj,q (l) = lj−dq . Let J = {{j, q} : 0 < j + q < (d + 1)/2, q ∈ {0, 1, 2, . . . , λ/2}, j ∈ {0, 1, 2, . . . , ⌊s⌋}} .

(8)

˜ h(l) satisfies condition C.1. If L > |J| = I, then Theorem 4 can be applied to obtain the Then from (4), the bias of G ˜ ˜ w0 ,2 = P optimal weight vector. The estimator G l∈L w0 (l)Gh(l) achieves the parametric convergence rate if λ ≥ d + 1 and if 1 ˜ w0 ,2 is referred to as s ≥ (d + 1)/2. Otherwise, if s < (d + 1)/2 we can only guarantee the MSE rate up to O N 2s/(d+1) .G the ODin2 estimator and is summarized in Algorithm 1. D. Comparison of ODin1 and ODin2 Estimators ˜ w0 ,1 , h ∝ N −1 2d and the parametric convergence rate is guaranteed when s ≥ d. This can be For the ODin1 estimator G achieved with L ≥ d parameters and applies to any functional g in (1) that is infinitely differentiable. −1 ˜ w0 ,2 , h ∝ N d+1 In contrast, for the ODin2 estimator G if g(x, y) has mixed derivatives of the form of xα y β and the d+1 ˜ w ,2 under less parametric convergence rate is guaranteed when s ≥ 2 . Thus the parametric rate can be achieved with G 0 ˜ strict assumptions on the smoothness of the densities than those required for Gw0 ,1 . For large d the condition s ≥ (d + 1)/2 is just slightly stronger than the condition s ≥ d/2 required by the more complex estimators that achieve the parametric rate proposed in [45]. These rate improvements come at a cost in the number of parameters L required to implement the weighted ensemble then the size of J for ODin2 is on the order of d2 /8. This may lead to increased variance of the estimator. If s ≥ d+1 2 ˜ w ,2 can only be applied to functionals g(x, y) with mixed derivatives ensemble estimator as indicated by (6). Also, so far G 0 α β of the form of x y . Future work is required to extend this estimator to other functionals of interest. E. Central Limit Theorem ˜ w converges in distribution to a The following theorem shows that the appropriately normalized ensemble estimator G normal random variable. This enables us to perform hypothesis testing on the divergence functional. The proof is based on the Efron-Stein inequality and an application of Slutsky’s Theorem (Appendix D). Theorem 5: Assume that the functional g is Lipschitz in both arguments with Lipschitz constant Cg . Further assume that ˜ w is h = o(1), N → ∞, and N hd → ∞. Then for fixed L, the asymptotic distribution of the weighted ensemble estimator G ! r i h i h ˜ w ≤ t → P r(S ≤ t), ˜w −E G ˜w / V G Pr G where S is a standard normal random variable.

ODin1 ODin2

1

0

w (l)

2

0 −1 1.5

2

2.5

3

l α Figure 2. Examples of the optimal weights for g(x, y) = x , d = 4, N = 3100, L = 50, and l is uniformly spaced between 1.5 (ODin1) or 2 (ODin2) y and 3. The lowest values of l are given the highest weight. Thus the minimum value of bandwidth parameters L should be sufficiently large to render an adequate estimate.

IV. N UMERICAL VALIDATION A. Tuning Parameter Selection The optimization problem in (7) has parameters η, L, and L. The parameter η provides an upper bound on the norm of the weight vector, which gives an upper bound on the constant in the variance of the ensemble estimator. If all the constants in (3) or (4) and an exact expression for the variance of the ensemble estimator were known, then η could be chosen to minimize the MSE. Since the constants are unknown, by applying (7), the resulting MSE of the ensemble estimator is O ǫ2 /N + O Lη 2 /N , where each term in the √ sum comes from the bias and variance, respectively. Since there is a tradeoff between η and ǫ, in principle setting η = ǫ/ L would minimize these terms. √ In practice, we find that the variance of the ensemble estimator is less than the upper bound of Lη 2 /N and setting η = ǫ/ L is therefore overly restrictive. Setting η = ǫ instead works well in practice. For fixed L, the set of kernel widths L can in theory be chosen by minimizing ǫ in (7) over L in addition to w. However, this results in a nonconvex optimization problem since w does not lie in the non-negative orthant. A parameter search may not be practical as ǫ generally decreases as the size and spread of L increases. This decrease in ǫ does not always correspond to a decrease in MSE as high and low values of h(l) can lead to inaccurate density estimates. Denote the value of the minimum value of l so that ˜fi,h(lmin ) (Xj ) > 0 ∀i = 1, 2 as lmin and the diameter of the support S as D. To ensure the density estimates are bounded away from zero, we require that min(L) ≥ lmin . The weights in w0 are generally largest for the smallest values of L (see Fig. 2) so min(L) should also be sufficiently larger than lmin to render an adequate estimate. Similarly, max(L) should be sufficiently smaller than D as high bandwidth values lead to high bias. The remaining L values are chosen to be equally spaced between min(L) and max(L). As L increases, the similarity of bandwidth values h(l) and basis functions ψi,d (l) increases, resulting in a negligible decrease in the bias. Hence L should be chosen large enough to decrease the bias but small enough so that the h(l) values are sufficiently distinct (typically 30 ≤ L ≤ 60). B. Convergence Rates Validation: Rényi-α Divergence To validate our theory, we estimated the Rényi-α divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes. The densities have means µ ¯1 = 0.7 ∗ ¯1d , µ ¯2 = 0.3 ∗ ¯1d and covariance matrices 0.1 ∗ Id where ¯1d is a d-dimensional vector of ones, and Id is a d × d identity matrix. We used α = 0.5 and restricted the Gaussians to the unit cube. The left plots in Fig. 3 show the MSE (200 trials) of the standard plug-in estimator implemented with a uniform kernel, ˜ρ = the two proposed optimally weighted estimators ODin1 and ODin2, and a linear combination of ODin1 and ODin2, G ˜ w0 ,1 + ρG ˜ w0 ,2 , for various dimensions and sample sizes. The set of kernel widths L, L, and ρ are tuned to minimize (1 − ρ)G the MSE. The bandwidth used for the standard plug-in estimator was selected from the set L that resulted from the ODin2 optimization; specifically the member of the set that empirically minimized the MSE of the plug-in estimator. Note that for d = 4, the standard plug-in estimator performs comparably with the optimally weighted estimators. However, for d = 7, 10, the plug-in estimator performs considerably worse. This reflects the strength of ensemble estimators: the weighted sum of a set of poor estimators can result in a very good estimator. Note also that for most cases, the ensemble estimators’ MSE rates match the theoretical rate based on the estimated log-log slope given in Table I. ODin1 tends to do better than ODin2 when the dimension is lower (d = 4) while the opposite occurs for the higher dimensions. Further evidence for this is given in the right figures in Fig. 3 that show the corresponding average estimates with standard error bars compared to the true values. ODin1 has smaller variance than ODin2 when d = 4 and slightly larger

Dimension d=4

−1

Dimension d=4

10

0.75

−2

MSE

10

Renyi−α integral

Kernel ODin1 ODin2 Comb.

−3

10

0.7 0.65 0.6 0.55 0.5

−4

10

True Kernel ODin1 ODin2 Comb.

2

3

10

2

10

10

Sample Size N Dimension d=7

−1

3

10 Sample Size N Dimension d=7

4

10

10

Renyi−α integral

0.7 Kernel

−2

10 MSE

ODin1 ODin2

−3

Comb.

10

True Kernel

0.6

ODin1 ODin2

0.5

Comb. 0.4

−4

10

2

10

3

2

10 Sample Size N Dimension d=10

10

3

10 Sample Size N Dimension d=10

4

10

Renyi−α integral

0.8 Kernel ODin1

MSE

−2

10

ODin2 Comb.

−4

10

2

0.6

0.4

0.2 2 10

3

10

True Kernel ODin1 ODin2 Comb.

10 Sample Size N

3

10 Sample Size N

4

10

Figure 3. (Left) Log-log plot of MSE of the uniform kernel plug-in (“Kernel”), the two proposed optimally weighted estimators (ODin1 and ODin2), and the optimal linear combination of ODin1 and ODin2 for various dimensions and sample sizes. (Right) Plot of the average value of the same estimators with standard error bars compared to the true values being estimated. The proposed weighted ensemble estimators generally match the theoretical rate (see Table I) and perform much better than the plug-in estimator for high dimensions. Table I N EGATIVE LOG - LOG SLOPE OF THE MSE AS A

FUNCTION OF SAMPLE SIZE FOR VARIOUS DIMENSIONS AND ESTIMATORS

Estimator ODin1 ODin2 Comb.

d=4 1.04 0.83 1.03

d=7 1.07 1.08 1.04

d = 10 1.01 1.00 1.02

variance when d = 10. This seems to account for the differences in MSE between ODin1 and ODin2. The values for the weight ρ are given in Table II which indicate a preference for ODin1 when d = 4 and a preference for ODin2 for higher dimensions. Paired t-tests on the MSE (125 trials) of the two methods indicate that the MSE differences are statistically significant (see Table III). C. Tuning Parameter Robustness The results in Section IV-B were obtained by selecting the tuning parameters L and L for each pair of dimension and samples to minimize the MSE. Here we demonstrate the robustness of the estimators to variations in the tuning parameters. In all experiments, we estimated the Rényi-α divergence integral between the same distributions described in Section IV-B (truncated Gaussians with same covariance and different mean) and chose L = 50. In the first set of experiments, we set η = ǫ

Table II ˜ ρ = (1 − ρ)G ˜ w ,1 + ρG ˜ w ,2 VALUES OF THE WEIGHT ρ FOR THE ESTIMATOR G 0 0 Dim. d=4 d=7 d = 10

N = 100 0.15 0.6 0.55

N = 240 0 0.45 1

N = 560 0.1 0.75 0.5

N = 1330 0.05 0.75 0.65

THAT MINIMIZE MSE

N = 3200 0.05 0.55 0.5

Table III p- VALUES OF PAIRED T- TESTS OF OD IN 1 VS . OD IN 2 MSE (N = 1300). Dim. 4 7 10

ODin1>ODin2 1 8.7 × 10−52 0

ODin1

ODin1=ODin2 1.8 × 10−58 1.8 × 10−51 1.0 × 10−52

and chose the set of kernel bandwidths L to be linearly spaced between min(L) and max(L). Table IV provides the values chosen for min(L) and max(L). Figure 4 shows the results for these experiments when d = 5. As the number of samples increase, choosing a larger range of values for L (Sets 1 and 2) generally gives better performance for both ODin1 and ODin2 in terms of MSE than choosing a smaller range for L (e.g. Sets 4 and 5). This suggests that for large sample sizes, the estimators will perform well if a reasonably large range for L is chosen. In contrast, choosing a smaller range of values for L when the sample size is small results in smaller bias and variance compared to the larger ranges. Thus for small sample sizes, it may be useful to tighten the range of kernel bandwidths. Comparing the results for ODin1 and ODin2 indicates that ODin2 is more robust to the choice of L as the difference in MSE under the different settings is smaller for ODin2 than ODin1. This is due primarily to the relatively smaller bias of ODin2 as the variances of the two estimators under each setting are comparable (see the bottom plots in Fig. 4). Similar results hold when the dimension is increased to d = 7 (see Fig. 5). For larger sample sizes, a large range for L gives better results than a smaller range. However, for smaller sample sizes, the larger range for L does not perform as well as other configurations. Additionally, ODin2 again appears to be more robust to the choice of L as the difference in MSE at larger sample sizes is smaller for ODin2. Additionally, the Set 5 configuration of ODin1 does not even appear to be converging to the true value yet when N = 10000. For the second set of experiments, we fixed L to be linearly spaced values between 2 and 3. We then varied the values of η from 0.5 to 10. Figure 6 provides heatmaps of the MSE of the two ensemble estimators under this configuration with d = 5, 7. For d = 5, choosing η = 0.5 gives the lowest MSE when N ≥ 103.5 for ODin1 and for all sample sizes for ODin2. In fact, when d = 5, ODin2 with η = 0.5 outperforms all other configurations at all sample sizes, including those shown in Fig. 4. Increasing d to 7 changes this somewhat as choosing η = 0.5 results in the lowest MSE when N ≥ 1000 for ODin2 and for no sample sizes for ODin1. However, generally lower values of η (η < 2) result in the lowest MSE for ODin2 when N < 1000 and for ODin1 when N ≥ 103.5 (η ≤ 3). Both ODin1 and ODin2 are fairly robust to the choice of η when d = 5 as the MSE is relatively constant at each sample size for most η values. However, ODin2 has generally lower MSE values for N ≥ 102.5 (see Table V). When d = 7, ODin2 is more robust than ODin1 for larger samples (N ≥ 1000). Overall, based on our experiments, ODin1 has lower MSE on average for smaller sample sizes while ODin2 is generally more robust to the tuning parameters for larger sample sizes. Additionally, choosing a low value for η with ODin2 may result in better performance. Thus unless the sample size is small, we recommend ODin2 over ODin1.

Table IV VALUES OF min(L) AND max(L) FOR DIFFERENT EXPERIMENTS .

Set 1 2 3 4 5

ODin1 min(L) max(L) 1.5 3 1.75 3 2 3 2.25 3 2.5 3

ODin2 min(L) max(L) 2 3 2.25 3 2.5 3 2.75 3 2.75 3.25

ODin1, d=5

−1

10

−2

−2

10

10

MSE

MSE

ODin2, d=5

−1

10

−3

10

−3

10

Set 1 Set 2

−4

−4

10

Set 3

10

Set 4 Set 5 2

3

10 Sample Size N ODin1, d=5

4

2

10

0.8

0.8

0.7

0.7

0.6 0.5

10 Sample Size N ODin2, d=5

4

10

True Set 1 Set 2 Set 3 Set 4 Set 5

0.6 0.5 0.4

0.4 0.3

3

10

Renyi−α integral

Renyi−α integral

10

2

10

3

10 Sample Size N

0.3

4

2

3

10

10

10 Sample Size N

4

10

Figure 4. (Top) Log-log plot of MSE of the two proposed optimally weighted estimators (ODin1 and ODin2) as a function of sample size using different values for the range of kernel bandwidths L (see Table IV) when d = 5. (Bottom) Plot of the average value of the same estimators with standard error bars compared to the true value being estimated. For larger sample sizes, a larger range in L results in smaller MSE (see Sets 1 and 2), while a smaller range in L is more accurate at smaller sample sizes. ODin2 is generally more robust to the choice of L. Table V AVERAGE MSE OVER ALL VALUES OF η USED IN F IGURE 6 FOR A FIXED SAMPLE SIZE . Sample Size N Mean MSE, ODin1, Mean MSE, ODin2, Mean MSE, ODin1, Mean MSE, ODin2,

d=5 d=5 d=7 d=7

102 0.0225 0.0358 0.0394 0.0557

102.5 0.0159 0.0092 0.0233 0.0292

103 0.0118 0.0029 0.0155 0.0120

103.5 0.0071 0.0011 0.0115 0.0046

104 0.0040 0.0005 0.0091 0.0018

104.5 0.0022 0.0002 N/A N/A

D. Central Limit Theorem Validation: KL Divergence To verify the central limit theorem of both ensemble estimators, we estimated the KL divergence between two truncated Gaussian densities again restricted to the unit cube. We conducted two experiments where 1) the densities are different with means µ ¯1 = 0.7 ∗ ¯ 1d , µ ¯2 = 0.3 ∗ ¯ 1d and covariance matrices σi ∗ Id , σ1 = 0.1, σ2 = 0.3; and where 2) the densities are the same with means 0.3 ∗ ¯ 1d and covariance matrices 0.3 ∗ Id . For both experiments, we chose d = 6 and N = 1000. Figure 7 shows Q-Q plots of the normalized optimally weighted ensemble estimators ODin1 (left) and ODin2 (right) of the KL divergence when the two densities are the same (top) and when they are different (bottom). The linear relationship between the quantiles of the normalized estimators and the standard normal distribution validates Theorem 5 for both estimators under the two cases. V. C ONCLUSION We derived convergence rates for a kernel density plug-in estimator of divergence functionals. We generalized the theory of optimally weighted ensemble estimation and derived an estimator that achieves the parametric rate when the densities are

ODin2, d=7

ODin1, d=7 −1

−1

10

−2

10

10

−2

MSE

MSE

10

Set 1 Set 2

−3

−3

10

10

Set 3 Set 4 Set 5

−4

−4

10

2

3

4

10 Sample Size N ODin1, d=7

0.9

0.9

0.8

0.8

0.7

0.7

0.6 0.5 0.4 0.3

2

3

10

10

Renyi−α integral

Renyi−α integral

10

10

10 Sample Size N ODin2, d=7

4

10

True Set 1 Set 2 Set 3 Set 4 Set 5

0.6 0.5 0.4 0.3 0.2

0.2

0.1

0.1 2

10

3

4

10 Sample Size N

10

2

10

3

10 Sample Size N

4

10

Figure 5. (Top) Log-log plot of MSE of the two proposed optimally weighted estimators (ODin1 and ODin2) as a function of sample size using different values of the parameter L (see Table IV) when d = 7. (Bottom) Plot of the average value of the same estimators with standard error bars compared to the true value being estimated. Again, a larger range for L at large sample sizes results in smaller MSE.

(d + 1)/2 times differentiable. The estimators we derive apply to general bounded density support sets and do not require knowledge of the support which is a distinct advantage over other estimators. We also derived the asymptotic distribution of the estimator, provided some guidelines for tuning parameter selection, and validated the convergence rates for the case of empirical estimation of the Rényi-α divergence. We then performed experiments to examine the estimators’ robustness to the choice of tuning parameters and validated the central limit theorem for KL divergence estimation. Future work includes deriving expressions similar to (4) for more general divergence functionals that are not restricted to the class of functions whose mixed derivatives depend on x, y only through terms of the form xα y β . An important divergence which does not satisfy this condition is the Henze-Penrose divergence [6] which can be used to bound the Bayes error. Further future work will focus on extending this work on distributional functional estimation to k-nn based estimators where knowledge of the support is again not required. This will improve the computational burden as k-nn estimators require fewer computations than standard KDEs. A PPENDIX A P ROOF OF T HEOREM 1 Consider a uniform rectangular kernel K(x) that satisfies K(x) = 1 for all x such that ||x||1 ≤ 1/2. Also consider the family of probability densities f with rectangular support S = [−1, 1]d. We will prove Theorem 1 which is that that S satisfies the following smoothness condition (A.5): for any polynomial px (u) : Rd → R of order q ≤ r = ⌊s⌋ with coefficients that are r − q times differentiable wrt x, !t ˆ ˆ px (u)du dx = vt (h), (9) x∈S

/ u:||u||1 ≤ 12 , x+uh∈S

ODin1, d=5

ODin2, d=5 −1.5

2.5

−1.5 2.5

−2

−2

5

−2.5 η

η

−2.5 −3

5

−3

−3.5

−3.5

7.5

7.5 −4

−4

−4.5

10

−4.5

10

2.5 3.5 4.5 Log of Sample Size log10(N)

2.5 3.5 4.5 Log of Sample Size log (N) 10

ODin1, d=7

ODin2, d=7 −1

2.5

−1 2.5

−1.5

−1.5 −2

5

η

η

−2 5

−2.5 −3

7.5

−2.5 −3

7.5

−3.5 10 2 2.5 3 3.5 4 Log of Sample Size log (N)

−4

10

−3.5 10 2 2.5 3 3.5 4 Log of Sample Size log (N)

−4

10

Figure 6. Heatmaps of the ensemble estimators’ MSE (log10 scale) as a function of sample size and the tuning parameter η. Lower values of η tend to give the smallest MSE, especially for ODin2. Both estimators are fairly robust to the choice of η as the MSE is relatively constant at each sample size for most η values.

where vt (h) has the expansion vt (h) =

r−q X i=1

ei,q,t hi + o hr−q .

Note that the inner integral forces the x’s under consideration to be boundary points via the constraint x + uh ∈ / S. A. Single Coordinate Boundary Point We begin by focusing on points x that are boundary points by virtue of a single coordinate xi such that xi + ui h ∈ / S. Without loss of generality, assume that xi + ui h > 1. The inner integral in (9) can then be evaluated first wrt all coordinates other than i. Since all of these coordinates lie within the support, the inner integral over these coordinates will amount to integration P of the polynomial px (u) over a symmetric d − 1 dimensional rectangular region |uj | ≤ 21 for all j 6= i. This yields q a function m=1 p˜m (x)um ˜m (x) are each r − q times differentiable wrt x. i where the coefficients p

Same Distributions

4

4

3

3

Quantiles of Normalized ODin2

Quantiles of Normalized ODin1

Same Distributions

2 1 0 −1 −2 −3 −4 −4

−2 0 2 Standard Normal Quantiles

2 1 0 −1 −2 −3 −4 −4

4

4

3

3

2 1 0 −1 −2 −3 −4 −4

−2 0 2 Standard Normal Quantiles

4

Different Distributions

4 Quantiles of Normalized ODin2

Quantiles of Normalized ODin1

Different Distributions

−2 0 2 Standard Normal Quantiles

4

2 1 0 −1 −2 −3 −4 −4

−2 0 2 Standard Normal Quantiles

4

Figure 7. Q-Q plots comparing quantiles from the normalized weighted ensemble estimators ODin1 (left) and ODin2 (right) of the KL divergence (vertical axis) to the quantiles from the standard normal distribution (horizontal axis) when the two distributions are the same (top) and when they are different (bottom). The red line shows a reference line passing through the first and third quantiles. The linearity of the plot points validates the central limit theorem (Theorem 5) for all four cases. i With respect to the ui coordinate, the inner integral will have limits from 1−x to 12 for some 1 > xi > 1 − h2 . Consider h q the p˜q (x)ui monomial term. The inner integral wrt this term yields m+1 ! ˆ 12 q q X X 1 1 − xi 1 m . (10) − p˜m (x) p˜m (x) ui dui = 1−xi m + 1 2m+1 h h m=1 m=1

Raising the right hand side of (10) to the power of t results in an expression of the form j qt X 1 − xi pˇj (x) , h j=0

(11)

where the coefficients pˇj (x) are r − q times differentiable wrt x. Integrating (11) over all the coordinates in x other than xi results in an expression of the form j qt X 1 − xi p¯j (xi ) , (12) h j=0

where again the coefficients¯ pj (xi ) are r − q times differentiable wrt xi . Note that since the other cooordinates of x other than xi are far away from the boundary, the coefficients p¯j (xi ) are independent of h. To evaluate the integral of (12), consider the r − q term Taylor series expansion of p¯j (xi ) around xi = 1. This will yield terms of the form ˆ 1 j+k+1 xi =1 j+k (1 − xi ) (1 − xi ) dxi = − k k h h (j + k + 1) 1−h/2 xi =1−h/2

=

hj+1 , (j + k + 1)2j+k+1

for 0 ≤ j ≤ r − q, and 0 ≤ k ≤ qt. Combining terms results in the expansion vt (h) = B. Multiple Coordinate Boundary Point

Pr−q i=1

ei,q,t hi + o (hr−q ).

The case where multiple coordinates of the point x are near the boundary is a straightforward extension of the single boundary point case so we only sketch the main ideas here. As an example, consider the case where 2 of the coordinates are near the boundary. Assume for notational ease that they are x1 and x2 and that x1 + u1 h > 1 and x2 P + u2 h > 1. The inner q j integral in (9) can again be evaluated first wrt all coordinates other than 1 and 2. This yields a function m,j=1 p˜m,j (x)um 1 u2 where the coefficients p˜m,j (x) are each r − q times differentiable wrt x. Integrating this wrt x1 and x2 and then raising the result to the power of t yields a double sum similar to (11). Integrating this over all the coordinates in x other than x1 and x2 gives a double sum similar to (12). Then a Taylor series expansion of the coefficients and integration over x1 and x2 yields the result. A PPENDIX B P ROOF OF T HEOREM 2 ˜ h ,h can be In this appendix, we prove the bias results in Thm. 2. The bias of the base kernel density plug-in estimator G 1 2 expressed as h i h i ˜ h1 ,h2 B G = E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g (f1 (Z), f2 (Z)) i h = E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) i h (13) +E g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) − g (f1 (Z), f2 (Z)) ,

where Z is drawn from f2 . The first term is the “variance” term while the second is the “bias” term. We bound these terms using Taylor series expansions under the assumption that g is infinitely differentiable. The Taylor series expansion of the variance term in (13) will depend on variance-like terms of the KDEs while the Taylor series expansion of the bias term in (13) will depend on the bias of the KDEs. The Taylor series expansion of g EZ˜f1,h1 (Z), EZ ˜f2,h2 (Z) around f1 (Z) and f2 (Z) is h h i i j ˜ i ˜ ∞ ∞ X i+j (Z) B B f (Z) f X 1,h 2,h 1 2 Z Z ∂ g(x, y) = (14) g EZ˜f1,h1 (Z), EZ ˜f2,h2 (Z) x=f1 (Z) i ∂y j ∂x i!j! i=0 j=0 y=f2 (Z)

j i h where BjZ ˜fi,hi (Z) = EZ˜fi,hi (Z) − fi (Z) is the bias of ˜fi,hi at the point Z raised to the power of j. This expansion can be ˜ usedh to control i the second term (the bias term) in (13). To accomplish this, we require an expression for EZ fi,hi (Z) − fi (Z) = ˜ BZ fi,hi (Z) . i h To obtain an expression for BZ ˜fi,hi (Z) , we consider separately the cases when Z is in the interior of the support S or when Z is near the boundary of the support. A point X ∈ S is defined to be in the interior of S if for all Y ∈ / S, = 0. A point X ∈ S is near the boundary of the support if it is not in the interior. Denote the region in the interior K X−Y hi and near the boundary wrt hi as SIi and SBi , respectively. We will need the following. Lemma 1: Let Z be a realization of the density f2 independent of ˜fi,hi for i = 1, 2. Assume that the densities f1 and f2 belong to Σ(s, L). Then for Z ∈ SIi , ⌊s/2⌋ h i X s ci,j (Z)h2j EZ ˜fi,hi (Z) = fi (Z) + i + O (hi ) . j=ν/2

(15)

Proof: Obtaining the lower order terms in (15) is a common result in kernel density estimation. However, since we also require the higher order terms, we present the proof here. Additionally, some of the results in this proof will be useful later. From the linearity of the KDE, we have that if X is drawn from fi and is independent of Z, then X−Z 1 ˜ EZ fi,hi (Z) = EZ d K hi hi ˆ x−Z 1 K fi (x)dx = hi hdi ˆ = K (t) fi (thi + Z)dt, (16) where the last step follows from the substitution t = we can expand it as

x−Z hi .

fi (thi + Z) = fi (Z) +

Since the density fi belongs to Σ(s, K), using multi-index notation

X

0<|α|≤⌊s⌋

Dα fi (Z) α s (thi ) + O (kthi k ) , α!

(17)

where α! = α1 !α2 ! . . . αd ! and Y α = Y1α1 Y2α2 . . . Ydαd . Combining (16) and (17) gives X Dα fi (Z) |α| ˆ hi tα K(t)dt + O(hsi ) EZ ˜fi,hi (Z) = fi (Z) + α! 0<|α|≤⌊s⌋

=

fi (Z) +

⌊s/2⌋

X

s ci,j (Z)h2j i + O(hi ),

j=ν/2

where the last step follows from the fact that K is symmetric and of order ν. To obtain a similar result for the case when Z is near the boundary of S, we use assumption A.5. Lemma 2: Let γ(x, y) be an arbitrary function satisfying supx,y |γ(x, y)| < ∞. Let S satisfy the boundary smoothness conditions of Assumption A.5. Assume that the densities f1 and f2 belong to Σ(s, L) and let Z be a realization of the density ′ f2 independent of ˜fi,hi for i = 1, 2. Let h = min (h1 , h2 ). Then r ii h h X = c4,i,j,t hji + o (hri ) (18) E 1{Z∈SB } γ (f1 (Z), f2 (Z)) BtZ ˜fi,hi (Z) i

h E 1{Z∈SB

1 ∩SB2

j=1

i ii h h q ˜ t ˜ (Z) B f (Z) = f γ (f (Z), f (Z)) B 1,h 2,h 1 2 1 2 Z Z }

r−1 X r−1 X

′

c4,j,i,q,t hj1 hi2 h + o

j=0 i=0

Proof: For fixed X near the boundary of S, we have ˆ h i 1 X −Y ˜ E fi,hi (X) − fi (X) = K fi (Y )dY − fi (X) hi hd # " i Yˆ :Y ∈S X −Y 1 K fi (Y )dY − fi (X) = hi hdi Y :K X−Y >0 hi ˆ 1 X −Y fi (Y )dY − d K hi hi Y :Y ∈S / = T1,i (X) − T2,i (X).

′ r h

(19)

Note that in T1,i (X), we are extending the integral beyond the support of the density fi . However, by using the same Taylor series expansion method as in the proof of Lemma 1, we always evaluate fi and its derivatives at the point X which is within the support of fi . Thus it does not matter how we define an extension of fi since the Taylor series will remain the same. Thus T1,i (X) results in an identical expression to that obtained from (15). For the T2,i (X) term, we expand it as follows using multi-index notation as ˆ 1 X −Y T2,i (X) = K fi (Y )dY hi hd / ˆ i Y :Y ∈S K (u) fi (X + hi u)du = u:hi u+X ∈S,K(u)>0 /

=

X h|α| ˆ i K (u) Dα fi (X)uα du + o (hri ) . α! u:hi u+X ∈S,K(u)>0 /

|α|≤r

Recognizing that the |α|th derivative of fi is r −|α| times differentiable, we can apply assumption A.5 to obtain the expectation of T2,i (X) wrt X: ˆ ˆ 1 X −Y fi (Y )dY f2 (X)dx E [T2,i (X)] = K hi hdi X Y :Y ∈S / X h|α| ˆ ˆ i = K (u) Dα fi (X)uα duf2 (X)dX + o (hri ) α! X u:hi u+X ∈S,K(u)>0 / |α|≤r X h|α| X r−|α| |β| i = eβ,r−|α|hi + o hi + o (hri ) α! =

|α|≤r r X

1≤|β|≤r−|α|

ej hji + o (hri ) .

j=1

Similarly, we find that i h t E (T2,i (X))

t ˆ ˆ 1 X −Y f (Y )dY f2 (X)dx K i hi hdt X Y :Y ∈S / i t ˆ X h|α| ˆ i = K (u) Dα fi (X)uα du f2 (X)dX α! u:hi u+X ∈S,K(u)>0 X /

=

=

r X

|α|≤r

ej,t hji + o (hri ) .

j=1

Combining these results gives i t h ˜ E 1{Z∈SB } γ (f1 (Z), f2 (Z)) EZ fi,hi (Z) − fi (Z)

i h t = E γ (f1 (Z), f2 (Z)) (T1,i (Z) − T2,i (Z)) t X t j t−j (T1,i (Z)) (−T2,i (Z)) = E γ (f1 (Z), f2 (Z)) j j=0 =

r X

c4,i,j,t hji + o (hri ) ,

j=1

where the constants are functionals of the kernel, γ, and the densities. The expression in (19) can be proved in a similar manner. Applying Lemmas 1 and 2 to (14) gives r−1 r−1 X r X i X h ′ c5,i,j hj1 hi2 h + o (hr1 + hr2 ) . (20) c4,1,j hj1 + c4,2,j hj2 + E g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) − g (f1 (Z), f2 (Z)) = j=1

j=0 i=0

For the variance term (the first term) in (13), the truncated Taylor series expansion of g ˜f1,h1 (Z), ˜f2,h2 (Z) around EZ ˜f1,h1 (Z) and EZ˜f2,h2 (Z) gives λ X λ i+j X ˜i1,h1 (Z)˜ ej2,h2 (Z) ∂ g(x, y) e ˜λ2,h2 (Z) ˜λ1,h1 (Z) + e = g ˜f1,h1 (Z), ˜f2,h2 (Z) (21) +o e ˜ i j x=EZ f1,h1 (Z) ∂x ∂y i!j! i=0 j=0 f2,h2 (Z) y=EZ ˜

h i ˜ji,hi (Z) . ˜i,hi (Z) := ˜fi,hi (Z) − EZ˜fi,hi (Z). To control the variance term in (13), we thus require expressions for EZ e where e

Lemma 3: Let Z be a realization of the density f2 that is in the interior of the support and is independent of ˜fi,hi for i = 1, 2. Let n(q) be the set of integer divisors of q including 1 but excluding q. Then, P P⌊s/2⌋ 1 2m i h + O N1i , q ≥ 2 j∈n(q) (N hd )q−j m=0 c6,i,q,j,m (Z)hi q 2 2 ˜i,hi (Z) = EZ e (22) 0, q = 1, P⌊s/2⌋ P 1 2m × q, l ≥ 2 q−i m=0 c6,1,q,i,m (Z)h1 i∈n(q) (N1 hd1 ) h i q P P⌊s/2⌋ ˜1,h1 (Z)˜ (23) el2,h2 (Z) = EZ e 1 2t + O N11 + N12 , j∈n(l) (N hd )l−j t=0 c6,2,l,j,t (Z)h2 2 2 0, q = 1 or l = 1 where c6,i,q,j,m is a functional of f1 and f2 . Xi −Z Proof: Define the random variable Vi (Z) = K Xhi −Z − E K . This gives Z h2 2 ˜2,h2 (Z) e

= ˜f2,h2 (Z) − EZ ˜f2,h2 (Z) N2 1 X = Vi (Z). N2 hd2 i=1

Clearly, EZ Vi (Z) = 0. From (16), we have for integer j ≥ 1 Xi − Z = EZ K j h2 =

ˆ

K j (t) f2 (th2 + Z)dt

hd2

⌊s/2⌋

X

c3,2,j,m (Z)h2m 2 ,

m=0

where the constants c3,2,j,m depend on the density f2 , its derivatives, and the moments of the kernel K j . Note that since K is symmetric, the odd moments of K j are zero for Z in the interior of the support. However, all even moments may now be nonzero since K j may now be nonnegative. By the binomial theorem, j−k j h i X Xi − Z Xi − Z j j k EZ K EZ K EZ Vi (Z) = h2 h2 k k=0 ⌊s/2⌋ j X X j d(j−k) c3,2,k,m (Z)h2m hd2 O h2 = 2 k m=0 =

k=0 ⌊s/2⌋ X d h2 c3,2,j,m (Z)h2m 2 m=0

+ O h2d .

h i ˜q2,h2 (Z) . As an example, let q = 2. Then since the Xi s are independent, We can use these expressions to simplify EZ e 2 ˜2,h2 (Z) EZ e

=

1 EZ Vi2 (Z) N2 h2d 2

=

⌊s/2⌋ 1 1 X 2m c3,2,2,m (Z)h2 + O . d N2 N2 h2 m=0

Similarly, we find that 3 ˜2,h2 (Z) EZ e

= =

1 N22 h3d 2

EZ Vi3 (Z) ⌊s/2⌋

1 d 2

N 2 h2

X

m=0

c3,2,3,m (Z)h2m 2 +o

1 N2

.

For q = 4, we have 4 ˜2,h2 (Z) EZ e

= =

1 N23 h4d 2

EZ Vi4 (Z) + ⌊s/2⌋

1 N2 hd2

The pattern is then for q ≥ 2,

3

X

2 N2 − 1 EZ Vi2 (Z) 3 4d N 2 h2

c3,2,4,m (Z)h2m 2

+

m=0

h i X ˜q2,h2 (Z) = EZ e

i∈n(q)

⌊s/2⌋

1 N2 hd2

q−i

X

1

N2 hd2

2

⌊s/2⌋

X

c6,2,2,m (Z)h2m 2

m=0

c6,2,q,i,m (Z)h2m 2

m=0

+O

1 N2

+o

1 N2

.

.

For any integer q, the largest possible factor is q/2. Thus for given q, possible exponent on the N2 hd2 term is h the smallest i q ˜1,h1 (Z) except the Xi s are replaced with Yi , f2 is q/2. This increases as q increases. A similar expression holds for EZ e replaced with f1 , and N2 and h2 are replaced with N1 and h1 , respectively, all resulting in different constants. Then since ˜2,h2 (Z) are conditionally independent given Z, ˜1,h1 (Z) and e e ⌊s/2⌋ ⌊s/2⌋ h i X X X X 1 1 ˜q1,h1 (Z)˜ el2,h2 (Z) = EZ e c6,1,q,i,m (Z)h2m c6,2,l,j,t (Z)h2t q−i l−j 1 2 d d N h N h 1 2 m=0 t=0 1 i∈n(q) j∈n(l) 2 1 1 +O . + N1 N2 Applying Lemma 3 to (21) when taking the conditional expectation given Z in the interior gives an expression of the form ! λ/2 ⌊s/2⌋ h2m h2m X X 1 2 c7,1,j,m EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) j + c7,2,j,m EZ ˜f2,h2 (Z), EZ ˜f2,h2 (Z) j N1 hd1 N2 hd2 j=1 m=0 +

λ/2 ⌊s/2⌋ λ/2 ⌊s/2⌋ X XX X j=1 m=0 i=1 n=0

2n h2m 1 h2 i j N1 hd1 N2 hd2 1 1 +O λ2 + λ2 . d d N 1 h1 N 2 h2

c7,j,i,m,n EZ˜f2,h2 (Z), EZ ˜f2,h2 (Z)

(24)

Note that the functionals c7,i,j,m and c7,j,i,m,n depend on the derivatives of g and EZ˜fi,hi (Z) which depends on hi . To apply ensemble estimation, we need to separate the dependence on hi from the constants. If we use ODin1, then it is sufficient to note that in the interior of the support, EZ˜fi,hi (Z) = fi (Z) + o(1) and therefore c EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) = c (f1 (Z), f2 (Z)) + o(1) for some functional c. The terms in (24) reduce to 1 1 1 1 c7,1,1,0 (f1 (Z), f2 (Z)) . + c (f (Z), f (Z)) + o + 7,2,1,0 1 2 N1 hd1 N2 hd2 N1 hd1 N2 hd2 For ODin2, we need the higher order terms. To separate the dependence on hi from the constants, we need more information about the functional g and its derivatives. Consider the special case where the functional g(x, y) has derivatives of the form of xα y β with α, β < 0. This includes the important cases of the KL divergence and the Renyi divergence. The generalized α binomial theorem states that if m := α(α−1)...(α−m+1) and if q and t are real numbers with |q| > |t|, then for any complex m! number α, ∞ X α α−m m α q t . (25) (q + t) = m m=0 P ⌊s/2⌋ s Since the densities are bounded away from zero, for sufficiently small hi , we have that fi (Z) > j=ν/2 ci,j (Z)h2j i + O (hi ) . Applying the generalized binomial theorem and Lemma 1 gives that m ⌊s/2⌋ ∞ α X X α α−m s ci,j (Z)h2j fi (Z) . EZ ˜f1,h1 (Z) = i + O (hi ) m m=0 j=ν/2

Since m is an integer, the exponents of the hi terms are also integers. Thus (24) gives in this case h i = EZ g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z)

λ/2 ⌊s/2⌋ X X

h2m 1

c8,1,j,m (Z)

N1 hd1

j=1 m=0

+

λ/2 ⌊s/2⌋ λ/2 ⌊s/2⌋ XX X X

j + c8,2,j,m (Z)

c8,j,i,m,n (Z)

j=1 m=0 i=1 n=0

+O

1 N1 hd1

λ2 +

1 N2 hd2

h2m 2 N2 hd2

2n h2m 1 h2 i j N1 hd1 N2 hd2

s s λ2 + h1 + h2 .

j

!

(26)

As before, the case for Z close to the boundary of the support is more complicated. However, by using a similar technique to the proof of Lemma 2 for Z at the boundary and combining with the previous results, we find that for general g, i h 1 1 1 1 . (27) + c + o + E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) = c9,1 9,2 N1 hd1 N2 hd2 N1 hd1 N2 hd2 If g(x, y) has derivatives of the form of xα y β with α, β < 0, then we can similarly obtain h i E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) =

λ/2 r X X

hm 1

c9,1,j,m

N1 hd1

j=1 m=0

+

λ/2 r λ/2 r X X XX

j + c9,2,j,m

c9,j,i,m,n

j=1 m=0 i=1 n=0

Combining (20) with either (27) or (28) completes the proof.

+O

1 N1 hd1

λ2 +

1 N2 hd2

hm 2 N2 hd2

j

n hm 1 h2 i j N2 hd2 N1 hd1

s s λ2 + h1 + h2 .

!

(28)

A PPENDIX C P ROOF OF T HEOREM 3 ˜ h ,h , we will use the Efron-Stein inequality [58]: To bound the variance of the plug-in estimator G 1 2 ′ ′ Lemma 4 (Efron-Stein Inequality): Let X1 , . . . , Xn , X1 , . . . , Xn be independent random variables on the space S. Then if f : S × · · · × S → R, we have that n 2 ′ 1X E f (X1 , . . . , Xn ) − f (X1 , . . . , Xi , . . . , Xn ) . V [f (X1 , . . . , Xn )] ≤ 2 i=1 o n ′ Suppose we have samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and denote the respective ˜′ ˜ h1 ,h2 and G estimators as G h1 ,h2 . We have that ˜ ˜ ′h ,h ≤ 1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X′1 ), ˜f2,h2 (X′1 ) Gh1 ,h2 − G 1 2 N2 N2 ′ 1 X ˜ + (29) g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . N2 j=2 Since g is Lipschitz continuous with constant Cg , we have ′ ′ ˜ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) ≤ ′ ˜ f1,h1 (X1 ) − ˜f1,h1 (X1 )

′ 2 =⇒ E ˜f1,h1 (X1 ) − ˜f1,h1 (X1 )

=

≤ ≤

′ ′ , Cg ˜f1,h1 (X1 ) − ˜f1,h1 (X1 ) + ˜f2,h2 (X1 ) − ˜f2,h2 (X1 ) (30)

N !! ′ 1 X X1 − Yi X1 − Yi K −K h h 1 1 i=1 ! ′ N1 X1 − Yi X1 − Yi 1 X K − K h1 h1 N1 hd1 i=1 !!2 ′ N1 X X1 − Yi X1 − Yi 1 , E K −K h h1 N1 h2d 1 1 i=1 1 N1 hd1

(31)

′

′

X −Y

i where the last step follows from Jensen’s inequality. By making the substitution ui = X1h−Y and ui = 1h1 i , this gives 1 !!2 !!2 ˆ ′ ′ X1 − Yi X − Y X − Y 1 1 X1 − Yi i 1 i 1 = −K −K K × E K h1 h1 h2d h1 h1 h2d 1 ′

′

f2 (X1 )f2 (X1 )f1 (Yi )dX1 dX1 dYi ≤ 2||K||2∞ .

Combining this with (31) gives

′ 2 ˜ ˜ E f1,h1 (X1 ) − f1,h1 (X1 ) ≤ 2||K||2∞ .

Similarly,

′ 2 ˜ ˜ E f2,h2 (X1 ) − f2,h2 (X1 ) ≤ 2||K||2∞ .

Combining these results with (30) gives 2 ′ ′ E g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) ≤ 8Cg2 ||K||2∞ . The second term in (29) is controlled in a similar way. From the Lipschitz condition, 2 2 ′ ′ ˜ g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) ≤ Cg2 ˜f2,h2 (Xj ) − ˜f2,h2 (Xj ) Cg2 Xj − X1 K = −K h M22 h2d 2 X −X

′

(32)

′

Xj − X1 h

!!2

.

′

X −X

j 1 and uj = jh2 1 within the expectation to obtain The h2d 2 terms are eliminated by making the substitutions of uj = h2 2 2Cg2 ||K||2∞ ′ ˜ ˜ ˜ ˜ (33) E g f1,h1 (Xj ), f2,h2 (Xj ) − g f1,h1 (Xj ), f2,h2 (Xj ) ≤ M22 2 N2 X ′ ˜ =⇒ E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )

j=2

=

N2 X N2 X j=2 i=2

≤ ≤

h i ′ ′ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi )

2 ′ M22 E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) 2Cg2 ||K||2∞ ,

(34)

where we use the Cauchy Schwarz inequality to bound the expectation within each summand. Finally, applying Jensen’s inequality and (32) and (34) gives 2 2 ′ ′ ′ 2 ˜ ˜ ˜ ˜ ˜ ˜ E g f1,h1 (X1 ), f2,h2 (X1 ) − g f1,h1 (X1 ), f2,h2 (X1 ) ≤ E Gh1 ,h2 − Gh1 ,h2 2 N2 2 N2 X ′ 2 ˜ + 2 E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) N2 j=2 ≤

20Cg2 ||K||2∞ . N22

o n ′ Now suppose we have samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and denote the respective ˜′ ˜ h1 ,h2 and G estimators as G h1 ,h2 . Then ′ ′ ˜ g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) ≤ Cg ˜f1,h1 (Xj ) − ˜f1,h1 (Xj ) ! ′ Xj − Y1 Xj − Y1 Cg −K = K h1 h1 N1 hd1 ′ 2 2Cg2 ||K||2∞ =⇒ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . ≤ N12 Thus using a similar argument as was used to obtain (34), 2 N 2 ′ 1 X ˜ 2 ˜ ˜′ ≤ E G E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) h1 ,h2 − Gh1 ,h2 2 N2 j=1 2Cg2 ||K||2∞ . N22

≤

Applying the Efron-Stein inequality gives h i 10C 2 ||K||2 Cg2 ||K||2∞ N1 g ∞ ˜ h1 ,h2 ≤ + . V G N2 N22 A PPENDIX D P ROOF OF T HEOREM 5

We are interested in the asymptotic distribution of h i p ˜ h1 ,h2 − E G ˜ h1 ,h2 = N2 G

N2 h i 1 X √ g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − EXj g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) N2 j=1

N2 h i h i 1 X √ EXj g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . + N2 j=1

Note that by the standard central limit theorem [59], the second term converges in distribution to a Gaussian random variable. If the first term converges in probability to a constant (specifically, 0), then we can use Slutsky’s theorem [60] to find the asymptotic distribution. So now we focus on the first term which we denote as VN2 . To prove convergence in probability, we will use Chebyshev’s inequality. Note that E [VN2 ] = 0. To bound the variance of ′ ′ VN2 , we again use the Efron-Stein inequality. Let X1 be drawn from f2 and denote VN2 and VN2 as the sequences using ′ X1 and X1 , respectively. Then i h ′ 1 ˜ VN2 − VN2 = √ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) N2 h i ′ ′ ′ ′ 1 ˜ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX′ g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) +√ 1 N2 N 2 ′ 1 X g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . (35) +√ N2 j=2 Note that E

h h i2 h ii g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) = E VX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) .

h i p If we condition on X1 , then by the standard central limit theorem Ni hdi ˜fi,hi (X1 ) − EX1 ˜fi,hi (X1 ) converges in 2 distribution to a zero mean Gaussian random variable with variance σi (X1 ) = O(1). This is true even if X1 is close to the boundary of the support of the densities. The KDEs ˜f1,h1 (X1 ) and ˜f2,h2 (X1 ) are conditionally independent given X1 as are their limiting distributions. Thus the KDEs converge jointly in distribution to a Gaussian random vector with zero mean,

zero covariance, and their respective variances. By h i the delta method [61], we have that if g(x, y) is continuously differentiable ˜ with respect to both x and y at EX1 fi,hi (X1 ) for i = 1, 2, respectively, then i h 1 1 ˜ ˜ VX1 g f1,h1 (X1 ), f2,h2 (X1 ) = O + = o(1), N1 hd1 N2 hd2 h h ii provided that Ni hdi → ∞. Thus E VX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) = o(1). A similar result holds when we replace X1 with ′

X1 . For the third term in (35),

2 N2 X ′ ˜ E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) j=2

=

N2 N2 X X j=2 i=2

h i ′ ′ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) .

There are M2 terms where i = j and we have from Appendix C (see (33)) that 2 2C 2 ||K||2 ′ ˜ g ∞ ˜ ˜ ˜ . E g f1,h1 (Xj ), f2,h2 (Xj ) − g f1,h1 (Xj ), f2,h2 (Xj ) ≤ M22 Thus these terms are O M12 . There are M22 − M2 terms when i 6= j. In this case, we can do four substitutions of the form uj =

Xj −X1 h2

to obtain

i 4C 2 ||K||2 h2d h ′ ′ g ∞ 2 . E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) ≤ M22

Then since hd2 = o(1), we get 2 N2 X ′ ˜ E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) = o(1),

(36)

j=2

=⇒ E

′

VN2 − VN2

2

≤

h i2 3 E g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) N2 h i2 ′ ′ ′ ′ 3 ˜ ˜ ˜ ˜ + E g f1,h1 (X1 ), f2,h2 (X1 ) − EX′ g f1,h1 (X1 ), f2,h2 (X1 ) 1 N2 2 N 2 ′ ′ 3 X g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) + E N2 j=2

= o

1 N2

.

o n ′ Now consider samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and the respective sequences VN2 ′

and VN2 . Then

′

VN2 − VN2

=

N2 ′ 1 X √ g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . N2 j=1

Using a similar argument as that used to obtain (36), we have that if hd1 = o(1) and N1 → ∞, then 2 N2 ′ X ˜ E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) = o(1) j=2

Applying the Efron-Stein inequality gives

2 ′ 1 . =o =⇒ E VN2 − VN2 N2 V [VN2 ] = o

N2 + N1 N2

= o(1).

Thus by Chebyshev’s inequality,

V [VN2 ] = o(1), ǫ2 h i √ ˜ h1 ,h2 − E G ˜ h1 ,h2 converges in distriand therefore VN2 converges to zero in probability. By Slutsky’s theorem, N2 G bution to a zero mean Gaussian random variable with variance ii h h , V EX g ˜f1,h1 (X), ˜f2,h2 (X) Pr (|VN2 | > ǫ) ≤

where X is drawn from f2 . i h √ ˜w = ˜w −E G ˜ w where G For the weighted ensemble estimator, we wish to know the asymptotic distribution of N2 G P ˜ l∈¯ l w(l)Gh(l) . We have that i h p ˜w −E G ˜w N2 G

=

N2 X h i 1 X √ w(l) g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) − EXj g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) N2 j=1 ¯ l∈l N 2 X X 1 X w(l)g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) − E EXj w(l)g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) . +√ N2 j=1 ¯ ¯ l∈l

l∈l

The second term again converges in distribution to a Gaussian random variable by the central limit theorem. The mean and variance are, respectively, zero and h i X w(l)EX g ˜f1,h(l) (X), ˜f2,h(l) (X) . V l∈¯ l

The first term is equal to N2 h i X X 1 g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) − EXj g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) w(l) √ N 2 ¯ j=1

=

X

w(l)oP (1)

l∈¯ l

l∈l

=

oP (1),

where oP (1) denotes convergence to zero in probability. In the last step, we used the fact that if two random variables converge in probability to constants, then their linear combination converges in probability to the linear combination of the constants. Combining this result with Slutsky’s theorem completes the proof. R EFERENCES [1] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley & Sons, 2012. [2] H. Avi-Itzhak and T. Diep, “Arbitrarily tight upper and lower bounds on the Bayesian probability of error,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 89–91, 1996. [3] W. A. Hashlamoun, P. K. Varshney, and V. Samarasooriya, “A tight upper bound on the Bayesian probability of error,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 2, pp. 220–224, 1994. [4] K.R. Moon, V. Delouille, and A. O. Hero III, “Meta learning of bounds on the Bayes classifier error,” in IEEE Signal Processing and SP Education Workshop. IEEE, 2015, pp. 13–18. [5] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, pp. 493–507, 1952. [6] V. Berisha, A. Wisler, A. O. Hero III, and A. Spanias, “Empirically estimable classification bounds based on a new divergence measure,” IEEE Transactions on Signal Processing, 2015. [7] K. R. Moon and A. O. Hero III, “Multivariate f-divergence estimation with confidence,” in Advances in Neural Information Processing Systems, 2014, pp. 2420–2428. [8] Stephen V Gliske, Kevin R Moon, William C Stacey, and Alfred O Hero III, “The intrinsic value of HFO features as a biomarker of epileptic activity,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2016. [9] B. Póczos and J. G. Schneider, “On the estimation of alpha-divergences,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 609–617. [10] J. Oliva, B. Póczos, and J. Schneider, “Distribution to distribution regression,” in Proceedings of The 30th International Conference on Machine Learning, 2013, pp. 1049–1057. [11] Z. Szabó, A. Gretton, B. Póczos, and B. Sriperumbudur, “Two-stage sampled learning theory on distributions,” To appear in AISTATS, 2015.

[12] K. R. Moon, V. Delouille, J. J. Li, R. De Visscher, F. Watson, and A. O. Hero III, “Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization,” Journal of Space Weather and Space Climate, vol. 6, no. A3, 2016. [13] K. R. Moon, J. J. Li, V. Delouille, R. De Visscher, F. Watson, and A. O. Hero III, “Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis,” Journal of Space Weather and Space Climate, vol. 6, no. A2, 2016. [14] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification,” The Journal of Machine Learning Research, vol. 3, pp. 1265–1287, 2003. [15] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” The Journal of Machine Learning Research, vol. 6, pp. 1705–1749, 2005. [16] J. Lewi, R. Butera, and L. Paninski, “Real-time adaptive information-theoretic optimization of neurophysiology experiments,” in Advances in Neural Information Processing Systems, 2006, pp. 857–864. [17] L. Bruzzone, F. Roli, and S. B. Serpico, “An extension of the Jeffreys-Matusita distance to multiclass cases for feature selection,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 33, no. 6, pp. 1318–1321, 1995. [18] X. Guorong, C. Peiqi, and W. Minhui, “Bhattacharyya distance feature selection,” in Pattern Recognition, 1996., Proceedings of the 13th International Conference on. IEEE, 1996, vol. 2, pp. 195–199. [19] D. M. Sakate and D. N. Kashid, “Variable selection via penalized minimum ϕ-divergence estimation in logistic regression,” Journal of Applied Statistics, vol. 41, no. 6, pp. 1233–1246, 2014. [20] K. E. Hild, D. Erdogmus, and J. C. Principe, “Blind source separation using Renyi’s mutual information,” Signal Processing Letters, IEEE, vol. 8, no. 6, pp. 174–176, 2001. [21] M. Mihoko and S. Eguchi, “Robust blind source separation by beta divergence,” Neural computation, vol. 14, no. 8, pp. 1859–1886, 2002. [22] B. C. Vemuri, M. Liu, S. Amari, and F. Nielsen, “Total Bregman divergence and its applications to DTI analysis,” Medical Imaging, IEEE Transactions on, vol. 30, no. 2, pp. 475–483, 2011. [23] A. B. Hamza and H. Krim, “Image registration and segmentation by maximizing the Jensen-Rényi divergence,” in Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, 2003, pp. 147–163. [24] G. Liu, G. Xia, W. Yang, and N. Xue, “SAR image segmentation via non-local active contours,” in Geoscience and Remote Sensing Symposium (IGARSS), 2014 IEEE International. IEEE, 2014, pp. 3730–3733. [25] V. Korzhik and I. Fedyanin, “Steganographic applications of the nearest-neighbor approach to Kullback-Leibler divergence estimation,” in Digital Information, Networking, and Wireless Communications (DINWC), 2015 Third International Conference on. IEEE, 2015, pp. 133–138. [26] M. Basseville, “Divergence measures for statistical data processing–An annotated bibliography,” Signal Processing, vol. 93, no. 4, pp. 621–633, 2013. [27] B. Chai, D. Walther, D. Beck, and L. Fei-Fei, “Exploring functional connectivities of the human brain using multivariate information analysis,” in Advances in neural information processing systems, 2009, pp. 270–278. [28] K. M. Carter, R. Raich, and A. O. Hero III, “On local intrinsic dimension estimation and its applications,” Signal Processing, IEEE Transactions on, vol. 58, no. 2, pp. 650–663, 2010. [29] K. R. Moon, J. J. Li, V. Delouille, F. Watson, and A. O. Hero III, “Image patch analysis and clustering of sunspots: A dimensionality reduction approach,” in IEEE International Conference on Image Processing. IEEE, 2014, pp. 1623–1627. [30] A. O. Hero III, B. Ma, O. Michel, and J. Gorman, “Applications of entropic spanning graphs,” Signal Processing Magazine, IEEE, vol. 19, no. 5, pp. 85–95, 2002. [31] I. Csiszar, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. MAth. Hungar., vol. 2, pp. 299–318, 1967. [32] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966. [33] S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. [34] A. Rényi, “On measures of entropy and information,” in Fourth Berkeley Sympos. on Mathematical Statistics and Probability, 1961, pp. 547–561. [35] E. Hellinger, “Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen.,” Journal für die reine und angewandte Mathematik, vol. 136, pp. 210–271, 1909. [36] A. Bhattacharyya, “On a measure of divergence between two multinomial populations,” Sankhy¯a: The Indian Journal of Statistics, pp. 401–406, 1946. [37] K. Sricharan, D. Wei, and A. O. Hero, “Ensemble estimators for multivariate entropy estimation,” Information Theory, IEEE Transactions on, vol. 59, no. 7, pp. 4374–4388, 2013. [38] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú, “Divergence estimation for multidimensional densities via k-nearest-neighbor distances,” IEEE Trans. Information Theory, vol. 55, no. 5, pp. 2392–2405, 2009. [39] Georges A Darbellay, Igor Vajda, et al., “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Trans. Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999. [40] Jorge Silva and Shrikanth S Narayanan, “Information divergence estimation based on data-dependent partitions,” Journal of Statistical Planning and Inference, vol. 140, no. 11, pp. 3180–3198, 2010. [41] Trung Kien Le, “Information dependency: Strong consistency of Darbellay–Vajda partition estimators,” Journal of Statistical Planning and Inference, vol. 143, no. 12, pp. 2089–2100, 2013. [42] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú, “Divergence estimation of continuous distributions based on data-dependent partitions,” IEEE Trans. Information Theory, vol. 51, no. 9, pp. 3064–3074, 2005. [43] K. R. Moon and A. O. Hero III, “Ensemble estimation of multivariate f-divergence,” in Information Theory (ISIT), 2014 IEEE International Symposium on. IEEE, 2014, pp. 356–360. [44] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” Information Theory, IEEE Transactions on, vol. 56, no. 11, pp. 5847–5861, 2010. [45] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. Wasserman, “Nonparametric estimation of renyi divergence and friends,” in Proceedings of The 31st International Conference on Machine Learning, 2014, pp. 919–927. [46] S. Singh and B. Póczos, “Generalized exponential concentration inequality for rényi divergence estimation,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 333–341. [47] S. Singh and B. Póczos, “Exponential concentration of a density functional estimator,” in Advances in Neural Information Processing Systems, 2014, pp. 3032–3040. [48] Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James Robins, “Nonparametric von mises estimators for entropies, divergences and mutual informations,” in Advances in Neural Information Processing Systems, 2015, pp. 397–405. [49] Wolfgang Härdle, Applied Nonparametric Regression, Cambridge University Press, 1990. [50] A Berlinet, L Devroye, and L Györfi, “Asymptotic normality of L1 error in density estimation,” Statistics, vol. 26, pp. 329–343, 1995. [51] A Berlinet, László Györfi, and István Dénes, “Asymptotic normality of relative entropy in multivariate density estimation,” Publications de l’Institut de Statistique de l’Université de Paris, vol. 41, pp. 3–27, 1997.

[52] Peter J Bickel and Murray Rosenblatt, “On some global measures of the deviations of density function estimates,” The Annals of Statistics, pp. 1071–1095, 1973. [53] Lucien Birgé and Pascal Massart, “Estimation of integral functionals of a density,” The Annals of Statistics, pp. 11–29, 1995. [54] Evarist Giné and David M Mason, “Uniform in bandwidth estimation of integral functionals of the density function,” Scandinavian Journal of Statistics, vol. 35, no. 4, pp. 739–761, 2008. [55] Béatrice Laurent et al., “Efficient estimation of integral functionals of a density,” The Annals of Statistics, vol. 24, no. 2, pp. 659–681, 1996. [56] Kumar Sricharan, Raviv Raich, and Alfred O Hero, “Estimation of nonlinear functionals of densities with confidence,” IEEE Trans. Information Theory, vol. 58, no. 7, pp. 4135–4159, 2012. [57] Bruce E Hansen, “Lecture notes on nonparametrics,” 2009. [58] Bradley Efron and Charles Stein, “The jackknife estimate of variance,” The Annals of Statistics, pp. 586–596, 1981. [59] Rick Durrett, Probability: Theory and Examples, Cambridge University Press, 2010. [60] Allan Gut, Probability: A Graduate Course, Springer Science & Business Media, 2012. [61] Robert Keener, Theoretical Statistics: Topics for a Core Course, Springer Science & Business Media, 2010.