3 days ago - E. S. Cross, S. Daniels, H. Danielsson, L. DeBruine, D. J. Dunleavy, ... Trabajos de Estadistica Y de Investigacion Operativa 31(1), 585â...

0 downloads 3 Views 634KB Size

Assessing Bayes factor surfaces using interactive visualization and computer surrogate modeling Christopher T. Franck Department of Statistics, Virginia Tech and Robert B. Gramacy Department of Statistics, Virginia Tech September 19, 2018

Abstract Bayesian model selection provides a natural alternative to classical hypothesis testing based on p-values. While many papers mention that Bayesian model selection is frequently sensitive to prior specification on the parameters, there are few practical strategies to assess and report this sensitivity. This article has two goals. First, we aim educate the broader statistical community about the extent of potential sensitivity through visualization of the Bayes factor surface. The Bayes factor surface shows the value a Bayes factor takes (usually on the log scale) as a function of user-specified hyperparameters. We provide interactive visualization through an R shiny application that allows the user to explore sensitivity in Bayes factor over a range of hyperparameter settings in a familiar regression setting. We compare the surface with three automatic procedures. Second, we suggest surrogate modeling via Gaussian processes (GPs) to visualize the Bayes factor surface in situations where computation of Bayes factors is expensive. That is, we treat Bayes factor calculation as a computer simulation experiment. In this context, we provide a fully reproducible example using accessible GP libraries to augment an important study of the influence of outliers in empirical finance. We suggest Bayes factor surfaces are valuable for scientific reporting since they (i) increase transparency, making potential instability in Bayes factors easy to visualize, (ii) generalize to simple and more complicated examples, and (iii) provide a path for researchers to assess the impact of prior choice on modeling decisions in a wide variety research areas.

Keywords: Bayes Factors, Bayesian model selection, prior distributions, emulator, Gaussian process, visualization 1

1

Introduction

In the current scientific landscape, multiple concerns surrounding the use of classical pvalues for null hypothesis significance testing have been raised, see e.g. Ioannidis (2005), Boos and Stefanski (2011), Young and Karr (2011), Trafimow and Marks (2015), and Wasserstein et al. (2016). Despite these concerns, interest in hypothesis testing persists. Bayesian model selection enables direct probabalistic statements about the plausibility of competing models and is thus a natural alternative to null hypothesis testing based on pvalues. However, Bayesian model selection also has drawbacks. Specifically, model selection from the Bayesian perspective is typically quite sensitive to prior choice on parameters, even ones deep within a hierarchy. Scientific reporting can be improved if these sensitivities are more broadly recognized and easily reported among practicing statisticians and the wider research community. In the statistical literature, the sensitivity to hyperparameters (parameters to the priors on parameters) has been mentioned previously, see e.g. Berger (1985), Kass and Raftery (1995), Berger and Pericchi (1996), O’Hagan (1995), Chipman et al. (2001), and Yao et al. (2018). Yet, it remains difficult for practitioners who are interested in using Bayesian model selection in their research to understand the degree of sensitivity they might expect in their specific analysis plan. A researcher hopes their Bayesian model selection “will be relatively insensitive to reasonable choices [of hyperparameters]” (Berger, 1985, p. 146) but worries that “... such hypothesis tests can pose severe diffiulties in the Bayesian inferential paradigm where the Bayes factors may exhibit undesirable properties unless parameter prior distributions are chosen with exquisite care” (Ormerod et al., 2017). There is currently a dearth of practical strategies to help researchers systematically assess and report sensitivity in Bayesian model selection. In order for an approach to be useful for this purpose, it must (i) be simple to implement, not requiring extensive reconceptualization in different scenarios, (ii) be efficient in instances where computing Bayes factors is computationally expensive, (iii) readily reveal the extent to which Bayes factors are sensitive to prior specification and facilitate the reporting of potential sensitivity. With the above requirements in mind, we propose systematic assessment of the Bayes factor surface with the aid of modern libraries and visualization tools. We define the Bayes 2

factor surface as the value Bayes factors (usually on the log scale) take as a function of the hyperparameters in the model. Bayes factor surfaces are easy to visualize, require only the ability to form a Bayes factor in order to produce, can be compared with available automatic procedures which appear as planes in Bayes factor surface visualizations. We propose approximating Bayes factor surfaces using computer surrogate modeling (e.g., Santner et al., 2003) if computation of a given Bayes factor is expensive. Importantly, Bayes factor surfaces inform their audience how inferences on competing models might vary as a function of prior belief about the parameters. We suspect that the instability revealed by Bayes factor surfaces might be considerably higher than many practicing statisticians realize, thus revealing a potential barrier for the greater adoption of Bayesian model selection in data analysis.

●

●

● ●

●

●

●

●

● ●

●

y

2

●

●

1 0

● ● ● ●●

●● ● ●

●

p−value=.0021

0.4

0.6

0.8

1.0

5

2 1

0

−1

−3

0

−5

−2

−5

−10

−4

−2

0

2

4

prior precision for slope − log10(φ)

x

Figure 1: Scatter plot (left panel) and Bayes factor surface contour plot (right panel). Contour plot shows Bayes factor comparing β = 0 and β 6= 0 hypotheses on natural log. Contours correspond to strength of evidence according to Kass and Raftery (1995). Figure 1 shows the dilemma. Consider a simple linear regression problem where the researcher wishes to test whether the slope parameter β is equal to zero or not. The data in the left panel were simulated with β = 2.5, error variance σ 2 = 1, and sample size 3

10

3

−4

●

0.2

−3

0

0.0

●

● ● ●

−1

1

prior mean for slope − µ

3

0

4

0

●

●

n = 30). The classical p-value for the slope is 0.0021, below the recently proposed α = 0.005 criterion to declare statistical significance for new discoveries (Benjamin et al., 2018). The contour plot shows the log10 Bayes factor (i.e., the Bayes factor surface) comparing H0 : β = 0 and H1 : β 6= 0 hypotheses. The x- and y-axes of the contour plot represent prior hyperparameters—user chosen inputs that encode prior belief. (We define these in detail in Section 3.1.) Blue favors H1 and red favors H0 . It is clear that the prior belief that the researcher brings to this analysis plays a major role in the conclusion. Decreasing prior precision (x-axis) yields equivocal evidence about hypotheses despite the apparent trend in the data. Precise prior beliefs near the least squares estimate yield Bayes factors “strongly” in favor of a non-zero slope (according to the scale of evidence proposed by Kass and Raftery, 1995). Incorrect but precise prior beliefs (i.e., red in lower right corner) provide very strong evidence in favor of H0 . Figure 1 is, at first blush, rather discouraging. Least squares regression is among the simplest data analytic approaches in any analyst’s toolbox. Even in the simplest of cases, it is clear that priors on parameters must be handled with care. The good news is that within the class of linear models, several well-calibrated automatic approaches are available for Bayes factors, which we review later. In more complicated settings such as the problem described in Section 3.2, automatic procedures for model selection are usually unavailable (although Fonseca et al. (2008) is a good starting point to develop an automatic procedure for this example). In these latter cases, the Bayes factor surface can play an important role in scientific reporting. A thorough overview of Bayes factors can be found in Kass and Raftery (1995). A useful guide for implementation of Bayesian model selection is Chipman et al. (2001). The remainder of this article is organized as follows. Section 2 reviews the general formulation of Bayesian model selection via Bayes factors and develops the idea of Bayes factor surfaces. Section 3 covers two examples of varying complexity. The first example includes an interactive R shiny (Chang et al., 2017) application hyperlink that allows the user to explore Bayes factor surfaces in real time. The second example addresses the common situation where Bayes factor approximation is computationally intensive, and often a “noisy” enterprise owing to the stochastic nature of approximation via Markov chain Monte Carlo

4

(MCMC). Therein we propose a space-filling design in the hyperparamter space, and surrogate modeling with Gaussian process models to emulate the Bayes factor surface. Modern libraries in R, like hetGP (Binois and Gramacy, 2018) on CRAN, cope well with both mean and variance features common in Bayes factor surfaces. Our fully reproducible illustration, with code provided in the supplementary material, is designed to be easily ported to Bayes factor calculations in novel contexts. Discussion is included in Section 4.

2

Bayesian model selection overview

Let y represent observed data. Let M1 and M2 represent two competing models (or hypotheses), and let θ1 and θ2 represent vectors of parameters associated with each of M1 and M2 , respectively. Index models and parameter vectors with k = 1, 2. Let the distribution of the data given the parameters (i.e., the likelihood normalized as a proper distribution) be denoted p(y|θk , Mk ). The marginal likelihood for model k (denoted p(y|Mk )) arises when parameters are integrated out of the joint distribution of the parameters and data, i.e., Z p(y|Mk ) =

p(y|θk , Mk )p(θk |Mk ) dθk .

(1)

The marginal likelihood plays a central role in Bayesian model comparison, selection, and averaging. The Bayes factor is the usual metric to compare M1 and M2 and is the ratio of those models’ marginal likelihoods: BF12 =

p(y|M1 ) . p(y|M2 )

(2)

In this work we consider Bayes factors based on parametric Bayesian models. Historically, scales of interpretation (Jeffreys, 1961; Kass and Raftery, 1995) have been used to describe strength of evidence in favor of M1 over M2 where large values of BF12 suggest that observed data support M1 more than M2 . However, using scales in this manner implicitly assumes prior model probabilities p(M1 ) = p(M2 ) = 21 , which is the unique setting where the Bayes factor describes the ratio of how much more probable M1 is compared

5

with M2 . In general, interpreting Bayes factors as the weight of evidence in favor of one hypothesis over another is not strictly justified. Technically, Bayes factors summarize the extent to which the observed data update prior model odds to posterior model odds. In our discussion we will assume equal model priors but the reader can adopt other model priors and interpret the Bayes factor as a multiplicative factor which updates prior belief about models. For a more complete discussion of this and related issues, see Lavine and Schervish (1999). The notation p(y|Mk ) for marginal likelihoods shown in Equation (1) is remarkably consistent across the statistical literature. Perhaps for brevity, this notation omits reference to the parameters that have been integrated, which facilitates an illusion that p(y|Mk ) does not depend on the priors assigned to θk . In the remainder of this paper we show that choices about the priors on θk can exert strong influence on Bayes factors. The first famous discussion about the sensitivity to priors for Bayesian model selection arose as the Jeffreys-Lindley-Bartlett paradox (Jeffreys, 1935, 1961; Bartlett, 1957; Lindley, 1957). The issue (which is not technically a paradox) was described (Lindley, 1957) in terms of a one parameter point null testing problem in which frequentist p-values strongly favor rejecting the null while Bayes factors provide strong support for the null. This simple example illustrates that as the prior on the parameter being tested becomes vague, the Bayes factor increasingly favors the simpler point-null hypothesis. See Ormerod et al. (2017) for thorough discussion and numerous references to the “paradox” in the literature. In addition to the effects of prior vagueness, Bayes factor surfaces allow for the visualization of the effects of multiple hyperparameters simultaneously. The methods described thus far apply when only two models are under consideration. When K > 2 models are of interest, usually a series of K − 1 Bayes factors are produced, where a specific “base” model is chosen, and a Bayes factor is formed comparing each other model to the base model. In cases such as these, the researcher may wish to examine Bayes factor surfaces corresponding to each of the K − 1 Bayes factors that are considered.

6

2.1

Bayes factor surfaces

The method we propose in this paper is simple. We suggest producing a graphical display of the log Bayes factor as a function of hyperparameters of interest for inclusion in scientific reports. Basically, enumerate a grid in the hyperparameter space, collect Bayes factor estimates under each setting, and stretch over an appropriate mesh—off-loading the heavy visual lifting, projections, slices, etc., to any number of existing rich visualization suites. One drawback to this approach is that numerical integration to obtain marginal likelihoods (1) is frequently expensive. When Monte Carlo integration is used the output is frequently noisy. Although that noise can be reduced with further computation (i.e., longer MCMC chains), convergence is typically not uniform in the hyperparameter space. For a fixed simulation effort in all hyperparameter settings, the level of noise in the output Bayes factor may change. In other words, the Bayes factor response surface can be heteroskedastic. In situations such as these, we propose treating Bayes factor calculation as a computer simulation experiment: space-filling design in the hyperparameter space and surrogate modeling via Gaussian processes in lieu of direct inspection. We illustrate a modern library called hetGP which copes with input dependent variance and leads to nice surrogate Bayes factor surface visualizations in our examples. The simplicity of the approach is a virtue. The core idea does not need to be reconceptualized in different settings and could be explored in any situation where Bayes factors are available and concern about sensitivity is present. For example, a Bayes factor surface plot could accompany statistical reporting as a method to indicate to reviewers and the broader audience whether the conclusions with respect to model selection are driven heavily by the specific priors chosen for θk . As a corollary this approach also fosters transparency since the audience can see how model selection would differ under other hyperparameter settings.

3 3.1

Examples Ordinary regression

The first example comes from the ordinary regression setting with model

7

yi = α + βxi + i

(3)

where α is the y-intercept, β is the slope, i = 1, . . . , n, xi represents fixed known data from iid 1 a predictor variable, and yi is the outcome. We assume i ∼ N 0, γ , where the error precision is γ =

1 σ2

and σ 2 is the usual error variance. Specifying error in terms of precision

is mathematically convenient for Bayesian analysis. Following Equation (2) and the color scale in Figure 1, under M1 we have β 6= 0 and under M2 implies β = 0. The remaining model specifications for M1 and M2 share the following features: β|M1 ∼ N (µ, φ)

(4)

γ ∼ IG(a, b)

(5)

p(α) ∝ 1

(6)

iid

yi |α, β, γ, Mk ∼ N

α + βxi ,

1 γ

.

(7)

The parameters in each model are vectors θ1 = (α, β, γ)> and θ2 = (α, γ)> . It has been shown previously that putting a flat prior on the intercept implicitly centers the yi and xi about their respective means during the computation of the marginal likelihood (Chipman et al., 2001). Denote wi = xi − x¯ and zi = yi − y¯. Choice of hyperparameters a, b, φ, and µ enable the researcher to incorporate prior information about parameters into the analysis. In the absence of specific prior information, the priors on β and γ are frequently taken to be vague yet proper. For example, a researcher may center β at zero and choose a small prior precision (i.e., large prior variance) and similar for γ they may choose to center at 1 and make the prior variance large. This is in an attempt to impart minimal information to the inference from the prior. The improper flat intercept α contributes an unspecified constant to the marginal likelihood in each model which cancels once the Bayes factor is formed. The included R shiny (Chang et al., 2017) application hyperlink allows users to entertain various settings of hyperprior, sample size, and true underlying effect size β in order to explore how these affect the Bayes factor surface. Experimentation with this application illustrates that hyperparameters for parameters common to both models (e.g., a and 8

b specified for γ) tend to have much less influence on Bayes factors than hyperparameters for parameters that are being tested (e.g., µ and φ specified for β). Under the simpler hypothesis M2 : β = 0 and (5–7) the marginal likelihood has a convenient closed-form solution. −(n−1) +a cba (2π) 2 Γ n−1 2 p(y|M1 ) = n−1 +a P √ Γ(a) n b + 12 ni=1 zi2 2

(8)

As noted in Section 2, the standard notation for marginal likelihoods, as in Eq. (8), omits reference to θ1 due to its integration. Yet, obviously specific choices about hyperparamters (a and b in this case) associated with θ1 appear in the marginal likelihood. The form of prior densities also influences the marginal likelihood in a less visible manner. Even though standard notation tends to obscure this fact, prior choice on parameters directly influences marginal likelihoods and thus Bayesian model selection. Across the universe of possible Bayesian model selection endeavors, there are no ironclad guarantees that increasing n diminishes the influence of hyperparameters. Under the M1 : β 6= 0 in Eqs. (4–7), integration of α and β can be handled analytically, but integration of γ requires numerical approximation. √ n−1 +a−1 φγ 2 1 P (φ + γ ni=1 wi2 ) 2 ! ! Pn n 2 X z w ) (φµ + γ 1 1 i i Pi=1 dγ φµ2 + γ zi2 + × exp −bγ − n 2 2 2 (φ + γ i=1 wi ) i=1 −(n−1)

c(2π) 2 ba √ p(y|M2 ) = Γ(a) n

Z

(9)

Of course, adopting the priors (4-6) is only one possible strategy in a universe of available Bayesian priors. In this study we will also consider three specific automatic procedures to obtain Bayes factors that are readily available for regression. The term “automatic” is used in Bayesian procedures to describe cases where the researcher does not have to specify priors subjectively. We explore Zellner–Siow mixture g-priors (Liang et al., 2008), Bayes factor approximation via Bayesian information Criterion (Schwarz, 1978, See also Kass and Raftery 1995), and Fractional Bayes factors (O’Hagan, 1995), each of which is briefly summarized next. As the reader probably expects, these automatic procedures are each

9

defensible individually, yet do not necessarily agree and could potentially lead to different conclusions in different situations. Zellner–Siow mixture g-prior: The Zellner–Siow mixture g-prior was introduced (Liang et al., 2008) as an automatic Bayesian specification to allow selection among nested ordinary linear models. Previous work (Zellner and Siow, 1980) has shown that a multuivariate Cauchy distribution on regression coefficients satisfies basic model consistency requirements. Liang et al. (2008) demonstrates that an inverse gamma with shape and rate parameters of

1 2

and

n 2

respectively, placed on the g parameter of Zellner’s g-prior (Zell-

ner, 1986) induces Cauchy tails on the prior of β while only requiring a one dimensional approximation to the integral for the marginal likelihood. The Appendix includes a brief description of the Bayesian model for the Zellner–Siow mixture g-prior. 1

Approximation by BIC: Asymptotically, BF12 ≈ e− 2 (BIC1 −BIC2 ) where BICk is the usual Bayesian information criterion (Schwarz, 1978) for model k. Further details relating BIC to approximate Bayesian model selection can be found in Kass and Raftery (1995). Fractional Bayes factor: The final automatic method we consider is the fractional Bayes factor (O’Hagan, 1995). Fractional Bayes factors are a variant of partial Bayes factors (Berger and Pericchi, 1996) which reserve a training sample from the full data, update the prior using the training sample, then form a Bayes factor using the remainder of the data in the likelihood. A central appeal of the partial Bayes factor approach is that parameters with noninformative priors (which are frequently improper) can be tested in an essentially automatic fashion. While intrinsic Bayes factors (Berger and Pericchi, 1996) are obtained by averaging over many partial Bayes factors, Fractional Bayes factors obviate the need to choose a specific training sample through a clever approximation of the “partial” likelihood and are thus computationally less expensive than intrinsic Bayes factors. In this approach we adopt noninformative improper priors and compute a Bayes factor surface using fractional Bayes factors. More detail can be found in the Appendix. R shiny application To accompany this paper and allow the reader to interactively explore Bayes factor surfaces in the regression setting, we have produced a R shiny (Chang et al., 2017) application.

10

The app allows the user to set sample size, parameter values and choose hyperparameters in Equations (3-7). The application provides 3 dimensional rotating Bayes factor surface plots with user options to superimpose planes corresponding to each of the three automatic procedures described. Figure 2 shows the interface with this specific output. Additionally, the application also provides scatter plots contour plots such as in Figure 1 and density plots for the priors. The specific surface visualized corresponds to a true β = 3, where user rotation reveals the same phenomnenon as Figure 1: strong evidence in favor of the correct model is achieved when the prior on the slope is precice and centered near the least squares estimate. As prior precision on the slope decreases, the Bayes factor reduces (i.e. Jeffreys-Lindley-Bartlett phenomenon), and precise but misguided prior beliefs about the slope incorrectly suggest the plausibility that β = 0. Each of the three automatic procedures provide strong evidence of a non-zero slope, where the Fractional Bayes factor plane (not shown) is between the BIC approximation and mixture g prior.

3.2

Surrogate modeling the BF surface

In the simple regression setting of Section 3.1, computation is cheap and many viable automatic procedures are available. In many other settings, obtaining marginal likelihoods is a delicate and computationally expensive task. For cases such as these, we propose that an adequate analogue of the visualizations provided above can be facilitated by emulating the Bayes factor surface in the hyperparameter space of interest. In other words, we propose treating expensive Bayes factor calculations, via MCMC say, as a (stochastic) computer simulation experiment. Bayes Factor calculations at a space-filling design in the (input) hyperparameter space can be used to map out the space and better understand the sensitivity of model selection to those settings. As motivation, consider an experiment described by Gramacy and Pantaleo (2010, Section 3.3–3.4) involving Bayes factor calculations to determine if data are leptokurtic (Student-t errors) or not (simply Gaussian). The context of that experiment is portfolio balancing with historical asset return data of varying length. Highly unbalanced history lengths necessitated a modeling in the Cholesky-space, through a cascade of least-squares style regressions decomposing a joint multivariate normal via conditionals. To address this,

11

Figure 2: Screenshot of R shiny application. The application allows users to simulate regression data with varying slope and sample size to explore Bayes factor surfaces under various hyperparameter settings. The application includes 3d rotatable surfaces, contour plots, scatter plots, and prior densities. Above, the red lines show the surface at µ = 0 and φ = 1. The grey plane corresponds to log(BF12 ) = 0, i.e. a BF such that prior and posterior odds are equal (for a uniform model prior). Users can superimpose planes corresponding to the three automatic procedures, including mixture g prior (green), the BIC approximation to BF(blue), and the fractional Bayes factor (not shown). These and additional graphics and analyses can be interactively explored using the accompanying shiny app hosted insert hyperlink. Gramacy and Pantaleo proposed Bayesian modernization of methods first introduced by Andersen (1957) in some generality, and Stambaugh (1997) for portfolio balancing. Gramacy and Pantaleo entertained modeling variations for those regressions including doubleexponential (lasso), Gaussian (ridge), normal-gamma (NG), and horseshoe priors for the regression coefficients, potentially under a reversible-jump (RJ) scheme for variable selection. Their monomvn package (Gramacy, 2017) on CRAN provided RJ-MCMC inference for the regression subroutines, and subsequently for Andersen’s reconstruction of the resulting joint MVN distributions of the asset returns, which are funneled through quadratic programs for Markowitz (1959)-style portfolio balancing. With the proposed methods coming 12

online just after the 2008 financial crisis, the authors were particularly interested in the appropriateness of the prevailing MVN modeling setup—theirs being one example—in the presence of data which may contain outliers, or otherwise heavy tails. To accommodate robust inference in the face out outliers, Gramacy and Pantaleo tacked the latent-variable-based Student-t sampler from Geweke (1992, 1993) onto their inferential framework. Next, to determine if the data warranted heavy-tailed modeling they further augmented the software with a Bayes factor calculation in the style Jacquier et al. (2004, Section 2.5.1), leveraging that Student-t (St) and normal (N ) models differ by just one parameter in the likelihood: ν, the degrees of freedom. In such cases, one can write the BF as the expectation of the ratio of un-normalized posteriors with respect to the posterior under the bigger (St) model. That is, BFStN = E

p(0|ψ, MN ) p(0|ψ, ν, MSt )

T 1 X p(0|ψ (t) , MN ) , ≈ T t=1 p(0|ψ (t) , ν (t) , MSt )

(10)

where (ψ (t) , ν (t) ) ∼ p(ψ, ν|0, MSt ) and ψ collects the parameters shared by both models. A subsequent simulation study, however, cast doubt on whether or not one could reliably detect heavy tails (when indeed they were present) in situations where the generating νvalue was of moderate size, say ν ∈ {3, 4, 5, 6, 7}. It is relevant to ask to what extent their conclusions are impacted by their hyperparameter choices, particularly on the prior for ν which they took to be ν ∼ Exp(θ = 0.1). Their intention was to be diffuse, but ultimately they lacked an appropriate framework for studying sensitivity to this choice. Consider the following warm-up experiment toward a more comprehensive analysis. We set up a grid of hyperparameter values in θ, evenly spaced in log10 space from 10−3 to 10m spanning “solidly Student-t” (even Cauchy) to “essentially Gaussian” in terms of the mean of the prior over ν. For each θi on the grid we ran the RJ-MCMC to approximate BFStN by feeding sample likelihood evaluations provided by monomvn’s blasso function through (10). The training data was generated following a setup identical to the one described in Gramacy and Pantaleo (2010), involving 200 observations randomly drawn from a linear model in seven input dimensions, with three of the coefficients being zero, and ν = 5. Details are provided by our fully reproducible R implementation provided in the supplementary

13

material. In order to understand the Monte Carlo variability in those calculations, ten replicates of the BFs under each hyperparameter setting were collected. ●

20

hetTP mean hetTP interval

●

15

● ● ● ●

●

5

log(bf)

10

● ● ●

●

−10

−5

0

●

● ● ● ● ● ● ● ● ● ●

−3

● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ● ● ●

● ● ● ● ●

● ● ● ●

● ● ● ● ● ●

● ● ● ● ●

● ● ● ●

● ● ●

●

●

●

−2

−1

0

1

log10(thetas)

Figure 3: Log Bayes factors for varying hyperparameter prior settings θ in ν ∼ Exp(θ). The x-axis (θ) is shown in log10 space. A predictive surface from hetTP is overlayed via mean and 95% error-bars. The results are shown in Figure 3. Each open circle is a BFStN evaluation, plotted in log10 − loge space. Observe that outputs on the y-axis span the full range of evidence classes (Kass and Raftery, 1995) for the hypothesis of Student-t errors over the Gaussian alternative. When θ is small, the Student-t is essentially a foregone conclusion; whereas if θ is large the Gaussian is. The data, via likelihood through to the posterior, is providing very little in the form of adjudication between these alternatives. An explanation is that the model is so flexible, the posterior is happy to attribute residual variability—what we know to be outliers from heavy tails (because we generated the data)—to mean variability via settings of the linear regression coefficients, β. A seemingly innocuous hyperparameter setting is essentially determining the outcome of a model selection enterprise. Each BFStN evaluation, utilizing T = 100000 MCMC samples, takes about 36 minutes to obtain on a 4.2 GHz Intel Core i7 processor, leading to a total runtime of about 120 hours to collect all 200 values used in the figure. Although that burden is tolerable (and perhaps we could have made due with fewer evaluations), extending to higher dimensions is problematic. Suppose we wanted to entertain ν ∼ Gamma(α, β), where the α = 1 case reduces to ν ∼ Exp(β ≡ θ) above. If we tried to have a similarly dense grid, the 14

runtime would balloon to 100 days, which is clearly unreasonable. Rather, we propose treating the BFStN calculation as a computer experiment: build a surrogate model from a more limited space-filling design, and use the resulting posterior predictive surface to understand variability in Bayes factors in the hyperparameter space. Before providing an example, the simpler experiment Figure 3 points to some potential challenges. The BFStN surface is heteroskedastic, even after log transform, and may itself have heavy tails. Fortunately, stochastic computer simulations with heteroskedastic errors are popping up all over the literature and some good libraries have recently been developed to cope. The hetGP package on CRAN (Binois and Gramacy, 2018), based on the work of Binois et al. (2018) with Student-t extensions (Chung et al., 2018), is easy to deploy in our context. The predictive surface from a so-called hetTP surrogate, a fitted heteroskedastic Student-t process, is overlayed on the figure via mean and 95% predictive intervals. Although this fitted surface doesn’t add much to the visualization here, it’s analog is essential in higher dimensions. As an illustration, lets return now to ν ∼ Gamma(α, β). Our supplementary material provides a code which evaluates BFs on a space-filling design in α × β-space, via a Latin hypercube sample (McKay et al., 1979) of size 40, using a recently updated version of the monomvn library to accommodate the Gamma prior. Five replicates are obtained at each input setting, for a total of 200 runs. I.e., at similar computational expense compared to the earlier Exp experiment despite the higher dimension. Figure 4 shows the outcome of that experiment via a fitted hetTP surface: mean on the left and standard deviation on the right. The numbers overlayed on the figure are the average BFStN obtained for the five replicates at each input location. Observe that the story here is much the same as before in terms of β, which is maps to θ in the earlier experiment, especially nearby α = 1 (i.e., log10 α = 0) where the equivalence is exact. The left panel shows that along that slice you can get just about whatever conclusion you want. Smaller α values tell a somewhat more nuanced story, however. A rather large range of smaller α values lead to somewhat less sensitivity in the outcome due to the β hyperparameter setting, except when β is quite large. It would appear that it is essential to have a small α setting if the data are going to have any influence on model selection via BF. The right panel shows that the variance

15

mean log BF 17

10 8.6

7.2

5

2.9

2 2

−1

0.83

1.4

3

2.2

0.92 0.76 −0.43

3

1.1

1

−1 −2

1.3

−0.6

0.066 −0.91 −0.89 −0.35

−0.52 −0.44

−1

−1.5 −1.3

−1.6

−3

−0.11

−0.073

−0.67

−8.8

−0.29 −0.0037

−5 −2.5 −4.8 −6.8 −7.1

0.4

−2

0

0.65

0.89 0.46

1.6 0.95

0.75 0.76

1

0.41

2.2

0.41

1.62.2

3.6 2.4

2

−2

0.5 0.0

1.6

−1

0

1

log10 alpha

Figure 4: Log Bayes factors for varying hyperparameter prior settings α and β in ν ∼ Gamma(α, β). The x-axis (β) and y-axis are both shown in log10 space. A predictive surface from hetTP is overlayed via mean (left) and standard deviation (right). The contours come from Kass and Raftery (1995). surface is indeed changing over the input space, justifying the heteroskedastic surrogate.

4

Discussion

The instability of Bayes factors as a function of prior choices on parameters has been commented on in the statistical literature that develops Bayesian model comparison and selection. While the issue has been frequently mentioned, less attention has been spent developing approaches to help researchers understand the extent of sensitivity in their own specific analyses. The two-fold purpose of this article has been to educate the broader statistical community about sensitivity in Bayes factors through interactive visualization, and to propose surrogate modeling in the form of computer surrogate models to examine the Bayes factor surface in cases where computing Bayes factors is too expensive to accomplish extensively. We hope these contributions will provide researchers the insight and tools

16

1.0

1

2 0.49

2.1

0.94

−3

log10 alpha

1.2

1.3

1.5

1.7

1.2 0.59 1.2

1.1

2.0

0.97

0.86 0.89

0.92 0.74

2.5 0.89

0.22

3.0

0.4

0.59

0.96

0.84

0.63

1.1 0.26 0.71 0.84

−3

2

1.5

1.1

0.57

−10

0.78

0.44

0.93

−9

−1

0.35 0.65 0.24

0.55

−2

0.32 0.16

0.35

0.39

0.33

−1

1

0.43 0.35

0.53 0.89 0.23

0.94 0.54

0.4

−9.9

0.98

−0.94

−3

−4.4

−1.2−2.1 −1.7

−1.4 −0.73

0

0.39

−0.72

0.47 0.49

0

0.26 0.39

0

5

−5

1.5

0.39 0.42

−2.2

1.5

0.73 0.84

0.5 0.33 0.72

0.96

0.65

1.3

2.2

1.9

0.91 2.5

4.7

−3

log10 beta

3.74.2

33.3

4.3 4.1

4.6

4.2 4.6 2.8

1

10

9.1

7.58.5

7.4 5

0

12 9

1314

15

log10 beta

1

std log BF

necessary to asses sensitivity in Bayes factors for their own specific research interests. Much has been said and written about the downside of the near ubiquitous use of pvalues for hypothesis testing across many branches of science. While it is unclear whether Bayesian model selection can or should ultimately supplant use of p-values on a broad scale, illustrating these sensitivities helps researchers use Bayes factors and Bayesian model selection responsibly. Further, we hope availability of Bayes factor surface plots will help journal reviewers assess these sensitivities and prevent statistical malpractice (intentional or otherwise) through exploitation of these sensitivities. Many of the concerns about p-values relate specifically to the practice of setting hard thresholds to declare results statistically significant or not (see, e.g., Lakens et al., 2018; Benjamin et al., 2018). Similar thresholding of Bayesian metrics surely will not serve as a panacea to these concerns. Whereas research papers used to include visuals of prior to posterior updating at the parameter level, and trace plots from MCMC chains to indicate good mixing, these are now pass´e. If they are not cut entirely, they are frequently relegated to an appendix. The extra transparency such visuals provide are apparently no longer of value to the literature, perhaps because audiences have matured and readers/referees are now better able to sniff out troublesome modeling and inference choices through other, less direct, means. Yet a similar setup is sorely needed for model selection. Helpful visuals are not well-established and, as we show in our examples, it is easy to gain a false sense of security from the outcome of an experiment for which further analysis (e.g., visualization of a Bayes factor surface) reveals was actually balanced on a knife’s edge. Studies involving Bayesian model selection desperately need greater transparency. We acknowledge that when many parameters are being tested simultaneously, it will be harder to visualize a single Bayes factor surface and multiple plots might be necessary. Individual surfaces can be helpful in facilitating lower-dimensional slices and projections. However, the quality of that fitted surrogate will depend on the input design, with spacefilling becoming harder (as a function of a fixed design size) as the input dimension increases. An attractive option here is sequential design. Binois et al. (2018) show how an integrated mean-square prediction error criteria can be used to dynamically select design sites and degrees of replication in order for focus sampling on parts of the input space where the

17

signal (e.g., in BFStN evaluations via (10), say) is harder to extract from the noise. Treating the Bayes factor surface surrogate as a sequential experiment in that context represents a promising avenue for further research. While this may seem potentially burdensome, we believe that the availability of Bayes factor surfaces will remain valuable. We advise researchers to prioritize surfaces that address parameters that differ between the hypotheses. Typically, hyperparameters for quantities common to both models affect Bayes factors to a much lesser degree than hyperparameters on the quantities being tested. We encourage users to use the accompanying R shiny application to explore the comparatively large impact of changing φ and µ (both related to the parameter β that is being tested) compared with the smaller influence a and b have, as these latter hypeparameters govern the error precision which is common to both models.

References Andersen, T. (1957, June). Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J. of the American Statistical Association 52, 200–203. Bartlett, M. S. (1957). A comment on D. V. Lindley’s statistical paradox. Biometrika 44 (34), 533–534. Benjamin, D. J., J. O. Berger, M. Johannesson, B. A. Nosek, E.-J. Wagenmakers, R. Berk, K. A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini, C. D. Chambers, M. Clyde, T. D. Cook, P. De Boeck, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A. P. Field, M. Forster, E. I. George, R. Gonzalez, S. Goodman, E. Green, D. P. Green, A. G. Greenwald, J. D. Hadfield, L. V. Hedges, L. Held, T. Hua Ho, H. Hoijtink, D. J. Hruschka, K. Imai, G. Imbens, J. P. A. Ioannidis, M. Jeon, J. H. Jones, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery, S. E. Maxwell, M. McCarthy, D. A. Moore, S. L. Morgan, M. Munaf, S. Nakagawa, B. Nyhan, T. H. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei, F. D. Schnbrodt, T. Sellke, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire, D. J. Watts, C. Winship, R. L.

18

Wolpert, Y. Xie, C. Young, J. Zinman, and V. E. Johnson (2018, January). Redefine statistical significance. Nature Human Behaviour 2 (1), 6–10. Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis, second edition. Springer. Berger, J. O. and L. R. Pericchi (1996). The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association 91 (433), 109–122. Binois, M. and R. B. Gramacy (2018). hetGP: Heteroskedastic Gaussian Process Modeling and Design under Replication. R package version 1.0.3. Binois, M., R. B. Gramacy, and M. Ludkovski (2018). Practical heteroskedastic Gaussian process modeling for large simulation experiments. Journal of Computational and Graphical Statistics to appear. arXiv preprint arXiv:1611.05902. Binois, M., J. Huang, R. B. Gramacy, and M. Ludkovski (2018). Replication or exploration? Sequential design for stochastic simulation experiments. Technometrics to appear. arXiv preprint arXiv:1710.03206. Boos, D. D. and L. A. Stefanski (2011). P-value precision and reproducibility. The American Statistician 65 (4), 213–221. Chang, W., J. Cheng, J. Allaire, Y. Xie, and J. McPherson (2017). shiny: Web Application Framework for R. R package version 1.0.5. Chipman, H., E. I. George, and R. E. McCulloch (2001). The Practical Implementation of Bayesian Model Selection, Volume Volume 38 of Lecture Notes–Monograph Series, pp. 65–116. Beachwood, OH: Institute of Mathematical Statistics. Chung, M., R. B. Gramacy, M. Binois, D. Moquin, A. Smith, and A. Smith (2018). Parameter and uncertainty estimation for dynamical systems using surrogate stochastic processes. Technical report, Virginia Tech. preprint on arXiv:1802.00852. Fonseca, T. C. O., M. A. R. Ferreira, and H. S. Migon (2008). Objective bayesian analysis for the student-t regression model. Biometrika 95 (2), 325–333. 19

Geweke, J. (1992). Priors for microeconomic times series and their application. Technical Report Institute of Empirical Macroeconomics Discussion Paper No.64, Federal Reserve Bank of Minneapolis. Geweke, J. (1993, Dec). Bayesian treatment of the independent student–t linear model. J. of Applied Econometrics Vol. 8, Supplement: Special Issue on Econometric Inference Using Simulation Techniques, S19–S40. Gramacy, R. B. (2017). monomvn: Estimation for Multivariate Normal and Student-t Data with Monotone Missingness. R package version 1.9-7. Gramacy, R. B. and E. Pantaleo (2010, 06). Shrinkage regression for multivariate inference with missing data, and an application to portfolio balancing. Bayesian Anal. 5 (2), 237– 262. Ioannidis, J. P. A. (2005, 08). Why most published research findings are false. PLOS Medicine 2 (8). Jacquier, E., N. Polson, and P. E. Rossi (2004). Bayesian analysis of stochastic volatility models with fat-tails and correlated errors. J. of Econometrics 122, 185–212. Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Mathematical Proceedings of the Cambridge Philosophical Society 31 (2), 203222. Jeffreys, H. (1961). The Theory of Probability, Volume 2 of Oxford Classic Texts in the Physical Sciences. Oxford University Press. Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90 (430), 773–795. Lakens, D., F. G. Adolfi, C. J. Albers, F. Anvari, M. A. J. Apps, S. E. Argamon, T. Baguley, R. B. Becker, S. D. Benning, D. E. Bradford, E. M. Buchanan, A. R. Caldwell, B. Van Calster, R. Carlsson, S.-C. Chen, B. Chung, L. J. Colling, G. S. Collins, Z. Crook, E. S. Cross, S. Daniels, H. Danielsson, L. DeBruine, D. J. Dunleavy, B. D. Earp, M. I. Feist, J. D. Ferrell, J. G. Field, N. W. Fox, A. Friesen, C. Gomes, M. Gonzalez-Marquez,

20

J. A. Grange, A. P. Grieve, R. Guggenberger, J. Grist, A.-L. van Harmelen, F. Hasselman, K. D. Hochard, M. R. Hoffarth, N. P. Holmes, M. Ingre, P. M. Isager, H. K. Isotalus, C. Johansson, K. Juszczyk, D. A. Kenny, A. A. Khalil, B. Konat, J. Lao, E. G. Larsen, G. M. A. Lodder, J. Lukavsk, C. R. Madan, D. Manheim, S. R. Martin, A. E. Martin, D. G. Mayo, R. J. McCarthy, K. McConway, C. McFarland, A. Q. X. Nio, G. Nilsonne, C. L. de Oliveira, J.-J. O. de Xivry, S. Parsons, G. Pfuhl, K. A. Quinn, J. J. Sakon, S. A. Saribay, I. K. Schneider, M. Selvaraju, Z. Sjoerds, S. G. Smith, T. Smits, J. R. Spies, V. Sreekumar, C. N. Steltenpohl, N. Stenhouse, W. witkowski, M. A. Vadillo, M. A. L. M. Van Assen, M. N. Williams, S. E. Williams, D. R. Williams, T. Yarkoni, I. Ziano, and R. A. Zwaan (2018, March). Justify your alpha. Nature Human Behaviour 2 (3), 168–171. Lavine, M. and M. J. Schervish (1999). Bayes factors: What they are and what they are not. The American Statistician 53 (2), 119–122. Liang, F., R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association 103 (481), 410–423. Lindley, D. V. (1957). A statistical paradox. Biometrika 44 (1/2), 187–192. Markowitz, H. (1959). Portfolio Selection: Efficient Diversification of Investments. New York: John Wiley. McKay, M. D., W. J. Conover, and R. J. Beckman (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245. O’Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society. Series B (Methodological) 57 (1), 99–138. Ormerod, J. T., M. Stewart, W. Yu, and S. E. Romanes (2017). Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too? arXiv:1710.09146 .

21

arXiv preprint

Santner, T. J., B. J. Williams, and W. I. Notz (2003). The Design and Analysis of Computer Experiments. New York, NY: Springer-Verlag. Schwarz, G. (1978, 03). Estimating the dimension of a model. Ann. Statist. 6 (2), 461–464. Stambaugh, R. F. (1997). Analyzing investments whose histories differ in lengh. J. of Financial Economics 45, 285–331. Trafimow, D. and M. Marks (2015). Editorial. Basic and Applied Social Psychology 37 (1), 1–2. Wasserstein, R. L., N. A. Lazar, et al. (2016). The asas statement on p-values: context, process, and purpose. The American Statistician 70 (2), 129–133. Yao, Y., A. Vehtari, D. Simpson, and A. Gelman (2018). Using stacking to average bayesian predictive distributions. Advance publication. Young, S. S. and A. Karr (2011). Deming, data and observational studies. Significance 8 (3), 116–120. Zellner, A. (1986). On assessing prior distributions and bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, Chapter 15, pp. 233–243. New York: Elsevier Science Publishers, Inc. Zellner, A. and A. Siow (1980, Feb). Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa 31 (1), 585–603.

22

5

Appendix

The Bayesian model for the mixture g-prior comes from Liang et al. (2008): 1 yi |α, β, γ, Mk , ∼ N α + βxi , γ g β|γ, g ∼ N 0, Pn 2 γ i=1 xi 1 p α, γ = φ 1 n , g ∼ IG 2 2

iid

The marginal likelihood can be obtained by analytically integrating α, β, and γ, then approximating the integral over g. As in the original paper, we use Laplace approximation for this last integral. Fractional Bayes factors: Let m denote a training sample size, n the sample size, and b =

m . n

The likelihood raised to b approximates the likelihood of a training sample, a

fact used to form fractional Bayes factors. Fractional Bayes factors are appealing since there is no need to choose an arbitrary training sample or spend computational resources averaging over some or all possible training samples to obtain an intrinsic Bayes factor (Berger and Pericchi, 1996). The prior distribution for the fractional Bayes factor analysis is the reference prior: p(α, β, γ) =

c γ

where c is a constant. This prior is improper, though it yields a proper posterior distribution for inference on parameters and is eligible for model selection due to the availability of fractional Bayes factors. The minimal training sample in this study is m = 3. Further details about fractional Bayes factors are left to the literature.

23