Bayesian Inference in Nonparametric Dynamic StateSpace Models Anurag Ghosh, Soumalya Mukhopadhyay, Sandipan Roy and Sourabh Bhattacharya∗
arXiv:1108.3262v5 [stat.ME] 21 Feb 2014
Abstract We introduce statespace models where the functionals of the observational and the evolutionary equations are unknown, and treated as random functions evolving with time. Thus, our model is nonparametric and generalizes the traditional parametric statespace models. This random function approach also frees us from the restrictive assumption that the functional forms, although timedependent, are of fixed forms. The traditional approach of assuming known, parametric functional forms is questionable, particularly in statespace models, since the validation of the assumptions require data on both the observed time series and the latent states; however, data on the latter are not available in statespace models. We specify Gaussian processes as priors of the random functions and exploit the “lookup table approach” of Bhattacharya (2007) to efficiently handle the dynamic structure of the model. We consider both univariate and multivariate situations, using the Markov chain Monte Carlo (MCMC) approach for studying the posterior distributions of interest. We illustrate our methods with simulated data sets, in both univariate and multivariate situations. Moreover, using our Gaussian process approach we analyse a real data set, which has also been analysed by Shumway & Stoffer (1982) and Carlin, Polson & Stoffer (1992) using the linearity assumption. Interestingly, our analyses indicate that towards the end of the time series, the linearity assumption is perhaps questionable. ∗
Anurag Ghosh is a PhD student in Department of Statistical Science, Duke University, Soumalya Mukhopad
hyay is a PhD student in Agricultural and Ecological Research Unit, Indian Statistical Institute, Sandipan Roy is a PhD student in Department of Statistics, University of Michigan, Ann Arbor, and Sourabh Bhattacharya is an Assistant Professor in Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata 700108. Corresponding email:
[email protected]
1
Keywords: Evolutionary equation; Gaussian process; Lookup table; Markov Chain Monte Carlo; Observational equation; Statespace model.
1.
INTRODUCTION
The statespace models play important role in dealing with dynamic systems that arise in various disciplines such as finance, engineering, ecology, medicine, and statistics. The timevarying regression structure and the flexibility inherent in the sequential nature of statespace models make them very suitable for analysis and prediction of dynamic data. Indeed, as is wellknown, most time series models of interest are expressible as statespace models; see Durbin & Koopman (2001) and Shumway & Stoffer (2011) for details. However, till date, the statespace models have considered only known forms of the equations, typically linear. But testing the parametric assumptions require data on both the observed time series and the unobserved states; unfortunately, data on the latter are not available in statespace models. Moreover, the regression structures of the statespace models may evolve with time, changing from linear to nonlinear, and even the nonlinear structure may also evolve with time, yielding further different nonlinear structures. We are not aware of any nonparametric statespace approach in the statistical literature that can handle unknown functional forms, which may or may not be evolving with time. Another criticism of the existing state space models is the assumption that the (unobserved) states satisfy the Markov property. Although such Markov models have been useful in many situations where there are natural laws supporting such conditional independence, in general such assumption is not expected to hold. These arguments point towards the need for developing general, nonparametric, approaches to statespace models, and this indeed, is our aim in this article. We adopt the Bayesian paradigm for its inherent flexibility. In a nutshell, in this work, adopting a nonparametric Bayesian framework, we treat the regression structures as unknown and model these as Gaussian processes, and develop the consequent theory in the Bayesian framework, considering both univariate and multivariate situations. Our Gaussian process approach of viewing the unknown functional forms allows very flexible modeling of the unknown structures, even though they might evolve with time. Also, as we discusss 2
in Section 4.7, as a consequence of our nonparametric approach, the unobserved state variables do not follow any Markov model. Thus our approach provides a realistic dependence structure between the state variables. We also develop efficient MCMCbased methods for simulating from the resulting posterior distributions. We demonstrate our methods in the case of both univariate and multivariate situations using simulated data. Application of our ideas to a real data set which has been analysed by Shumway & Stoffer (1982) and Carlin et al. (1992) assuming linearity, provided an interesting insight that, although the linearity assumption may not be unreasonable for most part of the time series, the assumption may be called in question towards the end of the time series. This vindicates that our approach is indeed capable of modeling unknown functions even if the forms are changing with time, without requiring any change point analysis and specification of functional forms before and after change points. Before introducing our approach, we provide a brief overview of statespace models. 2.
OVERVIEW OF STATESPACE MODELS
Generally, statespace models are of the following form: for t = 1, 2, . . ., yt = ft (xt ) + t
(1)
xt = gt (xt−1 ) + ηt
(2)
In the above, ft and gt are assumed to be functions of known forms which may or may not explicitly depend upon t; ηt , t are usually assumed to be zero mean iid normal variates. The choice ft (xt ) = Ft xt and gt (xt−1 ) = Gt xt−1 , assuming known Ft , Gt , have found very wide use in the literature. Obviously, xt , yt may be univariate or multivariate. Matrixvariate dynamic linear models have been considered by Quintana & West (1987) and West & Harrison (1997) (see also Carvalho & West (2007)). Equation (1) is called the observational equation, while (2) is known as the evolutionary equation. Letting D T = (y1 , . . . , yT )0 denote the available data, the goal is to obtain inferences about yT +1 (singlestep forecast), yT +k (kstep forecast), xT +1 conditional on yT +1 (filtering), xT −k (retrospection). In the Bayesian paradigm, the interests center upon analyzing the corresponding posteriors [yT +1  D T ], [yT +k  D T ], [xT +1  D T , yT +1 ] (also, [xT +1  D T ]) and 3
[xT −k  D T ]. In the nonBayesian framework, solutions to dynamic systems are quite generally available via the wellknown Kalman filter. However, the performance of Kalman filter is heavily dependent on the assumption of Gaussian errors and linearity of the functions in the observation and the evolution equations. In the case of nonlinear dynamic models, various linearization techniques are used to obtain approximate solutions. For details on these issues, see Brockwell & Davis (1987), Meinhold & Singpurwala (1989), West & Harrison (1997) and the references therein. The Bayesian paradigm frees the investigator from restrictions of linear functions or Gaussian errors, and allows for very general dynamic model building through coherent combination of prior and the available time series data, and using Markov chain Monte Carlo (MCMC) for inference. Bayesian nonlinear dynamic models with nonGaussian errors, in conjunction with the Gibbs sampling approach for inference, have been considered in Carlin et al. (1992). For general details on nonlinear and nonGaussian approches to state space models, see Durbin & Koopman (2001). However, even nonlinear and nonGaussian approaches assume that there is an underlying known natural phenomenon supporting some standard parametric model. Except in wellstudied scientific contexts such assumptions are not unquestionable. In this work, we particularly concern ourselves with situations where parametric models are not established for the underlying scientific study. A case in point may be the context of climate change dynamics, where observed climate depends upon various factors in the forms of latent states, but in our knowledge, no clear parametric model is available for this extremely important and challenging problem. In medicine, growth of cancer cells at any time point may depend upon various unobserved factors (states), but no clear parametric model is available, in our knowledge. Similar challenges exist in sociology, in studies of dynamic social networks; in astrophysics, associated with the study of the evolution of the universe; in computer science, for target tracking and datadriven computer animation, and in various other fields. Thus, we expect our nonparametric approach to be quite relevant and useful for these investigations. For the sake of clarity, in this main article, we consider only onedimensional xt and yt , but we provide additional details of the univariate cases, and generalize our approach to accommodate 4
multivariate situations in the supplement Ghosh, Mukhopadhyay, Roy & Bhattacharya (2013), whose sections, figures and tables have the prefix “S” when referred to in this paper. A brief description of the contents of the supplement can be found at the end of this article. In Section 3 we introduce our novel nonparametric dynamic model where we use Gaussian processes to model the unknown functions ft and gt . Assuming the functions to be random allows us to accommodate even those functions the forms of which are changing with time. In order to describe the distribution of the unobserved states, we adopt the “lookup table” idea of Bhattacharya (2007). Since this idea is an integral part of the development of our methodology, we devote Section 4 to its detailed discussion. In Section 5 we provide the forms of the prior distributions of the hyperparameters of our model and build an MCMC based methodology for Bayesian inference. In Section 6 we include a brief discussion of two simulation studies, the details of which are reported in Section S3 of the supplement. In Section 7 we consider application to a real, univariate data set. We present a summary of the current work, along with discussion of further work in Section 8. 3.
NONPARAMETRIC DYNAMIC MODEL: UNIVARIATE CASE
In this section we model the unknown observational and the evolutionary functions using Gaussian processes assuming that the true functions are evolving with time t; our approach includes the timeinvariant situation as a simple special case; see Section S3.2. In (1) and (2) we now assume ft and gt to be of unknown functional forms varying with time t. For convenience, we denote ft (xt ) and gt (xt−1 ) as f (t, xt ) and g(t, xt−1 ), respectively. That is, we treat time t as an input to both the observational and evolutionary functions, in addition to the other relevant inputs xt and xt−1 . With this understanding, we rewrite (1) and (2) as yt = f (t, xt ) + t , t ∼ N (0, σ2 ),
(3)
xt = g(t, xt−1 ) + ηt , ηt ∼ N (0, ση2 ).
(4)
We assume that x0 ∼ N (µx0 , σx20 ); µx0 , σx20 being known. Crucially, we allow f (·, ·) and g(·, ·) to be of unknown functional forms, which we model as two independent Gaussian processes. To 5
present the details of the Gaussian processes, we find it convenient to use the notation x∗t,t = (t, xt )0 , and x∗t,t−1 = (t, xt−1 )0 so that (3) and (4) can be rewritten as yt = f (x∗t,t ) + t , t ∼ N (0, σ2 ),
(5)
xt = g(x∗t,t−1 ) + ηt , ηt ∼ N (0, ση2 );
(6)
in fact, more generally, we use the notation x∗t,u = (t, xu ). This general notation will be convenient for describing theoretical and computational details. Next, we provide details of the independent Gaussian processes used to model f and g. 3.1
Modeling the unknown observational and evolutionary timevarying functions using independent Gaussian processes
The functions f and g are modeled as independent Gaussian processes with mean functions µf (·) = h(·)0 β f and µg (·) = h(·)0 β g with h(x∗ ) = (1, x∗ )0 for any x∗ , and covariance functions of the form σf2 cf (·, ·) and σg2 cg (·, ·), respectively. The process variances are σf2 and σg2 and cf , cg are the correlation functions. Typically, for any z1 , z2 , cf (z1 , z2 ) = exp{−(z1 − z2 )0 Rf (z1 − z2 )} and cg (z1 , z2 ) = exp{−(z1 − z2 )0 Rg (z1 − z2 )}, where Rf and Rg are 2 × 2dimensional diagonal matrices consisting of respective smoothness (or, roughness) parameters {r1,f , r2,f } and {r1,g , r2,g }, which are responsible for the smoothness of the process realizations. These choices of the correlation functions imply that the functions, modeled by the process realizations, are infinitely smooth. The sets of parameters θ f = (β f , σf2 , Rf ) and θ g = (β g , σg2 , Rg ) are assumed to be independent a priori. We consider the following form of prior distribution of the parameters: [β f , σf2 , Rf , β g , σg2 , Rg , σ2 , ση2 ] = [β f , σf2 , Rf ][β g , σg2 , Rg ][σ2 , ση2 ].
6
3.2
Hierarchical structure induced by our Gaussian process approach
In summary, our approach can be described in the following hierarchical form: [yt f, θ f , xt ] ∼ N f (x∗t,t ), σ2 ; t = 1, . . . , T, [xt g, θ g , xt−1 ] ∼ N g(x∗t,t−1 ), ση2 ; t = 1, . . . , T, [x0 ] ∼ N µx0 , σx20 , [f (·)θ f ] ∼ GP h0 (·)0 β f , σf2 cf (·, ·) , [g(·)θ g ] ∼ GP h0 (·)β g , σg2 cg (·, ·) ,
[β f , σf2 , Rf , β g , σg2 , Rg , σ2 , ση2 ] = [β f , σf2 , Rf ][β g , σg2 , Rg ][σ2 , ση2 ].
(7) (8) (9) (10) (11) (12)
In the above, GP stands for “Gaussian Process”. Forms of the prior distributions in (12) are provided in (5) and specific details are provided in the relevant applications. 3.3
Conditional distribution of the observed data induced by the Gaussian process prior on the observational function and a brief discussion of the difficulty of obtaining the joint distribution of the state variables
It follows from the Gaussian process prior assumption on the unknown observational function f that the distribution of D T , conditional on x∗1,1 , . . . , x∗T,T (equivalently, conditional on x1 , . . . , xT ), and the other parameters is multivariate normal: D T ∼ NT (H DT β f , σf2 Af,DT + σ2 I T ),
(13)
where H 0DT = h(x∗1,1 ), . . . , h(x∗T,T ) , Af,DT is a T ×T matrix with (i, j)th element cf (x∗i,i , x∗j,j ); (i, j) = 1, . . . , T , and I T is the T th order identity matrix.
The joint distribution of the state variables (x1 , . . . , xT ), however, is much less straightforward. Observe that, although we have [x0 ] ∼ N (µx0 , σx20 ), [x1  x0 ] ∼ N (h(x0 )0 β g , σg2 + ση2 ), but [x2  x1 , x0 ]=[g(2, x1 ) + η2  x1 , x0 ]=[g(2, g(1, x0 ) + η1 ) + η2  g(1, x0 ) + η1 , x0 ], the rightmost expression suggesting that special techniques may be necessary to get hold of the conditional distribution. We adopt the procedure introduced by Bhattacharya (2007) to deal with this problem. The idea is to conceptually simulate the entire function g modeled by the Gaussian process, and 7
use the simulated process as a lookup table to obtain the conditional distributions of {xi ; i ≥ 2}. The intuition behind the lookup table concept is briefly dicussed in the next subsection, while the detailed procedure of approximating the joint distribution of the state variables is provided in Section 4. 3.4
Intuition behind the lookup table idea for approximating the joint distribution of the state variables
For the purpose of illustration only let us assume that t = 0 for all t, yielding the model xt = g(x∗t,t−1 ). The concept of lookup table in this problem can be briefly explained as follows. Let us first assume that the entire process g(·) is available. This means that for every input z, the corresponding g(z) is available, thus constituting a lookup table, with the first column representing z and the second column representing the corresponding g(z). Conditional on x∗t,t−1 (equivalently, conditional on xt−1 ), xt = g(x∗t,t−1 ) can be obtained by simply picking the input x∗t,t−1 from the first column of the lookup table and reporting the corresponding output value g(x∗t,t−1 ), located in the second column of the lookup table. This hypothetical lookup table concept suggests that conditional on the simulated process g, it can be safely assumed that xt depends only upon x∗t,t−1 via xt = g(x∗t,t−1 ). Thus, if for all possible inputs, a simulation of the entire random function g, following the Gaussian process, is available, then for any input x∗t,t−1 , we only need to identify the corresponding g(x∗t,t−1 ) in the lookup table. In practice, we can have a simulation of the Gaussian process g on a fine enough grid of inputs. Given this simulation on a fine grid, we can simulate from the conditional distribution of g(x∗t,t−1 ), fixing x∗t,t−1 as given. This simulation from the conditional distribution of g(x∗t,t−1 ) will approximate xt as accurately as we desire by making the grid as fine as required. By repeating this procedure for each t, we can approximate the joint distribution of the state variables as closely as we desire. In the next section we provide details regarding this approach.
8
4.
DETAILED PROCEDURE OF APPROXIMATING THE JOINT DISTRIBUTION OF THE STATE VARIABLE USING THE LOOKUP TABLE CONCEPT
4.1
Distribution of x1
Note that given x0 we can simulate x1 = g(x∗1,0 ) ∼ N (h(x∗1,0 )0 β g , σg2 ), which is the marginal distribution of the Gaussian process prior. Thus, x1 is simulated without resorting to any approximation. It then remains to simulate the rest of the dynamic sequence, for which we need to simulate the rest of the process {g(x∗ ); x∗ 6= x∗1,0 }. 4.2
Introduction of a set of auxiliary variables to act as proxy to the Gaussian process g
In practice, it is not possible to have a simulation of this entire set {g(x∗ ); x∗ 6= x∗1,0 }.. We only have available a set of grid points Gn = {z1 , . . . , zn } obtained, perhaps, by Latin hypercube sampling (see, for example, Santner, Williams & Notz (2003)) and a corresponding simulation of g, given by D ∗n = {g(z1 ), . . . , g(zn )}, the latter having a joint multivariate normal distribution with mean E [D ∗n  θ g ] = H Dn∗ β g
(14)
V [D ∗n  θ g ] = σg2 Ag,Dn∗ ,
(15)
and covariance matrix
where H 0Dn∗ =[h(z1 ), . . . , h(zn )] and Ag,Dn∗ is a correlation matrix with the (i, j)th element cg (zi , zj ). and sg,Dn∗ (·) = (cg (·, z1 ), . . . , cg (·, zn ))0 . Given (x0 , g(x∗1,0 )), we simulate D ∗n from [D ∗n  θ g , g(x∗1,0 ), x0 ]. Since the joint distribution of ∗ H β ∗ Dn g and covariance matrix D n , g(x∗1,0 )  x0 is multivariate normal with mean vector 0 h(x0 ) β g 2 σg ADn∗ ,g(x∗1,0 ) where
ADn∗ ,g(x∗1,0 ) =
Ag,Dn∗
sg,Dn∗ (x∗1,0 )
sg,Dn∗ (x∗1,0 )0
1
,
(16)
it follows that the conditional [D ∗n  g(x∗1,0 ), x∗1,0 ] has an nvariate normal distribution with mean
9
vector E[D ∗n  g(x∗1,0 ), x0 , θ g ] = µg,Dn∗ = H Dn∗ β g + sg,Dn∗ (x∗1,0 )(g(x∗1,0 ) − h(x∗1,0 )0 β g )
(17)
and covariance matrix V [D ∗n  g(x∗1,0 ), x0 , θ g ] = σg2 Σg,Dn∗ ,
(18)
Σg,Dn∗ = Ag,Dn∗ − sg,Dn∗ (x∗1,0 )sg,Dn∗ (x∗1,0 )0 .
(19)
where
4.3
Distribution of each state variable conditional on the lookup table proxy D ∗n
We now seek the conditional distribution [xt = g(x∗t,t−1 )  D ∗n , xt−1 , xt−2 , . . . , x1 ]. To notationally distinguish between the conditional distribution of g(·) from the elements of the set D ∗n , we henceforth denote the elements of D ∗n as gtrue (·). In other words, we henceforth write D ∗n = {gtrue (z1 ), . . . , gtrue (zn )}. Recall that the lookup table idea supports conditional independence, that is, given a simulation of the entire random function g, xt depends only upon xt−1 via xt = g(x∗t,t−1 ), so that given g, xt is conditionally independent of {xk ; k < t − 1}. Indeed, given a fine enough grid Gn , D ∗n approximates the random function g, which contains all information regarding the conditioned state variables x1 , . . . , xt−1 . Hence, [g(x∗t,t−1 )  D ∗n , xt−1 , xt−2 , . . . , x1 ] ≈ [g(x∗t,t−1 )  D ∗n , xt−1 ],
(20)
and this approximation can be made arbitrarily accurate by making the grid Gn as fine as desired. Hence, it is sufficient for our purpose to deal with the conditional distribution of [g(x∗t,t−1 )  D ∗n , xt−1 ]. This is easy to obtain: since given xt−1 , (g(x∗t,t−1 ), D ∗n ) is jointly multivariate normal, it is easily seen that [g(x∗t,t−1 )  D ∗n , xt−1 ] is normal with mean
and variance
∗ ∗β ) µt = h(x∗t,t−1 )0 β g + sg,Dn∗ (x∗t,t−1 )0 A−1 ∗ (D n − H Dn g g,Dn
(21)
n o ∗ ∗ s (x ) . σt2 = σg2 1 − sg,Dn∗ (x∗t,t−1 )0 A−1 ∗ t,t−1 g,Dn g,Dn
(22)
10
One subtlety involved in the assumption of conditional independence is that, conditional on x∗1,0 , Gn must not contain x∗1,0 ; otherwise D ∗n would contain g(x∗1,0 ) implying that [g(x∗t,t−1 )  θ g , D ∗n , xt ] is dependent on x1 = g(x∗1,0 ), violating the conditional independence assumption. 4.4
Accuracy of the Markov approximation of the distributions of the state variables conditional on D ∗n
To first heuristically understand how the approximation (20) can be made arbitrarily accurate, note that thanks to the Gaussian process assumption, conditioning on D ∗n forces the random function g(·) to pass through the points in (Gn , D ∗n ) since the conditional [g(x)  θ g , D ∗n , x] has zero variance if x ∈ Gn (see, for example, Bhattacharya (2007) and the references therein). In other words, if x∗t,t−1 ∈ Gn , then σt2 = 0 so that [g(x∗t,t−1 )  θ g , D ∗n , xt−1 ] = δgtrue (x∗t,t−1 ) ,
(23)
where δz denotes point mass at z. This property of the conditional associated with Gaussian process is in keeping with the insight gained from the discussion related to lookup table associated with prediction of the outputs of deterministic function having dynamic behaviour. However, x∗t,t−1 ∈ / Gn with probability 1 and the conditional [g(x∗t,t−1 )  θ g , D ∗n , xt−1 ] provides spatial interpolation within (Gn , D ∗n ) (see, for example, Cressie (1993), Stein (1999)). Finer the set Gn , closer is [g(x∗t,t−1 )  θ g , D ∗n , xt−1 ] to δg(x∗t,t−1 ) . The conditional independence assumption of g(x∗t,t−1 ) of all {x∗t,k ; k < (t − 1)} given (Gn , D ∗n ) is in accordance with the motivation provided by the deterministic sequence and here D ∗n acts as a set of auxiliary variables, greatly simplifying computation, while not compromising on accuracy. In Section S1 of the supplement we formally prove a theorem stating that, given a particular design Gn , the order of approximation of gtrue (·) by the conditional distribution of g(·) given D ∗n , within a finite region (but as large as required for all practical purposes), is O (n−1 ). Hence, given a judiciously chosen sufficiently fine grid Gn and corresponding D ∗n , the conditioned state variables {xk ; k < t} do not provide any extra information to xt = g(x∗t,t−1 ) regarding gtrue (x∗t,t−1 ) in an asymptotic sense with respect to Gn . Hence, the Markov approximation (20) is valid for appropriate Gn , and the accuracy of the approximation can be improved arbitrarily. 11
4.5
Summary of the lookup table procedure for obtaining the joint distribution of the state variables
To summarize the ideas, let x0 ∼ π, where π is some appropriate prior distribution of x0 . The entire dynamic sequence {x0 , x1 = g(x∗1,0 ), x2 = g(x∗2,1 ), . . .} can then be simulated using the following steps sequentially: (1) Draw x0 ∼ π. (2) Given x0 , draw x1 = g(x∗1,0 ) ∼ N (h(x∗1,0 )0 β g , σg2 ). (3) Given x0 , and x1 = g(x∗1,0 ), draw D ∗n ∼ [D ∗n  θ g , g(x∗1,0 ), x0 ]. (4) For t = 2, 3, . . ., draw xt ∼ [xt = g(x∗t,t−1 )  θ g , D ∗n , xt−1 ]. Step (1) is a simulation of x0 from its prior, step (2) is simply drawn from the known marginal distribution of g(x∗1,0 ) given x0 . In step (3) D ∗n is drawn conditional on g(x∗1,0 ) (and x∗1,0 ), conceptually implying that the rest of the process {g(x∗ ); x 6= x∗1,0 } is drawn once g(x∗1,0 ) is known. Step (4) then uses this simulated D ∗n to obtain the rest of the dynamic sequence, using the assumed conditional independence structure. 4.6
Explicit form of the lookup table induced joint distribution of {x0 , x1 , . . . , xT +1 , D ∗n }
Once Gn and D ∗n are available, we write down the joint distribution of (D ∗n , x0 , x1 , . . . , xT ) conditional on the other parameters as [x0 , x1 , . . . , xT +1 , D ∗n  θ f , θ g , σ2 , ση2 ] = [x0 ][x1 = g(x∗1,0 ) + η1  x0 , ση2 ][D ∗n  θ g ]
T Y [xt+1 = g(x∗t+1,t ) + ηt+1  D ∗n , xt , θ g , ση2 ]
(24)
t=1
Recall that [x0 ] ∼ N (µx0 , σx20 ), [x1 = g(x∗1,0 ) + η1  x∗1,0 , ση2 ]∼ N (h(x∗1,0 )0 β g , σg2 + ση2 ) and the distribution of D ∗n is multivariate normal with mean and variance given by (14) and (15). The conditional distribution [xt+1 = g(x∗t+1,t ) + ηt+1  D ∗n , xt , θ g , ση2 ] is normal with mean ∗ ∗β ) µxt = h(x∗t+1,t )0 β g + sg,D∗n (x∗t+1,t )0 A−1 ∗ (D n − H Dn g g,Dn
12
(25)
and variance
n o ∗ σx2t = ση2 + σg2 1 − sg,D∗n (x∗t+1,t )0 A−1 ∗ sg,D ∗ (xt+1,t ) g,D n n
(26)
Observe that in this case even if x∗t+1,t ∈ Gn , due to the presence of the additive error term ηt+1 , the conditional variance of xt+1 is nonzero, equalling σx2t = ση2 , the error variance. 4.7
NonMarkovian dependence structure of the marginalized joint distribution of the state variables (x0 , x1 , . . . , xT +1 )
As in Bhattacharya (2007), here also it is possible to marginalize out D ∗n from (24) to obtain the approximate joint distribution of (x0 , x1 , . . . , xT ). However, if D ∗n is integrated out, it is clear that the conditional indpendepence (Markov) property of xt ’s given D ∗n , as seen in (24), will be lost. Thus, the marginalized conditional distribution of xt+1 depends upon {xk ; k < t + 1}, that is, the set of all the past state variables, unlike the nonmarginalized case, where conditionally on D ∗n , xt+1 depends only upon xt . This makes it clear that even though for fixed (known) evolutionary function g the corresponding equation (2) satisfies the Markov property, such Markov property is lost when the function is modeled as Gaussian processes. Hence, in our approach based on Gaussian process, the state variables are nonMarkovian. 4.8
To marginalize or not to marginalize with respect to D ∗n ?
As discussed in detail in Bhattacharya (2007), the complicated dependence structure associated with the marginalized joint distribution of (x0 , x1 , . . . , xT +1 ) is also the root of all numerical instabilities associated with MCMC implementation of our model. To understand this heuristically, note that evaluations of the conditionals [xt+1 xt , . . . , x0 , θ g , σg2 ] are required for MCMC implementation, but evaluations of the conditionals require inversions of covariance matrices involving the random states x0 , x1 , . . . , xt−1 in the correlation terms. By sample path continuity of the underlying Gaussian process, the sampled states x0 , x1 , . . . , xt−1 will be often close to each other with high probability, particularly if σg2 and ση2 are small, rendering the correlation matrix almost singular. Moreover, inversion of such correlation matrices at every iteration of MCMC is also very costly computationally. The problems are much aggravated for large t. 13
On the other hand, if D ∗n is retained, then such problem is avoided, since in that case we only need to compute the conditionals [xt+1 Dn∗ , xt , θ g , σg2 ], which involves inversion of the correlation matrix Ag,Dn∗ , which has (i, j)th element of the form c(zi , zj ), where z1 , . . . , zn are fixed constants selected by the user. Also, quite importantly, A−1 ∗ can be computed even before beginning g,Dn MCMC simulations, and it remains fixed thereafter, saving a lot of computational time, in addition to providing protection against numerical instability. For further details, see Bhattacharya (2007). Hence, we retain the set of auxiliary variables D ∗n in (24) for implementation of our MCMC methods, and finally discard them from the resultant MCMC samples to infer about the quantities of our interest. In the next section we complete specification of our fully Bayesian model by choosing appropriate prior distributions of the parameters. 5.
PRIOR SPECIFICATIONS AND BAYESIAN INFERENCE USING MCMC
We assume the following prior distributions: 2 [log(ri,f )] ∼ N µri,f , σri,f ; for i = 1, 2. 2 [log(ri,g )] ∼ N µri,g , σri,g ; for i = 1, 2. α +2 γ 2 2 −( 2 ) exp − 2 ; α , γ > 0 [σ ] ∝ σ 2σ α +2 η γη 2 2 −( 2 ) [ση ] ∝ ση exp − 2 ; αη , γη > 0 2ση ( ) α +2 f γ − f 2 [σf2 ] ∝ σf2 exp − 2 ; αf , γf > 0 2σf αg +2 γg 2 −( 2 ) 2 [σg ] ∝ σg exp − 2 ; αg , γg > 0 2σg [β f ] ∼ Nm β f,0 , Σβf,0 [β g ] ∼ Nm β g,0 , Σβg,0
(27) (28) (29) (30) (31) (32) (33) (34)
All the prior parameters are assumed to be known. Now we discuss our approach to selecting the prior parameters for our application our Bayesian model in simulation studies and real data application in the univariate situations. 14
In order to choose the parameters of the lognormal priors of the smoothness parameters, we set the mean of the lognormal prior with parameters µ and σ 2 , given by exp(µ + σ 2 /2), to 1. This yields µ = −σ 2 /2. Since the variance of this lognormal prior is given by (exp(σ 2 ) − 1) exp(2µ + σ 2 ), the relation µ = −σ 2 /2 implies that the variance is exp(σ 2 ) − 1 = exp(−2µ) − 1. We set σ 2 = 1, so that µ = −0.5. This implies that the mean is 1 and the variance is approximately 2, for the priors of each smoothness parameter ri,f and ri,g ; i = 1, 2. For the choice of the parameters of the priors of σf2 , σg2 , σ2 and ση2 , we first note that the mean is of the form γ/(α − 2) and the variance is of the form 2γ 2 /{(α − 2)2 (α − 4)}. Thus, if we set γ/(α − 2) = a, then the variance becomes 2a2 /(α − 4). Here we set a = 0.5, 0.5, 0.1, 0.1, respectively, for σf2 , σg2 , σ2 , ση2 . For each of these priors we set α = 4.01, so that the variance is of the form 200a2 . We set the priors of β f and β g to be trivariate normal with zero mean and the identity matrix as the variance. 5.1
MCMCbased Bayesian inference
In this section we begin with the problem of forecasting yT +1 , given the data set D T . Interestingly, our approach to this problem provides an MCMC methodology which generates inference about all the posterior distributions required, either as byproducts or by simple generalization of this MCMC approach using an augmentation scheme. Details follow. The posterior predictive distribution of yT +1 given D T is [yT +1  D T ] =
Z
[yT +1  D T , x0 , x1 , . . . , xT +1 , β f , Rf , σ2 ]
×[x0 , x1 , . . . , xT +1 , β f , β g , Rf , Rg , σ2 ση2  D T ]dθ f dθ g dσ2 , ση2 dx0 dx1 , . . . , dxt+1 . (35) This posterior is not available analytically, and so simulation methods are necessary to make inferences. In particular, once a sample is available from the posterior [x0 , x1 , . . . , xt+1 , β f , β g , Rf , Rg , σ2 ση2  D T ], the corresponding samples drawn from [yT +1  D T , x0 , x1 , . . . , xT +1 , β f , Rf , σ2 ] are from the posterior predictive (35), using which required posterior summaries can be obtained. Note that 15
the conditional distribution [yT +1 = f (x∗T +1,T +1 ) + T +1  D T , x0 , x1 , . . . , xT +1 β f , σf2 , Rf , σ2 ] is normal with mean µyT +1 = h(x∗T +1,T +1 )0 β f + sf,DT (x∗T +1,T +1 )0 A−1 f,DT (D T − H DT β f )
(36)
and variance ∗ σy2T +1 = σ2 + σf2 1 − sf,DT (x∗T +1,T +1 )0 A−1 f,D T sf,D T (xT +1,T +1 )
(37)
Using D ∗n , the conditional posterior [x0 , x1 , . . . , xT +1 , β f , β g , Rf , Rg , σ2 , ση2  D T ] can be written as [x0 , x1 , . . . , xT +1 , θ f , θ g , σ2 , ση2  D T ] = ∝ =
Z
Z
Z
[x0 , x1 , . . . , xT +1 , θ f , θ g , σ2 , ση2 , g(x∗1,0 ), D ∗n  D T ]dg(x∗1,0 )dD ∗n
(38)
[x0 , x1 , . . . , xT +1 , θ f , θ g , σ2 , ση2 , g(x∗1,0 ), D ∗n , D T ]dg(x0 )dD ∗n
(39)
[θ f , θ g , σ2 , ση2 ][x0 ][g(x∗1,0 )  x∗1,0 ][D ∗n  g(x∗1,0 ), x0 , θ g ] ×[x1 = g(x∗1,0 ) + η1  g(x∗1,0 ), x0 , β g , σg2 , ση2 ] ×[xT +1 = g(x∗T +1,T ) + ηT  θ g , ση2 , D ∗n , xT ] ×
T −1 Y t=1
[xt+1  θ g , ση2 , D ∗n , xt ][D T  x1 , . . . , xT , θ f , σ2 ]dg(x∗1,0 )dD ∗n
(40)
In the above, [x1 = g(x∗1,0 ) + η1  g(x∗1,0 ), β g , σg2 , ση2 ] ∼ N g(x∗1,0 ), ση2 . Although the analytic
form of (40) is not available, MCMC simulation from [x0 , x1 ,. . . ,xT +1 ,θ f ,θ g ,σ2 ,ση2 ,g(x∗1,0 ),D ∗n D T ], which is proportional to the integrand in (40), is possible. Ignoring g(x∗1,0 ) and D ∗n in these MCMC simulations yields the desired samples from [x0 , x1 , . . . , xT +1 , θ f , θ g , σ2 , ση2  D T ]. The details of the MCMC are provided in Section S2. 5.2
MCMCbased single and mutiple step forecasts, filtering and retrospection
Observe that one can readily study the posteriors [yT +1  D T ], [xT +1  D T ] and [xT −k  D T ] using the readily available MCMC samples of {yT +1 , xT +1 , xT −k } after ignoring the samples corresponding to the rest of the unknowns. To study the posterior [xT +1  D T , yT +1 ], we only need 16
to augment yT +1 to D T to create D T +1 = (D 0T , yT +1 )0 . Then our methodology can be followed exactly to generate samples from [xT +1  D T +1 ]. Sample generation from [yT +k  D T ] requires a slight generalization of this augmentation strategy. Here we use successive augmentation, adding each simulated yT +j to the previous D T +j−1 to create D T +j = (D 0T +j−1 , yT +j ); j = 1, 2, . . . , k. Then our MCMC methodology can be implemented successively to generate samples from yT +j+1 and all other variables. This implies that at each augmentation stage we need to draw a single MCMC sample from [x0 , x1 , . . . , xT +j+1 , β f , β g , Rf , Rg , σf2 , σg2 , σ2 , ση2  D T +j ]. Once this sample is generated, we can draw a single realization from yT +j+1 by drawing from yT +j+1 ∼ N µyT +j+1 , σy2T +j+1 , where, analogous to (36) and (37), µyT +j+1 = h(x∗T +j+1,T +j+1 )0 β f + sf,DT +j (x∗T +j+1,T +j+1 )0 A−1 f,DT +j (D T +j − H DT +j β f )
(41)
and variance n o ∗ σy2T +j+1 = σ2 + σf2 1 − sf,DT +j (x∗T +j+1,T +j+1 )0 A−1 s (x ) . T +j+1,T +j+1 f,D T +j f,D T +j 6.
(42)
BRIEF DISCUSSION ON SIMULATION STUDIES
In Section S3 of the supplement we present two detailed simulation experiments. In the first experiment, both the true observational and the evolutionary functions are of linear forms. In the second study, both these true functions are nonlinear. In both the cases we fitted our Gaussian process based nonparametric model to the data generated from the true models. Whenever the true parameters are comparable to the parameters associated with our model, the posteriors of the parameters of our model successfully captured them. The true time series {x0 , . . . , xT } fell well within their respective 95% Bayesian credible intervals in both the simulation studies. However, the lengths of the credible regions of the states seem to be somewhat larger than desired. Indeed, since the posterior distribution of the states depend upon the observational and evolutionary functions, both of which are treated as unknown, somewhat larger credible intervals only reflect the uncertainty associated with these unknown functions. In the context of our simulation studies where the true functions are known, the lengths of our credible intervals may not seem to be particularly encouraging, but as we already mentioned in the Section 1, we have in mind complex, realistic problems, 17
where true observational and evolutionary functions are extremely difficult to ascertain. In such problems, there exist large amounts of uncertainties regarding these functions, which would be coherently modeled by our Gaussian process approach, and relatively large Bayesian credible regions, that would arise as a result of acknowledging such uncertainties, would be coherent and make good practical sense. We now apply our ideas based on Gaussian processes to a real data set. 7.
APPLICATION TO A REAL DATA SET
Assuming a parametric, dynamic, linear model set up, Shumway & Stoffer (1982) and Carlin et al. (1992) analysed a data set consisting of estimated total physician expenditures by year (yt ; t = 1, . . . , 25) as measured by the Social Security Administration. The unobserved true annual physician expenditures are denoted by xt ; t = 1, . . . , 25. Shumway & Stoffer (1982) used a maximum likelihood approach based on the EM algorithm, while Carlin et al. (1992) considered a Gibbs sampling based Bayesian approach. We apply our Bayesian nonparametric Gaussian process based model and methodology on the same data set and check if the results of the former authors who analysed this data based on the linearity assumption, agree with those obtained by our far more general analysis. 7.1
Choice of prior parameters and MCMC implementation
We use the same prior structures as detailed in Section 5. We set α = 4.01 so that the prior variance is of the form 200a2 ; we set a = 0.5, 0.5, 100, 100, respectively, for σf2 , σg2 , σ2 and ση2 . These imply that the prior expectations of the variances are 0.5, 0.5, 100, and 100, respectively, along with the aforementioned variances. The choices of high values of the prior expectations of σ2 and ση are motivated by Carlin et al. (1992), while the small variabilities of σf2 and σg2 reflects the belief that the true functional forms are perhaps not very different from linearity, the belief being motivated by the assumption of linearity by both Shumway & Stoffer (1982) and Carlin et al. (1992). Also motivated by the latter we chose x0 ∼ N (2500, 1002 ) to be the prior distribution of x0 . We set the priors of β f and β g to be trivariate normal distributions with zero mean and the 18
identity matrix as the covariance matrix. For the prior parameters of the smoothness parameters we set σ 2 = 100 and µ = −50 in the lognormal prior distributions, so that the means are 1 and the variances are exp(100) − 1. Here we set high variance for the smoothness parameters to account for much higher degree of uncertainty about smoothness in this real data situation compared to the simulation experiment. To set up the grid Gn we noted that the MLEs of the time series obtained by Shumway & Stoffer (1982) using linear state space model are contained in [2000, 30000]. We divide this interval into 200 subintervals of equal length, and select a value randomly from each such subinterval, obtaining values of the second component of the twodimensional grid Gn . For the first component, we generate a number uniformly from each of the 200 subintervals [i, i + 1]; i = 0, . . . , 199. We discard the first 10,000 MCMC iterations as burnin and store the next 50,000 iterations for inference. We used the normal random walk proposal with variance 0.05 for updating σf , σg , σ and ση and the normal random walk proposal with variance 10 for updating {x0 , x1 , . . . , xT }. These choices of the variances are based on pilot runs of our MCMC algorithm. As before, informal diagnostics indicated good convergence properties of our MCMC algorithm. It took around 17 hours to implement this application in an ordinary laptop machine. 7.2
Results of modelfitting
Figures 1, 2 and 3 show the posterior distributions of the unknowns. Our posterior time series of xt has relatively narrow 95% highest posterior density credible intervals, vindicating that the linearity assumption is not unreasonable as claimed by Shumway & Stoffer (1982) and Carlin et al. (1992). Indeed, for such linear relationships, it is wellknown that the Gaussian process priors that we adopted are expected to perform very well. However, perhaps not all is well with the aforementioned linearity assumption of Shumway & Stoffer (1982) and Carlin et al. (1992). Note that although the MLE time series obtained by Shumway & Stoffer (1982) fall mostly within our Bayesian 95% highest posterior density credible intervals of xt , five observations towards the end of the time series fall outside. These observations correspond to the years 1968, 1970, 1971, 1972, and 1973. The values of yt in these years are 19
beta_f_2
beta_f_3
−4
−2
0
2
20 0
0.0
0.0
10
0.1
0.1
30
0.2
0.2
40
50
0.3
0.3
60
0.4
beta_f_1
4
−4
−2
2
4
0.98
0.99
beta_g_2
1.01
1.02
−3
−2
−1
0
1
2
3
100 20 0
0.0
0.0
0.1
0.1
0.2
40
0.2
60
80
0.3
0.5 0.4 0.3
1.00
beta_g_3
0.4
beta_g_1
0
−4
−2
0
2
4
1.080
1.085
sigma_g
1.100
1.105
sigma_e
1.2
0.020 0.010 0.000
0.0
0.0
0.2
0.5
0.4
1.0
0.6
1.5
0.8
2.0
1.0
2.5
1.095
0.030
sigma_f
1.090
0.5
1.0
1.5
2.0
2.5
0
5
10
15
120
140
160
180
Figure 1: Real data analysis: Posterior densities of β0,f , β1,f , β2,f , β0,g , β1,g , β2,g , σf , σg , and σ .
20
sigma_eta
r_f_2
1.0 0.5 0.0
0
0.000
5
0.010
10
0.020
1.5
15
0.030
2.0
20
r_f_1
100
110
120
130
140
0.0
150
0.1
0.2
0.4
0.5
0.6
0.7
r_g_2
0.5
1.0
1.5
2.0
2.5
3.0
x_0
0.005 0.000
0
0
5
20
10
40
15
20
60
25
0.010
80
30
35
100
r_g_1
0.3
0.015
90
0.00
0.02
0.04
0.06
0.08
0.10
0.0
0.1
0.2
0.3
2400
2450
forecasted_y
19500
20000
0.0000
0.0005
0.0005
0.0010
0.0010
0.0015
0.0015
0.0020
forecasted_x
0.0000
2350
0.4
20500
19000
19500
20000
20500
21000
Figure 2: Real data analysis: Posterior densities of ση , r1,f , r2,f , r1,g , r2,g , x0 , xT +1 (onestep forecasted x), and yT +1 (onestep forecasted y).
21
Posterior time series of x
10000 5000
x
15000
MLE−based Time Series Lower and Upper 95% CI
5
10
15
20
25
time
Figure 3: Real data analysis: 95% highest posterior density credible intervals of the time series x1 , . . . , xT . The solid line stands for the MLE time series obtained by Shumway and Stoffer (1982).
22
11, 099, 14, 306, 15, 835, 16, 916, and 18, 200. We suspect that linearity breaks down towards the end of the time series, an issue which is perhaps overlooked by the linear model based approaches of Shumway & Stoffer (1982) and Carlin et al. (1992). On the other hand, without any change point analysis, our flexible nonparametric model, based on Gaussian processes, is able to accommodate changes in the regression structures. 8.
CONCLUSIONS AND FUTURE WORK
In this article, using Gaussian process priors and the “lookup table” idea of Bhattacharya (2007) we proposed a novel methodology for Bayesian inference in nonparametric state space models, in both univariate and multivariate cases. The Gaussian process priors on the unknown functional forms of the observational and the evolutionary equations allow for very flexible modeling of timevarying random functions, where even the functional forms may change over time, without requiring any change point analysis. We have vindicated the effectiveness of our model and methodology with simulation experiments, and a real data analysis which provided interesting insight into nonlinearity of the underlying time series towards its end. For our current purpose, we have assumed iid Gaussian noises t and ηt ; however, it is straightforward to generalize these to other parametric distributions (thicktailed or otherwise). It may, however, be quite interesting to consider nonparametric error distributions, for example, mixtures of Dirichlet processes; such a work in the linear cases has been undertaken by Caron, Davy, Doucet, Duflos & Vanheeghe (2008). We shall also consider matrixvariate extensions to our current work, in addition to nonparametric error distributions. ACKNOWLEDGMENT We are extremely grateful to an anonymous reviewer whose suggestions resulted in improved presentation of our ideas.
23
DESCRIPTION OF THE SUPPLEMENT In Section S1 of the supplement we state and prove results on the accuracy of the lookup table idea and in Section S2 we provide details of MCMC sampling in the univariate situation. In Section S3 two simulation studies are considered in details. We provide extension of our Gaussian process based methodology for univariate cases to multivariate situations in Section S4; followed by details of MCMC sampling in the multivariate setup in Section S5. In Section S6 we conduct a detailed simulation study to illustrate our methodology in the multivariate setup. Finally, in Section S7 we provide detailed discussion of posteriors of the composite observational and evolutionary functions and their comparisons with the respective true composite functions; in particular, we provide the comparisons in the case of the three simulation studies (two univariate cases, and one multivariate case).
24
REFERENCES Bhattacharya, S. (2007), “A Simulation Approach to Bayesian Emulation of Complex Dynamic Computer Models,” Bayesian Analysis, 2, 783–816. Brockwell, P. J., & Davis, R. A. (1987), Time Series: Theory and Methods, New York: SpringerVerlag. Carlin, B. P., Polson, N. G., & Stoffer, D. S. (1992), “A Monte Carlo Approach to Nonnormal and Nonlinear StateSpace Modeling,” Journal of the American Statistical Association, 87, 493– 500. Caron, F., Davy, M., Doucet, A., Duflos, E., & Vanheeghe, P. (2008), “Bayesian Inference for Linear Dynamic Models with Dirichlet Process Mixtures,” IEEE Transactions on Signal Processing, 56, 71–84. Carvalho, C. M., & West, M. (2007), “Dynamic MatrixVariate Graphical Models,” Bayesian Analysis, 2, 69–98. Cressie, N. A. C. (1993), Statistics for Spatial Data, New York: Wiley. Durbin, J., & Koopman, S. J. (2001), Time Series Analysis by State Space Methods, Oxford: Oxford University Press. Ghosh, A., Mukhopadhyay, S., Roy, S., & Bhattacharya, S. (2013), “Supplement to “Bayesian Inference in Nonparametric Dynamic StateSpace Models”,”. Submitted. Meinhold, R. J., & Singpurwala, N. D. (1989), “Robustification of Kalman Filter Models,” Journal of the American Statistical Association, 84, 479–486. Quintana, J., & West, M. (1987), “Multivariate time series analysis: New techniques applied to international exchange rate data,” The Statistician, 36, 275–281. Santner, T. J., Williams, B. J., & Notz, W. I. (2003), The design and analysis of computer experiments, Springer Series in Statistics, New York, Inc.: SpringerVerlag. Shumway, R. H., & Stoffer, D. S. (1982), “An Approach to Time Series Smoothing and Forecasting Using the EM Algorithm,” Journal of Time Series Analysis, 3, 253–264. Shumway, R. H., & Stoffer, D. S. (2011), Time Series Analysis and Its Applications, New York: SpringerVerlag. 25
Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Kriging, New York, Inc: SpringerVerlag. West, M., & Harrison, P. (1997), Bayesian Forecasting and Dynamic Models, New York: SpringerVerlag.
26
Supplement to “Bayesian Inference in Nonparametric Dynamic StateSpace Models” Anurag Ghosh, Soumalya Mukhopadhyay, Sandipan Roy and Sourabh Bhattacharya∗
arXiv:1108.3262v5 [stat.ME] 21 Feb 2014
Throughout, we refer to our main paper Ghosh, Mukhopadhyay, Roy & Bhattacharya (2013) as GMRB. S1.
RESULT ON THE ACCURACY OF THE MARKOV APPROXIMATION OF THE CONDITIONAL DISTRIBUTION OF XT GIVEN D ∗N
Theorem S1.1 Let g (t) (·) denote composition applied to g(·) with itself t (t ≥ 1) times, where (t)
g (1) (·) = g(·). Let gtrue (·) be defined analogously. Let the inputs {z1 , . . . , zn } satisfy zi − zi−1 = (b − a)/n for i = 2, 3, . . . , n, where a, b are finite constants such that a < min{z1 , . . . , zn } < (2)
min{z1 , . . . , zn } < b. Also assume that Gn is of the form {z1 , . . . , zn , gtrue (z1 ), . . . , gtrue (zn ), gtrue (z1 ), (T −1)
(2)
(T −1)
. . . , gtrue (zn ), . . . , gtrue (z1 ), . . . , gtrue (zn )}, so that D ∗n = {g(z1 ), . . . , g(zn ), g(gtrue (z1 )), . . . , (2)
(T −1)
(2)
(T −1)
g(gtrue (zn )), g(gtrue (z1 )), . . . , g(gtrue (zn )), . . . , g(gtrue (z1 )), . . . , g(gtrue (zn ))}. Then, for t = 1, . . . , T , for r > 0, Z b r (t) ∗ (t) E[g (z)D n , θ g ] − gtrue (z) dz = O n−1 .
(1)
a
Also, for any z ∗ ∈ [a, b],
and
∗
(t) ∗ (t) ∗ ∗ E[g (z )D , θ ] − g (z ) = O n−1 , g true n V ar[g (t) (z ∗ )D ∗n , θ g ] = O n−1 .
(2)
(3)
Anurag Ghosh is a PhD student in Department of Statistical Science, Duke University, Soumalya Mukhopad
hyay is a PhD student in Agricultural and Ecological Research Unit, Indian Statistical Institute, Sandipan Roy is a PhD student in Department of Statistics, University of Michigan, Ann Arbor, and Sourabh Bhattacharya is an Assistant Professor in Bayesian and Interdisciplinary Research Unit, Indian Statistical Institute, 203, B. T. Road, Kolkata 700108. Corresponding email:
[email protected]
1
Proof: Note that for any r > 0, we have Z b n r r b − a X (t) (t) ∗ ∗ (t) (t) E[g (zi )D n , θ g ] − gtrue (zi ) + O n−1 , E[g (z)D n , θ g ] − gtrue (z) dz = n i=1 a
by Riemann sum approximation. But by the choice of Gn and the interpolation property of Gaussian processes, it follows that, given D ∗n , g(zi ) = gtrue (zi ) with probability one. Hence, given D ∗n , g(g(zi )) = g(gtrue (zi )) = gtrue (gtrue (zi )) with probability 1 since gtrue (zi ) is in (t)
Gn , and so on. In general, given D ∗n , g (t) (zi ) = gtrue (zi ) with probability 1. Hence, we have (t) ∗ (t) E[g (zi )D n , θ g ] − gtrue (zi ) = 0 for k = 1, . . . , n. The result (1) thus follows. (t) ∗ (t) ∗ ∗ ∗ To see (2) note that if z ∈ Gn , then E[g (z )D n , θ g ] − gtrue (z ) = 0 by the interpolation property of Gaussian processes, so that (2) trivially holds. If z ∗ ∈ [a, b] is not a design point in
Gn , then there exists zi ∈ Gn such that z ∗ ∈ [zi , zi+1 ]. Hence, z ∗ = zi + h, where h < (b − a)/n. (t)
Then, letting ν(·) = E[g (t) (·)D ∗n , θ g ] and γ(·) = gtrue (·) we have ν(zi + h) = ν(zi ) + hν 0 (ξ1 )
and
γ(zi + h) = γ(zi ) + hγ 0 (ξ2 ),
(4) (5)
where ξ1 and ξ2 lie between zi and zi +h. Since ν(zi ) = γ(zi ) and since ν 0 (·) and γ 0 (·) are bounded on [a, b] (as the exponential correlation function of our Gaussian process ensures that ν and γ are continuously differentiable), we have, using (4) and (5), (t) E[g (t) (z ∗ )D ∗n , θ g ] − gtrue (z ∗ ) = ν(zi + h) − γ(zi + h) = O (h) .
Since h < (b − a)/n, result (2) follows.
To prove (3), let ϕ(·) = V ar[g (t) (·)D ∗n , θ g ]. Then ϕ(zi + h) = ϕ(zi ) + hϕ0 (ξ3 ), where ξ3 lies between zi and zi + h. But ϕ(zi ) = 0 due to the interpolation property of Gaussian processes, and ϕ0 (·) is bounded on [a, b] (again, the exponential correlation function ensuring that ϕ is continuously differentiable). Moreover, since h < (b − a)/n, the result (3) follows. 2
S2.
DETAILS OF MCMC SAMPLING IN THE UNIVARIATE SITUATION Ag,Dn∗ sg,Dn∗ (x∗1,0 ) 0 ∗0 ∗ ∗ ∗ and Let Gn+1 = Gn ∪ {x1,0 }, D n+1 = D n , g(x1,0 ) , Ag,n+1 = 0 ∗ 1 sg,Dn∗ (x1,0 ) 0 0 ∗ ∗ ∗ ∗ ∗ H g,Dn+1 = [H g,Dn∗ , h(x1,0 )]. Clearly, [D n+1  x1,0 , θ g ] = [g(x1,0 )  x0 , θ g ][D n  g(x∗1,0 ), x0 , θ g ]. ∗ With these definitions, the forms of the full conditional distributions of the unknowns are provided below. (6)
[β f  · · · ] ∝ [β f ][D T  x1 , . . . , xT , θ f , σ2 ] [β g  · · · ] ∝
[β g ][D ∗n+1

x∗1,0 , θ g ]
T Y [xt+1  θ g , ση2 , D ∗n , xt ]
(8)
[σf2  · · · ] ∝ [σf2 ][D T  x1 , . . . , xT , θ f , σ2 ] [σg2
 ···] ∝
[σg2 ][D ∗n+1

x∗1,0 , θ g ]
T Y [xt+1  θ g , ση2 , D ∗n , xt ]
 ···] ∝
[ση2 ][x1

(10)
g(x∗1,0 ), x0 , θ g , ση2 ]
T Y t=1
[xt+1  θ g , ση2 , D ∗n , xt ]
[ri,f  · · · ] ∝ [ri,f ][D T  x1 , . . . , xT , θ f , σ2 ]; i = 1, 2 [ri,g  · · · ] ∝
[ri,g ][D ∗n

g(x∗1,0 ), x0 , θ g ]
T Y [xt+1  θ g , ση2 , D ∗n , xt ]; i = 1, 2
(11) (12) (13)
t=1
[g(x∗1,0 )  · · · ] ∝ [g(x∗1,0 )  x0 , β g , σg2 ][D ∗n  g(x∗1,0 ), x0 , θ g ][x1  g(x∗1,0 ), x0 , ση2 ] [D ∗n
(9)
t=1
[σ2  · · · ] ∝ [σ2 ][D T  x1 , . . . , xT , θ f , σ2 ] [ση2
(7)
t=1
T Y  ···] ∝ [xt+1  θ g , ση2 , D ∗n , xt ][D ∗n  g(x∗1,0 ), x0 , θ g ]
(14) (15)
t=1
[x0  · · · ] ∝ [x0 ][D ∗n+1  x0 , θ g ]
(16)
[x1  · · · ] ∝ [x1  g(x∗1,0 ), x0 , β g , σg2 , ση2 ][x2  θ g , ση2 , D ∗n , x1 ] (17)
× [D T  x1 , . . . , xT , θ f , σ2 ] [xt+1  · · · ] ∝ [xt+1  θ g , ση2 , D ∗n , xt ][xt+2  θ g , ση2 , D ∗n , xt+1 ] × [D T  x1 , . . . , xT , θ f , σ2 ]; [xT +1  · · · ] ∝ [xT +1  θ g , ση2 , D ∗n , xT ] 3
t = 1, . . . , T − 1
(18) (19)
Although some of the full conditionals are of standard forms permitting Gibbs sampling steps, others are nonstandard and sampling requires MetropolisHastings (MH) steps in those situations. We describe below the Gibbs steps and construct proposal distributions when MH steps are needed. S2.1
Updating β f using Gibbs step
The full conditional of β f is mvariate normal with mean o−1 −1 n H DT + Σ−1 E β f  · · · = H 0DT σf2 Af,DT + σ2 I βf ,0 n o −1 × H 0DT σf2 Af,DT + σ2 I D T + Σ−1 β βf ,0 f,0 .
and variance
(20)
o−1 n 0 −1 −1 2 2 V β f  · · · = H DT σf Af,DT + σ I H DT + Σβf ,0 .
S2.2
(21)
Updating β g using Gibbs step
The conditional distribution of β g is mvariate normal with mean E βg  · · · =
(
Σ−1 βg ,0
+
T X t=1
× +
(
+
∗ H 0Dn+1 A−1 H Dn+1 ∗ ∗ g,Dn+1
σg2
∗ ∗ (x H 0Dn∗ A−1 ∗ sg,Dn t+1,t ) g,Dn
Σ−1 βg ,0 β g,0
+
−
0 −1 0 −1 ∗ ∗ H Dn∗ Ag,Dn∗ sg,Dn∗ (xt+1,t ) − h(xt+1,t )
h(x∗t+1,t )
2 σg,η,t
H 0Dn+1 A−1 D ∗n+1 ∗ ∗ g,Dn+1 σg2
∗ 0 −1 ∗ ∗ T ∗ (x xt+1 − sg,Dn∗ (x∗t+1,t )0 A−1 D h(x ) − H A s ) ∗ ∗ ∗ X g,D n t+1,t Dn g,Dn t+1,t g,Dn n t=1
2 σg,η,t
and variance
4
.
(22)
V βg  · · · =
(
Σ−1 βg ,0
+
T X t=1
+
∗ H 0Dn+1 H Dn+1 A−1 ∗ ∗ g,Dn+1
σg2
∗ ∗ (x H 0Dn∗ A−1 ∗ sg,Dn t+1,t ) g,Dn
−
0 −1 0 −1 ∗ ∗ H Dn∗ Ag,Dn∗ sg,Dn∗ (xt+1,t ) − h(xt+1,t )
h(x∗t+1,t )
2 σg,η,t
(23)
In (22) and (23), n o 2 ∗ ∗ σg,η,t = σg2 1 − sg,Dn∗ (x∗t+1,t )0 A−1 s (x ) + ση2 . ∗ t+1,t g,Dn g,Dn S2.3
(24)
Updating σf2 and σg2 using MH steps
The full conditionals of σf2 and σg2 are not available in closed forms and MH steps are necessary here. For proposal distributions we first construct a new model by setting σ2 = σf2 and ση2 = σg2 . For this model the full conditional distributions of σf2 and σg2 are inverse Gamma distributions, given by qσf2 (σf2 )
∝
T +α2f +2 2 − σf
"
# 0 o 1 n −1 exp − 2 γf + D T − H DT β f (Af,DT + I T ) D T − H DT β f 2σf (25)
and qσg2 (σg2 )
∝
αg +4+n+T 2 − 2 σg
+
T X t=1
+
exp −
1 γg + (x1 − g(x∗1,0 ))2 2 2σg
n o2 −1 ∗ ∗ 0 ∗ 0 xt+1 − h(xt+1,t ) β g − sg,Dn∗ (xt+1,t ) Ag,Dn∗ D n − H Dn∗ β g 2 σg,t
D ∗n+1
∗ − H Dn+1 βg
0
A−1 ∗ g,Dn+1
D ∗n+1
∗ − H Dn+1 βg
(26)
In (26), 2 ∗ ∗ (x σg,t = 2 − sg,Dn∗ (x∗t+1,t )0 A−1 ∗ sg,Dn t+1,t ). g,Dn
5
(27)
In (26) the term σg2
−1/2
o n exp − 2σ1 2 (x1 − g(x∗1,0 ))2 is also taken into account since, given ση2 = g
σg2 , [x1  g(x∗1,0 ), x0 ] ∼ N (g(x∗1,0 ), σg2 ). It is useful to remark here that unless σf ≈ σ and σg ≈ ση
these proposal mechanisms may not be efficient. We shall discuss other proposal distributions in the context of applications. S2.4
Updating σ2 and ση2 using MH steps
As before, the full conditionals of σ2 and ση2 are not available in closed forms. For MH steps, we construct proposal distributions obtained by setting σf2 = σ2 and σg2 = ση2 . The proposal distributions are given by qσ2 (σ2 )
∝
−( T +α2 +2 ) σ2
0 o 1 n −1 exp − 2 γ + D T − H DT β f (Af,DT + I T ) D T − H DT β f 2σ
1 2 qση2 (ση ) ∝ exp − 2 γη + (x1 − g(x∗1,0 ))2 2ση 0 ∗ ∗ ∗ − H β D β g A−1 + D ∗n+1 − H Dn+1 ∗ D g n+1 g,Dn+1 n+1 n o2 −1 ∗ 0 0 ∗ ∗ T X xt+1 − h(xt+1,t ) β g − sg,Dn∗ (xt+1,t ) Ag,Dn∗ D n − H Dn∗ β g + 2 σg,t t=1
(28)
αη +4+n+T 2 − 2 ση
(29)
0 ∗ −1 ∗ 1 ∗ ∗ In (29) the term exp − 2σ2 D n+1 − H Dn+1 β g Ag,Dn+1 D n+1 − H Dn+1 βg ∗ η ∗ ∗ occurs since, given σg2 = ση2 , [D ∗n+1  β g , σg2 , Rg ] ∼ Nn+1 H Dn+1 β g , ση2 Ag,Dn+1 . Again, these −(n+1)/2 ση2
proposals need not be efficient unless σf ≈ σ and σg ≈ ση . In the context of specific applications we shall discuss other proposal distributions. S2.5
Updating Rf and Rg using MH steps
For i = 1, 2, the full conditionals of the smoothness parameters ri,f and ri,g are not available in closed forms, and we suggest MH steps with normal random walk proposals with adequately optimized variances.
6
S2.6
Updating g(x∗1,0 ) using Gibbs step
The full conditional of g(x∗1,0 ) is univariate normal with mean and variance given, respectively, by (
)−1 ∗ ∗ 0 −1 ∗ (x ∗ (x ) 1 + s ) Σ ∗ sg,Dn 1 g,D 1,0 1,0 g,D n n E[g(x∗1,0 )  · · · ] = + ση2 σg2 ) ( ∗ ∗ 0 ∗ 0 −1 x1 h(x1,0 ) β g + sg,Dn∗ (x1,0 ) Σg,Dn∗ D z × + ση2 σg2 (30) and V [g(x∗1,0 )  · · · ] =
(
∗ ∗ (x 1 + sg,Dn∗ (x∗1,0 )0 Σ−1 ∗ sg,Dn 1 1,0 ) g,Dn + 2 2 ση σg
)−1 (31)
In (30), D ∗z = D ∗n − H Dn∗ β g + sg,Dn∗ (x∗1,0 )h(x∗1,0 )0 β g S2.7
(32)
Updating D ∗n using Gibbs step
The full conditional distribution of D ∗n is nvariate normal with mean E [D ∗n  · · · ] = ×
(
(
Σ−1 ∗ g,Dn σg2
+ A−1 ∗ g,Dn
T X sg,Dn∗ (x∗t+1,t )sg,Dn∗ (x∗t+1,t )0 2 σg,η,t
t=1
Σ−1 ∗ µg,D ∗ g,Dn n 2 σg
+
A−1 ∗ g,Dn
!
A−1 ∗ g,Dn
)−1
T ∗ X ∗ (x sg,Dn∗ (x∗t+1,t ){xt+1 − β 0g (h(x∗t+1,t ) − H 0D∗ A−1 t+1,t ))} g,D∗ sg,Dn n
n
2 σg,η,t
t=1
(33) and variance V [D ∗n  · · · ] =
(
Σ−1 ∗ g,Dn σg2
+ A−1 ∗ g,Dn
T X sg,Dn∗ (x∗t+1,t )sg,Dn∗ (x∗t+1,t )0 t=1
2 σg,η,t
!
A−1 ∗ g,Dn
In (33) and (34), µg,Dn∗ and Σg,Dn∗ are given by (9) and (10), respectively, of GMRB.
7
)−1
(34)
)
S2.8
Updating x0 using MH step
Let β f = (β 00,f , β1,f )0 and β g = (β 00,g , β1,g )0 , where β 0,f and β 0,g are twocomponent vectors. Assuming the prior of x0 to be normal with mean µx0 and variance σx20 , and setting σg2 = 0, the full conditional distribution of x1 is univariate normal, with mean and variance given, respectively, by ξ0 = φ20
=
2 β1,g 1 + σx20 ση2 2 β1,g 1 + σx20 ση2
−1 −1
0 µx0 (x1 − k1 β 0,g )β1,g + σx20 ση2
,
(35) (36)
.
In (35), for any t, kt = (1, t)0 . We use qx0 (x0 ) ≡ N (x0 : ξ0 , φ20 ) as the proposal distribution for updating x0 using MH step. Observe that, under σg2 = 0, g(x∗1,0 ) = h(x∗1,0 )0 β g with probability one; hence [x1  g(x∗1,0 ), x0 , β g , ση2 ] ∼ N (h(x∗1,0 )0 β g , ση2 ), which has been taken into account while constructing the above proposal distribution. Note that this proposal will only be efficient when the g(·, ·) is close to linear. As a result, for nonlinear applications, we shall often use other proposal mechanisms, such as the normal random walk. S2.9
Updating {x1 , . . . , xT } using MH steps
We construct proposal distributions for simulating {x1 , . . . , xT } based on linear observational and evolutionary equations, setting σf2 = σg2 = 0. Thus, for t = 0, . . . , T − 1, the proposal distributions of xt+1 are of the form qxt+1 (xt+1 ) ≡ N (xt+1 : ξt+1 , φt+1 ), that is, a normal distribution with mean ξt+1 and variance φ2t+1 , where the latter quantities are given by ξt+1 =
φ2t+1
2 2 β1,f 1 + β1,g + ση2 σ2
−1
2 2 β1,f 1 + β1,g + 2 ση2 σ
−1
k0t+1 β 0,g + β1,g xt + (xt+2 − k0t+2 β 0,g )β1,g (yt+1 − k0t+1 β 0,f )β1,f + ση2 σ2 (37)
=
(38)
These proposal mechanisms will be efficient only if both f (·, ·) and g(·, ·) are close to linear. Hence, for nonlinear situations, we shall consider other proposal distributions.
8
S2.10
Updating xT +1 using Gibbs step
The full conditional distribution of xT +1 is normal with mean and variance given, respectively, by ∗ ∗β ) E [xT +1  · · · ] = h(x∗T +1,T )0 β g + sg,D∗n (x∗T +1,T )0 A−1 ∗ (D n − H Dn g g,Dn
(39)
n o ∗ V [xT +1  · · · ] = ση2 + σg2 1 − sg,D∗n (x∗T +1,T )0 A−1 ∗ sg,D ∗ (xT +1,T ) g,D n n
(40)
and variance
S3.
SIMULATION STUDIES IN THE UNIVARIATE SITUATIONS
In this section we consider two simulation studies: in the first study we consider univariate data generated from linear observational and evolutionary equations, the linear models being invariant with respect to time–we fit our Gaussian process based model on this linear data. In the second case we generate data from the parametric, univariate growth model of Carlin, Polson & Stoffer (1992) and fit our nonparametric model to the data, validating our model and methodology in the process when the data are governed by nonlinear observational and evolutionary equations. S3.1
Data generated from a univariate linear model
We generate data from the following linear model: xt = βx,0 + βx,1 xt−1 + ut
(41)
yt = βy,0 + βy,1 xt + vt , t = 1, . . . , 100,
(42)
where x0 = 0. As before, we assume ut ∼ N (0, σ2 ), vt ∼ N (0, ση2 ), independently for all t, and set σ = 0.1 = ση . We fix the true values of (βx,0 , βx,1 , βy,0 , βy,1 ) to be (1.0, 0.1, 8.0, 0.05), respectively. We then generated the data set consisting of 101 data points using these true values. As before, we set aside the last data point for the purpose of forecasting. Note that under the above true model, the coefficient of time t is zero for both the observational and the evolutionary equations. In other words, the functions are not timevariant. Thus, while fitting our Gaussian process based model, we expect the corresponding coefficients in the mean
9
functions β2,f , β2,g and the corresponding smoothness parameters r1,f and r1,g to have large posterior probabilities around zero. The results of our model implementation show that it is indeed the case. S3.1.1.
Choice of grid and MCMC implementation To set up the grid Gn we first note the
interval containing the entire true time series; in this example, the entire true time series fall within [0.8, 1.5]. We then consider a much larger interval, [−30, 30], containing the aforementioned interval, and divide the larger interval into 100 subintervals of equal length. Then we select a value randomly from each subinterval. This yields values of the second component of the twodimensional grid Gn . It is worth mentioning that we experimented with other reasonable grid choices, but the results suggested considerable robustness with respect to the grid choices. For the first component we generate a number uniformly from each of the 100 subintervals [i, i + 1]; i = 0, . . . , 99. We discarded the first 10,000 MCMC iterations as burnin and stored the next 50,000 iterations for inference. For updating {x0 , . . . , xT } we used the linear model based proposals detailed in Sections S2.8 and S2.9. Since the true models are also linear, the performance of these proposal distributions, as measured by informal MCMC diagnostics, turned out to be adequate. However, updating {σf , σg , σ , ση } using the linear model based proposals detailed in Sections S2.3 and S2.4 did not perform satisfactorily since the assumptions σf ≈ σ and σg ≈ ση do not hold here (clearly, since the true models are linear, σf = σg = 0, whereas σ and ση are positive). So, to update the variance parameters we used the normal random walk proposal with variance 0.05, which worked adequately. We note, however, that it is possible to modify the proposals provided in Sections S2.3 and S2.4 so that the fact σf = σg = 0 is taken account of, but since our random walk proposals performed adequately here we refrained from further experiments on proposal distributions. It took around 15 hours in an ordinary laptop to implement this experiment. S3.1.2.
Results of modelfitting Figures S1, S2 and S3 show the posterior distributions of
the unknowns. Note that whenever applicable, the true value is wellcaptured by the posteriors in question. Also, as expected, the posteriors of β2,f , β2,g , r1,f and r1,g , have high posterior probabili10
Posterior of beta_f_2
Posterior of beta_f_3
−2
0
2
4
6
8
0
0
0.0
10
1
0.2
20
2
0.4
30
3
0.6
40
4
0.8
50
Posterior of beta_f_1
0.00
0.05
beta_f_1
0.10
0.15
12 10
30
8 6
20 15
2
4
10 1
2
0
5 0 0
0.5
Posterior of beta_g_3
25
0.5 0.4 0.3 0.2 0.1 0.0
−1
0.0 beta_f_3
Posterior of beta_g_2
0.6
Posterior of beta_g_1
−2
−0.5
beta_f_2
−0.04
−0.02
beta_g_1
0.00
0.02
0.04
−0.2
0.0
beta_g_2
0.4
Posterior of sigma_g
Posterior of sigma_e
0
1
2
3
sigma_f
4
30 0
0.0
0.0
0.2
0.5
10
0.4
1.0
20
0.6
1.5
0.8
2.0
1.0
40
2.5
1.2
Posterior of sigma_f
0.2 beta_g_3
1.0
1.5
2.0 sigma_g
2.5
3.0
0.08
0.10
0.12
0.14
sigma_e
Figure S1: Simulation study with data generated from the univariate linear model: Posterior densities of β0,f , β1,f , β2,f , β0,g , β1,g , β2,g , σf , σg , and σ . The solid line stands for the true values of the respective parameters.
11
Posterior of r_f_1
Posterior of r_f_2
0.0
0.5
1.0
1.5
0
0
0
10
5
1
20
10
2
30
15
3
40
20
4
50
Posterior of sigma_eta
0.00
0.05
0.10
0.15
sigma_eta
0.20
0.25
0.30
0.35
0.00
0.05
0.10
0.15
r_f_1
Posterior of r_g_2
0.25
0.30
0.35
Posterior of x_0
0
5
10
15
0.1 0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
Posterior of r_g_1
0.20 r_f_2
0
2
4
r_g_1
6
8
10
−4
−2
0
r_g_2
2
4
x_0
Posterior of forecasted_y
0.0
0.00
0.1
0.05
0.2
0.10
0.3
0.4
0.15
0.5
0.20
Posterior of forecasted_x
−10
−5
0
5
10
−5
0
5
10
15
20
forecasted_y
forecasted_x
Figure S2: Simulation study with data generated from the univariate linear model: Posterior densities of ση , r1,f , r2,f , r1,g , r2,g , x0 , xT +1 (onestep forecasted x), and yT +1 (onestep forecasted y). The solid line stands for the true values of the respective parameters.
12
10
Posteriors of x
0 −10
−5
x
5
True Time Series Lower and Upper 95% CI
0
20
40
60
80
100
time
Figure S3: Simulation study with data generated from the univariate linear model: 95% highest posterior density credible intervals of the time series x1 , . . . , xT . The solid line stands for the true time series.
13
ties around zero, supporting the timeinvariant nature of the true observational and the evolutionary equations. The true values of σ , ση and x0 also fell well within the highest density regions of their respective posterior distributions. The true forecasted values of x101 and y101 are wellcaptured by their respective forecast posteriors. The true time series {x1 , . . . , x100 } again fell well within the lower and the upper 95% highest posterior density credible limits, but the credible regions seem to be wider than the usual parametric approaches. This is a consequence of acknowledging uncertainty about the functional forms of both f and g. Particularly because the posterior distribution of the state variables depends upon both the unknown functions f and g, it is perhaps not surprising that the credible intervals reflect the effect of uncertainties about the random functions. In light of the knowledge of the true time series in this simple simulated example, which has a linear structure, the relatively wide credible intervals may not seem to indicate very encouraging performance, but in complex, realistic situations, where any structure of the true time series is impossible to guess, the relatively large credible intervals make good practical sense. S3.2
Data generated from a univariate parametric growth model
Following Carlin et al. (1992) we generate data from the following model: xt = αxt−1 + βxt−1 /(1 + x2t−1 ) + γ cos(1.2(t − 1)) + ut
(43)
yt = x2t /20 + vt , t = 1, . . . , 100,
(44)
where x0 = 0. Here we assume ut ∼ N (0, σ2 ) and vt ∼ N (0, ση2 ), independently for all t. We set σ = 0.1 = ση and fix the true values of (α, β, γ) to be (0.05, 0.1, 0.2), respectively. We then generated the data set consisting of 101 data points using these true values. As before, we set aside the last data point for the purpose of forecasting. S3.2.1.
Choice of grid and MCMC implementation In this example the entire true time series
falls within the space [−0.33, 0.32]. As before, we consider the much larger space, [−30, 30], and divide it into 100 subintervals of equal length, and then select a value randomly from each such subinterval. This yields values of the second component of the twodimensional grid Gn . 14
As in the linear model experiment, here also we experimented with other reasonable choices and even in this nonlinear experiment, the results remained exhibited considerable robustness. For the first component we generate a number uniformly from each of the 100 subintervals [i, i + 1]; i = 0, . . . , 99. We discarded the first 10000 MCMC iterations as burnin and stored the next 50,000 iterations for inference. We used the normal random walk proposal with variance 0.05 for updating σf , σg , σ and ση . For updating x0 we used the normal random walk proposal with variance 1, but for updating xt ; t = 1, . . . , T , we found in this example, using informal convergence diagnostics, that the random walk proposal with variance t provided much better mixing than those with constant variances. These proposals also performed much better than the linear model based proposals detailed in Sections S2.8 and S2.9. It took around 15 hours in an ordinary laptop to implement this experiment. S3.2.2.
Results of modelfitting Figures S4, S5 and S6 show the posterior distributions of the
unknowns. Note that whenever applicable, the true value is wellcaptured by the posteriors in question. To avoid any confusion, we mention that the true values of (α, β, γ) associated with the true observational equation (43), for example, are not comparable with the parameters (β0,f , β1,f , β2,f ) associated with our Gaussian process model. Hence, the figures do not show any true value associated with the posteriors of the Gaussian process parameters. Note that this is in contrast with the previous example where the true model has a linear state space structure. There, the true functional forms are comparable with the linear mean functions of the underlying Gaussian processes, and that is the reason why the figures of the posterior distributions of the Gaussian process parameters also consisted of the vertical line denoting the true values of the parameters. As before, the true values of x101 and y101 are captured very well by their respective posteriors. The true time series {x1 , . . . , x100 } again fell well within the lower and the upper 95% highest posterior density credible limits. However, note that the credible intervals in this case are somewhat wider than the corresponding linear model, which is to be expected. The true values of σ , ση and x0 also fell well within the highest density regions of their respective posterior distributions. That 15
Posterior of beta_f_2
Posterior of beta_f_3
−0.5
0.0
0.5
−0.015
−0.010
−0.005
0.000
0.005
beta_f_1
beta_f_2
Posterior of beta_g_1
Posterior of beta_g_2
0.010
0.015
−0.1
0.0
0.1
0.2
beta_f_3
Posterior of beta_g_3
−4
−2
0
2
4 0
0
0.0
1
5
0.1
2
0.2
10
3
0.3
15
5
0.4
−1.0
0
0
0.0
0.5
50
5
1.0
1.5
100
10
2.0
2.5
150
15
Posterior of beta_f_1
4
−0.05
beta_g_1
0.00
0.05
−1.5
−1.0
beta_g_2
0.0
0.5
Posterior of sigma_g
Posterior of sigma_e
0.2
0.3
0.4
0.5
0
0
0.0
2
10
0.2
4
20
0.4
6
30
0.6
8
40
Posterior of sigma_f
−0.5 beta_g_3
1
sigma_f
2
3 sigma_g
4
5
6
0.06
0.08
0.10
0.12
sigma_e
Figure S4: Simulation study with data generated from the univariate growth model of CPS: Posterior densities of β0,f , β1,f , β2,f , β0,g , β1,g , β2,g , σf , σg , and σ . The solid line stands for the true values of the respective parameters.
16
Posterior of r_f_1
Posterior of r_f_2
0.5
1.0
1.5
2.0
2.5
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.2
0.4
0.6
sigma_eta
r_f_1
r_f_2
Posterior of r_g_1
Posterior of r_g_2
Posterior of x_0
0.8
0
2
4
6
8
10
0.1 0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.0
0
0
0
2
1
10
4
2
20
6
3
30
8
Posterior of sigma_eta
0
2
4
6
r_g_1
8
10
12
−4
−2
0
r_g_2
2
4
x_0
Posterior of forecasted_y
0.0
0.00
0.1
0.05
0.2
0.10
0.3
0.4
0.15
0.5
0.20
Posterior of forecasted_x
−10
−5
0
5
10
15
−5
0
5
10
forecasted_y
forecasted_x
Figure S5: Simulation study with data generated from the univariate growth model of CPS: Posterior densities of ση , r1,f , r2,f , r1,g , r2,g , x0 , xT +1 (onestep forecasted x), and yT +1 (onestep forecasted y). The solid line stands for the true values of the respective parameters.
17
10
Posteriors of x
0 −10
−5
x
5
True Time Series Lower and Upper 95% CI
0
20
40
60
80
100
time
Figure S6: Simulation study with data generated from the univariate growth model of CPS: 95% highest posterior density credible intervals of the time series x1 , . . . , xT . The solid line stands for the true time series.
18
our Gaussian process approach fits the data so well despite the fact that the functions f (·, ·) and g(·, ·) are highly nonlinear, is quite encouraging. S4.
EXTENSION OF OUR GAUSSIAN PROCESS BASED APPROACH TO THE MULTIVARIATE SITUATION
Now we extend our nonparametric dynamic model to the case where both y t and xt are multivariate. In particular, we assume that they are pcomponent and qcomponent vectors, respectively. Then our multivariate model is of the form y t = f (x∗t,t ) + t , t ∼ Np (0, Σ ),
(45)
xt = g(x∗t,t−1 ) + η t , η t ∼ Nq (x0 , Ση ),
(46)
where x0 ∼ Nq (µx0 , Σx0 ); µx0 , Σx0 assumed known. In the above, f (·) = (f1 (·), . . . , fp (·))0 is a function with p components and g(·) = (g1 (·), . . . , gq (·))0 is a function consisting of q components. We assume that f (·) is a pvariate Gaussian process with mean E[f (·)] = B 0f h(·) and with covariance function cov(f (z 1 ), f (z 2 )) = cf (z 1 , z 2 )Σf , for any qdimensional inputs z 1 , z 2 . Here h(·) = (h1 (·), . . . , hm (·))0 and B f = (β 1,f , . . . , β p,f ), where, for j = 1, . . . , p, β j,f are mdimensional column vectors; clearly, m = q+2. Also, cf (z 1 , z 2 ) = exp {−(z 1 − z 2 )0 Rf (z 1 − z 2 )}, where Rf is a diagonal matrix consisting of (q+1) smoothness parameters, denoted by {r1,f , . . . , r(q+1),f }. Similarly, we assume that g(·) is a Gaussian process with mean E[g(·)] = B 0g h(·) and covariance function cg (z 1 , z 2 )Σg = exp {−(z 1 − z 2 )0 Rg (z 1 − z 2 )}, the notation used being analogous to those used for description of the Gaussian process f (·). S4.1
Multivariate data and its distribution
The multivariate data is given by the following T ×p matrix: D T = (y 1 , y 2 , . . . , y T )0 . Writing D T
vectorically as a T pvector for convenience, as D T p = (y 01 , y 02 , . . . , y 0T )0 , it follows that [D T p 
19
B f , Σf , Rf , Σ ] is a T pvariate normal with mean vector 0 ∗ B f h(x1,1 ) 0 B h(x∗ ) 2,2 f E[D T p  B f , Σf , Rf , Σ ] = .. . B 0f h(x∗T,T )
and covariance matrix
= µDT p (say)
V [D T p  B f , Σf , Rf , Σ ] = Af,DT ⊗ Σf + I T ⊗ Σ = ΣDT p (say).
(47)
(48)
Since (48) does not admit a Kronecker product form, the distribution of D T p can not be written as matrix normal, which requires right and left covariance matrices forming Kronecker product in the corresponding multivariate distribution of the vector. It follows that [y T +1 = f (x∗T +1,T +1 )+T +1  D T , x0 , x1 , . . . , xT +1 B f , Σf , Rf , Σ ] is pvariate normal with mean ∗ µyT +1 = B 0f h(x∗T +1,T +1 ) + (D T − H DT B f )0 A−1 f,DT sf,D T (xT +1,T +1 )
(49)
and variance
S4.2
∗ ΣyT +1 = 1 − sf,DT (x∗T +1,T +1 )0 A−1 f,D T sf,D T (xT +1,T +1 ) Σf + Σ
(50)
Distributions of g(x∗1,0 ) and D ∗n
Conditional on x0 , g(x∗1,0 ) is qvariate normal with mean B 0g h(x∗1,0 ) and covariance matrix Σg . In contrast with the distribution of D T , D z,nq = (g 0 (z 1 ), g 0 (z 2 ), . . . , g 0 (z n ))0 has an nqvariate normal distribution with mean
E[D z,nq
and covariance matrix
B 0g h(z 1 )
0 B h(z 2 ) g  B g , Σg , Rg ] = .. . B 0g h(z n )
= µDz,nq (say)
V [D z,nq  B g , Σg , Rg ] = Ag,Dn∗ ⊗ Σg = ΣDz,nq (say). 20
(51)
(52)
Hence, the distribution of the n×qdimensional matrix D ∗n = (g(z 1 ), g(z 2 ), . . . , g(z n ))0 is matrix normal: [D ∗n  B g , Σg , Rg ] ∼ Nn,q H z B g , Ag,Dn∗ , Σg
(53)
Conditionally on (x0 , g(x∗1,0 )), it follows that D ∗n is n × qdimensional matrixnormal: [D ∗n  g(x∗1,0 ), x0 , B g , Σg , Rg , Ση ] ∼ Nn,q µg,Dn∗ , Σg,Dn∗ , Σg In (54) µg,Dn∗ is the mean matrix, given by
(54)
µg,Dn∗ = H Dn∗ B g + sg,Dn∗ (x∗1,0 )(g(x∗1,0 )0 − h(x∗1,0 )0 B g ),
(55)
Σg,Dn∗ = Ag,Dn∗ − sg,Dn∗ (x∗1,0 )sg,Dn∗ (x∗1,0 ).0
(56)
and
Here we slightly abuse notation to denote both univariate and multivariate versions of the mean matrix and the right covariance matrix by µg,Dn∗ and Σg,Dn∗ , respectively (see (9) and (11) of GMRB). S4.3
Joint distribution of {x0 , . . . , xT +1 , D ∗n }
Note that [x1  g(x0 ), x0 , B g , Σg ] ∼ Nq g(x∗1,0 ), Ση ,
(57)
and for t = 1, . . . , T , the conditional distribution [xt+1 = g(x∗t+1,t )+η t+1  D ∗n , xt , B g , Σg , Rg , Ση ] is qvariate normal with mean
and variance
µxt = B 0g h(x∗t+1,t ) + (D ∗n − H Dn∗ B g )0 A−1 (x∗t+1,t ) ∗ sg,D ∗ g,Dn n
(58)
o n ∗ 0 −1 ∗ ∗ ∗ Σxt = 1 − sg,Dn (xt+1,t ) Ag,D∗n sg,Dn (xt+1,t ) Σg + Ση .
(59)
Since [x0 ] ∼ Np µx0 , Σx0 and the distribution of D ∗n is given by (53) the joint distribution is obtained by taking products of the individual distributions.
21
S4.4
Prior distributions
We assume the following prior distributions: For i = 1, . . . , (q + 1), iid
[log(ri,f )] ∼ N For i = 1, . . . , (p + 1),
µRf , σR2 f
iid [log(ri,g )] ∼ N µRf , σR2 g
(60)
1 exp − tr 2 νη +q+1 1 [Ση ] ∝ Ση − 2 exp − tr 2 νf +p+1 1 [Σf ] ∝ Σf − 2 exp − tr 2 νg +q+1 1 [Σg ] ∝ Σg − 2 exp − tr 2 [Σ ] ∝ Σ −
ν +p+1 2
Σ−1 Σ,0
; −1 Ση Ση,0 ; −1 Σf Σf,0 ; −1 Σg Σg,0 ;
[B f  Σf ] ∼ Nm,p B f,0 , ΣBf ,0 , ψΣf [B g  Σg ] ∼ Nm,q B g,0 , ΣBg ,0 , ψΣg
(61) ν > p − 1
(62)
νη > q − 1
(63)
νf > p − 1
(64)
νg > q − 1
(65) (66) (67)
All the prior parameters are assumed to be known; and their choices will be discussed in the specific applications to be considered. The full conditionals are of analogous forms as those in the univariate case, provided by equations from (6) to (19), but now the univariate distributions must be replaced with multivariate distributions and multivariate distributions with matrixvariate distributions, and interestingly in some cases the full conditionals are not available in closed form although these were available in the onedimensional version of the problem. For example, although the full conditional of β f and β g were available in the onedimensional problem, the corresponding distributions of B f and B g are no longer available in closed forms. The problem of obtaining the closed form of the full conditional of B f can be attributed to the fact that (48) can not be represented as a single Kronecker product. The nonavailability of the closed form in the case of B g is due to fact that the covariance matrices of xt are additive. In fact, it turns out that only the full conditional of xT +1 is of standard form. Details of our MCMC methods for the multivariate case are provided in the next section. 22
S5.
DETAILS OF MCMC SAMPLING IN THE MULIVARIATE SITUATION
To obtain good proposal distributions, we use our old strategy of ignoring the error terms t and η t . Below we provide details of the proposal distributions used in each case. For our purpose, we abuse notation slightly as in Section 4 of GMRB, that is, we denote by Gn+1 , D ∗n+1 , Ag,n+1 ∗ and H g,Dn+1 the multivariate analogues of the quantities corresponding to the univariate situation 0 described in Section 4 of GMRB. In other words, let Gn+1 = G∪{x∗1,0 }, D ∗n+1 = D ∗n 0 , g(x∗1,0 ) , Ag,Dn∗ sg,Dn∗ (x∗1,0 ) and H 0g,D∗ = [H 0g,D∗ , h(x∗1,0 )]. Ag,n+1 = n n+1 0 ∗ 1 sg,Dn∗ (x1,0 )
S5.1
Proposal distribution for updating B f
Assuming Σ = Σf in our multivariate dynamic statespace model, the full conditional of B f is m × pvariate matrixnormal: B f ∼ Nm,p µBf , ΣBf , Σf ,
(68)
where µBf =
H 0DT
×
−1
(Af,DT + I T )
H 0DT
H DT + ψ
−1
(Af,DT + I T )
−1
DT + ψ
Σ−1 Bf ,0
−1
−1
Σ−1 Bf ,0 B f,0
,
(69)
and −1 ΣBf = H 0DT (Af,DT + I T )−1 H DT + ψ −1 Σ−1 . Bf ,0 S5.2
(70)
Proposal distribution for B g
Assuming Ση = Σg , the conditional distribution of B g is m × qvariate matrix: B g ∼ Nm,q µBg , ΣBg , Σg ,
23
(71)
where n 0 ∗ µBg = ψ −1 Σ−1 H Dn+1 A−1 ∗ ∗ B g ,0 + H Dn+1 g,Dn+1 0 −1 0 −1 0 −1 ∗ ∗ ∗ ∗ T ∗ ∗ X H Dn∗ Ag,Dn∗ sg,Dn (xt+1,t ) − h(xt+1,t ) H Dn∗ Ag,Dn∗ sg,Dn (xt+1,t ) − h(xt+1,t ) + 2 σg,t t=1 n 0 × ψ −1 Σ−1 A−1 D ∗n+1 ∗ ∗ B g ,0 B g,0 + H Dn+1 g,Dn+1 0 0 −1 0 −1 ∗ ∗ ∗ T h(xt+1,t ) − H Dn∗ Ag,Dn∗ sg,Dn∗ (xt+1,t ) xt+1 − D z Ag,Dn∗ sg,Dn∗ (xt+1,t ) X . + 2 σ g,t t=1
(72)
and
n 0 ∗ H Dn+1 A−1 ΣBg = ψ −1 Σ−1 ∗ ∗ B g ,0 + H Dn+1 g,Dn+1 0 −1 0 0 −1 −1 ∗ ∗ ∗ ∗ T ∗ ∗ X H Dn∗ Ag,Dn∗ sg,Dn (xt+1,t ) − h(xt+1,t ) H Dn∗ Ag,Dn∗ sg,Dn (xt+1,t ) − h(xt+1,t ) + 2 σg,t t=1 (73)
2 is given by (27), with obvious notational change from the univariate xt to the In (72) and (73), σg,t
multivariate xt . The corresponding changes from x∗ to x∗ are also selfexplanatory. S5.3
Proposal distributions for updating Σf and Σg
Setting Σ = Σf and Ση = Σg , we obtain the following inverse Wishart proposal distributions of Σf and Σg : 1 n qΣf (Σf ) ∝ Σf  exp − tr Σ−1 Σf,0 + ψ −1 (B f − B f,0 )0 Σ−1 Bf ,0 (B f − B f,0 ) f 2 + (D T − H DT B f )0 (Af,DT + I T )−1 (D T − H DT B f ) (74) −
ν
f +p+1+T +m 2
24
and ν +q+2+m+n+T − g 2
qΣg (Σg ) ∝ Σg 
n 1 exp − trΣ−1 Σg,0 + ψ −1 (B g − B g,0 )0 Σ−1 g Bg ,0 (B g − B g,0 ) 2
+(x1 − g(x∗1,0 ))(x1 − g(x∗1,0 ))0
T o X 0 −1 1 n 0 ∗ ∗ ∗ ∗ ∗ s (x ) x − B h(x ) − D − H B A ∗ g,Dn t+1 Dn g t+1,t g t+1,t n g,Dn σ2 t=1 g,t n o0 0 ∗ ∗ ) s (x × xt+1 − B 0g h(x∗t+1,t ) − D ∗n − H Dn∗ B g A−1 ∗ t+1,t g,Dn g,Dn 0 ∗ ∗ + D ∗n+1 − H Dn+1 B g A−1 D ∗n+1 − H Dn+1 Bg ∗ g,Dn+1
+
S5.4
(75)
Proposal distributions for updating Σ , Ση , Rf and Rg
As in the univariate case, here we set Σf = Σ and Σg = Ση . Then the proposal distributions of Σ and Ση are given by the following: ν +p+1+T qΣ (Σ ) ∝ Σ −( 2 ) 1 0 −1 −1 × exp − trΣ Σ,0 + (D T − H DT B f ) (Af,DT + I T ) (D T − H DT B f ) 2
(76)
1 exp − trΣ−1 Ση,0 + (x1 − g(x∗1,0 ))(x1 − h(x∗1,0 ))0 η 2 0 ∗ ∗ ∗ D − H B + D ∗n+1 − H Dn+1 B g A−1 ∗ Dn+1 g n+1 g,Dn+1 −
qΣη (Ση ) ∝ Ση 
νη +q+3+n+T 2
T o X 0 −1 1 n 0 ∗ ∗ ∗ + xt+1 − B g h(xt+1,t ) − D n − H Dn∗ B g Ag,Dn∗ sg,Dn∗ (xt+1,t ) σ2 t=1 g,t n o0 0 −1 0 ∗ ∗ ∗ × xt+1 − B g h(xt+1,t ) − D n − H Dn∗ B g Ag,Dn∗ sg,Dn∗ (xt+1,t )
As in the case of the corresponding univariate proposals, in (77) the factor 0 1 −1 ∗ −1 ∗ ∗ ∗ B g Ag,Dn+1 Ση  exp − trΣη D n+1 − H Dn+1 D n+1 − H Dn+1 Bg ∗ 2 ∗ ∗ ∗ occurs because under Σg = Ση , [D n+1  B g , Σg , Rg ] ∼ Nn,q H Dn+1 B g , Ag,Dn+1 , Ση . −( n 2)
25
(77)
The full conditionals of the smoothness parameters in Rf and Rg are not available in closed forms, and as before we suggest MetropolisHastings steps with normal random walk proposals with adequately optimized variances. S5.5
Proposal distribution for updating g(x∗1,0 )
Assuming Ση = Σg , the full conditional of g(x∗1,0 ) is qvariate normal with mean and variance given, respectively, by ∗ −1 ∗ (x µ∗g = E[g(x∗1,0 )  · · · ] = {2 + sg,Dn∗ (x∗1,0 )0 Σ−1 ∗ sg,Dn 1,0 )} Σg g,Dn n o ∗ ∗ × x1 + B 0g h(x∗1,0 ) + D ∗ 0 Σ−1 s (x ) ∗ g,Dn 1,0 g,Dn
(78)
and ∗ −1 ∗ (x Σ∗g = V [g(x∗1,0 )  · · · ] = {2 + sg,Dn∗ (x∗1,0 )0 Σ−1 ∗ sg,Dn 1,0 )} Σg g,Dn
(79) In the above, Σg,Dn∗ is given by (59) of GMRB, and D ∗z is given by D ∗z = D ∗n − H Dn∗ B g + sg,Dn∗ (x∗1,0 )h(x∗1,0 )0 B g
(80)
We consider qg(x∗1,0 ) (g(x∗1,0 )) ≡ Nq g(x∗1,0 ) : µ∗g , Σ∗g as the proposal distribution for updating g(x∗1,0 ).
S5.6
Proposal distribution for updating D ∗n
The full conditional distribution of D ∗n in our dynamic model after setting Ση = Σg , is matrixnormal: [D ∗n ] ∼ Nn,q µ∗Dn∗ , Σ∗Dn∗ , Σg ,
26
(81)
where µ∗Dn∗
= ×
(
Σ−1 ∗ g,Dn
+
A−1 ∗ g,Dn
T X sg,Dn∗ (x∗t+1,t )sg,Dn∗ (x∗t+1,t )0 2 σg,t
t=1
(
Σ−1 ∗ µg,D ∗ g,Dn n
+
A−1 ∗ g,Dn
!
A−1 ∗ g,Dn
)−1
T X ∗ )B g } sg,Dn∗ (x∗t+1,t ){x0t+1 − (h(x∗t+1,t )0 − sg,Dn∗ (x∗t+1,t )0 A−1 g,D∗ H Dn n
2 σg,t
t=1
(82) and Σ∗Dn∗ =
(
−1 Σ−1 ∗ + Ag,D ∗ g,Dn n
T X sg,Dn∗ (x∗t+1,t )sg,Dn∗ (x∗t+1,t )0 2 σg,t
t=1
!
A−1 ∗ g,Dn
)−1
(83)
In the above, µg,Dn∗ and Σg,Dn∗ are given by (58) and (59) of GMRB, respectively. The distribution (81) will be used as proposal distribution to update D ∗n using MetropolisHastings step. S5.7
MH proposal for updating x0
For j = 1, . . . , p, let β j,f = (β 0j,f,1 , β 0j,f,2 )0 . Likewise, for j = 1, . . . , q, let β j,g = (β 0j,g,1 , β 0j,g,2 )0 . Here β j,f,1 and β j,g,1 are twodimensional and β j,f,2 and β j,g,2 are qdimensional vectors. Also let af = (β 1,f,1 , . . . , β p,f,1 )0 and bf = (β 1,f,2 , . . . , β p,f,2 )0 . Similarly, ag = (β 1,g,1 , . . . , β q,g,1 )0 and bg = (β 1,g,2 , . . . , β q,g,2 )0 . Thus, af and ag are p × 2 and q × 2dimensional matrices respectively, while bf and bg are, respectively, p × q and q × q dimensional matrices. Then, B 0f h(x∗ ) = af k1 + bf x and B 0g h(x∗ ) = ag k1 + bg x, for any qdimensional x. With these representations, and setting Σg = 0, the full conditional distribution of x0 is qvariate normal with mean and variance given, respectively, by
S5.8
−1 −1 0 −1 bg Σx0 µx0 + b0g Ση −1 (x1 − ag k1 ) E[x0  · · · ] = Σ−1 x0 + bg Ση −1 0 −1 bg . V [x0  · · · ] = Σ−1 x0 + bg Ση
(84) (85)
MH proposals for {x1 , . . . , xT }
Ignoring the errors of the functions f and g, that is, setting Σf = Σg = 0, it turns out that the proposal distribution of xt+1 , for t = 0, . . . , T − 1, can be taken as a qvariate normal distribution 27
)
with mean and variance given, respectively, by −1 0 0 −1 −1 0 E[xt+1  · · · ] = Σ−1 η + bg Ση bg + bf Σ bf 0 0 −1 −1 Σ (y − a k ) Σ (x − a k ) + b × Σ−1 (a k + b x ) + b f t+1 t+2 g t+2 g t+1 g t t+1 f g η η
(86)
0 0 −1 −1 0 V [xt+1  · · · ] = Σ−1 η + bg Ση bg + bf Σ bf
S5.9
Gibbs step for updating xT +1
−1
.
(87)
The full conditional distribution of xT +1 is qvariate normal with mean (x∗T +1,T ) E[xT +1  · · · ] = B 0g h(x∗T +1,T ) + (D ∗n − H Dn∗ B g )0 A−1 ∗ sg,D ∗ g,Dn n
(88)
and variance o n ∗ . V [xT +1  · · · ] = Ση + Σg 1 − sg,D∗n (x∗T +1,T )0 A−1 ∗ sg,D ∗ (xT +1,T ) g,D n n
(89)
The caveat for using these proposal distributions is that if the underlying assumptions Σf = Σ , Σg = Ση and Σg = 0 do not hold, then the proposal mechanisms will turn out to be much less effective, and more so compared to the univariate cases, due to the curse of dimensionality. So, we shall also consider other appproaches to these updating procedures, to be discussed in the context of the specific applications. S6.
SIMULATION STUDY: MULTIVARIATE CASE
We consider a simulation study where we generate the data from a 4variate model, constructed using the univariate nonstationary growth model of Carlin et al. (1992). To reduce computational burden in this multidimensional set up we generated 50 data points using the following model:
28
for t = 1, . . . , 50, xt,1 = αxt−1,1 + βxt−1,1 /(1 + x2t−1,1 ) + γ cos(1.2(t − 1)) + ηt,1
(90)
xt,2 = αxt−1,2 + βxt−1,2 /(1 + x2t−1,2 ) + ηt,2
(91)
xt,3 = α + βxt−1,3 + ηt,3
(92)
xt,4 = γ cos(1.2(t − 1)) + ηt,4
(93)
yt,i = x2t,i /20 + t,i ; i = 1, . . . , 4,
(94)
where t,i and ηt,i are iid zeromean normal random variables with variances 0.1. We set α = 0.05, β = 0.1 and γ = 0.2, and x0 = 0. S6.1
Choice of prior parameters
For the priors of B f and B g , we set ψ = 1, B f,0 and B g,0 to be null matrices, and ΣBf ,0 and ΣBg ,0 to be identity matrices. For the priors of Σ , Ση , Σf , and Σg we set ν = p, νη = q, νf = p, and νg = q. We chose Σ = Ση = 0.1I, and Σf = Σg = 0.5I, where I denotes the identity matrix. For the lognormal priors of the smoothness parameters we set µRf = µRg = −0.5 and σR2 f = σR2 g = 1. The choices imply that the prior mean and the prior variance of each of the smoothness parameters are, respectively, 1 and 2 (approximately). In our simulation experiment these choices seemed adequate. S6.2
MCMC implementation
For updating the smoothness parameters we used normal random walk proposals with variances 0.005. To set up the grid Gz for the modelfitting purpose, we considered [−30, 30]4 to be a grid space for the 4dimensional variable x. Indeed, this gridspace is much larger than, and contains the region where the entire true 4dimensional time series lie. We divide [−30, 30] into 100 equal subintervals and chose a point randomly from each of 100 subintervals, in each dimension, yielding n = 100 4dimensional points corresponding to x. As before, we chose the first component of the grid (corresponding to the time component) by uniformly drawing from each subinterval [i, i + 1]; i = 0, . . . , 99. 29
The block updating proposal for updating B f , described in Section S2.2 worked quite well, but those for block updating B g , g(x∗1,0 ), D ∗n , and those for the covariance matrices, as decribed in Section S2 yielded poor acceptance rates. In order to update the covariance matrices Σ , Ση , Σg and Σf , we considered the following strategy: we rewrote the matrices in the form CC 0 , where C is a lower triangular matrix, and used normal random walk with variance 0.005, to update the nonzero elements in a single block. This improved the acceptance rates. For B g , g(x∗1,0 ) and D ∗n , the strategy of block updating using random walk failed. As a remedy we considered the transformationbased MCMC (TMCMC), recently developed by Dutta & Bhattacharya (2013); in particular, we used the additive transformation, which requires much less number of move types and hence computationally less expensive compared to other, nonadditive move types. Briefly, for each of the blocks, we generated ξ ∼ N (0, 0.05)I(ξ > 0). Then, for each parameter in the block, we either added or subtracted ξ with equal probability. In other words, we used the same ξ to update all the parameters in a block, unlike the block random walk proposal. This considerably improved the acceptance rates. For theoretical and implementation details, see Dutta & Bhattacharya (2013). We discarded the first 10,000 iterations of the MCMC run as burnin and stored the subsequent 50,000 iterations for inference. Convergence is assessed by informal diagnostics, as before. It took an ordinary laptop about 24 hours to implement this multivariate experiment. S6.3
Results of modelfitting
Figures S7 and S8 show that our Gaussian processbased nonparametric model performs well in spite of the multidimensional situation—the true values are wellcaptured by the posterior distributions, and the true forecast values are also wellsupported by the corresponding forecast distributions of x51 and y 51 . Moreover, each component of the true 4variate time series fall entirely within the 95% highest posterior densities of the corresponding component of the 4variate posterior time series.
30
x0_2
x0_3
x0_4
2
−2
0
2
4
0
0.4 0.2 0.1
0.030 50
100
−50
0
50
0.8
0.8
forecasted_y_4
0.6 0.4
0.4 2
4
0.020 0
0.2 1
2
0.010 −50
0.0 0
0
forecasted_x_4
0.6
0.8 0.4
−1
−2
forecasted_y_3
0.2 −2
−4
0.000
50
0.0 2
4
0.010 −50
0.6
0.8 0.6 0.4
1
2
0.000 −100
forecasted_y_2
0.2
0
0
0.020
0.015 0.005 0.000 40
0.0
−1
−2
forecasted_x_3
0.010
0.06 0.04 0.02
20
forecasted_y_1
−2
0.0
forecasted_x_2
0.00
0
−4
0.030
forecasted_x_1
−20
0.3
0.3 0.2 0.1
4
0.2
0
0.0
−2
0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
x0_1
−3
−2
−1
0
1
2
−2
−1
0
1
2
Figure S7: Simulation study with data generated from the 4variate modification of the model of CPS: Posterior distributions of x0 , xT +1 (onestep forecasted x), and y T +1 (onestep forecasted y). The solid line stands for the true values of the respective parameters.
31
Posterior time series of x_2 100
30
Posterior time series of x_1
True Time Series Lower and Upper 95% CI
0
x
−30
−100
−20
−50
−10
x
0
10
50
20
True Time Series Lower and Upper 95% CI
100
0
10
20
30
40
50
0
10
20
30
40
time
time
Posterior time series of x_3
Posterior time series of x_4 True Time Series Lower and Upper 95% CI
x
0 −60 −40 −20
−100
−50
x
0
20
50
40
60
True Time Series Lower and Upper 95% CI
50
0
10
20
30
40
50
0
time
10
20
30
40
50
time
Figure S8: Simulation study with data generated from the 4variate modification of the model of CPS: 95% highest posterior density credible intervals of the time series x1,j , . . . , xT,j ; j = 1, 2, 3, 4. The solid line stands for the true time series.
32
S7.
POSTERIORS OF THE FUNCTION COMPOSITIONS AND COMPARISONS WITH THE TRUE FUNCTION COMPOSITIONS
It may be of interest to check how the posterior distributions of the function compositions at different time points compare with the true function compositions generating the data. To clarify, for ∗ any x, let g0∗ (x) = x, and let gt∗ (x) = g(t, gt−1 (x)) for t ≥ 1 be the tstep composition of g(·, ·).
It may be of interest to compare the posterior of gt∗ (·) with the true tstep functional composition of the datagenerating evolutionary equation. Similarly, for t ≥ 1 it may be of interest to compare the true tstep composition of the observational equation with the posterior of ft∗ (·) = f (t, gt∗ (·)). Although technically the comparisons are possible, there is a subtle issue that needs to be understood. It is important to note that the observed data has been generated by a single latent time series, guided by the evolutionary equation, beginning at a single value x0 . Since we consider x0 unknown, we attempt to learn about this unknown through the posterior distribution of x0 . Given the observed data D T , the support of the posterior distribution contains the likely values of x0 that might have given rise to the observed data via the compositions of the observational and the evolutionary equations. But beyond the support of the posterior of x0 there is no imformation about the composite functions. Hence, it is reasonable to compare the true function compositions with gt∗ (x) and ft∗ (x) when x belongs to the support of the posterior distribution of x0 , but not if x is not in the support of the posterior of x0 . Since the support of the posterior of x0 is not likely to be large if the data and the model contain sufficient information, the composite functions are likely to be close to linear on that relatively small region. Below we provide details of computing the posteriors of gt∗ (·) and ft∗ (·). We note that it follows on similar lines as in (16) and (17) of GMRB, that for t ≥ 1, [gt∗ (x)  ∗ D ∗n , gt−1 (x), θ g ] is normal with mean ∗ ∗ ∗ ∗β ) µg,t (x) = h(t, gt−1 (x))0 β g + sg,D∗n (t, gt−1 (x))0 A−1 ∗ (D n − H Dn g g,Dn
and variance n o 2 ∗ ∗ (x))0 A−1 σg,t (x) = σg2 1 − sg,D∗n (t, gt−1 . ∗ sg,D ∗ (t, gt−1 (x)) g,D n n 33
Noting that the posterior of gt∗ (x) can be expressed (using conditional independence) as Z ∗ ∗ ∗ ∗ (x)dθ g , (x), θ g  D T ]dD ∗n dgt−1 (x), θ g ][D ∗n , gt−1 [gt (x)  D T ] = [gt∗ (x)  D ∗n , gt−1 ∗ ∗ (x), θ g ] (x), θ g , draws from [gt∗ (x)  D ∗n , gt−1 given the available posterior samples of D ∗n , gt−1
yields posterior samples from [gt∗ (x)  D T ]. Thus, for any x, we can compute the posterior distribution of gt∗ (x); t = 1, 2, . . .. Similarly, it follows as in (27) and (28) of GMRB, that for t ≥ 1, the full conditional of ft∗ (x) = f (t, gt∗ (x)), given gt∗ (x) is normal, with mean µf,t (x) = h(t, gt∗ (x))0 β f + sf,DT (t, gt∗ (x))0 A−1 f,DT (D T − H DT β f ) and variance 2 ∗ σf,t (x) = σf2 1 − sf,DT (t, gt∗ (x))0 A−1 f,D T sf,D T (t, gt (x)) .
Following the method discussed for generating posterior samples of gt∗ (x), we can easily generate posterior samples of ft∗ (x). The procedure in the multivariate situation is analogous; we only modify the notations ft∗ (x) ∗ and gt∗ (x) to fjt∗ (x) and gkt (x) to denote the corresponding observational and evolutionary tstep
composite functions for the jth and the kth components of the functions espectively, where j = 1, 2, . . . , p and k = 1, . . . , q. Figures S9 and S10 display the 95% highest posterior density intervals of ft∗ (x) and gt∗ (x) for t = 1, 2, 3 and for x ∈ [−4, 4] in the simulation studies concerning the univariate linear model and the nonlinear growth model of CPS respectively. The domain [−4, 4] is chosen because the posterior of x0 is supported on this interval in both the linear and the nonlinear cases (see Figures 2 and 5 of GMRB). As already mentioned, the true functions are almost linear within this short interval. That actually the true functions can be far from linear on a much wider domain is already clear from the observational and the evolutionary equations in the nonlinear example; the true function f1∗ (·) in the case of the nonlinear example as shown in Figure S11 is clearly nonlinear over a much wider domain even though it is close to linear (indeed, almost constant) on [−4, 4]. ∗ Figures S12 and S13 show the true functions fjt∗ and gkt for j, k = 1, 2, 3, 4 and t = 1, 2, 3
34
−4
−2
0
2
4
9 6
7
8
9 8 7 6
6
7
8
9
10
Posterior of f*_3(x)
10
Posterior of f*_2(x)
10
Posterior of f*_1(x)
−4
0
2
4
0
2
4
0
2
4
4 −4
−2
0
2
4 2 −2 −4
−2
−2
Posterior of g*_3(x)
0
2 0 −2 −4
−4
−4
Posterior of g*_2(x)
4
Posterior of g*_1(x)
−2
−4
−2
0
2
4
−4
−2
0
2
4
Figure S9: Simulation study with data generated from the univariate linear model: The broken lines denote the 95% credible regions of the posterior distributions of ft∗ (x) and gt∗ (x) for t = 1, 2, 3; x ∈ [−4, 4], which is the support of the posterior of x0 . The solid lines denote the true composite observational and evolutionary functions. along with the corresponding 95% highest posterior density credible intervals in the multivariate situation. Here also we select the domain of x as [−4, 4] because Figure S7 shows that a posteriori each coordinate of x0 has support [−4, 4]. In all the cases the true, composite, observational and the evolutionary functions fall comfortably within their respective 95% highest posterior density credible intervals when x ∈ [−4, 4] is considered. We also experimented with x ∈ [−30, 30], but in this case, as already anticipated, in many cases (not reported here) the true values are excluded from the credible intervals when x∈ / [−4, 4], particularly towards the boundaries of the range. 35
−4
−2
0
2
4
2 −3
−2
−1
0
1
2 1 0 −1 −2 −3
−3
−2
−1
0
1
2
3
Posterior of f*_3(x)
3
Posterior of f*_2(x)
3
Posterior of f*_1(x)
−4
0
2
4
0
2
4
0
2
4
10 −10
−5
0
5
10 5 −5 −10
−2
−2
Posterior of g*_3(x)
0
5 0 −5 −10
−4
−4
Posterior of g*_2(x)
10
Posterior of g*_1(x)
−2
−4
−2
0
2
4
−4
−2
0
2
4
Figure S10: Simulation study with data generated from the univariate growth model of CPS: The broken lines denote the 95% highest posterior density credible regions of the posterior distributions of ft∗ (x) and gt∗ (x) for t = 1, 2, 3; x ∈ [−4, 4], which is the support of the posterior of x0 . The solid lines denote the true composite observational and evolutionary functions.
36
1000 0
500
f*_1(x)
1500
2000
True f*_1(x)
−4000
−2000
0
2000
4000
x
Figure S11: Simulation study with data generated from the univariate growth model of CPS: Displayed is the true function f1∗ (·) on the much wider domain [−4000, 4000]. REFERENCES Carlin, B. P., Polson, N. G., & Stoffer, D. S. (1992), “A Monte Carlo Approach to Nonnormal and Nonlinear StateSpace Modeling,” Journal of the American Statistical Association, 87, 493– 500. Dutta, S., & Bhattacharya, S. (2013), “Markov Chain Monte Carlo Based on Deterministic Transformations,” Statistical Methodology (to appear), . Available at arxiv:1106.5850v3 with supplementary section in arxiv.org/pdf/1306.6684. Ghosh, A., Mukhopadhyay, S., Roy, S., & Bhattacharya, S. (2013), “Bayesian Inference in Nonparametric Dynamic StateSpace Models,”. Submitted.
37
−4
−2
0
2
4
1 −2
−1
0
1 0 −1 −2
−2
−1
0
1
2
Posterior of f*_31(x)
2
Posterior of f*_21(x)
2
Posterior of f*_11(x)
−4
0
2
4
0
2
4
2 0
0
2
4
4
2
2
4
−4
0
2
4
2 −2
−1
0
1
2 0 −1
4
−2
Posterior of f*_43(x)
−2
2
4
0
0
1
2 1 0 −1
0
2
−1
−2
Posterior of f*_33(x)
−2
−2
0
−2
−4
Posterior of f*_23(x)
−4
−2
1
2 0 −1 −2
2
−4
Posterior of f*_13(x)
1
2 1 0 −1
0
4
−1
−2
Posterior of f*_42(x)
−2
−2
2
−2
−4
Posterior of f*_32(x)
−4
0
1
2 1 −1 −2
−2
−2
Posterior of f*_22(x)
0
1 0 −1 −2
−4
−4
Posterior of f*_12(x)
2
Posterior of f*_41(x)
−2
−4
−2
0
2
4
−4
−2
0
2
4
Figure S12: Simulation study with data generated from the 4variate modification of the model of CPS: The broken lines denote the 95% highest posterior density credible regions of the posterior distributions of fjt∗ (x) for j = 1, 2, 3, 4 and t = 1, 2, 3; x ∈ [−4, 4], which is the support of the posterior of x0 . The solid lines denote the true composite observational and evolutionary 38 functions.
Posterior of g*_31(x)
0
0
−5 −10
−20
−40
−10
−20
0
20
5
10
40
10
Posterior of g*_21(x)
20
Posterior of g*_11(x)
−4
−2
0
2
4
−4
0
2
4
−4
−2
0
2
4
Posterior of g*_22(x)
0 −20
0
−4
−2
0
2
4
−60
−20
−40
−40
−10
−20
0
20
20
10
40
40
60
Posterior of g*_12(x) 20
Posterior of g*_41(x)
−2
−4
0
2
4
−4
Posterior of g*_42(x)
−2
0
2
4
Posterior of g*_13(x)
10
40
0
20 0
−10
−20
−20
−40
−40
−20
0
20
40
20
Posterior of g*_32(x)
−2
−4
−2
0
2
4
−4
0
2
4
−4
Posterior of g*_33(x)
−2
0
2
4
Posterior of g*_43(x) 40 20 0 −40
−20
0
−60
−40
−40
−20
−20
0
20
20
40
40
60
Posterior of g*_23(x)
−2
−4
−2
0
2
4
−4
−2
0
2
4
−4
−2
0
2
4
Figure S13: Simulation study with data generated from the 4variate modification of the model of CPS: The broken lines denote the 95% highest posterior density credible regions of the ∗ posterior distributions of gjt (x) for j = 1, 2, 3, 4 and t = 1, 2, 3; x ∈ [−4, 4], which is the support
of the posterior of x0 . The solid lines denote the true composite observational and evolutionary 39 functions.