In a general imaging problem, we assume that the data d ob- ... Several solutions have been proposed to remedy the problem ...... example, given in th...

0 downloads 11 Views 875KB Size

No documents

Printed 30 May 2018

(MN LATEX style file v2.2)

Maximum-entropy image reconstruction using wavelets Klaus Maisinger, M.P. Hobson and A.N. Lasenby

arXiv:astro-ph/0303246v1 12 Mar 2003

Astrophysics Group, Cavendish Laboratory, Madingley Road, Cambridge, CB3 0HE, UK

Accepted —. Received —; in original form 9th March 2003

ABSTRACT

Wavelet functions allow the sparse and efficient representation of a signal at different scales. Recently the application of wavelets to the denoising of maps of cosmic microwave background (CMB) fluctuations has been proposed. The maximum-entropy method (MEM) is also often used for enhancing astronomical images and has been applied to CMB data. In this paper, we give a systematic discussion of combining these two approaches by the use of the MEM in wavelet bases for the denoising and deconvolution of CMB maps and more general images. Certain types of wavelet transforms, such as the a` trous transform, can be viewed as a multi-channel intrinsic correlation function (ICF). We find that the wavelet MEM has lower reconstruction residuals than conventional pixel-basis MEM in the case when the signal-tonoise ratio is low and the point spread function narrow. Furthermore, the Bayesian evidence for the wavelet MEM reconstructions is generally higher for a wide range of images. From a Bayesian point of view, the wavelet basis thus provides a better model of the image. Key words: methods: data analysis – methods: statistical – techniques: image processing

spread function, and the additive noise vector n:

1 INTRODUCTION Both the maximum-entropy method (MEM) and wavelet techniques are used for astronomical image enhancement. In particular, both methods have recently been applied to the analysis of CMB data (see, for instance, Hobson, Jones & Lasenby 1999; Sanz et al. 1999a,b; Tenorio et al. 1999). Maps of CMB anisotropies are a useful tool in the analysis of CMB data. Making maps is rarely straightforward, since a multitude of systematic instrumental effects, calibration uncertainties and other deficiencies in the modelling of the telescope come into play. For example, interferometric maps suffer from the telescope’s incomplete sampling in Fourier space and require the deconvolution of the synthesised beam (e.g. Thompson, Moran & Swenson 1994) and the suppression of receiver noise. CMB observations from single-dish telescopes use total power measurements and scan across the observed fields to assemble a map. Here it is the effect of the finite primary beam that needs to be deconvolved in order that a high-resolution map may be recovered. Beyond the area of the CMB, the task of image reconstruction is generic and occurs in virtually any type of astronomical map-making. In a general imaging problem, we assume that the data d observed by an experiment are given by a convolution of the true sky signal, or image h, with the point spread function P of the instrument, plus some Gaussian random noise n: d = P ∗ h + n. In the discretised version, the data vector d is given by a multiplication of the vector h of the image pixels with the instrumental response matrix R that describes the convolution with the point

d = Rh + n. To solve the inverse problem of recovering the image h from the data, some type of regularisation is usually required. A common technique is to use an entropic function S for the regularisation. The best reconstruction is then found by minimising the function F(h) = 12 χ2 (h) − αS(h) that determines a suitable trade-off between a good fit to the data enforced by the χ2 -statistic and a strong regularisation given by the entropy S(h) of the reconstruction. The maximum-entropy method has proven to be very successful for the deconvolution of noisy images. Despite its capabilities, the MEM suffers from several shortcomings. For example, the appropriate entropy functional depends on the properties of the distribution of image pixels, but it is not always evident what the theoretical distribution should be. For positive additive distributions, one uses the entropy Nh

S(h) =

hi

∑ hi − mi − hi log mi ,

(1)

i=1

where the sum is over all image pixels and mi is a measure assigned to pixel i. Even in this case problems can arise when there is no appropriate background level, or the image brightness falls below the background level in some areas. Another defect of the MEM is that, in its simplest forms, correlations between image pixels are not taken into account properly. This problem manifests itself in several guises. Because of correlations between image pixels, the effective number of ‘degrees of freedom’ in the data is often much smaller than the number of parameters in the minimisation problem, making effective regularisation more difficult. In fact, MEM is inherently based on the as-

2

Klaus Maisinger, M.P. Hobson and A.N. Lasenby

sumption that image pixels are independent. Furthermore, ignoring correlations leads to the introduction of spurious features in the map, such as the characteristic ringing present on uniform backgrounds. There is no provision in the MEM algorithm to reward local smoothness of the image. It appears to be quite difficult to regularise in such a way as to reconstruct faithfully sharp features and uniform areas at the same time. Several solutions have been proposed to remedy the problem of image correlations. Gull & Skilling (1999) have introduced the concept of an intrinsic correlation function (ICF) that is used to decorrelate the reconstructed image. The ICF framework has been extended to allow reconstructions of objects on different scales. Weir (1992) proposes a multi-channel approach, which allows for multiple scales of pixel-to-pixel correlations. In pyramidal maximum entropy (Bontekoe, Koper & Kester 1994), the number of pixels retained in the low-resolution channels is decimated. Despite these improvements, choosing an ICF is not straightforward. It is clear that there is no single set of ICFs that is universally optimal for all possible types of data. Choosing suitable scale lengths and weights is of great importance. A slightly different approach to tackling the correlation problem is to use a representation of the image that is more efficient in identifying its information content. In other words, the task is to find an optimal basis set for the representation of the image. Furthermore, it is desirable to have a representation that can efficiently capture information present on different length scales in the image. For instance, for CMB observations several foreground components, such as radio point sources or SZ-clusters, are very localised on the sky. Some theories for structure formation also predict localised non-Gaussian imprints at arcminute scales on the CMB itself, for example temperature fluctuations produced in the wake of cosmic strings. On the other hand, the primordial CMB itself shows more diffuse structure that peaks on angular scales close to a degree. Representing these signals in real space (i.e. the image plane) requires large numbers of basis functions (i.e. the pixels) for a given image. Similarly, a reconstruction in Fourier space requires the determination of a large number of modes, which are often very poorly constrained by the data, since each localised feature on a map is expanded into an infinite number of basis functions in Fourier space. Maximum entropy has been applied to reconstructions in both real and Fourier space. A Fourier space approach has been developed by Hobson et al. (1998) and has been applied to simulated Planck data, while Jones, Hobson & Lasenby (1999) simulate its use for the MAP satellite mission. The application of wavelets to CMB data analysis has recently been investigated (see, for instance, Hobson et al. 1999; Sanz et al. 1999a,b; Tenorio et al. 1999; Cay´on et al. 2000; Vielva et al. 2001). Wavelets are special sets of functions that allow the efficient representation of signals both in real and in Fourier space. Furthermore, they can represent different objects of greatly varying sizes simultaneously. The term ‘wavelet’ does not refer to a single unique function. Instead, it comprises a whole class of functions with similar properties. In the context of CMB analysis, wavelets have predominantly been used for noise filtering or the separation of localised foreground sources. A combination of MEM and certain types of wavelets has been discussed by Pantin & Starck (1996) and Starck et al. (2001). In this paper, we investigate the use of wavelets in MEM more closely. We will compare different wavelet transforms, entropic priors and regularisations. We will also consider how the different approaches can be viewed in the ICF framework. In Section 2 we give an introduction to wavelet transforms, and the application of

wavelet filters to denoising is reviewed in Section 3. In Section 4 the maximum-entropy method is introduced. The combination of wavelet and MEM techniques is explored in Section 5, and the equivalence of intrinsic correlation functions and certain redundant wavelet transforms is discussed in Section 6. In Section 7, we test the techniques presented by applying them to simulated image reconstruction problems.

2 THE WAVELET TRANSFORM Wavelets are functions that enjoy certain properties, which will be further discussed below. A wavelet basis can be constructed from dilations and translations of a given wavelet function. The wavelet transform is an integral transform that uses the wavelet basis functions. The most widely used classes of wavelet transforms are orthogonal transforms. Like the Fourier transform, they are essentially rotations in function space from the pixel basis to appropriate wavelet basis functions. Unlike Fourier transforms, the basis functions can be well-localised in both real and Fourier space (even though they can only be compactly supported in one of the two).

2.1 The continuous wavelet transform The continuous wavelet transform of a one-dimensional square integrable function f ∈ L 2 (R) can be defined as Z ∞ 1 x−b f (x) √ ψ W (a, b) = dx, a a −∞ where the function ψ(x) is the wavelet (often called analysing wavelet or mother wavelet). The real numbers a > 0 and b are scale and position parameters respectively. Dilations and translations of ψ(x) can be derived by varying a and b. Obviously, the wavelet transform is linear. The inverse wavelet transform is given by Z ∞ Z ∞ 1 1 x−b f (x) = db W (a, b) √ ψ a−2 da , Cψ 0 a a −∞ R

e (k)|2 dk can where the normalisation constant Cψ = 2π 0∞ |k|−1 |ψ e (k) of the wavelet funcbe obtained from the Fourier transform ψ tion ψ(x) 1 . In order to construct a basis from the translations and dilations of the wavelet function ψ(x), a second function φ(x) must be introduced. It is called the scaling function or father wavelet, and its relation to ψ(x) will become clear in a moment. For the resulting basis to be discrete, compact and orthogonal, the wavelet functions ψ(x) and φ(x) must obey a set of mathematical restrictions first derived by Daubechies (e.g. Daubechies 1992), among them Z ∞

−∞

φ(x) dx = 1,

Z ∞

−∞

ψ(x) dx = 0,

(2)

and the normalisation Z ∞

−∞

|ψ|2 (x) dx = 1.

(3)

A wavelet basis can then be constructed as follows. It is convenient to take special values for a and b in defining the basis: a = 2− j and 1

The √ normalisation of Cψ assumes a Fourier transform that is scaled by 1/ 2π in both directions.

MEM image reconstruction using wavelets 0.08

0.1

0.06 0.05 0.04

0.02

0

0

−0.05

−0.02

−0.04 −0.1 −0.06

−0.08 0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

(a)

−0.15 0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

(b)

Figure 1. Wavelet functions ψ(x): (a) Haar wavelet and (b) Daubechies-4 wavelet.

b = 2 j l, where

j and l are integers labelling the scale of the wavelet and the position at this scale. The resulting wavelet functions are φ j,l (x)

=

ψ j,l (x)

=

j

2 2 φ(2 j x − l), j

2 2 ψ(2 j x − l).

It can be shown that the set {φ0,l , ψ j,l } with j > 0 and −∞ < l < ∞ forms a complete orthonormal basis in L 2 (R). One may then expand a function f (x) as ∞

f (x) =

∑

c0,l φ0,l (x) +

l=−∞

∞

∞

∑ ∑

w j,l ψ j,l (x),

(4)

j=0 l=−∞

where the wavelet coefficients c0,l and w j,l are given by c0,l w j,l

= =

Z ∞

−∞

Z ∞

−∞

f (x)φ0,l (x) dx, f (x)ψ j,l (x) dx.

(5)

Since the wavelet basis consists of dilations and translations of the mother and father wavelets ψ(x) and φ(x), one can obtain a different orthogonal wavelet basis for each pair of these functions that obey the Daubechies conditions. Thus there exists an infinite number of possible wavelet transforms, with different wavelet bases making different trade-offs between how compactly they are localised in either real space or frequency space. Unfortunately, in nearly all cases, the wavelet basis functions cannot be expressed in closed form in terms of simple functions. Nevertheless, two of the more commonly used wavelet basis functions, the Haar and Daubechies wavelets, are plotted in Fig. 1. The properties of wavelet and scaling functions can be more easily understood within the framework of multiresolution analysis (MRA), which is discussed in more detail in Appendix A. 2.2 The discrete wavelet transform In many applications, one is not so much interested in a function f (x) defined at all values of x, as in its samples f (xi ) at N = 2J equally-spaced points xi . By analogy with the discrete Fourier Transform (DFT), this function can be represented by its J detail scales and its smooth component. Equation (4) becomes J−1 2 −1

∑ ∑

increasingly smaller scales, where each scale is by a factor of 2 more detailed than the previous one. The index l (which runs from l = 0 to 2 j − 1) denotes the position of the wavelet ψ j,l within the jth scale level. If we collect the function samples fi = f (xi ) in a column vector f (whose length N must an integer power of 2), then the DWT (like the DFT) is a linear operation that transforms f into another vector e f of the same length, which contains the wavelet coefficients of the (digitised) function. The action of the DWT can therefore be described as a multiplication of the original vector by the N × N wavelet matrix W: e f = W f.

(7)

Again like the DFT, the matrix W is orthogonal, and the inverse transformation can be performed straightforwardly using the transpose of W. Thus both the DFT and DWT can be considered as rotations from the original orthonormal basis vectors ei in signal space to some new orthonormal basis e ei (i = 1, . . . , N), with the transformed vector e f containing the coefficients in this new basis. The original basis vectors ei have unity as the ith element and the remaining elements equal to zero, and hence correspond to the ‘pixels’ in the original vector f . Therefore the original basis is the most localised basis possible in real space. For the DFT, the new basis vectors e ei are (digitised) complex exponentials and represent the opposite extreme, since they are completely non-local in real space but localised in frequency space. For the DWT, the new basis vectors are the wavelets, which enjoy the characteristic property of being fairly localised both in real space and in frequency space, thus occupying an intermediate position between the original ‘delta-function’ basis and the Fourier basis of complex exponentials. Indeed, it is the simultaneous localisation of wavelets in both spaces that makes the DWT such a useful tool for analysing data in wide range of applications.

2.3 The a` trous transform So far we have restricted our attention to discrete wavelet bases that are orthogonal and non-redundant (i.e. the number of wavelet coefficients equals the number of points at which the original function is sampled). By relaxing some of the Daubechies conditions (2) and (3), one may represent a sampled function in terms of wavelet bases that are both non-orthogonal and redundant. Although this may at first seem a retrograde step, Langer, Wilson & Anderson (1993) have suggested that non-orthogonal, translationally and rotationally invariant (isotropic) wavelet transforms are better suited to the task of image reconstruction than orthogonal ones. The a` trous (‘with holes’) algorithm (Holschneider et al. 1989; Bijaoui, Starck & Murtagh 1994) is an example of a nonorthogonal, redundant discrete wavelet transform that is widely used in image analysis. Starting from a data vector cJ,l (l = 1, . . . , N, 2J 6 N) iteratively smoothed vectors are obtained by c j−1,k = ∑ Hl c j,k+2(J− j) l ,

(8)

l

j

f (xi ) = c0,0 φ0,0 (xi ) +

3

w j,l ψ j,l (xi ),

(6)

j=0 l=0

and the integrals (5) for the wavelet coefficients have to be replaced by the appropriate summations. If the mean of the function samples f (xi ) is zero, then c0,0 = 0 and the function can be described entirely in terms of the wavelets ψ j,l . As the scale index j increases from 0 to J − 1, the wavelets represent structure of the function on

where each step is effectively a convolution of the image with the filter mask Hl using varying step sizes 2J− j . At each scale, the detail wavelet coefficients contain the difference between the smoothed image c j−1,k and the image c j,k at the previous scale: w j−1,k = c j,k − c j−1,k . Since no decimation is carried out between consecutive filter steps,

4

Klaus Maisinger, M.P. Hobson and A.N. Lasenby

Figure 2. The effective point spread functions of the a` trous transform for the triangular scaling function (10). The wavelet coefficients at each scale are given by convolutions with the increasingly wider wavelet functions (dotted line, dash-dotted line, dashed line). The smooth component is given by a convolution with the scaling function (solid line).

Figure 3. The effective point spread functions of the a` trous transform for a B3 -spline. The wavelet coefficients at each scale are given by convolutions with the increasingly wider wavelet functions (dotted line, dash-dotted line, dashed line). The smooth component is given by a convolution with the scaling function (solid line). Compare the triangular mask in Fig. 2.

the a` trous transform is redundant. Thus the final wavelet transformed vector has length J × N. The inverse a` trous transform is simply the sum over the coefficients at all scales: J−1

cJ,l = c0,l +

∑ w j,l .

j=0

2.3.1 Properties of the a` trous transform R

For the a` trous transform the normalisation |ψ(x)|2 dx = 1 no longer holds. This means that some convenient properties of the orthogonal transform are lost. For instance, the orthogonal wavelet transforms of Gaussian white noise have a constant dispersion in all wavelet domains. This is not the case for the a` trous transform. As discussed above, the a` trous transform constructs the wavelet coefficients by successive applications of the same filter mask with different spacings between pixels, followed by a subtraction of the smooth component in each step. This procedure is computationally efficient because of the compactness of the mask, as the wavelet coefficients in each domain can be constructed by a sum over only a small fraction of the coefficients on the previous scale. Perhaps not entirely obvious from this construction is that this algorithm is equivalent to a convolution of the original image with a series of point spread functions, which do not only have different widths, but can also assume slightly different shapes on different scales. Wavelet coefficients are then given by the convolution w j−1,k = ∑ ψ j−1,l−k cJ,l

(9)

l

of the input vector and the wavelet function ψ j−1 (−x) at the ( j − 1)th scale. For a symmetric wavelet ψ j−1 (x) = ψ j−1 (−x), this is identical to a convolution with the wavelet itself. Popular choices for the corresponding scaling function φ(x) are the triangle function φ(x) =

1 − |x| 0

if x ∈ [−1, 1] otherwise

(10)

or a B3 -spline (Starck, Murtagh & Bijaoui 1998). The effective point spread functions for the triangular scaling function (10) are shown in Fig. 2. The horizontal axis denotes

Figure 4. The window functions for the triangular transform (10) at different scales in Fourier space. These functions have been obtained by a Fourier transform of the functions in Fig. 2. The horizontal scale denotes spatial frequencies.

pixel offsets. The first wavelet scale is given by a convolution with the narrow wavelet function ψ(x) (dotted line). The next scale is given by the same function, but scaled by a factor of 2 in width and 12 in height (dash-dotted line). The higher scales are produced from increasingly wider convolution masks. Finally, the last scale is the smooth component obtained from a convolution with the broad scaling function φ(x) (solid line). Fig. 3 shows the effective point spread functions for the B3 spline given by the convolution mask 1 1 3 1 1 , , , , . 16 4 8 4 16 Note how the wavelet functions become smoother as their width increases relative to the pixel resolution. 2.3.2 The a` trous transform as a Fourier filter The convolution functions shown in Figs. 2 and 3 act as filters corresponding to different spatial frequency bands. The corresponding window functions can be obtained from a Fourier transform of the

MEM image reconstruction using wavelets convolution masks. The window functions for the triangular transform from Fig. 2 are plotted in Fig. 4. The detail coefficients are given by the high-pass filter extending to the large Fourier modes (dotted line with circles). The coarser scales are given by filter functions moving to smaller and smaller Fourier modes. The smooth component can be obtained from the low-pass filter centred around the smallest frequencies (solid line), or longest wavelengths. The fact that the inverse transform is simply given by a sum of the detail scales and the smooth components ensures that all filters add up to a constant rectangular window of unit height (crosses on top of the diagram). The corresponding window functions for a B3 -spline look very similar, but their sidelobes are much smaller. 2.4 Two-dimensional wavelet transforms The extension of the discrete wavelet transform to two dimensional objects or images is not unique. For orthogonal, nonredundant wavelet bases, there are two commonly used types of two-dimensional transform. The terminology used to name these transforms is quite variable, and different authors even use the same term to refer to different transforms. We choose to refer to them as ‘tensor’ and Mallat (or MRA) transforms respectively, even though both are constructed from tensor products of the one-dimensional wavelets. The tensor approach simply uses tensor products of the one-dimensional wavelets as a two-dimensional basis (e.g. Press et al. 1992). The MRA transform was developed by Mallat (1989) and uses dilated versions of three different tensor products of one–dimensional wavelets without mixing scales in different directions. Both tensor and Mallat transforms are presented in more detail in Appendix B. The non-orthogonal, redundant a` trous transform lends itself particularly well to generalisations to higher dimensions. The convolution mask for the two-dimensional a` trous transform can be obtained from a two-dimensional rotation of the one-dimensional wavelet function. One of the most appealing features of the a` trous algorithm is that it does not single out any special direction in the image plane, unlike the tensor products that are tailored to vertical, horizontal and diagonal image features. In this sense the a` trous transform is isotropic. Furthermore, the a` trous transform is also invariant under translations (the transform of the dilated function is simply the dilation of the transform). In practice, the convolution mask for the two-dimensional a` trous transform can even be obtained from a simple product of two one-dimensional masks, while retaining the above properties to a sufficient approximation.

3 WAVELET DENOISING The application of wavelets to the denoising of CMB maps has recently been proposed (Tenorio et al. 1999; Sanz et al. 1999a,b). Filtering techniques exploit the fact that the information content of the wavelet-transformed image has been compressed into fewer coefficients. On the other hand, the dispersion of Gaussian white noise is constant across all wavelet domains after an orthogonal transform, because white noise has equal power on all image scales. Thus some wavelet coefficients become statistically more significant than the original image pixels, since they are considerably enhanced compared to the noise. This observation leads to a filtering scheme where statistically significant coefficients are kept and the rest are discarded. The image obtained by an inverse transform then has much reduced noise. A simple realisation of such a filtering scheme uses ‘hard

5

thresholding’, where each wavelet coefficient wi is multiplied by a factor 1 if |wi | > τi Mi = . (11) 0 if |wi | < τi For Gaussian noise, the threshold τi is usually chosen as some mulj j tiple of the noise dispersion σN , i.e. τi = kσN , where the ith coefficient lies in the jth wavelet domain. Starck et al. (1998) propose to call the vector M from (11) the multiresolution support. A common choice for the threshold is k = 3. For ‘soft thresholding’ wi −τi if wi > τi wi Mi = 0 if |wi | < τi , (12) wi +τi if wi 6 τi wi

the value Mi can vary continuously. Coefficients that lie under the threshold are discarded, whereas the rest are shrunk by the values τi . Again, there are different prescriptions for choosing the τi . One j possibility is τi = kσN where k is of the order of 1. A more advanced method is the procedure SureShrink (Donoho & Johnstone 1995) which assigns the threshold τ to each resolution level by minimising the Stein Unbiased Estimate of Risk (SURE) for threshold estimates. This is basically a method to minimise the the estimated mean squared error of the filtered coefficient. The computational cost of this procedure is of the order N log N, where N is the number of coefficients in a domain. Additionally, one often sets a minimum τmin and maximum τmax threshold such that SureShrink is only applied in those wavelet domains j where j j τmin 6 σD /σN 6 τmax . This avoids damping of the signal for domains of high signal-to-noise and suppresses noise-dominated coefficients. The soft thresholding filter is non-linear, since the values of the filter coefficients Mi depend on the data wi . The multiresolution support as a mask for regularisation will be discussed in Section 5.2. In Section 7, we will use reconstructions obtained from wavelet denoising as a benchmark for a comparison with MEM techniques operating in the wavelet basis.

4 THE MAXIMUM-ENTROPY METHOD The task of recovering the original image from blurred and noisy data is a typical inverse problem. This section discusses methods to draw inferences from data and to solve inverse problems. In particular, we focus on the maximum-entropy method (MEM) of image reconstruction. We give here a discussion of the background to the standard MEM technique and then, in Section 5, highlight the enhancements to the method provided by performing reconstructions in wavelet bases.

4.1 The inverse problem In image reconstruction, the inverse problem consists of the task of estimating the Nh image pixels hi (i = 1 . . . Nh ) from the Nd data samples di . One may find a solution by optimising some measure of the goodness of fit to the data (see, for example, Titterington 1985). For Gaussian noise, the χ2 -statistic is the preferred measure: χ2 (h) = (Rh − d)t N−1 (Rh − d).

(13)

By minimising χ2 , the vector hˆ that best fits the data can be found.

6

Klaus Maisinger, M.P. Hobson and A.N. Lasenby

4.2 Regularisation In image reconstructions, the number Nh of parameters is the number of image pixels, which can be very large. The parameter estimates are poorly constrained by the data, and over-fitting leads to a bad reconstruction of the original image. In order to avoid over-fitting and wildly oscillating solutions, some kind of additional information or regularisation is required. The regularisation is achieved by an additional function S(h) which penalises ‘roughness’ in the image. The choice of S(h) is determined by what exactly one considers as roughness. A compromise between the goodness of fit χ2 (h) and the regularisation S(h) can then be found by minimising the function F(h) = 12 χ2 (h) − αS(h),

(14)

where α is a Lagrange multiplier which determines the degree of smoothing. As α is varied, solutions lie on a trade-off curve between optimal fit to the data and maximal smoothness (see, for example, Titterington 1985). 4.3 Bayes’ theorem The previous analysis can be more coherently expressed in a Bayesian framework. Bayes’ theorem can be used as a starting point to draw statistical inferences from data. It states that the conditional probability Pr(h|d) for a hypothesis h to be true given some data d, the so-called ‘posterior’ probability, is given by Pr(d|h) Pr(h) Pr(h|d) = . Pr(d)

(15)

The probability Pr(d|h) is called the likelihood of the data, and Pr(h) is the prior probability of the hypothesis. It can be used to incorporate our prior beliefs or expectations on possible solutions. The probability Pr(d), the evidence, only depends on the data and can, for the time being, be viewed as a normalisation constant. For Gaussian distributed errors on the data points, the likelihood is then given by

L (d|h) ≡ Pr(d|h) =

1 (2π)Nd /2

1

t

−1

p e− 2 (Rh−d) N |N|

(Rh−d)

, (16)

where χ2 (h) = (Rh − d) N−1 (Rh − d) is the standard misfit statistic introduced in (13). If the parameters are well-constrained by the data, i.e. the likelihood function is narrow, the posterior probability will be sufficiently peaked and the errors on the parameters h will be small. Unfortunately, that is usually not the case in image reconstruction, where the number of parameters or image pixels is large. One then has to make use of the prior Pr(h) to obtain a solution. t

4.4 The maximum entropy method (MEM) If the likelihood function does not constrain the parameters sufficiently, the choice of a prior becomes important. There are many possible choices for the prior. The prior is usually of the exponential form (e.g. Skilling 1989) Pr(h) ∝ exp [αS(h)] ,

(17)

where S(h) is a regularisation function and α is a constant. Assuming a likelihood function given by (16), the posterior probability from (15) then becomes 1 Pr(h|d) ∝ exp − χ2 (h) + αS(h) . (18) 2

Clearly, the optimal choice of the regularisation has to reflect our knowledge of the expected solution (see Frieden 1983). A list of commonly used regularisation functions can be found in Titterington (1985). These are either functions that are quadratic in the hypothesis h or that use some kind of logarithmic entropy. An efficient prior is provided by the (Shannon) information entropy of the image. Restricting ourselves to images whose pixel values are strictly positive, the image can be considered as a positive additive distribution (PAD). The entropy function (1) is a generalisation of the Shannon entropy. The measure m is often called the ‘model’, because the entropy is maximised by the default solution hi = mi (i = 1 . . . Nh ). Given an entropy function S(h), the posterior probability can be maximised by minimising its negative logarithm 1 − ln[Pr(h|d)] = F(h) = χ2 (h) − αS(h). (19) 2 This is the maximum entropy method (MEM; see, for example, Gull 1989; Skilling 1989; Gull & Skilling 1999). In (19), we have omitted a constant additive term that comes from the normalisation of the posterior probability. This term includes the logarithm of the evidence Pr(d). The evidence becomes useful because it is conditional on the underlying assumptions, for example the values of the constant α and the model m, and can be written as Pr(d|α, m, . . .). From Bayes’ theorem (15), the posterior probability Pr(α, m, . . . |d) for the model m and regularisation α depends on the evidence Pr(d|α, m, . . .): Pr(α, m, . . . |d) =

Pr(d|α, m, . . .) Pr(α, m, . . .) . Pr(d)

(20)

Thus, in the same way as the likelihood discriminates between hypotheses the evidence helps to discriminate between different priors or models (again, e.g. Gull & Skilling 1999). In Section 4.5, the evidence will be used to set a Bayesian value of the regularisation constant α. In image reconstructions the model m is often set uniformly across the image, to the value of the expected image background. Again, for a more refined analysis the evidence can be used to discriminate between models. 4.5 The regularisation parameter α The parameter α introduced in (17) determines the amount of regularisation on the image. It is clear that minimising χ2 only (by setting α = 0) would lead to a closer agreement with the data, and thus to noise-fitting. On the other hand, maximising the entropy alone by setting α = ∞ would lead to an image which equals the default everywhere. Indeed, for every choice of α there is an image h(α) corresponding to the minimum of F(h) for that particular choice. The images h(α) vary along a trade-off curve as α is varied. There are several methods for assigning an optimal value to α. In early MEM applications, α was chosen such that for the final reconstruction hˆ the misfit statistic χ2 equalled its expectation ˆ = Nd , i.e. the number Nd of data values. This choice value χ2 (h) is often referred to as historic MEM. It can be shown that it leads to systematic underfitting of the data (Titterington 1985). However, an optimum value of α can be assigned within the Bayesian framework itself (Gull 1989; Gull & Skilling 1999). Treating α as another parameter in the hypothesis space, or rather as a part of the model or theory that the parameter h belongs to, one can remove the dependence of the posterior on this parameter by marginalising over α: Pr(h|d) =

Z

Pr(h|d, α) Pr(α|d) dα.

MEM image reconstruction using wavelets hidden space

visible space ICF

constructed straightforwardly by a diagonalisation or Cholesky decomposition of C. Unfortunately the correlation structure is usually not known in advance, and a suitable ICF has to be found from empirical or heuristic criteria.

data space model prediction

reconstruction

7

data

Figure 5. The relationship between data, visible and hidden space.

5 TRANSFORMING THE INVERSE PROBLEM From (20), we have Pr(α|d) ∝ Pr(d|α) Pr(α). If the evidence Pr(d|α) is sufficiently peaked, it will overwhelm any priors on α and one can simply use the optimal value αˆ which maximises the evidence. This choice of α is called classic maximum entropy (Gull 1989). It can be shown that the Bayesian value of α may be reasonably approximated by choosing its value such that the value of F(h) at its minimum is equal to half the number of data points, i.e. F ≈ Nd /2 (MacKay 1992). 4.6 Errors on the maximum entropy estimates There are two principal methods of quantifying errors on maximum entropy reconstructions. Firstly, one may evaluate the Hessian maˆ from which the covariance matrix H = ∇h ∇h F at the optimal h, trix of the parameters is given by C = H−1 . However, this typically requires the inversion of a large matrix, which is unfortunately nondiagonal and so requires a large computational effort. Secondly, one may take ‘samples’ from the posterior probability distribution and quantify the error on any particular parameter hi (or combinations of parameters) by examining how hi varies between samples from the distribution (Gull & Skilling 1999). Additionally, when testing the method on simulations, it is always possible to quantify the errors on the reconstruction by using a Monte-Carlo approach. Since this is in fact the most robust method, we adopt it in Section 7, in which we analyse some simulated data.

4.7 The intrinsic correlation function (ICF) One of the fundamental axioms on which maximum entropy is based is that it should not by itself introduce correlations in the reconstructions (Gull & Skilling 1999). However, it is often the case that a priori there is reason to believe that for a specific application the reconstructed quantities are not uncorrelated. Fortunately, any additional information on the correlation structure can be exploited to improve the quality of the reconstruction. Denoting the reconstructed image by v instead of h, the correlation structure of v can be encoded in a function K called the intrinsic correlation function (ICF). Using the ICF K, the reconstructed (‘visible’) image v can be derived from a set of new (‘hidden’) parameters h by v = Kh.

(21)

Instead of reconstructing the image vector directly, the entropy is maximised with respect to the ‘hidden image’ h. The elements of h are ideally a priori uncorrelated and of unit variance; the information on the correlations is absorbed into K. The relationship between data, visible and hidden image is depicted in Fig. 5. The visible reconstruction can be derived from the hidden image through the ICF, and the predicted noiseless data can be calculated using the response matrix and is given by d = Rv. If the covariance matrix C = hvi v†j i is known, the ICF can be

Without knowledge of the correlation structure, the ICF will have to be based on some assumptions about the correlations. One may prefer to find an ICF that works specifically well in certain cases, or one that performs well in the worst case. In the following, we will investigate several types of wavelet transforms as ICFs. Let us consider more closely the relationship between the data, visible and hidden image vectors. The ND -dimensional data vector d is an element of the data space D . Similarly, the visible vector v ∈ V , dim V = NV and the hidden vector h ∈ H , dim H = NH . The data vector is then given by the transform h ∈

H

7−→ K

−→

v ∈

V

7−→ R

−→

d ∈

(22)

D,

where the ICF K is a wavelet transform. In practice, we will use the transpose K = Wt of the wavelet transform in this step, i.e. v = Wt h. For orthogonal wavelet transforms, Wt = W−1 and this choice of K ensures that the transformation v 7→ h is given by the wavelet transform and the hidden space simply consists of the wavelet coefficients of the reconstruction. For non-orthogonal transforms, however, Wt = W−1 does not hold. The χ2 (v)-function is defined on the space V of visible vectors v and the entropy function S(h) on H . There are now two ways to construct a MEM algorithm. (i) The first method is to choose an entropy function S(h) in the space H of hidden images and to maximise the functional F(h) = 12 χ2 (Kh) − αS(h). In the following, we will call this approach ‘wavelet MEM’. Fig. 6 shows a schematic depiction of wavelet MEM. It is the straightforward way to implement the MEM in a hidden-space algorithm. A hidden-space MEM kernel is, for example, given in the software package MEMSYS 5 (Gull & Skilling 1999). We also note that a numerical implementation requires the evaluation of derivatives of F. These are discussed in Appendix C and, from (C1), we see that the transpose of K is required for the evaluation of the gradient. This is why K = Wt is a better choice than K = W−1 for non-orthogonal transforms. Furthermore, in Section 6.2 we will show that the use of the transpose has a very simple interpretation for the non-orthogonal a` trous transform. (ii) Another alternative is to define F in terms of the visible space quantities: F(v) = 21 χ2 (v) − αS(Kt v). (Since we defined K = Wt , the transform applied to the vector v in the entropy term is simply the wavelet transform.) This method has been pursued by, for instance, Pantin & Starck (1996) and Starck et al. (2001). In fact, the entropy S(Wv) can be viewed as a function on V , and its proponents call it the ‘multiresolution entropy’. We will call this approach ‘wavelet regularised MEM’. An illustration is shown in Fig. 7. The transform K is not an ICF in the same sense as discussed in Section 4.7, since the parameters that are reconstructed by the MEM, the visible image pixels, are still correlated. We also note that for redundant wavelet transforms, this approach has the advantage that the numerical minimisation problem is lower dimensional, since NV < NH .

8

Klaus Maisinger, M.P. Hobson and A.N. Lasenby regularisation

5.1 The entropy function

S wavelet space

W

T

image space

R

data space

From (1), for a positive additive distribution h, the cross-entropy S+ (h, m) of the image h with some model m of the image is given by the sum S+ (h, m) = ∑ s+ (hi , mi ), i

maximise Figure 6. Schematic depiction of wavelet MEM. The data are predicted by a composition of the transpose of the wavelet transform Wt and the instrumental response R. The entropic regularisation is applied to the hidden (wavelet) space, and the posterior functional is maximised with respect to the wavelet coefficients.

‘wavelet entropy’ regularisation

S wavelet space

W

image space

R

data space

maximise Figure 7. Schematic depiction of wavelet regularised MEM. The entropic regularisation is applied to the wavelet coefficients obtained from the image, but the posterior functional is maximised with respect to the image pixels. The wavelet transform and entropy functional act as a new effective ‘wavelet entropy’.

Despite the use of the same linear transform K, ‘wavelet regularised MEM’ is not entirely equivalent to ‘wavelet MEM’ because of the non-linearity of the entropy functional. In Appendix D, we show that both methods become equivalent for orthogonal wavelet transforms if a quadratic regularisation function is used. The type of wavelet transform is not the only free parameter one has to pick in wavelet MEM. For instance, there is still freedom in the choice of the entropy functional S. As mentioned before, an optimal choice should reflect the expected distribution of the hidden image vectors h. A more detailed discussion of entropy functionals used with wavelets and maximum entropy is given in Section 5.1. There is no reason in the MEM that the data space D and visible space V be identical. For example, Bridle et al. (1998) and Marshall et al. (2002) have applied maximum entropy to the reconstruction of lensing mass profiles of galaxy clusters from shear or magnification data. In this case, D and V do not even share the same physical dimensions. However, in image reconstruction the data image and the reconstruction are often not only given in the same physical units, but on the same discretised grid, and we have D = V . In this case, it is possible to incorporate information gleaned directly from the data into the choice of the regularisation. The ICF K(d) and the entropy S(h, d) may become explicitly datadependent. In practice, the data dependence is usually introduced by a modification of the model m.

where s+ (h, m) = h − m − h log

h . m

(23)

The function s+ (h, m) reaches a global maximum of zero at h = m. Thus, in the absence of data, the reconstruction takes on the default value m for all pixels; in practice, the value of m is often set to the level of the expected image background. Many astronomical images, such as maps of CMB fluctuations, generally consist of both positive and negative pixels. Furthermore, even if an image were purely positive, some of its wavelet coefficients may be negative. The entropy (23) is thus inapplicable to such images. A simple generalisation of (23) to negative values is s|·| (h, m) = |h| − m − |h| log

|h| , m

(24)

which has been used by Pantin & Starck (1996). This function does not have a unique maximum, since both h = ±|m| maximise the entropy. For h → 0, one finds s|·| (h) → −m. However, the entropy is not defined at zero, which is generally the expected default state of the image if the mean has been subtracted. In practice, the model values m have to be close to zero and small compared to any realistic data in order to avoid the introduction of a spurious background signal. Pantin & Starck (1996) use the value m = kσ, where σ is the 1 is an arbitrarily chosen, small constant. rms of the data and k = 100 The proper way to extend the entropy (23) to images that take both positive and negative values is the entropy definition (Hobson & Lasenby 1998; Gull & Skilling 1999) s± (h, m) = Ψ − 2m − h log

Ψ+h , 2m

(25)

√ where Ψ = h2 + 4m2 . The entropy s± has a maximum at h = 0. In the positive/negative entropy (25) the role of the model m is different from that in (23). The value of m determines the width of the entropy function and thus controls the magnitude of the allowed deviations from the default value. Hence the model can be considered as a level of ‘damping’ imposed on the image. From a dimensional analysis, the obvious choice for m is to set it to the expected signal rms. If data and visible space are identical and the signal-to-noise ratio is sufficiently high, the rms of the observed data can be a good approximation to that of the signal. Fig. 8 shows the entropy functions s|·| and s± . From the definitions (24) and (25), we see that, for a given model, both functions differ only by a constant offset in the limit of large h: s|·| (h) → s± (h) + m (h → ∞). However, in order to minimise the cusp of s|·| around zero, the models will generally be chosen differently, and the maximum of s|·| will be significantly narrower than that of s± .

MEM image reconstruction using wavelets

9

5.2 Choosing the parameters α and m

0

The choice of the regularisation parameter α has been discussed in Section 4.5. Classic maximum entropy uses a Bayesian choice of α, as implemented in the software package MEMSYS 5. For wavelet regularised MEM, the maximisation takes place in visible space and the Hessian of the effective entropy is not diagonal, which means it cannot be implemented in MEMSYS 5. As mentioned above, Pantin & Starck (1996) propose to choose a pixeldependent value of α in order to enhance or alleviate the regularisation on certain coefficients. They use

-0.5 -1 -1.5 -2 -2.5 -3 -3.5

j

αi ∝ σN (1 − Mi )

-4 -4

-3

-2

-1

0

1

2

3

4

ψ+h 2m

(solid line), its Figure 8. The entropy function s± (h) = ψ − 2m − h log quadratic approximation s(h) = −h2 /(4m) (short dashed line) and s|·| (h) = |h| − m − |h|log |h| m (long dashed line). In all cases, the model was chosen to be m = 1. The function s|·| is not defined at h = 0.

Expanding s± around zero ∞

s± (h, m)

=

(∏ni=1 (2i − 3))2 (−1)n h2n (2n)! (2m)2n−1 n=1

=

−

∑

h4 32 h6 (3 · 5)2 h8 h2 + − −... + 3 5 2(2m) 4!(2m) 8!(2m)7 6!(2m)

h2 + O (h4 ), 4m one obtains the quadratic approximation =

−

(26)

h2 , (27) 4m which is plotted in Fig. 8. With this entropy, MEM reduces to a scaled least squares. The quadratic entropy can also be obtained from the large-α limit h ≃ m of the standard entropy s+ (h), but with 1 . In a MEM reconstruction, one evaluates a different scale factor 2m the product αS of the entropy S and a regularisation constant α. The constant α is not dimensionless; its dimension [α] = 1/[h] is given by the dimension [h] of h. If the model is chosen proportional to the signal rms σS , then, from (26), the product αs becomes invariant under a rescaling of h if α is also rescaled by 1/σS . In fact, to first order, any change in the model m can be absorbed by a reciprocal change in the regularisation constant α: α αs ∝ − h2 . (28) m This explains why s|·| produces reconstructions similar to those obtained for to s± despite of its narrow shape for small models. We note that in a more recent work, Starck et al. (2001) use the quadratic approximation s2 . Finally we note that Pantin & Starck (1996) propose to choose the regularisation parameter α dependent on the scale or even the image pixel, instead of setting a constant α globally. This can be achieved by introducing an additional weighting factor αi for each pixel i: s2 (h) = −

S(h, m) = ∑ αi s(hi , mi ).

(29)

i

However, from (28) we see that this modification is again to first order equivalent to a corresponding change in the model value mi , and so can be achieved by adopting a non-constant model.

(30)

in (29), where Mi is the multiresolution support (11) defined in Section 3, the ith pixel is assumed to lie in the jth wavelet domain and j the factor σN is the noise dispersion in the jth wavelet domain. The factor (1 − Mi ) reduces the regularisation of coefficients with j a good signal-to-noise, and the factor σN introduces an additional pixel-dependent regularisation. The pixel-dependent factor αi is scaled by the global parameter α. In the following, we will simply employ the historic maximum entropy criterion χ2 = ND to fix a global α in cases where MEMSYS 5 cannot be used to determine a Bayesian value. In real-space MEM, there is often no a priori reason to assign a varying model to particular image regions, since in the absence of any additional information one would expect a constant signal dispersion across the image. The wavelet transform, however, is designed to enhance the signal-to-noise ratio on some wavelet coefficients to allow a sparse representation of the signal. Consequently one would expect different signal dispersions in each wavelet domain, and a uniform choice of the model seems appropriate only within domains. If data and image space are identical, the signal j dispersion in the jth wavelet domain σS can be estimated from the j j dispersion of the data σD and of the noise σN : j

σS =

q j2 j2 σD − σN . j

Now the model can be set to mi = σS , where the ith pixel lies in the jth wavelet domain. If no analytic noise model is available, the noise dispersion can be obtained from Monte-Carlo simulations. Other models have been proposed that involve the use of signal-to-noise ratios rather than signal dispersions. For orthogonal transforms, the dispersion of the wavelet coefficients of white noise is scale-invariant and the signal-to-noise is proportional to the signal dispersions.

5.3 Choice of the wavelet basis For a given data set, one has to chose a specific wavelet basis for a MEM reconstruction. Some bases will obviously be more suitable to fit the data than others, and one naturally wants to find a way to choose the best one. A common selection criterion is the information entropy of the image in the wavelet representation (Coifman & Wickerhauser 1992; Zhuang & Baras 1994; Hobson et al. 1999). As will be discussed in Section 7.1, we find empirically that for orthogonal wavelets, the choice of the wavelet basis in the MEM algorithm does not play an important role. Only Haar wavelets seem to perform significantly worse than other types.

10

Klaus Maisinger, M.P. Hobson and A.N. Lasenby hidden channels

visible channels

1

channel 1

channel 2

channel 3

00000 11111 11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111

image

2

=

+

wavelet coefficients (3 scales)

= +

3

* image

wavelet transform (3 scales)

(a) Figure 9. The multi-channel ICF. The hidden image consists of several images or channels (left). The visible channels are obtained by convolutions of the hidden parameters with point spread functions of different widths (middle). The visible image is a weighted sum of the image channels (right).

1

=

2

3

*

image transpose of wavelet transform (3 scales) wavelet vector

6 MULTI-RESOLUTION RECONSTRUCTION The advantage of transforming the inverse problem to wavelet space is that it allows for a natural multiresolution description of the image. Indeed, our approach may be be considered in the context of the multi-channel ICF proposed by Weir (1992), which we now discuss.

(b)

Figure 10. (a) The operation of a 3-level a` trous transform on an image. (b) The transpose of the a` trous transform operates on a wavelet vector with 3 scales. Each scale is convolved with the corresponding wavelet, and the resulting vector is obtained by a sum over all scales.

6.1 Multi-channel ICF Traditionally the ICF K is taken to be a convolution with some point spread function P(x). Thus the visible image is a blurred version of the hidden variables. Popular point spread functions comprise B-splines of various orders (Gull & Skilling 1999) or Gaussians (e.g. Weir 1992). One deficiency of the ‘blurring’ ICF is that the width of the point spread function introduces a characteristic scale for the pixel correlations. In most applications, however, objects and correlations of varying sizes and scales are present in the same image. A straightforward generalisation of the blurring ICF is the multi-channel ICF (Weir 1992). This method is illustrated in Fig. 9. The hidden variables consist of a number of different images or channels hi . Each hidden channel is blurred with a different point spread function Ki such that vi = Ki hi . The visible image v is obtained by a weighted sum v = ∑i wi vi of all blurred image channels vi , where the weight on the kth channel is denoted by wi . The entropy function is applied to the hidden variables. If the weights on different scales are chosen appropriately, it is entropically favourable to represent extended structure in the visible image by a single or very few coefficient in the hidden domain. One weakness of this approach is that it introduces a large number of new free parameters, like the widths of the point spread functions and the weighting factors wk of the different channels (and a complete new hidden image for each scale). If there is a priori knowledge on the expected correlation scales of objects, the width of the point spread function can be chosen accordingly. However, this will rarely be the case. The reconstructions can become strongly sensitive to the chosen set of ICF widths, the number of channels, the weights and the models in different channels (Bontekoe et al. 1994). In the ‘pyramidal MEM’ (Bontekoe et al. 1994), the different channels contain different numbers of hidden parameters. With decreasing resolution, the number of parameters in each consecutive channel is reduced by a factor of a half in each image dimension. Bontekoe et al. (1994) find empirically that uniform models and weights (i.e. wk = 1 for all channels k) and the same point spread function (scaled by factors of 2) can be used for all channels. They also note that the pyramid images can be interpreted as spatial band-pass filters.

We note out that the weights wk of the visible image channels play a similar role as the scale-dependent regularisation parameters αi from Section 5.1 that were proposed by Pantin & Starck (1996). The different weights correspond to a rescaling of the hidden image channels. From (28), this is again to first order equivalent to a corresponding rescaling (albeit quadratically) of the regularisation α or the model m. 6.2 The a` trous transform as ICF If one writes the non-orthogonal a` trous transform as a series of convolutions with scaling or wavelet functions, as discussed in Section 2.3, one can show that, for the a` trous transform, the ICF K = Wt used in wavelet MEM is just a special case of a multichannel ICF. j From (9), the a` trous wavelet coefficients wi at the jth scale are given by the (discretised) convolution wi = ∑ Wik fk = ∑ ψk−i fk , j

j

j

k

(31)

k

of the original function or image f with a version of the wavelet ψ j at the jth scale that was mirrored at the origin. For a symmetric wavelet function, this corresponds to a convolution with the wavelet itself. Examples of wavelet functions are shown in Figs. 2 and 3. (For an orthogonal wavelet the equivalent to (31) would be wi = ∑k ψk−2i hk , where the factor of 2 in the index accounts for the decimation carried out between consecutive filter steps.) The transpose of the a` trous transform operates on the wavelet coefficients and produces a new image g, which is given by gk = ∑ Wki wi = ∑ ψk−i wi . j

j,i

j

j

j

(32)

j,i

Thus the transpose consists of a convolution of the jth wavelet domain with the corresponding wavelet ψ j and a subsequent summation over all scales. The operation of the a` trous transform and of its transpose is illustrated in Fig. 10. The effect of the transpose of the a` trous transform is just the same as that of a multi-channel ICF. Its set of ICFs consists of the

MEM image reconstruction using wavelets

(a)

11

(b)

Figure 11. The original image (a) used in the simulations described in the text and the ‘data image’ (b) obtained after a convolution with a Gaussian point spread function with a FWHM of 5 pixels and the addition of Gaussian random noise whose rms is half the rms of the original image.

scaling function and the rescaled wavelet functions. Unlike in the standard multi-channel approach, the blurring functions take negative values in some areas. As in Section 6, one can introduce different weights for all channels in the a` trous transform, or one can simply adapt the default model on different scales to enhance the reconstruction quality. However, if one introduces scale-dependent weights wk , the inverse transform also needs to be rescaled. The nice property that the Fourier transforms of all wavelet functions add up to 1 is lost.

7 APPLICATION TO SIMULATED DATA In this section, we compare the different maximum entropy methods discussed in this paper by applying them to a set of simulated two-dimensional images. One of the problems of assessing the capabilities of different reconstruction methods is that there is no single unique criterion for the quality of a reconstruction. Furthermore, the outcome of any quality measurement depends on a large number of variables, such as the type and content of the image, properties of the instrument with which the data were observed (e.g. the point spread function and the noise) and the model and regularisation constant used in the maximum entropy algorithms etc. In this section, we use the rms differences between the original image and the reconstruction as a measure of the reconstruction quality. Of course, in real applications one does not have the original image available for comparison, but nevertheless one would usually hope to minimise the expected difference between true and reconstructed image. In Section 7.1 we use a photographic image as an arbitrarily chosen, but fairly general and typical test case for image reconstructions. In Section 7.2 we then turn to the investigation of the reconstruction of CMB maps that are realisations obtained from an inflationary CDM model, and provide a useful astronomical test image that is complementary in its properties to the photographic image.

Figure 12. The dispersion of the wavelet coefficients of data (solid line), the signal (long-dashed line) and noise contribution (short-dashed line) of the image in Fig. 11 (b). There are four levels in the a` trous transform, where the level 1 corresponds to the coarse structure and level 4 is the domain with the most detailed structure.

7.1 A photographic image As a first test image we use the photographic portrait shown in Fig. 11 (a). The image contains 256 × 256 pixels and has had its mean subtracted and has been rescaled to an rms of 1. It is highly non-gaussian, and its sharp lines and contrasting continuous extended regions provide useful characteristics to assess the visual impression of different reconstruction methods. We simulate observations of this image by convolving it with different Gaussian point spread functions with FWHMs of 3, 5 and 10 pixels and adding Gaussian white noise with an rms of 0.1, 0.5 or 2. Bearing in mind that the original image had unity rms, this corresponds to signalto-noise ratios of 10, 2 and 0.5. For each FWHM and noise level, we use Monte-Carlo simulations of 15 different noise realisations. One of the realisations obtained for a FWHM of 5 pixels and a noise level of 0.5 is shown in Fig. 11 (b). To each simulation, we apply several different reconstruction

12

Klaus Maisinger, M.P. Hobson and A.N. Lasenby the spiky signatures of the Daubechies-4 wavelets are visible in the image. The reconstructions for the wavelet MEM using tensor and a` trous wavelets are shown in Fig. 15. They look very similar to the results obtained from the wavelet regularised MEM in Fig. 14. For a more quantitative comparison of the different methods, Table 1 lists the reconstruction errors for different FWHMs of the point spread function (3, 5 and 10 pixels) and different noise levels (0.1, 0.5 and 2). The quoted errors are the rms differences between the original image and the reconstruction averaged over 15 different noise realisations. The standard deviation of the error values is usually much less than 0.01 for low noise values and not more than 0.03–0.04 in the high-noise case, so the error on the mean (of the errors) is expected to be not more than 0.01 in most cases. There are several trends apparent from the numbers:

Figure 13. The 4-level a` trous transform of the image in Fig. 11 (a). Each level is plotted on a different greyscale.

algorithms. First, we use a real-space MEM algorithm whose image model is set to be constant across the image plane to the value of the estimated signal rms. The regularisation parameter α discussed in Section 4.5 is chosen from the historic MEM criterion which demands that the χ2 -statistic for the final reconstruction equal the number ND of data points, in this case 256 × 256. We also apply wavelet regularised MEM and wavelet MEM implementations, each with Daubechies-4 tensor, Daubechies-4 MRA and a` trous wavelets. The a` trous wavelets use the triangle function (10) with four levels. In all cases, the model is chosen to be constant within a given wavelet domain. The model values for each domain or scale are found by setting them to the expected signal dispersions as discussed in Section 5.2. Fig. 12 plots data, signal and noise dispersions for the 4-level a` trous transform. One can clearly see that on coarse scales (level 1) the data are dominated by the signal and on the finest scales (level 4) by the noise. The 4 levels of the a` trous transform are shown in Fig. 13. Each level is plotted on its own greyscale. The detail levels would be much fainter if they were plotted on the same scale. Some of the reconstructions produced from the ‘data’ in Fig. 11 (b) are presented in Fig. 14. The first image (a) shows a reconstruction obtained with the wavelet regularised MEM using tensor wavelets. The image looks more or less smooth. However, there are some faint structures along the horizontal and vertical directions. These are the signatures of the highly non-differentiable Daubechies-4 wavelets (compare Fig. 1 (b) for the one-dimensional profile). The second image (b), which was produced with the a` trous wavelets, is the visually most appealing reconstruction. The image is very smooth, but due to the high noise levels on the data there is no apparent attempt at a deconvolution of the point spread function. In (c) the results from real-space MEM are shown. The image is dominated by the ‘ringing’ or little speckles that are characteristic for the real-space method. Visually, this image is clearly the poorest reconstruction of the MEM reconstructions. Finally, in (d) we show a reconstruction obtained from a SureShrink filter with Daubechies-4 tensor wavelets, as introduced in Section 3. As in (a),

• For large point spread functions, all methods yield similar results (except perhaps for the real-space MEM at low signal-tonoise). • For high signal-to-noise, the methods also have a similar performance. • For sufficiently narrow point spread functions and poor signalto-noise ratios, real-space MEM performs clearly worse than its competitors. • There is some indication that the wavelet MEM performs better than the wavelet regularised MEM. • Also, there are some hints that a` trous wavelets may outperform the tensor wavelets at least for the wavelet MEM, although in wavelet regularised MEM they are less able to cope with high noise levels. • In the low signal-to-noise regime the SureShrink filter matches the performance of the wavelet MEM techniques, indicating that the errors are entirely dominated by the noise. For low noise levels, however, MEM is more effective than the filter, which makes no attempt at a deconvolution. In the case when the point spread function is narrow and the noise dominant, it is most difficult for the MEM algorithms to distinguish between signal and noise and the performance differences become most prominent. Within the wavelet algorithms, there are of course many free parameters that may influence the reconstruction errors. We have performed simulations with different types of orthogonal wavelets and both two-dimensional tensor and MRA transforms. There is no significant difference except for the case of the simple Haar wavelets, which seem to perform slightly worse than other wavelets. Likewise, for the a` trous algorithm the triangle function seems to be similarly efficient as the B3 -spline, and the number of levels in the transform does have marginal effects as long it is greater than a minimum of three or four. We have also tested different models for the wavelet coefficients. Generally, there are many models suppressing power on small wavelet scales that provide a similar performance. A stronger relative penalty on smallscale structure, for example by using the variance instead of the dispersion of the signal, can be advantageous for poor signal-tonoise ratios, since most of the fine structure will be noise. We find no benefits from methods that attempt a more data-dependent regularisation, for instance using the regularisation constant (30). Tracing back the reconstruction errors to the wavelet domain, it appears that, not surprisingly, the dominant contribution to the errors comes from those domains where the signal-to-noise is close to unity. The other domains are either perfectly reconstructed

MEM image reconstruction using wavelets

(a)

(b)

(c)

(d)

13

Figure 14. Reconstructions of the data from Fig. 11 (b) using the ‘wavelet regularised MEM’ algorithm: (a) with Daubechies-4 tensor wavelets; (b) or with 4-level a` trous wavelets. A reconstruction with real-space MEM is shown in (c), and a SureShrink reconstruction in (d). Compare the original image in Fig. 11 (a).

PSF FWHM noise rms

0.1

3 0.5

2.0

0.1

5 0.5

2.0

0.1

10 0.5

2.0

real-space MEM tensor reg. MEM a` trous reg. MEM tensor wavelet MEM a` trous wavelet MEM SureShrink

0.12 0.12 0.11 0.12 0.11 0.12

0.30 0.20 0.21 0.20 0.18 0.19

0.60 0.39 0.45 0.34 0.32 0.32

0.14 0.15 0.14 0.16 0.14 0.17

0.25 0.22 0.21 0.22 0.21 0.22

0.49 0.37 0.39 0.34 0.32 0.36

0.21 0.21 0.21 0.22 0.21 0.26

0.26 0.26 0.26 0.27 0.25 0.29

0.41 0.36 0.37 0.36 0.35 0.36

Table 1. Reconstruction errors for different methods as a function of the FWHM (in image pixels) of the convolution mask and of the noise level on the data. The numbers quote the rms differences between the reconstruction and the original image averaged over a number of noise realisations as described in the text.

14

Klaus Maisinger, M.P. Hobson and A.N. Lasenby

(a)

(b)

Figure 15. Reconstructions of the simulated image from Fig. 11 with ‘wavelet MEM’: (a) using Daubechies-4 tensor wavelets; (b) using a` trous wavelets. Compare the corresponding reconstructions obtained from ‘wavelet regularised MEM’ shown in Figs. 14 (a) and (b).

0.5

construction error on the regularisation constant α for real-space MEM and wavelet MEM using tensor and a` trous wavelet. These results are obtained from a simulation with a FWHM of 5 pixels and a noise level of 0.5. Each point corresponds to the rms difference between the final reconstructions and the original image when the regularisation parameter α has been fixed to the given value. One can clearly see that there is a trade-off curve between too close a fit to the data (and thus to spurious noise) and too strong a regularisation (and thus to suppression of real structure in the image). Both extremes result in high reconstruction errors. For each reconstruction method, a short horizontal line marks the point along the curve that is preferred by the χ2 = ND criterion. Two features of these curves are worth pointing out:

real-space MEM tensor wavelet MEM a trous wavelet MEM

0.45

reconstruction error

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.01

0.1

1

10

100

regularisation

Figure 16. The rms reconstruction errors as a function of the regularisation constant α, for a simulation with a 5-pixel FWHM blurring function and a signal-to-noise of 2. For real-space MEM (asterisks) and wavelet MEM using tensor (diagonal crosses) and a` trous wavelets (perpendicular crosses), the plotted points show the final errors for a reconstruction with a fixed value of α. For each method, the small lines indicate the value of α that would have been obtained from the historic MEM criterion χ2 = ND .

• The global minima of the curves are indeed lower for the wavelet methods than for real-space MEM. This means that the lower errors of the wavelet methods in Table 1 are not merely artefacts of a poorly chosen regularisation constant. On the contrary, for the real-space algorithm the historic MEM criterion actually picks out points that are very close to the global minimum of the curve, despite the visual ringing evident from the image. • The curves appear to be less narrowly peaked for the wavelet methods than for real-space MEM. This means the quality of the reconstruction will be less sensitive to α. For the wavelet methods, one can vary α by about an order of magnitude around the minimum without significantly affecting the reconstruction errors.

(the coarse structure) or contribute little to the image (the noisedominated small-scale structure). 7.1.2 A Bayesian choice of the regularisation constant α 7.1.1 The influence of the regularisation constant α Beside the choice of the model, the selection of the regularisation constant α is one of the major problems in the maximum entropy method, as already discussed in Sections 4.5 and 5.2. By setting the χ2 -statistic to the number of data points, χ2 = ND , one essentially fixes the ‘goodness of fit’, which may result in very similar error levels on the reconstructions independent of the basis functions or regularisation. Fig. 16 illustrates the dependence of the re-

The proper Bayesian way to determine the regularisation constant α is to treat it as an additional model parameter and marginalise the posterior probability over α, i.e. integrate over all possible values of α (see Section 4.5). Because the evidence is often strongly peaked, it is usually sufficient to maximise the evidence to find the optimal value of α. The reconstruction errors for MEM with a Bayesian α (using MEMSYS 5) are shown in Table 2. It is evident that for the real-space MEM the reconstruction errors are very poor compared to those obtained with the historic MEM criterion

MEM image reconstruction using wavelets PSF FWHM noise rms

0.1

3 0.5

2.0

0.1

5 0.5

2.0

0.1

10 0.5

2.0

real-space MEM tensor wavelet MEM a` trous wavelet MEM ICF MEM

0.47 0.12 0.15 0.12

0.63 0.21 0.27 0.24

0.85 0.35 0.43 0.44

0.43 0.15 0.15 0.18

0.57 0.23 0.22 0.28

0.74 0.35 0.35 0.45

0.38 0.22 0.20 0.25

0.50 0.27 0.27 0.34

0.65 0.37 0.42 0.47

15

Table 2. Reconstruction errors as in Table 1, but for a Bayesian choice of the regularisation constant α (classic MEM).

PSF FWHM noise rms

0.1

3 0.5

2.0

0.1

5 0.5

2.0

0.1

10 0.5

2.0

real-space MEM tensor waveletMEM a` trous wavelet MEM ICF MEM

0 17040 15310 13900

0 6910 5850 5080

0 1830 1360 1220

0 6240 6090 4450

0 3060 2590 1890

0 1010 730 560

0 1350 1290 850

0 930 120 430

0 370 170 150

Table 3. The logarithm ln Pr(d) of the evidence Pr(d) for a classic MEM reconstruction with a Bayesian choice of the regularisation constant α (classic MEM). The logarithms are normalised such that they equal zero for the real-space MEM for each dataset.

in Table 1. This is expected from the curve in Fig. 16: the historic MEM criterion picks a value of α that is close to the minimum of the trade-off curve. The reconstruction errors can only get worse for a different choice of α. For comparison with Fig. 16, the Bayesian value for the real-space MEM is α = 0.12. In fact, it is this poor performance of the real-space method that historically lead to the introduction of ICFs and the search for a better image model (e.g. MacKay 1992). The wavelet MEM improves the reconstruction errors dramatically, especially in the case of the tensor wavelets (for comparison with Fig. 16, α = 0.5). The errors even approach those obtained for the a` trous wavelet MEM in Table 1. The a` trous transform (α = 0.04 in Fig.16) has problems for narrow point spread functions; it tends to allow too much image structure on small scales. However, by reweighting the individual scales of the transform suitably, the reconstructions can be much improved. The optimal weighting factors will be investigated in future work. The last row of Table 2 shows the reconstruction errors obtained from a Gaussian multi-channel ICF with four channels. The widths of the Gaussians are scaled by a factor of 2 between channels, and the weights are chosen such that the volume of the ICFs are identical between levels. Even though the weights are not specifically adapted to the data, the reconstruction errors are much lower than for real-space MEM. Again, the reconstructions can be much improved by the choice of different convolution masks and weightings. The evidence for classic real-space and wavelet MEM are presented in Table 3. The values are the logarithms ln Pr(d) of the evidence Pr(d) as introduced in Section 4. The values have been normalised such that for a given FWHM and noise rms the evidence for the real-space MEM equals 1. The values presented here thus allow a comparison of the evidence for a model with ICF and the simple real-space MEM without ICF. The standard deviation across the different noise realisations is roughly 170 (in units of ln Pr(d)). It is apparent that the evidence for the ICF MEM and wavelet MEM is significantly higher than for the real-space MEM, as would be expected from the improved reconstruction errors. Furthermore, the lower reconstruction residuals for the tensor transform are matched by a higher evidence. From a Bayesian point of view, the ICF model

or the wavelet basis are therefore the favoured models for image reconstructions. Unfortunately, a similar calculation of the evidence for the wavelet regularised MEM is not possible, because the software package MEMSYS 5 that is used for the reconstruction is limited to regularisation functionals whose curvature matrix is diagonal. 7.2 CMB maps Having applied the wavelet MEM techniques to a general image, we now turn to the reconstruction of CMB maps. We apply the different methods to five different CMB maps that are realisations obtained from the inflationary cold dark matter (CDM) model. The map size is 16◦ × 16◦ with 256 × 256 pixels, corresponding to a pixel size of 3.75′ . The maps are convolved with Gaussian beams of different beam sizes with FWHMs of 11.25′ (3 pixels), 18.75′ (5 pixels) and 37.5′ (10 pixels) respectively. Gaussian white noise is added with different signal-to-noise ratios of σS /σN = 10, 2 and 0.5. For each signal-to-noise ratio, we create five different noise realisations. With the five different input CMB maps, there are thus 25 different combinations of sky and noise realisations available for a given FWHM and noise level. The maps are reconstructed on the same grid as the simulated data. Again we use the historic MEM criterion to determine the regularisation constant instead of classic MEM, since it produces sufficiently good reconstruction errors and is also easily applicable to wavelet-regularised MEM. The reconstruction errors for different methods are quoted in Table 4. Because of the wealth of information present on all scales in the CMB maps, the errors are generally larger than for the photographic image (compare Table 1). Although the differences in reconstruction errors are less conspicuous, the results confirm the conclusions drawn from the photographic reconstructions. For low signal-to-noise ratios and narrow point spread functions, the wavelet-based methods are superior to real-space MEM. However, there is no significant difference between wavelet regularised and wavelet methods, even though there is still some indication that a` trous wavelets perform better than orthogonal ones for high signalto-noise ratios. In the last row, we also show that the SureShrink

16

Klaus Maisinger, M.P. Hobson and A.N. Lasenby PSF FWHM noise rms

0.1

3 0.5

2.0

0.1

5 0.5

2.0

0.1

10 0.5

2.0

real-space MEM tensor reg. MEM a` trous reg. MEM tensor wavelet MEM a` trous wavelet MEM SureShrink filter

0.14 0.18 0.15 0.18 0.15 0.24

0.34 0.33 0.32 0.33 0.32 0.34

0.66 0.57 0.59 0.56 0.55 0.55

0.26 0.28 0.27 0.30 0.27 0.38

0.39 0.40 0.39 0.41 0.40 0.44

0.62 0.58 0.59 0.59 0.57 0.58

0.44 0.45 0.45 0.47 0.45 0.57

0.51 0.52 0.52 0.54 0.52 0.59

0.66 0.66 0.65 0.67 0.65 0.67

Table 4. Reconstruction errors between reconstructions and original CMB maps. The errors are averaged over a set of 25 simulations with 5 different noise and image realisations.

(a)

(b)

Figure 17. (a) One realisation of the CMB maps used in the simulations described in the text. The map is normalised to an unity rms, and the axes are labelled in degrees. (b) The ‘data image’ obtained after a convolution with a Gaussian point spread function with a FWHM of 5 pixels and the addition of Gaussian random noise whose rms is half the rms of the original image.

filter again cannot match the performance of maximum entropy for high signal-to-noise ratios. For illustration, one of the realisations of the CMB maps used in the simulations is shown in Fig. 17 (a). A simulated observation of this map using a point spread function with a FWHM of 5 pixels and Gaussian random noise of 0.5 is shown in Fig. 17 (b). Reconstructions using wavelet MEM are shown in Figs. 18 (a) for Daubechies-4 tensor wavelets and (b) for a` trous wavelets. While the reconstruction errors are virtually identical in both cases, the tensor reconstruction show distinctly non-gaussian features introduced by the spiky wavelet functions. Real-space MEM reconsructions of the same data are plotted in Figs. 18 (c) and (d). The reconstruction (c) was obtained with the historic MEM criterion; reconstruction errors are similar to those of the wavelet methods. The image (d) was obtained from a Bayesian choice of the regularisation constant with the classic MEM. The reconstruction quality is visibly inferior, which is confirmed by the reconstruction errors (0.59 compared to 0.40 for historic MEM). The averaged logarithms ln Pr(d) of the Bayesian evidence Pr(d) obtained from reconstructions of the CMB maps with the classic MEM criterion are shown in Table 5. They confirm the results from Table 3 and the visual impression from Fig. 18. The evidences for the wavelet methods are again significantly higher

than for the real-space MEM, even though the relative evidence ratios are not quite as pronounced as for the reconstructions of the photographic image.

8 CONCLUSIONS Wavelets are functions that enable an efficient representation of signals or images. They help to identify and compress the information content of the signal into a small number of parameters. Because wavelet functions span a whole range of spatial scales, they can be used to describe signal correlations of different characteristic lengths. In this paper, we have investigated how wavelets can be combined with the maximum entropy method to improve the reconstruction of images from blurred and noisy data. There are two principal ways to incorporate wavelets into the maximum entropy method. First, the wavelet transform can be treated as an intrinsic correlation function that is used to decorrelate the data (wavelet MEM). Secondly, the wavelet transform can be combined with the entropy functional into a new effective wavelet entropy (wavelet regularised entropy). We have implemented both approaches for orthogonal wavelet transforms. Another type of wavelet transform is the a` trous transform. It is non-

MEM image reconstruction using wavelets

(a)

(b)

(c)

(d)

17

Figure 18. Reconstructions of the data from Fig. 17 (b) using the ‘wavelet MEM’ algorithm: (a) with Daubechies-4 tensor wavelets; (b) with 4-level a` trous wavelets. A reconstruction with real-space historic MEM is shown in (c), and a real-space reconstruction using classic MEM with a Bayesian choice of the regularisation constant in (d). Compare the original image in Fig. 17 (a).

PSF FWHM noise rms

0.1

3 0.5

2.0

0.1

5 0.5

2.0

0.1

10 0.5

2.0

real-space MEM tensor wavelet MEM a` trous wavelet MEM

0 8474 6197

0 3437 2526

0 930 631

0 2121 2305

0 1287 1262

0 431 423

0 155 542

0 268 296

0 90 75

Table 5. The logarithm ln Pr(d) of the Bayesian evidence Pr(d) for differnt reconstructions of the CMB maps. The values are averaged over a set of 25 simulations with 5 different noise and image realisations and are normalised such that they equal zero for the real-space MEM for each dataset.

orthogonal and has the benefit of being invariant under translations and rotations of the image. We show that the a` trous transform can be considered as a special case of a multi-channel ICF, in which the image is produced from a linear combination of images convolved with point spread functions of different widths. The quality of the reconstruction depends on the relative weights assigned to each channel or scale. We have applied MEM implementations using both orthogo-

nal and a` trous wavelets to simulated observations of CMB temperature anisotropies. We find that while the relative weighting of scales or channels is important, there is a range of different weightings that can yield roughly similar results as long as they suppress small-scale structure in the image. It does not matter much whether the weighting is introduced by setting different channel weights or by rescaling the entropy expressions or default models. Weightings that suppress small-scale structure more efficiently can per-

18

Klaus Maisinger, M.P. Hobson and A.N. Lasenby

form better for low signal-to-noise, while they are worse for high signal-to-noise. Furthermore, we also find that for images containing structure on different scales, like CMB maps, methods that try to improve the reconstruction by ad hoc assignments of pixel- and data-dependent weights usually do not enhance the reconstruction quality. The more complicated reconstruction prescriptions given by Pantin & Starck (1996) have no benefits for CMB map-making. As far as reconstruction errors are concerned, wavelet-based maximum entropy algorithms seem to match the standard MEM in pixel space (real-space MEM) for large point spread functions or low noise levels. There exist sufficient well-determined degrees of freedom in the data to make the image basis irrelevant. On the other hand, for poor signal-to-noise and narrow convolution masks, wavelet methods outperform real-space MEM. Thus the use of wavelet techniques can improve the reconstruction of images in many cases, while there is no disadvantage of using these methods in other situations. The improvement seems to be genuinely related to the basis set or ICF and not just an artefact of an improper choice of the regularisation. A Bayesian treatment of the regularisation constant and a comparison of the different reconstruction methods shows a much higher evidence for ICF methods than for the simple real-space MEM. In a Bayesian context, the wavelet basis can thus be interpreted as a better ‘model’ for the image. In this regard, it fulfills the promise of an improvement over the real-space method that the ICF was designed to address (see e.g. MacKay 1992). The isotropic a` trous transform or the multi-channel ICF can in some cases improve on orthogonal wavelets. Furthermore, they can be implemented within the MEMSYS 5 maximum entropy kernel and thus be applied in a proper Bayesian maximisation scheme. To summarise, we conclude that the use of wavelets in MEM image reconstructions is a successful technique that can improve the quality of image reconstructions.

ACKNOWLEDGMENTS We thank Belen Barreiro, Steve Gull, Phil Marshall and Charles McLachlan for helpful discussions. Klaus Maisinger acknowledges support from an EU Marie Curie Fellowship.

REFERENCES Bijaoui A., Starck J.-L., Murtagh F., 1994, Traitement du Signal, 11, 229 Bontekoe T. R., Koper E., Kester D. J. M., 1994, A&A, 284, 1037 Bridle S. L., Hobson M. P., Lasenby A. N., Saunders R., 1998, MNRAS, 299, 895 Cay´on L., Sanz J. L., Barreiro R. B., Mart´ınez-Gonz´alez E., Vielva P., Toffolatti L., Silk J., Diego J. M., Arg¨ueso F., 2000, MNRAS, 315, 757 Coifman R. R., Wickerhauser M. V., 1992, IEEE Transactions on Information Theory, 38, 713 Daubechies I., 1992, Ten lectures on wavelets. SIAM, Philadelphia Donoho D. L., Johnstone I. M., 1995, Journal of the American Statistical Association, 90, 1200 Frieden B. R., 1983, J. Opt. Soc. Am., 73, 927 Gull S. F., 1989, in Skilling J., ed., Maximum Entropy and Bayesian Methods. Kluwer, Dordrecht, p. 53

Gull S. F., Skilling J., 1999, Quantified Maximum Entropy, MemSys5 Users’ Manual. Maximum Entropy Data Consultants Ltd. Hobson M. P., Jones A. W., Lasenby A. N., Bouchet F. R., 1998, MNRAS, 300, 1 Hobson M. P., Jones A. W., Lasenby A. N., 1999, MNRAS, 309, 125 Hobson M. P., Lasenby A. N., 1998, MNRAS, 298, 905 Holschneider M., Kronland–Martinet R., Morlet J., Tchamitchian P., 1989, in Combes J. M., Grossman A., Tchamitchian P., eds, Wavelets: Time–Frequency Methods and Phase Space. Springer–Verlag, Berlin, p. 286 Jones A. W., Hobson M. P., Lasenby A. N., 1999, MNRAS, 305, 898 Langer W. D., Wilson R. W., Anderson C. H., 1993, ApJ, 408, L45 MacKay D. J. C., 1992, Neural Computation, 4, 415 Mallat S. G., 1989, IEEE Trans. Acoust. Speech Sig. Proc., 37, 2091 Marshall P. J., Hobson M. P., Gull S. F., Bridle S. L., 2002, MNRAS, 335, 1037 Pantin E., Starck J.-L., 1996, A&AS, 118, 575 Press W. H., Teukolsky S. A., Vetterling W. T., Flannery B. P., 1992, Numerical Recipes, 2nd edn. Cambridge University Press Sanz J. L., Arg¨ueso F., Cay´on L., Mart´ınez-Gonz´alez E., Barreiro R. B., Toffolatti L., 1999a, MNRAS, 309, 672 Sanz J. L., Barreiro R. B., Cay´on L., Mart´ınez-Gonz´alez E., Ruiz G. A., D´ıaz F. J., Arg¨ueso F., Silk J., Toffolatti L., 1999b, A&AS, 140, 99 Skilling J., 1989, in Skilling J., ed., Maximum entropy and Bayesian methods. Kluwer, Dordrecht, p. 45 Starck J.-L., Murtagh F., Bijaoui A., 1998, Image processing and data analysis. The multiscale approach. Cambridge University Press, Cambridge, UK Starck J.-L., Murtagh F., Querre P., Bonnarel F., 2001, A&A, 368, 730 Tenorio L., Jaffe A. H., Hanany S., Lineweaver C. H., 1999, MNRAS, 310, 823 Thompson A. R., Moran J. M., Swenson G. W., 1994, Interferometry and Synthesis in Radio Astronomy. Krieger Publishing Company, New York Titterington D. M., 1985, A&A, 144, 381 Vielva P., Barreiro R. B., Hobson M. P., Mart´ınez-Gonz´alez E., Lasenby A. N., Sanz J. L., Toffolatti L., 2001, MNRAS, 328, 1 Weir N., 1992, in ASP Conf. Ser. 25: Astronomical Data Analysis Software and Systems I, Vol. 1. p. 186 Zhuang Y., Baras J. S., 1994, report CSHCN TR 94-7 / ISR-TR94-3. Center for Satellite and Hybrid Communication Networks / The Institute for Systems Research

APPENDIX A: MULTIRESOLUTION ANALYSIS Multiresolution analysis provides a simple framework for describing the properties of wavelet and scaling functions discussed in Section 2. Let L 2 (R) be the set of square integrable functions. The multiresolution analysis is a sequence of closed subspaces {V j } j∈Z ⊂ L 2 (R) which approximate L 2 (R). The subspaces are nested . . . ⊂ V−1 ⊂ V0 ⊂ V1 ⊂ . . .

MEM image reconstruction using wavelets such that their union tion f (x)

S

j∈Z V j

19

2J 2

is dense in L 2 (R), and for any func-

f (x) ∈ V j ⇔ f (2x) ∈ V j+1 . In this picture, the scaling functions {φ(x − l), l ∈ Z} form an orthonormal basis in the reference space V0 . Because V0 ⊂ V1 , elements of V0 can be written as linear combinations of those of V1 : √ φ(x) = ∑ 2Hk φ(2x − k).

j

(A1)

2

k

This expression is called a refinement relation for the scaling function φ(x). Given a subspace V j and its basis of scaling functions φ(x), one can ask how this basis in V j can be extended to a basis in V j+1 . The wavelet functions ψ(x) can be introduced as a set of functions needed to complete the basis in V j+1 . In other words, they form a basis of the orthogonal complement W j of V j , i.e. V j+1 = V j ⊕ W j . Consequently, the wavelet functions can be written in the form of a refinement relation √ ψ(x) = ∑ 2Gk φ(2x − k). (A2) k

The coefficients Gk are related to the Hk via Gk = (−1)k H1−k . In a signal processing context, the sequences Hk and Gk are called quadrature mirror filters. The sequence Hk is a low pass filter, while Gk acts as a high pass filter. The wavelet transform represents a function in terms of its smooth basis functions φ0,l ∈ V0 and the detail functions ψ j,l ∈ W j .

For the discrete wavelet transform introduced in Section 2.2, wavelet coefficients can be constructed directly from the data vector cJ,l = f (xl ) (l = 1, . . . , N) via =

c j−1,k

=

∑ Hl−2k c j,l , l

∑ Gl−2k c j,l .

(A3)

l

This prescription leads to a pyramidal algorithm for the implementation of the discrete wavelet transform. At each iteration, the data vector of length 2 j is split into 2 j−1 detail components and 2 j−1 smoothed components. The smoothed components are then used as input for the next iteration to reconstruct the detail coefficients at the next larger scale. A2 The a` trous transform For the a` trous transform from Section 2.3, a data vector cJ,l (l = 1, . . . , N, 2J 6 N) is iteratively smoothed with a filter mask Hl (8): c j−1,k = ∑ Hl c j,k+2(J− j) l . The coefficients Hl can be derived from a refinement relation of the form φ(x) = ∑ 2Hl φ(2x − l), l

where φ(x) is a scaling function. Note the difference of a factor compared to (A1). The a` trous wavelet vector is given by l

2J1

j1

Figure B1. The partitioning of the wavelet image into different domains for the tensor transform. There are J × J domains mixing different scales j1 and j2 . Increasing j corresponds to finer scales.

from which Gl = δl,0 − Hl . For the wavelet ψ(x) one finds the scaling relation ψ(x) = ∑ 2Gl φ(2x − l) = 2φ(2x) − φ(x). l

APPENDIX B: TWO-DIMENSIONAL WAVELET TRANSFORMS

The simplest approach to extending the wavelet transform to two dimensions uses tensor products of the one-dimensional wavelets as a two-dimensional basis (e.g. Press et al. 1992). The resulting algorithm is straightforward. First one performs a wavelet transform on the first index of the image matrix for all possible values of the second, and then on the second. The tensor basis functions mix one-dimensional wavelets from different scales; they are given by φ0,0;l1 ,l2 (x, y)

=

φ0,l1 (x)φ0,l2 (y),

ζ j1 ,0;l1 ,l2 (x, y)

=

ψ j1 ,l1 (x)φ0,l2 (y),

ξ0, j2 ;l1 ,l2 (x, y)

=

φ0,l1 (x)ψ j2 ,l2 (y),

ψ j1 , j2 ;l1 ,l2 (x, y)

=

ψ j1 ,l1 (x)ψ j2 ,l2 (y).

The two-dimensional pixelised image has dimensions N × N = 2J1 × 2J2 and by analogy with (6) one has 0 6 j1 6 J1 − 1 and 0 6 l1 6 2 j1 − 1, and similarly for j2 and l2 . If we denote such an image by the (N × N-) matrix T, then the matrix of wavelet coefficients is given by

(1d)

l

w j−1,k = ∑ Gl c j,k+2(J− j) l = c j,k − c j−1,k ,

1 2J-1

B1 Tensor products

A1 The discrete wavelet transform

w j−1,k

2 22

√ 2

e = W(2d) T = W(1d) T W(1d)t , T

(B1)

where W is the N × N-matrix describing the one-dimensional transform that was introduced in (7), and W(1d)t is its transe is partitioned into J1 × J2 separate domains pose. The matrix T of 2 j1 × 2 j2 wavelet coefficients (see Fig. B1), according to the scale indices j1 and j2 in the horizontal and vertical directions respectively. By analogy with the one-dimensional case, as j1 increases the wavelets represent the horizontal structure in the image on increasingly smaller scales. Similarly, as j2 increases the wavelets represent the increasingly fine scale vertical structure in the image. Thus domains that lie in the leading diagonal (i.e. with

20

Klaus Maisinger, M.P. Hobson and A.N. Lasenby 2J

where d is the data vector and K the ICF. The gradient and curvature can be derived straightforwardly:

H

∇h χ2

D

∇h ∇h χ

2

= =

(RK)t N−1 (RKh − d), t t

−1

KRN

RK.

(C1) (C2)

For an implementation, one thus needs to provide the transformation R, the wavelet transform W = Kt and their transposes Rt and Wt .

j

V C1 Real-space MEM 2 22

2J-1

2J

j

Figure B2. The partitioning of the wavelet image into different domains for the MRA transform. At each scale j, there are 3 different domains: horizontal, vertical and diagonal. Again, increasing j corresponds to finer scales.

For a maximum entropy algorithm operating on data and reconstructed images given in real space, i.e. in the same image plane, the data can be predicted by a convolution of the underlying image distribution h(x) with a point spread function P(x). The discrete predicted data samples diP are given by diP = ∑ Ri j h(x j ) = ∑ P(xi − x j )h(x j ).

j1 = j2 ) contain coefficients of two-dimensional wavelets that represent the image at the same scale in the horizontal and vertical directions, whereas domains with j1 6= j2 contain coefficients of two-dimensional wavelets describing the image on different scales in the two directions.

j

(C3)

j

The transpose Rt of the response matrix is easily obtained. Its operation on an image vector h, where hi = h(xi ) is given by

∑ P(−xi + x j )h(x j ).

(C4)

j

B2 MRA transforms The MRA transform uses three different tensor products of one– dimensional wavelets. Dilated versions of these tensor products form the detail wavelets. They do not mix scales and can be described in terms of a single scale index j. At each scale level j, these bases are given by φ0;l1 ,l2 (x, y)

=

φ0,l1 (x)φ0,l2 (y),

ψHj;l1 ,l2 (x, y)

=

ψ j,l1 (x)φ j,l2 (y),

ψVj;l1 ,l2 (x, y)

=

φ j,l1 (x)ψ j,l2 (y),

ψDj;l1 ,l2 (x, y)

=

ψ j,l1 (x)ψ j,l2 (y).

The φ j;l1 ,l2 wavelet is simply an averaging function at the jth level, while the other three wavelets correspond to structure at the jth scale level in the horizontal, vertical and diagonal directions in the e from (B1) is different for the image. The structure of the matrix T MRA transform. Obviously, the matrix is square and its domains can be described by a single value of J. Fig. B2 shows how it is partitioned into 3(J − 1) detail domains and one domain with the smoothed components.

APPENDIX C: CALCULATION OF DERIVATIVES In an implementation of the maximum entropy algorithm, the posterior functional F(h) = 21 χ2 (h) − αS(h) has to be minimised numerically. Some minimisation routines, like the MEMSYS 5 package (Gull & Skilling 1999), require first derivatives of F with respect to the (visible or hidden) image pixels, while others, like the simple Newton-Raphson method, additionally require second derivatives. The derivatives of the χ2 - and entropy terms can be calculated separately. The χ2 -functional for a linear instrumental response matrix R is given by χ2 (h) = (RKh − d)t N−1 (RKh − d),

This is a correlation of the functions P(x) and h(x), and not a convolution. For a symmetric beam, R = Rt . The terms R2i j , which are useful to derive the curvature (C2), are also straightforward to calculate. The convolution can be most easily implemented with FFTs. By substituting (C3) and (C4) into the expressions (C1) and (C2), and setting the ICF to the identity K = 1, one can obtain the derivatives of the χ2 -functional.

C2 Wavelet MEM Wavelet MEM uses the real-space response matrix derived in Section C1, and an ICF K that is given by the transpose Wt of the wavelet transform. Derivatives can by obtained by substitution into the expressions (C1) and (C2). For the orthogonal wavelet transforms, the transpose of the wavelet transform is simply the inverse transform. The transpose of the a` trous transform is given by (32). We note that the χ2 -curvature ∇h ∇h χ2 = WRt N−1 RWt is numerically difficult to evaluate efficiently for the orthogonal transforms. However, for a diagonal noise covariance N, Nii = σ2i , the diagonal elements of the curvature are ∂2 χ2 2 t = ∑ Wnk Rtki 2 RioWop δ pn . 2 ∂hn σi i,k,o,p If the noise covariance is constant across pixels with σi = σ, this reduces to 2 ∂2 χ2 t = ∑ 2 Wnk Rtki RioWop δ pn , ∂h2n σ ikop which only needs to be evaluated once for each wavelet domain. This makes a calculation of the curvature feasible in a reasonable amount of time.

MEM image reconstruction using wavelets C3 Wavelet-regularised MEM In wavelet-regularised MEM, the data dP are predicted from an image h in the same way as in the real-space algorithm from Section C1. The entropy functional, however, is calculated on the wavelet coefficients rather than the image pixels. This can be viewed as a new entropy functional created from a composition of the standard entropy and the wavelet transform. Numerical derivatives are given by ∇v [S(Wv)]

=

Wt [∇h S](Wv),

∇v ∇v [S(Wv)]

=

Wt [∇h ∇h S](Wv)W.

Again, second derivatives are numerically difficult to evaluate for orthogonal wavelet transforms.

APPENDIX D: EQUIVALENCE OF METHODS In this section, we show that the ‘wavelet MEM’ and ‘wavelet regularised MEM’ introduced in Section 5 become equivalent if the regularisation function is quadratic and the wavelet transform orthogonal. For wavelet MEM, the χ2 -statistic is given by χ2 (h) = (RKh − d)t N−1 (RKh − d).

(D1)

The most general form of quadratic regularisation is given by S(h) = −ht M−1 h. In the special case of the quadratic approximation (27) to the positive/negative entropy with a model m, the matrix M is defined as Mi j = 4mi δi j . The maximum entropy solution hˆ can be found by minimising the function (14), F(h)

1 2 2 χ (h) − αS(h) t −1 t −1 1 2 (RKh − d) N (RKh − d) + αh M h t t t −1 1 t −1 2d N d−h K R N d +ht ( 12 Kt Rt N−1 RK + αM−1 )h.

= = =

Demanding that the gradient of F with respect to h vanishes at the ˆ we obtain maximum h, ( 21 Kt Rt N−1 RK + αM−1 )hˆ = Kt Rt N−1 d.

(D2)

For wavelet regularised MEM, we have χ (v)

=

t

=

2

S(K v)

(Rv − d)t N−1 (Rv − d)

and

−vKM−1 Kt v.

The maximum entropy solution vˆ is given by ( 21 Rt N−1 R + αKM−1 Kt )ˆv = Rt N−1 d. t

t

(D3) t

For orthogonal wavelet transforms W = K , KK = 1 = K K. Multiplying (D3) by Kt , we obtain ( 12 Kt Rt N−1 RK + αM−1 )Kt vˆ = Kt Rt N−1 d. By comparison with (D2), we see that hˆ = Kt vˆ . The solutions for wavelet MEM and wavelet regularised MEM are identical.

21