Nonparametric regression in exponential families
Abstract
Most results in nonparametric regression theory are developed only for the case of additive noise. In such a setting many smoothing techniques including wavelet thresholding methods have been developed and shown to be highly adaptive. In this paper we consider nonparametric regression in exponential families with the main focus on the natural exponential families with a quadratic variance function, which include, for example, Poisson regression, binomial regression and gamma regression. We propose a unified approach of using a meanmatching variance stabilizing transformation to turn the relatively complicated problem of nonparametric regression in exponential families into a standard homoscedastic Gaussian regression problem. Then in principle any good nonparametric Gaussian regression procedure can be applied to the transformed data. To illustrate our general methodology, in this paper we use wavelet block thresholding to construct the final estimators of the regression function. The procedures are easily implementable. Both theoretical and numerical properties of the estimators are investigated. The estimators are shown to enjoy a high degree of adaptivity and spatial adaptivity with nearoptimal asymptotic performance over a wide range of Besov spaces. The estimators also perform well numerically.
10.1214/09AOS762 \volume38 \issue4 2010 \firstpage2005 \lastpage2046 \newproclaimremarkRemark
Nonparametric regression in exponential families
A]\fnmsLawrence D. \snmBrown\thanksreft1, A]\fnmsT. Tony \snmCai\corref\thanksreft2label=e1] and B]\fnmsHarrison H. \snmZhou\thanksreft3label=e3]
t1Supported in part by NSF Grant DMS0707033.
t2Supported in part by NSF Grant DMS0604954 and NSF FRG Grant DMS0854973.
t3Supported in part by NSF Career Award DMS0645676 and NSF FRG Grant DMS0854975.
class=AMS] \kwd[Primary ]62G08 \kwd[; secondary ]62G20. Adaptivity \kwdasymptotic equivalence \kwdexponential family \kwdJames–Stein estimator \kwdnonparametric Gaussian regression \kwdquadratic variance function \kwdquantile coupling \kwdwavelets.
1 Introduction
Theory and methodology for nonparametric regression is now well developed for the case of additive noise particularly additive homoscedastic Gaussian noise. In such a setting many smoothing techniques including wavelet thresholding methods have been developed and shown to be adaptive and enjoy other desirable properties over a wide range of function spaces. However, in many applications the noise is not additive and the conventional methods are not readily applicable. For example, such is the case when the data are counts or proportions.
In this paper we consider nonparametric regression in exponential families with the main focus on the natural exponential families with a quadratic variance function (NEF–QVF). These include, for example, Poisson regression, binomial regression and gamma regression. We present a unified treatment of these regression problems by using a meanmatching variance stabilizing transformation (VST) approach. The meanmatching VST turns the relatively complicated problem of regression in exponential families into a standard homoscedastic Gaussian regression problem and then any good nonparametric Gaussian regression procedure can be applied.
Variance stabilizing transformations and closely related normalizing transformations have been widely used in many parametric statistical inference problems. See Hoyle (1973), Efron (1982) and BarLev and Enis (1990). In the more standard parametric problems, the goal of VST is often to optimally stabilize the variance. That is, one desires the variance of the transformed variable to be as close to a constant as possible. For example, Anscombe (1948) introduced VSTs for binomial, Poisson and negative binomial distributions that provide the greatest asymptotic control over the variance of the resulting transformed variables. In the context of nonparametric function estimation, Anscombe’s variance stabilizing transformation has also been briefly discussed in Donoho (1993) for density estimation. However, for our purposes it is much more essential to have optimal asymptotic control over the bias of the transformed variables. A meanmatching VST minimizes the bias of the transformed data while also stabilizing the variance.
Our procedure begins by grouping the data into many small size bins, and by then applying the meanmatching VST to the binned data. In principle any good Gaussian regression procedure could be applied to the transformed data to construct the final estimator of the regression function. To illustrate our general methodology, in this paper we employ two wavelet block thresholding procedures. Wavelet thresholding methods have achieved considerable success in nonparametric regression in terms of spatial adaptivity and asymptotic optimality. In particular, block thresholding rules have been shown to possess impressive properties. In the context of nonparametric regression, local block thresholding has been studied, for example, in Hall, Kerkyacharian and Picard (1998), Cai (1999, 2002) and Cai and Silverman (2001). In this paper we shall use the BlockJS procedure proposed in Cai (1999) and the NeighCoeff procedure introduced in Cai and Silverman (2001). Both estimators were originally developed for nonparametric Gaussian regression. BlockJS first divides the empirical coefficients at each resolution level into nonoverlapping blocks and then simultaneously estimates all the coefficients within a block by a James–Stein rule. NeighCoeff also thresholds the empirical coefficients in blocks, but estimates wavelet coefficients individually. It chooses a threshold for each coefficient by referencing not only to that coefficient but also to its neighbors. Both estimators increase estimation accuracy over termbyterm thresholding by utilizing information about neighboring coefficients.
Both theoretical and numerical properties of our estimators are investigated. It is shown that the estimators enjoy excellent asymptotic adaptivity and spatial adaptivity. The procedure using BlockJS simultaneously attains the optimal rate of convergence under the integrated squared error over a wide range of the Besov classes. The estimators also automatically adapt to the local smoothness of the underlying function; they attain the local adaptive minimax rate for estimating functions at a point. A key step in the technical argument is the use of the quantile coupling inequality of Komlós, Major and Tusnády (1975) to approximate the binned and transformed data by independent normal variables. The procedures are easy to implement, at the computational cost of . In addition to enjoying the desirable theoretical properties, the procedures also perform well numerically.
Our method is applicable in more general settings. It can be extended to treat nonparametric regression in general oneparameter natural exponential families. The meanmatching VST only exists in NEF–QVF (see Section 2). In the general case when the variance is not a quadratic function of the mean, we apply the same procedure with the standard VST in place of the meanmatching VST. It is shown that, under slightly stronger conditions, the same optimality results hold in general. We also note that meanmatching VST transformations exist for some useful nonexponential families, including some commonly used for modeling “overdispersed” data. Though we do not pursue the details in the present paper, it appears that because of this our methods can also be effectively used for nonparametric regressions involving such error distributions.
We should note that nonparametric regression in exponential families has been considered in the literature. Among individual exponential families, the Poisson case is perhaps the most studied. Besbeas, De Feis and Sapatinas (2004) provided a review of the literature on the nonparametric Poisson regression and carried out an extensive numerical comparison of several estimation procedures including Donoho (1993), Kolaczyk (1999a, 1999b) and Fryźlewicz and Nason (2001). In the case of Bernoulli regression, Antoniadis and Leblanc (2000) introduced a wavelet procedure based on diagonal linear shrinkers. Unified treatments for nonparametric regression in exponential families have also been proposed. Antoniadis and Sapatinas (2001) introduced a wavelet shrinkage and modulation method for regression in NEF–QVF and showed that the estimator attains the optimal rate over the classical Sobolev spaces. Kolaczyk and Nowak (2005) proposed a recursive partition and complexitypenalized likelihood method. The estimator was shown to be within a logarithmic factor of the minimax rate under squared Hellinger loss over Besov spaces.
The paper is organized as follows. Section 2 discusses the meanmatching variance stabilizing transformation for natural exponential families. In Section 3, we first introduce the general approach of using the meanmatching VST to convert nonparametric regression in exponential families into a nonparametric Gaussian regression problem, and then present in detail specific estimation procedures based on the meanmatching VST and wavelet block thresholding. Theoretical properties of the procedures are treated in Section 4. Section 5 investigates the numerical performance of the estimators. We also illustrate our estimation procedures in the analysis of two real data sets: a gammaray burst data set and a packet loss data set. Technical proofs are given in Section 6.
2 Meanmatching variance stabilizing transformation
We begin by considering variance stabilizing transformations (VST) for natural exponential families. As mentioned in the Introduction, VST has been widely used in many contexts and the conventional goal of VST is to optimally stabilize the variance. See, for example, Anscombe (1948) and Hoyle (1973). For our purpose of nonparametric regression in exponential families, we shall first develop a new class of VSTs, called meanmatching VSTs, which asymptotically minimize the bias of the transformed variables while at the same time stabilizing the variance.
Let be a random sample from a distribution in a natural oneparameter exponential families with the probability density/mass function
Here is called the natural parameter. The mean and variance are, respectively,
We shall denote the distribution by . A special subclass of interest is the one with a quadratic variance function (QVF),
(1) 
In this case we shall write . The NEF–QVF families consist of six distributions, three continuous: normal, gamma and NEF–GHS distributions and three discrete: binomial, negative binomial and Poisson. See, for example, Morris (1982) and Brown (1986).
Set . According to the central limit theorem,
A variance stabilizing transformation (VST) is a function such that
(2) 
The standard delta method then yields
It is known that the variance stabilizing properties can often be further improved by using a transformation of the form
(3) 
with suitable choice of constants and . See, for example, Anscombe (1948). In this paper we shall use the VST as a tool for nonparametric regression in exponential families. For this purpose, it is more important to optimally match the means than to optimally stabilize the variance. That is, we wish to choose the constants and such that optimally matches .
To derive the optimal choice of and , we need the following expansions for the mean and variance of the transformed variable .
Lemma 1
Let be a compact set in the interior of the natural parameter space. Then for and for constants and ,
(4) 
and
(5) 
Moreover, there exist constants and such that
(6) 
for all with a positive Lebesgue measure if and only if the exponential family has a quadratic variance function.
The proof of Lemma 1 is given in Section 6. The last part of Lemma 1 can be easily explained as follows. Equation (4) implies that (6) holds if and only if
that is, . Solving this differential equation yields
(7) 
for some constant . Hence the solution of the differential equation is exactly the subclass of natural exponential family with a quadratic variance function (QVF).
It follows from (7) that among the VSTs of the form (3) for the exponential family with a quadratic variance function
the best constants and for meanmatching are
(8) 
We shall call the VST (3) with the constants and given in (8) the meanmatching VST. Lemma 1 shows that the meanmatching VST only exists in the NEF–QVF families and with the meanmatching VST the bias is of the order . In contrast, for an NEF without a quadratic variance function, the term does not vanish for all with any choice of and . And in this case the bias
instead of in (6). We shall see in Section 4 that this difference has important implications for nonparametric regression in NEF.
The following are the specific expressions of the meanmatching VST for the five distributions (other than normal) in the NEF–QVF families:

Poisson: , and .

: , and .

Negative : , and

(with known): , and .

NEF–GHS (with known): , and
Note that the meanmatching VST is different from the more conventional VST that optimally stabilizes the variance. Take the binomial distribution with as an example. In this case the VST is an arcsine transformation. Let and then . Figure 1 compares the mean and variance of three arcsine transformations of the form
for the binomial variable with . The choice of gives the usual arcsine transformation, optimally stabilizes the variance asymptotically, and yields the meanmatching arcsine transformation. The left panel of Figure 1 plots the bias
as a function of for , and . It is clear from the plot that is the best choice among the three for matching the mean. On the other hand, the arcsine transformation with yields significant bias and the transformation with also produces noticeably larger bias. The right panel plots the variance of for , and . Interestingly, over a wide range of values of near the center the arcsine transformation with is even slightly better than the case with and clearly is the worst choice of the three. Figure 2 below shows similar behavior for the Poisson case.
Let us now consider the Gamma distribution with as an example for the continuous case. The VST in this case is a log transformation. Let . Then . Figure 3 compares the mean and variance of two log transformations of the form
(9) 
for the Gamma variable with and ranging from 3 to 40. The choice of gives the usual log transformation, and yields the meanmatching log transformation. The left panel of Figure 3 plots the bias as a function of for and . It is clear from the plot that is a much better choice than for matching the mean. It is interesting to note that in this case there do not exist constants and that optimally stabilize the variance. The right panel plots the variance of , that is, , as a function of . In this case, it is obvious that the variances are the same with and for the variable in (9). {remark} Meanmatching variance stabilizing transformations exist for some other important families of distributions. We mention two that are commonly used to model “overdispersed” data. The first family is often referred to as the gammaPoisson family. See, for example, Johnson, Kemp and Kotz (2005), Berk and MacDonald (2008) and Hilbe (2007). Let with , . The are latent variables; only the are observed. The scale parameter, , is assumed known, and the mean is the unknown parameter, . The resulting family of distributions of each is a subfamily of the negative Binomial with , a fixed constant, and . [Here this negative Binomial family is defined for all as having probability function, , ] This is a oneparameter family, but it is not an exponential family. It can be verified that a meanmatching variance stabilizing transformation for this family is given by
This transformation has the desired properties (5) and (6) with . For the second family, consider the betabinomial family. See Johnson, Kemp and Kotz (2005). Here, and , . Again, the are latent variables; only the are observed. For the family of interest here, we assume are allowed to vary so that , a known constant, and . This family can alternatively be parameterized via . The resulting oneparameter family of distributions of each is again not a oneparameter exponential family. It can be verified that a meanmatching variance stabilizing transformation for this family is given by
This transformation has the desired properties (5) and (6) with .
3 Nonparametric regression in exponential families
We now turn to nonparametric regression in exponential families. We begin with the NEF–QVF. Suppose we observe
(10) 
and wish to estimate the mean function . In this setting, for the five NEF–QVF families discussed in the last section the noise is not additive and nonGaussian. Applying standard nonparametric regression methods directly to the data in general do not yield desirable results. Our strategy is to use the meanmatching VST to reduce this problem to a standard Gaussian regression problem based on a sample where
Here is the VST defined in (2), is the number of bins, and is the number of observations in each bin. The values of and will be specified later.
We begin by dividing the interval into equilength subintervals with observations in each subintervals. Let be the sum of observations on the th subinterval , ,
(11) 
The sums can be treated as observations for a Gaussian regression directly, but this in general leads to a heteroscedastic problem. Instead, we apply the meanmatching VST discussed in Section 2, and then treat as new observations in a homoscedastic Gaussian regression problem. To be more specific, let
(12) 
where the constants and are chosen as in (8) to match the means. The transformed data is then treated as the new equispaced sample for a nonparametric Gaussian regression problem.
We will first estimate , then take a transformation of the estimator to estimate the mean function . After the original regression problem is turned into a Gaussian regression problem through binning and the meanmatching VST, in principle any good nonparametric Gaussian regression method can be applied to the transformed data to construct an estimate of . The general ideas for our approach can be summarized as follows.

Binning: divide into equal length intervals between 0 and 1. Let be the sum of the observations in each of the intervals. Later results suggest a choice of satisfying for the NEF–QVF case and for the nonQVF case. See Section 4 for details.

VST: let , , and treat as the new equispaced sample for a nonparametric Gaussian regression problem.

Gaussian regression: apply your favorite nonparametric regression procedure to the binned and transformed data to obtain an estimate of .

Inverse VST: estimate the mean function by . If is not in the domain of which is an interval between and ( and can be ), we set if and set if . For example, when in the case of negative Binomial and NEF–GHS distributions.
3.1 Effects of binning and VST
As mentioned earlier, after binning and the meanmatching VST, one can treat the transformed data as if they were data from a homoscedastic Gaussian nonparametric regression problem. A key step in understanding why this procedure works is to understand the effects of binning and the VST. Quantile coupling provides an important technical tool to shed insights on the procedure.
The following result, which is a direct consequence of the quantile coupling inequality of Komlós, Major and Tusnády (1975), shows that the binned and transformed data can be well approximated by independent normal variables.
Lemma 2
Let with variance for and let . Under the assumptions of Lemma 1, there exists a standard normal random variable and constants not depending on such that whenever the event occurs,
(13) 
Hence, for large , can be treated as a normal random variable with mean and variance . Let , and be a standard normal variable satisfying (13). Then can be written as
(14) 
where
(15) 
In the decomposition (14), is the deterministic approximation error between the mean of and its target value and is the stochastic error measuring the difference of and its normal approximation. It follows from Lemma 1 that when is large, is “small,” for some constant . The following result, which is proved in Section 6.1, shows that the random variable is “stochastically small.”
Lemma 3
The discussion so far has focused on the effects of the VST for i.i.d. observations. In the nonparametric function estimation problem mentioned earlier, observations in each bin are independent but not identically distributed since the mean function is not a constant in general. However, through coupling, observations in each bin can in fact be treated as if they were i.i.d. random variables when the function is smooth. Let , , be independent. Here the means are “close” but not equal. Let be a value close to the ’s. The analysis given in Section 6.1 shows that can in fact be coupled with i.i.d. random variables where . See Lemma 4 in Section 6.1 for a precise statement.
How well the transformed data can be approximated by an ideal Gaussian regression model depends partly on the smoothness of the mean function . For , define the Lipschitz class by
and
where with is a compact set in the interior of the mean parameter space of the natural exponential family. Lemmas 1, 2, 3 and 4 together yield the following result which shows how far away are the transformed data from the ideal Gaussian model.
Theorem 1
Let be given as in and let . Then can be written as
(17) 
where , are constants satisfying and consequently for some constant
(18) 
and are independent and “stochastically small” random variables satisfying that for any integer and any constant
(19)  
where is a constant depending only on and .
Theorem 1 provides explicit bounds for both the deterministic and stochastic errors. This is an important technical result which serves as a major tool for the proof of the main results given in Section 4. {remark} There is a tradeoff between the two terms in the bound (18) for the overall approximation error . There are two sources to the approximation error: one is the variation of the functional values within a bin and one comes from the expansion of the mean of (see Lemma 1). The former is related to the smoothness of the function and is controlled by the term and the latter is bounded by the term. In addition, there is the discretization error between the sampled function and the whole function , which is obviously a decreasing function of . Furthermore, the choice of also affects the stochastic error . A good choice of makes all three types of errors negligible relative to the minimax risk. See Section 4.2 for further discussions. {remark} In Section 4 we introduce Besov balls for the analysis of wavelet regression methods. A Besov ball can be embedded into a Lipschitz class with and some .
3.2 Wavelet thresholding
One can apply any good nonparametric Gaussian regression procedure to the transformed data to construct an estimator of the function . To illustrate our general methodology, in the present paper we shall use wavelet block thresholding to construct the final estimators of the regression function. Before we can give a detailed description of our procedures, we need a brief review of basic notation and definitions.
Let be a pair of father and mother wavelets. The functions and are assumed to be compactly supported and , and dilation and translation of and generates an orthonormal wavelet basis. For simplicity in exposition, in the present paper we work with periodized wavelet bases on . Let
where and . The collection {} is then an orthonormal basis of , provided the primary resolution level is large enough to ensure that the support of the scaling functions and wavelets at level is not the whole of . The superscript “” will be suppressed from the notation for convenience. An orthonormal wavelet basis has an associated orthogonal Discrete Wavelet Transform (DWT) which transforms sampled data into the wavelet coefficients. See Daubechies (1992) and Strang (1992) for further details about the wavelets and discrete wavelet transform. A squareintegrable function on can be expanded into a wavelet series:
(20) 
where are the wavelet coefficients of .
3.3 Wavelet procedures for generalized regression
We now give a detailed description of the wavelet thresholding procedures BlockJS and NeighCoeff in this section and study the properties of the resulting estimators in Section 4. We shall show that our estimators enjoy a high degree of adaptivity and spatial adaptivity and are easily implementable.
Apply the discrete wavelet transform to the binned and transformed data , and let be the empirical wavelet coefficients, where is the discrete wavelet transformation matrix. Write
(21) 
Here are the gross structure terms at the lowest resolution level, and () are empirical wavelet coefficients at level which represent fine structure at scale . The empirical wavelet coefficients can then be written as
(22) 
where are the true wavelet coefficients of , are “small” deterministic approximation errors, are i.i.d. , and are some “small” stochastic errors. The theoretical calculations given in Section 6 will show that both and are negligible. If these negligible errors are ignored then we have
(23) 
which is the idealized Gaussian sequence model with noise level . Both BlockJS [Cai (1999)] and NeighCoeff [Cai and Silverman (2001)] were originally developed for this ideal model. Here we shall apply these methods to the empirical coefficients as if they were observed as in (23).
We first describe the BlockJS procedure. At each resolution level , the empirical wavelet coefficients are grouped into nonoverlapping blocks of length . As in the sequence estimation setting let and let . A modified James–Stein shrinkage rule is then applied to each block , that is,
(24) 
where is the solution to the equation [see Cai (1999) for details], and is approximately the variance of each . For the gross structure terms at the lowest resolution level , we set . The estimate of at the equally spaced sample points is then obtained by applying the inverse discrete wavelet transform (IDWT) to the denoised wavelet coefficients. That is, is estimated by with . The estimate of the whole function is given by
The mean function is estimated by
(25) 
Figure 4 shows the steps of the procedure for an example in the case of nonparametric Gamma regression.
We now turn to the NeighCoeff procedure. This procedure, introduced in Cai and Silverman (2001) for Gaussian regression, incorporates information about neighboring coefficients in a different way from the BlockJS procedure. NeighCoeff also thresholds the empirical coefficients in blocks, but estimates wavelet coefficients individually. It chooses a threshold for each coefficient by referencing not only to that coefficient but also to its neighbors. As shown in Cai and Silverman (2001), NeighCoeff outperforms BlockJS numerically, but with slightly inferior asymptotic properties.
Let the empirical coefficients be given same as before. To estimate a coefficient at resolution level , we form a block of size by including the coefficient together with its immediate neighbors and . (If periodic boundary conditions are not being used, then for the two coefficients at the boundary blocks, again of length , are formed by only extending in one direction.) Estimate the coefficient by
(26) 
where . The gross structure terms at the lowest resolution level are again estimated by . The rest of the steps are same as before. Namely, the inverse DWT is applied to obtain an estimate and the mean function is then estimated by .
We can envision a sliding window of size which moves one position each time and only the middle coefficient in the center is estimated for a given window. Each individual coefficient is thus shrunk by an amount that depends on the coefficient and on its immediate neighbors. Note that NeighCoeff uses a lower threshold level than the universal thresholding procedure of Donoho and Johnstone (1994). In NeighCoeff, a coefficient is estimated by zero only when the sum of squares of the empirical coefficient and its immediate neighbors is less than , or the average of the squares is less than .
4 Theoretical properties
In this section we investigate the asymptotic properties of the procedures proposed in Section 3. Numerical results will be given in Section 5.
We study the theoretical properties of our procedures over the Besov spaces that are by now standard for the analysis of wavelet regression methods. Besov spaces are a very rich class of function spaces and contain as special cases many traditional smoothness spaces such as Hölder and Sobolev spaces. Roughly speaking, the Besov space contains functions having bounded derivatives in norm, the third parameter gives a finer gradation of smoothness. Full details of Besov spaces are given, for example, in Triebel (1992) and DeVore and Popov (1988). A wavelet is called rregular if has vanishing moments and continuous derivatives. For a given rregular mother wavelet with and a fixed primary resolution level , the Besov sequence norm of the wavelet coefficients of a function is then defined by
(27) 
where is the vector of the father wavelet coefficients at the primary resolution level , is the vector of the wavelet coefficients at level , and of a function is equivalent to the sequence norm (27) of the wavelet coefficients of the function. See Meyer (1992). We define . Note that the Besov function norm of index
(28) 
and
(29) 
where with is a compact set in the interior of the mean parameter space of the natural exponential family.
The following theorem shows that our estimators achieve near optimal global adaptation under integrated squared error for a wide range of Besov balls.
Theorem 2
Suppose the wavelet is rregular. Let . Let . Then the estimator defined in (25) satisfies
and the estimator satisfies
(30) 
Note that when , the condition implies that there exists such that with
where with , since it follows from Theorem 3 on page 344 and Remark 3 on page 345 of Runst (1986) that
Simple algebra shows that
For functions of spatial inhomogeneity, the local smoothness of the functions varies significantly from point to point and global risk given in Theorem 2 cannot wholly reflect the performance of estimators at a point. We use the local risk measure
(31) 
for spatial adaptivity.
The local smoothness of a function can be measured by its local Hölder smoothness index. For a fixed point and , define the local Hölder class as follows:
If , then
where is the largest integer less than and . Define
In Gaussian nonparametric regression setting, it is a wellknown fact that for estimation at a point, one must pay a price for adaptation. The optimal rate of convergence for estimating over function class with completely known is . Lepski (1990) and Brown and Low (1996) showed that one has to pay a price for adaptation of at least a logarithmic factor. It is shown that the local adaptive minimax rate over the Hölder class is .
The following theorem shows that our estimators achieve optimal local adaptation with the minimal cost.
Theorem 3
Suppose the wavelet is rregular with . Let be fixed. Let . Let . Then for or