5 The Uniform Distribution
5.1 Properties
The uniform distribution is often used within the realm of probability, in part because of its utility and in part because of its simplicity. We briefly touched upon this distribution at times earlier in this book, such as, for instance, when we talked about hypothesis test \(p\)-values (which are distributed uniformly between 0 and 1 when the null hypothesis is correct). Why do we return to the uniform distribution now? Because it is slightly different from other distributions: its two parameters, often denoted \(a\) and \(b\) (where \(b > a\)), do not dictate the shape of its probability density function, but rather its domain. This affects aspects of estimation, such as determining sufficient statistics and deriving maximum likelihood estimates, etc. We highlight these “quirks” of the uniform pdf throughout this chapter.
Recall: a probability density function is one way to represent a continuous probablity distribution, and it has the properties (a) \(f_X(x) \geq 0\) and (b) \(\int_x f_X(x) dx = 1\), where the integral is over all values of \(x\) in the dist ribution’s domain.
The uniform pdf is defined as \[ f_X(x) = \frac{1}{b-a} ~~\mbox{where}~~ x \in [a,b] \,. \] \(f_X(x)\) is thus constant between \(a\) and \(b\). (See Figure 5.1.) This means that we can think of the uniform distribution “geometrically,” as the following is true: \[ \underbrace{(b-a)}_{\mbox{domain}} \cdot \underbrace{\frac{1}{b-a}}_{f_X(x)} = 1 \] If we know the domain of the pdf, we immediately know \(f_X(x)\); conversely, if we know \(f_X(x)\), we immediately know the width of the domain (but not \(a\) and \(b\) themselves).
Recall: the cumulative distribution function, or cdf, is another means by which to encapsulate information about a probability distribution. For a continuous distribution, it is defined as \(F_X(x) = \int_{y \leq x} f_Y(y) dy\), and it is defined for all values \(x \in (-\infty,\infty)\), with \(F_X(-\infty) = 0\) and \(F_X(\infty) = 1\).
The cdf for a uniformly distributed random variable is \[ F_X(x) = \int_a^x f_Y(y) dy = \int_a^x \frac{1}{b-a} dy = \frac{x-a}{b-a} ~~ x \in [a,b] \,, \] with a value of 0 for \(x < a\) and 1 for \(x > b\). (We can quickly confirm that the derivative of the cdf yields the pdf. Recall that for continuous distributions, \(f_X(x) = dF_X(x)/dx\).)
Recall: an inverse cdf function \(F_X^{-1}(q)\) takes as input a distribution quantile \(q \in [0,1]\) and returns the value of \(x\) such that \(q = F_X(x)\).
The inverse cdf is exceptionally simple to compute:
\[
q = \frac{x-a}{b-a} ~~ \Rightarrow ~~ x = (b-a)q + a \,.
\]
Technically the inverse cdf has no unique solution when \(q = 0\) or
\(q = 1\). However, it is convention (for instance, within R
) that
the inverse cdf output for continuous distributions be the largest
value for which \(q = 0\) and the smallest value for which \(q = 1\). Thus,
for a Uniform(\(a,b\)) distribution,
when \(q = 0\), then \(x = a\), and when \(q = 1\), \(x = b\).
A discrete analogue to the uniform distribution is the discrete uniform distribution, which is defined over a range of integers \([a,b]\). (The rolls of a fair, six-sided die would, for instance, be governed by the discrete uniform distribution.) The pmf for the discrete uniform is \[ p_X(x) = \frac{1}{n} ~~ x \in [a,b] \,, \] where \(n = b - a + 1\) is the number of possible experimental outcomes. The cdf is \[ F_X(x) = \frac{\lfloor x \rfloor - a + 1}{n} ~~ x \in [a,b] \,, \] where \(\lfloor x \rfloor\) is the largest integer that is smaller than or equal to \(x\), while the inverse cdf is given by the generalized inverse cdf formalism that we’ve previously seen for discrete distributions.
Note that there are no standard R
functions of the form xdiscunif()
for computing the pmf or cdf of
the discrete uniform distribution, or for sampling from it. We will
show how one can create such functions in an example below.
5.1.1 The Expected Value and Variance of a Uniform Random Variable
Recall: the expected value of a continuously distributed random variable is \[ E[X] = \int_x x f_X(x) dx\,, \] where the integral is over all values of \(x\) within the domain of the pdf \(f_X(x)\). The expected value is equivalent to a weighted average, with the weight for each possible value of \(x\) given by \(f_X(x)\).
The expected value of a random variable drawn from a Uniform(\(a,b\)) distribution is \[ E[X] = \int_a^b x f_X(x) dx = \int_a^b \frac{x}{b-a} dx = \frac{1}{b-a} \left. \frac{x^2}{2} \right|_a^b = \frac{1}{b-a} \frac{b^2-a^2}{2} = \frac{1}{b-a} \frac{(b-a)(b+a)}{2} = \frac{a+b}{2} \,. \]
Recall: the variance of a continuously distributed random variable is \[ V[X] = \int_x (x-\mu)^2 f_X(x) dx = E[X^2] - (E[X])^2\,, \] where the integral is over all values of \(x\) within the domain of the pdf \(f_X(x)\). The variance represents the square of the “width” of a probability density function, where by “width” we mean the range of values of \(x\) for which \(f_X(x)\) is effectively non-zero.
To find the variance, we work with the shortcut formula: \(V[X] = E[X^2] - (E[X])^2\). We know \(E[X]\) already; as for \(E[X^2]\), we utilize the Law of the Unconscious Statistician: \[\begin{align*} E[X^2] = \int_a^b x^2 f_X(x) dx &= \int_a^b \frac{x^2}{b-a} dx \\ &= \frac{1}{b-a} \left. \frac{x^3}{3} \right|_a^b \\ &= \frac{b^3-a^3}{3(b-a)} \\ &= \frac{(b-a)(a^2+ab+b^2)}{3(b-a)} = \frac{1}{3}\left(a^2 + ab + b^2\right) \,. \end{align*}\] Thus \[\begin{align*} V[X] &= \frac{1}{3}\left(a^2 + ab + b^2\right) - \left(\frac{a+b}{2}\right)^2 \\ &= \frac{1}{3}\left(a^2 + ab + b^2\right) - \frac{1}{4}\left(a^2+2ab+b^2\right) \\ &= \frac{1}{12}\left(4a^2 + 4ab + 4b^2 - 3a^2 - 6ab - 3b^2\right) \\ &= \frac{1}{12}\left(a^2 - 2ab + b^2 \right) \\ &= \frac{(a-b)^2}{12} \,. \end{align*}\]
5.1.2 Coding R-Style Functions for the Discrete Uniform Distribution
There are four standard functions associated with any distribution: the one prefaced by
d
that returns the output of the probability mass function or probability density function, given a coordinate \(x\); the one prefaced byp
that returns the output of the cumulative distribution function, given \(x\); the one prefacedq
that returns the output of the inverse cdf, given a quantile \(q \in [0,1]\); and the random sampler, a function prefaced byr
.
For the discrete uniform distribution, one can code the probability mass function as follows:
ddiscunif <- function(x,min=0,max=1,step=1)
{
y <- seq(min,max,by=step)
if ( x %in% y ) return(1/length(y))
return(0)
}
ddiscunif(4,min=1,max=6) # assume a fair six-sided die
## [1] 0.1666667
As for the cumulative distribution function:
pdiscunif <- function(x,min=0,max=1,step=1)
{
y <- seq(min,max,by=step)
w <- which(y<=x)
if ( length(w) == 0 ) return(0)
return(length(w)/length(y))
}
pdiscunif(4,min=1,max=6)
## [1] 0.6666667
The inverse cdf implements the generalized inverse algorithm:
qdiscunif <- function(q,min=0,max=1,step=1)
{
y <- seq(min,max,by=step)
if ( q == 0 ) return(min(y))
if ( q == 1 ) return(max(y))
cdf <- (1:length(y))/length(y)
w <- which(cdf>=q)
if ( length(w) == 0 ) return(max(y))
return(y[min(w)])
}
qdiscunif(0.55,min=1,max=6)
## [1] 4
And last, the random data generator:
rdiscunif <- function(n,min=0,max=1,step=1)
{
y <- seq(min,max,by=step)
s <- sample(length(y),n,replace=TRUE)
return(y[s])
}
set.seed(235) # set to ensure consistent output
rdiscunif(10,min=1,max=6)
## [1] 6 5 5 6 2 1 5 1 3 6
5.2 Linear Functions of Uniform Random Variables
Let’s assume that we are given \(n\) iid Uniform random variables: \(X_1,X_2,\ldots,X_n \sim\) Uniform(\(a,b\)). What is the distribution of the sum \(Y = \sum_{i=1}^n X_i\)?
Recall: the moment-generating function, or mgf, is a means by which to encapsulate information about a probability distribution. When it exists, the mgf is given by \(E[e^{tX}]\). If \(Y = \sum_{i=1}^n a_iX_i\), then \(m_Y(t) = m_{X_1}(a_1t) m_{X_2}(a_2t) \cdots m_{X_n}(a_nt)\); if we can identify \(m_Y(t)\) os the mgf for a known family of distributions, then we can immediately identify the distribution of \(Y\) and the parameters of that distribution.
The mgf for the uniform distribution is \[\begin{align*} m_X(t) = E[e^{tX}] &= \int_a^b \frac{e^{tx}}{b-a} dx \\ &= \frac{1}{b-a} \left. \frac{1}{t}e^{tx} \right|_a^b \\ &= \frac{e^{tb}-e^{ta}}{t(b-a)} \,. \end{align*}\] The mgf for the sum \(Y = \sum_{i=1}^n X_i\) is thus \[ m_Y(t) = \prod_{i=1}^n m_{X_i}(t) = \left( \frac{e^{tb}-e^{ta}}{t(b-a)} \right)^n \,. \] This expression does not simplify such that we recognize the distribution of \(Y\). If \(a = 0\) and \(b = 1\), it turns out that the mgf does take on the form of that for an Irwin-Hall distribution. An Irwin-Hall random variable converges in distribution to a normal random variable as \(n \rightarrow \infty\).
We are placed in a similar situation if we look at the sample mean \(\bar{X} = Y/n\): \[ m_{\bar{X}}(t) = \prod_{i=1}^n m_{X_i}\left(\frac{t}{n}\right) = \left( \frac{n(e^{tb/n}-e^{ta/n})}{t(b-a)} \right)^n \,. \] If \(a = 0\) and \(b = 1\), \(\bar{X}\) is sampled from the Bates distribution. A Bates random variable converges in distribution to a normal random variable as \(n \rightarrow \infty\). For all other combinations of \(a\) and \(b\), we cannot write down a specific functional form for the sampling distribution of \(\bar{X}\) and thus we would have to perform simulations to test hypotheses, etc. (However, we note that because statistical inference for a uniform distribution involves determining the lower and/or upper bounds, we can utilize order statistics for inference instead of \(\bar{X}\). See the next section below.)
5.2.1 The Moment-Generating Function for a Discrete Uniform Distribution
The mgf for a discrete uniform random variable is \[ E[e^{tX}] = \sum_{x=a}^b e^{tx} p_X(x) = \frac{1}{n} \sum_{x=a}^b e^{tx} \,. \] We cannot say anything further without making an assumption. If we say that \(x \in [a,a+1,\ldots,b-1,b]\), i.e., that there are integer steps between the probability masses, then \[ E[e^{tX}] = \frac{1}{n} \sum_{x=a}^b e^{tx} = \frac{1}{n}e^{ta} \left( 1 + e^{t(a+1)} + \cdots + e^{t(b-a)} \right) \,. \] If \(t\) is negative, then we can make use of a geometric sum: \[ 1 + e^t + \cdots = \frac{1}{1-e^t} = \underbrace{1 + \cdots + e^{t(b-a)}}_{} + \underbrace{e^{t(b-a+1)} + \cdots}_{} \,, \] where the first underbraced quantity is what appears above in the expected value. Thus we can rearrange terms and write \[\begin{align*} 1 + e^{t(a+1)} + \cdots + e^{t(b-a)} &= \frac{1}{1-e^t} - \left( e^{t(b-a+1)} + \cdots \right) \\ &= \frac{1}{1-e^t} - e^{t(b-a+1)}\left(1 + e^t + \cdots\right) \\ &= \frac{1}{1-e^t} - \frac{e^{t(b-a+1)}}{1-e^t} = \frac{1-e^{t(b-a+1)}}{1-e^t} \,. \end{align*}\] Putting everything together, we find that \[ m_X(t) = \frac{1}{n}e^{ta} \frac{1-e^{t(b-a+1)}}{1-e^t} = \frac{e^{ta}-e^t(b+1)}{n(1-e^t)} \,. \] This is the usual form of the mgf presented for the discrete uniform distribution, but again, this is only valid if the masses are separated by one unit: \(x \in [a,a+1,\ldots,b-1,b]\).
5.3 Sufficient Statistics and the Minimum Variance Unbiased Estimator
Recall: a sufficient statistic for a population parameter \(\theta\) captures all information about \(\theta\) contained in a data sample; no additional statistic will provide more information about \(\theta\). Sufficient statistics are not unique: functions of sufficient statistics are themselves sufficient statistics.
Before we discuss sufficient statistics in the context of the uniform distribution, it is useful to (re-)introduce the indicator function. This function, mentioned briefly in Chapter 1, takes on the value 1 if a specified condition is met and 0 otherwise. For instance, \[ \mathbb{I}_{x_i \in [0,1]} = \left\{ \begin{array}{cl} 1 & x_i \in [0,1] \\ 0 & \mbox{otherwise} \end{array} \right. \,. \] One use for the indicator function is to, well, indicate the domain of a pmf or pdf. For instance, we can write \[ f_X(x) = \left\{ \begin{array}{ll} e^{-x} & x \geq 0 \\ 0 & \mbox{otherwise} \end{array} \right. \] to express that the exponential distribution with rate \(\beta = 1\) is defined within the domain \(x \in [0,\infty)\), or, equivalently, we can write \[ f_X(x) = e^{-x} \mathbb{I}_{x \in [0,\infty)} \,. \] The latter form expresses the same information in a more condensed fashion.
So…what do indicator functions have to do with uniform distributions?
Let’s suppose we sample \(n\) iid data \(\{X_1,\ldots,X_n\}\) from a uniform distribution with lower bound 0 and upper bound \(\theta\), and our goal is to define a sufficient statistic for \(\theta\). Let’s work with the factorization criterion: \[ \mathcal{L}(\theta \vert \mathbf{x}) = g(\mathbf{x},\theta) \cdot h(\mathbf{x}) \,. \] The likelihood is \[ \mathcal{L}(\theta \vert \mathbf{x}) = \prod_{i=1}^n f_X(x_i \vert \theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n} \,. \] OK…no…wait, there are no data in this expression, so we cannot define a sufficient statistic. The way around this is to re-express the pdf as \[ f_X(x) = \frac{1}{\theta} \mathbb{I}_{x \in [0,\theta]} \] and to re-write the likelihood as \[ \mathcal{L}(\theta \vert \mathbf{x}) = \prod_{i=1}^n f_X(x_i \vert \theta) = \frac{1}{\theta^n} \prod_{i=1}^n \mathbb{I}_{x \in [0,\theta]} \,. \] A product of indicator functions will equal 1 if and only if all data lie in the domain \(x \in [0,\theta]\). This is equivalent to saying that \(\theta \geq X_{(n)}\), the order statistic representing the maximum observed datum. Thus \(X_{(n)}\) is a sufficient statistic for \(\theta\): we know \(\theta\) is greater than this statistic’s value, and none of the data aside from \(X_{(n)}\) provide any additional information about \(\theta\).
The upshot: when the parameter \(\theta\) dictates (at least in part) the domain of a distribution, the sufficient statistics for \(\theta\) will be functions of an order statistic.
When we first introduced the factorization criterion and sufficient statistics back in Chapter 3, we did it so that ultimately we could write down the minimum variance unbiased estimator (or MVUE).
Recall: the bias of an estimator is the difference between the average value of the estimates it generates and the true parameter value. If \(E[\hat{\theta}-\theta] = 0\), then the estimator \(\hat{\theta}\) is said to be unbiased.
Recall: deriving the minimum variance unbiased estimator involves two steps:
- factorizing the likelihood function to uncover a sufficient statistic \(U\) (that we assume is both minimal and complete); and
- finding a function \(h(U)\) such that \(E[h(U)] = \theta\).
When \(\{X_1,\ldots,X_n\} \stackrel{iid}{\sim}\) Uniform(\(0,\theta\)), can we define an MVUE for \(\theta\)? The answer is yes…but we have to recall how we define the pdf of \(X_{(n)}\) first.
Recall: the maximum of \(n\) iid random variables sampled from a pdf \(f_X(x)\) has a sampling distribution given by \[ f_{(n)}(x) = n f_X(x) [ F_X(x) ]^{n-1} \,, \] where \(F_X(x)\) is the associated cdf.
For the Uniform(\(0,\theta\)) distribution, \[ f_X(x) = \frac{1}{\theta} ~~\mbox{and}~~ F_X(x) = \int_0^x f_Y(y) dy = \int_0^x \frac{1}{\theta} dy = \frac{x}{\theta} \,, \] so \[ f_{(n)}(x) = n \frac{1}{\theta} \left[ \frac{x}{\theta} \right]^{n-1} = n \frac{x^{n-1}}{\theta^n} \,. \] To find the MVUE, we first compute the expected value of the sufficient statistic \(X_{(n)}\): \[ E[X_{(n)}] = \int_0^\theta x n \frac{x^{n-1}}{\theta^n} dx = \left. \frac{n}{(n+1)\theta^n} x^{n+1} \right|_0^\theta = \frac{n}{n+1} \theta \,, \] and then rearrange terms: \[ E\left[\frac{n+1}{n}X_{(n)}\right] = \theta \,. \] Thus \[ \hat{\theta}_{MVUE} = \frac{n+1}{n}X_{(n)} \] is the MVUE for \(\theta\). (Note that a similar calculation to this one can be used to determine, e.g., the MVUE for the \(\theta\) when the data are sampled from a Uniform(\(\theta,b\)) distribution.)
5.3.1 The Sufficient Statistic for the Domain Parameter of the Pareto Distribution
The Pareto [puh-RAY-toh] distribution, also known in some quarters as the power-law, is \[ f_X(x) = \frac{\alpha \beta^\alpha}{x^{\alpha+1}} \,, \] where \(\alpha > 0\) is the shape parameter and \(x \in [\beta,\infty)\), where \(\beta\) is the scale (or location) parameter. Let’s assume \(\alpha\) is known. A sufficient statistic for \(\beta\), found via likelihood factorization, is \[ \mathcal{L}(\beta \vert \mathbf{x}) = \prod_{i=1}^n f_X(x_i) = \underbrace{\beta^{n\alpha}}_{g(\mathbf{x},\beta)} \cdot \underbrace{\frac{\alpha^n}{(\prod_{i=1}^n x_i)^{\alpha+1}}}_{h(\mathbf{x})} \,. \] Wait…again, as is the case for the uniform distribution, no data appear in the expression \(g(\cdot)\). So we would go back and introduce an indicator function into the pdf; it should be clear that when we do so, \(g(\mathbf{x},\beta)\) changes to \[ g(\mathbf{x},\beta) = \beta^{n\alpha} \prod_{i=1}^n \mathbb{I}_{x_i \in [\beta,\infty)} \] and thus that because all data have to be larger than \(\beta\), the sufficient statistic is the minimum observed datum, \(X_{(1)}\).
5.3.2 MVUE Properties for Uniform Distribution Bounds
The properties of estimators that we have examined thus far include the bias (are our estimates offset from the truth, on average?), the variance (over how large a range do our estimates vary?), etc. Let’s look at some of these properties here, assuming we sample \(n\) iid data from a Uniform(\(0,\theta\)) distribution.
Bias: the MVUE is by definition unbiased, since \(E[\frac{n+1}{n}X_{(n)}] = \theta\).
Variance: the variance of \(\hat{\theta}_{MVUE}\) is \[\begin{align*} V[\hat{\theta}_{MVUE}] &= E\left[\left(\hat{\theta}_{MVUE}\right)^2\right] - \left( E\left[ \hat{\theta}_{MVUE} \right] \right)^2 \\ &= \frac{(n+1)^2}{n^2} \left( E[X_{(n)}^2] - (E[X_{(n)}])^2 \right) \,, \end{align*}\] where \[ E[X_{(n)}^2] = \int_0^\theta x^2 n \frac{x^{n-1}}{\theta^n} dx = \left. \frac{n}{(n+2)\theta^n} x^{n+2} \right|_0^\theta = \frac{n}{n+2} \theta^2 \,. \] Thus \[\begin{align*} V[\hat{\theta}_{MVUE}] &= \frac{(n+1)^2}{n^2} \left( \frac{n}{n+2} \theta^2 - \frac{n^2}{(n+1)^2} \theta^2 \right) \\ &= \frac{(n+1)^2}{n^2} \left( \frac{n(n+1)^2 - n^2(n+2)}{(n+2)(n+1)^2} \theta^2 \right) \\ &= \frac{(n+1)^2}{n^2} \left( \frac{n}{(n+2)(n+1)^2} \theta^2 \right) \\ &= \frac{1}{n(n+2)} \theta^2 \rightarrow \frac{\theta^2}{n^2} ~~\mbox{as}~~ n \rightarrow \infty \,. \end{align*}\] We observe that since the variance goes to zero as \(n \rightarrow \infty\), the MVUE is a consistent estimator…
…but does the MVUE achieve the Cramer-Rao Lower Bound (CRLB), the theoretical lower bound on the variance of unbiased estimators? It turns out that not only does it achieve the lower bound (which one can show equals \(\theta^2/n\)), but it even surpasses that bound! This seemingly worrisome result is actually fine, however, because the CRLB theorem is not applicable in situations where the domain of a pmf or pdf depends on \(\theta\).
5.4 Maximum Likelihood Estimation
Recall: the value of \(\theta\) that maximizes the likelihood function is the maximum likelihood estimate, or MLE, for \(\theta\). The maximum is, thus far, found by taking the (partial) derivative of the (log-)likelihood function with respect to \(\theta\), setting the result to zero, and solving for \(\theta\). That solution is the maximum likelihood estimate \(\hat{\theta}_{MLE}\). Also recall the invariance property of the MLE: if \(\hat{\theta}_{MLE}\) is the MLE for \(\theta\), then \(g(\hat{\theta}_{MLE})\) is the MLE for \(g(\theta)\).
Now that we have recalled how maximum likelihood estimation works, we can state that this is not how the MLE is found for a domain-affecting parameter! (Hence the “thus far” in the recall statement above.) Let’s assume, for instance, that we sample \(n\) iid random variables from a Uniform(\(0,\theta\)) distribution. As stated above (without the indicator function), the likelihood is \[ \mathcal{L}(\theta \vert \mathbf{x}) = \frac{1}{\theta^n} \,. \] This means that the smaller \(\theta\) is, the larger the likelihood will be. So how small can \(\theta\) be? We can answer this intuitively: the domain \([0,\theta]\) has to just encompass all the observed data, i.e., \[ \hat{\theta}_{MLE} = X_{(n)} \,. \] If \(\theta\) were smaller, \(X_{(n)}\) would lie outside the domain. It is fine for \(\theta\) to be larger, since then all the data lie in the domain \([0,\theta]\)…but the larger \(\theta\) is, the smaller the likelihood.
We plot an example likelihood function in Figure 5.2. We observe immediately that the usual MLE algorithm will not work here, as the likelihood function is discontinuous at \(\theta = X_{(n)}\) and thus we cannot compute a first derivative. All we can do is, e.g., plot the likelihood and identify the MLE as that value for which the likelihood is maximized (or identify the value intuitively as we do above).
5.4.1 The MLE for the Domain Parameter of the Pareto Distribution
In the last section above, we introduce the Pareto distribution, \[ f_X(x) = \frac{\alpha\beta^\alpha}{x^{\alpha+1}} \,, \] where \(\alpha > 0\) and \(x \in [\beta,\infty)\), and we show that the sufficient statistic for \(\beta\) (with \(\alpha\) fixed) is the smallest observed datum, \(X_{(1)}\). Because \(\beta\) is a parameter that dictates the domain, we find the MLE not via differentiation but rather by identifying that the likelihood is maximized when \(\beta\) is exactly equal to \(X_{(1)}\), i.e., \(\hat{\beta}_{MLE} = X_{(1)}\). See Figure 5.3.
5.4.2 MLE Properties for Uniform Distribution Bounds
In this example, we will mimic what we do above when discussing the properties of the MVUE, and look at estimator bias and variance, etc., assuming that \(\{X_1,\ldots,X_n\} \stackrel{iid}{\sim}\) Uniform(\(0,\theta\)).
Bias: we know, from our derivation of the MVUE, that \[ E[\hat{\theta}_{MLE}] = E[X_{(n)}] = \frac{n}{n+1}\theta \,, \] and thus the estimator bias is \[ B[\hat{\theta}_{MLE}] = E[\hat{\theta}_{MLE}] - \theta = \frac{n}{n+1}\theta - \theta = \frac{1}{n+1}\theta \,. \] As we expect for the MLE, the estimator is at least asymptotically unbiased, as the bias goes to zero as the sample size \(n \rightarrow \infty\).
Variance: the variance of the MLE is \[ V[\hat{\theta}_{MLE}] = E[\hat{\theta}_{MLE}^2] - \left(E[\hat{\theta}_{MLE}\right)^2 = E[X_{(n)}^2] - (E[X_{(n)}])^2 \,. \] We derived both \(E[X_{(n)}]\) and \(E[X_{(n)}^2]\) above when discussing the MVUE, so we can write down immediately that \[ V[\hat{\theta}_{MLE}] = \frac{n}{n+2}\theta^2 - \left( \frac{n}{n+1}\theta\right)^2 = \frac{n}{(n+2)(n+1)^2}\theta^2 \rightarrow \frac{\theta^2}{n^2} ~~\mbox{as}~~ n \rightarrow \infty\,. \] We observe that because the variance goes to zero as \(n \rightarrow \infty\), the MLE is a consistent estimator. The variance of the MLE is similar to, but not exactly the same as, the variance for the MVUE, although the variances converge to the same value in asymtopia.
5.5 Confidence Intervals
Recall: a confidence interval is a random interval \([\hat{\theta}_L,\hat{\theta}_U]\) that overlaps (or covers) the true value \(\theta\) with probability \[ P\left( \hat{\theta}_L \leq \theta \leq \hat{\theta}_U \right) = 1 - \alpha \,, \] where \(1 - \alpha\) is the confidence coefficient. We determine \(\hat{\theta}_L\) and \(\hat{\theta}_U\) by, e.g., solving for the root \(\theta_q\) in each of the following equations: \[\begin{align*} F_Y(y_{\rm obs} \vert \theta_{\alpha/2}) - \frac{\alpha}{2} &= 0 \\ F_Y(y_{\rm obs} \vert \theta_{1-\alpha/2}) - \left(1-\frac{\alpha}{2}\right) &= 0 \,. \end{align*}\] The construction of confidence intervals thus relies on knowing the sampling distribution of the adopted statistic \(Y\). One maps \(\theta_{\alpha/2}\) and \(\theta_{1-\alpha/2}\) to \(\hat{\theta}_L\) and \(\hat{\theta}_U\) by taking into account how the expected value \(E[Y]\) varies with the parameter \(\theta\). (See the table in section 14 of Chapter 1.)
Here, we will recall something else about confidence interval construction: we can choose the statistic that we use. This is important, because as we have seen, if we are given \(n\) iid data drawn from, e.g., a Uniform(\(0,\theta\)) distribution, we do not know the distribution of \(\bar{X}\) (unless \(\theta=1\))…while we do know the distribution of the order statistic \(Y = X_{(n)}\). We derive it above: \[ f_{(n)}(x) = n \frac{x^{n-1}}{\theta^n} \,. \] The cdf is thus \(F_Y(y) = F_{(n)}(x) = (x/\theta)^n\). We work with this expression in an example below, noting that \(E[Y] = (n-1)\theta/n\), i.e., that \(E[Y]\) increases with \(\theta\).
We conclude our coverage (so to speak) of confidence intervals by going back to the notion of the confidence coefficient \(1 - \alpha\). In a footnote in Chapter 1, we make the point that technically, the confidence coefficient is the infimum, or minimum value, of the probability \(P(\hat{\theta}_L \leq \theta \leq \hat{\theta}_U)\). What does this actually mean?
Let’s suppose that we have sampled \(n\) iid data from a normal distribution, and that we are going to construct a confidence interval of the form \[ P(S^2 - a \leq \sigma^2 \leq S^2 + a) \] for the population variance \(\sigma^2\). We can do this, right? Let’s see… \[\begin{align*} P(S^2 - a \leq \sigma^2 \leq S^2 + a) &= P(-a \leq S^2-\sigma^2 \leq a) \\ &= P\left(1 - \frac{a}{\sigma^2} \leq \frac{S^2}{\sigma^2} \leq 1 + \frac{a}{\sigma^2}\right) \\ &= P\left( (n-1)\left(1 - \frac{a}{\sigma^2}\right) \leq \frac{(n-1)S^2}{\sigma^2} \leq (n-1)\left(1 + \frac{a}{\sigma^2}\right) \right)\\ &= F_{W(n-1)}\left( (n-1)\left(1 + \frac{a}{\sigma^2}\right) \right) - F_{W(n-1)}\left( (n-1)\left(1 - \frac{a}{\sigma^2}\right) \right) \,. \end{align*}\] The key to interpreting the last line above is that \(\sigma^2\) is unknown (otherwise, why would we be constructing a confidence interval for it in the first place?), and thus can take on any positive value. What if \(\sigma^2\) is very large? \[ \lim_{\sigma^2 \to \infty} F_{W(n-1)}\left[ (n-1)\left(1 + \frac{a}{\sigma^2}\right) \right] - F_{W(n-1)}\left[ (n-1)\left(1 - \frac{a}{\sigma^2}\right) \right] = F_{W(n-1)}(n-1) - F_{W(n-1)}(n-1) = 0 \,. \] Thus the confidence coefficient for the interval \(S^2 - a \leq \sigma^2 \leq S^2 + a\) is \(1 - \alpha = 0\) (or, we have that \(\alpha = 1\)).
The upshot: one cannot write down just any interval and assume that it is a valid one!
5.5.1 Interval Estimation Given Order Statistics
We assume that we are given \(n\) iid data drawn from a Uniform(\(0,\theta\)) distribution. Above, we note that the cdf for the maximum observed datum, \(X_{(n)}\), is \(F_{(n)}(x) = (x/\theta)^n\). To find the lower and upper bounds on \(\theta\), respectively, we solve for \(\theta\) in the expressions \[\begin{align*} \left(\frac{X_{(n)}}{\theta_{1-\alpha/2}}\right)^n &= 1 - \frac{\alpha}{2} ~~ \mbox{(lower)} \\ \left(\frac{X_{(n)}}{\theta_{\alpha/2}}\right)^n &= \frac{\alpha}{2} ~~ \mbox{(upper)} \,. \end{align*}\] (We assume that we are constructing a two-sided interval. Similar expressions would yield the lower or upper bound.) We thus find that \[ \hat{\theta}_L = \theta_{1-\alpha/2} = \frac{X_{(n)}}{(1-\alpha/2)^{1/n}} ~~\mbox{and}~~ \hat{\theta}_U = \theta_{\alpha/2} = \frac{X_{(n)}}{(\alpha/2)^{1/n}} \,. \]
5.5.2 Confidence Coefficient for a Uniform-Based Interval Estimator
In the example above, we show that the interval estimate with confidence coefficient \(1-\alpha\) for the uniform upper bound \(\theta\) has the form \([aX_{(n)},bX_{(n)}]\). Can we also define an appropriate interval estimator if, for instance, it has the form \([X_{(n)} + a,X_{(n)} + b]\)? The short answer is no…because the confidence coefficient will be zero! To see why, let’s expand out and solve: \[\begin{align*} P(X_{(n)} + a \leq \theta \leq X_{(n)} + b) &= P(\theta - b \leq X_{(n)} \leq \theta - a)\\ &= P(X_{(n)} \leq \theta - a) - P(X_{(n)} \leq \theta - b)\\ &= F_{(n)}(\theta-a) - F_{(n)}(\theta-b)\\ &= \left(\frac{\theta-a}{\theta}\right)^2 - \left(\frac{\theta-b}{\theta}\right)^2\\ &= \left(1-\frac{a}{\theta}\right)^2 - \left(1-\frac{b}{\theta}\right)^2 \,. \end{align*}\] The confidence coefficient is the infimum (or minimum value) that this expression can take on. For an interval of the form \([aX_{(n)},bX_{(n)}]\), \(\theta\) does not appear, and thus the infimum is a constant. Here, however, \[ \lim_{\theta \to \infty} P(X_{(n)} + a \leq \theta \leq X_{(n)} + b) = 0 \,, \] and thus the confidence coefficient is (i.e., the proportion of computed intervals that overlap the true value \(\theta\) goes to zero). Thus an interval estimator of the form \([aX_{(n)},bX_{(n)}]\) is a better one than one of the form \([X_{(n)} + a,X_{(n)} + b]\).
5.6 Hypothesis Testing
Recall: a hypothesis test is a framework to make an inference about the value of a population parameter \(\theta\). The null hypothesis \(H_o\) is that \(\theta = \theta_o\), while possible alternatives \(H_a\) are \(\theta \neq \theta_o\) (two-tail test), \(\theta > \theta_o\) (upper-tail test), and \(\theta < \theta_o\) (lower-tail test). For, e.g., a one-tail test, we reject the null hypothesis if the observed test statistic \(y_{\rm obs}\) falls outside the bound given by \(y_{RR}\), which is a solution to the equation \[ F_Y(y_{RR} \vert \theta_o) - q = 0 \,, \] where \(F_Y(\cdot)\) is the cumulative distribution function for the statistic \(Y\) and \(q\) is an appropriate quantile value that is determined using the hypothesis test reference table introduced in section 17 of Chapter 1. Note that the hypothesis test framework only allows us to make a decision about a null hypothesis; nothing is proven.
One aspect of hypothesis testing that we reiterate here is that the hypotheses are always to be established, along with the level of the test, before we collect data. This should be obvious\(-\)looking at the data prior to establishing hypotheses and test levels can (and often will) lead to bias\(-\)so why are we reiterating this now? We are making this point because when we perform tests involving domain-specifying parameters, there are some quirks that we observe when we establish rejection regions.
Let’s look at an example: we sample \(n\) iid data from a Uniform(\(0,\theta\)) distribution, and we use these data to test the hypothesis \(H_o : \theta = \theta_o\) versus the hypothesis \(H_a : \theta \neq \theta_o\) at level \(\alpha\). We know that the sufficient statistic upon which we will build our test is \(X_{(n)}\).
Given this, what can we say about the rejection region right away? We can say that we will reject the null if \(X_{(n)} > \theta_o\). This is a “trivial” statement, as no mathematics is involved. Initially, this might seem off-putting: we would, of course, never set \(\theta_o\) to be less than the maximum datum, would we? (That would be silly.) But…that’s an incorrect way of looking at this situation, since that implies that we looked at the data first and only established the hypotheses afterwards. If we do things in the proper order, then it can be very much the case that the maximum datum will exceed \(\theta_o\). If we observe this, then life is easy: we simply reject the null and move on.
But…how do we establish the part of the rejection region involving values of \(X_{(n)}\) that are less than \(\theta_o\)? That seems simple enough: we are performing a two-tail test, so we set the cdf for \(X_{(n)}\) to \(\alpha/2\) and invert and…oh, but there’s a problem. If the null is actually correct, then the power of the test for \(\theta = \theta_o\) would be \(\alpha/2\) and not \(\alpha\). (If the null is correct, it is impossible to observe \(X_{(n)} > \theta_o\), so all rejections would happen “to the left” of \(\theta_o\)!) So this is quirk number two: we have to be careful about whether we use, e.g., \(\alpha/2\) or \(\alpha\) we finding the rejection region boundary. If in doubt, think about the test power and how we can reject the null if \(\theta = \theta_o\), and make sure the power is actually \(\alpha\).
The last hypothesis-test-related topic that we will touch upon is the concept of multiple comparisons. This is a somewhat opaque term that denotes the situation in which we perform many hypothesis tests simultaneously and need to correct for the fact that if the null is correct in all cases, it becomes more and more likely that we will observe (multiple) instances in which we decide to reject the null. We can illustrate this using the binomial distribution: if we collect \(k\) sets of data (e.g., \(k\) separate sets of \(n\) iid data sampled from a Uniform(0,\(\theta\)) distribution), and perform level-\(\alpha\) hypothesis tests for each, then the number of tests results in which we reject the null is \[ X \sim \mbox{Binom}(k,\alpha) \,, \] The expected value of \(X\) is \(E[X] = k\alpha\), which increases with \(k\). The family-wise error rate, or FWER, is the probability that at least one test will result in a rejection when the null is correct: \[ FWER = P(X > 0) = 1 - P(X = 0) = 1 - \binom{k}{x} \alpha^0 (1-\alpha)^k = 1 - (1-\alpha)^k \,. \] For instance, if \(k = 10\) and \(\alpha = 0.05\), the family-wise error rate is 0.401: for every 10 tests we perform, the probability of erroneously rejecting one or more null hypotheses (i.e., detecting one or more false positives) is about 40 percent. This increase in the FWER with \(k\) is not good, and is well-known to be associated with a commonly seen data analysis issue dubbed data dredging or p-hacking. \(p\)-hacking greatly increases the probability that researchers will make incorrect claims about what their data say, and worse yet, that they will publish papers purporting these claims.
To mitigate this issue, we can attempt to change the test level for individual tests such that the overall FWER is reduced to \(\alpha\). There are many procedures for how we might go about changing the test level for individual tests, but the most commonly used one is the Bonferroni correction: \[ \alpha \rightarrow \frac{\alpha}{k} \,. \] What is the FWER given this correction? Let’s assume the null is correct for all \(k\) tests. Then \[ FWER = P\left( p_1 \leq \frac{\alpha}{k} \cup \cdots \cup p_k \leq \frac{\alpha}{k} \right) = \sum_{i=1}^k P\left( p_i \leq \frac{\alpha}{k}\right) = \sum_{i=1}^k \frac{\alpha}{k} = \alpha \,. \] This works! Except…what happens if actually only \(k'\) out of the \(k\) are actually true? The FWER becomes \(k' \alpha / k \leq \alpha\). Thus when there are incorrect nulls sprinkled into the mix, the FWER is too low…which means that the Bonferroni correction is unduly conservative. Using it will lead to us possibly not detecting false nulls that we should have detected! We illustrate this issue in an example below.
An alternative to the Bonferroni correction and related procedures is to not focus upon the FWER, but to attempt to limit the false discovery rate, or FDR, instead. The simplest and most often used FDR-based procedure is the one of Benjamini and Hochberg (1995):
- compute all \(k\) \(p\)-values;
- sort the \(p\)-values into ascending order: \(p_{(1)},\ldots,p_{(k)}\);
- determine the largest value \(k'\) such that \(p_{(k')} \leq k' \alpha / k\); and
- reject the null for all tests that map to \(p_{(1)},\ldots,p_{(k')}\).
In an example below, we illustrate the use of the BH procedure. To be clear: \(\alpha\) here represents the proportion of rejected null hypotheses that are actually correct. (“We reject the null 20 times. Assuming \(\alpha = 0.05\), then we expect that we were right to reject the null 19 times, and that we’d be mistaken once.”) This is different from the FWER setting, where \(\alpha\) represents the probability of erroneously rejecting one or more null hypotheses. (“We perform 100 independent tests. Assuming \(\alpha = 0.05\), we expect to erroneously reject the null five percent of the time when the null is correct…but we can say nothing about how often we correctly reject the null.”) We can further illustrate this point with the following table:
Null Correct | Null False | Total | |
---|---|---|---|
Null Rejected | V | S | R |
Fail to Reject | U | T | k-R |
Total | k’ | k-k’ | k |
The only observable random variable here is \(R\), the total number of rejected null hypotheses. In the FDR procedure, we focus on the first row. We know that \[ E[V] = \alpha k' \leq \alpha k \] and we know that \(V+S \leq k\), so \[ E\left[\frac{V}{V+S}\right] = E\left[\frac{V}{R}\right] \leq \frac{\alpha k'}{k} \leq \alpha \,. \] The FWER procedure, on the other hand, focuses on the first column, with \[ E\left[\frac{V}{V+U}\right] = \frac{\alpha k'}{k'} = \alpha \,. \]
5.6.1 The Power Curve for Testing the Uniform Distribution Upper Bound
Assume, as we do above, that we have sampled \(n\) iid data from a Uniform(\(0,\theta\)) distribution, and that we will use these data to test the hypotheses \[ H_o: \theta = \theta_o ~~\mbox{versus}~~ H_a: \theta \neq \theta_o \,. \]
The sufficient statistic is the maximum datum \(X_{(n)}\). As stated above, we know that we will reject the null when \(X_{(n)} > \theta_o\); that’s a “trivial” statement. As for the rejection region boundary when \(X_{(n)} \leq \theta_o\): we know that the cdf for \(X_{(n)}\) is \(F_{(n)}(x) = (x/\theta)^n\), so the lower boundary is \[ \left(\frac{x_{\alpha/2}}{\theta_o}\right)^n = \frac{\alpha}{2} ~~~ \Rightarrow ~~~ \ldots \,. \] Except, this is wrong: what would be the power if \(\theta = \theta_o\)? It would be \(\alpha/2\) and not \(\alpha\). So despite the fact that we are carrying out a two-tail test, all the \(\alpha\) “goes to the left” of \(\theta_o\) (because it is impossible to reject “to the right”: if the null is correct, \(X_{(n)} > \theta_o\) is impossible. So we have that \[ \left(\frac{x_{\alpha}}{\theta_o}\right)^n = \alpha \] and thus that \[ x_{\alpha} = \theta_o \alpha^{1/n} \,. \] If \(X_{(n)} < x_{\alpha}\), we reject the null. Full stop.
The power of this test is \[\begin{align*} P(\mbox{reject}~\mbox{null} \vert \theta) &= P(X_{(n)} < x_\alpha \cup X_{(n)} > \theta_o \vert \theta) \\ &= F_{(n)}(x_\alpha \vert \theta) + [1 - F_{(n)}(\theta_o \vert \theta)] \\ &= \left\{ \begin{array}{rl} 1 & \theta < x_\alpha \\ \left(\frac{x_\alpha}{\theta}\right)^n & x_\alpha < \theta \leq \theta_o \\ 1 + \left(\frac{x_\alpha}{\theta}\right)^n - \left(\frac{\theta_o}{\theta}\right)^n & \theta > \theta_o \end{array} \right. \,. \end{align*}\] For the first condition above: if \(\theta < x_\alpha\), then \(X_{(n)} < x_\alpha\), so every test will result in a rejection, and the power is thus 1. We plot out the power curve for \(\theta_o = 1\) and \(n = 10\) in Figure 5.4.
5.6.2 An Illustration of Multiple Comparisons When Testing for the Normal Mean
In the code chunk below, we generate \(k = 100\) independent datasets of size \(n = 40\); for \(k' = 80\) datasets, \(\mu = 0\), and for the remainder, \(\mu = 0.5\). For simplicity, we assume \(\sigma^2\) is known and is equal to one. For each dataset, we test the hypotheses \(H_o : \mu = 0\) versus \(H_a : \mu > 0\).
set.seed(101)
n <- 40
k <- 100
k.p <- 80
mu <- c(rep(0,k.p),rep(0.5,k-k.p))
mu.o <- 0
sigma2 <- 1
p <- rep(NA,k)
for ( ii in 1:k ) {
X <- rnorm(n,mean=mu[ii],sd=sigma2)
p[ii] <- 1 - pnorm(mean(X),mean=mu.o,sd=sqrt(sigma2/n))
}
Below, we try two separate corrections for multiple comparisons: the Bonferroni correction (controlling FWER), and the Benjamini-Hochberg procedure (controlling FDR).
## The number of rejected null hypotheses for Bonferroni: 9
## The number of falsely rejected null hypotheses is: 0
p.sort <- sort(p)
cat("The number of rejected null hypotheses for FDR: ",
sum(p.sort < (1:k)*alpha/k),"\n")
## The number of rejected null hypotheses for FDR: 17
p.rej <- p.sort[p.sort<(1:k)*alpha/k]
w <- p %in% p.rej
w <- which(w==TRUE)
cat("The number of falsely rejected null hypotheses is: ",sum(w<=k.p),"\n")
## The number of falsely rejected null hypotheses is: 0
With the Bonferroni correction, we reject nine null hypotheses (with the guarantee that there is, on average, a five percent chance that we erroneously reject one or more of the true nulls…here, we reject no correct null hypotheses. See Figure 5.5.
With the BH procedure, we reject 17 null hypotheses (with the guarantee that on average, five percent of these 17 [meaning, effectively, 1] is an erroneous rejection…here, we reject no correct null hypotheses).