3 The Binomial (and Related) Distributions
3.1 Motivation
Let’s assume we are holding a coin. It may be a fair coin (meaning that the probabilities of observing heads or tails in a given flip of the coin are each 0.5)…or perhaps it is not. We decide that we are going to flip the coin some fixed number of times \(k\), and we will record the outcome of each flip: \[ \mathbf{Y} = \{Y_1,Y_2,\ldots,Y_k\} \,, \] where, e.g., \(Y_1 = 1\) if we observe heads or \(0\) if we observe tails. This is an example of a Bernoulli process, where “process” denotes a sequence of observations, and “Bernoulli” indicates that there are two possible discrete outcomes for each observation.
A binomial experiment is one that generates Bernoulli process data through the running of \(k\) trials (e.g., \(k\) separate coin flips). The properties of such an experiment are that:
- The number of trials \(k\) is fixed in advance.
- Each trial has two possible outcomes, generically denoted as \(S\) (success) or \(F\) (failure).
- The probability of success remains \(p\) throughout the experiment.
- The outcome of any one trial is independent of the outcomes of the others.
The random variable of interest for a binomial experiment is the number of observed successes. A closely related alternative to a binomial experiment is a negative binomial experiment, where the number of successes \(s\) is fixed in advance, instead of the number of trials \(k\), and the random variable of interest is the number of failures that we observe before achieving \(s\) successes. A simple example would be flipping a coin until \(s\) heads are observed and recording the overall number of tails that we observe.
As a side note to the third point above, about the probability of success remaining \(p\) throughout the experiment: binomial and negative binomial experiments rely on sampling with replacement…if we observe a head for a given coin flip, we can observe heads again in the future. In the real world, however, the reader will observe instances where, e.g., a binomial distribution is used to model experiments featuring sampling without replacement: we have \(K = 100\) widgets, of which ten are defective; we check one to see if it is defective (with probability \(p = 0.1\)) and set it aside, then check another (with probability either 10/99 or 9/99, depending on the outcome of the first trial), etc. The convention for using the binomial distribution to model data in such a situation is that it is fine to do so if the number of trials \(k \lesssim K/10\). However, in the age of computers, there is no reason to apply the binomial distribution when we can apply the hypergeometric distribution instead.
And, as a side note to the fourth point above, about the outcome of each trial being independent of the outcomes of the others: in a general process, each datum can be dependent on the data observed previously. How each datum is dependent on previous data defines the type of process that is observed: a Markov process, a Gaussian process, etc. A Bernoulli process is termed a memoryless process because it is comprised of independent (and identically distributed) data.
3.2 Probability Mass Function
Let’s focus first on the outcome of a binomial experiment, with the random variable \(X\) being the number of observed successes in \(k\) trials. What is the probability of observing \(X=x\) successes, if the probability of observing a success in any one trial is \(p\)? \[\begin{align*} \mbox{$x$ successes}&: p^x \\ \mbox{$k-x$ failures}&: (1-p)^{k-x} \,. \end{align*}\] So \(P(X=x) = p^x (1-p)^{k-x}\)…but, no, this isn’t right. Let’s start again and think this through. Assume \(k = 2\). The sample space of possible experimental outcomes is \[ \Omega = \{ SS, SF, FS, FF \} \,. \] If \(p\) = 0.5, then we can see that the probability of observing one success in two trials is 0.5…but our proposed probability mass function tells us that \(P(X=1) = (0.5)^1 (1-0.5)^1 = 0.25\). What are we missing? We are missing that there are two ways of observing a single success…and we need to count both. Because we ultimately do not care about the order in which successes and failures are observed, we utilize counting via combination: \[ \binom{k}{x} = \frac{k!}{x! (k-x)!} \,, \] where the exclamation point represents the factorial function \(x! = x(x-1)(x-2)\cdots 1\). So now we can correctly write down the binomial probability mass function: \[ P(X=x) = p_X(x) = \binom{k}{x} p^x (1-p)^{k-x} ~~~ x \in \{0,\ldots,k\} \,. \] We denote the distribution of the random variable \(X\) as \(X \sim\) Binomial(\(k\),\(p\)). Note that when \(k = 1\), we have a Bernoulli distribution.
Recall: a probability mass function is one way to represent a discrete probability distribution, and it has the properties (a) \(0 \leq p_X(x) \leq 1\) and (b) \(\sum_x p_X(x) = 1\), where the sum is over all values of \(x\) in the distribution’s domain.
(The reader should note that the number of trials is commonly denoted as \(n\), not as \(k\). However, since \(n\) is conventionally used to denote the sample size in an experiment, to avoid confusion we use \(k\) to denote the number of trials in this text.)
In Figure 3.1, we display three binomial pmfs, one each for probabilities of success 0.1 (red, to the left), 0.5 (green, to the center), and 0.8 (blue, to the right). This figure indicates an important aspect of the binomial pmf, namely that it can attain a shape akin to that of a normal distribution, if \(p\) is such that any truncation observed at the values \(x=0\) and \(x=k\) is minimal. In fact, a binomial random variable converges in distribution to a normal random variable in certain limiting situations, as we show in an example below.
Recall: the expected value of a discretely distributed random variable is \[ E[X] = \sum_x x p_X(x) \,, \] where the sum is over all values of \(x\) within the domain of the pmf p_X(x). The expected value is equivalent to a weighted average, with the weight for each possible value of \(x\) given by \(p_X(x)\).
For the binomial distribution, the expected value is \[ E[X] = \sum_{x=0}^k x \binom{k}{x} p^x (1-p)^{k-x} \,. \] At first, this does not appear to be easy to evaluate. One trick in our arsenal is to pull constants out of the summation such that whatever is left as the summand is a pmf (and thus sums to 1). Let’s try this here: \[\begin{align*} E[X] &= \sum_{x=0}^k x \binom{k}{x} p^x (1-p)^{k-x} \\ &= \sum_{x=1}^k x \frac{k!}{x!(k-x)!} p^x (1-p)^{k-x} \\ &= \sum_{x=1}^k \frac{k!}{(x-1)!(k-x)!} p^x (1-p)^{k-x} \\ &= kp \sum_{x=1}^k \frac{(k-1)!}{(x-1)!(k-x)!} p^{x-1} (1-p)^{k-x} \,. \end{align*}\] The summation appears almost like that of a binomial random variable. Let’s set \(y = x-1\). Then \[\begin{align*} E[X] &= kp \sum_{x=1}^k \frac{(k-1)!}{(x-1)!(k-x)!} p^{x-1} (1-p)^{k-x} \\ &= kp \sum_{y=0}^{k-1} \frac{(k-1)!}{y!(k-(y+1))!} p^y (1-p)^{k-(y+1)} \\ &= kp \sum_{y=0}^{k-1} \frac{(k-1)!}{y!((k-1)-y)!} p^y (1-p)^{(k-1)-y} \,. \end{align*}\] The summand is now the pmf for the random variable \(Y \sim\) Binomial(\(k-1\),\(p\)), summed over all values of \(y\) in the domain of the distribution. Thus the summation evaluates to 1: \(E[X] = kp\). In an example below, we use a similar strategy to determine the variance \(V[X] = kp(1-p)\).
A negative binomial experiment is governed by the negative binomial distribution, whose pmf is \[ p_X(x) = \binom{x+s-1}{x} p^s (1-p)^x ~~~ x \in \{0,1,\ldots,\infty\} \] The form of this pmf follows from the fact that the underlying Bernoulli process would consist of \(x+s\) data, with the last datum being the observed success that ends the experiment. The first \(x+s-1\) data would feature \(s-1\) successes and \(x\) failures, with the order of success and failure not mattering…so we can view these data as being binomially distributed (albeit with \(x\) representing failures…hence the “negative” in negative binomial!): \[ p_X(x) = \underbrace{\binom{x+s-1}{x} p^{s-1} (1-p)^x}_{\mbox{first $x+s-1$ trials}} \cdot \underbrace{p}_{\mbox{last trial}} \,. \] Note that when \(s = 1\), the resulting distribution is called the geometric distribution.
In an example below, we show how one would derive the expected value of a negative binomial random variable, \(E[X] = s(1-p)/p\). (We can enact a similar calculation to show that the variance is \(V[X] = s(1-p)/p^2\).)
3.2.1 Variance of a Binomial Random Variable
Recall: the variance of a discretely distributed random variable is \[ V[X] = \sum_x (x-\mu)^2 p_X(x) = E[X^2] - (E[X])^2\,, \] where the sum is over all values of \(x\) in the domain of the pmf \(p_X(x)\). The variance represents the square of the “width” of a probability mass function, where by “width” we mean the range of values of \(x\) for which \(p_X(x)\) is effectively non-zero.
The variance of a random variable is given by the shortcut formula that we have been using since Chapter 1: \(V[X] = E[X^2] - (E[X])^2\). So we would expect that we would need to compute \(E[X^2]\) here, since we already know that \(E[X] = kp\). But for reasons that will become apparent below, it is actually far easier for us to compute \(E[X(X-1)]\), and to work with that to eventually derive the variance: \[\begin{align*} E[X(X-1)] &= \sum_{x=0}^k x(x-1) \binom{k}{x} p^x (1-p)^{k-x} \\ &= \sum_{x=0}^k x(x-1) \frac{k!}{x!(k-x)!} p^x (1-p)^{k-x} \\ &= \sum_{x=2}^k \frac{k!}{(x-2)!(k-x)!} p^x (1-p)^{k-x} \\ &= k(k-1) p^2 \sum_{x=2}^k \frac{(k-2)!}{(x-2)!(k-x)!} p^{x-2} (1-p)^{k-x} \,. \end{align*}\] The advantage to using \(x(x-1)\) was that it matches the first two terms of \(x! = x(x-1)\cdots(1)\), allowing easy cancellation. If we set \(y = x-2\), we find that the summand above will, in a similar manner as in the calculation of \(E[X]\), become the pmf for the random variable \(Y \sim\) Binomial(\(k-2,p\))…and thus the summation will evaluate to 1.
So \(E[X(X-1)] = E[X^2] - E[X] = k(k-1)p^2\), and \(E[X^2] = k^2p^2-kp^2 + kp = V[X] + (E[X])^2\), and \(V[X] = k^2p^2-kp^2+kp-k^2p^2 = kp-kp^2 = kp(1-p)\). Done.
3.2.2 The Expected Value of a Negative Binomial Random Variable
The calculation for the expected value \(E[X]\) for a negative binomial random variable is similar to that for a binomial random variable: \[\begin{align*} E[X] &= \sum_{x=0}^{\infty} x \binom{x+s-1}{x} p^s (1-p)^x \\ &= \sum_{x=0}^{\infty} x \frac{(x+s-1)!}{(s-1)!x!} p^s (1-p)^x \\ &= \sum_{x=1}^{\infty} \frac{(x+s-1)!}{(s-1)!(x-1)!} p^s (1-p)^x \,. \end{align*}\] Let \(y = x-1\). Then \[\begin{align*} E[X] &= \sum_{y=0}^{\infty} \frac{(y+s)!}{(s-1)!y!} p^s (1-p)^{y+1} \\ &= \sum_{y=0}^{\infty} s(1-p) \frac{(y+s)!}{s!y!} p^s (1-p)^y \\ &= \sum_{y=0}^{\infty} \frac{s(1-p)}{p} \frac{(y+s)!}{s!y!} p^{s+1} (1-p)^y \\ &= \frac{s(1-p)}{p} \sum_{y=0}^{\infty} \frac{(y+s)!}{s!y!} p^{s+1} (1-p)^y \\ &= \frac{s(1-p)}{p} \,. \end{align*}\] The summand is that of a negative binomial distribution for \(s+1\) successes, hence the summation is 1, and thus \(E[X] = s(1-p)/p\).
We can use a similar calculation in which we evaluate \(E[X(X-1)]\) in order to derive the variance of a negative binomial random variable: \(V[X] = s(1-p)/p^2\).
3.2.3 Binomial Distribution: Normal Approximation
In certain limiting situations, a binomial random variable converges in distribution to a normal random variable. In other words, if \[ P\left(\frac{X-\mu}{\sigma} < a \right) = P\left(\frac{X-kp}{\sqrt{kp(1-p)}} < a \right) \approx P(Z < a) = \Phi(a) \,, \] then we can state that \(X \stackrel{d}{\rightarrow} Y \sim \mathcal{N}(kp,kp(1-p))\), or that \(X\) converges in distribution to a normal random variable \(Y\). Now, what do we mean by “certain limiting situations”? For instance, if \(p\) is close to zero or one, then the binomial distribution is truncated at 0 or at \(k\), and the shape of the pmf does not appear to be like that of a normal pdf. One convention is that the normal approximation is adequate if \[ k > 9\left(\frac{\mbox{max}(p,1-p)}{\mbox{min}(p,1-p)}\right) \,. \] The reader might question why we would mention this approximation at all: if we have binomially distributed data and a computer, then we need not ever utilize such an approximation to, e.g., compute probabilities. This point is correct (and is the reason why, for instance, we do not mention the so-called continuity correction here; our goal is not to compute probabilities). The reason we mention this is that this approximation underlies a commonly used hypothesis test framework, the Wald interval, that we will mention later in the chapter.
3.2.4 Computing Probabilities
Let \(X\) be a random variable sampled from a binomial distribution with number of trials \(k = 6\) and success probability \(p = 0.4\), and let \(Y\) be a random variable sampled from a negative binomial distribution with number of successes \(s = 3\) and success probability \(p = 0.6\).
- What is \(P(2 \leq X < 4)\)?
To find this probability, we sum over the binomial probability mass function for values \(x = 2\) and \(x = 3\). (The form of the inequalities matter for discrete distributions!) We can perform this computation analytically: \[\begin{align*} P(2 \leq X < 4) &= \binom{6}{2} (0.4)^2 (1-0.4)^4 + \binom{6}{3} (0.4)^3 (1-0.4)^3 \\ &= \frac{6!}{2!4!} \cdot 0.16 \cdot 0.1296 + \frac{6!}{3!3!} \cdot 0.064 \cdot 0.216 = 15 \cdot 0.0207 + 20 \cdot 0.0138 = 0.5875 \,. \end{align*}\] There is a 58.75% chance that the next time we sample data according to this distribution, we will observe a value of 2 or 3.
It is ultimately simpler to use
R
to perform this calculation. While we can compute the probability as the difference of two cumulative distribution function values, for discrete distributions it is often more straightforward to sum over the relevant probability masses directly, since then we need not worry about whether the input to the cdf is correct given the form of the inequality:
## [1] 0.58752
- What is \(P(Y > 1)\)?
If we were to determine this probability by hand, the first thing we would do is specify that we will compute \(1 - P(Y \leq 3)\), as this has a finite (and small!) number of terms: \[\begin{align*} P(Y > 1) &= 1 - \binom{0+3-1}{0} (0.6)^3 (1-0.6)^0 - \binom{1+3-1}{1} (0.6)^3 (1-0.6)^1 \\ &= 1 - 1 \cdot 0.216 \cdot 1 - 3 \cdot 0.216 \cdot 0.4 = 0.5248 \,. \end{align*}\] There is a 52.48% chance that we would observe more than one failure the next time we sample data according to this distribution.
Like before, it is ultimately simpler to use
R
:
## [1] 0.5248
3.3 Cumulative Distribution Function
Recall: the cumulative distribution function, or cdf, is another means by which to encapsulate information about a probability distribution. For a discrete distribution, it is defined as \(F_X(x) = \sum_{y\leq x} p_Y(y)\), and it is defined for all values \(x \in (-\infty,\infty)\), with \(F_X(-\infty) = 0\) and \(F_X(\infty) = 1\).
For the binomial distribution, the cdf is \[ F_X(x) = \sum_{y=0}^{\lfloor x \rfloor} p_Y(y) = \sum_{y=0}^{\lfloor x \rfloor} \binom{k}{y} p^y (1-p)^{k-y} \,, \] where \(\lfloor x \rfloor\) denotes the floor function, which returns the largest integer that is less than or equal to \(x\) (e.g., if \(x\) = 6.75, \(\lfloor x \rfloor\) = 6). (In closed form, we can represent this cdf with a regularized incomplete beta function, which is not analytically easy to work with.) Also, because a pmf is defined at discrete values of \(x\), its associated cdf is a step function, as illustrated in the left panel of Figure 3.3. As we can see in this figure, the cdf steps up at each value of \(x\) in the domain of \(p_X(x)\), and unlike the case for continuous distributions, the form of the inequalities in a probabilistic statement matter: \(P(X < x)\) and \(P(X \leq x)\) will not be the same, if \(x\) is an integer with value \(\{0,1,2,\ldots,k\}\).
Recall: an inverse cdf function \(x = F_X^{-1}(q)\) takes as input a distribution quantile \(q \in [0,1]\) and returns the value of \(x\). A discrete distribution has no unique inverse cdf; it is convention to utilize the generalized inverse cdf, \(x = \mbox{inf}\{x : F_X(x) \geq q\}\), where “inf” indicates that the function is to return the smallest value of \(x\) such that \(F_X(x) \geq q\).
In the right panel of Figure 3.3, we display the inverse cdf for the same distribution used to generate the figure in the left panel (\(k=4\) and \(p=0.5\)). Like the cdf, the inverse cdf for a discrete distribution is a step function. Below, in an example, we show how we adapt the inverse transform sampler algorithm of Chapter 1 to accommodate the step-function nature of an inverse cdf.
3.3.1 Computing Probabilities
Because computing the binomial pmf for a range of values of \(x\) can be laborious, we typically utilize
R
functions when computing probabilities.
- If \(X \sim\) Binomial(10,0.6), which is \(P(4 \leq X < 6)\)?
We first note that due to the form of the inequality, we do not include \(X=6\) in the computation. Thus \(P(4 \leq X < 6) = p_X(4) + p_X(5)\), which equals \[ \binom{10}{4} (0.6)^4 (1-0.6)^6 + \binom{10}{5} (0.6)^5 (1-0.6)^5 \,. \] Even computing this is unnecessarily laborious; instead, we call on
R
:
## [1] 0.3121349
(This utilizes
R
’s vectorization feature: we need not explicitly define afor
-loop to evaluatedbinom()
for \(x=4\) and then at \(x=5\).) We can also utilize cdf functions here: \(P(4 \leq X < 6) = P(X < 6) - P(X < 4) = P(X \leq 5) - P(X \leq 3) = F_X(5) - F_X(3)\), which inR
is computed via
## [1] 0.3121349
As we can see, the direct summation approach is the more straightforward one.
- \(X \sim\) Binomial(10,0.6), what is the value of \(a\) such that \(P(X \leq a) = 0.9\)?
First, we set up the inverse cdf formula: \[ P(X \leq a) = F_X(a) = 0.9 ~~ \Rightarrow ~~ a = F_X^{-1}(0.9) \] Note that we didn’t do anything differently here than we would have done in a continuous distribution setting…and we can proceed directly to
R
because it utilizes the generalized inverse cdf algorithm.
## [1] 8
We can see immediately how the cdf for a discrete distribution is not a one-to-one function, as if we plug \(x = 8\) into the cdf, we will not recover the initial value \(q = 0.9\):
## [1] 0.9536426
3.3.2 Sampling Data From an Arbitrary Probability Mass Function
While we would always utilize
R
shortcut functions likerbinom()
when they exist, there may be instances when we need to code our own functions for sampling data from discrete distributions. The code below shows such a function for an arbitrary probability mass function.
set.seed(101)
x <- c(1,2,4,8) # domain of x
p.x <- c(0.2,0.35,0.15,0.3) # p_X(x)
F.x <- cumsum(p.x) # cumulative sum -> produces F_X(x)
n <- 100
q <- runif(n) # we still ultimately need runif!
i <- findInterval(q,F.x)+1 # the output are bin numbers [0,3], and not [1,4]
# hence we add 1
# 1 means q is between 0 and F.x[1], etc.
x.sample <- x[i]
3.4 Linear Functions of Random Variables
Let’s assume we are given \(n\) iid binomial random variables: \(X_1,X_2,\ldots,X_n \sim\) Binomial(\(k\),\(p\)). Can we determine the distribution of the sum \(Y = \sum_{i=1}^n X_i\)? Yes, we can…via the method of moment-generating functions.
Recall: the moment-generating function, or mgf, is a means by which to encapsulate information about a probability distribution. When it exists, the mgf is given by \(m_X(t) = E[e^{tX}]\). Also, if \(Y = \sum_{i=1}^n a_iX_i\), then \(m_Y(t) = m_{X_1}(a_1t) m_{X_2}(a_2t) \cdots m_{X_n}(a_nt)\); if we can identify \(m_Y(t)\) as the mgf for a known family of distributions, then we can immediately identify the distribution for \(Y\) and the parameters of that distribution.
The mgf for the binomial distribution is \[\begin{align*} m_X(t) = E[e^{tX}] &= \sum_{x=0}^k e^{tx} \binom{k}{x} p^x (1-p)^{k-x} \\ &= \sum_{x=0}^k \binom{k}{x} (pe^t)^x (1-p)^{k-x} \,. \end{align*}\] We utilize the binomial theorem \[ (x+y)^k = \sum_{i=0}^k \binom{k}{x} x^i y^{k-i} \] to re-express \(m_X(t)\): \[ m_X(t) = [pe^t + (1-p)]^k \,. \] Note that one may see this written as \((pe^t+q)^k\), where \(q = 1-p\).
The mgf for \(Y = \sum_{i=1}^n X_i\) is thus \[ m_Y(t) = \prod_{i=1}^n m_{X_i}(t) = [m_X(t)]^n = [pe^t + (1-p)]^{nk} \,. \] We can see that this has the form of a binomial mgf: \(Y \sim\) Binomial(\(nk\),\(p\)), with expected value \(E[Y] = nkp\) and variance \(V[Y] = nkp(1-p)\). This makes sense, as the act of summing binomial data is equivalent to concatenating \(n\) separate Bernoulli processes into one longer Bernoulli process…whose data can subsequently be modeled using a binomial distribution.
While we can identify the distribution of the sum by name, we cannot
say the same about the sample mean.
We know that the expected value is \(E[\bar{X}] = \mu = kp\)
and that the variance is \(V[\bar{X}] = \sigma^2/n = kp(1-p)/n\),
but when we attempt to use the mgf method with
\(a_i = 1/n\) instead of \(a_i = 1\), we find that
\[
m_{\bar{X}}(t) = [pe^{t/n} + (1-p)]^{nk} \,.
\]
Changing \(t\) to \(t/n\) has the effect of creating an mgf that does not
have the form of any known mgf.
However, we do know the distribution: it has a pmf that is identical
in form to that of the binomial distribution, but has the domain
\(\{0,1/n,2/n,...,k\}\).
(We can derive this result mathematically by making
the transformation \(\sum_{i=1}^n X_i \rightarrow (\sum_{i=1}^n X_i)/n\),
as we see below in an example.)
We could define the pmf ourselves
using our own R
function, but there is no real need to: as we will see,
if we wish to construct a confidence interval for \(p\), we can just use the sum
\(\sum_{i=1}^n X_i\) as our statistic.
(We could also, in theory, utilize the Central Limit Theorem if
\(n \gtrsim 30\), but there is absolutely no reason to do that to make
inferences about \(p\): we know the distribution of the sum of the data
exactly, and thus there is no need to fall back upon approximations.)
3.4.1 The MGF for a Geometric Random Variable
Recall that a geometric distribution is equivalent to a negative binomial distribution with number of successes \(s = 1\); its probability mass function is \[ p_X(x) = p (1-p)^x \,, \] with \(x = \{0,1,\ldots\}\) and \(p \in [0,1]\). The moment-generating function for a geometric random variable is thus \[\begin{align*} m_X(t) = E[e^{tX}] &= \sum_{x=0}^{\infty} e^{tx} p (1-p)^x = p \sum_{x=0}^\infty e^{tx} (1-p)^x \\ &= p \sum_{x=0}^\infty [e^t(1-p)]^x = \frac{p}{1-e^t(1-p)} \,. \end{align*}\] The last equality utilizes the formula for the sum of an infinite geometric series: \(\sum_{i=0}^\infty x^i = (1-x)^{-1}\), when \(\vert x \vert < 1\). (If \(t < 0\) and \(p > 0\), then the condition that \(\vert e^t(1-p) \vert < 1\) holds.)
The sum of \(s\) geometric random variables has the moment-generating function \[ m_Y(t) = \prod_{i=1}^s m_{X_i}(t) = \left[\frac{p}{1-e^t(1-p)}\right]^s \,. \] This is the mgf for a negative binomial distribution for \(s\) successes. In the same way that the sum of \(k\) Bernoulli random variables is a binomially distributed random variable, the sum of \(s\) geometric random variables is a negative binomially distributed random variable.
3.4.2 The PMF for the Sample Mean
Let’s assume we are given \(n\) iid binomial random variables: \(X_1,X_2,\ldots,X_n \sim\) Binomial(\(k\),\(p\)). As we observe above, the distribution of the sum \(Y = \sum_{i=1}^k X_i\) is binomial with mean \(nkp\) and variance \(nkp(1-p)\).
The sample mean is \(\bar{X} = Y/n\), and so \[ F_{\bar{X}}(\bar{x}) = P(\bar{X} \leq \bar{x}) = P(Y \leq n\bar{x}) = \sum_{y=0}^{n\bar{x}} p_Y(y) \,. \] (We note that \(n\bar{x}\) is integer-valued by definition; we do not need to round down here.) Because we are dealing with a pmf, we cannot simply take the derivative of \(F_{\bar{X}}(\bar{x})\) to find \(f_{\bar{X}}(\bar{x})\)…but what we can do is assess the jump in the cumulative distribution function at each step, because that is the pmf. In other words, we can compute \[ f_{\bar{X}}(\bar{x}) = P(Y \leq n\bar{x}) - P(Y \leq n\bar{x}-1) \] and store this as a numerically expressed pmf for \(\bar{X}\). See Figure 3.5.
But it turns out we can say more about this pmf, by looking at the problem in a different way. We know that \(Y \sim\) Binomial\((nkp,nkp(1-p))\) and thus that \(Y \in [0,1,\ldots,nk]\). When we compute the quotient \(\bar{X} = Y/n\), all we are doing is redefining the domain of the pmf from being \([0,1,\ldots,nk]\) to being \([0,1/n,2/n,\ldots,k]\). We do not actually change the probability masses! So we can write \[ p_{\bar{X}}(\bar{x}) = \binom{nk}{n\bar{x}} p^{n\bar{x}} (1-p)^{nk-n\bar{x}} ~~ \bar{x} \in [0,1/n,2/n,\ldots,k] \,. \] This pmf has the functional form of a binomial pmf…but not the domain of a binomial pmf. For that reason, we cannot say that \(\bar{X}\) is binomially distributed. The pmf has a functional form, it has a domain, but it has no known “name” and thus no associated
R
functions that we can utilize when performing statistical inference. (This is why we will utilize the sum of the data instead:R
functions for its distribution exist!)
3.5 Order Statistics
Let’s suppose that we have sampled \(n\) iid random variables \(\{X_1,\ldots,X_n\}\) from some arbitrary distribution. Previously, we have summarized such data with the sample mean and the sample variance. However, there are other summary statistics, some of which are only calculable if we sort the data into ascending order: \(\{X_{(1)},\ldots,X_{(n)}\}\). These are dubbed order statistics and the \(j^{th}\) order statistic is the sample’s \(j^{th}\) smallest value (i.e., the smallest-valued datum in the sample is \(X_{(1)}\) and the largest-valued datum is \(X_{(n)}\)). Examples of statistics based on ordering include \[\begin{align*} \mbox{Range:}& ~~X_{(n)} - X_{(1)} \\ \mbox{Median:}& ~~X_{(n+1)/2} ~ \mbox{if $n$ is odd} \\ & ~~(X_{n/2}+X_{(n+1)/2})/2 ~ \mbox{if $n$ is even} \,. \end{align*}\] The most important point to keep in mind is that the probability mass and density functions for order statistics differ from the pmfs and pdfs for their constituent iid data. For instance, if we sample \(n\) data from a \(\mathcal{N}(0,1)\) distribution, we would not expect the minimum value to be distributed the same way; if anything, the mean should take on larger and larger negative values, and the variance on those values should decrease, as \(n\) increases.
So: why are we discussing order statistics here, in the middle of a discussion of the binomial distribution? It is because we can derive, e.g., the pdf for an order statistic of a continuous distribution using the binomial pmf. (Note that order statistics exist for discretely valued data, but the probability mass functions for them are not easily derived and thus we will only consider order statistics for continuously valued data here.)
See Figure 3.6. Without loss of generality, we can assume that \(f_X(x) > 0\) for \(x \in [a,b]\) and that we sample \(n\) data from this distribution. The number of data \(X\) that have value less than some arbitrarily chosen \(x\) is a binomial random variable: \[ Y \sim \mbox{Binomial}(n,p=F_X(x)) \] What is the probability that the \(j^{th}\) ordered datum has a value \(\leq x\)? That’s equivalent to asking for the probability that \(Y \geq j\), i.e., did we see at least \(j\) successes in \(n\) trials? \[ F_{(j)}(x) = P(X_{(j)} \leq x) = P(Y \geq j) = \sum_{i=j}^n \binom{n}{i} [F_X(x)]^i [1 - F_X(x)]^{n-i} \,. \] \(F_{(j)}(x)\) is the cdf for the \(j^{th}\) ordered datum.
Recall: a continuous distribution’s pdf is the derivative of its cdf.
Leaving aside algebraic details, we can write down the pdf for \(X_{(j)}\): \[ f_{(j)}(x) = \frac{d}{dx}F_{(j)}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1 - F_X(x)]^{n-j} \,, \] and write down simplified expressions for the pdfs for the minimum and maximum data values: \[ f_{(1)}(x) = n f_X(x) [1 - F_X(x)]^{n-1} ~~\mbox{and}~~ f_{(n)}(x) = n f_X(x) [F_X(x)]^{n-1} \,. \]
3.5.1 Distribution of the Minimum Value Sampled from an Exponential Distribution
The probability density function for an exponential random variable is \[ f_X(x) = \frac{1}{\theta} \exp\left(-\frac{x}{\theta}\right) \,, \] for \(x \geq 0\) and \(\theta > 0\), and the expected value of \(X\) is \(E[X] = \theta\). What is the pdf for the smallest value among \(n\) iid data sampled from an exponential distribution? What is the expected value for the smallest value?
First, if we do not immediately recall the cumulative distribution function \(F_X(x)\), we can easily derive it: \[ F_X(x) = \int_0^x \frac{1}{\theta} e^{-y/\theta} dy = 1 - e^{-x/\theta} \,. \] We plug \(F_X(x)\) into the expression of the pdf of the minimum datum given above: \[\begin{align*} f_{(1)}(x) &= n \frac{1}{\theta} e^{-x/\theta} \left[ 1 - (1-e^{-x/\theta}) \right]^{n-1} \\ &= n \frac{1}{\theta} e^{-x/\theta} e^{-(n-1)x/\theta} \\ &= \frac{n}{\theta} e^{-nx/\theta} \,. \end{align*}\] \(X_{(1)}\) is thus an exponentially distributed random variable with parameter \(\theta/n\) and expected value \(\theta/n\). We can derive this result as follows. \[ E[X_{(1)}] = \int_0^\infty x \frac{n}{\theta} e^{-nx/\theta} dx \,. \] We recognize this as almost having the form of a gamma-function integral: \[ \Gamma(u) = \int_0^\infty x^{u-1} e^{-x} dx \,. \] We affect a variable transformation \(y = nx/\theta\); for this transformation, \(dy = (n/\theta)dx\), and if \(x = 0\) or \(\infty\), \(y = 0\) or \(\infty\) (meaning the integral bounds are unchanged). Our new integral is \[ E[X] = \int_0^\infty \frac{\theta y}{n} \frac{n}{\theta} e^{-y} \frac{\theta}{n} dy = \frac{\theta}{n} \int_0^\infty y e^{-y} dy = \frac{\theta}{n} \Gamma(2) = \frac{\theta}{n} 1! = \frac{\theta}{n} = \frac{E[X]}{n} \,. \]
3.5.2 Distribution of the Median Value Sampled from a Uniform(0,1) Distribution
The probability density function for a Uniform(0,1) distribution is \[ f_X(x) = 1 \] for \(x \in [0,1]\). The cdf for this distribution is thus \[ F_X(x) = \int_0^x 1 dy = x \,. \] Let’s assume that we sample \(n\) iid data from this distribution, where \(n\) is an odd number. The index of the median value is thus \((n+1)/2\), and if we plug into the general expression for the pdf of the \(m^{th}\) ordered datum, we find that \[\begin{align*} f_{(n+1)/2} &= \frac{n!}{\left(\frac{n+1}{2}-1\right)!\left(n - \frac{n+1}{2}\right)!} \cdot 1 \cdot x^{\left(\frac{n+1}{2}\right)-1} \cdot (1-x)^{n - \left(\frac{n+1}{2}\right)} \\ &= \frac{n!}{2\left(\frac{n-1}{2}\right)!} x^{\left(\frac{n-1}{2}\right)} (1-x)^{\left(\frac{n-1}{2}\right)} \,. \end{align*}\] As we will see later, this is a beta distribution with parameters \(\alpha = \beta = (n+1)/2\). The median value has expected value 1/2 and a variance that shrinks with \(n\).
3.6 Point Estimation
In the first two chapters, we introduced a number of concepts related to point estimation, the act of using statistics to make inferences about a population parameter \(\theta\). We…
- assess point estimators using the metrics of bias, variance, mean-squared error, and consistency;
- utilize the Fisher information metric to determine the lower bound on the variance for unbiased estimators (the Cramer-Rao Lower Bound, or CRLB); and
- define estimators via the maximum likelihood algorithm, which generates estimators that are at least asymptotically unbiased and at least asymptotically reach the CRLB, and which converge in distribution to normal random variables.
We will review these concepts in the context of estimating population quantities for binomial distributions below, in the body of the text and in examples. For now…
Recall: the bias of an estimator is the difference between the average value of the estimates it generates and the true parameter value. If \(E[\hat{\theta}-\theta] = 0\), then the estimator \(\hat{\theta}\) is said to be unbiased.
Recall: the value of \(\theta\) that maximizes the likelihood function is the maximum likelihood estimate, or MLE, for \(\theta\). The maximum is found by taking the (partial) derivative of the (log-)likelihood function with respect to \(\theta\), setting the result to zero, and solving for \(\theta\). That solution is the maximum likelihood estimate \(\hat{\theta}_{MLE}\). Also recall the invariance property of the MLE: if \(\hat{\theta}_{MLE}\) is the MLE for \(\theta\), then \(g(\hat{\theta}_{MLE})\) is the MLE for \(g(\theta)\).
Here we will introduce another means by which to define an estimator. The minimum variance unbiased estimator (or MVUE) is the one that has the smallest variance among all unbiased estimators of \(\theta\). The reader’s first thought might be “well, why didn’t we use this estimator in the first place…after all, the MLE is not guaranteed to yield an unbiased estimator, so why have we put off discussing the MVUE?” The primary reasons are that MVUEs are sometimes not definable (i.e., we can reach insurmountable roadblocks when trying to derive them), and unlike MLEs, they do not exhibit the invariance property. (For instance, if \(\hat{\theta}_{MLE} = \bar{X}\), then \(\hat{\theta^2}_{MLE} = \bar{X}^2\), but if \(\hat{\theta}_{MVUE} = \bar{X}\), it is not necessarily the case that \(\hat{\theta^2}_{MVUE} = \bar{X}^2\).) However, we should always at least try to define the MVUE, because if we can, it will be at least equal the performance of, if not do better than, the MLE, in terms of bias and/or variance.
There are two steps to carry out when deriving the MVUE:
- determining a sufficient statistic for \(\theta\); and
- correcting any bias that is observed when we utilize that sufficient statistic as our initial estimator.
A sufficient statistic for a parameter \(\theta\) captures all information about \(\theta\) contained in the sample. In other words, any additional statistic, beyond the sufficient statistic, will not provide any additional information about \(\theta\). This does not mean that a sufficient statistic is necessarily unique, as any function of a sufficient statistic is also a sufficient statistic. For instance, if \(\sum_{i=1}^n X_i\) is a sufficient statistic, so is \(\bar{X}\). It just means that any additional statistic that is not a function of the sufficient statistic will not help us when we try to estimate \(\theta\). For instance, if \(\bar{X}\) is a sufficient statistic for \(\theta\), then using \(\bar{X}\) and the sample median to infer \(\theta\) will lead to no better result than when we just use \(\bar{X}\) by itself.)
The simplest way by which to identify a sufficient statistic is by writing down the likelihood function and factorizing it into two functions, one of which depends only on the observed data and the other of which depends on both the observed data and the parameter of interest: \[ \mathcal{L}(\theta \vert \mathbf{x}) = g(\mathbf{x},\theta) \cdot h(\mathbf{x}) \,. \] This is the so-called factorization criterion. In the expression \(g(\mathbf{x},\theta)\), the data will appear within, e.g., a summation (e.g., \(\sum_{i=1}^n x_i\)) or a product (e.g., \(\prod_{i=1}^n x_i\)), and we would identify that summation or product as a sufficient statistic.
Let’s assume we are given a sample of \(n\) iid binomial random variables. The following is the factorized likelihood: \[ \mathcal{L}(p \vert \mathbf{x}) = \prod_{i=1}^n \binom{k}{x_i} p^{x_i} (1-p)^{k-x_i} = \underbrace{\left[ \prod_{i=1}^n \binom{k}{x_i} \right]}_{h(\mathbf{x})} \underbrace{p^{\sum_{i=1}^n x_i} (1-p)^{nk-\sum_{i=1}^n x_i}}_{g(\sum_{i=1}^n x_i,p)} \,. \] By inspecting the function \(g(\mathbf{x},\theta)\), we determine that a sufficient statistic is \(U = \sum_{i=1}^n X_i\).
(The next step, in general, is to demonstrate whether the identified statistic is both minimally sufficient and complete; it needs to be both so that we can use it to determine the MVUE. It suffices to say here that the factorization criterion typically identifies statistics that are minimally sufficient and complete, particularly for those distributions that we examine in this book. See Chapter 7 for more details on how to determine if a sufficient statistic is a minimally sufficient statistic and if it is complete. Note that complete statistics are always minimally sufficient, but not necessarily vice-versa, so we would always check for completeness first!)
If \(U\) is a minimally sufficient and complete statistic, and there is a function \(h(U)\) that is an unbiased estimator for \(\theta\) and that depends on the data only through \(U\), then \(h(U)\) is the MVUE for \(\theta\). Here, given \(U = \sum_{i=1}^n X_i\), we need to find a function \(h(\cdot)\) such that \(E[h(U)] = p\). Earlier in this chapter, we determined that the distribution for the sum of iid binomial random variables is Binomial(\(nk\),\(p\)), and thus we know that this distribution has expected value \(nkp\). Thus \[ E\left[U\right] = nkp ~\implies~ E\left[\frac{U}{nk}\right] = p ~\implies~ h(U) = \frac{U}{nk} = \frac{\bar{X}}{k} ~\mbox{is the MVUE for}~p \,. \]
The variance of \(\hat{p}\) is \[ V[\hat{p}] = V\left[\frac{\bar{X}}{k}\right] = \frac{1}{k^2}V[\bar{X}] = \frac{1}{k^2} \frac{V[X]}{n} = \frac{1}{k^2}\frac{kp(1-p)}{n} = \frac{p(1-p)}{nk} \,. \] We know that this variance abides by the restriction \[ V[\hat{p}] \geq \frac{1}{nI(p)} = -\frac{1}{nE\left[\frac{d^2}{dp^2} \log p_X(X \vert p) \right]} \,. \] But is it equivalent to the lower bound for unbiased estimators, the CRLB? (Note that in particular situations, the MVUE may have a variance larger than the CRLB; when this is the case, unbiased estimators that achieve the CRLB simply do not exist.) For the binomial distribution, \[\begin{align*} p_{X}(x) &= \binom{k}{x} p^{x} (1-p)^{k-x} \\ \log p_{X}(x) &= \log \binom{k}{x} + x \log p + (k-x) \log (1-p) \\ \frac{d}{dp} \log p_{X}(x) &= 0 + \frac{x}{p} - \frac{k-x}{(1-p)} \\ \frac{d^2}{dp^2} \log p_{X}(x) &= -\frac{x}{p^2} - \frac{k-x}{(1-p)^2} \\ E\left[\frac{d^2}{dp^2} \log p_{X}(X)\right] &= -\frac{1}{p^2}E[X] - \frac{1}{(1-p)^2}E[k-X] \\ &= -\frac{kp}{p^2}-\frac{k-kp}{(1-p)^2} \\ &= -\frac{k}{p}-\frac{k}{1-p} = -\frac{k}{p(1-p)} \,. \end{align*}\] The lower bound on the variance is thus \(p(1-p)/(nk)\), and so the MVUE does achieve the CRLB. We cannot define a better unbiased estimator for \(p\) than \(\bar{X}/k\).
3.6.1 The MLE for the Binomial Success Probability
Recall: the value of \(\theta\) that maximizes the likelihood function is the maximum likelihood estimate, or MLE, for \(\theta\). The maximum is found by taking the (partial) derivative of the (log-)likelihood function with respect to \(\theta\), setting the result to zero, and solving for \(\theta\). That solution is the maximum likelihood estimate \(\hat{\theta}_{MLE}\).
Above, we determined that the likelihood function for \(n\) iid binomial random variables \(\{X_1,\ldots,X_n\}\) is \[ \mathcal{L}(p \vert \mathbf{x}) = \left[\prod_{i=1}^n \binom{k}{x_i} \right] p^{\sum_{i=1}^n x_i} (1-p)^{nk-\sum_{i=1}^n x_i} \,. \] Recall that the value \(\hat{p}_{MLE}\) that maximizes \(\mathcal{L}(p \vert x)\) also maximizes \(\ell(p \vert x) = \log \mathcal{L}(p \vert x)\), which is considerably easier to work with: \[\begin{align*} \ell(p \vert \mathbf{x}) &= \left(\sum_{i=1}^n x_i\right) \log p + \left(nk - \sum_{i=1}^n x_i\right) \log (1-p) \\ \frac{d}{dp} \ell(p \vert \mathbf{x}) &= \frac{1}{p} \sum_{i=1}^n x_i - \frac{1}{1-p} \left(nk - \sum_{i=1}^n x_i\right) = 0 \,. \end{align*}\] (Here, we drop the binomial coefficient, which does not depend on \(p\) and thus differentiates to zero.) After rearranging terms, we find that \[ p = \frac{1}{nk}\sum_{i=1}^n x_i ~\implies~ \hat{p}_{MLE} = \frac{\bar{X}}{k} \,. \] The MLE matches the MVUE, thus we know that the MLE is unbiased and we know that it achieves the CRLB.
A useful property of MLEs is the invariance property, whereby the MLE for a function of \(\theta\) is given by applying the same function to the MLE itself. Thus
- the MLE for the population mean \(E[X] = \mu = kp\) is \(\hat{\mu}_{MLE} = \bar{X}\); and
- the MLE for the population variance \(V[X] = \sigma^2 = kp(1-p)\) is \(\widehat{\sigma^2}_{MLE} = \bar{X}(1-\bar{X}/k)\).
Last, note that asymptotically, \(\hat{p}_{MLE}\) converges in distribution to a normal random variable: \[ \hat{p}_{MLE} \stackrel{d}{\rightarrow} Y \sim \mathcal{N}\left(p,\frac{1}{nI(p)} = \frac{p(1-p)}{nk}\right) \,. \]
3.6.2 Sufficient Statistics for the Normal Distribution
If we have \(n\) iid data drawn from a normal distribution with unknown mean \(\mu\) and unknown variance \(\sigma^2\), then the factorized likelihood is \[ \mathcal{L}(\mu,\sigma^2 \vert \mathbf{x}) = \underbrace{(2 \pi \sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n x_i^2\right)\exp\left(\frac{\mu}{\sigma^2}\sum_{i=1}^n x_i\right)\exp\left(-\frac{n\mu^2}{2\sigma^2}\right)}_{g(\sum x_i^2, \sum x_i,\mu,\sigma)} \cdot \underbrace{1}_{h(\mathbf{x})} \,. \] Here, we identify \(\sum x_i^2\) and \(\sum x_i\) as joint sufficient statistics: we need two pieces of information to jointly estimate \(\mu\) and \(\sigma^2\). (To be clear: it is not necessarily the case that one of the parameters matches up to one of the sufficient statistics…rather, the two statistics are jointly sufficient for estimation.) We thus cannot proceed further to define an MVUE for \(\mu\) or for \(\sigma^2\), without knowing the joint bivariate probability density function for \(U_1 = \sum_{i=1}^n X_i^2\) and \(U_2 = \sum_{i=1}^n X_i\).
(Note that we can proceed if we happen to know either \(\mu\) or \(\sigma^2\); if one of these values is fixed, then there will only be one sufficient statistic and we can determine the MVUE for the other, freely varying parameter.)
3.6.3 The MVUE for the Exponential Mean
The exponential distribution is \[ f_X(x) = \frac{1}{\theta} \exp\left(-\frac{x}{\theta}\right) \,, \] where \(x \geq 0\) and \(\theta > 0\), and where \(E[X] = \theta\) and \(V[X] = \theta^2\). Let’s assume that we have \(n\) iid data drawn from this distribution. Can we define the MVUE for \(\theta\)? For \(\theta^2\)?
The likelihood function is \[ \mathcal{L}(\theta \vert \mathbf{x}) = \prod_{i=1}^n f_X(x_i \vert \theta) = \frac{1}{\theta^n}\exp\left(-\frac{1}{\theta}\sum_{i=1}^n x_i \right) = h(\mathbf{x}) \cdot g(\theta,\mathbf{x}) \,. \] Here, there are no terms that are functions of only the data, so \(h(\mathbf{x}) = 1\) and a sufficient statistic is \(U = \sum_{i=1}^n X_i\). We compute the expected value of \(U\): \[ E[U] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \theta = n\theta \,. \] The expected value of \(U\) is not \(\theta\), so \(U\) is not unbiased…but we can see immediately that \(E[U/n] = \theta\), so that \(U/n\) is unbiased. Thus the MVUE for \(\theta\) is thus \(\hat{\theta}_{MVUE} = U/n = \bar{X}\).
Note that the MVUE does not possess the invariance property…it is not necessarily the case that \(\hat{\theta^2}_{MVUE} = \bar{X}^2\).
Let’s propose a function of \(U\) and see if we can use that to define \(\hat{\theta^2}_{MVUE}\): \(h(U) = U^2/n^2 = \bar{X}^2\). (To be clear, we are simply proposing a function and seeing if it helps us define what we are looking for. It might not. If not, we can try again with another function of \(U\).) Utilizing what we know about the sample mean, we can write down that \[ E[\bar{X}^2] = V[\bar{X}] + (E[\bar{X}])^2 = \frac{V[X]}{n} + (E[X])^2 = \frac{\theta^2}{n}+\theta^2 = \theta^2\left(\frac{1}{n} + 1\right) \,. \] So \(\bar{X}^2\) itself is not an unbiased estimator of \(\theta^2\)…but we can see that \(\bar{X}^2/(1/n+1)\) is. Hence \[ \hat{\theta^2}_{MVUE} = \frac{\bar{X}^2}{\left(\frac{1}{n}+1\right)} \,. \]
3.6.4 The MVUE for the Geometric Distribution
Recall that the geometric distribution is a negative binomial distribution with \(s = 1\): \[\begin{align*} p_X(x) = \binom{x + s - 1}{x} p^s (1-p)^x ~ \rightarrow ~ p(1-p)^x \,, \end{align*}\] where \(p \in (0,1]\) and where \(E[X] = (1-p)/p\) and \(V[X] = (1-p)/p^2\). Let’s assume that we sample \(n\) iid data from this distribution. Can we define the MVUE for \(p\)? For \(1/p\)?
The likelihood function is \[\begin{align*} \mathcal{L}(p \vert \mathbf{x}) = \prod_{i=1}^n p_X(x_i \vert p) = p^n (1-p)^{\sum_{i=1}^n x_i} = 1 \cdot g(p,\mathbf{x}) \,. \end{align*}\] We identify a sufficient statistic as \(U = \sum_{i=1}^n X_i\); the expected value of \(U\) is \[\begin{align*} E[U] = E\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n E[X_i] = \sum_{i=1}^n \frac{1-p}{p} = n\left(\frac{1}{p}-1\right) \,. \end{align*}\] We cannot “debias” this expression to find the MVUE for \(p\): \[\begin{align*} E\left[\frac{U}{n}+1\right] = \frac{1}{p} \,. \end{align*}\] Specifically, \[\begin{align*} E\left[\frac{1}{U/n+1}\right] \neq 1/E\left[\frac{U}{n}+1\right] = p \,. \end{align*}\] However, we did determine the MVUE for \(1/p\): \(U/n+1 = \bar{X}+1\). But as there is no invariance property for the MVUE, we cannot use this expression to write down an MVUE for \(p\).
3.7 Confidence Intervals
Recall: a confidence interval is a random interval \([\hat{\theta}_L,\hat{\theta}_U]\) that overlaps (or covers) the true value \(\theta\) with probability \[ P\left( \hat{\theta}_L \leq \theta \leq \hat{\theta}_U \right) = 1 - \alpha \,, \] where \(1 - \alpha\) is the confidence coefficient. We determine \(\hat{\theta}\) by solving the following equation: \[ F_Y(y_{\rm obs} \vert \theta) - q = 0 \,, \] where \(F_Y(\cdot)\) is the cumulative distribution function for the statistic \(Y\), \(y_{\rm obs}\) is the observed value of the statistic, and \(q\) is an appropriate quantile value that is determined using the confidence interval reference table introduced in section 16 of Chapter 1.
The only new element to consider here regarding the construction of
confidence intervals is that now the sampling distribution for our
adopted statistic (\(Y = \sum_{i=1}^n X_i\), where the \(X_i\)’s are
iid samples from a binomial distribution) is a discrete distribution
as opposed to begin a continuous distribution.
As it turns out, this does not impact our use of
uniroot()
: even though \(F_Y(y_{\rm obs})\) is discrete, the success
probability \(p\) is still a continuously valued quantity,
and we can tune its value so that
\(F_Y(y_{\rm obs} \vert p)\) will match, e.g., \(\alpha/2\) or \(1-\alpha/2\) with
arbitrary precision.
What the discreteness of the sampling distribution does impact is the coverage, or the fraction of intervals that overlap the true value.
In Figure 3.7, we display a continuous sampling distribution for a statistic \(Y\), given a specific (unknown to us!) true value \(\theta\). (Here, without loss of generality, we use a normal distribution, with \(\theta = \mu = 0\). We also assume that \(E[Y]\) increases with \(\theta\).) When running an experiment, if we sample a value of \(y_{\rm obs}\) in the red polygon, then turn around to derive a one-sided upper bound, that upper bound will not overlap the true value of \(\theta\); otherwise, if we sample a value not in the red polygon, the upper bound will overlap the true value. Since the probability of sampling a value in the red polygon is exactly \(\alpha\), then the probability that the interval will not overlap the true value is also exactly \(\alpha\)…meaning that the interval coverage is exactly \(100(1-\alpha)\) percent.
In Figure 3.8, we display a discrete sampling distribution for a statistic \(Y\), given a specific (and still unknown to us!) true value \(\theta\). (This distribution is specifically a binomial distribution with \(k = 12\) and \(p = 0.5\).) The three points that are colored red are the ones with cdf values less than \(\alpha\). (Given an arbitrary value of \(\theta\), we will never observe a cdf value for any value of \(y\) that is exactly \(\alpha\) due to the discreteness of the sampling distribution, unless we have an infinite number of data. For instance, in this figure the cdf for \(y = 2\) is 0.019, whereas the cdf for \(y = 3\) is 0.073. Neither value matches \(\alpha = 0.05\).) Above, we indicated that for a continuous distribution, sampling a datum \(y_{\rm obs}\) with a cdf value less than \(\alpha\) leads to the computation of a one-sided upper bound that does not cover the true value. The same situation holds here: sampling either \(y_{\rm obs} = 0\), 1, or 2, data for whom the cdf values are less than \(\alpha\), will lead to the computation of upper bounds that do not overlap \(\theta\). But what is different here is that the probability of sampling one of these values is no longer \(\alpha\), but some quantity less than \(\alpha\)…let’s call this quantity \(\alpha'\). The probability that our one-sided upper bounds do not cover \(\theta\) is thus \(\alpha'\)…which means that our coverage is actually going to be \(100(1-\alpha')\) percent, which is greater than \(100(1-\alpha)\) percent. For one-sided lower bounds, we can apply a similar set of arguments: we would identify all values of \(y\) with cdf values less than \(1 - \alpha\), with \(\alpha' > \alpha\) being the sum of the masses for the remaining values of \(y\). Since \(\alpha' > \alpha\), the coverage, which is \(100(1-\alpha')\) percent, would be less than \(100(1-\alpha)\) percent. (For two-sided intervals, we would compute \(\alpha_{\rm lo}'/2\), the sum of all masses associated with values of \(y\) with cdf values less than \(\alpha/2\), and \(\alpha_{\rm hi}'/2\), the sum of all masses associated with values of \(y\) that do not have cdf values less than \(1-\alpha/2\). The coverage would then be \(100(1-\alpha_{\rm lo}'/2-\alpha_{\rm hi}'/2)\) percent, which can be less than or greater than \(100(1-\alpha)\) percent.)
Discrete distributions like the binomial have historically been difficult to work with analytically. Because of this, a number of algorithms have been developed through the years for constructing confidence intervals for binomial probabilities. (See, e.g., this Wikipedia page.) It is our opinion that there is no particular reason to utilize any of these algorithms when one can compute exact intervals. However, for completeness, we illustrate how one would construct the most commonly seen approximating interval, the Wald interval, in an example below.
3.7.1 Confidence Interval for the Binomial Success Probability
Assume that we sample \(n\) iid data from a binomial distribution with number of trials \(k\) and probability \(p\). Then, as shown above, \(Y = \sum_{i=1}^n X_i \sim\) Binom(\(nk,p\)); our observed test statistic is \(y_{\rm obs} = \sum_{i=1}^n x_i\). For this statistic, \(E[Y] = nkp\) increases with \(p\), so \(q = 1-\alpha/2\) maps to the lower bound, while \(q = \alpha/2\) maps to the upper bound.
set.seed(101)
alpha <- 0.05
n <- 12
k <- 5
p <- 0.4
X <- rbinom(n,size=k,prob=p)
f <- function(p,y.obs,n,k,q)
{
pbinom(y.obs,size=n*k,prob=p)-q
}
uniroot(f,interval=c(0,1),y.obs=sum(X),n,k,1-alpha/2)$root
## [1] 0.2607037
## [1] 0.5010387
We find that the interval is \([\hat{p}_L,\hat{p}_U] = [0.261,0.501]\), which overlaps the true value of 0.4. (See Figure 3.9.) Note that, unlike in Chapter 2, the interval over which we search for the root is [0,1], which is the range of possible values for \(p\).
We can compute the coverage of this interval following the prescription given above:
w.lo <- sum(pbinom(0:(n*k),size=n*k,prob=p) < alpha/2)
alpha_over_2.lo.prime <- pbinom(0:(n*k),size=n*k,prob=p)[w.lo]
w.hi <- sum(pbinom(0:(n*k),size=n*k,prob=p) < 1-alpha/2)
alpha_over_2.hi.prime <- 1-pbinom(0:(n*k),size=n*k,prob=p)[w.hi]
round(100*(1-alpha_over_2.lo.prime-alpha_over_2.hi.prime),2)
## [1] 95.29
Due to the discreteness of the binomial distribution, the true coverage of our two-sided intervals is 95.29%, not 95%.
3.7.2 Confidence Interval for the Negative Binomial Success Probability
Let’s assume that we have performed \(n = 10\) separate negative binomial trials, each with a target number of successes \(s\), and recorded the number of failures \(X_1,\ldots,X_{n}\) for each. Further, assume that the probability of success is \(p\). Below we will show how to compute the confidence interval for \(p\), but before we start, we recall that \(Y = \sum_{i=1}^n X_i\) is a negative binomially distributed random variable for \(ns\) successes and probability of success \(p\). Here, \(E[Y] = s(1-p)/p\)…as \(p\) increases, \(E[Y]\) decreases. Thus when we adapt the confidence interval code we use for binomially distributed data, we need to switch the mapping of \(q = 1-\alpha/2\) and \(q = \alpha/2\) to the upper and lower bounds, respectively.
set.seed(101)
alpha <- 0.05
n <- 12
s <- 5
p <- 0.4
X <- rnbinom(n,size=s,prob=p)
f <- function(p,y.obs,n,s,q)
{
pnbinom(y.obs,size=n*s,prob=p)-q
}
uniroot(f,interval=c(0.0001,1),y.obs=sum(X),n,s,alpha/2)$root
## [1] 0.3255305
## [1] 0.4822869
The confidence interval is \([\hat{p}_L,\hat{p}_U] = [0.326,0.482]\), which overlaps the true value 0.4. We note that in the code, we change the lower bound on the interval from 0 (in the binomial case) to 0.0001 (something suitably small but non-zero): a probability of success of 0 maps to an infinite number of failures, which
R
cannot tolerate!
We can compute the coverage of this interval following the prescription given above:
large <- 200
w.lo <- sum(pnbinom(0:large,size=n*s,prob=p) < alpha/2)
alpha_over_2.lo.prime <- pnbinom(0:large,size=n*s,prob=p)[w.lo]
w.hi <- sum(pnbinom(0:large,size=n*s,prob=p) < 1-alpha/2)
alpha_over_2.hi.prime <- 1-pnbinom(0:large,size=n*s,prob=p)[w.hi]
round(100*(1-alpha_over_2.lo.prime-alpha_over_2.hi.prime),2)
## [1] 94.77
Due to the discreteness of the binomial distribution, the true coverage of our two-sided intervals is 94.77%, not 95%.
3.7.3 Wald Interval for the Binomial Success Probability
Let’s assume that we have sampled \(n\) iid binomial variables with number of trials \(k\) and probability of success \(p\). When \(k\) is sufficiently large and \(p\) is sufficiently far from 0 or 1, we can assume that \(\bar{X}\) has a distribution whose shape is approximately that of a normal distribution, with mean \(E[\bar{X}] = kp\) and with variance and standard error \[ V[\bar{X}] = \frac{kp(1-p)}{n} ~~~ \mbox{and} ~~~ se(\bar{X}) = \sqrt{V[\bar{X}]} = \sqrt{\frac{kp(1-p)}{n}} \,. \] Furthermore, we can assume that \(\hat{p} = \bar{X}/k\) is approximately normally distributed, with mean \(p\), variance \(V[\hat{p}] = V[\bar{X}]/k^2 = p(1-p)/nk\), and standard error \(\sqrt{p(1-p)/nk}\). Given this information, it is simple to express, e.g., an approximate two-sided \(100(1-\alpha)\)% confidence interval for \(p\): \[ \hat{p} \pm z_{1-\alpha/2} se(\hat{p}) ~~ \Rightarrow ~~ \left[ \hat{p} - z_{1-\alpha/2} \sqrt{\frac{p(1-p)}{nk}} \, , \, \hat{p} + z_{1-\alpha/2} \sqrt{\frac{p(1-p)}{nk}} \right] \,, \] where \(z_{1-\alpha/2} = \Phi^{-1}(1-\alpha/2)\). However, we don’t actually know the true value of \(p\)…so we plug in \(p = \hat{p}\).
This is the so-called Wald interval that is typically provided to students in introductory statistics courses, although it is typically provided assuming \(n = 1\) and assuming that \(\alpha = 0.05\): \[ \left[ \hat{p} - 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \, , \, \hat{p} + 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{k}} \right] \,, \] where in this case \(\hat{p} = X/k\).
3.8 Hypothesis Testing
Recall: a hypothesis test is a framework to make an inference about the value of a population parameter \(\theta\). The null hypothesis \(H_o\) is that \(\theta = \theta_o\), while possible alternatives \(H_a\) are \(\theta \neq \theta_o\) (two-tail test), \(\theta > \theta_o\) (upper-tail test), and \(\theta < \theta_o\) (lower-tail test). For, e.g., a one-tail test, we reject the null hypothesis if the observed test statistic \(y_{\rm obs}\) falls outside the bound given by \(y_{RR}\), which is a solution to the equation \[ F_Y(y_{RR} \vert \theta_o) - q = 0 \,, \] where \(F_Y(\cdot)\) is the cumulative distribution function for the statistic \(Y\) and \(q\) is an appropriate quantile value that is determined using the hypothesis test reference table introduced in section 17 of Chapter 1. Note that the hypothesis test framework only allows us to make a decision about a null hypothesis; nothing is proven.
In the previous chapter, we utilized \(\bar{X}\) when testing hypotheses about the normal mean \(\mu\). This is a principled choice for a test statistic\(-\)after all, \(\bar{X}\) is the MLE for \(\mu-\)but we do not yet know whether or not we can choose a better one. Can we differentiate hypotheses more easily if we use a test statistic other than \(\bar{X}\)?
To help answer this question, we now introduce a method for defining the most powerful test of a simple null hypothesis versus a simple alternative hypothesis: \[ H_o : \theta = \theta_o ~~\mbox{and}~~ H_a : \theta = \theta_a \,. \] Note that the word “simple” has a precise meaning here: it means that when we set \(\theta\) to a particular value, we are completely fixing the shape and location of the pmf or pdf from which data are sampled. If, for instance, we are dealing with a normal distribution with unknown variance \(\sigma^2\), the hypothesis \(\mu = \mu_o\) would not be simple, since the width of the pdf can still vary: the shape is not completely fixed. (The hypothesis \(\mu = \mu_o\) with variance unknown is dubbed a composite hypothesis. We will examine how to work with composite hypotheses in the next chapter.) For a given test level \(\alpha\), the Neyman-Pearson lemma states that the test that maximizes the power has a rejection region of the form \[ \frac{\mathcal{L}(\theta_o \vert \mathbf{x})}{\mathcal{L}(\theta_a \vert \mathbf{x})} < c(\alpha) \,, \] where \(c\) is a constant whose value depends on \(\alpha\) that we have to determine. While this formulation initially appears straightforward, it is in fact not necessarily clear how to derive \(c(\alpha)\). However: in typical analysis situations, we don’t need to! The NP lemma utilizes the likelihood function, so implicitly, what it is telling us that the best statistic for differentiating between two simple hypotheses is a sufficient statistic. We simply determine a sufficient statistic for which we know the sampling distribution, and use it to define a hypothesis test via the procedure we laid out in previous chapters. Full stop.
For instance, when we draw \(n\) iid data from a binomial distribution, the sufficient statistic \(\sum_{i=1}^n X_i\) has a known and easily utilized sampling distribution\(-\)namely, Binom(\(nk,p\))\(-\)while \(\bar{X} = (\sum_{i=1}^n X_i)/n\) does not. So we would utilize \(U = \sum_{i=1}^n X_i\). (Note that it ultimately doesn’t matter which function of a sufficient statistic we use: if we can carry out the math, we will end up with tests that have the same power and result in the same \(p\)-values. It would be very problematic if this wasn’t the case: we’d have to “fish around” to determine which function of a sufficient statistic would give us the best test, which clearly would not be a good situation to find ourselves in.)
Let’s assume that we use \(U = \sum_{i=1}^n X_i\) to define, e.g., a lower-tail test \(H_o : p = p_o\) versus \(H_a : p = p_a < p_o\). Since \(E[U] = nkp\) increases with \(p\), we can go to the hypothesis test reference tables and write (in code) that
We see that our rejection region boundary depends on the value of \(p_o\), but not on the value of \(p_a\). This means that the test we define above is the most-powerful test regardless of the value \(p_a < p_o\). We have thus constructed a uniformly most powerful (or UMP) test for disambiguating the simple hypotheses \(H_o : p = p_o\) and \(H_a : p = p_a < p_o\). It is typically the case that when we use the NP lemma to define a most powerful test for \(\theta_o\) versus \(\theta_a\), we end up defining a UMP test as well.
We note that we cannot use the NP lemma to construct most powerful two-tail hypothesis tests. When we construct a two-tail test, it is convention to define one rejection region boundary assuming \(q = \alpha/2\) and the other assuming \(q = 1 - \alpha/2\). But that is just convention; we could put \(\alpha/10\) on “one side” and \(1-9\alpha/10\) on the other, etc., and in general we cannot guarantee that any one way of splitting \(\alpha\) will yield a more powerful test in a given situation than any other possible split.
3.8.1 UMP Test: Exponential Distribution
Let’s suppose we sample \(n = 3\) iid data from the exponential distribution \[ f_X(x) = \frac{1}{\theta} e^{-x/\theta} \,, \] for \(x \geq 0\) and \(\theta > 0\), and we wish to define the most powerful test of the simple hypotheses \(H_o : \theta_o = 2\) and \(H_a : \theta_a = 1\). We observe the values \(\mathbf{x}_{\rm obs} = \{0.215,1.131,2.064\}\), and we assume \(\alpha = 0.05\).
We begin by determining a sufficient statistic: \[ \mathcal{L}(\theta \vert \mathbf{x}) = \prod_{i=1}^3 \frac{1}{\theta} e^{-x_i/\theta} = \frac{1}{\theta^3} e^{(\sum_{i=1}^3 x_i)/\theta} \,. \] We identify \(Y = \sum_{i=1}^3 X_i\) as a sufficient statistic. The next question is whether we can determine the sampling distribution for \(Y\). The moment-generating function for each of the \(X_i\)’s is \[ m_{X_i}(t) = (1 - \theta t)^{-1} \,, \] and so the mgf for \(Y\) is \[ m_Y(t) = \prod_{i=1}^3 m_{X_i}(t) = (1 - \theta t)^{-3} \,, \] We (might!) recognize this as the mgf for a Gamma(3,\(\theta\)) distribution. (The gamma distribution will be officially introduced in Chapter 4.) So: we know the sampling distribution for \(Y\), and we can use it to define the most powerful test. To reiterate: the only thing that the NP lemma is doing for us here is guiding our selection of a test statistic. Beyond that, we construct the test using the framework we already learned in Chapters 1 and 2.
Because \(\theta_a < \theta_o\), we define a lower-tail test. And since \(E[Y] = 3\theta\) increases with \(\theta\), we utilize the formulae from the hypothesis test reference tables that are on the “yes” line. The rejection-region boundary is thus \[ y_{\rm RR} = F_Y^{-1}(\alpha \vert \theta_o) \,, \] which in code is
## [1] 3.41
## [1] 1.635383
Our observed statistic is \(y_{\rm obs} = 3.410\) and the rejection-region boundary is 1.635: we fail to reject the null and conclude that \(\theta_o = 2\) is a plausible value of \(\theta\).
We note that because \(y_{\rm RR}\) is not a function of \(\theta_a\), we have not only defined the most powerful test, but we have also defined a uniformly most powerful test for all alternative hypotheses \(\theta_a < \theta_o\).
What is the \(p\)-value, and what is the power of the test if \(\theta = 1.5\)?
According to the hypothesis test reference tables, the \(p\)-value is \[ p = F_Y(y_{\rm obs} \vert \theta_o) \,, \] which in code is
## [1] 0.2440973
The \(p\)-value is 0.244, which is greater than \(\alpha\), as we expect.
The test power is \[ {\rm power}(\theta) = F_Y(y_{\rm RR} \vert \theta) \,, \] which in code is
## [1] 0.09762911
The power is 0.097…only 9.7% of the time will we reject the null hypothesis \(\theta_o = 2\) when \(\theta\) is actually 1.5.
3.8.2 UMP Test: Negative Binomial Distribution
Let’s assume that we sample \(n = 5\) data from a negative binomial distribution \[ p_X(x) = \binom{x+s-1}{x} p^s (1-p)^x \,, \] with \(x \in \{0,1,\ldots,\infty\}\) being the observed number of failures prior to observed the \(s^{\rm th}\) success, \(p \in (0,1]\), and \(s = 3\) successes. We wish to define the most powerful test of the simple hypotheses \(H_o : p_o = 0.5\) and \(H_a : p_a = 0.25\). We observe the values \(\mathbf{x}_{\rm obs} = \{5,3,10,12,4\}\), and we assume \(\alpha = 0.05\).
As in the last example, we begin by determining a sufficient statistic: \[ \mathcal{L}(\theta \vert \mathbf{x}) = \prod_{i=1}^5 \binom{x_i+s-1}{x_i} p^s (1-p)^x_i \propto \prod_{i=1}^5 p^s (1-p)^x_i = p^{ns} (1-p)^{\sum_{i=1}^5 x_i} \,. \] We identify \(Y = \sum_{i=1}^5 X_i\) as a sufficient statistic. Earlier in the chapter, we determined that if the \(X_i\)’s are iid draws from a negative binomial distribution with parameters \(p\) and \(s\), then the sum is also negative binomially distributed, with parameters \(p\) and \(ns\). So: we know the sampling distribution for \(Y\), and we can use it to define the most powerful test.
Because \(p_a < p_o\), we will define a lower-tail test. And since \(E[Y] = s(1-p)/p\) decreases with \(p\), we will utilize the formulae from the hypothesis test reference tables that are on the “no” line (and we will reject the null hypothesis if \(y_{\rm obs} > y_{\rm RR}\)).
The rejection-region boundary is \[ y_{\rm RR} = F_Y^{-1}(1-\alpha \vert p_o) \,, \] which in code is
## [1] 34
## [1] 25
Our observed statistic is \(y_{\rm obs} = 34\) and the rejection-region boundary is 25: we reject the null hypothesis and state that there is sufficient evidence to conclude that \(p < 0.5\).
We note that because \(y_{\rm RR}\) is not a function of \(p_a\), we have not only defined the most powerful test, but we have also defined a uniformly most powerful test for all alternative hypotheses \(p_a < p_o\).
What is the \(p\)-value, and what is the power of the test if \(p = 0.3\)?
According to the hypothesis test reference tables, the \(p\)-value is \[ p = 1 - F_Y(y_{\rm obs} \vert p_o) \,, \] which in code is
## [1] 0.001900827
Our \(p\)-value is supposedly 0.0019, which is less than \(\alpha\), as we expect. However, there is a problem here! The function call
1 - pnbinom(y.obs,size=3*n,prob=p.o)
is equivalent to the function call
## [1] 0.001900827
where we use the number 100 as a stand-in for infinity. However, a \(p\)-value calculation should include the observed statistic value, and the above call does not. (It begins the calculation at
y.obs+1
, and not aty.obs
.) The \(p\)-value is thus actually
## [1] 0.002757601
or
## [1] 0.002757601
The value 0.0028 is still less than \(\alpha\), so our qualitative conclusion is unchanged.
So: if our data are drawn from a discrete distribution, and we are computing a \(p\)-value for an upper-tail/yes test, or a lower-tail/no test (as is the case here), we need to subtract 1 from the observed statistic value when passing it into a cdf function. (The same rule applies to power calculations as well.)
The test power is \[ {\rm power}(\theta) = 1 - F_Y(y_{\rm RR} \vert p) \,, \] which in code is
## [1] 0.8364752
(Note how we subtract 1 from
y.rr
.) We find that the power is 0.836…83.6% of the time will we reject the null hypothesis \(p = 0.5\) when \(p\) is actually 0.3 (and when \(n = 5\) and \(s = 3\)).
3.8.3 Defining a UMP Test for the Normal Population Mean
In Chapter 2, we use the statistic \(\bar{X}\) as the basis for testing hypotheses about the normal population mean, \(\mu\). We justified this choice because, at the very least, \(\hat{\mu}_{MLE} = \bar{X}\), so it is a principled choice. However, beyond that, we did not (and indeed could not) provide any further justification. Is \(\bar{X}\) the basis of the UMP for \(\mu\)?
Recall that the NP lemma does not apply if there are two freely varying parameters. The NP lemma applies to simple hypotheses, where the hypotheses uniquely specify the population from which data are drawn. Here, even if we set \(H_o: \mu = \mu_o\) and \(H_a: \mu = \mu_a\), \(\sigma^2\) can still freely vary. (So, technically, these hypotheses are composite hypotheses.) So to go further here, we must assume \(\sigma^2\) is known.
When \(\sigma^2\) is known, we can factorize the likelihood as \[ \mathcal{L}(\mu,\sigma^2 \vert \mathbf{x}) = \underbrace{\exp\left(\frac{\mu}{\sigma^2}\sum_{i=1}^n x_i\right)\exp\left(-\frac{n\mu^2}{2\sigma^2}\right)}_{g(\sum x_i,\mu)} \cdot \underbrace{(2 \pi \sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n x_i^2\right)}_{h(\mathbf{x})} \,, \] and identify \(Y = \sum_{i=1}^n X_i\) as a sufficient statistic. We know from using the method of moment-generating functions that the sum of \(n\) iid normal random variables is itself a normal random variable with mean \(n\mu\) and variance \(n\sigma^2\), so \(Y \sim \mathcal{N}(n\mu,n\sigma^2)\).
Let’s assume that \(\mu_a > \mu_o\), or, in other words, that we are performing an upper-tail test. Given that \(E[Y] = n\mu\), we know that we are on the “yes” line of the hypothesis test reference tables, so the rejection region is \[ Y > y_{\rm RR} = F_Y^{-1}(1-\alpha,n\mu_o,n\sigma^2) \,, \] or, in code,
qnorm(1-alpha,n*mu.o,n*sigma2)
The rejection-region boundary does not depend on \(\mu_a\), so we have defined a uniformly most powerful test of \(\mu = \mu_o\) versus \(\mu = \mu_a > \mu_o\).
“But wait. What about \(\bar{X}\)?”
Recall that a function of a sufficient statistic is itself a sufficient statistic, so we could also use \(Y' = \bar{X} = Y/n\) as our statistic, particularly as we do know its sampling distribution: \(\mathcal{N}(\mu,\sigma^2/n)\). Thus \[ Y' > y_{\rm RR}' = F_Y^{-1}(1-\alpha,\mu_o,\sigma^2/n) \,, \] or, in code,
qnorm(1-alpha,mu.o,sigma2/n)
“But…\(y_{\rm RR}\) and \(y_{\rm RR}'\) do not have the same value!”
This is fine: \(Y\) and \(Y'\) don’t have the same value either. The key point is that when we compute, e.g., \(p\)-values given observed data, they will be the same in both cases. And the test power will be the same. We can use any function of a sufficient statistic to define our test; in practice, we will use any function of a sufficient statistic for which we know the sampling distribution. Here we know the distributions for both the sample sum and the sample mean; in other situations (like when we are working with binomially distributed data), we will not.
Now, what if \(\sigma^2\) is unknown? We discuss this possibility in the next chapter, when we introduce the likelihood ratio test.
3.8.4 Defining a Test That is Not Uniformly Most Powerful
Let’s assume we draw one datum \(X\) from a normal distribution with mean \(\mu\) and variance \(\mu^2\), and we wish to test \(H_o : \mu = \mu_o = 1\) versus \(H_a : \mu = \mu_a\). Can we define a uniformly most powerful test of these hypotheses?
To try to do this, we utilize the NP lemma. However, there is an immediate issue that arises: here, there are two sufficient statistics that are identified via factorization (\(\sum_{i=1}^n X_i\) and \(\sum_{i=1}^n X_i^2\)), but just one parameter (\(\theta\)). So we cannot circumvent working directly with the likelihood ratio: \[\begin{align*} \frac{\mathcal{L}(\theta_o \vert \mathbf{x})}{\mathcal{L}(\theta_o \vert \mathbf{x})} = \frac{f_X(x \vert \theta_o)}{f_X(x \vert \theta_a)} = \mu_a^2 \exp\left(-\frac{x^2}{2}+\frac{(x-\mu_a)^2}{2\mu_a^2}\right) \,. \end{align*}\] To reject the null hypothesis, the ratio must be less than some constant \(c(\alpha)\). First, we can divide both sides of the inequality by \(\mu_a^2\) (which is a set constant), so that now we reject the null if \[\begin{align*} \exp\left(-\frac{x^2}{2}+\frac{(x-\mu_a)^2}{2\mu_a^2}\right) < c'(\alpha) \,. \end{align*}\] We then take the natural logarithm of each side; we would reject the null if \[\begin{align*} -\frac{x^2}{2}+\frac{(x-\mu_a)^2}{2\mu_a^2} < c''(\alpha) \,. \end{align*}\] For the test to be uniformly most powerful, we would have to perform further manipulations to put \(x\) alone on the left-hand side of the inequality, without it appearing on the right-hand side. Here, that is impossible to do, meaning that the value of the rejection region boundary changes as \(\mu_a\) changes. Thus while we can define the most powerful test of \(H_o : \mu = \mu_o = 1\) versus \(H_a : \mu = \mu_a\), that test is not a uniformly most powerful one.
3.8.5 Large-Sample Tests of the Binomial Success Probability
Earlier in the chapter, we saw that if \[ k > 9\left(\frac{\mbox{max}(p,1-p)}{\mbox{min}(p,1-p)}\right) \,, \] then, given a random variable \(X \sim\) Binomial(\(k\),\(p\)), we can assume \[ X \stackrel{d}{\rightarrow} Y \sim \mathcal{N}(kp,kp(1-p)) \,, \] or that \[ \hat{p} = \hat{p}_{MLE} = \frac{X}{k} \stackrel{d}{\rightarrow} X' \sim \mathcal{N}(p,p(1-p)/k) \,. \] In hypothesis testing, this approximation forms the basis of two different tests: \[\begin{align*} \mbox{Score}~\mbox{Test:} ~~ & \hat{p} \sim \mathcal{N}(p_o,p_o(1-p_o)/k) \\ \mbox{Wald}~\mbox{Test:} ~~ & \hat{p} \sim \mathcal{N}(p_o,\hat{p}(1-\hat{p})/k) \,. \end{align*}\] The only difference between the two is in how the variance is estimated under the null. To carry out these tests, we can simply adapt the expressions for the rejection regions, \(p\)-values, and power given in Chapter 2 for tests of the normal population mean with variance known, with the primary change being the definition of the variance.
We are including the details of the Wald and Score tests here for completeness, in that they are ubiquitous in introductory statistics settings. As they yield only approximate results, we would encourage the reader to always use the exact testing mechanisms described above, while being mindful of the effects of the discreteness of the probability mass function, especially when \(n\) and \(k\) are both small.
3.9 Logistic Regression
In the last chapter, we introduced simple linear regression, in which we model the data-generating process as \[ Y_i = \beta_0 + \beta_1x_i + \epsilon_i \,. \] To perform hypothesis testing (e.g., \(H_o : \beta_1 = 0\) vs. \(H_a : \beta_1 \neq 0\)), we make the assumption that \(\epsilon_i \sim \mathcal{N}(0,\sigma^2)\), or equivalently, that \(Y_i \vert x_i \sim \mathcal{N}(\beta_0+\beta_1x_i,\sigma^2)\). When we make this assumption, we are implicitly stating that the response variable is continuous and that it can take on any value between \(-\infty\) and \(\infty\). But what if this doesn’t actually correctly represent the response variable? Maybe the \(Y_i\)’s are distances with values \(\geq 0\). Maybe each \(Y_i\) belongs to one of several categories, and thus the \(Y_i\)’s are discretely valued. When provided data such as these, it is possible, though not optimal, to utilize simple linear regression. A better choice is to generalize the concept of linear regression, and to utilize this generalization to implement a more appropriate statistical model.
To implement a generalized linear model (or GLM), we need to do two things:
- examine the \(Y_i\) values and select an appropriate distribution for them (discrete or continuous? what is the functional domain?); and
- define a link function \(g(\theta \vert x)\) that maps the line \(\beta_0 + \beta_1 x_i\), which has infinite range, into a more limited range (e.g., \([0,\infty)\)).
Suppose that, in an experiment, the response variable can take on the values “false” and “true.” To generalize linear regression so as to handle these data, we map “false” to 0 and “true” to 1, and assume that we can model the data-generating process using a Bernoulli distribution (i.e., a binomial distribution with \(k = 1\)): \(Y \sim\) Bernoulli(\(p\)). We know that \(0 < p < 1\), so we adopt a link function that maps \(\beta_0 + \beta_1 x\) to the range \((0,1)\). There is no unique choice, but a conventional one is the logit function: \[ g(p \vert x) = \log\left[\frac{p \vert x}{1-p \vert x}\right] = \beta_0 + \beta_1 x ~\implies~ p \vert x = \frac{\exp\left(\beta_0+\beta_1x\right)}{1+\exp\left(\beta_0+\beta_1x\right)} \,. \] Using the logit function to model dichotomous data is dubbed logistic regression. See Figure 3.11.
How do we learn a logistic regression model? In linear regression, we can determine values for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) by formulae, the ones we derive when we minimize the sum of squared errors (SSE). For logistic regression, such simple formulae do not exist, and so we estimate \(\beta_0\) and \(\beta_1\) via numerical optimization of the likelihood function \[ \mathcal{L}(\beta_0,\beta_1 \vert \mathbf{y}) = \prod_{i=1}^n p_{Y \vert \beta_0,\beta_1}(y_i \vert \beta_0,\beta_1) = \prod_{i=1}^n p^{y_i}(1-p)^{1-y_i} = p^{\sum_{i=1}^n y_i} (1-p)^{n-\sum_{i=1}^n y_i} \,, \] where \(p_Y(\cdot)\) is the Bernoulli probability mass function. Specifically, we would take the first partial derivatives of \(\mathcal{L}(\beta_0,\beta_1 \vert \mathbf{y})\) with respect to \(\beta_0\) and \(\beta_1\), respectively, set them both to zero (thus allowing us to equate the derivative expressions), and determine the values \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that satisfy the equated derivatives. These are the values at which the bivariate likelihood surface achieves its maximum value. Note that because numerical optimization is an iterative process, logistic regression models are learned more slowly than simple linear regression models.
Let’s step back for a moment. Why would we want to learn a logistic regression model in the first place? After all, it is a relatively inflexible model that might lack the ability to mimic the true behavior of \(p \vert x\). The reason is that because we specify the mathematical form of the model, we can after the fact examine the estimated coefficients and perform inference: we can examine how the response is affected by changes in the predictor variables’ values. This can be important in, e.g., scientific contexts, where explaining how a model works can be as important, if not more important, than its ability to generate accurate and precise predictions. (This is in constrast to commonly used machine learning models, whose mathematical forms cannot be specified a priori and are thus less interpretable after they are learned.)
In simple linear regression, if we increase \(x\) by one unit, we can immediately infer how the response value changes: \[ \hat{Y}' = \hat{\beta}_0 + \hat{\beta}_1 (x + 1) = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_1 = \hat{Y} + \hat{\beta}_1 \,. \] For logistic regression, the situation is not as straightforward, because \(p\) changes non-linearly as a function of \(x\). So we fall back on the concept of odds: \[ O(x) = \frac{p(x)}{1-p(x)} = \exp\left(\hat{\beta}_0+\hat{\beta}_1x\right) \,. \] If, e.g., \(O(x) = 4\), then that means that, given \(x\), we are four times more likely to sample a success (1) than a failure (0). How does the odds change if we add one unit to \(x\)? \[ O(x+1) = e^{\hat{\beta}_0+\hat{\beta}_1(x+1)} = \exp\left(\hat{\beta}_0 + \hat{\beta}_1x\right) \times \exp\left(\hat{\beta}_1\right) = O(x) \exp\left(\hat{\beta}_1\right) \,. \] The odds change by a factor of \(\exp\left(\hat{\beta}_1\right)\), which can be greater than or less than one, depending on the sign of \(\hat{\beta}_1\).
It is important to note that when we perform logistic regression, we are simply estimating \(p \vert x\), which is the probability that we would sample a datum belonging to Class 1, given \(x\). How we choose to map \(p \vert x\) to either 0 or 1, if we choose to make that mapping, is not actually part of the logistic regression framework. Classification is a broad topic within statistical learning whose details are beyond the scope of this book. We direct the interested reader to, e.g., Introduction to Statistical Learning by James et al.
3.9.1 Logistic Regression in R
When observed with a telescope, a star is a point-like object, as opposed to a galaxy, which has some angular extent. Telling stars and galaxies apart visually is thus, in relative terms, easy. However, if the galaxy has an inordinately bright core because the supermassive black hole at its center is ingesting matter at a high rate, that core can outshine the rest of the galaxy so much that the galaxy appears to be point-like. Such galaxies with bright cores are dubbed “quasars,” or “quasi-stellar objects,” and they are much harder to tell apart from stars.
Let’s say we are given a dataset showing the difference in magnitude (a logarithmic measure of brightness) at two wavelengths, for 500 stars and 500 quasars. The response variable is a factor variable with two levels, which we dub “Class 0” (here,
QSO
) and “Class 1” (STAR
). (The mapping of qualitative factors to quantitative levels is by default alphabetical; asSTAR
comes afterQSO
,STAR
gets mapped to Class 1. One can always override this default behavior.)
We learn a simple logistic regression model using the
R
functionglm()
, rather thanlm()
, and we specifyfamily=binomial
, which means “do logistic regression.” (We can add additional arguments to, e.g., change the link function.)
##
## Call:
## glm(formula = class ~ r, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6465 -0.8950 0.0331 0.8436 3.9163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 23.44801 1.65428 14.17 <2e-16 ***
## r -1.25386 0.08817 -14.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1386.3 on 999 degrees of freedom
## Residual deviance: 1040.9 on 998 degrees of freedom
## AIC: 1044.9
##
## Number of Fisher Scoring iterations: 5
The
summary()
of a learned logistic regression model is similar to that for a linear regression model.
The “deviance residuals,” which show the residuals for each datum, are defined as \[ d_i = \mbox{sign}(Y_i - \hat{Y}_i) \sqrt{-2[Y_i \log \hat{Y}_i + (1-Y_i)\log(1-\hat{Y}_i)]} \,, \] where \[ \hat{Y}_i = \hat{p}_i \vert x_i = \frac{\exp(\hat{\beta}_0+\hat{\beta}_1 x_i)}{1 + \exp(\hat{\beta}_0+\hat{\beta}_1 x_i)} \,. \] The sum of the squared deviance residuals is equal to \(-2 \log \mathcal{L}_{max}\), where \(\mathcal{L}_{max}\) is the maximum value of the joint likelihood of \(\hat{\beta}_0\) and \(\hat{\beta}_1\), given \(\mathbf{x}\). Here, we observe that the deviance residuals are seemingly well-balanced around zero.
The coefficients can be translated to \(y\)-coordinate values as follows: the intercept is \(e^{23.448}/(1+e^{23.448}) \rightarrow 1\) (this is the probability the a sampled datum with
r
equal to zero is aSTAR
), while the odds ratio is \(O_{new}/O_{old} = e^{-1.254}\) (so that every timer
increases by one unit, the probability that a sampled datum is a star is 0.285 times what it had been before: the higher the value ofr
, the less and less likely that a sampled datum is actually a star, and the more and more likely that it is a quasar…at least, according to the logistic regression model.
The numbers in the other three columns of the coefficients section are estimated numerically using the behavior of the likelihood function (specifically, the rates at which it curves downward away from the maximum). Some details about how the numbers are calculated are given in an example in the “Covariance and Correlation” section of Chapter 6. It suffices to say here that if we assume that \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are both normally distributed random variables, we reject the null hypothesis that the intercept is 0 (or equivalently \(y = 1/2\)), and that \(\beta_1 = 0\).
The null deviance is \(-2 \log \mathcal{L}_{max,o}\), where \(\mathcal{L}_{max,o}\) is the maximum likelihood when \(\beta_1\) is set to zero. The residual deviance is \(-2 \log \mathcal{L}_{max}\). The difference between these values (here, 1386.3-1040.9 = 345.4), under the null hypothesis that \(\beta_1 = 0\), is assumed to be sampled from a chi-square distribution for 999-998 = 1 degree of freedom. The \(p\)-value is thus
1 - pchisq(345.4,1)
or effectively zero: we emphatically reject the null hypothesis that \(\beta_1 = 0\). While this hypothesis test clearly indicates that \(\beta_1\) is not equal to zero, how can we know whether the model itself fits to the data well in an absolute sense? This is a model’s “goodness of fit,” a concept we introduce at the end of this chapter. There is no unique answer to this question in the context of logistic regression; the interested reader should look up, e.g., the Hosmer-Lemeshow statistic. (Note that the test carried out here is analogous to the \(F\)-test in linear regression, and is testing \(H_o: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) versus \(H_a:\) at least one of the \(\beta_i\)’s is non-zero. Because the sampling distribution here is the chi-square distribution as opposed to the normal distribution, the \(p\)-value here will not match the \(p\)-value seen in the coefficients section in general.)
Last, the AIC, or Akaike Information Criterion, is \(-2\) times the model likelihood (or here, the deviance) plus two times the number of variables (here, 2, as the intercept is counted as a variable). Adding the number of variables acts to penalize those models with more variables: the improvement in the maximum likelihood has to be sufficient to justify added model complexity. Discussion of the mathematical details of AIC is beyond the scope of this book; it suffices to say that if we compute it when learning a suite of different models, we would select the model with the smallest value.
We wrap up this example by showing how one would start moving from estimating the Class 1 probabilities for each datum towards predicting classes for each datum. Logistic regression is not in and of itself a “classifier”: it simply outputs probabilities. Naively, we would classify an object with an estimated probability below 0.5 as being an object of Class 0 and one with an estimated probability above 0.5 as being an object of Class 1, but that only works in general if the classes are balanced, i.e., if there are the same number of data of Class 1 as of Class 0. (Here, “works” means “acts to minimize misclassification rates across both classes at the same time.”) If there is class imbalance, then we will almost certainly have to change 0.5 to some other value for optimal classification. In the current example, the classes are balanced, and so putting the “dividing line” at 0.5, as we do in Figure 3.12, is an optimal choice.
3.10 Naive Bayes Regression
The Naive Bayes model is the basis for perhaps the simplest probabilitic classifier, one that is used to, e.g., detect spam emails. The meanings of the words “Naive” and “Bayes” will become more clear below. Note that one would often see this model referred to as the “Naive Bayes classifier.” However, as noted in the last section, there are actually two steps in classification, the first being the estimation of probabilities that a given datum belongs to each class, and the second being the mapping of those probabilities to class predictions. In the last section and here, we are only focusing on the generation of predicted probabilities…hence the title of this section.
(Many would argue that the Naive Bayes model is a machine-learning model. We would argue that it actually is not: the final form of a machine-learning model is unknown before we start the modeling process [e.g., the number of branches that will appear in a classification tree model is unknown and is learned by the machine], whereas with Naive Bayes the mathematical form of the model is fully specified and all we have to do is estimate unknown probabilities. This is akin to the situation with linear regression, where the mathematical form is set and all we have to do is estimate the coefficients…and no one would argue that linear regression is a machine-learning model!)
Let’s assume that we are in a similar setting as the one for logistic regression, but instead of having a response that is a two-level factor variable (i.e., one the represents two classes), the number of levels is \(K \geq 2\). (So instead of just, e.g., “chocolate” and “vanilla” as our response variable values, we can add “strawberry” and other ice cream flavors too!) The ultimate goal of the model is to assign conditional probabilities for each response class, given a datum \(\mathbf{x}\): \(p(C_k \vert \mathbf{x})\), where \(C_k\) denotes “class \(k\).” How do we derive this quantity?
The first step is to apply Bayes’ rule (hence the “Bayes” in “Naive Bayes”): \[ p(C_k \vert \mathbf{x}) = \frac{p(C_k) p(\mathbf{x} \vert C_k)}{p(\mathbf{x})} \,. \] The next step is to expand \(p(\mathbf{x} \vert C_k)\): \[ p(\mathbf{x} \vert C_k) = p(x_1,\ldots,x_p \vert C_k) = p(x_1 \vert x_2,\ldots,x_p,C_k) p(x_2 \vert x_3,\ldots,x_p,C_k) \cdots p(x_p \vert C_k) \,. \] (This is the multiplicative law of probability in action, as applied to a conditional probability. See section 1.4.) The right-most expression above is one that is difficult to evaluate in practice, given all the conditions that must be jointly applied…so this is where the “Naive” aspect of the model comes in. We simplify the expression by assuming (most often incorrectly!) that the predictor variables are all mutually independent, so that \[ p(x_1 \vert x_2,\ldots,x_p,C_k) p(x_2 \vert x_3,\ldots,x_p,C_k) \cdots p(x_p \vert C_k) ~~ \rightarrow ~~ p(x_1 \vert C_k) \cdots p(x_p \vert C_k) \] and thus \[ p(C_k \vert \mathbf{x}) = \frac{p(C_k) \prod_{i=1}^p p(x_i \vert C_k)}{p(\mathbf{x})} \,. \] OK…where do we go from here?
We need to make further assumptions!
- We need to assign “prior probabilities” \(p(C_k)\) to each class. Common choices are \(1/K\) (we view each class as equally probable before we gather data) and \(n_k/n\) (the number of observed data of class \(k\) divided by the overall sample size).
- We also need to assume probability mass and density functions \(p(x_i \vert C_k)\) for each predictor variable (a pmf if \(x_i\) is discrete and a pdf if \(x_i\) is continuous). It is convention to use binomial or multinomial pmfs, depending on the number of levels, with the observed proportions in each level informing the category probability estimate, and normal pdfs, with \(\hat{\mu} = \bar{x_i}\) and \(\hat{\sigma^2} = s_i^2\) (the sample variance).
Ultimately, this model depends on a number of (perhaps unjustified) assumptions. Why would we ever use it? Because the assumption of mutual independence makes model evaluation fast. Naive Bayes is rarely the model underlying the best classifier for any given problem, but when speed is needed (such as in the identification of spam email), one’s choices are limited. (As as general rule: if a model is simple to implement, one should implement it, even if the a priori expectation is that another model will ultimately generate better results. One never knows…)
3.10.1 Naive Bayes Regression With Categorical Predictors
Let’s assume that we have collected the following data:
\(x_1\) | \(x_2\) | \(Y\) |
---|---|---|
Yes | Chocolate | True |
Yes | Chocolate | False |
No | Vanilla | True |
No | Chocolate | True |
Yes | Vanilla | False |
No | Chocolate | False |
No | Vanilla | False |
We learn a Naive Bayes model given these data, in which we assume “False” is Class 0 and “True” is Class 1. If we have a new datum \(\mathbf{x}\) = (“Yes”,“Chocolate”), what is the probability that the response is “True”?
We seek the quantity \[ p(C_1 \vert \mathbf{x}) = \frac{p(C_1) p(\mathbf{x} \vert C_1)}{p(\mathbf{x})} = \frac{p(C_1) p(x_1 \vert C_1) p(x_2 \vert C_1)}{p(x_1,x_2)} \,. \] When we examine the data, we see that \(p(C_1) = 3/7\), as there are three data out of seven with the value \(Y\) = “True.” Now, given \(C_1\), at what rate do we observe “Yes”? (One time out of three…so \(p(x_1 \vert C_1) = 1/3\).) What about “Chocolate”? (Two times out of three…so \(p(x_2 \vert C_1) = 2/3\).) The numerator is thus \(3/7 \times 1/3 \times 2/3 = 2/21\). The value of the denominator, \(p(x_1,x_2)\), is determined utilizing the Law of Total Probability: \[\begin{align*} p(x_1,x_2) &= p(x_1 \vert C_1) p(x_2 \vert C_1) p(C_1) + p(x_1 \vert C_2) p(x_2 \vert C_2) p(C_2) \\ &= 2/21 + 2/4 \times 2/4 \times 4/7 = 2/21 + 4/28 = 2/21 + 3/21 = 5/21 \,. \end{align*}\] And so now we know that \[ p(\mbox{True} \vert \mbox{Yes,Chocolate}) = \frac{2/21}{5/21} = \frac{2}{5} \,. \]
3.10.2 Naive Bayes Regression With Continuous Predictors
Let’s assume we have the following data regarding credit-card defaults, where \(x\) represents the credit-card balance and \(Y\) is a categorical variable indicating whether a default has occurred:
\(x\) | 1487.00 | 324.74 | 988.21 | 836.30 | 2205.80 | 927.89 | 712.28 | 706.16 | 1774.69 |
---|---|---|---|---|---|---|---|---|---|
\(Y\) | Yes | No | No | No | Yes | No | No | No | Yes |
Let’s now suppose someone came along with a credit balance of $1,200. According to the Naive Bayes model, given this balance, what is the probability of a credit default?
The Naive Bayes model in this particular case is \[ p(Y \vert x) = \frac{p(x \vert Y) p(Y)}{p(x \vert Y) p(Y) + p(x \vert N) p(N)} \,, \] where we can observe immediately that \(p(Y) = 3/9 = 1/3\) and \(p(N) = 6/9 = 2/3\). To compute, e.g., \(p(x \vert Y)\), we first compute the sample mean and sample standard deviation for the observed data for which \(Y\) is “Yes”:
## [1] 1822.5
## [1] 361.78
The mean is $1,822.50 and the standard deviation is $361.78, so the probability density associated with observing $1,200 is
## [1] 0.0002509367
The density is 2.51 \(\times\) 10\(^{-4}\). As far as for those data for which \(Y\) is “No”:
## [1] 0.0002748351
The density is 2.75 \(\times\) 10\(^{-4}\). Thus \[ p(Y \vert x) = \frac{2.51 \cdot 0.333}{2.51 \cdot 0.333 + 2.75 \cdot 0.667} = \frac{0.834}{0.834 + 1.834} = 0.313 \,. \] We would predict that there is roughly a 1 in 3 chance that a person with a credit balance of $1,200 will default on their debt.
3.10.3 Naive Bayes Applied to Star-Quasar Data
Here, we learn a Naive Bayes model that we can use to differentiate between stars and quasars, utilizing functions from
R
’se1071
package.
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## QSO STAR
## 0.5 0.5
##
## Conditional probabilities:
## r
## Y [,1] [,2]
## QSO 19.35770 0.8282726
## STAR 17.96389 1.3651572
The summary of the model shows that the classes are balanced (as the
A-priori probabilities
are 0.5 for each class). It then shows the estimated distributions ofr
values for quasars (a normal with estimated mean 19.358 and standard deviation 0.828) and stars (mean 17.964 and standard deviation 1.365). See Figure 3.13.
3.11 The Beta Distribution
The beta distribution is a continuous distribution that is commonly used to model random variables with finite, bounded domains. Its probability density function is given by \[ f_X(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,, \] where \(x \in [0,1]\), \(\alpha\) and \(\beta\) are both \(> 0\), and the normalization constant \(B(\alpha,\beta)\) is \[ B(\alpha,\beta) = \int_0^1 x^{\alpha-1}(1-x)^{\beta-1} dx = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \,. \] (See Figure 3.14.) We have seen the gamma function, \(\Gamma(\alpha)\), before; it is defined as \[ \Gamma(\alpha) = \int_0^\infty x^{\alpha-1} e^{-x} dx \,. \] There are two things to note about the gamma function. The first is its recursive property: \(\Gamma(\alpha+1) = \alpha \Gamma(\alpha)\). (This can be shown by applying integration by parts.) The second is that when \(\alpha\) is a positive integer, the gamma function takes on the value \((\alpha-1)! = (\alpha-1)(\alpha-2) \cdots 1\). (Note that \(\Gamma(1) = 0! = 1\).)
Regarding the statement above about “model[ing] random variables with finite, bounded domains”: if a set of \(n\) iid random variables \(\mathbf{X}\) has domain \([a,b]\), we can define a new set of random variables \(\mathbf{Y}\) via the transformation \[ \mathbf{Y} = \frac{\mathbf{X}-a}{b-a} \] such that the domain becomes \([0,1]\). We can model these newly defined data with the beta distribution. (Note the word can: we can model these data with the beta distribution, but we don’t have to, and it may be the case that there is another distribution bounded on the interval \([0,1]\) that ultimately better describes the data-generating process. The beta distribution just happens to be commonly used.)
But…in the end, what does this all have to do with the binomial distribution, the subject of this chapter? We know the binomial has a parameter \(p \in [0,1]\) and the domain of the beta distribution is \([0,1]\), but is there more? Let’s write down the binomial pmf: \[ \binom{k}{x} p^x (1-p)^{k-x} = \frac{k!}{x!(k-x)!} p^x (1-p)^{k-x} \,. \] This pmf dictates the probability of observing a particular value of \(x\) given \(k\) and \(p\). But what if we turn this around a bit…and examine this function if we fix \(k\) and \(x\) and vary \(p\) instead? In other words, let’s examine the likelihood function \[ \mathcal{L}(p \vert k,x) = \frac{k!}{x!(k-x)!} p^x (1-p)^{k-x} \,. \] We can see immediately that the likelihood has the form of a beta distribution if we set \(\alpha = x+1\) and \(\beta = k-x+1\): \[ \mathcal{L}(p \vert k,x) = \frac{\Gamma(\alpha+\beta-1)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1} \,, \] except that the normalization term is not quite right: here we have \(\Gamma(\alpha+\beta-1)\) instead of \(\Gamma(\alpha+\beta)\). But that’s fine: there is no requirement that a likelihood function integrate to one over its domain. (Here, as the interested reader can verify, the likelihood function integrates to \(1/(k+1)\).) So, in the end, if we observe a random variable \(X \sim\) Binomial(\(k,p\)), then the likelihood function \(\mathcal{L}(p \vert k,x)\) has the shape (if not the normalization) of a Beta(\(x+1,k-x+1\)) distribution.
3.11.1 The Expected Value of a Beta Random Variable
The expected value of a random variable sampled from a Beta(\(\alpha,\beta\)) distribution is \[\begin{align*} E[X] = \int_0^1 x f_X(x) dx &= \int_0^1 x \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha,\beta)} dx \\ &= \int_0^1 \frac{x^{\alpha} (1-x)^{\beta-1}}{B(\alpha,\beta)} dx \\ &= \int_0^1 \frac{x^{\alpha} (1-x)^{\beta-1}}{B(\alpha+1,\beta)} \frac{B(\alpha+1,\beta)}{B(\alpha,\beta)} dx \\ &= \frac{B(\alpha+1,\beta)}{B(\alpha,\beta)} \int_0^1 \frac{x^{\alpha} (1-x)^{\beta-1}}{B(\alpha+1,\beta)} dx \\ &= \frac{B(\alpha+1,\beta)}{B(\alpha,\beta)} \,. \end{align*}\] The last result follows from the fact that the integrand is the pdf for a Beta(\(\alpha+1,\beta\)) distribution, and the integral is over the entire domain, hence the integral evaluates to 1.
Continuing, \[\begin{align*} E[X] &= \frac{B(\alpha+1,\beta)}{B(\alpha,\beta)} \\ &= \frac{\Gamma(\alpha+1) \Gamma(\beta)}{\Gamma(\alpha+\beta+1)} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \\ &= \frac{\alpha \Gamma(\alpha)}{(\alpha+\beta)\Gamma(\alpha+\beta)} \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)} \\ &= \frac{\alpha}{\alpha+\beta} \,. \end{align*}\] Here, we take advantage of the recursive property of the gamma function.
We can utilize a similar strategy to determine the variance of a beta random variable, starting by computing \(E[X^2]\) and utilizing the shortcut formula \(V[X] = E[X]^2 - (E[X])^2\). The final result is \[ V[X] = \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \,. \]
3.11.2 The Sample Median of a Uniform(0,1) Distribution
The Uniform(0,1) distribution is \[ f_X(x) = 1 ~~~ x \in [0,1] \,. \] Let’s assume that we draw \(n\) iid data from this distribution, with \(n\) odd. Then we can utilize a result from order statistics to write down the pdf of the sample median, which is the \(j^{\rm th}\) order statstic (where \(j = (n+1)/2\)): \[ f_{((n+1)/2)}(x) = \frac{n!}{\left(\frac{n-1}{2}\right)! \left(\frac{n-1}{2}\right)!} f_X(x) \left[ F_X(x) \right]^{(n-1)/2} \left[ 1 - F_X(x) \right]^{(n-1)/2} \,. \] Given that \[ F_X(x) = \int_0^x dy = x ~~~ x \in [0,1] \,, \] we can write \[ f_{((n+1)/2)}(x) = \frac{n!}{\left(\frac{n-1}{2}\right)! \left(\frac{n-1}{2}\right)!} x^{(n-1)/2} (1-x)^{(n-1)/2} \,, \] for \(x \in [0,1]\). This function has both the form of a beta distribution (with \(\alpha = \beta = (n+1)/2\)) and the domain of a beta distribution, so \(\tilde{X} \sim\) Beta\(\left(\frac{n+1}{2},\frac{n+1}{2}\right)\).
(In fact, we can go further and state a more general result: \(X_{(j)} \sim\) Beta(\(j,n-j+1\)): all the order statistics for data drawn from a Uniform(0,1) distribution are beta-distributed random variables! See Figure 3.15.)
3.12 The Multinomial Distribution
Let’s suppose we are in a situation in which we are gathering categorical data. For instance, we might be
- throwing a ball and recording which of bins numbered 1 through \(m\) it lands in;
- categorizing the condition of old coins as “mint,” “very good,” “fine,” etc.; or
- classifying a galaxy as being a spiral galaxy, an elliptical galaxy, or an irregular galaxy.
These are examples of multinomial trials, which is a generalization of the concept of binomial trials to situations where the number of possible outcomes is \(m > 2\) rather than \(m = 2\). Earlier in this chapter, we wrote down the five properties of binomial trials. The analogous properties of a multinomial experiment are the following.
- It consists of \(k\) trials, with \(k\) chosen in advance.
- There are \(m\) possible outcomes for each trial.
- A trial may have no more than one realized outcome.
- The outcomes of each trial are independent.
- The probability of achieving the \(i^{th}\) outcome is \(p_i\), a constant quantity.
- The probabilities of each outcome sum to one: \(\sum_{i=1}^m p_i = 1\).
- The number of trials that achieve a particular outcome is \(X_i\), with \(\sum_{i=1}^m X_i = k\).
The probability of any given outcome \(\{X_1,\ldots,X_m\}\) is given by the multinomial probability mass function: \[ p(x_1,\ldots,x_m \vert p_1,\ldots,p_m) = \frac{k!}{x_1! \cdots x_m!}p_1^{x_1}\cdots p_m^{x_m} \,. \] The distribution for any given \(X_i\) itself is binomial, with \(E[X_i] = kp_i\) and \(V[X_i] = kp_i(1-p_i)\). This makes intuitive sense, as one either observes \(i\) as a trial outcome (success), or something else (failure). However, the \(X_i\)’s are not independent random variables; the covariance between \(X_i\) and \(X_j\), a metric of linear dependence, is Cov(\(X_i\),\(X_j\)) = \(-kp_ip_j\) if \(i \neq j\). (We will discuss the concept of covariance in Chapter 6.) This also makes intuitive sense: given that the number of trials \(k\) is fixed, observing more data that achieve one outcome will usually mean we will observe fewer data achieving any other given outcome.
3.13 Chi-Square-Based Hypothesis Testing
Imagine that we are scientists, sitting on a platform in the middle of a plain. We are recording the number of animals of a particular species that we see, and the azimuthal angle for each one. (An azimuthal angle is, e.g., \(0^\circ\) when looking directly north, and 90\(^\circ\), 180\(^\circ\), and 270\(^\circ\) as we look east, south, and west, etc.) The question we want to answer is, are the animals uniformly distributed as a function of angle?
There is no unique way to answer this question. However, a very common approach is to bin the data and use a chi-square goodness-of-fit test. Let’s assume we’ve observed 100 animals and that we record the numbers seen in each of four quadrants:
angle range | 0\(^\circ\)-90\(^\circ\) | 90\(^\circ\)-180\(^\circ\) | 180\(^\circ\)-270\(^\circ\) | 270\(^\circ\)-360\(^\circ\) |
---|---|---|---|---|
number of animals | 28 | 32 | 17 | 23 |
There is nothing “special” about the choice of four quadrants\(-\)we could have chosen eight, etc.\(-\)but as we’ll see below, the chi-square GoF test is an “approximate” test and its results become more precise as the numbers of data in each bin get larger. So there is a tradeoff if we increase the number of quadrants: we might be able to detect smaller-scale non-uniformities, but are test results will become less precise.
To perform a chi-square GoF test with these data, we first specify are null and alternative hypotheses: \[ H_o : p_1 = p_{1,o},\cdots,p_m = p_{m,o} ~~ vs. ~~ H_a : \mbox{at least two of the probabilities differ} \,, \] where \(p_{i,o}\) is the null hypothesis proportion in bin \(i\). Here, \(k = 100\), \(m = 4\), and \(p_{1,o} = p_{2,o} = p_{3,o} = p_{4,o} = 0.25\)…we expect, under the null, to see 25 animals in each quadrant.
The null hypothesis is that the data are multinomially distributed with specified proportions being expected in each of the \(m\) defined bins. As we might imagine, performing a hypothesis test of the form given above that utilizes the multinomial distribution would be difficult to do by hand; multinomial distributions are intrinsically high-dimensional and the data (the numbers of counts in each bin) are not iid. In 1900, the statistician Karl Pearson proposed a “workaround” for testing multinomial hypotheses. He noted that as \(k \rightarrow \infty\), multinomial random variables converge in distribution to multivariate normal random variables (with the latter being something we will discuss in Chapter 6), and that the statistic \[ W = \sum_{i=1}^m \frac{(X_i-E[X_i])^2}{E[X_i]} = \sum_{i=1}^m \frac{(X_i - kp_i)^2}{kp_i} \] converges in distribution to the chi-square random variable for \(m-1\) degrees of freedom. (We subtract 1 because only \(m-1\) of the multinomial probabilities can freely vary; the \(m^{th}\) one is constrained by the fact that the probabilities must sum to 1. This constraint is what makes multinomial data not iid.) The computation of \(W\) is thus the basis of the chi-square goodness-of-fit test, or chi-square GoF test. For this test,
- we reject the null hypothesis if \(W > w_{RR}\);
- the rejection region boundary is \(w_{RR} = F_{W(m-1)}^{-1}(1-\alpha)\) (in
R
,qchisq(1-alpha,m-1)
); - the \(p\)-value is \(1 - F_{W(m-1)}(w_{\rm obs})\) (e.g.,
1-pchisq(w.obs,m-1)
); and - by convention, \(kp_i\) must be \(\geq\) 5 in each bin for the test to yield a valid result.
This last point underscores the tradeoff between splitting the data over more bins and test precision!
In a chi-square GoF test, the inputs are the observed data and the hypothesized proportions. There are variations on this test in which the inputs are tables of observed data, ones that differ depending upon how the data are collected:
- chi-square test of independence: the question is whether two variables are associated with each other in a population; subjects are selected and the values of two variables are recorded for each. For instance, we might select \(k\) people at random and record whether or not they have had Covid-19, and also record whether or not they initially had zero, one, or two vaccine shots, and put these data into a table with, e.g., “yes” and “no” defining the rows and 0, 1, and 2 defining the columns. If we reject the null, we are stating that the distributions of data along, e.g., each row are statistically significantly different from each other.
- chi-square test of homogeneity: the question is whether the distribution of a single variable is the same for two subgroups of a population, with subjects being selected randomly from each subgroup separately. For instance, we might select \(k\) people under 20 years of age and ask if they prefer vanilla, chocolate, or strawberry ice cream, and then repeat the process for people of age 20 or over, and put the data into a table similar to that described for the test of independence above.
Whether we perform a test of independence versus one of homogeneity affects the interpretation of results, but algorithmically the tests are identical. Under the null hypothesis, \[ \widehat{E[X_{ij}]} = \frac{r_i c_j}{k} \,, \] i.e., the expected value in the cell in row \(i\) and column \(j\) is the product of the total number of data in row \(i\) and column \(j\), divided by the total number of data overall. Then, \[ W = \sum_{i=1}^r \sum_{j=1}^c \frac{(X_{ij} - \widehat{E[X_{ij}]})^2}{\widehat{E[X_{ij}]}} \stackrel{\dot}{\sim} \chi_{(r-1)(c-1)}^2 \,, \] i.e., the test statistic \(W\) is, under the null, assumed to be chi-square distributed for \((r-1) \times (c-1)\) degrees of freedom.
3.13.1 Chi-Square Goodness of Fit Test
Let’s work with the data presented above at the beginning of this section:
angle range | 0\(^\circ\)-90\(^\circ\) | 90\(^\circ\)-180\(^\circ\) | 180\(^\circ\)-270\(^\circ\) | 270\(^\circ\)-360\(^\circ\) |
---|---|---|---|---|
number of animals | 28 | 32 | 17 | 23 |
As stated above, we expect 25 counts in each bin. Given this expectation, are these data plausible, at level \(\alpha = 0.05\)?
We first compute the test statistic: \[\begin{align*} W &= \sum_{i=1}^m \frac{(X_i - kp_i)^2}{kp_i} = \frac{(28-25)^2}{25} + \frac{(32-25)^2}{25} + \frac{(17-25)^2}{25} + \frac{(23-25)^2}{25} \\ &= \frac{9}{25} + \frac{49}{25} + \frac{64}{25} + \frac{4}{25} = \frac{126}{25} = 5.04 \,. \end{align*}\] The number of degrees of freedom is \(m-1 = 3\), so the rejection region boundary is $F_{W(3)}^{-1}(1-) = $
qchisq(0.95,3)
= 7.815. Since 5.04 \(<\) 7.815, we fail to reject the null hypothesis that the animals are distributed uniformly as a function of azimuthal angle. (The \(p\)-value is1-pchisq(5.04,3)
= 0.169.) See Figure 3.16.
3.13.2 Simulating an Exact Multinomial Test
Let’s assume the same data as in the last example. One reason\(-\)in fact, the reason\(-\)why we utilize the chi-square GoF test when analyzing these data is that “it has always been done this way.” Stated differently, we have historically not worked with the multinomial distribution directly because we couldn’t…at least, not until computers came along. But now we can estimate the \(p\)-value for the hypothesis in the last example via simulation. Will we achieve a result very different from that above, \(p = 0.169\)?
set.seed(101)
O <- c(28,32,17,23)
p <- rep(1/4,4)
num.sim <- 100000
k <- sum(O)
m <- length(O)
pmf.obs <- dmultinom(O,prob=p) # the observed multinomial pmf
X <- rmultinom(num.sim,k,p) # generates an m x num.sim matrix
# (m determined as the length of p)
pmf <- apply(X,2,function(x){dmultinom(x,prob=p)}) # simulated pmf's
sum(pmf<pmf.obs)/num.sim
## [1] 0.15974
What does
apply()
do? It applies the function given as the third argument (which evaluates the multinomial probability mass function given a set of four data \(\mathbf{x}\) and a a set of four probabilities \(\mathbf{p}\)) to each column (hence the “2” as the second argument…“1” would denote rows) of the matrix \(\mathbf{X}\). It is a convenient function that allows us to not have to embed the evaluation of the multinomial pmf into afor
loop that would iterate over all the rows of the matrix.
We observe \(p = 0.160\). This is close to, but at the same time still substantially less than, 0.169. In case the reader is to say “but, the computation time must be much longer than for the chi-square GoF test,” the above computation takes \(\sim\) 1 CPU second. The chi-square test is definitely important to know, in part because testing hypotheses about multinomial probabilities “has always been done this way”…but we would argue that when a simulation of the exact test is possible, one should code that simulation! (And run as many simulations as possible, to reduce uncertainty in the final result. Here, we can take our estimated \(p\)-value of 0.160 and state that our one standard error uncertainty is approximately \(\sqrt{kp(1-p)}/k = 0.001\), i.e., we expect the true \(p\)-value to be in the range \(0.160 \pm 3 \cdot 0.001\) or \([0.157,0.163]\).
3.13.3 Chi-Square Test of Independence
Let’s go back to our animal data. When we observe the animals in each quadrant, we record their color: black or red. So now are data look like this:
0\(^\circ\)-90\(^\circ\) | 90\(^\circ\)-180\(^\circ\) | 180\(^\circ\)-270\(^\circ\) | 270\(^\circ\)-360\(^\circ\) | ||
---|---|---|---|---|---|
black | 20 | 18 | 5 | 14 | 57 |
red | 8 | 14 | 12 | 9 | 43 |
28 | 32 | 17 | 23 |
When we record two attributes for each subject (here, azimuthal angle and color), we can perform a chi-square test of independence to answer the question of whether the attributes are independent random variables. In other words, here, does the coloration depend on angle? The null hypothesis is no. We will test this hypothesis assuming \(\alpha = 0.05\).
We first determine the expected number of counts in each bin, \(\widehat{E[X_{ij}]} = r_i c_j / k\):
0\(^\circ\)-90\(^\circ\) | 90\(^\circ\)-180\(^\circ\) | 180\(^\circ\)-270\(^\circ\) | 270\(^\circ\)-360\(^\circ\) | ||
---|---|---|---|---|---|
black | 57 \(\cdot\) 28/100 = 15.96 | 57 \(\cdot\) 32/100 = 18.24 | 57 \(\cdot\) 17/100 = 9.69 | 57 \(\cdot\) 23/100 = 13.11 | 57 |
red | 43 \(\cdot\) 28/100 = 12.04 | 43 \(\cdot\) 32/100 = 13.76 | 43 \(\cdot\) 17/100 = 7.31 | 43 \(\cdot\) 23/100 = 9.89 | 43 |
28 | 32 | 17 | 23 |
We can already see that working with the numbers directly is tedious. Can we make a matrix of such numbers using
R
?
r <- c(57,43)
c <- c(28,32,17,23)
E <- (r %*% t(c))/sum(r) # multiply r and the transpose of c
print(E) # much better
## [,1] [,2] [,3] [,4]
## [1,] 15.96 18.24 9.69 13.11
## [2,] 12.04 13.76 7.31 9.89
If we wish to continue using
R
, we need to define a matrix of observed data values:
## [,1] [,2] [,3] [,4]
## [1,] 20 18 5 14
## [2,] 8 14 12 9
Now we have what we need. The test statistic is
## [1] 7.805
and the number of degrees of freedom is \((r-1)(c-1) = 1 \cdot 3 = 3\), so the rejection region boundary is
## [1] 7.815
We find that if \(\alpha = 0.05\), we cannot reject the null hypothesis. We might be tempted to do so, as our test statistic very nearly falls into the rejection region, but we cannot. We could, if we were so inclined, remind ourselves that chi-square-based hypothesis tests are approximate, and run a simulation to try to estimate the true distribution of \(W\), and see what the rejection region and \(p\)-value actually are…that way, we might be able to actually reject the null. But really, at the end of the day, such a result should simply motivate us to gather more data!